Aquaboutic | Focus Security Research | Vulnerability Exploit | POC


learn to play qr code with me

Posted by loope at 2020-03-07

Sometimes the two-dimensional code is seriously damaged and cannot be scanned, which makes me learn a wave of knowledge about the two-dimensional code. There are 40 sizes of QR code. V 1 is a matrix of 21 x 21, V2 is a matrix of 25 x 25, and V3 is a dimension of 29. Every time you increase a level, you will increase the dimension of 4. The formula is: (V-1) * 4 + 21 max. V 40, (40-1) * 4 + 21 = 177, so the max. is a square of 177 x 177.

An example of QR code format is as follows:

Positioning pattern

The positioning pattern is the "back" sign in the upper left, lower left and upper right corner of each QR code. The size of the rectangle used to mark the QR code is 7 * 7 modules.

Functional data: it exists in all sizes and is used to store some formatted data. The main content is "error correction code level (3bit) + mask category (2bit) + BCH code (10bit), For error correction) ", then these 15 bits also need to XOR with 101010000010010, mainly for the purpose that if the error correction level of 00 and the mask of 000 are selected, all of them will be white, which will increase the difficulty of image recognition of the scanner. For example:

The distribution of these 15 bits in the format information area is as follows:

Above version 7, two 3 x 6 areas need to be reserved to store some version information.

Data code and error correction code

In addition to those places mentioned above, the rest places store data codes and error correction codes. It is the dark gray area of the first two figures. Generally, the data is filled from the bottom right corner. First, fill in the data code, and then fill in the error correction code. Take V1 as an example, the filling order of the data is as follows:

Data encoding

QR code supports the following codes:

Digital code: from 0 to 9;

Character code: including 0-9, uppercase A to Z (no lowercase), and the symbol $% * + –. /: including spaces;

Byte code: can be 0-255 iso-8859-1 character;

Japanese code: also double byte code;

Extended channelinterpretation (ECI) mode is mainly used for special character sets. Not all scanners support this coding;

Structured appendmode is used for mixed encoding, that is to say, the QR code contains a variety of encoding formats;

Fnc1 mode is mainly used for some special industries or industries. Like GS1 barcode

The following table is the "number" corresponding to the coding of each pattern, which exists in the format information area.

Because there are many and complex types, and in order to facilitate your understanding, we choose numerical code and character code examples here. For other codes, interested students can view official documents. Example 1:

Numeric encoding, from 0 to 9. If the number of digits to be encoded is not a multiple of 3, then the last 1 or 2 digits will be converted into 4 or 7 bits, and every other 3 digits will be converted into 10 bits of binary number. Finally, these binary data will be connected and the number of encoding mode and character count indicator (that is, how many characters are encoded) and character counter will be added in front of them The length of the number indicator depends on the encoding mode and the version of the two-dimensional code to be encoded. In the number encoding, the character count indicator corresponds to 10, 12 or 14 bits in the following table:

For example, in version 1, when the error correction level is h (we will talk about it below), we need to code: 01234567

(1) Divide the above figures into three groups: 012345 67

(2) Convert them to 10bit binary: 012 to 0000001100; 345 to 0101011001; 67 to 1000011.

(3) String the three binaries together: 000000 1100 0101011001 1000011

(4) Convert the number of numbers into binary (version 1-h is 10 bits): the binary of 8 numbers is 000000000

(5) Add the number coded flag 0001 and the code of step 4 to the front: 0001 0000000000000100101011001 1000011

Example two:

Character encoding (also called alphanumeric encoding). Includes 0-9, uppercase A to Z (no lowercase), and the symbol $% * + –. /: includes spaces. These characters are mapped into a character index table. As shown below (two tables, in Chinese and English): (SP is a space, char is a character, and value is its index value). The process of coding is to divide the characters into two groups, and then convert them into base 45 of the following table, and then convert them into binary 11 bits. If there is a single one, then convert them into binary 6 bits. The character count indicator needs to be compiled into 9, 11 or 13 binaries according to different version sizes (as shown in the table above).

Under the dimension of v 1 and the error correction level of H, the code is ac-42

(1) Find the index (10,12,41,4,2) of ac-42 from the character index table

(2) Two groups: (10,12) (41,4) (2)

(3) Convert each group to 11bits binary: (10,12) 10 * 45 + 12 = 462 to 00111001110; (41,4) 41 * 45 + 4 = 1849 to 11100111001;

(4) Connect these binaries: 00111001110 11100111001 000010

(5) Convert the number of characters to binary (version 1-h is 9 bits): 5 characters, 5 to 000000101

(6) Add code ID 0010 to the header and the number code of step 5: 0010 00000010100111001110 1110011100100000

Terminators and complements

Based on the above example 1, after coding, we get the following coding:

Then, we need to add an end sign to indicate that the real amount data is over.

Let's also add the closing character:

Group by 8 bits in each group. If all the codes do not add up to 8 multiples, we need to add enough zeros after them. For example, there are 45 bits in total above. So, we need to add 3 zeros, and then group by 8 bits: 000100000000000000000001101010110 0110000001 10000000

Then there are the complements. If we haven't reached the maximum number of bits, we need to add some complements to repeat the following two bytes: 11101100 00010001. (the main reason for using these two bytes is to prevent a large dark or light color area from appearing when filling in data, which may interfere with the scanner, making it difficult for the QR code to scan normally). As for how many complements to be filled, you need to check the corresponding table of character number and data capacity in the document. In the official document, the corresponding table is table 7-table 11

From the table, we can know that the data capacity of v1-h is 9 data codewords (each data codeword is 8 bits), and we have 6 data codewords above, so we need to add three 8 bits, and the completion is as follows: 0001000000100000 00001100110 011000110000001 10000000 11101100000000111101100 each group of data above is a data codeword, data Codewords, now only the original data, also need to add error correction code.

Error correcting code

We mentioned the error correction level above. There are four levels of error correction in QR code (from low to high, i.e. L, m, Q, H). That's why some people add icons in the center of QR code, and they can still scan (that is, when the amount of two-dimensional code incomplete does not exceed the allowable range of corresponding error correction level, the content can still be scanned by the scanning tool).

As for how the error correction code is calculated, it involves Reed Solomon error correction algorithm, which is a fixed length code. This is more complex, but there is a library in universal pythom that can be directly called by rededsolo. This means that a fixed length of input data will be processed into a fixed length of output data. Of the most commonly used (255223) codes, 223 Reed Solomon input symbols (8 bits per symbol) are encoded into 255 output symbols. Most of the error correction coding processes are systematic. This means that part of the output codeword contains the original form of the input data. Codes with 8-bit symbol size force a maximum code length of 255 symbols. The standard (255223) can correct up to 16 errors of symbols in each code word. Since each symbol is actually 8 bits, this means that the code can correct up to 16 short burst errors. Reed Solomon code, like convolutional code, is a kind of transparent code. This means that if the channel symbols are reversed at some point in the queue, the decoder can work as well. The decoding result will supplement the original data. However, the code inside will lose transparency after shortening. In the shortened code, "lost" bits need to be replaced by 0 or 1, which depends on whether the data needs to be supplemented. (if the symbol is reversed at this time, the substitute 0 needs to be changed to 1.). In this way, it is necessary to make a mandatory detection decision on the data before decoding.

We have existing Python modules to calculate the error correction code -- Python's reedsolo module. We can refer to the error correction table in the official document. The following table is an example:

Take version 1-h as an example to explain. From the table, we can clearly know that the number of error correction code words should be 17, and the number of error correction blocks should be 1 (indicating that the data to be encoded in this version will only be divided into one data block), (26,9, 8) That is to say, this version of QR code can store a total of 26 codewords, but of these 26 codewords, 9 codewords are data codewords, 17 are error correction codewords (8 * 2 + 1 = 17), and 8-bit error correction capacity. Is there any comment below each table

This is also why the number of error correction code words is R * 2. When there is an arrow behind it, it means that 1 should be added after R * 2. When adding error correction codes to data codewords, there is also the operation of data codeword segmentation. Because version 1's two-dimensional code divides data codewords into one block, which is not obvious enough, we use an example on the Internet:

The above version 5 + Q error correction level: it requires 4 blocks (2 blocks in a group, two groups in total), 15 bytes (data codeword) data in each of the two blocks in the first group plus the error correction capacity of 9 bytes (18 bytes of error correction codeword). Because binary will make the table too large, so we use decimal system to express it. We can see that the error correction code of each data block has 18 bytes, that is, the binary number of 18 8bits.

5. Final coding and interleaving.

At this point, the coding process, only the last step.

For data codewords: first take out the first codewords of each block and arrange them in order of smoothness, then take the second one of the first block, and so on. The data codewords in the above example are as follows:

We take the first column first: 67246182, 70

Then take the second column: 67246182, 70, 852462302477

And so on: 67246182, 70, 85246230247 ... , 38 ,6,50, 17 ,7,236

For error correcting codes, the same is true:

As the same as the data code, we get: 213, 871482351992041161559  39,133,141,236

Then, put the two groups together (error correction code after data code) to get:

67, 246, 182, 70, 85, 246, 230, 247, 70, 66, 247, 118, 134, 7, 119, 86, 87,118, 50, 194, 38, 134, 7, 6, 85, 242, 118, 151, 194, 7, 134, 50, 119, 38, 87,16, 50, 86, 38, 236, 6, 22, 82, 17, 18, 198, 6, 236, 6, 199, 134, 17, 103, 146,151, 236, 38, 6, 50, 17, 7, 236, 213, 87, 148, 235, 199, 204, 116, 159, 11, 96,177, 5, 45, 60, 212, 173, 115, 202, 76, 24, 247, 182, 133, 147, 241, 124, 75,59, 223, 157, 242, 33, 229, 200, 238, 106, 248, 134, 76, 40, 154, 27, 195, 255,117, 129, 230, 172, 154, 209, 189, 82, 111, 17, 10, 2, 86, 163, 108, 131, 161,163, 240, 32, 111, 120, 192, 178, 39, 133, 141, 236

Residual bits

Finally, add Reminder Bits. For some Version QR, the length above is not enough, and add Remainder Bits. For example, for the above 5Q QR code, add 7 bits, and add zero for Remainder Bits. Please refer to table 1 of the official document (a part of which is listed here) for details of which versions require several remainder bits.

Mask (also called mask)

The coding process is completed, but to generate a good QR code, you need to fill the data you have in the blank template in advance, select a suitable mask, XOR the data of the original template with the mask, and then fill in the format information to generate the QR code. The significance of mask existence: QR code is to be scanned, and the scanning is afraid of not being able to clearly distinguish every bit of encoded information. If the number of black and white dots in QR code is uneven, or the spatial distribution is uneven, it will lead to the appearance of large color block area, and the appearance of large color block area will increase the difficulty of positioning during scanning, thus reducing the efficiency of scanning. In more serious cases, if the functional identification happens to appear after the data is filled in, such as the pattern of positioning identification, it will also interfere with the function of normal functional identification, resulting in the QR code can not be scanned. In computer science, mask is a binary string, which transforms data by exclusive or operation with data. In QR code, mask also transforms data matrix by exclusive or operation. So as you may have guessed, the mask of QR code is a pre-defined matrix. The QR standard defines eight data masks by generating rules:

The first three bits of binary data are the numbers corresponding to each pattern mask. This information is also to be filled in format information.

From this figure, we can directly see the template appearance of each mask. Take mask 2 (No. 010) as an example, J mod 3= 0 means that the number of columns that can be divided by 3 must be inverted (black block becomes white block, white block becomes black block). Of course, the information in the fixed format area of QR code does not need to be inverted, so to use mask 2, the number of columns that need to be inverted is: 0, 3, 6, 9 ...

Of course, the official regulations stipulate that when exclusive or is performed, the original data template shall be scored (punished) according to the following rules after exclusive or operation with each mask template. Finally, the one with the lowest score shall be selected as the best mask selection.

In addition, using the reed solo module of python, the data can also be recovered when the two-dimensional code damage exceeds the fault tolerance range of the corresponding level.

Basic usage of reedsolo module in Python 2

1. First install the reedsolo module. The version of the redsolo module officially downloaded by Python is 0.3, which is not very easy to use. This time, the used reedsolo module is stored in the in the download package. After decompression, run the CMD command Python install in this path.

2. Enter the python environment and import the reedsolo module. Define an object and set the number of error correction codes generated to 10.

3. Encode the string "Hello world" to generate error correction code.

4. Decoding

5. Now we have a general understanding of the use method of the reedsolo module. Now let's understand the function of error correction code. For example, we now write "Hello world" as the wrong "hellx xorld". Here we have two errors. With the error correction code generated before, we decode it, and the output is the correct string, and the error correction code is like this.

6. The error correction code algorithm is to encode the content to be corrected byte by byte, so it generates a byte array after encoding.

7. Convert the encoded content to decimal output

Learn to apply, and repeat the solving steps of mma2015-misc400-qr's QR code recovery challenge. The original write-up address is

1. The QR code given by the title is as follows

This is a 25 * 25 QR code, that is, version 2 QR code. From the visible part of it, we can get part of the information of format information:?????????????? 011011010

2. By referring to the corresponding table given in the following website, you can know what encoding mode and mask are used for this QR code

By comparison, the complete format information information information should be: 010111011011010. And the information can also be obtained that the mask used by the QR code is 6, and the corresponding error correction level is Q.

3. Supplement the occluded fixed information and format information completely.

A part of the original data codewords and error correction codewords are obtained by exclusive or operation with the corresponding mask. The figure below shows the pattern corresponding to mask 6.

4. Apply the mask to the QR code we have supplemented, flip the color of the area corresponding to the dark area in the mask, and cover the format information with gray to facilitate data reading, as shown below.

5. Starting from the lower right corner, read the information of data code word and error correction code word according to the serpentine sequence shown in the figure below. For the information reading sequence of different area blocks, please refer to the official documents.

And the corresponding data block distribution should be as follows:

6. Read out all readable information:

7. According to the table of error correction characteristics in the official documents, we can see that there are 22 error correction code words and 22 data code words in version 2-q. at the Q level, it can recover no more than 25% of the damaged bytes, but we only have 16 complete bytes, that is, more than 63% of the bytes are lost. Check carefully the error correction ability of Reed Solomon and realize that it can correct up to twice the erasure That is to say, if it knows where the error is, it is much better to correct it. This means that our code can recover from 50% of the lost bytes! But this means that we can only correct up to 22 lost bytes, but we still only have 16 complete bytes at present, so we need to find a way to recover some bytes to meet the minimum requirement of 22 bytes.

8. Let's sort out the available data first:

0010: [encoding mode = character encoding (alphanumeric mode)]

00000 10100: [9 bit length character count identifier = 20 characters]



     11001100110:【 I】

     1010001000? [S?]

Let's calculate that 22 data codewords are 176 bits, and 20 characters are divided into 10 groups, each group has 11 bits, so 4 + 9 + 10 * 11 = 123 bits, 123 / 8 = 15 more than 3, plus 4 zeros (terminators), and 8 bits need to be added as multiples of 8, 8-3-4 = 1, so 1 0 needs to be added. At this time, there are 16 data codewords in total, 22-16 = 6, so 6 bytes of complement code should be added, as follows (red is the data that was originally hidden):

00100000 10100010 10111000 00111010 01011001 10011010 10001000 ???????????????? ???????? ???????? ???????? ???????? ???????? ???????? ??000000  1110110000010001 11101100 00010001  11101100  00010001

In this way, we have recovered 6 bytes of data. At this time, we only have 22 bytes left, which meets the minimum requirements. We can use error correction code to recover the original data.

9. Write a script to use the reedsolo module of Python for error correction.

10. Get all the information.

11. Splitting and decoding


Official document (Chinese version):

Reed Solomon Code:

Write up of mma2015-misc400-qr:

He tianzhihui:

*Author: ledger 1, reprint please indicate from