Computer coding theory - different types of data storage Chinese encoding

  On the previous program in Windows platform for Chinese and string display different character set encoding formats do not quite understand. Recent data used in the preparation of this part of the knowledge, then through their own analysis and reference material to do a simple to use, let's analyze together with the Chinese common coding set coding!


Chapter Preview:


1. Chinese character array stored GB2312 coding
2.GB2312 standard
3. Output distortion principle
4. integer store Chinese GB2312 encoding
5.Uncode
6.UTF. 8-
7. The integer store UTF-8 encoded Chinese


Chapters:


GB2312 Chinese character encoding storage array:


  First, we array output by the character "Hello", the reference code:
    unsigned char testch [. 4] = {0};
    the memcpy (testch, "Hello", strlen ( "Hello"));
    the printf ( "% S | ", testch);

    The result is: your so hot hot hot hot
                              |


First, we analyze the information testch output:

  We define a 256-byte array of character values stored testch:
    unsigned char testch2 [256] = {0};
    the memcpy (testch2, testch, 255);
      Expand testch2:
        - testch2 0x00aff9f0 "You and pressed hot hot hot
        " unsigned char [ 256]
        [0] 196 '?' unsigned char
        [. 1] 227 '?' unsigned char
        [2] 186 '?' unsigned char
        [. 3] 195 '?' unsigned char
        [. 4] 204 '?' unsigned char
        [. 5] 204 '?' unsigned char
        [. 6] 204 '?' unsigned char
        [. 7] 204 '?' unsigned char
        [. 8] 204 '?' unsigned char
        [. 9] 204 '?' unsigned char
        [10] 204 '?' unsigned char
        [. 11] 204 '?' unsigned char
        [12 is] 10 '
        ’ unsigned char
        [13] 0 unsigned char

        Byte13 value 0, testch information output from the program terminates here.


Then, we analyze the information content testch output:

  GB2312 using predetermined matching two bytes per character representation, the compiler may VS2010 so that (the discrepancy GB2312 standard definition):
    The current value of 0-128 bytes, the use of international Ascll calculated as code table, the current byte information is displayed in the corresponding character code table Ascll;
    current byte value is equal to or greater than 129, followed by one byte of the current byte is considered to be a match GB2312 encoding (using two bytes merged into a number of bits 16 digits (lower 8 bits in the low byte, 8 bits of high byte)), if the digital value coded in GB2312, GB2312 correspondence information will be displayed, otherwise (if the digital value is not within GB2312 encoding) as the single byte display.


GB2312 standards:


  GB2312 is a registered name of the main official Chinese character sets for Simplified Chinese characters. GB abbreviated as national standards, issued by the China National Standards Bureau in 1980, started May 1, 1981. GB2312 encoding contains a total of 6763 characters, wherein a Chinese character 3755, two characters 3008.

  GB2312 predetermined for each character included using two bytes, the first byte is "High", corresponding to region 94; the second byte is "Low", corresponding to 94 bits, it is the area code range: 0101-9494. Digit area code and number, respectively, plus 0xA0 is GB2312 encoding. The last example is a 9494-bit code, area code and each bit number is converted to hexadecimal 5E5E, 0x5E + 0xA0 = 0xFE, so the code bits are encoded GB2312 FEFE.

  GB2312 encoding range 0xA1A1-0xFEFE, wherein the encoding range for the characters 0xB0A1-0xF7FE, the first byte 0xB0-0xF7 (corresponding to Code: 16--87), a second byte 0xA1-0xFE (corresponding to the bit number: 01- 94).


Output garbled reasons:


  When we use the character array to store information and did not make the end of the string (the last byte character set is empty), the situation will occur "Memory Access offside", because the stack memory space VS2010 compiler will automatically unused all initialized byte 0xCC (unsigned 204), when the two bytes are 0xCC, shown as GB2312 encoding the "hot" characters.


GB2312 Chinese store integer code:


unsigned short:

  In the 32-bit compiler, a type unsigned short 2 bytes, we can save the data by unsigned short array, and outputs "Hello":
    unsigned short SHO [. 3] = {} 58308,50106,0;
    the printf ( "% s", sho);
    first value: 58308 1110 is converted into a binary digital 0011,1100 0100, Chinese "you";
    the second value: 50106 1110 is converted into a binary digital 0011,1011 1010, Chinese "good ";
    The third value: use as a null character.

    Digit sequence: Left higher than right.


unsigned int:

  In the 32-bit compiler, a type unsigned int 4 bytes, outputs "Hello":
    unsigned int SHO [2] = {} 3283805124,0;
    ; the printf ( "% S", SHO)
    a first value : 3283805124 is converted into binary number 110000111011 001111000100 1010, 1110, Chinese "hello";
    the second value: as the null character.

    When using unsigned int type GB2312 data storage using a set of two bytes, the byte 3,4 (Chinese "good") on the left, 2 bytes (Chinese "you") in the right order within the group are left higher than right.


long long:

  In most compiler, a type long long occupies 8 bytes, the output "you okay":
    long long SHO = 208708629619652;
    the printf ( "% S", & SHO);
    parameters: 208,708,629,619,652 or hex 0xBDD1C3BAE3C4 converted into a binary number is:
    101111011101 0001,1100 00,111,011 001,111,000,100 1010, 1110, Chinese "you okay";

    When using long long type GB2312 stored data type, form and store the same int. We only example output characters GB2312 3, 6 and 7 use byte null character.

  Here we can see that the use of unsigned type and result type of the output symbol has the same, most of the examples used for convenience only unsigned reference data.


Uncode:


  In the non-Unicode environment, due to different countries and regions, the use of character sets, it is possible not display properly if all the characters. Technique using Microsoft code page (the Codepage) conversion table to transitional part to solve this problem, i.e., the conversion table designated by non-Unicode character encoding is converted to the corresponding Unicode character encoding the same internal system use.
MultiByteToWideChar windows API is a function that maps a string of a wide character (Unicode) to the string.


UTF-8:


  UTF-8 is a variable length character encoding for a Unicode, which can be used to represent any character in the Unicode standard, Unicode UTF-8 character encoding each of 1 to 4 octets variable number, wherein the number of octets depends Unicode characters assigned to an integer value. This is an efficient encoding of the Unicode document US-ASCII characters is mainly used because it will range 0000 to U + 007F between each character is represented as one octet in U +.


Chinese integer store UTF-8 encoding:


long long:

  Because of uncertainty byte UTF-8 character set, type long long directly outputs "Hello":
    System ( "the chcp 65001"); // in VS2010 compiler, we select the output information to the console UTF-8
    SHO = 208520219770340 Long Long;
    the printf ( "% S", & SHO);
    parameters: 208,520,219,770,340 or hex 0xBDA5E5A0BDE4 is converted into binary
numbers: 1011 1101 1,010,010,111,100,101 1010 0,000,101,111,011,110 0100 Chinese " Hello there";

    Examples of "you", "good" respectively occupy three bytes, 6,5,4-byte (Chinese "good") on the right and left in the group, 3,2,1-byte (Chinese "you") high-low order also left and right.

Published 90 original articles · won praise 199 · views 170 000 +

Guess you like

Origin blog.csdn.net/a29562268/article/details/104086465