Conversion between UNICODE and UTF-8

52 Unicode - Unicode Encoding Table, Unicode Character Encyclopedia, Unicode Character Set, Unicode Symbols

1. UTF-8 encoding method

        UTF-8 is a variable-length encoding expression of UNICODE (generally UNICODE is double-byte [refers to UCS2]), UTF-8 is to encode UCS in units of 8 bits, and UTF-8 does not use big endian In little-endian form, each character stored in UTF-8, except for the first byte, the first two bits of the other bytes start with "10", so that the word processor can quickly find the starting position of each character.

        In order to be compatible with the previous ASCII code (ASCII is one byte), UTF-8 chooses to use variable-length bytes to store Unicode. The specific conversion relationship is as follows:

(Table 3-2 Conversion relationship table between Unicode and UTF-8)

UCS-4 (UNICODE) encoding UTF-8 byte stream
U-00000000 – U-0000007F 0xxxxxxx
U-00000080 – U-000007FF 110xxxxx 10xxxxxx
U-00000800 – U-0000FFFF 1110xxxx 10xxxxxx 10xxxxxx
U-00010000 – U-001FFFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
U-00200000 – U-03FFFFFF 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
U-04000000 – U-7FFFFFFF 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

2. Convert UNICODE to UTF-8

        The characteristic of UTF-8 is to use different length encodings for different ranges of characters. For characters between 0x00-0x7F, UTF-8 encoding is exactly the same as ASCII encoding. The maximum length of UTF-8 encoding is 4 bytes. It can be seen from Table 3-2 that the 4-byte template has 21 x, that is, it can accommodate 21 binary digits. Unicode's maximum code point 0x10FFFF is only 21 bits.

        like:

The Unicode encoding of "Han" is 0x6C49. 0x6C49 is between 0x0800-0xFFFF, using a 3-byte template: 1110xxxx 10xxxxxx 10xxxxxx. Writing 0x6C49 into binary is: 0110 1100 0100 1001, using this bit stream to replace the x in the template in turn, and get: 11100110 10110001 10001001, that is, E6 B1 89.

        Another example:

Unicode encoding 0x20C30 is between 0x010000-0x10FFFF, using a 4-byte template: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx. Write 0x20C30 as a 21-bit binary number ( if less than 21, add 0 in front ): 0 0010 0000 1100 0011 0000, use this bit stream to replace the x in the template in turn, and get: 11110000 10100000 10110000 10110000, that is, F0 A0 B0 B0.

3. Convert UTF-8 to UNICODE

Guess you like

Origin blog.csdn.net/qq_27586341/article/details/120638514