Unicode given UTF8


For example: It's Zhihu Daily

The unicode character set you see is an encoding table like this:
I 0049
t 0074
' 0027
s 0073
  0020
知 77e5
乎 4e4e
日 65e5
报 62a5

Each character corresponds to a hexadecimal digit.

Computers only understand binary, so, strictly according to unicode (UCS-2), it should be stored like this:
I 00000000 01001001
t 00000000 01110100
' 00000000 00100111
s 00000000 01110011
  00000000 00100000
知 01110111 11100101
乎 01001110 01001110
日 01100101 11100101
报 01100010 10100101

This string occupies a total of 18 bytes, but comparing the binary codes of Chinese and English, it can be found that the first 9 digits of English are all 0! Waste ah, waste hard disk, waste traffic.

How to do?

UTF。

UTF-8 does this:

1. For single-byte characters, the first bit of the byte is set to 0. For English text, the UTF-8 code occupies only one byte, which is exactly the same as the ASCII code;

2. For n-byte characters (n>1), the first n bits of the first byte are set to 1, the n+1th bit is set to 0, and the first two bits of the following bytes are set to 10. This n The rest of the vacancies of the bytes are filled with the unicode code of the character, and the high bits are filled with 0.

This results in the following UTF-8 marker bits:

0xxxxxxx
110xxxxx 10xxxxxx
1110xxxx 10xxxxxx 10xxxxxx
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
... ...

So, "It's Zhihu Daily" becomes:
I 01001001
t 01110100
' 00100111
s 01110011
  00100000
知 11100111 10011111 10100101
乎 11100100 10111001 10001110
日 11100110 10010111 10100101
报 11100110 10001010 10100101

Compared with the above scheme, English is shorter, but each Chinese character uses one more byte. But the whole string is only 17 bytes, which is a little shorter than the 18 above.

Below is the homework:

Please convert the GB2312 and GBK codes (Google yourself) of "It's Zhihu Daily" into binary. Regardless of historical factors, explain from a technical point of view why GB2312 and GBK are still widely used while unicode and UTF-8 are popular.

Spoiler: It's all about saving your hard drive and data.
unicode is the source code, which digitizes the character set;
utf8 is the channel code, for better storage and transmission

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324773151&siteId=291194637