Understand ASCII, GB2312, GBK, GB18030 Chinese character encoding

When displayed, often garbled phenomenon after obtaining page text, which requires the conversion between different coding, understand the difference between different Chinese character encoding conversion to do it. . .

This article first introduce ASCII, GB2312, GBK and GB18030 encoding.

The reason why these few introduced together, because their correlation is very strong. Compatibility relationship is compatible GB18030 GBK, GBK compatible GB2312, GB2312 compatible with ASCII. The so-called compatible, you can simply understood as a subset relationships do not conflict. Such as file GB2312 encoded in ASCII characters can occur, files encoded in GBK GB2312 can occur and ASCII characters, GB18030 encoded files can appear GBK, GB2312, ASCII characters.

The characteristics of each type of encoding:

[1] ASCII each character occupies 1bytes, if the most significant bit binary representation must be 0 (extended ASCII not taken into account), and therefore can only represent 128 ASCII characters

[2] The earliest edition of the Chinese GB2312 encoding, each character occupies 2bytes. Due to compatible with ASCII, then this may not be the most significant bit is 0 2bytes the (otherwise there will be conflict and ASCII). A collection of 6,763 Chinese and 682 special symbols in GB2312 in life have included all the most commonly used Chinese characters.

GBK GB2312 [3] Because only 6,763 Chinese, Chinese profound I, only 6763 words how enough? So GBK at no guarantee and GB2312, ASCII conflict (that is compatible with GB2312 and ASCII) premise, also occupy the way 2bytes with each word and a lot of Chinese character coding. After GBK coded Chinese character can represent up to 20,902, and another 984 Chinese punctuation, and other radicals. It is noteworthy that this also includes 20,902 Chinese characters traditional characters.

[4] However GB18030, GBK more than 20,000 words have been unable to meet our needs, and there are more possibilities you have never seen the need for Chinese character coding. This time is obviously only that one word was not enough (2bytes only up to 65,536 kinds of combinations with 2bytes, however, for compatible with ASCII, the highest bit is 0 can not have been directly eliminated half of the portfolio, leaving more than 30,000 kinds of combinations can not to meet the requirements of all the characters). So GB18030 Chinese characters using 4bytes extra coding. Of course, in order to be compatible GBK, the four bytes of the first two obviously can not (found in practical operation after two and GBK also did not conflict) with GBK conflict. Our country in 2000 and 2005, respectively, issued twice GB18030 encoding, which in 2005 was further added in the year 2000 basis. So far, GB18030-encoded files already have over 70,000 Chinese characters, and even includes minority languages.

This figure illustrates several of the previously encoded coding is complete, the range of the first 2 byte value (expressed in hexadecimal). Each byte may represent 00 to FF (i.e., 0-255). From the figure we can readily see why the GB18030 compatible with GBK, GB2312 and the ASCII. The first two do not overlap among them several coding portion. Note that only ASCII 1byte, so there is no second place. Further in the above figure GB18030 occupied area is small, but it is 4bytes encoding, which shows only the top two in FIG. If the latter two also count, GB18030 of words far more than GBK. Also note that, due to GBK compatible GB2312, GB2312 and therefore belongs to the blue area in fact be counted as GBK area. Similarly area theoretically GBK GB18030 also belong to the area. On the table only shows the extra part.

Real life, we use more than 99% of Chinese characters, in fact, that within an area in GB2312. In actual use, GBK coding already meet most of the scenes, GB18030 encoding characters are all in this life we ​​do not necessarily see the text, which is usually the reason why the GBK often use it.

Guess you like

Origin www.cnblogs.com/hb01846/p/10948931.html