Introduction encoding format (ANSI, GBK, GB2312, UTF-8, UTF-16, GB18030 and the UNICODE)

A long time ago, a group of people, they decided to use eight transistors can be opened and closed to assemble the different states, to represent things in the world, they call this the "byte." Later, they did some machines can handle these bytes, the machine running, you can use a combination of a lot of bytes to the state, the state began to come and go, they put this machine called the "computer."

The computer began only in the United States. Eight bytes can be combined total of 256 (8 th power of 2) different states. They wherein the 32 states are numbered from zero respectively predetermined special purpose, but a terminal, the printer good agreement meets these bytes are transferred over, it agreed to do some operation. Meets 00 × 10, the terminal on the wrap, met 0 × 07, the called terminal will beep people encounter good 0x1b embodiment, the printer prints the highlighted word, a color display or the terminal to use letters. They saw so good, then put the status byte 0 × 20 hereinafter referred to as "control codes."

And they cast all spaces, punctuation, numbers, uppercase and lowercase letters respectively consecutive bytes represent the state, it has been allocated to the No. 127, so that the computer can be used to store different bytes of the English text. We see this, feel good, so we regard this program is called "Ascii" coding of ANSI (American Standard Code for Information Interchange, American Standard Code for Information Interchange). At that time all the computers in the world used the same scheme to save the ASCII English text.

Need down to the horizontal line and later computers becoming more and more widely, the world can be saved to text in their computer, they decided to use the space after the number 127 to represent these new letters, symbols, also added a lot of time painting table , vertical, cross-like shape, the number has been allocated to the last state 255. From the character set 128-255 This page is called "extended character set." But the original numbering scheme has no longer fit more coding.

When people get China and other computer has no status byte can be used to represent Chinese characters, and since more than 6,000 commonly used Chinese characters need to save it. So people will independent research and development, after those strange symbols 127 are directly cancel. States: a same original meaning less than 127 characters, the two characters greater than 127 when connected together, represent a character, a preceding byte (he called high byte) 0xA1 used from 0xF7, back a byte (low byte) from 0xA1 to 0xFE, so that we can combine a plurality of approximately 7000 simplified Chinese characters. In these codes, we also mathematical symbols, Greek letters Roman, Japanese Kana who were incorporated into it, even in ASCII already there in numbers, punctuation, letters are all re-edited the two-byte coding it is often said that "full" character, while those originally called "half-size" in 127 characters or less of.

Chinese people see this very good, so he took this program is called Chinese characters "GB2312". ASCII is the Chinese GB2312 expansion.

But Chinese characters too much, or not enough then, so simply no longer require low byte must be within 127 yards after the number, as long as the first byte is greater than 127 is fixed indicates that this is the beginning of a Chinese character, no matter is followed by an extended character set is not in the content. After the encoding scheme is known as a result of expansion of GBK standard, GB2312 GBK includes all of the content, but also an increase of nearly 20,000 new characters (including traditional Chinese characters) and symbols. Later minority should use a computer, so we'll expand, but also added thousands of new words ethnic minorities, GBK expansion became GB18030. Since then, the cultural heritage of the Chinese nation will be in the computer era.

Because at that time all countries have come up with its own set of such coding standards like China, who do not know the results of each other who coding, coding the support of others who do not. At that time the Chinese people want the computer to display Chinese characters, it must be fitted with a "system of Chinese characters", designed to handle the display of Chinese characters, the problem input, wrong character system, the display will be in total chaos. How to do this? At this international organization, called ISO (International Standards Organization who) decided to address this problem. They used very simple: scrap all regional coding scheme, including a re-engage all cultures on Earth, encoding all letters and symbols! They plan to call it "Universal Multiple-Octet Coded Character Set", referred to as the UCS, commonly known as "UNICODE".

When UNICODE began to develop, the memory capacity of the computer has greatly developed, the space no longer become a problem. ISO then be directly specified by two bytes, i.e. 16 to unify all the characters represented, ascii for those in the "half-size" character, the UNICODE encoding embracing its original unchanged, but its length is from 8 extended to 16 bits, and the character of other cultures and languages ​​are all re-unified coding. Since the "half-size" need to use English symbols only the lower 8 bits, so its high eight is always 0, so this program when you save the atmosphere of the English text will be more than double the waste of space.

However, there is no consideration in the formulation of UNICODE maintaining compatibility with any conventional encoding scheme, which makes GBK and UNICODE is simply not the same as the code on the arrangement of the Chinese characters, there is no easy way to the arithmetic text from UNICODE encoding and other encoding transform, which must be performed by table look-up. UNICODE is two bytes to represent a character, he can combine a total of 65535 different characters, which probably can already cover the symbol of all cultures in the world.

UNICODE problem came when the rise coming together as well as computer networks, UNICODE how the network is a transmission must be considered, so the transmission facing numerous UTF (UCS Transfer Format) standard emerged, as the name suggests, UTF8 is every 8 bit data transmission, and each is UTF16 16 bits, but in order of transmission reliability, not directly from the time corresponding to the UNICODE UTF, but to some rules and algorithms to convert.

 

UTF-8 uses one to four bytes to encode one code point. These code points from 0-127 directly mapped into a byte (for text containing only the characters of this range, this makes it identical to ASCII and UTF-8). The next code points mapped into 1,920 2 bytes, all the remaining code points requires 3 bytes. Other Unicode code points in the plane is required 4 bytes.

 

 

 

After reading these, I believe you coding for these relations, to understand more clearly now. I come simply summarize:

● Chinese people through the transformation of the Chinese expansion ASCII encoding, resulting in encoded GB2312, can represent more than 6,000 commonly used Chinese characters.

● Chinese characters is too much, including traditional and various characters, so have the GBK encoding, which includes the GB2312 coding, and expanded a lot.

● China is a multi-ethnic country, almost every nation has its own separate language system, in order to express those characters continue to be expanded to GB18030 GBK coding coding.

● Each country like China, his own language codes, so there is a wide variety of coding, if you do not install the appropriate coding, can not explain the contents of the corresponding encoded want to express.

● Finally, a man named ISO organization could not stand. Together they created a coding UNICODE, this encoding is very large, big enough to accommodate any text and logos in the world. So long as there is such a UNICODE coding system on the computer, no matter what kind of character the world, only time to save the file, saved as UNICODE coding can be properly interpreted other computers.

● UNICODE in the network transmission, there were two standard UTF-8 and UTF-16, respectively, each transmission 8 bits and 16 bits.

于是就会有人产生疑问,UTF-8 既然能保存那么多文字、符号,为什么国内还有这么多使用 GBK 等编码的人?因为 UTF-8 等编码体积比较大,占电脑空间比较多,如果面向的使用人群绝大部分都是中国人,用 GBK 等编码也可以。但是目前的电脑来看,硬盘都是白菜价,电脑性能也已经足够无视这点性能的消耗了。所以推荐所有的网页使用统一编码:UTF-8。

 

 

关于记事本无法单独保存“联通”的问题

当你新建一个 文本文档 之后,在里面输入 “联通” 两个字,然后保存。当你再次打开的时候,原来输入的 “联通” 会变成两个乱码。

这个问题就是因为 GB2312 编码与 UTF8 编码产生了编码冲撞造成的。从网上引来一段从UNICODE到UTF8的转换规则:

UTF-8

0000 – 007F

0xxxxxxx

0080 – 07FF

110xxxxx 10xxxxxx

0800 – FFFF

1110xxxx 10xxxxxx 10xxxxxx

例如”汉”字的Unicode编码是6C49。6C49在0800-FFFF之间,所以要用3字节模板:1110xxxx 10xxxxxx 10xxxxxx。将6C49写成二进制是:0110 1100 0100 1001,将这个比特流按三字节模板的分段方法分为0110 110001 001001,依次代替模板中的x,得到:1110-0110 10-110001 10-001001,即E6 B1 89,这就是其UTF8的编码。

而当你新建一个文本文件时,记事本的编码默认是ANSI, 如果你在ANSI的编码输入汉字,那么他实际就是GB系列的编码方式,在这种编码下,”联通”的内码是:

c1 1100 0001

aa 1010 1010

cd 1100 1101

a8 1010 1000

注意到了吗?第一二个字节、第三四个字节的起始部分的都是”110″和”10″,正好与UTF8规则里的两字节模板是一致的,于是再次打开记事本 时,记事本就误认为这是一个UTF8编码的文件,让我们把第一个字节的110和第二个字节的10去掉,我们就得到了”00001 101010″,再把各位对齐,补上前导的0,就得到了”0000 0000 0110 1010″,不好意思,这是UNICODE的006A,也就是小写的字母”j”,而之后的两字节用UTF8解码之后是0368,这个字符什么也不是。这就 是只有”联通”两个字的文件没有办法在记事本里正常显示的原因。

由这个问题,可以发散出很多问题。比较常见的一个问题就是:我已经把文件保存成了 XX 编码,为什么每次打开,还是原来的 YY 编码?!原因就在于此,你虽然保存成了 XX 编码,但是系统识别的时候,却误识别为了 YY 编码,所以还是显示为 YY 编码。为了避免这个问题,微软公司弄出了一个叫 BOM 头的东西。

 

Guess you like

Origin www.cnblogs.com/FengZeng666/p/11683333.html