ANSI, GBK, GB2312, UTF-8, GB18030 and UNICODE web page encoding

Encoding has always been a headache for novices, especially the difference between GBK, GB2312, and UTF-8, which are the three more common web page encodings, which makes many novices confused, and it is not clear how to explain it. But coding is so important, especially in the web page. If what you type is not garbled, but garbled appears on the webpage, most of the reason is the encoding. In addition to garbled characters, there will be some other problems (for example: IE6 CSS loading problem) and so on. Stalker m's purpose for writing this article is to thoroughly explain this coding problem! If you encounter similar problems, then you have to read this article carefully.

ANSI, GBK, GB2312, UTF-8, GB18030, and UNICODE
are more common encoding keywords. Although I put us together, it does not mean that these things are in a level relationship. The content of this section is quoted from the Internet with slight modification. The source of the original text is unknown, so it cannot be signed.

A long, long time ago, there was a group of people who decided to use 8 transistors that can be turned on and off to combine them into different states to represent everything in the world. They called this a "byte." Later, they built some machines that could process these bytes. When the machine was started, it could use bytes to compose many states. The state began to change. They called this machine a "computer."

Initially, computers were only used in the United States. A total of 256 (2 to the 8th power) different states can be combined with an eight-bit byte. They defined the 32 states with numbers starting from 0 for special purposes. Once the agreed bytes on the terminal and printer are passed over, they need to do some agreed actions. When it encounters 00×10, the terminal will wrap, and when it encounters 0×07, the terminal will beep to people. For example, when it encounters 0x1b, the printer will print reversed words, or the terminal will display letters in color. They see that this is very good, so they call these byte states below 0×20 "control codes".

They also represented all the spaces, punctuation marks, numbers, and uppercase and lowercase letters in consecutive byte states, and compiled them up to No. 127, so that the computer can use different bytes to store English text. Everyone feels good when they see this, so everyone calls this scheme the ANSI "Ascii" code (American Standard Code for Information Interchange, American Standard Code for Information Interchange). At that time, all computers in the world used the same ASCII scheme to save English text.

Later on, the development of computers became more and more extensive. In order to save their text on the computer, countries around the world decided to use the space after 127 to represent these new letters and symbols, and added a lot of horizontal lines that need to be used when drawing tables. , Vertical line, cross and other shapes, the serial number has been numbered to the last state 255. The character set on this page from 128 to 255 is called "extended character set". But the original numbering method can no longer accommodate more codes.

When the Chinese people get the computer, there is no byte state that can be used to represent Chinese characters, and there are more than 6000 commonly used Chinese characters that need to be saved. Therefore, the Chinese people independently researched and developed them, and directly canceled the strange symbols after the 127th. Regulation: The meaning of a character smaller than 127 is the same as before, but when two characters larger than 127 are connected together, it represents a Chinese character. The first byte (which he calls the high byte) is used from 0xA1 to 0xF7, and the latter One byte (low byte) is from 0xA1 to 0xFE, so we can combine more than 7000 simplified Chinese characters. In these codes, we have also compiled mathematical symbols, Roman Greek letters, and Japanese pseudonyms. Even the numbers, punctuation, and letters that exist in ASCII have all been recoded into two-byte long codes. , This is what is often called "full-width" characters, but those below 127 are called "half-width" characters.

The Chinese people see this very well, so they call this Chinese character plan "GB2312". GB2312 is a Chinese extension to ASCII.

However, there were too many Chinese characters in China, and later they were not enough, so the low byte is no longer required to be the internal code after 127. As long as the first byte is greater than 127, it always means that this is the beginning of a Chinese character. What follows is the content in the extended character set. As a result, the expanded coding scheme is called the GBK standard. GBK includes all the contents of GB2312, and at the same time nearly 20,000 new Chinese characters (including traditional characters) and symbols have been added. Later, ethnic minorities also used computers, so we expanded and added thousands of new ethnic minority characters. GBK was expanded to GB18030. From then on, the culture of the Chinese nation can be passed on in the computer age.

Because at that time, all countries developed their own coding standards like China, and as a result, no one knew each other's coding, and no one supported other's coding. At that time, if the Chinese wanted to display Chinese characters on the computer, they had to install a "Chinese character system" to deal with the display and input of Chinese characters. If the character system was installed incorrectly, the display would be messed up. How can this be done? At this moment, an international organization called ISO (International Organization for Standardization) decided to tackle this problem. The method they adopted is simple: abolish all regional coding schemes, and rebuild a code that includes all cultures, all letters and symbols on the earth! They plan to call it "Universal Multiple-Octet Coded Character Set", or UCS for short, or "UNICODE".

When UNICODE was first formulated, the memory capacity of the computer was greatly developed, and space was no longer a problem. Therefore, ISO directly stipulates that two bytes, that is, 16 bits, must be used to uniformly represent all characters. For those "half-width" characters in ascii, UNICODE keeps its original encoding unchanged, but changes its length from the original 8 The bit is expanded to 16 bits, and the characters of other cultures and languages ​​are all re-encoded. Since the "half-width" English symbol only needs the lower 8 bits, the upper 8 bits are always 0. Therefore, this atmospheric scheme will waste twice as much space when saving English text.

However, UNICODE did not consider maintaining compatibility with any existing encoding scheme when formulating it. This makes GBK and UNICODE completely different in the internal code layout of Chinese characters. There is no simple arithmetic method to change the text content from UNICODE encoding and another encoding are converted, and this conversion must be performed by looking up the table. UNICODE is represented by two bytes as one character. It can combine 65535 different characters in total, which can probably cover all cultural symbols in the world.

When UNICODE came, it also came with the rise of computer networks. How to transmit UNICODE on the network is also a problem that must be considered. So many transmission-oriented UTF (UCS Transfer Format) standards appeared. As the name suggests, UTF8 is 8 every time. One bit transmits data, and UTF16 is 16 bits at a time, but for the reliability of transmission, there is no direct correspondence from UNICODE to UTF, but some algorithms and rules are required to convert.

After reading these, I believe you have a clearer understanding of these coding relationships and so on. Let me briefly summarize:

Through the expansion and transformation of ASCII code in Chinese, the Chinese people produced the GB2312 code, which can represent more than 6000 commonly used Chinese characters.
There are too many Chinese characters, including traditional characters and various characters, so GBK encoding is produced, which includes the encoding in GB2312, and at the same time expands a lot.
China is a multi-ethnic country, and almost every ethnic group has its own independent language system. In order to express those characters, we continue to expand the GBK code to GB18030 code.
Every country, like China, encodes its own language, so a variety of codes appear. If you don't install the corresponding code, you can't explain what the corresponding code wants to express.
Finally, an organization called ISO couldn't stand it anymore. Together, they created a code UNICODE, which is very large, big enough to hold any text and logo in the world. Therefore, as long as there is a UNICODE encoding system on the computer, no matter what kind of text is in the world, when you only need to save the file, the UNICODE encoding can be normally interpreted by other computers.
UNICODE In the network transmission, there are two standards UTF-8 and UTF-16, each transmitting 8 bits and 16 bits respectively.
So some people will have questions. Since UTF-8 can store so many characters and symbols, why are there so many people in China who use GBK and other encodings? Because encodings such as UTF-8 are relatively large and take up more computer space, if most of the target users are Chinese, encodings such as GBK can also be used. However, from the perspective of current computers, hard disks are all at the price of cabbage, and the performance of the computer is enough to ignore this performance consumption. Therefore, it is recommended that all web pages use uniform encoding: UTF-8.
Regarding the problem that Notepad cannot save "Unicom" separately.
After you create a new text document, enter the word "Unicom" in it and save it. When you open it again, the original "Unicom" input will become two garbled characters.

This problem is caused by the encoding collision between GB2312 encoding and UTF8 encoding. A paragraph of conversion rules from UNICODE to UTF8 is quoted from the Internet:

UTF-8

0000 – 007F

0xxxxxxx

0080 – 07FF

110xxxxx 10xxxxxx

0800 – FFFF

1110xxxx 10xxxxxx 10xxxxxx
For example, the Unicode code of "Chinese" is 6C49. 6C49 is between 0800-FFFF, so use a 3-byte template: 1110xxxx 10xxxxxx 10xxxxxx. Write 6C49 in binary as: 0110 1100 0100 1001. Divide this bit stream into 0110 110001 001001 according to the three-byte template segmentation method, and replace x in the template in turn to get: 1110-0110 10-110001 10-001001, that is E6 B1 89, this is its UTF8 encoding.

When you create a new text file, the default encoding of Notepad is ANSI. If you enter Chinese characters in ANSI encoding, then it is actually the GB series encoding method. Under this encoding, the internal code of "Unicom" is:

c1 1100 0001

aa 1010 1010

cd 1100 1101

a8 1010 1000

Did you notice? The first two bytes and the beginning of the third four bytes are "110" and "10", which are exactly the same as the two-byte template in the UTF8 rules, so when you open the Notepad again, you will remember I mistakenly thought this was a UTF8-encoded file. Let us remove the 110 of the first byte and the 10 of the second byte, and we will get "00001 101010". Then align the bits and add the leading ones. 0, you get "0000 0000 0110 1010", sorry, this is UNICODE 006A, which is the lowercase letter "j", and the next two bytes are decoded with UTF8 to be 0368, which is nothing. This is the reason why files with only the word "Unicom" cannot be displayed normally in Notepad.

From this question, many problems can radiate. A more common question is: I have saved the file in XX code, why is it still the original YY code every time I open it? ! This is the reason. Although you saved it as XX code, when the system recognized it, it misrecognized it as YY code, so it was still displayed as YY code. In order to avoid this problem, Microsoft came up with something called the BOM header.

Regarding the issue of the BOM header of the file.
When using software like Notepad that comes with WINDOWS, when saving a UTF-8 encoded file, three invisible characters (0xEF 0xBB 0xBF, namely BOM) will be inserted at the beginning of the file. ). It is a string of hidden characters used to allow editors such as Notepad to recognize whether the file is encoded in UTF-8. In this way, this problem can be avoided. For general files, this will not cause any trouble.

There are disadvantages to doing so, especially in web pages. PHP does not ignore the BOM, so when reading, including, or referencing these files, it will use the BOM as part of the text at the beginning of the file. According to the characteristics of the embedded language, this string of characters will be directly executed (displayed). As a result, even if the top padding of the page is set to 0, the entire web page cannot be close to the top of the browser because there are these 3 characters at the beginning of the html. If you find unknown blanks in the webpage, it is likely that the file has a BOM header. When you encounter this kind of problem, do not include the BOM header when saving the file!

How to view and modify the code of a document
1, directly use Notepad to view and modify. We can open the file with Notepad, and then click "File" = "Save As" in the upper left corner, and a save window will pop up at this time. After selecting the code below, click Save.

However, the choice of this method is very small, and is usually used to quickly view the encoding of the file. I recommend the following method.

2. Use other text editors (for example: notepad++) to view the changes. Almost all mature text editors (for example: Dreamweaver, Emeditor, etc.) can quickly view or modify the file encoding. This is especially reflected in notepad++.

After opening a file, the encoding of the current file will be displayed in the lower right corner.

Click "encoding" in the menu bar above to convert the current document to other encodings

IE6 loading CSS file BUG
When the encoding of HTML file is inconsistent with the file you want to load CSS, IE6 will not be able to read the CSS file, that is, the HTML file has no style. From my observation, this problem has never appeared in other browsers, only in IE6. Just save the CSS file as the code of the HTML file.

This problem only involves the PHP front-end C# back-end, if you use the UTF-8 file to read each line of the file, there will be a shift.
Derivation: https://blog.csdn.net/tinyletero/article/details/8197974
Source: http://www.qianxingzhem.com/post-1499.html

Guess you like

Origin blog.csdn.net/cao919/article/details/103999950