Coding problems in computers

table of Contents

1. Computer coding

2. Computer coding classification

1. ASCII encoding

2. GBK encoding

3. UTF-8 encoding

3. Coding application in computer system

Four. Coding issues


 When writing code, I often encounter coding problems. I haven't understood this coding too much, so let's learn it today.

  • GBK encoding : Chinese occupies two bytes, and English occupies one byte.
  • utf-8 encoding : Chinese occupies three bytes, and English occupies one byte.

1. Computer coding

       Computer coding refers to a way of recording data that represents letters or numbers inside a computer.

       Why does the encoding appear? We know that the data in the computer is stored by electronic originals, and because of the limitations of industrial technology, electronic components can only record two stable states "on" and "off", which are represented by numbers, which are 0 and 1. In other words, in essence, the computer can only record the two numbers 0 and 1. Each 0 or 1, we call it a bit, which is the smallest unit of the computer. This type of number with only 0 and 1, we call it a binary number.

       But obviously, we need to record a lot of things, so only two numbers will definitely not work, so three bits together represent one number, and there is an octal system. 4 bits together to represent a number, there is a hexadecimal system

       The number problem is solved, but if you want to store a character'a' in the computer, you can't do it. In order to solve this problem, people thought of a solution: to uniformly number all the commonly used characters, such as the number of'a' is 97, so that when we need to store'a', we don't directly store'a' ', but to store the number 97. When it is taken out, turn this 97 into'a', which perfectly solves this problem.

       And what we usually call "encoding" is the number of these characters. The table corresponding to all characters and their numbers is called the "coding table".

       Common encoding table: ASCII encoding, GB2312 encoding (Simplified Chinese), GBK, BlG5 encoding (Traditional Chinese), utf-8 encoding, etc.

2. Computer coding classification

1. ASCII encoding

       When computers were first created, they were popular in the "Western world" or "English-speaking countries". Taking it apart, the languages, characters, etc. of the Western world are at best 26 English letters plus some symbols, even if the English letters are divided into different sizes. Write, and never exceed 128, and each character is represented by one byte, which is enough. This encoding method that uses one byte to represent a character is the earliest: ASCII encoding. 

2. GBK encoding

       Later, with the popularity of computers, the entire world needs to use computers to store data. If you still use ASCII encoding, it will definitely not work (cannot store characters other than the Roman alphabet), and the ASCII encoding requires one byte to represent one character. This kind of regulation obviously cannot be applied to the whole world (at least a few thousand Chinese characters are required). Therefore, all countries have expanded the ASCII code, from the original one byte to represent one character, and the conversion to multiple bytes to represent one character. 
For example, GB series codes are my country's national standard codes, which are used to store Chinese characters. They are divided into GB2312, GBK, and GB18030, and they are basically forward compatible, of which GBK is currently the most common.

3. UTF-8 encoding

       Of course, if all countries in the world use their own codes, then the communication between countries will be more troublesome. For example, you originally meant praise here, but when you get to the other party, because the codes are different, the interpretation means cursing. It's not working. Therefore, in order to solve this problem, an organization called the Unicode Academic Society formulated a set of encoding rules-Unicode encoding. The rule supports more than 650 languages ​​in the world. It is a universal character rule.

       Unicode unifies all languages ​​into a set of encodings, so that there will be no more garbled problems. The Unicode standard is also evolving, but the most commonly used is to use two bytes to represent a character (if you want to use a very remote character, you need 4 bytes). Modern operating systems and most programming languages ​​directly support Unicode. However, he only specifies the encoding of characters, but does not specify how the characters are stored or transmitted. Therefore, the UTF series encoding specifies the storage and transmission mode of the Unicode encoding.

      At present, the most commonly used UTF encoding is divided into three types, UTF-8, UTF-16 and UTF-32. We know that the computer uses 8 bits as a byte to store data, and UTF-16 and UTF-32 use 2 respectively. Bytes and 4 bytes represent a character, so the storage order of the bytes is involved here, whether the low order or the high order comes first, in this way, the BOM is generated.

BOM is a special mark at the beginning of a text file, and a set of special numbers is used to mark the endianness of the text file. Although the UTF-8 byte order is fixed, in order to be compatible with UTF-16 and UTF-32, a UTF-8 BOM is also specified, which is used to mark the UTF-8 encoding. However, the BOM of UTF-8 has different regulations on different platforms, so use it with care.

The BOM regulations are as follows :

  • UTF-8 EF BB BF
  • UTF-16(LE) FF FE
  • UTF-16(BE) FE FF
  • UTF-32(LE) FF FE 00 00
  • UTF-32(BE) 00 00 FE FF

UTF-8 encoding : If unified into Unicode encoding, the problem of garbled characters has disappeared. However, if the text you write is basically all in English, Unicode encoding requires twice the storage space than ASCII encoding, which is very uneconomical in storage and transmission. Therefore, in the spirit of economy, UTF-8 encoding that transforms Unicode encoding into "variable-length encoding" has appeared. UTF-8 encoding encodes a Unicode character into 1-6 bytes according to different number sizes. Commonly used English letters are encoded into 1 byte, and Chinese characters are usually 3 bytes. Only very rare characters will be Encoded into 4-6 bytes. If the text you want to transmit contains a lot of English characters, using UTF-8 encoding can save space:

3. Coding application in computer system

       In the computer memory, the Unicode encoding is uniformly used. When it needs to be saved to the hard disk or needs to be transmitted, it is converted to UTF-8 encoding; when editing with Notepad, the UTF-8 characters read from the file are converted to Unicode characters In the memory, after editing, when saving, convert Unicode to UTF-8 and save to the file:

When browsing the web, the server will convert the dynamically generated Unicode content to UTF-8 and then transfer it to the browser:

So you see the source code of many webpages will have information similar to <meta charset="UTF-8" />, which means that the webpage is encoded in UTF-8

Four. Coding issues

       The so-called "encoding problem" is actually Chinese garbled characters. Why does this problem occur? We Chinese people generally use Chinese operating systems, and the default encoding format of Chinese operating systems is GBK. Internationally, UTF-8 encoding is generally used in order to be understood by the whole world. (International websites are generally UTF-8 encoded) 

  • In GBK encoding, a Chinese character generally occupies 2 bytes. 
  • UTF-8 encoding, a Chinese character generally occupies 3 bytes

    Because the number of bytes occupied by different encoding methods is different, the problem of garbled characters will occur when one encoding method uses another encoding for parsing.

In fact, the "encoding problem" appears because when we parse the Chinese characters given to us by others, the encoding used is wrong. If we get GBK, we use GBK to parse. If we get UTF-8, we use it. UTF-8 parsing, isn’t that solved? So, if you encounter Chinese garbled characters in the string:  

1. Rewrite and break up the Chinese garbled character string into bytes.

 2. Use the xx constructor to reorganize the string

       Chinese garbled codes are just because we misassembled when we parsed the bytes, which is similar to misplaced when we were playing with building blocks, but the essential bytes have not changed.

 

reference:

1. Chuanzhi Podcast, https://wenku.baidu.com/view/eef190ca0129bd64783e0912a216147917117edd.html

2. https://www.jb51.net/article/119186.htm

 

Guess you like

Origin blog.csdn.net/qq_44159028/article/details/115201653