Character encoding old questions and new solutions, tangled (ASCII, GBK, GB2312, GB18030, UNICODE, UTF-8, UTF-16, UTF-32)

In the end, the character encoding topic was not bypassed. I think I am clear and clear, but I still get frustrated every time I encounter it, and I have to learn from it. Knowledge still needs to get to the bottom of it, trace the source, and don't just look at it.

1. Coding History

1. The problem arises

The so-called clearing up the source means that solving a problem starts with the generation of the problem, and the coding is generated in the process of storing and transmitting what we have seen and heard through the computer. We all know that the computer has two states of high level and low level , which are represented by (1, 0) respectively , that is, binary. To store characters in the computer, the characters can also be converted into (1, 0) to represent. So we correspond different characters to the arrangement of (1, 0) , and this relationship is the character set .

2. Solutions

For example, the English letter a, specify its arrangement as 01100001, this arrangement can also be described by 97 (decimal) , 0x61 (hexadecimal) , but its essence is still (0,1) arrangement. Therefore, we can describe it in this way in the character set, 0x61and express it as a when it is parsed into characters.

The Americans first solved the problem of English storage according to this idea. They found that the arrangement of one byte (8 0s or 1s) is enough to contain all English characters, so the single-byte character setallThe ASCII character set was born, and 1 byte happens to be the basic storage of the computer, which is good for reading and writing.

The 8-bit byte is also a standard formulated after many considerations during the birth of ASCII.

3. Character set development

Later, when we wanted to store Chinese characters, we found that we needed 2 bytes to fit the Chinese characters we wanted to use, so GB2312-80 was compiled in 1980, and GBK was formulated in 1995. Their encoding methods are double bytes. In order to be compatible with the characters of ethnic minorities, national standards GB18030-2000 and GB18030-2005 have been formulated

The current one is GB18030-2005, which fully supports Unicode. It can support Chinese domestic minority languages, Chinese, Japanese, Korean and traditional Chinese characters, and emoji characters without using the character creation area. GB18030 adopts variable-length multi-byte encoding, and each character can be Composed of 1, 2 or 4 bytes, the encoding space is huge, and a maximum of 1.61 million characters can be defined.

When each region develops its own solution, some people consider that if each region develops its own character set independently, the parsing work in the process of transmitting information will introduce more problems. First of all, it is necessary to correctly install the character set and encoding method of the other party in advance , and it is very difficult to save characters from different regions in the same document. For example, using big5 encoding to parse "China" will make an error, and there is no "国" in traditional characters.

So Unicode came into being, aiming to standardize a unified character set around the world, so that there will be no translation problems. Even if the region really must have its own set of character sets, it only needs to be converted to unicode separately. Of course, these character sets are compatible with the ASCII character set.
insert image description here


2. Character set and encoding method

1. Difference

The (0,1) arrangement of characters is called a character set , and the way to store the arrangement in a computer is called an encoding method .

In the ASCII era, it was defined that a byte is 8 bits, ASCII is represented by 7 bits, and a byte represents a character. And when there is a multi-byte character set, especially after the variable-byte character set, a character has one or more bits, and at this time, it is necessary to tell the reading program to use several bytes to represent a character through encoding .

ASCII, GB2312, GBK, and GB18030 specify the character set and encoding method at the same time, while Unicode only specifies the character set, and the encoding method needs to be specified as UTF-8, UTF-16, and UTF-32. If the specified encoding method is UTF8, the character set must be unicode.

2. Research tools

We use java to study the encoding method . The getBytes method of the java string converts characters into specific bytes, and the converted bytes are the (0, 1) arrangement of the character set. We need to know the internal encoding of Java strings. The String class internally manages an array of char type. The Java API describes the basic type of char in this way:

The char data type (and the value encapsulated by a Character object) is based on the original Unicode specification, which defines characters as fixed-width 16-bit entities
That is to say, the character set used by java itself is Unicode, and the encoding method is UTF-16.
String str = "abc"; 等效于
char data[] = {
    
    'a', 'b', 'c'};  
String str = new String(data); 

str.getBytes(Charset.forName("GBK"))). The essence is that the UTF-16 encoding method of the Unicode character set of "abc" is converted into the double-byte encoding method of GBK , and finally the GBK encoding of the character is obtained.

new String(bytes, Charset.forName('GBK'))The essence is the process of converting from GBK encoding to Unicode character set, and finally get the characters represented by GBK encoding.

This is the conversion between the unicode character set and other character sets. We can observe the different encoding methods of characters in this way. Here are a couple of methods for testing:


//将字节数组以某种编码方式进行解析
public static void fromByteToString(byte[] bytes, String charsetDef) {
    
           
	String str = new String(bytes, Charset.forName(charsetDef)); 
	System.out.println("\r\n"+str);
}

//以二进制格式返回byte
public static String fromByteToBinString(byte b) {
    
    
	String tmp = Integer.toBinaryString(b & 0xFF); 
	for(int i = tmp.length(); i < 8; i++) {
    
    
		tmp = "0" + tmp;
	}
	return tmp;
}
//以16进制格式返回byte

public static String fromByteToHexString(byte b) {
    
    
	String tmp = Integer.toHexString(b & 0xFF);  
	tmp = tmp.length() == 1 ? "0" + tmp.toUpperCase() : tmp.toUpperCase();
	return "0x" +  tmp;
}

//打印字符的编码结果,用二进制或者16进制表示
public static void printRes(String maiamng, String charsetDef) {
    
    
	byte[] characterSet = maiamng.getBytes(Charset.forName(charsetDef));
	System.out.print("\"" + maiamng + "\" " + charsetDef + " 字符集(二进制):");
	for(byte b : characterSet) {
    
    
		String tmp = fromByteToBinString(b);
		System.out.print(tmp + " ");
	}
	System.out.println();
	System.out.print("\"" + maiamng + "\" " + charsetDef + " 字符集(16进制):");
	for(byte b : characterSet) {
    
    
		String tmp = fromByteToHexString(b);
		System.out.print(tmp + "     ");
	}
}

During the conversion process of Unicode character set encoding to other character set encoding, conversion failure may occur. When the conversion fails, the Unicode code is automatically 0x3Freplaced. For example:

String maiamng = "麦芒";
printBytes(maiamng + " ASCII: ", maiamng.getBytes(Charset.forName("ASCII")));  

打印结果:  0x3F 0x3F

3. Implementation of coding method

Next we conduct coding research.

i、ASCII

Single-byte 8-bit (0, 1) representation, the most significant bit is 0. 128 characters are defined, 0000 0000which are NULL empty characters and 0111 1111DEL deletion control characters, suitable for English letters, such as the letter a:

binary 10 hex Hexadecimal
1100001 97 0x61
String maiamng = "a";
String charsetDef = "ASCII";
printRes(maiamng,charsetDef);

"a" ASCII 字符集(二进制):01100001 
"a" ASCII 字符集(16进制):0x61    

ii、GB2312

The double-byte 16-bit (0, 1) representation is used for Chinese characters. In order to be compatible with the ascii code, gb2312 is designed so that its highest bit is 1, which distinguishes it from the ascii code ( the highest bit is 0 when it is designed) . Therefore, when the computer uses gb2312 to read a character set, the highest bit is 0 and is recognized as ascii, one byte represents a character, and the highest bit is 1, then two bytes are one character. (It specifies both the character set and the encoding method) For example, "Mai":

binary 10 hex Hexadecimal
11000010 11110011 49970 0xC2 0xF3
String maiamng = "a麦芒";
String charsetDef = "gb2312";
printRes(maiamng,charsetDef);

"a麦芒" gb2312 字符集(二进制):01100001 11000010 11110011 11000011 10100010 
"a麦芒" gb2312 字符集(16进制):0x61     0xC2     0xF3     0xC3     0xA2     

If the computer encounters a piece of data, the high bit is only an odd number of 1, which can only be garbled.

String charsetDef = "gb2312";
byte[] bytes = {
    
     (byte) 0x61, (byte) 0xC2, (byte) 0xF3,(byte) 0xC3, (byte) 0xA2};   
fromByteToString(bytes,charsetDef); //a麦芒
 
bytes = {
    
     (byte) 0x61, (byte) 0xC2, (byte) 0xF3,(byte) 0xC3};  
fromByteToString(bytes,charsetDef); //a麦?

bytes = {
    
     (byte) 0x61, (byte) 0xC2,(byte) 0xC3, (byte) 0xA2};   
fromByteToString(bytes,charsetDef); //a旅?

iii、GBK

Double-byte 16-bit (0, 1) representation, used for Chinese characters, GBK character set is compatible with GB2312 character set, and GBK encoding method is consistent with GB2312 encoding method.

String maiamng = "a麦芒";
String charsetDef = "gb2312";
printRes(maiamng,charsetDef);

"a麦芒" gb2312 字符集(二进制):01100001 11000010 11110011 11000011 10100010 
"a麦芒" gb2312 字符集(16进制):0x61     0xC2     0xF3     0xC3     0xA2     

Here is a picture showing the difference between ASCII, GB2312, and GBK.
insert image description here

We can see 817Ethat this encoding is defined in gbk but undefined in gb2312.

String charsetDef = "gbk";
byte[] bytes = {
    
     (byte) 0x81, (byte) 0x7e};   
fromByteToString(bytes,charsetDef); //亊

charsetDef = "gb2312";
bytes = {
    
     (byte) 0x81, (byte) 0x7e};   
fromByteToString(bytes,charsetDef); //?~

iv、GB18030

Variable length 1 byte (8 bits) , 2 bytes (16 bits) and 4 bytes (32 bits) indicate that the number of bytes of characters is judged by the range characteristics of bytes. Let’s look at the range of different bytes in GB18030 Define the graph.
insert image description here
For example, there is a 4-byte encoding 0x81 0x30 0x81 0x30, 0x81in the range of the first byte of double-byte and four-byte, so it may be double-byte encoding or four-byte encoding, which is 0x30less than the lowest code point of the second byte of double-byte , so only four bytes are possible.

Note here that the eclipse console cannot print out 4-byte encoded Chinese, because it uses GBK. And the default Chinese character of the windows operating system is gbk. It seems that there are some special reasons that the 4-byte gb18030 cannot be displayed on windows. Therefore, the text document saved in ANSI is also GBK, so I cannot print out the 4-byte GB18030 for the time being. And I really didn't find all the character sets of the GB18030 standard (I found them in China's national standard system but they couldn't be downloaded) , and the examples in this part will be supplemented after researching them later

v、UNICODE

gb18030 is not only a character set, but also an encoding method (variable length 1, 2, 4 bytes) , while unicode is only a character set, which has a separately defined encoding method. Before introducing the encoding method of unicode, unicode has several concepts.

  1. Code point (code point) : refers to the code value corresponding to a character in the Unicode code table. For example, the code point of the Chinese character "一" is U+4E00, and the code point of the English letter "A" is U+0041.
  2. Code unit (code unit) : It is stipulated that the storage capacity of 16 bits is a code unit.
  3. 代码级别 (code plane): Unicode字符集分为17个代码级别,其中代码点/u0000-/uFFFF为第一级别一一基本多语言级别 (basic multilingual plane),可以用一个代码单元存储。其余16个附加级别 从0x10000~0x10FFFF,需要两个代码单元。
  4. 替代区域 (surrogate area) :在多语言级别中,U+D800~U+DFFF这2048值没有表示任何字符,被称为Unicode的替代区域。

unicode的三种编码方式分别是UTF-8UTF-16UTF-32

i、UFT-8

一种变长的编码方案,使用1~4个字节来存储。

  1. 对于单字节的字符,字节的第一位设为0,兼容ASCII码。
  2. 对于n字节的字符 (n > 1) ,第一个字节的前n位都设为1,第n + 1位设为0,后面字节的前两位一律设为10。剩下的没有提及的二进制位,全部为这个字符的 Unicode 码。
String maiamng = "a麦芒";
String charsetDef = "UTF-8";
printRes(maiamng,charsetDef);

"a麦芒" UTF-8 字符集(二进制):01100001 11101001 10111010 10100110 11101000 10001010 10010010 
"a麦芒" UTF-8 字符集(16进制):0x61     0xE9     0xBA     0xA6     0xE8     0x8A     0x92  

我们可以发现,“a” (01100001) 占了1个字节,“麦” (11101001 10111010 10100110)“芒” (11101000 10001010 10010010 ) 都占3个字节。通过这个我们也可以知道 “麦” 的unicode的代码点为 0x9EA6“芒” 的unicode代码点为 0x8292。我们从unicode字符查询上查看也得出相同结论。
insert image description here
insert image description here

ii、UTF-32

一种固定长度的编码方案,不管字符编号大小,始终使用 4 个字节来存储。

String maiamng = "a麦芒";
String charsetDef = "UTF-32";
printRes(maiamng,charsetDef);

"a麦芒" UTF-32 字符集(二进制):00000000 00000000 00000000 01100001 00000000 00000000 10011110 10100110 00000000 00000000 10000010 10010010 
"a麦芒" UTF-32 字符集(16进制):0x00     0x00     0x00     0x61     0x00     0x00     0x9E     0xA6     0x00     0x00     0x82     0x92     

iii、UTF-16

UTF-16使用 2 个或者 4 个字节来存储,当编码长度是4字节时,前2字节必然不可单独解析。utf-16就是通过之前提的unicode的替代区域U+D800-U+DFFF实现变长判断。

  1. 如果 代码点 < U+10000 ,也就是处于Unicode的基本多语言级别中。这样16bits (一个代码单元) 就足够表示出字符的Unicode值。
  2. 如果U+10000 <= 代码点 < U+10FFFF,也就是处于附加级别中。UTF-16用2个代码单元来表示,并且正好将每个16位都控制在替代区域U+D800~U+DFFF中了,具体操作如下:
    分别初始化2个16位无符号的整数 —— W1和W2。其中W1=110110yyyyyyyyyy, y最小取0,最大取1。所以W1范围是1101100000000000 ~ 1101101111111111 (0xD800~0xDBFF) , W2 = 110111xxxxxxxxxx。x最小取0,最大取1,所以W2范围是1101110000000000 ~ 1101111111111111 (0xDC00~OxDFFF)
    然后将 U 减去 0x10000 的结果 (20位) 的高10位分配给W1的低10位,将U的低10位分配给W2的低10位。这样就可以将20bits的代码点U拆成两个16bits的代码单元。而且这两个代码点正好落在替代区域U+D800~U+DFFF中。
    但是也意味着UTF-16不能表示21位的代码点。

我们找一个unicode附加级别的代码点进行说明。这里我用1D578 (附加级别) 来作为例子。
insert image description here

0x1D578减去0x10000得到0xD578,转为进制代码是0000 1101 0101 0111 1000。按照上面所述我们得到W1 = 1101100000110101 (0xD835),W2 = 1101110101111000 (0xDD78)

由于在eclipse里面的console默认是GBK编码 (参照此文章) ,因此无法打印某些unicode字符,为了实验UTF-16,我们在本机 (我的是win10系统) 建一个UTF-16保存的文本文档。然后用下面这段writeFile程序将编码的字节写入文本中进行查看。

public static void writeFile(byte[] contentInBytes,String filePath) {
    
    
	File file = new File(filePath);
	try {
    
    
		FileOutputStream fop = new FileOutputStream(file,true); // 追加
		fop.write(contentInBytes);
		fop.flush();
		fop.close();
		System.out.println("Done");
	} catch (Exception e) {
    
    
		e.printStackTrace();
	} 
}
String filePath = "C:\\\\Users\\\\zhang\\\\Desktop\\\\readUTF16.txt";
byte[] bytes = new byte[] {
    
    (byte)0xD8, (byte)0x35,(byte)0xDD, (byte)0x78};

writeFile(bytes,filePath);

得到结果:
insert image description here
这里的BE表示大端模式,下面简短讲解下大小端。

iv、大小端问题(简单讲一讲)

计算机基本存储单位是8位,就是说他们的物理地址是分开的,号码有大小,地址有高低,低地址储存高位是大端储存,用 0xFE 0xFF (BE) 表示,低地址储存低位就是小端储存,用0xFF 0xFE (LE) 表示。假设有两个位置A,B (地址A小于地址B) ,每个位置只能放一个在 0~9 之间的数字, 例如数字65的存放模式可能是A=6;B=5(大端) 也可能是B=6;A=5 (小端)

上面例子0xD835 (W1)0xDD78 (W2) 的大小端如下:

模式 地址(低→高)
大端 0xD8,0x35,0xDD,0x78
小端 0x35,0xD8,0x78,0xDD
大小端问题本人不了解,这里的小端为什么不是0x78,0xDD,0x35,0xD8 ? 现在留位,后面有思路的时候再写大小端的文章。

一般来说,大端小端是CPU决定的,但是ASCII、GB2312、GBK、GB18030既是字符集也是编码方式,所以它们的大小端模式定死了,Unicode由于没有指定编码方式,所以它的大小端模式则是由CPU决定。

我们再创建一个以UTF-16小端模式保存的文本文档,然后用小端模式编码写入文件查看结果。

String filePath = "C:\\\\Users\\\\zhang\\\\Desktop\\\\readUTF16_LE.txt";
byte[] bytes = new byte[] {
    
    (byte)0x35, (byte)0xD8,(byte)0x78, (byte)0xDD};

writeFile(bytes,filePath);

insert image description here

三、实战

那当从外部(文件或者网络)读取字节码 (0,1) 后,要将其正确显示出来必须知道其编码方式。

我们在本机 (win10) 新建 一个文本文档,用ansi保存。
insert image description here

Ansi is GBK encoding under windows. Since GB18030 is compatible with GBK, we can use GB18030 or GBK for analysis. Here is another piece of readFile function.

public static void readFile(String filePath, String charsetDef) {
    
    		
	try {
    
    
		File file = new File(filePath);
		FileInputStream in = new FileInputStream(file);
		
		byte[] bytes = new byte[1];
		int byteindex = 0;
		int n = -1;
		//循环取出数据
		String hexString = "";
		String binString = "";
		while ((n = in.read()) != -1) {
    
    
			byte nByte = (byte)(n & 0xFF);
			//转成16进制数
			String tmp = fromByteToHexString(nByte);
			hexString += tmp + "     ";
		
			String tmp2 = fromByteToBinString(nByte);  
			binString += tmp2 + " ";
			
			byteindex++;
			byte[] bytesTmp = new byte[byteindex];
			System.arraycopy(bytes, 0, bytesTmp, 0, bytes.length);
			
			bytes = new byte[byteindex];
			System.arraycopy(bytesTmp, 0, bytes, 0, bytesTmp.length);
			bytes[byteindex-1] = nByte;
	
		}
		String res = new String(bytes,charsetDef);
		System.out.println("readFile \"" + res + "\" " + charsetDef + " 字符集(二进制):" + binString);
		System.out.println("readFile \"" + res + "\" " + charsetDef + " 字符集(16进制):" + hexString);
			
	} catch (Exception e) {
    
    
		e.printStackTrace();
}
String charsetDef = "GB18030";
String filePath = "C:\\\\Users\\\\zhang\\\\Desktop\\\\readANSI.txt";
readFile(filePath,charsetDef);

readFile "麦芒" GB18030 字符集(二进制):11000010 11110011 11000011 10100010 
readFile "麦芒" GB18030 字符集(16进制):0xC2     0xF3     0xC3     0xA2     

When the file is saved in UTF-8, we use UTF-8 to parse it so that the encoding is correct and there will be no garbled characters. (After writing the article, I realized that my previous understanding of this sentence was too mere formality.)

This article has been written on and off for a month (from 2021-02-24 to 2021-03-23) *, fortunately, it has been completed from beginning to end, and I have read a lot of articles in the process. The problem is understood and described more carefully in the article.

Reference articles:
https://blog.csdn.net/guxiaonuan/article/details/78678043
https://blog.csdn.net/wh_java01/article/details/53894736
https://cloud.tencent.com/developer/article /1343240
https://www.cnblogs.com/gzhnan/articles/4307717.html
https://blog.csdn.net/zhoubl668/article/details/6914018
https://zhuanlan.zhihu.com/p/27827951

Guess you like

Origin blog.csdn.net/kiramario/article/details/114987077