My understanding of character encoding

Recently doing things a codec protocol work, it is necessary according to private protocol character content encoded as a binary stream of bytes to the terminal. When coding the phone number directly using the String of getBytes () method, but when the terminal programmer ask me how to decode this binary content, I began to think about their own understanding of the character encoding.

At first I would like a direct answer, according to ASSIC coding solution to the phone number on the list, but this strict interpretation of it? Obviously a series of digital numbers by a String getBytes () method for encoding the encoded ASSIC does seem to, but in fact due to the negligence of programming, do not specify the encoding format used getBytes () byte array obtained by the method without parameters, is operating system default encoding used for encoding, in accordance with the agreed project agreement, GBK coding should be used instead getBytes ( "GBK"). GBK encoding can not be said to ASSIC coding.

So the question is, GBK coding and coding ASSIC What is the relationship? Why GBK encoding the same phone number encoded with ASSIC out of? In fact, GBK and ASSIC are two character encoding, ASSIC use a low-seven-byte coded character, the highest bit is always 0. ASSIC is an encoding of the earliest use of only coded 2 ^ 7 = 128 characters commonly used in English-speaking countries. Chinese people for character encoding (Simplified Chinese) developed GBK coding, many articles and even books will GBK described as stationary use two bytes to encode, but in fact only use double-byte coding GBK Chinese characters, for compatibility ASSIC coding, coding ASSIC use only one byte characters. It sounds a bit around, it can be understood as the use of GBK coding standards, including ASSIC coding standards. Problem again, using the computer coding GBK interpret strings, how to tell if a byte is a half or a full double-byte character ASSIC a Chinese character representation of it? In fact, the most significant bit of the first byte of GBK encoding will be for 1, but said before the most significant bit byte coding ASSIC is 0, so when the computer first read the most significant bit is 1 byte, you know this byte and a next byte Chinese characters, otherwise it is ASSIC characters.

In fact, not only the GBK encoding, most natural language computer codes are also compatible ASSIC coding. ASSIC appears to be very versatile character encodings, Unicode and seems very generic "code", then ASSIC and Unicode What is the relationship? In fact, Unicode with ASSIC not the same thing, if ASSIC is a computer character encoding in the minimum set, and that is to try to Unicode encoding standard set of coded character set of the largest, Unicode defines commonly used in the world's big (in English and Japanese, etc.) No part of the text and symbols, the majority can be represented by 2 bytes, but not as Unicode GBK, ASSIC this predetermined binary character representation, showing how Unicode character number defined binary coded, so that the UTF-8 the specific implementation criteria to define.

We are familiar with the UTF-8, the prefix UTF UTF-16, the full name of Unicode Transformation Format (Unicode Transformation Format), Unicode is a real Coding. They are intended to Unicode characters into a binary coded number, but uses UTF-16 2-4 bytes to encode; and UTF-8 can save more space, the use of coded bytes 1-4. UTF-8 is now more widely used, it is compatible ASSIC encoded characters ASSIC indication when only 1 byte, 3 bytes indicates generally used Chinese characters. Some programs do not consider international domestic use GBK encoding, it represents only 2 bytes characters than UTF-8 and require less space.

The principle of common character encoding is not concerned about when the character-oriented programming, wait until the binary byte-oriented programming principles found some say he did not immediately clear. Hereby comb, encourage each other.

Guess you like

Origin www.cnblogs.com/qingkongxing/p/11444208.html