UTF-16 - top systems programmer coding problems will be ignored, JDK wrong decade!

   Unicode (Unicode, Unicode, single) in the field of computer science is an industry standard, including character sets, coding schemes. To address the limitations of traditional Unicode is a character encoding scheme generated, it is set for each language for each character in a unified and unique binary code, in order to meet the cross-language, cross-platform text conversion processing requirements.

 

  Unicode is a character encoding scheme can accommodate all signs and symbols of the world elaborated by international organizations. Unicode character currently organized into 17 groups, 0x0000 to 0x10FFFF, each referred to a plane (Plane), and each plane has 65,536 yards bits, a total of 1,114,112. However, there is only a small number of planes. UTF-8, UTF-16, UTF-32 are converted to digital data coding scheme of the program.

 

  UTF-16 is the Unicode third layer of the hierarchical model character encoding five: a character code table of an implementation. I.e., the abstract code bit Unicode character set is mapped to a sequence of 16-bit integers (i.e., symbols), for data storage or transfer. Unicode character code bits, requires a 16-bit or two symbols represented, so this is a variable-length coding.

 

  Is a UTF-16 encoding the introduction, it is emphasized that a red part variable length coding; but in fact many people have also appreciated that encoding thought that it is a fixed length coding. This wrong perception does not seem to have any problems in their daily programming, as commonly used Chinese characters can use a 16-bit length of the symbol is represented. However, nearly 80,000 Chinese characters, and a 16-bit symbol is the maximum value is only 65535 (0xffff), so more than half of the Chinese characters are not commonly defined as the extended characters, which need two 16-bit length symbols represent.

 

  UTF-16 code to 16-bit unsigned integer units. We Unicode encoding denoted by U. Encoding rules are as follows:

 

  If U <0x10000, UTF-16 encoded U is 16-bit unsigned integer corresponding to U (for the sake of writing, 16 will hereinafter be referred to as an unsigned integer WORD).

 

  If U> = 0x10000, we first calculate U '= U - 0x10000, then U' written in binary form: yyyy yyyy yyxx xxxx xxxx, U of UTF-16 encoding (binary) is: 110110yyyyyyyyyy 110111xxxxxxxxxx.

 

  Why U 'can be written in 20 bits? The maximum code bit Unicode is 0x10FFFF, after subtracting 0x10000, the maximum value of U 'is 0xFFFFF, certainly it can use 20 binary represents a bit. For example: the Unicode encoding 0x20C30, after subtracting 0x10000, to give 0x10C30, it is written in binary: 0001 0,000,110,000,110,000. 10 prior to use alternative template sequence y, after sequentially with 10 alternate template x, is obtained: 1101100001000011 1101110000110000, i.e. 0xD843 0xDC30.

 

  According to the above rules, Unicode UTF-16 encoding encoding 0x10000-0x10FFFF WORD has two, first WORD higher 6 bits are 110110, the second high-6 WORD is 110,111. Seen, first WORD in the range of (binary) is 00000000-11011011 11011000 11111111 i.e. 0xD800-0xDBFF. WORD second range (binary) is 00000000-11011111 11011100 11111111 i.e. 0xDC00-0xDFFF.

 

  To a UTF-16 code of WORD and UTF-16 encoding region of two separate WORD, Unicode-encoded designer 0xD800-0xDFFF retained and is called the agent region (Surrogate):

D800-FYM
High Surrogates
Alternative high
DC00-DFFF
Low Surrogates
Alternative low
 

 

 

 

  Alternatively refers to the code bit high in this range is a first WORD UTF-16 encoded the two WORD. Alternatively refers to the code bit low in this range is a UTF-16 encoding a second two WORD WORD.

 

  The above is a description of UTF-16 encoding rules, then how to implement it? The following C # code shows how to use UTF-16 and UTF-32 Room Huzhuan:

    public class Demo
    {
        internal const char HIGH_SURROGATE_START = '\ud800';
        internal const char HIGH_SURROGATE_END = '\udbff';
        internal const char LOW_SURROGATE_START = '\udc00';
        internal const char LOW_SURROGATE_END = '\udfff';

        internal const int UNICODE_PLANE00_END = 0x00ffff;
        internal const int UNICODE_PLANE01_START = 0x10000;
        internal const int UNICODE_PLANE16_END = 0x10ffff;

        public static bool IsHighSurrogate(char c)
        {
            return ((c >= HIGH_SURROGATE_START) && (c <= HIGH_SURROGATE_END));
        }

        public static bool IsLowSurrogate(char c)
        {
            return ((c >= LOW_SURROGATE_START) && (c <= LOW_SURROGATE_END));
        }

        public static char[] ConvertFromUtf32 ( int UTF32) 
        { 
            // the U-00D800 + ~ + 00DFFF the U-Unicode This range is defined as a dedicated proxy region, they can not be encoded as Unicode values. 
            IF ((UTF32 < 0 || UTF32> UNICODE_PLANE16_END) || (UTF32> = HIGH_SURROGATE_START && UTF32 <= LOW_SURROGATE_END)) 
            { 
                the throw  new new an ArgumentOutOfRangeException ( " UTF32 " ); 
            } 

            IF (UTF32 < UNICODE_PLANE01_START) 
            { 
                // This is a basic character. 
                return  new new  char [] {( char ) UTF32}; 
            } 

            //This is an extended character, you need to convert it to UTF-16 in a surrogate pair. 
            UTF32 - = UNICODE_PLANE01_START; 

            return  new new  char [] 
            { 
                ( char ) ((UTF32 / 0x400 ) + HIGH_SURROGATE_START), 
                ( char ) ((UTF32% 0x400 ) + LOW_SURROGATE_START) 
            }; 
        } 

        public  static  int ConvertToUtf32 ( char highSurrogate, char lowSurrogate) 
        { 
            IF (! isHighSurrogate (highSurrogate)) 
            { 
                the throw new ArgumentOutOfRangeException("highSurrogate");
            }
            if (!IsLowSurrogate(lowSurrogate))
            {
                throw new ArgumentOutOfRangeException("lowSurrogate");
            }

            return (((highSurrogate - HIGH_SURROGATE_START) * 0x400) + (lowSurrogate - LOW_SURROGATE_START) + UNICODE_PLANE01_START);
        }
    }

  Why is JDK wrong decade in this respect it? Because Java 7 times, because the string architecture is unreasonable, utf-16 mistakenly thought to be a fixed length code, and the actual utf-16 is a variable length coding, since the maximum char (word) is 0xFFFF, the Unicode specification the maximum value is 0x10ffff, character requires two small probability of char to represent. Java later realized the error, and Java the next few releases, the rush will be replaced by string encoding utf8 (actually, to determine if there are characters exceed 0xffff, use utf8, or otherwise continue unusual utf- 16 algorithm). Then later only use Java on a normal utf-16 encoding.

 

 

  There is a piece two years ago, he said only 2,000 yuan phone to play a Chinese character input method. The reason is this.
 
 
  Here accompanied by my open source project: .Net platform of high performance Json parsing library: Swifter.Json: https://github.com/Dogwei/Swifter.Json . I hope you will support it.
 
  Finally, attach the internal Swifter.Json used Utf16 source Utf8 Huzhuan of:
    public static unsafe class EncodingHelper
    {
        public const char ASCIIMaxChar = (char)0x7f;
        public const int Utf8MaxBytesCount = 4;

        public static int GetUtf8Bytes(char* chars, int length, byte* bytes)
        {
            var destination = bytes;

            for (int i = 0; i < length; i++)
            {
                int c = chars[i];

                if (c <= 0x7f)
                {
                    *destination = (byte)c; ++destination;
                }
                else if (c <= 0x7ff)
                {
                    *destination = (byte)(0xc0 | (c >> 6)); ++destination;
                    *destination = (byte)(0x80 | (c & 0x3f)); ++destination;
                }
                else if (c >= 0xd800 && c <= 0xdbff)
                {
                    c = ((c & 0x3ff) << 10) + 0x10000;

                    ++i;

                    if (i < length)
                    {
                        c |= chars[i] & 0x3ff;
                    }

                    *destination = (byte)(0xf0 | (c >> 18)); ++destination;
                    *destination = (byte)(0x80 | ((c >> 12) & 0x3f)); ++destination;
                    *destination = (byte)(0x80 | ((c >> 6) & 0x3f)); ++destination;
                    *destination = (byte)(0x80 | (c & 0x3f)); ++destination;
                }
                else
                {
                    *destination = (byte)(0xe0 | (c >> 12)); ++destination;
                    *destination = (byte)(0x80 | ((c >> 6) & 0x3f)); ++destination;
                    *destination = (byte)(0x80 | (c & 0x3f)); ++destination;
                }
            }

            return (int)(destination - bytes);
        }

        [MethodImpl(VersionDifferences.AggressiveInlining)]
        public static int GetUtf8Chars(byte* bytes, int length, char* chars)
        {
            var destination = chars;

            var current = bytes;
            var end = bytes + length;

            for (; current < end; ++current)
            {
                if ((*((byte*)destination) = *current) > 0x7f)
                {
                    return GetGetUtf8Chars(current, end, destination, chars);
                }

                ++destination;
            }

            return (int)(destination - chars);
        }

        [MethodImpl(MethodImplOptions.NoInlining)]
        private static int GetGetUtf8Chars(byte* current, byte* end, char* destination, char* chars)
        {
            if (current + Utf8MaxBytesCount < end)
            {
                end -= Utf8MaxBytesCount;

                // Unchecked index.
                for (; current < end; ++current)
                {
                    var byt = *current;

                    if (byt <= 0x7f)
                    {
                        *destination = (char)byt;
                    }
                    else if (byt <= 0xdf)
                    {
                        *destination = (char)(((byt & 0x1f) << 6) | (current[1] & 0x3f));

                        ++current;
                    }
                    else if (byt <= 0xef)
                    {
                        *destination = (char)(((byt & 0xf) << 12) | ((current[1] & 0x3f) << 6) + (current[2] & 0x3f));

                        current += 2;
                    }
                    else
                    {
                        var utf32 = (((byt & 0x7) << 18) | ((current[1] & 0x3f) << 12) | ((current[2] & 0x3f) << 6) + (current[3] & 0x3f)) - 0x10000;

                        *destination = (char)(0xd800 | (utf32 >> 10)); ++destination;
                        *destination = (char)(0xdc00 | (utf32 & 0x3ff));

                        current += 3;
                    }

                    ++destination;
                }

                end += Utf8MaxBytesCount;
            }

            // Checked index.
            for (; current < end; ++current)
            {
                var byt = *current;

                if (byt <= 0x7f)
                {
                    *destination = (char)byt;
                }
                else if (byt <= 0xdf && current + 1 < end)
                {
                    *destination = (char)(((byt & 0x1f) << 6) | (current[1] & 0x3f));

                    ++current;
                }
                else if (byt <= 0xef && current + 2 < end)
                {
                    *destination = (char)(((byt & 0xf) << 12) | ((current[1] & 0x3f) << 6) + (current[2] & 0x3f));

                    current += 2;
                }
                else if (current + 3 < end)
                {
                    var utf32 = (((byt & 0x7) << 18) | ((current[1] & 0x3f) << 12) | ((current[2] & 0x3f) << 6) + (current[3] & 0x3f)) - 0x10000;

                    *destination = (char)(0xd800 | (utf32 >> 10)); ++destination;
                    *destination = (char)(0xdc00 | (utf32 & 0x3ff));

                    current += 3;
                }

                ++destination;
            }

            return (int)(destination - chars);
        }

        public static int GetUtf8CharsLength(byte* bytes, int length)
        {
            int count = 0;

            for (int i = 0; i < length; i += bytes[i] <= 0x7f ? 1 : bytes[i] <= 0xdf ? 2 : 3)
            {
                ++count;
            }

            return count;
        }

        public static int GetUtf8MaxBytesLength(int charsLength)
        {
            return charsLength * Utf8MaxBytesCount;
        }

        [MethodImpl(VersionDifferences.AggressiveInlining)]
        public static int GetUtf8MaxCharsLength(int bytesLength)
        {
            return bytesLength;
        }
    }

 

Guess you like

Origin www.cnblogs.com/Dogwei/p/11236706.html