unicode character encoding interval table

Because I haven't found a more suitable reference, I can only generate a coding table myself, and then find these characteristics from it, but most of them are accurate, and I don't know what some foreign languages ​​are. . (Text IT Plato)

    As for what this table is for, if you want to filter some of the user input or try to identify gibberish, or encode a word segmentation system with utf-8 encoding, then these references make a lot of sense.

1. Chinese character area:

(1) Cold words:

0x3400--0x4DB5

(2) Ordinary:

0x4E00--0x9FA5

(3) Others:

0xF900--0xFA2C

2. Korean area:

(1) Korean phonetic symbol area

0x1100--0x11F9

0x3130--0x318E

(2) Korean:

0xAC00--0xD7A3

3. Symbol expressions:

(1) Segment characters (eg: ① ⑴ ⒈ )

0x2460--0x24E9

(2) Tabulation aids, special characters, etc. (┊┌┍ ▃ ▄ ▅)

0x2500--0x25FF

(3) Physical object characters

0x2600--0x2671

0x2700--0x27FF

(4) Full-width brackets ("" "" "" [] []〖〗, etc.)

0x3007--0x301A

(5) Special serial number or unit element area ((1) ㎎ ㎏ ㎡, etc.)

0x3200--0x33FF

(6) Full-width characters corresponding to ANSI

0xFF00--0xFF5E

Corresponding: 0x0020--0xFF7E (ie! -- ~ interval)

(7) Other special symbols

0x2000--0x22FF

4. Japanese character or kana symbol area:

0x3041--0x30FF

0x3104--0x312A

0xFF66--0xFF9E

Among them, Hiragana: 0x3041--0x3094

    Katakana: 0x30A1--0x30FA

5. Other word strips or phonetic symbols area:

(1) Roman phonetic symbols

0x00C0--0x0232

(2) Roman phonetic symbols or European characters

0x0386--0x04F3

0x1E00--0x1EFF

0x1F00--0x1FFF

(3) Arabic

0x0620--0x06FF

(4) Buddhist Hybrid Sanskrit

0x0904--0x0970

0x0A00--0x0AEF

0x0E00--0x0E32

 

Six, unicode encoding and UTF-8 encoding conversion:

Unicode symbol range | UTF-8 encoding

u0000 0000 - u0000 007F   | 0xxxxxxx

u0000 0080 - u0000 07FF   | 110xxxxx 10xxxxxx

u0000 0800 - u0000 FFFF   | 1110xxxx 10xxxxxx 10xxxxxx

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326572662&siteId=291194637