Because I haven't found a more suitable reference, I can only generate a coding table myself, and then find these characteristics from it, but most of them are accurate, and I don't know what some foreign languages are. . (Text IT Plato)
As for what this table is for, if you want to filter some of the user input or try to identify gibberish, or encode a word segmentation system with utf-8 encoding, then these references make a lot of sense.
1. Chinese character area:
(1) Cold words:
0x3400--0x4DB5
(2) Ordinary:
0x4E00--0x9FA5
(3) Others:
0xF900--0xFA2C
2. Korean area:
(1) Korean phonetic symbol area
0x1100--0x11F9
0x3130--0x318E
(2) Korean:
0xAC00--0xD7A3
3. Symbol expressions:
(1) Segment characters (eg: ① ⑴ ⒈ )
0x2460--0x24E9
(2) Tabulation aids, special characters, etc. (┊┌┍ ▃ ▄ ▅)
0x2500--0x25FF
(3) Physical object characters
0x2600--0x2671
0x2700--0x27FF
(4) Full-width brackets ("" "" "" [] []〖〗, etc.)
0x3007--0x301A
(5) Special serial number or unit element area ((1) ㎎ ㎏ ㎡, etc.)
0x3200--0x33FF
(6) Full-width characters corresponding to ANSI
0xFF00--0xFF5E
Corresponding: 0x0020--0xFF7E (ie! -- ~ interval)
(7) Other special symbols
0x2000--0x22FF
4. Japanese character or kana symbol area:
0x3041--0x30FF
0x3104--0x312A
0xFF66--0xFF9E
Among them, Hiragana: 0x3041--0x3094
Katakana: 0x30A1--0x30FA
5. Other word strips or phonetic symbols area:
(1) Roman phonetic symbols
0x00C0--0x0232
(2) Roman phonetic symbols or European characters
0x0386--0x04F3
0x1E00--0x1EFF
0x1F00--0x1FFF
(3) Arabic
0x0620--0x06FF
(4) Buddhist Hybrid Sanskrit
0x0904--0x0970
0x0A00--0x0AEF
0x0E00--0x0E32
Six, unicode encoding and UTF-8 encoding conversion:
Unicode symbol range | UTF-8 encoding
u0000 0000 - u0000 007F | 0xxxxxxx
u0000 0080 - u0000 07FF | 110xxxxx 10xxxxxx
u0000 0800 - u0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx