Talk about the difference between utf8 and utf8mb4 in mysql

MySQL added this utf8mb4 encoding after 5.5.3 , mb4 means most bytes 4, which is specially designed to be compatible with four-byte unicode. Fortunately, utf8mb4 is a superset of utf8, and no other conversion is required except changing the encoding to utf8mb4. Of course, in order to save space, it is usually enough to use utf8.

   2. Content description

   As mentioned above, since utf8 can store most Chinese characters, why use utf8mb4? It turns out that the maximum character length of utf8 encoding supported by mysql is 3 bytes. If it encounters 4 bytes of wide characters, an exception will be inserted. . The largest Unicode character that can be encoded in three-byte UTF-8 is 0xffff, which is the Basic Multilingual Plane (BMP) in Unicode. That is to say, any Unicode characters that are not in the basic multi-text plane cannot be stored in Mysql's utf8 character set. Including Emoji expressions (Emoji is a special Unicode encoding, common on ios and android mobile phones), and many uncommon Chinese characters, as well as any new Unicode characters and so on.

   3. The root cause of the problem

   The original UTF-8 format used one to six bytes and could encode up to 31 characters. The latest UTF-8 specification uses only one to four bytes and can encode a maximum of 21 bits, just enough to represent all 17 Unicode planes.

   utf8 is a character set in Mysql that only supports UTF-8 characters up to three bytes long, which is the basic multi-text plane in Unicode.

   Why does utf8 in Mysql only support UTF-8 characters that hold up to three bytes? I thought about it, maybe because Mysql just started to develop, Unicode has no auxiliary plane yet. At that time, the Unicode committee was still dreaming that "65,535 characters are enough for the whole world". The length of the string in Mysql counts the number of characters rather than the number of bytes. For the CHAR data type, it is necessary to reserve enough length for the string. When using the utf8 character set, the length to be reserved is the length of the longest character of utf8 multiplied by the length of the string, so of course, the maximum length of utf8 is limited to 3. For example, CHAR(100) Mysql will reserve 300 bytes of length. As for why subsequent versions do not provide support for UTF-8 characters with a length of 4 bytes, I think one is for backward compatibility considerations, and that characters outside the basic multilingual plane are rarely used.

   To save 4-byte UTF-8 characters in Mysql, you need to use the utf8mb4 character set, which is only supported after version 5.5.3 (see version: select version();). I think, in order to obtain better compatibility, utf8mb4 should always be used instead of utf8. For CHAR type data, utf8mb4 will consume more space. According to the official recommendation of Mysql, use VARCHAR instead of CHAR.

Modify the default database configuration

[client]
default-character-set = utf8mb4
[mysqld]
character-set-server=utf8mb4
collation-server=utf8mb4_unicode_ci
[mysql]
default-character-set = utf8mb4

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326486724&siteId=291194637