MySQL 8.0 Reference Manual(讀書筆記37節-- 字元編碼(4)-Unicode Support)

东山絮柳仔發表於2024-04-14

1.概述

The Unicode Standard includes characters from the Basic Multilingual Plane (BMP) and supplementary characters that lie outside the BMP. This section describes support for Unicode in MySQL.

BMP characters have these characteristics:

• Their code point values are between 0 and 65535 (or U+0000 and U+FFFF).

• They can be encoded in a variable-length encoding using 8, 16, or 24 bits (1 to 3 bytes).

• They can be encoded in a fixed-length encoding using 16 bits (2 bytes).

• They are sufficient【səˈfɪʃnt 足夠的;(理由、條件)充足的, 充分的;充足的;<古>有充分能力的, 足以勝任的;】 for almost all characters in major languages.

Supplementary【ˌsʌplɪˈmentri 補充的;額外的;補充性的;外加的;增補性的;】 characters lie outside the BMP:

• Their code point values are between U+10000 and U+10FFFF).

• Unicode support for supplementary characters requires character sets that have a range outside BMP characters and therefore take more space than BMP characters (up to 4 bytes per character).

The UTF-8 (Unicode Transformation Format with 8-bit units) method for encoding Unicode data is implemented according to RFC 3629, which describes encoding sequences【ˈsiːkwənsɪz 順序;次序;一系列;一連串;(電影中表現同一主題或場面的)一組鏡頭;】 that take from one to four bytes. The idea of UTF-8 is that various Unicode characters are encoded using byte sequences of different lengths:

• Basic Latin letters, digits, and punctuation【ˌpʌŋktʃuˈeɪʃn 標點符號;標點符號用法;】 signs use one byte

• Most European and Middle East script letters fit into a 2-byte sequence: extended Latin letters (with tilde, macron, acute, grave and other accents), Cyrillic, Greek, Armenian, Hebrew, Arabic, Syriac, and others.

• Korean, Chinese, and Japanese ideographs use 3-byte or 4-byte sequences.

MySQL supports these Unicode character sets:

• utf8mb4: A UTF-8 encoding of the Unicode character set using one to four bytes per character.

• utf8mb3: A UTF-8 encoding of the Unicode character set using one to three bytes per character. This character set is deprecated【ˈdeprəkeɪtɪd 強烈反對;對…表示極不贊成;】 in MySQL 8.0, and you should use utfmb4 instead.

• utf8: An alias for utf8mb3. In MySQL 8.0, this alias is deprecated; use utf8mb4 instead. utf8 is expected in a future release to become an alias for utf8mb4.

• ucs2: The UCS-2 encoding of the Unicode character set using two bytes per character. Deprecated in MySQL 8.0.28; you should expect support for this character set to be removed in a future release.

• utf16: The UTF-16 encoding for the Unicode character set using two or four bytes per character. Like ucs2 but with an extension for supplementary characters.

• utf16le: The UTF-16LE encoding for the Unicode character set. Like utf16 but little-endian rather than big-endian.

• utf32: The UTF-32 encoding for the Unicode character set using four bytes per character.

Summarizes the general characteristics of Unicode character sets supported by MySQL.

Character Set Supported Characters Required Storage Per Character
utf8mb3, utf8 (deprecated) BMP only 1, 2, or 3 bytes
ucs2 BMP only 2 bytes
utf8mb4 BMP and supplementary 1, 2, 3, or 4 bytes
utf16 BMP and supplementary 2 or 4 bytes
utf16le BMP and supplementary 2 or 4 bytes
utf32 BMP and supplementary 4 bytes

Characters outside the BMP compare as REPLACEMENT CHARACTER and convert to '?' when converted to a Unicode character set that supports only BMP characters (utf8mb3 or ucs2).

If you use character sets that support supplementary characters and thus are “wider” than the BMP-only utf8mb3 and ucs2 character sets, there are potential【pəˈtenʃl 潛在的;可能的;】 incompatibility【ˌɪnkəmˌpætəˈbɪləti 盾;感情不和,不能和諧共處(尤指夫婦關係);感情不和,不能和諧共處(尤指夫婦關係);矛盾;】 issues for your applications;That section also describes how to convert tables from the (3-byte) utf8mb3 to the (4-byte) utf8mb4, and what constraints【kənˈstreɪnts 約束;限制;限定;嚴管;】 may apply in doing so.

A similar set of collations is available for most Unicode character sets. For example, each has a Danish collation, the names of which are utf8mb4_danish_ci, utf8mb3_danish_ci (deprecated), utf8_danish_ci (deprecated), ucs2_danish_ci, utf16_danish_ci, and utf32_danish_ci. The exception is utf16le, which has only two collations.

The MySQL implementation of UCS-2, UTF-16, and UTF-32 stores characters in big-endian byte order and does not use a byte order mark (BOM) at the beginning of values. Other database systems might use little-endian byte order or a BOM. In such cases, conversion of values needs to be performed when transferring data between those systems and MySQL. The implementation of UTF-16LE is little-endian.

大端模式(big endian):資料的高位位元組儲存在記憶體的低地址中,而低位位元組儲存在記憶體的高地址中。

小端模式(little endian):資料的高位位元組儲存在記憶體的高地址中,而低位位元組儲存在記憶體的低地址中。

MySQL uses no BOM for UTF-8 values.

2.The utf8mb4 Character Set (4-Byte UTF-8 Unicode Encoding)

The utfmb4 character set has these characteristics:

• Supports BMP and supplementary characters.

• Requires a maximum of four bytes per multibyte character.

utf8mb4 contrasts【kənˈtræsts 對比;對照;(靠近或作比較時)顯出明顯的差異,形成對比;】 with the utf8mb3 character set, which supports only BMP characters and uses a maximum of three bytes per character:

• For a BMP character, utf8mb4 and utf8mb3 have identical【aɪˈdentɪkl 完全相同的;相同的;同一的;完全同樣的;】 storage characteristics: same code values, same encoding, same length.

• For a supplementary character, utf8mb4 requires four bytes to store it, whereas【ˌwerˈæz (用以比較或對比兩個事實)然而,但是,儘管;(用於正式檔案中句子的開頭)鑑於;】 utf8mb3 cannot store the character at all. When converting utf8mb3 columns to utf8mb4, you need not worry about converting supplementary characters because there are none.

utf8mb4 is a superset of utf8mb3, so for an operation such as the following concatenation, the result has character set utf8mb4 and the collation of utf8mb4_col:

SELECT CONCAT(utf8mb3_col, utf8mb4_col);

Similarly, the following comparison in the WHERE clause works according to the collation of utf8mb4_col:

SELECT * FROM utf8mb3_tbl, utf8mb4_tbl
WHERE utf8mb3_tbl.utf8mb3_col = utf8mb4_tbl.utf8mb4_col;

3.The utf8mb3 Character Set (3-Byte UTF-8 Unicode Encoding)

The utf8mb3 character set has these characteristics:

• Supports BMP characters only (no support for supplementary characters)

• Requires a maximum of three bytes per multibyte character.

Exactly【ɪɡˈzæktli 確切地;準確地;精確地;(要求得到更多資訊)究竟,到底;(答語,表示贊同或強調正確)一點不錯,正是如此,完全正確;】 the same set of characters is available in utf8mb3 and ucs2. That is, they have the same repertoire【ˈrepərtwɑːr (總稱某人的)可表演專案;(某人的)全部才能,全部本領;】.

utf8mb3 can be used in CHARACTER SET clauses, and utf8mb3_collation_substring in COLLATE clauses, where collation_substring is bin, czech_ci, danish_ci, esperanto_ci, estonian_ci, and so forth.

Prior to MySQL 8.0.29, instances of utf8mb3 in statements were converted to utf8. In MySQL 8.0.30 and later, the reverse is true, so that in statements such as SHOW CREATE TABLE or SELECT CHARACTER_SET_NAME FROM INFORMATION_SCHEMA.COLUMNS or SELECT COLLATION_NAME FROM INFORMATION_SCHEMA.COLUMNS, users see the character set or collation name prefixed with utf8mb3 or utf8mb3_.

utf8mb3 is also valid (but deprecated) in contexts other than CHARACTER SET clauses.

4 The ucs2 Character Set (UCS-2 Unicode Encoding)

In UCS-2, every character is represented by a 2-byte Unicode code with the most significant【sɪɡˈnɪfɪkənt 重要的, 有重大意義的;顯著的, 值得注意的;<統>顯著的, 有效的;(詞綴等)有意義的;不可忽略的, 值得注意的;相當數量的;別有含義的, 意味深長的;(語言上)區別性的;】 byte first. For example: LATIN CAPITAL LETTER A has the code 0x0041 and it is stored as a 2-byte sequence: 0x00 0x41. CYRILLIC SMALL LETTER YERU (Unicode 0x044B) is stored as a 2-byte sequence: 0x04 0x4B.

The ucs2 character set has these characteristics:

• Supports BMP characters only (no support for supplementary characters)

• Uses a fixed-length 16-bit encoding and requires two bytes per character.

說明:

The ucs2 character set is deprecated in MySQL 8.0.28; expect it to be removed in a future MySQL release. Please use utf8mb4 instead.

5 The utf16 Character Set (UTF-16 Unicode Encoding)

The utf16 character set is the ucs2 character set with an extension【ɪkˈstenʃn 延伸;延長;延期;副檔名;擴大;擴建部分;進修部,進修課;增加的房間;增建部分;電話分機線;】 that enables encoding of supplementary characters:

• For a BMP character, utf16 and ucs2 have identical【aɪˈdentɪkl 完全相同的;相同的;同一的;完全同樣的;】 storage characteristics: same code values, same encoding, same length.

• For a supplementary character, utf16 has a special sequence for representing the character using 32 bits. This is called the “surrogate” mechanism: For a number greater than 0xffff, take 10 bits and add them to 0xd800 and put them in the first 16-bit word, take 10 more bits and add them to 0xdc00 and put them in the next 16-bit word. Consequently, all supplementary characters require 32 bits, where the first 16 bits are a number between 0xd800 and 0xdbff, and the last 16 bits are a number between 0xdc00 and 0xdfff.

Because utf16 supports surrogates and ucs2 does not, there is a validity check that applies only in utf16: You cannot insert a top surrogate without a bottom surrogate, or vice versa. For example:

INSERT INTO t (ucs2_column) VALUES (0xd800); /* legal */
INSERT INTO t (utf16_column)VALUES (0xd800); /* illegal */

There is no validity check for characters that are technically valid but are not true Unicode (that is, characters that Unicode considers to be “unassigned code points” or “private use” characters or even “illegals” like 0xffff). For example, since U+F8FF is the Apple Logo, this is legal:

INSERT INTO t (utf16_column)VALUES (0xf8ff); /* legal */

Such characters cannot be expected to mean the same thing to everyone.

Because MySQL must allow for the worst case (that one character requires four bytes) the maximum length of a utf16 column or index is only half of the maximum length for a ucs2 column or index. For example, the maximum length of a MEMORY table index key is 3072 bytes, so these statements create tables with the longest permitted indexes for ucs2 and utf16 columns:

CREATE TABLE tf (s1 VARCHAR(1536) CHARACTER SET ucs2) ENGINE=MEMORY;
CREATE INDEX i ON tf (s1);
CREATE TABLE tg (s1 VARCHAR(768) CHARACTER SET utf16) ENGINE=MEMORY;
CREATE INDEX i ON tg (s1);

補充:

The utf16le Character Set (UTF-16LE Unicode Encoding).This is the same as utf16 but is little-endian rather than big-endian.

6.The utf32 Character Set (UTF-32 Unicode Encoding)

The utf32 character set is fixed length (like ucs2 and unlike utf16). utf32 uses 32 bits for every character, unlike ucs2 (which uses 16 bits for every character), and unlike utf16 (which uses 16 bits for some characters and 32 bits for others).

utf32 takes twice as much space as ucs2 and more space than utf16, but utf32 has the same advantage as ucs2 that it is predictable【prɪˈdɪktəbl 可預測的;可預見的;可預料的;意料之中的;老套乏味的;】 for storage: The required number of bytes for utf32 equals the number of characters times 4. Also, unlike utf16, there are no tricks【trɪks 把戲;戲法;花招;詭計;騙局;引起錯覺(或記憶紊亂)的事物;】 for encoding in utf32, so the stored value equals the code value.

To demonstrate how the latter advantage is useful, here is an example that shows how to determine a utf8mb4 value given the utf32 code value:

/* Assume code value = 100cc LINEAR B WHEELED CHARIOT */
CREATE TABLE tmp (utf32_col CHAR(1) CHARACTER SET utf32,
 utf8mb4_col CHAR(1) CHARACTER SET utf8mb4);
INSERT INTO tmp VALUES (0x000100cc,NULL);
UPDATE tmp SET utf8mb4_col = utf32_col;
SELECT HEX(utf32_col),HEX(utf8mb4_col) FROM tmp;

MySQL is very forgiving【fərˈɡɪvɪŋ 寬容的;寬宏大量的;】 about additions of unassigned Unicode characters or private-use-area characters. There is in fact only one validity check for utf32: No code value may be greater than 0x10ffff. For example, this is illegal:

INSERT INTO t (utf32_column) VALUES (0x110000); /* illegal */

7. Converting Between 3-Byte and 4-Byte Unicode Character Sets

This section describes issues that you may face when converting character data between the utf8mb3 and utf8mb4 character sets.

This discussion focuses primarily on converting between utf8mb3 and utf8mb4, but similar principles apply to converting between the ucs2 character set and character sets such as utf16 or utf32.

The utf8mb3 and utf8mb4 character sets differ as follows:--特徵區別

• utf8mb3 supports only characters in the Basic Multilingual Plane (BMP). utf8mb4 additionally supports supplementary characters that lie outside the BMP.

• utf8mb3 uses a maximum of three bytes per character. utf8mb4 uses a maximum of four bytes per character.

One advantage of converting from utf8mb3 to utf8mb4 is that this enables applications to use supplementary characters. One tradeoff is that this may increase data storage space requirements.

In terms of table content, conversion from utf8mb3 to utf8mb4 presents no problems:

• For a BMP character, utf8mb4 and utf8mb3 have identical【aɪˈdentɪkl 完全相同的;相同的;同一的;完全同樣的;】 storage characteristics: same code values, same encoding, same length.

• For a supplementary character, utf8mb4 requires four bytes to store it, whereas【ˌwerˈæz (用以比較或對比兩個事實)然而,但是,儘管;(用於正式檔案中句子的開頭)鑑於;】 utf8mb3 cannot store the character at all. When converting utf8mb3 columns to utf8mb4, you need not worry about converting supplementary characters because there are none.

In terms of table structure, these are the primary potential incompatibilities:

• For the variable-length character data types (VARCHAR and the TEXT types), the maximum permitted length in characters is less for utf8mb4 columns than for utf8mb3 columns.

• For all character data types (CHAR, VARCHAR, and the TEXT types), the maximum number of characters that can be indexed is less for utf8mb4 columns than for utf8mb3 columns.

Consequently【ˈkɑːnsɪkwentli 因此;所以;】, to convert tables from utf8mb3 to utf8mb4, it may be necessary to change some column or index definitions.

The catch when converting from utf8mb3 to utf8mb4 is that the maximum length of a column or index key is unchanged in terms of bytes. Therefore, it is smaller in terms of characters because the maximum length of a character is four bytes instead of three. For the CHAR, VARCHAR, and TEXT data types, watch for these issues when converting your MySQL tables:

• Check all definitions of utf8mb3 columns and make sure they do not exceed【ɪkˈsiːd 超過(數量);超越(法律、命令等)的限制;】 the maximum length for the storage engine.

• Check all indexes on utf8mb3 columns and make sure they do not exceed the maximum length for the storage engine. Sometimes the maximum can change due to storage engine enhancements【ɛnˈhænsmənts 增強;提高;增加;】.

If the preceding conditions apply, you must either reduce the defined length of columns or indexes, or continue to use utf8mb3 rather than utf8mb4.

Here are some examples where structural【ˈstrʌktʃərəl 結構的;結構(或構造)上的;】 changes may be needed:

• A TINYTEXT column can hold up to 255 bytes【位元組;】, so it can hold up to 85 3-byte or 63 4-byte characters. Suppose that you have a TINYTEXT column that uses utf8mb3 but must be able to contain more than 63 characters. You cannot convert it to utf8mb4 unless you also change the data type to a longer type such as TEXT.

Similarly, a very long VARCHAR column may need to be changed to one of the longer TEXT types if you want to convert it from utf8mb3 to utf8mb4.

• InnoDB has a maximum index length of 767 bytes【位元組;】 for tables that use COMPACT or REDUNDANT row format, so for utf8mb3 or utf8mb4 columns, you can index a maximum of 255 or 191 characters, respectively. If you currently have utf8mb3 columns with indexes longer than 191 characters, you must index a smaller number of characters.

In an InnoDB table that uses COMPACT or REDUNDANT row format, these column and index definitions are legal:

col1 VARCHAR(500) CHARACTER SET utf8mb3, INDEX (col1(255))

To use utf8mb4 instead, the index must be smaller:

col1 VARCHAR(500) CHARACTER SET utf8mb4, INDEX (col1(191))

說明:

For InnoDB tables that use COMPRESSED or DYNAMIC row format, index key prefixes longer than 767 bytes (up to 3072 bytes) are permitted. Tables created with these row formats enable you to index a maximum of 1024 or 768 characters for utf8mb3 or utf8mb4 columns, respectively.

The preceding【prɪˈsiːdɪŋ 在前的;前面的;】 types of changes are most likely to be required only if you have very long columns or indexes. Otherwise, you should be able to convert your tables from utf8mb3 to utf8mb4 without problems, using ALTER TABLE as described previously.

The following items summarize other potential incompatibilities:

• SET NAMES 'utf8mb4' causes use of the 4-byte character set for connection character sets. As long as no 4-byte characters are sent from the server, there should be no problems. Otherwise, applications that expect to receive a maximum of three bytes per character may have problems. Conversely, applications that expect to send 4-byte characters must ensure that the server understands them.

• For replication, if character sets that support supplementary characters are to be used on the source, all replicas must understand them as well.

Also, keep in mind the general principle that if a table has different definitions on the source and replica, this can lead to unexpected results. For example, the differences in maximum index key length make it risky to use utf8mb3 on the source and utf8mb4 on the replica. --還是要主從一致的,不管是資料,還是結構

If you have converted to utf8mb4, utf16, utf16le, or utf32, and then decide to convert back to utf8mb3 or ucs2 (for example, to downgrade to an older version of MySQL), these considerations apply:

• utf8mb3 and ucs2 data should present no problems.

• The server must be recent【ˈriːsnt 最近的;近來的;新近的;】 enough to recognize definitions referring to the character set from which you are converting.

• For object definitions that refer to the utf8mb4 character set, you can dump them with mysqldump prior to downgrading, edit the dump file to change instances of utf8mb4 to utf8, and reload the file in the older server, as long as there are no 4-byte characters in the data. The older server sees utf8 in the dump file object definitions and create new objects that use the (3-byte) utf8 character set.

相關文章