MySQL 8.0 Reference Manual(讀書筆記35節-- 字元編碼(2))

东山絮柳仔發表於2024-04-13

1.Character String Literal【ˈlɪtərəl 字面意義的;缺乏想象力的;完全按原文的;】 Character Set and Collation

Every character string literal has a character set and a collation.

For the simple statement SELECT 'string', the string has the connection default character set and collation defined by the character_set_connection and collation_connection system variables.

A character string literal may have an optional character set introducer and COLLATE clause, to designate【ˈdezɪɡneɪt 指定;命名;選定,指派,委任(某人任某職);指明;標示;標明;】 it as a string that uses a particular character set and collation:

[_charset_name]'string' [COLLATE collation_name]

The _charset_name expression is formally called an introducer【ˌɪntrəˈdusər 新增劑;插管器;導引器;輸入(提出)者;創始(介紹)人;】. It tells the parser【ˈpɑrsər 解析器;分析器;解析;語法分析器;剖析器;】, “the string that follows uses character set charset_name.” An introducer does not change the string to the introducer character set like CONVERT() would do. It does not change the string value, although padding【ˈpædɪŋ 襯墊;襯料;廢話;湊篇幅的文字;贅語;】 may occur.

舉例

SELECT 'abc';
SELECT _latin1'abc';
SELECT _binary'abc';
SELECT _utf8mb4'abc' COLLATE utf8mb4_danish_ci;

Character set introducers and the COLLATE clause are implemented【ˈɪmplɪmentɪd 實施;執行;貫徹;使生效;】 according to standard SQL specifications.

MySQL determines the character set and collation of a character string literal in the following manner: ---都差不多

• If both _charset_name and COLLATE collation_name are specified, character set charset_name and collation collation_name are used. collation_name must be a permitted collation for charset_name.

• If _charset_name is specified but COLLATE is not specified, character set charset_name and its default collation are used. To see the default collation for each character set, use the SHOW CHARACTER SET statement or query the INFORMATION_SCHEMA CHARACTER_SETS table.

• If _charset_name is not specified but COLLATE collation_name is specified, the connection default character set given by the character_set_connection system variable and collation collation_name are used. collation_name must be a permitted collation for the connection default character set.

• Otherwise (neither _charset_name nor COLLATE collation_name is specified), the connection default character set and collation given by the character_set_connection and collation_connection system variables are used.

An introducer indicates【ˈɪndɪkeɪts 表明;顯示;暗示;示意;象徵;間接提及;】 the character set for the following string, but does not change how the parser performs escape【ɪˈskeɪp 逃跑;(從不愉快或危險處境中)逃脫;逃避;逃脫,倖免於難;擺脫;(從監禁或管制中)逃走;逃出;避免(不愉快或危險的事物);漏出;被忘掉;(不自覺地)由…發出;】 processing within the string. Escapes are always interpreted【ɪnˈtɜːrprətɪd 詮釋;說明;把…理解為;領會;口譯;】 by the parser according to the character set given by character_set_connection.

The following examples show that escape processing occurs using character_set_connection even in the presence of an introducer. The examples use SET NAMES (which changes character_set_connection),and display the resulting strings using the HEX() function so that the exact string contents can be seen.

Example 1:

mysql> SET NAMES latin1;
mysql> SELECT HEX('à\n'), HEX(_sjis'à\n');
+------------+-----------------+
| HEX('à\n') | HEX(_sjis'à\n') |
+------------+-----------------+
| E00A       | E00A            |
+------------+-----------------+

Here, à (hexadecimal value E0) is followed by \n, the escape sequence for newline. The escape sequence is interpreted using the character_set_connection value of latin1 to produce a literal newline (hexadecimal value 0A). This happens even for the second string. That is, the _sjis introducer does not affect the parser's escape processing.

Example 2:

mysql> SET NAMES sjis;
mysql> SELECT HEX('à\n'), HEX(_latin1'à\n');
+------------+-------------------+
| HEX('à\n') | HEX(_latin1'à\n') |
+------------+-------------------+
| E05C6E     | E05C6E            |
+------------+-------------------+

Here, character_set_connection is sjis, a character set in which the sequence of à followed by \ (hexadecimal values 05 and 5C) is a valid multibyte character. Hence, the first two bytes of the string are interpreted as a single sjis character, and the \ is not interpreted as an escape character. The following n (hexadecimal value 6E) is not interpreted as part of an escape sequence. This is true even for the second string; the _latin1 introducer does not affect escape processing.

2.The National【ˈnæʃnəl 國家的】 Character Set

Standard SQL defines NCHAR or NATIONAL CHAR as a way to indicate that a CHAR column should use some predefined character set. MySQL uses utf8 as this predefined character set. For example, these data type declarations are equivalent【ɪˈkwɪvələnt (價值、數量、意義、重要性等)相同的;相等的;】:

CHAR(10) CHARACTER SET utf8
NATIONAL CHARACTER(10)
NCHAR(10)

As are these:

VARCHAR(10) CHARACTER SET utf8
NATIONAL VARCHAR(10)
NVARCHAR(10)
NCHAR VARCHAR(10)
NATIONAL CHARACTER VARYING(10)
NATIONAL CHAR VARYING(10)

You can use N'literal' (or n'literal') to create a string in the national character set. These statements are equivalent:

SELECT N'some text';
SELECT n'some text';
SELECT _utf8'some text';

MySQL 8.0 interprets the national character set as utf8mb3, which is now deprecated. Thus, using NATIONAL CHARACTER or one of its synonyms to define the character set for a database, table, or column raises a warning similar to this one:

NATIONAL/NCHAR/NVARCHAR implies the character set UTF8MB3, which will be
replaced by UTF8MB4 in a future release. Please consider using CHAR(x) CHARACTER
SET UTF8MB4 in order to be unambiguous. 

3.Connection Character Sets and Collations

A “connection” is what a client program makes when it connects to the server, to begin a session within which it interacts【ɪntərˈækts 相互作用;交流;合作;相互影響;溝通;】 with the server. The client sends SQL statements, such as queries, over the session connection. The server sends responses, such as result sets or error messages, over the connection back to the client.

3.1 Connection Character Set and Collation System Variables

Several【ˈsevrəl 不同的,各種各樣的;大量的,許多的;各自的,分別的;單個的;〈律〉非連帶(負擔)的,個別的;專有的,獨佔的;】 character set and collation system variables relate to a client's interaction with the server. Some of these have been mentioned in earlier sections: ---有幾個系統變數會影響到客戶端和伺服器端的互動

• The character_set_server and collation_server system variables indicate the server character set and collation.

• The character_set_database and collation_database system variables indicate the character set and collation of the default database.

Additional【əˈdɪʃənl 附加的;額外的;外加的;】 character set and collation system variables are involved【ɪnˈvɑːlvd 參與;捲入的;關注;有關聯;關係密切;】 in handling traffic for the connection between a client and the server. Every client has session-specific connection-related character set and collation system variables. These session system variable values are initialized at connect time, but can be changed within the session.

Several questions about character set and collation handling for client connections can be answered in terms of system variables:

• What character set are statements in when they leave the client?

The server takes the character_set_client system variable to be the character set in which statements are sent by the client.

• What character set should the server translate【trænzˈleɪt 翻譯;被翻譯;譯;(以某種方式)理解;被譯成;(使)轉變,變為;】 statements to after receiving them?

To determine this, the server uses the character_set_connection and collation_connection system variables:

• The server converts statements sent by the client from character_set_client to character_set_connection. Exception: For string literals that have an introducer such as _utf8mb4 or _latin2, the introducer determines the character set.

collation_connection is important for comparisons【kəmˈpɛrəsənz】 of literal strings. For comparisons of strings with column values, collation_connection does not matter because columns have their own collation, which has a higher collation precedence【ˈpresɪdəns 優先;優先權;】.

• What character set should the server translate query results to before shipping them back to the client?

The character_set_results system variable indicates the character set in which the server returns query results to the client. This includes result data such as column values, result metadata such as column names, and error messages.

To tell the server to perform no conversion of result sets or error messages, set character_set_results to NULL or binary:

SET character_set_results = NULL;
SET character_set_results = binary;

To see the values of the character set and collation system variables that apply to the current session, use this statement:

SELECT * FROM performance_schema.session_variables
WHERE VARIABLE_NAME IN (
 'character_set_client', 'character_set_connection',
 'character_set_results', 'collation_connection'
) ORDER BY VARIABLE_NAME;

The following simpler statements also display the connection variables, but include other related variables as well. They can be useful to see all character set and collation system variables:

SHOW SESSION VARIABLES LIKE 'character\_set\_%';
SHOW SESSION VARIABLES LIKE 'collation\_%';

Clients can fine-tune the settings for these variables, or depend on the defaults (in which case, you can skip the rest of this section). If you do not use the defaults, you must change the character settings for each connection to the server.

3.2 Impermissible【ˌɪmpɜːrˈmɪsəbl 不允許的;不許可的;】 Client Character Sets

The character_set_client system variable cannot be set to certain character sets:

ucs2
utf16
utf16le
utf32

否則,報錯

mysql> SET character_set_client = 'ucs2';
ERROR 1231 (42000): Variable 'character_set_client'
can't be set to the value of 'ucs2'

The same error occurs if any of those character sets are used in the following contexts, all of which result in an attempt to set character_set_client to the named character set:

• The --default-character-set=charset_name command option used by MySQL client programs such as mysql and mysqladmin.

• The SET NAMES 'charset_name' statement.

• The SET CHARACTER SET 'charset_name' statement.

3.3 Client Program Connection Character Set Configuration

When a client connects to the server, it indicates which character set it wants to use for communication with the server. (Actually, the client indicates the default collation for that character set, from which the server can determine the character set.) The server uses this information to set the character_set_client, character_set_results, character_set_connection system variables to the character set, and collation_connection to the character set default collation. In effect, the server performs the equivalent【ɪˈkwɪvələnt (價值、數量、意義、重要性等)相同的;相等的;】 of a SET NAMES operation.

If the server does not support the requested character set or collation, it falls back to using the server character set and collation to configure the connection.

The mysql, mysqladmin, mysqlcheck, mysqlimport, and mysqlshow client programs determine the default character set to use as follows:

• In the absence【ˈæbsəns 缺席;缺乏;不存在;不在;】 of other information, each client uses the compiled-in default character set, usually utf8mb4.

• Each client can autodetect【自動檢測;】 which character set to use based on the operating system setting, such as the value of the LANG or LC_ALL locale environment variable on Unix systems or the code page setting on Windows systems. For systems on which the locale【loʊˈkæl 場所;現場;發生地點;】 is available from the OS, the client uses it to set the default character set rather than using the compiled-in default. For example, setting LANG to ru_RU.KOI8-R causes the koi8r character set to be used. Thus, users can configure the locale in their environment for use by MySQL clients.

The OS character set is mapped to the closest MySQL character set if there is no exact match. If the client does not support the matching character set, it uses the compiled-in default. For example, utf8 and utf-8 map to utf8mb4, and ucs2 is not supported as a connection character set, so it maps to the compiled-in default.

C applications can use character set autodetection based on the OS setting by invoking mysql_options() as follows before connecting to the server:

mysql_options(mysql,
 MYSQL_SET_CHARSET_NAME,
 MYSQL_AUTODETECT_CHARSET_NAME);

• Each client supports a --default-character-set option, which enables users to specify the character set explicitly【ɪkˈsplɪsətli 明確地;明白地;】 to override whatever default the client otherwise determines.

3.4 SQL Statements for Connection Character Set Configuration

After a connection has been established, clients can change the character set and collation system variables for the current session. These variables can be changed individually【ˌɪndɪˈvɪdʒuəli 單獨地;分別地;各別地;】 using SET statements, but two more convenient statements affect the connection-related character set system variables as a group:

• SET NAMES 'charset_name' [COLLATE 'collation_name']

SET NAMES indicates what character set the client uses to send SQL statements to the server. Thus, SET NAMES 'cp1251' tells the server, “future incoming messages from this client are in character set cp1251.” It also specifies the character set that the server should use for sending results back to the client. (For example, it indicates what character set to use for column values if you use a SELECT statement that produces a result set.)

A SET NAMES 'charset_name' statement is equivalent to these three statements:--效果是一樣的

SET character_set_client = charset_name;
SET character_set_results = charset_name;
SET character_set_connection = charset_name;

Setting character_set_connection to charset_name also implicitly【ɪmˈplɪsətli 含蓄地;無保留地;暗中地;不明顯地;無疑問地;】 sets collation_connection to the default collation for charset_name. It is unnecessary to set that collation explicitly. To specify a particular collation to use for collation_connection, add a COLLATE clause:

SET NAMES 'charset_name' COLLATE 'collation_name'

• SET CHARACTER SET 'charset_name'

SET CHARACTER SET is similar to SET NAMES but sets character_set_connection and collation_connection to character_set_database and collation_database (which, as mentioned previously, indicate the character set and collation of the default database).

A SET CHARACTER SET charset_name statement is equivalent to these three statements:

SET character_set_client = charset_name;
SET character_set_results = charset_name;
SET collation_connection = @@collation_database;

Setting collation_connection also implicitly sets character_set_connection to the character set associated with the collation (equivalent to executing SET character_set_connection = @@character_set_database). It is unnecessary to set character_set_connection explicitly.

3.5 Connection Character Set Error Handling

Attempts to use an inappropriate【ˌɪnəˈproʊpriət 不適當的;不恰當的;不合適的;】 connection character set or collation can produce an error, or cause the server to fall back to its default character set and collation for a given connection. This section describes problems that can occur when configuring the connection character set. These problems can occur when establishing a connection or when changing the character set within an established connection.

3.5.1 Connect-Time Error Handling

Some character sets cannot be used as the client character set;If you specify a character set that is valid but not permitted as a client character set, the server returns an error:

$> mysql --default-character-set=ucs2
ERROR 1231 (42000): Variable 'character_set_client' can't be set to
the value of 'ucs2'

If you specify a character set that the client does not recognize, it produces an error:

$> mysql --default-character-set=bogus
mysql: Character set 'bogus' is not a compiled character set and is
not specified in the '/usr/local/mysql/share/charsets/Index.xml' file
ERROR 2019 (HY000): Can't initialize character set bogus
(path: /usr/local/mysql/share/charsets/)

If you specify a character set that the client recognizes but the server does not, the server falls back to its default character set and collation. Suppose that the server is configured to use latin1 and latin1_swedish_ci as its defaults, and that it does not recognize gb18030 as a valid character set. A client that specifies --default-character-set=gb18030 is able to connect to the server, but the resulting character set is not what the client wants:

mysql> SHOW SESSION VARIABLES LIKE 'character\_set\_%';
+--------------------------+--------+
| Variable_name            | Value  |
+--------------------------+--------+
| character_set_client     | latin1 |
| character_set_connection | latin1 |
...
| character_set_results    | latin1 |
...
+--------------------------+--------+
mysql> SHOW SESSION VARIABLES LIKE 'collation_connection';
+----------------------+-------------------+
| Variable_name        | Value             |
+----------------------+-------------------+
| collation_connection | latin1_swedish_ci |
+----------------------+-------------------+

You can see that the connection system variables have been set to reflect a character set and collation of latin1 and latin1_swedish_ci. This occurs because the server cannot satisfy the client character set request and falls back to its defaults.

In this case, the client cannot use the character set that it wants because the server does not support it. The client must either be willing to use a different character set, or connect to a different server that supports the desired character set.

The same problem occurs in a more subtle【ˈsʌtl 微妙的;巧妙的;狡猾的;敏銳的;不易察覺的;不明顯的;機智的;機巧的;】 context: When the client tells the server to use a character set that the server recognizes, but the default collation for that character set on the client side is not known on the server side. This occurs, for example, when a MySQL 8.0 client wants to connect to a MySQL 5.7 server using utf8mb4 as the client character set. A client that specifies --default-character-set=utf8mb4 is able to connect to the server. However, as in the previous example, the server falls back to its default character set and collation, not what the client requested:

mysql> SHOW SESSION VARIABLES LIKE 'character\_set\_%';
+--------------------------+--------+
| Variable_name            | Value  |
+--------------------------+--------+
| character_set_client     | latin1 |
| character_set_connection | latin1 |
...
| character_set_results    | latin1 |
...
+--------------------------+--------+
mysql> SHOW SESSION VARIABLES LIKE 'collation_connection';
+----------------------+-------------------+
| Variable_name        | Value             |
+----------------------+-------------------+
| collation_connection | latin1_swedish_ci |
+----------------------+-------------------+

Why does this occur? After all, utf8mb4 is known to the 8.0 client and the 5.7 server, so both of them recognize it. To understand this behavior【bɪ'heɪvjər 行為;效能;表現;(生物的)習性;】, it is necessary to understand that when the client tells the server which character set it wants to use, it really tells the server the default collation for that character set. Therefore, the aforementioned【əˈfɔːrmenʃənd 上述的;前面提到的;】 behavior occurs due to a combination of factors:

• The default collation for utf8mb4 differs between MySQL 5.7 and 8.0 (utf8mb4_general_ci for 5.7, utf8mb4_0900_ai_ci for 8.0).

• When the 8.0 client requests a character set of utf8mb4, what it sends to the server is the default 8.0 utf8mb4 collation; that is, the utf8mb4_0900_ai_ci.

• utf8mb4_0900_ai_ci is implemented【ˈɪmplɪmentɪd 實施;執行;貫徹;使生效;】 only as of MySQL 8.0, so the 5.7 server does not recognize it.

• Because the 5.7 server does not recognize utf8mb4_0900_ai_ci, it cannot satisfy the client character set request, and falls back to its default character set and collation (latin1 and latin1_swedish_ci).

In this case, the client can still use utf8mb4 by issuing a SET NAMES 'utf8mb4' statement after connecting. The resulting collation is the 5.7 default utf8mb4 collation; that is, utf8mb4_general_ci. If the client additionally wants a collation of utf8mb4_0900_ai_ci, it cannot achieve that because the server does not recognize that collation. The client must either be willing to use a different utf8mb4 collation, or connect to a server from MySQL 8.0 or higher.

3.5.2 Runtime Error Handling

Within an established connection, the client can request a change of connection character set and collation with SET NAMES or SET CHARACTER SET.

Some character sets cannot be used as the client character set;If you specify a character set that is valid but not permitted as a client character set, the server returns an error.

If the server does not recognize the character set (or the collation), it produces an error.

補充:

A client that wants to verify whether its requested character set was honored by the server can execute the following statement after connecting and checking that the result is the expected character set:

SELECT @@character_set_client;

相關文章