5.7 與 8.0 對相同檔案的 LOAD DATA 語句結果不同
問題描述
某客戶現場支援,由MySQL 5.7.21升級MySQL 8.0.25後,透過LOAD DATA
匯入檔案,當同一會話連續匯入不同的編碼(UTF8/GB18030)檔案時會出現亂碼。資料庫版本未升級之前,相同的匯入操作在MySQL 5.7.21未出現亂碼。
問題分析
1)檢視簡化後的 LOAD DATA
語句
greatsql> LOAD DATA LOCAL INFILE '/home/greatdb/TEST_UTF8_bak.txt' IGNORE INTO TABLE ASSP_SIS_PAYRES_IMP_BAK CHARACTER SET UTF8MB4 IGNORE 0 LINES (@row) SET `D_NAME` = NULLIF(TRIM(CONVERT(UNHEX(SUBSTR(HEX(@row), 1,240)) USING UTF8MB4)),'');
Query OK, 2 rows affected (0.01 sec)
Records: 2 Deleted: 0 Skipped: 0 Warnings: 0
greatsql> LOAD DATA LOCAL INFILE '/home/greatdb/TEST_GB18030_bak.txt' IGNORE INTO TABLE ASSP_SIS_PAYRES_IMP_BAK CHARACTER SET GB18030 IGNORE 0 LINES (@row) SET `D_NAME` = NULLIF(TRIM(CONVERT(UNHEX(SUBSTR(HEX(@row), 1,240)) USING GB18030)),'');
Query OK, 2 rows affected (0.01 sec)
Records: 2 Deleted: 0 Skipped: 0 Warnings: 0
2)檢視錶資料
+----------+------------------------------------------------------+
| AUTO_INC | D_NAME |
+----------+------------------------------------------------------+
| 1 | xxx社會保險xxx |
| 2 | xxx市路橋區xxx |
| 4 | 鍙板窞甯傝礬妗ュ尯紺句細淇濋櫓浜嬩笟綆$悊涓績 |
| 5 | 鍙板窞甯傝礬妗ュ尯紺句細淇濋櫓浜嬩笟綆$悊涓績 |
+----------+------------------------------------------------------+
4 rows in set (0.00 sec)
3)檢查業務表的字符集與校驗集,發現字符集為 utf8mb4 、校驗集為 utf8mb4_bin
4)檢查資料庫的字符集與校驗集
greatsql> SHOW GLOBAL VARIABLES LIKE '%char%';
+--------------------------------------+--------------------------------+
| Variable_name | Value |
+--------------------------------------+--------------------------------+
| character_set_client | utf8mb4 |
| character_set_connection | utf8mb4 |
| character_set_database | utf8mb4 |
| character_set_filesystem | binary |
| character_set_results | utf8mb4 |
| character_set_server | utf8mb4 |
| character_set_system | utf8mb3 |
| character_sets_dir | /opt/mysql3301/share/charsets/ |
| validate_password_special_char_count | 1 |
+--------------------------------------+--------------------------------+
9 rows in set (0.01 sec)
greatsql> SHOW GLOBAL VARIABLES LIKE '%coll%';
+-------------------------------+--------------------+
| Variable_name | Value |
+-------------------------------+--------------------+
| collation_connection | utf8mb4_bin |
| collation_database | utf8mb4_bin |
| collation_server | utf8mb4_bin |
| default_collation_for_utf8mb4 | utf8mb4_general_ci |
+-------------------------------+--------------------+
4 rows in set (0.00 sec)
程式在MySQL 5.7.21跑了很長時間,一直沒有問題,把資料庫升級MySQL 8.0.25後,新匯入的資料出現部分亂碼, 由此懷疑,MySQL 8.0定長資料匯入LOAD DATA @row
出現BUG。
BUG場景:同一個會話 LOAD DATA
多種字符集檔案,使用@臨時變數切割欄位。將導致匯入資料亂碼,向MySQL官方提BUG,已證實為BUG(編號115824)
問題復現
MySQL: 8.0.25
greatsql> SELECT VERSION();
+-----------+
| version() |
+-----------+
| 8.0.25 |
+-----------+
1 row in set (0.00 sec)
table ddl:
CREATE TABLE `assp_sis_payres_imp_bak` (
`AUTO_INC` bigint unsigned NOT NULL AUTO_INCREMENT COMMENT '自增列',
`D_NAME` varchar(210) CHARACTER SET utf8mb4 COLLATE utf8mb4_bin DEFAULT NULL,
PRIMARY KEY (`AUTO_INC`)
) ENGINE=InnoDB AUTO_INCREMENT=7 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_bin;
greatsql> SHOW GLOBAL VARIABLES LIKE '%char%';
+--------------------------------------+--------------------------------+
| Variable_name | Value |
+--------------------------------------+--------------------------------+
| character_set_client | utf8mb4 |
| character_set_connection | utf8mb4 |
| character_set_database | utf8mb4 |
| character_set_filesystem | binary |
| character_set_results | utf8mb4 |
| character_set_server | utf8mb4 |
| character_set_system | utf8mb3 |
| character_sets_dir | /opt/mysql3301/share/charsets/ |
| validate_password_special_char_count | 1 |
+--------------------------------------+--------------------------------+
9 rows in set (0.01 sec)
greatsql> SHOW GLOBAL VARIABLES LIKE '%coll%';
+-------------------------------+--------------------+
| Variable_name | Value |
+-------------------------------+--------------------+
| collation_connection | utf8mb4_bin |
| collation_database | utf8mb4_bin |
| collation_server | utf8mb4_bin |
| default_collation_for_utf8mb4 | utf8mb4_general_ci |
+-------------------------------+--------------------+
4 rows in set (0.00 sec)
greatsql> TRUNCATE TABLE assp_sis_payres_imp_bak;
Query OK, 0 rows affected (0.03 sec)
greatsql> SELECT charset(@row), @row;
+---------------+------------+
| charset(@row) | @row |
+---------------+------------+
| binary | NULL |
+---------------+------------+
1 row in set (0.00 sec)
greatsql> LOAD DATA LOCAL INFILE '/root/dba_zc/load/TEST_UTF8_bak.txt' IGNORE INTO TABLE ASSP_SIS_PAYRES_IMP_BAK CHARACTER SET UTF8MB4 IGNORE 0 LINES (@row) SET `D_NAME` = NULLIF(TRIM(CONVERT(UNHEX(SUBSTR(HEX(@row), 1,240)) USING UTF8MB4)),'');
Query OK, 2 rows affected (0.01 sec)
Records: 2 Deleted: 0 Skipped: 0 Warnings: 0
greatsql> SELECT charset(@row), @row;
+---------------+------------------------+
| charset(@row) | @row |
+---------------+------------------------+
| utf8mb4 | XXX路橋區社會保XXX |
+---------------+------------------------+
greatsql> LOAD DATA LOCAL INFILE '/root/dba_zc/load/TEST_GB18030_bak.txt' IGNORE INTO TABLE ASSP_SIS_PAYRES_IMP_BAK CHARACTER SET GB18030 IGNORE 0 LINES (@row) SET `D_NAME` = NULLIF(TRIM(CONVERT(UNHEX(SUBSTR(HEX(@row), 1,240)) USING GB18030)),'');
Query OK, 2 rows affected (0.01 sec)
Records: 2 Deleted: 0 Skipped: 0 Warnings: 0
greatsql> SELECT charset(@row), @row;
+---------------+-----------------------------------------+
| charset(@row) | @row |
+---------------+-----------------------------------------+
| gb18030 | XXX路橋區社會保XXX |
+---------------+-----------------------------------------+
greatsql> SELECT * FROM ASSP_SIS_PAYRES_IMP_BAK;
+----------+---------------------------------------------------------+
| AUTO_INC | D_NAME |
+----------+---------------------------------------------------------+
| 1 | XXX路橋區社會保XXX |
| 2 | XXX路橋區社會保XXX |
| 4 | 鍙板窞甯傝礬妗ュ尯紺句細淇濋櫓浜嬩笟綆$悊涓績 |
| 5 | 鍙板窞甯傝礬妗ュ尯紺句細淇濋櫓浜嬩笟綆$悊涓績 |
+----------+---------------------------------------------------------+
4 rows in set (0.00 sec)
MySQL 5.7.21
greatsql> SELECT VERSION();
+------------+
| version() |
+------------+
| 5.7.21-log |
+------------+
1 row in set (0.01 sec)
table ddl:
CREATE TABLE `assp_sis_payres_imp_bak` (
`AUTO_INC` bigint unsigned NOT NULL AUTO_INCREMENT COMMENT '自增列',
`D_NAME` varchar(210) CHARACTER SET utf8mb4 COLLATE utf8mb4_bin DEFAULT NULL,
PRIMARY KEY (`AUTO_INC`)
) ENGINE=InnoDB AUTO_INCREMENT=7 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_bin;
greatsql> SHOW GLOBAL VARIABLES LIKE '%char%';
+--------------------------------------+--------------------------------+
| Variable_name | Value |
+--------------------------------------+--------------------------------+
| character_set_client | utf8mb4 |
| character_set_connection | utf8mb4 |
| character_set_database | utf8mb4 |
| character_set_filesystem | binary |
| character_set_results | utf8mb4 |
| character_set_server | utf8mb4 |
| character_set_system | utf8 |
| character_sets_dir | /opt/mysql3305/share/charsets/ |
| validate_password_special_char_count | 1 |
+--------------------------------------+--------------------------------+
9 rows in set (0.00 sec)
greatsql> SHOW GLOBAL VARIABLES LIKE '%coll%';
+----------------------+--------------------+
| Variable_name | Value |
+----------------------+--------------------+
| collation_connection | utf8mb4_general_ci |
| collation_database | utf8mb4_general_ci |
| collation_server | utf8mb4_general_ci |
+----------------------+--------------------+
3 rows in set (0.00 sec)
greatsql> SELECT charset(@row), @row;
+---------------+------------+
| charset(@row) | @row |
+---------------+------------+
| binary | NULL |
+---------------+------------+
1 row in set (0.00 sec)
greatsql> LOAD DATA LOCAL INFILE '/root/dba_zc/load/TEST_UTF8_bak.txt' IGNORE INTO TABLE ASSP_SIS_PAYRES_IMP_BAK CHARACTER SET UTF8MB4 IGNORE 0 LINES (@row) SET `D_NAME` = NULLIF(TRIM(CONVERT(UNHEX(SUBSTR(HEX(@row), 1,240)) USING UTF8MB4)),'');
Query OK, 2 rows affected (0.01 sec)
Records: 2 Deleted: 0 Skipped: 0 Warnings: 0
greatsql> SELECT charset(@row), @row;
+---------------+-----------------------+
| charset(@row) | @row |
+---------------+-----------------------+
| utf8mb4 | XXX路橋區社會保XXX |
+---------------+-----------------------+
greatsql> LOAD DATA LOCAL INFILE '/root/dba_zc/load/TEST_GB18030_bak.txt' IGNORE INTO TABLE ASSP_SIS_PAYRES_IMP_BAK CHARACTER SET GB18030 IGNORE 0 LINES (@row) SET `D_NAME` = NULLIF(TRIM(CONVERT(UNHEX(SUBSTR(HEX(@row), 1,240)) USING GB18030)),'');
Query OK, 2 rows affected (0.01 sec)
Records: 2 Deleted: 0 Skipped: 0 Warnings: 0
greatsql> SELECT charset(@row), @row;
+---------------+-----------------------+
| charset(@row) | @row |
+---------------+-----------------------+
| gb18030 | XXX路橋區社會保XXX |
+---------------+-----------------------+
greatsql> SELECT * FROM ASSP_SIS_PAYRES_IMP_BAK;
+---------------+-----------------------------+
| AUTO_INC | D_NAME |
+---------------+-----------------------------+
| 1 | XXX路橋區社會保XXX |
| 2 | XXX路橋區社會保XXX |
| 4 | XXX路橋區社會保XXX |
| 5 | XXX路橋區社會保XXX |
+---------------+-----------------------------+
4 rows in set (0.00 sec)
BUG規避方案
透過SELECT``charset(@row), @row;
可以看到@row
在執行LOAD DATA
後在5.7.21和8.0.25是一樣的,但最終的影響不一樣。雖然MySQL官方確認此問題為BUG,但沒有提供規避方案或者解決方案。透過萬里工程師研究後,發現一種可行的規避方案。每次執行LOAD DATA
命令前執行 [set @row=_binary'';
] 進行規避。
greatsql> SELECT VERSION();
+-----------+
| version() |
+-----------+
| 8.0.25 |
+-----------+
1 row in set (0.00 sec)
greatsql> SET @row=_binary'';
Query OK, 0 rows affected (0.00 sec)
greatsql> LOAD DATA LOCAL INFILE '/home/greatdb/TEST_UTF8_bak.txt' IGNORE INTO TABLE ASSP_SIS_PAYRES_IMP_BAK CHARACTER SET UTF8MB4 IGNORE 0 LINES (@row) SET `D_NAME` = NULLIF(TRIM(CONVERT(UNHEX(SUBSTR(HEX(@row), 1,240)) USING UTF8MB4)),'');
Query OK, 2 rows affected (0.01 sec)
Records: 2 Deleted: 0 Skipped: 0 Warnings: 0
greatsql> SET @row=_binary'';
Query OK, 0 rows affected (0.00 sec)
greatsql> LOAD DATA LOCAL INFILE '/home/greatdb/TEST_GB18030_bak.txt' IGNORE INTO TABLE ASSP_SIS_PAYRES_IMP_BAK CHARACTER SET GB18030 IGNORE 0 LINES (@row) SET `D_NAME` = NULLIF(TRIM(CONVERT(UNHEX(SUBSTR(HEX(@row), 1,240)) USING GB18030)),'');
Query OK, 2 rows affected (0.01 sec)
Records: 2 Deleted: 0 Skipped: 0 Warnings: 0
greatsql> SELECT * FROM assp_sis_payres_imp_bak;
+----------+--------------------------------------------------+
| AUTO_INC | D_NAME |
+----------+--------------------------------------------------+
| 1 | XXX路橋區社會保XXX |
| 2 | XXX路橋區社會保XXX |
| 4 | XXX路橋區社會保XXX |
| 5 | XXX路橋區社會保XXX |
+----------+--------------------------------------------------+
4 rows in set (0.00 sec)
問題總結
1.BUG原因
MySQL8.0重構定長資料匯入LOAD DATA @row
出現BUG.同一個資料庫會話,多次執行LOAD DATA @row
命令,則第n次執行LOAD DATA @row
的字符集使用的是n-1次的字符集,當檔案的字符集存在不同,例如先後處理GB18030、UTF8字符集的檔案就會資料亂碼。此問題MySQL官方已證實為BUG(編號115824)
2.BUG觸發條件
觸發條件:需同時滿足以下三個條件才會觸發此bug。
1)LOAD DATA
命令使用類似 @row臨時變數 進行資料處理,例如對定長記錄按位元組切割出多個欄位:
LINES (@row) SET COLUMN_NAME = NULLIF(TRIM(CONVERT(UNHEX(SUBSTR(HEX(@row),1,20)) USING GB18030))。
2)在同一個連線中,多次執行LOAD DATA
命令,且先後處理的檔案字符集存在不同(例如GB18030和UTF8)。
3)使用MySQL 8.0。
3.BUG規避辦法
由萬里工程師提出,與MySQL官方社群溝通證實,涉及到滿足上述BUG觸發條件的場景,透過在每次執行LOAD DATA
命令前執行 [set @row=_binary'';
] 進行規避。
參考:https://bugs.mysql.com/bug.php?id=115824
Enjoy GreatSQL 😃
關於 GreatSQL
GreatSQL是適用於金融級應用的國內自主開源資料庫,具備高效能、高可靠、高易用性、高安全等多個核心特性,可以作為MySQL或Percona Server的可選替換,用於線上生產環境,且完全免費併相容MySQL或Percona Server。
相關連結: GreatSQL社群 Gitee GitHub Bilibili
GreatSQL社群:
社群部落格有獎徵稿詳情:https://greatsql.cn/thread-100-1-1.html
技術交流群:
微信:掃碼新增
GreatSQL社群助手
微信好友,傳送驗證資訊加群
。