PostgreSQLsql檔案編碼引起的資料匯入亂碼或查詢字符集異常報錯(invalidbytesequence)

pg小助手發表於2018-10-23

背景
當使用者客戶端字符集與服務端字符集不匹配時,寫入的多位元組字元(例如中文)可能出現亂碼。

例子
資料庫字符集為sql_ascii,允許儲存任意編碼字元。

digoal@pg11-320tb-zfs-> psql
psql (11beta4)
Type “help” for help.

postgres=# l+

                                                             List of databases  
Name Owner Encoding Collate Ctype Access privileges Size Tablespace Description
postgres postgres SQL_ASCII C en_US.utf8 140 TB pg_default default administrative connection database
template0 postgres SQL_ASCII C en_US.utf8 =c/postgres + 15 MB pg_default unmodifiable empty database
postgres=CTc/postgres
template1 postgres SQL_ASCII C en_US.utf8 =c/postgres + 15 MB pg_default default template for new databases

(3 rows)
客戶端為utf8編碼

digoal@pg11-320tb-zfs-> echo $LANG
en_US.utf8
編輯一個檔案,以UTF8編碼

vi test.sql

insert into tbl values (1, `你好`);
內容如下

digoal@pg11-320tb-zfs-> cat test.sql
insert into tbl values (1, `你好`);
編碼如下

digoal@pg11-320tb-zfs-> file test.sql
test.sql: UTF-8 Unicode text
轉換為GBK,寫入資料庫

digoal@pg11-320tb-zfs-> iconv –help
Usage: iconv [OPTION…] [FILE…]
Convert encoding of given files from one encoding to another.

Input/Output format specification:
-f, –from-code=NAME encoding of original text
-t, –to-code=NAME encoding for output

Information:
-l, –list list all known coded character sets

Output control:
-c omit invalid characters from output
-o, –output=FILE output file
-s, –silent suppress warnings

  --verbose              print progress information  

-?, –help Give this help list

  --usage                Give a short usage message  

-V, –version Print program version

Mandatory or optional arguments to long options are also mandatory or optional
for any corresponding short options.

For bug reporting instructions, please see:
http://www.gnu.org/software/libc/bugs.html.

digoal@pg11-320tb-zfs-> iconv -f UTF8 -t GBK test.sql|psql -f –
INSERT 0 1
或者這樣寫入(client_encoding和server都設定為sql_ascii時,不檢查編碼合法性,直接存入資料庫)

digoal@pg11-320tb-zfs-> iconv -f UTF8 -t GBK test.sql -o test.sql.gbk

digoal@pg11-320tb-zfs-> psql
psql (11beta4)
Type “help” for help.

postgres=# set client_encoding =sql_ascii;
SET
postgres=# i ./test.sql.gbk
INSERT 0 1
當設定客戶端client_encoding為utf8編碼時,由於存入的資料編碼不合法,導致查詢異常

digoal@pg11-320tb-zfs-> psql
psql (11beta4)
Type “help” for help.

postgres=# set client_encoding =utf8;
SET
postgres=# select * from tbl;
ERROR: invalid byte sequence for encoding “UTF8”: 0xc4 0xe3
當client_encoding設定為GBK編碼,查詢為亂碼

postgres=# set client_encoding =gbk;
SET
postgres=# select * from tbl;

id info
1 ?oí

參考
《PostgreSQL 多位元組字符集合法性檢測》

《[轉] SqlServe到PG遷移錯誤:無效的編碼序列”UTF8″: 0x00》

《PostgreSQL UTF8 和 GB18030編碼map檔案不完整的問題》

《PostgreSQL WHY ERROR: invalid byte sequence for encoding “UTF8″》

《PostgreSQL SQL_ASCII encoding introduce》

《PostgreSQL Server Encoding sql_ascii attention》
轉自阿里雲德哥


相關文章