不得不知道的Python字串編碼相關的知識

發表於2016-12-15

開發經常會遇到各種字串編碼的問題，例如報錯SyntaxError: Non-ASCII character 'ascii' codec can't encode characters in position 0-2: ordinal not in range(128)，又例如顯示亂碼。
由於之前不知道編碼的原理，遇到這些情況，就只能不斷的用各種編碼decode和encode。。。。。
今天整理一個python中的各種編碼問題的原因和解決方法，以後遇到編碼問題，就不會像莽頭蒼蠅一樣，到處亂撞了。

下面的python環境都是在2.7，聽說在3.X中已經沒有編碼的問題了，因為所有的字串都是unicode了，之後裝個3.X試一下。

如果不知道什麼是decode和encode，建議先看一下：這裡

一、encoding的作用

1.在python檔案中，如果有中文，就一定要在檔案的第一行標記使用的編碼型別，例如 #encoding=utf-8 ,就是使用utf-8的編碼，這個編碼有什麼作用呢？會改變什麼呢？
demo1.py

# encoding=utf-8
test='測試test'
print type(test)
print repr(test)

# encoding=utf-8

test='測試test'

print type(test)

print repr(test)

輸出：

<type 'str'>
'\xe6\xb5\x8b\xe8\xaf\x95test'

1 2	<type 'str'> '\xe6\xb5\x8b\xe8\xaf\x95test'

我們通過print把一個變數輸出到終端的時候，IDE或者系統一般都會幫我們的輸出作轉換，例如中文字元會轉成中文，所以就看不到變數的原始內容。
repr函式可以看這個變數的給python看的形式，也就是看到這個變數的原始內容
從上面的輸出可以看到test變數的str型別，它的編碼是utf-8的（怎麼知道是utf-8，請看第三部分），也就是的encoding型別
如果我們把encoding改為gbk
demo2.py

# encoding=gbk
test='測試test'
print type(test)
print repr(test)

# encoding=gbk

test='測試test'

print type(test)

print repr(test)

輸出

<type 'str'>
'\xb2\xe2\xca\xd4test'

1 2	<type 'str'> '\xb2\xe2\xca\xd4test'

這樣test的編碼型別就變為gbk了。
所以這個encoding會決定在這個py檔案中定義的字串變數的編碼方式。
而如果一個變數是從其他py檔案匯入，或者從資料庫，redis等讀取出來的話，它的編碼又是怎樣的？
a.py

# encoding=utf-8
test='測試test'

1 2	# encoding=utf-8 test='測試test'

b.py

# encoding=gbk
from a import test
print repr(test)

# encoding=gbk

from a import test

print repr(test)

輸出

'\xe6\xb5\x8b\xe8\xaf\x95test'

1	'\xe6\xb5\x8b\xe8\xaf\x95test'

a.py中定義test變數，a.py的編碼方式是utf-8,b.py的編碼方式是gbk,b從a中匯入test，結果顯示test依然為utf-8編碼，也就是a.py的編碼
所以encoding只會決定本py檔案的編碼方式，不會影響匯入的或者從其他地方讀取的變數的編碼方式

二、常見報錯`codec can't encode characters`的原因

python的程式經常會報錯 codec can't encode characters 或 codec can't decode characters

在python中定義一個字串，

import sys
print sys.getdefaultencoding() # 輸出 ascii
unicode_test=u'測試test'
print repr(str(unicode_test))

import sys

print sys.getdefaultencoding() # 輸出 ascii

unicode_test=u'測試test'

print repr(str(unicode_test))

上面的程式碼會報錯

 'ascii' codec can't encode characters in position 0-1: ordinal not in range(128)

1	'ascii' codec can't encode characters in position 0-1: ordinal not in range(128)

除了str方法外，如果操作兩個都有中文的字串，也會報錯，但是隻有其中一個有中文，卻不會報錯

unicode_test = u'測試test%s{0}'

print '%stest' % unicode_test  # 不會報錯
print '%s測試' % unicode_test  #會報錯

print unicode_test % 'test'  #不會報錯
print unicode_test % '測試'  #會報錯

print unicode_test.format('test') #不會報錯
print unicode_test.format('測試') #會報錯

print unicode_test.split('test')  #不會報錯
print unicode_test.split('測試')  #報錯

print unicode_test + 'test'  #不會報錯
print unicode_test + '測試'  #會報錯

unicode_test = u'測試test%s{0}'

print '%stest' % unicode_test # 不會報錯

print '%s測試' % unicode_test #會報錯

print unicode_test % 'test' #不會報錯

print unicode_test % '測試' #會報錯

print unicode_test.format('test') #不會報錯

print unicode_test.format('測試') #會報錯

print unicode_test.split('test') #不會報錯

print unicode_test.split('測試') #報錯

print unicode_test + 'test' #不會報錯

print unicode_test + '測試' #會報錯

為什麼會這樣？
這原因下面再解答，這裡先列出這個報錯的解決方法：
解決方法是：把系統的預設編碼設定為utf-8

import sys
reload(sys)
sys.setdefaultencoding('utf-8')
print sys.getdefaultencoding()
unicode_test=u'測試test'

import sys

reload(sys)

sys.setdefaultencoding('utf-8')

print sys.getdefaultencoding()

unicode_test=u'測試test'

demo3.py

# encoding=utf-8
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
unicode_test=u'測試test'
utf8_test='測試test'
gbk_test=unicode_test.encode('gbk')

#合併unicode和utf-8
merge=unicode_test+utf8_test
print type(merge)
print repr(merge)

#合併unicode和gbk
merge=unicode_test+gbk_test
print type(merge)
print repr(merge)
print merge

#合併utf-8和gbk
merge=utf8_test+gbk_test
print type(merge)
print repr(merge)
print merge

# encoding=utf-8

import sys

reload(sys)

sys.setdefaultencoding('utf-8')

unicode_test=u'測試test'

utf8_test='測試test'

gbk_test=unicode_test.encode('gbk')

#合併unicode和utf-8

merge=unicode_test+utf8_test

print type(merge)

print repr(merge)

#合併unicode和gbk

merge=unicode_test+gbk_test

print type(merge)

print repr(merge)

print merge

#合併utf-8和gbk

merge=utf8_test+gbk_test

print type(merge)

print repr(merge)

print merge

這裡定義三個分別是unicode，utf-8和gbk編碼的字串，unicode_test,utf8_test和gbk_test
1.合併unicode和utf-8的時候，輸出：

<type 'unicode'>
u'\u6d4b\u8bd5test\u6d4b\u8bd5test'

1 2	<type 'unicode'> u'\u6d4b\u8bd5test\u6d4b\u8bd5test'

合併的結果的編碼是unicode編碼。
2.合併unicode和gbk，會報錯：

'utf8' codec can't decode byte 0xb2 in position 0: invalid start byte

1	'utf8' codec can't decode byte 0xb2 in position 0: invalid start byte

所以我們可以推測：
在python對兩個字串進行操作的時候，如果這兩個字串有一個是unicode編碼，有一個是非unicode編碼，python會將非unicode編碼的字串decode成unicode編碼，再進行字串操作
例如合併字串的操作可以寫成以下的function：

def merge_str(str1, str2):
    if isinstance(str1, unicode) and not isinstance(str2, unicode):
        str2 = str2.decode(sys.getdefaultencoding())
    elif not isinstance(str1, unicode) and isinstance(str2, unicode):
        str1 = str1.decode(sys.getdefaultencoding())
    return str1 + str2

def merge_str(str1, str2):

if isinstance(str1, unicode) and not isinstance(str2, unicode):

str2 = str2.decode(sys.getdefaultencoding())

elif not isinstance(str1, unicode) and isinstance(str2, unicode):

str1 = str1.decode(sys.getdefaultencoding())

return str1 + str2

PS:sys.getdefaultencoding()的初始值是ascii
所以，
codec can't encode（decode） characters這個報錯是encode或decode這兩個方法產生的，而這個方法的引數是sys.getdefaultencoding()。如果用ascii編碼對帶有中文的字串進行解碼，就會報錯。所以修改系統的預設編碼可以避免這個報錯。
當執行 str 操作時，python會執行 unicode_test.encode(sys.getdefaultencoding()) ，所以也會報錯。

3.#合併utf-8和gbk的時候卻不會報錯，python會直接把兩個字串合併，不會有decode或encode的操作，但是輸出的時候，部分字串會亂碼。
demo4.py

# encoding=gbk
import sys

reload(sys)
sys.setdefaultencoding('utf-8')
unicode_test = u'測試test'
utf8_test = unicode_test.encode('utf-8')
gbk_test = unicode_test.encode('gbk')

merge = utf8_test + gbk_test
print type(merge)
print repr(merge)
print merge

# encoding=gbk

import sys

reload(sys)

sys.setdefaultencoding('utf-8')

unicode_test = u'測試test'

utf8_test = unicode_test.encode('utf-8')

gbk_test = unicode_test.encode('gbk')

merge = utf8_test + gbk_test

print type(merge)

print repr(merge)

print merge

這裡檔案的encoding是gbk，sys.getdefaultencoding()設定為utf-8，結果是：

<type 'str'>
'\xe6\xb5\x8b\xe8\xaf\x95test\xb2\xe2\xca\xd4test'
測試test����test

'\xe6\xb5\x8b\xe8\xaf\x95test\xb2\xe2\xca\xd4test'

測試test��test

即gbk的部分亂碼了。所以輸出的時候會按照sys.getdefaultencoding()的編碼來解碼。

三、怎麼判斷一個字串（string）的編碼方式

1.沒有辦法準確地判斷一個字串的編碼方式，例如gbk的“\aa”代表甲，utf-8的“\aa”代表乙，如果給定“\aa”怎麼判斷是哪種編碼？它既可以是gbk也可以是utf-8

2.我們能做的是粗略地判斷一個字串的編碼方式，因為上面的例如的情況是很少的，更多的情況是gbk中的’\aa’代表甲，utf-8中是亂碼，例如�，這樣我們就能判斷’\aa’是gbk編碼，因為如果用utf-8編碼去解碼的結果是沒有意義的

3.而我們經常遇到的編碼其實主要的就只有三種：utf-8，gbk，unicode

unicode一般是 \u 帶頭的，然後後面跟四位數字或字串，例如 \u6d4b\u8bd5 ，一個\u對應一個漢字
utf-8一般是 \x 帶頭的，後面跟兩位字母或數字，例如 \xe6\xb5\x8b\xe8\xaf\x95\xe5\x95\x8a ，三個 \x 代表一個漢字
gbk一般是 \x 帶頭的，後面跟兩位字母或數字，例如 \xb2\xe2\xca\xd4\xb0\xa1，兩個個 \x 代表一個漢字

4.使用chardet模組來判斷
import chardet raw = u'我是一隻小小鳥' print chardet.detect(raw.encode('utf-8')) print chardet.detect(raw.encode('gbk'))
輸出：

{'confidence': 0.99, 'encoding': 'utf-8'}
{'confidence': 0.99, 'encoding': 'GB2312'}

1 2	{'confidence': 0.99, 'encoding': 'utf-8'} {'confidence': 0.99, 'encoding': 'GB2312'}

chardet模組可以計算這個字串是某個編碼的概率，基本對於99%的應用場景，這個模組都夠用了。

四、string_escape和unicode_escape

1. string_escape

在str中，\x是保留字元，表示後面的兩位字元表示一個字元單元（暫且這麼叫，不知道對不對），例如'\xe6'，一般三個字元單元表示一箇中文字元
所以在定義變數時，a='\xe6\x88\x91',是代表定義了一箇中文字元“我”，但是有時候，我們不希望a這個變數代表中文字元，而是代表3*4=12個英文字元，可以使用encode('string_escape')來轉換：

'\xe6\x88\x91'.encode('string_escape')='\\xe6\\x88\\x91'

1	'\xe6\x88\x91'.encode('string_escape')='\\xe6\\x88\\x91'

decode就是反過來。
轉換前後的型別都是string。
還有一個現象，定義a='\x',a='\x0'都是會報錯ValueError: invalid \x escape的，而定義a='\a',即反斜槓後面不是跟x，都會沒問題，而定義a='\x00'，即x後面跟兩個字元，也是沒問題的。

2. unicode_escape

同理在unicode中，\u是保留字元，表示後面的四個字元表示一箇中文字元，例如b=u'u6211'，表示“我:”，同理我們希望b變數，表示6個英文字元，而不是一箇中文字元，就可以使用encode(‘unicode-escape’)來轉換：

u'u6211'.encode('unicode-escape')='\u6211'

1	u'u6211'.encode('unicode-escape')='\u6211'

注意encode前是unicode，轉換後是string。
在unicode中，\u是保留字元，但是在string中，就不是了，所以只有一個反斜槓，而不是兩個。
decode就是反過來。
同理，a='\u'也是會報錯的

3. 例子

#正常的str和unicode字元
str_char='我'
uni_char=u'我'
print repr(str_char) # '\xe6\x88\x91'
print repr(uni_char) #  u'\u6211'

# decode('unicode-escape')
s1='\u6211'
s2=s1.decode('unicode-escape')
print repr(s1) # '\\u6211'
print repr(s2) # u'\u6211'

# encode('unicode-escape')
s1=u'\u6211'
s2=s1.encode('unicode-escape')
print repr(s1) # u'\u6211'
print repr(s2) # '\\u6211'

# decode("string_escape")
s1='\\xe6\\x88\\x91'
s2=s1.decode('string_escape')
print repr(s1) # '\\xe6\\x88\\x91'
print repr(s2) # '\xe6\x88\x91'

# encode("string_escape")
s1='\xe6\x88\x91'
s2=s1.encode('string_escape')
print repr(s1) # '\xe6\x88\x91'
print repr(s2) # '\\xe6\\x88\\x91'

#正常的str和unicode字元

str_char='我'

uni_char=u'我'

print repr(str_char) # '\xe6\x88\x91'

print repr(uni_char) # u'\u6211'

# decode('unicode-escape')

s1='\u6211'

s2=s1.decode('unicode-escape')

print repr(s1) # '\\u6211'

print repr(s2) # u'\u6211'

# encode('unicode-escape')

s1=u'\u6211'

s2=s1.encode('unicode-escape')

print repr(s1) # u'\u6211'

print repr(s2) # '\\u6211'

# decode("string_escape")

s1='\\xe6\\x88\\x91'

s2=s1.decode('string_escape')

print repr(s1) # '\\xe6\\x88\\x91'

print repr(s2) # '\xe6\x88\x91'

# encode("string_escape")

s1='\xe6\x88\x91'

s2=s1.encode('string_escape')

print repr(s1) # '\xe6\x88\x91'

print repr(s2) # '\\xe6\\x88\\x91'

4. 應用

內容是unicode，但是type是str，就可以使用decode("unicode_escape")轉換為內容和type都是unicode

s1='\u6211' s2=s1.decode('unicode-escape')

1
2

s1='\u6211'
s2=s1.decode('unicode-escape')
內容是str，但是type是unicode,就可以使用encode("unicode_escape").decode("string_escape")轉換為內容和type都是str

s1=u'\xe6\x88\x91' s2=s1.encode('unicode_escape').decode("string_escape")

1
2

s1=u'\xe6\x88\x91'
s2=s1.encode('unicode_escape').decode("string_escape")

Android 你不得不學的HTTP相關知識
2020-04-07
AndroidHTTP
關於Python Number 相關的知識！
2019-04-26
Python
網站原始碼的相關知識
2021-10-20
網站原始碼
老Python總結的字典相關知識
2021-05-03
Python
Python相關爬蟲的框架有哪些?Python知識
2020-09-24
Python爬蟲框架
Redis的相關知識
2020-11-09
Redis
/proc的相關知識
2024-06-03
硬體編碼相關知識(H264,H265)
2019-03-03
1-python 字串的相關操作
2019-02-16
Python字串
Python學習-字串的基本知識
2018-10-22
Python字串
Python資料型別相關知識
2019-05-15
Python資料型別
Mysql的優化的相關知識
2018-09-18
MySql優化
你不得不知道的MyBatis基礎知識之＜resultMap＞（4）
2020-10-21
MyBatis
wifi認證的相關知識
2019-06-28
WiFi
python中字串的編碼和解碼
2020-11-29
Python字串
.net相關知識
2018-11-12
Shell相關知識
2019-04-14
RPM相關知識
2018-06-20
子網掩碼和網路ip的相關知識
2019-03-12
對JAVAWEB相關知識的學些
2024-10-25
JavaWeb
clickhouse的一些相關知識
2024-06-25
由char和byte的關係引申出去——總結一下java中的字元編碼相關知識
2023-01-05
Java字元
web前端入門到實戰：js擷取字串相關的知識點
2019-08-07
Web前端JS字串
關於 JavaScript 字串的一個小知識
2020-08-14
JavaScript字串
SSL相關知識科普
2018-11-25
音訊相關知識
2020-07-29
音訊
Elasticsearch——search相關知識
2019-02-18
Elasticsearch
redis相關知識點
2022-06-20
Redis
Git相關知識點
2021-09-09
Git
Android短視訊開發業務中視訊編解碼的相關知識閱讀
2018-12-24
Android
網站安全相關的基礎知識
2018-11-27
網站
Gantt圖和PERT圖的相關知識
2024-05-10
漲知識！你不知道的中國手機號碼的編碼和劃分規則
2023-05-06
構建高效能佇列，你不得不知道的底層知識！
2020-08-15
佇列
相機成像相關知識總結
2020-12-21
Java中，那些關於String和字串常量池你不得不知道的東西
2021-01-24
Java字串
課本上的創造力的相關知識
2020-11-30
字串的相關函式
2020-09-24
字串函式
你所不知道的 Python 冷知識！（建議收藏）
2018-09-03
Python

不得不知道的Python字串編碼相關的知識

一、encoding的作用

二、常見報錯codec can't encode characters的原因

三、怎麼判斷一個字串（string）的編碼方式

四、string_escape和unicode_escape

1. string_escape

2. unicode_escape

3. 例子

4. 應用

相關文章

二、常見報錯`codec can't encode characters`的原因