如何理解 python UnicodeEncodeError 和 UnicodeDecodeError ：python 的 string 和 unicode

wecatch發表於2019-02-27

原文網址 : https://flycode.co/archives/289091

文中 python 皆為 2.x 版本

初學 python 的人基本上都有過如下類似經歷:

UnicodeDecodeError

Traceback (most recent call last):
  File "<input>", line 1, in <module>
UnicodeDecodeError: `ascii` codec can`t decode byte 0xe4 in position 0: ordinal not in range(128)複製程式碼

UnicodeEncodeError

Traceback (most recent call last):
  File "<input>", line 1, in <module>
UnicodeEncodeError: `ascii` codec can`t encode characters in position 0-1: ordinal not in range(128)複製程式碼

這兩個錯誤在 python 中十分常見，一不留神就碰上了。如果你寫過c、c++ 或者 java，對比之下一定會覺得 python 這個錯誤真讓人火大。事實也確實如此，我也曾經很火大?。

這兩個錯誤究竟意味著什麼？可以先從 python 的基本資料型別 string 和 unicode 開始。

string

字串(string)其實就是一段文字序列，是由一個或多個字元組成(character)，字元是文字的最小構成單元，在 python 中可以用以下方式表示字串:

>>> s1 = `abc`
>>> s2 = "abc"
>>> s3 = """
  abc
  """
>>> s4 = `中文`
>>> for i in [s1, s2, s3, s4]:
        print type(i)
<type `str`>
<type `str`>
<type `str`>
<type `str`>複製程式碼

這些變數在 python shell 中對應輸出是:

s1 --> `abc`
s2 --> `abc`
s3 --> `
abc
`
s4 --> `xe4xb8xadxe6x96x87`複製程式碼

s4 的輸出和其它變數明顯不同，字面上是一個 16 進位制序列，但是 s4 和其它字串一樣，在 python 內部都是用同樣方式進行儲存的: 位元組流(byte stream)，即位元組序列。

位元組是計算機內部最小的可定址的儲存單位(對大部分計算機而言)，一個位元組是由 8 bit 組成，也就是對應 8 個二進位制位。其實可以更進一步解釋說，python 不僅用位元組的方式儲存著變數中的字串文字，python 檔案中的所有資訊在計算機內部都是用一個個位元組表示的，計算機是用這樣的方式儲存文字資料的。

字串用位元組如何表示？

答案就是編碼。計算機是隻能識別 0 或 1 這樣的二進位制資訊，而不是 a 或 b 這樣對人類有意義的字元，為了讓機器能讀懂這些字元，人類就發明字元到二進位制的對映關係，然後按照這個對映規則進行相應地編碼。ascii 就是這樣背景下誕生的一種編碼規則。ascii 也是 python 2.x 預設使用的編碼規則。

ascii 規定了常用的字元到計算機是如何對映的，編碼範圍是 0~127 共 128 個字元。簡單來說它就是一本字典，規定了不同字元的對應的編碼值(code point，一個整數值)，這樣一來計算機就能用二進位制表示了。比如字元 a 的編碼是 97，對應的二進位制是 1100001，一個位元組就足夠儲存這些資訊。字串 “abc” 最終儲存就是 [97] [98] [99] 三個位元組。python 預設情況下就是使用這個規則對字元進行編碼，對位元組進行解碼(反編碼)。

>>> ord(`a`)
97
>>> chr(97)
`a`
>>>複製程式碼

由於 ascii 的編碼範圍非常有限，對超過 ascii 範圍之外的字元，python 是如何處理的？很簡單，拋錯誤出來，這就是 UnicodeEncodeError 和 UnicodeDecodeError 的來源。那 python 會在什麼時候丟擲這樣的錯誤，也就是說 python 進行編碼和解碼的操作發生在何時？

unicode 物件

unicode 物件和 string 一樣，是 python 中的一種字元物件(python 中一切皆物件，string 也是)。先不要去想 unicode 字符集、unicode 編碼或者 utf-8 這些概念，在此特意加了物件就是為了和後面提到的 unicode 字符集進行區分。這裡說的 unicode 就是 python 中的 unicode 物件，建構函式是 unicode()。

在 python 中創造 unicode 物件也很簡單:

>>> s1 = unicode(`abc`)
>>> s2 = u`abc`
>>> s3 = U`abc`
>>> s4 = u`中文`複製程式碼

這些變數在 python shell 中對應輸出是:

s1 --> u`abc`
s2 --> u`abc`
s3 --> u`abc`
s4 --> u`u4e2du6587`複製程式碼

同樣的，s4 的輸出和其它變數不同，這些就是unicode 字元。由於 ascii 能夠表示的字元太少，而且不夠通用(擴充套件 ascii 的話題，就是把 ascii 沒有利用的剩下大於 127 的位置利用了，在不同的字符集裡代表不同的意思)，unicode 字符集就被造出來了，一本更大的字典，裡面有更多的編碼值。

unicode 字符集

unicode 字符集解決了：

ascii 表達能力不夠的問題
擴充套件 ascii 不夠通用的問題

雖然 unicode 字符集表達能力強，又能夠統一字元編碼規則，但是它並沒有規定這些字元在計算機中是如何表示的。它和 ascii 不同，很多字元(編碼值大於 255 )沒有辦法用一個位元組就搞定。怎樣做到高效快捷地儲存這些編碼值？於是就有了 unicode 字符集的編碼規則的實現：utf-8、utf-16等。

到這裡可以簡單理清 ascii、unicode 字符集、utf-8等的關係了：ascii 和 unicode 字符集都是一種編碼集，由字元和字元對應的整數值(code point)組成，ascii 在計算機內部用一個位元組儲存，utf-8 是 unicode 字符集儲存的具體實現，因為 unicode 字符集沒有辦法簡簡單單用一個位元組搞定。

回到 s4 對應的輸出，這個輸出就是 unicode 字符集對應的編碼值(code point)的 16 進製表示。

unicode 物件是用來表示 unicode 字符集中的字元，這些字元(實際是那個編碼值，一個整數) 在 python 中又是如何儲存的？有了前文的分析，也許可以猜到，python 依然是通過編碼然後用位元組的方式儲存，但是這裡的編碼就不能是 ascii 了，而是對應 unicode 字符集的編碼規則: utf-8、utf-16等。

unicode 物件的編碼

unicode 物件想要正確的儲存就必須指定相應的編碼規則，這裡我們只討論使用最廣泛的 utf-8 實現。

在 python 中對 unicode 物件編碼如下：

>>> s=u`中文`
>>> s.encode(`utf-8`)
`xe4xb8xadxe6x96x87`
>>> type(s.encode(`utf-8`))
<type `str`>複製程式碼

編碼之後輸出的是個 string 並以位元組序列的方式進行儲存。有了編碼就會有解碼，python 正是在這種編碼、解碼的過程使用了錯誤的編碼規則而發生了 UnicodeEncodeError 和 UnicodeDecodeError 錯誤，因為它預設使用 ascii 來完成轉換。

string 和 unicode 物件的轉換

unicode 物件可以用 utf-8 編碼為 string，同理，string 也可以用 utf-8 解碼為 unicode 物件

>>> u=u`中文`
>>> s = u.encode(`utf-8`)
>>> s
`xe4xb8xadxe6x96x87`
>>> type(s)
<type `str`>
>>> s.decode(`utf-8`)
u`u4e2du6587`
>>> type(s.decode(`utf-8`))
<type `unicode`>複製程式碼

錯誤的編碼規則就會導致那兩個常見的異常

>>> u.encode(`ascii`)
Traceback (most recent call last):
  File "<input>", line 1, in <module>
UnicodeEncodeError: `ascii` codec can`t encode characters in position 0-1: ordinal not in range(128)
>>>
>>> s.decode(`ascii`)
Traceback (most recent call last):
  File "<input>", line 1, in <module>
UnicodeDecodeError: `ascii` codec can`t decode byte 0xe4 in position 0: ordinal not in range(128)複製程式碼

這兩個錯誤在某些時候會突然莫名其妙地出現就是因為 python 自動地使用了 ascii 編碼。

python 自動解編碼

1.stirng 和 unicode 物件合併

>>> s + u``
Traceback (most recent call last):
  File "<input>", line 1, in <module>
UnicodeDecodeError: `ascii` codec can`t decode byte 0xe4 in position 0: ordinal not in range(128)
>>>複製程式碼

2.列表合併

>>> as_list = [u, s]
>>> ``.join(as_list)
Traceback (most recent call last):
  File "<input>", line 1, in <module>
UnicodeDecodeError: `ascii` codec can`t decode byte 0xe4 in position 0: ordinal not in range(128)複製程式碼

3.格式化字串

>>> `%s-%s`%(s,u)
Traceback (most recent call last):
  File "<input>", line 1, in <module>
UnicodeDecodeError: `ascii` codec can`t decode byte 0xe4 in position 0: ordinal not in range(128)
>>>複製程式碼

4.列印 unicode 物件

#test.py
# -*- coding: utf-8 -*-
u = u`中文`
print u

#outpt
Traceback (most recent call last):
  File "/Users/zhyq0826/workspace/zhyq0826/blog-code/p20161030_python_encoding/uni.py", line 3, in <module>
    print u
UnicodeEncodeError: `ascii` codec can`t encode characters in position 0-1: ordinal not in range(128)複製程式碼

5.輸出到檔案

>>> f = open(`text.txt`,`w`)
>>> f.write(u)
Traceback (most recent call last):
  File "<input>", line 1, in <module>
UnicodeEncodeError: `ascii` codec can`t encode characters in position 0-1: ordinal not in range(128)
>>>複製程式碼

1，2，3 的例子中，python 自動用 ascii 把 string 解碼為 unicode 物件然後再進行相應操作，所以都是 decode 錯誤， 4 和 5 python 自動用 ascii 把 unicode 物件編碼為字串然後輸出，所以都是 encode 錯誤。

只要涉及到 unicode 物件和 string 的轉換以及 unicode 物件輸出、輸入的地方可能都會觸發 python 自動進行解碼/編碼，比如寫入資料庫、寫入到檔案、讀取 socket 等等。

到此，這兩個異常產生的真正原因了基本已經清楚了: unicode 物件需要編碼為相應的 string(字串)才可以儲存、傳輸、列印，字串需要解碼為對應的 unicode 物件才能完成 unicode 物件的各種操作，len、find 等。

string.decode(`utf-8`) --> unicode
unicode.encode(`utf-8`) --> string複製程式碼

如何避免這些的錯誤

1.理解編碼或解碼的轉換方向

無論何時發生編碼錯誤，首先要理解編碼方向，然後再針對性解決。

2.設定預設編碼為 utf-8

在檔案頭寫入

# -*- coding: utf-8 -*-複製程式碼

python 會查詢: coding: name or coding=name，並設定檔案編碼格式為 name，此方式是告訴 python 預設編碼不再是 ascii ，而是要使用宣告的編碼格式。

3.輸入物件儘早解碼為 unicode，輸出物件儘早編碼為位元組流

無論何時有位元組流輸入，都需要儘早解碼為 unicode 物件。任何時候想要把 unicode 物件寫入到檔案、資料庫、socket 等外界程式，都需要進行編碼。

4.使用 codecs 模組來處理輸入輸出 unicode 物件

codecs 模組可以自動的完成解編碼的工作。

>>> import codecs
>>> f = codecs.open(`text.txt`, `w`, `utf-8`)
>>> f.write(u)
>>> f.close()複製程式碼

參考文獻

zh.wikibooks.org/wiki/Unicod…

zh.wikibooks.org/wiki/ascii

www.unicode.org/

docs.python.org/2/howto/uni…

www.ruanyifeng.com/blog/2007/1…

wecatch

我們致力於創造有價值的網際網路產品和服務，分享有洞見的觀點。

wecatch
github
微信公賬號

wecatch

Python convert string to unicode number
2024-11-07
PythonUnicode
【廖雪峰python入門筆記】Unicode編碼_UnicodeDecodeError處理
2018-07-05
Python筆記UnicodeError
Python報UnicodeDecodeError
2019-03-13
PythonUnicodeError
python實現中文和unicode轉換
2023-05-14
PythonUnicode
Python str() 引發的 UnicodeEncodeError
2018-03-06
PythonUnicodeError
python 程式的使用和理解
2021-01-01
Python
Python字串string的查詢和替換
2018-10-04
Python字串
解決 Python UnicodeEncodeError 錯誤
2020-05-06
PythonUnicodeError
如何理解Python3中的子類和父類？
2021-09-11
Python
Python中__init__的用法和理解
2018-09-23
Python
Python中關於++和—（自增和自減）的理解
2020-12-29
Python
深入理解python中的類和物件
2019-03-02
Python物件
python string
2019-05-18
Python
python string
2021-09-09
Python
Python——UnicodeEncodeError: 'ascii' codec can't encode/decode characters
2018-11-28
PythonUnicodeErrorASCII
如何理解Python中的繼承？python入門
2020-11-20
Python繼承
理解Python asyncio原理和簡潔使用方式
2019-10-29
Python
python character string
2019-05-17
Python
Swift 中的 String 和 Substring 如何工作
2019-03-02
Swift
Python 日誌庫 logging 的理解和實踐經驗
2019-01-31
Python
Python報錯：UnicodeDecodeError: 'gbk' codec can't decode byte ...
2018-06-12
PythonUnicodeError
python cx_Oracle: UnicodeEncodeError: 'ascii' codec can't encode characters
2019-10-11
PythonOracleUnicodeErrorASCII
python和java該如何選擇？
2021-09-11
PythonJava
python中的f-string
2024-11-24
Python
Python開發：Python2和Python3的共存和切換使用
2019-04-12
Python
Python面試之理解__new__和__init__的區別
2018-04-23
Python面試
如何理解Python的迴圈設計
2021-09-09
Python
Python IDLE和Python的區別！Python入門教程
2021-07-19
Python
python命名元組如何理解
2021-09-11
Python
unicode和UTF-8的區別
2019-04-25
Unicode
Python和Java相比，開發效率如何？
2022-07-29
PythonJava
python技巧 is 和 ==
2018-11-14
Python
Python和Java
2020-11-04
PythonJava
深入理解 Python 的物件複製和記憶體佈局
2022-12-16
Python物件記憶體
Python技術分享：深入理解ThreadLocal變數的功能和使用
2021-04-28
Pythonthread變數
Python2和Python3的區別
2021-12-02
Python
Python和access的區別有哪些?Python教程
2021-08-13
Python
Python中is和==的區別
2020-09-26
Python