Python str() 引發的 UnicodeEncodeError

浮生若夢的程式設計發表於2018-03-06

原文網址 : https://juejin.im/post/5a9e981851882555791829ad

起因

眾所周知，Python 2 中的 UnicodeEncodeError 與 UnicodeDecodeError 是比較棘手的問題，有時候遇到這類問題的發生，總是一頭霧水，感覺莫名其妙。甚至，《Fluent Python》的作者還提出了所謂“三明治模型”的東西來幫助解決此類問題（其實大可不必如此麻煩，後文有述）。

今天線上上遇到一個與此有關的小問題，感覺很有趣，水文一篇記錄之。

Bug 轉到我這裡時，看到現象自然是UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: ordinal not in range(128)這類莫名其妙的提示。然後翻 log，迅速找到對應的程式碼行，大概類似下面這種：

thrift_obj = ThriftKeyValue(key=str(xx_obj.name))  # 出錯行, xx_obj.name 是一個 str
複製程式碼

一開始，看見str(xx_obj.name)，也不知道是手誤，還是故意為之，反正是學不會這種操作（應該每個專案裡面，或多或少都有這樣的神奇程式碼吧）。

分析

看異常的字面意思，大致就是：有某個串，正在被 ASCII 編碼器編碼，但是顯然該串超出了 ASCII 編碼器所規定的範圍，於是出錯。於是推測：

哪裡應該有個什麼Unicode串（什麼串無所謂，反正只要超出 ASCII 的範圍就行），這裡應該是 xx_obj.name。
某處正在發生編碼動作，而且是偷偷地在搞（最煩這種隱式轉換了，Python 2 中很多），從程式碼看不出在哪裡。

左看右看，應該是 str() 這個內建函式，於是簡單地試了一下如下程式碼：

In [5]: u = u'中國'

In [6]: str(u)
---------------------------------------------------------------------------
UnicodeEncodeError                        Traceback (most recent call last)
<ipython-input-6-b3b94fb7b5a0> in <module>()
----> 1 str(u)

UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: ordinal not in range(128)
In [7]: b = u.encode('utf-8')

In [8]: str(b)
Out[8]: '\xe4\xb8\xad\xe5\x9b\xbd'


複製程式碼

果然如此。查閱文件一看，沒啥有價值的資訊，描述太模糊了：

class str(object='')
Return a string containing a nicely printable representation of an object. For strings, this returns the string itself. The difference with repr(object) is that str(object) does not always attempt to return a string that is acceptable to eval(); its goal is to return a printable string. If no argument is given, returns the empty string, ''.

For more information on strings see Sequence Types — str, unicode, list, tuple, bytearray, buffer, xrange which describes sequence functionality (strings are sequences), and also the string-specific methods described in the String Methods section. To output formatted strings use template strings or the % operator described in the String Formatting Operations section. In addition see the String Services section. See also unicode().
複製程式碼

我們的程式碼裡面（Python 2），每個 py 檔案都有這麼一行：

from __future__ import unicode_literals, absolute_import
複製程式碼

所以我推測 xx_obj.name 是要給 unicode 串，打 log 一看，果然如此。

解決

至此，要麼將 xx_obj.name 轉化成 str() 能認識的東西，在這裡至少不能是 unicode，應該是 bytes。不過我沒有這麼做，太醜陋了，二是改成這樣：

thrift_obj = ThriftKeyValue(key=xx_obj.name) # 這裡沒必要呼叫 str() ，估計前面能跑正常，是因為 name 恰好總是 ASCII 字元
複製程式碼

Bug 修復，其他功能也表現正常。

總結

前文講到，Python 2 中有較多這種隱式轉換，而且也沒啥文件說明，特別是加上 Windows環境和 print 操作時，報錯資訊更是看得人不明所以。《Fluent Python》中有講到所謂“三明治模型”來解決這一問題，還是蠻有啟發的。

不過，我一般遵循的原則是：只用 Unicode，讓任何地方都是 Unicode。方式如下：

所有 py 檔案必須有如下檔案頭：

#!/usr/bin/env python
# -*- coding: utf-8 -*-
#

from __future__ import unicode_literals, absolute_import
複製程式碼

接到外界的位元組串（從網路，從檔案等），先轉成 Unicode，不過抽取成函式更好,免得重複編碼：

API 的起名優點冗餘，主要是為了做到 “見名知義”


class UnicodeUtils(object):
    @classmethod
    def get_unicode_str(cls, bytes_str, try_decoders=('utf-8', 'gbk', 'utf-16')):
        """轉換成字串(一般是Unicode)"""
        
        if not bytes_str:
            return u''

        if isinstance(bytes_str, (unicode,)):
            return bytes_str

        for decoder in try_decoders:
            try:
                unicode_str = bytes_str.decode(decoder)
            except UnicodeDecodeError:
                pass
            else:
                return unicode_str

        raise DecodeBytesFailedException('decode bytes failed. tried decoders: %s' % list(try_decoders))

    @classmethod
    def encode_to_bytes(cls, unicode_str, encoder='utf-8'):
        """轉換成位元組串"""
        
        if unicode_str is None:
            return b''

        if isinstance(unicode_str, unicode):
            return unicode_str.encode(encoding=encoder)
        else:
            u = cls.get_unicode(unicode_str)
            return u.encode(encoding=encoder)
複製程式碼

送到外界的東西，全部轉成 UTF-8 編碼的位元組串，見上面程式碼

解決 Python UnicodeEncodeError 錯誤
2020-05-06
PythonUnicodeError
python中的__str__
2024-11-24
Python
如何理解 python UnicodeEncodeError 和 UnicodeDecodeError ：python 的 string 和 unicode
2019-02-27
PythonUnicodeError
Python 字串 str
2020-03-02
Python字串
python str.endswith
2018-03-16
Python
python如何讓str排序
2021-09-11
Python排序
Python——UnicodeEncodeError: 'ascii' codec can't encode/decode characters
2018-11-28
PythonUnicodeErrorASCII
python str與bytes之間的轉換
2020-10-22
Python
Python之str內部功能的介紹
2020-04-04
Python
詳解Python中的str.format方法
2021-09-11
PythonORM
python cx_Oracle: UnicodeEncodeError: 'ascii' codec can't encode characters
2019-10-11
PythonOracleUnicodeErrorASCII
python str.format高階用法
2024-03-31
PythonORM
python list tuple str dic series dataframe
2020-09-27
Python
python中的str和repr函式的區別
2019-01-06
Python函式
Rust中 String、str、&str、char 的區別
2024-07-10
Rust
python str與byte轉換 encode decode
2018-04-30
Python
【python】str與json型別轉換
2018-05-18
PythonJSON型別
Python - 基本資料型別_str 字串
2021-07-18
Python資料型別字串
Python str字串實用小案例分享！
2023-02-23
Python字串
Python3 dict和str互轉
2020-12-09
Python
java如何實現python的urllib.quote(str,safe='/')
2018-08-14
JavaPython
python 中的魔法方法：__str__ 和__repr__
2019-12-24
Python
Effective Python（3）- 瞭解 bytes 與 str 的區別
2021-11-13
Python
5.Python3原始碼—字串（str）物件
2018-06-06
Python原始碼字串物件
Python str型別學習總結（一）
2020-10-20
Python型別
一條Python命令引發的漏洞思考
2020-08-19
Python
Python3解決UnicodeEncodeError: 'ascii' codec can't encode characters in position 0
2019-12-23
PythonUnicodeErrorASCII
rust 中 str 與 String; &str &String
2023-05-16
Rust
Python資料型別-str,list常見操作
2020-07-25
Python資料型別
Python報錯：TypeError: a bytes-like object is required, not ‘str‘
2020-12-09
PythonErrorObjectUI
字串 reverse(str.begin(),str.end()) 函式的標頭檔案以及 str.clear()函式
2020-10-07
字串函式
Python3之字串str、列表list、元組tuple的切片操作
2018-09-01
Python字串
J2SE-("").equals(str)與str.equals("")
2018-08-30
詳解Python魔法函式，__init__，__str__，__del__
2024-03-07
Python函式
Python中的引數遮蔽
2022-02-02
Python
Laravel str 字串操作
2018-09-06
Laravel字串
【python爬蟲】用selenium爬時報錯UnicodeEncodeError: ‘gbk‘ codec can‘t encode character ‘\u2022‘
2020-11-27
Python爬蟲UnicodeError
python生成器呼叫方法引發異常
2021-09-11
Python

Python str() 引發的 UnicodeEncodeError

起因

分析

解決

總結

相關文章