python與編碼

發表於2016-09-01

Python中的文字物件

Python 3.x中處理文字的物件有str, bytes, bytearray。

bytes和bytearray可以使用除了用作格式化的方法(format, format_map)以及幾個特殊的基於Unicode的方法(casefold, isdecimal, isidentifier, isnumeric, isprintable, encode)以外幾乎所有str的方法。
bytes有一個類方法，可以通過序列來構建字串，而這個方法不可以用在str上。

>>> b = bytes.fromhex('E4 B8 AD') >>> b b'xe4xb8xad' >>> b.decode('utf-8') '中' >>> str(b) "b'\xe4\xb8\xad'"

1
2
3
4
5
6
7

>>> b = bytes.fromhex('E4 B8 AD')
>>> b
b'xe4xb8xad'
>>> b.decode('utf-8')
'中'
>>> str(b)
"b'\xe4\xb8\xad'"

Unicode和字元轉換

採用chr可以把一個Unicode的code point轉換為字元，通過ord可以進行反向操作。

>>> ord('A') 65 >>> ord('中') 20013 >>> chr(65) 'A' >>> chr(20013) '中'

1
2
3
4
5
6
7
8

>>> ord('A')
65
>>> ord('中')
20013
>>> chr(65)
'A'
>>> chr(20013)
'中'

len函式計算的是字元數，不是位元組數

>>> len('中')
1
>>> '中'.encode('utf-8')
b'xe4xb8xad'
>>> len('中'.encode('utf-8'))    #計算的是bytes物件的長度，包含3個整數字符
3

>>> len('中')

>>> '中'.encode('utf-8')

b'xe4xb8xad'

>>> len('中'.encode('utf-8')) #計算的是bytes物件的長度，包含3個整數字符

Python與編碼

Python內部處理編碼的方式

在Python接受我們的輸入時，總是會先轉為Unicode。而且這個過程越早越好。
然後Python的處理總是對Unicode進行的，在這個過程中，一定不要進行編碼轉換的工作。
在Python向我們返回結果時，總是會從Unicode轉為我們需要的編碼。而且這個過程越晚越好。

Python原始碼的編碼方式

Python預設使用utf-8編碼。
如果想使用一種不同的編碼方式來儲存Python程式碼，我們可以在每個檔案的第一行或者第二行(如果第一行被hash-bang命令佔用了)放置編碼宣告(encoding declaration)
# ‐*‐ coding: windows‐1252 ‐*‐

Python中使用的編碼

C:\Users\JL>chcp        #查詢作業系統使用的編碼
Active code page: 936
>>> import sys, locale
>>> locale.getpreferredencoding()    #這個是最重要的
'cp936'
>>> my_file = open('cafe.txt','r')
>>> type(my_file)

>>> my_file.encoding    #檔案物件預設使用locale.getpreferreddecoding()的值
'cp936'
>>> sys.stdout.isatty(), sys.stdin.isatty(), sys.stderr.isatty()    #output是否是控制檯console
(True, True, True)
>>> sys.stdout.encoding, sys.stdin.encoding, sys.stderr.encoding    #sys的標準控制流如果被重定向，或者定向到檔案，那麼編碼將使用環境變數PYTHONIOENCODING的值、控制檯console的編碼、或者locale.getpreferredencoding()的編碼，優先順序依次減弱。
('cp936', 'cp936', 'cp936')
>>> sys.getdefaultencoding()    #如果Python需要把二進位制資料轉為字元物件，那麼在預設情況下使用該值。
'utf-8'
>>> sys.getfilesystemencoding()    #Python用來編碼或者解碼檔名(不是檔案內容)的時候，預設使用該編碼。
'mbcs'

C:\Users\JL>chcp #查詢作業系統使用的編碼

Active code page: 936

>>> import sys, locale

>>> locale.getpreferredencoding() #這個是最重要的

'cp936'

>>> my_file = open('cafe.txt','r')

>>> type(my_file)

>>> my_file.encoding #檔案物件預設使用locale.getpreferreddecoding()的值

'cp936'

>>> sys.stdout.isatty(), sys.stdin.isatty(), sys.stderr.isatty() #output是否是控制檯console

(True, True, True)

>>> sys.stdout.encoding, sys.stdin.encoding, sys.stderr.encoding #sys的標準控制流如果被重定向，或者定向到檔案，那麼編碼將使用環境變數PYTHONIOENCODING的值、控制檯console的編碼、或者locale.getpreferredencoding()的編碼，優先順序依次減弱。

('cp936', 'cp936', 'cp936')

>>> sys.getdefaultencoding() #如果Python需要把二進位制資料轉為字元物件，那麼在預設情況下使用該值。

'utf-8'

>>> sys.getfilesystemencoding() #Python用來編碼或者解碼檔名(不是檔案內容)的時候，預設使用該編碼。

'mbcs'

以上是在Windows中的測試結果，如果在GNU/Linux或者OSX中，那麼所有的結果都是UTF-8.
關於mbcs和utf-8的區別，可以參考http://stackoverflow.com/questions/3298569/difference-between-mbcs-and-utf-8-on-windows

檔案讀寫的編碼

>>> pen('cafe.txt','w',encoding='utf-8').write('café')
4
>>> fp = open('cafe.txt','r')
>>> fp.read()
'caf茅'
>>> fp.encoding
'cp936'
>>> open('cafe.txt','r', encoding = 'cp936').read()
'caf茅'
>>> open('cafe.txt','r', encoding = 'latin1').read()
'cafÃ©'
>>> fp = open('cafe.txt','r', encoding = 'utf-8')
>>> fp.encoding
'utf-8'

>>> pen('cafe.txt','w',encoding='utf-8').write('café')

>>> fp = open('cafe.txt','r')

>>> fp.read()

'caf茅'

>>> fp.encoding

'cp936'

>>> open('cafe.txt','r', encoding = 'cp936').read()

'caf茅'

>>> open('cafe.txt','r', encoding = 'latin1').read()

'cafÃ©'

>>> fp = open('cafe.txt','r', encoding = 'utf-8')

>>> fp.encoding

'utf-8'

從上面的例子可以看出，無論什麼時候都不要使用預設的編碼，因為在不同的機器上執行的時候會出現意想不到的問題。

Python如何處理來自Unicode的麻煩

Python總是通過code point來比較字串的大小，或者是否相等的。

Unicode中重音符號有兩種表示方法，用一個位元組表示，或者用基字母加上重音符號表示，在Unicode中他們是相等的，但是在Python中由於通過code point來比較大小，所以就不相等了。

>>> c1 = 'cafe\u0301' >>> c2 = 'café' >>> c1 == c2 False >>> len(c1), len(c2) (5, 4)

1
2
3
4
5
6

>>> c1 = 'cafe\u0301'
>>> c2 = 'café'
>>> c1 == c2
False
>>> len(c1), len(c2)
(5, 4)

解決方法是通過unicodedata庫中的normalize函式，該函式的第一個引數可以接受”NFC”,’NFD’,’NFKC’,’NFKD’四個引數中的一個。
NFC(Normalization Form Canonical Composition)：以標準等價方式來分解，然後以標準等價重組之。若是singleton的話，重組結果有可能和分解前不同。儘可能的縮短整個字串的長度，所以會把’eu0301’2個位元組壓縮到一個位元組’é’。
NFD(Normalization Form Canonical Decomposition)：以標準等價方式來分解
NFKD(Normalization Form Compatibility Decomposition)：以相容等價方式來分解
NFKC(Normalization Form Compatibility Composition)：以相容等價方式來分解，然後以標準等價重組之。
NFKC和NFKD可能會引起資料損失。

from unicodedata import normalize
>>> c3 = normalize('NFC',c1)        #把c1往字串長度縮短的方向操作
>>> len(c3)
4
>>> c3 == c2
True
>>> c4 = normalize('NFD',c2)
>>> len(c4)
5
>>> c4 == c1
True

from unicodedata import normalize

>>> c3 = normalize('NFC',c1) #把c1往字串長度縮短的方向操作

>>> len(c3)

>>> c3 == c2

True

>>> c4 = normalize('NFD',c2)

>>> len(c4)

>>> c4 == c1

True

西方的鍵盤通常會鍵入儘可能短的字串，也就是說和”NFC”的結果一致，但是通過”NFC”來操作一下再比較字串是否相等比較安全。且W3C建議使用”NFC”的結果。

同樣的一個字元在Unicode中有兩個不同的編碼。
該函式會把一個單一的Unicode字元轉為另一個Unicode字元。

>>> o1 = '\u2126'
>>> o2 = '\u03a9'
>>> o1, o2
('Ω', 'Ω')
>>> o1 == o2
False
>>> name(o1), name(o2)
('OHM SIGN', 'GREEK CAPITAL LETTER OMEGA')
>>> o3 = normalize('NFC',o1)
>>> name(o3)
'GREEK CAPITAL LETTER OMEGA'
>>> o3 == o2
True

>>> o1 = '\u2126'

>>> o2 = '\u03a9'

>>> o1, o2

('Ω', 'Ω')

>>> o1 == o2

False

>>> name(o1), name(o2)

('OHM SIGN', 'GREEK CAPITAL LETTER OMEGA')

>>> o3 = normalize('NFC',o1)

>>> name(o3)

'GREEK CAPITAL LETTER OMEGA'

>>> o3 == o2

True

又比如

>>> u1 = '\u00b5'
>>> u2 = '\u03bc'
>>> u1,u2
('µ', 'μ')
>>> name(u1), name(u2)
('MICRO SIGN', 'GREEK SMALL LETTER MU')
>>> u3 = normalize('NFKD',u1)
>>> name(u3)
'GREEK SMALL LETTER MU'

>>> u1 = '\u00b5'

>>> u2 = '\u03bc'

>>> u1,u2

('µ', 'μ')

>>> name(u1), name(u2)

('MICRO SIGN', 'GREEK SMALL LETTER MU')

>>> u3 = normalize('NFKD',u1)

>>> name(u3)

'GREEK SMALL LETTER MU'

再一個例子

>>> h1 = '\u00bd'
>>> h2 = normalize('NFKC',h1)
>>> h1, h2
('½', '1⁄2')
>>> len(h1), len(h2)
(1, 3)

>>> h1 = '\u00bd'

>>> h2 = normalize('NFKC',h1)

>>> h1, h2

('½', '1⁄2')

>>> len(h1), len(h2)

(1, 3)

有時候我們希望使用不區分大小寫的形式進行比較
使用方法str.casefold()，該方法會把大寫字母轉換為小寫進行比較，比如’A’會轉為’a’，’MICRO SIGN’的’µ’會轉換為’GREEK SMALL LETTER MU’的’µ’
在絕大部分(98.9%)情況下str.casefold()和str.lower()的結果一致。
文字排序
由於不同的語言規則，如果單純按照Python的比較code point的方式進行，那麼會出現很多不是使用者期望的結果。
通常採用locale.strxfrm進行排序。

>>> import locale >>> locale.setlocale(locale.LC_COLLATE,'pt_BR.UTF-8') 'pt_BR.UTF-8' >>> sort_result = sorted(intial, key = locale.strxfrm)

1
2
3
4

>>> import locale
>>> locale.setlocale(locale.LC_COLLATE,'pt_BR.UTF-8')
'pt_BR.UTF-8'
>>> sort_result = sorted(intial, key = locale.strxfrm)

編碼解碼錯誤

如果是Python原始碼中出現瞭解碼錯誤，那麼會產生SyntaxError異常。
其他情況下，如果發現編碼解碼錯誤，那麼會產生UnicodeEncodeError, UnicodeDecodeError異常。

幾個摘自fluent python中的有用方法

from unicodedata import normalize, combining
def nfc_equal(s1, s2):
    '''return True if string s1 is eual to string s2 after normalization under "NFC" '''
    return normalize("NFC",s1) == normalize("NFC",s2)

def fold_equal(s1, s2):
    '''return True if string s1 is eual to string s2 after normalization under "NFC" and casefold()'''
    return normalize('NFC',s1).casefold() == normalize('NFC',s2).casefold()

def shave_marks(txt):
    '''Remove all diacritic marks
    basically it only need to change Latin text to pure ASCII, but this func will change Greek letters also
    below shave_latin_marks func is more precise'''

    normal_txt = normalize('NFD',txt)
    shaved = ''.join(c for c in normal_txt if not combining(c))
    return normalize('NFC',shaved)

def shave_latin_marks(txt):
    '''Remove all diacritic marks from Latin base characters'''
    normal_txt = normalize('NFD',txt)
    keeping = []
    latin_base=False
    for c in normal_txt:
        if combining(c) and latin_base:
            continue    #Ingore diacritic marks on Latin base char
        keeping.append(c)
        #If it's not combining char, it should be a new base char
        if not combining(c):
            latin_base = c in string.ascii_letters

from unicodedata import normalize, combining

def nfc_equal(s1, s2):

'''return True if string s1 is eual to string s2 after normalization under "NFC" '''

return normalize("NFC",s1) == normalize("NFC",s2)

def fold_equal(s1, s2):

'''return True if string s1 is eual to string s2 after normalization under "NFC" and casefold()'''

return normalize('NFC',s1).casefold() == normalize('NFC',s2).casefold()

def shave_marks(txt):

'''Remove all diacritic marks

basically it only need to change Latin text to pure ASCII, but this func will change Greek letters also

below shave_latin_marks func is more precise'''

normal_txt = normalize('NFD',txt)

shaved = ''.join(c for c in normal_txt if not combining(c))

return normalize('NFC',shaved)

def shave_latin_marks(txt):

'''Remove all diacritic marks from Latin base characters'''

normal_txt = normalize('NFD',txt)

keeping = []

latin_base=False

for c in normal_txt:

if combining(c) and latin_base:

continue #Ingore diacritic marks on Latin base char

keeping.append(c)

#If it's not combining char, it should be a new base char

if not combining(c):

latin_base = c in string.ascii_letters

編碼探嗅Chardet

這是Python的標準模組。

參考資料：

http://blog.csdn.net/tcdddd/article/details/8191464

python程式碼混淆與編譯
2024-08-08
Python編譯
Python 編碼轉換與中文處理
2021-09-09
Python
python編碼
2018-05-12
Python
哈夫曼編碼 —— Lisp 與 Python 實現
2019-03-04
LispPython
Python 中文編碼
2018-10-11
Python
python與matlab混編
2018-06-11
PythonMatlab
編碼與幽默
2018-10-04
Python基礎：編碼
2019-03-19
Python
1.3.0 Python 字元編碼
2019-01-19
Python字元
Python安全編碼指南
2020-08-19
Python
python編碼規範
2021-09-09
Python
Python編解碼問題與文字檔案處理
2021-06-19
Python
python中的編碼&解碼
2024-08-31
Python
python 安全編碼&程式碼審計
2020-08-19
Python
python中字串的編碼和解碼
2020-11-29
Python字串
編碼與調製
2024-11-09
Go JSON編碼與解碼？
2019-03-30
GoJSON
URL編碼與解碼原理
2018-08-22
OpenLR 的編碼與解碼
2024-03-14
『無為則無心』Python基礎 — 9、Python字串的編碼與轉義
2021-06-27
Python字串
Python程式設計：URL網址連結中的中文編碼與解碼
2018-05-11
Python程式設計
Python字元與位元組新編
2021-06-11
Python字元
Python 編碼風格參考
2019-02-16
Python
python教程3.3：字元和編碼
2024-05-04
Python字元
Python變數、編碼、註釋
2021-09-09
Python變數
編碼的道與禪
2020-06-20
字符集與編碼
2022-04-19
標籤編碼、獨熱編碼大不同 - Python 實現
2019-01-25
Python
PHP編碼gzdeflate與Golang解碼DEFLATE
2021-09-09
PHPGolang
ptyon 特殊處理 url 編碼與解碼，字元編碼轉化 unicode
2020-05-19
字元Unicode
python中小資料池和編碼
2024-05-09
Python
python基礎之字串和編碼
2019-10-11
Python字串
python反編譯之位元組碼
2019-05-19
Python編譯
Python基礎：編碼規範（4）
2018-12-22
Python
python如何換行編寫程式碼
2021-09-11
Python
python編碼規範以及推導式的編寫
2020-11-22
Python
新媒體編碼時代的技術：編碼與傳輸
2018-10-23
Python字元編碼的常用種類！Python基礎教程
2021-09-10
Python字元
[Python3] 關於Bytes與String 寫檔案遇到的編碼問題
2019-01-21
Python

python與編碼

Unicode和字元轉換

len函式計算的是字元數，不是位元組數

Python與編碼

Python內部處理編碼的方式

Python原始碼的編碼方式

Python中使用的編碼

檔案讀寫的編碼

Python如何處理來自Unicode的麻煩

編碼解碼錯誤

幾個摘自fluent python中的有用方法

編碼探嗅Chardet

相關文章