Python 可列印字元、UTF8相關(qbit)

qbit發表於2022-12-16
  • Unicode 字元表:https://en.wikibooks.org/wiki...
  • \xa0 是 NO-Break Space,不連續空格
  • \xad 是 Soft Hyphen,軟連線符,常被顯示為短橫或者空格

可列印字元

>>> '你好'.isprintable()
True

>>> '\x41'.isprintable()
True

>>> '\xa0'.isprintable()
False

>>> '\xad'.isprintable()
False

>>> '\u0041'.isprintable()
True

UTF8

>>> import codecs
>>> utf8_decoder = codecs.getincrementaldecoder('utf8')()
>>> utf8_decoder.decode(b'hello')
'hello'

>>> utf8_decoder.decode(b'\x41')
'A'

>>> utf8_decoder.decode(b'\xa0')
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa0 in position 0:
 invalid start byte

regex

# Letter
>>> regex.findall(r'[\p{L}]', '水_A_\x41_\xa0_\xad_0\u4dc7地\u20de')
['水', 'A', 'A', '地']
# Mark
>>> regex.findall(r'[\p{M}]', '水_A_\x41_\xa0_\xad_0\u4dc7地\u20de')
['⃞']
# Separator
>>> regex.findall(r'[\p{Z}]', '水_A_\x41_\xa0_\xad_0\u4dc7地\u20de')
['\xa0']
# Symbol
>>> regex.findall(r'[\p{S}]', '水_A_\x41_\xa0_\xad_0\u4dc7地\u20de')
['䷇']
# Number
>>> regex.findall(r'[\p{N}]', '水_A_\x41_\xa0_\xad_0\u4dc7地\u20de')
['0']
# Punctuation
>>> regex.findall(r'[\p{P}]', '水_A_\x41_\xa0_\xad_0\u4dc7地\u20de')
['_', '_', '_', '_', '_']
# Other
>>> regex.findall(r'[\p{C}]', '水_A_\x41_\xa0_\xad_0\u4dc7地\u20de')
['\xad']

pandahouse

  • pandahouse 處理 \xad 之類的非常規字元會有問題
本文出自 qbit snap

相關文章