1.4 字串

本節介紹處理文字的方法。

表示字面量文字

在程式中字串字面量使用引號來書寫。

# 單引號（Single quote）
a = 'Yeah but no but yeah but...'

# 雙引號（Double quote）
b = "computer says no"

# 三引號（Triple quotes）
c = '''
Look into my eyes, look into my eyes, the eyes, the eyes, the eyes,
not around the eyes,
don't look around the eyes,
look into my eyes, you're under.
'''

通常，字串只能佔一行。三引號捕獲在引號結束之前的所有文字，包含所有的格式。

使用單引號（'）和雙引號（“）沒有區別。但是，以什麼樣的引號開始字串，必須以什麼樣的引號結束字串。

字串轉義碼

轉義碼被用於表示控制字元和不能輕易在鍵盤上輸入的字元。以下是一些常見的轉義碼：

'\n'      換行符（Line feed）
'\r'      回車符（Carriage return）
'\t'      製表符（Tab）
'\''      字面量單引號（Literal single quote）
'\"'      字面量雙引號（Literal double quote）
'\\'      字面量反斜槓（Literal backslash）

字串表示

字串中的每個字元在內部被儲存為所謂的 Unicode “程式碼點（code-point）”，程式碼點是一個整數。可以使用下列轉移序列指定確切的程式碼點。

a = '\xf1'          # a = 'ñ'
b = '\u2200'        # b = '∀'
c = '\U0001D122'    # c = '?'
d = '\N{FOR ALL}'   # d = '∀'

所有可用的字元碼請參考 Unicode 字元資料庫。

字串索引

可以像訪問陣列那樣訪問字串的單個字元。你可以使用從 0 開始的整數索引，負索引指定相對於字串尾部的位置。

a = 'Hello world'
b = a[0]          # 'H'
c = a[4]          # 'o'
d = a[-1]         # 'd' (end of string)

你也可以指定一個索引範圍來切割或者選擇子串：

d = a[:5]     # 'Hello'
e = a[6:]     # 'world'
f = a[3:8]    # 'lo wo'
g = a[-5:]    # 'world'

不包括結尾索引處的字元。缺失的索引假定為字串的開始或者結尾。

字串操作

字串的操作包括：拼接，長度計算，成員判斷和複製。

# Concatenation (+)
a = 'Hello' + 'World'   # 'HelloWorld'
b = 'Say ' + a          # 'Say HelloWorld'

# Length (len)
s = 'Hello'
len(s)                  # 5

# Membership test (`in`, `not in`)
t = 'e' in s            # True
f = 'x' in s            # False
g = 'hi' not in s       # True

# Replication (s * n)
rep = s * 5             # 'HelloHelloHelloHelloHello'

字串的方法

字串具有對資料執行各種操作的方法。

示例：刪除開頭或者結尾處的任何空白。

s = '  Hello '
t = s.strip()     # 'Hello'

示例：大小寫轉換。

s = 'Hello'
l = s.lower()     # 'hello'
u = s.upper()     # 'HELLO'

示例：文字替換。

s = 'Hello world'
t = s.replace('Hello' , 'Hallo')   # 'Hallo world'

更多字串方法:

字串具有各種各樣的方法用於測試和處理文字資料。

下面是字串方法的一小部分示例：

s.endswith(suffix)     # Check if string ends with suffix
s.find(t)              # First occurrence of t in s
s.index(t)             # First occurrence of t in s
s.isalpha()            # Check if characters are alphabetic
s.isdigit()            # Check if characters are numeric
s.islower()            # Check if characters are lower-case
s.isupper()            # Check if characters are upper-case
s.join(slist)          # Join a list of strings using s as delimiter
s.lower()              # Convert to lower case
s.replace(old,new)     # Replace text
s.rfind(t)             # Search for t from end of string
s.rindex(t)            # Search for t from end of string
s.split([delim])       # Split string into list of substrings
s.startswith(prefix)   # Check if string starts with prefix
s.strip()              # Strip leading/trailing space
s.upper()              # Convert to upper case

字串的可變性

字串是“不可變的”或者說是隻讀的。一旦建立，字串的值就無法修改。

>>> s = 'Hello World'
>>> s[1] = 'a'
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: 'str' object does not support item assignment
>>>

所有處理字串資料的操作和方法始終會建立一個新的字串。

字串轉換

使用 str() 函式可以將任何值轉換為字串。 str() 函式得到的結果是一個字串，該字串包含的文字與 print() 語句產生的文字相同。

>>> x = 42
>>> str(x)
'42'
>>>

位元組字串

通常，在底層 I/O 中會遇到 8 位位元組的字串（譯註：位元組字串），它是這樣寫的：

data = b'Hello World\r\n'

通過把小寫的 b 放到第一個引號之前來指定一個位元組字串而不是文字字串（譯註：在字串前面加 b 表示這是使用 ASCII 編碼的位元組字串）。大部分常用的文字字串操作可應用於位元組字串。

len(data)                         # 13
data[0:5]                         # b'Hello'
data.replace(b'Hello', b'Cruel')  # b'Cruel World\r\n'

位元組字串索引有點不同，因為它返回的是整數形式的位元組值：

data[0]   # 72 (ASCII code for 'H')

位元組字串與文字字串之間的轉換：

text = data.decode('utf-8') # bytes -> text
data = text.encode('utf-8') # text -> bytes

'utf-8' 這個引數指定了字元的編碼方式。其它常見的編碼方式有 'ascii' 和 'latin1'。

原始字串

原始字串是未解釋的帶有反斜槓的字串字面量。通過在原始引號之前新增 “r” 字首來指定。

>>> rs = r'c:\newdata\test' # Raw (uninterpreted backslash)
>>> rs
'c:\\newdata\\test'

輸出的字串是包含在引號裡面的字面量文字，與輸入的文字完全相同。這在反斜槓有特殊意義的情況下很有用。例如：檔名、正規表示式等。

f-Strings

具有格式化表示式替換的字串。

>>> name = 'IBM'
>>> shares = 100
>>> price = 91.1
>>> a = f'{name:>10s} {shares:10d} {price:10.2f}'
>>> a
'       IBM        100      91.10'
>>> b = f'Cost = ${shares*price:0.2f}'
>>> b
'Cost = $9110.00'
>>>

注意: 這要求 Python 3.6 或者更新的版本. 格式化程式碼的含義稍後介紹。

練習

在這些習題中，你將嘗試對 Python 字串型別進行操作。你應該在 Python 互動提示符下操作，在該提示符下可以輕鬆地檢視到結果。重要提示：

在應該與直譯器進行互動的習題中，
>>> 當 Python 希望你輸入一個新的語句，你將獲得一個直譯器提示符。習題中某些語句會跨越多行——要使這些語句執行，你可能需要多按幾次Enter鍵。提醒你，在做這些示例時，請勿輸入 >>> 提示符。

通過定義一個包含一系列股票代號的字串開始吧。字串如下所示：

>>> symbols = 'AAPL,IBM,MSFT,YHOO,SCO'
>>>

練習 1.13：提取單個字元和子串

字串是字元陣列。嘗試提取一些字元：

>>> symbols[0]
?
>>> symbols[1]
?
>>> symbols[2]
?
>>> symbols[-1]        # Last character
?
>>> symbols[-2]        # Negative indices are from end of string
?
>>>

在 Python 語言中，字串是隻讀的。

嘗試通過將 symbols 字串的第一個字元變為小寫字母 ‘a’ 來驗證這一點。

>>> symbols[0] = 'a'
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: 'str' object does not support item assignment
>>>

練習 1.14：字串拼接

儘管字串資料是隻讀的，但是你始終可以將變數重新分配給新建立的字串。

嘗試下面的語句，該語句將一個新的股票程式碼 “GOOG” 拼接到 symbols 字串的末尾。

>>> symbols = symbols + 'GOOG'
>>> symbols
'AAPL,IBM,MSFT,YHOO,SCOGOOG'
>>>

糟糕！這不是我們想要的。修改它使得變數 symbols 儲存的值為 'AAPL,IBM,MSFT,YHOO,SCO,GOOG'。

>>> symbols = ?
>>> symbols
'AAPL,IBM,MSFT,YHOO,SCO,GOOG'
>>>

把 'HPQ' 新增到 symbols 字串的前面：

>>> symbols = ?
>>> symbols
'HPQ,AAPL,IBM,MSFT,YHOO,SCO,GOOG'
>>>

在這些示例中，表面上看起來原始字串像正在被修改，明顯違反了字串是隻讀的。實際上不是這樣的。每次，這些操作都會建立一個全新的字串。當變數名 symbols 被重新分配，它會指向一個新建立的字串。然後，舊的字串被銷燬，因為它不再被使用了。

練習 1.15：成員測試（子串測試）

嘗試使用 in 操作符檢查子串。請在互動提示符下嘗試這些操作。

>>> 'IBM' in symbols
?
>>> 'AA' in symbols
True
>>> 'CAT' in symbols
?
>>>

為什麼檢查 AA 的時候返回 True ?

練習 1.16：字串方法

在 Python 互動提示符下，嘗試一些新的字串方法。

>>> symbols.lower()
?
>>> symbols
?
>>>

請記住，字串始終是隻讀的。如果你想要儲存操作的結果，你需要把它放置到一個變數中。

>>> lowersyms = symbols.lower()
>>>

請嘗試更多的操作：

>>> symbols.find('MSFT')
?
>>> symbols[13:17]
?
>>> symbols = symbols.replace('SCO','DOA')
>>> symbols
?
>>> name = '   IBM   \n'
>>> name = name.strip()    # Remove surrounding whitespace
>>> name
?
>>>

練習 1.17：f-strings

有時你想建立一個字串並把其它變數的值嵌入到其中。

要做到這點，可以使用 f-strings。示例：

>>> name = 'IBM'
>>> shares = 100
>>> price = 91.1
>>> f'{shares} shares of {name} at ${price:0.2f}'
'100 shares of IBM at $91.10'
>>>

從練習 1.10 中修改 mortgage.py 程式來使用 f-strings 建立它的輸出。

嘗試實現它，使得輸出能夠很好地對齊。

練習 1.18：正規表示式

基本字串操作的一個侷限性在於它們不支援任何型別的高階模式匹配。為此，你需要使用 Python 的 re 模組和正規表示式。正規表示式處理是一個大的主題，這裡只是一個簡短的示例：

>>> text = 'Today is 3/27/2018. Tomorrow is 3/28/2018.'
>>> # Find all occurrences of a date
>>> import re
>>> re.findall(r'\d+/\d+/\d+', text)
['3/27/2018', '3/28/2018']
>>> # Replace all occurrences of a date with replacement text
>>> re.sub(r'(\d+)/(\d+)/(\d+)', r'\3-\1-\2', text)
'Today is 2018-3-27. Tomorrow is 2018-3-28.'
>>>

有關 re 模組的更多資訊，請檢視官方文件：https://docs.python.org/library/re.html。

說明

當你開始嘗試使用直譯器時，你總是希望瞭解更多有關不同物件所支援的操作。例如，如何找出哪些操作是對字串是有效的？

根據你的 Python 環境，你可能可以通過 tab 鍵補全來檢視可用方法的列表。例如，嘗試輸入下面的程式碼：

>>> s = 'hello world'
>>> s.<tab key>
>>>

如果單擊 tab 鍵沒有任何作用，你可以使用 Python 的內建函式 dir()。示例：

>>> s = 'hello'
>>> dir(s)
['__add__', '__class__', '__contains__', ..., 'find', 'format',
'index', 'isalnum', 'isalpha', 'isdigit', 'islower', 'isspace',
'istitle', 'isupper', 'join', 'ljust', 'lower', 'lstrip', 'partition',
'replace', 'rfind', 'rindex', 'rjust', 'rpartition', 'rsplit',
'rstrip', 'split', 'splitlines', 'startswith', 'strip', 'swapcase',
'title', 'translate', 'upper', 'zfill']
>>>

dir() 函式生成一個在 (.) 後出現的所有操作的列表。

使用 help() 函式可以獲取有關特定操作的更多資訊。

>>> help(s.upper)
Help on built-in function upper:

upper(...)
    S.upper() -> string

    Return a copy of the string S converted to uppercase.
>>>

目錄 | 上一節 (1.3 數字) | 下一節 (1.5 列表)

注：完整翻譯見 https://github.com/codists/practical-python-zh

翻譯：《實用的Python程式設計》01_04_Strings