「python」正則

正則

基本都是參考：http://funhacks.net/2016/12/27/regular_expression/，部分小改動，感覺講的蠻好的，做下記錄。

正規表示式（regular expression）是可以匹配文字片段的模式。

字元

元字元	說明	正規表示式例項	匹配字串
一般字元	匹配自身	py	python
.	匹配任意字元（換行符‘n’除外）	p.t	pyt
	轉義字元	python.org	python.org
[…]	字符集，對應位置可以是字符集中任意字元。可以逐個列出，也可以給出範圍如[1234]，也可以加入^取反，如[^1-4]表示不是1234的其他字元，所有的特殊字元字在字符集中都失去原來的特殊含義	p[xyz]t	pxt,pyt,pzt

預定義字元

元字元	說明	正規表示式例項	匹配字串
d	數字：[0-9]	pdy	p3y
D	數字：[^d]	pDy	pay
s	空白字元：[<空格>，t fv]	psy	p y
S	非空白字元：[^s]	pSy	pay
w	單詞字元：[a-zA-Z0-9]	pwy	pay
W	非單詞字元：[^w]	pWy	p y

數量詞

元字元	說明	正規表示式例項	匹配字串
*	匹配前一個字元任意多次，包括0次	pyt*	py/pyttttt
+	匹配前一個字元至少1次	pyt+	pyt/pyttt
？	匹配前一個字元0次或1次	pyt?	py/pyt
{m}	匹配前一個字元m次	py{2}t	pyyt
{m,n}	匹配前一個字元m至n次，如果省略m,表示匹配0次至n次，如果省略，則表示匹配m次至無限次	py{1,2}t	pyt/pyyt
*?	使得*變成非貪婪模式，類似的還有+？，{m,n}?

邊界匹配

元字元	說明	正規表示式例項	匹配字串
^	匹配字串開頭，在多行模式中匹配每一行的開頭	^python	python
$	匹配字串末尾，在多行模式中匹配每一行的末尾	python$	python

邏輯或

元字元	說明	正規表示式例項	匹配字串
		先嚐試匹配	左邊的表示式，匹配成功則跳過右邊的表示式，否則嘗試匹配右邊的表示式	15[0	1	2	3]	150,151,152,153

分組

元字元	說明	正規表示式例項	匹配字元
（……）	被括起來的表示式作為分組，從表示式左邊開始每遇到一個分組的左括號，編號加1	（d+）	123
<number>	應用編號為<number>的分組匹配到的字串	(d)abd1	1abd1

re模組

re 模組的一般使用步驟如下：

使用 compile 函式將正規表示式的字串形式編譯為一個 Pattern 物件
通過 Pattern 物件提供的一系列方法對文字進行匹配查詢，獲得匹配結果（一個 Match 物件）
最後使用 Match 物件提供的屬性和方法獲得資訊，根據需要進行其他的操作

compile 函式

compile 函式用於編譯正規表示式，生成一個 Pattern 物件，它的一般使用形式如下：

re.compile(pattern[, flag])

其中，pattern 是一個字串形式的正規表示式，flag 是一個可選引數，表示匹配模式，比如忽略大小寫，多行模式等。

import re

pattern = re.compile(r`d+`)
# 此處將匹配數字至少1次的正規表示式編譯成  Pattern 物件

在上面，我們已將一個正規表示式編譯成 Pattern 物件，接下來，我們就可以利用 pattern 的一系列方法對文字進行匹配查詢了。Pattern 物件的一些常用方法主要有：

match 方法

match 方法用於查詢字串的頭部（也可以指定起始位置），它是一次匹配，只要找到了一個匹配的結果就返回，而不是查詢所有匹配的結果。它的一般使用形式如下：

match(string[, pos[, endpos]])

其中，string 是待匹配的字串，pos 和 endpos 是可選引數，指定字串的起始和終點位置，預設值分別是 0 和 len (字串長度)。因此，當你不指定 pos 和 endpos 時，match 方法預設匹配字串的頭部。

當匹配成功時，返回一個 Match 物件，如果沒有匹配上，則返回 None。

>>> import re
>>> pattern = re.compile(r`d+`)                    # 用於匹配至少一個數字
>>> m = pattern.match(`one12twothree34four`)        # 查詢頭部，沒有匹配
>>> print m
None
>>> m = pattern.match(`one12twothree34four`, 2, 10) # 從`e`的位置開始匹配，沒有匹配
>>> print m
None
>>> m = pattern.match(`one12twothree34four`, 3, 10) # 從`1`的位置開始匹配，正好匹配
>>> print m                                         # 返回一個 Match 物件
<_sre.SRE_Match object at 0x10a42aac0>
>>> m.group(0)   # 可省略 0
`12`
>>> m.start(0)   # 可省略 0
3
>>> m.end(0)     # 可省略 0
5
>>> m.span(0)    # 可省略 0
(3, 5)

在上面，當匹配成功時返回一個 Match 物件，其中：

group([group1, …]) 方法用於獲得一個或多個分組匹配的字串，當要獲得整個匹配的子串時，可直接使用 group() 或 group(0)；
start([group]) 方法用於獲取分組匹配的子串在整個字串中的起始位置（子串第一個字元的索引），引數預設值為 0；
end([group]) 方法用於獲取分組匹配的子串在整個字串中的結束位置（子串最後一個字元的索引+1），引數預設值為 0；
span([group]) 方法返回 (start(group), end(group))。

再看看一個例子：

>>> import re
>>> pattern = re.compile(r`([a-z]+) ([a-z]+)`, re.I)   # re.I 表示忽略大小寫
>>> m = pattern.match(`Hello World Wide Web`)
>>> print m                               # 匹配成功，返回一個 Match 物件
<_sre.SRE_Match object at 0x10bea83e8>
>>> m.group(0)                            # 返回匹配成功的整個子串
`Hello World`
>>> m.span(0)                             # 返回匹配成功的整個子串的索引
(0, 11)
>>> m.group(1)                            # 返回第一個分組匹配成功的子串
`Hello`
>>> m.span(1)                             # 返回第一個分組匹配成功的子串的索引
(0, 5)
>>> m.group(2)                            # 返回第二個分組匹配成功的子串
`World`
>>> m.span(2)                             # 返回第二個分組匹配成功的子串
(6, 11)
>>> m.groups()                            # 等價於 (m.group(1), m.group(2), ...)
(`Hello`, `World`)
>>> m.group(3)                            # 不存在第三個分組
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
IndexError: no such group

search 方法

search 方法用於查詢字串的任何位置，它也是一次匹配，只要找到了一個匹配的結果就返回，而不是查詢所有匹配的結果，它的一般使用形式如下：

search(string[, pos[, endpos]])

其中，string 是待匹配的字串，pos 和 endpos 是可選引數，指定字串的起始和終點位置，預設值分別是 0 和 len (字串長度)。

當匹配成功時，返回一個 Match 物件，如果沒有匹配上，則返回 None。

>>> import re
>>> pattern = re.compile(`d+`)
>>> m = pattern.search(`one12twothree34four`)  # 這裡如果使用 match 方法則不匹配
>>> m
<_sre.SRE_Match object at 0x10cc03ac0>
>>> m.group()
`12`
>>> m = pattern.search(`one12twothree34four`, 10, 30)  # 指定字串區間
>>> m
<_sre.SRE_Match object at 0x10cc03b28>
>>> m.group()
`34`
>>> m.span()
(13, 15)

再來看一個例子：

import re 
# 將正規表示式編譯成 Pattern 物件
pattern = re.compile(r`d+`)  
# 使用 search() 查詢匹配的子串，不存在匹配的子串時將返回 None 
# 這裡使用 match() 無法成功匹配 
m = pattern.search(`hello 123456 789`) 
if m: 
    # 使用 Match 獲得分組資訊 
    print `matching string:`,m.group()
    print `position:`,m.span()

執行結果：

matching string: 123456
position: (6, 12)

findall 方法

上面的 match 和 search 方法都是一次匹配，只要找到了一個匹配的結果就返回。然而，在大多數時候，我們需要搜尋整個字串，獲得所有匹配的結果。
findall 方法的使用形式如下：

findall(string[, pos[, endpos]])

其中，string 是待匹配的字串，pos 和 endpos 是可選引數，指定字串的起始和終點位置，預設值分別是 0 和 len (字串長度)。

findall 以列表形式返回全部能匹配的子串，如果沒有匹配，則返回一個空列表。

import re
 
pattern = re.compile(r`d+`)   # 查詢數字
result1 = pattern.findall(`hello 123456 789`)
result2 = pattern.findall(`one1two2three3four4`, 0, 10)
 
print result1
print result2

執行結果：

[`123456`, `789`]
[`1`, `2`]

finditer 方法

finditer 方法的行為跟 findall 的行為類似，也是搜尋整個字串，獲得所有匹配的結果。但它返回一個順序訪問每一個匹配結果（Match 物件）的迭代器。

import re
 
pattern = re.compile(r`d+`)
result_iter1 = pattern.finditer(`hello 123456 789`)
result_iter2 = pattern.finditer(`one1two2three3four4`, 0, 10)
print type(result_iter1)
print type(result_iter2)
print `result1...`
for m1 in result_iter1:   # m1 是 Match 物件
    print `matching string: {}, position: {}`.format(m1.group(), m1.span())
print `result2...`
for m2 in result_iter2:
    print `matching string: {}, position: {}`.format(m2.group(), m2.span())

執行結果：

<type `callable-iterator`>
<type `callable-iterator`>
result1...
matching string: 123456, position: (6, 12)
matching string: 789, position: (13, 16)
result2...
matching string: 1, position: (3, 4)
matching string: 2, position: (7, 8)

split 方法

split 方法按照能夠匹配的子串將字串分割後返回列表，它的使用形式如下：

split(string[, maxsplit])

其中，maxsplit 用於指定最大分割次數，不指定將全部分割。

import re
 
p = re.compile(r`[s,;]+`)
print p.split(`a,b;; c   d`)

執行結果：

[`a`, `b`, `c`, `d`]

sub 方法

sub 方法用於替換。它的使用形式如下：

sub(repl, string[, count])

其中，repl 可以是字串也可以是一個函式：

如果 repl 是字串，則會使用 repl 去替換字串每一個匹配的子串，並返回替換後的字串，另外，repl 還可以使用 id 的形式來引用分組，但不能使用編號 0；
如果 repl 是函式，這個方法應當只接受一個引數（Match 物件），並返回一個字串用於替換（返回的字串中不能再引用分組）。

count 用於指定最多替換次數，不指定時全部替換。

import re
 
p = re.compile(r`(w+) (w+)`)
s = `hello 123, hello 456`
def func(m):
    return `hi` + ` ` + m.group(2)
print p.sub(r`hello world`, s)  # 使用 `hello world` 替換 `hello 123` 和 `hello 456`
print p.sub(r`2 1`, s)        # 引用分組
print p.sub(func, s)
print p.sub(func, s, 1)

執行結果：

hello world, hello world
123 hello, 456 hello
hi 123, hi 456
hi 123, hello 456

subn 方法

subn 方法跟 sub 方法的行為類似，也用於替換。它的使用形式如下：

subn(repl, string[, count])

它返回一個元組：

(sub(repl, string[, count]), 替換次數)

元組有兩個元素，第一個元素是使用 sub 方法的結果，第二個元素返回原字串被替換的次數

import re
 
p = re.compile(r`(w+) (w+)`)
s = `hello 123, hello 456`
def func(m):
    return `hi` + ` ` + m.group(2)
print p.subn(r`hello world`, s)
print p.subn(r`2 1`, s)
print p.subn(func, s)
print p.subn(func, s, 1)

執行結果：

(`hello world, hello world`, 2)
(`123 hello, 456 hello`, 2)
(`hi 123, hi 456`, 2)
(`hi 123, hello 456`, 1)

其他函式

事實上，使用 compile 函式生成的 Pattern 物件的一系列方法跟 re 模組的多數函式是對應的，但在使用上有細微差別。

match 函式

match 函式的使用形式如下：

re.match(pattern, string[, flags]):

可以看到，match 函式不能指定字串的區間，它只能搜尋頭部

search 函式

search 函式的使用形式如下：

re.search(pattern, string[, flags])

search 函式不能指定字串的搜尋區間，用法跟 Pattern 物件的 search 方法類似

findall 函式

re.findall(pattern, string[, flags])

findall 函式不能指定字串的搜尋區間，用法跟 Pattern 物件的 findall 方法類似

finditer 函式

finditer 函式的使用方法跟 Pattern 的 finditer 方法類似，形式如下：

re.finditer(pattern, string[, flags])

split 函式

split 函式的使用形式如下：

re.split(pattern, string[, maxsplit])

sub函式

re.sub(pattern, repl, string[, count])

subn 函式

subn 函式的使用形式如下：

re.subn(pattern, repl, string[, count])

兩種模式的選擇

使用 re.compile 函式生成一個 Pattern 物件，然後使用 Pattern 物件的一系列方法對文字進行匹配查詢；
直接使用 re.match, re.search 和 re.findall 等函式直接對文字匹配查詢；

需要重複用到正規表示式，儘量考慮用第一種

匹配中文

在某些情況下，我們想匹配文字中的漢字，有一點需要注意的是，中文的 unicode 編碼範圍主要在 [u4e00-u9fa5]，這裡說主要是因為這個範圍並不完整，比如沒有包括全形（中文）標點，不過，在大部分情況下，應該是夠用的。

假設現在想把字串 title = u`你好，hello，世界` 中的中文提取出來，可以這麼做：

import re
title = u`你好，hello，世界`
pattern = re.compile(ur`[u4e00-u9fa5]+`)
result = pattern.findall(title)
print result

注意到，我們在正規表示式前面加上了兩個字首 ur，其中 r 表示使用原始字串，u 表示是 unicode 字串。(python3 不需要加u)

執行結果:

    [u`u4f60u597d`, u`u4e16u754c`]

貪婪匹配

在 Python 中，正則匹配預設是貪婪匹配（在少數語言中可能是非貪婪），也就是匹配儘可能多的字元。

比如，我們想找出字串中的所有 div 塊：

import re
content = `aa<div>test1</div>bb<div>test2</div>cc`
pattern = re.compile(r`<div>.*</div>`)
result = pattern.findall(content)
print result

執行結果：

[`<div>test1</div>bb<div>test2</div>`]

由於正則匹配是貪婪匹配，也就是儘可能多的匹配，因此，在成功匹配到第一個 </div> 時，它還會向右嘗試匹配，檢視是否還有更長的可以成功匹配的子串。

如果我們想非貪婪匹配，可以加一個 ?，如下：

import re
content = `aa<div>test1</div>bb<div>test2</div>cc`
pattern = re.compile(r`<div>.*?</div>`)    # 加上 ?
result = pattern.findall(content)
print result

結果：

[`<div>test1</div>`, `<div>test2</div>`]

「python」正則

正則

字元

預定義字元

數量詞

邊界匹配

邏輯或

分組

re模組

compile 函式

兩種模式的選擇

匹配中文

貪婪匹配

相關文章