python爬蟲之re正規表示式庫

LTQblog發表於2017-06-14

原文網址 : https://blog.csdn.net/qq_22186119/article/details/73252420

python爬蟲之re正規表示式庫

正規表示式是用來簡潔表達一組字串的表示式，

編譯：將符合正規表示式語法的字串轉換成正規表示式特徵

操作符  說明                            例項
.      表示任何單個字元
[ ]    字符集，對單個字元給出取值範圍     [abc]表示a、b、c，[a‐z]表示a到z單個字元
[^ ]   非字符集，對單個字元給出排除範圍   [^abc]表示非a或b或c的單個字元
*      前一個字元0次或無限次擴充套件          abc* 表示ab、abc、abcc、abccc等
+      前一個字元1次或無限次擴充套件          abc+ 表示abc、abcc、abccc等
?      前一個字元0次或1次擴充套件             abc? 表示ab、abc
|      左右表示式任意一個                abc|def 表示abc、def
{m}    擴充套件前一個字元m次                 ab{2}c表示abbc
{m,n}  擴充套件前一個字元m至n次（含n）        ab{1,2}c表示abc、abbc
^      匹配字串開頭                   ^abc表示abc且在一個字串的開頭
$      匹配字串結尾                   abc$表示abc且在一個字串的結尾
( )    分組標記，內部只能使用| 操作符     (abc)表示abc，(abc|def)表示abc、def
\d     數字，等價於[0‐9]
\w     單詞字元，等價於[A‐Za‐z0‐9_]

正規表示式              對應字串
P(Y|YT|YTH|YTHO)?N    'PN'、'PYN'、'PYTN'、'PYTHN'、'PYTHON'
PYTHON+               'PYTHON'、'PYTHONN'、'PYTHONNN' …
PY[TH]ON              'PYTON'、'PYHON'
PY[^TH]?ON            'PYON'、'PYaON'、'PYbON'、'PYcON'…
PY{:3}N               'PN'、'PYN'、'PYYN'、'PYYYN'…

re庫的主要功能函式

re庫是Python的標準庫，主要用於字串匹配
re庫採用raw string型別表示正規表示式，表示為：r’text’，raw string是不包含對轉義符再次轉義的字串。

re.search()     在一個字串中搜尋匹配正規表示式的第一個位置，返回match物件
re.match()      從一個字串的開始位置起匹配正規表示式，返回match物件
re.findall()    搜尋字串，以列表型別返回全部能匹配的子串
re.split()      將一個字串按照正規表示式匹配結果進行分割，返回列表型別
re.finditer()   搜尋字串，返回一個匹配結果的迭代型別，每個迭代元素是match物件
re.sub()        在一個字串中替換所有匹配正規表示式的子串，返回替換後的字串

re.match

match()函式從字串的起始位置開始匹配。
如果匹配成功，則返回match物件；否則返回None。
可以用匹配物件的group()方法來顯示成功匹配的物件。
語法：

re.match(pattern, string, flags=0)

pattern : 正規表示式的字串或原生字串表示
string : 待匹配字串
flags : 正規表示式使用時的控制標記

>>> import re
>>> m = re.match('hello','helloworld')
>>> m.group()
'hello'

為了保險起見，可以這樣寫：

>>> import re
>>> m = re.match('hello','helloworld')
>>> if m is not None:  # if match:
    m.group()
'hello'

也可以充分利用Python的物件導向特性，間接省略中間結果：

>>> import re
>>> re.match('hello','helloworld').group()
'hello'

但match()函式從字串的起始位置開始匹配，下面這種情況匹配不到初學者想要的結果：

>>> import re
>>> m = re.match(r'[1-9]\d{5}', 'BIT 100081')
>>> m.group
Traceback (most recent call last):
  File "<pyshell#16>", line 1, in <module>
    m.group
AttributeError: 'NoneType' object has no attribute 'group'

這可如何是好，且看下一節的search()函式。

re.search

語法：

re.search(pattern, string, flags=0)

在一個字串中搜尋匹配正規表示式的第一個位置，返回match物件
引數：
- pattern : 正規表示式的字串或原生字串表示
- string : 待匹配字串
- flags : 正規表示式使用時的控制標記

re.search的常用標記

常用標記	說明
re.I re.IGNORECASE	忽略正規表示式的大小寫，[A‐Z]能夠匹配小寫字元
re.M re.MULTILINE	正規表示式中的^操作符能夠將給定字串的每行當作匹配開始
re.S re.DOTALL	正規表示式中的.操作符能夠匹配所有字元，預設匹配除換行外的所有字元

>>> import re
>>> m = re.search(r'[1-9]\d{5}', 'BIT 100081')
>>> m.group
'100081'

re.findall()

findall()函式用來搜尋字串，以列表型別返回全部能匹配的子串。

re.findall(pattern, string, flags=0)

輸入引數：
- pattern : 正規表示式的字串或原生字串表示
- string : 待匹配字串
- flags : 正規表示式使用時的控制標記

>>> import re
>>> ls = re.findall(r'[1-9]\d{5}', 'BIT100081 TSU100084')
>>> ls
['100081', '100084']

re.split()

將一個字串按照正規表示式匹配結果進行分割，返回列表型別。

語法：

re.split(pattern, string, maxsplit=0, flags=0)

輸入引數：
- pattern : 正規表示式的字串或原生字串表示
- string : 待匹配字串
- maxsplit: 最大分割數，剩餘部分作為最後一個元素輸出
- flags : 正規表示式使用時的控制標記

>>> import re
>>> re.split(r'[1-9]\d{5}', 'BIT100081 TSU100084')
['BIT', ' TSU', '']
>>> re.split(r'[1-9]\d{5}', 'BIT100081 TSU100084', maxsplit=1)
['BIT', ' TSU100084']
>>> re.split(r'[1-9]\d{5}', 'BIT100081 TSU100084', maxsplit=2)
['BIT', ' TSU', '']
>>> re.split(r'[1-9]\d{5}', 'BIT100081 TSU100084', maxsplit=3)
['BIT', ' TSU', '']

finditer()

搜尋字串，返回一個匹配結果的迭代型別，每個迭代元素是match物件
語法：

re.finditer(pattern, string, flags=0)

pattern : 正規表示式的字串或原生字串表示
string : 待匹配字串
flags : 正規表示式使用時的控制標記

>>> import re
>>> for m in re.finditer(r'[1-9]\d{5}', 'BIT100081 TSU100084'):
    if m:
        print(m.group(0))

100081
100084

sub()

在一個字串中替換所有匹配正規表示式的子串, 返回替換後的字串.
語法：

re.sub(pattern, repl, string, count=0, flags=0)

輸入引數:
- pattern : 正規表示式的字串或原生字串表示
- repl : 替換匹配字串的字串
- string : 待匹配字串
- count : 匹配的最大替換次數
- flags : 正規表示式使用時的控制標記

>>> import re
>>> re.sub(r'[1-9]\d{5}', ':zipcode', 'BIT100081 TSU100084')
'BIT:zipcode TSU:zipcode'

compile()

將正規表示式的字串形式編譯成正規表示式物件
語法：

regex = re.compile(pattern, flags=0)

輸入引數：
- pattern : 正規表示式的字串或原生字串表示
- flags : 正規表示式使用時的控制標記

re庫的另一種等價用法

函式式用法：一次性操作

>>> m = re.search(r'[1-9]\d{5}', 'BIT 100081')

物件導向用法：編譯後的多次操作

>>> regex = re.compile(r'[1-9]\d{5}')
>>> rst = regex.search('BIT 100081')

re庫的match物件

Match物件是一次匹配的結果，包含匹配的很多資訊

match物件的屬性

屬性       說明
.string   待匹配的文字
.re       匹配時使用的patter物件（正規表示式）
.pos      正規表示式搜尋文字的開始位置
.endpos   正規表示式搜尋文字的結束位置

match物件的方法

方法        說明
.group(0)  獲得匹配後的字串
.start()   匹配字串在原始字串的開始位置
.end()     匹配字串在原始字串的結束位置
.span()    返回(.start(), .end())

re庫的貪婪匹配和最小匹配

貪婪匹配

Re庫預設採用貪婪匹配，即輸出匹配最長的子串

>>> match = re.search(r'PY.*N', 'PYANBNCNDN')
>>> match.group(0)
'PYANBNCNDN'

最小匹配

操作符   說明
*?      前一個字元0次或無限次擴充套件，最小匹配
+?      前一個字元1次或無限次擴充套件，最小匹配
??      前一個字元0次或1次擴充套件，最小匹配
{m,n}?  擴充套件前一個字元m至n次（含n），最小匹配

參考

http://www.icourse163.org/course/BIT-1001870001
https://docs.python.org/3/library/re.html

Python爬蟲— 1.4 正規表示式：re庫
2019-02-28
Python爬蟲
Python爬蟲教程-19-資料提取-正規表示式(re)
2018-09-06
Python爬蟲
Python 之 RE（正規表示式）常用
2020-03-16
Python
python爬蟲正規表示式詳解
2024-11-25
Python爬蟲
Python正規表示式簡記和re庫
2019-02-16
Python
Golang爬蟲+正規表示式
2021-12-22
Golang爬蟲
Python 正規表示式 re 模組
2018-10-12
Python
python re模組正規表示式
2018-09-12
Python
python正規表示式(re模組)
2020-08-08
Python
python 關於正規表示式re
2020-04-21
Python
(一) 爬蟲教程｜正規表示式
2020-12-18
爬蟲
python基礎之正規表示式和re模組
2020-03-12
Python
爬蟲必學知識之正規表示式上篇
2018-03-18
爬蟲
Java 的正規表示式與爬蟲
2023-03-10
Java爬蟲
python爬蟲學習筆記4-正規表示式
2020-12-12
Python爬蟲筆記
python之正規表示式
2018-08-11
Python
python中re模組的使用（正規表示式）
2021-01-17
Python
Datawhale-爬蟲-Task2（正規表示式）
2019-03-02
爬蟲
LeetCode-10. 正規表示式匹配（Python-re包）
2018-05-03
LeetCodePython
python就業班----正規表示式及re應用
2020-10-05
Python就業
python 正規表示式re常用操作符使用方法怎麼用re正規表示式表示一個IP地址：0-255
2018-11-22
Python
Go 正規表示式庫之 commonregex
2021-05-31
Go
Python "爬蟲"出發前的裝備之一正規表示式
2022-02-24
Python爬蟲
Python基礎之正規表示式
2024-06-30
Python
超詳細Python正規表示式操作指南(re使用)，一
2018-05-26
Python
Python——正規表示式
2019-08-05
Python
python正規表示式
2024-06-15
Python
Python 正規表示式
2021-09-09
Python
Python：正規表示式
2021-04-22
Python
re正規表示式庫的簡介、入門、使用方法
2019-08-31
正規表示式re.compile的學習
2018-08-01
Compile
Python筆記五之正規表示式
2024-02-25
Python筆記
【Python3網路爬蟲開發實戰】3-基本庫的使用-3正規表示式
2018-03-19
Python爬蟲
Python學習筆記|Python之正規表示式
2018-12-18
Python筆記
好程式設計師Python培訓分享python中爬蟲常用到的正規表示式
2020-07-06
程式設計師Python爬蟲
python 正規表示式匹配
2024-04-19
Python
Python正規表示式手稿
2020-04-04
Python
Python正規表示式大全
2020-11-26
Python
Python3之正規表示式詳解
2019-07-25
Python

python爬蟲之re正規表示式庫

python爬蟲之re正規表示式庫

re庫的主要功能函式

re.match

re.search

re.findall()

re.split()

finditer()

sub()

compile()

re庫的另一種等價用法

re庫的match物件

match物件的屬性

match物件的方法

re庫的貪婪匹配和最小匹配

貪婪匹配

最小匹配

相關文章