8.正規表示式

WangYao_BigData發表於2024-12-06

正規表示式

每一門語言關於正規表示式的定義都是一樣的，正規表示式是一種獨立的技術。

使用步驟

存在大量文字資訊
找出規律
按照規律編寫正規表示式

語法

字串本身就是一個正規表示式

import re

s1 = '博主講的太好了！已經三連xiaohu加關注，求課件！我的郵箱是 1234214@qq.com, 或者xiaohu是 3255@163.com 或者是xiaohu微信手機號 18356781451'
res1 = re.findall('xiaohu', s1)
print(res1) #['xiaohu', 'xiaohu', 'xiaohu']

[] 表示可選項

s1 = '博主講的太好了！已經三連xiaohuq加關注，求課件！我的郵箱是 1234214@qq.com, 或者xiaohuw是 3255@163.com 或者是xiaohup微信手機號 18356781451'
res1 = re.findall(r'xiaohu[qwp]',s1)
print(res1) # ['xiaohuq', 'xiaohuw', 'xiaohup']

值範圍

[a-z] 表示查詢 a-z

s1 = '博主講的太好了！已經三連xiaohuq加關注，求課件！我的郵箱是 1234214@qq.com, 或者xiaohu5是 3255@163.com 或xiaohu8者是xiaohuw微信手機號 18356781451'
res1 = re.findall(r'xiaohu[a-z]',s1)
print(res1) #['xiaohuq', 'xiaohuw']

[A-Za-z]

因為A-z的ascii碼是連續的，所以可以寫成[A-z]

s1 = '博主講的太好了！已經三連xiaohuq加關注，求課件！我xiaohuA的郵箱是 1234214@qq.com, 或者xiaohu53255@163.com 或xiaohu8者是xiaohuU微信手機號 18356781451'
res1 = re.findall(r'xiaohu[A-Za-z]', s1)
print(res1) # ['xiaohuq', 'xiaohuA', 'xiaohuU']
res2 = re.findall(r'xiaohu[A-z]', s1)
print(res2)# ['xiaohuq', 'xiaohuA', 'xiaohuU']

[0-9]

s1 = '博主講的太好了！已經三連xiaohuq加關注，求課件！我xiaohuA的郵箱是 1234214@qq.com, 或者xiaohu5是 3255@163.com 或xiaohu8者是xiaohuU微信手機號 18356781451'
res1 = re.findall(r'xiaohu[0-9]', s1)
print(res1) # ['xiaohu5', 'xiaohu8']

因為0-z的ascii碼不連續，所以寫成[0-z]會遺漏字元

s1 = '博主講的太好了！已經三連xiaohu=加關注，求課件！我xiaohuA的郵箱是 1234214@qq.com, 或者xiaohu5是 3255@163.com 或xiaohu8者是xiaohuU微信手機號 18356781451'
res1 = re.findall(r'xiaohu[0-z]', s1)
print(res1) # ['xiaohu=', 'xiaohuA', 'xiaohu5', 'xiaohu8', 'xiaohuU']

\d 表示數字

s1 = '博主講的太好了！已經三連xiaohuq加關注，求課件！我xiaohuA的郵箱是 1234214@qq.com, 或者xiaohu5是 3255@163.com 或xiaohu89者是xiaohuU微信手機號 18356781451'
res1 = re.findall(r'xiaohu\d\d', s1)
print(res1) # ['xiaohu89']

？表示出現了0次或1次

s1 = '博主講的太好xiaohu了！已經三連xiaohuq加關注，求課件！我xiaohu124453的郵箱是 1234214@qq.com, 或者xiaohu5是 3255@163.com 或xiaohu89者是xiaohuU微信手機號 18356781451'
res1 = re.findall(r'xiaohu\d?', s1)
print(res1) #['xiaohu', 'xiaohu', 'xiaohu1', 'xiaohu5', 'xiaohu8', 'xiaohu']

+ 表示出現了1次或者n次

s1 = '博主講的太好xiaohu了！已經三連xiaohuq加關注，求課件！我xiaohu124453的郵箱是 1234214@qq.com, 或者xiaohu5是 3255@163.com 或xiaohu89者是xiaohuU微信手機號 18356781451'
res1 = re.findall(r'xiaohu\d+', s1)
print(res1) # ['xiaohu124453', 'xiaohu5', 'xiaohu89']

* 表示出現了0次或者n次

s1 = '博主講的太好xiaohu了！已經三連xiaohuq加關注，求課件！我xiaohu124453的郵箱是 1234214@qq.com, 或者xiaohu5是 3255@163.com 或xiaohu89者是xiaohuU微信手機號 18356781451'
res1 = re.findall(r'xiaohu\d*', s1)
print(res1) # ['xiaohu', 'xiaohu', 'xiaohu124453', 'xiaohu5', 'xiaohu89', 'xiaohu']

次數範圍

{m,n} 表示出現的次數範圍，m表示至少出現的次數，n表述最多出現的次數

s1 = '有一個同學的學號為sj2, 另一個同學的學號為sj0000001, 還有一個同學的學號為sj3101, 還有一個學生：sj322010'
res1 = re.findall(r'sj\d{2,6}', s1) 
print(res1) # ['sj000000', 'sj3101', 'sj322010']

{m,} 表示出現的次數，至少為m個，上不封頂

s1 = '有一個同學sj9的學號為sj11, 另一個同學的學號為sj001, 還有一個同學的學號為sj3101, 還有一個學生：sj322010, 新來的學生學號為：sj34567809'
res1 = re.findall(r'sj\d{2,}', s1)
print(res1) # ['sj11', 'sj001', 'sj3101', 'sj322010', 'sj34567809']

{m} 表示出現了m次

s1 = '有一個同學sj9的學號為sj331001, 另一個同學的學號為sj32100, 還有一個同學的學號為sj3101, 還有一個學生：sj322010, 新來的學生學號為：sj34567809'
res1 = re.findall(r'sj\d{6}', s1)
print(res1) # ['sj331001', 'sj322010', 'sj345678']

匹配指定手機號：1、183 153 173開頭；2、最多是11位

s1 = '我有一個手機號是18347821932，另一個手機號是17386429189，還有一個手機號是15356878621，以前用過一個手機號13987648345'
res1 = re.findall(r'1[857]3\d{8}', s1)
print(res1) # ['18347821932', '17386429189', '15356878621']

\w 表示字母、數字、下劃線或是其他文字字元

s1 = '我有一個郵箱是183478@qq.com，另一個郵箱是17386@163.com，還有一個郵箱是78621@_mail.com，以前用過一個郵箱139876@*q.com,183478@王q.com，78621@のmail.com'
res1 = re.findall(r'\d+@\w+\.com', s1)
print(res1) #['183478@qq.com', '17386@163.com', '78621@_mail.com', '183478@王q.com', '78621@のmail.com']

\W 表示除\w表示的字元之外都能匹配

s1 = '我有一個郵箱是hys183478@###.com，另一個郵箱是17386zcy@163.com，還有一個郵箱是786zrx21@gmail.com，以前用過一個郵箱139876@qq.com'
res1 = re.findall(r'\w+@\W+\.com', s1, re.ASCII)
print(res1) # ['hys183478@###.com']

^ 表示以某個字串開頭

# 定義一個正規表示式模式，匹配以數字開頭的字串
pattern = r'^\d'

# 測試字串
test_strings = ["123abc", "abc123", "456def", "789ghi"]

# 遍歷測試字串並檢查是否匹配
for string in test_strings:
    if re.match(pattern, string):
        print(f"'{string}' matches the pattern")
    else:
        print(f"'{string}' does not match the pattern")

$ 表示以某個字串結尾

# 定義一個正規表示式模式，匹配以數字結尾的字串
pattern = r'\d$'

# 測試字串
test_strings = ["abc", "def2", "ghi", "jkl4"]

# 遍歷測試字串並檢查是否匹配
for string in test_strings:
    if re.search(pattern, string):
        print(f"'{string}' matches the pattern")
    else:
        print(f"'{string}' does not match the pattern")

() 表示分組

s1 = '有一個學生的身份證號為340123200312075687，另一個學生的身份證號是340122199705035414'
res1 = re.findall(r'340\d{3}(\d{4})\d{8}', s1)
print(res1) # ['2003', '1997']

s1 = '有一個學生的身份證號為340123200312075687，另一個學生的身份證號是340122199705035414'
res1 = re.findall(r'(340\d{3}(\d{4})\d{8})', s1)
print(res1) # [('340123200312075687', '2003'), ('340122199705035414', '1997')]

s1 = '有一個學生的身份證號為340123200312075687，另一個學生的身份證號是340122199705035414'
res1 = re.findall(r'(340\d{3}(\d{4})(\d{2})(\d{2}))', s1)
print(res1) # [('34012320031207', '2003', '12', '07'), ('34012219970503', '1997', '05', '03')]

|表示多個字元之間的或，使用小括號括起來

s1 = '有一個學生的身份證號為340123200312075687，另一個學生的身份證號是340122199705035414，另一個學生的身份證號是340110199705035414'
res1 = re.findall(r'(340(123|110)(\d{4})(\d{2})(\d{2}))', s1)
print(res1) # [('34012320031207', '123', '2003', '12', '07'), ('34011019970503', '110', '1997', '05', '03')]

. 表示任意字元

s1 = '我有一個鍵盤，鍵盤的售賣序列號為JP2134WFWFasd##&13000, 上一個鍵盤的序列號為JPIUYT4WFqw34sd##&000'
res1 = re.findall(r'JP.{16}', s1)
print(res1) # ['JP2134WFWFasd##&13', 'JPIUYT4WFqw34sd##&']

使用\跳脫字元，將.變成普通的點字元進行匹配

s1 = '我有shujia#888一個鍵盤，鍵盤的售shujia.666賣序列號為JP2134WFW.asd##&13, 上一個鍵盤的序列號為JPIUYT4WFqw34sd##&'
res1 = re.findall(r'shujia\.\d{3}', s1)
print(res1) # ['shujia.666']

常用函式

re.findall 在大字串中查詢符合正規表示式特點的式子
re.match() 匹配整個字串是否符合某個正規表示式特點

re.search() 從左向右匹配正規表示式，只會匹配一次符合條件, 得到的是一個物件

text = '博主講的實在是太1165872335@數加.com好了，通俗易懂，已三連，求課件，我的郵箱是1165872335@qq.com或' \
       '者是xiaohu2023666@pronton.com謝謝博主 手xiaohu2機微訊號也可以17354074069'

res1 = re.search(r'1\d+@\w+\.com',text)
print(res1) # <re.Match object; span=(8, 25), match='1165872335@數加.com'>
print(res1.group()) # 1165872335@數加.com

re.split()

text = '1001,xiaohu#18$踢足球'

res1 = re.split(r'[,#$]',text)
print(res1) # ['1001', 'xiaohu', '18', '踢足球']

re.finditer() 在大字串中查詢符合正規表示式特點的式子,得到的是一個迭代器

text = '博主講的實在是太1165872335@數加$.com好了，通俗易懂，已三連，求課件，我的郵箱是 1165872335@qq.com 或' \
       '者是xiaohu2023666@pronton.com謝謝博主 手xiaohu2機微訊號也可以17354074069'


res1 = re.finditer('(\w+@(數加\$|qq|pronton)\.com)',text, re.ASCII)
for res in res1:
    print(res.group(1))
    print(res.group(2))
'''1165872335@數加$.com
數加$
1165872335@qq.com
qq
xiaohu2023666@pronton.com
pronton'''

re.fullmatch() 將字串整體與正規表示式進行匹配

text = '安徽省-合肥市-蜀山區-浮山路'

res1 = re.fullmatch(f'(\w+)-(\w+)-(\w+)-(\w+)', text)
print(f"省份:{res1.group(1)}")
print(f"市:{res1.group(2)}")
print(f"區:{res1.group(3)}")
print(f"街道:{res1.group(4)}")

正規表示式
2024-10-30
正規表示式.
2019-11-10
【正規表示式】常用的正規表示式（數字，漢字，字串，金額等的正規表示式）
2021-12-13
字串
php –正規表示式
2019-02-16
PHP
【Linux】正規表示式
2018-10-18
Linux
【JavaScript】正規表示式
2019-03-02
JavaScript
URL正規表示式
2019-04-11
正規表示式 split()
2018-09-07
初探正規表示式
2018-05-11
正規表示式 test()
2018-05-27
正規表示式(?!)作用
2018-05-20
正規表示式 {n,}
2018-08-12
SQL正規表示式
2024-03-06
SQL
正規表示式(java)
2024-03-18
Java
Python——正規表示式
2019-08-05
Python
PHP正規表示式
2020-11-11
PHP
正規表示式概括
2020-10-04
javascript正規表示式
2020-11-09
JavaScript
java正規表示式
2020-11-21
Java
Shell正規表示式
2020-10-16
常用正規表示式
2024-11-18
正規表示式合集
2024-06-17
python正規表示式
2024-06-15
Python
【java】正規表示式
2018-04-05
Java
MySQL正規表示式
2024-07-30
MySql
JavaScript 正規表示式
2024-11-03
JavaScript
正規表示式教程
2021-09-09
Python 正規表示式
2021-09-09
Python
正規表示式（一）
2022-08-20
Python：正規表示式
2021-04-22
Python
正規表示式匹配
2020-12-27
正規表示式【四】
2020-12-18
正規表示式基本規則
2019-04-02
“正規表示式”應當稱為“規則表示式”
2018-09-23
正規表示式同時匹配中英文及常用正規表示式
2022-03-19
JS常用正規表示式及驗證時間的正規表示式
2022-03-19
JS
匹配正整數正規表示式
2020-04-03
ip:port 正規表示式
2018-10-26

8.正規表示式

正規表示式

使用步驟

語法

常用函式

相關文章