正則式 REGEX - 例項

Jason990420發表於2021-12-13

原文網址 : https://learnku.com/articles/63588?order_by=created_at&

建立日期: 2021/12/13
修改日期: 2021/12/17
WIN10 / Python 3.9.9

前言

前面談到 Python 的正則式定義, 現在就讓我們來看一些正則式的例項, 本文會按例項的增加而持續更新; 另外為了說明正則式的內容, 所以大都採用了 re.VERBOSE 方式來書寫正則式.

例項一網頁內容擷取, 小說目錄各章節的標題及鏈結網址

from urllib.request import urlopen

url = 'https://www.ptwxz.com/html/11/11175/'
html = urlopen(url).read().decode('gbk')

>>> html
('<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 
...
 '<h1>萬界最強道長最新章節</h1>\r\n'
 '</div>\r\n'
...
 '<li><a href="8011051.html">第1章 倚天峰上</a></li>\r\n'
 '<li><a href="8011052.html">第2章 萬界道士</a></li>\r\n'
...
 '<li><a href="8050881.html">第55章 隻身誘敵</a></li>\r\n'
 '<li><a href="8052105.html">第56章 終章</a></li>\r\n'
...
 '</body>\r\n'
 '</html>\r\n')

一般我們會使用 bs4.BeautifulSoup 來處理, 看似比較簡單, 但實際上, 並不一定如此, 這裡使用正則式, 其結果更簡單.

擷取書名

import re
# <h1>萬界最強道長最新章節</h1>
title_regex = re.compile(r"""
    <h1>        # <h1>
    (.*?)       # 萬界最強道長, group(1)
    .{4}        # 最新章節
    </h1>       # </h1>
""", re.VERBOSE)

title = title_regex.search(html).group(1)
print(f'小說書名: {title}')

小說書名: 萬界最強道長

擷取各章節的鏈結及章名

# <li><a href="8011051.html">第1章 倚天峰上</a></li>
chapter_regex = re.compile(r"""
    <li><a      # <li><a
    \s+         # ' '
    href="      # href="
    (.+?)"      # 8011051.html, group(1) 鏈結
    >           # >
    (.+?)       # 第1章 倚天峰上, group(2) 章名
    </a></li>   # </a></li>
""", re.VERBOSE)

chapters = [(url+m.group(1), m.group(2)) for m in chapter_regex.finditer(html)]
for chapter in chapters:
    print(chapter)

('https://www.ptwxz.com/html/11/11175/8011051.html', '第1章 倚天峰上')
('https://www.ptwxz.com/html/11/11175/8011052.html', '第2章 萬界道士')
...
('https://www.ptwxz.com/html/11/11175/8050881.html', '第55章 隻身誘敵')
('https://www.ptwxz.com/html/11/11175/8052105.html', '第56章 終章')

擷取第一章的內容

import re
from urllib.request import urlopen

url = 'https://www.ptwxz.com/html/11/11175/8011051.html'
html = urlopen(url).read().decode('gbk')

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
...
&nbsp;&nbsp;&nbsp;&nbsp;青天白日，浩浩諸峰。<br /><br />&nbsp;&nbsp;&nbsp;&nbsp;悠悠鐘聲，迴盪山間。
...
<br /><br />&nbsp;&nbsp;&nbsp;&nbsp;陳玄一身子一傾，倒在了地上。
</div>
...
</html>

就以\ \ \ \ 為起點, </div>為終點, 取出章節內容, 再以<br\s*?/><br\s*?/>    來分割段落, 這裡的空白符就以\s*?來代替.

regex = re.compile("""
    (?<=&nbsp;&nbsp;&nbsp;&nbsp;)
    .*?
    (?=</div>)
""", re.VERBOSE | re.DOTALL)
m = regex.search(html)
text = '\n'.join(re.split(r"<br\s*?/><br\s*?/>&nbsp;&nbsp;&nbsp;&nbsp;", m.group().strip()))
# text = re.sub(r"<br\s*?/><br\s*?/>&nbsp;&nbsp;&nbsp;&nbsp;", '\n', m.group().strip())
print(text)

青天白日，浩浩諸峰。
悠悠鐘聲，迴盪山間。
正是清晨時分，倚天峰上，鐘聲三響，人影綽綽。
...
陳玄一身子一傾，倒在了地上。

例項二擷取 Python 指令碼內所有的 class 定義及其文件字串

這裡以 tkinter 庫的 __init__.py 為例

讀取檔案內容

import re
import pathlib
import tkinter

base = tkinter.__path__[0]
path = pathlib.Path(base).joinpath('__init__.py')
with open(path, 'rt') as f:
    script = f.read()

定義 class 的樣式

# class xxx (yyy) : """zzz"""
class_pattern = r'''
    \bclass             # begin of a word
    \s+?                # space
    [\w]+?              # identifier xxx
    \s*?                # space
    (                   # group 1
        \(              #   (
        .*?             #   yyy
        \)              #   )
    )?                  # group 1 may not exist
    \s*?                # space
    :                   # :
    (                   # group 2
        \s*?            # space
        (["]{3}|[']{3}) # group 3, DOC-STRING
        .*?             # zzz
        \3              # same as group 3
    )?                  # maybe no DOC-STRING
'''
class_regex = re.compile(class_pattern, re.VERBOSE | re.DOTALL)

擷取內容

class_regex = re.compile(class_pattern, re.VERBOSE | re.DOTALL)

classes = [m.group() for m in class_regex.finditer(script)]
for c in classes:
    print(c)

class EventType(str, enum.Enum):
class Event:
    """Container for the properties of an event.

    Instances of this type are generated if one of the following events occurs:

    KeyPress, KeyRelease - for keyboard events
    ButtonPress, ButtonRelease, Motion, Enter, Leave, MouseWheel - for mouse events
    Visibility, Unmap, Map, Expose, FocusIn, FocusOut, Circulate,
    Colormap, Gravity, Reparent, Property, Destroy, Activate,
    Deactivate - for window events.

    If a callback function for one of these events is registered
    using bind, bind_all, bind_class, or tag_bind, the callback is
    called with an Event as first argument. It will have the
    following attributes (in braces are the event types for which
    the attribute is valid):

        serial - serial number of event
    num - mouse button pressed (ButtonPress, ButtonRelease)
    focus - whether the window has the focus (Enter, Leave)
    height - height of the exposed window (Configure, Expose)
    width - width of the exposed window (Configure, Expose)
    keycode - keycode of the pressed key (KeyPress, KeyRelease)
    state - state of the event as a number (ButtonPress, ButtonRelease,
                            Enter, KeyPress, KeyRelease,
                            Leave, Motion)
    state - state as a string (Visibility)
    time - when the event occurred
    x - x-position of the mouse
    y - y-position of the mouse
    x_root - x-position of the mouse on the screen
             (ButtonPress, ButtonRelease, KeyPress, KeyRelease, Motion)
    y_root - y-position of the mouse on the screen
             (ButtonPress, ButtonRelease, KeyPress, KeyRelease, Motion)
    char - pressed character (KeyPress, KeyRelease)
    send_event - see X/Windows documentation
    keysym - keysym of the event as a string (KeyPress, KeyRelease)
    keysym_num - keysym of the event as a number (KeyPress, KeyRelease)
    type - type of the event as a number
    widget - widget in which the event occurred
    delta - delta of wheel movement (MouseWheel)
    """
class Variable:
    """Class to define value holders for e.g. buttons.

    Subclasses StringVar, IntVar, DoubleVar, BooleanVar are specializations
    that constrain the type of the value returned from get()."""

...

class LabelFrame(Widget):
    """labelframe widget."""
class PanedWindow(Widget):
    """panedwindow widget."""

例項三字串內容轉換成資料列表

例如我們有一筆有關參考文獻的內容如下

import re

text = ''.join("""
[1] ShijunWangRonald M.Summe, Medical Image Analysis, Volume
16, Issue 5, July 2012, pp. 933-951 https://www.sciencedirect.
com/science/article/pii/S1361841512000333
[2] Dupuytren’s contracture, By Mayo Clinic Staff, https://
www.mayoclinic.org/diseases-conditions/dupuytrenscontracture/
symptoms-causes/syc-20371943
[3] Mean and standard deviation. http://www.bmj.com/about-bmj/
resources-readers/
publications/statistics-square-one/2-
mean-and-standard-deviation
[4] Interquartile Range IQR http://www.mathwords.com/i/
interquartile_range.htm
[5] Why are tree-based models robust to outliers? https://www.
quora.com/Why-are-tree-
based-models-robustto-
outliers
[6] https://www.dummies.com/education/math/statistics/howto-
interpret-a-correlation-coefficient-r/
[7] https://www.medicalnewstoday.com/releases/11856.php
[8] Scikit Learn Auc metrics: http://scikit-learn.org/stable/
modules/generated/sklearn.metrics.auc.html
[9] Scikit Learn Library RoC and AUC scores: http://
scikit-learn.
org/stable/modules/generated/sklearn.metrics.roc_auc_
score.html
""".strip().splitlines())

我們要的是分開的編號, 說明及網址, 在此已經將各行都合併在一起了.

regex = re.compile(r"""
    \[              # [
        (\d+?)      #   integer
    ]               # ]
    \s+?            # at least one space or more
    (.*?)           # any characters
    \s*?            # maybe no space or more
    (https?://.+?)  # simple http(s) match
    (?=\[|$)        # end with '[' or end of string, not included
""", re.VERBOSE | re.DOTALL)

for lst in regex.findall(text):
    print('\n'.join(lst))

因為我們要的結果是個列表, 所以呼叫的是 findall 函式; 為了列印出來, 方便觀看, 所以又把它們分行了.

1
ShijunWangRonald M.Summe, Medical Image Analysis, Volume16, Issue 5, July 2012, pp. 933-951
https://www.sciencedirect.com/science/article/pii/S1361841512000333
2
Dupuytren’s contracture, By Mayo Clinic Staff,
https://www.mayoclinic.org/diseases-conditions/dupuytrenscontracture/symptoms-causes/syc-20371943
3
Mean and standard deviation.
http://www.bmj.com/about-bmj/resources-readers/publications/statistics-square-one/2-mean-and-standard-deviation
4
Interquartile Range IQR
http://www.mathwords.com/i/interquartile_range.htm
5
Why are tree-based models robust to outliers?
https://www.quora.com/Why-are-tree-based-models-robustto-outliers
6

https://www.dummies.com/education/math/statistics/howto-interpret-a-correlation-coefficient-r/
7

https://www.medicalnewstoday.com/releases/11856.php
8
Scikit Learn Auc metrics:
http://scikit-learn.org/stable/modules/generated/sklearn.metrics.auc.html
9
Scikit Learn Library RoC and AUC scores:
http://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html

本作品採用《CC 協議》，轉載必須註明作者和本文連結

Jason Yang

正則式 REGEX - Python re library
2021-12-10
Python
java中url正則regex匹配
2020-04-06
Java
正規表示式例項蒐集，通過例項來學習正規表示式。
2021-11-19
Python 正規表示式（RegEx）指南
2023-11-02
Python
Regex 正規表示式入門
2020-05-04
js Abba逆向前瞻正則匹配例項
2022-03-18
JS
Java-正規表示式regex庫
2018-08-27
Java
Grep（Regex）中的正規表示式
2020-07-29
Java 正規表示式例項操作
2021-05-25
Java
python正規表示式小例幾則
2018-08-09
Python
十分有用的壓箱底的正則例項
2020-05-17
JavaScript正規表示式校驗非正整數例項
2022-03-18
JavaScript
正規表示式分組例項詳解
2022-03-16
正則實現個位數補零程式碼例項
2018-05-22
learn-regex：正規表示式學習資源
2022-03-08
PHP preg match正規表示式函式的操作例項
2022-03-21
PHP函式
JavaScript正規表示式備忘單附例項
2019-02-26
JavaScript
JavaScript正規表示式校驗非零的正整數例項
2022-03-18
JavaScript
通過js正規表示式例項學習正規表示式基本語法
2021-02-10
JS
瘋狂Java講義_07_正規表示式RegEx
2020-11-29
Java
例項程式碼詳解正規表示式匹配換行
2022-03-22
JavaScript正規表示式校驗非負整數例項
2022-03-18
JavaScript
iptables 常用規則使用例項
2021-03-03
簡單介紹正規表示式拆分url例項程式碼
2022-03-12
python之正則函式
2022-03-08
Python函式
郵箱/郵件地址的正規表示式及分析(JavaScript，email，regex)
2018-03-12
JavaScriptAI
10-正則化項-權重衰退
2024-08-24
例項QT程式 —— Qt單例不規則介面程式
2020-10-20
QT單例
正規表示式基本規則
2019-04-02
vim表示式正則替換
2019-01-02
FreeSWITCH測試撥號規則例項
2019-05-20
javascript將字串中的多個空格替換為一個空格的正則例項
2022-03-16
JavaScript字串
身份證號碼的正規表示式及驗證詳解(JavaScript，Regex)
2018-03-14
JavaScript
資料庫正規化與例項
2018-03-19
資料庫
正規表示式之零寬斷言例項詳解【基於PHP】
2022-03-14
PHP
java 正規表示式舉例
2018-06-21
Java
js表情正則手機正則郵箱正則
2020-12-16
JS
設計模式 - 原則及例項講解
2020-02-19
設計模式

正則式 REGEX - 例項

前言

例項一 網頁內容擷取, 小說目錄各章節的標題及鏈結網址

例項二 擷取 Python 指令碼內所有的 class 定義及其文件字串

例項三 字串內容轉換成資料列表

相關文章

例項一網頁內容擷取, 小說目錄各章節的標題及鏈結網址

例項二擷取 Python 指令碼內所有的 class 定義及其文件字串

例項三字串內容轉換成資料列表