Datawhale-爬蟲-Task2（正規表示式）

TNTZS666發表於2019-03-02

原文網址 : https://blog.csdn.net/tntzs666/article/details/88081050

學習內容

什麼是正規表示式

定義：一套規則，可以在字串文字中進行搜查替換等
使用步驟：
- 1.使用 compile() 函式將正規表示式的字串編譯成一個 pattern 物件
- 2.通過 pattern 物件的一些方法對文字進行匹配，匹配結果是一個 match物件
- 3.用 match 物件的方法，對結果進行操作
常用方法：
- match：從開始位置開始查詢，一次匹配，即1次匹配成功則退出
  一般使用形式：
  match(string[,pos[,endpos]])
- search：從任何位置開始查詢，一次匹配
  search[string[, pos[, endpos]]]
  例子：

>>>import re
>>>pattern = re.compile(r'\d+')  #用於匹配至少一個數字

>>>m = pattern.match('one12twothree34four')  #查詢頭部，沒有匹配
>>>print(m) #輸出None

>>>m = pattern.match('one12twothree34four', 2, 10) #從'e'的位置開始匹配，沒有匹配到
>>>print(m)
>>>m = pattern.match('one12twothree34four', 3, 10) #從'1' 的位置開始匹配，正好匹配上
>>>print(m)
<_sre.SRE_Match object at 0x10a42aac0>
>>>m.group(0)   #可忽略0
'12'
>>>m.start(0)   #可忽略0
3
>>>m.end(0)    #可忽略0
5
>>>m.span(0)  #可忽略0
(3, 5)

>>>m = pattern.search('one12twothree34four')  #這裡如果使用match方法則不匹配
>>>m
<_sre.SRE_Match object at 0x10cc03ac0>
>>>m.group()
'12'

注：match方法只匹配字串的開始，如果字串開始不符合正規表示式，則匹配失敗，函式返回None；而search匹配整個字串，直到找到一個匹配。match和search都是一次匹配，只要找到一個匹配結果就返回，而不是查詢所有匹配結果

findall：全部匹配，返回列表
findall(string[, pos[, endpos]])
finditer：全部匹配，返回迭代器
例子：

import re
pattern = re.compile(r'\d+')  #查詢數字

result1 = pattern.findall('hello 123456 789')
result2 = pattern.findall('one1two2three3four4', 0, 3)

print(result1) #輸出：['123456', '789']
print(result2) #輸出：[]

r1 = pattern.finditer('hello 123456 789')
r2 = pattern.finditer('one1two2three3four4', 0, 10)
print(r1)
print(r2)
print('r1....')
for m1 in r1:
    print("matching string:{} position:{}".format(m1.group(), m1.span()))
""" 輸出結果：matching string:123456 position:(6, 12)
			 matching string:789 position:(13, 16)"""
print('r2....')
for m2 in r2:
    print("matching string:{} position:{}".format(m2.group(), m2.span()))
 """ 輸出結果：matching string:1 position:(3, 4)
			 matching string:2 position:(7, 8)"""

findall是以列表形式返回全部能匹配到的子串，如果沒有匹配，則返回一個空列表。
finditer方法的行為跟findall的行為類似，也是搜尋整個字串，獲得所有匹配的結果。但它返回一個順序訪問每一個匹配結果(Match物件)的迭代器

split：分割字串，返回列表
split(string[, maxsplit])
其中maxsplit用於說明最大的分割次數，預設為全部分割
例子：
```
import re
p = re.compile(r'[\s\,;]+')
print(p.split('a,b;;c   d'))
#輸出結果：['a', 'b', 'c', 'd']
```
sub：替換
re.sub(pattern, repl, string, count=0)
其中，repl可以是字串也可以是一函式：
- 如果repl是字串，則會使用repl去替換字串每一個匹配的子串，並返回替換後的字串，repl還可以使用id的形式來引用過分組，但不能使用編號0；
- 如果repl是函式，這個方法應當只接受一個引數(Match物件)，並返回一個字串用於替換(返回的字串中不能再引用分組)。
- count用於指導最多替換次數，不指定時全部替換
- pattern：正則中的模式字串
例子：

import re
p = re.compile(r'(\w+) (\w+)')  #\w=[A-Za-z0-9]
s = 'hello 123, hello 456'

print(p.sub(r'hello world', s))   #使用'hello world'替換'hello 123'和'hello 456'
print(p.sub(r'\2 \1', s))
""" hello world, hello world
	123 hello, 456 hello"""
def func(m):
    return 'hi' + ' ' + m.group(2)

print(p.sub(func, s))
print(p.sub(func, s, 1))
"""hi 123, hi 456
hi 123, hello 456"""

匹配中文
- 中文是Unicode編碼(utf-8也是Unicode編碼)，範圍：主要在[u4e00-u9fa5]
- 中文全形逗號一類的不在[u4e00-u9fa5]範圍內
貪婪與非貪婪模式
- 貪婪模式：在整個表示式匹配成功的前提下，儘可能多的匹配
- 非貪婪模式：在整個表示式匹配成功的前提下，儘可能少的匹配
- python裡面數量詞預設是貪婪模式
  例子：

import re
s = "abbbbbbc"
pattern1 = re.compile("ab+")
pattern2 = re.compile("ab*?")
m1 = pattern1.match(s)
m2 = pattern2.match(s)
s1 = m1.group()
s2 = m2.group()
print(s1) #abbbbbb
print(s2) #a 儘可能少匹配b所以沒有b

案例

結合requests、re兩者的內容爬取豆瓣電影 Top 250裡的內容
要求抓取名次、影片名稱、國家、導演等欄位。

思路分析：

首先由於豆瓣網頁預設只有25部電影資訊，所以URL是變化的。觀察首頁的URL為https://movie.douban.com/top250，下一頁的URL為https://movie.douban.com/top250?start=25&filter=，下下頁為https://movie.douban.com/top250?start=50&filter=所以可以使用一個while迴圈來遍歷網頁

while page <= 225:
	url = "https://movie.douban.com/top250?start=%s&filter=" % page

分析網頁資訊，滑鼠右鍵點選網頁選擇檢視網頁原始碼即可檢視頁面資訊。
首先找到我們需要爬取的欄位：名次，影片名稱，國家，導演。
名次：

影片名稱：這裡觀察到名稱有中英文兩種title，這裡只選擇第一個中文title

國家和導演在同一個標籤下：
相應的正規表示式：
名次：'(.*?)
名稱：'(.*?)
導演：'.*?.*?導演:(.*?) '
國家：' .*?/ (.*?) /.*?'
最後在程式碼中加上請求頭防止爬蟲被阻止

完整程式碼：

import requests
import re
import json

class Movie(object):
    def __init__(self, rnd, n):
        # rnd: 名次，名字，導演
        # n:   國家
        self.info = {
            '排名': rnd[0],
            '電影名': rnd[1],
            '導演': rnd[2],
            '國家': n
        }

    def __repr__(self):
        return str(self.info)
    
def parse_html(url):
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36 SE 2.X MetaSr 1.0"}
    response = requests.get(url, headers=headers)
    text = response.text
    regix = '<em class="">(.*?)</em>.*?<span class="title">(.*?)</span>''.*?<p class="">.*?導演:(.*?)&nbsp;'
    nation = '<br>.*?/&nbsp;(.*?)&nbsp;/.*?</p>'
    rank_name_dire = re.findall(regix, text, re.S)
    r3 = re.findall(nation,text,re.S)
    info = [Movie(i[0], i[1]) for i in zip(rank_name_dire, r3)]
    for i in info:
        print(i)


def main():
    for offset in range(0, 250, 25):
        url = 'https://movie.douban.com/top250?start=' + str(offset) +'&filter='
        for item in parse_html(url):
            print(item)
            write_movies_file(item)

def write_movies_file(str):
    with open('douban_film.txt','a',encoding='utf-8') as f:
        f.write(json.dumps(str,ensure_ascii=False) + '\n')


if __name__ == '__main__':
    main()

Golang爬蟲+正規表示式
2021-12-22
Golang爬蟲
(一) 爬蟲教程｜正規表示式
2020-12-18
爬蟲
python爬蟲正規表示式詳解
2024-11-25
Python爬蟲
Java 的正規表示式與爬蟲
2023-03-10
Java爬蟲
Python爬蟲— 1.4 正規表示式：re庫
2019-02-28
Python爬蟲
爬蟲必學知識之正規表示式上篇
2018-03-18
爬蟲
python爬蟲學習筆記4-正規表示式
2020-12-12
Python爬蟲筆記
Python爬蟲教程-19-資料提取-正規表示式(re)
2018-09-06
Python爬蟲
Python "爬蟲"出發前的裝備之一正規表示式
2022-02-24
Python爬蟲
正規表示式
2024-10-30
正規表示式.
2019-11-10
Datawhale-爬蟲-Task3(beautifulsoup)
2019-03-03
爬蟲
【正規表示式】常用的正規表示式（數字，漢字，字串，金額等的正規表示式）
2021-12-13
字串
php –正規表示式
2019-02-16
PHP
【Linux】正規表示式
2018-10-18
Linux
【JavaScript】正規表示式
2019-03-02
JavaScript
URL正規表示式
2019-04-11
正規表示式 split()
2018-09-07
初探正規表示式
2018-05-11
正規表示式 test()
2018-05-27
正規表示式(?!)作用
2018-05-20
正規表示式 {n,}
2018-08-12
SQL正規表示式
2024-03-06
SQL
正規表示式(java)
2024-03-18
Java
Python——正規表示式
2019-08-05
Python
PHP正規表示式
2020-11-11
PHP
正規表示式概括
2020-10-04
javascript正規表示式
2020-11-09
JavaScript
java正規表示式
2020-11-21
Java
Shell正規表示式
2020-10-16
常用正規表示式
2024-11-18
正規表示式合集
2024-06-17
python正規表示式
2024-06-15
Python
【java】正規表示式
2018-04-05
Java
MySQL正規表示式
2024-07-30
MySql
JavaScript 正規表示式
2024-11-03
JavaScript
正規表示式教程
2021-09-09
Python 正規表示式
2021-09-09
Python

Datawhale-爬蟲-Task2（正規表示式）

學習內容

什麼是正規表示式

案例

相關文章