Python網路爬蟲資料採集實戰：Requests和Re庫

import requests
# 以get方式獲取百度官網原始碼
res = requests.get("https://www.baidu.com")
# 獲取返回型別
print(type(res))
# 獲取狀態碼
print(res.status_code)
# 獲取返回原始碼內容型別
print(type(res.text))
# 獲取前15字元
print((res.text)[:15])
# 獲取cookies
print(res.cookies)

輸出結果為：

<class 'requests.models.Response'>
200 # 狀態碼200代表響應正常
<class 'str'>
<!DOCTYPE html>
<RequestsCookieJar[<Cookie BDORZ=27315 for .baidu.com/>

3.主要方法

requests庫的主要方法有以下7種，接下來就幾種常用方法進行簡單介紹。

方法	說明
requests.get()	獲取HTML網頁的主要方法，對應於HTTP的GET
requests.head()	獲取HTML網頁頭資訊的方法，對應於HTTP的HEAD
requests.post()	向HTML網頁提交POST請求的方法，對應於HTTP的POST
requests.put()	向HTML網頁提交PUT請求的方法，對應於HTTP的PUT
requests.patch()	向HTML網頁提交區域性修改請求，對應於HTTP的PATCH
requests.delete()	向HTML頁面提交刪除請求，對應於HTTP的DELETE

get方法是我們通常最常用的方法。輸入如下程式碼對網站提交get請求：

import requests
res = requests.get("http://httpbin.org/get")
print(res.text)

輸出結果為：

{
  "args": {}, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Host": "httpbin.org", 
    "User-Agent": "python-requests/2.22.0", 
    "X-Amzn-Trace-Id": "Root=1-5e5dd355-96131363cf818957b8e7b67d"
  }, 
  "origin": "171.112.101.74", 
  "url": "http://httpbin.org/get"
}

由上述輸出可知響應結果包含請求頭、URL和IP等資訊。而如果我們想在get請求中輸入引數資訊，則需要設定params引數：

import requests
data = {
  'building':"zhongyuan",
  'nature':"administrative"
}
res = requests.get("http://httpbin.org/get",params=data)
print(res.text)

輸出內容為：

{
  "args": {
    "building": "zhongyuan", 
    "nature": "administrative"
  }, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Host": "httpbin.org", 
    "User-Agent": "python-requests/2.22.0", 
    "X-Amzn-Trace-Id": "Root=1-5e5dd4db-f0568e98f3350cae8998968a"
  }, 
  "origin": "171.112.101.74", 
  "url": "http://httpbin.org/get?building=zhongyuan&nature=administrative"
}

由上可知在get請求中成功將引數傳遞進去。此外，上述返回格式不僅是字串格式，還是json檔案格式，因此我們可以通過Python中json庫對返回資訊進行解析：

import requests
res = requests.get("http://httpbin.org/get")
print(type(res.text))
print(res.json())
print(type(res.json()))

輸出結果為：

<class 'str'>
{'args': {}, 'headers': {'Accept': '*/*', 'Accept-Encoding': 'gzip, deflate', 'Host': 'httpbin.org', 'User-Agent': 'python-requests/2.22.0', 'X-Amzn-Trace-Id': 'Root=1-5e5dd5e5-b195baec1c11b51c03eee96c'}, 'origin': '171.112.101.74', 'url': 'http://httpbin.org/get'}
<class 'dict'>

為了將 Requests 發起的 HTTP 請求偽裝成瀏覽器，我們通常是使用headers關鍵字引數。headers 引數同樣也是一個字典型別。具體用法見以下程式碼：

import requests
headers = {
    'user-agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36'
}
res = requests.get("http://httpbin.org/get",headers=headers)
print(res.text)

輸出結果如下，可以看出在headers引數中我們的"User-Agent"發生了改變，而不再是之前暴露的requests了，這對於一些對爬蟲有限制的網站似乎很有用。

{
  "args": {}, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Host": "httpbin.org", 
    "User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36", 
    "X-Amzn-Trace-Id": "Root=1-5e5dd68c-e185889878974c88aff8d704"
  }, 
  "origin": "171.112.101.74", 
  "url": "http://httpbin.org/get"
}

data 引數通常結合 POST 請求方式一起使用。如果我們需要用 POST 方式提交表單資料或者JSON資料，我們只需要傳遞一個字典給 data 引數。

import requests
data = {
    'user': 'admin',
    'pass': 'admin'
}
res = requests.post('http://httpbin.org/post', data=data) 
print(res.text)

獲取結果如下：

{
  "args": {}, 
  "data": "", 
  "files": {}, 
  "form": {
    "pass": "admin", 
    "user": "admin"
  }, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Content-Length": "21", 
    "Content-Type": "application/x-www-form-urlencoded", 
    "Host": "httpbin.org", 
    "User-Agent": "python-requests/2.22.0", 
    "X-Amzn-Trace-Id": "Root=1-5e5dd775-932576d4bdcad64891fb54fa"
  }, 
  "json": null, 
  "origin": "171.112.101.74", 
  "url": "http://httpbin.org/post"
}

我們使用代理髮起請求，經常會碰到因代理失效導致請求失敗的情況。因此，我們對請求超時做下設定。當發現請求超時，更換代理再重連。

# 設定3s超時斷連
res = requests.get(url, timeout=3) 
# 傳入元組引數，分別設定斷連超時時間與讀取超時時間
response = requests.get(url, timeout=(3, 30))

二、re庫

1.簡介

正規表示式是一個特殊的字元序列，能方便的檢查一個字串是否與某種模式匹配。re模組使得python擁有全部的正規表示式功能。在爬蟲自動化程式中，re庫充當資訊提取的角色，通過re庫我們可以從原始碼中批量精確匹配到想要的資訊。

2.入門測試

開源中國提供的正規表示式匹配網站可以供我們很好的練手測試（網址：https://tool.oschina.net/regex/#）。下文我們先輸入一段測試文字，再選擇匹配Email地址，正規表示式為：[\w!#$%&'*+/=?^_`{|}~-]+(?:\.[\w!#$%&'*+/=?^_`{|}~-]+)*@(?:[\w](?:[\w-]*[\w])?\.)+[\w](?:[\w-]*[\w])?。

是不是像一串“亂碼”？實際上這裡面每一個“亂碼”都有具體的意義，具體可參照下面的對照：

\w      匹配字母數字及下劃線
\W      匹配f非字母數字下劃線
\s      匹配任意空白字元，等價於[\t\n\r\f]
\S      匹配任意非空字元
\d      匹配任意數字
\D      匹配任意非數字
\A      匹配字串開始
\Z      匹配字串結束，如果存在換行，只匹配換行前的結束字串
\z      匹配字串結束
\G      匹配最後匹配完成的位置
\n      匹配一個換行符
\t      匹配一個製表符
^       匹配字串的開頭
$       匹配字串的末尾
.       匹配任意字元，除了換行符，re.DOTALL標記被指定時，則可以匹配包括換行符的任意字元
[....]  用來表示一組字元，單獨列出：[amk]匹配a,m或k
[^...]  不在[]中的字元：[^abc]匹配除了a,b,c之外的字元
*       匹配0個或多個的表示式
+       匹配1個或者多個的表示式
?       匹配0個或1個由前面的正規表示式定義的片段，非貪婪方式
{n}     精確匹配n前面的表示
{m,m}   匹配n到m次由前面的正規表示式定義片段，貪婪模式
a|b     匹配a或者b
()      匹配括號內的表示式，也表示一個組

3.主要方法

match函式

函式原型：match(pattern, string, flags=0)

嘗試從字串的起始位置匹配一個模式，如果起始位置沒匹配上的話，返回None

import re
content= "hello 123 4567 World_This is a regex Demo"
result = re.match('^hello\s\d\d\d\s\d{4}\s\w{10}.*Demo$',content)
print(result)
print(result.group()) #獲取匹配的結果
print(result.span())  #獲取匹配字串的長度範圍

輸出結果為：

<re.Match object; span=(0, 41), match='hello 123 4567 World_This is a regex Demo'>
hello 123 4567 World_This is a regex Demo
(0, 41)

通用匹配

上面的程式碼正規表示式太複雜，其實完全沒必要這麼做，因為還有一個萬能匹配可以用，那就是.＊（點星）。其中.（點）可以匹配任意字元（除換行符），＊（星）代表匹配前面的字元無限次，所以它們組合在一起就可以匹配任意字元了。因此我們可以使用下面的方式進行簡化：

content= "hello 123 4567 World_This is a regex Demo"
result = re.match('^hello.*Demo$',content)
print(result)
print(result.group())
print(result.span())

輸出結果與前文相同：

<re.Match object; span=(0, 41), match='hello 123 4567 World_This is a regex Demo'>
hello 123 4567 World_This is a regex Demo
(0, 41)

分組匹配

為了匹配字串中具體的目標，可以使用（）進行分組匹配

content= "hello 123 4567 World_This is a regex Demo"
result = re.match('^hello\s(\d+).*Demo$',content)
print(result.group())
print(result.group(1))

輸出group組中的第一個結果：

hello 123 4567 World_This is a regex Demo
123

貪婪匹配

簡要說意思就是一直匹配，匹配到匹配不上為止。

content= "hello 123 4567 World_This is a regex Demo"
result = re.match('^hello.*(?P<name>\d+).*Demo$',content)
print(result.group())
print(result.group(1))
print(result.groupdict())

結果如下，最終結果輸出的是7，出現這樣的結果是因為被前面的.*給匹陪掉了，只剩下了一個數字，這就是貪婪匹配。

hello 123 4567 World_This is a regex Demo
7
{'name': '7'}

若要非貪婪匹配可以使用問號（？）：

content= "hello 123 4567 World_This is a regex Demo"
result = re.match('^hello.*?(?P<name>\d+).*Demo$',content)
print(result.group())
print(result.group(1))
print(result.groupdict())

這樣就可以得到123的結果了：

llo 123 4567 World_This is a regex Demo
123
{'name': '123'}

函式中新增匹配模式

　　def match(pattern, string, flags=0) 第三個引數flags設定匹配模式

　　re.I：使匹配對大小寫不敏感

　　re.L：做本地化識別匹配

　　re.S：使.包括換行在內的所有字元

　　re.M：多行匹配，影響^和$

　　re.U：使用unicode字符集解析字元，這個標誌影響\w,\W,\b,\B

　　re.X：將正規表示式寫得更易於理解

例如通過設定匹配模式為re.I，使得使匹配對大小寫不敏感：

content= "heLLo 123 4567 World_This is a regex Demo"
result = re.match('hello',content,re.I)
print(result.group())

結果如下，仍然會輸出heLLo：

heLLo

search函式

函式原型：def search(pattern, string, flags=0)

掃描整個字串，返回第一個匹配成功的結果

content= '''hahhaha hello 123 4567 world'''
result = re.search('hello.*world',content)
print(result.group())

輸出：

hello 123 4567 world

findall函式

函式原型：def findall(pattern, string, flags=0)。搜尋字串，以列表的形式返回所有能匹配的字串

content= '''
    <url>http://httpbin.org/get</url>
    <url>http://httpbin.org/post</url>
    <url>https://www.baidu.com</url>'''
urls = re.findall('<url>(.*)</url>',content)
for url in urls:
    print(url)

以上命令將會輸出所有符合條件的字串即連結：

http://httpbin.org/get
http://httpbin.org/post
https://www.baidu.com

sub函式

函式原型：def subn(pattern, repl, string, count=0, flags=0)。替換字串中每一個匹配的子串後返回替換後的字串

content= '''hello 123 4567 world'''
str = re.sub('123.*world','future',content)
print(str)

輸出結果就會將123後面的內容替換成'future'：

hello future

compile

函式原型：def compile(pattern, flags=0)。將正規表示式編譯成正規表示式物件，方便複用該正規表示式

content= '''hello 123 4567 world'''
pattern = '123.*world'
regex = re.compile(pattern)
str = re.sub(regex,'future',content)
print(str)

輸出結果同上文一樣：

hello future

有關requests庫和re庫的簡單介紹和使用到此結束，下一篇將利用這兩個庫行網路資料爬取實戰。基礎知識可參考上篇：

Python網路爬蟲資料採集實戰：基礎知識

Python爬蟲初學二（網路資料採集）
2020-05-03
Python爬蟲
【python爬蟲實戰】使用Selenium webdriver採集山東招考資料
2020-07-02
Python爬蟲Web
Python網路爬蟲實戰
2022-03-18
Python爬蟲
python網路爬蟲應用_python網路爬蟲應用實戰
2020-12-29
Python爬蟲
Python 爬蟲實戰（二）：使用 requests-html
2018-03-14
Python爬蟲HTML
python3網路爬蟲開發實戰_Python3 爬蟲實戰
2022-01-24
Python爬蟲
Python爬蟲實戰系列4：天眼查公司工商資訊採集
2024-03-20
Python爬蟲
頁面資料採集——網路爬蟲實戰（ASP.NET Web 部落格園為例）
2020-12-25
爬蟲ASP.NETWeb
網路爬蟲——爬蟲實戰（一）
2022-01-29
爬蟲
python動態網站爬蟲實戰(requests+xpath+demjson+redis)
2021-09-16
Python網站爬蟲JSONRedis
Python【爬蟲實戰】提取資料
2020-11-17
Python爬蟲
使用Python和requests庫的簡單爬蟲程式
2023-11-13
Python爬蟲
python3網路爬蟲開發實戰_Python 3開發網路爬蟲(一)
2020-12-07
Python爬蟲
Python爬蟲神器requests庫的使用
2024-11-07
Python爬蟲
乾貨分享！Python網路爬蟲實戰
2020-08-07
Python爬蟲
Python網路爬蟲實戰小專案
2021-04-12
Python爬蟲
Python 3網路爬蟲開發實戰
2021-04-28
Python爬蟲
Python網路爬蟲實戰專案大全！
2020-12-19
Python爬蟲
《Python3網路爬蟲開發實戰》教程||爬蟲教程
2018-11-13
Python爬蟲
網站如何判斷爬蟲在採集資料？
2022-06-06
網站爬蟲
爬蟲資料採集的工作原理
2022-06-29
爬蟲
Python網路爬蟲實戰：爬取知乎話題下 18934 條回答資料
2019-01-17
Python爬蟲
爬蟲之requests庫
2022-03-20
爬蟲
python爬蟲之 scrapy框架採集2000期彩票資料
2020-12-02
Python爬蟲框架
Python網路爬蟲實戰(一)快速入門
2019-09-16
Python爬蟲
Python3網路爬蟲開發實戰
2021-04-15
Python爬蟲
python爬蟲常用庫之requests詳解
2019-03-04
Python爬蟲
[Python3網路爬蟲開發實戰] 分散式爬蟲原理
2019-12-08
Python爬蟲分散式
Python網路爬蟲實戰專案大全 32個Python爬蟲專案demo
2019-04-24
Python爬蟲
《Python3網路爬蟲開發實戰程式碼》基本庫使用
2019-05-05
Python爬蟲
【Python3網路爬蟲開發實戰】3-基本庫的使用 2-使用requests 1-基本用法
2018-03-15
Python爬蟲
Python爬蟲實戰系列3：今日BBNews程式設計新聞採集
2024-03-15
Python爬蟲程式設計
實戰：如何通過python requests庫寫一個抓取小網站圖片的小爬蟲
2020-01-25
Python網站爬蟲
Python爬蟲— 1.4 正規表示式：re庫
2019-02-28
Python爬蟲
《python3網路爬蟲開發實戰》--pyspider
2018-10-18
Python爬蟲IDE
python3網路爬蟲開發實戰pdf
2021-11-30
Python爬蟲
python爬蟲requests模組
2019-03-01
Python爬蟲
【Python3網路爬蟲開發實戰】3-基本庫的使用 2-使用requests 2-高階用法
2018-03-15
Python爬蟲

Python網路爬蟲資料採集實戰：Requests和Re庫

一、requests庫

1.簡介

2.入門測試

3.主要方法

二、re庫

1.簡介

2.入門測試

3.主要方法

相關文章