python爬蟲之一：requests庫

LTQblog發表於2017-06-06

安裝requtests

python2安裝requests

python2 -m pip install requests

python3安裝requests

python3 -m pip install requests

一個小demo

>>> import requests
>>> r = requests.get("http://www.baidu.com") # 訪問百度主頁
>>> r.status_code # 檢視狀態碼，狀態碼為200表示訪問成功
200
>>> r.encoding = 'utf-8' #更改編碼為
>>> r.text # 列印網頁內容

requests庫的連線異常

requests.ConnectionError 網路連線錯誤異常，如DNS查詢失敗、拒絕連線等
requests.HTTPError HTTP錯誤異常
requests.URLRequired URL缺失異常
requests.TooManyRedirects 超過最大重定向次數，產生重定向異常
requests.ConnectTimeout 連線遠端伺服器超時異常
requests.Timeout 請求URL超時，產生超時異常

通用程式碼框架，一個小例子

import requests
def getHTMLText(url):
    try:
        r = requests.get(url, timeout=30)
        r.raise_for_status()
        print(r.apparent_encoding)
        r.encoding = r.apparent_encoding
        return r.text
    except:
        return "產生異常"

if __name__ == "__main__":
    url = "http://www.baidu.com"
    print(getHTMLText(url))

HTTP協議

HTTP，Hypertext Transfer Protocol，超文字傳輸協議。HTTP是一個基於“請求與響應”模式的、無狀態的應用層協議。HTTP協議採用URL作為定位網路資源的標識，URL格式如下：
http://host[:port][path]
host: 合法的Internet主機域名或IP地址
port: 埠號，預設埠為80
path: 請求資源的路徑

HTTP URL例項：
http://www.bit.edu.cn
http://220.181.111.188/duty
HTTP URL的理解：
URL是通過HTTP協議存取資源的Internet路徑，一個URL對應一個資料資源。

HTTP協議對資源的操作

GET 請求獲取URL位置的資源
HEAD 請求獲取URL位置資源的響應訊息報告，即獲得該資源的頭部資訊
POST 請求向URL位置的資源後附加新的資料
PUT 請求向URL位置儲存一個資源，覆蓋原URL位置的資源
PATCH 請求區域性更新URL位置的資源，即改變該處資源的部分內容
DELETE 請求刪除URL位置儲存的資源

HTTP協議方法於requests庫方法是一一對應的。

requests庫的7個主要方法

requests.request() 構造一個請求，支撐以下各方法的基礎方法
requests.get() 獲取HTML網頁的主要方法，對應於HTTP的GET
requests.head() 獲取HTML網頁頭資訊的方法，對應於HTTP的HEAD
requests.post() 向HTML網頁提交POST請求的方法，對應於HTTP的POST
requests.put() 向HTML網頁提交PUT請求的方法，對應於HTTP的PUT
requests.patch() 向HTML網頁提交區域性修改請求，對應於HTTP的PATCH
requests.delete() 向HTML頁面提交刪除請求，對應於HTTP的DELETE

head()方法示例

>>> r = requests.head('http://httpbin.org/get')
>>> r.headers
{'Content‐Length': '238', 'Access‐Control‐Allow‐Origin': '*', 'Access‐
Control‐Allow‐Credentials': 'true', 'Content‐Type':
'application/json', 'Server': 'nginx', 'Connection': 'keep‐alive',
'Date': 'Sat, 18 Feb 2017 12:07:44 GMT'}
>>> r.text
''

post()方法示例

>>> payload = {'key1': 'value1', 'key2': 'value2'}
>>> r = requests.post('http://httpbin.org/post', data = payload)
>>> print(r.text)
{ ...
"form": {
"key2": "value2",
"key1": "value1"
},
}

向URL POST一個字典，自動編碼為form（表單）。
post字典，預設存到form表單中。

>>> r = requests.post('http://httpbin.org/post', data = 'ABC')
>>> print(r.text)
{ ...
"data": "ABC"
"form": {},
}

向URL POST一個字串，自動編碼為data。
post字串，預設存到data中。

put()方法示例

>>> payload = {'key1': 'value1', 'key2': 'value2'}
>>> r = requests.put('http://httpbin.org/put', data = payload)
>>> print(r.text)
{ ...
"form": {
"key2": "value2",
"key1": "value1"
},
}

request方法

requsets庫的request方法，是所有方法的基礎方法。

request方法的完整使用方法

requests.request(method, url, **kwargs)

method : 請求方式，對應get/put/post等7種
url : 擬獲取頁面的url連結
**kwargs: 控制訪問的引數，共13個

methed:request的請求方式（7種）

r = requests.request('GET', url, **kwargs)
r = requests.request('HEAD', url, **kwargs)
r = requests.request('POST', url, **kwargs)
r = requests.request('PUT', url, **kwargs)
r = requests.request('PATCH', url, **kwargs)
r = requests.request('delete', url, **kwargs)
r = requests.request('OPTIONS', url, **kwargs)

對應http協議的請求功能。
OPTIONS是向伺服器獲取一些伺服器和客戶端能夠打交道的引數。

**kwargs: 控制訪問的引數，均為可選項

params : 字典或位元組序列，作為引數增加到url中

>>> kv = {'key1': 'value1', 'key2': 'value2'}
>>> r = requests.request('GET', 'http://python123.io/ws', params=kv)
>>> print(r.url)
http://python123.io/ws?key1=value1&key2=value2

data : 字典、位元組序列或檔案物件，作為Request的內容

>>> kv = {'key1': 'value1', 'key2': 'value2'}
>>> r = requests.request('POST', 'http://python123.io/ws', data=kv)
>>> body = '主體內容'
>>> r = requests.request('POST', 'http://python123.io/ws', data=body)

json : JSON格式的資料，作為Request的內容

>>> kv = {'key1': 'value1'}
>>> r = requests.request('POST', 'http://python123.io/ws', json=kv)

headers : 字典，HTTP定製頭

>>> hd = {'user‐agent': 'Chrome/10'}
>>> r = requests.request('POST', 'http://python123.io/ws', headers=hd)

cookies : 字典或CookieJar，Request中的cookie
auth : 元組，支援HTTP認證功能

files : 字典型別，傳輸檔案

>>> fs = {'file': open('data.xls', 'rb')}
>>> r = requests.request('POST', 'http://python123.io/ws', files=fs)

timeout : 設定超時時間，秒為單位

>>> r = requests.request('GET', 'http://www.baidu.com', timeout=10)

proxies : 字典型別，設定訪問代理伺服器，可以增加登入認證

>>> pxs = { 'http': 'http://user:pass@10.10.10.1:1234'
'https': 'https://10.10.10.1:4321' }
>>> r = requests.request('GET', 'http://www.baidu.com', proxies=pxs)

allow_redirects : True/False，預設為True，重定向開關
stream : True/False，預設為True，獲取內容立即下載開關
verify : True/False，預設為True，認證SSL證照開關
cert : 本地SSL證照路徑

get方法

get方法的常用方式

r = requests.get(url)

r返回一個包含伺服器資源的Response物件
get方法構造一個向伺服器請求資源的Request物件

get方法的完整使用方法

requests.get(url, params=None, **kwargs)

url : 擬獲取頁面的url連結
params : url中的額外引數，字典或位元組流格式，可選
**kwargs: 12個控制訪問的引數，可選

>>> import requests
>>> r = requests.get("http://www.baidu.com") # 訪問百度主頁
>>> print(r.status_code) # 列印請求的狀態碼
200
>>> type(r) #檢視r的型別
<class 'requests.models.Response'>  #r是一個類，類的名是requests
>>> r.headers # 返回get請求獲得頁面的頭部資訊
{'Server': 'bfe/1.0.8.18', 'Date': 'Wed, 19 Apr 2017 09:28:11 GMT', 'Content-Type': 'text/html', 'Last-Modified': 'Mon, 23 Jan 2017 13:27:33 GMT', 'Transfer-Encoding': 'chunked', 'Connection': 'Keep-Alive', 'Cache-Control': 'private, no-cache, no-store, proxy-revalidate, no-transform', 'Pragma': 'no-cache', 'Set-Cookie': 'BDORZ=27315; max-age=86400; domain=.baidu.com; path=/', 'Content-Encoding': 'gzip'}

對於狀態碼，如果狀態碼為200，那麼訪問成功；如果狀態碼不是200，那麼訪問失敗。

response物件包含伺服器返回的所有資訊，也包含請求的request資訊。

response物件的屬性

r.status_code HTTP請求的返回狀態，200表示連線成功，404表示失敗
r.text HTTP響應內容的字串形式，即，url對應的頁面內容
r.encoding 從HTTP header中猜測的響應內容編碼方式
r.apparent_encoding 從內容中分析出的響應內容編碼方式（備選編碼方式）
r.content HTTP響應內容的二進位制形式

response的編碼

r.encoding：如果header中不存在charset，則認為編碼為ISO‐8859‐1
r.text根據r.encoding顯示網頁內容
r.apparent_encoding：根據網頁內容分析出的編碼方式，可以看作是r.encoding的備選

r.apparent_encoding比r.encoding更可靠

網路爬蟲引發的問題

爬取網頁，玩轉網頁
小規模，資料量小，對爬取速度不敏感，此時用requests庫。
爬取網站，爬取系列網站
中規模，資料規模較大，對爬取速度敏感。比如爬取攜程。此時用scrapy庫。
爬取全網
規模大，對於搜尋引擎，它的爬取速度是關鍵。此時只能定製開發。

騷擾伺服器。Web伺服器預設接收人類訪問。受限於編寫水平和目的，網路爬蟲將會為Web伺服器帶來巨大的資源開銷。
對產權有法律風險。伺服器上的資料有產權歸屬。網路爬蟲獲取資料後牟利將帶來法律風險。
洩露隱私。網路爬蟲可能具備突破簡單訪問控制的能力，獲得被保護資料從而洩露個人隱私。

伺服器如何對網路爬蟲的限制。
- 來源審查：判斷User‐Agent進行限制（有技術難度）
檢查來訪HTTP協議頭的User‐Agent域，只響應瀏覽器或友好爬蟲的訪問
- 釋出公告：Robots協議
告知所有爬蟲網站的爬取策略，要求爬蟲遵守

robots協議

Robots Exclusion Standard，網路爬蟲排除標準
作用：
網站告知網路爬蟲哪些頁面可以抓取，哪些不行
形式：
在網站根目錄下的robots.txt檔案

例如：
京東的協議
https://www.jd.com/robots.txt

User‐agent: *
Disallow: /?*
Disallow: /pop/*.html
Disallow: /pinpai/*.html?*
User‐agent: EtaoSpider
Disallow: /
User‐agent: HuihuiSpider
Disallow: /
User‐agent: GwdangSpider
Disallow: /
User‐agent: WochachaSpider
Disallow: /

Robots協議基本語法:

User‐agent: *
Disallow: /

註釋:*代表所有，/代表根目錄

一些其它網站的robots

http://www.baidu.com/robots.txt
http://news.sina.com.cn/robots.txt
http://www.qq.com/robots.txt
http://news.qq.com/robots.txt
http://www.moe.edu.cn/robots.txt （無robots協議）

並不是所有的網站都存在robots.txt

robots協議的遵守方式

實際操作中，該如何遵守Robots協議？

網路爬蟲：
自動或人工識別robots.txt，再進行內容爬取
約束性：
Robots協議是建議但非約束性，網路爬蟲可以不遵守，但存在法律風險

類人行為可以不參考robots協議。
訪問次數少。訪問資料量小。可以不遵守該協議。

網路爬蟲實戰

京東商品頁面的爬取

import requests
url = "https://item.jd.com/896813.html"
try:
    r = requests.get(url)
    r.raise_for_status()
    r.encoding = r.apparent_encoding
    print(r.text[:1000])
except:
    print("爬取失敗")

亞馬遜商品頁面的爬取

import requests
url = "https://www.amazon.cn/電腦-it-辦公/dp/B00D0393AM/ref=sr_1_4?s=pc&ie=UTF8&qid=1492660788&sr=1-4&keywords=行動硬碟"
try:
    kv = {'user-agent':'Mozilla/5.0'}
    r = requests.get(url,headers=kv)
    r.raise_for_status()
    r.encoding = r.apparent_encoding
    print(r.text[:1000])
except:
    print("爬取失敗")

百度/360搜尋關鍵字提交

import requests
keyword = "python"
try:
    kv = {'wd':keyword}
    r = requests.get("http://www.baidu.com/s",params=kv)
    #r = requests.get("http://www.so.com/s",params=kv)
    print(r.request.url)
    r.raise_for_status()
    print(len(r.text))
except:
    print("爬取失敗")

網路圖片的爬取和儲存

import requests
import os
root = "D://pics//"
url = "http://image.nationalgeographic.com.cn/2017/0419/20170419035805561.jpg"
path = root+url.split('/')[-1]

try:
    if not os.path.exists(root):
        os.mkdir(root)
    if not os.path.exists(path):
        r = requests.get(url)
        with open(path,'wb') as f:
            f.write(r.content)
            f.close
            print("檔案儲存成功")
    else:
        print("檔案已存在")
except:
    print("爬取失敗")

IP地址歸屬地的自動查詢

import requests
url = 'http://m.ip138.com/ip.asp?ip='
try:
    r = requests.get(url+'202.204.80.112')
    r.raise_for_status()
    r.encoding = r.apparent_encoding
    print(r.text[-500:])
except:
    print("爬取失敗")

參考：

http://www.icourse163.org/course/BIT-1001870001

爬蟲之requests庫
2022-03-20
爬蟲
python爬蟲requests模組
2019-03-01
Python爬蟲
python爬蟲常用庫之requests詳解
2019-03-04
Python爬蟲
Python requests爬蟲例項
2017-06-21
Python爬蟲
Python2爬蟲利器：requests庫的基本用法
2021-09-11
Python爬蟲
使用Python和requests庫的簡單爬蟲程式
2023-11-13
Python爬蟲
Python爬蟲學習筆記-2.Requests庫
2017-05-20
Python爬蟲筆記
爬蟲——Requests模組
2019-01-13
爬蟲
爬蟲-Requests模組
2022-03-03
爬蟲
Python3爬蟲實戰（requests模組）
2018-01-27
Python爬蟲
Python Beautiful Soup+requests實現爬蟲
2017-02-27
Python爬蟲
6.爬蟲 requests庫講解總結
2019-04-09
爬蟲
Python爬蟲教程-06-爬蟲實現百度翻譯(requests)
2018-09-06
Python爬蟲
python爬蟲利用requests製作代理池s
2019-12-04
Python爬蟲
Python 爬蟲實戰（二）：使用 requests-html
2018-03-14
Python爬蟲HTML
JB的Python之旅-爬蟲篇--requests&Scrapy
2018-06-08
Python爬蟲
Python 爬蟲實戰（一）：使用 requests 和 BeautifulSoup
2017-12-11
Python爬蟲
5.爬蟲 requests庫講解高階用法
2019-04-09
爬蟲
Python 爬蟲入門 (二) 使用Requests來爬取圖片
2017-02-24
Python爬蟲
Python網路爬蟲資料採集實戰：Requests和Re庫
2020-03-22
Python爬蟲
python爬蟲:爬蟲的簡單介紹及requests模組的簡單使用
2022-02-24
Python爬蟲
python多執行緒非同步爬蟲-Python非同步爬蟲試驗[Celery,gevent,requests]
2020-11-11
Python執行緒非同步爬蟲
Python—Requests庫的爬取效能分析
2018-05-16
Python
python爬蟲專案（新手教程）之知乎（requests方式）
2018-06-13
Python爬蟲
爬蟲入門系列（二）：優雅的HTTP庫requests
2017-04-12
爬蟲HTTP
Requests如何在Python爬蟲中實現get請求？
2021-09-11
Python爬蟲
Python爬蟲十六式 - 第三式：Requests的用法
2019-01-09
Python爬蟲
基於bs4+requests的python爬蟲偽裝
2018-07-20
Python爬蟲
【Python學習】爬蟲爬蟲爬蟲爬蟲~
2018-05-03
Python爬蟲
Python爬蟲之BeautifulSoup庫
2020-12-14
Python爬蟲
【python爬蟲】python爬蟲demo
2018-02-21
Python爬蟲
python爬蟲庫技術分享
2022-01-19
Python爬蟲
Python爬蟲入門【6】：蜂鳥網圖片爬取之一
2019-07-30
Python爬蟲
爬蟲學習之一個簡單的網路爬蟲
2016-07-11
爬蟲
python爬蟲---網頁爬蟲，圖片爬蟲，文章爬蟲，Python爬蟲爬取新聞網站新聞
2019-01-04
Python爬蟲網頁網站
python動態網站爬蟲實戰(requests+xpath+demjson+redis)
2021-09-16
Python網站爬蟲JSONRedis
Python "爬蟲"出發前的裝備之二資料先行（ Requests 模組）
2022-03-03
Python爬蟲
4.爬蟲 requests庫講解 GET請求 POST請求響應
2019-04-09
爬蟲