Python使用內建urllib模組或第三方庫requests訪問網路資源

潘高發表於2019-04-12

原文網址 : http://juejin.im/post/5cb03392f265da037b61002a

Python

前言

更多內容，請訪問我的個人部落格。

Python 訪問網路資源有很多方法，urllib, urllib2, urllib3, httplib, httplib2, requests ，現介紹如下兩種方法：

內建的 urllib 模組
- 優點：自帶模組，無需額外下載第三方庫
- 缺點：操作繁瑣，缺少高階功能
第三方庫 requests
- 優點：處理URL資源特別方便
- 缺點：需要下載安裝第三方庫

內建的 `urllib` 模組

發起GET請求

主要使用urlopen()方法來發起請求，如下：

from urllib import request

resp = request.urlopen('http://www.baidu.com')
print(resp.read().decode())
複製程式碼

訪問的結果會是一 個http.client.HTTPResponse 物件，使用此物件的 read() 方法，則可以獲取訪問網頁獲得的資料。但是要注意的是，獲得的資料會是 bytes 的二進位制格式，所以需要 decode() 一下，轉換成字串格式。

發起POST請求

urlopen() 預設的訪問方式是GET，當在 urlopen() 方法中傳入data引數時，則會發起POST請求。注意：傳遞的data資料需要為bytes格式。

設定timeout引數還可以設定超時時間，如果請求時間超出，那麼就會丟擲異常。如下：

from urllib import request

resp = request.urlopen('http://www.baidu.com', data=b'word=hello', timeout=10)
print(resp.read().decode())
複製程式碼

新增Headers

通過 urllib 發起的請求會有預設的一個Headers："User-Agent":"Python-urllib/3.6"，指明請求是由 urllib 傳送的。所以遇到一些驗證User-Agent的網站時，我們需要自定義Headers，而這需要藉助於urllib.request中的 Request 物件。

from urllib import request

url = 'http://httpbin.org/get'
headers = {'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36'}

# 需要使用url和headers生成一個Request物件，然後將其傳入urlopen方法中
req = request.Request(url, headers=headers)
resp = request.urlopen(req)
print(resp.read().decode())
複製程式碼

Request物件

如上所示， urlopen() 方法中不止可以傳入字串格式的url，也可以傳入一個 Request 物件來擴充套件功能，Request 物件如下：

class urllib.request.Request(url, data=None, headers={},
                                origin_req_host=None,
                                unverifiable=False, 
                                method=None)
複製程式碼

構造 Request 物件必須傳入url引數，data資料和headers都是可選的。

最後， Request 方法可以使用method引數來自由選擇請求的方法，如PUT，DELETE等等，預設為GET。

新增Cookie

為了在請求時能帶上Cookie資訊，我們需要重新構造一個opener。

使用request.build_opener方法來進行構造opener，將我們想要傳遞的cookie配置到opener中，然後使用這個opener的open方法來發起請求。如下：

from http import cookiejar
from urllib import request

url = 'https://www.baidu.com'
# 建立一個cookiejar物件
cookie = cookiejar.CookieJar()
# 使用HTTPCookieProcessor建立cookie處理器
cookies = request.HTTPCookieProcessor(cookie)
# 並以它為引數建立Opener物件
opener = request.build_opener(cookies)
# 使用這個opener來發起請求
resp = opener.open(url)

# 檢視之前的cookie物件，則可以看到訪問百度獲得的cookie
for i in cookie:
    print(i)
複製程式碼

或者也可以把這個生成的opener使用install_opener方法來設定為全域性的。

則之後使用urlopen方法發起請求時，都會帶上這個cookie。

# 將這個opener設定為全域性的opener
request.install_opener(opener)
resp = request.urlopen(url)
複製程式碼

設定Proxy代理

使用爬蟲來爬取資料的時候，常常需要使用代理來隱藏我們的真實IP。如下：

from urllib import request

url = 'http://www.baidu.com'
proxy = {'http':'222.222.222.222:80','https':'222.222.222.222:80'}
# 建立代理處理器
proxies = request.ProxyHandler(proxy)
# 建立opener物件
opener = request.build_opener(proxies)

resp = opener.open(url)
print(resp.read().decode())
複製程式碼

下載資料到本地

在我們進行網路請求時常常需要儲存圖片或音訊等資料到本地，一種方法是使用python的檔案操作，將read()獲取的資料儲存到檔案中。

而urllib提供了一個urlretrieve()方法，可以簡單的直接將請求獲取的資料儲存成檔案。如下：

from urllib import request

url = 'http://python.org/'
request.urlretrieve(url, 'python.html')
複製程式碼

urlretrieve() 方法傳入的第二個引數為檔案儲存的位置，以及檔名。

注意：urlretrieve() 方法是python2直接移植過來的方法，以後有可能在某個版本中棄用。

第三方庫 `requests`

安裝

由於 requests是第三方庫，所以要先安裝，如下：

pip install requests
複製程式碼

發起GET請求

直接用 get 方法，如下：

import requests

r = requests.get('http://www.baidu.com/')
print(r.status_code)    #狀態
print(r.text)   #內容
複製程式碼

對於帶引數的URL，傳入一個dict作為params引數，如下：

import requests

r = requests.get('http://www.baidu.com/', params={'q': 'python', 'cat': '1001'})
print(r.url)    #實際請求的URL
print(r.text)
複製程式碼

requests的方便之處還在於，對於特定型別的響應，例如JSON，可以直接獲取，如下：

r = requests.get('https://query.yahooapis.com/v1/public/yql?q=select%20*%20from%20weather.forecast%20where%20woeid%20%3D%202151330&format=json')
r.json()

# {'query': {'count': 1, 'created': '2017-11-17T07:14:12Z', ...
複製程式碼

新增Headers

需要傳入HTTP Header時，我們傳入一個dict作為headers引數，如下：

r = requests.get('https://www.baidu.com/', headers={'User-Agent': 'Mozilla/5.0 (iPhone; CPU iPhone OS 11_0 like Mac OS X) AppleWebKit'})
複製程式碼

獲取響應頭，如下：

r.headers
# {Content-Type': 'text/html; charset=utf-8', 'Transfer-Encoding': 'chunked', 'Content-Encoding': 'gzip', ...}

r.headers['Content-Type']
# 'text/html; charset=utf-8'
複製程式碼

發起POST請求

要傳送POST請求，只需要把get()方法變成post()，然後傳入data引數作為POST請求的資料，如下：

r = requests.post('https://accounts.baidu.com/login', data={'form_email': 'abc@example.com', 'form_password': '123456'})
複製程式碼

requests預設使用application/x-www-form-urlencoded對POST資料編碼。如果要傳遞JSON資料，可以直接傳入json引數，如下：

params = {'key': 'value'}
r = requests.post(url, json=params) #內部自動序列化為JSON
複製程式碼

上傳檔案

上傳檔案需要更復雜的編碼格式，但是requests把它簡化成files引數，如下：

upload_files = {'file': open('report.xls', 'rb')}
r = requests.post(url, files=upload_files)
複製程式碼

在讀取檔案時，注意務必使用 'rb' 即二進位制模式讀取，這樣獲取的 bytes 長度才是檔案的長度。

把 post() 方法替換為 put() ， delete() 等，就可以以PUT或DELETE方式請求資源。

新增Cookie

在請求中傳入Cookie，只需準備一個dict傳入cookies引數，如下：

cs = {'token': '12345', 'status': 'working'}
r = requests.get(url, cookies=cs)
複製程式碼

requests對Cookie做了特殊處理，使得我們不必解析Cookie就可以輕鬆獲取指定的Cookie，如下：

r.cookies['token']
# 12345
複製程式碼

指定超時

要指定超時，傳入以秒為單位的timeout引數,如下：

r = requests.get(url, timeout=2.5)  #2.5秒後超時
複製程式碼

Python模組之urllib模組
2020-10-30
Python
Python：requests模組
2020-10-18
Python
內網模組放開外網訪問和 cdn
2024-11-01
內網
學習Python的urllib模組
2023-11-10
Python
爬蟲-urllib模組的使用
2021-01-14
爬蟲
09 第三方模組 pyinstaller requests
2024-09-28
python爬蟲requests模組
2019-03-01
Python爬蟲
[實戰演練]python3使用requests模組爬取頁面內容
2021-09-09
Python
如何從公網訪問內網MongoDB資料庫
2018-11-17
內網MongoDB資料庫
爬蟲-urllib3模組的使用
2021-01-15
爬蟲
requests模組
2024-11-01
透過Requests模組獲取網頁內容並使用BeautifulSoup進行解析
2024-03-26
網頁
比nestjs更優雅的ioc：跨模組訪問資源
2024-04-10
JS
怎樣從公網訪問內網Redis資料庫
2018-11-18
內網Redis資料庫
公司網路環境中使用Python requests 庫SSL認證失敗
2024-09-04
Python
python爬蟲系列(4.5-使用urllib模組方式下載圖片)
2018-11-09
Python爬蟲
Python----Requests庫基本使用
2019-03-22
Python
Python網路爬蟲資料採集實戰：Requests和Re庫
2020-03-22
Python爬蟲
spring boot（四）資料訪問模組
2018-08-23
Spring Boot
利用內網穿透實現外網訪問內網 MySQL等資料庫教程
2020-11-26
內網穿透MySql資料庫
python requests模組session的使用建議及整個會話中的所有cookie的方法
2019-06-26
PythonSession會話Cookie
Python中內建資料庫！SQLite使用指南！ ⛵
2022-12-02
Python資料庫SQLite
如何使用 Python 請求網路資源
2022-10-23
Python
[開源] .Net 使用 ORM 訪問華為GaussDB資料庫
2020-11-08
ORM資料庫
爬蟲——Requests模組
2019-01-13
爬蟲
爬蟲-Requests模組
2022-03-03
爬蟲
Python模組、第三方模組安裝、模組匯入教程
2019-01-03
Python
nginx對訪問路徑進行限制【部分介面可以內外網訪問、剩餘介面只可以內網訪問】
2024-07-16
Nginx內網
Python中urllib和urllib2庫的用法
2023-11-24
Python
Python 內建模組：os模組
2020-04-05
Python
Python內建庫：pathlib（檔案路徑操作）
2024-05-19
Python
Python爬蟲神器requests庫的使用
2024-11-07
Python爬蟲
外網訪問MySQL資料庫
2018-12-02
MySql資料庫
requests模組獲取cookie
2018-08-15
Cookie
requests模組 - get 請求
2024-10-13
requests 模組 - post 請求
2024-10-13
介面自動化Python3_requests之使用xlrd讀取excel模組
2020-10-24
PythonExcel
[開源] .Net ORM 訪問 Firebird 資料庫
2022-07-07
ORM資料庫

Python使用內建urllib模組或第三方庫requests訪問網路資源

前言

內建的 urllib 模組

發起GET請求

發起POST請求

新增Headers

Request物件

新增Cookie

設定Proxy代理

下載資料到本地

第三方庫 requests

安裝

發起GET請求

新增Headers

發起POST請求

上傳檔案

新增Cookie

指定超時

相關文章

內建的 `urllib` 模組

第三方庫 `requests`