0.爬蟲 urlib庫講解 urlopen()與Request()

那是個好男孩發表於2019-04-09

原文網址 : https://www.cnblogs.com/DC0307/p/10675878.html

爬蟲

# 注意一下是import urllib.request 還是 form urllib import request

0. urlopen()

語法：urllib.request.urlopen(url, data=None, [timeout, ]*, cafile=None, capath=None, cadefault=False, context=None)

例項0：(這個函式一般就使用三個引數 url data timeout)

*新增的data引數需要使用bytes()方法將引數轉換為位元組流（區別於str的一種型別是一種位元流 010010010）編碼的格式的內容，即bytes型別。

*response.read()是bytes型別的資料，需要decode（解碼）一下。

import urllib.parse
import urllib.request
import urllib.error

url = 'http://httpbin.org/post'
data = bytes(urllib.parse.urlencode({'word': 'hello'}), encoding='utf8')
try:
    response = urllib.request.urlopen(url, data=data,timeout=1)
except urllib.error.URLError as e:
    if isinstance(e.reason, socket.timeout):
        print('TIME OUT')
else:
    print(response.read().decode("utf-8"))

輸出結果：

{
  "args": {}, 
  "data": "", 
  "files": {}, 
  "form": {
    "word": "hello"
  }, 
  "headers": {
    "Accept-Encoding": "identity", 
    "Content-Length": "10", 
    "Content-Type": "application/x-www-form-urlencoded", 
    "Host": "httpbin.org", 
    "User-Agent": "Python-urllib/3.6"
  }, 
  "json": null, 
  "origin": "101.206.170.234, 101.206.170.234", 
  "url": "https://httpbin.org/post"
}

例項1：檢視i狀態碼、響應頭、響應頭裡server欄位的資訊

import urllib.request

response = urllib.request.urlopen('https://www.python.org')
print(response.status)
print(response.getheaders())
print(response.getheader('Server'))

輸出結果：

200
[('Server', 'nginx'), ('Content-Type', 'text/html; charset=utf-8'), ('X-Frame-Options', 'DENY'), ('Via', '1.1 vegur'), ('Via', '1.1 varnish'), ('Content-Length', '48410'), ('Accept-Ranges', 'bytes'), ('Date', 'Tue, 09 Apr 2019 02:32:34 GMT'), ('Via', '1.1 varnish'), ('Age', '722'), ('Connection', 'close'), ('X-Served-By', 'cache-iad2126-IAD, cache-hnd18751-HND'), ('X-Cache', 'MISS, HIT'), ('X-Cache-Hits', '0, 1223'), ('X-Timer', 'S1554777154.210361,VS0,VE0'), ('Vary', 'Cookie'), ('Strict-Transport-Security', 'max-age=63072000; includeSubDomains')]
nginx

使用urllib庫的urlopen()方法有很大的侷限性，比如不能設定響應頭的資訊等。所以需要引入request()方法。

1. Request()

例項0：（這兩種方法的實現效果是一樣的）

import urllib.request

response = urllib.request.urlopen('https://www.python.org')
print(response.read().decode('utf-8'))

######################################

import urllib.request

req = urllib.request.Request('https://python.org')
response = urllib.request.urlopen(req)
print(response.read().decode('utf-8'))

下面主要講解下使用Request()方法來實現get請求和post請求,並設定引數。

例項1：(post請求)

from urllib import request, parse

url = 'http://httpbin.org/post'
headers = {
    'User-Agent': 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)',
    'Host': 'httpbin.org'
}
dict = {
    'name': 'Germey'
}
data = bytes(parse.urlencode(dict), encoding='utf8')
req = request.Request(url=url, data=data, headers=headers, method='POST')
response = request.urlopen(req)
print(response.read().decode('utf-8'))

亦可使用add_header()方法來新增報頭，實現瀏覽器的模擬，新增data屬性亦可如下書寫：

補充：還可以使用bulid_opener()修改報頭，不過多闡述，夠用了就好。

from urllib import request, parse

url = 'http://httpbin.org/post'
dict = {
    'name': 'Germey'
}
data = parse.urlencode(dict).encode('utf-8')
req = request.Request(url=url, data=data, method='POST')
req.add_header('User-Agent', 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)')
response = request.urlopen(req)
print(response.read().decode('utf-8'))

例項2：(get請求) 百度關鍵字的查詢

from urllib import request,parse

url = 'http://www.baidu.com/s?wd='
key = '路飛'
key_code = parse.quote(key)
url_all = url + key_code
"""
#第二種寫法
url = 'http://www.baidu.com/s'
key = '路飛'
wd = parse.urlencode({'wd':key})
url_all = url + '?' + wd
"""
req = request.Request(url_all)
response = request.urlopen(req)
print(response.read().decode('utf-8'))

在這裡，對編碼decode、reqest模組裡的quote()方法、urlencode()方法等就有疑問了，，對此，做一些說明：

parse.quote：將str資料轉換為對應的編碼
parse.urlencode：將字典中的k:v轉換為K:編碼後的v
parse.unquote：將編碼後的資料轉化為編碼前的資料
decode 字串解碼 decode("utf-8")跟read()搭配很配！
encode 字串編碼

>>> str0 = '我愛你'
>>> str1 = str0.encode('gb2312')    
>>> str1 
b'\xce\xd2\xb0\xae\xc4\xe3'
>>> str2 = str0.encode('gbk')
>>> str2
b'\xce\xd2\xb0\xae\xc4\xe3'
>>> str3 = str0.encode('utf-8')
>>> str3
b'\xe6\x88\x91\xe7\x88\xb1\xe4\xbd\xa0'
>>> str00 = str1.decode('gb2312')
>>> str00
'我愛你'
>>> str11 = str1.decode('utf-8') #報錯，因為str1是gb2312編碼的
Traceback (most recent call last):
  File "<pyshell#9>", line 1, in <module>
    str11 = str1.decode('utf-8')
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xce in position 0: invalid continuation byte

* encoding指定編碼格式

在這裡，又有疑問了？read()、readline()、readlines()的區別：

read():全部，字串str
reasline():一行
readlines():全部，列表list

3.爬蟲 urlib庫講解總結
2019-04-09
爬蟲
1.爬蟲 urlib庫講解 Handler高階用法
2019-04-09
爬蟲
2.爬蟲 urlib庫講解異常處理、URL解析、分析Robots協議
2019-04-09
爬蟲協議
request爬蟲
2019-02-16
爬蟲
Python爬蟲教程-02-使用urlopen
2018-08-05
Python爬蟲
6.爬蟲 requests庫講解總結
2019-04-09
爬蟲
5.爬蟲 requests庫講解高階用法
2019-04-09
爬蟲
python爬蟲--urllib.error.URLError: ＜urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certi
2020-12-29
Python爬蟲ErrorAI
4.爬蟲 requests庫講解 GET請求 POST請求響應
2019-04-09
爬蟲
通用爬蟲與聚焦爬蟲
2023-04-18
爬蟲
他靠講爬蟲微課掙了一筆-但不講爬蟲技術
2018-12-10
爬蟲
Python爬蟲之selenium庫使用詳解
2018-05-16
Python爬蟲
python爬蟲常用庫之urllib詳解
2018-03-11
Python爬蟲
python爬蟲常用庫之requests詳解
2019-03-04
Python爬蟲
python爬蟲常用庫之BeautifulSoup詳解
2018-04-01
Python爬蟲
Python爬蟲基礎講解（七）：xpath的語法
2021-05-15
Python爬蟲
限制IP到全流程防控，講解網路爬蟲與技術反爬的動態攻防
2022-11-16
爬蟲
爬蟲之requests庫
2022-03-20
爬蟲
websocket與爬蟲
2019-02-16
Web爬蟲
Java爬蟲與Python爬蟲的區別？
2023-10-25
Java爬蟲Python
爬蟲與反爬蟲技術簡介
2022-09-20
爬蟲
爬蟲的實現原理和技術進行講解
2023-11-28
爬蟲
C#爬蟲與反爬蟲--字型加密篇
2019-06-26
C#爬蟲加密
Python爬蟲與Java爬蟲有何區別？
2022-06-01
Python爬蟲Java
【Python學習】爬蟲爬蟲爬蟲爬蟲~
2018-05-03
Python爬蟲
Python爬蟲之BeautifulSoup庫
2020-12-14
Python爬蟲
IPIDEA乾貨|Java爬蟲與Python爬蟲的區別
2023-05-08
IdeaJava爬蟲Python
python爬蟲庫技術分享
2022-01-19
Python爬蟲
爬蟲（14） - Scrapy-Redis分散式爬蟲(1) | 詳解
2022-07-06
爬蟲Redis分散式
Python爬蟲--- 1.2 BS4庫的安裝與使用
2018-12-17
Python爬蟲
Python3中urlopen()詳解
2018-07-20
Python
十四個爬蟲專案爬蟲超詳細講解（零基礎入門，老年人都看的懂）
2021-02-04
爬蟲
詳解爬蟲與RPA的工作原理和差異
2020-04-14
爬蟲
python爬蟲總是爬不到資料，你需要解決反爬蟲了
2020-06-26
Python爬蟲
爬蟲：多程式爬蟲
2021-05-19
爬蟲
爬蟲系列 | 6、詳解爬蟲中BeautifulSoup4的用法
2021-01-19
爬蟲
python爬蟲---網頁爬蟲，圖片爬蟲，文章爬蟲，Python爬蟲爬取新聞網站新聞
2019-01-04
Python爬蟲網頁網站
終於有人把網路爬蟲講明白了
2019-04-10
爬蟲

0.爬蟲 urlib庫講解 urlopen()與Request()

相關文章