2.爬蟲 urlib庫講解異常處理、URL解析、分析Robots協議

那是個好男孩發表於2019-04-09

原文網址 : https://www.cnblogs.com/DC0307/p/10677021.html

1.異常處理

URLError類來自urllib庫的error模組，它繼承自OSError類，是error異常模組的基類，由request模組產生的異常都可以通過這個類來處理。

from urllib import request, error

try:
    response = request.urlopen('http://cuiqingcai.com/index.htm')
except error.HTTPError as e:
    print(e.reason, e.code, e.headers, sep='\n')
except error.URLError as e:
    print(e.reason)
else:
    print('Request Successfully')

輸出結果如下：

Not Found
404
Server: nginx/1.10.3 (Ubuntu)
Date: Tue, 09 Apr 2019 07:25:19 GMT
Content-Type: text/html; charset=UTF-8
Transfer-Encoding: chunked
Connection: close
Vary: Cookie
Expires: Wed, 11 Jan 1984 05:00:00 GMT
Cache-Control: no-cache, must-revalidate, max-age=0
Link: <https://cuiqingcai.com/wp-json/>; rel="https://api.w.org/"

* 這樣來理解：URLError 和 HTTPError ==> URLError其子類是HTTPError。

*URLError擁有屬性reason；HTTPError擁有屬性reason(原因)、headers(請求頭)、code(狀態碼).

import socket
import urllib.request
import urllib.error

try:
    response = urllib.request.urlopen('https://www.baidu.com', timeout=0.01)
except urllib.error.URLError as e:
    print(type(e.reason))
    if isinstance(e.reason, socket.timeout):
        print('TIME OUT')

2.解析連結

urllib庫中的parse模組，它定義了處理URL的標準介面，例如實現URL各部分的抽取、合併以及連結轉換.

!!! scheme協議://netloc域名/path訪問路徑;params引數?query查詢條件#fragment瞄點

urlencode() !!! 這個函式前面提到過，很重要，用於構造get請求引數

from urllib.parse import urlencode

params = {
    'name': 'germey',
    'age': 22
}
base_url = 'http://www.baidu.com?'
url = base_url + urlencode(params)
print(url)

輸出結果如下：

http://www.baidu.com?name=germey&age=22

urllib.parse.urlparse(urlstring, scheme='', allow_fragments=True)：

*該方法可以實現URL的識別和分段 urlstring必填項，即代解析的URL地址；scheme預設的協議(eg：http、https)；allow_fragments是否忽略fragment

例項0：

from urllib.parse import urlparse

result = urlparse('http://www.baidu.com/index.html;user?id=5#comment')
print(type(result), result)

輸出結果如下：

<class 'urllib.parse.ParseResult'> ParseResult(scheme='http', netloc='www.baidu.com', path='/index.html', params='user', query='id=5', fragment='comment')

例項1：

from urllib.parse import urlparse

result = urlparse('www.baidu.com/index.html;user?id=5#comment', scheme='https')
print(result)

"""
輸出結果：
ParseResult(scheme='https', netloc='', path='www.baidu.com/index.html', params='user', query='id=5', fragment='comment')
"""

例項2：

from urllib.parse import urlparse

result = urlparse('http://www.baidu.com/index.html;user?id=5#comment', scheme='https')
print(result)

"""
輸出結果：
ParseResult(scheme='http', netloc='www.baidu.com', path='/index.html', params='user', query='id=5', fragment='comment')
"""

例項3：

from urllib.parse import urlparse

result = urlparse('http://www.baidu.com/index.html;user?id=5#comment', allow_fragments=False)
print(result)

"""
輸出結果：
ParseResult(scheme='http', netloc='www.baidu.com', path='/index.html', params='user', query='id=5#comment', fragment='')
"""

例項4：

from urllib.parse import urlparse

result = urlparse('http://www.baidu.com/index.html#comment', allow_fragments=False)
print(result)

"""
輸出結果：
ParseResult(scheme='http', netloc='www.baidu.com', path='/index.html#comment', params='', query='', fragment='')
"""

urlunparse(data) :實現URL的構造。長度必須為6、型別:列表、元組或特定的資料結構

from urllib.parse import urlunparse

data = ['http', 'www.baidu.com', 'index.html', 'user', 'a=6', 'comment']
print(urlunparse(data))

"""
結果如下：
http://www.baidu.com/index.html;user?a=6#comment
"""

urljoin():實現連結的解析合併和生成

from urllib.parse import urljoin

print(urljoin('http://www.baidu.com', 'FAQ.html'))
print(urljoin('http://www.baidu.com', 'https://cuiqingcai.com/FAQ.html'))
print(urljoin('http://www.baidu.com/about.html', 'https://cuiqingcai.com/FAQ.html'))
print(urljoin('http://www.baidu.com/about.html', 'https://cuiqingcai.com/FAQ.html?question=2'))
print(urljoin('http://www.baidu.com?wd=abc', 'https://cuiqingcai.com/index.php'))
print(urljoin('http://www.baidu.com', '?category=2#comment'))
print(urljoin('www.baidu.com', '?category=2#comment'))
print(urljoin('www.baidu.com#comment', '?category=2'))

輸出結果如下：

http://www.baidu.com/FAQ.html
https://cuiqingcai.com/FAQ.html
https://cuiqingcai.com/FAQ.html
https://cuiqingcai.com/FAQ.html?question=2
https://cuiqingcai.com/index.php
http://www.baidu.com?category=2#comment
www.baidu.com?category=2#comment
www.baidu.com?category=2

*有點暈？沒關係.是有一定規律的，簡單來說，有兩個引數，以後面的那個為準，沒有的補充，有的後者覆蓋前者。

*值得注意的是：第一個引數path訪問路徑後面的(即params、query、fragment)是不起作用的。(看最後的那個列印就明白了)

urlsplit()：類似與urlparse。區別：urlsplit()會將params合併到path中，返回5個結果，其返回的結果是一個元組型別，即可以用屬性獲取值，也可以用索引來索取。

from urllib.parse import urlsplit

result = urlsplit('http://www.baidu.com/index.html;user?id=5#comment')
print(type(result), result)
print(result.scheme,result[0])

"""
<class 'urllib.parse.SplitResult'> SplitResult(scheme='http', netloc='www.baidu.com', path='/index.html;user', query='id=5', fragment='comment')
http http
"""

from urllib.parse import urlparse

result = urlparse('http://www.baidu.com/index.html;user?id=5#comment')
print(type(result), result)

"""
<class 'urllib.parse.ParseResult'> ParseResult(scheme='http', netloc='www.baidu.com', path='/index.html', params='user', query='id=5', fragment='comment')
"""

urlunsplit()：類似與urlunparse。唯一的區別是：傳入的引數長度必須為5.

from urllib.parse import urlunsplit

data = ['http', 'www.baidu.com', 'index.html', 'a=6', 'comment']
print(urlunsplit(data))

"""
http://www.baidu.com/index.html?a=6#comment
"""

from urllib.parse import urlunparse

data = ['http', 'www.baidu.com', 'index.html', 'user', 'a=6', 'comment']
print(urlunparse(data))

"""
http://www.baidu.com/index.html;user?a=6#comment
"""

parse_qs()：反序列化，將一串Get請求資料轉回字典。

from urllib.parse import parse_qs

query = 'name=germey&age=22'
print(parse_qs(query))

"""
{'name': ['germey'], 'age': ['22']}
"""

parse_qsl()：將引數轉換為元組組成的列表。

from urllib.parse import parse_qsl

query = 'name=germey&age=22'
print(parse_qsl(query))

"""
[('name', 'germey'), ('age', '22')]
"""

quote()與unquote():URL編碼和解碼

from urllib.parse import quote

keyword = "我愛你"
url = 'https://www.baidu.com/s?wd=' + quote(keyword)
print(url)

"""
https://www.baidu.com/s?wd=%E6%88%91%E7%88%B1%E4%BD%A0
"""


from urllib.parse import unquote

url = 'https://www.baidu.com/s?wd=%E6%88%91%E7%88%B1%E4%BD%A0'
print(unquote(url))

"""
https://www.baidu.com/s?wd=我愛你
"""

3.分析Robots協議(爬蟲協議、機器人協議)

*告知爬蟲和搜尋引擎哪些頁面可以爬取，哪些頁面不可爬取。它通常是一個叫作robots.txt的文字檔案。

舉例：robots.txt

　　User-agent：* （*指代所有爬蟲）

　　Disallow： / (禁止爬取'/'所有目錄)

　　Allow：/public/ (允許爬取的目錄)

*利用urllib的robotparser模組，我們可以實現網站的Robots協議的分析。

==>常用的幾個方法：(只例舉出3個，其實還有parse()、mtime()、modified()，用到的時候再說)

set_url：設定robots.txt檔案的連結。

read()：讀取robots.txt檔案並進行分析.必須呼叫！！！

can_fetch()：引數1 User-agent，引數2 URL.返回結果True.False表明是否可以爬取.

from urllib.robotparser import RobotFileParser
rp = RobotFileParser()
rp.set_url('http://www.jianshu.com/robots.txt')
rp.read()
print(rp.can_fetch('*','http://www.jianshu.com/p/b67554025d7d'))
print(rp.can_fetch('*','http://www.jianshu.com/search?q=python&page=1&type=collections'))

"""
結果如下：
False
False
"""
# 也可以使用parse()方法執行讀取和分析.那個相對複雜，我選擇簡單的，夠用就行.

3.爬蟲 urlib庫講解總結
2019-04-09
爬蟲
0.爬蟲 urlib庫講解 urlopen()與Request()
2019-04-09
爬蟲
1.爬蟲 urlib庫講解 Handler高階用法
2019-04-09
爬蟲
Protobuf協議逆向解析-APP爬蟲
2018-03-04
協議APP爬蟲
異常處理全面解析
2024-07-13
給妹子講python-S01E24深入解析異常處理方式
2018-08-22
Python
異常篇——異常處理
2022-02-27
異常處理
2024-10-18
dns解析狀態異常怎麼處理 dns解析異常怎麼修復
2022-07-13
DNS
C++ 異常處理機制詳解：輕鬆掌握異常處理技巧
2024-04-28
C++
詳解C#異常處理
2019-02-28
C#
Reactor詳解之:異常處理
2020-11-13
React
Laravel核心解讀–異常處理
2019-05-13
Laravel
url http異常處理 The valid characters are defined in RFC 7230 and RFC 3986
2018-04-12
HTTP
6.爬蟲 requests庫講解總結
2019-04-09
爬蟲
知識點講解七：Python中的異常處理機制
2018-09-03
Python
Mysql系列第十九講異常捕獲及處理詳解
2020-10-12
MySql
JSP 異常處理如何處理？
2021-09-01
JS
Python爬蟲js處理
2020-03-31
Python爬蟲JS
SpringBoot原始碼解析-ExceptionHandler處理異常的原理
2019-04-10
Spring Boot原始碼Exception
異常-throws的方式處理異常
2018-09-02
React 異常處理
2019-03-01
React
JS異常處理
2018-05-08
JS
oracle異常處理
2023-05-09
Oracle
Python——異常處理
2019-08-04
Python
Python異常處理
2020-06-24
Python
ThinkPHP 異常處理
2019-12-18
PHP
JavaScript 異常處理
2020-08-13
JavaScript
JAVA 異常處理
2020-10-31
Java
異常的處理
2024-08-05
golang - 異常處理
2024-07-02
Golang
異常處理2
2024-07-02
異常處理1
2024-07-03
Java 異常處理
2024-05-30
Java
Abp 異常處理
2022-02-27
JAVA異常處理
2022-02-24
Java
08、異常處理
2021-02-20
SpringMVC異常處理
2020-11-26
SpringMVC

2.爬蟲 urlib庫講解 異常處理、URL解析、分析Robots協議

相關文章

2.爬蟲 urlib庫講解異常處理、URL解析、分析Robots協議