[Python] 網路爬蟲與資訊提取（1）網路爬蟲之規則

clyuan1996發表於2020-11-06

原文網址 : https://blog.csdn.net/weixin_43429677/article/details/108677411

Python爬蟲

Python網路爬蟲與資訊提取北京理工大學嵩天老師慕課學習筆記

網路爬蟲之規則

requets庫入門

requests庫的7個主要方法：

requests.request()構造一個請求，支撐以下各方法的基礎方法
requests.get()獲取HTML網頁的主要方法，對應於HTTP的GET
requests.head()獲取HTML網頁頭資訊的方法，對應於HTTP的HEAD
requests.post()向HTML網頁提交POST請求的方法，對應於HTTP的POST
requests.put()向HTML網頁提交PUT請求的方法，對應於HTTP的PUT
requests.patch()向HTML網頁提交區域性修改請求，對應於HTTP的PATCH
requests.delete()向HTML頁面提交刪除請求，對應於HTTP的DELETE

requests庫的get()方法

獲取網頁r = requests.get(url)
以此來構造一個向伺服器請求資源的Request物件
返回一個包含伺服器資源的Response物件

requests.get(url, params = None, **kwargs)
url:擬獲取頁面的url連結
params:url中的額外引數，字典或位元組流格式，可選
**kwargs：12個控制訪問的引數

requests庫提供的7個常用方法，第一個方法（request方法）是基本方法，其他六個都是通過呼叫requests方法來實現的。

Response物件：

>>> import requests
>>> r = requests.get("http://www.baidu.com")
>>> print(r.status_code)  # 檢測請求的狀態碼
200
>>> type(r)  
<class 'requests.models.Response'>
>>> r.headers  # 返回get請求獲得頁面的頭部資訊
{'Cache-Control': 'private, no-cache, no-store, proxy-revalidate, no-transform', 'Connection': 'keep-alive', 'Content-Encoding': 'gzip', 'Content-Type': 'text/html', 'Date': 'Sat, 19 Sep 2020 05:51:05 GMT', 'Last-Modified': 'Mon, 23 Jan 2017 13:27:36 GMT', 'Pragma': 'no-cache', 'Server': 'bfe/1.0.8.18', 'Set-Cookie': 'BDORZ=27315; max-age=86400; domain=.baidu.com; path=/', 'Transfer-Encoding': 'chunked'}

Response物件包含伺服器返回的所有資訊，也包含請求的Request資訊

Response物件的屬性:

r.status_codeHTTP請求的返回狀態，200表示連線成功，404或其他表示失敗
r.textHTTP響應內容的字串形式，即url隊形的頁面內容
r.encoding從HTTP header中猜測的響應內容編碼方式
r.apparent_encoding從內容中分析出的響應內容編碼方式（備選編碼方式）
r.contentHTTP響應內容的二進位制形式

>>> r = requests.get("http://www.baidu.com")
>>> print(r.status_code)
200
>>> r.text
'<!DOCTYPE html>\r\n<!--STATUS OK--><html> <head><meta http-equiv=content-type content=text/html;charset=utf-8><meta http-equiv=X-UA-Compatible content=IE=Edge><meta content=always name=referrer><link rel=stylesheet type=text/css href=http://s1.bdstatic.com/r/www/cache/bdorz/baidu.min.css><title>ç\x99¾åº¦ä¸\x80ä¸\x8bï¼\x8cä½\xa0å°±ç\x9f¥é\x81\x93</title></head> <body link=#0000cc> <div id=wrapper> <div id=head> <div class=head_wrapper> <div class=s_form> <div class=s_form_wrapper> <div id=lg> <img hidefocus=true src=//www.baidu.com/img/bd_logo1.png width=270 height=129> </div> <form id=form name=f action=//www.baidu.com/s class=fm> <input type=hidden name=bdorz_come value=1> <input type=hidden name=ie value=utf-8> <input type=hidden name=f value=8> <input type=hidden name=rsv_bp value=1> <input type=hidden name=rsv_idx value=1> <input type=hidden name=tn value=baidu><span class="bg s_ipt_wr"><input id=kw name=wd class=s_ipt value maxlength=255 autocomplete=off autofocus></span><span class="bg s_btn_wr"><input type=submit id=su value=ç\x99¾åº¦ä¸\x80ä¸\x8b class="bg s_btn"></span> </form> </div> </div> <div id=u1> <a href=http://news.baidu.com name=tj_trnews class=mnav>æ\x96°é\x97»</a> <a href=http://www.hao123.com name=tj_trhao123 class=mnav>hao123</a> <a href=http://map.baidu.com name=tj_trmap class=mnav>å\x9c°å\x9b¾</a> <a href=http://v.baidu.com name=tj_trvideo class=mnav>è§\x86é¢\x91</a> <a href=http://tieba.baidu.com name=tj_trtieba class=mnav>è´´å\x90§</a> <noscript> <a href=http://www.baidu.com/bdorz/login.gif?login&amp;tpl=mn&amp;u=http%3A%2F%2Fwww.baidu.com%2f%3fbdorz_come%3d1 name=tj_login class=lb>ç\x99»å½\x95</a> </noscript> <script>document.write(\'<a href="http://www.baidu.com/bdorz/login.gif?login&tpl=mn&u=\'+ encodeURIComponent(window.location.href+ (window.location.search === "" ? "?" : "&")+ "bdorz_come=1")+ \'" name="tj_login" class="lb">ç\x99»å½\x95</a>\');</script> <a href=//www.baidu.com/more/ name=tj_briicon class=bri style="display: block;">æ\x9b´å¤\x9aäº§å\x93\x81</a> </div> </div> </div> <div id=ftCon> <div id=ftConw> <p id=lh> <a href=http://home.baidu.com>å\x85³äº\x8eç\x99¾åº¦</a> <a href=http://ir.baidu.com>About Baidu</a> </p> <p id=cp>&copy;2017&nbsp;Baidu&nbsp;<a href=http://www.baidu.com/duty/>ä½¿ç\x94¨ç\x99¾åº¦å\x89\x8då¿\x85è¯»</a>&nbsp; <a href=http://jianyi.baidu.com/ class=cp-feedback>æ\x84\x8fè§\x81å\x8f\x8dé¦\x88</a>&nbsp;äº¬ICPè¯\x81030173å\x8f·&nbsp; <img src=//www.baidu.com/img/gs.gif> </p> </div> </div> </div> </body> </html>\r\n'
>>> r.encoding
'ISO-8859-1'
>>> r.apparent_encoding
'utf-8'
>>> r.encoding = 'utf_8'
>>> r.text
'<!DOCTYPE html>\r\n<!--STATUS OK--><html> <head><meta http-equiv=content-type content=text/html;charset=utf-8><meta http-equiv=X-UA-Compatible content=IE=Edge><meta content=always name=referrer><link rel=stylesheet type=text/css href=http://s1.bdstatic.com/r/www/cache/bdorz/baidu.min.css><title>百度一下，你就知道</title></head> <body link=#0000cc> <div id=wrapper> <div id=head> <div class=head_wrapper> <div class=s_form> <div class=s_form_wrapper> <div id=lg> <img hidefocus=true src=//www.baidu.com/img/bd_logo1.png width=270 height=129> </div> <form id=form name=f action=//www.baidu.com/s class=fm> <input type=hidden name=bdorz_come value=1> <input type=hidden name=ie value=utf-8> <input type=hidden name=f value=8> <input type=hidden name=rsv_bp value=1> <input type=hidden name=rsv_idx value=1> <input type=hidden name=tn value=baidu><span class="bg s_ipt_wr"><input id=kw name=wd class=s_ipt value maxlength=255 autocomplete=off autofocus></span><span class="bg s_btn_wr"><input type=submit id=su value=百度一下 class="bg s_btn"></span> </form> </div> </div> <div id=u1> <a href=http://news.baidu.com name=tj_trnews class=mnav>新聞</a> <a href=http://www.hao123.com name=tj_trhao123 class=mnav>hao123</a> <a href=http://map.baidu.com name=tj_trmap class=mnav>地圖</a> <a href=http://v.baidu.com name=tj_trvideo class=mnav>視訊</a> <a href=http://tieba.baidu.com name=tj_trtieba class=mnav>貼吧</a> <noscript> <a href=http://www.baidu.com/bdorz/login.gif?login&amp;tpl=mn&amp;u=http%3A%2F%2Fwww.baidu.com%2f%3fbdorz_come%3d1 name=tj_login class=lb>登入</a> </noscript> <script>document.write(\'<a href="http://www.baidu.com/bdorz/login.gif?login&tpl=mn&u=\'+ encodeURIComponent(window.location.href+ (window.location.search === "" ? "?" : "&")+ "bdorz_come=1")+ \'" name="tj_login" class="lb">登入</a>\');</script> <a href=//www.baidu.com/more/ name=tj_briicon class=bri style="display: block;">更多產品</a> </div> </div> </div> <div id=ftCon> <div id=ftConw> <p id=lh> <a href=http://home.baidu.com>關於百度</a> <a href=http://ir.baidu.com>About Baidu</a> </p> <p id=cp>&copy;2017&nbsp;Baidu&nbsp;<a href=http://www.baidu.com/duty/>使用百度前必讀</a>&nbsp; <a href=http://jianyi.baidu.com/ class=cp-feedback>意見反饋</a>&nbsp;京ICP證030173號&nbsp; <img src=//www.baidu.com/img/gs.gif> </p> </div> </div> </div> </body> </html>\r\n'

r.encoding:如果header中不存在charset，則認為編碼為ISO-8859-1

爬取網頁的通用程式碼框架

理解requests庫的異常：
requests.ConnectionError 網路連線錯誤異常，如DNS查詢失敗、拒絕連線等
request.HTTPErrorHTTP錯誤異常
requests.URLRequiredURL缺失異常
requests.TooManyRedirects超過最大重定向次數，產生重定向異常
requests.ConnectTimeout連線遠端伺服器超時異常
requests.Timeout請求URL超時，產生超時異常

python網路爬蟲_Python爬蟲：30個小時搞定Python網路爬蟲視訊教程
2020-10-21
Python爬蟲
怎麼利用Python網路爬蟲來提取資訊
2020-03-20
Python爬蟲
python網路爬蟲應用_python網路爬蟲應用實戰
2020-12-29
Python爬蟲
python DHT網路爬蟲
2019-02-14
Python爬蟲
網路爬蟲
2018-12-07
爬蟲
python網路爬蟲（14）使用Scrapy搭建爬蟲框架
2019-07-27
Python爬蟲框架
網路爬蟲——爬蟲實戰（一）
2022-01-29
爬蟲
什麼是Python網路爬蟲?常見的網路爬蟲有哪些?
2020-11-27
Python爬蟲
網路爬蟲（python專案）
2018-12-04
爬蟲Python
專案－－python網路爬蟲
2020-08-15
Python爬蟲
python網路爬蟲合法嗎
2021-09-11
Python爬蟲
Python網路爬蟲實戰
2022-03-18
Python爬蟲
python網路爬蟲（9）構建基礎爬蟲思路
2019-06-09
Python爬蟲
網路爬蟲示例
2018-10-30
爬蟲
網路爬蟲精要
2019-04-27
爬蟲
python3網路爬蟲開發實戰_Python 3開發網路爬蟲(一)
2020-12-07
Python爬蟲
爬蟲（9） - Scrapy框架(1) | Scrapy 非同步網路爬蟲框架
2022-07-05
爬蟲框架非同步
python網路爬蟲筆記（一）
2020-10-25
Python爬蟲筆記
python實現selenium網路爬蟲
2021-03-11
Python爬蟲
網路爬蟲之抓取郵箱
2018-06-18
爬蟲
python爬蟲---網頁爬蟲，圖片爬蟲，文章爬蟲，Python爬蟲爬取新聞網站新聞
2019-01-04
Python爬蟲網頁網站
python網路爬蟲--爬取淘寶聯盟
2018-07-17
Python爬蟲
網路爬蟲的原理
2018-12-02
爬蟲
網路爬蟲專案
2022-01-29
爬蟲
《Python3網路爬蟲開發實戰》教程||爬蟲教程
2018-11-13
Python爬蟲
什麼是網路爬蟲?為什麼用Python寫爬蟲?
2021-03-08
爬蟲Python
【Python爬蟲】正則爬取趕集網
2020-12-24
Python爬蟲
最簡單的網路圖片的爬取 --Pyhon網路爬蟲與資訊獲取
2020-04-04
爬蟲
Python網路爬蟲 - Phantomjs, selenium/Chromedirver使用
2019-01-22
Python爬蟲JSChrome
Python網路爬蟲4 - scrapy入門
2018-05-29
Python爬蟲
python例項，python網路爬蟲爬取大學排名!
2018-11-20
Python爬蟲
python網路爬蟲（7）爬取靜態資料詳解
2019-06-07
Python爬蟲
用PYTHON爬蟲簡單爬取網路小說
2021-09-11
Python爬蟲
[Python3網路爬蟲開發實戰] 分散式爬蟲原理
2019-12-08
Python爬蟲分散式
網路爬蟲之關於爬蟲 http 代理的常見使用方式
2020-04-28
爬蟲HTTP
什麼是網路爬蟲
2018-12-02
爬蟲
網路爬蟲大型教程(二)
2018-05-14
爬蟲
網路爬蟲流程總結
2023-03-09
爬蟲

[Python] 網路爬蟲與資訊提取（1） 網路爬蟲之規則

Python網路爬蟲與資訊提取 北京理工大學 嵩天老師慕課 學習筆記