Python爬蟲開發與專案實戰 3: 初識爬蟲

CopperDong發表於2018-01-05

Python爬蟲

3.1 網路爬蟲概述

概念：按照系統結構和實現技術，大致可分：通用網路爬蟲、聚焦爬蟲、增量式爬蟲、深層爬蟲。實際的爬蟲系統通常是幾種技術的相結合實現的。

搜尋引擎：屬於通用爬蟲，但存在一定的侷限性：

檢索結果包含大量使用者不關心的網頁

有限的伺服器資源與無限的網路資料資源之間的矛盾

SEO往往對資訊含量密集且具有一定結構的資料無能為力，如音視訊等

基於關鍵字的檢索，難以支援根據語義資訊提出的查詢

為了解決上述問題，定向抓取相關網頁資源的聚焦爬蟲應運而生

聚焦爬蟲：一個自動下載網頁的程式，為面向主題的使用者查詢準備資料資源

增量式爬蟲：採取更新和只爬新產生的網頁。減少時間和空間上的耗費，但增加演算法複雜度和實現難度

深層爬蟲：網頁分表層網頁（SEO可以索引的）和深層網頁（表單後的）

場景：BT搜尋網站（https://www.cilisou.org/），雲盤搜尋網站（http://www.pansou.com/）

基本工作流程如下：

首先選取一部分精心挑選的種子URL
將這些URL放入待抓取URL佇列
從待抓取URL佇列中讀取URL，解析DNS，得到IP，下載網頁，儲存網頁，將URL放進已抓取URL佇列
分析已抓取URL佇列中的URL，分析網頁中的URL，比較去重，後放入待抓取URL佇列，進入下一個迴圈。

3.2 HTTP請求的Python實現

Python中實現HTTP請求的三種方式：urllib2/urllib httplib/urllib Requests

urllib2/urllib實現：Python中的兩個內建模組，以urllib2為主，urllib為輔

1.實現一個完整的請求與響應模型

import urllib2
response = urllib2.urlopen('http://www.zhihu.com')
html = response.read()
print html

將請求響應分為兩步：一步是請求，一步是響應

import urllib2
request = urllib2.Request('http://www.zhihu.com')
response = urllib2.urlopen(request)
html = response.read()
print html

POST方式：

有時伺服器拒絕你的訪問，因為伺服器會檢驗請求頭。常用的反爬蟲的手段。

2、實現請求頭headers處理

import urllib
import urllib2
url = 'http://www.xxxx.com/login'
user_agent = ''
referer = 'http://www.xxxx.com/'
postdata = {'username': 'qiye',
             'password': 'qiye_pass' }
# 寫入頭資訊
headers = {'User-Agent': user_agent, 'Referer': referer}
data = urllib.urlencode(postdata)
req = urllib2.Request(url, data, headers)
response = urllib2.urlopen(req)
html = response.read()

3、Cookie處理：使用CookieJar函式進行Cookie的管理

import urllib2
import cookielib
cookie = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cookie))
response = opener.open('http://www.zhihu.com')
for item in cookie:
	print item.name + ':' + item.value

SessionID_R3:4y3gT2mcOjBQEQ7RDiqDz6DfdauvG8C5j6jxFg8jIcJvE5ih4USzM0h8WRt1PZomR1C9755SGG5YIzDJZj7XVraQyomhEFA0v6pvBzV94V88uQqUyeDnsMj8MALBSKr
4、Timeout設定超時

import urllib2
request = urllib2.Request('http://www.zhihu.com')
response = urllib2.urlopen(request, timeout=2)
html = response.read()
print html

5、獲取HTTP響應碼

import urllib2
try:
	response = urllib2.urlopen('http://www.google.com')
	print response
except urllib2.HTTPError as e:
	if hasattr(e, 'code'):
		print 'Error code:', e.code

6、重定向：urllib2預設情況下會針對HTTP 3XX返回碼自動進行重定向

只要檢查Response的URL和Request的URL是否相同

import urllib2
response = urllib2.urlopen('http://www.zhihu.com')
isRedirected = response.geturl() == 'http://www.zhihu.com'

7、Proxy的設定：urllib2預設會使用環境變數http_proxy來設定HTTP Proxy，但我們一般不採用這種方式，而用ProxyHandler在程式中動態設定代理。

import urllib2
proxy = urllib2.ProxyHandler({'http': '127.0.0.1:8087'})
opener = urllib2.build_opener(proxy, )
urllib2.install_opener(opener)
response = urllib2.urlopen('http://www.zhihu.com/')
print response.read()

install_opener()會設定全域性opener,但如想使用兩個不同的Proxy代理，比較好的做法是直接呼叫的open方法代替全域性urlopen方法

import urllib2
proxy = urllib2.ProxyHandler({'http': '127.0.0.1:8087'})
opener = urllib2.build_opener(proxy, )
response = opener.open('http://www.zhihu.com/')
print response.read()

httplib/urllib實現：一個底層基礎模組，可以看到建立HTTP請求的每一步，但是實現的功能比較少。

Requests：更人性化，是第三方模組，pip install requests

import requests
r = requests.get('http://www.baidu.com')
print r.content

2、響應與編碼

import requests
r = requests.get('http://www.baidu.com')
print 'content-->' + r.content
print 'text-->' + r.text
print 'encoding-->' + r.encoding
r.encoding = 'utf-8'
print 'new text-->' + r.text

pip install chardet 一個非常優秀的字串/檔案編碼檢查模組

直接將chardet探測到的編碼，賦給r.encoding實現解碼，r.text輸出就不會有亂碼了。

import requests
import chardet
r = requests.get('http://www.baidu.com')
print chardet.detect(r.content)
r.encoding = chardet.detect(r.content)['encoding']
print r.text

流模式

import requests
r = requests.get('http://www.baidu.com', stream=True)
print r.raw.read(10)

3、請求頭headers處理

import requests
user_agent = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36'
headers = {'User-Agent': user_agent}
r = requests.get('http://www.baidu.com', headers=headers)
print r.content

4、響應碼code和響應頭headers處理

# -*- coding: utf-8 -*-
import requests
user_agent = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36'
headers = {'User-Agent': user_agent}
r = requests.get('http://www.baidu.com', headers=headers)
if r.status_code == requests.codes.ok:
	print r.status_code    #響應碼
	print r.headers        #響應頭
	print r.headers.get('content-type')  # 推薦這種方式
	print r.headers['content-type']      # 不推薦這種方式
else:
	r.raise_for_status()

5、Cookie處理

# -*- coding: utf-8 -*-
import requests
user_agent = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36'
headers = {'User-Agent': user_agent}
r = requests.get('http://www.baidu.com', headers=headers)
# 遍歷出所有的cookie欄位的值
for cookie in r.cookies.keys():
	print cookie + ":" + r.cookies.get(cookie)

將自定義的Cookie值傳送出去

# -*- coding: utf-8 -*-
import requests
user_agent = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36'
headers = {'User-Agent': user_agent}
cookies = dict(name='qiye', age='10')
r = requests.get('http://www.baidu.com', headers=headers, cookies=cookies)
print r.text

Requests提供了session的概念，使我們不需要關心Cookie值，可連續訪問網頁

# -*- coding: utf-8 -*-
import requests
loginUrl = "http://www.xxx.com/login"
s = requests.Session()
# 首次訪問，作為遊客，伺服器分配一個cookie
r = s.get(loginUrl, allow_redirects=True)
datas = {'name':'qiye', 'passwd': 'qiye'}
# 向登入連結傳送post請求，遊客許可權轉為會員許可權
r = s.post(loginUrl, data=datas.allow_redirects=Trues)
print r.text

這是一個正式遇到的問題，如果沒有第一不訪問登入的頁面，而是直接向登入連結傳送Post請求，系統會把你當做非法使用者，因為訪問登入介面式會分配一個Cookie，需要將這個Cookie在傳送Post請求時帶上，這種使用Session函式處理Cookie的方式之後會很常用。

6、重定向與歷史資訊

只需設定以下allow_redicts欄位即可，可通過r.history欄位檢視歷史資訊

# -*- coding: utf-8 -*-
import requests
r= requests.get('http://github.com')   # 重定向為https://github.com
print r.url
print r.status_code
print r.history

7、超時設定

requests.get('http://github.com', timeout=2)

8、代理設定

# -*- coding: utf-8 -*-
import requests
proxies = {
	"http" = "http://0.10.10.01:3234",
	"https" = "http://0.0.0.2:1020",
}
r= requests.get('http://github.com', proxies=proxies)

也可通過環境變數HTTP_PROXY和HTTPS_PROXY來配置，但不常用。

你的代理需要使用HTTP Basic Auth，可以用http://user:password&host/語法

Python爬蟲開發與專案實戰——基礎爬蟲分析
2017-12-27
Python爬蟲
python爬蟲實操專案_Python爬蟲開發與專案實戰 1.6 小結
2021-02-04
Python爬蟲
Python爬蟲開發與專案實戰pdf
2020-01-11
Python爬蟲
Python爬蟲開發與專案實戰（2）
2020-10-21
Python爬蟲
Python爬蟲開發與專案實戰（1）
2020-10-18
Python爬蟲
不踩坑的Python爬蟲：Python爬蟲開發與專案實戰，從爬蟲入門 Python
2021-12-17
Python爬蟲
Python爬蟲開發與專案實踐（3）
2020-10-26
Python爬蟲
python3網路爬蟲開發實戰_Python3 爬蟲實戰
2022-01-24
Python爬蟲
Python爬蟲開發與專案實戰--分散式程式
2018-07-31
Python爬蟲分散式
《Python3網路爬蟲開發實戰》教程||爬蟲教程
2018-11-13
Python爬蟲
python爬蟲-33個Python爬蟲專案實戰(推薦)
2020-10-28
Python爬蟲
python3網路爬蟲開發實戰_Python 3開發網路爬蟲(一)
2020-12-07
Python爬蟲
[Python3網路爬蟲開發實戰] 分散式爬蟲原理
2019-12-08
Python爬蟲分散式
Python爬蟲開發與專案實戰 4: HTML解析大法
2018-05-15
Python爬蟲HTML
Java 爬蟲專案實戰之爬蟲簡介
2018-11-24
Java爬蟲
Python 3網路爬蟲開發實戰
2021-04-28
Python爬蟲
Python網路爬蟲實戰專案大全 32個Python爬蟲專案demo
2019-04-24
Python爬蟲
python爬蟲實戰教程-Python爬蟲開發實戰教程（微課版）
2020-11-11
Python爬蟲
Python3 大型網路爬蟲實戰 — 給 scrapy 爬蟲專案設定為防反爬
2016-12-06
Python爬蟲
python書籍推薦-Python爬蟲開發與專案實戰
2019-06-11
Python爬蟲
爬蟲初識
2024-03-28
爬蟲
爬蟲專案實戰（一）
2020-06-15
爬蟲
爬蟲實戰專案集合
2019-02-28
爬蟲
爬蟲實戰專案合集
2022-01-25
爬蟲
Python爬蟲開發與專案實戰 2：Web前端基礎
2018-01-04
Python爬蟲Web前端
Python大型網路爬蟲專案開發實戰（全套）
2017-06-14
Python爬蟲
Python3網路爬蟲開發實戰
2021-04-15
Python爬蟲
Python3.X 爬蟲實戰（併發爬取）
2017-06-25
Python爬蟲
視訊教程-Python網路爬蟲開發與專案實戰-Python
2020-05-28
Python爬蟲
《Python爬蟲開發與專案實戰》總結第二章
2017-09-26
Python爬蟲
2個月精通Python爬蟲——3大爬蟲框架+6場實戰+反爬蟲技巧+分散式爬蟲
2018-06-28
Python爬蟲框架分散式
Python爬蟲開發（二）：整站爬蟲與Web挖掘
2016-02-29
Python爬蟲Web
python3 爬蟲實戰：為爬蟲新增 GUI 影象介面
2020-03-06
Python爬蟲GUI
《Python 3網路爬蟲開發實戰》chapter3
2019-07-09
Python爬蟲APT
Python網路爬蟲實戰小專案
2021-04-12
Python爬蟲
Python網路爬蟲實戰專案大全！
2020-12-19
Python爬蟲
python3網路爬蟲開發實戰pdf
2021-11-30
Python爬蟲
Python開發爬蟲專案+程式碼
2019-04-24
Python爬蟲

Python爬蟲開發與專案實戰 3: 初識爬蟲

3.1 網路爬蟲概述

3.2 HTTP請求的Python實現

相關文章