Python爬蟲小結（轉）

HowieLee59發表於2018-08-09

原文網址 : https://blog.csdn.net/neo233/article/details/81535938

Python爬蟲

一、爬蟲介紹

爬蟲排程端：啟動、停止爬蟲，監視爬蟲執行情況
URL管理器：管理將要爬取的URL和已經爬取的URL
網頁下載器：下載URL指定的網頁，儲存成字串
網頁解析器：提取有價值的資料，提取關聯URL補充URL管理器

二、URL管理器

三、網頁下載器

（1）方法一

（2）方法二

header：http頭資訊
data：使用者輸入資訊

（3）方法三

HTTPCookieProcessor：需登入的網頁
ProxyHandler：需代理訪問的網頁
HTTPSHandler：加密訪問的網頁
HTTPRedirectHandler：URL自動跳轉的網頁

# coding:utf8   #出現編碼錯誤時新增

import urllib2
import cookielib
url = "http://www.baidu.com"

print '第一種方法'
response1 = urllib2.urlopen(url)
print response1.getcode()
print len(response1.read())

print '第二種方法'
request = urllib2.Request(url)
request.add_header('user_agent', 'Mozilla/5.0')
response2 = urllib2.urlopen(request)
print response2.getcode()
print len(response2.read())

print '第三種方法'
cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
urllib2.install_opener(opener)
response3 = urllib2.urlopen(request)
print response3.getcode()
print cj
print response3.read()

四、網頁解析器

Python 自帶：html.parser
第三方：BeautifulSoup，lxml

安裝beautifulsoup4
1.命令提示符中進入安裝Python的資料夾中~\Python27\Scripts
2.輸入pip install beautifulsoup4

calss 為Python的關鍵詞，所以用class_表示。

以字典形式可訪問節點所有屬性

參考：Python爬蟲利器二之Beautiful Soup的用法

# coding:utf8

import re
from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc,'html.parser',from_encoding='utf-8')

print '獲取所有的連結'
links = soup.find_all('a')
for link in links:
    print link.name, link['href'], link.get_text()

print '獲取Lacie的 連結'
link_node = soup.find('a',href = 'http://example.com/lacie')
print link_node.name, link_node['href'], link_node.get_text()

print '正則匹配'
link_node = soup.find('a',href = re.compile(r'ill'))
print link_node.name, link_node['href'], link_node.get_text()

print '獲取p段落文字'
p_node = soup.find('a',class_ = "title")
print p_node.name, p_node.get_text()

結果：
獲取所有的連結
a http://example.com/elsie Elsie
a http://example.com/lacie Lacie
a http://example.com/tillie Tillie
獲取Lacie的 連結
a http://example.com/lacie Lacie
正則匹配
a http://example.com/tillie Tillie
獲取p段落文字
p The Dormouse's story

Eclipse：ctrl+shift+M或Ctrl+Shift+o或Ctrl+1可以自動匯入相應的包或建立相應的類或方法。

五、例項

觀察目標，定製策略，策略要根據目標的變化實時更新。

精通Python網路爬蟲（Python3.X版本，PyCharm工具）

一、爬蟲型別

通用網路爬蟲：全網爬蟲。由初始URL集合、URL佇列、頁面爬行模組、頁面分析模組、頁面資料庫、連結過濾模組等構成。
聚焦網路爬蟲：主題爬蟲。由初始URL集合、URL佇列、頁面爬行模組、頁面分析模組、頁面資料庫、連結過濾模組，內容評價模組、連結評價模組等構成。
增量網路爬蟲：增量式更新，儘可能爬取新頁面（更新改變的部分）。
深層網路爬蟲：隱藏在表單後，需要提交一定關鍵詞才能獲取的頁面。URL列表、LVS列表（LVS指標籤/數值集合，即填充表單的資料來源）、爬行控制器、解析器、LVS控制器、表單分析器、表單處理器、響應分析器等構成。

二、核心技術

PyCharm常用快捷鍵：
Alt+Enter：快速匯入包
Ctrl+z：撤銷，Ctrl+Shift+z：反撤銷

（1）Urllib庫

1）Python2.X與Python3.X區別

Python2.X	Python3.X
`import urllib2`	`import urllib.requset, urllib.error`
`import urllib`	`import urllib.requset, urllib.error, urllib.parse`
`urllib2.urlopen`	`urllib.request.urlopen`
`urllib.urlencode`	`urllib.parse.urlencode`
`urllib.quote`	`urllib.request.quote`
`urllib.CookieJar`	`http.CookieJar`
`urllib.Request`	`urllib.request.Request`

2）快速爬取網頁

import urllib.request

# 爬取百度網頁內容
file = urllib.request.urlopen("http://www.baidu.com", timeout=30) # timeout超時設定，單位：秒
data = file.read()            #讀取檔案全部內容，字串型別
dataline = file.readline()    #讀取檔案一行內容
datalines = file.readlines()  #讀取檔案全部內容，列表型別

# 以html格式儲存到本地
fhandle = open("/.../1.html","wb")
fhandle.write(data)
fhandle.close()

# 快捷儲存到本地
filename = urllib.request.urlretrieve("http://www.baidu.com",filename="/.../1.html")
urllib.request.urlcleanup() #清除快取

# 其他常用方法
file.getcode() #響應狀態碼，200為連結成功
file.geturl() #爬取的源網頁

# URL編碼(當URL中存在漢字等不符合標準的字元時需要編碼後爬取)
urllib.request.quote("http://www.baidu.com")  # http%3A//www.baidu.com
# URL解碼
urllib.request.unquote("http%3A//www.baidu.com") # http://www.baidu.com

注意：URL中存在漢字如https://www.baidu.com/s?wd=電影，爬取該URL時實際傳入URL應該是"https://www.baidu.com/s?wd=" + urllib.request.quote("電影")，而不應該是urllib.request.quote("https://www.baidu.com/s?wd=電影")

3）瀏覽器模擬（應對403禁止訪問）

import urllib.request

url = "http://baidu.com"
# 方法一
headers = ("User-Agent",
           "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 \
           (KHTML, like Gecko) Chrome/49.0.2623.221 Safari/537.36 SE 2.X MetaSr 1.0")
opener = urllib.request.build_opener()
opener.addheaders = [headers]
data = opener.open(url).read()

# 方法二
req = urllib.request.Request(url)
req.add_header("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 \
           (KHTML, like Gecko) Chrome/49.0.2623.221 Safari/537.36 SE 2.X MetaSr 1.0")
data = urllib.request.urlopen(req).read()

4）POST請求

import urllib.request
import urllib.parse

url = "http://www.iqianyue.com/mypost"   # 測試網站
# 將資料使用urlencode編碼處理後，使用encode()設定為utf-8編碼
postdata = urllib.parse.urlencode({"name": "abc", "pass": "111"}).encode("utf-8")
req = urllib.request.Request(url, postdata)
req.add_header("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 \
           (KHTML, like Gecko) Chrome/49.0.2623.221 Safari/537.36 SE 2.X MetaSr 1.0")
data = urllib.request.urlopen(req).read()

注意：
postdata = urllib.parse.urlencode({"name": "abc", "pass": "111"}).encode("utf-8")，必須encode("utf-8")編碼後才可使用，實際結果為b'name=abc&pass=111'，未編碼結果為name=abc&pass=111

5）代理伺服器設定（應對IP被遮蔽、403）

def use_proxy(proxy_add, url):
    import urllib.request
    proxy = urllib.request.ProxyHandler({'http': proxy_add})
    opener = urllib.request.build_opener(proxy, urllib.request.HTTPHandler)
    # 建立全域性預設opener物件，這樣使用urlopen()也會使用該物件
    urllib.request.install_opener(opener)
    # 解碼型別與網頁編碼格式一致
    data = urllib.request.urlopen(url).read().decode("gb2312") 
    return data

# 代理IP可用百度搜尋
data = use_proxy("116.199.115.79:80", "http://www.baidu.com")
print(data)

注意：encode()：編碼，decode()：解碼

例如（Python3.X）：
u = '中文'
str = u.encode('utf-8') # 結果：b'\xe4\xb8\xad\xe6\x96\x87'，為位元組型別
u1 = str.decode('utf-8') # 結果：中文

過程：
str(unicode) --[encode('utf-8')]--> bytes --[decode('utf-8')]--> str(unicode)

6）DebugLog除錯日誌

import urllib.request

httphd = urllib.request.HTTPHandler(debuglevel=1)
httpshd = urllib.request.HTTPSHandler(debuglevel=1)

opener = urllib.request.build_opener(httphd, httpshd)
urllib.request.install_opener(opener)
data = urllib.request.urlopen("http://www.baidu.com")

執行結果：

send: b'GET / HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: www.baidu.com\r\nUser-Agent: Python-urllib/3.6\r\nConnection: close\r\n\r\n'
reply: 'HTTP/1.1 200 OK\r\n'
header: Date header: Content-Type header: Transfer-Encoding header: Connection header: Vary header: Set-Cookie header: Set-Cookie header: Set-Cookie header: Set-Cookie header: Set-Cookie header: Set-Cookie header: P3P header: Cache-Control header: Cxy_all header: Expires header: X-Powered-By header: Server header: X-UA-Compatible header: BDPAGETYPE header: BDQID header: BDUSERID

7）異常處理

URLError：1）連線不上伺服器。2）遠端URL不存在。3）無網路。4）HTTPError：
200：OK
301：Moved Permanently——重定向到新的URL
302：Found——重定向到臨時的URL
304：Not Modified——請求資源未更新
400：Bad Request——非法請求
401：Unauthorized——請求未經授權
403：Forbidden——禁止訪問
404：Not Found——未找到對應頁面
500：Internal Server Error——伺服器內部錯誤
501：Not Implemented——伺服器不支援實現請求所需要的功能

import urllib.error
import urllib.request

try:
    urllib.request.urlopen("http://www.baidu.com")
except urllib.error.URLError as e:
    if hasattr(e, "code"):
        print(e.code)
    if hasattr(e, "reason"):
        print(e.reason)

（2）正規表示式

1）基本語法（適用其他）

1.單個字元匹配

[...]匹配字符集內的任意字元
\w包括[a-zA-Z0-9]即匹配所以大小寫字元及數字，以及下劃線

2.多個字元匹配

因為*匹配前個字元0到無限次，所以*?匹配前個字元0次，既不匹配前個字元。
因為+匹配前個字元1到無限次，所以+?匹配前個字元1次。
因為?匹配前個字元0或1次，所以??匹配前個字元0次，既不匹配前個字元。

3.邊界匹配
4.分組匹配

\<num>引用編號為num的分組匹配的字串：

程式碼1：re.match(r'<(book>)(python)</\1\2','<book>python</book>python').group()
結果：<book>python</book>python
程式碼2：re.match(r'<(book>)(python)</\1\2','<book>python</book>python').groups()
結果：('book>','python')

解釋：.groups()方法返回分組匹配的字串集合，指總體匹配模式中()內的分組匹配模式匹配的結果集。程式碼1中'<(book>)(python)</\1\2'為總體匹配模式，其中有(book>)和(python)兩個分組匹配模式，程式碼2結果就為這兩個分組匹配模式匹配的結果集，\<num>就是通過num來引用該結果集中的字串，\1為book>,\2為python。

用(?P<name>)和(?P=name)替代，程式碼1還可以寫為：
re.match(r'<(?P<mark1>book>)(?P<mark2>python)</(?P=mark1)(?P=mark2)','<book>python</book>python').group()

5.模式修改

符號	含義
`I`	匹配時忽略大小寫
`M`	多行匹配
`L`	做本地化識別匹配
`U`	根據Unicode字元及解析字元
`S`	讓`.`匹配包括換行符，使`.`可以匹配任意字元

2）re模組

import re
str = ‘imooc python’

pa = re.compile(r'imooc') #匹配‘imooc’字串
ma = pa.match(str)
# 等價於
ma = re.match(r'imooc', str)

ma.string   #被匹配字串
ma.re       #匹配模式(pa值)
ma.group()  #匹配結果
ma.span()   #匹配位置

pa = re.compile(r'imooc', re.I) #匹配‘imooc’字串,不管大小寫

# 上述最終可寫為
ma = re.match(r'imooc', 'imooc python', re.I)

樣式字串前r的用法：
（1）帶上r，樣式字串為原字串，後面的樣式字串是什麼匹配什麼，裡面即使有轉義字串也按普通字串匹配。
（2）不帶r，樣式字串無轉義字串不影響，有轉義字串需考慮轉義字串進行匹配。
例子中r'imooc\\n'相當於imooc\\n，'imooc\\n'相當於imooc\n，因為'\\'為轉義字串時相當於'\'

march從頭開始匹配，找出字串開頭符合匹配樣式的部分，開頭無符合返回NoneType
seach從頭開始匹配，找出字串內第一個符合匹配樣式的部分並返回，字串內無符合返回NoneType

sub()引數中repl可以是用來替代的字串，也可以是一個函式且該函式需返回一個用來替換的字串。count為替換次數，預設為0，為都替換。
re.sub(r'\d+','100','imooc videnum=99')
re.sub(r'\d+',lambda x: str(int(x.group())+1),'imooc videnum=99')
結果：'imooc videnum=100'
lambda x: str(int(x.group())+1)為匿名函式，其中冒號前的x為函式引數，預設傳入匹配的結果物件，需要用.group()方法獲取結果字串。冒號後算式的結果為返回值。也可以寫成：

def add(x):
    val = x.group()
    num = int(val)+1
    return str(num)

re.sub(r'\d+',add,'imooc videnum=99')

（3）Cookie用法（應對模擬登陸）

import urllib.request
import urllib.parse
import http.cookiejar

# 建立CookieJar物件
cjar = http.cookiejar.CookieJar()
# 使用HTTPCookieProcessor建立cookie處理器，並以其為引數建立opener物件
opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cjar))
# 將opener安裝為全域性
urllib.request.install_opener(opener)

# 網站登入頁
url1 = 'http://xxx.com/index.php/login/login_new/'
# 登陸所需要POST的資料
postdata = urllib.parse.urlencode({
            'username': 'xxx',
            'password': 'xxx'
            }).encode("utf-8")
req = urllib.request.Request(url1, postdata)
# 網站登陸後才能訪問的網頁
url2 = 'http://xxx.com/index.php/myclass'

# 登陸網站
file1 = urllib.request.urlopen(req)
# 爬取目標網頁資訊
file2 = urllib.request.urlopen(url2).read()

（4）多執行緒與佇列

# 多執行緒基礎
import threading

class A(threading.Thread):
    def __init__(self):
        # 初始化該執行緒
        threading.Thread.__init__(self)

    def run(self):
        # 該執行緒要執行的內容
        for i in range(10):
            print("執行緒A執行")

class B(threading.Thread):
    def __init__(self):
        threading.Thread.__init__(self)

    def run(self):
        for i in range(10):
            print("執行緒B執行")

t1 = A()
t1.start()

t2 = B()
t2.start()

# 佇列基礎（先進先出）
import queue
# 建立佇列物件
a = queue.Queue()
# 資料傳入佇列
a.put("hello")
a.put("php")
a.put("python")
a.put("bye")
# 結束資料傳入
a.task_done()

for i in range(4):
    # 取出資料
    print(a.get())

（5）瀏覽器偽裝

Headers資訊：

Accept:text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8

Accept：瀏覽器支援內容型別，支援優先順序從左往右依次排序
text/html：HTML文件
application/xhtml+xml：XHTML文件
application/xml：XML文件

Accept-Encoding:gzip, deflate, sdch（設定該欄位，從伺服器返回的是對應形式的壓縮程式碼（瀏覽器會自動解壓縮），因此可能出現亂碼）

Accept-Encoding：瀏覽器支援的壓縮編碼方式
deflate：一種無損資料壓縮的演算法

Accept-Language:zh-CN,zh;q=0.8

Accept-Language：支援的語言型別
zh-CN：zh中文，CN簡體
en-US：英語（美國）

Connection:keep-alive

Connection：客戶端與服務端連線型別
keep-alive：永續性連線
close：連線斷開

Referer:http://123.sogou.com/（某些反爬蟲網址可能檢驗該欄位，一般可以設定為要爬取網頁的域名地址或對應網址的主頁地址）

Referer：來源網址

·.addheaders方法傳入格式為：[('Connection','keep-alive'),("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.221 Safari/537.36 SE 2.X MetaSr 1.0"),...]

三、Scrapy框架

（1）常見爬蟲框架

Scrapy框架：https://scrapy.org/
Crawley框架
Portia框架：有網頁版
newspaper框架
python-goose框架

（2）安裝Scrapy

Python2.X和Python3.X同時安裝，命令提示符：
py -2：啟動Python2.X
py -3：啟動Python3.X
py -2 -m pip install ...：使用Python2.X pip安裝
py -3 -m pip install ...：使用Python3.X pip安裝
安裝超時：
手動指定源，在pip後面跟-i，命令如下：
pip install packagename -i http://pypi.douban.com/simple --trusted-host pypi.douban.com
pipy國內映象目前有：
豆瓣 http://pypi.douban.com/simple/
阿里雲 http://mirrors.aliyun.com/pypi/simple/
中國科技大學 https://pypi.mirrors.ustc.edu.cn/simple/
清華大學 https://pypi.tuna.tsinghua.edu.cn/simple/
華中理工大學 http://pypi.hustunique.com/
山東理工大學 http://pypi.sdutlinux.org/
出現如下錯誤：
error:Microsoft Visual C++ 14.0 is required. Get it with "Microsoft Visual C++ Build Tools": http://landinghub.visualstudio.com/visual-cpp-build-tools
解決方案：
在http://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted下載twisted對應版本的whl檔案，cp後面是python版本，amd64代表64位，以Python位數為準
執行命令：
pip install C:\xxx\Twisted-17.5.0-cp36-cp36m-win_amd64.whl
安裝成功後執行出現No module named 'win32api'錯誤：
在https://sourceforge.net/projects/pywin32/files%2Fpywin32/下載安裝對應pywin32即可

（3）Scrapy應用（命令提示符輸入）

1）建立專案

scrapy startproject myscrapy：建立名為myscrapy的爬蟲專案，自動生成如下目錄

目錄結構：

myscrapy/scrapy.cfg：爬蟲專案配置檔案
myscrapy/myscrapy/items.py：資料容器檔案，定義獲取的資料
myscrapy/myscrapy/pipelines.py：管道檔案，對items定義的資料進行加工處理
myscrapy/myscrapy/settings.py：設定檔案
myscrapy/myscrapy/spiders：放置爬蟲檔案
myscrapy/myscrapy/middleware.py：下載中介軟體檔案

引數控制：

scrapy startproject --logfile="../logf.log" myscrapy
在建立myscrapy爬蟲專案同時在指定地址建立名為logf的日誌檔案
scrapy startproject --loglevel=DEBUG myscrapy
建立專案同時指定日誌資訊的等級為DEBUG模式（預設），等級表如下：

等級名	含義
`CRITICAL`	發生最嚴重的錯誤
`ERROR`	發生必須立即處理的錯誤
`WARNING`	出現警告資訊，存在潛在錯誤
`INFO`	輸出提示資訊
`DEBUG`	輸出除錯資訊，常用於開發階段

scrapy startproject --nolog myscrapy
建立專案同時指定不輸出日誌

2）常用工具命令

全域性命令：（專案資料夾外scrapy -h）

scrapy fetch http://www.baidu.com：顯示爬取網站的過程
scrapy fetch --headers --nolog http://www.baidu.com：顯示頭資訊不顯示日誌資訊
scrapy runspider 爬蟲檔案.py -o xxx/xxx.xxx：執行指定爬蟲檔案並將爬取結果儲存在指定檔案內
scrapy setting --get BOT_NAME：專案內執行為專案名，專案外執行為scrapybot
scrapy shell http://www.baidu.com --nolog：爬取百度首頁建立一個互動終端環境並設定為不輸出日誌資訊。

專案命令：（專案資料夾內scrapy -h）

scrapy bench：測試本地硬體效能
scrapy genspider -l：檢視可使用的爬蟲模板

scrapy genspider -d 模板名：檢視爬蟲模板內容
scrapy genspider -t 模板名爬蟲名要爬取的網站域名：快速建立一個爬蟲檔案
scrapy check 爬蟲名：對爬蟲檔案進行合同測試
scrapy crawl 爬蟲名：啟動爬蟲
scrapy list：顯示可以使用的爬蟲檔案
scrapy edit 爬蟲名：編輯爬蟲檔案（Windows下執行有問題）
scrapy parse 網站URL：獲取指定URL網站內容，並使用對應爬蟲檔案處理分析，可設定的常用引數如下：

引數	含義
`--spider==SPIDER`	指定某個爬蟲檔案進行處理
`-a NAME=VALUE`	設定爬蟲檔案引數
`--pipelines`	通過pipelines處理items
`--nolinks`	不展示提取到的連結資訊
`--noitems`	不展示得到的items
`--nocolour`	輸出結果顏色不高亮
`--rules,-r`	使用CrawlSpider規則處理回撥函式
`--callback=CALLBACK,-c CALLBACK`	指定spider中用於處理返回的響應的回撥函式
`--depth=DEPTH,-d DEPTH`	設定爬取深度，預設為1
`--verbose,-v`	顯示每層的詳細資訊

3）Items編寫

import scrapy
class MyscrapyItem(scrapy.Item):
    # define the fields for your item here like:
    name = scrapy.Field()
    ...

格式：資料名 = scrapy.Field()
例項化：item = MyscrapyItem(name = "xxx",...)
呼叫：item["name"]、item.keys()、 item.items()（可以看做字典使用）

4）Spider編寫(BasicSpider)

# -*- coding: utf-8 -*-
import scrapy
class MyspiderSpider(scrapy.Spider):
    name = 'myspider' # 爬蟲名
    allowed_domains = ['baidu.com'] 
    start_urls = ['http://baidu.com/']

    def parse(self, response):
        pass

allowed_domains：允許爬取的域名，當開啟OffsiteMiddleware時，非允許的域名對應的網址會自動過濾，不再跟進。
start_urls：爬取的起始網址，如果沒有指定爬取的URL網址，則從該屬性中定義的網址開始進行爬取，可指定多個起始網址，網址間用逗號隔開。
parse方法：如果沒有特別指定回撥函式，該方法是處理Scrapy爬蟲爬行到的網頁響應（response）的預設方法，通過該方法，可以對響應進行處理並返回處理後的資料，同時該方法也負責連結的跟進。

其他方法	含義
`start_requests()`	該方法預設讀取`start_urls`屬性中定義的網址（也可自定義），為每個網址生成一個Request請求物件，並返回可迭代物件
`make_requests_from_url(url)`	該方法會被`start_requests()` 呼叫，負責實現生成Request請求物件
`close(reason)`	關閉Spider時呼叫
`log(message[,level, component])`	實現在Spider中新增log
`__init__()`	負責爬蟲初始化的建構函式

# -*- coding: utf-8 -*-
import scrapy
from myscrapy.items import MyscrapyItem

class MyspiderSpider(scrapy.Spider):
    name = 'myspider'
    allowed_domains = ['baidu.com']
    start_urls = ['http://baidu.com/']
    my_urls = ['http://baidu.com/', 'http://baidu.com/']

    # 重寫該方法可讀取自己定義的URLS，不重寫時預設從start_urls中讀取起始網址
    def start_requests(self):
        for url in self.my_urls:
            # 呼叫預設make_requests_from_url()生成具體請求並迭代返回
            yield self.make_requests_from_url(url)

    def parse(self, response):
        item = MyscrapyItem()
        item["name"] = response.xpath("/html/head/title/text()")
        print(item["name"])

5）XPath基礎

/：選擇某個標籤，可多層標籤查詢
//：提取某個標籤的所有資訊
test()：獲取該標籤的文字資訊
//Z[@X="Y"]：獲取所有屬性X的值是Y的<Z>標籤的內容

返回一個SelectorList 物件
返回一個list、裡面是一些提取的內容
返回2中list的第一個元素(如果list為空丟擲異常)
返回1中SelectorList裡的第一個元素(如果list為空丟擲異常),和3達成的效果一致
4返回的是一個str，所以5會返回str的第一個字元

6）Spider類引數傳遞（通過-a選項實現引數的傳遞）

# -*- coding: utf-8 -*-
import scrapy
from myscrapy.items import MyscrapyItem

class MyspiderSpider(scrapy.Spider):
    name = 'myspider'
    allowed_domains = ['baidu.com']
    start_urls = ['http://baidu.com/']
    # 重寫初始化方法，並設定引數myurl
    def __init__(self, myurl=None, *args, **kwargs):
        super(MyspiderSpider, self).__init__(*args, **kwargs)
        myurllist = myurl.split(",")
        # 輸出要爬的網站
        for i in myurllist:
            print("爬取網站：%s" % i)
        # 重新定義start_urls屬性
        self.start_urls = myurllist

    def parse(self, response):
        item = MyscrapyItem()
        item["name"] = response.xpath("/html/head/title/text()")
        print(item["name"])

命令列：scrapy crawl myspider -a myurl=http://www.sina.com.cn,http://www.baidu.com --nolog

7）XMLFeedSpider

# -*- coding: utf-8 -*-
from scrapy.spiders import XMLFeedSpider

class MyxmlSpider(XMLFeedSpider):
    name = 'myxml'
    allowed_domains = ['sina.com.cn']
    start_urls = ['http://sina.com.cn/feed.xml']
    iterator = 'iternodes'  # you can change this; see the docs
    itertag = 'item'  # change it accordingly

    def parse_node(self, response, selector):
        i = {}
        # i['url'] = selector.select('url').extract()
        # i['name'] = selector.select('name').extract()
        # i['description'] = selector.select('description').extract()
        return i

iterator：設定迭代器，預設iternodes(基於正規表示式的高效能迭代器)，此外還有html、xml
itertag：設定開始迭代的節點
parse_node(self, response, selector)：在節點與所提供的標籤名相符合的時候被呼叫，可進行資訊的提取和處理操作

其他屬性或方法	含義
`namespaces`	以列表形式存在，主要定義在文件中會被爬蟲處理的可用名稱空間
`adapt_response(response)`	主要在spider分析響應（Response）前被呼叫
`process_results(response, results)`	主要在spider返回結果時被呼叫，對結果在返回前進行最後處理

8）CSVFeedSpider

CSV：一種簡單、通用的檔案格式，其儲存的資料可以與表格資料相互轉化。最原始的形式是純文字形式，列之間通過,間隔，行之間通過換行間隔。

# -*- coding: utf-8 -*-
from scrapy.spiders import CSVFeedSpider

class MycsvSpider(CSVFeedSpider):
    name = 'mycsv'
    allowed_domains = ['iqianyue.com']
    start_urls = ['http://iqianyue.com/feed.csv']
    # headers = ['id', 'name', 'description', 'image_link']
    # delimiter = '\t'

    # Do any adaptations you need here
    #def adapt_response(self, response):
    #    return response

    def parse_row(self, response, row):
        i = {}
        #i['url'] = row['url']
        #i['name'] = row['name']
        #i['description'] = row['description']
        return i

headers：存放CSV檔案包含的用於提取欄位行資訊的列表
delimiter：主要存放欄位之間的間隔符，csv檔案以,間隔
parse_row(self, response, row)：用於接收Response物件，並進行相應處理

9）CrawlSpider（自動爬取）

class MycrawlSpider(CrawlSpider):
    name = 'mycrawl'
    allowed_domains = ['sohu.com']
    start_urls = ['http://sohu.com/']
    # 自動爬取規則
    rules = (
        Rule(LinkExtractor(allow=r'Items/'), callback='parse_item', follow=True),
    )

    def parse_item(self, response):
        i = {}
        #i['domain_id'] = response.xpath('//input[@id="sid"]/@value').extract()
        #i['name'] = response.xpath('//div[@id="name"]').extract()
        #i['description'] = response.xpath('//div[@id="description"]').extract()
        return i

rules：設定自動爬取規則，規則Rule的引數如下：
LinkExtractor：連結提取器，用來提取頁面中滿足條件的連結，以供下次爬取使用，可設定的引數如下

引數名	含義
`allow`	提取符合對應正規表示式的連結
`deny`	不提取符合對應正規表示式的連結
`restrict_xpaths`	使用XPath表示式與allow共同作用，提取出同時符合兩者的連結
`allow_domains`	允許提取的域名，該域名下的連結才可使用
`deny_domains`	禁止提取的域名，限制不提取該域名下的連結

callback='parse_item'：處理的回撥方法
follow=True：是否跟進。CrawlSpider爬蟲會根據連結提取器中設定的規則自動提取符合條件的網頁連結，提取之後再自動的對這些連結進行爬取，形成迴圈，如果連結設定為跟進，則會一直迴圈下去，如果設定為不跟進，則第一次迴圈後就會斷開。

10）避免爬蟲被禁止（settings.py內設定）

禁止Cookie：（應對通過使用者Cookie資訊對使用者識別和分析的網站）

# Disable cookies (enabled by default)
COOKIES_ENABLED = False

設定下載延時：（設定爬取的時間間隔，應對通過網頁訪問（爬取）頻率進行分析的網站）

# Configure a delay for requests for the same website (default: 0)
# See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3

IP池：（應對檢驗使用者IP的網站）

在middlewares.py中或新建立一個Python檔案中編寫：

import random
from scrapy.contrib.downloadermiddleware.httpproxy import HttpProxyMiddleware

class IPPOOLS(HttpProxyMiddleware):
    myIPPOOL = ["183.151.144.46:8118",
                "110.73.49.52:8123",
                "123.55.2.126:808"]
    # process_request()方法，主要進行請求處理
    def process_request(self, request, spider):
        # 隨機選擇一個IP
        thisip = random.choice(self.myIPPOOL)
        # 將IP新增為具體代理，用該IP進行爬取
        request.meta["proxy"] = "http://" + thisip
        # 輸出觀察
        print('當前使用IP：%s' % request.meta["proxy"])

設定為預設下載中介軟體：

# Enable or disable downloader middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
    'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware': 123,
    # 格式：'下載中介軟體所在目錄.下載中介軟體檔名.下載中介軟體內部要使用的類':數字（有規定）
    'myscrapy.middlewares.IPPOOLS': 125
}

使用者代理池

在middlewares.py中或新建立一個Python檔案中編寫：

import random
from scrapy.downloadermiddlewares.useragent import UserAgentMiddleware

class UAPOOLS(UserAgentMiddleware):
    myUApool = [
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 \
        (KHTML, like Gecko) Chrome/49.0.2623.221 Safari/537.36 SE 2.X MetaSr 1.0",
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 \
        (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36",
    ]

    def process_request(self, request, spider):
        thisUA = random.choice(self.myUApool)
        request.headers.setdefault('User-Agent', thisUA)
        print("當前使用UA: %s" % thisUA)

設定為預設下載中介軟體：

# Enable or disable downloader middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
    'myscrapy.middlewares.UAPOOLS': 1,
    'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': 2,
}

（4）Scrapy核心框架

Scrapy引擎：框架核心，控制整個資料處理流程，以及觸發一些事務處理。
排程器：儲存待爬取的網址，並確定網址優先順序，同時會過濾一些重複的網址。
下載器：對網頁資源進行高速下載，然後將這些資料傳遞給Scrapy引擎，再由引擎傳遞給爬蟲進行處理。
下載中介軟體：下載器與引擎間的特殊元件，處理其之間的通訊。
爬蟲：接收並分析處理引擎的Response響應，提取所需資料。
爬蟲中介軟體：爬蟲與引擎間的特殊元件，處理其之間的通訊。
實體管道：接收爬蟲元件中提取的資料，如：清洗、驗證、儲存至資料庫等

（5）Scrapy輸出與儲存

1）中文儲存

setting.py設定pipelines

# Configure item pipelines
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
    'myscrapy.pipelines.MyscrapyPipeline': 300,
}

import codecs
class MyscrapyPipeline(object):
    def __init__(self):
        # 以寫入的方式建立或開啟要存入資料的檔案
        self.file = codecs.open('E:/xxx/mydata.txt',
                                'wb',
                                encoding="utf-8")
    # 主要處理方法，預設自動呼叫
    def process_item(self, item, spider):
        content = str(item) + '\n'
        self.file.write(content)
        return item
    # 關閉爬蟲時呼叫
    def close_spider(self, spider):
        self.file.close()

注意：要想執行process_item()，爬蟲檔案parse()方法中必須返回item：yield item

2）Json輸出

import codecs
import json
class MyscrapyPipeline(object):
    def __init__(self):
        print("建立pip")
        # 以寫入的方式建立或開啟要存入資料的檔案
        self.file = codecs.open('E:/PycharmProjects/untitled/myscrapy/data/mydata.txt',
                                'wb',
                                encoding="utf-8")

    def process_item(self, item, spider):
        js = json.dumps(dict(item), ensure_ascii=False)
        content = js + '\n'
        self.file.write(content)
        return item

    # 關閉爬蟲時呼叫
    def close_spider(self, spider):
        self.file.close()

注意：

爬蟲檔案parse()方法中，由response.xpath("xxx/text()")返回的SelectorList 物件不能轉換為Json型別，需要response.xpath("xxx/text()").extract()轉化為字串列表型別才可轉化為Json型別。

json.dumps(dict(item), ensure_ascii=False)：進行json.dumps()序列化時，中文資訊預設使用ASCII編碼，當設定不使用ASCII編碼時，中文資訊就可以正常顯示

3）資料庫操作

安裝：pip install pymysql3
匯入：import pymysql
連結MySQL：
conn = pymysql.connect(host="主機名", user="賬號", passwd="密碼"[, db="資料庫名"])
SQL語句執行：
conn.query("SQL語句")
檢視錶內容：

# cursor()建立遊標
cs = conn.cursor()
# execute()執行對應select語句
cs.execute("select * from mytb")
# 遍歷
for i in cs:
    print("當前是第"+str(cs.rownumber)+"行")
    print(i[x])

四、Scrapy文件例項

（1）迴圈爬取`http://quotes.toscrape.com/`網站

import scrapy
class MyxpathSpider(scrapy.Spider):
    name = 'myxpath'
    allowed_domains = ['toscrape.com']
    start_urls = ['http://quotes.toscrape.com/']

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').extract_first(),
                'author': quote.xpath('span/small/text()').extract_first(),
            }

        next_page = response.css('li.next a::attr("href")').extract_first()
        if next_page is not None:
            yield response.follow(next_page, self.parse)

迴圈爬取時，注意迴圈的下個網頁需在allowed_domains域名下，否則會被過濾，從而無法迴圈

Reference:
Links：https://www.jianshu.com/p/3036689e613f
Source：簡書

python爬蟲實操專案_Python爬蟲開發與專案實戰 1.6 小結
2021-02-04
Python爬蟲
python網路爬蟲_Python爬蟲：30個小時搞定Python網路爬蟲視訊教程
2020-10-21
Python爬蟲
python 小爬蟲 DrissionPage+BeautifulSoup
2024-06-16
Python爬蟲
【Python學習】爬蟲爬蟲爬蟲爬蟲~
2018-05-03
Python爬蟲
python爬蟲之解析連結
2020-12-01
Python爬蟲
C#爬蟲開發小結
2023-01-19
C#爬蟲
Python 第一個爬蟲，爬取 147 小說
2020-05-08
Python爬蟲
python爬蟲---網頁爬蟲，圖片爬蟲，文章爬蟲，Python爬蟲爬取新聞網站新聞
2019-01-04
Python爬蟲網頁網站
如何用python爬蟲下載小說？
2021-09-11
Python爬蟲
python就是爬蟲嗎-python就是爬蟲嗎
2020-10-29
Python爬蟲
用PYTHON爬蟲簡單爬取網路小說
2021-09-11
Python爬蟲
python 爬蟲
2024-04-20
Python爬蟲
python爬蟲
2024-06-13
Python爬蟲
一入爬蟲深似海，總結python爬蟲學習筆記！
2019-02-14
爬蟲Python筆記
Python爬蟲小專案：爬一個圖書網站
2018-11-21
Python爬蟲網站
【總結】10款Python爬蟲框架！Python入門
2021-05-20
Python爬蟲框架
python爬蟲之抓取小說(逆天邪神)
2022-03-10
Python爬蟲
Python網路爬蟲實戰小專案
2021-04-12
Python爬蟲
「玩轉Python」打造十萬博文爬蟲篇
2019-07-30
Python爬蟲
python爬蟲初探--第一個python爬蟲專案
2018-05-18
Python爬蟲
python爬蟲2
2019-01-07
Python爬蟲
Python爬蟲-xpath
2018-06-08
Python爬蟲
Python爬蟲——XPath
2018-07-28
Python爬蟲
Python爬蟲--2
2024-03-24
Python爬蟲
Python asyncio 爬蟲
2020-04-28
Python爬蟲
Python 爬蟲系列
2021-01-01
Python爬蟲
Python爬蟲入門教程 50-100 Python3爬蟲爬取VIP視訊-Python爬蟲6操作
2019-02-14
Python爬蟲
3.26爬蟲小記
2019-03-26
爬蟲
爬蟲小專案
2019-05-10
爬蟲
3.22 爬蟲小記
2019-03-22
爬蟲
Go 爬蟲小例
2022-05-24
Go爬蟲
Python爬蟲教程-01-爬蟲介紹
2018-09-06
Python爬蟲
Java爬蟲與Python爬蟲的區別？
2023-10-25
Java爬蟲Python
Python-爬蟲工程師-面試總結
2019-02-16
Python爬蟲工程師面試
Python爬蟲批次下載電影連結
2021-09-09
Python爬蟲
什麼是Python爬蟲？python爬蟲入門難嗎？
2021-12-27
Python爬蟲
python爬蟲是什麼?學習python爬蟲難嗎
2021-03-31
Python爬蟲
什麼是Python爬蟲？Python爬蟲常用框架有哪些？
2020-12-24
Python爬蟲框架