Python網路爬蟲4 - scrapy入門

litreily發表於2018-05-29

該部落格首發於www.litreily.top

scrapy作為一款強大的爬蟲框架,當然要好好學習一番,本文便是本人學習和使用scrapy過後的一個總結,內容比較基礎,算是入門筆記吧,主要講述scrapy的基本概念和使用方法。

scrapy framework

首先附上scrapy經典圖如下:

scrapy framework

scrapy框架包含以下幾個部分

  1. Scrapy Engine 引擎
  2. Spiders 爬蟲
  3. Scheduler 排程器
  4. Downloader 下載器
  5. Item Pipeline 專案管道
  6. Downloader Middlewares 下載器中介軟體
  7. Spider Middlewares 爬蟲中介軟體

spider process

其爬取過程簡述如下:

  1. 引擎從爬蟲獲取首個待爬取的連結url,並傳遞給排程器
  2. 排程器將連結存入佇列
  3. 引擎向排程器請求要爬取的連結,並將請求得到的連結經下載器中介軟體傳遞給下載器
  4. 下載器從網上下載網頁,下載後的網頁經下載器中介軟體傳遞給引擎
  5. 引擎將網頁經爬蟲中介軟體傳遞給爬蟲
  6. 爬蟲對網頁進行解析,將得到的Items和新的連結經爬蟲中介軟體交給引擎
  7. 引擎將從爬蟲得到的Items交給專案管道,將新的連結請求requests交給排程器
  8. 此後迴圈2~7步,直到沒有待爬取的連結為止

需要說明的是,專案管道(Item Pipeline)主要完成資料清洗,驗證,持久化儲存等工作;下載器中介軟體(Downloader Middlewares)作為下載器和引擎之間的的鉤子(hook),用於監聽或修改下載請求或已下載的網頁,比如修改請求包的頭部資訊等;爬蟲中介軟體(Spider Middlewares)作為爬蟲和引擎之間的鉤子(hook),用於處理爬蟲的輸入輸出,即網頁response和爬蟲解析網頁後得到的Itemsrequests

Items

至於什麼是Items,個人認為就是經爬蟲解析後得到的一個資料單元,包含一組資料,比如爬取的是某網站的商品資訊,那麼每爬取一個網頁可能會得到多組商品資訊,每組資訊包含商品名稱,價格,生產日期,商品樣式等,那我們便可以定義一組Item

from scrapy.item import Item
from scrapy.item import Field

class GoodsItem(Item):
    name = Field()
    price = Field()
    date = Field()
    types = Field()
複製程式碼

Field()實質就是一個字典Dict()型別的擴充套件,如上程式碼所示,一組Item對應一個商品資訊,單個網頁可能包含一個或多個商品,所有Item資訊都需要在Spider中賦值,然後經引擎交給Item Pipeline。具體實現在後續博文的例項中會有體現,本文旨在簡單記述scrapy的基本概念和使用方法。

Install

with pip

pip install scrapy
複製程式碼

or conda

conda install -c conda-forge scrapy
複製程式碼

基本指令如下:

D:\WorkSpace>scrapy --help
Scrapy 1.5.0 - no active project

Usage:
  scrapy <command> [options] [args]

Available commands:
  bench         Run quick benchmark test
  fetch         Fetch a URL using the Scrapy downloader
  genspider     Generate new spider using pre-defined templates
  runspider     Run a self-contained spider (without creating a project)
  settings      Get settings values
  shell         Interactive scraping console
  startproject  Create new project
  version       Print Scrapy version
  view          Open URL in browser, as seen by Scrapy

  [ more ]      More commands available when run from project directory

Use "scrapy <command> -h" to see more info about a command
複製程式碼

如果需要使用虛擬環境,需要安裝virtualenv

pip install virtualenv
複製程式碼

scrapy startproject

scrapy startproject <project-name> [project-dir]
複製程式碼

使用該指令可以生成一個新的scrapy專案,以demo為例

$ scrapy startproject demo
...
You can start your first spider with:
    cd demo
    scrapy genspider example example.com

$ cd demo
$ tree
.
├── demo
│   ├── __init__.py
│   ├── items.py
│   ├── middlewares.py
│   ├── pipelines.py
│   ├── __pycache__
│   ├── settings.py
│   └── spiders
│       ├── __init__.py
│       └── __pycache__
└── scrapy.cfg

4 directories, 7 files
複製程式碼

可以看到startproject自動生成了一些資料夾和檔案,其中:

  1. scrapy.cfg: 專案配置檔案,一般不用修改
  2. items.py: 定義items的檔案,例如上述的GoodsItem
  3. middlewares.py: 中介軟體程式碼,預設包含下載器中介軟體和爬蟲中介軟體
  4. pipelines.py: 專案管道,用於處理spider返回的items,包括清洗,驗證,持久化等
  5. settings.py: 全域性配置檔案,包含各類全域性變數
  6. spiders: 該資料夾用於儲存所有的爬蟲檔案,注意一個專案可以包含多個爬蟲
  7. __init__.py: 該檔案指示當前資料夾屬於一個python模組
  8. __pycache__: 儲存直譯器生成的.pyc檔案(一種跨平臺的位元組碼byte code),在python2中該類檔案與.py儲存在相同資料夾

scrapy genspider

專案生成以後,可以使用scrapy genspider指令自動生成一個爬蟲檔案,比如,如果要爬取花瓣網首頁,執行以下指令:

$ cd demo
$ scrapy genspider huaban www.huaban.com
複製程式碼

預設生成的爬蟲檔案huaban.py如下:

# -*- coding: utf-8 -*-
import scrapy


class HuabanSpider(scrapy.Spider):
    name = 'huaban'
    allowed_domains = ['www.huaban.com']
    start_urls = ['http://www.huaban.com/']

    def parse(self, response):
        pass
複製程式碼
  • 爬蟲類繼承於scrapy.Spider
  • name是必須存在的引數,用以標識該爬蟲
  • allowed_domains指代允許爬蟲爬取的域名,指定域名之外的連結將被丟棄
  • start_urls儲存爬蟲的起始連結,該引數是列表型別,所以可以同時儲存多個連結

如果要自定義起始連結,也可以重寫scrapy.Spider類的start_requests函式,此處不予細講。

parse函式是一個預設的回撥函式,當下載器下載網頁後,會呼叫該函式進行解析,response就是請求包的響應資料。至於網頁內容的解析方法,scrapy內建了幾種選擇器(Selector),包括xpath選擇器、CSS選擇器和正則匹配。下面是一些選擇器的使用示例,方便大家更加直觀的瞭解選擇器的用法。

# xpath selector
response.xpath('//a')
response.xpath('./img').extract()
response.xpath('//*[@id="huaban"]').extract_first()
repsonse.xpath('//*[@id="Profile"]/div[1]/a[2]/text()').extract_first()

# css selector
response.css('a').extract()
response.css('#Profile > div.profile-basic').extract_first()
response.css('a[href="test.html"]::text').extract_first()

# re selector
response.xpath('.').re('id:\s*(\d+)')
response.xpath('//a/text()').re_first('username: \s(.*)')
複製程式碼

需要說明的是,response不能直接呼叫re,re_first.

scrapy crawl

假設爬蟲編寫完了,那就可以使用scrapy crawl指令開始執行爬取任務了。

當進入一個建立好的scrapy專案目錄時,使用scrapy -h可以獲得相比未建立之前更多的幫助資訊,其中就包括用於啟動爬蟲任務的scrapy crawl

$ scrapy -h
Scrapy 1.5.0 - project: huaban

Usage:
  scrapy <command> [options] [args]

Available commands:
  bench         Run quick benchmark test
  check         Check spider contracts
  crawl         Run a spider
  edit          Edit spider
  fetch         Fetch a URL using the Scrapy downloader
  genspider     Generate new spider using pre-defined templates
  list          List available spiders
  parse         Parse URL (using its spider) and print the results
  runspider     Run a self-contained spider (without creating a project)
  settings      Get settings values
  shell         Interactive scraping console
  startproject  Create new project
  version       Print Scrapy version
  view          Open URL in browser, as seen by Scrapy

Use "scrapy <command> -h" to see more info about a command
複製程式碼
$ scrapy crawl -h
Usage
=====
  scrapy crawl [options] <spider>

Run a spider

Options
=======
--help, -h              show this help message and exit
-a NAME=VALUE           set spider argument (may be repeated)
--output=FILE, -o FILE  dump scraped items into FILE (use - for stdout)
--output-format=FORMAT, -t FORMAT
                        format to use for dumping items with -o

Global Options
--------------
--logfile=FILE          log file. if omitted stderr will be used
--loglevel=LEVEL, -L LEVEL
                        log level (default: DEBUG)
--nolog                 disable logging completely
--profile=FILE          write python cProfile stats to FILE
--pidfile=FILE          write process ID to FILE
--set=NAME=VALUE, -s NAME=VALUE
                        set/override setting (may be repeated)
--pdb                   enable pdb on failure
複製程式碼

scrapy crawl的幫助資訊可以看出,該指令包含很多可選引數,但必選引數只有一個,就是spider,即要執行的爬蟲名稱,對應每個爬蟲的名稱(name)。

scrapy crawl huaban
複製程式碼

至此,一個scrapy爬蟲任務的建立和執行過程就介紹完了,至於例項,後續部落格會陸續介紹。

scrapy shell

最後簡要說明一下指令scrapy shell,這是一個互動式的shell,類似於命令列形式的python,當我們剛開始學習scrapy或者剛開始爬蟲某個陌生的站點時,可以使用它熟悉各種函式操作或者選擇器的使用,用它來不斷試錯糾錯,熟練掌握scrapy各種用法。

$ scrapy shell www.huaban.com
2018-05-29 23:58:49 [scrapy.utils.log] INFO: Scrapy 1.5.0 started (bot: scrapybot)
2018-05-29 23:58:49 [scrapy.utils.log] INFO: Versions: lxml 4.2.1.0, libxml2 2.9.5, cssselect 1.0.3, parsel 1.4.0, w3lib 1.19.0, Twisted 17.9.0, Python 3.6.3 (v3.6.3:2c5fed8, Oct  3
2017, 17:26:49) [MSC v.1900 32 bit (Intel)], pyOpenSSL 17.5.0 (OpenSSL 1.1.0h  27 Mar 2018), cryptography 2.2.2, Platform Windows-10-10.0.17134-SP0
2018-05-29 23:58:49 [scrapy.crawler] INFO: Overridden settings: {'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter', 'LOGSTATS_INTERVAL': 0}
2018-05-29 23:58:49 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole']
2018-05-29 23:58:50 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2018-05-29 23:58:50 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2018-05-29 23:58:50 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2018-05-29 23:58:50 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-05-29 23:58:50 [scrapy.core.engine] INFO: Spider opened
2018-05-29 23:58:50 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET http://huaban.com/> from <GET http://www.huaban.com>
2018-05-29 23:58:50 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://huaban.com/> (referer: None)
[s] Available Scrapy objects:
[s]   scrapy     scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s]   crawler    <scrapy.crawler.Crawler object at 0x03385CB0>
[s]   item       {}
[s]   request    <GET http://www.huaban.com>
[s]   response   <200 http://huaban.com/>
[s]   settings   <scrapy.settings.Settings object at 0x04CC4D10>
[s]   spider     <DefaultSpider 'default' at 0x4fa6bf0>
[s] Useful shortcuts:
[s]   fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s]   fetch(req)                  Fetch a scrapy.Request and update local objects
[s]   shelp()           Shell help (print this help)
[s]   view(response)    View response in a browser
In [1]: view(response)
Out[1]: True

In [2]: response.xpath('//a')
Out[2]:
[<Selector xpath='//a' data='<a id="elevator" class="off" onclick="re'>,
 <Selector xpath='//a' data='<a class="plus"></a>'>,
 <Selector xpath='//a' data='<a onclick="app.showUploadDialog();">新增採'>,
 <Selector xpath='//a' data='<a class="add-board-item">新增畫板<i class="'>,
 <Selector xpath='//a' data='<a href="/about/goodies/">安裝採集工具<i class'>,
 <Selector xpath='//a' data='<a class="huaban_security_oauth" logo_si'>]

In [3]: response.xpath('//a').extract()
Out[3]:
['<a id="elevator" class="off" onclick="return false;" title="回到頂部"></a>',
 '<a class="plus"></a>',
 '<a onclick="app.showUploadDialog();">新增採集<i class="upload"></i></a>',
 '<a class="add-board-item">新增畫板<i class="add-board"></i></a>',
 '<a href="/about/goodies/">安裝採集工具<i class="goodies"></i></a>',
 '<a class="huaban_security_oauth" logo_size="124x47" logo_type="realname" href="//www.anquan.org" rel="nofollow"> <script src="//static.anquan.org/static/outer/js/aq_auth.js"></script> </a>']

In [4]: response.xpath('//img')
Out[4]: [<Selector xpath='//img' data='<img src="https://d5nxst8fruw4z.cloudfro'>]

In [5]: response.xpath('//a/text()')
Out[5]:
[<Selector xpath='//a/text()' data='新增採集'>,
 <Selector xpath='//a/text()' data='新增畫板'>,
 <Selector xpath='//a/text()' data='安裝採集工具'>,
 <Selector xpath='//a/text()' data=' '>,
 <Selector xpath='//a/text()' data=' '>]

In [6]: response.xpath('//a/text()').extract()
Out[6]: ['新增採集', '新增畫板', '安裝採集工具', ' ', ' ']

In [7]: response.xpath('//a/text()').extract_first()
Out[7]: '新增採集'
複製程式碼

相關文章