在之前我做了一個系列的關於 python 爬蟲的文章,傳送門:https://www.cnblogs.com/weijiutao/p/10735455.html,並寫了幾個爬取相關網站並提取有效資訊的案例:https://www.cnblogs.com/weijiutao/p/10614694.html 等,從本章開始本人將繼續深入學習 python 爬蟲,主要是基於 Scrapy 庫展開,特此記錄,與君共勉!
Scrapy 官方網址:https://docs.scrapy.org/en/latest/
Scrapy 中文網址:https://scrapy-chs.readthedocs.io/zh_CN/latest/index.html
Scrapy 框架
-
Scrapy是用純Python實現一個為了爬取網站資料、提取結構性資料而編寫的應用框架,用途非常廣泛。
-
框架的力量,使用者只需要定製開發幾個模組就可以輕鬆的實現一個爬蟲,用來抓取網頁內容以及各種圖片,非常之方便。
-
Scrapy 使用了 Twisted
['twɪstɪd]
(其主要對手是Tornado)非同步網路框架來處理網路通訊,可以加快我們的下載速度,不用自己去實現非同步框架,並且包含了各種中介軟體介面,可以靈活的完成各種需求。
Scrapy架構圖(綠線是資料流向):
-
Scrapy Engine(引擎)
: 負責Spider
、ItemPipeline
、Downloader
、Scheduler
中間的通訊,訊號、資料傳遞等。 -
Scheduler(排程器)
: 它負責接受引擎
傳送過來的Request請求,並按照一定的方式進行整理排列,入隊,當引擎
需要時,交還給引擎
。 -
Downloader(下載器)
:負責下載Scrapy Engine(引擎)
傳送的所有Requests請求,並將其獲取到的Responses交還給Scrapy Engine(引擎)
,由引擎
交給Spider
來處理, -
Spider(爬蟲)
:它負責處理所有Responses,從中分析提取資料,獲取Item欄位需要的資料,並將需要跟進的URL提交給引擎
,再次進入Scheduler(排程器)
, -
Item Pipeline(管道)
:它負責處理Spider
中獲取到的Item,並進行進行後期處理(詳細分析、過濾、儲存等)的地方. -
Downloader Middlewares(下載中介軟體)
:你可以當作是一個可以自定義擴充套件下載功能的元件。 -
Spider Middlewares(Spider中介軟體)
:你可以理解為是一個可以自定擴充套件和操作引擎
和Spider
中間通訊
的功能元件(比如進入Spider
的Responses;和從Spider
出去的Requests)
以上是 Scrapy 的架構圖,從流程上看還是很清晰的,我就只簡單的說一下,首先從紅色方框的 Spider 開始,通過引擎傳送給排程器任務,再將請求任務交給下載器並處理完後返回結果給 Spider,最後將結果交給關到來處理我們的結果就可以了。
上面的話可能還是會有些拗口,在接下來我們會一點點進行剖析,最後會發現利用 Scrapy 框架來做爬蟲是如此簡單。
Scrapy的安裝
windows 安裝 pip install scrapy
Mac 安裝 sudo pip install scrapy
pip 升級 pip install --upgrade pip
本人目前使用的是Mac電腦,目前使用的是 python3 版本,內容上其實都大同小異,如遇系統或版本問題可及時聯絡,互相學習!
安裝完成後我們在終端輸出 Scrapy 即可安裝是否成功:
新建專案
在 Scrapy 安裝成功之後,我們就需要用它來開發我們的爬蟲專案了,進入自定義的專案目錄中,執行下列命令:
scrapy startproject spiderDemo
執行上面的命令列就會在我們專案目錄下生成一下目錄結構:
下面來簡單介紹一下各個主要檔案的作用:
scrapy.cfg :專案的配置檔案
scrapyDemo/ :專案的Python模組,將會從這裡引用程式碼
scrapyDemo/items.py :專案的目標檔案
scrapyDemo/middlewares.py :專案的中介軟體檔案
scrapyDemo/pipelines.py :專案的管道檔案
scrapyDemo/settings.py :專案的設定檔案
scrapyDemo/spiders/ :儲存爬蟲程式碼目錄
接下來我們對各檔案裡的內容簡單說一下,裡面的程式碼目前都是最簡單的基本程式碼,在接下來做案例的時候我們會再有針對地對檔案做一下解釋。
其中的 __init_.py 檔案內容都是空的,但是卻不能刪除掉,否則專案將無法啟動。
spiderDemo/items.py
1 # -*- coding: utf-8 -*- 2 3 # Define here the models for your scraped items 4 # 5 # See documentation in: 6 # https://docs.scrapy.org/en/latest/topics/items.html 7 8 import scrapy 9 10 11 class ScrapydemoItem(scrapy.Item): 12 # define the fields for your item here like: 13 # name = scrapy.Field() 14 pass
該檔案是用來定義我們通過爬蟲所獲取到的有用的資訊,即 scrapy.Item
scrapyDemo/middlewares.py
1 # -*- coding: utf-8 -*- 2 3 # Define here the models for your spider middleware 4 # 5 # See documentation in: 6 # https://docs.scrapy.org/en/latest/topics/spider-middleware.html 7 8 from scrapy import signals 9 10 11 class ScrapydemoSpiderMiddleware(object): 12 # Not all methods need to be defined. If a method is not defined, 13 # scrapy acts as if the spider middleware does not modify the 14 # passed objects. 15 16 @classmethod 17 def from_crawler(cls, crawler): 18 # This method is used by Scrapy to create your spiders. 19 s = cls() 20 crawler.signals.connect(s.spider_opened, signal=signals.spider_opened) 21 return s 22 23 def process_spider_input(self, response, spider): 24 # Called for each response that goes through the spider 25 # middleware and into the spider. 26 27 # Should return None or raise an exception. 28 return None 29 30 def process_spider_output(self, response, result, spider): 31 # Called with the results returned from the Spider, after 32 # it has processed the response. 33 34 # Must return an iterable of Request, dict or Item objects. 35 for i in result: 36 yield i 37 38 def process_spider_exception(self, response, exception, spider): 39 # Called when a spider or process_spider_input() method 40 # (from other spider middleware) raises an exception. 41 42 # Should return either None or an iterable of Request, dict 43 # or Item objects. 44 pass 45 46 def process_start_requests(self, start_requests, spider): 47 # Called with the start requests of the spider, and works 48 # similarly to the process_spider_output() method, except 49 # that it doesn’t have a response associated. 50 51 # Must return only requests (not items). 52 for r in start_requests: 53 yield r 54 55 def spider_opened(self, spider): 56 spider.logger.info('Spider opened: %s' % spider.name) 57 58 59 class ScrapydemoDownloaderMiddleware(object): 60 # Not all methods need to be defined. If a method is not defined, 61 # scrapy acts as if the downloader middleware does not modify the 62 # passed objects. 63 64 @classmethod 65 def from_crawler(cls, crawler): 66 # This method is used by Scrapy to create your spiders. 67 s = cls() 68 crawler.signals.connect(s.spider_opened, signal=signals.spider_opened) 69 return s 70 71 def process_request(self, request, spider): 72 # Called for each request that goes through the downloader 73 # middleware. 74 75 # Must either: 76 # - return None: continue processing this request 77 # - or return a Response object 78 # - or return a Request object 79 # - or raise IgnoreRequest: process_exception() methods of 80 # installed downloader middleware will be called 81 return None 82 83 def process_response(self, request, response, spider): 84 # Called with the response returned from the downloader. 85 86 # Must either; 87 # - return a Response object 88 # - return a Request object 89 # - or raise IgnoreRequest 90 return response 91 92 def process_exception(self, request, exception, spider): 93 # Called when a download handler or a process_request() 94 # (from other downloader middleware) raises an exception. 95 96 # Must either: 97 # - return None: continue processing this exception 98 # - return a Response object: stops process_exception() chain 99 # - return a Request object: stops process_exception() chain 100 pass 101 102 def spider_opened(self, spider): 103 spider.logger.info('Spider opened: %s' % spider.name)
該檔案為中介軟體檔案,名字後面的s表示複數,說明這個檔案裡面可以放很多箇中介軟體,我們用到的中介軟體可以在此定義
spiderDemo/pipelines.py
1 # -*- coding: utf-8 -*- 2 3 # Define your item pipelines here 4 # 5 # Don't forget to add your pipeline to the ITEM_PIPELINES setting 6 # See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html 7 8 9 class ScrapydemoPipeline(object): 10 def process_item(self, item, spider): 11 return item
該檔案俗稱管道檔案,是用來獲取到我們的Item資料,並對資料做針對性的處理。
scrapyDemo/settings.py
1 # -*- coding: utf-8 -*- 2 3 # Scrapy settings for scrapyDemo project 4 # 5 # For simplicity, this file contains only settings considered important or 6 # commonly used. You can find more settings consulting the documentation: 7 # 8 # https://docs.scrapy.org/en/latest/topics/settings.html 9 # https://docs.scrapy.org/en/latest/topics/downloader-middleware.html 10 # https://docs.scrapy.org/en/latest/topics/spider-middleware.html 11 12 BOT_NAME = 'scrapyDemo' 13 14 SPIDER_MODULES = ['scrapyDemo.spiders'] 15 NEWSPIDER_MODULE = 'scrapyDemo.spiders' 16 17 18 # Crawl responsibly by identifying yourself (and your website) on the user-agent 19 #USER_AGENT = 'scrapyDemo (+http://www.yourdomain.com)' 20 21 # Obey robots.txt rules 22 ROBOTSTXT_OBEY = True 23 24 # Configure maximum concurrent requests performed by Scrapy (default: 16) 25 #CONCURRENT_REQUESTS = 32 26 27 # Configure a delay for requests for the same website (default: 0) 28 # See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay 29 # See also autothrottle settings and docs 30 #DOWNLOAD_DELAY = 3 31 # The download delay setting will honor only one of: 32 #CONCURRENT_REQUESTS_PER_DOMAIN = 16 33 #CONCURRENT_REQUESTS_PER_IP = 16 34 35 # Disable cookies (enabled by default) 36 #COOKIES_ENABLED = False 37 38 # Disable Telnet Console (enabled by default) 39 #TELNETCONSOLE_ENABLED = False 40 41 # Override the default request headers: 42 #DEFAULT_REQUEST_HEADERS = { 43 # 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 44 # 'Accept-Language': 'en', 45 #} 46 47 # Enable or disable spider middlewares 48 # See https://docs.scrapy.org/en/latest/topics/spider-middleware.html 49 #SPIDER_MIDDLEWARES = { 50 # 'scrapyDemo.middlewares.ScrapydemoSpiderMiddleware': 543, 51 #} 52 53 # Enable or disable downloader middlewares 54 # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html 55 #DOWNLOADER_MIDDLEWARES = { 56 # 'scrapyDemo.middlewares.ScrapydemoDownloaderMiddleware': 543, 57 #} 58 59 # Enable or disable extensions 60 # See https://docs.scrapy.org/en/latest/topics/extensions.html 61 #EXTENSIONS = { 62 # 'scrapy.extensions.telnet.TelnetConsole': None, 63 #} 64 65 # Configure item pipelines 66 # See https://docs.scrapy.org/en/latest/topics/item-pipeline.html 67 #ITEM_PIPELINES = { 68 # 'scrapyDemo.pipelines.ScrapydemoPipeline': 300, 69 #} 70 71 # Enable and configure the AutoThrottle extension (disabled by default) 72 # See https://docs.scrapy.org/en/latest/topics/autothrottle.html 73 #AUTOTHROTTLE_ENABLED = True 74 # The initial download delay 75 #AUTOTHROTTLE_START_DELAY = 5 76 # The maximum download delay to be set in case of high latencies 77 #AUTOTHROTTLE_MAX_DELAY = 60 78 # The average number of requests Scrapy should be sending in parallel to 79 # each remote server 80 #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0 81 # Enable showing throttling stats for every response received: 82 #AUTOTHROTTLE_DEBUG = False 83 84 # Enable and configure HTTP caching (disabled by default) 85 # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings 86 #HTTPCACHE_ENABLED = True 87 #HTTPCACHE_EXPIRATION_SECS = 0 88 #HTTPCACHE_DIR = 'httpcache' 89 #HTTPCACHE_IGNORE_HTTP_CODES = [] 90 #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
該檔案為我們的設定檔案,一些基本的設定需要我們在此檔案中進行配置,如我們的中介軟體檔案當中的兩個類 ScrapydemoSpiderMiddleware,ScrapydemoDownloaderMiddleware 在 settings.py 中就能找到。
在 settings 檔案中,我們常會配置到如上面的欄位 如:ITEM_PIPELINES(管道檔案),DEFAULT_REQUEST_HEADERS(請求報頭),DOWNLOAD_DELAY(下載延遲)
,ROBOTSTXT_OBEY(是否遵循爬蟲協議)等。
本章我們就先簡單的介紹一下 scrapy 的基本目錄,下一章我們來根據 scrapy 框架實現一個爬蟲案例。