python Scrapy 從零開始學習筆記（一）

豐寸發表於2020-07-23

原文網址 : https://www.cnblogs.com/weijiutao/p/13214066.html

在之前我做了一個系列的關於 python 爬蟲的文章，傳送門：https://www.cnblogs.com/weijiutao/p/10735455.html，並寫了幾個爬取相關網站並提取有效資訊的案例：https://www.cnblogs.com/weijiutao/p/10614694.html 等，從本章開始本人將繼續深入學習 python 爬蟲，主要是基於 Scrapy 庫展開，特此記錄，與君共勉！

Scrapy 官方網址：https://docs.scrapy.org/en/latest/

Scrapy 中文網址：https://scrapy-chs.readthedocs.io/zh_CN/latest/index.html

Scrapy 框架

Scrapy是用純Python實現一個為了爬取網站資料、提取結構性資料而編寫的應用框架，用途非常廣泛。
框架的力量，使用者只需要定製開發幾個模組就可以輕鬆的實現一個爬蟲，用來抓取網頁內容以及各種圖片，非常之方便。
Scrapy 使用了 Twisted['twɪstɪd](其主要對手是Tornado)非同步網路框架來處理網路通訊，可以加快我們的下載速度，不用自己去實現非同步框架，並且包含了各種中介軟體介面，可以靈活的完成各種需求。

Scrapy架構圖(綠線是資料流向)：

Scrapy Engine(引擎): 負責Spider、ItemPipeline、Downloader、Scheduler中間的通訊，訊號、資料傳遞等。
Scheduler(排程器): 它負責接受引擎傳送過來的Request請求，並按照一定的方式進行整理排列，入隊，當引擎需要時，交還給引擎。
Downloader（下載器）：負責下載Scrapy Engine(引擎)傳送的所有Requests請求，並將其獲取到的Responses交還給Scrapy Engine(引擎)，由引擎交給Spider來處理，
Spider（爬蟲）：它負責處理所有Responses,從中分析提取資料，獲取Item欄位需要的資料，並將需要跟進的URL提交給引擎，再次進入Scheduler(排程器)，
Item Pipeline(管道)：它負責處理Spider中獲取到的Item，並進行進行後期處理（詳細分析、過濾、儲存等）的地方.
Downloader Middlewares（下載中介軟體）：你可以當作是一個可以自定義擴充套件下載功能的元件。
Spider Middlewares（Spider中介軟體）：你可以理解為是一個可以自定擴充套件和操作引擎和Spider中間通訊的功能元件（比如進入Spider的Responses;和從Spider出去的Requests）

以上是 Scrapy 的架構圖，從流程上看還是很清晰的，我就只簡單的說一下，首先從紅色方框的 Spider 開始，通過引擎傳送給排程器任務，再將請求任務交給下載器並處理完後返回結果給 Spider，最後將結果交給關到來處理我們的結果就可以了。

上面的話可能還是會有些拗口，在接下來我們會一點點進行剖析，最後會發現利用 Scrapy 框架來做爬蟲是如此簡單。

Scrapy的安裝

windows 安裝 pip install scrapy

Mac 安裝 sudo pip install scrapy

pip 升級 pip install --upgrade pip

本人目前使用的是Mac電腦，目前使用的是 python3 版本，內容上其實都大同小異，如遇系統或版本問題可及時聯絡，互相學習！

安裝完成後我們在終端輸出 Scrapy 即可安裝是否成功：

新建專案

在 Scrapy 安裝成功之後，我們就需要用它來開發我們的爬蟲專案了，進入自定義的專案目錄中，執行下列命令：

scrapy startproject spiderDemo

執行上面的命令列就會在我們專案目錄下生成一下目錄結構：

下面來簡單介紹一下各個主要檔案的作用：

scrapy.cfg ：專案的配置檔案

scrapyDemo/ ：專案的Python模組，將會從這裡引用程式碼

scrapyDemo/items.py ：專案的目標檔案

scrapyDemo/middlewares.py ：專案的中介軟體檔案

scrapyDemo/pipelines.py ：專案的管道檔案

scrapyDemo/settings.py ：專案的設定檔案

scrapyDemo/spiders/ ：儲存爬蟲程式碼目錄

接下來我們對各檔案裡的內容簡單說一下，裡面的程式碼目前都是最簡單的基本程式碼，在接下來做案例的時候我們會再有針對地對檔案做一下解釋。

其中的 __init_.py 檔案內容都是空的，但是卻不能刪除掉，否則專案將無法啟動。

spiderDemo/items.py

 1 # -*- coding: utf-8 -*-
 2 
 3 # Define here the models for your scraped items
 4 #
 5 # See documentation in:
 6 # https://docs.scrapy.org/en/latest/topics/items.html
 7 
 8 import scrapy
 9 
10 
11 class ScrapydemoItem(scrapy.Item):
12     # define the fields for your item here like:
13     # name = scrapy.Field()
14     pass

該檔案是用來定義我們通過爬蟲所獲取到的有用的資訊，即 scrapy.Item

scrapyDemo/middlewares.py

  1 # -*- coding: utf-8 -*-
  2 
  3 # Define here the models for your spider middleware
  4 #
  5 # See documentation in:
  6 # https://docs.scrapy.org/en/latest/topics/spider-middleware.html
  7 
  8 from scrapy import signals
  9 
 10 
 11 class ScrapydemoSpiderMiddleware(object):
 12     # Not all methods need to be defined. If a method is not defined,
 13     # scrapy acts as if the spider middleware does not modify the
 14     # passed objects.
 15 
 16     @classmethod
 17     def from_crawler(cls, crawler):
 18         # This method is used by Scrapy to create your spiders.
 19         s = cls()
 20         crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
 21         return s
 22 
 23     def process_spider_input(self, response, spider):
 24         # Called for each response that goes through the spider
 25         # middleware and into the spider.
 26 
 27         # Should return None or raise an exception.
 28         return None
 29 
 30     def process_spider_output(self, response, result, spider):
 31         # Called with the results returned from the Spider, after
 32         # it has processed the response.
 33 
 34         # Must return an iterable of Request, dict or Item objects.
 35         for i in result:
 36             yield i
 37 
 38     def process_spider_exception(self, response, exception, spider):
 39         # Called when a spider or process_spider_input() method
 40         # (from other spider middleware) raises an exception.
 41 
 42         # Should return either None or an iterable of Request, dict
 43         # or Item objects.
 44         pass
 45 
 46     def process_start_requests(self, start_requests, spider):
 47         # Called with the start requests of the spider, and works
 48         # similarly to the process_spider_output() method, except
 49         # that it doesn’t have a response associated.
 50 
 51         # Must return only requests (not items).
 52         for r in start_requests:
 53             yield r
 54 
 55     def spider_opened(self, spider):
 56         spider.logger.info('Spider opened: %s' % spider.name)
 57 
 58 
 59 class ScrapydemoDownloaderMiddleware(object):
 60     # Not all methods need to be defined. If a method is not defined,
 61     # scrapy acts as if the downloader middleware does not modify the
 62     # passed objects.
 63 
 64     @classmethod
 65     def from_crawler(cls, crawler):
 66         # This method is used by Scrapy to create your spiders.
 67         s = cls()
 68         crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
 69         return s
 70 
 71     def process_request(self, request, spider):
 72         # Called for each request that goes through the downloader
 73         # middleware.
 74 
 75         # Must either:
 76         # - return None: continue processing this request
 77         # - or return a Response object
 78         # - or return a Request object
 79         # - or raise IgnoreRequest: process_exception() methods of
 80         #   installed downloader middleware will be called
 81         return None
 82 
 83     def process_response(self, request, response, spider):
 84         # Called with the response returned from the downloader.
 85 
 86         # Must either;
 87         # - return a Response object
 88         # - return a Request object
 89         # - or raise IgnoreRequest
 90         return response
 91 
 92     def process_exception(self, request, exception, spider):
 93         # Called when a download handler or a process_request()
 94         # (from other downloader middleware) raises an exception.
 95 
 96         # Must either:
 97         # - return None: continue processing this exception
 98         # - return a Response object: stops process_exception() chain
 99         # - return a Request object: stops process_exception() chain
100         pass
101 
102     def spider_opened(self, spider):
103         spider.logger.info('Spider opened: %s' % spider.name)

該檔案為中介軟體檔案，名字後面的s表示複數，說明這個檔案裡面可以放很多箇中介軟體，我們用到的中介軟體可以在此定義

spiderDemo/pipelines.py

 1 # -*- coding: utf-8 -*-
 2 
 3 # Define your item pipelines here
 4 #
 5 # Don't forget to add your pipeline to the ITEM_PIPELINES setting
 6 # See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
 7 
 8 
 9 class ScrapydemoPipeline(object):
10     def process_item(self, item, spider):
11         return item

該檔案俗稱管道檔案，是用來獲取到我們的Item資料，並對資料做針對性的處理。

scrapyDemo/settings.py

 1 # -*- coding: utf-8 -*-
 2 
 3 # Scrapy settings for scrapyDemo project
 4 #
 5 # For simplicity, this file contains only settings considered important or
 6 # commonly used. You can find more settings consulting the documentation:
 7 #
 8 #     https://docs.scrapy.org/en/latest/topics/settings.html
 9 #     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
10 #     https://docs.scrapy.org/en/latest/topics/spider-middleware.html
11 
12 BOT_NAME = 'scrapyDemo'
13 
14 SPIDER_MODULES = ['scrapyDemo.spiders']
15 NEWSPIDER_MODULE = 'scrapyDemo.spiders'
16 
17 
18 # Crawl responsibly by identifying yourself (and your website) on the user-agent
19 #USER_AGENT = 'scrapyDemo (+http://www.yourdomain.com)'
20 
21 # Obey robots.txt rules
22 ROBOTSTXT_OBEY = True
23 
24 # Configure maximum concurrent requests performed by Scrapy (default: 16)
25 #CONCURRENT_REQUESTS = 32
26 
27 # Configure a delay for requests for the same website (default: 0)
28 # See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
29 # See also autothrottle settings and docs
30 #DOWNLOAD_DELAY = 3
31 # The download delay setting will honor only one of:
32 #CONCURRENT_REQUESTS_PER_DOMAIN = 16
33 #CONCURRENT_REQUESTS_PER_IP = 16
34 
35 # Disable cookies (enabled by default)
36 #COOKIES_ENABLED = False
37 
38 # Disable Telnet Console (enabled by default)
39 #TELNETCONSOLE_ENABLED = False
40 
41 # Override the default request headers:
42 #DEFAULT_REQUEST_HEADERS = {
43 #   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
44 #   'Accept-Language': 'en',
45 #}
46 
47 # Enable or disable spider middlewares
48 # See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
49 #SPIDER_MIDDLEWARES = {
50 #    'scrapyDemo.middlewares.ScrapydemoSpiderMiddleware': 543,
51 #}
52 
53 # Enable or disable downloader middlewares
54 # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
55 #DOWNLOADER_MIDDLEWARES = {
56 #    'scrapyDemo.middlewares.ScrapydemoDownloaderMiddleware': 543,
57 #}
58 
59 # Enable or disable extensions
60 # See https://docs.scrapy.org/en/latest/topics/extensions.html
61 #EXTENSIONS = {
62 #    'scrapy.extensions.telnet.TelnetConsole': None,
63 #}
64 
65 # Configure item pipelines
66 # See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
67 #ITEM_PIPELINES = {
68 #    'scrapyDemo.pipelines.ScrapydemoPipeline': 300,
69 #}
70 
71 # Enable and configure the AutoThrottle extension (disabled by default)
72 # See https://docs.scrapy.org/en/latest/topics/autothrottle.html
73 #AUTOTHROTTLE_ENABLED = True
74 # The initial download delay
75 #AUTOTHROTTLE_START_DELAY = 5
76 # The maximum download delay to be set in case of high latencies
77 #AUTOTHROTTLE_MAX_DELAY = 60
78 # The average number of requests Scrapy should be sending in parallel to
79 # each remote server
80 #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
81 # Enable showing throttling stats for every response received:
82 #AUTOTHROTTLE_DEBUG = False
83 
84 # Enable and configure HTTP caching (disabled by default)
85 # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
86 #HTTPCACHE_ENABLED = True
87 #HTTPCACHE_EXPIRATION_SECS = 0
88 #HTTPCACHE_DIR = 'httpcache'
89 #HTTPCACHE_IGNORE_HTTP_CODES = []
90 #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

該檔案為我們的設定檔案，一些基本的設定需要我們在此檔案中進行配置，如我們的中介軟體檔案當中的兩個類 ScrapydemoSpiderMiddleware，ScrapydemoDownloaderMiddleware 在 settings.py 中就能找到。

在 settings 檔案中，我們常會配置到如上面的欄位如：ITEM_PIPELINES（管道檔案），DEFAULT_REQUEST_HEADERS（請求報頭），DOWNLOAD_DELAY（下載延遲）

，ROBOTSTXT_OBEY（是否遵循爬蟲協議）等。

本章我們就先簡單的介紹一下 scrapy 的基本目錄，下一章我們來根據 scrapy 框架實現一個爬蟲案例。

python Scrapy 從零開始學習筆記（二）
2020-07-27
Python筆記
PHP從零開始系列一（學習筆記）：前言
2022-09-15
PHP筆記
從零開始學Electron筆記（一）
2020-07-07
筆記
從零開始JAVA資料結構學習筆記（一）
2019-03-20
Java資料結構筆記
PHP從零開始系列二（學習筆記）：序言
2022-09-15
PHP筆記
從零開始學Electron筆記（六）
2020-07-16
筆記
從零開始學Electron筆記（七）
2020-07-22
筆記
從零開始學Electron筆記（二）
2020-07-08
筆記
從零開始學Electron筆記（四）
2020-07-10
筆記
從零開始學Electron筆記（五）
2020-07-13
筆記
從零開始學Electron筆記（三）
2020-07-09
筆記
從零開始學五筆（一）：概述
2024-10-24
從零開始學Python
2022-04-10
Python
《Python深度學習從零開始學》簡介
2022-05-23
Python深度學習
從零開始學習laravel
2020-11-17
Laravel
從零開始學習Kafka
2019-05-10
Kafka
從零開始學習機器學習
2018-08-09
機器學習
從零開始的Python學習Episode 6——字串操作
2018-09-20
Python字串
從零開始學習開發人工智慧(一)
2019-02-22
人工智慧
從零開始機器學習
2018-08-10
機器學習
【FFmpeg筆記】從零開始之濾鏡
2019-05-07
筆記
從零開始的Python學習Episode 14——日誌操作
2018-10-27
Python
從零開始的Python學習知識補充sorted
2018-11-18
Python
從零開始的Python學習Episode 11——裝飾器
2018-10-08
Python
從零開始的Unity個人學習日記（二）
2020-09-29
Unity
從零開始機器學習-03
2018-08-13
機器學習
從零開始機器學習--4
2018-08-15
機器學習
從零開始機器學習--05
2018-08-21
機器學習
從零開始學習 Go ——安裝
2019-05-14
Go
從零開始學習C++（0）
2024-08-20
C++
Scrapy 框架 (學習筆記-1)
2019-08-17
框架筆記
從零開始的Python學習Episode 15——正規表示式
2018-10-31
Python
從零開始的Python學習Episode 7——檔案基本操作
2018-09-23
Python
從零開始的Python學習Episode 18——物件導向（1）
2019-01-31
Python物件
從零開始的Python學習Episode 19——物件導向（2）
2019-01-31
Python物件
從零開始構建並編寫神經網路【學習筆記】[2/2]
2022-06-07
神經網路筆記
【Python零基礎】19天從零開始學Python——第一天
2020-11-20
Python
從零開始學習邏輯迴歸
2018-11-23
邏輯迴歸

python Scrapy 從零開始學習筆記（一）

Scrapy 框架

Scrapy架構圖(綠線是資料流向)：

Scrapy的安裝

新建專案

相關文章