python爬蟲系列（三）scrapy基本概念

Yang_Farley發表於2018-09-26

原文網址 : https://blog.csdn.net/farley119/article/details/82848821

Scrapy專案的預設結構

欲深入研究爬蟲，那就先把這個scrapy的基礎概念搞懂。下面我們先看下scrapy的基礎目錄結構

scrapy.cfg
myproject/
    __init__.py
    items.py
    middlewares.py
    pipelines.py
    settings.py
    spiders/
        __init__.py
        spider1.py
        spider2.py
        ...

scrapy.cfg檔案所在的目錄稱為專案根目錄。該檔案包含定義專案設定的python模組的名稱。比如：

[settings]
default = myproject.settings

scrapy指令

列印一些使用幫助和可用命令：

Scrapy X.Y - no active project

Usage:
  scrapy <command> [options] [args]

Available commands:
  crawl         Run a spider
  fetch         Fetch a URL using the Scrapy downloader
[...]

如果您在Scrapy專案中，第一行將列印當前活動的專案。在上面，它是從專案外部執行的。如果從專案內部執行，它將列印出如下內容：

Scrapy X.Y - project: myproject

Usage:
  scrapy <command> [options] [args]

[...]

建立專案

一般我們使用scrapy的第一件事肯定是建立專案

scrapy startproject myproject [project_dir]

這將在project_dir目錄下建立一個Scrapy專案。如果project_dir沒有指定，project_dir將是相同的myproject。
接下來，進入新專案目錄：

cd project_dir

管理專案

例如，要建立一個新蜘蛛：

scrapy genspider mydomain mydomain.com

某些Scrapy命令（如crawl）必須從Scrapy專案內部執行。
還要記住，某些命令在從專案內部執行時可能會略有不同的行為。例如，user_agent如果獲取的url與某個特定的spider相關聯，則fetch命令將使用spider-overridden（例如覆蓋使用者代理的屬性）。因為該fetch命令旨在用於檢查蜘蛛如何下載頁面。

檢視更多資訊

您還在任何地方可以通過執行以獲取有關每個命令的更多資訊：

scrapy <command> -h

或者

scrapy -h

全域性命令：

startproject
genspider
settings
runspider
shell
fetch
view
version

僅限專案的命令：

crawl
check
list
edit
parse
bench

startproject命令

句法： scrapy startproject <project_name> [project_dir]

project_name在project_dir 目錄下建立一個名為的新Scrapy專案。如果project_dir沒有指定，project_dir將是相同的project_name。
示例：

$ scrapy startproject myproject

genspider

句法： scrapy genspider [-t template] <name> <domain>

spiders如果從專案內部呼叫，則在當前資料夾或當前專案的資料夾中建立新的蜘蛛。該引數設定為蜘蛛的name，而用於生成allowed_domains和start_urls蜘蛛的屬性。
示例：

$ scrapy genspider -l
Available templates:
  basic
  crawl
  csvfeed
  xmlfeed

$ scrapy genspider example example.com
Created spider 'example' using template 'basic'

$ scrapy genspider -t crawl scrapyorg scrapy.org
Created spider 'scrapyorg' using template 'crawl'

這只是一個方便的快捷方式命令，用於根據預定義的模板建立蜘蛛，但肯定不是建立蜘蛛的唯一方法。您可以自己建立蜘蛛原始碼檔案，而不是使用此命令。

抓取

句法： scrapy crawl <spider>

使用蜘蛛開始抓取。
示例：


    $ scrapy crawl myspider
[ ... myspider starts crawling ... ]

檢查

句法： scrapy check [-l] <spider>

執行規則檢查。
示例：

$ scrapy check -l
first_spider
  * parse
  * parse_item
second_spider
  * parse
  * parse_item

$ scrapy check
[FAILED] first_spider:parse_item
>>> 'RetailPricex' field is missing

[FAILED] first_spider:parse
>>> Returned 92 requests, expected 0..4

列表

句法： scrapy list

列出當前專案中的所有可用蜘蛛。輸出是每行一個蜘蛛。
示例：

$ scrapy list
spider1
spider2

編輯

句法： scrapy edit <spider>

使用EDITOR環境變數中定義的編輯器或（如果未設定）EDITOR設定編輯給定的蜘蛛。

此命令僅作為最常見情況的便捷快捷方式提供，開發人員當然可以自由選擇任何工具或IDE來編寫和除錯蜘蛛。
示例：

$ scrapy edit spider1

獲取

句法： scrapy fetch <url>

使用Scrapy下載程式下載給定的URL，並將內容寫入標準輸出。

這個命令的有趣之處在於它獲取了蜘蛛下載它的頁面。例如，如果蜘蛛具有USER_AGENT 覆蓋使用者代理的屬性，則它將使用該屬性。

因此，此命令可用於“檢視”您的蜘蛛如何獲取某個頁面。

如果在專案外部使用，則不會應用特定的每蜘蛛行為，它將僅使用預設的Scrapy下載程式設定。

支援的選項：

–spider=SPIDER：繞過蜘蛛自動檢測並強制使用特定的蜘蛛
–headers：列印響應的HTTP標頭而不是響應的正文
–no-redirect：不要遵循HTTP 3xx重定向（預設是遵循它們）
示例：

$ scrapy fetch --nolog http://www.example.com/some/page.html
[ ... html content here ... ]

$ scrapy fetch --nolog --headers http://www.example.com/
{'Accept-Ranges': ['bytes'],
 'Age': ['1263   '],
 'Connection': ['close     '],
 'Content-Length': ['596'],
 'Content-Type': ['text/html; charset=UTF-8'],
 'Date': ['Wed, 18 Aug 2010 23:59:46 GMT'],
 'Etag': ['"573c1-254-48c9c87349680"'],
 'Last-Modified': ['Fri, 30 Jul 2010 15:30:18 GMT'],
 'Server': ['Apache/2.2.3 (CentOS)']}

檢視

句法： scrapy view <url>

在瀏覽器中開啟給定的URL，因為您的Scrapy蜘蛛會“看到”它。有時蜘蛛會看到不同於普通使用者的頁面，因此可以用來檢查蜘蛛“看到”的內容並確認它是您所期望的。

支援的選項：

–spider=SPIDER：繞過蜘蛛自動檢測並強制使用特定的蜘蛛
–no-redirect：不要遵循HTTP 3xx重定向（預設是遵循它們）
示例：

$ scrapy view http://www.example.com/some/page.html
[ ... browser starts ... ]

外殼

句法： scrapy shell [url]

為給定的URL啟動Scrapy shell（如果給定），如果沒有給出URL，則為空。還支援UNIX樣式的本地檔案路徑，相對於 ./或…/字首或絕對檔案路徑。有關詳細資訊，請參閱Scrapy shell。

支援的選項：

–spider=SPIDER：繞過蜘蛛自動檢測並強制使用特定的蜘蛛
-c code：評估shell中的程式碼，列印結果並退出
–no-redirect：不要遵循HTTP 3xx重定向（預設是遵循它們）; 這隻會影響您在命令列中作為引數傳遞的URL; 一旦進入shell，fetch(url)預設情況下仍會遵循HTTP重定向。
示例：

$ scrapy shell http://www.example.com/some/page.html
[ ... scrapy shell starts ... ]

$ scrapy shell --nolog http://www.example.com/ -c '(response.status, response.url)'
(200, 'http://www.example.com/')

# shell follows HTTP redirects by default
$ scrapy shell --nolog http://httpbin.org/redirect-to?url=http%3A%2F%2Fexample.com%2F -c '(response.status, response.url)'
(200, 'http://example.com/')

# you can disable this with --no-redirect
# (only for the URL passed as command line argument)
$ scrapy shell --no-redirect --nolog http://httpbin.org/redirect-to?url=http%3A%2F%2Fexample.com%2F -c '(response.status, response.url)'
(302, 'http://httpbin.org/redirect-to?url=http%3A%2F%2Fexample.com%2F')

解析

句法： scrapy parse <url> [options]

獲取給定的URL並使用處理它的蜘蛛解析它，使用隨–callback選項傳遞的方法，或者parse如果沒有給出。

支援的選項：

–spider=SPIDER：繞過蜘蛛自動檢測並強制使用特定的蜘蛛
–a NAME=VALUE：設定蜘蛛引數（可能重複）
–callback或-c：用作解析響應的回撥的spider方法
–meta或-m：將傳遞給回撥請求的其他請求元。這必須是有效的json字串。示例：-meta =’{“foo”：“bar”}’
–pipelines：通過管道處理專案
–rules或-r：使用CrawlSpider 規則來發現用於解析響應的回撥（即蜘蛛方法）
–noitems：不顯示刮下的物品
–nolinks：不顯示提取的連結
–nocolour：避免使用pygments為輸出著色
–depth或-d：遞迴請求的深度級別（預設值：1）
–verbose或-v：顯示每個深度級別的資訊
示例：

$ scrapy parse http://www.example.com/ -c parse_item
[ ... scrapy log lines crawling example.com spider ... ]

>>> STATUS DEPTH LEVEL 1 <<<
# Scraped Items  ------------------------------------------------------------
[{'name': u'Example item',
 'category': u'Furniture',
 'length': u'12 cm'}]

# Requests  -----------------------------------------------------------------
[]

設定

句法： scrapy settings [options]

獲取Scrapy設定的值。

如果在專案中使用它將顯示專案設定值，否則它將顯示該設定的預設Scrapy值。
示例：

$ scrapy settings --get BOT_NAME
scrapybot
$ scrapy settings --get DOWNLOAD_DELAY
0

runspider

句法： scrapy runspider <spider_file.py>

在Python檔案中執行自包含的蜘蛛，而無需建立專案。
示例：

$ scrapy runspider myspider.py
[ ... spider starts crawling ... ]

python爬蟲Scrapy框架
2018-11-21
Python爬蟲框架
Python爬蟲—Scrapy框架
2020-10-04
Python爬蟲框架
【Python篇】scrapy爬蟲
2020-11-29
Python爬蟲
Python Scrapy 爬蟲（二）：scrapy 初試
2018-08-13
Python爬蟲
Python爬蟲（1.爬蟲的基本概念）
2018-04-20
Python爬蟲
Python爬蟲教程-30-Scrapy 爬蟲框架介紹
2018-09-06
Python爬蟲框架
python網路爬蟲（14）使用Scrapy搭建爬蟲框架
2019-07-27
Python爬蟲框架
學好Python不加班系列之SCRAPY爬蟲框架的使用
2021-11-09
Python爬蟲框架
爬蟲--Scrapy簡易爬蟲
2020-10-07
爬蟲
Python爬蟲教程-31-建立 Scrapy 爬蟲框架專案
2018-09-04
Python爬蟲框架
Python 爬蟲系列
2021-01-01
Python爬蟲
Scrapy爬蟲-草稿
2018-09-08
爬蟲
Scrapy爬蟲框架
2024-11-13
爬蟲框架
Python3爬蟲（十八） Scrapy框架（二）
2018-10-26
Python爬蟲框架
python爬蟲常用之Scrapy 中介軟體
2018-12-22
Python爬蟲
Python網路爬蟲4 - scrapy入門
2018-05-29
Python爬蟲
python 爬蟲對 scrapy 框架的認識
2020-07-17
Python爬蟲框架
Python爬蟲 ---scrapy框架初探及實戰
2020-04-16
Python爬蟲框架
利用scrapy建立初始Python爬蟲專案
2018-03-04
Python爬蟲
爬蟲（9） - Scrapy框架(1) | Scrapy 非同步網路爬蟲框架
2022-07-05
爬蟲框架非同步
python爬蟲系列版
2018-03-16
Python爬蟲
scrapy爬蟲代理池
2018-08-28
爬蟲
爬蟲實戰scrapy
2018-03-11
爬蟲
Python爬蟲之Scrapy學習（基礎篇）
2019-03-04
Python爬蟲
Python爬蟲 --- 2.3 Scrapy 框架的簡單使用
2018-12-19
Python爬蟲框架
Python爬蟲教程-33-scrapy shell 的使用
2018-09-06
Python爬蟲
JB的Python之旅-爬蟲篇--requests&Scrapy
2018-06-08
Python爬蟲
Python爬蟲框架：scrapy爬取高考派大學資料
2019-10-07
Python爬蟲框架
Scrapy框架的使用之Scrapy通用爬蟲
2018-05-21
框架爬蟲
scrapy之分散式爬蟲scrapy-redis
2020-12-24
分散式爬蟲Redis
Python爬蟲教程-32-Scrapy 爬蟲框架專案 Settings.py 介紹
2018-09-06
Python爬蟲框架
scrapy + mogoDB 網站爬蟲
2019-05-19
Go網站爬蟲
爬蟲框架-scrapy的使用
2021-04-28
爬蟲框架
Scrapy爬蟲框架的使用
2021-01-17
爬蟲框架
使用python的scrapy來編寫一個爬蟲
2019-03-14
Python爬蟲
Python爬蟲筆記（4）：利用scrapy爬取豆瓣電影250
2018-11-10
Python爬蟲筆記
Python爬蟲深造篇(四)——Scrapy爬蟲框架啟動一個真正的專案
2021-11-08
Python爬蟲框架
Python 爬蟲（六）：使用 Scrapy 爬取去哪兒網景區資訊
2019-10-20
Python爬蟲

python爬蟲系列（三）scrapy基本概念

Scrapy專案的預設結構

scrapy指令

建立專案

管理專案

檢視更多資訊

全域性命令：

僅限專案的命令：

startproject命令

genspider

抓取

檢查

列表

編輯

獲取

檢視

外殼

解析

設定

runspider

相關文章