【動圖演示】笑眯眯地教你如何將 Scrapy 專案及爬蟲打包部署到伺服器

AsyncIns發表於2018-10-29

原文網址 : https://juejin.im/post/5bd6c44df265da0af213b667

爬蟲伺服器

通過 Scrapyd-client 打包並部署爬蟲

當爬蟲程式碼編寫完畢後，你可以選擇直接執行啟動檔案來啟動爬蟲，也可以將爬蟲部署到 Scrapyd 後，通過 Scrapyd 的 API 來啟動爬蟲。

那麼如何將爬蟲專案打包並部署到 Scrapyd 呢？

筆者將通過兩個具體的部署例子（部署到本地以及部署到雲伺服器）以熟悉 Scrapy 爬蟲專案打包、Scrapyd-client 的安裝、使用以及爬蟲專案部署過程。

爬蟲專案打包

Scrapyd 打包部署的整個流程為：

打包前期

當你使用 Scrapy 框架編寫完爬蟲程式碼之後，你需要將專案進行打包，才能夠將其部署到 Scrapyd 上。官方文件對專案的打包有介紹：

Deploying your project involves eggifying it and uploading the egg to Scrapyd via the
addversion.json endpoint. You can do this manually, but the easiest way is to use the scrapyd-deploy tool provided by scrapyd-client which will do it all for you.
複製程式碼

Scrapy 專案需要使用 Scrapyd-client 工具進行打包。

Scrapyd-client

它是 Scrapy 專案打包專用的客戶端工具，同樣是由 Scrapy 開發團隊開發。使用 Scrapyd-client 將專案打包生成 .egg 檔案。

Scrapyd-client 的安裝

與 Scrapyd 一樣，它也可以通過 pip 進行安裝：

pip install scrapyd-client
複製程式碼

打包前的專案配置

在打包前，我們需要對 Scrapy 專案進行設定。在 Scrapy 專案目錄下，找到專案根目錄的 .cfg 檔案（通常是 scrapy.cfg）並用編輯器開啟：

# Automatically created by: scrapy startproject
#
# For more information about the [deploy] section see:
# https://scrapyd.readthedocs.io/en/latest/deploy.html

[settings]
default = arts.settings

[deploy]
#url = http://localhost:6800/
project = arts

複製程式碼

配置檔案分為 Settings 級和 Deploy 級。Settings 中指定了專案所用的配置檔案，而 Deploy 中指定專案打包的設定。

URL - 指定部署的目標地址
Project - 指定打包的專案
Deploy - 指定專案別名

本小節，使用的專案為 arts，Scrapyd 服務為本地服務即 localhost:6800，所以這裡以此作為基礎進行演示。

可以看到.cfg檔案中 URL 處預設是有註釋的，這裡將註釋去掉，並且為專案新增別名 locals：

[settings]
default = arts.settings

[deploy:locals]
url = http://localhost:6800/
project = arts
複製程式碼

打包部署

而後在 arts 專案的根目錄(.cfg同級目錄)下使用命令(此時必須保證 Scrapyd 服務是正常執行的)：

scrapyd-deploy locals -p arts
複製程式碼

將專案打包並部署到指定的目標服務上，Scrapyd 服務會將請求結果以 json 格式返回：

node-name:arts$ scrapyd-deploy locals -p arts
Packing version 1538645094
Deploying to project "arts" in http://localhost:6800/addversion.json
Server response (200):
{"node_name": "node-name", "status": "ok", "project": "arts", "version": "1538645094", "spiders": 1}
複製程式碼

返回資訊中包含了此次打包的版本號、目標服務地址、nodeName、專案狀態、專案名稱以及其中所包含的爬蟲數量。並且在 Web 介面上也可以看到專案 arts 的名稱，如下圖所示：

思考題

scrapy.cfg 檔案中 Deploy 級設定裡，Deploy 的名稱是必須設定的嗎？如果不設定會怎麼樣？可以有多個 Deploy 級配置嗎？

我們可以通過動手實驗，來驗證這些問題。

若 Deploy 不設定名稱

可以看到，Deploy 級配置不設定名稱的話，在命令列中也無需使用名稱，同樣可以完成專案的打包。

若多個 Deploy 配置

筆者在 192.168.0.61 伺服器啟動了 Scrapyd，並且在 scrapy.cfg 檔案中設定兩組 Deploy 級別配置，其中一個 Deploy 不設定名稱且 URL 指向本地 Scrapyd；另一個 Deploy 設定名稱為 servers 且 URL 指向伺服器的 Scrapyd。 cfg 程式碼為：

[settings]
default = arts.settings

[deploy]
url = http://localhost:6800/
project = arts

[deploy:servers]
url = http://192.168.0.61:6800/
project = arts
複製程式碼