通常開發好的Scrapy爬蟲部署到伺服器上,要不使用nohup命令,要不使用scrapyd。如果使用nohup命令的話,爬蟲掛掉了,你可能還不知道,你還得上伺服器上查或者做額外的郵件通知操作。如果使用scrapyd,就是部署爬蟲的時候有點複雜,功能少了點,其他還好。
SpiderKeeper是一款管理爬蟲的軟體,和scrapinghub的部署功能差不多,能多臺伺服器部署爬蟲,定時執行爬蟲,檢視爬蟲日誌,檢視爬蟲執行情況等功能。
專案地址:https://github.com/DormyMo/SpiderKeeper
一、執行環境
- Centos7
- Python2.7
- Python3.6
注意:supervisor用的是Python2.7,scrapyd用的是Python3.6,需要自行編譯安裝。Python3具體安裝自行百度。
二、安裝依賴
1、supervisor pip install supervisor
2、scrapyd pip3 install scrapyd
3、SpiderKeeperpip3 install SpiderKeeper
三、配置scrapyd
1、新建scrapyd的配置檔案:
[scrapyd]
eggs_dir = eggs
logs_dir = logs
items_dir =
jobs_to_keep = 5
dbs_dir = dbs
max_proc = 0
max_proc_per_cpu = 4
finished_to_keep = 100
poll_interval = 5.0
bind_address = 0.0.0.0
http_port = 6800
debug = off
runner = scrapyd.runner
application = scrapyd.app.application
launcher = scrapyd.launcher.Launcher
webroot = scrapyd.website.Root
[services]
schedule.json = scrapyd.webservice.Schedule
cancel.json = scrapyd.webservice.Cancel
addversion.json = scrapyd.webservice.AddVersion
listprojects.json = scrapyd.webservice.ListProjects
listversions.json = scrapyd.webservice.ListVersions
listspiders.json = scrapyd.webservice.ListSpiders
delproject.json = scrapyd.webservice.DeleteProject
delversion.json = scrapyd.webservice.DeleteVersion
listjobs.json = scrapyd.webservice.ListJobs
daemonstatus.json = scrapyd.webservice.DaemonStatus
四、配置supervisor
1、建立配置的資料夾和配置檔案
mkdir /etc/supervisor
echo_supervisord_conf > /etc/supervisor/supervisord.conf
2、編輯配置檔案vim /etc/supervisor/supervisord.conf
;[include]
;files = relative/directory/*.ini
改為
[include]
files = conf.d/*.conf
3、新建conf.d資料夾mkdir /etc/supervisor/conf.d
4、新增scrapyd的配置檔案vim /etc/supervisor/conf.d/scrapyd.conf
[program:scrapyd]
command=/usr/local/python3.5/bin/scrapyd
directory=/opt/SpiderKeeper
user=root
stderr_logfile=/var/log/scrapyd.err.log
stdout_logfile=/var/log/scrapyd.out.log
5、新增spiderkeeper的配置檔案vim /etc/supervisor/conf.d/spiderkeeper.conf
[program:spiderkeeper]
command=spiderkeeper --server=http://localhost:6800
directory=/opt/SpiderKeeper
user=root
stderr_logfile=/var/log/spiderkeeper.err.log
stdout_logfile=/var/log/spiderkeeper.out.log
6、啟動supervisor,supervisord -c /etc/supervisor/supervisord.conf
五、使用
1、登入http://localhost:5000
2、新建project
3、打包爬蟲檔案
pip3 install scrapyd-client
scrapyd-deploy --build-egg output.egg
4、上傳打包好的爬蟲egg檔案
SpiderKeeper可以識別多臺伺服器的scrapyd,具體多加--server就好。