如何構建一個分散式爬蟲：基礎篇

發表於2017-06-08

繼上篇我們談論了Celery的基本知識後，本篇繼續講解如何一步步使用Celery構建分散式爬蟲。這次我們抓取的物件定為celery官方文件。

首先，我們新建目錄distributedspider，然後再在其中新建檔案workers.py,裡面內容如下

from celery import Celery

app = Celery('crawl_task', include=['tasks'], broker='redis://223.129.0.190:6379/1', backend='redis://223.129.0.190:6379/2')

# 官方推薦使用json作為訊息序列化方式

app.conf.update(

CELERY_TIMEZONE='Asia/Shanghai',

CELERY_ENABLE_UTC=True,

CELERY_ACCEPT_CONTENT=['json'],

CELERY_TASK_SERIALIZER='json',

CELERY_RESULT_SERIALIZER='json',

)

上述程式碼主要是做Celery例項的初始化工作，include是在初始化celery app的時候需要引入的內容，主要就是註冊為網路呼叫的函式所在的檔案。然後我們再編寫任務函式，新建檔案tasks.py,內容如下

import requests

from bs4 import BeautifulSoup

from workers import app

@app.task

def crawl(url):

print('正在抓取連結{}'.format(url))

resp_text = requests.get(url).text

soup = BeautifulSoup(resp_text, 'html.parser')

return soup.find('h1').text

它的作用很簡單，就是抓取指定的url，並且把標籤為h1的元素提取出來

最後，我們新建檔案task_dispatcher.py，內容如下

from workers import app
url_list = [
    'http://docs.celeryproject.org/en/latest/getting-started/introduction.html',
    'http://docs.celeryproject.org/en/latest/getting-started/brokers/index.html',
    'http://docs.celeryproject.org/en/latest/getting-started/first-steps-with-celery.html',
    'http://docs.celeryproject.org/en/latest/getting-started/next-steps.html',
    'http://docs.celeryproject.org/en/latest/getting-started/resources.html',
    'http://docs.celeryproject.org/en/latest/userguide/application.html',
    'http://docs.celeryproject.org/en/latest/userguide/tasks.html',
    'http://docs.celeryproject.org/en/latest/userguide/canvas.html',
    'http://docs.celeryproject.org/en/latest/userguide/workers.html',
    'http://docs.celeryproject.org/en/latest/userguide/daemonizing.html',
    'http://docs.celeryproject.org/en/latest/userguide/periodic-tasks.html'
]
def manage_crawl_task(urls):
    for url in urls:
        app.send_task('tasks.crawl', args=(url,))
if __name__ == '__main__':
    manage_crawl_task(url_list)

from workers import app

url_list = [

'http://docs.celeryproject.org/en/latest/getting-started/introduction.html',

'http://docs.celeryproject.org/en/latest/getting-started/brokers/index.html',

'http://docs.celeryproject.org/en/latest/getting-started/first-steps-with-celery.html',

'http://docs.celeryproject.org/en/latest/getting-started/next-steps.html',

'http://docs.celeryproject.org/en/latest/getting-started/resources.html',

'http://docs.celeryproject.org/en/latest/userguide/application.html',

'http://docs.celeryproject.org/en/latest/userguide/tasks.html',

'http://docs.celeryproject.org/en/latest/userguide/canvas.html',

'http://docs.celeryproject.org/en/latest/userguide/workers.html',

'http://docs.celeryproject.org/en/latest/userguide/daemonizing.html',

'http://docs.celeryproject.org/en/latest/userguide/periodic-tasks.html'

]

def manage_crawl_task(urls):

for url in urls:

app.send_task('tasks.crawl', args=(url,))

if __name__ == '__main__':

manage_crawl_task(url_list)

這段程式碼的作用主要就是給worker傳送任務，任務是tasks.crawl，引數是url(元祖的形式)

現在，讓我們在節點A(hostname為resolvewang的主機)上啟動worker

1	celery -A workers worker -c 2 -l info

這裡 -c指定了執行緒數為2， -l表示日誌等級是info。我們把程式碼拷貝到節點B(節點名為wpm的主機)，同樣以相同命令啟動worker，便可以看到以下輸出

如何構建一個分散式爬蟲：基礎篇

兩個節點

可以看到左邊節點(A)先是all alone，表示只有一個節點；後來再節點B啟動後，它便和B同步了

1	sync with celery@wpm

這個時候，我們執行給這兩個worker節點傳送抓取任務

1	python task_dispatcher.py

可以看到如下輸出

如何構建一個分散式爬蟲：基礎篇

分散式抓取示意圖

可以看到兩個節點都在執行抓取任務，並且它們的任務不會重複。我們再在redis裡看看結果

如何構建一個分散式爬蟲：基礎篇

backend示意圖

可以看到一共有11條結果，說明 tasks.crawl中返回的資料都在db2(backend)中了，並且以json的形式儲存了起來，除了返回的結果，還有執行是否成功等資訊。

到此，我們就實現了一個很基礎的分散式網路爬蟲，但是它還不具有很好的擴充套件性，而且貌似太簡單了…下一篇我將以微博資料採集為例來演示如何構建一個穩健的分散式網路爬蟲。

對微博大規模資料採集感興趣的同學可以關注一下分散式微博爬蟲，用用也是極好的

[爬蟲架構] 如何設計一個分散式爬蟲架構
2018-05-01
爬蟲架構分散式
Python分散式爬蟲(三) - 爬蟲基礎知識
2019-03-21
Python分散式爬蟲
python網路爬蟲（9）構建基礎爬蟲思路
2019-06-09
Python爬蟲
爬蟲基礎篇
2020-07-31
爬蟲
python分散式爬蟲如何設計架構？
2021-09-11
Python分散式爬蟲架構
第一個分散式爬蟲專案
2018-08-15
分散式爬蟲
分散式爬蟲原理之分散式爬蟲原理
2018-05-25
分散式爬蟲
爬蟲（1） - 爬蟲基礎入門理論篇
2022-06-30
爬蟲
基於java的分散式爬蟲
2018-07-06
Java分散式爬蟲
分散式爬蟲
2019-03-05
分散式爬蟲
分散式爬蟲原理
2019-02-16
分散式爬蟲
分散式爬蟲很難嗎？用Python寫一個小白也能聽懂的分散式知乎爬蟲
2018-05-04
分散式爬蟲Python
19--Scarpy05:增量式爬蟲、分散式爬蟲
2024-04-25
爬蟲分散式
爬蟲基礎
2019-03-30
爬蟲
Python爬蟲之Scrapy學習（基礎篇）
2019-03-04
Python爬蟲
2個月精通Python爬蟲——3大爬蟲框架+6場實戰+反爬蟲技巧+分散式爬蟲
2018-06-28
Python爬蟲框架分散式
爬蟲基礎---1
2019-01-06
爬蟲
Python：基礎&爬蟲
2023-10-29
Python爬蟲
Python爬蟲教程-34-分散式爬蟲介紹
2018-09-06
Python爬蟲分散式
分散式爬蟲的部署之Gerapy分散式管理
2018-06-06
分散式爬蟲
分散式爬蟲的部署之Scrapyd分散式部署
2018-05-30
分散式爬蟲
【0基礎學爬蟲】爬蟲基礎之資料儲存
2023-04-14
爬蟲
【0基礎學爬蟲】爬蟲基礎之檔案儲存
2023-04-07
爬蟲
基於golang分散式爬蟲系統的架構體系v1.0
2021-05-03
Golang分散式爬蟲架構
爬蟲（14） - Scrapy-Redis分散式爬蟲(1) | 詳解
2022-07-06
爬蟲Redis分散式
剖析ElasticSearch基礎分散式架構
2019-03-06
Elasticsearch分散式架構
解讀爬蟲中HTTP的祕密（基礎篇）
2018-04-21
爬蟲HTTP
python爬蟲基礎概念
2020-05-11
Python爬蟲
python_爬蟲基礎
2024-07-30
Python爬蟲
爬蟲基礎知識
2023-03-15
爬蟲
分散式爬蟲總結和使用
2018-12-09
分散式爬蟲
基於Scrapy分散式爬蟲的開發與設計
2018-04-27
分散式爬蟲
3 行寫爬蟲 - 使用 Goribot 快速構建 Golang 爬蟲
2019-10-13
爬蟲Golang
Python爬蟲之路-爬蟲基礎知識(理論)
2021-01-04
Python爬蟲
如何設計一個微型分散式架構？
2018-09-20
分散式架構
【0基礎學爬蟲】爬蟲基礎之自動化工具 Pyppeteer 的使用
2023-05-15
爬蟲
【0基礎學爬蟲】爬蟲基礎之自動化工具 Playwright 的使用
2023-04-28
爬蟲
【0基礎學爬蟲】爬蟲基礎之自動化工具 Selenium 的使用
2023-04-21
爬蟲
【0基礎學爬蟲】爬蟲基礎之網路請求庫的使用
2023-03-26
爬蟲

如何構建一個分散式爬蟲：基礎篇

相關文章