python多執行緒非同步爬蟲-Python非同步爬蟲試驗[Celery,gevent,requests]

weixin_39915267發表於2020-11-11

原文網址 : https://blog.csdn.net/weixin_39915267/article/details/109622340

以往爬蟲都是用自己寫的一個爬蟲框架，一群Workers去Master那領取任務後開始爬。程式數量等於處理器核心數，通過增開執行緒數提高爬取速度。

最近看了Celery，介面真是優美，挺想試驗下非同步模型來寫個爬蟲。

模擬目標

為了方便測試，用Tornado搭了一個簡易的伺服器，用來模擬被爬的網站。

功能很簡單，每個請求阻塞6秒才回復import tornado.webimport tornado.ioloopimport timefrom concurrent.futures import ThreadPoolExecutorfrom tornado.concurrent import run_on_executorimport tornado.genclass MainHandler(tornado.web.RequestHandler):

executor = ThreadPoolExecutor(40) @tornado.web.asynchronous @tornado.gen.coroutine

def get(self):

print(time.asctime()) yield self.sleep(6)

self.write("from server:" + time.asctime())

self.finish() @run_on_executor

def sleep(self, sec):

time.sleep(sec)if __name__ == "__main__":

app = tornado.web.Application(handlers=[

("^/.*", MainHandler)

])

app.listen(10240)

tornado.ioloop.IOLoop.instance().start()

消費者

task裡就一個spider函式，功能是利用gevent去請求給定的目標import gevent.monkey

gevent.monkey.patch_socket()from celery import Celeryimport socketimport requestsimport gevent

app = Celery("tasks",

broker="redis://127.0.0.1:6379/3",

backend="redis://127.0.0.1:6379/3")@app.taskdef spider(url):

resp = gevent.spawn(requests.get, url)

tmp = 0

while True:

print("wait...", tmp) if resp.ready(): return "from:" + socket.getfqdn() + " res:" + str(resp.value.text)

gevent.sleep(1)

tmp += 1

用gevent模式啟動Celerycelery worker -A tasks --loglevel info -c 100 -P gevent

生產者

利用剛剛編寫的spider函式去爬取目標

測試中，下面程式碼開了6個程式，結果均在7秒內返回，證明成功了。from tasks import spiderimport timeimport random

res = spider.delay("http://127.0.0.1:10240/{}".format(random.randint(1, 999)))

i = 0while True: if res.ready():

print("res:", res.get()) break

else:

print("wait...", i)

time.sleep(1)

i += 1

Celery的部分日誌輸出：

可以看出在一個Celery程式內，多個spider函式輪替執行的[2016-08-20 21:27:11,281: INFO/MainProcess] Starting new HTTP connection (1): 127.0.0.1[2016-08-20 21:27:11,313: INFO/MainProcess] Received task: tasks.spider[7b8b6f63-2bef-491e-a3a8-fdbcff824b9c][2016-08-20 21:27:11,314: WARNING/MainProcess] wait...[2016-08-20 21:27:11,314: WARNING/MainProcess] 0[2016-08-20 21:27:11,316: INFO/MainProcess] Starting new HTTP connection (1): 127.0.0.1[2016-08-20 21:27:11,354: INFO/MainProcess] Received task: tasks.spider[5aa05e65-504d-4a04-8247-3f5708bfa46f][2016-08-20 21:27:11,356: WARNING/MainProcess] wait...[2016-08-20 21:27:11,356: WARNING/MainProcess] 0[2016-08-20 21:27:11,357: INFO/MainProcess] Starting new HTTP connection (1): 127.0.0.1[2016-08-20 21:27:11,821: WARNING/MainProcess] wait...[2016-08-20 21:27:11,821: WARNING/MainProcess] 1[2016-08-20 21:27:11,989: WARNING/MainProcess] wait...[2016-08-20 21:27:11,990: WARNING/MainProcess] 1[2016-08-20 21:27:12,059: WARNING/MainProcess] wait...[2016-08-20 21:27:12,059: WARNING/MainProcess] 2[2016-08-20 21:27:12,208: WARNING/MainProcess] wait...[2016-08-20 21:27:12,209: WARNING/MainProcess] 1[2016-08-20 21:27:12,225: WARNING/MainProcess] wait...[2016-08-20 21:27:12,225: WARNING/MainProcess] 1[2016-08-20 21:27:12,246: WARNING/MainProcess] wait...[2016-08-20 21:27:12,247: WARNING/MainProcess] 2[2016-08-20 21:27:12,282: WARNING/MainProcess] wait...[2016-08-20 21:27:12,282: WARNING/MainProcess] 1[2016-08-20 21:27:12,316: WARNING/MainProcess] wait...[2016-08-20 21:27:12,316: WARNING/MainProcess] 1[2016-08-20 21:27:12,357: WARNING/MainProcess] wait...[2016-08-20 21:27:12,357: WARNING/MainProcess] 1[2016-08-20 21:27:12,823: WARNING/MainProcess] wait...[2016-08-20 21:27:12,823: WARNING/MainProcess] 2[2016-08-20 21:27:12,991: WARNING/MainProcess] wait...[2016-08-20 21:27:12,992: WARNING/MainProcess] 2[2016-08-20 21:27:13,061: WARNING/MainProcess] wait...[2016-08-20 21:27:13,061: WARNING/MainProcess] 3[2016-08-20 21:27:13,210: WARNING/MainProcess] wait...[2016-08-20 21:27:13,211: WARNING/MainProcess] 2[2016-08-20 21:27:13,227: WARNING/MainProcess] wait...[2016-08-20 21:27:13,227: WARNING/MainProcess] 2

最後

藉助Celery，爬蟲很容易實現橫向擴充套件，在多臺伺服器上增加消費者程式即可；

藉助gevent，單程式內requests做到了非阻塞，而我過去是用多執行緒對付阻塞的。

Celery，gevent我也是初學一天，這小玩意兒做出來後，得開始看文件了深入瞭解了！

作者：spencer404

連結：https://www.jianshu.com/p/c1e53cc32d4d

python多執行緒爬蟲與單執行緒爬蟲效率效率對比
2021-03-19
Python執行緒爬蟲
Python《多執行緒併發爬蟲》
2020-12-12
Python執行緒爬蟲
Python爬蟲入門【9】：圖蟲網多執行緒爬取
2019-07-31
Python爬蟲執行緒
Python微型非同步爬蟲框架
2019-02-16
Python非同步爬蟲框架
Python非同步爬蟲（aiohttp版）
2022-12-06
Python非同步爬蟲AIHTTP
python爬蟲入門八：多程式/多執行緒
2019-01-07
Python爬蟲執行緒
python爬蟲之多執行緒、多程式+程式碼示例
2020-08-26
Python爬蟲執行緒
python爬蟲requests模組
2019-03-01
Python爬蟲
Python爬蟲入門【10】：電子書多執行緒爬取
2019-07-31
Python爬蟲執行緒
基於多執行緒+協程的非同步增量式爬蟲
2024-05-12
執行緒非同步爬蟲
【Python學習】爬蟲爬蟲爬蟲爬蟲~
2018-05-03
Python爬蟲
非同步/同步,阻塞/非阻塞,單執行緒/多執行緒概念梳理
2019-02-25
非同步執行緒
簡易多執行緒爬蟲框架
2018-06-02
執行緒爬蟲框架
多執行緒爬蟲實現（上）
2018-05-26
執行緒爬蟲
Python爬蟲神器requests庫的使用
2024-11-07
Python爬蟲
python爬蟲---網頁爬蟲，圖片爬蟲，文章爬蟲，Python爬蟲爬取新聞網站新聞
2019-01-04
Python爬蟲網頁網站
python就是爬蟲嗎-python就是爬蟲嗎
2020-10-29
Python爬蟲
python爬蟲常用庫之requests詳解
2019-03-04
Python爬蟲
Python爬蟲教程-06-爬蟲實現百度翻譯(requests)
2018-09-06
Python爬蟲
python 爬蟲
2024-04-20
Python爬蟲
python爬蟲
2024-06-13
Python爬蟲
Python爬蟲入門教程 13-100 鬥圖啦表情包多執行緒爬取
2018-12-27
Python爬蟲執行緒
Python爬蟲入門教程 11-100 行行網電子書多執行緒爬取
2018-12-25
Python爬蟲執行緒
資料提取方法-多程式多執行緒爬蟲
2020-11-16
執行緒爬蟲
大規模非同步新聞爬蟲：用asyncio實現非同步爬蟲
2018-12-03
非同步爬蟲
JB的Python之旅-爬蟲篇--requests&Scrapy
2018-06-08
Python爬蟲
Python 爬蟲實戰（二）：使用 requests-html
2018-03-14
Python爬蟲HTML
python爬蟲利用requests製作代理池s
2019-12-04
Python爬蟲
python爬蟲初探--第一個python爬蟲專案
2018-05-18
Python爬蟲
爬蟲：多程式爬蟲
2021-05-19
爬蟲
最令人頭疼的Python問題：Python多執行緒在爬蟲中的應用
2019-11-05
Python執行緒爬蟲
python爬蟲:爬蟲的簡單介紹及requests模組的簡單使用
2022-02-24
Python爬蟲
python網路爬蟲_Python爬蟲：30個小時搞定Python網路爬蟲視訊教程
2020-10-21
Python爬蟲
python爬蟲2
2019-01-07
Python爬蟲
Python爬蟲-xpath
2018-06-08
Python爬蟲
Python爬蟲——XPath
2018-07-28
Python爬蟲
Python爬蟲--2
2024-03-24
Python爬蟲
Python asyncio 爬蟲
2020-04-28
Python爬蟲

python多執行緒非同步爬蟲-Python非同步爬蟲試驗[Celery,gevent,requests]

相關文章