大規模非同步新聞爬蟲：用asyncio實現非同步爬蟲

王平發表於2018-12-03

原文網址 : https://www.yuanrenxue.com/crawler/news-crawler-asyncio.html

“等了好久終於等到今天，夢裡好久終於把夢實現”，腦海裡不禁響起來劉德華這首歌。是啊，終於可以寫我最喜歡的非同步爬蟲了。前面那麼多章節，一步一步、循序漸進的講解，實在是“嘮叨”了不少，可是為了小猿們能由淺入深的學習爬蟲，老猿我又不得不說那麼多“嘮叨”，可把我給憋死了，今天就大書特書非同步爬蟲，說個痛快！

用asyncio實現一個非同步新聞爬蟲

關於非同步IO這個概念，可能有些小猿們不是非常明白，那就先來看看非同步IO是怎麼回事兒。
為了大家能夠更形象得理解這個概念，我們拿放羊來打個比方：

下載請求開始，就是放羊出去吃草；
下載任務完成，就是羊吃飽回羊圈。

同步放羊的過程就是這樣的：
羊倌兒小同要放100只羊，他就先放一隻羊出去吃草，等羊吃飽了回來在放第二隻羊，等第二隻羊吃飽了回來再放第三隻羊出去吃草……這樣放羊的羊倌兒實在是……

再看看非同步放羊的過程：
羊倌兒小異也要放100只羊，他觀察後發現，小同放羊的方法比較笨，他覺得草地一下能容下10只羊（頻寬）吃草，所以它就一次放出去10只羊等它們回來，然後他還可以給羊剪剪羊毛。有的羊吃得快回來的早，他就把羊關到羊圈接著就再放出去幾隻，儘量保證草地上都有10只羊在吃草。

很明顯，非同步放羊的效率高多了。同樣的，網路世界裡也是非同步的效率高。

到了這裡，可能有小猿要問，為什麼不用多執行緒、多程式實現爬蟲呢？沒錯，多執行緒和多程式也可以提高前面那個同步爬蟲的抓取效率，但是非同步IO提高的更多，也更適合爬蟲這個場景。後面機會我們可以對比一下三者抓取的效率。

1. 非同步的downloader

還記得我們之前使用requests實現的那個downloader嗎？同步情況下，它很好用，但不適合非同步，所以我們要先改造它。幸運的是，已經有aiohttp模組來支援非同步http請求了，那麼我們就用aiohttp來實現非同步downloader。

async def fetch(session, url, headers=None, timeout=9):
    _headers = {
        'User-Agent': ('Mozilla/5.0 (compatible; MSIE 9.0; '
                       'Windows NT 6.1; Win64; x64; Trident/5.0)'),
    }
    if headers:
        _headers = headers
    try:
        async with session.get(url, headers=_headers, timeout=timeout) as response:
            status = response.status
            html = await response.read()
            encoding = response.get_encoding()
            if encoding == 'gb2312':
                encoding = 'gbk'
            html = html.decode(encoding, errors='ignore')
            redirected_url = str(response.url)
    except Exception as e:
        msg = 'Failed download: {} | exception: {}, {}'.format(url, str(type(e)), str(e))
        print(msg)
        html = ''
        status = 0
        redirected_url = url
    return status, html, redirected_url

這個非同步的downloader，我們稱之為fetch()，它有兩個必須引數：

seesion：這是一個aiohttp.ClientSession的物件，這個物件的初始化在crawler裡面完成，每次呼叫fetch()時，作為引數傳遞。
url：這是需要下載的網址。

實現中使用了非同步上下文管理器（async with），編碼的判斷我們還是用cchardet來實現。
有了非同步下載器，我們的非同步爬蟲就可以寫起來啦～

2. 非同步新聞爬蟲

跟同步爬蟲一樣，我們還是把整個爬蟲定義為一個類，它的主要成員有：

self.urlpool 網址池
self.loop 非同步的事件迴圈
self.seesion aiohttp.ClientSession的物件，用於非同步下載
self.db 基於aiomysql的非同步資料庫連線
self._workers 當前併發下載（放出去的羊）的數量

通過這幾個主要成員來達到非同步控制、非同步下載、非同步儲存（資料庫）的目的，其它成員作為輔助。爬蟲類的相關方法，參加下面的完整實現程式碼：

#!/usr/bin/env python3
# File: news-crawler-async.py
# Author: veelion

import traceback
import time
import asyncio
import aiohttp
import urllib.parse as urlparse
import farmhash
import lzma

import uvloop
asyncio.set_event_loop_policy(uvloop.EventLoopPolicy())

import sanicdb

from urlpool import UrlPool
import functions as fn
import config


class NewsCrawlerAsync:
    def __init__(self, name):
        self._workers = 0
        self._workers_max = 30
        self.logger = fn.init_file_logger(name+ '.log')

        self.urlpool = UrlPool(name)

        self.loop = asyncio.get_event_loop()
        self.session = aiohttp.ClientSession(loop=self.loop)
        self.db = sanicdb.SanicDB(
            config.db_host,
            config.db_db,
            config.db_user,
            config.db_password,
            loop=self.loop
        )

    async def load_hubs(self,):
        sql = 'select url from crawler_hub'
        data = await self.db.query(sql)
        self.hub_hosts = set()
        hubs = []
        for d in data:
            host = urlparse.urlparse(d['url']).netloc
            self.hub_hosts.add(host)
            hubs.append(d['url'])
        self.urlpool.set_hubs(hubs, 300)

    async def save_to_db(self, url, html):
        urlhash = farmhash.hash64(url)
        sql = 'select url from crawler_html where urlhash=%s'
        d = await self.db.get(sql, urlhash)
        if d:
            if d['url'] != url:
                msg = 'farmhash collision: %s <=> %s' % (url, d['url'])
                self.logger.error(msg)
            return True
        if isinstance(html, str):
            html = html.encode('utf8')
        html_lzma = lzma.compress(html)
        sql = ('insert into crawler_html(urlhash, url, html_lzma) '
               'values(%s, %s, %s)')
        good = False
        try:
            await self.db.execute(sql, urlhash, url, html_lzma)
            good = True
        except Exception as e:
            if e.args[0] == 1062:
                # Duplicate entry
                good = True
                pass
            else:
                traceback.print_exc()
                raise e
        return good

    def filter_good(self, urls):
        goodlinks = []
        for url in urls:
            host = urlparse.urlparse(url).netloc
            if host in self.hub_hosts:
                goodlinks.append(url)
        return goodlinks

    async def process(self, url, ishub):
        status, html, redirected_url = await fn.fetch(self.session, url)
        self.urlpool.set_status(url, status)
        if redirected_url != url:
            self.urlpool.set_status(redirected_url, status)
        # 提取hub網頁中的連結, 新聞網頁中也有“相關新聞”的連結，按需提取
        if status != 200:
            return
        if ishub:
            newlinks = fn.extract_links_re(redirected_url, html)
            goodlinks = self.filter_good(newlinks)
            print("%s/%s, goodlinks/newlinks" % (len(goodlinks), len(newlinks)))
            self.urlpool.addmany(goodlinks)
        else:
            await self.save_to_db(redirected_url, html)
        self._workers -= 1

    async def loop_crawl(self,):
        await self.load_hubs()
        last_rating_time = time.time()
        counter = 0
        while 1:
            tasks = self.urlpool.pop(self._workers_max)
            if not tasks:
                print('no url to crawl, sleep')
                await asyncio.sleep(3)
                continue
            for url, ishub in tasks.items():
                self._workers += 1
                counter += 1
                print('crawl:', url)
                asyncio.ensure_future(self.process(url, ishub))

            gap = time.time() - last_rating_time
            if gap > 5:
                rate = counter / gap
                print('\tloop_crawl() rate:%s, counter: %s, workers: %s' % (round(rate, 2), counter, self._workers))
                last_rating_time = time.time()
                counter = 0
            if self._workers > self._workers_max:
                print('====== got workers_max, sleep 3 sec to next worker =====')
                await asyncio.sleep(3)

    def run(self):
        try:
            self.loop.run_until_complete(self.loop_crawl())
        except KeyboardInterrupt:
            print('stopped by yourself!')
            del self.urlpool
            pass


if __name__ == '__main__':
    nc = NewsCrawlerAsync('yrx-async')
    nc.run()

爬蟲的主流程是在方法loop_crawl()裡面實現的。它的主體是一個while迴圈，每次從self.urlpool裡面獲取定量的爬蟲作為下載任務（從羊圈裡面選出一批羊），通過ensure_future()開始非同步下載（把這些羊都放出去）。而process()這個方法的流程是下載網頁並儲存、提取新的url，這就類似羊吃草、下崽等。

通過self._workers和self._workers_max來控制併發量。不能一直併發，給本地CPU、網路頻寬帶來壓力，同樣也會給目標伺服器帶來壓力。

至此，我們實現了同步和非同步兩個新聞爬蟲，分別實現了NewsCrawlerSync和NewsCrawlerAsync兩個爬蟲類，他們的結構幾乎完全一樣，只是抓取流程一個是順序的，一個是併發的。小猿們可以通過對比兩個類的實現，來更好的理解非同步的流程。

爬蟲知識點

1. uvloop模組
uvloop這個模組是用Cython編寫建立在libuv庫之上，它是asyncio內建事件迴圈的替代，使用它僅僅是多兩行程式碼而已：

import uvloop
asyncio.set_event_loop_policy(uvloop.EventLoopPolicy())

uvloop使得asyncio很快，比odejs、gevent和其它Python非同步框架的快至少2倍，接近於Go語言的效能。

uvloop作者的效能測試

這是uvloop作者的效能對比測試。
目前，uvloop不支援Windows系統和Python 3.5 及其以上版本，這在它原始碼的setup.py檔案中可以看到：

if sys.platform in ('win32', 'cygwin', 'cli'):
    raise RuntimeError('uvloop does not support Windows at the moment')

vi = sys.version_info
if vi < (3, 5):
    raise RuntimeError('uvloop requires Python 3.5 or greater')

所以，使用Windows的小猿們要執行非同步爬蟲，就要把uvloop那兩行註釋掉哦。

思考題

1. 給同步的downloader()或非同步的fetch()新增功能
或許有些小猿還沒見過這樣的html程式碼，它出現在<head>裡面:

<meta http-equiv="refresh" content="5; url=https://example.com/">

它的意思是，告訴瀏覽器在5秒之後跳轉到另外一個url：https://example.com/。
那麼問題來了，請給downloader（fetch()）新增程式碼，讓它支援這個跳轉。

2. 如何控制hub的重新整理頻率，及時發現最新新聞
這是我們寫新聞爬蟲要考慮的一個很重要的問題，我們實現的新聞爬蟲中並沒有實現這個機制，小猿們來思考一下，並對手實現實現。

到這老猿要講的實現一個非同步定向新聞爬蟲已經講完了，感謝你的閱讀，有任何建議和問題請再下方留言，我會一一回復你，你也可以關注猿人學公眾號，那裡可以及時看到我新發的文章。

後面的章節，是介紹如何使用工具，比如如何使用charles抓包，如何管理瀏覽器cookie，如何使用selenium等等，也歡迎你的閱讀。

我的公眾號：猿人學 Python 上會分享更多心得體會，敬請關注。

***版權申明:若沒有特殊說明，文章皆是猿人學 yuanrenxue.com 原創，沒有猿人學授權，請勿以任何形式轉載。***

大規模非同步新聞爬蟲：實現一個同步定向新聞爬蟲
2018-12-03
非同步爬蟲
大規模非同步新聞爬蟲的實現思路
2019-05-20
非同步爬蟲
大規模非同步新聞爬蟲的分散式實現
2019-06-10
非同步爬蟲分散式
大規模非同步新聞爬蟲：簡單的百度新聞爬蟲
2018-12-02
非同步爬蟲
大規模非同步新聞爬蟲：網頁正文的提取
2018-12-03
非同步爬蟲網頁
大規模非同步新聞爬蟲：讓MySQL 資料庫操作更方便
2018-12-03
非同步爬蟲MySql資料庫
大規模非同步新聞爬蟲：實現一個更好的網路請求函式
2018-12-02
非同步爬蟲函式
大規模非同步新聞爬蟲：實現功能強大、簡潔易用的網址池(URL Pool)
2018-12-03
非同步爬蟲
基於asyncio、aiohttp、xpath的非同步爬蟲
2019-02-16
AIHTTP非同步爬蟲
python爬蟲---網頁爬蟲，圖片爬蟲，文章爬蟲，Python爬蟲爬取新聞網站新聞
2019-01-04
Python爬蟲網頁網站
Python asyncio 爬蟲
2020-04-28
Python爬蟲
爬蟲 | 非同步請求aiohttp模組
2024-06-16
爬蟲非同步AIHTTP
爬蟲實戰：探索XPath爬蟲技巧之熱榜新聞
2024-03-21
爬蟲
用PyCharm Profile分析非同步爬蟲效率
2019-04-24
PyCharm非同步爬蟲
Python微型非同步爬蟲框架
2019-02-16
Python非同步爬蟲框架
Python非同步爬蟲（aiohttp版）
2022-12-06
Python非同步爬蟲AIHTTP
Python爬取鏈家成都二手房源資訊 asyncio + aiohttp 非同步爬蟲實戰
2020-09-22
PythonAIHTTP非同步爬蟲
爬蟲（9） - Scrapy框架(1) | Scrapy 非同步網路爬蟲框架
2022-07-05
爬蟲框架非同步
python多執行緒非同步爬蟲-Python非同步爬蟲試驗[Celery,gevent,requests]
2020-11-11
Python執行緒非同步爬蟲
Jsoup + HtmlUtil 實現網易新聞網頁爬蟲
2019-01-14
JSHTML網頁爬蟲
非同步爬蟲之理解協程
2024-05-05
非同步爬蟲
爬蟲之多工非同步協程
2024-03-26
爬蟲非同步
【Python學習】爬蟲爬蟲爬蟲爬蟲~
2018-05-03
Python爬蟲
關於非同步爬蟲排序的困惑
2020-12-26
非同步爬蟲排序
2個月精通Python爬蟲——3大爬蟲框架+6場實戰+反爬蟲技巧+分散式爬蟲
2018-06-28
Python爬蟲框架分散式
實用爬蟲-01-檢測爬蟲的 IP
2018-09-08
爬蟲
網路爬蟲——爬蟲實戰（一）
2022-01-29
爬蟲
實用爬蟲-02-爬蟲真正使用代理 ip
2018-09-08
爬蟲
Python爬蟲百度新聞標題
2020-11-29
Python爬蟲
爬蟲：多程式爬蟲
2021-05-19
爬蟲
爬蟲——Requests模組
2019-01-13
爬蟲
爬蟲-Requests模組
2022-03-03
爬蟲
爬蟲搭建代理池、爬取某網站影片案例、爬取新聞案例
2023-03-16
爬蟲網站
python網路爬蟲應用_python網路爬蟲應用實戰
2020-12-29
Python爬蟲
爬蟲速度太慢？來試試用非同步協程提速吧！
2018-07-09
爬蟲非同步
通用爬蟲與聚焦爬蟲
2023-04-18
爬蟲
爬蟲--Scrapy簡易爬蟲
2020-10-07
爬蟲
Python爬蟲的兩套解析方法和四種爬蟲實現
2018-07-03
Python爬蟲

大規模非同步新聞爬蟲： 用asyncio實現非同步爬蟲

1. 非同步的downloader

2. 非同步新聞爬蟲

爬蟲知識點

思考題

相關文章

大規模非同步新聞爬蟲：用asyncio實現非同步爬蟲