用PyCharm Profile分析非同步爬蟲效率

liaochangjiang發表於2019-04-24

原文網址 : https://juejin.im/post/5cc05c005188250a912b2800

~~今天比較忙，水一下~~

下面的程式碼來源於這個視訊裡面提到的，github 的連結為：github.com/mikeckenned…

第一個程式碼如下，就是一個普通的 for 迴圈爬蟲。原文地址。

import requests
import bs4
from colorama import Fore


def main():
    get_title_range()
    print("Done.")


def get_html(episode_number: int) -> str:
    print(Fore.YELLOW + f"Getting HTML for episode {episode_number}", flush=True)

    url = f'https://talkpython.fm/{episode_number}'
    resp = requests.get(url)
    resp.raise_for_status()

    return resp.text


def get_title(html: str, episode_number: int) -> str:
    print(Fore.CYAN + f"Getting TITLE for episode {episode_number}", flush=True)
    soup = bs4.BeautifulSoup(html, 'html.parser')
    header = soup.select_one('h1')
    if not header:
        return "MISSING"

    return header.text.strip()


def get_title_range():
    # Please keep this range pretty small to not DDoS my site. ;)
    for n in range(185, 200):
        html = get_html(n)
        title = get_title(html, n)
        print(Fore.WHITE + f"Title found: {title}", flush=True)


if __name__ == '__main__':
    main()
複製程式碼

這段程式碼跑完花了37s，然後我們用 pycharm 的 profiler 工具來具體看看哪些地方比較耗時間。

點選Profile (檔名稱)

之後獲取到得到一個詳細的函式呼叫關係、耗時圖：

可以看到 get_html 這個方法佔了96.7%的時間。這個程式的 IO 耗時達到了97%，獲取 html 的時候，這段時間內程式就在那死等著。如果我們能夠讓他不要在那兒傻傻地等待 IO 完成，而是開始幹些其他有意義的事，就能節省大量的時間。

稍微做一個計算，試用asyncio非同步抓取，能將時間降低多少？

get_html這個方法耗時36.8s，一共呼叫了15次，說明實際上獲取一個連結的 html 的時間為36.8s / 15 = 2.4s。**要是全非同步的話，獲取15個連結的時間還是2.4s。**然後加上get_title這個函式的耗時0.6s，所以我們估算，改進後的程式將可以用 3s 左右的時間完成，也就是效能能夠提升13倍。

再看下改進後的程式碼。原文地址。

import asyncio
from asyncio import AbstractEventLoop

import aiohttp
import requests
import bs4
from colorama import Fore


def main():
    # Create loop
    loop = asyncio.get_event_loop()
    loop.run_until_complete(get_title_range(loop))
    print("Done.")


async def get_html(episode_number: int) -> str:
    print(Fore.YELLOW + f"Getting HTML for episode {episode_number}", flush=True)

    # Make this async with aiohttp's ClientSession
    url = f'https://talkpython.fm/{episode_number}'
    # resp = await requests.get(url)
    # resp.raise_for_status()

    async with aiohttp.ClientSession() as session:
        async with session.get(url) as resp:
            resp.raise_for_status()

            html = await resp.text()
            return html


def get_title(html: str, episode_number: int) -> str:
    print(Fore.CYAN + f"Getting TITLE for episode {episode_number}", flush=True)
    soup = bs4.BeautifulSoup(html, 'html.parser')
    header = soup.select_one('h1')
    if not header:
        return "MISSING"

    return header.text.strip()


async def get_title_range(loop: AbstractEventLoop):
    # Please keep this range pretty small to not DDoS my site. ;)
    tasks = []
    for n in range(190, 200):
        tasks.append((loop.create_task(get_html(n)), n))

    for task, n in tasks:
        html = await task
        title = get_title(html, n)
        print(Fore.WHITE + f"Title found: {title}", flush=True)


if __name__ == '__main__':
    main()
複製程式碼

同樣的步驟生成profile 圖：

可見現在耗時為大約3.8s，基本符合我們的預期了。

我的公眾號：全棧不存在的

大規模非同步新聞爬蟲：用asyncio實現非同步爬蟲
2018-12-03
非同步爬蟲
Python爬蟲和java爬蟲哪個效率高
2023-10-12
Python爬蟲Java
提高爬蟲爬取效率的辦法
2022-04-06
爬蟲
python多執行緒爬蟲與單執行緒爬蟲效率效率對比
2021-03-19
Python執行緒爬蟲
如何提高爬取爬蟲採集的效率？
2022-06-11
爬蟲
高效率爬蟲框架之 pyspider
2018-07-06
爬蟲框架IDE
想提高爬蟲效率？aiohttp 瞭解下
2018-08-08
爬蟲AIHTTP
提升爬蟲效率的兩大方法
2022-04-29
爬蟲
Python微型非同步爬蟲框架
2019-02-16
Python非同步爬蟲框架
Python非同步爬蟲（aiohttp版）
2022-12-06
Python非同步爬蟲AIHTTP
爬蟲（9） - Scrapy框架(1) | Scrapy 非同步網路爬蟲框架
2022-07-05
爬蟲框架非同步
大規模非同步新聞爬蟲：實現一個同步定向新聞爬蟲
2018-12-03
非同步爬蟲
python多執行緒非同步爬蟲-Python非同步爬蟲試驗[Celery,gevent,requests]
2020-11-11
Python執行緒非同步爬蟲
非同步爬蟲之理解協程
2024-05-05
非同步爬蟲
爬蟲之多工非同步協程
2024-03-26
爬蟲非同步
用Python爬蟲分析演唱會銷售資料
2018-12-05
Python爬蟲
【Python學習】爬蟲爬蟲爬蟲爬蟲~
2018-05-03
Python爬蟲
爬蟲速度太慢？來試試用非同步協程提速吧！
2018-07-09
爬蟲非同步
高效率使用隧道轉發爬蟲代理
2021-09-09
爬蟲
我爬取了爬蟲崗位薪資，分析後發現爬蟲真香
2020-12-09
爬蟲
爬蟲 | 非同步請求aiohttp模組
2024-06-16
爬蟲非同步AIHTTP
關於非同步爬蟲排序的困惑
2020-12-26
非同步爬蟲排序
實用爬蟲-01-檢測爬蟲的 IP
2018-09-08
爬蟲
網路爬蟲在商業分析中的應用
2020-01-03
爬蟲
讓爬蟲效率最大化該怎麼做？
2022-02-23
爬蟲
如何利用代理ip提高爬蟲的工作效率
2021-09-11
爬蟲
爬蟲—有道翻譯案例分析
2021-09-03
爬蟲
實用爬蟲-02-爬蟲真正使用代理 ip
2018-09-08
爬蟲
python網路爬蟲應用_python網路爬蟲應用實戰
2020-12-29
Python爬蟲
基於asyncio、aiohttp、xpath的非同步爬蟲
2019-02-16
AIHTTP非同步爬蟲
爬蟲代理怎麼用
2021-09-11
爬蟲
大規模非同步新聞爬蟲：簡單的百度新聞爬蟲
2018-12-02
非同步爬蟲
Python爬蟲可以幹什麼?Python爬蟲有什麼用?
2022-08-25
Python爬蟲
pycharm 爬蟲輸出資料太長讓其分行顯示
2020-09-25
PyCharm爬蟲
爬蟲：多程式爬蟲
2021-05-19
爬蟲
爬蟲專案:大麥網分析
2019-08-22
爬蟲
Python爬蟲之JS逆向分析技巧
2020-04-17
Python爬蟲JS
python爬蟲---網頁爬蟲，圖片爬蟲，文章爬蟲，Python爬蟲爬取新聞網站新聞
2019-01-04
Python爬蟲網頁網站

用PyCharm Profile分析非同步爬蟲效率

相關文章