Python asyncio 爬蟲

TwinkleStar發表於2020-04-28

原文網址 : https://learnku.com/articles/43876


import aiohttp, aiofiles
from aiohttp.client_exceptions import ClientConnectionError
import asyncio
import os
import re


RE_IMG_PAGES = re.compile('''<li><a href=["'](https://www.mzitu.com/\d+)["']''')  # 每一個套圖的入口URL
RE_LIST_NEXT_PAGE = re.compile('''next page-numbers" href=["'](https://www.mzitu.com/page/\d+/)["']>''')  # 列表的下一頁
RE_IMG_INFO = re.compile('''<div class="main-image">.+?<img src=["']([^"']+?)["'] alt=["']([^"']+?)["']''')  # 圖片URL和名稱
RE_IMG_NEXT_PAGE = re.compile('''href=["']([^"']+?/\d+/\d+)["']><span>下一頁''')  # 圖片的下一頁
RE_SUB_DIRNAME = re.compile(r'[<>/\\|:*?]')  # 圖片名稱字元過濾


async def download(url, retries=0):
    headers = {'User-Agent': 'Mozilla', 'Referer':'https://www.mzitu.com/'}
    if retries < 3:
        async with aiohttp.request('GET', url, headers=headers, allow_redirects=False, expect100=True) as resp:
            if resp.status == 200:
                return await resp.read()
            else:
                await asyncio.sleep(10)
                return await download(url, retries+1)
    else:
        raise ClientConnectionError


async def save_image(img_url, save_dir=''):
    img = await download(img_url)
    save_dir = RE_SUB_DIRNAME.sub('_', save_dir)  # 過濾目錄名中的不規範字元
    save_path = os.path.join(save_dir, os.path.split(img_url)[-1])

    try:
        # 直接儲存，如果不成功再目錄，再儲存
        async with aiofiles.open(save_path, mode='wb') as img_fp:
            await img_fp.write(img)
    except FileNotFoundError:
        os.mkdir(save_dir)
        async with aiofiles.open(save_path, mode='wb') as img_fp:
            await img_fp.write(img)

    print(save_path)


async def process_list_page(list_page_url):
    list_page = await download(list_page_url)
    list_page = list_page.decode('utf-8')
    img_page_list = RE_IMG_PAGES.findall(list_page)

    # for img_page in img_page_list:
    # 爬取列表中每一個專案，謹慎開啟
    for img_page in img_page_list[:1]:
        await process_img_page(img_page)

    # list_next_page_list = RE_LIST_NEXT_PAGE.findall(list_page)
    # for list_next_page in list_next_page_list:
    #     await process_list_page(list_next_page)
    # 爬取列表中的下一頁，謹慎開啟


async def process_img_page(img_page_url):
    img_page = await download(img_page_url)
    img_page = img_page.decode('utf-8')
    img_info_list = RE_IMG_INFO.findall(img_page)
    for img_url, img_title in img_info_list:
        await save_image(img_url, img_title)

    img_next_page_list = RE_IMG_NEXT_PAGE.findall(img_page)
    for img_next_page in img_next_page_list:
        await process_img_page(img_next_page)


base_url = 'https://www.mzitu.com/'
loop = asyncio.get_event_loop()
loop.run_until_complete(process_list_page(base_url))

本作品採用《CC 協議》，轉載必須註明作者和本文連結

【Python學習】爬蟲爬蟲爬蟲爬蟲~
2018-05-03
Python爬蟲
基於asyncio、aiohttp、xpath的非同步爬蟲
2019-02-16
AIHTTP非同步爬蟲
大規模非同步新聞爬蟲：用asyncio實現非同步爬蟲
2018-12-03
非同步爬蟲
python爬蟲---網頁爬蟲，圖片爬蟲，文章爬蟲，Python爬蟲爬取新聞網站新聞
2019-01-04
Python爬蟲網頁網站
python就是爬蟲嗎-python就是爬蟲嗎
2020-10-29
Python爬蟲
Python爬取鏈家成都二手房源資訊 asyncio + aiohttp 非同步爬蟲實戰
2020-09-22
PythonAIHTTP非同步爬蟲
python 爬蟲
2024-04-20
Python爬蟲
python爬蟲
2024-06-13
Python爬蟲
python爬蟲初探--第一個python爬蟲專案
2018-05-18
Python爬蟲
python網路爬蟲_Python爬蟲：30個小時搞定Python網路爬蟲視訊教程
2020-10-21
Python爬蟲
python爬蟲2
2019-01-07
Python爬蟲
Python爬蟲-xpath
2018-06-08
Python爬蟲
Python爬蟲——XPath
2018-07-28
Python爬蟲
Python爬蟲--2
2024-03-24
Python爬蟲
Python 爬蟲系列
2021-01-01
Python爬蟲
Python爬蟲入門教程 50-100 Python3爬蟲爬取VIP視訊-Python爬蟲6操作
2019-02-14
Python爬蟲
Python爬蟲教程-01-爬蟲介紹
2018-09-06
Python爬蟲
Java爬蟲與Python爬蟲的區別？
2023-10-25
Java爬蟲Python
什麼是Python爬蟲？python爬蟲入門難嗎？
2021-12-27
Python爬蟲
python爬蟲是什麼?學習python爬蟲難嗎
2021-03-31
Python爬蟲
什麼是Python爬蟲？Python爬蟲常用框架有哪些？
2020-12-24
Python爬蟲框架
不踩坑的Python爬蟲：Python爬蟲開發與專案實戰，從爬蟲入門 Python
2021-12-17
Python爬蟲
python爬蟲-33個Python爬蟲專案實戰(推薦)
2020-10-28
Python爬蟲
Python爬蟲可以幹什麼?Python爬蟲有什麼用?
2022-08-25
Python爬蟲
【爬蟲】python爬蟲從入門到放棄
2018-12-20
爬蟲Python
【python--爬蟲】彼岸圖網高清桌布爬蟲
2019-07-21
Python爬蟲
Python爬蟲（1.爬蟲的基本概念）
2018-04-20
Python爬蟲
python爬蟲實戰，爬蟲之路，永無止境
2022-01-27
Python爬蟲
什麼是爬蟲？Python爬蟲框架有哪些？
2022-04-18
爬蟲Python框架
Python爬蟲與Java爬蟲有何區別？
2022-06-01
Python爬蟲Java
Python爬蟲之路-chrome在爬蟲中的使用
2021-01-04
Python爬蟲Chrome
Python爬蟲之路-selenium在爬蟲中的使用
2021-01-04
Python爬蟲
python asyncio
2024-12-02
Python
Python爬蟲之BeautifulSoup
2019-02-16
Python爬蟲
python爬蟲Scrapy框架
2018-11-21
Python爬蟲框架
Python爬蟲工具列表
2018-11-15
Python爬蟲
python 爬蟲代理池
2019-03-09
Python爬蟲
Python爬蟲的用途
2018-08-16
Python爬蟲

Python asyncio 爬蟲

相關文章