上一個筆記總結了asyncio的一些知識點,這次就來應用一下。如果使用協程的方式來寫爬蟲,網路相關的請求就要將requests庫替換成aiohttp這個庫。
一. 效率對比
-
用上次寫的爬蟲,先爬一些桌布連結
urls = [ "https://alpha.wallhaven.cc/wallpapers/thumb/small/th-729560.jpg", "https://alpha.wallhaven.cc/wallpapers/thumb/small/th-724055.jpg", "https://alpha.wallhaven.cc/wallpapers/thumb/small/th-716644.jpg", "https://alpha.wallhaven.cc/wallpapers/thumb/small/th-716643.jpg", "https://alpha.wallhaven.cc/wallpapers/thumb/small/th-716645.jpg", "https://alpha.wallhaven.cc/wallpapers/thumb/small/th-686220.jpg", "https://alpha.wallhaven.cc/wallpapers/thumb/small/th-686212.jpg", "https://alpha.wallhaven.cc/wallpapers/thumb/small/th-652608.jpg", "https://alpha.wallhaven.cc/wallpapers/thumb/small/th-639894.jpg", "https://alpha.wallhaven.cc/wallpapers/thumb/small/th-639893.jpg", "https://alpha.wallhaven.cc/wallpapers/thumb/small/th-639892.jpg", "https://alpha.wallhaven.cc/wallpapers/thumb/small/th-639890.jpg", "https://alpha.wallhaven.cc/wallpapers/thumb/small/th-639888.jpg", "https://alpha.wallhaven.cc/wallpapers/thumb/small/th-468197.jpg", "https://alpha.wallhaven.cc/wallpapers/thumb/small/th-467016.jpg", "https://alpha.wallhaven.cc/wallpapers/thumb/small/th-467012.jpg", "https://alpha.wallhaven.cc/wallpapers/thumb/small/th-467009.jpg", "https://alpha.wallhaven.cc/wallpapers/thumb/small/th-467007.jpg", "https://alpha.wallhaven.cc/wallpapers/thumb/small/th-467005.jpg", "https://alpha.wallhaven.cc/wallpapers/thumb/small/th-466997.jpg", "https://alpha.wallhaven.cc/wallpapers/thumb/small/th-466998.jpg", "https://alpha.wallhaven.cc/wallpapers/thumb/small/th-466993.jpg", "https://alpha.wallhaven.cc/wallpapers/thumb/small/th-466994.jpg", "https://alpha.wallhaven.cc/wallpapers/thumb/small/th-466995.jpg", "https://alpha.wallhaven.cc/wallpapers/thumb/small/th-729560.jpg", "https://alpha.wallhaven.cc/wallpapers/thumb/small/th-724055.jpg", "https://alpha.wallhaven.cc/wallpapers/thumb/small/th-716644.jpg", "https://alpha.wallhaven.cc/wallpapers/thumb/small/th-716643.jpg", "https://alpha.wallhaven.cc/wallpapers/thumb/small/th-716645.jpg", "https://alpha.wallhaven.cc/wallpapers/thumb/small/th-686220.jpg", "https://alpha.wallhaven.cc/wallpapers/thumb/small/th-686212.jpg", "https://alpha.wallhaven.cc/wallpapers/thumb/small/th-652608.jpg", "https://alpha.wallhaven.cc/wallpapers/thumb/small/th-639894.jpg", "https://alpha.wallhaven.cc/wallpapers/thumb/small/th-639893.jpg", "https://alpha.wallhaven.cc/wallpapers/thumb/small/th-639892.jpg", "https://alpha.wallhaven.cc/wallpapers/thumb/small/th-639890.jpg", "https://alpha.wallhaven.cc/wallpapers/thumb/small/th-639888.jpg", "https://alpha.wallhaven.cc/wallpapers/thumb/small/th-468197.jpg", "https://alpha.wallhaven.cc/wallpapers/thumb/small/th-467016.jpg", "https://alpha.wallhaven.cc/wallpapers/thumb/small/th-467012.jpg", "https://alpha.wallhaven.cc/wallpapers/thumb/small/th-467009.jpg", "https://alpha.wallhaven.cc/wallpapers/thumb/small/th-467007.jpg", "https://alpha.wallhaven.cc/wallpapers/thumb/small/th-467005.jpg", "https://alpha.wallhaven.cc/wallpapers/thumb/small/th-466997.jpg", "https://alpha.wallhaven.cc/wallpapers/thumb/small/th-466998.jpg", "https://alpha.wallhaven.cc/wallpapers/thumb/small/th-466993.jpg", "https://alpha.wallhaven.cc/wallpapers/thumb/small/th-466994.jpg", "https://alpha.wallhaven.cc/wallpapers/thumb/small/th-466995.jpg", "https://alpha.wallhaven.cc/wallpapers/thumb/small/th-466992.jpg", "https://alpha.wallhaven.cc/wallpapers/thumb/small/th-466989.jpg" ] 複製程式碼
-
用協程的方式下載這些圖片
-
安裝aiohttp庫
-
根據文件,aiohttp庫發起請求使用的是aiohttp.ClientSession(),文件建議不用每次請求都建立一個session會話,所以這裡就只建立一個:
async def main(): async with aiohttp.ClientSession() as session: pass 複製程式碼
-
定義下載圖片的協程函式和建立儲存路徑的函式:
# 下載圖片 async def download_img(session, url): image_name = url.split('/')[-1] async with session.get(url, headers=headers) as response: with open('%s/%s' % (get_store_path('city'), image_name), 'wb') as fd: while True: chunk = await response.content.read(200) if not chunk: break fd.write(chunk) 複製程式碼
# 獲取圖片儲存路徑,如果沒有則建立 def get_store_path(dir_name): current_path = os.path.abspath('.') target_path = os.path.join(current_path, 'wallpaper/%s' % dir_name) folder = os.path.exists(target_path) if not folder: os.makedirs(target_path) return target_path 複製程式碼
-
補全main函式
async def main(loop): async with aiohttp.ClientSession() as session: tasks = [loop.create_task(download_img(session, url)) for url in urls] await asyncio.wait(tasks) loop = asyncio.get_event_loop() loop.run_until_complete(main(loop)) loop.close() 複製程式碼
-
計算下載耗時:
if __name__ == '__main__': t1 = time.time() loop = asyncio.get_event_loop() loop.run_until_complete(main(loop)) loop.close() print('耗時:%fs' % (time.time() - t1)) 複製程式碼
-
結果:
26張圖片用時1.476328s
-
-
使用多程式方式下載
import requests import multiprocessing import os import time urls = [ "https://alpha.wallhaven.cc/wallpapers/thumb/small/th-729560.jpg", "https://alpha.wallhaven.cc/wallpapers/thumb/small/th-724055.jpg", "https://alpha.wallhaven.cc/wallpapers/thumb/small/th-716644.jpg", "https://alpha.wallhaven.cc/wallpapers/thumb/small/th-716643.jpg", "https://alpha.wallhaven.cc/wallpapers/thumb/small/th-716645.jpg", "https://alpha.wallhaven.cc/wallpapers/thumb/small/th-686220.jpg", "https://alpha.wallhaven.cc/wallpapers/thumb/small/th-686212.jpg", "https://alpha.wallhaven.cc/wallpapers/thumb/small/th-652608.jpg", "https://alpha.wallhaven.cc/wallpapers/thumb/small/th-639894.jpg", "https://alpha.wallhaven.cc/wallpapers/thumb/small/th-639893.jpg", "https://alpha.wallhaven.cc/wallpapers/thumb/small/th-639892.jpg", "https://alpha.wallhaven.cc/wallpapers/thumb/small/th-639890.jpg", "https://alpha.wallhaven.cc/wallpapers/thumb/small/th-639888.jpg", "https://alpha.wallhaven.cc/wallpapers/thumb/small/th-468197.jpg", "https://alpha.wallhaven.cc/wallpapers/thumb/small/th-467016.jpg", "https://alpha.wallhaven.cc/wallpapers/thumb/small/th-467012.jpg", "https://alpha.wallhaven.cc/wallpapers/thumb/small/th-467009.jpg", "https://alpha.wallhaven.cc/wallpapers/thumb/small/th-467007.jpg", "https://alpha.wallhaven.cc/wallpapers/thumb/small/th-467005.jpg", "https://alpha.wallhaven.cc/wallpapers/thumb/small/th-466997.jpg", "https://alpha.wallhaven.cc/wallpapers/thumb/small/th-466998.jpg", "https://alpha.wallhaven.cc/wallpapers/thumb/small/th-466993.jpg", "https://alpha.wallhaven.cc/wallpapers/thumb/small/th-466994.jpg", "https://alpha.wallhaven.cc/wallpapers/thumb/small/th-466995.jpg", "https://alpha.wallhaven.cc/wallpapers/thumb/small/th-729560.jpg", "https://alpha.wallhaven.cc/wallpapers/thumb/small/th-724055.jpg", "https://alpha.wallhaven.cc/wallpapers/thumb/small/th-716644.jpg", "https://alpha.wallhaven.cc/wallpapers/thumb/small/th-716643.jpg", "https://alpha.wallhaven.cc/wallpapers/thumb/small/th-716645.jpg", "https://alpha.wallhaven.cc/wallpapers/thumb/small/th-686220.jpg", "https://alpha.wallhaven.cc/wallpapers/thumb/small/th-686212.jpg", "https://alpha.wallhaven.cc/wallpapers/thumb/small/th-652608.jpg", "https://alpha.wallhaven.cc/wallpapers/thumb/small/th-639894.jpg", "https://alpha.wallhaven.cc/wallpapers/thumb/small/th-639893.jpg", "https://alpha.wallhaven.cc/wallpapers/thumb/small/th-639892.jpg", "https://alpha.wallhaven.cc/wallpapers/thumb/small/th-639890.jpg", "https://alpha.wallhaven.cc/wallpapers/thumb/small/th-639888.jpg", "https://alpha.wallhaven.cc/wallpapers/thumb/small/th-468197.jpg", "https://alpha.wallhaven.cc/wallpapers/thumb/small/th-467016.jpg", "https://alpha.wallhaven.cc/wallpapers/thumb/small/th-467012.jpg", "https://alpha.wallhaven.cc/wallpapers/thumb/small/th-467009.jpg", "https://alpha.wallhaven.cc/wallpapers/thumb/small/th-467007.jpg", "https://alpha.wallhaven.cc/wallpapers/thumb/small/th-467005.jpg", "https://alpha.wallhaven.cc/wallpapers/thumb/small/th-466997.jpg", "https://alpha.wallhaven.cc/wallpapers/thumb/small/th-466998.jpg", "https://alpha.wallhaven.cc/wallpapers/thumb/small/th-466993.jpg", "https://alpha.wallhaven.cc/wallpapers/thumb/small/th-466994.jpg", "https://alpha.wallhaven.cc/wallpapers/thumb/small/th-466995.jpg", "https://alpha.wallhaven.cc/wallpapers/thumb/small/th-466992.jpg", "https://alpha.wallhaven.cc/wallpapers/thumb/small/th-466989.jpg" ] req_session = requests.Session() req_session.headers['user-agent'] = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36' def get_store_path(dir_name): current_path = os.path.abspath('.') target_path = os.path.join(current_path, 'wallpaper/%s' % dir_name) folder = os.path.exists(target_path) if not folder: os.makedirs(target_path) return target_path def download_img(url): img = req_session.get(url, stream=True) image_name = url.split('/')[-1] with open('%s/%s' % (get_store_path('city'), image_name), 'wb') as fd: for chunk in img.iter_content(chunk_size=128): fd.write(chunk) def main(): p = multiprocessing.Pool() [p.apply_async(download_img, args=(url,)) for url in urls] p.close() p.join() if __name__ == '__main__': t1 = time.time() main() print('耗時:%fs' % (time.time() - t1)) 複製程式碼
結果
兩種我都執行了多次,差不多都在1.4~2.7s之間,可以看到協程還是很強大的,僅僅用單執行緒就做到了類似多程式的效果。