Python爬蟲的N種姿勢

山陰少年發表於2018-10-16

原文網址 : https://blog.csdn.net/jclian91/article/details/83095306

問題的由來

前幾天，在微信公眾號（Python爬蟲及演算法）上有個人問了筆者一個問題，如何利用爬蟲來實現如下的需求，需要爬取的網頁如下（網址為：https://www.wikidata.org/w/index.php?title=Special:WhatLinksHere/Q5&limit=500&from=0）：

我們的需求為爬取紅色框框內的名人（有500條記錄，圖片只展示了一部分）的名字以及其介紹，關於其介紹，點選該名人的名字即可，如下圖：

這就意味著我們需要爬取500個這樣的頁面，即500個HTTP請求（暫且這麼認為吧），然後需要提取這些網頁中的名字和描述，當然有些不是名人，也沒有描述，我們可以跳過。最後，這些網頁的網址在第一頁中的名人後面可以找到，如George Washington的網頁字尾為Q23.
爬蟲的需求大概就是這樣。

爬蟲的4種姿勢

首先，分析來爬蟲的思路：先在第一個網頁（https://www.wikidata.org/w/index.php?title=Special:WhatLinksHere/Q5&limit=500&from=0）中得到500個名人所在的網址，接下來就爬取這500個網頁中的名人的名字及描述，如無描述，則跳過。
接下來，我們將介紹實現這個爬蟲的4種方法，並分析它們各自的優缺點，希望能讓讀者對爬蟲有更多的體會。實現爬蟲的方法為：

一般方法（同步，requests+BeautifulSoup）
併發（使用concurrent.futures模組以及requests+BeautifulSoup）
非同步（使用aiohttp+asyncio+requests+BeautifulSoup）
使用框架Scrapy

一般方法

一般方法即為同步方法，主要使用requests+BeautifulSoup，按順序執行。完整的Python程式碼如下：

import requests
from bs4 import BeautifulSoup
import time

# 開始時間
t1 = time.time()
print('#' * 50)

url = "http://www.wikidata.org/w/index.php?title=Special:WhatLinksHere/Q5&limit=500&from=0"
# 請求頭部
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.87 Safari/537.36'}
# 傳送HTTP請求
req = requests.get(url, headers=headers)
# 解析網頁
soup = BeautifulSoup(req.text, "lxml")
# 找到name和Description所在的記錄
human_list = soup.find(id='mw-whatlinkshere-list')('li')

urls = []
# 獲取網址
for human in human_list:
    url = human.find('a')['href']
    urls.append('https://www.wikidata.org'+url)

# 獲取每個網頁的name和description
def parser(url):
    req = requests.get(url)
    # 利用BeautifulSoup將獲取到的文字解析成HTML
    soup = BeautifulSoup(req.text, "lxml")
    # 獲取name和description
    name = soup.find('span', class_="wikibase-title-label")
    desc = soup.find('span', class_="wikibase-descriptionview-text")
    if name is not None and desc is not None:
        print('%-40s,\t%s'%(name.text, desc.text))

for url in urls:
    parser(url)

t2 = time.time() # 結束時間
print('一般方法，總共耗時：%s' % (t2 - t1))
print('#' * 50)

輸出的結果如下(省略中間的輸出，以…代替)：

##################################################
George Washington                       ,	first President of the United States
Douglas Adams                           ,	British author and humorist (1952–2001)
......
Willoughby Newton                       ,	Politician from Virginia, USA
Mack Wilberg                            ,	American conductor
一般方法，總共耗時：724.9654655456543
##################################################

使用同步方法，總耗時約725秒，即12分鐘多。
一般方法雖然思路簡單，容易實現，但效率不高，耗時長。那麼，使用併發試試看。

併發方法

併發方法使用多執行緒來加速一般方法，我們使用的併發模組為concurrent.futures模組，設定多執行緒的個數為20個（實際不一定能達到，視計算機而定）。完整的Python程式碼如下：

import requests
from bs4 import BeautifulSoup
import time
from concurrent.futures import ThreadPoolExecutor, wait, ALL_COMPLETED

# 開始時間
t1 = time.time()
print('#' * 50)

url = "http://www.wikidata.org/w/index.php?title=Special:WhatLinksHere/Q5&limit=500&from=0"
# 請求頭部
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.87 Safari/537.36'}
# 傳送HTTP請求
req = requests.get(url, headers=headers)
# 解析網頁
soup = BeautifulSoup(req.text, "lxml")
# 找到name和Description所在的記錄
human_list = soup.find(id='mw-whatlinkshere-list')('li')

urls = []
# 獲取網址
for human in human_list:
    url = human.find('a')['href']
    urls.append('https://www.wikidata.org'+url)

# 獲取每個網頁的name和description
def parser(url):
    req = requests.get(url)
    # 利用BeautifulSoup將獲取到的文字解析成HTML
    soup = BeautifulSoup(req.text, "lxml")
    # 獲取name和description
    name = soup.find('span', class_="wikibase-title-label")
    desc = soup.find('span', class_="wikibase-descriptionview-text")
    if name is not None and desc is not None:
        print('%-40s,\t%s'%(name.text, desc.text))

# 利用併發加速爬取
executor = ThreadPoolExecutor(max_workers=20)
# submit()的引數： 第一個為函式， 之後為該函式的傳入引數，允許有多個
future_tasks = [executor.submit(parser, url) for url in urls]
# 等待所有的執行緒完成，才進入後續的執行
wait(future_tasks, return_when=ALL_COMPLETED)

t2 = time.time() # 結束時間
print('併發方法，總共耗時：%s' % (t2 - t1))
print('#' * 50)

輸出的結果如下（省略中間的輸出，以…代替)：

##################################################
Larry Sanger                            ,	American former professor, co-founder of Wikipedia, founder of Citizendium and other projects
Ken Jennings                            ,	American game show contestant and writer
......
Antoine de Saint-Exupery                ,	French writer and aviator
Michael Jackson                         ,	American singer, songwriter and dancer
併發方法，總共耗時：226.7499692440033
##################################################

使用多執行緒併發後的爬蟲執行時間約為227秒，大概是一般方法的三分之一的時間，速度有了明顯的提升啊！多執行緒在速度上有明顯提升，但執行的網頁順序是無序的，線上程的切換上開銷也比較大，執行緒越多，開銷越大。
關於多執行緒與一般方法在速度上的比較，可以參考文章：Python爬蟲之多執行緒下載豆瓣Top250電影圖片。

非同步方法

非同步方法在爬蟲中是有效的速度提升手段，使用aiohttp可以非同步地處理HTTP請求，使用asyncio可以實現非同步IO，需要注意的是，aiohttp只支援3.5.3以後的Python版本。使用非同步方法實現該爬蟲的完整Python程式碼如下：

import requests
from bs4 import BeautifulSoup
import time
import aiohttp
import asyncio

# 開始時間
t1 = time.time()
print('#' * 50)

url = "http://www.wikidata.org/w/index.php?title=Special:WhatLinksHere/Q5&limit=500&from=0"
# 請求頭部
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.87 Safari/537.36'}
# 傳送HTTP請求
req = requests.get(url, headers=headers)
# 解析網頁
soup = BeautifulSoup(req.text, "lxml")
# 找到name和Description所在的記錄
human_list = soup.find(id='mw-whatlinkshere-list')('li')

urls = []
# 獲取網址
for human in human_list:
    url = human.find('a')['href']
    urls.append('https://www.wikidata.org'+url)

# 非同步HTTP請求
async def fetch(session, url):
    async with session.get(url) as response:
        return await response.text()
        
# 解析網頁
async def parser(html):
    # 利用BeautifulSoup將獲取到的文字解析成HTML
    soup = BeautifulSoup(html, "lxml")
    # 獲取name和description
    name = soup.find('span', class_="wikibase-title-label")
    desc = soup.find('span', class_="wikibase-descriptionview-text")
    if name is not None and desc is not None:
        print('%-40s,\t%s'%(name.text, desc.text))

# 處理網頁，獲取name和description
async def download(url):
    async with aiohttp.ClientSession() as session:
        try:
            html = await fetch(session, url)
            await parser(html)
        except Exception as err:
            print(err)

# 利用asyncio模組進行非同步IO處理
loop = asyncio.get_event_loop()
tasks = [asyncio.ensure_future(download(url)) for url in urls]
tasks = asyncio.gather(*tasks)
loop.run_until_complete(tasks)

t2 = time.time() # 結束時間
print('使用非同步，總共耗時：%s' % (t2 - t1))
print('#' * 50)

輸出結果如下（省略中間的輸出，以…代替)：

##################################################
Frédéric Taddeï                         ,	French journalist and TV host
Gabriel Gonzáles Videla                 ,	Chilean politician
......
Denmark                                 ,	sovereign state and Scandinavian country in northern Europe
Usain Bolt                              ,	Jamaican sprinter and soccer player
使用非同步，總共耗時：126.9002583026886
##################################################

顯然，非同步方法使用了非同步和併發兩種提速方法，自然在速度有明顯提升，大約為一般方法的六分之一。非同步方法雖然效率高，但需要掌握非同步程式設計，這需要學習一段時間。
關於非同步方法與一般方法在速度上的比較，可以參考文章：利用aiohttp實現非同步爬蟲。
如果有人覺得127秒的爬蟲速度還是慢，可以嘗試一下非同步程式碼（與之前的非同步程式碼的區別在於：僅僅使用了正規表示式代替BeautifulSoup來解析網頁，以提取網頁中的內容）：

import requests
from bs4 import BeautifulSoup
import time
import aiohttp
import asyncio
import re

# 開始時間
t1 = time.time()
print('#' * 50)

url = "http://www.wikidata.org/w/index.php?title=Special:WhatLinksHere/Q5&limit=500&from=0"
# 請求頭部
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.87 Safari/537.36'}
# 傳送HTTP請求
req = requests.get(url, headers=headers)
# 解析網頁
soup = BeautifulSoup(req.text, "lxml")
# 找到name和Description所在的記錄
human_list = soup.find(id='mw-whatlinkshere-list')('li')

urls = []
# 獲取網址
for human in human_list:
    url = human.find('a')['href']
    urls.append('https://www.wikidata.org' + url)

# 非同步HTTP請求
async def fetch(session, url):
    async with session.get(url) as response:
        return await response.text()

# 解析網頁
async def parser(html):
    # 利用正規表示式解析網頁
    try:
        name = re.findall(r'<span class="wikibase-title-label">(.+?)</span>', html)[0]
        desc = re.findall(r'<span class="wikibase-descriptionview-text">(.+?)</span>', html)[0]
        print('%-40s,\t%s' % (name, desc))
    except Exception as err:
        pass

# 處理網頁，獲取name和description
async def download(url):
    async with aiohttp.ClientSession() as session:
        try:
            html = await fetch(session, url)
            await parser(html)
        except Exception as err:
            print(err)

# 利用asyncio模組進行非同步IO處理
loop = asyncio.get_event_loop()
tasks = [asyncio.ensure_future(download(url)) for url in urls]
tasks = asyncio.gather(*tasks)
loop.run_until_complete(tasks)

t2 = time.time()  # 結束時間
print('使用非同步（正規表示式），總共耗時：%s' % (t2 - t1))
print('#' * 50)

輸出的結果如下（省略中間的輸出，以…代替)：

##################################################
Dejen Gebremeskel                       ,	Ethiopian long-distance runner
Erik Kynard                             ,	American high jumper
......
Buzz Aldrin                             ,	American astronaut
Egon Krenz                              ,	former General Secretary of the Socialist Unity Party of East Germany
使用非同步（正規表示式），總共耗時：16.521944999694824
##################################################

16.5秒，僅僅為一般方法的43分之一，速度如此之快，令人咋舌（感謝某人提供的嘗試）。筆者雖然自己實現了非同步方法，但用的是BeautifulSoup來解析網頁，耗時127秒，沒想到使用正規表示式就取得了如此驚人的效果。可見，BeautifulSoup解析網頁雖然快，但在非同步方法中，還是限制了速度。但這種方法的缺點為，當你需要爬取的內容比較複雜時，一般的正規表示式就難以勝任了，需要另想辦法。

爬蟲框架Scrapy

最後，我們使用著名的Python爬蟲框架Scrapy來解決這個爬蟲。我們建立的爬蟲專案為wikiDataScrapy，專案結構如下：

在settings.py中設定“ROBOTSTXT_OBEY = False”. 修改items.py，程式碼如下：

# -*- coding: utf-8 -*-

import scrapy

class WikidatascrapyItem(scrapy.Item):
    # define the fields for your item here like:
    name = scrapy.Field()
    desc = scrapy.Field()

然後，在spiders資料夾下新建wikiSpider.py，程式碼如下:

import scrapy.cmdline
from wikiDataScrapy.items import WikidatascrapyItem
import requests
from bs4 import BeautifulSoup

# 獲取請求的500個網址，用requests+BeautifulSoup搞定
def get_urls():
    url = "http://www.wikidata.org/w/index.php?title=Special:WhatLinksHere/Q5&limit=500&from=0"
    # 請求頭部
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.87 Safari/537.36'}
    # 傳送HTTP請求
    req = requests.get(url, headers=headers)
    # 解析網頁
    soup = BeautifulSoup(req.text, "lxml")
    # 找到name和Description所在的記錄
    human_list = soup.find(id='mw-whatlinkshere-list')('li')

    urls = []
    # 獲取網址
    for human in human_list:
        url = human.find('a')['href']
        urls.append('https://www.wikidata.org' + url)

    # print(urls)
    return urls

# 使用scrapy框架爬取
class bookSpider(scrapy.Spider):
    name = 'wikiScrapy'  # 爬蟲名稱
    start_urls = get_urls()  # 需要爬取的500個網址

    def parse(self, response):
        item = WikidatascrapyItem()
        # name and description
        item['name'] = response.css('span.wikibase-title-label').xpath('text()').extract_first()
        item['desc'] = response.css('span.wikibase-descriptionview-text').xpath('text()').extract_first()

        yield item

# 執行該爬蟲，並轉化為csv檔案
scrapy.cmdline.execute(['scrapy', 'crawl', 'wikiScrapy', '-o', 'wiki.csv', '-t', 'csv'])

輸出結果如下（只包含最後的Scrapy資訊總結部分）：

{'downloader/request_bytes': 166187,
 'downloader/request_count': 500,
 'downloader/request_method_count/GET': 500,
 'downloader/response_bytes': 18988798,
 'downloader/response_count': 500,
 'downloader/response_status_count/200': 500,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2018, 10, 16, 9, 49, 15, 761487),
 'item_scraped_count': 500,
 'log_count/DEBUG': 1001,
 'log_count/INFO': 8,
 'response_received_count': 500,
 'scheduler/dequeued': 500,
 'scheduler/dequeued/memory': 500,
 'scheduler/enqueued': 500,
 'scheduler/enqueued/memory': 500,
 'start_time': datetime.datetime(2018, 10, 16, 9, 48, 44, 58673)}

可以看到，已成功爬取500個網頁，耗時31秒，速度也相當OK。再來看一下生成的wiki.csv檔案，它包含了所有的輸出的name和description，如下圖：

可以看到，輸出的CSV檔案的列並不是有序的。至於如何解決Scrapy輸出的CSV檔案有換行的問題，請參考stackoverflow上的回答：https://stackoverflow.com/questions/39477662/scrapy-csv-file-has-uniform-empty-rows/43394566#43394566 。

Scrapy來製作爬蟲的優勢在於它是一個成熟的爬蟲框架，支援非同步，併發，容錯性較好（比如本程式碼中就沒有處理找不到name和description的情形），但如果需要頻繁地修改中介軟體，則還是自己寫個爬蟲比較好，而且它在速度上沒有超過我們自己寫的非同步爬蟲，至於能自動匯出CSV檔案這個功能，還是相當實在的。

總結

本文內容較多，比較了4種爬蟲方法，每種方法都有自己的利弊，已在之前的陳述中給出，當然，在實際的問題中，並不是用的工具或方法越高階就越好，具體問題具體分析嘛~
本文到此結束，感謝閱讀哦~

注意：本人現已開通微信公眾號： Python爬蟲與演算法（微訊號為：easy_web_scrape），歡迎大家關注哦~~

程式碼除錯的N種姿勢
2018-11-15
除錯
Powershell惡意程式碼的N種姿勢
2020-08-19
實現同比、環比計算的N種姿勢
2022-03-11
開發函式計算的正確姿勢 —— 爬蟲
2018-12-13
函式爬蟲
解鎖canvas匯出圖片跨域的N種姿勢～
2019-01-22
Canvas跨域
Python爬蟲的兩套解析方法和四種爬蟲實現
2018-07-03
Python爬蟲
SpringBoot 系列 web 篇之自定義返回 Http Code 的 n 種姿勢
2020-01-14
Spring BootWebHTTP
【Python學習】爬蟲爬蟲爬蟲爬蟲~
2018-05-03
Python爬蟲
PTH的幾種食用姿勢
2021-06-06
Vue搭建前端監控，採集使用者行為的 N 種姿勢
2022-09-14
Vue前端
Java爬蟲與Python爬蟲的區別？
2023-10-25
Java爬蟲Python
python爬蟲---網頁爬蟲，圖片爬蟲，文章爬蟲，Python爬蟲爬取新聞網站新聞
2019-01-04
Python爬蟲網頁網站
python在爬蟲方面有哪些優勢呢?
2020-12-18
Python爬蟲
python就是爬蟲嗎-python就是爬蟲嗎
2020-10-29
Python爬蟲
2分鐘瞭解Python的5種傳參姿勢
2019-02-16
Python
Python爬蟲的用途
2018-08-16
Python爬蟲
python 爬蟲
2024-04-20
Python爬蟲
python爬蟲
2024-06-13
Python爬蟲
Java記憶體洩漏、效能優化、當機死鎖的N種姿勢
2020-08-04
Java記憶體優化
問題解決：嘗試解決maven依賴找不到的n種姿勢
2020-12-05
Maven
Python爬蟲（1.爬蟲的基本概念）
2018-04-20
Python爬蟲
Python爬蟲之路-chrome在爬蟲中的使用
2021-01-04
Python爬蟲Chrome
Python爬蟲之路-selenium在爬蟲中的使用
2021-01-04
Python爬蟲
Guava Cache使用的三種姿勢
2021-07-22
Guava
npm換源的幾種姿勢
2020-11-30
NPM
Python的5種傳參姿勢,兩分鐘就能瞭解
2019-01-17
Python
不踩坑的Python爬蟲：Python爬蟲開發與專案實戰，從爬蟲入門 Python
2021-12-17
Python爬蟲
Java記憶體洩漏、效能最佳化、當機死鎖的N種姿勢
2020-08-10
Java記憶體
python爬蟲初探--第一個python爬蟲專案
2018-05-18
Python爬蟲
Python 爬蟲的工具鏈
2018-09-22
Python爬蟲
Python爬蟲更多的功能
2023-11-24
Python爬蟲
python網路爬蟲_Python爬蟲：30個小時搞定Python網路爬蟲視訊教程
2020-10-21
Python爬蟲
PHP 檔案操作的各種姿勢
2019-02-26
PHP
解鎖跨域的九種姿勢
2019-01-25
跨域
python爬蟲2
2019-01-07
Python爬蟲
Python爬蟲-xpath
2018-06-08
Python爬蟲
Python爬蟲——XPath
2018-07-28
Python爬蟲
Python爬蟲--2
2024-03-24
Python爬蟲