通過網路圖片小爬蟲對比Python中單執行緒與多線（進）程的效率

發表於2016-09-28

批評 Python 的人通常都會說 Python 的多執行緒程式設計太困難了，眾所周知的全域性直譯器鎖（Global Interpreter Lock，或稱 GIL）使得多個執行緒的 Python 程式碼無法同時執行。因此，如果你並非 Python 開發者，而是從其他語言如 C++ 或者 Java 轉過來的話，你會覺得 Python 的多執行緒模組並沒有以你期望的方式工作。但必須澄清的是，只要以一些特定的方式，我們仍然能夠編寫出併發或者並行的 Python 程式碼，並對效能產生完全不同的影響。如果你還不理解什麼是併發和並行，建議你百度或者 Google 或者 Wiki 一下。

在這篇闡述 Python 併發與並行程式設計的入門教程裡，我們將寫一小段從 Imgur 下載最受歡迎的圖片的 Python 程式。我們將分別使用順序下載圖片和同時下載多張圖片的版本。在此之前，你需要先註冊一個 Imgur 應用。如果你還沒有 Imgur 賬號，請先註冊一個。

這篇教程的 Python 程式碼在 3.4.2 中測試通過。但只需一些小的改動就能在 Python 2中執行。兩個 Python 版本的主要區別是 urllib2 這個模組。

注：考慮到國內嚴酷的上網環境，譯者測試原作的程式碼時直接卡在了註冊 Imgur 賬號這一步。因此為了方便起見，譯者替換了圖片爬取資源。一開始使用的某生產商提供的圖片 API ，但不知道是網路原因還是其他原因導致程式在讀取最後一張圖片時無法退出。所以譯者一怒之下采取了原始爬蟲法，參考著 requests 和 beautifulsoup4 的文件爬取了某頭條 253 張圖片，以為示例。譯文中的程式碼替換為譯者使用的程式碼，如需原始程式碼請參考原文 Python Multithreading Tutorial: Concurrency and Parallelism 。

Python 多執行緒起步

首先讓我們來建立一個名為 download.py 的模組。這個檔案包含所有抓取和下載所需圖片的函式。我們將全部功能分割成如下三個函式：

get_links
download_link
setup_download_dir

第三個函式，setup_download_dir 將會建立一個存放下載的圖片的目錄，如果這個目錄不存在的話。

我們首先結合 requests 和 beautifulsoup4 解析出網頁中的全部圖片連結。下載圖片的任務非常簡單，只要通過圖片的 URL 抓取圖片並寫入檔案即可。

程式碼看起來像這樣：

download.py

import json
import os
import requests

from itertools import chain
from pathlib import Path

from bs4 import BeautifulSoup

# 結合 requests 和 bs4 解析出網頁中的全部圖片連結，返回一個包含全部圖片連結的列表
def get_links(url):
    req = requests.get(url)
    soup = BeautifulSoup(req.text, "html.parser")
    return [img.attrs.get('data-src') for img in
            soup.find_all('div', class_='img-wrap')
            if img.attrs.get('data-src') is not None]

# 把圖片下載到本地
def download_link(directory, link):
    img_name = '{}.jpg'.format(os.path.basename(link))
    download_path = directory / img_name
    r = requests.get(link)
    with download_path.open('wb') as fd:
            fd.write(r.content)

# 設定資料夾，資料夾名為傳入的 directory 引數，若不存在會自動建立
def setup_download_dir(directory):
    download_dir = Path(directory)
    if not download_dir.exists():
        download_dir.mkdir()
    return download_dir

download.py

import json

import os

import requests

from itertools import chain

from pathlib import Path

from bs4 import BeautifulSoup

# 結合 requests 和 bs4 解析出網頁中的全部圖片連結，返回一個包含全部圖片連結的列表

def get_links(url):

req = requests.get(url)

soup = BeautifulSoup(req.text, "html.parser")

return [img.attrs.get('data-src') for img in

soup.find_all('div', class_='img-wrap')

if img.attrs.get('data-src') is not None]

# 把圖片下載到本地

def download_link(directory, link):

img_name = '{}.jpg'.format(os.path.basename(link))

download_path = directory / img_name

r = requests.get(link)

with download_path.open('wb') as fd:

fd.write(r.content)

# 設定資料夾，資料夾名為傳入的 directory 引數，若不存在會自動建立

def setup_download_dir(directory):

download_dir = Path(directory)

if not download_dir.exists():

download_dir.mkdir()

return download_dir

接下來我們寫一個使用這些函式一張張下載圖片的模組。我們把它命名為single.py。我們的第一個簡單版本的圖片下載器將包含一個主函式。它會呼叫 setup_download_dir 建立下載目錄。然後，它會使用 get_links 方法抓取一系列圖片的連結，由於單個網頁的圖片較少，這裡抓取了 5 個網頁的圖片連結並把它們組合成一個列表。最後呼叫 download_link 方法將全部圖片寫入磁碟。這是 single.py 的程式碼：

single.py

from time import time
from itertools import chain

from download import setup_download_dir, get_links, download_link


def main():
    ts = time()

    url1 = 'http://www.toutiao.com/a6333981316853907714'
    url2 = 'http://www.toutiao.com/a6334459308533350658'
    url3 = 'http://www.toutiao.com/a6313664289211924737'
    url4 = 'http://www.toutiao.com/a6334337170774458625'
    url5 = 'http://www.toutiao.com/a6334486705982996738'
    download_dir = setup_download_dir('single_imgs')
    links = list(chain(
        get_links(url1),
        get_links(url2),
        get_links(url3),
        get_links(url4),
        get_links(url5),
    ))
    for link in links:
        download_link(download_dir, link)
    print('一共下載了 {} 張圖片'.format(len(links)))
    print('Took {}s'.format(time() - ts))


if __name__ == '__main__':
    main()

"""
一共下載了 253 張圖片
Took 166.0219452381134s
"""

single.py

from time import time

from itertools import chain

from download import setup_download_dir, get_links, download_link

def main():

ts = time()

url1 = 'http://www.toutiao.com/a6333981316853907714'

url2 = 'http://www.toutiao.com/a6334459308533350658'

url3 = 'http://www.toutiao.com/a6313664289211924737'

url4 = 'http://www.toutiao.com/a6334337170774458625'

url5 = 'http://www.toutiao.com/a6334486705982996738'

download_dir = setup_download_dir('single_imgs')

links = list(chain(

get_links(url1),

get_links(url2),

get_links(url3),

get_links(url4),

get_links(url5),

))

for link in links:

download_link(download_dir, link)

print('一共下載了 {} 張圖片'.format(len(links)))

print('Took {}s'.format(time() - ts))

if __name__ == '__main__':

main()

"""

一共下載了 253 張圖片

Took 166.0219452381134s

"""

在我的筆記本上，這段指令碼花費了 166 秒下載 253 張圖片。請注意花費的時間因網路的不同會有所差異。166 秒不算太長。但如果我們要下載更多的圖片呢？2530 張而不是 253 張。平均下載一張圖片花費約 1.5 秒，那麼 2530 張圖片將花費約 28 分鐘。25300 張圖片將要 280 分鐘。但好訊息是通過使用併發和並行技術，其將顯著提升下載速度。

接下來的程式碼示例只給出為了實現併發或者並行功能而新增的程式碼。為了方便起見，全部的 python 指令碼可以在這個GitHub的倉庫獲取。（注：這是原作者的 GitHub 倉庫，是下載 Imgur 圖片的程式碼，本文的程式碼存放在這：concurrency-parallelism-demo）。

使用多執行緒實現併發和並行

執行緒是大家熟知的使 Python 獲取併發和並行能力的方式之一。執行緒通常是作業系統提供的特性。執行緒比程式要更輕量，且共享大部分記憶體空間。

在我們的 Python 多執行緒教程中，我們將寫一個新的模組來替換 single.py 模組。這個模組將建立一個含有 8 個執行緒的執行緒池，加上主執行緒一共 9 個執行緒。我選擇 8 個工作執行緒的原因是因為我的電腦是 8 核心的。一核一個執行緒是一個不錯的選擇。但即使是同一臺機器，對於不同的應用和服務也要綜合考慮各種因素來選擇合適的執行緒數。

過程基本上面類似，只是多了一個 DownloadWorker 的類，這個類繼承自 Thread。我們覆寫了 run 方法，它執行一個死迴圈，每一次迴圈中它先呼叫 self.queue.get()方法，嘗試從一個執行緒安全的佇列中獲取一個圖片的 URL 。線上程從佇列獲取到 URL 之前，它將處於阻塞狀態。一旦執行緒獲取到一個 URL，它就被喚醒，並呼叫上一個指令碼中的 download_link 方法下載圖片到下載目錄中。下載完成後，執行緒叫傳送完成訊號給佇列。這一步非常重要，因為佇列或跟蹤記錄當前佇列中有多少個執行緒正在執行。如果執行緒不通知佇列下載任務已經完成，那麼 queue.join() 將使得主執行緒一直阻塞。

thread_toutiao.py

import os
from queue import Queue
from threading import Thread
from time import time
from itertools import chain

from download import setup_download_dir, get_links, download_link


class DownloadWorker(Thread):

    def __init__(self, queue):
        Thread.__init__(self)
        self.queue = queue

    def run(self):
        while True:
            # Get the work from the queue and expand the tuple
            item = self.queue.get()
            if item is None:
                break
            directory, link = item
            download_link(directory, link)
            self.queue.task_done()


def main():
    ts = time()

    url1 = 'http://www.toutiao.com/a6333981316853907714'
    url2 = 'http://www.toutiao.com/a6334459308533350658'
    url3 = 'http://www.toutiao.com/a6313664289211924737'
    url4 = 'http://www.toutiao.com/a6334337170774458625'
    url5 = 'http://www.toutiao.com/a6334486705982996738'
    download_dir = setup_download_dir('thread_imgs')
    # Create a queue to communicate with the worker threads
    queue = Queue()

    links = list(chain(
        get_links(url1),
        get_links(url2),
        get_links(url3),
        get_links(url4),
        get_links(url5),
    ))

    # Create 8 worker threads
    for x in range(8):
        worker = DownloadWorker(queue)
        # Setting daemon to True will let the main thread exit even though the
        # workers are blocking
        worker.daemon = True
        worker.start()

    # Put the tasks into the queue as a tuple
    for link in links:
        queue.put((download_dir, link))

    # Causes the main thread to wait for the queue to finish processing all
    # the tasks
    queue.join()
    print('一共下載了 {} 張圖片'.format(len(links)))
    print('Took {}s'.format(time() - ts))


if __name__ == '__main__':
    main()

"""
一共下載了 253 張圖片
Took 57.710124015808105s
"""

thread_toutiao.py

import os

from queue import Queue

from threading import Thread

from time import time

from itertools import chain

from download import setup_download_dir, get_links, download_link

class DownloadWorker(Thread):

def __init__(self, queue):

Thread.__init__(self)

self.queue = queue

def run(self):

while True:

# Get the work from the queue and expand the tuple

item = self.queue.get()

if item is None:

break

directory, link = item

download_link(directory, link)

self.queue.task_done()

def main():

ts = time()

url1 = 'http://www.toutiao.com/a6333981316853907714'

url2 = 'http://www.toutiao.com/a6334459308533350658'

url3 = 'http://www.toutiao.com/a6313664289211924737'

url4 = 'http://www.toutiao.com/a6334337170774458625'

url5 = 'http://www.toutiao.com/a6334486705982996738'

download_dir = setup_download_dir('thread_imgs')

# Create a queue to communicate with the worker threads

queue = Queue()

links = list(chain(

get_links(url1),

get_links(url2),

get_links(url3),

get_links(url4),

get_links(url5),

))

# Create 8 worker threads

for x in range(8):

worker = DownloadWorker(queue)

# Setting daemon to True will let the main thread exit even though the

# workers are blocking

worker.daemon = True

worker.start()

# Put the tasks into the queue as a tuple

for link in links:

queue.put((download_dir, link))

# Causes the main thread to wait for the queue to finish processing all

# the tasks

queue.join()

print('一共下載了 {} 張圖片'.format(len(links)))

print('Took {}s'.format(time() - ts))

if __name__ == '__main__':

main()

"""

一共下載了 253 張圖片

Took 57.710124015808105s

"""

在同一機器上執行這段指令碼下載相同張數的圖片花費 57.7 秒，比前一個例子快了約 3 倍。儘管下載速度更快了，但必須指出的是，因為 GIL 的限制，同一時間仍然只有一個執行緒在執行。因此，程式碼只是併發執行而不是並行執行。其比單執行緒下載更快的原因是因為下載圖片是 IO 密集型的操作。當下載圖片時處理器便空閒了下來，處理器花費的時間主要在等待網路連線上。這就是為什麼多執行緒會大大提高下載速度的原因。噹噹前執行緒開始執行下載任務時，處理器便可以切換到其他執行緒繼續執行。使用 Python 或者其他擁有 GIL 的指令碼語言會降低機器效能。如果的你的程式碼是執行 CPU 密集型的任務，例如解壓一個 gzip 檔案，使用多執行緒反而會增長執行時間。對於 CPU 密集型或者需要真正並行執行的任務我們可以使用 multiprocessing 模組。

儘管 Python 的標準實現 CPython 有 GIL，但不是所有的 python 實現都有 GIL。例如 IronPython，一個基於。NET 的 Python 實現就沒有 GIL，同樣的，Jython，基於 Java 的 Python 實現也沒有。你可以在這裡檢視 Python 的實現列表。

使用多程式

multiprocessing 模組比 threading 更容易使用，因為我們不用像在上一個例子中那樣建立一個執行緒類了。我們只需修改一下 main 函式。

為了使用多程式，我們建立了一個程式池。使用 multiprocessing 提供的 map 方法，我們將一個 URLs 列表傳入程式池，它會開啟 8 個新的程式，並讓每一個程式並行地去下載圖片。這是真正的並行，但也會付出一點代價。程式碼執行使用的儲存空間在每個程式中都會複製一份。在這個簡單的例子中當然無關緊要，但對一些大型程式可能會造成大的負擔。

程式碼：

process_toutiao.py

from functools import partial
from multiprocessing.pool import Pool
from itertools import chain
from time import time

from download import setup_download_dir, get_links, download_link


def main():
    ts = time()

    url1 = 'http://www.toutiao.com/a6333981316853907714'
    url2 = 'http://www.toutiao.com/a6334459308533350658'
    url3 = 'http://www.toutiao.com/a6313664289211924737'
    url4 = 'http://www.toutiao.com/a6334337170774458625'
    url5 = 'http://www.toutiao.com/a6334486705982996738'
    download_dir = setup_download_dir('process_imgs')
    links = list(chain(
        get_links(url1),
        get_links(url2),
        get_links(url3),
        get_links(url4),
        get_links(url5),
    ))

    download = partial(download_link, download_dir)
    with Pool(8) as p:
        p.map(download, links)
    print('一共下載了 {} 張圖片'.format(len(links)))
    print('Took {}s'.format(time() - ts))

if __name__ == '__main__':
    main()

process_toutiao.py

from functools import partial

from multiprocessing.pool import Pool

from itertools import chain

from time import time

from download import setup_download_dir, get_links, download_link

def main():

ts = time()

url1 = 'http://www.toutiao.com/a6333981316853907714'

url2 = 'http://www.toutiao.com/a6334459308533350658'

url3 = 'http://www.toutiao.com/a6313664289211924737'

url4 = 'http://www.toutiao.com/a6334337170774458625'

url5 = 'http://www.toutiao.com/a6334486705982996738'

download_dir = setup_download_dir('process_imgs')

links = list(chain(

get_links(url1),

get_links(url2),

get_links(url3),

get_links(url4),

get_links(url5),

))

download = partial(download_link, download_dir)

with Pool(8) as p:

p.map(download, links)

print('一共下載了 {} 張圖片'.format(len(links)))

print('Took {}s'.format(time() - ts))

if __name__ == '__main__':

main()

這裡補充一點，多程式下下載同樣了花費約 58 秒，和多執行緒差不多。但是對於 CPU 密集型任務，多程式將發揮巨大的速度優勢。

將任務分配到多臺機器

這一節作者討論了將任務分配到多臺機器上進行分散式計算，由於沒有環境測試，而且暫時也沒有這個需求，因此略過。感興趣的朋友請參考本文開頭的的原文連結。

結論

如果你的程式碼是 IO 密集型的，選擇 Python 的多執行緒和多程式差別可能不會太大。多程式可能比多執行緒更易使用，但需要消耗更大的記憶體。如果你的程式碼是 CPU 密集型的，那麼多程式可能是不二選擇，特別是對具有多個處理器的的機器而言。

通過網路圖片小爬蟲對比Python中單執行緒與多線（進）程的效率

Python 多執行緒起步

使用多執行緒實現併發和並行

使用多程式

將任務分配到多臺機器

結論

相關文章