簡易多執行緒爬蟲框架

dwzb發表於2018-06-02

原文網址 : https://juejin.im/post/5b129bd7e51d4506a74d22f4

本文首發於知乎

本文使用多執行緒實現一個簡易爬蟲框架，讓我們只需要關注網頁的解析，不用自己設定多執行緒、佇列等事情。呼叫形式類似scrapy，而諸多功能還不完善，因此稱為簡易爬蟲框架。

這個框架實現了Spider類，讓我們只需要寫出下面程式碼，即可多執行緒執行爬蟲

class DouBan(Spider):

    def __init__(self):
        super(DouBan, self).__init__()
        self.start_url = 'https://movie.douban.com/top250'
        self.filename = 'douban.json' # 覆蓋預設值
        self.output_result = False 
        self.thread_num = 10

    def start_requests(self): # 覆蓋預設函式
        yield (self.start_url, self.parse_first)

    def parse_first(self, url): # 只需要yield待爬url和回撥函式
        r = requests.get(url)
        soup = BeautifulSoup(r.content, 'lxml')

        movies = soup.find_all('div', class_ = 'info')[:5]
        for movie in movies:
            url = movie.find('div', class_ = 'hd').a['href']
            yield (url, self.parse_second)

        nextpage = soup.find('span', class_ = 'next').a
        if nextpage:
            nexturl = self.start_url + nextpage['href']
            yield (nexturl, self.parse_first)
        else:
            self.running = False # 表明執行到這裡則不會繼續新增待爬URL佇列

    def parse_second(self, url):
        r = requests.get(url)
        soup = BeautifulSoup(r.content, 'lxml')
        mydict = {}
        title = soup.find('span', property = 'v:itemreviewed')
        mydict['title'] = title.text if title else None
        duration = soup.find('span', property = 'v:runtime')
        mydict['duration'] = duration.text if duration else None
        time = soup.find('span', property = 'v:initialReleaseDate')
        mydict['time'] = time.text if time else None
        yield mydict


if __name__ == '__main__':
    douban = DouBan()
    douban.run()
複製程式碼

可以看到這個使用方式和scrapy非常相似

繼承類，只需要寫解析函式（因為是簡易框架，因此還需要寫請求函式）
用yield返回資料或者新的請求及回撥函式
自動多執行緒（scrapy是非同步）
執行都一樣只要run
可以設定是否儲存到檔案等，只是沒有考慮可擴充套件性(資料庫等)

下面我們來說一說它是怎麼實現的

我們可以對比下面兩個版本，一個是上一篇文章中的使用方法，另一個是進行了一些修改，將一些功能抽象出來，以便擴充套件功能。

上一篇文章版本程式碼請讀者自行點選連結去看，下面是修改後的版本程式碼。

import requests
import time
import threading
from queue import Queue, Empty
import json
from bs4 import BeautifulSoup

def run_time(func):
    def wrapper(*args, **kw):
        start = time.time()
        func(*args, **kw)
        end = time.time()
        print('running', end-start, 's')
    return wrapper


class Spider():

    def __init__(self):
        self.start_url = 'https://movie.douban.com/top250'
        self.qtasks = Queue()
        self.data = list()
        self.thread_num = 5
        self.running = True

    def start_requests(self):
        yield (self.start_url, self.parse_first)

    def parse_first(self, url):
        r = requests.get(url)
        soup = BeautifulSoup(r.content, 'lxml')

        movies = soup.find_all('div', class_ = 'info')[:5]
        for movie in movies:
            url = movie.find('div', class_ = 'hd').a['href']
            yield (url, self.parse_second)

        nextpage = soup.find('span', class_ = 'next').a
        if nextpage:
            nexturl = self.start_url + nextpage['href']
            yield (nexturl, self.parse_first)
        else:
            self.running = False


    def parse_second(self, url):
        r = requests.get(url)
        soup = BeautifulSoup(r.content, 'lxml')
        mydict = {}
        title = soup.find('span', property = 'v:itemreviewed')
        mydict['title'] = title.text if title else None
        duration = soup.find('span', property = 'v:runtime')
        mydict['duration'] = duration.text if duration else None
        time = soup.find('span', property = 'v:initialReleaseDate')
        mydict['time'] = time.text if time else None
        yield mydict


    def start_req(self):
        for task in self.start_requests():
            self.qtasks.put(task)

    def parses(self):
        while self.running or not self.qtasks.empty():
            try:
                url, func = self.qtasks.get(timeout=3)
                print('crawling', url)
                for task in func(url):
                    if isinstance(task, tuple):
                        self.qtasks.put(task)
                    elif isinstance(task, dict):
                        self.data.append(task)
                    else:
                        raise TypeError('parse functions have to yield url-function tuple or data dict')
            except Empty:
                print('{}: Timeout occurred'.format(threading.current_thread().name))
        print(threading.current_thread().name, 'finished')


    @run_time
    def run(self, filename=False):
        ths = []

        th1 = threading.Thread(target=self.start_req)
        th1.start()
        ths.append(th1)

        for _ in range(self.thread_num):
            th = threading.Thread(target=self.parses)
            th.start()
            ths.append(th)

        for th in ths:
            th.join()

        if filename:
            s = json.dumps(self.data, ensure_ascii=False, indent=4)
            with open(filename, 'w', encoding='utf-8') as f:
                f.write(s)

        print('Data crawling is finished.')

if __name__ == '__main__':
    Spider().run(filename='frame.json')
複製程式碼

這個改進主要思路如下

我們希望寫解析函式時，像scrapy一樣，用yield返回待抓取的URL和它對應的解析函式，於是就做了一個包含(URL，解析函式)的元組佇列，之後只要不斷從佇列中獲取元素，用函式解析url即可，這個提取的過程使用多執行緒
yield可以返回兩種型別資料，一種是元組（URL，解析函式），一種是字典（即我們要的資料），通過判斷分別加入不同佇列中。元組佇列是不斷消耗和增添的過程，而字典佇列是一隻增加，最後再一起輸出到檔案中
在queue.get時，加入了timeout引數並做異常處理，保證每一個執行緒都能結束

這裡其實沒有特別的知識，也不需要解釋很多，讀者自己複製程式碼到文字檔案裡對比就知道了

然後框架的形式就是從第二種中，剝離一些通用的設定，讓使用者自定義每個爬蟲獨特的部分，完整程式碼如下(本文開頭的程式碼就是下面這塊程式碼的後半部分)

import requests
import time
import threading
from queue import Queue, Empty
import json
from bs4 import BeautifulSoup

def run_time(func):
    def wrapper(*args, **kw):
        start = time.time()
        func(*args, **kw)
        end = time.time()
        print('running', end-start, 's')
    return wrapper


class Spider():

    def __init__(self):
        self.qtasks = Queue()
        self.data = list()
        self.thread_num = 5
        self.running = True
        self.filename = False
        self.output_result = True

    def start_requests(self):
        yield (self.start_url, self.parse)

    def start_req(self):
        for task in self.start_requests():
            self.qtasks.put(task)

    def parses(self):
        while self.running or not self.qtasks.empty():
            try:
                url, func = self.qtasks.get(timeout=3)
                print('crawling', url)
                for task in func(url):
                    if isinstance(task, tuple):
                        self.qtasks.put(task)
                    elif isinstance(task, dict):
                        if self.output_result:
                            print(task)
                        self.data.append(task)
                    else:
                        raise TypeError('parse functions have to yield url-function tuple or data dict')
            except Empty:
                print('{}: Timeout occurred'.format(threading.current_thread().name))
        print(threading.current_thread().name, 'finished')

    @run_time
    def run(self):
        ths = []

        th1 = threading.Thread(target=self.start_req)
        th1.start()
        ths.append(th1)

        for _ in range(self.thread_num):
            th = threading.Thread(target=self.parses)
            th.start()
            ths.append(th)

        for th in ths:
            th.join()

        if self.filename:
            s = json.dumps(self.data, ensure_ascii=False, indent=4)
            with open(self.filename, 'w', encoding='utf-8') as f:
                f.write(s)

        print('Data crawling is finished.')



class DouBan(Spider):

    def __init__(self):
        super(DouBan, self).__init__()
        self.start_url = 'https://movie.douban.com/top250'
        self.filename = 'douban.json' # 覆蓋預設值
        self.output_result = False 
        self.thread_num = 10

    def start_requests(self): # 覆蓋預設函式
        yield (self.start_url, self.parse_first)

    def parse_first(self, url): # 只需要yield待爬url和回撥函式
        r = requests.get(url)
        soup = BeautifulSoup(r.content, 'lxml')

        movies = soup.find_all('div', class_ = 'info')[:5]
        for movie in movies:
            url = movie.find('div', class_ = 'hd').a['href']
            yield (url, self.parse_second)

        nextpage = soup.find('span', class_ = 'next').a
        if nextpage:
            nexturl = self.start_url + nextpage['href']
            yield (nexturl, self.parse_first)
        else:
            self.running = False # 表明執行到這裡則不會繼續新增待爬URL佇列

    def parse_second(self, url):
        r = requests.get(url)
        soup = BeautifulSoup(r.content, 'lxml')
        mydict = {}
        title = soup.find('span', property = 'v:itemreviewed')
        mydict['title'] = title.text if title else None
        duration = soup.find('span', property = 'v:runtime')
        mydict['duration'] = duration.text if duration else None
        time = soup.find('span', property = 'v:initialReleaseDate')
        mydict['time'] = time.text if time else None
        yield mydict


if __name__ == '__main__':
    douban = DouBan()
    douban.run()
複製程式碼

我們這樣剝離之後，就只需要寫後半部分的程式碼，只關心網頁的解析，不用考慮多執行緒的實現了。

歡迎關注我的知乎專欄

專欄主頁：python程式設計

版本說明：軟體及包版本說明

python多執行緒爬蟲與單執行緒爬蟲效率效率對比
2021-03-19
Python執行緒爬蟲
多執行緒爬蟲實現（上）
2018-05-26
執行緒爬蟲
Python《多執行緒併發爬蟲》
2020-12-12
Python執行緒爬蟲
Java多執行緒之Executor框架和手寫簡易的執行緒池
2019-01-08
Java執行緒框架
Python爬蟲入門【9】：圖蟲網多執行緒爬取
2019-07-31
Python爬蟲執行緒
python爬蟲入門八：多程式/多執行緒
2019-01-07
Python爬蟲執行緒
資料提取方法-多程式多執行緒爬蟲
2020-11-16
執行緒爬蟲
python爬蟲之多執行緒、多程式+程式碼示例
2020-08-26
Python爬蟲執行緒
爬蟲--Scrapy簡易爬蟲
2020-10-07
爬蟲
Python爬蟲入門【10】：電子書多執行緒爬取
2019-07-31
Python爬蟲執行緒
如何使用queue模組實現多執行緒爬蟲
2023-11-29
執行緒爬蟲
python多執行緒非同步爬蟲-Python非同步爬蟲試驗[Celery,gevent,requests]
2020-11-11
Python執行緒非同步爬蟲
【Java】【多執行緒】執行緒池簡述
2018-04-17
Java執行緒
基於多執行緒+協程的非同步增量式爬蟲
2024-05-12
執行緒非同步爬蟲
ObjC 多執行緒簡析（一）-多執行緒簡述和執行緒鎖的基本應用
2019-02-28
OBJ執行緒
簡易執行緒池實現
2024-04-04
執行緒
C++簡易執行緒池
2020-11-27
C++執行緒
java多執行緒系列：Executors框架
2018-06-12
Java執行緒框架
【框架】一種通知到多執行緒框架
2021-08-30
框架執行緒
Python爬蟲入門教程 13-100 鬥圖啦表情包多執行緒爬取
2018-12-27
Python爬蟲執行緒
Python爬蟲入門教程 11-100 行行網電子書多執行緒爬取
2018-12-25
Python爬蟲執行緒
Java多執行緒學習（八）執行緒池與Executor 框架
2019-03-01
Java執行緒框架
多執行緒爬取B站視訊
2020-10-13
執行緒
多執行緒和多執行緒同步
2024-08-22
執行緒
多執行緒Demo學習（執行緒的同步，簡單的執行緒通訊）
2020-12-20
執行緒
爬蟲養成記--千軍萬馬來相見（詳解多執行緒）
2020-03-23
爬蟲執行緒
多執行緒--執行緒管理
2018-07-31
執行緒
執行緒與多執行緒
2024-08-11
執行緒
多執行緒【執行緒池】
2021-02-20
執行緒
Golang多執行緒簡單鬥地主
2020-09-12
Golang執行緒
python多執行緒爬去糗事百科
2018-04-03
Python執行緒
Java多執行緒-執行緒中止
2019-08-26
Java執行緒
多執行緒之初識執行緒
2020-06-30
執行緒
多執行緒------執行緒與程式/執行緒排程/建立執行緒
2020-12-31
執行緒
多執行緒系列（1），多執行緒基礎
2020-08-20
執行緒
POSTMAN 單執行緒簡易刷星指令碼
2024-10-15
Postman執行緒指令碼
a、多執行緒
2024-03-14
執行緒
爬蟲：多程式爬蟲
2021-05-19
爬蟲

簡易多執行緒爬蟲框架

歡迎關注我的知乎專欄

相關文章