使用redis和mongodb下載小說，並用pytest做測試

weixin_34249678發表於2019-01-06

原文網址 : https://blog.csdn.net/weixin_34249678/article/details/88286454

週末為了熟悉mongodb和redis，寫了一個抓取《白夜行》小說的程式，並且用pytest測試框架做單元測試, 使用了執行緒池加快下載速度：

# white_novel.py
""" 使用redis儲存網址，使用mongodb儲存內容"""

import lxml.html  # type: ignore
import requests  # type: ignore
import redis  # type: ignore
from pymongo import MongoClient, database
from concurrent.futures import ThreadPoolExecutor, as_completed
import time
from multiprocessing.dummy import Pool
from functools import partial

class DownloadWhite:
    KEY = 'urls'

    def __init__(self, workers=15, home_url='http://dongyeguiwu.zuopinj.com/5525'):
        self.workers = workers
        self.home_url = home_url
        self.redis_client = redis.StrictRedis(decode_responses=True)
        mongo_client = MongoClient()
        db: database.Database = mongo_client['Chapter6']
        self.collection = db['white_novel']

    def _clear(self):
        self.redis_client.delete(self.KEY)
        self.collection.delete_many({})

    def save_urls(self):
        home_page = requests.get(self.home_url).content.decode()
        selector = lxml.html.fromstring(home_page)
        useful = selector.xpath('//div[@class="book_list"]/ul/li')
        urls = []
        for i, li in enumerate(useful):
            url = li.xpath('a/@href')[0] if li.xpath('a/@href') else None
            urls.append(url)
        self.redis_client.rpush(self.KEY, *urls)

    def download_novel(self):
        client = self.redis_client
        contents = []
        urls = client.lrange(self.KEY, 0, -1)
        if not urls:
            return
        # method1
        # with ThreadPoolExecutor(max_workers=self.workers) as executor:
        #     futures = [executor.submit(self._download_chapter, url, contents) for url in urls]
        # for _ in as_completed(futures):
        #     pass
        # method2
        pool = Pool(self.workers)
        pool.map(partial(self._download_chapter, contents=contents), urls)
        print(f'at last insert {len(contents)} chapters')
        self.collection.insert_many(contents)

    @staticmethod
    def _download_chapter(url, contents: list) -> None:
        page = requests.get(url).content.decode()
        selector = lxml.html.fromstring(page)
        title = selector.xpath('//div[@class="h1title"]/h1/text()')[0]
        content = '\n'.join(selector.xpath('//div[@id="htmlContent"]/p/text()'))
        contents.append({'title': title, 'contnet': content})


if __name__ == '__main__':
    dlw = DownloadWhite()
    dlw._clear()
    dlw.save_urls()
    start = time.perf_counter()
    dlw.download_novel()
    print(f'time elapse {time.perf_counter() - start} seconds')

執行緒池的實現我試了2個方案，一種方案是ThreadPoolExecutor, 另一種方案是multiprocessing.dummy.Pool, 還用了partial這種小技巧.

不過我有個疑惑：多個執行緒往同一個列表contents裡append，這個contents是執行緒安全的嗎？
What kinds of global value mutation are thread-safe?解答了我的疑問，由於GIL的存在，許多java中的非執行緒安全問題在python中不存在了，少數類似L[i] +=4這樣的先讀取再賦值的語句，由於不是原子操作，才可能執行緒不安全。

由於使用了執行緒池（15個執行緒）併發下載章節，因此13章的耗時基本等於1章的耗時

at last insert 13 chapters
time elapse 0.9961462760111317 seconds

單元測試：

# test_white_novel.py
import pytest  # type: ignore
import redis  # type: ignore
from pymongo import MongoClient, collection  # type: ignore

from white_novel import DownloadWhite


@pytest.fixture(scope='function')
def wld_instance():
    print('start')
    dlw = DownloadWhite()
    dlw._clear()
    yield dlw
    dlw._clear()
    print('end')


@pytest.fixture(scope='module')
def redis_client():
    print('init redis')
    return redis.StrictRedis(decode_responses=True)


@pytest.fixture(scope='module')
def white_novel_collection() -> collection.Collection:
    print('init mongo')
    mongo_client = MongoClient()
    database = mongo_client['Chapter6']
    collection = database['white_novel']
    return collection


def test_download(wld_instance, redis_client, white_novel_collection):
    wld_instance.save_urls()
    wld_instance.download_novel()
    assert redis_client.llen(wld_instance.KEY) == 13
    assert white_novel_collection.count_documents(filter={}) == 13


def test_not_save_url_download(wld_instance, redis_client, white_novel_collection):
    wld_instance.download_novel()
    assert redis_client.llen(wld_instance.KEY) == 0
    assert white_novel_collection.count_documents(filter={}) == 0

def test_only_save_url(wld_instance, redis_client, white_novel_collection):
    wld_instance.save_urls()
    assert redis_client.llen(wld_instance.KEY) == 13
    assert white_novel_collection.count_documents(filter={}) == 0

最終抓取的結果如下：

redis 儲存每一章的連結列表

mongodb儲存小說內容

MongoDB和Redis的使用
2018-11-28
MongoDBRedis
效能測試工具Lmbench的使用和下載
2021-12-06
python+appium+pytest做app自動化測試
2024-04-22
PythonAPP
Pytest單元測試框架——Pytest+Allure+Jenkins的應用
2020-06-09
框架Jenkins
Pytest測試框架（一）：pytest安裝及用例執行
2021-01-01
框架
Linux測試上行和下載速率
2018-06-01
Linux
實用指南:使用Pytest Allure測試框架新增用例失敗截圖
2024-04-09
框架
Pytest測試框架（三）：pytest fixture 用法
2021-01-03
框架
這次，我掌握了 pytest 中 fixture 的使用及 pytest 執行測試的載入順序
2024-05-15
測試用例虛擬dom下載
2020-10-19
pytest-req外掛：更簡單的做介面測試
2024-07-26
使用tkinter打造一個小說下載器，想看什麼小說，就下什麼
2020-12-12
pytest(4)-測試用例執行順序
2022-02-14
python測試框架-pytest
2022-01-18
Python框架
pytest(2)-pytest-html測試報告
2022-02-13
HTML測試報告
Pytest 如何使用切換被測試環境
2024-05-15
mongodb下載
2024-05-25
MongoDB
Docker - 使用 Jenkins 映象建立容器，並搭建 Python + Pytest +Allure 的自動化測試環境
2020-11-10
DockerJenkinsPython
用Jmeter做微信小程式專案介面測試【案例】
2018-11-29
JMeter微信小程式
【案例】用Jmeter做微信小程式專案介面測試
2020-04-30
JMeter微信小程式
使用 ATX+pytest+allure-pytest 進行 IOS 的 UI 自動化測試
2020-06-02
iOSUI
pytest 能否執行 nose 寫的測試用例
2020-06-15
『德不孤』Pytest框架 — 9、Pytest測試報告
2022-03-10
框架測試報告
Jmeter 實用技巧--redis 測試
2020-07-17
JMeterRedis
TestLink測試用例管理工具使用說明
2021-03-06
使用jmeter測試工具完成檔案的下載
2019-08-06
JMeter
如何用python爬蟲下載小說？
2021-09-11
Python爬蟲
如何更好的做單元測試並用它來提升程式碼質量（下）
2018-09-25
python自動化測試框架pytest和unittest區別！！！
2019-08-11
Python框架
mongodb macos 下的安裝和使用
2018-06-14
MongoDBMac
pytest-testreport測試報告
2024-07-30
測試報告
小說過刊雜誌閱讀，txt小說下載 - 筆下光年
2019-05-11
讓使用者幫你做測試（A/B測試）
2019-08-29
【pytest】如何使用 pytest-rerunfailures 外掛並自定義重跑操作
2024-07-29
AI
pytest多程式/多執行緒執行測試用例
2022-07-04
執行緒
測試用例和測試方法
2020-11-23
爬蟲新手入門實戰專案（爬取筆趣閣小說並下載）
2019-05-09
爬蟲
使用Gatling做web壓力測試
2019-02-16
Web

使用redis和mongodb下載小說，並用pytest做測試

相關文章