情況最簡單下的爬蟲案例

Jason990420發表於2020-03-06

原文網址 : https://learnku.com/articles/41440?order_by=created_at&

爬蟲

檔案建立日期: 2020/03/06
最後修訂日期: None
相關軟體資訊:

說明: 本文請隨意引用或更改, 只須標示出處及作者, 作者不保證內容絶對正確無誤, 如造成任何後果, 請自行負責.

標題: 情況最簡單下的爬蟲案例

常看一些網路上的小說, 但都會碰到某些章節沒下載或廣告等等問題, 所以就寫了一個簡單的爬蟲, 只針對簡單的請求, 沒有要設定Header, Cookie, 登入, 驗證或VIP使用者等, 就可以取得網頁資料的小說網頁

要求:

小說目錄網址, 不需翻頁, 以https://www.wfxs.org/html/2/為例
取得各章的網址
建立一個目錄供存整本小說
每章各建立一個文字檔案
為加速完成, 使用多執行緒
簡單顯示進行中的執行緒, 完成數及總數

執行緒說明

用了好幾種方法, 常會碰到主程式結束, 但執行緒未全部完成, 章節的總數總是不對, 因此自己建立記錄區, 自行管理, 終於解決這個問題, 明確所有的執行緒都已完成.

輸出

情況最簡單下的爬蟲案例

說明及程式碼

使用的庫

from pathlib import Path
from bs4 import BeautifulSoup as bs
from copy import deepcopy
import urllib.request as request
import _thread
import PySimpleGUI as sg

建立網頁處理的類

class WEB():

    def __init__(self):
        self.base = 'https://www.wfxs.org'  # 網頁的根目錄
        self.root = ''                      # 小說儲存的根目錄
        self.queue = {}                     # 記錄目前正在工作的執行緒
        self.max = 40                       # 最大執行緒數
        self.buffer = {}                    # 執行緒完成的章節小說內容, 待存檔
        self.temp = []                      # 章節執行緒出錯, 待重排入執行緒
        self.count = 0                      # 已存檔章節數
        self.not_allow = '''?|><"*'+,;=[]\x00/\\''' # 不合格的檔名文字

建立目錄: 如果該目錄已存在, 後面加上數字以區別

    def create_subdirectory(self, name):
        i, path = 1, Path(name)
        while path.is_dir():
            i += 1
            path = Path(name+str(i))
        self.root = path
        path.mkdir()

讀取該章節小說內容
由於內容放在head中, 如果直接以head.text讀取, 也會取到子tag中的文字, 因此先移除其他所的子tag, 再以head.text讀取.
<html ……><head><title> … </title><meta … /> … 章內容文字</head>

    def chapter_content(self, html):
        for tag in html.head: tag.extract()
        chapter_text = self.form(html.head.text).strip()
        return chapter_text

將文字中的<br>以及多餘的空行移除

    def form(self, text):
        text = text.replace('\xA0', '')
        while '\n\n' in text:
            text = text.replace('\n\n', '\n')
        return text

獲取一個新未使用的記錄執行緒鍵值

    def get_a_key(self):
        for i in range(self.max):
            if i not in self.queue:
                return i
        return None

讀取目錄中的作者名
find_all, tag為meta, 引數name:author; 讀取content的內容, 從右邊分割字串, 取最右邊一個.
<meta content="絕品天醫版權歸葉天南"name="author"/>

    def get_auther(self, html):
        return html.find_all(
            name ='meta', attrs={'name':'author'}
            )[0].get('content').rsplit(sep=None, maxsplit=1)[-1]

讀取目錄中所有的章節名及鏈結
find_all, tag為<dd>, 章節名為tag<a>的text, 鏈結為tag<a>的href值, 如果章節名空字串, 略過.
<dd><a href="/html/2/3063.html">第十五章慶元診所</a></dd>

    def get_chapters(self, html):
        chapters = html.find_all(name='dd')
        result = []
        for chapter in chapters:
            title = chapter.a.text.split('（')[0]
            if title != '':
                link = self.base + chapter.a.get('href')
                result.append([self.valid(title), link])
        return result

讀取目錄中對該書的簡介
簡介在tag<p>, 引數class="tl pd8 pd10"中<br>後的text
<p class="tl pd8 pd10">作者︰葉天南寫的小說《絕品天醫》… <br/>.……..

    def get_description(self, html):
        return self.form(html.find(
            name='p',
            attrs={'class':"tl pd8 pd10"}).br.text)

讀取目錄中書名
書名在tag<h1>中的text, 因為要以書名來建立目錄, 所以要移除書名中的非法字母
<h1 class="tc h10">絕品天醫</h1>

    def get_name(self, html):
        return self.valid(html.h1.text)

載入目錄中的書名, 作者, 簡介, 各章名及其鏈結

    def load_catalog(self, url):
        status, html     = self.load_html(url)
        if status != 200:
            return None, None, None, None, None
        name        = self.get_name(html)
        auther      = self.get_auther(html)
        description = self.get_description(html)
        chapters    = self.get_chapters(html)
        return status, name, auther, description, chapters

載入章節的小說內文, 再放入存檔用的緩衝區, 只要網頁載入的結果不是200程式碼, 就從執行緒記錄區移除, 並放入後面重排入執行緒.

    def load_chapter(self, key, chapter, url):
        status, html = self.load_html(url)
        if status != 200:
            self.temp.append([chapter, url])
            del self.queue[key]
        else:
            chapter_text = self.chapter_content(html)
            self.buffer[key] = [chapter, chapter_text]
        return

根據網址讀入HTML檔, 如果出錯或狀態程式碼不是200的都返回None, 指示錯誤, 後面再重新讀取
該網頁的編碼為big5, 譯碼如果有錯, ignore, 該字會略過不處理.
<meta http-equiv="Content-Type" content="text/html; charset=big5" />

    def load_html(self, url):
        try:
            response = request.urlopen(url)
            status   = response.getcode()
        except:
            return None, ''
        else:
            if status == 200:
                data = str(response.read(), encoding='big5', errors='ignore')
                html = bs(data, features='html.parser')
                return status, html
            else:
                return None, ''

刪除執行緒記錄

    def queue_delete(self, key):
        del self.queue[key]

執行緒加入記錄中, 並啟動, 批註中為非執行緒作法

    def queue_insert(self, chapter, url):
        key = self.get_a_key()
        self.queue[key] = [chapter, url]
        # self.load_chapter(key, chapter, url)
        _thread.start_new_thread(self.load_chapter, (key, chapter, url))

檢查執行緒記錄是否已達到上限, 用來限制最大限執行緒數, 不會再加入新的限程

    def queue_is_full(self):
        return True if len(self.queue) == self.max else False

檢查執行緒記錄是否空, 用來確認所有的執行緒都已完成.

    def queue_not_empty(self):
        return True if len(self.queue) != 0 else False

儲存小說書的說明檔, 內含書名, 作者, 簡介, 如果檔案已存在, 附加額外數字以區別

    def save_book(self, name, auther, description):
        i, path = 1, self.root.joinpath(name+'.txt')
        while path.is_file():
            i += 1
            path = self.root.joinpath(name+str(i)+'.txt')
        text = '\n'.join(('書名: %s'%name, '作者: %s'%auther, 
                          '簡介: %s'%description))
        with open(path, 'wt', encoding='utf-8') as f:
            f.write(text)

儲存小說的章節內文, 如果檔名存在, 附加額外數字以區別, 在存檔緩衝區以及執行緒記錄中, 刪除該章節.

    def save_chapter(self):
        buffer = deepcopy(self.buffer)
        for key, value in buffer.items():
            i, path = 1, self.root.joinpath(value[0]+'.txt')
            while path.is_file():
                i += 1
                path = self.root.joinpath(value[0]+str(i)+'.txt')
            with open(path, 'wt', encoding='utf-8') as f:
                f.write(value[1])
            self.count += 1
            del self.buffer[key]
            del self.queue[key]

將檔名中的非法字母移除, 避免存檔錯誤

    def valid(self, text):
        return ''.join((char for char in text if char not in self.not_allow))

主程式
- 如果目錄無法載入, 結束程式
- 儲存小說書的說明檔
- 建立簡單GUI, 顯示進度, 並控制隨時可以結束, 或保證所有的行程都已執行完畢

url = 'https://www.wfxs.org/html/2/'    # 小說目錄網址
W = WEB()

status, name, auther, description, chapters = W.load_catalog(url)
if status == None:
    print('%s open failed !' % url)
    quit()

W.create_subdirectory(name)
W.save_book(name, auther, description)

font = ('Courier New', 16, 'bold')
layout = [[sg.Text('', font=font, auto_size_text=False, key='Text1', 
                   size=(W.max, 1))],
          [sg.Text('', font=font, auto_size_text=False, key='Text2', 
                   size=(W.max, 1))]]
window = sg.Window('Novel Download', layout=layout, finalize=True)

size = len(chapters)
all = deepcopy(chapters)
while len(all) != 0:
    W.temp = []
    for chapter, url in all:
        W.save_chapter()
        state, values = window.read(timeout=1)
        if state == None:
            window.close()
            quit()
        window['Text1'].update(value='■'*len(W.queue))
        window['Text2'].update(
            value='{}/{} chapters saved'.format(W.count, size))
        while W.queue_is_full():    # 如果執行緒記錄區已滿, 存章節
            W.save_chapter()
        W.queue_insert(chapter, url)    # 插入執行緒, 下載章節
    # 如果執行緒記錄區不是空的, 存章節, 跑完所有的執行緒, 再重新跑出錯的章節
    while W.queue_not_empty():
        W.save_chapter()
    all = deepcopy(W.temp)

while True:
    state, values = window.read(timeout=100)
    if state == None:
        break
    window['Text1'].update(value='■'*len(W.queue))
    window['Text2'].update(value='{}/{} chapters saved'.format(W.count, size))
    W.save_chapter()    # 儲存章節, 直到執行緒全部執行完畢
window.close()

本作品採用《CC 協議》，轉載必須註明作者和本文連結

Jason Yang

python最簡單的爬蟲 , 一看就會
2018-06-14
Python爬蟲
簡單的爬蟲程式
2024-03-24
爬蟲
爬蟲在什麼情況下才需要使用代理IP
2021-09-11
爬蟲
python爬蟲:爬蟲的簡單介紹及requests模組的簡單使用
2022-02-24
Python爬蟲
python簡單爬蟲(二)
2018-04-18
Python爬蟲
簡單例子展示爬蟲在不同思想下的寫法
2021-04-26
單例爬蟲
爬蟲案例
2024-03-31
爬蟲
簡單瞭解python爬蟲
2020-10-13
Python爬蟲
Python代理IP爬蟲的簡單使用
2019-03-04
Python爬蟲
nodeJS做一個簡單的爬蟲
2018-03-30
NodeJS爬蟲
爬蟲代理為什麼會出現超時的情況？
2022-05-31
爬蟲
最簡單的網路圖片的爬取 --Pyhon網路爬蟲與資訊獲取
2020-04-04
爬蟲
爬蟲案例（六）
2020-11-03
爬蟲
一天時間入門python爬蟲，直接寫一個爬蟲案例，分享出來，很簡單
2018-12-02
Python爬蟲
爬蟲--Scrapy簡易爬蟲
2020-10-07
爬蟲
Python爬蟲 --- 2.3 Scrapy 框架的簡單使用
2018-12-19
Python爬蟲框架
phpspider簡單快速上手的php爬蟲框架
2020-02-17
PHPIDE爬蟲框架
9.爬蟲案例
2024-12-06
爬蟲
python 爬蟲 mc 皮膚站 little skin 的簡單爬取
2019-08-02
Python爬蟲
使用Python和requests庫的簡單爬蟲程式
2023-11-13
Python爬蟲
Laravel 手動搭建簡單的資料爬蟲
2019-11-28
Laravel爬蟲
一個簡單的爬蟲頭部構造
2020-11-22
爬蟲
使用nodeJS寫一個簡單的小爬蟲
2018-12-25
NodeJS爬蟲
使用requests+BeautifulSoup的簡單爬蟲練習
2018-04-06
爬蟲
基於Python的簡單天氣爬蟲程式
2018-03-26
Python爬蟲
用PYTHON爬蟲簡單爬取網路小說
2021-09-11
Python爬蟲
大規模非同步新聞爬蟲：簡單的百度新聞爬蟲
2018-12-02
非同步爬蟲
簡單的爬蟲：爬取網站內容正文與圖片
2021-09-09
爬蟲網站
今天剛上手爬蟲，當然要從最簡單的開始啦，驗證一下所學的知識
2020-08-20
爬蟲
簡單介紹MySQL索引失效的幾種情況
2020-10-14
MySql索引
教你如何編寫第一個簡單的爬蟲
2020-02-16
爬蟲
java實現一個簡單的爬蟲小程式
2020-08-11
Java爬蟲
python爬蟲簡單實現逆向JS解密
2019-08-29
Python爬蟲JS解密
Python《成功破解簡單的動態載入的爬蟲》
2020-12-20
Python爬蟲
爬蟲與反爬蟲技術簡介
2022-09-20
爬蟲
Python3 | 簡單爬蟲分析網頁元素
2018-11-30
Python爬蟲網頁
初探python之做一個簡單小爬蟲
2019-03-02
Python爬蟲
爬蟲，其實本就是這麼簡單
2019-08-19
爬蟲

情況最簡單下的爬蟲案例

標題: 情況最簡單下的爬蟲案例

要求:

執行緒說明

輸出

說明及程式碼

相關文章