python爬蟲之抓取小說(逆天邪神)

jlttt發表於2022-03-10

原文網址 : http://blog.itpub.net/9533994/viewspace-2869029/

Python爬蟲

2022-03-06 23:05:11

申明：自我娛樂，對自我學習過程的總結。

正文：

環境：

系統：win10，
python版本：python3.10.2，
工具：pycharm。

專案目標：

實現對單本小說的更新判斷，省去人工登入瀏覽器看小說的繁瑣操作。
如果小說內容更新了，那麼自動下載你沒看過的小說內容到本地，並儲存為txt格式。
對專案程式碼封裝成可單獨執行在win10上的exe檔案。

最終效果：都已實現。可以判斷小說更新了沒；更新了就下載下來；通過調整小說的已看章節數（就是你上次瀏覽小說章節位置記錄）可以達到直接儲存整本小說。

專案實現流程：

1. 主程式

我這裡只寫了一個main.py，就一個主函式解決了。

# 這個是一個爬取小說的工具
# 內容針對逆天邪神
# 功能1:是判斷小說是否更新，如果更新就下載下來
# 功能2:下載整本小說（單執行緒），一般都是自動下載最新更新的幾章，單執行緒足夠。——懶


import requests
import re
from bs4 import BeautifulSoup
import os

if __name__ == '__main__':
    novel_url = "https://www.bige3.com/book/1030/"  # 逆天邪神
    return_value = is_update(novel_url)  # 更新章節數
    if return_value == 0:
        print("小說尚未更新!")
    else:
        print("小說已更新" + str(return_value) +"章!")
        print("正在下載已更新的小說......")
        download_novel(return_value)
    # os.system("pause")   # 除錯時註釋掉，封裝時開啟，用於觀察結果

2. 功能函式

2.1 功能函式is_update()

def is_update(url):
    heards = {
        "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36"
    }
    try:
        resp = requests.get(url, headers=heards)
        resp.raise_for_status()  # 檢查Response狀態碼,若不是200則產生HttpError異常
        resp.encoding = 'utf-8'
    except:
        print("爬取失敗")

    resp = re.findall(r'<a href =.*?>(.*?)</a>', resp.text)
    # print("請求返回的列表中的最後一章是:" + resp[-1])
    with open("小說更新記錄.txt", "r", encoding='utf-8') as f:  # 開啟檔案
        data = f.read()  # 讀取檔案
        # print("source_novel_data is:" + str(data))
    if data == str(resp[-1]):
        # print("===章節一致,小說尚未更新!")
        return 0
    else:
        # print("!==小說更新啦,並將更新值加入到小說更新記錄.txt")
        data_num = re.findall(r'\d+', data)  # list
        data_num = ''.join(data_num)  # str
        resp_num = re.findall(r'\d+', resp[-1])
        resp_num = ''.join(resp_num)
        gap_num = int(resp_num)-int(data_num)  # 更新章節數
        with open("小說更新記錄.txt", "w", encoding='utf-8') as f:  # 開啟檔案
            f.write(str(resp[-1]))  # 讀取檔案
            print("writing is ok!")
        return gap_num

2.2 功能函式download_novel(return_value)

# 單執行緒方式
def download_novel(return_value):
    if return_value >= 1:
        for i in range(1, return_value+1, 1):
            print(i)
            with open("小說更新記錄.txt", "r", encoding='utf-8') as f:  # 開啟檔案
                data = f.read()  # 讀取檔案 str
                data_num = re.findall(r'\d+', data)  # list
                data_num = ''.join(data_num)  # str
                download_num = int(data_num)+1-(i-1)
                # print(download_num)
                print(novel_url+str(download_num)+'.html')
            resp = requests.get(novel_url+str(download_num)+'.html')
            # print(resp.content)
            soup = BeautifulSoup(resp.text, 'lxml')
            soup.select('#chaptercontent')
            mytxt = soup.text[soup.text.find('下一章'):soup.text.rfind('『點此報錯')]
            mytxt = mytxt[3:]
            mytxt = mytxt.strip()
            mytxt = mytxt.replace('　　', '\n')
            novel_save_location = "./novel_downloads/逆天邪神第"+str(download_num-1)+"章.txt"
            with open(novel_save_location, "w", encoding='utf-8') as f:  # 開啟檔案
                f.write(mytxt)
            print("下載完畢!")
    else:
        print("invalid parameter!")

注意：

除錯時要建立資料夾novel_downloads，並標註為Exclusion，防止pycharm自動建立索引，使電腦卡頓。
封裝後的main.exe要保證它所在的路徑下有兩個東西：資料夾novel_downloads和檔案小說更新記錄.txt。
初始階段保證檔案小說更新記錄.txt裡有個數字就行，隨便啥（1 or 1935等）

全部程式碼：（直接能爬）

# 這個是一個爬取小說的工具
# 內容針對逆天邪神
# 功能1:是判斷小說是否更新，如果更新就下載下來
# 功能2:下載整本小說（單執行緒），一般都是自動下載最新更新的幾章，單執行緒足夠。——懶

import requests
from lxml import etree
import re
from bs4 import BeautifulSoup
import os

def is_update(url):
    heards = {
        "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36"
    }
    try:
        resp = requests.get(url, headers=heards)
        resp.raise_for_status()  # 檢查Response狀態碼,若不是200則產生HttpError異常
        resp.encoding = 'utf-8'
    except:
        print("爬取失敗")

    resp = re.findall(r'<a href =.*?>(.*?)</a>', resp.text)
    # print("請求返回的列表中的最後一章是:" + resp[-1])
    with open("小說更新記錄.txt", "r", encoding='utf-8') as f:  # 開啟檔案
        data = f.read()  # 讀取檔案
        # print("source_novel_data is:" + str(data))
    if data == str(resp[-1]):
        # print("===章節一致,小說尚未更新!")
        return 0
    else:
        # print("!==小說更新啦,並將更新值加入到小說更新記錄.txt")
        data_num = re.findall(r'\d+', data)  # list
        data_num = ''.join(data_num)  # str
        resp_num = re.findall(r'\d+', resp[-1])
        resp_num = ''.join(resp_num)
        gap_num = int(resp_num)-int(data_num)  # 更新章節數
        with open("小說更新記錄.txt", "w", encoding='utf-8') as f:  # 開啟檔案
            f.write(str(resp[-1]))  # 讀取檔案
            print("writing is ok!")
        return gap_num


# 單執行緒方式
def download_novel(return_value):
    if return_value >= 1:
        for i in range(1, return_value+1, 1):
            print(i)
            with open("小說更新記錄.txt", "r", encoding='utf-8') as f:  # 開啟檔案
                data = f.read()  # 讀取檔案 str
                data_num = re.findall(r'\d+', data)  # list
                data_num = ''.join(data_num)  # str
                download_num = int(data_num)+1-(i-1)
                # print(download_num)
                print(novel_url+str(download_num)+'.html')
            resp = requests.get(novel_url+str(download_num)+'.html')
            # print(resp.content)
            soup = BeautifulSoup(resp.text, 'lxml')
            soup.select('#chaptercontent')
            mytxt = soup.text[soup.text.find('下一章'):soup.text.rfind('『點此報錯')]
            mytxt = mytxt[3:]
            mytxt = mytxt.strip()
            mytxt = mytxt.replace('　　', '\n')
            novel_save_location = "./novel_downloads/逆天邪神第"+str(download_num-1)+"章.txt"
            with open(novel_save_location, "w", encoding='utf-8') as f:  # 開啟檔案
                f.write(mytxt)
            print("下載完畢!")
    else:
        print("invalid parameter!")


if __name__ == '__main__':
    novel_url = "https://www.bige3.com/book/1030/"  # 逆天邪神
    return_value = is_update(novel_url)
    if return_value == 0:
        print("小說尚未更新!")
    else:
        print("小說已更新" + str(return_value) +"章!")
        print("正在下載已更新的小說......")
        download_novel(return_value)
    os.system("pause")

缺點：單執行緒，沒有用到非同步協程，也沒有用執行緒池實現對小說下載章節數較多時的快速下載優勢。之後有空再優化程式碼，並實現相應的功能。

實現效果：

例如章節是目前是

最新章節為：1936章災厄奏鳴，我改個數字演示。

不改話，就沒有新章節更新：

改後跑起來，應該是

對應的資料夾裡是：

開啟後內容是：

Over！！！！！

封裝問題

步驟：

在pycharm專案路徑下開啟終端輸入：pip install pyinstaller
cd到專案的.py檔案路徑下cd .\study_capture\novel_capture\
執行：pyinstaller -F .\main.py

結果是：

專案中用到的知識點：

這裡面可以有些在優化程式時被我給去掉了，嘿嘿

請求網頁資料

resp = requests.get(url, headers=heards)

python中list與string的轉換

data_num = re.findall(r'\d+', data)  # 正則出來的是list 
data_num = ''.join(data_num)  # str

小說章節數的確認

resp = re.findall(r'<a href =.*?>(.*?)</a>', resp.text)

TXT文字的讀取

encoding='utf-8' 是有必要的，不然會報錯。

with open("小說更新記錄.txt", "r", encoding='utf-8') as f:  # 開啟檔案
    data = f.read()  # 讀取檔案

TXT文字的回寫

with open("小說更新記錄.txt", "w", encoding='utf-8') as f:  # 開啟檔案
    f.write(str(resp[-1]))  # 讀取檔案

BS4對HTML進行值的篩選

#表示識別標籤

soup = BeautifulSoup(resp.text, 'lxml')
soup.select('#chaptercontent')

取列表元素最後一個

resp[-1]

將列表中的章節數字拿出

data_num = re.findall(r'\d+', data)  # list

python特定位置的字串擷取

soup.text  str型
find('下一章')  左邊開始第一個索引
rfind('『點此報錯')   右邊開始第一個索引

mytxt = soup.text[soup.text.find('下一章'):soup.text.rfind('『點此報錯')]

字串的拼接：

novel_save_location = "./novel_downloads/逆天邪神第"+str(download_num-1)+"章.txt"

小說儲存時：

1.裡面有空白，直接用

mytxt = mytxt.strip()

時沒有去掉，不知道啥原因。我記得聽網課說是：去掉空格，空白，換行符，其他好像都去了，最後還剩小說之間一些空白。

解決方式：因為沒有發現是啥符號（notepad++），於是之間將空白拿過來用（copy）。

mytxt=mytxt.replace('　　', '\n')
#目的是:在TXT文字中句子太長，於是我直接在每句話結束後換行。效果還行，與網站對比。

感謝觀看！！！第一次寫，好慢，好菜，回去寫作業去了。嗚嗚嗚

python爬蟲抓取哈爾濱天氣資訊（靜態爬蟲）
2020-04-05
Python爬蟲
Python爬蟲抓取股票資訊
2021-01-03
Python爬蟲
Python 第一個爬蟲，爬取 147 小說
2020-05-08
Python爬蟲
如何用python爬蟲下載小說？
2021-09-11
Python爬蟲
如何讓Python爬蟲一天抓取100萬張網頁
2019-05-09
Python爬蟲網頁
用PYTHON爬蟲簡單爬取網路小說
2021-09-11
Python爬蟲
用Python爬蟲抓取代理IP
2019-04-17
Python爬蟲
Python爬蟲之小說資訊爬取與資料視覺化分析
2021-01-09
Python爬蟲視覺化
網路爬蟲之抓取郵箱
2018-06-18
爬蟲
Python爬蟲抓取技術的門道
2019-09-21
Python爬蟲
Python爬蟲實戰：爐石傳說卡牌、原畫資料抓取
2020-10-09
Python爬蟲
python爬蟲第四天
2019-01-28
Python爬蟲
Python爬蟲之BeautifulSoup
2019-02-16
Python爬蟲
Python爬蟲小結（轉）
2018-08-09
Python爬蟲
Python爬蟲，抓取淘寶商品評論內容!
2018-06-24
Python爬蟲
初探python之做一個簡單小爬蟲
2019-03-02
Python爬蟲
python網路爬蟲_Python爬蟲：30個小時搞定Python網路爬蟲視訊教程
2020-10-21
Python爬蟲
21 天搞定 Python 分佈爬蟲
2019-02-16
Python爬蟲
爬蟲的小技巧之–如何尋找爬蟲入口
2018-03-05
爬蟲
python爬蟲之JS逆向
2022-06-11
Python爬蟲JS
Python爬蟲之Pyspider使用
2021-09-11
Python爬蟲IDE
Python爬蟲之BeautifulSoup庫
2020-12-14
Python爬蟲
python 小爬蟲 DrissionPage+BeautifulSoup
2024-06-16
Python爬蟲
Python爬蟲新手教程：手機APP資料抓取 pyspider
2019-07-20
Python爬蟲APPIDE
爬蟲原理與資料抓取
2020-12-17
爬蟲
【Python學習】爬蟲爬蟲爬蟲爬蟲~
2018-05-03
Python爬蟲
實戰：如何通過python requests庫寫一個抓取小網站圖片的小爬蟲
2020-01-25
Python網站爬蟲
Python超簡單超基礎的免費小說爬蟲！爬蟲入門從這開始！
2020-10-23
Python爬蟲
python爬蟲學習：爬蟲QQ說說並生成詞雲圖，回憶滿滿
2018-05-13
Python爬蟲
Python爬蟲入門實戰之貓眼電影資料抓取(理論篇)
2019-04-06
Python爬蟲
Python爬蟲入門實戰之貓眼電影資料抓取（實戰篇）
2019-04-07
Python爬蟲
Python爬蟲之js加密破解，抓取網易雲音樂評論生成詞雲
2020-10-22
Python爬蟲JS加密
Python爬蟲二：抓取京東商品列表頁面資訊
2018-06-26
Python爬蟲
Python爬蟲抓取知乎所有使用者資訊
2018-03-14
Python爬蟲
Python爬蟲抓取資料，為什麼要使用代理IP？
2022-12-27
Python爬蟲
Python爬蟲如何去抓取qq音樂的歌手資料？
2021-03-19
Python爬蟲
python爬蟲之js逆向（三）
2020-01-06
Python爬蟲JS
python爬蟲之js逆向（二）
2019-11-05
Python爬蟲JS