爬蟲實戰：從網頁到本地，如何輕鬆實現小說離線閱讀

努力的小雨發表於2024-03-19

原文網址 : https://www.cnblogs.com/guoxiaoyu/p/18069448

爬蟲網頁

今天我們將繼續進行爬蟲實戰，除了常規的網頁資料抓取外，我們還將引入一個全新的下載功能。具體而言，我們的主要任務是爬取小說內容，並實現將其下載到本地的操作，以便後續能夠進行離線閱讀。

為了確保即使在功能逐漸增多的情況下也不至於使初學者感到困惑，我特意為你繪製了一張功能架構圖，具體如下所示：

讓我們開始深入解析今天的主角：小說網

小說解析

書單獲取

在小說網的推薦列表中，我們可以選擇解析其中的某一個推薦內容，而無需完全還原整個網站頁面的顯示效果，從而更加高效地獲取我們需要的資訊。

以下是一個示例程式碼，幫助你更好地理解：

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36 Edg/122.0.0.0'}
req = Request("https://www.readnovel.com/",headers=headers)
# 發出請求，獲取html
# 獲取的html內容是位元組，將其轉化為字串
html = urlopen(req)
html_text = bytes.decode(html.read())
soup = bf(html_text,'html.parser')

for li in soup.select('#new-book-list li'):
    a_tag = li.select_one('a[data-eid="qd_F24"]')
    p_tag = li.select_one('p')
    book = {
        'href': a_tag['href'],
        'title': a_tag.get('title'),
        'content': p_tag.get_text()
    }
    print(book)

書籍簡介

在通常情況下，我們會先檢視書單，然後對書籍的大致內容進行了解，因此直接解析相關內容即可。以下是一個示例程式碼：

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36 Edg/122.0.0.0'}
req = Request(f"https://www.readnovel.com{link}#Catalog",headers=headers)
# 發出請求，獲取html
# 獲取的html內容是位元組，將其轉化為字串
html = urlopen(req)
html_text = bytes.decode(html.read())
soup = bf(html_text,'html.parser')
og_title = soup.find('meta', property='og:title')['content']
og_description = soup.find('meta', property='og:description')['content']
og_novel_author = soup.find('meta', property='og:novel:author')['content']
og_novel_update_time = soup.find('meta', property='og:novel:update_time')['content']
og_novel_status = soup.find('meta', property='og:novel:status')['content']
og_novel_latest_chapter_name = soup.find('meta', property='og:novel:latest_chapter_name')['content']
# 查詢內容為"免費試讀"的a標籤
div_tag = soup.find('div', id='j-catalogWrap')
list_items = div_tag.find_all('li', attrs={'data-rid': True})
for li in list_items:
    link_text = li.find('a').text
    if '第' in link_text:
        link_url = li.find('a')['href']
        link_obj = {'link_text':link_text,
                'link_url':link_url}
        free_trial_link.append(link_obj)
print(f"書名:{og_title}")
print(f"簡介:{og_description}")
print(f"作者:{og_novel_author}")
print(f"最近更新:{og_novel_update_time}")
print(f"當前狀態:{og_novel_status}")
print(f"最近章節:{og_novel_latest_chapter_name}")

在解析過程中，我們發現除了獲取書籍的大致內容外，還順便解析了相關的書籍目錄。將這些目錄儲存下來會方便我們以後進行試讀操作，因為一旦對某本書感興趣，我們接下來很可能會閱讀一下。如果確實對書籍感興趣，可能還會將其加入書單。為了避免在閱讀時再次解析，我們在這裡直接儲存了這些目錄資訊。

免費試讀

在這一步，我們的主要任務是解析章節的名稱以及章節內容，並將它們列印出來，為後續封裝成方法以進行下載或閱讀做準備。這樣做可以更好地組織和管理資料，提高程式碼的複用性和可維護性。下面是一個示例程式碼，展示瞭如何實現這一功能：

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36 Edg/122.0.0.0'}
req = Request(f"https://www.readnovel.com{link}",headers=headers)
# 發出請求，獲取html
# 獲取的html內容是位元組，將其轉化為字串
html = urlopen(req)
html_text = bytes.decode(html.read())
soup = bf(html_text, 'html.parser')
name = soup.find('h1',class_='j_chapterName')
chapter = {
    'name':name.get_text()
}
print(name.get_text())
ywskythunderfont = soup.find('div', class_='ywskythunderfont')
if ywskythunderfont:
    p_tags = ywskythunderfont.find_all('p')
    chapter['text'] = p_tags[0].get_text()
    print(chapter)

小說下載

當我們完成內容解析後，已經成功獲取了小說的章節內容，接下來只需執行下載操作即可。對於下載操作的具體步驟，如果有遺忘的情況，我來幫忙大家進行回顧一下。

file_name = 'a.txt'
with open(file_name, 'w', encoding='utf-8') as file:
    file.write('嘗試下載')
print(f'檔案 {file_name} 下載完成！')

包裝一下

按照老規矩，以下是原始碼示例。即使你懶得編寫程式碼，也可以直接複製貼上執行一下，然後自行琢磨其中的細節。這樣能夠更好地理解程式碼的執行邏輯和實現方式。

# 匯入urllib庫的urlopen函式
from urllib.request import urlopen,Request
# 匯入BeautifulSoup
from bs4 import BeautifulSoup as bf
from random import choice,sample
from colorama import init
from termcolor import colored
from readchar import  readkey
FGS = ['green', 'yellow', 'blue', 'cyan', 'magenta', 'red']
book_list = []
free_trial_link = []
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36 Edg/122.0.0.0'}

def get_hot_book():
    print(colored('開始搜尋書單！',choice(FGS)))
    book_list.clear()
    req = Request("https://www.readnovel.com/",headers=headers)
    # 發出請求，獲取html
    # 獲取的html內容是位元組，將其轉化為字串
    html = urlopen(req)
    html_text = bytes.decode(html.read())
    soup = bf(html_text,'html.parser')

    for li in soup.select('#new-book-list li'):
        a_tag = li.select_one('a[data-eid="qd_F24"]')
        p_tag = li.select_one('p')
        book = {
            'href': a_tag['href'],
            'title': a_tag.get('title'),
            'content': p_tag.get_text()
        }
        book_list.append(book)

def get_book_detail(link):
    global free_trial_link
    free_trial_link.clear()
    req = Request(f"https://www.readnovel.com{link}#Catalog",headers=headers)
    # 發出請求，獲取html
    # 獲取的html內容是位元組，將其轉化為字串
    html = urlopen(req)
    html_text = bytes.decode(html.read())
    soup = bf(html_text,'html.parser')
    og_title = soup.find('meta', property='og:title')['content']
    og_description = soup.find('meta', property='og:description')['content']
    og_novel_author = soup.find('meta', property='og:novel:author')['content']
    og_novel_update_time = soup.find('meta', property='og:novel:update_time')['content']
    og_novel_status = soup.find('meta', property='og:novel:status')['content']
    og_novel_latest_chapter_name = soup.find('meta', property='og:novel:latest_chapter_name')['content']
    # 查詢內容為"免費試讀"的a標籤
    div_tag = soup.find('div', id='j-catalogWrap')
    list_items = div_tag.find_all('li', attrs={'data-rid': True})
    for li in list_items:
        link_text = li.find('a').text
        if '第' in link_text:
            link_url = li.find('a')['href']
            link_obj = {'link_text':link_text,
                    'link_url':link_url}
            free_trial_link.append(link_obj)
    print(colored(f"書名:{og_title}",choice(FGS)))
    print(colored(f"簡介:{og_description}",choice(FGS)))
    print(colored(f"作者:{og_novel_author}",choice(FGS)))
    print(colored(f"最近更新:{og_novel_update_time}",choice(FGS)))
    print(colored(f"當前狀態:{og_novel_status}",choice(FGS)))
    print(colored(f"最近章節:{og_novel_latest_chapter_name}",choice(FGS)))

def free_trial(link):
    req = Request(f"https://www.readnovel.com{link}",headers=headers)
    # 發出請求，獲取html
    # 獲取的html內容是位元組，將其轉化為字串
    html = urlopen(req)
    html_text = bytes.decode(html.read())
    soup = bf(html_text, 'html.parser')
    name = soup.find('h1',class_='j_chapterName')
    chapter = {
        'name':name.get_text()
    }
    print(colored(name.get_text(),choice(FGS)))
    ywskythunderfont = soup.find('div', class_='ywskythunderfont')
    if ywskythunderfont:
        p_tags = ywskythunderfont.find_all('p')
        chapter['text'] = p_tags[0].get_text()
    return chapter

def download_chapter(chapter):
    file_name = chapter['name'] + '.txt'
    with open(file_name, 'w', encoding='utf-8') as file:
        file.write(chapter['text'].replace('\u3000\u3000', '\n'))
    print(colored(f'檔案 {file_name} 下載完成！',choice(FGS)))

def print_book():
    for i in range(0, len(book_list), 3):
        names = [f'{i + j}:{book_list[i + j]["title"]}' for j in range(3) if i + j < len(book_list)]
        print(colored('\t\t'.join(names),choice(FGS)))

def read_book(page):
    if not free_trial_link:
        print(colored('未選擇書單，無法閱讀！',choice(FGS)))
    
    print(colored(free_trial(free_trial_link[page]['link_url'])['text'],choice(FGS)))

get_hot_book()

init() ## 命令列輸出彩色文字
print(colored('已搜尋完畢！',choice(FGS)))
print(colored('m:返回首頁',choice(FGS)))
print(colored('d:免費試讀',choice(FGS)))
print(colored('x:全部下載',choice(FGS)))
print(colored('n:下一章節',choice(FGS)))
print(colored('b:上一章節',choice(FGS)))
print(colored('q:退出閱讀',choice(FGS)))
my_key = ['q','m','d','x','n','b']
current = 0
while True:
    while True:
        move = readkey()
        if move in my_key:
            break
    if move == 'q': ## 鍵盤‘Q’是退出
        break 
    if move == 'd':  
        read_book(current)
    if move == 'x':  ## 這裡只是演示為主，不迴圈下載所有資料了
        download_chapter(free_trial(free_trial_link[0]['link_url']))
    if move == 'b':  
        current = current - 1
        if current < 0 :
            current = 0
        read_book(current)
    if move == 'n':  
        current = current + 1
        if current > len(free_trial_link) :
            current = len(free_trial_link) - 1
        read_book(current)
    if move == 'm':
        print_book()
        current = 0
        num = int(input('請輸入書單編號：=====>'))
        if num <= len(book_list):
            get_book_detail(book_list[num]['href'])

總結

今天在爬蟲實戰中，除了正常爬取網頁資料外，我們還新增了一個下載功能，主要任務是爬取小說並將其下載到本地，以便離線閱讀。為了避免迷糊，我為大家繪製了功能架構圖。我們首先解析了小說網，包括獲取書單、書籍簡介和免費試讀章節。然後針對每個功能編寫了相應的程式碼，如根據書單獲取書籍資訊、獲取書籍詳細資訊、免費試讀章節解析和小說下載。最後，將這些功能封裝成方法，方便呼叫和操作。透過這次實戰，我們深入瞭解了爬蟲的應用，為後續的專案提供了基礎支援。

《從零開始學習Python爬蟲：頂點小說全網爬取實戰》
2024-07-06
Python爬蟲
網路爬蟲——爬蟲實戰（一）
2022-01-29
爬蟲
Python網路爬蟲實戰小專案
2021-04-12
Python爬蟲
Python實戰案例彙總，帶你輕鬆從入門到實戰
2021-06-15
Python
Python靜態網頁爬蟲專案實戰
2020-05-01
Python網頁爬蟲
爬蟲——三個小實戰
2018-09-21
爬蟲
網路爬蟲---從千圖網爬取圖片到本地
2019-09-03
爬蟲
不會Python爬蟲？教你一個通用爬蟲思路輕鬆爬取網頁資料
2019-01-08
Python爬蟲網頁
【閱讀筆記】《Python3網路爬蟲開發實戰》PDF文件
2020-01-14
筆記Python爬蟲
LLM實戰：當網頁爬蟲整合gpt3.5
2024-05-20
網頁爬蟲GPT
Jsoup + HtmlUtil 實現網易新聞網頁爬蟲
2019-01-14
JSHTML網頁爬蟲
用Flutter實現一個小說閱讀App
2019-12-27
FlutterAPP
Python 3網路爬蟲開發實戰.PDF分享（可直接下載閱讀）
2021-12-13
Python爬蟲
Vue學習路徑-輕鬆從基礎到實戰
2018-08-27
Vue
Python網路爬蟲實戰
2022-03-18
Python爬蟲
python3網路爬蟲開發實戰_Python3 爬蟲實戰
2022-01-24
Python爬蟲
從0到1完成nutch分散式爬蟲專案實戰
2019-01-08
分散式爬蟲
《52講輕鬆搞定網路爬蟲》讀書筆記 - Session和Cookie
2020-09-03
爬蟲筆記SessionCookie
爬蟲實戰scrapy
2018-03-11
爬蟲
Python 爬蟲實戰
2023-10-16
Python爬蟲
【上新】實戰能力UP！從基礎到入門，輕鬆掌握 CVE 復現技能
2023-04-05
AntSK 0.2.3 版本更新：輕鬆整合 AI 本地離線模型
2024-03-23
AI模型
thinkphp小說閱讀網站
2019-05-11
PHP網站
爬蟲新手入門實戰專案（爬取筆趣閣小說並下載）
2019-05-09
爬蟲
Python爬蟲實戰之叩富網
2021-04-04
Python爬蟲
最新《30小時搞定Python網路爬蟲專案實戰》
2020-02-18
Python爬蟲
python網路爬蟲應用_python網路爬蟲應用實戰
2020-12-29
Python爬蟲
[Python3網路爬蟲開發實戰] 2-爬蟲基礎 2-網頁基礎
2018-03-08
Python爬蟲網頁
SpringBoot實戰：輕鬆實現介面資料脫敏
2024-07-15
Spring Boot
NodeJs 實戰——原生 NodeJS 輕仿 Express 框架從需求到實現（二）
2018-11-21
NodeJSExpress框架
# NodeJs 實戰——原生 NodeJS 輕仿 Express 框架從需求到實現（一）
2018-11-19
NodeJSExpress框架
python實現selenium網路爬蟲
2021-03-11
Python爬蟲
爬蟲解析庫：XPath 輕鬆上手
2019-11-03
爬蟲
實戰（二）輕鬆使用requests庫和beautifulsoup爬連結
2019-03-03
python的爬蟲功能如何實現
2019-02-28
Python爬蟲
Python爬蟲是如何實現的？
2022-07-15
Python爬蟲
python爬蟲實操專案_Python爬蟲開發與專案實戰 1.6 小結
2021-02-04
Python爬蟲
《Python3網路爬蟲開發實戰》教程||爬蟲教程
2018-11-13
Python爬蟲