今天我們將繼續進行爬蟲實戰,除了常規的網頁資料抓取外,我們還將引入一個全新的下載功能。具體而言,我們的主要任務是爬取小說內容,並實現將其下載到本地的操作,以便後續能夠進行離線閱讀。
為了確保即使在功能逐漸增多的情況下也不至於使初學者感到困惑,我特意為你繪製了一張功能架構圖,具體如下所示:
讓我們開始深入解析今天的主角:小說網
小說解析
書單獲取
在小說網的推薦列表中,我們可以選擇解析其中的某一個推薦內容,而無需完全還原整個網站頁面的顯示效果,從而更加高效地獲取我們需要的資訊。
以下是一個示例程式碼,幫助你更好地理解:
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36 Edg/122.0.0.0'}
req = Request("https://www.readnovel.com/",headers=headers)
# 發出請求,獲取html
# 獲取的html內容是位元組,將其轉化為字串
html = urlopen(req)
html_text = bytes.decode(html.read())
soup = bf(html_text,'html.parser')
for li in soup.select('#new-book-list li'):
a_tag = li.select_one('a[data-eid="qd_F24"]')
p_tag = li.select_one('p')
book = {
'href': a_tag['href'],
'title': a_tag.get('title'),
'content': p_tag.get_text()
}
print(book)
書籍簡介
在通常情況下,我們會先檢視書單,然後對書籍的大致內容進行了解,因此直接解析相關內容即可。以下是一個示例程式碼:
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36 Edg/122.0.0.0'}
req = Request(f"https://www.readnovel.com{link}#Catalog",headers=headers)
# 發出請求,獲取html
# 獲取的html內容是位元組,將其轉化為字串
html = urlopen(req)
html_text = bytes.decode(html.read())
soup = bf(html_text,'html.parser')
og_title = soup.find('meta', property='og:title')['content']
og_description = soup.find('meta', property='og:description')['content']
og_novel_author = soup.find('meta', property='og:novel:author')['content']
og_novel_update_time = soup.find('meta', property='og:novel:update_time')['content']
og_novel_status = soup.find('meta', property='og:novel:status')['content']
og_novel_latest_chapter_name = soup.find('meta', property='og:novel:latest_chapter_name')['content']
# 查詢內容為"免費試讀"的a標籤
div_tag = soup.find('div', id='j-catalogWrap')
list_items = div_tag.find_all('li', attrs={'data-rid': True})
for li in list_items:
link_text = li.find('a').text
if '第' in link_text:
link_url = li.find('a')['href']
link_obj = {'link_text':link_text,
'link_url':link_url}
free_trial_link.append(link_obj)
print(f"書名:{og_title}")
print(f"簡介:{og_description}")
print(f"作者:{og_novel_author}")
print(f"最近更新:{og_novel_update_time}")
print(f"當前狀態:{og_novel_status}")
print(f"最近章節:{og_novel_latest_chapter_name}")
在解析過程中,我們發現除了獲取書籍的大致內容外,還順便解析了相關的書籍目錄。將這些目錄儲存下來會方便我們以後進行試讀操作,因為一旦對某本書感興趣,我們接下來很可能會閱讀一下。如果確實對書籍感興趣,可能還會將其加入書單。為了避免在閱讀時再次解析,我們在這裡直接儲存了這些目錄資訊。
免費試讀
在這一步,我們的主要任務是解析章節的名稱以及章節內容,並將它們列印出來,為後續封裝成方法以進行下載或閱讀做準備。這樣做可以更好地組織和管理資料,提高程式碼的複用性和可維護性。下面是一個示例程式碼,展示瞭如何實現這一功能:
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36 Edg/122.0.0.0'}
req = Request(f"https://www.readnovel.com{link}",headers=headers)
# 發出請求,獲取html
# 獲取的html內容是位元組,將其轉化為字串
html = urlopen(req)
html_text = bytes.decode(html.read())
soup = bf(html_text, 'html.parser')
name = soup.find('h1',class_='j_chapterName')
chapter = {
'name':name.get_text()
}
print(name.get_text())
ywskythunderfont = soup.find('div', class_='ywskythunderfont')
if ywskythunderfont:
p_tags = ywskythunderfont.find_all('p')
chapter['text'] = p_tags[0].get_text()
print(chapter)
小說下載
當我們完成內容解析後,已經成功獲取了小說的章節內容,接下來只需執行下載操作即可。對於下載操作的具體步驟,如果有遺忘的情況,我來幫忙大家進行回顧一下。
file_name = 'a.txt'
with open(file_name, 'w', encoding='utf-8') as file:
file.write('嘗試下載')
print(f'檔案 {file_name} 下載完成!')
包裝一下
按照老規矩,以下是原始碼示例。即使你懶得編寫程式碼,也可以直接複製貼上執行一下,然後自行琢磨其中的細節。這樣能夠更好地理解程式碼的執行邏輯和實現方式。
# 匯入urllib庫的urlopen函式
from urllib.request import urlopen,Request
# 匯入BeautifulSoup
from bs4 import BeautifulSoup as bf
from random import choice,sample
from colorama import init
from termcolor import colored
from readchar import readkey
FGS = ['green', 'yellow', 'blue', 'cyan', 'magenta', 'red']
book_list = []
free_trial_link = []
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36 Edg/122.0.0.0'}
def get_hot_book():
print(colored('開始搜尋書單!',choice(FGS)))
book_list.clear()
req = Request("https://www.readnovel.com/",headers=headers)
# 發出請求,獲取html
# 獲取的html內容是位元組,將其轉化為字串
html = urlopen(req)
html_text = bytes.decode(html.read())
soup = bf(html_text,'html.parser')
for li in soup.select('#new-book-list li'):
a_tag = li.select_one('a[data-eid="qd_F24"]')
p_tag = li.select_one('p')
book = {
'href': a_tag['href'],
'title': a_tag.get('title'),
'content': p_tag.get_text()
}
book_list.append(book)
def get_book_detail(link):
global free_trial_link
free_trial_link.clear()
req = Request(f"https://www.readnovel.com{link}#Catalog",headers=headers)
# 發出請求,獲取html
# 獲取的html內容是位元組,將其轉化為字串
html = urlopen(req)
html_text = bytes.decode(html.read())
soup = bf(html_text,'html.parser')
og_title = soup.find('meta', property='og:title')['content']
og_description = soup.find('meta', property='og:description')['content']
og_novel_author = soup.find('meta', property='og:novel:author')['content']
og_novel_update_time = soup.find('meta', property='og:novel:update_time')['content']
og_novel_status = soup.find('meta', property='og:novel:status')['content']
og_novel_latest_chapter_name = soup.find('meta', property='og:novel:latest_chapter_name')['content']
# 查詢內容為"免費試讀"的a標籤
div_tag = soup.find('div', id='j-catalogWrap')
list_items = div_tag.find_all('li', attrs={'data-rid': True})
for li in list_items:
link_text = li.find('a').text
if '第' in link_text:
link_url = li.find('a')['href']
link_obj = {'link_text':link_text,
'link_url':link_url}
free_trial_link.append(link_obj)
print(colored(f"書名:{og_title}",choice(FGS)))
print(colored(f"簡介:{og_description}",choice(FGS)))
print(colored(f"作者:{og_novel_author}",choice(FGS)))
print(colored(f"最近更新:{og_novel_update_time}",choice(FGS)))
print(colored(f"當前狀態:{og_novel_status}",choice(FGS)))
print(colored(f"最近章節:{og_novel_latest_chapter_name}",choice(FGS)))
def free_trial(link):
req = Request(f"https://www.readnovel.com{link}",headers=headers)
# 發出請求,獲取html
# 獲取的html內容是位元組,將其轉化為字串
html = urlopen(req)
html_text = bytes.decode(html.read())
soup = bf(html_text, 'html.parser')
name = soup.find('h1',class_='j_chapterName')
chapter = {
'name':name.get_text()
}
print(colored(name.get_text(),choice(FGS)))
ywskythunderfont = soup.find('div', class_='ywskythunderfont')
if ywskythunderfont:
p_tags = ywskythunderfont.find_all('p')
chapter['text'] = p_tags[0].get_text()
return chapter
def download_chapter(chapter):
file_name = chapter['name'] + '.txt'
with open(file_name, 'w', encoding='utf-8') as file:
file.write(chapter['text'].replace('\u3000\u3000', '\n'))
print(colored(f'檔案 {file_name} 下載完成!',choice(FGS)))
def print_book():
for i in range(0, len(book_list), 3):
names = [f'{i + j}:{book_list[i + j]["title"]}' for j in range(3) if i + j < len(book_list)]
print(colored('\t\t'.join(names),choice(FGS)))
def read_book(page):
if not free_trial_link:
print(colored('未選擇書單,無法閱讀!',choice(FGS)))
print(colored(free_trial(free_trial_link[page]['link_url'])['text'],choice(FGS)))
get_hot_book()
init() ## 命令列輸出彩色文字
print(colored('已搜尋完畢!',choice(FGS)))
print(colored('m:返回首頁',choice(FGS)))
print(colored('d:免費試讀',choice(FGS)))
print(colored('x:全部下載',choice(FGS)))
print(colored('n:下一章節',choice(FGS)))
print(colored('b:上一章節',choice(FGS)))
print(colored('q:退出閱讀',choice(FGS)))
my_key = ['q','m','d','x','n','b']
current = 0
while True:
while True:
move = readkey()
if move in my_key:
break
if move == 'q': ## 鍵盤‘Q’是退出
break
if move == 'd':
read_book(current)
if move == 'x': ## 這裡只是演示為主,不迴圈下載所有資料了
download_chapter(free_trial(free_trial_link[0]['link_url']))
if move == 'b':
current = current - 1
if current < 0 :
current = 0
read_book(current)
if move == 'n':
current = current + 1
if current > len(free_trial_link) :
current = len(free_trial_link) - 1
read_book(current)
if move == 'm':
print_book()
current = 0
num = int(input('請輸入書單編號:=====>'))
if num <= len(book_list):
get_book_detail(book_list[num]['href'])
總結
今天在爬蟲實戰中,除了正常爬取網頁資料外,我們還新增了一個下載功能,主要任務是爬取小說並將其下載到本地,以便離線閱讀。為了避免迷糊,我為大家繪製了功能架構圖。我們首先解析了小說網,包括獲取書單、書籍簡介和免費試讀章節。然後針對每個功能編寫了相應的程式碼,如根據書單獲取書籍資訊、獲取書籍詳細資訊、免費試讀章節解析和小說下載。最後,將這些功能封裝成方法,方便呼叫和操作。透過這次實戰,我們深入瞭解了爬蟲的應用,為後續的專案提供了基礎支援。