三國演義內容抓取(詩詞名句網)
時間:2024-08-06
一、完整程式碼
import random
import time
import requests
from lxml import etree
four_famous_novels = 'https://www.shicimingju.com/bookmark/sidamingzhu.html' # 四大名著線上閱讀地址
three_kingdoms = 'https://www.shicimingju.com/book/sanguoyanyi.html' # 三國演藝地址
header = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36 Edg/125.0.0.0'
}
req = requests.get(three_kingdoms, headers=header)
req.encoding = req.apparent_encoding
# print(req.text)
tree = etree.HTML(req.text)
book_mulu = tree.xpath('//div[@class="book-mulu"]/ul/li/a/text()')
mulu_href = tree.xpath('//div[@class="book-mulu"]/ul/li/a/@href')
for i in range(len(book_mulu)):
url = 'https://www.shicimingju.com' + mulu_href[i]
print(url)
req_content = requests.get(url, headers=header)
req_content.encoding = req_content.apparent_encoding
tree = etree.HTML(req_content.text)
content = tree.xpath('//div[@class="chapter_content"]//text()')
print(book_mulu[i])
print(content)
time.sleep(random.randint(1, 4))
效果:
二、知識點
2.1 隨機時間點(避免網站壓力大)
time.sleep(random.randint(1, 4))
三、思路
第一步: 先抓取目錄和目錄下面的連結
第二步: 迴圈所有的urls ,然後抓取下面的內容
第三步TODO: 建立一個三國演繹的資料夾,然後裡面按照 01 章+ 章節名.txt 進行文字內容寫入