爬蟲新手入門實戰專案（爬取筆趣閣小說並下載）

Mr.DDG發表於2019-05-09

原文網址 : https://blog.csdn.net/weixin_44266342/article/details/90036382

網路爬蟲是什麼？簡單來說就是一個指令碼，他能幫助我們自動採集我們需要的資源。

爬蟲步驟
獲取資料

# 匯入模組
import requests
import re

url = 'https://www.biqudu.com/43_43821/'

模擬頭部資訊

headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.119 Safari/537.36"}

用瀏覽器開啟網頁（我用的是谷歌瀏覽器），按 F12 開啟檢查元素。
在這裡插入圖片描述
將 User-Agent 複製下來。

模擬瀏覽器傳送 http 請求

response = requests.get(url, headers=headers, verify=False)

因為這裡是 https 請求，如果不加 verify=False 會報錯。

編碼方式（不新增編碼，看到的網頁會有亂碼）

response.encoding = 'utf-8'

目標網頁原始碼

html = response.text

分析資料載入流程（使用正規表示式）

def multiple_replace(text, adict):

    rx = re.compile('|'.join(map(re.escape, adict)))

    def one_xlat(match):
        return adict[match.group(0)]
    return rx.sub(one_xlat, text)

提取小說名字

# 因為列表中只有一個元素，所以是 [0]
title = re.findall(r'<meta property="og:title" content="(.*?)"/>', html)[0]
# print(title)

在這裡插入圖片描述
如圖，一般情況下小說標題都在 head 的 meter 裡面。

獲取每一章節的資訊

dl = re.findall(r'<dt>《聖墟》正文</dt>.*?</dl>', html, re.S)[0]  # re.S 匹配不可見字元
chapter_info_list = re.findall(r'href="(.*?)">(.*?)<', dl)

在這裡插入圖片描述
要以唯一欄位開頭結尾

<dt>《聖墟》正文</dt>.*?</dl>

這是唯一欄位ka，如果不是唯一欄位，會導致匹配不準確。

下載資料
迴圈每一個章節並下載

for chapter_info in chapter_info_list:

    # chapter_title = chapter_info[0]
    # chapter_url = chapter_info[1]

    chapter_url, chapter_title = chapter_info

    # 不用加號拼接，會增加新的字串物件，增加記憶體
    chapter_url = "https://www.biqudu.com%s" % chapter_url
    # 下載章節內容
    chapter_response = requests.get(chapter_url, headers=headers, verify=False)
    chapter_response.encoding = 'utf-8'
    chapter_html = chapter_response.text  # 網頁原始碼
    # 提取章節內容
    chapter_content = re.findall(r'<script>readx\(\);</script>(.*?)<script>chaptererror\(\);</script>',
                                 chapter_html, re.S)[0]

清洗資料

adict = {'<br/>': '', '\t': '', '　　': '\n', ';&lt;="="/js/"&gt;&lt;/&gt;': ''}
chapter_content = multiple_replace(chapter_content, adict)

5.儲存資料

   with open('%s.txt' % title, 'a', encoding='utf-8') as f:
        f.write(chapter_title)
        f.write(chapter_content)
        f.write('\n')
        print(chapter_url)

筆趣閣小說爬取
2024-05-29
爬蟲 Scrapy框架爬取圖蟲圖片並下載
2018-08-27
爬蟲框架
Python爬蟲入門學習實戰專案（一）
2020-02-18
Python爬蟲
不踩坑的Python爬蟲：Python爬蟲開發與專案實戰，從爬蟲入門 Python
2021-12-17
Python爬蟲
爬蟲實戰（二）：Selenium 模擬登入並爬取資訊
2018-07-15
爬蟲
python爬蟲實操專案_Python爬蟲開發與專案實戰 1.6 小結
2021-02-04
Python爬蟲
Python網路爬蟲實戰小專案
2021-04-12
Python爬蟲
Java 爬蟲專案實戰之爬蟲簡介
2018-11-24
Java爬蟲
如何用python爬蟲下載小說？
2021-09-11
Python爬蟲
爬蟲實戰專案集合
2019-02-28
爬蟲
爬蟲專案實戰（一）
2020-06-15
爬蟲
爬蟲實戰專案合集
2022-01-25
爬蟲
爬蟲小專案
2019-05-10
爬蟲
網路爬蟲——專案實戰（爬取糗事百科所有文章）
2020-02-07
爬蟲
Python 第一個爬蟲，爬取 147 小說
2020-05-08
Python爬蟲
python爬蟲-33個Python爬蟲專案實戰(推薦)
2020-10-28
Python爬蟲
scrapy入門教程()部署爬蟲專案
2018-09-27
爬蟲
Python網路爬蟲實戰專案大全 32個Python爬蟲專案demo
2019-04-24
Python爬蟲
《從零開始學習Python爬蟲：頂點小說全網爬取實戰》
2024-07-06
Python爬蟲
用PYTHON爬蟲簡單爬取網路小說
2021-09-11
Python爬蟲
Python學習筆記——爬蟲之Scrapy專案實戰
2018-09-03
Python筆記爬蟲
Scrapy入門-第一個爬蟲專案
2018-07-23
爬蟲
Java爬蟲入門(一)——專案介紹
2018-08-06
Java爬蟲
Python爬蟲入門【5】：27270圖片爬取
2019-07-30
Python爬蟲
爬蟲——三個小實戰
2018-09-21
爬蟲
爬蟲實戰——58同城租房資料爬取
2019-12-04
爬蟲
大資料爬蟲專案實戰教程
2018-11-14
大資料爬蟲
Python網路爬蟲實戰專案大全！
2020-12-19
Python爬蟲
最新《30小時搞定Python網路爬蟲專案實戰》
2020-02-18
Python爬蟲
python爬蟲小專案--飛常準航班資訊爬取variflight（上）
2019-03-23
Python爬蟲
爬蟲入門
2024-04-13
爬蟲
爬蟲爬取微信小程式
2019-02-16
爬蟲微信小程式
專案之爬蟲入門（豆瓣TOP250）
2020-11-19
爬蟲
python爬蟲學習筆記 4.2 （Scrapy入門案例（建立專案））
2020-04-30
Python爬蟲筆記
Python爬蟲——批次爬取douyin影片，下載到本地
2024-12-06
Python爬蟲
Python網路爬蟲實戰(一)快速入門
2019-09-16
Python爬蟲
Python爬蟲入門【9】：圖蟲網多執行緒爬取
2019-07-31
Python爬蟲執行緒
【爬蟲】爬蟲專案推薦 / 思路
2020-04-21
爬蟲

爬蟲新手入門實戰專案（爬取筆趣閣小說並下載）

相關文章