今天剛上手爬蟲，當然要從最簡單的開始啦，驗證一下所學的知識

松鼠愛吃餅乾發表於2020-08-20

原文網址 : https://www.cnblogs.com/hhh188764/p/13537459.html

爬蟲

前言

很多免費的資源只能看但是不提供下載，今天我們以小說為例教你如何把網際網路上只能看不能下載的東西給下載下來

知識點：

requests
css選擇器
全站小說爬取思路

開發環境：

版本：anaconda5.2.0(python3.6.5)
編輯器：pycharm 社群版

程式碼

匯入工具

import requests
import parsel

請求頭

headers = {
    'User-Agent': 'gao fu shui'
}

請求資料

response = requests.get(chapter_url, headers=headers)
# 設定文字的編碼為 utf-8
# response.encoding = 'utf-8'
# 萬能解碼 99%的情況下都是對的
# print(response.apparent_encoding)  # requests 自動識別的編碼
# print(response.encoding)  # 服務直接我們的編碼
response.encoding = response.apparent_encoding
# print(response)
html = response.text
# print(html)
# print(response.headers)
# # 響應體.請求體.請求頭資訊
# print(response.request.headers)
# # 檢視原始碼 ctrl + 滑鼠左鍵
# print(response.cookies)

解析資料

# css xpath
# parsel = css + xpath + re
# 把字串變成可以解析的物件
selector = parsel.Selector(html)

# selector.css()
# selector.xpath()
# selector.re()
# get 獲取物件裡面的文字內容
# 屬性提取器 attr
h1 = selector.css('.reader h1::text').get()
# print(h1)
content = selector.css('.showtxt::text').getall()
# print(content)
# # xpath 路徑提取器
# h1 = selector.xpath('//h1/text()').get()
# print(h1)
# content = selector.xpath('//*[@class="showtxt"]//text()').getall()
# print(content)
# 去除每一個空白字元
# 定義一個空列表，留待備用 {}
lines = []

for c in content:
    lines.append(c.strip())

print(h1)
# print(lines)

# str join 字串的合併方法
text = '\n'.join(lines)
# print(text)

儲存資料

file = open(book_name + '.txt', mode='a', encoding='utf-8')
file.write(h1)
file.write('\n')
file.write(text)
file.write('\n')
file.close()

獲取所有章節的下載地址

# download_one_chapter('http://www.shuquge.com/txt/8659/2324752.html')
# download_one_chapter('http://www.shuquge.com/txt/8659/2324753.html')
# download_one_chapter('http://www.shuquge.com/txt/8659/2324754.html')

def download_one_book(index_url):
    index_response = requests.get(index_url, headers=headers)
    index_response.encoding = index_response.apparent_encoding
    sel = parsel.Selector(index_response.text)
    book_name = sel.css('h2::text').get()
    # 提取了所有章節的下載地址
    urls = sel.css('.listmain dl dd a::attr(href)').getall()
    # 不要最新的 12 章放在最前main
    for url in urls[12:]:
        chapter_url = index_url[:-10] + url
        print(chapter_url)
        download_one_chapter(chapter_url, book_name)
# download_one_book('http://www.shuquge.com/txt/8659/index.html')
# download_one_book('http://www.shuquge.com/txt/5809/index.html')
# download_one_book('http://www.shuquge.com/txt/63542/index.html')
"""下載玄幻類的第一頁"""
# 2_1.html 控制類別頁數 可以for in 生產類別 for in 生產 頁數
for cate in ['1', '2', '4']:
    for page in range(1, 101):
        cate_url = 'http://www.shuquge.com/category/' + cate + '_' + str(page) + '.html'
        cate_response = requests.get(cate_url, headers=headers)
        cate_response.encoding = cate_response.apparent_encoding
        sel = parsel.Selector(cate_response.text)
        # 提取了所有章節的下載地址
        urls = sel.css('.l.bd > ul > li > span.s2 > a::attr(href)').getall()
        # 不要最新的 12 章放在最前main
        for url in urls:
            print(url)
            download_one_book(url)

phpspider簡單快速上手的php爬蟲框架
2020-02-17
PHPIDE爬蟲框架
Python超簡單超基礎的免費小說爬蟲！爬蟲入門從這開始！
2020-10-23
Python爬蟲
【從零開始學爬蟲】建立模板
2023-01-06
爬蟲
【從零開始學爬蟲】對任務的操作
2022-12-07
爬蟲
情況最簡單下的爬蟲案例
2020-03-06
爬蟲
【從零開始學爬蟲】模板的高階選項
2023-01-06
爬蟲
從零開始的爬蟲專案（一）
2020-04-23
爬蟲
逆向爬蟲知識學習
2022-03-21
爬蟲
【從零開始學爬蟲】模板的複製與貼上
2023-01-06
爬蟲
《從零開始學Python網路爬蟲》概要
2018-08-29
Python爬蟲
Python爬蟲筆記（一）——基礎知識簡單整理
2018-07-08
Python爬蟲筆記
5分鐘上手Python爬蟲：從乾飯開始，輕鬆掌握技巧
2024-03-15
Python爬蟲
學習大資料要從哪些知識點開始著手？
2018-10-13
大資料
學習爬蟲必須學的基礎知識
2020-01-13
爬蟲
python最簡單的爬蟲 , 一看就會
2018-06-14
Python爬蟲
簡單的爬蟲程式
2024-03-24
爬蟲
從零基礎開始學習Python爬蟲你需要注意的點以及如何學習爬蟲
2019-01-02
Python爬蟲
PYTHON系列-從零開始的爬蟲入門指南
2018-09-16
Python爬蟲
python爬蟲:爬蟲的簡單介紹及requests模組的簡單使用
2022-02-24
Python爬蟲
從零開始的Python學習知識補充sorted
2018-11-18
Python
Python 開發簡單爬蟲 (學習筆記)
2019-08-05
Python爬蟲筆記
【從零開始學爬蟲】任務屬性配置中的兩點技巧
2022-12-07
爬蟲
爬蟲遇到頭疼的驗證碼？教你彈窗處理和驗證碼識別
2020-12-30
爬蟲
爬蟲基礎知識
2023-03-15
爬蟲
那些年，我爬過的北科(八)——反反爬蟲之驗證碼識別
2018-12-08
爬蟲
我的第一篇部落格（從爬蟲開始）
2020-09-29
爬蟲
剛開始找工作所面臨的開發問題
2024-05-07
名片識別，史上最簡單的整合攻略來啦！附有SDK包
2020-11-30
Laravel 使用者認證最簡單的實現比 Jetstream 要簡單很多
2021-01-16
Laravel
從零開始寫一個node爬蟲(一)
2019-04-09
爬蟲
從零開始學前端動畫 —— 簡單的特效登入
2018-07-07
前端動畫特效
從最簡單的入手學習 Docker (一)
2019-02-16
Docker
【從零開始學爬蟲】採集收視率排行資料
2022-12-15
爬蟲
Python分散式爬蟲(三) - 爬蟲基礎知識
2019-03-21
Python分散式爬蟲
Python爬蟲之路-爬蟲基礎知識(理論)
2021-01-04
Python爬蟲
剛開始認為python簡單不想學，用他接單後我真香了
2021-12-30
Python
Python代理IP爬蟲的簡單使用
2019-03-04
Python爬蟲
nodeJS做一個簡單的爬蟲
2018-03-30
NodeJS爬蟲

今天剛上手爬蟲，當然要從最簡單的開始啦，驗證一下所學的知識

相關文章