python爬蟲學習01--電子書爬取

俠客小飛發表於2020-07-13

python爬蟲學習01--電子書爬取

1.獲取網頁資訊

import requests        #匯入requests庫
'''
獲取網頁資訊
'''
if __name__ == '__main__':          #主函式入口
    target = 'https://www.xsbiquge.com/78_78513/108078.html'#要爬取的目標地址
    req = requests.get(url=target)  #進行get請求
    req.encoding='utf-8'            #設定編碼
    print(req.text)                 #列印輸出

2.引入BeautifulSoup對網頁內容進行解析

import requests        #匯入requests庫
from bs4 import BeautifulSoup  #引入BeautifulSoup庫

'''
引入BeautifulSoup對網頁內容進行解析
獲取網頁電子書文字資訊
'''
if __name__ == '__main__':          #主函式入口
    target = 'https://www.xsbiquge.com/78_78513/108078.html'#要爬取的目標地址
    req = requests.get(url=target)  #發起請求，獲取html資訊
    req.encoding='utf-8'            #設定編碼
    html = req.text                 #將網頁的html資訊儲存在html變數中
    bs = BeautifulSoup(html,'lxml') #使用lxml對網頁資訊進行解析
    texts = bs.find('div',id='content') #獲取所有<div id = "content">的內容
    print(texts)                            #列印輸出

3.切分資料，去掉空格，提取文字

import requests        #匯入requests庫
from bs4 import BeautifulSoup  #引入BeautifulSoup庫

'''
引入BeautifulSoup對網頁內容進行解析
獲取網頁電子書文字資訊
最後一句texts.text 是提取所有文字，然後再使用 strip 方法去掉回車，
最後使用 split 方法根據 \xa0 切分資料，因為每一段的開頭，都有四個空格
'''
if __name__ == '__main__':          #主函式入口
    target = 'https://www.xsbiquge.com/78_78513/108078.html'#要爬取的目標地址
    req = requests.get(url=target)  #發起請求，獲取html資訊
    req.encoding='utf-8'            #設定編碼
    html = req.text                 #將網頁的html資訊儲存在html變數中
    bs = BeautifulSoup(html,'lxml') #使用lxml對網頁資訊進行解析
    texts = bs.find('div',id='content') #獲取所有<div id = "content">的內容
    print(texts.text.strip().split('\xa0'*4))                            #列印輸出

4.檢視章節列表

import requests        #匯入requests庫
from bs4 import BeautifulSoup  #引入BeautifulSoup庫

'''
檢視章節列表資訊
引入BeautifulSoup對網頁內容進行解析
獲取網頁電子書文字資訊

'''
if __name__ == '__main__':          #主函式入口
    target = 'https://www.xsbiquge.com/78_78513/'#要爬取的目標地址,《元尊》的章節目錄網址
    req = requests.get(url=target)      #發起請求，獲取html資訊
    req.encoding='utf-8'                #設定編碼
    html = req.text                     #將網頁的html資訊儲存在html變數中
    bs = BeautifulSoup(html,'lxml')     #使用lxml對網頁資訊進行解析
    chapters = bs.find('div',id='list') #獲取所有<div id = "list">的內容
    chapters = chapters.find_all('a')         #找到list中的a標籤中的內容
    for chapter in chapters:
        print(chapter)                  #列印章節列表

5.獲取章節目錄和章節連結

import requests        #匯入requests庫
from bs4 import BeautifulSoup  #引入BeautifulSoup庫

'''
檢視章節列表資訊
引入BeautifulSoup對網頁內容進行解析
獲取網頁電子書文字資訊

'''
if __name__ == '__main__':          #主函式入口
    server = 'https://www.xsbiquge.com'
    target = 'https://www.xsbiquge.com/78_78513/'#要爬取的目標地址,《元尊》的章節目錄網址
    req = requests.get(url=target)      #發起請求，獲取html資訊
    req.encoding='utf-8'                #設定編碼
    html = req.text                     #將網頁的html資訊儲存在html變數中
    bs = BeautifulSoup(html,'lxml')     #使用lxml對網頁資訊進行解析
    chapters = bs.find('div',id='list') #獲取所有<div id = "list">的內容
    chapters = chapters.find_all('a')         #找到list中的a標籤中的內容
    for chapter in chapters:
        url = chapter.get('href')       #獲取章節連結中的href
        print("《"+chapter.string+"》")           #列印章節名字
        print(server+url)               #將電子書網站與獲取到的章節連線進行拼接，得到每一個章節的連結

6.整合資料，下載電子書文件

import requests        #匯入requests庫
from bs4 import BeautifulSoup  #引入BeautifulSoup庫
import time
from tqdm import  tqdm


'''
檢視章節列表資訊
引入BeautifulSoup對網頁內容進行解析
獲取網頁電子書文字資訊

'''
def get_content(target):
    req = requests.get(url=target)  # 發起請求，獲取html資訊
    req.encoding = 'utf-8'  # 設定編碼
    html = req.text  # 將網頁的html資訊儲存在html變數中
    bf = BeautifulSoup(html, 'lxml')  # 使用lxml對網頁資訊進行解析
    texts = bf.find('div', id='content')  # 獲取所有<div id = "content">的內容
    content = texts.text.strip().split('\xa0' * 4)
    return content


if __name__ == '__main__':          #主函式入口
    server = 'https://www.xsbiquge.com'     #電子書網站地址
    book_name = '《元尊》.txt'
    target = 'https://www.xsbiquge.com/78_78513/'#要爬取的目標地址,《元尊》的章節目錄網址
    req = requests.get(url=target)      #發起請求，獲取html資訊
    req.encoding='utf-8'                #設定編碼
    html = req.text                     #將網頁的html資訊儲存在html變數中
    chapter_bs = BeautifulSoup(html,'lxml')     #使用lxml對網頁資訊進行解析
    chapters = chapter_bs.find('div',id='list') #獲取所有<div id = "list">的內容
    chapters = chapters.find_all('a')         #找到list中的a標籤中的內容
    for chapter in tqdm(chapters):
        chapter_name = chapter.string           #章節名字
        url = server + chapter.get('href')       #獲取章節連結中的href
        content = get_content(url)
        with open(book_name,'a',encoding='utf-8') as f:
            f.write("《"+chapter_name+"》")
            f.write('\n')
            f.write('\n'.join(content))
            f.write('\n')

ps:下載的時候可能會有點慢，下載一本書大概十幾分鍾，在以後學到新的方法會改善的

Python爬蟲入門【10】：電子書多執行緒爬取
2019-07-31
Python爬蟲執行緒
【Python學習】爬蟲爬蟲爬蟲爬蟲~
2018-05-03
Python爬蟲
擼個爬蟲，爬取電影種子
2019-05-11
爬蟲
Python爬蟲入門教程 11-100 行行網電子書多執行緒爬取
2018-12-25
Python爬蟲執行緒
python爬蟲——爬取大學排名資訊
2019-08-02
Python爬蟲
python爬蟲是什麼?學習python爬蟲難嗎
2021-03-31
Python爬蟲
python爬蟲---網頁爬蟲，圖片爬蟲，文章爬蟲，Python爬蟲爬取新聞網站新聞
2019-01-04
Python爬蟲網頁網站
python爬蟲學習1
2020-11-29
Python爬蟲
小白學 Python 爬蟲（25）：爬取股票資訊
2019-12-24
Python爬蟲
python例項，python網路爬蟲爬取大學排名!
2018-11-20
Python爬蟲
python爬蟲練習之爬取豆瓣讀書所有標籤下的書籍資訊
2018-07-23
Python爬蟲
什麼是爬蟲?學習Python爬蟲難不難?
2019-11-05
爬蟲Python
為什麼學習python及爬蟲，Python爬蟲[入門篇]？
2018-11-21
Python爬蟲
Python爬蟲例項：爬取貓眼電影——破解字型反爬
2019-02-26
Python爬蟲
python 爬蟲爬取 learnku 精華文章
2020-04-17
Python爬蟲
爬蟲學習筆記：練習爬取多頁天涯帖子
2019-02-16
爬蟲筆記
Python爬蟲入門教程 50-100 Python3爬蟲爬取VIP視訊-Python爬蟲6操作
2019-02-14
Python爬蟲
Python爬蟲框架：scrapy爬取高考派大學資料
2019-10-07
Python爬蟲框架
Python爬蟲筆記（4）：利用scrapy爬取豆瓣電影250
2018-11-10
Python爬蟲筆記
Python爬蟲教程-17-ajax爬取例項（豆瓣電影）
2018-09-06
Python爬蟲
python爬蟲爬取豆瓣電影 1-10 ajax 資料
2024-07-04
Python爬蟲
python爬蟲—學習筆記-4
2024-04-23
Python爬蟲筆記
python爬蟲—學習筆記-2
2024-04-10
Python爬蟲筆記
python爬蟲js逆向學習（二）
2020-07-03
Python爬蟲JS
Python爬蟲學習筆記(三)
2021-01-30
Python爬蟲筆記
python爬蟲學習筆記（二）
2020-11-24
Python爬蟲筆記
python爬蟲練習--爬取虎牙主播原畫視訊
2020-11-28
Python爬蟲
Python爬蟲—爬取某網站圖片
2020-11-19
Python爬蟲網站
python爬蟲--爬取鏈家租房資訊
2020-05-16
Python爬蟲
python 爬蟲 1 爬取酷狗音樂
2020-03-29
Python爬蟲
【Python爬蟲】正則爬取趕集網
2020-12-24
Python爬蟲
爬蟲練習——爬取縱橫中文網
2020-10-19
爬蟲
Python爬蟲教程+書籍分享
2018-11-29
Python爬蟲
Python爬蟲學習線路圖丨Python爬蟲需要掌握哪些知識點
2018-12-10
Python爬蟲
一入爬蟲深似海，總結python爬蟲學習筆記！
2019-02-14
爬蟲Python筆記
Python爬蟲：爬取instagram，破解js加密引數
2019-04-09
Python爬蟲JS加密
python網路爬蟲--爬取淘寶聯盟
2018-07-17
Python爬蟲
Python爬蟲入門【5】：27270圖片爬取
2019-07-30
Python爬蟲

python爬蟲學習01--電子書爬取

python爬蟲學習01--電子書爬取

1.獲取網頁資訊

2.引入BeautifulSoup對網頁內容進行解析

3.切分資料，去掉空格，提取文字

4.檢視章節列表

5.獲取章節目錄和章節連結

6.整合資料，下載電子書文件

ps:下載的時候可能會有點慢，下載一本書大概十幾分鍾，在以後學到新的方法會改善的

相關文章