python 爬蟲如何爬取動態生成的網頁內容

潘_谈發表於2024-10-31

Python爬蟲網頁

--- 好的方法很多，我們先掌握一種 ---

【背景】

對於網頁資訊的採集，靜態頁面我們通常都可以透過python的request.get()庫就能獲取到整個頁面的資訊。

但是對於動態生成的網頁資訊來說，我們透過request.get()是獲取不到。

【方法】

可以透過python第三方庫selenium來配合實現資訊獲取，採取方案：python + request + selenium + BeautifulSoup

我們拿縱橫中文網的小說採集舉例（注意：請檢視網站的robots協議找到可以爬取的內容，所謂盜亦有道）：

思路整理：

　　1.透過selenium 定位元素的方式找到小說章節資訊

　　2.透過BeautifulSoup加工後提取章節標題和對應的各章節的連結資訊

　　3.透過request +BeautifulSoup 按章節連結提取小說內容，並將內容儲存下來

【上程式碼】

1.先在開發者工具中，除錯定位所需元素對應的xpath命令編寫方式

2.透過selenium 中find_elements()定位元素的方式找到所有小說章節，我們這裡定義一個方法接受引數來使用

def Get_novel_chapters_info(url:str,xpath:str,skip_num=None,chapters_num=None):
    # skip_num 需要跳過的採集章節(預設不跳過)，chapters_num需要採集的章節數(預設全部章節)
        # 建立Chrome選項（禁用圖形介面）
        chrome_options = Options()
        chrome_options.add_argument("--headless")
        driver = webdriver.Chrome(options=chrome_options)
        driver.get(url)
        driver.maximize_window()
        time.sleep(3)
        # 採集小說的章節元素
        catalogues_list = []
        try:
            catalogues = driver.find_elements(By.XPATH,xpath)
            if skip_num is None:
                for catalogue in catalogues:
                    catalogues_list.append(catalogue.get_attribute('outerHTML'))
                driver.quit()
                if chapters_num is None:
                    return catalogues_list
                else:
                    return catalogues_list[:chapters_num]
            else:
                for catalogue in catalogues[skip_num:]:
                    catalogues_list.append(catalogue.get_attribute('outerHTML'))
                driver.quit()
                if chapters_num is None:
                    return catalogues_list
                else:
                    return catalogues_list[:chapters_num]
        except Exception:
            driver.quit()

3.把採集到的資訊透過beautifulsoup加工後，提取章節標題和連結內容

        # 獲取章節標題和對應的連結資訊
        title_link = {}
        for each in catalogues_list:
            bs = BeautifulSoup(each,'html.parser')
            chapter = bs.find('a')
            title = chapter.text
            link = 'https:' + chapter.get('href')
            title_link[title] = link

4.透過request+BeautifulSoup 按章節連結提取小說內容，並儲存到一個檔案中

        # 按章節儲存小說內容
        novel_path = '小說存放的路徑/小說名稱.txt'
        with open(novel_path,'a') as f:
            for title,url in title_link.items():
                response = requests.get(url,headers={'user-agent':'Mozilla/5.0'})
                html = response.content.decode('utf-8')
                soup = BeautifulSoup(html,'html.parser')
                content = soup.find('div',class_='content').text
                # 先寫章節標題，再寫小說內容
                f.write('---小西瓜免費小說---' + '\n'*2)
                f.write(title + '\n')
                f.write(content+'\n'*3)

Python爬蟲爬取B站up主所有動態內容
2024-05-08
Python爬蟲
Python 爬取網頁中JavaScript動態新增的內容（一）
2018-09-28
Python網頁JavaScript
Python 爬取網頁中JavaScript動態新增的內容（二）
2018-09-28
Python網頁JavaScript
python爬蟲：使用BeautifulSoup修改網頁內容
2020-04-05
Python爬蟲網頁
python爬蟲---網頁爬蟲，圖片爬蟲，文章爬蟲，Python爬蟲爬取新聞網站新聞
2019-01-04
Python爬蟲網頁網站
Python 爬蟲網頁內容提取工具xpath(二)
2018-12-08
Python爬蟲網頁
Python 爬蟲網頁內容提取工具xpath(一)
2018-12-06
Python爬蟲網頁
Python 爬蟲進階篇-利用beautifulsoup庫爬取網頁文章內容實戰演示
2020-09-14
Python爬蟲網頁
Puppeteer 實戰-爬取動態生成的網頁
2018-11-10
網頁
JB的Python之旅-爬蟲篇-新浪微博內容爬取
2018-06-30
Python爬蟲
如何用python爬蟲分析動態網頁的商品資訊？
2021-09-11
Python爬蟲網頁
ferret 爬取動態網頁
2019-12-15
網頁
簡單的爬蟲：爬取網站內容正文與圖片
2021-09-09
爬蟲網站
node：爬蟲爬取網頁圖片
2019-02-16
爬蟲網頁
爬蟲——網頁爬取方法和網頁解析方法
2020-12-07
爬蟲網頁
爬蟲Selenium+PhantomJS爬取動態網站圖片資訊（Python）
2018-03-24
爬蟲JS網站Python
python爬蟲爬取網頁中文亂碼問題的解決
2024-11-17
Python爬蟲網頁
Python爬取網頁的所有內外鏈
2021-04-09
Python網頁
[譯] 如何使用 Python 和 BeautifulSoup 爬取網站內容
2019-02-23
Python網站
Python爬蟲教程-13-爬蟲使用cookie爬取登入後的頁面(人人網)（下）
2018-09-06
Python爬蟲Cookie
Python爬蟲教程-12-爬蟲使用cookie爬取登入後的頁面(人人網)（上）
2018-09-06
Python爬蟲Cookie
不會Python爬蟲？教你一個通用爬蟲思路輕鬆爬取網頁資料
2019-01-08
Python爬蟲網頁
python網路爬蟲（7）爬取靜態資料詳解
2019-06-07
Python爬蟲
ScienceDirect內容爬蟲
2021-07-21
爬蟲
[Python3網路爬蟲開發實戰] 7-動態渲染頁面爬取-4-使用Selenium爬取淘寶商品
2018-03-30
Python爬蟲
Python爬蟲十六式 - 第四式: 使用Xpath提取網頁內容
2019-01-10
Python爬蟲網頁
Python靜態網頁爬蟲專案實戰
2020-05-01
Python網頁爬蟲
Python爬蟲—爬取某網站圖片
2020-11-19
Python爬蟲網站
【Python爬蟲】正則爬取趕集網
2020-12-24
Python爬蟲
python 爬蟲網頁登陸
2020-11-30
Python爬蟲網頁
python爬取換頁_爬蟲爬不進下一頁了，怎麼辦
2020-11-24
Python爬蟲
Python網路爬蟲之爬取淘寶網頁頁面 MOOC可以執行的程式碼
2018-11-24
Python爬蟲網頁
如何用Python網路爬蟲爬取網易雲音樂歌曲
2018-04-27
Python爬蟲
[Python3網路爬蟲開發實戰] 7-動態渲染頁面爬取-1-Selenium的使用
2019-02-28
Python爬蟲
java 爬取網頁內容。標題、圖片等
2021-09-24
Java網頁
如何使用python進行網頁爬取?
2020-08-06
Python網頁
Node JS爬蟲：爬取瀑布流網頁高清圖
2018-05-17
JS爬蟲網頁
《網頁爬蟲》
2018-11-26
網頁爬蟲

python 爬蟲如何爬取動態生成的網頁內容

相關文章