Datawhale-爬蟲-Task3(beautifulsoup)

TNTZS666發表於2019-03-03

原文網址 : https://blog.csdn.net/tntzs666/article/details/88087789

Beautiful Soup

Beautiful Soup是一個非常流行的Python模組。該模組可以解析網頁，並提供定位內容的便捷介面。
使用Beautiful Soup的第一步是將已下載的HTML內容解析為soup文件。由於大多數網頁都不具備良好的HTML格式，因此Beautiful Soup需要對其實際格式進行確定。

例如，在下面這個簡單的網頁列表中，存在屬性值兩側引號缺失和標籤未閉合的問題：

<ul class = country>
	<li>Area
	<li>Population
</ul>

如果Population列表項被解析為Area列表的子元素，而不是並列兩個列表項的話，我們在抓取時就會得到錯誤的結果。下面我們看一下Beautiful Soup是如何處理的。

from bs4 import BeautifulSoup
broken_html = '<ul class=country><li>Area<li>Population</ul>'
soup = BeautifulSoup(broken_html,'html.parser')
fixed_html = soup.prettify()
print(fixed_html)

<ul class = "country">
	<li>Area</li>
	<li>Population</li>
</ul>

從上面結果可以看出，Beautiful soup能正確解析缺失的引號並閉合標籤，現在我們就可以使用find()和findall()方法來定位我們需要的元素了。

案例：

使用beautifulsoup提取下面丁香園論壇的特定帖子的所有回覆內容，以及回覆人的資訊。

首先進去丁香園論壇檢視源網頁找到相關的HTML標籤

發帖人：
在這裡插入圖片描述
帖子的內容：

發現這兩個標籤都是唯一的，所以直接使用find在HTML中找到標籤即可：

user_id = item.find("div", "auth").get_text()
content = item.find("td", "postbody").get_text("|", strip=True)

所有程式碼：

import requests
from bs4 import BeautifulSoup as bs

def get_soup():
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.119 Safari/537.36'
    }
    url = 'http://www.dxy.cn/bbs/thread/626626'
    try:
        html = requests.get(url,headers = headers)
        if html.status_code == 200:
            return html.text
    except:
        pass

def get_item(html):
    topic_con = bs(html, 'lxml')
    table = topic_con.find_all('tbody')
    datas = []
    for item in table:
        try:
            user_id = item.find("div", "auth").get_text()
            content = item.find("td", "postbody").get_text("|", strip=True)
            datas.append((user_id, content))
        except:
            pass
    return datas


def main():
    html = get_soup()
    info = get_item(html)
    for x in info:
    	print(x)


if __name__ == '__main__':
    main()

執行結果：
在這裡插入圖片描述

Python爬蟲之BeautifulSoup
2019-02-16
Python爬蟲
Python爬蟲之BeautifulSoup庫
2020-12-14
Python爬蟲
11.18爬蟲學習（BeautifulSoup類）
2024-11-18
爬蟲
python 小爬蟲 DrissionPage+BeautifulSoup
2024-06-16
Python爬蟲
Datawhale-爬蟲-Task5（selenium學習）
2019-03-05
爬蟲
Datawhale-爬蟲-Task4(學習xpath）
2019-03-04
爬蟲
python爬蟲之 BeautifulSoup庫入門
2019-12-09
Python爬蟲
爬蟲系列 | 6、詳解爬蟲中BeautifulSoup4的用法
2021-01-19
爬蟲
Datawhale-爬蟲-Task2（正規表示式）
2019-03-02
爬蟲
python爬蟲常用庫之BeautifulSoup詳解
2018-04-01
Python爬蟲
Datawhale-爬蟲-Task7(實戰大專案)
2019-03-07
爬蟲
python爬蟲：使用BeautifulSoup修改網頁內容
2020-04-05
Python爬蟲網頁
爬蟲（6） - 網頁資料解析(2) | BeautifulSoup4在爬蟲中的使用
2022-07-04
爬蟲網頁
爬蟲入門系列（四）：HTML 文字解析庫 BeautifulSoup
2019-02-27
爬蟲HTML
使用requests+BeautifulSoup的簡單爬蟲練習
2018-04-06
爬蟲
Datawhale-爬蟲-Task6(學習IP相關知識)
2019-03-06
爬蟲
Datawhale-爬蟲-Task1（學習get與post請求）
2019-03-01
爬蟲
Python3爬蟲利器:BeautifulSoup4的安裝
2021-09-11
Python爬蟲
Python 爬蟲十六式 - 第五式：BeautifulSoup，美味的湯
2019-01-13
Python爬蟲
Python爬蟲教程-25-資料提取-BeautifulSoup4（三）
2018-09-06
Python爬蟲
Python爬蟲教程-24-資料提取-BeautifulSoup4（二）
2018-09-06
Python爬蟲
Python爬蟲教程-23-資料提取-BeautifulSoup4（一）
2018-09-06
Python爬蟲
爬蟲-使用BeautifulSoup4（bs4）解析html資料
2021-01-24
爬蟲HTML
Python 爬蟲進階篇-利用beautifulsoup庫爬取網頁文章內容實戰演示
2020-09-14
Python爬蟲網頁
【Python學習】爬蟲爬蟲爬蟲爬蟲~
2018-05-03
Python爬蟲
[python爬蟲] BeautifulSoup設定Cookie解決網站攔截並爬取螞蟻短租
2018-03-07
Python爬蟲Cookie網站
python爬蟲學習(一)：BeautifulSoup庫基礎及一般元素提取方法
2018-04-05
Python爬蟲
Python 實用爬蟲-04-使用 BeautifulSoup 去水印下載 CSDN 部落格圖片
2019-06-16
Python爬蟲
爬蟲：多程式爬蟲
2021-05-19
爬蟲
python爬蟲---網頁爬蟲，圖片爬蟲，文章爬蟲，Python爬蟲爬取新聞網站新聞
2019-01-04
Python爬蟲網頁網站
通用爬蟲與聚焦爬蟲
2023-04-18
爬蟲
爬蟲--Scrapy簡易爬蟲
2020-10-07
爬蟲
爬蟲進階：反反爬蟲技巧
2018-06-28
爬蟲
反爬蟲之字型反爬蟲
2019-06-27
爬蟲
爬蟲
2024-11-16
爬蟲
【爬蟲】爬蟲專案推薦 / 思路
2020-04-21
爬蟲
網路爬蟲——爬蟲實戰（一）
2022-01-29
爬蟲
BeautifulSoup + requests 爬取扇貝 python 單詞書
2019-07-11
Python

Datawhale-爬蟲-Task3(beautifulsoup)

Beautiful Soup

案例：

相關文章