BeautifulSoup(bs4)細緻講解

ihav2carryon發表於2024-11-30

原文網址 : https://www.cnblogs.com/ihave2carryon/p/18578873

BeautifulSoup(bs4)

BeautifulSoup是python的一個庫,最主要的功能是從網頁爬取資料,官方是這樣解釋的:BeautifulSoup提供一些簡單,python式函式來處理導航,搜尋,修改分析樹等功能,其是一個工具庫,透過解析文件為使用者提供需要抓取的資料,因為簡單,所有不需要多少程式碼就可以寫出一個完整的程式

bs4安裝

直接使用pip install命令安裝

pip install beautifulsoup4

lxml解析器

lxml是一個高效能的Python庫,用於處理XML與HTML文件,與bs4相比之下lxml具有更強大的功能與更高的效能,特別是處理大型文件時尤為明顯.lxml可以與bs4結合使用,也可以單獨使用

lxml安裝

同樣使用pip install 安裝

pip install lxml

其用於在接下來會結合bs4進行講解

BeautifulSoup瀏覽瀏覽器結構化方法

.title:獲取title標籤

html_doc="""....
""""
# 建立beautifulsoup物件 解析器為lxml
soup = BeautifulSoup(html_doc, 'lxml')
print(soup.title)
#output-><title>The Dormouse's story</title>

.name獲取檔案或標籤型別名稱

soup = BeautifulSoup(html_doc, 'lxml')
print(soup.title.name)
print(soup.name)
#output->title
#[document]

.string/.text:獲取標籤中的文字內容

soup = BeautifulSoup(html_doc, 'lxml')
print(soup.title.string)
print(soup.title.text)
#output->The Dormouse's story
#The Dormouse's story

.p:獲取標籤

soup = BeautifulSoup(html_doc, 'lxml')
print(soup.p)
#output-><p class="title"><b>The Dormouse's story</b></p>

.find_all(name,attrs={}):獲取所有標籤,引數:標籤名,如’a’a標籤,’p’p標籤等等,attrs={}:屬性值篩選器字典如attrs={'class': 'story'}

# 建立beautifulsoup物件 解析器為lxml
soup = BeautifulSoup(html_doc, 'lxml')
print(soup.find_all('p'))
print(soup.find_all('p', attrs={'class': 'title'}))

.find(name,attrs={}):獲取第一次匹配條件的元素

soup = BeautifulSoup(html_doc, 'lxml')
print(soup.find(id="link1"))
#output-><a class="sister" href="https://example.com/elsie" id="link1">Elsie</a>

.parent:獲取父級標籤

soup = BeautifulSoup(html_doc, 'lxml')
print(soup.title.parent)
#output-><head><title>The Dormouse's story</title></head>

.p['class'] :獲取class的值

soup = BeautifulSoup(html_doc, 'lxml')
print(soup.p["class"])
#output->['title']

.get_text():獲取文件中所有文字內容

soup = BeautifulSoup(html_doc, 'lxml')
print(soup.get_text())
The Dormouse's story

The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.
...

從文件中找到所有<a>標籤的連結

a_tags = soup.find_all('a')
for a_tag in a_tags:
    print(a_tag.get("href"))
#output->https://example.com/elsie
#https://example.com/lacie
#https://example.com/tillie

BeautifulSoup的物件種類

當你使用BeautifulSoup 解析一個HTML或XML文件時,BeautifulSoup會整個文件轉換為一個樹形結構,其中每個結點(標籤,文字,註釋)都被表示為一個python物件

BeautifulSoup的樹形結構

在HTML文件中,根結點通常是<html>標籤,其餘的標籤和文字內容則是其子結點

若有以下一個HTML文件:

<html>
    <head>
        <title>The Dormouse's story</title>
    </head>
    <body>
        <h1>The Dormouse's story</h1>
        <p>Once upon a time...</p>
    </body>
</html>

經過BeautifulSoup的解析後,<html>是根結點,與<html>相鄰的<head>與<body>是其子結點,同理可得<title>是<head>子結點,<h1>與是<body>子結點

物件型別

BeautifulSoup有四種主要型別,Tag,NavigableString,BeautifulSoup,Comment

Tag

Tag物件與HTML或XML原生文件中的標籤相同,每個Tag物件都可以包含其他標籤,文字內容和屬性

soup = BeautifulSoup(html_doc, 'lxml')
tag = soup.title
print(type(tag))
#output-><class 'bs4.element.Tag'>

NavigableString

NavigableString物件表示標籤內的文字內容,是一個不可變字串,可以提供Tag物件的.string獲取

soup = BeautifulSoup(html_doc, 'lxml')
tag = soup.title
print(type(tag.string))
#output-> <class 'bs4.element.NavigableString'>

BeautifulSoup

BeautifulSoup物件表示整個文件的內容.其可以被視為一個特殊的Tag物件,但沒有名稱與屬性.其提供了對整個文件的遍歷,搜尋和修改的功能

soup = BeautifulSoup(html_doc, 'lxml')
print(type(soup))
#output-> <class 'bs4.BeautifulSoup'>

Comment

Comment物件是一個特殊型別的NavigableString物件,表示HTML和XML中的註釋部分

# <b><!--This is a comment--></b>
soup = BeautifulSoup(html_doc, 'lxml')
print(type(soup.b.string))
#output-> <class 'bs4.element.NavigableString'>

BeautifulSoup遍歷文件樹

BeautifulSoup提供了許多方法來遍歷解析後的文件樹

導航父節點

.parent與.parents:.parent可以獲取當前節點的上一級父節點,.parents可以遍歷獲取當前節點的所有父輩節點

soup = BeautifulSoup(html_doc, 'lxml')
title_tag = soup.title
print(title_tag.parent)
#<head><title>The Dormouse's story</title></head>

soup = BeautifulSoup(html_doc, 'lxml')
body_tag = soup.body
for parent in body_tag.parents:
    print(parent)
#<html><head><title>The Dormouse's story</title></head>
#<body>
#<p class="title"><b>The Dormouse's story</b></p>
#<p class="story">Once upon a time there were three little sisters; and their names were
#<a class="sister" href="https://example.com/elsie" id="link1">Elsie</a>,
#....

導航子結點

.contents:可以獲取當前結點的所有子結點

soup = BeautifulSoup(html_doc, 'lxml')
head_contents = soup.head.contents
print(head_contents)
#output-> [<title>The Dormouse's story</title>]

.children:可以遍歷當前結點的所有子結點,返回一個list

soup = BeautifulSoup(html_doc, 'lxml')
body_children = soup.body.children

for child in body_children:
    print(child)
#output-><p class="title"><b>The Dormouse's story</b></p>
#<a class="sister" href="https://example.com/elsie" id="link1">Elsie</a>,
#<a class="sister" href="https://example.com/tillie" id="link3">Tillie</a>;
#and they lived at the bottom of a well.</p>
#.....

字串沒有.children與.contents屬性

導航所有後代節點

.contents與.children屬性僅包含tag直接子結點,例如標籤只有一個直接子結點<title>

#[<title>The Dormouse's story</title>]

但<title>標籤也包含一個子結點:字串”The Dormouse's story”,字串”The Dormouse's story”是<head>標籤的子孫結點

.descendants屬性可以遍歷當前結點的所有後代結點(層遍歷)

soup = BeautifulSoup(html_doc, 'lxml')
for descendant in soup.descendants:
    print(descendant)

節點內容

.string
- 如果tag只有一個NavigableString型別子節點,那麼這個tag可以使用.string得到其子節點.
```
soup = BeautifulSoup(html_doc, 'lxml')
print(soup.head.string)
#The Dormouse's story
print(soup.title.string)
#The Dormouse's story
```
- 但若tag中包含了多個子節點,tag就無法確定string方法應該呼叫哪一個位元組的內容,則會輸出None
```
soup = BeautifulSoup(html_doc, 'lxml')
print(soup.body.string)
#None
```

.strings和.stripped_strings

.strings可以遍歷獲取標籤中的所有文字內容,.stripped_strings可以除去多餘的空白字元

soup = BeautifulSoup(html_doc, 'lxml')
for string in soup.strings:
    print(string)
#The Dormouse's story
......

#The Dormouse's story

soup = BeautifulSoup(html_doc, 'lxml')
for string in soup.stripped_strings:
    print(string)
#The Dormouse's story
#The Dormouse's story
#Once upon a time there were three little sisters; and their names were
#Elsie
#,
...

BeautifulSoup搜尋文件樹

BeautifulSoup提供了多種方法來搜尋解析後的文件樹

find_all(name , attrs , recursive , string , **kwargs)

find_all()方法搜尋當前tag的所有tag子節點

soup = BeautifulSoup(html_doc, 'lxml')
print(soup.find_all("title"))  # 查詢所有的title標籤
print(soup.find_all("p", "title"))  # 查詢p標籤中class為title的標籤

print(soup.find_all("a"))  # 查詢所有的a標籤

print(soup.find_all(id="link2"))  # 查詢id為link2的標籤
#[<title>The Dormouse's story</title>]
#[<p class="title"><b>The Dormouse's story</b></p>]
#[<a class="sister" href="https://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="https://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="https://example.com/tillie" id="link3">Tillie</a>]
#[<a class="sister" href="https://example.com/lacie" id="link2">Lacie</a>]

接下來我們來詳細解析一下每個引數的含義

name引數

name引數可以查詢所有名字為name的tag,字串物件名字會自動被忽略

soup.find_all("title")
# [<title>The Dormouse's story</title>]

name引數可以為任意型別的過濾器,如字串,正規表示式,列表,方法等等
傳字串

傳入字串是最簡單的過濾器,在搜尋方法中傳入一個字串引數,BeautifulSoup會查詢與字串匹配的內容
- 下面的例子用於查詢文件中所有的標籤
```
soup.find_all('b')
# [The Dormouse's story]
```
傳入正規表示式

若傳入正規表示式作為引數,BeautifulSoup會透過正規表示式match()來匹配內容
- 查詢b開頭的標籤,這表示<body>和標籤都應該被找到
```
soup = BeautifulSoup(html_doc, 'lxml')
for tag in soup.find_all(re.compile("^b")):
 print(tag.name)
# body
# b
```
傳入列表

如果傳入列表引數,Beautiful Soup會將與列表中任一元素匹配的內容返回
- 找到文件中所有<a>標籤和標籤
```
soup = BeautifulSoup(html_doc, 'lxml')
for tag in soup.find_all(['a', 'b']):
 print(tag.name)
#b
#a
#a
#a
#b
```

**kwargs引數

在BeautifulSoup中,**kwargs(即關鍵字引數)可用於透過標籤的屬性來查詢特定的標籤.這些關鍵字引數可以直接傳遞給find,find_all方法,使得搜尋更加強大.標籤的屬性名作為關鍵字引數，值可以是字串、正規表示式或列表

使用字典

可以使用key=’word’傳入引數

soup = BeautifulSoup(html_doc, 'lxml')
print(soup.find_all(id='link1'))
#[<a class="sister" href="https://example.com/elsie" id="link1">Elsie</a>]

使用正規表示式

使用Python的re模組中的正規表示式來匹配屬性值,使搜尋更靈活

soup = BeautifulSoup(html_doc, 'lxml')
print(soup.find_all('a', href=re.compile("elsie")))  # 查詢href屬性中包含elsie的a標籤
print(soup.find_all(string=re.compile("^The")))  # 查詢文字中The開頭的標籤
#[<a class="sister" href="https://example.com/elsie" id="link1">Elsie</a>]
#["The Dormouse's story", "The Dormouse's story"]

使用列表

可以傳遞一個列表作為關鍵字引數的值.BeautifulSoup會匹配列表中的任意一個值

soup = BeautifulSoup(html_doc, 'lxml')
print(soup.find('a', id=['link1', 'link2']))  # 查詢id為link1或者link2的a標籤
print(soup.find_all(class_=['sister', 'story']))  # 查詢class為sister或者story的標籤
#<a class="sister" href="https://example.com/elsie" id="link1">Elsie</a>
#[<p class="story">Once upon a time there were three little sisters; and their names were
#...

特殊屬性名稱

HTML的屬性名稱與Python的保留字衝突,為了防止衝突,BeautifulSoup提供了一些特殊的替代名稱

class_:用於匹配class屬性
data-*:用於匹配自定義的data-*屬性

soup = BeautifulSoup(html_doc, 'lxml')
print(soup.find_all('p', class_="title"))  # 查詢所有class為title的p標籤
print(soup.find_all('p', attrs={'data-p', 'story'}))  # 查詢所有class為story的p標籤
#[<p class="title"><b>The Dormouse's story</b></p>]
#[<p class="story">Once upon a time there were three little sisters; and their names were
#<a class="sister" href="https://example.com/elsie" id="link1">Elsie</a>,
#<a class="sister" href="https://example.com/lacie" id="link2">Lacie</a> and

text/string引數

text/string引數允許操作者根據標籤的文字內容進行搜尋,與name引數類似,text引數也支援多種型別的值,包括正規表示式,字串列表和True,早期bs4支援text,近期bs4將text都改為string

使用字串匹配

你可以直接傳遞一個字串作為 string引數的值，BeautifulSoup 會查詢所有包含該字串的標籤

soup = BeautifulSoup(html_doc, 'lxml')
print(soup.find_all(string='Elsie'))
#['Elsie']

使用正規表示式匹配

soup = BeautifulSoup(html_doc, 'lxml')
print(soup.find_all(string=re.compile('sister'), limit=2))  # 查詢前兩個包含sister的字串
print(soup.find_all(string=re.compile('Dormouse')))  # 查詢包含Dormouse的字串
#['Once upon a time there were three little sisters; and their names were\n']
#["The Dormouse's story", "The Dormouse's story"]

使用列表匹配

soup = BeautifulSoup(html_doc, 'lxml')
print(soup.find_all(string=['Elsie', 'Lacie', 'Tillie']))
#['Elsie', 'Lacie', 'Tillie']

limit引數

BeautifulSoup中的limit引數用於限制find_all方法結果的返回數量,當只需要查詢前幾個標籤時,使用limit引數可以提高搜尋搜尋效率,效果與SQL中的limit關鍵字類似,當搜尋到的結果數量達到 limit 的限制時,就停止搜尋返回結果

soup = BeautifulSoup(html_doc, 'lxml')
print(soup.find_all('a', limit=2))  # 查詢所有a標籤，限制輸出2個
#[<a class="sister" href="https://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="https://example.com/lacie" id="link2">Lacie</a>]

find_parents() 和 find_parent()

BeautifulSoup 提供了 find_parents() 和 find_parent() 方法,用於在解析後的文件樹中向上查詢父標籤.兩個方法的主要區別在於返回的結果數量

find_parent(name=None, attrs={}, **kwargs):只返回最接近的父標籤(即第一個匹配的父標籤)
find_parents(name=None, attrs={}, limit=None, **kwargs):返回所有符合條件的祖先標籤,按從近到遠的順序排列

soup = BeautifulSoup(html_doc, 'lxml')
a_string = soup.find(string='Lacie')
print(a_string.find_parent())  # 查詢父節點
print('-----------------')
print(a_string.find_parents())  # 查詢所有父節點
#<a class="sister" href="https://example.com/lacie" id="link2">Lacie</a>
#-----------------
#[<a class="sister" href="https://example.com/lacie" id="link2">Lacie</a>, <p class="story">Once upon a time there were three little sisters; and their names were
#<a class="sister" href="https://example.com/elsie" id="link1">Elsie</a>,
#<a class="sister" href="https://example.com/lacie" id="link2">Lacie</a> and
#and they lived at the bottom of a well.</p>, <body>....]

BeautifulSoup的CSS選擇器

我們在寫CSS時,標籤名不加任何修飾,類名前加點,id名前加#,BeautifulSoup中也可以使用類似的方法來篩選元素,

select(selector, namespaces=None, limit=None, **kwargs)

BeautifulSoup中的select()方法允許使用CSS選擇器來查詢HTML文件元素,其返回一個包含所有匹配元素的列表類似與find_all()方法

selector:一個字串,表示將要選擇的CSS選擇器,可以是簡單標籤選擇器,類選擇器,id選擇器

透過標籤名查詢

soup = BeautifulSoup(html_doc, 'lxml')
print(soup.select('b'))
#[<b>The Dormouse's story</b>, <b><!--This is a comment--></b>]

透過類名查詢

soup = BeautifulSoup(html_doc, 'lxml')
print(soup.select('.title'))
#[<p class="title"><b>The Dormouse's story</b></p>]

id名查詢

soup = BeautifulSoup(html_doc, 'lxml')
print(soup.select('#link1'))
#[<a class="sister" href="https://example.com/elsie" id="link1">Elsie</a>]

組合查詢

組合查詢即與寫class時一致,標籤名與類名id名進行組合的原理一樣

eg:查詢p標籤中id為link1的內容

soup = BeautifulSoup(html_doc, 'lxml')
print(soup.select('p #link1'))
#[<a class="sister" href="https://example.com/elsie" id="link1">Elsie</a>]

查詢類選擇器時也可以使用id選擇器的標籤

soup = BeautifulSoup(html_doc, 'lxml')
print(soup.select('.story#text'))

查詢有多個class選擇器和一個id選擇器的標籤

soup = BeautifulSoup(html_doc, 'lxml')
print(soup.select(".story .sister#link1"))
#[<a class="sister" href="https://example.com/elsie" id="link1">Elsie</a>]

屬性查詢

選擇具有特定屬性或屬性值的標籤

簡單屬性選擇器

選擇具有特定屬性的標籤

soup = BeautifulSoup(html_doc, 'lxml')
print(soup.select("a[href='https://example.com/elsie']"))  # 選擇a標籤中href屬性為https://example.com/elsie的標籤
#[<a class="sister" href="https://example.com/elsie" id="link1">Elsie</a>]

屬性值選擇器

選擇具有特定屬性值的標籤
- 精確匹配:[attribute="value"]
- 部分匹配
 - 包含特定值:[attribute~="value"] 選擇屬性值包含特定單詞的標籤。
 - 以特定值開頭:[attribute^="value"] 選擇屬性值以特定字串開頭的標籤
 - 以特定值結尾:[attribute$="value"] 選擇屬性值以特定字串結尾的標籤。
 - 包含特定子字串:[attribute*="value"] 選擇屬性值包含特定子字串的標籤
```
soup = BeautifulSoup(html_doc, 'lxml')
print(soup.select('a[href^="https://example.com"]')) # 選擇href以https://example.com開頭的a標籤
#[<a class="sister" href="https://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="https://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="https://example.com/tillie" id="link3">Tillie</a>]
```

JavaScript從原型到原型鏈，細緻講解
2018-12-11
JavaScript原型
爬蟲-使用BeautifulSoup4（bs4）解析html資料
2021-01-24
爬蟲HTML
dart類詳細講解
2021-02-09
Dart
MySQL 細緻總結之中級篇
2019-04-17
MySql
細緻解析：kubernets整體架構
2019-03-05
架構
你可見過如此細緻的延時任務詳解
2022-11-23
Go Struct超詳細講解
2019-04-07
GoStruct
指標的詳細講解
2020-04-15
指標
promise,then,setTimeout -- 細緻探討執行流程
2018-05-16
Promise
MySQL 細緻總結之基礎篇
2019-03-25
MySql
Java細緻末節小錯記錄
2020-12-23
Java
Java中的static詳細講解
2020-11-22
Java
react的詳細知識講解！
2021-05-26
React
詳細講解函式呼叫原理
2020-12-29
函式
MyBatis-Plus詳細講解（一）
2020-12-27
MyBatis
Spring @Conditional註解詳細講解及示例
2020-04-05
Spring
演算法--揹包九講（詳細講解+程式碼）
2018-07-31
演算法
詳細講解23種設計模式
2023-03-01
設計模式
機器學習之決策樹詳細講解及程式碼講解
2020-09-29
機器學習
iOS進階之masonry細緻入微_MASUtilities.h
2019-01-15
iOS
「必知必會」最細緻的 ArrayList 原理分析
2021-08-05
「必知必會」最細緻的 LinkedList 原理分析
2021-08-09
MVC 三層架構案例詳細講解
2023-05-17
MVC架構
DeFi和CeFi的區別詳細講解
2020-09-15
詳細講解！RabbitMQ防止資料丟失
2020-09-29
MQ
spring 詳細講解（ioc，依賴注入，aop）
2024-09-16
Spring依賴注入
Mbps 及其相關單位詳細講解
2024-07-23
網路安全Bypass網路卡詳細講解
2021-12-27
【論文系列】PPO知識點梳理+程式碼 (盡我可能細緻通俗解釋！）
2024-12-09
遊戲運營的自我修養：細緻、高效、敏感
2019-06-21
遊戲
細講top命令
2020-09-25
webpack4.x最詳細入門講解
2018-10-29
Web
vue-cli 目錄結構詳細講解
2019-02-16
Vue
Spring 面向切面程式設計AOP 詳細講解
2024-05-18
Spring程式設計
ES6中rest引數詳細講解
2021-09-09
REST
超詳細講解頁面載入過程
2021-11-09
BeautifulSoup庫
2024-05-19
python爬蟲常用庫之BeautifulSoup詳解
2018-04-01
Python爬蟲