BeautifulSoup的使用姿勢

weixin_33890499發表於2018-03-05

原文網址 : https://blog.csdn.net/weixin_33890499/article/details/87022430

BeautifulSoup 是什麼

BeautifulSoup庫是解析、遍歷、維護“標籤樹”的功能庫

安裝

pip3 install beautifulsoup4

注意：

在 PyPi 中還有一個名字是 BeautifulSoup 的包,但那可能不是你想要的,那是 Beautiful Soup3 的釋出版本,因為很多專案還在使用 BS3, 所以 BeautifulSoup 包依然有效.但是如果你在編寫新專案,那麼你應該安裝的 beautifulsoup4

HelloWorld

首先匯入包

from bs4 import BeautifulSoup

BeautifulSoup 可以直接開啟檔案，並分析檔案內容

soup = BeautifulSoup(open('index.html'), 'html.parser')

也可以直接分析檔案內容

soup = BeautifulSoup('<html>data</html>', 'html.parser')

上面的 html.parser 部分是指定解析器，用來解析檔案內容用的。BeautifulSoup 目前有以下幾種解析器

html.parser
lxml
xml
html5lib

來演示一個例子。從 https://python123.io/ws/demo.html 上獲取網頁的 HTML 內容，然後使用 BeautifulSoup 解析

import requests
from bs4 import BeautifulSoup

r = requests.get('https://python123.io/ws/demo.html')
html_text =r.text
soup = BeautifulSoup(html_text, 'html.parser')

得到 HTML 內容如下

<html><head><title>This is a python demo page</title></head>
<body>
<p class="title"><b>The demo python introduces several python courses.</b></p>
<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a href="http://www.icourse163.org/course/BIT-268001" class="py1" id="link1">Basic Python</a> and <a href="http://www.icourse163.org/course/BIT-1001870001" class="py2" id="link2">Advanced Python</a>.</p>
</body></html>

可以使用 prettify() 列印格式化後的 HTML 內容

print(soup.prettify())

得到

<html>
 <head>
  <title>
   This is a python demo page
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The demo python introduces several python courses.
   </b>
  </p>
  <p class="course">
   Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
   <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">
    Basic Python
   </a>
   and
   <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">
    Advanced Python
   </a>
   .
  </p>
 </body>
</html>

BeautifulSoup 類的基本元素

基本元素	型別	說明
Tag	bs4.element.Tag	標籤，最基本的資訊組織單元，分別用 <> 和 </> 標明開頭和結尾
Name	str	標籤的名字，<p>...</p> 的名字是 'p'，格式 <tag>.name
Attributes	dict	標籤的屬性，字典組織形式，格式 <tag>.attrs
NavigableString	bs4.element.NavigableString	標籤內非屬性字串，<>...</> 中的字串，格式 <tag>.string
Comment	bs4.element.Comment	標籤內字串的註釋部分，一種特殊的 Comment 型別

用一張圖來說明就是

接著以 https://python123.io/ws/demo.html 獲取到的 HTML 內容來解釋說明 BeautifulSoup 的基本元素

Tag標籤

標籤，最基本的資訊組織單元，分別用 <> 和 </> 標明開頭和結尾

from bs4 import BeautifulSoup

soup = BeautifulSoup(html_text, 'html.parser')
print(type(soup.title)
print(soup.title)
print(soup.title.title)
print(soup.a)

輸出

<class 'bs4.element.Tag'>
<title>This is a python demo page</title>
None
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>

任何存在於 HTML 文件中的標籤都可以用 soup.<tag> 訪問獲得
還可以使用 soup.<tag1>.<tag2> 類似的形式，獲取 <tag1> 標籤下的 <tag2> 標籤
當 HTML 文件中存在多個相同 <tag> 對應內容時，soup.<tag> 返回第一個

Tag的name

標籤的名字，<p>…</p> 的名字是'p'，格式：<tag>.name

檢視 <a> 標籤的名字

from bs4 import BeautifulSoup

soup = BeautifulSoup(html_text, 'html.parser')
print(type(soup.a.name))
print(soup.a.name)

輸出

<class 'str'>
a

Tag的attrs

標籤的屬性，字典形式組織，格式：<tag>.attrs

from bs4 import BeautifulSoup

soup = BeautifulSoup(html_text, 'html.parser')
print(type(soup.p.attrs))
print(soup.p.attrs)
print(soup.a.attrs)

輸出

<class 'dict'>
{'class': ['title']}
{'href': 'http://www.icourse163.org/course/BIT-268001', 'class': ['py1'], 'id': 'link1'}

Tag的NavigableString

標籤內非屬性字串，<>…</> 中字串，格式：<tag>.string

from bs4 import BeautifulSoup

soup = BeautifulSoup(html_text, 'html.parser')
print(type(soup.p.string))
print(soup.p.string)

輸出

<class 'bs4.element.NavigableString'>
The demo python introduces several python courses.

Tag的Comment

from bs4 import BeautifulSoup

html_text = '''
<b><!--This is a comment--></b>
<p>This is not a comment</p>
'''
soup = BeautifulSoup(html_text, 'html.parser')
print(soup.b.string)
print(type(soup.b.string))
print(soup.p.string)
print(type(soup.p.string))

輸出

This is a comment
<class 'bs4.element.Comment'>
This is not a comment
<class 'bs4.element.NavigableString'>

可以看到雖然都是呼叫 <tag>.string 方法獲取注視和標籤內容，但是兩者的型別是不一樣的：
標籤的 Comment 是特殊的 NavigableString 型別：bs4.element.Comment。

這個需要在將來的實際應用中特別注意，可以使用 if-else 語句來判斷

if isinstance(tag.string, bs4.element.Comment):
    pass
else:
    pass

使用BeautifulSoup分析HTML樹

從 https://python123.io/ws/demo.html 下載到的 HTML 文件內容如下

<html><head><title>This is a python demo page</title></head>
<body>
<p class="title"><b>The demo python introduces several python courses.</b></p>
<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a href="http://www.icourse163.org/course/BIT-268001" class="py1" id="link1">Basic Python</a> and <a href="http://www.icourse163.org/course/BIT-1001870001" class="py2" id="link2">Advanced Python</a>.</p>
</body></html>

可以把它理解成一棵樹，<html> 標籤是根，其他標籤接在根下面

於是，遍歷標籤樹就有這幾種方式

下行遍歷
上行遍歷
平行遍歷

下行遍歷

屬性	說明
.contents	子節點的列表，將 <tag> 所有兒子節點存入列表
.children	子節點的迭代型別，與 .contents 類似，用於迴圈遍歷兒子節點
.descendants	子孫節點的迭代型別，包含所有子孫節點，用於迴圈遍歷

.contents

檢視 head 標籤的子節點

from bs4 import BeautifulSoup

soup = BeautifulSoup(html_text, 'html.parser')
print(soup.head.contents)

輸出

[<title>This is a python demo page</title>]

檢視 body 標籤的子節點

print(soup.body.contents)

輸出

['\n', <p class="title"><b>
The demo python introduces several python courses.</b></p>, '\n', <p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>, '\n']

可以發現 \n 換行符也當作是一個節點輸出了，這個很重要，在 BeautifulSoup 裡換行符也當作一個節點看待。比如

soup = BeautifulSoup('''
''', 'html.parser')
print(soup.contents)
for child in soup.contents:
    print(type(child))

輸出結果是

['\n']
<class 'bs4.element.NavigableString'>

.children

用法和 .contents 一樣，都是遍歷節點下的子節點

soup = BeautifulSoup(html_text, 'html.parser')
for child in soup.head.children:
    print(child)

輸出

<title>This is a python demo page</title>

.descendants

檢視標籤下的子孫節點

soup = BeautifulSoup(html_text, 'html.parser')
for child in soup.head.descendants:
    print(child)

輸出

<title>This is a python demo page</title>
This is a python demo page

可以看到同樣是 head 標籤，呼叫 .children 和呼叫 .descendants 差別很大。.children 是獲取標籤下的直接節點，而 .descendants 是獲取標籤下的子孫節點。同時在 BeautifulSoup 中，字串也是當作節點看待，所以就輸出了

<title>This is a python demo page</title>
This is a python demo page

上行遍歷

屬性	說明
.parent	節點的父親標籤
.parents	節點先輩標籤的迭代型別，用於迴圈遍歷先輩節點

.parent

print(soup.title.parent)
print(soup.html.parent)
print(soup.parent)

分別輸出

<head><title>This is a python demo page</title></head>

<html><head><title>This is a python demo page</title></head>
<body>
<p class="title"><b>The demo python introduces several python courses.</b></p>
<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a href="http://www.icourse163.org/course/BIT-268001" class="py1" id="link1">Basic Python</a> and <a href="http://www.icourse163.org/course/BIT-1001870001" class="py2" id="link2">Advanced Python</a>.</p>
</body></html>

None

.parents

for parent in soup.a.parents:
    print(type(parent), parent.name)

輸出

<class 'bs4.element.Tag'> p
<class 'bs4.element.Tag'> body
<class 'bs4.element.Tag'> html
<class 'bs4.BeautifulSoup'> [document]

平行遍歷

屬性	說明
.next_sibling	返回按照HTML文字順序的下一個平行節點標籤
.previous_sibling	返回按照HTML文字順序的上一個平行節點標籤
.next_siblings	迭代型別，返回按照HTML文字順序的後續所有平行節點標籤
.previous_siblings	迭代型別，返回按照HTML文字順序的前續所有平行節點標籤

需要注意，平行遍歷，是發生在同一個父節點下的各節點間

遍歷方式總結

查詢節點

查詢節點常用的方法有以下兩個

find( name , attrs , recursive , text , **kwargs )
find_all( name , attrs , recursive , text , **kwargs )

find：查詢第一個符合要求的節點
find_all：查詢所有符合要求的節點

下面幾個常用的例子

找到第一個 <a> 標籤

a = soup.find('a')
print(a)

輸出

<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>

找到所有 <a> 標籤

for a in soup.find_all('a'):
    print(a)

輸出

<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>
<a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>

查詢 class 屬性值為 py1 的標籤

a = soup.find('a', {'class': 'py1'})
print(a)

輸出

<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>

查詢所有以 p 字母開頭的標籤

import re

for tag in soup.find_all(re.compile(r'^p')):
    print(tag.name)

輸出

p
p

查詢所有 href 屬性為 http://www.icourse163.org/course/ 開頭的 a 標籤

import re

for tag in soup.find_all('a', {'href': re.compile(r'^http://www.icourse163.org/course/.+')}):
    print(tag)

輸出

<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>
<a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>

總結

使用方法

from bs4 import BeautifulSoup

soup = BeautifulSoup(html_text, 'html.parser')

基本元素

Tag
name
attrs
NavigableString
Comment

遍歷方式

下行遍歷
上行遍歷
平行遍歷

查詢節點

find
find_all

TiDB 的正確使用姿勢
2019-03-03
TiDB
Redis的正確使用姿勢
2019-04-02
Redis
聊聊javascript事件的使用姿勢
2019-03-31
JavaScript事件
Vue-router的使用姿勢
2018-07-27
Vue
探索Bitmap使用姿勢
2019-02-03
使用快取的正確姿勢
2018-05-11
快取
laravel 使用 es 的正確姿勢
2020-09-18
Laravel
使用列舉的正確姿勢
2020-09-19
Guava Cache使用的三種姿勢
2021-07-22
Guava
ElasticSearch基本使用姿勢二
2022-06-15
Elasticsearch
Postman 正確使用姿勢
2022-04-18
Postman
原始碼|使用FutureTask的正確姿勢
2019-01-12
原始碼
Spring之RequestBody的使用姿勢小結
2018-07-30
Spring
在vscode使用editorconfig的正確姿勢
2018-07-19
VSCode
虛幻私塾的正確使用姿勢
2018-05-20
Spring Boot使用AOP的正確姿勢
2020-07-22
Spring Boot
使用 react Context API 的正確姿勢
2019-03-12
ReactContextAPI
Swift中使用Contains的正確姿勢
2018-03-05
SwiftAI
npm run dev 的正確使用姿勢
2021-01-09
NPMdev
GIT使用rebase和merge的正確姿勢
2018-12-28
Git
python裝飾器的集中使用姿勢
2024-08-10
Python
AOP的姿勢之簡化 MemoryCache 使用方式
2020-12-27
Python BeautifulSoup 使用
2019-01-20
Python
Fragment巢狀FragmentViewPager 正常使用姿勢
2018-07-11
Fragment巢狀Viewpager
Java日誌正確使用姿勢
2019-04-22
Java
Spring學習之事務的使用姿勢一覽
2018-05-14
Spring
在react中使用svg的各種騷姿勢
2018-07-24
ReactSVG
企業使用資料庫的12種姿勢
2019-08-20
資料庫
SpringBoot系列Mybatis之轉義符的使用姿勢
2021-10-03
Spring BootMyBatis
穿越邊界的姿勢
2018-06-04
【記錄】windows7利用“DOSBox”使用“debug”的姿勢
2018-10-23
Windows
【通俗易懂】JWT-使用的可能正確姿勢
2019-02-23
JWT
在Vue中使用JSX的正確姿勢(有福利)
2018-06-14
VueJS
“5Why分析法”的正確使用姿勢
2022-07-18
BeautifulSoup模組的使用方法
2023-03-17
IPS BYPASS姿勢
2020-08-19
中國菜刀使用（實戰正確姿勢）
2020-12-17
刷LeetCode的簡易姿勢
2020-10-08
LeetCode

BeautifulSoup的使用姿勢

BeautifulSoup 是什麼

安裝

HelloWorld

BeautifulSoup 類的基本元素

Tag標籤

Tag的name

Tag的attrs

Tag的NavigableString

Tag的Comment

使用BeautifulSoup分析HTML樹

下行遍歷

.contents

.children

.descendants

上行遍歷

.parent

.parents

平行遍歷

遍歷方式總結

查詢節點

總結

使用方法

基本元素

遍歷方式

查詢節點

相關文章