Python爬蟲之BeautifulSoup庫

Praywu發表於2020-12-14

原文網址 : https://www.cnblogs.com/hgzero/p/14132992.html

1. BeautifulSoup

1.1 解析庫

1）Python標準庫

# 使用方法
BeautifulSoup(markup, "html.parser")

# 優勢
Python的內建標準庫，執行速度適中，文件容錯能力強

# 劣勢
Python2.7.3 或者 python3.2.2 前的版本容錯能力差

2）lxml HTML解析器

絕大部分場景都應該使用lxml解析器

# 使用方法
BeautifulSoup(markup, "lxml")

# 優勢
速度快，文件容錯能力強

# 劣勢
需要安裝C語言庫

3）lxml XML解析器

# 使用方法
BeautifulSoup(markup, "xml")

# 優勢
速度快，唯一支援XML的解析器

# 劣勢
需要安裝C語言庫

4）html5lib

# 使用方法
BeautifulSoup(markup, "html5lib")

# 優勢
最好的容錯性，以瀏覽器的方式解析文件，生成HTML5格式的文件

# 劣勢
速度慢，不依賴外部擴充套件

1.2 基本使用

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml') # 使用lxml解析器
print(soup.prettify())    # 格式化程式碼，能自動將缺失的程式碼進行補全並進行容錯處理
print(soup.title.string)  # 拿到title標籤，並拿到其中的內容

2. 標籤選擇器

2.1 選擇元素

可以直接通過 .標籤名 的方式來選擇標籤

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.title)        # 選擇title標籤，列印結果：<title>The Dormouse's story</title>
print(type(soup.title))  # 型別：<class 'bs4.element.Tag'>
print(soup.head) 
print(soup.p) # 如果有多個匹配結果，那麼它只會返回第一個

2.2 獲取名稱

獲取標籤的名稱，如是p標籤還是a標籤等

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.title.name) # 獲取標籤名稱

2.3 獲取屬性

可以通過 attrs["name"] 或者 標籤["name"] 的方式來獲取標籤中name屬性的值

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.p.attrs['name'])   # 獲取p標籤中name屬性的值
print(soup.p['name'])         # 這樣也可以獲取

2.4 獲取內容

可以通過 標籤.string 的方式來獲取標籤中的內容

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p clss="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.p.string)  # 獲取p標籤中的內容（只是獲取字元內容）：The Dormouse's story

2.5 巢狀選擇

可以通過點 . 的方式來巢狀選擇

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.head.title.string)  # 獲取head下面的title中的字元內容

2.6 子節點和子孫節點

1）子節點

通過 標籤.contents 可以獲取標籤中的所有子節點，儲存為一個列表
儲存為列表

html = """
<html>
    <head>
        <title>The Dormouse's story</title>
    </head>
    <body>
        <p class="story">
            Once upon a time there were three little sisters; and their names were
            <a href="http://example.com/elsie" class="sister" id="link1">
                <span>Elsie</span>
            </a>
            <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> 
            and
            <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
            and they lived at the bottom of a well.
        </p>
        <p class="story">...</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.p.contents)  # 獲取p標籤中的所有子節點，儲存為一個列表

可以通過 標籤.children 來獲取標籤中的所有子節點，儲存為一個迭代器
儲存為迭代器

html = """
<html>
    <head>
        <title>The Dormouse's story</title>
    </head>
    <body>
        <p class="story">
            Once upon a time there were three little sisters; and their names were
            <a href="http://example.com/elsie" class="sister" id="link1">
                <span>Elsie</span>
            </a>
            <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> 
            and
            <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
            and they lived at the bottom of a well.
        </p>
        <p class="story">...</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.p.children)  # 獲取p標籤中的所有子節點，儲存為一個迭代器
for i, child in enumerate(soup.p.children):
    print(i, child)

2）子孫節點

可以通過標籤.descendants 來獲取標籤中的所有子孫節點，並儲存為一個迭代器

html = """
<html>
    <head>
        <title>The Dormouse's story</title>
    </head>
    <body>
        <p class="story">
            Once upon a time there were three little sisters; and their names were
            <a href="http://example.com/elsie" class="sister" id="link1">
                <span>Elsie</span>
            </a>
            <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> 
            and
            <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
            and they lived at the bottom of a well.
        </p>
        <p class="story">...</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.p.descendants)   # 獲取p標籤中的所有子孫節點，儲存為一個迭代器
for i, child in enumerate(soup.p.descendants):
    print(i, child)

2.7 父節點和祖先節點

1）父節點

通過標籤.parent 可以獲取標籤的父節點

html = """
<html>
    <head>
        <title>The Dormouse's story</title>
    </head>
    <body>
        <p class="story">
            Once upon a time there were three little sisters; and their names were
            <a href="http://example.com/elsie" class="sister" id="link1">
                <span>Elsie</span>
            </a>
            <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> 
            and
            <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
            and they lived at the bottom of a well.
        </p>
        <p class="story">...</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.a.parent)  # 獲取a標籤的父節點

2）祖先節點

通過標籤.parents 可以獲取標籤的所有祖先節點

html = """
<html>
    <head>
        <title>The Dormouse's story</title>
    </head>
    <body>
        <p class="story">
            Once upon a time there were three little sisters; and their names were
            <a href="http://example.com/elsie" class="sister" id="link1">
                <span>Elsie</span>
            </a>
            <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> 
            and
            <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
            and they lived at the bottom of a well.
        </p>
        <p class="story">...</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(list(enumerate(soup.a.parents)))   # 獲取a標籤所有的祖先節點

2.8 兄弟節點

通過標籤.next_siblings 可以獲取標籤後面的所有兄弟節點
通過標籤.previous_siblings 可以獲取標籤前面的所有兄弟節點

html = """
<html>
    <head>
        <title>The Dormouse's story</title>
    </head>
    <body>
        <p class="story">
            Once upon a time there were three little sisters; and their names were
            <a href="http://example.com/elsie" class="sister" id="link1">
                <span>Elsie</span>
            </a>
            <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> 
            and
            <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
            and they lived at the bottom of a well.
        </p>
        <p class="story">...</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(list(enumerate(soup.a.next_siblings)))     # 獲取a標籤後面的所有兄弟節點
print(list(enumerate(soup.a.previous_siblings))) # 獲取a標籤前面的所有兄弟節點

3. 標準選擇器

3.1 find_all()

使用語法：find_all(name, attrs, recursive, text, **kwargs)

1）name

根據標籤名來選擇標籤

html='''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''
from bs4 import BeautifulSoup
soup1 = BeautifulSoup(html, 'lxml')
print(soup1.find_all('ul'))  # 找到所有匹配的結果，並以列表的形式返回
print(type(soup1.find_all('ul')[0]))

soup2 = BeautifulSoup(html, 'lxml')
for ul in soup2.find_all('ul'):
print(ul.find_all('li'))

2）attrs

根據標籤中的屬性進行選擇標籤

html='''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1" name="elements">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find_all(attrs={'id': 'list-1'}))    # 找到所有的標籤屬性中id=list-1的標籤
print(soup.find_all(attrs={'name': 'elements'}))

soup2 = BeautifulSoup(html, 'lxml')
print(soup2.find_all(id='list-1'))      # 找到所有的標籤屬性中id=list-1的標籤，和attrs類似，只不過不需要再傳入字典了
print(soup2.find_all(class_='element')) # 如果和關鍵字衝突，則可以通過將屬性後面加一個下劃線，如class_

3）text

根據文字的內容進行選擇

html='''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find_all(text='Foo'))   # 根據文字的內容進行選擇，選擇文字中包含Foo的標籤的所有內容

3.2 find()

find返回單個元素，find_all返回所有元素

html='''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find('ul'))   # 找到第一個ul標籤
print(type(soup.find('ul')))
print(soup.find('page'))

3.3 find_parents() find_parent()

find_parents() 返回所有祖先節點，find_parent() 返回直接父節點。

3.4 find_next_siblings() find_next_sibling()

find_next_siblings()返回後面所有兄弟節點，find_next_sibling()返回後面第一個兄弟節點。

3.5 find_previous_siblings() find_previous_sibling()

find_previous_siblings()返回前面所有兄弟節點，find_previous_sibling()返回前面第一個兄弟節點。

3.6 find_all_next() find_next()

find_all_next()返回節點後所有符合條件的節點, find_next()返回第一個符合條件的節點。

3.7 find_all_previous() 和 find_previous()

find_all_previous()返回節點後所有符合條件的節點, find_previous()返回第一個符合條件的節點。

4. CSS選擇器

4.1 css選擇器基本使用

通過select() 直接傳入CSS選擇器即可完成選擇

html='''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.select('.panel .panel-heading'))  # 這是類選擇器，class=xxx，中間的空格表示這是也是層級選擇器
print(soup.select('ul li'))                  # 這是標籤選擇器，選擇具體的標籤，這裡表示選擇ul標籤中的li標籤
print(soup.select('#list-2 .element'))       # 這個id選擇器，id=xxx
print(type(soup.select('ul')[0]))

soup2 = BeautifulSoup(html, 'lxml')
for ul in soup2.select('ul'):
print(ul.select('li'))

4.2 獲取屬性

TAG['id']
TAG.attr['id']

html='''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
for ul in soup.select('ul'):
    print(ul['id'])         # 獲取ul標籤中id屬性的值
    print(ul.attrs['id'])   # 這兩種寫法等價

4.3 獲取內容

TAG.get_text()

html='''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
for li in soup.select('li'):
    print(li.get_text())   # 獲取標籤中的文字

5. 總結

推薦使用 lxml 解析庫，必要時使用 html.parser
標籤選擇篩選功能弱但是速度快
建議使用find()、find_all() 查詢匹配單個結果或者多個結果
如果對CSS選擇器熟悉建議使用select()
要記住常用的獲取屬性和文字值的方法

python爬蟲之 BeautifulSoup庫入門
2019-12-09
Python爬蟲
Python爬蟲之BeautifulSoup
2019-02-16
Python爬蟲
python爬蟲常用庫之BeautifulSoup詳解
2018-04-01
Python爬蟲
python 小爬蟲 DrissionPage+BeautifulSoup
2024-06-16
Python爬蟲
python爬蟲：使用BeautifulSoup修改網頁內容
2020-04-05
Python爬蟲網頁
爬蟲入門系列（四）：HTML 文字解析庫 BeautifulSoup
2019-02-27
爬蟲HTML
Python 爬蟲進階篇-利用beautifulsoup庫爬取網頁文章內容實戰演示
2020-09-14
Python爬蟲網頁
11.18爬蟲學習（BeautifulSoup類）
2024-11-18
爬蟲
Python爬蟲之Selenium庫的基本使用
2018-11-30
Python爬蟲
Python爬蟲之selenium庫使用詳解
2018-05-16
Python爬蟲
python爬蟲常用庫之urllib詳解
2018-03-11
Python爬蟲
python爬蟲常用庫之requests詳解
2019-03-04
Python爬蟲
Python3爬蟲利器:BeautifulSoup4的安裝
2021-09-11
Python爬蟲
python爬蟲學習(一)：BeautifulSoup庫基礎及一般元素提取方法
2018-04-05
Python爬蟲
Datawhale-爬蟲-Task3(beautifulsoup)
2019-03-03
爬蟲
爬蟲之requests庫
2022-03-20
爬蟲
Python 爬蟲十六式 - 第五式：BeautifulSoup，美味的湯
2019-01-13
Python爬蟲
Python爬蟲教程-25-資料提取-BeautifulSoup4（三）
2018-09-06
Python爬蟲
Python爬蟲教程-24-資料提取-BeautifulSoup4（二）
2018-09-06
Python爬蟲
Python爬蟲教程-23-資料提取-BeautifulSoup4（一）
2018-09-06
Python爬蟲
爬蟲系列 | 6、詳解爬蟲中BeautifulSoup4的用法
2021-01-19
爬蟲
Python爬蟲進階之urllib庫使用方法
2021-09-11
Python爬蟲
python爬蟲之JS逆向
2022-06-11
Python爬蟲JS
Python爬蟲之Pyspider使用
2021-09-11
Python爬蟲IDE
【Python學習】爬蟲爬蟲爬蟲爬蟲~
2018-05-03
Python爬蟲
python爬蟲庫技術分享
2022-01-19
Python爬蟲
21.8 Python 使用BeautifulSoup庫
2023-10-27
Python
python爬蟲之js逆向（三）
2020-01-06
Python爬蟲JS
python爬蟲之js逆向（二）
2019-11-05
Python爬蟲JS
Python爬蟲之XPath語法
2019-05-20
Python爬蟲
Python爬蟲基礎之selenium
2022-07-13
Python爬蟲
Python爬蟲實戰之bilibili
2021-04-04
Python爬蟲
python爬蟲之解析連結
2020-12-01
Python爬蟲
python爬蟲基礎之urllib
2020-11-26
Python爬蟲
爬蟲（6） - 網頁資料解析(2) | BeautifulSoup4在爬蟲中的使用
2022-07-04
爬蟲網頁
使用requests+BeautifulSoup的簡單爬蟲練習
2018-04-06
爬蟲
[python爬蟲] BeautifulSoup設定Cookie解決網站攔截並爬取螞蟻短租
2018-03-07
Python爬蟲Cookie網站
Python爬蟲神器requests庫的使用
2024-11-07
Python爬蟲