beautifulsoup的總結

Cosmop01itan發表於2017-01-26

html_doc="""

<html><head><title>TheDormouse'sstory</title></head>

<pclass="title"><b>TheDormouse'sstory</b></p>

<pclass="story">Onceuponatimetherewerethreelittlesisters;andtheirnameswere

<ahref="http://example.com/elsie"class="sister"id="link1">Elsie</a>,

<ahref="http://example.com/lacie"class="sister"id="link2">Lacie</a>and

<ahref="http://example.com/tillie"class="sister"id="link3">Tillie</a>;

andtheylivedatthebottomofawell.</p>

<pclass="story">...</p>

"""

Id過濾器的用法

1.soup.find_all(id='link2')#返回特定id的tag物件
#[<a class="sister" href="http://example.com/lacie"id="link2">Lacie</a>]

soup.find_all(id=True)#找到所有有id屬性的tag,返回的是tag物件的列表
# [<a class="sister"href="http://example.com/elsie"id="link1">Elsie</a>,
# <aclass="sister" href="http://example.com/lacie"id="link2">Lacie</a>,
# <aclass="sister" href="http://example.com/tillie"id="link3">Tillie</a>]

Href過濾器的用法

soup.find_all(href=re.compile("elsie"))//通過href來過濾元素
# [<a class="sister"href="http://example.com/elsie"id="link1">Elsie</a>]

Src圖片過濾器的用法

因為<img>tag 有個src屬性,所以就可以用src來獲取網頁所有的圖片啦

<imgsrc="http://image.ylyq.duoku.com/uploads/images/2016/1017/1476676420949247.jpg"width="150" height="200">

<imgsrc="http://image.ylyq.duoku.com/uploads/images/2016/0802/1470133017414468.png"width="150" height="200">

ele=soup.find_all(src=re.compile('http://image'))

也可以將上面的過濾器寫在一起:

soup.find_all(href=re.compile("elsie"), id='link1')
#[<a class="sister" href="http://example.com/elsie"id="link1">three</a>]

Soup.find_all(href=re.compile('http://'),id='someid',src='…')

以通過 find_all() 方法的 attrs 引數定義一個字典引數來搜尋包含特殊屬性的tag:

data_soup.find_all(attrs={"data-foo": "value",'class':'someclass'})
#[<div data-foo="value">foo!</div>]

#-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------#

按照CSS類名搜尋tag的功能非常實用,但標識CSS類名的關鍵字 class 在Python中是保留字,使用 class 做引數會導致語法錯誤.從Beautiful Soup的4.1.1版本開始,可以通過 class_ 引數搜尋有指定CSS類名的tag:

soup.find_all("a", class_="sister")
#[<a class="sister" href="http://example.com/elsie"id="link1">Elsie</a>,
# <a class="sister"href="http://example.com/lacie"id="link2">Lacie</a>,
# <a class="sister"href="http://example.com/tillie"id="link3">Tillie</a>]

class_ 引數同樣接受不同型別的過濾器 ,字串,正規表示式,方法或 True :

soup.find_all(class_=re.compile("itl"))
#[<p class="title"><b>The Dormouse'sstory</b></p>]

通過 text 引數可以搜搜文件中的字串內容.與 name 引數的可選值一樣, text 引數接受字串 , 正規表示式 , 列表, True . 看例子:

soup.find_all(text=re.compile("Dormouse"))
[u"The Dormouse's story", u"The Dormouse's story"]

雖然 text 引數用於搜尋字串,還可以與其它引數混合使用來過濾tag.BeautifulSoup會找到 .string 方法與 text 引數值相符的tag.下面程式碼用來搜尋內容裡面包含“Elsie”的<a>標籤:

soup.find_all("a", text="Elsie")
#[<a href="http://example.com/elsie" class="sister"id="link1">Elsie</a>]

find_all() 方法返回全部的搜尋結構,如果文件樹很大那麼搜尋會很慢.如果我們不需要全部結果,可以使用 limit 引數限制返回結果的數量.效果與SQL中的limit關鍵字類似,當搜尋到的結果數量達到 limit 的限制時,就停止搜尋返回結果.

文件樹中有3個tag符合搜尋條件,但結果只返回了2個,因為我們限制了返回數量:

soup.find_all("a", limit=2)
#[<a class="sister" href="http://example.com/elsie"id="link1">Elsie</a>,
# <a class="sister"href="http://example.com/lacie"id="link2">Lacie</a>]

From <https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/#true>

關於BeautifulSoup的總結
2016-04-11
BeautifulSoup庫
2024-05-19
利用requestes\pyquery\BeautifulSoup爬取某租房公寓(深圳市)4755條租房資訊及總結
2020-11-22
安裝beautifulsoup
2013-05-29
BeautifulSoup模組的使用方法
2023-03-17
BeautifulSoup4庫
2022-05-09
Python BeautifulSoup 使用
2019-01-20
Python
實戰（二）輕鬆使用requests庫和beautifulsoup爬連結
2019-03-03
用 BeautifulSoup 爬資料
2018-01-05
不能算是總結的年終總結薦
2009-12-13
Python爬蟲之BeautifulSoup
2019-02-16
Python爬蟲
21.8 Python 使用BeautifulSoup庫
2023-10-27
Python
python BeautifulSoup用法介紹
2020-10-05
Python
python 中BeautifulSoup入門
2013-10-10
Python
我的總結
2019-10-21
Fragment的總結
2018-11-19
Fragment
awk的總結
2018-10-29
近期的總結
2015-10-10
javaSE總結（轉+總結）
2020-08-16
Java
Python爬蟲之BeautifulSoup庫
2020-12-14
Python爬蟲
BeautifulSoup使用手冊（查詢篇）
2020-10-03
自己總結物件導向程式設計的總結
2009-04-11
物件程式設計
ListenalbeFuture的使用總結
2019-04-15
我的年終總結
2018-12-25
18年末的總結
2018-12-26
近半年的總結
2018-10-30
ssh的小總結
2020-11-08
MySQL的Explain總結
2022-05-26
MySqlAI
多型的總結
2020-10-01
多型
Bluetooth的profile總結
2017-03-27
我的工作總結
2018-01-09
遲到的總結
2018-01-10
面試官的總結
2015-03-31
面試
js中this的總結
2013-08-21
JS
git的使用總結
2015-06-26
Git
我的個人總結
2015-08-05
WebView的使用總結
2014-09-08
WebView
type的用法總結
2014-08-07

beautifulsoup的總結

相關文章