Python爬蟲系列（六）：搜尋文件樹

weixin_34019929發表於2017-10-24

Python爬蟲

今天早上，寫的東西掉了。這個爛知乎，有bug，說了自動儲存草稿，其實並沒有儲存。無語

今晚，我們將繼續討論如何分析html文件。

1.字串

#直接找元素

soup.find_all('b')

2.正規表示式

#通過正則找

import re

for tag in soup.find_all(re.compile("^b")):

print(tag.name)

3.列表

找a 和 b標籤

soup.find_all(["a", "b"])

4.True

找所有標籤

fortaginsoup.find_all(True):

print(tag.name)

5.方法

defhas_class_but_no_id(tag):

returntag.has_attr('class')andnottag.has_attr('id')

#呼叫外部方法。只返回方法滿足為true的元素

soup.find_all(has_class_but_no_id)

6.find_all

ind_all() 方法搜尋當前tag的所有tag子節點,並判斷是否符合過濾器的條件.這裡有幾個例子:

soup.find_all("title")

#找class=title的p元素

soup.find_all("p", "title")

#找所有元素

soup.find_all("a")

#通過ID找

soup.find_all(id="link2")

#通過內容找

importre

soup.find(text=re.compile("sisters"))

#通過正則：查詢元素屬性滿足條件的

soup.find_all(href=re.compile("elsie"))

#查詢包含id的元素

soup.find_all(id=True)

#多條件查詢

soup.find_all(href=re.compile("elsie"), id='link1')

有些tag屬性在搜尋不能使用,比如HTML5中的 data-* 屬性

data_soup = BeautifulSoup('

foo!

data_soup.find_all(data-foo="value")

但是可以通過 find_all() 方法的 attrs 引數定義一個字典引數來搜尋包含特殊屬性的tag:

data_soup.find_all(attrs={"data-foo": "value"})

#按CSS搜尋注意class的用法

按照CSS類名搜尋tag的功能非常實用,但標識CSS類名的關鍵字 class 在Python中是保留字,使用 class 做引數會導致語法錯誤.從Beautiful Soup的4.1.1版本開始,可以通過 class_ 引數搜尋有指定CSS類名的tag

soup.find_all("a", class_="sister")

class_ 引數同樣接受不同型別的過濾器 ,字串,正規表示式,方法或 True :

soup.find_all(class_=re.compile("itl"))

defhas_six_characters(css_class):

returncss_classisnotNoneandlen(css_class) == 6

soup.find_all(class_=has_six_characters)

tag的 class 屬性是多值屬性.按照CSS類名搜尋tag時,可以分別搜尋tag中的每個CSS類名:

css_soup = BeautifulSoup('')

css_soup.find_all("p", class_="strikeout")

css_soup.find_all("p", class_="body")

搜尋 class 屬性時也可以通過CSS值完全匹配

css_soup.find_all("p", class_="body strikeout")

完全匹配 class 的值時,如果CSS類名的順序與實際不符,將搜尋不到結果

soup.find_all("a", attrs={"class": "sister"})

通過 text 引數可以搜搜文件中的字串內容.

與 name 引數的可選值一樣, text 引數接受字串 , 正規表示式 , 列表, True .

soup.find_all(text="Elsie")

soup.find_all(text=["Tillie", "Elsie", "Lacie"])

soup.find_all(text=re.compile("Dormouse"))

def is_the_only_string_within_a_tag(s):

return (s == s.parent.string)

soup.find_all(text=is_the_only_string_within_a_tag)

雖然 text 引數用於搜尋字串,還可以與其它引數混合使用來過濾tag.Beautiful Soup會找到 .string 方法與 text 引數值相符的tag.下面程式碼用來搜尋內容裡面包含“Elsie”的標籤

soup.find_all("a", text="Elsie")

find_all() 方法返回全部的搜尋結構,如果文件樹很大那麼搜尋會很慢.如果我們不需要全部結果,可以使用 limit 引數限制返回結果的數量.效果與SQL中的limit關鍵字類似,當搜尋到的結果數量達到 limit 的限制時,就停止搜尋返回結果

soup.find_all("a", limit=2)

呼叫tag的 find_all() 方法時,Beautiful Soup會檢索當前tag的所有子孫節點,如果只想搜尋tag的直接子節點,可以使用引數 recursive=False .

soup.html.find_all("title")

soup.html.find_all("title", recursive=False)

find_all() 幾乎是Beautiful Soup中最常用的搜尋方法,所以我們定義了它的簡寫方法. BeautifulSoup 物件和 tag 物件可以被當作一個方法來使用,這個方法的執行結果與呼叫這個物件的 find_all() 方法相同,下面兩行程式碼是等價的

soup.find_all("a")

soup("a")

soup.title.find_all(text=True)

soup.title(text=True)

7.find

soup.find_all('title', limit=1)與soup.find('title')一樣

find就是找到滿足條件的第一個就返回。all返回列表，find返回一個物件

find_all() 方法沒有找到目標是返回空列表, find() 方法找不到目標時,返回 None

soup.head.title 是 tag的名字方法的簡寫.這個簡寫的原理就是多次呼叫當前tag的 find() 方法

soup.head.title與soup.find("head").find("title")

8.find_parents() 和 find_parent()

soup = BeautifulSoup(html_doc, "lxml")

a_string = soup.find(text="Lacie")

print('1---------------------------')

print(a_string)

print('2---------------------------')

#找直接父節點

print(a_string.find_parents("a"))

print('3---------------------------')

#迭代找父節點

print(a_string.find_parent("p"))

print('4---------------------------')

#找直接父節點

print(a_string.find_parents("p", class_="title"))

9.find_next_siblings() 合 find_next_sibling()

soup = BeautifulSoup(html_doc, "lxml")

a_string = soup.find(text="Lacie")

print('1---------------------------')

first_link = soup.a

print(first_link)

print('2---------------------------')

#找當前元素的所有後續元素

print(first_link.find_next_siblings("a"))

print('3---------------------------')

first_story_paragraph = soup.find("p", "story")

#找當前元素的緊接著的第一個元素

print(first_story_paragraph.find_next_sibling("p"))

10.find_previous_siblings() 和 find_previous_sibling()

和第9點方向相反

last_link = soup.find("a", id="link3")

last_link

last_link.find_previous_siblings("a")

first_story_paragraph = soup.find("p", "story")

first_story_paragraph.find_previous_sibling("p")

11.find_all_next() 和 find_next()

這2個方法通過 .next_elements 屬性對當前tag的之後的tag和字串進行迭代, find_all_next() 方法返回所有符合條件的節點, find_next() 方法返回第一個符合條件的節點:

first_link.find_all_next(text=True)

first_link.find_next("p")

12.find_all_previous() 和 find_previous()

這2個方法通過 .previous_elements 屬性對當前節點前面的tag和字串進行迭代, find_all_previous() 方法返回所有符合條件的節點, find_previous() 方法返回第一個符合條件的節點

first_link.find_all_previous("p")

first_link.find_previous("title")

13.CSS選擇器

查詢class=title的元素

soup.select("title")

soup.select("p nth-of-type(3)")

通過元素層級查詢

soup.select("body a")

soup.select("html head title")

找直接子元素

soup.select("head > title")

soup.select("p > a")

soup.select("p > a:nth-of-type(2)")

oup.select("p > #link1")

up.select("body > a")

找到兄弟節點標籤

soup.select("#link1 ~ .sister")

soup.select("#link1 + .sister")

通過CSS的類名查詢

soup.select(".sister")

這裡的class沒有加 _

soup.select("[class~=sister]")

通過tag的id查詢

soup.select("#link1")

通過是否存在某個屬性來查詢

oup.select('a[href]')

通過屬性的值來查詢

soup.select('a[href="http://example.com/elsie"]')

#以title結尾

soup.select('a[href$="tillie"]')

#包含.com

soup.select('a[href*=".com/el"]')

通過語言設定來查詢:就是通過元素屬性來查詢

multilingual_soup = BeautifulSoup(multilingual_markup)

multilingual_soup.select('p[lang|=en]')

這一部分內容，瞭解jquery的人一眼就看明白了

作為程式設計師，一定要學會觸類旁通

搜狗搜尋微信Python爬蟲案例
2022-04-04
Python爬蟲
Python爬蟲全網搜尋並下載音樂
2021-02-14
Python爬蟲
Python 爬蟲系列
2021-01-01
Python爬蟲
python爬蟲系列版
2018-03-16
Python爬蟲
laravel 簡單限制搜尋引擎爬蟲頻率
2022-05-27
Laravel爬蟲
Tomcat和搜尋引擎網路爬蟲的攻防
2018-10-26
Tomcat爬蟲
如何使用robots禁止各大搜尋引擎爬蟲爬取網站
2018-08-28
爬蟲網站
【Python學習】爬蟲爬蟲爬蟲爬蟲~
2018-05-03
Python爬蟲
爬蟲案例（六）
2020-11-03
爬蟲
python爬蟲獲取百度熱搜
2024-06-15
Python爬蟲
二分搜尋樹系列之[ 插入操作 (insert) ]
2021-05-19
二分搜尋樹系列之「插入操作 (insert) 」
2021-05-19
二叉搜尋樹的python實現
2019-02-16
Python
python 二叉樹深度優先搜尋和廣度優先搜尋
2019-02-16
Python二叉樹
python爬蟲系列（三）scrapy基本概念
2018-09-26
Python爬蟲
python爬蟲---網頁爬蟲，圖片爬蟲，文章爬蟲，Python爬蟲爬取新聞網站新聞
2019-01-04
Python爬蟲網頁網站
python就是爬蟲嗎-python就是爬蟲嗎
2020-10-29
Python爬蟲
[Python手撕]不同的二叉搜尋樹
2024-10-06
Python
[Python手撕]判斷二叉搜尋樹
2024-09-26
Python
Python 爬蟲從入門到進階之路（六）
2019-06-27
Python爬蟲
爬蟲的小技巧之–如何尋找爬蟲入口
2018-03-05
爬蟲
python 爬蟲
2024-04-20
Python爬蟲
python爬蟲
2024-06-13
Python爬蟲
Web網站如何檢視搜尋引擎蜘蛛爬蟲的行為
2019-05-12
Web網站爬蟲
二叉搜尋樹
2024-11-21
六種高效爬蟲框架
2022-06-07
爬蟲框架
二分搜尋樹系列之[ 節點刪除 (remove) ]
2021-05-20
REM
二分搜尋樹系列之「節點刪除 (remove) 」
2021-05-20
REM
二分搜尋樹系列之[查詢(Search)-包含(Contain)]
2021-05-19
AI
二分搜尋樹系列之「查詢(Search)-包含(Contain)」
2021-05-19
AI
Python 爬蟲（六）：使用 Scrapy 爬取去哪兒網景區資訊
2019-10-20
Python爬蟲
二叉樹的插入和搜尋–python實現
2018-08-20
二叉樹Python
從二分搜尋到二叉搜尋樹
2023-04-03
PHP蜘蛛爬蟲開發文件
2021-01-12
PHP爬蟲
python爬蟲初探--第一個python爬蟲專案
2018-05-18
Python爬蟲
nodejs 實現磁力連結資源搜尋 BT磁力連結爬蟲
2019-02-16
NodeJS爬蟲
二分搜尋樹系列之[特性及完整原始碼-code]
2021-05-20
原始碼
二分搜尋樹系列之「特性及完整原始碼-code」
2021-05-20
原始碼
python網路爬蟲_Python爬蟲：30個小時搞定Python網路爬蟲視訊教程
2020-10-21
Python爬蟲

Python爬蟲系列（六）：搜尋文件樹

相關文章