python爬蟲之Beautiful Soup基礎知識+例項

Monste發表於2020-08-12

原文網址 : https://www.cnblogs.com/Monste/p/13489026.html

#python爬蟲之Beautiful Soup基礎知識

Beautiful Soup是一個可以從HTML或XML檔案中提取資料的python庫。它能通過你喜歡的轉換器實現慣用的文件導航，查詢，修改文件的方式。

需要注意的是，Beautiful Soup已經自動將輸入文件轉換為Unicode編碼，輸出文件轉換為utf-8編碼。因此在使用它的時候不需要考慮編碼方式，僅僅需要說明一下原始編碼方式就可以了。

一、安裝Beautiful Soup庫

使用pip命令工具安裝Beautiful Soup4庫

pip install beautifulsoup4

##二、BeautifulSoup庫的主要解析器 |解析器|使用方法|條件| |---|---|---| |bs4的html解析器|BeautifulSoup(markup, 'html.parser')|安裝bs4庫| |lxml的html解析器|BeautifulSoup(markup, 'lxml')|pip install lxml| |lxml的lxml解析器|BeautifulSoup(markup, 'lxml')|pip install lxml| |html5lib的解析器|BeautifulSoup(markup, 'html5lib')|pip install html5lib|

具體操作：

html = 'https://www.baidu.com'
bs = BeautifulSoup(html, 'html.parser')

##三、BeautifulSoup的簡單使用

提取百度搜尋頁面的部分原始碼為例：

<!DOCTYPE html>
<html>
<head>
  <meta content="text/html;charset=utf-8" http-equiv="content-type" />
  <meta content="IE=Edge" http-equiv="X-UA-Compatible" />
  <meta content="always" name="referrer" />
  <link
href="https://ss1.bdstatic.com/5eN1bjq8AAUYm2zgoY3K/r/www/cache/bdorz/baidu.min.
css" rel="stylesheet" type="text/css" />
  <title>百度一下，你就知道 </title>
</head>
<body link="#0000cc">
 <div id="wrapper">
  <div id="head">
    <div class="head_wrapper">
     <div id="u1">
      <a class="mnav" href="http://news.baidu.com" name="tj_trnews">新聞
</a>
      <a class="mnav" href="https://www.hao123.com"
name="tj_trhao123">hao123 </a>
      <a class="mnav" href="http://map.baidu.com" name="tj_trmap">地圖 </a>
      <a class="mnav" href="http://v.baidu.com" name="tj_trvideo">視訊 </a>
      <a class="mnav" href="http://tieba.baidu.com" name="tj_trtieba">貼吧
</a>
      <a class="bri" href="//www.baidu.com/more/" name="tj_briicon"
style="display: block;">更多產品 </a>
     </div>
    </div>
  </div>
 </div>
</body>
</html>

綜合requests和使用BeautifulSoup庫的html解析器,對其進行解析如下：

import requests
from bs4 import BeautifulSoup

# 使用requests庫載入頁面程式碼
r = requests.get('https://www.baidu.com')
r.encoding = r.apparent_encoding
html = r.text

bs = BeautifulSoup(html, 'html.parser')

print(bs.prettify())    # prettify 方式輸出頁面

結果如下：

<!DOCTYPE html>
<!--STATUS OK-->
<html>
 <head>
  <meta content="text/html;charset=utf-8" http-equiv="content-type"/>
  <meta content="IE=Edge" http-equiv="X-UA-Compatible"/>
  <meta content="always" name="referrer"/>
  <link href="https://ss1.bdstatic.com/5eN1bjq8AAUYm2zgoY3K/r/www/cache/bdorz/baidu.min.css" rel="stylesheet" type="text/css"/>
  <title>
   百度一下，你就知道
  </title>
 </head>
 <body link="#0000cc">
  <div id="wrapper">
   <div id="head">
    <div class="head_wrapper">
     <div class="s_form">
      <div class="s_form_wrapper">
       <div id="lg">
        <img height="129" hidefocus="true" src="//www.baidu.com/img/bd_logo1.png" width="270"/>
       </div>
       <form action="//www.baidu.com/s" class="fm" id="form" name="f">
        <input name="bdorz_come" type="hidden" value="1"/>
        <input name="ie" type="hidden" value="utf-8"/>
        <input name="f" type="hidden" value="8"/>
        <input name="rsv_bp" type="hidden" value="1"/>
        <input name="rsv_idx" type="hidden" value="1"/>
        <input name="tn" type="hidden" value="baidu"/>
        <span class="bg s_ipt_wr">
         <input autocomplete="off" autofocus="autofocus" class="s_ipt" id="kw" maxlength="255" name="wd" value=""/>
        </span>
        <span class="bg s_btn_wr">
         <input autofocus="" class="bg s_btn" id="su" type="submit" value="百度一下"/>
        </span>
       </form>
      </div>
     </div>
     <div id="u1">
      <a class="mnav" href="http://news.baidu.com" name="tj_trnews">
       新聞
      </a>
      <a class="mnav" href="https://www.hao123.com" name="tj_trhao123">
       hao123
      </a>
      <a class="mnav" href="http://map.baidu.com" name="tj_trmap">
       地圖
      </a>
      <a class="mnav" href="http://v.baidu.com" name="tj_trvideo">
       視訊
      </a>
      <a class="mnav" href="http://tieba.baidu.com" name="tj_trtieba">
       貼吧
      </a>
      <noscript>
       <a class="lb" href="http://www.baidu.com/bdorz/login.gif?login&amp;tpl=mn&amp;u=http%3A%2F%2Fwww.baidu.com%2f%3fbdorz_come%3d1" name="tj_login">
        登入
       </a>
      </noscript>
      <script>
       document.write('<a href="http://www.baidu.com/bdorz/login.gif?login&tpl=mn&u='+ encodeURIComponent(window.location.href+ (window.location.search === "" ? "?" : "&")+ "bdorz_come=1")+ '" name="tj_login" class="lb">登入</a>');
      </script>
      <a class="bri" href="//www.baidu.com/more/" name="tj_briicon" style="display: block;">
       更多產品
      </a>
     </div>
    </div>
   </div>
   <div id="ftCon">
    <div id="ftConw">
     <p id="lh">
      <a href="http://home.baidu.com">
       關於百度
      </a>
      <a href="http://ir.baidu.com">
       About Baidu
      </a>
     </p>
     <p id="cp">
      ©2017 Baidu
      <a href="http://www.baidu.com/duty/">
       使用百度前必讀
      </a>
      <a class="cp-feedback" href="http://jianyi.baidu.com/">
       意見反饋
      </a>
      京ICP證030173號
      <img src="//www.baidu.com/img/gs.gif"/>
     </p>
    </div>
   </div>
  </div>
 </body>
</html>

##四、BeautifulSoup類的基本元素

BeautifulSoup將複製的HTML文件轉換成一個複雜的樹型結構，每個節點都是python物件，所有物件可以歸納為四種Tag，NavigableString，Comment，Beautifulsoup。 |基本元素|說明| |---|---| |Tag|標籤，最基本的資訊組織單元，分別用<>和</>標明開頭和結尾，格式：bs.a或者bs.p（獲取a標籤中或者p標籤中的內容）。| |Name|標籤的名字，格式為.name.| |Attributes|標籤的屬性，字典形式，格式：.attrs.| |NavigableString|標籤內非屬性字串，<>...</>中的字串,格式：.string.| |Comment|標籤內的註釋部分，一種特殊的Comment型別。|

Tag

任何存在於HTML語法中的標籤都可以bs.tag訪問獲得，如果在HTML文件中存在多個相同的tag對應的內容時，bs.tag返回第一個。示例程式碼如下：

import requests
from bs4 import BeautifulSoup

# 使用requests庫載入頁面程式碼
r = requests.get('https://www.baidu.com')
r.encoding = r.apparent_encoding
html = r.text
bs = BeautifulSoup(html, 'html.parser')
# 獲取第一個a標籤的所有內容
print(bs.a)	# <a class="mnav" href="http://news.baidu.com" name="tj_trnews">新聞</a>
print(type(bs.a))	# <class 'bs4.element.Tag'>

在Tag標籤中最重要的就是html頁面中的nam和attrs屬性，使用方法如下：

print(bs.a.name)    # a
# 把a標籤的所有屬性列印輸出出來，返回一個字典型別
print(bs.a.attrs)   # {'href': 'http://news.baidu.com', 'name': 'tj_trnews', 'class': ['mnav']}
# 等價 bs.a.get('class')
print(bs.a['class'])    # ['mnav']
bs.a['class'] = 'newClass'  # 對class屬性的值進行修改
print(bs.a) # <a class="newClass" href="http://news.baidu.com" name="tj_trnews">新聞</a>
del bs.a['class']   # 刪除class屬性
print(bs.a) # <a href="http://news.baidu.com" name="tj_trnews">新聞</a>

NavigableString

NavigableString中的string方法用於獲取標籤內部的文字，程式碼如下：

import requests
from bs4 import BeautifulSoup

# 使用requests庫載入頁面程式碼
r = requests.get('https://www.baidu.com')
r.encoding = r.apparent_encoding
html = r.text
bs = BeautifulSoup(html, 'html.parser')

print(bs.title.string)  # 百度一下，你就知道
print(type(bs.title.string))    # <class 'bs4.element.NavigableString'>

###Comment

Comment物件是一個特殊型別的NavigableString物件，其輸出的內容不包括註釋符號，用於輸出註釋的內容。

from bs4 import BeautifulSoup

html = """<a class="mnav" href="http://news.baidu.com" name="tj_trnews"><!--新聞--></a>"""
bs = BeautifulSoup(html, 'html.parser')

print(bs.a.string)  # 新聞
print(type(bs.a.string))    # <class 'bs4.element.Comment'>

BeautifulSoup

bs物件表示的是一個文件的全部內容，大部分時候，可以把它當作Tag物件，支援遍歷文件樹和搜尋文件中描述的大部分方法。

因為Beautifulsoup物件並不是真正的HTML或者XML的tag，所以它沒有name和attribute屬性。所以BeautifulSoup物件一般包含值為"[document]"的特殊屬性.name

print(bs.name)	# [document]

五、基於bs4庫的HTML內容的遍歷方法

在HTML中有如下特定的基本格式，也是構成HTML頁面的基本組成成分。

而在這種基本的格式下有三種基本的遍歷流程

下行遍歷
上行遍歷
平行遍歷

三種遍歷方式分別是從當前節點出發，對之上、之下、平行的格式以及關係進行遍歷。

下行遍歷

下行遍歷分別有三種遍歷屬性，如下所示： |屬性|說明| |---|---| |.contents|子節點的列表，將所有兒子節點存入列表。| |.children|子節點的迭代型別，用於迴圈遍歷兒子節點。| |.descendants|子孫節點的迭代型別，包涵所有子孫節點，用於迴圈遍歷。|

程式碼如下：

import requests
from bs4 import BeautifulSoup

# 使用requests庫載入頁面程式碼
r = requests.get('https://www.baidu.com')
r.encoding = r.apparent_encoding
html = r.text
bs = BeautifulSoup(html, 'html.parser')

# 迴圈遍歷兒子節點
for child in bs.body.children:
    print(child)

# 迴圈遍歷子孫節點
for child in bs.body.descendants:
    print(child)

# 輸出子節點，以列表的形式
print(bs.head.contents)
print(bs.head.contents[0])  # 用列表索引來獲取它的某一個元素

上行遍歷

上行遍歷有兩種方式，如下所示： |屬性|說明| |---|---| |.parent|節點的父親標籤。| |.parents|節點先輩標籤的迭代型別，用於迴圈遍歷先輩節點，返回一個生成器。|

程式碼如下：

import requests
from bs4 import BeautifulSoup

# 使用requests庫載入頁面程式碼
r = requests.get('https://www.baidu.com')
r.encoding = r.apparent_encoding
html = r.text
bs = BeautifulSoup(html, 'html.parser')

for parent in bs.a.parents:
    if parent is not None:
        print(parent.name)

print(bs.a.parent.name)

平行遍歷

平行遍歷有四種屬性，如下所示： |屬性|說明| |---|---| |.next_sibling|返回按照HTML文字順序的下一個平行節點標籤。| |.previous_sibling|返回按照HTML文字順序的上一個平行節點標籤。| |.next_siblings|迭代型別，返回按照HTML文字順序的所有後續平行節點標籤。| |.previous_siblings|迭代型別，返回按照HTML文字順序的前序所有平行節點標籤。|

程式碼如下：

import requests
from bs4 import BeautifulSoup

# 使用requests庫載入頁面程式碼
r = requests.get('https://www.baidu.com')
r.encoding = r.apparent_encoding
html = r.text
bs = BeautifulSoup(html, 'html.parser')

for sibling in bs.a.next_siblings:
    print(sibling)

for sibling in bs.a.previous_siblings:
    print(sibling)

其它遍歷

屬性	說明
.strings	如果Tag包含多個字串，即在子孫節點中有內容，可以用此獲取，然後進行遍歷。
.stripped_strings	與strings用法一致，可以去除掉那些多餘的空白內容。
.has_attr	判斷Tag是否包含屬性。

六、檔案樹搜尋

使用bs.find_all(name, attires, recursive, string, **kwargs)方法，用於返回一個列表型別，儲存查詢的結果。 |屬性|說明| |---|---| |name|對標籤的名稱的檢索字串。| |attrs|對標籤屬性值的檢索字串，可標註屬性檢索。| |recursive|是否對子孫全部檢索，預設為True。| |string|用與在資訊文字中特定字串的檢索。|

name引數

如果是指定的字串：會查詢與字串完全匹配的內容，程式碼如下：

a_list = bs.find_all("a")
print(a_list)

使用正規表示式：將會使用BeautifulSoup4中的search()方法來匹配，程式碼如下：

import requests
from bs4 import BeautifulSoup
import re

# 使用requests庫載入頁面程式碼
r = requests.get('https://www.baidu.com')
r.encoding = r.apparent_encoding
html = r.text
bs = BeautifulSoup(html, 'html.parser')

t_list = bs.find_all(re.compile("p"))
for item in t_list:
    print(item)

傳入一個列表：Beautifulsoup4將會與列表中的任一元素匹配到的節點返回，程式碼如下：

import requests
from bs4 import BeautifulSoup

# 使用requests庫載入頁面程式碼
r = requests.get('https://www.baidu.com')
r.encoding = r.apparent_encoding
html = r.text
bs = BeautifulSoup(html, 'html.parser')

t_list = bs.find_all(["meta", "link"])
for item in t_list:
    print(item)

傳入一個函式或方法：將會根據函式或者方法來匹配，程式碼如下：

import requests
from bs4 import BeautifulSoup

# 使用requests庫載入頁面程式碼
r = requests.get('https://www.baidu.com')
r.encoding = r.apparent_encoding
html = r.text
bs = BeautifulSoup(html, 'html.parser')


def name_is_exists(tag):
    return tag.has_attr("name")


t_list = bs.find_all(name_is_exists)
for item in t_list:
    print(item)

attrs引數

並不是所有的屬性都可以使用上面這種方法進行搜尋，比如HTML的data屬性，用與指定屬性搜尋。

import requests
from bs4 import BeautifulSoup

# 使用requests庫載入頁面程式碼
r = requests.get('https://www.baidu.com')
r.encoding = r.apparent_encoding
html = r.text
bs = BeautifulSoup(html, 'html.parser')

t_list = bs.find_all(attrs={"class": "mnav"})


for item in t_list:
    print(item)

string引數

通過string引數可以搜尋文件中的字串內容，與name引數的可選值一樣，string引數接受字串，正規表示式，列表。

import requests
from bs4 import BeautifulSoup
import re

# 使用requests庫載入頁面程式碼
r = requests.get('https://www.baidu.com')
r.encoding = r.apparent_encoding
html = r.text
bs = BeautifulSoup(html, 'html.parser')

t_list = bs.find_all(attrs={"class": "mnav"})
for item in t_list:
    print(item)

# text用於搜尋字串
t_list = bs.find_all(text="hao123")
for item in t_list:
    print(item)

# text可以通其它引數混合使用用來過濾tag
t_list = bs.find_all("a", text=["hao123", "地圖", "貼吧"])
for item in t_list:
    print(item)

t_list = bs.find_all(text=re.compile("\d\d"))
for item in t_list:
    print(item)

使用find_all()方法，常用到的正規表示式形式import re程式碼如下：

bs.find_all(string = re.compile('python'))	# 指定查詢內容

# 或者指定使用正規表示式要搜尋的內容
string = re.compile('python')	# 字元為python
bs.find_all(string)	# 呼叫方法模版

七、常用的find()方法如下

方法	說明
<>find()	搜尋且只返回一個結果，字串型別，同.find_all()引數。
<>find_parent()	在先輩節點中返回一個結果，字串型別，同.find_all()引數。
<>.find_parents()	在先輩節點中搜尋，返回列表型別，同.find_all()引數。
<>.find_next_sibling()	在後續平行節點中返回一個結果，同.find_all()引數。
<>.find_next_siblings()	在後續平行節點中搜尋，返回列表型別，同.find_all()引數。
<>.find_previous_sibling()	在前序平行節點中返回一個結果，字串型別，同.find_all()引數。
<>.find_previous_siblings()	在前序平行節點中搜尋，返回列表型別，同.find_all()引數。

八、爬取京東電腦資料

爬取的例子直接輸出到螢幕。

(1)要爬取京東一頁的電腦商品資訊，下圖所示：

(2)所爬取的網頁連線：https://search.jd.com/search?keyword=macbook%20pro&qrst=1&suggest=5.def.0.V09&wq=macbook%20pro

(3)我們的目的是需要獲取京東這一個頁面上所有的電腦資料，包括價格，名稱，ID等。具體程式碼如下：

#!/usr/bin/env python
# -*- coding:utf-8 -*-

import requests
from bs4 import BeautifulSoup

headers = {
        'User-agent': "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) "
                      "Chrome/66.0.3359.139 Safari/537.36"
    }

URL = "https://search.jd.com/search?keyword=macbook%20pro&qrst=1&suggest=5.def.0.V09&wq=macbook%20pro"

r = requests.get(URL, headers=headers)
r.encoding = r.apparent_encoding
html = r.text
bs = BeautifulSoup(html, 'html.parser')

all_items = bs.find_all('li', attrs={"class": "gl-item"})

for item in all_items:
    computer_id = item["data-sku"]
    computer_name = item.find('div', attrs={'class': 'p-name p-name-type-2'})
    computer_price = item.find('div', attrs={'class': 'p-price'})
    print('電腦ID為：' + computer_id)
    print('電腦名稱為：' + computer_name.em.text)
    print('電腦價格為：' + computer_price.find('i').string)
    print('------------------------------------------------------------')

部分結果如下圖所示：

JB的Python之旅-爬蟲篇--urllib和Beautiful Soup
2018-05-15
Python爬蟲
Python分散式爬蟲(三) - 爬蟲基礎知識
2019-03-21
Python分散式爬蟲
Python爬蟲之路-爬蟲基礎知識(理論)
2021-01-04
Python爬蟲
爬蟲基礎知識
2023-03-15
爬蟲
Beautiful Soup在爬蟲中的基本使用語法
2020-12-01
爬蟲
一起學爬蟲——使用Beautiful Soup爬取網頁
2018-11-26
爬蟲網頁
Python入門基礎知識例項，
2018-11-24
Python
Python爬蟲基礎之selenium
2022-07-13
Python爬蟲
python爬蟲基礎之urllib
2020-11-26
Python爬蟲
Python爬蟲筆記（一）——基礎知識簡單整理
2018-07-08
Python爬蟲筆記
Python：基礎&爬蟲
2023-10-29
Python爬蟲
爬蟲開發知識入門基礎（1）
2020-06-22
爬蟲
【Python3網路爬蟲開發實戰】4-解析庫的使用-2 使用Beautiful Soup
2018-03-19
Python爬蟲
python基礎例項韋瑋 pdf_韋瑋：Python網路爬蟲實戰解析
2020-11-24
Python爬蟲
Python基礎知識之字典
2019-02-16
Python
Python基礎知識之集合
2019-02-16
Python
Python爬蟲從入門到精通系列──第1課基礎知識
2019-01-17
Python爬蟲
【爬蟲】第一章-Web基礎知識
2024-04-02
爬蟲Web
學習爬蟲必須學的基礎知識
2020-01-13
爬蟲
學 Java 網路爬蟲，需要哪些基礎知識？
2021-09-09
Java爬蟲
python爬蟲基礎概念
2020-05-11
Python爬蟲
python_爬蟲基礎
2024-07-30
Python爬蟲
Python爬蟲之Scrapy學習（基礎篇）
2019-03-04
Python爬蟲
Python爬蟲專案100例，附原始碼！100個Python爬蟲練手例項
2021-09-09
Python爬蟲原始碼
使用 Beautiful Soup 在 Python 中抓取網頁
2021-12-27
Python網頁
python例項，python網路爬蟲爬取大學排名!
2018-11-20
Python爬蟲
【0基礎學爬蟲】爬蟲基礎之資料儲存
2023-04-14
爬蟲
【0基礎學爬蟲】爬蟲基礎之檔案儲存
2023-04-07
爬蟲
零基礎入門學習Python爬蟲必備的知識點！
2018-09-26
Python爬蟲
【Python培訓基礎知識】單例模式
2021-04-01
Python單例模式
用例基礎知識
2024-08-28
爬蟲入門基礎-Python
2020-05-09
爬蟲Python
Python基礎——切片例項
2018-12-15
Python
Python入門之基礎知識（一）
2020-10-13
Python
Python基礎知識之常用框架Flask！
2021-01-25
Python框架Flask
python基礎知識
2024-03-14
Python
python 基礎知識
2021-09-09
Python
python爬蟲例項專案大全-GitHub 上有哪些優秀的 Python 爬蟲專案？
2020-10-30
Python爬蟲Github

python爬蟲之Beautiful Soup基礎知識+例項

一、安裝Beautiful Soup庫

Tag

NavigableString

BeautifulSoup

五、基於bs4庫的HTML內容的遍歷方法

下行遍歷

上行遍歷

平行遍歷

其它遍歷

六、檔案樹搜尋

name引數

attrs引數

string引數

七、常用的find()方法如下

八、爬取京東電腦資料

相關文章