爬蟲之xpath的使用

Xiao0101發表於2024-04-02

一、xpath初步認識

1、xpath介紹

XPath（XML Path Language）是一種用於在 XML 文件中定位節點的語言。它是一種在 XML 文件中導航和查詢資訊的語言，類似於在關聯式資料庫中使用 SQL 查詢資料。XPath 提供了一種靈活的方式來定位和處理 XML 文件中的元素和屬性。

2、lxml的安裝

lxml是Python的一個第三方解析庫，支援HTML和XML解析，而且效率非常高，彌補了Python自帶的xml標準庫在XML解析方面的不足。

由於是第三方庫，所以在使用 lxml 之前需要先安裝：pip install lxml

from lxml import etree

# 將原始碼轉化為能被XPath匹配的格式
selector=etree.HTML(原始碼) 

# 返回為一列表
selector.xpath(表示式)

3、xpath解析原理

XPath 使用路徑表示式來選取 XML 文件中的節點或節點集。這些路徑表示式類似於檔案系統中的路徑，可以沿著元素和屬性之間的層次結構前進，並選擇所需的節點。XPath 也支援使用謂詞來過濾和選擇節點，以便更精確地定位目標節點。

二、xpath的語法

XPath 語法： XPath 參考手冊 ] - 線上原生手冊 - php中文網

1、選取節點

表示式	描述
nodename	選取此節點的所有子節點。
/	從根節點選取。
//	從匹配選擇的當前節點選擇文件中的節點，而不考慮它們的位置。
.	選取當前節點。
..	選取當前節點的父節點。
@	選取屬性。

在下面的表格中，我們已列出了一些路徑表示式以及表示式的結果：

路徑表示式	結果
bookstore	選取 bookstore 元素的所有子節點。
/bookstore	選取根元素 bookstore。註釋：假如路徑起始於正斜槓( / )，則此路徑始終代表到某元素的絕對路徑！
bookstore/book	選取屬於 bookstore 的子元素的所有 book 元素。
//book	選取所有 book 子元素，而不管它們在文件中的位置。
bookstore//book	選擇屬於 bookstore 元素的後代的所有 book 元素，而不管它們位於 bookstore 之下的什麼位置。
//@lang	選取名為 lang 的所有屬性。

2、謂語

謂語用來查詢某個特定的節點或者包含某個指定的值的節點。

謂語被嵌在方括號中。

在下面的表格中，我們列出了帶有謂語的一些路徑表示式，以及表示式的結果：

路徑表示式	結果
/bookstore/book[1]	選取屬於 bookstore 子元素的第一個 book 元素。
/bookstore/book[last()]	選取屬於 bookstore 子元素的最後一個 book 元素。
/bookstore/book[last()-1]	選取屬於 bookstore 子元素的倒數第二個 book 元素。
/bookstore/book[position()❤️]	選取最前面的兩個屬於 bookstore 元素的子元素的 book 元素。
//title[@lang]	選取所有擁有名為 lang 的屬性的 title 元素。
//title[@lang='eng']	選取所有 title 元素，且這些元素擁有值為 eng 的 lang 屬性。
/bookstore/book[price>35.00]	選取 bookstore 元素的所有 book 元素，且其中的 price 元素的值須大於 35.00。
/bookstore/book[price>35.00]/title	選取 bookstore 元素中的 book 元素的所有 title 元素，且其中的 price 元素的值須大於 35.00。

3、選取未知節點

XPath 萬用字元可用來選取未知的 XML 元素。

萬用字元	描述
*	匹配任何元素節點。
@*	匹配任何屬性節點。
node()	匹配任何型別的節點。

在下面的表格中，我們列出了一些路徑表示式，以及這些表示式的結果：

路徑表示式	結果
/bookstore/*	選取 bookstore 元素的所有子元素。
//*	選取文件中的所有元素。
//title[@*]	選取所有帶有屬性的 title 元素。

4、選取若干路徑

透過在路徑表示式中使用"|"運算子，您可以選取若干個路徑。

在下面的表格中，我們列出了一些路徑表示式，以及這些表示式的結果：

路徑表示式	結果
//book/title \| //book/price	選取 book 元素的所有 title 和 price 元素。
//title \| //price	選取文件中的所有 title 和 price 元素。
/bookstore/book/title \| //price	選取屬於 bookstore 元素的 book 元素的所有 title 元素，以及文件中所有的 price 元素。

（1）邏輯運算

//div[@id="head" and @class="s_down"] # 查詢所有id屬性等於head並且class屬性等於s_down的div標籤
//title | //price # 選取文件中的所有 title 和 price 元素,“|”兩邊必須是完整的xpath路徑

（2）屬性查詢

//div[@id] # 找所有包含id屬性的div節點
//div[@id="maincontent"]  # 查詢所有id屬性等於maincontent的div標籤
//@class
//li[@name="xx"]//text()  # 獲取li標籤name為xx的裡面的文字內容

（3）獲取第幾個標籤索引從1開始

tree.xpath('//li[1]/a/text()')  # 獲取第一個
tree.xpath('//li[last()]/a/text()')  # 獲取最後一個
tree.xpath('//li[last()-1]/a/text()')  # 獲取倒數第二個

（4）模糊查詢

//div[contains(@id, "he")]  # 查詢所有id屬性中包含he的div標籤
//div[starts-with(@id, "he")] # 查詢所有id屬性中包以he開頭的div標籤

//div/h1/text()  # 查詢所有div標籤下的直接子節點h1的內容
//div/a/@href   # 獲取a裡面的href屬性值 
//*  #獲取所有
//*[@class="xx"]  #獲取所有class為xx的標籤

# 獲取節點內容轉換成字串
c = tree.xpath('//li/a')[0]
result=etree.tostring(c, encoding='utf-8')
print(result.decode('UTF-8'))

5、示例

from lxml import etree

doc = '''
<html>
 <head>
  <base href='http://example.com/' />  <!-- 設定基準連結 -->
  <title>Example website</title>  <!-- 設定網頁標題 -->
 </head>
 <body>
  <div id='images'>
   <a href='image1.html' id='lqz'>Name: My image 1 <br /><img src='image1_thumb.jpg' /></a>
   <a href='image2.html'>Name: My image 2 <br /><img src='image2_thumb.jpg' /></a>
   <a href='image3.html'>Name: My image 3 <br /><img src='image3_thumb.jpg' /></a>
   <a href='image4.html'>Name: My image 4 <br /><img src='image4_thumb.jpg' /></a>
   <a href='image5.html' class='li li-item' name='items'>Name: My image 5 <br /><img src='image5_thumb.jpg' /></a>
   <a href='image6.html' name='items'><span><h5>test</h5></span>Name: My image 6 <br /><img src='image6_thumb.jpg' /></a>
  </div>
 </body>
</html>
'''

# 將HTML字串轉為可解析的物件
html = etree.HTML(doc)

# 1. 獲取所有節點
all_nodes = html.xpath('//*')
print(all_nodes)

# 2. 指定節點（結果為列表）
head_node = html.xpath('//head')
print(head_node)

# 3. 子節點和子孫節點
child_nodes = html.xpath('//div/a')  # 獲取div下的所有a標籤
descendant_nodes = html.xpath('//body//a')  # 獲取body下的所有子孫a標籤
print(child_nodes)
print(descendant_nodes)

# 4. 父節點
parent_node = html.xpath('//body//a[1]/..')  # 獲取第一個a標籤的父節點
print(parent_node)

# 5. 屬性匹配
matched_nodes = html.xpath('//body//a[@href="image1.html"]')  # 獲取href屬性為"image1.html"的a標籤
print(matched_nodes)

# 6. 文字獲取
text = html.xpath('//body//a[@href="image1.html"]/text()')  # 獲取第一個a標籤的文字內容
print(text)

# 7. 屬性獲取
href_attributes = html.xpath('//body//a/@href')  # 獲取所有a標籤的href屬性值
print(href_attributes)

# 8. 屬性多值匹配
li_class_nodes = html.xpath('//body//a[contains(@class, "li")]')  # 獲取class屬性包含"li"的a標籤
print(li_class_nodes)

# 9. 多屬性匹配
matched_nodes = html.xpath('//body//a[contains(@class, "li") and @name="items"]')  # 獲取class屬性包含"li"和name屬性為"items"的a標籤
print(matched_nodes)

# 10. 按序選擇
second_a_text = html.xpath('//a[2]/text()')  # 獲取第二個a標籤的文字內容
print(second_a_text)

# 11. 節點軸選擇
ancestors = html.xpath('//a/ancestor::*')  # 獲取a標籤的所有祖先節點
div_ancestor_node = html.xpath('//a/ancestor::div')  # 獲取a標籤的祖先節點中的div
attribute_values = html.xpath('//a[1]/attribute::*')  # 獲取第一個a標籤的所有屬性值
child_nodes = html.xpath('//a[1]/child::*')  # 獲取第一個a標籤的所有子節點
descendant_nodes = html.xpath('//a[6]/descendant::*')  # 獲取第六個a標籤的所有子孫節點
following_nodes = html.xpath('//a[1]/following::*')  # 獲取第一個a標籤之後的所有節點
following_sibling_nodes = html.xpath('//a[1]/following-sibling::*')  # 獲取第一個a標籤之後的同級節點

print(ancestors)
print(div_ancestor_node)
print(attribute_values)
print(child_nodes)
print(descendant_nodes)
print(following_nodes)
print(following_sibling_nodes)

XPath 是一種強大的工具，廣泛用於 XML 文件的處理和解析。它在各種領域中都有廣泛的應用，包括 Web 開發、資料抓取、資料提取和資料轉換等方面。在 Web 開發中，XPath 經常與 XML、HTML 和 XSLT（Extensible Stylesheet Language Transformations）一起使用，用於從網頁中提取資料或進行資料轉換。

三、專案例項

1、例項一

需求：
爬取58同城二手房源資訊（以北京市為例）。解析出所有房源的名稱，並進行持久化儲存。
網址：https://bj.58.com/ershoufang/

思路：

主要就是觀察頁面的結構，看每個房源的名字所在的標籤是哪個。然後寫出xpath表示式即可。

import requests
from lxml import etree
from fake_useragent import UserAgent

url = 'https://bj.58.com/ershoufang/'
headers = {
    "User-Agent": UserAgent().random
}

# 爬取頁面原始碼資料
page_text = requests.get(url=url, headers=headers).text

# 頁面解析
tree = etree.HTML(page_text)
div_list = tree.xpath('//*[@id="esfMain"]/section/section[3]/section[1]/section[2]/div')
with open('58.txt', 'w',encoding='utf8') as f:
    for div in div_list:
        # 區域性解析
        # 一定要加 . 這個 . 表示的就是區域性定位到的標籤
        title = div.xpath('./a/div[2]/div[1]/div[1]/h3')[0].text
        print(title)
        # 存入檔案
        f.write(title + '\n')

2、案例二

需求：
爬取《紅樓夢》所有章節的標題。
網址：https://www.shicimingju.com/book/hongloumeng.html

思路：

主要是對xpath表示式的書寫。透過觀察標籤寫出xpath表示式。

import requests
from lxml import etree
from fake_useragent import UserAgent

url = 'https://www.shicimingju.com/book/hongloumeng.html'
headers = {
    "User-Agent":UserAgent().random
}

page_text = requests.get(url=url,headers=headers).text

# 例項化etree物件
tree = etree.HTML(page_text)
li_list = tree.xpath('//*[@id="main_left"]/div/div[4]/ul/li')

with open('紅樓夢.txt', 'w',encoding='utf8') as f:
    data_dic = []
    for li in li_list:
        title = li.xpath('./a/text()')[0]  # 獲取標籤裡的文字值
        href = li.xpath('./a/@href')[0]  # 獲取標籤裡的href值
        detail_url = 'https://www.shicimingju.com' + href
        detail_page_text = requests.get(url=detail_url,headers=headers)
        detail_page_text.encoding = 'utf8'
        detail_page_text = detail_page_text.text
        new_tree = etree.HTML(detail_page_text)
        # 文章內容
        p_list = new_tree.xpath('//*[@id="main_left"]/div[1]/div/p')
        for p in p_list:
            words = p.text
            print(words)
        data_dic.append({"title":title,"words":words})
    f.write(str(data_dic))

3、例項三

需求：
解析下載圖片資料。
網址：https://pic.netbian.com/4kdongman/

思路：
主要是對 xpath表示式的書寫，和怎樣處理中文亂碼。
xpath表示式可以從< div class = “slist”>標籤開始，也可以從更上面的標籤開始，比如< idv id = “main” > 可以從這裡開始。
當然兩個寫法的含義是一樣的。

import requests
from lxml import etree
from fake_useragent import UserAgent
import os

if __name__ == '__main__':
    url = 'https://pic.netbian.com/4kdongman/'
    header = {
        'User-Agent': UserAgent().random
    }
    # 爬取頁面原始碼資料 獲取相應物件
    reponse = requests.get(url=url, headers=header)
    # 手動設定響應資料編碼格式
    # reponse.encoding = 'utf-8'
    page_text = reponse.text
    # 資料解析，解析src的屬性值，解析alt的屬性值
    # 例項化etree
    tree = etree.HTML(page_text)
    # xpath表示式
    # src_list = tree.xpath('//div[@id="main"]/div[3]/ul/li/a/img/@src')
    # alt_list = tree.xpath('//div[@id="main"]/div[3]/ul/li/a/img/@alt')
    # 也可以這樣寫
    src_list = tree.xpath('//div[@class="slist"]/ul/li')

    # 建立一個資料夾
    if not os.path.exists('./tupian'):
        os.mkdir('./tupian')

    for li in src_list:
        img_src = 'http://pic.netbian.com' + li.xpath('./a/img/@src')[0]
        img_alt = li.xpath('./a/img/@alt')[0] + '.jpg'
        # 統用處理解決中文亂碼的解決方法
        img_alt = img_alt.encode('iso-8859-1').decode('gbk')
        print(img_alt + " : " + img_src)
        # 圖片地址轉化成二進位制
        img_data = requests.get(url=img_src, headers=header).content
        img_path = './tupian/' + img_alt
        # 儲存
        with open(img_path, 'wb') as fp:
            fp.write(img_data)

4、案例四

需求：
解析出所給網址中全國城市的名稱。
網址：https://www.aqistudy.cn/historydata/

思路：

首先例項化xpath物件，然後根據熱門城市和全部城市的標籤層級關係寫出xpath表示式。解析表示式所對應的a標籤，然後xpath函式返回一個列表，列表中存的就是a標籤對應的城市。然後我們遍歷列表即可。

第一種寫法：分別解析熱門城市和所有城市，然後把這些城市的名字存入列表中。

import requests
from lxml import etree
import os
from fake_useragent import UserAgent

url = 'https://www.aqistudy.cn/historydata/'
headers = {
    'User-Agent':UserAgent().random
}

# 爬取頁面原始碼資料,獲取響應物件
page_text = requests.get(url=url,headers=headers).text

# 資料解析
# 例項化物件
tree = etree.HTML(page_text)
all_city = []  # 所有的城市


# 熱門城市
#  hot_city_list = tree.xpath('//div[@class="bottom"]/ul/li')
hot_city_list = tree.xpath('/html/body/div[3]/div/div[1]/div[1]/div[2]/ul/li')
for li in hot_city_list:
    hot_city_name = li.xpath('./a/text()')[0]
    all_city.append(hot_city_name)

all_city_list = tree.xpath('//div[@class="bottom"]/ul/div[2]/li')
for li in all_city_list:
    all_city_name = li.xpath('./a/text()')[0]
    all_city.append(all_city_name)

print(all_city,"一共有:",len(all_city),"個城市")

第二種寫法：用按位或將兩個層級關係連線。

我們無法只透過一共 xpath表示式，將兩個層級標籤都表示數量，但是我們可以將兩個層級標籤寫在一起，只需要用按位或 “ | ” 進行分割。這個意味著將第一個 xpath表示式或者第二個 xpath表示式，作用到 xpath函式當中。這樣可以解析第一個表示式所對應的a標籤定位到，也可以將第二個表示式所對應的a標籤定位到。
# 用按位或進行分割 “ | ”
a_city_list = tree.xpath('//div[@class="bottom"]/ul/li/a | //div[@class="bottom"]/ul/div[2]/li/a')
xpath返回一個列表，這個列表裡面存的是熱門城市加全部城市a標籤所對應的一個列表。

然後遍歷列表即可。


import requests
from lxml import etree

if __name__ == '__main__':
    url = 'https://www.aqistudy.cn/historydata/'
    header = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4209.2 Safari/537.36'
    }
    # 爬取頁面原始碼資料 獲取相應物件
    page_text = requests.get(url=url, headers=header).text

    # 資料解析
    # 例項化物件
    tree = etree.HTML(page_text)
    all_city = []  # 所有的城市
    # 解析到熱門城市和所有城市對應的a標籤
    # 熱門城市對應a標籤層級關係：//div[@class="bottom"]/ul/li/a
    # 所有城市對應a標籤層級關係：//div[@class="bottom"]/ul/div[2]/li/a
    # 用按位或進行分割 “ | ”
    a_city_list = tree.xpath('//div[@class="bottom"]/ul/li/a | //div[@class="bottom"]/ul/div[2]/li/a')
    for a in a_city_list:
        city_name = a.xpath('./text()')[0]
        all_city.append(city_name)
    print(all_city, " 一共有：", len(all_city), "個城市")

5、案例五

需求：

獲取大學生求職簡歷模板的封面和名稱，並且列印下來。
網址：http://sc.chinaz.com/jianli/daxuesheng.html

思路：

首先例項化xpath物件，然後根據圖片和名稱的標籤層級關係寫出xpath表示式。解析表示式所對應的a標籤，然後xpath函式返回一個列表，列表中存的就是a標籤對應的圖片和名字。然後我們遍歷列表即可。

import requests
from fake_useragent import UserAgent
from lxml import etree

if __name__ == '__main__':
    header = {
        'User-Agent': UserAgent().random
    }

    # 第一頁
    # 第一頁的url和其它頁的有所不同
    url1 = 'http://sc.chinaz.com/jianli/daxuesheng.html'
    # 爬取頁面原始碼資料 獲取相應物件
    reponse = requests.get(url=url1, headers=header)
    # 手動設定響應資料編碼格式
    reponse.encoding = 'utf-8'
    page_text = reponse.text
    # 資料解析
    # 例項化物件
    tree = etree.HTML(page_text)
    src_list = tree.xpath('//div[@id="main"]/div/div')
    for a in src_list:
        img_src = a.xpath('./a/img/@src')[0]
        img_alt = a.xpath('./a/img/@alt')[0]
        # 統用處理解決中文亂碼的解決方法
        # 會報錯：UnicodeDecodeError: 'gbk' codec can't decode byte 0xbf in position 32: incomplete multibyte sequence
        # img_alt = img_alt.encode('iso-8859-1').decode('gbk')
        print(img_alt, ' : ', img_src)

    # 其他頁（2到13頁）
    url = 'http://sc.chinaz.com/jianli/daxuesheng'
    for num in range(2, 13 + 1):
        num = str(num)
        new_url = url + '_' + num + '.html'
        # 爬取頁面原始碼資料 獲取相應物件
        reponse = requests.get(url=new_url, headers=header)
        # 手動設定響應資料編碼格式
        reponse.encoding = 'utf-8'
        page_text = reponse.text
        # 資料解析
        # 例項化物件
        tree = etree.HTML(page_text)
        src_list = tree.xpath('//div[@id="main"]/div/div')
        for a in src_list:
            img_src = a.xpath('./a/img/@src')[0]
            img_alt = a.xpath('./a/img/@alt')[0]
            # 統用處理解決中文亂碼的解決方法
            # 會報錯：UnicodeDecodeError: 'gbk' codec can't decode byte 0xbf in position 32: incomplete multibyte sequence
            # img_alt = img_alt.encode('iso-8859-1').decode('gbk')
            print(img_alt, ' : ', img_src)

具體流程：

當然我們發現每個模板的名字（中文資訊）列印出來是亂碼，這就需要我們更改編碼格式。由於統用處理解決中文亂碼的解決方法會報錯，無法正常使用，所以我們使用手動設定響應資料編碼格式。

reponse = requests.get(url=url,headers=header)
# 手動設定響應資料編碼格式
reponse.encoding = 'utf-8'
page_text = reponse.text

我們如何列印所有頁的資料呢？
- 我們觀察發現，模板只有13頁
而網址也很相似：

第一頁：http://sc.chinaz.com/jianli/daxuesheng.html
第二頁：http://sc.chinaz.com/jianli/daxuesheng_2.html
第三頁：http://sc.chinaz.com/jianli/daxuesheng_3.html
第四頁：http://sc.chinaz.com/jianli/daxuesheng_4.html
...
第十三頁：http://sc.chinaz.com/jianli/daxuesheng_13.html

我們觀察發現除了第一頁外其它12頁的url只有後面的數字不同，我們可以用字串拼接的方法將url拼接出來，而那個不同的數字用for迴圈即可，然後將迴圈變數強制轉換成str型的，最後拼接字串。而第一頁的特殊處理，獨立輸出，其他頁（2到13頁）的迴圈輸出。

四、亂碼解決辦法

解決方法一：

手動設定響應資料編碼格式。

reponse = requests.get(url=url,headers=header)
# 手動設定響應資料編碼格式
reponse.encoding = 'utf-8'
page_text = reponse.text

解決方法二：

找到發生亂碼對應的資料，對該資料進行：encode(‘iso-8859-1’).decode(‘gbk’) 操作。

# 統用處理解決中文亂碼的解決方法
img_alt = img_alt.encode('iso-8859-1').decode('gbk')

案例三和案例五就是對這兩種處理中文亂碼的方法的應用。
案例三：是方法二。統用處理解決中文亂碼的解決方法。
案例五：是方法一。手動設定響應資料編碼格式。

Python爬蟲之XPath語法
2019-05-20
Python爬蟲
Python爬蟲之資料解析（XPath）
2018-12-18
Python爬蟲
爬蟲 – xpath 匹配
2018-12-20
爬蟲
Python爬蟲-xpath
2018-06-08
Python爬蟲
Python爬蟲——XPath
2018-07-28
Python爬蟲
爬蟲實戰：探索XPath爬蟲技巧之熱榜新聞
2024-03-21
爬蟲
爬蟲之xpath精準定位--位置定位
2024-06-03
爬蟲
python爬蟲：XPath語法和使用示例
2020-08-09
Python爬蟲
Python爬蟲——Xpath和lxml
2019-01-20
Python爬蟲XML
爬蟲---xpath解析（爬取美女圖片）
2020-12-23
爬蟲
Python爬蟲教程-22-lxml-etree和xpath配合使用
2018-09-06
Python爬蟲XML
爬蟲解析庫：XPath 輕鬆上手
2019-11-03
爬蟲
基於 go + xpath 爬蟲小案例
2021-07-11
Go爬蟲
基於asyncio、aiohttp、xpath的非同步爬蟲
2019-02-16
AIHTTP非同步爬蟲
Python爬蟲教程-21-xpath 簡介
2018-09-06
Python爬蟲
Python爬蟲之Pyspider使用
2021-09-11
Python爬蟲IDE
Datawhale-爬蟲-Task4(學習xpath）
2019-03-04
爬蟲
Python爬蟲基礎講解（七）：xpath的語法
2021-05-15
Python爬蟲
Python爬蟲之Selenium庫的基本使用
2018-11-30
Python爬蟲
網路爬蟲之關於爬蟲 http 代理的常見使用方式
2020-04-28
爬蟲HTTP
Python 爬蟲網頁內容提取工具xpath(二)
2018-12-08
Python爬蟲網頁
Python 爬蟲網頁內容提取工具xpath(一)
2018-12-06
Python爬蟲網頁
【Python3網路爬蟲開發實戰】4-解析庫的使用-1 使用XPath
2019-02-26
Python爬蟲
反爬蟲之字型反爬蟲
2019-06-27
爬蟲
Python爬蟲十六式 - 第四式: 使用Xpath提取網頁內容
2019-01-10
Python爬蟲網頁
【0基礎學爬蟲】爬蟲基礎之自動化工具 Pyppeteer 的使用
2023-05-15
爬蟲
【0基礎學爬蟲】爬蟲基礎之自動化工具 Playwright 的使用
2023-04-28
爬蟲
【0基礎學爬蟲】爬蟲基礎之自動化工具 Selenium 的使用
2023-04-21
爬蟲
【0基礎學爬蟲】爬蟲基礎之網路請求庫的使用
2023-03-26
爬蟲
爬蟲的小技巧之–如何尋找爬蟲入口
2018-03-05
爬蟲
Python爬蟲之路-chrome在爬蟲中的使用
2021-01-04
Python爬蟲Chrome
Python爬蟲之路-selenium在爬蟲中的使用
2021-01-04
Python爬蟲
爬蟲-selenium的使用
2021-02-04
爬蟲
Python爬蟲之selenium庫使用詳解
2018-05-16
Python爬蟲
爬蟲之股票定向爬取
2018-12-06
爬蟲
爬蟲框架-scrapy的使用
2021-04-28
爬蟲框架
Scrapy爬蟲框架的使用
2021-01-17
爬蟲框架
使用java 爬蟲
2020-10-05
Java爬蟲

爬蟲之xpath的使用

一、xpath初步認識

1、xpath介紹

2、lxml的安裝

3、xpath解析原理

二、xpath的語法

1、選取節點

2、謂語

3、選取未知節點

4、選取若干路徑

（1）邏輯運算

（2）屬性查詢

（3）獲取第幾個標籤 索引從1開始

（4）模糊查詢

5、示例

三、專案例項

1、例項一

2、案例二

3、例項三

4、案例四

5、案例五

四、亂碼解決辦法

相關文章

（3）獲取第幾個標籤索引從1開始