初始xpath

小满三岁啦發表於2024-03-28

原文網址 : https://www.cnblogs.com/ccsvip/p/18102864

包的安裝

pip install lxml

谷歌瀏覽器外掛安裝

XPath Helper 可以自行搜尋安裝也可以點選：傳送門

解析流程與使用

例項化一個etree的物件，把即將被解析的頁面原始碼載入到該物件。

呼叫該物件的xpath方法結合著不同形式的xpath表示式進行標籤定位和資料提取

# 匯入lxml.etree
from lxml import etree

# 方法1：
# 使用etree.parse() 解析本地html檔案
# 如果遇到報錯，在開啟檔案之前指定編碼 etree.HTMLParser(encoding='utf-8')
parser = etree.HTMLParser(encoding='utf-8')

# 這裡直接指定路徑就行，如果使用open開啟會報錯
selector = etree.parse('../bs4練習/豆瓣讀書 Top 250.html', parser=parser)
result = etree.tostring(selector)
print(result)

# 方法2：
# 【推薦】使用etree.HTML(html字串) 解析網路或者本地資源 這種方法容錯能力更強
with open('../bs4練習/豆瓣讀書 Top 250.html', encoding='utf-8') as file:
    data = file.read()
    html_tree = etree.HTML(data)
    print(html_tree) # <Element html at 0x1ee7986ba80>  這個就表示節點物件
  
# 【推薦】另一種寫法 
tree = etree.HTML(open('../bs4練習/豆瓣讀書 Top 250.html', encoding='utf-8').read())

# 使用html_tree.xpath() 得到的結果是一個列表
items = html_tree.xpath('xpath語法')
print(items)

xpath語法

xpath是一門在XML文件中查詢資訊的語言。xpath用於在XML文件中透過元素和屬性進行導航。

路徑表示式

表示式	描述
/	從根節點選取
//	從匹配選擇的`當前節點`選擇文件中的節點，而不考慮它們的位置。
./	當前節點再次進行xpath
@	選取屬性

例項

在下面的表格中，我們已列出了一些路徑表示式以及表示式的結果

路徑表示式	結果
/html	選取根元素/bookstore。註釋：加入路徑起始於正斜槓（/），則此路徑始終代表到某元素的絕對路徑！
//li	選取所有li子元素，而不管它們在文件中的位置。
//ul//li	選擇屬於ul元素的後代的所有li元素，而不管它們位於ul之下的什麼位置。
`節點物件.xpath('./div')`	選擇當前節點物件裡面的第一個div節點

一層一層取 / 開頭

獲取整個標籤 tostring

節點物件轉成字串的輸出 etree.tostring(obj, encoding) 這樣得到的結果就是整個標籤<a>..</a>

from lxml import etree

# 例項化tree物件
tree = etree.HTML(open('../bs4練習/豆瓣讀書 Top 250.html', encoding='utf-8').read())

# 拿到的是一個列表
result = tree.xpath('/html/body/div/div/div/a')

# 使用etree.tostring將節點物件轉成字串的輸出
# 引數解讀  etree.tostring(轉誰, 編碼)
r = etree.tostring(result[0], encoding='utf-8')  # 這樣拿到的是二進位制

# 解碼 不寫引數預設就是utf-8
print(r.decode())

# 簡寫
r = etree.tostring(result[0], encoding='utf-8').decode()

# 單個取值
r1 = etree.tostring(result[0], encoding='utf-8').decode()
r2 = etree.tostring(result[1], encoding='utf-8').decode()
print(r1)
print(r2)

# 遍歷取值
for line in result:
    print(etree.tostring(line, encoding='utf-8').decode())

我全都要 // 開頭

from lxml import etree

# 例項化tree物件
tree = etree.HTML(open('../bs4練習/豆瓣讀書 Top 250.html', encoding='utf-8').read())

# 獲取當前頁面所有的a 無論是哪一個位置
result = tree.xpath('//a')

for line in result:
    print(etree.tostring(line, encoding='utf-8').decode())

# 獲取當前頁面所有的圖片標籤 不論在哪一個位置
result = tree.xpath('//img')
for line in result:
    print(etree.tostring(line, encoding='utf-8').decode())

獲取標籤的文字內容 /text()

# 獲取a標籤中的文字
result = tree.xpath('/html/body/div/div/div/a/text()')
print(result)  # ['登入/註冊', '下載豆瓣客戶端']

透過[ ] 可以指定位置，索引是從1開始，不是從0開始

from lxml import etree

# 例項化tree物件
tree = etree.HTML(open('../bs4練習/豆瓣讀書 Top 250.html', encoding='utf-8').read())


# tr下面有2個td 獲取第一個td下面的a標籤
result = tree.xpath('//tbody/tr/td[1]/a')
for line in result:
    print(line)

/ 和 // 在下一次路徑中的使用

總結：

當前節點下如果要繼續匹配，只能使用 ./，繼續匹配，使用/和//都不起作用，都相當從頭開始找了。

from lxml import etree

# 例項化tree物件
tree = etree.HTML(open('../bs4練習/豆瓣讀書 Top 250.html', encoding='utf-8').read())

result = tree.xpath('/html/body/div/div/div/a')
print(result)  # [<Element a at 0x1b1106bf300>, <Element a at 0x1b1106be880>]

# 只要是<class 'lxml.etree._Element'> 物件，就可以一直使用節點.xpath()繼續匹配

# ./ 代表當前節點往下繼續匹配
r1 = result[0].xpath('./text()')  # 這個寫法等同於tree.xpath('/html/body/div/div/div/a/text()')
print(r1)  # ['登入/註冊']

# 如果確定只有一個標籤 那麼 . 可以省略
r2 = result[0].xpath('text()')
print(r2)  # ['登入/註冊']

# 因為單獨一個 / 代表從根目錄匹配 所以在這個案例中 /text() 拿不到任何資料
# 這裡的/text() 從根下開始匹配 根應該是/html 和前面的result[0] 一點關係都沒有 所以也就拿不到資料
r3 = result[0].xpath('/text()')
print(r3)  # []

# //text() 相當於tree.xpath('//text()') 查詢所有的有文字的內容 和result[0] 一點關係也沒有，所以拿到了全部的資料
r4 = result[0].xpath('//text()')
print(r4)  # 拿到了全部的結果

/ 和 // 的混合使用

獲取屬性值透過 `/@屬性名`

from lxml import etree

# 例項化tree物件
tree = etree.HTML(open('../bs4練習/豆瓣讀書 Top 250.html', encoding='utf-8').read())

# 獲取屬性，可以透過@  下面這個xpath表示式的意思就是 查詢所有class屬性為item的tr標籤
# //標籤名[@屬性名=屬性值]
result = tree.xpath('//tr[@class="item"]')
for line in result:
    print(etree.tostring(line, encoding='utf-8').decode())

# 獲取所有文字  在class=pl2這個節點下的所有標籤的文字我都要
result = tree.xpath('//div[@class="pl2"]//text()')
for line in result:
    print(line)

# 獲取class=item的tr標籤下面的第一個td標籤裡面的全部img標籤
result = tree.xpath("//tr[@class='item']/td[1]//img")
for line in result:
    print(etree.tostring(line, encoding='utf-8').decode())

# 獲取class=item的tr標籤下的第一個td標籤裡面的全部img標籤的連結地址，即src屬性
# 獲取屬性值透過@屬性名
result = tree.xpath('//tr[@class="item"]/td[1]//img/@src')
for line in result:
    print(line)

數量限制

現在說的xpath語法，完全可以使用列表的切片替代。

獲取第幾個使用：[n]

獲取最後一個使用：[last()]

獲取倒數的，使用：[last()-(n-1)]

前n個，使用：[position()<=n] 或者 [position()<n-1]

獲取a到b，使用：[position()>=a and position()<=b]

from lxml import etree

# 例項化tree物件
tree = etree.HTML(open('../bs4練習/豆瓣讀書 Top 250.html', encoding='utf-8').read())

# 獲取class=item的tr標籤下的第二個td標籤裡面的第二個div標籤裡面的第2個span標籤
result = tree.xpath("//tr[@class='item']/td[2]/div[2]/span[2]")
for line in result:
    print(etree.tostring(line, encoding='utf-8').decode())

# 獲取最後一個[last()]
# 獲取class=item的tr標籤下的第二個td標籤裡面的第二個div標籤裡面的最後一個span標籤
result = tree.xpath('//tr[@class="item"]/td[2]/div[2]/span[last()]')
for line in result:
    print(etree.tostring(line, encoding='utf-8').decode())

# 獲取倒數第n個 [last()-(n-1)] 倒數第二個就是 [last()-(2-1)] --> [last()-1]
# 獲取class=item的tr標籤下的第二個td標籤裡面的第二個div標籤裡面的倒數第二個span標籤
result = tree.xpath('//tr[@class="item"]/td[2]/div[2]/span[last()-1]')
for line in result:
    print(etree.tostring(line, encoding='utf-8').decode())

# 獲取前幾個 [position<n+1] 獲取前兩個就是 [position<2+1] --> [position()<3]
# 前2個也可以寫成[position()<=2]
# 獲取class=item的tr標籤下的第二個td標籤裡面的第二個div標籤裡面的前span標籤
# result = tree.xpath('//tr[@class="item"]/td[2]/div[2]/span[position<=2]') 
result = tree.xpath('//tr[@class="item"]/td[2]/div[2]/span[position<3]') 
for line in result:
    print(etree.tostring(line, encoding='utf-8').decode())

https://www.shicimingju.com/book/hongloumeng.html

獲取第18個到29個

邏輯或 |

獲取第18到29回合，但是不包含20回合的資料

//div[@class="book-mulu"]/ul/li[position()>=18 and position()<20] | //div[@class="book-mulu"]/ul/li[position()>=21 and position()<=29]

邏輯 and

from lxml import etree

# 例項化tree物件
tree = etree.HTML(open('../bs4練習/豆瓣讀書 Top 250.html', encoding='utf-8').read())


# 獲取id等於"top-nav-appintro" 並且 class等於"more-items" 的div標籤
result = tree.xpath('//div[@id="top-nav-appintro" and @class="more-items"]')
print(etree.tostring(result[0], encoding='utf-8').decode())

attr屬性

如果要屬性值指定了，那就是獲取指定屬性的標籤，如果沒有指定，就是獲取帶有這個屬性的標籤

from lxml import etree

# 例項化tree物件
tree = etree.HTML(open('../bs4練習/豆瓣讀書 Top 250.html', encoding='utf-8').read())

# 獲取所有的img標籤
r1 = tree.xpath('//img')

# 獲取所有img的src屬性
r2 = tree.xpath('//img/@src')

# 獲取所有包含src屬性的結果 （即不論是不是img標籤）
r3 = tree.xpath('//@src')

# 獲取所有，具有src屬性的img標籤
r4 = tree.xpath('//img[@src]')

# 獲取所有具有class屬性的div標籤
r5 = tree.xpath('//div[@class]')

判斷

除了=和!=，其他了解即可

這裡的判斷同JavaScript，會自動換成同一個型別進行比較，不多餘贅述。

from lxml import etree

# 例項化tree物件
tree = etree.HTML(open('../bs4練習/豆瓣讀書 Top 250.html', encoding='utf-8').read())

# 獲取所有帶有class屬性的a
result = tree.xpath('//a[@class]')

# 獲取所有class屬性值為"cover"的a
result = tree.xpath('//a[@class="cover"]')

# 獲取price屬性大於10的標籤
result = tree.xpath('//a[@price>"10"]')

# 獲取price屬性大於等於10的標籤
result = tree.xpath('//a[@price>="10"]')

# 獲取price屬性小於10的標籤
result = tree.xpath('//a[@price<"10"]')

# 獲取price屬性小於等於10的標籤
result = tree.xpath('//a[@price<="10"]')

# 獲取price屬性不等於10的標籤
result = tree.xpath('//a[@price!="10"]')

萬用字元 * 的使用

萬用字元 * 一般只有在copy xpath的時候會出現* 自己寫的時候很少用

from lxml import etree

# 例項化tree物件
tree = etree.HTML(open('../bs4練習/豆瓣讀書 Top 250.html', encoding='utf-8').read())


# 獲取所有帶有class屬性的標籤 不論什麼標籤我都要
result = tree.xpath('//*[@class]')

# 獲取帶有price屬性的全部標籤
result = tree.xpath('//*[@"price"]')

# node() 匹配任何型別的節點 （沒有什麼意義）
result = tree.xpath("node()")

包含關係 contains

語法：*[contains(@屬性, '包含的標籤部分')]

from lxml import etree

# 例項化tree物件
tree = etree.HTML(open('../bs4練習/豆瓣讀書 Top 250.html', encoding='utf-8').read())

# div標籤中 屬性值包含pl的節點
result = tree.xpath('//div[contains(@class, "pl")]//text()')
print(result)

以什麼開始 starts-with

from lxml import etree

# 例項化tree物件
tree = etree.HTML(open('../bs4練習/豆瓣讀書 Top 250.html', encoding='utf-8').read())

# div標籤中 屬性值以a開頭的節點
result = tree.xpath('//div[starts-with(@class, "a")]//text()')
print(result)

案例練習

import requests
from lxml import etree
import base64
from openpyxl import Workbook
from pathlib import Path
from fake_useragent import UserAgent

BASE_DIR = Path(__file__).parent


def encryption():
    url = b'aHR0cDovL3d3dy5xaWFubXUub3JnL3Jhbmtpbmc='
    return base64.b64decode(url).decode()


def get_request(url):
    try:
        response = requests.get(url=url, headers={'UserAgent': UserAgent().random})
        response.encoding = response.apparent_encoding
        response.raise_for_status()
    except requests.exceptions.RequestException as e:
        print(e)
    else:
        return response.text


wb = Workbook()
ws = wb.active
ws.title = '世界大學排行榜最新'
ws.append(['排名', '學校名稱', '學校英語名', '國家/地區'])


def fetch_content(content):
    tree = etree.HTML(content)
    result = tree.xpath('//*[@id="page-wrapper"]/div/div[2]/div/div/div/div[2]/div/div[5]/table/tbody')
    for line in result:
        rank = line.xpath(".//td[1]/text()")
        school_name = line.xpath(".//td[2]/text()")
        school_english_name = line.xpath(".//td[3]/text()")
        country = line.xpath(".//td[4]/text()")

        for rank_result, school_name_result, school_english_name_result, country_result in zip(rank, school_name,
                                                                                               school_english_name,
                                                                                               country):
            data = [rank_result, school_name_result, school_english_name_result, country_result]
            ws.append(data)


if __name__ == '__main__':
    url = encryption()
    content = get_request(url)
    fetch_content(content)
    wb.save(rf'{BASE_DIR}/世界大學排行榜.xlsx')

Xpath
2024-06-16
xpath解析
2024-04-27
python xpath用法
2018-07-30
Python
Xpath,XQuery,DTD
2018-05-21
爬蟲 – xpath 匹配
2018-12-20
爬蟲
Python爬蟲-xpath
2018-06-08
Python爬蟲
Python爬蟲——XPath
2018-07-28
Python爬蟲
如何手寫xpath
2020-08-10
Xpath helper外掛
2019-05-20
day11 Xpath
2024-06-07
XPath 語法概述
2022-08-14
xPath 用法總結整理
2019-01-21
Xpath語法格式整理
2018-08-10
selenium中的xpath定位
2018-04-09
Selenium：xPath 定位實踐
2020-08-22
測試工具-XPath使用
2020-09-26
xpath中常用的方法
2020-10-16
XPath學習筆記
2019-01-05
筆記
Xpath解析及其語法
2024-12-04
Python爬蟲——Xpath和lxml
2019-01-20
Python爬蟲XML
爬蟲之xpath的使用
2024-04-02
爬蟲
python使用xpath（超詳細）
2020-10-07
Python
淺談python中的xpath用法
2018-07-28
Python
xpath beautiful pyquery三種解析庫
2019-08-05
selenium之xpath語法總結
2020-10-09
【推薦】好用的 XPath 外掛
2020-11-22
Python爬蟲之XPath語法
2019-05-20
Python爬蟲
python_selenium元素定位_xpath(2)
2022-10-24
Python
Python爬蟲之資料解析（XPath）
2018-12-18
Python爬蟲
爬蟲解析庫：XPath 輕鬆上手
2019-11-03
爬蟲
xpath和dom有什麼區別？
2024-11-27
基於 go + xpath 爬蟲小案例
2021-07-11
Go爬蟲
Python爬蟲教程-21-xpath 簡介
2018-09-06
Python爬蟲
在xpath中text()和string(.)的區別
2019-08-26
python爬蟲：XPath語法和使用示例
2020-08-09
Python爬蟲
線上xpath選擇器測試工具
2024-10-28
爬蟲之xpath精準定位--位置定位
2024-06-03
爬蟲
Selenium系列5-XPath路徑表示式
2021-09-18