爬蟲解析庫：XPath 輕鬆上手

kuibatian發表於2019-11-03

XPath，全稱 XML Path Language，即 XML 路徑語言，它是一門在 XML 文件中查詢資訊的語言。最初是用來搜尋 XML 文件的，但同樣適用於 HTML 文件的搜尋。所以在做爬蟲時完全可以使用 XPath 做相應的資訊抽取。

1. XPath 概覽

XPath 的選擇功能十分強大，它提供了非常簡潔明瞭的路徑選擇表示式。另外，它還提供了超過 100 個內建函式，用於字串、數值、時間的匹配以及節點、序列的處理等，幾乎所有想要定位的節點都可以用 XPath 來選擇。\
官方文件：https://www.w3.org/TR/xpath/

2. XPath 常用規則

表示式	描述
nodename	選取此節點的所有子節點
/	從當前節點選區直接子節點
//	從當前節點選取子孫節點
.	選取當前節點
..	選取當前節點的父節點
@	選取屬性

這裡列出了 XPath 的常用匹配規則，示例如下：

//title[@lang='eng']

這是一個 XPath 規則，代表的是選擇所有名稱為 title，同時屬性 lang 的值為 eng 的節點，後面會通過 Python 的 lxml 庫，利用 XPath 進行 HTML 的解析。

更多請參考這裡：
https://www.jianshu.com/p/85a3004b5c06

import requests
# 呼叫XPATH的包
#   XPath，全稱 XML Path Language，
# 即 XML 路徑語言，它是一門在 XML 文件中查詢資訊的語言。
# 最初是用來搜尋 XML 文件的，但同樣適用於 HTML 文件的搜尋。
# 所以在做爬蟲時完全可以使用 XPath 做相應的資訊抽取。
# https://www.jianshu.com/p/85a3004b5c06

from lxml import etree

url = 'http://lol.178.com/'
response = requests.get(url)

with open('178.html', 'wb') as f:
    f.write(response.content)

# 需要找到的是 /html/head/title

html_ele = etree.HTML(response.text)
print(html_ele)
# head_ele = html_ele.xpath('/html/head')
# # print(head_ele[0])
# # title_ele = head_ele[0].xpath('./title')
# # print(title_ele[0])

# 找到了標籤, 如何找到標籤內的內容
#print(title_ele[0].text)
# 另一種方式就是 通過xpath獲取內部的字串 `/html/head/title/text()`

# 在xpath中可以通過索引的方式獲取具體的那個標籤, 從1開始
# meta_ele = html_ele.xpath('/html/head/meta[1]')
# print(meta_ele[0])

# 通過@獲取標籤下的屬性
# charset_str = html_ele.xpath('/html/head/meta[1]/@charset')
# print(charset_str[0])

# /html/head/link[1]/@href
# href_str = html_ele.xpath('/html/head/link[1]/@href')
# print(href_str[0])

# 謂詞相關資訊: /html/body/div[@class="wrap"]
#
div_ele = html_ele.xpath('/html/body/div[@class="wrap"]/@class')
print(div_ele)

# /html/body/div[@class="wrap"]/div[1]/div[@class="head"]
div_ele = html_ele.xpath('/html/body/div[@class="wrap"]/div[1]/div[@class="head"]/@class')
print(div_ele)

# /html/body/div[@class="wrap"]/div/div[1]/div[1]/div/a
div_ele = html_ele.xpath('/html/body/div[@class="wrap"]/div/div[1]/div[1]/div/a/text()')
print(div_ele)

# 搜尋內容: //div[@class="head"]

div_ele = html_ele.xpath('//div[@class="Oldversion"]/a/@href')
print(div_ele)

# 獲取所有的屬性包含 itemprop 的 meta 標籤: /html/head/meta[@itemprop]

div_ele = html_ele.xpath('/html/head/meta[@itemprop]/@itemprop')
print(div_ele)

# 獲取所有的屬性不包含 itemprop 的 meta 標籤:, 需要使用not函式
div_ele = html_ele.xpath('/html/head/meta[not(@itemprop)]')
print(div_ele)

# 找到剛剛的ul, XPATH: //ul[@class="ui-nav-main"]//text()
div_ele = html_ele.xpath('//ul[@class="ui-nav-main"]//text()')
print(div_ele)

strs = [item for item in div_ele if item.strip()]
print(strs)

# contains()
# starts_with()

本作品採用《CC 協議》，轉載必須註明作者和本文連結

每天5分鐘，與你一起蛻變！上海php自學中心，目前專注於php，python，golang~撒花！
群 S3d25uqwht.png!large
公眾號 7Dn78VKKcW.jpg!large

爬蟲---xpath解析（爬取美女圖片）
2020-12-23
爬蟲
Python爬蟲之資料解析（XPath）
2018-12-18
Python爬蟲
爬蟲 – xpath 匹配
2018-12-20
爬蟲
Python爬蟲-xpath
2018-06-08
Python爬蟲
Python爬蟲——XPath
2018-07-28
Python爬蟲
5分鐘上手Python爬蟲：從乾飯開始，輕鬆掌握技巧
2024-03-15
Python爬蟲
04selenium爬蟲輕鬆入門
2024-12-08
爬蟲
Python爬蟲——Xpath和lxml
2019-01-20
Python爬蟲XML
爬蟲之xpath的使用
2024-04-02
爬蟲
輕鬆利用Python爬蟲爬取你想要的資料
2021-09-10
Python爬蟲
Python爬蟲之XPath語法
2019-05-20
Python爬蟲
不會Python爬蟲？教你一個通用爬蟲思路輕鬆爬取網頁資料
2019-01-08
Python爬蟲網頁
基於 go + xpath 爬蟲小案例
2021-07-11
Go爬蟲
爬蟲實戰：探索XPath爬蟲技巧之熱榜新聞
2024-03-21
爬蟲
【Python3網路爬蟲開發實戰】4-解析庫的使用-1 使用XPath
2019-02-26
Python爬蟲
輕鬆上手Jackjson（珍藏版）
2024-04-07
JSON
教你8步輕鬆上手kafka
2022-12-27
Kafka
Python爬蟲教程-21-xpath 簡介
2018-09-06
Python爬蟲
python爬蟲：XPath語法和使用示例
2020-08-09
Python爬蟲
爬蟲之xpath精準定位--位置定位
2024-06-03
爬蟲
Node.js操作Dom ，輕鬆hold住簡單爬蟲
2023-01-08
Node.js爬蟲
學會XPath，輕鬆抓取網頁資料
2023-11-30
網頁
快速上手 KSQL：輕鬆與資料庫互動的利器
2024-11-14
SQL資料庫
基於asyncio、aiohttp、xpath的非同步爬蟲
2019-02-16
AIHTTP非同步爬蟲
Datawhale-爬蟲-Task4(學習xpath）
2019-03-04
爬蟲
爬蟲智慧解析庫 Readability 和 Newspaper 的用法
2022-12-06
爬蟲
『No20: Golang 爬蟲上手指南』
2018-08-19
Golang爬蟲
Python 爬蟲網頁內容提取工具xpath(二)
2018-12-08
Python爬蟲網頁
Python 爬蟲網頁內容提取工具xpath(一)
2018-12-06
Python爬蟲網頁
輕鬆上手SpringBoot Security + JWT Hello World示例
2020-09-15
Spring BootJWT
xpath beautiful pyquery三種解析庫
2019-08-05
爬蟲入門系列（四）：HTML 文字解析庫 BeautifulSoup
2019-02-27
爬蟲HTML
實戰（二）輕鬆使用requests庫和beautifulsoup爬連結
2019-03-03
《52講輕鬆搞定網路爬蟲》讀書筆記 - Session和Cookie
2020-09-03
爬蟲筆記SessionCookie
快速上手——我用scrapy寫爬蟲（一）
2019-02-16
爬蟲
Python爬蟲教程-22-lxml-etree和xpath配合使用
2018-09-06
Python爬蟲XML
Python爬蟲基礎講解（七）：xpath的語法
2021-05-15
Python爬蟲
xpath解析
2024-04-27

爬蟲解析庫：XPath 輕鬆上手

1. XPath 概覽

2. XPath 常用規則

相關文章