Python lxml ：從網頁HTML/XML提取資料

veelion發表於2019-07-04

原文網址 : https://www.yuanrenxue.com/crawler/python-lxml-usage.html

PythonXML網頁HTML

Python lxml 庫：從網頁HTML/XML提取資料

Python 的 lxml 模組是一個非常好用且效能高的HTML、XML解析工具，通過它解析網頁，爬蟲就可以輕鬆的從網頁中提取想要的資料。lxml是基於C語言的libxml2和libxslt庫開發的，所以速度是沒的說。

使用lxml提取網頁資料的流程

要從網頁裡面提取資料，使用lxml需要兩步：

第一步，用lxml把網頁（或xml）解析成一個DOM樹。這個過程，我們可以選擇etree、etree.HTML 和 lxml.html 這三種來實現，它們基本類似但又有些許差別，後面我們會詳細講到。
第二步，使用xpath遍歷這棵DOM 樹，找到你想要的資料所在的節點並提取。這一步要求我們對xpath規則比較熟練，xpath規則很多，但別怕，我來總結一些常用的套路。

生成DOM樹

上面我們說了，可以有三種方法來把網頁解析成DOM樹，有選擇困難症的同學要犯難了，選擇那種好呢？別急，我們逐一探究一下。下面我通過例項來解析一下下面這段html程式碼：

<div class="1">
    <p class="p_1 item">item_1</p>
    <p class="p_2 item">item_2</p>
</div>
<div class="2">
    <p id="p3"><a href="/go-p3">item_3</a></p>
</div>

使用etree.fromstring()函式

先看看這個函式的說明(docstring)：

In [3]: etree.fromstring?
Signature:      etree.fromstring(text, parser=None, *, base_url=None)
Call signature: etree.fromstring(*args, **kwargs)
Type:           cython_function_or_method
String form:    <cyfunction fromstring at 0x7fe538822df0>
Docstring:
fromstring(text, parser=None, base_url=None)

Parses an XML document or fragment from a string.  Returns the
root node (or the result returned by a parser target).

To override the default parser with a different parser you can pass it to
the ``parser`` keyword argument.

The ``base_url`` keyword argument allows to set the original base URL of
the document to support relative Paths when looking up external entities
(DTD, XInclude, ...).

這個函式就是把輸入的html解析成一棵DOM樹，並返回根節點。它對輸入的字串text有什麼要求嗎？首先，必須是合法的html字串，然後我們看看下面的例子：

In [19]: html = ''' 
...: <div class="1"> 
...:     <p class="p_1 item">item_1</p> 
...:     <p class="p_2 item">item_2</p> 
...: </div> 
...: <div class="2"> 
...:     <p id="p3"><a href="/go-p3">item_3</a></p> 
...: </div> 
...: '''

In [20]: etree.fromstring(html)
Traceback (most recent call last):

    File "/home/veelion/.virtualenvs/py3.6/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 3267, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)

    File "<ipython-input-20-aea2e2c2317e>", line 1, in <module>
etree.fromstring(html)

    File "src/lxml/etree.pyx", line 3213, in lxml.etree.fromstring

    File "src/lxml/parser.pxi", line 1877, in lxml.etree._parseMemoryDocument

    File "src/lxml/parser.pxi", line 1758, in lxml.etree._parseDoc

    File "src/lxml/parser.pxi", line 1068, in lxml.etree._BaseParser._parseUnicodeDoc

    File "src/lxml/parser.pxi", line 601, in lxml.etree._ParserContext._handleParseResultDoc

    File "src/lxml/parser.pxi", line 711, in lxml.etree._handleParseResult

    File "src/lxml/parser.pxi", line 640, in lxml.etree._raiseParseError

    File "<string>", line 6
    XMLSyntaxError: Extra content at the end of the document, line 6, column 1

竟然報錯了！究其原因，我們的html是兩個並列的<div>標籤，沒有一個單獨的root節點。那麼給這個html再加一個最外層的<div>標籤呢？

In [22]: etree.fromstring('<div>' + html + '</div>')
Out[22]: <Element div at 0x7fe53aa978c8>

這樣就可以了，返回了root節點，它是一個Element物件，tag是div。
總結一下，etree.fromstring()需要最外層是一個單獨的節點，否則會出錯。這個方法也適用於生成 XML 的DOM樹。

使用etree.HTML()函式

這個函式更像是針對 HTML 的，看看它的docstring：

In [23]: etree.HTML?
Signature:      etree.HTML(text, parser=None, *, base_url=None)
Call signature: etree.HTML(*args, **kwargs)
Type:           cython_function_or_method
String form:    <cyfunction HTML at 0x7fe538822c80>
Docstring:     
HTML(text, parser=None, base_url=None)

Parses an HTML document from a string constant.  Returns the root
node (or the result returned by a parser target).  This function
can be used to embed "HTML literals" in Python code.

To override the parser with a different ``HTMLParser`` you can pass it to
the ``parser`` keyword argument.

The ``base_url`` keyword argument allows to set the original base URL of
the document to support relative Paths when looking up external entities
(DTD, XInclude, ...).

介面引數跟etree.fromstring()一模一樣，實操一下：

In [24]: etree.HTML(html)
Out[24]: <Element html at 0x7fe53ab03748>

輸入兩個並列節點的html也沒有問題。等等，返回的root節點物件Element的標籤是html？把它用etree.tostring()還原成html程式碼看看：

In [26]: print(etree.tostring(etree.HTML(html)).decode())
<html><body><div class="1">
    <p class="p_1 item">item_1</p>
    <p class="p_2 item">item_2</p>
</div>
<div class="2">
    <p id="p3"><a href="/go-p3">item_3</a></p>
</div>
</body></html>

In [27]: print(html)

<div class="1">
    <p class="p_1 item">item_1</p>
    <p class="p_2 item">item_2</p>
</div>
<div class="2">
    <p id="p3"><a href="/go-p3">item_3</a></p>
</div>

也就是說，etree.HTML()函式會補全html程式碼片段，給它們加上<html>和<body>標籤。

使用lxml.html函式

lxml.html是lxml的子模組，它是對etree的封裝，更適合解析html網頁。用這個子模組生成DOM樹的方法有多個：

lxml.html.document_fromstring()
lxml.html.fragment_fromstring()
lxml.html.fragments_fromstring()
lxml.html.fromstring()

它們的docstring可以在ipython裡面查一下，這裡就不再列舉。通常，我們解析網頁用最後一個fromstring()即可。這個fromstring()函式也會給我們的樣例html程式碼最頂層的兩個並列節點加一個父節點div。

上面三種方法介紹完，相信你自己已經有了選擇，那必須是lxml.html。

因為它針對html做了封裝，所以也多了寫特有的方法:

比如我們要獲得某個節點下包含所有子節點的文字內容時，通過etree得到的節點沒辦法，它的每個節點有個text屬性只是該節點的，不包括子節點，必須要自己遍歷獲得子節點的文字。而lxml.html有一個text_content()方法可以方便的獲取某節點內包含的所有文字。
再比如，好多網頁的連結寫的都是相對路徑而不是完整url：<a href="/index.html">，我們提取連結後還要自己手動拼接成完整的url。這個時候可以用lxml.html提供的make_links_absolute()方法，這個方法是節點物件Element的方法，etree的Element物件卻沒有。

使用xpath提取資料

我們還以下面這段html程式碼為例，來看看如何定位節點提取資料。

<div class="1">
    <p class="p_1 item">item_1</p>
    <p class="p_2 item">item_2</p>
</div>
<div class="2">
    <p id="p3"><a href="/go-p3">item_3</a></p>
</div>

首先匯入lxml.html模組，生成DOM樹：

In [50]: import lxml.html as lh

In [51]: doc = lh.fromstring(html)

(1)通過標籤屬性定位節點
比如我們要獲取<div class="2">這節點：

In [52]: doc.xpath('//div[@class="2"]')
Out[52]: [<Element div at 0x7fe53a492ea8>]

In [53]: print(lh.tostring(doc.xpath('//div[@class="2"]')[0]).decode())
<div class="2">
    <p id="p3"><a href="/go-p3">item_3</a></p>
</div>

(2)contains語法
html中有兩個<p>標籤的class含有item，如果我們要提取這兩個<p>標籤，則：

In [54]: doc.xpath('//p[contains(@class, "item")]')
Out[54]: [<Element p at 0x7fe53a6a3ea8>, <Element p at 0x7fe53a6a3048>]

## 獲取<p>的文字：
In [55]: doc.xpath('//p[contains(@class, "item")]/text()')
Out[55]: ['item_1', 'item_2']

(3)starts-with語法
跟（2）一樣的提取需求，兩個<p>標籤的class都是以p_開頭的，所以：

In [60]: doc.xpath('//p[starts-with(@class, "p_")]')
Out[60]: [<Element p at 0x7fe53a6a3ea8>, <Element p at 0x7fe53a6a3048>]

## 獲取<p>的文字：
In [61]: doc.xpath('//p[starts-with(@class, "p_")]/text()')
Out[61]: ['item_1', 'item_2']

(4)獲取某一屬性的值
比如，我們想提取網頁中所有的連結：

In [63]: doc.xpath('//@href')
Out[63]: ['/go-p3']

如果你有更多xpath的巧妙用法，歡迎分享出來，謝謝。

我的公眾號：猿人學 Python 上會分享更多心得體會，敬請關注。

***版權申明:若沒有特殊說明，文章皆是猿人學 yuanrenxue.com 原創，沒有猿人學授權，請勿以任何形式轉載。***

Python 爬蟲網頁解析工具lxml.html(二)
2018-12-05
Python爬蟲網頁XMLHTML
Python 爬蟲網頁解析工具lxml.html(一)
2018-12-05
Python爬蟲網頁XMLHTML
爬蟲-使用lxml解析html資料
2021-01-20
爬蟲XMLHTML
python 網頁文字提取
2018-07-25
Python網頁
提取動態html網頁內容
2018-09-06
HTML網頁
網頁提取資料常用正則
2018-09-05
網頁
爬蟲工作原理詳解：從網頁請求到資料提取
2023-11-24
爬蟲網頁
使用python uiautomation從釘釘網頁版提取公司所有聯絡人資訊
2018-12-12
PythonUI網頁
Python爬蟲教程-18-頁面解析和資料提取
2018-09-06
Python爬蟲
(一)如何使用 Parsel 和 XPath 進行網頁資料提取
2024-08-03
網頁
Python解析XML檔案生成HTML
2019-02-16
PythonXMLHTML
Python 爬蟲網頁內容提取工具xpath(二)
2018-12-08
Python爬蟲網頁
Python 爬蟲網頁內容提取工具xpath(一)
2018-12-06
Python爬蟲網頁
如何提取網頁上的顏色，網頁顏色程式碼提取工具ColorWell
2021-01-05
網頁
Python【爬蟲實戰】提取資料
2020-11-17
Python爬蟲
[譯] 使用 Python 的 Pandas 和 Seaborn 框架從 Kaggle 資料集中提取資訊
2019-02-27
Python框架
HTML 網頁建立
2018-05-26
HTML網頁
python的應用 | 提取指定資料夾下所有PDF檔案的頁數
2024-03-27
Python
C#簡單的web網頁html抓取並提取指定a標籤連結
2019-05-11
C#Web網頁HTML
kaarbe/html-extractor：從HTML中提取文字的簡單Java庫
2022-12-05
HTMLJava
如何用Python爬資料？（一）網頁抓取
2018-06-27
Python網頁
Python爬蟲——Xpath和lxml
2019-01-20
Python爬蟲XML
如何用python分析xml獲取資料？
2021-09-11
PythonXML
如何從資料庫提取海波龍的組織主資料
2024-11-20
資料庫
Python 爬取網頁資料的兩種方法
2023-02-15
Python網頁
Python爬蟲之路-lxml模組
2021-01-04
Python爬蟲XML
Python提取文字檔案（.txt）資料的方法
2024-05-24
Python
用python3教你任意Html主內容提取
2018-11-05
PythonHTML
python 3.6 lxml標準庫lxml的安裝及找不到etree問題
2018-12-10
PythonXML
Python網路爬蟲抓取動態網頁並將資料存入資料庫MYSQL
2019-01-04
Python爬蟲網頁資料庫MySql
Python中用PyPDF2拆分pdf提取頁面
2021-09-11
Python
Python爬蟲十六式 - 第四式: 使用Xpath提取網頁內容
2019-01-10
Python爬蟲網頁
win10系統如何提取網頁中影片_win10提取網頁中影片的圖文教程
2020-03-16
Win10網頁
使用Python進行Web爬取和資料提取
2020-07-28
PythonWeb
怎麼利用Python網路爬蟲來提取資訊
2020-03-20
Python爬蟲
Python中使用mechanize庫抓取網頁上的表格資料
2024-03-15
Python網頁
python 自定義資料分頁
2024-11-20
Python
檢視HTML網頁滑鼠位置
2024-05-20
HTML網頁