Python 3 解析 html

微微微笑發表於2017-11-20

原文網址 : https://www.cnblogs.com/miniren/p/7272755.html

資料：https://docs.python.org/3/library/html.parser.html

python 自帶了一個類，叫 HTMLParser。

我們用的時候需要自己定義一個類，繼承自 HTMLParser 。然後重寫一部分方法。

下面是我們常用的解析html的方法，可以看到在 HTMLParser 裡面，這些方法內容都是空的，也就是如果我們要用某個方法，我們得自己再我們的類裡面重寫這個方法。具體的每個方法的使用方式參見下文。

# Overridable -- finish processing of start+end tag: <tag.../>
    def handle_startendtag(self, tag, attrs):
        self.handle_starttag(tag, attrs)
        self.handle_endtag(tag)

    # Overridable -- handle start tag
    def handle_starttag(self, tag, attrs):
        pass

    # Overridable -- handle end tag
    def handle_endtag(self, tag):
        pass

    # Overridable -- handle character reference
    def handle_charref(self, name):
        pass

    # Overridable -- handle entity reference
    def handle_entityref(self, name):
        pass

    # Overridable -- handle data
    def handle_data(self, data):
        pass

    # Overridable -- handle comment
    def handle_comment(self, data):
        pass

    # Overridable -- handle declaration
    def handle_decl(self, decl):
        pass

    # Overridable -- handle processing instruction
    def handle_pi(self, data):
        pass

使用

1. 簡單解析

from html.parser import HTMLParser

class MyHTMLParser(HTMLParser):
    def handle_starttag(self, tag, attrs):
        print("Encountered a start tag:", tag)

    def handle_endtag(self, tag):
        print("Encountered an end tag :", tag)

    def handle_data(self, data):
        print("Encountered some data  :", data)

parser = MyHTMLParser()
parser.feed('<html><head><title>Test</title></head><body><h1>Parse me!</h1></body></html>')

這裡寫了一個類 MyHTMLParse ，繼承自 HTMLParser。然後重寫了 handle_xxx方法。

然後只要呼叫該類的 feed() 方法，將html格式的資料傳進去，遇到特定的資料，就會自動觸發相應的方法。比如遇到<html>就會觸發handle_starttag()方法進行處理。

執行結果如下：

Encountered a start tag: html
Encountered a start tag: head
Encountered a start tag: title
Encountered some data  : Test
Encountered an end tag : title
Encountered an end tag : head
Encountered a start tag: body
Encountered a start tag: h1
Encountered some data  : Parse me!
Encountered an end tag : h1
Encountered an end tag : body
Encountered an end tag : html

2. 複雜解析

from html.parser import HTMLParser
from html.entities import name2codepoint

class MyHTMLParser(HTMLParser):
    def handle_starttag(self, tag, attrs):
        print("Start tag:", tag)
        for attr in attrs:
            print("     attr:", attr)

    def handle_endtag(self, tag):
        print("End tag  :", tag)

    def handle_data(self, data):
        print("Data     :", data)

    def handle_comment(self, data):
        print("Comment  :", data)

    def handle_entityref(self, name):
        c = chr(name2codepoint[name])
        print("Named ent:", c)

    def handle_charref(self, name):
        if name.startswith('x'):
            c = chr(int(name[1:], 16))
        else:
            c = chr(int(name))
        print("Num ent  :", c)

    def handle_decl(self, data):
        print("Decl     :", data)

parser = MyHTMLParser()

1）解析文件型別申明

傳入html資料如下：

parser.feed('<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">')

執行結果如下，可以看到會自動呼叫 handle_decl() 方法。

Decl     : DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"

2）解析屬性

傳入html資料如下：

parser.feed('<img src="python-logo.png" alt="The Python logo">')

執行結果如下，可以看到會自動呼叫 handle_starttag()方法。

Start tag: img
     attr: ('src', 'python-logo.png')
     attr: ('alt', 'The Python logo')

3）解析資料以及結束標籤

傳入html資料如下：

parser.feed('<style type="text/css">#python { color: green }</style>')

執行結果如下，可以看到會自動呼叫 handle_data() 以及 handle_endtag()方法。

Start tag: style
     attr: ('type', 'text/css')
Data     : #python { color: green }
End tag  : style

4）解析備註

傳入html資料如下：

parser.feed('<!-- a comment --><!--[if IE 9]>IE-specific content<![endif]-->')

執行結果如下，可以看到會自動呼叫 handle_comment()方法。

Comment  :  a comment 
Comment  : [if IE 9]>IE-specific content<![endif]

5）解析實體字元

傳入html資料如下：

parser.feed('&gt;&#62;&#x3E;')

在html語言中 ‘>’這個符號，實體名稱為 &gt ，實體編號為 &#62。這裡 &#x3E表示16進位制數字，3E轉化過來和62 是一致的。

執行結果如下，可以看到會自動呼叫 handle_entityref() 來處理 &gt ，然後呼叫 handle_charref()來處理 &#62 以及 &#x3E。

Named ent: >
Num ent  : >
Num ent  : >

Python解析XML檔案生成HTML
2019-02-16
PythonXMLHTML
解析-HTML 解析器
2019-02-20
HTML
Python 爬蟲網頁解析工具lxml.html(二)
2018-12-05
Python爬蟲網頁XMLHTML
Python 爬蟲網頁解析工具lxml.html(一)
2018-12-05
Python爬蟲網頁XMLHTML
Python3 裝飾器解析
2020-07-18
Python
Python3 生成器解析
2020-07-17
Python
Python3 迭代器深入解析
2020-07-16
Python
HTML————3、HTML元素
2018-07-22
HTML
Html 解析利器-goquery
2021-04-18
HTMLGo
用python3教你任意Html主內容提取
2018-11-05
PythonHTML
【Python 3】keras.layers.Lambda解析與使用
2021-01-02
PythonKeras
Python爬蟲開發與專案實戰 4: HTML解析大法
2018-05-15
Python爬蟲HTML
HTML學習(3)(HTML字元格式)
2024-03-20
HTML字元
HTML20_HTML標籤3
2024-05-31
HTML
Python3 解析複雜結構的 json
2019-12-28
PythonJSON
【Java】Jsoup 解析HTML報告
2024-08-02
JavaJSHTML
iOS 輕量級 HTML 解析方案
2019-01-22
iOSHTML
AI筆試面試題庫-Python題目解析3
2018-06-29
AI筆試面試題Python
w3cschool-html
2019-01-17
HTML
HTML5+CSS3
2018-05-08
HTMLCSSS3
html5&css3
2020-11-04
HTMLCSSS3
HTML5 Audio & Video 屬性解析
2019-02-16
HTMLIDE
Java中使用Jsoup解析HTML表格教程
2024-03-17
JavaJSHTML
Java爬蟲利器HTML解析工具-Jsoup
2019-06-21
Java爬蟲HTMLJS
[網路爬蟲] Jsoup : HTML 解析工具
2024-10-06
爬蟲JSHTML
JAVA 解析html 型別字串（使用jsoup）
2024-08-16
JavaHTML型別字串JS
爬蟲-使用lxml解析html資料
2021-01-20
爬蟲XMLHTML
Python3 中 configparser 模組解析配置的用法詳解
2019-02-28
Python
Python3網路爬蟲快速入門實戰解析
2020-04-23
Python爬蟲
python解析式
2018-08-16
Python
jspdf + html2canvas 實現html轉pdf (提高解析度版本)
2019-02-16
JSHTMLCanvas
深入解析webpack 外掛html-webpack-plugin
2018-11-27
WebHTMLPlugin
Java爬蟲系列三：使用Jsoup解析HTML
2019-05-25
Java爬蟲JSHTML
瀏覽器是如何解析html的？
2018-12-23
瀏覽器HTML
Python開發技巧-使用Python生成HTML表格
2021-07-30
PythonHTML
Python開發【前端篇】HTML
2018-08-22
Python前端HTML
【Python】生成html文件-使用dominate
2024-06-02
PythonHTML
dmidecode的Python解析
2019-01-23
IDEPython
用Python解析XMind
2018-09-21
Python

Python 3 解析 html

使用

1. 簡單解析

2. 複雜解析

相關文章