htmlparsing: 純淨簡單的 HTML 解析庫

小眾程式碼發表於2018-02-26

HTML Parsing

純淨的HTML解析庫, 取代複雜的beautifulsoup4, pyquery, lxml

github: github.com/gaojiuli/ht…

安裝

pip install htmlparsing

# or

pip install git+https://github.com/gaojiuli/htmlparsing
複製程式碼

用法

import requests

from htmlparsing import Element

url = 'https://python.org'
r = requests.get(url)
複製程式碼

初始化

e = Element(text=r.text, base_url=url)
複製程式碼

獲取頁面中的連結

print(e.links)
"""
{...'/users/membership/', '/events/python-events', '//docs.python.org/3/tutorial/controlflow.html#defining-functions'}
"""


print(e.absolute_links)
"""
{...'https://python.org/download/alternatives',  'https://python.org/about/success/#software-development', 'https://python.org/download/other/', 'https://python.org/community/irc/'}
"""
複製程式碼

選擇器以及選擇屬性

print(e.xpath('//a')[0].attrs)
"""{'href': '#content', 'title': 'Skip to content'}"""

print(e.xpath('//a')[0].attrs.title)
"""Skip to content"""

print(e.css('a')[0].attrs)
"""{'href': '#content', 'title': 'Skip to content'}"""

print(e.parse('<a href="#content" title="Skip to content">{}</a>'))
"""<Result ('Skip to content',) {}>"""
複製程式碼

獲取文字內容和整個HTML

print(e.xpath('//a')[5].text)
"""PyPI"""

print(e.xpath('//a')[5].html)
"""<a href="https://pypi.python.org/" title="Python Package Index">PyPI</a>"""

print(e.xpath('//a')[5].markdown)
"""[PyPI](https://pypi.python.org/ "Python Package Index")"""

複製程式碼

目前支援的選擇器: xpath, css ,parse

github: github.com/gaojiuli/ht…

相關文章