python爬蟲神器PyQuery的使用方法

發表於2016-05-27

原文網址 : http://python.jobbole.com/85222/

Python爬蟲

前言

你是否覺得 XPath 的用法多少有點晦澀難記呢？

你是否覺得 BeautifulSoup 的語法多少有些慳吝難懂呢？

你是否甚至還在苦苦研究正規表示式卻因為少些了一個點而抓狂呢？

你是否已經有了一些前端基礎瞭解選擇器卻與另外一些奇怪的選擇器語法混淆了呢？

嗯，那麼，前端大大們的福音來了，PyQuery 來了，乍聽名字，你一定聯想到了 jQuery，如果你對 jQuery 熟悉，那麼 PyQuery 來解析文件就是不二之選！包括我在內！

PyQuery 是 Python 仿照 jQuery 的嚴格實現。語法與 jQuery 幾乎完全相同，所以不用再去費心去記一些奇怪的方法了。

天下竟然有這等好事？我都等不及了！

安裝

有這等神器還不趕緊安裝了！來！

pip install pyquery

1	pip install pyquery

參考來源

本文內容參考官方文件，更多內容，大家可以去官方文件學習，畢竟那裡才是最原汁原味的。

目前版本 1.2.4 (2016/3/24)

官方文件

簡介

pyquery allows you to make jquery queries on xml documents. The API is
as much as possible the similar to jquery. pyquery uses lxml for fast
xml and html manipulation. This is not (or at least not yet) a library
to produce or interact with javascript code. I just liked the jquery
API and I missed it in python so I told myself “Hey let’s make jquery
in python”. This is the result. It can be used for many purposes, one
idea that I might try in the future is to use it for templating with
pure http templates that you modify using pyquery. I can also be used
for web scrapping or for theming applications with Deliverance.

pyquery 可讓你用 jQuery 的語法來對 xml 進行操作。這I和 jQuery 十分類似。如果利用 lxml，pyquery 對 xml 和 html 的處理將更快。

這個庫不是（至少還不是）一個可以和 JavaScript互動的程式碼庫，它只是非常像 jQuery API 而已。

初始化

在這裡介紹四種初始化方式。

（1）直接字串

from pyquery import PyQuery as pq
doc = pq("<html></html>")

1 2	from pyquery import PyQuery as pq doc = pq("<html></html>")

pq 引數可以直接傳入 HTML 程式碼，doc 現在就相當於 jQuery 裡面的 $ 符號了。

（2）lxml.etree

from lxml import etree
doc = pq(etree.fromstring("<html></html>"))

1 2	from lxml import etree doc = pq(etree.fromstring("<html></html>"))

可以首先用 lxml 的 etree 處理一下程式碼，這樣如果你的 HTML 程式碼出現一些不完整或者疏漏，都會自動轉化為完整清晰結構的 HTML程式碼。

（3）直接傳URL

from pyquery import PyQuery as pq
doc = pq('http://www.baidu.com')

1 2	from pyquery import PyQuery as pq doc = pq('http://www.baidu.com')

這裡就像直接請求了一個網頁一樣，類似用 urllib2 來直接請求這個連結，得到 HTML 程式碼。

（4）傳檔案

from pyquery import PyQuery as pq
doc = pq(filename='hello.html')

1 2	from pyquery import PyQuery as pq doc = pq(filename='hello.html')

可以直接傳某個路徑的檔名。

快速體驗

現在我們以本地檔案為例，傳入一個名字為 hello.html 的檔案，檔案內容為

<div>
    <ul>
         <li class="item-0">first item</li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
         <li class="item-1 active"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a></li>
     </ul>
 </div>

<div>

<ul>

<li class="item-0">first item</li>

<li class="item-1"><a href="link2.html">second item</a></li>

<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>

<li class="item-1 active"><a href="link4.html">fourth item</a></li>

<li class="item-0"><a href="link5.html">fifth item</a></li>

</ul>

</div>

編寫如下程式

from pyquery import PyQuery as pq
doc = pq(filename='hello.html')
print doc.html()
print type(doc)
li = doc('li')
print type(li)
print li.text()

from pyquery import PyQuery as pq

doc = pq(filename='hello.html')

print doc.html()

print type(doc)

li = doc('li')

print type(li)

print li.text()

執行結果

<ul>
         <li class="item-0">first item</li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
         <li class="item-1 active"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a></li>
     </ul>
 
<class 'pyquery.pyquery.PyQuery'>
<class 'pyquery.pyquery.PyQuery'>
first item second item third item fourth item fifth item

<ul>

<li class="item-0">first item</li>

<li class="item-1"><a href="link2.html">second item</a></li>

<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>

<li class="item-1 active"><a href="link4.html">fourth item</a></li>

<li class="item-0"><a href="link5.html">fifth item</a></li>

</ul>

first item second item third item fourth item fifth item

看，回憶一下 jQuery 的語法，是不是執行結果都是一樣的呢？

在這裡我們注意到了一點，PyQuery 初始化之後，返回型別是 PyQuery，利用了選擇器篩選一次之後，返回結果的型別依然還是 PyQuery，這簡直和 jQuery 如出一轍，不能更贊！然而想一下 BeautifulSoup 和 XPath 返回的是什麼？列表！一種不能再進行二次篩選（在這裡指依然利用 BeautifulSoup 或者 XPath 語法）的物件！

然而比比 PyQuery，哦我簡直太愛它了！

屬性操作

你可以完全按照 jQuery 的語法來進行 PyQuery 的操作。

from pyquery import PyQuery as pq

p = pq('<p id="hello" class="hello"></p>')('p')
print p.attr("id")
print p.attr("id", "plop")
print p.attr("id", "hello")

from pyquery import PyQuery as pq

p = pq('<p id="hello" class="hello"></p>')('p')

print p.attr("id")

print p.attr("id", "plop")

print p.attr("id", "hello")

執行結果

hello
<p id="plop" class="hello"/>
<p id="hello" class="hello"/>

hello

再來一發

from pyquery import PyQuery as pq

p = pq('<p id="hello" class="hello"></p>')('p')
print p.addClass('beauty')
print p.removeClass('hello')
print p.css('font-size', '16px')
print p.css({'background-color': 'yellow'})

from pyquery import PyQuery as pq

p = pq('<p id="hello" class="hello"></p>')('p')

print p.addClass('beauty')

print p.removeClass('hello')

print p.css('font-size', '16px')

print p.css({'background-color': 'yellow'})

執行結果

<p id="hello" class="hello beauty"/>
<p id="hello" class="beauty"/>
<p id="hello" class="beauty" style="font-size: 16px"/>
<p id="hello" class="beauty" style="font-size: 16px; background-color: yellow"/>

依舊是那麼優雅與自信！

在這裡我們發現了，這是一連串的操作，而 p 是一直在原來的結果上變化的。

因此執行上述操作之後，p 本身也發生了變化。

DOM操作

同樣的原汁原味的 jQuery 語法

from pyquery import PyQuery as pq

p = pq('<p id="hello" class="hello"></p>')('p')
print p.append(' check out <a href="http://reddit.com/r/python"><span>reddit</span></a>')
print p.prepend('Oh yes!')
d = pq('<div class="wrap"><div id="test"><a href="http://cuiqingcai.com">Germy</a></div></div>')
p.prependTo(d('#test'))
print p
print d
d.empty()
print d

from pyquery import PyQuery as pq

p = pq('<p id="hello" class="hello"></p>')('p')

print p.append(' check out <a href="http://reddit.com/r/python"><span>reddit</span></a>')

print p.prepend('Oh yes!')

d = pq('<div class="wrap"><div id="test"><a href="http://cuiqingcai.com">Germy</a></div></div>')

p.prependTo(d('#test'))

print p

print d

d.empty()

print d

執行結果

<p id="hello" class="hello"> check out <a href="http://reddit.com/r/python"><span>reddit</span></a></p>
<p id="hello" class="hello">Oh yes! check out <a href="http://reddit.com/r/python"><span>reddit</span></a></p>
<p id="hello" class="hello">Oh yes! check out <a href="http://reddit.com/r/python"><span>reddit</span></a></p>
<div class="wrap"><div id="test"><p id="hello" class="hello">Oh yes! check out <a href="http://reddit.com/r/python"><span>reddit</span></a></p><a href="http://cuiqingcai.com">Germy</a></div></div>
<div class="wrap"/>

<p id="hello" class="hello"> check out <a href="http://reddit.com/r/python"><span>reddit</span></a></p>

<p id="hello" class="hello">Oh yes! check out <a href="http://reddit.com/r/python"><span>reddit</span></a></p>

<div class="wrap"><div id="test"><p id="hello" class="hello">Oh yes! check out <a href="http://reddit.com/r/python"><span>reddit</span></a></p><a href="http://cuiqingcai.com">Germy</a></div></div>

這不需要多解釋了吧。

DOM 操作也是與 jQuery 如出一轍。

遍歷

遍歷用到 items 方法返回物件列表，或者用 lambda

from pyquery import PyQuery as pq
doc = pq(filename='hello.html')
lis = doc('li')
for li in lis.items():
    print li.html()

print lis.each(lambda e: e)

from pyquery import PyQuery as pq

doc = pq(filename='hello.html')

lis = doc('li')

for li in lis.items():

print li.html()

print lis.each(lambda e: e)

執行結果

first item
<a href="link2.html">second item</a>
<a href="link3.html"><span class="bold">third item</span></a>
<a href="link4.html">fourth item</a>
<a href="link5.html">fifth item</a>
<li class="item-0">first item</li>
 <li class="item-1"><a href="link2.html">second item</a></li>
 <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
 <li class="item-1 active"><a href="link4.html">fourth item</a></li>
 <li class="item-0"><a href="link5.html">fifth item</a></li>

first item

<a href="link2.html">second item</a>

<a href="link3.html"><span class="bold">third item</span></a>

<a href="link4.html">fourth item</a>

<a href="link5.html">fifth item</a>

<li class="item-0">first item</li>

<li class="item-1"><a href="link2.html">second item</a></li>

<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>

<li class="item-1 active"><a href="link4.html">fourth item</a></li>

<li class="item-0"><a href="link5.html">fifth item</a></li>

不過最常用的還是 items 方法

網頁請求

PyQuery 本身還有網頁請求功能，而且會把請求下來的網頁程式碼轉為 PyQuery 物件。

from pyquery import PyQuery as pq
print pq('http://cuiqingcai.com/', headers={'user-agent': 'pyquery'})
print pq('http://httpbin.org/post', {'foo': 'bar'}, method='post', verify=True)

from pyquery import PyQuery as pq

print pq('http://cuiqingcai.com/', headers={'user-agent': 'pyquery'})

print pq('http://httpbin.org/post', {'foo': 'bar'}, method='post', verify=True)

感受一下，GET，POST，樣樣通。

Ajax

PyQuery 同樣支援 Ajax 操作，帶有 get 和 post 方法，不過不常用，一般我們不會用 PyQuery 來做網路請求，僅僅是用來解析。

PyQueryAjax

API

最後少不了的，API大放送。

API

原汁原味最全的API，都在裡面了！如果你對 jQuery 語法不熟，強烈建議先學習下 jQuery，再回來看 PyQuery，你會感到異常親切！

結語

用完了 PyQuery，我已經深深愛上了他！

你呢？

Python爬蟲神器requests庫的使用
2024-11-07
Python爬蟲
Python 爬蟲十六式 - 第六式：JQuery的假兄弟-pyquery
2019-01-26
Python爬蟲jQuery
Python爬蟲進階之urllib庫使用方法
2021-09-11
Python爬蟲
【Python學習】爬蟲爬蟲爬蟲爬蟲~
2018-05-03
Python爬蟲
【Python3網路爬蟲開發實戰】4-解析庫的使用-3 使用pyquery
2018-03-19
Python爬蟲
Python3網路爬蟲開發實踐——第4章—解析庫的使用—4.3使用pyquery
2019-01-29
Python爬蟲
App爬蟲神器mitmproxy和mitmdump的使用
2018-04-08
APP爬蟲MIT
Java爬蟲與Python爬蟲的區別？
2023-10-25
Java爬蟲Python
python爬蟲---網頁爬蟲，圖片爬蟲，文章爬蟲，Python爬蟲爬取新聞網站新聞
2019-01-04
Python爬蟲網頁網站
python就是爬蟲嗎-python就是爬蟲嗎
2020-10-29
Python爬蟲
Python爬蟲的用途
2018-08-16
Python爬蟲
python 爬蟲
2024-04-20
Python爬蟲
python爬蟲
2024-06-13
Python爬蟲
Python爬蟲（1.爬蟲的基本概念）
2018-04-20
Python爬蟲
Python爬蟲之路-chrome在爬蟲中的使用
2021-01-04
Python爬蟲Chrome
Python爬蟲之路-selenium在爬蟲中的使用
2021-01-04
Python爬蟲
不踩坑的Python爬蟲：Python爬蟲開發與專案實戰，從爬蟲入門 Python
2021-12-17
Python爬蟲
python爬蟲初探--第一個python爬蟲專案
2018-05-18
Python爬蟲
新手小白的爬蟲神器-無程式碼高效爬取資料
2021-01-01
爬蟲
Python 爬蟲的工具鏈
2018-09-22
Python爬蟲
Python爬蟲更多的功能
2023-11-24
Python爬蟲
python網路爬蟲_Python爬蟲：30個小時搞定Python網路爬蟲視訊教程
2020-10-21
Python爬蟲
python爬蟲2
2019-01-07
Python爬蟲
Python爬蟲-xpath
2018-06-08
Python爬蟲
Python爬蟲——XPath
2018-07-28
Python爬蟲
Python爬蟲--2
2024-03-24
Python爬蟲
Python asyncio 爬蟲
2020-04-28
Python爬蟲
Python 爬蟲系列
2021-01-01
Python爬蟲
Python爬蟲入門教程 50-100 Python3爬蟲爬取VIP視訊-Python爬蟲6操作
2019-02-14
Python爬蟲
Python爬蟲教程-01-爬蟲介紹
2018-09-06
Python爬蟲
IPIDEA乾貨|Java爬蟲與Python爬蟲的區別
2023-05-08
IdeaJava爬蟲Python
什麼是Python爬蟲？python爬蟲入門難嗎？
2021-12-27
Python爬蟲
python爬蟲是什麼?學習python爬蟲難嗎
2021-03-31
Python爬蟲
什麼是Python爬蟲？Python爬蟲常用框架有哪些？
2020-12-24
Python爬蟲框架
python爬蟲-33個Python爬蟲專案實戰(推薦)
2020-10-28
Python爬蟲
Python爬蟲可以幹什麼?Python爬蟲有什麼用?
2022-08-25
Python爬蟲
【爬蟲】python爬蟲從入門到放棄
2018-12-20
爬蟲Python
【python--爬蟲】彼岸圖網高清桌布爬蟲
2019-07-21
Python爬蟲
python爬蟲實戰，爬蟲之路，永無止境
2022-01-27
Python爬蟲

python爬蟲神器PyQuery的使用方法

安裝

參考來源

簡介

初始化

（1）直接字串

（2）lxml.etree

（3）直接傳URL

（4）傳檔案

快速體驗

屬性操作

DOM操作

遍歷

網頁請求

Ajax

API

結語

相關文章