[Python爬蟲] Selenium+Phantomjs動態獲取CSDN下載資源資訊和評論

Eastmount發表於2015-08-24

原文網址 : https://blog.csdn.net/eastmount/article/details/47907341

前面幾篇文章介紹了Selenium、PhantomJS的基礎知識及安裝過程，這篇文章是一篇應用。通過Selenium呼叫Phantomjs獲取CSDN下載資源的資訊，最重要的是動態獲取資源的評論，它是通過JavaScript動態載入的，故通過Phantomjs模擬瀏覽器載入獲取。
希望該篇基礎性文章對你有所幫助，如果有錯誤或不足之處，請海涵~
  [Python爬蟲] 在Windows下安裝PhantomJS和CasperJS及入門介紹(上)
  [Python爬蟲] 在Windows下安裝PIP+Phantomjs+Selenium
  [Python爬蟲] Selenium自動訪問Firefox和Chrome並實現搜尋截圖
  [Python爬蟲] Selenium實現自動登入163郵箱和Locating Elements介紹

原始碼

# coding=utf-8  
  
from selenium import webdriver  
from selenium.webdriver.common.keys import Keys  
import selenium.webdriver.support.ui as ui  
from selenium.webdriver.common.action_chains import ActionChains  
import time      
import re      
import os  
  
#開啟Firefox瀏覽器 設定等待載入時間 訪問URL  
driver = webdriver.PhantomJS(executable_path="G:\phantomjs-1.9.1-windows\phantomjs.exe")  
driver_detail = webdriver.PhantomJS(executable_path="G:\phantomjs-1.9.1-windows\phantomjs.exe")  
wait = ui.WebDriverWait(driver,10)  
driver.get("http://download.csdn.net/user/eastmount/uploads")  
SUMRESOURCES = 0 #全域性變數 記錄資源總數(儘量避免)  
  
  
#獲取列表頁數 <div class="page_nav>共46個 共8頁..</div>  
def getPage():  
    number = 0  
    wait.until(lambda driver: driver.find_element_by_xpath("//div[@class='page_nav']"))  
    texts = driver.find_element_by_xpath("//div[@class='page_nav']").text  
    print texts  
    m = re.findall(r'(\w*[0-9]+)\w*',texts) #正規表示式尋找數字  
    print '頁數：' + str(m[1])  
    return int(m[1])  
  
  
#獲取URL和文章標題   
def getURL_Title(num):  
    global SUMRESOURCES  
    url = 'http://download.csdn.net/user/eastmount/uploads/' + str(num)  
    print unicode('下載列表URL: ' + url,'utf-8')  
    ''''' 
    ' 等待最下面頁面載入成功 獲取URL和標題 
    ' 原始碼 
    ' <div class='list-container mb-bg'> 
    '     <dl> 
    '        <dt> 
    '           <div class="icon"><img src="xxx"></img></div> 
    '           <h3><a href="/detail/eastmount/8757243">MFC顯示BMP圖片</a></h3> 
    '        </dt> 
    '     </dl> 
    ' </div> 
    ' get_attribute('href')獲取URL且自動補齊 
    ' unicode防止報錯 - s.encode('utf8')unicode轉換成utf8編碼 decode表示utf8轉換成unicode 
    '''  
    driver.get(url)  
    wait.until(lambda driver: driver.find_element_by_xpath("//div[@class='page_nav']"))  
    list_container = driver.find_elements_by_xpath("//div[@class='list-container mb-bg']/dl/dt/h3/a") 
    for title in list_container:  
        print 'Num' + str(SUMRESOURCES +1)  
        print u'標題: ' + title.text  
        print u'連結: ' + title.get_attribute('href')  
        SUMRESOURCES = SUMRESOURCES +1  
        #  
        #獲取具體內容和評論  
        getDetails( str(title.get_attribute('href')) )  
    else:  
        print ' ' #換行  
          
  
#獲取詳細資訊 因前定義的driver正在使用中 故呼叫driver_detail  
#否則報錯 Message: Error Message => 'Element does not exist in cache'  
def getDetails(url):  
    #獲取infobox  
    driver_detail.get(url)  
    details = driver_detail.find_element_by_xpath("//div[@class='info']").text  
    print details  
    #載入評論 <dl><dt></dt><dd></dd></dl>  
    comments = driver_detail.find_elements_by_xpath("//dl[@class='recom_list']/dd")  
    for com in comments:  
        print u'評論：' + com.text  
    else:  
        print ' ' #換行  
       
  
#主函式  
def main():  
    start = time.clock()  
    pageNum = getPage()  
    i=1  
    #迴圈獲取標題和URL  
    while(i<=pageNum):  
        getURL_Title(i)   
        i = i + 1  
    else:  
        print 'SUmResouces: ' + str(SUMRESOURCES)  
        print 'Load Over'  
    end = time.clock()  
    print "Time: %f s" % (end - start)  
          
main()

程式碼實現步驟

1.首先獲取頁面總數，通過getPage()函式實現；
2.每個頁面有一列資源，通過driver的find_element_by_xpath()路徑獲取標題和get_attribute('href')函式獲取URL，它會自動補齊連結；
3.根據步驟2獲取資源的URL，去到具體資源獲取訊息框和評論資訊；
4.由於採用Phantomjs無介面瀏覽器載入頁面，故獲取class=info和recom_list的div即可。

執行結果

執行結果如下圖所示：

程式分析

首先獲取如下圖所示的頁面總數，此時為“8”頁。它通過如下程式碼實現：
texts = driver.find_element_by_xpath("//div[@class='page_nav']").text
然後再while(i<=8)依次遍歷每頁的資源，每頁資源的URL連結為：
http://download.csdn.net/user/eastmount/uploads/8

再獲取每頁所有資源的標題及URL，通過程式碼如下：

list_container = driver.find_elements_by_xpath("//div[@class='list-container mb-bg']/dl/dt/h3/a")  
for title in list_container:  
    print 'Num' + str(SUMRESOURCES +1)  
    print u'標題: ' + title.text  
    print u'連結: ' + title.get_attribute('href')

其中對應的原始碼如下所示，通過獲取find_elements_by_xpath()獲取多個元素，其div的class='list-container mb-bg'，同時路徑為<div><dl><dt><h3><a>即可。同時自動補齊URL，如：
<a href='/detail/eastmount/6917799'會新增“http://download.csdn.net/”。

最後在進入具體的資源獲取相應的訊息盒InfoBox和評論資訊，由於通過模擬Phantomjs瀏覽器直接可以顯示動態JS評論資訊。

而如果通過BeautifulSoup只能獲取的HTML原始碼如下，並沒有JS資訊。因為它是動態載入的，這就體現了Phantomjs的優勢。而通過Chrome或FireFox瀏覽器審查元素能檢視具體的評論div，這也是模擬瀏覽器的用處所在吧！
可對比前面寫過的文章：[Python學習] 簡單爬取CSDN下載資源資訊

<div class="section-list panel panel-default">  
   <div class="panel-heading">  
      <h3 class="panel-title">資源評論</h3>  
   </div>  
   <!-- recommand -->  
   <script language='JavaScript' defer type='text/javascript' src='/js/comment.js'></script>  
   <div class="recommand download_comment panel-body" sourceid="8772951"></div>  
</div>

總結

這篇文章主要講述通過Selenium和Phantomjs獲取CSDN下載資源資訊的過程，其中由於driver呼叫Chrome或FireFox瀏覽器總會有額外空間增加，故呼叫Phantomjs無介面瀏覽器完成。同時有幾個問題：
1.如何避免Phantomjs的黑框彈出；
2.程式的執行時間比較低，響應時間較慢，如何提高？
接下來如果有機會準備嘗試的內容包括：
1.下載百度百科的旅遊地點InfoBox（畢設知識圖譜挖掘）；
2.如何爬取搜狐圖片的動態載入圖片，完成智慧爬圖工具；
3.當需要自動登入時driver訪問Chrome或FireFox瀏覽器傳送訊息。
最後希望文章對你有所幫助吧！如果有錯誤或不足之處，還請海涵~
（By:Eastmount 2015-8-24 深夜2點半 http://blog.csdn.net/eastmount/）

爬蟲Selenium+PhantomJS爬取動態網站圖片資訊（Python）
2018-03-24
爬蟲JS網站Python
爬蟲實踐之獲取網易雲評論資料資訊
2022-03-29
爬蟲
用Python網路爬蟲獲取Mikan動漫資源
2020-08-26
Python爬蟲
【Python爬蟲實戰】使用Selenium爬取QQ音樂歌曲及評論資訊
2021-03-24
Python爬蟲
python爬蟲——爬取大學排名資訊
2019-08-02
Python爬蟲
python爬蟲--爬取鏈家租房資訊
2020-05-16
Python爬蟲
Python 爬蟲獲取網易雲音樂歌手資訊
2019-03-04
Python爬蟲
python爬蟲，獲取中國工程院院士資訊
2021-12-04
Python爬蟲
利用Python爬蟲獲取招聘網站職位資訊
2021-08-09
Python爬蟲網站
python爬蟲獲取搜狐汽車的配置資訊和swf動態圖表的銷量資料-------詳細教學
2019-08-05
Python爬蟲
如何高效獲取大資料?動態ip代理：用爬蟲!
2019-01-24
大資料爬蟲
小白學 Python 爬蟲（25）：爬取股票資訊
2019-12-24
Python爬蟲
Python爬取CSDN部落格資料
2019-01-03
Python
Python爬蟲爬取淘寶，京東商品資訊
2020-02-11
Python爬蟲
python 爬取騰訊視訊的全部評論
2021-02-17
Python
python爬蟲抓取哈爾濱天氣資訊（靜態爬蟲）
2020-04-05
Python爬蟲
Python爬蟲精簡步驟1 獲取資料
2020-02-17
Python爬蟲
python爬蟲獲取天氣網實時資料
2022-11-29
Python爬蟲
Python爬蟲實戰：爬取淘寶的商品資訊
2021-09-11
Python爬蟲
如何用python爬蟲分析動態網頁的商品資訊？
2021-09-11
Python爬蟲網頁
python網路爬蟲（7）爬取靜態資料詳解
2019-06-07
Python爬蟲
python爬蟲--招聘資訊
2018-11-03
Python爬蟲
獲取爬蟲動態IP的三種方法
2022-06-06
爬蟲
Java爬蟲-爬取疫苗批次資訊
2024-06-03
Java爬蟲
「資料分析」2種常見的反爬蟲策略，資訊驗證和動態反爬蟲
2022-02-23
爬蟲
Python爬蟲——批次爬取douyin影片，下載到本地
2024-12-06
Python爬蟲
python 爬蟲 5i5j房屋資訊獲取並儲存到資料庫
2018-08-20
Python爬蟲資料庫
python爬蟲58同城（多個資訊一次爬取）
2018-11-04
Python爬蟲
Python爬蟲訓練：爬取酷燃網視訊資料
2020-10-23
Python爬蟲
爬蟲例項-淘寶頁面商品資訊獲取
2020-10-08
爬蟲
Python爬取動態載入的視訊（梨視訊,xpath)
2022-03-21
Python
python 爬蟲之獲取標題和連結
2020-11-27
Python爬蟲
如何利用 Selenium 爬取評論資料？
2018-04-12
Python爬蟲之小說資訊爬取與資料視覺化分析
2021-01-09
Python爬蟲視覺化
Python爬蟲抓取股票資訊
2021-01-03
Python爬蟲
爬蟲：拉勾自動投遞簡歷+資料獲取
2020-10-21
爬蟲
python爬蟲如何獲取表情包
2021-09-11
Python爬蟲
【python爬蟲案例】利用python爬取豆瓣電影TOP250評分排行資料！
2024-09-18
Python爬蟲
selenium + xpath爬取csdn關於python的博文博主資訊
2020-12-19
Python

[Python爬蟲] Selenium+Phantomjs動態獲取CSDN下載資源資訊和評論

原始碼

程式碼實現步驟

執行結果

程式分析

總結

相關文章