[python爬蟲] BeautifulSoup和Selenium簡單爬取知網資訊測試

Eastmount發表於2017-11-17

原文網址 : https://blog.csdn.net/eastmount/article/details/78534119

作者最近在研究複雜網路和知識圖譜內容，準備爬取知網論文相關資訊進行分析，包括標題、摘要、出版社、年份、下載數和被引用數、作者資訊等。但是在爬取知網論文時，遇到問題如下：
1.爬取內容總為空，其原因是採用動態載入的資料，無法定位，然後作者重新選取了CNKI3.0知網進行了爬取；
2.但卻不含作者資訊，需要定位到詳情頁面，再依次獲取作者資訊，但是又遇到了新的問題。

一. 網站定位分析

知網網站如下：http://nvsm.cnki.net/kns/brief/default_result.aspx
比如搜尋Python關鍵字，網頁反饋內容如下所示，2681篇文章。

但是使用Selenium定位爬取的論文內容總為空，後來網上看到qiuqingyun大神的部落格，發現另一個知網介面（CNKI3.0 知識搜尋：http://search.cnki.net/）。
強烈推薦大家閱讀他的原文：http://qiuqingyu.cn/2017/04/27/python實現CNKI知網爬蟲/
搜尋python的的圖片如下，共3428篇論文。

接下來簡單講述分析的過程，方法都類似，通過DOM樹節點分析定位元素。右鍵瀏覽器審查元素如下所示，每頁包括15篇論文，標籤位於<div class="wz_tab">下。

點選具體一條內容，如下所示，定位方法如下：
1.標題定位<div class="wz_content">下的<h3>標籤，並且可以獲取URL；
2.摘要定位<div class="width715">內容；
3.出處定位<span class="year-count">節點下的title，年份通過正規表示式提取資料；
4.下載次數和被引用數定位<span class="count">，提取數字第一個和第二個。

接下來直接講述BeautifulSoup和Selenium兩種方式的爬蟲。

二. BeautifulSoup爬蟲

BeautifulSoup完整程式碼如下：

# -*- coding: utf-8 -*-
import time              
import re
import urllib   
from bs4 import BeautifulSoup
      
    
#主函式
if __name__ == '__main__':

    url = "http://search.cnki.net/Search.aspx?q=python&rank=relevant&cluster=all&val=&p=0"
    content = urllib.urlopen(url).read()
    soup = BeautifulSoup(content,"html.parser")

    #定位論文摘要
    wz_tab = soup.find_all("div",class_="wz_tab")
    num = 0
    for tab in wz_tab:
        #標題
        title = tab.find("h3")
        print title.get_text()
        urls = tab.find("h3").find_all("a")
        #詳情超連結
        flag = 0
        for u in urls:
            if flag==0: #只獲取第一個URL
                print u.get('href')
                flag += 1
        #摘要
        abstract = tab.find(attrs={"class":"width715"}).get_text()
        print abstract
        #獲取其他資訊
        other = tab.find(attrs={"class":"year-count"})
        content = other.get_text().split("\n")
        """
            由於無法分割兩個空格，如：《懷化學院學報》  2017年 第09期
            故採用獲取標題titile內容為出版雜誌
            <span title="北方文學(下旬)">《北方文學(下旬)》  2017年 第06期</span>
        """
        #出版雜誌+年份
        cb_from = other.find_all("span")
        flag = 0 
        for u in cb_from:
            if flag==0: #獲取標題
                print u.get("title")
                flag += 1
        mode = re.compile(r'\d+\.?\d*')
        number = mode.findall(content[0])
        print number[0] #年份
        
        #下載次數 被引次數
        mode = re.compile(r'\d+\.?\d*')
        number = mode.findall(content[1])
        if len(number)==1:
            print number[0]
        elif len(number)==2:
            print number[0], number[1]

        num = num + 1

輸出如下圖所示：

但是爬取的URL無法跳轉，總是顯示登入頁面，比如“http://epub.cnki.net/kns/detail/detail.aspx?filename=DZRU2017110705G&dbname=CAPJLAST&dbcode=cjfq”，而能正確顯示的是的“http://www.cnki.net/KCMS/detail/detail.aspx?filename=DZRU2017110705G&
dbname=CAPJLAST&dbcode=CJFQ&urlid=&yx=&v=MTc2ODltUm42ajU3VDN
mbHFXTTBDTEw3UjdxZVlPZHVGeTdsVXJ6QUpWZz1JVGZaZbzlDWk81NFl3OU16”。
顯示如下圖所示：

解決方法：這裡我準備採用Selenium技術定位超連結，再通過滑鼠點選進行跳轉，從而去到詳情頁面獲取作者或關鍵詞資訊。

三. Selenium爬蟲

爬取程式碼如下：

# -*- coding: utf-8 -*-
import time              
import re              
import sys    
import codecs    
import urllib   
from selenium import webdriver            
from selenium.webdriver.common.keys import Keys            
  
      
#主函式
if __name__ == '__main__':

    url = "http://search.cnki.net/Search.aspx?q=python&rank=relevant&cluster=all&val=&p=0"
    driver = webdriver.Firefox()
    driver.get(url)
    #標題
    content = driver.find_elements_by_xpath("//div[@class='wz_content']/h3")
    #摘要
    abstracts = driver.find_elements_by_xpath("//div[@class='width715']")
    #出版雜誌+年份
    other = driver.find_elements_by_xpath("//span[@class='year-count']/span[1]")
    mode = re.compile(r'\d+\.?\d*')
    #下載次數 被引次數
    num = driver.find_elements_by_xpath("//span[@class='count']")

    #獲取內容
    i = 0
    for tag in content:
        print tag.text
        print abstracts[i].text
        print other[i].get_attribute("title")
        number = mode.findall(other[i].text)
        print number[0] #年份
        number = mode.findall(num[i].text)
        if len(number)==1: #由於存在數字確實 如(100) ()
            print number[0]
        elif len(number)==2:
            print number[0],number[1]
        print ''
        
        i = i + 1
        tag.click()
        time.sleep(1)

輸出如下所示：

>>> 
網路資源輔助下的Python程式設計教學 
本文對於Python學習網路資源做了歸納分類,說明了每類資源的特點,具體介紹了幾個有特色的學習網站,就網路資源輔助下的Python學習進行了討論,闡釋了利用優質網路資源可以提高課堂教學效果,增加教學的生動性、直觀性和互動性。同時說明了這些資源的利用能夠方便學生的程式設計訓練,使學生有更多的時間和機會動手程式設計,實現程式設計教學中...
電子技術與軟體工程
2017
11 0

Python虛擬機器記憶體管理的研究 
動態語言的簡潔性,易學性縮短了軟體開發人員的開發週期,所以深受研發人員的喜愛。其在機器學習、科學計算、Web開發等領域都有廣泛的應用。在眾多的動態語言中,Python是使用者數量較大的動態語言之一。本文主要研究Python對記憶體資源的管理。Python開發效率高,但是執行效率常為人詬病,主要原因在於一切皆是物件的語言實現...
南京大學
2014
156 0

接下來是點選詳情頁面，視窗轉化捕獲資訊，程式碼如下：

# -*- coding: utf-8 -*-
import time              
import re              
import sys    
import codecs    
import urllib   
from selenium import webdriver            
from selenium.webdriver.common.keys import Keys            
  
      
#主函式
if __name__ == '__main__':

    url = "http://search.cnki.net/Search.aspx?q=python&rank=relevant&cluster=all&val=&p=0"
    driver = webdriver.Firefox()
    driver.get(url)
    #標題
    content = driver.find_elements_by_xpath("//div[@class='wz_content']/h3")
    #摘要
    abstracts = driver.find_elements_by_xpath("//div[@class='width715']")
    #出版雜誌+年份
    other = driver.find_elements_by_xpath("//span[@class='year-count']/span[1]")
    mode = re.compile(r'\d+\.?\d*')
    #下載次數 被引次數
    num = driver.find_elements_by_xpath("//span[@class='count']")

    #獲取當前視窗控制程式碼  
    now_handle = driver.current_window_handle

    #獲取內容
    i = 0
    for tag in content:
        print tag.text
        print abstracts[i].text
        print other[i].get_attribute("title")
        number = mode.findall(other[i].text)
        print number[0] #年份
        number = mode.findall(num[i].text)
        if len(number)==1: #由於存在數字確實 如(100) ()
            print number[0]
        elif len(number)==2:
            print number[0],number[1]
        print ''
        
        i = i + 1
        tag.click()
        time.sleep(2)

        #跳轉 獲取所有視窗控制程式碼  
        all_handles = driver.window_handles  
      
        #彈出兩個介面,跳轉到不是主窗體介面  
        for handle in all_handles:  
            if handle!=now_handle:     
                #輸出待選擇的視窗控制程式碼  
                print handle  
                driver.switch_to_window(handle)  
                time.sleep(1)  
  
                print u'彈出介面資訊'  
                print driver.current_url  
                print driver.title  
  
                #獲取登入連線資訊  
                elem_sub = driver.find_element_by_xpath("//div[@class='summary pad10']")  
                print u"作者", elem_sub.text   
                print ''  
  
                #關閉當前視窗  
                driver.close()  
              
        #輸出主視窗控制程式碼  
        print now_handle  
        driver.switch_to_window(now_handle) #返回主視窗 開始下一個跳轉

但部分網站還是出現無法訪問的問題，如下所示：

最後作者擬爬取萬方資料進行分析。
最後希望文章對你有所幫助，如果錯誤或不足之處，請海涵~
(By:Eastmount 2017-11-17 深夜12點 http://blog.csdn.net/eastmount/ )

爬蟲Selenium+PhantomJS爬取動態網站圖片資訊（Python）
2018-03-24
爬蟲JS網站Python
用PYTHON爬蟲簡單爬取網路小說
2021-09-11
Python爬蟲
Python爬蟲之BeautifulSoup
2019-02-16
Python爬蟲
Python爬蟲之BeautifulSoup庫
2020-12-14
Python爬蟲
python爬蟲——爬取大學排名資訊
2019-08-02
Python爬蟲
python爬蟲--爬取鏈家租房資訊
2020-05-16
Python爬蟲
最簡單的網路圖片的爬取 --Pyhon網路爬蟲與資訊獲取
2020-04-04
爬蟲
selenium 知網爬蟲之根據【關鍵詞】獲取文獻資訊
2023-10-28
爬蟲
python爬蟲---網頁爬蟲，圖片爬蟲，文章爬蟲，Python爬蟲爬取新聞網站新聞
2019-01-04
Python爬蟲網頁網站
Python爬蟲訓練：爬取酷燃網視訊資料
2020-10-23
Python爬蟲
【Python爬蟲實戰】使用Selenium爬取QQ音樂歌曲及評論資訊
2021-03-24
Python爬蟲
小白學 Python 爬蟲（25）：爬取股票資訊
2019-12-24
Python爬蟲
python 小爬蟲 DrissionPage+BeautifulSoup
2024-06-16
Python爬蟲
爬蟲實戰（二）：Selenium 模擬登入並爬取資訊
2018-07-15
爬蟲
使用requests+BeautifulSoup的簡單爬蟲練習
2018-04-06
爬蟲
python爬蟲：使用BeautifulSoup修改網頁內容
2020-04-05
Python爬蟲網頁
python實現selenium網路爬蟲
2021-03-11
Python爬蟲
Python爬蟲爬取淘寶，京東商品資訊
2020-02-11
Python爬蟲
python簡單爬蟲(二)
2018-04-18
Python爬蟲
[譯] 如何使用 Python 和 BeautifulSoup 爬取網站內容
2019-02-23
Python網站
python 爬蟲 mc 皮膚站 little skin 的簡單爬取
2019-08-02
Python爬蟲
Python 爬蟲進階篇-利用beautifulsoup庫爬取網頁文章內容實戰演示
2020-09-14
Python爬蟲網頁
BeautifulSoup + requests 爬取扇貝 python 單詞書
2019-07-11
Python
【爬蟲】專案篇-使用selenium爬取大魚潮汐網
2024-04-05
爬蟲
利用requests+BeautifulSoup爬取網頁關鍵資訊
2018-11-13
網頁
python爬蟲之 BeautifulSoup庫入門
2019-12-09
Python爬蟲
Python爬蟲之路-selenium在爬蟲中的使用
2021-01-04
Python爬蟲
分散式爬蟲之知乎使用者資訊爬取
2018-08-31
分散式爬蟲
Java爬蟲-爬取疫苗批次資訊
2024-06-03
Java爬蟲
Python爬蟲實戰：爬取淘寶的商品資訊
2021-09-11
Python爬蟲
[python爬蟲] BeautifulSoup設定Cookie解決網站攔截並爬取螞蟻短租
2018-03-07
Python爬蟲Cookie網站
Python網路爬蟲 - Phantomjs, selenium/Chromedirver使用
2019-01-22
Python爬蟲JSChrome
Python爬蟲—爬取某網站圖片
2020-11-19
Python爬蟲網站
【Python爬蟲】正則爬取趕集網
2020-12-24
Python爬蟲
python網路爬蟲_Python爬蟲：30個小時搞定Python網路爬蟲視訊教程
2020-10-21
Python爬蟲
Python爬蟲入門教程 50-100 Python3爬蟲爬取VIP視訊-Python爬蟲6操作
2019-02-14
Python爬蟲
簡單瞭解python爬蟲
2020-10-13
Python爬蟲
Python爬蟲入門【3】：美空網資料爬取
2019-07-30
Python爬蟲
簡單的爬蟲：爬取網站內容正文與圖片
2021-09-09
爬蟲網站

[python爬蟲] BeautifulSoup和Selenium簡單爬取知網資訊測試

一. 網站定位分析

二. BeautifulSoup爬蟲

三. Selenium爬蟲

相關文章