[Python學習] 簡單爬取CSDN下載資源資訊

Eastmount發表於2015-07-21

原文網址 : https://blog.csdn.net/eastmount/article/details/46986589

這是一篇Python爬取CSDN下載資源資訊的例子，主要是通過urllib2獲取CSDN某個人所有資源的資源URL、資源名稱、下載次數、分數等資訊；寫這篇文章的原因是我想獲取自己的資源所有的評論資訊，但是由於評論採用JS臨時載入，所以這篇文章先簡單介紹如何人工分析HTML頁面爬取資訊。

原始碼

# coding=utf-8  
import urllib  
import time  
import re  
import os

#************************************************** 
#第一步 遍歷獲取每頁對應主題的URL 
#http://download.csdn.net/user/eastmount/uploads/1
#http://download.csdn.net/user/eastmount/uploads/8
#**************************************************

num=1 #記錄資源總數 共46個資源
number=1 #記錄列表總數1-8
fileurl=open('csdn_url.txt','w+')  
fileurl.write('****************獲取資源URL*************\n\n')

while number<9:
    url='http://download.csdn.net/user/eastmount/uploads/' + str(number)
    fileurl.write('下載列表URL:'+url+'\n\n')
    print unicode('下載列表URL:'+url,'utf-8')
    content=urllib.urlopen(url).read()
    open('csdn.html','w+').write(content)

    #獲取包含URL塊內容 匹配需要計算</div>個數
    start=content.find(r'<div class="list-container mb-bg">')  
    end=content.find(r'<div class="page_nav">')
    cutcontent=content[start:end]
    #print cutcontent

    #獲取塊內容中URL
    #形如<dt><div><img 圖示></div><h3><a href>標題</a></h3></dt>
    res_dt = r'<dt>(.*?)</dt>'  
    m_dt =  re.findall(res_dt,cutcontent,re.S|re.M)  
    for obj in m_dt:
        #記錄URL數量
        print '******************************************'
        print '第'+str(num)+'個資源'
        fileurl.write('******************************************\n')
        fileurl.write('第'+str(num)+'個資源\n')
        num = num +1
        #獲取具體URL
        url_list = re.findall(r"(?<=href=\").+?(?=\")|(?<=href=\').+?(?=\')", obj)
        for url in url_list:
            url_load='http://download.csdn.net'+url
            print 'URL： '+url_load
            fileurl.write('URL： http://download.csdn.net'+url+'\n')
        #獲取資源標題
        #<a href="/detail/eastmount/8757243">MFC顯示BMP圖片</a>
        res_title = r'<a href=.*?>(.*?)</a>'
        title = re.findall(res_title,obj,re.S|re.M)
        for t in title:
            print unicode('Title: ' + t,'utf-8')  
            fileurl.write('Title： ' + t +'\n')


        #************************************************** 
        #第二步 遍歷具體資源的內容及評論 
        #http://download.csdn.net/detail/eastmount/8785591
        #**************************************************

        #定位指定結構化資訊盒Infobox
        resources = urllib.urlopen(url_load).read()
        open('resource.html','w+').write(resources)
        start_res=resources.find(r'<div class="wraper-info">')  
        end_res=resources.find(r'<div class="enter-link">')
        infobox=resources[start_res:end_res]

        #獲取資源積分、下載次數、資源型別、資源大小(前4個<span></span>)
        res_span = r'<span>(.*?)</span>'  
        m_span = re.findall(res_span,infobox,re.S|re.M)
        print '資源積分： '+m_span[0]
        fileurl.write('資源積分: ' + m_span[0] +'\n')
        print '下載次數： '+m_span[1]
        fileurl.write('下載次數: ' + m_span[1] +'\n')
        print '資源型別： '+m_span[2]
        fileurl.write('資源型別: ' + m_span[2] +'\n')
        print '資源大小： '+m_span[3]
        fileurl.write('資源大小: ' + m_span[3] +'\n')


        #**************************************************
        #第三步 如何獲取評論
        #http://jeanphix.me/Ghost.py/
        #http://segmentfault.com/q/1010000000143340
        #http://casperjs.org/
        #**************************************************
     

    else:
        fileurl.write('******************************************\n\n')
        print '******************************************\n'
        print 'Load Next List\n'
        number = number+1 #列表加1
#退出所有迴圈
else:
    fileurl.close()

顯示結果
顯示內容包括資源URL、資源標題、資源積分、下載次數、資源型別和資源大小：

比如現在爬取郭霖大神的資源資訊，其中頁面連結如下：(共7頁)
http://download.csdn.net/user/sinyu890807/uploads/1
http://download.csdn.net/user/sinyu890807/uploads/7
簡單修改Python原始碼URL後，下載頁面如下圖所示：

執行結果如下圖所示：

HTML分析
首先，獲取每列中的所有資源的URL和標題，通過分析原始碼。

<dt>
   <div class="icon"><img src="/images/minetype/rar.gif" title="rar檔案"></div>
   <div class="btns"></div>  
   <h3><a href="/detail/eastmount/8772951">
          MFC 影象處理之幾何運算 影象平移旋轉縮放映象(原始碼)</a>
       <span class="points">0</span>
   </h3>
</dt>           
<dd class="meta">上傳者：
    <a class="user_name" href="/user/eastmount">eastmount</a>
         | 上傳時間：2015-06-04
         | 下載26次
</dd>
<dd class="intro">
        該資源主要參考我的部落格【數字影象處理】六.MFC空間幾何變換之影象平移、映象、旋轉
        縮放詳解，主要講述基於VC++6.0 MFC影象處理的應用知識，要通過MFC單文件檢視實現顯
        示BMP圖片。
</dd>
<dd class="tag">
     <a href="/tag/MFC">MFC</a>
     <a href="/tag/%E5%9B%BE%E5%83%8F%E5%A4%84%E7%90%86">影象處理</a><
</dd>

對應的HTML顯示如下圖所示：

然後通過URL去到具體的資源獲取我自己稱為像訊息盒的資訊：

對應審查元素的資訊如下所示，獲取<span>0分</span>即可：

最後我想做的事獲取評論資訊，但是它是通過JS實現的：

<div class="section-list panel panel-default">
   <div class="panel-heading">
      <h3 class="panel-title">資源評論</h3>
   </div>
   <!-- recommand -->
   <script language='JavaScript' defer type='text/javascript'         

src='/js/comment.js'></script>
   <div class="recommand download_comment panel-body" sourceid="8772951"></div>
</div>

顯示的JS頁面部分如下：

var base_url= (window.location.host.substring(0,5)=='local') ? 'http://local.downloadv3.csdn.net' : 'http://download.csdn.net';
base_url = "";
$(document).ready(function(){
	
	CC_Comment.initConfig();
	CC_Comment.getContent(1);
});	
var CC_Comment = 
{
	sourceid:0,
	initConfig:function()
	{
		var sid = parseInt($(".download_comment").attr('sourceid'));
		if(isNaN(sid) || sid<=0)
		{
			this.sourceid = 0;
		}else
		{
			this.sourceid = sid;
		}
		
	}
....
}

最後希望文章對你有所幫助吧！下一篇準備分析下Python如何獲取JS的評論資訊，同時該篇文章可以給你提供一種簡單的人工分析頁面的例子；也可以獲取某個人CSDN資源下載多、分數高的給你挑選。基礎知識，僅供參考~
（By:Eastmount 2015-7-21 下午5點 http://blog.csdn.net/eastmount/）

Python爬取CSDN部落格資料
2019-01-03
Python
簡單介紹python深度學習tensorflow例項資料下載與讀取
2022-07-17
Python深度學習
python爬蟲——爬取大學排名資訊
2019-08-02
Python爬蟲
小白學 Python 爬蟲（25）：爬取股票資訊
2019-12-24
Python爬蟲
selenium + xpath爬取csdn關於python的博文博主資訊
2020-12-19
Python
python爬取北京租房資訊
2018-05-18
Python
Python筆記：網頁資訊爬取簡介（一）
2020-11-11
Python筆記網頁
Python 超簡單爬取微博熱搜榜資料
2020-05-13
Python
python爬蟲--爬取鏈家租房資訊
2020-05-16
Python爬蟲
Python學習：爬個電影資源網站
2018-03-16
Python網站
python爬蟲練習之爬取豆瓣讀書所有標籤下的書籍資訊
2018-07-23
Python爬蟲
Python 超簡單爬取新浪微博資料 (高階版)
2020-05-16
Python
python itchat 爬取微信好友資訊
2018-06-02
Python
Python資料爬蟲學習筆記（11）爬取千圖網圖片資料
2018-09-18
Python爬蟲筆記
最簡單的網路圖片的爬取 --Pyhon網路爬蟲與資訊獲取
2020-04-04
爬蟲
Python爬蟲爬取淘寶，京東商品資訊
2020-02-11
Python爬蟲
Python 開發簡單爬蟲 (學習筆記)
2019-08-05
Python爬蟲筆記
Python學習之路15-下載資料
2018-05-29
Python
Python學習資源整理
2018-03-23
Python
用python爬取鏈家的租房資訊
2020-10-29
Python
Python爬蟲實戰：爬取淘寶的商品資訊
2021-09-11
Python爬蟲
python-python爬取豆果網（菜譜資訊）
2019-01-22
Python
用PYTHON爬蟲簡單爬取網路小說
2021-09-11
Python爬蟲
Python爬取所有人位置資訊——騰訊位置大資料！
2020-11-13
Python大資料
python如何實現簡單的爬蟲功能?Python學習教程!
2021-01-12
Python爬蟲
利用python爬取丁香醫生上新型肺炎資料，並下載到本地，附帶經緯度資訊
2020-02-07
Python
Python爬資料之全國中小學資訊
2018-07-08
Python
Python爬蟲——批次爬取douyin影片，下載到本地
2024-12-06
Python爬蟲
python 爬取 blessing skin 的簡單實現
2020-03-04
Python
Python爬蟲之小說資訊爬取與資料視覺化分析
2021-01-09
Python爬蟲視覺化
Python3爬取貓眼電影資訊
2020-11-06
Python
[Python3]selenium爬取淘寶商品資訊
2021-09-09
Python
如何爬取前程無憂python職位資訊
2021-09-11
Python
Python爬蟲框架：scrapy爬取高考派大學資料
2019-10-07
Python爬蟲框架
Golang+chromedp+goquery 簡單爬取動態資料
2021-03-05
GolangChrome
python爬蟲58同城（多個資訊一次爬取）
2018-11-04
Python爬蟲
Python爬蟲訓練：爬取酷燃網視訊資料
2020-10-23
Python爬蟲
前端，java，mysql，nginx，簡歷獲取免費學習資源！！！
2018-11-26
前端JavaMySqlNginx
Python爬蟲精簡步驟1 獲取資料
2020-02-17
Python爬蟲

[Python學習] 簡單爬取CSDN下載資源資訊

相關文章