實現爬取csdn個人部落格並匯出資料

=-=會飛起來的哈士奇發表於2020-09-24

原文網址 : https://blog.csdn.net/weixin_42873348/article/details/108785952

因為最近也在學習python，爬蟲和一點pandas的內容
剛好看到一篇部落格，部落格地址：https://blog.csdn.net/xiaoma_2018/article/details/108231658也是實現一樣的內容的，只是使用的方式被我改了一下，我也是借鑑學習大佬的方法
我所使用到的庫有lxml, urllib.request

程式碼如下

'''
匯入所需要的庫
'''
import urllib.request as ur
import lxml.etree as le
import pandas as pd
from config import EachSource,OUTPUT

url = 'https://blog.csdn.net/shine_a/article/list/2'

#得到部落格所有內容
def get_blog(url):
    req = ur.Request(
        url=url,
        headers={
            'cookie':'c_session_id%3D10_1600923150109.901257%3Bc_sid%3D40c3e11ae0d6021f6f8323db1cc321a1%3Bc_segment%3D9%3Bc_first_ref%3Dwww.google.com.hk%3Bc_first_page%3Dhttps%253A%2F%2Fwww.csdn.net%2F%3Bc_ref%3Dhttps%253A%2F%2Fblog.csdn.net%2F%3Bc_page_id%3Ddefault%3B',
            'User_Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Safari/537.36'
        }
    )
    response = ur.urlopen(req).read()
    return response

# 得到每一個部落格的連結, 進入二級頁面並提取資料

def get_href_list():
    href_s = le.HTML(response).xpath('//div[@class="articleMeList-integration"]/div[2]/div/h4/a/@href')
    for data in href_s:
        rp = ur.urlopen(data).read()
        html = rp.decode()
        title = le.HTML(html).xpath('//div[contains(@class,"main_father")]/div/main/div[1]/div/div/div/h1/text()')
        readCount = le.HTML(html).xpath('//div[@class="bar-content"]/span[@class="read-count"]/text()')
        time = le.HTML(html).xpath('//div[@class="article-info-box"]/div[3]/span[2]/text()')
        url_href = data
        content = [title[0],readCount[0],time[0],url_href]
        info = ",".join(content)
        # print(info)
        with open(EachSource,'a',encoding='utf-8') as f:
            f.writelines(info+'\n')



# 處理資料
def parseData():
    f = pd.read_table('./info.txt',sep=',',header=None, names=['文章標題', '瀏覽量', '釋出時間','文章連結'])
    f.to_csv(OUTPUT)


if __name__ == '__main__':
    blogID = input('Enter blogID:')
    pages = input('Enter page:')
    response = get_blog(
        url= 'https://blog.csdn.net/{blogID}/article/list/{pages}'.format(blogID=blogID, pages=pages)
    )
    # data = response.decode('utf-8')
    # with open('a.txt','w',encoding='utf-8') as f:
    #     f.write(data)
    print('獲取部落格連結...')
    get_href_list()
    print("開始獲取資料...")
    parseData()
    print("結束獲取資料...")

說一下我在實現的過程中所遇到的一些問題：
1.在用xpath得到的每個部落格的url的時候，我通過data遍歷進入href_s但是遍歷出來的是url並不是直接進入部落格（二級頁面了）返回的是一些xml格式的物件，所以rp = ur.urlopen(data).read()，html = rp.decode()的方式需要進入到url物件然後進行decode才能進行下面的xpath

以下是實現的情況：
過程：
在這裡插入圖片描述

在這裡插入圖片描述

最後感謝Caso_卡索博主所提供的內容提供學習與參考！

Python爬取CSDN部落格資料
2019-01-03
Python
批量匯出 CSDN 部落格並轉為 hexo 部落格風格
2019-09-30
Hexo
Python3爬取CSDN個人部落格相關資料--新增GUI圖形化介面
2020-12-11
PythonGUI
新版CSDN部落格如何新增別人的部落格連結
2020-10-16
爬取部落格園文章
2020-07-31
Python爬取股票資訊，並實現視覺化資料
2020-09-25
Python視覺化
如何轉載CSDN部落格
2018-04-27
《將部落格搬至CSDN》
2024-05-26
將部落格搬至CSDN
2024-07-08
python爬取股票資料並存到資料庫
2021-03-29
Python資料庫
Python 實用爬蟲-04-使用 BeautifulSoup 去水印下載 CSDN 部落格圖片
2019-06-16
Python爬蟲
ThinkPHP5+LayUI雲易部落格系統-自動同步CSDN網站的部落格資料
2019-05-11
PHPUI網站
QZpython匯入匯出redis資料的實現deu
2022-03-01
PythonRedis
Python爬蟲實戰一：爬取csdn學院所有課程名、價格和課時
2018-06-23
Python爬蟲
將部落格搬運至CSDN
2020-11-10
部落格轉移回csdn了。
2024-08-01
個人的小部落格
2019-05-11
結合LangChain實現網頁資料爬取
2024-07-18
LangChain網頁
如何在CSDN部落格首頁掛個二維碼
2020-04-04
個人圖床配置，實現部落格園圖片上傳自由
2024-11-26
圖床
Oracle資料庫——資料匯出時出現匯出成功終止, 但出現警告。
2020-09-29
Oracle資料庫
如何去除CSDN部落格圖片水印
2020-04-08
JavaScript爬蟲程式實現自動化爬取tiktok資料教程
2023-10-18
JavaScript爬蟲
個人部落格資料庫設計
2019-03-23
資料庫
01、部落格爬蟲
2019-04-11
爬蟲
python實現微博個人主頁的資訊爬取
2021-01-03
Python
【爬蟲】利用Python爬蟲爬取小麥苗itpub部落格的所有文章的連線地址並寫入Excel中（2）
2018-12-27
爬蟲PythonExcel
爬蟲實戰（二）：Selenium 模擬登入並爬取資訊
2018-07-15
爬蟲
CSDN部落格海報分享上線啦！
2019-08-20
個人靜態部落格上線
2020-03-21
部落格，休閒個人站點
2019-05-11
mysql mysqldump只匯出表結構或只匯出資料的實現方法
2020-05-05
MySql
爬蟲實戰——58同城租房資料爬取
2019-12-04
爬蟲
房產資料爬取、智慧財產權資料爬取、企業工商資料爬取、抖音直播間資料python爬蟲爬取
2024-07-11
Python爬蟲
【最佳實踐】MongoDB匯出匯入資料
2023-10-09
MongoDB
Python 基於 xlsxwriter 實現百萬資料匯出 excel
2024-03-29
PythonExcel
Python實現對比兩個Excel資料內容並標出不同
2023-02-21
PythonExcel
【python】爬取疫情資料並進行視覺化
2020-09-24
Python視覺化

實現爬取csdn個人部落格並匯出資料

相關文章