儲存CSDN 中的部落格文章為本地檔案

青蛙愛輪滑發表於2019-03-31

原文網址 : https://blog.csdn.net/kevinhanser/article/details/88936860

儲存CSDN 中的部落格文章為本地檔案

2019年3月31日21:49:39 【原創】
python 學習部落格目錄

1. 執行環境

最近發現我CSDN裡的部落格中，外鏈的圖片全部無法載入，部落格總使用的圖片論壇是 https://i.imgur.com/，圖片連結形如 https://i.imgur.com/WxbYfSu.png，圖片是使用 markdown 軟體寫部落格時上傳的。

目前在國內的 IP 無法訪問 i.imgur.com 網站，不知道啥情況。

現在的問題是國內的IP無法訪部落格裡的圖片，所以決定掛個國外的代理將圖片儲存至本地，以後有時間再寫個指令碼再批量替換，畢竟部落格數量還是很多的。

有幾點不足：

未下載 css 和 js 檔案，其實實現起來很簡單，可以複製程式碼中的正則自己擴充套件。
下載後的HTML檔案中的js和css都是需要聯網，不聯網只能看到文章內容，不能看到美化佈局。

廢話不多說，環境是 python3 的，沒有多餘的庫需要安裝，忘記是不是需要 pillow 庫了。

pip install Pillow

2. 流程分解

URL儲存在檔案中

getHtml() 函式，作用是掛代理訪問 URL；

 def getHtml(url):
     proxies = {"https": "127.0.0.1:1080"}
     response = requests.get(url, proxies = proxies)
     return response.text

getImg() 函式，作用是格式化過濾儲存文章內容為圖片；

 # 過濾自動跳轉
     pattern1 = re.compile(r'<div style="display:none;">(.*?)</div>',flags=re.S)
     match1 = pattern1.sub('',body)
     #print(match1)

頁面中的title作為檔名，同時也是路徑名；

 # 獲取title作為檔名使用
 title = ''
 title_list = re.findall('<title>(.*?) - kevinhanser - CSDN部落格</title>',match4)
 #print(title_list)
 title = str(title_list[0])
 print(title)

將body中的關於圖片的路徑連結都替換為本地的圖片路徑，使在本地檢視文章時也可以載入圖片；

 # 定位body中的圖片位置
     #match5 = pattern5.findall(match4)
     pattern5 = re.compile(r'<p>(.*)i.imgur.com')
     # 替換body中的圖片檔名為本地的（./image/+title）
     match5 = pattern5.sub('<img src="'+'./image/'+title,match4)

for 迴圈中是將圖片名使用正則匹配出來，然後構造完整的圖片連結下載並儲存為同名檔案。

 # 儲存圖片，將img_url儲存為img_path，儲存為本地檔案
   if not os.path.exists(img_path):
         with urllib.request.urlopen(img_url, timeout=120) as response, open(img_path, 'wb') as f_save:
                 f_save.write(response.read())
                 f_save.flush()
                 f_save.close()

3. 程式碼部分

#coding=utf-8
import urllib   #urllib模組提供了讀取Web頁面資料的介面.
import os       # 系統相關
import re       # 正規表示式
from PIL import Image
import os,stat
import urllib.request
import requests
import time
# python3

def getHtml(url):
    proxies = {"https": "127.0.0.1:1080"}
    response = requests.get(url, proxies = proxies)
    return response.text

def getImg(body):
    # 過濾自動跳轉
    pattern1 = re.compile(r'<div style="display:none;">(.*?)</div>',flags=re.S)
    match1 = pattern1.sub('',body)
    #print(match1)
    
	 # 過濾推薦及廣告
    pattern2 = re.compile(r'</article>(.*)</script>',flags=re.S)
    match2 = pattern2.sub('</div>',match1)
    #print(match2)
    
    # 過濾CSDN的最上頭的橫幅
    pattern2_tmp = re.compile(r'b site By baidu end\n    </script>\n    (.*?)    <script src="https://csdnimg.cn/rabbit/exposure-click/main-1.0.6.js"></script>',flags=re.S)
    match2_tmp = pattern2_tmp.sub('',match2)
    #print(match2_tmp)
    
    # 過濾css，使頁面全屏
    pattern3 = re.compile(r'https://csdnimg.cn/release/phoenix/themes/skin-yellow/skin-yellow-2eefd34acf.min.css',flags=re.S)
    match3 = pattern3.sub('',match2_tmp)
    #print(match3)
    
    # 過濾使頁面最大，而不是居中固定寬度
    pattern4 = re.compile(r'container',flags=re.S)
    match4 = pattern4.sub('container_bak',match3)
    #print(match4)    
    #container
    
    # 獲取title作為檔名使用
    title = ''
    title_list = re.findall('<title>(.*?) - kevinhanser - CSDN部落格</title>',match4)
    #print(title_list)
    title = str(title_list[0])
    print(title)
    
    # python2中使用.decode('string_escape')編碼為中文
    #title = str(title_list[0]).decode('string_escape')
    
    # 獲取圖片名
    picture_list = re.findall('https://i.imgur.com/(.*?).png',match4)
    picture = picture_list
    #print picture
    
    # 定位body中的圖片位置
    #match5 = pattern5.findall(match4)
    pattern5 = re.compile(r'<p>(.*)i.imgur.com')
    # 替換body中的圖片檔名為本地的（./image/+title）
    match5 = pattern5.sub('<img src="'+'./image/'+title,match4)
    
    # 將body寫成檔案，編碼為中文
    f = open(title+'.html','w',encoding='utf8')
    f.write(match5)
    f.close()
    
    #部落格中無圖片則提示
    if len(picture) == 0:
        print("%s 中無圖片" % title)
    
    #部落格中有圖片則儲存，使用title作為目錄路徑名
    for i in range(len(picture)):
        if not os.path.exists('./image/'+title):
            os.makedirs('./image/'+title)
            
        #print picture[i]
        img_path = './image/'+title+'/'+picture[i]+'.png'
        img_url = 'https://i.imgur.com/'+picture[i]+'.png'
        print("%s 開始..." % img_url)
        #print(img_path)
        
        # 儲存圖片，將img_url儲存為img_path，儲存為本地檔案
        if not os.path.exists(img_path):
            with urllib.request.urlopen(img_url, timeout=120) as response, open(img_path, 'wb') as f_save:
                    f_save.write(response.read())
                    f_save.flush()
                    f_save.close()
        
            # 請求過快會報錯，前面有重試機制
            #time.sleep(15)
            
        print(picture[i]+'.png'+"圖片載入完成")             


if __name__ == '__main__':
    
    # kali.txt 文件中存放的是CSDN部落格的URL連結
    f = open("kali.txt","r")
    URL_list = f.readlines()
    f.close()
    print(URL_list)
    
    for url in URL_list:	        
        # 設定錯誤重試機制
        attempts = 0
        success = False
        print(url)
        while attempts < 30 and not success:
            try:
                # 呼叫函式，主要就這兩個
                body = getHtml(url.strip("\n"))     # 獲取頁面body
                getImg(body)                        # 對body進行操作
                success = True
            except:
                attempts += 1
                # 重試超時
                if attempts % 5 == 0:
                    time.sleep(10) 
                elif attempts == 30:
                    break

部落格都在文章檔案中
2024-09-03
部落格一鍵儲存本地exe視覺化介面檔案
2019-07-22
視覺化
個人部落格開發系列：文章實時儲存
2019-02-26
批量匯出 CSDN 部落格並轉為 hexo 部落格風格
2019-09-30
Hexo
Android中的資料儲存之檔案儲存
2020-03-11
Android
spark在IDEA的本地無法使用saveAsTextFile儲存檔案
2020-10-20
SparkIdeaAST
塊儲存檔案儲存物件儲存
2020-05-28
物件
【部落格383】etcd儲存結構
2020-10-06
文章部落格
2024-10-05
檔案儲存
2019-05-23
資料儲存--檔案儲存
2024-05-26
Android下載網路pdf檔案儲存至本地
2020-10-26
Android
新版CSDN部落格如何新增別人的部落格連結
2020-10-16
如何轉載CSDN部落格
2018-04-27
《將部落格搬至CSDN》
2024-05-26
將部落格搬至CSDN
2024-07-08
04、部落格文章
2019-04-11
Android 檔案儲存
2019-05-13
Android
CSV檔案儲存
2024-06-09
儲存json檔案
2024-07-13
JSON
IPFS的檔案儲存模式
2020-10-20
模式
【VMware vSphere】沒有共享儲存的ESXi主機之間如何共享本地儲存上的ISO檔案。
2024-07-09
將部落格搬運至CSDN
2020-11-10
部落格轉移回csdn了。
2024-08-01
Flutter持久化儲存之檔案儲存
2019-03-06
Flutter持久化
本地MinIO儲存服務Java遠端呼叫上傳檔案
2023-11-22
Java
本地儲存-系統和保留-系統檔案佔用儲存空間過大的解決方式
2020-12-06
jmeter儲存下載的檔案到本地
2024-11-12
JMeter
部落格文章彙總
2018-04-11
【轉載】如何轉發部落格園中的文章
2024-10-29
部落格園中隨筆、文章、日記、評論、標籤、合集、連結、相簿、檔案的關係
2024-04-23
我的部落格文章彙總
2021-09-09
如何將html程式碼儲存為Pdf檔案
2021-09-09
HTML
python 儲存檔案json
2020-11-10
PythonJSON
hive檔案儲存格式
2020-11-29
Hive
物件儲存 vs 檔案儲存 vs 塊儲存，選哪個？
2020-09-17
物件
如何去除CSDN部落格圖片水印
2020-04-08
Python爬取CSDN部落格資料
2019-01-03
Python

儲存CSDN 中的部落格文章為本地檔案

儲存CSDN 中的部落格文章為本地檔案

1. 執行環境

2. 流程分解

3. 程式碼部分

相關文章