「玩轉Python」打造十萬博文爬蟲篇

小柒2012發表於2019-07-30

原文網址 : https://www.cnblogs.com/smallSevens/p/11269447.html

Python爬蟲

「玩轉Python」打造十萬博文爬蟲篇

前言

這裡以爬取部落格園文章為例，僅供學習參考，某些AD滿天飛的網站太浪費爬蟲的感情了。

爬取

使用 BeautifulSoup 獲取博文
通過 html2text 將 Html 轉 Markdown
儲存 Markdown 到本地檔案
下載 Markdown 中的圖片到本地並替換圖片地址
寫入資料庫

工具

使用到的第三方類庫：BeautifulSoup、html2text、PooledDB

程式碼

獲取博文：

# 獲取標題和文章內容
def getHtml(blog):
    res = requests.get(blog, headers=headers)
    soup = BeautifulSoup(res.text, 'html.parser')
    # 獲取部落格標題
    title = soup.find('h1', class_='postTitle').text
    # 去除空格等
    title = title.strip()
    # 獲取部落格內容
    content = soup.find('div', class_='blogpost-body')
    # 去掉部落格外層的DIV
    content = article.decode_contents(formatter="html")
    info = {"title": title, "content": content}
    return info

Html 轉 Markdown：

# 這裡使用開源第三方庫 html2text
 md = text_maker.handle(info['content'])

儲存到本地檔案：


def createFile(md, title):
    print('系統預設編碼：{}'.format(sys.getdefaultencoding()))
    save_file = str(title) +".md"
    # print(save_file)
    print('準備寫入檔案：{}'.format(save_file))
    # r+ 開啟一個檔案用於讀寫。檔案指標將會放在檔案的開頭。
    # w+ 開啟一個檔案用於讀寫。如果該檔案已存在則將其覆蓋。如果該檔案不存在，建立新檔案。
    # a+ 開啟一個檔案用於讀寫。如果該檔案已存在，檔案指標將會放在檔案的結尾。檔案開啟時會是追加模式。如果該檔案不存在，建立新檔案用於讀寫。
    f = codecs.open(save_file, 'w+', 'utf-8')
    f.write(md)
    f.close()
    print('寫入檔案結束：{}'.format(f.name))
    return save_file

下載圖片到本地並替換圖片地址：

def replace_md_url(md_file):
    """
    把指定MD檔案中引用的圖片下載到本地，並替換URL
    """

    if os.path.splitext(md_file)[1] != '.md':
        print('{}不是Markdown檔案，不做處理。'.format(md_file))
        return

    cnt_replace = 0
    # 日期時間為目錄儲存圖片
    dir_ts = time.strftime('%Y%m', time.localtime())
    isExists = os.path.exists(dir_ts)
    # 判斷結果
    if not isExists:
        os.makedirs(dir_ts)
    with open(md_file, 'r', encoding='utf-8') as f:  # 使用utf-8 編碼開啟
        post = f.read()
        matches = re.compile(img_patten).findall(post)
        if matches and len(matches) > 0:
            for match in list(chain(*matches)):
                if match and len(match) > 0:
                    array = match.split('/')
                    file_name = array[len(array) - 1]
                    file_name = dir_ts + "/" + file_name
                    img = requests.get(match, headers=headers)
                    f = open(file_name, 'ab')
                    f.write(img.content)
                    new_url = "https://blog.52itstyle.vip/{}".format(file_name)
                    # 更新MD中的URL
                    post = post.replace(match, new_url)
                    cnt_replace = cnt_replace + 1

        # 如果有內容的話，就直接覆蓋寫入當前的markdown檔案
        if post and cnt_replace > 0:
            url = "https://blog.52itstyle.vip"
            open(md_file, 'w', encoding='utf-8').write(post)
            print('{0}的{1}個URL被替換到{2}/{3}'.format(os.path.basename(md_file), cnt_replace, url, dir_ts))
        elif cnt_replace == 0:
            print('{}中沒有需要替換的URL'.format(os.path.basename(md_file)))

寫入資料庫：

# 寫入資料庫
def write_db(title, content, url):
    sql = "INSERT INTO blog (title, content,url) VALUES(%(title)s, %(content)s, %(url)s);"
    param = {"title": title, "content": content, "url": url}
    mysql.insert(sql, param)

小結

網際網路時代一些開放的部落格社群的確方便了很多，但是也伴隨著隨時消失的可能性，最好就是自己備份一份到本地；你也可以選擇自己喜歡的博主，爬取下收藏。

原始碼：https://gitee.com/52itstyle/Python

演示：https://blog.52itstyle.top

列表：https://blog.52itstyle.top/index

詳情：https://blog.52itstyle.top/49.shtml

JB的Python之旅-爬蟲篇-新浪微博內容爬取
2018-06-30
Python爬蟲
Python實現微博爬蟲，爬取新浪微博
2020-12-14
Python爬蟲
【Python篇】scrapy爬蟲
2020-11-29
Python爬蟲
Python爬蟲小結（轉）
2018-08-09
Python爬蟲
為什麼學習python及爬蟲，Python爬蟲[入門篇]？
2018-11-21
Python爬蟲
【Python學習】爬蟲爬蟲爬蟲爬蟲~
2018-05-03
Python爬蟲
Python 實戰:用 Scrapyd 打造爬蟲控制檯
2018-10-30
Python爬蟲
python爬蟲---網頁爬蟲，圖片爬蟲，文章爬蟲，Python爬蟲爬取新聞網站新聞
2019-01-04
Python爬蟲網頁網站
python就是爬蟲嗎-python就是爬蟲嗎
2020-10-29
Python爬蟲
利用爬蟲獲取當前博文數量與字數
2021-06-11
爬蟲
從SpringBoot構建十萬博文聊聊快取穿透
2019-08-13
Spring Boot快取穿透
Python網路爬蟲2 - 爬取新浪微博使用者圖片
2018-04-10
Python爬蟲
python 爬蟲
2024-04-20
Python爬蟲
python爬蟲
2024-06-13
Python爬蟲
selenium + xpath爬取csdn關於python的博文博主資訊
2020-12-19
Python
python爬蟲瞭解第一篇
2019-02-16
Python爬蟲
Python爬蟲之Scrapy學習（基礎篇）
2019-03-04
Python爬蟲
Python爬蟲怎麼入門-初級篇
2018-12-10
Python爬蟲
JB的Python之旅-爬蟲篇--requests&Scrapy
2018-06-08
Python爬蟲
「docker實戰篇」python的docker爬蟲技術-python
2021-09-09
DockerPython爬蟲
一篇長文帶你在python裡玩轉Json資料
2019-11-06
PythonJSON
Python爬蟲入門教程 55-100 python爬蟲高階技術之驗證碼篇
2019-04-02
Python爬蟲
什麼是Python爬蟲？一篇文章帶你全面瞭解爬蟲
2022-02-21
Python爬蟲
Python爬蟲深造篇(四)——Scrapy爬蟲框架啟動一個真正的專案
2021-11-08
Python爬蟲框架
爬蟲基礎篇
2020-07-31
爬蟲
C#爬蟲與反爬蟲--字型加密篇
2019-06-26
C#爬蟲加密
python爬蟲初探--第一個python爬蟲專案
2018-05-18
Python爬蟲
JB的Python之旅-爬蟲篇--urllib和Beautiful Soup
2018-05-15
Python爬蟲
Python 萬能程式碼模版：爬蟲程式碼篇
2022-08-25
Python爬蟲
python網路爬蟲_Python爬蟲：30個小時搞定Python網路爬蟲視訊教程
2020-10-21
Python爬蟲
python爬蟲2
2019-01-07
Python爬蟲
Python爬蟲-xpath
2018-06-08
Python爬蟲
Python爬蟲——XPath
2018-07-28
Python爬蟲
Python爬蟲--2
2024-03-24
Python爬蟲
Python asyncio 爬蟲
2020-04-28
Python爬蟲
Python 爬蟲系列
2021-01-01
Python爬蟲
Python爬蟲入門教程 50-100 Python3爬蟲爬取VIP視訊-Python爬蟲6操作
2019-02-14
Python爬蟲
Python爬蟲教程-01-爬蟲介紹
2018-09-06
Python爬蟲

「玩轉Python」打造十萬博文爬蟲篇

前言

爬取

工具

程式碼

小結

相關文章