Python爬蟲新手教程：知乎文章圖片爬取器

程式設計師啟航發表於2019-07-20

原文網址 : http://blog.itpub.net/69913713/viewspace-2651230/

1. 知乎文章圖片爬取器之二部落格背景

昨天寫了知乎文章圖片爬取器的一部分程式碼，針對知乎問題的答案json進行了資料抓取，部落格中出現了部分寫死的內容，今天把那部分資訊調整完畢，並且將圖片下載完善到程式碼中去。

首先，需要獲取任意知乎的問題，只需要你輸入問題的ID，就可以獲取相關的頁面資訊，比如最重要的合計有多少人回答問題。
問題ID為如下標紅數字

編寫程式碼，下面的程式碼用來檢測使用者輸入的是否是正確的ID，並且通過拼接URL去獲取該問題下面合計有多少答案。

import requests
import re
import pymongo
import time
DATABASE_IP = '127.0.0.1'
DATABASE_PORT = 27017
DATABASE_NAME = 'sun'
client = pymongo.MongoClient(DATABASE_IP,DATABASE_PORT)
db = client.sun
db.authenticate("dba", "dba")
collection = db.zhihuone  # 準備插入資料
BASE_URL = "https://www.zhihu.com/question/{}"
def get_totle_answers(article_id):
    headers = {
        "user-agent": "需要自己補全 Mozilla/5.0 (Windows NT 10.0; WOW64)"
    }
    with requests.Session() as s:
        with s.get(BASE_URL.format(article_id),headers=headers,timeout=3) as rep:
            html = rep.text
            pattern =re.compile( '<meta itemProp="answerCount" content="(\d*?)"/>')
            s = pattern.search(html)
            print("查詢到{}條資料".format(s.groups()[0]))
            return s.groups()[0]
if __name__ == '__main__':
    # 用死迴圈判斷使用者輸入的是否是數字
    article_id = ""
    while not article_id.isdigit():
        article_id = input("請輸入文章ID：")
    totle = get_totle_answers(article_id)
    if int(totle)>0:
        zhi = ZhihuOne(article_id,totle)
        zhi.run()
    else:
        print("沒有任何資料！")

完善圖片下載部分，圖片下載地址在查閱過程中發現，存在json欄位的 content 中，我們採用簡單的正規表示式將他匹配出來。細節如下圖展示

編寫程式碼吧，下面的程式碼註釋請仔細閱讀，中間有一個小BUG，需要手動把pic3修改為pic2這個地方目前原因不明確，可能是我本地網路的原因，還有請在專案根目錄先建立一個 imgs 的資料夾，用來儲存圖片

    def download_img(self,data):
        ## 下載圖片
        for item in data["data"]:
            content = item["content"]
            pattern = re.compile('<noscript>(.*?)</noscript>')
            imgs = pattern.findall(content)
            if len(imgs) > 0:
                for img in imgs:
                    match = re.search('<img src="(.*?)"', img)
                    download = match.groups()[0]
                    download = download.replace("pic3", "pic2")  # 小BUG,pic3的下載不到
                    print("正在下載{}".format(download), end="")
                    try:
                        with requests.Session() as s:
                            with s.get(download) as img_down:
                                # 獲取檔名稱
                                file = download[download.rindex("/") + 1:]
                                content = img_down.content
                                with open("imgs/{}".format(file), "wb+") as f:  # 這個地方進行了硬編碼
                                    f.write(content)
                                print("圖片下載完成", end="\n")
                    except Exception as e:
                        print(e.args)
            else:
                pass
Python資源分享qun 784758214 ,內有安裝包，PDF，學習視訊，這裡是Python學習者的聚集地，零基礎，進階，都歡迎

執行結果為

然後在玩知乎的過程中，發現了好多好問題

來自 “ ITPUB部落格 ” ，連結：http://blog.itpub.net/69913713/viewspace-2651230/，如需轉載，請註明出處，否則將追究法律責任。

新手爬蟲教程：Python爬取知乎文章中的圖片
2019-01-17
爬蟲Python
python爬蟲---網頁爬蟲，圖片爬蟲，文章爬蟲，Python爬蟲爬取新聞網站新聞
2019-01-04
Python爬蟲網頁網站
python爬蟲專案（新手教程）之知乎（requests方式）
2018-06-13
Python爬蟲
Python爬蟲—爬取某網站圖片
2020-11-19
Python爬蟲網站
[Python]爬蟲獲取知乎某個問題下所有圖片並去除水印
2021-09-20
Python爬蟲
python爬蟲如何爬知乎的話題？
2019-02-16
Python爬蟲
Python爬蟲入門【5】：27270圖片爬取
2019-07-30
Python爬蟲
Java爬蟲批量爬取圖片
2021-09-24
Java爬蟲
Python 爬蟲 + 人臉檢測 —— 知乎高顏值圖片抓取
2020-12-21
Python爬蟲
python 爬蟲爬取 learnku 精華文章
2020-04-17
Python爬蟲
Python爬蟲實戰詳解：爬取圖片之家
2020-11-04
Python爬蟲
node：爬蟲爬取網頁圖片
2019-02-16
爬蟲網頁
爬蟲---xpath解析（爬取美女圖片）
2020-12-23
爬蟲
Python爬蟲遞迴呼叫爬取動漫美女圖片
2020-10-19
Python爬蟲遞迴
爬蟲 Scrapy框架爬取圖蟲圖片並下載
2018-08-27
爬蟲框架
【python--爬蟲】千圖網高清背景圖片爬蟲
2019-05-21
Python爬蟲
分散式爬蟲之知乎使用者資訊爬取
2018-08-31
分散式爬蟲
Python 爬蟲零基礎教程(1)：爬單個圖片
2024-03-13
Python爬蟲
Python爬蟲入門教程 4-100 美空網未登入圖片爬取
2018-12-17
Python爬蟲
Python網路爬蟲實戰：爬取知乎話題下 18934 條回答資料
2019-01-17
Python爬蟲
Python爬蟲入門教程 50-100 Python3爬蟲爬取VIP視訊-Python爬蟲6操作
2019-02-14
Python爬蟲
Python爬蟲入門教程 2-100 妹子圖網站爬取
2018-12-13
Python爬蟲網站
Python網路爬蟲2 - 爬取新浪微博使用者圖片
2018-04-10
Python爬蟲
Python爬蟲入門【4】：美空網未登入圖片爬取
2019-07-30
Python爬蟲
爬蟲Selenium+PhantomJS爬取動態網站圖片資訊（Python）
2018-03-24
爬蟲JS網站Python
網路爬蟲---從千圖網爬取圖片到本地
2019-09-03
爬蟲
python 爬蟲之requests爬取頁面圖片的url，並將圖片下載到本地
2019-06-12
Python爬蟲
Python資料爬蟲學習筆記（11）爬取千圖網圖片資料
2018-09-18
Python爬蟲筆記
Python爬蟲入門教程 8-100 蜂鳥網圖片爬取之三
2018-12-20
Python爬蟲
Python爬蟲教程-01-爬蟲介紹
2018-09-06
Python爬蟲
Python爬蟲入門【9】：圖蟲網多執行緒爬取
2019-07-31
Python爬蟲執行緒
Python爬蟲抓取知乎所有使用者資訊
2018-03-14
Python爬蟲
【爬蟲】利用Python爬蟲爬取小麥苗itpub部落格的所有文章的連線地址（1）
2018-12-26
爬蟲Python
Python爬蟲入門【11】：半次元COS圖爬取
2019-07-31
Python爬蟲
【Python學習】爬蟲爬蟲爬蟲爬蟲~
2018-05-03
Python爬蟲
教你用Python爬取圖蟲網
2019-02-26
Python
簡單的爬蟲：爬取網站內容正文與圖片
2021-09-09
爬蟲網站
Python爬蟲學習線路圖丨Python爬蟲需要掌握哪些知識點
2018-12-10
Python爬蟲

Python爬蟲新手教程： 知乎文章圖片爬取器

1. 知乎文章圖片爬取器之二部落格背景

相關文章

Python爬蟲新手教程：知乎文章圖片爬取器