[Python]爬蟲獲取知乎某個問題下所有圖片並去除水印

番茄貓發表於2021-09-20

原文網址 : https://www.cnblogs.com/TOMOCAT/p/15314131.html

Python爬蟲

獲取URL

進入某個知乎問題的主頁下，按F12開啟開發者工具後檢視network皮膚。

network皮膚可以檢視頁面向伺服器請求的資源、資源的大小、載入資源花費的時間以及哪些資源載入失敗等資訊。還可以檢視HTTP的請求頭，返回內容等。

以“你有哪些可愛的貓貓照片？”問題為例，我們可以看到network皮膚如下：

按一下快捷鍵Ctrl + F在搜尋皮膚中直接搜尋對應的答案出現的文字，可以找到對應的目標url及其response：

安裝對應的package，其他包都比較簡單，需要注意的是python影像處理的包cv2安裝命令如下：

pip install opencv-python

URL分析

1. 引數分析

我們剛才獲取的URL如下：

https://www.zhihu.com/api/v4/questions/356541789/answers?include=data%5B*%5D.is_normal%2Cadmin_closed_comment%2Creward_info%2Cis_collapsed%2Cannotation_action%2Cannotation_detail%2Ccollapse_reason%2Cis_sticky%2Ccollapsed_by%2Csuggest_edit%2Ccomment_count%2Ccan_comment%2Ccontent%2Ceditable_content%2Cattachment%2Cvoteup_count%2Creshipment_settings%2Ccomment_permission%2Ccreated_time%2Cupdated_time%2Creview_info%2Crelevant_info%2Cquestion%2Cexcerpt%2Cis_labeled%2Cpaid_info%2Cpaid_info_content%2Crelationship.is_authorized%2Cis_author%2Cvoting%2Cis_thanked%2Cis_nothelp%2Cis_recognized%3Bdata%5B*%5D.mark_infos%5B*%5D.url%3Bdata%5B*%5D.author.follower_count%2Cvip_info%2Cbadge%5B*%5D.topics%3Bdata%5B*%5D.settings.table_of_content.enabled&offset=&limit=3&sort_by=default&platform=desktop

其中包含的引數為：

limit: 一頁顯示的答案條數
offset：頁面的偏移量
sort_by：答案的排序方式，支援預設排序或者按時間排序

2.解析Response

嘗試著發一個請求並截獲http response：

# python3
import requests
import json

if __name__ == '__main__':
    target_url = "https://www.zhihu.com/api/v4/questions/356541789/answers?include=data%5B*%5D.is_normal%2Cadmin_closed_comment%2Creward_info%2Cis_collapsed%2Cannotation_action%2Cannotation_detail%2Ccollapse_reason%2Cis_sticky%2Ccollapsed_by%2Csuggest_edit%2Ccomment_count%2Ccan_comment%2Ccontent%2Ceditable_content%2Cattachment%2Cvoteup_count%2Creshipment_settings%2Ccomment_permission%2Ccreated_time%2Cupdated_time%2Creview_info%2Crelevant_info%2Cquestion%2Cexcerpt%2Cis_labeled%2Cpaid_info%2Cpaid_info_content%2Crelationship.is_authorized%2Cis_author%2Cvoting%2Cis_thanked%2Cis_nothelp%2Cis_recognized%3Bdata%5B*%5D.mark_infos%5B*%5D.url%3Bdata%5B*%5D.author.follower_count%2Cvip_info%2Cbadge%5B*%5D.topics%3Bdata%5B*%5D.settings.table_of_content.enabled&offset=&limit=3&sort_by=default&platform=desktop"
    headers = {
        'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.119 Safari/537.36',
    }
    
    response = requests.get(url = target_url, headers = headers)
    html = response.text 
    print(html)

獲取到的response如下，我們需要做的是找到所有圖片對應的連結，使用Json工具解析後可以從http返回值json中找到圖片所在的位置，後續就是通過爬蟲解析到下載地址即可：

Tips：值得注意的是網站的返回值樣式經常變動，而且不同網站返回值的組織樣式也不一樣，所以不可盲目借鑑。

3.獲取所有答案url

仍然使用在“開發者工具中”查詢答案關鍵字的方法，我們可以拿到多個答案對應的url，我們需要從這些url中找到規律：

https://www.zhihu.com/api/v4/questions/356541789/answers?include=data%5B%2A%5D.is_normal%2Cadmin_closed_comment%2Creward_info%2Cis_collapsed%2Cannotation_action%2Cannotation_detail%2Ccollapse_reason%2Cis_sticky%2Ccollapsed_by%2Csuggest_edit%2Ccomment_count%2Ccan_comment%2Ccontent%2Ceditable_content%2Cattachment%2Cvoteup_count%2Creshipment_settings%2Ccomment_permission%2Ccreated_time%2Cupdated_time%2Creview_info%2Crelevant_info%2Cquestion%2Cexcerpt%2Cis_labeled%2Cpaid_info%2Cpaid_info_content%2Crelationship.is_authorized%2Cis_author%2Cvoting%2Cis_thanked%2Cis_nothelp%2Cis_recognized%3Bdata%5B%2A%5D.mark_infos%5B%2A%5D.url%3Bdata%5B%2A%5D.author.follower_count%2Cvip_info%2Cbadge%5B%2A%5D.topics%3Bdata%5B%2A%5D.settings.table_of_content.enabled&limit=3&offset=3&platform=desktop&sort_by=default

https://www.zhihu.com/api/v4/questions/356541789/answers?include=data%5B%2A%5D.is_normal%2Cadmin_closed_comment%2Creward_info%2Cis_collapsed%2Cannotation_action%2Cannotation_detail%2Ccollapse_reason%2Cis_sticky%2Ccollapsed_by%2Csuggest_edit%2Ccomment_count%2Ccan_comment%2Ccontent%2Ceditable_content%2Cattachment%2Cvoteup_count%2Creshipment_settings%2Ccomment_permission%2Ccreated_time%2Cupdated_time%2Creview_info%2Crelevant_info%2Cquestion%2Cexcerpt%2Cis_labeled%2Cpaid_info%2Cpaid_info_content%2Crelationship.is_authorized%2Cis_author%2Cvoting%2Cis_thanked%2Cis_nothelp%2Cis_recognized%3Bdata%5B%2A%5D.mark_infos%5B%2A%5D.url%3Bdata%5B%2A%5D.author.follower_count%2Cvip_info%2Cbadge%5B%2A%5D.topics%3Bdata%5B%2A%5D.settings.table_of_content.enabled&limit=3&offset=0&platform=desktop&sort_by=default

儘管url的格式不盡相同，但是我發現基本都遵循如下格式，只需要變更offset引數即可

https://www.zhihu.com/api/v4/questions/356541789/answers?include=data%5B%2A%5D.is_normal%2Cadmin_closed_comment%2Creward_info%2Cis_collapsed%2Cannotation_action%2Cannotation_detail%2Ccollapse_reason%2Cis_sticky%2Ccollapsed_by%2Csuggest_edit%2Ccomment_count%2Ccan_comment%2Ccontent%2Ceditable_content%2Cattachment%2Cvoteup_count%2Creshipment_settings%2Ccomment_permission%2Ccreated_time%2Cupdated_time%2Creview_info%2Crelevant_info%2Cquestion%2Cexcerpt%2Cis_labeled%2Cpaid_info%2Cpaid_info_content%2Crelationship.is_authorized%2Cis_author%2Cvoting%2Cis_thanked%2Cis_nothelp%2Cis_recognized%3Bdata%5B%2A%5D.mark_infos%5B%2A%5D.url%3Bdata%5B%2A%5D.author.follower_count%2Cvip_info%2Cbadge%5B%2A%5D.topics%3Bdata%5B%2A%5D.settings.table_of_content.enabled&limit=3&offset=0&platform=desktop&sort_by=default

Code

1. 模擬請求

簡單加上headers即可，知乎的校驗沒有其他網站來得嚴格，訪問過於頻繁時會限制訪問一段時間，我這裡簡單使用隨機請求頭和代理IP來處理：

def get_http_content(number, offset):
    """讀取知乎某問題下的答案url, 返回對應json
    Args:
        number: 知乎問題唯一標識
        offset: 偏移量
    """
    target_url = "https://www.zhihu.com/api/v4/questions/{number}/answers?include=data%5B*%5D.is_normal%2Cadmin_closed_comment%2" \
        "Creward_info%2Cis_collapsed%2Cannotation_action%2Cannotation_detail%2Ccollapse_reason%2Cis_sticky%2Ccollapsed_by%2" \
        "Csuggest_edit%2Ccomment_count%2Ccan_comment%2Ccontent%2Ceditable_content%2Cvoteup_count%2Creshipment_settings%2Ccomment_permission%2" \
        "Ccreated_time%2Cupdated_time%2Creview_info%2Crelevant_info%2Cquestion%2Cexcerpt%2Crelationship.is_authorized%2Cis_author%2Cvoting%2" \
        "Cis_thanked%2Cis_nothelp%2Cis_labeled%2Cis_recognized%2Cpaid_info%2Cpaid_info_content%3Bdata%5B*%5D.mark_infos%5B*%5D.url%3Bdata%5B*%5" \
        "D.author.follower_count%2Cbadge%5B*%5D.topics&offset={offset}&limit={limit}&sort_by=default&platform=desktop".format(
            number=number, offset=offset, limit=limit)
    logger.info("target_url:{}", target_url)
    headers = {
        'User-Agent': fake_useragent.get_random_useragent(),
    }
    ip = IPPool().get_random_key()
    proxies = {"http": "http://" + ip}
    response = requests.get(target_url, headers=headers, proxies=proxies)
    if (response is None) or (response.status_code != 200):
        logger.warning("http response is None, number={}, offset={}, status_code={}".format(
            number, offset, response.status_code))
        return None
    html = response.text
    return json.loads(html)

2. 解析出圖片地址

def start_crawl():
    """開始爬蟲獲取圖片
    """
    for i in range(0, max_pages):
        offset = limit * i
        logger.info("download pictures with offset {}". format(offset))

        # 獲取html
        content_dict = get_http_content(number, offset)
        if content_dict is None:
            logger.error(
                "get http resp fail, number={} offset={}", number, offset)
            continue
        # content_dict['data']儲存了答案列表
        if 'data' not in content_dict:
            logger.error("parse data from http resp fail, dict={}", dict)
            continue
        for answer_text in content_dict['data']:
            logger.info(
                "get pictures from answer: https://www.zhihu.com/question/{}/answer/{}", number, answer_text['id'])
            if 'content' not in answer_text:
                logger.error(
                    "parse content from answer text fail, text={}", answer_text)
                continue
            answer_content = pq(answer_text['content'])
            img_urls = answer_content.find('noscript').find('img')
            # 此篇問答不包含圖片時列印對應資訊, 方便debug
            if len(list(img_urls)) <= 0:
                logger.warning(
                    "this answer has no pictures, url:https://www.zhihu.com/question/{}/answer/{}", number, answer_text['id'])
                continue
            for img_url in img_urls.items():
                # src例子: https://pic2.zhimg.com/50/v2-c970108cd260ea095383627362c1d04f_720w.jpg?source=1940ef5c
                src = img_url.attr("src")
                # 解析出圖片格式字尾: .jpeg 或者 .gif等
                source_index = src.rfind('?source')
                if source_index == -1:
                    logger.error("find source index fail, src:{} source_index{}",
                                 src, source_index)
                suffix = src[0:source_index]
                suffix_index = src.rfind('.')
                if source_index == -1:
                    logger.error("find suffix fail, src:{} suffix_index{}".format(
                        src, suffix_index))
                suffix = suffix[suffix_index:]
                logger.info("get picture url, src:{} suffix:{}", src, suffix)
                store_picture(src, suffix)
                time.sleep(1)

3. 將圖片儲存到本地

def store_picture(img_url, suffix):
    """將圖片儲存到資料夾中
    Args:
        img_url: 圖片連結
        suffix: 圖片字尾, 比如'.jpg', '.gif'等
    """
    headers = {
        'User-Agent': fake_useragent.get_random_useragent(),
    }
    ip = IPPool().get_random_key()
    proxies = {"http": "http://" + ip}
    http_resp = requests.get(img_url, headers=headers, proxies=proxies)
    if (http_resp is None) or (http_resp.status_code != 200):
        logger.warning("get http resp fail, url={} http_resp={}",
                       img_url, http_resp)
        return
    content = http_resp.content
    with open(f"{picture_path}/{uuid.uuid4()}{suffix}", 'wb') as f:
        f.write(content)

4. 去除圖片水印

本來打算使用影像識別進行摳圖去除水印的（因為知乎的水印比較簡單而且樣式統一），無奈最近需要處理的事情比較多，因此就簡單通過opencv包進行裁剪：

def crop_watermark(ori_dir, adjusted_dir):
    """通過裁剪圖片的方式來去除水印, 注意無法處理gif格式的圖片
    Args:
        ori_dir: 圖片所在的資料夾
        adjusted_dir: 去除水印後存放的資料夾
    """
    img_path_list = os.listdir(ori_dir)  # 獲取目錄下的所有檔案
    total = len(img_path_list)
    cnt = 1
    for img_path in img_path_list:
        logger.info(
            "the overall process::{}/{}, now handle the picture:{}", cnt, total, img_path)
        img_abs_path = ori_dir + '/' + img_path
        img = cv2.imread(img_abs_path)
        if img is None:
            logger.error("cv2.imread fail, picture:{}", img_path)
            continue
        height, width = img.shape[0:2]
        cropped = img[0:height-40, 0:width]
        adjusted_img_abs_path = adjusted_dir + '/' + img_path
        cv2.imwrite(adjusted_img_abs_path, cropped)
        cnt += 1

寫在最後

寫這個程式主要還是為了學習html解析和錘鍊一下python程式設計，雖然寫完了之後回過頭來看確實沒啥值得稱道的地方，就把程式碼放這裡供大家一起參考了：

https://gitee.com/tomocat/zhi-hu-picture-crawler

另外此程式的主要目的僅僅是將我搜集圖片和剔除水印的過程自動化而已，還是再告誡大家一下不要因為爬蟲給別人的伺服器帶來壓力。

Reference

[1] https://www.cnblogs.com/jxlsblog/p/10445066.html

Python爬蟲新手教程：知乎文章圖片爬取器
2019-07-20
Python爬蟲
新手爬蟲教程：Python爬取知乎文章中的圖片
2019-01-17
爬蟲Python
Python爬蟲—爬取某網站圖片
2020-11-19
Python爬蟲網站
爬蟲 Scrapy框架爬取圖蟲圖片並下載
2018-08-27
爬蟲框架
python爬蟲如何爬知乎的話題？
2019-02-16
Python爬蟲
python 爬蟲之requests爬取頁面圖片的url，並將圖片下載到本地
2019-06-12
Python爬蟲
Python網路爬蟲實戰：爬取知乎話題下 18934 條回答資料
2019-01-17
Python爬蟲
Python 爬蟲 + 人臉檢測 —— 知乎高顏值圖片抓取
2020-12-21
Python爬蟲
Python爬蟲抓取知乎所有使用者資訊
2018-03-14
Python爬蟲
爬取知乎單個網頁問題和回答
2021-09-09
網頁
Python爬蟲入門【5】：27270圖片爬取
2019-07-30
Python爬蟲
有去除片尾並新增圖片水印的技巧嗎？
2022-06-16
python爬蟲---網頁爬蟲，圖片爬蟲，文章爬蟲，Python爬蟲爬取新聞網站新聞
2019-01-04
Python爬蟲網頁網站
爬蟲：HTTP請求與HTML解析（爬取某乎網站）
2021-05-19
爬蟲HTTPHTML網站
Python爬蟲實戰詳解：爬取圖片之家
2020-11-04
Python爬蟲
Java爬蟲批量爬取圖片
2021-09-24
Java爬蟲
用python寫個爬取指定網址上所有圖片，並能根據獲取到的圖片網址，進入網址，再次進行圖片獲取的程式碼指令碼
2024-04-28
Python指令碼
Python爬蟲遞迴呼叫爬取動漫美女圖片
2020-10-19
Python爬蟲遞迴
node：爬蟲爬取網頁圖片
2019-02-16
爬蟲網頁
爬蟲---xpath解析（爬取美女圖片）
2020-12-23
爬蟲
如何去除CSDN部落格圖片水印
2020-04-08
關於去除圖片上的水印
2024-06-05
分散式爬蟲之知乎使用者資訊爬取
2018-08-31
分散式爬蟲
【python--爬蟲】千圖網高清背景圖片爬蟲
2019-05-21
Python爬蟲
python 爬蟲之獲取標題和連結
2020-11-27
Python爬蟲
使用正則編寫簡單的爬蟲爬取某網站的圖片
2018-06-06
爬蟲網站
python 爬蟲下載百度美女圖片
2024-04-18
Python爬蟲
基於python的圖片修復程式-可用於水印去除
2021-09-09
Python
python爬蟲如何獲取表情包
2021-09-11
Python爬蟲
python爬蟲專案（新手教程）之知乎（requests方式）
2018-06-13
Python爬蟲
使用Python爬蟲實現自動下載圖片
2021-09-11
Python爬蟲
Python 爬蟲零基礎教程(1)：爬單個圖片
2024-03-13
Python爬蟲
Python網路爬蟲2 - 爬取新浪微博使用者圖片
2018-04-10
Python爬蟲
Python爬蟲入門【4】：美空網未登入圖片爬取
2019-07-30
Python爬蟲
爬蟲Selenium+PhantomJS爬取動態網站圖片資訊（Python）
2018-03-24
爬蟲JS網站Python
自學python網路爬蟲，從小白快速成長，分別實現靜態網頁爬取，下載meiztu中圖片；動態網頁爬取，下載burberry官網所有當季新品圖片。
2020-02-06
Python爬蟲網頁
Python爬蟲-獲得某一連結下的所有超連結
2020-11-08
Python爬蟲
go語言實現簡單爬蟲獲取頁面圖片
2022-11-14
Go爬蟲