獲取URL
進入某個知乎問題的主頁下,按F12
開啟開發者工具後檢視network
皮膚。
network
皮膚可以檢視頁面向伺服器請求的資源、資源的大小、載入資源花費的時間以及哪些資源載入失敗等資訊。還可以檢視HTTP
的請求頭,返回內容等。
以“你有哪些可愛的貓貓照片?”問題為例,我們可以看到network
皮膚如下:
按一下快捷鍵Ctrl + F在搜尋皮膚中直接搜尋對應的答案出現的文字,可以找到對應的目標url
及其response
:
安裝對應的package
,其他包都比較簡單,需要注意的是python
影像處理的包cv2
安裝命令如下:
pip install opencv-python
URL分析
1. 引數分析
我們剛才獲取的URL如下:
https://www.zhihu.com/api/v4/questions/356541789/answers?include=data%5B*%5D.is_normal%2Cadmin_closed_comment%2Creward_info%2Cis_collapsed%2Cannotation_action%2Cannotation_detail%2Ccollapse_reason%2Cis_sticky%2Ccollapsed_by%2Csuggest_edit%2Ccomment_count%2Ccan_comment%2Ccontent%2Ceditable_content%2Cattachment%2Cvoteup_count%2Creshipment_settings%2Ccomment_permission%2Ccreated_time%2Cupdated_time%2Creview_info%2Crelevant_info%2Cquestion%2Cexcerpt%2Cis_labeled%2Cpaid_info%2Cpaid_info_content%2Crelationship.is_authorized%2Cis_author%2Cvoting%2Cis_thanked%2Cis_nothelp%2Cis_recognized%3Bdata%5B*%5D.mark_infos%5B*%5D.url%3Bdata%5B*%5D.author.follower_count%2Cvip_info%2Cbadge%5B*%5D.topics%3Bdata%5B*%5D.settings.table_of_content.enabled&offset=&limit=3&sort_by=default&platform=desktop
其中包含的引數為:
- limit: 一頁顯示的答案條數
- offset:頁面的偏移量
- sort_by:答案的排序方式,支援預設排序或者按時間排序
2.解析Response
嘗試著發一個請求並截獲http response
:
# python3
import requests
import json
if __name__ == '__main__':
target_url = "https://www.zhihu.com/api/v4/questions/356541789/answers?include=data%5B*%5D.is_normal%2Cadmin_closed_comment%2Creward_info%2Cis_collapsed%2Cannotation_action%2Cannotation_detail%2Ccollapse_reason%2Cis_sticky%2Ccollapsed_by%2Csuggest_edit%2Ccomment_count%2Ccan_comment%2Ccontent%2Ceditable_content%2Cattachment%2Cvoteup_count%2Creshipment_settings%2Ccomment_permission%2Ccreated_time%2Cupdated_time%2Creview_info%2Crelevant_info%2Cquestion%2Cexcerpt%2Cis_labeled%2Cpaid_info%2Cpaid_info_content%2Crelationship.is_authorized%2Cis_author%2Cvoting%2Cis_thanked%2Cis_nothelp%2Cis_recognized%3Bdata%5B*%5D.mark_infos%5B*%5D.url%3Bdata%5B*%5D.author.follower_count%2Cvip_info%2Cbadge%5B*%5D.topics%3Bdata%5B*%5D.settings.table_of_content.enabled&offset=&limit=3&sort_by=default&platform=desktop"
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.119 Safari/537.36',
}
response = requests.get(url = target_url, headers = headers)
html = response.text
print(html)
獲取到的response
如下,我們需要做的是找到所有圖片對應的連結,使用Json工具解析後可以從http返回值json中找到圖片所在的位置,後續就是通過爬蟲解析到下載地址即可:
Tips:值得注意的是網站的返回值樣式經常變動,而且不同網站返回值的組織樣式也不一樣,所以不可盲目借鑑。
3.獲取所有答案url
仍然使用在“開發者工具中”查詢答案關鍵字的方法,我們可以拿到多個答案對應的url
,我們需要從這些url
中找到規律:
https://www.zhihu.com/api/v4/questions/356541789/answers?include=data%5B%2A%5D.is_normal%2Cadmin_closed_comment%2Creward_info%2Cis_collapsed%2Cannotation_action%2Cannotation_detail%2Ccollapse_reason%2Cis_sticky%2Ccollapsed_by%2Csuggest_edit%2Ccomment_count%2Ccan_comment%2Ccontent%2Ceditable_content%2Cattachment%2Cvoteup_count%2Creshipment_settings%2Ccomment_permission%2Ccreated_time%2Cupdated_time%2Creview_info%2Crelevant_info%2Cquestion%2Cexcerpt%2Cis_labeled%2Cpaid_info%2Cpaid_info_content%2Crelationship.is_authorized%2Cis_author%2Cvoting%2Cis_thanked%2Cis_nothelp%2Cis_recognized%3Bdata%5B%2A%5D.mark_infos%5B%2A%5D.url%3Bdata%5B%2A%5D.author.follower_count%2Cvip_info%2Cbadge%5B%2A%5D.topics%3Bdata%5B%2A%5D.settings.table_of_content.enabled&limit=3&offset=3&platform=desktop&sort_by=default
https://www.zhihu.com/api/v4/questions/356541789/answers?include=data%5B%2A%5D.is_normal%2Cadmin_closed_comment%2Creward_info%2Cis_collapsed%2Cannotation_action%2Cannotation_detail%2Ccollapse_reason%2Cis_sticky%2Ccollapsed_by%2Csuggest_edit%2Ccomment_count%2Ccan_comment%2Ccontent%2Ceditable_content%2Cattachment%2Cvoteup_count%2Creshipment_settings%2Ccomment_permission%2Ccreated_time%2Cupdated_time%2Creview_info%2Crelevant_info%2Cquestion%2Cexcerpt%2Cis_labeled%2Cpaid_info%2Cpaid_info_content%2Crelationship.is_authorized%2Cis_author%2Cvoting%2Cis_thanked%2Cis_nothelp%2Cis_recognized%3Bdata%5B%2A%5D.mark_infos%5B%2A%5D.url%3Bdata%5B%2A%5D.author.follower_count%2Cvip_info%2Cbadge%5B%2A%5D.topics%3Bdata%5B%2A%5D.settings.table_of_content.enabled&limit=3&offset=0&platform=desktop&sort_by=default
儘管url
的格式不盡相同,但是我發現基本都遵循如下格式,只需要變更offset
引數即可
https://www.zhihu.com/api/v4/questions/356541789/answers?include=data%5B%2A%5D.is_normal%2Cadmin_closed_comment%2Creward_info%2Cis_collapsed%2Cannotation_action%2Cannotation_detail%2Ccollapse_reason%2Cis_sticky%2Ccollapsed_by%2Csuggest_edit%2Ccomment_count%2Ccan_comment%2Ccontent%2Ceditable_content%2Cattachment%2Cvoteup_count%2Creshipment_settings%2Ccomment_permission%2Ccreated_time%2Cupdated_time%2Creview_info%2Crelevant_info%2Cquestion%2Cexcerpt%2Cis_labeled%2Cpaid_info%2Cpaid_info_content%2Crelationship.is_authorized%2Cis_author%2Cvoting%2Cis_thanked%2Cis_nothelp%2Cis_recognized%3Bdata%5B%2A%5D.mark_infos%5B%2A%5D.url%3Bdata%5B%2A%5D.author.follower_count%2Cvip_info%2Cbadge%5B%2A%5D.topics%3Bdata%5B%2A%5D.settings.table_of_content.enabled&limit=3&offset=0&platform=desktop&sort_by=default
Code
1. 模擬請求
簡單加上headers即可,知乎的校驗沒有其他網站來得嚴格,訪問過於頻繁時會限制訪問一段時間,我這裡簡單使用隨機請求頭和代理IP來處理:
def get_http_content(number, offset):
"""讀取知乎某問題下的答案url, 返回對應json
Args:
number: 知乎問題唯一標識
offset: 偏移量
"""
target_url = "https://www.zhihu.com/api/v4/questions/{number}/answers?include=data%5B*%5D.is_normal%2Cadmin_closed_comment%2" \
"Creward_info%2Cis_collapsed%2Cannotation_action%2Cannotation_detail%2Ccollapse_reason%2Cis_sticky%2Ccollapsed_by%2" \
"Csuggest_edit%2Ccomment_count%2Ccan_comment%2Ccontent%2Ceditable_content%2Cvoteup_count%2Creshipment_settings%2Ccomment_permission%2" \
"Ccreated_time%2Cupdated_time%2Creview_info%2Crelevant_info%2Cquestion%2Cexcerpt%2Crelationship.is_authorized%2Cis_author%2Cvoting%2" \
"Cis_thanked%2Cis_nothelp%2Cis_labeled%2Cis_recognized%2Cpaid_info%2Cpaid_info_content%3Bdata%5B*%5D.mark_infos%5B*%5D.url%3Bdata%5B*%5" \
"D.author.follower_count%2Cbadge%5B*%5D.topics&offset={offset}&limit={limit}&sort_by=default&platform=desktop".format(
number=number, offset=offset, limit=limit)
logger.info("target_url:{}", target_url)
headers = {
'User-Agent': fake_useragent.get_random_useragent(),
}
ip = IPPool().get_random_key()
proxies = {"http": "http://" + ip}
response = requests.get(target_url, headers=headers, proxies=proxies)
if (response is None) or (response.status_code != 200):
logger.warning("http response is None, number={}, offset={}, status_code={}".format(
number, offset, response.status_code))
return None
html = response.text
return json.loads(html)
2. 解析出圖片地址
def start_crawl():
"""開始爬蟲獲取圖片
"""
for i in range(0, max_pages):
offset = limit * i
logger.info("download pictures with offset {}". format(offset))
# 獲取html
content_dict = get_http_content(number, offset)
if content_dict is None:
logger.error(
"get http resp fail, number={} offset={}", number, offset)
continue
# content_dict['data']儲存了答案列表
if 'data' not in content_dict:
logger.error("parse data from http resp fail, dict={}", dict)
continue
for answer_text in content_dict['data']:
logger.info(
"get pictures from answer: https://www.zhihu.com/question/{}/answer/{}", number, answer_text['id'])
if 'content' not in answer_text:
logger.error(
"parse content from answer text fail, text={}", answer_text)
continue
answer_content = pq(answer_text['content'])
img_urls = answer_content.find('noscript').find('img')
# 此篇問答不包含圖片時列印對應資訊, 方便debug
if len(list(img_urls)) <= 0:
logger.warning(
"this answer has no pictures, url:https://www.zhihu.com/question/{}/answer/{}", number, answer_text['id'])
continue
for img_url in img_urls.items():
# src例子: https://pic2.zhimg.com/50/v2-c970108cd260ea095383627362c1d04f_720w.jpg?source=1940ef5c
src = img_url.attr("src")
# 解析出圖片格式字尾: .jpeg 或者 .gif等
source_index = src.rfind('?source')
if source_index == -1:
logger.error("find source index fail, src:{} source_index{}",
src, source_index)
suffix = src[0:source_index]
suffix_index = src.rfind('.')
if source_index == -1:
logger.error("find suffix fail, src:{} suffix_index{}".format(
src, suffix_index))
suffix = suffix[suffix_index:]
logger.info("get picture url, src:{} suffix:{}", src, suffix)
store_picture(src, suffix)
time.sleep(1)
3. 將圖片儲存到本地
def store_picture(img_url, suffix):
"""將圖片儲存到資料夾中
Args:
img_url: 圖片連結
suffix: 圖片字尾, 比如'.jpg', '.gif'等
"""
headers = {
'User-Agent': fake_useragent.get_random_useragent(),
}
ip = IPPool().get_random_key()
proxies = {"http": "http://" + ip}
http_resp = requests.get(img_url, headers=headers, proxies=proxies)
if (http_resp is None) or (http_resp.status_code != 200):
logger.warning("get http resp fail, url={} http_resp={}",
img_url, http_resp)
return
content = http_resp.content
with open(f"{picture_path}/{uuid.uuid4()}{suffix}", 'wb') as f:
f.write(content)
4. 去除圖片水印
本來打算使用影像識別進行摳圖去除水印的(因為知乎的水印比較簡單而且樣式統一),無奈最近需要處理的事情比較多,因此就簡單通過opencv
包進行裁剪:
def crop_watermark(ori_dir, adjusted_dir):
"""通過裁剪圖片的方式來去除水印, 注意無法處理gif格式的圖片
Args:
ori_dir: 圖片所在的資料夾
adjusted_dir: 去除水印後存放的資料夾
"""
img_path_list = os.listdir(ori_dir) # 獲取目錄下的所有檔案
total = len(img_path_list)
cnt = 1
for img_path in img_path_list:
logger.info(
"the overall process::{}/{}, now handle the picture:{}", cnt, total, img_path)
img_abs_path = ori_dir + '/' + img_path
img = cv2.imread(img_abs_path)
if img is None:
logger.error("cv2.imread fail, picture:{}", img_path)
continue
height, width = img.shape[0:2]
cropped = img[0:height-40, 0:width]
adjusted_img_abs_path = adjusted_dir + '/' + img_path
cv2.imwrite(adjusted_img_abs_path, cropped)
cnt += 1
寫在最後
寫這個程式主要還是為了學習html解析和錘鍊一下python程式設計,雖然寫完了之後回過頭來看確實沒啥值得稱道的地方,就把程式碼放這裡供大家一起參考了:
另外此程式的主要目的僅僅是將我搜集圖片和剔除水印的過程自動化而已,還是再告誡大家一下不要因為爬蟲給別人的伺服器帶來壓力。