本地HTML中圖片下載
詳細可以見我的個人部落格:本地HTML中圖片下載
單個檔案中所有圖片下載
import requests
from lxml import etree
import os
本地html檔案讀取到記憶體
這裡需要注意下編碼方式!
with open('爬蟲與API(上).html','r',encoding = 'utf-8') as f:
html = f.read()
頁面解析
selector = etree.HTML(html)
img_list = selector.xpath('//img/@src')
img_list
['https://pic2.zhimg.com/v2-92e8bf502b2a8cb1c972215297161e40_b.jpg',
'https://pic3.zhimg.com/v2-8a64c355393635e51f486e8f77a31b11_b.jpg',
'https://pic3.zhimg.com/v2-b0b7e8426f7abe8bba55748830e1fedb_b.jpg',
'https://pic3.zhimg.com/v2-1ad5fce7304021d5e8240513242b1842_b.jpg',
'https://pic2.zhimg.com/v2-c4b13d820e724740b6d22d26cd1f78e4_b.jpg']
圖片下載
num = 0
for img_url in img_list:
img = requests.get(img_url)
#下面是新建資料夾、圖片檔名
num += 1
img_dir = os.getcwd() + '/爬蟲與API(上)/'
if not os.path.exists(img_dir):
os.makedirs(img_dir)
file_name = img_dir + str(num) + ".png"
#下面是圖片檔案的儲存
with open(file_name,'wb') as f:
f.write(img.content)
批量下載本目錄所有檔案的圖片
import requests
from lxml import etree
import os
import glob
獲取本目錄下所有的.html
檔名。
file_list = glob.glob('*.html')
file_list
['xpath+mongodb抓取伯樂線上實戰.html', '代理IP設定.html', '多執行緒爬蟲實現(上).html', '爬蟲基本原理.html']
下面是批量下載所有圖片過程。
for file in file_list:
with open(file,'r',encoding = 'utf-8') as f:
html = f.read()
selector = etree.HTML(html)
img_list = selector.xpath('//img/@src')
#圖片下載
num = 0
for img_url in img_list:
img = requests.get(img_url)
print(img_url)
print(img.status_code)
#下面是新建資料夾、圖片檔名
num += 1
img_dir = os.getcwd() + '/' + file[:-5] + '/'
if not os.path.exists(img_dir):
os.makedirs(img_dir)
file_name = img_dir + str(num) + ".jpg"
#下面是圖片檔案的儲存
with open(file_name,'wb') as f:
f.write(img.content)
https://pic1.zhimg.com/v2-1adc1eb4791afceffe35cd726cd1ee1c_b.jpg
200
https://pic3.zhimg.com/v2-7f577f74b40e98f6c31430b8e884837e_b.jpg
200
https://pic2.zhimg.com/v2-4d9ab580eec66877f4f90688ee856675_b.jpg
200
https://pic3.zhimg.com/v2-dc7f2877b020191711c67b5c059cb7b6_b.jpg
200
https://pic1.zhimg.com/v2-e7b6728b7a35bbe2c035755ad776c89c_b.jpg
200
https://pic2.zhimg.com/v2-0be8b34bd0bf2611715ff1fcd1b32651_b.jpg
200
https://pic4.zhimg.com/v2-782237911b1a2146b07dc5b790f27363_b.jpg
200
https://pic3.zhimg.com/v2-6dd15768a0d9303f6af923440705b346_b.jpg
200
https://pic3.zhimg.com/v2-efb63eef398fb9f3c89ff7a7bf624a96_b.jpg
200
https://pic3.zhimg.com/v2-60b94f730a916a010ee9969233d26b1a_b.jpg
200
https://pic4.zhimg.com/v2-1c42198298f2ed0191c0c8c9bcc1c83f_b.jpg
200
https://pic3.zhimg.com/v2-152abf7e81663e83091507574c579176_b.jpg
200
https://pic2.zhimg.com/v2-5aefef22c1315ea30494576fd7a8fe49_b.jpg
200
https://pic1.zhimg.com/v2-137a8ec31194a86c562dafb9f8886bac_b.jpg
200
https://pic2.zhimg.com/v2-616f2b58e1709c54f5eb73a302f2a64a_b.jpg
400
https://pic4.zhimg.com/v2-a4933e53972df61721540cd84b28d1b8_b.jpg
200
https://pic4.zhimg.com/v2-17a19920c1fb8771f076a38014c88cd0_b.jpg
200
https://pic4.zhimg.com/v2-6841a49976a11bbd6cadd54530edc2f0_b.jpg
200
https://pic3.zhimg.com/v2-cfb2e2d1ba89674777f37cc354f04a30_b.jpg
400
https://pic3.zhimg.com/v2-0778cca50a17f1f9d35d56bd0bedebfd_b.jpg
200
https://pic3.zhimg.com/v2-5f31d4e31af4ec37c56d0266fa26fc93_b.jpg
200
https://pic2.zhimg.com/v2-1b7f1861e6dbf85866fdc540675366d4_b.jpg
400
https://pic1.zhimg.com/v2-c420c79953b45aaedba381445bc5be78_b.jpg
400
https://pic2.zhimg.com/v2-48cc47aff189b5c722862ecd32a4516a_b.jpg
400
https://pic3.zhimg.com/v2-a2580253cde081db3e3f1b8b66dddf93_b.jpg
200
https://pic4.zhimg.com/v2-6841a49976a11bbd6cadd54530edc2f0_b.jpg
200
https://pic1.zhimg.com/v2-569c1425597defc7f2fd5b54e7e3c3d2_b.jpg
400
https://pic1.zhimg.com/v2-850dd573365d9c9a1c9d58fa7f27532c_b.jpg
400
https://pic2.zhimg.com/v2-13c20a4c25725fb9d363c567ab4eb08d_b.jpg
400
https://pic1.zhimg.com/v2-c0235ab217e08e205305de260bea60e0_b.jpg
400
https://pic2.zhimg.com/v2-99be53d259d1d0c0755a63b578816f05_b.jpg
400
https://pic4.zhimg.com/v2-bb0040576245087202432c2c4ebbc88b_b.jpg
200
https://pic3.zhimg.com/v2-184bf0e862d37b5e2297f2c4289d8662_b.jpg
200
https://pic2.zhimg.com/v2-7d761d77317867021fd59e4e90c1bddd_b.jpg
400
https://pic4.zhimg.com/v2-9a315bb94c08e58ed5f63202e8a25d5b_b.jpg
200
https://pic2.zhimg.com/v2-8499b2eb6e641620474641daedb61931_b.jpg
400
https://pic1.zhimg.com/v2-11e49b3e1474035316b4bd2ae4d59a4c_b.jpg
400
https://pic2.zhimg.com/v2-88b64ae8861ace4172d54a6cdb81da31_b.jpg
400
問題
上面的程式碼經常出現下載下來的圖片無法開啟,應該是沒有下載成功。然後我看了如下的程式碼:
img = requests.get(img_url)
print(img.status_code)
發現很多請求返回的狀態碼是400,然後我看了下載下來的圖片,確實正是那些返回的狀態碼為400的不能開啟:
https://pic2.zhimg.com/v2-8499b2eb6e641620474641daedb61931_b.jpg
400
https://pic1.zhimg.com/v2-11e49b3e1474035316b4bd2ae4d59a4c_b.jpg
400
https://pic2.zhimg.com/v2-88b64ae8861ace4172d54a6cdb81da31_b.jpg
400
然後我手動點了連結,發現會報:
You do not have permission to get URL '/v2-88b64ae8861ace4172d54a6cdb81da31_b.jpg' from this server.
我猜測大概是因為我這個程式碼沒有設定headers中的referer。下面是改進版本:
user_agent="Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 safari/537.36"
referer="https://www.zhihu.com/"
headers={'User-Agent':user_agent,'Referer':referer}
for file in file_list:
with open(file,'r',encoding = 'utf-8') as f:
html = f.read()
selector = etree.HTML(html)
img_list = selector.xpath('//img/@src')
#圖片下載
num = 0
for img_url in img_list:
img = requests.get(img_url,headers = headers)
print(img_url)
print(img.status_code)
#下面是新建資料夾、圖片檔名
num += 1
img_dir = os.getcwd() + '/' + file[:-5] + '/'
if not os.path.exists(img_dir):
os.makedirs(img_dir)
file_name = img_dir + str(num) + ".jpg"
#下面是圖片檔案的儲存
with open(file_name,'wb') as f:
f.write(img.content)
https://pic2.zhimg.com/v2-616f2b58e1709c54f5eb73a302f2a64a_b.jpg
200
https://pic4.zhimg.com/v2-a4933e53972df61721540cd84b28d1b8_b.jpg
200
https://pic4.zhimg.com/v2-17a19920c1fb8771f076a38014c88cd0_b.jpg
200
https://pic4.zhimg.com/v2-6841a49976a11bbd6cadd54530edc2f0_b.jpg
200
https://pic3.zhimg.com/v2-cfb2e2d1ba89674777f37cc354f04a30_b.jpg
200
https://pic3.zhimg.com/v2-0778cca50a17f1f9d35d56bd0bedebfd_b.jpg
200
https://pic3.zhimg.com/v2-5f31d4e31af4ec37c56d0266fa26fc93_b.jpg
200
https://pic2.zhimg.com/v2-1b7f1861e6dbf85866fdc540675366d4_b.jpg
200
https://pic1.zhimg.com/v2-c420c79953b45aaedba381445bc5be78_b.jpg
200
https://pic2.zhimg.com/v2-48cc47aff189b5c722862ecd32a4516a_b.jpg
200
https://pic3.zhimg.com/v2-a2580253cde081db3e3f1b8b66dddf93_b.jpg
200
https://pic4.zhimg.com/v2-6841a49976a11bbd6cadd54530edc2f0_b.jpg
200
https://pic1.zhimg.com/v2-569c1425597defc7f2fd5b54e7e3c3d2_b.jpg
200
https://pic1.zhimg.com/v2-850dd573365d9c9a1c9d58fa7f27532c_b.jpg
200
https://pic2.zhimg.com/v2-13c20a4c25725fb9d363c567ab4eb08d_b.jpg
200
https://pic1.zhimg.com/v2-c0235ab217e08e205305de260bea60e0_b.jpg
200
https://pic2.zhimg.com/v2-99be53d259d1d0c0755a63b578816f05_b.jpg
200
https://pic4.zhimg.com/v2-bb0040576245087202432c2c4ebbc88b_b.jpg
200
https://pic3.zhimg.com/v2-184bf0e862d37b5e2297f2c4289d8662_b.jpg
200
https://pic2.zhimg.com/v2-7d761d77317867021fd59e4e90c1bddd_b.jpg
200
https://pic4.zhimg.com/v2-9a315bb94c08e58ed5f63202e8a25d5b_b.jpg
200
https://pic2.zhimg.com/v2-8499b2eb6e641620474641daedb61931_b.jpg
200
https://pic1.zhimg.com/v2-11e49b3e1474035316b4bd2ae4d59a4c_b.jpg
200
https://pic2.zhimg.com/v2-88b64ae8861ace4172d54a6cdb81da31_b.jpg
200
問題解決!
相關文章
- 原生JS實現base64圖片下載-圖片儲存到本地JS
- html2canvas擷取圖片並下載HTMLCanvas
- 圖片下載
- 前端js儲存頁面為圖片下載到本地前端JS
- 對html進行截圖並儲存為本地圖片HTML地圖
- 載入本地圖片模糊,Glide載入網路圖片卻很清晰地圖IDE
- Python中scrapy下載儲存圖片Python
- Python 下載圖片Python
- cordova圖片下載
- 使用ABAP批量下載Markdown原始檔裡的圖片到本地
- vue如何動態載入本地圖片Vue地圖
- python 爬蟲之requests爬取頁面圖片的url,並將圖片下載到本地Python爬蟲
- 圖片下載框架概述框架
- js實現canvas儲存圖片為png格式並下載到本地JSCanvas
- 【HTML】04圖片HTML
- python自動下載圖片Python
- 花了一整天寫了個下載markdown圖片到本地的庫?
- TestFlight下載App,載入圖片失效。Xcode安裝App,圖片載入正常。APPXCode
- 下載指定的 Tumblr 部落格中的圖片,影片。
- 對映本地圖片地圖
- 讀取本地圖片地圖
- 使用httpclient下載 頁面、圖片HTTPclient
- 圖片下載 (hqm精簡版)
- Opencv官方樣例圖片下載OpenCV
- 正規表示式匹配html中的圖片HTML
- HTML中嵌入SVG圖片的N種方式HTMLSVG
- 使用ABAP批量下載有道雲筆記中的圖片筆記
- html input type=file 選擇圖片,圖片預覽 純html js實現圖片預覽HTMLJS
- 本地Markdown上傳圖片
- SingleFile:將整個網頁完整下載儲存到本地一個HTML檔案中網頁HTML
- 利用html5 file api讀取本地檔案(如圖片、PDF等)HTMLAPI
- Python學習筆記 - 下載圖片Python筆記
- 前端實現點選下載圖片前端
- HTML 連結和圖片HTML
- 本地圖文直接複製到HTML編輯器中地圖HTML
- Vue中圖片的載入方式Vue
- SingleFile:將網頁像快照一樣下載儲存到本地一個HTML檔案中網頁HTML
- 圖片預載入,圖片懶載入,和jsonp中的一個疑問JSON