你們要的小姐姐來啦!保姆式教程帶大家爬取高清圖片!培養一雙發現美的眼睛
有些日子沒寫爬蟲了,今日心血來潮,來寫寫,但也不知道爬啥,於是隨便找了個網站試試手。
一、環境搭建
本爬蟲使用Scrapy框架進行爬取
scrapy startproject Weimei
cd Weimei
scrapy genspider weimei "weimei.com"
修改settings.py檔案
設定檔案下載路徑
編寫啟動檔案start.py
from scrapy import cmdline
cmdline.execute("scrapy crawl weimei".split())
二、網頁分析
今日就先爬個攝影寫真吧,?
該網頁是個典型的瀑布流網頁
所以我們有兩種處理辦法
- 使用selenium操控瀏覽器滑輪,然後獲取整個頁面的原始碼
- 在網頁向下滑動時,檢視傳送的請求,根據請求來進行爬取(相當於滑動的效果)
我使用的時第二種
可以看到,向下滑動,瀏覽器發出ajax請求
請求方式為POST
帶有引數
可以推斷出,paged相當於頁面頁數,既然如此,只要把paged改為1,就相當於第一頁,通過修改paged反覆請求可以實現多頁爬取
三、程式碼分析
weimei.py
class WeimeiSpider(scrapy.Spider):
name = 'weimei'
# allowed_domains = ['vmgirls.com']
# start_urls = ['https://www.vmgirls.com/photography']
#post請求提交的資料,可見網頁分析
data = {
"append": "list - archive",
"paged": "1", #當前頁數
"action": "ajax_load_posts",
"query": "17",
"page": "cat"
}
#重寫start_requests
def start_requests(self):
#進行多頁爬取,range(3) -> 0,1,2
for i in range(3):
#設定爬取的當前頁數
#range是從0開始
self.data["paged"] = str(1 + i)
#發起post請求
yield scrapy.FormRequest(url="https://www.vmgirls.com/wp-admin/admin-ajax.php", method='POST',
formdata=self.data, callback=self.parse)
def parse(self, response):
#使用BeautifulSoup的lxml庫解析
bs1 = BeautifulSoup(response.text, "lxml")
#圖1
div_list = bs1.find_all(class_="col-md-4 d-flex")
#遍歷
for div in div_list:
a = BeautifulSoup(str(div), "lxml")
#參考圖2
#詳情頁Url
page_url = a.find("a")["href"]
#每一份攝影寫真名稱
name = a.find("a")["title"]
#傳送get請求,去請求詳情頁,帶上name引數
yield scrapy.Request(url=page_url,callback=self.page,meta = {"name": name})
#詳情頁爬取
def page(self,response):
#拿到傳過來的name引數
name = response.meta.get("name")
bs2 = BeautifulSoup(response.text, "lxml")
#參考圖3
img_list = bs2.find(class_="nc-light-gallery").find_all("img")
for img in img_list:
image = BeautifulSoup(str(img), "lxml")
#圖4
#注意,我拿取的時data-src不是src
#data-src是html5的新屬性,意思是資料來源。
#圖片url
img_url = "https://www.vmgirls.com/" + image.find("img")["data-src"]
item = WeimeiItem(img_url=img_url, name=name)
yield item
pipelines.py
class WeimeiPipeline(ImagesPipeline):
def get_media_requests(self, item, info):
img_url = item["img_url"]
name = item["name"]
yield scrapy.Request(url=img_url, meta={"name": name})
def file_path(self, request, response=None, info=None):
name = request.meta["name"]
img_name = request.url.split('/')[-1]
#拼接路徑,使每個攝影寫真系列的圖片都存放於同意資料夾中
img_path = os.path.join(name,img_name)
# print(img_path)
return img_path # 返回檔名
def item_completed(self, results, item, info):
# print(item)
return item # 返回給下一個即將被執行的管道類
四、圖片輔助分析
圖1
圖2
圖3
圖4
五、執行結果
六、完整程式碼
weimei.py
# -*- coding: utf-8 -*-
import scrapy
from Weimei.items import WeimeiItem
from bs4 import BeautifulSoup
class WeimeiSpider(scrapy.Spider):
name = 'weimei'
# allowed_domains = ['vmgirls.com']
# start_urls = ['https://www.vmgirls.com/photography']
data = {
"append": "list - archive",
"paged": "1",
"action": "ajax_load_posts",
"query": "17",
"page": "cat"
}
def start_requests(self):
for i in range(3):
self.data["paged"] = str(int(self.data["paged"]) + 1)
yield scrapy.FormRequest(url="https://www.vmgirls.com/wp-admin/admin-ajax.php", method='POST',
formdata=self.data, callback=self.parse)
def parse(self, response):
bs1 = BeautifulSoup(response.text, "lxml")
div_list = bs1.find_all(class_="col-md-4 d-flex")
for div in div_list:
a = BeautifulSoup(str(div), "lxml")
page_url = a.find("a")["href"]
name = a.find("a")["title"]
yield scrapy.Request(url=page_url,callback=self.page,meta = {"name": name})
def page(self,response):
name = response.meta.get("name")
bs2 = BeautifulSoup(response.text, "lxml")
img_list = bs2.find(class_="nc-light-gallery").find_all("img")
for img in img_list:
image = BeautifulSoup(str(img), "lxml")
img_url = "https://www.vmgirls.com/" + image.find("img")["data-src"]
item = WeimeiItem(img_url=img_url, name=name)
yield item
items.py
import scrapy
class WeimeiItem(scrapy.Item):
img_url = scrapy.Field()
name = scrapy.Field()
pipelines.py
from scrapy.pipelines.images import ImagesPipeline
import scrapy
import os
class WeimeiPipeline(ImagesPipeline):
def get_media_requests(self, item, info):
img_url = item["img_url"]
name = item["name"]
print(img_url)
yield scrapy.Request(url=img_url, meta={"name": name})
def file_path(self, request, response=None, info=None):
name = request.meta["name"]
img_name = request.url.split('/')[-1]
img_path = os.path.join(name,img_name)
# print(img_path)
return img_path # 返回檔名
def item_completed(self, results, item, info):
# print(item)
return item # 返回給下一個即將被執行的管道類
settings.py
BOT_NAME = 'Weimei'
SPIDER_MODULES = ['Weimei.spiders']
NEWSPIDER_MODULE = 'Weimei.spiders'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'Weimei (+http://www.yourdomain.com)'
LOG_LEVEL = "ERROR"
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32
# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16
# Disable cookies (enabled by default)
#COOKIES_ENABLED = False
# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False
# Override the default request headers:
DEFAULT_REQUEST_HEADERS = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en',
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36"
}
# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
# 'Weimei.middlewares.WeimeiSpiderMiddleware': 543,
#}
# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
# DOWNLOADER_MIDDLEWARES = {
# 'Weimei.middlewares.AreaSpiderMiddleware': 543,
# }
# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
#}
# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
'Weimei.pipelines.WeimeiPipeline': 300,
}
# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False
# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
IMAGES_STORE = "Download"
覺得博主寫的不錯的讀者大大們,可以點贊關注和收藏哦,你們的支援是我寫作最大的動力!
相關文章
- 華為Camera Kit,賦予你的APP一雙善於發現美的眼睛APP
- 你們要的面試題來啦!面試題
- HTTP代理如何爬取?保姆式教程(附測試影片)HTTP
- 老司機帶你用python來爬取妹子圖Python
- 來了,來了,你們要的Nginx教程來了Nginx
- Python爬取王者榮耀英雄皮膚高清圖片Python
- 新手爬蟲教程:Python爬取知乎文章中的圖片爬蟲Python
- Python爬蟲新手教程: 知乎文章圖片爬取器Python爬蟲
- 【python--爬蟲】千圖網高清背景圖片爬蟲Python爬蟲
- Python自動化爬取小說,解放你的雙手Python
- 【有手就行】ChatGPT 保姆級註冊使用教程來啦ChatGPT
- 你要的 AI版 “←2017 2019→” 來啦!AI
- Java爬蟲批量爬取圖片Java爬蟲
- AotucCrawler 快速爬取圖片
- 爬取愛套圖網上的圖片
- Node JS爬蟲:爬取瀑布流網頁高清圖JS爬蟲網頁
- node:爬蟲爬取網頁圖片爬蟲網頁
- 爬蟲---xpath解析(爬取美女圖片)爬蟲
- 隨機圖片又雙叒叕炸啦隨機
- Python應用開發——爬取網頁圖片Python網頁
- 爬取微博圖片資料存到Mysql中遇到的各種坑mysql儲存圖片爬取微博圖片MySql
- Python爬蟲—爬取某網站圖片Python爬蟲網站
- 小豬的Python學習之旅 —— 9.爬蟲實戰:爬取花瓣網的小姐姐Python爬蟲
- 【技術短影片】OceanBase 5mins Tips 第一期來啦:小姐姐帶你入門資料庫資料庫
- 用雲函式快速實現圖片爬蟲函式爬蟲
- Python爬蟲入門教程 13-100 鬥圖啦表情包多執行緒爬取Python爬蟲執行緒
- Flutter Sliver你要的瀑布流小姐姐Flutter
- 福利來啦,送給大家一個小遊戲的原始碼,不要錯過喲遊戲原始碼
- 爬蟲 Scrapy框架 爬取圖蟲圖片並下載爬蟲框架
- 青花瓷圖片的爬取和resize
- 你們要的MyCat實現MySQL分庫分表來了MySql
- Python爬蟲入門【5】:27270圖片爬取Python爬蟲
- 萬字+28張圖帶你探祕小而美的規則引擎框架LiteFlow框架
- 如何實現 iOS 16 帶來的 Depth Effect 圖片效果iOS
- [dataloader]會以一個很奇怪的順序來讀取圖片,如果我們想以編號來讀取的話
- 網路爬蟲---從千圖網爬取圖片到本地爬蟲
- Python爬蟲入門教程 4-100 美空網未登入圖片爬取Python爬蟲
- python 爬蟲之requests爬取頁面圖片的url,並將圖片下載到本地Python爬蟲