你們要的小姐姐來啦!保姆式教程帶大家爬取高清圖片!培養一雙發現美的眼睛
有些日子沒寫爬蟲了,今日心血來潮,來寫寫,但也不知道爬啥,於是隨便找了個網站試試手。
一、環境搭建
本爬蟲使用Scrapy框架進行爬取
scrapy startproject Weimei
cd Weimei
scrapy genspider weimei "weimei.com"
修改settings.py檔案
設定檔案下載路徑
編寫啟動檔案start.py
from scrapy import cmdline
cmdline.execute("scrapy crawl weimei".split())
二、網頁分析
今日就先爬個攝影寫真吧,?
該網頁是個典型的瀑布流網頁
所以我們有兩種處理辦法
- 使用selenium操控瀏覽器滑輪,然後獲取整個頁面的原始碼
- 在網頁向下滑動時,檢視傳送的請求,根據請求來進行爬取(相當於滑動的效果)
我使用的時第二種
可以看到,向下滑動,瀏覽器發出ajax請求
請求方式為POST
帶有引數
可以推斷出,paged相當於頁面頁數,既然如此,只要把paged改為1,就相當於第一頁,通過修改paged反覆請求可以實現多頁爬取
三、程式碼分析
weimei.py
class WeimeiSpider(scrapy.Spider):
name = 'weimei'
# allowed_domains = ['vmgirls.com']
# start_urls = ['https://www.vmgirls.com/photography']
#post請求提交的資料,可見網頁分析
data = {
"append": "list - archive",
"paged": "1", #當前頁數
"action": "ajax_load_posts",
"query": "17",
"page": "cat"
}
#重寫start_requests
def start_requests(self):
#進行多頁爬取,range(3) -> 0,1,2
for i in range(3):
#設定爬取的當前頁數
#range是從0開始
self.data["paged"] = str(1 + i)
#發起post請求
yield scrapy.FormRequest(url="https://www.vmgirls.com/wp-admin/admin-ajax.php", method='POST',
formdata=self.data, callback=self.parse)
def parse(self, response):
#使用BeautifulSoup的lxml庫解析
bs1 = BeautifulSoup(response.text, "lxml")
#圖1
div_list = bs1.find_all(class_="col-md-4 d-flex")
#遍歷
for div in div_list:
a = BeautifulSoup(str(div), "lxml")
#參考圖2
#詳情頁Url
page_url = a.find("a")["href"]
#每一份攝影寫真名稱
name = a.find("a")["title"]
#傳送get請求,去請求詳情頁,帶上name引數
yield scrapy.Request(url=page_url,callback=self.page,meta = {"name": name})
#詳情頁爬取
def page(self,response):
#拿到傳過來的name引數
name = response.meta.get("name")
bs2 = BeautifulSoup(response.text, "lxml")
#參考圖3
img_list = bs2.find(class_="nc-light-gallery").find_all("img")
for img in img_list:
image = BeautifulSoup(str(img), "lxml")
#圖4
#注意,我拿取的時data-src不是src
#data-src是html5的新屬性,意思是資料來源。
#圖片url
img_url = "https://www.vmgirls.com/" + image.find("img")["data-src"]
item = WeimeiItem(img_url=img_url, name=name)
yield item
pipelines.py
class WeimeiPipeline(ImagesPipeline):
def get_media_requests(self, item, info):
img_url = item["img_url"]
name = item["name"]
yield scrapy.Request(url=img_url, meta={"name": name})
def file_path(self, request, response=None, info=None):
name = request.meta["name"]
img_name = request.url.split('/')[-1]
#拼接路徑,使每個攝影寫真系列的圖片都存放於同意資料夾中
img_path = os.path.join(name,img_name)
# print(img_path)
return img_path # 返回檔名
def item_completed(self, results, item, info):
# print(item)
return item # 返回給下一個即將被執行的管道類
四、圖片輔助分析
圖1
圖2
圖3
圖4
五、執行結果
六、完整程式碼
weimei.py
# -*- coding: utf-8 -*-
import scrapy
from Weimei.items import WeimeiItem
from bs4 import BeautifulSoup
class WeimeiSpider(scrapy.Spider):
name = 'weimei'
# allowed_domains = ['vmgirls.com']
# start_urls = ['https://www.vmgirls.com/photography']
data = {
"append": "list - archive",
"paged": "1",
"action": "ajax_load_posts",
"query": "17",
"page": "cat"
}
def start_requests(self):
for i in range(3):
self.data["paged"] = str(int(self.data["paged"]) + 1)
yield scrapy.FormRequest(url="https://www.vmgirls.com/wp-admin/admin-ajax.php", method='POST',
formdata=self.data, callback=self.parse)
def parse(self, response):
bs1 = BeautifulSoup(response.text, "lxml")
div_list = bs1.find_all(class_="col-md-4 d-flex")
for div in div_list:
a = BeautifulSoup(str(div), "lxml")
page_url = a.find("a")["href"]
name = a.find("a")["title"]
yield scrapy.Request(url=page_url,callback=self.page,meta = {"name": name})
def page(self,response):
name = response.meta.get("name")
bs2 = BeautifulSoup(response.text, "lxml")
img_list = bs2.find(class_="nc-light-gallery").find_all("img")
for img in img_list:
image = BeautifulSoup(str(img), "lxml")
img_url = "https://www.vmgirls.com/" + image.find("img")["data-src"]
item = WeimeiItem(img_url=img_url, name=name)
yield item
items.py
import scrapy
class WeimeiItem(scrapy.Item):
img_url = scrapy.Field()
name = scrapy.Field()
pipelines.py
from scrapy.pipelines.images import ImagesPipeline
import scrapy
import os
class WeimeiPipeline(ImagesPipeline):
def get_media_requests(self, item, info):
img_url = item["img_url"]
name = item["name"]
print(img_url)
yield scrapy.Request(url=img_url, meta={"name": name})
def file_path(self, request, response=None, info=None):
name = request.meta["name"]
img_name = request.url.split('/')[-1]
img_path = os.path.join(name,img_name)
# print(img_path)
return img_path # 返回檔名
def item_completed(self, results, item, info):
# print(item)
return item # 返回給下一個即將被執行的管道類
settings.py
BOT_NAME = 'Weimei'
SPIDER_MODULES = ['Weimei.spiders']
NEWSPIDER_MODULE = 'Weimei.spiders'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'Weimei (+http://www.yourdomain.com)'
LOG_LEVEL = "ERROR"
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32
# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16
# Disable cookies (enabled by default)
#COOKIES_ENABLED = False
# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False
# Override the default request headers:
DEFAULT_REQUEST_HEADERS = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en',
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36"
}
# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
# 'Weimei.middlewares.WeimeiSpiderMiddleware': 543,
#}
# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
# DOWNLOADER_MIDDLEWARES = {
# 'Weimei.middlewares.AreaSpiderMiddleware': 543,
# }
# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
#}
# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
'Weimei.pipelines.WeimeiPipeline': 300,
}
# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False
# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
IMAGES_STORE = "Download"
覺得博主寫的不錯的讀者大大們,可以點贊關注和收藏哦,你們的支援是我寫作最大的動力!
相關文章
- 華為Camera Kit,賦予你的APP一雙善於發現美的眼睛APP
- 你們要的面試題來啦!面試題
- Python爬蟲入門-爬取pexels高清圖片Python爬蟲
- HTTP代理如何爬取?保姆式教程(附測試影片)HTTP
- 來了,來了,你們要的Nginx教程來了Nginx
- Python爬取王者榮耀英雄皮膚高清圖片Python
- 【有手就行】ChatGPT 保姆級註冊使用教程來啦ChatGPT
- 老司機帶你用python來爬取妹子圖Python
- Node JS爬蟲:爬取瀑布流網頁高清圖JS爬蟲網頁
- 小豬的Python學習之旅 —— 9.爬蟲實戰:爬取花瓣網的小姐姐Python爬蟲
- 你們要的MyCat實現MySQL分庫分表來了MySql
- Flutter Sliver你要的瀑布流小姐姐Flutter
- 要過年啦!啥?朋友太多?用python實現一個完美的自動回覆吧Python
- 新手爬蟲教程:Python爬取知乎文章中的圖片爬蟲Python
- 【技術短影片】OceanBase 5mins Tips 第一期來啦:小姐姐帶你入門資料庫資料庫
- 圖片爬取實戰一
- ? 一款小而美的 Markdown 靜態部落格程式,迭代啦!
- Python爬蟲新手教程: 知乎文章圖片爬取器Python爬蟲
- 電商 | 姐姐要閃亮登場啦 分享一組38節/女王節電商海報
- [北京海淀]一家小而美的大資料公司,要召喚有趣又有才的網際網路小夥伴啦!大資料
- 你要的 AI版 “←2017 2019→” 來啦!AI
- 一款小而美的小程式腳手架,讓你更流暢的開發小程式
- 程式設計師們你們辛苦啦程式設計師
- Python自動化爬取小說,解放你的雙手Python
- 爬蟲入門經典 | 一文帶你爬取傳統古詩詞!爬蟲
- 分散式事務保姆級教程分散式
- Python 爬蟲入門 (二) 使用Requests來爬取圖片Python爬蟲
- Python正規表示式保姆式教學,帶你精通大名鼎鼎的正則Python
- 函式中的this的四種繫結形式 — 大家準備好瓜子,我要講故事啦~~函式
- 保姆級教程,如何發現 GitHub 上的優質專案?Github
- 最近要寫爬蟲,大家有推薦 Golang 的爬蟲框架嗎?爬蟲Golang框架
- 也請教各位高手們一個對你們來說是一個小case的問題!
- 你們要的乾貨來了——實戰 Spring BootSpring Boot
- 晚年思考 | 我們需要啥樣的養老 深度老齡化會給養老產業帶來什麼產業
- graspnet復現保姆級教程
- Hive視窗函式保姆級教程Hive函式
- Python3 大型網路爬蟲實戰 003 — scrapy 大型靜態圖片網站爬蟲專案實戰 — 實戰:爬取 169美女圖片網 高清圖片Python爬蟲網站
- Java爬蟲批量爬取圖片Java爬蟲