你們要的小姐姐來啦!保姆式教程帶大家爬取高清圖片!培養一雙發現美的眼睛

Code皮皮蝦發表於2020-08-07

有些日子沒寫爬蟲了,今日心血來潮,來寫寫,但也不知道爬啥,於是隨便找了個網站試試手。
在這裡插入圖片描述

唯美女生
在這裡插入圖片描述

一、環境搭建

本爬蟲使用Scrapy框架進行爬取

scrapy startproject Weimei

cd Weimei

scrapy genspider weimei "weimei.com"

在這裡插入圖片描述
修改settings.py檔案
在這裡插入圖片描述

在這裡插入圖片描述
在這裡插入圖片描述
設定檔案下載路徑
在這裡插入圖片描述

編寫啟動檔案start.py

from scrapy import cmdline

cmdline.execute("scrapy crawl weimei".split())

二、網頁分析

今日就先爬個攝影寫真吧,?
在這裡插入圖片描述
該網頁是個典型的瀑布流網頁
在這裡插入圖片描述
所以我們有兩種處理辦法

  1. 使用selenium操控瀏覽器滑輪,然後獲取整個頁面的原始碼
  2. 在網頁向下滑動時,檢視傳送的請求,根據請求來進行爬取(相當於滑動的效果)

我使用的時第二種

可以看到,向下滑動,瀏覽器發出ajax請求
在這裡插入圖片描述
請求方式為POST
在這裡插入圖片描述
帶有引數
在這裡插入圖片描述
在這裡插入圖片描述
可以推斷出,paged相當於頁面頁數,既然如此,只要把paged改為1,就相當於第一頁,通過修改paged反覆請求可以實現多頁爬取

三、程式碼分析

weimei.py

class WeimeiSpider(scrapy.Spider):
    name = 'weimei'
    # allowed_domains = ['vmgirls.com']
    # start_urls = ['https://www.vmgirls.com/photography']
	
	#post請求提交的資料,可見網頁分析
    data = {
        "append": "list - archive",
        "paged": "1",  #當前頁數
        "action": "ajax_load_posts",
        "query": "17",
        "page": "cat"
    }

	#重寫start_requests
    def start_requests(self):
    	#進行多頁爬取,range(3) -> 0,1,2
        for i in range(3):
            #設定爬取的當前頁數
            #range是從0開始
            self.data["paged"] = str(1 + i)
            #發起post請求
            yield scrapy.FormRequest(url="https://www.vmgirls.com/wp-admin/admin-ajax.php", method='POST',
                                 formdata=self.data, callback=self.parse)
	
    def parse(self, response):
    	#使用BeautifulSoup的lxml庫解析
        bs1 = BeautifulSoup(response.text, "lxml")
        
        #圖1
        div_list = bs1.find_all(class_="col-md-4 d-flex")
        #遍歷
        for div in div_list:
            a = BeautifulSoup(str(div), "lxml")
            
            #參考圖2
            #詳情頁Url
            page_url = a.find("a")["href"]
            #每一份攝影寫真名稱
            name = a.find("a")["title"]
            #傳送get請求,去請求詳情頁,帶上name引數
            yield scrapy.Request(url=page_url,callback=self.page,meta = {"name": name})
	
	#詳情頁爬取
    def page(self,response):
		#拿到傳過來的name引數
        name = response.meta.get("name")
        bs2 = BeautifulSoup(response.text, "lxml")
        #參考圖3
        img_list = bs2.find(class_="nc-light-gallery").find_all("img")
        for img in img_list:
            image = BeautifulSoup(str(img), "lxml")
            #圖4
            #注意,我拿取的時data-src不是src
            #data-src是html5的新屬性,意思是資料來源。
            #圖片url
            img_url = "https://www.vmgirls.com/" + image.find("img")["data-src"]
            
            item = WeimeiItem(img_url=img_url, name=name)
            yield item

pipelines.py

class WeimeiPipeline(ImagesPipeline):
    def get_media_requests(self, item, info):
        img_url = item["img_url"]
        name = item["name"]
        yield scrapy.Request(url=img_url, meta={"name": name})

    def file_path(self, request, response=None, info=None):
        name = request.meta["name"]
        img_name = request.url.split('/')[-1]
		#拼接路徑,使每個攝影寫真系列的圖片都存放於同意資料夾中
        img_path = os.path.join(name,img_name)
        # print(img_path)
        return img_path  # 返回檔名


    def item_completed(self, results, item, info):
        # print(item)
        return item  # 返回給下一個即將被執行的管道類

四、圖片輔助分析

圖1
在這裡插入圖片描述
圖2
在這裡插入圖片描述
圖3
在這裡插入圖片描述
圖4
在這裡插入圖片描述


五、執行結果

在這裡插入圖片描述

在這裡插入圖片描述
在這裡插入圖片描述


六、完整程式碼

weimei.py

# -*- coding: utf-8 -*-
import scrapy
from Weimei.items import WeimeiItem
from bs4 import BeautifulSoup

class WeimeiSpider(scrapy.Spider):
    name = 'weimei'
    # allowed_domains = ['vmgirls.com']
    # start_urls = ['https://www.vmgirls.com/photography']

    data = {
        "append": "list - archive",
        "paged": "1",
        "action": "ajax_load_posts",
        "query": "17",
        "page": "cat"
    }

    def start_requests(self):
        for i in range(3):
            self.data["paged"] = str(int(self.data["paged"]) + 1)
            yield scrapy.FormRequest(url="https://www.vmgirls.com/wp-admin/admin-ajax.php", method='POST',
                                 formdata=self.data, callback=self.parse)

    def parse(self, response):
        bs1 = BeautifulSoup(response.text, "lxml")
        div_list = bs1.find_all(class_="col-md-4 d-flex")
        for div in div_list:
            a = BeautifulSoup(str(div), "lxml")
            page_url = a.find("a")["href"]
            name = a.find("a")["title"]
            yield scrapy.Request(url=page_url,callback=self.page,meta = {"name": name})

    def page(self,response):
        name = response.meta.get("name")
        bs2 = BeautifulSoup(response.text, "lxml")
        img_list = bs2.find(class_="nc-light-gallery").find_all("img")
        for img in img_list:
            image = BeautifulSoup(str(img), "lxml")
            img_url = "https://www.vmgirls.com/" + image.find("img")["data-src"]
            item = WeimeiItem(img_url=img_url, name=name)
            yield item

items.py

import scrapy


class WeimeiItem(scrapy.Item):
    img_url = scrapy.Field()
    name = scrapy.Field()

pipelines.py

from scrapy.pipelines.images import ImagesPipeline
import scrapy
import os


class WeimeiPipeline(ImagesPipeline):
    def get_media_requests(self, item, info):
        img_url = item["img_url"]
        name = item["name"]
        print(img_url)
        yield scrapy.Request(url=img_url, meta={"name": name})

    def file_path(self, request, response=None, info=None):
        name = request.meta["name"]
        img_name = request.url.split('/')[-1]
        img_path = os.path.join(name,img_name)
        # print(img_path)
        return img_path  # 返回檔名


    def item_completed(self, results, item, info):
        # print(item)
        return item  # 返回給下一個即將被執行的管道類

settings.py

BOT_NAME = 'Weimei'

SPIDER_MODULES = ['Weimei.spiders']
NEWSPIDER_MODULE = 'Weimei.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'Weimei (+http://www.yourdomain.com)'

LOG_LEVEL = "ERROR"
# Obey robots.txt rules
ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
DEFAULT_REQUEST_HEADERS = {
  'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
  'Accept-Language': 'en',
  "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36"
}

# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'Weimei.middlewares.WeimeiSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
# DOWNLOADER_MIDDLEWARES = {
#    'Weimei.middlewares.AreaSpiderMiddleware': 543,
# }

# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   'Weimei.pipelines.WeimeiPipeline': 300,
}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

IMAGES_STORE = "Download"

覺得博主寫的不錯的讀者大大們,可以點贊關注和收藏哦,你們的支援是我寫作最大的動力!

博主更多部落格文章
在這裡插入圖片描述

相關文章