作業1:
指定一個網站,爬取這個網站中的所有的所有圖片,例如:中國氣象網(http://www.weather.com.cn)。使用scrapy框架分別實現單執行緒和多執行緒的方式爬取。–務必控制總頁數(學號尾數2位)、總下載的圖片數量(尾數後3位)等限制爬取的措施。輸出資訊: 將下載的Url資訊在控制檯輸出,並將下載的圖片儲存在images子檔案中,並給出截圖。
程式碼與執行結果:
spider程式碼:
import scrapy
from urllib.parse import urljoin
from scrapy import Item, Field
class WeatherItem(Item):
image_urls = Field()
class Myspider31Spider(scrapy.Spider):
name = "myspider31"
allowed_domains = ["weather.com.cn"]
start_urls = ["https://weather.com.cn"]
def parse(self, response):
full_image_urls = []
image_urls = response.css('img::attr(src)').getall()
full_image_urls = [urljoin(response.url, img_url) for img_url in image_urls]
item = WeatherItem()
item['image_urls'] = full_image_urls
yield item
setting程式碼:
ITEM_PIPELINES = {
# "project31.pipelines.Project31Pipeline": 300,
'scrapy.pipelines.images.ImagesPipeline': 300
}
IMAGES_STORE='D:\資料集\資料採集實踐3-1'
pipelines程式碼:
from itemadapter import ItemAdapter
class Project31Pipeline:
def process_item(self, item, spider):
return item
執行結果