資料採集與融合技術作業三

總倉庫連結

作業1:

要求：指定一個網站，爬取這個網站中的所有的所有圖片，例如：中國氣象網（http://www.weather.com.cn）。使用scrapy框架分別實現單執行緒和多執行緒的方式爬取。
務必控制總頁數（學號尾數2位）、總下載的圖片數量（尾數後3位）等限制爬取的措施。
輸出資訊: 將下載的Url資訊在控制檯輸出，並將下載的圖片儲存在images子檔案中，並給出截圖。

程式碼和結果

class ChinaWeatherSpider(scrapy.Spider):
    name = 'china_weather'
    allowed_domains = ['www.weather.com.cn']
    start_urls = ['http://www.weather.com.cn/']

    def parse(self, response):
        # 根據網站結構提取圖片URL
        for img_url in response.css('img::attr(src)').getall():
            yield WeatherImageItem(image_urls=[img_url])

        # 控制爬取頁數和圖片數量
        page = 1
        student_id_suffix = 102202107
        max_pages = int(student_id_suffix[-2:])
        max_images = int(student_id_suffix[-3:])
        count = 0

        while page <= max_pages and count < max_images:
            # 構造下一頁URL並爬取
            next_page = response.urljoin('/changepage.shtml?pg={}'.format(page))
            yield response.follow(next_page, self.parse, meta={'page': page + 1, 'count': count})
            page += 1
            count += len(response.css('img::attr(src)').getall())

執行結果:

作業連結

心得體會:

學習使用Scrapy框架進行網站圖片爬取，讓我對Python程式設計和網路爬蟲有了更深入的理解。透過實踐單執行緒和多執行緒爬取，我掌握瞭如何控制爬取速度和數量，以遵守網站的使用協議，保護網站資源。這個過程不僅鍛鍊了我的程式設計技能，還增強了我對網路倫理的認識。

作業2

要求：熟練掌握 scrapy 中 Item、Pipeline 資料的序列化輸出方法；使用scrapy框架+Xpath+MySQL資料庫儲存技術路線爬取股票相關資訊。
候選網站：東方財富網：https://www.eastmoney.com/
輸出資訊：MySQL資料庫儲存和輸出格式如下：
表頭英文命名例如：序號id，股票程式碼：bStockNo……，由同學們自行定義設計

程式碼和結果

stock_spider.py

 self.page_num = response.meta.get('page_num', 1)
        if self.page_num < 3:
            self.page_num += 1
            next_page = f"https://69.push2.eastmoney.com/api/qt/clist/get?cb=jQuery112404359196896638151_1697701391202&pn={self.page_num}&pz=20&po=1&np=1&ut=bd1d9ddb04089700cf9c27f6f7426281&fltt=2&invt=2&wbp2u=|0|0|0|web&fid=f3&fs=m:1+t:2,m:1+t:23&fields=f1,f2,f3,f4,f5,f6,f7,f8,f9,f10,f12,f13,f14,f15,f16,f17,f18,f20,f21,f23,f24,f25,f22,f11,f62,f128,f136,f115,f152&_=1697701391203"
            yield scrapy.Request(url=next_page, callback=self.parse, meta={'page_num': self.page_num})

items.py

import scrapy

class StockItem(scrapy.Item):
    code = scrapy.Field()
    name = scrapy.Field()
    latest_price = scrapy.Field()
    change_degree = scrapy.Field()
    change_amount = scrapy.Field()
    count = scrapy.Field()
    money = scrapy.Field()
    zfcount = scrapy.Field()
    highest = scrapy.Field()
    lowest = scrapy.Field()
    today = scrapy.Field()
    yesterday = scrapy.Field()

執行結果:

作業連結

心得體會:

做這次作業由於過度請求，短時間爬取太多次，違反了robot.txt的規定讓我認識到爬蟲時應該小心注意不要違反規定

作業3:

要求：熟練掌握 scrapy 中 Item、Pipeline 資料的序列化輸出方法；使用scrapy框架+Xpath+MySQL資料庫儲存技術路線爬取外匯網站資料。

程式碼和結果

boc_spider.py

class BocSpider(scrapy.Spider):
    name = 'boc'
    allowed_domains = ['boc.cn']
    start_urls = ['https://www.boc.cn/sourcedb/whpj/']

    def parse(self, response):
        # 使用XPath提取資料
        for row in response.xpath('//table/tr'):
            item = ForexItem()
            item['currency'] = row.xpath('./td[1]/text()').get()
            item['tbp'] = row.xpath('./td[2]/text()').get()
            item['cbp'] = row.xpath('./td[3]/text()').get()
            item['tsp'] = row.xpath('./td[4]/text()').get()
            item['csp'] = row.xpath('./td[5]/text()').get()
            item['time'] = row.xpath('./td[6]/text()').get()
            yield item

items.py

import scrapy

class ForexItem(scrapy.Item):
    currency = scrapy.Field()
    tbp = scrapy.Field()
    cbp = scrapy.Field()
    tsp = scrapy.Field()
    tsp = scrapy.Field()
    time = scrapy.Field()

執行結果

心得體會:

我學會了如何使用XPath來精確提取外匯網站的資料，並將這些資料透過Pipeline儲存到MySQL資料庫中。
作業連結

資料採集與融合術作業三

資料採集與融合技術作業三

作業1:

程式碼和結果

心得體會:

作業2

程式碼和結果

stock_spider.py

items.py

心得體會:

作業3:

程式碼和結果

boc_spider.py

items.py

心得體會:

相關文章