作業①:

1）指定一個網站，爬取這個網站中的所有的所有圖片，例如：中國氣象網（http://www.weather.com.cn）。使用scrapy框架分別實現單執行緒和多執行緒的方式爬取。

程式碼解析

weather_spiders.py檔案

解析起始頁面

def parse(self, response):
    urls = response.xpath('//div[@class="tu"]/a/@href').extract()
    for url in urls:
        yield scrapy.Request(url=url, callback=self.imgs_parse)

parse 方法是 Scrapy 預設的回撥方法，處理響應並提取資料。
使用 XPath 提取特定 div 下的所有連結，生成一個 URL 列表。
對每個提取的 URL，建立一個新的 scrapy.Request，並將其交由 imgs_parse 方法處理。

解析圖片連結

def imgs_parse(self, response):
    item = WeatherItem()
    item["pic_url"] = response.xpath('/html/body/div[3]/div[1]/div[1]/div[2]/div/ul/li/a/img/@src').extract()
    yield item

imgs_parse 方法用於處理每個影像頁面的響應。
建立 WeatherItem 例項以儲存資料。
使用 XPath 提取影像連結，並將其儲存在 pic_url 欄位中。
最後，透過 yield 返回 item，將資料傳遞給 Scrapy 的管道處理。

pipelines.py檔案

覆蓋 get_media_requests 方法

def get_media_requests(self, item, info):
    for i in range(len(item['pic_url'])):
        yield scrapy.Request(url=item['pic_url'][i])

get_media_requests 方法負責生成每個圖片的下載請求。
item 引數是 Scrapy Item 物件，包含了爬蟲抓取到的資料。
info 引數是關於當前請求的資訊。

覆蓋 item_completed 方法

def item_completed(self, results, item, info):
    if not results[0][0]:
        raise DropItem('下載失敗')
    return item

item_completed 方法在所有圖片下載完成後呼叫。
results 是一個列表，每個元組包含兩個元素，第一個元素是下載成功與否的布林值，第二個元素是儲存圖片資訊的字典。
透過 if not results[0][0]: 判斷第一個下載請求是否成功，如果失敗，就丟擲 DropItem 異常，以丟棄這個 item，且輸出 '下載失敗' 的訊息。
如果下載成功，返回原始的 item。

items.py檔案


class WeatherItem(scrapy.Item):

    pic_url = scrapy.Field()

settings.py檔案

Scrapy 預設是多執行緒的，但你可以透過配置來限制它的併發請求數，從而實現單執行緒和多執行緒的爬取。
*單執行緒爬取
CONCURRENT_REQUESTS = 1
多執行緒爬取
CONCURRENT_REQUESTS = 16

輸出資訊

Gitee資料夾連結

2）心得體會

XPath 的靈活性：XPath 是提取 HTML 表示式的強大工具。在程式碼中，正是利用 XPath 成功定位了所需資料。對於複雜的網頁結構，掌握 XPath 能夠簡化資料提取過程。
Scrapy 工具鏈：Scrapy 的請求和回撥機制使得抓取過程可以實現高效而清晰的工作流。透過 yield 返回請求和資料，保持了程式碼的簡潔性與優雅性。
資料處理：使用自定義的 WeatherItem 使得資料在抓取後可以方便地進行處理，這符合物件導向程式設計的理念，能夠提升程式碼的模組化和重用性。

作業②

1）熟練掌握 scrapy 中 Item、Pipeline 資料的序列化輸出方法；使用scrapy框架+Xpath+MySQL資料庫儲存技術路線爬取股票相關資訊。

候選網站：東方財富網：https://www.eastmoney.com/

程式碼解析

stock_spiders.py檔案

定義 parse 方法，接收網頁響應。
使用 XPath 獲取包含股票資料的 div 元素。

for stock in stocks:
    item = StockItem()
    item['bStockNo'] = stock.xpath('.//div[@class="code"]/text()').get()
    item['bStockName'] = stock.xpath('.//div[@class="name"]/text()').get()
    item['latestPrice'] = stock.xpath('.//div[@class="latest"]/text()').get()
    item['priceChangePercent'] = stock.xpath('.//div[@class="percent"]/text()').get()
    item['priceChange'] = stock.xpath('.//div[@class="change"]/text()').get()
    item['volume'] = stock.xpath('.//div[@class="volume"]/text()').get()
    item['amplitude'] = stock.xpath('.//div[@class="amplitude"]/text()').get()
    item['highest'] = stock.xpath('.//div[@class="highest"]/text()').get()
    item['lowest'] = stock.xpath('.//div[@class="lowest"]/text()').get()
    item['openPrice'] = stock.xpath('.//div[@class="open"]/text()').get()
    item['closePrice'] = stock.xpath('.//div[@class="close"]/text()').get()
    yield item

遍歷獲取的股票資料，建立 StockItem 例項。
使用 XPath 提取每個股票的各項資訊，並賦值給 item。
最後，使用 yield 將 item 返回。

pipelines.py檔案

self.cursor = self.connection.cursor()
self.cursor.execute('''
    CREATE TABLE IF NOT EXISTS stocks (
        id INT PRIMARY KEY AUTO_INCREMENT,
        bStockNo VARCHAR(20),
        bStockName VARCHAR(100),
        latestPrice FLOAT,
        priceChangePercent FLOAT,
        priceChange FLOAT,
        volume VARCHAR(20),
        amplitude VARCHAR(20),
        highest FLOAT,
        lowest FLOAT,
        openPrice FLOAT,
        closePrice FLOAT
    )
''')
self.connection.commit()

建立一個遊標以執行 SQL 語句。
檢查並建立 stocks 表，如果表不存在的話。

def close_spider(self, spider):
    self.cursor.close()
    self.connection.close()

定義 close_spider 方法，負責關閉資料庫連線和遊標。

def process_item(self, item, spider):
    self.cursor.execute('''
        INSERT INTO stocks (bStockNo, bStockName, latestPrice, priceChangePercent,
        priceChange, volume, amplitude, highest, lowest, openPrice, closePrice)
        VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)
    ''', (item['bStockNo'], item['bStockName'], item['latestPrice'],
          item['priceChangePercent'], item['priceChange'], item['volume'],
          item['amplitude'], item['highest'], item['lowest'],
          item['openPrice'], item['closePrice']))
    self.connection.commit()
    return item

定義 process_item 方法，處理每個抓取到的 item。
使用 SQL 語句將資料插入資料庫，並提交更改。

itmes.py

class StockItem(scrapy.Item):
    id = scrapy.Field()          # 序號
    bStockNo = scrapy.Field()    # 股票程式碼
    bStockName = scrapy.Field()  # 股票名稱
    latestPrice = scrapy.Field() # 最新報價
    priceChangePercent = scrapy.Field() # 漲跌幅
    priceChange = scrapy.Field()  # 漲跌額
    volume = scrapy.Field()       # 成交量
    amplitude = scrapy.Field()    # 振幅
    highest = scrapy.Field()      # 最高
    lowest = scrapy.Field()       # 最低
    openPrice = scrapy.Field()    # 今開
    closePrice = scrapy.Field()   # 昨收

輸出資訊

Gitee資料夾連結

2）心得體會

使用 pymysql 連線資料庫是一種常見的做法，配置從 Scrapy 設定中獲取，增強了靈活性和可維護性。
寫程式碼要考慮到良好的資源管理，確保在爬蟲結束後釋放資料庫連線，有助於避免資源洩漏。
可以考慮增加錯誤處理機制，以提高程式的魯棒性。

作業③:

1）熟練掌握 scrapy 中 Item、Pipeline 資料的序列化輸出方法；使用scrapy框架+Xpath+MySQL資料庫儲存技術路線爬取外匯網站資料。

候選網站：中國銀行網：https://www.boc.cn/sourcedb/whpj/

程式碼解析

myspiders.py檔案

使用 XPath 獲取資料的方式靈活且高效，為後續資料提取做好準備。

        for tr in trs[1:]:
            Currency = tr.xpath("./td[1]/text()").extract_first().strip()
            TSP = tr.xpath("./td[4]/text()").extract_first().strip()
            CSP = tr.xpath("./td[5]/text()").extract_first().strip()
            TBP = tr.xpath("./td[6]/text()").extract_first().strip()
            CBP = tr.xpath("./td[7]/text()").extract_first().strip()
            Time = tr.xpath("./td[8]/text()").extract_first().strip()

遍歷表格行，從每一行提取出所需的欄位資料。

            item['Currency'] = Currency
            item['TSP'] = TSP
            item['CSP'] = CSP
            item['TBP'] = TBP
            item['CBP'] = CBP
            item['Times'] = Time
            item['Id'] = cont
            cont += 1
            yield item

將提取到的資料賦值給 item 物件，並透過 yield 返回。。

pipelines.py檔案

        connect = pymysql.connect(host='localhost', user='chenshuo', password='cs031904104',
                                   database='cs031904104', charset='UTF-8') 
        cur = connect.cursor()

建立資料庫連線並建立遊標，後續將使用該遊標執行 SQL 命令。

        try:
            cur.execute(
                "insert into rate_cs (id,Currency,TSP,CSP,TBP,CBP,Times) values ('%d','%s','%s','%s','%s','%s','%s')" % (
                    item['Id'], item['Currency'].replace("'", "''"), item['TSP'].replace("'", "''"),
                    item['CSP'].replace("'", "''"), item['TBP'].replace("'", "''"),
                    item['CBP'].replace("'", "''"), item['Times'].replace("'", "''")))
            connect.commit()  # 提交命令
        except Exception as er:
            print(er)

使用 try 塊執行插入操作，捕獲並列印可能發生的異常。
使用 SQL 語句插入資料並提交更改，特別處理了單引號的問題以避免 SQL 錯誤。

        connect.close()  # 關閉與資料庫的連線
        return item

關閉資料庫連線並返回處理後的 item。

items.py檔案

class Exp42Item(scrapy.Item):
    Currency = scrapy.Field()
    TSP = scrapy.Field()
    CSP = scrapy.Field()
    TBP = scrapy.Field()
    CBP = scrapy.Field()
    Times = scrapy.Field()
    Id = scrapy.Field()
    pass

輸出資訊

Gitee資料夾連結

2）心得體會

列印輸出便於除錯，幫助開發者檢視資料是否正確提取。
雖然使用字串格式化插入資料方便，但不夠安全，易受 SQL 注入攻擊，下次可以嘗試使用引數化查詢。

資料採集和融合技術作業3

作業①:

1）指定一個網站，爬取這個網站中的所有的所有圖片，例如：中國氣象網（http://www.weather.com.cn）。使用scrapy框架分別實現單執行緒和多執行緒的方式爬取。

程式碼解析

weather_spiders.py檔案

解析起始頁面

解析圖片連結

pipelines.py檔案

覆蓋 get_media_requests 方法

覆蓋 item_completed 方法

items.py檔案

settings.py檔案

輸出資訊

Gitee資料夾連結

2）心得體會

作業②

1）熟練掌握 scrapy 中 Item、Pipeline 資料的序列化輸出方法；使用scrapy框架+Xpath+MySQL資料庫儲存技術路線爬取股票相關資訊。

程式碼解析

stock_spiders.py檔案

pipelines.py檔案

itmes.py

輸出資訊

Gitee資料夾連結

2）心得體會

作業③:

1）熟練掌握 scrapy 中 Item、Pipeline 資料的序列化輸出方法；使用scrapy框架+Xpath+MySQL資料庫儲存技術路線爬取外匯網站資料。

程式碼解析

myspiders.py檔案

pipelines.py檔案

items.py檔案

輸出資訊

Gitee資料夾連結

2）心得體會

相關文章