16--Scrapy02:管道

Edmond辉仔發表於2024-04-17

原文網址 : https://www.cnblogs.com/Edmondhui/p/18139651

Scrapy02--管道

0. 關於管道

上一節內容,我們已經可以從spider中提取到資料. 然後透過引擎將資料傳遞給pipeline

那麼在pipeline中如何對資料進行儲存呢? 主要針對四種資料儲存,展開講解

前三個案例以：https://match.lottery.sina.com.cn/lotto/pc_zst/index?lottoType=ssq&actionType=chzs

最後一個案例以：https://desk.zol.com.cn/dongman/

1. 寫入csv檔案

寫入檔案是一個非常簡單的事情. 直接在pipeline中開啟檔案即可

但這裡要說明的是,如果只在process_item中進行處理檔案是不夠優雅的. 總不能有一條資料就open一次吧

class CaipiaoFilePipeline:
    
    def process_item(self, item, spider):
        with open("caipiao.txt", mode="a", encoding='utf-8') as f:
            # 以追加的模式,寫入檔案
            f.write(f"{item['qihao']},{'_'.join(item['red_ball'])},{'_'.join(item['blue_ball'])}\n")
        return item

我們希望的是,能不能開啟一個檔案,然後就用這一個檔案控制代碼來完成資料的儲存？

答案是可以的,可以在pipeline中建立兩個方法,一個是open_spider(),另一個是close_spider()

open_spider() 在爬蟲開始時,執行一次
close_spider() 在爬蟲結束時,執行一次

class CaipiaoFilePipeline:
    def open_spider(self, spider):
        # 同一個類中,其他方法要使用該變數,可放在物件中
        self.f = open("caipiao.txt", mode="a", encoding='utf-8')

    def close_spider(self, spider):
        if self.f:
            self.f.close()

    def process_item(self, item, spider):
        # 寫入檔案
        self.f.write(f"{item['qihao']},{'_'.join(item['red_ball'])},{'_'.join(item['blue_ball'])}\n")
        return item
    
    
# 設定settings
ITEM_PIPELINES = {
   'caipiao.pipelines.CaipiaoFilePipeline': 300,
}

2. 寫入mysql

有了上面的示例,寫入資料庫其實也就很順其自然了

首先,在open_spider中建立好資料庫連線,在close_spider中關閉連結. 在proccess_item中對資料,進行儲存工作

先把mysql相關設定丟到settings裡

# MYSQL配置資訊
MYSQL_CONFIG = {
   "host": "localhost",
   "port": 3306,
   "user": "root",
   "password": "test123456",
   "database": "spider",
}

from caipiao.settings import MYSQL_CONFIG as mysql

import pymysql

class CaipiaoMySQLPipeline:

    def open_spider(self,spider):
        self.conn = pymysql.connect(
            host=mysql["host"],
            port=mysql["port"],
            user=mysql["user"],
            password=mysql["password"],
            database=mysql["database"]
        )

    def close_spider(self,spider):
        self.conn.close()

    def process_item(self,item,spider):
        # 寫入檔案
        try:
            cursor = self.conn.cursor()
            sql = "insert into caipiao(qihao,red,blue) values(%s,%s,%s)"
            red = ",".join(item['red_ball'])
            blue = ",".join(item['blue_ball'])
            cursor.execute(sql,(item['qihao'],red,blue))
            self.conn.commit()
            spider.logger.info(f"儲存資料{item}")
        except Exception as e:
            self.conn.rollback()
            spider.logger.error(f"儲存資料庫失敗!",e,f"資料是: {item}")  # 記錄錯誤日誌
        return item

    
# 設定settings  開啟管道
ITEM_PIPELINES = {
   'caipiao.pipelines.CaipiaoMySQLPipeline': 301,
}

3. 寫入mongodb

mongodb資料庫和mysql如出一轍

# settings.py  

MONGO_CONFIG = {
   "host": "localhost",
   "port": 27017,
   #'has_user': True,
   #'user': "python_admin",
   #"password": "123456",
   "db": "python"
}


ITEM_PIPELINES = {
    # 三個管道可以共存~
   'caipiao.pipelines.CaipiaoFilePipeline': 300,
   'caipiao.pipelines.CaipiaoMySQLPipeline': 301,
   'caipiao.pipelines.CaipiaoMongoDBPipeline': 302,
}

from caipiao.settings import MONGO_CONFIG as mongo

import pymongo

class CaipiaoMongoDBPipeline:
    def open_spider(self,spider):
        client = pymongo.MongoClient(host=mongo['host'], port=mongo['port'])
        db = client[mongo['db']]
        
        # if mongo['has_user']:
        #    db.authenticate(mongo['user'], mongo['password'])
        self.client = client
        self.collection = db['caipiao']

    def close_spider(self,spider):
        self.client.close()

    def process_item(self,item,spider):
        self.collection.insert_one({"qihao": item['qihao'],'red': item["red_ball"],'blue': item['blue_ball']})
        return item

4. 檔案儲存

嘗試使用Scrapy 來下載一些圖片

圖片網址： https://desk.zol.com.cn/dongman/

首先，建立好專案，完善spider，注意看 yield scrapy.Request()

import scrapy

from urllib.parse import urljoin


class ZolSpider(scrapy.Spider):
    name = 'zol'
    allowed_domains = ['zol.com.cn']
    start_urls = ['https://desk.zol.com.cn/dongman/']

    def parse(self,response,**kwargs):  # scrapy自動執行這個parse -> 解析資料
        # print(resp.text)
        # 1. 拿到詳情頁的url
        a_list = response.xpath("//*[@class='pic-list2  clearfix']/li/a")
        for a in a_list:
            href = a.xpath("./@href").extract_first()
            if href.endswith(".exe"):
                continue
                
            # print(response.url)   # response.url  從響應物件中，獲取當前請求的url
             # print(href)  # '/bizhi/9109_111583_2.html'

            # href = urljoin(response.url, href)  # 這個拼接才是沒問題的.
            # 僅限於scrapy
            href = response.urljoin(href)  # response.url 和你要拼接的東西
            # print(href)
            # 2. 請求到詳情頁. 拿到圖片的下載地址

            # 傳送一個新的請求
            # 返回一個新的請求物件
            # 我們需要在請求物件中,給出至少以下內容(spider中)
            # url  -> 請求的url
            # method -> 請求方式
            # callback -> 請求成功後.得到了響應之後. 如何解析(parse),把解析函式名字放進去
            yield scrapy.Request(
                url=href,
                method="get",
                # 當前url返回之後.自動執行的那個解析函式
                callback=self.suibianqimignzi,
            )

    def suibianqimignzi(self,response,**kwargs):
        # 在這裡得到的響應就是url=href返回的響應
        img_src = response.xpath("//*[@id='bigImg']/@src").extract_first()
        # print(img_src)
        yield {"img_src": img_src}

4.1 URL拼接

# url 拼接的邏輯：   核心就是資原始檔路徑是相對路徑，還是絕對路徑 

### 總體原則：
當前請求的url (eg: https://desk.zol.com.cn/dongman/aaa) 和 子url，進行拼接

1.若子url是'/bizhi/sss.html'     # 絕對路徑  以'/' 開頭，表示資源路徑的根目錄  
  應當和 請求url的域名，進行拼接
    
 eg: 'https://desk.zol.com.cn/'  + '/bizhi/sss.html' = 'https://desk.zol.com.cn/bizhi/sss.html'


2.若子url是'bizhi/xxx.html'      # 相對路徑  是拼接到 當前請求的url中 最後一層目錄 的 同級目錄中
  應當和 將當前請求的url中 最後一層目錄 刪除後 ，再進行拼接
    
 eg: 'https://desk.zol.com.cn/dongman/aaa'  + 'bizhi/sss.html' = 'https://desk.zol.com.cn/dongman/bizhi/sss.html'

    
    
### 處理辦法：  簡單   不用自己判斷處理
# 1.通用方案
from urllib.parse import urljoin

urljoin(當前請求的url, 子url)


# 2.scrapy中  響應物件提供url拼接   原始碼本質就是上面通用方案
response.urljoin(href)   # ==> response.url + 要拼接的東西
                         # response.url: 從響應物件中，獲取當前請求的ur

4.2 Request請求物件

### 關於Request()的引數:
url        請求地址
method     請求方式
callback   回撥函式
errback    報錯回撥
dont_filter  預設False  # 表示"不過濾"，該請求會重新進行傳送
headers    請求頭
cookies    cookie資訊

meta       後設資料       # 用來儲存，其他地方能從 該Request物件中 獲取到的資料

4.3 圖片下載管道

其次，就是下載，如何在pipeline中下載一張圖片呢?

在Scrapy中有一個ImagesPipeline，可以實現自動圖片下載功能.

# 先安裝 圖片處理模組    ImagesPipeline依賴這個圖片模組
pip install pillow

import scrapy
from itemadapter import ItemAdapter

# ImagesPipeline 圖片專用的管道
from scrapy.pipelines.images import ImagesPipeline

# FilesPipeline 檔案下載管道     兩者實質和用法 差不多
from scrapy.pipelines.files import FilesPipeline


class TuPipeline:
    def process_item(self, item, spider):
        print(item['img_src'])
        # 一個儲存方案：自己發請求 + open 二進位制檔案
        # import requests
        # resp = requests.get(item['img_src'])
        # with open(f'{(item['img_src'].split('/')[-1]}', 'wb') as f:
            # f.write(resp.content)
        return item

    
    
### scrapy方案： scrapy提供的圖片管道
class MyTuPipeline(ImagesPipeline):    # 重寫下面三個方法
    # 1. 傳送請求(下載圖片,檔案,影片,xxx)
    def get_media_requests(self, item, info):
        url = item['img_src']
        yield scrapy.Request(url=url, meta={"sss": url})  # 直接返回一個請求物件即可

        
    # 2. 圖片的儲存路徑   # 在這個過程中. 資料夾自動建立
    # return 字串  返回圖片的儲存路徑    
    # 完整的路徑: settings中的IMAGES_STORE + file_path()的返回值
    def file_path(self, request, response=None, info=None, *, item=None):
        # 準備資料夾
        img_path = "dongman/imgs/kunmo/libaojun/liyijia"
        
        # 準備檔名     根據url來切片 檔名
        # 方法1：用響應物件拿url
        # file_name = response.url.split("/")[-1]
        # 坑: response.url 沒辦法正常使用     該函式位置，從返回物件獲取不到url，預設為None到嘛

        # 方法2：用item拿url   可以 但item 在詳情頁時，一般存放是多個圖片的url列表  不是特別精準
        # file_name = item['img_src'].split("/")[-1]
        # print("item:", file_name)
        
         # 方法3：透過請求物件中引數meta，存放該請求的url       最優方案
        file_name = request.meta['sss'].split("/")[-1]
        print("meta:", file_name)

        real_path = img_path + "/" + file_name  # 資料夾路徑拼接
        return real_path  # 返回檔案儲存路徑即可

    
    # 3. item資料處理完(圖片下完)後的操作   一般用於對item進行更新 和 下載完檔案的資訊列印記錄
    def item_completed(self, results, item, info):
        # results：多個請求完(圖片下載完)的結果 列表  
        # eg: [(響應狀態:True 或者 False, 一堆資料的物件), (True, 物件), (True, 物件)]
        
        for ok, info in results:
            if ok:
                print(info['path'])   # 圖片下載存放的路徑
            
        return item  # 一定要return item 把資料傳遞給下一個管道

最後，在settings中設定

LOG_LEVEL = "WARNING"

USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML,like Gecko) Chrome/101.0.4951.54 Safari/537.36'

ROBOTSTXT_OBEY = False

ITEM_PIPELINES = {
   'tu.pipelines.TuPipeline': 300,
   'tu.pipelines.MyTuPipeline': 301,
}


# 在下載檔案(圖片)時，可能出現302重定向的問題
MEDIA_ALLOW_REDIRECTS = True   # 媒體_允許_重定向


# 圖片存放的根目錄(總路徑) 配置
IMAGES_STORE = "./qiaofu"

管道 |
2019-01-23
angular 管道
2019-03-11
Angular
【linux】管道！！！
2018-03-12
Linux
redis管道
2023-12-05
Redis
Filter管道
2024-08-31
Filter
Linux 管道
2018-04-19
Linux
[Linux]管道
2024-12-03
Linux
介紹 Linux 中的管道和命名管道
2018-09-12
Linux
演算法鏈與管道（上）：建立管道
2022-06-02
演算法
速度不夠，管道來湊——Redis管道技術
2019-04-30
Redis
mongodb 聚合管道
2018-09-24
MongoDB
windows命名管道
2018-03-14
Windows
管道系統
2024-11-07
Linux管道符
2024-07-03
Linux
Linux 之管道
2020-12-10
Linux
netty 管道傳遞
2018-06-20
Netty
AngularJS 4(五)【管道】
2018-08-24
AngularJS
linux管道詳解
2018-03-17
Linux
linux——管道詳解
2018-03-15
Linux
有名管道程式碼
2020-10-14
管道的學習
2024-05-25
15.GO-管道
2022-01-04
Go
IPC（一）---------匿名管道
2020-12-25
ASP.Net 管道模型 VS Asp.Net Core 管道總結
2021-04-26
ASP.NET模型
輸出重定向管道
2018-05-03
簡易版管道模式
2020-01-20
模式
NetCore訊息管道 Middleware
2024-07-16
NetCore
2020年鍋爐壓力容器壓力管道安全管理（限管道）免費試題及鍋爐壓力容器壓力管道安全管理（限管道）模擬試題
2020-10-15
Linux 命令管道緩衝區
2018-11-08
Linux
Redis管道技術的使用
2019-04-16
Redis
釋出訂閱管道化
2018-05-14
TensorFlow: 薛定諤的管道
2018-06-13
程序間通訊（1）-管道
2024-04-04
Go 系統命令管道操作
2019-09-05
Go
匿名管道通訊實現
2019-02-19
Laravel 路由管道原始碼分析
2018-12-25
Laravel路由原始碼
資料管道架構概述
2024-06-12
架構
管道流間的通訊
2021-11-18