scrapy框架持久化儲存

Bound_w發表於2019-03-01

原文網址 : https://www.cnblogs.com/wqzn/p/10458611.html

1.基於終端指令的持久化儲存

保證爬蟲檔案的parse方法中有可迭代型別物件（通常為列表or字典）的返回，該返回值可以通過終端指令的形式寫入指定格式的檔案中進行持久化操作。

執行輸出指定格式進行儲存：將爬取到的資料寫入不同格式的檔案中進行儲存
    scrapy crawl 爬蟲名稱 -o xxx.json
    scrapy crawl 爬蟲名稱 -o xxx.xml
    scrapy crawl 爬蟲名稱 -o xxx.csv

2.基於管道的持久化儲存

scrapy框架中已經為我們專門整合好了高效、便捷的持久化操作功能，我們直接使用即可。要想使用scrapy的持久化操作功能，我們首先來認識如下兩個檔案：

    items.py：資料結構模板檔案。定義資料屬性。
    pipelines.py：管道檔案。接收資料（items），進行持久化操作。

持久化流程：
    1.爬蟲檔案爬取到資料後，需要將資料封裝到items物件中。
    2.使用yield關鍵字將items物件提交給pipelines管道進行持久化操作。
    3.在管道檔案中的process_item方法中接收爬蟲檔案提交過來的item物件，然後編寫持久化儲存的程式碼將item物件中儲存的資料進行持久化儲存
    4.settings.py配置檔案中開啟管道

小試牛刀：將Boss直聘中的資料爬去下來，然後進行持久化儲存

爬蟲檔案：

# # -*- coding: utf-8 -*-
import scrapy
from bossPro.items import BossproItem

class BossSpider(scrapy.Spider):
    name = 'boss'
    # allowed_domains = ['www.xxx.com']
    start_urls = ['https://www.zhipin.com/job_detail/?query=python%E7%88%AC%E8%99%AB&scity=101010100&industry=&position=']


    def parse(self, response):
        all_data = []
        li_list = response.xpath('//div[@class="job-list"]/ul/li')
        for li in li_list:
            job_name = li.xpath('.//div[@class = "info-primary"]/h3/a/div/text()').extract_first()
            salary = li.xpath('.//div[@class = "info-primary"]/h3/a/span/text()').extract_first()
            company = li.xpath('.//div[@class = "company-text"]/h3/a/text()').extract_first()




            #例項化一個item物件
            item = BossproItem()
            #把解析到的資料全部封裝到item物件中
            item["job_name"] =job_name
            item["salary"] = salary
            item["company"] = company

            #將item提交給管道
            yield item

- items檔案：items.py

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class BossproItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    job_name = scrapy.Field()
    salary = scrapy.Field()
    company = scrapy.Field()

- 管道檔案：pipelines.py 基於mysql和redis儲存

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html

import pymysql
from redis import  Redis
class BossproPipeline(object):
    fp = None

    def open_spider(self, spider):
        print('開始爬蟲>>>>>')
        self.fp = open('./boss.txt', 'w', encoding='utf-8')

    def close_spider(self, spider):
        print('結束爬蟲>>>>>')
        self.fp.close()
    def process_item(self, item, spider):
        self.fp.write(item['job_name']+':'+item['salary']+':'+item['company']+'\n')

        return item


class mysqlPileLine(object):
    coon = None
    cursor = None
    # 建立資料庫的連線
    def open_spider(self,spider):
        self.coon = pymysql.Connect(host = '127.0.0.1',port = 3306,user='root',password = '',db = 'scrapy',charset = 'utf8')
        print(self.coon)
#   #儲存資料
    def process_item(self,item,spider):
        self.cursor = self.coon.cursor()#建立遊標
        #開始儲存資料
        try:
            self.cursor.execute('insert into boss values ("%s","%s","%s")' %(item['job_name'],item['salary'],item['company']))
            self.coon.commit()
        except Ellipsis as e :
            print(e)
            self.coon.rollback()
    def close_spider(self,spider):
        self.coon.close()
        self.cursor.close()





class redisPileLine(object):
    conn = None
    def open_spider(self,spider):
        self.conn = Redis(host='127.0.0.1',port=6379)
        print(self.conn)

    def process_item(self,item,spider):
        dic = {
            'name':item['job_name'],
            'salary':item['salary'],
            'company':item['company']
        }

        self.conn.lpush('boss',dic)

- settings.py 配置

USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.109 Safari/537.36'


# Obey robots.txt rules
ROBOTSTXT_OBEY = False



ITEM_PIPELINES = {
   'bossPro.pipelines.BossproPipeline': 300,
   'bossPro.pipelines.mysqlPileLine': 301,
'bossPro.pipelines.redisPileLine': 302,
}

MySQL建立表：

在終端檢視：

scrapy crawl boss

　【備註：】

如果redis資料庫在儲存字典的時候出現報錯，原因是因為當前使用的redis模組不支援儲存字典型別的資料，需要在終端中執行如下指令即可：pip install -U redis== 2.10.6

Redis持久化儲存
2018-04-26
Redis持久化
Flutter持久化儲存之檔案儲存
2019-03-06
Flutter持久化
Flutter持久化儲存之資料庫儲存
2019-03-08
Flutter持久化資料庫
Flutter持久化儲存之key-value儲存
2019-03-04
Flutter持久化
Redis 持久化儲存詳解
2019-09-13
Redis持久化
Redis持久化儲存——>RDB & AOF
2020-08-12
Redis持久化
（三）Kubernetes---持久化儲存
2020-12-10
持久化
iOS資料持久化儲存-CoreData
2018-04-04
iOS持久化
Kubernetes 持久化資料儲存 StorageClass
2021-09-08
持久化
Kubuesphere部署Ruoyi（三）：持久化儲存配置
2023-04-20
持久化
tensorflow模型持久化儲存和載入
2018-04-23
模型持久化
Kubernetes的故事之持久化儲存(十)
2022-01-26
持久化
AOF持久化(儲存的是操作redis命令)
2019-02-21
持久化Redis
利用Kubernetes實現容器的持久化儲存
2019-01-30
持久化
容器雲對接持久化儲存並使用
2022-08-23
持久化
Room-資料持久化儲存(入門)
2021-06-28
OOM持久化
使用容器化塊儲存OpenEBS在K3s中實現持久化儲存
2020-05-26
持久化
1.05 docker的持久化儲存和資料共享
2018-12-10
Docker持久化
Docker的持久化儲存和資料共享（四）
2018-08-11
Docker持久化
React通過redux-persist持久化資料儲存
2019-02-22
ReactRedux持久化
Kubernetes 持久化儲存之 NFS 終極實戰指南
2024-07-22
持久化NFS
探索 Kubernetes 持久化儲存之 Longhorn 初窺門徑
2024-07-25
持久化
K8S中如何使用Glusterfs做持久化儲存？
2018-12-12
K8S持久化
探索 Kubernetes 持久化儲存之 Rook Ceph 初窺門徑
2024-08-13
持久化
8 個用於 Kubernetes 持久化儲存的 CNCF 專案
2022-04-29
持久化
Python中scrapy下載儲存圖片
2021-08-09
Python
k8s使用glusterfs實現動態持久化儲存
2018-12-17
K8S持久化
k8s使用ceph實現動態持久化儲存
2018-12-17
K8S持久化
一文讀懂 K8s 持久化儲存流程
2020-04-10
K8S持久化
Kubernetes 使用 ceph-csi 消費 RBD 作為持久化儲存
2020-10-20
持久化
k8s-資料持久化儲存卷，nfs，pv/pvc
2021-12-01
K8S持久化NFS
Tair持久儲存系列技術解讀
2020-10-28
AI
雲原生時代容器持久化儲存的最佳方式是什麼？
2020-10-27
持久化
flutter 持久化儲存-----資料庫sqflite｜8月更文挑戰
2021-08-01
Flutter持久化資料庫
Docker最全教程——資料庫容器化之持久儲存資料（十一）
2021-09-09
Docker資料庫
Scrapy框架
2023-03-29
框架
用非常硬核的JAVA序列化手段實現物件流的持久化儲存
2020-05-28
Java物件持久化
Scrapy框架的使用之Scrapy框架介紹
2018-05-02
框架

scrapy框架持久化儲存

相關文章