1.基於終端指令的持久化儲存
- 保證爬蟲檔案的parse方法中有可迭代型別物件(通常為列表or字典)的返回,該返回值可以通過終端指令的形式寫入指定格式的檔案中進行持久化操作。
執行輸出指定格式進行儲存:將爬取到的資料寫入不同格式的檔案中進行儲存 scrapy crawl 爬蟲名稱 -o xxx.json scrapy crawl 爬蟲名稱 -o xxx.xml scrapy crawl 爬蟲名稱 -o xxx.csv
2.基於管道的持久化儲存
scrapy框架中已經為我們專門整合好了高效、便捷的持久化操作功能,我們直接使用即可。要想使用scrapy的持久化操作功能,我們首先來認識如下兩個檔案:
items.py:資料結構模板檔案。定義資料屬性。 pipelines.py:管道檔案。接收資料(items),進行持久化操作。 持久化流程: 1.爬蟲檔案爬取到資料後,需要將資料封裝到items物件中。 2.使用yield關鍵字將items物件提交給pipelines管道進行持久化操作。 3.在管道檔案中的process_item方法中接收爬蟲檔案提交過來的item物件,然後編寫持久化儲存的程式碼將item物件中儲存的資料進行持久化儲存 4.settings.py配置檔案中開啟管道
小試牛刀:將Boss直聘中的資料爬去下來,然後進行持久化儲存
爬蟲檔案:
# # -*- coding: utf-8 -*- import scrapy from bossPro.items import BossproItem class BossSpider(scrapy.Spider): name = 'boss' # allowed_domains = ['www.xxx.com'] start_urls = ['https://www.zhipin.com/job_detail/?query=python%E7%88%AC%E8%99%AB&scity=101010100&industry=&position='] def parse(self, response): all_data = [] li_list = response.xpath('//div[@class="job-list"]/ul/li') for li in li_list: job_name = li.xpath('.//div[@class = "info-primary"]/h3/a/div/text()').extract_first() salary = li.xpath('.//div[@class = "info-primary"]/h3/a/span/text()').extract_first() company = li.xpath('.//div[@class = "company-text"]/h3/a/text()').extract_first() #例項化一個item物件 item = BossproItem() #把解析到的資料全部封裝到item物件中 item["job_name"] =job_name item["salary"] = salary item["company"] = company #將item提交給管道 yield item
- items檔案:items.py
# -*- coding: utf-8 -*- # Define here the models for your scraped items # # See documentation in: # https://doc.scrapy.org/en/latest/topics/items.html import scrapy class BossproItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() job_name = scrapy.Field() salary = scrapy.Field() company = scrapy.Field()
- 管道檔案:pipelines.py 基於mysql和redis儲存
# -*- coding: utf-8 -*- # Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html import pymysql from redis import Redis class BossproPipeline(object): fp = None def open_spider(self, spider): print('開始爬蟲>>>>>') self.fp = open('./boss.txt', 'w', encoding='utf-8') def close_spider(self, spider): print('結束爬蟲>>>>>') self.fp.close() def process_item(self, item, spider): self.fp.write(item['job_name']+':'+item['salary']+':'+item['company']+'\n') return item class mysqlPileLine(object): coon = None cursor = None # 建立資料庫的連線 def open_spider(self,spider): self.coon = pymysql.Connect(host = '127.0.0.1',port = 3306,user='root',password = '',db = 'scrapy',charset = 'utf8') print(self.coon) # #儲存資料 def process_item(self,item,spider): self.cursor = self.coon.cursor()#建立遊標 #開始儲存資料 try: self.cursor.execute('insert into boss values ("%s","%s","%s")' %(item['job_name'],item['salary'],item['company'])) self.coon.commit() except Ellipsis as e : print(e) self.coon.rollback() def close_spider(self,spider): self.coon.close() self.cursor.close() class redisPileLine(object): conn = None def open_spider(self,spider): self.conn = Redis(host='127.0.0.1',port=6379) print(self.conn) def process_item(self,item,spider): dic = { 'name':item['job_name'], 'salary':item['salary'], 'company':item['company'] } self.conn.lpush('boss',dic)
- settings.py 配置
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.109 Safari/537.36' # Obey robots.txt rules ROBOTSTXT_OBEY = False ITEM_PIPELINES = { 'bossPro.pipelines.BossproPipeline': 300, 'bossPro.pipelines.mysqlPileLine': 301, 'bossPro.pipelines.redisPileLine': 302, }
MySQL建立表:
在終端檢視:
scrapy crawl boss
【備註:】
如果redis資料庫在儲存字典的時候出現報錯,原因是因為當前使用的redis模組不支援儲存字典型別的資料,需要在終端中執行如下指令即可:pip install -U redis== 2.10.6