Scrapy的專案管道

江先生發表於2018-01-10

原文網址 : https://juejin.im/post/5a560a37f265da3e2b164fb4

人生苦短，我用Python

Item管道（Item Pipeline）

主要負責處理由蜘蛛從網頁抽取的Item，主要任務是清洗、驗證和儲存資料。
當頁面被蜘蛛解析後，將被髮送到Item管道，並經過幾個特定的次序處理資料。
每個Item管道的元件都是由一個簡單的方法組成的Python類。
它們獲取了Item並執行它們的方法，同時還需要確定是否需要在Item管道中繼續執行下一步或是直接丟棄掉不處理。

Item管道的作用：

清理HTML資料
驗證抓取的資料（檢查專案是否包含特定欄位）
檢查重複（並刪除）--說明：由於效能原因，去重最好在連結中去重，或者利用資料庫主鍵的唯一性去重
將抓取的專案儲存在資料庫中

Item管道主要函式：

1. process_item(self, item, spider)------必須實現

每個Item Pipeline元件都需要呼叫該方法，這個方法必須返回一個Item（或任何繼承物件）物件，或是丟擲DropItem異常，被丟棄的item將不會被之後的Pipeline元件所處理

需傳入的引數：

item(Item物件)：被爬取的Item
spider(Spider物件)：爬取該item的spider

該方法會被每一個item pipeline元件所呼叫，process_item必須返回以下其中的任意一個物件；

一個字典dict
一個Item物件或者它的子類物件
一個Twisted Deferred物件
一個DropItem Exception;如果返回此異常，則該item將不會被後續的item pipeline繼續訪問

特別注意：該方法是Item Pipeline必須實現的方法，其他三個方法（open_spider/close_spider/from_crawler）是可選的方法

2.open_spider(self, spider) —— 非必需，為爬蟲啟動的時候呼叫；

當 spider 被開啟時，這個方法被呼叫。可以實現在爬蟲開啟時需要進行的操作，比如說開啟一個待寫入的檔案，或者連線資料庫等

需要傳入的引數：

spider (Spider 物件) ：被開啟的 spider

3. close_spider(self, spider) —— 非必需，為爬蟲關閉的時候呼叫；

當 spider 被關閉時，這個方法被呼叫。可以實現在爬蟲關閉時需要進行的操作，比如說關閉已經寫好的檔案，或者關閉與資料庫的連線

需要傳入的引數：

spider (Spider 物件) ：被關閉的 spider

4. from_crawler(cls, crawler) —— 非必需，也是在啟動的時候呼叫，比 open_spider早。

該類方法用來從 Crawler 中初始化得到一個 pipeline 例項；它必須返回一個新的 pipeline 例項；Crawler 物件提供了訪問所有 Scrapy 核心元件的介面，包括 settings 和 signals

需要傳入的引數：

crawler (Crawler 物件) ：使用該管道的crawler

專案案例：爬取58同城房屋出租資訊

程式碼如下：

items.py：定義我們所要爬取的資訊的相關屬性，此例中需要爬取的是name、price、url

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class City58Item(scrapy.Item):
    # define the fields for your item here like:
    name = scrapy.Field()
    price=scrapy.Field()
    url=scrapy.Field()
    pass複製程式碼

city58_demo.py：主要是用於定義請求連結，並使用pyquery選取目標元素

# -*- coding: utf-8 -*-
import scrapy
from pyquery import PyQuery
from ..items import City58Item
import time
from scrapy.http import Request

class City58DemoSpider(scrapy.Spider):
    name = 'city58_demo'
    allowed_domains = ['58.com']
    start_urls = ['http://sh.58.com/chuzu/']

    def parse(self, response):
        for index in range(1,4):
            url='http://sh.58.com/chuzu/pn{0}'.format(str(index))
            print(url)
            time.sleep(3)
            yield Request(url=url,callback=self.page)

    def page(self,response):
        jpy = PyQuery(response.text)
        li_list=jpy('body > div.mainbox > div.main > div.content > div.listBox > ul > li').items()
        for it in li_list:
            a_tag = it(' div.des > h2 > a')
            item = City58Item()
            item['name'] = a_tag.text()  # a_tag取出文字
            item['url'] = a_tag.attr('href')  # 取出href引數
            item['price'] = it('div.listliright > div.money > b').text()
            yield item  # 把Item返回給引擎複製程式碼

pipeline.py：當item資料被city58_test爬蟲爬取好並返回給引擎以後，引擎會把item交給City58Pipeline這個管道處理。這個pipeline檔案負責開啟MongoDB資料庫，並將資料寫入資料庫:

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
import json
import pymongo

class City58Pipeline(object):
    collection_name = 'scrapy_items'

    def __init__(self, mongo_uri, mongo_db):
        self.mongo_uri = mongo_uri
        self.mongo_db = mongo_db

    @classmethod
    def from_crawler(cls, crawler):
        return cls(
            mongo_uri=crawler.settings.get('localhost:27017'),
            mongo_db=crawler.settings.get('MONGO_DATABASE', 'items')
        )
    def open_spider(self,spider):
        #self.file=open('58_chuzu.txt','w',encoding='utf8')
        self.client=pymongo.MongoClient()
        self.db=self.client[self.mongo_db]
        print('開啟資料庫了')
    def process_item(self, item, spider):
        line='{}\n'.format(json.dumps(dict(item)))#把item轉換成字串
        #self.file.write(line)
        self.db[self.collection_name].insert_one(dict(item))
        return item
    def close_spider(self,spider):
        #self.file.close()
        self.client.close()
        print('關閉資料庫了')複製程式碼

settings.py：開啟City58Pipeline這個管道

說明：300相當於一個執行優先順序的序號

執行爬蟲，可以在專案下建立一個檔案：

執行該檔案即可執行爬蟲：

和使用命令列scrapy crawl city58_demo效果是一樣的。

最後執行結果如圖這樣的資料。

16--Scrapy02:管道
2024-04-17
一個完整的scrapy 專案
2020-05-02
scrapy實戰專案（簡單的爬取知乎專案）
2018-05-17
scrapy入門教程()部署爬蟲專案
2018-09-27
爬蟲
利用scrapy建立初始Python爬蟲專案
2018-03-04
Python爬蟲
Scrapy入門-第一個爬蟲專案
2018-07-23
爬蟲
scrapy通用專案和爬蟲程式碼模板
2021-03-22
爬蟲
Python爬蟲教程-31-建立 Scrapy 爬蟲框架專案
2018-09-04
Python爬蟲框架
scrapy 框架新建一個爬蟲專案詳細步驟
2018-06-09
框架爬蟲
Python學習筆記——爬蟲之Scrapy專案實戰
2018-09-03
Python筆記爬蟲
精通Scrapy網路爬蟲【一】第一個爬蟲專案
2021-06-19
爬蟲
Python爬蟲深造篇(四)——Scrapy爬蟲框架啟動一個真正的專案
2021-11-08
Python爬蟲框架
python爬蟲學習筆記 4.2 （Scrapy入門案例（建立專案））
2020-04-30
Python爬蟲筆記
[20190329]grep與管道檔案.txt
2019-03-29
"揭秘CentosChina爬蟲專案：掌握Scrapy框架的必備技巧與資料庫設計"
2024-08-08
CentOS爬蟲框架資料庫
Scrapy框架的使用之Scrapy入門
2018-05-02
框架
Python爬蟲教程-32-Scrapy 爬蟲框架專案 Settings.py 介紹
2018-09-06
Python爬蟲框架
[20191104]sqlplus 管道檔案過濾.txt
2019-11-04
SQL
Scrapy框架的使用之Scrapy框架介紹
2018-05-02
框架
Scrapy框架的使用之Scrapy通用爬蟲
2018-05-21
框架爬蟲
Scrapy框架的使用之Scrapy對接Splash
2018-05-18
框架
介紹 Linux 中的管道和命名管道
2018-09-12
Linux
scrapy 單檔案啟動單個spider
2024-06-19
IDE
scrapy中的selenium
2019-03-04
Scrapy框架的使用之Scrapy爬取新浪微博
2018-05-23
框架
管道的學習
2024-05-25
Scrapy：根據目錄來下載github上的檔案
2019-03-01
Github
管道 |
2019-01-23
【動圖演示】笑眯眯地教你如何將 Scrapy 專案及爬蟲打包部署到伺服器
2018-10-29
爬蟲伺服器
Channel（管道）- 《Go 專家程式設計》筆記提要
2020-11-11
Go程式設計筆記
scrapy 採集常用的Pipeline(輸出檔案、圖片下載)
2020-09-28
Linux-task_struct和檔案系統及管道的關係
2018-05-24
LinuxStruct
Linux大檔案重定向和管道的效率對比總結
2020-03-26
Linux
Python Scrapy 爬蟲（二）：scrapy 初試
2018-08-13
Python爬蟲
python網路爬蟲--專案實戰--scrapy嵌入selenium，晶片廠級聯評論爬取（6）
2020-10-23
Python爬蟲晶片
scrapy（2）
2024-05-22
scrapy使用
2024-04-12
初始scrapy
2024-04-04
Scrapy框架
2023-03-29
框架

Scrapy的專案管道

Item管道（Item Pipeline）

Item管道的作用：

Item管道主要函式：

專案案例：爬取58同城房屋出租資訊

相關文章