Scrapy爬蟲(6)爬取銀行理財產品並存入MongoDB(共12w+資料)

weixin_33763244發表於2018-03-15

  本次Scrapy爬蟲的目標是爬取“融360”網站上所有銀行理財產品的資訊,並存入MongoDB中。網頁的截圖如下,全部資料共12多萬條。


銀行理財產品

  我們不再過多介紹Scrapy的建立和執行,只給出相關的程式碼。關於Scrapy的建立和執行,有興趣的讀者可以參考:Scrapy爬蟲(4)爬取豆瓣電影Top250圖片
  修改items.py,程式碼如下,用來儲存每個理財產品的相關資訊,如產品名稱,發行銀行等。

import scrapy

class BankItem(scrapy.Item):
    # define the fields for your item here like:
    name = scrapy.Field()
    bank = scrapy.Field()
    currency = scrapy.Field()
    startDate = scrapy.Field()
    endDate = scrapy.Field()
    period = scrapy.Field()
    proType = scrapy.Field()
    profit = scrapy.Field()
    amount = scrapy.Field()

  建立爬蟲檔案bankSpider.py,程式碼如下,用來爬取網頁中理財產品的具體資訊。

import scrapy
from bank.items import BankItem

class bankSpider(scrapy.Spider):
    name = 'bank'
    start_urls = ['https://www.rong360.com/licai-bank/list/p1']

    def parse(self, response):

        item = BankItem()
        trs = response.css('tr')[1:]

        for tr in trs:
            item['name'] = tr.xpath('td[1]/a/text()').extract_first()
            item['bank'] = tr.xpath('td[2]/p/text()').extract_first()
            item['currency'] = tr.xpath('td[3]/text()').extract_first()
            item['startDate'] = tr.xpath('td[4]/text()').extract_first()
            item['endDate'] = tr.xpath('td[5]/text()').extract_first()
            item['period'] = tr.xpath('td[6]/text()').extract_first()
            item['proType'] = tr.xpath('td[7]/text()').extract_first()
            item['profit'] = tr.xpath('td[8]/text()').extract_first()
            item['amount'] = tr.xpath('td[9]/text()').extract_first()

            yield item

        next_pages = response.css('a.next-page')

        if len(next_pages) == 1:
            next_page_link = next_pages.xpath('@href').extract_first() 
        else:
            next_page_link = next_pages[1].xpath('@href').extract_first()

        if next_page_link:
            next_page = "https://www.rong360.com" + next_page_link
            yield scrapy.Request(next_page, callback=self.parse)

  為了將爬取的資料儲存到MongoDB中,我們需要修改pipelines.py檔案,程式碼如下:

# pipelines to insert the data into mongodb
import pymongo
from scrapy.conf import settings

class BankPipeline(object):
    def __init__(self):
        # connect database
        self.client = pymongo.MongoClient(host=settings['MONGO_HOST'], port=settings['MONGO_PORT'])

        # using name and password to login mongodb
        # self.client.admin.authenticate(settings['MINGO_USER'], settings['MONGO_PSW'])

        # handle of the database and collection of mongodb
        self.db = self.client[settings['MONGO_DB']]
        self.coll = self.db[settings['MONGO_COLL']] 

    def process_item(self, item, spider):
        postItem = dict(item)
        self.coll.insert(postItem)
        return item

其中的MongoDB的相關引數,如MONGO_HOST, MONGO_PORT在settings.py中設定。修改settings.py如下:

  1. ROBOTSTXT_OBEY = False
  2. ITEM_PIPELINES = {‘bank.pipelines.BankPipeline’: 300}
  3. 新增MongoDB連線引數
MONGO_HOST = "localhost"  # 主機IP
MONGO_PORT = 27017  # 埠號
MONGO_DB = "Spider"  # 庫名 
MONGO_COLL = "bank"  # collection# MONGO_USER = ""
# MONGO_PSW = ""

其中使用者名稱和密碼可以根據需要新增。

  接下來,我們就可以執行爬蟲了。執行結果如下:


執行結果

共用時3小時,爬了12多萬條資料,效率之高令人驚歎!
  最後我們再來看一眼MongoDB中的資料:


MongoDB資料庫

  Perfect!本次分享到此結束,歡迎大家交流~~

相關文章