Scrapy爬蟲（6）爬取銀行理財產品並存入MongoDB（共12w+資料）

weixin_33763244發表於2018-03-15

原文網址 : https://blog.csdn.net/weixin_33763244/article/details/89697096

本次Scrapy爬蟲的目標是爬取“融360”網站上所有銀行理財產品的資訊，並存入MongoDB中。網頁的截圖如下，全部資料共12多萬條。

我們不再過多介紹Scrapy的建立和執行，只給出相關的程式碼。關於Scrapy的建立和執行，有興趣的讀者可以參考：Scrapy爬蟲（4）爬取豆瓣電影Top250圖片。
修改items.py，程式碼如下，用來儲存每個理財產品的相關資訊，如產品名稱，發行銀行等。

import scrapy

class BankItem(scrapy.Item):
    # define the fields for your item here like:
    name = scrapy.Field()
    bank = scrapy.Field()
    currency = scrapy.Field()
    startDate = scrapy.Field()
    endDate = scrapy.Field()
    period = scrapy.Field()
    proType = scrapy.Field()
    profit = scrapy.Field()
    amount = scrapy.Field()

建立爬蟲檔案bankSpider.py，程式碼如下，用來爬取網頁中理財產品的具體資訊。

import scrapy
from bank.items import BankItem

class bankSpider(scrapy.Spider):
    name = 'bank'
    start_urls = ['https://www.rong360.com/licai-bank/list/p1']

    def parse(self, response):

        item = BankItem()
        trs = response.css('tr')[1:]

        for tr in trs:
            item['name'] = tr.xpath('td[1]/a/text()').extract_first()
            item['bank'] = tr.xpath('td[2]/p/text()').extract_first()
            item['currency'] = tr.xpath('td[3]/text()').extract_first()
            item['startDate'] = tr.xpath('td[4]/text()').extract_first()
            item['endDate'] = tr.xpath('td[5]/text()').extract_first()
            item['period'] = tr.xpath('td[6]/text()').extract_first()
            item['proType'] = tr.xpath('td[7]/text()').extract_first()
            item['profit'] = tr.xpath('td[8]/text()').extract_first()
            item['amount'] = tr.xpath('td[9]/text()').extract_first()

            yield item

        next_pages = response.css('a.next-page')

        if len(next_pages) == 1:
            next_page_link = next_pages.xpath('@href').extract_first() 
        else:
            next_page_link = next_pages[1].xpath('@href').extract_first()

        if next_page_link:
            next_page = "https://www.rong360.com" + next_page_link
            yield scrapy.Request(next_page, callback=self.parse)

為了將爬取的資料儲存到MongoDB中，我們需要修改pipelines.py檔案，程式碼如下：

# pipelines to insert the data into mongodb
import pymongo
from scrapy.conf import settings

class BankPipeline(object):
    def __init__(self):
        # connect database
        self.client = pymongo.MongoClient(host=settings['MONGO_HOST'], port=settings['MONGO_PORT'])

        # using name and password to login mongodb
        # self.client.admin.authenticate(settings['MINGO_USER'], settings['MONGO_PSW'])

        # handle of the database and collection of mongodb
        self.db = self.client[settings['MONGO_DB']]
        self.coll = self.db[settings['MONGO_COLL']] 

    def process_item(self, item, spider):
        postItem = dict(item)
        self.coll.insert(postItem)
        return item

其中的MongoDB的相關引數，如MONGO_HOST, MONGO_PORT在settings.py中設定。修改settings.py如下：

ROBOTSTXT_OBEY = False
ITEM_PIPELINES = {‘bank.pipelines.BankPipeline’: 300}
新增MongoDB連線引數

MONGO_HOST = "localhost"  # 主機IP
MONGO_PORT = 27017  # 埠號
MONGO_DB = "Spider"  # 庫名 
MONGO_COLL = "bank"  # collection名
# MONGO_USER = ""
# MONGO_PSW = ""

其中使用者名稱和密碼可以根據需要新增。

接下來，我們就可以執行爬蟲了。執行結果如下：

共用時3小時，爬了12多萬條資料，效率之高令人驚歎！
最後我們再來看一眼MongoDB中的資料：

Perfect！本次分享到此結束，歡迎大家交流~~

爬蟲 Scrapy框架爬取圖蟲圖片並下載
2018-08-27
爬蟲框架
房產資料爬取、智慧財產權資料爬取、企業工商資料爬取、抖音直播間資料python爬蟲爬取
2024-07-11
Python爬蟲
Python爬蟲框架：scrapy爬取高考派大學資料
2019-10-07
Python爬蟲框架
scrapy爬蟲框架呼叫百度地圖api資料存入資料庫
2021-04-30
爬蟲框架地圖API資料庫
爬蟲--Scrapy簡易爬蟲
2020-10-07
爬蟲
爬蟲教程——用Scrapy爬取豆瓣TOP250
2018-10-31
爬蟲
Scrapy爬蟲-草稿
2018-09-08
爬蟲
Scrapy爬蟲框架
2024-11-13
爬蟲框架
scrapy爬取豆瓣電影資料
2021-09-11
Python網路爬蟲抓取動態網頁並將資料存入資料庫MYSQL
2019-01-04
Python爬蟲網頁資料庫MySql
爬蟲（9） - Scrapy框架(1) | Scrapy 非同步網路爬蟲框架
2022-07-05
爬蟲框架非同步
爬蟲爬取資料如何繞開限制？
2022-06-10
爬蟲
python爬蟲Scrapy框架
2018-11-21
Python爬蟲框架
scrapy爬蟲代理池
2018-08-28
爬蟲
爬蟲實戰scrapy
2018-03-11
爬蟲
Python爬蟲—Scrapy框架
2020-10-04
Python爬蟲框架
【Python篇】scrapy爬蟲
2020-11-29
Python爬蟲
Selenium + Scrapy爬取某商標資料
2018-06-27
如何提升scrapy爬取資料的效率
2019-03-05
Python爬蟲筆記（4）：利用scrapy爬取豆瓣電影250
2018-11-10
Python爬蟲筆記
爬蟲實戰——58同城租房資料爬取
2019-12-04
爬蟲
如何保障爬蟲高效穩定爬取資料？
2022-05-27
爬蟲
Python Scrapy 爬蟲（二）：scrapy 初試
2018-08-13
Python爬蟲
Python爬蟲教程-30-Scrapy 爬蟲框架介紹
2018-09-06
Python爬蟲框架
python網路爬蟲（14）使用Scrapy搭建爬蟲框架
2019-07-27
Python爬蟲框架
scrapy + mogoDB 網站爬蟲
2019-05-19
Go網站爬蟲
爬蟲框架-scrapy的使用
2021-04-28
爬蟲框架
Scrapy爬蟲框架的使用
2021-01-17
爬蟲框架
python網路爬蟲--專案實戰--scrapy嵌入selenium，晶片廠級聯評論爬取（6）
2020-10-23
Python爬蟲晶片
爬蟲實戰（二）：Selenium 模擬登入並爬取資訊
2018-07-15
爬蟲
Python爬蟲之使用MongoDB儲存資料
2019-02-16
Python爬蟲MongoDB
scrapy 爬取空值
2020-10-03
Scrapy框架的使用之Scrapy通用爬蟲
2018-05-21
框架爬蟲
scrapy之分散式爬蟲scrapy-redis
2020-12-24
分散式爬蟲Redis
Java爬蟲-爬取疫苗批次資訊
2024-06-03
Java爬蟲
Python爬蟲教程-31-建立 Scrapy 爬蟲框架專案
2018-09-04
Python爬蟲框架
爬蟲（14） - Scrapy-Redis分散式爬蟲(1) | 詳解
2022-07-06
爬蟲Redis分散式
初識Scrapy框架+爬蟲實戰(7)-爬取鏈家網100頁租房資訊
2018-06-12
框架爬蟲

Scrapy爬蟲（6）爬取銀行理財產品並存入MongoDB（共12w+資料）

相關文章