scrapy + mogoDB 網站爬蟲

Rachel發表於2019-05-19

原文網址 : https://learnku.com/articles/28674?order_by=created_at&

工具環境

語言：python3.6
資料庫：MongoDB (安裝及執行命令如下)

python3 -m pip install pymongo
brew install mongodb
mongod --config /usr/local/etc/mongod.conf

框架：scrapy1.5.1 (安裝命令如下)

python3 -m pip install Scrapy

用 scrapy 框架建立一個爬蟲專案

在終端執行如下命令，建立一個名為 myspider 的爬蟲專案

scrapy startproject myspider

即可得到一個如下結構的檔案目錄

scrapy + mogoDB 網站爬蟲

建立 crawl 樣式的爬蟲

針對不同的用途， scrapy 提供了不同種類的爬蟲型別，分別是
Spider：所有爬蟲的祖宗
CrawlSpider：比較常用的爬取整站資料的爬蟲（下面的例子就是用這種）
XMLFeedSpider
CSVFeedSpider
SitemapSpider

先在命令列進入到 spiders 目錄下

cd myspider/myspider/spiders

然後建立 crawl 型別的爬蟲模板

scrapy genspider -t crawl zgmlxc www.zgmlxc.com.cn

引數說明：

-t crawl 指明爬蟲的型別

zgmlxc 是我給這個爬蟲取的名字

www.zgmlxc.com.cn 是我要爬取的站點

完善小爬蟲 zgmlxc

開啟 zgmlxc.py 檔案，可以看到一個基本的爬蟲模板，現在就開始對其進行一系列的配置工作，讓這個小爬蟲根據我的指令去爬取資訊。

配置跟蹤頁面規則

rules = (
    // 定位到 www.zgmlxc.com.cn/node/72.jspx 這個頁面
    Rule(LinkExtractor(allow=r'.72\.jspx')),  
    // 在上面規定的頁面中，尋找符合下面規則的 url, 爬取裡面的內容，並把獲取的資訊返回給 parse_item（）函式
    Rule(LinkExtractor(allow=r'./info/\d+\.jspx'), callback='parse_item'),
)

這裡有個小坑, 就是最後一個 Rule 後面必須有逗號, 否則報錯, 哈哈哈
rules = (
Rule(LinkExtractor(allow=r'./info/\d+\.jspx'), callback='parse_item', follow=True),
)

在items.py內定義我們需要提取的欄位

import scrapy

class CrawlspiderItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    title = scrapy.Field()
    content = scrapy.Field()
    piclist = scrapy.Field()
    shortname = scrapy.Field()

完善 parse_item 函式

這裡就是把上一步返回的內容，配置規則，提取我們想要的資訊。這裡必須用 join 方法，是為了方便後面順利匯入資料庫。

def parse_item(self, response):
    yield {
        'title' : ' '.join(response.xpath("//div[@class='head']/h3/text()").get()).strip(),
        'shortname' : ' '.join(response.xpath("//div[@class='body']/p/strong/text()").get()).strip(),
        'piclist' : ' '.join(response.xpath("//div[@class='body']/p/img/@src").getall()).strip(),
        'content' : ' '.join(response.css("div.body").extract()).strip(),
            }

PS: 下面是提取內容的常用規則，直接總結在這裡了：

1). 獲取 img 標籤中的 src:
//img[@class='photo-large']/@src

2). 獲取文章主題內容及排版:
response.css("div.body").extract()

將資訊存入 MogoDB 資料庫

設定資料庫資訊

開啟 settings.py 新增如下資訊：

# 建立爬蟲與資料庫之間的連線關係
ITEM_PIPELINES = {
   'crawlspider.pipelines.MongoDBPipeline': 300,
}

# 設定資料庫資訊
MONGODB_SERVER = "localhost"
MONGODB_PORT = 27017
MONGODB_DB = 'spider_world'
MONGODB_COLLECTION = 'zgmlxc'

# 設定文明爬蟲, 意思是每個請求之間間歇 5 秒, 對站點友好, 也防止被黑名單
```py
DOWNLOAD_DELAY = 5

在 piplines.py 中

import pymongo

from scrapy.conf import settings
from scrapy.exceptions import DropItem
from scrapy import log

class MongoDBPipeline(object):
    def __init__(self):
        connection = pymongo.MongoClient(
            settings['MONGODB_SERVER'],
            settings['MONGODB_PORT']
        )
        db = connection[settings['MONGODB_DB']]
        self.collection = db[settings['MONGODB_COLLECTION']]

    def process_item(self, item, spider):
        valid = True
        for data in item:
            if not data:
                valid = False
                raise DropItem("Missing {0}!".format(data))
        if valid:
            self.collection.insert(dict(item))
            log.msg("Question added to MongoDB database!",
                    level=log.DEBUG, spider=spider)
        return item

在終端執行這個小爬蟲

scrapy crawl myspider

在 navicat 中檢視資訊入庫情況

如下圖新建一個 MogoDB 的資料庫連線，填入上面配置的資訊，如果一切順利，就可以看到我們想要的資訊都已經入庫了。

scrapy + mogoDB 網站爬蟲

以上就完成了自定義爬蟲到資料入庫的全過程辣~~~

參考：
scrapy 官方文件

Web Scraping and Crawling with Scrapy and MongoDB

Web Scraping with Scrapy and MongoDB

scrapy之10行程式碼爬下電影天堂全站

爬蟲（9） - Scrapy框架(1) | Scrapy 非同步網路爬蟲框架
2022-07-05
爬蟲框架非同步
爬蟲--Scrapy簡易爬蟲
2020-10-07
爬蟲
python網路爬蟲（14）使用Scrapy搭建爬蟲框架
2019-07-27
Python爬蟲框架
Scrapy爬蟲-草稿
2018-09-08
爬蟲
Scrapy爬蟲框架
2024-11-13
爬蟲框架
精通Scrapy網路爬蟲【一】第一個爬蟲專案
2021-06-19
爬蟲
Python Scrapy 爬蟲（二）：scrapy 初試
2018-08-13
Python爬蟲
python爬蟲Scrapy框架
2018-11-21
Python爬蟲框架
scrapy爬蟲代理池
2018-08-28
爬蟲
爬蟲實戰scrapy
2018-03-11
爬蟲
Python爬蟲—Scrapy框架
2020-10-04
Python爬蟲框架
【Python篇】scrapy爬蟲
2020-11-29
Python爬蟲
Python網路爬蟲4 - scrapy入門
2018-05-29
Python爬蟲
Scrapy框架的使用之Scrapy通用爬蟲
2018-05-21
框架爬蟲
scrapy之分散式爬蟲scrapy-redis
2020-12-24
分散式爬蟲Redis
python爬蟲---網頁爬蟲，圖片爬蟲，文章爬蟲，Python爬蟲爬取新聞網站新聞
2019-01-04
Python爬蟲網頁網站
爬蟲框架-scrapy的使用
2021-04-28
爬蟲框架
Scrapy爬蟲框架的使用
2021-01-17
爬蟲框架
招聘網站爬蟲模板
2020-09-20
網站爬蟲
Python爬蟲教程-30-Scrapy 爬蟲框架介紹
2018-09-06
Python爬蟲框架
Python乾貨：用Scrapy爬電商網站
2018-09-04
Python網站
Python爬蟲教程-31-建立 Scrapy 爬蟲框架專案
2018-09-04
Python爬蟲框架
爬蟲 Scrapy框架爬取圖蟲圖片並下載
2018-08-27
爬蟲框架
爬蟲（14） - Scrapy-Redis分散式爬蟲(1) | 詳解
2022-07-06
爬蟲Redis分散式
Python 爬蟲（六）：使用 Scrapy 爬取去哪兒網景區資訊
2019-10-20
Python爬蟲
Python爬蟲—爬取某網站圖片
2020-11-19
Python爬蟲網站
Windows下安裝配置爬蟲工具Scrapy及爬蟲環境
2018-09-19
Windows爬蟲
Scrapy使用隨機User-Agent爬取網站
2018-08-31
隨機網站
我的第一個 scrapy 爬蟲
2019-02-16
爬蟲
scrapy 爬蟲利器初體驗(1)
2018-11-26
爬蟲
爬蟲教程——用Scrapy爬取豆瓣TOP250
2018-10-31
爬蟲
手把手教你寫網路爬蟲（4）：Scrapy入門
2018-05-05
爬蟲
Scrapy爬蟲：實習僧網最新招聘資訊抓取
2021-09-09
爬蟲
快速上手——我用scrapy寫爬蟲（一）
2019-02-16
爬蟲
Python3爬蟲（十八） Scrapy框架（二）
2018-10-26
Python爬蟲框架
python爬蟲常用之Scrapy 中介軟體
2018-12-22
Python爬蟲
scrapy入門教程()部署爬蟲專案
2018-09-27
爬蟲
python爬蟲系列（三）scrapy基本概念
2018-09-26
Python爬蟲

scrapy + mogoDB 網站爬蟲

工具環境

用 scrapy 框架建立一個爬蟲專案

建立 crawl 樣式的爬蟲

完善小爬蟲 zgmlxc

配置跟蹤頁面規則

在items.py內定義我們需要提取的欄位

完善 parse_item 函式

將資訊存入 MogoDB 資料庫

設定資料庫資訊

在終端執行這個小爬蟲

在 navicat 中檢視資訊入庫情況

相關文章