Python下用Scrapy和MongoDB構建爬蟲系統（2）

PyPer發表於2015-04-24

在上一篇中，我們實現了一個基本網路爬蟲，它可以從StackOverflow上下載最新的問題，並將它們儲存在MongoDB資料庫中。在本文中，我們將對其擴充套件，使它能夠爬取每個網頁底部的分頁連結，並從每一頁中下載問題（包含問題標題和URL）。

在你開始任何爬取工作之前，檢查目標網站的使用條款並遵守robots.txt檔案。同時，做爬取練習時遵守道德，不要在短時間內向某個網站發起大量請求。像對待自己的網站一樣對待任何你將爬取的網站。

開始

有兩種可能的方法來接著從上次我們停下的地方繼續進行。

第一個方法是，擴充套件我們現有的網路爬蟲，通過利用一個xpath表示式從”parse_item”方法裡的響應中提取每個下一頁連結，並通過回撥同一個parse_item方法產生一個請求物件。利用這種方法，爬蟲會自動生成針對我們指定的連結的新請求，你可以在Scrapy文件這裡找到更多有關該方法的資訊。

另一個更簡單的方法是，使用一個不同型別的爬蟲—CrawlSpider（連結）。這是基本Spider的一個擴充套件版本，它剛好滿足我們的要求。

CrawlSpider

我們將使用與上一篇教程中相同的爬蟲專案，所以如果你需要的話可以從repo上獲取這些程式碼。

建立樣板

在“stack”目錄中，首先由crawl模板生成爬蟲樣板。

$ scrapy genspider stack_crawler  stackoverflow.com -t crawl
Created spider &#039;stack_crawler&#039; using template &#039;crawl&#039; in module:
  stack.spiders.stack_crawler

$ scrapy genspider stack_crawler stackoverflow.com -t crawl

Created spider 'stack_crawler' using template 'crawl' in module:

stack.spiders.stack_crawler

Scrapy專案現在看起來應該像這樣:

├── scrapy.cfg
└── stack
    ├── __init__.py
    ├── items.py
    ├── pipelines.py
    ├── settings.py
    └── spiders
        ├── __init__.py
        ├── stack_crawler.py
        └── stack_spider.py

├── scrapy.cfg

└── stack

├── __init__.py

├── items.py

├── pipelines.py

├── settings.py

└── spiders

├── __init__.py

├── stack_crawler.py

└── stack_spider.py

stack_crawler.py檔案內容如下：

# -*- coding: utf-8 -*-
import scrapy
from scrapy.contrib.linkextractors import LinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule

from stack.items import StackItem

class StackCrawlerSpider(CrawlSpider):
    name = 'stack_crawler'
    allowed_domains = ['stackoverflow.com']
    start_urls = ['http://www.stackoverflow.com/']

    rules = (
        Rule(LinkExtractor(allow=r'Items/'), callback='parse_item', follow=True),
    )

    def parse_item(self, response):
        i = StackItem()
        #i['domain_id'] = response.xpath('//input[@id="sid"]/@value').extract()
        #i['name'] = response.xpath('//div[@id="name"]').extract()
        #i['description'] = response.xpath('//div[@id="description"]').extract()
        return i

# -*- coding: utf-8 -*-

import scrapy

from scrapy.contrib.linkextractors import LinkExtractor

from scrapy.contrib.spiders import CrawlSpider, Rule

from stack.items import StackItem

class StackCrawlerSpider(CrawlSpider):

name = 'stack_crawler'

allowed_domains = ['stackoverflow.com']

start_urls = ['http://www.stackoverflow.com/']

rules = (

Rule(LinkExtractor(allow=r'Items/'), callback='parse_item', follow=True),

)

def parse_item(self, response):

i = StackItem()

#i['domain_id'] = response.xpath('//input[@id="sid"]/@value').extract()

#i['name'] = response.xpath('//div[@id="name"]').extract()

#i['description'] = response.xpath('//div[@id="description"]').extract()

return i

我們只需要對這個樣板做一些更新。

更新“start_urls”列表

首先，新增問題的第一個頁面連結到start_urls列表：

start_urls = [
    'http://stackoverflow.com/questions?pagesize=50&sort=newest'
]

start_urls = [

'http://stackoverflow.com/questions?pagesize=50&sort=newest'

]

更新“rules”列表

接下來，我們需要新增一個正規表示式到“rules”屬性中，以此告訴爬蟲在哪裡可以找到下一個頁面連結：

rules = [
    Rule(LinkExtractor(allow=r'questions?page=[0-9]&sort=newest'),
         callback='parse_item', follow=True)
]

rules = [

Rule(LinkExtractor(allow=r'questions?page=[0-9]&sort=newest'),

callback='parse_item', follow=True)

]

現在爬蟲能根據那些連結自動請求新的頁面，並將響應傳遞給“parse_item”方法，以此來提取問題和對應的標題。

如果你仔細檢視的話，可以發現這個正規表示式限制了它只能爬取前9個網頁，因為在這個demo中，我們不想爬取所有的176234個網頁。

更新“parse_item”方法

現在我們只需編寫如何使用xpath解析網頁，這一點我們已經在上一篇教程中實現過了，所以直接複製過來。

def parse_item(self, response):
    questions = response.xpath('//div[@class="summary"]/h3')

    for question in questions:
        item = StackItem()
        item['url'] = question.xpath(
            'a[@class="question-hyperlink"]/@href').extract()[0]
        item['title'] = question.xpath(
            'a[@class="question-hyperlink"]/text()').extract()[0]
        yield item

def parse_item(self, response):

questions = response.xpath('//div[@class="summary"]/h3')

for question in questions:

item = StackItem()

item['url'] = question.xpath(

'a[@class="question-hyperlink"]/@href').extract()[0]

item['title'] = question.xpath(

'a[@class="question-hyperlink"]/text()').extract()[0]

yield item

這就是為爬蟲提供的解析程式碼，但是現在先不要啟動它。

新增一個下載延遲

我們需要通過在settings.py檔案中設定一個下載延遲來善待StackOverflow（和任何其他網站）。

DOWNLOAD_DELAY = 5

1	DOWNLOAD_DELAY = 5

這告訴爬蟲需要在每兩個發出的新請求之間等待5秒鐘。你也很有必要做這樣的限制，因為如果你不這麼做的話，StackOverflow將會限制你的訪問流量，如果你繼續不加限制地爬取該網站，那麼你的IP將會被禁止。所有，友好點—要像對待自己的網站一樣對待任何你爬取的網站。

現在只剩下一件事要考慮—儲存資料。

MongoDB

上次我們僅僅下載了50個問題，但是因為這次我們要爬取更多的資料，所有我們希望避免向資料庫中新增重複的問題。為了實現這一點，我們可以使用一個MongoDB的 upsert方法，它意味著如果一個問題已經存在資料庫中，我們將更新它的標題；否則我們將新問題插入資料庫中。

修改我們前面定義的MongoDBPipeline：

class MongoDBPipeline(object):

    def __init__(self):
        connection = pymongo.Connection(
            settings['MONGODB_SERVER'],
            settings['MONGODB_PORT']
        )
        db = connection[settings['MONGODB_DB']]
        self.collection = db[settings['MONGODB_COLLECTION']]

    def process_item(self, item, spider):
        for data in item:
            if not data:
                raise DropItem("Missing data!")
        self.collection.update({'url': item['url']}, dict(item), upsert=True)
        log.msg("Question added to MongoDB database!",
                level=log.DEBUG, spider=spider)
        return item

class MongoDBPipeline(object):

def __init__(self):

connection = pymongo.Connection(

settings['MONGODB_SERVER'],

settings['MONGODB_PORT']

)

db = connection[settings['MONGODB_DB']]

self.collection = db[settings['MONGODB_COLLECTION']]

def process_item(self, item, spider):

for data in item:

if not data:

raise DropItem("Missing data!")

self.collection.update({'url': item['url']}, dict(item), upsert=True)

log.msg("Question added to MongoDB database!",

level=log.DEBUG, spider=spider)

return item

為簡單起見，我們沒有優化查詢，也沒有處理索引值，因為這不是一個生產環境。

測試

啟動爬蟲！

$ scrapy crawl questions

1	$ scrapy crawl questions

現在你可以坐下來，看著你的資料庫漸漸充滿資料。

結論

你可以從Github庫下載整個原始碼，也可以在下面評論或提問。

Python下用Scrapy和MongoDB構建爬蟲系統（1）
2015-04-24
PythonMongoDB爬蟲
使用Scrapy構建一個網路爬蟲
2017-01-12
爬蟲
使用scrapy搭建大型爬蟲系統
2017-01-15
爬蟲
python爬蟲Scrapy框架
2018-11-21
Python爬蟲框架
【Python篇】scrapy爬蟲
2020-11-29
Python爬蟲
Python爬蟲—Scrapy框架
2020-10-04
Python爬蟲框架
Python Scrapy 爬蟲（二）：scrapy 初試
2018-08-13
Python爬蟲
scrapy爬蟲
2012-05-09
爬蟲
爬蟲--Scrapy簡易爬蟲
2020-10-07
爬蟲
Python爬蟲-用Scrapy框架實現漫畫的爬取
2016-12-30
Python爬蟲框架
安裝Scrapy（Windows下Python的爬蟲環境）
2018-01-01
WindowsPython爬蟲
python網路爬蟲（14）使用Scrapy搭建爬蟲框架
2019-07-27
Python爬蟲框架
Python爬蟲教程-30-Scrapy 爬蟲框架介紹
2018-09-06
Python爬蟲框架
Python scrapy爬蟲框架簡介
2017-04-06
Python爬蟲框架
爬蟲 Scrapy框架爬取圖蟲圖片並下載
2018-08-27
爬蟲框架
python網路爬蟲（9）構建基礎爬蟲思路
2019-06-09
Python爬蟲
Scrapy爬蟲-草稿
2018-09-08
爬蟲
Scrapy爬蟲框架
2024-11-13
爬蟲框架
Python爬蟲教程-31-建立 Scrapy 爬蟲框架專案
2018-09-04
Python爬蟲框架
Windows下安裝配置爬蟲工具Scrapy及爬蟲環境
2018-09-19
Windows爬蟲
Scrapy + Flask + Mongodb + Swift 開發爬蟲全攻略（1）
2015-04-30
FlaskMongoDBSwift爬蟲
Python網路爬蟲（六） Scrapy框架
2018-01-16
Python爬蟲框架
Python 爬蟲 (六) -- Scrapy 框架學習
2017-08-28
Python爬蟲框架
python爬蟲之Scrapy 使用代理配置
2014-03-26
Python爬蟲
爬蟲教程——用Scrapy爬取豆瓣TOP250
2018-10-31
爬蟲
爬蟲（9） - Scrapy框架(1) | Scrapy 非同步網路爬蟲框架
2022-07-05
爬蟲框架非同步
scrapy爬蟲代理池
2018-08-28
爬蟲
爬蟲實戰scrapy
2018-03-11
爬蟲
快速上手——我用scrapy寫爬蟲（一）
2019-02-16
爬蟲
python 爬蟲對 scrapy 框架的認識
2020-07-17
Python爬蟲框架
Python爬蟲 ---scrapy框架初探及實戰
2020-04-16
Python爬蟲框架
python爬蟲系列（三）scrapy基本概念
2018-09-26
Python爬蟲
python爬蟲常用之Scrapy 中介軟體
2018-12-22
Python爬蟲
Python3爬蟲（十八） Scrapy框架（二）
2018-10-26
Python爬蟲框架
利用scrapy建立初始Python爬蟲專案
2018-03-04
Python爬蟲
Python網路爬蟲4 - scrapy入門
2018-05-29
Python爬蟲
Python 爬蟲(七)-- Scrapy 模擬登入
2017-08-31
Python爬蟲
Python爬蟲知識點四--scrapy框架
2017-11-27
Python爬蟲框架