利用 ChangeStream 實現 Amazon DocumentDB 表級別容災複製

前言
與 MongoDB 相容的Amazon DocumentDB，使用完全託管式文件資料庫服務輕鬆擴充套件 JSON 工作負載，透過獨立擴充套件計算和儲存，支援每秒數以百萬計文件的讀取請求；自動化硬體預置、修補、設定和其他資料庫管理任務；透過自動複製、連續備份和嚴格的網路隔離實現 99.999999999% 的永續性；將現有 MongoDB 驅動程式和工具與 Apache 2.0 開源 MongoDB 3.6 和 4.0 API 搭配使用。鑑於上述效能優勢，越來越多的企業已經或即將使用 DocumentDB 來管理JSON文件資料庫。

亞馬遜雲科技開發者社群為開發者們提供全球的開發技術資源。這裡有技術文件、開發案例、技術專欄、培訓影片、活動與競賽等。幫助中國開發者對接世界最前沿技術，觀點，和專案，並將中國優秀開發者或技術推薦給全球雲社群。如果你還沒有關注/收藏，看到這裡請一定不要匆匆劃過，點這裡讓它成為你的技術寶庫！

對很多行業而言，需要保證資料與業務的持續性，存在關鍵業務與資料的容災訴求。亞馬遜雲科技於2021年6月推出了面向 Amazon DocumentDB（相容 MongoDB）的全域性叢集（Global Cluster）。全域性叢集是一項新功能，可在發生區域範圍的中斷時提供災難恢復，同時透過允許從最近的 Amazon DocumentDB 叢集讀取來實現低延遲全域性讀取。客戶可以將業務發生 Region 內的 DocumentDB 透過該功能同步至其他Region，輕鬆實現資料層的跨區域容災。但由於 Global Cluster 全域性叢集功能是基於儲存的快速複製，所以很遺憾，截止本文發稿時，DocumentDB Global Cluster全域性叢集僅支援例項級別的資料同步與複製，暫不支援 Database 或者 Collection 級別的資料容災。

亞馬遜雲科技還有另一款資料庫產品 Amazon Data Migration Server（DMS），可以實現 Database 或者 Collection 級別的資料同步，以低延遲與較低的 RPO 指標實現資料的跨區域同步與複製，以實現容災的需求。但在面對容災場景中的資料保護訴求，DMS 暫不支援對刪除型別的操作進行過濾。

在本文中，我們將向您介紹使用 Amazon Managed Streaming for Apache Kafka（MSK）作為訊息中介軟體暫存 DocumentDB 的改變流事件Change Stream Events，來實現跨 Region 的資料庫同步，並攔截刪除型別的操作的整體解決方案。本例中，我們採用 us-east-1 弗吉尼亞北部區域作為主區域 Primary Region，已有 DocumentDB 主例項，us-west-2俄勒岡區域作為災備區域 DR Region，已有 DocumentDB 災備例項，使用了 python 作為程式語言，除 python 外您還可以使用其他主流程式語言譬如 Java，Node.JS 實現業務邏輯，但由於驅動原因，暫不支援 Ruby；另外請使用 Amazon DocumentDB v4.0 以上版本。參考架構圖如下圖所示：

主 region 的 stream-capture 主機環境設定
1.在主 region 的 stream-capture 主機上設定OS引數環境 Code部分：

設定環境變數，請替換紅色的文字部分為您實際的值，本文中預設採用 bar.foo 為改變流監控 collection，您可以替換為您自定義的其他 DB 與 collection

##設定環境變數，請替換紅色的文字部分為您實際的值，本文中預設採用 bar.foo 為改變流監控 collection，您可以替換為您自定義的其他DB 與 collection
echo -e "USERNAME="Your Primary MongoDB User"\nexport USERNAME\nPASSWORD="Your Primary MongoDB password"\nexport PASSWORD\nmongo_host="Primary MongoDB Cluster URI"\nexport mongo_host\nstate_tbl="YOUR STATE COLLECTION"\nexport state_tbl\nstate_db="YOUR STATE DB"\nexport state_db\nwatched_db_name="bar"\nexport watched_db_name\nwatched_tbl_name="foo"\nexport watched_tbl_name\nevents_remain=1\nexport events_remain\nDocuments_per_run=100000\nexport Documents_per_run\nkfk_host="YOUR KFK URI\nexport kfk_host\nkfk_topic="changevents"\nexport kfk_topic\nbucket_name="YOUR S3 BUCKET"\nexport bucket_name\nS3_prefix=""\nexport S3_prefix"" >> .bash_profile
##應用環境變數
source .bash_profile

在主 region 的 stream-capture 主機上安裝 pymongo 與 boto3 請參考如何在 Amazon Linux 2 上使用 Boto 3 庫建立 Python 3 虛擬環境

完成 python3 與 boto3 的安裝與配置，本文不再複述

##安裝 pymongo
sudo pip install pymongo

在主 region 的 stream-capture 主機上安裝 MongoDB 客戶端與證照

##下載 SSL 證照到 /tmp 下
wget -P /tmp https://s3.amazonaws.com/rds-downloads/rds-combined-ca-bundle.pem

##配置 MongoDB 的 YUM REPO
sudo echo -e "[mongodb-org-5.0]\nname=MongoDB Repository\nbaseurl=https://repo.mongodb.org/yum/amazon/2/mongodb-org/5.0/x86_64/\ngpgcheck=1\nenabled=\ngpgkey=https://www.mongodb.org/static/pgp/server-5.0.asc" >> /etc/yum.repos.d/mongodb-org-5.0.repo
##安裝M ongoDB 客戶端
sudo yum install -y mongodb-org-shell

建立 MSK 的 Topic 用以接受改變流事件
請參照本文件【開始使用MSK第3步：建立主題】來建立 MSK 的 topic，本文不再複述。請將步驟12中的–topic MSKTutorialTopic替換–topic changevents 之後，執行第步驟12

我們將可以看到如下訊息：

Created topic changevents.

啟用Amazon DocumentDB改變流
1.使用 mongosh 客戶端登陸主 DocumentDB 叢集

Mongo --host $mongo_host:27017 --ssl --sslCAFile
/tmp/rds-combined-ca-bundle.pem --username $USERNAME --password $PASSWORD

2.對 bar.foo 啟用改變流

db.adminCommand({modifyChangeStreams: 1,database: "bar",collection: "foo", enable: true});

3.確認成功
{ "ok" : 1 }

主 region 的改變流捕獲程式

#!/bin/env python

import json
import logging
import os
import time
import boto3
import datetime
from pymongo import MongoClient
from pymongo.errors import OperationFailure
from kafka import KafkaProducer

db_client = None
kafka_client = None                                           
s3_client = None        
                                 
logging.basicConfig(Level=logging.ERROR)

# The error code returned when data for the requested resume token has been deleted
err_code_136 = 136


def get_db_client():

    # Use a global variable if CX has interest in Lambda function instead of long-running python
    global db_client

    if db_client is None:
        logging.debug('Creating a new DB client.')

        try:

            username = os.environ[‘USERNAME’]
            password = os.environ[‘PASSWORD’]
            cluster_uri = os.environ['mongo_host'] 
            db_client = MongoClient(cluster_uri, ssl=True, retryWrites=False, ssl_ca_certs='/tmp/rds-combined-ca-bundle.pem')
            # Make an attemp for connecting
            db_client.admin.command('ismaster')
            db_client["admin"].authenticate(name=username, password=password)
            logging.debug('Successfully created a new DB client.')
        except Exception as err:
            logging.error('Failed to create a new DB client: {}'.format(err))
            raise

    return db_client


def get_state_tbl_client():

    """Return a DocumentDB client for the collection in which we store processing state."""

        try:

            db_client = get_db_client()
            state_db_name = os.environ['state_db']
            state_tbl_name = os.environ['state_tbl']
            state_tbl = db_client[state_db_name][state_tbl_name]
        except Exception as err:
            logging.error('Failed to create new state collection client: {}'.format(err))
            raise

    return state_tbl


def get_last_position():

            last_position = None
            logging.debug('Locate the last position.’)
        try:

            state_tbl = get_state_tbl_client()
            if "watched_tbl_name" in os.environ:
            position_point = state_tbl.find_one({'currentposition': True, 'watched_db': str(os.environ['watched_db_name']), 
                'watched_tbl': str(os.environ['watched_tbl_name']), 'db_level': False})
            else:
            position_point = state_tbl.find_one({'currentposition': True, 'db_level': True, 
                'watched_db': str(os.environ['watched_db_name'])})
           
            if position_point is not None:
            if 'lastProcessed' in position_point: 
                last_position = position_point['lastProcessed']
            else:
            if "watched_tbl_name" in os.environ:
                state_tbl.insert_one({'watched_db': str(os.environ['watched_db_name']),
                    'watched_tbl': str(os.environ['watched_tbl_name']), 'currentposition': True, 'db_level': False})
            else:
                state_tbl.insert_one({'watched_db': str(os.environ['watched_db_name']), 'currentposition': True, 
                    'db_level': True})

        except Exception as err:
            logging.error('Failed to locate the last processed id: {}'.format(err))
            raise

    return last_position


def save_last_position(resume_token):

            """Save the resume token by the last successfully processed change event."""

            logging.debug('Saving last processed id.')
        try:

            state_tbl = get_state_tbl_client()
            if "watched_tbl_name" in os.environ:
            state_tbl.update_one({'watched_db': str(os.environ['watched_db_name']), 
                'watched_tbl': str(os.environ['watched_tbl_name'])},{'$set': {'lastProcessed': resume_token}})
            else:
            state_tbl.update_one({'watched_db': str(os.environ['watched_db_name']), 'db_level': True, },
                {'$set': {'lastProcessed': resume_token}})

        except Exception as err:
            logging.error('Failed to save last processed id: {}'.format(err))
            raise


def conn_kfk_producer():

            # Use a global variable if CX has interest in Lambda function instead of long-running python
            global kafka_client
    
            if kafka_client is None:
            logging.debug('Creating a new Kafka client.')

        try:

            kafka_client = KafkaProducer(bootstrap_servers=os.environ['kfk_host'])
        except Exception as err:
            logging.error('Failed to create a new Kafka client: {}'.format(err))
            raise
    
    return kafka_client


def produce_msg(producer_instance, topic_name, key, value):

    """Produce change events to MSK."""
    
        try:

            topic_name = os.environ['kfk_topic']
            producer_instance = KafkaProducer(key_serializer=lambda key: json.dumps(key).encode('utf-8’),value_serializer=lambda value: json.dumps(value).encode('utf-8’),retries=3)
            producer_instance.send(topic_name, key, value)
            producer_instance.flush()
        except Exception as err:
            logging.error('Error in publishing message: {}'.format(err))
            raise


def write_S3(event, database, collection, doc_id):

            global s3_client

            if s3_client is None:
            s3_client = boto3.resource('s3')  

        try:
            logging.debug('Publishing message to S3.') #, str(os.environ['S3_prefix'])
            if "S3_prefix" in os.environ:
            s3_client.Object(os.environ['bucket_name'], str(os.environ['S3_prefix']) + '/' + database + '/' +
                collection + '/' + datetime.datetime.now().strftime('%Y/%m/%d/') + doc_id).put(Body=event)
            else: 
            s3_client.Object(os.environ['bucket_name'], database + '/' + collection + '/' + 
                datetime.datetime.now().strftime('%Y/%m/%d/') + doc_id).put(Body=event)

        except Exception as err:
            logging.error('Error in publishing message to S3: {}'.format(err))
            raise

def main(event, context):
    """Read change events from DocumentDB and push them to MSK&S3."""
    
            events_processed = 0
            watcher = None
            kafka_client = None

        try:
        
            # Kafka client set up    
            if "kfk_host" in os.environ:
            kafka_client = conn_kfk_producer()  
            logging.debug('Kafka client set up.')    

            # DocumentDB watched collection set up
            db_client = get_db_client()
            watched_db = os.environ['watched_db_name']
            if "watched_tbl_name" in os.environ:
            watched_tbl = os.environ['watched_tbl_name']
            watcher = db_client[watched_db][watched_tbl]
            else: 
            watcher = db_client[watched_db]
            logging.debug('Watching table {}'.format(watcher))

            # DocumentDB sync set up
            state_sync_count = int(os.environ['events_remain'])
            last_position = get_last_position()
            logging.debug("last_position: {}".format(last_position))

            with watcher.watch(full_document='updateLookup', resume_after=last_position) as change_stream:
            i = 0
            state = 0

            while change_stream.alive and i < int(os.environ['Documents_per_run']):
            
                i += 1
                change_event = change_stream.try_next()
                logging.debug('Event: {}'.format(change_event))
                
                
                if change_event is None:
                        Time.sleep(0.5)
                        Continue
                else:
                    op_type = change_event['operationType']
                    op_id = change_event['_id']['_data']

                    if op_type == insert':             
                        doc_body = change_event['fullDocument']
                        doc_id = str(doc_body.pop("_id", None))
                        insert_body = doc_body
                        readable = datetime.datetime.fromtimestamp(change_event['clusterTime'].time).isoformat()
                        doc_body.update({'operation':op_type,'timestamp':str(change_event['clusterTime'].time),'timestampReadable':str(readable)})
                        doc_body.update({'insert_body':json.dumps(insert_body)})
                        doc_body.update({'db':str(change_event['ns']['db']),'coll':str(change_event['ns']['coll'])})
                        payload = {'_id':doc_id}
                        payload.update(doc_body)
                        # Publish event to MSK
                            produce_msg(kafka_client, kfk_topic, op_id, payload)


                    if op_type == 'update':             
                        doc_id = str(documentKey["_id"])
                        readable = datetime.datetime.fromtimestamp(change_event['clusterTime'].time).isoformat()
                        doc_body.update({'operation':op_type,'timestamp':str(change_event['clusterTime'].time),'timestampReadable':str(readable)})
                        doc_body.update({'updateDescription':json.dumps(updateDescription)})
                        doc_body.update({'db':str(change_event['ns']['db']),'coll':str(change_event['ns']['coll'])})
                        payload = {'_id':doc_id}
                        payload.update(doc_body)
                        # Publish event to MSK
                            produce_msg(kafka_client, kfk_topic, op_id, payload)


                    if op_type == 'delete':
                        doc_id = str(change_event['documentKey']['_id'])
                        readable = datetime.datetime.fromtimestamp(change_event['clusterTime'].time).isoformat()
                        doc_body.update({'operation':op_type,'timestamp':str(change_event['clusterTime'].time),'timestampReadable':str(readable)})
                        doc_body.update({'db':str(change_event['ns']['db']),'coll':str(change_event['ns']['coll'])})
                        payload = {'_id':doc_id}
                        payload.update(doc_body)
                        # Append event for S3
                        if "bucket_name" in os.environ:
                            write_S3(op_id, json.dumps(payload))

                        logging.debug('Processed event ID {}'.format(op_id))

                    events_processed += 1

    except OperationFailure as of:
            if of.code == err_code_136:
            # Data for the last processed ID has been deleted in the change stream,
            # Store the last known good state so our next invocation
            # starts from the most recently available data
            save_last_position(None)
        raise

    except Exception as err:
            logging.error(‘Positionpoint lost: {}'.format(err))
        raise

    else:
        
        if events_processed > 0:

            save_last_position(change_stream.resume_token)
            logging.debug('Synced token {} to state collection'.format(change_stream.resume_token))
            return{
                'statusCode': 200,
                'description': 'Success',
                'detail': json.dumps(str(events_processed)+ ' records processed successfully.')
            }
        else:
                return{
                    'statusCode': 201,
                    'description': 'Success',
                    'detail': json.dumps('No records to process.')
                }

    finally:

        # Close Kafka client
        if "kfk_host" in os.environ:                                                 
            kafka_client.close()

容災 region 的 stream-apply 主機環境設定
Code 部分：

##設定環境變數，請替換紅色的文字部分為您實際的值
echo -e "DR_USERNAME="Your DR MongoDB User"\nexport DR_USERNAME\nDR_PASSWORD="Your DR MongoDB Password"\nexport DR_PASSWORD\nDR_mongo_host="Your DR MongoDB cluster URI"\nexport DR_mongo_host\nkfk_host="YOUR KFK URI\nexport kfk_host\nkfk_topic="changevents"\nexport kfk_topic \nDocuments_per_run=100000\nexport Documents_per_run" >> .bash_profile
##應用環境變數
source .bash_profile

容災 region 的改變流應用程式
在 stream-apply 主機上部署下列 python 程式碼並執行

Python Code：

#!/bin/env python

import json
import logging
import os
import string
import sys
import time
import boto3
import datetime
from pymongo import MongoClient
from kafka import KafkaConsumer
                                                
db_client = None 
kafka_client = None                                                  

"""ERROR level for deployment."""                                
logging.basicConfig(Level=logging.ERROR)

def get_db_client():
    global db_client

    if db_client is None:
            logging.debug('Creating a new DB client.')

        try:

            username = os.environ[‘DR_USERNAME’]
            password = os.environ[‘DR_PASSWORD’]
            cluster_uri = os.environ[‘DR_mongo_host'] 
            db_client = MongoClient(cluster_uri, ssl=True, retryWrites=False, ssl_ca_certs='/tmp/rds-combined-ca-bundle.pem')
            # Make an attemp for connecting
            db_client.admin.command('ismaster')
            db_client["admin"].authenticate(name=username, password=password)
            logging.debug('Successfully created a new DB client.')
        except Exception as err:
            logging.error('Failed to create a new DB client: {}'.format(err))
            raise

    return db_client

def conn_kfk_comsumer():
    global kafka_client
    
            if kafka_client is None:
            logging.debug('Creating a new Kafka client.')

        try:

            kafka_client = KafkaConsumer(bootstrap_servers=os.environ['kfk_host'])
        except Exception as err:
            logging.error('Failed to create a new Kafka client: {}'.format(err))
            raise
    
    return kafka_client

def poll_msg(consumer, topic_name, key, value):
    """Poll documentdb changes from MSK."""
    
        try:

                topic_name = os.environ['kfk_topic']
                consumer = KafkaConsumer(topic_name, bootstrap_servers= os.environ['kfk_host'], auto_offset_reset=‘latest’, group_id=‘docdb’, key_deserializer=lambda key: json.loads(key).decode('utf-8’), value_deserializer=lambda value: json.loads(value).decode('utf-8’))
                consumer.subscribe(topic_name, key, value)
                consumer.poll(max_records=1)
        except Exception as err:
                logging.error('Error in polling message: {}'.format(err))
                raise


def apply2mongodb(message,db_client)

    try:
        
                # Kafka client set up    
                if "kfk_host" in os.environ:
                kafka_client = conn_kfk_consumer()  
                logging.debug('Kafka client set up.')    

                db_client = get_db_client()

                partition = KafkaConsumer.assignment()
                next_offset = KafkaConsumer.position(partition)
            
                if next_offset is None:
                    Time.sleep(0.5)
                    Continue
                else:
                poll_msg(kafka_client, kfk_topic, op_id, payload)
                for message in consumer:
                event_body = message.value()
              op_type = json.loads(event_body[‘operation'])

                if op_type == 'insert':
                    coll = json.loads(event_body['coll'])
                    coll_client = db_client(coll)
                    insert_body = json.loads(event_body[‘insert_body'])
                    payload = {'_id':ObjectId(json.loads(event_body['_id']))}
                  payload.update(insert_body)
                    coll_client.insert_one(payload)

                if op_type == 'update':
                    coll = json.loads(event_body['coll'])
                    coll_client = db_client(coll)
                    update_body = json.loads(event_body[‘updateDescription']['updatedFields'])
                    update_set = {"$set":update_body}
                    payload = {'_id':(json.loads(event_body['_id']))}
                    coll_client.update_one(payload,update_set)

                    events_processed += 1

def main(event, context):
     events_processed = 0
    kafka_client = None

    try:

                # DocumentDB watched collection set up
                db_client = get_db_client()
                dr_db = os.environ['DR_mongo_host']
                dr_db_client = db_client(dr_db)
                while i < int(os.environ['Documents_per_run']):
                apply2mongodb(message,dr_db_client)
                i += 1

        else:

                if events_processed > 0:

                logging.debug(' {} events been processed successfully'.format(events_processed))
                return{
                'statusCode': 200,
                'description': 'Success',
                'detail': json.dumps(str(events_processed)+ ' events processed successfully.')
            }
        else:
                return{
                    'statusCode': 201,
                    'description': 'Success',
                    'detail': json.dumps('No records to process.')
                }

    finally:

        # Close Kafka client
        if "kfk_host" in os.environ:                                                 
            kafka_client.close()

結果驗證

分別登陸主region與容災region的DocumentDB
主region：

mongo --host $mongo_host:27017 --ssl --sslCAFile
/tmp/rds-combined-ca-bundle.pem --username $USERNAME --password $PASSWORD

容災region：

mongo --host $DR_mongo_host:27017 --ssl --sslCAFile
/tmp/rds-combined-ca-bundle.pem --username $USERNAME --password $PASSWORD

在主 region 插入資料

use bar;
db.foo.insertOne({
"x":1}) ;

在災備 region 觀察

use bar;
db.foo.find();
##得到結果
{"_id":ObjectId(9416e4a253875177a816b3d6),"x":1}

在主 region 更新資料

db.foo.updateOne({
"x":1},
{
$set:{"x":2}}
);

在災備 region 觀察

db.foo.find();
##得到結果
{"_id":ObjectId(9416e4a253875177a816b3d6),"x":2}

5.在主 region 非監控表 exa 插入資料y=1

db.exa.insertOne({
"y":1});

6.在主 region 觀察有哪些表，發現新增加了 exa 這張表

show tables;
exa
foo

在災備 region 觀察，並沒有 exa 出現，因為 exa 並不在我們的 watched collection 裡，不會捕捉相關的改變流

show tables;
foo

在主region的foo表刪除x記錄

db.foo.deleteOne({
"x":2}) ;

##觀察得到結果，主 region DocumentDB foo 表已被清空
db.foo.find();
##得到結果為空

在災備 region 驗證 foo 表內容

db.foo.find();
##得到結果
{"_id":ObjectId(9416e4a253875177a816b3d6),"x":2}
##刪除操作被攔截

10.下載 S3 中的檔案，並開啟，其中內容為

{"_id":"ObjectId(9416e4a253875177a816b3d6)", "operation":"delete", "timestamp":1658233222,"timestampReadable":"2022-07-19 20:20:22", "db":"bar","coll":"foo"}
##驗證了本條 delete 命令被攔截並儲存在 S3 中。

總結
我們在此文中，使用了 MSK 來非同步儲存 DocumentDB 的 insert/update 改變流，攔截 delete 型別的改變流儲存在 S3 中備查。如果需要進一步對刪除事件做出分析，可以引入 Amazon Glue 與Amazon Athena 對儲存於 S3 中的日誌檔案即席查詢。MSK 中的改變流事件，我們將其應用在災備區域的 DocumentDB，做到資料只增不減，避免主 region 的資料庫因為意外誤操作導致的資料損失或者高時間成本資料恢復操作。

參考資源
Amazon Linux 2 上使用 Boto 3 庫建立 Python 3 虛擬環境

https://aws.amazon.com/cn/premiumsupport/knowledge-center/ec2...

建立MSK的Topic

https://docs.aws.amazon.com/zh_cn/msk/latest/developerguide/c...

本篇作者

付曉明 Amazon 解決方案架構師，負責雲端計算解決方案的諮詢與架構設計，同時致力於資料庫，邊緣計算方面的研究和推廣。在加入亞馬遜雲科技之前曾在金融行業IT部門負責網際網路券商架構的設計，對分散式，高併發，中介軟體等具有豐富經驗。

劉冰冰 Amazon 資料庫解決方案架構師，負責基於Amazon 的資料庫解決方案的諮詢與架構設計，同時致力於大資料方面的研究和推廣。在加入Amazon 之前曾在Oracle工作多年，在資料庫雲規劃、設計運維調優、DR解決方案、大資料和數倉以及企業應用等方面有豐富的經驗。

文章來源：https://dev.amazoncloud.cn/column/article/630994cd86218f3ca3e...

利用 ChangeStream 實現 Amazon DocumentDB 表級別容災複製

設定環境變數，請替換紅色的文字部分為您實際的值，本文中預設採用 bar.foo 為改變流監控 collection，您可以替換為您自定義的其他 DB 與 collection

相關文章