一種小資源情況下RDS資料實時同步StarRocks方案

一条路上的咸鱼發表於2024-11-28

一、背景

目前需要將阿里雲RDS資料庫的資料同步到自建的StarRocks叢集。之前使用DolphinScheduler透過定時排程Datax任務,將資料同步到StarRocks叢集中,但是隨著業務的發展,這種方式出現了三個問題:

1.為了滿足系統三級等保的要求,阿里雲RDS不再支援透過公網進行訪問,只能在阿里雲內網中進行訪問。

2.隨著業務的發展,批次的資料同步已經無法滿足業務對資料更新頻率的要求。

為了解決以上的問題,誕生了如下的資料同步架構。

二、資料同步架構

為了解決上面面臨的問題,設計瞭如下的資料同步架構,來進行資料的實時同步。具體架構如下:

1.使用一臺4C8G的阿里雲伺服器,該伺服器可以訪問內網的RDS伺服器。

2.將KAFKA叢集開通公網訪問。

3.在這臺阿里雲伺服器上面部署資料實時同步的腳步,一邊實時讀取RDS的binlog,將其解析加密之後傳送到KAFKA中。

4.在公司內網環境中建立KAFKA CONNECTOR叢集,建立connector將kafka資料解密之後同步到公司自建的StarRocks中。

三、現有方案的調研

建設初期調研了一些現在主流的方案,但是發現各個方案都存在一定的問題嗎,從而選擇了目前的這種方式,呼叫的現有解決方案有:

  1. Flink-CDC

Flink-CDC是目前最流行的實時資料同步方案,但是經過調研發現Flink叢集所需資源太大,目前只有一臺4C8G的阿里雲伺服器,而且這已經是能得到的最高配置。

用於目前業務的分庫分表多種多樣(按照範圍分表, 按照組織分表,按照年度分表,按照月度分表,按照組織月度組合分表等等),業務表因為歷史原因存在大量的欄位不規範問題(全大寫、全小寫、駝峰、下劃線等等),採用Flink-CDC如果分表建設task,則資源根本不夠,如果耦合在一起,則需要進行大量的編碼,後續修改複雜,因此放棄該方案。

  1. Apache SeaTunnel

Apache SeaTunnel是目前流行的另外一個實時資料同步工具,但是其目前無法支援表的模糊匹配,由於業務系統中儲存多種方式的分庫分表技術,而且分表數量巨大,有些表分表數量成百上千,有些按照組織分表的則是隨時可能新增表,導致其很難進行相容,需要進行上千張表的配置,基本沒有可行性,所以放棄該方案。

四、核心步驟技術方案

  1. binlog實時消費

binlog的實時資料同步採用開源專案python-mysql-replication進行實現,python-mysql-replication是在PyMYSQL之上構建的MySQL複製協議的純Python實現,透過其可以很簡單的實時消費RDS資料庫的binlog。

Pure Python Implementation of MySQL replication protocol build on top of PyMYSQL. This allows you to receive event like insert, update, delete with their datas and raw SQL queries.

  1. 資料加解密

為了保證資料在兩個內網直接傳輸的安全,要求需要對進行傳輸的資料進行加密,經過調研之後選擇了 AES-GCM對稱加密,AES-GCM是一種 高效,支援硬體加速 ,適用於大資料量加密、檔案加密、流加密 。

  1. 資料同步到StarRocks

從kafka消費資料到StarRocks,採用的使用StarRocks官方支援的starrocks-connector-for-kafka,但是由於我們的資料進行了加密操作,所以需要對該元件進行擴充套件,再其中加入進行資料解密的操作。

  1. kafka內外網對映

由於RDS和StarRocks在兩個不同的內網之中,為了連通兩個內網,使用kafka進行資料的中轉操作。這就需要kafka能夠提供公網的訪問。透過配置不同的advertised.listeners來進行實現。

  1. 批次同步

在進行一張歷史已經存在的表資料同步的時候,需要先同步歷史已經存在的資料,然後再按照binlog實時進行新資料的同步工作,歷史資料的同步採用DataX來進行同步。

  1. DataX寫Kafka

DataX屬於批次資料同步的元件,而Kafka屬於流式資料同步的元件,兩者的定位不一致,因此DataX官方並沒有用於Kafka的Writer,這就需要我們自己進行擴充套件,編寫Kafka-Writer,來進行支援。

  1. StarRocks表的增刪改

StarRocks中存在主鍵表模型,該模型支援資料的增刪改操作,同時starrocks-connector-for-kafka底層採用StreamLoad進行實現,StreamLoad支援透過在資料中增加對應的__op欄位來支援對錶的資料進行增刪改。

  1. 分庫分表的支援

系統存在多種方式的分庫分表,由於分庫分表之後的主鍵可能重複,因此可以在資料同步的時候,對分庫分表進行分析,設計以 (表名,原表主鍵) 或者 (庫名,表名,原表主鍵)作為對應StarRocks表的主鍵,來進行對應的支援操作。

五、資料同步過程說明

下面以一張已經存在的表如何進行資料同步為例,進行整個資料同步過程的說明:

  1. 根據要同步的RDS中表的結構資訊,在StarRocks中建立對應的表。
  2. 在kafka中建立對應的進行歷史批次資料同步的topic和binlog增量同步的topic。
  3. 在進行增量同步的腳步中新增這張表的binlog同步配置,將binlog資料寫入用於增量同步的kafka的topic中。
  4. 使用DataX將歷史資料全量同步到用於批次同步的kafka的topic中。
  5. 建立用於同步歷史資料到StarRocks表中的Connector,消費批次topic中的資料。
  6. 根據DataX返回的同步資料量和StarRocks中已經接收到的資料量進行比對,如果一致則表明歷史資料已經全部同步完成,此時可以刪除刪除用於歷史資料同步的topic和Connector,也可以保留不管。
  7. 建立用於增量同步的Connector,消費binlog資料,實時接入StarRocks。

六、具體的實現

  1. DataX的kafkawriter實現
public class KafkaWriter extends Writer {

    public static class Job extends Writer.Job {

        private static final Logger logger = LoggerFactory.getLogger(Job.class);
        private Configuration conf = null;

        @Override
        public List<Configuration> split(int mandatoryNumber) {
            List<Configuration> configurations = new ArrayList<Configuration>(mandatoryNumber);
            for (int i = 0; i < mandatoryNumber; i++) {
                configurations.add(conf);
            }
            return configurations;
        }

        private void validateParameter() {
            this.conf.getNecessaryValue(Key.BOOTSTRAP_SERVERS, KafkaWriterErrorCode.REQUIRED_VALUE);
            this.conf.getNecessaryValue(Key.TOPIC, KafkaWriterErrorCode.REQUIRED_VALUE);
        }

        @Override
        public void init() {
            this.conf = super.getPluginJobConf();
            logger.info("kafka writer params:{}", conf.toJSON());
            this.validateParameter();
        }


        @Override
        public void destroy() {

        }
    }

    public static class Task extends Writer.Task {
        private static final Logger logger = LoggerFactory.getLogger(Task.class);
        private static final String NEWLINE_FLAG = System.getProperty("line.separator", "\n");

        private Producer<String, String> producer;
        private String fieldDelimiter;
        private Configuration conf;
        private Properties props;
        private AesEncryption aesEncryption;
        private List<String> columns;

        @Override
        public void init() {
            this.conf = super.getPluginJobConf();
            fieldDelimiter = conf.getUnnecessaryValue(Key.FIELD_DELIMITER, "\t", null);
            columns = conf.getList(Key.COLUMN_LIST, new ArrayList<>(), String.class);

            props = new Properties();
            props.put("bootstrap.servers", conf.getString(Key.BOOTSTRAP_SERVERS));
            props.put("acks", conf.getUnnecessaryValue(Key.ACK, "0", null));//這意味著leader需要等待所有備份都成功寫入日誌,這種策略會保證只要有一個備份存活就不會丟失資料。這是最強的保證。
            props.put("retries", conf.getUnnecessaryValue(Key.RETRIES, "5", null));
            props.put("retry.backoff.ms", "1000");
            props.put("batch.size", conf.getUnnecessaryValue(Key.BATCH_SIZE, "16384", null));
            props.put("linger.ms", 100);
            props.put("connections.max.idle.ms", 300000);
            props.put("max.in.flight.requests.per.connection", 5);
            props.put("socket.keepalive.enable", true);
            props.put("key.serializer", conf.getUnnecessaryValue(Key.KEYSERIALIZER, "org.apache.kafka.common.serialization.StringSerializer", null));
            props.put("value.serializer", conf.getUnnecessaryValue(Key.VALUESERIALIZER, "org.apache.kafka.common.serialization.StringSerializer", null));
            producer = new KafkaProducer<String, String>(props);
            String encryptKey = conf.getUnnecessaryValue(Key.ENCRYPT_KEY, null, null);
            if(encryptKey != null){
                aesEncryption = new AesEncryption(encryptKey);
            }
        }

        @Override
        public void prepare() {
            AdminClient adminClient = AdminClient.create(props);
            ListTopicsResult topicsResult = adminClient.listTopics();
            String topic = conf.getNecessaryValue(Key.TOPIC, KafkaWriterErrorCode.REQUIRED_VALUE);
            try {
                if (!topicsResult.names().get().contains(topic)) {
                    new NewTopic(
                            topic,
                            Integer.parseInt(conf.getUnnecessaryValue(Key.TOPIC_NUM_PARTITION, "1", null)),
                            Short.parseShort(conf.getUnnecessaryValue(Key.TOPIC_REPLICATION_FACTOR, "1", null))
                    );
                    List<NewTopic> newTopics = new ArrayList<NewTopic>();
                    adminClient.createTopics(newTopics);
                }
                adminClient.close();
            } catch (Exception e) {
                throw new DataXException(KafkaWriterErrorCode.CREATE_TOPIC, KafkaWriterErrorCode.REQUIRED_VALUE.getDescription());
            }
        }

        @Override
        public void startWrite(RecordReceiver lineReceiver) {
            logger.info("start to writer kafka");
            Record record = null;
            while ((record = lineReceiver.getFromReader()) != null) {
                if (conf.getUnnecessaryValue(Key.WRITE_TYPE, WriteType.TEXT.name(), null)
                        .equalsIgnoreCase(WriteType.TEXT.name())) {
                    producer.send(new ProducerRecord<String, String>(this.conf.getString(Key.TOPIC),
                            Md5Encrypt.md5Hexdigest(recordToString(record)),
                            aesEncryption ==null ? recordToString(record): JSONObject.toJSONString(aesEncryption.encrypt(recordToString(record))))
                    );
                } else if (conf.getUnnecessaryValue(Key.WRITE_TYPE, WriteType.TEXT.name(), null)
                        .equalsIgnoreCase(WriteType.JSON.name())) {
                    producer.send(new ProducerRecord<String, String>(this.conf.getString(Key.TOPIC),
                            Md5Encrypt.md5Hexdigest(recordToString(record)),
                            aesEncryption ==null ? recordToJsonString(record) : JSONObject.toJSONString(aesEncryption.encrypt(recordToJsonString(record))))
                    );
                }
                producer.flush();
            }
        }

        @Override
        public void destroy() {
            if (producer != null) {
                producer.close();
            }
        }

        /**
         * 資料格式化
         *
         * @param record
         * @return
         */
        private String recordToString(Record record) {
            int recordLength = record.getColumnNumber();
            if (0 == recordLength) {
                return NEWLINE_FLAG;
            }
            Column column;
            StringBuilder sb = new StringBuilder();
            for (int i = 0; i < recordLength; i++) {
                column = record.getColumn(i);
                sb.append(column.asString()).append(fieldDelimiter);
            }

            sb.setLength(sb.length() - 1);
            sb.append(NEWLINE_FLAG);
            return sb.toString();
        }

        /**
         * 資料格式化
         *
         * @param record 資料
         *
         */
        private String recordToJsonString(Record record) {
            int recordLength = record.getColumnNumber();
            if (0 == recordLength) {
                return "{}";
            }
            Map<String, Object> map = new HashMap<>();
            for (int i = 0; i < recordLength; i++) {
                String key = columns.get(i);
                Column column = record.getColumn(i);
                map.put(key, column.getRawData());
            }
            return JSONObject.toJSONString(map);
        }
    }
}

進行資料加密的實現:

public class AesEncryption {

    private SecretKey secretKey;

    private static final int GCM_TAG_LENGTH = 16; // 16位元組 (128位)

    public AesEncryption(String secretKey) {
        byte[] keyBytes = hexStringToByteArray(secretKey);
        this.secretKey = new SecretKeySpec(keyBytes, 0, keyBytes.length, "AES");
    }

    public ResultModel encrypt(String data) {
        try {
            byte[] nonce = new byte[GCM_TAG_LENGTH];
            new SecureRandom().nextBytes(nonce);
            Cipher cipher = Cipher.getInstance("AES/GCM/NoPadding");
            GCMParameterSpec gcmSpec = new GCMParameterSpec(GCM_TAG_LENGTH * 8, nonce);
            cipher.init(Cipher.ENCRYPT_MODE, secretKey, gcmSpec);
            byte[] encryptedBytes = cipher.doFinal(data.getBytes());
            return new ResultModel(bytesToHex(nonce), bytesToHex(encryptedBytes));
        } catch (Exception e) {
            throw new RuntimeException(e);
        }
    }

    /**
     * 將 16 進位制字串轉換為位元組陣列
     */
    private byte[] hexStringToByteArray(String s) {
        int len = s.length();
        byte[] data = new byte[len / 2];
        for (int i = 0; i < len; i += 2) {
            data[i / 2] = (byte) ((Character.digit(s.charAt(i), 16) << 4)
                    + Character.digit(s.charAt(i + 1), 16));
        }
        return data;
    }

    // 將位元組陣列轉換為 16 進位制字串
    public static String bytesToHex(byte[] bytes) {
        StringBuilder hexString = new StringBuilder();
        for (byte b : bytes) {
            String hex = Integer.toHexString(0xff & b);
            if (hex.length() == 1) hexString.append('0');
            hexString.append(hex);
        }
        return hexString.toString();
    }

}
  1. starrocks-connector-for-kafka的實現
package com.starrocks.connector.kafka.transforms;

public class DecryptJsonTransformation <R extends ConnectRecord<R>> implements Transformation<R> {
    private static final Logger LOG = LoggerFactory.getLogger(DecryptJsonTransformation.class);
    private AesEncryption aesEncryption;

    private interface ConfigName {
        String SECRET_KEY = "secret.key";
    }

    public static final ConfigDef CONFIG_DEF = new ConfigDef()
            .define(ConfigName.SECRET_KEY, ConfigDef.Type.STRING, ConfigDef.Importance.HIGH, "secret key");


    @Override
    public R apply(R record) {
        if (record.value() == null) {
            return record;
        }
        String value = (String) record.value();
        try {
            String newValue = aesEncryption.decrypt(value);
            JSONObject jsonObject = JSON.parseObject(newValue, JSONReader.Feature.UseBigDecimalForDoubles);
            return record.newRecord(record.topic(), record.kafkaPartition(), record.keySchema(), record.key(), null, jsonObject, record.timestamp());
        } catch (Exception e) {
            return record;
        }
    }

    @Override
    public ConfigDef config() {
        return CONFIG_DEF;
    }

    @Override
    public void close() {

    }

    @Override
    public void configure(Map<String, ?> map) {
        final SimpleConfig config = new SimpleConfig(CONFIG_DEF, map);
        String secretKey = config.getString(ConfigName.SECRET_KEY);
        aesEncryption = new AesEncryption(secretKey);
    }
}

public class AesEncryption {

    private SecretKeySpec secretKey;

    public AesEncryption(String secretKey) {
        byte[] keyBytes = hexStringToByteArray(secretKey);
        this.secretKey = new SecretKeySpec(keyBytes, "AES");
    }

    public String encrypt(String data) {
        try {
            Cipher cipher = Cipher.getInstance("AES/GCM/NoPadding");
            cipher.init(Cipher.ENCRYPT_MODE, secretKey);
            byte[] encryptedBytes = cipher.doFinal(data.getBytes());
            return Base64.getEncoder().encodeToString(encryptedBytes);
        } catch (Exception e) {
            throw new RuntimeException(e);
        }
    }

    public String decrypt(String encryptedData) throws Exception {
        JSONObject jsonMessage = JSONObject.parseObject(encryptedData);
        byte[] ciphertext = hexStringToByteArray(jsonMessage.getString("ciphertext"));
        byte[] nonce = hexStringToByteArray(jsonMessage.getString("nonce"));
        return decryptData(ciphertext, nonce);
    }


    /**
     * 使用 AES-GCM 解密資料
     * @param ciphertext 密文
     * @param nonce 隨機 IV(nonce)
     * @return 解密後的明文
     * @throws Exception
     */
    private  String decryptData(byte[] ciphertext, byte[] nonce) throws Exception {
        // 建立 GCMParameterSpec 物件,用於解密時的認證標籤驗證
        GCMParameterSpec gcmSpec = new GCMParameterSpec(128, nonce); // 128 位標籤
        // 建立 AES Cipher 物件,設定為解密模式
        Cipher cipher = Cipher.getInstance("AES/GCM/NoPadding");
        cipher.init(Cipher.DECRYPT_MODE, secretKey , gcmSpec);
        // 解密資料
        // 2. 拼接ciphertext和tag
        byte[] decryptedData = cipher.doFinal(ciphertext);
        // 返回解密後的明文
        return new String(decryptedData);
    }

    /**
     * 將 16 進位制字串轉換為位元組陣列
     */
    private byte[] hexStringToByteArray(String s) {
        int len = s.length();
        byte[] data = new byte[len / 2];
        for (int i = 0; i < len; i += 2) {
            data[i / 2] = (byte) ((Character.digit(s.charAt(i), 16) << 4)
                    + Character.digit(s.charAt(i + 1), 16));
        }
        return data;
    }
}
  1. Kafka的公網配置

Kafka的內外網配置,只需要修改kafka/config下面的server.properties檔案中的如下配置即可。

# 配置kafka的監聽埠,同時監聽9093和9092
listeners=INTERNAL://kafka節點3內網IP:9093,EXTERNAL://kafka節點3內網IP:9092

# 配置kafka的對外廣播地址, 同時配置內網的9093和外網的19092
advertised.listeners=INTERNAL://kafka節點3內網IP:9093,EXTERNAL://公網IP:19092

# 配置地址協議
listener.security.protocol.map=INTERNAL:PLAINTEXT,EXTERNAL:PLAINTEXT

# 指定broker內部通訊的地址
inter.broker.listener.name=INTERNAL
  1. Kafka-Connector的部署流程

a. 建立一個目錄,來存放connect的

# 建立檔案
mkdir /opt/kafka-connect 

# 將connect檔案解壓到該目錄中

b. 修改kafka的配置檔案config/connect-distributed.properties(以1臺為例)

# 配置kafka的地址資訊
bootstrap.servers=192.168.20.41:9093,192.168.20.42:9093,192.168.20.43:9093

# 配置connect的地址
plugin.path=/vmimg/opt/kafka-connect/starrocks-kafka-connector

c. 啟動connect(以1臺為例)

 nohup bin/connect-distributed.sh config/connect-distributed.properties > start_connect.log 2>&1 &
  1. 監聽資料庫binlog檔案並加密傳送到kafka
import os
import json
import binascii
import logging
import re
from typing import List, Dict
from datetime import datetime, date, timedelta
from decimal import Decimal
from Crypto.Cipher import AES
from pymysqlreplication import BinLogStreamReader
from pymysqlreplication.row_event import BINLOG
from kafka import KafkaProducer


logging.basicConfig(level=logging.INFO,
                    format='%(asctime)s - %(levelname)s - %(lineno)d - %(message)s',
                    filename='mzt-stream-rds1.log',
                    filemode='w')

class TableConfig:
    """
    表配置資訊
    """
    def __init__(self, key: str, topic: str, need_table: bool, complete_regex: bool, regex: str, column_mapping: dict ):
        """
        :param key:表的唯一id
        :param topic  topic名單
        :param need_table  是否需要將表名稱作為一個欄位寫入表,用於解決分庫分別問題
        :param complete_regex  是否完全匹配還是正則匹配
        :param regex  匹配表的正規表示式
        :param column_mapping  列的對應關係
        """
        self.key = key
        self.topic = topic
        self.need_table = need_table
        self.complete_regex = complete_regex
        self.regex = regex
        self.column_mapping = column_mapping

class TableConfigReader:
    """
    解析表的配置檔案
    """
    def __init__(self, directory_path: str):
        """
        :param directory_path 配置檔案的目錄
        """
        self.directory_path = directory_path
        # 配置列表
        self.table_config_list: List[TableConfig] = []

    def read(self):
        """
        讀取所有的配置檔案,轉換為配置列表
        """
        entries = os.listdir(self.directory_path)
        # 過濾出所有檔案
        files = [entry for entry in entries if
                 os.path.isfile(os.path.join(self.directory_path, entry)) and entry.endswith(".json")]
        logging.info(f"讀取配置檔案數量:{len(files)}")
        for file in files:
            file_path = os.path.join(self.directory_path, file)
            with open(file_path, 'r', encoding='utf-8') as f:
                content = f.read()
            json_data = json.loads(content)
            self.table_config_list.append(TableConfig(json_data['key'], json_data['topic'], json_data['need_table'], json_data['complete_regex'],  json_data['regex'], json_data['column_mapping']))


class PrefixTrie:
    """
    用於匹配表名稱
    """
    def __init__(self, complete_regex_map: Dict[str, TableConfig], not_complete_list:List[TableConfig]):
        """
        :param complete_regex_map 完全匹配的表配置資訊字典
        :param not_complete_list 正則匹配的列表
        """
        self.complete_regex_map = complete_regex_map
        self.not_complete_list = not_complete_list

    def search(self, text):
        if text in self.complete_regex_map.keys():
            # 完全匹配
            return self.complete_regex_map[text]
        for data in self.not_complete_list:
            # 正則匹配
            match = re.match(data.regex, text)
            if match:
                return data
        return None


class MysqlConfig:
    """
    MYSQL的連線
    """
    def __init__(self, host:str, port:int, user:str, password:str, service_id:int):
        """
        :param host 資料庫的host
        :param port 資料庫的port
        :param user 資料庫的user
        :param password 資料庫的password
        :param service_id 資料庫的server_id,  用於binlog同步
        """
        self.host = host
        self.port = port
        self.user = user
        self.password = password
        self.service_id = service_id

class KafkaBinlogStreamer:
    """
    真正的binlog消費
    """
    def __init__(self, kafka_server: str, mysql_config:MysqlConfig, trie_tree: PrefixTrie, aes_key):
        """
        :param kafka_server  kafka的地址
        :param mysql_config  MYSQL的配置資訊
        :param trie_tree  名稱匹配的資訊
        :param aes_key  資料加密的秘鑰
        """
        self.producer = self._init_kafka_producer(kafka_server)
        self.mysql_config = mysql_config
        self.stream = None
        self.trie_tree = trie_tree
        self.aes_key = aes_key

    def _init_kafka_producer(self, kafka_server):
        """
        初始化 Kafka Producer
        :param kafka_server kafka的地址
        """
        producer = KafkaProducer(
            bootstrap_servers=kafka_server,
            batch_size=16384,  # 批次傳送的大小(位元組)
            linger_ms=100,  # 等待時間(毫秒),等待更多訊息達到 batch_size
            max_request_size=10485760,  # 最大請求大小(位元組)
            acks=1,  # 等待所有副本的確認
            retries=3,  # 重試次數
            value_serializer=lambda v: json.dumps(v, default=self.json_serial, ensure_ascii=False).encode('utf-8')
        )
        logging.info("Kafka producer initialized.")
        return producer

    @staticmethod
    def json_serial(obj):
        """
        JSON serializer for objects not serializable by default json code
        """
        if isinstance(obj, (datetime, date)):
            return obj.strftime('%Y-%m-%d %H:%M:%S')
        if isinstance(obj, Decimal):
            return float(obj)
        if isinstance(obj, bytes):
            return obj.decode('utf-8')
        if isinstance(obj, timedelta):
            # 將 timedelta 型別轉換為字串格式
            return str(obj)
        if isinstance(obj, dict):
            return {KafkaBinlogStreamer.json_serial(k): KafkaBinlogStreamer.json_serial(v) for k, v in obj.items()}
            # 處理列表型別
        if isinstance(obj, list):
            return [KafkaBinlogStreamer.json_serial(item) for item in obj]
        logging.warning(f"Type '{obj.__class__}' for '{obj}' not serializable")
        return None

    @staticmethod

    def convert_bytes_keys_to_str(data):
        """
        遞迴轉換字典中的 bytes 鍵為 str
        """
        if isinstance(data, dict):
            # 將字典中的 bytes 型別的鍵轉換為 str
            return {
                (k.decode('utf-8') if isinstance(k, bytes) else k): KafkaBinlogStreamer.convert_bytes_keys_to_str(v)
                for k, v in data.items()
            }
        elif isinstance(data, list):
            # 對列表中的每個元素遞迴轉換
            return [KafkaBinlogStreamer.convert_bytes_keys_to_str(item) for item in data]
        elif isinstance(data, bytes):
            # 對 bytes 型別的值進行轉換
            return data.decode('utf-8')
        elif isinstance(data, datetime):
            # 將 datetime 型別轉換為 ISO 格式字串
            return data.strftime('%Y-%m-%d %H:%M:%S')
        else:
            return data

    def build_message(self, binlog_evt, trie_tree: PrefixTrie):
        """
        構建訊息
        :param binlog_evt binlog事件
        :param trie_tree 匹配樹
        """
        schema = str(f"{getattr(binlog_evt, 'schema', '')}.{getattr(binlog_evt, 'table', '')}")
        table_name = str(f"{getattr(binlog_evt, 'table', '')}")
        # 獲取配置
        table_config = trie_tree.search(schema)
        if table_config is None:
            return None
        topic = table_config.topic
        if binlog_evt.event_type == BINLOG.WRITE_ROWS_EVENT_V1:
            # Insert
            data_rows = binlog_evt.rows
            data_list = []
            for data_row in data_rows:
                data_list.append(self._map_columns(self.convert_bytes_keys_to_str(data_row['values']), table_name, table_config, 0))
            return {'event': 'INSERT', 'headers': {'topic': topic}, 'data_list': data_list}
        elif binlog_evt.event_type == BINLOG.UPDATE_ROWS_EVENT_V1:
            # Update
            data_rows = binlog_evt.rows
            data_list = []
            for data_row in data_rows:
                data_list.append(self._map_columns(self.convert_bytes_keys_to_str(data_row['after_values']), table_name, table_config, 0))
            return {'event': 'INSERT', 'headers': {'topic': topic}, 'data_list': data_list}
        elif binlog_evt.event_type == BINLOG.DELETE_ROWS_EVENT_V1:
            # Delete
            data_rows = binlog_evt.rows
            data_list = []
            for data_row in data_rows:
                data_list.append(self._map_columns(self.convert_bytes_keys_to_str(data_row['values']), table_name, table_config, 1))
            return {'event': 'DELETE', 'headers': {'topic': topic}, 'data_list': data_list}
        return None

    @staticmethod
    def _map_columns(values, table_name: str, table_config: TableConfig, op_data: int):
        """
        對列名進行對映
        :param values 資料
        :param table_name 表名稱
        :param table_config 表的配置
        :param op_data op操作
        """
        column_mapping = table_config.column_mapping
        need_table = table_config.need_table
        mapped_values = {}
        for column, value in values.items():
            # 如果列名在對映字典中,則替換為對映的列名
            if column in column_mapping:
                mapped_column = column_mapping.get(column)
                mapped_values[mapped_column] = value
        if need_table:
            mapped_values['table_name'] = table_name
        mapped_values['__op'] = op_data
        return mapped_values

    def encrypt_data(self, data):
        """
        使用 AES-GCM 加密資料
        :param data 待加密資料
        """
        cipher = AES.new(self.aes_key, AES.MODE_GCM)
        ciphertext, tag = cipher.encrypt_and_digest(data.encode('utf-8'))
        return {
            'nonce': cipher.nonce.hex(),
            'ciphertext': ciphertext.hex()  + tag.hex()
        }

    def start_stream(self):
        """
        開始監聽 MySQL Binlog 流
        """
        logging.info("Starting binlog stream...")
        mysql_settings = {
            'host': self.mysql_config.host,
            'port': self.mysql_config.port,
            'user': self.mysql_config.user,
            'password': self.mysql_config.password,

        }
        self.stream = BinLogStreamReader(
            connection_settings=mysql_settings,
            server_id=self.mysql_config.service_id,
            resume_stream=True,
            blocking=True,
           # only_events=[BINLOG.WRITE_ROWS_EVENT_V1, BINLOG.UPDATE_ROWS_EVENT_V1, BINLOG.DELETE_ROWS_EVENT_V1]
        )

        try:
            for evt in self.stream:
                msg = self.build_message(evt, self.trie_tree)
                if msg:
                    topic = msg['headers']['topic']
                    data_list = msg['data_list']
                    for data_row in data_list:
                        try:
                            self.producer.send(topic, value=self.encrypt_data(json.dumps(data_row,  default=self.json_serial, ensure_ascii=False)))
                        except Exception as e:
                            logging.info(e)
                            logging.info(topic)
                            logging.info(data_row)
                            raise e
        except KeyboardInterrupt:
            logging.info("Binlog stream interrupted by user.")
        finally:
            self.close()

    def close(self):
        """
        關閉資源
        """
        if self.stream:
            self.stream.close()
            logging.info("Binlog stream closed.")
        self.producer.close()
        self.producer.flush()
        logging.info("Kafka producer closed.")


if __name__ == "__main__":
    # 讀取表配置
    BASE_PATH = "/opt/py38/data-job-stream"
    CONFIG_PATH = BASE_PATH + "/" + "config/rds1"
    table_config_data_list = TableConfigReader(CONFIG_PATH)
    table_config_data_list.read()

    complete_regex_table_dict = {}
    not_complete_regex_table_dict_list = []
    for table_config in table_config_data_list.table_config_list:
        if table_config.complete_regex:
            complete_regex_table_dict[table_config.key] = table_config
        else:
            not_complete_regex_table_dict_list.append(table_config)
    # 構建 Trie 樹
    trie = PrefixTrie(complete_regex_table_dict, not_complete_regex_table_dict_list)

    # 配置引數
    KAFKA_SERVER = "ip:19092,ip2:19093,ip3:19094"
    mysql_config = MysqlConfig("*.mysql.rds.aliyuncs.com", 3306, "username",
                               "password", 100)

    # 對稱加密的秘鑰
    hex_key = "6253c3d*************8b5deba549"
    key_bytes = binascii.unhexlify(hex_key)
    # 建立並啟動 Kafka Binlog Streamer
    streamer = KafkaBinlogStreamer(KAFKA_SERVER, mysql_config, trie, key_bytes)
    streamer.start_stream()

指令碼中依賴的版本資訊

kafka_python==2.0.2
mysql_replication==0.45.1
pycryptodome==3.21.0

七、配套生態指令碼

  1. 批次與增量配置檔案的生成

在同步一張新表的時候,可以修改改指令碼中的RDS資料庫的資訊,執行該指令碼,自動生成各個資料同步步驟的配置檔案和指令碼資訊。

如果 資料庫名稱:test 表名稱為:test_1, StarRocks中表名稱為:ods_test_1, 執行該指令碼之後會生成如下的檔案

  • test.test_1.json 該檔案是用於binlog同步的配置檔案。
  • mzt_ods_cjm_all.test_1_connect.json 該檔案是用於歷史資料批次同步的KafkaConnector配置。
  • mzt_ods_cjm_all.test_1_datax_config.json 該檔案是用於歷史資料批次同步的DataX配置。
  • mzt_ods_cjm_all.test_1_datax_shell.sh 該檔案是用於執行DataX任務的啟動指令碼。
  • mzt_ods_cjm_stream.ods_test_1-connect.json 該檔案是用於增量資料同步的KafkaConnector配置。
  • ods_test_1_create_table.sql 該檔案是用於在StarRocks中建表的SQL指令碼檔案。
import os
import shutil
import pymysql
import re
import json

class MySQLMATEDATA():
    """
    mysql 列的後設資料資訊
    """
    def __init__(self, column_name: str, is_nullable: str, data_type: str,character_maximum_length:
                 int,column_key: int, numeric_precision: int, comment: str):
        """
        初始化
        :param column_name 列名稱
        :param is_nullable 是否可以為空
        :param data_type 欄位型別
        :param character_maximum_length 字元最大長度
        :param column_key 鍵
        :param numeric_precision 數字精度
        :param comment 註釋
        """
        self.column_name = column_name
        self.new_column_name = MySQLMATEDATA.camel_to_snake(self.column_name)
        self.is_nullable = True if is_nullable == 'NO' else False
        self.data_type = data_type
        self.character_maximum_length = character_maximum_length
        self.column_key = column_key
        self.numeric_precision = numeric_precision
        self.is_primary_key = True if column_key == 'PRI' else False
        self.comment = comment

    def transform_to_starrcocks(self) -> str:
        """
        轉換為StarRocks的列
        :return: StarRocks的列
        """
        column_str = self.new_column_name
        if self.data_type == 'timestamp':
            column_str += ' DATETIME '
        else:
            if self.character_maximum_length is not None:
                column_str += ' ' + self.data_type + '( ' + str(self.character_maximum_length * 3) + ") "
            elif self.numeric_precision is not None:
                column_str += ' ' + self.data_type + '( ' + str(self.numeric_precision + 1) + ') '
            else:
                column_str += ' ' + self.data_type + ' '
        if self.is_primary_key:
            column_str += ' NOT NULL '
        if self.comment is not None:
            column_str += ' COMMENT "' + self.comment + '"'
        return column_str + ','

    def transform_to_datax(self) -> str:
        """
        轉換列名稱為 DATAX 需要的列
        :return: 新的列
        """
        type  = str(self.data_type).lstrip().lower()
        if type == 'datetime' or type == 'timestamp':
            return 'DATE_FORMAT('+ self.column_name + ', \'%Y-%m-%d %H:%i:%s\') AS ' + self.new_column_name
        elif  type == 'date' :
            return 'DATE_FORMAT(' + self.column_name + ', \'%Y-%m-%d\') AS ' + self.new_column_name
        return self.column_name

    @staticmethod
    def camel_to_snake(name):
        """
        在大寫字母前面加上下劃線,並轉換為小寫
        :param name: 列名稱
        :return: 新的列名稱
        """
        snake_case = re.sub(r'(?<!^)(?=[A-Z])', '_', name).lower()
        return snake_case


class ConfigGeneral(object):
    """
    資料同步相關的配置檔案生成
    """
    def __init__(self, database: str, table: str, topic: str, n_table: str, s_topic: str, mysql_host: str, mysql_port: int,
                 mysql_user: str, mysql_passwd: str):
        """
           初始化
           :param database mysql資料庫名稱
           :param table mysql表名稱
           :param topic 批次同步的topic名稱
           :param n_table StarRocks表名稱
           :param s_topic 流式同步的topic的名稱
           :param mysql_host mysql的host
           :param mysql_port mysql的port
           :param mysql_user mysql的使用者名稱
           :param mysql_passwd mysql的密碼
        """
        self.database = database
        self.table = table
        self.topic = topic
        self.n_table = n_table
        self.s_topic = s_topic
        self.mysql_host = mysql_host
        self.mysql_port = mysql_port
        self.mysql_user = mysql_user
        self.mysql_passwd = mysql_passwd
        self.column_list = []
        self.jdbc_url = 'jdbc:mysql://{}:{}/{}?characterEncoding=utf-8&useSSL=false&tinyInt1isBit=false'.format(self.mysql_host, self.mysql_port, self.database)

    def search_column_name(self):
        """
        查詢mysql中表的後設資料
        :return: 後設資料列表
        """
        sql ="""SELECT 
                    COLUMN_NAME, 
                    IS_NULLABLE,
                    DATA_TYPE,
                    CHARACTER_MAXIMUM_LENGTH,
                    COLUMN_KEY, 
                    NUMERIC_PRECISION, 
                    COLUMN_COMMENT
                FROM information_schema.`COLUMNS` 
                WHERE table_schema='{}' AND table_name = '{}'  
                ORDER BY ORDINAL_POSITION ASC""".format(self.database, self.table)
        table_name_list = []
        conn = pymysql.connect(host=self.mysql_host,
                               port=self.mysql_port,
                               user=self.mysql_user,
                               passwd=self.mysql_passwd,
                               db=self.database,
                               charset='utf8',
                               connect_timeout=200,
                               autocommit=True,
                               read_timeout=2000
                               )
        with conn.cursor() as cursor:
            cursor.execute(query=sql)
            while 1:
                res = cursor.fetchone()
                if res is None:
                    break
                table_name_list.append(MySQLMATEDATA(res[0], res[1], res[2], res[3], res[4], res[5], res[6]))
        self.column_list = table_name_list
        conn.close()

    def create_all_datax_config(self) -> str:
        """
        生成DATAX的配置檔案
        :return: 配置檔案地址
        """
        datax_config ={
            "job":{
                "setting" :{
                    "speed":{
                        "channel":1
                    }
                },
                "content": [
                    {
                        "reader": {
                            "name": "mysqlreader",
                            "parameter": {
                                "username": "",
                                "password": "",
                                "column": [],
                                "connection": [
                                    {
                                        "table": [],
                                        "jdbcUrl": []
                                    }
                                ],
                            }
                        },
                        "writer": {
                            "name": "kafkawriter",
                            "parameter": {
                                "bootstrapServers": "IP1:19092,IP2:19093,IP3:19094",
                                "topic": "",
                                "ack": "all",
                                "batchSize": 1000,
                                "retries": 3,
                                "keySerializer": "org.apache.kafka.common.serialization.StringSerializer",
                                "valueSerializer": "org.apache.kafka.common.serialization.StringSerializer",
                                "fieldDelimiter": ",",
                                "writeType": "json",
                                "topicNumPartition": 1,
                                "topicReplicationFactor": 1,
                                "encryptionKey": "6253c3************a549",
                                "column": []
                            }
                        }
                    }
                ]
            }
        }
        new_column_name_list =[]
        new_column_name_list1 =[]
        for column in self.column_list:
            new_column_name = column.new_column_name
            new_column_name_list.append(column.transform_to_datax())
            new_column_name_list1.append(new_column_name)
        datax_config['job']['content'][0]['reader']['parameter']['column'] = new_column_name_list
        datax_config['job']['content'][0]['writer']['parameter']['column'] = new_column_name_list1
        datax_config['job']['content'][0]['reader']['parameter']['connection'][0]['table'] = [self.table]
        datax_config['job']['content'][0]['reader']['parameter']['connection'][0]['jdbcUrl'] = [self.jdbc_url]
        datax_config['job']['content'][0]['writer']['parameter']['topic'] = self.topic
        with open('config/' + self.topic + '_datax_config.json', 'w', encoding='utf-8') as f:
            f.write(json.dumps(datax_config, ensure_ascii=False, indent=2))
        return self.topic + '_datax_config.json'

    def create_all_datax_shell(self, config_path: str):
        """
        生成DATAX的執行指令碼檔案
        :param config_path: 配置檔案的路徑
        :return: 指令碼檔案路徑
        """
        text = """python3 /opt/datax-k/bin/datax.py {} """.format(config_path)
        with open('config/' + self.topic + '_datax_shell.sh', 'w', encoding='utf-8') as f:
            f.write(text)
        return self.topic + '_datax_shell.sh'

    def create_all_connect(self) -> str:
        """
        生成全量同步的kafka-connect的配置檔案
        :return:  配置檔案路徑
        """
        connect_config = {
            "name": "",
            "config": {
                "connector.class": "com.starrocks.connector.kafka.StarRocksSinkConnector",
                "topics": "",
                "key.converter": "org.apache.kafka.connect.storage.StringConverter",
                "value.converter": "org.apache.kafka.connect.storage.StringConverter",
                "key.converter.schemas.enable": "true",
                "value.converter.schemas.enable": "false",
                "starrocks.http.url": "IP1:8050,IP2:8050,IP3:8050",
                "starrocks.topic2table.map": "",
                "starrocks.username": "",
                "starrocks.password": "",
                "starrocks.database.name": "ods_cjm",
                "sink.properties.strip_outer_array": "true",
                "sink.properties.columns": "",
                "sink.properties.jsonpaths": "",
                "transforms": "decrypt",
                "transforms.decrypt.type": "com.starrocks.connector.kafka.transforms.DecryptJsonTransformation",
                "transforms.decrypt.secret.key": "6253****************a549"
            }
        }
        connect_config['name'] = self.topic + "-connect"
        connect_config['config']['topics'] = self.topic
        connect_config['config']['starrocks.topic2table.map'] = self.topic + ":" + self.n_table
        connect_config['config']['sink.properties.columns'] = ",".join(list(map(lambda x: x.new_column_name ,self.column_list)))
        connect_config['config']['sink.properties.jsonpaths'] = '['  + (",".join(list(map(lambda x : ("\"$." + x.column_name  + "\"") ,self.column_list)))) + "]"
        with open('config/' + self.topic + '_connect.json', 'w', encoding='utf-8') as f:
            f.write(json.dumps(connect_config, ensure_ascii=False, indent=2))
        return self.topic + '_connect.json'

    def search_table_create_sql(self):
        """
        生成StarRocks的建表語句
        :return: 檔案路徑
        """
        table_sql_list = " CREATE TABLE " + self.n_table +" (" + "\n"
        primary_key = ''
        for column in self.column_list:
            table_sql_list  = table_sql_list + column.transform_to_starrcocks() +"\n"
            if column.is_primary_key:
                primary_key = column.column_name
        table_sql_list = table_sql_list +")\n"
        table_sql_list = table_sql_list +"PRIMARY KEY ("+ primary_key+")\n"
        table_sql_list = table_sql_list +"DISTRIBUTED BY HASH ("+ primary_key+");\n"
        with open('config/' +  self.n_table + '_create_table.sql', 'w', encoding='utf-8') as f:
            f.write(table_sql_list)
        return self.n_table + '_create_table.sql'

    def create_stream_config(self):
        """
        生成mysql binlog同步的配置檔案
        :return: 檔案路徑
        """
        config = {
            "key": "",
            "topic": "",
            "need_table": False,
            "complete_regex": True,
            "regex": "",
            "column_mapping": {}
        }
        column_mapping = {}
        for column in self.column_list:
            column_mapping[column.column_name] = column.new_column_name
        config['column_mapping'] = column_mapping
        config['key'] = self.database + '.' + self.table
        config['topic'] = self.s_topic
        with open('config/' + self.database + '.' + self.table + '.json', 'w', encoding='utf-8') as f:
            f.write(json.dumps(config, ensure_ascii=False, indent=2))

    def create_stream_connect(self):
        """
        生成流式同步的kafka-connector的配置檔案
        :return: 檔案路徑
        """
        connect_config = {
            "name": "",
            "config": {
                "connector.class": "com.starrocks.connector.kafka.StarRocksSinkConnector",
                "topics": "",
                "key.converter": "org.apache.kafka.connect.storage.StringConverter",
                "value.converter": "org.apache.kafka.connect.storage.StringConverter",
                "key.converter.schemas.enable": "true",
                "value.converter.schemas.enable": "false",
                "starrocks.http.url": "IP1:8050,IP2:8050,IP3:8050",
                "starrocks.topic2table.map": "",
                "starrocks.username": "",
                "starrocks.password": "",
                "starrocks.database.name": "ods_cjm",
                "sink.properties.strip_outer_array": "true",
                "sink.properties.columns": "",
                "sink.properties.jsonpaths": "",
                "transforms": "decrypt",
                "transforms.decrypt.type": "com.starrocks.connector.kafka.transforms.DecryptJsonTransformation",
                "transforms.decrypt.secret.key": "6253********549"
            }
        }
        connect_config['name'] = self.s_topic + "-connect"
        connect_config['config']['topics'] = self.s_topic
        connect_config['config']['starrocks.topic2table.map'] = self.s_topic + ":" + self.n_table
        connect_config["config"]["sink.properties.columns"] = ",".join(list(map(lambda x : x.new_column_name ,self.column_list))) +",__op"
        connect_config["config"]["sink.properties.jsonpaths"] = '['  + (",".join(list(map(lambda x : ("\"$." + x.new_column_name  + "\"") ,self.column_list)))) + ",\"$.__op\"]"
        with open('config/' + self.s_topic + '-connect.json', 'w', encoding='utf-8') as f:
            f.write(json.dumps(connect_config, ensure_ascii=False, indent=2))
        return self.s_topic + '_connect.json'


def delete_all_files_in_folder(folder_path):
    """
    刪除某個資料夾下面的所有檔案
    :param folder_path: 資料夾路徑
    :return: NONE
    """
    # 檢查資料夾是否存在
    if not os.path.exists(folder_path):
        print("資料夾不存在")
        return

    # 遍歷資料夾中的所有檔案和子資料夾
    for filename in os.listdir(folder_path):
        file_path = os.path.join(folder_path, filename)

        try:
            # 如果是檔案,則刪除
            if os.path.isfile(file_path) or os.path.islink(file_path):
                os.remove(file_path)
                print(f"刪除檔案: {file_path}")
            # 如果是子資料夾,則刪除子資料夾及其內容
            elif os.path.isdir(file_path):
                shutil.rmtree(file_path)
                print(f"刪除資料夾及其內容: {file_path}")
        except Exception as e:
            print(f"刪除時出錯: {file_path},錯誤資訊: {e}")

if __name__ == '__main__':
    delete_all_files_in_folder("config")
    DATABASE = 'hydra_marketing'
    TABLE ='t_data_overview'
    TOPIC ='mzt_ods_cjm_all.' + TABLE
    NEW_TABLE = 'ods_marketing_t_data_overview'
    STREAM_TOPIC = "mzt_ods_cjm_stream." + NEW_TABLE

    MYSQL_HOST = ""
    MYSQL_PORT = 3306
    USER_NAME = ""
    PASSWD=""

    config = ConfigGeneral(DATABASE,TABLE, TOPIC, NEW_TABLE, STREAM_TOPIC, MYSQL_HOST, MYSQL_PORT, USER_NAME, PASSWD)
    config.search_column_name()
    config.search_table_create_sql()
    config_path = config.create_all_datax_config()
    config.create_all_datax_shell(config_path)
    config.create_all_connect()
    config.create_stream_config()
    config.create_stream_connect()
  1. Kafka-Connector操作指令碼

該指令碼包含了Kafka Connector操作的各個API,可以很方便的進行Kafka Connector相關的操作或者各個任務的狀態查詢。

import json
import requests
from typing import List, Mapping


class KafkaConnectAll:
    """
    KAFKA Connect 相關操作
    """
    def __init__(self, base_url: str):
        """
        初始化
        :param base_url: kafka-connector的地址
        """
        self.base_url = base_url

    def query_all(self) -> List[str]:
        """
        查詢全部的connector
        :return: connector名稱列表
        """
        url = self.base_url + '/connectors'
        data_json = requests.get(url).json()
        for data in data_json:
            print(data)
        return data_json

    def delete_connector(self, connector_name: str):
        """
        刪除指定的connector
        :param connector_name: connector名稱
        :return: None
        """
        url = self.base_url + '/connectors/' + connector_name
        requests.delete(url)

    def query_status(self, connector_name: str) -> Mapping[str, str]:
        """
        查詢指定connector的狀態
        :param connector_name: connector名稱
        :return: 狀態資訊
        """
        url = self.base_url + '/connectors/' + connector_name + '/status'
        result = requests.get(url)
        connect_state = result.json()['connector']['state']
        task_states = []
        for task in result.json()['tasks']:
            task_states.append({
                'id': task['id'],
                'state': task['state'],
            })

        print("connector狀態", connect_state)
        print("tasks狀態", task_states)
        return {
            "connector_status": connect_state,
            "task_states": task_states
        }

    def create_connector(self, connector_config: json):
        """
        建立connector
        :param connector_config: 配置檔案
        :return: NONE
        """
        url = self.base_url + '/connectors'
        headers = {"Content-Type": "application/json"}
        try:
            # 傳送 POST 請求建立 Connector
            response = requests.post(url, headers=headers, data=json.dumps(connector_config))
            if response.status_code == 201:
                print("Connector 建立成功")
                print(response.json())
            elif response.status_code == 409:
                print("Connector 已經存在")
            else:
                print(f"Connector 建立失敗,狀態碼: {response.status_code}")
                print(response.json())
        except Exception as e:
            print(f"請求失敗: {e}")

    def query_connector(self, connector_name: str) -> json:
        """
        查詢指定的connector
        :param connector_name: connector名稱
        :return: 內容
        """
        url = self.base_url + '/connectors/' + connector_name
        result = requests.get(url).json()
        print(json.dumps(result, indent=4))
        return result

    def query_connector_config(self, connector_name: str) -> json:
        """
        查詢指定connector的配置檔案
        :param connector_name: connector名稱
        :return: 配置
        """
        url = self.base_url + '/connectors/' + connector_name + "/config"
        result = requests.get(url).json()
        print(json.dumps(result, indent=4))
        return result

    def update_connector_config(self, connector_name: str, connector_config: json):
        """
        修改指定connector的配置
        :param connector_name: 指定connector名稱
        :param connector_config: 配置
        :return: NONE
        """
        url = self.base_url + '/connectors/' + connector_name + "/config"
        headers = {"Content-Type": "application/json"}
        try:
            # 傳送 POST 請求建立 Connector
            response = requests.put(url, headers=headers, data=json.dumps(connector_config))
            if response.status_code == 201:
                print("Connector 更新成功")
                print(response.json())
            elif response.status_code == 409:
                print("Connector 已經存在")
            else:
                print(f"Connector 更新失敗,狀態碼: {response.status_code}")
                print(response.json())
        except Exception as e:
            print(f"請求失敗: {e}")

    def query_connectors_task(self, connector_name: str) -> List[int]:
        """
        查詢指定connector的task列表
        :param connector_name: 指定connector名稱
        :return: taskId列表
        """
        url = self.base_url + '/connectors/' + connector_name + '/tasks'
        result = requests.get(url).json()
        print(json.dumps(result, indent=4))
        task_id = []
        for task in result:
            task_id.append(task['id']['task'])
        return task_id

    def query_connectors_tasks_status(self, connector_name: str, task_id: int) -> json:
        """
        查詢task的狀態
        :param connector_name: 指定connector的名稱
        :param task_id: task id
        :return: 結果
        """
        url = self.base_url + '/connectors/' + connector_name + '/tasks/' + str(task_id) + '/status'
        result = requests.get(url).json()
        print(json.dumps(result, indent=4))
        return result

    def pause_connector(self, connector_name: str):
        """
        暫停connector
        :param connector_name: connector 名稱
        :return: NONE
        """
        url = self.base_url + '/connectors/' + connector_name + '/pause'
        requests.put(url).json()

    def resume_connector(self, connector_name: str):
        """
        恢復
        :param connector_name: connector名稱
        :return: NONE
        """
        url = self.base_url + '/connectors/' + connector_name + '/resume'
        requests.put(url).json()

    def restart_connector(self, connector_name: str):
        """
        重啟
        :param connector_name: connector名稱
        :return: NONE
        """
        url = self.base_url + '/connectors/' + connector_name + '/restart'
        requests.post(url).json()

    def restart_connector_task(self, connector_name: str, task_id: int):
        """
        重啟task
        :param connector_name: connector名稱
        :param task_id: task id
        :return: NONE
        """
        url = self.base_url + '/connectors/' + connector_name + '/tasks/' + str(task_id) + '/restart'
        requests.post(url).json()


if __name__ == '__main__':
    base_url = 'http://IP:8083'
    kafka_connector_all = KafkaConnectAll(base_url)
    kafka_connector_all.query_connectors_tasks_status('user_sys_org-connect', 0)
  1. StarRocks表最新日期檢測指令碼

該指令碼用於檢測StarRocks各個表中的最新的資料的時間,可以用於判斷當前資料同步是否正常時使用。

import pymysql
from typing import Tuple
import json


class StarRocksTableCheck:
    """
    starrocks表資料最新日期檢測
    """
    def __init__(self, host: str, port: int, user: str, password: str, database: str):
        """
        初始化
        :param host: 資料庫host
        :param port: 資料庫埠
        :param user: 資料來源使用者名稱稱
        :param password: 資料庫使用者密碼
        :param database: 資料庫名稱
        """
        self.host = host
        self.port = port
        self.user = user
        self.password = password
        self.database = database
        # 資料庫連線
        self.conn = None

    def connect(self) -> None:
        """
        建立連線
        :return: None
        """
        self.conn = pymysql.connect(host=self.host,
                               port=self.port,
                               user=self.user,
                               passwd=self.password,
                               db=self.database,
                               charset='utf8',
                               connect_timeout=200,
                               autocommit=True,
                               read_timeout=2000
                               )

    def close(self) -> None:
        """
        關閉資料庫連線
        :return: NONE
        """
        self.conn.close()

    def query(self, table: str, time_filed: str) -> Tuple[str, str]:
        """
        查詢最大日期
        :param table: 資料表名稱
        :param time_filed: 時間欄位名稱
        :return: 表名稱-最新時間
        """
        sql = "SELECT MAX({}) AS MAX_TIME FROM {}".format(time_filed, table)
        with self.conn.cursor() as cursor:
            cursor.execute(query=sql)
            while 1:
                res = cursor.fetchone()
                if res is None:
                    break
                print('{} {}'.format(table, res[0]))
                return str(res[0]), str(res[1])


class ReaderFile:
    """
    讀取檔案
    """
    def __init__(self,file_path: str):
        """
        初始化
        :param file_path: 檔案路徑
        """
        self.file_path = file_path

    def read_file_content(self) -> str:
        """
        讀取檔案內容
        :return: 檔案內容
        """
        with open(self.file_path, 'r', encoding='utf-8') as file:
            content = file.read()
        return content


if __name__ == '__main__':
    starRocksTableCheck = StarRocksTableCheck("ip", 9030, "username", "password", "ods_cjm")
    starRocksTableCheck.connect()
    try:
        content = ReaderFile(r'E:\pycode\python_scripts_new\超級碼\stream\config\starrocks_table.json').read_file_content()
        for table in json.loads(content):
            table_name = table['TABLE_NAME']
            column_name = table['COLUMN_NAME']
            starRocksTableCheck.query(table_name, column_name)
    finally:
        starRocksTableCheck.close()

表資訊的配置檔案,配置需要檢測的表和對應的時間欄位。

[
  {
    "TABLE_NAME": "ods_codemanager_codeapply",
    "COLUMN_NAME": "create_date"
  },
  {
    "TABLE_NAME": "ods_t_integral_account",
    "COLUMN_NAME": "update_time"
  }
]
  1. 增量同步任務檢測指令碼

該指令碼用於檢測當前的資料同步任務指令碼是否正常執行,未執行可以直接啟動指令碼,可以配置crontab實現服務的異常終止直接啟動操作,可以加入訊息告警。

#!/bin/bash

# 要檢測的 Python 指令碼列表
PROCESS_LIST=("mzt-transform-stream-kafka-rds1.py" "mzt-transform-stream-kafka-rds2.py")

# 啟動命令列表
START_CMD_LIST=(
    "nohup python3 mzt-transform-stream-kafka-rds1.py > /opt/py38/data-job-stream/nohup1.out 2>&1 &"
    "nohup python3 mzt-transform-stream-kafka-rds2.py > /opt/py38/data-job-stream/nohup2.out 2>&1 &"
)

# 目錄
BASE_PATH="/opt/py38/data-job-stream"
# 虛擬環境路徑
VENV_PATH="/opt/py38/bin/activate"

# 檢測程序是否存在
check_process() {
    local process_name="$1"
    if pgrep -f "$process_name" > /dev/null; then
        echo "$(date) - 程序 '$process_name' 正在執行。"
        return 0
    else
        echo "$(date) - 程序 '$process_name' 未執行。"
        return 1
    fi
}

# 啟動程序
start_process() {
    local start_cmd="$1"
    echo "$(date) - 正在啟動命令: $start_cmd"
    # 進入目錄
    cd "$BASE_PATH" || exit 1
    # 啟用虛擬環境
    source "$VENV_PATH"
    # 執行啟動命令
    eval "$start_cmd"
    if [ $? -eq 0 ]; then
        echo "$(date) - 命令啟動成功。"
    else
        echo "$(date) - 命令啟動失敗!"
    fi
}

# 主邏輯
for index in "${!PROCESS_LIST[@]}"; do
    process_name="${PROCESS_LIST[$index]}"
    start_cmd="${START_CMD_LIST[$index]}"

    echo "----- 檢測程序:$process_name -----"
    if ! check_process "$process_name"; then
        start_process "$start_cmd"
    fi
done

  1. 增量同步的配置檔案示例

這是一個用於增量資料同步的配置檔案,其配置了具體的某張表的增量資料同步規則。

{
  "key": "app.device",
  "topic": "mzt_ods_cjm_stream.ods_device",
  "need_table": false,
  "complete_regex": true,
  "regex": "",
  "column_mapping": {
    "Id": "id",
    "CompanyName": "company_name",
    "CompanyId": "company_id",
    "SecretKey": "secret_key",
    "Brand": "brand",
    "ModelType": "model_type",
    "Enable": "enable",
    "CreateTime": "create_time",
    "UpdateTime": "update_time"
  }
}
KEY 說明
key 表的唯一id,用於匹配binlog日誌,組成為庫.表 或者 模糊匹配開頭
topic 資料寫入的kafka的topic名稱
need_table 列中是否需要加入表明,如何為true,則列中會加入一個欄位table_name,值為當前RDS中表的名稱,用於解決分庫分表的問題
complete_regex 是否完整匹配表名稱,如果為true,則根據key完整匹配表名稱,用於解決分庫分表的問題, 如果為false,則根據regex的值進行正常匹配
regex 匹配binlog表的正規表示式
column_mapping 表的列欄位的對映,key為RDS中表欄位名稱。value為StarRocks中表欄位的名稱
  1. kafkaConnector的配置示例
{
    "name": "mzt_ods_cjm_stream.ods_device-connect",
    "config": {
        "connector.class": "com.starrocks.connector.kafka.StarRocksSinkConnector",
        "topics": "mzt_ods_cjm_stream.ods_device",
        "key.converter": "org.apache.kafka.connect.storage.StringConverter",
        "value.converter": "org.apache.kafka.connect.storage.StringConverter",
        "key.converter.schemas.enable": "true",
        "value.converter.schemas.enable": "false",
        "starrocks.http.url": "IP1:8050,IP2:8050,IP3:8050",
        "starrocks.topic2table.map": "mzt_ods_cjm_stream.ods_device:ods_device",
        "starrocks.username": "",
        "starrocks.password": "",
        "starrocks.database.name": "ods_cjm",
        "sink.properties.strip_outer_array": "true",
        "sink.properties.columns": "id,company_name,company_id,secret_key,brand,model_type,enable,create_time,update_time,__op",
        "sink.properties.jsonpaths": "[\"$.id\",\"$.company_name\",\"$.company_id\",\"$.secret_key\",\"$.brand\",\"$.model_type\",\"$.enable\",\"$.create_time\",\"$.update_time\",\"$.__op\"]",
        "transforms": "decrypt",
        "transforms.decrypt.type": "com.starrocks.connector.kafka.transforms.DecryptJsonTransformation",
        "transforms.decrypt.secret.key": "6253******a549"
    }

KEY 說明
name kafka connector的唯一名稱
config.connector.class 聯結器 預設值
config.topics 要消費的topic列表,多個使用,分隔
config.key.converter key的轉換器,保持預設
config.value.converter value的轉換器,保持預設
config.key.converter.schemas.enable 是否需要轉換key
config.value.converter.schemas.enable 是否需要轉換value
config.starrocks.http.url StarRocks用於streamLoad的地址
config.starrocks.topic2table.map topic與表的對映, 格式為 topic名稱:表名, 多個直接使用,分隔
config.starrocks.username StarRocks的使用者名稱
config.starrocks.password StarRocks的密碼
config.starrocks.database.name StarRocks資料庫的名稱
config.sink.properties.strip_outer_array 是否展開JSON陣列
config.sink.properties.columns 列欄位的列表
config.sink.properties.jsonpaths JSON欄位的列表,可列欄位的列表一一對應
config.transforms 資料的轉換器
config.transforms.decrypt.type 轉換器的實現類
config.transforms.decrypt.secret.key 資料解密的秘鑰

八、備註

  1. python-mysql-replication python實現的用於binlog同步的庫。
  2. starrocks-connector-for-kafka Kafka Connector是StarRocks資料來源聯結器
  3. DataX 批次資料同步工具
  4. kafka-console-ui Kakfa視覺化控制檯
  5. StarRocks-kafka-Connector 透過kafkaConnector匯入資料到StarRocks
  6. StreamLoad實現資料增刪改
  7. Kafka Connector的API列表
方法 路徑 說明
GET /connectors 返回活動聯結器的列表
POST /connectors 建立一個新的聯結器; 請求主體應該是包含字串name欄位和config帶有聯結器配置引數的物件欄位的JSON物件
GET /connectors/ 獲取有關特定聯結器的資訊
GET /connectors/{name}/config 獲取特定聯結器的配置引數
PUT /connectors/{name}/config 更新特定聯結器的配置引數
GET /connectors/{name}/status 獲取聯結器的當前狀態,包括聯結器是否正在執行,失敗,已暫停等,分配給哪個工作者,失敗時的錯誤資訊以及所有任務的狀態
GET /connectors/{name}/tasks 獲取當前為聯結器執行的任務列表
GET /connectors/{name}/tasks/{taskid}/status 獲取任務的當前狀態,包括如果正在執行,失敗,暫停等,分配給哪個工作人員,如果失敗,則返回錯誤資訊
PUT /connectors/{name}/pause 暫停聯結器及其任務,停止訊息處理,直到聯結器恢復
PUT /connectors/{name}/resume 恢復暫停的聯結器(或者,如果聯結器未暫停,則不執行任何操作)
POST /connectors/{name}/restart 重新啟動聯結器(通常是因為失敗)
POST /connectors/{name}/tasks/{taskId}/restart 重啟個別任務(通常是因為失敗)
DELETE /connectors/ 刪除聯結器,停止所有任務並刪除其配置
  1. 加密演算法參考
加密型別 推薦演算法 優點 缺點 適用場景
對稱加密 AES-GCM, ChaCha20 高效,支援硬體加速 需要安全管理金鑰 大資料量加密、檔案加密、流加密
非對稱加密 RSA, ECDSA 無需共享金鑰,高安全性 速度慢,不適合大資料加密 金鑰交換、數字簽名
流加密 ChaCha20 高效,低延遲 不適合檔案加密 實時通訊、影片流加密
雜湊演算法 SHA-256, BLAKE3 不可逆,速度快 不能用於加解密 資料校驗、數字簽名

九、踩坑記錄

  1. python虛擬環境pip報錯,沒有SSL模組。

解決:使用支援http的pip源進行安裝

pip3 install pymysql -i http://mirrors.aliyun.com/pypi/simple/ --trusted-host mirrors.aliyun.com
  1. DATAX同步時間戳欄位,Kafka中為數字,無法寫入StarRocks的datetime型別欄位中。

解決:在DATAX的同步欄位對映中,使用DATA_FORMAT將其轉換為字串。

"column": [
      "id",
      "name",
      "DATE_FORMAT(timestamp_column, '%Y-%m-%d %H:%i:%s') AS timestamp_column"
],
  1. AES-GCM在python端和Java端的實現問題。

解決:在python中加密會生成(nonce, chiphertext, tag)三元組資訊,但是Java中解密會報錯,在python中將 ciphertext和tag拼接起來,在Java中可以直接解密。

def encrypt_data(self, data):
    """使用 AES-GCM 加密資料"""
    cipher = AES.new(self.aes_key, AES.MODE_GCM)
    ciphertext, tag = cipher.encrypt_and_digest(data.encode('utf-8'))
    return {
        'nonce': cipher.nonce.hex(),
        'ciphertext': ciphertext.hex()  + tag.hex()
    }
public String decrypt(String encryptedData) throws Exception {
    JSONObject jsonMessage = JSONObject.parseObject(encryptedData);
    // 解析密文、認證標籤和 IV
    byte[] ciphertext = hexStringToByteArray(jsonMessage.getString("ciphertext"));
    byte[] nonce = hexStringToByteArray(jsonMessage.getString("nonce"));
    return decryptData(ciphertext, nonce);
}
  1. BigDecimal欄位型別,starrocks-connector-for-kafka無法解析報錯。

解決:在starrocks-connector-for-kafka中進行解密的時候,FastJSON配置將BigDecimal轉換為Double型別。

 public R apply(R record) {
        if (record.value() == null) {
            return record;
        }
        String value = (String) record.value();
        try {
            String newValue = aesEncryption.decrypt(value);
            // 轉換BigDecimal為Double
            JSONObject jsonObject = JSON.parseObject(newValue, JSONReader.Feature.UseBigDecimalForDoubles);
            return record.newRecord(record.topic(), record.kafkaPartition(), record.keySchema(), record.key(), null, jsonObject, record.timestamp());
        } catch (Exception e) {
            return record;
        }
    }
  1. 自定義打包的starrocks-connector-for-kafka,kafka Connector無法載入。

解決:必須使用Java8進行打包,使用了Java21打包,導致無法載入。

  1. 使用最新版的python-mysql-replication讀取binlog,解析不到表欄位。

解決:不要使用最新版本,使用0.45.1版本,可參考:issue#612

  1. python3.6無法執行python-mysql-replication。

解決:python-mysql-replication不支援python3.6,至少需要3.7版本,本專案使用3.8.4版本

  1. python JSON轉換不支援byte,日期格式。

解決:自定義python的JSON轉換格式。

def convert_bytes_keys_to_str(data):
    if isinstance(data, dict):
        # 將字典中的 bytes 型別的鍵轉換為 str
        return {
            (k.decode('utf-8') if isinstance(k, bytes) else k): KafkaBinlogStreamer.convert_bytes_keys_to_str(v)
            for k, v in data.items()
        }
    elif isinstance(data, list):
        # 對列表中的每個元素遞迴轉換
        return [KafkaBinlogStreamer.convert_bytes_keys_to_str(item) for item in data]
    elif isinstance(data, bytes):
        # 對 bytes 型別的值進行轉換
        return data.decode('utf-8')
    elif isinstance(data, datetime):
        # 將 datetime 型別轉換為 ISO 格式字串
        return data.strftime('%Y-%m-%d %H:%M:%S')
    else:
        return data
  1. 設定python-mysql-replication的only_events導致消費不到任何binlog。

解決:經過測試,最終捨棄該引數的配置。

  1. 指令碼執行過程中佔用記憶體過大,導致其被系統kill

解決:可以手動觸發GC垃圾回收,主動回收釋放記憶體

gc.collect()

十、目前資料同步情況

指標KEY 指標值
RDS例項數 2
同步邏輯表數量 56
同步物理表數量 5274
資料延遲 1分鐘以內

相關文章