離線資料同步變遷

一条路上的咸鱼發表於2024-12-05

第一代-基於Hadoop體系的離線資料同步

一、背景

隨著業務的發展,系統進行了微服務的差分,導致資料越來越分散,很難進行一個完整的生命週期的資料查詢,對於某些業務的需求支援變得越來越難,越來越複雜,也越來越難以進行職責劃分。對著業務的發展,資料量越來越大之後,為了良好的業務支援,進行了分庫分表,分庫分表規則五花八門,一旦脫離了業務邏輯,很難確定某一條資料在哪個庫哪個表。

基於這樣的問題和情況,為了滿足業務需求,很自然的就想到了使用大資料服務,將業務資料歸集到一起,建立完整的資料倉儲,便於資料的查詢。

二、資料同步架構

為了追求簡單和通用,由於自身的認識現在,選擇了最標準的大資料架構,即基於Hadoop的大資料體現。整個叢集採用三節點,透過CDH進行叢集的部署和維護。

整個資料鏈路為:

透過Azkaban呼叫Spark應用,將資料從RDS同步到Hive,運營平臺和報表系統採用Presto加速訪問Hive的資料。

三、資料同步詳細過程

資料同步採用Spark任務來進行,將任務打包之後,上傳到Azkaban排程平臺,使用Azkaban進行定時排程,完成T+1級別的資料同步工作。

資料同步程式碼示例:

object MarketMysqlToHiveEtl extends SparkHivePartitionOverwriteApplication{


  /**
   * 刪除已存在的分割槽
   *
   * @param spark SparkSessions例項
   * @param date 日期
   * @param properties 資料庫配置
   */
  def delete_partition(spark: SparkSession, properties:Properties, date: String):Unit={
    val odsDatabaseName = properties.getProperty("hive.datasource.ods")
    DropPartitionTools
     .dropPartitionIfExists(spark,odsDatabaseName,"ods_t_money_record","ds",date)
    DropPartitionTools
     .dropPartitionIfExists(spark,odsDatabaseName,"ods_t_account","ds",date)
  }



  /**
   * 抽取資料
   * @param spark SparkSession例項
   * @param properties 資料庫配置
   * @param date 日期
   */
  def loadData(spark: SparkSession, properties:Properties, date: String): Unit ={
    // 刪除歷史資料,解決重複同步問題
    delete_partition(spark,properties,date)

    // 獲取資料來源配置
    val odsDatabaseName = properties.get("hive.datasource.ods")
    val dataSource = DataSourceUtils.getDataSourceProperties(FinalCode.MARKET_MYSQL_FILENAME,properties)

    var sql = s"select id,account_id,type,original_id,original_code,money,reason,user_type,user_id,organization_id," +
    s"create_time,update_time,detail,deleted,parent_id,counts,'${date}' AS ds from TABLENAME where date(update_time) ='${date}'"

    // 同步資料
    MysqlToHiveTools.readFromMysqlIncrement(spark,dataSource,sql.replace("TABLENAME","t_money_record"),
                                            s"${odsDatabaseName}.ods_t_money_record",SaveMode.Append,"ds")


    sql = s"select id,code,customer_code,name,mobile,type,organization_id,organization_name,create_time,update_time,deleted,status,customer_name," +
    s"customer_id,channel_type,nike_name,version,register_Time,'${date}' AS ds from TABLENAME where date(update_time) ='${date}'"
    MysqlToHiveTools.readFromMysqlIncrement(spark,dataSource,sql.replace("TABLENAME","t_account"),
                                            s"${odsDatabaseName}.ods_t_account",SaveMode.Append,"ds")
  }



  /**
   * 資料etl
   * @param spark SparkSession例項
   * @param SparkSession 資料庫配置
   */
  def etl(spark: SparkSession, properties:Properties): Unit = {
    val sparkConf = spark.sparkContext.getConf
    // 獲取同步的日期
    var lastDate = sparkConf.get("spark.etl.last.day", DateUtils.getLastDayString)
    val dateList = new  ListBuffer[String]()
    if(lastDate.isEmpty){
      // 未配置,設定為前一天
      lastDate = DateUtils.getLastDayString
    }
    if(lastDate.contains("~")){
      // 如果是時間段,獲取時間段中的每一天,解析為時間list
      val dateArray = lastDate.split("~")
      DateUtils.findBetweenDates(dateArray(0), dateArray(1)).foreach(it => dateList.append(it))
    }else if(lastDate.contains(",")){
      // 如果是使用,分隔的多個日期,解析為時間list
      lastDate.split(",").foreach(it => dateList.append(it))
    }else{
      // 新增進時間列表
      dateList.append(lastDate)
    }
    // 迴圈同步每天的資料
    dateList.foreach(it =>  loadData(spark, properties, it))
  }


  def main(args: Array[String]): Unit = {
    job() {
      val sparkAndProperties = SparkUtils.get()
      val spark = sparkAndProperties.spark
      val properties = sparkAndProperties.properties
      // 排程任務
      etl(spark,properties)
    }
  }
}

刪除Partition的程式碼示例:

object DropPartitionTools {


  /**
   * 刪除指定的Partition
   * @param SparkSession例項
   * @param database資料庫名稱
   * @param table表名稱
   * @param partitionKey 分割槽欄位的名稱
   * @param partitionValue 具體的分割槽值
   */
  def dropPartitionIfExists(spark: SparkSession, database: String, table: String, partitionKey: String, partitionValue:String): Unit ={

     val df = spark.sql(
       s"""
         | show tables in ${database} like '${table}'
         |""".stripMargin)

    if(df.count() > 0 ){
      // 表存在,刪除分割槽
      spark.sql(
        s"""
           |ALTER TABLE  ${database}.${table} DROP  IF EXISTS  PARTITION (${partitionKey}='${partitionValue}')
           |""".stripMargin)
    }
  }


  /**
   * 刪除Partition
   * @param SparkSession例項
   * @param database資料庫名稱
   * @param table表名稱
   * @param partitionKey 分割槽欄位的名稱
   */
  def dropHistoryPartitionIfExists(spark: SparkSession, database: String, table: String, partitionKey: String): Unit ={

    val df = spark.sql(
      s"""
         | show tables in ${database} like '${table}'
         |""".stripMargin)

    if(df.count() > 0 ){
      // 表存在,刪除歷史分割槽,獲取8天前的日期
      val sevenDay = DateUtils.getSomeLastDayString(8);
      spark.sql(
        s"""
           |ALTER TABLE  ${database}.${table} DROP  IF EXISTS  PARTITION (${partitionKey} ='${sevenDay}')
           |""".stripMargin)
    }
  }

}

從RDS同步資料到HIVE的程式碼示例:

object MysqlToHiveTools {


  /**
   * 從mysql抽取資料到hive -- 全量
   * @param spark spark例項
   * @param dataSource 資料庫配置資訊
   * @param tableName 抽取的資料庫表名
   * @param destTableName 目標表名
   * @param mode 抽取的模式
   */
  def mysqlToHiveTotal(spark: SparkSession, dataSource: JSONObject,tableName: String, destTableName:String,mode: SaveMode, partition: String): Unit = {
     val sql = "(select * from " + tableName + ") as t"
     mysqlToHive(spark, dataSource, sql, destTableName, mode, partition)
  }


  /**
   * 從mysql抽取資料到hive -- 增量量
   * @param spark spark例項
   * @param dataSource 資料庫配置資訊
   * @param sql 抽取資料的SQL
   * @param destTableName 目標表名
   * @param mode 抽取的模式
   */
  def readFromMysqlIncrement(spark: SparkSession, dataSource: JSONObject,sql: String, destTableName:String,mode: SaveMode, partition: String): Unit = {
    mysqlToHive(spark, dataSource, sql, destTableName, mode, partition)
  }


  /**
   * 真正的抽取資料
   * @param spark spark例項
   * @param properties 資料庫配置資訊
   * @param sql 抽取資料的SQL
   * @param destTableName 目標表名
   * @param mode 抽取的模式
   */
  def mysqlToHive(spark: SparkSession, dataSource: JSONObject,sql: String, destTableName:String, mode: SaveMode, partition: String):Unit={
    val df = spark.read.format("jdbc")
      .option("url",dataSource.getString("url"))
      .option("driver",dataSource.getString("driver"))
      .option("fetchSize", 10000)
      .option("numPartitions",2)
      .option("dbtable",s"(${sql}) AS t")
      .option("user",dataSource.getString("user"))
      .option("password",dataSource.getString("password"))
      .load()
    if(partition == null || partition.isEmpty){
      df.write.format("parquet").mode(mode).saveAsTable(destTableName)
    }else{
      df.write.format("parquet").mode(mode).partitionBy("ds").saveAsTable(destTableName)
    }
  }
}

Spark Application程式碼示例

trait SparkHivePartitionOverwriteApplication extends Logging{


  def getProperties(): Properties ={
    val prop:Properties = new Properties()
    val inputStream = this.getClass.getClassLoader.getResourceAsStream("config.properties")
    prop.load(inputStream);
    prop
  }

  def job(appName: String = null,
          master: String = null)(biz: => Unit): Unit = {
    var spark: SparkSession = null
    System.setProperty("HADOOP_USER_NAME", "mapred")
    val prop:Properties = getProperties()
    if (null == appName) {
      spark = SparkSession.builder
        .config("spark.sql.parquet.writeLegacyFormat", true)
        .config("spark.sql.sources.partitionOverwriteMode","dynamic")
        .config("hive.exec.dynamic.partition.mode","nonstrict")
        .config("spark.sql.hive.convertMetastoreParquet",false)
        .enableHiveSupport
        .getOrCreate
      var sparkAndProperties = SparkAndProperties(spark, prop)
      SparkUtils.set(sparkAndProperties)
    } else {
      spark = SparkSession.builder.master(master).appName(appName)
        .config("spark.sql.parquet.writeLegacyFormat", true)
        .config("spark.sql.sources.partitionOverwriteMode","dynamic")
        .config("hive.exec.dynamic.partition.mode","nonstrict")
        .config("spark.sql.hive.convertMetastoreParquet",false)
        .config("spark.testing.memory","2147480000")
        .config("spark.driver.memory","2147480000")
        .enableHiveSupport.getOrCreate
      var sparkAndProperties = SparkAndProperties(spark, prop)
      SparkUtils.set(sparkAndProperties)
      SparkUtils.set(sparkAndProperties)
    }
    biz
    spark.stop()
    SparkUtils.remove()
  }

}

case class SparkAndProperties(spark: SparkSession,
                              properties: Properties)

四、配套生態

  1. 自定義UDF函式

在使用的過程中,需要將表中的IP地址,解析為所在地的名稱,這需要呼叫第三方的一個服務介面來完成,為了完成這個任務,定義了一個自定義UDF函式,進行解析。

a. 自定義UDF函式

object ParseIp  {
    def evaluate(ip: String):String= {
      // 具體的IP解析服務
      SplitAddress.getPlaceFromIp(ip)
   }
}

b. 使用自定義UDF函式

object TraceTmpEtl extends SparkHivePartitionOverwriteApplication{

  /**
   * 資料同步任務
   * @param spark sparkSession例項
   * @param properties 資料庫配置
   * @param date 日期
   */
  def tmp_t_trace_user_visit_real_time_statistic(spark: SparkSession,properties:Properties,date: String):Unit ={
    // 獲取資料庫配置的資料庫名稱
    val odsDatabaseName = properties.get("hive.datasource.ods")
    val tmpDatabaseName = properties.get("hive.datasource.tmp")

    // 註冊自定義的UDF函式
    spark.udf.register("parseIP", (ip: String) => SplitAddress.getPlaceFromIp(ip))
    // 在Spark SQL中使用UDF函式
    spark.sql(
      s"""
         |INSERT OVERWRITE TABLE ${tmpDatabaseName}.tmp_t_statistic partition(ds='${date}')
         |select
         |	  `id` ,
         |	  `create_time` ,
         |	  `update_time` ,
         |	  `ip` ,
         |      replace( replace( replace(replace( case when parseIP(ip) rlike '^中國' then replace(parseIP(ip),'中國','')
         |          when parseIP(ip) rlike '^內蒙古' then replace(parseIP(ip),'內蒙古','內蒙古自治區')
         |          when parseIP(ip) rlike '^廣西' then replace(parseIP(ip),'廣西','廣西壯族自治區')
         |          when parseIP(ip) rlike '^西藏' then replace(parseIP(ip),'西藏','西藏自治區')
         |          when parseIP(ip) rlike '^寧夏' then replace(parseIP(ip),'寧夏','寧夏回族自治區')
         |          when parseIP(ip) rlike '^新疆' then replace(parseIP(ip),'新疆','新疆維吾爾自治區')
         |          when parseIP(ip) rlike '^香港' then replace(parseIP(ip),'香港','香港特別行政區')
         |          when parseIP(ip) rlike '^澳門' then replace(parseIP(ip),'澳門','澳門特別行政區')
         |     else parseIP(ip) end, "省", "省."),"市", "市."),"縣", "縣."),"區", "區.") as ip_place,
         |	  `page_view` 
         |from ${odsDatabaseName}.ods_t_statistic where ds ='${date}'
         |""".stripMargin)
  }

  /**
   * 資料etl
   * @param spark SparkSession例項
   * @param properties 資料庫配置
   */
  def etl(spark: SparkSession, properties:Properties): Unit = {
    val lastDate = DateUtils.getLastDayString
    tmp_t_trace_user_visit_real_time_statistic(spark,properties, lastDate)
  }


  
  def main(args: Array[String]): Unit = {
    job() {
      val sparkAndProperties = SparkUtils.get()
      val spark = sparkAndProperties.spark
      val properties = sparkAndProperties.properties
      etl(spark,properties)
    }
  }
}

  1. 資料庫的配置安全性問題

剛開始資料庫配置同步配置檔案直接寫死,但是後續發現這樣存在一些安全性的問題,後來採用將資料庫相關的配置組合為一個JSON字串,將其加密之後儲存到MongoDB中,在使用時進行查詢解密。

public class DataSourceUtils {

    private  static Logger logger = LoggerFactory.getLogger(DataSourceUtils.class);

    public static JSONObject getDataSourceProperties(String dataSourceKey,Properties properties){
        List<ServerAddress> adds = new ArrayList<>();
        try {
            String filePath = properties.getProperty("spark.mongo.properties.file.url");
            properties = new Properties();
            File file = new File(filePath);
            FileInputStream inputStream = null;
             inputStream = new FileInputStream(file);
            properties.load(inputStream);
        }catch (Exception e){
            logger.info("not load file, reason:" + e.getMessage());
            e.printStackTrace();
        }
        String mongoUrl = properties.getProperty("mongo_url");
        String mongoPort = properties.getProperty("mongo_port");
        String mongoDbName = properties.getProperty("mongo_dbName");
        String mongoCollect = properties.getProperty("mongo_collect");
        String mongoUser = properties.getProperty("mongo_user");
        String mongoPassword = properties.getProperty("mongo_password");
        String desKey = properties.getProperty("data_des_key");
        ServerAddress serverAddress = new ServerAddress(mongoUrl, Integer.parseInt(mongoPort));
        adds.add(serverAddress);
        List<MongoCredential> credentials = new ArrayList<>();
        MongoCredential mongoCredential = MongoCredential.createScramSha1Credential(mongoUser, mongoDbName, mongoPassword.toCharArray());
        credentials.add(mongoCredential);
        MongoClient mongoClient = new MongoClient(adds, credentials);
        MongoDatabase mongoDatabase = mongoClient.getDatabase(mongoDbName);
        MongoCollection<Document> collection = mongoDatabase.getCollection(mongoCollect);
        //指定查詢過濾器
        Bson filter = Filters.eq("key", dataSourceKey);
        //指定查詢過濾器查詢
        FindIterable findIterable = collection.find(filter);
        //取出查詢到的第一個文件
        Document document = (Document) findIterable.first();
        //列印輸出
        String content = DESUtil.decrypt(desKey, document.getString("content"));
        return JSON.parseObject(content);
    }


    public static  Properties json2Properties(JSONObject jsonObject){
        String tmpKey = "";
        String tmpKeyPre = "";
        Properties properties = new Properties();
        j2p(jsonObject, tmpKey, tmpKeyPre, properties);
        return properties;
    }



    private static void j2p(JSONObject jsonObject, String tmpKey, String tmpKeyPre, Properties properties){
        for (String key : jsonObject.keySet()) {
            // 獲得key
            String value = jsonObject.getString(key);
            try {
                JSONObject jsonStr = JSONObject.parseObject(value);
                tmpKeyPre = tmpKey;
                tmpKey += key + ".";
                j2p(jsonStr, tmpKey, tmpKeyPre, properties);
                tmpKey = tmpKeyPre;
            } catch (Exception e) {
                properties.put(tmpKey + key, value);
                System.out.println(tmpKey + key + "=" + value);
            }
        }
    }
    public static void main(String[] args) {

    }
}

  1. Spark任務指令碼示例
#!/bin/sh

##### env ###########
export JAVA_HOME=/usr/java/jdk1.8.0_151
export SPARK_HOME=/opt/cloudera/parcels/CDH/lib/spark
export PATH=${JAVA_HOME}/bin:${SPARK_HOME}/bin:${PATH}
export SPARK_USER=hadoop
export HADOOP_USER_NAME=hadoop
LAST_DAY="$1"
echo LAST_DAY

spark-submit \
--class net.app315.bigdata.operatereport.ods.MarketMysqlToHiveEtl \
--conf spark.sql.hive.metastore.version=2.1.1 \
--conf spark.sql.hive.metastore.jars=/opt/cloudera/parcels/CDH/lib/hive/lib/* \
--jars /opt/cloudera/parcels/CDH/lib/spark/jars/mysql-connector-java-5.1.48.jar,/opt/cloudera/parcels/CDH/lib/spark/jars/druid-1.1.10.jar \
--master yarn \
--deploy-mode cluster \
--executor-memory 4G \
--driver-memory 2G \
--num-executors 4 \
--executor-cores 2 \
--conf spark.dynamicAllocation.minExecutors=1 \
--conf spark.dynamicAllocation.maxExecutors=8 \
--conf spark.yarn.am.attemptFailuresValidityInterval=1h \
--conf spark.yarn.max.executor.failures=128 \
--conf spark.yarn.executor.failuresValidityInterval=1h \
--conf spark.task.maxFailures=4 \
--conf spark.yarn.maxAppAttempts=2 \
--conf spark.scheduler.mode=FIFO \
--conf spark.network.timeout=420000 \
--conf spark.dynamicAllocation.enabled=true \
--conf spark.executor.heartbeatInterval=360000 \
--conf spark.sql.crossJoin.enabled=true \
--conf spark.mongo.properties.file.url=/opt/conf/mongo.properties \
--conf spark.etl.last.day="${LAST_DAY}" \
./target/spark-operate-report-project-1.0.jar
  1. Job任務指令碼例項
nodes:

  - name: bigdata_market_ods_etl
    type: command
    config:
      command: sh -x ./script/bigdata_market_ods_etl.sh "${spark.etl.last.day}"
      failure.emails: mxx@xxx.com

  - name: bigdata_market_dim_etl
    type: command
    config:
      command: sh -x ./script/bigdata_market_dim_etl.sh "${spark.etl.last.day}"
      failure.emails: mxx@xxx.com
    dependsOn:
          - bigdata_market_ods_etl
          
  - name: bigdata_market_dw_etl
    type: command
    config:
      command: sh -x ./script/bigdata_market_dw_etl.sh "${spark.etl.last.day}"
      failure.emails: mxx@xxx.com
    dependsOn:
          - bigdata_market_dim_etl
          - bigdata_user_dw_etl

五、備註

  1. Davinci報表 一個開源的報表平臺

第二代-基於DolphinScheduler的離線資料同步

一、背景

自從上次開始使用基於Hadoop的大資料體現方案之後,業務平穩發展,但是隨著時間的推移,新的問題開始出現,主要出現的問題為兩個:

  1. 資料的變更越來越頻繁,基於之前SparkSQL任務的方式,只要需要對錶結構進行變更,就需要重新修改Scala程式碼,然後重新進行任務的打包,這對於一些不熟悉程式碼的人來說,不太友好,而且成本也很高。
  2. 雖然使用了Presto對HIVE的資料查詢進行了加速,但是所在資料量越來越大,分析要求越來越複雜,即席查詢越來越多,由於叢集本身資源有限,查詢能力出現了顯著瓶頸。

二、資料同步架構

隨著技術的發展已經對大資料的認識,接觸到了更多的大資料相關的知識與元件,基於此,透過認真分析與思考之後,對資料的同步方案進行了如下的重新設計。

  1. 資料儲存與查詢放棄了HDFS+HIVE+Presto的組合,轉而採用現代化的MPP資料庫StarRocks,StarRocks在資料查詢的效率層面非常優秀,在相同資源的情況下,可以解決目前遇到的資料查詢瓶頸。
  2. 資料同步放棄了SparkSQL,轉而採用更加輕量級的DATAX來進行,其只需要透過簡單的配置,即可完成資料的同步,同時其也支援StarRocks Writer,開發人員只需要具備簡單的SQL知識,就可以完成整個資料同步任務的配置,難度大大降低,效率大大提升,友好度大大提升。
  3. 定時任務排程放棄Azkaban,採用現代化的任務排程工作Apache DolphinScheduler,透過視覺化的頁面進行排程任務工作流的配置,更加友好。

三、資料同步的詳細流程

資料同步在這種方式下變動非常簡單,只需要視覺化的配置DataX任務,即可自動排程。下面的一個任務的配置示例

{
  "job": {
    "setting": {
      "speed": {
        "channel":1
      }
    },
    "content": [
      {
        "reader": {
          "name": "mysqlreader",
          "parameter": {
            "username": "",
            "password": "",
            "connection": [
              {
                "querySql": [
                  "SELECT CustomerId AS customer_id FROM base_info.base_customer where date(UpdateTime) > '${sdt}' and date(UpdateTime) < '${edt}'"
                ],
                "jdbcUrl": [
                  "jdbc:mysql://IP:3306/base_info?characterEncoding=utf-8&useSSL=false&tinyInt1isBit=false"
                ]
              }
            ]
          }
        },
        "writer": {
          "name": "starrockswriter",
          "parameter": {
            "username": "xxx",
            "password": "xxx",
            "database": "ods_cjm_test",
            "table": "ods_base_customer",
            "column": ["id"],
            "preSql": [],
            "postSql": [], 
            "jdbcUrl": "jdbc:mysql://IP:9030/",
            "loadUrl": ["IP:8050", "IP:8050", "IP:8050"],
            "loadProps": {
              "format": "json",
              "strip_outer_array": true
            }
          }
        }
      }
        ]
    }
}

資料同步過程中,遇到了另外一個問題,即業務存在大量的分庫分表的,這些分庫分表的邏輯五花八門,60張左右的邏輯板,經過分庫分表之後達到了驚人的5000多張,為每張表配置任務很顯然不太正常,這就需要能夠在進行資料同步的時候動態生成需要的表列表,把表列表配置到DataX的配置檔案中去。

經過技術的呼叫,Apache DolphinScheduler的Python任務型別很適合做這個事情,由於公司本身使用了Apache DolphinScheduler3.0的版本,其Python任務還不支援返回資料到下游節點,但是社群最新版本已經支援該能力,因為按照已實現版本對其進行改造。

改造之後,Python節點能夠將資料傳遞給他的下游節點,因此使用Python指令碼查詢獲取需要進行同步的表列表,將其傳遞給DataX節點,完成動態表的資料同步

import pymysql
import datetime


def select_all_table(date: str):
    result_list = []
    sql = """
    SELECT concat('"', table_name, '"') 
    FROM information_schema.`TABLES` 
    WHERE table_schema='hydra_production_flow' 
        and table_name like 't_package_flow_log_%'
        and table_name like '%_{}'
    """.format(date)
    conn = pymysql.connect(host='', port=3306, user='', passwd='',
                           db='information_schema')
    cur = conn.cursor()
    cur.execute(query=sql)
    while 1:
        res = cur.fetchone()
        if res is None:
            break
        result_list.append(res[0])
    cur.close()
    conn.close()
    return result_list


if __name__ == '__main__':
    # 獲取當前年月
    # 獲取當前日期
    today = datetime.date.today()
    # 計算前一天的日期
    yesterday = today - datetime.timedelta(days=1)
    current_date = yesterday.strftime("%Y_%m")
    table_list = select_all_table(current_date)
    table_str = ",".join(table_list)
    # 設定變數,傳遞給下游節點
    print('${setValue(table_list=%s)}' % table_str)
{
  "job": {
    "setting": {
      "speed": {
        "channel":1
      }
    },
    "content": [
      {
        "reader": {
          "name": "mysqlreader",
          "parameter": {
            "username": "xxx",
            "password": "xxxx",
            "column": [
              "id",
              "concat('t_package_flow_log_',DATE_FORMAT(create_time,'%Y_%m'))",
              "operation_type"
            ],
            "where": "date(create_time) ${operator_symbol} '${dt}'",
            "connection": [
              {
                "table": [
                  ${table_list}
                ],
                "jdbcUrl": [
                  "jdbc:mysql://xx:3306/hydra_production_flow?characterEncoding=utf-8&useSSL=false&tinyInt1isBit=false"
                ]
              }
            ]
          }
        },
        "writer": {
                    "name": "starrockswriter",
                    "parameter": {
                        "username": "xxxxxx",
                        "password": "xxxxxxx",
                        "database": "ods_cjm",
                        "table": "ods_t_package_flow_log",
                        "column": ["id", "table_name","operation_type"],
                        "preSql": [],
                        "postSql": [], 
                        "jdbcUrl": "jdbc:mysql://IP:9030/",
                        "loadUrl": ["IP:8050", "IP:8050", "IP:8050"],
                        "loadProps": {
                            "format": "json",
                            "strip_outer_array": true
                        }
                    }
                }
            }
        ]
    }
}

四、踩坑記錄

  1. DATAX只支援python2.x

下載支援python3.x的相關檔案,替換DataX中的相同檔案,即可支援python3.x使用

五、備註

  1. StarRocks 高效能的MPP資料庫
  2. DataX 離線資料同步
  3. Apache DolphinScheduler 任務排程工具

第三代-基於Python自定義的離線資料同步

一、背景

自從採用Apache DolphinScheduler + StarRocks資料方案以來,一切都很平穩發展;但是隨著時間的推移,總會出現新的問題。

隨著資料量的增多,使用方需求的增長,已經一些其他因素的影響,對目前的資料同步架構帶來了一些不小的挑戰,這些問題導致任務的維護和更新越來越麻煩,需要耗費大量的時間來進行,急需一種新的方式來處理。

  1. 由於等保的要求,線上RDS資料庫不再支援透過公網訪問,又因為StarRocks也在內網,這就導致了之前的資料同步鏈路徹底斷裂,需要新的方案。
  2. 由於資料結構的頻繁變更、伺服器資源導致的任務排程異常等等原因,需要重跑資料的需求越來越多,這就導致需要不斷的修改任務的排程引數(如日期),目前已經上線了10個業務的排程任務,也就是重新同步一次,就需要依次修改排程這10個任務,這期間還需要專人進行狀態的跟蹤,即使修改排程,壓力很大。

二、資料同步架構

鑑於資料鏈路變更,導致原本資料鏈路斷裂的問題,透過調研之後,決定採用KAFKA進行資料的中轉,在內網部署KAFKA叢集,同時該叢集提供公網訪問地址;在RDS所在的內網機器上使用DataX將RDS資料透過公網地址寫入KAFKA,在內網中透過KafkaConnector消費資料寫入StarRocks。

鑑於新的資源有限,原本內網提供了4臺8C32G的伺服器,但是新的RDS所在內網只能提供一臺最大4C8G的伺服器。因此放棄了使用Apache DolphinScheduler來進行排程,直接使用crontab呼叫對應的Python指令碼進行DataX任務排程。

三、具體的資料同步

新的方案,主要解決的問題有兩個,一是DataX如何將資料寫入KAFKA,二是Python指令碼怎麼解決前面遇到的修改複雜的問題。

  1. DataX寫KAFKA

DataX本身並沒有kafkawriter實現,這就需要我們自己實現一個KafkaWriter來支援我們的需求,同時為了資料安全,希望能夠對資料進行加密。

DataX的KafkaWriter實現

public class KafkaWriter extends Writer {

    public static class Job extends Writer.Job {

        private static final Logger logger = LoggerFactory.getLogger(Job.class);
        private Configuration conf = null;

        @Override
        public List<Configuration> split(int mandatoryNumber) {
            List<Configuration> configurations = new ArrayList<Configuration>(mandatoryNumber);
            for (int i = 0; i < mandatoryNumber; i++) {
                configurations.add(conf);
            }
            return configurations;
        }

        private void validateParameter() {
            this.conf.getNecessaryValue(Key.BOOTSTRAP_SERVERS, KafkaWriterErrorCode.REQUIRED_VALUE);
            this.conf.getNecessaryValue(Key.TOPIC, KafkaWriterErrorCode.REQUIRED_VALUE);
        }

        @Override
        public void init() {
            this.conf = super.getPluginJobConf();
            logger.info("kafka writer params:{}", conf.toJSON());
            this.validateParameter();
        }


        @Override
        public void destroy() {

        }
    }

    public static class Task extends Writer.Task {
        private static final Logger logger = LoggerFactory.getLogger(Task.class);
        private static final String NEWLINE_FLAG = System.getProperty("line.separator", "\n");

        private Producer<String, String> producer;
        private String fieldDelimiter;
        private Configuration conf;
        private Properties props;
        private AesEncryption aesEncryption;
        private List<String> columns;

        @Override
        public void init() {
            this.conf = super.getPluginJobConf();
            fieldDelimiter = conf.getUnnecessaryValue(Key.FIELD_DELIMITER, "\t", null);
            columns = conf.getList(Key.COLUMN_LIST, new ArrayList<>(), String.class);

            props = new Properties();
            props.put("bootstrap.servers", conf.getString(Key.BOOTSTRAP_SERVERS));
            props.put("acks", conf.getUnnecessaryValue(Key.ACK, "0", null));//這意味著leader需要等待所有備份都成功寫入日誌,這種策略會保證只要有一個備份存活就不會丟失資料。這是最強的保證。
            props.put("retries", conf.getUnnecessaryValue(Key.RETRIES, "5", null));
            props.put("retry.backoff.ms", "1000");
            props.put("batch.size", conf.getUnnecessaryValue(Key.BATCH_SIZE, "16384", null));
            props.put("linger.ms", 100);
            props.put("connections.max.idle.ms", 300000);
            props.put("max.in.flight.requests.per.connection", 5);
            props.put("socket.keepalive.enable", true);
            props.put("key.serializer", conf.getUnnecessaryValue(Key.KEYSERIALIZER, "org.apache.kafka.common.serialization.StringSerializer", null));
            props.put("value.serializer", conf.getUnnecessaryValue(Key.VALUESERIALIZER, "org.apache.kafka.common.serialization.StringSerializer", null));
            producer = new KafkaProducer<String, String>(props);
            String encryptKey = conf.getUnnecessaryValue(Key.ENCRYPT_KEY, null, null);
            if(encryptKey != null){
                aesEncryption = new AesEncryption(encryptKey);
            }
        }

        @Override
        public void prepare() {
            AdminClient adminClient = AdminClient.create(props);
            ListTopicsResult topicsResult = adminClient.listTopics();
            String topic = conf.getNecessaryValue(Key.TOPIC, KafkaWriterErrorCode.REQUIRED_VALUE);
            try {
                if (!topicsResult.names().get().contains(topic)) {
                    new NewTopic(
                            topic,
                            Integer.parseInt(conf.getUnnecessaryValue(Key.TOPIC_NUM_PARTITION, "1", null)),
                            Short.parseShort(conf.getUnnecessaryValue(Key.TOPIC_REPLICATION_FACTOR, "1", null))
                    );
                    List<NewTopic> newTopics = new ArrayList<NewTopic>();
                    adminClient.createTopics(newTopics);
                }
                adminClient.close();
            } catch (Exception e) {
                throw new DataXException(KafkaWriterErrorCode.CREATE_TOPIC, KafkaWriterErrorCode.REQUIRED_VALUE.getDescription());
            }
        }

        @Override
        public void startWrite(RecordReceiver lineReceiver) {
            logger.info("start to writer kafka");
            Record record = null;
            while ((record = lineReceiver.getFromReader()) != null) {
                if (conf.getUnnecessaryValue(Key.WRITE_TYPE, WriteType.TEXT.name(), null)
                        .equalsIgnoreCase(WriteType.TEXT.name())) {
                    producer.send(new ProducerRecord<String, String>(this.conf.getString(Key.TOPIC),
                            Md5Encrypt.md5Hexdigest(recordToString(record)),
                            aesEncryption ==null ? recordToString(record): JSONObject.toJSONString(aesEncryption.encrypt(recordToString(record))))
                    );
                } else if (conf.getUnnecessaryValue(Key.WRITE_TYPE, WriteType.TEXT.name(), null)
                        .equalsIgnoreCase(WriteType.JSON.name())) {
                    producer.send(new ProducerRecord<String, String>(this.conf.getString(Key.TOPIC),
                            Md5Encrypt.md5Hexdigest(recordToString(record)),
                            aesEncryption ==null ? recordToJsonString(record) : JSONObject.toJSONString(aesEncryption.encrypt(recordToJsonString(record))))
                    );
                }
                producer.flush();
            }
        }

        @Override
        public void destroy() {
            if (producer != null) {
                producer.close();
            }
        }

        /**
         * 資料格式化
         *
         * @param record
         * @return
         */
        private String recordToString(Record record) {
            int recordLength = record.getColumnNumber();
            if (0 == recordLength) {
                return NEWLINE_FLAG;
            }
            Column column;
            StringBuilder sb = new StringBuilder();
            for (int i = 0; i < recordLength; i++) {
                column = record.getColumn(i);
                sb.append(column.asString()).append(fieldDelimiter);
            }

            sb.setLength(sb.length() - 1);
            sb.append(NEWLINE_FLAG);
            return sb.toString();
        }

        /**
         * 資料格式化
         *
         * @param record 資料
         *
         */
        private String recordToJsonString(Record record) {
            int recordLength = record.getColumnNumber();
            if (0 == recordLength) {
                return "{}";
            }
            Map<String, Object> map = new HashMap<>();
            for (int i = 0; i < recordLength; i++) {
                String key = columns.get(i);
                Column column = record.getColumn(i);
                map.put(key, column.getRawData());
            }
            return JSONObject.toJSONString(map);
        }
    }
}

進行資料加密的實現:

public class AesEncryption {

    private SecretKey secretKey;

    public AesEncryption(String secretKey) {
        byte[] keyBytes = Base64.getDecoder().decode(secretKey);
        this.secretKey = new SecretKeySpec(keyBytes, 0, keyBytes.length, "AES");
    }


    public String encrypt(String data) {
        try {
            Cipher cipher = Cipher.getInstance("AES");
            cipher.init(Cipher.ENCRYPT_MODE, secretKey);
            byte[] encryptedBytes = cipher.doFinal(data.getBytes());
            return Base64.getEncoder().encodeToString(encryptedBytes);
        } catch (Exception e) {
            throw new RuntimeException(e);
        }
    }

    public String decrypt(String encryptedData) throws Exception {
        Cipher cipher = Cipher.getInstance("AES");
        cipher.init(Cipher.DECRYPT_MODE, secretKey);
        byte[] decodedBytes = Base64.getDecoder().decode(encryptedData);
        byte[] decryptedBytes = cipher.doFinal(decodedBytes);
        return new String(decryptedBytes);
    }
}


Kafka的公網配置

Kafka的內外網配置,只需要修改kafka/config下面的server.properties檔案中的如下配置即可。

# 配置kafka的監聽埠,同時監聽9093和9092
listeners=INTERNAL://kafka節點3內網IP:9093,EXTERNAL://kafka節點3內網IP:9092

# 配置kafka的對外廣播地址, 同時配置內網的9093和外網的19092
advertised.listeners=INTERNAL://kafka節點3內網IP:9093,EXTERNAL://公網IP:19092

# 配置地址協議
listener.security.protocol.map=INTERNAL:PLAINTEXT,EXTERNAL:PLAINTEXT

# 指定broker內部通訊的地址
inter.broker.listener.name=INTERNAL
  1. 自定義的配置檔案

Python指令碼需要能夠自動生成對應的DataX排程的配置檔案和shell指令碼,自動排程DataX進行任務的執行。因此經過調研,採用自定義配置檔案,透過讀取配置檔案,動態生成對應的DataX任務指令碼和排程指令碼,排程任務執行。

自定義的配置檔案示例1:

{
  "datasource": {
    "host": "xxxxxx",
    "port": "3306",
    "username": "xxxxx",
    "password": "xxxxxxx",
    "properties": {
      "characterEncoding": "utf-8",
      "useSSL": "false",
      "tinyInt1isBit": "false"
    }
  },
  "table": {
    "database": "app",
    "table": "device",
    "column": [
      "Id AS id",
      "CompanyName AS company_name",
      "CompanyId AS company_id",
      "SecretKey AS secret_key",
      "Brand AS brand",
      "ModelType AS model_type",
      "Enable AS enable",
      "CAST(CreateTime as CHAR) AS create_time",
      "CAST(UpdateTime as CHAR) AS update_time"
    ],
    "where": "date(UpdateTime) >= '$[yyyy-MM-dd-8]'",
    "searchTableSql": []
  },
  "kafka": {
    "topic": "mzt_ods_cjm.ods_device"
  }
}

支援分庫分表的配置檔案示例2

{
  "datasource": {
    "host": "xxxxxxx",
    "port": "3306",
    "username": "xxxxxxx",
    "password": "xxxxxxxx",
    "properties": {
      "characterEncoding": "utf-8",
      "useSSL": "false",
      "tinyInt1isBit": "false"
    }
  },
  "table": {
    "database": "hydra_logistics_flow",
    "table": "",
    "column": [
      "id",
      "concat('t_logistics_sweep_out_code_flow_',DATE_FORMAT(create_time,'%Y')) AS table_name",
      "cus_org_id",
      "CAST(create_time as CHAR) AS create_time",
      "replace_product_id",
      "replace_product_name",
      "replace_product_code"
    ],
    "where": "date(create_time) >= '$[yyyy-MM-dd-8]'",
    "searchTableSql": [
      "SELECT concat('t_logistics_sweep_out_code_flow_',YEAR(SUBDATE(CURDATE(), 1))) AS TABLE_NAME",
      "SELECT concat('t_logistics_sweep_out_code_flow_',YEAR(DATE_SUB(DATE_SUB(CURDATE(), INTERVAL 1 DAY), INTERVAL 1 YEAR))) AS TABLE_NAME"
    ]
  },
  "kafka": {
    "topic": "mzt_ods_cjm.ods_t_logistics_sweep_out_code_flow"
  }
}

如上的配置檔案,解釋如下:

KEY 說明
datasource RDS資料來源
datasource.host RDS資料庫的host
datasource.port> RDS資料庫的埠
datasource.username RDS資料庫的使用者名稱
datasource.password RDS資料庫的密碼
datasource.properties jdbc連線的引數,連線時拼接為?key=value&key=value
table 要同步的表資訊
table.database RDS資料庫名稱
table.table RDS中表的名稱,分庫分表的可以為空
table.column RDS表中要同步的欄位列表,支援取別名和使用函式
table.where 同步資料的過濾條件
table.searchTableSql 查詢表名稱的SQL語句,用於動態分庫分表
kafka kafka相關的配置
kafka.topic 資料要寫入的kafka topic的名稱
  1. Python排程指令碼
import json
import os
import pymysql
import re
from datetime import datetime
from dateutil.relativedelta import relativedelta
import uuid
import subprocess
import logging
import hmac
import hashlib
import base64
import urllib.parse
import urllib
import requests
import time
from typing import List, Mapping


def list_files_in_directory(directory_path: str) -> List[str]:
    """
    獲取目錄下的所有以.json結尾的檔案
    :param directory_path: 目錄
    :return: 檔案列表
    """
    entries = os.listdir(directory_path)
    # 過濾出所有檔案
    files = [entry for entry in entries if
             os.path.isfile(os.path.join(directory_path, entry)) and entry.endswith(".json")]
    logging.info(f"讀取配置檔案數量:{len(files)}")
    return files


def read_file_content(file_path: str) -> str:
    """
    讀取檔案內容
    :param file_path: 檔案路徑
    :return: 檔案內容
    """
    with open(file_path, 'r', encoding='utf-8') as file:
        content = file.read()
    return content


def read_all_files_in_directory(directory_path: str) -> Mapping[str, str]:
    """
    讀取資料夾下面的所有檔案的內容
    :param directory_path: 資料夾路徑
    :return: 內容map
    """
    logging.info(f"開始讀取所有的配置檔案資訊")
    files = list_files_in_directory(directory_path)
    file_contents = {}
    for file in files:
        file_path = os.path.join(directory_path, file)
        content = read_file_content(file_path)
        file_contents[file] = content
    sorted_items = sorted(file_contents.items())
    sorted_dict = dict(sorted_items)
    return file_contents


def search_table_list(datasource: json, search_table_sql_list: List[str]) -> List[str]:
    """
    執行語句獲取表資訊
    :param datasource: 資料來源資訊
    :param search_table_sql_list: 查詢表的SQL語句
    :return: 表列表
    """
    logging.info(f"開始查詢需要同步的表")
    host = datasource['host']
    port = int(datasource['port'])
    username = datasource['username']
    password = datasource['password']
    conn = pymysql.connect(host=host,
                           port=port,
                           user=username,
                           passwd=password,
                           db='',
                           charset='utf8',
                           connect_timeout=200,
                           autocommit=True,
                           read_timeout=2000
                          )
    table_name_list = []
    for search_table_sql in search_table_sql_list:
        search_table_sql = parse_where_sql(search_table_sql)
        with conn.cursor() as cursor:
            cursor.execute(query=search_table_sql)
            while 1:
                res = cursor.fetchone()
                if res is None:
                    break
                table_name_list.append(res[0])
    return table_name_list


def general_default_job_config() -> json:
    """
    生成預設的datax配置
    :return: 預設的配置
    """
    default_job_json = """
    {
    "job": {
        "setting": {
            "speed": {
                 "channel":1
            }
        },
        "content": [
            {
                "reader": {
                    "name": "mysqlreader",
                    "parameter": {
                        "username": "test",
                        "password": "test1234",
                        "connection": [
                            {
                                "querySql": [
                                    "SELECT id, code from test.t_open_api_classify"
                                ],
                                "jdbcUrl": [
                                    "jdbc:mysql://IP:3306/test?characterEncoding=utf-8&useSSL=false&tinyInt1isBit=false"
                                ]
                            }
                        ]
                    }
                },
                 "writer": {
                    "name": "kafkawriter",
                    "parameter": {
                        "bootstrapServers": "IP:9092,IP:9092,IP:9092",
                        "topic": "test-m-t-k",
                        "ack": "all",
                        "batchSize": 1000,
                        "retries": 0,
                        "keySerializer": "org.apache.kafka.common.serialization.StringSerializer",
                        "valueSerializer": "org.apache.kafka.common.serialization.StringSerializer",
                        "fieldDelimiter": ",",
                        "writeType": "json",
                        "topicNumPartition": 1,
                        "topicReplicationFactor": 1,
                        "encryptionKey": "5s8FGjerddfWkG/b64CGHHZYvQ=="
                    }
                }
            }
        ]
    }
}
    """
    return json.loads(default_job_json, encoding='utf-8')


def general_jdbc_url(json_config: json) -> str:
    """
    根據資料來源資訊生成jdbc url
    :param json_config: 配置
    :return: jdbc url
    """
    logging.info(f"開始解析jdbc url")
    host = json_config['datasource']['host']
    port = int(json_config['datasource']['port'])
    database = json_config['table']['database']
    url = "jdbc:mysql://{}:{}/{}".format(host, port, database)
    # 解下properties
    properties = json_config['datasource']['properties']
    properties_list = []
    if properties is not None and len(properties) > 0:
        for key, value in properties.items():
            properties_list.append(key + "=" + str(value))
        url = url + "?" + "&".join(properties_list)
    logging.info(f"jdbc url: {url}")
    return url


def parse_where_sql(where_sql: str) -> str:
    """
    解析where語句
    :param where_sql: 原始where語句
    :return: 轉換之後的where語句
    """
    # 定義支援的型別 $[yyyyMMdd+N_Y]  $[yyyyMMdd-N_Y]
    # 正規表示式模式
    logging.info(f"還是解析where語句:where_sql: {where_sql}")
    pattern = r"\$\[.*?\]"
    return re.sub(pattern, replacement_function, where_sql)


def replacement_function(match):
    """
    替換函式
    :param match: 匹配結果
    :return: 替換之後的結果
    """
    matched_text = match.group(0)
    return calc_datetime(matched_text)


def calc_datetime(expression: str) -> str:
    """
    計算時間表示式
    :param expression: 表示式
    :return: 計算之後的值
    """
    logging.info(f"開始計算時間引數:expression: {expression}")
    # 設定對映
    format_units = {
        "yyyy": "%Y",
        "MM": "%m",
        "dd": "%d",
        "HH": "%H",
        "mm": "%M",
        "ss": "%S"
    }

    unit_map = {
        "Y": "yyyy",
        "M": "MM",
        "d": "dd",
        "H": "HH",
        "m": "mm",
        "s": "ss"
    }
    # 解析引數
    expression = expression[2:-1]
    # 判斷其開頭,擷取尾部
    min_unit = None
    for key, value in format_units.items():
        if key in expression:
            min_unit = key
            expression = expression.replace(key, value)

    # 替換完畢,確定是否有數字
    logging.info(f"轉換為Python格式的表示式:expression: {expression}")
    # 定義正規表示式模式
    pattern = r'([^0-9]+)([-+]\d+(\*\d+)?)(?:_([YMdHms]))?'
    matches = re.match(pattern, expression)
    # 輸出拆分結果
    if matches:
        date_part = matches.group(1)
        remainder = matches.group(2)
        unit = matches.group(4)
        if unit is not None and unit in unit_map.keys():
            min_unit = unit_map[unit]
        return calculate_expression(min_unit, date_part, remainder)
    else:
        return expression


def calculate_expression(min_unit: str, date_part: str, remainder: str) -> str:
    """
    計算表示式
    :param min_unit: 最小單位
    :param date_part: 日期表示式部分
    :param remainder: 偏移量部分
    :return: 計算之後的結果
    """
    logging.info(f"開始計算表示式:min_unit: {min_unit}, date_part: {date_part}, remainder:{remainder}")
    # 獲取當前日期和時間
    now = datetime.now()
    # 計算時間的偏移量
    if remainder is None:
        # 格式化的日期
        formatted_datetime = now.strftime(date_part)
        logging.info(f"日期偏移量為空,返回值:{formatted_datetime}")
        return formatted_datetime
    else:
        # 計算偏移量
        plus_or_sub = remainder[0:1]
        offset = eval(remainder[1:])
        logging.info(f"計算偏移量,plus_or_sub:{plus_or_sub}, offset:{offset}")
        if min_unit == 'yyyy':
            if plus_or_sub == '-':
                now = now - relativedelta(years=offset)
            else:
                now = now + relativedelta(years=offset)
        elif min_unit == 'MM':
            if plus_or_sub == '-':
                now = now - relativedelta(months=offset)
            else:
                now = now + relativedelta(months=offset)
        elif min_unit == 'dd':
            if plus_or_sub == '-':
                now = now - relativedelta(days=offset)
            else:
                now = now + relativedelta(days=offset)
        elif min_unit == 'HH':
            if plus_or_sub == '-':
                now = now - relativedelta(hours=offset)
            else:
                now = now + relativedelta(hours=offset)
        elif min_unit == 'mm':
            if plus_or_sub == '-':
                now = now - relativedelta(minutes=offset)
            else:
                now = now + relativedelta(minutes=offset)
        elif min_unit == 'ss':
            if plus_or_sub == '-':
                now = now - relativedelta(seconds=offset)
            else:
                now = now + relativedelta(seconds=offset)
        formatted_datetime = now.strftime(date_part)
        logging.info(f"日期偏移量為空,返回值:{formatted_datetime}")
        return formatted_datetime


def general_reader(json_config: json) -> json:
    """
    生成配置的reader部分
    :param json_config: 配置
    :return: JSON結果
    """
    logging.info(f"開始生成DataX的配置JSON檔案的reader內容")
    reader_json = json.loads("{}", encoding='utf-8')
    reader_json['name'] = "mysqlreader"
    reader_json['parameter'] = {}
    reader_json['parameter']['username'] = json_config['datasource']['username']
    reader_json['parameter']['password'] = json_config['datasource']['password']
    reader_json['parameter']['column'] = json_config['table']['column']
    reader_json['parameter']['connection'] = [{}]
    reader_json['parameter']['connection'][0]['table'] = json_config['table']['table']
    reader_json['parameter']['connection'][0]['jdbcUrl'] = [general_jdbc_url(json_config)]
    where_sql = json_config['table']['where']
    if where_sql is not None and where_sql != '':
        reader_json['parameter']['where'] = parse_where_sql(where_sql)
    return reader_json


def general_writer(json_config: json) -> json:
    """
    生成配置的Writer部分
    :param json_config: 配置
    :return: JSON結果
    """
    columns = json_config['table']['column']
    new_columns = []
    for column in columns:
        column = str(column).replace("`", "")
        if " AS " in str(column).upper():
            new_columns.append(str(column).split(" AS ")[1].strip())
        else:
            new_columns.append(str(column).strip())
    logging.info(f"開始生成DataX的配置JSON檔案的Writer內容")
    writer_json = json.loads("{}", encoding='utf-8')
    writer_json['name'] = "kafkawriter"
    writer_json['parameter'] = {}
    writer_json['parameter']['bootstrapServers'] = "IP:19092,IP:19093,IP:19094"
    writer_json['parameter']['topic'] = json_config['kafka']['topic']
    writer_json['parameter']['ack'] = "all"
    writer_json['parameter']['batchSize'] = 1000
    writer_json['parameter']['retries'] = 3
    writer_json['parameter']['keySerializer'] = "org.apache.kafka.common.serialization.StringSerializer"
    writer_json['parameter']['valueSerializer'] = "org.apache.kafka.common.serialization.StringSerializer"
    writer_json['parameter']['fieldDelimiter'] = ","
    writer_json['parameter']['writeType'] = "json"
    writer_json['parameter']['topicNumPartition'] = 1
    writer_json['parameter']['topicReplicationFactor'] = 1
    writer_json['parameter']['encryptionKey'] = "5s8FGjerddfWkG/b64CGHHZYvQ=="
    writer_json['parameter']['column'] = new_columns
    return writer_json


def general_datax_job_config(datax_config: str):
    """
    生成job的配置內容
    :param datax_config: 配置
    :return: 完整的JSON內容
    """
    logging.info(f"開始生成DataX的配置JSON檔案內容, {datax_config}")
    json_config = json.loads(datax_config, encoding='utf-8')
    # 判定是否需要查詢表
    datasource = json_config['datasource']
    table = json_config['table']['table']
    search_table_sql_list = json_config['table']['searchTableSql']
    if search_table_sql_list is not None and len(search_table_sql_list) > 0:
        # 查詢表列表,覆蓋原來的配置資訊
        table_list = search_table_list(datasource, search_table_sql_list)
    else:
        table_list = [table]
    json_config['table']['table'] = table_list

    # 開始生成配置檔案
    job_json = general_default_job_config()
    job_json['job']['content'][0]['reader'] = general_reader(json_config)
    job_json['job']['content'][0]['writer'] = general_writer(json_config)
    return job_json


def write_job_file(base_path: str, job_config: json) -> str:
    """
    生成job的JSON配置檔案
    :param base_path: 根路徑
    :param job_config: 配置資訊
    :return: 完整的JSON檔案路徑
    """
    # 生成一個指令碼
    logging.info(f"開始建立DataX的配置JSON檔案")
    date_day = datetime.now().strftime('%Y-%m-%d')
    timestamp_milliseconds = int(datetime.now().timestamp() * 1000)
    # 生成UUID
    file_name = str(uuid.uuid4()).replace("-", "") + "_" + str(timestamp_milliseconds) + ".json"
    # 完整檔案路徑
    # 建立資料夾
    mkdir_if_not_exist(base_path + "/task/datax/json/" + date_day)
    complex_file_path = base_path + "/task/datax/json/" + date_day + "/" + file_name
    logging.info(f"完整的DataX的配置JSON檔案路徑:{complex_file_path}")
    with open(complex_file_path, 'w+', encoding='utf-8') as f:
        f.write(json.dumps(job_config, ensure_ascii=False))
    return complex_file_path


def mkdir_if_not_exist(path):
    """
    建立目錄
    :param path: 目錄路徑
    :return: None
    """
    os.makedirs(path, exist_ok=True)


def write_task_file(base_path: str, python_path: str, datax_path: str, job_file_path: str) -> str:
    """
    寫shell指令碼檔案
    :param base_path: 跟路徑
    :param python_path: python執行檔案路徑
    :param datax_path: datax執行檔案路徑
    :param job_file_path: JSON配置檔案路徑
    :return: shell指令碼的完整路徑
    """
    # 組合內容
    logging.info(f"開始建立Shell指令碼檔案")
    task_content = python_path + " " + datax_path + " " + job_file_path
    # 生成一個指令碼
    date_day = datetime.now().strftime('%Y-%m-%d')
    timestamp_milliseconds = int(datetime.now().timestamp() * 1000)
    # 生成UUID
    task_file_name = str(uuid.uuid4()).replace("-", "") + "_" + str(timestamp_milliseconds) + ".sh"
    # 完整檔案路徑
    # 建立資料夾
    mkdir_if_not_exist(base_path + "/task/datax/shell/" + date_day)
    complex_file_path = base_path + "/task/datax/shell/" + date_day + "/" + task_file_name
    logging.info(f"完整的shell指令碼路徑: {complex_file_path}")
    with open(complex_file_path, 'w+', encoding='utf-8') as f:
        f.write(task_content)
    # 新增執行許可權
    current_permissions = os.stat(complex_file_path).st_mode
    # 新增執行許可權 (許可權值 0o111 表示使用者、組和其他人的執行許可權)
    new_permissions = current_permissions | 0o111
    # 使用 os.chmod 設定新的許可權
    os.chmod(complex_file_path, new_permissions)
    return complex_file_path


def signs(dd_secret: str, timestamp: str) -> str:
    """
    釘釘機器人簽名
    :param dd_secret: 秘鑰
    :param timestamp: 時間戳
    :return: 簽名
    """
    secret_enc = dd_secret.encode('utf-8')
    string_to_sign = '{}\n{}'.format(timestamp, dd_secret)
    string_to_sign_enc = string_to_sign.encode('utf-8')
    hmac_code = hmac.new(secret_enc, string_to_sign_enc, digestmod=hashlib.sha256).digest()
    sign = urllib.parse.quote(base64.b64encode(hmac_code))
    return sign


def real_send_msg(dd_secret: str, dd_access_token: str, text: json):
    """
    傳送釘釘機器人訊息
    :param dd_secret: 秘鑰
    :param dd_access_token: token
    :param text: 內容
    :return: None
    """
    timestamp = str(round(time.time() * 1000))
    sign = signs(dd_secret, timestamp)
    headers = {'Content-Type': 'application/json'}
    web_hook = f'https://oapi.dingtalk.com/robot/send?access_token={dd_access_token}&timestamp={timestamp}&sign={sign}'
    # 定義要傳送的資料
    requests.post(web_hook, data=json.dumps(text), headers=headers)


def send_msg(dd_secret: str, dd_access_token: str, job_start_time: str, total_count: int, success_count: int, fail_task_list: List[str]):
    """
    組合釘釘訊息
    :param dd_secret: 秘鑰
    :param dd_access_token: token
    :param job_start_time: 任務開始時間
    :param total_count: 總任務數
    :param success_count: 成功任務數
    :return: NONE
    """
    title = '### <font color=#CCCC00>資料同步結果</font>'
    if success_count == total_count:
        title = '### <font color=#00FF00>資料同步結果</font>'
    elif success_count == 0:
        title = '### <font color=#FF0000>資料同步結果</font>'

    end_time = datetime.now().strftime('%Y-%m-%d %H:%M:%S')
    result = {
        "msgtype": "markdown",
        "markdown": {
            "title": "資料同步結果",
            "text": title + ' \n\n\n\n- '
                    + "總同步任務數:" + str(total_count) + "\n\n- "
                    + "成功任務數:" + str(success_count) + "\n\n- "
                    + "失敗任務數" + str(total_count - success_count) + "\n\n- "
                    + "開始時間:" + str(job_start_time) + "\n\n- "
                    + "結束時間:" + str(end_time) + "\n\n- "
                    + "失敗列表:" + str(fail_task_list) + "\n\n "
        }
    }
    if success_count < total_count:
        result['markdown']['at'] = json.loads("{\"atMobiles\": [\"12345678997\"]}")
    real_send_msg(dd_secret, dd_access_token, result)


def run_job(dd_secret, dd_access_token, job_start_time, base_path: str, python_script_path: str, datax_json_path: str):
    """
    執行任務
    :param dd_secret: 秘鑰
    :param dd_access_token: token
    :param job_start_time: 任務開始時間
    :param base_path: 根路徑
    :param python_script_path: Python執行路徑
    :param datax_json_path: datax執行路徑
    :return: NONE
    """
    task_content_list = read_all_files_in_directory(base_path + "/task/config/")
    success_count = 0
    total_count = len(task_content_list)
    fail_task_list = []
    for task_content in task_content_list:
        try:
            logging.info(f"開始生成,配置檔名稱:{task_content}")
            job_config = general_datax_job_config(task_content_list[task_content])
            job_file_path = write_job_file(base_path, job_config)
            shell_path = write_task_file(base_path, python_script_path, datax_json_path, job_file_path)
            logging.info(f"shell指令碼建立成功,路徑為:{base_path}")
            # 呼叫指令碼
            call_shell(shell_path)
            success_count += 1
        except Exception as e:
            fail_task_list.append(task_content)
            logging.error(f"配置檔案:{task_content} 執行失敗", e)
    # 傳送訊息
    send_msg(dd_secret, dd_access_token, job_start_time, total_count, success_count, fail_task_list)


def call_shell(shell_path: str):
    """
    執行shell指令碼
    :param shell_path: shell指令碼路徑
    :return: NONE
    """
    logging.info(f"呼叫shell指令碼,路徑為:{shell_path}")
    result = subprocess.run(shell_path,
                            check=True,
                            shell=True,
                            universal_newlines=True,
                            stdout=subprocess.PIPE,
                            stderr=subprocess.PIPE)

    # 輸出標準輸出
    logging.info(f"shell指令碼{shell_path}標準輸出:%s", result.stdout)
    # # 輸出標準錯誤輸出
    logging.info(f"shell指令碼{shell_path}標準錯誤輸出:%s", result.stderr)
    # # 輸出返回碼
    logging.info(f"shell指令碼{shell_path}的返回碼:%s", result.returncode)


if __name__ == '__main__':
    """
    碼中臺資料同步任務指令碼
    使用前請修改如下配置資訊:
      - secret  釘釘機器人的秘鑰
      - access_token  釘釘機器人的token
      - python_path   Python的安裝路徑
      - datax_path   datax的執行檔案路徑
    """
    # 釘釘配置
    start_time = datetime.now().strftime('%Y-%m-%d %H:%M:%S')
    secret = ''
    access_token = ''
    python_path = "/usr/bin/python3"
    datax_path = "/opt/datax-k/bin/datax.py"
    # 當前指令碼檔案的目錄路徑
    script_dir = '/opt/data-job'
    curr_date_day = datetime.now().strftime('%Y-%m-%d')
    # 建立資料夾
    mkdir_if_not_exist(script_dir + "/logs/" + curr_date_day)
    logging.basicConfig(level=logging.INFO,
                        format='%(asctime)s - %(levelname)s - %(lineno)d - %(message)s',
                        filename='logs/' + curr_date_day + '/app.log',
                        filemode='w')
    run_job(secret, access_token, start_time, script_dir, python_path, datax_path)
    logging.shutdown()

  1. 同步日期的控制

我們在之前的任務同步中,遇到的問題便是日期的修改很麻煩,因此我們需要一個更加簡單的方式來進行日期的批次更新。在我們上面的排程指令碼中,包含了對日期表示式的解析,我們自定義了一種時間的表示式$[yyyyMMddHHmmss+/-N_Y] 透過解析該表示式,我們可以生成需要的任意時間,該時間表示式的含義為:

  • yyyy 表示年份
  • MM 表示月份
  • dd 表示日期
  • HH 表示24進位制小時
  • mm 表示分鐘
  • ss 表示秒
    • 表示當前時間加上N
    • 表示當前時間減去N
  • _Y 表示加減的單位,可以是YMdHms(年、月、日、時、分、秒)

透過對該表示式的解析,我們可以生成相對於當前之前或之後的任何格式的時間字串,將其用於同步的where條件中,既可以完成針對時間的解析。

  1. 如何更新日期

日期目前可以計算,但是我們需要能夠批次修改配置檔案中的WHERE條件中的時間表示式,如我們想同步8天前的資料,我們就需要將指令碼中的表示式修改為$[yyyyMMdd-8_d] ,即代表當前時間減去8天,這樣我們就可以同步八天前那一天的資料,但是我們可能想同步從8天氣到現在的所有資料,那麼我們希望我們也能批次修改where表示式中的條件,如將=改為>=。

鑑於以上的需求,我們開發了一個新的Python指令碼,透過簡單的配置,即可一次修改所有指令碼中的where條件中的表示式,這樣,我們只需要執行兩個指令碼,就完成了一切,再也不需要依次修改執行10個工作流了。

import json
import os
import logging
from typing import List, Mapping
import re
from datetime import datetime, date


def list_files_in_directory(directory_path: str) -> List[str]:
    """
    獲取目錄下的所有以.json結尾的檔案
    :param directory_path: 目錄
    :return: 檔案列表
    """
    entries = os.listdir(directory_path)
    # 過濾出所有檔案
    files = [entry for entry in entries if
             os.path.isfile(os.path.join(directory_path, entry)) 
             and entry.endswith(".json")]
    logging.info(f"讀取配置檔案數量:{len(files)}")
    return files


def read_file_content(file_path: str) -> str:
    """
    讀取檔案內容
    :param file_path: 檔案路徑
    :return: 檔案內容
    """
    with open(file_path, 'r', encoding='utf-8') as file:
        content = file.read()
    return content


def read_all_files_in_directory(directory_path: str) -> Mapping[str, str]:
    """
    讀取資料夾下面的所有檔案的內容
    :param directory_path: 資料夾路徑
    :return: 內容map
    """
    logging.info(f"開始讀取所有的配置檔案資訊")
    files = list_files_in_directory(directory_path)
    file_contents = {}
    for file in files:
        file_path = os.path.join(directory_path, file)
        content = read_file_content(file_path)
        file_contents[file] = content
    sorted_items = sorted(file_contents.items())
    sorted_dict = dict(sorted_items)
    return file_contents


def parse_where_sql(where_sql: str, sub_day: int, comparator: str = None) -> str:
    """
    解析where語句
    :param where_sql: 原始where語句
    :param sub_day: 天數
    :param comparator: 比較符  包括 = != > < >=   <=
    :return: 轉換之後的where語句
    """
    # 定義支援的型別 $[yyyyMMdd+N_Y]  $[yyyyMMdd-N_Y]
    # 正規表示式模式
    pattern = r'\$(\[.*?\])'
    matches = re.finditer(pattern, where_sql)
    for match in matches:
        matched_text = match.group(1)
        new_search = calc_datetime(matched_text, sub_day)
        where_sql = where_sql.replace(matched_text, new_search)

    legal_comparator_list = ['>==','<>', '!=', '>=', '<=', '=', '>','<']
    legal_default = '@'
    if comparator is not None:
        for legal_comparator in legal_comparator_list:
            if legal_comparator in where_sql:
                where_sql = where_sql.replace(legal_comparator, legal_default)
        where_sql = where_sql.replace(legal_default, comparator)
    return where_sql


def calc_datetime(expression: str, sub_day: int) -> str:
    """
    計算時間表示式
    :param expression: 表示式
    :param sub_day: 天數
    :return: 計算之後的值
    """
    # 替換完畢,確定是否有數字
    # 定義正規表示式模式
    pattern = r'([^0-9]+)([-+]\d+(\*\d+)?)(?:_([YMdHms]))?'
    matches = re.match(pattern, expression)
    # 輸出拆分結果
    if matches:
        date_part = matches.group(1)
        remainder = matches.group(2)
        unit = matches.group(4)
        plus_or_sub = remainder[0:1]
        if unit is not None:
            return date_part + plus_or_sub + str(sub_day) + '_' + unit + "]"
        else:
            return date_part + plus_or_sub + str(sub_day) + "]"
    else:
        return expression


def check_parma(formatted_date: str, sub_day: int, comparator: str = None):
    """
    校驗引數是否合法
    :param formatted_date: 格式化日期
    :param sub_day: 天數
    :param comparator: 運算子
    :return: NONE
    """
    legal_comparator = ['=', '<>', '!=', '>', '>=', '<', '<=']
    if formatted_date is None and sub_day is None:
        raise "formatted_date 和 sub_day不能同時為空"

    if formatted_date is not None:
        try:
            datetime.strptime(formatted_date, "%Y-%m-%d")
        except Exception as _:
            raise "formatted_date 必須是一個完整的yyyy-MM-dd日期格式, 當前sub_day={}".format(sub_day)

    if formatted_date is None and not isinstance(sub_day, int):
        raise "sub_day 必須是一個整數, 當前sub_day={}".format(sub_day)

    if comparator is not None and comparator not in legal_comparator:
        raise "comparator 不合法,合法操作列表為:{} 當前comparator={}".format(legal_comparator, comparator)


def update_file(base_path: str, sub_day: int, comparator: str = None):
    """
    更新配置檔案
    :param base_path 配置檔案根目錄
    :param sub_day  要減去的天數
    :param comparator 比較符
    """
    file_dict = read_all_files_in_directory(base_path)
    for key, value in file_dict.items():
        json_data = json.loads(value, encoding='utf-8')
        where_sql = json_data['table']['where']
        if where_sql is not None:
            new_where_sql = parse_where_sql(where_sql, sub_day, comparator)
            json_data['table']['where'] = new_where_sql

        search_tal_sql_list = json_data['table']['searchTableSql']
        if search_tal_sql_list is not None:
            new_search_table_sql_list = []
            for search_tal_sql in search_tal_sql_list:
                new_search_table_sql = parse_where_sql(search_tal_sql, sub_day)
                new_search_table_sql_list.append(new_search_table_sql)
            json_data['table']['searchTableSql'] = new_search_table_sql_list

        with open(base_path + "/" + key, "w+", encoding='utf-8') as f:
            f.write(json.dumps(json_data, ensure_ascii=False, indent=2))
        print("{} 更新完成".format(key))


if __name__ == '__main__':
    """
    更新資料同步配置檔案的日期
    """
    dir_path = r'/opt/data-job/task/config'
    # 多少天前
    day = 6
    # 要指定的日期
    date_format = '2024-11-19'
    # where表示式的條件
    comparator_symbol = '>='
    check_parma(date_format, day, comparator_symbol)
    if date_format is not None:
        # 使用date_format的值覆蓋day
        single_date = datetime.strptime(date_format, "%Y-%m-%d").date()
        current_date = date.today()
        day = (current_date - single_date).days
    update_file(dir_path, day, comparator_symbol)

  1. 透過KafkaConnector同步資料到StarRocks
    1. starrocks-connector-for-kafka的實現

StarRocks官方提供了starrocks-connector-for-kafka的實現,我們只需要在其中加入我們的資料解密邏輯即可直接使用。

package com.starrocks.connector.kafka.transforms;

public class DecryptJsonTransformation <R extends ConnectRecord<R>> implements Transformation<R> {
    private static final Logger LOG = LoggerFactory.getLogger(DecryptJsonTransformation.class);
    private AesEncryption aesEncryption;

    private interface ConfigName {
        String SECRET_KEY = "secret.key";
    }

    public static final ConfigDef CONFIG_DEF = new ConfigDef()
    .define(ConfigName.SECRET_KEY, ConfigDef.Type.STRING, ConfigDef.Importance.HIGH, "secret key");


    @Override
    public R apply(R record) {
        if (record.value() == null) {
            return record;
        }
        String value = (String) record.value();
        try {
            String newValue = aesEncryption.decrypt(value);
            JSONObject jsonObject = JSON.parseObject(newValue, JSONReader.Feature.UseBigDecimalForDoubles);
            return record.newRecord(record.topic(), record.kafkaPartition(), record.keySchema(), record.key(), null, jsonObject, record.timestamp());
        } catch (Exception e) {
            return record;
        }
    }

    @Override
    public ConfigDef config() {
        return CONFIG_DEF;
    }

    @Override
    public void close() {

    }

    @Override
    public void configure(Map<String, ?> map) {
        final SimpleConfig config = new SimpleConfig(CONFIG_DEF, map);
        String secretKey = config.getString(ConfigName.SECRET_KEY);
        aesEncryption = new AesEncryption(secretKey);
    }
}

解密的邏輯

package com.starrocks.connector.kafka;


import javax.crypto.Cipher;
import javax.crypto.SecretKey;
import javax.crypto.spec.SecretKeySpec;
import java.util.Base64;

public class AesEncryption {

    private SecretKey secretKey;

    public AesEncryption(String secretKey) {
        byte[] keyBytes = Base64.getDecoder().decode(secretKey);
        this.secretKey = new SecretKeySpec(keyBytes, 0, keyBytes.length, "AES");
    }

    public String encrypt(String data) {
        try {
            Cipher cipher = Cipher.getInstance("AES");
            cipher.init(Cipher.ENCRYPT_MODE, secretKey);
            byte[] encryptedBytes = cipher.doFinal(data.getBytes());
            return Base64.getEncoder().encodeToString(encryptedBytes);
        } catch (Exception e) {
            throw new RuntimeException(e);
        }
    }

    public String decrypt(String encryptedData) throws Exception {
        Cipher cipher = Cipher.getInstance("AES");
        cipher.init(Cipher.DECRYPT_MODE, secretKey);
        byte[] decodedBytes = Base64.getDecoder().decode(encryptedData);
        byte[] decryptedBytes = cipher.doFinal(decodedBytes);
        return new String(decryptedBytes);
    }
}


b. 配置KafkaConnector任務

{
  "name": "mzt_ods_cjm.ods_device-connect",
  "config": {
    "connector.class": "com.starrocks.connector.kafka.StarRocksSinkConnector",
    "topics": "mzt_ods_cjm.ods_device",
    "key.converter": "org.apache.kafka.connect.storage.StringConverter",
    "value.converter": "org.apache.kafka.connect.storage.StringConverter",
    "key.converter.schemas.enable": "true",
    "value.converter.schemas.enable": "false",
    "starrocks.http.url": "IP:8050,IP:8050,IP:8050",
    "starrocks.topic2table.map": "mzt_ods_cjm.ods_device:ods_device",
    "starrocks.username": "xxxxxxx",
    "starrocks.password": "xxxxxx",
    "starrocks.database.name": "ods_cjm",
    "sink.properties.strip_outer_array": "true",
    "sink.properties.columns": "id,company_name,company_id,secret_key,",
    "sink.properties.jsonpaths": "[\"$.id\",\"$.company_name\",\"$.company_id\",\"$.secret_key\"]",
    "transforms": "decrypt",
    "transforms.decrypt.type": "com.starrocks.connector.kafka.transforms.DecryptJsonTransformation",
    "transforms.decrypt.secret.key": "5s8ekjRWkG/b64CGHHZYvQ=="
  }
}

四、備註

  1. starrocks-connector-for-kafka Kafka Connector是StarRocks資料來源聯結器
  2. DataX 批次資料同步工具
  3. kafka-console-ui Kakfa視覺化控制檯
  4. StarRocks-kafka-Connector 透過kafkaConnector匯入資料到StarRocks
  5. StreamLoad實現資料增刪改
  6. Kafka Connector的API列表
方法 路徑 說明
GET /connectors 返回活動聯結器的列表
POST /connectors 建立一個新的聯結器; 請求主體應該是包含字串name欄位和config帶有聯結器配置引數的物件欄位的JSON物件
GET /connectors/ 獲取有關特定聯結器的資訊
GET /connectors/{name}/config 獲取特定聯結器的配置引數
PUT /connectors/{name}/config 更新特定聯結器的配置引數
GET /connectors/{name}/status 獲取聯結器的當前狀態,包括聯結器是否正在執行,失敗,已暫停等,分配給哪個工作者,失敗時的錯誤資訊以及所有任務的狀態
GET /connectors/{name}/tasks 獲取當前為聯結器執行的任務列表
GET /connectors/{name}/tasks/{taskid}/status 獲取任務的當前狀態,包括如果正在執行,失敗,暫停等,分配給哪個工作人員,如果失敗,則返回錯誤資訊
PUT /connectors/{name}/pause 暫停聯結器及其任務,停止訊息處理,直到聯結器恢復
PUT /connectors/{name}/resume 恢復暫停的聯結器(或者,如果聯結器未暫停,則不執行任何操作)
POST /connectors/{name}/restart 重新啟動聯結器(通常是因為失敗)
POST /connectors/{name}/tasks/{taskId}/restart 重啟個別任務(通常是因為失敗)
DELETE /connectors/ 刪除聯結器,停止所有任務並刪除其配置

相關文章