Apache Dolphinscheduler資料質量原始碼分析

海豚调度發表於2024-11-19

Apache DolphinScheduler 是一個分散式、易擴充套件的視覺化資料工作流任務排程系統,廣泛應用於資料排程和處理領域。

在大規模資料工程專案中,資料質量的管理至關重要,而 DolphinScheduler 也提供了資料質量檢查的計算能力。本文將對 Apache DolphinScheduler 的資料質量模組進行原始碼分析,幫助開發者深入理解其背後的實現原理與設計理念。

資料質量規則

Apache Dolphinscheduler 資料質量模組支援多種常用的資料質量規則,如下圖所示。

資料質量規則主要包括空值校驗、自定義SQL、跨表準確性、跨表值比、欄位長度校驗、唯一性校驗、及時性檢查、列舉值校驗、錶行數校驗等。

資料質量工作流程

資料質量執行流程分為2個部分:

(1)在Web端進行資料質量檢測的流程定義,透過DolphinScheduer進行排程,提交到Spark計算引擎;

(2)Spark端負責解析資料質量模型的引數,透過讀取資料、執行轉換、輸出三個步驟,完成資料質量檢測任務,工作流程如下圖所示。

在Web端進行定義

資料質量定義如下圖所示,這裡只定義了一個節點。

以一個空值檢測的輸入引數為例,在介面完成配置後,會生產一個JSON檔案。

這個JSON檔案會以字串引數形式提交給Spark叢集,進行排程和計算。

JSON檔案如下所示。


{
    "name": "$t(null_check)",
    "env": {
        "type": "batch",
        "config": null
    },
    "readers": [
        {
            "type": "JDBC",
            "config": {
                "database": "ops",
                "password": "***",
                "driver": "com.mysql.cj.jdbc.Driver",
                "user": "root",
                "output_table": "ops_ms_alarm",
                "table": "ms_alarm",
                "url": "jdbc:mysql://192.168.3.211:3306/ops?allowLoadLocalInfile=false&autoDeserialize=false&allowLocalInfile=false&allowUrlInLocalInfile=false"
            }
        }
    ],
    "transformers": [
        {
            "type": "sql",
            "config": {
                "index": 1,
                "output_table": "total_count",
                "sql": "SELECT COUNT(*) AS total FROM ops_ms_alarm"
            }
        },
        {
            "type": "sql",
            "config": {
                "index": 2,
                "output_table": "null_items",
                "sql": "SELECT * FROM ops_ms_alarm WHERE (alarm_time is null or alarm_time = '') "
            }
        },
        {
            "type": "sql",
            "config": {
                "index": 3,
                "output_table": "null_count",
                "sql": "SELECT COUNT(*) AS nulls FROM null_items"
            }
        }
    ],
    "writers": [
        {
            "type": "JDBC",
            "config": {
                "database": "dolphinscheduler3",
                "password": "***",
                "driver": "com.mysql.cj.jdbc.Driver",
                "user": "root",
                "table": "t_ds_dq_execute_result",
                "url": "jdbc:mysql://192.168.3.212:3306/dolphinscheduler3?characterEncoding=utf-8&allowLoadLocalInfile=false&autoDeserialize=false&allowLocalInfile=false&allowUrlInLocalInfile=false",
                "sql": "select 0 as rule_type,'$t(null_check)' as rule_name,0 as process_definition_id,25 as process_instance_id,26 as task_instance_id,null_count.nulls AS statistics_value,total_count.total AS comparison_value,7 AS comparison_type,3 as check_type,0.95 as threshold,3 as operator,1 as failure_strategy,'hdfs://xmaster:9000/user/hadoop/data_quality_error_data/0_25_211-ops-ms_alarm-空值檢測' as error_output_path,'2022-11-16 03:40:32' as create_time,'2022-11-16 03:40:32' as update_time from null_count full join total_count"
            }
        },
        {
            "type": "JDBC",
            "config": {
                "database": "dolphinscheduler3",
                "password": "***",
                "driver": "com.mysql.cj.jdbc.Driver",
                "user": "root",
                "table": "t_ds_dq_task_statistics_value",
                "url": "jdbc:mysql://192.168.3.212:3306/dolphinscheduler3?characterEncoding=utf-8&allowLoadLocalInfile=false&autoDeserialize=false&allowLocalInfile=false&allowUrlInLocalInfile=false",
                "sql": "select 0 as process_definition_id,26 as task_instance_id,1 as rule_id,'ZKTZKDBTRFDKXKQUDNZJVKNX8OIAEVLQ91VT2EXZD3U=' as unique_code,'null_count.nulls'AS statistics_name,null_count.nulls AS statistics_value,'2022-11-16 03:40:32' as data_time,'2022-11-16 03:40:32' as create_time,'2022-11-16 03:40:32' as update_time from null_count"
            }
        },
        {
            "type": "hdfs_file",
            "config": {
                "path": "hdfs://xmaster:9000/user/hadoop/data_quality_error_data/0_25_211-ops-ms_alarm-空值檢測",
                "input_table": "null_items"
            }
        }
    ]
}

Spark端原始碼分析

DataQualityApplication.java 是Spark程式入口


public static void main(String[] args) throws Exception {
    //...
//從命令列獲取引數
    String dataQualityParameter = args[0];
//   將json引數轉為DataQualityConfiguration物件
    DataQualityConfiguration dataQualityConfiguration = JsonUtils.fromJson(dataQualityParameter,DataQualityConfiguration.class);
    //...
//構建 SparkRuntimeEnvironment的引數Config物件
    EnvConfig envConfig = dataQualityConfiguration.getEnvConfig();
    Config config = new Config(envConfig.getConfig());
    config.put("type",envConfig.getType());
    if (Strings.isNullOrEmpty(config.getString(SPARK_APP_NAME))) {
        config.put(SPARK_APP_NAME,dataQualityConfiguration.getName());
    }

    SparkRuntimeEnvironment sparkRuntimeEnvironment = new SparkRuntimeEnvironment(config);
//委託給 DataQualityContext執行
    DataQualityContext dataQualityContext = new DataQualityContext(sparkRuntimeEnvironment,dataQualityConfiguration);
    dataQualityContext.execute();
}

資料質量配置類

public class DataQualityConfiguration implements IConfig {
    @JsonProperty("name")
    private String name; // 名稱
    @JsonProperty("env")
    private EnvConfig envConfig; // 環境配置
    @JsonProperty("readers")
    private List<ReaderConfig> readerConfigs; // reader配置
    @JsonProperty("transformers")
    private List<TransformerConfig> transformerConfigs;  // transformer配置
    @JsonProperty("writers")
    private List<WriterConfig> writerConfigs; // writer配置
//...
}

DataQualityContext#executedataQualityConfiguration類中獲取Readers、Transformers、Writers, 委託給SparkBatchExecution執行


public void execute() throws DataQualityException {
// 將List<ReaderConfig>轉為List<BatchReader>
    List<BatchReader> readers = ReaderFactory
            .getInstance()
            .getReaders(this.sparkRuntimeEnvironment,dataQualityConfiguration.getReaderConfigs());
// 將List<TransformerConfig>轉為List<BatchTransformer>
    List<BatchTransformer> transformers = TransformerFactory
            .getInstance()
            .getTransformer(this.sparkRuntimeEnvironment,dataQualityConfiguration.getTransformerConfigs());
// 將List<WriterConfig>轉為List<BatchWriter>
    List<BatchWriter> writers = WriterFactory
            .getInstance()
            .getWriters(this.sparkRuntimeEnvironment,dataQualityConfiguration.getWriterConfigs());
// spark 執行環境
    if (sparkRuntimeEnvironment.isBatch()) {
// 批模式
        sparkRuntimeEnvironment.getBatchExecution().execute(readers,transformers,writers);
    } else {
// 流模式, 暫不支援
        throw new DataQualityException("stream mode is not supported now");
    }
}

目前 Apache DolphinScheduler 暫時不支援實時資料的質量檢測。

ReaderFactory類採用了單例和工廠方法的設計模式,目前支援JDBC和HIVE的資料來源的讀取, 對應Reader類HiveReader、JDBCReader。

WriterFactory類採用了單例和工廠方法的設計模式,目前支援JDBC、HDFS、LOCAL_FILE的資料來源的輸出,對應Writer類JdbcWriter、 HdfsFileWriter和 LocalFileWriter 。

TransformerFactory類採用了單例和工廠方法的設計模式,目前僅支援TransformerType.SQL的轉換器型別。

結合JSON可以看出一個空值檢測的Reader、Tranformer、 Writer情況:

1個Reader :讀取源表資料
3個Tranformer:
total_count 行總數
null_items 空值項(行資料)
null_count (空值數)

計算SQL如下

-- SELECT COUNT(*) AS total FROM ops_ms_alarm
-- SELECT * FROM ops_ms_alarm WHERE (alarm_time is null or alarm_time = '')
-- SELECT COUNT(*) AS nulls FROM null_items

3個Writer:第一個是JDBC Writer, 將比較值、統計值輸出t\_ds\_dq\_execute\_result 資料質量執行結果表。

SELECT
   //...
    null_count.nulls AS statistics_value,
    total_count.total AS comparison_value,
   //...
    'hdfs://xmaster:9000/user/hadoop/data_quality_error_data/0_25_211-ops-ms_alarm-空值檢測' AS error_output_path,
   //...
FROM
    null_count
    FULL JOIN total_count

第二個是JDBC Writer,將statistics\_value寫入到表 t\_ds\_dq\_task\_statistics\_value


SELECT
    //...
    //...
    'null_count.nulls' AS statistics_name,
    null_count.nulls AS statistics_value,
    //...
FROM
    null_count

第3個是HDFS Writer,將空值項寫入到HDFS檔案目錄

{
    "type": "hdfs_file",
    "config": {
        "path": "hdfs://xmaster:9000/user/hadoop/data_quality_error_data/0_25_211-ops-ms_alarm-空值檢測",
        "input_table": "null_items"
    }
}

SparkBatchExecution#execute


public class SparkBatchExecution implements Execution<BatchReader, BatchTransformer, BatchWriter> {
    private final SparkRuntimeEnvironment environment;
    public SparkBatchExecution(SparkRuntimeEnvironment environment) throws ConfigRuntimeException {
        this.environment = environment;
    }
    
    @Override
    public void execute(List<BatchReader> readers, List<BatchTransformer> transformers, List<BatchWriter> writers) {
// 為每一個reader註冊輸入臨時表
        readers.forEach(reader -> registerInputTempView(reader, environment));

        if (!readers.isEmpty()) {
// 取readers列表的第一個reader讀取資料集合, reader的實現類有HiveReader、JdbcReader
            Dataset<Row> ds = readers.get(0).read(environment);
            for (BatchTransformer tf:transformers) {
// 執行轉換
                ds = executeTransformer(environment, tf, ds);
// 將轉換後結果寫到臨時表
                registerTransformTempView(tf, ds);
            }

            for (BatchWriter sink: writers) {
// 執行將轉換結果由writer輸出, writer的實現類有JdbcWriter、LocalFileWriter、HdfsFileWriter
                executeWriter(environment, sink, ds);
            }
        }
// 結束
        environment.sparkSession().stop();
    }
}

SparkBatchExecution#registerInputTempView

//註冊輸入臨時表, 臨時表表名為OUTPUT_TABLE的名字
    private void registerInputTempView(BatchReader reader, SparkRuntimeEnvironment environment) {
        Config conf = reader.getConfig();
        if (Boolean.TRUE.equals(conf.has(OUTPUT_TABLE))) {// ops_ms_alarm
            String tableName = conf.getString(OUTPUT_TABLE);        
            registerTempView(tableName, reader.read(environment));
        } else {
            throw new ConfigRuntimeException(
                "[" + reader.getClass().getName() + "] must be registered as dataset, please set \"output_table\" config");
        }
    }

呼叫Dataset.createOrReplaceTempView方法

private void registerTempView(String tableName, Dataset<Row> ds) {
    if (ds != null) {
        ds.createOrReplaceTempView(tableName);
    } else {
        throw new ConfigRuntimeException("dataset is null, can not createOrReplaceTempView");
    }
}

執行轉換executeTransformer


private Dataset<Row> executeTransformer(SparkRuntimeEnvironment environment, BatchTransformer transformer, Dataset<Row> dataset) {
    Config config = transformer.getConfig();
    Dataset<Row> inputDataset;
    Dataset<Row> outputDataset = null;
    if (Boolean.TRUE.equals(config.has(INPUT_TABLE))) {
// 從INPUT_TABLE獲取表名
        String[] tableNames = config.getString(INPUT_TABLE).split(",");

// outputDataset合併了inputDataset資料集合
        for (String sourceTableName: tableNames) {
            inputDataset = environment.sparkSession().read().table(sourceTableName);
            if (outputDataset == null) {
                outputDataset = inputDataset;
            } else {
                outputDataset = outputDataset.union(inputDataset);
            }
        }
    } else {
//  配置檔案無INPUT_TABLE
        outputDataset = dataset;
    }
// 如果配置檔案中配置了TMP_TABLE, 將outputDataset 註冊到TempView
    if (Boolean.TRUE.equals(config.has(TMP_TABLE))) {
        if (outputDataset == null) {
            outputDataset = dataset;
        }
        String tableName = config.getString(TMP_TABLE);
        registerTempView(tableName, outputDataset);
    }
//  轉換器進行轉換
    return transformer.transform(outputDataset, environment);
}

SqlTransformer#transform 最終是使用spark-sql進行處理, 所以核心還是這個SQL語句,SQL需要在web端生成好,參考前面的JSON檔案。


public class SqlTransformer implements BatchTransformer {
    private final Config config;
    public SqlTransformer(Config config) {
        this.config = config;
    }
//...
    @Override
    public Dataset<Row> transform(Dataset<Row> data, SparkRuntimeEnvironment env) {
        return env.sparkSession().sql(config.getString(SQL));
    }
}

將資料輸出到指定的位置executeWriter


private void executeWriter(SparkRuntimeEnvironment environment, BatchWriter writer, Dataset<Row> ds) {
    Config config = writer.getConfig();
    Dataset<Row> inputDataSet = ds;
    if (Boolean.TRUE.equals(config.has(INPUT_TABLE))) {
        String sourceTableName = config.getString(INPUT_TABLE);
        inputDataSet = environment.sparkSession().read().table(sourceTableName);
    }

    writer.write(inputDataSet, environment);
}

總體來講,Apache Dolphinscheduler的資料質量檢測實現相對簡單明瞭,只要採用Spark SQL進行計算。在本文中,我們深入分析了資料質量模組的原始碼結構和實現邏輯,Apache DolphinScheduler 資料質量模組的設計理念強調靈活性和擴充套件性,這使得它可以適應不同企業的多樣化需求。

對於開發者而言,深入理解其原始碼不僅有助於更好地使用 DolphinScheduler,也為進一步擴充套件其功能提供了方向和靈感。希望本文能夠為您在資料質量控制和開源專案深入探索方面提供幫助。

本文由 白鯨開源 提供釋出支援!

相關文章