Apache DolphinScheduler將上游Task執行結果傳遞給下游

海豚调度發表於2024-11-07

原文網址 : https://www.cnblogs.com/DolphinScheduler/p/18533078

01 背景

公司的資料開發平臺需要用到DolphinScheduler做任務排程，其中一個場景是：上游任務執行結束後，需要將任務執行結果傳遞給下游任務。

DolphinScheduler肯定是能實現任務之間的傳參的，具體的可以看：DolphinScheduler | 文件中心 (https://dolphinscheduler.apache.org/zh-cn/docs/3.2.2/guide/parameter/context)。

但是官方案例中介紹的任務之間傳參是提前在管理臺上配置好的，OK，那麼問題來了，如何實現任務之間的動態傳參呢？比如說我們自定義Task，然後在Task執行結束後將執行結果封裝，傳遞給DAG中的下一個Task。

02 分析

如果DolphinScheduler官方的案例沒有演示如何動態傳，我們開發者應該如何去處理這種需求？

我是這麼做的：分析DolphinScheduler內建的Task，總有一個Task是需要傳遞引數給下游的。我這裡盲猜兩個，一個是_SqlTask，一個是HttpTask。我的觀點是：總不能做完SQL查詢，或者做完HTTP請求後就不管結果吧？_

分析 HttpTask 原始碼

分析HttpTask原始碼，直接找到HttpTask的handle方法，DolphinScheduler中，任何Task的具體執行邏輯都在這個handle方法中。

handle方法分析

@Override
public void handle(TaskCallBack taskCallBack) throws TaskException {
    long startTime = System.currentTimeMillis();
    String formatTimeStamp = DateUtils.formatTimeStamp(startTime);
    String statusCode = null;
    String body = null;

    try (
            CloseableHttpClient client = createHttpClient();
            CloseableHttpResponse response = sendRequest(client)) {
        statusCode = String.valueOf(getStatusCode(response));
        body = getResponseBody(response);
        exitStatusCode = validResponse(body, statusCode);
        // 看名字應該就能猜到是處理請求結果的
        addDefaultOutput(body);
        long costTime = System.currentTimeMillis() - startTime;
        log.info(
                "startTime: {}, httpUrl: {}, httpMethod: {}, costTime : {} milliseconds, statusCode : {}, body : {}, log : {}",
                formatTimeStamp, httpParameters.getUrl(),
                httpParameters.getHttpMethod(), costTime, statusCode, body, output);
    } catch (Exception e) {
        appendMessage(e.toString());
        exitStatusCode = -1;
        log.error("httpUrl[" + httpParameters.getUrl() + "] connection failed：" + output, e);
        throw new TaskException("Execute http task failed", e);
    }

}

繼續看addDefaultOutput方法

public void addDefaultOutput(String response) {
    // put response in output
    // 建立Property物件
    Property outputProperty = new Property();
    // 設定Prop，也就是設定Key
    outputProperty.setProp(String.format("%s.%s", taskExecutionContext.getTaskName(), "response"));
    // 設定是入參還是出參，這裡是出參，因為是將結果給下游任務
    outputProperty.setDirect(Direct.OUT);
    // 設定引數型別，VARCHAR表示就是字串
    outputProperty.setType(DataType.VARCHAR);
    // 設定Value，就是http請求結果
    outputProperty.setValue(response);
    // 重點：將Property新增到varPool中
    httpParameters.addPropertyToValPool(outputProperty);
}

分析SqlTask原始碼

handler方法分析

@Override
public void handle(TaskCallBack taskCallBack) throws TaskException {
    log.info("Full sql parameters: {}", sqlParameters);
    log.info(
            "sql type : {}, datasource : {}, sql : {} , localParams : {},udfs : {},showType : {},connParams : {},varPool : {} ,query max result limit  {}",
            sqlParameters.getType(),
            sqlParameters.getDatasource(),
            sqlParameters.getSql(),
            sqlParameters.getLocalParams(),
            sqlParameters.getUdfs(),
            sqlParameters.getShowType(),
            sqlParameters.getConnParams(),
            sqlParameters.getVarPool(),
            sqlParameters.getLimit());
    try {

        // get datasource
        baseConnectionParam = (BaseConnectionParam) DataSourceUtils.buildConnectionParams(dbType,
                sqlTaskExecutionContext.getConnectionParams());
        List<String> subSqls = DataSourceProcessorProvider.getDataSourceProcessor(dbType)
                .splitAndRemoveComment(sqlParameters.getSql());

        // ready to execute SQL and parameter entity Map
        List<SqlBinds> mainStatementSqlBinds = subSqls
                .stream()
                .map(this::getSqlAndSqlParamsMap)
                .collect(Collectors.toList());

        List<SqlBinds> preStatementSqlBinds = Optional.ofNullable(sqlParameters.getPreStatements())
                .orElse(new ArrayList<>())
                .stream()
                .map(this::getSqlAndSqlParamsMap)
                .collect(Collectors.toList());
        List<SqlBinds> postStatementSqlBinds = Optional.ofNullable(sqlParameters.getPostStatements())
                .orElse(new ArrayList<>())
                .stream()
                .map(this::getSqlAndSqlParamsMap)
                .collect(Collectors.toList());

        List<String> createFuncs = createFuncs(sqlTaskExecutionContext.getUdfFuncParametersList());

        // execute sql task
        // 這個方法就是處理sql結果的
        executeFuncAndSql(mainStatementSqlBinds, preStatementSqlBinds, postStatementSqlBinds, createFuncs);

        setExitStatusCode(TaskConstants.EXIT_CODE_SUCCESS);

    } catch (Exception e) {
        setExitStatusCode(TaskConstants.EXIT_CODE_FAILURE);
        log.error("sql task error", e);
        throw new TaskException("Execute sql task failed", e);
    }
}

所以我們在看下executeFuncAndSql方法內部實現

public void executeFuncAndSql(List<SqlBinds> mainStatementsBinds,
                              List<SqlBinds> preStatementsBinds,
                              List<SqlBinds> postStatementsBinds,
                              List<String> createFuncs) throws Exception {
    try (
            Connection connection =
                    DataSourceClientProvider.getAdHocConnection(DbType.valueOf(sqlParameters.getType()),
                            baseConnectionParam)) {

        // create temp function
        if (CollectionUtils.isNotEmpty(createFuncs)) {
            createTempFunction(connection, createFuncs);
        }

        // pre execute
        executeUpdate(connection, preStatementsBinds, "pre");

        // main execute
        String result = null;
        // decide whether to executeQuery or executeUpdate based on sqlType
        if (sqlParameters.getSqlType() == SqlType.QUERY.ordinal()) {
            // query statements need to be convert to JsonArray and inserted into Alert to send
            result = executeQuery(connection, mainStatementsBinds.get(0), "main");
        } else if (sqlParameters.getSqlType() == SqlType.NON_QUERY.ordinal()) {
            // non query statement
            String updateResult = executeUpdate(connection, mainStatementsBinds, "main");
            result = setNonQuerySqlReturn(updateResult, sqlParameters.getLocalParams());
        }
        // deal out params
        // 這個方法就是來處理結果的
        sqlParameters.dealOutParam(result);

        // post execute
        executeUpdate(connection, postStatementsBinds, "post");
    } catch (Exception e) {
        log.error("execute sql error: {}", e.getMessage());
        throw e;
    }
}

透過dealOutParam看具體處理細節

public void dealOutParam(String result) {
    if (CollectionUtils.isEmpty(localParams)) {
        return;
    }
    List<Property> outProperty = getOutProperty(localParams);
    if (CollectionUtils.isEmpty(outProperty)) {
        return;
    }
    if (StringUtils.isEmpty(result)) {
        varPool = VarPoolUtils.mergeVarPool(Lists.newArrayList(varPool, outProperty));
        return;
    }
    List<Map<String, String>> sqlResult = getListMapByString(result);
    if (CollectionUtils.isEmpty(sqlResult)) {
        return;
    }
    // if sql return more than one line
    if (sqlResult.size() > 1) {
        Map<String, List<String>> sqlResultFormat = new HashMap<>();
        // init sqlResultFormat
        Set<String> keySet = sqlResult.get(0).keySet();
        for (String key : keySet) {
            sqlResultFormat.put(key, new ArrayList<>());
        }
        for (Map<String, String> info : sqlResult) {
            for (String key : info.keySet()) {
                sqlResultFormat.get(key).add(String.valueOf(info.get(key)));
            }
        }
        for (Property info : outProperty) {
            if (info.getType() == DataType.LIST) {
                info.setValue(JSONUtils.toJsonString(sqlResultFormat.get(info.getProp())));
            }
        }
    } else {
        // result only one line
        Map<String, String> firstRow = sqlResult.get(0);
        for (Property info : outProperty) {
            info.setValue(String.valueOf(firstRow.get(info.getProp())));
        }
    }
    
    // 本質還是將sql結果處理後儲存在varPool中，varPool才是關鍵所在
    varPool = VarPoolUtils.mergeVarPool(Lists.newArrayList(varPool, outProperty));

}

所以，原始碼分析到這，我們就知道了：如果想實現動態傳參，那麼我們需要將傳遞的資料封裝成_org.apache.dolphinscheduler.plugin.task.api.model.Property,然後新增到內建集合變數org.apache.dolphinscheduler.plugin.task.api.parameters.AbstractParameters#varPool中_

03 具體實現

這裡我們不去討論自定義Task的具體實現步驟，這不是本文的重點。

當我們實現自定義Task後，可以這樣編碼實現動態傳參：

Property outputProperty = new Property();
// 新增我們要傳遞的資料Key
outputProperty.setProp("xxxxKey"));
// OUT
outputProperty.setDirect(Direct.OUT);
// 這裡傳遞的資料是什麼型別就寫什麼型別，建議透過json字串處理資料
outputProperty.setType(DataType.VARCHAR);
// 新增我們要傳遞的資料Key
outputProperty.setValue("xxxxValue");
// 這裡的xxxxParameters是我們自己自定義的，一般情況下，一個Task對應一個Parameters
xxxxParameters.addPropertyToValPool(outputProperty);

DolphinScheduler內部有將_List<Property> varPool轉換成Map<String, Property> varParams的邏輯，然後會將varParams與其他的引數合併，最後透過taskExecutionContext.setPrepareParamsMap(propertyMap) 將資料設定給Map<String, Property> prepareParamsMap。_

04 總結

關於DolphinScheduler（海豚排程器）是什麼，能做什麼，怎麼使用等等，這裡我就不再贅述，大家感興趣的可以去看看官方文件：DolphinScheduler | 文件中心 (https://dolphinscheduler.apache.org/zh-cn/docs/3.2.2)

希望透過本篇文章能讓各位讀者掌握Task之間的動態傳參，然後應用在實際工作中。如果本篇文章能給螢幕前的你們或多或少的一些幫助，也是我喜聞樂見的。

本文由白鯨開源提供釋出支援！

Spark Task 的執行流程④ - task 結果的處理
2021-09-09
Spark
celery筆記九之task執行結果檢視
2023-05-16
筆記
java中如何給多執行緒中子執行緒傳遞引數？
2018-04-04
Java執行緒
（MFC）子執行緒的資料如何傳遞給主執行緒中？
2018-08-02
執行緒
基於 Apache Dolphinscheduler3.1.9中的Task 處理流程解析
2024-12-06
Apache
.NET非同步程式設計——給執行緒傳遞資料
2021-08-24
非同步程式設計執行緒
如何將環境變數傳遞給 Docker 容器?
2021-09-16
變數Docker
執行緒執行緒池 Task
2018-04-26
執行緒
Apache Kafka訊息傳遞策略
2024-10-14
ApacheKafka
Apache DolphinScheduler支援Flink嗎？
2024-04-30
Apache
Vue 中，如何將函式作為 props 傳遞給元件
2022-12-19
Vue函式元件
10月月報 | Apache DolphinScheduler進展總結
2024-11-15
Apache
Apache DolphinScheduler 4月簡報：社群發展與技術革新速遞
2024-05-08
Apache
Spark Task 的執行流程② - 建立、分發 Task
2021-09-09
Spark
Apache DolphinScheduler 社群 3 月月報
2024-04-11
Apache
多執行緒並行執行，然後彙總結果
2019-01-18
執行緒並行
Jmeter將token設定為全域性變數並跨執行緒進行傳遞引數
2021-06-29
JMeter變數執行緒
【Spark】 Spark作業執行原理--獲取執行結果
2019-01-08
Spark
vscode中執行gulp的task
2018-07-10
VSCode
多執行緒系列（四）：Task
2018-04-02
執行緒
Apache DolphinScheduler 3.2.2 版本正式釋出！
2024-07-23
Apache
走進Task（2）：Task 的回撥執行與 await
2022-02-19
AI
獲取任務的執行結果
2019-01-22
python執行shell並獲取結果
2024-07-10
Python
crontab執行結果未透過傳送mail通知使用者的方法
2019-09-04
AI
執行緒組之間的JMeter傳遞變數
2018-12-12
執行緒JMeter變數
Jmeter 跨執行緒組引數傳遞的方法
2024-05-16
JMeter執行緒
c++ 執行緒函式傳遞資料 namespace
2024-08-13
C++執行緒函式namespace
如何實現Dolphinscheduler YARN Task狀態跟蹤？
2024-10-28
Yarn
Java獲取多執行緒執行結果方式的歸納與總結
2021-04-28
Java執行緒
Go 多協程記錄執行結果
2021-08-25
Go
Apache DolphinScheduler-3.2.0叢集部署教程
2024-03-11
Apache
Apache DolphinScheduler 3.3.0 版本重磅更新提前看！
2024-05-08
Apache
Apache DolphinScheduler-1.3.9原始碼分析（二）
2024-10-10
Apache原始碼
Apache DolphinScheduler-1.3.9原始碼分析（一）
2024-09-26
Apache原始碼
深入解析Apache DolphinScheduler容錯機制
2024-10-22
Apache
如何在AWS上構建Apache DolphinScheduler
2024-07-19
Apache
將命令列提示符裡的執行結果匯出到text檔案中
2020-12-12
命令列