[原始碼解析] GroupReduce，GroupCombine 和 Flink SQL group by

羅西的思考發表於2020-06-16

原文網址 : https://www.cnblogs.com/rossiXYZ/p/13148695.html

[原始碼解析] GroupReduce，GroupCombine和Flink SQL group by

[原始碼解析] GroupReduce，GroupCombine和Flink SQL group by

0x00 摘要

本文從原始碼和例項入手，為大家解析 Flink 中 GroupReduce 和 GroupCombine 的用途。也涉及到了 Flink SQL group by 的內部實現。

0x01 緣由

在前文[原始碼解析] Flink的Groupby和reduce究竟做了什麼中，我們剖析了Group和reduce都做了些什麼，也對combine有了一些瞭解。但是總感覺意猶未盡，因為在Flink還提出了若干新運算元，比如GroupReduce和GroupCombine。這幾個運算元不搞定，總覺得如鯁在喉，但沒有找到一個良好的例子來進行剖析說明。

本文是筆者在探究Flink SQL UDF問題的一個副產品。起初是為了除錯一段sql程式碼，結果發現Flink本身給出了一個GroupReduce和GroupCombine使用的完美例子。於是就拿出來和大家共享，一起分析看看究竟如何使用這兩個運算元。

請注意：這個例子是Flink SQL，所以本文中將涉及Flink SQL goup by內部實現的知識。

0x02 概念

Flink官方對於這兩個運算元的使用說明如下：

2.1 GroupReduce

GroupReduce運算元應用在一個已經分組了的DataSet上，其會對每個分組都呼叫到使用者定義的group-reduce函式。它與Reduce的區別在於使用者定義的函式會立即獲得整個組。

Flink將在組的所有元素上使用Iterable呼叫使用者自定義函式，並且可以返回任意數量的結果元素。

2.2 GroupCombine

GroupCombine轉換是可組合GroupReduceFunction中組合步驟的通用形式。它在某種意義上被概括為允許將輸入型別 I 組合到任意輸出型別O。與此相對的是，GroupReduce中的組合步驟僅允許從輸入型別 I 到輸出型別 I 的組合。這是因為GroupReduceFunction的 "reduce步驟" 期望自己的輸入型別為 I。

在一些應用中，我們期望在執行附加變換（例如，減小資料大小）之前將DataSet組合成中間格式。這可以通過CombineGroup轉換能以非常低的成本實現。

注意：分組資料集上的GroupCombine在記憶體中使用貪婪策略執行，該策略可能不會一次處理所有資料，而是以多個步驟處理。它也可以在各個分割槽上執行，而無需像GroupReduce轉換那樣進行資料交換。這可能會導致輸出的是部分結果，所以GroupCombine是不能替代GroupReduce操作的，儘管它們的操作內容可能看起來都一樣。

2.3 例子

是不是有點暈？還是直接讓程式碼來說話吧。以下官方示例演示瞭如何將CombineGroup和GroupReduce轉換用於WordCount實現。即通過combine操作先對單詞數目進行初步排序，然後通過reduceGroup對combine產生的結果進行最終排序。因為combine進行了初步排序，所以在運算元之間傳輸的資料量就少多了。

DataSet<String> input = [..] // The words received as input

// 這裡通過combine操作先對單詞數目進行初步排序，其優勢在於使用者定義的combine函式只呼叫一次，因為runtime已經把輸入資料一次性都提供給了自定義函式。  
DataSet<Tuple2<String, Integer>> combinedWords = input
  .groupBy(0) // group identical words
  .combineGroup(new GroupCombineFunction<String, Tuple2<String, Integer>() {

    public void combine(Iterable<String> words, Collector<Tuple2<String, Integer>>) { // combine
        String key = null;
        int count = 0;

        for (String word : words) {
            key = word;
            count++;
        }
        // emit tuple with word and count
        out.collect(new Tuple2(key, count));
    }
});

// 這裡對combine的結果進行第二次排序，其優勢在於使用者定義的reduce函式只呼叫一次，因為runtime已經把輸入資料一次性都提供給了自定義函式。  
DataSet<Tuple2<String, Integer>> output = combinedWords
  .groupBy(0)                              // group by words again
  .reduceGroup(new GroupReduceFunction() { // group reduce with full data exchange

    public void reduce(Iterable<Tuple2<String, Integer>>, Collector<Tuple2<String, Integer>>) {
        String key = null;
        int count = 0;

        for (Tuple2<String, Integer> word : words) {
            key = word;
            count++;
        }
        // emit tuple with word and count
        out.collect(new Tuple2(key, count));
    }
});

看到這裡，有的兄弟已經明白了，這和mapPartition很類似啊，都是runtime做了大量工作。為了讓大家這兩個運算元的使用情形有深刻的認識，我們再通過一個sql的例子，向大家展示Flink內部是怎麼應用這兩個運算元的，也能看出來他們的強大之處。

0x03 程式碼

下面程式碼主要參考自 flink 使用問題彙總。我們可以看到這裡通過groupby進行了聚合操作。其中collect方法，類似於mysql的group_concat。

public class UdfExample {
    public static class MapToString extends ScalarFunction {

        public String eval(Map<String, Integer> map) {
            if(map==null || map.size()==0) {
                return "";
            }
            StringBuffer sb=new StringBuffer();
            for(Map.Entry<String, Integer> entity : map.entrySet()) {
                sb.append(entity.getKey()+",");
            }
            String result=sb.toString();
            return result.substring(0, result.length()-1);
        }
    }

    public static void main(String[] args) throws Exception {
        MemSourceBatchOp src = new MemSourceBatchOp(new Object[][]{
                new Object[]{"1", "a", 1L},
                new Object[]{"2", "b33", 2L},
                new Object[]{"2", "CCC", 2L},
                new Object[]{"2", "xyz", 2L},
                new Object[]{"1", "u", 1L}
        }, new String[]{"f0", "f1", "f2"});

        BatchTableEnvironment environment = MLEnvironmentFactory.getDefault().getBatchTableEnvironment();
        Table table = environment.fromDataSet(src.getDataSet());
        environment.registerTable("myTable", table);
        BatchOperator.registerFunction("MapToString",  new MapToString());
        BatchOperator.sqlQuery("select f0, mapToString(collect(f1)) as type from myTable group by f0").print();
    }
}

程式輸出是

f0|type
--|----
1|a,u
2|CCC,b33,xyz

0x04 Flink SQL內部翻譯

這個SQL語句的重點是group by。這個是程式猿經常使用的操作。但是大家有沒有想過這個group by在真實執行起來時候是怎麼操作的呢？針對大資料環境有沒有做了什麼優化呢？其實，Flink正是使用了GroupReduce和GroupCombine來實現並且優化了group by的功能。優化之處在於：

GroupReduce和GroupCombine的函式呼叫次數要遠低於正常的reduce運算元，如果reduce操作中涉及到頻繁建立額外的物件或者外部資源操作，則會相當省時間。
因為combine進行了初步排序，所以在運算元之間傳輸的資料量就少多了。

SQL生成Flink的過程十分錯綜複雜，所以我們只能找最關鍵處。其是在 DataSetAggregate.translateToPlan 完成的。我們可以看到，對於SQL語句 “select f0, mapToString(collect(f1)) as type from myTable group by f0”，Flink系統把它翻譯成如下階段，即

pre-aggregation ：排序 + combine；
final aggregation ：排序 + reduce；

從之前的文章我們可以知道，groupBy這個其實不是一個運算元，它只是排序過程中的一個輔助步驟而已，所以我們重點還是要看combineGroup和reduceGroup。這恰恰是我們想要的完美例子。

input ----> (groupBy + combineGroup) ----> (groupBy + reduceGroup) ----> output

SQL生成的Scala程式碼如下，其中 combineGroup在後續中將生成GroupCombineOperator，reduceGroup將生成GroupReduceOperator。

  override def translateToPlan(
      tableEnv: BatchTableEnvImpl,
      queryConfig: BatchQueryConfig): DataSet[Row] = {

    if (grouping.length > 0) {
      // grouped aggregation
      ...... 
      if (preAgg.isDefined) { // 我們的例子是在這裡
        inputDS          
          // pre-aggregation
          .groupBy(grouping: _*)
          .combineGroup(preAgg.get) // 將生成GroupCombineOperator運算元
          .returns(preAggType.get)
          .name(aggOpName)
          // final aggregation
          .groupBy(grouping.indices: _*) //將生成GroupReduceOperator運算元。
          .reduceGroup(finalAgg.right.get)
          .returns(rowTypeInfo)
          .name(aggOpName)
      } else {
        ......
      }
    }
    else {
      ......
    }
  }
}

// 程式變數列印如下
this = {DataSetAggregate@5207} "Aggregate(groupBy: (f0), select: (f0, COLLECT(f1) AS $f1))"
 cluster = {RelOptCluster@5220}

0x05 JobGraph

LocalExecutor.execute中會生成JobGraph。JobGraph是提交給 JobManager 的資料結構，是唯一被Flink的資料流引擎所識別的表述作業的資料結構，也正是這一共同的抽象體現了流處理和批處理在執行時的統一。

在生成JobGraph時候，系統得到如下JobVertex。

jobGraph = {JobGraph@5652} "JobGraph(jobId: 6aae8b5e5ad32f588136bef26f8b65f6)"
 taskVertices = {LinkedHashMap@5655}  size = 4

{JobVertexID@5677} "c625209bb7fb9a098807551840aeaa99" -> {InputOutputFormatVertex@5678} "CHAIN DataSource (at initializeDataSource(MemSourceBatchOp.java:98) (org.apache.flink.api.java.io.CollectionInputFormat)) -> FlatMap (select: (f0, f1)) (org.apache.flink.runtime.operators.DataSourceTask)"

{JobVertexID@5679} "b56ace4acd7a2f69ea110a9f262ff80a" -> {JobVertex@5680} "CHAIN GroupReduce (groupBy: (f0), select: (f0, COLLECT(f1) AS $f1)) -> FlatMap (select: (f0, mapToString($f1) AS type)) -> Map (Map at linkFrom(MapBatchOp.java:35)) (org.apache.flink.runtime.operators.BatchTask)"
 
{JobVertexID@5681} "3f5e2a0f700421d80ce85e02a6d9db73" -> {InputOutputFormatVertex@5682} "DataSink (collect()) (org.apache.flink.runtime.operators.DataSinkTask)"
 
{JobVertexID@5683} "ad29dc5b2e0a39ad2cd1d164b6f859f7" -> {JobVertex@5684} "GroupCombine (groupBy: (f0), select: (f0, COLLECT(f1) AS $f1)) (org.apache.flink.runtime.operators.BatchTask)"

我們可以看到，在JobGraph中就生成了對應的兩個運算元。其中這裡的FlatMap就是使用者的UDF函式MapToString的對映生成。

GroupCombine (groupBy: (f0), select: (f0, COLLECT(f1) AS $f1))  
  
CHAIN GroupReduce (groupBy: (f0), select: (f0, COLLECT(f1) AS $f1)) -> FlatMap (select: (f0, mapToString($f1) AS type)) -> Map

0x06 Runtime

最後，讓我們看看runtime會如何處理這兩個運算元。

6.1 ChainedFlatMapDriver

首先，Flink會在ChainedFlatMapDriver.collect中對record進行處理，這是從Table中提取資料所必須經歷的，與後續的group by關係不大。

@Override
public void collect(IT record) {
   try {
      this.numRecordsIn.inc();
      this.mapper.flatMap(record, this.outputCollector);
   } catch (Exception ex) {
      throw new ExceptionInChainedStubException(this.taskName, ex);
   }
}

// 這裡能夠看出來，我們獲取了第一列記錄
record = {Row@9317} "1,a,1"
 fields = {Object[3]@9330} 
this.taskName = "FlatMap (select: (f0, f1))"

// 程式堆疊列印如下
collect:80, ChainedFlatMapDriver (org.apache.flink.runtime.operators.chaining)
collect:35, CountingCollector (org.apache.flink.runtime.operators.util.metrics)
invoke:196, DataSourceTask (org.apache.flink.runtime.operators)
doRun:707, Task (org.apache.flink.runtime.taskmanager)
run:532, Task (org.apache.flink.runtime.taskmanager)
run:748, Thread (java.lang)

6.2 GroupReduceCombineDriver

其次，GroupReduceCombineDriver.run()中會進行combine操作。

會通過this.sorter.write(value)把資料寫到排序緩衝區。
會通過sortAndCombineAndRetryWrite(value)進行實際的排序，合併。

因為是系統實現，所以Combine的使用者自定義函式就是由Table API提供的，比如org.apache.flink.table.functions.aggfunctions.CollectAccumulator.accumulate 。

@Override
public void run() throws Exception {
   final MutableObjectIterator<IN> in = this.taskContext.getInput(0);
   final TypeSerializer<IN> serializer = this.serializer;

   if (objectReuseEnabled) {
    .....
   }
   else {
      IN value;
      while (running && (value = in.next()) != null) {
         // try writing to the sorter first
         if (this.sorter.write(value)) {
            continue;
         }

         // do the actual sorting, combining, and data writing
         sortAndCombineAndRetryWrite(value);
      }
   }

   // sort, combine, and send the final batch
   if (running) {
      sortAndCombine();
   }
}

// 程式變數如下
value = {Row@9494} "1,a"
 fields = {Object[2]@9503}

sortAndCombine是具體排序/合併的過程。

排序是通過 org.apache.flink.runtime.operators.sort.QuickSort 完成的。
合併是通過 org.apache.flink.table.functions.aggfunctions.CollectAccumulator.accumulate 完成的。
給下游是由 org.apache.flink.table.runtime.aggregate.DataSetPreAggFunction.combine 呼叫 out.collect(output) 完成的。

private void sortAndCombine() throws Exception {
   final InMemorySorter<IN> sorter = this.sorter;
   // 這裡進行實際的排序
   this.sortAlgo.sort(sorter);
   final GroupCombineFunction<IN, OUT> combiner = this.combiner;
   final Collector<OUT> output = this.output;

   // iterate over key groups
   if (objectReuseEnabled) {
			......		
   } else {
      final NonReusingKeyGroupedIterator<IN> keyIter = 
            new NonReusingKeyGroupedIterator<IN>(sorter.getIterator(), this.groupingComparator);
      // 這裡是歸併操作
      while (this.running && keyIter.nextKey()) {
         // combiner.combiner 是使用者定義操作，runtime把某key對應的資料一次性傳給它
         combiner.combine(keyIter.getValues(), output);
      }
   }
}

具體呼叫棧如下：

accumulate:57, CollectAggFunction (org.apache.flink.table.functions.aggfunctions)
accumulate:-1, DataSetAggregatePrepareMapHelper$5
combine:71, DataSetPreAggFunction (org.apache.flink.table.runtime.aggregate)
sortAndCombine:213, GroupReduceCombineDriver (org.apache.flink.runtime.operators)
run:188, GroupReduceCombineDriver (org.apache.flink.runtime.operators)
run:504, BatchTask (org.apache.flink.runtime.operators)
invoke:369, BatchTask (org.apache.flink.runtime.operators)
doRun:707, Task (org.apache.flink.runtime.taskmanager)
run:532, Task (org.apache.flink.runtime.taskmanager)
run:748, Thread (java.lang)

6.3 GroupReduceDriver & ChainedFlatMapDriver

這兩個放在一起，是因為他們組成了Operator Chain。

GroupReduceDriver.run中完成了reduce。具體reduce 操作是在 org.apache.flink.table.runtime.aggregate.DataSetFinalAggFunction.reduce 完成的，然後在其中直接傳送給下游 out.collect(output)。

@Override
public void run() throws Exception {
   // cache references on the stack
   final GroupReduceFunction<IT, OT> stub = this.taskContext.getStub();
 
   if (objectReuseEnabled) {
       ......	
   }
   else {
      final NonReusingKeyGroupedIterator<IT> iter = new NonReusingKeyGroupedIterator<IT>(this.input, this.comparator);
      // run stub implementation
      while (this.running && iter.nextKey()) {
         // stub.reduce 是使用者定義操作，runtime把某key對應的資料一次性傳給它
         stub.reduce(iter.getValues(), output);
      }
   }
}

從前文我們可以，這裡已經配置成了Operator Chain，所以out.collect(output)會呼叫到CountingCollector。CountingCollector的成員變數collector已經配置成了ChainedFlatMapDriver。

public void collect(OUT record) {
   this.numRecordsOut.inc();
   this.collector.collect(record);
}

this.collector = {ChainedFlatMapDriver@9643} 
 mapper = {FlatMapRunner@9610} 
 config = {TaskConfig@9655} 
 taskName = "FlatMap (select: (f0, mapToString($f1) AS type))"

於是程式就呼叫到了 ChainedFlatMapDriver.collect。

public void collect(IT record) {
   try {
      this.numRecordsIn.inc();
      this.mapper.flatMap(record, this.outputCollector);
   } catch (Exception ex) {
      throw new ExceptionInChainedStubException(this.taskName, ex);
   }
}

最終呼叫棧如如下：

eval:21, UdfExample$MapToString (com.alibaba.alink)
flatMap:-1, DataSetCalcRule$14
flatMap:52, FlatMapRunner (org.apache.flink.table.runtime)
flatMap:31, FlatMapRunner (org.apache.flink.table.runtime)
collect:80, ChainedFlatMapDriver (org.apache.flink.runtime.operators.chaining)
collect:35, CountingCollector (org.apache.flink.runtime.operators.util.metrics)
reduce:80, DataSetFinalAggFunction (org.apache.flink.table.runtime.aggregate)
run:131, GroupReduceDriver (org.apache.flink.runtime.operators)
run:504, BatchTask (org.apache.flink.runtime.operators)
invoke:369, BatchTask (org.apache.flink.runtime.operators)
doRun:707, Task (org.apache.flink.runtime.taskmanager)
run:532, Task (org.apache.flink.runtime.taskmanager)
run:748, Thread (java.lang)

0x07 總結

由此我們可以看到：

GroupReduce，GroupCombine和mapPartition十分類似，都是從系統層面對運算元進行優化，把迴圈操作放到使用者自定義函式來處理。
對於group by這個SQL語句，Flink將其翻譯成 GroupReduce + GroupCombine，採用兩階段優化的方式來完成了對大資料下的處理。

0x08 參考

flink 使用問題彙總

Flink kafka source & sink 原始碼解析
2020-04-03
Kafka原始碼
[原始碼解析] Flink的groupBy和reduce究竟做了什麼
2020-06-09
原始碼
Spark SQL原始碼解析（四）Optimization和Physical Planning階段解析
2020-05-14
SparkSQL原始碼
[原始碼解析] 當 Java Stream 遇見 Flink
2020-08-17
原始碼Java
mybatis原始碼學習------resultMap和sql片段的解析
2020-11-02
MyBatis原始碼SQL
[Flink-原始碼分析]Blink SQL 回撤解密
2021-12-26
原始碼SQL解密
[原始碼解析] Flink UDAF 背後做了什麼
2020-08-12
原始碼
Flink 原始碼解析--Stream、Job、ExecutionGraph的生成示例
2020-09-27
原始碼
Sharding-JDBC 原始碼之 SQL 解析
2020-12-24
JDBC原始碼SQL
[原始碼解析] 從TimeoutException看Flink的心跳機制
2020-06-23
原始碼Exception
Flink sql 之兩階段聚合與 TwoStageOptimizedAggregateRule（原始碼分析）
2022-01-06
SQLZed原始碼
Apache-Flink深度解析-SQL概覽
2019-03-19
ApacheSQL
MySQL核心原始碼解讀-SQL解析一
2018-12-12
MySql原始碼
Spark SQL原始碼解析（五）SparkPlan準備和執行階段
2020-05-27
SparkSQL原始碼
[原始碼解析] Flink的Slot究竟是什麼？(2)
2020-09-04
原始碼
[原始碼解析] Flink的Slot究竟是什麼？(1)
2020-08-24
原始碼
Flink1.9.2原始碼編譯和使用
2022-12-11
原始碼編譯
HandlerThread和IntentService原始碼解析
2019-03-01
threadIntent原始碼
Shading-jdbc原始碼分析-sql詞法解析
2019-03-04
JDBC原始碼SQL
Shading – jdbc 原始碼分析(三) – sql 解析之 Select
2019-02-25
JDBC原始碼SQL
原始碼解析丨一次慢SQL排查
2024-03-22
原始碼SQL
Mybatis原始碼解析之執行SQL語句
2022-12-13
MyBatis原始碼SQL
SQL -去重Group by 和Distinct的效率
2021-01-22
SQL
《Flink SQL任務自動生成與提交》後續：修改flink原始碼實現kafka connector BatchMode
2022-01-12
SQL原始碼KafkaBAT
MySQL核心原始碼解讀-SQL解析之解析器淺析
2018-12-12
MySql原始碼
RecyclerView用法和原始碼深度解析
2019-02-18
View原始碼
UGUI原始碼解析（Toggle和ToggleGroup）
2021-01-01
UGUI原始碼
openGauss資料庫原始碼解析——慢SQL檢測
2024-04-08
資料庫原始碼SQL
[原始碼分析] 帶你梳理 Flink SQL / Table API內部執行流程
2020-04-25
原始碼SQLAPI
Glide原始碼解析四（解碼和轉碼）
2024-03-03
IDE原始碼
Flink 1.16：Hive SQL 如何平遷到 Flink SQL
2022-12-19
HiveSQL
Spring原始碼解析02：Spring IOC容器之XmlBeanFactory啟動流程分析和原始碼解析
2020-05-18
Spring原始碼XMLBean
ORACLE SQL解析之硬解析和軟解析
2018-09-17
OracleSQL
Java Timer原始碼解析（定時器原始碼解析）
2018-10-20
Java原始碼定時器
【原始碼解析】- ArrayList原始碼解析，絕對詳細
2021-04-15
原始碼
Spring @Profile註解使用和原始碼解析
2023-04-13
Spring原始碼
HashMap原始碼解析和設計解讀
2021-06-14
HashMap原始碼
PostgreSQL DBA(186) - SQL Group By
2022-02-11
SQL