讀Flink原始碼談設計：FileSystemConnector中的整潔架構

本文首發於泊浮目的語雀:https://www.yuque.com/17sing

版本	日期	備註
1.0	2022.3.8	文章首發

本文基於Flink 1.14程式碼進行分析。

0.前言

前陣子在生產上碰到了一個詭異現象：全量作業無法正常進行，日誌中充斥著java.util.concurrent.TimeoutException: Heartbeat of TaskManager with id container xxxx(HOSTNAME:PORT) timed out的報錯。

場景為Oracle全量抽取至Hive，資料會流過Kafka，資料量為T級別，根據時間欄位每天做一個分割槽。報錯的Job負責抽取Kafka的資料並寫至Hive，使用的是TableAPI。

1.排查思路

這個問題報到我這邊的時候，有同學已經排查過一輪了。根據網上搜尋，會告知你可能是yarn的壓力過大、網路短暫不穩定等，可以調大heartbeat.timeout來緩解這個問題，經調整改問題並未解決。

另外一個說法會告知你是GC頻繁的原因。建議調整記憶體，調整後，的確有一定的效果（使出問題的時間變慢）。那很顯然和程式碼有關係了。

因為之前一個版本同步資料都沒有出問題，因此開始尋找最近程式碼的改動，找了幾圈下來並沒有找到可疑的程式碼。頓時覺得有點頭皮發麻。於是讓現場的同學切換到上個版本繼續做全量，現象依舊會發生。

這時我就有點懷疑生產環境的特性了——比如資料特性，但現場的同學告知我資料並沒有什麼特殊之處。於是我要了一份現場的HeapDump，丟到了分析軟體上進行檢視，發現org.apache.flink.streaming.api.functions.sink.filesystem.Bucket的物件特別多。

於是看了一下Bucket物件的定義：


/**
 * A bucket is the directory organization of the output of the {@link StreamingFileSink}.
 *
 * <p>For each incoming element in the {@code StreamingFileSink}, the user-specified {@link
 * BucketAssigner} is queried to see in which bucket this element should be written to.
 */
@Internal
public class Bucket<IN, BucketID> {

好傢伙。一個目錄一個物件，此時此刻我已經對現場的同學告知我的“資料沒有什麼特殊之處”產生了懷疑，不過為了實錘，我還是跟了一遍程式碼：

|-- HiveTableSink
   \-- createStreamSink
|-- StreamingFileSink
  \-- initializeState
|-- StreamingFileSinkHelper
  \-- constructor
|-- HadoopPathBasedBulkFormatBuilder
  \-- createBuckets
|-- Buckets
  \-- onElement
  \-- getOrCreateBucketForBucketId

過了一遍程式碼以後，心裡便有了數。問了下現場，同步的資料時間跨度是不是特別大，現場同學確認後，時間跨度為3年多。於是建議降低時間跨度，或者降低分割槽時間。最終將全量批次進行切分後解決了這個問題。

2. 解決問題後的好奇

如果每個目錄都會產生一個Bucket，那如果執行一個流作業，豈不是遲早碰到相同的問題。這麼顯而易見的問題，社群的大神們肯定早就想到了，好奇心驅使著我尋找答案——直到看到了這段程式碼：

    public void commitUpToCheckpoint(final long checkpointId) throws IOException {
        final Iterator<Map.Entry<BucketID, Bucket<IN, BucketID>>> activeBucketIt =
                activeBuckets.entrySet().iterator();

        LOG.info(
                "Subtask {} received completion notification for checkpoint with id={}.",
                subtaskIndex,
                checkpointId);

        while (activeBucketIt.hasNext()) {
            final Bucket<IN, BucketID> bucket = activeBucketIt.next().getValue();
            bucket.onSuccessfulCompletionOfCheckpoint(checkpointId);

            if (!bucket.isActive()) {
                // We've dealt with all the pending files and the writer for this bucket is not
                // currently open.
                // Therefore this bucket is currently inactive and we can remove it from our state.
                activeBucketIt.remove();
                notifyBucketInactive(bucket);
            }
        }
    }

做Checkpoint後的提交時，這裡會根據Bucket是否處於活躍狀態來決定是否移除在記憶體中維護的資料結構。

那麼怎樣才算活躍呢？程式碼很簡短：

    boolean isActive() {
        return inProgressPart != null
                || !pendingFileRecoverablesForCurrentCheckpoint.isEmpty()
                || !pendingFileRecoverablesPerCheckpoint.isEmpty();
    }

接下來就是講清楚這三個的觸發條件了。

2.1 inProgressPart == null

該物件的型別為InProgressFileWriter，觸發條件和FileSystem的滾動策略息息相關。


/**
 * The policy based on which a {@code Bucket} in the {@code Filesystem Sink} rolls its currently
 * open part file and opens a new one.
 */
@PublicEvolving
public interface RollingPolicy<IN, BucketID> extends Serializable {

    /**
     * Determines if the in-progress part file for a bucket should roll on every checkpoint.
     *
     * @param partFileState the state of the currently open part file of the bucket.
     * @return {@code True} if the part file should roll, {@link false} otherwise.
     */
    boolean shouldRollOnCheckpoint(final PartFileInfo<BucketID> partFileState) throws IOException;

    /**
     * Determines if the in-progress part file for a bucket should roll based on its current state,
     * e.g. its size.
     *
     * @param element the element being processed.
     * @param partFileState the state of the currently open part file of the bucket.
     * @return {@code True} if the part file should roll, {@link false} otherwise.
     */
    boolean shouldRollOnEvent(final PartFileInfo<BucketID> partFileState, IN element)
            throws IOException;

    /**
     * Determines if the in-progress part file for a bucket should roll based on a time condition.
     *
     * @param partFileState the state of the currently open part file of the bucket.
     * @param currentTime the current processing time.
     * @return {@code True} if the part file should roll, {@link false} otherwise.
     */
    boolean shouldRollOnProcessingTime(
            final PartFileInfo<BucketID> partFileState, final long currentTime) throws IOException;
}

這三個介面分別對應在某些情況下，是否應該關閉當前開啟的檔案：

shouldRollOnCheckpoint：做Checkpoint之前檢查。
shouldRollOnEvent：根據當前的狀態檢查是否應該關閉。比如當前的buffer大小是否超過了限制。
shouldRollOnProcessingTime：檢查當前開啟時間是否太長來盤判斷符合關閉的條件。

2.2 pendingFileRecoverablesForCurrentCheckpoint isNotEmpty

其中的元素也是根據RollingPolicy來觸發的，不做過多的解釋。

2.3 pendingFileRecoverablesPerCheckpoint isNotEmpty

基於pendingFileRecoverablesForCurrentCheckpoint isNotEmpty。用字典來儲存一個CheckpointId與List<InProgressFileWriter.PendingFileRecoverable>的關係。

2.4 非活躍Bucket

結合前面的條件來說，其實就是已經關閉並做完所有Checkpoint的目錄，則為非活躍Bucket。檢查的時機一般是：

Task重新恢復時，從StateBackend中讀取之前的狀態，並做檢查
做完Checkpoint後，會進行一次檢查

當Bucket變成非活躍狀態時，會做一次通知Inactive的通知。告知下游該分割槽的資料已提交，變成可讀狀態。見issue：artition commit is delayed when records keep coming

3. FileSystemConnector中的整潔架構

在瞭解完上文的知識點後，我關注到了有這麼一個Proposal：FLIP-115: Filesystem connector in Table。根據這個Proposal，我簡單的翻閱了一下相關的原始碼，發現其實現也是一種整潔架構的體現。

在上面我們已經進行過原始碼分析了，接下來我們就裡面的抽象設計以及職責、分層進行分析：

|-- HiveTableSink  #Table級API，負責對外，使用者可以直接呼叫
|-- StreamingFileSink  #Streaming 級API，也可以對外，位於TableAPI下方
|-- StreamingFileSinkHelper #整合了對於TimeService的邏輯，便於定期關閉Bucket；以及對於資料到Bucket的分發。這個類也被AbstractStreamingWriter使用，註釋上也建議複用於 RichSinkFunction or StreamOperator
|-- BucketsBuilder #場景中調到的具體類是HadoopPathBasedBulkFormatBuilder，這個類會關注Buckets的具體實現以BucketWriter的具體實現
|-- Buckets #這是一個管理Bucket生命週期的類。其中有幾個關鍵成員物件
  |-- BucketWriter  #會對應具體的FileSystem實現與寫入的Format
  |-- RolingPolicy  #滾動策略，前面提到過，不再深入討論
  |-- BucketAssigner #決定每個元素輸出到哪個Bucket中。比如是key還是date等等
  |-- BucketFactory #負責每個Bucket的建立

由於職責切分粒度較細，資料的流轉邏輯與外部具體實現是解耦的，我們舉幾個例子：

如果我們要基於自己的DSL來呼叫Hive的寫入，那麼只需要寫個和HiveTableSink類似的HiveDSLSink。
如果一個數倉（資料湖）一直在增加自己底層的檔案系統的支援，那麼當第一套程式碼構築完畢時，後續只需要實現相應的BucketWriter和FileSystem即可。
如果一個數倉（資料湖）一直在增加自己支援的Format，那麼當第一套程式碼構築完畢時，後續只需要實現相應的BucketWriter即可。

基於這種設計，核心邏輯往往不會產生變化，並將容易變化的部分隔離開來，整個模組的質量將更容易得到保障。