【持續更新】重要FLIP總結

码以致用發表於2024-06-21

FLIP-27: Refactor Source Interface

流批一體API

1、解耦SplitEnumerator與SplitReader

SplitEnumerator：發現並分配splits（比如files/kafka_partitions）

SourceReader：從splits裡實際讀取資料

這樣就使不同的splits分配策略與讀取動作解耦，可分別封裝成兩個元件，Source介面即變成構建SplitEnumerator、SourceReader的工廠。而且降低chkpt鎖競爭

2、流批統一

每個Source例項都可作為batch、stream source，Boundedness有界性作為Source例項的內在屬性，只有SplitEnumerators應該知道有界性，reader不用知道

FileSource：對於有界輸入使用SplitEnumerator一次性列舉給定目錄下的所有檔案，對於流輸入使用SplitEnumerator定期列舉給定目錄下的所有檔案並分配新檔案

KafkaSource：對於有界輸入使用SplitEnumerator找到所有分割槽並將每個分割槽的最新offset作為split的end offset；對於無界輸入使用SplitEnumerator找到所有分割槽且將LONG_MAX作為split的end offset，若開啟自動發現分析則定期列舉所有分割槽

3、頂層公用介面

Source - A factory style class that helps create SplitEnumerator and SourceReader at runtime. 也感知Boundedness
SourceSplit - An interface for all the split types.
SplitEnumerator - Discover the splits and assign them to the SourceReaders
SplitEnumeratorContext - Provide necessary information to the SplitEnumerator to assign splits and send custom events to the the SourceReaders.
SplitAssignment - A container class holding the source split assignment for each subtask.
SourceReader - Read the records from the splits assigned by the SplitEnumerator.
SourceReaderContext - Provide necessary function to the SourceReader to communicate with SplitEnumerator.
SourceOutput - A collector style interface to take the records and timestamps emit by the SourceReader.
WatermarkOutput - An interface for emitting watermark and indicate idleness of the source.
Watermark - A new Watermark class will be created in the package org.apache.flink.api.common.eventtime. This class will eventually replace the existing Watermark in org.apache.flink.streaming.api.watermark. This change allows flink-core to remain independent of other modules. Given that we will eventually put all the watermark generation into the Source, this change will be necessary. Note that this FLIP does not intended to change the existing way that watermark can be overridden in the DataStream after they are emitted by the source.

FLIP-147: Support Checkpoints After Tasks Finished

流批一體runtime

在流批一體場景下，若runtime層不支援部分task結束後繼續做chkpt，則會有以下問題：

有界與無界輸入混合的情況下，一旦發生failover會產生較大的回退開銷
兩階段提交的sink依賴於chkpt實現端到端一致性，若結束的這部分task不提交chkpt會無法提交資料，無法保證一致性

對chkpt整體流程進行修改，首先將那些前序任務都已經終止但本身尚未終止的 task 識別為新的source task，然後從這些task開始傳送barrier進行正常的chkpt操作。由於 checkpoint 中 state 是以 jobvertext 為單位進行記錄的，因此如果一個 jobvertext 中所有 task 都已結束，會在它的狀態中記錄一個特殊的標記 ver，如果是部分 task 結束，會保留所有正在執行的 task state 作為 jobvertext state，而所有其他 jobvertext 的處理流程與正常 checkpoint 保持一致。作業發生 failover 重啟之後，會跳過完全終止的 jobvertext，對其他的 task 的處理邏輯與正常的處理邏輯保持一致的

為了支援drain模式下也能等待savepoint完成、或正常結束的任務等待下一個chkpt完成後再結束，先通知所有 task EndOfDataEvent 進行結束但不關閉網路連結，等所有 task 結束之後，若是drain模式還要再發起一個 savepoint 操作，然後等待下一個chkpt或savepoint，等到之後再接收EndOfPartitionEvent關閉網路連結，就能實現所有 task 等待最後同一個 savepoint 或 chkpt而結束

對之前比較有歧義的 close() 和 dipose() 操作進行了重新命名，分別改成了 finish() 和 close()，其中 finish() 只會在任務正常結束時進行呼叫，而 close() 會在作業正常結束和異常結束的時候都進行呼叫

FLIP-150: Introduce Hybrid Source

流批一體API

FLIP-95: New TableSource and TableSink interfaces

流批一體runtime

公用介面

Main interfaces:

DynamicTableSource
ScanTableSource extends DynamicTableSource
LookupTableSource extends DynamicTableSource
DynamicTableSink

Corresponding factory interfaces:

Factory
DynamicTableFactory extends Factory
DynamicTableSourceFactory extends DynamicTableFactory
DynamicTableSinkFactory extends DynamicTableFactory
FormatFactory extends Factory

Optional interfaces that add further abilities:

SupportsComputedColumnPushDown
SupportsFilterPushDown
SupportsProjectionPushDown
SupportsWatermarkPushDown
SupportsLimitPushDown
SupportsPartitionPushDown
SupportsPartitioning
SupportsOverwriting

Data structure interfaces and classes:

RowData
ArrayData
MapData
StringData
DecimalData
TimestampData
RawValueData
GenericRowData implements RowData
GenericArrayData implements ArrayData
GenericMapData implements MapData

FLIP-131: Consolidate the user-facing Dataflow SDKs/APIs (and deprecate the DataSet API)

建議使用者使用DataStream/Table/SQL而逐步廢棄DataSet。其實在此提出前DataStream在功能上已經能覆蓋DataSet的功能，只是執行批處理效率欠佳，而Table/SQL則完全能高效處理流批資料。這個提議只是明確這個批是流特殊情況的理念，並在DataSet文件里加上了建議使用者使用DataStream

FLIP-188: Introduce Built-in Dynamic Table Storage

流批一體儲存

陣列總結，持續更新～
2018-06-20
陣列
前端佈局總結（持續更新）
2018-04-13
前端
javaScript 習題總結(持續更新)
2018-05-06
JavaScript
Dom中高big 事件總結（持續更新中）
2019-01-12
事件
Android 常用開源庫總結（持續更新）
2019-03-07
Android
PHP面試題總結-持續更新中
2021-07-20
PHP面試題
前端面試題總結——HTML(持續更新中)
2019-06-18
前端面試題HTML
前端演算法類面試總結(持續更新...)
2019-02-27
前端演算法面試
PHP學習路線資源總結[持續更新]
2021-04-12
PHP
Git 常用命令總結，將會持續更新
2021-05-06
Git
數學小結（持續更新）
2020-11-06
python 系列文章彙總(持續更新…）
2018-11-26
Python
前端知識點總結——JS高階（持續更新中）
2019-02-16
前端JS
2020年近期出去面試Java的總結（持續更新）
2020-06-24
面試Java
前端面試題總結——Html5（持續更新中）
2019-06-18
前端面試題HTML
【前端面試】Vue面試題總結(持續更新中)
2022-07-17
前端Vue面試題
前端面試題總結——綜合問題(持續更新中)
2019-02-16
前端面試題
分享（四）：免費可用的 API 大全總結（持續更新中）
2023-04-10
API
深入理解Java虛擬機器--個人總結（持續更新）
2020-07-15
Java虛擬機
js函數語言程式設計術語總結 - 持續更新
2019-02-13
JS函數程式設計
pytorch分散式訓練注意事項/踩坑總結 - 持續更新
2024-06-18
PyTorch分散式
JVM（持續更新。。。）
2019-04-18
JVM
FastApi持續更新
2021-07-04
ASTAPI
前端學習資源彙總（持續更新）
2018-12-09
前端
Kotlin學習資料彙總(持續更新...)
2018-03-12
Kotlin
總結Java開發面試常問的問題，持續更新中~
2018-06-04
Java面試
23.Springboot常用的依賴總結_(沒死之前)持續更新中~~~~~~
2024-04-10
Spring Boot
4W字的後端面試知識點總結（持續更新）
2020-06-29
後端面試
《大道至簡》創作參考筆記連結彙總（持續更新）
2024-06-08
筆記
2019 Vue 面試題彙總（持續更新中...）
2019-05-01
Vue面試題
Xcode 技巧持續更新
2018-10-19
XCode
Blender 雕刻持續更新
2024-10-27
Latex 自己論文使用總結--插圖、表格、間距、字型等（持續更新）
2020-11-21
LeetCode Animation 題目圖解彙總(持續更新中...)
2018-12-06
LeetCode圖解
01_Http、Https、Http2.0 的基礎知識總結（持續更新篇）
2018-05-05
HTTP
CSS日常踩坑後的總結（猜測你也會遇到的，持續更新。。。）
2021-09-09
CSS
【持續部署】批量部署工具，總結、對比
2020-12-25
iOS 脈絡圖總結匯總連續更新
2021-09-09
iOS