flink狀態容錯

weixin_33797791發表於2018-03-14

原文網址 : https://blog.csdn.net/weixin_33797791/article/details/88286474

什麼是State(狀態)？

某task/operator在某時刻的一箇中間結果
快照(shapshot)
在flink中狀態可以理解為一種資料結構
舉例
對輸入源為<key,value>的資料,計算其中某key的最大值，如果使用HashMap，也可以進行計算，但是每次都需要重新遍歷，使用狀態的話，可以獲取最近的一次計算結果，減少了系統的計算次數
程式一旦crash，恢復
程式擴容

State型別

Operator State(運算元狀態)

With Operator State (or non-keyed state), each operator state is bound to one parallel operator instance. The Kafka Connector is a good motivating example for the use of Operator State in Flink. Each parallel instance of the Kafka consumer maintains a map of topic partitions and offsets as its Operator State.
The Operator State interfaces support redistributing state among parallel operator instances when the parallelism is changed. There can be different schemes for doing this redistribution.

kafka示例

flink官方文件用kafka的消費者舉例，認為kafka消費者的partitionId和offset類似flink的operator state

提供的資料結構：ListState<T>
每一個Operator都存在自己的狀態

key State

Keyed State is always relative to keys and can only be used in functions and operators on a KeyedStream.
You can think of Keyed State as Operator State that has been partitioned, or sharded, with exactly one state-partition per key. Each keyed-state is logically bound to a unique composite of <parallel-operator-instance, key>, and since each key “belongs” to exactly one parallel instance of a keyed operator, we can think of this simply as <operator, key>.
Keyed State is further organized into so-called Key Groups. Key Groups are the atomic unit by which Flink can redistribute Keyed State; there are exactly as many Key Groups as the defined maximum parallelism. During execution each parallel instance of a keyed operator works with the keys for one or more Key Groups.

基於KeyStream之上的狀態
可理解為dataStream.keyBy()之後的Operator State,Operator State是對每一個Operator的狀態進行記錄，而key State則是在dataSteam進行keyBy()後，記錄相同keyId的keyStream上的狀態
key State提供的資料型別：ValueState<T>、ListState<T>、ReducingState<T>、MapState<T>

狀態容錯

Introduction
Apache Flink offers a fault tolerance mechanism to consistently recover the state of data streaming applications. The mechanism ensures that even in the presence of failures, the program’s state will eventually reflect every record from the data stream exactly once. Note that there is a switch to downgrade the guarantees to at least once (described below).
The fault tolerance mechanism continuously draws snapshots of the distributed streaming data flow. For streaming applications with small state, these snapshots are very light-weight and can be drawn frequently without impacting the performance much. The state of the streaming applications is stored at a configurable place (such as the master node, or HDFS).
In case of a program failure (due to machine-, network-, or software failure), Flink stops the distributed streaming dataflow. The system then restarts the operators and resets them to the latest successful checkpoint. The input streams are reset to the point of the state snapshot. Any records that are processed as part of the restarted parallel dataflow are guaranteed to not have been part of the checkpointed state before.
Note: For this mechanism to realize its full guarantees, the data stream source (such as message queue or broker) needs to be able to rewind the stream to a defined recent point. Apache Kafka has this ability and Flink’s connector to Kafka exploits this ability.
Note: Because Flink’s checkpoints are realized through distributed snapshots, we use the words snapshot and checkpointinterchangeably.

依靠checkPoint

checkPoint概念：進行全域性快照，持久化儲存所有的task/operator的State

特點：
非同步：輕量級，不影響系統處理資料
Barrier機制
失敗情況下可回滾致最近一次成功的checkpoint
週期性
保證exactly-once

chcekPoint

Restore

shapshot(快照)

Barriers(屏障)
Barriers是flink分散式快照中的重要元素

單並行度Barriers

多並行度Barriers

Barrier被注入資料流中，並隨著資料流和記錄一起流動，每一個Barrier攜帶者快照ID，並且十分輕量級，不會打斷資料的流動，不同時期的快照的barrier可以同時存在資料流中，所以各種快照可以同時發生。
相對於單並行度，多並行度的快照需要不同資料流中攜帶相同快照ID的Barrier經過operator之後，才能進行checkpoint。

image.png

個人理解：感覺對於Flink的狀態遷移和容錯來說，主要依賴checkpoint機制，而其中最重要的元素就是Barrier，通過Barrier保證流入Operator的資料都進行了checkpoint

第09講：Flink 狀態與容錯
2022-02-03
Flink狀態管理和容錯機制介紹
2019-04-25
Flink的狀態程式設計和容錯機制(四)
2020-08-09
程式設計
Flink 狀態管理與checkPoint資料容錯機制深入剖析-Flink牛刀小試
2018-11-24
Flink狀態妙用
2020-08-04
Flink狀態(一)
2024-06-20
Flink快照容錯處理
2024-06-15
【Flink入門修煉】2-2 Flink State 狀態
2024-03-06
深入理解Flink中的狀態
2019-01-10
flink 有狀態（stateful）的計算
2019-05-24
Flink 容錯恢復 2.0 2022 最新進展
2023-01-09
Flink狀態專題：keyed state和Operator state
2019-08-01
Flink 2.0 狀態存算分離改造實踐
2024-02-11
flink學習（加餐）——job任務狀態變化
2020-12-03
Flink CheckPoint狀態點恢復與savePoint機制對比剖析-Flink牛刀小試
2018-11-25
IE console.log 除錯狀態
2018-12-25
除錯
Flink處理函式實戰之一：深入瞭解ProcessFunction的狀態(Flink-1.10)
2020-11-19
函式Function
【除錯】SystemTap除錯網路卡狀態一例
2018-09-28
除錯
伺服器狀態出錯情況有哪些？
2023-02-15
伺服器
Flink基於Kafka-Connector 資料流容錯回放機制及程式碼案例實戰-Flink牛刀小試
2018-11-26
Kafka
Flink 新一代流計算和容錯——階段總結和展望
2022-03-03
State Processor API：如何讀取，寫入和修改 Flink 應用程式的狀態
2019-12-24
API
印表機狀態錯誤是怎麼回事 win7win10電腦顯示列印狀態錯誤怎麼解決
2022-01-24
Win7Win10
前端狀態管理與有限狀態機
2018-10-15
前端
【偏理論】Flink1.11 狀態後端說明及部分原始碼閱讀
2020-11-22
後端原始碼
有狀態軟體如何在 k8s 上快速擴容甚至自動擴容
2022-12-07
K8S
React 狀態管理：狀態與生命週期
2018-06-20
React
Vben Admin 原始碼學習:狀態管理-錯誤日誌
2022-05-31
原始碼
狀態機
2024-03-13
狀態列
2020-11-06
狀態碼
2024-08-27
狀態管理
2024-11-08
印表機提示列印錯誤怎麼解決印表機狀態錯誤的方法
2022-04-12
修復HTTP 304錯誤狀態碼5種方法介紹
2021-11-03
HTTP
【知識分享】伺服器狀態出錯情況有哪些
2023-02-14
伺服器
Vuex 單狀態庫與多模組狀態庫
2018-12-11
Vue
資料管理方案Portworx是如何幫助有狀態應用做容災的？
2020-02-11
有限狀態機
2018-10-25