[原始碼解析] Flink 的slot究竟是什麼?(2)
0x00 摘要
Flink的Slot概念大家應該都聽說過,但是可能很多朋友還不甚瞭解其中細節,比如具體Slot究竟代表什麼?在程式碼中如何實現?Slot在生成執行圖、排程、分配資源、部署、執行階段分別起到什麼作用?本文和上文將帶領大家一起分析原始碼,為你揭開Slot背後的機理。
0x01 前文回顧
書接上回 [原始碼解析] Flink 的slot究竟是什麼?(1)。前文中我們已經從系統架構和資料結構角度來分析了Slot,本文我們將從業務流程角度來分析Slot。我們重新放出系統架構圖
和資料結構邏輯關係圖
下面我們從幾個流程入手一一分析。
0x02 註冊/更新Slot
有兩個途徑會註冊Slot/更新Slot狀態。
- 當TaskExecutor註冊成功之後會和RM互動進行註冊時,一併註冊Slot;
- 定時心跳時,會在心跳payload中附加Slot狀態資訊;
2.1 TaskExecutor註冊成功
當TaskExecutor註冊成功之後會和RM互動進行註冊。會通過如下的程式碼呼叫路徑來向ResourceManager(SlotManagerImpl)註冊Slot。SlotManagerImpl 在獲取訊息之後,會更新Slot狀態,如果此時已經有如果有pendingSlotRequest,就直接分配,否則就更新freeSlots變數。
-
TaskExecutor#establishResourceManagerConnection;
-
TaskSlotTableImpl#createSlotReport;建立 report
-
這時候的 report如下:
slotReport = {SlotReport@9633} 0 = {SlotStatus@8969} "SlotStatus{slotID=40d390ec-7d52-4f34-af86-d06bb515cc48_0, resourceProfile=ResourceProfile{managedMemory=64.000mb (67108864 bytes), networkMemory=32.000mb (33554432 bytes)}, allocationID=null, jobID=null}" slotID = {SlotID@8629} "40d390ec-7d52-4f34-af86-d06bb515cc48_0" resourceProfile = {ResourceProfile@4194} "ResourceProfile{managedMemory=64.000mb (67108864 bytes), networkMemory=32.000mb (33554432 bytes)}" allocationID = null jobID = null 1 = {SlotStatus@9638} "SlotStatus{slotID=40d390ec-7d52-4f34-af86-d06bb515cc48_1, resourceProfile=ResourceProfile{managedMemory=64.000mb (67108864 bytes), networkMemory=32.000mb (33554432 bytes)}, allocationID=null, jobID=null}" slotID = {SlotID@9643} "40d390ec-7d52-4f34-af86-d06bb515cc48_1" resourceProfile = {ResourceProfile@4194} "ResourceProfile{managedMemory=64.000mb (67108864 bytes), networkMemory=32.000mb (33554432 bytes)}" allocationID = null jobID = null
-
-
ResourceManager#sendSlotReport;通過RPC(resourceManagerGateway.sendSlotReport)呼叫到RM
-
SlotManagerImpl#registerTaskManager;把TaskManager註冊到SlotManager
-
SlotManagerImpl#registerSlot;
-
SlotManagerImpl#createAndRegisterTaskManagerSlot;生成註冊了TaskManagerSlot
-
這時候程式碼 & 變數如下,我們可以看到,就是把TM的Slot資訊註冊到SlotManager中:
private TaskManagerSlot createAndRegisterTaskManagerSlot(SlotID slotId, ResourceProfile resourceProfile, TaskExecutorConnection taskManagerConnection) { final TaskManagerSlot slot = new TaskManagerSlot( slotId, resourceProfile, taskManagerConnection); slots.put(slotId, slot); return slot; } slot = {TaskManagerSlot@13322} slotId = {SlotID@8629} "40d390ec-7d52-4f34-af86-d06bb515cc48_0" resourceProfile = {ResourceProfile@4194} cpuCores = {CPUResource@11616} "Resource(CPU: 89884656743115785...0)" taskHeapMemory = {MemorySize@11617} "4611686018427387903 bytes" taskOffHeapMemory = {MemorySize@11618} "4611686018427387903 bytes" managedMemory = {MemorySize@11619} "64 mb" networkMemory = {MemorySize@11620} "32 mb" extendedResources = {HashMap@11621} size = 0 taskManagerConnection = {WorkerRegistration@11121} allocationId = null jobId = null assignedSlotRequest = null state = {TaskManagerSlot$State@13328} "FREE"
-
-
SlotManagerImpl#updateSlot
-
SlotManagerImpl#updateSlotState;如果有pendingSlotRequest,就直接分配
-
SlotManagerImpl#handleFreeSlot;否則就更新freeSlots變數
流程結束後,SlotManager如下,可以看到此時slots個數是兩個,freeSlots也是兩個,說明都是空閒的:
this = {SlotManagerImpl@11120}
scheduledExecutor = {ActorSystemScheduledExecutorAdapter@11125}
slotRequestTimeout = {Time@11127} "300000 ms"
taskManagerTimeout = {Time@11128} "30000 ms"
slots = {HashMap@11122} size = 2
{SlotID@9643} "40d390ec-7d52-4f34-af86-d06bb515cc48_1" -> {TaskManagerSlot@19206}
{SlotID@8629} "40d390ec-7d52-4f34-af86-d06bb515cc48_0" -> {TaskManagerSlot@13322}
freeSlots = {LinkedHashMap@11129} size = 2
{SlotID@8629} "40d390ec-7d52-4f34-af86-d06bb515cc48_0" -> {TaskManagerSlot@13322}
{SlotID@9643} "40d390ec-7d52-4f34-af86-d06bb515cc48_1" -> {TaskManagerSlot@19206}
taskManagerRegistrations = {HashMap@11130} size = 1
fulfilledSlotRequests = {HashMap@11131} size = 0
pendingSlotRequests = {HashMap@11132} size = 0
pendingSlots = {HashMap@11133} size = 0
slotMatchingStrategy = {AnyMatchingSlotMatchingStrategy@11134} "INSTANCE"
slotRequestTimeoutCheck = {ActorSystemScheduledExecutorAdapter$ScheduledFutureTask@11139}
2.2 心跳機制更新Slot狀態
Flink的心跳機制也會被利用來進行Slots資訊的彙報,Slot Report被包括在心跳payload中。
首先在 TE 中建立Slot Report
- TaskExecutor#heartbeatFromResourceManager
- HeartbeatManagerImpl#requestHeartbeat
- TaskExecutor$ResourceManagerHeartbeatListener # retrievePayload
- TaskSlotTableImpl # createSlotReport
程式執行到 RM,於是 SlotManagerImpl 呼叫到 reportSlotStatus,進行Slot狀態更新。
-
ResourceManager#heartbeatFromTaskManager
-
HeartbeatManagerImpl#receiveHeartbeat
-
ResourceManager$TaskManagerHeartbeatListener#reportPayload
-
SlotManagerImpl#reportSlotStatus,此時的SlotReport如下:
-
slotReport = {SlotReport@8718} slotsStatus = {ArrayList@8717} size = 2 0 = {SlotStatus@9025} "SlotStatus{slotID=d99e16d7-a30c-4e21-b270-f82884b1813f_0, resourceProfile=ResourceProfile{managedMemory=64.000mb (67108864 bytes), networkMemory=32.000mb (33554432 bytes)}, allocationID=null, jobID=null}" slotID = {SlotID@9032} "d99e16d7-a30c-4e21-b270-f82884b1813f_0" resourceProfile = {ResourceProfile@4194} "ResourceProfile{managedMemory=64.000mb (67108864 bytes), networkMemory=32.000mb (33554432 bytes)}" allocationID = null jobID = null 1 = {SlotStatus@9026} "SlotStatus{slotID=d99e16d7-a30c-4e21-b270-f82884b1813f_1, resourceProfile=ResourceProfile{managedMemory=64.000mb (67108864 bytes), networkMemory=32.000mb (33554432 bytes)}, allocationID=null, jobID=null}" slotID = {SlotID@9029} "d99e16d7-a30c-4e21-b270-f82884b1813f_1" resourceProfile = {ResourceProfile@4194} "ResourceProfile{managedMemory=64.000mb (67108864 bytes), networkMemory=32.000mb (33554432 bytes)}" allocationID = null jobID = null
-
-
SlotManagerImpl#updateSlot
-
SlotManagerImpl#updateSlotState;如果有pendingSlotRequest,就直接分配
-
SlotManagerImpl#handleFreeSlot;否則就更新freeSlots變數
-
freeSlots.put(freeSlot.getSlotId(), freeSlot);
-
0x03 生成ExecutionGraph階段
當Job提交之後,經過一系列處理,Scheduler會建立ExecutionGraph。ExecutionGraph 是 JobGraph 的並行版本。而通過一系列的分析,才可以最終把任務分發到相關的任務槽中。槽會根據CPU的數量提前指定出來,這樣可以最大限度的利用CPU的計算資源。如果Slot耗盡,也就意味著新分發的作業任務是無法執行的。
ExecutionGraph
:JobManager
根據JobGraph
生成的分散式執行圖,是排程層最核心的資料結構。
一個JobVertex / ExecutionJobVertex代表的是一個operator,而具體的ExecutionVertex則代表了一個Task。
在生成StreamGraph時候,StreamGraph.addOperator
方法就已經確定了operator是什麼型別,比如OneInputStreamTask,或者SourceStreamTask等。
假設OneInputStreamTask.class
即為生成的StreamNode的vertexClass。這個值會一直傳遞,當StreamGraph被轉化成JobGraph的時候,這個值會被傳遞到JobVertex的invokableClass。然後當JobGraph被轉成ExecutionGraph的時候,這個值被傳入到ExecutionJobVertex.TaskInformation.invokableClassName中,最後一直傳到Task中。
本系列程式碼執行序列如下:
-
JobMaster#createScheduler
-
DefaultSchedulerFactory#createInstance
-
DefaultScheduler#init
-
SchedulerBase#init
-
SchedulerBase#createAndRestoreExecutionGraph
-
SchedulerBase#createExecutionGraph
-
ExecutionGraphBuilder#buildGraph
-
ExecutionGraph#attachJobGraph
-
ExecutionJobVertex#init,這裡根據並行度來確定要建立多少個Task,即多少個ExecutionVertex。
-
int numTaskVertices = vertexParallelism > 0 ? vertexParallelism : defaultParallelism; this.taskVertices = new ExecutionVertex[numTaskVertices];
-
-
ExecutionVertex#init,這裡會生成Execution。
-
this.currentExecution = new Execution( getExecutionGraph().getFutureExecutor(), this, 0, initialGlobalModVersion, createTimestamp, timeout);
-
0x04 排程階段
任務的流程就是通過作業分發到TaskManager,然後再分發到指定的Slot進行執行。
這部分排程階段的程式碼只是利用CompletableFuture把程式執行架構搭建起來,可以把認為是自頂之下進行操作。
Job開始排程之後,程式碼執行序列如下:
-
JobMaster#startJobExecution
-
JobMaster#resetAndStartScheduler
-
Future操作
-
JobMaster#startScheduling
-
SchedulerBase#startScheduling
-
DefaultScheduler#startSchedulingInternal
-
LazyFromSourcesSchedulingStrategy#startScheduling,這裡開始針對Vertices進行資源分配和部署
-
allocateSlotsAndDeployExecutionVertices(schedulingTopology.getVertices());
-
-
LazyFromSourcesSchedulingStrategy#allocateSlotsAndDeployExecutionVertices,這裡會遍歷ExecutionVertex,篩選出Create狀態的 & 輸入Ready的節點。
-
private void allocateSlotsAndDeployExecutionVertices( final Iterable<? extends SchedulingExecutionVertex<?, ?>> vertices) { // 取出狀態是CREATED,且輸入Ready的 ExecutionVertex final Set<ExecutionVertexID> verticesToDeploy = IterableUtils.toStream(vertices) .filter(IS_IN_CREATED_EXECUTION_STATE.and(isInputConstraintSatisfied())) .map(SchedulingExecutionVertex::getId) .collect(Collectors.toSet()); // 根據 ExecutionVertex 建立 DeploymentOption final List<ExecutionVertexDeploymentOption> vertexDeploymentOptions = ...; // 分配資源並且部署 schedulerOperations.allocateSlotsAndDeploy(vertexDeploymentOptions); }
-
-
DefaultScheduler#allocateSlotsAndDeploy
這裡來到了本文第一個關鍵函式 allocateSlotsAndDeploy。其主要功能是:
- allocateSlots分配Slot,其實這時候並沒有分配,而是建立一系列Future,然後根據Future返回SlotExecutionVertexAssignment列表。
- 根據SlotExecutionVertexAssignment建立DeploymentHandle
- 根據deploymentHandles進行部署,其實是根據Future把部署搭建起來,具體如何部署需要在slot分配成功之後再執行。
@Override
public void allocateSlotsAndDeploy(final List<ExecutionVertexDeploymentOption> executionVertexDeploymentOptions) {
validateDeploymentOptions(executionVertexDeploymentOptions);
final Map<ExecutionVertexID, ExecutionVertexDeploymentOption> deploymentOptionsByVertex =
groupDeploymentOptionsByVertexId(executionVertexDeploymentOptions);
final List<ExecutionVertexID> verticesToDeploy = executionVertexDeploymentOptions.stream()
.map(ExecutionVertexDeploymentOption::getExecutionVertexId)
.collect(Collectors.toList());
final Map<ExecutionVertexID, ExecutionVertexVersion> requiredVersionByVertex =
executionVertexVersioner.recordVertexModifications(verticesToDeploy);
transitionToScheduled(verticesToDeploy);
// 分配Slot,其實這時候並沒有分配,而是建立一系列Future,然後根據Future返回SlotExecutionVertexAssignment列表
final List<SlotExecutionVertexAssignment> slotExecutionVertexAssignments =
allocateSlots(executionVertexDeploymentOptions);
// 根據SlotExecutionVertexAssignment建立DeploymentHandle
final List<DeploymentHandle> deploymentHandles = createDeploymentHandles(
requiredVersionByVertex,
deploymentOptionsByVertex,
slotExecutionVertexAssignments);
// 根據deploymentHandles進行部署,其實是根據Future把部署搭建起來,具體如何部署需要在slot分配成功之後再執行
if (isDeployIndividually()) {
deployIndividually(deploymentHandles);
} else {
waitForAllSlotsAndDeploy(deploymentHandles);
}
}
接下來 兩個小章節我們分別針對 allocateSlots 和 deployIndividually / waitForAllSlotsAndDeploy 進行分析。
0x05 分配資源階段
注意,此處的入口為 allocateSlotsAndDeploy 的allocateSlots 呼叫。
在分配slot時,首先會在JobMaster中SlotPool中進行分配,具體是先SlotPool中獲取所有slot,然後嘗試選擇一個最合適的slot進行分配,這裡的選擇有兩種策略,即按照位置優先和按照之前已分配的slot優先;若從SlotPool無法分配,則通過RPC請求向ResourceManager請求slot,若此時並未連線上ResourceManager,則會將請求快取起來,待連線上ResourceManager後再申請。
5.1 CompletableFuture
CompletableFuture 首先是一個 Future,它擁有 Future 所有的功能,包括取得非同步執行結果,取消正在執行的任務等,其次是 一個CompleteStage,其最大作用是將回撥改為鏈式呼叫,從而將 Future 組合起來。
此處生成了執行框架,即通過三個 CompletableFuture 構成了執行框架。
我們按照出現順序命名為 Future 1,Future 2,Future 3。
但是這個反過來說明反而更方便。我們可以看到,'
出現次序是 Future 1,Future 2,Future 3
呼叫順序是 Future 3 ---> Future 2 ---> Future 1
5.1.1 Future 3
我們可以稱之為 PhysicalSlot Future
型別是:CompletableFuture
生成在:requestNewAllocatedSlot 函式中對 PendingRequest 的生成。PendingRequest 的建構函式中有 new CompletableFuture<>(),這個 Future 3 是 PendingRequest 的成員變數。
用處是:
- PendingRequest 會 加入到 waitingForResourceManager
回撥函式作用是:
- 在 allocateMultiTaskSlot 的 whenComplete 會把payload賦值給slot,allocatedSlot.tryAssignPayload
- 進一步回撥在 createRootSlot 函式 的 forward . thenApply 語句,會 設定為 Future 3 回撥 Future 2 的回撥函式
何時回撥:
- TM,TE offer Slot的時候,會根據 PendingRequest 間接回撥到這裡
6.1.2 Future 2
我們可以稱之為 allocationFuture
型別是:
- CompletableFuture
,CompletableFuture 有型別轉換
生成在:
- createRootSlot函式中。final CompletableFuture
slotContextFutureAfterRootSlotResolution = new CompletableFuture<>();
用處是:
- 把 Future 2 設定為 multiTaskSlot 的成員變數 private final CompletableFuture<? extends SlotContext> slotContextFuture;
- Future 2 其實也就是 SingleTaskSlot 的 parent.getSlotContextFuture(),因為 multiTaskSlot 和 SingleTaskSlot 是父子關係
- 在 SingleTaskSlot 建構函式 中,Future 2 會賦值給 SingleTaskSlot 的成員變數 singleLogicalSlotFuture。
- 即 Future 2 實際上是 SingleTaskSlot 的成員變數 singleLogicalSlotFuture
- SchedulerImpl # allocateSharedSlot 函式,return leaf.getLogicalSlotFuture(); 會被返回 singleLogicalSlotFuture 給外層呼叫,就是外層看到的 allocationFuture。
回撥函式作用是:
- 在 SingleTaskSlot 建構函式 中,會生成一個 SingleLogicalSlot(未來回撥時候會真正生成 )
- 在 internalAllocateSlot 函式中,會回撥 Future 1,allocationResultFuture的回撥函式
何時回撥:
- 被 Future 3 的回撥函式呼叫
6.1.3 Future 1
我們可以稱之為 allocationResultFuture
型別是:
- CompletableFuture
生成在:
- SchedulerImpl#allocateSlotInternal,這裡生成了第一個 CompletableFuture
用處是:
- 後續 Deploy 時候會用到 這個 Future 1,會通過 handle 給 Future 1 再加上兩個後續呼叫,是在 Future 1 結束之後的後續呼叫。
回撥函式作用是:
- allocateSlotsFor 函式中有錯誤處理
- 後續 Deploy 時候會用到 這個 Future 1,會通過 handle 給 Future 1 再加上兩個後續呼叫,是在 Future 1 結束之後的後續呼叫。
何時回撥:
- 語句在internalAllocateSlot中,但是在 Future 2 回撥函式中呼叫
5.2 流程圖
這裡比較複雜,先給出流程圖
* Run in Job Manager
*
* DefaultScheduler#allocateSlotsAndDeploy
* |
* +----> DefaultScheduler#allocateSlots
* | //把ExecutionVertex轉化為ExecutionVertexSchedulingRequirements
* |
* +----> DefaultExecutionSlotAllocator#allocateSlotsFor( 呼叫 1 開始 )
* | // 得到 我們的第一個 CompletableFuture,我們稱之為 Future 1
* |
* |
* +--------------> NormalSlotProviderStrategy#allocateSlot
* |
* |
* +--------------> SchedulerImpl#allocateSlotInternal
* | // 生成了第一個 CompletableFuture,以後稱之為 allocationResultFuture
* |
* ┌────────────┐
* │ Future 1 │ 生成 allocationResultFuture
* └────────────┘
* │
* │
* +----> SchedulerImpl#internalAllocateSlot( 呼叫 2 開始 )
* | // Future 1 做為引數被傳進來,這裡會繼續呼叫,生成 Future 2, Future 3
* |
* |
* +-----------> SchedulerImpl#allocateSharedSlot( 呼叫 3 開始 )
* | // 這裡涉及到 MultiTaskSlot 和 SingleTaskSlot
* |
* +-----------> SchedulerImpl # allocateMultiTaskSlot ( 呼叫 4 開始 )
* |
* |
* +--------------------> SchedulerImpl # requestNewAllocatedSlot
* |
* |
* +--------------------> SlotPoolImpl#requestNewAllocatedSlot
* | // 這裡生成一個 PendingRequest
* | // PendingRequest的建構函式中有 new CompletableFuture<>(),
* | // 所以這裡是生成了第三個 Future,注意這裡的 Future 是針對 PhysicalSlot
* |
* |
* ┌────────────┐
* │ Future 3 │ 生成 Future<PhysicalSlot>,這個 Future 3 實際是對使用者不可見的。
* └────────────┘
* |
* |
* +-----------> SchedulerImpl # allocateMultiTaskSlot( 呼叫 4 結束 )
* | // 回到 ( 呼叫 4 ) 這裡,得倒 Future 3
* | // 這裡得倒了第三個 Future<PhysicalSlot>
* | // 第三是因為從使用者角度看,它是第三個出現的
* |
* +-----------------------> slotSharingManager # createRootSlot
* | // 把 Future 3 做為引數傳進去
* | // 這裡馬上生成 Future 2
* | // Future 2 被設定為 multiTaskSlot 的成員變數 slotContextFuture;
* | // 然後forward . thenApply 語句 會 設定為 Future 3 回撥 Future 2 的回撥函式
* |
* |
* +-----------> SchedulerImpl#allocateSharedSlot
* | // 回到 ( 呼叫 3 ) 這裡
* |
* |
* +-----------------------> SlotSharingManager#allocateSingleTaskSlo
* | // 在 rootMultiTaskSlot 之上生成一個 SingleTaskSlot leaf加入到allTaskSlots。
* | // leaf.getLogicalSlotFuture(); 這個就是Future 2,設定好的
* |
* |
* +-----------> SchedulerImpl#allocateSharedSlot
* | // 還在 ( 呼叫 3 ) 這裡
* | // return leaf.getLogicalSlotFuture(); 返回 Future 2
* |
* |
* ┌────────────┐
* │ Future 2 │
* └────────────┘
* |
* |
* |
* +----> SchedulerImpl#internalAllocateSlot
* | // 回到 ( 呼叫 2 ) 這裡
* | // 設定,在 Future 2 的回撥函式中會呼叫 Future 1
* |
* |
* +----> DefaultExecutionSlotAllocator#allocateSlotsFor
* | // 回到 ( 呼叫 1 ) 這裡
* |
* |
* |
* ┌────────────┐
* │ Future 1 │
* └────────────┘
* |
* |
* +----> createDeploymentHandles
* | // 生成 DeploymentHandle
* |
* |
* +-----------> deployIndividually(deploymentHandles);
* | // 這裡會給 Future 1 再加上兩個 回撥函式,作為 部署回撥
* |
下圖是為了手機閱讀。
5.3 具體執行路徑
預設情況下,Flink 允許subtasks共享slot,條件是它們都來自同一個Job的不同task的subtask。結果可能一個slot持有該job的整個pipeline。允許slot共享有以下兩點好處:
- Flink 叢集所需的task slots數與job中最高的並行度一致。也就是說我們不需要再去計算一個程式總共會起多少個task了。
- 更容易獲得更充分的資源利用。如果沒有slot共享,那麼非密集型操作source/flatmap就會佔用同密集型操作 keyAggregation/sink 一樣多的資源。如果有slot共享,將基線的2個並行度增加到6個,能充分利用slot資源,同時保證每個TaskManager能平均分配到重的subtasks。
此處執行路徑大致如下:
-
DefaultScheduler#allocateSlotsAndDeploy
-
DefaultScheduler#allocateSlots;該過程會把ExecutionVertex轉化為ExecutionVertexSchedulingRequirements,會封裝包含一些location資訊、sharing資訊、資源資訊等
-
DefaultExecutionSlotAllocator#allocateSlotsFor;我們小節實際是從這裡開始分析,這裡會進行一系列操作,一層層呼叫下去。首先這個函式會得到我們的第一個 CompletableFuture,我們稱之為 allocationResultFuture,這個名字的由來後續就會知道。這個 slotFuture 會賦值給 SlotExecutionVertexAssignment,然後傳遞給外面。後續 Deploy 時候會用到 這個 slotFuture,會通過 handle 給 slotFuture 再加上兩個後續呼叫,是在slotFuture結束之後的後續呼叫。
-
public List<SlotExecutionVertexAssignment> allocateSlotsFor(...) { for (ExecutionVertexSchedulingRequirements schedulingRequirements : executionVertexSchedulingRequirements) { // 得到第一個 CompletableFuture,具體是在 calculatePreferredLocations 中通過 CompletableFuture<LogicalSlot> slotFuture = calculatePreferredLocations(...).thenCompose(...) -> slotProviderStrategy.allocateSlot( // 函式裡面生成了第一個CompletableFuture slotRequestId, new ScheduledUnit(...), SlotProfile.priorAllocation(...))); SlotExecutionVertexAssignment slotExecutionVertexAssignment = new SlotExecutionVertexAssignment(executionVertexId, slotFuture); slotFuture.whenComplete( (ignored, throwable) -> { // 第一個CompletableFuture的回撥函式,裡其實只是異常處理,後續有人會呼叫到這裡 pendingSlotAssignments.remove(executionVertexId); if (throwable != null) { slotProviderStrategy.cancelSlotRequest(slotRequestId, slotSharingGroupId, throwable); } }); slotExecutionVertexAssignments.add(slotExecutionVertexAssignment); } return slotExecutionVertexAssignments; }
-
-
NormalSlotProviderStrategy#allocateSlot(slotProviderStrategy.allocateSlot)
-
SchedulerImpl#allocateSlotInternal,這裡生成了第一個 CompletableFuture,我們可以稱之為 allocationResultFuture
-
private CompletableFuture<LogicalSlot> allocateSlotInternal(...) { // 這裡生成了第一個 CompletableFuture,我們以後稱之為 allocationResultFuture final CompletableFuture<LogicalSlot> allocationResultFuture = new CompletableFuture<>(); // allocationResultFuture 會傳送進去繼續處理 internalAllocateSlot(allocationResultFuture, slotRequestId, scheduledUnit, slotProfile, allocationTimeout); // 返回 allocationResultFuture return allocationResultFuture; }
-
-
SchedulerImpl#allocateSlot
-
SchedulerImpl#internalAllocateSlot,該方法會根據vertex是否共享slot來分配singleSlot/SharedSlot。這裡得到第二個 CompletableFuture,我們以後成為 allocationFuture
-
private void internalAllocateSlot( CompletableFuture<LogicalSlot> allocationResultFuture, ...) { // 這裡得到第二個 CompletableFuture,我們以後稱為 allocationFuture,注意目前只是得到,不是生成。 CompletableFuture<LogicalSlot> allocationFuture = scheduledUnit.getSlotSharingGroupId() == null ? allocateSingleSlot(slotRequestId, slotProfile, allocationTimeout) : allocateSharedSlot(slotRequestId, scheduledUnit, slotProfile, allocationTimeout); // 第二個Future,allocationFuture的回撥函式。注意,CompletableFuture可以連續呼叫多個whenComplete。 allocationFuture.whenComplete((LogicalSlot slot, Throwable failure) -> { if (failure != null) { // 異常處理 cancelSlotRequest(...); allocationResultFuture.completeExceptionally(failure); } else { allocationResultFuture.complete(slot); // 它將回撥第一個 allocationResultFuture的回撥函式 } }); }
-
-
SchedulerImpl#allocateSharedSlot,這裡也比較複雜,涉及到 MultiTaskSlot 和 SingleTaskSlot
-
private CompletableFuture<LogicalSlot> allocateSharedSlot(...) { // allocate slot with slot sharing final SlotSharingManager multiTaskSlotManager = slotSharingManagers.computeIfAbsent( scheduledUnit.getSlotSharingGroupId(), id -> new SlotSharingManager(id,slotPool,this)); // 生成 SlotSharingManager final SlotSharingManager.MultiTaskSlotLocality multiTaskSlotLocality; if (scheduledUnit.getCoLocationConstraint() != null) { multiTaskSlotLocality = allocateCoLocatedMultiTaskSlot(...); } else { multiTaskSlotLocality = allocateMultiTaskSlot(...); // 這裡生成 MultiTaskSlot } // 這裡生成 SingleTaskSlot final SlotSharingManager.SingleTaskSlot leaf = multiTaskSlotLocality.getMultiTaskSlot().allocateSingleTaskSlot(...); return leaf.getLogicalSlotFuture(); // 返回 SingleTaskSlot 的 future,就是第二個Future,具體生成我們在下面會詳述 }
-
-
SchedulerImpl # allocateMultiTaskSlot,這裡是一個難點函式。因為這裡生成了第三個 Future
,這裡把第三個 Future 提前說明,第三是因為從使用者角度看,它是第三個出現的。 -
private SlotSharingManager.MultiTaskSlotLocality allocateMultiTaskSlot(...) { SlotSharingManager.MultiTaskSlot multiTaskSlot = slotSharingManager.getUnresolvedRootSlot(groupId); if (multiTaskSlot == null) { // requestNewAllocatedSlot 會呼叫 SlotPoolImpl 的同名函式 // 得到第 三 個 Future,注意,這個 Future 針對的是 PhysicalSlot final CompletableFuture<PhysicalSlot> slotAllocationFuture = requestNewAllocatedSlot(...); // 使用 第 三 個 Future 來構建 multiTaskSlot multiTaskSlot = slotSharingManager.createRootSlot(...,slotAllocationFuture,...); // 第 三 個 Future的回撥函式,這裡會把payload賦值給slot slotAllocationFuture.whenComplete( (PhysicalSlot allocatedSlot, Throwable throwable) -> { final SlotSharingManager.TaskSlot taskSlot = slotSharingManager.getTaskSlot(multiTaskSlotRequestId); if (taskSlot != null) { // 會把payload賦值給slot if (!allocatedSlot.tryAssignPayload(((SlotSharingManager.MultiTaskSlot) taskSlot))) {...} } }); } return SlotSharingManager.MultiTaskSlotLocality.of(multiTaskSlot, Locality.UNKNOWN); }
-
-
SchedulerImpl # requestNewAllocatedSlot 會呼叫 SlotPoolImpl 的同名函式
-
SlotPoolImpl#requestNewAllocatedSlot,這裡生成一個 PendingRequest
-
public CompletableFuture<PhysicalSlot> requestNewAllocatedSlot(...) { // 生成 PendingRequest final PendingRequest pendingRequest = PendingRequest.createStreamingRequest(slotRequestId, resourceProfile); // 新增 PendingRequest 到 waitingForResourceManager,然後返回Future return requestNewAllocatedSlotInternal(pendingRequest) .thenApply((Function.identity())); }
-
PendingRequest的建構函式中有 new CompletableFuture<>(),所以這裡是生成了第三個 Future,注意這裡的 Future 是針對 PhysicalSlot
-
requestNewAllocatedSlotInternal
-
private CompletableFuture<AllocatedSlot> requestNewAllocatedSlotInternal(PendingRequest pendingRequest) { if (resourceManagerGateway == null) { // 就是把 pendingRequest 加到 waitingForResourceManager 之中 stashRequestWaitingForResourceManager(pendingRequest); } else { requestSlotFromResourceManager(resourceManagerGateway, pendingRequest); } return pendingRequest.getAllocatedSlotFuture(); // 第三個Future }
-
-
-
SlotSharingManager#createRootSlot,這裡才是生成 第二個 Future
的地方 -
MultiTaskSlot createRootSlot( SlotRequestId slotRequestId, CompletableFuture<? extends SlotContext> slotContextFuture, // 引數是第三個Future SlotRequestId allocatedSlotRequestId) { // 生成第二個Future<SlotContext> final CompletableFuture<SlotContext> slotContextFutureAfterRootSlotResolution = new CompletableFuture<>(); final MultiTaskSlot rootMultiTaskSlot = createAndRegisterRootSlot(... slotContextFutureAfterRootSlotResolution); // 第二個Future 在 createAndRegisterRootSlot 函式中 被賦值為 MultiTaskSlot的 slotContextFuture 成員變數 FutureUtils.forward( slotContextFuture.thenApply( // 第三個Future進一步回撥時候,會回撥第二個Future (SlotContext slotContext) -> { // add the root node to the set of resolved root nodes once the SlotContext future has // been completed and we know the slot's TaskManagerLocation tryMarkSlotAsResolved(slotRequestId, slotContext); return slotContext; }), slotContextFutureAfterRootSlotResolution); // 在這裡回撥第二個Future return rootMultiTaskSlot; }
-
-
SlotSharingManager#allocateSingleTaskSlot,這裡的目的是在 rootMultiTaskSlot 之上生成一個 SingleTaskSlot leaf加入到allTaskSlots。
-
SingleTaskSlot allocateSingleTaskSlot( SlotRequestId slotRequestId, ResourceProfile resourceProfile, AbstractID groupId, Locality locality) { final SingleTaskSlot leaf = new SingleTaskSlot( slotRequestId, resourceProfile, groupId, this, locality); children.put(groupId, leaf); // register the newly allocated slot also at the SlotSharingManager allTaskSlots.put(slotRequestId, leaf); reserveResource(resourceProfile); return leaf; }
-
-
最後回到 SchedulerImpl # allocateSharedSlot 函式,return leaf.getLogicalSlotFuture(); 這裡也是一個難點,即 getLogicalSlotFuture 返回的是一個 CompletableFuture
(就是第二個 Future),但是這個 SingleLogicalSlot 是未來回撥時候才會生成。 -
public final class SingleTaskSlot extends TaskSlot { private final MultiTaskSlot parent; // future containing a LogicalSlot which is completed once the underlying SlotContext future is completed private final CompletableFuture<SingleLogicalSlot> singleLogicalSlotFuture; private SingleTaskSlot() { singleLogicalSlotFuture = parent.getSlotContextFuture() .thenApply( (SlotContext slotContext) -> { return new SingleLogicalSlot( // 未來回撥時候才會生成 slotRequestId, slotContext, slotSharingGroupId, locality, slotOwner); }); } CompletableFuture<LogicalSlot> getLogicalSlotFuture() { return singleLogicalSlotFuture.thenApply(Function.identity()); } }
-
0x06 Deploy階段
注意,此處的入口為 allocateSlotsAndDeploy函式中 的 deployIndividually / waitForAllSlotsAndDeploy 語句。
此處執行路徑大致如下:
-
DefaultScheduler#allocateSlotsAndDeploy
-
DefaultScheduler#allocateSlots;得到 SlotExecutionVertexAssignment 列表,上節已經詳細介紹(該過程會ExecutionVertex轉化為ExecutionVertexSchedulingRequirements,會封裝包含一些location資訊、sharing資訊、資源資訊等)
-
List
deploymentHandles = createDeploymentHandles() 根據SlotExecutionVertexAssignment建立DeploymentHandle -
DefaultScheduler#deployIndividually 根據deploymentHandles進行部署,其實是根據Future把部署搭建起來,具體如何部署需要在slot分配成功之後再執行。我們小節實際是從這裡開始分析,具體程式碼可以看出,取出了 Future 1 進行一些列操作。
-
private void deployIndividually(final List<DeploymentHandle> deploymentHandles) { for (final DeploymentHandle deploymentHandle : deploymentHandles) { FutureUtils.assertNoException( deploymentHandle .getSlotExecutionVertexAssignment() .getLogicalSlotFuture() .handle(assignResourceOrHandleError(deploymentHandle)) .handle(deployOrHandleError(deploymentHandle))); } }
-
-
DefaultScheduler#assignResourceOrHandleError;就是返回函式,以備後續回撥使用
-
private BiFunction<LogicalSlot, Throwable, Void> assignResourceOrHandleError(final DeploymentHandle deploymentHandle) { final ExecutionVertexVersion requiredVertexVersion = deploymentHandle.getRequiredVertexVersion(); final ExecutionVertexID executionVertexId = deploymentHandle.getExecutionVertexId(); return (logicalSlot, throwable) -> { if (throwable == null) { final ExecutionVertex executionVertex = getExecutionVertex(executionVertexId); final boolean sendScheduleOrUpdateConsumerMessage = deploymentHandle.getDeploymentOption().sendScheduleOrUpdateConsumerMessage(); executionVertex .getCurrentExecutionAttempt() .registerProducedPartitions(logicalSlot.getTaskManagerLocation(), sendScheduleOrUpdateConsumerMessage); executionVertex.tryAssignResource(logicalSlot); } else { handleTaskDeploymentFailure(executionVertexId, maybeWrapWithNoResourceAvailableException(throwable)); } return null; }; }
-
-
deployOrHandleError 就是返回函式,以備後續回撥使用
-
private BiFunction<Object, Throwable, Void> deployOrHandleError(final DeploymentHandle deploymentHandle) { final ExecutionVertexVersion requiredVertexVersion = deploymentHandle.getRequiredVertexVersion(); final ExecutionVertexID executionVertexId = requiredVertexVersion.getExecutionVertexId(); return (ignored, throwable) -> { if (throwable == null) { deployTaskSafe(executionVertexId); } else { handleTaskDeploymentFailure(executionVertexId, throwable); } return null; }; }
-
0x07 RM分配資源
之前的工作基本都是在 JM 之中。通過 Scheduler 和 SlotPool 來完成申請資源和部署階段。目前 SlotPool 之中已經積累了一個 PendingRequest,等 SlotPool 連線上 RM,就可以開始向 RM 申請資源了。
當ResourceManager收到申請slot請求時,若發現該JobManager未註冊,則直接丟擲異常;否則將請求轉發給SlotManager處理,SlotManager中維護了叢集所有空閒的slot(TaskManager會向ResourceManager上報自己的資訊,在ResourceManager中由SlotManager儲存Slot和TaskManager對應關係),並從其中找出符合條件的slot,然後向TaskManager傳送RPC請求申請對應的slot。
程式碼執行路徑如下:
-
JobMaster # establishResourceManagerConnection 程式執行在 JM 之中
-
SlotPoolImpl # connectToResourceManager
-
SlotPoolImpl # requestSlotFromResourceManager,這裡 Pool 會向 RM 進行 RPC 請求。
-
private void requestSlotFromResourceManager( final ResourceManagerGateway resourceManagerGateway, final PendingRequest pendingRequest) { // 生成一個 AllocationID,這個會傳到 TM 那裡,註冊到 TaskSlot上。 final AllocationID allocationId = new AllocationID(); // 生成一個SlotRequest,並且向 RM 進行 RPC 請求。 CompletableFuture<Acknowledge> rmResponse = resourceManagerGateway.requestSlot( jobMasterId, new SlotRequest(jobId, allocationId, pendingRequest.getResourceProfile(), jobManagerAddress), rpcTimeout); }
-
-
RPC
-
ResourceManager # requestSlot 程式切換到 RM 之中
-
SlotManagerImpl # registerSlotRequest。registerSlotRequest方法會先執行checkDuplicateRequest判斷是否有重複,沒有重複的話,則將該slotRequest維護到pendingSlotRequests,然後呼叫internalRequestSlot進行分配,如果出現異常則從pendingSlotRequests中異常,然後丟擲SlotManagerException。
-
pendingSlotRequests.put
-
-
SlotManagerImpl # internalRequestSlot
-
SlotManagerImpl # findMatchingSlot
-
SlotManagerImpl # internalAllocateSlot,此時是沒有資源的,需要向 TM 要求資源
-
private void internalRequestSlot(PendingSlotRequest pendingSlotRequest) throws ResourceManagerException { final ResourceProfile resourceProfile = pendingSlotRequest.getResourceProfile(); OptionalConsumer.of(findMatchingSlot(resourceProfile)) .ifPresent(taskManagerSlot -> allocateSlot(taskManagerSlot, pendingSlotRequest)) .ifNotPresent(() -> fulfillPendingSlotRequestWithPendingTaskManagerSlot(pendingSlotRequest)); }
-
-
SlotManagerImpl # allocateSlot,向task manager要求資源。TaskExecutorGateway介面用來通過RPC分配任務槽,或者說分配任務的資源。
-
TaskExecutorGateway gateway = taskExecutorConnection.getTaskExecutorGateway(); CompletableFuture<Acknowledge> requestFuture = gateway.requestSlot( slotId, pendingSlotRequest.getJobId(), allocationId, pendingSlotRequest.getResourceProfile(), pendingSlotRequest.getTargetAddress(), resourceManagerId, taskManagerRequestTimeout);
-
-
RPC
-
TaskExecutor # requestSlot,程式切換到 TE
-
TaskSlotTableImpl # allocateSlot,分配資源,更新task slot map,把slot加入到 set of job slots 中。
-
public boolean allocateSlot(int index, JobID jobId, AllocationID allocationId, ResourceProfile resourceProfile,Time slotTimeout) { taskSlot = new TaskSlot<>(index, resourceProfile, memoryPageSize, jobId, allocationId); taskSlots.put(index, taskSlot); allocatedSlots.put(allocationId, taskSlot); slots.add(allocationId); }
-
0x08 Offer資源階段
此階段是由 TE,TM 開始,就是TE 向 RM 提供 Slot,然後 RM 通知 JM 可以執行 Job。也可以認為這部分是從底向上的執行。
等待所有的slot申請完成後,然後會將ExecutionVertex對應的Execution分配給對應的Slot,即從Slot中分配對應的資源給Execution,完成分配後可開始部署作業。
這裡兩個關鍵點是:
- 當 JM 收到 SlotOffer時候,就會根據 RPC傳遞過來的 taskManagerId 引數,構建一個 taskExecutorGateway,然後這個 taskExecutorGateway 被賦予為 AllocatedSlot . taskManagerGateway。這樣就把 JM 範疇的 Slot 和 Slot 所在的 taskManager 聯絡起來。
- Execution 部署時候,是 從 SingleLogicalSlot ---> AllocatedSlot ---> TaskManagerGateway 這個順序獲取了 TaskManager 的 RPC 閘道器,然後通過 taskManagerGateway.submitTask 才能提交任務的。這樣就把 Execution 部署階段和執行階段聯絡起來了。
---------- Task Executor ----------
│
│
┌─────────────┐
│ TaskSlot │ requestSlot
└─────────────┘
│
│
┌──────────────┐
│ SlotOffer │ offerSlotsToJobManager
└──────────────┘
│
│
------------- Job Manager -------------
│
│
┌──────────────┐
│ SlotOffer │ JobMaster#offerSlots(taskManagerId,slots)
└──────────────┘
│ //taskManager = registeredTaskManagers.get(taskManagerId);
│ //taskManagerLocation = taskManager.f0;
│ //taskExecutorGateway = taskManager.f1;
│
│
┌──────────────┐
│ SlotOffer │ SlotPoolImpl#offerSlots
└──────────────┘
│
│
┌───────────────┐
│ AllocatedSlot │ SlotPoolImpl#offerSlot
└───────────────┘
│
│
┌───────────────┐
│ 回撥 Future 3 │ SlotSharingManager#createRootSlot
└───────────────┘
│
│
┌───────────────┐
│ 回撥 Future 2 │ SingleTaskSlot#SingleTaskSlot
└───────────────┘
│
│
┌───────────────────┐
│ SingleLogicalSlot │ new SingleLogicalSlot
└───────────────────┘
│
│
┌───────────────────┐
│ SingleLogicalSlot │
│ 回撥 Future 1 │ allocationResultFuture.complete()
└───────────────────┘
│
│
┌───────────────────────────────┐
│ SingleLogicalSlot │
│回撥 assignResourceOrHandleError│
└───────────────────────────────┘
│
│
┌────────────────┐
│ ExecutionVertex│ tryAssignResource
└────────────────┘
│
│
┌────────────────┐
│ Execution │ tryAssignResource
└────────────────┘
│
│
┌──────────────────┐
│ SingleLogicalSlot│ tryAssignPayload
└──────────────────┘
│
│
┌───────────────────────┐
│ SingleLogicalSlot │
│ 回撥deployOrHandleError│
└───────────────────────┘
│
│
┌────────────────┐
│ ExecutionVertex│ deploy
└────────────────┘
│
│
┌────────────────┐
│ Execution │ deploy // 關鍵點
└────────────────┘
│
│
│
---------- Task Executor ----------
│
│
┌────────────────┐
│ TaskExecutor │ submitTask
└────────────────┘
│
│
┌────────────────┐
│ TaskExecutor │ startTaskThread
└────────────────┘
執行路徑如下:
-
TaskExecutor # establishJobManagerConnection
-
TaskExecutor # offerSlotsToJobManager,這裡就是遍歷已經分配的TaskSlot,然後每個TaskSlot會生成一個SlotOffer(裡面是allocationId,slotIndex,resourceProfile),這個會通過RPC發給 JM。
-
private void offerSlotsToJobManager(final JobID jobId) { final Iterator<TaskSlot<Task>> reservedSlotsIterator = taskSlotTable.getAllocatedSlots(jobId); final JobMasterId jobMasterId = jobManagerConnection.getJobMasterId(); final Collection<SlotOffer> reservedSlots = new HashSet<>(2); while (reservedSlotsIterator.hasNext()) { SlotOffer offer = reservedSlotsIterator.next().generateSlotOffer(); reservedSlots.add(offer); } // 把 SlotOffer 通過RPC發給 JM CompletableFuture<Collection<SlotOffer>> acceptedSlotsFuture = jobMasterGateway.offerSlots( getResourceID(), reservedSlots, taskManagerConfiguration.getTimeout()); }
-
-
RPC
-
JobMaster # offerSlots 。程式執行到 JM。當 JM 收到 SlotOffer時候,就會根據 RPC傳遞過來的 taskManagerId 引數,構建一個 taskExecutorGateway,然後這個 taskExecutorGateway 被賦予為 AllocatedSlot . taskManagerGateway。這樣就把 JM 範疇的 Slot 和 Slot 所在的 taskManager 聯絡起來。
-
public CompletableFuture<Collection<SlotOffer>> offerSlots( final ResourceID taskManagerId, final Collection<SlotOffer> slots, final Time timeout) { Tuple2<TaskManagerLocation, TaskExecutorGateway> taskManager = registeredTaskManagers.get(taskManagerId); final TaskManagerLocation taskManagerLocation = taskManager.f0; final TaskExecutorGateway taskExecutorGateway = taskManager.f1; final RpcTaskManagerGateway rpcTaskManagerGateway = new RpcTaskManagerGateway(taskExecutorGateway, getFencingToken()); return CompletableFuture.completedFuture( slotPool.offerSlots( taskManagerLocation, rpcTaskManagerGateway, slots)); }
-
-
SlotPoolImpl # offerSlots
-
SlotPoolImpl # offerSlot,這裡根據 SlotOffer 的資訊生成一個 AllocatedSlot,對於 AllocatedSlot 來說,有效資訊就是 slotIndex, resourceProfile。提醒,AllocatedSlot implements PhysicalSlot。
-
boolean offerSlot( final TaskManagerLocation taskManagerLocation, final TaskManagerGateway taskManagerGateway, final SlotOffer slotOffer) { // 根據 SlotOffer 的資訊生成一個 AllocatedSlot,對於 AllocatedSlot 來說,有效資訊就是 slotIndex, resourceProfile final AllocatedSlot allocatedSlot = new AllocatedSlot( allocationID, taskManagerLocation, slotOffer.getSlotIndex(), slotOffer.getResourceProfile(), taskManagerGateway); allocatedSlots.add(pendingRequest.getSlotRequestId(), allocatedSlot); if (pendingRequest != null) { allocatedSlots.add(pendingRequest.getSlotRequestId(), allocatedSlot); // 這裡取出了 pendingRequest 的 Future, 就是我們之前的 Future 3,進行回撥 if (!pendingRequest.getAllocatedSlotFuture().complete(allocatedSlot)) { // we could not complete the pending slot future --> try to fulfill another pending request allocatedSlots.remove(pendingRequest.getSlotRequestId()); tryFulfillSlotRequestOrMakeAvailable(allocatedSlot); } } }
-
-
開始回撥 Future 3,程式碼在 SlotSharingManager # createRootSlot 這裡
-
FutureUtils.forward( slotContextFuture.thenApply( (SlotContext slotContext) -> { // add the root node to the set of resolved root nodes once the SlotContext future has // been completed and we know the slot's TaskManagerLocation tryMarkSlotAsResolved(slotRequestId, slotContext); // 執行到這裡 return slotContext; }), slotContextFutureAfterRootSlotResolution); // 然後到這裡
-
-
開始回撥 Future 2,程式碼在 SingleTaskSlot 建構函式 ,因為有 PhysicalSlot extends SlotContext, 所以這裡就把 物理Slot 對映成了一個 邏輯Slot
-
singleLogicalSlotFuture = parent.getSlotContextFuture() .thenApply( (SlotContext slotContext) -> { return new SingleLogicalSlot( // 回撥生成了 SingleLogicalSlot slotRequestId, slotContext, slotSharingGroupId, locality, slotOwner); });
-
-
開始回撥 Future 1,程式碼在這裡,呼叫到 後續 Deploy 時候設定的回撥函式。
-
allocationFuture.whenComplete((LogicalSlot slot, Throwable failure) -> { if (failure != null) { cancelSlotRequest( slotRequestId, scheduledUnit.getSlotSharingGroupId(), failure); allocationResultFuture.completeExceptionally(failure); } else { allocationResultFuture.complete(slot); // 程式碼在這裡 } });
-
-
繼續回撥到 Deploy 階段設定的回撥函式 assignResourceOrHandleError,就是分配資源
-
private BiFunction<LogicalSlot, Throwable, Void> assignResourceOrHandleError(final DeploymentHandle deploymentHandle) { return (logicalSlot, throwable) -> { if (executionVertexVersioner.isModified(requiredVertexVersion)) { if (throwable == null) { final ExecutionVertex executionVertex = getExecutionVertex(executionVertexId); final boolean sendScheduleOrUpdateConsumerMessage = deploymentHandle.getDeploymentOption().sendScheduleOrUpdateConsumerMessage(); executionVertex .getCurrentExecutionAttempt() .registerProducedPartitions(logicalSlot.getTaskManagerLocation(), sendScheduleOrUpdateConsumerMessage); executionVertex.tryAssignResource(logicalSlot); // 執行到這裡 } return null; }; }
-
回撥函式會深入呼叫 executionVertex.tryAssignResource,
-
ExecutionVertex # tryAssignResource
-
Execution # tryAssignResource
-
SingleLogicalSlot# tryAssignPayload(this),這裡會把 Execution 自己 賦值給Slot.payload,最後 Execution 在 runtime 的變數舉例如下:
-
payload = {Execution@10669} "Attempt #0 (CHAIN DataSource (at getDefaultTextLineDataSet(WordCountData.java:47) (org.apache.flink.api.java.io.CollectionInputFormat)) -> FlatMap (FlatMap at main(WordCount.java:64)) -> Combine (SUM(1), at main(WordCount.java:67) (1/1)) @ org.apache.flink.runtime.jobmaster.slotpool.SingleLogicalSlot@61c7928f - [SCHEDULED]" executor = {ScheduledThreadPoolExecutor@5928} "java.util.concurrent.ScheduledThreadPoolExecutor@6a2c6c71[Running, pool size = 3, active threads = 0, queued tasks = 1, completed tasks = 2]" vertex = {ExecutionVertex@10534} "CHAIN DataSource (at getDefaultTextLineDataSet(WordCountData.java:47) (org.apache.flink.api.java.io.CollectionInputFormat)) -> FlatMap (FlatMap at main(WordCount.java:64)) -> Combine (SUM(1), at main(WordCount.java:67) (1/1)" attemptId = {ExecutionAttemptID@10792} "2f8b6c7297527225ee4c8036c457ba27" globalModVersion = 1 stateTimestamps = {long[9]@10793} attemptNumber = 0 rpcTimeout = {Time@5924} "18000000 ms" partitionInfos = {ArrayList@10794} size = 0 terminalStateFuture = {CompletableFuture@10795} "java.util.concurrent.CompletableFuture@2eb8f94c[Not completed]" releaseFuture = {CompletableFuture@10796} "java.util.concurrent.CompletableFuture@7c794914[Not completed]" taskManagerLocationFuture = {CompletableFuture@10797} "java.util.concurrent.CompletableFuture@2e11ac18[Not completed]" state = {ExecutionState@10789} "SCHEDULED" assignedResource = {SingleLogicalSlot@10507} failureCause = null taskRestore = null assignedAllocationID = null accumulatorLock = {Object@10798} userAccumulators = null ioMetrics = null producedPartitions = {LinkedHashMap@10799} size = 1
-
-
-
繼續回撥到 Deploy 階段設定的回撥函式 deployOrHandleError,就是部署
-
private BiFunction<Object, Throwable, Void> deployOrHandleError(final DeploymentHandle deploymentHandle) { return (ignored, throwable) -> { if (executionVertexVersioner.isModified(requiredVertexVersion)) { if (throwable == null) { deployTaskSafe(executionVertexId); // 在這裡部署 } else { handleTaskDeploymentFailure(executionVertexId, throwable); } return null; }; }
-
回撥函式深入呼叫其他函式
-
DefaultScheduler # deployTaskSafe
-
ExecutionVertex # deploy
-
Execution # deploy。每次排程ExecutionVertex,都會有一個Execution,在此階段會將Execution的狀態變更為DEPLOYING狀態,並且為該ExecutionVertex生成對應的部署描述資訊,然後從對應的slot中獲取對應的TaskManagerGateway,以便向對應的TaskManager提交Task。其中,ExecutionVertex.createDeploymentDescriptor方法中,包含了從Execution Graph到真正物理執行圖的轉換。如將IntermediateResultPartition轉化成ResultPartition,ExecutionEdge轉成InputChannelDeploymentDescriptor(最終會在執行時轉化成InputGate)。
-
// 這裡一個關鍵點是:Execution 部署時候,是 從 SingleLogicalSlot ---> AllocatedSlot ---> TaskManagerGateway 這個順序獲取了 TaskManager 的 RPC 閘道器,然後通過 taskManagerGateway.submitTask 才能提交任務的。這樣就把 Execution 部署階段和執行階段聯絡起來了 public void deploy() throws JobException { final TaskDeploymentDescriptor deployment = TaskDeploymentDescriptorFactory .fromExecutionVertex(vertex, attemptNumber) .createDeploymentDescriptor( slot.getAllocationId(), slot.getPhysicalSlotNumber(), taskRestore, producedPartitions.values()); // 這裡就是關鍵點 final TaskManagerGateway taskManagerGateway = slot.getTaskManagerGateway(); // 在這裡通過RPC提交task給了TaskManager CompletableFuture.supplyAsync(() -> taskManagerGateway.submitTask(deployment, rpcTimeout), executor).thenCompose(Function.identity()) }
-
-
-
TaskExecutor # submitTask, 程式執行到 TE,這就是正式執行了。TaskManager(TaskExecutor)在接收到提交Task的請求後,會經過一些初始化(如從BlobServer拉取檔案,反序列化作業和Task資訊、LibaryCacheManager等),然後這些初始化的資訊會用於生成Task(Runnable物件),然後啟動該Task,其程式碼呼叫路徑如下 Task#startTaskThread(啟動Task執行緒)-> Task#run(將ExecutionVertex狀態變更為RUNNING狀態,此時在FLINK web前臺檢視頂點狀態會變更為RUNNING狀態,另外還會生成了一個AbstractInvokable物件,該物件是FLINK銜接執行使用者程式碼的關鍵。
-
// 這個方法會建立真正的Task,然後呼叫task.startTaskThread();開始task的執行。 public CompletableFuture<Acknowledge> submitTask( TaskDeploymentDescriptor tdd, JobMasterId jobMasterId, Time timeout) { // taskSlot.getMemoryManager(); 會獲取slot的記憶體管理器,這裡就是分割記憶體的部分功能 memoryManager = taskSlotTable.getTaskMemoryManager(tdd.getAllocationId()); // 在Task建構函式中,會根據輸入的引數,建立InputGate, ResultPartition, ResultPartitionWriter等。 Task task = new Task( jobInformation, taskInformation, tdd.getExecutionAttemptId(), tdd.getAllocationId(), tdd.getSubtaskIndex(), tdd.getAttemptNumber(), tdd.getProducedPartitions(), tdd.getInputGates(), tdd.getTargetSlotNumber(), memoryManager, taskExecutorServices.getIOManager(), taskExecutorServices.getShuffleEnvironment(), taskExecutorServices.getKvStateService(), taskExecutorServices.getBroadcastVariableManager(), taskExecutorServices.getTaskEventDispatcher(), taskStateManager, taskManagerActions, inputSplitProvider, checkpointResponder, aggregateManager, blobCacheService, libraryCache, fileCache, taskManagerConfiguration, taskMetricGroup, resultPartitionConsumableNotifier, partitionStateChecker, getRpcService().getExecutor()); taskAdded = taskSlotTable.addTask(task); task.startTaskThread(); }
-
開始了執行緒了。而
startTaskThread
方法,則會執行executingThread.start
,從而呼叫Task.run
方法。-
public void startTaskThread() { executingThread.start(); }
-
-
-
最後會執行到 Task,就是呼叫使用者程式碼。這裡的invokable即為operator物件例項,通過反射建立。具體地,即為OneInputStreamTask,或者SourceStreamTask等。以OneInputStreamTask為例,Task的核心執行程式碼即為
OneInputStreamTask.invoke
方法,它會呼叫StreamTask.run
方法,這是個抽象方法,最終會呼叫其派生類的run方法,即OneInputStreamTask, SourceStreamTask等。-
// 這裡的invokable即為operator物件例項,通過反射建立。 private void doRun() { AbstractInvokable invokable = null; invokable = loadAndInstantiateInvokable(userCodeClassLoader, nameOfInvokableClass, env); // run the invokable invokable.invoke(); }
-
-
tryFulfillSlotRequestOrMakeAvailable
0x09 Slot發揮作用
有人可能有一個疑問:Slot分配之後,在執行時候怎麼發揮作用呢?
這裡我們就用WordCount示例來看看。
示例程式碼就是WordCount。只不過做了一些配置:
- taskmanager.numberOfTaskSlots 是為了設定有幾個taskmanager。
- 其他是為了除錯,加長了心跳時間或者超時時間。
public class WordCount {
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
conf.setString("heartbeat.timeout", "18000000");
conf.setString("resourcemanager.job.timeout", "18000000");
conf.setString("resourcemanager.taskmanager-timeout", "18000000");
conf.setString("slotmanager.request-timeout", "18000000");
conf.setString("slotmanager.taskmanager-timeout", "18000000");
conf.setString("slot.request.timeout", "18000000");
conf.setString("slot.idle.timeout", "18000000");
conf.setString("akka.ask.timeout", "18000000");
conf.setString("taskmanager.numberOfTaskSlots", "1");
final LocalEnvironment env = ExecutionEnvironment.createLocalEnvironment(conf);
final MultipleParameterTool params = MultipleParameterTool.fromArgs(args);
env.getConfig().setGlobalJobParameters(params);
// get input data
DataSet<String> text = null;
if (params.has("input")) {
// union all the inputs from text files
for (String input : params.getMultiParameterRequired("input")) {
if (text == null) {
text = env.readTextFile(input);
} else {
text = text.union(env.readTextFile(input));
}
}
} else {
// get default test text data
text = WordCountData.getDefaultTextLineDataSet(env);
}
DataSet<Tuple2<String, Integer>> counts =
// split up the lines in pairs (2-tuples) containing: (word,1)
text.flatMap(new Tokenizer())
// group by the tuple field "0" and sum up tuple field "1"
.groupBy(0)
.sum(1);
// emit result
if (params.has("output")) {
counts.writeAsCsv(params.get("output"), "\n", " ");
env.execute("WordCount Example");
} else {
counts.print();
}
}
// *************************************************************************
// USER FUNCTIONS
// *************************************************************************
public static final class Tokenizer implements FlatMapFunction<String, Tuple2<String, Integer>> {
@Override
public void flatMap(String value, Collector<Tuple2<String, Integer>> out) {
// normalize and split the line
String[] tokens = value.toLowerCase().split("\\W+");
// emit the pairs
for (String token : tokens) {
if (token.length() > 0) {
out.collect(new Tuple2<>(token, 1));
}
}
}
}
}
9.1 部署階段
這裡 Slot 起到了一個承接作用,把具體提交部署和執行階段聯絡起來。
前面提到,當TE 提交一個Slot之後,RM會在這個Slot上提交Task。具體邏輯如下:
每次排程ExecutionVertex,都會有一個Execution。在 Execution # deploy 函式中。
- 會將Execution的狀態變更為DEPLOYING狀態,並且為該ExecutionVertex生成對應的部署描述資訊。其中,ExecutionVertex.createDeploymentDescriptor方法中,包含了從Execution Graph到真正物理執行圖的轉換。
- 如將IntermediateResultPartition轉化成ResultPartition
- ExecutionEdge轉成InputChannelDeploymentDescriptor(最終會在執行時轉化成InputGate)。
- 然後從對應的slot中獲取對應的TaskManagerGateway,以便向對應的TaskManager提交Task。這裡一個關鍵點是:Execution 部署時候,是 從 SingleLogicalSlot ---> AllocatedSlot ---> TaskManagerGateway 這個順序獲取了 TaskManager 的 RPC 閘道器。
- 最後通過 taskManagerGateway.submitTask 提交 Task。
具體程式碼如下:
// 這裡一個關鍵點是:Execution 部署時候,是 從 SingleLogicalSlot ---> AllocatedSlot ---> TaskManagerGateway 這個順序獲取了 TaskManager 的 RPC 閘道器,然後通過 taskManagerGateway.submitTask 才能提交任務的。這樣就把 Execution 部署階段和執行階段聯絡起來了
public void deploy() throws JobException {
final TaskDeploymentDescriptor deployment = TaskDeploymentDescriptorFactory
.fromExecutionVertex(vertex, attemptNumber)
.createDeploymentDescriptor(
slot.getAllocationId(),
slot.getPhysicalSlotNumber(),
taskRestore,
producedPartitions.values());
// 這裡就是關鍵點
final TaskManagerGateway taskManagerGateway = slot.getTaskManagerGateway();
// 在這裡通過RPC提交task給了TaskManager
CompletableFuture.supplyAsync(() -> taskManagerGateway.submitTask(deployment, rpcTimeout), executor).thenCompose(Function.identity())
}
9.2 執行階段
這裡僅以Split為例子說明,Slot在其中也起到了連線作用,使用者從Slot中可以得到其 TaskManager 的host,然後Split會根據這個host繼續操作。
當 Source 讀取輸入之後,可能涉及到分割輸入,Flink就會進行輸入分片的切分。
9.2.1 FileInputSplit 的由來
Flink 一般把檔案按並行度拆分成FileInputSplit的個數,當然並不是完全有幾個並行度就生成幾個FileInputSplit物件,根據具體演算法得到,但是FileInputSplit個數,一定是(並行度個數,或者並行度個數+1)。因為計算FileInputSplit個數時,參照物是檔案大小 / 並行度 ,如果沒有餘數,剛好整除,那麼FileInputSplit個數一定是並行度,如果有餘數,FileInputSplit個數就為是(並行度個數,或者並行度個數+1)。
Flink在生成階段,會把JobVertex 轉化為ExecutionJobVertex,呼叫new ExecutionJobVertex(),ExecutionJobVertex中存了inputSplits,所以會根據並行並來計算inputSplits的個數。
在 ExecutionJobVertex 建構函式中有如下程式碼,這些程式碼作用是生成 InputSplit,賦值到 ExecutionJobVertex 的成員變數 inputSplits 中,這樣就知道了從哪裡得倒 Split:
// set up the input splits, if the vertex has any
try {
InputSplitSource<InputSplit> splitSource = (InputSplitSource<InputSplit>) jobVertex.getInputSplitSource();
if (splitSource != null) {
try {
inputSplits = splitSource.createInputSplits(numTaskVertices);
if (inputSplits != null) {
splitAssigner = splitSource.getInputSplitAssigner(inputSplits);
}
}
}
// 此時splitSource如下:
splitSource = {CollectionInputFormat@7603} "[To be, or not to be,--that is the question:--, Whether 'tis nobler in the mind to suffer, The slings and arrows of outrageous fortune, ...]"
serializer = {StringSerializer@7856}
dataSet = {ArrayList@7857} size = 35
iterator = null
partitionNumber = 0
runtimeContext = null
9.2.2 File Split
這裡以網上文章Flink-1.10.0中的readTextFile解讀內容為例,給大家看看檔案切片大致流程。當然他介紹的是Stream型別。
readTextFile分成兩個階段,一個Source,一個Split Reader。這兩個階段可以分為多個執行緒,不一定是2個執行緒。因為Split Reader的並行度時根據配置檔案或者啟動引數來決定的。
Source的執行流程如下,Source的是用來構建輸入切片的,不做資料的讀取操作。這裡是按照本地執行模式整理的。
Task.run()
|-- invokable.invoke()
| |-- StreamTask.invoke()
| | |-- beforeInvoke()
| | | |-- init()
| | | | |-- SourceStreamTask.init()
| | | |-- initializeStateAndOpen()
| | | | |-- operator.initializeState()
| | | | |-- operator.open()
| | | | | |-- SourceStreamTask.LegacySourceFunctionThread.run()
| | | | | | |-- StreamSource.run()
| | | | | | | |-- userFunction.run(ctx)
| | | | | | | | |-- ContinuousFileMonitoringFunction.run()
| | | | | | | | | |-- RebalancePartitioner.selectChannel()
| | | | | | | | | |-- RecordWriter.emit()
Split Reader的程式碼執行流程如下:
Task.run()
|-- invokable.invoke()
| |-- StreamTask.invoke()
| | |-- beforeInvoke()
| | | |-- init()
| | | | |--OneInputStreamTask.init()
| | | |-- initializeStateAndOpen()
| | | | |-- operator.initializeState()
| | | | | |-- ContinuousFileReaderOperator.initializeState()
| | | | |-- operator.open()
| | | | | |-- ContinuousFileReaderOperator.open()
| | | | | | |-- ContinuousFileReaderOperator.SplitReader.run()
| | |-- runMailboxLoop()
| | | |-- StreamTask.processInput()
| | | | |-- StreamOneInputProcessor.processInput()
| | | | | |-- StreamTaskNetworkInput.emitNext() while迴圈不停的處理輸入資料
| | | | | | |-- ContinuousFileReaderOperator.processElement()
| | |-- afterInvoke()
9.2.3 Slot的使用
針對本文示例,我們重點介紹Slot在其中的使用。
呼叫路徑如下:
-
DataSourceTask # invoke,此時執行在 TE
-
DataSourceTask # hasNext
-
while (!this.taskCanceled && splitIterator.hasNext())
-
-
RpcInputSplitProvider # getNextInputSplit
-
CompletableFuture<SerializedInputSplit> futureInputSplit = jobMasterGateway.requestNextInputSplit( jobVertexID, executionAttemptID);
-
-
RPC
-
來到 JM
-
JobMaster # requestNextInputSplit
-
SchedulerBase # requestNextInputSplit,這裡會從 executionGraph 獲取 Execution,然後從 Execution 獲取 InputSplit
-
public SerializedInputSplit requestNextInputSplit(JobVertexID vertexID, ExecutionAttemptID executionAttempt) throws IOException { final Execution execution = executionGraph.getRegisteredExecutions().get(executionAttempt); final ExecutionJobVertex vertex = executionGraph.getJobVertex(vertexID); final InputSplit nextInputSplit = execution.getNextInputSplit(); final byte[] serializedInputSplit = InstantiationUtil.serializeObject(nextInputSplit); return new SerializedInputSplit(serializedInputSplit); }
-
這裡 execution.getNextInputSplit() 就會呼叫 Slot,可以看到,這裡先獲取Slot,然後從Slot獲取其 TaskManager 的host。再從 Vertiex 獲取 InputSplit。
-
public InputSplit getNextInputSplit() { final LogicalSlot slot = this.getAssignedResource(); final String host = slot != null ? slot.getTaskManagerLocation().getHostname() : null; return this.vertex.getNextInputSplit(host); }
-
public InputSplit getNextInputSplit(String host) { final int taskId = getParallelSubtaskIndex(); synchronized (inputSplits) { final InputSplit nextInputSplit = jobVertex.getSplitAssigner().getNextInputSplit(host, taskId); if (nextInputSplit != null) { inputSplits.add(nextInputSplit); } return nextInputSplit; } } // runtime 資訊如下 inputSplits = {GenericInputSplit[1]@13113} 0 = {GenericInputSplit@13121} "GenericSplit (0/1)" partitionNumber = 0 totalNumberOfPartitions = 1
-
-
-
回到 SchedulerBase # requestNextInputSplit,返回 return new SerializedInputSplit(serializedInputSplit);
-
RPC
-
返回 運算元 Task,TE,獲取到了 InputSplit,就可以繼續處理輸入。
-
final InputSplit split = splitIterator.next(); final InputFormat<OT, InputSplit> format = this.format; // open input format // open還沒開始真正的讀資料,只是定位,設定當前切片資訊(切片的開始位置,切片長度),和定位開始位置。把第一個換行符,分到前一個分片,自己從第二個換行符開始讀取資料 format.open(split);
-
0xFF 參考
Flink Slot詳解與Job Execution Graph優化
聊聊flink的slot.request.timeout配置
Apache Flink 原始碼解析(三)Flink on Yarn (2) Resource Manager
Flink on Yarn模式下的TaskManager個數
Flink on YARN時,如何確定TaskManager數
Flink原理與實現:如何生成ExecutionGraph及物理執行圖
flink原始碼解析3 ExecutionGraph的形成與物理執行