[原始碼解析] Flink的Slot究竟是什麼?(1)

羅西的思考發表於2020-08-24

[原始碼解析] Flink的Slot究竟是什麼?(1)

0x00 摘要

Flink的Slot概念大家應該都聽說過,但是可能很多朋友還不甚瞭解其中細節,比如具體Slot究竟代表什麼?在程式碼中如何實現?Slot在生成執行圖、排程、分配資源、部署、執行階段分別起到什麼作用?本文和下文將帶領大家一起分析原始碼,為你揭開Slot背後的機理。

0x01 概述 & 問題

1.1 Fllink工作原理

從下圖可以大致瞭解Flink的工作原理,即從提交Job到執行具體Task的過程。我們可以看到在具體執行時候,Task是依附於某一個Slot上的。

                                                                                        +--------------+
                                                                                        | +----------+ |
+--------+          +--------+         +---------+              +---------+             | |Task Slot | |
| Flink  |  Submit  |  Job   | Submit  |  Job    | Submit Task  |  Task   |Execute Task | +----------+ |
|Program +--------->+ Client +-------> | Manager +------------->+ Manager +------------>+              |
+--------+          +--------+         +---------+              +---------+             | +----------+ |
                                                                                        | |Task Slot | |
                                                                                        | +----------+ |
                                                                                        +--------------+

下圖是為了手機上觀看。

1.2 問題

帶著問題學習比較好,我這裡整理了幾個問題,希望有一定代表性:

  • Slot究竟是什麼?
  • Slot在程式碼中是如何實現的?
  • Slot定義裡面究竟有什麼?CPU?記憶體?
  • Slot是怎麼實現各種隔離的?
  • TM中需要分成多少個Slot?
  • Slot是怎麼分配給Task的?或者說Task是怎麼跑在Slot上的?

如果想搞清楚這些問題可不是一件容易的事情,需要把Flink任務執行的流程梳理一遍才行。下面我就帶著大家探尋一下。

0x02 示例程式碼

2.1 示例程式碼

示例程式碼就是Flink本身自帶的WordCount。只不過新增了一些配置:

  • taskmanager.numberOfTaskSlots 是為了設定有幾個taskmanager。
  • 其他是為了除錯,加長了心跳時間或者超時時間。
Configuration conf = new Configuration();
conf.setString("heartbeat.timeout", "18000000");
conf.setString("resourcemanager.job.timeout", "18000000");
conf.setString("resourcemanager.taskmanager-timeout", "18000000");
conf.setString("slotmanager.request-timeout", "18000000");
conf.setString("slotmanager.taskmanager-timeout", "18000000");
conf.setString("slot.request.timeout", "18000000");
conf.setString("slot.idle.timeout", "18000000");
conf.setString("akka.ask.timeout", "18000000");
conf.setString("taskmanager.numberOfTaskSlots", "1");
final LocalEnvironment env = ExecutionEnvironment.createLocalEnvironment(conf);

0x03 從Slot角度看系統劃分

3.1 Flink元件

其實系統還是那麼劃分,只是我們從Slot資源分配角度看會更清晰

一個 Flink Cluster 是由一個 Flink Master 和多個 Task Manager 組成的,一個 Flink Master 中有一個 Resource Manager 和多個 Job Manager。

  • Flink Master 中每一個 Job Manager 單獨管理一個具體的 Job。
    • Job Manager 中的 Scheduler 元件負責排程執行該 Job 的 DAG 中所有 Task,發出資源請求,即整個資源排程的起點;
    • JobManager 中的 Slot Pool 元件持有分配到該 Job 的所有資源。
  • Flink Master 中唯一的 Resource Manager 負責整個 Flink Cluster 的資源排程以及與外部排程系統對接,這裡的外部排程系統指的是 Kubernetes、Mesos、Yarn 等資源管理系統。
  • Task Manager 負責 Task 的執行,其中的 Slot 是 Task Manager 資源的一個子集,也是 Flink 資源管理的基本單位,Slot 的概念貫穿資源排程過程的始終。

Flink Master 和 Task Manager 是程式級元件,其他的元件都是程式內的元件

3.2 Slot的由來

前面我們介紹了 TaskManager 是一個 JVM 程式,並會以獨立的執行緒來執行一個task或多個subtask。

所以在 多執行緒處理 的 TaskManager 的內部是:在不同的執行緒上去執行一個或者多個它的子任務。而這個執行緒到底能執行多少個子任務呢?

為了控制內部執行緒執行子任務的個數,即為了控制一個 TaskManager 能接受多少個 task,就提出了slots概念。slots就是TaskManager的固定大小資源的一個集合。ResourceManager在做資源分配管理的時候,最小的單位就是slot。

Slot概念的優勢就在於,如果JobMaster通過分發而來的作業,可以獨立的在不同的Slot中執行。有一點類似於資源的隔離,這樣,就可以儘可能的提高整個資源的效率。

在子任務同屬一個 job 時,Flink還允許共享Slot。之所以允許共享,主要是因為既可以迅速的執行一些佔用資源較小的任務,又可以從邏輯上抽離對平行計算是資源的消耗的多餘計算(這點和虛擬記憶體有異曲同工之妙)。通過Map-reduce的對映來更好的進行作業和任務的執行。

3.3 資源分配

Flink 的資源排程是一個經典的兩層模型,其中從 Cluster 到 Job 的分配過程是由 Slot Manager 來完成,Job 內部分配給 Task 資源的過程則是由 Scheduler 來完成。Scheduler 向 Slot Pool 發出 Slot Request(資源請求),Slot Pool 如果不能滿足該資源需求則會進一步請求 Resource Manager,Resource Manager中具體來滿足該請求的元件是 Slot Manager。

在 Operator 和 Task 中間的 Chaining 是指如何用 Operator 組成 Task 。在 Task 和 Job 之間的 Slot Sharing 是指多個 Task 如何共享一個 Slot 資源,這種情況不會發生在跨作業的情況中。在 Flink Cluster 和 Job 之間的 Slot Allocation 是指 Flink Cluster 中的 Slot 是怎樣分配給不同的 Job 。

先使用 http://asciiflow.com/ 畫個圖總結下。

           +------------------------------------------+
           |              TaskManager                 |
           |   +-----------------------------------+  |
           |   |    TaskManagerServices            |  | 2.Status Report
           |   |    +-------------------------+    |  +--------------------+
           |   |    |     TaskSlotTable       |    |  |                    |
           |   |    |   +------------------+  |    |  | 1.Reqister         |
           |   |    |   |TaskSlot TaskSlot |  |    |  +---------------+    |
           |   |    |   +------------------+  |    |  |               |    |
           |   |    +-------------------------+    |  |               |    |
           |   +-----------------------------------+  | <---------+   |    |
           +------------------------------------------+ 6.Request |   |    |
                        |           ^                             |   |    |
                        | 7.Offer   | 8.submitTask                |   |    |
                        v           |                             |   v    v
+-----------------------+-----------+----------------+         +---+---+----+-----------+ 
|                       JobManager                   |         |                        |
|          +-------------------------------------+   |         |  ResourceManager       |
|          |          Scheduler                  |   |         |                        |
|          | +---------------------------------+ |   |         | +--------------------+ |
|          | | LogicalSlot  PhysicalSlot       | |   |         | |    SlotManager     | |
|          | +---------------------------------+ |   |         | |                    | |
|          +-------------------------------------+   |         | | +----------------+ | |
|                    |         |                     |         | | |                | | |
|3.allocateSharedSlot|         |4.allocateSingleSlot |         | | |TaskManagerSlot | | |
|                    v         |                     |         | | |                | | |
| +------------------+-+       |  +----------------+ |         | | |                | | |
| |  SlotSharingManager|       +->+  SlotPool      | |5.Request| | |                | | |
| | +----------------+ |          | +------------+ +---------> | | +----------------+ | |
| | |MultiTaskSlot   | |          | |AllocatedSlot | |         | |                    | |
| | |SingleTaskSlot  | |          | |            + | |         | +--------------------+ |
| | +----------------+ |          | +------------+ | |         +------------------------+
| +--------------------+          +----------------+ |
+----------------------------------------------------+

圖. Flink 資源管理相關元件

下面這個圖是為了在手機上觀看。

如圖,Cluster 到 Job 的資源排程過程中主要包含五個過程。

  • TE註冊(就是上圖中的 1,2 兩項)

    • Reqister : 當 TE 啟動之後,會向 RM 註冊自己和自己內部的Slot。
    • Status Report:TE啟動之後,會定期向 RM 進行心跳彙報,在心跳 payload 中,會攜帶 Slot 資訊,RM 會隨即更新自己內部Slot狀態。
  • JM內部分配(就是上圖中的 3,4 兩項)

    • allocateSingleSlot : Scheduler 向 Slot Pool 傳送請求,如果 Slot 資源足夠則直接分配,如果 Slot 資源不夠,則由 Slot Pool 再向 Slot Manager傳送請求(此時即為 Job 向 Cluster 請求資源)
    • allocateSharedSlot : Scheduler 向 Slot Sharing Manager 傳送請求,Slot Sharing Manager 構建好Slot樹之後, 向 Slot Pool 傳送請求,如果 Slot 資源足夠則直接分配,如果 Slot 資源不夠,則由 Slot Pool 再向 Slot Manager傳送請求(此時即為 Job 向 Cluster 請求資源)
  • Job 向 Cluster 請求資源(就是上圖的 5,6 兩項)

    • 如果 Slot Manager 判斷叢集當中有足夠的資源可以滿足需求,那麼就會向 Task Manager 傳送 Request 指令,Slot Pool 再去滿足 Scheduler 的資源請求。
    • 在 Active Resource Manager 資源部署模式下,當 Resource Manager 判定 Flink Cluster 中沒有足夠的資源去滿足需求時,它會進一步去底層的資源排程系統請求資源,由排程系統把新的 Task Manager 啟動起來,並且 TaskManager 向 Resource Manager 註冊,則完成了新 Slot 的補充。
  • TE Offer Slot(就是上圖的第 7 項)

    • Offer : Task Manager 就會提供 Slot 給 Slot Pool。
  • JM 會向 TE提交 Task(就是上圖的第 8 項)

    • submitTask : JM 會更新內部Slot狀態,然後向 TE 提交任務。

這些元件具體闡釋如下。

3.4 Task Manager 範疇

Task Manager 內部相應元件為 TaskManagerServices,TaskSlotTableImpl。TaskManagerServices 是提供了 TaskManager 的基礎服務,其中就包括了 Slot相關功能 TaskSlotTable

3.4.1 TaskManagerServices

TaskManagerServices裡面有TaskSlotTable taskSlotTable; 這裡負責Slot的管理和分配。

public class TaskManagerServices {
	/** TaskManager services. */
	private final TaskManagerLocation taskManagerLocation;
	private final long managedMemorySize;
	private final IOManager ioManager;
	private final ShuffleEnvironment<?, ?> shuffleEnvironment;
	private final KvStateService kvStateService;
	private final BroadcastVariableManager broadcastVariableManager;
	private final TaskSlotTable<Task> taskSlotTable; // 這裡是 Slot 相關
	private final JobManagerTable jobManagerTable;
	private final JobLeaderService jobLeaderService;
	private final TaskExecutorLocalStateStoresManager taskManagerStateStore;
	private final TaskEventDispatcher taskEventDispatcher;
}  

3.4.2 TaskSlotTableImpl

TaskSlotTableImpl 是Slots的容器,可以依據配置檔案進行構建。其中重要的成員變數是:

  • taskSlots :本 TE 所有 task slot 的列表
  • allocatedSlots :本 TE 所有已經分配的 task slot 列表
  • slotsPerJob:每個job分配的 Slot 列表
public class TaskSlotTableImpl<T extends TaskSlotPayload> implements TaskSlotTable<T> {

   /**
    * Number of slots in static slot allocation.
    * If slot is requested with an index, the requested index must within the range of [0, numberSlots).
    * When generating slot report, we should always generate slots with index in [0, numberSlots) even the slot does not exist.
    */
   private final int numberSlots;

   /** Slot resource profile for static slot allocation. */
   private final ResourceProfile defaultSlotResourceProfile; // 定義了Slot擁有的資源

   /** Timer service used to time out allocated slots. */
   private final TimerService<AllocationID> timerService;

   /** The list of all task slots. */
   private final Map<Integer, TaskSlot<T>> taskSlots;

   /** Mapping from allocation id to task slot. */
   private final Map<AllocationID, TaskSlot<T>> allocatedSlots;

   /** Mapping from execution attempt id to task and task slot. */
   private final Map<ExecutionAttemptID, TaskSlotMapping<T>> taskSlotMappings;

   /** Mapping from job id to allocated slots for a job. */
   private final Map<JobID, Set<AllocationID>> slotsPerJob;

   /** Interface for slot actions, such as freeing them or timing them out. */
   @Nullable
   private SlotActions slotActions;
}  

3.4.3 ResourceProfile

其中 ResourceProfile 這個類需要特殊說明下,它定義了Slot擁有的資源,包括

  • CPU核數
  • task heap memory 大小
  • task off-heap memory大小
  • managed memory大小
  • network memory大小
  • 擴充套件資源,比如GPU and FPGA

當使用者申請Slot時候,會根據使用者需求來計算是否本Slot滿足需求。

public class ResourceProfile implements Serializable {
	/** How many cpu cores are needed. Can be null only if it is unknown. */
	@Nullable
	private final Resource cpuCores;

	/** How much task heap memory is needed. */
	@Nullable // can be null only for UNKNOWN
	private final MemorySize taskHeapMemory;

	/** How much task off-heap memory is needed. */
	@Nullable // can be null only for UNKNOWN
	private final MemorySize taskOffHeapMemory;

	/** How much managed memory is needed. */
	@Nullable // can be null only for UNKNOWN
	private final MemorySize managedMemory;

	/** How much network memory is needed. */
	@Nullable // can be null only for UNKNOWN
	private final MemorySize networkMemory;

	/** A extensible field for user specified resources from {@link ResourceSpec}. */
	private final Map<String, Resource> extendedResources = new HashMap<>(1);
}

3.5 Resource Manager範疇

RM 中相關的元件是SlotManager。SlotManager的作用是負責維護一個統一檢視,包括:

  • 所有註冊的 task manager slots
  • slots的分配情況
  • 所有pending狀態的 Slot 請求
  • 當前註冊的 Task Manager

其簡略版定義如下:

public class SlotManagerImpl implements SlotManager {

   /** Scheduled executor for timeouts. */
   private final ScheduledExecutor scheduledExecutor;

   /** Map for all registered slots. */
   private final HashMap<SlotID, TaskManagerSlot> slots;

   /** Index of all currently free slots. */
   private final LinkedHashMap<SlotID, TaskManagerSlot> freeSlots;

   /** All currently registered task managers. */
   private final HashMap<InstanceID, TaskManagerRegistration> taskManagerRegistrations;

   /** Map of fulfilled and active allocations for request deduplication purposes. */
   private final HashMap<AllocationID, SlotID> fulfilledSlotRequests;

   /** Map of pending/unfulfilled slot allocation requests. */
   private final HashMap<AllocationID, PendingSlotRequest> pendingSlotRequests;

   private final HashMap<TaskManagerSlotId, PendingTaskManagerSlot> pendingSlots;

   private final SlotMatchingStrategy slotMatchingStrategy;

   /** ResourceManager's id. */
   private ResourceManagerId resourceManagerId;

   /** Executor for future callbacks which have to be "synchronized". */
   private Executor mainThreadExecutor;

   /** Callbacks for resource (de-)allocations. */
   private ResourceActions resourceActions;
}    

3.6 Job Master範疇

JM 中關於Slot的主要有以下四個元件:Scheduler,SlotSharingManager,SlotPool,Execution。

3.6.1 Scheduler

Scheduler 元件是整個資源排程的起點,Scheduler主要作用是:

  • 負責排程執行該 Job 的 DAG 中所有 Task。根據 Execution Graph 和 Task 的執行狀態,決定接下來要排程的 Task
  • 發起 SlotRequest 資源請求
  • 決定 Task / Slot 之間的分配

其簡略版定義如下,可以看到其包括 slotSharingManagers 和 SlotPool。

SchedulerImpl 使用 LogicalSlot 和 PhysicalSlot

public class SchedulerImpl implements Scheduler {
   /** Strategy that selects the best slot for a given slot allocation request. */
   @Nonnull
   private final SlotSelectionStrategy slotSelectionStrategy;

   /** The slot pool from which slots are allocated. */
   @Nonnull
   private final SlotPool slotPool;

   /** Managers for the different slot sharing groups. */
   @Nonnull
   private final Map<SlotSharingGroupId, SlotSharingManager> slotSharingManagers;
}  

3.6.2 SlotSharingManager

SlotSharingManager 負責管理 slot sharing。

所謂的共享Slot,就是指不同operator下面的subTask(一個operator往往會因為並行度的原因,被分解成並行度個數的Task,並行執行)可以在同一個Task Slot中執行,即共享Slot。

在Storm中,supervisor下面是work,work中往往一個Executor執行一個Task。而在Flink中,TaskManager下面是slot,相同的是Slot和Work都是一個JVM程式,不同的是TaskManager會對Slot進行資源分配。

SlotSharingGroup是Flink中用來實現slot共享的類,它儘可能地讓subtasks共享一個slot。保證同一個group的並行度相同的sub-tasks 共享同一個slots。運算元的預設group為default(即預設一個job下的subtask都可以共享一個slot)

SlotSharingManager 允許建立一顆 TaskSlot hierarchy樹。hierarchy樹具體由 MultiTaskSlot 和 SingleTaskSlot 來實現,MultiTaskSlot代表樹的中間節點,其包括一系列其他的TaskSlot,而SingleTaskSlot代表葉子節點。

在申請時候,SlotSharingManager 會通過 Slot Pool來具體申請物理Slot。

SlotSharingManager其精簡版定義如下:

public class SlotSharingManager {
   private final SlotSharingGroupId slotSharingGroupId;

   /** Actions to release allocated slots after a complete multi task slot hierarchy has been released. */
   private final AllocatedSlotActions allocatedSlotActions; //指向SlotPool

   /** Owner of the slots to which to return them when they are released from the outside. */
   private final SlotOwner slotOwner;//指向SlotPool.ProviderAndOwner

   private final Map<SlotRequestId, TaskSlot> allTaskSlots;//儲存MultiTaskSlot和SingleTaskSlot

   /** Root nodes which have not been completed because the allocated slot is still pending. */
   private final Map<SlotRequestId, MultiTaskSlot> unresolvedRootSlots;//在申請Slot後並且在確認前的臨時儲存

   /** Root nodes which have been completed (the underlying allocated slot has been assigned). */
   private final Map<TaskManagerLocation, Map<AllocationID, MultiTaskSlot>> resolvedRootSlots; //在申請Slot並且在確認後,根據TM劃分為形如TM1-(MultiTaskSlot,MultiTaskSlot), TM2-(MultiTaskSlot,MultiTaskSlot)
}    

其執行時變數舉例如下

this = {SlotSharingManager@5659} Object is being initialized
 slotSharingGroupId = {SlotSharingGroupId@5660} "041826faeb51ada2ba89356613583507"
 allocatedSlotActions = {SlotPoolImpl@5661} 
 slotOwner = {SchedulerImpl@5662} 
 allTaskSlots = {HashMap@6489}  size = 0
 unresolvedRootSlots = {HashMap@6153}  size = 0
 resolvedRootSlots = null

3.6.3 SlotPool

JobManager 中的 Slot Pool 元件持有分配到該 Job 的所有資源。其精簡版定義如下,可以看出其擁有 resourceManagerGateway 以向 RM 發出資源請求,pending的請求保持在pendingRequests 或者 waitingForResourceManager之中。

SlotPool 中 管理的是 AllocatedSlots,就是物理Slot

public class SlotPoolImpl implements SlotPool {

   private final JobID jobId;

   /** All registered TaskManagers, slots will be accepted and used only if the resource is registered. */
   private final HashSet<ResourceID> registeredTaskManagers;

   /** The book-keeping of all allocated slots. */
   private final AllocatedSlots allocatedSlots;

   /** The book-keeping of all available slots. */
   private final AvailableSlots availableSlots;

   /** All pending requests waiting for slots. */
   private final DualKeyLinkedMap<SlotRequestId, AllocationID, PendingRequest> pendingRequests;

   /** The requests that are waiting for the resource manager to be connected. */
   private final LinkedHashMap<SlotRequestId, PendingRequest> waitingForResourceManager;

   /** The gateway to communicate with resource manager. */
   private ResourceManagerGateway resourceManagerGateway;

   private String jobManagerAddress;
}     

3.6.4 Execution

JobManager 將 JobGraph 轉換為 ExecutionGraph,ExecutionGraph 是 JobGraph 的並行版本:每個 JobVertex 包含並行子任務的 ExecutionVertex。一個並行度為100的運算元將擁有一個 JobVertex 和100個 ExecutionVertex。

每個 ExecutionGraph 都有一個與其相關聯的作業狀態。此作業狀態指示作業執行的當前狀態。

在執行 ExecutionGraph 期間,每個並行任務經過多個階段,從建立(created)到完成(finished)或失敗(failed),任務可以執行多次(例如故障恢復)。

ExecutionVertex 會跟蹤特定子任務的執行狀態。來自一個 JobVertex 的所有 ExecutionVertex 都由一個 ExecutionJobVertex 管理儲存,ExecutionJobVertex 跟蹤運算元整體狀態。

每個 Execution 表示一個 ExecutionVertex 的執行,每個 ExecutionVertex 都有一個當前 Execution(current execution)和一個前驅 Execution(prior execution)。

而每個 Execution 中間包含一個 private volatile LogicalSlot assignedResource;。這個變數在執行時候就是 SingleLogicalSlot。

Execution 部署時候,是 從 SingleLogicalSlot ---> AllocatedSlot ---> TaskManagerGateway 這個順序獲取了 TaskManager 的 RPC 閘道器,然後通過 taskManagerGateway.submitTask 才能提交任務的。這樣就把 Execution 部署階段和執行階段聯絡起來了

/**
 * A single execution of a vertex. While an {@link ExecutionVertex} can be executed multiple times
 * (for recovery, re-computation, re-configuration), this class tracks the state of a single execution
 * of that vertex and the resources.  */
public class Execution implements AccessExecution, Archiveable<ArchivedExecution>, LogicalSlot.Payload {
	/** The executor which is used to execute futures. */
	private final Executor executor;

	/** The execution vertex whose task this execution executes. */
	private final ExecutionVertex vertex;

	/** The unique ID marking the specific execution instant of the task. */
	private final ExecutionAttemptID attemptId;

	private final Collection<PartitionInfo> partitionInfos;

	private final CompletableFuture<TaskManagerLocation> taskManagerLocationFuture;

	private volatile LogicalSlot assignedResource;
}    

0x04 Slot分配

4.1 Slot隔離原則

Flink 中的計算資源通過 Task Slot 來定義。每個 task slot 代表了 TaskManager 的一個固定大小的資源子集。例如,一個擁有3個slot的 TaskManager,會將其管理的記憶體平均分成三分分給各個 slot。

+----------------------------------------------+
|     TaskExecutor (Flink Memory)              |
|                                              |
| +-------------------+ +--------+ +--------+  |
| |      Slot         | |  Slot  | |  Slot  |  |
| |                   | |        | |        |  |
| | +---------------+ | | +----+ | | +----+ |  |
| | | Task Heap     | | | |    | | | |    | |  |
| | +---------------+ | | +----+ | | +----+ |  |
| | +---------------+ | | +----+ | | +----+ |  |
| | | Task Off Heap | | | |    | | | |    | |  |
| | +---------------+ | | +----+ | | +----+ |  |
| | +---------------+ | | +----+ | | +----+ |  |
| | |  Network      | | | |    | | | |    | |  |
| | +---------------+ | | +----+ | | +----+ |  |
| | +---------------+ | | +----+ | | +----+ |  |
| | | Managed       | | | |    | | | |    | |  |
| | +---------------+ | | +----+ | | +----+ |  |
| +-----------------------------------------+  |
| +-----------------------------------------+  |
| |      Framework Heap                     |  |
| +-----------------------------------------+  |
| +-----------------------------------------+  |
| |     Framework Off Heap                  |  |
| +-----------------------------------------+  |
+----------------------------------------------+

下面這個圖是為了在手機上觀看。

任務槽可以實現TaskManager中不同Task的資源隔離,不過是邏輯隔離,並且只隔離記憶體,亦即在排程層面認為每個任務槽“應該”得到taskmanager.heap.size的N分之一大小的記憶體,這意味著來自不同job的task不會為了記憶體而競爭。

slot隔離主要是對記憶體的隔離,CPU本身是不做隔離的,CPU在不同的slot之間是可以共享的

如果每個 TaskManager 有多個slot的話,也就是說多個task執行在同一個JVM中。而在同一個JVM程式中的task,可以共享TCP連線(基於多路複用)和心跳訊息,可以減少資料的網路傳輸。也能共享一些資料結構,一定程度上減少了每個task的消耗。

每個 slot 都能跑由多個連續 task 組成的一個 pipeline,比如 MapFunction 的第n個並行例項和 ReduceFunction 的第n個並行例項可以組成一個 pipeline。通過動態的對槽的大小和數量的調整,就可以把任務的執行較好的並行起來。

4.2 系統裡有多少Slot?

通過調整 task slot 的數量,使用者可以定義task之間是如何相互隔離的。

我們按照每一個TaskManager機器的效能,它所含有的資源來配置slot。slot相當於它所有資源的一個子集,這個子集在執行過程中,就是一個隔離開的獨立的子任務(執行緒)。相當於是用slot把不同的子任務之間做了一個隔離。如果機器的記憶體很大,cpu數量也多,那麼就可以讓它同時並行執行任務分配更多的slot。

slot記憶體是平均分配的,比如機器上有16G記憶體,如果劃分4個slot的話,那每個slot就是4G記憶體了。如果每個slot記憶體太小的話,任務就執行不下去了,記憶體直接被撐爆這個是完全有可能的。

所以我們要根據我們執行任務的複雜程度,佔用資源的角度和我們本身的機器它本身所有的資源大小做一個整體的分配。確定TaskManager到底分成幾個slot。如果我們機器記憶體充足,為了避免不同的slot之間共享CPU導致我們資源本身負載的程度不高,這時我們往往按照CPU的數量來分配多少個slot。

JobManager拿到任務執行計劃後,它如何確定到底需要多少個slot,這時它只要看整個作業裡面,並行度最高的那個運算元設定的並行度就可以了,只要滿足它的需求,別的就都能滿足了。

  • 在 Standalone 部署模式下,Task Manager 的數量是固定的,如果是 start-cluster.sh 指令碼來啟動叢集,可以通過修改以下檔案中的配置來決定 TM 的數量;也可以通過手動執行 taskmanager.sh 指令碼來啟動一個 TM 。

  • 在Kubernetes,Yarn,Mesos部署模式下,Task Manager 的數量由 SlotManager / ResourceManager 按需動態決定:

    • 當前 Slot 數量不能滿足新的 Slot Request 時,申請並啟動新的 TaskManager
    • TaskManager 空閒一段時間後,超時則釋放

0x05 Slot分類

Flink中有很多種Slot定義,比較複雜。分佈在不同範疇,不同模組之中。

這裡先把幾個 "Slot相關類" 整理出一張表(這裡假定 啟用CoLocationConstraint)。

  • 第一行是元件範疇。
  • 第二行是使用該 "Slot相關類" 的具體元件。比如 PhysicalSlot 在 SlotPool / Scheduler 這兩個元件中使用。
  • 第三行是該 "Slot相關類" 對應的邏輯概念,比如 TaskSlot 表示 在 Task Executor 上的一個 Slot。
  • 第四行 ~ 最後 是該 "Slot相關類" 的類名,其中在同一行上的類是一一對應的,比如 一個 TaskSlot 對應 一個 AllocatedSlot,也對應一個 TaskManagerSlot。
  • SingleLogicalSlot 的成員變數SlotContext slotContext 實際指向了對應的 AllocatedSlot。這就把邏輯Slot 和 物理Slot聯絡起來了
  • 每個Execution最後對應一個SingleTaskSlot,SingleTaskSlot 的JobVertex資訊從Execution中取。
  • Execution 部署時候,是 從 SingleLogicalSlot ---> AllocatedSlot ---> TaskManagerGateway 這個順序獲取了 TaskManager 的 RPC 閘道器,然後通過 taskManagerGateway.submitTask 才能提交任務的。這樣就把 Execution 部署階段和執行階段聯絡起來了
  • MultiTaskSlot本身子節點只可能是SingleTaskSlot。
  • SlotOffer 只是一個RPC的中間變數。

具體關係如下:

範疇 Task Manager Job Master Job Master Job Master Resource Manager
元件 TaskSlotTable SlotPool / Scheduler SlotSharingManager / Scheduler Scheduler SlotManager
概念 Slot on TE Physical Slot Sharing Slot Logical Slot
TaskSlot AllocatedSlot SingleTaskSlot SingleLogicalSlot TaskManagerSlot
SlotOffer
MultiTaskSlot

其一一對應如下,除了 “taskManagerGateway.submitTask(呼叫)” 是表示呼叫關係,其餘都是真實物理對應關係或者生成關係

  • TaskSlot 是 TE 中對物理資源的封裝
  • SlotOffer是一箇中間環節,和TaskSlot一一對應,TE用它來向JM提交Slot
  • SingleTaskSlot 是共享Slot樹中的葉子結點,和SlotOffer一一對應。
  • MultiTaskSlot是共享Slot樹的中間節點
  • SingleLogicalSlot 是一個Slot的邏輯抽象,由 SingleTaskSlot生成,並一一對應
  • Execution#assignedResource 成員變數指向SingleLogicalSlot,並一一對應
  • TaskManagerSlot 是 RM 中對 TE 的TaskSlot 的狀態記錄/管理,TaskManagerSlot 和 TaskSlot 一一對應
                              taskManagerGateway.submitTask(呼叫)
              +---------------------------------------------------------------+
              |                                                               |
              |                                                               |
+-------------+----------+         +------------------------+                 |
|       Execution        |         |    SingleLogicalSlot   |                 |
|                        |         |                        |                 |
| +--------------------+ |         |  +------------------+  |                 |
| |                    | |         |  |                  |  |                 |
| |  assignedResource+-----------> |  |  slotContext  +----------------+      |
| |                    | |         |  |                  |  |          |      |
| |                    | |         |  |  slotOwner       |  |          |      |
| |                    | |         |  |  (SlotPoolImpl)  |  |          |      |
| |                    | |         |  |                  |  |          |      |
| +--------------------+ |         |  +------------------+  |          |      |
|                        |         |                        |          |      |
+------------------------+         +------------------------+          |      |
                                                                       |      |
           +-----------------------------------------------------------+      |
           |                                                                  |
           v                                                                  v
+----------+--------------+        +-------------------------+    +-----------+-------+
|       AllocatedSlot     |        |       SlotOffer         |    | TaskSlot in TM    |
| +---------------------+ |        | +---------------------+ |    | +---------------+ |
| | resourceProfile +------------> | |     resourceProfile+-----> | |resourceProfile| |
| |                     | |        | |                     | |    | |               | |
| |  allocationId  +-------------> | |     allocationId  +------> | |allocationId   | |
| |                     | |        | |                     | |    | |               | |
| | physicalSlotNumber+----------> | |     slotIndex   +--------> | |index          | |
| |                     | |        | +---------------------| |    | +---------------+ |
| |                     | |        +-------------------------+    +-------------------+
| |                     | |        +----------------------------+
| |                     | |        |       JobMaster            |
| | TaskManagerLocation+---------> | +------------------------+ |
| |                     | |        | |                        | |
| | taskManagerGateway+----------> | | registeredTaskManagers | |
| +---------------------| |        | +------------------------+ |
+-------------------------+        +----------------------------+


下面這個為手機觀看

我們再給出一個Slot流程(可以認為是從底向上的)。其中一個重點是:

當 JM 收到 SlotOffer時候,就會根據 RPC傳遞過來的 taskManagerId 引數,構建一個 taskExecutorGateway,然後這個 taskExecutorGateway 被賦予為 AllocatedSlot . taskManagerGateway。這樣就把 JM 範疇的 Slot 和 Slot 所在的 taskManager 聯絡起來

---------- Task Executor ----------
       │ 
       │ 
┌─────────────┐   
│  TaskSlot   │  requestSlot
└─────────────┘     
       │ // TaskSlot 是 TE 中對物理資源的封裝
       │  
       │                  
┌──────────────┐   
│  SlotOffer   │  offerSlotsToJobManager
└──────────────┘   
       │ // SlotOffer是一箇中間環節,和TaskSlot一一對應,TE用它來向JM提交Slot  
       │ 
       │      
------------- Job Manager -------------
       │ 
       │       
┌──────────────┐   
│  SlotOffer   │  JobMaster#offerSlots(taskManagerId,slots)
└──────────────┘     
       │ //taskManager = registeredTaskManagers.get(taskManagerId);     
       │ //taskManagerLocation = taskManager.f0;     
       │ //taskExecutorGateway = taskManager.f1;     
       │    
       │       
┌──────────────┐   
│  SlotOffer   │  SlotPoolImpl#offerSlots
└──────────────┘     
       │ 
       │      
┌───────────────┐   
│ AllocatedSlot │  SlotPoolImpl#offerSlot
└───────────────┘      
       │ 
       │      
┌───────────────┐   
│ 回撥 Future 3  │ SlotSharingManager#createRootSlot
└───────────────┘      
       │ 
       │      
┌───────────────┐   
│ 回撥 Future 2  │  SingleTaskSlot#SingleTaskSlot 
└───────────────┘      
       │ // SingleTaskSlot 是共享Slot樹中的葉子結點,和SlotOffer一一對應
       │   
       │      
┌───────────────────┐   
│ SingleLogicalSlot │ new SingleLogicalSlot
└───────────────────┘    
       │ // SingleLogicalSlot 是一個Slot的邏輯抽象,由 SingleLogicalSlot 生成,並一一對應
       │     
       │    
┌───────────────────┐   
│ SingleLogicalSlot │  
│ 回撥 Future 1      │ allocationResultFuture.complete()
└───────────────────┘   
       │    
       │        
┌───────────────────────────────┐  
│     SingleLogicalSlot         │  
│回撥 assignResourceOrHandleError│ 
└───────────────────────────────┘
       │    
       │        
┌────────────────┐   
│ ExecutionVertex│ tryAssignResource
└────────────────┘    
       │    
       │        
┌────────────────┐   
│    Execution   │ tryAssignResource
└────────────────┘       
       │ // Execution#assignedResource 成員變數指向SingleLogicalSlot,並一一對應
       │     
       │        
┌──────────────────┐   
│ SingleLogicalSlot│ tryAssignPayload
└──────────────────┘  
       │    
       │        
┌───────────────────────┐   
│   SingleLogicalSlot   │      
│ 回撥deployOrHandleError│ 
└───────────────────────┘   
       │    
       │       
┌────────────────┐   
│ ExecutionVertex│ deploy
└────────────────┘    
       │    
       │        
┌────────────────┐   
│    Execution   │ deploy
└────────────────┘        
       │    
       │        
 ---------- Task Executor ----------
       │    
       │     
┌────────────────┐   
│  TaskExecutor  │ submitTask
└────────────────┘     
       │    
       │        
┌────────────────┐   
│  TaskExecutor  │ startTaskThread
└────────────────┘         

下面這個圖是為了在手機上觀看。

下面我們一一講解。

5.1 Task Manager 範疇

在 Task Manager 中有如下幾個Slot相關結構。

5.1.1 TaskSlot

org.apache.flink.runtime.taskexecutor.slot.TaskSlot 是 TM 範疇內關於Slot的基本定義。我們可以看出,其主要包括分配給哪個job,資源特徵,在此slot中執行的task,資源管理器,Slot狀態等等。

其中資源隔離就是通過 MemoryManager 完成的。在MemoryManager中,根據要管理的記憶體的總量和和每個記憶體頁的大小得到記憶體頁的數量生成相應大小數量的記憶體頁來作為可以使用的記憶體。

Flink並不能保證TM的資源是嚴格平分給所有slot的。JVM中不同執行緒的資源並無嚴格隔離。所謂的平均劃分更多的是排程上的考慮,可以理解為在排程時認為一個slot的資源相當於TM資源的1/n(n為slot數)。

特例就是對於DataSet作業使用到的managed memory,Flink目前是保證了TM的managed memory平均劃分給所有slot的。Managed memory由TM上的MemoryManager管理,task在執行期間向MemoryManager申請記憶體,因此可以控制每個slot中task申請的記憶體上限

// Container for multiple {@link TaskSlotPayload tasks} belonging to the same slot. 
public class TaskSlot<T extends TaskSlotPayload> implements AutoCloseableAsync {

   /** Index of the task slot. */
   private final int index;

   /** Resource characteristics for this slot. */
   private final ResourceProfile resourceProfile;

   /** Tasks running in this slot. */
   private final Map<ExecutionAttemptID, T> tasks;

   private final MemoryManager memoryManager; // 物理資源隔離

   /** State of this slot. */
   private TaskSlotState state;

   /** Job id to which the slot has been allocated. */
   private final JobID jobId;

   /** Allocation id of this slot. */
   private final AllocationID allocationId; // 這個是在 SlotPoolImpl # requestSlotFromResourceManager 中生成的。

   /** The closing future is completed when the slot is freed and closed. */
   private final CompletableFuture<Void> closingFuture;
    
	public SlotOffer generateSlotOffer() {
		return new SlotOffer(allocationId, index, resourceProfile);
	}    
}  

RM 會 給 TM 傳送 分配 Slot 的請求,TM 分配 Slot 之後,會更新 TaskSlotTableImpl 內部變數。分配的 TaskSlot 其runtime時候變數如下。

taskSlot = {TaskSlot@12027}
 index = 0
 resourceProfile = {ResourceProfile@5359} "ResourceProfile{managedMemory=128.000mb (134217728 bytes), networkMemory=64.000mb (67108864 bytes)}"
 tasks = {HashMap@18322}  size = 0
 memoryManager = {MemoryManager@18323} 
 state = {TaskSlotState@18324} "ALLOCATED"
 jobId = {JobID@6259} "c7be7a4944784caac382cdcd9e651863"
 allocationId = {AllocationID@6257} "20d50091f2d16939f79f06edf66494f7"
 closingFuture = {CompletableFuture@18325} "java.util.concurrent.CompletableFuture@1ec6b32[Not completed]"

5.1.2 SlotOffer

當 TM 分配 TaskSlot 之後,會呼叫 TaskSlot # generateSlotOffer 函式生成一個SlotOffer,發給 JM

SlotOffer 就是 RPC 專用的一箇中間變數。

/**
 * Describe the slot offering to job manager provided by task manager.
 */
public class SlotOffer implements Serializable {
	/** Allocation id of this slot, this would be the only identifier for this slot offer */
	private AllocationID allocationId;

	/** Index of the offered slot */
	private final int slotIndex;

	/** The resource profile of the offered slot */
	private final ResourceProfile resourceProfile;
}

5.2 Job Manager 範疇

JM 內部關於 Slot 的類大致如下:

物理Slot

org.apache.flink.runtime.jobmaster.slotpool.PhysicalSlot extends SlotContext

org.apache.flink.runtime.jobmaster.slotpool.AllocatedSlot implements PhysicalSlot

分享Slot

org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager.TaskSlot

org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager.MultiTaskSlot

org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager.SingleTaskSlot

邏輯Slot

org.apache.flink.runtime.jobmaster.LogicalSlot

org.apache.flink.runtime.jobmaster.slotpool.SingleLogicalSlot implements LogicalSlot, PhysicalSlot.Payload

5.2.1 物理Slot

一個 TM 的 TaskSlot 對應一個 SlotOffer,通過 RPC offer 這個Slot 給 JM,然後 JM 根據這個 SlotOffer 生成一個 AllocatedSlot,這樣 TM 的 TaskSlot 就對應起一個 SlotPool 的 物理 SlotSlotPool 裡面管理的都是物理Slot

/* The context of an {@link AllocatedSlot}. This represent an interface to classes outside the slot pool to interact with allocated slots.*/
public interface PhysicalSlot extends SlotContext {

   /**
    * Tries to assign the given payload to this allocated slot. This only works if there has not been another payload assigned to this slot.
    *
    * @param payload to assign to this slot
    * @return true if the payload could be assigned, otherwise false
    */
   boolean tryAssignPayload(Payload payload);

   /**
    * Payload which can be assigned to an {@link AllocatedSlot}.
    */
   interface Payload {

      /**
       * Releases the payload
       *
       * @param cause of the payload release
       */
      void release(Throwable cause);
   }
}

AllocatedSlot 實現了 PhysicalSlot,因為是代表了具體物理Slot,所以裡面有 TaskManagerGateway 以便和 TaskManager互動。

TaskManagerGateway介面定義了和TaskManager通訊的方法,有兩種具體實現,分別基於Actor模式和RPC模式。基於RPC的實現會包含一個TaskExecutorGateway的實現類TaskExecutor來代理提交任務的實際工作。

class AllocatedSlot implements PhysicalSlot {

	/** The ID under which the slot is allocated. Uniquely identifies the slot. */
	private final AllocationID allocationId;

	/** The location information of the TaskManager to which this slot belongs */
	private final TaskManagerLocation taskManagerLocation;

	/** The resource profile of the slot provides */
	private final ResourceProfile resourceProfile;

	/** RPC gateway to call the TaskManager that holds this slot */
	private final TaskManagerGateway taskManagerGateway;

	/** The number of the slot on the TaskManager to which slot belongs. Purely informational. */
	private final int physicalSlotNumber;

	private final AtomicReference<Payload> payloadReference;
} 

5.2.2 分享Slot

以下三個都是 在 SlotSharingManager 內部定義的。在 SlotSharingManager 和 Scheduler 中都有涉及,在某種程度上,可以認為是 物理 Slot 和 Logical 的一箇中間狀態

  • org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager.TaskSlot

  • org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager.MultiTaskSlot

  • org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager.SingleTaskSlot

正如前面所述,SlotSharingManager 負責管理 slot sharing,slot sharing允許不同task在同一個slot中執行以實現co-location constraints。SlotSharingManager 允許建立一顆 TaskSlot hierarchy樹。

hierarchy樹具體由 MultiTaskSlot 和 SingleTaskSlot 來實現,MultiTaskSlot代表樹的中間節點,其包括一系列其他的TaskSlot,而SingleTaskSlot代表葉子節點。

當 JM 生成一個 物理Slot,即 AllocatedSlot 之後,會 回撥 一個 Future 2(程式碼在 SingleTaskSlot 建構函式,回撥函式的輸入引數是 SlotContext ),因為有 PhysicalSlot extends SlotContext, 所以 SingleTaskSlot 這裡就把 物理Slot 對映成了一個 邏輯Slot : SingleLogicalSlot

/**
 * Base class for all task slots.
 */
public abstract static class TaskSlot {
   // every TaskSlot has an associated slot request id
   private final SlotRequestId slotRequestId;

   // all task slots except for the root slots have a group id assigned
   @Nullable
   private final AbstractID groupId;
}   

MultiTaskSlot繼承了TaskSlot

public final class MultiTaskSlot extends TaskSlot implements PhysicalSlot.Payload {

   private final Map<AbstractID, TaskSlot> children;

   // the root node has its parent set to null
   @Nullable
   private final MultiTaskSlot parent;

   // underlying allocated slot
   private final CompletableFuture<? extends SlotContext> slotContextFuture;

   // slot request id of the allocated slot
   @Nullable
   private final SlotRequestId allocatedSlotRequestId;

   // true if we are currently releasing our children
   private boolean releasingChildren;

   // the total resources reserved by all the descendants.
   private ResourceProfile reservedResources;
}    

SingleTaskSlot 也繼承了 TaskSlot,但是 每一個 SingleTaskSlot 有一個 MultiTaskSlot 父親。

/**
 * {@link TaskSlot} implementation which harbours a {@link LogicalSlot}. The {@link SingleTaskSlot}
 * cannot have any children assigned.
 */
public final class SingleTaskSlot extends TaskSlot {
		private final MultiTaskSlot parent;

		// future containing a LogicalSlot which is completed once the underlying SlotContext future is completed
		private final CompletableFuture<SingleLogicalSlot> singleLogicalSlotFuture;

		// the resource profile of this slot.
		private final ResourceProfile resourceProfile;  
}   

5.2.3 邏輯Slot

上節的回撥函式會帶著 SingleLogicalSlot 繼續呼叫,這時候呼叫到的回撥函式 是 Deploy 階段設定的回撥函式 assignResourceOrHandleError,這就是分配資源階段了

/**
 * A logical slot represents a resource on a TaskManager into
 * which a single task can be deployed.
 */
public interface LogicalSlot {

   Payload TERMINATED_PAYLOAD = new Payload() {

      private final CompletableFuture<?> completedTerminationFuture = CompletableFuture.completedFuture(null);

      @Override
      public void fail(Throwable cause) {
         // ignore
      }

      @Override
      public CompletableFuture<?> getTerminalStateFuture() {
         return completedTerminationFuture;
      }
   };
 }  

SingleLogicalSlot 實現了 LogicalSlot。其成員變數 SlotContext slotContext; 實際指向了對應的AllocatedSlot

/**
 * Implementation of the {@link LogicalSlot} which is used by the {@link SlotPoolImpl}.
 */
public class SingleLogicalSlot implements LogicalSlot, PhysicalSlot.Payload {

   private final SlotRequestId slotRequestId;

   private final SlotContext slotContext;  // 這裡實際指向了對應的AllocatedSlot

   // null if the logical slot does not belong to a slot sharing group, otherwise non-null
   @Nullable
   private final SlotSharingGroupId slotSharingGroupId;

   // locality of this slot wrt the requested preferred locations
   private final Locality locality;

   // owner of this slot to which it is returned upon release
   private final SlotOwner slotOwner;

   private final CompletableFuture<Void> releaseFuture;

   private volatile State state;

   // LogicalSlot.Payload of this slot
   private volatile Payload payload;
}   

5.2.4 Execution

現在我們已經有了 邏輯Slot,而且通過回撥函式到了部署階段。

org.apache.flink.runtime.executiongraph.Execution 裡面包含 LogicalSlot。而且要注意,Execution 實現了 LogicalSlot.Payload。這個和 PhysicalSlot.Payload 混用。最後 Slot.Payload就是Execution 自己。

public class Execution implements AccessExecution, Archiveable<ArchivedExecution>, LogicalSlot.Payload {
    ......
	private volatile LogicalSlot assignedResource;
    ......
}

回撥函式繼續做如下操作

  • 深入呼叫 executionVertex.tryAssignResource,

  • ExecutionVertex # tryAssignResource

  • Execution # tryAssignResource

  • SingleLogicalSlot# tryAssignPayload(this),這裡會把 Execution 自己 賦值給Slot.payload。

這樣就完成了資源的部署,即Execution 和 LogicalSlot 聯絡在一起。

5.3 Resource Manager 範疇

RM的元件 SlotManagerImpl 使用 TaskManagerSlot

package org.apache.flink.runtime.resourcemanager.slotmanager.TaskManagerSlot

5.3.1 TaskManagerSlot

這裡當 TM 啟動或者狀態變化時候(利用心跳機制),TM會向 RM 註冊Slot,或者傳送 Slot報告。

RM(SlotManagerImpl )會 相應 update 自己內部TaskManagerSlot的狀態。 TaskManagerSlot 和 TM的 TaskSlot 是一一對應的。

/**
 * A TaskManagerSlot represents a slot located in a TaskManager. It has a unique identification and
 * resource profile associated.
 */
public class TaskManagerSlot implements TaskManagerSlotInformation {

	/** The unique identification of this slot. */
	private final SlotID slotId;

	/** The resource profile of this slot. */
	private final ResourceProfile resourceProfile;

	/** Gateway to the TaskExecutor which owns the slot. */
	private final TaskExecutorConnection taskManagerConnection;

	/** Allocation id for which this slot has been allocated. */
	private AllocationID allocationId;

	/** Allocation id for which this slot has been allocated. */
	@Nullable
	private JobID jobId;
} 	

資料結構我們已經分析完畢,我們看看系統啟動時候如何構建這些資料結構。

0x06 系統啟動

以下是基於IDEA除錯的過程,線上程式碼大同小異。

Flink啟動時,會建立相應的模組,比如 TaskManagerServices,Scheduler,SlotPool等。

我們具體按照大模組的範疇來講解。

6.1 TM & TE範疇

6.1.1 TaskManagerServices啟動

Slot的分配管理就在TaskManagerServices之中,具體如下。

6.1.1.1 呼叫路徑

在建立TaskExecutor的過程中,會根據配置來建立 TaskSlotTable,其程式碼呼叫路徑如下:

  1. LocalExecutor#execute
  2. LocalExecutor#startMiniCluster
  3. MiniCluster#start
  4. MiniCluster#startTaskManagers
  5. MiniCluster#startTaskExecutor
  6. TaskManagerRunner#startTaskManager
  7. TaskManagerServices#fromConfiguration
  8. TaskManagerServices#createTaskSlotTable,根據配置資訊建立TaskSlotTable
  9. TaskSlotTableImpl#TaskSlotTableImpl
6.1.1.2 配置資訊

TaskManager的資訊是可以通過配置檔案進行調整的。在 TaskManagerServices#createTaskSlotTable 中能看到根據配置得到的資訊

taskManagerServicesConfiguration = {TaskManagerServicesConfiguration@4263} 
  confData = {HashMap@4319}  size = 13
   "heartbeat.timeout" -> "18000000"
   "taskmanager.memory.network.min" -> {MemorySize@4338} "64 mb"
   "taskmanager.cpu.cores" -> {Double@4340} 1.7976931348623157E308
   "taskmanager.memory.task.off-heap.size" -> {MemorySize@4342} "9223372036854775807 bytes"
   "execution.target" -> "local"
   "rest.bind-port" -> "0"
   "taskmanager.memory.network.max" -> {MemorySize@4338} "64 mb"
   "execution.attached" -> {Boolean@4349} true
   "jobmanager.scheduler" -> "ng"
   "taskmanager.memory.managed.size" -> {MemorySize@4353} "128 mb"
   "taskmanager.numberOfTaskSlots" -> {Integer@4355} 2
   "taskmanager.memory.task.heap.size" -> {MemorySize@4342} "9223372036854775807 bytes"
   "rest.address" -> "localhost"
 resourceID = {ResourceID@4274} "40d390ec-7d52-4f34-af86-d06bb515cc48"
 taskManagerAddress = {Inet4Address@4276} "localhost/127.0.0.1"
 numberOfSlots = 2
6.1.1.3 TaskSlotTableImpl

TaskSlotTableImpl的runtime的資訊如下,目前只是建立了資料結構,並沒有具體生成Slot資源,因為此時只是知道TM有這些資源,但是具體分配應該RM來處理

this = {TaskSlotTableImpl@4192} 
 numberSlots = 2
 defaultSlotResourceProfile = {ResourceProfile@4194} "ResourceProfile{managedMemory=64.000mb (67108864 bytes), networkMemory=32.000mb (33554432 bytes)}"
 memoryPageSize = 32768
 timerService = {TimerService@4195} 
 taskSlots = {HashMap@4375}  size = 0
 allocatedSlots = {HashMap@4387}  size = 0
 taskSlotMappings = {HashMap@4391}  size = 0
 slotsPerJob = {HashMap@4395}  size = 0
 slotActions = null
 state = {TaskSlotTableImpl$State@4403} "CREATED"
 budgetManager = {ResourceBudgetManager@4383} 
 closingFuture = {CompletableFuture@4408} "java.util.concurrent.CompletableFuture@7aac8884[Not completed]"
 mainThreadExecutor = {ComponentMainThreadExecutor$DummyComponentMainThreadExecutor@4219} 

6.1.2 TaskExecutor啟動

Flink會通過如下的程式碼呼叫路徑來建立Slot。

  1. TaskExecutor#onStart;

  2. TaskExecutor#startTaskExecutorServices;

    • taskSlotTable.start(new SlotActionsImpl(), getMainThreadExecutor());
      
  3. TaskSlotTableImpl#start。此處會把TaskExecutor的 "main thread execution context" 和TaskSlotTable聯絡起來;

6.2 Resource Manager範疇

RM 這裡主要是SlotManager 的啟動。

6.2.1 SlotManager啟動

SlotManager的呼叫棧如下:

  • MiniCluster#setupDispatcherResourceManagerComponents
  • MiniCluster#createDispatcherResourceManagerComponents
  • DefaultDispatcherResourceManagerComponentFactory#create
  • StandaloneResourceManagerFactory#createResourceManager
  • ResourceManagerRuntimeServices#fromConfiguration
  • ResourceManagerRuntimeServices#createSlotManager
  • SlotManagerImpl#SlotManagerImpl

6.3 Job Master範疇

這裡模組比較多。

6.3.1 Scheduler

Scheduler 的執行路徑如下:

  • JobMaster#JobMaster

  • JobMaster#createScheduler

  • DefaultSchedulerFactory#createInstance

  • DefaultScheduler#init,生成 schedulingStrategy = LazyFromSourcesSchedulingStrategy。Flink支援兩種執行模式,LAZY_FROM_SOURCE模式只有在一個Operator的輸入資料就緒時才初始化該節點,EAGER模式會在一開始就按拓撲順序載入計算圖中的所有節點。

    • this.schedulingStrategy = schedulingStrategyFactory.createInstance(this, getSchedulingTopology());
      
      

6.3.2 SlotSharingManager

這裡提前拿出來說明,因為實際上 SlotSharingManager 是在具體使用時候,如果發現沒有才會生成。

SlotSharingManager的執行路徑如下:

  • SlotProviderStrategy$NormalSlotProviderStrategy#allocateSlot
  • SchedulerImpl#allocateSlot
  • SchedulerImpl#allocateSlotInternal
  • SchedulerImpl#internalAllocateSlot
  • SchedulerImpl#allocateSharedSlot
  • HashMap#computeIfAbsent
  • SchedulerImpl#lambda$allocateSharedSlot
  • SlotSharingManager#init

6.3.3 SlotPool

SlotPool的執行路徑如下:

  • DefaultJobManagerRunnerFactory#createJobManagerRunner
  • JobManagerRunnerImpl#JobManagerRunnerImpl
  • DefaultJobMasterServiceFactory#createJobMasterService
  • JobMaster#JobMaster
  • DefaultSlotPoolFactory#createSlotPool
  • SlotPoolImpl#SlotPoolImpl

至此,我們需要的基本模組都已經生成。

下文我們會圍繞 Slot 從原始碼執行流程入手來一一分析,具體包括分配資源,部署等各個角度來講解,敬請期待。

0xFF 參考

一文了解 Apache Flink 的資源管理機制

Flink5:Flink執行架構(Slot和並行度)

Flink Slot詳解與Job Execution Graph優化

聊聊flink的slot.request.timeout配置

Apache Flink 原始碼解析(三)Flink on Yarn (2) Resource Manager

Flink on Yarn模式下的TaskManager個數

Flink on YARN時,如何確定TaskManager數

Flink】Flink作業排程流程分析

Flink原理與實現:如何生成ExecutionGraph及物理執行圖

Flink原始碼走讀(一):Flink工程目錄

flink分析使用之七任務的啟動

flink原始碼解析3 ExecutionGraph的形成與物理執行

Flink 內部原理之作業與排程

Flink之使用者程式碼生成排程層圖結構

3. Flink Slot申請

Flink 任務和排程

Flink的Slot是如何做到平均劃分TM記憶體的?

Flink-1.10.0中的readTextFile解讀

Flink1.7.2 Dataset 檔案切片計算方式和切片資料讀取原始碼分析

flink任務提交流程分析

Flink Parallelism和Slot理解

相關文章