[原始碼解析] 深度學習分散式訓練框架 horovod (18) --- kubeflow tf-operator

羅西的思考發表於2021-07-23

原文網址 : https://www.cnblogs.com/rossiXYZ/p/15032577.html

原始碼深度學習分散式框架

[原始碼解析] 深度學習分散式訓練框架 horovod (18) --- kubeflow tf-operator

[原始碼解析] 深度學習分散式訓練框架 horovod (18) --- kubeflow tf-operator

0x00 摘要

Horovod 是一款基於 AllReduce 的分散式訓練框架。憑藉其對 TensorFlow、PyTorch 等主流深度學習框架的支援，以及通訊優化等特點，Horovod 被廣泛應用於資料並行的訓練中。

前面通過十幾篇文章，我們一步一步分析了 Horovod 的方方面面。接下來就是面對 Horovod on K8S 這座大山。

本文以及後幾篇文章目的是：藉著分析學習 Horovod on K8S 功能，把相關概念梳理一遍，期望可以從中找出設計思路。所以成文方式是：整理學習了很多網上文章，然後自己分析程式碼。特此對各位作者深表感謝。

本文是 horovod on k8s 的餐前甜點和必備前提，介紹相關概念以及kubeflow 社群的 tf-operator。

本系列其他文章連結如下：

[原始碼解析] 深度學習分散式訓練框架 Horovod (1) --- 基礎知識

[原始碼解析] 深度學習分散式訓練框架 horovod (2) --- 從使用者角度切入

[原始碼解析] 深度學習分散式訓練框架 horovod (3) --- Horovodrun背後做了什麼

[原始碼解析] 深度學習分散式訓練框架 horovod (4) --- 網路基礎 & Driver

[原始碼解析] 深度學習分散式訓練框架 horovod (5) --- 融合框架

[原始碼解析] 深度學習分散式訓練框架 horovod (6) --- 後臺執行緒架構

[原始碼解析] 深度學習分散式訓練框架 horovod (7) --- DistributedOptimizer

[原始碼解析] 深度學習分散式訓練框架 horovod (8) --- on spark

[原始碼解析] 深度學習分散式訓練框架 horovod (9) --- 啟動 on spark

[原始碼解析] 深度學習分散式訓練框架 horovod (10) --- run on spark

[原始碼解析] 深度學習分散式訓練框架 horovod (11) --- on spark --- GLOO 方案

[原始碼解析] 深度學習分散式訓練框架 horovod (12) --- 彈性訓練總體架構

[原始碼解析] 深度學習分散式訓練框架 horovod (13) --- 彈性訓練之 Driver

[原始碼解析] 深度學習分散式訓練框架 horovod (14) --- 彈性訓練發現節點 & State

[原始碼解析] 深度學習分散式訓練框架 horovod (15) --- 廣播 & 通知

[原始碼解析] 深度學習分散式訓練框架 horovod (16) --- 彈性訓練之Worker生命週期

[原始碼解析] 深度學習分散式訓練框架 horovod (17) --- 彈性訓練之容錯

0x01 背景知識

1.1 Kubernetes

kubernetes，簡稱K8s，是用8代替8個字元“ubernete”而成的縮寫。是一個開源的，用於管理雲平臺中多個主機上的容器化的應用，Kubernetes的目標是讓部署容器化的應用簡單並且高效（powerful）,Kubernetes提供了應用部署，規劃，更新，維護的一種機制。

Kubernetes 是一種越來越受歡迎的深度神經網路訓練選項，因為它提供了通過容器使用不同機器學習框架的靈活性，以及按需擴充套件的敏捷性。

當面臨較複雜的模型訓練或者資料量大時，單機的計算能力往往無法滿足算力要求。通過使用阿里的 AiACC 或者社群的 horovod 等分散式訓練框架，僅需修改幾行程式碼，就能將一個單機的訓練任務擴充套件為支援分散式的訓練任務。

在 Kubernetes 上常見的是 kubeflow 社群的 tf-operator 支援 Tensorflow PS 模式，或者 mpi-operator 支援 horovod 的 mpi allreduce 模式。

1.2 容器作為排程單元

為什麼希望使用容器來作為深度學習系統的排程單元？因為容器拉取/啟動快速。隔離資源效果好。抽象來看，可以將容器的image作為job的一部分分發排程執行。當然容器化後會引入gpu，網路等效能的代價。

比如 nvidia gpu 對docker提供了支援，nvidia-docker可以代替docker執行create和run操作。下圖就是nvidia-docker架構。

1.3 Kubeflow

Kubeflow 是一個開源的 Kubernetes 原生平臺，用於開發、編排、部署和執行可擴充套件的行動式機器學習工作負載。Kubeflow 可以在任何Kubernetes 叢集上執行。

Kubeflow可以很好的管理多機任務，Kubeflow的名字比較簡單，為Kubernetes + TensorFlow，是一個機器學習工具包，是執行在K8s之上的一套技術棧，這套技術棧包含了很多元件，元件之間的關係比較鬆散，我們可以配合起來用，也可以單獨用其中的一部分。

Kubeflow 詢問 Kubernetes 計劃分配哪幾臺機器來執行一個分散式作業中的各個程式，隨後告知每個程式，所有其他程式的 IP 地址和 port。從而保證一個作業裡各個程式之間互相知道對方。

為什麼需要讓所有程式互相知道對方呢？這是 TensorFlow ps-based distribution 方式要求的。TensorFlow 1.x 原生的分佈式訓練功能讓一個作業中所有程式都執行 TensorFlow 1.x runtime 程式。這些程式互相通訊，互相協調成為一個“分散式 runtime“，來解釋執行表示深度學習計算過程的計算圖（graph）。在開始分散式訓練之初，graph 被 TensorFlow runtime 拆解成若干子圖；每個程式負責執行一個子圖 —— 任何一個程式失敗（可能是被更高優先順序作業搶佔），則整個大圖的執行就失敗了。所以 TensorFlow 原生的分散式訓練能力不是容錯的（fault-tolerant）。不過，它是可以從錯誤恢復（fault-recoverable）—— TensorFlow API 提供 checkpoint 的能力；如果一個作業失敗了，可以重啟作業，從最近的 checkpoint 開始繼續執行。

1.4 Tensorflow on Kubeflow

Kubeflow 支援兩種不同的 Tensorflow 框架分散式訓練方法。

第一種是原生 Tensorflow 架構，它依賴於集中式引數伺服器來實現工作執行緒之間的協調。
第二種是分散式方法，工作執行緒通過 MPI AllReduce 原語直接相互通訊，不使用引數伺服器。NVIDIA 的 NCCL 庫已經在GPU 上有效地執行了大部分 MPI 原語，而 Uber 的Horovod 讓使用 TensorFlow 執行多 GPU 和多節點訓練變得輕而易舉。與引數伺服器相比，第二種方法可以更好地優化頻寬和更好地擴充套件。

1.5 Operator

Operator 是Kubernetes 之中的概念，主要是用來打包、部署及管理使用者的任務。

Operator可以簡單理解為 CRD + Controller。

CRD（Custom Resource Definition）是 Kubernetes 的擴充套件型別，用來為使用者自定義資源提。
Controller 用來讓使用者操作CRD。

如果用 Java 來比喻，operator 就是 Class，CRD 就是類的成員變數，Controller 就是類成員方法。

1.6 TF-Operator

雖然KubeFlow提供了一大堆元件，涵蓋了機器學習的方方面面，但模型訓練肯定是KubeFlow最重要的功能。 KubeFlow針對各種各樣的機器學習框架提供了訓練的能力。方式是定義了各種各樣的Operator，其主要是用來管理機器學習或者深度學習裡面的任務，比如如何管理維護一個任務的多個節點，如何管理Pod及任務的生命週期，如何進行容錯等等。

TF-Operator就是開源社群基於K8S提供的擴充套件API，提供了TensorFlow的訓練能力，從名字也能看出來，這個實現是類似Job的一種方式，其特點如下：

提供TensorFlow原生PS-worker架構的多機訓練
推薦將PS和worker一起啟動
通過service做服務發現
在社群中最早期的Operator

因為 TF-Operator 是社群中最早期的Operator，所以我們有必要先看看。

0x02 TensorFlow 分散式

因為 TF-Operator 是為了支援 Tensorflow PS 模式，所以我們首先介紹一下 TensorFlow 分散式。

2.1 Parameter server架構

在Parameter server架構（PS架構）中，叢集中的節點被分為兩類：引數伺服器（parameter server）和工作伺服器（worker）。其中引數伺服器存放模型的引數，而工作伺服器負責計算引數的梯度。在每個迭代過程，工作伺服器從引數伺服器中獲得引數，然後將計算的梯度返回給引數伺服器，引數伺服器聚合從工作伺服器傳回的梯度，然後更新引數，並將新的引數廣播給工作伺服器。

PS-Worker 架構的梯度更新有著 同步更新 和 非同步更新 兩種方式：

在同步訓練中，所有的Worker裝置採用同一個Batch的不同小批(mini-batch)資料來訓練，等待所有裝置該批次的梯度計算完成後，模型才會根據所有的梯度進行一次引數更新，然後PS將更新後的模型下發到各個裝置。

非同步訓練中，沒有裝置需要去等待其他裝置的梯度計算和引數更新，所有裝置獨立算並與將梯度結果更新到中心節點（PS）。非同步訓練總體會訓練速度會快很多，但是非同步訓練的一個很嚴重的問題是梯度失效問題（stale gradients），剛開始所有裝置採用相同的引數來訓練，但是非同步情況下，某個裝置完成一步訓練後，可能發現模型引數已經被其它裝置更新過了，此時這個裝置計算出的梯度就過期了。

2.2 Tensorflow PS-Worker

2.2.1 架構

這裡只是大致介紹一下，主要是為了和 TF-Operator 對比。

TF 把Job主要劃分為Parameter Server和Worker（因為 TF 版本不同，所以有不同階段的特別定義，比如 master 或者 chief）。

Parameter Job：執行模型相關的作業，包括模型引數儲存，分發，彙總，更新；作為分散式訓練的服務端，等到各個終端(supervisors)來連線。
Worker Job：在TensorFlow的程式碼註釋中被稱為supervisors，執行訓練相關的作業，包括推理計算和梯度計算。如果引數的數量太大，一臺機器處理不了，這就要需要多個Tasks（動態上理解，主機上的一個程式，從靜態的角度理解，Task就是我們寫的程式碼）。
Chief supervisors：在眾多運算終端中必須選中一個作為主要的運算終端。該終端是在運算終端中最先啟動的，它的功能是合併各個終端運算後的學習引數，將其儲存再寫入。
Cluster 是 Jobs 的集合: Cluster(叢集) 就是叢集系統。

每個具體角色網路標識都是唯一的，即分佈在不同IP的機器上(或者同一主機但不同埠號)。

在實際執行中，各個角色的網路構建部分程式碼必須完全相同，Ps-worker 架構分散式模型的流程大致如下:

pull : 各個worker根據資料流圖的拓撲結構，從PS拉取最新的模型引數
feed: 各worker填充不同的批資料
compute: 各worker按照相同的模型引數和不同的批資料計算梯度，得出不同的梯度值
push 各worker 將計算得到的梯度值上傳給PS
update: PS 收集所有worker的梯度值，求平均值，更新模型引數。

2.2.2 程式碼

具體邏輯如下：

Task需要知道叢集上都有哪些主機，以及它們都監聽什麼埠。tf.train.ClusterSpec()就是用來描述這個。
這個Cluster(叢集)有兩個Job(worker.ps)，worker中有三個Task(即，有三個Task執行Tensorflow op操作)
將ClusterSpec當作引數傳入到 tf.train.Server()中，同時指定此Task的Job_name和task_index。
由於是相同的程式碼執行在不同的主機上，所以要傳入job_name和task_index加以區分，而ps_hosts和worker_hosts對於所有主機來說，都是一樣的，用來描述叢集的。
一個tf.train.Server包含了本地裝置（GPUs，CPUs）的集合，可以連線到到其它task的ip:port（儲存在cluster中），還有一個session target用來執行分佈操作。還有最重要的一點就是，它建立了一個伺服器，監聽port埠，如果有資料傳過來，他就會在本地執行(啟動session target,呼叫本地裝置執行運算)，然後結果返回給呼叫者。
為了使ps_server能夠一直處於監聽狀態，我們需要使用server.join()。這時，程式就會block在這裡.至於為什麼ps_server剛建立就join呢，原因是因為下面的程式碼會將引數指定給ps_server保管，所以ps_server靜靜的監聽就好了。

# To build a cluster with two ps jobs on hosts ps0 and ps1, and 3 worker
# jobs on hosts worker0, worker1 and worker2.
cluster_spec = {
    "ps": ["ps0:2222"， "ps1:2222"]，
    "worker": ["worker0:2222"， "worker1:2222"， "worker2:2222"]}

# Create a cluster from the parameter server and worker hosts.
cluster = tf.train.ClusterSpec({"ps": ps_hosts, "worker": worker_hosts})

# Create and start a server for the local task.
server = tf.train.Server(cluster,
                           job_name=FLAGS.job_name,
                           task_index=FLAGS.task_index)

if FLAGS.job_name == "ps":
	server.join()

稍微完整點的程式碼如下：

def main(_):
  ps_hosts = FLAGS.ps_hosts.split("，")
  worker_hosts = FLAGS.worker_hosts.split("，")

  # Create a cluster from the parameter server and worker hosts.
  cluster = tf.train.ClusterSpec({"ps": ps_hosts, "worker": worker_hosts})

  # Create and start a server for the local task.
  server = tf.train.Server(cluster,
                           job_name=FLAGS.job_name,
                           task_index=FLAGS.task_index)

  if FLAGS.job_name == "ps":
    server.join()
  elif FLAGS.job_name == "worker":

 		# 找出worker的主節點，即task_index為0的節點
		is_chief = (FLAGS.task_index == 0)   
    
    # Assigns ops to the local worker by default.
    with tf.device(tf.train.replica_device_setter(
        worker_device="/job:worker/task:%d" % FLAGS.task_index,
        cluster=cluster)):
    # Compute

執行如下，可以看出，我們只需要寫一個程式，在不同的主機上，傳入不同的引數使其執行：

# On ps0.example.com:
$ python trainer.py \
     --ps_hosts=ps0.example.com:2222,ps1.example.com:2222 \
     --worker_hosts=worker0.example.com:2222,worker1.example.com:2222 \
     --job_name=ps --task_index=0
# On ps1.example.com:
$ python trainer.py \
     --ps_hosts=ps0.example.com:2222,ps1.example.com:2222 \
     --worker_hosts=worker0.example.com:2222,worker1.example.com:2222 \
     --job_name=ps --task_index=1
# On worker0.example.com:
$ python trainer.py \
     --ps_hosts=ps0.example.com:2222,ps1.example.com:2222 \
     --worker_hosts=worker0.example.com:2222,worker1.example.com:2222 \
     --job_name=worker --task_index=0
# On worker1.example.com:
$ python trainer.py \
     --ps_hosts=ps0.example.com:2222,ps1.example.com:2222 \
     --worker_hosts=worker0.example.com:2222,worker1.example.com:2222 \
     --job_name=worker --task_index=1

0x03 TF-Operator

3.1 TF-Operator 設計思路

瞭解了 TF 分散式的大致運作，我們來看看 TF-Operator 設計思路。

以下是從 "Design Doc TFJob K8s CRD" 中翻譯的。

目標是使在Kubernetes（K8s）上執行TensorFlow訓練（尤其是分散式訓練）變得容易。我建議通過建立一個K8s自定義資源描述符（CRD）和關聯的控制器來實現這一點。CRD負責管理執行培訓作業所需的K8s資源。

Kubernetes通過提供一個流程（而不是以VM為中心）的世界檢視，使得管理流程變得更加容易。Kubernetes還為複雜的分散式應用程式提供了基本的構建塊。例如，K8s提供對DNS、健康檢查、日誌收集、度量收集、儲存等的內建支援。

在K8s中，控制器負責確保一套Pods是執行狀態。Pod是K8s中的基本構建塊，它描述了一個或多個應該進行共定位的程式（相同的ip）。K8s配備了許多內建控制器。可以確保N個pod以特定的規範執行。作業控制器可以用來執行二進位制檔案。

內建控制器不足以執行分散式TensorFlow作業。TensorFlow是一個有狀態的應用程式；每個引數伺服器和工作者都需要具有唯一的可定址性，以支援所有不同的分散式培訓模式。K8s有一個statefulset。但是，有狀態集用於永久執行的有狀態服務（如Redis之類的記憶體分片快取服務），而不是用於執行到完成的作業。

因此，今天在K8s上執行分散式TF作業意味著從內建原語中拼湊出一個解決方案。通常，這意味著手動管理多個資源。例如，使用者可以為引數伺服器建立一個有狀態集，為工作者建立一個有狀態集，為主伺服器建立一個作業。

為了解決內建資源的限制，K8s支援自定義資源（CRD）和控制器。使用CRD，可以很容易地為特定工作負載建立具有所需語義的控制器，同時將使用者隱藏在實現中。K8s社群很快就採用了這種模式，貢獻了大量的CRD用於各種工作負載。

開發crd和各種控制器的K8s團隊的意見是，大多數控制器使用非分散式、多執行緒設計，可伸縮性不是問題。

TFJob CRD為K8s定義了TFJob資源。

TFJob資源是 TfReplicas 的集合。每個TfReplica對應一個在工作中扮演角色的一組 TensorFlow processes；

我做出了一個明確的決定，不試圖隱藏或替換K8s抽象。例如，每個TfReplica都包含一個標準的K8s PodTemplate 以指定要在每個複製副本中執行的程式（包括TF）。我這樣做是因為K8s已經提供了一個被廣泛採用和理解的API。因此，引入新的概念來代替K8s的概念是令人困惑的。此外，公開PodTemplate 使TFJob使用者可以輕鬆地利用K8s特性。例如，TFJob使用者可以使用K8s將卷附加到其TF程式。這使得TF與K8s支援的任何儲存系統（如PDs、NFS等）結合使用變得非常容易。

3.2 架構圖

具體架構圖如下：

3.2.1 什麼是Pod

我們從圖上來看，先看中間的 pod 概念。

pod 是 k8s排程的最小單元。pod 可以理解為：容器組，同時pod相當於邏輯主機，進入pod後彷彿進入一個linux主機，命令都可用（linux系統下），該“主機”內又有很多容器，進入後又彷彿是又進了一個linux主機。預設情況下，每個容器的檔案系統與其他容器完全隔離。每個pod都有自己的ip地址。pod內的容器共享相同的ip和埠空間。

3.2.2 為什麼要有 service

首先，每個Pod都會被分配一個單獨的IP地址，而且每個Pod都提供了一個獨立的Endpoint（Pod IP + ContainerPort）以被客戶端訪問，但這種訪問僅限於叢集內部，外部沒法訪問叢集內部的IP地址，

其次，Pod的生命是有限的，如果Pod重啟IP很有可能會發生變化。當 controller 用新 Pod 替代發生故障的 Pod 時，新 Pod 會分配到新的 IP 地址。這樣就產生了一個問題：如果一組 Pod 對外提供服務（比如 HTTP），它們的 IP 很有可能發生變化，那麼客戶端如何找到並訪問這個服務呢？

Kubernetes 給出的解決方案是 Service。

Service只是一個抽象概念，Kubernetes Service 從邏輯上代表了一組 Pod，具體是哪些 Pod 則是由 label 來挑選。Service 在邏輯上將一組pod（功能相同）給抽象出來一個統一入口。可以將他簡單理解為做了一個服務的負載均衡。

Service 有自己 IP，而且這個 IP 是不變的。客戶端只需要訪問 Service 的 IP，Kubernetes 則負責建立和維護 Service 與 Pod 的對映關係。無論後端 Pod 如何變化，對客戶端不會有任何影響，因為 Service 沒有變。所以一般會通過service來訪問pod。core-dns會給service分配一個內部的虛擬ip，因此內部服務可以通過這個ip或者是serviceName來訪問到pod的服務。

我們給出一個原始碼中的service 例子。

apiVersion: v1
kind: Service
metadata:
  annotations:
    prometheus.io/path: /metrics
    prometheus.io/scrape: "true"
    prometheus.io/port: "8443"
  labels:
    app: tf-job-operator
  name: tf-job-operator
spec:
  ports:
  - name: monitoring-port
    port: 8443
    targetPort: 8443
  selector:
    name: tf-job-operator
  type: ClusterIP

現在我們看到已經建立了名為tf-job-operator的Service，會分配一個Cluster IP，該Service還會持續的監聽selector下面的 Pod，會把這些Pod資訊更新到一個名為 tf-job-operator 的Endpoints物件上去，這個物件就類似於我們上面說的Pod集合了。

3.2.3 什麼是 controller

因為 Kubernetes 現有的資源型別無法滿足我們的需求，因此需要通過 Custom Resource Definition 的機制進行擴充套件。

K8S中一切都是resource，比如Deployment，Service等等。

我們可以基於CRD（CustomResourceDefinitions）功能新增resource，比如我想自定義一種Deployment資源，提供不同的部署策略。

我們知道resource可以通過k8s的RESTFUL API進行CURD操作，對於CRD建立的resource也是一樣的。

CRD僅僅是定義一種resource，我們還需要實現controller，類似於deployment controller等等，監聽對應資源的CURD事件，做出對應的處理，比如部署POD。

其實，TF-Operator 主要就是一個 Controller 的實現，我們下面也主要就是講解這個 controller。

3.3 Spec

我們首先給出一個 Job Spec，這樣大家可以在後續和程式碼中對應。樣例如下，擁有一個 master，2個 workers，一個 PS。

apiVersion: "kubeflow.org/v1alpha1" # 指定api版本，此值必須在kubectl api-versions中  
kind: "TFJob"  # 指定建立資源的角色/型別 
metadata:  # 資源的後設資料/屬性 
  name: "example-job"
spec: # 資源規範欄位
  replicaSpecs: # 宣告副本數目
    - replicas: 1
      tfReplicaType: MASTER
      template: # 模版
        spec:
          containers:
            - image: gcr.io/tf-on-k8s-dogfood/tf_sample:dc944ff  # 容器使用的映象地址  
              name: tensorflow
              args:
                - --log_dir=gs://my-job/log-dir
          restartPolicy: OnFailure
    - replicas: 2
      tfReplicaType: WORKER
      template:
        spec:
          containers:
            - image: gcr.io/tf-on-k8s-dogfood/tf_sample:dc944ff
              name: tensorflow
              args:
                - --log_dir=gs://my-job/log-dir
          restartPolicy: OnFailure
    - replicas: 1
      tfReplicaType: PS

下面我們開始進入程式碼世界。

3.4 TFJob

首先我們看看 TFJob 的定義，大致可以和上面的 Spec 中找到對應關係，因為本文目的是瞭解其大略，所以我們就只分析這些即可。

// TFJob represents a TFJob resource.
type TFJob struct {
	// Standard Kubernetes type metadata.
	metav1.TypeMeta `json:",inline"`

	// Standard Kubernetes object's metadata.
	// +optional
	metav1.ObjectMeta `json:"metadata,omitempty"`

	// Specification of the desired state of the TFJob.
	// +optional
	Spec TFJobSpec `json:"spec,omitempty"`

	// Most recently observed status of the TFJob.
	// Populated by the system.
	// Read-only.
	// +optional
	Status commonv1.JobStatus `json:"status,omitempty"`
}

// TFJobSpec is a desired state description of the TFJob.
type TFJobSpec struct {
	// RunPolicy encapsulates various runtime policies of the distributed training
	// job, for example how to clean up resources and how long the job can stay
	// active.
	RunPolicy commonv1.RunPolicy `json:"runPolicy,inline"`

	// SuccessPolicy defines the policy to mark the TFJob as succeeded.
	// Default to "", using the default rules.
	// +optional
	SuccessPolicy *SuccessPolicy `json:"successPolicy,omitempty"`

	// A map of TFReplicaType (type) to ReplicaSpec (value). Specifies the TF cluster configuration.
	// For example,
	//   {
	//     "PS": ReplicaSpec,
	//     "Worker": ReplicaSpec,
	//   }
	TFReplicaSpecs map[commonv1.ReplicaType]*commonv1.ReplicaSpec `json:"tfReplicaSpecs"`

	// // A switch to enable dynamic worker
	EnableDynamicWorker bool `json:"enableDynamicWorker,omitempty"`
}

3.5 角色

其次我們看看 TF-Operator 之中，對 TF 角色的對應實現。

3.5.1 定義

首先是角色定義。這裡的角色基本對應了 Tensorflow 的各個角色，包括很多為了相容而保留的角色。

// setTypeNamesToCamelCase sets the name of all replica types from any case to correct case.
func setTypeNamesToCamelCase(tfJob *TFJob) {
	setTypeNameToCamelCase(tfJob, TFReplicaTypePS)
	setTypeNameToCamelCase(tfJob, TFReplicaTypeWorker)
	setTypeNameToCamelCase(tfJob, TFReplicaTypeChief)
	setTypeNameToCamelCase(tfJob, TFReplicaTypeMaster)
	setTypeNameToCamelCase(tfJob, TFReplicaTypeEval)
}


const (
	// TFReplicaTypePS is the type for parameter servers of distributed TensorFlow.
	TFReplicaTypePS commonv1.ReplicaType = "PS"

	// TFReplicaTypeWorker is the type for workers of distributed TensorFlow.
	// This is also used for non-distributed TensorFlow.
	TFReplicaTypeWorker commonv1.ReplicaType = "Worker"

	// TFReplicaTypeChief is the type for chief worker of distributed TensorFlow.
	// If there is "chief" replica type, it's the "chief worker".
	// Else, worker:0 is the chief worker.
	TFReplicaTypeChief commonv1.ReplicaType = "Chief"

	// TFReplicaTypeMaster is the type for master worker of distributed TensorFlow.
	// This is similar to chief, and kept just for backwards compatibility.
	TFReplicaTypeMaster commonv1.ReplicaType = "Master"

	// TFReplicaTypeEval is the type for evaluation replica in TensorFlow.
	TFReplicaTypeEval commonv1.ReplicaType = "Evaluator"
)

3.5.2 建立角色

NewTFJobV2 函式就是依據配置的不同，來建立不同的角色。

這裡可以看到，生成 job 時候，基本就是按照 spec 的對應欄位來處理。

apiVersion: "kubeflow.org/v1alpha1"
kind: "TFJob"
metadata:
  name: "example-job"
spec:
  replicaSpecs:

下面是函式定義。

func NewTFJobV2(worker, ps, master, cheif, evaluator int) *tfv1.TFJob {
	tfJob := &tfv1.TFJob{
		TypeMeta: metav1.TypeMeta{
			Kind: tfv1.Kind,
		},
		ObjectMeta: metav1.ObjectMeta{
			Name:      TestTFJobName,
			Namespace: metav1.NamespaceDefault,
		},
		Spec: tfv1.TFJobSpec{
			TFReplicaSpecs: make(map[commonv1.ReplicaType]*commonv1.ReplicaSpec),
		},
	}
	tfv1.SetObjectDefaults_TFJob(tfJob)

	if worker > 0 {
		worker := int32(worker)
		workerReplicaSpec := &commonv1.ReplicaSpec{
			Replicas: &worker,
			Template: NewTFReplicaSpecTemplate(),
		}
		tfJob.Spec.TFReplicaSpecs[tfv1.TFReplicaTypeWorker] = workerReplicaSpec
	}

	if ps > 0 {
		ps := int32(ps)
		psReplicaSpec := &commonv1.ReplicaSpec{
			Replicas: &ps,
			Template: NewTFReplicaSpecTemplate(),
		}
		tfJob.Spec.TFReplicaSpecs[tfv1.TFReplicaTypePS] = psReplicaSpec
	}

	if master > 0 {
		master := int32(master)
		masterReplicaSpec := &commonv1.ReplicaSpec{
			Replicas: &master,
			Template: NewTFReplicaSpecTemplate(),
		}
		tfJob.Spec.TFReplicaSpecs[tfv1.TFReplicaTypeMaster] = masterReplicaSpec
	}

	if cheif > 0 {
		cheif := int32(cheif)
		cheifReplicaSpec := &commonv1.ReplicaSpec{
			Replicas: &cheif,
			Template: NewTFReplicaSpecTemplate(),
		}
		tfJob.Spec.TFReplicaSpecs[tfv1.TFReplicaTypeChief] = cheifReplicaSpec
	}

	if evaluator > 0 {
		evaluator := int32(evaluator)
		evaluatorReplicaSpec := &commonv1.ReplicaSpec{
			Replicas: &evaluator,
			Template: NewTFReplicaSpecTemplate(),
		}
		tfJob.Spec.TFReplicaSpecs[tfv1.TFReplicaTypeChief] = evaluatorReplicaSpec
	}
	return tfJob
}

3.5.3 如何區分 master

用如下方法區分 master。

func (tc *TFController) IsMasterRole(replicas map[commonv1.ReplicaType]*commonv1.ReplicaSpec, rtype commonv1.ReplicaType, index int) bool {
	if ContainChieforMasterSpec(replicas) {
		return rtype == tfv1.TFReplicaTypeChief || rtype == tfv1.TFReplicaTypeMaster
	}
	// else check if it is worker with index 0
	return rtype == tfv1.TFReplicaTypeWorker && index == 0
}

0x04 Contoller

下面就進入正題，看看 Controller 如何實現。

4.1 K8S CRD關鍵概念

首先我們需要看看 K8S CRD 的一些關鍵概念。

informer：監聽apiserver中特定資源變化，然後會儲存到一個執行緒安全的local cache中，最後回撥我們自己實現的event handler。
local cache：informer實時同步apiserver（也就是etcd）中的資料到記憶體中儲存，可以有效降低apiserver的查詢壓力，但缺點就是實時性不好，本地會比遠端的資料落後一點點但會最終與etcd一致，所以需要根據情況具體分析是走Local cache還是apiserver實時獲取資料。
Lister：提供了CURD操作訪問local cache。
controller：一個邏輯概念，就是指排程某種資源的實現而已，需要我們自己開發。Controller做的事情主要包括：
1. 實現event handler處理資源的CURD操作
2. 在event handler，可以使用workqueue類庫實現相同資源物件的連續event的去重，以及event處理異常後的失敗重試，通常是建議使用的。
Workqueue：一個單獨的類庫，是可選使用的，但通常都會使用，原因上面說了。我們需要在實現event handler的時候把發生變化的資源標識放入workqueue，供下面的processor消費。
Clientset：預設clientset只能CRUD k8s提供的資源型別，比如deployments，daemonset等；生成的程式碼為我們自定義的資源（CRD）生成了單獨的clientset，從而讓我們使用結構化的程式碼CURD自定義資源。也就是說，想操作內建資源就用k8s自帶的clientset，想操作CRD就用生成程式碼裡的clientset。
Processor：我們實現的go協程，消費workqueue中的事件，workqueue提供了按資源標識的去重。

4.2 定義

TFController 的定義如下，可以看出來幾個成員變數各有所用，就分別用到了上述的部分元件。

// TFController is the type for TFJob Controller, which manages
// the lifecycle of TFJobs.
type TFController struct {
	common.JobController

	// tfJobClientSet is a clientset for CRD TFJob.
	tfJobClientSet tfjobclientset.Interface

	// To allow injection of sync functions for testing.
	syncHandler func(string) (bool, error)

	// tfJobInformer is a temporary field for unstructured informer support.
	tfJobInformer cache.SharedIndexInformer

	// Listers for TFJob, Pod and Service
	// tfJobLister can list/get tfjobs from the shared informer's store.
	tfJobLister tfjoblisters.TFJobLister

	// tfJobInformerSynced returns true if the tfjob store has been synced at least once.
	tfJobInformerSynced cache.InformerSynced
}

4.3 入口

TF-Operator 邏輯程式碼的入口是 runWorker，其實就是迴圈呼叫 processNextWorkItem。

func (tc *TFController) runWorker() {
	for tc.processNextWorkItem() {
	}
}

processNextWorkItem將從WorkQueue中讀取單個工作項，並嘗試通過呼叫syncHandler來處理它。

// processNextWorkItem will read a single work item off the workqueue and
// attempt to process it, by calling the syncHandler.
func (tc *TFController) processNextWorkItem() bool {
	obj, quit := tc.WorkQueue.Get()
	if key, ok = obj.(string); !ok {
		tc.WorkQueue.Forget(obj)
		return true
	}
	tfJob, err := tc.getTFJobFromKey(key)

  // 同步TFJob以將實際狀態匹配到所需的狀態。
	// Sync TFJob to match the actual state to this desired state.
	forget, err := tc.syncHandler(key)
}

4.4 syncHandler

syncHandler 的作用是依據 key 來同步 Job，就是從 WorkQueue 之中弄出來一個 job，本地處理。

之前設定有 tc.syncHandler = tc.syncTFJob，所以我們實際來到了 syncTFJob。

如果tfjob的期望值已經實現，那麼syncTFJob就會用給定的key來同步tfjob，這意味著它不希望更多的
pod/service被建立或刪除：
EnableDynamicWorker 這裡會根據不同型別設定。
然後會呼叫 ReconcileJobs 對具體 job 進行處理。

// syncTFJob syncs the tfjob with the given key if it has had its expectations fulfilled, meaning
// it did not expect to see any more of its pods/services created or deleted.
// This function is not meant to be invoked concurrently with the same key.
// 這個函式不能與同一個key同時呼叫
func (tc *TFController) syncTFJob(key string) (bool, error) {

	namespace, name, err := cache.SplitMetaNamespaceKey(key)
	sharedTFJob, err := tc.getTFJobFromName(namespace, name)
	tfjob := sharedTFJob.DeepCopy()

	// Sync tfjob every time if EnableDynamicWorker is true
	tfjobNeedsSync := tfjob.Spec.EnableDynamicWorker || tc.satisfiedExpectations(tfjob)

  // 為新tfjob設定預設值。
	// Set default for the new tfjob.
	scheme.Scheme.Default(tfjob)

	if tfjobNeedsSync && tfjob.DeletionTimestamp == nil {
     // 呼叫reconcileTFJobs來啟動TFJobs
		reconcileTFJobsErr = tc.ReconcileJobs(tfjob, tfjob.Spec.TFReplicaSpecs, tfjob.Status, &tfjob.Spec.RunPolicy)
	}

	return true, err
}

4.5 ReconcileJobs

reconcileTFJobs檢查並更新每個給定TFReplicaSpec的replicas，並且做相應處理，可以認為這裡是主控邏輯。

如果 job 結束，則做相應處理，delete所有pod和service。
如果TFJob超過了backofflimit或超過了active deadline，刪除所有pod和service，然後將狀態設定為failed。
否則遍歷配置檔案的TFReplicaSpecs部分，
- 分別為不同型別的節點啟動相應的Pod。
- 在啟動Pod之後，還要為其啟動一個Service。

// 如果在建立/刪除 pods/services時發生錯誤，它將請求tfjob。 
// ReconcileJobs checks and updates replicas for each given ReplicaSpec.
// It will requeue the job in case of an error while creating/deleting pods/services.
func (jc *JobController) ReconcileJobs(
	job interface{},
	replicas map[apiv1.ReplicaType]*apiv1.ReplicaSpec,
	jobStatus apiv1.JobStatus,
	runPolicy *apiv1.RunPolicy) error {

	metaObject, ok := job.(metav1.Object)
	jobName := metaObject.GetName()
	runtimeObject, ok := job.(runtime.Object)
	jobKey, err := KeyFunc(job)
	pods, err := jc.Controller.GetPodsForJob(job)
	services, err := jc.Controller.GetServicesForJob(job)
	oldStatus := jobStatus.DeepCopy()
  
  // 如果TFJob terminated，則delete所有pod和service。
	if commonutil.IsSucceeded(jobStatus) || commonutil.IsFailed(jobStatus) {
		// If the Job is succeed or failed, delete all pods and services.
		jc.DeletePodsAndServices(runPolicy, job, pods)    
		jc.CleanupJob(runPolicy, jobStatus, job)
		return nil
	}

	// 檢索以前的重試次數
  // retrieve the previous number of retry
	previousRetry := jc.WorkQueue.NumRequeues(jobKey)
	activePods := k8sutil.FilterActivePods(pods)
	jc.recordAbnormalPods(activePods, runtimeObject)

	active := int32(len(activePods))
	failed := k8sutil.FilterPodCount(pods, v1.PodFailed)
	totalReplicas := k8sutil.GetTotalReplicas(replicas)
	prevReplicasFailedNum := k8sutil.GetTotalFailedReplicas(jobStatus.ReplicaStatuses)

	if jobExceedsLimit {
		// If the Job exceeds backoff limit or is past active deadline
		// delete all pods and services, then set the status to failed
		jc.DeletePodsAndServices(runPolicy, job, pods); err != nil {
		jc.CleanupJob(runPolicy, jobStatus, job); err != nil {
		jc.Recorder.Event(runtimeObject, v1.EventTypeNormal, commonutil.JobFailedReason, failureMessage)
		commonutil.UpdateJobConditions(&jobStatus, apiv1.JobFailed, commonutil.JobFailedReason, failureMessage)
		return jc.Controller.UpdateJobStatusInApiServer(job, &jobStatus)
	} else {
		// General cases which need to reconcile
		if jc.Config.EnableGangScheduling {
			minAvailableReplicas := totalReplicas
			_, err := jc.SyncPodGroup(metaObject, minAvailableReplicas)
		}

    // 遍歷配置檔案的TFReplicaSpecs部分，分別為不同型別的節點啟動相應的Pod。
    // 在啟動Pod之後，還要為其啟動一個Service。
		// Diff current active pods/services with replicas.
		for rtype, spec := range replicas {
			err := jc.Controller.ReconcilePods(metaObject, &jobStatus, pods, rtype, spec, replicas)
			err = jc.Controller.ReconcileServices(metaObject, services, rtype, spec)
		}
	}

	err = jc.Controller.UpdateJobStatus(job, replicas, &jobStatus)

  // No need to update the job status if the status hasn't changed since last time.
	if !reflect.DeepEqual(*oldStatus, jobStatus) {
		return jc.Controller.UpdateJobStatusInApiServer(job, &jobStatus)
	}
	return nil
}

目前邏輯如下：

             +------------+
             | runWorker  |
             +-----+------+
                   |
                   |
                   v
          +--------+------------+
          | processNextWorkItem |
          +--------+------------+
                   |
                   |
                   v
              +----+------+
              | syncTFJob |
              +----+------+
                   |
                   |
                   v
           +-------+--------+
           | ReconcileJobs  |
           +-------+--------+
                   |
                   |
                   v
          +--------+---------+
          |                  |
          |                  |
          v                  v
+---------+---------+  +-----+--------+
|                   |  |              |
| ReconcileServices |  |ReconcilePods |
|                   |  |              |
+-------------------+  +--------------+

下面我們分別介紹處理 Pod 和處理 Service。

4.6 處理 Pod

4.6.1 ReconcilePods

reconcilePods為每個給定的TFReplicaSpec檢查和更新pod。

具體比如：

初始化 replica 的狀態；
如果master pod存在，選擇master pod，如果沒有master，第一個worker pod被選為master；
createNewPod 來建立新的 pod；
或者刪除 pod；

// reconcilePods checks and updates pods for each given TFReplicaSpec.
// It will requeue the tfjob in case of an error while creating/deleting pods.
func (tc *TFController) ReconcilePods(
	job interface{},
	jobStatus *commonv1.JobStatus,
	pods []*v1.Pod,
	rtype commonv1.ReplicaType,
	spec *commonv1.ReplicaSpec,
	replicas map[commonv1.ReplicaType]*commonv1.ReplicaSpec,
) error {

	tfJob, ok := job.(*tfv1.TFJob)

	// Convert ReplicaType to lower string.
	rt := strings.ToLower(string(rtype))
  
  // 獲取rtype型別的所有pod。
	pods, err := tc.FilterPodsForReplicaType(pods, rt)

	numReplicas := int(*spec.Replicas)
	masterRole := false

	initializeReplicaStatuses(jobStatus, rtype)

	// GetPodSlices will return enough information here to make decision to add/remove/update resources.
	// For example, let's assume we have pods with replica-index 0, 1, 2
	// If replica is 4, return a slice with size 4. [[0],[1],[2],[]], a pod with replica-index 3 will be created.
	// If replica is 1, return a slice with size 3. [[0],[1],[2]], pod with replica-index 1 and 2 are out of range and will be deleted.
	podSlices := tc.GetPodSlices(pods, numReplicas, logger)
	for index, podSlice := range podSlices {
		if len(podSlice) > 1 {
			logger.Warningf("We have too many pods for %s %d", rt, index)
		} else if len(podSlice) == 0 {
      // 如果master pod存在，選擇master pod
      // 如果沒有master，第一個worker pod被選為master。      
			// check if this replica is the master role
			masterRole = tc.IsMasterRole(replicas, rtype, index)
			// TODO: [should change to CreateNewPod]
			err = tc.createNewPod(tfJob, rt, strconv.Itoa(index), spec, masterRole, replicas)
		} else {
			// Check the status of the current pod.
			pod := podSlice[0]

      // 目前只允許縮減workers
			// check if the index is in the valid range, if not, we should kill the pod
			if index < 0 || index >= numReplicas {
				err = tc.PodControl.DeletePod(pod.Namespace, pod.Name, tfJob)
			}

			// Check if the pod is retryable.
			if spec.RestartPolicy == commonv1.RestartPolicyExitCode {
				if pod.Status.Phase == v1.PodFailed && train_util.IsRetryableExitCode(exitCode) {
					tc.Recorder.Event(tfJob, corev1.EventTypeWarning, tfJobRestartingReason, msg)
					err := commonutil.UpdateJobConditions(jobStatus, commonv1.JobRestarting, tfJobRestartingReason, msg)
					tfJobsRestartCount.Inc()
				}
			}

			updateJobReplicaStatuses(jobStatus, rtype, pod)
		}
	}
	return nil
}

4.6.2 createNewPod

createNewPod為給定的index和type建立一個新的pod：

// createNewPod creates a new pod for the given index and type.
func (tc *TFController) createNewPod(tfjob *tfv1.TFJob, rt, index string, spec *commonv1.ReplicaSpec, masterRole bool,
	replicas map[commonv1.ReplicaType]*commonv1.ReplicaSpec) error {

	tfjobKey, err := KeyFunc(tfjob)
	expectationPodsKey := expectation.GenExpectationPodsKey(tfjobKey, rt)

	// Create OwnerReference.
	controllerRef := tc.GenOwnerReference(tfjob)

	// Set type and index for the worker.
	labels := tc.GenLabels(tfjob.Name)
	labels[tfReplicaTypeLabel] = rt
	labels[tfReplicaIndexLabel] = index

	podTemplate := spec.Template.DeepCopy()
	// Set name for the template.
	podTemplate.Name = common.GenGeneralName(tfjob.Name, rt, index)
	if podTemplate.Labels == nil {
		podTemplate.Labels = make(map[string]string)
	}
	for key, value := range labels {
		podTemplate.Labels[key] = value
	}

  // 生成叢集的配置資訊，這裡最關鍵，看一下實現
	if err := tc.SetClusterSpec(tfjob, podTemplate, rt, index); err != nil {
		return err
	}

	// if gang-scheduling is enabled:
	// 1. if user has specified other scheduler, we report a warning without overriding any fields.
	// 2. if no SchedulerName is set for pods, then we set the SchedulerName to "kube-batch".
	if tc.Config.EnableGangScheduling {
		if isNonGangSchedulerSet(replicas) {
			tc.Recorder.Event(tfjob, v1.EventTypeWarning, podTemplateSchedulerNameReason, errMsg)
		} else {
			podTemplate.Spec.SchedulerName = gangSchedulerName
		}

		if podTemplate.Annotations == nil {
			podTemplate.Annotations = map[string]string{}
		}
		podTemplate.Annotations[gangSchedulingPodGroupAnnotation] = tfjob.GetName()
		podTemplate.Annotations[volcanoTaskSpecKey] = rt
	}

  // 使用上面的配置資訊，真正啟動Pod的建立
	err = tc.PodControl.CreatePodsWithControllerRef(tfjob.Namespace, podTemplate, tfjob, controllerRef)
	return nil
}

4.6.3 生成配置資訊

4.6.3.1 SetClusterSpec

上面函式中的生成配置資訊比較重要，所以我們單獨摘出來說一下。

setClusterSpec為給定的podTemplateSpec生成並設定TF_CONFIG：

// SetClusterSpec generates and sets TF_CONFIG for the given podTemplateSpec.
func (tc *TFController) SetClusterSpec(job interface{}, podTemplate *v1.PodTemplateSpec, rtype, index string) error {
	tfjob, ok := job.(*tfv1.TFJob)

	// Generate TF_CONFIG JSON string.
	tfConfigStr, err := genTFConfigJSONStr(tfjob, rtype, index)

	// Add TF_CONFIG environment variable to tensorflow container in the pod.
	for i := range podTemplate.Spec.Containers {
		if podTemplate.Spec.Containers[i].Name == tfv1.DefaultContainerName {
			if len(podTemplate.Spec.Containers[i].Env) == 0 {
				podTemplate.Spec.Containers[i].Env = make([]v1.EnvVar, 0)
			}
			podTemplate.Spec.Containers[i].Env = append(podTemplate.Spec.Containers[i].Env, v1.EnvVar{
				Name:  tfConfig,
				Value: tfConfigStr,
			})
			break
		}
	}
	return nil
}

4.6.3.2 genTFConfigJSONStr

genTFConfigJSONStr 會生成 json 資料。

// genTFConfig will generate the environment variable TF_CONFIG
// {
//     "cluster": {
//         "ps": ["ps1:2222", "ps2:2222"],
//         "worker": ["worker1:2222", "worker2:2222", "worker3:2222"]
//     },
//     "task": {
//         "type": "ps",
//         "index": 1
//         },
//     }
// }
func genTFConfigJSONStr(tfjob *tfv1.TFJob, rtype, index string) (string, error) {
	// Configure the TFCONFIG environment variable.
	i, err := strconv.ParseInt(index, 0, 32)
	if err != nil {
		return "", err
	}

	cluster, err := genClusterSpec(tfjob)
	if err != nil {
		return "", err
	}

	var tfConfigJSONByteSlice []byte
	if tfjob.Spec.EnableDynamicWorker {
		sparseCluster := convertClusterSpecToSparseClusterSpec(cluster, strings.ToLower(rtype), int32(i))
		sparseTFConfig := SparseTFConfig{
			Cluster: sparseCluster,
			Task: TaskSpec{
				Type:  strings.ToLower(rtype),
				Index: int(i),
			},
		}
		tfConfigJSONByteSlice, err = json.Marshal(sparseTFConfig)
	} else {
		tfConfig := TFConfig{
			Cluster: cluster,
			Task: TaskSpec{
				Type:  strings.ToLower(rtype),
				Index: int(i),
			},
			// We need to set environment to cloud  otherwise it will default to local which isn't what we want.
			// Environment is used by tensorflow.contrib.learn.python.learn in versions <= 1.3
			// TODO(jlewi): I don't think it is used in versions TF >- 1.4. So we can eventually get rid of it.
      // 我們需要設定環境為cloud，否則它會預設為local，這不是我們想要的。
			Environment: "cloud",
		}
		tfConfigJSONByteSlice, err = json.Marshal(tfConfig)
	}
	if err != nil {
		return "", err
	}

	return string(tfConfigJSONByteSlice), nil
}

4.6.3.3 genClusterSpec

這裡就是從叢集資訊中獲得 cluster 資訊。

// genClusterSpec will generate ClusterSpec.
func genClusterSpec(tfjob *tfv1.TFJob) (ClusterSpec, error) {
	clusterSpec := make(ClusterSpec)

	for rtype, spec := range tfjob.Spec.TFReplicaSpecs {
		rt := strings.ToLower(string(rtype))
		replicaNames := make([]string, 0, *spec.Replicas)

		port, err := GetPortFromTFJob(tfjob, rtype)
    // 這裡迴圈生成了TF_CONFIG裡面的Cluster資訊。注意看註釋，使用DNS配合Service，解決的還是各個節點IP不固定的問題
		for i := int32(0); i < *spec.Replicas; i++ {
			// As described here: https://kubernetes.io/docs/concepts/services-networking/dns-pod-service/#a-records.
			// Headless service assigned a DNS A record for a name of the form "my-svc.my-namespace.svc.cluster.local".
			// And the last part "svc.cluster.local" is called cluster domain
			// which maybe different between kubernetes clusters.
            // 如下所述:https://kubernetes.io/docs/concepts/services-networking/dns-pos-service/#a-records。
            // Headless service為"my-svc.my-namespace.svc.cluster.local"的名稱分配一個DNS記錄。
            // 最後一部分是"svc.cluster.local"被稱為cluster domain，在不同的kubernetes叢集之間可能存在差異。
      
			hostName := common.GenGeneralName(tfjob.Name, rt, fmt.Sprintf("%d", i))
			svcName := hostName + "." + tfjob.Namespace + "." + "svc"
			clusterDomain := os.Getenv(EnvCustomClusterDomain)
			if len(clusterDomain) > 0 {
				svcName += "." + clusterDomain
			}

			endpoint := fmt.Sprintf("%s:%d", svcName, port)
			replicaNames = append(replicaNames, endpoint)
		}

		clusterSpec[rt] = replicaNames
	}

	return clusterSpec, nil
}

4.6.4 CreatePodsWithControllerRef

得到了叢集配置資訊之後，就使用叢集的配置資訊，進行真正啟動Pod的建立：

func (r RealPodControl) CreatePods(namespace string, template *v1.PodTemplateSpec, object runtime.Object) error {
	return r.createPods("", namespace, template, object, nil)
}

func (r RealPodControl) CreatePodsWithControllerRef(namespace string, template *v1.PodTemplateSpec, controllerObject runtime.Object, controllerRef *metav1.OwnerReference) error {
	if err := ValidateControllerRef(controllerRef); err != nil {
		return err
	}
	return r.createPods("", namespace, template, controllerObject, controllerRef)
}

4.6.5 createPods

這裡才真正呼叫K8S介面建立pod

func (r RealPodControl) createPods(nodeName, namespace string, template *v1.PodTemplateSpec, object runtime.Object, controllerRef *metav1.OwnerReference) error {
	pod, err := GetPodFromTemplate(template, object, controllerRef)

	if len(nodeName) != 0 {
		pod.Spec.NodeName = nodeName
	}
	if labels.Set(pod.Labels).AsSelectorPreValidated().Empty() {
		return fmt.Errorf("unable to create pods, no labels")
	}
	if newPod, err := r.KubeClient.CoreV1().Pods(namespace).Create(pod); err != nil {
		return err
	} else {
		accessor, err := meta.Accessor(object)
	}
	return nil
}

此時邏輯如下：

                                        +------------------------------+
          +------------+                | SetClusterSpec               |
          | runWorker  |                |  +-------------------------+ |
          +-----+------+                |  | genTFConfigJSONStr      | |
                |                       |  |                         | |
                |                       |  |      genClusterSpec     | |
                v                       |  |                         | |
       +--------+------------+          |  +-------------------------+ |
       | processNextWorkItem |          +------------------------------+
       +--------+------------+                      |
                |                                   |
                |                                   v
                v                            +------+-------+      +-----------------------------+       +------------+
           +----+------+              +----> | createNewPod +----->+ CreatePodsWithControllerRef +------>+ createPods |
           | syncTFJob |              |      +--------------+      +-----------------------------+       +------------+
           +----+------+              |
                |                     |
                |                     |
                v                     |
        +-------+--------+            |
        | ReconcileJobs  |            |
        +-------+--------+            |
                |                     |
                |                     |
                v                     |
       +--------+---------+           |
       |                  |           |
       |                  |           |
       v                  v           |
+------+----------+  +----+--------+  |
|                 |  |             |  |
|ReconcileServices|  |ReconcilePods+--+
|                 |  |             |
+-----------------+  +-------------+

手機如下：

4.7 處理服務

4.7.1 ReconcileServices

ReconcileServices 為每個給定的TFReplicaSpec檢查和更新service，大致如下：

將在建立/刪除服務時發生錯誤時請求tfjob。
獲取rt型別的所有service。
- 或者建立新服務；
- 或者刪除舊服務，目前只允許縮小worker的service範圍；

// reconcileServices checks and updates services for each given ReplicaSpec.
// It will requeue the job in case of an error while creating/deleting services.
func (jc *JobController) ReconcileServices(
	job metav1.Object,
	services []*v1.Service,
	rtype apiv1.ReplicaType,
	spec *apiv1.ReplicaSpec) error {

	// Convert ReplicaType to lower string.
	rt := strings.ToLower(string(rtype))

	replicas := int(*spec.Replicas)
	// Get all services for the type rt.
	services, err := jc.FilterServicesForReplicaType(services, rt)

	// GetServiceSlices will return enough information here to make decision to add/remove/update resources.
	//
	// For example, let's assume we have services with replica-index 0, 1, 2
	// If replica is 4, return a slice with size 4. [[0],[1],[2],[]], a svc with replica-index 3 will be created.
	//
	// If replica is 1, return a slice with size 3. [[0],[1],[2]], svc with replica-index 1 and 2 are out of range and will be deleted.
	serviceSlices := jc.GetServiceSlices(services, replicas, commonutil.LoggerForReplica(job, rt))

	for index, serviceSlice := range serviceSlices {
		if len(serviceSlice) > 1 {
		} else if len(serviceSlice) == 0 {
			err = jc.CreateNewService(job, rtype, spec, strconv.Itoa(index))
		} else {
			// Check the status of the current svc.
			svc := serviceSlice[0]

			// check if the index is in the valid range, if not, we should kill the svc
			if index < 0 || index >= replicas {
				err = jc.ServiceControl.DeleteService(svc.Namespace, svc.Name, job.(runtime.Object))
			}
		}
	}
	return nil
}

4.7.2 CreateNewService

為給定的index和type建立一個新service：

// createNewService creates a new service for the given index and type.
func (jc *JobController) CreateNewService(job metav1.Object, rtype apiv1.ReplicaType,
	spec *apiv1.ReplicaSpec, index string) error {
	jobKey, err := KeyFunc(job)

	// Convert ReplicaType to lower string.
	rt := strings.ToLower(string(rtype))
	expectationServicesKey := expectation.GenExpectationServicesKey(jobKey, rt)
	err = jc.Expectations.ExpectCreations(expectationServicesKey, 1)
	if err != nil {
		return err
	}

	// Append ReplicaTypeLabel and ReplicaIndexLabel labels.
	labels := jc.GenLabels(job.GetName())
	labels[apiv1.ReplicaTypeLabel] = rt
	labels[apiv1.ReplicaIndexLabel] = index

	port, err := jc.GetPortFromJob(spec)
	if err != nil {
		return err
	}

	service := &v1.Service{
		Spec: v1.ServiceSpec{
			ClusterIP: "None",
			Selector:  labels,
			Ports:     []v1.ServicePort{},
		},
	}

	// Add service port to headless service only if port is set from controller implementation
	if port != nil {
		svcPort := v1.ServicePort{Name: jc.Controller.GetDefaultContainerPortName(), Port: *port}
		service.Spec.Ports = append(service.Spec.Ports, svcPort)
	}

	service.Name = GenGeneralName(job.GetName(), rt, index)
	service.Labels = labels
	// Create OwnerReference.
	controllerRef := jc.GenOwnerReference(job)

	err = jc.ServiceControl.CreateServicesWithControllerRef(job.GetNamespace(), service, job.(runtime.Object), controllerRef)
	if err != nil && errors.IsTimeout(err) {
		succeededServiceCreationCount.Inc()
		return nil
	} else if err != nil {
		failedServiceCreationCount.Inc()
		return err
	}
	succeededServiceCreationCount.Inc()
	return nil
}

4.7.3 CreateServicesWithControllerRef

使用叢集的配置資訊，真正啟動Service的建立：

func (r RealServiceControl) CreateServicesWithControllerRef(namespace string, service *v1.Service, controllerObject runtime.Object, controllerRef *metav1.OwnerReference) error {
	if err := ValidateControllerRef(controllerRef); err != nil {
		return err
	}
	return r.createServices(namespace, service, controllerObject, controllerRef)
}

4.7.4 createServices

此時才真正呼叫K8S介面建立service：

func (r RealServiceControl) createServices(namespace string, service *v1.Service, object runtime.Object, controllerRef *metav1.OwnerReference) error {
	if labels.Set(service.Labels).AsSelectorPreValidated().Empty() {
		return fmt.Errorf("unable to create Services, no labels")
	}
	serviceWithOwner, err := GetServiceFromTemplate(service, object, controllerRef)
	newService, err := r.KubeClient.CoreV1().Services(namespace).Create(serviceWithOwner)
	accessor, err := meta.Accessor(object)
}

此時邏輯擴充如下：

                                        +------------------------------+
          +------------+                | SetClusterSpec               |
          | runWorker  |                |  +-------------------------+ |
          +-----+------+                |  | genTFConfigJSONStr      | |
                |                       |  |                         | |
                |                       |  |      genClusterSpec     | |
                v                       |  |                         | |
       +--------+------------+          |  +-------------------------+ |
       | processNextWorkItem |          +------------------------------+
       +--------+------------+                      |
                |                                   |
                |                                   v
                v                            +------+-------+      +-----------------------------+       +------------+
           +----+------+              +----> | createNewPod +----->+ CreatePodsWithControllerRef +------>+ createPods |
           | syncTFJob |              |      +--------------+      +-----------------------------+       +------------+
           +----+------+              |
                |                     |
                |                     |
                v                     |           +------------------+     +---------------------------------+    +----------------+
        +-------+--------+            |    +----> | CreateNewService +---->+ CreateServicesWithControllerRef +--->+ createServices |
        | ReconcileJobs  |            |    |      +------------------+     +---------------------------------+    +----------------+
        +-------+--------+            |    |
                |                     |    |
                |                     |    |
                v                     |    |
       +--------+---------+           |    |
       |                  |           |    |
       |                  |           |    |
       v                  v           |    |
+------+----------+  +----+--------+  |    |
|                 |  |             |  |    |
|ReconcileServices|  |ReconcilePods+--+    |
|                 |  |             |       |
+------+----------+  +-------------+       |
       |                                   |
       +---------------------------------->+

手機如下：

所以我們大致可知，TF-Operator 本質上就是：

通過 TF-Operator 的這種自定義資源物件來描述分散式機器學習的訓練任務；
同時實現了 TFJob 的 Controller 來控制容器的生老病死，給使用者管理好多個程式之間的關係；

0x05 與普通部署比較

分析到這裡，大家可能也有點疑惑，究竟 TF on K8s 和普通部署有啥區別，優勢何處呢？我們下面就具體分析下。

5.1 執行

我們首先看原始碼中的Dockerfile內容

FROM tensorflow/tensorflow:1.5.0

ADD . /var/tf_dist_mnist
ENTRYPOINT ["python", "/var/tf_dist_mnist/dist_mnist.py"]

然後看看對應的 spec，分別有2個 PS，4個 Worker。

apiVersion: "kubeflow.org/v1"
kind: "TFJob"
metadata:
  name: "dist-mnist-for-e2e-test"
spec:
  tfReplicaSpecs:
    PS:
      replicas: 2
      restartPolicy: Never
      template:
        spec:
          containers:
            - name: tensorflow
              image: kubeflow/tf-dist-mnist-test:1.0
    Worker:
      replicas: 4
      restartPolicy: Never
      template:
        spec:
          containers:
            - name: tensorflow
              image: kubeflow/tf-dist-mnist-test:1.0

然後再安裝example，跑一個分散式的 mnist 訓練任務。

cd ./examples/v1/dist-mnist
docker build -f Dockerfile -t kubeflow/tf-dist-mnist-test:1.0 .
kubectl create -f ./tf_job_mnist.yaml

5.2 比較

我們就簡單從訓練程式碼看看。

5.2.1 普通 TF

各種host 的配置是通過指令碼引數來設定的，下面就是讀取引數的配置啟動。

# 讀取引數
ps_spec = FLAGS.ps_hosts.split(',')
worker_spec = FLAGS.worker_hosts.split(',')

# 建立叢集
num_worker = len(worker_spec)
cluster = tf.train.ClusterSpec({'ps': ps_spec, 'worker': worker_spec})
server = tf.train.Server(cluster, job_name=FLAGS.job_name, task_index=FLAGS.task_index)

5.2.2 TF-Operator

首先，dist_mnist.py中有如下方式獲取 cluster 資訊。

# If not explicitly specified in the constructor and the TF_CONFIG
# environment variable is present, load cluster_spec from TF_CONFIG.
tf_config = json.loads(os.environ.get('TF_CONFIG') or '{}')

其次，在 TF-Operator 之中有如下，說明 cluster 資訊是從這裡設定：

tfConfig = "TF_CONFIG"

然後，在 SetClusterSpec 中有如下，就是呼叫 K8S 介面動態獲取配置：

// SetClusterSpec generates and sets TF_CONFIG for the given podTemplateSpec.
func (tc *TFController) SetClusterSpec(job interface{}, podTemplate *v1.PodTemplateSpec, rtype, index string) error {
   tfjob, ok := job.(*tfv1.TFJob)

   // Do not set TF_CONFIG for local training jobs.
   if !isDistributed(tfjob) {
      return nil
   }
   // Generate TF_CONFIG JSON string.
   tfConfigStr, err := genTFConfigJSONStr(tfjob, rtype, index)

   // Add TF_CONFIG environment variable to tensorflow container in the pod.
   for i := range podTemplate.Spec.Containers {
      if podTemplate.Spec.Containers[i].Name == tfv1.DefaultContainerName {
         if len(podTemplate.Spec.Containers[i].Env) == 0 {
            podTemplate.Spec.Containers[i].Env = make([]v1.EnvVar, 0)
         }
         podTemplate.Spec.Containers[i].Env = append(podTemplate.Spec.Containers[i].Env, v1.EnvVar{
            Name:  tfConfig,
            Value: tfConfigStr,
         })
         break
      }
   }
   return nil
}

因此可以知道，從使用者角度看，就修改了一點程式碼即可。至於部署服務等，都是由 K8S 接管了。

使用者只要在 spec 之中設定需要多少 worker，ps 就成。這樣使用者就可以把精力集中在模型之上。而devops 則大展身手為你搞定一切。

0x06 總結

綜合之前的我們可以得出 TF-Operator 如下優勢：

通過 TF-Operator 的這種自定義資源物件來描述分散式機器學習的訓練任務；
同時實現了 TFJob 的 Controller 來控制容器的生老病死，給使用者管理好多個程式之間的關係；
對於使用者，只要建立一個 TFJob 的自定義資源物件，在 Template 配置好相關資訊，就相當於描述好一個分散式訓練程式的執行過程了。
使用者可以把精力集中在模型之上。而devops 則大展身手為你搞定一切；

kubeflow/tf-operator 雖然可以運作，但是依然有很多缺陷。

Kubeflow 可以在 Kubernetes 上啟動基於 TensorFlow 原生的分散式計算能力的作業。但是因為後者並不能容錯，所以 Kubeflow 並不能無中生有。不能容錯，也意味著不能彈性排程。
使用 kubeflow/tf-operator 執行分散式 TensorFlow 作業，模型迭代必須等待申請的程式全部啟動後才能開始。如果叢集資源不足以啟動所有程式，則當前作業只能等待其他作業釋放資源。為了縮短資源等待時間，可以給作業配置專有資源池。
由於資源不共享，叢集資源利用率會很低。所以 kubeflow/tf-operator 很難同時兼顧研發效率和叢集利用率。

而且，最重要的是：沒有和 horovod 聯絡起來，沒有安裝 MPI 等軟體，所以下文我們看看 MPI-Operator。