Kubernetes 排程器

FR13DNS發表於2020-11-21

Kubernetes 排程器

Kubernetes 依賴 scheduler 元件於以確保排程 pod 能在叢集中找到一個合適的節點,並使其以期望的狀態執行。排程過程中,排程器不會修改Pod資源,而是從中讀取資料並根據配置的策略挑選出最適合的節點,而後通過API呼叫將Pod繫結至挑選出的節點之上以完成排程過程.

在這裡插入圖片描述

工作邏輯

  1. kuberlet 的工作概述
    當使用者請求通過 APIserver 到達 scheduler 後,通過 scheduler 的演算法得出一個最適合執行該 pod 的節點後,會將結果傳回 APIserver 並儲存在 Etcd 當中,如非節點當機或 pod 被 OOM 等原因驅逐,那麼該 pod 會一致執行在這個節點,及時 pod 被被重建依然不會改變排程結果,而節點上的 kubelet 會一直 which APIserver 一旦出現關於自身節點的事件變動,這時候節點就會去獲取 APIserver 上生命的資源清單來生成 pod,如根據清單下載或啟動本地映象,以及是否需要掛載儲存卷等一系列工作
  2. kube-proxy 的工作概述
    建立 service 則與建立 pod 的形式相同,唯一不同的是 service 只是節點上的 iptables 或 lvs 規則,這個規則是通過節點上的 kube-proxy 來 which APIserver 並最終建立生成出來的
  3. APIserver 的資料序列化
    對 APIserver 來說,任何請求訪問都視為 client,並檢查授權和認證,只不過不通的 client 資料序列化的方式有所不同, kubectl 通過 json 來進行資料序列化,而叢集內部元件通訊則使用由 Google 研發的 Protobuf 方式來實現

Scheduler 排程演算法

Kubernetes內建了適合絕大多數場景中Pod資源排程需求的預設排程器,它支援同時使用演算法基於原生及可定製的工具來選出叢集中最適合執行當前Pod資源的一個節點,其核心目標是基於資源可用性將各Pod資源公平地分佈於叢集節點之上。目前,平臺提供的預設排程器也稱為通用排程器,它通過三個步驟完成排程操作:節點預選 Predicate、節點優先順序排序 Priority 及節點擇優 Select

在這裡插入圖片描述

Predicate

對一個容器來說能做兩個維度的限制,第一維度為起始資源基本要求,滿足才可以執行.第二維度為資源的限額,超出限額則不分配任何記憶體,而容器本身則提供當前佔用狀態,而眾多節點當中不能滿足起始資源基本要求的就會在 Predicate 中被排查,當然其中也包括其他諸如監聽節點埠的容器而節點的埠已經被佔用的情況,總之對這一步來說就是在所有節點中排除掉完全不能符合對應 pod 執行的基本要求的節點,預選策略工作機制遵循一票否則與反對法則機制

kubernetes 1.10 支援的預選策略,在所有 Scheduler 的排程演算法中,預設情況下只啟用了部分子集,如果需要生效其他排程策略則需要部署或後期配置時增加需要的排程演算法

常用排程策略

  1. ChecknodeCondition,檢查是否可以在節點報告磁碟或網路狀態不可用的或未準備好的情況下,將 pod 排程到上面,預設啟用該策略
  2. GaneralPredicates, 策略子集,預設啟用該策略,包括多種預選:
    • hostName: 檢查 pod.spec.hostname 如果 pod 定義了 hostName 那麼則檢查節點上的其他 pod 是否佔用了該名稱
    • podFistHostPorts: 檢查 pod.spec.containers.ports.hostPort 如果 container 定義了 ports 那麼檢查節點上其他 pod 是否佔用了該埠
    • matchNodeSelector: 檢查節點上是否存在該 pod 的標籤選擇器需要的標籤
    • podFistResources: 檢查節點是否滿足該 pod 的資源需求,在 describe node 的 Allocated resources
  3. NoDiskconflict: 是否不存在磁碟衝突,檢查節點上是否滿足 pod 上儲存卷的需求,預設這個策略不啟用
  4. PodToleratesNodeTaints: 檢查 pod 的 pod.spec.tolerations 是否包含 Node 的汙點,預設啟用該策略
  5. PodToleratesNodeNoExecuteTaints: 檢查 pod 的 pod.spec.tolerations 是否包含 Node 的 NoExecute 汙點,預設這個策略不啟用
  6. CheckNodeLabelPresence: 檢查 Node 標籤的存在性,預設這個策略不啟用
  7. CheckServiceAffinity: 根據 pod 其縮在 service 的其他 pod 是否在該節點來決定是否排程到該節點,預設這個策略不啟用
  8. 三個 CNCF 雲原生計算基金會的預設啟用的排程策略
    • MaxEBSVolume
    • MaxGCEPDVolumeCount
    • MaxAzureDiskVolumeCount
  9. CheckVolumeBinding: 檢查節點上已繫結和未繫結的 PVC 是否能滿足 pod 儲存卷的需求,預設啟用
  10. NoVolumZoneConfict: 在當前區域中檢查節點的儲存卷與 pod 物件是否存在存在衝突,預設啟用
  11. CheckNodeMemoryPressure: 檢查節點記憶體是否存在壓力,預設啟用
  12. CheckNodePIDPressure: 檢查節點 PID 資源壓力過大,預設啟用
  13. CheckNodeDiskPressure: 檢查節點磁碟 IO 壓力是否過大,預設啟用
  14. MatchInterPodAffinity: 檢查節點是否滿足 pod 的親和或反親和性條件,預設啟用

Priority

預選策略篩選並生成一個節點列表後即進入第二階段的優選過程.在這個過程中,排程器向每個通過預選的節點傳遞一系列的優選函式來計算其優先順序分值,優先順序分值介於0到10之間,其中0表示不適用,10表示最適合託管該 Pod 物件

在這裡插入圖片描述

常用優選函式

  1. LeastRequested: 節點的空閒資源與總容量的比值,得分高即表示空限量更大級的最優,他的演算法如下

    (CPU(capacity-sum(pod_requested))*10/capacity+
    MEM(capacity-sum(pod_requested))*10/capacity)/2
    

    每個數值乘以 10 的原因是因為每一個優選函式的計算得分是 10,再將 CPU 和 MEM 的得分相加,總和再除以 2 因為是兩個維度的數值

  2. BalancedResourceAlloction: CPU 和 MEM 資源被佔用的比率越相近得分越高,需要結合 LeastRequested 來評估節點資源的使用量

  3. NodePreferAvoidPods: 此優選級函式許可權預設為10000,它將根據節點是否設定了註解資訊 scheduler.alpha.kubernetes.io/preferAvoidPods 來計算其優選級,計算方式是

    • 給定的節點無此註解資訊時,其得分為10乘以權重10000
    • 存在此註解資訊時,對於那些由 ReplicationController 或 ReplicaSet 控制器管控的Pod物件的得分為0,其他Pod物件會被忽略(得最高分)
  4. Nodeaffinity: 基於節點的親和性排程偏好進行評估,它根據 Pod 資源中的 nodeSelector 對給定節點進行匹配度檢查,成功匹配到的條目越多則節點得分越高,不過,其評估過程使用首選而非強制型的 PreferredDuringSchedulingIgnoredDuringExecution 標籤選擇器

  5. TaintToleration: 基於 Pod 物件對節點的汙點容忍排程偏好進行其優先順序的評估,它將 Pod 物件的 tolerations 列表與節點的 Taints 汙點進行匹配度檢查,成功匹配的條目越多,則節點得分越低

  6. SelectorSpread: 標籤選擇器分散度,查詢與當前 pod 物件匹配的 Service、ReplicationController、ReplicaSet(RS)StatefulSet 而後查詢與這些選擇器匹配的現存Pod 物件及其所在的節點,則執行此類 Pod 物件越少的節點得分將越高.簡單來說,如其名稱所示此優選函式會盡量將同一標籤選擇器匹配到的Pod資源分散到不同的節點上執行

  7. InterPodAffinity: 遍歷此 pod 的親和性條目,並將那些能夠匹配到的給定節點的的條目相加,值越大得分越高

  8. MostRequested: 與 LeastRequested 演算法同樣,但得分判斷相反,這個函式儘可能的將一個節點資源用完,一般來說不與 LeastRequested 同時使用

  9. NodeLabel: 根據節點是否擁有某些標籤,存在時得分不存在則不得分,或以標籤個數來評定的分

  10. ImageLocality: 基於給定節點上擁有的執行當前 Pod 物件中的容器所依賴到的映象檔案來計算節點得分,不具有 Pod 依賴到的任何映象檔案的節點其得分為0,而擁有相應映象檔案的各節點中,所擁有的被依賴到的映象檔案其體積之和越大則節點得分越高,即節省下載的頻寬流量

優選評估:
對於 pod 來說會根據所有已啟用的優選函式做評估,並將得分相加峰值最高則為最佳,多個則進入 select 階段,另外排程器還支援為每個優選函式指定一個簡單的由正數值表示的權重,進行節點優先順序分值的計算時,它首先將每個優選函式的計算得分乘以其權重(大多數優先順序的預設 權重為1)然後將所有優選函式的得分相加從而得出節點的最終優先順序分值,權重屬性賦予了管理員定義優選函式傾向性的能力,下面是每個節點的最終優先順序得分的計算公式:

finalScoreNode=(weight1*priorityFunc1)+(weight2*priorityFunc2)+ ...

Select

將 pod 繫結在優選的節點上,如果當優選結果不止一個則隨機挑選

特有傾向

為特殊的 pod 的提供的一種選擇節點的方式,可以通過該種方式參與或改變預選與優選的判斷結果,從而實現高階排程方法,特殊傾向有如下三種型別

節點標籤

當一些 pod 需要執行在特定 node 節點上時,此時應該對節點用標籤做分類,而後 pod 定義時可以額外定義特有傾向性 pods.spec.nodeNamepods.spce.nodeSelector 此操作會在 Predicate 中判斷

  1. 資源清單模板

    [root@master-0 ~]# kubectl explain pod.spec.nodeSelector
    KIND:     Pod
    VERSION:  v1
    
    FIELD:    nodeSelector <map[string]string>
    
    DESCRIPTION:
        NodeSelector is a selector which must be true for the pod to fit on a node.
        Selector which must match a node's labels for the pod to be scheduled on
        that node. More info:
        https://kubernetes.io/docs/concepts/configuration/assign-pod-node/
    
  2. 示例

    [root@master-0 ~]# cat nodeselector.yaml
    apiVersion: v1
    kind: Pod
    metadata:
      name: pod-demo
      namespace: default
      labels:
          app: myapp
    spec:
      containers:
      - name: myapp
        image: ikubernetes/myapp:v1
      nodeSelector:
        disktype: ssd
    [root@master-0 ~]# kubectl apply -f nodeselector.yaml
    pod/pod-demo created
    [root@master-0 ~]# kubectl label nodes slave-0.shared disktype=ssd
    node/slave-0.shared labeled
    

親和性

親和性可以在 pod.spec.affinity 中檢視,並且從節點和 pod 兩個維度定義

[root@master-0 ~]# kubectl explain pod.spec.affinity
KIND:     Pod
VERSION:  v1

RESOURCE: affinity <Object>

DESCRIPTION:
     If specified, the pod's scheduling constraints

     Affinity is a group of affinity scheduling rules.

FIELDS:
   nodeAffinity <Object>
     Describes node affinity scheduling rules for the pod.

   podAffinity <Object>
     Describes pod affinity scheduling rules (e.g. co-locate this pod in the
     same node, zone, etc. as some other pod(s)).

   podAntiAffinity <Object>
     Describes pod anti-affinity scheduling rules (e.g. avoid putting this pod
     in the same node, zone, etc. as some other pod(s)).

節點親和性

定義節點親和性規則時有兩種型別的節點親和性規則:

  • 硬親和性(required): 硬親和性實現的是強制性規則,它是Pod排程時必須要滿足的規則,而在不存在滿足規則的節點時,Pod物件會被置為Pending狀態
  • 軟親和性(preferred): 軟親和性規則實現的是一種柔性排程限制,它傾向於將 Pod 物件執行於某類特定的節點之上,而排程器也將盡量滿足此需求,但在無法滿足排程需求時它將退而求其次地選擇一個不匹配規則的節點

無論是 required 和 preferred,在 Pod 資源基完成排程至某節點後,節點標籤發生了改變而不再符合此節點親和性規則時,排程器不會將Pod物件從此節點上移出

節點硬親和性
  1. 節點硬親和性 pod.spec.affinity.nodeAffinity.requiredDuringSchedulingIgnoredDuringExecution

    [root@master-0 ~]# kubectl explain pod.spec.affinity.nodeAffinity.requiredDuringSchedulingIgnoredDuringExecution
    KIND:     Pod
    VERSION:  v1
    
    RESOURCE: requiredDuringSchedulingIgnoredDuringExecution <Object>
    
    DESCRIPTION:
        If the affinity requirements specified by this field are not met at
        scheduling time, the pod will not be scheduled onto the node. If the
        affinity requirements specified by this field cease to be met at some point
        during pod execution (e.g. due to an update), the system may or may not try
        to eventually evict the pod from its node.
    
        A node selector represents the union of the results of one or more label
        queries over a set of nodes; that is, it represents the OR of the selectors
        represented by the node selector terms.
    
    FIELDS:
      nodeSelectorTerms <[]Object> -required-       # 親和的節點
        Required. A list of node selector terms. The terms are ORed.
    [root@master-0 ~]# cat nodeaffinity.yaml
    apiVersion: v1
    kind: Pod
    metadata:
      name: pod-nodeaffinity
      namespace: default
      labels:
        app: myapp
    spec:
      containers:
      - name: myapp
        image: ikubernetes/myapp:v1
      affinity:
        nodeAffinity:
            nodeSelectorTerms:
            - matchExpressions:
              - key: zone
                operator: In
                values:
                - foo
                - bar
    [root@master-0 ~]# kubectl apply -f nodeaffinity.yaml
    pod/pod-nodeaffinity created                  # 此時節點中如果有標籤為 zone 且包括值為 foo 或者 bar 則該 pod 才會 running
    
  2. 關於 pod.spec.affinity.nodeAffinity.requiredDuringSchedulingIgnoredDuringExecution.nodeSelectorTerms 的兩種匹配方法

    • matchExpressions: 按節點標籤列出的節點選擇器要求列表
    • matchFields: 允許你根據一個或多個資源欄位的值 篩選 Kubernetes 資源,如
      1. metadata.name=my-service
      2. metadata.namespace!=default
      3. status.phase=Pending
節點軟親和性

節點軟親和性 pod.spec.affinity.nodeAffinity.preferredDuringSchedulingIgnoredDuringExecution

[root@master-0 ~]# kubectl explain pod.spec.affinity.nodeAffinity.preferredDuringSchedulingIgnoredDuringExecution
KIND:     Pod
VERSION:  v1

RESOURCE: preferredDuringSchedulingIgnoredDuringExecution <[]Object>

DESCRIPTION:
     The scheduler will prefer to schedule pods to nodes that satisfy the
     affinity expressions specified by this field, but it may choose a node that
     violates one or more of the expressions. The node that is most preferred is
     the one with the greatest sum of weights, i.e. for each node that meets all
     of the scheduling requirements (resource request, requiredDuringScheduling
     affinity expressions, etc.), compute a sum by iterating through the
     elements of this field and adding "weight" to the sum if the node matches
     the corresponding matchExpressions; the node(s) with the highest sum are
     the most preferred.

     An empty preferred scheduling term matches all objects with implicit weight
     0 (i.e. it's a no-op). A null preferred scheduling term matches no objects
     (i.e. is also a no-op).

FIELDS:
   preference <Object> -required-             # 傾向的節點
     A node selector term, associated with the corresponding weight.

   weight <integer> -required-                # 傾向權重
     Weight associated with matching the corresponding nodeSelectorTerm, in the
     range 1-100.
[root@master-0 ~]# cat nodeaffinity-demo.yaml
apiVersion: v1
kind: Pod
metadata:
  name: pod-nodeaffinity-demo
  namespace: default
  labels:
    app: myapp
spec:
  containers:
  - name: myapp
    image: ikubernetes/myapp:v1
  affinity:
    nodeAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - preference:
          matchExpressions:
          - key: zone
            operator: In
            values:
            - foo
            - bar
        weight: 60
[root@master-0 ~]# kubectl apply -f nodeaffinity-demo.yaml
pod/pod-nodeaffinity-demo created

Pod 親和性

讓需要有關聯性的 pod 與 pod 之間執行在一起,雖然通過節點親和性也可以實現但需要精心編排,而 pod 親和性則是排程器會把第一個 pod 放置於任何位置,而後與其有親和或反親和性關係的 pod 根據此動態完成位置編排,而必須通過某些手段如節點標籤來讓 pod 親和性與反親和性的時有章可循

如果某些 pod 傾向於執行在同一位置,則表示它們具有親和性,如果傾向於不要執行在同一位置,則表示他們有反親和性,如兩個 Nginx 同時監聽 80 或出於安全考慮來隔離 pod

Pod 硬親和性

  1. pod 硬親和性 pods.spec.affinity.podAffinity.requiredDuringSchedulingIgnoredDuringExecution

    [root@master-0 ~]# kubectl explain pods.spec.affinity.podAffinity.requiredDuringSchedulingIgnoredDuringExecution
    KIND:     Pod
    VERSION:  v1
    
    RESOURCE: requiredDuringSchedulingIgnoredDuringExecution <[]Object>
    
    DESCRIPTION:
        If the affinity requirements specified by this field are not met at
        scheduling time, the pod will not be scheduled onto the node. If the
        affinity requirements specified by this field cease to be met at some point
        during pod execution (e.g. due to a pod label update), the system may or
        may not try to eventually evict the pod from its node. When there are
        multiple elements, the lists of nodes corresponding to each podAffinityTerm
        are intersected, i.e. all terms must be satisfied.
    
        Defines a set of pods (namely those matching the labelSelector relative to
        the given namespace(s)) that this pod should be co-located (affinity) or
        not co-located (anti-affinity) with, where co-located is defined as running
        on a node whose value of the label with key <topologyKey> matches that of
        any node on which a pod of the set of pods is running
    
    FIELDS:
      labelSelector <Object>             # 跟那個 pod 親和,選定目標 pod 資源
        A label query over a set of resources, in this case pods.
    
      namespaces <[]string>              # 這組標籤選擇器匹配到的 pod 是哪個名稱空間下的,如果不指定則預設使用正在建立的這個 pod 的 ns
        namespaces specifies which namespaces the labelSelector applies to (matches
        against); null or empty list means "this pod's namespace"
    
      topologyKey <string> -required-    # 位置拓撲的鍵
        This pod should be co-located (affinity) or not co-located (anti-affinity)
        with the pods matching the labelSelector in the specified namespaces, where
        co-located is defined as running on a node whose value of the label with
        key topologyKey matches that of any node on which any of the selected pods
        is running. Empty topologyKey is not allowed.
    
  2. 定義基準 pod 與 pod 硬親和

    [root@master-0 ~]# cat pod-requiredaffinity-demo.yaml
    apiVersion: v1
    kind: Pod
    metadata:
      name: pod-first
      namespace: default
      labels:
        app: myapp
    spec:
      containers:
      - name: myapp
        image: ikubernetes/myapp:v1
    ---
    apiVersion: v1
    kind: Pod
    metadata:
      name: pod-second
      namespace: default
      labels:
        app: db
    spec:
      containers:
      - name: busybox
        image: busybox:latest
        imagePullPolicy: IfNotPresent
        command: ["sh","-c","sleep 3600"]
      affinity:
        podAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - {"key": "app", "operator": "In", "values": ["myapp"]}     # 選擇基準 pod 的 label
            topologyKey: kubernetes.io/hostname         # 後置 busybox 的 pod 可以執行在那些節點,這裡的條件則為 hostname 一致則只能是基準 pod 執行的那個節點
    [root@master-0 ~]# kubectl apply -f pod-requiredaffinity-demo.yaml
    pod/pod-first created
    pod/pod-second created
    

基於單一節點的 Pod 親和性只在極個別的情況下才有可能會用到,較為常用的通常是基於同region、zone、或 rack 的拓撲位置約束,例如部署應用程式服務與資料庫服務相關的 Pod 時,db Pod 可能會部署 foo 或 bar 這兩個區域中的某節點之上,依賴於資料服務的 myapp Pod 物件可部署於 db Pod 所在區域內的節點上,當然,如果 db Pod 在兩個區域 foo 和 bar 中各有副本執行,那麼 myapp Pod 將可以執行於這兩個區域的任何節點之上

在這裡插入圖片描述

Pod 反親和性

在於 topologyKey 是一定不能相同的,除此之外則無任何區別

[root@master-0 ~]# kubectl label nodes slave-0.shared zone=foo
node/slave-0.shared labeled
[root@master-0 ~]# kubectl label nodes slave-1.shared zone=foo
node/slave-1.shared labeled
[root@master-0 ~]# cat pod-required-antiaffinity-demo.yaml
apiVersion: v1
kind: Pod
metadata:
  name: pod-first
  namespace: default
  labels:
    app: myapp
spec:
  containers:
  - name: myapp
    image: ikubernetes/myapp:v1
---
apiVersion: v1
kind: Pod
metadata:
  name: pod-second
  namespace: default
  labels:
    app: db
spec:
  containers:
  - name: busybox
    image: busybox:latest
    imagePullPolicy: IfNotPresent
    command: ["sh","-c","sleep 3600"]
  affinity:
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions:
          - {"key": "app", "operator": "In", "values": ["myapp"]}
        topologyKey: zone
[root@master-0 ~]# kubectl apply -f pod-required-antiaffinity-demo.yaml
pod/pod-first created
pod/pod-second created
[root@master-0 ~]# kubectl get pod
NAME         READY   STATUS    RESTARTS   AGE
pod-first    1/1     Running   0          3s
pod-second   0/1     Pending   0          3s

pod 軟親和和軟非親和

與 node 軟親和功能一致,不再贅述

Taints and Tolerations

Taints 在 node 上新增的鍵值屬性,Tolerations 是 pod 上定義的能容忍 Taints 的列表,node 上可以標識某些汙點,而 pod 能否執行在該 node 上則取決於 pod 是否能容忍這些汙點標識

在這裡插入圖片描述

檢查汙點的排程在預選和優選中都會涉及,並且當 node 節點上出現新的且不被 pod 容忍的汙點時,會有兩種結果而結果取決 Taints.effect 中定義對 pod 的排斥效果

  • NoSchedule: 隻影響排程過程,對已存 pod 不產生影響
  • NoExecute: 即影響排程過程也影響現存 pod,不容忍的則會被主動驅逐 pod,這個動作可以在 pods.spec.tolerations.tolerationSeconds 設定驅逐容忍期,預設為 0 秒
  • PreferNoSchedule: 柔性 NoSchedule

Taints

  1. 在 node 中定義,先看模板

    [root@master-0 ~]# kubectl explain node.spec.taints
    KIND:     Node
    VERSION:  v1
    
    RESOURCE: taints <[]Object>
    
    DESCRIPTION:
        If specified, the node's taints.
    
        The node this Taint is attached to has the "effect" on any pod that does
        not tolerate the Taint.
    
    FIELDS:
      effect <string> -required-
        Required. The effect of the taint on pods that do not tolerate the taint.
        Valid effects are NoSchedule, PreferNoSchedule and NoExecute.
    
      key <string> -required-
        Required. The taint key to be applied to a node.
    
      timeAdded <string>
        TimeAdded represents the time at which the taint was added. It is only
        written for NoExecute taints.
    
      value <string>
        The taint value corresponding to the taint key.
    
  2. 命令列形式

    Usage:
      kubectl taint NODE NAME KEY_1=VAL_1:TAINT_EFFECT_1 ... KEY_N=VAL_N:TAINT_EFFECT_N [options]
    [root@master-0 ~]# kubectl taint node slave-0.shared node-type=production:NoSchedule
    node/slave-0.shared tainted
    [root@master-0 ~]# kubectl get pod -owide         # 所有 pod 沒有容忍度
    NAME                           READY   STATUS              RESTARTS   AGE     IP       NODE             NOMINATED NODE   READINESS GATES
    myapp-98skj                    0/1     ContainerCreating   0          6m27s   <none>   slave-1.shared   <none>           <none>
    myapp-deploy-5d645d645-7dsg5   0/1     ContainerCreating   0          30s     <none>   slave-1.shared   <none>           <none>
    myapp-deploy-5d645d645-fm8tm   0/1     ContainerCreating   0          30s     <none>   slave-1.shared   <none>           <none>
    myapp-deploy-5d645d645-wskql   0/1     ContainerCreating   0          30s     <none>   slave-1.shared   <none>           <none>
    myapp-ms6lv                    0/1     ContainerCreating   0          6m27s   <none>   slave-1.shared   <none>           <none>
    [root@master-0 ~]# kubectl taint node slave-1.shared node-type=dev:NoExecute
    node/slave-1.shared tainted
    [root@master-0 ~]# kubectl get pod
    NAME                           READY   STATUS    RESTARTS   AGE
    myapp-deploy-5d645d645-dppsh   0/1     Pending   0          23s
    myapp-deploy-5d645d645-pcpfp   0/1     Pending   0          23s
    myapp-deploy-5d645d645-rtghf   0/1     Pending   0          23s
    myapp-gmxm6                    0/1     Pending   0          23s
    myapp-j8dhg                    0/1     Pending   0          23s
    

Tolerations

在 Pod 物件上定義容忍度時,它支援兩種操作符

  • 等值比較: 表示容忍度與汙點必須在key、value 和 effect 三者之上完全匹配
  • 存在性判斷: 表示二者的 key 和 effect 必須完全匹配,而容忍度中的 value 欄位要使用空值
  1. Toleration 模板

    [root@master-0 ~]# kubectl explain pods.spec.tolerations
    KIND:     Pod
    VERSION:  v1
    
    RESOURCE: tolerations <[]Object>
    
    DESCRIPTION:
        If specified, the pod's tolerations.
    
        The pod this Toleration is attached to tolerates any taint that matches the
        triple <key,value,effect> using the matching operator <operator>.
    
    FIELDS:
      effect <string>
        Effect indicates the taint effect to match. Empty means match all taint
        effects. When specified, allowed values are NoSchedule, PreferNoSchedule
        and NoExecute.
    
      key <string>
        Key is the taint key that the toleration applies to. Empty means match all
        taint keys. If the key is empty, operator must be Exists; this combination
        means to match all values and all keys.
    
      operator <string>            # Equal 等值比較和 Exists 存在性比較
        Operator represents a key's relationship to the value. Valid operators are
        Exists and Equal. Defaults to Equal. Exists is equivalent to wildcard for
        value, so that a pod can tolerate all taints of a particular category.
    
      tolerationSeconds <integer>    # 容忍期限
        TolerationSeconds represents the period of time the toleration (which must
        be of effect NoExecute, otherwise this field is ignored) tolerates the
        taint. By default, it is not set, which means tolerate the taint forever
        (do not evict). Zero and negative values will be treated as 0 (evict
        immediately) by the system.
    
      value <string>
        Value is the taint value the toleration matches to. If the operator is
        Exists, the value should be empty, otherwise just a regular string.
    
  2. 設定等值比較的容忍列表

    [root@master-0 ~]# cat deploy.yaml
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: myapp-deploy
      namespace: default
    spec:
      replicas: 3
      selector:
        matchLabels:
          app: myapp
          release: canary
      template:
        metadata:
          labels:
            app: myapp
            release: canary
        spec:
          containers:
          - name: myapp
            image: nginx:1.7
            ports:
            - name: http
              containerPort: 80
          tolerations:
          - key: "node-type"
            operator: "Equal"
            value: "production"
            effect: "NoSchedule"
    [root@master-0 ~]# kubectl apply -f deploy.yaml
    deployment.apps/myapp-deploy configured
    [root@master-0 ~]# kubectl get pod  -owide
    NAME                           READY   STATUS              RESTARTS   AGE     IP       NODE             NOMINATED NODE   READINESS GATES
    myapp-deploy-9f9d6df86-8w6qb   0/1     ContainerCreating   0          2s      <none>   slave-0.shared   <none>           <none>
    myapp-deploy-9f9d6df86-d6vjg   0/1     ContainerCreating   0          2s      <none>   slave-0.shared   <none>           <none>
    myapp-deploy-9f9d6df86-lhh78   0/1     ContainerCreating   0          2s      <none>   slave-0.shared   <none>           <none>
    
  3. 設定存在性判斷的容忍列表

    [root@master-0 ~]# cat deploy.yaml
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: myapp-deploy
      namespace: default
    spec:
      replicas: 3
      selector:
        matchLabels:
          app: myapp
          release: canary
      template:
        metadata:
          labels:
            app: myapp
            release: canary
        spec:
          containers:
          - name: myapp
            image: nginx:1.7
            ports:
            - name: http
              containerPort: 80
          tolerations:
          - key: "node-type"
            operator: "Exists"
            value: ""
            effect: ""                  # Exists 狀態下 value 預設為萬用字元,所以可以通過 effect 來匹配節點,比如此時如果值為 NoSchedule 則 pod 會被全部排程到 slave-0 上
    [root@master-0 ~]# kubectl apply -f deploy.yaml
    deployment.apps/myapp-deploy configured
    [root@master-0 ~]# kubectl get pod -owide
    NAME                            READY   STATUS    RESTARTS   AGE   IP            NODE             NOMINATED NODE   READINESS GATES
    myapp-deploy-7c7968f87c-d6b69   1/1     Running   0          12s   10.244.1.24   slave-1.shared   <none>           <none>
    myapp-deploy-7c7968f87c-f798g   1/1     Running   0          12s   10.244.2.21   slave-0.shared   <none>           <none>
    myapp-deploy-7c7968f87c-nvf9m   1/1     Running   0          12s   10.244.2.22   slave-0.shared   <none>           <none>
    

問題節點標識

Kubernetes 自1.6版本起支援使用汙點自動標識問題節點,它通過節點控制器在特定條件下自動為節點新增汙點資訊實現,它們都使用 NoExecute 效用標識,因此不能容忍此類汙點的現有 Pod 物件也會遭到驅逐,目前內建使用的此類汙點包含如下幾個

  • node.kubernetes.io/not-ready: 節點進入 NotReady 狀態時被自動新增的汙點
  • node.alpha.kubernetes.io/unreachable: 節點進入 NotReachable 狀態時被自動新增的汙點
  • node.kubernetes.io/out-of-disk: 節點進入 OutOfDisk 狀態時被自動新增的汙點
  • node.kubernetes.io/memory-pressure: 節點記憶體資源面臨壓力
  • node.kubernetes.io/disk-pressure: 節點磁碟資源面臨壓力

相關文章