k8s之pod排程

路過的柚子廚發表於2021-07-14

Pod排程

在預設情況下,一個pod在哪個node節點上執行,是由scheduler元件採用相應的演算法計算出來的,這個過程是不受人工控制的。

但是在實際過程中,這並不滿足需求,因為很多情況下,我們想控制某些pod到達某些節點上,那麼應該怎麼做呢?

這就要求瞭解k8s對pod的排程規則,k8s提供了四大類排程方式:

  • 自動排程:執行在哪個節點上完全由scheduler經過一系列的演算法得出
  • 定向排程:nodename、nodeselector
  • 親和性排程:nodeaffinity、podaffinity、podantiaffinity
  • 汙點(容忍)排程:Taints、toleration

定向排程

定向排程,指的是利用在pod上宣告nodename或者nodeselector,以此將pod排程到期望的node節點上。注意,這裡的排程是強制的,

這就意味著即使要排程目標node不存在,也會向上面進行排程,只不過pod執行失敗而已。

nodename

nodename用於強制約束將pod排程到指定的name的pod節點上。這種方式,其實是直接跳過scheduler的排程邏輯,直接寫入podlist表

接下來,實驗一下:建立一個pod-nodename.yaml檔案

apiVersion: v1
kind: Pod
metadata: 
  name: pod-nodename
  namespace: dev
spec:
  containers:
  - name: nginx
    image: nginx:1.17.1
  nodeName: node1         #指定排程到node1節點上

 使用配置檔案

[root@master ~]# kubectl create -f pod-nodename.yaml 
pod/pod-nodename created
[root@master ~]# kubectl get pod pod-nodename -n dev -o wide
NAME           READY   STATUS    RESTARTS   AGE   IP            NODE    NOMINATED NODE   READINESS GATES
pod-nodename   1/1     Running   0          49s   10.244.2.35   node1   <none>           <none>

可以發現pod執行在node1

接下來刪除pod,更改配置檔案為node3

apiVersion: v1
kind: Pod
metadata: 
  name: pod-nodename
  namespace: dev
spec:
  containers:
  - name: nginx
    image: nginx:1.17.1
  nodeName: node3         #指定排程到node1節點上

使用配置檔案

[root@master ~]# kubectl delete -f pod-nodename.yaml 
pod "pod-nodename" deleted
[root@master ~]# vim pod-nodename.yaml 
[root@master ~]# kubectl create -f pod-nodename.yaml 
pod/pod-nodename created
[root@master ~]# kubectl get pod pod-nodename -n dev -o wide
NAME           READY   STATUS    RESTARTS   AGE   IP       NODE    NOMINATED NODE   READINESS GATES
pod-nodename   0/1     Pending   0          21s   <none>   node3   <none>           <none>

可以看見雖然被指定在了node3,但是由於node3不存在,pod無法啟動

 

 nodeselector

nodeselector用於將pod排程到新增了指定標籤的node節點上,它是通過k8s的label-selector機制實現的,也即是說,在pod建立之前,會由

scheduler使用matchnodeselector排程策略進行label匹配,找出目標node,然後將pod排程到目標節點,該匹配規則是強制約束

接下來,實驗一下:

1.首先分別為node節點新增標籤

[root@master ~]# kubectl label nodes node1 nodeenv=pro
node/node1 labeled
[root@master ~]# kubectl label nodes node2 nodeenv=test
node/node2 labeled

建立一個pod-nodeselector.yaml檔案,並使用它建立pod

apiVersion: v1
kind: Pod
metadata: 
  name: pod-nodeselector
  namespace: dev
spec:
  containers:
  - name: nginx
    image: nginx:1.17.1
  nodeSelector: 
    nodeenv: pro  #指定排程到具有nodeenv=pro標籤的節點上

使用配置檔案(這裡省略過程)

[root@master ~]# kubectl get pod pod-nodeselector -n dev -o wide
NAME               READY   STATUS    RESTARTS   AGE   IP            NODE    NOMINATED NODE   READINESS GATES
pod-nodeselector   1/1     Running   0          22m   10.244.2.36   node1   <none>           <none>

可以看見pod已經被排程到了node1

 

親和性排程

上面兩種定向排程的方式使用起來非常方便,但是也有一定的問題,那就是如果沒有滿足條件的node,那麼pod將不會被執行,即使在叢集中還有可用的node列表也不行,這就限制了它的使用場景。

基於上面的問題,k8s還提供了一種親和性排程,它在nodeselector的基礎之上進行了擴充套件,可以通過配置的方式,實現優先選擇滿足條件的node進行排程,如果沒有,也可以排程到不滿足條件的節點上,使排程更加靈活。

affinity主要分為三類:

  • nodeaffinity(node親和性):以node為目標,解決node可以排程到哪些node的問題
  • podaffinity(pod親和性):以pod為目標,解決pod可以和哪些已存在的pod部署到同一個拓撲域中的問題
  • podantiaffinity(pod反親和性):以pod為目標,解決pod不能和哪些已存在的pod部署到同一個拓撲域中的問題

關於親和性(反親和性)使用場景的說明:

  • 親和性:如果兩個應用頻繁互動,那就有必要利用親和性讓兩個應用盡可能地靠近,這樣可以減少網路通訊而帶來的效能損耗。
  • 反親和性:當應用採用副本部署時,有必要採用反親和性讓各個應用例項打散分佈在各個node上,這樣可以提高服務的高可用性。

nodeaffinity

關係符的使用說明:

- matchExpressions:
  - key: nodeenv         #匹配存在標籤的key為nodeenv的節點
    operator: Exists
  - key: nodeenv         #匹配存在標籤的key為nodeenv,且value是"xxx""yyy"的節點
    operator: In
    values: ["xxx","yyy"]
  - key: nodeenv         #匹配存在標籤的key為nodeenv,且value大於"xxx"的節點
    operator: Gt
    values: "xxx"

接下來首先演示一下requireDuringSchedullingIgnoreDuringExecution

建立pod-nodeaffinity-required.yaml

apiVersion: v1
kind: Pod
metadata: 
  name: pod-nodeaffinity-required
  namespace: dev
spec:
  containers:
  - name: nginx
    image: nginx:1.17.1
  affinity:   #親和性設定
    nodeAffinity:   #設定node親和性
      requiredDuringSchedulingIgnoredDuringExecution:  #硬限制
        nodeSelectorTerms:
        - matchExpressions:
          - key: nodeenv
            operator: In
            values: ["xxx","yyy"]

 建立並使用配置檔案

[root@master ~]# vim pod-nodeaffinity-required.yaml
[root@master ~]# kubectl create -f pod-nodeaffinity-required.yaml 
pod/pod-nodeaffinity-required created
[root@master ~]# kubectl get pod pod-nodeaffinity-required -n dev
NAME                        READY   STATUS    RESTARTS   AGE
pod-nodeaffinity-required   0/1     Pending   0          14s

發現啟動失敗,檢視詳細描述

[root@master ~]# kubectl describe pod pod-nodeaffinity-required -n dev
Events:
  Type     Reason            Age        From               Message
  ----     ------            ----       ----               -------
  Warning  FailedScheduling  <unknown>  default-scheduler  0/3 nodes are available: 3 node(s) didn't match node selector.
  Warning  FailedScheduling  <unknown>  default-scheduler  0/3 nodes are available: 3 node(s) didn't match node selector.

刪除pod,重新編輯配置檔案的values

apiVersion: v1
kind: Pod
metadata: 
  name: pod-nodeaffinity-required
  namespace: dev
spec:
  containers:
  - name: nginx
    image: nginx:1.17.1
  affinity:   #親和性設定
    nodeAffinity:   #設定node親和性
      requiredDuringSchedulingIgnoredDuringExecution:  #硬限制
        nodeSelectorTerms:
        - matchExpressions:
          - key: nodeenv
            operator: In
            values: ["pro","yyy"]
[root@master ~]# kubectl delete -f pod-nodeaffinity-required.yaml 
pod "pod-nodeaffinity-required" deleted
[root@master ~]# vim pod-nodeaffinity-required.yaml 
[root@master ~]# kubectl create -f pod-nodeaffinity-required.yaml 
pod/pod-nodeaffinity-required created
[root@master ~]# kubectl get pod pod-nodeaffinity-required -n dev
NAME                        READY   STATUS    RESTARTS   AGE
pod-nodeaffinity-required   1/1     Running   0          35s

發現建立成功

 

接下來再演示一下preferredDuringSchedulingIgnoredDuringExecution,

建立pod-nodeaffinity-preferred.yaml

apiVersion: v1
kind: Pod
metadata: 
  name: pod-nodeaffinity-preferred
  namespace: dev
spec:
  containers:
  - name: nginx
    image: nginx:1.17.1
  affinity:   #親和性設定
    nodeAffinity:   #設定node親和性
      preferredDuringSchedulingIgnoredDuringExecution:  #軟限制
      - weight: 1
        preference:
          matchExpressions:
          - key: nodeenv
            operator: In
            values: ["xxx","yyy"]

建立並使用配置檔案

[root@master ~]# vim pod-nodeaffinity-preferred.yaml
[root@master ~]# kubectl create -f pod-nodeaffinity-preferred.yaml 
pod/pod-nodeaffinity-preferred created
[root@master ~]# kubectl get pod pod-nodeaffinity-preferred -n dev -o wide
NAME                         READY   STATUS    RESTARTS   AGE   IP            NODE    NOMINATED NODE   READINESS GATES
pod-nodeaffinity-preferred   1/1     Running   0          23s   10.244.2.38   node1   <none>           <none>

發現pod被排程到了node1

 

nodeaffinity規則設定的注意事項:

  • 如果同時定義了nodeselector和nodeaffinity,那麼必須兩個條件都得到滿足,pod才能執行在指定的node上
  • 如果nodeaffinity指定了多個nodeSelectorTerms,那麼只需要其中一個能匹配成功即可
  • 如果一個nodeSelectorTerms中有多個matchExpressions,則一個節點必須滿足所有的才能匹配成功
  • 如果一個pod所在的node再pod執行期間其標籤發生了改變,不再符合該pod節點親和性需求,則系統將忽略此變化

 

podaffinity

podaffinity主要實現以執行的pod為參照,實現讓新建立的pod跟參照pod在一個區域的功能

topologyKey用於指定排程時作用域,例如:

  • 如果指定為kubernetes.io/hostname,那就是以Node節點為區分範圍
  • 如果指定為beta.kubernetes.io/os,則以Node節點的作業系統型別來區分

接下來,演示下requireDuringSchedullingIgnoreDuringExecution

建立一個參照pod,pod-podaffinity-target.yaml

apiVersion: v1
kind: Pod
metadata: 
  name: pod-podaffinity-target
  namespace: dev
  labels:
    podenv: pro   #設定標籤
spec:
  containers:
  - name: nginx
    image: nginx:1.17.1
  nodeName: node1  #將目標pod明確指定到node1上
[root@master ~]# vim pod-podaffinity-target.yaml
[root@master ~]# kubectl create -f pod-podaffinity-target.yaml 
pod/pod-podaffinity-target created
[root@master ~]# kubectl get pod pod-podaffinity-target -n dev -o wide --show-labels
NAME                     READY   STATUS    RESTARTS   AGE     IP            NODE    NOMINATED NODE   READINESS GATES   LABELS
pod-podaffinity-target   1/1     Running   0          2m47s   10.244.2.39   node1   <none>           <none>            podenv=pro

建立pod-podaffinity-required.yaml

apiVersion: v1
kind: Pod
metadata: 
  name: pod-podaffinity-required
  namespace: dev
spec:
  containers:
  - name: nginx
    image: nginx:1.17.1
  affinity:   #親和性設定
    podAffinity:   #設定pod親和性
      requiredDuringSchedulingIgnoredDuringExecution:  #硬限制
      -  labelSelector:
           matchExpressions: #匹配env的值在["xxx","yyy"]中的標籤
           - key: podenv
             operator: In
             values: ["xxx","yyy"]
         topologyKey: kubernetes.io/hostname
[root@master ~]# vim pod-podaffinity-required.yaml
[root@master ~]# kubectl create -f pod-podaffinity-required.yaml 
pod/pod-podaffinity-required created
[root@master ~]# kubectl get pod pod-podaffinity-required -n dev -o wide --show-labels
NAME                       READY   STATUS    RESTARTS   AGE   IP       NODE     NOMINATED NODE   READINESS GATES   LABELS
pod-podaffinity-required   0/1     Pending   0          24s   <none>   <none>   <none>           <none>            <none>

發現排程失敗,檢視排程資訊

[root@master ~]# kubectl describe pod pod-podaffinity-required -n dev
Events:
  Type     Reason            Age        From               Message
  ----     ------            ----       ----               -------
  Warning  FailedScheduling  <unknown>  default-scheduler  0/3 nodes are available: 1 node(s) had taints that the pod didn't tolerate, 2 node(s) didn't match pod affinity rules.
  Warning  FailedScheduling  <unknown>  default-scheduler  0/3 nodes are available: 1 node(s) had taints that the pod didn't tolerate, 2 node(s) didn't match pod affinity rules.

刪除pod,重新編輯配置檔案

apiVersion: v1
kind: Pod
metadata: 
  name: pod-podaffinity-required
  namespace: dev
spec:
  containers:
  - name: nginx
    image: nginx:1.17.1
  affinity:   #親和性設定
    podAffinity:   #設定pod親和性
      requiredDuringSchedulingIgnoredDuringExecution:  #硬限制
      -  labelSelector:
           matchExpressions: #匹配env的值在["xxx","yyy"]中的標籤
           - key: podenv
             operator: In
             values: ["pro","yyy"]
         topologyKey: kubernetes.io/hostname
[root@master ~]# kubectl delete -f pod-podaffinity-required.yaml 
pod "pod-podaffinity-required" deleted
[root@master ~]# vim pod-podaffinity-required.yaml 
[root@master ~]# kubectl create -f pod-podaffinity-required.yaml 
pod/pod-podaffinity-required created
[root@master ~]# kubectl get pod pod-podaffinity-required -n dev -o wide --show-labels
NAME                       READY   STATUS    RESTARTS   AGE   IP            NODE    NOMINATED NODE   READINESS GATES   LABELS
pod-podaffinity-required   1/1     Running   0          11s   10.244.2.40   node1   <none>           <none>            <none>

 

podantiaffinity

podaniaffinity主要實現以執行的pod為參照,讓新建立的pod跟參照pod不在一個區域中的功能

它的配置方式和選項跟podaffinity是一樣的,這裡不再做詳細解釋,直接做一個測試案例

繼續使用上個案例中目標pod

[root@master ~]# kubectl get pod -n dev -o wide --show-labels
NAME                       READY   STATUS    RESTARTS   AGE   IP            NODE    NOMINATED NODE   READINESS GATES   LABELS
pod-podaffinity-required   1/1     Running   2          24h   10.244.2.57   node1   <none>           <none>            <none>
pod-podaffinity-target     1/1     Running   2          24h   10.244.2.56   node1   <none>           <none>            podenv=pro

建立pod-podantiaffinity-required.yaml,內容如下

apiVersion: v1
kind: Pod
metadata: 
  name: pod-podantiaffinity-required
  namespace: dev
spec:
  containers:
  - name: nginx
    image: nginx:1.17.1
  affinity:   #親和性設定
    podAntiAffinity:   #設定pod反親和性
      requiredDuringSchedulingIgnoredDuringExecution:  #硬限制
      -  labelSelector:
           matchExpressions: #匹配env的值在["pro"]中的標籤
           - key: podenv
             operator: In
             values: ["pro"]
         topologyKey: kubernetes.io/hostname

使用配置檔案

[root@master ~]# vim pod-podantiaffinity-required.yaml
[root@master ~]# kubectl create -f pod-podantiaffinity-required.yaml 
pod/pod-podantiaffinity-required created
[root@master ~]# kubectl get pod -n dev -o wide
NAME                           READY   STATUS    RESTARTS   AGE   IP            NODE    NOMINATED NODE   READINESS GATES
pod-podaffinity-required       1/1     Running   2          24h   10.244.2.57   node1   <none>           <none>
pod-podaffinity-target         1/1     Running   2          24h   10.244.2.56   node1   <none>           <none>
pod-podantiaffinity-required   1/1     Running   0          10s   10.244.1.57   node2   <none>           <none>

可以發現pod被排程到了pod2中

 

汙點和容忍

汙點(Taints)

前面的排程方式都是站在pod的角度上,通過在pod上新增屬性,來確定pod是否要排程到指定的node上,其實我們也可以站在node的角度上,通過在node上新增

汙點屬性,來決定是否允許pod排程過來

node被設定上汙點之後就和pod之間存在了一種相斥的關係,進而拒絕pod排程進來,甚至可以將已經存在的pod驅逐出去

汙點的格式為:key=value:effect,key和value是汙點的標籤,effect描述汙點的作用,支援如下三個選項:

  • PreferNoSchedule:k8s將盡量避免把pod排程到具有該汙點的node上,除非沒有其他節點可以排程
  • NoSchedule:k8s將不會把pod排程到具有該汙點的node上,但不會影響當前node上已經存在的pod
  • NoExecute:k8s將不會把Pod排程到具有該汙點的node上,同時也會將Node上已存在的Pod驅離

使用kubectl設定和去除汙點的命令如下:

#設定汙點
kubectl taint nodes nodeName key=value:effect

#去除汙點
kubectl taint nodes nodeName key:effect-

#去除所有汙點
kubectl taint nodes nodeName key-

 接下來,演示下汙點的效果:

  1. 準備節點node1(為了演示效果更加明顯,暫時停止node2節點)
  2. 為node1節點設定一個汙點:tag=ayanami:PreferNoSchedule;然後建立pod1(pod1可以)
  3. 修改為node1節點設定一個汙點:tag=ayanami:NoSchedule;然後建立pod2(pod1正常 pod2失敗)
  4. 修改為node1節點設定一個汙點:tag=ayanami:NoExecute;然後建立pod3(3個pod都失敗)

 為node1設定汙點(PreferNoSchedule):

[root@master ~]# kubectl taint nodes node1 tag=ayanami:PreferNoSchedule
node/node1 tainted

 建立pod1

[root@master ~]# kubectl run taint1 --image=nginx:1.17.1 -n dev
kubectl run --generator=deployment/apps.v1 is DEPRECATED and will be removed in a future version. Use kubectl run --generator=run-pod/v1 or kubectl create instead.

[root@master ~]# kubectl get pod -n dev
NAME                      READY   STATUS    RESTARTS   AGE
taint1-766c47bf55-lhmcj   1/1     Running   0          6m16s

為node1設定汙點(取消PreferNoSchedule設定為NoSchedule)

[root@master ~]# kubectl taint nodes node1 tag:PreferNoSchedule-
node/node1 untainted
[root@master ~]# kubectl taint nodes node1 tag=ayanami:NoSchedule
node/node1 tainted

再次檢視pod,發現沒有變化

[root@master ~]# kubectl get pod -n dev
NAME                      READY   STATUS    RESTARTS   AGE
taint1-766c47bf55-lhmcj   1/1     Running   0          10m

建立新的taint2並檢視

[root@master ~]# kubectl run taint2 --image=nginx:1.17.1 -n dev
kubectl run --generator=deployment/apps.v1 is DEPRECATED and will be removed in a future version. Use kubectl run --generator=run-pod/v1 or kubectl create instead.
deployment.apps/taint2 created
[root@master ~]# kubectl get pod -n dev
NAME                      READY   STATUS    RESTARTS   AGE
taint1-766c47bf55-lhmcj   1/1     Running   0          11m
taint2-84946958cf-h9765   0/1     Pending   0          15s

發現新的pod無法running,檢視taint2

[root@master ~]# kubectl describe pod taint2 -n dev
Events:
  Type     Reason            Age        From               Message
  ----     ------            ----       ----               -------
  Warning  FailedScheduling  <unknown>  default-scheduler  0/3 nodes are available: 3 node(s) had taints that the pod didn't tolerate.
  Warning  FailedScheduling  <unknown>  default-scheduler  0/3 nodes are available: 3 node(s) had taints that the pod didn't tolerate.

為node1設定汙點(取消NoSchedule,設定為NoExecute)

[root@master ~]# kubectl taint node node1 tag:NoSchedule-
node/node1 untainted
[root@master ~]# kubectl taint node node1 tag=ayanami:NoExecute
node/node1 tainted
[root@master ~]# kubectl get pod -n dev
NAME                      READY   STATUS    RESTARTS   AGE
taint1-766c47bf55-fdtqw   0/1     Pending   0          30s
taint2-84946958cf-26rfx   0/1     Pending   0          30s

發現兩個pod都停止了,再建立一個taint3

[root@master ~]# kubectl run taint3 --image=nginx:1.17.1 -n dev
kubectl run --generator=deployment/apps.v1 is DEPRECATED and will be removed in a future version. Use kubectl run --generator=run-pod/v1 or kubectl create instead.
[root@master ~]# kubectl get pod -n dev
NAME                      READY   STATUS    RESTARTS   AGE
taint1-766c47bf55-fdtqw   0/1     Pending   0          97s
taint2-84946958cf-26rfx   0/1     Pending   0          97s
taint3-57d45f9d4c-68pwr   0/1     Pending   0          9s

發現新的也建立不了了

擴充:

使用kubeadm搭建的叢集,預設就會給master節點新增一個汙點標記,所以pod就不會排程到master節點上

[root@master ~]# kubectl describe node master
Taints:             node-role.kubernetes.io/master:NoSchedule

容忍

上面介紹了汙點的作用,我們可以在node上新增汙點用於拒絕pod排程上來,但是如果就是想將一個pod排程到一個有汙點的node上去,這時候應該怎麼做呢?這就要用到容忍

汙點就是拒絕,容忍就是忽略,node通過汙點拒絕pod排程上去,pod通過容忍忽略拒絕

下面先通過案例看下效果:

  1. 上一小節,已經在node1節點上打了NoExecute的汙點,此時pod是排程不上去的
  2. 本小節,可以通過給pod新增容忍,然後將其排程上去

建立pod-toleration.yaml,內容如下

apiVersion: v1
kind: Pod
metadata: 
  name: pod-toleration
  namespace: dev
spec:
  containers:
  - name: nginx
    image: nginx:1.17.1
  tolerations:    #新增容忍
  - key: "tag"   #要容忍的汙點的key
    operator: "Equal"   #操作符
    value: "ayanami"   #容忍的汙點的value
    effect: "NoExecute"   #新增容忍的規則,這裡必須和標記的汙點規則相同

使用配置檔案

[root@master ~]# vim pod-toleration.yaml
[root@master ~]# kubectl create -f pod-toleration.yaml 
pod/pod-toleration created
[root@master ~]# kubectl get pod -n dev
NAME                      READY   STATUS    RESTARTS   AGE
pod-toleration            1/1     Running   0          9s
taint1-766c47bf55-fdtqw   0/1     Pending   0          34m
taint2-84946958cf-26rfx   0/1     Pending   0          34m
taint3-57d45f9d4c-68pwr   0/1     Pending   0          33m

容忍的詳細配置

key:對應著要容忍的汙點的鍵,空值意味著所有的鍵
value:意味著要容忍的汙點的值
operator:key-value的運算子,支援Equal和Exists(預設)
effect:對應汙點的effect,空值意味著匹配所有的影響
tolerationSeconds:容忍時間,當effect為NoExecute時生效,表示pod再Node上的停留時間

 

相關文章