Pod排程
在預設情況下,一個pod在哪個node節點上執行,是由scheduler元件採用相應的演算法計算出來的,這個過程是不受人工控制的。
但是在實際過程中,這並不滿足需求,因為很多情況下,我們想控制某些pod到達某些節點上,那麼應該怎麼做呢?
這就要求瞭解k8s對pod的排程規則,k8s提供了四大類排程方式:
- 自動排程:執行在哪個節點上完全由scheduler經過一系列的演算法得出
- 定向排程:nodename、nodeselector
- 親和性排程:nodeaffinity、podaffinity、podantiaffinity
- 汙點(容忍)排程:Taints、toleration
定向排程
定向排程,指的是利用在pod上宣告nodename或者nodeselector,以此將pod排程到期望的node節點上。注意,這裡的排程是強制的,
這就意味著即使要排程目標node不存在,也會向上面進行排程,只不過pod執行失敗而已。
nodename
nodename用於強制約束將pod排程到指定的name的pod節點上。這種方式,其實是直接跳過scheduler的排程邏輯,直接寫入podlist表
接下來,實驗一下:建立一個pod-nodename.yaml檔案
apiVersion: v1 kind: Pod metadata: name: pod-nodename namespace: dev spec: containers: - name: nginx image: nginx:1.17.1 nodeName: node1 #指定排程到node1節點上
使用配置檔案
[root@master ~]# kubectl create -f pod-nodename.yaml pod/pod-nodename created [root@master ~]# kubectl get pod pod-nodename -n dev -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES pod-nodename 1/1 Running 0 49s 10.244.2.35 node1 <none> <none>
可以發現pod執行在node1
接下來刪除pod,更改配置檔案為node3
apiVersion: v1 kind: Pod metadata: name: pod-nodename namespace: dev spec: containers: - name: nginx image: nginx:1.17.1 nodeName: node3 #指定排程到node1節點上
使用配置檔案
[root@master ~]# kubectl delete -f pod-nodename.yaml pod "pod-nodename" deleted [root@master ~]# vim pod-nodename.yaml [root@master ~]# kubectl create -f pod-nodename.yaml pod/pod-nodename created [root@master ~]# kubectl get pod pod-nodename -n dev -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES pod-nodename 0/1 Pending 0 21s <none> node3 <none> <none>
可以看見雖然被指定在了node3,但是由於node3不存在,pod無法啟動
nodeselector
nodeselector用於將pod排程到新增了指定標籤的node節點上,它是通過k8s的label-selector機制實現的,也即是說,在pod建立之前,會由
scheduler使用matchnodeselector排程策略進行label匹配,找出目標node,然後將pod排程到目標節點,該匹配規則是強制約束
接下來,實驗一下:
1.首先分別為node節點新增標籤
[root@master ~]# kubectl label nodes node1 nodeenv=pro node/node1 labeled [root@master ~]# kubectl label nodes node2 nodeenv=test node/node2 labeled
建立一個pod-nodeselector.yaml檔案,並使用它建立pod
apiVersion: v1 kind: Pod metadata: name: pod-nodeselector namespace: dev spec: containers: - name: nginx image: nginx:1.17.1 nodeSelector: nodeenv: pro #指定排程到具有nodeenv=pro標籤的節點上
使用配置檔案(這裡省略過程)
[root@master ~]# kubectl get pod pod-nodeselector -n dev -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES pod-nodeselector 1/1 Running 0 22m 10.244.2.36 node1 <none> <none>
可以看見pod已經被排程到了node1
親和性排程
上面兩種定向排程的方式使用起來非常方便,但是也有一定的問題,那就是如果沒有滿足條件的node,那麼pod將不會被執行,即使在叢集中還有可用的node列表也不行,這就限制了它的使用場景。
基於上面的問題,k8s還提供了一種親和性排程,它在nodeselector的基礎之上進行了擴充套件,可以通過配置的方式,實現優先選擇滿足條件的node進行排程,如果沒有,也可以排程到不滿足條件的節點上,使排程更加靈活。
affinity主要分為三類:
- nodeaffinity(node親和性):以node為目標,解決node可以排程到哪些node的問題
- podaffinity(pod親和性):以pod為目標,解決pod可以和哪些已存在的pod部署到同一個拓撲域中的問題
- podantiaffinity(pod反親和性):以pod為目標,解決pod不能和哪些已存在的pod部署到同一個拓撲域中的問題
關於親和性(反親和性)使用場景的說明:
- 親和性:如果兩個應用頻繁互動,那就有必要利用親和性讓兩個應用盡可能地靠近,這樣可以減少網路通訊而帶來的效能損耗。
- 反親和性:當應用採用副本部署時,有必要採用反親和性讓各個應用例項打散分佈在各個node上,這樣可以提高服務的高可用性。
nodeaffinity
關係符的使用說明:
- matchExpressions: - key: nodeenv #匹配存在標籤的key為nodeenv的節點 operator: Exists - key: nodeenv #匹配存在標籤的key為nodeenv,且value是"xxx"或"yyy"的節點 operator: In values: ["xxx","yyy"] - key: nodeenv #匹配存在標籤的key為nodeenv,且value大於"xxx"的節點 operator: Gt values: "xxx"
接下來首先演示一下requireDuringSchedullingIgnoreDuringExecution
建立pod-nodeaffinity-required.yaml
apiVersion: v1 kind: Pod metadata: name: pod-nodeaffinity-required namespace: dev spec: containers: - name: nginx image: nginx:1.17.1 affinity: #親和性設定 nodeAffinity: #設定node親和性 requiredDuringSchedulingIgnoredDuringExecution: #硬限制 nodeSelectorTerms: - matchExpressions: - key: nodeenv operator: In values: ["xxx","yyy"]
建立並使用配置檔案
[root@master ~]# vim pod-nodeaffinity-required.yaml [root@master ~]# kubectl create -f pod-nodeaffinity-required.yaml pod/pod-nodeaffinity-required created [root@master ~]# kubectl get pod pod-nodeaffinity-required -n dev NAME READY STATUS RESTARTS AGE pod-nodeaffinity-required 0/1 Pending 0 14s
發現啟動失敗,檢視詳細描述
[root@master ~]# kubectl describe pod pod-nodeaffinity-required -n dev Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedScheduling <unknown> default-scheduler 0/3 nodes are available: 3 node(s) didn't match node selector. Warning FailedScheduling <unknown> default-scheduler 0/3 nodes are available: 3 node(s) didn't match node selector.
刪除pod,重新編輯配置檔案的values
apiVersion: v1 kind: Pod metadata: name: pod-nodeaffinity-required namespace: dev spec: containers: - name: nginx image: nginx:1.17.1 affinity: #親和性設定 nodeAffinity: #設定node親和性 requiredDuringSchedulingIgnoredDuringExecution: #硬限制 nodeSelectorTerms: - matchExpressions: - key: nodeenv operator: In values: ["pro","yyy"]
[root@master ~]# kubectl delete -f pod-nodeaffinity-required.yaml pod "pod-nodeaffinity-required" deleted [root@master ~]# vim pod-nodeaffinity-required.yaml [root@master ~]# kubectl create -f pod-nodeaffinity-required.yaml pod/pod-nodeaffinity-required created [root@master ~]# kubectl get pod pod-nodeaffinity-required -n dev NAME READY STATUS RESTARTS AGE pod-nodeaffinity-required 1/1 Running 0 35s
發現建立成功
接下來再演示一下preferredDuringSchedulingIgnoredDuringExecution,
建立pod-nodeaffinity-preferred.yaml
apiVersion: v1 kind: Pod metadata: name: pod-nodeaffinity-preferred namespace: dev spec: containers: - name: nginx image: nginx:1.17.1 affinity: #親和性設定 nodeAffinity: #設定node親和性 preferredDuringSchedulingIgnoredDuringExecution: #軟限制 - weight: 1 preference: matchExpressions: - key: nodeenv operator: In values: ["xxx","yyy"]
建立並使用配置檔案
[root@master ~]# vim pod-nodeaffinity-preferred.yaml [root@master ~]# kubectl create -f pod-nodeaffinity-preferred.yaml pod/pod-nodeaffinity-preferred created [root@master ~]# kubectl get pod pod-nodeaffinity-preferred -n dev -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES pod-nodeaffinity-preferred 1/1 Running 0 23s 10.244.2.38 node1 <none> <none>
發現pod被排程到了node1
nodeaffinity規則設定的注意事項:
- 如果同時定義了nodeselector和nodeaffinity,那麼必須兩個條件都得到滿足,pod才能執行在指定的node上
- 如果nodeaffinity指定了多個nodeSelectorTerms,那麼只需要其中一個能匹配成功即可
- 如果一個nodeSelectorTerms中有多個matchExpressions,則一個節點必須滿足所有的才能匹配成功
- 如果一個pod所在的node再pod執行期間其標籤發生了改變,不再符合該pod節點親和性需求,則系統將忽略此變化
podaffinity
podaffinity主要實現以執行的pod為參照,實現讓新建立的pod跟參照pod在一個區域的功能
topologyKey用於指定排程時作用域,例如:
- 如果指定為kubernetes.io/hostname,那就是以Node節點為區分範圍
- 如果指定為beta.kubernetes.io/os,則以Node節點的作業系統型別來區分
接下來,演示下requireDuringSchedullingIgnoreDuringExecution
建立一個參照pod,pod-podaffinity-target.yaml
apiVersion: v1 kind: Pod metadata: name: pod-podaffinity-target namespace: dev labels: podenv: pro #設定標籤 spec: containers: - name: nginx image: nginx:1.17.1 nodeName: node1 #將目標pod明確指定到node1上
[root@master ~]# vim pod-podaffinity-target.yaml [root@master ~]# kubectl create -f pod-podaffinity-target.yaml pod/pod-podaffinity-target created [root@master ~]# kubectl get pod pod-podaffinity-target -n dev -o wide --show-labels NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES LABELS pod-podaffinity-target 1/1 Running 0 2m47s 10.244.2.39 node1 <none> <none> podenv=pro
建立pod-podaffinity-required.yaml
apiVersion: v1 kind: Pod metadata: name: pod-podaffinity-required namespace: dev spec: containers: - name: nginx image: nginx:1.17.1 affinity: #親和性設定 podAffinity: #設定pod親和性 requiredDuringSchedulingIgnoredDuringExecution: #硬限制 - labelSelector: matchExpressions: #匹配env的值在["xxx","yyy"]中的標籤 - key: podenv operator: In values: ["xxx","yyy"] topologyKey: kubernetes.io/hostname
[root@master ~]# vim pod-podaffinity-required.yaml [root@master ~]# kubectl create -f pod-podaffinity-required.yaml pod/pod-podaffinity-required created [root@master ~]# kubectl get pod pod-podaffinity-required -n dev -o wide --show-labels NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES LABELS pod-podaffinity-required 0/1 Pending 0 24s <none> <none> <none> <none> <none>
發現排程失敗,檢視排程資訊
[root@master ~]# kubectl describe pod pod-podaffinity-required -n dev Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedScheduling <unknown> default-scheduler 0/3 nodes are available: 1 node(s) had taints that the pod didn't tolerate, 2 node(s) didn't match pod affinity rules. Warning FailedScheduling <unknown> default-scheduler 0/3 nodes are available: 1 node(s) had taints that the pod didn't tolerate, 2 node(s) didn't match pod affinity rules.
刪除pod,重新編輯配置檔案
apiVersion: v1 kind: Pod metadata: name: pod-podaffinity-required namespace: dev spec: containers: - name: nginx image: nginx:1.17.1 affinity: #親和性設定 podAffinity: #設定pod親和性 requiredDuringSchedulingIgnoredDuringExecution: #硬限制 - labelSelector: matchExpressions: #匹配env的值在["xxx","yyy"]中的標籤 - key: podenv operator: In values: ["pro","yyy"] topologyKey: kubernetes.io/hostname
[root@master ~]# kubectl delete -f pod-podaffinity-required.yaml pod "pod-podaffinity-required" deleted [root@master ~]# vim pod-podaffinity-required.yaml [root@master ~]# kubectl create -f pod-podaffinity-required.yaml pod/pod-podaffinity-required created [root@master ~]# kubectl get pod pod-podaffinity-required -n dev -o wide --show-labels NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES LABELS pod-podaffinity-required 1/1 Running 0 11s 10.244.2.40 node1 <none> <none> <none>
podantiaffinity
podaniaffinity主要實現以執行的pod為參照,讓新建立的pod跟參照pod不在一個區域中的功能
它的配置方式和選項跟podaffinity是一樣的,這裡不再做詳細解釋,直接做一個測試案例
繼續使用上個案例中目標pod
[root@master ~]# kubectl get pod -n dev -o wide --show-labels NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES LABELS pod-podaffinity-required 1/1 Running 2 24h 10.244.2.57 node1 <none> <none> <none> pod-podaffinity-target 1/1 Running 2 24h 10.244.2.56 node1 <none> <none> podenv=pro
建立pod-podantiaffinity-required.yaml,內容如下
apiVersion: v1 kind: Pod metadata: name: pod-podantiaffinity-required namespace: dev spec: containers: - name: nginx image: nginx:1.17.1 affinity: #親和性設定 podAntiAffinity: #設定pod反親和性 requiredDuringSchedulingIgnoredDuringExecution: #硬限制 - labelSelector: matchExpressions: #匹配env的值在["pro"]中的標籤 - key: podenv operator: In values: ["pro"] topologyKey: kubernetes.io/hostname
使用配置檔案
[root@master ~]# vim pod-podantiaffinity-required.yaml [root@master ~]# kubectl create -f pod-podantiaffinity-required.yaml pod/pod-podantiaffinity-required created [root@master ~]# kubectl get pod -n dev -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES pod-podaffinity-required 1/1 Running 2 24h 10.244.2.57 node1 <none> <none> pod-podaffinity-target 1/1 Running 2 24h 10.244.2.56 node1 <none> <none> pod-podantiaffinity-required 1/1 Running 0 10s 10.244.1.57 node2 <none> <none>
可以發現pod被排程到了pod2中
汙點和容忍
汙點(Taints)
前面的排程方式都是站在pod的角度上,通過在pod上新增屬性,來確定pod是否要排程到指定的node上,其實我們也可以站在node的角度上,通過在node上新增
汙點屬性,來決定是否允許pod排程過來
node被設定上汙點之後就和pod之間存在了一種相斥的關係,進而拒絕pod排程進來,甚至可以將已經存在的pod驅逐出去
汙點的格式為:key=value:effect,key和value是汙點的標籤,effect描述汙點的作用,支援如下三個選項:
- PreferNoSchedule:k8s將盡量避免把pod排程到具有該汙點的node上,除非沒有其他節點可以排程
- NoSchedule:k8s將不會把pod排程到具有該汙點的node上,但不會影響當前node上已經存在的pod
- NoExecute:k8s將不會把Pod排程到具有該汙點的node上,同時也會將Node上已存在的Pod驅離
使用kubectl設定和去除汙點的命令如下:
#設定汙點 kubectl taint nodes nodeName key=value:effect #去除汙點 kubectl taint nodes nodeName key:effect- #去除所有汙點 kubectl taint nodes nodeName key-
接下來,演示下汙點的效果:
- 準備節點node1(為了演示效果更加明顯,暫時停止node2節點)
- 為node1節點設定一個汙點:tag=ayanami:PreferNoSchedule;然後建立pod1(pod1可以)
- 修改為node1節點設定一個汙點:tag=ayanami:NoSchedule;然後建立pod2(pod1正常 pod2失敗)
- 修改為node1節點設定一個汙點:tag=ayanami:NoExecute;然後建立pod3(3個pod都失敗)
為node1設定汙點(PreferNoSchedule):
[root@master ~]# kubectl taint nodes node1 tag=ayanami:PreferNoSchedule
node/node1 tainted
建立pod1
[root@master ~]# kubectl run taint1 --image=nginx:1.17.1 -n dev kubectl run --generator=deployment/apps.v1 is DEPRECATED and will be removed in a future version. Use kubectl run --generator=run-pod/v1 or kubectl create instead. [root@master ~]# kubectl get pod -n dev NAME READY STATUS RESTARTS AGE taint1-766c47bf55-lhmcj 1/1 Running 0 6m16s
為node1設定汙點(取消PreferNoSchedule設定為NoSchedule)
[root@master ~]# kubectl taint nodes node1 tag:PreferNoSchedule- node/node1 untainted [root@master ~]# kubectl taint nodes node1 tag=ayanami:NoSchedule node/node1 tainted
再次檢視pod,發現沒有變化
[root@master ~]# kubectl get pod -n dev NAME READY STATUS RESTARTS AGE taint1-766c47bf55-lhmcj 1/1 Running 0 10m
建立新的taint2並檢視
[root@master ~]# kubectl run taint2 --image=nginx:1.17.1 -n dev kubectl run --generator=deployment/apps.v1 is DEPRECATED and will be removed in a future version. Use kubectl run --generator=run-pod/v1 or kubectl create instead. deployment.apps/taint2 created [root@master ~]# kubectl get pod -n dev NAME READY STATUS RESTARTS AGE taint1-766c47bf55-lhmcj 1/1 Running 0 11m taint2-84946958cf-h9765 0/1 Pending 0 15s
發現新的pod無法running,檢視taint2
[root@master ~]# kubectl describe pod taint2 -n dev Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedScheduling <unknown> default-scheduler 0/3 nodes are available: 3 node(s) had taints that the pod didn't tolerate. Warning FailedScheduling <unknown> default-scheduler 0/3 nodes are available: 3 node(s) had taints that the pod didn't tolerate.
為node1設定汙點(取消NoSchedule,設定為NoExecute)
[root@master ~]# kubectl taint node node1 tag:NoSchedule- node/node1 untainted [root@master ~]# kubectl taint node node1 tag=ayanami:NoExecute node/node1 tainted [root@master ~]# kubectl get pod -n dev NAME READY STATUS RESTARTS AGE taint1-766c47bf55-fdtqw 0/1 Pending 0 30s taint2-84946958cf-26rfx 0/1 Pending 0 30s
發現兩個pod都停止了,再建立一個taint3
[root@master ~]# kubectl run taint3 --image=nginx:1.17.1 -n dev kubectl run --generator=deployment/apps.v1 is DEPRECATED and will be removed in a future version. Use kubectl run --generator=run-pod/v1 or kubectl create instead. [root@master ~]# kubectl get pod -n dev NAME READY STATUS RESTARTS AGE taint1-766c47bf55-fdtqw 0/1 Pending 0 97s taint2-84946958cf-26rfx 0/1 Pending 0 97s taint3-57d45f9d4c-68pwr 0/1 Pending 0 9s
發現新的也建立不了了
擴充:
使用kubeadm搭建的叢集,預設就會給master節點新增一個汙點標記,所以pod就不會排程到master節點上
[root@master ~]# kubectl describe node master
Taints: node-role.kubernetes.io/master:NoSchedule
容忍
上面介紹了汙點的作用,我們可以在node上新增汙點用於拒絕pod排程上來,但是如果就是想將一個pod排程到一個有汙點的node上去,這時候應該怎麼做呢?這就要用到容忍
汙點就是拒絕,容忍就是忽略,node通過汙點拒絕pod排程上去,pod通過容忍忽略拒絕
下面先通過案例看下效果:
- 上一小節,已經在node1節點上打了NoExecute的汙點,此時pod是排程不上去的
- 本小節,可以通過給pod新增容忍,然後將其排程上去
建立pod-toleration.yaml,內容如下
apiVersion: v1 kind: Pod metadata: name: pod-toleration namespace: dev spec: containers: - name: nginx image: nginx:1.17.1 tolerations: #新增容忍 - key: "tag" #要容忍的汙點的key operator: "Equal" #操作符 value: "ayanami" #容忍的汙點的value effect: "NoExecute" #新增容忍的規則,這裡必須和標記的汙點規則相同
使用配置檔案
[root@master ~]# vim pod-toleration.yaml [root@master ~]# kubectl create -f pod-toleration.yaml pod/pod-toleration created [root@master ~]# kubectl get pod -n dev NAME READY STATUS RESTARTS AGE pod-toleration 1/1 Running 0 9s taint1-766c47bf55-fdtqw 0/1 Pending 0 34m taint2-84946958cf-26rfx 0/1 Pending 0 34m taint3-57d45f9d4c-68pwr 0/1 Pending 0 33m
容忍的詳細配置
key:對應著要容忍的汙點的鍵,空值意味著所有的鍵 value:意味著要容忍的汙點的值 operator:key-value的運算子,支援Equal和Exists(預設) effect:對應汙點的effect,空值意味著匹配所有的影響 tolerationSeconds:容忍時間,當effect為NoExecute時生效,表示pod再Node上的停留時間