- 第1集:驗證 .NET 5.0 正式版 docker 映象問題
- 第2集:碼中的小窟窿,背後的大坑,發現重要嫌犯 EnyimMemcachedCore
- 第3集-劇情反轉:EnyimMemcachedCore 無罪,.NET 5.0 繼續背鍋
- 第4集:一個.NET,兩手準備,一個issue,加倍關注
- 第5集-案情突破:都是我們的錯,讓 .NET 5.0 背鍋
- 第6集-案發現場回顧:故障情況下 Kubernetes 的部署表現
我們的部落格系統是部署在用阿里雲伺服器自己搭建的 Kubernetes 叢集上,故障在 k8s 部署更新 pod 的過程中就出現了,昨天釋出時,我們特地觀察一下,在這1集中分享一下。
在部署過程中,k8s 會進行3個階段的 pod 更新操作:
- "xxx new replicas have been updated"
- "xxx replicas are pending termination"
- "xxx updated replicas are available"
正常釋出情況下,整個部署操作通常在5-8分鐘左右完成(這與livenessProbe和readinessProbe的配置有關),下面是部署期間的控制檯輸出
Waiting for deployment "blog-web" rollout to finish: 4 out of 8 new replicas have been updated...
Waiting for deployment spec update to be observed...
Waiting for deployment "blog-web" rollout to finish: 4 out of 15 new replicas have been updated...
Waiting for deployment "blog-web" rollout to finish: 4 out of 15 new replicas have been updated...
Waiting for deployment "blog-web" rollout to finish: 4 out of 15 new replicas have been updated...
Waiting for deployment "blog-web" rollout to finish: 8 out of 15 new replicas have been updated...
Waiting for deployment "blog-web" rollout to finish: 8 out of 15 new replicas have been updated...
Waiting for deployment "blog-web" rollout to finish: 8 out of 15 new replicas have been updated...
...
Waiting for deployment "blog-web" rollout to finish: 4 old replicas are pending termination...
...
Waiting for deployment "blog-web" rollout to finish: 14 of 15 updated replicas are available...
deployment "blog-web" successfully rolled out
而在故障場景下,整個部署操作需要在15分鐘左右才能完成,3個階段的 pod 更新都比正常情況下慢,尤其是"old replicas are pending termination"階段。
在部署期間通過 kubectl get pods -l app=blog-web -o wide
命令檢視 pod 的狀態,新部署的 pod 處於 Running 狀態,說明 livenessProbe 健康檢查成功,但多數 pod 沒有進入 ready 狀態,說明這些 pod 的 readinessProbe 健康檢查失敗,restarts 大於0 說明 livenessProbe 健康檢查失敗對 pod 進行了重啟。
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
blog-web-55d5677cf-2854n 0/1 Running 1 5m1s 192.168.107.213 k8s-node3 <none> <none>
blog-web-55d5677cf-7vkqb 0/1 Running 2 6m17s 192.168.228.33 k8s-n9 <none> <none>
blog-web-55d5677cf-8gq6n 0/1 Running 2 5m29s 192.168.102.235 k8s-n19 <none> <none>
blog-web-55d5677cf-g8dsr 0/1 Running 2 5m54s 192.168.104.78 k8s-node11 <none> <none>
blog-web-55d5677cf-kk9mf 0/1 Running 2 6m9s 192.168.42.3 k8s-n13 <none> <none>
blog-web-55d5677cf-kqwzc 0/1 Pending 0 4m44s <none> <none> <none> <none>
blog-web-55d5677cf-lmbvf 0/1 Running 2 5m54s 192.168.201.123 k8s-n14 <none> <none>
blog-web-55d5677cf-ms2tk 0/1 Pending 0 6m9s <none> <none> <none> <none>
blog-web-55d5677cf-nkjrd 1/1 Running 2 6m17s 192.168.254.129 k8s-n7 <none> <none>
blog-web-55d5677cf-nnjdx 0/1 Pending 0 4m48s <none> <none> <none> <none>
blog-web-55d5677cf-pqgpr 0/1 Pending 0 4m33s <none> <none> <none> <none>
blog-web-55d5677cf-qrjr5 0/1 Pending 0 2m38s <none> <none> <none> <none>
blog-web-55d5677cf-t5wvq 1/1 Running 3 6m17s 192.168.10.100 k8s-n12 <none> <none>
blog-web-55d5677cf-w52xc 1/1 Running 3 6m17s 192.168.73.35 k8s-node10 <none> <none>
blog-web-55d5677cf-zk559 0/1 Running 1 5m21s 192.168.118.6 k8s-n4 <none> <none>
blog-web-5b57b7fcb6-7cbdt 1/1 Running 2 18m 192.168.168.77 k8s-n6 <none> <none>
blog-web-5b57b7fcb6-cgfr4 1/1 Running 4 19m 192.168.89.250 k8s-n8 <none> <none>
blog-web-5b57b7fcb6-cz278 1/1 Running 3 19m 192.168.218.99 k8s-n18 <none> <none>
blog-web-5b57b7fcb6-hvzwp 1/1 Running 3 18m 192.168.195.242 k8s-node5 <none> <none>
blog-web-5b57b7fcb6-rhgkq 1/1 Running 1 16m 192.168.86.126 k8s-n20 <none> <none>
在我們的 k8e deployment 配置中 livenessProbe 與 readinessProbe 檢查的是同一個地址,具體配置如下
livenessProbe:
httpGet:
path: /
port: 80
httpHeaders:
- name: X-Forwarded-Proto
value: https
- name: Host
value: www.cnblogs.com
initialDelaySeconds: 30
periodSeconds: 3
successThreshold: 1
failureThreshold: 5
timeoutSeconds: 5
readinessProbe:
httpGet:
path: /
port: 80
httpHeaders:
- name: X-Forwarded-Proto
value: https
- name: Host
value: www.cnblogs.com
initialDelaySeconds: 40
periodSeconds: 5
successThreshold: 1
failureThreshold: 5
timeoutSeconds: 5
由於潛藏的併發問題造成 livenessProbe 與 readinessProbe 健康檢查頻繁失敗,造成 k8s 更新 pod 的過程跌跌撞撞,在這個過程中,由於有部分舊 pod 分擔負載,新 pod 出現問題會暫停更新,等正在部署的 pod 恢復正常,所以這時故障的影響侷限在一定範圍內,訪問網站的表現是時好時壞。
這個跌跌撞撞的艱難部署過程最終會完成,而部署完成之際,就是故障全面爆發之時。部署完成後,新 pod 全面接管負載,存在併發問題的新 pod 在併發請求的重壓下潰不成軍,多個 pod 因 livenessProbe 健康檢查失敗被重啟,重啟後因為 readinessProbe 健康檢查失敗很難進入 ready 狀態分擔負載,僅剩的 pod 不堪重負,CrashLoopBackOff 此起彼伏,在源源不斷的併發請求的衝擊下,始終沒有足夠的 pod 應付當前的負載,故障就一直無法恢復。