《.NET 5.0 背鍋案》第6集-案發現場回顧:故障情況下 Kubernetes 部署表現

部落格園團隊發表於2020-11-18

我們的部落格系統是部署在用阿里雲伺服器自己搭建的 Kubernetes 叢集上,故障在 k8s 部署更新 pod 的過程中就出現了,昨天釋出時,我們特地觀察一下,在這1集中分享一下。

在部署過程中,k8s 會進行3個階段的 pod 更新操作:

  1. "xxx new replicas have been updated"
  2. "xxx replicas are pending termination"
  3. "xxx updated replicas are available"

正常釋出情況下,整個部署操作通常在5-8分鐘左右完成(這與livenessProbe和readinessProbe的配置有關),下面是部署期間的控制檯輸出

Waiting for deployment "blog-web" rollout to finish: 4 out of 8 new replicas have been updated...
Waiting for deployment spec update to be observed...
Waiting for deployment "blog-web" rollout to finish: 4 out of 15 new replicas have been updated...
Waiting for deployment "blog-web" rollout to finish: 4 out of 15 new replicas have been updated...
Waiting for deployment "blog-web" rollout to finish: 4 out of 15 new replicas have been updated...
Waiting for deployment "blog-web" rollout to finish: 8 out of 15 new replicas have been updated...
Waiting for deployment "blog-web" rollout to finish: 8 out of 15 new replicas have been updated...
Waiting for deployment "blog-web" rollout to finish: 8 out of 15 new replicas have been updated...
...
Waiting for deployment "blog-web" rollout to finish: 4 old replicas are pending termination...
...
Waiting for deployment "blog-web" rollout to finish: 14 of 15 updated replicas are available...
deployment "blog-web" successfully rolled out

而在故障場景下,整個部署操作需要在15分鐘左右才能完成,3個階段的 pod 更新都比正常情況下慢,尤其是"old replicas are pending termination"階段。

在部署期間通過 kubectl get pods -l app=blog-web -o wide 命令檢視 pod 的狀態,新部署的 pod 處於 Running 狀態,說明 livenessProbe 健康檢查成功,但多數 pod 沒有進入 ready 狀態,說明這些 pod 的 readinessProbe 健康檢查失敗,restarts 大於0 說明 livenessProbe 健康檢查失敗對 pod 進行了重啟。

NAME                        READY   STATUS    RESTARTS   AGE     IP                NODE         NOMINATED NODE   READINESS GATES
blog-web-55d5677cf-2854n    0/1     Running   1          5m1s    192.168.107.213   k8s-node3    <none>           <none>
blog-web-55d5677cf-7vkqb    0/1     Running   2          6m17s   192.168.228.33    k8s-n9       <none>           <none>
blog-web-55d5677cf-8gq6n    0/1     Running   2          5m29s   192.168.102.235   k8s-n19      <none>           <none>
blog-web-55d5677cf-g8dsr    0/1     Running   2          5m54s   192.168.104.78    k8s-node11   <none>           <none>
blog-web-55d5677cf-kk9mf    0/1     Running   2          6m9s    192.168.42.3      k8s-n13      <none>           <none>
blog-web-55d5677cf-kqwzc    0/1     Pending   0          4m44s   <none>            <none>       <none>           <none>
blog-web-55d5677cf-lmbvf    0/1     Running   2          5m54s   192.168.201.123   k8s-n14      <none>           <none>
blog-web-55d5677cf-ms2tk    0/1     Pending   0          6m9s    <none>            <none>       <none>           <none>
blog-web-55d5677cf-nkjrd    1/1     Running   2          6m17s   192.168.254.129   k8s-n7       <none>           <none>
blog-web-55d5677cf-nnjdx    0/1     Pending   0          4m48s   <none>            <none>       <none>           <none>
blog-web-55d5677cf-pqgpr    0/1     Pending   0          4m33s   <none>            <none>       <none>           <none>
blog-web-55d5677cf-qrjr5    0/1     Pending   0          2m38s   <none>            <none>       <none>           <none>
blog-web-55d5677cf-t5wvq    1/1     Running   3          6m17s   192.168.10.100    k8s-n12      <none>           <none>
blog-web-55d5677cf-w52xc    1/1     Running   3          6m17s   192.168.73.35     k8s-node10   <none>           <none>
blog-web-55d5677cf-zk559    0/1     Running   1          5m21s   192.168.118.6     k8s-n4       <none>           <none>
blog-web-5b57b7fcb6-7cbdt   1/1     Running   2          18m     192.168.168.77    k8s-n6       <none>           <none>
blog-web-5b57b7fcb6-cgfr4   1/1     Running   4          19m     192.168.89.250    k8s-n8       <none>           <none>
blog-web-5b57b7fcb6-cz278   1/1     Running   3          19m     192.168.218.99    k8s-n18      <none>           <none>
blog-web-5b57b7fcb6-hvzwp   1/1     Running   3          18m     192.168.195.242   k8s-node5    <none>           <none>
blog-web-5b57b7fcb6-rhgkq   1/1     Running   1          16m     192.168.86.126    k8s-n20      <none>           <none>

在我們的 k8e deployment 配置中 livenessProbe 與 readinessProbe 檢查的是同一個地址,具體配置如下

livenessProbe:
    httpGet:
    path: /
    port: 80
    httpHeaders:
    - name: X-Forwarded-Proto
        value: https
    - name: Host
        value: www.cnblogs.com
    initialDelaySeconds: 30
    periodSeconds: 3
    successThreshold: 1
    failureThreshold: 5
    timeoutSeconds: 5
readinessProbe:
    httpGet:
    path: /
    port: 80
    httpHeaders:
    - name: X-Forwarded-Proto
        value: https
    - name: Host
        value: www.cnblogs.com
    initialDelaySeconds: 40
    periodSeconds: 5
    successThreshold: 1
    failureThreshold: 5
    timeoutSeconds: 5

由於潛藏的併發問題造成 livenessProbe 與 readinessProbe 健康檢查頻繁失敗,造成 k8s 更新 pod 的過程跌跌撞撞,在這個過程中,由於有部分舊 pod 分擔負載,新 pod 出現問題會暫停更新,等正在部署的 pod 恢復正常,所以這時故障的影響侷限在一定範圍內,訪問網站的表現是時好時壞。

這個跌跌撞撞的艱難部署過程最終會完成,而部署完成之際,就是故障全面爆發之時。部署完成後,新 pod 全面接管負載,存在併發問題的新 pod 在併發請求的重壓下潰不成軍,多個 pod 因 livenessProbe 健康檢查失敗被重啟,重啟後因為 readinessProbe 健康檢查失敗很難進入 ready 狀態分擔負載,僅剩的 pod 不堪重負,CrashLoopBackOff 此起彼伏,在源源不斷的併發請求的衝擊下,始終沒有足夠的 pod 應付當前的負載,故障就一直無法恢復。

相關文章