【故障公告】k8s 開船記：增加控制艙（control-plane）造成的翻船

春節期間我們更換了 kubernetes 生產叢集，舊叢集的 kubernetes 版本是 1.17.0，新叢集版本是 1.23.3，新叢集上部署了 dapr，最近準備將更多獨立部署的伺服器部署到 k8s 叢集上，比如 redis, memcached, mongodb。

新叢集和舊叢集一樣都是高可用叢集，但新叢集開始只用了 1 個 control-plane 節點，今天本來的任務是給新叢集再增加 2 個 control-plane 節點，實現高可用。

如何加入 control-plane 節點呢？ k8s 沒有直接提供生成 join 命令的命令，下面的命令只能用於加入 worker 節點。

kubeadm token create --print-join-command

之前我們是在建立叢集的時候，在執行 kubeadm init 命令之後就操作加入 control-plane 節點，kubeadm init 會生成加入命令

You can now join any number of the control-plane node running the following command on each as root:

  kubeadm join k8s-api:6443 --token ****** \
    --discovery-token-ca-cert-hash ****** \
    --control-plane --certificate-key ******

上面的命令實際上就是 token create --print-join-command 的輸出加上 --control-plane 與 --certificate-key，但之前使用的 certificate-key 已經過期。

開始我們用下面的命令生成 certificate-key 加入叢集

kubeadm certs certificate-key

但加入失敗，報錯資訊如下

[download-certs] Downloading the certificates in Secret "kubeadm-certs" in the "kube-system" Namespace
error execution phase control-plane-prepare/download-certs: error downloading certs: error downloading the secret: Secret "kubeadm-certs" was not found in the "kube-system" Namespace. This Secret might have expired. Please, run kubeadm init phase upload-certs --upload-certs on a control plane to generate a new one

於是改用 upload-certs 命令

$ kubeadm init phase upload-certs --upload-certs
[upload-certs] Storing the certificates in Secret "kubeadm-certs" in the "kube-system" Namespace
[upload-certs] Using certificate key:
*****

將 upload-certs 命令生成的 certificate-key 用於 kubeadm join 命令，這個 key 果然可以，但是在加入過程中卡在了下面的地方

[etcd] Announced new etcd member joining to the existing etcd cluster
[etcd] Creating static Pod manifest for "etcd"
[etcd] Waiting for the new etcd member to join the cluster. This can take up to 40s
[kubelet-check] Initial timeout of 40s passed.

本以為新 control-plane 節點加入不了不會影響現有叢集的正常執行，但一個殘酷的現實突然而至 —— 已有且僅有的那臺 control-plane 上 api-server 容器突然無法啟動了，syslog 中的其中一條錯誤資訊如下

"Unable to authenticate the request due to an error" err="Post "https://k8s-api:6443/apis/authentication.k8s.io/v1/tokenreviews": dial tcp 10.0.9.171:6443: connect: connection refused"

這時雖然 api-server 不能正常工作，但 worker 節點上的 pod 都正常執行，應用暫時沒有受影響。

面對這樣如履薄冰的場景，我們首先想到的是先恢復舊叢集，將負載切換到舊叢集，然後從容地處理新叢集的問題，但故障如此會開玩笑，舊叢集的 api-server 竟然也無法正常啟動。

CONTAINER ID        IMAGE                                       COMMAND                  CREATED             STATUS           
0f30ff71a13d        0cae8d5cc64c                                "kube-apiserver --ad…"   5 seconds ago       Up 3 seconds

命運會作弄人，故障也學會了。

面對如履薄冰與故障的作弄，我們做出了一個鋌而走險的決定 —— 更新所有證書，之前成功操作過，但當時叢集中沒有負載。

用下面的命令三下五除二地完成了證書更新

$ cd /etc/kubernetes/pki/
$ mv {apiserver.crt,apiserver-etcd-client.key,apiserver-kubelet-client.crt,front-proxy-ca.crt,front-proxy-client.crt,front-proxy-client.key,front-proxy-ca.key,apiserver-kubelet-client.key,apiserver.key,apiserver-etcd-client.crt} ~/
$ kubeadm init phase certs all --control-plane-endpoint "k8s-api:6443"
$ cd /etc/kubernetes/
$ mv {admin.conf,controller-manager.conf,kubelet.conf,scheduler.conf} ~/
$ kubeadm init phase kubeconfig all --control-plane-endpoint "k8s-api:6443"

接下來就是重啟 control-plane 節點伺服器使用更新的證書。

重啟的結局卻是 —— 滿園盡是502

走險失敗後立即進入緊急搶險，用當前 control-plane 節點今天凌晨的快照建立映象，用映象重置整個系統。

重置後的 control-plane 節點成功啟動後，從 control-plane 節點上看叢集應該恢復了正常，worker 節點都處於 ready 狀態，絕大多數 pod 都處於 running 狀態，但是 pod 中的應用卻不能正常工作，比如連不上其他服務、ingress 規則失效等。

開始以為要將所有 worker node 退出並重新加入叢集才能恢復，但是在第1個 worker node 上操作退出時卻卡在下面的地方

$ kubeadm reset
[reset] Unmounting mounted directories in "/var/lib/kubelet"

後來想到重啟所有 worker node 試試，越是關鍵時候越是不能忽略簡單方法，都重啟後一切恢復了正常。

非常抱歉，今天 19:10~19:50 期間由於 k8s 叢集操作引發全站全站故障，由此給您帶來很大的麻煩，請您諒解。

【故障公告】k8s 開船記：增加控制艙（control-plane）造成的翻船

相關文章