kubernets叢集節點NotReady故障 分析報告

jaymarco發表於2021-11-10


1、問題描述

客戶反饋UAT測試環境的kubernets叢集節點狀態NotReady,無法建立新資源,影響到了廠家業務部署測試。為了不影響業務廠家應用釋出測試,現場工程師透過重啟kublet與docker容器服務來恢復kubernets叢集。對該UAT的Kubernetes叢集故障時間範圍跟蹤分析。具體故障資訊如下圖所示。

2、問題分析
2.1、分析過程

1.    檢視/var/log/messages日誌資訊詳情:

Mar 10 10:36:18 UAT-K8S-MASTER01 systemd:   Started Session 26305 of user   root.

Mar 10 10:36:18 UAT-K8S-MASTER01   systemd-logind: New session 26305 of user root.      <<登入root使用者

Mar 10   10:36:19 UAT-K8S-MASTER01 kubelet: E0310 10:36:19.780889   20272 file.go:182] Not recursing into   manifest path "/etc/kubernetes/manifests/keepalived"

Mar 10   10:36:19 UAT-K8S-MASTER01 kubelet: E0310 10:36:19.783096   20272 file.go:182] Not recursing into   manifest path "/etc/kubernetes/manifests/nginx-lb"

Mar 10   10:36:39 UAT-K8S-MASTER01 kubelet: E0310 10:36:39.780897   20272 file.go:182] Not recursing into   manifest path "/etc/kubernetes/manifests/keepalived"

Mar 10   10:36:39 UAT-K8S-MASTER01 kubelet: E0310 10:36:39.783123   20272 file.go:182] Not recursing into   manifest path "/etc/kubernetes/manifests/nginx-lb"

Mar 10   10:36:59 UAT-K8S-MASTER01 kubelet: E0310 10:36:59.780902   20272 file.go:182] Not recursing into   manifest path "/etc/kubernetes/manifests/keepalived"

Mar 10   10:36:59 UAT-K8S-MASTER01 kubelet: E0310 10:36:59.783046   20272 file.go:182] Not recursing into   manifest path "/etc/kubernetes/manifests/nginx-lb"

Mar 10 10:37:03 UAT-K8S-MASTER01   systemd: Stopping firewalld - dynamic firewall daemon...     << 關閉防火牆操作

Mar 10 10:37:03 UAT-K8S-MASTER01   kernel: IPVS: [sh] scheduler unregistered.    <<IPVS排程器立即出現scheduler未註冊的日誌資訊

Mar 10 10:37:03 UAT-K8S-MASTER01   kernel: IPVS: [wrr] scheduler unregistered.

Mar 10 10:37:03 UAT-K8S-MASTER01   kernel: IPVS: [rr] scheduler unregistered.

Mar 10 10:37:03 UAT-K8S-MASTER01   kernel: IPVS: ipvs unloaded. // 並且出現IPVS模組解除安裝的日誌資訊

Mar 10 10:37:04 UAT-K8S-MASTER01   systemd: Stopped firewalld - dynamic firewall daemon.     

Mar 10 10:37:04 UAT-K8S-MASTER01   kubelet: E0310 10:37:04.573494   20272   controller.go:178] failed to update node lease, error: Put   "   dial tcp 172.31.2

50.21:16443: connect: connection   refused     <<開始出現拒絕連線報錯,並一直連續報錯

Mar 10   10:37:04 UAT-K8S-MASTER01 kubelet: E0310 10:37:04.573696   20272 controller.go:178] failed to update   node lease, error: Put   "   dial tcp 172.31.2

50.21:16443:   connect: connection refused

Mar 10   10:37:04 UAT-K8S-MASTER01 kubelet: E0310 10:37:04.573881   20272 controller.go:178] failed to update   node lease, error: Put   "   dial tcp 172.31.2

50.21:16443:   connect: connection refused

Mar 10   10:37:04 UAT-K8S-MASTER01 kubelet: E0310 10:37:04.574009   20272 controller.go:178] failed to update   node lease, error: Put   "   dial tcp 172.31.2

50.21:16443:   connect: connection refused

Mar 10   10:37:04 UAT-K8S-MASTER01 kubelet: E0310 10:37:04.574123   20272 controller.go:178] failed to update   node lease, error: Put "   dial tcp 172.31.2

50.21:16443:   connect: connection refused

Mar 10   10:37:04 UAT-K8S-MASTER01 kubelet: I0310 10:37:04.574138   20272 controller.go:106] failed to update   lease using latest lease, fallback to ensure lease, err: failed 5 attempts to   update node lease

Mar 10   10:37:04 UAT-K8S-MASTER01 kubelet: E0310 10:37:04.574228   20272 controller.go:136] failed to ensure   node lease exists, will retry in 200ms, error: Get "

eout=10s":   dial tcp 172.31.250.21:16443: connect: connection refused

Mar 10   10:37:04 UAT-K8S-MASTER01 kubelet: E0310 10:37:04.774499   20272 controller.go:136] failed to ensure   node lease exists, will retry in 400ms, error: Get   "

eout=10s":   dial tcp 172.31.250.21:16443: connect: connection refused

Mar 10   10:37:05 UAT-K8S-MASTER01 kubelet: E0310 10:37:05.174927   20272 controller.go:136] failed to ensure   node lease exists, will retry in 800ms, error: Get   "

eout=10s":   dial tcp 172.31.250.21:16443: connect: connection refused

Mar 10   10:37:05 UAT-K8S-MASTER01 kubelet: E0310 10:37:05.314735   20272 kubelet_node_status.go:442] Error   updating node status, will retry: error getting node "uat-k8s-master01":   Get   "

s":   dial tcp 172.31.250.21:16443: connect: connection refused

Mar 10   10:37:05 UAT-K8S-MASTER01 kubelet: E0310 10:37:05.315321   20272 kubelet_node_status.go:442] Error   updating node status, will retry: error getting node   "uat-k8s-master01": Get   "   dial tcp 172.3

1.250.21:16443:   connect: connection refused

Mar 10   10:37:05 UAT-K8S-MASTER01 kubelet: E0310 10:37:05.315477   20272 kubelet_node_status.go:442] Error   updating node status, will retry: error getting node   "uat-k8s-master01": Get "   dial tcp 172.3

1.250.21:16443:   connect: connection refused

Mar 10   10:37:05 UAT-K8S-MASTER01 kubelet: E0310 10:37:05.315616   20272 kubelet_node_status.go:442] Error   updating node status, will retry: error getting node   "uat-k8s-master01": Get "   dial tcp 172.3

1.250.21:16443:   connect: connection refused

  2.    檢視防火牆狀態詳情:

[root@UAT-K8S-MASTER01 ~]# systemctl status firewalld

● firewalld.service -   firewalld - dynamic firewall daemon

   Loaded: loaded   (/usr/lib/systemd/system/firewalld.service; disabled; vendor preset: enabled)

   Active: inactive (dead) since Wed 2021-03-10 10:37:04 CST; 1   day 4h ago    <<防火牆關閉時間和messages日誌資訊時間記錄一致

     Docs: man:firewalld(1)

 Main PID: 1201 (code=exited,   status=0/SUCCESS)

Jan 12 09:37:35 UAT-K8S-MASTER01   firewalld[1201]: WARNING: COMMAND_FAILED: '/usr/sbin/iptables -w10 -t filter   -F DOCKER-ISOLATION-STAGE-1' failed: iptables: No chain/target/match by that   name.

Jan 12 09:37:35 UAT-K8S-MASTER01   firewalld[1201]: WARNING: COMMAND_FAILED: '/usr/sbin/iptables -w10 -t filter   -X DOCKER-ISOLATION-STAGE-1' failed: iptables: No chain/target/match by that   name.

Jan 12 09:37:35 UAT-K8S-MASTER01   firewalld[1201]: WARNING: COMMAND_FAILED: '/usr/sbin/iptables -w10 -t filter   -F DOCKER-ISOLATION-STAGE-2' failed: iptables: No chain/target/match by that   name.

Jan 12 09:37:35 UAT-K8S-MASTER01   firewalld[1201]: WARNING: COMMAND_FAILED: '/usr/sbin/iptables -w10 -t filter   -X DOCKER-ISOLATION-STAGE-2' failed: iptables: No chain/target/match by that   name.

Jan 12 09:37:35 UAT-K8S-MASTER01   firewalld[1201]: WARNING: COMMAND_FAILED: '/usr/sbin/iptables -w10 -t filter   -F DOCKER-ISOLATION' failed: iptables: No chain/target/match by that name.

Jan 12 09:37:35 UAT-K8S-MASTER01   firewalld[1201]: WARNING: COMMAND_FAILED: '/usr/sbin/iptables -w10 -t filter   -X DOCKER-ISOLATION' failed: iptables: No chain/target/match by that name.

Jan 12 09:37:35 UAT-K8S-MASTER01   firewalld[1201]: WARNING: COMMAND_FAILED: '/usr/sbin/iptables -w10 -D FORWARD   -i docker0 -o docker0 -j DROP' failed: iptables: Bad rule (does a matching   rule exist in that chain?).

Jan 12 09:37:35 UAT-K8S-MASTER01   firewalld[1201]: WARNING: COMMAND_FAILED: '/usr/sbin/iptables -w10 -D FORWARD   -i docker0 -o docker0 -j DROP' failed: iptables: Bad rule (does a matching   rule exist in that chain?).

Mar 10 10:37:03 UAT-K8S-MASTER01   systemd[1]: Stopping firewalld - dynamic firewall daemon...

Mar 10 10:37:04 UAT-K8S-MASTER01 systemd[1]: Stopped firewalld   - dynamic firewall daemon.      <<防火牆關閉時間3月10日10時37分04秒


2.2、問題原因

     從/var/log/messages日誌發現當天上午10時36分18秒,有使用者登入root賬戶,並且在10時37分04秒使用關閉防火牆命令,把UAT-K8S-MASTER01上的防火牆關閉了(經過查證,發現其他節點上的防火牆也都被關閉了),關閉防火牆後,messages日誌資訊馬上出現IPVS排程器scheduler未註冊的日誌資訊,並且出現IPVS排程器解除安裝的日誌資訊,然後日誌開始出現172.31.250.21:16443的拒絕連線報錯資訊,並一直持續列印該報錯資訊,導致叢集所有節點無法建立新資源,且狀態顯示為NotReady狀態。

3、問題總結與建議
3.1、總結

由於人為誤操作將kubernets叢集宿主機防火牆服務關閉了,最終導致叢集節點上面的元件無法經過防火牆策略互通,從而影響叢集整體運作甚至叢集節點整體down機。當防火牆服務關閉後重啟kublet節點恢復了kubernets叢集,此時節點通訊不走防火牆策略來控制。這種情況雖然能保障節點之間通訊,但會影響kubernets內部服務之間DNS解析,建議恢復防火牆策略保持原有環境配置。

3.2、建議
1、 目前UAT環境的監控尚不完善,需把各個節點或元件的重要指標納入監控平臺,並配置相應告警以及通知資訊(郵箱、簡訊等),輔以第一時間獲得叢集效能和執行的整體資訊。
2、 Root使用者的管控和人員登入的管控,例如:是否存在別的部門人員擁有登入root使用者的許可權;需要把控好root的登入許可權,避免出現類似情況的發生。
3、 目前所有節點上的防火牆仍然時關閉狀態,由於當前k8s叢集之間元件通訊都是透過防火牆網路策略控制,關閉了防火牆會影響kubernets內部服務之間DNS解析,建議重新開啟防火牆服務。


來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/28833846/viewspace-2841519/,如需轉載,請註明出處,否則將追究法律責任。

相關文章