kubernets叢集節點NotReady故障 分析報告
1、問題描述
客戶反饋UAT測試環境的kubernets叢集節點狀態NotReady,無法建立新資源,影響到了廠家業務部署測試。為了不影響業務廠家應用釋出測試,現場工程師透過重啟kublet與docker容器服務來恢復kubernets叢集。對該UAT的Kubernetes叢集故障時間範圍跟蹤分析。具體故障資訊如下圖所示。
2、問題分析
2.1、分析過程
1.
檢視/var/log/messages日誌資訊詳情:
Mar 10 10:36:18 UAT-K8S-MASTER01 systemd: Started Session 26305 of user root. Mar 10 10:36:18 UAT-K8S-MASTER01 systemd-logind: New session 26305 of user root. <<登入root使用者 Mar 10 10:36:19 UAT-K8S-MASTER01 kubelet: E0310 10:36:19.780889 20272 file.go:182] Not recursing into manifest path "/etc/kubernetes/manifests/keepalived" Mar 10 10:36:19 UAT-K8S-MASTER01 kubelet: E0310 10:36:19.783096 20272 file.go:182] Not recursing into manifest path "/etc/kubernetes/manifests/nginx-lb" Mar 10 10:36:39 UAT-K8S-MASTER01 kubelet: E0310 10:36:39.780897 20272 file.go:182] Not recursing into manifest path "/etc/kubernetes/manifests/keepalived" Mar 10 10:36:39 UAT-K8S-MASTER01 kubelet: E0310 10:36:39.783123 20272 file.go:182] Not recursing into manifest path "/etc/kubernetes/manifests/nginx-lb" Mar 10 10:36:59 UAT-K8S-MASTER01 kubelet: E0310 10:36:59.780902 20272 file.go:182] Not recursing into manifest path "/etc/kubernetes/manifests/keepalived" Mar 10 10:36:59 UAT-K8S-MASTER01 kubelet: E0310 10:36:59.783046 20272 file.go:182] Not recursing into manifest path "/etc/kubernetes/manifests/nginx-lb" Mar 10 10:37:03 UAT-K8S-MASTER01 systemd: Stopping firewalld - dynamic firewall daemon... << 關閉防火牆操作 Mar 10 10:37:03 UAT-K8S-MASTER01 kernel: IPVS: [sh] scheduler unregistered. <<IPVS排程器立即出現scheduler未註冊的日誌資訊 Mar 10 10:37:03 UAT-K8S-MASTER01 kernel: IPVS: [wrr] scheduler unregistered. Mar 10 10:37:03 UAT-K8S-MASTER01 kernel: IPVS: [rr] scheduler unregistered. Mar 10 10:37:03 UAT-K8S-MASTER01 kernel: IPVS: ipvs unloaded. // 並且出現IPVS模組解除安裝的日誌資訊 Mar 10 10:37:04 UAT-K8S-MASTER01 systemd: Stopped firewalld - dynamic firewall daemon. Mar 10 10:37:04 UAT-K8S-MASTER01 kubelet: E0310 10:37:04.573494 20272 controller.go:178] failed to update node lease, error: Put " dial tcp 172.31.2 50.21:16443: connect: connection refused <<開始出現拒絕連線報錯,並一直連續報錯 Mar 10 10:37:04 UAT-K8S-MASTER01 kubelet: E0310 10:37:04.573696 20272 controller.go:178] failed to update node lease, error: Put " dial tcp 172.31.2 50.21:16443: connect: connection refused Mar 10 10:37:04 UAT-K8S-MASTER01 kubelet: E0310 10:37:04.573881 20272 controller.go:178] failed to update node lease, error: Put " dial tcp 172.31.2 50.21:16443: connect: connection refused Mar 10 10:37:04 UAT-K8S-MASTER01 kubelet: E0310 10:37:04.574009 20272 controller.go:178] failed to update node lease, error: Put " dial tcp 172.31.2 50.21:16443: connect: connection refused Mar 10 10:37:04 UAT-K8S-MASTER01 kubelet: E0310 10:37:04.574123 20272 controller.go:178] failed to update node lease, error: Put " dial tcp 172.31.2 50.21:16443: connect: connection refused Mar 10 10:37:04 UAT-K8S-MASTER01 kubelet: I0310 10:37:04.574138 20272 controller.go:106] failed to update lease using latest lease, fallback to ensure lease, err: failed 5 attempts to update node lease Mar 10 10:37:04 UAT-K8S-MASTER01 kubelet: E0310 10:37:04.574228 20272 controller.go:136] failed to ensure node lease exists, will retry in 200ms, error: Get " eout=10s": dial tcp 172.31.250.21:16443: connect: connection refused Mar 10 10:37:04 UAT-K8S-MASTER01 kubelet: E0310 10:37:04.774499 20272 controller.go:136] failed to ensure node lease exists, will retry in 400ms, error: Get " eout=10s": dial tcp 172.31.250.21:16443: connect: connection refused Mar 10 10:37:05 UAT-K8S-MASTER01 kubelet: E0310 10:37:05.174927 20272 controller.go:136] failed to ensure node lease exists, will retry in 800ms, error: Get " eout=10s": dial tcp 172.31.250.21:16443: connect: connection refused Mar 10 10:37:05 UAT-K8S-MASTER01 kubelet: E0310 10:37:05.314735 20272 kubelet_node_status.go:442] Error updating node status, will retry: error getting node "uat-k8s-master01": Get " s": dial tcp 172.31.250.21:16443: connect: connection refused Mar 10 10:37:05 UAT-K8S-MASTER01 kubelet: E0310 10:37:05.315321 20272 kubelet_node_status.go:442] Error updating node status, will retry: error getting node "uat-k8s-master01": Get " dial tcp 172.3 1.250.21:16443: connect: connection refused Mar 10 10:37:05 UAT-K8S-MASTER01 kubelet: E0310 10:37:05.315477 20272 kubelet_node_status.go:442] Error updating node status, will retry: error getting node "uat-k8s-master01": Get " dial tcp 172.3 1.250.21:16443: connect: connection refused Mar 10 10:37:05 UAT-K8S-MASTER01 kubelet: E0310 10:37:05.315616 20272 kubelet_node_status.go:442] Error updating node status, will retry: error getting node "uat-k8s-master01": Get " dial tcp 172.3 1.250.21:16443: connect: connection refused |
2. 檢視防火牆狀態詳情:
[root@UAT-K8S-MASTER01 ~]# systemctl status firewalld ● firewalld.service - firewalld - dynamic firewall daemon Loaded: loaded (/usr/lib/systemd/system/firewalld.service; disabled; vendor preset: enabled) Active: inactive (dead) since Wed 2021-03-10 10:37:04 CST; 1 day 4h ago <<防火牆關閉時間和messages日誌資訊時間記錄一致 Docs: man:firewalld(1) Main PID: 1201 (code=exited, status=0/SUCCESS) Jan 12 09:37:35 UAT-K8S-MASTER01 firewalld[1201]: WARNING: COMMAND_FAILED: '/usr/sbin/iptables -w10 -t filter -F DOCKER-ISOLATION-STAGE-1' failed: iptables: No chain/target/match by that name. Jan 12 09:37:35 UAT-K8S-MASTER01 firewalld[1201]: WARNING: COMMAND_FAILED: '/usr/sbin/iptables -w10 -t filter -X DOCKER-ISOLATION-STAGE-1' failed: iptables: No chain/target/match by that name. Jan 12 09:37:35 UAT-K8S-MASTER01 firewalld[1201]: WARNING: COMMAND_FAILED: '/usr/sbin/iptables -w10 -t filter -F DOCKER-ISOLATION-STAGE-2' failed: iptables: No chain/target/match by that name. Jan 12 09:37:35 UAT-K8S-MASTER01 firewalld[1201]: WARNING: COMMAND_FAILED: '/usr/sbin/iptables -w10 -t filter -X DOCKER-ISOLATION-STAGE-2' failed: iptables: No chain/target/match by that name. Jan 12 09:37:35 UAT-K8S-MASTER01 firewalld[1201]: WARNING: COMMAND_FAILED: '/usr/sbin/iptables -w10 -t filter -F DOCKER-ISOLATION' failed: iptables: No chain/target/match by that name. Jan 12 09:37:35 UAT-K8S-MASTER01 firewalld[1201]: WARNING: COMMAND_FAILED: '/usr/sbin/iptables -w10 -t filter -X DOCKER-ISOLATION' failed: iptables: No chain/target/match by that name. Jan 12 09:37:35 UAT-K8S-MASTER01 firewalld[1201]: WARNING: COMMAND_FAILED: '/usr/sbin/iptables -w10 -D FORWARD -i docker0 -o docker0 -j DROP' failed: iptables: Bad rule (does a matching rule exist in that chain?). Jan 12 09:37:35 UAT-K8S-MASTER01 firewalld[1201]: WARNING: COMMAND_FAILED: '/usr/sbin/iptables -w10 -D FORWARD -i docker0 -o docker0 -j DROP' failed: iptables: Bad rule (does a matching rule exist in that chain?). Mar 10 10:37:03 UAT-K8S-MASTER01 systemd[1]: Stopping firewalld - dynamic firewall daemon... Mar 10 10:37:04 UAT-K8S-MASTER01 systemd[1]: Stopped firewalld - dynamic firewall daemon. <<防火牆關閉時間3月10日10時37分04秒 |
2.2、問題原因
從/var/log/messages日誌發現當天上午10時36分18秒,有使用者登入root賬戶,並且在10時37分04秒使用關閉防火牆命令,把UAT-K8S-MASTER01上的防火牆關閉了(經過查證,發現其他節點上的防火牆也都被關閉了),關閉防火牆後,messages日誌資訊馬上出現IPVS排程器scheduler未註冊的日誌資訊,並且出現IPVS排程器解除安裝的日誌資訊,然後日誌開始出現172.31.250.21:16443的拒絕連線報錯資訊,並一直持續列印該報錯資訊,導致叢集所有節點無法建立新資源,且狀態顯示為NotReady狀態。
3、問題總結與建議
3.1、總結
由於人為誤操作將kubernets叢集宿主機防火牆服務關閉了,最終導致叢集節點上面的元件無法經過防火牆策略互通,從而影響叢集整體運作甚至叢集節點整體down機。當防火牆服務關閉後重啟kublet節點恢復了kubernets叢集,此時節點通訊不走防火牆策略來控制。這種情況雖然能保障節點之間通訊,但會影響kubernets內部服務之間DNS解析,建議恢復防火牆策略保持原有環境配置。
3.2、建議
1、 目前UAT環境的監控尚不完善,需把各個節點或元件的重要指標納入監控平臺,並配置相應告警以及通知資訊(郵箱、簡訊等),輔以第一時間獲得叢集效能和執行的整體資訊。
2、 Root使用者的管控和人員登入的管控,例如:是否存在別的部門人員擁有登入root使用者的許可權;需要把控好root的登入許可權,避免出現類似情況的發生。
3、 目前所有節點上的防火牆仍然時關閉狀態,由於當前k8s叢集之間元件通訊都是透過防火牆網路策略控制,關閉了防火牆會影響kubernets內部服務之間DNS解析,建議重新開啟防火牆服務。
來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/28833846/viewspace-2841519/,如需轉載,請註明出處,否則將追究法律責任。
相關文章
- RAC第一個節點被剔除叢集故障分析
- mongodb叢集節點故障的切換方法MongoDB
- K8S線上叢集排查,實測排查Node節點NotReady異常狀態K8S
- galera mysql cluster 故障節點再次接入叢集遇到的問題.MySql
- consul 多節點/單節點叢集搭建
- 4.2 叢集節點初步搭建
- 谷歌Kubernets搞叢集管理的方法谷歌
- MongoDB叢集搭建(包括隱藏節點,仲裁節點)MongoDB
- K8s 叢集高可用 master 節點故障如何恢復? 原創K8SAST
- 記一次K8S叢集Node節點CPU消耗高故障K8S
- HAC叢集更改IP(單節點更改、全部節點更改)
- Oracle叢集軟體管理-新增和刪除叢集節點Oracle
- linux搭建kafka叢集,多master節點叢集說明LinuxKafkaAST
- Redis服務之叢集節點管理Redis
- Redis Manager 叢集管理與節點管理Redis
- redhat安裝雙節點cassandra叢集Redhat
- Jedis操作單節點redis,叢集及redisTemplate操作redis叢集(一)Redis
- ceph 叢集報 mds cluster is degraded 故障排查薦
- kubernets1.13.1叢集使用ceph rbd塊儲存
- 400+節點的 Elasticsearch 叢集運維Elasticsearch運維
- 400+ 節點的 Elasticsearch 叢集運維Elasticsearch運維
- Oracle RAC 10g叢集節點增加Oracle
- 【Mongodb】sharding 叢集Add/Remove 節點MongoDBREM
- multicast導致節點無法加入叢集AST
- 3節點RAC資料庫夯故障分析資料庫
- 從庫轉換成PXC叢集的節點
- kafka系列二:多節點分散式叢集搭建Kafka分散式
- CentOS7 上搭建多節點 Elasticsearch叢集CentOSElasticsearch
- hadoop叢集搭建——單節點(偽分散式)Hadoop分散式
- RAC修改叢集兩個節點public ip地址
- hadoop叢集多節點安裝詳解Hadoop
- 在多節點的叢集上執行Cassandra
- 升級kubeadm 叢集(只有master單節點)AST
- WebSphere 叢集建立及故障排除Web
- 故障分析 | ClickHouse 叢集分散式 DDL 被阻塞案例一則分散式
- oracle11GR2 RAC節點crash故障分析Oracle
- 【故障公告】k8s叢集2臺32核64G節點伺服器被釋放造成全站故障K8S伺服器
- Elasticsearch 7.x 之節點、叢集、分片及副本Elasticsearch