Kubernetes 叢集中 Ingress 故障的根因診斷

KubeSphere發表於2022-06-18
作者:scwang18,主要負責技術架構,在容器雲方向頗有研究。

前言

KubeSphere 是青雲開源的基於 Kubernetes 的雲原生分散式作業系統,提供了比較炫酷的 Kubernetes 叢集管理介面,我們團隊用 KubeSphere 來作為開發平臺。

本文記錄了一次 KubeSphere 環境下的網路故障的解決過程。

現象

開發同學反饋自己搭建的 Harbor 倉庫總是出問題,偶爾會報 net/http: TLS handshake timeout , 通過 curl 的方式訪問 harbor.xxxx.cn ,也會隨機頻繁掛起。但是 ping 的反饋一切正常。

原因分析

接到錯誤報障後,經過了多輪分析,才最終定位到原因,應該是安裝 KubeSphere 時,使用了最新版的 Kubernetes 1.23.1 。

雖然使用 ./kk version --show-supported-k8s 可以看到 KubeSphere 3.2.1 可以支援 Kubernetes 1.23.1 ,但實際上只是試驗性支援,有坑的。

分析過程如下:

  1. 出現 Harbor registry 訪問問題,下意識以為是 Harbor 部署有問題,但是在檢查 Harbor core 的日誌的時候,沒有看到異常時有相應錯誤資訊,甚至 info 級別的日誌資訊都沒有。
  2. 又把目標放在 Harbor portal, 檢視訪問日誌,一樣沒有發現異常資訊。
  3. 根據訪問鏈,繼續追查 kubesphere-router-kubesphere-system , 即 KubeSphere 版的 nginx ingress controller ,同樣沒有發現異常日誌。
  4. 嘗試在叢集內其他 Pod 裡訪問 Harbor 的叢集內 Service 地址,發現不會出現訪問超時問題。初步判斷是 KubeSphere 自帶的 Ingress 的問題。
  5. 把 kubeSphere 自帶的 Ingress Controller 關閉,安裝 Kubernetes 官方推薦的 ingress-nginx-controller 版本, 故障依舊,而且 Ingress 日誌裡也沒有發現異常資訊。
  6. 綜合上面的分析,問題應該出現在客戶端到 Ingress Controller 之間,我的 Ingress Controller 是通過 NodePort 方式暴露到叢集外面。因此,測試其他通過 NodePort 暴露到叢集外的 service,發現是一樣的故障,至此,可以完全排除 Harbor 部署問題了,基本確定是客戶端到 Ingress Controller 的問題。
  7. 外部客戶端通過 NodePort 訪問 Ingress Controller 時,會通過 kube-proxy 元件,分析 kube-proxy 的日誌,發現告警資訊
can’t set sysctl net/ipv4/vs/conn_reuse_mode, kernel version must be at least 4.1

這個告警資訊是因為我的 centos 7.6 的核心版本過低, 當前是 3.10.0-1160.21.1.el7.x86_64 ,與 Kubernetes 新版的 ipvs 存在相容性問題。

可以通過升級作業系統的 kernel 版本可以解決。

  1. 升級完 kernel 後,Calico 啟動不了,報以下錯誤資訊

    ipset v7.1: kernel and userspace incompatible: settype hash:ip,port with revision 6 not supported by userspace.

原因是安裝 KubeSphere 時預設安裝的 Calico 版本是 v3.20.0 , 這個版本不支援最新版的 Linux Kernel ,升級後的核心版本是 5.18.1-1.el7.elrepo.x86_64,calico 需要升級到 v3.23.0 以上版本。

  1. 升級完 Calico 版本後,Calico 繼續報錯

    user "system:serviceaccount:kube-system:calico-node" cannot list resource "caliconodestatuses" in api group "crd.projectcalico.org"

還有另外一個錯誤資訊,都是因為 clusterrole 的資源許可權不足,可以通過修改 clusterrole 來解決問題。

  1. 至此,該莫名其妙的網路問題解決了。

解決過程

根據上面的分析,主要解決方案如下:

升級作業系統核心

  1. 使用阿里雲的 yum 源
wget -O /etc/yum.repos.d/CentOS-Base.repo http://mirrors.aliyun.com/repo/Centos-7.repo
yum clean all && yum -y update
  1. 啟用 elrepo 倉庫
rpm --import https://www.elrepo.org/RPM-GPG-KEY-elrepo.org
rpm -Uvh http://www.elrepo.org/elrepo-release-7.0-3.el7.elrepo.noarch.rpm
  1. 安裝最新版本核心
yum --enablerepo=elrepo-kernel install kernel-ml
  1. 檢視系統上的所有可用核心
awk -F\' '$1=="menuentry " {print i++ " : " $2}' /etc/grub2.cfg
  1. 設定新的核心為 grub2 的預設版本

檢視第4步返回的系統可用核心列表,不出意外第1個應該是最新安裝的核心。

grub2-set-default 0
  1. 生成 grub 配置檔案並重啟
grub2-mkconfig -o /boot/grub2/grub.cfg
reboot now
  1. 驗證
uname -r

升級 Calico

Kubernetes 上的 Calico 一般是使用 Daemonset 方式部署,我的叢集裡,Calico 的 Daemonset 名字是 calico-node。

直接輸出為 yaml 檔案,修改檔案裡的所有 image 版本號為最新版本 v3.23.1 。重新建立 Daemonset。

  1. 輸出 yaml
kubectl -n kube-system get ds  calico-node -o yaml>calico-node.yaml
  1. calico-node.yaml:
apiVersion: apps/v1
kind: DaemonSet
metadata:
  labels:
    k8s-app: calico-node
  name: calico-node
  namespace: kube-system
spec:
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      k8s-app: calico-node
  template:
    metadata:
      creationTimestamp: null
      labels:
        k8s-app: calico-node
    spec:
      containers:
      - env:
        - name: DATASTORE_TYPE
          value: kubernetes
        - name: WAIT_FOR_DATASTORE
          value: "true"
        - name: NODENAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: spec.nodeName
        - name: CALICO_NETWORKING_BACKEND
          valueFrom:
            configMapKeyRef:
              key: calico_backend
              name: calico-config
        - name: CLUSTER_TYPE
          value: k8s,bgp
        - name: NODEIP
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: status.hostIP
        - name: IP_AUTODETECTION_METHOD
          value: can-reach=$(NODEIP)
        - name: IP
          value: autodetect
        - name: CALICO_IPV4POOL_IPIP
          value: Always
        - name: CALICO_IPV4POOL_VXLAN
          value: Never
        - name: FELIX_IPINIPMTU
          valueFrom:
            configMapKeyRef:
              key: veth_mtu
              name: calico-config
        - name: FELIX_VXLANMTU
          valueFrom:
            configMapKeyRef:
              key: veth_mtu
              name: calico-config
        - name: FELIX_WIREGUARDMTU
          valueFrom:
            configMapKeyRef:
              key: veth_mtu
              name: calico-config
        - name: CALICO_IPV4POOL_CIDR
          value: 10.233.64.0/18
        - name: CALICO_IPV4POOL_BLOCK_SIZE
          value: "24"
        - name: CALICO_DISABLE_FILE_LOGGING
          value: "true"
        - name: FELIX_DEFAULTENDPOINTTOHOSTACTION
          value: ACCEPT
        - name: FELIX_IPV6SUPPORT
          value: "false"
        - name: FELIX_HEALTHENABLED
          value: "true"
        envFrom:
        - configMapRef:
            name: kubernetes-services-endpoint
            optional: true
        image: calico/node:v3.23.1
        imagePullPolicy: IfNotPresent
        livenessProbe:
          exec:
            command:
            - /bin/calico-node
            - -felix-live
            - -bird-live
          failureThreshold: 6
          initialDelaySeconds: 10
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 10
        name: calico-node
        readinessProbe:
          exec:
            command:
            - /bin/calico-node
            - -felix-ready
            - -bird-ready
          failureThreshold: 3
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 10
        resources:
          requests:
            cpu: 250m
        securityContext:
          privileged: true
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /host/etc/cni/net.d
          name: cni-net-dir
        - mountPath: /lib/modules
          name: lib-modules
          readOnly: true
        - mountPath: /run/xtables.lock
          name: xtables-lock
        - mountPath: /var/run/calico
          name: var-run-calico
        - mountPath: /var/lib/calico
          name: var-lib-calico
        - mountPath: /var/run/nodeagent
          name: policysync
        - mountPath: /sys/fs/
          mountPropagation: Bidirectional
          name: sysfs
        - mountPath: /var/log/calico/cni
          name: cni-log-dir
          readOnly: true
      dnsPolicy: ClusterFirst
      hostNetwork: true
      initContainers:
      - command:
        - /opt/cni/bin/calico-ipam
        - -upgrade
        env:
        - name: KUBERNETES_NODE_NAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: spec.nodeName
        - name: CALICO_NETWORKING_BACKEND
          valueFrom:
            configMapKeyRef:
              key: calico_backend
              name: calico-config
        envFrom:
        - configMapRef:
            name: kubernetes-services-endpoint
            optional: true
        image: calico/cni:v3.23.1
        imagePullPolicy: IfNotPresent
        name: upgrade-ipam
        resources: {}
        securityContext:
          privileged: true
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /var/lib/cni/networks
          name: host-local-net-dir
        - mountPath: /host/opt/cni/bin
          name: cni-bin-dir
      - command:
        - /opt/cni/bin/install
        env:
        - name: CNI_CONF_NAME
          value: 10-calico.conflist
        - name: CNI_NETWORK_CONFIG
          valueFrom:
            configMapKeyRef:
              key: cni_network_config
              name: calico-config
        - name: KUBERNETES_NODE_NAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: spec.nodeName
        - name: CNI_MTU
          valueFrom:
            configMapKeyRef:
              key: veth_mtu
              name: calico-config
        - name: SLEEP
          value: "false"
        envFrom:
        - configMapRef:
            name: kubernetes-services-endpoint
            optional: true
        image: calico/cni:v3.23.1
        imagePullPolicy: IfNotPresent
        name: install-cni
        resources: {}
        securityContext:
          privileged: true
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /host/opt/cni/bin
          name: cni-bin-dir
        - mountPath: /host/etc/cni/net.d
          name: cni-net-dir
      - image: calico/pod2daemon-flexvol:v3.23.1
        imagePullPolicy: IfNotPresent
        name: flexvol-driver
        resources: {}
        securityContext:
          privileged: true
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /host/driver
          name: flexvol-driver-host
      nodeSelector:
        kubernetes.io/os: linux
      priorityClassName: system-node-critical
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      serviceAccount: calico-node
      serviceAccountName: calico-node
      terminationGracePeriodSeconds: 0
      tolerations:
      - effect: NoSchedule
        operator: Exists
      - key: CriticalAddonsOnly
        operator: Exists
      - effect: NoExecute
        operator: Exists
      volumes:
      - hostPath:
          path: /lib/modules
          type: ""
        name: lib-modules
      - hostPath:
          path: /var/run/calico
          type: ""
        name: var-run-calico
      - hostPath:
          path: /var/lib/calico
          type: ""
        name: var-lib-calico
      - hostPath:
          path: /run/xtables.lock
          type: FileOrCreate
        name: xtables-lock
      - hostPath:
          path: /sys/fs/
          type: DirectoryOrCreate
        name: sysfs
      - hostPath:
          path: /opt/cni/bin
          type: ""
        name: cni-bin-dir
      - hostPath:
          path: /etc/cni/net.d
          type: ""
        name: cni-net-dir
      - hostPath:
          path: /var/log/calico/cni
          type: ""
        name: cni-log-dir
      - hostPath:
          path: /var/lib/cni/networks
          type: ""
        name: host-local-net-dir
      - hostPath:
          path: /var/run/nodeagent
          type: DirectoryOrCreate
        name: policysync
      - hostPath:
          path: /usr/libexec/kubernetes/kubelet-plugins/volume/exec/nodeagent~uds
          type: DirectoryOrCreate
        name: flexvol-driver-host
  updateStrategy:
    rollingUpdate:
      maxSurge: 0
      maxUnavailable: 1
    type: RollingUpdate

ClusterRole

還需要修改 ClusterRole ,否則 Calico 會一直報許可權錯。

  1. 輸出 yaml
kubectl get clusterrole calico-node -o yaml >calico-node-clusterrole.yaml
  1. calico-node-clusterrole.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: calico-node
rules:
- apiGroups:
  - ""
  resources:
  - pods
  - nodes
  - namespaces
  verbs:
  - get
- apiGroups:
  - discovery.k8s.io
  resources:
  - endpointslices
  verbs:
  - watch
  - list
- apiGroups:
  - ""
  resources:
  - endpoints
  - services
  verbs:
  - watch
  - list
  - get
- apiGroups:
  - ""
  resources:
  - configmaps
  verbs:
  - get
- apiGroups:
  - ""
  resources:
  - nodes/status
  verbs:
  - patch
  - update
- apiGroups:
  - networking.k8s.io
  resources:
  - networkpolicies
  verbs:
  - watch
  - list
- apiGroups:
  - ""
  resources:
  - pods
  - namespaces
  - serviceaccounts
  verbs:
  - list
  - watch
- apiGroups:
  - ""
  resources:
  - pods/status
  verbs:
  - patch
- apiGroups:
  - crd.projectcalico.org
  resources:
  - globalfelixconfigs
  - felixconfigurations
  - bgppeers
  - globalbgpconfigs
  - bgpconfigurations
  - ippools
  - ipamblocks
  - globalnetworkpolicies
  - globalnetworksets
  - networkpolicies
  - networksets
  - clusterinformations
  - hostendpoints
  - blockaffinities
  - caliconodestatuses
  - ipreservations
  verbs:
  - get
  - list
  - watch
- apiGroups:
  - crd.projectcalico.org
  resources:
  - ippools
  - felixconfigurations
  - clusterinformations
  verbs:
  - create
  - update
- apiGroups:
  - ""
  resources:
  - nodes
  verbs:
  - get
  - list
  - watch
- apiGroups:
  - crd.projectcalico.org
  resources:
  - bgpconfigurations
  - bgppeers
  verbs:
  - create
  - update
- apiGroups:
  - crd.projectcalico.org
  resources:
  - blockaffinities
  - ipamblocks
  - ipamhandles
  verbs:
  - get
  - list
  - create
  - update
  - delete
- apiGroups:
  - crd.projectcalico.org
  resources:
  - ipamconfigs
  verbs:
  - get
- apiGroups:
  - crd.projectcalico.org
  resources:
  - blockaffinities
  verbs:
  - watch
- apiGroups:
  - apps
  resources:
  - daemonsets
  verbs:
  - get

總結

這次奇怪的網路故障,最終原因還是因為 KubeSphere 的版本與 Kubernetes 的版本不匹配。所以工作環境要穩字為先,不要冒進使用最新的版本。否則會耽擱很多時間來解決莫名其妙的問題。

本文由部落格一文多發平臺 OpenWrite 釋出!

相關文章