作者:scwang18,主要負責技術架構,在容器雲方向頗有研究。
前言
KubeSphere 是青雲開源的基於 Kubernetes 的雲原生分散式作業系統,提供了比較炫酷的 Kubernetes 叢集管理介面,我們團隊用 KubeSphere 來作為開發平臺。
本文記錄了一次 KubeSphere 環境下的網路故障的解決過程。
現象
開發同學反饋自己搭建的 Harbor 倉庫總是出問題,偶爾會報 net/http: TLS handshake timeout
, 通過 curl 的方式訪問 harbor.xxxx.cn ,也會隨機頻繁掛起。但是 ping 的反饋一切正常。
原因分析
接到錯誤報障後,經過了多輪分析,才最終定位到原因,應該是安裝 KubeSphere 時,使用了最新版的 Kubernetes 1.23.1 。
雖然使用 ./kk version --show-supported-k8s
可以看到 KubeSphere 3.2.1 可以支援 Kubernetes 1.23.1 ,但實際上只是試驗性支援,有坑的。
分析過程如下:
- 出現 Harbor registry 訪問問題,下意識以為是 Harbor 部署有問題,但是在檢查 Harbor core 的日誌的時候,沒有看到異常時有相應錯誤資訊,甚至 info 級別的日誌資訊都沒有。
- 又把目標放在 Harbor portal, 檢視訪問日誌,一樣沒有發現異常資訊。
- 根據訪問鏈,繼續追查 kubesphere-router-kubesphere-system , 即 KubeSphere 版的 nginx ingress controller ,同樣沒有發現異常日誌。
- 嘗試在叢集內其他 Pod 裡訪問 Harbor 的叢集內 Service 地址,發現不會出現訪問超時問題。初步判斷是 KubeSphere 自帶的 Ingress 的問題。
- 把 kubeSphere 自帶的 Ingress Controller 關閉,安裝 Kubernetes 官方推薦的 ingress-nginx-controller 版本, 故障依舊,而且 Ingress 日誌裡也沒有發現異常資訊。
- 綜合上面的分析,問題應該出現在客戶端到 Ingress Controller 之間,我的 Ingress Controller 是通過 NodePort 方式暴露到叢集外面。因此,測試其他通過 NodePort 暴露到叢集外的 service,發現是一樣的故障,至此,可以完全排除 Harbor 部署問題了,基本確定是客戶端到 Ingress Controller 的問題。
- 外部客戶端通過 NodePort 訪問 Ingress Controller 時,會通過 kube-proxy 元件,分析 kube-proxy 的日誌,發現告警資訊
can’t set sysctl net/ipv4/vs/conn_reuse_mode, kernel version must be at least 4.1
這個告警資訊是因為我的 centos 7.6 的核心版本過低, 當前是 3.10.0-1160.21.1.el7.x86_64 ,與 Kubernetes 新版的 ipvs 存在相容性問題。
可以通過升級作業系統的 kernel 版本可以解決。
升級完 kernel 後,Calico 啟動不了,報以下錯誤資訊
ipset v7.1: kernel and userspace incompatible: settype hash:ip,port with revision 6 not supported by userspace.
原因是安裝 KubeSphere 時預設安裝的 Calico 版本是 v3.20.0 , 這個版本不支援最新版的 Linux Kernel ,升級後的核心版本是 5.18.1-1.el7.elrepo.x86_64,calico 需要升級到 v3.23.0 以上版本。
升級完 Calico 版本後,Calico 繼續報錯
user "system:serviceaccount:kube-system:calico-node" cannot list resource "caliconodestatuses" in api group "crd.projectcalico.org"
還有另外一個錯誤資訊,都是因為 clusterrole 的資源許可權不足,可以通過修改 clusterrole 來解決問題。
- 至此,該莫名其妙的網路問題解決了。
解決過程
根據上面的分析,主要解決方案如下:
升級作業系統核心
- 使用阿里雲的 yum 源
wget -O /etc/yum.repos.d/CentOS-Base.repo http://mirrors.aliyun.com/repo/Centos-7.repo
yum clean all && yum -y update
- 啟用 elrepo 倉庫
rpm --import https://www.elrepo.org/RPM-GPG-KEY-elrepo.org
rpm -Uvh http://www.elrepo.org/elrepo-release-7.0-3.el7.elrepo.noarch.rpm
- 安裝最新版本核心
yum --enablerepo=elrepo-kernel install kernel-ml
- 檢視系統上的所有可用核心
awk -F\' '$1=="menuentry " {print i++ " : " $2}' /etc/grub2.cfg
- 設定新的核心為 grub2 的預設版本
檢視第4步返回的系統可用核心列表,不出意外第1個應該是最新安裝的核心。
grub2-set-default 0
- 生成 grub 配置檔案並重啟
grub2-mkconfig -o /boot/grub2/grub.cfg
reboot now
- 驗證
uname -r
升級 Calico
Kubernetes 上的 Calico 一般是使用 Daemonset 方式部署,我的叢集裡,Calico 的 Daemonset 名字是 calico-node。
直接輸出為 yaml 檔案,修改檔案裡的所有 image 版本號為最新版本 v3.23.1 。重新建立 Daemonset。
- 輸出 yaml
kubectl -n kube-system get ds calico-node -o yaml>calico-node.yaml
- calico-node.yaml:
apiVersion: apps/v1
kind: DaemonSet
metadata:
labels:
k8s-app: calico-node
name: calico-node
namespace: kube-system
spec:
revisionHistoryLimit: 10
selector:
matchLabels:
k8s-app: calico-node
template:
metadata:
creationTimestamp: null
labels:
k8s-app: calico-node
spec:
containers:
- env:
- name: DATASTORE_TYPE
value: kubernetes
- name: WAIT_FOR_DATASTORE
value: "true"
- name: NODENAME
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: spec.nodeName
- name: CALICO_NETWORKING_BACKEND
valueFrom:
configMapKeyRef:
key: calico_backend
name: calico-config
- name: CLUSTER_TYPE
value: k8s,bgp
- name: NODEIP
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: status.hostIP
- name: IP_AUTODETECTION_METHOD
value: can-reach=$(NODEIP)
- name: IP
value: autodetect
- name: CALICO_IPV4POOL_IPIP
value: Always
- name: CALICO_IPV4POOL_VXLAN
value: Never
- name: FELIX_IPINIPMTU
valueFrom:
configMapKeyRef:
key: veth_mtu
name: calico-config
- name: FELIX_VXLANMTU
valueFrom:
configMapKeyRef:
key: veth_mtu
name: calico-config
- name: FELIX_WIREGUARDMTU
valueFrom:
configMapKeyRef:
key: veth_mtu
name: calico-config
- name: CALICO_IPV4POOL_CIDR
value: 10.233.64.0/18
- name: CALICO_IPV4POOL_BLOCK_SIZE
value: "24"
- name: CALICO_DISABLE_FILE_LOGGING
value: "true"
- name: FELIX_DEFAULTENDPOINTTOHOSTACTION
value: ACCEPT
- name: FELIX_IPV6SUPPORT
value: "false"
- name: FELIX_HEALTHENABLED
value: "true"
envFrom:
- configMapRef:
name: kubernetes-services-endpoint
optional: true
image: calico/node:v3.23.1
imagePullPolicy: IfNotPresent
livenessProbe:
exec:
command:
- /bin/calico-node
- -felix-live
- -bird-live
failureThreshold: 6
initialDelaySeconds: 10
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 10
name: calico-node
readinessProbe:
exec:
command:
- /bin/calico-node
- -felix-ready
- -bird-ready
failureThreshold: 3
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 10
resources:
requests:
cpu: 250m
securityContext:
privileged: true
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /host/etc/cni/net.d
name: cni-net-dir
- mountPath: /lib/modules
name: lib-modules
readOnly: true
- mountPath: /run/xtables.lock
name: xtables-lock
- mountPath: /var/run/calico
name: var-run-calico
- mountPath: /var/lib/calico
name: var-lib-calico
- mountPath: /var/run/nodeagent
name: policysync
- mountPath: /sys/fs/
mountPropagation: Bidirectional
name: sysfs
- mountPath: /var/log/calico/cni
name: cni-log-dir
readOnly: true
dnsPolicy: ClusterFirst
hostNetwork: true
initContainers:
- command:
- /opt/cni/bin/calico-ipam
- -upgrade
env:
- name: KUBERNETES_NODE_NAME
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: spec.nodeName
- name: CALICO_NETWORKING_BACKEND
valueFrom:
configMapKeyRef:
key: calico_backend
name: calico-config
envFrom:
- configMapRef:
name: kubernetes-services-endpoint
optional: true
image: calico/cni:v3.23.1
imagePullPolicy: IfNotPresent
name: upgrade-ipam
resources: {}
securityContext:
privileged: true
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /var/lib/cni/networks
name: host-local-net-dir
- mountPath: /host/opt/cni/bin
name: cni-bin-dir
- command:
- /opt/cni/bin/install
env:
- name: CNI_CONF_NAME
value: 10-calico.conflist
- name: CNI_NETWORK_CONFIG
valueFrom:
configMapKeyRef:
key: cni_network_config
name: calico-config
- name: KUBERNETES_NODE_NAME
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: spec.nodeName
- name: CNI_MTU
valueFrom:
configMapKeyRef:
key: veth_mtu
name: calico-config
- name: SLEEP
value: "false"
envFrom:
- configMapRef:
name: kubernetes-services-endpoint
optional: true
image: calico/cni:v3.23.1
imagePullPolicy: IfNotPresent
name: install-cni
resources: {}
securityContext:
privileged: true
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /host/opt/cni/bin
name: cni-bin-dir
- mountPath: /host/etc/cni/net.d
name: cni-net-dir
- image: calico/pod2daemon-flexvol:v3.23.1
imagePullPolicy: IfNotPresent
name: flexvol-driver
resources: {}
securityContext:
privileged: true
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /host/driver
name: flexvol-driver-host
nodeSelector:
kubernetes.io/os: linux
priorityClassName: system-node-critical
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
serviceAccount: calico-node
serviceAccountName: calico-node
terminationGracePeriodSeconds: 0
tolerations:
- effect: NoSchedule
operator: Exists
- key: CriticalAddonsOnly
operator: Exists
- effect: NoExecute
operator: Exists
volumes:
- hostPath:
path: /lib/modules
type: ""
name: lib-modules
- hostPath:
path: /var/run/calico
type: ""
name: var-run-calico
- hostPath:
path: /var/lib/calico
type: ""
name: var-lib-calico
- hostPath:
path: /run/xtables.lock
type: FileOrCreate
name: xtables-lock
- hostPath:
path: /sys/fs/
type: DirectoryOrCreate
name: sysfs
- hostPath:
path: /opt/cni/bin
type: ""
name: cni-bin-dir
- hostPath:
path: /etc/cni/net.d
type: ""
name: cni-net-dir
- hostPath:
path: /var/log/calico/cni
type: ""
name: cni-log-dir
- hostPath:
path: /var/lib/cni/networks
type: ""
name: host-local-net-dir
- hostPath:
path: /var/run/nodeagent
type: DirectoryOrCreate
name: policysync
- hostPath:
path: /usr/libexec/kubernetes/kubelet-plugins/volume/exec/nodeagent~uds
type: DirectoryOrCreate
name: flexvol-driver-host
updateStrategy:
rollingUpdate:
maxSurge: 0
maxUnavailable: 1
type: RollingUpdate
ClusterRole
還需要修改 ClusterRole ,否則 Calico 會一直報許可權錯。
- 輸出 yaml
kubectl get clusterrole calico-node -o yaml >calico-node-clusterrole.yaml
- calico-node-clusterrole.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: calico-node
rules:
- apiGroups:
- ""
resources:
- pods
- nodes
- namespaces
verbs:
- get
- apiGroups:
- discovery.k8s.io
resources:
- endpointslices
verbs:
- watch
- list
- apiGroups:
- ""
resources:
- endpoints
- services
verbs:
- watch
- list
- get
- apiGroups:
- ""
resources:
- configmaps
verbs:
- get
- apiGroups:
- ""
resources:
- nodes/status
verbs:
- patch
- update
- apiGroups:
- networking.k8s.io
resources:
- networkpolicies
verbs:
- watch
- list
- apiGroups:
- ""
resources:
- pods
- namespaces
- serviceaccounts
verbs:
- list
- watch
- apiGroups:
- ""
resources:
- pods/status
verbs:
- patch
- apiGroups:
- crd.projectcalico.org
resources:
- globalfelixconfigs
- felixconfigurations
- bgppeers
- globalbgpconfigs
- bgpconfigurations
- ippools
- ipamblocks
- globalnetworkpolicies
- globalnetworksets
- networkpolicies
- networksets
- clusterinformations
- hostendpoints
- blockaffinities
- caliconodestatuses
- ipreservations
verbs:
- get
- list
- watch
- apiGroups:
- crd.projectcalico.org
resources:
- ippools
- felixconfigurations
- clusterinformations
verbs:
- create
- update
- apiGroups:
- ""
resources:
- nodes
verbs:
- get
- list
- watch
- apiGroups:
- crd.projectcalico.org
resources:
- bgpconfigurations
- bgppeers
verbs:
- create
- update
- apiGroups:
- crd.projectcalico.org
resources:
- blockaffinities
- ipamblocks
- ipamhandles
verbs:
- get
- list
- create
- update
- delete
- apiGroups:
- crd.projectcalico.org
resources:
- ipamconfigs
verbs:
- get
- apiGroups:
- crd.projectcalico.org
resources:
- blockaffinities
verbs:
- watch
- apiGroups:
- apps
resources:
- daemonsets
verbs:
- get
總結
這次奇怪的網路故障,最終原因還是因為 KubeSphere 的版本與 Kubernetes 的版本不匹配。所以工作環境要穩字為先,不要冒進使用最新的版本。否則會耽擱很多時間來解決莫名其妙的問題。
本文由部落格一文多發平臺 OpenWrite 釋出!