Kubernetes 常見問題總結

上善若水~小輝發表於2021-04-07

Kubernetes 常見問題總結

如何刪除不一致狀態下的 rc,deployment,service

在某些情況下,經常發現 kubectl 程式掛起現象,然後在 get 時候發現刪了一半,而另外的刪除不了

[root@k8s-master ~]# kubectl get -f fluentd-elasticsearch/
NAME DESIRED CURRENT READY AGE
rc/elasticsearch-logging-v1 0 2 2 15h

NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE
deploy/kibana-logging 0 1 1 1 15h
Error from server (NotFound): services "elasticsearch-logging" not found
Error from server (NotFound): daemonsets.extensions "fluentd-es-v1.22" not found
Error from server (NotFound): services "kibana-logging" not found

刪除這些 deployment,service 或者 rc 命令如下:

kubectl delete deployment kibana-logging -n kube-system --cascade=false

kubectl delete deployment kibana-logging -n kube-system  --ignore-not-found

delete rc elasticsearch-logging-v1 -n kube-system --force now --grace-period=0
1|2刪除不了後如何重置etcd

刪除不了後如何重置 etcd

rm -rf /var/lib/etcd/*

刪除後重新 reboot master 結點。reset etcd 後需要重新設定網路

etcdctl mk /atomic.io/network/config '{ "Network": "192.168.0.0/16" }'

啟動 apiserver 失敗

每次啟動都是報如下問題:

start request repeated too quickly for kube-apiserver.service

但其實不是啟動頻率問題,需要檢視, /var/log/messages,在我的情況中是因為開啟 ServiceAccount 後找不到 ca.crt 等檔案,導致啟動出錯。

May 21 07:56:41 k8s-master kube-apiserver: Flag --port has been deprecated, see --insecure-port instead.
May 21 07:56:41 k8s-master kube-apiserver: F0521 07:56:41.692480 4299 universal_validation.go:104] Validate server run options failed: unable to load client CA file: open /var/run/kubernetes/ca.crt: no such file or directory
May 21 07:56:41 k8s-master systemd: kube-apiserver.service: main process exited, code=exited, status=255/n/a
May 21 07:56:41 k8s-master systemd: Failed to start Kubernetes API Server.
May 21 07:56:41 k8s-master systemd: Unit kube-apiserver.service entered failed state.
May 21 07:56:41 k8s-master systemd: kube-apiserver.service failed.
May 21 07:56:41 k8s-master systemd: kube-apiserver.service holdoff time over, scheduling restart.
May 21 07:56:41 k8s-master systemd: start request repeated too quickly for kube-apiserver.service
May 21 07:56:41 k8s-master systemd: Failed to start Kubernetes API Server.

在部署 fluentd 等日誌元件的時候,很多問題都是因為需要開啟 ServiceAccount 選項需要配置安全導致,所以說到底還是需要配置好 ServiceAccount.

出現 Permission denied 情況

在配置 fluentd 時候出現cannot create /var/log/fluentd.log: Permission denied 錯誤,這是因為沒有關掉 SElinux 安全導致。

可以在 /etc/selinux/config 中將 SELINUX=enforcing 設定成 disabled,然後 reboot

基於 ServiceAccount 的配置

首先生成各種需要的 keys,k8s-master 需替換成 master 的主機名.

openssl genrsa -out ca.key 2048
openssl req -x509 -new -nodes -key ca.key -subj "/CN=k8s-master" -days 10000 -out ca.crt
openssl genrsa -out server.key 2048

echo subjectAltName=IP:10.254.0.1 > extfile.cnf

#ip由下述命令決定

#kubectl get services --all-namespaces |grep 'default'|grep 'kubernetes'|grep '443'|awk '{print $3}'

openssl req -new -key server.key -subj "/CN=k8s-master" -out server.csr

openssl x509 -req -in server.csr -CA ca.crt -CAkey ca.key -CAcreateserial -extfile extfile.cnf -out server.crt -days 10000

如果修改 /etc/kubernetes/apiserver 的配置檔案引數的話,通過 systemctl start kube-apiserver 啟動失敗,出錯資訊為:

Validate server run options failed: unable to load client CA file: open /root/keys/ca.crt: permission denied

但可以通過命令列啟動 API Server

/usr/bin/kube-apiserver --logtostderr=true --v=0 --etcd-servers=http://k8s-master:2379 --address=0.0.0.0 --port=8080 --kubelet-port=10250 --allow-privileged=true --service-cluster-ip-range=10.254.0.0/16 --admission-control=ServiceAccount --insecure-bind-address=0.0.0.0 --client-ca-file=/root/keys/ca.crt --tls-cert-file=/root/keys/server.crt --tls-private-key-file=/root/keys/server.key --basic-auth-file=/root/keys/basic_auth.csv --secure-port=443 &>> /var/log/kubernetes/kube-apiserver.log &

命令列啟動 Controller-manager

/usr/bin/kube-controller-manager --logtostderr=true --v=0 --master=http://k8s-master:8080 --root-ca-file=/root/keys/ca.crt --service-account-private-key-file=/root/keys/server.key & >>/var/log/kubernetes/kube-controller-manage.log

ETCD 啟動不起來-問題(1)

etcd是kubernetes 叢集的zookeeper程式,幾乎所有的service都依賴於etcd的啟動,比如flanneld,apiserver,docker.....在啟動etcd是報錯日誌如下:

May 24 13:39:09 k8s-master systemd: Stopped Flanneld overlay address etcd agent.
May 24 13:39:28 k8s-master systemd: Starting Etcd Server...
May 24 13:39:28 k8s-master etcd: recognized and used environment variable ETCD_ADVERTISE_CLIENT_URLS=http://etcd:2379,http://etcd:4001
May 24 13:39:28 k8s-master etcd: recognized environment variable ETCD_NAME, but unused: shadowed by corresponding flag
May 24 13:39:28 k8s-master etcd: recognized environment variable ETCD_DATA_DIR, but unused: shadowed by corresponding flag
May 24 13:39:28 k8s-master etcd: recognized environment variable ETCD_LISTEN_CLIENT_URLS, but unused: shadowed by corresponding flag
May 24 13:39:28 k8s-master etcd: etcd Version: 3.1.3
May 24 13:39:28 k8s-master etcd: Git SHA: 21fdcc6
May 24 13:39:28 k8s-master etcd: Go Version: go1.7.4
May 24 13:39:28 k8s-master etcd: Go OS/Arch: linux/amd64
May 24 13:39:28 k8s-master etcd: setting maximum number of CPUs to 1, total number of available CPUs is 1
May 24 13:39:28 k8s-master etcd: the server is already initialized as member before, starting as etcd member...
May 24 13:39:28 k8s-master etcd: listening for peers on http://localhost:2380
May 24 13:39:28 k8s-master etcd: listening for client requests on 0.0.0.0:2379
May 24 13:39:28 k8s-master etcd: listening for client requests on 0.0.0.0:4001
May 24 13:39:28 k8s-master etcd: recovered store from snapshot at index 140014
May 24 13:39:28 k8s-master etcd: name = master
May 24 13:39:28 k8s-master etcd: data dir = /var/lib/etcd/default.etcd
May 24 13:39:28 k8s-master etcd: member dir = /var/lib/etcd/default.etcd/member
May 24 13:39:28 k8s-master etcd: heartbeat = 100ms
May 24 13:39:28 k8s-master etcd: election = 1000ms
May 24 13:39:28 k8s-master etcd: snapshot count = 10000
May 24 13:39:28 k8s-master etcd: advertise client URLs = http://etcd:2379,http://etcd:4001
May 24 13:39:28 k8s-master etcd: ignored file 0000000000000001-0000000000012700.wal.broken in wal
May 24 13:39:29 k8s-master etcd: restarting member 8e9e05c52164694d in cluster cdf818194e3a8c32 at commit index 148905
May 24 13:39:29 k8s-master etcd: 8e9e05c52164694d became follower at term 12
May 24 13:39:29 k8s-master etcd: newRaft 8e9e05c52164694d [peers: [8e9e05c52164694d], term: 12, commit: 148905, applied: 140014, lastindex: 148905, lastterm: 12]
May 24 13:39:29 k8s-master etcd: enabled capabilities for version 3.1
May 24 13:39:29 k8s-master etcd: added member 8e9e05c52164694d [http://localhost:2380] to cluster cdf818194e3a8c32 from store
May 24 13:39:29 k8s-master etcd: set the cluster version to 3.1 from store
May 24 13:39:29 k8s-master etcd: starting server... [version: 3.1.3, cluster version: 3.1]
May 24 13:39:29 k8s-master etcd: raft save state and entries error: open /var/lib/etcd/default.etcd/member/wal/0.tmp: is a directory
May 24 13:39:29 k8s-master systemd: etcd.service: main process exited, code=exited, status=1/FAILURE
May 24 13:39:29 k8s-master systemd: Failed to start Etcd Server.
May 24 13:39:29 k8s-master systemd: Unit etcd.service entered failed state.
May 24 13:39:29 k8s-master systemd: etcd.service failed.
May 24 13:39:29 k8s-master systemd: etcd.service holdoff time over, scheduling restart.

核心語句:

raft save state and entries error: open /var/lib/etcd/default.etcd/member/wal/0.tmp: is a directory

進入相關目錄,刪除 0.tmp,然後就可以啟動啦!

ETCD啟動不起來-超時問題(2)

問題背景:當前部署了 3 個 etcd 節點,突然有一天 3 臺叢集全部停電當機了。重新啟動之後發現 K8S 叢集是可以正常使用的,但是檢查了一遍元件之後,發現有一個節點的 etcd 啟動不了。經過一遍探查,發現時間不準確,通過以下命令 ntpdate ntp.aliyun.com 重新將時間調整正確,重新啟動 etcd,發現還是起不來,報錯如下:

Mar 05 14:27:15 k8s-node2 etcd[3248]: etcd Version: 3.3.13
Mar 05 14:27:15 k8s-node2 etcd[3248]: Git SHA: 98d3084
Mar 05 14:27:15 k8s-node2 etcd[3248]: Go Version: go1.10.8
Mar 05 14:27:15 k8s-node2 etcd[3248]: Go OS/Arch: linux/amd64
Mar 05 14:27:15 k8s-node2 etcd[3248]: setting maximum number of CPUs to 4, total number of available CPUs is 4
Mar 05 14:27:15 k8s-node2 etcd[3248]: the server is already initialized as member before, starting as etcd member
...
Mar 05 14:27:15 k8s-node2 etcd[3248]: peerTLS: cert = /opt/etcd/ssl/server.pem, key = /opt/etcd/ssl/server-key.pe
m, ca = , trusted-ca = /opt/etcd/ssl/ca.pem, client-cert-auth = false, crl-file =
Mar 05 14:27:15 k8s-node2 etcd[3248]: listening for peers on https://192.168.25.226:2380
Mar 05 14:27:15 k8s-node2 etcd[3248]: The scheme of client url http://127.0.0.1:2379 is HTTP while peer key/cert
files are presented. Ignored key/cert files.
Mar 05 14:27:15 k8s-node2 etcd[3248]: listening for client requests on 127.0.0.1:2379
Mar 05 14:27:15 k8s-node2 etcd[3248]: listening for client requests on 192.168.25.226:2379
Mar 05 14:27:15 k8s-node2 etcd[3248]: member 9c166b8b7cb6ecb8 has already been bootstrapped
Mar 05 14:27:15 k8s-node2 systemd[1]: etcd.service: main process exited, code=exited, status=1/FAILURE
Mar 05 14:27:15 k8s-node2 systemd[1]: Failed to start Etcd Server.
Mar 05 14:27:15 k8s-node2 systemd[1]: Unit etcd.service entered failed state.
Mar 05 14:27:15 k8s-node2 systemd[1]: etcd.service failed.
Mar 05 14:27:15 k8s-node2 systemd[1]: etcd.service failed.
Mar 05 14:27:15 k8s-node2 systemd[1]: etcd.service holdoff time over, scheduling restart.
Mar 05 14:27:15 k8s-node2 systemd[1]: Starting Etcd Server...
Mar 05 14:27:15 k8s-node2 etcd[3258]: recognized environment variable ETCD_NAME, but unused: shadowed by correspo
nding flag
Mar 05 14:27:15 k8s-node2 etcd[3258]: recognized environment variable ETCD_DATA_DIR, but unused: shadowed by corr
esponding flag
Mar 05 14:27:15 k8s-node2 etcd[3258]: recognized environment variable ETCD_LISTEN_PEER_URLS, but unused: shadowed
 by corresponding flag
Mar 05 14:27:15 k8s-node2 etcd[3258]: recognized environment variable ETCD_LISTEN_CLIENT_URLS, but unused: shadow
ed by corresponding flag
Mar 05 14:27:15 k8s-node2 etcd[3258]: recognized environment variable ETCD_INITIAL_ADVERTISE_PEER_URLS, but unuse
d: shadowed by corresponding flag
Mar 05 14:27:15 k8s-node2 etcd[3258]: recognized environment variable ETCD_ADVERTISE_CLIENT_URLS, but unused: sha
dowed by corresponding flag
Mar 05 14:27:15 k8s-node2 etcd[3258]: recognized environment variable ETCD_INITIAL_CLUSTER, but unused: shadowed
by corresponding flag
Mar 05 14:27:15 k8s-node2 etcd[3258]: recognized environment variable ETCD_INITIAL_CLUSTER_TOKEN, but unused: sha
dowed by corresponding flag
Mar 05 14:27:15 k8s-node2 etcd[3258]: recognized environment variable ETCD_INITIAL_CLUSTER_STATE, but unused: sha
dowed by corresponding flag

解決方法:檢查日誌發現並沒有特別明顯的錯誤,根據經驗來講,etcd 節點壞掉一個其實對叢集沒有大的影響,這時叢集已經可以正常使用了,但是這個壞掉的 etcd 節點並沒有啟動,解決方法如下:

  • 進入 etcd 的資料儲存目錄進行備份 備份原有資料:
cd /var/lib/etcd/default.etcd/member/cp *  /data/bak/
  • 刪除這個目錄下的所有資料檔案
rm -rf /var/lib/etcd/default.etcd/member/*
  • 停止另外兩臺 etcd 節點,因為 etcd 節點啟動時需要所有節點一起啟動,啟動成功後即可使用。
#master 節點
systemctl stop etcd
systemctl restart etcd

#node1 節點
systemctl stop etcd
systemctl restart etcd

#node2 節點
systemctl stop etcd
systemctl restart etcd

CentOS下配置主機互信

在每臺伺服器需要建立主機互信的使用者名稱執行以下命令生成公鑰/金鑰,預設回車即可

ssh-keygen -t rsa

可以看到生成個公鑰的檔案。互傳公鑰,第一次需要輸入密碼,之後就OK了。

ssh-copy-id -i /root/.ssh/id_rsa.pub root@192.168.199.132 (-p 2222)

-p 埠 預設埠不加-p,如果更改過埠,就得加上-p. 可以看到是在.ssh/下生成了個 authorized_keys的檔案,記錄了能登陸這臺伺服器的其他伺服器的公鑰。測試看是否能登陸:

ssh 192.168.199.132 (-p 2222)

CentOS 主機名的修改

hostnamectl set-hostname k8s-master1

Virtualbox 實現 CentOS 複製和貼上功能

如果不安裝或者不輸出,可以將 update 修改成 install 再執行。

yum install update
yum update kernel
yum update kernel-devel
yum install kernel-headers
yum install gcc
yum install gcc make

執行完後

sh VBoxLinuxAdditions.run

刪除Pod一直處於Terminating狀態

可以通過下面命令強制刪除

kubectl delete pod NAME --grace-period=0 --force

刪除namespace一直處於Terminating狀態

可以通過以下指令碼強制刪除

[root@k8s-master1 k8s]# cat delete-ns.sh
#!/bin/bash
set -e

useage(){
    echo "useage:"
    echo " delns.sh NAMESPACE"
}

if [ $# -lt 1 ];then
    useage
    exit
fi

NAMESPACE=$1
JSONFILE=${NAMESPACE}.json
kubectl get ns "${NAMESPACE}" -o json > "${JSONFILE}"
vi "${JSONFILE}"
curl -k -H "Content-Type: application/json" -X PUT --data-binary @"${JSONFLE}" \
    http://127.0.0.1:8001/api/v1/namespaces/"${NAMESPACE}"/finalize

容器包含有效的 CPU/記憶體 requests 且沒有指定 limits 可能會出現什麼問題?

下面我們建立一個對應的容器,該容器只有 requests 設定,但是沒有 limits 設定,

- name: busybox-cnt02
    image: busybox
    command: ["/bin/sh"]
    args: ["-c", "while true; do echo hello from cnt02; sleep 10;done"]
    resources:
      requests:
        memory: "100Mi"
        cpu: "100m"

這個容器建立出來會有什麼問題呢?其實對於正常的環境來說沒有什麼問題,但是對於資源型 pod 來說,如果有的容器沒有設定 limit 限制,資源會被其他的 pod 搶佔走,可能會造成容器應用失敗的情況。可以通過 limitrange 策略來去匹配,讓 pod 自動設定,前提是要提前配置好limitrange 規則。

原文連結:https://www.cnblogs.com/passzhang

相關文章