kubernetes/k8s CRI分析-kubelet刪除pod分析

良凱爾發表於2021-08-15

原文網址 : https://www.cnblogs.com/lianngkyle/p/15143931.html

關聯部落格《kubernetes/k8s CRI 分析-容器執行時介面分析》
《kubernetes/k8s CRI分析-kubelet建立pod分析》

之前的博文先對 CRI 做了介紹，然後對 kubelet CRI 相關原始碼包括 kubelet 元件 CRI 相關啟動引數分析、CRI 相關 interface/struct 分析、CRI 相關初始化分析、kubelet呼叫CRI建立pod分析 4 個部分進行了分析，沒有看的小夥伴，可以點選上面的連結去看一下。

把之前部落格分析到的 CRI 架構圖再貼出來一遍。

本篇博文將對 kubelet 呼叫 CRI 刪除 pod 做分析。

kubelet中CRI相關的原始碼分析

kubelet的CRI原始碼分析包括如下幾部分：
（1）kubelet CRI相關啟動引數分析；
（2）kubelet CRI相關interface/struct分析；
（3）kubelet CRI初始化分析；
（4）kubelet呼叫CRI建立pod分析；
（5）kubelet呼叫CRI刪除pod分析。

上兩篇博文先對前四部分做了分析，本篇博文將對kubelet呼叫CRI刪除pod做分析。

基於tag v1.17.4

https://github.com/kubernetes/kubernetes/releases/tag/v1.17.4

5.kubelet呼叫CRI刪除pod分析

kubelet CRI刪除pod呼叫流程

下面以kubelet dockershim刪除pod呼叫流程為例做一下分析。

kubelet通過呼叫dockershim來停止容器，而dockershim則呼叫docker來停止容器，並呼叫CNI來刪除pod網路。

圖1：kubelet dockershim刪除pod呼叫圖示

dockershim屬於kubelet內建CRI shim，其餘remote CRI shim的建立pod呼叫流程其實與dockershim呼叫基本一致，只不過是呼叫了不同的容器引擎來操作容器，但一樣由CRI shim呼叫CNI來刪除pod網路。

下面進行詳細的原始碼分析。

直接看到kubeGenericRuntimeManager的KillPod方法，呼叫CRI刪除pod的邏輯將在該方法裡觸發發起。

從該方法程式碼也可以看出，kubelet刪除一個pod的邏輯為：
（1）先停止屬於該pod的所有containers；
（2）然後再停止pod sandbox容器。

注意點：這裡只是停止容器，而刪除容器的操作由kubelet的gc來做。

// pkg/kubelet/kuberuntime/kuberuntime_manager.go
// KillPod kills all the containers of a pod. Pod may be nil, running pod must not be.
// gracePeriodOverride if specified allows the caller to override the pod default grace period.
// only hard kill paths are allowed to specify a gracePeriodOverride in the kubelet in order to not corrupt user data.
// it is useful when doing SIGKILL for hard eviction scenarios, or max grace period during soft eviction scenarios.
func (m *kubeGenericRuntimeManager) KillPod(pod *v1.Pod, runningPod kubecontainer.Pod, gracePeriodOverride *int64) error {
	err := m.killPodWithSyncResult(pod, runningPod, gracePeriodOverride)
	return err.Error()
}

// killPodWithSyncResult kills a runningPod and returns SyncResult.
// Note: The pod passed in could be *nil* when kubelet restarted.
func (m *kubeGenericRuntimeManager) killPodWithSyncResult(pod *v1.Pod, runningPod kubecontainer.Pod, gracePeriodOverride *int64) (result kubecontainer.PodSyncResult) {
	killContainerResults := m.killContainersWithSyncResult(pod, runningPod, gracePeriodOverride)
	for _, containerResult := range killContainerResults {
		result.AddSyncResult(containerResult)
	}

	// stop sandbox, the sandbox will be removed in GarbageCollect
	killSandboxResult := kubecontainer.NewSyncResult(kubecontainer.KillPodSandbox, runningPod.ID)
	result.AddSyncResult(killSandboxResult)
	// Stop all sandboxes belongs to same pod
	for _, podSandbox := range runningPod.Sandboxes {
		if err := m.runtimeService.StopPodSandbox(podSandbox.ID.ID); err != nil {
			killSandboxResult.Fail(kubecontainer.ErrKillPodSandbox, err.Error())
			klog.Errorf("Failed to stop sandbox %q", podSandbox.ID)
		}
	}

	return
}

5.1 m.killContainersWithSyncResult

m.killContainersWithSyncResult作用：停止屬於該pod的所有containers。

主要邏輯：起與容器數量相同的goroutine，呼叫m.killContainer來停止容器。

// pkg/kubelet/kuberuntime/kuberuntime_container.go
// killContainersWithSyncResult kills all pod's containers with sync results.
func (m *kubeGenericRuntimeManager) killContainersWithSyncResult(pod *v1.Pod, runningPod kubecontainer.Pod, gracePeriodOverride *int64) (syncResults []*kubecontainer.SyncResult) {
	containerResults := make(chan *kubecontainer.SyncResult, len(runningPod.Containers))
	wg := sync.WaitGroup{}

	wg.Add(len(runningPod.Containers))
	for _, container := range runningPod.Containers {
		go func(container *kubecontainer.Container) {
			defer utilruntime.HandleCrash()
			defer wg.Done()

			killContainerResult := kubecontainer.NewSyncResult(kubecontainer.KillContainer, container.Name)
			if err := m.killContainer(pod, container.ID, container.Name, "", gracePeriodOverride); err != nil {
				killContainerResult.Fail(kubecontainer.ErrKillContainer, err.Error())
			}
			containerResults <- killContainerResult
		}(container)
	}
	wg.Wait()
	close(containerResults)

	for containerResult := range containerResults {
		syncResults = append(syncResults, containerResult)
	}
	return
}

5.1.1 m.killContainer

m.killContainer方法主要是呼叫m.runtimeService.StopContainer。

runtimeService即RemoteRuntimeService，實現了CRI shim客戶端-容器執行時介面RuntimeService interface，持有與CRI shim容器執行時服務端通訊的客戶端。所以呼叫m.runtimeService.StopContainer，實際上等於呼叫了CRI shim服務端的StopContainer方法，來進行容器的停止操作。

// pkg/kubelet/kuberuntime/kuberuntime_container.go
// killContainer kills a container through the following steps:
// * Run the pre-stop lifecycle hooks (if applicable).
// * Stop the container.
func (m *kubeGenericRuntimeManager) killContainer(pod *v1.Pod, containerID kubecontainer.ContainerID, containerName string, message string, gracePeriodOverride *int64) error {
	...

	klog.V(2).Infof("Killing container %q with %d second grace period", containerID.String(), gracePeriod)

	err := m.runtimeService.StopContainer(containerID.ID, gracePeriod)
	if err != nil {
		klog.Errorf("Container %q termination failed with gracePeriod %d: %v", containerID.String(), gracePeriod, err)
	} else {
		klog.V(3).Infof("Container %q exited normally", containerID.String())
	}

	m.containerRefManager.ClearRef(containerID)

	return err
}

m.runtimeService.StopContainer

m.runtimeService.StopContainer方法，會呼叫r.runtimeClient.StopContainer，即利用CRI shim客戶端，呼叫CRI shim服務端來進行停止容器的操作。

分析到這裡，kubelet中的CRI相關呼叫就分析完畢了，接下來將會進入到CRI shim（以kubelet內建CRI shim-dockershim為例）裡進行停止容器的操作分析。

// pkg/kubelet/remote/remote_runtime.go
// StopContainer stops a running container with a grace period (i.e., timeout).
func (r *RemoteRuntimeService) StopContainer(containerID string, timeout int64) error {
	// Use timeout + default timeout (2 minutes) as timeout to leave extra time
	// for SIGKILL container and request latency.
	t := r.timeout + time.Duration(timeout)*time.Second
	ctx, cancel := getContextWithTimeout(t)
	defer cancel()

	r.logReduction.ClearID(containerID)
	_, err := r.runtimeClient.StopContainer(ctx, &runtimeapi.StopContainerRequest{
		ContainerId: containerID,
		Timeout:     timeout,
	})
	if err != nil {
		klog.Errorf("StopContainer %q from runtime service failed: %v", containerID, err)
		return err
	}

	return nil
}

5.1.2 r.runtimeClient.StopContainer

接下來將會以dockershim為例，進入到CRI shim來進行停止容器操作的分析。

前面kubelet呼叫r.runtimeClient.StopContainer，會進入到dockershim下面的StopContainer方法。

// pkg/kubelet/dockershim/docker_container.go
// StopContainer stops a running container with a grace period (i.e., timeout).
func (ds *dockerService) StopContainer(_ context.Context, r *runtimeapi.StopContainerRequest) (*runtimeapi.StopContainerResponse, error) {
	err := ds.client.StopContainer(r.ContainerId, time.Duration(r.Timeout)*time.Second)
	if err != nil {
		return nil, err
	}
	return &runtimeapi.StopContainerResponse{}, nil
}

ds.client.StopContainer

主要是呼叫d.client.ContainerStop。

// pkg/kubelet/dockershim/libdocker/kube_docker_client.go
// Stopping an already stopped container will not cause an error in dockerapi.
func (d *kubeDockerClient) StopContainer(id string, timeout time.Duration) error {
	ctx, cancel := d.getCustomTimeoutContext(timeout)
	defer cancel()
	err := d.client.ContainerStop(ctx, id, &timeout)
	if ctxErr := contextError(ctx); ctxErr != nil {
		return ctxErr
	}
	return err
}

d.client.ContainerStop

構建請求引數，向docker指定的url傳送http請求，停止容器。

// vendor/github.com/docker/docker/client/container_stop.go
// ContainerStop stops a container. In case the container fails to stop
// gracefully within a time frame specified by the timeout argument,
// it is forcefully terminated (killed).
//
// If the timeout is nil, the container's StopTimeout value is used, if set,
// otherwise the engine default. A negative timeout value can be specified,
// meaning no timeout, i.e. no forceful termination is performed.
func (cli *Client) ContainerStop(ctx context.Context, containerID string, timeout *time.Duration) error {
	query := url.Values{}
	if timeout != nil {
		query.Set("t", timetypes.DurationToSecondsString(*timeout))
	}
	resp, err := cli.post(ctx, "/containers/"+containerID+"/stop", query, nil, nil)
	ensureReaderClosed(resp)
	return err
}

5.2 m.runtimeService.StopPodSandbox

在m.runtimeService.StopPodSandbox中的runtimeService即RemoteRuntimeService，其實現了CRI shim客戶端-容器執行時介面RuntimeService interface，持有與CRI shim容器執行時服務端通訊的客戶端。所以呼叫m.runtimeService.StopPodSandbox，實際上等於呼叫了CRI shim服務端的StopPodSandbox方法，來進行pod sandbox的停止操作。

分析到這裡，kubelet中的CRI相關呼叫就分析完畢了，接下來將會進入到CRI shim（以kubelet內建CRI shim-dockershim為例）裡進行停止pod sandbox的分析。

// pkg/kubelet/remote/remote_runtime.go
// StopPodSandbox stops the sandbox. If there are any running containers in the
// sandbox, they should be forced to termination.
func (r *RemoteRuntimeService) StopPodSandbox(podSandBoxID string) error {
	ctx, cancel := getContextWithTimeout(r.timeout)
	defer cancel()

	_, err := r.runtimeClient.StopPodSandbox(ctx, &runtimeapi.StopPodSandboxRequest{
		PodSandboxId: podSandBoxID,
	})
	if err != nil {
		klog.Errorf("StopPodSandbox %q from runtime service failed: %v", podSandBoxID, err)
		return err
	}

	return nil
}

5.2.1 r.runtimeClient.StopPodSandbox

接下來將會以dockershim為例，進入到CRI shim來進行停止pod sandbox的分析。

前面kubelet呼叫r.runtimeClient.StopPodSandbox，會進入到dockershim下面的StopPodSandbox方法。

停止pod sandbox主要有2個步驟：
（1）呼叫ds.network.TearDownPod：刪除pod網路；
（2）呼叫ds.client.StopContainer：停止pod sandbox容器。

需要注意的是，上面的2個步驟只有都成功了，停止pod sandbox的操作才算成功，且上面2個步驟成功的先後順序沒有要求。

// pkg/kubelet/dockershim/docker_sandbox.go
// StopPodSandbox stops the sandbox. If there are any running containers in the
// sandbox, they should be force terminated.
// TODO: This function blocks sandbox teardown on networking teardown. Is it
// better to cut our losses assuming an out of band GC routine will cleanup
// after us?
func (ds *dockerService) StopPodSandbox(ctx context.Context, r *runtimeapi.StopPodSandboxRequest) (*runtimeapi.StopPodSandboxResponse, error) {
	var namespace, name string
	var hostNetwork bool

	podSandboxID := r.PodSandboxId
	resp := &runtimeapi.StopPodSandboxResponse{}

	// Try to retrieve minimal sandbox information from docker daemon or sandbox checkpoint.
	inspectResult, metadata, statusErr := ds.getPodSandboxDetails(podSandboxID)
	if statusErr == nil {
		namespace = metadata.Namespace
		name = metadata.Name
		hostNetwork = (networkNamespaceMode(inspectResult) == runtimeapi.NamespaceMode_NODE)
	} else {
		checkpoint := NewPodSandboxCheckpoint("", "", &CheckpointData{})
		checkpointErr := ds.checkpointManager.GetCheckpoint(podSandboxID, checkpoint)

		// Proceed if both sandbox container and checkpoint could not be found. This means that following
		// actions will only have sandbox ID and not have pod namespace and name information.
		// Return error if encounter any unexpected error.
		if checkpointErr != nil {
			if checkpointErr != errors.ErrCheckpointNotFound {
				err := ds.checkpointManager.RemoveCheckpoint(podSandboxID)
				if err != nil {
					klog.Errorf("Failed to delete corrupt checkpoint for sandbox %q: %v", podSandboxID, err)
				}
			}
			if libdocker.IsContainerNotFoundError(statusErr) {
				klog.Warningf("Both sandbox container and checkpoint for id %q could not be found. "+
					"Proceed without further sandbox information.", podSandboxID)
			} else {
				return nil, utilerrors.NewAggregate([]error{
					fmt.Errorf("failed to get checkpoint for sandbox %q: %v", podSandboxID, checkpointErr),
					fmt.Errorf("failed to get sandbox status: %v", statusErr)})
			}
		} else {
			_, name, namespace, _, hostNetwork = checkpoint.GetData()
		}
	}

	// WARNING: The following operations made the following assumption:
	// 1. kubelet will retry on any error returned by StopPodSandbox.
	// 2. tearing down network and stopping sandbox container can succeed in any sequence.
	// This depends on the implementation detail of network plugin and proper error handling.
	// For kubenet, if tearing down network failed and sandbox container is stopped, kubelet
	// will retry. On retry, kubenet will not be able to retrieve network namespace of the sandbox
	// since it is stopped. With empty network namespcae, CNI bridge plugin will conduct best
	// effort clean up and will not return error.
	errList := []error{}
	ready, ok := ds.getNetworkReady(podSandboxID)
	if !hostNetwork && (ready || !ok) {
		// Only tear down the pod network if we haven't done so already
		cID := kubecontainer.BuildContainerID(runtimeName, podSandboxID)
		err := ds.network.TearDownPod(namespace, name, cID)
		if err == nil {
			ds.setNetworkReady(podSandboxID, false)
		} else {
			errList = append(errList, err)
		}
	}
	if err := ds.client.StopContainer(podSandboxID, defaultSandboxGracePeriod); err != nil {
		// Do not return error if the container does not exist
		if !libdocker.IsContainerNotFoundError(err) {
			klog.Errorf("Failed to stop sandbox %q: %v", podSandboxID, err)
			errList = append(errList, err)
		} else {
			// remove the checkpoint for any sandbox that is not found in the runtime
			ds.checkpointManager.RemoveCheckpoint(podSandboxID)
		}
	}

	if len(errList) == 0 {
		return resp, nil
	}

	// TODO: Stop all running containers in the sandbox.
	return nil, utilerrors.NewAggregate(errList)
}

ds.client.StopContainer

主要是呼叫d.client.ContainerStop。

// pkg/kubelet/dockershim/libdocker/kube_docker_client.go
// Stopping an already stopped container will not cause an error in dockerapi.
func (d *kubeDockerClient) StopContainer(id string, timeout time.Duration) error {
	ctx, cancel := d.getCustomTimeoutContext(timeout)
	defer cancel()
	err := d.client.ContainerStop(ctx, id, &timeout)
	if ctxErr := contextError(ctx); ctxErr != nil {
		return ctxErr
	}
	return err
}

d.client.ContainerStop

構建請求引數，向docker指定的url傳送http請求，停止pod sandbox容器。

// vendor/github.com/docker/docker/client/container_stop.go
// ContainerStop stops a container. In case the container fails to stop
// gracefully within a time frame specified by the timeout argument,
// it is forcefully terminated (killed).
//
// If the timeout is nil, the container's StopTimeout value is used, if set,
// otherwise the engine default. A negative timeout value can be specified,
// meaning no timeout, i.e. no forceful termination is performed.
func (cli *Client) ContainerStop(ctx context.Context, containerID string, timeout *time.Duration) error {
	query := url.Values{}
	if timeout != nil {
		query.Set("t", timetypes.DurationToSecondsString(*timeout))
	}
	resp, err := cli.post(ctx, "/containers/"+containerID+"/stop", query, nil, nil)
	ensureReaderClosed(resp)
	return err
}

總結

CRI架構圖

在 CRI 之下，包括兩種型別的容器執行時的實現：
（1）kubelet內建的 dockershim，實現了 Docker 容器引擎的支援以及 CNI 網路外掛（包括 kubenet）的支援。dockershim程式碼內建於kubelet，被kubelet呼叫，讓dockershim起獨立的server來建立CRI shim，向kubelet暴露grpc server；
（2）外部的容器執行時，用來支援 rkt、containerd 等容器引擎的外部容器執行時。

kubelet呼叫CRI刪除pod流程分析

kubelet刪除一個pod的邏輯為：
（1）先停止屬於該pod的所有containers；
（2）然後再停止pod sandbox容器（包括刪除pod網路）。

注意點：這裡只是停止容器，而刪除容器的操作由kubelet的gc來做。

kubelet CRI刪除pod呼叫流程

下面以kubelet dockershim刪除pod呼叫流程為例做一下分析。

kubelet通過呼叫dockershim來停止容器，而dockershim則呼叫docker來停止容器，並呼叫CNI來刪除pod網路。

圖1：kubelet dockershim刪除pod呼叫圖示

關聯部落格《kubernetes/k8s CRI 分析-容器執行時介面分析》
《kubernetes/k8s CRI分析-kubelet建立pod分析》

kubernetes/k8s CRI分析-kubelet建立pod分析
2021-08-08
K8S
Kubernetes原始碼分析之Pod的刪除
2019-05-12
原始碼
kubernetes/k8s CRI分析-容器執行時介面分析
2021-08-01
K8S
Kubernetes:kubelet 原始碼分析之探針
2024-05-20
原始碼
kubernetes實踐之五十：kubelet執行機制分析
2018-06-12
kubernetes/k8s CNI分析-容器網路介面分析
2021-08-22
K8S
kubernetes/k8s CSI分析-容器儲存介面分析
2021-07-24
K8S
kubernetes1.9原始碼閱讀 kubelet對pod資源的wat
2021-09-09
原始碼
[Kubernetes]node節點pod無法啟動/節點刪除網路重置
2018-11-15
[20180409]delete刪除緩慢分析.txt
2018-04-09
delete
無法刪除pod的處理
2024-11-21
kubelet 建立 Pod 前發生了什麼？
2024-06-17
Kubernetes K8s 結合國內外文章解決 The kubelet is not running
2022-07-07
K8S
containerd 原始碼分析：kubelet 和 containerd 互動
2024-05-22
AI原始碼
Kubernetes 無法查詢到並且無法刪除pod例項的排查過程
2018-12-26
CRI + shimv2：一種 Kubernetes？
2018-12-24
Oracle億級大表高效刪除案例分析
2022-10-27
Oracle
刪除k8s叢集
2024-09-19
K8S
K8s 裡多容器 Pod 的健康檢查探針工作機制分析
2023-12-12
K8S
Windows 7 C 盤隱藏檔案分析刪除
2020-04-04
Windows
MySQL在刪除表時I/O錯誤原因分析
2018-07-30
MySql
如何使用kubelet 啟動命令限制Pod 啟動數量？
2022-04-28
k8s pod
2021-01-02
K8S
Kubernetes-22：kubelet 驅逐策略詳解
2021-07-16
從原始碼分析DEARGUI之互動新增和刪除元件
2020-10-06
原始碼GUI元件
UserService 刪除使用者查詢許可權 deleteUserPermission分析
2024-11-21
delete
傲視Kubernetes(三)：Kubernetes中的Pod
2020-12-13
Kubernetes之Pod排程
2018-12-14
Kubernetes Pod驅逐策略
2020-11-02
kubernetes之pod中斷
2019-06-06
Kubernetes：Pod總結(二)
2022-02-10
Kubernetes Pod 全面知識
2021-11-29
Kubernetes部署單元-Pod
2022-04-11
Pod掛載Volume失敗問題分析
2018-10-23
Kubelet 錯誤日誌 broken pipe 和 connection reset by peer 的原因分析
2018-09-14
Kubernetes網路分析之Flannel
2019-10-18
Kubernetes Deployment 原始碼分析（一）
2021-11-02
原始碼
Kubernetes Deployment 原始碼分析（二）
2021-11-04
原始碼

kubernetes/k8s CRI分析-kubelet刪除pod分析

kubelet中CRI相關的原始碼分析

基於tag v1.17.4

5.kubelet呼叫CRI刪除pod分析

kubelet CRI刪除pod呼叫流程

圖1：kubelet dockershim刪除pod呼叫圖示

5.1 m.killContainersWithSyncResult

5.1.1 m.killContainer

m.runtimeService.StopContainer

5.1.2 r.runtimeClient.StopContainer

ds.client.StopContainer

d.client.ContainerStop

5.2 m.runtimeService.StopPodSandbox

5.2.1 r.runtimeClient.StopPodSandbox

ds.client.StopContainer

d.client.ContainerStop

總結

CRI架構圖

kubelet呼叫CRI刪除pod流程分析

kubelet CRI刪除pod呼叫流程

圖1：kubelet dockershim刪除pod呼叫圖示

相關文章