kubelet 建立 Pod 前發生了什麼？

abin在路上發表於2024-06-17

原文網址 : https://www.cnblogs.com/sctb/p/18251552

Kubelet Watch 到新增的 Pod，需要做的主要有以下幾件事：

管理 Pod 狀態，除了更新本地快取，還要同步給 API server
計算節點的資源是否足夠建立 Pod
建立 Cgroup 並更新資源配置
建立 Pod、儲存卷、外掛的根目錄
掛載儲存卷並等待所有儲存卷掛載成功
獲取 imagePullSecrets
呼叫 CRI 建立 PodSandbox
呼叫 CRI 建立 Container（ephemeral Container、Init Container、普通 Container）

Pod 的 event 是 k8s 元件向使用者展示關鍵資訊的渠道，對於某個 Pod，正常情況下會有以下事件：

Pod 被排程

當 Pod 被建立出來時，會首先提交到 API server（如果由 controller 管理的 Pod，也是如此）。此時，排程器發現未被排程的 Pod，會對其進行排程。當透過一系列演算法找到合適的排程節點後，排程器會將 Pod 和節點進行繫結。對於預設排程器，排程完成的標誌是往 Pod 的事件列印以下內容：

Successfully assigned default/dp-test-5986bbcc8f-4vlmd to 192.168.0.9

Scheduler 中對應程式碼如下：

func (sched *Scheduler) finishBinding(fwk framework.Framework, assumed *v1.Pod, targetNode string, status *framework.Status) {
    if finErr := sched.Cache.FinishBinding(assumed); finErr != nil {
        klog.ErrorS(finErr, "Scheduler cache FinishBinding failed")
    }
    if !status.IsSuccess() {
        klog.V(1).InfoS("Failed to bind pod", "pod", klog.KObj(assumed))
        return
    }

    fwk.EventRecorder().Eventf(assumed, nil, v1.EventTypeNormal, "Scheduled", "Binding", "Successfully assigned %v/%v to %v", assumed.Namespace, assumed.Name, targetNode)
}

Kubelet 發現需要被處理的 Pod

Kubelet 會持續透過 watch 的方式從 API server 獲取排程到當前節點的 Pod 列表資訊：

“

根據 Selector spec.nodeName 進行過濾。

// NewSourceApiserver creates a config source that watches and pulls from the apiserver.
func NewSourceApiserver(c clientset.Interface, nodeName types.NodeName, nodeHasSynced func() bool, updates chan<- interface{}) {
    lw := cache.NewListWatchFromClient(c.CoreV1().RESTClient(), "pods", metav1.NamespaceAll, fields.OneTermEqualSelector("spec.nodeName", string(nodeName)))

    // The Reflector responsible for watching pods at the apiserver should be run only after
    // the node sync with the apiserver has completed.
    klog.InfoS("Waiting for node sync before watching apiserver pods")
    go func() {
        for {
            if nodeHasSynced() {
                klog.V(4).InfoS("node sync completed")
                break
            }
            time.Sleep(WaitForAPIServerSyncPeriod)
            klog.V(4).InfoS("node sync has not completed yet")
        }
        klog.InfoS("Watching apiserver")
        newSourceApiserverFromLW(lw, updates)
    }()
}

當 Kubelet 發現有新的 Pod 被排程到當前節點，會列印以下日誌：

“

需要調整日誌等級為 5，下同。

Jun 15 15:18:59 iv-yd6ldax3wgr9cxxjx9iv kubelet[3390574]: I0615 15:18:59.269157 3390574 config.go:398] "Receiving a new pod" pod="default/dp-test-5986bbcc8f-gcrmm"

對應程式碼如下：

// recordFirstSeenTime records the first seen time of this pod.
func recordFirstSeenTime(pod *v1.Pod) {
    klog.V(4).InfoS("Receiving a new pod", "pod", klog.KObj(pod))
    pod.Annotations[kubetypes.ConfigFirstSeenAnnotationKey] = kubetypes.NewTimestamp().GetString()
}

在這裡，kubelet 會為這個 Pod 增加 key 為kubernetes.io/config.seen的 Annotation（這個 Annotation 只在 kubelet 使用，沒有同步到 API Server），記錄 kubelet 第一次看到這個 Pod 的時間。

Kubelet 對拿到的 Pod 進行分類，並透過s.updates Channel 傳送給對應的 handler。透過這個 Channel，我們可以找到對應的處理函式：

// syncLoop is the main loop for processing changes. It watches for changes from
// three channels (file, apiserver, and http) and creates a union of them. For
// any new change seen, will run a sync against desired state and running state. If
// no changes are seen to the configuration, will synchronize the last known desired
// state every sync-frequency seconds. Never returns.
func (kl *Kubelet) syncLoop(ctx context.Context, updates <-chan kubetypes.PodUpdate, handler SyncHandler) {
    klog.InfoS("Starting kubelet main sync loop")
    // The syncTicker wakes up kubelet to checks if there are any pod workers
    // that need to be sync'd. A one-second period is sufficient because the
    // sync interval is defaulted to 10s.
    syncTicker := time.NewTicker(time.Second)
    defer syncTicker.Stop()
    housekeepingTicker := time.NewTicker(housekeepingPeriod)
    defer housekeepingTicker.Stop()
    plegCh := kl.pleg.Watch()
    const (
        base   = 100 * time.Millisecond
        max    = 5 * time.Second
        factor = 2
    )
    duration := base
    // Responsible for checking limits in resolv.conf
    // The limits do not have anything to do with individual pods
    // Since this is called in syncLoop, we don't need to call it anywhere else
    if kl.dnsConfigurer != nil && kl.dnsConfigurer.ResolverConfig != "" {
        kl.dnsConfigurer.CheckLimitsForResolvConf()
    }

    for {
        if err := kl.runtimeState.runtimeErrors(); err != nil {
            klog.ErrorS(err, "Skipping pod synchronization")
            // exponential backoff
            time.Sleep(duration)
            duration = time.Duration(math.Min(float64(max), factor*float64(duration)))
            continue
        }
        // reset backoff if we have a success
        duration = base

        kl.syncLoopMonitor.Store(kl.clock.Now())
        if !kl.syncLoopIteration(ctx, updates, handler, syncTicker.C, housekeepingTicker.C, plegCh) {
            break
        }
        kl.syncLoopMonitor.Store(kl.clock.Now())
    }
}

syncLoop是 kubelet 的主要處理邏輯，方法syncLoopIteration從不同 Channel 讀取事件並且分發給 worker。內容較多，可以透過註釋瞭解大概邏輯：

// syncLoopIteration reads from various channels and dispatches pods to the
// given handler.
//
// Arguments:
// 1.  configCh:       a channel to read config events from
// 2.  handler:        the SyncHandler to dispatch pods to
// 3.  syncCh:         a channel to read periodic sync events from
// 4.  housekeepingCh: a channel to read housekeeping events from
// 5.  plegCh:         a channel to read PLEG updates from
//
// Events are also read from the kubelet liveness manager's update channel.
//
// The workflow is to read from one of the channels, handle that event, and
// update the timestamp in the sync loop monitor.
//
// Here is an appropriate place to note that despite the syntactical
// similarity to the switch statement, the case statements in a select are
// evaluated in a pseudorandom order if there are multiple channels ready to
// read from when the select is evaluated.  In other words, case statements
// are evaluated in random order, and you can not assume that the case
// statements evaluate in order if multiple channels have events.
//
// With that in mind, in truly no particular order, the different channels
// are handled as follows:
//
//   - configCh: dispatch the pods for the config change to the appropriate
//     handler callback for the event type
//   - plegCh: update the runtime cache; sync pod
//   - syncCh: sync all pods waiting for sync
//   - housekeepingCh: trigger cleanup of pods
//   - health manager: sync pods that have failed or in which one or more
//     containers have failed health checks

此時，Kubelet 列印以下日誌，表示已經收到新增 Pod 的請求，並呼叫對應 handler 進行處理：

Jun 15 15:18:59 iv-yd6ldax3wgr9cxxjx9iv kubelet[3390574]: I0615 15:18:59.269224 3390574 kubelet.go:2206] "SyncLoop ADD" source="api" pods="[default/dp-test-5986bbcc8f-gcrmm]"

對應程式碼：

func (kl *Kubelet) syncLoopIteration(ctx context.Context, configCh <-chan kubetypes.PodUpdate, handler SyncHandler,
    syncCh <-chan time.Time, housekeepingCh <-chan time.Time, plegCh <-chan *pleg.PodLifecycleEvent) bool {
    select {
    case u, open := <-configCh:
        // Update from a config source; dispatch it to the right handler
        // callback.
        if !open {
            klog.ErrorS(nil, "Update channel is closed, exiting the sync loop")
            return false
        }

        switch u.Op {
        case kubetypes.ADD:
            klog.V(2).InfoS("SyncLoop ADD", "source", u.Source, "pods", klog.KObjSlice(u.Pods))
            // After restarting, kubelet will get all existing pods through
            // ADD as if they are new pods. These pods will then go through the
            // admission process and *may* be rejected. This can be resolved
            // once we have checkpointing.
            handler.HandlePodAdditions(u.Pods)
        case kubetypes.UPDATE:
            klog.V(2).InfoS("SyncLoop UPDATE", "source", u.Source, "pods", klog.KObjSlice(u.Pods))
            handler.HandlePodUpdates(u.Pods)
        case kubetypes.REMOVE:
            klog.V(2).InfoS("SyncLoop REMOVE", "source", u.Source, "pods", klog.KObjSlice(u.Pods))
            handler.HandlePodRemoves(u.Pods)
        case kubetypes.RECONCILE:
            klog.V(4).InfoS("SyncLoop RECONCILE", "source", u.Source, "pods", klog.KObjSlice(u.Pods))
            handler.HandlePodReconcile(u.Pods)
        case kubetypes.DELETE:
            klog.V(2).InfoS("SyncLoop DELETE", "source", u.Source, "pods", klog.KObjSlice(u.Pods))
            // DELETE is treated as a UPDATE because of graceful deletion.
            handler.HandlePodUpdates(u.Pods)
        case kubetypes.SET:
            // TODO: Do we want to support this?
            klog.ErrorS(nil, "Kubelet does not support snapshot update")
        default:
            klog.ErrorS(nil, "Invalid operation type received", "operation", u.Op)
        }

        // 省略部分內容
    }
    return true
}

對於不同的事件型別，有不同的 handler，此處不深究。

Add Pod：HandlePodAdditions
Update Pod、Delete Pod：HandlePodUpdates
Remove Pod：HandlePodRemoves
Reconcile Pod：HandlePodReconcile

對於kubetypes.ADD事件，handler 是HandlePodAdditions：

// HandlePodAdditions is the callback in SyncHandler for pods being added from
// a config source.
func (kl *Kubelet) HandlePodAdditions(pods []*v1.Pod) {
    start := kl.clock.Now()
    sort.Sort(sliceutils.PodsByCreationTime(pods))
    for _, pod := range pods {
        existingPods := kl.podManager.GetPods()
        // Always add the pod to the pod manager. Kubelet relies on the pod
        // manager as the source of truth for the desired state. If a pod does
        // not exist in the pod manager, it means that it has been deleted in
        // the apiserver and no action (other than cleanup) is required.
        kl.podManager.AddPod(pod)

        // 靜態 Pod，透過 handleMirrorPod 處理
        if kubetypes.IsMirrorPod(pod) {
            kl.handleMirrorPod(pod, start)
            continue
        }

        // Only go through the admission process if the pod is not requested
        // for termination by another part of the kubelet. If the pod is already
        // using resources (previously admitted), the pod worker is going to be
        // shutting it down. If the pod hasn't started yet, we know that when
        // the pod worker is invoked it will also avoid setting up the pod, so
        // we simply avoid doing any work.
        if !kl.podWorkers.IsPodTerminationRequested(pod.UID) {
            // We failed pods that we rejected, so activePods include all admitted
            // pods that are alive.
            activePods := kl.filterOutInactivePods(existingPods)

            // Check if we can admit the pod; if not, reject it.
            if ok, reason, message := kl.canAdmitPod(activePods, pod); !ok {
                kl.rejectPod(pod, reason, message)
                continue
            }
        }
        mirrorPod, _ := kl.podManager.GetMirrorPodByPod(pod)
        kl.dispatchWork(pod, kubetypes.SyncPodCreate, mirrorPod, start)
    }
}

只有非 terminating 狀態的 Pod 才會走到 admission 邏輯。對應方法是canAdmitPod中，在該方法中，會透過不同的admitHandlers處理Admit邏輯，列印日誌如下：

Jun 15 15:18:59 iv-yd6ldax3wgr9cxxjx9iv kubelet[3390574]: I0615 15:18:59.269248 3390574 topology_manager.go:210] "Topology Admit Handler" podUID=dd780355-f908-4bf2-9193-68dabbbca69e podNamespace="default" podName="dp-test-5986bbcc8f-gcrmm"

對應程式碼：

func (m *manager) Admit(attrs *lifecycle.PodAdmitAttributes) lifecycle.PodAdmitResult {
    klog.InfoS("Topology Admit Handler", "podUID", attrs.Pod.UID, "podNamespace", attrs.Pod.Namespace, "podName", attrs.Pod.Name)
    pod := attrs.Pod

    return m.scope.Admit(pod)
}

Admit方法會先根據container.Resources.Limits分析需要分配的資源（如：網路卡等），日誌如下：

Jun 15 15:18:59 iv-yd6ldax3wgr9cxxjx9iv kubelet[3390574]: I0615 15:18:59.269261 3390574 manager.go:854] "Looking for needed resources" needed=1 resourceName="xxx/eni-ip"

對應程式碼：

// allocateContainerResources attempts to allocate all of required device
// plugin resources for the input container, issues an Allocate rpc request
// for each new device resource requirement, processes their AllocateResponses,
// and updates the cached containerDevices on success.
func (m *ManagerImpl) allocateContainerResources(pod *v1.Pod, container *v1.Container, devicesToReuse map[string]sets.String) error {
    podUID := string(pod.UID)
    contName := container.Name
    allocatedDevicesUpdated := false
    needsUpdateCheckpoint := false
    // Extended resources are not allowed to be overcommitted.
    // Since device plugin advertises extended resources,
    // therefore Requests must be equal to Limits and iterating
    // over the Limits should be sufficient.
    for k, v := range container.Resources.Limits {
        resource := string(k)
        needed := int(v.Value())
        klog.V(3).InfoS("Looking for needed resources", "needed", needed, "resourceName", resource)
        if !m.isDevicePluginResource(resource) {
            continue
        }
        // Updates allocatedDevices to garbage collect any stranded resources
        // before doing the device plugin allocation.
        if !allocatedDevicesUpdated {
            m.UpdateAllocatedDevices()
            allocatedDevicesUpdated = true
        }

        var allocDevices sets.String
        var err error
        contDevicesScheduledKey := fmt.Sprintf("%s_container-%s", resource, contName)
        contDevicesScheduled := pod.Annotations[contDevicesScheduledKey]
        if contDevicesScheduled != "" {
            allocDevices, err = m.devicesToAllocateFromScheduler(podUID, contName, resource, contDevicesScheduled, needed, devicesToReuse[resource])
        } else {
            allocDevices, err = m.devicesToAllocate(podUID, contName, resource, needed, devicesToReuse[resource])
        }
        if err != nil {
            return err
        }
        if allocDevices == nil || len(allocDevices) <= 0 {
            continue
        }

        // 省略部分內容
        m.podDevices.insert(podUID, contName, resource, allocDevicesWithNUMA, resp.ContainerResponses[0])
    }

    if needsUpdateCheckpoint {
        return m.writeCheckpoint()
    }

    return nil
}

然後，會走到devicesToAllocate以分配需要的裝置，下一條日誌為：

Jun 15 15:18:59 iv-yd6ldax3wgr9cxxjx9iv kubelet[3390574]: I0615 15:18:59.269321 3390574 manager.go:621] "Need devices to allocate for pod" deviceNumber=1 resourceName="xxx/eni-ip" podUID="dd780355-f908-4bf2-9193-68dabbbca69e" containerName="nginx"

對應程式碼：

// Returns list of device Ids we need to allocate with Allocate rpc call.
// Returns empty list in case we don't need to issue the Allocate rpc call.
func (m *ManagerImpl) devicesToAllocate(podUID, contName, resource string, required int, reusableDevices sets.String) (sets.String, error) {
    m.mutex.Lock()
    defer m.mutex.Unlock()
    needed := required
    // Gets list of devices that have already been allocated.
    // This can happen if a container restarts for example.
    devices := m.podDevices.containerDevices(podUID, contName, resource)
    if devices != nil {
        klog.V(3).InfoS("Found pre-allocated devices for resource on pod", "resourceName", resource, "containerName", contName, "podUID", string(podUID), "devices", devices.List())
        needed = needed - devices.Len()
        // A pod's resource is not expected to change once admitted by the API server,
        // so just fail loudly here. We can revisit this part if this no longer holds.
        if needed != 0 {
            return nil, fmt.Errorf("pod %q container %q changed request for resource %q from %d to %d", string(podUID), contName, resource, devices.Len(), required)
        }
    }

    // We have 3 major flows to handle:
    // 1. kubelet running, normal allocation (needed > 0, container being  [re]created). Steady state and most common case by far and large.
    // 2. kubelet restart. In this scenario every other component of the stack (device plugins, app container, runtime) is still running.
    // 3. node reboot. In this scenario device plugins may not be running yet when we try to allocate devices.
    //    note: if we get this far the runtime is surely running. This is usually enforced at OS level by startup system services dependencies.

    // First we take care of the exceptional flow (scenarios 2 and 3). In both flows, kubelet is reinitializing, and while kubelet is initializing, sources are NOT all ready.
    // Is this a simple kubelet restart (scenario 2)? To distinguish, we use the informations we got for runtime. If we are asked to allocate devices for containers reported
    // running, then it can only be a kubelet restart. On node reboot the runtime and the containers were also shut down. Then, if the container was running, it can only be
    // because it already has access to all the required devices, so we got nothing to do and we can bail out.
    if !m.sourcesReady.AllReady() && m.isContainerAlreadyRunning(podUID, contName) {
        klog.V(3).InfoS("container detected running, nothing to do", "deviceNumber", needed, "resourceName", resource, "podUID", string(podUID), "containerName", contName)
        return nil, nil
    }

    // We dealt with scenario 2. If we got this far it's either scenario 3 (node reboot) or scenario 1 (steady state, normal flow).
    klog.V(3).InfoS("Need devices to allocate for pod", "deviceNumber", needed, "resourceName", resource, "podUID", string(podUID), "containerName", contName)
    healthyDevices, hasRegistered := m.healthyDevices[resource]
    // 省略部分內容

    // Filters available Devices based on NUMA affinity.
    aligned, unaligned, noAffinity := m.filterByAffinity(podUID, contName, resource, available)

    // If we can allocate all remaining devices from the set of aligned ones, then
    // give the plugin the chance to influence which ones to allocate from that set.
    if needed < aligned.Len() {
        // First allocate from the preferred devices list (if available).
        preferred, err := m.callGetPreferredAllocationIfAvailable(podUID, contName, resource, aligned.Union(allocated), allocated, required)
        if err != nil {
            return nil, err
        }
        if allocateRemainingFrom(preferred.Intersection(aligned)) {
            return allocated, nil
        }
        // Then fallback to allocate from the aligned set if no preferred list
        // is returned (or not enough devices are returned in that list).
        if allocateRemainingFrom(aligned) {
            return allocated, nil
        }

        return nil, fmt.Errorf("unexpectedly allocated less resources than required. Requested: %d, Got: %d", required, required-needed)
    }

    // If we can't allocate all remaining devices from the set of aligned ones,
    // then start by first allocating all of the  aligned devices (to ensure
    // that the alignment guaranteed by the TopologyManager is honored).
    if allocateRemainingFrom(aligned) {
        return allocated, nil
    }

    // Then give the plugin the chance to influence the decision on any
    // remaining devices to allocate.
    preferred, err := m.callGetPreferredAllocationIfAvailable(podUID, contName, resource, available.Union(allocated), allocated, required)
    if err != nil {
        return nil, err
    }
    if allocateRemainingFrom(preferred.Intersection(available)) {
        return allocated, nil
    }
    // Finally, if the plugin did not return a preferred allocation (or didn't
    // return a large enough one), then fall back to allocating the remaining
    // devices from the 'unaligned' and 'noAffinity' sets.
    if allocateRemainingFrom(unaligned) {
        return allocated, nil
    }
    if allocateRemainingFrom(noAffinity) {
        return allocated, nil
    }

    return nil, fmt.Errorf("unexpectedly allocated less resources than required. Requested: %d, Got: %d", required, required-needed)
}

如果沒有配置GetPreferredAllocation，會打列印以下日誌：

Jun 15 15:18:59 iv-yd6ldax3wgr9cxxjx9iv kubelet[3390574]: I0615 15:18:59.269375 3390574 manager.go:1041] "Plugin options indicate to skip GetPreferredAllocation for resource" resourceName="xxx/eni-ip"

對應程式碼：

// callGetPreferredAllocationIfAvailable issues GetPreferredAllocation grpc
// call for device plugin resource with GetPreferredAllocationAvailable option set.
func (m *ManagerImpl) callGetPreferredAllocationIfAvailable(podUID, contName, resource string, available, mustInclude sets.String, size int) (sets.String, error) {
    eI, ok := m.endpoints[resource]
    if !ok {
        return nil, fmt.Errorf("endpoint not found in cache for a registered resource: %s", resource)
    }

    if eI.opts == nil || !eI.opts.GetPreferredAllocationAvailable {
        klog.V(4).InfoS("Plugin options indicate to skip GetPreferredAllocation for resource", "resourceName", resource)
        return nil, nil
    }

    m.mutex.Unlock()
    klog.V(4).InfoS("Issuing a GetPreferredAllocation call for container", "containerName", contName, "podUID", string(podUID))
    resp, err := eI.e.getPreferredAllocation(available.UnsortedList(), mustInclude.UnsortedList(), size)
    m.mutex.Lock()
    if err != nil {
        return nil, fmt.Errorf("device plugin GetPreferredAllocation rpc failed with err: %v", err)
    }
    if resp != nil && len(resp.ContainerResponses) > 0 {
        return sets.NewString(resp.ContainerResponses[0].DeviceIDs...), nil
    }
    return sets.NewString(), nil
}

接下來，會走到分配裝置的邏輯，列印日誌：

Jun 15 15:18:59 iv-yd6ldax3wgr9cxxjx9iv kubelet[3390574]: I0615 15:18:59.269402 3390574 manager.go:909] "Making allocation request for device plugin" devices=[eni-ip-121] resourceName="xxx/eni-ip"

對應程式碼在allocateContainerResources方法：

func (m *ManagerImpl) allocateContainerResources(pod *v1.Pod, container *v1.Container, devicesToReuse map[string]sets.String) error {
    // 省略部分內容
    devs := allocDevices.UnsortedList()
    // TODO: refactor this part of code to just append a ContainerAllocationRequest
    // in a passed in AllocateRequest pointer, and issues a single Allocate call per pod.
    klog.V(3).InfoS("Making allocation request for device plugin", "devices", devs, "resourceName", resource)
    resp, err := eI.e.allocate(devs)
    metrics.DevicePluginAllocationDuration.WithLabelValues(resource).Observe(metrics.SinceInSeconds(startRPCTime))
    if err != nil {
        // In case of allocation failure, we want to restore m.allocatedDevices
        // to the actual allocated state from m.podDevices.
        m.mutex.Lock()
        m.allocatedDevices = m.podDevices.devices()
        m.mutex.Unlock()
        return err
    }

    if len(resp.ContainerResponses) == 0 {
        return fmt.Errorf("no containers return in allocation response %v", resp)
    }
    // 省略部分內容
}

⚠️ 注意：這裡有個指標可以看到裝置分配花費的時間device_plugin_alloc_duration_seconds（不過遺憾的是並沒有 Pod 和 container 資訊，只有資源名稱資訊）

這裡調了 RPC 分配需要的裝置：

func (c *devicePluginClient) Allocate(ctx context.Context, in *AllocateRequest, opts ...grpc.CallOption) (*AllocateResponse, error) {
    out := new(AllocateResponse)
    err := c.cc.Invoke(ctx, "/v1beta1.DevicePlugin/Allocate", in, out, opts...)
    if err != nil {
        return nil, err
    }
    return out, nil
}

資源分配完成，Pod 透過dispatchWork分配給對應的 worker 處理：

kl.dispatchWork(pod, kubetypes.SyncPodCreate, mirrorPod, start)

此時，可以看到類似下面的日誌：

Jun 15 15:18:59 iv-yd6ldax3wgr9cxxjx9iv kubelet[3390574]: I0615 15:18:59.273776 3390574 pod_workers.go:585] "Pod is being synced for the first time" pod="default/dp-test-5986bbcc8f-gcrmm" podUID=dd780355-f908-4bf2-9193-68dabbbca69e

對應程式碼：

func (p *podWorkers) UpdatePod(options UpdatePodOptions) {
    // 省略部分內容
    // decide what to do with this pod - we are either setting it up, tearing it down, or ignoring it
    now := time.Now()
    status, ok := p.podSyncStatuses[uid]
    if !ok {
        klog.V(4).InfoS("Pod is being synced for the first time", "pod", klog.KObj(pod), "podUID", pod.UID)
        status = &podSyncStatus{
            syncedAt: now,
            fullname: kubecontainer.GetPodFullName(pod),
        }
        // if this pod is being synced for the first time, we need to make sure it is an active pod
        if !isRuntimePod && (pod.Status.Phase == v1.PodFailed || pod.Status.Phase == v1.PodSucceeded) {
            // check to see if the pod is not running and the pod is terminal.
            // If this succeeds then record in the podWorker that it is terminated.
            if statusCache, err := p.podCache.Get(pod.UID); err == nil {
                if isPodStatusCacheTerminal(statusCache) {
                    status = &podSyncStatus{
                        terminatedAt:       now,
                        terminatingAt:      now,
                        syncedAt:           now,
                        startedTerminating: true,
                        finished:           true,
                        fullname:           kubecontainer.GetPodFullName(pod),
                    }
                }
            }
        }
        p.podSyncStatuses[uid] = status
    }
    // 省略部分內容
}

如果是新建立的 Pod，會建立一個 Go Routine（Pod Worker）來處理，主要邏輯為managePodLoop，透過outCh接收請求：

// Creating a new pod worker either means this is a new pod, or that the
// kubelet just restarted. In either case the kubelet is willing to believe
// the status of the pod for the first pod worker sync. See corresponding
// comment in syncPod.
go func() {
    defer runtime.HandleCrash()
    p.managePodLoop(outCh)
}()

當透過outCh收到請求，會列印以下日誌：

Jun 15 15:18:59 iv-yd6ldax3wgr9cxxjx9iv kubelet[3390574]: I0615 15:18:59.273812 3390574 pod_workers.go:902] "Processing pod event" pod="default/dp-test-5986bbcc8f-gcrmm" podUID=dd780355-f908-4bf2-9193-68dabbbca69e updateType=0

程式碼位置：

func (p *podWorkers) managePodLoop(podUpdates <-chan podWork) {
    // 省略部分內容
    klog.V(4).InfoS("Processing pod event", "pod", klog.KObj(pod), "podUID", pod.UID, "updateType", update.WorkType)
    var isTerminal bool
    err := func() error {
        // The worker is responsible for ensuring the sync method sees the appropriate
        // status updates on resyncs (the result of the last sync), transitions to
        // terminating (no wait), or on terminated (whatever the most recent state is).
        // Only syncing and terminating can generate pod status changes, while terminated
        // pods ensure the most recent status makes it to the api server.
        var status *kubecontainer.PodStatus
        var err error
        switch {
        case update.Options.RunningPod != nil:
            // when we receive a running pod, we don't need status at all
        default:
            // wait until we see the next refresh from the PLEG via the cache (max 2s)
            // TODO: this adds ~1s of latency on all transitions from sync to terminating
            //  to terminated, and on all termination retries (including evictions). We should
            //  improve latency by making the pleg continuous and by allowing pod status
            //  changes to be refreshed when key events happen (killPod, sync->terminating).
            //  Improving this latency also reduces the possibility that a terminated
            //  container's status is garbage collected before we have a chance to update the
            //  API server (thus losing the exit code).
            status, err = p.podCache.GetNewerThan(pod.UID, lastSyncTime)
        }
        if err != nil {
            // This is the legacy event thrown by manage pod loop all other events are now dispatched
            // from syncPodFn
            p.recorder.Eventf(pod, v1.EventTypeWarning, events.FailedSync, "error determining status: %v", err)
            return err
        }

        ctx := p.contextForWorker(pod.UID)

        // Take the appropriate action (illegal phases are prevented by UpdatePod)
        switch {
        case update.WorkType == TerminatedPodWork:
            err = p.syncTerminatedPodFn(ctx, pod, status)

        case update.WorkType == TerminatingPodWork:
            var gracePeriod *int64
            if opt := update.Options.KillPodOptions; opt != nil {
                gracePeriod = opt.PodTerminationGracePeriodSecondsOverride
            }
            podStatusFn := p.acknowledgeTerminating(pod)

            err = p.syncTerminatingPodFn(ctx, pod, status, update.Options.RunningPod, gracePeriod, podStatusFn)

        default:
            isTerminal, err = p.syncPodFn(ctx, update.Options.UpdateType, pod, update.Options.MirrorPod, status)
        }

        lastSyncTime = time.Now()
        return err
    }()
    // 省略部分內容
}

除了以下兩種型別，都透過syncPodFn處理：

TerminatedPodWork
TerminatingPodWork

在syncPodFn中，會首先列印以下日誌：

Jun 15 15:18:59 iv-yd6ldax3wgr9cxxjx9iv kubelet[3390574]: I0615 15:18:59.273831 3390574 kubelet.go:1620] "syncPod enter" pod="default/dp-test-5986bbcc8f-gcrmm" podUID=dd780355-f908-4bf2-9193-68dabbbca69e

對應程式碼位於（syncPodFn 同步 Pod 狀態的主要邏輯，具體可參考註釋）：

// syncPod is the transaction script for the sync of a single pod (setting up)
// a pod. This method is reentrant and expected to converge a pod towards the
// desired state of the spec. The reverse (teardown) is handled in
// syncTerminatingPod and syncTerminatedPod. If syncPod exits without error,
// then the pod runtime state is in sync with the desired configuration state
// (pod is running). If syncPod exits with a transient error, the next
// invocation of syncPod is expected to make progress towards reaching the
// runtime state. syncPod exits with isTerminal when the pod was detected to
// have reached a terminal lifecycle phase due to container exits (for
// RestartNever or RestartOnFailure) and the next method invoked will by
// syncTerminatingPod.
//
// Arguments:
//
// updateType - whether this is a create (first time) or an update, should
//
//  only be used for metrics since this method must be reentrant
//
// pod - the pod that is being set up
// mirrorPod - the mirror pod known to the kubelet for this pod, if any
// podStatus - the most recent pod status observed for this pod which can
//
//  be used to determine the set of actions that should be taken during
//  this loop of syncPod
//
// The workflow is:
//   - If the pod is being created, record pod worker start latency
//   - Call generateAPIPodStatus to prepare an v1.PodStatus for the pod
//   - If the pod is being seen as running for the first time, record pod
//     start latency
//   - Update the status of the pod in the status manager
//   - Stop the pod's containers if it should not be running due to soft
//     admission
//   - Ensure any background tracking for a runnable pod is started
//   - Create a mirror pod if the pod is a static pod, and does not
//     already have a mirror pod
//   - Create the data directories for the pod if they do not exist
//   - Wait for volumes to attach/mount
//   - Fetch the pull secrets for the pod
//   - Call the container runtime's SyncPod callback
//   - Update the traffic shaping for the pod's ingress and egress limits
//
// If any step of this workflow errors, the error is returned, and is repeated
// on the next syncPod call.
//
// This operation writes all events that are dispatched in order to provide
// the most accurate information possible about an error situation to aid debugging.
// Callers should not write an event if this operation returns an error.
func (kl *Kubelet) syncPod(_ context.Context, updateType kubetypes.SyncPodType, pod, mirrorPod *v1.Pod, podStatus *kubecontainer.PodStatus) (isTerminal bool, err error) {
    // TODO(#113606): connect this with the incoming context parameter, which comes from the pod worker.
    // Currently, using that context causes test failures.
    ctx := context.TODO()
    klog.V(4).InfoS("syncPod enter", "pod", klog.KObj(pod), "podUID", pod.UID)
    defer func() {
        klog.V(4).InfoS("syncPod exit", "pod", klog.KObj(pod), "podUID", pod.UID, "isTerminal", isTerminal)
    }()

    // Latency measurements for the main workflow are relative to the
    // first time the pod was seen by kubelet.
    var firstSeenTime time.Time
    if firstSeenTimeStr, ok := pod.Annotations[kubetypes.ConfigFirstSeenAnnotationKey]; ok {
        firstSeenTime = kubetypes.ConvertToTimestamp(firstSeenTimeStr).Get()
    }

    // Record pod worker start latency if being created
    // TODO: make pod workers record their own latencies
    if updateType == kubetypes.SyncPodCreate {
        if !firstSeenTime.IsZero() {
            // This is the first time we are syncing the pod. Record the latency
            // since kubelet first saw the pod if firstSeenTime is set.
            metrics.PodWorkerStartDuration.Observe(metrics.SinceInSeconds(firstSeenTime))
        } else {
            klog.V(3).InfoS("First seen time not recorded for pod",
                "podUID", pod.UID,
                "pod", klog.KObj(pod))
        }
    }

    // 省略部分內容
}

這裡有個指標pod_worker_start_duration_seconds記錄了 kubelet 第一次看到這個 Pod 到第一次syncPod經歷的時間。

kubelet4

下一條日誌，kubelet 生成 Pod 狀態資訊：

Jun 15 15:18:59 iv-yd6ldax3wgr9cxxjx9iv kubelet[3390574]: I0615 15:18:59.273841 3390574 kubelet_pods.go:1461] "Generating pod status" pod="default/dp-test-5986bbcc8f-gcrmm"

對應程式碼：

// generateAPIPodStatus creates the final API pod status for a pod, given the
// internal pod status. This method should only be called from within sync*Pod methods.
func (kl *Kubelet) generateAPIPodStatus(pod *v1.Pod, podStatus *kubecontainer.PodStatus) v1.PodStatus {
    klog.V(3).InfoS("Generating pod status", "pod", klog.KObj(pod))
    // 省略部分內容
}

對於新建立的 Pod，初始狀態都是 Pending，對應日誌：

Jun 15 15:18:59 iv-yd6ldax3wgr9cxxjx9iv kubelet[3390574]: I0615 15:18:59.273867 3390574 kubelet_pods.go:1426] "Pod waiting > 0, pending"

計算 Pod 的狀態

Pod 的 Phase 是根據 Pod 中所有 container 的狀態計算得出的：

// getPhase returns the phase of a pod given its container info.
func getPhase(spec *v1.PodSpec, info []v1.ContainerStatus) v1.PodPhase {
    pendingInitialization := 0
    failedInitialization := 0
    for _, container := range spec.InitContainers {
        containerStatus, ok := podutil.GetContainerStatus(info, container.Name)
        if !ok {
            pendingInitialization++
            continue
        }

        switch {
        case containerStatus.State.Running != nil:
            pendingInitialization++
        case containerStatus.State.Terminated != nil:
            if containerStatus.State.Terminated.ExitCode != 0 {
                failedInitialization++
            }
        case containerStatus.State.Waiting != nil:
            if containerStatus.LastTerminationState.Terminated != nil {
                if containerStatus.LastTerminationState.Terminated.ExitCode != 0 {
                    failedInitialization++
                }
            } else {
                pendingInitialization++
            }
        default:
            pendingInitialization++
        }
    }

    unknown := 0
    running := 0
    waiting := 0
    stopped := 0
    succeeded := 0
    for _, container := range spec.Containers {
        containerStatus, ok := podutil.GetContainerStatus(info, container.Name)
        if !ok {
            unknown++
            continue
        }

        switch {
        case containerStatus.State.Running != nil:
            running++
        case containerStatus.State.Terminated != nil:
            stopped++
            if containerStatus.State.Terminated.ExitCode == 0 {
                succeeded++
            }
        case containerStatus.State.Waiting != nil:
            if containerStatus.LastTerminationState.Terminated != nil {
                stopped++
            } else {
                waiting++
            }
        default:
            unknown++
        }
    }

    if failedInitialization > 0 && spec.RestartPolicy == v1.RestartPolicyNever {
        return v1.PodFailed
    }

    switch {
    case pendingInitialization > 0:
        fallthrough
    case waiting > 0:
        klog.V(5).InfoS("Pod waiting > 0, pending")
        // One or more containers has not been started
        return v1.PodPending
    case running > 0 && unknown == 0:
        // All containers have been started, and at least
        // one container is running
        return v1.PodRunning
    case running == 0 && stopped > 0 && unknown == 0:
        // All containers are terminated
        if spec.RestartPolicy == v1.RestartPolicyAlways {
            // All containers are in the process of restarting
            return v1.PodRunning
        }
        if stopped == succeeded {
            // RestartPolicy is not Always, and all
            // containers are terminated in success
            return v1.PodSucceeded
        }
        if spec.RestartPolicy == v1.RestartPolicyNever {
            // RestartPolicy is Never, and all containers are
            // terminated with at least one in failure
            return v1.PodFailed
        }
        // RestartPolicy is OnFailure, and at least one in failure
        // and in the process of restarting
        return v1.PodRunning
    default:
        klog.V(5).InfoS("Pod default case, pending")
        return v1.PodPending
    }
}

failedInitialization 大於 0 並且 RestartPolicy 是 Never：Pod 狀態為 Failed
waiting 狀態的 container 數量大於 0：Pod 狀態為 Pending
running 狀態的 container 大於 0 並且 unknown 狀態的 container 等於 0：Pod 狀態為 Running
running 狀態的 container 等於 0 並且 stopped 狀態的 container 大於 0，同時 unknown 狀態的 container 等於 0
RestartPolicy 是 always：Pod 狀態為 Running
如果 stopped 狀態 container 數量和 succeeded 狀態 container 數量相同：Pod 狀態為 Succeeded
如果 RestartPolicy 是 Never：Pod 狀態為 Failed
其它情況：Pod 狀態為 Running
其它情況：Pod 狀態為 Pending

這裡有指標pod_start_duration_seconds記錄了 kubelet 第一次看到 Pod 到 Pod running 的時間：

Pod 狀態更新完成，列印以下日誌：

Jun 15 15:18:59 iv-yd6ldax3wgr9cxxjx9iv kubelet[3390574]: I0615 15:18:59.273874 3390574 kubelet_pods.go:1471] "Got phase for pod" pod="default/dp-test-5986bbcc8f-gcrmm" oldPhase=Pending phase=Pending

計算出 Pod 狀態資訊，接下來更新 Pod 狀態資訊，對應程式碼：

func (kl *Kubelet) syncPod(_ context.Context, updateType kubetypes.SyncPodType, pod, mirrorPod *v1.Pod, podStatus *kubecontainer.PodStatus) (isTerminal bool, err error) {
    // 省略其他內容
    kl.statusManager.SetPodStatus(pod, apiPodStatus)
    // 省略其他內容
}

呼叫updateStatusInternal更新 Pod 狀態：

func (m *manager) SetPodStatus(pod *v1.Pod, status v1.PodStatus) {
    m.podStatusesLock.Lock()
    defer m.podStatusesLock.Unlock()

    // Make sure we're caching a deep copy.
    status = *status.DeepCopy()

    // Force a status update if deletion timestamp is set. This is necessary
    // because if the pod is in the non-running state, the pod worker still
    // needs to be able to trigger an update and/or deletion.
    m.updateStatusInternal(pod, status, pod.DeletionTimestamp != nil)
}

updateStatusInternal方法除了更新本地快取中的 Pod 狀態，還會新增到發給 API server 的請求佇列中。在updateStatusInternal方法，列印如下日誌：

Jun 15 15:18:59 iv-yd6ldax3wgr9cxxjx9iv kubelet[3390574]: I0615 15:18:59.273910 3390574 status_manager.go:552] "updateStatusInternal" version=1 pod="default/dp-test-5986bbcc8f-gcrmm" podUID=dd780355-f908-4bf2-9193-68dabbbca69e containers="(nginx state=waiting previous=<none>)"

對應程式碼：

func (m *manager) updateStatusInternal(pod *v1.Pod, status v1.PodStatus, forceUpdate bool) bool {
    // 省略部分內容
    // Perform some more extensive logging of container termination state to assist in
    // debugging production races (generally not needed).
    if klogV := klog.V(5); klogV.Enabled() {
        var containers []string
        for _, s := range append(append([]v1.ContainerStatus(nil), status.InitContainerStatuses...), status.ContainerStatuses...) {
            var current, previous string
            switch {
            case s.State.Running != nil:
                current = "running"
            case s.State.Waiting != nil:
                current = "waiting"
            case s.State.Terminated != nil:
                current = fmt.Sprintf("terminated=%d", s.State.Terminated.ExitCode)
            default:
                current = "unknown"
            }
            switch {
            case s.LastTerminationState.Running != nil:
                previous = "running"
            case s.LastTerminationState.Waiting != nil:
                previous = "waiting"
            case s.LastTerminationState.Terminated != nil:
                previous = fmt.Sprintf("terminated=%d", s.LastTerminationState.Terminated.ExitCode)
            default:
                previous = "<none>"
            }
            containers = append(containers, fmt.Sprintf("(%s state=%s previous=%s)", s.Name, current, previous))
        }
        sort.Strings(containers)
        klogV.InfoS("updateStatusInternal", "version", cachedStatus.version+1, "pod", klog.KObj(pod), "podUID", pod.UID, "containers", strings.Join(containers, " "))
    }
    // 省略部分內容
}

往 API server 更新 Pod 狀態的請求被新增到 Channel，同時列印如下日誌：

Jun 15 15:18:59 iv-yd6ldax3wgr9cxxjx9iv kubelet[3390574]: I0615 15:18:59.273990 3390574 status_manager.go:582] "Status Manager: adding pod with new status to podStatusChannel" pod="default/dp-test-5986bbcc8f-gcrmm" podUID=dd780355-f908-4bf2-9193-68dabbbca69e statusVersion=1 status={Phase:Pending Conditions:[{Type:Initialized Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2024-06-15 15:18:59 +0800 CST Reason: Message:} {Type:Ready Status:False LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2024-06-15 15:18:59 +0800 CST Reason:ContainersNotReady Message:containers with unready status: [nginx]} {Type:ContainersReady Status:False LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2024-06-15 15:18:59 +0800 CST Reason:ContainersNotReady Message:containers with unready status: [nginx]} {Type:PodScheduled Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2024-06-15 15:18:59 +0800 CST Reason: Message:}] Message: Reason: NominatedNodeName: HostIP:192.168.0.9 PodIP: PodIPs:[] StartTime:2024-06-15 15:18:59 +0800 CST InitContainerStatuses:[] ContainerStatuses:[{Name:nginx State:{Waiting:&ContainerStateWaiting{Reason:ContainerCreating,Message:,} Running:nil Terminated:nil} LastTerminationState:{Waiting:nil Running:nil Terminated:nil} Ready:false RestartCount:0 Image:nginx:latest ImageID: ContainerID: Started:0xc001ed178c}] QOSClass:BestEffort EphemeralContainerStatuses:[]}

對應程式碼：

func (m *manager) updateStatusInternal(pod *v1.Pod, status v1.PodStatus, forceUpdate bool) bool {
    // 省略部分內容
    m.podStatuses[pod.UID] = newStatus

    select {
    case m.podStatusChannel <- podStatusSyncRequest{pod.UID, newStatus}:
        klog.V(5).InfoS("Status Manager: adding pod with new status to podStatusChannel",
            "pod", klog.KObj(pod),
            "podUID", pod.UID,
            "statusVersion", newStatus.version,
            "status", newStatus.status)
        return true
    default:
        // Let the periodic syncBatch handle the update if the channel is full.
        // We can't block, since we hold the mutex lock.
        klog.V(4).InfoS("Skipping the status update for pod for now because the channel is full",
            "pod", klog.KObj(pod),
            "status", status)
        return false
    }
    //省略部分內容
}

可以看到，請求裡面只有UID和新的狀態兩部分內容：

pod.UID
newStatus

向 API server 更新 Pod 狀態

因此，下一條日誌是 Channel 的 handler（Status Manager）收到這個請求：

Jun 15 15:18:59 iv-yd6ldax3wgr9cxxjx9iv kubelet[3390574]: I0615 15:18:59.274047 3390574 status_manager.go:186] "Status Manager: syncing pod with status from podStatusChannel" podUID=dd780355-f908-4bf2-9193-68dabbbca69e statusVersion=1 status={Phase:Pending Conditions:[{Type:Initialized Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2024-06-15 15:18:59 +0800 CST Reason: Message:} {Type:Ready Status:False LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2024-06-15 15:18:59 +0800 CST Reason:ContainersNotReady Message:containers with unready status: [nginx]} {Type:ContainersReady Status:False LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2024-06-15 15:18:59 +0800 CST Reason:ContainersNotReady Message:containers with unready status: [nginx]} {Type:PodScheduled Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2024-06-15 15:18:59 +0800 CST Reason: Message:}] Message: Reason: NominatedNodeName: HostIP:192.168.0.9 PodIP: PodIPs:[] StartTime:2024-06-15 15:18:59 +0800 CST InitContainerStatuses:[] ContainerStatuses:[{Name:nginx State:{Waiting:&ContainerStateWaiting{Reason:ContainerCreating,Message:,} Running:nil Terminated:nil} LastTerminationState:{Waiting:nil Running:nil Terminated:nil} Ready:false RestartCount:0 Image:nginx:latest ImageID: ContainerID: Started:0xc001ed178c}] QOSClass:BestEffort EphemeralContainerStatuses:[]}

對應程式碼：

func (m *manager) Start() {
    // Don't start the status manager if we don't have a client. This will happen
    // on the master, where the kubelet is responsible for bootstrapping the pods
    // of the master components.
    if m.kubeClient == nil {
        klog.InfoS("Kubernetes client is nil, not starting status manager")
        return
    }

    klog.InfoS("Starting to sync pod status with apiserver")

    //nolint:staticcheck // SA1015 Ticker can leak since this is only called once and doesn't handle termination.
    syncTicker := time.NewTicker(syncPeriod).C

    // syncPod and syncBatch share the same go routine to avoid sync races.
    go wait.Forever(func() {
        for {
            select {
            case syncRequest := <-m.podStatusChannel:
                klog.V(5).InfoS("Status Manager: syncing pod with status from podStatusChannel",
                    "podUID", syncRequest.podUID,
                    "statusVersion", syncRequest.status.version,
                    "status", syncRequest.status.status)
                m.syncPod(syncRequest.podUID, syncRequest.status)
            case <-syncTicker:
                klog.V(5).InfoS("Status Manager: syncing batch")
                // remove any entries in the status channel since the batch will handle them
                for i := len(m.podStatusChannel); i > 0; i-- {
                    <-m.podStatusChannel
                }
                m.syncBatch()
            }
        }
    }, 0)
}

Status Manager 是一個 Go Routine，主要透過m.syncPod向 API server 更新狀態：

Jun 15 15:18:59 iv-yd6ldax3wgr9cxxjx9iv kubelet[3390574]: I0615 15:18:59.285142 3390574 status_manager.go:719] "Patch status for pod" pod="default/dp-test-5986bbcc8f-gcrmm" podUID=dd780355-f908-4bf2-9193-68dabbbca69e patch="{\"metadata\":{\"uid\":\"dd780355-f908-4bf2-9193-68dabbbca69e\"},\"status\":{\"$setElementOrder/conditions\":[{\"type\":\"Initialized\"},{\"type\":\"Ready\"},{\"type\":\"ContainersReady\"},{\"type\":\"PodScheduled\"}],\"conditions\":[{\"lastProbeTime\":null,\"lastTransitionTime\":\"2024-06-15T07:18:59Z\",\"status\":\"True\",\"type\":\"Initialized\"},{\"lastProbeTime\":null,\"lastTransitionTime\":\"2024-06-15T07:18:59Z\",\"message\":\"containers with unready status: [nginx]\",\"reason\":\"ContainersNotReady\",\"status\":\"False\",\"type\":\"Ready\"},{\"lastProbeTime\":null,\"lastTransitionTime\":\"2024-06-15T07:18:59Z\",\"message\":\"containers with unready status: [nginx]\",\"reason\":\"ContainersNotReady\",\"status\":\"False\",\"type\":\"ContainersReady\"}],\"containerStatuses\":[{\"image\":\"nginx:latest\",\"imageID\":\"\",\"lastState\":{},\"name\":\"nginx\",\"ready\":false,\"restartCount\":0,\"started\":false,\"state\":{\"waiting\":{\"reason\":\"ContainerCreating\"}}}],\"hostIP\":\"192.168.0.9\",\"startTime\":\"2024-06-15T07:18:59Z\"}}"

對應程式碼：

// syncPod syncs the given status with the API server. The caller must not hold the lock.
func (m *manager) syncPod(uid types.UID, status versionedPodStatus) {
    // 省略部分內容
    newPod, patchBytes, unchanged, err := statusutil.PatchPodStatus(context.TODO(), m.kubeClient, pod.Namespace, pod.Name, pod.UID, pod.Status, mergedStatus)
    klog.V(3).InfoS("Patch status for pod", "pod", klog.KObj(pod), "podUID", uid,     "patch", string(patchBytes))
    // 省略部分內容
}

這裡，Kubelet 使用了 Patch API：

updatedPod, err := c.CoreV1().Pods(namespace).Patch(ctx, name, types.StrategicMergePatchType, patchBytes, metav1.PatchOptions{}, "status")
if err != nil {
    return nil, nil, false, fmt.Errorf("failed to patch status %q for pod %q/%q: %v", patchBytes, namespace, name, err)
}

更新完成後列印如下日誌：

Jun 15 15:18:59 iv-yd6ldax3wgr9cxxjx9iv kubelet[3390574]: I0615 15:18:59.285211 3390574 status_manager.go:728] "Status for pod updated successfully" pod="default/dp-test-5986bbcc8f-gcrmm" statusVersion=1 status={Phase:Pending Conditions:[{Type:Initialized Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2024-06-15 15:18:59 +0800 CST Reason: Message:} {Type:Ready Status:False LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2024-06-15 15:18:59 +0800 CST Reason:ContainersNotReady Message:containers with unready status: [nginx]} {Type:ContainersReady Status:False LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2024-06-15 15:18:59 +0800 CST Reason:ContainersNotReady Message:containers with unready status: [nginx]} {Type:PodScheduled Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2024-06-15 15:18:59 +0800 CST Reason: Message:}] Message: Reason: NominatedNodeName: HostIP:192.168.0.9 PodIP: PodIPs:[] StartTime:2024-06-15 15:18:59 +0800 CST InitContainerStatuses:[] ContainerStatuses:[{Name:nginx State:{Waiting:&ContainerStateWaiting{Reason:ContainerCreating,Message:,} Running:nil Terminated:nil} LastTerminationState:{Waiting:nil Running:nil Terminated:nil} Ready:false RestartCount:0 Image:nginx:latest ImageID: ContainerID: Started:0xc001ed178c}] QOSClass:BestEffort EphemeralContainerStatuses:[]}

對應程式碼：

func (m *manager) syncPod(uid types.UID, status versionedPodStatus) {
    // 省略部分內容
    if unchanged {
        klog.V(3).InfoS("Status for pod is up-to-date", "pod", klog.KObj(pod), "statusVersion", status.version)
    } else {
        klog.V(3).InfoS("Status for pod updated successfully", "pod", klog.KObj(pod), "statusVersion", status.version, "status", mergedStatus)
        pod = newPod
        // We pass a new object (result of API call which contains updated ResourceVersion)
        m.podStartupLatencyHelper.RecordStatusUpdated(pod)
    }
    // 省略部分內容
}

建立 Cgroup

接下來，Kubelet 需要為 Pod 建立 cgroup 並更新資源配置：


func (m *manager) syncPod(uid types.UID, status versionedPodStatus) {
    // 省略部分內容
    // Create Cgroups for the pod and apply resource parameters
    // to them if cgroups-per-qos flag is enabled.
    pcm := kl.containerManager.NewPodContainerManager()
    // If pod has already been terminated then we need not create
    // or update the pod's cgroup
    // TODO: once context cancellation is added this check can be removed
    if !kl.podWorkers.IsPodTerminationRequested(pod.UID) {
        // When the kubelet is restarted with the cgroups-per-qos
        // flag enabled, all the pod's running containers
        // should be killed intermittently and brought back up
        // under the qos cgroup hierarchy.
        // Check if this is the pod's first sync
        firstSync := true
        for _, containerStatus := range apiPodStatus.ContainerStatuses {
            if containerStatus.State.Running != nil {
                firstSync = false
                break
            }
        }
        // Don't kill containers in pod if pod's cgroups already
        // exists or the pod is running for the first time
        podKilled := false
        if !pcm.Exists(pod) && !firstSync {
            p := kubecontainer.ConvertPodStatusToRunningPod(kl.getRuntime().Type(), podStatus)
            if err := kl.killPod(ctx, pod, p, nil); err == nil {
                podKilled = true
            } else {
                klog.ErrorS(err, "KillPod failed", "pod", klog.KObj(pod), "podStatus", podStatus)
            }
        }
        // Create and Update pod's Cgroups
        // Don't create cgroups for run once pod if it was killed above
        // The current policy is not to restart the run once pods when
        // the kubelet is restarted with the new flag as run once pods are
        // expected to run only once and if the kubelet is restarted then
        // they are not expected to run again.
        // We don't create and apply updates to cgroup if its a run once pod and was killed above
        if !(podKilled && pod.Spec.RestartPolicy == v1.RestartPolicyNever) {
            if !pcm.Exists(pod) {
                if err := kl.containerManager.UpdateQOSCgroups(); err != nil {
                    klog.V(2).InfoS("Failed to update QoS cgroups while syncing pod", "pod", klog.KObj(pod), "err", err)
                }
                if err := pcm.EnsureExists(pod); err != nil {
                    kl.recorder.Eventf(pod, v1.EventTypeWarning, events.FailedToCreatePodContainer, "unable to ensure pod container exists: %v", err)
                    return false, fmt.Errorf("failed to ensure that the pod: %v cgroups exist and are correctly applied: %v", pod.UID, err)
                }
            }
        }
    }
    // 省略部分內容
}

UpdateQOSCgroups更新資源配置，如果 Cgroup 不存在，pcm.EnsureExists建立對應的 Cgroup：

// EnsureExists takes a pod as argument and makes sure that
// pod cgroup exists if qos cgroup hierarchy flag is enabled.
// If the pod level container doesn't already exist it is created.
func (m *podContainerManagerImpl) EnsureExists(pod *v1.Pod) error {
    podContainerName, _ := m.GetPodContainerName(pod)
    // check if container already exist
    alreadyExists := m.Exists(pod)
    if !alreadyExists {
        enforceMemoryQoS := false
        if utilfeature.DefaultFeatureGate.Enabled(kubefeatures.MemoryQoS) &&
            libcontainercgroups.IsCgroup2UnifiedMode() {
            enforceMemoryQoS = true
        }
        // Create the pod container
        containerConfig := &CgroupConfig{
            Name:               podContainerName,
            ResourceParameters: ResourceConfigForPod(pod, m.enforceCPULimits, m.cpuCFSQuotaPeriod, enforceMemoryQoS),
        }
        if m.podPidsLimit > 0 {
            containerConfig.ResourceParameters.PidsLimit = &m.podPidsLimit
        }
        if enforceMemoryQoS {
            klog.V(4).InfoS("MemoryQoS config for pod", "pod", klog.KObj(pod), "unified", containerConfig.ResourceParameters.Unified)
        }
        if pod.Annotations[annotationHostResourceView] == "false" {
            hostResourceView := false
            containerConfig.ResourceParameters.HostResourceView = &hostResourceView
        }
        if err := m.cgroupManager.Create(containerConfig); err != nil {
            return fmt.Errorf("failed to create container for %v : %v", podContainerName, err)
        }
    }
    return nil
}

cgroup_manager_duration_seconds記錄了建立 cgroup 所需的時間（其它型別操作透過 label 區分）：

kubelet6

更新 Cgroup 使用了 runc 的 package：

// Update updates the cgroup with the specified Cgroup Configuration
func (m *cgroupManagerImpl) Update(cgroupConfig *CgroupConfig) error {
    start := time.Now()
    defer func() {
        metrics.CgroupManagerDuration.WithLabelValues("update").Observe(metrics.SinceInSeconds(start))
    }()

    libcontainerCgroupConfig := m.libctCgroupConfig(cgroupConfig, true)
    manager, err := manager.New(libcontainerCgroupConfig)
    if err != nil {
        return fmt.Errorf("failed to create cgroup manager: %v", err)
    }
    return manager.Set(libcontainerCgroupConfig.Resources)
}

建立 Pod 目錄

接下來，Kubelet 為每個 Pod 建立目錄：

// makePodDataDirs creates the dirs for the pod datas.
func (kl *Kubelet) makePodDataDirs(pod *v1.Pod) error {
    uid := pod.UID
    if err := os.MkdirAll(kl.getPodDir(uid), 0750); err != nil && !os.IsExist(err) {
        return err
    }
    if err := os.MkdirAll(kl.getPodVolumesDir(uid), 0750); err != nil && !os.IsExist(err) {
        return err
    }
    if err := os.MkdirAll(kl.getPodPluginsDir(uid), 0750); err != nil && !os.IsExist(err) {
        return err
    }
    return nil
}

包括以下 3 種目錄：

掛載儲存卷

接下來，kubelet 等待 Pod 所需的儲存卷被掛載成功（對於 terminating 狀態的 Pod 不會等待）：

Jun 15 15:19:06 iv-yd6ldax3wgr9cxxjx9iv kubelet[3390574]: I0615 15:19:06.599897 3390574 volume_manager.go:408] "Waiting for volumes to attach and mount for pod" pod="default/dp-test-5986bbcc8f-gcrmm"

對應程式碼：


func (m *manager) syncPod(uid types.UID, status versionedPodStatus) {
    // 省略部分內容
    // Volume manager will not mount volumes for terminating pods
    // TODO: once context cancellation is added this check can be removed
    if !kl.podWorkers.IsPodTerminationRequested(pod.UID) {
        // Wait for volumes to attach/mount
        if err := kl.volumeManager.WaitForAttachAndMount(pod); err != nil {
            kl.recorder.Eventf(pod, v1.EventTypeWarning, events.FailedMountVolume, "Unable to attach or mount volumes: %v", err)
            klog.ErrorS(err, "Unable to attach or mount volumes for pod; skipping pod", "pod", klog.KObj(pod))
            return false, err
        }
    }
    // 省略部分內容
}

Kubelet 透過desiredStateOfWorld和actualStateOfWorld來管理 Pod 的儲存卷狀態。在這裡，WaitForAttachAndMount只根據這兩個變數的值，確保所有卷都被 Attach 並且 Mount 成功：

“

這裡對於不同的儲存卷型別，處理邏輯不同，第三方儲存卷還需要 CSI 外掛的參與，這個會在後面的文章單獨進行分析。

func (vm *volumeManager) WaitForAttachAndMount(pod *v1.Pod) error {
    // 省略部分內容
    err := wait.PollImmediate(
        podAttachAndMountRetryInterval,
        podAttachAndMountTimeout,
        vm.verifyVolumesMountedFunc(uniquePodName, expectedVolumes))

    if err != nil {
        unmountedVolumes :=
            vm.getUnmountedVolumes(uniquePodName, expectedVolumes)
        // Also get unattached volumes for error message
        unattachedVolumes :=
            vm.getUnattachedVolumes(expectedVolumes)

        if len(unmountedVolumes) == 0 {
            return nil
        }

        return fmt.Errorf(
            "unmounted volumes=%v, unattached volumes=%v: %s",
            unmountedVolumes,
            unattachedVolumes,
            err)
    }
    // 省略部分內容
}

如果這個步驟失敗，錯誤資訊會展示在 Pod 的 event 裡面，這個 event 我們經常會遇到：

Unable to attach or mount volumes xxx

這裡unmounted volumes和unattached volumes的值就是透過desiredStateOfWorld和actualStateOfWorld得到的。

如果所有卷都已經掛載成功，會列印以下日誌：

Jun 15 15:19:06 iv-yd6ldax3wgr9cxxjx9iv kubelet[3390574]: I0615 15:19:06.599946 3390574 volume_manager.go:439] "All volumes are attached and mounted for pod" pod="default/dp-test-5986bbcc8f-gcrmm"

獲取 imagePullSecrets

接下來，kubelet 會從 API server 拿imagePullSecrets：

func (vm *volumeManager) WaitForAttachAndMount(pod *v1.Pod) error {
    // 省略部分內容
    // Fetch the pull secrets for the pod
    pullSecrets := kl.getPullSecretsForPod(pod)
    // 省略部分內容
}

imagePullSecrets傳給containerRuntime的syncPod方法，以啟動容器：

func (vm *volumeManager) WaitForAttachAndMount(pod *v1.Pod) error {
    // 省略部分內容
    // Ensure the pod is being probed
    kl.probeManager.AddPod(pod)

    // Call the container runtime's SyncPod callback
    result := kl.containerRuntime.SyncPod(ctx, pod, podStatus, pullSecrets, kl.backOff)
    kl.reasonCache.Update(pod.UID, result)
    if err := result.Error(); err != nil {
        // Do not return error if the only failures were pods in backoff
        for _, r := range result.SyncResults {
            if r.Error != kubecontainer.ErrCrashLoopBackOff && r.Error != images.ErrImagePullBackOff {
                // Do not record an event here, as we keep all event logging for sync pod failures
                // local to container runtime, so we get better errors.
                return false, err
            }
        }

        return false, nil
    }
    // 省略部分內容
}

建立 Sandbox 和 Container

至此，Kubelet 在建立 Pod 前的準備工作已經完成。SyncPod方法會透過 CRI 介面呼叫執行時的 RPC，完成建立 Sandbox 和 Container 的過程：

“

具體邏輯會在後面的文章單獨分析。

// SyncPod syncs the running pod into the desired pod by executing following steps:
//
//  1. Compute sandbox and container changes.
//  2. Kill pod sandbox if necessary.
//  3. Kill any containers that should not be running.
//  4. Create sandbox if necessary.
//  5. Create ephemeral containers.
//  6. Create init containers.
//  7. Create normal containers.

kubernetes/k8s CRI分析-kubelet建立pod分析
2021-08-08
K8S
[] == ![]發生了什麼？
2019-02-21
SAP CRM settype的建立，背後發生了什麼
2018-12-21
當一個 Pod 被排程時，Kubernetes 內部發生了什麼？
2021-10-08
在Kubernetes中建立Pod時會發生什麼？ - daniele
2020-09-25
[譯] 當你建立 Flexbox 佈局時，都發生了什麼？
2019-03-04
Flex
從建立程式到進入main函式，發生了什麼？
2020-10-29
AI函式
new Vue發生了什麼
2019-11-06
Vue
go func 時發生了什麼
2024-07-05
Go
當你建立了一個 Deployment 時，Kubernetes 內部發生了什麼？
2021-09-27
kubelet的功能、作用是什麼
2024-03-09
猜猜體育課發生了什麼？
2024-10-10
當執行時，發生了什麼？
2021-03-25
輸入url以後發生了什麼
2020-11-19
docker container run 背後發生了什麼？
2022-07-09
DockerAI
System.gc 之後到底發生了什麼？
2024-04-01
GC
啟動vue專案時發生了什麼
2024-05-01
Vue
【Django】runserver 0.0.0.0:0 後，究竟發生了什麼
2020-04-30
DjangoServer
kubernetes/k8s CRI分析-kubelet刪除pod分析
2021-08-15
K8S
如何使用kubelet 啟動命令限制Pod 啟動數量？
2022-04-28
JS每日一題:new Vue()中發生了什麼？
2019-02-16
JS每日一題Vue
Android Studio點選Run背後發生了什麼？
2020-03-27
Android
訪問網際網路時發生了什麼？
2020-08-02
當 Redis 發生高延遲時，到底發生了什麼
2022-12-05
Redis
椰島CEO：在《江南百景圖》立項前的4個月裡發生了什麼？
2020-07-30
譯文：瀏覽器輸入URL發生了什麼？
2019-02-24
瀏覽器
一個完整的 Web 請求到底發生了什麼
2019-04-10
Web
vue原始碼解析之npm run build發生了什麼？
2018-09-09
Vue原始碼NPMUI
Flutter筆記——runApp發生了什麼(原始碼學習)
2020-01-18
Flutter筆記APP原始碼
Vue原始碼探祕（三）（new Vue發生了什麼？）
2020-03-19
Vue原始碼
建立公有pod庫
2019-03-31
建立Pod私有庫
2018-09-20
kubernetes1.9原始碼閱讀 kubelet對pod資源的wat
2021-09-09
原始碼
從輸入url到顯示網頁發生了什麼
2018-11-19
網頁
當我呼叫了$().append()後，jQuery內部發生了什麼？
2019-03-27
APPjQuery
RAG實戰4-RAG過程中發生了什麼？
2024-03-09
這個春節假期，遊戲市場發生了什麼？
2020-02-10
遊戲
從輸入url到傳送請求發生了什麼
2019-05-23

kubelet 建立 Pod 前發生了什麼？

Pod 被排程

Kubelet 發現需要被處理的 Pod

計算 Pod 的狀態

向 API server 更新 Pod 狀態

建立 Cgroup

建立 Pod 目錄

掛載儲存卷

獲取 imagePullSecrets

建立 Sandbox 和 Container

相關文章