kubelet 建立 Pod 前發生了什麼?

abin在路上發表於2024-06-17

Kubelet Watch 到新增的 Pod,需要做的主要有以下幾件事:

  • 管理 Pod 狀態,除了更新本地快取,還要同步給 API server

  • 計算節點的資源是否足夠建立 Pod

  • 建立 Cgroup 並更新資源配置

  • 建立 Pod、儲存卷、外掛的根目錄

  • 掛載儲存卷並等待所有儲存卷掛載成功

  • 獲取 imagePullSecrets

  • 呼叫 CRI 建立 PodSandbox

  • 呼叫 CRI 建立 Container(ephemeral Container、Init Container、普通 Container)

Pod 的 event 是 k8s 元件向使用者展示關鍵資訊的渠道,對於某個 Pod,正常情況下會有以下事件:

Pod 被排程

當 Pod 被建立出來時,會首先提交到 API server(如果由 controller 管理的 Pod,也是如此)。此時,排程器發現未被排程的 Pod,會對其進行排程。當透過一系列演算法找到合適的排程節點後,排程器會將 Pod 和節點進行繫結。對於預設排程器,排程完成的標誌是往 Pod 的事件列印以下內容:

Successfully assigned default/dp-test-5986bbcc8f-4vlmd to 192.168.0.9


Scheduler 中對應程式碼如下:

func (sched *Scheduler) finishBinding(fwk framework.Framework, assumed *v1.Pod, targetNode string, status *framework.Status) {
    if finErr := sched.Cache.FinishBinding(assumed); finErr != nil {
        klog.ErrorS(finErr, "Scheduler cache FinishBinding failed")
    }
    if !status.IsSuccess() {
        klog.V(1).InfoS("Failed to bind pod", "pod", klog.KObj(assumed))
        return
    }

    fwk.EventRecorder().Eventf(assumed, nil, v1.EventTypeNormal, "Scheduled", "Binding", "Successfully assigned %v/%v to %v", assumed.Namespace, assumed.Name, targetNode)
}


Kubelet 發現需要被處理的 Pod

Kubelet 會持續透過 watch 的方式從 API server 獲取排程到當前節點的 Pod 列表資訊:

根據 Selector spec.nodeName 進行過濾。

// NewSourceApiserver creates a config source that watches and pulls from the apiserver.
func NewSourceApiserver(c clientset.Interface, nodeName types.NodeName, nodeHasSynced func() bool, updates chan<- interface{}) {
    lw := cache.NewListWatchFromClient(c.CoreV1().RESTClient(), "pods", metav1.NamespaceAll, fields.OneTermEqualSelector("spec.nodeName", string(nodeName)))

    // The Reflector responsible for watching pods at the apiserver should be run only after
    // the node sync with the apiserver has completed.
    klog.InfoS("Waiting for node sync before watching apiserver pods")
    go func() {
        for {
            if nodeHasSynced() {
                klog.V(4).InfoS("node sync completed")
                break
            }
            time.Sleep(WaitForAPIServerSyncPeriod)
            klog.V(4).InfoS("node sync has not completed yet")
        }
        klog.InfoS("Watching apiserver")
        newSourceApiserverFromLW(lw, updates)
    }()
}


當 Kubelet 發現有新的 Pod 被排程到當前節點,會列印以下日誌:

需要調整日誌等級為 5,下同。

Jun 15 15:18:59 iv-yd6ldax3wgr9cxxjx9iv kubelet[3390574]: I0615 15:18:59.269157 3390574 config.go:398] "Receiving a new pod" pod="default/dp-test-5986bbcc8f-gcrmm"


對應程式碼如下:

// recordFirstSeenTime records the first seen time of this pod.
func recordFirstSeenTime(pod *v1.Pod) {
    klog.V(4).InfoS("Receiving a new pod", "pod", klog.KObj(pod))
    pod.Annotations[kubetypes.ConfigFirstSeenAnnotationKey] = kubetypes.NewTimestamp().GetString()
}


在這裡,kubelet 會為這個 Pod 增加 key 為kubernetes.io/config.seen的 Annotation(這個 Annotation 只在 kubelet 使用,沒有同步到 API Server),記錄 kubelet 第一次看到這個 Pod 的時間。

Kubelet 對拿到的 Pod 進行分類,並透過s.updates Channel 傳送給對應的 handler。透過這個 Channel,我們可以找到對應的處理函式:

// syncLoop is the main loop for processing changes. It watches for changes from
// three channels (file, apiserver, and http) and creates a union of them. For
// any new change seen, will run a sync against desired state and running state. If
// no changes are seen to the configuration, will synchronize the last known desired
// state every sync-frequency seconds. Never returns.
func (kl *Kubelet) syncLoop(ctx context.Context, updates <-chan kubetypes.PodUpdate, handler SyncHandler) {
    klog.InfoS("Starting kubelet main sync loop")
    // The syncTicker wakes up kubelet to checks if there are any pod workers
    // that need to be sync'd. A one-second period is sufficient because the
    // sync interval is defaulted to 10s.
    syncTicker := time.NewTicker(time.Second)
    defer syncTicker.Stop()
    housekeepingTicker := time.NewTicker(housekeepingPeriod)
    defer housekeepingTicker.Stop()
    plegCh := kl.pleg.Watch()
    const (
        base   = 100 * time.Millisecond
        max    = 5 * time.Second
        factor = 2
    )
    duration := base
    // Responsible for checking limits in resolv.conf
    // The limits do not have anything to do with individual pods
    // Since this is called in syncLoop, we don't need to call it anywhere else
    if kl.dnsConfigurer != nil && kl.dnsConfigurer.ResolverConfig != "" {
        kl.dnsConfigurer.CheckLimitsForResolvConf()
    }

    for {
        if err := kl.runtimeState.runtimeErrors(); err != nil {
            klog.ErrorS(err, "Skipping pod synchronization")
            // exponential backoff
            time.Sleep(duration)
            duration = time.Duration(math.Min(float64(max), factor*float64(duration)))
            continue
        }
        // reset backoff if we have a success
        duration = base

        kl.syncLoopMonitor.Store(kl.clock.Now())
        if !kl.syncLoopIteration(ctx, updates, handler, syncTicker.C, housekeepingTicker.C, plegCh) {
            break
        }
        kl.syncLoopMonitor.Store(kl.clock.Now())
    }
}


syncLoop是 kubelet 的主要處理邏輯,方法syncLoopIteration從不同 Channel 讀取事件並且分發給 worker。內容較多,可以透過註釋瞭解大概邏輯:

// syncLoopIteration reads from various channels and dispatches pods to the
// given handler.
//
// Arguments:
// 1.  configCh:       a channel to read config events from
// 2.  handler:        the SyncHandler to dispatch pods to
// 3.  syncCh:         a channel to read periodic sync events from
// 4.  housekeepingCh: a channel to read housekeeping events from
// 5.  plegCh:         a channel to read PLEG updates from
//
// Events are also read from the kubelet liveness manager's update channel.
//
// The workflow is to read from one of the channels, handle that event, and
// update the timestamp in the sync loop monitor.
//
// Here is an appropriate place to note that despite the syntactical
// similarity to the switch statement, the case statements in a select are
// evaluated in a pseudorandom order if there are multiple channels ready to
// read from when the select is evaluated.  In other words, case statements
// are evaluated in random order, and you can not assume that the case
// statements evaluate in order if multiple channels have events.
//
// With that in mind, in truly no particular order, the different channels
// are handled as follows:
//
//   - configCh: dispatch the pods for the config change to the appropriate
//     handler callback for the event type
//   - plegCh: update the runtime cache; sync pod
//   - syncCh: sync all pods waiting for sync
//   - housekeepingCh: trigger cleanup of pods
//   - health manager: sync pods that have failed or in which one or more
//     containers have failed health checks


此時,Kubelet 列印以下日誌,表示已經收到新增 Pod 的請求,並呼叫對應 handler 進行處理:

Jun 15 15:18:59 iv-yd6ldax3wgr9cxxjx9iv kubelet[3390574]: I0615 15:18:59.269224 3390574 kubelet.go:2206] "SyncLoop ADD" source="api" pods="[default/dp-test-5986bbcc8f-gcrmm]"


對應程式碼:

func (kl *Kubelet) syncLoopIteration(ctx context.Context, configCh <-chan kubetypes.PodUpdate, handler SyncHandler,
    syncCh <-chan time.Time, housekeepingCh <-chan time.Time, plegCh <-chan *pleg.PodLifecycleEvent) bool {
    select {
    case u, open := <-configCh:
        // Update from a config source; dispatch it to the right handler
        // callback.
        if !open {
            klog.ErrorS(nil, "Update channel is closed, exiting the sync loop")
            return false
        }

        switch u.Op {
        case kubetypes.ADD:
            klog.V(2).InfoS("SyncLoop ADD", "source", u.Source, "pods", klog.KObjSlice(u.Pods))
            // After restarting, kubelet will get all existing pods through
            // ADD as if they are new pods. These pods will then go through the
            // admission process and *may* be rejected. This can be resolved
            // once we have checkpointing.
            handler.HandlePodAdditions(u.Pods)
        case kubetypes.UPDATE:
            klog.V(2).InfoS("SyncLoop UPDATE", "source", u.Source, "pods", klog.KObjSlice(u.Pods))
            handler.HandlePodUpdates(u.Pods)
        case kubetypes.REMOVE:
            klog.V(2).InfoS("SyncLoop REMOVE", "source", u.Source, "pods", klog.KObjSlice(u.Pods))
            handler.HandlePodRemoves(u.Pods)
        case kubetypes.RECONCILE:
            klog.V(4).InfoS("SyncLoop RECONCILE", "source", u.Source, "pods", klog.KObjSlice(u.Pods))
            handler.HandlePodReconcile(u.Pods)
        case kubetypes.DELETE:
            klog.V(2).InfoS("SyncLoop DELETE", "source", u.Source, "pods", klog.KObjSlice(u.Pods))
            // DELETE is treated as a UPDATE because of graceful deletion.
            handler.HandlePodUpdates(u.Pods)
        case kubetypes.SET:
            // TODO: Do we want to support this?
            klog.ErrorS(nil, "Kubelet does not support snapshot update")
        default:
            klog.ErrorS(nil, "Invalid operation type received", "operation", u.Op)
        }

        // 省略部分內容
    }
    return true
}


對於不同的事件型別,有不同的 handler,此處不深究。

  • Add Pod:HandlePodAdditions

  • Update Pod、Delete Pod:HandlePodUpdates

  • Remove Pod:HandlePodRemoves

  • Reconcile Pod:HandlePodReconcile

對於kubetypes.ADD事件,handler 是HandlePodAdditions

// HandlePodAdditions is the callback in SyncHandler for pods being added from
// a config source.
func (kl *Kubelet) HandlePodAdditions(pods []*v1.Pod) {
    start := kl.clock.Now()
    sort.Sort(sliceutils.PodsByCreationTime(pods))
    for _, pod := range pods {
        existingPods := kl.podManager.GetPods()
        // Always add the pod to the pod manager. Kubelet relies on the pod
        // manager as the source of truth for the desired state. If a pod does
        // not exist in the pod manager, it means that it has been deleted in
        // the apiserver and no action (other than cleanup) is required.
        kl.podManager.AddPod(pod)

        // 靜態 Pod,透過 handleMirrorPod 處理
        if kubetypes.IsMirrorPod(pod) {
            kl.handleMirrorPod(pod, start)
            continue
        }

        // Only go through the admission process if the pod is not requested
        // for termination by another part of the kubelet. If the pod is already
        // using resources (previously admitted), the pod worker is going to be
        // shutting it down. If the pod hasn't started yet, we know that when
        // the pod worker is invoked it will also avoid setting up the pod, so
        // we simply avoid doing any work.
        if !kl.podWorkers.IsPodTerminationRequested(pod.UID) {
            // We failed pods that we rejected, so activePods include all admitted
            // pods that are alive.
            activePods := kl.filterOutInactivePods(existingPods)

            // Check if we can admit the pod; if not, reject it.
            if ok, reason, message := kl.canAdmitPod(activePods, pod); !ok {
                kl.rejectPod(pod, reason, message)
                continue
            }
        }
        mirrorPod, _ := kl.podManager.GetMirrorPodByPod(pod)
        kl.dispatchWork(pod, kubetypes.SyncPodCreate, mirrorPod, start)
    }
}


只有非 terminating 狀態的 Pod 才會走到 admission 邏輯。對應方法是canAdmitPod中,在該方法中,會透過不同的admitHandlers處理Admit邏輯,列印日誌如下:

Jun 15 15:18:59 iv-yd6ldax3wgr9cxxjx9iv kubelet[3390574]: I0615 15:18:59.269248 3390574 topology_manager.go:210] "Topology Admit Handler" podUID=dd780355-f908-4bf2-9193-68dabbbca69e podNamespace="default" podName="dp-test-5986bbcc8f-gcrmm"


對應程式碼:

func (m *manager) Admit(attrs *lifecycle.PodAdmitAttributes) lifecycle.PodAdmitResult {
    klog.InfoS("Topology Admit Handler", "podUID", attrs.Pod.UID, "podNamespace", attrs.Pod.Namespace, "podName", attrs.Pod.Name)
    pod := attrs.Pod

    return m.scope.Admit(pod)
}


Admit方法會先根據container.Resources.Limits分析需要分配的資源(如:網路卡等),日誌如下:

Jun 15 15:18:59 iv-yd6ldax3wgr9cxxjx9iv kubelet[3390574]: I0615 15:18:59.269261 3390574 manager.go:854] "Looking for needed resources" needed=1 resourceName="xxx/eni-ip"


對應程式碼:

// allocateContainerResources attempts to allocate all of required device
// plugin resources for the input container, issues an Allocate rpc request
// for each new device resource requirement, processes their AllocateResponses,
// and updates the cached containerDevices on success.
func (m *ManagerImpl) allocateContainerResources(pod *v1.Pod, container *v1.Container, devicesToReuse map[string]sets.String) error {
    podUID := string(pod.UID)
    contName := container.Name
    allocatedDevicesUpdated := false
    needsUpdateCheckpoint := false
    // Extended resources are not allowed to be overcommitted.
    // Since device plugin advertises extended resources,
    // therefore Requests must be equal to Limits and iterating
    // over the Limits should be sufficient.
    for k, v := range container.Resources.Limits {
        resource := string(k)
        needed := int(v.Value())
        klog.V(3).InfoS("Looking for needed resources", "needed", needed, "resourceName", resource)
        if !m.isDevicePluginResource(resource) {
            continue
        }
        // Updates allocatedDevices to garbage collect any stranded resources
        // before doing the device plugin allocation.
        if !allocatedDevicesUpdated {
            m.UpdateAllocatedDevices()
            allocatedDevicesUpdated = true
        }

        var allocDevices sets.String
        var err error
        contDevicesScheduledKey := fmt.Sprintf("%s_container-%s", resource, contName)
        contDevicesScheduled := pod.Annotations[contDevicesScheduledKey]
        if contDevicesScheduled != "" {
            allocDevices, err = m.devicesToAllocateFromScheduler(podUID, contName, resource, contDevicesScheduled, needed, devicesToReuse[resource])
        } else {
            allocDevices, err = m.devicesToAllocate(podUID, contName, resource, needed, devicesToReuse[resource])
        }
        if err != nil {
            return err
        }
        if allocDevices == nil || len(allocDevices) <= 0 {
            continue
        }

        // 省略部分內容
        m.podDevices.insert(podUID, contName, resource, allocDevicesWithNUMA, resp.ContainerResponses[0])
    }

    if needsUpdateCheckpoint {
        return m.writeCheckpoint()
    }

    return nil
}


然後,會走到devicesToAllocate以分配需要的裝置,下一條日誌為:

Jun 15 15:18:59 iv-yd6ldax3wgr9cxxjx9iv kubelet[3390574]: I0615 15:18:59.269321 3390574 manager.go:621] "Need devices to allocate for pod" deviceNumber=1 resourceName="xxx/eni-ip" podUID="dd780355-f908-4bf2-9193-68dabbbca69e" containerName="nginx"


對應程式碼:

// Returns list of device Ids we need to allocate with Allocate rpc call.
// Returns empty list in case we don't need to issue the Allocate rpc call.
func (m *ManagerImpl) devicesToAllocate(podUID, contName, resource string, required int, reusableDevices sets.String) (sets.String, error) {
    m.mutex.Lock()
    defer m.mutex.Unlock()
    needed := required
    // Gets list of devices that have already been allocated.
    // This can happen if a container restarts for example.
    devices := m.podDevices.containerDevices(podUID, contName, resource)
    if devices != nil {
        klog.V(3).InfoS("Found pre-allocated devices for resource on pod", "resourceName", resource, "containerName", contName, "podUID", string(podUID), "devices", devices.List())
        needed = needed - devices.Len()
        // A pod's resource is not expected to change once admitted by the API server,
        // so just fail loudly here. We can revisit this part if this no longer holds.
        if needed != 0 {
            return nil, fmt.Errorf("pod %q container %q changed request for resource %q from %d to %d", string(podUID), contName, resource, devices.Len(), required)
        }
    }

    // We have 3 major flows to handle:
    // 1. kubelet running, normal allocation (needed > 0, container being  [re]created). Steady state and most common case by far and large.
    // 2. kubelet restart. In this scenario every other component of the stack (device plugins, app container, runtime) is still running.
    // 3. node reboot. In this scenario device plugins may not be running yet when we try to allocate devices.
    //    note: if we get this far the runtime is surely running. This is usually enforced at OS level by startup system services dependencies.

    // First we take care of the exceptional flow (scenarios 2 and 3). In both flows, kubelet is reinitializing, and while kubelet is initializing, sources are NOT all ready.
    // Is this a simple kubelet restart (scenario 2)? To distinguish, we use the informations we got for runtime. If we are asked to allocate devices for containers reported
    // running, then it can only be a kubelet restart. On node reboot the runtime and the containers were also shut down. Then, if the container was running, it can only be
    // because it already has access to all the required devices, so we got nothing to do and we can bail out.
    if !m.sourcesReady.AllReady() && m.isContainerAlreadyRunning(podUID, contName) {
        klog.V(3).InfoS("container detected running, nothing to do", "deviceNumber", needed, "resourceName", resource, "podUID", string(podUID), "containerName", contName)
        return nil, nil
    }

    // We dealt with scenario 2. If we got this far it's either scenario 3 (node reboot) or scenario 1 (steady state, normal flow).
    klog.V(3).InfoS("Need devices to allocate for pod", "deviceNumber", needed, "resourceName", resource, "podUID", string(podUID), "containerName", contName)
    healthyDevices, hasRegistered := m.healthyDevices[resource]
    // 省略部分內容

    // Filters available Devices based on NUMA affinity.
    aligned, unaligned, noAffinity := m.filterByAffinity(podUID, contName, resource, available)

    // If we can allocate all remaining devices from the set of aligned ones, then
    // give the plugin the chance to influence which ones to allocate from that set.
    if needed < aligned.Len() {
        // First allocate from the preferred devices list (if available).
        preferred, err := m.callGetPreferredAllocationIfAvailable(podUID, contName, resource, aligned.Union(allocated), allocated, required)
        if err != nil {
            return nil, err
        }
        if allocateRemainingFrom(preferred.Intersection(aligned)) {
            return allocated, nil
        }
        // Then fallback to allocate from the aligned set if no preferred list
        // is returned (or not enough devices are returned in that list).
        if allocateRemainingFrom(aligned) {
            return allocated, nil
        }

        return nil, fmt.Errorf("unexpectedly allocated less resources than required. Requested: %d, Got: %d", required, required-needed)
    }

    // If we can't allocate all remaining devices from the set of aligned ones,
    // then start by first allocating all of the  aligned devices (to ensure
    // that the alignment guaranteed by the TopologyManager is honored).
    if allocateRemainingFrom(aligned) {
        return allocated, nil
    }

    // Then give the plugin the chance to influence the decision on any
    // remaining devices to allocate.
    preferred, err := m.callGetPreferredAllocationIfAvailable(podUID, contName, resource, available.Union(allocated), allocated, required)
    if err != nil {
        return nil, err
    }
    if allocateRemainingFrom(preferred.Intersection(available)) {
        return allocated, nil
    }
    // Finally, if the plugin did not return a preferred allocation (or didn't
    // return a large enough one), then fall back to allocating the remaining
    // devices from the 'unaligned' and 'noAffinity' sets.
    if allocateRemainingFrom(unaligned) {
        return allocated, nil
    }
    if allocateRemainingFrom(noAffinity) {
        return allocated, nil
    }

    return nil, fmt.Errorf("unexpectedly allocated less resources than required. Requested: %d, Got: %d", required, required-needed)
}


如果沒有配置GetPreferredAllocation,會打列印以下日誌:

Jun 15 15:18:59 iv-yd6ldax3wgr9cxxjx9iv kubelet[3390574]: I0615 15:18:59.269375 3390574 manager.go:1041] "Plugin options indicate to skip GetPreferredAllocation for resource" resourceName="xxx/eni-ip"


對應程式碼:

// callGetPreferredAllocationIfAvailable issues GetPreferredAllocation grpc
// call for device plugin resource with GetPreferredAllocationAvailable option set.
func (m *ManagerImpl) callGetPreferredAllocationIfAvailable(podUID, contName, resource string, available, mustInclude sets.String, size int) (sets.String, error) {
    eI, ok := m.endpoints[resource]
    if !ok {
        return nil, fmt.Errorf("endpoint not found in cache for a registered resource: %s", resource)
    }

    if eI.opts == nil || !eI.opts.GetPreferredAllocationAvailable {
        klog.V(4).InfoS("Plugin options indicate to skip GetPreferredAllocation for resource", "resourceName", resource)
        return nil, nil
    }

    m.mutex.Unlock()
    klog.V(4).InfoS("Issuing a GetPreferredAllocation call for container", "containerName", contName, "podUID", string(podUID))
    resp, err := eI.e.getPreferredAllocation(available.UnsortedList(), mustInclude.UnsortedList(), size)
    m.mutex.Lock()
    if err != nil {
        return nil, fmt.Errorf("device plugin GetPreferredAllocation rpc failed with err: %v", err)
    }
    if resp != nil && len(resp.ContainerResponses) > 0 {
        return sets.NewString(resp.ContainerResponses[0].DeviceIDs...), nil
    }
    return sets.NewString(), nil
}


接下來,會走到分配裝置的邏輯,列印日誌:

Jun 15 15:18:59 iv-yd6ldax3wgr9cxxjx9iv kubelet[3390574]: I0615 15:18:59.269402 3390574 manager.go:909] "Making allocation request for device plugin" devices=[eni-ip-121] resourceName="xxx/eni-ip"


對應程式碼在allocateContainerResources方法:

func (m *ManagerImpl) allocateContainerResources(pod *v1.Pod, container *v1.Container, devicesToReuse map[string]sets.String) error {
    // 省略部分內容
    devs := allocDevices.UnsortedList()
    // TODO: refactor this part of code to just append a ContainerAllocationRequest
    // in a passed in AllocateRequest pointer, and issues a single Allocate call per pod.
    klog.V(3).InfoS("Making allocation request for device plugin", "devices", devs, "resourceName", resource)
    resp, err := eI.e.allocate(devs)
    metrics.DevicePluginAllocationDuration.WithLabelValues(resource).Observe(metrics.SinceInSeconds(startRPCTime))
    if err != nil {
        // In case of allocation failure, we want to restore m.allocatedDevices
        // to the actual allocated state from m.podDevices.
        m.mutex.Lock()
        m.allocatedDevices = m.podDevices.devices()
        m.mutex.Unlock()
        return err
    }

    if len(resp.ContainerResponses) == 0 {
        return fmt.Errorf("no containers return in allocation response %v", resp)
    }
    // 省略部分內容
}


⚠️ 注意:這裡有個指標可以看到裝置分配花費的時間device_plugin_alloc_duration_seconds(不過遺憾的是並沒有 Pod 和 container 資訊,只有資源名稱資訊)

這裡調了 RPC 分配需要的裝置:

func (c *devicePluginClient) Allocate(ctx context.Context, in *AllocateRequest, opts ...grpc.CallOption) (*AllocateResponse, error) {
    out := new(AllocateResponse)
    err := c.cc.Invoke(ctx, "/v1beta1.DevicePlugin/Allocate", in, out, opts...)
    if err != nil {
        return nil, err
    }
    return out, nil
}


資源分配完成,Pod 透過dispatchWork分配給對應的 worker 處理:

kl.dispatchWork(pod, kubetypes.SyncPodCreate, mirrorPod, start)


此時,可以看到類似下面的日誌:

Jun 15 15:18:59 iv-yd6ldax3wgr9cxxjx9iv kubelet[3390574]: I0615 15:18:59.273776 3390574 pod_workers.go:585] "Pod is being synced for the first time" pod="default/dp-test-5986bbcc8f-gcrmm" podUID=dd780355-f908-4bf2-9193-68dabbbca69e


對應程式碼:

func (p *podWorkers) UpdatePod(options UpdatePodOptions) {
    // 省略部分內容
    // decide what to do with this pod - we are either setting it up, tearing it down, or ignoring it
    now := time.Now()
    status, ok := p.podSyncStatuses[uid]
    if !ok {
        klog.V(4).InfoS("Pod is being synced for the first time", "pod", klog.KObj(pod), "podUID", pod.UID)
        status = &podSyncStatus{
            syncedAt: now,
            fullname: kubecontainer.GetPodFullName(pod),
        }
        // if this pod is being synced for the first time, we need to make sure it is an active pod
        if !isRuntimePod && (pod.Status.Phase == v1.PodFailed || pod.Status.Phase == v1.PodSucceeded) {
            // check to see if the pod is not running and the pod is terminal.
            // If this succeeds then record in the podWorker that it is terminated.
            if statusCache, err := p.podCache.Get(pod.UID); err == nil {
                if isPodStatusCacheTerminal(statusCache) {
                    status = &podSyncStatus{
                        terminatedAt:       now,
                        terminatingAt:      now,
                        syncedAt:           now,
                        startedTerminating: true,
                        finished:           true,
                        fullname:           kubecontainer.GetPodFullName(pod),
                    }
                }
            }
        }
        p.podSyncStatuses[uid] = status
    }
    // 省略部分內容
}


如果是新建立的 Pod,會建立一個 Go Routine(Pod Worker)來處理,主要邏輯為managePodLoop,透過outCh接收請求:

// Creating a new pod worker either means this is a new pod, or that the
// kubelet just restarted. In either case the kubelet is willing to believe
// the status of the pod for the first pod worker sync. See corresponding
// comment in syncPod.
go func() {
    defer runtime.HandleCrash()
    p.managePodLoop(outCh)
}()


當透過outCh收到請求,會列印以下日誌:

Jun 15 15:18:59 iv-yd6ldax3wgr9cxxjx9iv kubelet[3390574]: I0615 15:18:59.273812 3390574 pod_workers.go:902] "Processing pod event" pod="default/dp-test-5986bbcc8f-gcrmm" podUID=dd780355-f908-4bf2-9193-68dabbbca69e updateType=0


程式碼位置:

func (p *podWorkers) managePodLoop(podUpdates <-chan podWork) {
    // 省略部分內容
    klog.V(4).InfoS("Processing pod event", "pod", klog.KObj(pod), "podUID", pod.UID, "updateType", update.WorkType)
    var isTerminal bool
    err := func() error {
        // The worker is responsible for ensuring the sync method sees the appropriate
        // status updates on resyncs (the result of the last sync), transitions to
        // terminating (no wait), or on terminated (whatever the most recent state is).
        // Only syncing and terminating can generate pod status changes, while terminated
        // pods ensure the most recent status makes it to the api server.
        var status *kubecontainer.PodStatus
        var err error
        switch {
        case update.Options.RunningPod != nil:
            // when we receive a running pod, we don't need status at all
        default:
            // wait until we see the next refresh from the PLEG via the cache (max 2s)
            // TODO: this adds ~1s of latency on all transitions from sync to terminating
            //  to terminated, and on all termination retries (including evictions). We should
            //  improve latency by making the pleg continuous and by allowing pod status
            //  changes to be refreshed when key events happen (killPod, sync->terminating).
            //  Improving this latency also reduces the possibility that a terminated
            //  container's status is garbage collected before we have a chance to update the
            //  API server (thus losing the exit code).
            status, err = p.podCache.GetNewerThan(pod.UID, lastSyncTime)
        }
        if err != nil {
            // This is the legacy event thrown by manage pod loop all other events are now dispatched
            // from syncPodFn
            p.recorder.Eventf(pod, v1.EventTypeWarning, events.FailedSync, "error determining status: %v", err)
            return err
        }

        ctx := p.contextForWorker(pod.UID)

        // Take the appropriate action (illegal phases are prevented by UpdatePod)
        switch {
        case update.WorkType == TerminatedPodWork:
            err = p.syncTerminatedPodFn(ctx, pod, status)

        case update.WorkType == TerminatingPodWork:
            var gracePeriod *int64
            if opt := update.Options.KillPodOptions; opt != nil {
                gracePeriod = opt.PodTerminationGracePeriodSecondsOverride
            }
            podStatusFn := p.acknowledgeTerminating(pod)

            err = p.syncTerminatingPodFn(ctx, pod, status, update.Options.RunningPod, gracePeriod, podStatusFn)

        default:
            isTerminal, err = p.syncPodFn(ctx, update.Options.UpdateType, pod, update.Options.MirrorPod, status)
        }

        lastSyncTime = time.Now()
        return err
    }()
    // 省略部分內容
}


除了以下兩種型別,都透過syncPodFn處理:

  • TerminatedPodWork

  • TerminatingPodWork

syncPodFn中,會首先列印以下日誌:

Jun 15 15:18:59 iv-yd6ldax3wgr9cxxjx9iv kubelet[3390574]: I0615 15:18:59.273831 3390574 kubelet.go:1620] "syncPod enter" pod="default/dp-test-5986bbcc8f-gcrmm" podUID=dd780355-f908-4bf2-9193-68dabbbca69e


對應程式碼位於(syncPodFn 同步 Pod 狀態的主要邏輯,具體可參考註釋):

// syncPod is the transaction script for the sync of a single pod (setting up)
// a pod. This method is reentrant and expected to converge a pod towards the
// desired state of the spec. The reverse (teardown) is handled in
// syncTerminatingPod and syncTerminatedPod. If syncPod exits without error,
// then the pod runtime state is in sync with the desired configuration state
// (pod is running). If syncPod exits with a transient error, the next
// invocation of syncPod is expected to make progress towards reaching the
// runtime state. syncPod exits with isTerminal when the pod was detected to
// have reached a terminal lifecycle phase due to container exits (for
// RestartNever or RestartOnFailure) and the next method invoked will by
// syncTerminatingPod.
//
// Arguments:
//
// updateType - whether this is a create (first time) or an update, should
//
//  only be used for metrics since this method must be reentrant
//
// pod - the pod that is being set up
// mirrorPod - the mirror pod known to the kubelet for this pod, if any
// podStatus - the most recent pod status observed for this pod which can
//
//  be used to determine the set of actions that should be taken during
//  this loop of syncPod
//
// The workflow is:
//   - If the pod is being created, record pod worker start latency
//   - Call generateAPIPodStatus to prepare an v1.PodStatus for the pod
//   - If the pod is being seen as running for the first time, record pod
//     start latency
//   - Update the status of the pod in the status manager
//   - Stop the pod's containers if it should not be running due to soft
//     admission
//   - Ensure any background tracking for a runnable pod is started
//   - Create a mirror pod if the pod is a static pod, and does not
//     already have a mirror pod
//   - Create the data directories for the pod if they do not exist
//   - Wait for volumes to attach/mount
//   - Fetch the pull secrets for the pod
//   - Call the container runtime's SyncPod callback
//   - Update the traffic shaping for the pod's ingress and egress limits
//
// If any step of this workflow errors, the error is returned, and is repeated
// on the next syncPod call.
//
// This operation writes all events that are dispatched in order to provide
// the most accurate information possible about an error situation to aid debugging.
// Callers should not write an event if this operation returns an error.
func (kl *Kubelet) syncPod(_ context.Context, updateType kubetypes.SyncPodType, pod, mirrorPod *v1.Pod, podStatus *kubecontainer.PodStatus) (isTerminal bool, err error) {
    // TODO(#113606): connect this with the incoming context parameter, which comes from the pod worker.
    // Currently, using that context causes test failures.
    ctx := context.TODO()
    klog.V(4).InfoS("syncPod enter", "pod", klog.KObj(pod), "podUID", pod.UID)
    defer func() {
        klog.V(4).InfoS("syncPod exit", "pod", klog.KObj(pod), "podUID", pod.UID, "isTerminal", isTerminal)
    }()

    // Latency measurements for the main workflow are relative to the
    // first time the pod was seen by kubelet.
    var firstSeenTime time.Time
    if firstSeenTimeStr, ok := pod.Annotations[kubetypes.ConfigFirstSeenAnnotationKey]; ok {
        firstSeenTime = kubetypes.ConvertToTimestamp(firstSeenTimeStr).Get()
    }

    // Record pod worker start latency if being created
    // TODO: make pod workers record their own latencies
    if updateType == kubetypes.SyncPodCreate {
        if !firstSeenTime.IsZero() {
            // This is the first time we are syncing the pod. Record the latency
            // since kubelet first saw the pod if firstSeenTime is set.
            metrics.PodWorkerStartDuration.Observe(metrics.SinceInSeconds(firstSeenTime))
        } else {
            klog.V(3).InfoS("First seen time not recorded for pod",
                "podUID", pod.UID,
                "pod", klog.KObj(pod))
        }
    }

    // 省略部分內容
}


這裡有個指標pod_worker_start_duration_seconds記錄了 kubelet 第一次看到這個 Pod 到第一次syncPod經歷的時間。

kubelet4

下一條日誌,kubelet 生成 Pod 狀態資訊:

Jun 15 15:18:59 iv-yd6ldax3wgr9cxxjx9iv kubelet[3390574]: I0615 15:18:59.273841 3390574 kubelet_pods.go:1461] "Generating pod status" pod="default/dp-test-5986bbcc8f-gcrmm"


對應程式碼:

// generateAPIPodStatus creates the final API pod status for a pod, given the
// internal pod status. This method should only be called from within sync*Pod methods.
func (kl *Kubelet) generateAPIPodStatus(pod *v1.Pod, podStatus *kubecontainer.PodStatus) v1.PodStatus {
    klog.V(3).InfoS("Generating pod status", "pod", klog.KObj(pod))
    // 省略部分內容
}


對於新建立的 Pod,初始狀態都是 Pending,對應日誌:

Jun 15 15:18:59 iv-yd6ldax3wgr9cxxjx9iv kubelet[3390574]: I0615 15:18:59.273867 3390574 kubelet_pods.go:1426] "Pod waiting > 0, pending"


計算 Pod 的狀態

Pod 的 Phase 是根據 Pod 中所有 container 的狀態計算得出的:

// getPhase returns the phase of a pod given its container info.
func getPhase(spec *v1.PodSpec, info []v1.ContainerStatus) v1.PodPhase {
    pendingInitialization := 0
    failedInitialization := 0
    for _, container := range spec.InitContainers {
        containerStatus, ok := podutil.GetContainerStatus(info, container.Name)
        if !ok {
            pendingInitialization++
            continue
        }

        switch {
        case containerStatus.State.Running != nil:
            pendingInitialization++
        case containerStatus.State.Terminated != nil:
            if containerStatus.State.Terminated.ExitCode != 0 {
                failedInitialization++
            }
        case containerStatus.State.Waiting != nil:
            if containerStatus.LastTerminationState.Terminated != nil {
                if containerStatus.LastTerminationState.Terminated.ExitCode != 0 {
                    failedInitialization++
                }
            } else {
                pendingInitialization++
            }
        default:
            pendingInitialization++
        }
    }

    unknown := 0
    running := 0
    waiting := 0
    stopped := 0
    succeeded := 0
    for _, container := range spec.Containers {
        containerStatus, ok := podutil.GetContainerStatus(info, container.Name)
        if !ok {
            unknown++
            continue
        }

        switch {
        case containerStatus.State.Running != nil:
            running++
        case containerStatus.State.Terminated != nil:
            stopped++
            if containerStatus.State.Terminated.ExitCode == 0 {
                succeeded++
            }
        case containerStatus.State.Waiting != nil:
            if containerStatus.LastTerminationState.Terminated != nil {
                stopped++
            } else {
                waiting++
            }
        default:
            unknown++
        }
    }

    if failedInitialization > 0 && spec.RestartPolicy == v1.RestartPolicyNever {
        return v1.PodFailed
    }

    switch {
    case pendingInitialization > 0:
        fallthrough
    case waiting > 0:
        klog.V(5).InfoS("Pod waiting > 0, pending")
        // One or more containers has not been started
        return v1.PodPending
    case running > 0 && unknown == 0:
        // All containers have been started, and at least
        // one container is running
        return v1.PodRunning
    case running == 0 && stopped > 0 && unknown == 0:
        // All containers are terminated
        if spec.RestartPolicy == v1.RestartPolicyAlways {
            // All containers are in the process of restarting
            return v1.PodRunning
        }
        if stopped == succeeded {
            // RestartPolicy is not Always, and all
            // containers are terminated in success
            return v1.PodSucceeded
        }
        if spec.RestartPolicy == v1.RestartPolicyNever {
            // RestartPolicy is Never, and all containers are
            // terminated with at least one in failure
            return v1.PodFailed
        }
        // RestartPolicy is OnFailure, and at least one in failure
        // and in the process of restarting
        return v1.PodRunning
    default:
        klog.V(5).InfoS("Pod default case, pending")
        return v1.PodPending
    }
}


  • failedInitialization 大於 0 並且 RestartPolicy 是 Never:Pod 狀態為 Failed

  • waiting 狀態的 container 數量大於 0:Pod 狀態為 Pending

  • running 狀態的 container 大於 0 並且 unknown 狀態的 container 等於 0:Pod 狀態為 Running

  • running 狀態的 container 等於 0 並且 stopped 狀態的 container 大於 0,同時 unknown 狀態的 container 等於 0

  • RestartPolicy 是 always:Pod 狀態為 Running

  • 如果 stopped 狀態 container 數量和 succeeded 狀態 container 數量相同:Pod 狀態為 Succeeded

  • 如果 RestartPolicy 是 Never:Pod 狀態為 Failed

  • 其它情況:Pod 狀態為 Running

  • 其它情況:Pod 狀態為 Pending

這裡有指標pod_start_duration_seconds記錄了 kubelet 第一次看到 Pod 到 Pod running 的時間:

Pod 狀態更新完成,列印以下日誌:

Jun 15 15:18:59 iv-yd6ldax3wgr9cxxjx9iv kubelet[3390574]: I0615 15:18:59.273874 3390574 kubelet_pods.go:1471] "Got phase for pod" pod="default/dp-test-5986bbcc8f-gcrmm" oldPhase=Pending phase=Pending


計算出 Pod 狀態資訊,接下來更新 Pod 狀態資訊,對應程式碼:

func (kl *Kubelet) syncPod(_ context.Context, updateType kubetypes.SyncPodType, pod, mirrorPod *v1.Pod, podStatus *kubecontainer.PodStatus) (isTerminal bool, err error) {
    // 省略其他內容
    kl.statusManager.SetPodStatus(pod, apiPodStatus)
    // 省略其他內容
}


呼叫updateStatusInternal更新 Pod 狀態:

func (m *manager) SetPodStatus(pod *v1.Pod, status v1.PodStatus) {
    m.podStatusesLock.Lock()
    defer m.podStatusesLock.Unlock()

    // Make sure we're caching a deep copy.
    status = *status.DeepCopy()

    // Force a status update if deletion timestamp is set. This is necessary
    // because if the pod is in the non-running state, the pod worker still
    // needs to be able to trigger an update and/or deletion.
    m.updateStatusInternal(pod, status, pod.DeletionTimestamp != nil)
}


updateStatusInternal方法除了更新本地快取中的 Pod 狀態,還會新增到發給 API server 的請求佇列中。在updateStatusInternal方法,列印如下日誌:

Jun 15 15:18:59 iv-yd6ldax3wgr9cxxjx9iv kubelet[3390574]: I0615 15:18:59.273910 3390574 status_manager.go:552] "updateStatusInternal" version=1 pod="default/dp-test-5986bbcc8f-gcrmm" podUID=dd780355-f908-4bf2-9193-68dabbbca69e containers="(nginx state=waiting previous=<none>)"


對應程式碼:

func (m *manager) updateStatusInternal(pod *v1.Pod, status v1.PodStatus, forceUpdate bool) bool {
    // 省略部分內容
    // Perform some more extensive logging of container termination state to assist in
    // debugging production races (generally not needed).
    if klogV := klog.V(5); klogV.Enabled() {
        var containers []string
        for _, s := range append(append([]v1.ContainerStatus(nil), status.InitContainerStatuses...), status.ContainerStatuses...) {
            var current, previous string
            switch {
            case s.State.Running != nil:
                current = "running"
            case s.State.Waiting != nil:
                current = "waiting"
            case s.State.Terminated != nil:
                current = fmt.Sprintf("terminated=%d", s.State.Terminated.ExitCode)
            default:
                current = "unknown"
            }
            switch {
            case s.LastTerminationState.Running != nil:
                previous = "running"
            case s.LastTerminationState.Waiting != nil:
                previous = "waiting"
            case s.LastTerminationState.Terminated != nil:
                previous = fmt.Sprintf("terminated=%d", s.LastTerminationState.Terminated.ExitCode)
            default:
                previous = "<none>"
            }
            containers = append(containers, fmt.Sprintf("(%s state=%s previous=%s)", s.Name, current, previous))
        }
        sort.Strings(containers)
        klogV.InfoS("updateStatusInternal", "version", cachedStatus.version+1, "pod", klog.KObj(pod), "podUID", pod.UID, "containers", strings.Join(containers, " "))
    }
    // 省略部分內容
}


往 API server 更新 Pod 狀態的請求被新增到 Channel,同時列印如下日誌:

Jun 15 15:18:59 iv-yd6ldax3wgr9cxxjx9iv kubelet[3390574]: I0615 15:18:59.273990 3390574 status_manager.go:582] "Status Manager: adding pod with new status to podStatusChannel" pod="default/dp-test-5986bbcc8f-gcrmm" podUID=dd780355-f908-4bf2-9193-68dabbbca69e statusVersion=1 status={Phase:Pending Conditions:[{Type:Initialized Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2024-06-15 15:18:59 +0800 CST Reason: Message:} {Type:Ready Status:False LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2024-06-15 15:18:59 +0800 CST Reason:ContainersNotReady Message:containers with unready status: [nginx]} {Type:ContainersReady Status:False LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2024-06-15 15:18:59 +0800 CST Reason:ContainersNotReady Message:containers with unready status: [nginx]} {Type:PodScheduled Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2024-06-15 15:18:59 +0800 CST Reason: Message:}] Message: Reason: NominatedNodeName: HostIP:192.168.0.9 PodIP: PodIPs:[] StartTime:2024-06-15 15:18:59 +0800 CST InitContainerStatuses:[] ContainerStatuses:[{Name:nginx State:{Waiting:&ContainerStateWaiting{Reason:ContainerCreating,Message:,} Running:nil Terminated:nil} LastTerminationState:{Waiting:nil Running:nil Terminated:nil} Ready:false RestartCount:0 Image:nginx:latest ImageID: ContainerID: Started:0xc001ed178c}] QOSClass:BestEffort EphemeralContainerStatuses:[]}


對應程式碼:

func (m *manager) updateStatusInternal(pod *v1.Pod, status v1.PodStatus, forceUpdate bool) bool {
    // 省略部分內容
    m.podStatuses[pod.UID] = newStatus

    select {
    case m.podStatusChannel <- podStatusSyncRequest{pod.UID, newStatus}:
        klog.V(5).InfoS("Status Manager: adding pod with new status to podStatusChannel",
            "pod", klog.KObj(pod),
            "podUID", pod.UID,
            "statusVersion", newStatus.version,
            "status", newStatus.status)
        return true
    default:
        // Let the periodic syncBatch handle the update if the channel is full.
        // We can't block, since we hold the mutex lock.
        klog.V(4).InfoS("Skipping the status update for pod for now because the channel is full",
            "pod", klog.KObj(pod),
            "status", status)
        return false
    }
    //省略部分內容
}


可以看到,請求裡面只有UID和新的狀態兩部分內容:

  • pod.UID

  • newStatus

向 API server 更新 Pod 狀態

因此,下一條日誌是 Channel 的 handler(Status Manager)收到這個請求:

Jun 15 15:18:59 iv-yd6ldax3wgr9cxxjx9iv kubelet[3390574]: I0615 15:18:59.274047 3390574 status_manager.go:186] "Status Manager: syncing pod with status from podStatusChannel" podUID=dd780355-f908-4bf2-9193-68dabbbca69e statusVersion=1 status={Phase:Pending Conditions:[{Type:Initialized Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2024-06-15 15:18:59 +0800 CST Reason: Message:} {Type:Ready Status:False LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2024-06-15 15:18:59 +0800 CST Reason:ContainersNotReady Message:containers with unready status: [nginx]} {Type:ContainersReady Status:False LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2024-06-15 15:18:59 +0800 CST Reason:ContainersNotReady Message:containers with unready status: [nginx]} {Type:PodScheduled Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2024-06-15 15:18:59 +0800 CST Reason: Message:}] Message: Reason: NominatedNodeName: HostIP:192.168.0.9 PodIP: PodIPs:[] StartTime:2024-06-15 15:18:59 +0800 CST InitContainerStatuses:[] ContainerStatuses:[{Name:nginx State:{Waiting:&ContainerStateWaiting{Reason:ContainerCreating,Message:,} Running:nil Terminated:nil} LastTerminationState:{Waiting:nil Running:nil Terminated:nil} Ready:false RestartCount:0 Image:nginx:latest ImageID: ContainerID: Started:0xc001ed178c}] QOSClass:BestEffort EphemeralContainerStatuses:[]}


對應程式碼:

func (m *manager) Start() {
    // Don't start the status manager if we don't have a client. This will happen
    // on the master, where the kubelet is responsible for bootstrapping the pods
    // of the master components.
    if m.kubeClient == nil {
        klog.InfoS("Kubernetes client is nil, not starting status manager")
        return
    }

    klog.InfoS("Starting to sync pod status with apiserver")

    //nolint:staticcheck // SA1015 Ticker can leak since this is only called once and doesn't handle termination.
    syncTicker := time.NewTicker(syncPeriod).C

    // syncPod and syncBatch share the same go routine to avoid sync races.
    go wait.Forever(func() {
        for {
            select {
            case syncRequest := <-m.podStatusChannel:
                klog.V(5).InfoS("Status Manager: syncing pod with status from podStatusChannel",
                    "podUID", syncRequest.podUID,
                    "statusVersion", syncRequest.status.version,
                    "status", syncRequest.status.status)
                m.syncPod(syncRequest.podUID, syncRequest.status)
            case <-syncTicker:
                klog.V(5).InfoS("Status Manager: syncing batch")
                // remove any entries in the status channel since the batch will handle them
                for i := len(m.podStatusChannel); i > 0; i-- {
                    <-m.podStatusChannel
                }
                m.syncBatch()
            }
        }
    }, 0)
}


Status Manager 是一個 Go Routine,主要透過m.syncPod向 API server 更新狀態:

Jun 15 15:18:59 iv-yd6ldax3wgr9cxxjx9iv kubelet[3390574]: I0615 15:18:59.285142 3390574 status_manager.go:719] "Patch status for pod" pod="default/dp-test-5986bbcc8f-gcrmm" podUID=dd780355-f908-4bf2-9193-68dabbbca69e patch="{\"metadata\":{\"uid\":\"dd780355-f908-4bf2-9193-68dabbbca69e\"},\"status\":{\"$setElementOrder/conditions\":[{\"type\":\"Initialized\"},{\"type\":\"Ready\"},{\"type\":\"ContainersReady\"},{\"type\":\"PodScheduled\"}],\"conditions\":[{\"lastProbeTime\":null,\"lastTransitionTime\":\"2024-06-15T07:18:59Z\",\"status\":\"True\",\"type\":\"Initialized\"},{\"lastProbeTime\":null,\"lastTransitionTime\":\"2024-06-15T07:18:59Z\",\"message\":\"containers with unready status: [nginx]\",\"reason\":\"ContainersNotReady\",\"status\":\"False\",\"type\":\"Ready\"},{\"lastProbeTime\":null,\"lastTransitionTime\":\"2024-06-15T07:18:59Z\",\"message\":\"containers with unready status: [nginx]\",\"reason\":\"ContainersNotReady\",\"status\":\"False\",\"type\":\"ContainersReady\"}],\"containerStatuses\":[{\"image\":\"nginx:latest\",\"imageID\":\"\",\"lastState\":{},\"name\":\"nginx\",\"ready\":false,\"restartCount\":0,\"started\":false,\"state\":{\"waiting\":{\"reason\":\"ContainerCreating\"}}}],\"hostIP\":\"192.168.0.9\",\"startTime\":\"2024-06-15T07:18:59Z\"}}"


對應程式碼:

// syncPod syncs the given status with the API server. The caller must not hold the lock.
func (m *manager) syncPod(uid types.UID, status versionedPodStatus) {
    // 省略部分內容
    newPod, patchBytes, unchanged, err := statusutil.PatchPodStatus(context.TODO(), m.kubeClient, pod.Namespace, pod.Name, pod.UID, pod.Status, mergedStatus)
    klog.V(3).InfoS("Patch status for pod", "pod", klog.KObj(pod), "podUID", uid,     "patch", string(patchBytes))
    // 省略部分內容
}


這裡,Kubelet 使用了 Patch API:

updatedPod, err := c.CoreV1().Pods(namespace).Patch(ctx, name, types.StrategicMergePatchType, patchBytes, metav1.PatchOptions{}, "status")
if err != nil {
    return nil, nil, false, fmt.Errorf("failed to patch status %q for pod %q/%q: %v", patchBytes, namespace, name, err)
}


更新完成後列印如下日誌:

Jun 15 15:18:59 iv-yd6ldax3wgr9cxxjx9iv kubelet[3390574]: I0615 15:18:59.285211 3390574 status_manager.go:728] "Status for pod updated successfully" pod="default/dp-test-5986bbcc8f-gcrmm" statusVersion=1 status={Phase:Pending Conditions:[{Type:Initialized Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2024-06-15 15:18:59 +0800 CST Reason: Message:} {Type:Ready Status:False LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2024-06-15 15:18:59 +0800 CST Reason:ContainersNotReady Message:containers with unready status: [nginx]} {Type:ContainersReady Status:False LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2024-06-15 15:18:59 +0800 CST Reason:ContainersNotReady Message:containers with unready status: [nginx]} {Type:PodScheduled Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2024-06-15 15:18:59 +0800 CST Reason: Message:}] Message: Reason: NominatedNodeName: HostIP:192.168.0.9 PodIP: PodIPs:[] StartTime:2024-06-15 15:18:59 +0800 CST InitContainerStatuses:[] ContainerStatuses:[{Name:nginx State:{Waiting:&ContainerStateWaiting{Reason:ContainerCreating,Message:,} Running:nil Terminated:nil} LastTerminationState:{Waiting:nil Running:nil Terminated:nil} Ready:false RestartCount:0 Image:nginx:latest ImageID: ContainerID: Started:0xc001ed178c}] QOSClass:BestEffort EphemeralContainerStatuses:[]}


對應程式碼:

func (m *manager) syncPod(uid types.UID, status versionedPodStatus) {
    // 省略部分內容
    if unchanged {
        klog.V(3).InfoS("Status for pod is up-to-date", "pod", klog.KObj(pod), "statusVersion", status.version)
    } else {
        klog.V(3).InfoS("Status for pod updated successfully", "pod", klog.KObj(pod), "statusVersion", status.version, "status", mergedStatus)
        pod = newPod
        // We pass a new object (result of API call which contains updated ResourceVersion)
        m.podStartupLatencyHelper.RecordStatusUpdated(pod)
    }
    // 省略部分內容
}


建立 Cgroup

接下來,Kubelet 需要為 Pod 建立 cgroup 並更新資源配置:


func (m *manager) syncPod(uid types.UID, status versionedPodStatus) {
    // 省略部分內容
    // Create Cgroups for the pod and apply resource parameters
    // to them if cgroups-per-qos flag is enabled.
    pcm := kl.containerManager.NewPodContainerManager()
    // If pod has already been terminated then we need not create
    // or update the pod's cgroup
    // TODO: once context cancellation is added this check can be removed
    if !kl.podWorkers.IsPodTerminationRequested(pod.UID) {
        // When the kubelet is restarted with the cgroups-per-qos
        // flag enabled, all the pod's running containers
        // should be killed intermittently and brought back up
        // under the qos cgroup hierarchy.
        // Check if this is the pod's first sync
        firstSync := true
        for _, containerStatus := range apiPodStatus.ContainerStatuses {
            if containerStatus.State.Running != nil {
                firstSync = false
                break
            }
        }
        // Don't kill containers in pod if pod's cgroups already
        // exists or the pod is running for the first time
        podKilled := false
        if !pcm.Exists(pod) && !firstSync {
            p := kubecontainer.ConvertPodStatusToRunningPod(kl.getRuntime().Type(), podStatus)
            if err := kl.killPod(ctx, pod, p, nil); err == nil {
                podKilled = true
            } else {
                klog.ErrorS(err, "KillPod failed", "pod", klog.KObj(pod), "podStatus", podStatus)
            }
        }
        // Create and Update pod's Cgroups
        // Don't create cgroups for run once pod if it was killed above
        // The current policy is not to restart the run once pods when
        // the kubelet is restarted with the new flag as run once pods are
        // expected to run only once and if the kubelet is restarted then
        // they are not expected to run again.
        // We don't create and apply updates to cgroup if its a run once pod and was killed above
        if !(podKilled && pod.Spec.RestartPolicy == v1.RestartPolicyNever) {
            if !pcm.Exists(pod) {
                if err := kl.containerManager.UpdateQOSCgroups(); err != nil {
                    klog.V(2).InfoS("Failed to update QoS cgroups while syncing pod", "pod", klog.KObj(pod), "err", err)
                }
                if err := pcm.EnsureExists(pod); err != nil {
                    kl.recorder.Eventf(pod, v1.EventTypeWarning, events.FailedToCreatePodContainer, "unable to ensure pod container exists: %v", err)
                    return false, fmt.Errorf("failed to ensure that the pod: %v cgroups exist and are correctly applied: %v", pod.UID, err)
                }
            }
        }
    }
    // 省略部分內容
}


UpdateQOSCgroups更新資源配置,如果 Cgroup 不存在,pcm.EnsureExists建立對應的 Cgroup:

// EnsureExists takes a pod as argument and makes sure that
// pod cgroup exists if qos cgroup hierarchy flag is enabled.
// If the pod level container doesn't already exist it is created.
func (m *podContainerManagerImpl) EnsureExists(pod *v1.Pod) error {
    podContainerName, _ := m.GetPodContainerName(pod)
    // check if container already exist
    alreadyExists := m.Exists(pod)
    if !alreadyExists {
        enforceMemoryQoS := false
        if utilfeature.DefaultFeatureGate.Enabled(kubefeatures.MemoryQoS) &&
            libcontainercgroups.IsCgroup2UnifiedMode() {
            enforceMemoryQoS = true
        }
        // Create the pod container
        containerConfig := &CgroupConfig{
            Name:               podContainerName,
            ResourceParameters: ResourceConfigForPod(pod, m.enforceCPULimits, m.cpuCFSQuotaPeriod, enforceMemoryQoS),
        }
        if m.podPidsLimit > 0 {
            containerConfig.ResourceParameters.PidsLimit = &m.podPidsLimit
        }
        if enforceMemoryQoS {
            klog.V(4).InfoS("MemoryQoS config for pod", "pod", klog.KObj(pod), "unified", containerConfig.ResourceParameters.Unified)
        }
        if pod.Annotations[annotationHostResourceView] == "false" {
            hostResourceView := false
            containerConfig.ResourceParameters.HostResourceView = &hostResourceView
        }
        if err := m.cgroupManager.Create(containerConfig); err != nil {
            return fmt.Errorf("failed to create container for %v : %v", podContainerName, err)
        }
    }
    return nil
}


cgroup_manager_duration_seconds記錄了建立 cgroup 所需的時間(其它型別操作透過 label 區分):

kubelet6

更新 Cgroup 使用了 runc 的 package:

// Update updates the cgroup with the specified Cgroup Configuration
func (m *cgroupManagerImpl) Update(cgroupConfig *CgroupConfig) error {
    start := time.Now()
    defer func() {
        metrics.CgroupManagerDuration.WithLabelValues("update").Observe(metrics.SinceInSeconds(start))
    }()

    libcontainerCgroupConfig := m.libctCgroupConfig(cgroupConfig, true)
    manager, err := manager.New(libcontainerCgroupConfig)
    if err != nil {
        return fmt.Errorf("failed to create cgroup manager: %v", err)
    }
    return manager.Set(libcontainerCgroupConfig.Resources)
}


建立 Pod 目錄

接下來,Kubelet 為每個 Pod 建立目錄:

// makePodDataDirs creates the dirs for the pod datas.
func (kl *Kubelet) makePodDataDirs(pod *v1.Pod) error {
    uid := pod.UID
    if err := os.MkdirAll(kl.getPodDir(uid), 0750); err != nil && !os.IsExist(err) {
        return err
    }
    if err := os.MkdirAll(kl.getPodVolumesDir(uid), 0750); err != nil && !os.IsExist(err) {
        return err
    }
    if err := os.MkdirAll(kl.getPodPluginsDir(uid), 0750); err != nil && !os.IsExist(err) {
        return err
    }
    return nil
}


包括以下 3 種目錄:

  • Pod 根目錄

  • Volume 根目錄

  • Plugins 根目錄

掛載儲存卷

接下來,kubelet 等待 Pod 所需的儲存卷被掛載成功(對於 terminating 狀態的 Pod 不會等待):

Jun 15 15:19:06 iv-yd6ldax3wgr9cxxjx9iv kubelet[3390574]: I0615 15:19:06.599897 3390574 volume_manager.go:408] "Waiting for volumes to attach and mount for pod" pod="default/dp-test-5986bbcc8f-gcrmm"


對應程式碼:


func (m *manager) syncPod(uid types.UID, status versionedPodStatus) {
    // 省略部分內容
    // Volume manager will not mount volumes for terminating pods
    // TODO: once context cancellation is added this check can be removed
    if !kl.podWorkers.IsPodTerminationRequested(pod.UID) {
        // Wait for volumes to attach/mount
        if err := kl.volumeManager.WaitForAttachAndMount(pod); err != nil {
            kl.recorder.Eventf(pod, v1.EventTypeWarning, events.FailedMountVolume, "Unable to attach or mount volumes: %v", err)
            klog.ErrorS(err, "Unable to attach or mount volumes for pod; skipping pod", "pod", klog.KObj(pod))
            return false, err
        }
    }
    // 省略部分內容
}


Kubelet 透過desiredStateOfWorldactualStateOfWorld來管理 Pod 的儲存卷狀態。在這裡,WaitForAttachAndMount只根據這兩個變數的值,確保所有卷都被 Attach 並且 Mount 成功:

這裡對於不同的儲存卷型別,處理邏輯不同,第三方儲存卷還需要 CSI 外掛的參與,這個會在後面的文章單獨進行分析。

func (vm *volumeManager) WaitForAttachAndMount(pod *v1.Pod) error {
    // 省略部分內容
    err := wait.PollImmediate(
        podAttachAndMountRetryInterval,
        podAttachAndMountTimeout,
        vm.verifyVolumesMountedFunc(uniquePodName, expectedVolumes))

    if err != nil {
        unmountedVolumes :=
            vm.getUnmountedVolumes(uniquePodName, expectedVolumes)
        // Also get unattached volumes for error message
        unattachedVolumes :=
            vm.getUnattachedVolumes(expectedVolumes)

        if len(unmountedVolumes) == 0 {
            return nil
        }

        return fmt.Errorf(
            "unmounted volumes=%v, unattached volumes=%v: %s",
            unmountedVolumes,
            unattachedVolumes,
            err)
    }
    // 省略部分內容
}


如果這個步驟失敗,錯誤資訊會展示在 Pod 的 event 裡面,這個 event 我們經常會遇到:

Unable to attach or mount volumes xxx


這裡unmounted volumesunattached volumes的值就是透過desiredStateOfWorldactualStateOfWorld得到的。

如果所有卷都已經掛載成功,會列印以下日誌:

Jun 15 15:19:06 iv-yd6ldax3wgr9cxxjx9iv kubelet[3390574]: I0615 15:19:06.599946 3390574 volume_manager.go:439] "All volumes are attached and mounted for pod" pod="default/dp-test-5986bbcc8f-gcrmm"


獲取 imagePullSecrets

接下來,kubelet 會從 API server 拿imagePullSecrets

func (vm *volumeManager) WaitForAttachAndMount(pod *v1.Pod) error {
    // 省略部分內容
    // Fetch the pull secrets for the pod
    pullSecrets := kl.getPullSecretsForPod(pod)
    // 省略部分內容
}


imagePullSecrets傳給containerRuntimesyncPod方法,以啟動容器:

func (vm *volumeManager) WaitForAttachAndMount(pod *v1.Pod) error {
    // 省略部分內容
    // Ensure the pod is being probed
    kl.probeManager.AddPod(pod)

    // Call the container runtime's SyncPod callback
    result := kl.containerRuntime.SyncPod(ctx, pod, podStatus, pullSecrets, kl.backOff)
    kl.reasonCache.Update(pod.UID, result)
    if err := result.Error(); err != nil {
        // Do not return error if the only failures were pods in backoff
        for _, r := range result.SyncResults {
            if r.Error != kubecontainer.ErrCrashLoopBackOff && r.Error != images.ErrImagePullBackOff {
                // Do not record an event here, as we keep all event logging for sync pod failures
                // local to container runtime, so we get better errors.
                return false, err
            }
        }

        return false, nil
    }
    // 省略部分內容
}


建立 Sandbox 和 Container

至此,Kubelet 在建立 Pod 前的準備工作已經完成。SyncPod方法會透過 CRI 介面呼叫執行時的 RPC,完成建立 Sandbox 和 Container 的過程:

具體邏輯會在後面的文章單獨分析。

// SyncPod syncs the running pod into the desired pod by executing following steps:
//
//  1. Compute sandbox and container changes.
//  2. Kill pod sandbox if necessary.
//  3. Kill any containers that should not be running.
//  4. Create sandbox if necessary.
//  5. Create ephemeral containers.
//  6. Create init containers.
//  7. Create normal containers.


相關文章