k8s daemonset controller原始碼分析

良凱爾發表於2021-12-23

daemonset controller分析

daemonset controller簡介

daemonset controller是kube-controller-manager元件中眾多控制器中的一個,是 daemonset 資源物件的控制器,其通過對daemonset、pod、node、ControllerRevision四種資源的監聽,當這四種資源發生變化時會觸發 daemonset controller 對相應的daemonset資源進行調諧操作,從而完成daemonset在合適node上pod的建立、在不合適node上pod的刪除、daemonset的滾動更新、daemonset狀態status更新、舊版本daemonset清理等操作。

daemonset controller架構圖

daemonset controller的大致組成和處理流程如下圖,daemonset controller對daemonset、pod、node、ControllerRevision物件註冊了event handler,當有事件時,會watch到然後將對應的daemonset物件放入到queue中,然後syncDaemonset方法為daemonset controller調諧daemonset物件的核心處理邏輯所在,從queue中取出daemonset物件,做調諧處理。

daemonset更新策略

(1)OnDelete:使用 OnDelete 更新策略時,在更新 DaemonSet pod模板後,只有當你手動刪除老的 DaemonSet pods 之後,新的 DaemonSet Pod 才會被自動建立。
(2)RollingUpdate:預設的更新策略。使用 RollingUpdate 更新策略時,在更新 DaemonSet pod模板後, 老的 DaemonSet pods 將被刪除,並且將根據滾動更新配置自動建立新的 DaemonSet pods。 滾動更新期間,最多隻能有 DaemonSet 的一個 Pod 執行於每個節點上。

daemonset controller分析將分為兩大塊進行,分別是:
(1)daemonset controller初始化與啟動分析;
(2)daemonset controller處理邏輯分析。

1.daemonset controller初始化與啟動分析

基於tag v1.17.4

https://github.com/kubernetes/kubernetes/releases/tag/v1.17.4

直接看到startDaemonSetController函式,作為daemonset controller初始化與啟動分析的入口。

startDaemonSetController

startDaemonSetController主要邏輯:
(1)呼叫daemon.NewDaemonSetsController新建並初始化DaemonSetsController;
(2)拉起一個goroutine,跑DaemonSetsController的Run方法。

// cmd/kube-controller-manager/app/apps.go
func startDaemonSetController(ctx ControllerContext) (http.Handler, bool, error) {
	if !ctx.AvailableResources[schema.GroupVersionResource{Group: "apps", Version: "v1", Resource: "daemonsets"}] {
		return nil, false, nil
	}
	dsc, err := daemon.NewDaemonSetsController(
		ctx.InformerFactory.Apps().V1().DaemonSets(),
		ctx.InformerFactory.Apps().V1().ControllerRevisions(),
		ctx.InformerFactory.Core().V1().Pods(),
		ctx.InformerFactory.Core().V1().Nodes(),
		ctx.ClientBuilder.ClientOrDie("daemon-set-controller"),
		flowcontrol.NewBackOff(1*time.Second, 15*time.Minute),
	)
	if err != nil {
		return nil, true, fmt.Errorf("error creating DaemonSets controller: %v", err)
	}
	go dsc.Run(int(ctx.ComponentConfig.DaemonSetController.ConcurrentDaemonSetSyncs), ctx.Stop)
	return nil, true, nil
}

1.1 daemon.NewDaemonSetsController

daemon.NewDaemonSetsController函式程式碼中可以看到,daemonset controller註冊了daemonset、node、pod與ControllerRevisions物件的EventHandler,也即對這幾個物件的event進行監聽,把event放入事件佇列並做處理。並且將dsc.syncDaemonSet方法賦值給dsc.syncHandler,也即註冊為核心處理方法,在dsc.Run方法中會呼叫該核心處理方法來調諧daemonset物件(核心處理方法後面會進行詳細分析)。

// pkg/controller/daemon/daemon_controller.go
func NewDaemonSetsController(
	daemonSetInformer appsinformers.DaemonSetInformer,
	historyInformer appsinformers.ControllerRevisionInformer,
	podInformer coreinformers.PodInformer,
	nodeInformer coreinformers.NodeInformer,
	kubeClient clientset.Interface,
	failedPodsBackoff *flowcontrol.Backoff,
) (*DaemonSetsController, error) {
	eventBroadcaster := record.NewBroadcaster()
	eventBroadcaster.StartLogging(klog.Infof)
	eventBroadcaster.StartRecordingToSink(&v1core.EventSinkImpl{Interface: kubeClient.CoreV1().Events("")})

	if kubeClient != nil && kubeClient.CoreV1().RESTClient().GetRateLimiter() != nil {
		if err := ratelimiter.RegisterMetricAndTrackRateLimiterUsage("daemon_controller", kubeClient.CoreV1().RESTClient().GetRateLimiter()); err != nil {
			return nil, err
		}
	}
	dsc := &DaemonSetsController{
		kubeClient:    kubeClient,
		eventRecorder: eventBroadcaster.NewRecorder(scheme.Scheme, v1.EventSource{Component: "daemonset-controller"}),
		podControl: controller.RealPodControl{
			KubeClient: kubeClient,
			Recorder:   eventBroadcaster.NewRecorder(scheme.Scheme, v1.EventSource{Component: "daemonset-controller"}),
		},
		crControl: controller.RealControllerRevisionControl{
			KubeClient: kubeClient,
		},
		burstReplicas: BurstReplicas,
		expectations:  controller.NewControllerExpectations(),
		queue:         workqueue.NewNamedRateLimitingQueue(workqueue.DefaultControllerRateLimiter(), "daemonset"),
	}

	daemonSetInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{
		AddFunc: func(obj interface{}) {
			ds := obj.(*apps.DaemonSet)
			klog.V(4).Infof("Adding daemon set %s", ds.Name)
			dsc.enqueueDaemonSet(ds)
		},
		UpdateFunc: func(old, cur interface{}) {
			oldDS := old.(*apps.DaemonSet)
			curDS := cur.(*apps.DaemonSet)
			klog.V(4).Infof("Updating daemon set %s", oldDS.Name)
			dsc.enqueueDaemonSet(curDS)
		},
		DeleteFunc: dsc.deleteDaemonset,
	})
	dsc.dsLister = daemonSetInformer.Lister()
	dsc.dsStoreSynced = daemonSetInformer.Informer().HasSynced

	historyInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{
		AddFunc:    dsc.addHistory,
		UpdateFunc: dsc.updateHistory,
		DeleteFunc: dsc.deleteHistory,
	})
	dsc.historyLister = historyInformer.Lister()
	dsc.historyStoreSynced = historyInformer.Informer().HasSynced

	// Watch for creation/deletion of pods. The reason we watch is that we don't want a daemon set to create/delete
	// more pods until all the effects (expectations) of a daemon set's create/delete have been observed.
	podInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{
		AddFunc:    dsc.addPod,
		UpdateFunc: dsc.updatePod,
		DeleteFunc: dsc.deletePod,
	})
	dsc.podLister = podInformer.Lister()

	// This custom indexer will index pods based on their NodeName which will decrease the amount of pods we need to get in simulate() call.
	podInformer.Informer().GetIndexer().AddIndexers(cache.Indexers{
		"nodeName": indexByPodNodeName,
	})
	dsc.podNodeIndex = podInformer.Informer().GetIndexer()
	dsc.podStoreSynced = podInformer.Informer().HasSynced

	nodeInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{
		AddFunc:    dsc.addNode,
		UpdateFunc: dsc.updateNode,
	},
	)
	dsc.nodeStoreSynced = nodeInformer.Informer().HasSynced
	dsc.nodeLister = nodeInformer.Lister()

	dsc.syncHandler = dsc.syncDaemonSet
	dsc.enqueueDaemonSet = dsc.enqueue

	dsc.failedPodsBackoff = failedPodsBackoff

	return dsc, nil
}

1.2 dsc.Run

主要看到for迴圈處,根據workers的值(預設值為2),啟動相應數量的goroutine,跑dsc.runWorker方法,主要是呼叫前面講到的daemonset controller核心處理方法dsc.syncDaemonSet

// pkg/controller/daemon/daemon_controller.go
func (dsc *DaemonSetsController) Run(workers int, stopCh <-chan struct{}) {
	defer utilruntime.HandleCrash()
	defer dsc.queue.ShutDown()

	klog.Infof("Starting daemon sets controller")
	defer klog.Infof("Shutting down daemon sets controller")

	if !cache.WaitForNamedCacheSync("daemon sets", stopCh, dsc.podStoreSynced, dsc.nodeStoreSynced, dsc.historyStoreSynced, dsc.dsStoreSynced) {
		return
	}

	for i := 0; i < workers; i++ {
		go wait.Until(dsc.runWorker, time.Second, stopCh)
	}

	go wait.Until(dsc.failedPodsBackoff.GC, BackoffGCInterval, stopCh)

	<-stopCh
}

1.2.1 dsc.runWorker

從queue佇列中取出事件key,並呼叫dsc.syncHandledsc.syncDaemonSet做調諧處理。queue佇列裡的事件來源前面講過,是daemonset controller註冊的daemonset、node、pod與ControllerRevisions物件的EventHandler,它們的變化event會被監聽到然後放入queue中。

// pkg/controller/daemon/daemon_controller.go
func (dsc *DaemonSetsController) runWorker() {
	for dsc.processNextWorkItem() {
	}
}

func (dsc *DaemonSetsController) processNextWorkItem() bool {
	dsKey, quit := dsc.queue.Get()
	if quit {
		return false
	}
	defer dsc.queue.Done(dsKey)

	err := dsc.syncHandler(dsKey.(string))
	if err == nil {
		dsc.queue.Forget(dsKey)
		return true
	}

	utilruntime.HandleError(fmt.Errorf("%v failed with : %v", dsKey, err))
	dsc.queue.AddRateLimited(dsKey)

	return true
}

2.daemonset controller核心處理邏輯分析

syncDaemonSet

直接看到daemonset controller核心處理方法syncDaemonSet。

主要邏輯:
(1)獲取執行方法時的當前時間,並定義defer函式,用於計算該方法總執行時間,也即統計對一個 daemonset 進行同步調諧操作的耗時;
(2)根據 daemonset 物件的名稱空間與名稱,獲取 daemonset 物件;
(3)獲取所有node的物件列表;
(4)判斷daemonset物件的DeletionTimestamp是否為空,不為空則直接return,代表該daemonset物件正在被刪除,無需再調諧;
(5)呼叫dsc.constructHistory獲取daemonset的歷史版本;
(6)呼叫dsc.expectations.SatisfiedExpectations,判斷該daemonset物件是否滿足expectations機制(expectations機制與replicaset controller分析中的用途一致,這裡不再展開分析),不滿足則呼叫dsc.updateDaemonSetStatus更新daemonset狀態後直接return;
(7)呼叫dsc.manage,dsc.manage方法中不區分新舊daemonset版本的pod,只保證daemonset的pod執行在每一個合適條件的node上,在合適的node上沒有daemonset的pod時建立pod,且把不符合條件的node上的daemonset pod刪除掉;
(8)再次呼叫dsc.expectations.SatisfiedExpectations判斷是否滿足expectations機制,滿足則判斷daemonset配置的更新策略,如果是滾動更新則呼叫dsc.rollingUpdate,主要用於處理daemonset物件的滾動更新處理,根據配置的滾動更新配置,刪除舊的pod(pod的建立操作在dsc.manage方法中進行);
當daemonset更新策略配置為OnDelete時,這裡不做額外處理,因為只有當手動刪除老的 DaemonSet pods 之後,新的 DaemonSet Pod 才會被自動建立,手動刪除老的pod後,將在dsc.manage方法中建立新版本的pod;
(9)呼叫dsc.cleanupHistory,根據daemonset的spec.revisionHistoryLimit配置以及版本新舊順序(優先清理最老舊版本)來清理daemonset的已經不存在pod的歷史版本;
(10)最後呼叫dsc.updateDaemonSetStatus,根據現存daemonset pod的部署情況以及pod的狀態、node是否滿足pod執行條件等資訊,更新daemonset的status。

// pkg/controller/daemon/daemon_controller.go
func (dsc *DaemonSetsController) syncDaemonSet(key string) error {
	startTime := time.Now()
	defer func() {
		klog.V(4).Infof("Finished syncing daemon set %q (%v)", key, time.Since(startTime))
	}()

	namespace, name, err := cache.SplitMetaNamespaceKey(key)
	if err != nil {
		return err
	}
	ds, err := dsc.dsLister.DaemonSets(namespace).Get(name)
	if errors.IsNotFound(err) {
		klog.V(3).Infof("daemon set has been deleted %v", key)
		dsc.expectations.DeleteExpectations(key)
		return nil
	}
	if err != nil {
		return fmt.Errorf("unable to retrieve ds %v from store: %v", key, err)
	}

	nodeList, err := dsc.nodeLister.List(labels.Everything())
	if err != nil {
		return fmt.Errorf("couldn't get list of nodes when syncing daemon set %#v: %v", ds, err)
	}

	everything := metav1.LabelSelector{}
	if reflect.DeepEqual(ds.Spec.Selector, &everything) {
		dsc.eventRecorder.Eventf(ds, v1.EventTypeWarning, SelectingAllReason, "This daemon set is selecting all pods. A non-empty selector is required.")
		return nil
	}

	// Don't process a daemon set until all its creations and deletions have been processed.
	// For example if daemon set foo asked for 3 new daemon pods in the previous call to manage,
	// then we do not want to call manage on foo until the daemon pods have been created.
	dsKey, err := controller.KeyFunc(ds)
	if err != nil {
		return fmt.Errorf("couldn't get key for object %#v: %v", ds, err)
	}

	// If the DaemonSet is being deleted (either by foreground deletion or
	// orphan deletion), we cannot be sure if the DaemonSet history objects
	// it owned still exist -- those history objects can either be deleted
	// or orphaned. Garbage collector doesn't guarantee that it will delete
	// DaemonSet pods before deleting DaemonSet history objects, because
	// DaemonSet history doesn't own DaemonSet pods. We cannot reliably
	// calculate the status of a DaemonSet being deleted. Therefore, return
	// here without updating status for the DaemonSet being deleted.
	if ds.DeletionTimestamp != nil {
		return nil
	}

	// Construct histories of the DaemonSet, and get the hash of current history
	cur, old, err := dsc.constructHistory(ds)
	if err != nil {
		return fmt.Errorf("failed to construct revisions of DaemonSet: %v", err)
	}
	hash := cur.Labels[apps.DefaultDaemonSetUniqueLabelKey]

	if !dsc.expectations.SatisfiedExpectations(dsKey) {
		// Only update status. Don't raise observedGeneration since controller didn't process object of that generation.
		return dsc.updateDaemonSetStatus(ds, nodeList, hash, false)
	}

	err = dsc.manage(ds, nodeList, hash)
	if err != nil {
		return err
	}

	// Process rolling updates if we're ready.
	if dsc.expectations.SatisfiedExpectations(dsKey) {
		switch ds.Spec.UpdateStrategy.Type {
		case apps.OnDeleteDaemonSetStrategyType:
		case apps.RollingUpdateDaemonSetStrategyType:
			err = dsc.rollingUpdate(ds, nodeList, hash)
		}
		if err != nil {
			return err
		}
	}

	err = dsc.cleanupHistory(ds, old)
	if err != nil {
		return fmt.Errorf("failed to clean up revisions of DaemonSet: %v", err)
	}

	return dsc.updateDaemonSetStatus(ds, nodeList, hash, true)
}

2.1 dsc.manage

dsc.manage方法中不區分新舊daemonset版本的pod,主要是用於保證daemonset的pod執行在每一個合適條件的node上,在合適的node上沒有daemonset的pod時建立pod,且把不符合條件的node上的daemonset pod刪除掉。

主要邏輯:
(1)呼叫dsc.getNodesToDaemonPods,根據daemonset的Selector獲取daemonset的所有pod,然後返回pod與node的對應關聯關係map;
(2)遍歷前面獲取到的node列表,執行dsc.podsShouldBeOnNode,根據pod是否指定了nodeName、nodeSelector、ToleratesNodeTaints等,以及node物件的相關資訊來做比對,來確定在某個node上是否已經存在daemonset對應的pod,以及是要為該daemonset建立pod還是刪除pod;
(3)呼叫getUnscheduledPodsWithoutNode,將pod的nodeName與前面獲取到的node列表比對,將nodeName不存在的pod加入到要被刪除的pod列表中;
(4)呼叫dsc.syncNodes,根據前面獲取到的要建立的pod的node列表以及要刪除的pod列表,做相應的建立、刪除pod的操作。

// pkg/controller/daemon/daemon_controller.go
func (dsc *DaemonSetsController) manage(ds *apps.DaemonSet, nodeList []*v1.Node, hash string) error {
	// Find out the pods which are created for the nodes by DaemonSet.
	nodeToDaemonPods, err := dsc.getNodesToDaemonPods(ds)
	if err != nil {
		return fmt.Errorf("couldn't get node to daemon pod mapping for daemon set %q: %v", ds.Name, err)
	}

	// For each node, if the node is running the daemon pod but isn't supposed to, kill the daemon
	// pod. If the node is supposed to run the daemon pod, but isn't, create the daemon pod on the node.
	var nodesNeedingDaemonPods, podsToDelete []string
	for _, node := range nodeList {
		nodesNeedingDaemonPodsOnNode, podsToDeleteOnNode, err := dsc.podsShouldBeOnNode(
			node, nodeToDaemonPods, ds)

		if err != nil {
			continue
		}

		nodesNeedingDaemonPods = append(nodesNeedingDaemonPods, nodesNeedingDaemonPodsOnNode...)
		podsToDelete = append(podsToDelete, podsToDeleteOnNode...)
	}

	// Remove unscheduled pods assigned to not existing nodes when daemonset pods are scheduled by scheduler.
	// If node doesn't exist then pods are never scheduled and can't be deleted by PodGCController.
	podsToDelete = append(podsToDelete, getUnscheduledPodsWithoutNode(nodeList, nodeToDaemonPods)...)

	// Label new pods using the hash label value of the current history when creating them
	if err = dsc.syncNodes(ds, podsToDelete, nodesNeedingDaemonPods, hash); err != nil {
		return err
	}

	return nil
}

2.1.1 dsc.podsShouldBeOnNode

dsc.podsShouldBeOnNode方法用於判斷一個node上是否需要執行daemonset pod,方法返回nodesNeedingDaemonPods與podsToDelete,分別代表需要執行daemonset pod的node、需要被刪除的pod列表。

主要邏輯:
(1)呼叫dsc.nodeShouldRunDaemonPod,返回shouldSchedule與shouldContinueRunning,分別代表daemonset pod是否應該排程到某node、某node上的daemonset pod是否可以繼續執行;
(2)當shouldSchedule為true,即pod應該排程到某node,但現在不存在時,將該node新增到nodesNeedingDaemonPods;
(3)當shouldContinueRunning為true,找出在該node上還在執行沒有退出的daemonset pod列表,然後按照pod建立時間排序,只保留最新建立的pod,其餘的加入到podsToDelete;
(4)當shouldContinueRunning為false,即daemonset pod不應繼續在某node上執行,且現在該node已經存在該daemonset pod時,將node上該daemonset的所有pod都加入到podsToDelete;
(5)返回nodesNeedingDaemonPods與podsToDelete,分別代表需要執行daemonset pod的node、需要被刪除的pod列表。

// pkg/controller/daemon/daemon_controller.go
func (dsc *DaemonSetsController) podsShouldBeOnNode(
	node *v1.Node,
	nodeToDaemonPods map[string][]*v1.Pod,
	ds *apps.DaemonSet,
) (nodesNeedingDaemonPods, podsToDelete []string, err error) {

	_, shouldSchedule, shouldContinueRunning, err := dsc.nodeShouldRunDaemonPod(node, ds)
	if err != nil {
		return
	}

	daemonPods, exists := nodeToDaemonPods[node.Name]

	switch {
	case shouldSchedule && !exists:
		// If daemon pod is supposed to be running on node, but isn't, create daemon pod.
		nodesNeedingDaemonPods = append(nodesNeedingDaemonPods, node.Name)
	case shouldContinueRunning:
		// If a daemon pod failed, delete it
		// If there's non-daemon pods left on this node, we will create it in the next sync loop
		var daemonPodsRunning []*v1.Pod
		for _, pod := range daemonPods {
			if pod.DeletionTimestamp != nil {
				continue
			}
			if pod.Status.Phase == v1.PodFailed {
				// This is a critical place where DS is often fighting with kubelet that rejects pods.
				// We need to avoid hot looping and backoff.
				backoffKey := failedPodsBackoffKey(ds, node.Name)

				now := dsc.failedPodsBackoff.Clock.Now()
				inBackoff := dsc.failedPodsBackoff.IsInBackOffSinceUpdate(backoffKey, now)
				if inBackoff {
					delay := dsc.failedPodsBackoff.Get(backoffKey)
					klog.V(4).Infof("Deleting failed pod %s/%s on node %s has been limited by backoff - %v remaining",
						pod.Namespace, pod.Name, node.Name, delay)
					dsc.enqueueDaemonSetAfter(ds, delay)
					continue
				}

				dsc.failedPodsBackoff.Next(backoffKey, now)

				msg := fmt.Sprintf("Found failed daemon pod %s/%s on node %s, will try to kill it", pod.Namespace, pod.Name, node.Name)
				klog.V(2).Infof(msg)
				// Emit an event so that it's discoverable to users.
				dsc.eventRecorder.Eventf(ds, v1.EventTypeWarning, FailedDaemonPodReason, msg)
				podsToDelete = append(podsToDelete, pod.Name)
			} else {
				daemonPodsRunning = append(daemonPodsRunning, pod)
			}
		}
		// If daemon pod is supposed to be running on node, but more than 1 daemon pod is running, delete the excess daemon pods.
		// Sort the daemon pods by creation time, so the oldest is preserved.
		if len(daemonPodsRunning) > 1 {
			sort.Sort(podByCreationTimestampAndPhase(daemonPodsRunning))
			for i := 1; i < len(daemonPodsRunning); i++ {
				podsToDelete = append(podsToDelete, daemonPodsRunning[i].Name)
			}
		}
	case !shouldContinueRunning && exists:
		// If daemon pod isn't supposed to run on node, but it is, delete all daemon pods on node.
		for _, pod := range daemonPods {
			if pod.DeletionTimestamp != nil {
				continue
			}
			podsToDelete = append(podsToDelete, pod.Name)
		}
	}

	return nodesNeedingDaemonPods, podsToDelete, nil
}
dsc.nodeShouldRunDaemonPod

關於dsc.nodeShouldRunDaemonPod方法,不做展開分析,它主要是呼叫dsc.simulate執行Predicates預選演算法來檢查某個node 是否滿足pod的執行條件,如果預選失敗,則根據失敗資訊,返回wantToRun、shouldSchedule、shouldContinueRunning,分別代表node與pod的selector、taints 等是否匹配(不考慮node資源是否充足)、daemonset pod是否應該排程到某node、某node上的daemonset pod是否可以繼續執行,預選成功則全都返回true。

2.1.2 dsc.syncNodes

dsc.syncNodes是daemonset controller對pod進行建立和刪除操作的方法。

該方法也涉及到expectations機制,與replicaset controller中的expectations機制作用一致,使用上也基本一致,忘記的可以回頭看下replicaset controller分析中對expectations機制的分析,這裡不再對expectations機制展開分析。

主要邏輯:
(1)計算要建立、刪除pod的數量,上限為dsc.burstReplicas(250),即每一次對daemonset物件的同步操作,能建立/刪除的pod數量上限為250,超出的部分需要在下一次同步操作才能進行;
(2)呼叫dsc.expectations.SetExpectations,設定expectations;
(3)呼叫util.CreatePodTemplate,計算並獲取要建立的podTemplate;
(4)先進行pod的建立操作:pod的建立與replicaset controller建立pod類似,使用了慢開始演算法,分多批次進行建立,第一批建立1個pod,第二批建立2個pod,第三批建立4個pod,以2倍往下依次執行,直到達到期望為止;而每一批次的建立,會拉起與要建立pod數量相等的goroutine,每個goroutine負責建立一個pod,並使用WaitGroup等待該批次的所有建立任務完成,再進行下一批次的建立;
(4)再進行pod的刪除操作:對於每個要刪除的pod,都拉起一個goroutine來做刪除操作,並使用WaitGroup等待所有goroutine完成。

// pkg/controller/daemon/daemon_controller.go
func (dsc *DaemonSetsController) syncNodes(ds *apps.DaemonSet, podsToDelete, nodesNeedingDaemonPods []string, hash string) error {
	// We need to set expectations before creating/deleting pods to avoid race conditions.
	dsKey, err := controller.KeyFunc(ds)
	if err != nil {
		return fmt.Errorf("couldn't get key for object %#v: %v", ds, err)
	}

	createDiff := len(nodesNeedingDaemonPods)
	deleteDiff := len(podsToDelete)

	if createDiff > dsc.burstReplicas {
		createDiff = dsc.burstReplicas
	}
	if deleteDiff > dsc.burstReplicas {
		deleteDiff = dsc.burstReplicas
	}

	dsc.expectations.SetExpectations(dsKey, createDiff, deleteDiff)

	// error channel to communicate back failures.  make the buffer big enough to avoid any blocking
	errCh := make(chan error, createDiff+deleteDiff)

	klog.V(4).Infof("Nodes needing daemon pods for daemon set %s: %+v, creating %d", ds.Name, nodesNeedingDaemonPods, createDiff)
	createWait := sync.WaitGroup{}
	// If the returned error is not nil we have a parse error.
	// The controller handles this via the hash.
	generation, err := util.GetTemplateGeneration(ds)
	if err != nil {
		generation = nil
	}
	template := util.CreatePodTemplate(ds.Spec.Template, generation, hash)
	// Batch the pod creates. Batch sizes start at SlowStartInitialBatchSize
	// and double with each successful iteration in a kind of "slow start".
	// This handles attempts to start large numbers of pods that would
	// likely all fail with the same error. For example a project with a
	// low quota that attempts to create a large number of pods will be
	// prevented from spamming the API service with the pod create requests
	// after one of its pods fails.  Conveniently, this also prevents the
	// event spam that those failures would generate.
	batchSize := integer.IntMin(createDiff, controller.SlowStartInitialBatchSize)
	for pos := 0; createDiff > pos; batchSize, pos = integer.IntMin(2*batchSize, createDiff-(pos+batchSize)), pos+batchSize {
		errorCount := len(errCh)
		createWait.Add(batchSize)
		for i := pos; i < pos+batchSize; i++ {
			go func(ix int) {
				defer createWait.Done()

				podTemplate := template.DeepCopy()
				// The pod's NodeAffinity will be updated to make sure the Pod is bound
				// to the target node by default scheduler. It is safe to do so because there
				// should be no conflicting node affinity with the target node.
				podTemplate.Spec.Affinity = util.ReplaceDaemonSetPodNodeNameNodeAffinity(
					podTemplate.Spec.Affinity, nodesNeedingDaemonPods[ix])

				err := dsc.podControl.CreatePodsWithControllerRef(ds.Namespace, podTemplate,
					ds, metav1.NewControllerRef(ds, controllerKind))

				if err != nil {
					if errors.HasStatusCause(err, v1.NamespaceTerminatingCause) {
						// If the namespace is being torn down, we can safely ignore
						// this error since all subsequent creations will fail.
						return
					}
				}
				if err != nil {
					klog.V(2).Infof("Failed creation, decrementing expectations for set %q/%q", ds.Namespace, ds.Name)
					dsc.expectations.CreationObserved(dsKey)
					errCh <- err
					utilruntime.HandleError(err)
				}
			}(i)
		}
		createWait.Wait()
		// any skipped pods that we never attempted to start shouldn't be expected.
		skippedPods := createDiff - (batchSize + pos)
		if errorCount < len(errCh) && skippedPods > 0 {
			klog.V(2).Infof("Slow-start failure. Skipping creation of %d pods, decrementing expectations for set %q/%q", skippedPods, ds.Namespace, ds.Name)
			dsc.expectations.LowerExpectations(dsKey, skippedPods, 0)
			// The skipped pods will be retried later. The next controller resync will
			// retry the slow start process.
			break
		}
	}

	klog.V(4).Infof("Pods to delete for daemon set %s: %+v, deleting %d", ds.Name, podsToDelete, deleteDiff)
	deleteWait := sync.WaitGroup{}
	deleteWait.Add(deleteDiff)
	for i := 0; i < deleteDiff; i++ {
		go func(ix int) {
			defer deleteWait.Done()
			if err := dsc.podControl.DeletePod(ds.Namespace, podsToDelete[ix], ds); err != nil {
				klog.V(2).Infof("Failed deletion, decrementing expectations for set %q/%q", ds.Namespace, ds.Name)
				dsc.expectations.DeletionObserved(dsKey)
				errCh <- err
				utilruntime.HandleError(err)
			}
		}(i)
	}
	deleteWait.Wait()

	// collect errors if any for proper reporting/retry logic in the controller
	errors := []error{}
	close(errCh)
	for err := range errCh {
		errors = append(errors, err)
	}
	return utilerrors.NewAggregate(errors)
}

2.2 dsc.rollingUpdate

dsc.rollingUpdate方法主要用於處理daemonset物件的滾動更新處理,根據配置的滾動更新配置,刪除舊的pod(pod的建立操作在dsc.manage方法中進行)。

主要邏輯:
(1)呼叫dsc.getNodesToDaemonPods,獲取daemonset所屬pod與node的對應關聯關係map;
(2)呼叫dsc.getAllDaemonSetPods,獲取所有的舊版本daemonset的pod;
(3)呼叫dsc.getUnavailableNumbers,根據daemonset的滾動更新策略配置獲取maxUnavailable值,再獲取numUnavailable值,numUnavailable代表在符合條件的node節點中,沒有daemonset對應的pod或者pod處於Unavailable狀態的node數量;
(4)呼叫util.SplitByAvailablePods,將舊版本daemonset的所有pod分成oldAvailablePods列表,以及oldUnavailablePods列表;
(5)定義一個字串陣列oldPodsToDelete,用於儲存準備要刪除的pod;
(6)將全部oldUnavailablePods加入到oldPodsToDelete陣列中;
(7)遍歷oldAvailablePods列表,當numUnavailable小於maxUnavailable值時,將pod加入到oldPodsToDelete陣列中,且numUnavailable值加一;
(8)呼叫dsc.syncNodes,將oldPodsToDelete陣列中的pod刪除。

// pkg/controller/daemon/update.go
func (dsc *DaemonSetsController) rollingUpdate(ds *apps.DaemonSet, nodeList []*v1.Node, hash string) error {
	nodeToDaemonPods, err := dsc.getNodesToDaemonPods(ds)
	if err != nil {
		return fmt.Errorf("couldn't get node to daemon pod mapping for daemon set %q: %v", ds.Name, err)
	}

	_, oldPods := dsc.getAllDaemonSetPods(ds, nodeToDaemonPods, hash)
	maxUnavailable, numUnavailable, err := dsc.getUnavailableNumbers(ds, nodeList, nodeToDaemonPods)
	if err != nil {
		return fmt.Errorf("couldn't get unavailable numbers: %v", err)
	}
	oldAvailablePods, oldUnavailablePods := util.SplitByAvailablePods(ds.Spec.MinReadySeconds, oldPods)

	// for oldPods delete all not running pods
	var oldPodsToDelete []string
	klog.V(4).Infof("Marking all unavailable old pods for deletion")
	for _, pod := range oldUnavailablePods {
		// Skip terminating pods. We won't delete them again
		if pod.DeletionTimestamp != nil {
			continue
		}
		klog.V(4).Infof("Marking pod %s/%s for deletion", ds.Name, pod.Name)
		oldPodsToDelete = append(oldPodsToDelete, pod.Name)
	}

	klog.V(4).Infof("Marking old pods for deletion")
	for _, pod := range oldAvailablePods {
		if numUnavailable >= maxUnavailable {
			klog.V(4).Infof("Number of unavailable DaemonSet pods: %d, is equal to or exceeds allowed maximum: %d", numUnavailable, maxUnavailable)
			break
		}
		klog.V(4).Infof("Marking pod %s/%s for deletion", ds.Name, pod.Name)
		oldPodsToDelete = append(oldPodsToDelete, pod.Name)
		numUnavailable++
	}
	return dsc.syncNodes(ds, oldPodsToDelete, []string{}, hash)
}

2.3 dsc.updateDaemonSetStatus

dsc.updateDaemonSetStatus方法負責根據現存daemonset pod的部署情況以及pod的狀態、node是否滿足pod執行條件等資訊,來更新daemonset的status狀態值,這裡不對程式碼展開分析,只分析一下daemonset的status中各個欄位的意思。

(1)currentNumberScheduled: 已經排程了daemonset pod的節點數量;
(2)desiredNumberScheduled: 期望排程daemonset pod的節點數量;
(3)numberMisscheduled:不需要排程daemonset pod但已經排程完成了的節點數量;
(4)numberAvailable: pod狀態達到Available的數量(pod達到Ready狀態MinReadySeconds時間後,就認為達到了Available狀態);
(5)numberReady: pod狀態達到Ready的數量;
(6)numberUnavailable: desiredNumberScheduled - numberAvailable;
(7)updatedNumberScheduled: 已經排程了最新版本daemonset pod的節點數量。

總結

daemonset controller建立 pod 的流程與 replicaset controller 建立 pod 的流程是相似的,都使用了 expectations 機制並且限制了在一次調諧過程中最多建立或刪除的 pod 數量。daemonset的更新方式與 statefulset 一樣包含 OnDelete 和 RollingUpdate(滾動更新) 兩種,OnDelete 方式需要手動刪除對應的 pod,然後daemonset controller才會建立出新的pod,而 RollingUpdate 方式與 statefulset 和 deployment 有所區別, RollingUpdate方式更新時是按照先刪除pod再建立pod的順序進行,不像deployment那樣可以先建立出新的pod再刪除舊的pod。

daemonset controller架構

daemonset controller的大致組成和處理流程如下圖,daemonset controller對daemonset、pod、node、ControllerRevision物件註冊了event handler,當有事件時,會watch到然後將對應的daemonset物件放入到queue中,然後syncDaemonset方法為daemonset controller調諧daemonset物件的核心處理邏輯所在,從queue中取出daemonset物件,做調諧處理。

daemonset controller核心處理邏輯

daemonset controller的核心處理邏輯是調諧daomonset物件,使得daemonset在合適node上完成pod的建立、在不合適node上完成pod的刪除,觸發滾動更新時按照配置的滾動更新策略配置來刪除舊的pod、建立新的pod,並根據歷史版本限制配置清理daemonset的歷史版本,最後更新daemonset物件的status狀態。

daemonset controller建立pod演算法

daemonset controller建立pod的演算法與replicaset controller建立pod的演算法幾乎相同,按1、2、4、8...的遞增趨勢分多批次進行(每次調諧中建立pod的數量上限為250個,超過上限的會在下次調諧中再建立),若某批次建立pod有失敗的(如apiserver限流,丟棄請求等,注意:超時除外,因為initialization處理有可能超時),則後續批次的pod建立不再進行,需等待該daemonset物件下次調諧時再觸發該pod建立演算法,進行pod的建立,直至所有滿足條件的node上都有該daemonset的pod。

daemonset controller刪除pod演算法

daemonset controller刪除pod的演算法是,拉起與要刪除的pod數量相同的goroutine來刪除pod(每次調諧中刪除pod的數量上限為250),並等待所有goroutine執行完成。刪除pod有失敗的(如apiserver限流,丟棄請求)或超過250上限的部分,需等待該daemonset物件下次調諧時再觸發該pod刪除演算法,進行pod的刪除,直至所有期望被刪除的pod都被刪除。

相關文章