kube-scheduler原始碼分析(2)-核心處理邏輯分析

良凱爾發表於2022-03-06

kube-scheduler原始碼分析(2)-核心處理邏輯分析

kube-scheduler簡介

kube-scheduler元件是kubernetes中的核心元件之一,主要負責pod資源物件的排程工作,具體來說,kube-scheduler元件負責根據排程演算法(包括預選演算法和優選演算法)將未排程的pod排程到合適的最優的node節點上。

kube-scheduler架構圖

kube-scheduler的大致組成和處理流程如下圖,kube-scheduler對pod、node等物件進行了list/watch,根據informer將未排程的pod放入待排程pod佇列,並根據informer構建排程器cache(用於快速獲取需要的node等物件),然後sched.scheduleOne方法為kube-scheduler元件排程pod的核心處理邏輯所在,從未排程pod佇列中取出一個pod,經過預選與優選演算法,最終選出一個最優node,然後更新cache並非同步執行bind操作,也就是更新pod的nodeName欄位,至此一個pod的排程工作完成。

kube-scheduler元件的分析將分為兩大塊進行,分別是:
(1)kube-scheduler初始化與啟動分析;
(2)kube-scheduler核心處理邏輯分析。

上一篇進行了kube-scheduler元件的初始化與啟動分析,本篇進行核心處理邏輯分析。

2.kube-scheduler核心處理邏輯分析

基於tag v1.17.4

https://github.com/kubernetes/kubernetes/releases/tag/v1.17.4

直接看到kube-scheduler核心處理方法sched.Run。

sched.Run

sched.Run主要邏輯:
(1)判斷informer中的物件cache是否同步完成;
(2)迴圈呼叫sched.scheduleOne排程pod。

// pkg/scheduler/scheduler.go
func (sched *Scheduler) Run(ctx context.Context) {
    // 1.判斷informer中的物件cache是否同步完成
	if !cache.WaitForCacheSync(ctx.Done(), sched.scheduledPodsHasSynced) {
		return
	}
    
    // 2.迴圈呼叫sched.scheduleOne排程pod
	wait.UntilWithContext(ctx, sched.scheduleOne, 0)
}

sched.scheduleOne

sched.scheduleOne方法作用是排程一個pod到合適的node節點,主要邏輯如下:
(1)從scheduler的待排程pod佇列中取出一個pod,如果該pod的deletetimestamp不為空(代表處於刪除狀態)則跳過該pod的排程工作;
(2)呼叫sched.Algorithm.Schedule執行排程演算法,返回通過預算及優選演算法算出的nodo節點;
(3)當執行排程演算法失敗時,上報排程失敗event,更新pod的status;
(4)當沒有找到合適的節點時,判斷scheduler是否開啟了搶佔排程機制,是則呼叫sched.preempt執行搶佔邏輯;
(5)呼叫sched.VolumeBinder.Binder.AssumePodVolumes,更新cache,判斷關聯pvc是否都已bound;
(6)執行呼叫 "reserve" plugins(有印象即可,後面會對該類plugins進行講解);
(7)呼叫sched.assume,在scheduler的cache中記錄這個pod已經排程了,因為更新pod的nodeName是非同步操作,防止pod被重複排程;
(8)起一個goroutine,非同步執行pod的binding操作:
(8.1)執行呼叫 "permit" plugins(有印象即可,後面會對該類plugins進行講解);
(8.2)呼叫sched.bindVolumes,繫結volumes;
(8.3)執行呼叫 "prebind" plugins(有印象即可,後面會對該類plugins進行講解);
(8.4)更新pod的nodeName,寫入etcd;
(8.5)執行呼叫 "postbind" plugins(有印象即可,後面會對該類plugins進行講解),該pod排程結束。

// pkg/scheduler/scheduler.go
func (sched *Scheduler) scheduleOne(ctx context.Context) {
	fwk := sched.Framework
    
    // 1.從scheduler的待排程pod佇列中取出一個pod
	podInfo := sched.NextPod()
	// pod could be nil when schedulerQueue is closed
	if podInfo == nil || podInfo.Pod == nil {
		return
	}
	pod := podInfo.Pod
	// 如果該pod的deletetimestamp不為空(代表處於刪除狀態)則跳過該pod的排程工作
	if pod.DeletionTimestamp != nil {
		sched.Recorder.Eventf(pod, nil, v1.EventTypeWarning, "FailedScheduling", "Scheduling", "skip schedule deleting pod: %v/%v", pod.Namespace, pod.Name)
		klog.V(3).Infof("Skip schedule deleting pod: %v/%v", pod.Namespace, pod.Name)
		return
	}

	klog.V(3).Infof("Attempting to schedule pod: %v/%v", pod.Namespace, pod.Name)
    
    // 2.呼叫sched.schedule執行排程演算法,返回通過預算及優選演算法算出的nodo節點
	// Synchronously attempt to find a fit for the pod.
	start := time.Now()
	state := framework.NewCycleState()
	state.SetRecordFrameworkMetrics(rand.Intn(100) < frameworkMetricsSamplePercent)
	schedulingCycleCtx, cancel := context.WithCancel(ctx)
	defer cancel()
	scheduleResult, err := sched.Algorithm.Schedule(schedulingCycleCtx, state, pod)
	if err != nil {
	    // 3.當執行排程演算法失敗時,上報排程失敗event,更新pod的status
		sched.recordSchedulingFailure(podInfo.DeepCopy(), err, v1.PodReasonUnschedulable, err.Error())
		// Schedule() may have failed because the pod would not fit on any host, so we try to
		// preempt, with the expectation that the next time the pod is tried for scheduling it
		// will fit due to the preemption. It is also possible that a different pod will schedule
		// into the resources that were preempted, but this is harmless.
		if fitError, ok := err.(*core.FitError); ok {
		    // 4.當沒有找到合適的節點時,判斷scheduler是否開啟了搶佔排程機制,是則呼叫sched.preempt執行搶佔邏輯
			if sched.DisablePreemption {
				klog.V(3).Infof("Pod priority feature is not enabled or preemption is disabled by scheduler configuration." +
					" No preemption is performed.")
			} else {
				preemptionStartTime := time.Now()
				sched.preempt(schedulingCycleCtx, state, fwk, pod, fitError)
				metrics.PreemptionAttempts.Inc()
				metrics.SchedulingAlgorithmPreemptionEvaluationDuration.Observe(metrics.SinceInSeconds(preemptionStartTime))
				metrics.DeprecatedSchedulingAlgorithmPreemptionEvaluationDuration.Observe(metrics.SinceInMicroseconds(preemptionStartTime))
				metrics.SchedulingLatency.WithLabelValues(metrics.PreemptionEvaluation).Observe(metrics.SinceInSeconds(preemptionStartTime))
				metrics.DeprecatedSchedulingLatency.WithLabelValues(metrics.PreemptionEvaluation).Observe(metrics.SinceInSeconds(preemptionStartTime))
			}
			// Pod did not fit anywhere, so it is counted as a failure. If preemption
			// succeeds, the pod should get counted as a success the next time we try to
			// schedule it. (hopefully)
			metrics.PodScheduleFailures.Inc()
		} else {
			klog.Errorf("error selecting node for pod: %v", err)
			metrics.PodScheduleErrors.Inc()
		}
		return
	}
	metrics.SchedulingAlgorithmLatency.Observe(metrics.SinceInSeconds(start))
	metrics.DeprecatedSchedulingAlgorithmLatency.Observe(metrics.SinceInMicroseconds(start))
	// Tell the cache to assume that a pod now is running on a given node, even though it hasn't been bound yet.
	// This allows us to keep scheduling without waiting on binding to occur.
	assumedPodInfo := podInfo.DeepCopy()
	assumedPod := assumedPodInfo.Pod
    
    // 5.更新cache,判斷關聯pvc是否都已bound
	// Assume volumes first before assuming the pod.
	//
	// If all volumes are completely bound, then allBound is true and binding will be skipped.
	//
	// Otherwise, binding of volumes is started after the pod is assumed, but before pod binding.
	//
	// This function modifies 'assumedPod' if volume binding is required.
	allBound, err := sched.VolumeBinder.Binder.AssumePodVolumes(assumedPod, scheduleResult.SuggestedHost)
	if err != nil {
		sched.recordSchedulingFailure(assumedPodInfo, err, SchedulerError,
			fmt.Sprintf("AssumePodVolumes failed: %v", err))
		metrics.PodScheduleErrors.Inc()
		return
	}
    
    // 6.執行呼叫 "reserve" plugins
	// Run "reserve" plugins.
	if sts := fwk.RunReservePlugins(schedulingCycleCtx, state, assumedPod, scheduleResult.SuggestedHost); !sts.IsSuccess() {
		sched.recordSchedulingFailure(assumedPodInfo, sts.AsError(), SchedulerError, sts.Message())
		metrics.PodScheduleErrors.Inc()
		return
	}
    
    // 7.呼叫sched.assume,在scheduler的cache中記錄這個pod已經排程了,因為更新pod的nodeName是非同步操作,防止pod被重複排程
	// assume modifies `assumedPod` by setting NodeName=scheduleResult.SuggestedHost
	err = sched.assume(assumedPod, scheduleResult.SuggestedHost)
	if err != nil {
		// This is most probably result of a BUG in retrying logic.
		// We report an error here so that pod scheduling can be retried.
		// This relies on the fact that Error will check if the pod has been bound
		// to a node and if so will not add it back to the unscheduled pods queue
		// (otherwise this would cause an infinite loop).
		sched.recordSchedulingFailure(assumedPodInfo, err, SchedulerError, fmt.Sprintf("AssumePod failed: %v", err))
		metrics.PodScheduleErrors.Inc()
		// trigger un-reserve plugins to clean up state associated with the reserved Pod
		fwk.RunUnreservePlugins(schedulingCycleCtx, state, assumedPod, scheduleResult.SuggestedHost)
		return
	}
	
	// 8.起一個goroutine,非同步執行pod的binding操作
	// bind the pod to its host asynchronously (we can do this b/c of the assumption step above).
	go func() {
		bindingCycleCtx, cancel := context.WithCancel(ctx)
		defer cancel()
		metrics.SchedulerGoroutines.WithLabelValues("binding").Inc()
		defer metrics.SchedulerGoroutines.WithLabelValues("binding").Dec()

		// Run "permit" plugins.
		permitStatus := fwk.RunPermitPlugins(bindingCycleCtx, state, assumedPod, scheduleResult.SuggestedHost)
		if !permitStatus.IsSuccess() {
			var reason string
			if permitStatus.IsUnschedulable() {
				metrics.PodScheduleFailures.Inc()
				reason = v1.PodReasonUnschedulable
			} else {
				metrics.PodScheduleErrors.Inc()
				reason = SchedulerError
			}
			if forgetErr := sched.Cache().ForgetPod(assumedPod); forgetErr != nil {
				klog.Errorf("scheduler cache ForgetPod failed: %v", forgetErr)
			}
			// trigger un-reserve plugins to clean up state associated with the reserved Pod
			fwk.RunUnreservePlugins(bindingCycleCtx, state, assumedPod, scheduleResult.SuggestedHost)
			sched.recordSchedulingFailure(assumedPodInfo, permitStatus.AsError(), reason, permitStatus.Message())
			return
		}

		// Bind volumes first before Pod
		if !allBound {
			err := sched.bindVolumes(assumedPod)
			if err != nil {
				sched.recordSchedulingFailure(assumedPodInfo, err, "VolumeBindingFailed", err.Error())
				metrics.PodScheduleErrors.Inc()
				// trigger un-reserve plugins to clean up state associated with the reserved Pod
				fwk.RunUnreservePlugins(bindingCycleCtx, state, assumedPod, scheduleResult.SuggestedHost)
				return
			}
		}

		// Run "prebind" plugins.
		preBindStatus := fwk.RunPreBindPlugins(bindingCycleCtx, state, assumedPod, scheduleResult.SuggestedHost)
		if !preBindStatus.IsSuccess() {
			var reason string
			metrics.PodScheduleErrors.Inc()
			reason = SchedulerError
			if forgetErr := sched.Cache().ForgetPod(assumedPod); forgetErr != nil {
				klog.Errorf("scheduler cache ForgetPod failed: %v", forgetErr)
			}
			// trigger un-reserve plugins to clean up state associated with the reserved Pod
			fwk.RunUnreservePlugins(bindingCycleCtx, state, assumedPod, scheduleResult.SuggestedHost)
			sched.recordSchedulingFailure(assumedPodInfo, preBindStatus.AsError(), reason, preBindStatus.Message())
			return
		}

		err := sched.bind(bindingCycleCtx, assumedPod, scheduleResult.SuggestedHost, state)
		metrics.E2eSchedulingLatency.Observe(metrics.SinceInSeconds(start))
		metrics.DeprecatedE2eSchedulingLatency.Observe(metrics.SinceInMicroseconds(start))
		if err != nil {
			metrics.PodScheduleErrors.Inc()
			// trigger un-reserve plugins to clean up state associated with the reserved Pod
			fwk.RunUnreservePlugins(bindingCycleCtx, state, assumedPod, scheduleResult.SuggestedHost)
			sched.recordSchedulingFailure(assumedPodInfo, err, SchedulerError, fmt.Sprintf("Binding rejected: %v", err))
		} else {
			// Calculating nodeResourceString can be heavy. Avoid it if klog verbosity is below 2.
			if klog.V(2) {
				klog.Infof("pod %v/%v is bound successfully on node %q, %d nodes evaluated, %d nodes were found feasible.", assumedPod.Namespace, assumedPod.Name, scheduleResult.SuggestedHost, scheduleResult.EvaluatedNodes, scheduleResult.FeasibleNodes)
			}

			metrics.PodScheduleSuccesses.Inc()
			metrics.PodSchedulingAttempts.Observe(float64(podInfo.Attempts))
			metrics.PodSchedulingDuration.Observe(metrics.SinceInSeconds(podInfo.InitialAttemptTimestamp))

			// Run "postbind" plugins.
			fwk.RunPostBindPlugins(bindingCycleCtx, state, assumedPod, scheduleResult.SuggestedHost)
		}
	}()
}

2.1 sched.Algorithm.Schedule

sched.Algorithm.Schedule主要作用是執行預選演算法和優選演算法,給pod算出一個合適的node,其主要邏輯為:
(1)對pod使用到的pvc進行檢查,檢查其是否處於刪除狀態;
(2)呼叫g.snapshot,獲取當前的所有node節點資訊快照,用於本輪排程,包括下面執行的預算演算法與優選演算法都將使用該node快照;
(3)執行呼叫 "prefilter" plugins(有印象即可,後面會對該類plugins進行講解);
(4)呼叫g.findNodesThatFit,執行預選演算法,篩選出符合該pod執行條件的合適node節點;
(5)執行呼叫 "postfilter" plugins(有印象即可,後面會對該類plugins進行講解);
(6)判斷合適的node節點數,如果為0直接返回失敗;
(7)判斷合適的node節點數,如果為1則直接返回該node節點,不往下執行優選演算法;
(8)呼叫g.prioritizeNodes,執行優選演算法,給合適的所有node節點打分;
(9)呼叫g.selectHost,從優選演算法打分過後的node節點中選擇最佳(即得分最高)的node節點並返回。

// pkg/scheduler/core/generic_scheduler.go
func (g *genericScheduler) Schedule(ctx context.Context, state *framework.CycleState, pod *v1.Pod) (result ScheduleResult, err error) {
	trace := utiltrace.New("Scheduling", utiltrace.Field{Key: "namespace", Value: pod.Namespace}, utiltrace.Field{Key: "name", Value: pod.Name})
	defer trace.LogIfLong(100 * time.Millisecond)
    
    // 1.對pod使用到的pvc進行檢查,檢查其是否處於刪除狀態
	if err := podPassesBasicChecks(pod, g.pvcLister); err != nil {
		return result, err
	}
	trace.Step("Basic checks done")
    
    // 2.呼叫g.snapshot,獲取當前的所有node節點資訊快照,用於本輪排程,包括下面執行的預算演算法與優選演算法都將使用該node快照
	if err := g.snapshot(); err != nil {
		return result, err
	}
	trace.Step("Snapshoting scheduler cache and node infos done")

	if len(g.nodeInfoSnapshot.NodeInfoList) == 0 {
		return result, ErrNoNodesAvailable
	}

	// Run "prefilter" plugins.
	preFilterStatus := g.framework.RunPreFilterPlugins(ctx, state, pod)
	if !preFilterStatus.IsSuccess() {
		return result, preFilterStatus.AsError()
	}
	trace.Step("Running prefilter plugins done")
    
    // 4.呼叫g.findNodesThatFit,執行預選演算法,篩選出符合該pod執行條件的合適node節點
	startPredicateEvalTime := time.Now()
	filteredNodes, failedPredicateMap, filteredNodesStatuses, err := g.findNodesThatFit(ctx, state, pod)
	if err != nil {
		return result, err
	}
	trace.Step("Computing predicates done")

	// Run "postfilter" plugins.
	postfilterStatus := g.framework.RunPostFilterPlugins(ctx, state, pod, filteredNodes, filteredNodesStatuses)
	if !postfilterStatus.IsSuccess() {
		return result, postfilterStatus.AsError()
	}
    
    // 6.判斷合適的node節點數,如果為0直接返回失敗
	if len(filteredNodes) == 0 {
		return result, &FitError{
			Pod:                   pod,
			NumAllNodes:           len(g.nodeInfoSnapshot.NodeInfoList),
			FailedPredicates:      failedPredicateMap,
			FilteredNodesStatuses: filteredNodesStatuses,
		}
	}
	trace.Step("Running postfilter plugins done")
	metrics.SchedulingAlgorithmPredicateEvaluationDuration.Observe(metrics.SinceInSeconds(startPredicateEvalTime))
	metrics.DeprecatedSchedulingAlgorithmPredicateEvaluationDuration.Observe(metrics.SinceInMicroseconds(startPredicateEvalTime))
	metrics.SchedulingLatency.WithLabelValues(metrics.PredicateEvaluation).Observe(metrics.SinceInSeconds(startPredicateEvalTime))
	metrics.DeprecatedSchedulingLatency.WithLabelValues(metrics.PredicateEvaluation).Observe(metrics.SinceInSeconds(startPredicateEvalTime))
    
    // 7.判斷合適的node節點數,如果為1則直接返回該node節點,不往下執行優選演算法
	startPriorityEvalTime := time.Now()
	// When only one node after predicate, just use it.
	if len(filteredNodes) == 1 {
		metrics.SchedulingAlgorithmPriorityEvaluationDuration.Observe(metrics.SinceInSeconds(startPriorityEvalTime))
		metrics.DeprecatedSchedulingAlgorithmPriorityEvaluationDuration.Observe(metrics.SinceInMicroseconds(startPriorityEvalTime))
		return ScheduleResult{
			SuggestedHost:  filteredNodes[0].Name,
			EvaluatedNodes: 1 + len(failedPredicateMap) + len(filteredNodesStatuses),
			FeasibleNodes:  1,
		}, nil
	}

    // 8.呼叫g.prioritizeNodes,執行優選演算法,給合適的所有node節點打分
	metaPrioritiesInterface := g.priorityMetaProducer(pod, filteredNodes, g.nodeInfoSnapshot)
	priorityList, err := g.prioritizeNodes(ctx, state, pod, metaPrioritiesInterface, filteredNodes)
	if err != nil {
		return result, err
	}

	metrics.SchedulingAlgorithmPriorityEvaluationDuration.Observe(metrics.SinceInSeconds(startPriorityEvalTime))
	metrics.DeprecatedSchedulingAlgorithmPriorityEvaluationDuration.Observe(metrics.SinceInMicroseconds(startPriorityEvalTime))
	metrics.SchedulingLatency.WithLabelValues(metrics.PriorityEvaluation).Observe(metrics.SinceInSeconds(startPriorityEvalTime))
	metrics.DeprecatedSchedulingLatency.WithLabelValues(metrics.PriorityEvaluation).Observe(metrics.SinceInSeconds(startPriorityEvalTime))
    
    // 9.呼叫g.selectHost,從合適的node節點中選擇最佳(即得分最高)的node節點並返回
	host, err := g.selectHost(priorityList)
	trace.Step("Prioritizing done")

	return ScheduleResult{
		SuggestedHost:  host,
		EvaluatedNodes: len(filteredNodes) + len(failedPredicateMap) + len(filteredNodesStatuses),
		FeasibleNodes:  len(filteredNodes),
	}, err
}

2.1.1 g.snapshot

g.snapshot方法主要是獲取當前的所有node節點資訊快照,用於本輪排程,包括後面執行的預算演算法與優選演算法都將使用該node快照。當node節點資訊沒有變化時(根據node節點的generation大小判斷),該方法直接返回現有node節點資訊快照,無需更新,當node節點資訊有變化時才更新快照並返回最新快照。

// pkg/scheduler/core/generic_scheduler.go
func (g *genericScheduler) snapshot() error {
	// Used for all fit and priority funcs.
	return g.cache.UpdateNodeInfoSnapshot(g.nodeInfoSnapshot)
}
// pkg/scheduler/internal/cache/cache.go
func (cache *schedulerCache) UpdateNodeInfoSnapshot(nodeSnapshot *nodeinfosnapshot.Snapshot) error {
	cache.mu.Lock()
	defer cache.mu.Unlock()
	balancedVolumesEnabled := utilfeature.DefaultFeatureGate.Enabled(features.BalanceAttachedNodeVolumes)

	// Get the last generation of the snapshot.
	snapshotGeneration := nodeSnapshot.Generation
    
    // 增加和更新node節點快照
	// Start from the head of the NodeInfo doubly linked list and update snapshot
	// of NodeInfos updated after the last snapshot.
	for node := cache.headNode; node != nil; node = node.next {
		if node.info.GetGeneration() <= snapshotGeneration {
			// all the nodes are updated before the existing snapshot. We are done.
			break
		}
		if balancedVolumesEnabled && node.info.TransientInfo != nil {
			// Transient scheduler info is reset here.
			node.info.TransientInfo.ResetTransientSchedulerInfo()
		}
		if np := node.info.Node(); np != nil {
			nodeSnapshot.NodeInfoMap[np.Name] = node.info.Clone()
		}
	}
	// Update the snapshot generation with the latest NodeInfo generation.
	if cache.headNode != nil {
		nodeSnapshot.Generation = cache.headNode.info.GetGeneration()
	}
    
    // 刪除多餘的node節點快照
	if len(nodeSnapshot.NodeInfoMap) > len(cache.nodes) {
		for name := range nodeSnapshot.NodeInfoMap {
			if _, ok := cache.nodes[name]; !ok {
				delete(nodeSnapshot.NodeInfoMap, name)
			}
		}
	}

	// Take a snapshot of the nodes order in the tree
	nodeSnapshot.NodeInfoList = make([]*schedulernodeinfo.NodeInfo, 0, cache.nodeTree.numNodes)
	nodeSnapshot.HavePodsWithAffinityNodeInfoList = make([]*schedulernodeinfo.NodeInfo, 0, cache.nodeTree.numNodes)
	for i := 0; i < cache.nodeTree.numNodes; i++ {
		nodeName := cache.nodeTree.next()
		if n := nodeSnapshot.NodeInfoMap[nodeName]; n != nil {
			nodeSnapshot.NodeInfoList = append(nodeSnapshot.NodeInfoList, n)
			if len(n.PodsWithAffinity()) > 0 {
				nodeSnapshot.HavePodsWithAffinityNodeInfoList = append(nodeSnapshot.HavePodsWithAffinityNodeInfoList, n)
			}
		} else {
			klog.Errorf("node %q exist in nodeTree but not in NodeInfoMap, this should not happen.", nodeName)
		}
	}
	return nil
}

2.1.2 g.findNodesThatFit 執行預選演算法

g.findNodesThatFit方法是執行預選演算法的地方。主要邏輯如下:
(1)呼叫g.numFeasibleNodesToFind,根據一定的演算法,計算並返回預選演算法要篩選的node節點數量;
(2)定義checkNode函式,用於篩選合適的node節點;
(3)起16個goroutine,並行的對所有node執行checkNode函式,返回合適的node節點列表,列表長度小於等於g.numFeasibleNodesToFind方法返回值;
(4)遍歷scheduler-extender(kube-scheduler的一種webhook擴充套件機制),對已經過濾過的node再來執行extender的Filter,即執行http擴充套件的預選演算法;
(5)最後返回filtered(預選通過的node列表)、failedPredicateMap(預選失敗的node和失敗原因)等。

// pkg/scheduler/core/generic_scheduler.go
func (g *genericScheduler) findNodesThatFit(ctx context.Context, state *framework.CycleState, pod *v1.Pod) ([]*v1.Node, FailedPredicateMap, framework.NodeToStatusMap, error) {
	var filtered []*v1.Node
	failedPredicateMap := FailedPredicateMap{}
	filteredNodesStatuses := framework.NodeToStatusMap{}

	if len(g.predicates) == 0 && !g.framework.HasFilterPlugins() {
		filtered = g.nodeInfoSnapshot.ListNodes()
	} else {
		allNodes := len(g.nodeInfoSnapshot.NodeInfoList)
		numNodesToFind := g.numFeasibleNodesToFind(int32(allNodes))

		// Create filtered list with enough space to avoid growing it
		// and allow assigning.
		filtered = make([]*v1.Node, numNodesToFind)
		errCh := util.NewErrorChannel()
		var (
			predicateResultLock sync.Mutex
			filteredLen         int32
		)

		ctx, cancel := context.WithCancel(ctx)

		// We can use the same metadata producer for all nodes.
		meta := g.predicateMetaProducer(pod, g.nodeInfoSnapshot)
		state.Write(migration.PredicatesStateKey, &migration.PredicatesStateData{Reference: meta})

		checkNode := func(i int) {
			// We check the nodes starting from where we left off in the previous scheduling cycle,
			// this is to make sure all nodes have the same chance of being examined across pods.
			nodeInfo := g.nodeInfoSnapshot.NodeInfoList[(g.nextStartNodeIndex+i)%allNodes]
			fits, failedPredicates, status, err := g.podFitsOnNode(
				ctx,
				state,
				pod,
				meta,
				nodeInfo,
				g.alwaysCheckAllPredicates,
			)
			if err != nil {
				errCh.SendErrorWithCancel(err, cancel)
				return
			}
			if fits {
				length := atomic.AddInt32(&filteredLen, 1)
				if length > numNodesToFind {
					cancel()
					atomic.AddInt32(&filteredLen, -1)
				} else {
					filtered[length-1] = nodeInfo.Node()
				}
			} else {
				predicateResultLock.Lock()
				if !status.IsSuccess() {
					filteredNodesStatuses[nodeInfo.Node().Name] = status
				}
				if len(failedPredicates) != 0 {
					failedPredicateMap[nodeInfo.Node().Name] = failedPredicates
				}
				predicateResultLock.Unlock()
			}
		}

		// Stops searching for more nodes once the configured number of feasible nodes
		// are found.
		workqueue.ParallelizeUntil(ctx, 16, allNodes, checkNode)
		processedNodes := int(filteredLen) + len(filteredNodesStatuses) + len(failedPredicateMap)
		g.nextStartNodeIndex = (g.nextStartNodeIndex + processedNodes) % allNodes

		filtered = filtered[:filteredLen]
		if err := errCh.ReceiveError(); err != nil {
			return []*v1.Node{}, FailedPredicateMap{}, framework.NodeToStatusMap{}, err
		}
	}

	if len(filtered) > 0 && len(g.extenders) != 0 {
		for _, extender := range g.extenders {
			if !extender.IsInterested(pod) {
				continue
			}
			filteredList, failedMap, err := extender.Filter(pod, filtered, g.nodeInfoSnapshot.NodeInfoMap)
			if err != nil {
				if extender.IsIgnorable() {
					klog.Warningf("Skipping extender %v as it returned error %v and has ignorable flag set",
						extender, err)
					continue
				}

				return []*v1.Node{}, FailedPredicateMap{}, framework.NodeToStatusMap{}, err
			}

			for failedNodeName, failedMsg := range failedMap {
				if _, found := failedPredicateMap[failedNodeName]; !found {
					failedPredicateMap[failedNodeName] = []predicates.PredicateFailureReason{}
				}
				failedPredicateMap[failedNodeName] = append(failedPredicateMap[failedNodeName], predicates.NewFailureReason(failedMsg))
			}
			filtered = filteredList
			if len(filtered) == 0 {
				break
			}
		}
	}
	return filtered, failedPredicateMap, filteredNodesStatuses, nil
}
g.numFeasibleNodesToFind

g.numFeasibleNodesToFind方法根據一定的演算法計算並返回預選演算法要篩選的node節點數量。

返回值node節點數量將根據一定的演算法來計算得出:
(1)當node節點數量小於100,或配置引數percentageOfNodesToScore大於等於100時,返回值為node節點的數量;
(2)當配置引數percentageOfNodesToScore小於等於0時,將啟用以下自帶演算法算出返回的node節點數量值:
返回的node節點數量值=node節點數量*(50-node節點數量/125)/100
(3)當配置引數percentageOfNodesToScore大於0時,返回的node節點數量值=node節點數量*percentageOfNodesToScore/100
(4)當計算得出的node節點數量小於100時,任然返回最小值100。

配置引數percentageOfNodesToScore說明:該引數用於kube-scheduler排程器效能調優,允許執行排程預選演算法時在找到一定數量的可行node節點後停止尋找更多的節點,提高了排程器在大型叢集中的效能。

// pkg/scheduler/core/generic_scheduler.go
func (g *genericScheduler) numFeasibleNodesToFind(numAllNodes int32) (numNodes int32) {
	if numAllNodes < minFeasibleNodesToFind || g.percentageOfNodesToScore >= 100 {
		return numAllNodes
	}

	adaptivePercentage := g.percentageOfNodesToScore
	if adaptivePercentage <= 0 {
		basePercentageOfNodesToScore := int32(50)
		adaptivePercentage = basePercentageOfNodesToScore - numAllNodes/125
		if adaptivePercentage < minFeasibleNodesPercentageToFind {
			adaptivePercentage = minFeasibleNodesPercentageToFind
		}
	}

	numNodes = numAllNodes * adaptivePercentage / 100
	if numNodes < minFeasibleNodesToFind {
		return minFeasibleNodesToFind
	}

	return numNodes
}
checkNode

checkNode函式主要邏輯如下:
(1)從nodes快照中取出一個node;
(2)呼叫g.podFitsOnNode,對該pod在該node上執行所有的已註冊的預選演算法;
(3)當已經找到的合適的node節點數量已經大於要篩選的node節點數量時,呼叫cacel函式,不再繼續找。

        checkNode := func(i int) {
			// (1)從nodes快照中取出一個node
			nodeInfo := g.nodeInfoSnapshot.NodeInfoList[(g.nextStartNodeIndex+i)%allNodes]
			
			// (2)對該pod在該node上執行所有的已註冊的預選演算法
			fits, failedPredicates, status, err := g.podFitsOnNode(
				ctx,
				state,
				pod,
				meta,
				nodeInfo,
				g.alwaysCheckAllPredicates,
			)
			if err != nil {
				errCh.SendErrorWithCancel(err, cancel)
				return
			}
			if fits {
				length := atomic.AddInt32(&filteredLen, 1)
				
				// (3)當已經找到的合適的node節點數量已經大於要篩選的node節點數量時,呼叫cacel函式,不再繼續找
				if length > numNodesToFind {
					cancel()
					atomic.AddInt32(&filteredLen, -1)
				} else {
					filtered[length-1] = nodeInfo.Node()
				}
			} else {
				predicateResultLock.Lock()
				if !status.IsSuccess() {
					filteredNodesStatuses[nodeInfo.Node().Name] = status
				}
				if len(failedPredicates) != 0 {
					failedPredicateMap[nodeInfo.Node().Name] = failedPredicates
				}
				predicateResultLock.Unlock()
			}
		}

執行預算演算法時將根據預定義的順序依次執行,並根據配置引數alwaysCheckAllPredicates是否為true,決定當一個預選演算法執行失敗時要不要繼續往下執行剩餘的預選演算法。

// pkg/scheduler/algorithm/predicates/predicates.go
var (
	predicatesOrdering = []string{CheckNodeUnschedulablePred,
		GeneralPred, HostNamePred, PodFitsHostPortsPred,
		MatchNodeSelectorPred, PodFitsResourcesPred, NoDiskConflictPred,
		PodToleratesNodeTaintsPred, PodToleratesNodeNoExecuteTaintsPred, CheckNodeLabelPresencePred,
		CheckServiceAffinityPred, MaxEBSVolumeCountPred, MaxGCEPDVolumeCountPred, MaxCSIVolumeCountPred,
		MaxAzureDiskVolumeCountPred, MaxCinderVolumeCountPred, CheckVolumeBindingPred, NoVolumeZoneConflictPred,
		EvenPodsSpreadPred, MatchInterPodAffinityPred}
)
g.podFitsOnNode

g.podFitsOnNode方法主要作用是對某個pod在某個node上執行所有的已註冊的預選演算法,看是否滿足預選結果。

g.podFitsOnNode方法只用關注兩個點:
(1)for迴圈跑了兩遍;
(2)注意到predicates.Ordering(),定義了預選演算法的先後順序;
(3)呼叫g.framework.RunFilterPlugins,用於執行“Filter” plugins。

for迴圈跑了兩遍,分別做了什麼:
(1)第一遍,kube-scheduler假設需要搶佔的pod已經在該節點執行(實際上尚未排程到該節點),執行預選演算法;
(2)第二遍,kube-scheduler正常執行預選演算法,不考慮搶佔的pod。

為什麼for迴圈要跑兩遍,即執行兩遍預選操作:
(1)第一遍執行預選操作是必須的,因為需要考慮到搶佔的pod(更高優先順序)排程到該節點,如InterPodAntiAffinity等規則將會考慮所有pod之間的互斥關係;
(2)第二遍執行預選操作也是必須的,因為搶佔的pod最終不一定執行在該節點上,然後有一些預選演算法比如pod親和性,在搶佔的pod沒有成功跑到該節點的情況下可能會不滿足,所以需要不考慮搶佔的pod去正常執行一遍預選演算法。

func (g *genericScheduler) podFitsOnNode(
	ctx context.Context,
	state *framework.CycleState,
	pod *v1.Pod,
	meta predicates.Metadata,
	info *schedulernodeinfo.NodeInfo,
	alwaysCheckAllPredicates bool,
) (bool, []predicates.PredicateFailureReason, *framework.Status, error) {
	var failedPredicates []predicates.PredicateFailureReason
	var status *framework.Status

	podsAdded := false
	
	...
	
	// 執行兩遍預選操作
	for i := 0; i < 2; i++ {
		metaToUse := meta
		stateToUse := state
		nodeInfoToUse := info
		if i == 0 {
		    //處理搶佔pod(優先順序更高的pod)的邏輯
			var err error
			// 在addNominatedPods中,會將node上的nominatedPod列舉出來,即將搶佔pod考慮到其中,然後後面再執行預選演算法
			podsAdded, metaToUse, stateToUse, nodeInfoToUse, err = g.addNominatedPods(ctx, pod, meta, state, info)
			if err != nil {
				return false, []predicates.PredicateFailureReason{}, nil, err
			}
		} else if !podsAdded || len(failedPredicates) != 0 || !status.IsSuccess() {
			break
		}

		for _, predicateKey := range predicates.Ordering() {
			var (
				fit     bool
				reasons []predicates.PredicateFailureReason
				err     error
			)

			if predicate, exist := g.predicates[predicateKey]; exist {
				fit, reasons, err = predicate(pod, metaToUse, nodeInfoToUse)
				if err != nil {
					return false, []predicates.PredicateFailureReason{}, nil, err
				}

				if !fit {
					// eCache is available and valid, and predicates result is unfit, record the fail reasons
					failedPredicates = append(failedPredicates, reasons...)
					// if alwaysCheckAllPredicates is false, short circuit all predicates when one predicate fails.
					if !alwaysCheckAllPredicates {
						klog.V(5).Infoln("since alwaysCheckAllPredicates has not been set, the predicate " +
							"evaluation is short circuited and there are chances " +
							"of other predicates failing as well.")
						break
					}
				}
			}
		}

		status = g.framework.RunFilterPlugins(ctx, stateToUse, pod, nodeInfoToUse)
		if !status.IsSuccess() && !status.IsUnschedulable() {
			return false, failedPredicates, status, status.AsError()
		}
	}

	return len(failedPredicates) == 0 && status.IsSuccess(), failedPredicates, status, nil
}

predicates.Ordering(),包含了全部預選演算法的列表,並定義了預選演算法的先後順序,所以當需要擴充套件預選演算法的時候必須要記得將其名稱新增到此列表中。

// pkg/scheduler/algorithm/predicates/predicates.go

var (
	predicatesOrdering = []string{CheckNodeUnschedulablePred,
		GeneralPred, HostNamePred, PodFitsHostPortsPred,
		MatchNodeSelectorPred, PodFitsResourcesPred, NoDiskConflictPred,
		PodToleratesNodeTaintsPred, PodToleratesNodeNoExecuteTaintsPred, CheckNodeLabelPresencePred,
		CheckServiceAffinityPred, MaxEBSVolumeCountPred, MaxGCEPDVolumeCountPred, MaxCSIVolumeCountPred,
		MaxAzureDiskVolumeCountPred, MaxCinderVolumeCountPred, CheckVolumeBindingPred, NoVolumeZoneConflictPred,
		EvenPodsSpreadPred, MatchInterPodAffinityPred}
)

// Ordering returns the ordering of predicates.
func Ordering() []string {
	return predicatesOrdering
}

2.1.3 g.prioritizeNodes 執行優選演算法

g.prioritizeNodes方法是執行優選演算法的地方,每個節點執行所有優選演算法過後,將會得到一個分數,最低分0分,最高分100分(v1.17.4版本是100分,舊版本可能是10分)。

// pkg/scheduler/framework/v1alpha1/interface.go
const (
	// MaxNodeScore is the maximum score a Score plugin is expected to return.
	MaxNodeScore int64 = 100

	// MinNodeScore is the minimum score a Score plugin is expected to return.
	MinNodeScore int64 = 0

	...
)

g.prioritizeNodes方法主要邏輯如下:
(1)如果沒有註冊優選演算法,則所有node節點的得分都為1分;
(2)對nodes執行優選演算法,這裡使用了MapReduce的思想(Map與Reduce函式在優選演算法註冊時定義);
(2.1)啟動16個goroutine併發為node節點執行優選演算法的Map,並記錄每個節點的每個優選演算法的打分;
(2.2)對每個優選演算法,都起一個goroutine來執行Reduce;
(3)執行呼叫"Score" plugins;
(4)彙總每個node節點的分數(node節點的分數=每個優選演算法的得分*該優選演算法的權重);
(5)遍歷scheduler-extender(kube-scheduler的一種webhook擴充套件機制),執行extender的Prioritize,即執行http擴充套件的優選演算法,執行完畢後再統計一遍每個node節點的分數。

關於Map-Reduce:一個優選演算法的Map函式計算1個node的Score,在有需要時,定義Reduce函式,根據一定的規則策略(收縮分數、反轉分數等等)將這個Score進行歸約到[0-100],更多關於Map-Reduce的介紹,可自行檢視其他資料瞭解,這裡不展開介紹。

// pkg/scheduler/core/generic_scheduler.go
func (g *genericScheduler) prioritizeNodes(
	ctx context.Context,
	state *framework.CycleState,
	pod *v1.Pod,
	meta interface{},
	nodes []*v1.Node,
) (framework.NodeScoreList, error) {
	// If no priority configs are provided, then all nodes will have a score of one.
	// This is required to generate the priority list in the required format
	if len(g.prioritizers) == 0 && len(g.extenders) == 0 && !g.framework.HasScorePlugins() {
		result := make(framework.NodeScoreList, 0, len(nodes))
		for i := range nodes {
			result = append(result, framework.NodeScore{
				Name:  nodes[i].Name,
				Score: 1,
			})
		}
		return result, nil
	}

	var (
		mu   = sync.Mutex{}
		wg   = sync.WaitGroup{}
		errs []error
	)
	appendError := func(err error) {
		mu.Lock()
		defer mu.Unlock()
		errs = append(errs, err)
	}

	results := make([]framework.NodeScoreList, len(g.prioritizers))

	for i := range g.prioritizers {
		results[i] = make(framework.NodeScoreList, len(nodes))
	}

	workqueue.ParallelizeUntil(context.TODO(), 16, len(nodes), func(index int) {
		nodeInfo := g.nodeInfoSnapshot.NodeInfoMap[nodes[index].Name]
		for i := range g.prioritizers {
			var err error
			results[i][index], err = g.prioritizers[i].Map(pod, meta, nodeInfo)
			if err != nil {
				appendError(err)
				results[i][index].Name = nodes[index].Name
			}
		}
	})

	for i := range g.prioritizers {
		if g.prioritizers[i].Reduce == nil {
			continue
		}
		wg.Add(1)
		go func(index int) {
			metrics.SchedulerGoroutines.WithLabelValues("prioritizing_mapreduce").Inc()
			defer func() {
				metrics.SchedulerGoroutines.WithLabelValues("prioritizing_mapreduce").Dec()
				wg.Done()
			}()
			if err := g.prioritizers[index].Reduce(pod, meta, g.nodeInfoSnapshot, results[index]); err != nil {
				appendError(err)
			}
			if klog.V(10) {
				for _, hostPriority := range results[index] {
					klog.Infof("%v -> %v: %v, Score: (%d)", util.GetPodFullName(pod), hostPriority.Name, g.prioritizers[index].Name, hostPriority.Score)
				}
			}
		}(i)
	}
	// Wait for all computations to be finished.
	wg.Wait()
	if len(errs) != 0 {
		return framework.NodeScoreList{}, errors.NewAggregate(errs)
	}

	// Run the Score plugins.
	state.Write(migration.PrioritiesStateKey, &migration.PrioritiesStateData{Reference: meta})
	scoresMap, scoreStatus := g.framework.RunScorePlugins(ctx, state, pod, nodes)
	if !scoreStatus.IsSuccess() {
		return framework.NodeScoreList{}, scoreStatus.AsError()
	}

	// Summarize all scores.
	result := make(framework.NodeScoreList, 0, len(nodes))

	for i := range nodes {
		result = append(result, framework.NodeScore{Name: nodes[i].Name, Score: 0})
		for j := range g.prioritizers {
			result[i].Score += results[j][i].Score * g.prioritizers[j].Weight
		}

		for j := range scoresMap {
			result[i].Score += scoresMap[j][i].Score
		}
	}

	if len(g.extenders) != 0 && nodes != nil {
		combinedScores := make(map[string]int64, len(g.nodeInfoSnapshot.NodeInfoList))
		for i := range g.extenders {
			if !g.extenders[i].IsInterested(pod) {
				continue
			}
			wg.Add(1)
			go func(extIndex int) {
				metrics.SchedulerGoroutines.WithLabelValues("prioritizing_extender").Inc()
				defer func() {
					metrics.SchedulerGoroutines.WithLabelValues("prioritizing_extender").Dec()
					wg.Done()
				}()
				prioritizedList, weight, err := g.extenders[extIndex].Prioritize(pod, nodes)
				if err != nil {
					// Prioritization errors from extender can be ignored, let k8s/other extenders determine the priorities
					return
				}
				mu.Lock()
				for i := range *prioritizedList {
					host, score := (*prioritizedList)[i].Host, (*prioritizedList)[i].Score
					if klog.V(10) {
						klog.Infof("%v -> %v: %v, Score: (%d)", util.GetPodFullName(pod), host, g.extenders[extIndex].Name(), score)
					}
					combinedScores[host] += score * weight
				}
				mu.Unlock()
			}(i)
		}
		// wait for all go routines to finish
		wg.Wait()
		for i := range result {
			// MaxExtenderPriority may diverge from the max priority used in the scheduler and defined by MaxNodeScore,
			// therefore we need to scale the score returned by extenders to the score range used by the scheduler.
			result[i].Score += combinedScores[result[i].Name] * (framework.MaxNodeScore / extenderv1.MaxExtenderPriority)
		}
	}

	if klog.V(10) {
		for i := range result {
			klog.Infof("Host %s => Score %d", result[i].Name, result[i].Score)
		}
	}
	return result, nil
}

2.1.4 g.selectHost

g.selectHost方法主要是從所有優選打分過後的節點中,選出得分最高的節點並返回,當有多個得分最高的節點時,將隨機返回一個得分最高的node節點。

// pkg/scheduler/core/generic_scheduler.go
func (g *genericScheduler) selectHost(nodeScoreList framework.NodeScoreList) (string, error) {
	if len(nodeScoreList) == 0 {
		return "", fmt.Errorf("empty priorityList")
	}
	maxScore := nodeScoreList[0].Score
	selected := nodeScoreList[0].Name
	cntOfMaxScore := 1
	for _, ns := range nodeScoreList[1:] {
		if ns.Score > maxScore {
			maxScore = ns.Score
			selected = ns.Name
			cntOfMaxScore = 1
		} else if ns.Score == maxScore {
			cntOfMaxScore++
			if rand.Intn(cntOfMaxScore) == 0 {
				// Replace the candidate with probability of 1/cntOfMaxScore
				selected = ns.Name
			}
		}
	}
	return selected, nil
}

2.2 sched.preempt

preempt搶佔機制:當高優先順序的pod沒有找到合適的node時,kube-scheduler會將低優先順序的pod從所在node驅逐,然後讓高優先順序的pod排程到該node上,最後被驅逐的pod再進入待排程佇列,重新進行排程。搶佔機制的程式碼暫不做具體分析,可以自行檢視。

kube-scheduler擴充套件機制

kube-scheduler當前有兩種擴充套件機制:
(1)scheduler-extender;
(2)scheduler framework。

scheduler-extender

scheduler-extender是以http webhook的形式提供的scheduler擴充套件形式,在執行預選演算法、優選演算法和Bind時可以分別進行webhook擴充套件。

參考:https://liqiang.io/post/kubernetes-scheduler-extender-dd6516a6

scheduler framework

前面程式碼分析中提到過的執行的filter,都是kube-scheduler的可擴充套件機制scheduler framework提供的,該機制在排程器生命週期的各個關鍵點上,向使用者暴露可以進行擴充套件和實現的介面,從而賦予使用者自定義排程器的能力。基於篇幅原因,這裡不展開介紹kube-scheduler的可擴充套件機制,可自行進行了解。

目前kube-scheduler的很多內建預選演算法和優選演算法都是基於scheduler framework機制實現的。

參考:https://cloudnative.to/blog/202003-k8s-scheduling-framework/

總結

kube-scheduler簡介

kube-scheduler元件是kubernetes中的核心元件之一,主要負責pod資源物件的排程工作,具體來說,kube-scheduler元件負責根據排程演算法(包括預選演算法和優選演算法)將未排程的pod排程到合適的最優的node節點上。

kube-scheduler架構圖

kube-scheduler的大致組成和處理流程如下圖,kube-scheduler對pod、node等物件進行了list/watch,根據informer將未排程的pod放入待排程pod佇列,並根據informer構建排程器cache(用於快速獲取需要的node等物件),然後sched.scheduleOne方法為kube-scheduler元件排程pod的核心處理邏輯所在,從未排程pod佇列中取出一個pod,經過預選與優選演算法,最終選出一個最優node,然後更新cache並非同步執行bind操作,也就是更新pod的nodeName欄位,至此一個pod的排程工作完成。

kube-scheduler核心處理流程圖

下方處理流程圖展示了sched.scheduleOne方法的核心處理步驟,其中kube-scheduler擴充套件機制相關的步驟未畫出。

相關文章