Kubernetes原始碼分析之kube-scheduler

NeverMore_RC發表於2019-03-13

原文網址 : http://juejin.im/post/5c889c2e5188257df700a732

原始碼

本節開始主要分析kubernetes原始碼部分，版本基於當前最新的1.13.4。

啟動分析

Kubernetes基礎元件的入口均在cmd目錄下，kube-schduler入口在scheduler.go下，如圖

kubernetes所有的元件啟動採用的均是command的形式，引用的是spf13類庫

通過將配置檔案轉化成command的形式，呼叫Execute方法執行定義的Run方法

進入runCommand方法，通過完成配置的初始化，呼叫Run方法，進一步啟動。

Run方法分析

Run方法主要做了以下工作：
1、判斷是否需要新增VolumeScheduling新特性；
2、初始化排程引數的相關結構體；
3、配置準備事件廣播；
4、健康檢查相關配置；
5、Metrics相關配置；
6、啟動所有的Informer(kubernetes主要就是通過Informer和Workqueue機制監聽事件的變化)；
7、判斷是否需要LeaderElection，決定最終的啟動。

排程介面

最終的排程介面進入的是pkg下的scheduler.go檔案，通過啟動單獨的協程處理排程工作。

scheduleOne方法分析

scheduleOne，顧名思義，每次排程一個Pod，整體檔案如下

// scheduleOne does the entire scheduling workflow for a single pod.  It is serialized on the scheduling algorithm's host fitting.
func (sched *Scheduler) scheduleOne() {
	// 1.從佇列中取出待排程的Pod
	pod := sched.config.NextPod()
	// pod could be nil when schedulerQueue is closed
	if pod == nil {
		return
	}
	if pod.DeletionTimestamp != nil {
		sched.config.Recorder.Eventf(pod, v1.EventTypeWarning, "FailedScheduling", "skip schedule deleting pod: %v/%v", pod.Namespace, pod.Name)
		klog.V(3).Infof("Skip schedule deleting pod: %v/%v", pod.Namespace, pod.Name)
		return
	}

	klog.V(3).Infof("Attempting to schedule pod: %v/%v", pod.Namespace, pod.Name)

	// Synchronously attempt to find a fit for the pod.
	start := time.Now()
	// 2.獲取待排程Pod匹配的主機名
	suggestedHost, err := sched.schedule(pod)
	if err != nil {
		// schedule() may have failed because the pod would not fit on any host, so we try to
		// preempt, with the expectation that the next time the pod is tried for scheduling it
		// will fit due to the preemption. It is also possible that a different pod will schedule
		// into the resources that were preempted, but this is harmless.
		if fitError, ok := err.(*core.FitError); ok {
			preemptionStartTime := time.Now()
			sched.preempt(pod, fitError)
			metrics.PreemptionAttempts.Inc()
			metrics.SchedulingAlgorithmPremptionEvaluationDuration.Observe(metrics.SinceInMicroseconds(preemptionStartTime))
			metrics.SchedulingLatency.WithLabelValues(metrics.PreemptionEvaluation).Observe(metrics.SinceInSeconds(preemptionStartTime))
			// Pod did not fit anywhere, so it is counted as a failure. If preemption
			// succeeds, the pod should get counted as a success the next time we try to
			// schedule it. (hopefully)
			metrics.PodScheduleFailures.Inc()
		} else {
			klog.Errorf("error selecting node for pod: %v", err)
			metrics.PodScheduleErrors.Inc()
		}
		return
	}
	metrics.SchedulingAlgorithmLatency.Observe(metrics.SinceInMicroseconds(start))
	// Tell the cache to assume that a pod now is running on a given node, even though it hasn't been bound yet.
	// This allows us to keep scheduling without waiting on binding to occur.
	// 3.Pod與Node快取，保證排程一直進行，不用等待每次繫結完成（繫結是一個耗時的過程）
	assumedPod := pod.DeepCopy()

	// Assume volumes first before assuming the pod.
	//
	// If all volumes are completely bound, then allBound is true and binding will be skipped.
	//
	// Otherwise, binding of volumes is started after the pod is assumed, but before pod binding.
	//
	// This function modifies 'assumedPod' if volume binding is required.
	// 4.判斷是否需要VolumeScheduling特性
	allBound, err := sched.assumeVolumes(assumedPod, suggestedHost)
	if err != nil {
		klog.Errorf("error assuming volumes: %v", err)
		metrics.PodScheduleErrors.Inc()
		return
	}

	// assume modifies `assumedPod` by setting NodeName=suggestedHost
	// 5.Pod對應的NodeName寫上主機名，存入快取
	err = sched.assume(assumedPod, suggestedHost)
	if err != nil {
		klog.Errorf("error assuming pod: %v", err)
		metrics.PodScheduleErrors.Inc()
		return
	}
	// bind the pod to its host asynchronously (we can do this b/c of the assumption step above).
	// 6.請求apiserver，非同步處理最終的繫結，寫入到etcd
	go func() {
		// Bind volumes first before Pod
		if !allBound {
			err := sched.bindVolumes(assumedPod)
			if err != nil {
				klog.Errorf("error binding volumes: %v", err)
				metrics.PodScheduleErrors.Inc()
				return
			}
		}

		err := sched.bind(assumedPod, &v1.Binding{
			ObjectMeta: metav1.ObjectMeta{Namespace: assumedPod.Namespace, Name: assumedPod.Name, UID: assumedPod.UID},
			Target: v1.ObjectReference{
				Kind: "Node",
				Name: suggestedHost,
			},
		})
		metrics.E2eSchedulingLatency.Observe(metrics.SinceInMicroseconds(start))
		if err != nil {
			klog.Errorf("error binding pod: %v", err)
			metrics.PodScheduleErrors.Inc()
		} else {
			metrics.PodScheduleSuccesses.Inc()
		}
	}()
}
複製程式碼

主要做了以下工作：
1、從佇列中取出待排程的Pod
2、根據排程演算法（預選+優選）獲取待排程Pod匹配的主機，如果未獲取到合適的主機，判斷是否需要preempt，即Pod的搶佔策略，為Pod分配節點
3、將當前Pod快取起來，假定已經繫結成功（主要是為了將scheduling與binding過程分開）
4、判斷是否需要VolumeScheduling特性繼續新增Pod資訊
5、Pod對應的NodeName寫上主機名（排程的本質就是將為空的NodeName寫上相應的Node的值）
6、啟動新的binding協程，請求apiserver，非同步處理最終的繫結，將結果寫入到etcd中

排程演算法

最終的排程在generic_scheduler.go的Schedule方法。排程主要分兩步，預選和優選。

預選

預選演算法呼叫的介面是findNodesThatFit，主要程式碼如下：

// Filters the nodes to find the ones that fit based on the given predicate functions
// Each node is passed through the predicate functions to determine if it is a fit
func (g *genericScheduler) findNodesThatFit(pod *v1.Pod, nodes []*v1.Node) ([]*v1.Node, FailedPredicateMap, error) {
	var filtered []*v1.Node
	failedPredicateMap := FailedPredicateMap{}

	// 該if表示，如果沒有配置預選的演算法，則直接將所有的Node寫入匹配陣列
	if len(g.predicates) == 0 {
		filtered = nodes
	} else {
		allNodes := int32(g.cache.NodeTree().NumNodes)
		// numFeasibleNodesToFind保證一次性不用返回過多的Node數量，避免陣列過大
		numNodesToFind := g.numFeasibleNodesToFind(allNodes)

		// Create filtered list with enough space to avoid growing it
		// and allow assigning.
		filtered = make([]*v1.Node, numNodesToFind)
		errs := errors.MessageCountMap{}
		var (
			predicateResultLock sync.Mutex
			filteredLen         int32
			equivClass          *equivalence.Class
		)

		ctx, cancel := context.WithCancel(context.Background())

		// We can use the same metadata producer for all nodes.
		meta := g.predicateMetaProducer(pod, g.cachedNodeInfoMap)

		if g.equivalenceCache != nil {
			// getEquivalenceClassInfo will return immediately if no equivalence pod found
			equivClass = equivalence.NewClass(pod)
		}

		// checkNode處理預選策略
		checkNode := func(i int) {
			var nodeCache *equivalence.NodeCache
			// 每次獲取Node資訊
			nodeName := g.cache.NodeTree().Next()
			if g.equivalenceCache != nil {
				nodeCache = g.equivalenceCache.LoadNodeCache(nodeName)
			}
			fits, failedPredicates, err := podFitsOnNode(
				pod,
				meta,
				g.cachedNodeInfoMap[nodeName],
				g.predicates,
				nodeCache,
				g.schedulingQueue,
				g.alwaysCheckAllPredicates,
				equivClass,
			)
			if err != nil {
				predicateResultLock.Lock()
				errs[err.Error()]++
				predicateResultLock.Unlock()
				return
			}
			if fits {
				// 保證獲取的Node數量在numNodesToFind內
				length := atomic.AddInt32(&filteredLen, 1)
				if length > numNodesToFind {
					// 通知ParallelizeUntil任務結束
					cancel()
					atomic.AddInt32(&filteredLen, -1)
				} else {
					filtered[length-1] = g.cachedNodeInfoMap[nodeName].Node()
				}
			} else {
				predicateResultLock.Lock()
				failedPredicateMap[nodeName] = failedPredicates
				predicateResultLock.Unlock()
			}
		}

		// Stops searching for more nodes once the configured number of feasible nodes
		// are found.
		// 並行處理多個Node的checkNode工作
		workqueue.ParallelizeUntil(ctx, 16, int(allNodes), checkNode)

		filtered = filtered[:filteredLen]
		if len(errs) > 0 {
			return []*v1.Node{}, FailedPredicateMap{}, errors.CreateAggregateFromMessageCountMap(errs)
		}
	}

	if len(filtered) > 0 && len(g.extenders) != 0 {
		for _, extender := range g.extenders {
			if !extender.IsInterested(pod) {
				continue
			}
			filteredList, failedMap, err := extender.Filter(pod, filtered, g.cachedNodeInfoMap)
			if err != nil {
				if extender.IsIgnorable() {
					klog.Warningf("Skipping extender %v as it returned error %v and has ignorable flag set",
						extender, err)
					continue
				} else {
					return []*v1.Node{}, FailedPredicateMap{}, err
				}
			}

			for failedNodeName, failedMsg := range failedMap {
				if _, found := failedPredicateMap[failedNodeName]; !found {
					failedPredicateMap[failedNodeName] = []algorithm.PredicateFailureReason{}
				}
				failedPredicateMap[failedNodeName] = append(failedPredicateMap[failedNodeName], predicates.NewFailureReason(failedMsg))
			}
			filtered = filteredList
			if len(filtered) == 0 {
				break
			}
		}
	}
	return filtered, failedPredicateMap, nil
}
複製程式碼

findNodesThatFit主要做了幾個操作
1、判斷是否配置了預選演算法，如果沒有，直接返回Node列表資訊；
2、如果配置了預選演算法，則同時對多個Node（最多一次16個）呼叫checkNode方法，判斷Pod是否可以排程在該Node上；
3、預選篩選之後，如果配置了排程的擴充套件演算法，需要繼續對篩選後的Pod與Node進行再一次的篩選，獲取最終匹配的Node列表。
這裡有一個注意的地方，獲取匹配的Node節點數量時，通過numFeasibleNodesToFind函式限制了每次獲取的節點數，最大值為100。這樣當匹配到相應的Node數時，checkNode方法不再呼叫。
這裡個人覺著有些問題，當Node數量足夠多的時候（大於100），由於numFeasibleNodesToFind限制了Node數量，導致並不能掃描到所有的Node，這樣可能導致最合適的Node沒有被掃描到，匹配到的只是較優先的Node，則最終排程到的Node也不是最合適的Node，只是相較於比較合適。
最終實現排程判斷的介面是podFitsOnNode。
podFitsOnNode最難理解的就是for迴圈了兩次，根據註釋，大致意思如下：
1、第一次迴圈，將所有的優先順序比較高或者相等的nominatedPods加入到Node中，更新meta和nodeInfo。nominatedPods是指已經分配到Node內但是還沒有真正執行起來的Pods。這樣做可以保證優先順序高的Pods不會因為現在的Pod的加入而導致排程失敗；
2、第二次排程，不將nominatedPods加入到Node內。這樣的原因是因為考慮到像Pod affinity策略的話，如果當前的Pod依賴的是nominatedPods，這樣就會有問題。因為，nominatedPods不能保證一定可以排程到相應的Node上。

// podFitsOnNode checks whether a node given by NodeInfo satisfies the given predicate functions.
// For given pod, podFitsOnNode will check if any equivalent pod exists and try to reuse its cached
// predicate results as possible.
// This function is called from two different places: Schedule and Preempt.
// When it is called from Schedule, we want to test whether the pod is schedulable
// on the node with all the existing pods on the node plus higher and equal priority
// pods nominated to run on the node.
// When it is called from Preempt, we should remove the victims of preemption and
// add the nominated pods. Removal of the victims is done by SelectVictimsOnNode().
// It removes victims from meta and NodeInfo before calling this function.
// ---
// podFitsOnNode根據給定的NodeInfo判斷是否匹配相應的預選函式
// 對於一個給定的Pod，podFitsOnNode會檢查之前是否有等價的Pod，這樣就可以直接複用等價Pod的預選結果
// 該函式會有兩個地方呼叫：Schedule和Preempt
// 當Schedule（正常排程）的時候，判斷Node上所有已經存在的Pod和將被指定將要排程到這個Node上的其他所有高優先順序Pod外，當前的Pod是否可以排程
// 當Preempt（搶佔式）的時候，待定。。。
func podFitsOnNode(
	pod *v1.Pod,
	meta algorithm.PredicateMetadata,
	info *schedulercache.NodeInfo,
	predicateFuncs map[string]algorithm.FitPredicate,
	nodeCache *equivalence.NodeCache,
	queue internalqueue.SchedulingQueue,
	alwaysCheckAllPredicates bool,
	equivClass *equivalence.Class,
) (bool, []algorithm.PredicateFailureReason, error) {
	var (
		eCacheAvailable  bool
		failedPredicates []algorithm.PredicateFailureReason
	)

	podsAdded := false
	// We run predicates twice in some cases. If the node has greater or equal priority
	// nominated pods, we run them when those pods are added to meta and nodeInfo.
	// If all predicates succeed in this pass, we run them again when these
	// nominated pods are not added. This second pass is necessary because some
	// predicates such as inter-pod affinity may not pass without the nominated pods.
	// If there are no nominated pods for the node or if the first run of the
	// predicates fail, we don't run the second pass.
	// We consider only equal or higher priority pods in the first pass, because
	// those are the current "pod" must yield to them and not take a space opened
	// for running them. It is ok if the current "pod" take resources freed for
	// lower priority pods.
	// Requiring that the new pod is schedulable in both circumstances ensures that
	// we are making a conservative decision: predicates like resources and inter-pod
	// anti-affinity are more likely to fail when the nominated pods are treated
	// as running, while predicates like pod affinity are more likely to fail when
	// the nominated pods are treated as not running. We can't just assume the
	// nominated pods are running because they are not running right now and in fact,
	// they may end up getting scheduled to a different node.
	// 兩次迴圈的原因主要就是因為NominatedPods排程的不一定就是此Node，還有Pod的親和性等問題
	for i := 0; i < 2; i++ {
		metaToUse := meta
		nodeInfoToUse := info
		if i == 0 {
			// 第一次排程，根據NominatedPods更新meta和nodeInfo資訊，pod根據更新後的資訊去預選
			// 第二次排程，meta和nodeInfo資訊不變，保證pod不完全依賴於NominatedPods（主要考慮到pod親和性之類的）
			podsAdded, metaToUse, nodeInfoToUse = addNominatedPods(pod, meta, info, queue)
		} else if !podsAdded || len(failedPredicates) != 0 {
			break
		}
		// Bypass eCache if node has any nominated pods.
		// TODO(bsalamat): consider using eCache and adding proper eCache invalidations
		// when pods are nominated or their nominations change.
		eCacheAvailable = equivClass != nil && nodeCache != nil && !podsAdded
		for predicateID, predicateKey := range predicates.Ordering() {
			var (
				fit     bool
				reasons []algorithm.PredicateFailureReason
				err     error
			)
			//TODO (yastij) : compute average predicate restrictiveness to export it as Prometheus metric
			if predicate, exist := predicateFuncs[predicateKey]; exist {
				if eCacheAvailable {
					fit, reasons, err = nodeCache.RunPredicate(predicate, predicateKey, predicateID, pod, metaToUse, nodeInfoToUse, equivClass)
				} else {
					fit, reasons, err = predicate(pod, metaToUse, nodeInfoToUse)
				}
				if err != nil {
					return false, []algorithm.PredicateFailureReason{}, err
				}

				if !fit {
					// eCache is available and valid, and predicates result is unfit, record the fail reasons
					failedPredicates = append(failedPredicates, reasons...)
					// if alwaysCheckAllPredicates is false, short circuit all predicates when one predicate fails.
					if !alwaysCheckAllPredicates {
						klog.V(5).Infoln("since alwaysCheckAllPredicates has not been set, the predicate " +
							"evaluation is short circuited and there are chances " +
							"of other predicates failing as well.")
						break
					}
				}
			}
		}
	}

	return len(failedPredicates) == 0, failedPredicates, nil
}
複製程式碼

之後就是根據預選的排程演算法，一個個判斷是否都滿足。這裡有個小優化，如果當前的Pod在之前有一個等價的Pod，則直接從快取返回相應上一次的結果。如果成功則不用繼續呼叫預選演算法。但是，對於快取部分，我個人有些疑問，可能對於上一個Pod快取的結果是成功的，但是本次排程，Node資訊發生變化了，快取結果是成功的，但是實際上可能並不一定會成功。

預選排程演算法

本節主要說的是預設的排程演算法。預設的程式碼在pkg/scheduler/algorithmprovider/defaults/defaults.go下，defaultPredicates方法返回的是預設的一系列預選演算法。與預選相關的程式碼都在pkg/scheduler/algorithm/predicates/predicates.go下

對於每一個排程演算法，有一個優先順序Order，官網有詳細的描述。
排程方法基本一致，引數為(pod *v1.Pod, meta algorithm.PredicateMetadata, nodeInfo *schedulercache.NodeInfo)，返回值為(bool, []algorithm.PredicateFailureReason, error)。

優選

預選完成之後會得到一個Node的陣列。如果預選合適的節點數大於1，則需要呼叫優選演算法根據評分獲取最優的節點。
優選演算法呼叫的介面是PrioritizeNodes，使用與預選類似的多工同步呼叫方式，採用MapReduce的思想，Map根據不同的優選演算法獲取對某一Node的值，根據Reduce統計最終的結果。

優選排程演算法

優選排程演算法預設程式碼在pkg/scheduler/algorithmprovider/defaults/defaults.go下，defaultPriorities方法返回的是預設的一系列優選演算法，通過工廠模式處理相應的優選演算法，程式碼如下

func defaultPriorities() sets.String {
	return sets.NewString(
		// spreads pods by minimizing the number of pods (belonging to the same service or replication controller) on the same node.
		factory.RegisterPriorityConfigFactory(
			"SelectorSpreadPriority",
			factory.PriorityConfigFactory{
				MapReduceFunction: func(args factory.PluginFactoryArgs) (algorithm.PriorityMapFunction, algorithm.PriorityReduceFunction) {
					return priorities.NewSelectorSpreadPriority(args.ServiceLister, args.ControllerLister, args.ReplicaSetLister, args.StatefulSetLister)
				},
				Weight: 1,
			},
		),
		// pods should be placed in the same topological domain (e.g. same node, same rack, same zone, same power domain, etc.)
		// as some other pods, or, conversely, should not be placed in the same topological domain as some other pods.
		factory.RegisterPriorityConfigFactory(
			"InterPodAffinityPriority",
			factory.PriorityConfigFactory{
				Function: func(args factory.PluginFactoryArgs) algorithm.PriorityFunction {
					return priorities.NewInterPodAffinityPriority(args.NodeInfo, args.NodeLister, args.PodLister, args.HardPodAffinitySymmetricWeight)
				},
				Weight: 1,
			},
		),

		// Prioritize nodes by least requested utilization.
		factory.RegisterPriorityFunction2("LeastRequestedPriority", priorities.LeastRequestedPriorityMap, nil, 1),

		// Prioritizes nodes to help achieve balanced resource usage
		factory.RegisterPriorityFunction2("BalancedResourceAllocation", priorities.BalancedResourceAllocationMap, nil, 1),

		// Set this weight large enough to override all other priority functions.
		// TODO: Figure out a better way to do this, maybe at same time as fixing #24720.
		factory.RegisterPriorityFunction2("NodePreferAvoidPodsPriority", priorities.CalculateNodePreferAvoidPodsPriorityMap, nil, 10000),

		// Prioritizes nodes that have labels matching NodeAffinity
		factory.RegisterPriorityFunction2("NodeAffinityPriority", priorities.CalculateNodeAffinityPriorityMap, priorities.CalculateNodeAffinityPriorityReduce, 1),

		// Prioritizes nodes that marked with taint which pod can tolerate.
		factory.RegisterPriorityFunction2("TaintTolerationPriority", priorities.ComputeTaintTolerationPriorityMap, priorities.ComputeTaintTolerationPriorityReduce, 1),

		// ImageLocalityPriority prioritizes nodes that have images requested by the pod present.
		factory.RegisterPriorityFunction2("ImageLocalityPriority", priorities.ImageLocalityPriorityMap, nil, 1),
	)
}
複製程式碼

用到的優選演算法通過程式碼結構基本可以看出

每一個不同的優選策略獨立成一個單獨的檔案。
通過優選之後，呼叫selectHost方法獲取分數最高的Node。如果多個Node分數相同，則使用輪詢的方式得到最終的Node。

搶佔排程

當通過正常的排程流程如果沒有找到合適的節點（主要是預選沒有合適的節點），會判斷需不需要進行搶佔排程，具體的程式碼在pkg/scheduler/scheduler.go檔案下，用到的方法preempt，具體如下：

// preempt tries to create room for a pod that has failed to schedule, by preempting lower priority pods if possible.
// If it succeeds, it adds the name of the node where preemption has happened to the pod annotations.
// It returns the node name and an error if any.
// ---
// preempt儘可能的通過去搶佔低優先順序的Pod的空間，為排程失敗的Pod創造空間
// 如果成功了，就會去新增在Pod註解中宣告的Node名稱
// 返回Node名稱和錯誤（如果有錯誤的話）
func (sched *Scheduler) preempt(preemptor *v1.Pod, scheduleErr error) (string, error) {

	// 1.判斷是否開啟Pod優先順序，排程器是否配置了DisablePreemption，兩者中任一滿足即停止搶佔
	if !util.PodPriorityEnabled() || sched.config.DisablePreemption {
		klog.V(3).Infof("Pod priority feature is not enabled or preemption is disabled by scheduler configuration." +
			" No preemption is performed.")
		return "", nil
	}
	// 2.獲取待搶佔Pod的資訊
	preemptor, err := sched.config.PodPreemptor.GetUpdatedPod(preemptor)
	if err != nil {
		klog.Errorf("Error getting the updated preemptor pod object: %v", err)
		return "", err
	}

	// 3.根據配置的演算法獲取搶佔的節點
	// 獲取到的四個引數
	// 1.搶佔獲取到的Node
	// 2.需要被刪除掉的低優先順序的Pod列表
	// 3.需要刪除掉的nominatedPods列表
	// 4.錯誤資訊
	node, victims, nominatedPodsToClear, err := sched.config.Algorithm.Preempt(preemptor, sched.config.NodeLister, scheduleErr)
	metrics.PreemptionVictims.Set(float64(len(victims)))
	if err != nil {
		klog.Errorf("Error preempting victims to make room for %v/%v.", preemptor.Namespace, preemptor.Name)
		return "", err
	}
	var nodeName = ""
	if node != nil {
		// 1.將Pod和Node結合，更新相應的資訊(Pod的nodeName有值)，並且構造apiserver的呼叫
		// 2.所有的將要被刪除的Pod一一被刪除
		// 只有兩者都滿足了，才能保證搶佔成功
		nodeName = node.Name
		// Update the scheduling queue with the nominated pod information. Without
		// this, there would be a race condition between the next scheduling cycle
		// and the time the scheduler receives a Pod Update for the nominated pod.
		sched.config.SchedulingQueue.UpdateNominatedPodForNode(preemptor, nodeName)

		// Make a call to update nominated node name of the pod on the API server.
		err = sched.config.PodPreemptor.SetNominatedNodeName(preemptor, nodeName)
		if err != nil {
			klog.Errorf("Error in preemption process. Cannot update pod %v/%v annotations: %v", preemptor.Namespace, preemptor.Name, err)
			sched.config.SchedulingQueue.DeleteNominatedPodIfExists(preemptor)
			return "", err
		}

		for _, victim := range victims {
			if err := sched.config.PodPreemptor.DeletePod(victim); err != nil {
				klog.Errorf("Error preempting pod %v/%v: %v", victim.Namespace, victim.Name, err)
				return "", err
			}
			sched.config.Recorder.Eventf(victim, v1.EventTypeNormal, "Preempted", "by %v/%v on node %v", preemptor.Namespace, preemptor.Name, nodeName)
		}
	}
	// Clearing nominated pods should happen outside of "if node != nil". Node could
	// be nil when a pod with nominated node name is eligible to preempt again,
	// but preemption logic does not find any node for it. In that case Preempt()
	// function of generic_scheduler.go returns the pod itself for removal of the annotation.
	// 4.刪除nominatedPods，不要求一定成功，對整體結果不影響
	for _, p := range nominatedPodsToClear {
		rErr := sched.config.PodPreemptor.RemoveNominatedNodeName(p)
		if rErr != nil {
			klog.Errorf("Cannot remove nominated node annotation of pod: %v", rErr)
			// We do not return as this error is not critical.
		}
	}
	return nodeName, err
}
複製程式碼

整體程式碼結構比較清晰，有如下幾個步驟：
1、判斷是否需要進行搶佔排程，主要有兩個判斷項（PodPriority是否開啟、排程器是否配置DisablePreemption），兩者缺一不可；
2、獲取待搶佔排程Pod配置的資訊；
3、通過配置演算法的搶佔策略獲取搶佔排程的結果（最核心的步驟）；
4、收尾工作（更新Pod的資訊、刪除低優先順序的Pod、刪除一些資源如nominatedPods）
整個過程最核心的是排程演算法獲取排程結果的介面，同預選優選一樣，預設的排程實現均在generic_scheduler.go檔案，方法是Preempt。
Preempt方法返回四個引數，分別是
1)Preempt得到的Node；
2)被搶佔的Pod的列表（待刪除）；
3)將要被清除的nominatedPods（待清除）；
4)可能返回的error訊息
Preempt方法主要執行以下幾個步驟：
1、從預選失敗的節點中獲取可以用來做搶佔排程的節點，通過一個switch語句排除不可以用來做搶佔排程的節點

如圖，只要預選失敗的原因處於上述的錯誤原因均不能再做搶佔排程節點；
2、獲取PDB（Pod中斷預算）列表，用來做後續的判斷標準；
3、通過呼叫selectNodesForPreemption方法，判斷哪些Node可以進行搶佔排程。通過ParallelizeUntil方法同步對所有的Node進行判斷，判斷路徑為checkNode-->selectVictimsOnNode-->podFitsOnNode，最終同預選方法類似，使用了podFitsOnNode方法。不同於普通預選，搶佔排程會先對Pod優先順序判斷，然後在移除掉優先順序較低的Pod之後再呼叫podFitsOnNode方法，以此達到搶佔的效果。selectNodesForPreemption方法返回的引數是一個map型別的值，key為Node資訊，value為該Node如果作為排程節點，將要清除的一些資訊，包括Pods和PDB資訊

4、獲取到搶佔排程可以實現的Nodes資源後，繼續通過擴充套件的演算法進行過濾；
5、選中最終的搶佔排程的Node，呼叫pickOneNodeForPreemption方法，主要基於5個原則：
a)PDB violations（違規）值最小的Node；
b)挑選具有最低優先順序受害者的節點，即被清除的Node上的Pods，它的優先順序是最低的；
c)通過所有受害者Pods（將被刪除的低優先順序Pods）的優先順序總和做區分；
d)如果多個Node優先順序總和仍然相等，則選擇具有最小受害者數量的Node；
e)如果多個Node優先順序總和仍然相等，則選擇第一個這樣的Node（隨機排序）;
6、選中最終的Node之後，記錄該Node上優先順序較低的NominatedPods，這些Pod還未排程，需要將其排程關係進行刪除，重新應用。程式碼如下：

// preempt finds nodes with pods that can be preempted to make room for "pod" to
// schedule. It chooses one of the nodes and preempts the pods on the node and
// returns 1) the node, 2) the list of preempted pods if such a node is found,
// 3) A list of pods whose nominated node name should be cleared, and 4) any
// possible error.
// Preempt does not update its snapshot. It uses the same snapshot used in the
// scheduling cycle. This is to avoid a scenario where preempt finds feasible
// nodes without preempting any pod. When there are many pending pods in the
// scheduling queue a nominated pod will go back to the queue and behind
// other pods with the same priority. The nominated pod prevents other pods from
// using the nominated resources and the nominated pod could take a long time
// before it is retried after many other pending pods.
func (g *genericScheduler) Preempt(pod *v1.Pod, nodeLister algorithm.NodeLister, scheduleErr error) (*v1.Node, []*v1.Pod, []*v1.Pod, error) {
	// Scheduler may return various types of errors. Consider preemption only if
	// the error is of type FitError.
	fitError, ok := scheduleErr.(*FitError)
	if !ok || fitError == nil {
		return nil, nil, nil, nil
	}
	if !podEligibleToPreemptOthers(pod, g.cachedNodeInfoMap) {
		klog.V(5).Infof("Pod %v/%v is not eligible for more preemption.", pod.Namespace, pod.Name)
		return nil, nil, nil, nil
	}
	allNodes, err := nodeLister.List()
	if err != nil {
		return nil, nil, nil, err
	}
	if len(allNodes) == 0 {
		return nil, nil, nil, ErrNoNodesAvailable
	}
	// 1.獲取預選排程失敗的節點，但是可能是潛在的搶佔可能成功的節點(所有的搶佔節點都是在潛在節點內部選擇)
	potentialNodes := nodesWherePreemptionMightHelp(allNodes, fitError.FailedPredicates)
	if len(potentialNodes) == 0 {
		klog.V(3).Infof("Preemption will not help schedule pod %v/%v on any node.", pod.Namespace, pod.Name)
		// In this case, we should clean-up any existing nominated node name of the pod.
		return nil, nil, []*v1.Pod{pod}, nil
	}
	// 2.獲取PDB（Pod中斷預算）列表
	pdbs, err := g.pdbLister.List(labels.Everything())
	if err != nil {
		return nil, nil, nil, err
	}
	// 3.獲取所有可以進行Preempt的Node節點的資訊，主要包含該節點哪些Pod需要被搶佔掉
	nodeToVictims, err := selectNodesForPreemption(pod, g.cachedNodeInfoMap, potentialNodes, g.predicates,
		g.predicateMetaProducer, g.schedulingQueue, pdbs)
	if err != nil {
		return nil, nil, nil, err
	}

	// We will only check nodeToVictims with extenders that support preemption.
	// Extenders which do not support preemption may later prevent preemptor from being scheduled on the nominated
	// node. In that case, scheduler will find a different host for the preemptor in subsequent scheduling cycles.
	// 4.擴充套件的Preempt排程判斷
	nodeToVictims, err = g.processPreemptionWithExtenders(pod, nodeToVictims)
	if err != nil {
		return nil, nil, nil, err
	}

	// 5.選中某一個Node
	candidateNode := pickOneNodeForPreemption(nodeToVictims)
	if candidateNode == nil {
		return nil, nil, nil, err
	}

	// Lower priority pods nominated to run on this node, may no longer fit on
	// this node. So, we should remove their nomination. Removing their
	// nomination updates these pods and moves them to the active queue. It
	// lets scheduler find another place for them.
	// 6.判斷哪些Pod優先順序較低，後續需要被清除掉，不作為NominatedPods存在
	nominatedPods := g.getLowerPriorityNominatedPods(pod, candidateNode.Name)
	if nodeInfo, ok := g.cachedNodeInfoMap[candidateNode.Name]; ok {
		return nodeInfo.Node(), nodeToVictims[candidateNode].Pods, nominatedPods, err
	}

	return nil, nil, nil, fmt.Errorf(
		"preemption failed: the target node %s has been deleted from scheduler cache",
		candidateNode.Name)
}
複製程式碼

綜上，搶佔排程主要強調的一點是Pod的優先順序。與普通排程不同的是，搶佔排程對Pod做了明確的優先順序區分，以此來達到搶佔的目的。

選舉

在Scheduler啟動的時候，需要判斷是否需要做選主操作。配置選舉操作很簡單，只需要在配置檔案中新增--leader-elect=true即可。程式碼中，如果檢測到了配置選舉，則首先會參加選舉，只有拿到主節點的scheduler才能執行排程相關工作。
選舉程式碼結構比較簡單，如圖，程式碼位於client-go包中，路徑為client-go/tools/leaderelection/leaderelection.go

主要有三個函式le.acquire(ctx)、le.renew(ctx)以及le.config.Callbacks.OnStartedLeading(ctx)。
acquire表示是否選主成功，只有成功了之後，才能執行OnStartedLeading和renew。OnStartedLeading是一個回撥方法，執行的就是scheduler的run方法。

renew主要做選主的更新操作。當節點上的scheduler被選主時，還需要不斷的更新資訊，判斷是否主節點功能正常。
進入acquire或者renew方法，有一個共同的呼叫方法是tryAcquireOrRenew，該方法就是整個選舉的核心實現。
tryAcquireOrRenew顧名思義，如果沒有獲取到租約，就去獲取leader的租約，否則就去更新租約。主要有三部分操作：
1、呼叫Get操作獲取是否存在ElectionRecord。如果不存在，則呼叫Create方法新建一個新的Endpoint，當前節點為scheduler的主節點，選舉成功；否則，執行更新操作；
2、獲取到記錄，表明執行的是更新租約操作，需要驗證當前節點的身份和時間，判斷是否可以執行更新租約操作；
3、更新資訊，執行Update操作，更新選主資訊。

// tryAcquireOrRenew tries to acquire a leader lease if it is not already acquired,
// else it tries to renew the lease if it has already been acquired. Returns true
// on success else returns false.
// ---
// tryAcquireOrRenew，如果沒有獲取到租約，就去獲取leader的租約，否則去更新租約。
func (le *LeaderElector) tryAcquireOrRenew() bool {
	now := metav1.Now()
	leaderElectionRecord := rl.LeaderElectionRecord{
		HolderIdentity:       le.config.Lock.Identity(),
		LeaseDurationSeconds: int(le.config.LeaseDuration / time.Second),
		RenewTime:            now,
		AcquireTime:          now,
	}

	// 1. obtain or create the ElectionRecord
	// 1. 呼叫Endpoint的Get操作，獲取oldLeaderElectionRecord
	oldLeaderElectionRecord, err := le.config.Lock.Get()
	if err != nil {
		if !errors.IsNotFound(err) {
			klog.Errorf("error retrieving resource lock %v: %v", le.config.Lock.Describe(), err)
			return false
		}
		// 建立新的Endpoint
		if err = le.config.Lock.Create(leaderElectionRecord); err != nil {
			klog.Errorf("error initially creating leader election record: %v", err)
			return false
		}
		le.observedRecord = leaderElectionRecord
		le.observedTime = le.clock.Now()
		return true
	}

	// 2. Record obtained, check the Identity & Time
	// 2. 獲取到了記錄，檢查下身份和時間資訊，判斷是否合法
	if !reflect.DeepEqual(le.observedRecord, *oldLeaderElectionRecord) {
		le.observedRecord = *oldLeaderElectionRecord
		le.observedTime = le.clock.Now()
	}
	if le.observedTime.Add(le.config.LeaseDuration).After(now.Time) &&
		!le.IsLeader() {
		klog.V(4).Infof("lock is held by %v and has not yet expired", oldLeaderElectionRecord.HolderIdentity)
		return false
	}

	// 3. We're going to try to update. The leaderElectionRecord is set to it's default
	// here. Let's correct it before updating.
	if le.IsLeader() {
		leaderElectionRecord.AcquireTime = oldLeaderElectionRecord.AcquireTime
		leaderElectionRecord.LeaderTransitions = oldLeaderElectionRecord.LeaderTransitions
	} else {
		leaderElectionRecord.LeaderTransitions = oldLeaderElectionRecord.LeaderTransitions + 1
	}

	// update the lock itself
	if err = le.config.Lock.Update(leaderElectionRecord); err != nil {
		klog.Errorf("Failed to update lock: %v", err)
		return false
	}
	le.observedRecord = leaderElectionRecord
	le.observedTime = le.clock.Now()
	return true
}
複製程式碼

Scheduler的選舉操作比較簡單，主要就是通過判斷記錄在Etcd中的Endpoints是否可以更新來判斷是否可以進行選舉。整個選舉操作依賴於Etcd的特點來保證分散式操作的成功和唯一。在kube-system的namespace下可以檢視相應的endpoint：kube-scheduler。

kube-scheduler原始碼分析（3）-搶佔排程分析
2022-03-13
原始碼
Kubernetes:kubelet 原始碼分析之探針
2024-05-20
原始碼
Kubernetes原始碼分析之kube-apiserver
2019-03-22
原始碼APIServer
kube-scheduler原始碼分析（2）-核心處理邏輯分析
2022-03-06
原始碼
Kubernetes原始碼分析之Pod的刪除
2019-05-12
原始碼
kube-scheduler原始碼分析（1）-初始化與啟動分析
2022-02-20
原始碼
Kubernetes 漫遊：kube-scheduler
2023-11-29
Kubernetes Deployment 原始碼分析（一）
2021-11-02
原始碼
Kubernetes Deployment 原始碼分析（二）
2021-11-04
原始碼
Guava 原始碼分析之 EventBus 原始碼分析
2018-08-01
Guava原始碼
Android 原始碼分析之 AsyncTask 原始碼分析
2019-03-04
Android原始碼
Kubernetes1.5原始碼分析(二) apiServer之資源註冊
2019-02-16
原始碼APIServer
Kubernetes: kube-controller-manager 原始碼分析
2024-03-11
Controller原始碼
Kubernetes client-go 原始碼分析 - Reflector
2021-10-20
clientGo原始碼
Kubernetes Job Controller 原理和原始碼分析（一）
2021-10-19
Controller原始碼
原始碼分析之 HashMap
2019-03-04
原始碼HashMap
原始碼分析之 LinkedList
2019-01-23
原始碼
原始碼|jdk原始碼之HashMap分析(一)
2019-01-19
原始碼JDKHashMap
原始碼|jdk原始碼之HashMap分析(二)
2019-01-19
原始碼JDKHashMap
死磕 jdk原始碼之HashMap原始碼分析
2019-04-13
JDK原始碼HashMap
Android 原始碼分析之 EventBus 的原始碼解析
2018-08-06
Android原始碼
lodash原始碼分析之isArguments
2019-02-16
原始碼
Envoy原始碼分析之Dispatcher
2018-10-31
原始碼
Fresco原始碼分析之DraweeView
2019-03-04
原始碼View
Netty原始碼分析之LengthFieldBasedFrameDecoder
2018-11-15
Netty原始碼
RecyclerView之SnapHelper原始碼分析
2018-11-15
View原始碼
tornado 原始碼之 coroutine 分析
2019-02-28
原始碼
lodash原始碼分析之isObjectLike
2018-10-02
原始碼Object
OpenGL 之 GPUImage 原始碼分析
2018-09-10
GPUUI原始碼
jdk原始碼分析之TreeMap
2018-06-26
JDK原始碼
原始碼分析Kafka之Producer
2018-08-27
原始碼Kafka
DRF之Response原始碼分析
2024-04-23
原始碼
Spring AOP之原始碼分析
2024-03-19
Spring原始碼
BlueStore原始碼分析之FreelistManager
2020-02-11
原始碼
JUC之ReentrantLock原始碼分析
2020-05-13
ReentrantLock原始碼
JUC之CountDownLatch原始碼分析
2020-05-14
CountDownLatch原始碼
Fresco原始碼分析之Hierarchy
2019-02-04
原始碼
Dubbo之SPI原始碼分析
2019-03-04
原始碼