kube-scheduler原始碼分析(1)-初始化與啟動分析
kube-scheduler簡介
kube-scheduler元件是kubernetes中的核心元件之一,主要負責pod資源物件的排程工作,具體來說,kube-scheduler元件負責根據排程演算法(包括預選演算法和優選演算法)將未排程的pod排程到合適的最優的node節點上。
kube-scheduler架構圖
kube-scheduler的大致組成和處理流程如下圖,kube-scheduler對pod、node等物件進行了list/watch,根據informer將未排程的pod放入待排程pod佇列,並根據informer構建排程器cache(用於快速獲取需要的node等物件),然後sched.scheduleOne
方法為kube-scheduler元件排程pod的核心處理邏輯所在,從未排程pod佇列中取出一個pod,經過預選與優選演算法,最終選出一個最優node,然後更新cache並非同步執行bind操作,也就是更新pod的nodeName欄位,至此一個pod的排程工作完成。
kube-scheduler元件的分析將分為兩大塊進行,分別是:
(1)kube-scheduler初始化與啟動分析;
(2)kube-scheduler核心處理邏輯分析。
本篇先進行kube-scheduler元件的初始化與啟動分析,下篇再進行核心處理邏輯分析。
1.kube-scheduler初始化與啟動分析
基於tag v1.17.4
https://github.com/kubernetes/kubernetes/releases/tag/v1.17.4
直接看到kube-scheduler的NewSchedulerCommand函式,作為kube-scheduler初始化與啟動分析的入口。
NewSchedulerCommand
NewSchedulerCommand函式主要邏輯:
(1)初始化元件預設啟動引數值;
(2)定義kube-scheduler元件的執行命令方法,即runCommand函式(runCommand函式最終呼叫Run函式來執行啟動kube-scheduler元件,下面會進行Run函式的分析);
(3)kube-scheduler元件啟動命令列引數解析。
// cmd/kube-scheduler/app/server.go
func NewSchedulerCommand(registryOptions ...Option) *cobra.Command {
// 1.初始化元件預設啟動引數值
opts, err := options.NewOptions()
if err != nil {
klog.Fatalf("unable to initialize command options: %v", err)
}
// 2.定義kube-scheduler元件的執行命令方法,即runCommand函式
cmd := &cobra.Command{
Use: "kube-scheduler",
Long: `The Kubernetes scheduler is a policy-rich, topology-aware,
workload-specific function that significantly impacts availability, performance,
and capacity. The scheduler needs to take into account individual and collective
resource requirements, quality of service requirements, hardware/software/policy
constraints, affinity and anti-affinity specifications, data locality, inter-workload
interference, deadlines, and so on. Workload-specific requirements will be exposed
through the API as necessary.`,
Run: func(cmd *cobra.Command, args []string) {
if err := runCommand(cmd, args, opts, registryOptions...); err != nil {
fmt.Fprintf(os.Stderr, "%v\n", err)
os.Exit(1)
}
},
}
// 3.元件命令列啟動引數解析
fs := cmd.Flags()
namedFlagSets := opts.Flags()
verflag.AddFlags(namedFlagSets.FlagSet("global"))
globalflag.AddGlobalFlags(namedFlagSets.FlagSet("global"), cmd.Name())
for _, f := range namedFlagSets.FlagSets {
fs.AddFlagSet(f)
}
...
}
runCommand
runCommand定義了kube-scheduler元件的執行命令函式,主要看到以下兩個邏輯:
(1)呼叫algorithmprovider.ApplyFeatureGates方法,根據FeatureGate是否開啟,決定是否追加註冊相應的預選和優選演算法;
(2)呼叫Run,執行啟動kube-scheduler元件。
// cmd/kube-scheduler/app/server.go
// runCommand runs the scheduler.
func runCommand(cmd *cobra.Command, args []string, opts *options.Options, registryOptions ...Option) error {
...
// Apply algorithms based on feature gates.
// TODO: make configurable?
algorithmprovider.ApplyFeatureGates()
// Configz registration.
if cz, err := configz.New("componentconfig"); err == nil {
cz.Set(cc.ComponentConfig)
} else {
return fmt.Errorf("unable to register configz: %s", err)
}
ctx, cancel := context.WithCancel(context.Background())
defer cancel()
return Run(ctx, cc, registryOptions...)
}
1.1 algorithmprovider.ApplyFeatureGates
根據FeatureGate是否開啟,決定是否追加註冊相應的預選和優選演算法。
// pkg/scheduler/algorithmprovider/plugins.go
import (
"k8s.io/kubernetes/pkg/scheduler/algorithmprovider/defaults"
)
func ApplyFeatureGates() func() {
return defaults.ApplyFeatureGates()
}
1.1.1 init
plugins.go檔案import了defaults包,所以看defaults.ApplyFeatureGates方法之前,先來看到defaults包的init函式,主要做了內建排程演算法的註冊工作,包括預選演算法和優選演算法。
(1)先來看到defaults包中defaults.go檔案init函式。
// pkg/scheduler/algorithmprovider/defaults/defaults.go
func init() {
registerAlgorithmProvider(defaultPredicates(), defaultPriorities())
}
預算演算法:
// pkg/scheduler/algorithmprovider/defaults/defaults.go
func defaultPredicates() sets.String {
return sets.NewString(
predicates.NoVolumeZoneConflictPred,
predicates.MaxEBSVolumeCountPred,
predicates.MaxGCEPDVolumeCountPred,
predicates.MaxAzureDiskVolumeCountPred,
predicates.MaxCSIVolumeCountPred,
predicates.MatchInterPodAffinityPred,
predicates.NoDiskConflictPred,
predicates.GeneralPred,
predicates.PodToleratesNodeTaintsPred,
predicates.CheckVolumeBindingPred,
predicates.CheckNodeUnschedulablePred,
)
}
優選演算法:
// pkg/scheduler/algorithmprovider/defaults/defaults.go
func defaultPriorities() sets.String {
return sets.NewString(
priorities.SelectorSpreadPriority,
priorities.InterPodAffinityPriority,
priorities.LeastRequestedPriority,
priorities.BalancedResourceAllocation,
priorities.NodePreferAvoidPodsPriority,
priorities.NodeAffinityPriority,
priorities.TaintTolerationPriority,
priorities.ImageLocalityPriority,
)
}
registerAlgorithmProvider函式註冊 algorithm provider,algorithm provider儲存了所有型別的排程演算法列表,包括預選演算法和優選演算法(只儲存了演算法key列表,不包含演算法本身)。
// pkg/scheduler/algorithmprovider/defaults/defaults.go
func registerAlgorithmProvider(predSet, priSet sets.String) {
// Registers algorithm providers. By default we use 'DefaultProvider', but user can specify one to be used
// by specifying flag.
scheduler.RegisterAlgorithmProvider(scheduler.DefaultProvider, predSet, priSet)
// Cluster autoscaler friendly scheduling algorithm.
scheduler.RegisterAlgorithmProvider(ClusterAutoscalerProvider, predSet,
copyAndReplace(priSet, priorities.LeastRequestedPriority, priorities.MostRequestedPriority))
}
最終將註冊的algorithm provider賦值給變數algorithmProviderMap(儲存了所有型別的排程演算法列表),該變數是該包的全域性變數。
// pkg/scheduler/algorithm_factory.go
// RegisterAlgorithmProvider registers a new algorithm provider with the algorithm registry.
func RegisterAlgorithmProvider(name string, predicateKeys, priorityKeys sets.String) string {
schedulerFactoryMutex.Lock()
defer schedulerFactoryMutex.Unlock()
validateAlgorithmNameOrDie(name)
algorithmProviderMap[name] = AlgorithmProviderConfig{
FitPredicateKeys: predicateKeys,
PriorityFunctionKeys: priorityKeys,
}
return name
}
// pkg/scheduler/algorithm_factory.go
var (
...
algorithmProviderMap = make(map[string]AlgorithmProviderConfig)
...
)
(2)再來看到defaults包中register_predicates.go檔案的init函式,主要是註冊了預選演算法。
// pkg/scheduler/algorithmprovider/defaults/register_predicates.go
func init() {
...
// Fit is defined based on the absence of port conflicts.
// This predicate is actually a default predicate, because it is invoked from
// predicates.GeneralPredicates()
scheduler.RegisterFitPredicate(predicates.PodFitsHostPortsPred, predicates.PodFitsHostPorts)
// Fit is determined by resource availability.
// This predicate is actually a default predicate, because it is invoked from
// predicates.GeneralPredicates()
scheduler.RegisterFitPredicate(predicates.PodFitsResourcesPred, predicates.PodFitsResources)
...
(3)最後看到defaults包中register_priorities.go檔案的init函式,主要是註冊了優選演算法。
// pkg/scheduler/algorithmprovider/defaults/register_priorities.go
func init() {
...
// Prioritize nodes by least requested utilization.
scheduler.RegisterPriorityMapReduceFunction(priorities.LeastRequestedPriority, priorities.LeastRequestedPriorityMap, nil, 1)
// Prioritizes nodes to help achieve balanced resource usage
scheduler.RegisterPriorityMapReduceFunction(priorities.BalancedResourceAllocation, priorities.BalancedResourceAllocationMap, nil, 1)
...
}
預選演算法與優選演算法註冊的最後結果,都是賦值給全域性變數,預選演算法註冊後賦值給fitPredicateMap,優選演算法註冊後賦值給priorityFunctionMap。
// pkg/scheduler/algorithm_factory.go
var (
...
fitPredicateMap = make(map[string]FitPredicateFactory)
...
priorityFunctionMap = make(map[string]PriorityConfigFactory)
...
)
1.1.2 defaults.ApplyFeatureGates
主要用於判斷是否開啟特定的FeatureGate,然後追加註冊相應的預選和優選演算法。
// pkg/scheduler/algorithmprovider/defaults/defaults.go
func ApplyFeatureGates() (restore func()) {
...
// Only register EvenPodsSpread predicate & priority if the feature is enabled
if utilfeature.DefaultFeatureGate.Enabled(features.EvenPodsSpread) {
klog.Infof("Registering EvenPodsSpread predicate and priority function")
// register predicate
scheduler.InsertPredicateKeyToAlgorithmProviderMap(predicates.EvenPodsSpreadPred)
scheduler.RegisterFitPredicate(predicates.EvenPodsSpreadPred, predicates.EvenPodsSpreadPredicate)
// register priority
scheduler.InsertPriorityKeyToAlgorithmProviderMap(priorities.EvenPodsSpreadPriority)
scheduler.RegisterPriorityMapReduceFunction(
priorities.EvenPodsSpreadPriority,
priorities.CalculateEvenPodsSpreadPriorityMap,
priorities.CalculateEvenPodsSpreadPriorityReduce,
1,
)
}
// Prioritizes nodes that satisfy pod's resource limits
if utilfeature.DefaultFeatureGate.Enabled(features.ResourceLimitsPriorityFunction) {
klog.Infof("Registering resourcelimits priority function")
scheduler.RegisterPriorityMapReduceFunction(priorities.ResourceLimitsPriority, priorities.ResourceLimitsPriorityMap, nil, 1)
// Register the priority function to specific provider too.
scheduler.InsertPriorityKeyToAlgorithmProviderMap(scheduler.RegisterPriorityMapReduceFunction(priorities.ResourceLimitsPriority, priorities.ResourceLimitsPriorityMap, nil, 1))
}
...
}
1.2 Run
Run函式主要是根據配置引數,執行啟動kube-scheduler元件,其核心邏輯如下:
(1)準備好event上報client,用於將kube-scheduler產生的各種event上報給api-server;
(2)呼叫scheduler.New方法,例項化scheduler物件;
(3)啟動event上報管理器;
(4)設定kube-scheduler元件的健康檢查,並啟動健康檢查以及與metrics相關的http服務;
(5)啟動所有前面註冊過的物件的infomer,開始同步物件資源;
(6)呼叫WaitForCacheSync,等待所有informer的物件同步完成,使得本地快取資料與etcd中的資料一致;
(7)根據元件啟動引數判斷是否要開啟leader選舉功能;
(8)呼叫sched.Run方法啟動kube-scheduler元件(sched.Run將作為下面kube-scheduler核心處理邏輯分析的入口)。
// cmd/kube-scheduler/app/server.go
func Run(ctx context.Context, cc schedulerserverconfig.CompletedConfig, outOfTreeRegistryOptions ...Option) error {
// To help debugging, immediately log version
klog.V(1).Infof("Starting Kubernetes Scheduler version %+v", version.Get())
outOfTreeRegistry := make(framework.Registry)
for _, option := range outOfTreeRegistryOptions {
if err := option(outOfTreeRegistry); err != nil {
return err
}
}
// 1.準備好event上報client,用於將kube-scheduler產生的各種event上報給api-server
// Prepare event clients.
if _, err := cc.Client.Discovery().ServerResourcesForGroupVersion(eventsv1beta1.SchemeGroupVersion.String()); err == nil {
cc.Broadcaster = events.NewBroadcaster(&events.EventSinkImpl{Interface: cc.EventClient.Events("")})
cc.Recorder = cc.Broadcaster.NewRecorder(scheme.Scheme, cc.ComponentConfig.SchedulerName)
} else {
recorder := cc.CoreBroadcaster.NewRecorder(scheme.Scheme, v1.EventSource{Component: cc.ComponentConfig.SchedulerName})
cc.Recorder = record.NewEventRecorderAdapter(recorder)
}
// 2.呼叫scheduler.New方法,例項化scheduler物件
// Create the scheduler.
sched, err := scheduler.New(cc.Client,
cc.InformerFactory,
cc.PodInformer,
cc.Recorder,
ctx.Done(),
scheduler.WithName(cc.ComponentConfig.SchedulerName),
scheduler.WithAlgorithmSource(cc.ComponentConfig.AlgorithmSource),
scheduler.WithHardPodAffinitySymmetricWeight(cc.ComponentConfig.HardPodAffinitySymmetricWeight),
scheduler.WithPreemptionDisabled(cc.ComponentConfig.DisablePreemption),
scheduler.WithPercentageOfNodesToScore(cc.ComponentConfig.PercentageOfNodesToScore),
scheduler.WithBindTimeoutSeconds(cc.ComponentConfig.BindTimeoutSeconds),
scheduler.WithFrameworkOutOfTreeRegistry(outOfTreeRegistry),
scheduler.WithFrameworkPlugins(cc.ComponentConfig.Plugins),
scheduler.WithFrameworkPluginConfig(cc.ComponentConfig.PluginConfig),
scheduler.WithPodMaxBackoffSeconds(cc.ComponentConfig.PodMaxBackoffSeconds),
scheduler.WithPodInitialBackoffSeconds(cc.ComponentConfig.PodInitialBackoffSeconds),
)
if err != nil {
return err
}
// 3.啟動event上報管理器
// Prepare the event broadcaster.
if cc.Broadcaster != nil && cc.EventClient != nil {
cc.Broadcaster.StartRecordingToSink(ctx.Done())
}
if cc.CoreBroadcaster != nil && cc.CoreEventClient != nil {
cc.CoreBroadcaster.StartRecordingToSink(&corev1.EventSinkImpl{Interface: cc.CoreEventClient.Events("")})
}
// 4.設定kube-scheduler元件的健康檢查,並啟動健康檢查以及與metrics相關的http服務
// Setup healthz checks.
var checks []healthz.HealthChecker
if cc.ComponentConfig.LeaderElection.LeaderElect {
checks = append(checks, cc.LeaderElection.WatchDog)
}
// Start up the healthz server.
if cc.InsecureServing != nil {
separateMetrics := cc.InsecureMetricsServing != nil
handler := buildHandlerChain(newHealthzHandler(&cc.ComponentConfig, separateMetrics, checks...), nil, nil)
if err := cc.InsecureServing.Serve(handler, 0, ctx.Done()); err != nil {
return fmt.Errorf("failed to start healthz server: %v", err)
}
}
if cc.InsecureMetricsServing != nil {
handler := buildHandlerChain(newMetricsHandler(&cc.ComponentConfig), nil, nil)
if err := cc.InsecureMetricsServing.Serve(handler, 0, ctx.Done()); err != nil {
return fmt.Errorf("failed to start metrics server: %v", err)
}
}
if cc.SecureServing != nil {
handler := buildHandlerChain(newHealthzHandler(&cc.ComponentConfig, false, checks...), cc.Authentication.Authenticator, cc.Authorization.Authorizer)
// TODO: handle stoppedCh returned by c.SecureServing.Serve
if _, err := cc.SecureServing.Serve(handler, 0, ctx.Done()); err != nil {
// fail early for secure handlers, removing the old error loop from above
return fmt.Errorf("failed to start secure server: %v", err)
}
}
// 5.啟動所有前面註冊過的物件的informer,開始同步物件資源
// Start all informers.
go cc.PodInformer.Informer().Run(ctx.Done())
cc.InformerFactory.Start(ctx.Done())
// 6.等待所有informer的物件同步完成,使得本地快取資料與etcd中的資料一致
// Wait for all caches to sync before scheduling.
cc.InformerFactory.WaitForCacheSync(ctx.Done())
// 7.根據元件啟動引數判斷是否要開啟leader選舉功能
// If leader election is enabled, runCommand via LeaderElector until done and exit.
if cc.LeaderElection != nil {
cc.LeaderElection.Callbacks = leaderelection.LeaderCallbacks{
OnStartedLeading: sched.Run,
OnStoppedLeading: func() {
klog.Fatalf("leaderelection lost")
},
}
leaderElector, err := leaderelection.NewLeaderElector(*cc.LeaderElection)
if err != nil {
return fmt.Errorf("couldn't create leader elector: %v", err)
}
leaderElector.Run(ctx)
return fmt.Errorf("lost lease")
}
// 8.呼叫sched.Run方法啟動kube-scheduler元件
// Leader election is disabled, so runCommand inline until done.
sched.Run(ctx)
return fmt.Errorf("finished without leader elect")
}
1.2.1 scheduler.New
scheduler物件的例項化分為3個部分,分別是:
(1)例項化pod、node、pvc、pv等物件的infomer;
(2)呼叫configurator.CreateFromConfig,根據前面註冊的內建排程演算法(或根據使用者提供的排程策略),例項化scheduler;
(3)給infomer物件註冊eventHandler;
// pkg/scheduler/scheduler.go
func New(client clientset.Interface,
informerFactory informers.SharedInformerFactory,
podInformer coreinformers.PodInformer,
recorder events.EventRecorder,
stopCh <-chan struct{},
opts ...Option) (*Scheduler, error) {
stopEverything := stopCh
if stopEverything == nil {
stopEverything = wait.NeverStop
}
options := defaultSchedulerOptions
for _, opt := range opts {
opt(&options)
}
// 1.例項化node、pvc、pv等物件的infomer
schedulerCache := internalcache.New(30*time.Second, stopEverything)
volumeBinder := volumebinder.NewVolumeBinder(
client,
informerFactory.Core().V1().Nodes(),
informerFactory.Storage().V1().CSINodes(),
informerFactory.Core().V1().PersistentVolumeClaims(),
informerFactory.Core().V1().PersistentVolumes(),
informerFactory.Storage().V1().StorageClasses(),
time.Duration(options.bindTimeoutSeconds)*time.Second,
)
registry := options.frameworkDefaultRegistry
if registry == nil {
registry = frameworkplugins.NewDefaultRegistry(&frameworkplugins.RegistryArgs{
VolumeBinder: volumeBinder,
})
}
registry.Merge(options.frameworkOutOfTreeRegistry)
snapshot := nodeinfosnapshot.NewEmptySnapshot()
configurator := &Configurator{
client: client,
informerFactory: informerFactory,
podInformer: podInformer,
volumeBinder: volumeBinder,
schedulerCache: schedulerCache,
StopEverything: stopEverything,
hardPodAffinitySymmetricWeight: options.hardPodAffinitySymmetricWeight,
disablePreemption: options.disablePreemption,
percentageOfNodesToScore: options.percentageOfNodesToScore,
bindTimeoutSeconds: options.bindTimeoutSeconds,
podInitialBackoffSeconds: options.podInitialBackoffSeconds,
podMaxBackoffSeconds: options.podMaxBackoffSeconds,
enableNonPreempting: utilfeature.DefaultFeatureGate.Enabled(kubefeatures.NonPreemptingPriority),
registry: registry,
plugins: options.frameworkPlugins,
pluginConfig: options.frameworkPluginConfig,
pluginConfigProducerRegistry: options.frameworkConfigProducerRegistry,
nodeInfoSnapshot: snapshot,
algorithmFactoryArgs: AlgorithmFactoryArgs{
SharedLister: snapshot,
InformerFactory: informerFactory,
VolumeBinder: volumeBinder,
HardPodAffinitySymmetricWeight: options.hardPodAffinitySymmetricWeight,
},
configProducerArgs: &frameworkplugins.ConfigProducerArgs{},
}
metrics.Register()
// 2.呼叫configurator.CreateFromConfig,根據前面註冊的內建排程演算法(或根據使用者提供的排程策略),例項化scheduler
var sched *Scheduler
source := options.schedulerAlgorithmSource
switch {
case source.Provider != nil:
// Create the config from a named algorithm provider.
sc, err := configurator.CreateFromProvider(*source.Provider)
if err != nil {
return nil, fmt.Errorf("couldn't create scheduler using provider %q: %v", *source.Provider, err)
}
sched = sc
case source.Policy != nil:
// Create the config from a user specified policy source.
policy := &schedulerapi.Policy{}
switch {
case source.Policy.File != nil:
if err := initPolicyFromFile(source.Policy.File.Path, policy); err != nil {
return nil, err
}
case source.Policy.ConfigMap != nil:
if err := initPolicyFromConfigMap(client, source.Policy.ConfigMap, policy); err != nil {
return nil, err
}
}
sc, err := configurator.CreateFromConfig(*policy)
if err != nil {
return nil, fmt.Errorf("couldn't create scheduler from policy: %v", err)
}
sched = sc
default:
return nil, fmt.Errorf("unsupported algorithm source: %v", source)
}
// Additional tweaks to the config produced by the configurator.
sched.Recorder = recorder
sched.DisablePreemption = options.disablePreemption
sched.StopEverything = stopEverything
sched.podConditionUpdater = &podConditionUpdaterImpl{client}
sched.podPreemptor = &podPreemptorImpl{client}
sched.scheduledPodsHasSynced = podInformer.Informer().HasSynced
// 3.給infomer物件註冊eventHandler
AddAllEventHandlers(sched, options.schedulerName, informerFactory, podInformer)
return sched, nil
}
總結
kube-scheduler簡介
kube-scheduler元件是kubernetes中的核心元件之一,主要負責pod資源物件的排程工作,具體來說,kube-scheduler元件負責根據排程演算法(包括預選演算法和優選演算法)將未排程的pod排程到合適的最優的node節點上。
kube-scheduler架構圖
kube-scheduler的大致組成和處理流程如下圖,kube-scheduler對pod、node等物件進行了list/watch,根據informer將未排程的pod放入待排程pod佇列,並根據informer構建排程器cache(用於快速獲取需要的node等物件),然後sched.scheduleOne
方法為kube-scheduler元件排程pod的核心處理邏輯所在,從未排程pod佇列中取出一個pod,經過預選與優選演算法,最終選出一個最優node,然後更新cache並非同步執行bind操作,也就是更新pod的nodeName欄位,至此一個pod的排程工作完成。