從k8s叢集e2e排程慢告警看kube-scheduler原始碼

ning1875發表於2022-04-07

k8s教程說明

prometheus全元件的教程

go語言課程

告警的ql


histogram_quantile(0.99, sum(rate(scheduler_e2e_scheduling_duration_seconds_bucket{job="kube-scheduler"}[5m])) without(instance, pod)) > 3 for 1m
  • 含義:排程耗時超過3秒

    追蹤這個 histogram的metrics

  • 程式碼版本 v1.20
  • 位置 D:\go_path\src\github.com\kubernetes\kubernetes\pkg\scheduler\metrics\metrics.go
  • 追蹤呼叫方,在observeScheduleAttemptAndLatency的封裝中,位置 D:\go_path\src\github.com\kubernetes\kubernetes\pkg\scheduler\metrics\profile_metrics.go
  • 這裡可看到 排程的三種結果都會記錄相關的耗時

追蹤呼叫方

  • 位置 D:\go_path\src\github.com\kubernetes\kubernetes\pkg\scheduler\scheduler.go + 608
  • 在函式 Scheduler.scheduleOne中,用來記錄排程每個pod的耗時
  • 可以看到具體的呼叫點,在非同步bind函式的底部
  • 由此得出結論 e2e 是統計整個scheduleOne的耗時

    go func() {
          err := sched.bind(bindingCycleCtx, fwk, assumedPod, scheduleResult.SuggestedHost, state)
          if err != nil {
              metrics.PodScheduleError(fwk.ProfileName(), metrics.SinceInSeconds(start))
              // trigger un-reserve plugins to clean up state associated with the reserved Pod
              fwk.RunReservePluginsUnreserve(bindingCycleCtx, state, assumedPod, scheduleResult.SuggestedHost)
              if err := sched.SchedulerCache.ForgetPod(assumedPod); err != nil {
                  klog.Errorf("scheduler cache ForgetPod failed: %v", err)
              }
              sched.recordSchedulingFailure(fwk, assumedPodInfo, fmt.Errorf("binding rejected: %w", err), SchedulerError, "")
          } else {
              // Calculating nodeResourceString can be heavy. Avoid it if klog verbosity is below 2.
              if klog.V(2).Enabled() {
                  klog.InfoS("Successfully bound pod to node", "pod", klog.KObj(pod), "node", scheduleResult.SuggestedHost, "evaluatedNodes", scheduleResult.EvaluatedNodes, "feasibleNodes", scheduleResult.FeasibleNodes)
              }
              metrics.PodScheduled(fwk.ProfileName(), metrics.SinceInSeconds(start))
              metrics.PodSchedulingAttempts.Observe(float64(podInfo.Attempts))
              metrics.PodSchedulingDuration.WithLabelValues(getAttemptsLabel(podInfo)).Observe(metrics.SinceInSeconds(podInfo.InitialAttemptTimestamp))
    
              // Run "postbind" plugins.
              fwk.RunPostBindPlugins(bindingCycleCtx, state, assumedPod, scheduleResult.SuggestedHost)
          }
    }

scheduleOne從上到下都包含哪幾個過程

01 排程演算法耗時

  • 例項程式碼

    // 呼叫排程演算法給出結果
    scheduleResult, err := sched.Algorithm.Schedule(schedulingCycleCtx, fwk, state, pod)
    // 處理錯誤
    if err != nil{}
    // 記錄排程演算法耗時
    metrics.SchedulingAlgorithmLatency.Observe(metrics.SinceInSeconds(start
    }))
  • 從上面可以看出主要分3個步驟

    • 呼叫排程演算法給出結果
    • 處理錯誤
    • 記錄排程演算法耗時
  • 那麼我們首先應該 演算法的耗時,對應的histogram metrics為

    histogram_quantile(0.99, sum(rate(scheduler_scheduling_algorithm_duration_seconds_bucket{job="kube-scheduler"}[5m])) by (le))
  • 將e2e和algorithm 99分位耗時再結合 告警時間的曲線發現吻合度較高
  • 但是發現99分位下 algorithm > e2e ,但是按照e2e作為兜底來看,應該是e2e要更高,所以調整999分位發現2者差不多
  • 造成上述問題的原因跟prometheus histogram線性插值法的誤差有關係,具體可以看我的文章 histogram線性插值法原理
Algorithm.Schedule具體流程
  • 在Schedule中可以看到兩個主要的函式呼叫

    
    feasibleNodes, filteredNodesStatuses, err := g.findNodesThatFitPod(ctx, fwk, state, pod)
    priorityList, err := g.prioritizeNodes(ctx, fwk, state, pod, feasibleNodes)
  • 其中 findNodesThatFitPod 對應的是filter流程,對應的metrics有 scheduler_framework_extension_point_duration_seconds_bucket

    histogram_quantile(0.999, sum by(extension_point,le) (rate(scheduler_framework_extension_point_duration_seconds_bucket{job="kube-scheduler"}[5m])))
    
  • 相關的截圖可以看到
  • prioritizeNodes對應的是score流程,對應的metrics有

    histogram_quantile(0.99, sum by(plugin,le) (rate(scheduler_plugin_execution_duration_seconds_bucket{job="kube-scheduler"}[5m])))
  • 相關的截圖可以看到
  • 上述具體的演算法流程可以和官方文件給出的流程圖對得上 image

02 排程演算法耗時

  • 再回過頭來看bind的過程
  • 其中的核心就在bind這裡

    err := sched.bind(bindingCycleCtx, fwk, assumedPod, scheduleResult.SuggestedHost, state)
  • 可以看到在bind函式內部是單獨計時的

    func (sched *Scheduler) bind(ctx context.Context, fwk framework.Framework, assumed *v1.Pod, targetNode string, state *framework.CycleState) (err error) {
      start := time.Now()
      defer func() {
          sched.finishBinding(fwk, assumed, targetNode, start, err)
      }()
    
      bound, err := sched.extendersBinding(assumed, targetNode)
      if bound {
          return err
      }
      bindStatus := fwk.RunBindPlugins(ctx, state, assumed, targetNode)
      if bindStatus.IsSuccess() {
          return nil
      }
      if bindStatus.Code() == framework.Error {
          return bindStatus.AsError()
      }
      return fmt.Errorf("bind status: %s, %v", bindStatus.Code().String(), bindStatus.Message())
    }
  • 對應的metric為

    histogram_quantile(0.999, sum by(le) (rate(scheduler_binding_duration_seconds_bucket{job="kube-scheduler"}[5m])))
  • 這裡我們對比e2e和bind的999分位值
  • 發現相比於alg,bind和e2e吻合度更高
  • 同時發現bind內部主要兩個流程 sched.extendersBinding執行外部binding外掛
  • fwk.RunBindPlugins 執行內部的繫結外掛
內部繫結外掛
  • 程式碼如下,主要流程就是執行繫結外掛

    // RunBindPlugins runs the set of configured bind plugins until one returns a non `Skip` status.
    func (f *frameworkImpl) RunBindPlugins(ctx context.Context, state *framework.CycleState, pod *v1.Pod, nodeName string) (status *framework.Status) {
      startTime := time.Now()
      defer func() {
          metrics.FrameworkExtensionPointDuration.WithLabelValues(bind, status.Code().String(), f.profileName).Observe(metrics.SinceInSeconds(startTime))
      }()
      if len(f.bindPlugins) == 0 {
          return framework.NewStatus(framework.Skip, "")
      }
      for _, bp := range f.bindPlugins {
          status = f.runBindPlugin(ctx, bp, state, pod, nodeName)
          if status != nil && status.Code() == framework.Skip {
              continue
          }
          if !status.IsSuccess() {
              err := status.AsError()
              klog.ErrorS(err, "Failed running Bind plugin", "plugin", bp.Name(), "pod", klog.KObj(pod))
              return framework.AsStatus(fmt.Errorf("running Bind plugin %q: %w", bp.Name(), err))
          }
          return status
      }
      return status
    }
  • 那麼預設的繫結外掛為呼叫 pod的bind方法繫結到指定的node上,binding是pods的子資源

    // Bind binds pods to nodes using the k8s client.
    func (b DefaultBinder) Bind(ctx context.Context, state *framework.CycleState, p *v1.Pod, nodeName string) *framework.Status {
      klog.V(3).Infof("Attempting to bind %v/%v to %v", p.Namespace, p.Name, nodeName)
      binding := &v1.Binding{
          ObjectMeta: metav1.ObjectMeta{Namespace: p.Namespace, Name: p.Name, UID: p.UID},
          Target:     v1.ObjectReference{Kind: "Node", Name: nodeName},
      }
      err := b.handle.ClientSet().CoreV1().Pods(binding.Namespace).Bind(ctx, binding, metav1.CreateOptions{})
      if err != nil {
          return framework.AsStatus(err)
      }
      return nil
    }
    
  • 執行繫結動作也有相關的metrics統計耗時,

    histogram_quantile(0.999, sum by(le) (rate(scheduler_plugin_execution_duration_seconds_bucket{extension_point="Bind",plugin="DefaultBinder",job="kube-scheduler"}[5m])))
  • 同時在 RunBindPlugins中也有defer func負責統計耗時

    histogram_quantile(0.9999, sum by(le) (rate(scheduler_framework_extension_point_duration_seconds_bucket{extension_point="Bind",job="kube-scheduler"}[5m])))
  • 從上面兩個metrics看,內部的外掛耗時都很低
extendersBinding 外部外掛
  • 程式碼如下,遍歷Algorithm的Extenders,判斷是bind型別的,然後執行extender.Bind

    // TODO(#87159): Move this to a Plugin.
    func (sched *Scheduler) extendersBinding(pod *v1.Pod, node string) (bool, error) {
      for _, extender := range sched.Algorithm.Extenders() {
          if !extender.IsBinder() || !extender.IsInterested(pod) {
              continue
          }
          return true, extender.Bind(&v1.Binding{
              ObjectMeta: metav1.ObjectMeta{Namespace: pod.Namespace, Name: pod.Name, UID: pod.UID},
              Target:     v1.ObjectReference{Kind: "Node", Name: node},
          })
      }
      return false, nil
    }
    
  • extender.Bind對應就是通過http發往外部的 排程器

    // Bind delegates the action of binding a pod to a node to the extender.
    func (h *HTTPExtender) Bind(binding *v1.Binding) error {
      var result extenderv1.ExtenderBindingResult
      if !h.IsBinder() {
          // This shouldn't happen as this extender wouldn't have become a Binder.
          return fmt.Errorf("unexpected empty bindVerb in extender")
      }
      req := &extenderv1.ExtenderBindingArgs{
          PodName:      binding.Name,
          PodNamespace: binding.Namespace,
          PodUID:       binding.UID,
          Node:         binding.Target.Name,
      }
      if err := h.send(h.bindVerb, req, &result); err != nil {
          return err
      }
      if result.Error != "" {
          return fmt.Errorf(result.Error)
      }
      return nil
    }
  • 很遺憾的是這裡並沒有相關的metrics統計耗時
  • 目前猜測遍歷 sched.Algorithm.Extenders 執行的耗時
  • 這裡sched.Algorithm.Extenders來自於 KubeSchedulerConfiguration 中的配置
  • 也就是編寫配置檔案,並將其路徑傳給 kube-scheduler 的命令列引數,定製 kube-scheduler 的行為,目前並沒有看到

總結

scheduler 排程過程

  • 單個pod的排程主要分為3個步驟:

    • 根據Predict和Priority兩個階段,呼叫各自的演算法外掛,選擇最優的Node
    • Assume這個Pod被排程到對應的Node,儲存到cache
    • 用extender和plugins進行驗證,如果通過則繫結

    e2e 耗時主要來自bind

  • 但目前看到bind執行耗時並沒有很長
  • 待續

相關文章