k8s預設排程器關於pod申請資源過濾的原始碼細節

func (sched *Scheduler) scheduleOne(ctx context.Context) {
      scheduleResult, err := sched.Algorithm.Schedule(schedulingCycleCtx, sched.Extenders, fwk, state, pod)
}

分析 Schedule方法

預設排程Schedule方法的原始碼位置 D:\go_path\src\github.com\kubernetes\kubernetes\pkg\scheduler\generic_scheduler.go

從它的方法註釋可以看到

// Schedule tries to schedule the given pod to one of the nodes in the node list.
// If it succeeds, it will return the name of the node.
// If it fails, it will return a FitError error with reasons.

翻譯過來就是Schedule方法嘗試從給出的節點列表中選擇一個排程這個pod
如果成功，會返回節點的名稱
如果失敗，會返回錯誤

來分析一下這個方法的返回值

這個ScheduleResult結構體他的欄位定義的很清晰一看就知道幹啥的

(result ScheduleResult, err error)
type ScheduleResult struct {
    // Name of the scheduler suggest host
    SuggestedHost string  結果節點
    // Number of nodes scheduler evaluated on one pod scheduled
    EvaluatedNodes int   參與計算的節點數
    // Number of feasible nodes on one pod scheduled
    FeasibleNodes int  合適的節點數
}

再分析一下這個方法的引數

(ctx context.Context, extenders []framework.Extender, fwk framework.Framework, state framework.CycleState, pod v1.Pod)
ctx 上下文
extenders 應該是擴充套件的排程外掛？
fwk為內建的排程框架物件
state應該是排程的結果快取
pod就是待排程的目標pod

其中核心的內容就是 findNodesThatFitPod

程式碼如 feasibleNodes, diagnosis, err := g.findNodesThatFitPod(ctx, extenders, fwk, state, pod)
findNodesThatFitPod 就是執行filter外掛列表中的外掛

step01 執行prefilter外掛們

    // Run "prefilter" plugins.
    s := fwk.RunPreFilterPlugins(ctx, state, pod)
    allNodes, err := g.nodeInfoSnapshot.NodeInfos().List()
    if err != nil {
        return nil, diagnosis, err
    }

遍歷執行的程式碼如下

func (f *frameworkImpl) RunPreFilterPlugins(ctx context.Context, state *framework.CycleState, pod *v1.Pod) (status *framework.Status) {
  startTime := time.Now()
  defer func() {
      metrics.FrameworkExtensionPointDuration.WithLabelValues(preFilter, status.Code().String(), f.profileName).Observe(metrics.SinceInSeconds(startTime))
  }()
  for _, pl := range f.preFilterPlugins {
      status = f.runPreFilterPlugin(ctx, pl, state, pod)
      if !status.IsSuccess() {
          status.SetFailedPlugin(pl.Name())
          if status.IsUnschedulable() {
              return status
          }
          return framework.AsStatus(fmt.Errorf("running PreFilter plugin %q: %w", pl.Name(), status.AsError())).WithFailedPlugin(pl.Name())
      }
  }

  return nil
}

核心就是執行各個 PreFilterPlugin的 PreFilter方法

type PreFilterPlugin interface {
    Plugin
    // PreFilter is called at the beginning of the scheduling cycle. All PreFilter
    // plugins must return success or the pod will be rejected.
    PreFilter(ctx context.Context, state *CycleState, p *v1.Pod) *Status
    // PreFilterExtensions returns a PreFilterExtensions interface if the plugin implements one,
    // or nil if it does not. A Pre-filter plugin can provide extensions to incrementally
    // modify its pre-processed info. The framework guarantees that the extensions
    // AddPod/RemovePod will only be called after PreFilter, possibly on a cloned
    // CycleState, and may call those functions more than once before calling
    // Filter again on a specific node.
    PreFilterExtensions() PreFilterExtensions
}

預設的PreFilterPlugin都有哪些呢

我們可以在官方文件中搜尋 prefilter
發現有8個比如 NodePorts、NodeResourcesFit、VolumeBinding等
這跟我們在ide中檢視 PreFilter的實現者基本能對上

挑1個 NodeResourcesFit的 PreFilterPlugin 來看下

位置 D:\go_path\src\github.com\kubernetes\kubernetes\pkg\scheduler\framework\plugins\noderesources\fit.go

func (f *Fit) PreFilter(ctx context.Context, cycleState *framework.CycleState, pod *v1.Pod) *framework.Status {
    cycleState.Write(preFilterStateKey, computePodResourceRequest(pod, f.enablePodOverhead))
    return nil
}

從上面的方法來看只是計算了pod 的資源情況，寫入快取為後面的過濾做準備
其中的資料統計來自 computePodResourceRequest，我們不用看具體程式碼，看註釋就能清楚這個方法的含義
從pod 的init和app容器中彙總，求最大的資源使用情況
其中init和app容器的處理方式不一致
比如註釋中給出的樣例，init容器按順序執行，那麼找其中最大的資源就可以也就是 2c 3G
app容器要求同時啟動，所以需要求sum 也就是 3c 3G
最後再求2者的max 也就是3c 3G

// computePodResourceRequest returns a framework.Resource that covers the largest
// width in each resource dimension. Because init-containers run sequentially, we collect
// the max in each dimension iteratively. In contrast, we sum the resource vectors for
// regular containers since they run simultaneously.
//
// If Pod Overhead is specified and the feature gate is set, the resources defined for Overhead
// are added to the calculated Resource request sum
//
// Example:
//
// Pod:
//   InitContainers
//     IC1:
//       CPU: 2
//       Memory: 1G
//     IC2:
//       CPU: 2
//       Memory: 3G
//   Containers
//     C1:
//       CPU: 2
//       Memory: 1G
//     C2:
//       CPU: 1
//       Memory: 1G
//
// Result: CPU: 3, Memory: 3G

看到這裡就會疑惑了，fit 的prefilter 中並沒有過濾節點資源的程式碼

其實相關的邏輯在 filter外掛中
因為在 findNodesThatFitPod函式中執行完所有prefilter外掛後該執行 filter外掛了
也就是 NodeResourcesFit 的filter函式
位置 D:\go_path\src\github.com\kubernetes\kubernetes\pkg\scheduler\framework\plugins\noderesources\fit.go

// Filter invoked at the filter extension point.
// Checks if a node has sufficient resources, such as cpu, memory, gpu, opaque int resources etc to run a pod.
// It returns a list of insufficient resources, if empty, then the node has all the resources requested by the pod.
func (f *Fit) Filter(ctx context.Context, cycleState *framework.CycleState, pod *v1.Pod, nodeInfo *framework.NodeInfo) *framework.Status {
    s, err := getPreFilterState(cycleState)
    if err != nil {
        return framework.AsStatus(err)
    }

    insufficientResources := fitsRequest(s, nodeInfo, f.ignoredResources, f.ignoredResourceGroups)

    if len(insufficientResources) != 0 {
        // We will keep all failure reasons.
        failureReasons := make([]string, 0, len(insufficientResources))
        for _, r := range insufficientResources {
            failureReasons = append(failureReasons, r.Reason)
        }
        return framework.NewStatus(framework.Unschedulable, failureReasons...)
    }
    return nil
}

從上面的註釋就可以看出，這個是檢查一個節點是否具備滿足目標pod申請資源的

其中具體的資源計算邏輯在 fitsRequest中

以計算cpu為例

  if podRequest.MilliCPU > (nodeInfo.Allocatable.MilliCPU - nodeInfo.Requested.MilliCPU) {
      insufficientResources = append(insufficientResources, InsufficientResource{
          v1.ResourceCPU,
          "Insufficient cpu",
          podRequest.MilliCPU,
          nodeInfo.Requested.MilliCPU,
          nodeInfo.Allocatable.MilliCPU,
      })
  }

思考如果上面有多個節點滿足 pod 資源request怎麼辦

其實很簡單就是： findNodesThatPassFilters有多個node 結果

然後交給後面的 score 方法打分計算挑選即可

  feasibleNodes, err := g.findNodesThatPassFilters(ctx, fwk, state, pod, diagnosis, allNodes)
  if err != nil {
      return nil, diagnosis, err
  }

總結

NodeResourcesFit的 PreFilterPlugin 負責計算pod 的資源申請值，並且計算時處理init和app容器有所區別
k8s的預設排程器是在哪個環節過濾滿足這個pod資源的節點的：答案是NodeResourcesFit的Filter函式
filter如果返回多個節點，那麼交給 score外掛打分計算挑選即可

腦洞

如果使用k8s的排程框架寫個擴充套件排程器，只實現Filter方法根據節點的真實負載過濾那麼會有什麼問題
答案是：因為跳過了預設的NodeResourcesFit 可能會導致被kubelet 的admit攔截出現OutOfMemory等錯誤
因為 kubelet還是會校驗新pod的request和本節點已分配的資源

那麼基於真實負載排程的排程器該怎麼編寫呢

k8s二次開發之基於真實負載的排程器

k8s預設排程器關於pod申請資源過濾的原始碼細節

思考 Q1 k8s的預設排程器是在哪個環節過濾滿足這個pod資源的節點的？

我的23個課程推薦

k8s零基礎入門運維課程

k8s純原始碼解讀教程(3個課程內容合成一個大課程)

k8s運維進階調優課程

k8s管理運維平臺實戰

k8s二次開發課程

cicd 課程

prometheus全元件的教程

go語言課程

直播答疑sre職業發展規劃

官方排程框架文件地址

01 預設排程器何時 根據pod的容器 資源request量挑選節點

分析 Schedule方法

來分析一下 這個方法的返回值

再分析一下這個方法的 引數

其中核心的內容就是 findNodesThatFitPod

step01 執行prefilter外掛們

預設的PreFilterPlugin都有哪些呢

挑1個 NodeResourcesFit的 PreFilterPlugin 來看下

看到這裡就會疑惑了，fit 的prefilter 中並沒有過濾節點資源的程式碼

從上面的註釋就可以看出，這個是檢查一個節點 是否具備滿足 目標pod申請資源的

思考如果上面有多個節點滿足 pod 資源request怎麼辦

總結

腦洞

那麼基於真實負載排程的排程器該怎麼編寫呢

相關文章

01 預設排程器何時根據pod的容器資源request量挑選節點

來分析一下這個方法的返回值

再分析一下這個方法的引數

從上面的註釋就可以看出，這個是檢查一個節點是否具備滿足目標pod申請資源的