deployment controller簡介
deployment controller是kube-controller-manager元件中眾多控制器中的一個,是 deployment 資源物件的控制器,其通過對deployment、replicaset、pod三種資源的監聽,當三種資源發生變化時會觸發 deployment controller 對相應的deployment資源進行調諧操作,從而完成deployment的擴縮容、暫停恢復、更新、回滾、狀態status更新、所屬的舊replicaset清理等操作。
deployment controller架構圖
deployment controller的大致組成和處理流程如下圖,deployment controller對pod、replicaset和deployment物件註冊了event handler,當有事件時,會watch到然後將對應的deployment物件放入到queue中,然後syncDeployment
方法為deployment controller調諧deployment物件的核心處理邏輯所在,從queue中取出deployment物件,做調諧處理。
deployment controller分析將分為兩大塊進行,分別是:
(1)deployment controller初始化與啟動分析;
(2)deployment controller處理邏輯分析。
1.deployment controller初始化與啟動分析
基於tag v1.17.4
https://github.com/kubernetes/kubernetes/releases/tag/v1.17.4
直接看到startDeploymentController函式,作為deployment controller初始化與啟動分析的入口。
startDeploymentController
startDeploymentController主要邏輯:
(1)呼叫deployment.NewDeploymentController新建並初始化DeploymentController;
(2)拉起一個goroutine,跑DeploymentController的Run方法。
// cmd/kube-controller-manager/app/apps.go
func startDeploymentController(ctx ControllerContext) (http.Handler, bool, error) {
if !ctx.AvailableResources[schema.GroupVersionResource{Group: "apps", Version: "v1", Resource: "deployments"}] {
return nil, false, nil
}
dc, err := deployment.NewDeploymentController(
ctx.InformerFactory.Apps().V1().Deployments(),
ctx.InformerFactory.Apps().V1().ReplicaSets(),
ctx.InformerFactory.Core().V1().Pods(),
ctx.ClientBuilder.ClientOrDie("deployment-controller"),
)
if err != nil {
return nil, true, fmt.Errorf("error creating Deployment controller: %v", err)
}
go dc.Run(int(ctx.ComponentConfig.DeploymentController.ConcurrentDeploymentSyncs), ctx.Stop)
return nil, true, nil
}
1.1 deployment.NewDeploymentController
從deployment.NewDeploymentController
函式程式碼中可以看到,deployment controller註冊了deployment、replicaset與pod物件的EventHandler,也即對這幾個物件的event進行監聽,把event放入事件佇列並做處理。並且將dc.syncDeployment
方法賦值給dc.syncHandler
,也即註冊為核心處理方法,在dc.Run
方法中會呼叫該核心處理方法來調諧deployment物件(核心處理方法後面會進行詳細分析)。
// pkg/controller/deployment/deployment_controller.go
// NewDeploymentController creates a new DeploymentController.
func NewDeploymentController(dInformer appsinformers.DeploymentInformer, rsInformer appsinformers.ReplicaSetInformer, podInformer coreinformers.PodInformer, client clientset.Interface) (*DeploymentController, error) {
eventBroadcaster := record.NewBroadcaster()
eventBroadcaster.StartLogging(klog.Infof)
eventBroadcaster.StartRecordingToSink(&v1core.EventSinkImpl{Interface: client.CoreV1().Events("")})
if client != nil && client.CoreV1().RESTClient().GetRateLimiter() != nil {
if err := ratelimiter.RegisterMetricAndTrackRateLimiterUsage("deployment_controller", client.CoreV1().RESTClient().GetRateLimiter()); err != nil {
return nil, err
}
}
dc := &DeploymentController{
client: client,
eventRecorder: eventBroadcaster.NewRecorder(scheme.Scheme, v1.EventSource{Component: "deployment-controller"}),
queue: workqueue.NewNamedRateLimitingQueue(workqueue.DefaultControllerRateLimiter(), "deployment"),
}
dc.rsControl = controller.RealRSControl{
KubeClient: client,
Recorder: dc.eventRecorder,
}
dInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{
AddFunc: dc.addDeployment,
UpdateFunc: dc.updateDeployment,
// This will enter the sync loop and no-op, because the deployment has been deleted from the store.
DeleteFunc: dc.deleteDeployment,
})
rsInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{
AddFunc: dc.addReplicaSet,
UpdateFunc: dc.updateReplicaSet,
DeleteFunc: dc.deleteReplicaSet,
})
podInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{
DeleteFunc: dc.deletePod,
})
dc.syncHandler = dc.syncDeployment
dc.enqueueDeployment = dc.enqueue
dc.dLister = dInformer.Lister()
dc.rsLister = rsInformer.Lister()
dc.podLister = podInformer.Lister()
dc.dListerSynced = dInformer.Informer().HasSynced
dc.rsListerSynced = rsInformer.Informer().HasSynced
dc.podListerSynced = podInformer.Informer().HasSynced
return dc, nil
}
1.2 dc.Run
主要看到for迴圈處,根據workers的值(來源於kcm啟動引數concurrent-deployment-syncs
配置),啟動相應數量的goroutine,跑dc.worker
方法,主要是呼叫前面講到的deployment controller核心處理方法dc.syncDeployment
。
// pkg/controller/deployment/deployment_controller.go
func (dc *DeploymentController) Run(workers int, stopCh <-chan struct{}) {
defer utilruntime.HandleCrash()
defer dc.queue.ShutDown()
klog.Infof("Starting deployment controller")
defer klog.Infof("Shutting down deployment controller")
if !cache.WaitForNamedCacheSync("deployment", stopCh, dc.dListerSynced, dc.rsListerSynced, dc.podListerSynced) {
return
}
for i := 0; i < workers; i++ {
go wait.Until(dc.worker, time.Second, stopCh)
}
<-stopCh
}
1.2.1 dc.worker
從queue佇列中取出事件key,並呼叫dc.syncHandle
即dc.syncDeployment
做調諧處理。queue佇列裡的事件來源前面講過,是deployment controller註冊的deployment、replicaset與pod物件的EventHandler,它們的變化event會被監聽到然後放入queue中。
// pkg/controller/deployment/deployment_controller.go
func (dc *DeploymentController) worker() {
for dc.processNextWorkItem() {
}
}
func (dc *DeploymentController) processNextWorkItem() bool {
key, quit := dc.queue.Get()
if quit {
return false
}
defer dc.queue.Done(key)
err := dc.syncHandler(key.(string))
dc.handleErr(err, key)
return true
}
2.deployment controller核心處理邏輯分析
進行核心處理邏輯分析前,先來了解幾個關鍵概念。
幾個關鍵概念
進行程式碼分析前,先來看幾個關鍵的概念。
(1)最新的replicaset物件
怎樣的replicaset物件是最新的呢?replicaset物件的pod template與deployment的一致,則代表該replicaset是最新的。
(2)舊的replicaset物件
怎樣的replicaset物件是舊的呢?除去最新的replicaset物件,其餘的都是舊的replicaset。
(3)ready狀態的pod
pod物件的.status.conditions
中,type
為Ready
的condition
中,其status
屬性值為True
,則代表該pod屬於ready狀態。
apiVersion: v1
kind: Pod
...
status:
conditions:
- lastProbeTime: null
lastTransitionTime: "2021-08-04T08:47:03Z"
status: "True"
type: Ready
...
而type
為Ready
的condition
中,其status
屬性值會pod的各個容器都ready之後,將其值設定為True
。
pod裡的容器何時ready?kubelet會根據容器配置的readiness probe就緒探測策略,在探測成功後更新pod的status將該容器設定為ready,yaml示例如下。
apiVersion: v1
kind: Pod
...
status:
...
containerStatuses:
- containerID: xxx
image: xxx
imageID: xxx
lastState: {}
name: test
ready: true
...
(4)available狀態的pod
pod處於ready狀態且已經超過了minReadySeconds
時間後,該pod即處於available狀態。
syncDeployment
直接看到deployment controller核心處理方法syncDeployment。
主要邏輯:
(1)獲取執行方法時的當前時間,並定義defer
函式,用於計算該方法總執行時間,也即統計對一個 deployment 進行同步調諧操作的耗時;
(2)根據 deployment 物件的名稱空間與名稱,獲取 deployment 物件;
(3)呼叫dc.getReplicaSetsForDeployment
:對叢集中與deployment物件相同名稱空間下的所有replicaset物件做處理,若發現匹配但沒有關聯 deployment 的 replicaset 則通過設定 ownerReferences 欄位與 deployment 關聯,已關聯但不匹配的則刪除對應的 ownerReferences,最後獲取返回叢集中與 Deployment 關聯匹配的 ReplicaSet物件列表;
(4)呼叫dc.getPodMapForDeployment
:根據deployment物件的selector,獲取當前 deployment 物件關聯的 pod,根據 deployment 所屬的 replicaset 物件的UID
對 pod 進行分類並返回,返回值型別為map[types.UID][]*v1.Pod
;
(5)如果 deployment 物件的 DeletionTimestamp
屬性值不為空,則呼叫dc.syncStatusOnly
,根據deployment 所屬的 replicaset 物件,重新計算出 deployment 物件的status
欄位值並更新,呼叫完成後,直接return,不繼續往下執行;
(6)呼叫dc.checkPausedConditions
:檢查 deployment 是否為pause
狀態,是則更新deployment物件的status
欄位值,為其新增pause
相關的condition
;
(7)判斷deployment物件的.Spec.Paused
屬性值,為true
時,則呼叫dc.sync
做處理,呼叫完成後直接return;
(8)呼叫getRollbackTo
檢查deployment物件的annotations
中是否有以下key:deprecated.deployment.rollback.to
,如果有且值不為空,呼叫 dc.rollback
方法執行 回滾操作;
(9)呼叫dc.isScalingEvent
:檢查deployment物件是否處於 scaling
狀態,是則呼叫dc.sync
做擴縮容處理,呼叫完成後直接return;
(10)判斷deployment物件的更新策略,當更新策略為Recreate
時呼叫dc.rolloutRecreate
做進一步處理,也即對deployment進行recreate更新處理;當更新策略為RollingUpdate
時呼叫dc.rolloutRolling
做進一步處理,也即對deployment進行滾動更新處理。
// pkg/controller/deployment/deployment_controller.go
// syncDeployment will sync the deployment with the given key.
// This function is not meant to be invoked concurrently with the same key.
func (dc *DeploymentController) syncDeployment(key string) error {
startTime := time.Now()
klog.V(4).Infof("Started syncing deployment %q (%v)", key, startTime)
defer func() {
klog.V(4).Infof("Finished syncing deployment %q (%v)", key, time.Since(startTime))
}()
namespace, name, err := cache.SplitMetaNamespaceKey(key)
if err != nil {
return err
}
deployment, err := dc.dLister.Deployments(namespace).Get(name)
if errors.IsNotFound(err) {
klog.V(2).Infof("Deployment %v has been deleted", key)
return nil
}
if err != nil {
return err
}
// Deep-copy otherwise we are mutating our cache.
// TODO: Deep-copy only when needed.
d := deployment.DeepCopy()
everything := metav1.LabelSelector{}
if reflect.DeepEqual(d.Spec.Selector, &everything) {
dc.eventRecorder.Eventf(d, v1.EventTypeWarning, "SelectingAll", "This deployment is selecting all pods. A non-empty selector is required.")
if d.Status.ObservedGeneration < d.Generation {
d.Status.ObservedGeneration = d.Generation
dc.client.AppsV1().Deployments(d.Namespace).UpdateStatus(d)
}
return nil
}
// List ReplicaSets owned by this Deployment, while reconciling ControllerRef
// through adoption/orphaning.
rsList, err := dc.getReplicaSetsForDeployment(d)
if err != nil {
return err
}
// List all Pods owned by this Deployment, grouped by their ReplicaSet.
// Current uses of the podMap are:
//
// * check if a Pod is labeled correctly with the pod-template-hash label.
// * check that no old Pods are running in the middle of Recreate Deployments.
podMap, err := dc.getPodMapForDeployment(d, rsList)
if err != nil {
return err
}
if d.DeletionTimestamp != nil {
return dc.syncStatusOnly(d, rsList)
}
// Update deployment conditions with an Unknown condition when pausing/resuming
// a deployment. In this way, we can be sure that we won't timeout when a user
// resumes a Deployment with a set progressDeadlineSeconds.
if err = dc.checkPausedConditions(d); err != nil {
return err
}
if d.Spec.Paused {
return dc.sync(d, rsList)
}
// rollback is not re-entrant in case the underlying replica sets are updated with a new
// revision so we should ensure that we won't proceed to update replica sets until we
// make sure that the deployment has cleaned up its rollback spec in subsequent enqueues.
if getRollbackTo(d) != nil {
return dc.rollback(d, rsList)
}
scalingEvent, err := dc.isScalingEvent(d, rsList)
if err != nil {
return err
}
if scalingEvent {
return dc.sync(d, rsList)
}
switch d.Spec.Strategy.Type {
case apps.RecreateDeploymentStrategyType:
return dc.rolloutRecreate(d, rsList, podMap)
case apps.RollingUpdateDeploymentStrategyType:
return dc.rolloutRolling(d, rsList)
}
return fmt.Errorf("unexpected deployment strategy type: %s", d.Spec.Strategy.Type)
}
2.1 dc.getReplicaSetsForDeployment
dc.getReplicaSetsForDeployment主要作用:獲取叢集中與 Deployment 相關的 ReplicaSet,若發現匹配但沒有關聯 deployment 的 replicaset 則通過設定 ownerReferences 欄位與 deployment 關聯,已關聯但不匹配的則刪除對應的 ownerReferences。
主要邏輯如下:
(1)獲取deployment物件名稱空間下的所有replicaset物件;
(2)呼叫cm.ClaimReplicaSets
對replicaset做進一步處理,並最終返回與deployment匹配關聯的replicaset物件列表。
// pkg/controller/deployment/deployment_controller.go
// getReplicaSetsForDeployment uses ControllerRefManager to reconcile
// ControllerRef by adopting and orphaning.
// It returns the list of ReplicaSets that this Deployment should manage.
func (dc *DeploymentController) getReplicaSetsForDeployment(d *apps.Deployment) ([]*apps.ReplicaSet, error) {
// List all ReplicaSets to find those we own but that no longer match our
// selector. They will be orphaned by ClaimReplicaSets().
rsList, err := dc.rsLister.ReplicaSets(d.Namespace).List(labels.Everything())
if err != nil {
return nil, err
}
deploymentSelector, err := metav1.LabelSelectorAsSelector(d.Spec.Selector)
if err != nil {
return nil, fmt.Errorf("deployment %s/%s has invalid label selector: %v", d.Namespace, d.Name, err)
}
// If any adoptions are attempted, we should first recheck for deletion with
// an uncached quorum read sometime after listing ReplicaSets (see #42639).
canAdoptFunc := controller.RecheckDeletionTimestamp(func() (metav1.Object, error) {
fresh, err := dc.client.AppsV1().Deployments(d.Namespace).Get(d.Name, metav1.GetOptions{})
if err != nil {
return nil, err
}
if fresh.UID != d.UID {
return nil, fmt.Errorf("original Deployment %v/%v is gone: got uid %v, wanted %v", d.Namespace, d.Name, fresh.UID, d.UID)
}
return fresh, nil
})
cm := controller.NewReplicaSetControllerRefManager(dc.rsControl, d, deploymentSelector, controllerKind, canAdoptFunc)
return cm.ClaimReplicaSets(rsList)
}
2.1.1 cm.ClaimReplicaSets
遍歷與deployment物件相同名稱空間下的所有replicaset物件,呼叫m.ClaimObject
做處理,m.ClaimObject
的作用主要是將匹配但沒有關聯 deployment 的 replicaset 則通過設定 ownerReferences 欄位與 deployment 關聯,已關聯但不匹配的則刪除對應的 ownerReferences。
// pkg/controller/controller_ref_manager.go
func (m *ReplicaSetControllerRefManager) ClaimReplicaSets(sets []*apps.ReplicaSet) ([]*apps.ReplicaSet, error) {
var claimed []*apps.ReplicaSet
var errlist []error
match := func(obj metav1.Object) bool {
return m.Selector.Matches(labels.Set(obj.GetLabels()))
}
adopt := func(obj metav1.Object) error {
return m.AdoptReplicaSet(obj.(*apps.ReplicaSet))
}
release := func(obj metav1.Object) error {
return m.ReleaseReplicaSet(obj.(*apps.ReplicaSet))
}
for _, rs := range sets {
ok, err := m.ClaimObject(rs, match, adopt, release)
if err != nil {
errlist = append(errlist, err)
continue
}
if ok {
claimed = append(claimed, rs)
}
}
return claimed, utilerrors.NewAggregate(errlist)
}
2.2 dc.getPodMapForDeployment
dc.getPodMapForDeployment:根據deployment物件的Selector,獲取當前 deployment 物件關聯的 pod,根據 deployment 所屬的 replicaset 物件的UID
對 pod 進行分類並返回,返回值型別為map[types.UID][]*v1.Pod
。
// pkg/controller/deployment/deployment_controller.go
func (dc *DeploymentController) getPodMapForDeployment(d *apps.Deployment, rsList []*apps.ReplicaSet) (map[types.UID][]*v1.Pod, error) {
// Get all Pods that potentially belong to this Deployment.
selector, err := metav1.LabelSelectorAsSelector(d.Spec.Selector)
if err != nil {
return nil, err
}
pods, err := dc.podLister.Pods(d.Namespace).List(selector)
if err != nil {
return nil, err
}
// Group Pods by their controller (if it's in rsList).
podMap := make(map[types.UID][]*v1.Pod, len(rsList))
for _, rs := range rsList {
podMap[rs.UID] = []*v1.Pod{}
}
for _, pod := range pods {
// Do not ignore inactive Pods because Recreate Deployments need to verify that no
// Pods from older versions are running before spinning up new Pods.
controllerRef := metav1.GetControllerOf(pod)
if controllerRef == nil {
continue
}
// Only append if we care about this UID.
if _, ok := podMap[controllerRef.UID]; ok {
podMap[controllerRef.UID] = append(podMap[controllerRef.UID], pod)
}
}
return podMap, nil
}
2.3 dc.syncStatusOnly
如果 deployment 物件的 DeletionTimestamp
屬性值不為空,則呼叫dc.syncStatusOnly
,根據deployment 所屬的 replicaset 物件,重新計算出 deployment 物件的status
欄位值並更新,呼叫完成後,直接return,不繼續往下執行;
// pkg/controller/deployment/sync.go
func (dc *DeploymentController) syncStatusOnly(d *apps.Deployment, rsList []*apps.ReplicaSet) error {
newRS, oldRSs, err := dc.getAllReplicaSetsAndSyncRevision(d, rsList, false)
if err != nil {
return err
}
allRSs := append(oldRSs, newRS)
return dc.syncDeploymentStatus(allRSs, newRS, d)
}
// pkg/controller/deployment/sync.go
func (dc *DeploymentController) syncDeploymentStatus(allRSs []*apps.ReplicaSet, newRS *apps.ReplicaSet, d *apps.Deployment) error {
newStatus := calculateStatus(allRSs, newRS, d)
if reflect.DeepEqual(d.Status, newStatus) {
return nil
}
newDeployment := d
newDeployment.Status = newStatus
_, err := dc.client.AppsV1().Deployments(newDeployment.Namespace).UpdateStatus(newDeployment)
return err
}
關於具體如何計算出deployment物件的status,可以檢視calculateStatus
函式,計算邏輯都在裡面,這裡不展開分析。
2.4 dc.rollback
先呼叫getRollbackTo
檢查deployment物件的annotations
中是否有以下key:deprecated.deployment.rollback.to
,如果有且值不為空,呼叫 dc.rollback
方法執行 rollback
操作;
// pkg/controller/deployment/rollback.go
func getRollbackTo(d *apps.Deployment) *extensions.RollbackConfig {
// Extract the annotation used for round-tripping the deprecated RollbackTo field.
revision := d.Annotations[apps.DeprecatedRollbackTo]
if revision == "" {
return nil
}
revision64, err := strconv.ParseInt(revision, 10, 64)
if err != nil {
// If it's invalid, ignore it.
return nil
}
return &extensions.RollbackConfig{
Revision: revision64,
}
}
dc.rollback主要邏輯:
(1)獲取deployment的所有關聯匹配的replicaset物件列表;
(2)獲取需要回滾的Revision;
(3)遍歷上述獲得的replicaset物件列表,比較Revision是否與需要回滾的Revision一致,一致則呼叫dc.rollbackToTemplate
做回滾操作(主要是根據特定的Revision的replicaset物件,更改deployment物件的.Spec.Template
);
(4)最後,不管有沒有回滾成功,都將deployment物件的.spec.rollbackTo
屬性置為nil,然後更新deployment物件。
// pkg/controller/deployment/rollback.go
func (dc *DeploymentController) rollback(d *apps.Deployment, rsList []*apps.ReplicaSet) error {
newRS, allOldRSs, err := dc.getAllReplicaSetsAndSyncRevision(d, rsList, true)
if err != nil {
return err
}
allRSs := append(allOldRSs, newRS)
rollbackTo := getRollbackTo(d)
// If rollback revision is 0, rollback to the last revision
if rollbackTo.Revision == 0 {
if rollbackTo.Revision = deploymentutil.LastRevision(allRSs); rollbackTo.Revision == 0 {
// If we still can't find the last revision, gives up rollback
dc.emitRollbackWarningEvent(d, deploymentutil.RollbackRevisionNotFound, "Unable to find last revision.")
// Gives up rollback
return dc.updateDeploymentAndClearRollbackTo(d)
}
}
for _, rs := range allRSs {
v, err := deploymentutil.Revision(rs)
if err != nil {
klog.V(4).Infof("Unable to extract revision from deployment's replica set %q: %v", rs.Name, err)
continue
}
if v == rollbackTo.Revision {
klog.V(4).Infof("Found replica set %q with desired revision %d", rs.Name, v)
// rollback by copying podTemplate.Spec from the replica set
// revision number will be incremented during the next getAllReplicaSetsAndSyncRevision call
// no-op if the spec matches current deployment's podTemplate.Spec
performedRollback, err := dc.rollbackToTemplate(d, rs)
if performedRollback && err == nil {
dc.emitRollbackNormalEvent(d, fmt.Sprintf("Rolled back deployment %q to revision %d", d.Name, rollbackTo.Revision))
}
return err
}
}
dc.emitRollbackWarningEvent(d, deploymentutil.RollbackRevisionNotFound, "Unable to find the revision to rollback to.")
// Gives up rollback
return dc.updateDeploymentAndClearRollbackTo(d)
}
2.5 dc.sync
下面來分析一下dc.sync方法,以下兩種情況下,都會呼叫dc.sync,然後直接return:
(1)判斷deployment的.Spec.Paused
屬性值是否為true,是則呼叫dc.sync
做處理,呼叫完成後直接return;
(2)先呼叫dc.isScalingEvent
,檢查deployment物件是否處於 scaling
狀態,是則呼叫dc.sync
做處理,呼叫完成後直接return。
關於Paused欄位
deployment的.Spec.Paused
為true時代表該deployment處於暫停狀態,false則代表處於正常狀態。當deployment處於暫停狀態時,deployment物件的PodTemplateSpec的任何修改都不會觸發deployment的更新,當.Spec.Paused
再次賦值為false時才會觸發deployment更新。
dc.sync主要邏輯:
(1)呼叫dc.getAllReplicaSetsAndSyncRevision
獲取最新的replicaset物件以及舊的replicaset物件列表;
(2)呼叫dc.scale
,判斷是否需要進行擴縮容操作,需要則進行擴縮容操作;
(3)當deployment的.Spec.Paused
為true且不需要做回滾操作時,呼叫dc.cleanupDeployment
,根據deployment配置的保留歷史版本數(.Spec.RevisionHistoryLimit
)以及replicaset的建立時間,把最老的舊的replicaset給刪除清理掉;
(4)呼叫dc.syncDeploymentStatus
,計算並更新deployment物件的status欄位。
// pkg/controller/deployment/sync.go
// sync is responsible for reconciling deployments on scaling events or when they
// are paused.
func (dc *DeploymentController) sync(d *apps.Deployment, rsList []*apps.ReplicaSet) error {
newRS, oldRSs, err := dc.getAllReplicaSetsAndSyncRevision(d, rsList, false)
if err != nil {
return err
}
if err := dc.scale(d, newRS, oldRSs); err != nil {
// If we get an error while trying to scale, the deployment will be requeued
// so we can abort this resync
return err
}
// Clean up the deployment when it's paused and no rollback is in flight.
if d.Spec.Paused && getRollbackTo(d) == nil {
if err := dc.cleanupDeployment(oldRSs, d); err != nil {
return err
}
}
allRSs := append(oldRSs, newRS)
return dc.syncDeploymentStatus(allRSs, newRS, d)
}
2.5.1 dc.scale
dc.scale主要作用是處理deployment的擴縮容操作,其主要邏輯如下:
(1)呼叫deploymentutil.FindActiveOrLatest
,判斷是否只有最新的replicaset物件的副本數不為0,是則找到最新的replicaset物件,並判斷其副本數是否與deployment期望副本數一致,是則直接return,否則呼叫dc.scaleReplicaSetAndRecordEvent
更新其副本數為deployment的期望副本數;
(2)當最新的replicaset物件的副本數與deployment期望副本數一致,且舊的replicaset物件中有副本數不為0的,則從舊的replicset物件列表中找出副本數不為0的replicaset,呼叫dc.scaleReplicaSetAndRecordEvent
將其副本數縮容為0,然後return;
(3)當最新的replicaset物件的副本數與deployment期望副本數不一致,舊的replicaset物件中有副本數不為0的,且deployment的更新策略為滾動更新,說明deployment可能正在滾動更新,則按一定的比例對新舊replicaset進行擴縮容操作,保證滾動更新的穩定性,具體邏輯可以自己分析下,這裡不展開分析。
// pkg/controller/deployment/sync.go
func (dc *DeploymentController) scale(deployment *apps.Deployment, newRS *apps.ReplicaSet, oldRSs []*apps.ReplicaSet) error {
// If there is only one active replica set then we should scale that up to the full count of the
// deployment. If there is no active replica set, then we should scale up the newest replica set.
if activeOrLatest := deploymentutil.FindActiveOrLatest(newRS, oldRSs); activeOrLatest != nil {
if *(activeOrLatest.Spec.Replicas) == *(deployment.Spec.Replicas) {
return nil
}
_, _, err := dc.scaleReplicaSetAndRecordEvent(activeOrLatest, *(deployment.Spec.Replicas), deployment)
return err
}
// If the new replica set is saturated, old replica sets should be fully scaled down.
// This case handles replica set adoption during a saturated new replica set.
if deploymentutil.IsSaturated(deployment, newRS) {
for _, old := range controller.FilterActiveReplicaSets(oldRSs) {
if _, _, err := dc.scaleReplicaSetAndRecordEvent(old, 0, deployment); err != nil {
return err
}
}
return nil
}
// There are old replica sets with pods and the new replica set is not saturated.
// We need to proportionally scale all replica sets (new and old) in case of a
// rolling deployment.
if deploymentutil.IsRollingUpdate(deployment) {
allRSs := controller.FilterActiveReplicaSets(append(oldRSs, newRS))
allRSsReplicas := deploymentutil.GetReplicaCountForReplicaSets(allRSs)
allowedSize := int32(0)
if *(deployment.Spec.Replicas) > 0 {
allowedSize = *(deployment.Spec.Replicas) + deploymentutil.MaxSurge(*deployment)
}
// Number of additional replicas that can be either added or removed from the total
// replicas count. These replicas should be distributed proportionally to the active
// replica sets.
deploymentReplicasToAdd := allowedSize - allRSsReplicas
// The additional replicas should be distributed proportionally amongst the active
// replica sets from the larger to the smaller in size replica set. Scaling direction
// drives what happens in case we are trying to scale replica sets of the same size.
// In such a case when scaling up, we should scale up newer replica sets first, and
// when scaling down, we should scale down older replica sets first.
var scalingOperation string
switch {
case deploymentReplicasToAdd > 0:
sort.Sort(controller.ReplicaSetsBySizeNewer(allRSs))
scalingOperation = "up"
case deploymentReplicasToAdd < 0:
sort.Sort(controller.ReplicaSetsBySizeOlder(allRSs))
scalingOperation = "down"
}
// Iterate over all active replica sets and estimate proportions for each of them.
// The absolute value of deploymentReplicasAdded should never exceed the absolute
// value of deploymentReplicasToAdd.
deploymentReplicasAdded := int32(0)
nameToSize := make(map[string]int32)
for i := range allRSs {
rs := allRSs[i]
// Estimate proportions if we have replicas to add, otherwise simply populate
// nameToSize with the current sizes for each replica set.
if deploymentReplicasToAdd != 0 {
proportion := deploymentutil.GetProportion(rs, *deployment, deploymentReplicasToAdd, deploymentReplicasAdded)
nameToSize[rs.Name] = *(rs.Spec.Replicas) + proportion
deploymentReplicasAdded += proportion
} else {
nameToSize[rs.Name] = *(rs.Spec.Replicas)
}
}
// Update all replica sets
for i := range allRSs {
rs := allRSs[i]
// Add/remove any leftovers to the largest replica set.
if i == 0 && deploymentReplicasToAdd != 0 {
leftover := deploymentReplicasToAdd - deploymentReplicasAdded
nameToSize[rs.Name] = nameToSize[rs.Name] + leftover
if nameToSize[rs.Name] < 0 {
nameToSize[rs.Name] = 0
}
}
// TODO: Use transactions when we have them.
if _, _, err := dc.scaleReplicaSet(rs, nameToSize[rs.Name], deployment, scalingOperation); err != nil {
// Return as soon as we fail, the deployment is requeued
return err
}
}
}
return nil
}
2.5.2 dc.cleanupDeployment
當deployment的所有pod都是updated的和available的,而且沒有舊的pod在running,則呼叫dc.cleanupDeployment
,根據deployment配置的保留歷史版本數(.Spec.RevisionHistoryLimit
)以及replicaset的建立時間,把最老的舊的replicaset給刪除清理掉。
// pkg/controller/deployment/sync.go
func (dc *DeploymentController) cleanupDeployment(oldRSs []*apps.ReplicaSet, deployment *apps.Deployment) error {
if !deploymentutil.HasRevisionHistoryLimit(deployment) {
return nil
}
// Avoid deleting replica set with deletion timestamp set
aliveFilter := func(rs *apps.ReplicaSet) bool {
return rs != nil && rs.ObjectMeta.DeletionTimestamp == nil
}
cleanableRSes := controller.FilterReplicaSets(oldRSs, aliveFilter)
diff := int32(len(cleanableRSes)) - *deployment.Spec.RevisionHistoryLimit
if diff <= 0 {
return nil
}
sort.Sort(controller.ReplicaSetsByCreationTimestamp(cleanableRSes))
klog.V(4).Infof("Looking to cleanup old replica sets for deployment %q", deployment.Name)
for i := int32(0); i < diff; i++ {
rs := cleanableRSes[i]
// Avoid delete replica set with non-zero replica counts
if rs.Status.Replicas != 0 || *(rs.Spec.Replicas) != 0 || rs.Generation > rs.Status.ObservedGeneration || rs.DeletionTimestamp != nil {
continue
}
klog.V(4).Infof("Trying to cleanup replica set %q for deployment %q", rs.Name, deployment.Name)
if err := dc.client.AppsV1().ReplicaSets(rs.Namespace).Delete(rs.Name, nil); err != nil && !errors.IsNotFound(err) {
// Return error instead of aggregating and continuing DELETEs on the theory
// that we may be overloading the api server.
return err
}
}
return nil
}
2.6 dc.rolloutRecreate
判斷deployment物件的更新策略.Spec.Strategy.Type
,當更新策略為Recreate
時呼叫dc.rolloutRecreate
做進一步處理。
dc.rolloutRecreate主要邏輯:
(1)呼叫dc.getAllReplicaSetsAndSyncRevision
,獲取最新的replicaset物件以及舊的replicaset物件列表;
(2)呼叫dc.scaleDownOldReplicaSetsForRecreate
,縮容舊的replicaSets,將它們的副本數更新為0,當有舊的replicasets需要縮容時,呼叫dc.syncRolloutStatus
更新deployment狀態後直接return;
(3)呼叫oldPodsRunning
函式,判斷是否有屬於deployment的pod還在running(pod的pod.Status.Phase
屬性值為Failed
或Succeeded
時代表該pod不在running),還在running則呼叫dc.syncRolloutStatus
更新deployment狀態並直接return;
(4)當新的replicaset物件沒有被建立時,呼叫dc.getAllReplicaSetsAndSyncRevision
來建立新的replicaset物件(注意:新建立的replicaset的副本數為0);
(5)呼叫dc.scaleUpNewReplicaSetForRecreate
,擴容剛新建立的replicaset,更新其副本數與deployment期望副本數一致(即deployment的.Spec.Replicas
屬性值);
(6)呼叫util.DeploymentComplete
,檢查deployment的所有pod是否都是updated的和available的,而且沒有舊的pod在running,是則繼續呼叫dc.cleanupDeployment
,根據deployment配置的保留歷史版本數(.Spec.RevisionHistoryLimit
)以及replicaset的建立時間,把最老的舊的replicaset給刪除清理掉。
(7)呼叫dc.syncRolloutStatus
更新deployment狀態。
// pkg/controller/deployment/recreate.go
// rolloutRecreate implements the logic for recreating a replica set.
func (dc *DeploymentController) rolloutRecreate(d *apps.Deployment, rsList []*apps.ReplicaSet, podMap map[types.UID][]*v1.Pod) error {
// Don't create a new RS if not already existed, so that we avoid scaling up before scaling down.
newRS, oldRSs, err := dc.getAllReplicaSetsAndSyncRevision(d, rsList, false)
if err != nil {
return err
}
allRSs := append(oldRSs, newRS)
activeOldRSs := controller.FilterActiveReplicaSets(oldRSs)
// scale down old replica sets.
scaledDown, err := dc.scaleDownOldReplicaSetsForRecreate(activeOldRSs, d)
if err != nil {
return err
}
if scaledDown {
// Update DeploymentStatus.
return dc.syncRolloutStatus(allRSs, newRS, d)
}
// Do not process a deployment when it has old pods running.
if oldPodsRunning(newRS, oldRSs, podMap) {
return dc.syncRolloutStatus(allRSs, newRS, d)
}
// If we need to create a new RS, create it now.
if newRS == nil {
newRS, oldRSs, err = dc.getAllReplicaSetsAndSyncRevision(d, rsList, true)
if err != nil {
return err
}
allRSs = append(oldRSs, newRS)
}
// scale up new replica set.
if _, err := dc.scaleUpNewReplicaSetForRecreate(newRS, d); err != nil {
return err
}
if util.DeploymentComplete(d, &d.Status) {
if err := dc.cleanupDeployment(oldRSs, d); err != nil {
return err
}
}
// Sync deployment status.
return dc.syncRolloutStatus(allRSs, newRS, d)
}
2.6.1 dc.getAllReplicaSetsAndSyncRevision
dc.getAllReplicaSetsAndSyncRevision會獲取所有的舊的replicaset物件,以及最新的replicaset物件,然後返回。
關於最新的replicaset物件,怎樣的replicaset物件是最新的呢?replicaset物件的pod template與deployment的一致,則代表該replicaset是最新的。
關於舊的replicaset物件,怎樣的replicaset物件是舊的呢?除去最新的replicaset物件,其餘的都是舊的replicaset。
// pkg/controller/deployment/sync.go
func (dc *DeploymentController) getAllReplicaSetsAndSyncRevision(d *apps.Deployment, rsList []*apps.ReplicaSet, createIfNotExisted bool) (*apps.ReplicaSet, []*apps.ReplicaSet, error) {
_, allOldRSs := deploymentutil.FindOldReplicaSets(d, rsList)
// Get new replica set with the updated revision number
newRS, err := dc.getNewReplicaSet(d, rsList, allOldRSs, createIfNotExisted)
if err != nil {
return nil, nil, err
}
return newRS, allOldRSs, nil
}
2.6.2 dc.syncRolloutStatus
syncRolloutStatus方法主要作用是計算出deployment的新的status屬性值並更新,具體的計算邏輯可以自己檢視程式碼,這裡不展開分析。
// pkg/controller/deployment/progress.go
func (dc *DeploymentController) syncRolloutStatus(allRSs []*apps.ReplicaSet, newRS *apps.ReplicaSet, d *apps.Deployment) error {
newStatus := calculateStatus(allRSs, newRS, d)
...
}
2.6.3 oldPodsRunning
遍歷deployment下所有的pod,找到屬於舊的replicaset物件的pod,判斷pod的狀態(即pod.Status.Phase
的值)是否都是Failed
或Succeeded
,是則代表所有舊的pod都沒在running了,返回false。
// pkg/controller/deployment/recreate.go
func oldPodsRunning(newRS *apps.ReplicaSet, oldRSs []*apps.ReplicaSet, podMap map[types.UID][]*v1.Pod) bool {
if oldPods := util.GetActualReplicaCountForReplicaSets(oldRSs); oldPods > 0 {
return true
}
for rsUID, podList := range podMap {
// If the pods belong to the new ReplicaSet, ignore.
if newRS != nil && newRS.UID == rsUID {
continue
}
for _, pod := range podList {
switch pod.Status.Phase {
case v1.PodFailed, v1.PodSucceeded:
// Don't count pods in terminal state.
continue
case v1.PodUnknown:
// This happens in situation like when the node is temporarily disconnected from the cluster.
// If we can't be sure that the pod is not running, we have to count it.
return true
default:
// Pod is not in terminal phase.
return true
}
}
}
return false
}
2.6.4 dc.getAllReplicaSetsAndSyncRevision
dc.getAllReplicaSetsAndSyncRevision方法主要作用是獲取最新的replicaset物件以及舊的replicaset物件列表,當傳入的createIfNotExisted
變數值為true且新的replicaset物件不存在時,呼叫dc.getNewReplicaSet時會新建replicaset物件(新建的replicaset物件副本數為0)。
// pkg/controller/deployment/sync.go
func (dc *DeploymentController) getAllReplicaSetsAndSyncRevision(d *apps.Deployment, rsList []*apps.ReplicaSet, createIfNotExisted bool) (*apps.ReplicaSet, []*apps.ReplicaSet, error) {
_, allOldRSs := deploymentutil.FindOldReplicaSets(d, rsList)
// Get new replica set with the updated revision number
newRS, err := dc.getNewReplicaSet(d, rsList, allOldRSs, createIfNotExisted)
if err != nil {
return nil, nil, err
}
return newRS, allOldRSs, nil
}
2.6.5 dc.scaleDownOldReplicaSetsForRecreate
遍歷全部舊的replicaset,呼叫dc.scaleReplicaSetAndRecordEvent
將其副本數縮容為0。
// pkg/controller/deployment/recreate.go
func (dc *DeploymentController) scaleDownOldReplicaSetsForRecreate(oldRSs []*apps.ReplicaSet, deployment *apps.Deployment) (bool, error) {
scaled := false
for i := range oldRSs {
rs := oldRSs[i]
// Scaling not required.
if *(rs.Spec.Replicas) == 0 {
continue
}
scaledRS, updatedRS, err := dc.scaleReplicaSetAndRecordEvent(rs, 0, deployment)
if err != nil {
return false, err
}
if scaledRS {
oldRSs[i] = updatedRS
scaled = true
}
}
return scaled, nil
}
2.6.6 dc.scaleUpNewReplicaSetForRecreate
呼叫dc.scaleReplicaSetAndRecordEvent
,將最新的replicset物件的副本數更新為deployment期望的副本數。
// pkg/controller/deployment/recreate.go
func (dc *DeploymentController) scaleUpNewReplicaSetForRecreate(newRS *apps.ReplicaSet, deployment *apps.Deployment) (bool, error) {
scaled, _, err := dc.scaleReplicaSetAndRecordEvent(newRS, *(deployment.Spec.Replicas), deployment)
return scaled, err
}
2.7 dc.rolloutRolling
判斷deployment物件的更新策略.Spec.Strategy.Type
,當更新策略為RollingUpdate
時呼叫dc.rolloutRolling
做進一步處理。
dc.rolloutRolling主要邏輯:
(1)呼叫dc.getAllReplicaSetsAndSyncRevision
,獲取最新的replicaset物件以及舊的replicaset物件列表,當新的replicaset物件不存在時,將建立一個新的replicaset物件(副本數為0);
(2)呼叫dc.reconcileNewReplicaSet
,調諧新的replicaset物件,根據deployment的滾動更新策略配置.Spec.Strategy.RollingUpdate.MaxSurge
和現存pod數量進行計算,決定是否對新的replicaset物件進行擴容以及擴容的副本數;
(3)當新的replicaset物件副本數在調諧時被更新,則呼叫dc.syncRolloutStatus
更新deployment狀態後直接return;
(4)呼叫dc.reconcileOldReplicaSets
,根據deployment的滾動更新策略配置.Spec.Strategy.RollingUpdate.MaxUnavailable
、現存的Available狀態的pod數量、新replicaset物件下所屬的available的pod數量,決定是否對舊的replicaset物件進行縮容以及縮容的副本數;
(5)當舊的replicaset物件副本數在調諧時被更新,則呼叫dc.syncRolloutStatus
更新deployment狀態後直接return;
(6)呼叫util.DeploymentComplete
,檢查deployment的所有pod是否都是updated的和available的,而且沒有舊的pod在running,是則繼續呼叫dc.cleanupDeployment
,根據deployment配置的保留歷史版本數(.Spec.RevisionHistoryLimit
)以及replicaset的建立時間,把最老的舊的replicaset給刪除清理掉。
(7)呼叫dc.syncRolloutStatus
更新deployment狀態。
// pkg/controller/deployment/rolling.go
// rolloutRolling implements the logic for rolling a new replica set.
func (dc *DeploymentController) rolloutRolling(d *apps.Deployment, rsList []*apps.ReplicaSet) error {
newRS, oldRSs, err := dc.getAllReplicaSetsAndSyncRevision(d, rsList, true)
if err != nil {
return err
}
allRSs := append(oldRSs, newRS)
// Scale up, if we can.
scaledUp, err := dc.reconcileNewReplicaSet(allRSs, newRS, d)
if err != nil {
return err
}
if scaledUp {
// Update DeploymentStatus
return dc.syncRolloutStatus(allRSs, newRS, d)
}
// Scale down, if we can.
scaledDown, err := dc.reconcileOldReplicaSets(allRSs, controller.FilterActiveReplicaSets(oldRSs), newRS, d)
if err != nil {
return err
}
if scaledDown {
// Update DeploymentStatus
return dc.syncRolloutStatus(allRSs, newRS, d)
}
if deploymentutil.DeploymentComplete(d, &d.Status) {
if err := dc.cleanupDeployment(oldRSs, d); err != nil {
return err
}
}
// Sync deployment status
return dc.syncRolloutStatus(allRSs, newRS, d)
}
2.7.1 dc.reconcileNewReplicaSet
dc.reconcileNewReplicaSet主要作用是調諧新的replicaset物件,根據deployment的滾動更新策略配置和現存pod數量進行計算,決定是否對新的replicaset物件進行擴容。
主要邏輯:
(1)當新的replicaset物件的副本數與deployment宣告的副本數一致,則說明該replicaset物件無需再調諧,直接return;
(2)當新的replicaset物件的副本數比deployment宣告的副本數要大,則呼叫dc.scaleReplicaSetAndRecordEvent
,將replicaset物件的副本數縮容至與deployment宣告的副本數一致,然後return;
(3)當新的replicaset物件的副本數比deployment宣告的副本數要小,則呼叫deploymentutil.NewRSNewReplicas
,根據deployment的滾動更新策略配置.Spec.Strategy.RollingUpdate.MaxSurge
的值計算出新replicaset物件該擁有的副本數量,並呼叫dc.scaleReplicaSetAndRecordEvent
更新replicaset的副本數。
// pkg/controller/deployment/rolling.go
func (dc *DeploymentController) reconcileNewReplicaSet(allRSs []*apps.ReplicaSet, newRS *apps.ReplicaSet, deployment *apps.Deployment) (bool, error) {
if *(newRS.Spec.Replicas) == *(deployment.Spec.Replicas) {
// Scaling not required.
return false, nil
}
if *(newRS.Spec.Replicas) > *(deployment.Spec.Replicas) {
// Scale down.
scaled, _, err := dc.scaleReplicaSetAndRecordEvent(newRS, *(deployment.Spec.Replicas), deployment)
return scaled, err
}
newReplicasCount, err := deploymentutil.NewRSNewReplicas(deployment, allRSs, newRS)
if err != nil {
return false, err
}
scaled, _, err := dc.scaleReplicaSetAndRecordEvent(newRS, newReplicasCount, deployment)
return scaled, err
}
NewRSNewReplicas
當deployment配置了滾動更新策略時,NewRSNewReplicas
函式將根據.Spec.Strategy.RollingUpdate.MaxSurge
的配置,呼叫intstrutil.GetValueFromIntOrPercent
計算出maxSurge
(代表滾動更新時可超出deployment宣告的副本數的最大值),最終根據maxSurge
與現存pod數量計算出新的replicaset物件該擁有的副本數。
// pkg/controller/deployment/util/deployment_util.go
func NewRSNewReplicas(deployment *apps.Deployment, allRSs []*apps.ReplicaSet, newRS *apps.ReplicaSet) (int32, error) {
switch deployment.Spec.Strategy.Type {
case apps.RollingUpdateDeploymentStrategyType:
// Check if we can scale up.
maxSurge, err := intstrutil.GetValueFromIntOrPercent(deployment.Spec.Strategy.RollingUpdate.MaxSurge, int(*(deployment.Spec.Replicas)), true)
if err != nil {
return 0, err
}
// Find the total number of pods
currentPodCount := GetReplicaCountForReplicaSets(allRSs)
maxTotalPods := *(deployment.Spec.Replicas) + int32(maxSurge)
if currentPodCount >= maxTotalPods {
// Cannot scale up.
return *(newRS.Spec.Replicas), nil
}
// Scale up.
scaleUpCount := maxTotalPods - currentPodCount
// Do not exceed the number of desired replicas.
scaleUpCount = int32(integer.IntMin(int(scaleUpCount), int(*(deployment.Spec.Replicas)-*(newRS.Spec.Replicas))))
return *(newRS.Spec.Replicas) + scaleUpCount, nil
case apps.RecreateDeploymentStrategyType:
return *(deployment.Spec.Replicas), nil
default:
return 0, fmt.Errorf("deployment type %v isn't supported", deployment.Spec.Strategy.Type)
}
}
intstrutil.GetValueFromIntOrPercent
maxSurge的計算也不復雜,當maxSurge
為百分比時,因為函式入參roundUp
為true
,所以計算公式為:maxSurge = ⌈deployment.Spec.Strategy.RollingUpdate.MaxSurge * deployment.Spec.Replicas / 100⌉
(結果向上取整) ;
當maxSurge
不為百分比時,直接返回其值。
// staging/src/k8s.io/apimachinery/pkg/util/intstr/intstr.go
func GetValueFromIntOrPercent(intOrPercent *IntOrString, total int, roundUp bool) (int, error) {
if intOrPercent == nil {
return 0, errors.New("nil value for IntOrString")
}
value, isPercent, err := getIntOrPercentValue(intOrPercent)
if err != nil {
return 0, fmt.Errorf("invalid value for IntOrString: %v", err)
}
if isPercent {
if roundUp {
value = int(math.Ceil(float64(value) * (float64(total)) / 100))
} else {
value = int(math.Floor(float64(value) * (float64(total)) / 100))
}
}
return value, nil
}
2.7.2 dc.reconcileOldReplicaSets
dc.reconcileNewReplicaSet主要作用是調諧舊的replicaset物件,根據deployment的滾動更新策略配置.Spec.Strategy.RollingUpdate.MaxUnavailable
和現存的Available狀態的pod數量進行計算,決定是否對舊的replicaset物件進行縮容。
主要邏輯:
(1)獲取舊的replicaset物件的副本數總數,如果是0,則代表舊的replicaset物件已經無法縮容,調諧完畢,直接return;
(2)呼叫deploymentutil.MaxUnavailable
,計算獲取maxUnavailable
的值,即最大不可用pod數量(這裡注意一點,當deployment滾動更新策略中MaxUnavailable
與MaxSurge
的配置值都為0時,此處計算MaxUnavailable
的值時會返回1,因為這兩者均為0時,無法進行滾動更新);
(3)根據MaxUnavailable
的值、deployment的期望副本數、新replicaset物件的期望副本數、新replicaset物件的處於Available
狀態的副本數,計算出maxScaledDown
即最大可縮容副本數,當maxScaledDown
小於等於0,則代表目前暫不能對舊的replicaset物件進行縮容,直接return;
(4)呼叫dc.cleanupUnhealthyReplicas
,按照replicaset的建立時間排序,先清理縮容Unhealthy
的副本(如not-ready
的、unscheduled
的、pending
的pod),具體邏輯暫不展開分析;
(5)呼叫dc.scaleDownOldReplicaSetsForRollingUpdate
,根據deployment的滾動更新策略配置.Spec.Strategy.RollingUpdate.MaxUnavailable
計算出舊的replicaset物件該擁有的副本數量,呼叫dc.scaleReplicaSetAndRecordEvent
縮容舊的replicaset物件(所以這裡也可以看到,dc.cleanupUnhealthyReplicas
與dc.scaleDownOldReplicaSetsForRollingUpdate
均有可能會對舊的replicaset進行縮容操作);
(6)如果縮容的副本數大於0,則返回true,否則返回false。
// pkg/controller/deployment/rolling.go
func (dc *DeploymentController) reconcileOldReplicaSets(allRSs []*apps.ReplicaSet, oldRSs []*apps.ReplicaSet, newRS *apps.ReplicaSet, deployment *apps.Deployment) (bool, error) {
oldPodsCount := deploymentutil.GetReplicaCountForReplicaSets(oldRSs)
if oldPodsCount == 0 {
// Can't scale down further
return false, nil
}
allPodsCount := deploymentutil.GetReplicaCountForReplicaSets(allRSs)
klog.V(4).Infof("New replica set %s/%s has %d available pods.", newRS.Namespace, newRS.Name, newRS.Status.AvailableReplicas)
maxUnavailable := deploymentutil.MaxUnavailable(*deployment)
minAvailable := *(deployment.Spec.Replicas) - maxUnavailable
newRSUnavailablePodCount := *(newRS.Spec.Replicas) - newRS.Status.AvailableReplicas
maxScaledDown := allPodsCount - minAvailable - newRSUnavailablePodCount
if maxScaledDown <= 0 {
return false, nil
}
// Clean up unhealthy replicas first, otherwise unhealthy replicas will block deployment
// and cause timeout. See https://github.com/kubernetes/kubernetes/issues/16737
oldRSs, cleanupCount, err := dc.cleanupUnhealthyReplicas(oldRSs, deployment, maxScaledDown)
if err != nil {
return false, nil
}
klog.V(4).Infof("Cleaned up unhealthy replicas from old RSes by %d", cleanupCount)
// Scale down old replica sets, need check maxUnavailable to ensure we can scale down
allRSs = append(oldRSs, newRS)
scaledDownCount, err := dc.scaleDownOldReplicaSetsForRollingUpdate(allRSs, oldRSs, deployment)
if err != nil {
return false, nil
}
klog.V(4).Infof("Scaled down old RSes of deployment %s by %d", deployment.Name, scaledDownCount)
totalScaledDown := cleanupCount + scaledDownCount
return totalScaledDown > 0, nil
}
dc.scaleDownOldReplicaSetsForRollingUpdate
dc.scaleDownOldReplicaSetsForRollingUpdate主要邏輯是縮容舊的replicaset物件,主要邏輯如下:
(1)根據deployment的滾動更新策略配置.Spec.Strategy.RollingUpdate.MaxUnavailable
和現存的Available狀態的pod數量,計算出totalScaleDownCount
,即目前需要縮容的副本數;
(2)對舊的replicaset物件按照建立時間先後排序;
(3)遍歷舊的replicaset物件,根據需要縮容的副本總數,縮容replicaset。
// pkg/controller/deployment/rolling.go
func (dc *DeploymentController) scaleDownOldReplicaSetsForRollingUpdate(allRSs []*apps.ReplicaSet, oldRSs []*apps.ReplicaSet, deployment *apps.Deployment) (int32, error) {
maxUnavailable := deploymentutil.MaxUnavailable(*deployment)
// Check if we can scale down.
minAvailable := *(deployment.Spec.Replicas) - maxUnavailable
// Find the number of available pods.
availablePodCount := deploymentutil.GetAvailableReplicaCountForReplicaSets(allRSs)
if availablePodCount <= minAvailable {
// Cannot scale down.
return 0, nil
}
klog.V(4).Infof("Found %d available pods in deployment %s, scaling down old RSes", availablePodCount, deployment.Name)
sort.Sort(controller.ReplicaSetsByCreationTimestamp(oldRSs))
totalScaledDown := int32(0)
totalScaleDownCount := availablePodCount - minAvailable
for _, targetRS := range oldRSs {
if totalScaledDown >= totalScaleDownCount {
// No further scaling required.
break
}
if *(targetRS.Spec.Replicas) == 0 {
// cannot scale down this ReplicaSet.
continue
}
// Scale down.
scaleDownCount := int32(integer.IntMin(int(*(targetRS.Spec.Replicas)), int(totalScaleDownCount-totalScaledDown)))
newReplicasCount := *(targetRS.Spec.Replicas) - scaleDownCount
if newReplicasCount > *(targetRS.Spec.Replicas) {
return 0, fmt.Errorf("when scaling down old RS, got invalid request to scale down %s/%s %d -> %d", targetRS.Namespace, targetRS.Name, *(targetRS.Spec.Replicas), newReplicasCount)
}
_, _, err := dc.scaleReplicaSetAndRecordEvent(targetRS, newReplicasCount, deployment)
if err != nil {
return totalScaledDown, err
}
totalScaledDown += scaleDownCount
}
return totalScaledDown, nil
}
總結
deployment controller是kube-controller-manager元件中眾多控制器中的一個,是 deployment 資源物件的控制器,其通過對deployment、replicaset、pod三種資源的監聽,當三種資源發生變化時會觸發 deployment controller 對相應的deployment資源進行調諧操作,從而完成deployment的擴縮容、暫停恢復、更新、回滾、狀態status更新、所屬的舊replicaset清理等操作。
其中deployment的擴縮容、暫停恢復、更新、回滾、狀態status更新、所屬的舊replicaset清理等操作都在deployment controller的核心處理方法syncDeployment
裡進行處理呼叫。
關於deployment更新這一塊,deployment controller會根據deployment物件配置的更新策略Recreate或RollingUpdate,會呼叫rolloutRecreate
或rolloutRolling
方法來對deployment物件進行更新操作。
且經過以上的程式碼分析,可以看出,deployment controller並不負責deployment物件的刪除,除按歷史版本限制數需要清理刪除多餘的replicaset物件以外,deployment controller也不負責replicset物件的刪除(實際上,除按歷史版本限制數deployment controller需要清理刪除多餘的replicaset物件以外,其他的replicaset物件的刪除由garbagecollector controller完成)。
deployment controller架構
deployment controller的大致組成和處理流程如下圖,deployment controller對pod、replicaset和deployment物件註冊了event handler,當有事件時,會watch到然後將對應的deployment物件放入到queue中,然後syncDeployment
方法為deployment controller調諧deployment物件的核心處理邏輯所在,從queue中取出deployment物件,做調諧處理。
deployment controller核心處理邏輯
deployment controller的核心處理邏輯在syncDeployment
方法中,下圖即syncDeployment
方法的處理流程。
deployment controller建立replicaset流程
無論deployment配置了ReCreate還是RollingUpdate的更新策略,在dc.rolloutRecreate
或dc.rolloutRolling
的處理邏輯裡,都會判斷最新的replicaset物件是否存在,不存在則會建立。
在建立了deployment物件後,deployment controller會收到deployment的新增event,然後會做調諧處理,在第一次進入dc.rolloutRecreate
或dc.rolloutRolling
的處理邏輯時,deployment所屬的replicaset物件為空,所以會觸發建立一個新的replicaset物件出來。
deployment ReCreate更新流程
(1)先縮容舊的replicaset,將其副本數縮容為0;
(2)等待舊的replicaset的pod全部都處於not running狀態(pod的pod.Status.Phase
屬性值為Failed
或Succeeded
時代表該pod處於not running狀態);
(3)接著建立新的replicaset物件(注意:新建立的replicaset的例項副本數為0);
(4)隨後擴容剛新建立的replicaset,更新其副本數與deployment期望副本數一致;
(5)最後等待,直至deployment的所有pod都屬於最新的replicaset物件、pod數量與deployment期望副本數一致、且所有pod都處於Available狀態,則deployment更新完成。
deployment RollingUpdate更新流程
(1)根據deployment的滾動更新策略配置.Spec.Strategy.RollingUpdate.MaxSurge
和現存pod數量進行計算,決定是否對新的replicaset物件進行擴容以及擴容的副本數;
(2)根據deployment的滾動更新策略配置.Spec.Strategy.RollingUpdate.MaxUnavailable
、現存的Available狀態的pod數量、新replicaset物件下所屬的available的pod數量,決定是否對舊的replicaset物件進行縮容以及縮容的副本數;
(3)迴圈以上步驟,直至deployment的所有pod都屬於最新的replicaset物件、pod數量與deployment期望副本數一致、且所有pod都處於Available狀態,則deployment滾動更新完成。
deployment滾動更新速率控制解讀
先來看到deployment滾動更新配置的兩個關鍵引數:
(1).Spec.Strategy.RollingUpdate.MaxUnavailable
:指定更新過程中不可用的 Pod 的個數上限。該值可以是絕對數字(例如5),也可以是deployment期望副本數的百分比(例如10%),運算公式:期望副本數乘以百分比值並向下取整。 如果maxSurge為0,則此值不能為0。MaxUnavailable預設值為 25%。該值越小,越能保證服務穩定,deployment更新越平滑。
(2).Spec.Strategy.RollingUpdate.MaxSurge
:指定可以建立的超出期望 Pod 個數的 Pod 數量。此值可以是絕對數(例如5),也可以是deployment期望副本數的百分比(例如10%),運算公式:期望副本數乘以百分比值並向上取整。 如果 MaxUnavailable 為0,則此值不能為0。 MaxSurge預設值為 25%。該值越大,deployment更新速度越快。
速記:配置百分比時,maxSurge向上取整,maxUnavailable向下取整
注意:MaxUnavailable與MaxSurge不能均配置為0,但可能在運算之後這兩個值均為0,這種情況下,為了保證滾動更新能正常進行,deployment controller會在滾動更新時將MaxUnavailable的值置為1去進行滾動更新。
例如,當deployment期望副本數為2、MaxSurge值為0、MaxUnavailable為1%時(MaxUnavailable為百分比,根據運算公式運算並向下取整後,取值為0,這時MaxSurge與MaxUnavailable均為0,所以在deployment滾動更新時,會將MaxUnavailable置為1去做滾動更新操作),觸發滾動更新後,會立即將舊 replicaSet 副本數縮容到1,並擴容新的replicaset副本數為1。待新 Pod Available後,可以繼續縮容舊有的 replicaSet副本數為0,然後擴容新的replicaset副本數為2。滾動更新期間確保Available可用的 Pods 總數在任何時候都至少為1個。
例如,當deployment期望副本數為2、MaxSurge值為1%、MaxUnavailable為0時(MaxSurge根據運算公式運算並向上取整,取值為1),觸發滾動更新後,會立即擴容新的replicaset副本數為1,待新pod Available後,再縮容舊replicaset副本數為1,然後再擴容擴容新的replicaset副本數為2,待新pod Available後,再縮容舊replicaset副本數為0。滾動更新期間確保Available可用的 Pods 總數在任何時候都至少為2個。
更多示例如下:
// 2 desired, max unavailable 1%, surge 0% - should scale old(-1), then new(+1), then old(-1), then new(+1)
// 1 desired, max unavailable 1%, surge 0% - should scale old(-1), then new(+1)
// 2 desired, max unavailable 25%, surge 1% - should scale new(+1), then old(-1), then new(+1), then old(-1)
// 1 desired, max unavailable 25%, surge 1% - should scale new(+1), then old(-1)
// 2 desired, max unavailable 0%, surge 1% - should scale new(+1), then old(-1), then new(+1), then old(-1)
// 1 desired, max unavailable 0%, surge 1% - should scale new(+1), then old(-1)