跟我一起學Knative(4)--Serving 自動擴縮容

Knative共用單個共享自動縮放器。預設情況下，這是Knative Pod自動縮放器（KPA），開箱即用即可提供基於請求的快速自動縮放功能。

您還可以配置Knative使用Horizontal Pod Autoscaler（HPA）或使用定義的autoscaler。

KPA配置

KPA的配置檔案位於knative-serving 名稱空間中的configmap config-autoscaler 。我們執行下面的命令，檢視一下預設的內容：

kubectl -n knative-serving describe cm config-autoscaler

可以看到預設的配置內容為：

====
_example:
----
################################
#                              #
#    EXAMPLE CONFIGURATION     #
#                              #
################################

# This block is not actually functional configuration,
# but serves to illustrate the available configuration
# options and document them in a way that is accessible
# to users that `kubectl edit` this config map.
#
# These sample configuration options may be copied out of
# this example block and unindented to be in the data block
# to actually change the configuration.

# The Revision ContainerConcurrency field specifies the maximum number
# of requests the Container can handle at once. Container concurrency
# target percentage is how much of that maximum to use in a stable
# state. E.g. if a Revision specifies ContainerConcurrency of 10, then
# the Autoscaler will try to maintain 7 concurrent connections per pod
# on average.
# Note: this limit will be applied to container concurrency set at every
# level (ConfigMap, Revision Spec or Annotation).
# For legacy and backwards compatibility reasons, this value also accepts
# fractional values in (0, 1] interval (i.e. 0.7 ⇒ 70%).
# Thus minimal percentage value must be greater than 1.0, or it will be
# treated as a fraction.
# NOTE: that this value does not affect actual number of concurrent requests
#       the user container may receive, but only the average number of requests
#       that the revision pods will receive.
container-concurrency-target-percentage: "70"

# The container concurrency target default is what the Autoscaler will
# try to maintain when concurrency is used as the scaling metric for the
# Revision and the Revision specifies unlimited concurrency.
# When revision explicitly specifies container concurrency, that value
# will be used as a scaling target for autoscaler.
# When specifying unlimited concurrency, the autoscaler will
# horizontally scale the application based on this target concurrency.
# This is what we call "soft limit" in the documentation, i.e. it only
# affects number of pods and does not affect the number of requests
# individual pod processes.
# The value must be a positive number such that the value multiplied
# by container-concurrency-target-percentage is greater than 0.01.
# NOTE: that this value will be adjusted by application of
#       container-concurrency-target-percentage, i.e. by default
#       the system will target on average 70 concurrent requests
#       per revision pod.
# NOTE: Only one metric can be used for autoscaling a Revision.
container-concurrency-target-default: "100"

# The requests per second (RPS) target default is what the Autoscaler will
# try to maintain when RPS is used as the scaling metric for a Revision and
# the Revision specifies unlimited RPS. Even when specifying unlimited RPS,
# the autoscaler will horizontally scale the application based on this
# target RPS.
# Must be greater than 1.0.
# NOTE: Only one metric can be used for autoscaling a Revision.
requests-per-second-target-default: "200"

# The target burst capacity specifies the size of burst in concurrent
# requests that the system operator expects the system will receive.
# Autoscaler will try to protect the system from queueing by introducing
# Activator in the request path if the current spare capacity of the
# service is less than this setting.
# If this setting is 0, then Activator will be in the request path only
# when the revision is scaled to 0.
# If this setting is > 0 and container-concurrency-target-percentage is
# 100% or 1.0, then activator will always be in the request path.
# -1 denotes unlimited target-burst-capacity and activator will always
# be in the request path.
# Other negative values are invalid.
target-burst-capacity: "200"

# When operating in a stable mode, the autoscaler operates on the
# average concurrency over the stable window.
# Stable window must be in whole seconds.
stable-window: "60s"

# When observed average concurrency during the panic window reaches
# panic-threshold-percentage the target concurrency, the autoscaler
# enters panic mode. When operating in panic mode, the autoscaler
# scales on the average concurrency over the panic window which is
# panic-window-percentage of the stable-window.
# When computing the panic window it will be rounded to the closest
# whole second.
panic-window-percentage: "10.0"

# The percentage of the container concurrency target at which to
# enter panic mode when reached within the panic window.
panic-threshold-percentage: "200.0"

# Max scale up rate limits the rate at which the autoscaler will
# increase pod count. It is the maximum ratio of desired pods versus
# observed pods.
# Cannot be less or equal to 1.
# I.e with value of 2.0 the number of pods can at most go N to 2N
# over single Autoscaler period (see tick-interval), but at least N to
# N+1, if Autoscaler needs to scale up.
max-scale-up-rate: "1000.0"

# Max scale down rate limits the rate at which the autoscaler will
# decrease pod count. It is the maximum ratio of observed pods versus
# desired pods.
# Cannot be less or equal to 1.
# I.e. with value of 2.0 the number of pods can at most go N to N/2
# over single Autoscaler evaluation period (see tick-interval), but at
# least N to N-1, if Autoscaler needs to scale down.
max-scale-down-rate: "2.0"

# Scale to zero feature flag
enable-scale-to-zero: "true"

# Tick interval is the time between autoscaling calculations.
tick-interval: "2s"

# Scale to zero grace period is the time an inactive revision is left
# running before it is scaled to zero (min: 6s).
scale-to-zero-grace-period: "30s"

# Enable graceful scaledown feature flag.
# Once enabled, it allows the autoscaler to prioritize pods processing
# fewer (or zero) requests for removal when scaling down.
enable-graceful-scaledown: "false"

# pod-autoscaler-class specifies the default pod autoscaler class
# that should be used if none is specified. If omitted, the Knative
# Horizontal Pod Autoscaler (KPA) is used by default.
pod-autoscaler-class: "kpa.autoscaling.knative.dev"

# The capacity of a single activator task.
# The `unit` is one concurrent request proxied by the activator.
# activator-capacity must be at least 1.
# This value is used for computation of the Activator subset size.
# See the algorithm here: http://bit.ly/38XiCZ3.
# TODO(vagababov): tune after actual benchmarking.
activator-capacity: "100.0"

接下來我們詳細介紹一下每一個配置項的含義。

enable-scale-to-zero: 如果需要縮放到零，請確保將enable-scale-to-zero設定為true。預設是開啟。
scale-to-zero-grace-period: 指定將非活動修訂版本縮放到零（最小：6s）之前保持執行的時間。預設為30s。
stable-window :在穩定模式下執行時，autoscaler將在穩定視窗上的平均併發性數下操作（最小：6s）。預設為30s。當然也可以在Revision的模板中透過annotation設定。比如 autoscaling.knative.dev/window: "60s"。
container-concurrency-target-percentage:
activator-capacity: 單個activator任務的容量。單位是activator代理的一個併發請求。activator容量必須至少為1。該值用於計算activator子集大小。
pod-autoscaler-class: 指定使用的pod autoscaler類。如果省略，預設情況下使用Knative Horizontal Pod Autoscaler（KPA）。
enable-graceful-scaledown: 啟用優雅的按比例縮小功能標誌。啟用後，它允許autoscaler優先縮容請求更少或沒有請求的Pod。
縮小時減少＃個（或零個）刪除請求。
tick-interval: 自動縮放計算之間的時間，預設是2s。
max-scale-down-rate: 最大縮放比例限制了自動縮放器的縮容Pod速率。其值不能小於或等於1。當其值為2.0時，原來的Pod數目為N，在單個Autoscaler週期內（請參閱刻度間隔），Pod的數量最多可以縮容到N / 2，但如果Autoscaler需要縮小，則至少縮容到N-1。
max-scale-up-rate : 最大擴充套件速率限制了autoscaler的擴容Pod速率。其值不能小於或等於1。當其值為2.0時，原來的Pod數目為N，在單個Autoscaler週期內（請參閱刻度間隔），Pod的數量最多可以擴容到2N，但至少為N + 1，如果Autoscaler需要放大。
panic-threshold-percentage: 容器併發目標要達到的百分比，此時在緊急情況視窗內進入緊急狀態。
panic-window-percentage: 當觀察到緊急視窗期間的平均併發達到目標併發的緊急閾值百分比，自動縮放進入緊急模式。在緊急模式下執行時，autoscaler在緊急情況視窗上按平均併發縮放穩定視窗的緊急視窗百分比。計算恐慌視窗時，它將四捨五入到最接近的值
整秒。
target-burst-capacity: 指定併發中突發的請求大小。如果當前服務的備用容量小於設定值，Autoscaler將透過引入請求器路徑中的Activator來嘗試保護系統免於排隊。如果此設定為 0，則只當修訂版縮放為0時，Activator位於請求路徑中。如果此設定 > 0，並且container-concurrency-target-percentage為100％或1.0，則Activator將始終位於請求路徑中。-1 表示無限的目標爆發容量，Activator將始終在請求路徑中。其他負值無效。
requests-per-second-target-default: 當將每秒請求數（RPS）用作修訂的縮放指標，以及修訂版指定了無限制的RPS，autoscaler 將嘗試去維護。即使指定了無限制的RPS，autoscaler將基於此目標RPS水平縮放應用程式。該值必須大於1.0。注意：僅一個度量標準可用於自動縮放修訂。

Termination period

Termination period（終止時間）是 POD 在最後一個請求完成後關閉的時間。POD 的終止週期等於穩定視窗值和縮放至零寬限期引數的總和。在本例中，Termination period 為 90 秒。

配置併發

可以使用以下方法配置 Autoscaler 的併發數：

target

target 定義在給定時間（軟限制）需要多少併發請求，是 Knative 中 Autoscaler 的推薦配置。

在 ConfigMap 中預設配置的併發 target 為 100。

`container-concurrency-target-default: 100

這個值可以透過 Revision 中的 autoscaling.knative.dev/target 註釋進行修改：

autoscaling.knative.dev/target: "50"

containerConcurrency

注意：只有在明確需要限制在給定時間有多少請求到達應用程式時，才應該使用 containerConcurrency (容器併發)。只有當應用程式需要強制的併發約束時，才建議使用 containerConcurrency。

containerConcurrency 限制在給定時間允許併發請求的數量（硬限制），並在 Revision 模板中配置。

containerConcurrency: 0 | 1 | 2-N

1: 將確保一次只有一個請求由 Revision 給定的容器例項處理；
2-N: 請求的併發值限制為 2 或更多；
0: 表示不作限制，有系統自身決定。

配置擴縮容邊界（minScale 和 maxScale）

透過 minScale 和 maxScale 可以配置應用程式提供服務的最小和最大 Pod 數量。透過這兩個引數配置可以控制服務冷啟動或者控制計算成本。

minScale 和 maxScale 可以在 Revision 模板中按照以下方式進行配置：

spec:
 template:
  metadata:
   annotations:
    autoscaling.knative.dev/minScale: "2"
    autoscaling.knative.dev/maxScale: "10"

預設行為

如果未設定minScaleannotation，則容器將縮放為零（如果根據上述ConfigMap，如果enable-scale-to-zero為false，則縮放為1）。

如果未設定maxScale annotation，則建立的Pod數將沒有上限。

KPA原理

其實關於伸縮，無非就是兩個問題，第一個是參照的指標是什麼？CPU？記憶體？RPS？另外一個問題是伸縮的策略，也就是伸縮的數目。

用到的元件

自動縮放系統由一些在此簡要定義的“物理”和邏輯元件組成。瞭解它們是什麼，它們在哪裡部署以及它們在做什麼，將極大地有助於理解控制和資料流。提到的元件可能做的事情比這裡概述的要多。本文件將僅遵循影響自動縮放系統的細節。

Queue Proxy

佇列代理是一個sidecar容器，與每個使用者容器中的使用者容器一起部署。傳送到應用程式例項的每個請求都首先透過佇列代理，因此其名稱為“代理”。

佇列代理的主要目的是測量並限制使用者應用程式的併發性。如果修訂將併發限制定義為5，則佇列代理可確保一次到達應用程式例項的請求不超過5個。如果傳送給它的請求更多，它將在本地將它們排隊，因此是其名稱中的“佇列”。佇列代理還測量傳入的請求負載，並在單獨的埠上報告平均併發和每秒請求數。

Autoscaler

自動縮放器是一個獨立的Pod，包含三個主要元件：

PodAutoscaler reconciler
Collector
Decider

PodAutoscaler協調程式可確保正確獲取對PodAutoscalers的任何更改（請參閱API部分），並將其反映在Decider，Collector或兩者中。

Collector負責從應用程式例項上的佇列代理收集度量。為此，它會刮擦其內部指標端點並對其求和，以得到代表整個系統的指標。為了實現可伸縮性，僅會抓取所有應用程式例項的一個樣本，並將接收到的指標外推到整個叢集。

Decider獲得所有可用指標，並決定應將應用程式部署擴充套件到多少個Pod。基本上，要做的事情就是want = concurrencyInSystem/targetConcurrencyPerInstance。

除此之外，它還會針對修訂版的最大縮放比例和最小例項數和最大例項數設定值進行調整。它還計算當前部署中還剩下多少突發容量，從而確定是否可以從資料路徑中刪除Activator。

Activator

Activator是全域性共享的部署，具有很高的可伸縮性。其主要目的是緩衝請求並向autoscaler報告指標。

Activator主要涉及從零到零的規模擴充套件以及容量感知負載平衡。當修訂版本縮放到零例項時，Activator將被放置到資料路徑中，而不是修訂版本的例項中。如果請求將達到此修訂版，則Activator將緩衝這些請求，並使用指標戳autoscaler並保留請求，直到出現應用程式例項。在這種情況下，Activator會立即將其緩衝的請求轉發到新例項，同時小心避免使應用程式的現有例項過載。Activator在這裡有效地充當負載平衡器。當它們可用時，它將負載分配到所有Pod上，並且不會在併發設定方面使它們過載。在系統認為合適的情況下，將Activator放置在資料路徑上或從資料路徑上取下，以使其充當如上所述的負載平衡器。如果當前部署具有足夠的空間以使其不太容易過載，則將Activator從資料路徑中刪除，以將網路開銷降至最低。

與佇列代理不同，啟用器透過Websocket連線主動將指標傳送到autoscaler，以最大程度地減小從零開始的延遲。

演算法

autoscaler是基於每個Pod（併發）的執行中請求的平均數量。系統的預設目標併發性為100，但是我們為服務使用了10。我們為服務載入了50個併發請求，因此自動縮放器建立了5個容器（50個併發請求/目標10 = 5個容器）。

演算法中有兩種模式，分別是panic和stable模式，一個是短時間，一個是長時間，為了解決短時間內請求突增的場景，需要快速擴容。

Stable Mode（穩定模式）

在穩定模式下，Autoscaler 根據每個pod期望的併發來調整Deployment的副本個數。根據每個pod在60秒視窗內的平均併發來計算，而不是根據現有副本個數計算，因為pod的數量增加和pod變為可服務和提供指標資料有一定時間間隔。

Panic Mode （恐慌模式）

KPA會在60秒的視窗內計算平均併發性，因此係統需要一分鐘時間才能穩定在所需的併發性級別。但是，自動縮放器還會計算一個6秒的緊急視窗，如果該視窗達到目標併發性的2倍，它將進入緊急模式。在緊急模式下，自動縮放器在較短，更敏感的緊急視窗上執行。一旦在60秒內不再滿足緊急情況，autoscaler將返回到最初的60秒穩定視窗。

                                                       |
                                  Panic Target--->  +--| 20
                                                    |  |
                                                    | <------Panic Window
                                                    |  |
       Stable Target--->  +-------------------------|--| 10   CONCURRENCY
                          |                         |  |
                          |                      <-----------Stable Window
                          |                         |  |
--------------------------+-------------------------+--+ 0
120                       60                           0
                     TIME

資料流向

穩定模式下的擴縮

在穩定狀態下，autoascaler會不斷抓取當前活動的修訂包，以不斷調整修訂的規模。當請求流入系統時，被刮擦的值將發生變化，並且自動縮放器將指示修訂版的部署遵循給定的縮放比例。

SKS透過私有服務跟蹤部署規模的變化。它將相應地更新公共服務。

scale 到 0

一旦系統中不再有任何請求，修訂版本就會縮放為零。從autoscaler到修訂版容器的所有刮擦都返回0併發性，並且activator報告的併發性相同（1）。

在實際刪除修訂的最後一個pod之前，系統應確保activator在路徑中並且可路由。首先決定將比例縮放為零的autoscaler會指示SKS使用代理模式，因此所有流量都將定向到activator（4.1）。現在將檢查SKS的公共服務，直到確保它返回activator的響應為止。在這種情況下，如果已經過去了寬限期（可透過_scale-to-zero-grace-period_進行配置），則修訂的最後一個pod將被刪除，並且修訂已成功縮放為零（5）。

從 0 擴容

如果修訂版本縮放為零，並且系統中有一個試圖達到該修訂版本的請求，則系統需要將其擴充套件。當SKS處於代理模式時，請求將到達activator（1），activator將對其進行計數並將其報告給autoscaler（2.1）。然後，activator將緩衝請求，並監視SKS的專用服務以檢視端點的出現（2.2）。

Aujtoscaler從activator獲取度量，並立即執行自動縮放迴圈（3）。該過程將確定至少需要一個pod（4），autoscaler將指示修訂的部署擴充套件到N> 0個副本（5.1）。它還將SKS置於“服務”模式，一旦流量上升（5.2），流量就會直接流到修訂版的Pod。

activator最終會看到端點出現並開始對其進行探測。一旦探測成功透過，相應的地址將被認為是健康的，並用於路由我們緩衝的請求以及在此期間到達的所有其他請求（8.2）。

該修訂版已成功從零開始縮放。

KPA示例

我們使用官方autoscale-go來進行演示。service.yaml如下：

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: autoscale-go
  namespace: default
spec:
  template:
    metadata:
      annotations:
        # Target 10 in-flight-requests per pod.
        autoscaling.knative.dev/target: "10"
        autoscaling.knative.dev/minScale: "1" 
        autoscaling.knative.dev/maxScale: "3"
    spec:
      containers:
      - image: gcr.io/knative-samples/autoscale-go:0.1

部署完成以後，我們可以看到由於我們設定最小scale為1，所以即使在沒有流量訪問的情況下，也會保持一個例項。

kubectl get pods
NAME                                           READY   STATUS    RESTARTS   AGE
autoscale-go-h9x5z-deployment-84d57876-5mjzt   2/2     Running   0          12s

根據併發數來作為擴縮的參照指標，30s內發起50個併發請求，minScale 最小保留例項數為 1，maxScale 最大擴容例項數為 3。

我們透過hey測試，執行以下命令：

hey -z 30s -c 50 "http://autoscale-go.default.serverless.ushareit.me?sleep=100&prime=10000&bloat=5"

執行完畢，hey輸出一些統計內容：

Summary:
  Total:    30.1853 secs
  Slowest:    0.4866 secs
  Fastest:    0.1753 secs
  Average:    0.1838 secs
  Requests/sec:    271.6219

  Total data:    819814 bytes
  Size/request:    99 bytes

Response time histogram:
  0.175 [1]    |
  0.206 [8044]    |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
  0.238 [63]    |
  0.269 [39]    |
  0.300 [0]    |
  0.331 [1]    |
  0.362 [1]    |
  0.393 [7]    |
  0.424 [17]    |
  0.455 [11]    |
  0.487 [15]    |


Latency distribution:
  10% in 0.1782 secs
  25% in 0.1794 secs
  50% in 0.1808 secs
  75% in 0.1828 secs
  90% in 0.1863 secs
  95% in 0.1910 secs
  99% in 0.2502 secs

Details (average, fastest, slowest):
  DNS+dialup:    0.0012 secs, 0.1753 secs, 0.4866 secs
  DNS-lookup:    0.0007 secs, 0.0000 secs, 0.1098 secs
  req write:    0.0000 secs, 0.0000 secs, 0.0024 secs
  resp wait:    0.1824 secs, 0.1752 secs, 0.3321 secs
  resp read:    0.0001 secs, 0.0000 secs, 0.0071 secs

Status code distribution:
  [200]    8199 responses

檢視具體POD擴縮情況如下：

kubectl get pods
NAME                                           READY   STATUS    RESTARTS   AGE
autoscale-go-h9x5z-deployment-84d57876-5mjzt   2/2     Running   0          5m15s
autoscale-go-h9x5z-deployment-84d57876-64b2f   2/2     Running   0          21s
autoscale-go-h9x5z-deployment-84d57876-pf2c9   2/2     Running   0          21s

本來應該擴容到5個例項，但是由於設定了maxscale為3，所以最大例項為3。

您可以將Knative自動縮放配置為與預設KPA或基於CPU的指標（即“水平Pod自動縮放器”（HPA））一起使用。透過在修訂模板中新增或修改autoscaling.knative.dev/class和autoscaling.knative.dev/metric值作為註釋，可以將Knative配置為使用基於CPU的自動縮放，而不使用基於預設請求的度量。

spec:
 template:
  metadata:
   annotations:
    autoscaling.knative.dev/metric: cpu
    autoscaling.knative.dev/target: "70"
    autoscaling.knative.dev/class: hpa.autoscaling.knative.dev

如果你已經在knative-monitoring 名稱空間部署了對應的監控，那麼觀察grafana可以看到更直觀的變化：

結論

相對於傳統K8s的HPA，KPA支援了scale 到 0和從0擴容的特點。試想，在傳統的HPA中，如果POD數目為0，那麼根本沒法統計到RPS，那麼即使流量開始從無到有，那麼HPA也不會擴容。