Linkerd 金絲雀部署與 A/B 測試

為少發表於2021-07-29

本指南向您展示如何使用 Linkerd 和 Flagger 來自動化金絲雀部署與 A/B 測試。

前提條件

Flagger 需要 Kubernetes 叢集 v1.16 或更新版本和 Linkerd 2.10 或更新版本。

安裝 Linkerd the PrometheusLinkerd Viz 的一部分):

linkerd install | kubectl apply -f -
linkerd viz install | kubectl apply -f -

linkerd 名稱空間中安裝 Flagger

kubectl apply -k github.com/fluxcd/flagger//kustomize/linkerd

載入程式

Flagger 採用 Kubernetes deployment 和可選的水平 Pod 自動伸縮 (HPA),然後建立一系列物件(Kubernetes 部署、ClusterIP 服務和 SMI 流量拆分)。這些物件將應用程式暴露在網格內部並驅動 Canary 分析和推廣。

建立一個 test 名稱空間並啟用 Linkerd 代理注入:

kubectl create ns test
kubectl annotate namespace test linkerd.io/inject=enabled

安裝負載測試服務以在金絲雀分析期間生成流量:

kubectl apply -k https://github.com/fluxcd/flagger//kustomize/tester?ref=main

建立部署和水平 pod autoscaler:

kubectl apply -k https://github.com/fluxcd/flagger//kustomize/podinfo?ref=main

podinfo 部署建立一個 Canary 自定義資源:

apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: podinfo
  namespace: test
spec:
  # deployment reference
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: podinfo
  # HPA reference (optional)
  autoscalerRef:
    apiVersion: autoscaling/v2beta2
    kind: HorizontalPodAutoscaler
    name: podinfo
  # the maximum time in seconds for the canary deployment
  # to make progress before it is rollback (default 600s)
  progressDeadlineSeconds: 60
  service:
    # ClusterIP port number
    port: 9898
    # container port number or name (optional)
    targetPort: 9898
  analysis:
    # schedule interval (default 60s)
    interval: 30s
    # max number of failed metric checks before rollback
    threshold: 5
    # max traffic percentage routed to canary
    # percentage (0-100)
    maxWeight: 50
    # canary increment step
    # percentage (0-100)
    stepWeight: 5
    # Linkerd Prometheus checks
    metrics:
    - name: request-success-rate
      # minimum req success rate (non 5xx responses)
      # percentage (0-100)
      thresholdRange:
        min: 99
      interval: 1m
    - name: request-duration
      # maximum req duration P99
      # milliseconds
      thresholdRange:
        max: 500
      interval: 30s
    # testing (optional)
    webhooks:
      - name: acceptance-test
        type: pre-rollout
        url: http://flagger-loadtester.test/
        timeout: 30s
        metadata:
          type: bash
          cmd: "curl -sd 'test' http://podinfo-canary.test:9898/token | grep token"
      - name: load-test
        type: rollout
        url: http://flagger-loadtester.test/
        metadata:
          cmd: "hey -z 2m -q 10 -c 2 http://podinfo-canary.test:9898/"

將上述資源另存為 podinfo-canary.yaml 然後應用:

kubectl apply -f ./podinfo-canary.yaml

Canary 分析開始時,Flagger 將在將流量路由到 Canary 之前呼叫 pre-rollout webhooks。 金絲雀分析將執行五分鐘,同時每半分鐘驗證一次 HTTP 指標和 rollout(推出) hooks

幾秒鐘後,Flager 將建立 canary 物件:

# applied
deployment.apps/podinfo
horizontalpodautoscaler.autoscaling/podinfo
ingresses.extensions/podinfo
canary.flagger.app/podinfo

# generated
deployment.apps/podinfo-primary
horizontalpodautoscaler.autoscaling/podinfo-primary
service/podinfo
service/podinfo-canary
service/podinfo-primary
trafficsplits.split.smi-spec.io/podinfo

boostrap 之後,podinfo 部署將被縮放到零,
並且到 podinfo.test 的流量將被路由到主 pod
Canary 分析過程中,可以使用 podinfo-canary.test 地址直接定位 Canary Pod

自動金絲雀推進

Flagger 實施了一個控制迴圈,在測量 HTTP 請求成功率、請求平均持續時間和 Pod 健康狀況等關鍵效能指標的同時,逐漸將流量轉移到金絲雀。
根據對 KPI 的分析,提升或中止 Canary,並將分析結果釋出到 Slack


Flagger 金絲雀階段

通過更新容器映象觸發金絲雀部署:

kubectl -n test set image deployment/podinfo \
podinfod=stefanprodan/podinfo:3.1.1

Flagger 檢測到部署修訂已更改並開始新的部署:

kubectl -n test describe canary/podinfo

Status:
  Canary Weight:         0
  Failed Checks:         0
  Phase:                 Succeeded
Events:
 New revision detected! Scaling up podinfo.test
 Waiting for podinfo.test rollout to finish: 0 of 1 updated replicas are available
 Pre-rollout check acceptance-test passed
 Advance podinfo.test canary weight 5
 Advance podinfo.test canary weight 10
 Advance podinfo.test canary weight 15
 Advance podinfo.test canary weight 20
 Advance podinfo.test canary weight 25
 Waiting for podinfo.test rollout to finish: 1 of 2 updated replicas are available
 Advance podinfo.test canary weight 30
 Advance podinfo.test canary weight 35
 Advance podinfo.test canary weight 40
 Advance podinfo.test canary weight 45
 Advance podinfo.test canary weight 50
 Copying podinfo.test template spec to podinfo-primary.test
 Waiting for podinfo-primary.test rollout to finish: 1 of 2 updated replicas are available
 Promotion completed! Scaling down podinfo.test

請注意,如果您在 Canary 分析期間對部署應用新更改,Flagger 將重新開始分析。

金絲雀部署由以下任何物件的更改觸發:

  • Deployment PodSpec(容器映象container image、命令command、埠ports、環境env、資源resources等)
  • ConfigMaps 作為卷掛載或對映到環境變數
  • Secrets 作為卷掛載或對映到環境變數

您可以通過以下方式監控所有金絲雀:

watch kubectl get canaries --all-namespaces

NAMESPACE   NAME      STATUS        WEIGHT   LASTTRANSITIONTIME
test        podinfo   Progressing   15       2019-06-30T14:05:07Z
prod        frontend  Succeeded     0        2019-06-30T16:15:07Z
prod        backend   Failed        0        2019-06-30T17:05:07Z

自動回滾

在金絲雀分析期間,您可以生成 HTTP 500 錯誤和高延遲來測試 Flagger 是否暫停並回滾有故障的版本。

觸發另一個金絲雀部署:

kubectl -n test set image deployment/podinfo \
podinfod=stefanprodan/podinfo:3.1.2

使用以下命令執行負載測試器 pod

kubectl -n test exec -it flagger-loadtester-xx-xx sh

生成 HTTP 500 錯誤:

watch -n 1 curl http://podinfo-canary.test:9898/status/500

生成延遲:

watch -n 1 curl http://podinfo-canary.test:9898/delay/1

當失敗的檢查次數達到金絲雀分析閾值時,流量將路由回主伺服器,金絲雀縮放為零,並將推出標記為失敗。

kubectl -n test describe canary/podinfo

Status:
  Canary Weight:         0
  Failed Checks:         10
  Phase:                 Failed
Events:
 Starting canary analysis for podinfo.test
 Pre-rollout check acceptance-test passed
 Advance podinfo.test canary weight 5
 Advance podinfo.test canary weight 10
 Advance podinfo.test canary weight 15
 Halt podinfo.test advancement success rate 69.17% < 99%
 Halt podinfo.test advancement success rate 61.39% < 99%
 Halt podinfo.test advancement success rate 55.06% < 99%
 Halt podinfo.test advancement request duration 1.20s > 0.5s
 Halt podinfo.test advancement request duration 1.45s > 0.5s
 Rolling back podinfo.test failed checks threshold reached 5
 Canary failed! Scaling down podinfo.test

自定義指標

Canary analysis 可以通過 Prometheus 查詢進行擴充套件。

讓我們定義一個未找到錯誤的檢查。編輯 canary analysis 並新增以下指標:

  analysis:
    metrics:
    - name: "404s percentage"
      threshold: 3
      query: |
        100 - sum(
            rate(
                response_total{
                    namespace="test",
                    deployment="podinfo",
                    status_code!="404",
                    direction="inbound"
                }[1m]
            )
        )
        /
        sum(
            rate(
                response_total{
                    namespace="test",
                    deployment="podinfo",
                    direction="inbound"
                }[1m]
            )
        )
        * 100

上述配置通過檢查 HTTP 404 req/sec 百分比是否低於總流量的 3% 來驗證金絲雀版本。
如果 404s 率達到 3% 閾值,則分析將中止,金絲雀被標記為失敗。

通過更新容器映象觸發金絲雀部署:

kubectl -n test set image deployment/podinfo \
podinfod=stefanprodan/podinfo:3.1.3

生成 404

watch -n 1 curl http://podinfo-canary:9898/status/404

監視 Flagger 日誌:

kubectl -n linkerd logs deployment/flagger -f | jq .msg

Starting canary deployment for podinfo.test
Pre-rollout check acceptance-test passed
Advance podinfo.test canary weight 5
Halt podinfo.test advancement 404s percentage 6.20 > 3
Halt podinfo.test advancement 404s percentage 6.45 > 3
Halt podinfo.test advancement 404s percentage 7.22 > 3
Halt podinfo.test advancement 404s percentage 6.50 > 3
Halt podinfo.test advancement 404s percentage 6.34 > 3
Rolling back podinfo.test failed checks threshold reached 5
Canary failed! Scaling down podinfo.test

如果您配置了 SlackFlager 將傳送一條通知,說明金絲雀失敗的原因。

Linkerd Ingress

有兩個入口控制器與 FlaggerLinkerd 相容:NGINXGloo

安裝 NGINX:

helm upgrade -i nginx-ingress stable/nginx-ingress \
--namespace ingress-nginx

podinfo 建立一個 ingress 定義,將傳入標頭重寫為內部服務名稱(Linkerd 需要):

apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  name: podinfo
  namespace: test
  labels:
    app: podinfo
  annotations:
    kubernetes.io/ingress.class: "nginx"
    nginx.ingress.kubernetes.io/configuration-snippet: |
      proxy_set_header l5d-dst-override $service_name.$namespace.svc.cluster.local:9898;
      proxy_hide_header l5d-remote-ip;
      proxy_hide_header l5d-server-id;
spec:
  rules:
    - host: app.example.com
      http:
        paths:
          - backend:
              serviceName: podinfo
              servicePort: 9898

使用 ingress controller 時,Linkerd 流量拆分不適用於傳入流量,因為 NGINX 在網格之外執行。 為了對前端應用程式執行金絲雀分析,Flagger 建立了一個 shadow ingress 並設定了 NGINX 特定的註釋(annotations)。

A/B 測試

除了加權路由,Flagger 還可以配置為根據 HTTP 匹配條件將流量路由到金絲雀。 在 A/B 測試場景中,您將使用 HTTP headerscookies 來定位您的特定使用者群。 這對於需要會話關聯的前端應用程式特別有用。


Flagger Linkerd Ingress

編輯 podinfo 金絲雀分析,將提供者設定為 nginx,新增 ingress 引用,移除 max/step 權重並新增匹配條件和 iterations

apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: podinfo
  namespace: test
spec:
  # ingress reference
  provider: nginx
  ingressRef:
    apiVersion: extensions/v1beta1
    kind: Ingress
    name: podinfo
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: podinfo
  autoscalerRef:
    apiVersion: autoscaling/v2beta2
    kind: HorizontalPodAutoscaler
    name: podinfo
  service:
    # container port
    port: 9898
  analysis:
    interval: 1m
    threshold: 10
    iterations: 10
    match:
      # curl -H 'X-Canary: always' http://app.example.com
      - headers:
          x-canary:
            exact: "always"
      # curl -b 'canary=always' http://app.example.com
      - headers:
          cookie:
            exact: "canary"
    # Linkerd Prometheus checks
    metrics:
    - name: request-success-rate
      thresholdRange:
        min: 99
      interval: 1m
    - name: request-duration
      thresholdRange:
        max: 500
      interval: 30s
    webhooks:
      - name: acceptance-test
        type: pre-rollout
        url: http://flagger-loadtester.test/
        timeout: 30s
        metadata:
          type: bash
          cmd: "curl -sd 'test' http://podinfo-canary:9898/token | grep token"
      - name: load-test
        type: rollout
        url: http://flagger-loadtester.test/
        metadata:
          cmd: "hey -z 2m -q 10 -c 2 -H 'Cookie: canary=always' http://app.example.com"

上述配置將執行 10 分鐘的分析,目標使用者是:canary cookie 設定為 always 或使用 X-Canary: always header 呼叫服務。

請注意,負載測試現在針對外部地址並使用 canary cookie

通過更新容器映象觸發金絲雀部署:

kubectl -n test set image deployment/podinfo \
podinfod=stefanprodan/podinfo:3.1.4

Flagger 檢測到部署修訂已更改並開始 A/B 測試:

kubectl -n test describe canary/podinfo

Events:
 Starting canary deployment for podinfo.test
 Pre-rollout check acceptance-test passed
 Advance podinfo.test canary iteration 1/10
 Advance podinfo.test canary iteration 2/10
 Advance podinfo.test canary iteration 3/10
 Advance podinfo.test canary iteration 4/10
 Advance podinfo.test canary iteration 5/10
 Advance podinfo.test canary iteration 6/10
 Advance podinfo.test canary iteration 7/10
 Advance podinfo.test canary iteration 8/10
 Advance podinfo.test canary iteration 9/10
 Advance podinfo.test canary iteration 10/10
 Copying podinfo.test template spec to podinfo-primary.test
 Waiting for podinfo-primary.test rollout to finish: 1 of 2 updated replicas are available
 Promotion completed! Scaling down podinfo.test
公眾號:黑客下午茶

相關文章