ArgoWorkflow教程(三)---使用 Artifacts 實現步驟間檔案共享

探索云原生發表於2024-08-29

原文網址 : https://www.cnblogs.com/KubeExplorer/p/18386454

上一篇我們分析了 Workflow、WorkflowTemplate、template 之間的關係。本篇主要分析如何在 argo-workflow 中使用 S3 儲存 artifact 實現步驟之間的檔案共享。

本文主要解決兩個問題：

1）artifact-repository 如何配置
2）Workflow 中如何使用

1. artifact-repository 配置

ArgoWorkflow 對接 S3 實現持久化，依賴於 artifact-repository 配置。

有三種方式設定相關配置：

1）全域性配置：在 workflow-controller deploy 中直接透過配置檔案方式寫入 S3 相關配置，指定全域性使用的artifactRepository，該方式優先順序最低，可以被後續兩種方式替換。
2）名稱空間預設配置：ArgoWorkflow 會在 Workflow 所在名稱空間尋找當前名稱空間的預設配置，該方式配置優先順序第二，可以覆蓋全域性指定的配置。
- 規定：會在 Workflow 所在名稱空間尋找名為 artifact-repositories 的 Configmap 作為配置。
3）Workflow 中指定配置：還可以在 Workflow 中顯式指定使用哪個 artifact-repository，該方式優先順序最高。

注意📢：不管什麼方式指定 artifact-repository，其中儲存 S3 AKSK 資訊的 Secret 都必須同步到 Workflow 所在的名稱空間才行。

優先順序 InWorkflowConfig > Namespace > Global

全域性配置

以 helm 方式部署的 ArgoWorkflow 的話預設就會以這種形式指定配置。

workflow-controller 的 deployment yaml 如下：

apiVersion: apps/v1
kind: Deployment
metadata:
  name: argo-workflow-argo-workflows-workflow-controller
  namespace: argo-dev
spec:
  template:
    metadata:
    spec:
      containers:
      - args:
        - --configmap
        - argo-workflow-argo-workflows-workflow-controller-configmap
        - --executor-image
        - quay.io/argoproj/argoexec:v3.4.11
        - --loglevel
        - info
        - --gloglevel
        - "0"
        - --log-format
        - text

可以看到在啟動命令中以 --configmap argo-workflow-argo-workflows-workflow-controller-configmap 方式指定了配置檔案來源的 Configmap。

這個 Configmap 的內容如下：

apiVersion: v1
data:
  # ... 省略
  artifactRepository: |
    s3:
      endpoint: minio.default.svc:9000
      bucket: argo
      insecure: true
      accessKeySecret:
        name: my-s3-secret
        key: accessKey
      secretKeySecret:
        name: my-s3-secret
        key: secretKey
kind: ConfigMap
metadata:
  name: argo-workflows-workflow-controller-configmap
  namespace: argo

包括了 S3 的 endpoint、bucket、aksk 等資訊，藉助這些資訊 Workflow 就可以訪問 S3 了。

名稱空間預設配置

根據當前實現，ArgoWorkflow 會優先使用 Workflow 所在名稱空間下的預設 artifactRepository 配置。

預設會使用名為 artifact-repositories 的 Configmap 作為當前名稱空間下 Workflow 的 artifactRepository 配置，Configmap 中的內容大概長這樣：

注意：Configmap 名必須是 artifact-repositories

apiVersion: v1
kind: ConfigMap
metadata:
  # If you want to use this config map by default, name it "artifact-repositories". Otherwise, you can provide a reference to a
  # different config map in `artifactRepositoryRef.configMap`.
  name: artifact-repositories
  annotations:
    # v3.0 and after - if you want to use a specific key, put that key into this annotation.
    workflows.argoproj.io/default-artifact-repository: my-artifact-repository
data:
  my-artifact-repository: |
    s3:
      bucket: lixd-argo
      endpoint: minio.argo-dev.svc:9000
      insecure: true
      accessKeySecret:
        name: my-s3-secret
        key: accessKey
      secretKeySecret:
        name: my-s3-secret
        key: secretKey
 # 可以寫多個 Repository
 my-artifact-repository2： ...

Data 中的每一個 Key 對應一個 Repository，然後使用 workflows.argoproj.io/default-artifact-repository annotation 來指定預設使用哪個 artifactRepository。

比如這裡就指定了 my-artifact-repository 為預設 artifactRepository.

Workflow 中指定配置

除此之外，還可以直接在 Workflow 中指定具體要使用哪個 artifactRepository。

spec:
  artifactRepositoryRef:
    configMap: my-artifact-repository # default is "artifact-repositories"
    key: v2-s3-artifact-repository # default can be set by the `workflows.argoproj.io/default-artifact-repository` annotation in config map.

需要指定 Configmap 以及具體的 Key 來找到唯一的 artifactRepository。

只會在當前名稱空間下找，因此需要確保這個 Configmap 存在。

或者直接把 S3 配置寫到 Workflow 裡(不推薦),就像這樣：

  templates:
  - name: artifact-example
    inputs:
      artifacts:
      - name: my-input-artifact
        path: /my-input-artifact
        s3:
          endpoint: s3.amazonaws.com
          bucket: my-aws-bucket-name
          key: path/in/bucket/my-input-artifact.tgz
          accessKeySecret:
            name: my-aws-s3-credentials
            key: accessKey
          secretKeySecret:
            name: my-aws-s3-credentials
            key: secretKey
    outputs:
      artifacts:
      - name: my-output-artifact
        path: /my-output-artifact
        s3:
          endpoint: storage.googleapis.com
          bucket: my-gcs-bucket-name
          # NOTE that, by default, all output artifacts are automatically tarred and
          # gzipped before saving. So as a best practice, .tgz or .tar.gz
          # should be incorporated into the key name so the resulting file
          # has an accurate file extension.
          key: path/in/bucket/my-output-artifact.tgz
          accessKeySecret:
            name: my-gcs-s3-credentials
            key: accessKey
          secretKeySecret:
            name: my-gcs-s3-credentials
            key: secretKey
          region: my-GCS-storage-bucket-region
    container:
      image: debian:latest
      command: [sh, -c]
      args: ["cp -r /my-input-artifact /my-output-artifact"]

只會在當前名稱空間下找，因此需要確保這個 Configmap 存在。

小結

包括三種方式：

1）全域性配置
2）名稱空間預設配置
3）Workflow 中指定配置

注意📢：由於 S3 AKSK 以 Secret 方式儲存，因此三種配置方式都需要將該 Secret 同步到 Workflow 所在名稱空間，否則無法在 Pod 中使用，導致 Workflow 無法正常執行。

如果 ArgoWorkflow 能自動接管就好了,可以使用 https://github.com/mittwald/kubernetes-replicator 來自動同步

三種方式的區別：

全域性配置全域性只需要一個 Configmap 來指定 S3 資訊即可，所有 Workflow 都使用該 S3 配置，簡單，但是不夠靈活。
名稱空間預設配置：該方式可以為不同名稱空間配置不同的 S3，但是需要在每個名稱空間都建立一個 Configmap。
Workflow 中指定配置：這種方式最靈活，可以為不同 Workflow 指定不同 S3，但是需要建立很多 Configmap。

使用場景：

如果全域性只有一個 S3 配置，那就使用全域性配置方式，最簡單。

如果租戶間使用名稱空間隔離，使用不同 S3,那使用名稱空間預設配置方式就剛好

以上都不滿足的時候，才建議使用 Workflow 中指定配置方式。

2. Workflow 中使用 artifact

key-only-artifacts

當 Workflow 中不顯式指定 S3 配置資訊時，argo 會按照前面的優先順序自動尋找 artifact-repository 配置。

優先使用 Namespace 下的配置，沒有則使用全域性配置

一個完整的 Demo 如下：

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: artifact-passing-
spec:
  entrypoint: artifact-example
  templates:
  - name: artifact-example
    steps:
    - - name: generate-artifact
        template: whalesay
    - - name: consume-artifact
        template: print-message
        arguments:
          artifacts:
          # bind message to the hello-art artifact
          # generated by the generate-artifact step
          - name: message
            from: "{{steps.generate-artifact.outputs.artifacts.hello-art}}"

  - name: whalesay
    container:
      image: docker/whalesay:latest
      command: [sh, -c]
      args: ["cowsay hello world | tee /tmp/hello_world.txt"]
    outputs:
      artifacts:
      # generate hello-art artifact from /tmp/hello_world.txt
      # artifacts can be directories as well as files
      - name: hello-art
        path: /tmp/hello_world.txt

  - name: print-message
    inputs:
      artifacts:
      # unpack the message input artifact
      # and put it at /tmp/message
      - name: message
        path: /tmp/message
    container:
      image: alpine:latest
      command: [sh, -c]
      args: ["cat /tmp/message"]

第一個步驟，透過

第一個步驟，透過 tee 命令建立了一個檔案並透過 outputs 進行輸出,由於指定的是 artifacts，因此這個檔案會被儲存到 S3。

然後第二個步驟指定 inputs.artifacts 從 S3 讀取名為 message 的 artifact 並儲存到 /tmp/message 目錄。

問題來了第二步中讀取的 artifact 是從哪兒來的呢，就是 steps 中透過 arguments.artifacts 指定的，透過 name 進行關聯。

整個邏輯和 parameter 基本一致

1）whalesay template 透過 outputs.artifacts 來申明當前 template 會輸出一個 artifact。
2）print-message 中透過 inputs.artifacts 申明需要一個 artifact，並指定儲存位置
3）steps 在使用該 template 時，透過 arguments.artifacts 來指定一個 artifact，這個 artifact 來源就是 1 中的 output，透過{{steps.generate-artifact.outputs.artifacts.$name}} 語法引用。

	artifact-passing-vzp2r-1469537892:
      boundaryID: artifact-passing-vzp2r
      displayName: generate-artifact
      finishedAt: "2024-03-29T08:42:34Z"
      hostNodeName: lixd-argo
      id: artifact-passing-vzp2r-1469537892
      message: 'Error (exit code 1): You need to configure artifact storage. More
        information on how to do this can be found in the docs: https://argo-workflows.readthedocs.io/en/release-3.5/configure-artifact-repository/'
      name: artifact-passing-vzp2r[0].generate-artifact

artifact 壓縮

預設情況下，所有的 artifact 會被打成 tar 包並 gzip 壓縮，可以透過archive 欄位來配置壓縮情況：

預設行為：tar + gzip
可選關閉 tar+ gzip
或者配置 gzip 壓縮等級

<... snipped ...>
    outputs:
      artifacts:
        # default behavior - tar+gzip default compression.
      - name: hello-art-1
        path: /tmp/hello_world.txt

        # disable archiving entirely - upload the file / directory as is.
        # this is useful when the container layout matches the desired target repository layout.   
      - name: hello-art-2
        path: /tmp/hello_world.txt
        archive:
          none: {}

        # customize the compression behavior (disabling it here).
        # this is useful for files with varying compression benefits, 
        # e.g. disabling compression for a cached build workspace and large binaries, 
        # or increasing compression for "perfect" textual data - like a json/xml export of a large database.
      - name: hello-art-3
        path: /tmp/hello_world.txt
        archive:
          tar:
            # no compression (also accepts the standard gzip 1 to 9 values)
            compressionLevel: 0
<... snipped ...>

Artifact 垃圾回收

所有 Artifact 都會上傳到 S3，為了保證 S3 不被填滿，垃圾清理是個問題。

好訊息是，argo-workflow 3.4 開始，可以在 Workflow 中新增配置來實現自動刪除不需要的 Artifacts。

當前提供兩種回收策略，分別是：

OnWorkflowCompletion：工作流執行完成後就刪除
OnWorkflowDeletion：工作流被刪除時才刪除

同時可以統一為 Workflow 中的所有 artifact 配置回收策略，也可以單獨為每一個 artifact 配置回收策略。

Demo 如下：

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: artifact-gc-
spec:
  entrypoint: main
  artifactGC:
    strategy: OnWorkflowDeletion  # default Strategy set here applies to all Artifacts by default
  templates:
    - name: main
      container:
        image: argoproj/argosay:v2
        command:
          - sh
          - -c
        args:
          - |
            echo "can throw this away" > /tmp/temporary-artifact.txt
            echo "keep this" > /tmp/keep-this.txt
      outputs:
        artifacts:
          - name: temporary-artifact
            path: /tmp/temporary-artifact.txt
            s3:
              key: temporary-artifact.txt
          - name: keep-this
            path: /tmp/keep-this.txt
            s3:
              key: keep-this.txt
            artifactGC:
              strategy: Never   # optional override for an Artifact

核心部分如下：

spec:
  entrypoint: main
  # 為 Workflow 中的所有 artifact 統一配置
  artifactGC:
    strategy: OnWorkflowDeletion  # default Strategy set here applies to all Artifacts by default
# 單獨指定 artifact 的回收策略
			outputs:
        artifacts:
          - name: temporary-artifact
            artifactGC:
              strategy: Never   # optional override for an Artifact

注意事項：為了避免相同工作流併發執行時，artifact 被誤刪除的問題，可以為不同工作流配置不同的 artifact repository。

forceFinalizerRemoval

argo-workflow 會啟動一個 <wfName>-artgc-* 格式命名的 Pod 來執行垃圾回收工作，如果執行失敗，整個 Workflow 也會被標記為失敗。

同時由於finalizers 沒有被刪除掉

apiVersion: argoproj.io/v1alpha1
kind: Workflow
  finalizers:
  - workflows.argoproj.io/artifact-gc

會導致這個 Workflow 無法刪除，可以執行以下命令移除

kubectl patch workflow my-wf \
    --type json \
    --patch='[ { "op": "remove", "path": "/metadata/finalizers" } ]'

為了最佳化體驗，argo-workflow 3.5 版本新增了 forceFinalizerRemoval 引數

spec:
  artifactGC:
    strategy: OnWorkflowDeletion 
    forceFinalizerRemoval: true

只要forceFinalizerRemoval 設定為 true，即時 GC 失敗也會移除 finalizers。

常用 Artifacts 擴充套件

除了 S3 Artifacts 之外，為了便於使用， argo-workflow 還內建了 git、http 方式來獲取 artifact。

可以直接從指定 git 倉庫 clone 程式碼，或者從指定 url 下載檔案，就像這樣：

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: hardwired-artifact-
spec:
  entrypoint: hardwired-artifact
  templates:
  - name: hardwired-artifact
    inputs:
      artifacts:
      # Check out the main branch of the argo repo and place it at /src
      # revision can be anything that git checkout accepts: branch, commit, tag, etc.
      - name: argo-source
        path: /src
        git:
          repo: https://github.com/argoproj/argo-workflows.git
          revision: "main"
      # Download kubectl 1.8.0 and place it at /bin/kubectl
      - name: kubectl
        path: /bin/kubectl
        mode: 0755
        http:
          url: https://storage.googleapis.com/kubernetes-release/release/v1.8.0/bin/linux/amd64/kubectl
      # Copy an s3 compatible artifact repository bucket (such as AWS, GCS and MinIO) and place it at /s3
      - name: objects
        path: /s3
        s3:
          endpoint: storage.googleapis.com
          bucket: my-bucket-name
          key: path/in/bucket
          accessKeySecret:
            name: my-s3-credentials
            key: accessKey
          secretKeySecret:
            name: my-s3-credentials
            key: secretKey
    container:
      image: debian
      command: [sh, -c]
      args: ["ls -l /src /bin/kubectl /s3"]

3. Demo

測試點：

1）建立到 Workflow 對應 Namespace 是否能正常使用
2）將S3 配置建立到 Argo 部署的 Namespace 是不是可以不需要進行同步了。

Configmap:

Name：argo-workflow-argo-workflows-workflow-controller-configmap
Namespace：argo-dev
Key：artifactRepository

Minio 準備

部署一個 local-path-storage csi，如果有別的 csi 也可以跳過這一步

kubectl apply -f https://raw.githubusercontent.com/rancher/local-path-provisioner/v0.0.24/deploy/local-path-storage.yaml

然後部署 minio

helm install minio oci://registry-1.docker.io/bitnamicharts/minio

    my-release-minio.default.svc.cluster.local
   
   export ROOT_USER=$(kubectl get secret --namespace default my-release-minio -o jsonpath="{.data.root-user}" | base64 -d)
   export ROOT_PASSWORD=$(kubectl get secret --namespace default my-release-minio -o jsonpath="{.data.root-password}" | base64 -d)

配置 artifact-repository

cm.yaml 完整內容如下：

apiVersion: v1
kind: ConfigMap
metadata:
  name: artifact-repositories
  annotations:
    workflows.argoproj.io/default-artifact-repository: my-artifact-repository
data:
  my-artifact-repository: |
    s3:
      bucket: argo
      endpoint: minio.default.svc.cluster.local:9000
      insecure: true
      accessKeySecret:
        name: my-s3-secret
        key: accessKey
      secretKeySecret:
        name: my-s3-secret
        key: secretKey

secret.yaml 完整內容如下：

apiVersion: v1
stringData:
  accessKey: admin
  secretKey: minioadmin
kind: Secret
metadata:
  name: my-s3-secret
type: Opaque

建立 artifact repository 配置

kubectl apply -f cm.yaml
kubectl apply -f secret.yaml

Workflow 中使用artifact

兩個步驟：

generate：生成一個檔案，並透過 outputs.artifact 寫入 S3
consume：使用 inputs.artifact 從 S3 讀取檔案並列印內容

workflow.yaml 完整內容如下：

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: key-only-artifacts-
spec:
  entrypoint: main
  templates:
    - name: main
      dag:
        tasks:
          - name: generate
            template: generate
          - name: consume
            template: consume
            dependencies:
              - generate
    - name: generate
      container:
        image: argoproj/argosay:v2
        args: [ echo, hello, /mnt/file ]
      outputs:
        artifacts:
          - name: file
            path: /mnt/file
            s3:
              key: my-file
    - name: consume
      container:
        image: argoproj/argosay:v2
        args: [cat, /tmp/file]
      inputs:
        artifacts:
          - name: file
            path: /tmp/file
            s3:
              key: my-file

建立 Workflow

kubectl create -f workflow.yaml

等待執行完成

[root@lixd-argo artiface]# kubectl get wf
NAME                                  STATUS      AGE     MESSAGE
key-only-artifacts-9r84h              Succeeded   2m30s

S3 檢視檔案

到 S3 中檢視檔案是否存在

可以看到，在 argo bucket 下有一個名為 my-file的檔案存在，而且 context-type 是 application/gzip，這也驗證了 argo 會對 artifact 執行 tar+gzip。

argo-artifact-s3

4. 小結

【ArgoWorkflow 系列】持續更新中，搜尋公眾號【探索雲原生】訂閱，閱讀更多文章。

本文主要分析了 argo 中的 artifact 使用，包括如何配置 artifact-repository：