Velero系列文章(四):使用Velero進行生產遷移實戰

東風微鳴發表於2022-12-12

概述

目的

透過 velero 工具, 實現以下整體目標:

  • 特定 namespace 在B A兩個叢集間做遷移;

具體目標為:

  1. 在B A叢集上建立 velero (包括 restic )
  2. 備份 B叢集 特定 namespace : caseycui2020:
    1. 備份resources - 如deployments, configmaps等;
      1. 備份前, 排除特定secrets的yaml.
    2. 備份volume資料; (透過restic實現)
      1. 透過"選擇性啟用" 的方式, 只備份特定的pod volume
  3. 遷移特定 namespace 到 A叢集 : caseycui2020:
    1. 遷移resources - 透過include的方式, 僅遷移特定resources;
    2. 遷移volume資料. (透過restic 實現)

安裝

  1. 在您的本地目錄中建立特定於Velero的憑證檔案(credentials-velero):

    使用的是xsky的物件儲存: (公司的netapp的物件儲存不相容)

    [default]
    aws_access_key_id = xxxxxxxxxxxxxxxxxxxxxxxx
    aws_secret_access_key = xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
    
  2. (openshift) 需要先建立 namespace : velero: oc new-project velero

  3. 預設情況下,使用者維度的openshift namespace 不會在叢集中的所有節點上排程Pod。

    要在所有節點上計劃namespace,需要一個註釋:

    oc annotate namespace velero openshift.io/node-selector=""
    

    這應該在安裝velero之前完成。

  4. 啟動伺服器和儲存服務。 在Velero目錄中,執行:

    velero install \
        --provider aws \
        --plugins velero/velero-plugin-for-aws:v1.0.0 \
        --bucket velero \
        --secret-file ./credentials-velero \
        --use-restic \
        --use-volume-snapshots=true \
        --backup-location-config region="default",s3ForcePathStyle="true",s3Url="http://glacier.ewhisper.cn",insecureSkipTLSVerify="true",signatureVersion="4" \
        --snapshot-location-config region="default"
    

    建立的內容包括:

    CustomResourceDefinition/backups.velero.io: attempting to create resource
    CustomResourceDefinition/backups.velero.io: created
    CustomResourceDefinition/backupstoragelocations.velero.io: attempting to create resource
    CustomResourceDefinition/backupstoragelocations.velero.io: created
    CustomResourceDefinition/deletebackuprequests.velero.io: attempting to create resource
    CustomResourceDefinition/deletebackuprequests.velero.io: created
    CustomResourceDefinition/downloadrequests.velero.io: attempting to create resource
    CustomResourceDefinition/downloadrequests.velero.io: created
    CustomResourceDefinition/podvolumebackups.velero.io: attempting to create resource
    CustomResourceDefinition/podvolumebackups.velero.io: created
    CustomResourceDefinition/podvolumerestores.velero.io: attempting to create resource
    CustomResourceDefinition/podvolumerestores.velero.io: created
    CustomResourceDefinition/resticrepositories.velero.io: attempting to create resource
    CustomResourceDefinition/resticrepositories.velero.io: created
    CustomResourceDefinition/restores.velero.io: attempting to create resource
    CustomResourceDefinition/restores.velero.io: created
    CustomResourceDefinition/schedules.velero.io: attempting to create resource
    CustomResourceDefinition/schedules.velero.io: created
    CustomResourceDefinition/serverstatusrequests.velero.io: attempting to create resource
    CustomResourceDefinition/serverstatusrequests.velero.io: created
    CustomResourceDefinition/volumesnapshotlocations.velero.io: attempting to create resource
    CustomResourceDefinition/volumesnapshotlocations.velero.io: created
    Waiting for resources to be ready in cluster...
    Namespace/velero: attempting to create resource
    Namespace/velero: created
    ClusterRoleBinding/velero: attempting to create resource
    ClusterRoleBinding/velero: created
    ServiceAccount/velero: attempting to create resource
    ServiceAccount/velero: created
    Secret/cloud-credentials: attempting to create resource
    Secret/cloud-credentials: created
    BackupStorageLocation/default: attempting to create resource
    BackupStorageLocation/default: created
    VolumeSnapshotLocation/default: attempting to create resource
    VolumeSnapshotLocation/default: created
    Deployment/velero: attempting to create resource
    Deployment/velero: created
    DaemonSet/restic: attempting to create resource
    DaemonSet/restic: created
    Velero is installed! ⛵ Use 'kubectl logs deployment/velero -n velero' to view the status.
    
  5. (openshift) 將velero ServiceAccount新增到privilegedSCC:

    $ oc adm policy add-scc-to-user privileged -z velero -n velero
    
  6. (openshift) 對於OpenShift版本> = 4.1,修改DaemonSet yaml以請求privileged模式:

    @@ -67,3 +67,5 @@ spec:
                  value: /credentials/cloud
                - name: VELERO_SCRATCH_DIR
                  value: /scratch
    +          securityContext:
    +            privileged: true
    

    或:

    oc patch ds/restic \
      --namespace velero \
      --type json \
      -p '[{"op":"add","path":"/spec/template/spec/containers/0/securityContext","value": { "privileged": true}}]'
    

備份 - B叢集

備份叢集級別的特定資源

velero backup create <backup-name> --include-cluster-resources=true  --include-resources deployments,configmaps

檢視備份

velero backup describe YOUR_BACKUP_NAME

備份特定 namespace caseycui2020

排除特定資源

標籤為velero.io/exclude-from-backup=true的資源不包括在備份中,即使它包含匹配的選擇器標籤也是如此。

透過這種方式, 不需要備份的secret 等資源透過velero.io/exclude-from-backup=true 標籤(label)進行排除.

透過這種方式排除的secret部分示例如下:

builder-dockercfg-jbnzr
default-token-lshh8
pipeline-token-xt645

使用restic 備份Pod Volume

? 注意:

該 namespace 下, 以下2個pod volume也需要備份, 但是目前還沒正式使用:

  • mycoreapphttptask-callback
  • mycoreapphttptaskservice-callback

透過 "選擇性啟用" 的方式進行有選擇地備份.

  1. 對包含要備份的卷的每個Pod執行以下命令:

    oc -n caseycui2020 annotate pod/<mybackendapp-pod-name> backup.velero.io/backup-volumes=jmx-exporter-agent,pinpoint-agent,my-mybackendapp-claim
    oc -n caseycui2020 annotate pod/<elitegetrecservice-pod-name> backup.velero.io/backup-volumes=uploadfile
    

    其中,卷名是容器 spec中卷的名稱。

    例如,對於以下pod:

    apiVersion: v1
    kind: Pod
    metadata:
      name: sample
      namespace: foo
    spec:
      containers:
      - image: k8s.gcr.io/test-webserver
        name: test-webserver
        volumeMounts:
        - name: pvc-volume
          mountPath: /volume-1
        - name: emptydir-volume
          mountPath: /volume-2
      volumes:
      - name: pvc-volume
        persistentVolumeClaim:
          claimName: test-volume-claim
      - name: emptydir-volume
        emptyDir: {}
    

    你應該執行:

    kubectl -n foo annotate pod/sample backup.velero.io/backup-volumes=pvc-volume,emptydir-volume
    

    如果您使用控制器來管理您的pods,則也可以在pod template spec中提供此批註。

備份及驗證

備份namespace及其物件, 以及具有相關annotation的pod volume:

# 生產 namespace 
velero backup create caseycui2020 --include-namespaces caseycui2020

檢視備份

velero backup describe YOUR_BACKUP_NAME
velero backup logs caseycui2020
oc -n velero get podvolumebackups -l velero.io/backup-name=caseycui2020 -o yaml

describe檢視的結果如下:

Name:         caseycui2020
Namespace:    velero
Labels:       velero.io/storage-location=default
Annotations:  velero.io/source-cluster-k8s-gitversion=v1.18.3+2cf11e2
              velero.io/source-cluster-k8s-major-version=1
              velero.io/source-cluster-k8s-minor-version=18+

Phase:  Completed

Errors:    0
Warnimybackendapp:  0

Namespaces:
  Included:  caseycui2020
  Excluded:  <none>

Resources:
  Included:        *
  Excluded:        <none>
  Cluster-scoped:  auto

Label selector:  <none>

Storage Location:  default

Velero-Native Snapshot PVs:  auto

TTL:  720h0m0s

Hooks:  <none>

Backup Format Version:  1.1.0

Started:    2020-10-21 09:28:16 +0800 CST
Completed:  2020-10-21 09:29:17 +0800 CST

Expiration:  2020-11-20 09:28:16 +0800 CST

Total items to be backed up:  591
Items backed up:              591

Velero-Native Snapshots: <none included>

Restic Backups (specify --details for more information):
  Completed:  3

定期備份

使用基於cron表示式建立定期計劃的備份:

velero schedule create caseycui2020-b-daily --schedule="0 3 * * *" --include-namespaces caseycui2020

另外,您可以使用一些非標準的速記cron表示式:

velero schedule create test-daily --schedule="@every 24h" --include-namespaces caseycui2020

有關更多用法示例,請參見cron軟體包的文件。

叢集遷移 - 到A叢集

使用 BackupsRestores

只要您將每個Velero例項指向相同的雲物件儲存位置,Velero就能幫助您將資源從一個群集移植到另一個群集。 此方案假定您的群集由同一雲提供商託管。 請注意,Velero本身不支援跨雲提供程式遷移持久卷快照。 如果要在雲平臺之間遷移卷資料,請啟用restic,它將在檔案系統級別備份卷內容。

  1. (叢集 B)假設您尚未使用Velero schedule 操作對資料進行檢查點檢查,則需要首先備份整個群集(根據需要替換<BACKUP-NAME>):

    velero backup create <BACKUP-NAME>
    

    預設備份保留期限以TTL(有效期)表示,為30天(720小時); 您可以使用--ttl <DURATION>標誌根據需要進行更改。 有關備份到期的更多資訊,請參見velero的工作原理

  2. (叢集 A)配置BackupStorageLocationsVolumeSnapshotLocations, 指向 叢集1 使用的位置, 使用velero backup-location createvelero snapshot-location create. 要確保配置BackupStorageLocations為 read-only, 透過在velero backup-location create時使用--access-mode=ReadOnly flag (因為我就只有一個bucket, 所以就沒有配置只讀這一項). 如下為在A叢集安裝, 安裝時已配置了BackupStorageLocationsVolumeSnapshotLocations.

    velero install \
        --provider aws \
        --plugins velero/velero-plugin-for-aws:v1.0.0 \
        --bucket velero \
        --secret-file ./credentials-velero \
        --use-restic \
        --use-volume-snapshots=true \
        --backup-location-config region="default",s3ForcePathStyle="true",s3Url="http://glacier.ewhisper.cn",insecureSkipTLSVerify="true",signatureVersion="4"\
        --snapshot-location-config region="default"
    
  3. (叢集A)確保已建立Velero Backup物件。 Velero資源與雲端儲存中的備份檔案同步。

    velero backup describe <BACKUP-NAME>
    

    注意:預設同步間隔為1分鐘,因此請確保在檢查之前等待。 您可以使用Velero伺服器的--backup-sync-period標誌配置此間隔。

  4. (叢集A)一旦確認現在存在正確的備份(<BACKUP-NAME>),就可以使用以下方法還原所有內容: (因為backup 中只有caseycui2020一個 namespace , 所以還原是就不需要--include-namespaces caseycui2020 進行過濾)

    velero restore create --from-backup caseycui2020 --include-resources buildconfigs.build.openshift.io,configmaps,deploymentconfigs.apps.openshift.io,imagestreams.image.openshift.io,imagestreamtags.image.openshift.io,imagetags.image.openshift.io,limitranges,namespaces,networkpolicies.networking.k8s.io,persistentvolumeclaims,prometheusrules.monitoring.coreos.com,resourcequotas,rolebindimybackendapp.authorization.openshift.io,rolebindimybackendapp.rbac.authorization.k8s.io,routes.route.openshift.io,secrets,servicemonitors.monitoring.coreos.com,services,templateinstances.template.openshift.io
    

    因為後面驗證persistentvolumeclaimsrestore有問題, 所以後續使用的時候先拿掉這個pvc, 後面再想辦法解決:

    velero restore create --from-backup caseycui2020 --include-resources buildconfigs.build.openshift.io,configmaps,deploymentconfigs.apps.openshift.io,imagestreams.image.openshift.io,imagestreamtags.image.openshift.io,imagetags.image.openshift.io,limitranges,namespaces,networkpolicies.networking.k8s.io,persistentvolumeclaims,prometheusrules.monitoring.coreos.com,resourcequotas,rolebindimybackendapp.authorization.openshift.io,rolebindimybackendapp.rbac.authorization.k8s.io,routes.route.openshift.io,secrets,servicemonitors.monitoring.coreos.com,services,templateinstances.template.openshift.io
    

驗證2個叢集

檢查第二個群集是否按預期執行:

  1. (叢集A)執行:

    velero restore get
    

    結果如下:

    NAME                       BACKUP      STATUS            STARTED   COMPLETED   ERRORS   WARNImybackendapp   CREATED                         SELECTOR
    caseycui2020-20201021102342   caseycui2020   Failed            <nil>     <nil>       0        0          2020-10-21 10:24:14 +0800 CST   <none>
    caseycui2020-20201021103040   caseycui2020   PartiallyFailed   <nil>     <nil>       46       34         2020-10-21 10:31:12 +0800 CST   <none>
    caseycui2020-20201021105848   caseycui2020   InProgress        <nil>     <nil>       0        0          2020-10-21 10:59:20 +0800 CST   <none>
    
  2. 然後執行:

    velero restore describe <RESTORE-NAME-FROM-GET-COMMAND>
    oc -n velero get podvolumerestores -l velero.io/restore-name=YOUR_RESTORE_NAME -o yaml
    

    結果如下:

    Name:         caseycui2020-20201021102342
    Namespace:    velero
    Labels:       <none>
    Annotations:  <none>
    
    Phase:  InProgress
    
    Started:    <n/a>
    Completed:  <n/a>
    
    Backup:  caseycui2020
    
    Namespaces:
      Included:  all namespaces found in the backup
      Excluded:  <none>
    
    Resources:
      Included:        buildconfigs.build.openshift.io, configmaps, deploymentconfigs.apps.openshift.io, imagestreams.image.openshift.io, imagestreamtags.image.openshift.io, imagetags.image.openshift.io, limitranges, namespaces, networkpolicies.networking.k8s.io, persistentvolumeclaims, prometheusrules.monitoring.coreos.com, resourcequotas, rolebindimybackendapp.authorization.openshift.io, rolebindimybackendapp.rbac.authorization.k8s.io, routes.route.openshift.io, secrets, servicemonitors.monitoring.coreos.com, services, templateinstances.template.openshift.io
      Excluded:        nodes, events, events.events.k8s.io, backups.velero.io, restores.velero.io, resticrepositories.velero.io
      Cluster-scoped:  auto
    
    Namespace mappimybackendapp:  <none>
    
    Label selector:  <none>
    
    Restore PVs:  auto
    

如果遇到問題,請確保Velero在兩個群集中的相同namespace中執行。

我這邊碰到問題, 就是openshift裡邊, 使用了imagestream和imagetag, 然後對應的映象拉不過來, 容器沒有啟動.

容器沒有啟動, podvolume也沒有恢復成功.

Name:         caseycui2020-20201021110424
Namespace:    velero
Labels:       <none>
Annotations:  <none>

Phase:  PartiallyFailed (run 'velero restore logs caseycui2020-20201021110424' for more information)

Started:    <n/a>
Completed:  <n/a>

Warnimybackendapp:
  Velero:     <none>
  Cluster:    <none>
  Namespaces:
    caseycui2020:  could not restore, imagetags.image.openshift.io "mybackendapp:1.0.0" already exists. Warning: the in-cluster version is different than the backed-up version.
                could not restore, imagetags.image.openshift.io "mybackendappno:1.0.0" already exists. Warning: the in-cluster version is different than the backed-up version.
                ...

Errors:
  Velero:     <none>
  Cluster:    <none>
  Namespaces:
    caseycui2020:  error restoring imagestreams.image.openshift.io/caseycui2020/mybackendapp: ImageStream.image.openshift.io "mybackendapp" is invalid: []: Internal error: imagestreams "mybackendapp" is invalid: spec.tags[latest].from.name: Invalid value: "mybackendapp@sha256:6c5ab553a97c74ad602d2427a326124621c163676df91f7040b035fa64b533c7": error generating tag event: imagestreamimage.image.openshift.io ......

Backup:  caseycui2020

Namespaces:
  Included:  all namespaces found in the backup
  Excluded:  <none>

Resources:
  Included:        buildconfigs.build.openshift.io, configmaps, deploymentconfigs.apps.openshift.io, imagestreams.image.openshift.io, imagestreamtags.image.openshift.io, imagetags.image.openshift.io, limitranges, namespaces, networkpolicies.networking.k8s.io, persistentvolumeclaims, prometheusrules.monitoring.coreos.com, resourcequotas, rolebindimybackendapp.authorization.openshift.io, rolebindimybackendapp.rbac.authorization.k8s.io, routes.route.openshift.io, secrets, servicemonitors.monitoring.coreos.com, services, templateinstances.template.openshift.io
  Excluded:        nodes, events, events.events.k8s.io, backups.velero.io, restores.velero.io, resticrepositories.velero.io
  Cluster-scoped:  auto

Namespace mappimybackendapp:  <none>

Label selector:  <none>

Restore PVs:  auto

遷移問題總結

目前總結問題如下:

  1. imagestreams.image.openshift.io, imagestreamtags.image.openshift.io, imagetags.image.openshift.io 裡的映象沒有成功匯入; 確切地說是latest這個tag沒有匯入成功. imagestreamtags.image.openshift.io生效也需要時間.

  2. persistentvolumeclaims 遷移過來後報錯, 報錯如下:

    phase: Lost
    

    原因是: A B叢集的StorageClass的配置是不同的, 所以B叢集的PVC, 在A叢集想要直接binding是不可能的. 而且建立後無法直接修改, 需要刪掉重新建立.

  3. Routes 域名, 有部分域名是特定於A B叢集的域名, 如: jenkins-caseycui2020.b.caas.ewhisper.cn遷移到A叢集調整為: jenkins-caseycui2020.a.caas.ewhisper.cn

  4. podVolume 資料沒有遷移.

latest這個tag沒有匯入成功

手動匯入, 命令如下: (1.0.1 為ImageStream的最新的版本)

oc tag xxl-job-admin:1.0.1 xxl-job-admin:latest

PVC phase Lost問題

如果手動建立, 需要對PVC yaml進行調整. 調整前後PVC如下:

B叢集原YAML:

kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  annotations:
    pv.kubernetes.io/bind-completed: 'yes'
    pv.kubernetes.io/bound-by-controller: 'yes'
    volume.beta.kubernetes.io/storage-provisioner: csi.trident.netapp.io
  selfLink: /api/v1/namespaces/caseycui2020/persistentvolumeclaims/jenkins
  resourceVersion: '77304786'
  name: jenkins
  uid: ffcabc42-845d-4cdf-8c7c-56e97cb5ea82
  creationTimestamp: '2020-10-21T03:05:46Z'
  managedFields:
    - manager: kube-controller-manager
      operation: Update
      apiVersion: v1
      time: '2020-10-21T03:05:46Z'
      fieldsType: FieldsV1
      fieldsV1:
        'f:status':
          'f:phase': {}
    - manager: velero-server
      operation: Update
      apiVersion: v1
      time: '2020-10-21T03:05:46Z'
      fieldsType: FieldsV1
      fieldsV1:
        'f:metadata':
          'f:annotations':
            .: {}
            'f:pv.kubernetes.io/bind-completed': {}
            'f:pv.kubernetes.io/bound-by-controller': {}
            'f:volume.beta.kubernetes.io/storage-provisioner': {}
          'f:labels':
            .: {}
            'f:app': {}
            'f:template': {}
            'f:template.openshift.io/template-instance-owner': {}
            'f:velero.io/backup-name': {}
            'f:velero.io/restore-name': {}
        'f:spec':
          'f:accessModes': {}
          'f:resources':
            'f:requests':
              .: {}
              'f:storage': {}
          'f:storageClassName': {}
          'f:volumeMode': {}
          'f:volumeName': {}
  namespace: caseycui2020
  finalizers:
    - kubernetes.io/pvc-protection
  labels:
    app: jenkins-persistent
    template: jenkins-persistent-monitored
    template.openshift.io/template-instance-owner: 5a0b28c3-c760-451b-b92f-a781406d9e91
    velero.io/backup-name: caseycui2020
    velero.io/restore-name: caseycui2020-20201021110424
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 5Gi
  volumeName: pvc-414efafd-8b22-48da-8c20-6025a8e671ca
  storageClassName: nas-data
  volumeMode: Filesystem
status:
  phase: Lost

調整後:

kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  name: jenkins
  namespace: caseycui2020
  labels:
    app: jenkins-persistent
    template: jenkins-persistent-monitored
    template.openshift.io/template-instance-owner: 5a0b28c3-c760-451b-b92f-a781406d9e91
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 5Gi
  storageClassName: nas-data
  volumeMode: Filesystem

podVolume 資料沒有遷移

可以手動遷移, 命令如下:

# 登入B叢集
# 先把B 叢集/opt/prometheus資料拿出來到當前資料夾
oc rsync xxl-job-admin-5-9sgf7:/opt/prometheus .
# 上邊rsync命令會建立個prometheus的目錄
cd prometheus
# 登入A叢集
# 再把資料複製進去(複製之前得先確保這個pod啟動起來) (可以先把`JAVA_OPTS`刪掉)
oc rsync ./ xxl-job-admin-2-6k8df:/opt/prometheus/

總結

本文寫的比較早, 後面 OpenShift 出了基於 Velero 封裝的 OpenShift 專有的遷移工具, 可以直接透過它提供的工具進行遷移.

另外, OpenShift 叢集上限制很多, 另外也有很多專屬於 OpenShift 的資源, 所以在實際使用上和標準 K8S 的差別還是比較大的, 需要仔細注意.

本次雖然嘗試失敗, 但是其中的思路還是可供借鑑的.

系列文章

?️參考文件

三人行, 必有我師; 知識共享, 天下為公. 本文由東風微鳴技術部落格 EWhisper.cn 編寫.

相關文章