PV 與 PVC
PVC (PersistentVolumeClaim),名稱空間(namespace)級別的資源,由 使用者 or StatefulSet 控制器(根據VolumeClaimTemplate) 建立。PVC 類似於 Pod,Pod 消耗 Node 資源,PVC 消耗 PV 資源。Pod 可以請求特定級別的資源(CPU 和記憶體),而 PVC 可以請求特定儲存卷的大小及訪問模式(Access Mode
PV(PersistentVolume)是叢集中的一塊儲存資源,可以是 NFS、iSCSI、Ceph、GlusterFS 等儲存卷,PV 由叢集管理員建立,然後由開發者使用 PVC 來申請 PV,PVC 是對 PV 的申請,類似於 Pod 對 Node 的申請。
靜態建立儲存卷
也就是我們手動建立一個pv和pvc,然後將pv和pvc繫結,然後pod使用pvc,這樣就可以使用pv了。
建立一個 nfs 的 pv 以及 對應的 pvc
apiVersion: v1
kind: PersistentVolume
metadata:
name: nfs-pv
spec:
capacity:
storage: 10Gi
accessModes:
- ReadWriteOnce
persistentVolumeReclaimPolicy: Delete
nfs:
server: 192.168.203.110
path: /data/nfs
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: nfs-pvc
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 10Gi
檢視 pvc
$ kubectl get pvc
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
nfs-pvc Bound nfs-pv 10Gi RWO 101s
建立一個 pod 使用 pvc
apiVersion: v1
kind: Pod
metadata:
name: test-nfs
spec:
containers:
- image: ubuntu:22.04
name: ubuntu
command:
- /bin/sh
- -c
- sleep 10000
volumeMounts:
- mountPath: /data
name: nfs-volume
volumes:
- name: nfs-volume
persistentVolumeClaim:
claimName: nfs-pvc
❯ kubectl exec -it test-nfs -- cat /data/nfs
192.168.203.110:/data/nfs
pvc pv 繫結流程
func (ctrl *PersistentVolumeController) syncUnboundClaim(ctx context.Context, claim *v1.PersistentVolumeClaim) error {
logger := klog.FromContext(ctx)
if claim.Spec.VolumeName == "" {
// 是不是延遲繫結 也就是 VolumeBindingMode 為 WaitForFirstConsumer
delayBinding, err := storagehelpers.IsDelayBindingMode(claim, ctrl.classLister)
if err != nil {
return err
}
// 透過 pvc 找到最合適的 pv
volume, err := ctrl.volumes.findBestMatchForClaim(claim, delayBinding)
if err != nil {
logger.V(2).Info("Synchronizing unbound PersistentVolumeClaim, Error finding PV for claim", "PVC", klog.KObj(claim), "err", err)
return fmt.Errorf("error finding PV for claim %q: %w", claimToClaimKey(claim), err)
}
if volume == nil {
//// No PV found for this claim
} else /* pv != nil */ {
claimKey := claimToClaimKey(claim)
logger.V(4).Info("Synchronizing unbound PersistentVolumeClaim, volume found", "PVC", klog.KObj(claim), "volumeName", volume.Name, "volumeStatus", getVolumeStatusForLogging(volume))
// 繫結 pv 和 pvc
// 這裡會處理 pvc 的 spec.volumeName status 和 pv 的 status
if err = ctrl.bind(ctx, volume, claim); err != nil {
return err
}
return nil
}
} else /* pvc.Spec.VolumeName != nil */ {
/*
......
*/
}
}
// 選擇
func FindMatchingVolume(
claim *v1.PersistentVolumeClaim,
volumes []*v1.PersistentVolume,
node *v1.Node,
excludedVolumes map[string]*v1.PersistentVolume,
delayBinding bool) (*v1.PersistentVolume, error) {
var smallestVolume *v1.PersistentVolume
var smallestVolumeQty resource.Quantity
requestedQty := claim.Spec.Resources.Requests[v1.ResourceName(v1.ResourceStorage)]
requestedClass := GetPersistentVolumeClaimClass(claim)
var selector labels.Selector
if claim.Spec.Selector != nil {
internalSelector, err := metav1.LabelSelectorAsSelector(claim.Spec.Selector)
if err != nil {
return nil, fmt.Errorf("error creating internal label selector for claim: %v: %v", claimToClaimKey(claim), err)
}
selector = internalSelector
}
// Go through all available volumes with two goals:
// - find a volume that is either pre-bound by user or dynamically
// provisioned for this claim. Because of this we need to loop through
// all volumes.
// - find the smallest matching one if there is no volume pre-bound to
// the claim.
for _, volume := range volumes {
if _, ok := excludedVolumes[volume.Name]; ok {
// Skip volumes in the excluded list
continue
}
if volume.Spec.ClaimRef != nil && !IsVolumeBoundToClaim(volume, claim) {
continue
}
volumeQty := volume.Spec.Capacity[v1.ResourceStorage]
if volumeQty.Cmp(requestedQty) < 0 {
continue
}
// filter out mismatching volumeModes
if CheckVolumeModeMismatches(&claim.Spec, &volume.Spec) {
continue
}
// check if PV's DeletionTimeStamp is set, if so, skip this volume.
if volume.ObjectMeta.DeletionTimestamp != nil {
continue
}
nodeAffinityValid := true
if node != nil {
// Scheduler path, check that the PV NodeAffinity
// is satisfied by the node
// CheckNodeAffinity is the most expensive call in this loop.
// We should check cheaper conditions first or consider optimizing this function.
err := CheckNodeAffinity(volume, node.Labels)
if err != nil {
nodeAffinityValid = false
}
}
if IsVolumeBoundToClaim(volume, claim) {
// If PV node affinity is invalid, return no match.
// This means the prebound PV (and therefore PVC)
// is not suitable for this node.
if !nodeAffinityValid {
return nil, nil
}
return volume, nil
}
if node == nil && delayBinding {
// PV controller does not bind this claim.
// Scheduler will handle binding unbound volumes
// Scheduler path will have node != nil
continue
}
// filter out:
// - volumes in non-available phase
// - volumes whose labels don't match the claim's selector, if specified
// - volumes in Class that is not requested
// - volumes whose NodeAffinity does not match the node
if volume.Status.Phase != v1.VolumeAvailable {
// We ignore volumes in non-available phase, because volumes that
// satisfies matching criteria will be updated to available, binding
// them now has high chance of encountering unnecessary failures
// due to API conflicts.
continue
} else if selector != nil && !selector.Matches(labels.Set(volume.Labels)) {
continue
}
if GetPersistentVolumeClass(volume) != requestedClass {
continue
}
if !nodeAffinityValid {
continue
}
if node != nil {
// Scheduler path
// Check that the access modes match
if !CheckAccessModes(claim, volume) {
continue
}
}
if smallestVolume == nil || smallestVolumeQty.Cmp(volumeQty) > 0 {
smallestVolume = volume
smallestVolumeQty = volumeQty
}
}
if smallestVolume != nil {
// Found a matching volume
return smallestVolume, nil
}
return nil, nil
}
kubelet 繫結
if err := os.MkdirAll(dir, 0750); err != nil {
return err
}
source := fmt.Sprintf("%s:%s", nfsMounter.server, nfsMounter.exportPath)
options := []string{}
if nfsMounter.readOnly {
options = append(options, "ro")
}
mountOptions := util.JoinMountOptions(nfsMounter.mountOptions, options)
err = nfsMounter.mounter.MountSensitiveWithoutSystemd(source, dir, "nfs", mountOptions, nil)
kubelet 就會在呼叫 sudo mount -t nfs ...
命令把 nfs 繫結到主機上 繫結的目錄大概為 /var/lib/kubelet/pods/[POD-ID]/volumes/
StorageClass
StorageClass 是 Kubernetes 中用來定義儲存卷的型別的資源物件,StorageClass 用來定義儲存卷的型別,比如 NFS、iSCSI、Ceph、GlusterFS 等儲存卷。StorageClass 是叢集級別的資源,由叢集管理員建立,使用者可以使用 StorageClass 來動態建立 PV。
動態建立儲存卷
動態建立儲存卷相比靜態建立儲存卷,少了叢集管理員的干預,流程如下圖所示:
建立一個 StorageClass pvc pod
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: local-storage
provisioner: rancher.io/local-path
reclaimPolicy: Delete
volumeBindingMode: WaitForFirstConsumer
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: my-local-pvc
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 128Mi
storageClassName: local-storage
---
apiVersion: v1
kind: Pod
metadata:
name: test
spec:
containers:
- image: ubuntu:22.04
name: ubuntu
command:
- /bin/sh
- -c
- sleep 10000
volumeMounts:
- mountPath: /data
name: my-local-pvc
volumes:
- name: my-local-pvc
persistentVolumeClaim:
claimName: my-local-pvc
檢視 pv
❯ kubectl get pv
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE
pvc-9d257d8a-29a8-4abf-a1e2-c7e4953fc0ca 128Mi RWO Delete Bound default/my-local-pvc local-storage 85s
StorageClass 建立 pv 流程
// 還是 syncUnboundClaim 中
// volume 為空,說明沒有找到合適的 pv 那麼去檢查 如果 pvc 的 storageClassName 不為空,那麼就會去找到對應的 storageClass
if volume == nil {
switch {
case delayBinding && !storagehelpers.IsDelayBindingProvisioning(claim):
// ......
case storagehelpers.GetPersistentVolumeClaimClass(claim) != "":
// 如果 pvc 的 storageClassName 不為空,那麼就會去找到對應的 storageClass
if err = ctrl.provisionClaim(ctx, claim); err != nil {
return err
}
return nil
default:
}
return nil
}
func (ctrl *PersistentVolumeController) provisionClaim(ctx context.Context, claim *v1.PersistentVolumeClaim) error {
plugin, storageClass, err := ctrl.findProvisionablePlugin(claim)
ctrl.scheduleOperation(logger, opName, func() error {
var err error
if plugin == nil {
// 如果是外部的 provisioner 這裡我們就安裝了 rancher.io/local-path 這個外掛
// 所以這裡會呼叫 provisionClaimOperationExternal
_, err = ctrl.provisionClaimOperationExternal(ctx, claim, storageClass)
} else {
// 內部的 provisioner 直接處理
_, err = ctrl.provisionClaimOperation(ctx, claim, plugin, storageClass)
}
return err
})
return nil
}
// 如果是外部的 provisioner 會在 pvc 的 annotations 加入 volume.beta.kubernetes.io/storage-provisioner: rancher.io/local-path 和 volume.kubernetes.io/storage-provisioner: rancher.io/local-path
func (ctrl *PersistentVolumeController) setClaimProvisioner(ctx context.Context, claim *v1.PersistentVolumeClaim, provisionerName string) (*v1.PersistentVolumeClaim, error) {
if val, ok := claim.Annotations[storagehelpers.AnnStorageProvisioner]; ok && val == provisionerName {
// annotation is already set, nothing to do
return claim, nil
}
// The volume from method args can be pointing to watcher cache. We must not
// modify these, therefore create a copy.
claimClone := claim.DeepCopy()
// TODO: remove the beta storage provisioner anno after the deprecation period
logger := klog.FromContext(ctx)
metav1.SetMetaDataAnnotation(&claimClone.ObjectMeta, storagehelpers.AnnBetaStorageProvisioner, provisionerName)
metav1.SetMetaDataAnnotation(&claimClone.ObjectMeta, storagehelpers.AnnStorageProvisioner, provisionerName)
updateMigrationAnnotations(logger, ctrl.csiMigratedPluginManager, ctrl.translator, claimClone.Annotations, true)
newClaim, err := ctrl.kubeClient.CoreV1().PersistentVolumeClaims(claim.Namespace).Update(ctx, claimClone, metav1.UpdateOptions{})
if err != nil {
return newClaim, err
}
_, err = ctrl.storeClaimUpdate(logger, newClaim)
if err != nil {
return newClaim, err
}
return newClaim, nil
}
kubernetes external provisioner
kubernetes external provisioner 是一個獨立的程序,用來動態建立 PV,它透過監聽 StorageClass 的事件,當 StorageClass 的 ReclaimPolicy 為 Retain 時,會建立 PV。
在這裡我新建一個 關於 nfs 的 external provisioner
package main
import (
"context"
"fmt"
"path/filepath"
"github.com/golang/glog"
v1 "k8s.io/api/core/v1"
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
"k8s.io/client-go/kubernetes"
"k8s.io/client-go/tools/clientcmd"
"k8s.io/client-go/util/homedir"
"sigs.k8s.io/controller-runtime/pkg/log"
"sigs.k8s.io/sig-storage-lib-external-provisioner/v10/controller"
)
const provisionerName = "provisioner.test.com/nfs"
var _ controller.Provisioner = &nfsProvisioner{}
type nfsProvisioner struct {
client kubernetes.Interface
}
func (p *nfsProvisioner) Provision(ctx context.Context, options controller.ProvisionOptions) (*v1.PersistentVolume, controller.ProvisioningState, error) {
if options.PVC.Spec.Selector != nil {
return nil, controller.ProvisioningFinished, fmt.Errorf("claim Selector is not supported")
}
glog.V(4).Infof("nfs provisioner: VolumeOptions %v", options)
pv := &v1.PersistentVolume{
ObjectMeta: metav1.ObjectMeta{
Name: options.PVName,
},
Spec: v1.PersistentVolumeSpec{
PersistentVolumeReclaimPolicy: *options.StorageClass.ReclaimPolicy,
AccessModes: options.PVC.Spec.AccessModes,
MountOptions: options.StorageClass.MountOptions,
Capacity: v1.ResourceList{
v1.ResourceName(v1.ResourceStorage): options.PVC.Spec.Resources.Requests[v1.ResourceName(v1.ResourceStorage)],
},
PersistentVolumeSource: v1.PersistentVolumeSource{
NFS: &v1.NFSVolumeSource{
Server: options.StorageClass.Parameters["server"],
Path: options.StorageClass.Parameters["path"],
ReadOnly: options.StorageClass.Parameters["readOnly"] == "true",
},
},
},
}
return pv, controller.ProvisioningFinished, nil
}
func (p *nfsProvisioner) Delete(ctx context.Context, volume *v1.PersistentVolume) error {
// 因為是 nfs 沒有產生實際的資源,所以這裡不需要刪除
// 如果在 provisioner 中建立了資源,那麼這裡需要刪除
// 一般是呼叫 csi 建立/刪除資源
return nil
}
func main() {
l := log.FromContext(context.Background())
config, err := clientcmd.BuildConfigFromFlags("", filepath.Join(homedir.HomeDir(), ".kube", "config"))
if err != nil {
glog.Fatalf("Failed to create kubeconfig: %v", err)
}
clientset, err := kubernetes.NewForConfig(config)
if err != nil {
glog.Fatalf("Failed to create client: %v", err)
}
clientNFSProvisioner := &nfsProvisioner{
client: clientset,
}
pc := controller.NewProvisionController(l,
clientset,
provisionerName,
clientNFSProvisioner,
controller.LeaderElection(true),
)
glog.Info("Starting provision controller")
pc.Run(context.Background())
}
建立一個 nfs 的 storageClass
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: my-nfs
provisioner: provisioner.test.com/nfs
reclaimPolicy: Delete
volumeBindingMode: Immediate
parameters:
server: "192.168.203.110"
path: /data/nfs
readOnly: "false"
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: my-nfs-pvc
spec:
storageClassName: my-nfs
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 1Gi
apiVersion: v1
kind: Pod
metadata:
name: test-nfs
spec:
containers:
- image: ubuntu:22.04
name: ubuntu
command:
- /bin/sh
- -c
- sleep 10000
volumeMounts:
- mountPath: /data
name: my-nfs-pvc
volumes:
- name: my-nfs-pvc
persistentVolumeClaim:
claimName: my-nfs-pvc
❯ kubectl exec -it test-nfs -- cat /data/nfs
192.168.203.110:/data/nfs
CSI 流程
持久化儲存流程圖如下:
Provisioner
當部署 csi-controller 時 ,會啟動一個伴生容器,專案地址為 https://github.com/kubernetes-csi/external-provisioner
這個專案是一個 csi 的 provisioner
它會監控屬於自己的pvc,當有新的pvc建立時,會呼叫 csi 的 createVolume 方法,建立一個 volume,然後建立一個 pv。當 pvc 刪除時,會呼叫 csi 的 deleteVolume 方法,然後刪除 volume 和 pv。
Attacher
external-attacher 也是 csi-controller 的伴生容器,專案地址為 https://github.com/kubernetes-csi/external-attacher
這個專案是一個 csi 的 attacher, 它會監控 AttachDetachController 資源,當有新的資源建立時,會呼叫 csi 的 controllerPublishVolume 方法,掛載 volume 到 node 上。當資源刪除時,會呼叫 csi 的 controllerUnpublishVolume 方法,解除安裝 volume。
Snapshotter
external-snapshotter 也是 csi-controller 的伴生容器,專案地址為 https://github.com/kubernetes-csi/external-snapshotter
這個專案是一個 csi 的 snapshotter, 它會監控 VolumeSnapshot 資源,當有新的資源建立時,會呼叫 csi 的 createSnapshot 方法,建立一個快照。當資源刪除時,會呼叫 csi 的 deleteSnapshot 方法,刪除快照。
csi-node
csi-node 是一個 kubelet 的外掛,所以它需要每個節點上都執行,當 pod 建立時,並且 VolumeAttachment 的 .spec.Attached 時,kubelet 會呼叫 csi 的 NodeStageVolume 函式,之後外掛(csiAttacher)呼叫內部 in-tree CSI 外掛(csiMountMgr)的 SetUp 函式,該函式內部會呼叫 csi 的 NodePublishVolume 函式,掛載 volume 到 pod 上。當 pod 刪除時,kubelet 觀察到包含 CSI 儲存卷的 Pod 被刪除,於是呼叫內部 in-tree CSI 外掛(csiMountMgr)的 TearDown 函式,該函式內部會透過 unix domain socket 呼叫外部 CSI 外掛的 NodeUnpublishVolume 函式。kubelet 呼叫內部 in-tree CSI 外掛(csiAttacher)的 UnmountDevice 函式,該函式內部會透過 unix domain socket 呼叫外部 CSI 外掛的 NodeUnstageVolume 函式。
csi-node-driver-registrar
這個是 csi-node 的伴生容器,專案地址為 https://github.com/kubernetes-csi/node-driver-registrar
,
它的主要作用是向 kubelet 註冊 csi 外掛,kubelet 會呼叫 csi 外掛的 Probe 方法,如果返回成功,kubelet 會呼叫 csi 外掛的 NodeGetInfo 方法,獲取節點資訊。
csi-livenessprobe
這個是 csi-node 的伴生容器,專案地址為 https://github.com/kubernetes-csi/livenessprobe
, 它的主要作用是給 kubernetes 的 livenessprobe 提供一個介面,用來檢查 csi 外掛是否正常執行。它在 /healthz
時,會呼叫 csi 的 Probe 方法,如果返回成功,返回 200,否則返回 500。
Reference
- https://developer.aliyun.com/article/754434
- https://developer.aliyun.com/article/783464