作者：周虛（應金挺）

SREWorks的開源吸引了大量使用者來嘗試部署和使用我們的產品，其中不乏一些初次接觸Kubernetes的朋友。隨著SREWorks雲原生運維平臺使用的持續深入，部分使用者對於其中的原理和概念還存在一些困惑。因此，我們特推出 《Kubernetes資源編排系列》 ，從底層的Pod YAML開始，逐步遞進地講解相關內容，希望能夠解答大家對於Kubernetes的一些疑問，讓使用者對於雲原生相關技術有更深入的瞭解。

1.Pod整體結構

Pod YAML的整體結構，可以初步分為 Resource(資源)、Object(後設資料)、Spec(規範)、Status(狀態)。 本文將會圍繞這四部分一一展開。

Resource：定義資源型別與版本，作為從Rest API中獲取資源必帶的屬性。
Object：資源的後設資料屬性，明確資源的基本標識。
Spec / Status：

- Spec：定義資源的期望狀態，包括使用者提供的配置、系統擴充套件的預設值，以及周邊系統初始化或者更改值（scheduler、hpa等）。
- Status：定義資源的當前狀態，從而基於Spec定義的申明式配置，使pod不斷朝期望狀態靠近。

2.Resource（資源）-Rest API

k8s資源按照Scope可以分為Namespace資源、Cluster資源，Namespace在k8s可以認為是軟租戶的效果，實現資源層面的隔離，Pod資源就是屬於Namespace資源，而Namespace不光體現在YAML引數中，也表現在k8s Rest API中。

Rest API的整體結構，以Pod舉例

apiVersion: v1
kind: Pod
metadata:
  name: test-pod
  namespace: default

基於上述YAML，可以明確出namespace為default， name為test-pod的Pod資源物件，也就是明確出Pod為Namespace資源，該Pod資源物件對應的apiVersion為v1，後續k8s自內聯相關的Group為/api，自然而然，我們就將該物件的資料分離出來了：

group：api
apiVersion：v1
kind：Pod
name：test-pod
namespace：default

基於上述的資料展示，apiserver自然而然會相應的註冊出下列rest api，

/api/{apiVersion}/{kind}：該kind下的所有資源列表
/api/{apiVersion}/namespace/{namespace}/{kind}/：該kind下當前namespace的所有資源列表
/api/{apiVersion}/namespace/{namespace}/{kind}/{name}：該kind下當前namespace且名為name的資源
/api/{apiVersion}/namespace/{namespace}/{kind}/{name}/{subresource}：該kind下當前namespace且名為name的資源下子資源操作

後續基於擴充套件，我們就需要明確出method，這樣一個真正完整的Rest API就誕生了。

3.Object（後設資料）

在rest api中明確了Resource的kind、apiVersion，也確定了Object的namespace、name，作為凡是k8s資源物件都會引用的公共結構，自然也存在很多公共機制供使用。

metadata:
  annotations:
    alibabacloud.com/owner: testdemo
    k8s.aliyun.com/pod-eni: "true"
  creationTimestamp: "2022-06-02T07:21:36Z"
  deleteTimestamp: "2022-06-02T07:22:51Z"
  labels:
    app: taihao-app-cn-shanghai-pre-cloud-resource
    pod-template-hash: 5bbb759f78
  name: testdemo-5bbb759f78-27v88
  namespace: default
  ownerReferences:
  - apiVersion: apps/v1
    blockOwnerDeletion: true
    controller: true
    kind: ReplicaSet
    name: testdemo-5bbb759f78
    uid: 9c3f268a-c0d1-4038-bb2b-b92928f45e3d
  resourceVersion: "60166035"
  uid: e4236960-8be2-41bf-ac44-e7460378afbb

觀察上述YAML，我們將其整理一下，有這樣一些欄位:

namespace：常規來說，Namespace資源才會使用該資源物件
name：代表資源例項名稱
uid：是資源的唯一標識，可以區別已刪除與重新建立的同名資源例項
resourceVersion：是k8s的內部版本，具備時間屬性，基於此就能明確該資源對是什麼時候發生改變的，也是保證k8s list-watch核心機制
creationTimestamp: 資源例項建立時間
deleteTimestamp: 資源例項刪除時間，後續會在pod的生命週期內講到對該欄位應用
ownerReferences: 資源從屬物件，從上面yaml可知，該Pod資源從屬於名為testdemo-5bb759f78， ownerReferences內部是沒有namespace引數，也就是ownerReferences不允許跨namespace，將資源由下到上能夠建立起來
labels ：標籤， k8s內的服務發現以及相應的軟關聯，都是圍繞label運作的，比如testdemo-5bb759f78 replicaset 的labelselector（標籤篩選器）能夠篩選到當前Pod的label，保證兩者關聯由上到下的建立
annotations: 註釋，通常來說會是作為額外欄位供應給周邊系統使用，比如當前 k8s.aliyun.com/pod-eni="true"是提供網路系統使用

label & labelSelector

Deployment 會根據自己的labelseletor：app=taihao-app-cluster 以及計算出podtemplate的hash lable：pod-template-hash: 5b8b879786 , 篩選出出符合的replicaset， replicaset再根據自己的labelselector 去篩選出符合的pods，相應的服務發現service，也是透過labelselector去篩選出符合的Pod

Owner & GC(垃圾回收)

基於Pod的metadata.ownerReferences找尋到對應的replicaset，replicaset基於自身的metadata.ownerReferences 找尋到deploy；當deployment被刪除後，基於原有owner構建的樹狀，回收原有的rs與pod。

Deploy & Replicaset

基於label&labelselector，明確了從上到下的篩選歸納；基於owner&GC，明確了關聯資源的回收流程。

apiVersion: apps/v1
kind: ReplicaSet
metadata:
  generation: 1
  labels:
    app: testdemo
    pod-template-hash: bcd889947
  name: testdemo-bcd889947
  namespace: taihao
  ownerReferences:
  - apiVersion: apps/v1
    blockOwnerDeletion: true
    controller: true
    kind: Deployment
    name: testdemo
    uid: 1dddc849-c254-4cf5-aec8-9e1c2b5e65af
spec:
  replicas: 1
  selector:
    matchLabels:
      app: testdemo
      pod-template-hash: bcd889947
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: testdemo
        pod-template-hash: bcd889947
    spec:
      containers:
      - args:
        - -c
        - sleep 1000000
        command:
        - sh
        image: centos:7
        imagePullPolicy: IfNotPresent
				name: testdemo
status:
  fullyLabeledReplicas: 1
  observedGeneration: 1
  replicas: 1

replicaset.spec.replicas: 例項數，rs控制下的Pod個數

replicaset.spec.selector：基於label 篩選出對應的Pod

replicaset.spec.template：replicaset建立的Pod會基於podtemplate

replicaset.status：replicaset 當前管理Pod的狀態

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: testdemo
  name: testdemo
spec:
  replicas: 1
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app: testdemo
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: testdemo
    spec:
      containers:
      - args:
        - -c
        - sleep 1000000
        command:
        - sh
        image: centos:7
        imagePullPolicy: IfNotPresent
        name: testdemo
status:
  availableReplicas: 1
  observedGeneration: 2
  readyReplicas: 1
  replicas: 2
  unavailableReplicas: 1
  updatedReplicas: 1

deploy.spec.replicas: deploy期望的pod例項格式

deploy.spec.revisionHistoryLimit：deploy 管理replicaset的保留三個月

deploy.spec.selector：deploy 篩選符合標籤

deploy.spec.strategy：deploy的升級策略

deploy.template：deploy基於此模版要建立的pod格式

4.Spec（規範）

Spec作為Pod的期望狀態，一定程度上也覆蓋了Pod完整生命週期的邏輯，Pod的生命週期分為以下階段

Pending：代表Pod處於未排程階段
Creating：節點上的kubelet已經發現了Pod，處於建立階段
Running：至少一個容器執行完畢，kubelet這會發起健康監測
Terminating：Pod處於刪除狀態，kubelet開始回收容器
Terminated: Pod 銷燬完成

Pod生命週期: Pending

Pod資源建立完畢後，處於還未排程階段，這個時候scheduler（排程器）基於pod yaml本身的配置與節點資源狀態情況，來進行排程。

scheduler會去分析podyaml，將其中的策略提取出來，與節點組中的節點配置進行匹配，若匹配成功後，會選出最佳節點，重新修改pod yaml，將spec.nodeName更新掉，完成整個排程環節

資源策略

資源策略表明Pod執行需要的資源情況，以demo為例，Pod需要2核4G的資源，那麼排程過去的節點也需要有2核4G的資源剩餘，Pod才能執行在該節點上

節點標籤篩選策略

節點標籤篩選策略，篩選節點是否存在 topology.kubernetes.io/region: cn-hangzhou

親和策略

親和策略，有節點親和與Pod親和（Pod所在節點優先排程），常規來說可以優先滿足親和的節點上，當前例子就是節點親和，滿足標籤 disk-type=aaa 或者 disk-type=bbb

汙點策略

汙點策略，當節點上配置了汙點，若Pod沒有容忍該汙點的策略，則Pod不允許排程到該節點上

Pod生命週期: Creating

當Pod排程完畢後，開始建立階段，kubelet會基於pod.spec 期望狀態來建立出Pod

kubelet 在建立Pod階段，總共大致經歷以下過程

Group配置：主要是為了容器配置cgroup，裡面涉及了對容器資源限制，比如不允許超過cpu、memory配置,這裡涉及到Pod的qos級別判定
初始化環境配置：主要是對相關Pod資料儲存目錄進行配置，涉及到volume，則會去引用CSI協議，也會去獲取映象secret，為了後續拉取映象進行準備工作
建立pause容器：建立pause容器，該容器主要是為了後續配置容器網路，配置容器網路會去呼叫CNI
建立Pod容器：基於imagesecret拉取業務映象，在建立Pod容器階段，也會將相應的Pod YAML配置傳輸進去，在啟動Pod容器完畢後，會基於poststart進行相關的回撥

上述階段，會選擇部分關鍵概念進行詳細說明

image

spec:
  containers:
  - image: testdemo:v1
    imagePullPolicy: Always
    name: test-config
  imagePullSecrets:
  - name: image-regsecret

imagePullSecrets: 拉取映象的金鑰，保證能夠拉取image：testdemo:v1，尤其在映象庫是私有庫的階段

imagePullPolicy：映象拉取策略

Always：總是拉取映象
IfNotPresent：本地若有則使用本地映象，不進行拉取
Never：只使用本地映象，不拉取

containers

注意這個containers用的是複數，可以填多個容器映象: 比如可以放 nginx 和業務容器。這樣做的好處是可以儘量減少業務容器中與業務無關的程式碼或程式。

container涉及很多配置，其中有涉及到volume、env、dnsconfig、host等基礎配置

spec:
  containers:
  - env:
    - name: TZ
      value: Asia/Shanghai
    image: testdemo:v1
    name: taihao-app-cn-shanghai-pre-share
    volumeMounts:
    - mountPath: /home/admin
      name: test-config
      readOnly: true
  dnsConfig:
    nameservers:
    - 100.100.1.1
    - 100.100.2.1
    options:
    - name: ndots
      value: "3"
    - name: timeout
      value: "3"
    - name: attempts
      value: "3"
    searches:
    - default.svc.cluster.local
    - svc.cluster.local
    - cluster.local
  hostAliases:
  - hostnames:
    - kubernetes
    - kubernetes.default
    - kubernetes.default.svc
    - kubernetes.default.svc.cluster.local
    ip: 1.1.1.1
  volumes:
  - configMap:
      defaultMode: 420
      name: test-config
    name: test-config

env：配置Pod的環境變數

dnsConfig：配置Pod的域名解析

hostALiases：配置/etc/hosts檔案內容

volume/volumeMount: 配置檔案掛載到容器內，也可以配置檔案儲存系統掛載到容器內

postStart

containers:
  - image: testdemo:v1
    imagePullPolicy: Always
    lifecycle:
      postStart:
        exec:
          command:
          - /bin/sh
          - -c
          - sleep 5

當前poststart demo 是發起command命令，也可以發起http請求，主要作用可以作為資源部署以及環境準備。

Pod生命週期: Running

在Pod running階段的時候，Pod就迎來對其健康的檢查，當前kubelet 提供三種方式判定

readiness：檢查Pod是否為健康
liveness：件看Pod是否正常，若檢查失敗，則重啟容器
readinessGate：提供給第三方元件健康驗證，第三方元件驗證不過，則Pod不為健康

spec:
  readinessGates:
  - conditionType: TestPodReady
  containers:
  - image: testdemo:v1
    imagePullPolicy: Always
    livenessProbe:
      failureThreshold: 3
      initialDelaySeconds: 45
      periodSeconds: 5
      successThreshold: 1
      tcpSocket:
        port: 8080
      timeoutSeconds: 1
    readinessProbe:
      failureThreshold: 3
      httpGet:
        path: /actuator/health
        port: 8989
        scheme: HTTP
      initialDelaySeconds: 25
      periodSeconds: 3
      successThreshold: 1
      timeoutSeconds: 1

readiness與liveness檢查引數都是一致的

httpGet / tcpSocket：都是檢查方式，一種是http請求驗證，一種是tcpSocket，其中也有exec執行命令，以及grpc形式驗證
initialDelaySeconds：延遲多久開始檢查，原因在於容器啟動的時候，通常需要過段時間進行驗證
periodSeconds：檢驗時間週期
failureThreshold：連續幾次失敗，則代表這輪檢驗失敗
successThreshold：連續幾次成功，則代表這輪檢驗成功
timeoutSeconds：代表檢驗超時時間，若檢驗在該配置時間內沒有返回，則認為檢驗失敗

readiness、liveness雖然引數不一樣，但對檢驗的結果行為不一致。

readiness預設狀態下為false，也就是Pod為不健康，直到檢查透過，才將Pod變為健康
liveness預設狀態下為true，不會在剛開始就將Pod重啟，只有等檢查不透過後，才會進行容器重啟操作

readinessGate 是Pod健康的擴充套件，kubelet會基於此，預設在pod.status.conditions上配置對應的condition，比如當前例子readinessGate為 conditionType: TestPodReady ，則相應就會有conditions

status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2022-07-05T09:16:07Z"
    status: "false"
    type: TestPodReady

當該condition.status為false時，則Pod就會一直是不健康，哪怕readiness檢查透過，直到第三方系統去操作更新Pod該condition.status為true，才可以將Pod變為健康，這樣就可以接入更多的Pod健康指標。

Pod生命週期: Terminating

client 在發起請求刪除Pod的時候，實際上是配置

pod.metadata.deletionTimestamp，kubelet感知到後，開始進行Pod回收流程

整個Pod的回收週期，常規來說preStop—>SIGTERM—>SIGKILL

lifecycle:
  preStop:
    exec:
      command:
      - /bin/sh
      - -c
      - sleep 5

當kubelet進行preStop後，開始發起SIGTERM給容器內程式，若超過總預設耗時30S（metadata.DeletionGracePeriodSeconds）,則強制發起SIGKILL給容器，也就是prestop+SIGTERM總耗時不允許超過30s。

5.Status（狀態）

status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2022-07-05T09:16:07Z"
    status: "True"
    type: TestPodReady
  - lastProbeTime: null
    lastTransitionTime: "2022-07-05T09:16:07Z"
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2022-07-05T09:16:14Z"
    status: "True"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2022-07-05T09:16:14Z"
    status: "True"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2022-07-05T09:16:07Z"
    status: "False"
    type: ContainerDiskPressure
  - lastProbeTime: null
    lastTransitionTime: "2022-07-05T09:16:07Z"
    status: "True"
    type: PodScheduled
  containerStatuses:
  - containerID: containerd://xxxxx
    image: docker.io/library/testdemo:v1
    imageID: docker.io/library/centos@sha256:xxxx
    lastState: {}
    name: zxtest
    ready: true
    restartCount: 0
    started: true
    state:
      running:
        startedAt: "2022-07-05T09:16:13Z"
  hostIP: 21.1.96.23
  phase: Running
  podIP: 10.11.17.172
  podIPs:
  - ip: 10.11.17.172
  qosClass: Guaranteed
  startTime: "2022-07-05T09:16:07Z"

基於上述YAML樣例，將Pod status狀態拆建出來分析一下:

conditions: conditions是作為一種更詳盡的狀態報告，其本身也是一種擴充套件機制，其他的擴充套件欄位也可以放入其中，比如可以表明網路狀況，其中readinessGate就是這種擴充套件機制的表現，但決定Pod是否ready，永遠只看type: Ready是否為true
containerStatuses: Pod內各容器的狀態
hostIP: Pod所在節點ip地址
phase: Pod的生命週期狀態
- Pending：代表Pod有一個容器或者多個容器還未執行，其中包括Pod排程到節點之前以及拉取映象
- Running：代表Pod已繫結到節點上，至少有一個容器執行或在重啟
- Successed：代表Pod所有容器已終止
- Failed：代表Pod內至少有一個容器終止失敗
- Unknown：代表無法獲取Pod狀態
podIP / podIPs：Pod的IP地址，假如有ipv4、ipv6，則可以在podIPs上配置
qosClass：代表kubernetes服務等級
- Guaranteed：resource.requests與resource.limits一致
- Burstable：resource.requests與resource.limits 不一致
- BestEffort：沒有配置resource.requests與resource.limits

startTime：啟動時間

透過以上Pod四個部分拆解，我們基本搞清了一個Pod在k8s下“ 從哪裡來”的這個問題。本系列的後續的文章會對“ 到哪裡去”這個問題繼續展開：Kubernetes的魅力在於不僅僅是拉起一個工作負載，而是能夠召之即來揮之即去地編排海量工作負載。

後續文章均會發布在我們的公眾號“ 阿里智慧運維”上，請大家持續關注～也歡迎大家在公眾號後臺留言想了解的內容和感興趣的相關話題，與 SREWorks團隊進行交流。

Kubernetes資源編排系列之一: Pod YAML篇