K8S Internals 系列：第一期

容器編排之爭在 Kubernetes 一統天下局面形成後，K8S 成為了雲原生時代的新一代作業系統。K8S 讓一切變得簡單了，但自身逐漸變得越來越複雜。【K8S Internals 系列專欄】圍繞 K8S 生態的諸多方面，將由博雲容器雲研發團隊定期分享有關排程、安全、網路、效能、儲存、應用場景等熱點話題。希望大家在享受 K8S 帶來的高效便利的同時，又可以如庖丁解牛般領略其核心執行機制的魅力。

1. Pod Security Policy 簡介

因為Pod Security Admission目標是替代 Pod Security Policy，所以介紹它之前有必要先介紹一下Pod Security Policy，Pod Security Policy定義了一組Pod執行時必須遵循的條件及相關欄位的預設值，Pod必須滿足這些條件才能被成功建立,Pod Security Policy物件Spec包含以下欄位也即是Pod Security Policy能夠控制的方面：

控制的角度	欄位名稱
執行特權容器	privileged
使用宿主名字空間	hostPID,hostIPC
使用宿主的網路和埠	hostNetwork, hostPorts
控制卷型別的使用	volumes
使用宿主檔案系統	allowedHostPaths
允許使用特定的 FlexVolume 驅動	allowedFlexVolumes
分配擁有 Pod 卷的 FSGroup 賬號	fsGroup
以只讀方式訪問根檔案系統	readOnlyRootFilesystem
設定容器的使用者和組 ID	runAsUser, runAsGroup, supplementalGroups
限制 root 賬號特權級提升	allowPrivilegeEscalation, defaultAllowPrivilegeEscalation
Linux 功能（Capabilities）	defaultAddCapabilities, requiredDropCapabilities, allowedCapabilities
設定容器的 SELinux 上下文	seLinux
指定容器可以掛載的 proc 型別	allowedProcMountTypes
指定容器使用的 AppArmor 模版	annotations
指定容器使用的 seccomp 模版	annotations
指定容器使用的 sysctl 模版	forbiddenSysctls,allowedUnsafeSysctls

其中AppArmor 和seccomp 需要通過給PodSecurityPolicy物件新增註解的方式設定：

seccomp.security.alpha.kubernetes.io/allowedProfileNames: 'docker/default'
seccomp.security.alpha.kubernetes.io/defaultProfileNames: 'docker/default'
apparmor.security.beta.kubernetes.io/allowedProfileNames: 'runtime/default' 
apparmor.security.beta.kubernetes.io/defaultProfileNames: 'runtime/default'

Pod Security Policy是叢集級別的資源，我們看一下它的使用流程：

PSP使用流程.png
PSP 使用流程

由於需要建立ClusterRole/Role和ClusterRoleBinding/RoleBinding繫結服務賬號來使用PSP,這使得我們不能很容易的看出究竟使用了哪些PSP,更難看出Pod的建立被哪些安全規則限制。

2. 為什麼出現Pod Security Admission

通過對PodSecurityPolicy使用，應該也會發現它的問題，例如沒有dry-run和審計模式、不方便開啟和關閉等，並且使用起來也不那麼清晰。種種缺陷造成的結果是PodSecurityPolicy在Kubernetes v1.21被標記為棄用，並且將在 v1.25中被移除，在kubernets v1.22中則增加了新特性Pod Security Admission。

3. Pod Security Admission介紹

pod security admission是kubernetes內建的一種准入控制器，在kubernetes v1.23版本中這一特性門是預設開啟的，在v1.22中需要通過kube-apiserver引數 --feature-gates="...,PodSecurity=true" 開啟。在低於v1.22的kuberntes版本中也可以自行安裝Pod Security Admission Webhook。

pod security admission是通過執行內建的 Pod Security Standards來限制叢集中的pod的建立。

3.1 Pod Security Standards

為了廣泛的覆蓋安全應用場景， Pod Security Standards漸進式的定義了三種不同的Pod安全標準策略：

Profile	描述
Privileged	不受限制的策略，提供最大可能範圍的許可權許可。此策略允許已知的特權提升。
Baseline	限制性最弱的策略，禁止已知的策略提升。允許使用預設的（規定最少）Pod 配置。
Restricted	限制性非常強的策略，遵循當前的保護 Pod 的最佳實踐。

詳細內容參見Pod Security Standards。

3.2 Pod Security Standards實施方法

在kubernetes叢集中開啟了pod security admission特性門之後，就可以通過給namespace設定label的方式來實施Pod Security Standards。其中有三種設定模式可選用：

Mode	Description
enforce	違反安全標準策略的 Pod 將被拒絕。
audit	違反安全標準策略觸發向審計日誌中記錄的事件新增審計註釋，但其他行為被允許。
warn	違反安全標準策略將觸發面向使用者的警告，但其他行為被允許。

label設定模板解釋：

# 設定模式及安全標準策略等級
# MODE必須是 `enforce`, `audit`或`warn`其中之一。
# LEVEL必須是`privileged`, `baseline`或 `restricted`其中之一
pod-security.kubernetes.io/<MODE>: <LEVEL>

# 此選項是非必填的，用來鎖定使用哪個版本的的安全標準
# MODE必須是 `enforce`, `audit`或`warn`其中之一。
# VERSION必須是一個有效的kubernetes minor version(例如v1.23)，或者 `latest`
pod-security.kubernetes.io/<MODE>-version: <VERSION>

一個namesapce可以設定任意種模式或者不同的模式設定不同的安全標準策略。

通過准入控制器配置檔案，可以為pod security admission設定預設配置：

apiVersion: apiserver.config.k8s.io/v1
kind: AdmissionConfiguration
plugins:
- name: PodSecurity
  configuration:
    apiVersion: pod-security.admission.config.k8s.io/v1beta1
    kind: PodSecurityConfiguration
    # Defaults applied when a mode label is not set.
    #
    # Level label values must be one of:
    # - "privileged" (default)
    # - "baseline"
    # - "restricted"
    #
    # Version label values must be one of:
    # - "latest" (default) 
    # - specific version like "v1.23"
    defaults:
      enforce: "privileged"
      enforce-version: "latest"
      audit: "privileged"
      audit-version: "latest"
      warn: "privileged"
      warn-version: "latest"
    exemptions:
      # Array of authenticated usernames to exempt.
      usernames: []
      # Array of runtime class names to exempt.
      runtimeClassNames: []
      # Array of namespaces to exempt.
      namespaces: []

pod security admission可以從username，runtimeClassName，namespace三個維度對pod進行安全標準檢查的豁免。

3.3 Pod Security Standards實施演示

環境: kubernetes v1.23

執行時的容器面臨很多攻擊風險，例如容器逃逸，從容器發起資源耗盡型攻擊。

3.3.1 Baseline策略

Baseline策略目標是應用於常見的容器化應用，禁止已知的特權提升，在官方的介紹中此策略針對的是應用運維人員和非關鍵性應用開發人員，在該策略中包括：

必須禁止共享宿主名稱空間、禁止容器特權、限制Linux能力、禁止hostPath卷、限制宿主機埠、設定AppArmor、SElinux、Seccomp、Sysctls等。

下面演示設定Baseline策略。

違反Baseline策略存在的風險：

特權容器可以看到宿主機裝置
掛載procfs後可以看到宿主機程式，打破程式隔離
可以打破網路隔離
掛載執行時socket後可以不受限制的與執行時通訊

等等以上風險都可能導致容器逃逸。

建立名為my-baseline-namespace的namespace，並設定enforce和warn兩種模式都對應Baseline等級的Pod安全標準策略：

apiVersion: v1
kind: Namespace
metadata:
  name: my-baseline-namespace
  labels:
    pod-security.kubernetes.io/enforce: baseline  
    pod-security.kubernetes.io/enforce-version: v1.23
    pod-security.kubernetes.io/warn: baseline
    pod-security.kubernetes.io/warn-version: v1.23

建立pod

建立一個違反baseline策略的pod

apiVersion: v1
kind: Pod
metadata:
  name: hostnamespaces2
  namespace: my-baseline-namespace
spec:
  containers:
  - image: bitnami/prometheus:2.33.5
    name: prometheus
    securityContext:
      allowPrivilegeEscalation: true
      privileged: true
      capabilities:
        drop:
        - ALL
  hostPID: true
  securityContext:
    runAsNonRoot: true
    seccompProfile:
      type: RuntimeDefault

執行apply命令，顯示不能設定hostPID=true，securityContext.privileged=true，Pod建立被拒絕，特權容器的執行，並且開啟hostPID，容器程式沒有與宿主機程式隔離，容易造成Pod容器逃逸：

[root@localhost podSecurityStandard]# kubectl apply -f fail-hostnamespaces2.yaml
Error from server (Forbidden): error when creating "fail-hostnamespaces2.yaml": pods "hostnamespaces2" is forbidden: violates PodSecurity "baseline:v1.23": host namespaces (hostPID=true), privileged (container "prometheus" must not set securityContext.privileged=true)

建立不違反baseline策略的pod，設定Pod的hostPID=false，securityContext.privileged=false

apiVersion: v1
kind: Pod
metadata:
  name: hostnamespaces2
  namespace: my-baseline-namespace
spec:
  containers:
  - image: bitnami/prometheus:2.33.5
    name: prometheus
    securityContext:
      allowPrivilegeEscalation: false
      privileged: false
      capabilities:
        drop:
        - ALL
  hostPID: false
  securityContext:
    runAsNonRoot: true
    seccompProfile:
      type: RuntimeDefault

執行apply命令,pod被允許建立：

[root@localhost podSecurityStandard]# kubectl apply -f pass-hostnamespaces2.yaml
pod/hostnamespaces2 created

3.3.2 Restricted策略

Restricted策略目標是實施當前保護Pod的最佳實踐，在官方介紹中此策略主要針對運維人員和安全性很重要的應用開發人員，以及不太被信任的使用者。該策略包含所有的baseline策略的內容，額外增加：限制可以通過 PersistentVolumes 定義的非核心卷型別、禁止（通過 SetUID 或 SetGID 檔案模式）獲得特權提升、必須要求容器以非 root 使用者執行、Containers 不可以將 runAsUser 設定為 0、容器組必須棄用 ALL capabilities 並且只允許新增 NET_BIND_SERVICE 能力。

restricted策略進一步的限制在容器內獲取root許可權，linux核心功能。例如針對kubernetes網路的中間人攻擊需要擁有Linux系統的CAP_NET_RAW許可權來傳送ARP包。

建立名為my-restricted-namespace的namespace，並設定enforce和warn兩種模式都對應Restricted等級的Pod安全標準策略：

apiVersion: v1
kind: Namespace
metadata:
name: my-restricted-namespace
labels:
 pod-security.kubernetes.io/enforce: restricted 
 pod-security.kubernetes.io/enforce-version: v1.23
 pod-security.kubernetes.io/warn: restricted
 pod-security.kubernetes.io/warn-version: v1.23

建立pod

建立一個違反Restricted策略的pod

apiVersion: v1
kind: Pod
metadata:
  name: runasnonroot0
  namespace: my-restricted-namespace
spec:
  containers:
  - image: bitnami/prometheus:2.33.5
    name: prometheus
    securityContext:
      allowPrivilegeEscalation: false
  securityContext:
    seccompProfile:
      type: RuntimeDefault

執行apply命令，顯示必須設定securityContext.runAsNonRoot=true，securityContext.capabilities.drop=["ALL"]，Pod建立被拒絕，容器以root使用者執行時容器獲取許可權過大，結合沒有Drop linux核心能力有kubernetes網路中間人攻擊的風險：

[root@localhost podSecurityStandard]# kubectl apply -f fail-runasnonroot0.yaml
Error from server (Forbidden): error when creating "fail-runasnonroot0.yaml": pods "runasnonroot0" is forbidden: violates PodSecurity "restricted:v1.23": unrestricted capabilities (container "prometheus" must set securityContext.capabilities.drop=["ALL"]), runAsNonRoot != true (pod or container "prometheus" must set securityContext.runAsNonRoot=true)

建立不違反Restricted策略的pod，設定Pod的securityContext.runAsNonRoot=true，Drop所有linux能力。

apiVersion: v1
kind: Pod
metadata:
  name: runasnonroot0
  namespace: my-restricted-namespace
spec:
  containers:
  - image: bitnami/prometheus:2.33.5
    name: prometheus
    securityContext:
      allowPrivilegeEscalation: false
      capabilities:
        drop:
        - ALL
  securityContext:
    runAsNonRoot: true
    seccompProfile:
      type: RuntimeDefault

執行apply命令,pod被允許建立：

[root@localhost podSecurityStandard]# kubectl apply -f pass-runasnonroot0.yaml
pod/runasnonroot0 created

3.4 pod security admission當前侷限性

如果你的叢集中已經配置PodSecurityPolicy，考慮把它們遷移到pod security admission是需要一定的工作量的。

首先需要考慮當前的pod security admission是否適合你的叢集，目前它旨在滿足開箱即用的最常見的安全需求，與PSP相比它存在以下差異：

pod security admission 只是對pod進行安全標準的檢查，不支援對pod進行修改，不能為pod設定預設的安全配置。
pod security admission 只支援官方定義的三種安全標準策略，不支援靈活的自定義安全標準策略。這使得不能完全將PSP規則遷移到pod security admission，需要進行具體的安全規則考量。
pod security admission 不像PSP一樣可以與具體的使用者進行繫結，只支援豁免特定的使用者或者RuntimeClass及namespace。

4. pod security admission原始碼分析

kubernetes准入控制器是在程式碼層面與API server邏輯解耦的外掛，物件被建立、更新、或刪除在etcd持久化之前可以對請求進行攔截執行特定的邏輯。一個請求到API server經典的流程如下圖所示：

Api Request 處理流程圖

4.1 原始碼主體邏輯流程圖

podsecurityAdmission程式碼流程圖.png
podsecurityAdmission 程式碼流程圖

pod security admission主體邏輯流程如圖所示，准入控制器首先解析攔截到的請求，根據解析到的資源型別進行不同的邏輯處理：

Namespace : 如果解析到的資源是Namespace，准入控制器先根據該namesapce的labels解析出配置安全標準策略的等級、模式及鎖定的Pod安全標準策略版本等資訊。檢查如果過不包含Pod安全標準策略資訊則直接允許請求通過，如果包含Pod安全標準策略資訊則判斷是create新的namespace,還是update舊的namespace,如果是create則判斷配置是否正確，如果是update 則評估namespace中的pod是否符合新設定的安全標準策略。
Pod: 如果解析到的資源是Pod，准入控制器先獲取該Pod所處的namespace設定的Pod安全標準策略資訊，如果該namespace未設定Pod安全標準策略則允許請求通過，否則評估該Pod是否符合安全標準策略。
others：准入控制器先獲取該資源所處的namespace設定的Pod安全策略資訊，如果該namespace未設定Pod安全策略則允許請求通過，否則進一步解析該資源判斷該資源是否是諸如PodTemplate，ReplicationController，ReplicaSet，Deployment，DaemonSet，StatefulSet，Job，CronJob等包含PodSpec的資源，解析出PodSpec後評估該資源是否符合Pod安全策略。

4.2 初始化 Pod security admission

像大多數go程式一樣，Pod security admission使用github.com/spf13/cobra建立了啟動命令，在啟動呼叫runServer初始化並啟動webhook服務。入參Options中包含了DefaultClientQPSLimit，DefaultClientQPSBurst，DefaultPort，DefaultInsecurePort等預設配置。

// NewSchedulerCommand creates a *cobra.Command object with default parameters and registryOptions
func NewServerCommand() *cobra.Command {
    opts := options.NewOptions()

    cmdName := "podsecurity-webhook"
    if executable, err := os.Executable(); err == nil {
        cmdName = filepath.Base(executable)
    }
    cmd := &cobra.Command{
        Use: cmdName,
        Long: `The PodSecurity webhook is a standalone webhook server implementing the Pod
Security Standards.`,
        RunE: func(cmd *cobra.Command, _ []string) error {
            verflag.PrintAndExitIfRequested()
            // 初始化並且啟動webhook服務
            return runServer(cmd.Context(), opts)
        },
        Args: cobra.NoArgs,
    }
    opts.AddFlags(cmd.Flags())
    verflag.AddFlags(cmd.Flags())

    return cmd
}

runserver函式中載入了准入控制器的配置，初始化了server, 最終啟動server。

func runServer(ctx context.Context, opts *options.Options) error {
    // 載入配置內容
    config, err := LoadConfig(opts)
    if err != nil {
        return err
    }
    // 根據配置內容初始化server
    server, err := Setup(config)
    if err != nil {
        return err
    }
    
    ctx, cancel := context.WithCancel(ctx)
    defer cancel()
    go func() {
        stopCh := apiserver.SetupSignalHandler()
        <-stopCh
        cancel()
    }()
    // 啟動server
    return server.Start(ctx)
}

下面擷取了Setup函式部分主要程式碼片段，Setup函式建立了Admission物件包含:

PodSecurityConfig: 准入控制器配置內容，包括預設的Pod安全標準策略等級及設定模式和鎖定對應kubernetes版本，以及豁免的Usernames、RuntimeClasses和Namespaces。
Evaluator: 建立的評估器，即定義了檢查安全標準策略的具體方法。
Metrics: 用於收集Prometheus指標。
PodSpecExtractor：用解析請求物件中的PodSpec。
PodLister: 用於獲取指定namespace中的Pods。
NamespaceGetter：使用者獲取攔截到請求中的資源所處的namespace。

// Setup creates an Admission object to handle the admission logic.
func Setup(c *Config) (*Server, error) {
    ...
    s.delegate = &admission.Admission{
        Configuration:    c.PodSecurityConfig,
        Evaluator:        evaluator,
        Metrics:          metrics,
        PodSpecExtractor: admission.DefaultPodSpecExtractor{},
        PodLister:        admission.PodListerFromClient(client),
        NamespaceGetter:  admission.NamespaceGetterFromListerAndClient(namespaceLister, client),
    }
   ...
    return s, nil
}

准入控制器服務啟動之後註冊了HandleValidate方法進行准入檢驗邏輯的處理,在此方法中呼叫Validate方法進行具體Pod安全標準策略的檢驗。

//處理webhook攔截到的請求
func (s *Server) HandleValidate(w http.ResponseWriter, r *http.Request) {
    defer utilruntime.HandleCrash(func(_ interface{}) {
        // Assume the crash happened before the response was written.
        http.Error(w, "internal server error", http.StatusInternalServerError)
    })
     ...
    // 進行具體的檢驗操作
    response := s.delegate.Validate(ctx, attributes)
    response.UID = review.Request.UID // Response UID must match request UID
    review.Response = response
    writeResponse(w, review)
}

4.3 准入檢驗處理邏輯

Validate方法根據獲取請求包含的不同資源型別呼叫不同的檢驗方法進行具體的檢驗操作，以下三種處理方向最終都會呼叫EvaluatePod方法，對Pod進行安全標準策略評估。

// Validate admits an API request.
// The objects in admission attributes are expected to be external v1 objects that we care about.
// The returned response may be shared and must not be mutated.
func (a *Admission) Validate(ctx context.Context, attrs api.Attributes) *admissionv1.AdmissionResponse {
    var response *admissionv1.AdmissionResponse
    switch attrs.GetResource().GroupResource() {
    case namespacesResource:
        response = a.ValidateNamespace(ctx, attrs)
    case podsResource:
        response = a.ValidatePod(ctx, attrs)
    default:
        response = a.ValidatePodController(ctx, attrs)
    }
    return response
}

EvaluatePod方法中對namespace設定安全標準策略和版本進行判斷，從而選取不同的檢查方法對Pod進行安全性檢驗。

func (r *checkRegistry) EvaluatePod(lv api.LevelVersion, podMetadata *metav1.ObjectMeta, podSpec *corev1.PodSpec) []CheckResult {
    // 如果設定的Pod安全標準策略等級是Privileged（寬鬆的策略）直接返回
    if lv.Level == api.LevelPrivileged {
        return nil
    }
    // 如果註冊的檢查策略最大版本號低於namespace設定策略版本號，則使用註冊的檢查策略的最大版本號
    if r.maxVersion.Older(lv.Version) {
        lv.Version = r.maxVersion
    }

    var checks []CheckPodFn
    // 如果設定的Pod安全標準策略等級是Baseline
    if lv.Level == api.LevelBaseline {
        checks = r.baselineChecks[lv.Version]
    } else {
        // includes non-overridden baseline checks
        // 其他走嚴格的Pod安全標準策略檢查
        checks = r.restrictedChecks[lv.Version]
    }

    var results []CheckResult
    // 遍歷檢查方法，返回檢查結果
    for _, check := range checks {
        results = append(results, check(podMetadata, podSpec))
    }
    return results
}

下面擷取一個具體的檢驗方法來看一下是如何進行pod安全標準檢查的，如下檢查了Pod中的容器是否關閉了allowPrivilegeEscalation，AllowPrivilegeEscalation設定容器內的子程式是否可以提升許可權，通常在設定非root使用者（MustRunAsNonRoot）時進行設定。

func allowPrivilegeEscalation_1_8(podMetadata *metav1.ObjectMeta, podSpec *corev1.PodSpec) CheckResult {
    var badContainers []string
    visitContainers(podSpec, func(container *corev1.Container) {
        // 檢查pod中容器安全上下文是否配置，AllowPrivilegeEscalation是否配置，及AllowPrivilegeEscalation是否設定為false.
        if container.SecurityContext == nil || container.SecurityContext.AllowPrivilegeEscalation == nil || *container.SecurityContext.AllowPrivilegeEscalation {
            badContainers = append(badContainers, container.Name)
        }
    })

    if len(badContainers) > 0 {
        // 存在違反Pod安全標準策略的內容，則返回具體結果資訊
        return CheckResult{
            Allowed:         false,
            ForbiddenReason: "allowPrivilegeEscalation != false",
            ForbiddenDetail: fmt.Sprintf(
                "%s %s must set securityContext.allowPrivilegeEscalation=false",
                pluralize("container", "containers", len(badContainers)),
                joinQuote(badContainers),
            ),
        }
    }
    return CheckResult{Allowed: true}
}

總結

在 kubernetes v1.23版本中 Pod Security Admission已經升級到beta版本，雖然目前功能不算強大，但該特性未來可期。

K8S Pod 新安全策略 Pod Security Admission 介紹 | K8S Internals 系列第一期