一、User Namespace

user namespace和权限息息相关，事关容器的安全。
user namespace可以嵌套，内核控制最多32层。除了系统默认的user namespace外，所有的user namespace都有一个父user namespace。
当在一个进程中调用unshare或clone创建新的user namespace时，当前进程原来所在的user namespace为父，新的为子
在不同的user namespace中，同一个用户的user id 和 group id 可以不一样。
一个用户可以在父user namespace中是普通用户，而在子中是超级用户。
从Linux 3.8 开始，创建新的user namespace不再需要root权限。

二、演示

xundh@xundh-To-be-filled-by-O-E-M:~$ id
uid=1000(xundh) gid=1000(xundh) 组=1000(xundh),4(adm),24(cdrom),27(sudo),30(dip),46(plugdev),116(lpadmin),126(sambashare)
xundh@xundh-To-be-filled-by-O-E-M:~$ readlink /proc/$$/ns/user
user:[4026531837]
xundh@xundh-To-be-filled-by-O-E-M:~$ unshare --user /bin/bash
nobody@xundh-To-be-filled-by-O-E-M:~$ readlink /proc/$$/ns/user
user:[4026532323]
nobody@xundh-To-be-filled-by-O-E-M:~$ id
uid=65534(nobody) gid=65534(nogroup) 组=65534(nogroup)
nobody@xundh-To-be-filled-by-O-E-M:~$

这里没有映射父user namespace的 User ID和 group ID到子user namespace中来，当在新的user namespace中用getuid()和getgid()获取user id和 group id时，系统返回文件 /proc/sys/kernel/overflowuid中定义的user ID及 proc/sys/kernel/overflowgid中定义的group ID，它们的默认值都是65534.

下面使用这个user

nobody@xundh-To-be-filled-by-O-E-M:~$ ls -l / | grep root
drwx------   4 nobody nogroup       4096 3月  19 11:03 root
nobody@xundh-To-be-filled-by-O-E-M:~$ ls /root
ls: 无法打开目录'/root': 权限不够
nobody@xundh-To-be-filled-by-O-E-M:~$

/root虽然属于 nobody，但当前的nobody并没有权限查看，说明两个nobody不是同一个ID，没有映射关系。

映射user ID和group ID

通常情况下，创建新的user namespace后，第一件事就是映射user 和 group ID。映射的方法是添加配置到/proc/PID/uid_map 和 /proc/PID/gid_map （这里的PID是新user namespace中的进程ID，初始为空）。
文件配置格式：

ID-inside-ns ID-outside-ns length

第一个字段ID-inside-ns表示在容器显示的UID或GID，
第二个字段ID-outside-ns表示容器外映射的真实的UID或GID。
第三个字段表示映射的范围，一般填1，表示一一对应。

如 0 1000 256 ，表示父 user namspace中的1000~1256映射到新user namespace中的 0 ~256。
系统默认的user namespace没有父user namespace，kernel提供提供了一个虚拟的uid和gid map。

xundh@xundh-To-be-filled-by-O-E-M:~$ cat /proc/$$/uid_map
         0          0 4294967295
xundh@xundh-To-be-filled-by-O-E-M:~$

写这两个文件的进程需要这个namespace中的CAP_SETUID (CAP_SETGID)权限（可参看Capabilities）
写入的进程必须是此user namespace的父或子的user namespace进程。
另外需要满如下条件之一：1）父进程将effective uid/gid映射到子进程的user namespace中，2）父进程如果有CAP_SETUID/CAP_SETGID权限，那么它将可以映射到父进程中的任一uid/gid。

映射操作：

#--------------------------第一个shell窗口----------------------
#获取当前bash的pid
nobody@ubuntu:~$ echo $$
24126

#--------------------------第二个shell窗口----------------------
#dev是map文件的owner
dev@ubuntu:~$ ls -l /proc/24126/uid_map /proc/24126/gid_map
-rw-r--r-- 1 dev dev 0 7月  24 23:11 /proc/24126/gid_map
-rw-r--r-- 1 dev dev 0 7月  24 23:11 /proc/24126/uid_map

#但还是没有权限写这个文件
dev@ubuntu:~$ echo '0 1000 100' > /proc/24126/uid_map
bash: echo: write error: Operation not permitted
dev@ubuntu:~$ echo '0 1000 100' > /proc/24126/gid_map
bash: echo: write error: Operation not permitted
#当前用户运行的bash进程没有CAP_SETUID和CAP_SETGID的权限
dev@ubuntu:~$ cat /proc/$$/status | egrep 'Cap(Inh|Prm|Eff)'
CapInh: 0000000000000000
CapPrm: 0000000000000000
CapEff: 0000000000000000

#为/binb/bash设置capability，
dev@ubuntu:~$ sudo setcap cap_setgid,cap_setuid+ep /bin/bash
#重新加载bash以后我们看到相应的capability已经有了
dev@ubuntu:~$ exec bash
dev@ubuntu:~$ cat /proc/$$/status | egrep 'Cap(Inh|Prm|Eff)'
CapInh: 0000000000000000
CapPrm: 00000000000000c0
CapEff: 00000000000000c0


#再试一次写map文件，成功了
dev@ubuntu:~$ echo '0 1000 100' > /proc/24126/uid_map
dev@ubuntu:~$ echo '0 1000 100' > /proc/24126/gid_map
dev@ubuntu:~$
#再写一次就失败了，因为这个文件只能写一次
dev@ubuntu:~$ echo '0 1000 100' > /proc/24126/uid_map
bash: echo: write error: Operation not permitted
dev@ubuntu:~$ echo '0 1000 100' > /proc/24126/gid_map
bash: echo: write error: Operation not permitted

#后续测试不需要CAP_SETUID了，将/bin/bash的capability恢复到原来的设置
dev@ubuntu:~$ sudo setcap cap_setgid,cap_setuid-ep /bin/bash
dev@ubuntu:~$ getcap /bin/bash
/bin/bash =

#--------------------------第一个shell窗口----------------------
#回到第一个窗口，id已经变成0了，说明映射成功
nobody@ubuntu:~$ id
uid=0(root) gid=0(root) groups=0(root),65534(nogroup)

#--------------------------第二个shell窗口----------------------
#回到第二个窗口，确认map文件的owner，这里24126是新user namespace中的bash
dev@ubuntu:~$ ls -l /proc/24126/
......
-rw-r--r-- 1 dev dev 0 7月  24 23:13 gid_map
dr-x--x--x 2 dev dev 0 7月  24 23:10 ns
-rw-r--r-- 1 dev dev 0 7月  24 23:13 uid_map
......

#--------------------------第一个shell窗口----------------------
#重新加载bash，提示有root权限了
nobody@ubuntu:~$ exec bash
root@ubuntu:~#

#0000003fffffffff表示当前运行的bash拥有所有的capability
root@ubuntu:~# cat /proc/$$/status | egrep 'Cap(Inh|Prm|Eff)'
CapInh: 0000000000000000
CapPrm: 0000003fffffffff
CapEff: 0000003fffffffff
#--------------------------第二个shell窗口----------------------
#回到第二个窗口，发现owner已经变了，变成了root
#目前还不清楚为什么有这样的机制
dev@ubuntu:~$ ls -l /proc/24126/
......
-rw-r--r-- 1 root root 0 7月  24 23:13 gid_map
dr-x--x--x 2 root root 0 7月  24 23:10 ns
-rw-r--r-- 1 root root 0 7月  24 23:13 uid_map
......
#虽然不能看目录里有哪些文件，但是可以读里面文件的内容
dev@ubuntu:~$ ls -l /proc/24126/ns
ls: cannot open directory '/proc/24126/ns': Permission denied
dev@ubuntu:~$ readlink /proc/24126/ns/user
user:[4026532464]


#--------------------------第一个shell窗口----------------------
#和第二个窗口一样的结果
root@ubuntu:~# ls -l /proc/24126/ns
ls: cannot open directory '/proc/24126/ns': Permission denied
root@ubuntu:~# readlink /proc/24126/ns/user
user:[4026532464]

#仍然不能访问/root目录，因为他的拥有着是nobody
root@ubuntu:~# ls -l /|grep root
drwx------   3 nobody nogroup  4096 7月   8 18:39 root
root@ubuntu:~# ls /root
ls: cannot open directory '/root': Permission denied

#对于原来/home/dev下的内容，显示的owner已经映射过来了，由dev变成了新namespace中的root，
#当前root用户可以访问他里面的内容
root@ubuntu:~# ls -l /home
drwxr-xr-x 8 root root 4096 7月  21 18:35 dev
root@ubuntu:~# touch /home/dev/temp01
root@ubuntu:~#

#试试设置主机名称
root@ubuntu:~# hostname container001
hostname: you must be root to change the host name
#修改失败，说明这个新user namespace中的root账号在父user namespace里面不好使
#这也正是user namespace所期望达到的效果，当访问其他user namespace里的资源时，
#是以其他user namespace中的相应账号的权限来执行的，
#比如这里root对应父user namespace的账号是dev，所以改不了系统的hostname

对于map文件来说，在父user namespace和子user namespac中打开子user namespace中进程的这个文件看到的都是同样的内容，但如果是在其他的user namespace中打开这个map文件，‘ID-outside-ns’表示的就是映射到当前user namespace的ID:

#--------------------------打开一个新窗口----------------------
#创建一个新的user namespace，并取名container001
dev@ubuntu:~$ unshare --user --uts -r /bin/bash
root@ubuntu:~# hostname container001
root@ubuntu:~# exec bash
#记下bash的pid
root@container001:~# echo $$
27898

#在container001里面创建新的namespace container002
root@container001:~# unshare --user --uts -r /bin/bash
root@container001:~# hostname container002
root@container001:~# exec bash
#记下bash的pid
root@container002:~# echo $$
28066

#查看自己namespace中进程的uid map文件
#这里表示父user namespace的0映射到了当前namespace的0
root@container002:~# cat /proc/28066/uid_map
         0          0          1

#--------------------------再打开一个新窗口----------------------
#在系统默认namespace中查看同样这个文件，发现和上面的显示的不一样
#因为默认namespace是container002的爷爷，所以他们两个里面看到的东西有可能不一样
#这里表示当前user namespace的账号1000映射到了进程28066所在user namespace的账号0
#当然如果上面是用root账号创建的container001，这里显示的内容就和上面一样了
dev@ubuntu:~$ cat /proc/28066/uid_map
         0       1000          1

#我们再进入到container001，在里面看看这个文件，发现和在ontainer002看到的结果一样
#说明对于进程28066来说，在他自己所在的user namespace和他的父user namespace看到的map文件内容是一样的
dev@ubuntu:~$ nsenter --user --uts -t 27898 --preserve-credentials bash
root@container001:~# cat /proc/28066/uid_map
         0          0          1
#默认情况下，nsenter会调用setgroups函数去掉root group的权限，
#这里--preserve-credentials是为了让nsenter不调用setgroups函数，因为调用这个函数需要root权限

#测试完成后可以关闭这两个窗口，后面不会再用到了

user namespace的 owner

当一个用户创建一个新的user namespace的时候，这个用户就是这个新user namespace的owner，在父user namespace的这个用户就会拥有新user namespace及其所有子孙user namespace的所有capabilities.

#--------------------------第四个shell窗口----------------------
#新建用户test用于测试
dev@ubuntu:~$ sudo useradd test
dev@ubuntu:~$ sudo passwd test
Enter new UNIX password:
Retype new UNIX password:
passwd: password updated successfully

#切换到test账户并创建新的user namespace
#为了便于区分，同时创建新的uts namespace
dev@ubuntu:~$ su test
Password:
test@ubuntu:/home/dev$ unshare --user --uts -r /bin/bash

#设置一个容易区分的hostname
root@ubuntu:/home/dev# hostname container001
root@ubuntu:/home/dev# exec bash
root@container001:/home/dev# readlink /proc/$$/ns/user
user:[4026532463]
root@container001:/home/dev# echo $$
24419

#--------------------------第五个shell窗口----------------------
#使用dev账号新建一个user namespace
dev@ubuntu:~$ unshare --user --uts -r /bin/bash
root@ubuntu:~# hostname container002
root@ubuntu:~# exec bash
root@container002:~# readlink /proc/$$/ns/user
user:[4026532464]
root@container002:~# echo $$
24435

#--------------------------第六个shell窗口----------------------
#用dev账号往container002中加入新的进程/bin/bash成功，因为dev是container002的owner
dev@ubuntu:~$ nsenter --user -t 24435 --preserve-credentials --uts /bin/bash
root@container002:~# id
uid=0(root) gid=0(root) groups=0(root),65534(nogroup)
root@container002:~# readlink /proc/$$/ns/user
user:[4026532464]

#回到默认user namespace
root@container002:~# exit
exit
dev@ubuntu:~$

#因为container001的owner是test，用dev账号往container001中加入新的进程/bin/bash失败
dev@ubuntu:~$ nsenter --user -t 24419 --preserve-credentials --uts /bin/bash
nsenter: cannot open /proc/24419/ns/user: Permission denied

#用root账号往container001中加入新的进程/bin/bash成功
dev@ubuntu:~$ sudo nsenter --user -t 24419 --preserve-credentials --uts /bin/bash
nobody@container001:~$ readlink /proc/$$/ns/user
user:[4026532463]
#由于root账号没有映射到container001中，所以这里在container001中看到的账号是nobody
nobody@container001:~$ id
uid=65534(nobody) gid=65534(nogroup) groups=65534(nogroup)

#退出container001，便于后续测试
nobody@container001:~$ exit
dev@ubuntu:~$
#--------------------------第五个shell窗口----------------------
#回到第5个窗口，继续创建一个新的user namespace
root@container002:~# unshare --user --uts -r /bin/bash
root@container002:~# hostname container003
root@container002:~# exec bash
root@container003:~# readlink /proc/$$/ns/user
user:[4026532471]
root@container003:~# echo $$
24533

#--------------------------第六个shell窗口----------------------
#回到第6个窗口，用dev账号往container003（孙子user namespace）中加入新的bash进程，成功，
#说明dev拥有孙子user namespace的capabilities
dev@ubuntu:~$ nsenter --user -t 24533 --preserve-credentials --uts /bin/bash
root@container003:~# readlink /proc/$$/ns/user
user:[4026532471]

三、clone实现

#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <sys/mount.h>
#include <sys/capability.h>
#include <stdio.h>
#include <sched.h>
#include <signal.h>
#include <unistd.h>

#define STACK_SIZE (1024 * 1024)

static char container_stack[STACK_SIZE];
char* const container_args[] = {
    "/bin/bash",
    NULL
};

int pipefd[2];

void set_map(char* file, int inside_id, int outside_id, int len) {
    FILE* mapfd = fopen(file, "w");
    if (NULL == mapfd) {
        perror("open file error");
        return;
    }
    fprintf(mapfd, "%d %d %d", inside_id, outside_id, len);
    fclose(mapfd);
}

void set_uid_map(pid_t pid, int inside_id, int outside_id, int len) {
    char file[256];
    sprintf(file, "/proc/%d/uid_map", pid);
    set_map(file, inside_id, outside_id, len);
}

void set_gid_map(pid_t pid, int inside_id, int outside_id, int len) {
    char file[256];
    sprintf(file, "/proc/%d/gid_map", pid);
    set_map(file, inside_id, outside_id, len);
}

int container_main(void* arg)
{

    printf("Container [%5d] - inside the container!\n", getpid());

    printf("Container: eUID = %ld;  eGID = %ld, UID=%ld, GID=%ld\n",
            (long) geteuid(), (long) getegid(), (long) getuid(), (long) getgid());

    /* 等待父进程通知后再往下执行（进程间的同步） */
    char ch;
    close(pipefd[1]);
    read(pipefd[0], &ch, 1);

    printf("Container [%5d] - setup hostname!\n", getpid());
    //set hostname
    sethostname("container",10);

    //remount "/proc" to make sure the "top" and "ps" show container's information
    mount("proc", "/proc", "proc", 0, NULL);

    execv(container_args[0], container_args);
    printf("Something's wrong!\n");
    return 1;
}

int main()
{
    const int gid=getgid(), uid=getuid();

    printf("Parent: eUID = %ld;  eGID = %ld, UID=%ld, GID=%ld\n",
            (long) geteuid(), (long) getegid(), (long) getuid(), (long) getgid());

    pipe(pipefd);
 
    printf("Parent [%5d] - start a container!\n", getpid());

    int container_pid = clone(container_main, container_stack+STACK_SIZE, 
            CLONE_NEWUTS | CLONE_NEWPID | CLONE_NEWNS | CLONE_NEWUSER | SIGCHLD, NULL);

    
    printf("Parent [%5d] - Container [%5d]!\n", getpid(), container_pid);

    //To map the uid/gid, 
    //   we need edit the /proc/PID/uid_map (or /proc/PID/gid_map) in parent
    //The file format is
    //   ID-inside-ns   ID-outside-ns   length
    //if no mapping, 
    //   the uid will be taken from /proc/sys/kernel/overflowuid
    //   the gid will be taken from /proc/sys/kernel/overflowgid
    set_uid_map(container_pid, 0, uid, 1);
    set_gid_map(container_pid, 0, gid, 1);

    printf("Parent [%5d] - user/group mapping done!\n", getpid());

    /* 通知子进程 */
    close(pipefd[1]);

    waitpid(container_pid, NULL, 0);
    printf("Parent - container stopped!\n");
    return 0;
}

上面编译要安装yum install libcap-devel

上面使用管道来对父子进程进行同步，因为子进程中有一个execv的系统调用，这个系统调用会把当前子进程的进程空间给覆盖掉，我们希望在execv之前就做好user namespace的 uid/gid 的映射，这样，execv运行的 /bin/bash 就会因为我们设置了uid为0 的inside-uid而变成#号的提示符。

参考：

https://segmentfault.com/a/1190000006913195
https://coolshell.cn/articles/17029.html

Docker 学习笔记12 容器技术原理 User Namespace

User Namespace

一、User Namespace

二、演示

映射user ID和group ID

user namespace的 owner

三、clone实现

猜你喜欢