Linux Containers the Hard Way [Hackaday]
If you want to make containers under Linux, plenty of high-level options exist. [Lucavallin] wanted to learn more about how containers really work, so he decided to tackle the problem using the low-level kernel functions, and he shared the code with us on GitHub.
Containers are more isolated than processes but not quite full virtual machines. While a virtual machine creates a fake computer, a container is more like a fake operating system. Applications can run with their own idea of libraries, devices, and other resources, but it doesn’t try to abstract the underlying hardware.
[Lucavallin] tells us that the key features include namespaces
which allow different kernel resources to be grouped into related sets and control access to the different features. The seccomp
facility controls what system calls a process may make while the capabilities
system controls what root can do in the container. Finally, the cgroups
system allows you to limit resources so one container gets a fair share of things like CPU time or disk I/O.
These capabilities are available in the kernel started with version 6.0.x, so you’ll need that. In addition, namespaces and cgroupsv2 have to be on. If you aren’t sure, skim your /boot/config-*
file (use the one that matches what uname -a
tells you). For the user namespace, for example, you should find CONFIG_USER_NS
set to y. You can also look at /proc/self/ns
and see if it has namespace object you are looking for. If you want to be sure cgroupv2 is enabled, try “grep cgroup /proc/filesystems
” and you should see a “cgroup2” entry.
Do you need to roll your own container solution? No. Do you want to? We do because we love to learn more about why things work on a starship Linux system.