Containers are chroot with a Marketing Budget
We’re Earthly. We simplify and speed up software building with containerization. If you’re working with containers, our build tool can be a game-changer. Check us out.
There are many ways to understand how containers work, but most useful explanations are actually simplifications.
Many people have settled on explaining containers by calling them ‘light-weight VMs’ and they are light-weight because they ‘share the kernel with the host’. This is useful, but it simplifies a lot away. What is a ‘light-weight VM’? What does sharing the kernel mean?
Others will tell you containers are about namespaces and specific kernel visibility tweaks. This is also a helpful explanation because namespaces partition visibility, so that running containers can’t see other things on the same machine.
But for me, containers are just chrooted processes. Sure, they are more than that: Containers have a nice developer experience, an open-source foundation, and a whole ecosystem of cloud-native companies pushing them forward. But, let me show you why I think chroot
1 is the key.
So, let’s build a container runtime using only the chroot system call. Doing so, we can learn a little about chroot
, a little about container runtimes, and it will also be fun!
The Goal
By the end, I’ll have something that looks like docker run, called chrun
, where you can pull docker images:
> chrun pull redis
Pulling image redis
export image 16b87aa63c8f3a1e14a50feb94cba39eaa5d19bec64d90ff76c3ded058ad09c8
And then run them:
> chrun run redis "/usr/local/bin/redis-server"
Running /usr/local/bin/redis-server in /tmp/_assets_redis_tar_gz4234401501
4360:C 31 Oct 2022 16:07:57.253 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
4360:C 31 Oct 2022 16:07:57.253 # Redis version=7.0.5, bits=64,
4360:C 31 Oct 2022 16:07:57.253 # Warning: no config file specified, using the
4360:M 31 Oct 2022 16:07:57.256 * Increased maximum number of open files to
4360:M 31 Oct 2022 16:07:57.256 * monotonic clock: POSIX clock_gettime
_._
_.-``__ ''-._
_.-`` `. `_. ''-._ Redis 7.0.5 (00000000/0) 64 bit
.-`` .-` `. ` `\/ _.,_ ''-._
( ' , .-` | `, ) Running in standalone mode
|`-._`-...-` __...-.``-._|'` _.-'| Port: 6379
| `-._ `._ / _.-' | PID: 4360
`-._ `-._ `-./ _.-' _.-'
|`-._`-._ `-.__.-' _.-'_.-'|
| `-._`-._ _.-'_.-' | https://redis.io
`-._ `-._`-.__.-'_.-' _.-'
|`-._`-._ `-.__.-' _.-'_.-'|
| `-._`-._ _.-'_.-' |
`-._ `-._`-.__.-'_.-' _.-'
`-._ `-.__.-' _.-'
`-._ _.-'
`-.__.-'
4360:M 31 Oct 2022 16:07:57.260 # Server initialized
4360:M 31 Oct 2022 16:07:57.265 * Ready to accept connections
And it will do this using chroot. But first, some background.
History of chroot
chroot
probably doesn’t get a lot of mention now that containers exist, but it’s a Unix system call. This means it’s a way to request something from the operating system kernel. It is also a utility program, so it’s easy to call from the shell.
> chroot /bin /bash
All it does is change the root directory (/
) to a new value. That’s all chrooting does. It just changes what /
means. That sounds simple, but file paths are at the heart of how Unix works, so you can do a lot with this call.
Chroot is a much older system call than the ones modern container runtimes use, which means, in theory, the chrun
shown above could run on a much older linux kernel. But how far back into Linux history could we go?
Actually, we can go back to way before the creation of Linux. chroot first appeared in 1979 for Unix v7.
(I know this because Diomidis Spinellis put together this excellent github repository that recreates the history of Unix from the earliest available source to today’s modern variations2. The history recreated in this repo stretches back to 1970 and includes the original PDP-7 assembly code of the first iteration of Unix.)
It came along with chdir ( the system call equivalent of cd
) and looked like this:
chdir()
{
chdirec(&u.u_cdir);
}
chroot()
{if (suser())
chdirec(&u.u_rdir); }
struct user
{
...struct inode *u_cdir; /* pointer to inode of current directory */
struct inode *u_rdir; /* root directory of current process */
... }
&u
is a reference to the current users struct, which holds u_rdir
and u_cdir
.
So, a user on a Unix system has a current directory and root directory and chroot is a way to change the root value (u_rdir
) in the same way cd
changes the current working directory (u_cdir
). In Unix V7 that’s basically all the chroot
code I see, except for the syscall list and some userland code so that you can call chroot
from your shell:
/ C library -- chroot
/ error = chroot(string);
.globl _chroot
.globl cerror61.
.chroot =
_chroot:
mov r5,-(sp)
mov sp,r54(r5),0f
mov 0; 9f
sys 1f
bec
jmp cerror1:
clr r0
mov (sp)+,r5
rts pc
.data9:
0:.. sys .chroot;
So chroot goes way back, back into the 70s, and while the implementation has probably changed over the years, semantically it still matches the description found in the UNIX V7 Manual:
Chroot sets the root directory, the starting point for path names beginning with
/
. The call is restricted to the super-user.
Ok, history lesson over. Let’s start building things.
Using chroot
Directly
Let’s start with the command-line and work towards our docker run clone.
The most straightforward docker run is hello-world:
> docker run hello-world
Hello from Docker!
This message shows that your installation appears to be working correctly.
To generate this message, Docker took the following steps:
1. The Docker client contacted the Docker daemon.
2. The Docker daemon pulled the "hello-world" image from the Docker Hub.
(amd64)
3. The Docker daemon created a new container from that image which runs the
executable that produces the output you are currently reading.
4. The Docker daemon streamed that output to the Docker client, which sent it
to your terminal.
...
To recreate run this hello-world in chroot jail
3 is relatively straightforward.
chroot
Hello World
At the command line, I can setup the hello-world in a changed root like so:
> mkdir /testroot
> cp hello /testroot
Then run it:
> chroot /testroot /hello
Hello from Docker!
This message shows that your installation appears to be working correctly.
To generate this message, Docker took the following steps:
1. The Docker client contacted the Docker daemon.
2. The Docker daemon pulled the "hello-world" image from the Docker Hub.
(amd64)
3. The Docker daemon created a new container from that image which runs the
executable that produces the output you are currently reading.
4. The Docker daemon streamed that output to the Docker client, which sent it
to your terminal.
To try something more ambitious, you can run an Ubuntu container with:
docker run -it ubuntu bash
$
Share images, automate workflows, and more with a free Docker ID:
https://hub.docker.com/
For more examples and ideas, visit:
https://docs.docker.com/get-started/
The Root of the Matter
chroot
only works as a root user, so assume from here on out everything is being done as root on a Linux machine.
If you try as a non-root user, you will get something like this:
> chroot /testroot /hello
chroot: cannot change root directory to '/testroot': Operation not permitted
We can also do this from go, making the system call directly:
package main
import (
"os"
"os/exec"
"syscall"
)
func main() {
cmd := exec.Command("/hello")
syscall.Chroot("/testroot")
cmd.Stdin = os.Stdin
cmd.Stdout = os.Stdout
cmd.Stderr = os.Stderr
cmd.Run()
}
And the output is the same:
> go run go-change-root.go
Hello from Docker!
This message shows that your installation appears to be working correctly.
...
That hello process runs with a filesystem rooted to /testroot
. So there is nothing in the filesystem it can see besides itself.
I could verify that by running a shell inside it and poking around. However, when you change the root, the command passed is relative to the new root, so running /bin/sh
will fail.
> chroot /testroot /bin/sh
chroot: /bin/sh: No such file or directory
I could cp
/bin/sh
into /testroot
, but sh
dynamically links in libc and probably other stuff. Those file pointers won’t point to anything in our new root so it won’t work. This is also why you can’t shell into the hello-world image with docker run
:
> docker run -it hello-world /bin/sh
exec: "/bin/sh": stat /bin/sh: no such file or directory: unknown.
Predictably, you can only shell into an image with a shell (and supporting userspace dependencies) inside it. So I’m going to be using redis:latest
today:
> docker run -it redis /bin/sh
> cd /
> ls
bin boot data dev etc home lib lib64 media mnt opt proc root run
sbin srv sys tmp usr var
Now let’s try to chroot into this redis image. But to do that, I first need to get the file-system out of the image so I can pass it to chroot
.
Extracting FileSystems From Container
To extract file-system from the image, the first thing I’ll try is to grab the image and extract it:
docker save redis -o redisImage.tar
mkdir redis
cd redis && tar -mxvf ../redisImage.tar
I can then look inside it:
redis
131d224a301217b1d881f2464837d310dc8e0bf701d049fc30fb9eabddd98cbc
├── VERSION
│ ├── json
│ ├── layer.tar
│ └── 2279e9cb00a8a268cb01a1ccd1b7c0a01dc6b9ec619a7877dda2ca81e7409428
├── VERSION
│ ├── json
│ ├── layer.tar
│ └── 2d0405b8f23157bc9f45cadc12b8b7ff23446dfe968bfa7473cb78ec2444d198
├── VERSION
│ ├── json
│ ├── layer.tar
│ └── 770413d3495f9ba555e345d5c5397580a61cc64d9a945135b4b2235eed19d07b
├── VERSION
│ ├── json
│ ├── layer.tar
│ └── beb3916dbc72988060eaa0ba9ba119c76eb1c07db1c18ea53d3ca4f40a03c436
├── VERSION
│ ├── json
│ ├── layer.tar
│ └── c2342258f8ca7ab5af86e82df6e9ade908a949216679667b0f39b59bcd38c4e9.json
├── f7b46deebf614151dce2888bcb81e312da2ac791230b02688a5dbab1dee7ea91
├── VERSION
│ ├── json
│ ├── layer.tar
│ └── manifest.json
├── repositories └──
I’m on the right track, but this is not exactly what I wanted. Each layer.tar
is the union file-system changes for that image layer. To build the completed file structure I would need to extract each of these and combine them in the right order.
Thankfully, I can just ask docker to do that for us with docker export
.
> docker export $(docker create redis) -o redis.tar.gz
> mkdir redis && cd redis
> tar --no-same-owner --no-same-permissions --owner=0 --group=0 \
-mxf ../redis.tar.gz
Then I end up with the extracted redis file structure:
./redis
bin
├── boot
├── data
├── dev
├── etc
├── home
├── lib
├── lib64
├── media
├── mnt
├── opt
├── proc
├── root
├── run
├── sbin
├── sys
├── tmp
├── usr
├── var └──
And so if I wrap that docker export
up into a bash script, I can grab the file system for any image on docker hub, turning any Linux container image into a tar file.
./pull "redis"
Pulling image redis
export image c20f5ecac2f9c49521b32433ffc6abeade950e77592805b0fc61fea00d6e32f5
From there, my trusty rusty chroot
command works much like my docker run -it redis /bin/sh
from a couple of steps ago:
> chroot ./redis /bin/sh
> ls
bin boot data dev etc home lib lib64 media mnt opt proc root run
sbin srv sys tmp usr var
It’s Just a Process
Here is why this is interesting from a learning perspective:
When I run docker run ..
something happens – an image is turned into a container and started up. It’s not really a VM, but if you shell inside and look around, it seems like one. But now, with chroot at hand, you can see what ’not really a VM means: It’s just a process!
Namespaces mean when you start a container, you can’t see it in your process list, and cgroups mean that the process can have CPU and memory limits placed on it, but really, at a conceptual level, it’s just a process running with a different file-system root. Really containers are just a fancier way to chroot something!
Ok, let’s keep going.
ChRun Time
Another thing you may have noticed about containers is that they are ephemeral and relatively isolated. I can run N containers from one image and they will each be unique. Modern container runtimes use a union file-system ( like overlayfs ) for this but I get close to that with just temp directories.
Here’s my plan. When chrun pull <imagename>
is called, I grab a tar of the image and store it somewhere. Then each time chrun run <imagename>
is called, I’ll do the following:
- Create a temporary directory
- Extract
<imagename>.tar.gz
into it - Change root into that directory
- On exit, delete the directory
It’s looks like this:
func main() {
"./assets/%s.tar.gz", os.Args[2])
tar := fmt.Sprintf(3]
cmd := os.Args[
dir := createTempDir(tar)defer os.RemoveAll(dir)
must(unTar(tar, dir))
chroot(dir, cmd) }
First, I create a temp directory:
func createTempDir(name string) string {
var nonAlphanumericRegex = regexp.MustCompile(`[^a-zA-Z0-9 ]+`)
"_")
prefix := nonAlphanumericRegex.ReplaceAllString(name, "", prefix)
dir, err := ioutil.TempDir(if err != nil {
log.Fatal(err)
}return dir
}
Then I untar things:
func unTar(source string, dst string) error {
r, err := os.Open(source)if err != nil {
return err
}defer r.Close()
ctx := context.Background()return extract.Archive(ctx, r, dst, nil)
}
And then chroot
, and we’re off:
func chroot(root string, call string) {
"Running %s in %s\n", call, root)
fmt.Printf(
cmd := exec.Command(call)
must(syscall.Chroot(root))
cmd.Stdin = os.Stdin
cmd.Stdout = os.Stdout
cmd.Stderr = os.Stderr
must(cmd.Run()) }
( There are actually a couple more bits to it, but they are uninteresting. Entire file in this repo. )
And with that, I can do things like start up a redis client and server and have them talk to each other:
> ./chrun pull redis
Pulling image redis
export image 16b87aa63c8f3a1e14a50feb94cba39eaa5d19bec64d90ff76c3ded058ad09c8
chrun
pulls an image from docker hub and builds a tar archive it. (docker export
does the heavy lifting)
> chrun run redis "/usr/local/bin/redis-server"
Running /usr/local/bin/redis-server in /tmp/_assets_redis_tar_gz4234401501
4360:C 31 Oct 2022 16:07:57.253 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
4360:C 31 Oct 2022 16:07:57.253 # Redis version=7.0.5, bits=64,
4360:C 31 Oct 2022 16:07:57.253 # Warning: no config file specified, using the
4360:M 31 Oct 2022 16:07:57.256 * Increased maximum number of open files to 10032 (it was originally set to 1024).
4360:M 31 Oct 2022 16:07:57.256 * monotonic clock: POSIX clock_gettime
_._
_.-``__ ''-._
_.-`` `. `_. ''-._ Redis 7.0.5 (00000000/0) 64 bit
.-`` .-` `. ` `\/ _.,_ ''-._
( ' , .-` | `, ) Running in standalone mode
|`-._`-...-` __...-.``-._|'` _.-'| Port: 6379
| `-._ `._ / _.-' | PID: 4360
`-._ `-._ `-./ _.-' _.-'
|`-._`-._ `-.__.-' _.-'_.-'|
| `-._`-._ _.-'_.-' | https://redis.io
`-._ `-._`-.__.-'_.-' _.-'
|`-._`-._ `-.__.-' _.-'_.-'|
| `-._`-._ _.-'_.-' |
`-._ `-._`-.__.-'_.-' _.-'
`-._ `-.__.-' _.-'
`-._ _.-'
`-.__.-'
4360:M 31 Oct 2022 16:07:57.260 # Server initialized
4360:M 31 Oct 2022 16:07:57.265 * Ready to accept connections
chrun
run extracts the tar to a temp dir, changes root to it, and then starts the passed command. Afterward, it cleans up the temp dir.
> chrun run redis "/usr/local/bin/redis-cli"
Running /usr/local/bin/redis-cli in /tmp/_assets_redis_tar_gz1366317376
127.0.0.1:6379> SET mykey "Hello\nWorld"
OK
127.0.0.1:6379> GET mykey
"Hello\nWorld"
127.0.0.1:6379>
127.0.0.1:6379> exit
And after I stop them, the temp dir is removed, and they disappear. So there you go, ‘containers’ using only chroot.
The source is on github.
Who Cares?
So who cares? I mean, many container runtimes already exist (runC, containerd, gVisor, StarStruck) and they’re all better than this one in almost every way4.
Well, it could just be me, but understanding that a container is very similar to a process that has been chrooted – so it’s running against the same operating system but with a different root – that understanding helps ground my knowledge of what containers are. It makes them seem less magical5 and lets me think about new possibilities6.
And so containers are great. Namespaces, cgroups v2, runC, overlayfs, the OCI image format, and everything else in this space is impressive engineering. It’s incredible forward progress we can all take advantage of. But it’s not magic. It’s just a long series of progressive refinements ( and a bit of marketing ) on top of a feature that has been in Unix since … let me check:
> git log usr/src/libc/sys/chroot.s | head -5
commit a0b0c390d5f37060bf64b63bba8e9f0a1dceb337
Author: Dennis Ritchie <dmr@research.uucp>
Date: Wed Jan 10 14:59:44 1979 -0500
Research V7 development
Earthly makes CI/CD super simple
Fast, repeatable CI/CD with an instantly familiar syntax – like Dockerfile and Makefile had a baby.
I pronounce
chroot
as change root. I asked around and some say ‘see haitch root’ and some ‘chur-oot’ (which I kind of hate).↩︎Thanks to Diomidis Spinellis for creating the unix history repo and pointing me towards the V7 manual and thanks to Alex Couture-Beil for pushing me to understand the code.↩︎
The term
chroot jail
used to be used quite a bit on the internet, and so I’m using it here, but it’s a bit problematic as a term. A program running in a restricted root lacks access to files outside the root, but breaking out of this jail is often possible, especially in the way will be using them today.Containers provide more restrictions than chroot. For example, cgroups provide resource isolation, and namespaces limit visibility, but even containers are not proper security barriers.
I’m not a security expert, but it’s probably better to think of both chroot and containers as offering an abstraction barrier, rather than providing a security barrier.↩︎
gVisor is really cool. It’s an application kernel for containers, which limits some of the security surface-area of a container runtime. StarStruck is just there to check if you are paying attention. It’s not a container runtime at all, but Joe Exotic’s 2nd studio album.↩︎
Here’s one example where chroot made things seem less magical.
When I see people complaining that docker takes up too much disk space, I know that it’s more about having lots of docker images sitting around on disk than anything innate to how containers work. Also, its probably a bit about how many containers include full linux distros in them. No one is complaining about the size of
hello-world:latest
or any from scratch image.Nothing about containers means they need a full linux distro in them, that’s just a convenience.↩︎
One new possibility that seems exciting to me is building native OS X containers based on chroot.
When you run a Linux container on a mac, it runs them inside a Linux VM. But OS X has the chroot system call, so it should be possible to build native mac containers, which contain an OS X build of redis and a base layer including the OS X dependencies that redis needs to run. These containers would lack some of the isolation of Linux containers, but they would have almost no overhead and could have much faster volume mounts (a current source of frustration for many devs who are using OS X).↩︎