Minibox: A miniature Linux container runner

Published on 2017-12-23
Tagged: linux

View All Posts

I've been curious about how Linux containers work for a long time. I've played around with Docker, and it basically seems like magic. I decided to learn more about them by writing my own tiny container implementation.

Background

Containers in Linux are implemented using namespaces, which are a relatively new virtualization mechanism. I mean virtualization in the computer science sense here: isolating a process's view of the rest of the system from other processes. Some resources have always been virtualized in Linux: access to memory (virtual memory) and the CPU (preemptive scheduling). Most global system resources were not virtualized before namespaces came along: all processes had the same view of the file system, the network, user IDs, and IPC.

In 2002, mount namespaces were added. Each mount namespace has its own mount table, so processes in different mount namespaces have different views on which filesystems are mounted, and where. Additional namespaces were added after 2006: there are now namespaces for process IDs, network, interprocess communication, UTS (host and domain name), user IDs, and control groups.

Namespaces are the foundation of containers. A container is a collection of programs and resources that run in isolation from the rest of the processes on a system. Docker, at its lowest level, is a layer of packaging and configuration on top of the raw functionality built in the kernel.

Minibox

In order to understand containers better, I built a tiny, crappy version of Docker. You can find it at github.com/jayconrod/minibox.

My goal was to package a statically linked program and some related files in an ext2 disk image, mount that at the file system root inside a mount namespace, then execute the program. All configuration is done on the command line. I only used a mount namespace; processes in this container are not isolated from the network or anything else.

I implemented everything in Go. Go provides easy access to system calls via wrappers in golang.org/x/sys/unix. There were a few things that didn't quite work, so I dropped down into C in a couple places. I avoided using C as much as possible though since string manipulation and memory management are an absolute pain.

Demo

Before we dive into the code, I'll show how to use this thing. You'll need to have Go installed, and you'll need root access on your system.

First, download the project.

$ go get -d -u github.com/jayconrod/minibox
$ cd $GOPATH/src/github.com/jayconrod/minibox

Build the program that will run in the container. This can be any program. I wrote something simple that prints out some environmental information, lists files in its directory, then exits.

$ go build list-files.go

Run the bash script to create a disk image. This script fills a 32MB file with zeroes, formats it with mkfs, mounts it at /mnt, copies list-files there, creates a few other files in there (just so list-files has something to look at), then unmounts the image. The script needs to be run with sudo because it uses mount.

$ sudo ./build-image.bash

Next, build minibox, the container runner.

$ go build minibox.go

Run it like this:

$ sudo ./minibox \
-image mini.ext2 \
-fstype ext2 \
-dir /mnt \
-entry /list-files \
-uid 1000 \
-gid 1000

You should see this:

invoked as:  /list-files

uid: 1000 gid: 1000

environment:

files:
.
bar
baz
foo
list-files
lost+found

Implementation

Ok, let's dive into the implementation of our container runner. This program sets up the container, executes the program inside it, then tears down anything that needs to be torn down. The container runner must run as root, since nearly all of the system calls it uses are privileged.

Step 1: Create new namespaces for the current process with unshare. CLONE_NEWNS creates a new filesystem namespace; CLONE_FS isolates the root directory and the current directory from other processes. You can use the same flags with clone if you want to create a new namespace and a new process at the same time.

if err := unix.Unshare(CLONE_FS | CLONE_NEWNS); err != nil {
  return 1, errors.Wrap(err, "unshare")
}

Step 2: Mount the disk image. This was more complicated than I thought it would be. When you mount a disk image on the command line, you can pass the disk image directly to the mount program, and it knows what to do. The mount system call is not as smart though; it can only mount device files. So we need to configure a loop device and mount that. I'll break this into substeps. I translated this from the example on the loop(4) man page.

Step 2a: open /dev/loop-control and find a free loop device. Linux has a number of loop devices in /dev. Some of them may be in use, so you can use /dev/loop-control to find which one to use.

If you aren't familiar with Linux device files, you can open, read, and write them as if they were normal files. These operations are handled by device drivers in the kernel. Those drivers can send signals to the hardware to do something like playing sound or sending packets on the network in response to reads and writes. Of course, you can't do everything with read and write. For small miscellaneous operations, there's ioctl, which we use below. ioctl takes an open file descriptor, a request number (which has some device-specific meaning), and more untyped arguments interpreted by that request. Most requests simply get or set a number, so Go provides two wrappers for ioctl: IoctlGetInt and IoctlSetInt.

loopctlFd, err := unix.Open("/dev/loop-control", syscall.O_RDWR, 0)
if err != nil {
  return 1, errors.Wrapf(err, "open /dev/loop-control", err)
}
defer closeFn("/dev/loop-control", loopctlFd)
devNum, err := unix.IoctlGetInt(loopctlFd, LOOP_CTL_GET_FREE)
if err != nil {
  return 1, errors.Wrap(err, "ioctl LOOP_CTL_GET_FREE")
}

Step 2b: open the disk image and bind its file descriptor to /dev/loopN, where N is the number we got from /dev/loop-control. The binding is done with another ioctl, LOOP_SET_FD. Once bound, /dev/loopN will act like a disk with the disk image as its backing store. We'll need to do LOOP_CLR_FD when we clean up later.

loopDevName := fmt.Sprintf("/dev/loop%d", devNum)
loopFd, err := unix.Open(loopDevName, syscall.O_RDWR, 0)
if err != nil {
  return 1, errors.Wrapf(err, "open %s", loopDevName)
}
defer closeFn(loopDevName, loopFd)

imageFd, err := unix.Open(image, syscall.O_RDWR, 0)
if err != nil {
  return 1, errors.Wrapf(err, "open %s", image)
}
defer closeFn(image, imageFd)

if err := unix.IoctlSetInt(loopFd, LOOP_SET_FD, imageFd); err != nil {
  return 1, errors.Wrap(err, "ioctl LOOP_SET_FD")
}
defer func() {
  _, clearErr := unix.IoctlGetInt(loopFd, LOOP_CLR_FD);
  if clearErr != nil && err == nil {
    err = clearErr
  }
}()

Step 2c: mount the loop device like a normal disk.

if err := unix.Mount(loopDevName, dir, fstype, 0, ""); err != nil {
  return 1, errors.Wrap(err, "mount")
}

Step 3: Create a child process with fork, and wait for it to complete with wait4. fork creates a new process by duplicating the calling process. The child process will eventually execute the program inside the container with execve, but we need to make a few other system calls first.

Unfortunately, Go doesn't provide a standalone version of fork; it only has ForkExec, which glues fork and execve together. As far as I understand, this is because fork only duplicates the calling thread, and the Go runtime has some background threads (garbage collector?) that it needs to keep running. When I asked my coworkers about this, they said don't worry about it, just use RawSyscall, YOLO. RawSyscall it is then. It seems to work well enough for this demo, but I wouldn't use this in production code.

pid, _, _ := unix.RawSyscall(uintptr(C.SYS_fork), 0, 0, 0)
if pid < 0 {
  return 1, errors.New("fork")
}

After the fork, the parent process waits for the child process to exit or crash with wait4. wait4 also returns when the child process is suspended or resumed. We don't care about that, so we need to call it in a loop and check why it returned.

for {
  var status unix.WaitStatus
  if _, err = unix.Wait4(int(pid), &status, 0, nil); err != nil {
    return 1, errors.Wrap(err, "wait4")
  }
  if status.Signaled() {
    return 1, errors.Errorf("process terminated by signal %v", status.Signal())
  }
  if status.Exited() {
    return status.ExitStatus(), nil
  }
  if status.Stopped() || status.Continued() {
    continue
  }
  return 1, errors.Errorf("unknown return from wait: %x", status)
}

The remaining steps occur inside the child process.

Step 4: Make the mounted container image the new file system root.

Step 4a: Create a directory called .old_root inside the container image.

oldRootDir := filepath.Join(dir, ".old_root")
os.Mkdir(oldRootDir, 0700)

Step 4b: Call pivot_root. This makes the container image the new root of the file system and moves the old root to .old_root. Also, change the current directory to /. pivot_root is vaguely specified and may leave the current directory in an indeterminate state, so it's best to set it explicitly.

if err := unix.PivotRoot(dir, oldRootDir); err != nil {
  log.Fatal(errors.Wrap(err, "pivot_root"))
}
if err := os.Chdir("/"); err != nil {
  log.Fatal(errors.Wrap(err, "chdir"))
}

Step 4c: unmount the old root file system and remove the .old_root directory. At this point, the old file system should no longer be visible.

if err := unix.Unmount("/.old_root", MNT_DETACH); err != nil {
  log.Fatal(errors.Wrap(err, "unmount"))
}
if err := os.Remove("/.old_root"); err != nil {
  log.Fatal(errors.Wrap(err, "remove"))
}

Step 5: Drop privileges with setgid and setuid. setgid needs to be called first because both calls require root privilege. These system calls are both in golang.org/x/sys/unix, but their implementations just return an "operation not supported" error instead of doing something useful. RawSyscall once again, I guess.

ret, _, _ := unix.RawSyscall(uintptr(C.SYS_setgid), uintptr(gid), 0, 0)
if ret < 0 {
  log.Fatal("setgid")
}
ret, _, _ = unix.RawSyscall(uintptr(C.SYS_setuid), uintptr(uid), 0, 0)
if ret < 0 {
  log.Fatal("setuid")
}

Step 6: Execute the program inside the container. execve is the system call we want to use. This executes a program in the current process (the child process). Most of the process's state (virtual memory) is dropped and replaced with the new program. Some state is preserved: file descriptors are left open, so the child process can still read and write stdin and stdout. Command line arguments and environment variables are passed in explicitly through the execve call.

Go doesn't provide a wrapper for execve (other than ForkExec), and passing string arguments through RawSyscall didn't sound like fun to me. So I ended up writing my own wrapper in cgo. I joined the arguments into a single string with a NUL byte after each argument.

Here's the Go side of things:

cEntry := C.CString(entry)
cArgc := C.int(len(flag.Args()))
cArgstr := C.CString(strings.Join(flag.Args(), "\x00") + "\x00")
C.execWrapper(cEntry, cArgc, cArgstr)

And the C side:

void execWrapper(char* path, int argc, char *argstr) {
  char** argv = malloc((argc+2) * sizeof(char*));
  argv[0] = path;
  for (int i = 0; i < argc; i++) {
    argv[i+1] = argstr;
    argstr += strlen(argstr) + 1;
  }
  argv[argc+1] = NULL;
  execve(path, argv, NULL);
  free(argv);
  perror("execv");
  exit(1);
}

Conclusion

This was a fun demo to write. I learned quite a bit about how containers are implemented, and I got to play with system calls for the first time in a while.

Once again, you can find the full implementation at github.com/jayconrod/minibox. If you decide to hack on this, here are a few things to keep in mind:

Happy hacking!