Project 3: Container Runtime from Scratch

Build a minimal OCI-style container runtime that creates namespaces, applies cgroup limits, sets up a root filesystem, and runs a process in isolation.

Quick Reference

Attribute Value
Difficulty Advanced (Level 4)
Time Estimate 1-2 weeks
Main Programming Language C or Go (Alternatives: Rust)
Alternative Programming Languages Rust
Coolness Level Level 4: Practical Systems Magic
Business Potential Level 2: Platform Engineer Cred
Prerequisites Linux syscalls, process model, filesystems
Key Topics namespaces, cgroups v2, rootfs, OCI lifecycle

1. Learning Objectives

By completing this project, you will:

  1. Create PID, mount, UTS, IPC, network, and user namespaces with clone/unshare.
  2. Apply cgroup v2 resource limits for CPU and memory.
  3. Build a minimal root filesystem and use pivot_root to isolate the filesystem.
  4. Implement an OCI-style lifecycle: create -> start -> delete.
  5. Explain how containers differ from VMs in isolation boundaries and performance.

2. All Theory Needed (Per-Concept Breakdown)

2.1 Linux Namespaces (PID, Mount, UTS, Net, User, IPC)

Fundamentals

Namespaces are a Linux kernel feature that virtualize global system resources for a group of processes. Each namespace type isolates a specific resource: PID namespace isolates process IDs, mount namespace isolates filesystem mounts, UTS isolates hostname, network isolates network devices, user isolates UID/GID mappings, and IPC isolates System V IPC and POSIX message queues. A container runtime creates one or more namespaces for the process it launches so the process believes it is alone on the system. Understanding namespaces is essential because they form the isolation boundary for containers. They are the reason ps and hostname look different inside a container.

Deep Dive into the concept

Namespaces are implemented as lightweight kernel objects that reference sets of resources. When you create a namespace using clone() flags or unshare(), the kernel creates a new namespace instance and attaches the calling process to it. Processes in the same namespace see the same resource view; processes in different namespaces see distinct views. This is how a container can see only its own processes or its own network interfaces.

PID namespaces are particularly interesting. The first process created in a new PID namespace becomes PID 1 in that namespace. PID 1 has special semantics: it receives orphaned processes and is responsible for reaping zombies. This means your container runtime must be careful: the process you launch as PID 1 must handle signals and reap children, or your container will leak zombies. This is why many runtimes use a small init process as PID 1.

Mount namespaces provide filesystem isolation by allowing a process to have its own mount table. You can mount /proc, /sys, and other filesystems without affecting the host. This is crucial because many tools rely on /proc to reflect the process list and system information. Without remounting /proc inside the mount namespace, your container might see host processes.

Network namespaces isolate interfaces, routing tables, and firewall rules. A new network namespace starts with no interfaces except loopback. Your runtime must create a veth pair, move one end into the container’s namespace, and configure IP addresses or let DHCP handle it. Even if you skip networking initially, understanding how it works is important for a complete mental model.

User namespaces allow unprivileged users to have root inside the namespace. The kernel maps container UID 0 to a non-root UID on the host. This makes rootless containers possible. However, user namespaces are tricky because many filesystem operations still require capabilities on the host. Your runtime must configure UID/GID maps in /proc/<pid>/uid_map and /proc/<pid>/gid_map, and may need to disable setgroups for security. If you do not understand user namespaces, you will struggle to implement rootless mode correctly.

Finally, namespaces are composable. A container runtime can combine multiple namespaces to create a strong isolation boundary. However, namespaces alone do not limit resource usage; that’s the role of cgroups. So, namespaces define “what you can see,” and cgroups define “how much you can use.” Both are required for container isolation.

There is also an operational implication: namespaces are created per process tree, so the lifecycle of a namespace is tied to the lifetime of the processes inside it. If PID 1 exits, the namespace is torn down. This means that your runtime must manage process lifetime carefully to avoid leaking resources or leaving dangling mounts. It also means that a container can “die” simply because its init process exits, even if other child processes are still running-unless PID 1 is properly reaping. This is a key behavioral difference from VMs, which can outlive individual processes.

Another practical detail is namespace propagation. Mount namespaces can use shared or private propagation; if you do not set them correctly, mounts may leak between host and container. A best practice is to remount / as private before performing container mounts. This ensures isolation even when the host mounts new filesystems later.

How this fit on projects

Namespaces are the core of Section 3.2 and Section 5.10 Phase 1. They define the container’s illusion of a separate system.

Definitions & key terms

  • Namespace -> Kernel object providing isolated view of a resource.
  • PID 1 -> First process in a PID namespace; responsible for reaping zombies.
  • UTS namespace -> Isolates hostname and domain name.
  • User namespace -> Isolates UID/GID mappings for rootless containers.

Mental model diagram (ASCII)

Host PID namespace: [1,2,3,...]
Container PID namespace: [1,2,3]

Process in container sees only its own PID tree

How it works (step-by-step, with invariants and failure modes)

  1. Create child process with clone() and namespace flags.
  2. Child becomes PID 1 in new namespace.
  3. Set hostname and mount /proc.
  4. Configure network if needed.

Failure modes: PID 1 does not reap children, /proc not remounted, user namespace mapping fails.

Minimal concrete example

clone(child_fn, stack, CLONE_NEWPID | CLONE_NEWNS | CLONE_NEWUTS | SIGCHLD, arg);

Common misconceptions

  • “Namespaces are containers.” -> Namespaces are one component; cgroups and rootfs isolation are also required.
  • “PID 1 behaves like a normal process.” -> PID 1 ignores some signals by default.
  • “User namespaces are always safe.” -> They require careful mapping and capability handling.

Check-your-understanding questions

  1. Why must a new PID namespace have a child process to see PID 1?
  2. What happens if you don’t remount /proc?
  3. Predict the effect of creating a network namespace without configuring interfaces.

Check-your-understanding answers

  1. The process that calls clone remains in the old namespace; the child enters the new one and becomes PID 1.
  2. The container will see host processes, breaking isolation.
  3. The container will have no network connectivity except loopback.

Real-world applications

  • Docker and containerd use namespaces for isolation.
  • Sandboxes use namespaces to isolate untrusted workloads.

Where you’ll apply it

References

  • namespaces(7) man page
  • Linux kernel documentation on namespaces

Key insights

Namespaces are about visibility: they control what the process can see, not what it can consume.

Summary

You now understand how namespaces isolate processes and why PID 1 semantics matter.

Homework/Exercises to practice the concept

  1. Use unshare -p -f --mount-proc to create a PID namespace and observe PID 1.
  2. Create a UTS namespace and set a custom hostname.

Solutions to the homework/exercises

  1. unshare -pf --mount-proc /bin/bash then run ps -ef.
  2. unshare -u /bin/bash then hostname container.

2.2 cgroups v2 Resource Control

Fundamentals

cgroups (control groups) limit and account for resource usage by processes. cgroups v2 provides a unified hierarchy that exposes controllers such as CPU, memory, and I/O. A container runtime must create a cgroup for the container and set limits like memory.max or cpu.max. Without cgroups, namespaces only isolate visibility; the container could still consume all CPU or memory. Understanding cgroups v2 is critical for building a safe container runtime. You should be able to explain the difference between a hard limit and a soft limit. This is the core of predictable multi-tenant behavior. It is also the basis for capacity planning.

Deep Dive into the concept

cgroups v2 reorganizes resource control into a single unified hierarchy. Each cgroup is a directory under /sys/fs/cgroup. By writing to control files, you define limits and policies. For example, memory.max sets a hard memory limit, memory.swap.max controls swap usage, and cpu.max defines CPU time quota and period. When a container exceeds a memory limit, the kernel triggers OOM kill inside the cgroup, killing the offending process and recording an event.

Unlike cgroups v1, v2 uses a single tree and enforces “no internal processes” for cgroups that have children (unless delegated). This means your runtime must create a leaf cgroup and move the container process into it. If you plan to run nested cgroups, you must understand delegation rules and the cgroup.subtree_control file, which enables controllers for child cgroups.

The cgroup filesystem is itself a control interface. When you create a directory demo/, you can write process IDs to cgroup.procs to attach processes. The runtime should attach the container’s init process to the cgroup before it executes the user command, ensuring the entire process tree inherits limits. If you attach too late, child processes might escape limits.

cgroups v2 also provides accounting. Files like memory.current and cpu.stat give live usage metrics. You can use these for monitoring or to implement a stats command in your runtime. This is a stepping stone to Project 6, where resource metrics are used by a scheduler.

Resource control has trade-offs. For CPU, cpu.max uses a quota/period model; short period with low quota can cause bursty performance. For memory, memory.high provides a soft limit and throttles, while memory.max is a hard limit. A good runtime might set both to provide better behavior under pressure. Your project can keep it simple but must document these trade-offs.

Finally, cgroups interact with namespaces. A rootless container can still use cgroups if delegation is configured, but many systems restrict this by default. You may need to run with root for simplicity. Understanding these constraints helps you design a runtime that works in the typical environment.

Another nuance is accounting accuracy. cgroup metrics are updated asynchronously, so CPU and memory usage may lag behind reality by a few milliseconds. For a small runtime this is fine, but if you later build monitoring or enforcement on top, you must tolerate small delays and avoid using these stats as a strict gate. This is why production systems often combine cgroup metrics with application-level probes. Even in a toy runtime, it’s useful to log both the configured limits and the current usage to see how the kernel enforces them under load.

Be aware that cgroup limits can interact with the kernel’s OOM killer. If you set memory.max too low, the kernel may kill the container’s init process, which terminates the entire namespace. This looks like a sudden container exit and can be confusing until you inspect dmesg or cgroup events. Building a small “OOM detector” that reads memory.events is a great way to observe this behavior directly.

How this fit on projects

Resource limits are implemented in Section 3.2 and Section 5.10 Phase 2, and validated in Section 3.7.

Definitions & key terms

  • cgroup -> A control group that limits and accounts for resource usage.
  • Controller -> A resource type managed by cgroups (CPU, memory, I/O).
  • memory.max -> Hard memory limit in bytes.
  • cpu.max -> CPU quota and period (e.g., 50000 100000).

Mental model diagram (ASCII)

/sys/fs/cgroup
  +-- mini-runtime/
       +-- memory.max
       +-- cpu.max
       +-- cgroup.procs  <- container PID

How it works (step-by-step, with invariants and failure modes)

  1. Create cgroup directory.
  2. Enable controllers in parent.
  3. Write limits to control files.
  4. Write PID to cgroup.procs.

Failure modes: controller not enabled, permission denied, process attached too late.

Minimal concrete example

mkdir /sys/fs/cgroup/demo
echo 200000000 > /sys/fs/cgroup/demo/memory.max
echo "50000 100000" > /sys/fs/cgroup/demo/cpu.max
echo $$ > /sys/fs/cgroup/demo/cgroup.procs

Common misconceptions

  • “cgroups v1 and v2 are the same.” -> v2 is unified and has different semantics.
  • “Setting limits is enough.” -> The process must be attached to the cgroup.
  • “Limits are always enforced immediately.” -> Some limits are enforced lazily.

Check-your-understanding questions

  1. Why must you enable controllers in cgroup.subtree_control?
  2. What happens when memory usage exceeds memory.max?
  3. Predict how CPU quota affects a CPU-bound process.

Check-your-understanding answers

  1. Controllers must be delegated to children; otherwise writes are rejected.
  2. The kernel kills a process in the cgroup (OOM kill).
  3. The process will be throttled after its quota is consumed each period.

Real-world applications

  • Kubernetes uses cgroups to enforce pod resource limits.
  • Cloud providers meter CPU and memory usage via cgroups.

Where you’ll apply it

References

  • Linux cgroup v2 documentation

Key insights

cgroups turn containers from “isolated” into “fairly isolated” by enforcing resource boundaries.

Summary

You now know how to apply CPU and memory limits to container processes.

Homework/Exercises to practice the concept

  1. Create a cgroup with a low memory limit and run stress inside it.
  2. Observe memory.current as the workload runs.

Solutions to the homework/exercises

  1. You’ll see OOM kill when the limit is exceeded.
  2. memory.current increases until it hits the limit.

2.3 Root Filesystem Isolation and OCI Lifecycle

Fundamentals

A container runtime must isolate the filesystem view. This is done by creating a mount namespace and then performing pivot_root into a prepared root filesystem (rootfs). The OCI runtime spec defines a standard bundle format: a rootfs directory plus a config.json describing the process, namespaces, mounts, and limits. Even if you implement only a subset of OCI, understanding the lifecycle (create, start, delete) is essential for compatibility and clean cleanup. The bundle is the contract between tooling and runtime. If the bundle is wrong, everything else falls apart. This is why OCI bundles are strictly specified. It is the contract boundary for automation.

Deep Dive into the concept

Filesystem isolation is often underestimated. A container should see a root filesystem that includes only what it needs: a minimal /bin, /lib, /etc, and /proc. You can build this rootfs using debootstrap, alpine tarballs, or a simple busybox root. The runtime’s job is to mount this rootfs inside the container’s mount namespace and then switch the process’s root to it using pivot_root (preferred) or chroot (simpler but less secure).

pivot_root requires you to create a directory to hold the old root (often /oldroot inside the new rootfs). After pivoting, you unmount the old root to avoid accidental escape. You also need to mount /proc inside the container so tools like ps work properly, and mount /sys if you want system introspection. The order of operations matters: you must be in the new mount namespace before pivoting, and you must create the necessary mount points before pivot_root can succeed.

The OCI runtime spec defines config.json fields such as process args, environment, rootfs path, namespaces, and capabilities. A minimal runtime may parse only a subset, but it should follow the lifecycle: create sets up namespaces and cgroups but does not start the user process; start launches it; delete cleans up cgroups and namespaces. This separation is important because orchestration tools expect to inspect or modify the container before it starts. In your project, you can combine create/start in a single command but should still implement the internal phases.

Capabilities and seccomp provide an extra layer of security. Linux capabilities break root privileges into fine-grained bits like CAP_NET_ADMIN or CAP_SYS_CHROOT. Containers often drop many capabilities to reduce risk. Seccomp filters system calls, preventing dangerous syscalls. For a minimal runtime, you can implement capability dropping via prctl and skip seccomp, but you should document why these matter.

The rootfs and OCI lifecycle are also where container images meet the runtime. In full systems, the image is unpacked into a rootfs. Your runtime can accept an existing directory. Understanding that this rootfs must be isolated, mounted, and cleaned up is the key to avoiding host contamination and to ensuring your runtime is deterministic and safe.

There is also a subtle security angle: if you allow bind mounts from the host into the container, you can easily defeat isolation. Even if you do not implement bind mounts, you should understand why container runtimes treat them carefully and often require explicit flags. Similarly, a writable /proc or /sys can allow privilege escalation. For this project, keep mounts minimal and read-only where possible, and document which mount points are necessary. This will help you reason about the security boundary your runtime provides.

The OCI lifecycle also emphasizes cleanup. After a container exits, you must unmount any mounts you created and delete cgroup directories, otherwise you accumulate stale state. In production, these leaks cause resource exhaustion over time. Even in a toy runtime, implement cleanup as a first-class step rather than an afterthought, and log cleanup actions so you can verify determinism across runs.

How this fit on projects

Rootfs setup and lifecycle are implemented in Section 3.2, Section 5.10 Phase 3, and validated in Section 3.7.

Definitions & key terms

  • rootfs -> Directory that becomes the container’s root filesystem.
  • pivot_root -> System call that swaps root filesystems.
  • OCI bundle -> rootfs + config.json describing the container.
  • Capability -> Fine-grained privilege control.

Mental model diagram (ASCII)

Host FS
  /containers/demo/rootfs
        |
        v
mount namespace + pivot_root -> container sees / as rootfs

How it works (step-by-step, with invariants and failure modes)

  1. Create mount namespace.
  2. Bind-mount rootfs to itself (make it a mount point).
  3. Create /oldroot inside rootfs.
  4. Call pivot_root(new_root, oldroot).
  5. Unmount /oldroot.

Failure modes: pivot_root fails if new root is not a mount point or if oldroot is not inside new root.

Minimal concrete example

mount(rootfs, rootfs, NULL, MS_BIND | MS_REC, NULL);
mkdir("/rootfs/oldroot", 0755);
pivot_root("/rootfs", "/rootfs/oldroot");

Common misconceptions

  • “chroot is enough.” -> chroot can be escaped if the process keeps file descriptors.
  • “rootfs doesn’t need /proc.” -> many tools require /proc to function.
  • “OCI is optional.” -> It’s the lingua franca of container runtimes.

Check-your-understanding questions

  1. Why is pivot_root preferred over chroot?
  2. What is the role of config.json in OCI?
  3. Predict what happens if you forget to unmount the old root.

Check-your-understanding answers

  1. It fully replaces the root filesystem and avoids simple escapes.
  2. It defines the process, mounts, namespaces, and resource limits.
  3. The container retains access to the host root, breaking isolation.

Real-world applications

  • runc implements the OCI runtime spec.
  • containerd uses OCI bundles to start containers.

Where you’ll apply it

References

  • OCI runtime specification
  • pivot_root(2) man page

Key insights

Filesystem isolation is not just about hiding files; it’s about preventing escape.

Summary

You now understand how rootfs isolation and OCI lifecycle phases structure a container runtime.

Homework/Exercises to practice the concept

  1. Create a minimal rootfs with busybox and run /bin/sh inside.
  2. Inspect the mount table inside the container.

Solutions to the homework/exercises

  1. Use busybox static binary and required libs in rootfs.
  2. mount inside the container should show / as the new rootfs.

2.4 Linux Capabilities, Seccomp, and LSMs (Container Security Hardening)

Fundamentals

Namespaces and cgroups provide isolation and resource control, but they do not automatically make a container safe. The Linux security model has three additional layers you must understand: capabilities (fine-grained root privileges), seccomp (syscall filtering), and LSMs like AppArmor or SELinux (policy-based mandatory access control). A minimal container runtime should drop dangerous capabilities, prevent privilege escalation with no_new_privs, and optionally apply a seccomp profile. These steps drastically reduce the attack surface of a container while still allowing useful workloads to run. If you skip them, a container process running as root can often perform host-level operations that break the isolation illusion.

Deep Dive into the concept

Linux originally had a binary privilege model: either you were root or you were not. Capabilities split root into discrete powers, such as CAP_NET_ADMIN (configure networking), CAP_SYS_ADMIN (mount, pivot_root, and other powerful operations), and CAP_SYS_TIME (change system clock). A process has multiple capability sets: permitted, effective, inheritable, bounding, and ambient. The effective set is what is actually active; the permitted set is what could be activated; the bounding set is the ceiling for all future capability changes. Container runtimes typically drop capabilities from the bounding set to ensure the process can never regain them, even if it executes a setuid binary. In practice, you can start with a minimal capability list (e.g., CAP_CHOWN, CAP_DAC_OVERRIDE, CAP_SETUID, CAP_SETGID) and add more only when needed.

Seccomp (secure computing mode) is the next layer. It lets you define a filter (expressed as BPF) that either allows, denies, or traps system calls. A common approach is to use a default-deny list for dangerous syscalls such as kexec_load, ptrace, or mount, and allow common syscalls for typical workloads. The runtime installs the filter after setting up namespaces but before executing the target process. If the process calls a blocked syscall, the kernel can return EPERM, kill the process, or send a signal, depending on the filter action. This prevents entire classes of container escape techniques that rely on unusual syscalls.

LSMs (Linux Security Modules) provide policy-based access control that goes beyond capabilities and seccomp. AppArmor uses path-based profiles; SELinux uses labels and types. Container runtimes often integrate with these systems by loading a profile and attaching it to the container process. For a minimal runtime, you may not implement full LSM integration, but you should understand how it fits into the security stack and why production runtimes rely on it.

These layers interact. For example, unprivileged user namespaces allow root inside the container but map it to an unprivileged UID outside. This reduces the need for real root capabilities, but you must still drop capabilities inside the namespace because CAP_SYS_ADMIN in a user namespace is still powerful. Another example is no_new_privs, a flag that prevents a process from gaining new privileges via exec, even if the binary has setuid bits. This is often required before installing seccomp filters so that a process cannot bypass the filter by execing a privileged binary.

From an OCI perspective, these security settings map directly into the config.json: process.capabilities controls which capabilities are kept; process.noNewPrivileges sets the flag; linux.seccomp defines the syscall filter. This is a great place to connect the theory to your runtime implementation. You can parse a small subset of this config and apply it directly using capset, prctl, and seccomp syscalls (or libseccomp).

Security hardening introduces trade-offs. Dropping too many capabilities can break workloads (e.g., DHCP inside the container needs network capabilities). Aggressive seccomp filters can block legitimate syscalls used by language runtimes or JITs. The right approach is iterative: start with a minimal, known-good baseline, test a simple workload (like /bin/sh or python), and expand only as needed. This is a core security engineering discipline: reduce privileges until functionality breaks, then grant the smallest additional privilege required.

How this fit on projects

This concept informs Section 3.2 functional requirements (security constraints), Section 5.10 Phase 2 (runtime hardening), and Section 7.1 debugging when a container fails due to permissions.

Definitions & key terms

  • Capability -> A fine-grained privilege (e.g., CAP_NET_ADMIN).
  • Bounding set -> The maximum set of capabilities a process can ever obtain.
  • Seccomp -> Syscall filtering mechanism using BPF.
  • LSM -> Linux Security Module (e.g., AppArmor, SELinux).
  • no_new_privs -> Flag that prevents gaining privileges on exec.

Mental model diagram (ASCII)

Container process
  | capabilities (what it may do)
  | seccomp filter (which syscalls allowed)
  | LSM policy (what files/resources allowed)
  v
Effective privileges = intersection of all three layers

How it works (step-by-step, with invariants and failure modes)

  1. Create namespaces and set up the rootfs.
  2. Drop capabilities to a minimal set (invariant: bounding set removed for dangerous caps).
  3. Set no_new_privs to prevent privilege escalation.
  4. Install a seccomp filter allowing only required syscalls.
  5. Exec the container process. Failure mode: blocked syscall causes unexpected EPERM or process death.

Minimal concrete example

prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);

// Drop all but a minimal set of capabilities (pseudocode)
cap_set_proc(minimal_caps);

// Install a tiny seccomp filter (libseccomp-style)
seccomp_init(SCMP_ACT_ERRNO(EPERM));
seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(read), 0);
seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(write), 0);
seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(exit), 0);
seccomp_load(ctx);

Common misconceptions

  • “Namespaces are enough for security.” -> A root process in a namespace can still be dangerous without capability drops.
  • “Seccomp is only for browsers.” -> It is widely used in containers to block dangerous syscalls.
  • “no_new_privs is optional.” -> Without it, setuid binaries can bypass your capability restrictions.

Check-your-understanding questions

  1. Why is dropping capabilities from the bounding set stronger than only dropping from the effective set?
  2. Predict what happens if a seccomp filter blocks mount but your runtime calls pivot_root.
  3. What does no_new_privs protect against?
  4. How do user namespaces interact with capabilities?

Check-your-understanding answers

  1. The bounding set prevents the process from ever regaining the capability, even through exec or setuid binaries.
  2. The container setup will fail with EPERM or be killed during setup.
  3. It prevents gaining new privileges through exec of setuid or file capabilities.
  4. Capabilities are scoped to the user namespace, but powerful caps like CAP_SYS_ADMIN still require caution.

Real-world applications

  • Docker and containerd apply capability drops and seccomp profiles by default.
  • Production environments use AppArmor/SELinux to confine container file and network access.

Where you’ll apply it

References

  • capabilities(7) man page
  • seccomp(2) and libseccomp documentation
  • OCI runtime spec (process.capabilities and seccomp sections)

Key insights

Isolation without privilege reduction is an illusion; security hardening makes the isolation real.

Summary

You now know how capabilities, seccomp, and LSMs combine to harden containers.

Homework/Exercises to practice the concept

  1. Create a tiny seccomp profile that denies mount and verify it blocks a container mount command.
  2. Drop CAP_NET_ADMIN and confirm that ip link add fails inside the container.

Solutions to the homework/exercises

  1. Use libseccomp to block mount and observe EPERM when running mount in the container.
  2. Remove CAP_NET_ADMIN and watch ip return an “Operation not permitted” error.

3. Project Specification

3.1 What You Will Build

A CLI minirun that:

  • Creates namespaces and cgroups for a container.
  • Sets up a rootfs and mounts /proc.
  • Runs a command as PID 1 in the container.
  • Supports run, exec, and delete operations.

Included: namespace creation, cgroup limits, rootfs isolation, logging. Excluded: full image management, registry pulls, network CNI plugins.

3.2 Functional Requirements

  1. Namespace Setup: PID, mount, UTS, IPC, and optionally network.
  2. Cgroup Limits: CPU and memory limits applied.
  3. Rootfs Isolation: pivot_root and /proc mount.
  4. OCI Bundle Support: accept rootfs and config.json subset.
  5. Lifecycle: create/start/delete with cleanup.

3.3 Non-Functional Requirements

  • Performance: container start under 1 second for a minimal rootfs.
  • Reliability: cleanup removes cgroup directories and mounts.
  • Usability: clear CLI usage and error messages.

3.4 Example Usage / Output

$ sudo ./minirun run ./rootfs /bin/sh
[minirun] namespaces: pid, mnt, uts, ipc
[minirun] cgroup: cpu.max=50000/100000 memory.max=256M
[minirun] pivot_root -> /rootfs
root@container:/# hostname
container

3.5 Data Formats / Schemas / Protocols

OCI subset config.json:

{
  "process": {"args": ["/bin/sh"], "env": ["PATH=/bin"]},
  "root": {"path": "rootfs"},
  "linux": {"namespaces": [{"type": "pid"}, {"type": "mount"}]}
}

3.6 Edge Cases

  • Missing rootfs -> exit code 2.
  • User namespace mapping fails -> exit code 3.
  • pivot_root fails due to missing mount point -> exit code 4.

3.7 Real World Outcome

You will be able to run a process that believes it is PID 1, has a unique hostname, and is constrained by CPU and memory limits.

3.7.1 How to Run (Copy/Paste)

sudo ./minirun run ./rootfs /bin/sh

3.7.2 Golden Path Demo (Deterministic)

  • Run a shell, check ps -ef shows PID 1 as /bin/sh.
  • Run hostname and confirm isolation.

3.7.3 CLI Transcript (Success + Failure)

$ sudo ./minirun run ./rootfs /bin/sh
[minirun] ok
[exit] code=0

$ sudo ./minirun run ./missing /bin/sh
[error] rootfs not found
[exit] code=2

Exit codes:

  • 0 success
  • 2 missing rootfs
  • 3 user namespace mapping error
  • 4 pivot_root error

4. Solution Architecture

4.1 High-Level Design

CLI -> parse args -> setup namespaces -> setup cgroups -> pivot_root -> exec

4.2 Key Components

| Component | Responsibility | Key Decisions | |———–|—————-|—————| | Namespace manager | clone/unshare | which namespaces enabled | | Cgroup manager | create and attach | cgroups v2 only | | Rootfs manager | mount + pivot_root | prefer pivot_root | | OCI parser | read config.json | support minimal fields |

4.3 Data Structures (No Full Code)

struct ContainerConfig {
  char *rootfs;
  char **argv;
  long mem_max;
  char *cpu_max;
};

4.4 Algorithm Overview

Key Algorithm: Create + Start

  1. Parse config/args.
  2. Clone child with namespaces.
  3. In child, setup rootfs and cgroups.
  4. Exec target process.

Complexity Analysis:

  • Time: O(1) per container startup
  • Space: O(rootfs size)

5. Implementation Guide

5.1 Development Environment Setup

sudo apt-get install uidmap
gcc -o minirun main.c

5.2 Project Structure

minirun/
+-- src/
|   +-- main.c
|   +-- namespaces.c
|   +-- cgroups.c
|   +-- rootfs.c
+-- README.md

5.3 The Core Question You’re Answering

“How does Linux isolate a process so it looks like a VM without a hypervisor?”

5.4 Concepts You Must Understand First

  1. Namespaces and PID 1 semantics
  2. cgroups v2 limits
  3. pivot_root and mount namespaces

5.5 Questions to Guide Your Design

  1. How will you handle signals for PID 1?
  2. When will you attach the process to the cgroup?
  3. What is your cleanup strategy?

5.6 Thinking Exercise

Explain why PID 1 must reap children and what happens if it doesn’t.

5.7 The Interview Questions They’ll Ask

  1. What is the difference between namespaces and cgroups?
  2. Why do containers start faster than VMs?
  3. What does OCI standardize?

5.8 Hints in Layers

Hint 1: Start with just PID and mount namespaces. Hint 2: Add cgroups for memory limits. Hint 3: Add user namespaces for rootless mode.

5.9 Books That Will Help

| Topic | Book | Chapter | |——-|——|———| | Virtual Machines | OS Concepts | Ch. 16 | | Linux Internals | The Linux Programming Interface | Ch. 6 | | OS Design | Modern Operating Systems | Ch. 7 |

5.10 Implementation Phases

Phase 1: Foundation (3-4 days)

Goals: Namespaces and rootfs isolation. Tasks: clone with namespaces, mount /proc, pivot_root. Checkpoint: PID 1 inside container.

Phase 2: Core Functionality (3-4 days)

Goals: cgroups and OCI parsing. Tasks: create cgroup, apply limits, parse config.json. Checkpoint: resource limits enforced.

Phase 3: Polish & Edge Cases (3 days)

Goals: cleanup and error handling. Tasks: delete cgroups on exit, handle failures. Checkpoint: missing rootfs returns exit code 2.

5.11 Key Implementation Decisions

| Decision | Options | Recommendation | Rationale | |———-|———|—————-|———–| | Rootfs switch | chroot vs pivot_root | pivot_root | stronger isolation | | Cgroup version | v1 vs v2 | v2 | modern unified hierarchy | | User namespaces | on/off | optional | complexity vs security |


6. Testing Strategy

6.1 Test Categories

| Category | Purpose | Examples | |———-|———|———-| | Unit Tests | helpers | config parser | | Integration Tests | container run | /bin/sh run | | Edge Case Tests | missing rootfs | failure demo |

6.2 Critical Test Cases

  1. ps -ef shows PID 1 as container process.
  2. Memory limit triggers OOM inside container.
  3. Missing rootfs returns exit code 2.

6.3 Test Data

rootfs/, config.json, missing/

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

| Pitfall | Symptom | Solution | |———|———|———-| | No /proc mount | ps shows host | mount proc inside namespace | | User ns fails | permission denied | map UID/GID correctly | | cgroup not applied | limits ignored | attach PID before exec |

7.2 Debugging Strategies

  • Use strace to confirm clone flags.
  • Inspect /proc/self/ns to verify namespaces.

7.3 Performance Traps

  • Overly strict CPU quota can make the shell sluggish.

8. Extensions & Challenges

8.1 Beginner Extensions

  • Add exec into a running container.
  • Add hostname config option.

8.2 Intermediate Extensions

  • Support network namespaces with veth pairs.
  • Add basic seccomp filtering.

8.3 Advanced Extensions

  • Implement a minimal image unpacker.
  • Add support for runc-compatible bundles.

9. Real-World Connections

9.1 Industry Applications

  • Docker: built on namespaces, cgroups, and OCI runtime.
  • Serverless: containers provide isolation for functions.
  • runc: reference OCI runtime.
  • containerd: container lifecycle manager.

9.3 Interview Relevance

  • Understanding namespaces and cgroups is a common systems interview topic.

10. Resources

10.1 Essential Reading

  • OCI runtime spec
  • Linux namespaces and cgroups docs

10.2 Video Resources

  • “Containers from Scratch” talks

10.3 Tools & Documentation

  • unshare, nsenter, cgexec

11. Self-Assessment Checklist

11.1 Understanding

  • I can explain what each namespace isolates.
  • I understand cgroup v2 controllers.
  • I can explain why pivot_root is necessary.

11.2 Implementation

  • Container runs with PID 1.
  • Resource limits enforce correctly.
  • Rootfs isolation is correct.

11.3 Growth

  • I can explain containers vs VMs clearly.

12. Submission / Completion Criteria

Minimum Viable Completion:

  • Container runs a process with namespaces and cgroups.
  • Rootfs is isolated.

Full Completion:

  • OCI config parsing and cleanup.
  • Demonstrated resource limit enforcement.

Excellence (Going Above & Beyond):

  • Rootless mode with user namespaces.
  • Seccomp filtering for syscalls.