Project 10: The Poor Man’s Docker (Container Runtime)

Build a minimal container runtime that isolates PID and mount namespaces and applies cgroup limits.

Quick Reference

Attribute Value
Difficulty Expert
Time Estimate 2 weeks
Main Programming Language Go or C (Alternatives: Rust, Python)
Alternative Programming Languages Rust, Python
Coolness Level See REFERENCE.md (Level 5)
Business Potential See REFERENCE.md (Level 4)
Prerequisites Process control, mounts, cgroups, root access
Key Topics namespaces, pivot_root, cgroups, PID 1

1. Learning Objectives

By completing this project, you will:

  1. Explain how namespaces provide isolated views of system resources.
  2. Start a process as PID 1 inside a new PID namespace.
  3. Mount a new root filesystem and /proc inside the container.
  4. Apply resource limits using cgroups.

2. All Theory Needed (Per-Concept Breakdown)

Namespaces and Container Assembly

Fundamentals Namespaces isolate a process’s view of global system resources like PIDs, mounts, and hostnames. Cgroups limit resource usage. A container is a process launched in new namespaces with limits and a dedicated root filesystem. The kernel is shared, but the process sees a different world. Understanding these primitives reveals that containers are not magic; they are controlled views and budgets built from standard kernel features.

Deep Dive A container runtime is a program that orchestrates a precise sequence of syscalls to set up isolation. The first step is to create a process in new namespaces. PID namespaces provide an isolated process ID tree where the first process becomes PID 1 inside the container. Mount namespaces provide a separate mount table so the container can have its own filesystem layout. UTS namespaces isolate hostname and domain name. Network namespaces, if used, isolate network interfaces. The runtime can create these namespaces using clone or unshare and then configure them before launching the target program.

The root filesystem is a critical piece. The runtime prepares a directory tree that contains the binaries and libraries needed by the container. It then switches the root to this tree using a root pivot operation. This ensures that the process sees the new filesystem as /. A common mistake is to use only chroot, which changes the root directory but does not isolate mount points or prevent access to file descriptors that reference the old root. pivot_root is the more complete approach because it swaps the old root and new root, allowing the old root to be unmounted.

The /proc filesystem must be mounted inside the container’s mount namespace. Without this, tools like ps will show host processes or fail entirely. Mounting procfs inside the namespace ensures that process introspection reflects the container’s PID namespace. This is a key insight: /proc reflects the namespace in which it is mounted, so it must be remounted after the namespace is created.

Cgroups enforce resource limits. A container without cgroups is not a safe isolation boundary because it can still consume unlimited CPU or memory. The runtime should create a cgroup, set limits, and attach the container process to it. This ensures that the kernel enforces resource budgets. For deterministic demos, use fixed limits and a controlled workload.

PID 1 behavior is special. In a PID namespace, the first process has PID 1 and is responsible for reaping zombies and handling signals correctly. Many subtle container bugs appear because PID 1 behaves differently from normal processes: it ignores some signals by default and has special reaping behavior. A minimal runtime should be aware of this and either run a simple init process or implement basic reaping itself.

Container assembly is therefore an orchestration of isolation (namespaces), filesystem setup (pivot_root and mounts), and resource control (cgroups). Each step has a clear contract and observable outcome, which is why this project is a deep test of your systems understanding.

How this fit on projects You will apply this concept in §3.1 for container requirements, in §4.1 for runtime architecture, and in §5.10 for the implementation phases. It builds directly on P09-cgroup-resource-governor.md.

Definitions & key terms

  • Namespace: Kernel feature that isolates a resource view.
  • PID 1: First process in a PID namespace.
  • pivot_root: Switches the root filesystem.
  • Mount namespace: Isolates mount table and filesystem view.
  • Container: Process with isolated namespaces and limits.

Mental model diagram

Host kernel
  |
  +-- Namespace set -> process (PID 1)
  |        |
  |        +-- mount /proc
  |        +-- new root filesystem

How it works

  1. Create new namespaces for PID and mount.
  2. Set up root filesystem and pivot.
  3. Mount procfs inside namespace.
  4. Apply cgroup limits.
  5. Launch target program as PID 1.

Minimal concrete example

Inside container:
PID 1 -> /bin/sh
hostname -> sandbox
ps -> shows only container processes

Common misconceptions

  • “Containers are lightweight VMs.” They share the host kernel.
  • “chroot equals container.” chroot does not isolate mounts or PIDs.
  • “cgroups are optional.” They are required for resource isolation.

Check-your-understanding questions

  1. Why must /proc be mounted inside the namespace?
  2. What is special about PID 1?
  3. Why is pivot_root safer than chroot?
  4. How do cgroups and namespaces complement each other?

Check-your-understanding answers

  1. /proc reflects the PID namespace where it is mounted.
  2. PID 1 must reap zombies and handles signals differently.
  3. pivot_root swaps roots and allows old root to be unmounted.
  4. Namespaces isolate views; cgroups enforce resource limits.

Real-world applications

  • Container runtimes and orchestrators.
  • Sandbox environments for untrusted code.
  • Lightweight isolation in CI pipelines.

Where you’ll apply it

References

  • namespaces(7) man page: https://man7.org/linux/man-pages/man7/namespaces.7.html
  • cgroup v2 docs: https://docs.kernel.org/admin-guide/cgroup-v2.html

Key insights Containers are composed, not invented: namespaces plus cgroups plus mounts.

Summary If you can build a minimal container runtime, you understand the core of Docker.

Homework/Exercises to practice the concept

  1. Identify which namespaces are required for a minimal container.
  2. Explain why ps fails without a /proc mount.

Solutions to the homework/exercises

  1. PID and mount namespaces are the minimum for basic isolation.
  2. /proc must be mounted inside the namespace to show container processes.

3. Project Specification

3.1 What You Will Build

A minimal container runtime that launches a command in new PID and mount namespaces, sets a hostname, mounts a new root filesystem, mounts /proc, and applies cgroup limits.

3.2 Functional Requirements

  1. Namespace isolation: PID and mount namespaces at minimum.
  2. Filesystem isolation: new root filesystem and /proc mount.
  3. Resource limits: CPU and memory limits via cgroups.

3.3 Non-Functional Requirements

  • Performance: container startup in under 1 second.
  • Reliability: cleanup of mounts and cgroups.
  • Usability: clear CLI and error messages.

3.4 Example Usage / Output

$ sudo ./mycontainer run /bin/sh
container# hostname
sandbox
container# ps
PID  USER  CMD
1    root  /bin/sh
2    root  ps

3.5 Data Formats / Schemas / Protocols

  • Root filesystem layout: /bin, /lib, /proc, /tmp.
  • CLI syntax: mycontainer run <command>.

3.6 Edge Cases

  • Missing root filesystem.
  • Failure to mount /proc.
  • Lack of privileges for namespaces or cgroups.

3.7 Real World Outcome

3.7.1 How to Run (Copy/Paste)

  • Run as root or with sufficient capabilities.
  • ./mycontainer run /bin/sh in project root.

3.7.2 Golden Path Demo (Deterministic)

Use a fixed root filesystem and hostname.

3.7.3 If CLI: Exact terminal transcript

$ sudo ./mycontainer run /bin/sh
container# hostname
sandbox
container# ps
PID  USER  CMD
1    root  /bin/sh
2    root  ps
container# exit
# exit code: 0

Failure demo (deterministic):

$ sudo ./mycontainer run /bin/sh
error: missing root filesystem
# exit code: 2

Exit codes:

  • 0 success
  • 2 missing root filesystem

4. Solution Architecture

4.1 High-Level Design

setup namespaces -> pivot root -> mount /proc -> apply cgroups -> exec

4.2 Key Components

Component Responsibility Key Decisions
Namespace setup Create PID and mount namespaces Use clone/unshare
Rootfs setup Prepare and switch root pivot_root preferred
Cgroup manager Apply CPU/memory limits Reuse Project 9 logic

4.4 Data Structures (No Full Code)

  • Namespace config: flags for PID, mount, UTS.
  • Rootfs config: path, required directories.

4.4 Algorithm Overview

Key Algorithm: container launch

  1. Create new namespaces.
  2. Set hostname and mount root filesystem.
  3. Mount /proc inside namespace.
  4. Apply cgroup limits.
  5. Exec target program as PID 1.

Complexity Analysis:

  • Time: O(1) setup plus program runtime.
  • Space: O(1) extra memory.

5. Implementation Guide

5.1 Development Environment Setup

# Run on a Linux system with root privileges

5.2 Project Structure

project-root/
├── src/
│   ├── container.c
│   ├── namespaces.c
│   └── cgroups.c
├── rootfs/
│   ├── bin/
│   └── lib/
└── README.md

5.3 The Core Question You’re Answering

“What is a container in kernel terms?”

5.4 Concepts You Must Understand First

  1. Namespaces
    • Which namespaces are required for isolation?
    • Book Reference: “The Linux Programming Interface” - namespaces sections
  2. Root filesystem setup
    • Why pivot_root is safer than chroot.
    • Book Reference: Linux kernel documentation

5.5 Questions to Guide Your Design

  1. How will you ensure PID 1 reaps children?
  2. How will you clean up mounts on exit?

5.6 Thinking Exercise

The /proc Trap

Explain what ps shows if /proc is not remounted inside the container.

5.7 The Interview Questions They’ll Ask

  1. “How do namespaces differ from VMs?”
  2. “What is PID 1 responsible for?”
  3. “Why is pivot_root used?”
  4. “How do cgroups enforce limits?”
  5. “How does Docker isolate processes?”

5.8 Hints in Layers

Hint 1: Start with unshare Experiment with unshare in the shell to understand behavior.

Hint 2: Minimal rootfs Use a tiny rootfs with just a shell and libraries.

Hint 3: Mount /proc After entering the namespace, mount procfs.

Hint 4: Debugging Compare ps inside and outside the container.

5.9 Books That Will Help

Topic Book Chapter
Namespaces “The Linux Programming Interface” namespaces sections
Containers “Container Security” by Liz Rice Ch. 2-3

5.10 Implementation Phases

Phase 1: Foundation (3 days)

Goals:

  • Create PID and mount namespaces.

Tasks:

  1. Launch a child with new namespaces.
  2. Set hostname inside namespace.

Checkpoint: hostname differs inside container.

Phase 2: Core Functionality (4 days)

Goals:

  • Set up rootfs and /proc.

Tasks:

  1. Prepare rootfs directory.
  2. pivot_root into new root.
  3. Mount /proc.

Checkpoint: ps shows only container processes.

Phase 3: Polish & Edge Cases (3 days)

Goals:

  • Add cgroup limits and cleanup.

Tasks:

  1. Apply CPU/memory limits.
  2. Clean up mounts on exit.

Checkpoint: cgroup limits are enforced and cleanup is reliable.

5.11 Key Implementation Decisions

Decision Options Recommendation Rationale
Root switch chroot vs pivot_root pivot_root stronger isolation
Namespace set minimal vs full PID + mount + UTS focused scope

6. Testing Strategy

6.1 Test Categories

Category Purpose Examples
Unit Tests Config parsing validate CLI args
Integration Tests Full container run /bin/sh
Edge Case Tests Missing rootfs error output

6.2 Critical Test Cases

  1. Isolation: ps shows only container processes.
  2. Rootfs: files outside rootfs are inaccessible.
  3. Limits: CPU/memory caps applied.

6.3 Test Data

Fixed rootfs tree with /bin/sh

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

Pitfall Symptom Solution
No /proc mount ps shows host mount procfs inside
chroot only host mounts visible use pivot_root
No cleanup stale mounts unmount on exit

7.2 Debugging Strategies

  • Compare namespaces: inspect /proc/self/ns inside and outside.
  • Verbose logs: print each setup step.

7.3 Performance Traps

Mount operations are fast; the main cost is process startup and I/O in the rootfs.


8. Extensions & Challenges

8.1 Beginner Extensions

  • Add UTS namespace for hostname.
  • Add environment variable injection.

8.2 Intermediate Extensions

  • Add network namespace with veth pair.
  • Add read-only rootfs option.

8.3 Advanced Extensions

  • Implement OCI bundle support.
  • Add seccomp syscall filtering.

9. Real-World Connections

9.1 Industry Applications

  • Container runtimes: runc and containerd.
  • Sandboxing: isolate untrusted code.
  • runc: https://github.com/opencontainers/runc - reference runtime.
  • containerd: https://github.com/containerd/containerd - container manager.

9.3 Interview Relevance

Containers and namespaces are standard systems interview topics.


10. Resources

10.1 Essential Reading

  • namespaces(7) man page
  • cgroup v2 documentation

10.2 Video Resources

  • “Containers from scratch” talks (search title)

10.3 Tools & Documentation

  • namespaces: https://man7.org/linux/man-pages/man7/namespaces.7.html
  • cgroup v2: https://docs.kernel.org/admin-guide/cgroup-v2.html

11. Self-Assessment Checklist

11.1 Understanding

  • I can explain PID and mount namespaces
  • I can explain pivot_root vs chroot
  • I understand why /proc must be remounted

11.2 Implementation

  • All functional requirements are met
  • Resource limits are applied
  • Cleanup is reliable

11.3 Growth

  • I can explain this project in an interview
  • I documented lessons learned
  • I can propose an extension

12. Submission / Completion Criteria

Minimum Viable Completion:

  • Run a process in PID + mount namespaces
  • Mount /proc inside container
  • Demonstrate isolation

Full Completion:

  • All minimum criteria plus:
  • Apply cgroup limits
  • Failure demo with exit code

Excellence (Going Above & Beyond):

  • OCI bundle support
  • Seccomp filtering