Project 10: The Poor Man’s Docker (Container Runtime)
Build a minimal container runtime that isolates PID and mount namespaces and applies cgroup limits.
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | Expert |
| Time Estimate | 2 weeks |
| Main Programming Language | Go or C (Alternatives: Rust, Python) |
| Alternative Programming Languages | Rust, Python |
| Coolness Level | See REFERENCE.md (Level 5) |
| Business Potential | See REFERENCE.md (Level 4) |
| Prerequisites | Process control, mounts, cgroups, root access |
| Key Topics | namespaces, pivot_root, cgroups, PID 1 |
1. Learning Objectives
By completing this project, you will:
- Explain how namespaces provide isolated views of system resources.
- Start a process as PID 1 inside a new PID namespace.
- Mount a new root filesystem and /proc inside the container.
- Apply resource limits using cgroups.
2. All Theory Needed (Per-Concept Breakdown)
Namespaces and Container Assembly
Fundamentals Namespaces isolate a process’s view of global system resources like PIDs, mounts, and hostnames. Cgroups limit resource usage. A container is a process launched in new namespaces with limits and a dedicated root filesystem. The kernel is shared, but the process sees a different world. Understanding these primitives reveals that containers are not magic; they are controlled views and budgets built from standard kernel features.
Deep Dive A container runtime is a program that orchestrates a precise sequence of syscalls to set up isolation. The first step is to create a process in new namespaces. PID namespaces provide an isolated process ID tree where the first process becomes PID 1 inside the container. Mount namespaces provide a separate mount table so the container can have its own filesystem layout. UTS namespaces isolate hostname and domain name. Network namespaces, if used, isolate network interfaces. The runtime can create these namespaces using clone or unshare and then configure them before launching the target program.
The root filesystem is a critical piece. The runtime prepares a directory tree that contains the binaries and libraries needed by the container. It then switches the root to this tree using a root pivot operation. This ensures that the process sees the new filesystem as /. A common mistake is to use only chroot, which changes the root directory but does not isolate mount points or prevent access to file descriptors that reference the old root. pivot_root is the more complete approach because it swaps the old root and new root, allowing the old root to be unmounted.
The /proc filesystem must be mounted inside the container’s mount namespace. Without this, tools like ps will show host processes or fail entirely. Mounting procfs inside the namespace ensures that process introspection reflects the container’s PID namespace. This is a key insight: /proc reflects the namespace in which it is mounted, so it must be remounted after the namespace is created.
Cgroups enforce resource limits. A container without cgroups is not a safe isolation boundary because it can still consume unlimited CPU or memory. The runtime should create a cgroup, set limits, and attach the container process to it. This ensures that the kernel enforces resource budgets. For deterministic demos, use fixed limits and a controlled workload.
PID 1 behavior is special. In a PID namespace, the first process has PID 1 and is responsible for reaping zombies and handling signals correctly. Many subtle container bugs appear because PID 1 behaves differently from normal processes: it ignores some signals by default and has special reaping behavior. A minimal runtime should be aware of this and either run a simple init process or implement basic reaping itself.
Container assembly is therefore an orchestration of isolation (namespaces), filesystem setup (pivot_root and mounts), and resource control (cgroups). Each step has a clear contract and observable outcome, which is why this project is a deep test of your systems understanding.
How this fit on projects You will apply this concept in §3.1 for container requirements, in §4.1 for runtime architecture, and in §5.10 for the implementation phases. It builds directly on P09-cgroup-resource-governor.md.
Definitions & key terms
- Namespace: Kernel feature that isolates a resource view.
- PID 1: First process in a PID namespace.
- pivot_root: Switches the root filesystem.
- Mount namespace: Isolates mount table and filesystem view.
- Container: Process with isolated namespaces and limits.
Mental model diagram
Host kernel
|
+-- Namespace set -> process (PID 1)
| |
| +-- mount /proc
| +-- new root filesystem
How it works
- Create new namespaces for PID and mount.
- Set up root filesystem and pivot.
- Mount procfs inside namespace.
- Apply cgroup limits.
- Launch target program as PID 1.
Minimal concrete example
Inside container:
PID 1 -> /bin/sh
hostname -> sandbox
ps -> shows only container processes
Common misconceptions
- “Containers are lightweight VMs.” They share the host kernel.
- “chroot equals container.” chroot does not isolate mounts or PIDs.
- “cgroups are optional.” They are required for resource isolation.
Check-your-understanding questions
- Why must /proc be mounted inside the namespace?
- What is special about PID 1?
- Why is pivot_root safer than chroot?
- How do cgroups and namespaces complement each other?
Check-your-understanding answers
- /proc reflects the PID namespace where it is mounted.
- PID 1 must reap zombies and handles signals differently.
- pivot_root swaps roots and allows old root to be unmounted.
- Namespaces isolate views; cgroups enforce resource limits.
Real-world applications
- Container runtimes and orchestrators.
- Sandbox environments for untrusted code.
- Lightweight isolation in CI pipelines.
Where you’ll apply it
- See §3.1 What You Will Build and §4.2 Key Components.
- Also used in: P09-cgroup-resource-governor.md
References
- namespaces(7) man page: https://man7.org/linux/man-pages/man7/namespaces.7.html
- cgroup v2 docs: https://docs.kernel.org/admin-guide/cgroup-v2.html
Key insights Containers are composed, not invented: namespaces plus cgroups plus mounts.
Summary If you can build a minimal container runtime, you understand the core of Docker.
Homework/Exercises to practice the concept
- Identify which namespaces are required for a minimal container.
- Explain why
psfails without a /proc mount.
Solutions to the homework/exercises
- PID and mount namespaces are the minimum for basic isolation.
- /proc must be mounted inside the namespace to show container processes.
3. Project Specification
3.1 What You Will Build
A minimal container runtime that launches a command in new PID and mount namespaces, sets a hostname, mounts a new root filesystem, mounts /proc, and applies cgroup limits.
3.2 Functional Requirements
- Namespace isolation: PID and mount namespaces at minimum.
- Filesystem isolation: new root filesystem and /proc mount.
- Resource limits: CPU and memory limits via cgroups.
3.3 Non-Functional Requirements
- Performance: container startup in under 1 second.
- Reliability: cleanup of mounts and cgroups.
- Usability: clear CLI and error messages.
3.4 Example Usage / Output
$ sudo ./mycontainer run /bin/sh
container# hostname
sandbox
container# ps
PID USER CMD
1 root /bin/sh
2 root ps
3.5 Data Formats / Schemas / Protocols
- Root filesystem layout:
/bin,/lib,/proc,/tmp. - CLI syntax:
mycontainer run <command>.
3.6 Edge Cases
- Missing root filesystem.
- Failure to mount /proc.
- Lack of privileges for namespaces or cgroups.
3.7 Real World Outcome
3.7.1 How to Run (Copy/Paste)
- Run as root or with sufficient capabilities.
./mycontainer run /bin/shin project root.
3.7.2 Golden Path Demo (Deterministic)
Use a fixed root filesystem and hostname.
3.7.3 If CLI: Exact terminal transcript
$ sudo ./mycontainer run /bin/sh
container# hostname
sandbox
container# ps
PID USER CMD
1 root /bin/sh
2 root ps
container# exit
# exit code: 0
Failure demo (deterministic):
$ sudo ./mycontainer run /bin/sh
error: missing root filesystem
# exit code: 2
Exit codes:
- 0 success
- 2 missing root filesystem
4. Solution Architecture
4.1 High-Level Design
setup namespaces -> pivot root -> mount /proc -> apply cgroups -> exec
4.2 Key Components
| Component | Responsibility | Key Decisions |
|---|---|---|
| Namespace setup | Create PID and mount namespaces | Use clone/unshare |
| Rootfs setup | Prepare and switch root | pivot_root preferred |
| Cgroup manager | Apply CPU/memory limits | Reuse Project 9 logic |
4.4 Data Structures (No Full Code)
- Namespace config: flags for PID, mount, UTS.
- Rootfs config: path, required directories.
4.4 Algorithm Overview
Key Algorithm: container launch
- Create new namespaces.
- Set hostname and mount root filesystem.
- Mount /proc inside namespace.
- Apply cgroup limits.
- Exec target program as PID 1.
Complexity Analysis:
- Time: O(1) setup plus program runtime.
- Space: O(1) extra memory.
5. Implementation Guide
5.1 Development Environment Setup
# Run on a Linux system with root privileges
5.2 Project Structure
project-root/
├── src/
│ ├── container.c
│ ├── namespaces.c
│ └── cgroups.c
├── rootfs/
│ ├── bin/
│ └── lib/
└── README.md
5.3 The Core Question You’re Answering
“What is a container in kernel terms?”
5.4 Concepts You Must Understand First
- Namespaces
- Which namespaces are required for isolation?
- Book Reference: “The Linux Programming Interface” - namespaces sections
- Root filesystem setup
- Why pivot_root is safer than chroot.
- Book Reference: Linux kernel documentation
5.5 Questions to Guide Your Design
- How will you ensure PID 1 reaps children?
- How will you clean up mounts on exit?
5.6 Thinking Exercise
The /proc Trap
Explain what ps shows if /proc is not remounted inside the container.
5.7 The Interview Questions They’ll Ask
- “How do namespaces differ from VMs?”
- “What is PID 1 responsible for?”
- “Why is pivot_root used?”
- “How do cgroups enforce limits?”
- “How does Docker isolate processes?”
5.8 Hints in Layers
Hint 1: Start with unshare Experiment with unshare in the shell to understand behavior.
Hint 2: Minimal rootfs Use a tiny rootfs with just a shell and libraries.
Hint 3: Mount /proc After entering the namespace, mount procfs.
Hint 4: Debugging
Compare ps inside and outside the container.
5.9 Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Namespaces | “The Linux Programming Interface” | namespaces sections |
| Containers | “Container Security” by Liz Rice | Ch. 2-3 |
5.10 Implementation Phases
Phase 1: Foundation (3 days)
Goals:
- Create PID and mount namespaces.
Tasks:
- Launch a child with new namespaces.
- Set hostname inside namespace.
Checkpoint: hostname differs inside container.
Phase 2: Core Functionality (4 days)
Goals:
- Set up rootfs and /proc.
Tasks:
- Prepare rootfs directory.
- pivot_root into new root.
- Mount /proc.
Checkpoint: ps shows only container processes.
Phase 3: Polish & Edge Cases (3 days)
Goals:
- Add cgroup limits and cleanup.
Tasks:
- Apply CPU/memory limits.
- Clean up mounts on exit.
Checkpoint: cgroup limits are enforced and cleanup is reliable.
5.11 Key Implementation Decisions
| Decision | Options | Recommendation | Rationale |
|---|---|---|---|
| Root switch | chroot vs pivot_root | pivot_root | stronger isolation |
| Namespace set | minimal vs full | PID + mount + UTS | focused scope |
6. Testing Strategy
6.1 Test Categories
| Category | Purpose | Examples |
|---|---|---|
| Unit Tests | Config parsing | validate CLI args |
| Integration Tests | Full container run | /bin/sh |
| Edge Case Tests | Missing rootfs | error output |
6.2 Critical Test Cases
- Isolation:
psshows only container processes. - Rootfs: files outside rootfs are inaccessible.
- Limits: CPU/memory caps applied.
6.3 Test Data
Fixed rootfs tree with /bin/sh
7. Common Pitfalls & Debugging
7.1 Frequent Mistakes
| Pitfall | Symptom | Solution |
|---|---|---|
| No /proc mount | ps shows host | mount procfs inside |
| chroot only | host mounts visible | use pivot_root |
| No cleanup | stale mounts | unmount on exit |
7.2 Debugging Strategies
- Compare namespaces: inspect
/proc/self/nsinside and outside. - Verbose logs: print each setup step.
7.3 Performance Traps
Mount operations are fast; the main cost is process startup and I/O in the rootfs.
8. Extensions & Challenges
8.1 Beginner Extensions
- Add UTS namespace for hostname.
- Add environment variable injection.
8.2 Intermediate Extensions
- Add network namespace with veth pair.
- Add read-only rootfs option.
8.3 Advanced Extensions
- Implement OCI bundle support.
- Add seccomp syscall filtering.
9. Real-World Connections
9.1 Industry Applications
- Container runtimes: runc and containerd.
- Sandboxing: isolate untrusted code.
9.2 Related Open Source Projects
- runc: https://github.com/opencontainers/runc - reference runtime.
- containerd: https://github.com/containerd/containerd - container manager.
9.3 Interview Relevance
Containers and namespaces are standard systems interview topics.
10. Resources
10.1 Essential Reading
- namespaces(7) man page
- cgroup v2 documentation
10.2 Video Resources
- “Containers from scratch” talks (search title)
10.3 Tools & Documentation
- namespaces: https://man7.org/linux/man-pages/man7/namespaces.7.html
- cgroup v2: https://docs.kernel.org/admin-guide/cgroup-v2.html
10.4 Related Projects in This Series
11. Self-Assessment Checklist
11.1 Understanding
- I can explain PID and mount namespaces
- I can explain pivot_root vs chroot
- I understand why /proc must be remounted
11.2 Implementation
- All functional requirements are met
- Resource limits are applied
- Cleanup is reliable
11.3 Growth
- I can explain this project in an interview
- I documented lessons learned
- I can propose an extension
12. Submission / Completion Criteria
Minimum Viable Completion:
- Run a process in PID + mount namespaces
- Mount /proc inside container
- Demonstrate isolation
Full Completion:
- All minimum criteria plus:
- Apply cgroup limits
- Failure demo with exit code
Excellence (Going Above & Beyond):
- OCI bundle support
- Seccomp filtering