Project 13: Containers from Scratch
Build a minimal container runtime using namespaces and cgroups.
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | Advanced |
| Time Estimate | 18-28 hours |
| Main Programming Language | C |
| Alternative Programming Languages | Go (common for containers) |
| Coolness Level | Very High |
| Business Potential | High (container tooling) |
| Prerequisites | syscalls, mount, process model |
| Key Topics | namespaces, cgroups, chroot/pivot_root |
1. Learning Objectives
By completing this project, you will:
- Create new PID, mount, and network namespaces.
- Apply cgroup limits for CPU and memory.
- Run a process with an isolated root filesystem.
- Explain why containers are not VMs.
2. All Theory Needed (Per-Concept Breakdown)
Namespaces, cgroups, and Process Isolation
Fundamentals
Containers isolate processes using kernel features, not hardware virtualization. Namespaces give a process its own view of system resources: PID namespace isolates process IDs, mount namespace isolates filesystem view, and network namespace isolates network interfaces. Cgroups limit and account resource usage like CPU and memory. Together, they provide isolation and resource control for a group of processes while still sharing the host kernel.
Deep Dive into the concept
Namespaces virtualize specific kernel resources. A PID namespace makes processes see their own PID numbers; the first process in the namespace becomes PID 1 inside the container. This process is responsible for reaping zombies inside the container. A mount namespace allows you to create a new filesystem view: you can mount a new /proc and change the root via pivot_root or chroot. A network namespace provides its own network stack, interfaces, and routing tables. This is why containers can have isolated IPs without being full VMs.
Cgroups (control groups) manage resource limits. In cgroup v2, you write values to files like memory.max or cpu.max inside the cgroup filesystem. The kernel enforces those limits for processes assigned to the cgroup. This allows you to ensure a container cannot consume unlimited resources. Cgroups also provide accounting, so you can measure how much memory or CPU time a container used.
A minimal container runtime uses the clone syscall with namespace flags (CLONE_NEWPID, CLONE_NEWNS, CLONE_NEWNET, etc.), then configures the environment. For PID namespaces, the child should fork again so that the second child becomes PID 1 inside the container. For mount namespaces, you must mount a new /proc. For filesystem isolation, you create a root filesystem directory and use pivot_root or chroot to switch. You must also handle UID/GID mapping if you want user namespaces (optional but recommended for safety).
This project reveals the difference between containers and VMs: containers share the kernel, so their isolation boundaries are limited to what the kernel provides. A container escape is possible if the kernel has vulnerabilities. That is why containers are lighter but not as strong as VM isolation.
How this fit on projects
This concept is used in Section 3.2 and Section 3.7 and builds on Project 6 (process model) and Project 11 (IPC).
Definitions & key terms
- Namespace: kernel feature that isolates a resource view.
- cgroup: resource control and accounting group.
- pivot_root: switch to a new root filesystem.
- User namespace: maps container UIDs to host UIDs.
Mental model diagram (ASCII)
Host kernel
|-- container A (PID ns, mount ns, net ns)
|-- container B (PID ns, mount ns, net ns)
How it works (step-by-step)
- clone() with namespace flags.
- In child, set hostname and mount /proc.
- Set up root filesystem (chroot/pivot_root).
- Configure cgroup limits.
- exec target command.
Minimal concrete example
clone(child_fn, stack, CLONE_NEWPID|CLONE_NEWNS|SIGCHLD, NULL);
Common misconceptions
- “Containers are VMs”: they share the host kernel.
- “PID 1 behaves like normal”: it must reap zombies.
Check-your-understanding questions
- Why does a container need its own /proc mount?
- What does a cgroup limit do?
- How is a container different from a VM?
Check-your-understanding answers
- /proc reflects process IDs; without new /proc it shows host PIDs.
- It caps resources like memory or CPU.
- Containers share the host kernel; VMs have separate kernels.
Real-world applications
- Docker, runc, Kubernetes.
Where you’ll apply it
- This project: Section 3.2, Section 3.7, Section 5.10 Phase 2.
- Also used in: Project 6.
References
- TLPI Ch. 40, 42
- Docker/runc design docs
Key insights
Containers are a kernel feature composition, not a hardware abstraction.
Summary
By implementing namespaces and cgroups, you build the core of a container runtime.
Homework/Exercises to practice the concept
- Add user namespace mapping.
- Add network namespace with veth pair.
- Add cgroup CPU limit.
Solutions to the homework/exercises
- Write uid_map and gid_map files.
- Use
ip link add vethand move one end. - Write
cpu.maxwith quota/period.
3. Project Specification
3.1 What You Will Build
A minimal container runtime that launches a command in isolated PID/mount/network namespaces with cgroup limits and a minimal root filesystem.
3.2 Functional Requirements
- Create PID and mount namespaces.
- Mount /proc inside the container.
- Set up a root filesystem and chroot/pivot_root.
- Apply cgroup memory/CPU limits.
3.3 Non-Functional Requirements
- Performance: container starts in <1 second.
- Reliability: namespaces cleaned up on exit.
- Usability:
./mini_container /bin/bash.
3.4 Example Usage / Output
$ sudo ./mini_container /bin/bash
[container] pid: 1
[container] hostname: mini
3.5 Data Formats / Schemas / Protocols
- Cgroup v2 files:
memory.max,cpu.max.
3.6 Edge Cases
- Missing root filesystem.
- Cgroup controller not enabled.
- Command exits immediately.
3.7 Real World Outcome
3.7.1 How to Run (Copy/Paste)
sudo ./mini_container /bin/bash
3.7.2 Golden Path Demo (Deterministic)
- Fixed rootfs and resource limits (memory=128M, cpu=50%).
3.7.3 If CLI: exact terminal transcript
$ sudo ./mini_container /bin/bash
[container] pid: 1
[container] hostname: mini
$ ps
PID TTY TIME CMD
1 pts/0 00:00:00 bash
Failure demo (deterministic):
$ sudo ./mini_container /nope
error: exec failed (ENOENT)
Exit codes:
0success2invalid args3namespace/cgroup error
4. Solution Architecture
4.1 High-Level Design
CLI -> clone() -> namespace setup -> cgroup setup -> exec
4.2 Key Components
| Component | Responsibility | Key Decisions | |———–|—————-|—————| | Namespace setup | PID/mount/net | clone flags | | Rootfs setup | chroot/pivot_root | minimal rootfs | | Cgroup setup | memory/cpu limits | cgroup v2 |
4.3 Data Structures (No Full Code)
struct limits {
size_t memory_max;
int cpu_quota;
};
4.4 Algorithm Overview
Key Algorithm: container launch
- clone child with namespaces.
- setup mounts and rootfs.
- configure cgroups.
- exec command.
Complexity Analysis:
- Time: O(1) for setup
- Space: O(1)
5. Implementation Guide
5.1 Development Environment Setup
sudo apt-get install build-essential util-linux
5.2 Project Structure
project-root/
|-- mini_container.c
|-- rootfs/
`-- Makefile
5.3 The Core Question You’re Answering
“How do namespaces and cgroups isolate processes without virtualizing hardware?”
5.4 Concepts You Must Understand First
- clone syscall and namespace flags.
- mount namespace and /proc.
- cgroup v2 filesystem.
5.5 Questions to Guide Your Design
- Will you use chroot or pivot_root?
- How will you map UID/GID for safety?
- What limits will you enforce?
5.6 Thinking Exercise
Compare container vs VM: kernel, filesystem, and isolation differences.
5.7 The Interview Questions They’ll Ask
- Why is a container not a VM?
- What does PID namespace change?
5.8 Hints in Layers
Hint 1: Use unshare to prototype namespace behavior.
Hint 2: Use clone with CLONE_NEWPID.
Hint 3: Add cgroup memory limit.
5.9 Books That Will Help
| Topic | Book | Chapter | |——-|——|———| | Namespaces | TLPI | 40 | | Cgroups | TLPI | 42 |
5.10 Implementation Phases
Phase 1: Namespaces (6-8 hours)
Goals: PID + mount namespace working.
Phase 2: Rootfs (4-6 hours)
Goals: chroot/pivot_root and /proc mount.
Phase 3: Cgroups (6-8 hours)
Goals: CPU and memory limits enforced.
5.11 Key Implementation Decisions
| Decision | Options | Recommendation | Rationale | |———-|———|—————-|———–| | Rootfs switch | chroot vs pivot_root | pivot_root | cleaner isolation | | Cgroup version | v1 vs v2 | v2 | modern standard |
6. Testing Strategy
6.1 Test Categories
| Category | Purpose | Examples | |———-|———|———-| | Unit | flag parsing | limit values | | Integration | container run | /bin/bash | | Limits | memory cap | allocate memory |
6.2 Critical Test Cases
psinside container shows PID 1 only.- Memory limit enforced by allocation test.
- Root filesystem isolation (host files hidden).
6.3 Test Data
Memory test: allocate 256MB with limit 128MB
7. Common Pitfalls & Debugging
7.1 Frequent Mistakes
| Pitfall | Symptom | Solution | |——–|———|———-| | /proc not mounted | ps shows host | mount proc in namespace | | PID 1 not reaping | zombies | add wait loop | | cgroup not enabled | limits ignored | mount cgroup2 |
7.2 Debugging Strategies
- Use
lsnsto inspect namespaces. - Print cgroup paths and contents.
7.3 Performance Traps
- Excessive setup steps in child before exec.
8. Extensions & Challenges
8.1 Beginner Extensions
- Add hostname isolation.
8.2 Intermediate Extensions
- Add network namespace with veth pair.
8.3 Advanced Extensions
- Implement a simple image format for rootfs.
9. Real-World Connections
9.1 Industry Applications
- Containers in Docker/Kubernetes.
9.2 Related Open Source Projects
- runc and containerd.
9.3 Interview Relevance
- Namespace and cgroup questions.
10. Resources
10.1 Essential Reading
- TLPI Ch. 40, 42
10.2 Video Resources
- Container internals lectures
10.3 Tools & Documentation
man clone,man unshare, cgroup docs
10.4 Related Projects in This Series
11. Self-Assessment Checklist
11.1 Understanding
- I can explain namespaces and cgroups.
- I can explain PID 1 responsibilities.
11.2 Implementation
- Container runs and is isolated.
11.3 Growth
- I can explain containers vs VMs.
12. Submission / Completion Criteria
Minimum Viable Completion:
- PID + mount namespaces and /proc isolation.
Full Completion:
- cgroup limits enforced.
Excellence (Going Above & Beyond):
- User namespaces and network isolation.