Project 13: Containers from Scratch

Build a minimal container runtime using namespaces and cgroups.

Quick Reference

Attribute	Value
Difficulty	Advanced
Time Estimate	18-28 hours
Main Programming Language	C
Alternative Programming Languages	Go (common for containers)
Coolness Level	Very High
Business Potential	High (container tooling)
Prerequisites	syscalls, mount, process model
Key Topics	namespaces, cgroups, chroot/pivot_root

1. Learning Objectives

By completing this project, you will:

Create new PID, mount, and network namespaces.
Apply cgroup limits for CPU and memory.
Run a process with an isolated root filesystem.
Explain why containers are not VMs.

2. All Theory Needed (Per-Concept Breakdown)

Namespaces, cgroups, and Process Isolation

Fundamentals

Containers isolate processes using kernel features, not hardware virtualization. Namespaces give a process its own view of system resources: PID namespace isolates process IDs, mount namespace isolates filesystem view, and network namespace isolates network interfaces. Cgroups limit and account resource usage like CPU and memory. Together, they provide isolation and resource control for a group of processes while still sharing the host kernel.

Deep Dive into the concept

Namespaces virtualize specific kernel resources. A PID namespace makes processes see their own PID numbers; the first process in the namespace becomes PID 1 inside the container. This process is responsible for reaping zombies inside the container. A mount namespace allows you to create a new filesystem view: you can mount a new /proc and change the root via pivot_root or chroot. A network namespace provides its own network stack, interfaces, and routing tables. This is why containers can have isolated IPs without being full VMs.

Cgroups (control groups) manage resource limits. In cgroup v2, you write values to files like memory.max or cpu.max inside the cgroup filesystem. The kernel enforces those limits for processes assigned to the cgroup. This allows you to ensure a container cannot consume unlimited resources. Cgroups also provide accounting, so you can measure how much memory or CPU time a container used.

A minimal container runtime uses the clone syscall with namespace flags (CLONE_NEWPID, CLONE_NEWNS, CLONE_NEWNET, etc.), then configures the environment. For PID namespaces, the child should fork again so that the second child becomes PID 1 inside the container. For mount namespaces, you must mount a new /proc. For filesystem isolation, you create a root filesystem directory and use pivot_root or chroot to switch. You must also handle UID/GID mapping if you want user namespaces (optional but recommended for safety).

This project reveals the difference between containers and VMs: containers share the kernel, so their isolation boundaries are limited to what the kernel provides. A container escape is possible if the kernel has vulnerabilities. That is why containers are lighter but not as strong as VM isolation.

How this fit on projects

This concept is used in Section 3.2 and Section 3.7 and builds on Project 6 (process model) and Project 11 (IPC).

Definitions & key terms

Namespace: kernel feature that isolates a resource view.
cgroup: resource control and accounting group.
pivot_root: switch to a new root filesystem.
User namespace: maps container UIDs to host UIDs.

Mental model diagram (ASCII)

Host kernel
  |-- container A (PID ns, mount ns, net ns)
  |-- container B (PID ns, mount ns, net ns)

How it works (step-by-step)

clone() with namespace flags.
In child, set hostname and mount /proc.
Set up root filesystem (chroot/pivot_root).
Configure cgroup limits.
exec target command.

Minimal concrete example

clone(child_fn, stack, CLONE_NEWPID|CLONE_NEWNS|SIGCHLD, NULL);

Common misconceptions

“Containers are VMs”: they share the host kernel.
“PID 1 behaves like normal”: it must reap zombies.

Check-your-understanding questions

Why does a container need its own /proc mount?
What does a cgroup limit do?
How is a container different from a VM?

Check-your-understanding answers

/proc reflects process IDs; without new /proc it shows host PIDs.
It caps resources like memory or CPU.
Containers share the host kernel; VMs have separate kernels.

Real-world applications

Docker, runc, Kubernetes.

Where you’ll apply it

This project: Section 3.2, Section 3.7, Section 5.10 Phase 2.
Also used in: Project 6.

References

TLPI Ch. 40, 42
Docker/runc design docs

Key insights

Containers are a kernel feature composition, not a hardware abstraction.

Summary

By implementing namespaces and cgroups, you build the core of a container runtime.

Homework/Exercises to practice the concept

Add user namespace mapping.
Add network namespace with veth pair.
Add cgroup CPU limit.

Solutions to the homework/exercises

Write uid_map and gid_map files.
Use ip link add veth and move one end.
Write cpu.max with quota/period.

3. Project Specification

3.1 What You Will Build

A minimal container runtime that launches a command in isolated PID/mount/network namespaces with cgroup limits and a minimal root filesystem.

3.2 Functional Requirements

Create PID and mount namespaces.
Mount /proc inside the container.
Set up a root filesystem and chroot/pivot_root.
Apply cgroup memory/CPU limits.

3.3 Non-Functional Requirements

Performance: container starts in <1 second.
Reliability: namespaces cleaned up on exit.
Usability: ./mini_container /bin/bash.

3.4 Example Usage / Output

$ sudo ./mini_container /bin/bash
[container] pid: 1
[container] hostname: mini

3.5 Data Formats / Schemas / Protocols

Cgroup v2 files: memory.max, cpu.max.

3.6 Edge Cases

Missing root filesystem.
Cgroup controller not enabled.
Command exits immediately.

3.7 Real World Outcome

3.7.1 How to Run (Copy/Paste)

sudo ./mini_container /bin/bash

3.7.2 Golden Path Demo (Deterministic)

Fixed rootfs and resource limits (memory=128M, cpu=50%).

3.7.3 If CLI: exact terminal transcript

$ sudo ./mini_container /bin/bash
[container] pid: 1
[container] hostname: mini
$ ps
PID TTY TIME CMD
1   pts/0 00:00:00 bash

Failure demo (deterministic):

$ sudo ./mini_container /nope
error: exec failed (ENOENT)

Exit codes:

0 success
2 invalid args
3 namespace/cgroup error

4. Solution Architecture

4.1 High-Level Design

CLI -> clone() -> namespace setup -> cgroup setup -> exec

4.2 Key Components

4.3 Data Structures (No Full Code)

struct limits {
    size_t memory_max;
    int cpu_quota;
};

4.4 Algorithm Overview

Key Algorithm: container launch

clone child with namespaces.
setup mounts and rootfs.
configure cgroups.
exec command.

Complexity Analysis:

Time: O(1) for setup
Space: O(1)

5. Implementation Guide

5.1 Development Environment Setup

sudo apt-get install build-essential util-linux

5.2 Project Structure

project-root/
|-- mini_container.c
|-- rootfs/
`-- Makefile

5.3 The Core Question You’re Answering

“How do namespaces and cgroups isolate processes without virtualizing hardware?”

5.4 Concepts You Must Understand First

clone syscall and namespace flags.
mount namespace and /proc.
cgroup v2 filesystem.

5.5 Questions to Guide Your Design

Will you use chroot or pivot_root?
How will you map UID/GID for safety?
What limits will you enforce?

5.6 Thinking Exercise

Compare container vs VM: kernel, filesystem, and isolation differences.

5.7 The Interview Questions They’ll Ask

Why is a container not a VM?
What does PID namespace change?

5.8 Hints in Layers

Hint 1: Use unshare to prototype namespace behavior.

Hint 2: Use clone with CLONE_NEWPID.

Hint 3: Add cgroup memory limit.

5.9 Books That Will Help

| Topic | Book | Chapter | |——-|——|———| | Namespaces | TLPI | 40 | | Cgroups | TLPI | 42 |

5.10 Implementation Phases

Phase 1: Namespaces (6-8 hours)

Goals: PID + mount namespace working.

Phase 2: Rootfs (4-6 hours)

Goals: chroot/pivot_root and /proc mount.

Phase 3: Cgroups (6-8 hours)

Goals: CPU and memory limits enforced.

5.11 Key Implementation Decisions

6. Testing Strategy

6.1 Test Categories

6.2 Critical Test Cases

ps inside container shows PID 1 only.
Memory limit enforced by allocation test.
Root filesystem isolation (host files hidden).

6.3 Test Data

Memory test: allocate 256MB with limit 128MB

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

7.2 Debugging Strategies

Use lsns to inspect namespaces.
Print cgroup paths and contents.

7.3 Performance Traps

Excessive setup steps in child before exec.

8. Extensions & Challenges

8.1 Beginner Extensions

Add hostname isolation.

8.2 Intermediate Extensions

Add network namespace with veth pair.

8.3 Advanced Extensions

Implement a simple image format for rootfs.

9. Real-World Connections

9.1 Industry Applications

Containers in Docker/Kubernetes.

runc and containerd.

9.3 Interview Relevance

Namespace and cgroup questions.

10. Resources

10.1 Essential Reading

TLPI Ch. 40, 42

10.2 Video Resources

Container internals lectures

10.3 Tools & Documentation

man clone, man unshare, cgroup docs

Project 6

11. Self-Assessment Checklist

11.1 Understanding

I can explain namespaces and cgroups.
I can explain PID 1 responsibilities.

11.2 Implementation

Container runs and is isolated.

11.3 Growth

I can explain containers vs VMs.

12. Submission / Completion Criteria

Minimum Viable Completion:

PID + mount namespaces and /proc isolation.

Full Completion:

cgroup limits enforced.

Excellence (Going Above & Beyond):

User namespaces and network isolation.