Project 5: Build Your Own Container From Scratch

Create a minimal container runtime using Linux namespaces, cgroups, and filesystem isolation.

Quick Reference

Attribute Value
Difficulty Advanced
Time Estimate 2-3 weeks
Main Programming Language C (Alternatives: Go, Rust)
Alternative Programming Languages Go, Rust
Coolness Level Level 4: Hardcore Tech Flex
Business Potential Infrastructure tooling
Prerequisites Linux process model, syscalls, filesystem basics
Key Topics Namespaces, cgroups, chroot/pivot_root, capabilities

1. Learning Objectives

By completing this project, you will:

  1. Create isolated processes using Linux namespaces.
  2. Apply cgroups to limit CPU and memory.
  3. Build a root filesystem for containers.
  4. Understand capabilities and privilege dropping.
  5. Explain the difference between containers and VMs.
  6. Run a deterministic container demo.

2. All Theory Needed (Per-Concept Breakdown)

2.1 Linux Namespaces

Description / Expanded Explanation

Namespaces provide isolated views of system resources such as PID, network, mount, and user IDs. A container is essentially a process running inside a set of namespaces.

Definitions & Key Terms
  • PID namespace -> isolated process IDs
  • mount namespace -> isolated filesystem mounts
  • network namespace -> isolated network stack
  • user namespace -> map user IDs
Mental Model Diagram (ASCII)
host PID 1
  |- container init (PID 1 inside namespace)
Mental Model Diagram (Image)

Namespaces

How It Works (Step-by-Step)
  1. Call clone or unshare with namespace flags.
  2. Child process gets isolated view (PID 1 inside container).
  3. Processes outside cannot see container processes by PID.
Minimal Concrete Example
clone(child_fn, stack, CLONE_NEWPID | CLONE_NEWNS, NULL);
Common Misconceptions
  • “Namespaces provide security alone” -> they isolate but do not enforce limits.
  • “PID 1 behaves like normal” -> PID 1 must reap zombies.
Check-Your-Understanding Questions
  1. Why does a container need a PID namespace?
  2. What happens if the container init process dies?
  3. How does a mount namespace affect file visibility?
Where You’ll Apply It

2.2 Cgroups and Resource Limits

Description / Expanded Explanation

Cgroups limit and account for resource usage. They control CPU shares, memory limits, and process counts to prevent containers from consuming all host resources.

Definitions & Key Terms
  • cgroup -> kernel feature to limit resources
  • cpu.shares -> relative CPU allocation
  • memory.max -> memory limit
  • pids.max -> process count limit
Mental Model Diagram (ASCII)
cgroup: /mycontainer
  cpu.shares=256
  memory.max=128M
Mental Model Diagram (Image)

Cgroups

How It Works (Step-by-Step)
  1. Create a cgroup directory.
  2. Write limits to control files.
  3. Add the container process to the cgroup.
  4. Kernel enforces limits.
Minimal Concrete Example
echo 134217728 > memory.max
Common Misconceptions
  • “Cgroups are only for CPU” -> they cover multiple resources.
  • “Limits are advisory” -> kernel enforces them strictly.
Check-Your-Understanding Questions
  1. What happens when a container exceeds memory.max?
  2. How are CPU shares enforced across containers?
  3. What is the difference between cgroup v1 and v2?
Where You’ll Apply It
  • See 3.2 for functional requirements.
  • See 5.10 Phase 2 for implementation.
  • Also used in: P02 Load Balancer for resource isolation

2.3 Filesystem Isolation (chroot and pivot_root)

Description / Expanded Explanation

Containers need a root filesystem that looks like a full OS. chroot and pivot_root change the apparent root for a process, isolating file access.

Definitions & Key Terms
  • chroot -> change root directory for a process
  • pivot_root -> switch root filesystem and unmount old root
  • bind mount -> mount directory to another location
  • overlayfs -> layered filesystem for copy-on-write
Mental Model Diagram (ASCII)
/ (host)
  /containers/rootfs -> new /
Mental Model Diagram (Image)

Filesystem Isolation

How It Works (Step-by-Step)
  1. Prepare rootfs directory with /bin, /lib, etc.
  2. Bind mount necessary files.
  3. Call pivot_root to switch to new root.
  4. Unmount old root to prevent escape.
Minimal Concrete Example
pivot_root(new_root, put_old)
Common Misconceptions
  • “chroot is secure” -> without namespaces it is not.
  • “rootfs can be empty” -> you need binaries and libs.
Check-Your-Understanding Questions
  1. Why is pivot_root safer than chroot?
  2. What files must exist for a minimal shell container?
  3. How do bind mounts help share host directories?
Where You’ll Apply It

2.4 Capabilities and Seccomp

Description / Expanded Explanation

Linux capabilities split root privileges into smaller units. Seccomp filters system calls. Together they reduce the attack surface of a container.

Definitions & Key Terms
  • capabilities -> fine-grained privileges
  • seccomp -> system call filtering
  • drop privileges -> remove unneeded capabilities
Mental Model Diagram (ASCII)
root -> [cap_net_bind, cap_sys_admin] -> drop most
Mental Model Diagram (Image)

Capabilities

How It Works (Step-by-Step)
  1. Start as root in a user namespace.
  2. Drop capabilities not required.
  3. Install seccomp filter to allow only safe syscalls.
Minimal Concrete Example
prctl(PR_CAPBSET_DROP, CAP_SYS_ADMIN)
Common Misconceptions
  • “Running as root inside container is safe” -> it can still be dangerous.
  • “Seccomp is too complex” -> even a small allowlist helps.
Check-Your-Understanding Questions
  1. Why drop CAP_SYS_ADMIN?
  2. What syscalls are needed for a shell container?
  3. How does seccomp prevent container escape?
Where You’ll Apply It
  • See 3.2 for security requirements.
  • See 5.10 Phase 3 for hardening.
  • Also used in: P01 Memory Allocator for debug safety

2.5 Init Process and Process Lifecycle

Description / Expanded Explanation

The first process inside a PID namespace is PID 1. It must reap zombie processes and handle signals correctly. Container runtimes typically run a small init to manage this.

Definitions & Key Terms
  • PID 1 -> first process in namespace
  • zombie -> process that has exited but not reaped
  • signal -> asynchronous notification
Mental Model Diagram (ASCII)
PID 1 (init)
  |- child process
Mental Model Diagram (Image)

Init Process

How It Works (Step-by-Step)
  1. Container starts with PID 1 init.
  2. Init forks child command.
  3. Init waits and reaps child exit.
  4. Signals are forwarded to the child.
Minimal Concrete Example
while ((pid = waitpid(-1, &status, 0)) > 0) { }
Common Misconceptions
  • “PID 1 behaves like any process” -> it ignores some signals by default.
  • “Zombies are harmless” -> too many can exhaust PID space.
Check-Your-Understanding Questions
  1. Why must PID 1 reap zombies?
  2. What happens if PID 1 exits?
  3. How do you forward SIGTERM to the child process?
Where You’ll Apply It
  • See 4.2 for component responsibilities.
  • See 5.10 Phase 1 for init implementation.
  • Also used in: P08 TCP Socket Server

3. Project Specification

3.1 What You Will Build

A minimal container runtime that can run a command in isolated namespaces with a dedicated root filesystem and resource limits.

3.2 Functional Requirements

  1. Create PID, mount, and UTS namespaces.
  2. Set up root filesystem via pivot_root.
  3. Apply CPU and memory cgroup limits.
  4. Run a user-specified command inside container.
  5. Provide basic logging and deterministic demo.

3.3 Non-Functional Requirements

  • Performance: container startup under 200 ms.
  • Reliability: correct cleanup of cgroups and mounts.
  • Security: drop capabilities.

3.4 Example Usage / Output

$ ./minict run --rootfs ./rootfs -- /bin/sh
/ # ps
PID 1 sh

3.5 Data Formats / Schemas / Protocols

Config file:

rootfs: ./rootfs
cgroups:
  memory: 128M
  cpu_shares: 256

3.6 Edge Cases

  • Missing binaries in rootfs.
  • Exceeding memory limit.
  • Cleanup on failure.

3.7 Real World Outcome

3.7.1 How to Run (Copy/Paste)

make
./minict run --rootfs ./rootfs -- /bin/sh

3.7.2 Golden Path Demo (Deterministic)

Use a fixed rootfs and fixed command sequence.

3.7.3 CLI Transcript (Success)

$ ./minict run --rootfs ./rootfs -- /bin/echo hello
hello
$ echo $?
0

3.7.3 CLI Transcript (Failure)

$ ./minict run --rootfs ./rootfs -- /bin/missing
minict: exec failed: /bin/missing
$ echo $?
127

3.7.4 Exit Codes

  • 0 success
  • 1 invalid config
  • 2 namespace setup failed
  • 127 exec failed

4. Solution Architecture

4.1 High-Level Design

cli -> container setup -> namespaces -> cgroups -> exec cmd

4.2 Key Components

| Component | Responsibility | Key Decisions | |———–|—————-|—————| | Namespace manager | create namespaces | clone/unshare | | Rootfs manager | pivot_root, mounts | bind mounts | | Cgroup manager | apply limits | v2 unified | | Init process | PID 1 reaping | waitpid loop |

4.3 Data Structures (No Full Code)

struct container_config {
    char *rootfs;
    size_t memory_limit;
    int cpu_shares;
};

4.4 Algorithm Overview

  1. Parse config and validate rootfs.
  2. Create namespaces and fork child.
  3. Set up rootfs and mounts.
  4. Apply cgroups and drop capabilities.
  5. Exec target command.

Complexity Analysis

  • Time: O(1) setup per container
  • Space: O(1) per container metadata

5. Implementation Guide

5.1 Development Environment Setup

make

5.2 Project Structure

project-root/
├── src/ns.c
├── src/cgroups.c
├── src/rootfs.c
├── src/init.c
└── tests/

5.3 The Core Question You’re Answering

“What is a container really, and how does Linux isolate it?”

5.4 Concepts You Must Understand First

  1. Namespace types and clone flags
  2. cgroup v2 layout
  3. pivot_root and mounts
  4. capabilities and seccomp basics

5.5 Questions to Guide Your Design

  1. Which namespaces are strictly required for isolation?
  2. How do you ensure cleanup on failure?
  3. What is the minimal rootfs content?

5.6 Thinking Exercise

Draw the process tree from host PID to container PID 1 to child process.

5.7 The Interview Questions They’ll Ask

  1. How do containers differ from VMs?
  2. What is the role of cgroups?
  3. Why is PID 1 special in a container?

5.8 Hints in Layers

Hint 1: Start with a PID namespace and echo a command. Hint 2: Add mount namespace and chroot. Hint 3: Add cgroups last.

5.9 Books That Will Help

| Topic | Book | Chapter | |——-|——|———| | Linux internals | The Linux Programming Interface | Ch. 41-44 | | Containers | Linux Container Internals | Sections on namespaces |

5.10 Implementation Phases

Phase 1: Foundation (4-6 days)

Namespaces and init process.

Phase 2: Core Functionality (5-7 days)

Rootfs and cgroups.

Phase 3: Polish and Edge Cases (4-6 days)

Capabilities, cleanup, logging.

5.11 Key Implementation Decisions

| Decision | Options | Recommendation | Rationale | |———-|———|—————-|———–| | Namespace setup | clone vs unshare | clone | simple control | | Rootfs | chroot vs pivot_root | pivot_root | more secure | | Cgroups | v1 vs v2 | v2 | unified hierarchy |


6. Testing Strategy

6.1 Test Categories

| Category | Purpose | Examples | |———-|———|———-| | Unit Tests | config parsing | invalid rootfs | | Integration Tests | isolation | ps inside container | | Resource Tests | limits | memory OOM |

6.2 Critical Test Cases

  1. Container sees PID 1 for its own init.
  2. Memory limit triggers OOM kill in container.
  3. Host files not accessible from container rootfs.

6.3 Test Data

commands: /bin/echo, /bin/sh, /usr/bin/yes

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

| Pitfall | Symptom | Solution | |———|———|———-| | Missing mounts | commands not found | add /proc and /dev mounts | | No zombie reaping | zombie buildup | implement init loop | | cgroup cleanup | leaked directories | cleanup on exit |

7.2 Debugging Strategies

  • Use strace to inspect clone and mount syscalls.
  • Use lsns to list namespaces.

7.3 Performance Traps

  • Excessive bind mounts slow startup.

8. Extensions & Challenges

8.1 Beginner Extensions

  • Add hostname isolation (UTS namespace).
  • Add basic logging.

8.2 Intermediate Extensions

  • Add network namespace and veth pair.
  • Add seccomp filter.

8.3 Advanced Extensions

  • Implement layered rootfs with overlayfs.
  • Add container image import.

9. Real-World Connections

9.1 Industry Applications

  • Docker, containerd, and Kubernetes runtimes.
  • runc
  • containerd

9.3 Interview Relevance

  • Demonstrates OS isolation and security concepts.

10. Resources

10.1 Essential Reading

  • The Linux Programming Interface (Namespaces and cgroups)
  • Linux Container Internals

10.2 Video Resources

  • Container internals talks

10.3 Tools & Documentation

  • strace, lsns, cgget

11. Self-Assessment Checklist

11.1 Understanding

  • I can explain how namespaces isolate processes.
  • I can explain cgroup resource limits.
  • I can explain why PID 1 needs reaping.

11.2 Implementation

  • Container runs command successfully.
  • Resource limits are enforced.
  • Rootfs isolation works.

11.3 Growth

  • I can explain container internals in an interview.

12. Submission / Completion Criteria

Minimum Viable Completion:

  • Runs command in new namespaces.
  • Rootfs isolation works.

Full Completion:

  • Cgroups and capabilities implemented.
  • Deterministic demo passes.

Excellence (Going Above & Beyond):

  • Network namespace and overlayfs support.
  • Image import and snapshotting.