Project 5: Build Your Own Container From Scratch

Create a minimal container runtime using Linux namespaces, cgroups, and filesystem isolation.

Quick Reference

Attribute	Value
Difficulty	Advanced
Time Estimate	2-3 weeks
Main Programming Language	C (Alternatives: Go, Rust)
Alternative Programming Languages	Go, Rust
Coolness Level	Level 4: Hardcore Tech Flex
Business Potential	Infrastructure tooling
Prerequisites	Linux process model, syscalls, filesystem basics
Key Topics	Namespaces, cgroups, chroot/pivot_root, capabilities

1. Learning Objectives

By completing this project, you will:

Create isolated processes using Linux namespaces.
Apply cgroups to limit CPU and memory.
Build a root filesystem for containers.
Understand capabilities and privilege dropping.
Explain the difference between containers and VMs.
Run a deterministic container demo.

2. All Theory Needed (Per-Concept Breakdown)

2.1 Linux Namespaces

Description / Expanded Explanation

Namespaces provide isolated views of system resources such as PID, network, mount, and user IDs. A container is essentially a process running inside a set of namespaces.

Definitions & Key Terms

PID namespace -> isolated process IDs
mount namespace -> isolated filesystem mounts
network namespace -> isolated network stack
user namespace -> map user IDs

Mental Model Diagram (ASCII)

host PID 1
  |- container init (PID 1 inside namespace)

Mental Model Diagram (Image)

Namespaces

How It Works (Step-by-Step)

Call clone or unshare with namespace flags.
Child process gets isolated view (PID 1 inside container).
Processes outside cannot see container processes by PID.

Minimal Concrete Example

clone(child_fn, stack, CLONE_NEWPID | CLONE_NEWNS, NULL);

Common Misconceptions

“Namespaces provide security alone” -> they isolate but do not enforce limits.
“PID 1 behaves like normal” -> PID 1 must reap zombies.

Check-Your-Understanding Questions

Why does a container need a PID namespace?
What happens if the container init process dies?
How does a mount namespace affect file visibility?

Where You’ll Apply It

See 3.1 for container creation.
See 4.1 for architecture.
Also used in: P08 TCP Socket Server

2.2 Cgroups and Resource Limits

Description / Expanded Explanation

Cgroups limit and account for resource usage. They control CPU shares, memory limits, and process counts to prevent containers from consuming all host resources.

Definitions & Key Terms

cgroup -> kernel feature to limit resources
cpu.shares -> relative CPU allocation
memory.max -> memory limit
pids.max -> process count limit

Mental Model Diagram (ASCII)

cgroup: /mycontainer
  cpu.shares=256
  memory.max=128M

Mental Model Diagram (Image)

Cgroups

How It Works (Step-by-Step)

Create a cgroup directory.
Write limits to control files.
Add the container process to the cgroup.
Kernel enforces limits.

Minimal Concrete Example

echo 134217728 > memory.max

Common Misconceptions

“Cgroups are only for CPU” -> they cover multiple resources.
“Limits are advisory” -> kernel enforces them strictly.

Check-Your-Understanding Questions

What happens when a container exceeds memory.max?
How are CPU shares enforced across containers?
What is the difference between cgroup v1 and v2?

Where You’ll Apply It

See 3.2 for functional requirements.
See 5.10 Phase 2 for implementation.
Also used in: P02 Load Balancer for resource isolation

2.3 Filesystem Isolation (chroot and pivot_root)

Description / Expanded Explanation

Containers need a root filesystem that looks like a full OS. chroot and pivot_root change the apparent root for a process, isolating file access.

Definitions & Key Terms

chroot -> change root directory for a process
pivot_root -> switch root filesystem and unmount old root
bind mount -> mount directory to another location
overlayfs -> layered filesystem for copy-on-write

Mental Model Diagram (ASCII)

/ (host)
  /containers/rootfs -> new /

Mental Model Diagram (Image)

Filesystem Isolation

How It Works (Step-by-Step)

Prepare rootfs directory with /bin, /lib, etc.
Bind mount necessary files.
Call pivot_root to switch to new root.
Unmount old root to prevent escape.

Minimal Concrete Example

pivot_root(new_root, put_old)

Common Misconceptions

“chroot is secure” -> without namespaces it is not.
“rootfs can be empty” -> you need binaries and libs.

Check-Your-Understanding Questions

Why is pivot_root safer than chroot?
What files must exist for a minimal shell container?
How do bind mounts help share host directories?

Where You’ll Apply It

See 3.1 for container setup.
See 5.10 Phase 1 for rootfs creation.
Also used in: P10 Mini Git Object Store

2.4 Capabilities and Seccomp

Description / Expanded Explanation

Linux capabilities split root privileges into smaller units. Seccomp filters system calls. Together they reduce the attack surface of a container.

Definitions & Key Terms

capabilities -> fine-grained privileges
seccomp -> system call filtering
drop privileges -> remove unneeded capabilities

Mental Model Diagram (ASCII)

root -> [cap_net_bind, cap_sys_admin] -> drop most

Mental Model Diagram (Image)

Capabilities

How It Works (Step-by-Step)

Start as root in a user namespace.
Drop capabilities not required.
Install seccomp filter to allow only safe syscalls.

Minimal Concrete Example

prctl(PR_CAPBSET_DROP, CAP_SYS_ADMIN)

Common Misconceptions

“Running as root inside container is safe” -> it can still be dangerous.
“Seccomp is too complex” -> even a small allowlist helps.

Check-Your-Understanding Questions

Why drop CAP_SYS_ADMIN?
What syscalls are needed for a shell container?
How does seccomp prevent container escape?

Where You’ll Apply It

See 3.2 for security requirements.
See 5.10 Phase 3 for hardening.
Also used in: P01 Memory Allocator for debug safety

2.5 Init Process and Process Lifecycle

Description / Expanded Explanation

The first process inside a PID namespace is PID 1. It must reap zombie processes and handle signals correctly. Container runtimes typically run a small init to manage this.

Definitions & Key Terms

PID 1 -> first process in namespace
zombie -> process that has exited but not reaped
signal -> asynchronous notification

Mental Model Diagram (ASCII)

PID 1 (init)
  |- child process

Mental Model Diagram (Image)

Init Process

How It Works (Step-by-Step)

Container starts with PID 1 init.
Init forks child command.
Init waits and reaps child exit.
Signals are forwarded to the child.

Minimal Concrete Example

while ((pid = waitpid(-1, &status, 0)) > 0) { }

Common Misconceptions

“PID 1 behaves like any process” -> it ignores some signals by default.
“Zombies are harmless” -> too many can exhaust PID space.

Check-Your-Understanding Questions

Why must PID 1 reap zombies?
What happens if PID 1 exits?
How do you forward SIGTERM to the child process?

Where You’ll Apply It

See 4.2 for component responsibilities.
See 5.10 Phase 1 for init implementation.
Also used in: P08 TCP Socket Server

3. Project Specification

3.1 What You Will Build

A minimal container runtime that can run a command in isolated namespaces with a dedicated root filesystem and resource limits.

3.2 Functional Requirements

Create PID, mount, and UTS namespaces.
Set up root filesystem via pivot_root.
Apply CPU and memory cgroup limits.
Run a user-specified command inside container.
Provide basic logging and deterministic demo.

3.3 Non-Functional Requirements

Performance: container startup under 200 ms.
Reliability: correct cleanup of cgroups and mounts.
Security: drop capabilities.

3.4 Example Usage / Output

$ ./minict run --rootfs ./rootfs -- /bin/sh
/ # ps
PID 1 sh

3.5 Data Formats / Schemas / Protocols

Config file:

rootfs: ./rootfs
cgroups:
  memory: 128M
  cpu_shares: 256

3.6 Edge Cases

Missing binaries in rootfs.
Exceeding memory limit.
Cleanup on failure.

3.7 Real World Outcome

3.7.1 How to Run (Copy/Paste)

make
./minict run --rootfs ./rootfs -- /bin/sh

3.7.2 Golden Path Demo (Deterministic)

Use a fixed rootfs and fixed command sequence.

3.7.3 CLI Transcript (Success)

$ ./minict run --rootfs ./rootfs -- /bin/echo hello
hello
$ echo $?
0

3.7.3 CLI Transcript (Failure)

$ ./minict run --rootfs ./rootfs -- /bin/missing
minict: exec failed: /bin/missing
$ echo $?
127

3.7.4 Exit Codes

0 success
1 invalid config
2 namespace setup failed
127 exec failed

4. Solution Architecture

4.1 High-Level Design

cli -> container setup -> namespaces -> cgroups -> exec cmd

4.2 Key Components

4.3 Data Structures (No Full Code)

struct container_config {
    char *rootfs;
    size_t memory_limit;
    int cpu_shares;
};

4.4 Algorithm Overview

Parse config and validate rootfs.
Create namespaces and fork child.
Set up rootfs and mounts.
Apply cgroups and drop capabilities.
Exec target command.

Complexity Analysis

Time: O(1) setup per container
Space: O(1) per container metadata

5. Implementation Guide

5.1 Development Environment Setup

make

5.2 Project Structure

project-root/
├── src/ns.c
├── src/cgroups.c
├── src/rootfs.c
├── src/init.c
└── tests/

5.3 The Core Question You’re Answering

“What is a container really, and how does Linux isolate it?”

5.4 Concepts You Must Understand First

Namespace types and clone flags
cgroup v2 layout
pivot_root and mounts
capabilities and seccomp basics

5.5 Questions to Guide Your Design

Which namespaces are strictly required for isolation?
How do you ensure cleanup on failure?
What is the minimal rootfs content?

5.6 Thinking Exercise

Draw the process tree from host PID to container PID 1 to child process.

5.7 The Interview Questions They’ll Ask

How do containers differ from VMs?
What is the role of cgroups?
Why is PID 1 special in a container?

5.8 Hints in Layers

Hint 1: Start with a PID namespace and echo a command. Hint 2: Add mount namespace and chroot. Hint 3: Add cgroups last.

5.9 Books That Will Help

5.10 Implementation Phases

Phase 1: Foundation (4-6 days)

Namespaces and init process.

Phase 2: Core Functionality (5-7 days)

Rootfs and cgroups.

Phase 3: Polish and Edge Cases (4-6 days)

Capabilities, cleanup, logging.

5.11 Key Implementation Decisions

6. Testing Strategy

6.1 Test Categories

6.2 Critical Test Cases

Container sees PID 1 for its own init.
Memory limit triggers OOM kill in container.
Host files not accessible from container rootfs.

6.3 Test Data

commands: /bin/echo, /bin/sh, /usr/bin/yes

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

7.2 Debugging Strategies

Use strace to inspect clone and mount syscalls.
Use lsns to list namespaces.

7.3 Performance Traps

Excessive bind mounts slow startup.

8. Extensions & Challenges

8.1 Beginner Extensions

Add hostname isolation (UTS namespace).
Add basic logging.

8.2 Intermediate Extensions

Add network namespace and veth pair.
Add seccomp filter.

8.3 Advanced Extensions

Implement layered rootfs with overlayfs.
Add container image import.

9. Real-World Connections

9.1 Industry Applications

Docker, containerd, and Kubernetes runtimes.

runc
containerd

9.3 Interview Relevance

Demonstrates OS isolation and security concepts.

10. Resources

10.1 Essential Reading

The Linux Programming Interface (Namespaces and cgroups)
Linux Container Internals

10.2 Video Resources

Container internals talks

10.3 Tools & Documentation

strace, lsns, cgget

11. Self-Assessment Checklist

11.1 Understanding

I can explain how namespaces isolate processes.
I can explain cgroup resource limits.
I can explain why PID 1 needs reaping.

11.2 Implementation

Container runs command successfully.
Resource limits are enforced.
Rootfs isolation works.

11.3 Growth

I can explain container internals in an interview.

12. Submission / Completion Criteria

Minimum Viable Completion:

Runs command in new namespaces.
Rootfs isolation works.

Full Completion:

Cgroups and capabilities implemented.
Deterministic demo passes.

Excellence (Going Above & Beyond):

Network namespace and overlayfs support.
Image import and snapshotting.