Project 8: Container Runtime from Scratch

Build a minimal container runtime using namespaces, cgroups, and an OCI-style bundle.

Quick Reference

Attribute Value
Difficulty Level 4: Advanced
Time Estimate 2-3 weeks
Main Programming Language C or Go (Alternatives: Rust)
Alternative Programming Languages Rust
Coolness Level Level 4: Systems Practicality
Business Potential Level 2: Platform Cred
Prerequisites Linux namespaces, cgroups
Key Topics OS isolation, OCI runtime model

1. Learning Objectives

By completing this project, you will:

  1. Create PID, mount, network, and user namespaces.
  2. Apply cgroup v2 limits for CPU and memory.
  3. Run a process as PID 1 inside a rootfs.
  4. Explain why containers differ from VMs in isolation.

2. All Theory Needed (Per-Concept Breakdown)

2.1 Linux Namespaces and Process Isolation

Fundamentals Namespaces create isolated views of global resources. PID namespaces provide isolated process trees; mount namespaces provide separate filesystem views; network namespaces provide independent network stacks; UTS namespaces isolate hostnames; IPC namespaces isolate shared memory and message queues; user namespaces map IDs for rootless containers. Together, these form the isolation boundary for containers. They are lightweight because they reuse the host kernel rather than virtualizing hardware.

Containers are not VMs. They do not have a separate kernel; they share the host kernel. This makes them fast to start and efficient, but it also means their security boundary is weaker than a VM boundary. This is why many platforms run containers inside VMs for multi-tenant isolation.

Deep Dive into the concept Namespaces are implemented as kernel-level views of global resources. When a process is created with namespace flags, it gets its own instance of those resources. PID namespaces are special: PID 1 has unique signal semantics and is responsible for reaping zombies. This is why container runtimes often run an init-like process or use a small init wrapper.

Mount namespaces isolate filesystem mounts. A runtime typically mounts a rootfs and then uses pivot_root to replace the host root with the container root. It must also set mount propagation to avoid leaking mounts back to the host. This is subtle and a common source of bugs.

Network namespaces provide isolated network stacks with their own interfaces, routing tables, and firewall rules. To connect a container to the host or outside world, the runtime creates virtual links (veth pairs) and attaches one end to a bridge or virtual switch. This is conceptually similar to VM networking but occurs at the kernel networking layer.

User namespaces allow a process to be “root” inside the container while mapped to an unprivileged UID on the host. This reduces the impact of container escapes but introduces complexity with filesystem ownership. Modern Linux supports idmapped mounts to reduce chown overhead.

Namespaces are powerful but not sufficient for security. They isolate namespaces, not kernel vulnerabilities. This is why additional layers such as seccomp, AppArmor, and SELinux are standard in production.

Container runtimes must also manage filesystem layers and lifecycle cleanup. Overlayfs can introduce copy-up overhead that surprises applications. Rootless containers improve safety but require user namespaces and idmapped mounts, which can break assumptions in legacy software. Signal forwarding and zombie reaping are subtle but critical for long-running containers. These details determine whether a runtime feels reliable in production.

Container runtimes must also manage filesystem layers and lifecycle cleanup. Overlayfs can introduce copy-up overhead that surprises applications. Rootless containers improve safety but require user namespaces and idmapped mounts, which can break assumptions in legacy software. Signal forwarding and zombie reaping are subtle but critical for long-running containers. These details determine whether a runtime feels reliable in production.

Container runtimes must also manage filesystem layers and lifecycle cleanup. Overlayfs can introduce copy-up overhead that surprises applications. Rootless containers improve safety but require user namespaces and idmapped mounts, which can break assumptions in legacy software. Signal forwarding and zombie reaping are subtle but critical for long-running containers. These details determine whether a runtime feels reliable in production.

Container runtimes must also manage filesystem layers and lifecycle cleanup. Overlayfs can introduce copy-up overhead that surprises applications. Rootless containers improve safety but require user namespaces and idmapped mounts, which can break assumptions in legacy software. Signal forwarding and zombie reaping are subtle but critical for long-running containers. These details determine whether a runtime feels reliable in production.

Container runtimes must also manage filesystem layers and lifecycle cleanup. Overlayfs can introduce copy-up overhead that surprises applications. Rootless containers improve safety but require user namespaces and idmapped mounts, which can break assumptions in legacy software. Signal forwarding and zombie reaping are subtle but critical for long-running containers. These details determine whether a runtime feels reliable in production.How this fit on projects Namespaces are the primary mechanism your runtime will use to isolate the container.

Definitions & key terms

  • Namespace: kernel-isolated view of a global resource.
  • PID 1: special process with signal and reaping semantics.
  • veth: virtual Ethernet pair used to connect namespaces.

Mental model diagram

Process -> PID/NET/MNT/UTS namespaces -> isolated view

How it works (step-by-step, with invariants and failure modes)

  1. Create new namespaces for the container process.
  2. Configure mounts and rootfs.
  3. Configure network namespace (veth, routes).
  4. Execute target process.

Invariants: namespaces must be created before exec; PID 1 must reap children. Failure modes include leaked mounts or zombie processes.

Minimal concrete example

CLONE -> new PID + mount namespaces -> exec /bin/sh

Common misconceptions

  • Namespaces provide the same isolation as a VM.
  • PID 1 behaves like other processes.

Check-your-understanding questions

  1. Why does PID 1 need special handling?
  2. Why are user namespaces important for rootless containers?

Check-your-understanding answers

  1. PID 1 ignores some signals and must reap zombies.
  2. They map container root to an unprivileged host UID.

Real-world applications

  • Docker, containerd, CRI-O

Where you’ll apply it

  • Apply in §3.2 (functional requirements) and §5.10 (implementation phases)
  • Also used in: P10-mini-cloud-control-plane

References

  • Linux namespaces documentation

Key insights Namespaces isolate resources, not the kernel.

Summary You now understand how namespaces create container isolation boundaries.

Homework/Exercises to practice the concept

  1. Explain the lifecycle of a process in a PID namespace.
  2. Sketch a veth pair connecting a container to a bridge.

Solutions to the homework/exercises

  1. The process becomes PID 1 and must reap children.
  2. One end in container netns, one end in host bridge.

2.2 Cgroups and the OCI Runtime Model

Fundamentals Cgroups control resource usage by grouping processes and applying limits. Cgroup v2 provides a unified hierarchy with controllers for CPU, memory, and I/O. A container runtime uses cgroups to enforce resource limits and track usage. The OCI runtime specification defines the container bundle format (rootfs + config) and the lifecycle of a container process. This enables interoperability across runtimes.

Container runtimes must also manage filesystem layers and lifecycle cleanup. Overlayfs can introduce copy-up overhead that surprises applications. Rootless containers improve safety but require user namespaces and idmapped mounts, which can break assumptions in legacy software. Signal forwarding and zombie reaping are subtle but critical for long-running containers. These details determine whether a runtime feels reliable in production.Deep Dive into the concept Cgroups are hierarchical. Controllers are enabled on parent cgroups and apply to children. In cgroup v2, CPU limits use a quota/period model, memory limits use memory.max and memory.high, and I/O limits use io.max. A runtime must create a cgroup subtree, enable controllers, and place the container process into that cgroup.

The OCI runtime spec defines a config that describes namespaces, mounts, cgroups, capabilities, and seccomp filters. A runtime reads this config, creates namespaces, sets up mounts, applies cgroups, and execs the target process. This is a strict contract: if a runtime ignores fields, behavior diverges across environments.

Resource control has operational implications. If memory.max is too low, the container is OOM-killed. If CPU quotas are too strict, latency increases. If I/O limits are misconfigured, the container can starve or monopolize resources. Observability is crucial: operators need to see cgroup metrics to understand why containers slow down.

Security features are part of the OCI model. Capabilities define which privileged operations a container can perform. Seccomp filters restrict syscalls. LSMs enforce access control. These are not optional in production; they reduce the attack surface when untrusted workloads run on shared hosts.

The runtime also manages lifecycle signals: create, start, stop, delete. It must handle cleanup if the container crashes and ensure that mounts and cgroups are removed. Otherwise, resource leaks accumulate over time. This is a major operational challenge in long-running hosts.

Container runtimes must also manage filesystem layers and lifecycle cleanup. Overlayfs can introduce copy-up overhead that surprises applications. Rootless containers improve safety but require user namespaces and idmapped mounts, which can break assumptions in legacy software. Signal forwarding and zombie reaping are subtle but critical for long-running containers. These details determine whether a runtime feels reliable in production.

Container runtimes must also manage filesystem layers and lifecycle cleanup. Overlayfs can introduce copy-up overhead that surprises applications. Rootless containers improve safety but require user namespaces and idmapped mounts, which can break assumptions in legacy software. Signal forwarding and zombie reaping are subtle but critical for long-running containers. These details determine whether a runtime feels reliable in production.

Container runtimes must also manage filesystem layers and lifecycle cleanup. Overlayfs can introduce copy-up overhead that surprises applications. Rootless containers improve safety but require user namespaces and idmapped mounts, which can break assumptions in legacy software. Signal forwarding and zombie reaping are subtle but critical for long-running containers. These details determine whether a runtime feels reliable in production.

Container runtimes must also manage filesystem layers and lifecycle cleanup. Overlayfs can introduce copy-up overhead that surprises applications. Rootless containers improve safety but require user namespaces and idmapped mounts, which can break assumptions in legacy software. Signal forwarding and zombie reaping are subtle but critical for long-running containers. These details determine whether a runtime feels reliable in production.

Container runtimes must also manage filesystem layers and lifecycle cleanup. Overlayfs can introduce copy-up overhead that surprises applications. Rootless containers improve safety but require user namespaces and idmapped mounts, which can break assumptions in legacy software. Signal forwarding and zombie reaping are subtle but critical for long-running containers. These details determine whether a runtime feels reliable in production.How this fit on projects You will implement cgroup creation, limit enforcement, and OCI-style config handling.

Definitions & key terms

  • cgroup: resource control group in Linux.
  • OCI bundle: rootfs plus configuration for a container.
  • Capability: a fine-grained privilege token.

Mental model diagram

Config -> namespaces + cgroups -> exec process

How it works (step-by-step, with invariants and failure modes)

  1. Create cgroup subtree and enable controllers.
  2. Apply limits to the container cgroup.
  3. Place container process in cgroup.
  4. Execute process and monitor limits.

Invariants: controllers enabled in parent; process attached to correct cgroup. Failure modes include missing limits or cgroup leaks.

Minimal concrete example

SET memory.max=256M -> run process -> allocation fails above limit

Common misconceptions

  • Cgroups fully isolate performance.
  • OCI spec is optional if your runtime is “simple”.

Check-your-understanding questions

  1. Why is cgroup v2 hierarchical?
  2. What happens if you forget to enable controllers?

Check-your-understanding answers

  1. It allows consistent resource limits across a tree.
  2. Limits will not be enforced.

Real-world applications

  • Kubernetes resource limits
  • Docker and containerd runtimes

Where you’ll apply it

  • Apply in §3.2 (functional requirements) and §6.2 (critical tests)
  • Also used in: P07-vagrant-style-orchestrator

References

  • cgroup v2 documentation
  • OCI runtime spec

Key insights Cgroups provide resource control; OCI provides portability.

Summary You now understand how cgroups and OCI define container runtime behavior.

Homework/Exercises to practice the concept

  1. Explain how CPU quotas are enforced in cgroup v2.
  2. List the minimum fields an OCI config must include.

Solutions to the homework/exercises

  1. CPU.max sets quota/period for scheduling.
  2. Rootfs path, namespaces, mounts, process args, and limits.

3. Project Specification

3.1 What You Will Build

A minimal runtime that creates namespaces, applies cgroups, and runs a process inside a rootfs.

3.2 Functional Requirements

  1. Create PID, mount, UTS, network namespaces.
  2. Apply cgroup CPU and memory limits.
  3. Set hostname and rootfs.
  4. Execute target process.

3.3 Non-Functional Requirements

  • Performance: startup under 1 second.
  • Reliability: cleanup on exit.
  • Usability: clear logs.

3.4 Example Usage / Output

$ sudo ./minirun run ./rootfs /bin/sh
[minirun] namespaces: pid, net, mnt, uts
[minirun] cgroup: cpu.max=50% memory.max=256M

3.5 Data Formats / Schemas / Protocols

  • OCI-style config fields for namespaces and limits

3.6 Edge Cases

  • Missing rootfs
  • PID 1 exits immediately

3.7 Real World Outcome

Process runs as PID 1 with isolated hostname and enforced limits.

3.7.1 How to Run (Copy/Paste)

  • Run as root
  • Provide rootfs and command

3.7.2 Golden Path Demo (Deterministic)

  • Container prints hostname and shows PID 1

3.7.3 If CLI: exact terminal transcript

$ sudo ./minirun run ./rootfs /bin/sh
root@container:/# hostname
container

4. Solution Architecture

4.1 High-Level Design

Config -> namespaces -> cgroups -> exec

4.2 Key Components

| Component | Responsibility | Key Decisions | |———–|—————-|—————| | Namespace setup | Create isolation | Which namespaces | | Cgroup setup | Limits | v2 only | | Rootfs | Mounts | overlayfs optional |

4.3 Data Structures (No Full Code)

  • Config object: namespaces, limits, command

4.4 Algorithm Overview

  1. Create namespaces
  2. Setup rootfs
  3. Apply cgroups
  4. Exec process

5. Implementation Guide

5.1 Development Environment Setup

# Ensure cgroup v2 is enabled

5.2 Project Structure

project-root/
├── src/
│   ├── runtime.c
│   └── cgroup.c
└── README.md

5.3 The Core Question You’re Answering

“How does Linux isolate a process so it looks like a VM without a hypervisor?”

5.4 Concepts You Must Understand First

  1. Namespaces
  2. cgroup v2
  3. OCI runtime lifecycle

5.5 Questions to Guide Your Design

  1. How will you clean up mounts and cgroups?
  2. How will you handle PID 1 signal behavior?

5.6 Thinking Exercise

Draw the namespace and cgroup hierarchy for one container.

5.7 The Interview Questions They’ll Ask

  1. “What is the difference between namespaces and cgroups?”
  2. “Why do containers start faster than VMs?”

5.8 Hints in Layers

Hint 1: Start with PID namespace only. Hint 2: Add cgroup limits after namespaces work. Hint 3: Pseudocode

CLONE namespaces -> mount rootfs -> apply cgroups -> exec

Hint 4: Use lsns to verify isolation.

5.9 Books That Will Help

| Topic | Book | Chapter | |——-|——|———| | OS virtualization | “Modern Operating Systems” | Ch. 7 | | System calls | “The Linux Programming Interface” | Ch. 49 |

5.10 Implementation Phases

  • Phase 1: Namespace setup
  • Phase 2: Rootfs + mounts
  • Phase 3: Cgroup limits

5.11 Key Implementation Decisions

| Decision | Options | Recommendation | Rationale | |———-|———|—————-|———–| | Rootfs | chroot vs pivot_root | pivot_root | closer to OCI |


6. Testing Strategy

6.1 Test Categories

| Category | Purpose | Examples | |———-|———|———-| | Integration Tests | Isolation | PID 1, hostname |

6.2 Critical Test Cases

  1. PID namespace isolates process list.
  2. Memory limit triggers OOM inside container.

6.3 Test Data

Run /bin/sh and print hostname

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

| Pitfall | Symptom | Solution | |———|———|———-| | /proc not mounted | ps shows host | Mount /proc | | Missing cgroup controller | No limits | Enable controllers |

7.2 Debugging Strategies

  • Check /proc/self/cgroup and lsns.

7.3 Performance Traps

  • Excessive mount operations on startup.

8. Extensions & Challenges

8.1 Beginner Extensions

  • Add a custom hostname flag.

8.2 Intermediate Extensions

  • Add network namespace with veth.

8.3 Advanced Extensions

  • Add seccomp filtering.

9. Real-World Connections

9.1 Industry Applications

  • containerd, CRI-O
  • runc, crun

9.3 Interview Relevance

  • Namespaces, cgroups, OCI

10. Resources

10.1 Essential Reading

  • OCI runtime spec
  • cgroup v2 documentation

10.2 Video Resources

  • Container internals talks