Project 1: Kernel-Level Container Runtime Lab

Build a minimal runtime lab that launches isolated processes with explicit namespace, cgroup, and privilege constraints.

Quick Reference

Attribute	Value
Difficulty	Expert
Time Estimate	10-16 hours
Main Programming Language	Go (Alternatives: Rust, C)
Alternative Programming Languages	Rust, C
Coolness Level	Level 4 - Hardcore Tech Flex
Business Potential	2. Operational Superpower
Prerequisites	Linux process model, cgroups, namespace basics
Key Topics	PID/mount/net namespaces, cgroups v2, capabilities, PID 1 lifecycle

1. Learning Objectives

By completing this project, you will:

Explain the kernel primitives behind container runtime behavior.
Design a deterministic runtime launch sequence with observable checkpoints.
Diagnose isolation and resource failures from kernel signals.
Produce a security baseline for non-privileged runtime execution.

2. All Theory Needed (Per-Concept Breakdown)

2.1 Namespace Boundaries

Fundamentals

Namespaces isolate visibility, not trust. A process in a PID namespace can see a different process tree than the host, but it still shares the host kernel. Mount namespaces isolate filesystem views. Network namespaces isolate interfaces and routes. You need this distinction to avoid confusing containerization with virtualization.

Deep Dive into the concept

Container startup order matters: namespace setup before mount operations, then process execution. PID namespace semantics also require understanding that the first process becomes PID 1 with special signal/reaping responsibilities. If this process ignores shutdown signals, child processes remain and graceful termination fails. Mount namespace mistakes create confusing cross-container behavior: missing /proc, host filesystem leakage, or read-write paths that should be read-only.

Visibility boundaries define debugging complexity. A process may appear healthy from inside but fail from host perspective due to cgroup pressure. Observability strategy should include both namespace-internal and host-level views.

How this fit on projects

Primary here and reused in P04 and P09.

Definitions & key terms

Namespace, PID 1, mount propagation, host namespace sharing.

Mental model diagram

Host process tree:           Container process tree:
 1 systemd                     1 app-init
 200 kubelet                   17 worker
 350 runtime-shim              22 helper

Same kernel, different visibility scope.

How it works

Create namespaces.
Mount runtime root filesystem.
Enter namespace context.
Execute entrypoint as PID 1.

Invariants: isolation scope is explicit. Failure modes: leaked host mounts, zombie processes, shutdown delays.

Minimal concrete example

Pseudo-sequence:
create pid,mnt,net namespaces
mount rootfs read-only + writable tmp dirs
drop capabilities
exec /app/start as PID 1

Common misconceptions

“Namespace means secure by default.” -> False; privilege and mounts still matter.

Check-your-understanding questions

Why can PID 1 break graceful shutdown?
Which namespace bug can expose host paths?

Check-your-understanding answers

PID 1 must forward/handle signals and reap children.
Incorrect mount namespace setup or broad hostPath mounts.

Real-world applications

Hardened multi-tenant runtime baselines.

Where you’ll apply it

P01, P04, P09.

References

Kubernetes container runtime docs
Docker runtime overview

Key insights

Isolation is a composition of boundaries, not one flag.

Summary

Namespace correctness is the foundation of predictable container behavior.

Homework/Exercises to practice the concept

Trace shutdown behavior for workloads with and without PID 1 signal handling.
Compare namespace visibility outputs between host and container contexts.

Solutions to the homework/exercises

Capture process trees before and after termination signal.
Record command outputs and identify boundary differences.

2.2 Cgroups and Resource Policy

Fundamentals

Cgroups control resource consumption. CPU limits define throttling windows; memory limits define hard ceilings that may trigger OOM kills.

Deep Dive into the concept

Resource policy is workload-specific. Setting low limits can pass integration tests but fail under bursty production load. CPU throttling is especially deceptive because average CPU metrics may look healthy while request latency explodes. Memory behavior includes page cache and allocator overhead; OOM can happen even when application-level memory seems below limit.

Good policy pairs requests and limits with observed workload characteristics. Production tuning must include stress tests and SLO correlation. A reliable runtime baseline logs cgroup signals and surfaces them in dashboards and incident timelines.

How this fit on projects

P01 baseline, used again in P05 scheduling and P10 platform policy.

Definitions & key terms

CPU quota, throttling, OOM kill, request vs limit.

Mental model diagram

Workload demand ---> cgroup guardrails ---> kernel scheduler/memory manager
        |                     |                        |
   burst traffic         throttle/OOM events      latency + restart impact

How it works

Assign process to cgroup path.
Apply CPU/memory controls.
Observe throttle/OOM counters.
Tune policy with workload evidence.

Invariants: limits are enforced; guarantees depend on requests and cluster state. Failure modes: hidden throttling, repeated OOM restarts.

Minimal concrete example

Policy comparison run:
Profile A: cpu 200m, memory 128Mi
Profile B: cpu 500m, memory 256Mi
Run same workload and compare p95 latency + restart count.

Common misconceptions

“Low resource limits always reduce cost.” -> Can increase retries and total cost.

Check-your-understanding questions

Why can throttling raise latency without high average CPU?
Why are requests essential for scheduling quality?

Check-your-understanding answers

Burst work is repeatedly paused by quota windows.
Scheduler uses requests to estimate placement feasibility.

Real-world applications

FinOps-aware workload tuning.

Where you’ll apply it

P01, P05, P10.

References

Kubernetes resource management docs

Key insights

Resource settings are SLO controls, not just cost controls.

Summary

Cgroup policy requires measurement-driven iteration.

Homework/Exercises to practice the concept

Build a throttling test and capture latency effects.
Document one stable policy profile per workload type.

Solutions to the homework/exercises

Compare p95/p99 under identical load at different quotas.
Add runbook table with policy rationale and tradeoffs.

3. Project Specification

3.1 What You Will Build

A lab tool with explicit runtime phases:

prepare isolation context
apply resource policy
execute process
collect observability snapshot

Included:

deterministic launch and inspect commands
structured logs for each lifecycle step

Excluded:

full OCI-compliant runtime implementation
production multi-tenant security hardening

3.2 Functional Requirements

Create process in isolated PID/mount/net context.
Apply cgroup constraints and report effective values.
Surface lifecycle events (start, signal, exit reason).
Output explainable diagnostics on failures.

3.3 Non-Functional Requirements

Performance: startup diagnostics complete in under 2 seconds.
Reliability: repeated runs produce consistent lifecycle events.
Usability: single command for run + inspect + cleanup.

3.4 Example Usage / Output

$ ./runtime-lab run --profile baseline --cmd /bin/sh
$ ./runtime-lab inspect pod-001
$ ./runtime-lab cleanup pod-001

3.5 Data Formats / Schemas / Protocols

Runtime profile schema: namespace set, cgroup limits, capability list.
Event schema: timestamp, phase, status, diagnostic fields.

3.6 Edge Cases

Entrypoint process exits immediately.
cgroup path cannot be created.
runtime cleanup called on already-removed object.

3.7 Real World Outcome

3.7.1 How to Run (Copy/Paste)

$ ./runtime-lab run --profile baseline --cmd /bin/sh
$ ./runtime-lab inspect pod-001
$ ./runtime-lab stop pod-001

3.7.2 Golden Path Demo (Deterministic)

Launch baseline profile.
Verify namespace and cgroup values.
Terminate and verify cleanup.

3.7.3 If CLI: exact terminal transcript

$ ./runtime-lab run --profile baseline --cmd /bin/sh
[INFO] namespace setup complete
[INFO] cgroup policy applied cpu=500m memory=256Mi
[INFO] process started pid=1 (container namespace)

$ ./runtime-lab inspect pod-001
status: running
namespaces: pid,mnt,net,uts
limits: cpu=500m memory=256Mi

$ ./runtime-lab stop pod-001
[INFO] termination signal delivered
[INFO] process tree drained
[INFO] cleanup complete

4. Solution Architecture

4.1 High-Level Design

CLI -> runtime planner -> isolation executor -> process launcher -> diagnostics collector

4.2 Key Components

Component	Responsibility	Key Decisions
Planner	Parse profile + command	deterministic phase ordering
Executor	Apply namespaces/cgroups	fail fast on unsafe config
Launcher	Start and stop process	explicit PID 1 handling
Observer	Capture events and state	structured timeline output

4.4 Data Structures (No Full Code)

RuntimeProfile:
- namespace_set
- cpu_limit
- memory_limit
- capability_set

RuntimeEvent:
- timestamp
- phase
- status
- details

4.4 Algorithm Overview

Key Algorithm: deterministic launch sequence

Validate profile.
Apply isolation controls.
Start process and monitor lifecycle.
Emit summary and cleanup.

Complexity:

Time: O(phases)
Space: O(events captured)

5. Implementation Guide

5.1 Development Environment Setup

# build toolchain, verify cgroup v2 support, run local lab checks

5.2 Project Structure

runtime-lab/
  cmd/
  internal/planner/
  internal/executor/
  internal/observer/
  docs/

5.3 The Core Question You’re Answering

“Which concrete kernel controls must be composed for process isolation to be meaningful and observable?”

5.4 Concepts You Must Understand First

Namespace semantics and lifecycle.
cgroup throttling/OOM signals.
Capability minimization patterns.

5.5 Questions to Guide Your Design

Which step should fail closed for safety?
Which signals prove runtime correctness, not just successful command execution?

5.6 Milestones

Launch process with namespace boundaries.
Apply cgroup policy and confirm counters.
Add stop/cleanup flow and deterministic transcript.
Inject failures and validate diagnostics.

5.7 Validation and Testing

Repeated start/stop reliability tests.
Stress tests for throttle/OOM behavior.
Negative tests for unsafe config and invalid profiles.

5.8 Common Pitfalls and Recovery

Missing cleanup after abnormal termination.
Hidden privilege escalation from overly broad capability sets.

5.9 Definition of Done

Deterministic launch/inspect/stop transcript recorded.
Failure diagnostics cover at least three injected faults.
Security and resource policy baseline documented.
Postmortem notes explain one real tradeoff you observed.