Project 1: Kernel-Level Container Runtime Lab
Build a minimal runtime lab that launches isolated processes with explicit namespace, cgroup, and privilege constraints.
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | Expert |
| Time Estimate | 10-16 hours |
| Main Programming Language | Go (Alternatives: Rust, C) |
| Alternative Programming Languages | Rust, C |
| Coolness Level | Level 4 - Hardcore Tech Flex |
| Business Potential | 2. Operational Superpower |
| Prerequisites | Linux process model, cgroups, namespace basics |
| Key Topics | PID/mount/net namespaces, cgroups v2, capabilities, PID 1 lifecycle |
1. Learning Objectives
By completing this project, you will:
- Explain the kernel primitives behind container runtime behavior.
- Design a deterministic runtime launch sequence with observable checkpoints.
- Diagnose isolation and resource failures from kernel signals.
- Produce a security baseline for non-privileged runtime execution.
2. All Theory Needed (Per-Concept Breakdown)
2.1 Namespace Boundaries
Fundamentals
Namespaces isolate visibility, not trust. A process in a PID namespace can see a different process tree than the host, but it still shares the host kernel. Mount namespaces isolate filesystem views. Network namespaces isolate interfaces and routes. You need this distinction to avoid confusing containerization with virtualization.
Deep Dive into the concept
Container startup order matters: namespace setup before mount operations, then process execution. PID namespace semantics also require understanding that the first process becomes PID 1 with special signal/reaping responsibilities. If this process ignores shutdown signals, child processes remain and graceful termination fails. Mount namespace mistakes create confusing cross-container behavior: missing /proc, host filesystem leakage, or read-write paths that should be read-only.
Visibility boundaries define debugging complexity. A process may appear healthy from inside but fail from host perspective due to cgroup pressure. Observability strategy should include both namespace-internal and host-level views.
How this fit on projects
- Primary here and reused in P04 and P09.
Definitions & key terms
- Namespace, PID 1, mount propagation, host namespace sharing.
Mental model diagram
Host process tree: Container process tree:
1 systemd 1 app-init
200 kubelet 17 worker
350 runtime-shim 22 helper
Same kernel, different visibility scope.
How it works
- Create namespaces.
- Mount runtime root filesystem.
- Enter namespace context.
- Execute entrypoint as PID 1.
Invariants: isolation scope is explicit. Failure modes: leaked host mounts, zombie processes, shutdown delays.
Minimal concrete example
Pseudo-sequence:
create pid,mnt,net namespaces
mount rootfs read-only + writable tmp dirs
drop capabilities
exec /app/start as PID 1
Common misconceptions
- “Namespace means secure by default.” -> False; privilege and mounts still matter.
Check-your-understanding questions
- Why can PID 1 break graceful shutdown?
- Which namespace bug can expose host paths?
Check-your-understanding answers
- PID 1 must forward/handle signals and reap children.
- Incorrect mount namespace setup or broad hostPath mounts.
Real-world applications
- Hardened multi-tenant runtime baselines.
Where you’ll apply it
- P01, P04, P09.
References
- Kubernetes container runtime docs
- Docker runtime overview
Key insights
- Isolation is a composition of boundaries, not one flag.
Summary
- Namespace correctness is the foundation of predictable container behavior.
Homework/Exercises to practice the concept
- Trace shutdown behavior for workloads with and without PID 1 signal handling.
- Compare namespace visibility outputs between host and container contexts.
Solutions to the homework/exercises
- Capture process trees before and after termination signal.
- Record command outputs and identify boundary differences.
2.2 Cgroups and Resource Policy
Fundamentals
Cgroups control resource consumption. CPU limits define throttling windows; memory limits define hard ceilings that may trigger OOM kills.
Deep Dive into the concept
Resource policy is workload-specific. Setting low limits can pass integration tests but fail under bursty production load. CPU throttling is especially deceptive because average CPU metrics may look healthy while request latency explodes. Memory behavior includes page cache and allocator overhead; OOM can happen even when application-level memory seems below limit.
Good policy pairs requests and limits with observed workload characteristics. Production tuning must include stress tests and SLO correlation. A reliable runtime baseline logs cgroup signals and surfaces them in dashboards and incident timelines.
How this fit on projects
- P01 baseline, used again in P05 scheduling and P10 platform policy.
Definitions & key terms
- CPU quota, throttling, OOM kill, request vs limit.
Mental model diagram
Workload demand ---> cgroup guardrails ---> kernel scheduler/memory manager
| | |
burst traffic throttle/OOM events latency + restart impact
How it works
- Assign process to cgroup path.
- Apply CPU/memory controls.
- Observe throttle/OOM counters.
- Tune policy with workload evidence.
Invariants: limits are enforced; guarantees depend on requests and cluster state. Failure modes: hidden throttling, repeated OOM restarts.
Minimal concrete example
Policy comparison run:
Profile A: cpu 200m, memory 128Mi
Profile B: cpu 500m, memory 256Mi
Run same workload and compare p95 latency + restart count.
Common misconceptions
- “Low resource limits always reduce cost.” -> Can increase retries and total cost.
Check-your-understanding questions
- Why can throttling raise latency without high average CPU?
- Why are requests essential for scheduling quality?
Check-your-understanding answers
- Burst work is repeatedly paused by quota windows.
- Scheduler uses requests to estimate placement feasibility.
Real-world applications
- FinOps-aware workload tuning.
Where you’ll apply it
- P01, P05, P10.
References
- Kubernetes resource management docs
Key insights
- Resource settings are SLO controls, not just cost controls.
Summary
- Cgroup policy requires measurement-driven iteration.
Homework/Exercises to practice the concept
- Build a throttling test and capture latency effects.
- Document one stable policy profile per workload type.
Solutions to the homework/exercises
- Compare p95/p99 under identical load at different quotas.
- Add runbook table with policy rationale and tradeoffs.
3. Project Specification
3.1 What You Will Build
A lab tool with explicit runtime phases:
- prepare isolation context
- apply resource policy
- execute process
- collect observability snapshot
Included:
- deterministic launch and inspect commands
- structured logs for each lifecycle step
Excluded:
- full OCI-compliant runtime implementation
- production multi-tenant security hardening
3.2 Functional Requirements
- Create process in isolated PID/mount/net context.
- Apply cgroup constraints and report effective values.
- Surface lifecycle events (start, signal, exit reason).
- Output explainable diagnostics on failures.
3.3 Non-Functional Requirements
- Performance: startup diagnostics complete in under 2 seconds.
- Reliability: repeated runs produce consistent lifecycle events.
- Usability: single command for run + inspect + cleanup.
3.4 Example Usage / Output
$ ./runtime-lab run --profile baseline --cmd /bin/sh
$ ./runtime-lab inspect pod-001
$ ./runtime-lab cleanup pod-001
3.5 Data Formats / Schemas / Protocols
- Runtime profile schema: namespace set, cgroup limits, capability list.
- Event schema: timestamp, phase, status, diagnostic fields.
3.6 Edge Cases
- Entrypoint process exits immediately.
- cgroup path cannot be created.
- runtime cleanup called on already-removed object.
3.7 Real World Outcome
3.7.1 How to Run (Copy/Paste)
$ ./runtime-lab run --profile baseline --cmd /bin/sh
$ ./runtime-lab inspect pod-001
$ ./runtime-lab stop pod-001
3.7.2 Golden Path Demo (Deterministic)
- Launch baseline profile.
- Verify namespace and cgroup values.
- Terminate and verify cleanup.
3.7.3 If CLI: exact terminal transcript
$ ./runtime-lab run --profile baseline --cmd /bin/sh
[INFO] namespace setup complete
[INFO] cgroup policy applied cpu=500m memory=256Mi
[INFO] process started pid=1 (container namespace)
$ ./runtime-lab inspect pod-001
status: running
namespaces: pid,mnt,net,uts
limits: cpu=500m memory=256Mi
$ ./runtime-lab stop pod-001
[INFO] termination signal delivered
[INFO] process tree drained
[INFO] cleanup complete
4. Solution Architecture
4.1 High-Level Design
CLI -> runtime planner -> isolation executor -> process launcher -> diagnostics collector
4.2 Key Components
| Component | Responsibility | Key Decisions |
|---|---|---|
| Planner | Parse profile + command | deterministic phase ordering |
| Executor | Apply namespaces/cgroups | fail fast on unsafe config |
| Launcher | Start and stop process | explicit PID 1 handling |
| Observer | Capture events and state | structured timeline output |
4.4 Data Structures (No Full Code)
RuntimeProfile:
- namespace_set
- cpu_limit
- memory_limit
- capability_set
RuntimeEvent:
- timestamp
- phase
- status
- details
4.4 Algorithm Overview
Key Algorithm: deterministic launch sequence
- Validate profile.
- Apply isolation controls.
- Start process and monitor lifecycle.
- Emit summary and cleanup.
Complexity:
- Time: O(phases)
- Space: O(events captured)
5. Implementation Guide
5.1 Development Environment Setup
# build toolchain, verify cgroup v2 support, run local lab checks
5.2 Project Structure
runtime-lab/
cmd/
internal/planner/
internal/executor/
internal/observer/
docs/
5.3 The Core Question You’re Answering
“Which concrete kernel controls must be composed for process isolation to be meaningful and observable?”
5.4 Concepts You Must Understand First
- Namespace semantics and lifecycle.
- cgroup throttling/OOM signals.
- Capability minimization patterns.
5.5 Questions to Guide Your Design
- Which step should fail closed for safety?
- Which signals prove runtime correctness, not just successful command execution?
5.6 Milestones
- Launch process with namespace boundaries.
- Apply cgroup policy and confirm counters.
- Add stop/cleanup flow and deterministic transcript.
- Inject failures and validate diagnostics.
5.7 Validation and Testing
- Repeated start/stop reliability tests.
- Stress tests for throttle/OOM behavior.
- Negative tests for unsafe config and invalid profiles.
5.8 Common Pitfalls and Recovery
- Missing cleanup after abnormal termination.
- Hidden privilege escalation from overly broad capability sets.
5.9 Definition of Done
- Deterministic launch/inspect/stop transcript recorded.
- Failure diagnostics cover at least three injected faults.
- Security and resource policy baseline documented.
- Postmortem notes explain one real tradeoff you observed.