Docker Containers and Kubernetes Mastery - Real World Projects
Goal: Build deep, first-principles understanding of containers and Kubernetes by connecting Linux kernel primitives, OCI standards, and Kubernetes control loops to practical systems work. You will stop treating Dockerfiles and YAML as magic and start reasoning from namespaces, cgroups, image manifests, reconciliation, and network datapaths. Across ten real projects, you will design, observe, stress, and harden containerized systems with measurable outcomes. By the end, you will be able to design production-grade container platforms, debug hard incidents, and explain tradeoffs clearly in interviews and architecture reviews.
Introduction
Docker packages and runs software as isolated processes. Kubernetes coordinates those processes at cluster scale through a declarative API and control loops. Together they solve a practical problem: shipping software consistently while keeping operations predictable under change, failure, and growth.
What you will build in this sprint:
- A mini container runtime model and image/registry workflow
- A scheduler and controller simulation environment
- Networking, storage, and policy labs that mirror production incidents
- A final GitOps-style platform capstone that combines all concepts
In scope:
- Linux container internals, OCI image/runtime/distribution model
- Kubernetes control plane, scheduling, reconciliation, networking, stateful operations
- Security baselines, policy enforcement, and operational troubleshooting
Out of scope:
- Managed cloud vendor specifics as the primary focus
- Full service-mesh internals implementation
- Writing a full production orchestrator from scratch
Big-picture system view:
Developer -> Source + Dockerfile -> OCI Image -> Registry -> Cluster Pull -> Pod Runtime
| | | |
| | | +-> kubelet + CRI + runtime
| | +-> auth, tags, digests
| +-> layers, manifests, signatures
+-> CI, tests, SBOM, policy gates
Kubernetes Control Plane:
API Server <-> etcd
| |
v v
Scheduler Controllers ----> Desired state -> Actual state convergence
|
v
Nodes (kubelet, CNI, CSI, runtime)
How to Use This Guide
- Read the Theory Primer first; each project assumes the concepts are internalized.
- Choose one learning path, but do Project 1 and Project 2 first no matter your path.
- Before each project, answer the Core Question and Thinking Exercise on paper.
- During implementation, capture observable evidence: logs, events, metrics, and deterministic command transcripts.
- After each project, validate against Definition of Done and write a short incident-style postmortem.
Prerequisites & Background Knowledge
Essential Prerequisites (Must Have)
- Linux command-line fluency (processes, files, networking basics)
- One systems language or scripting language (Go, Rust, Python, or Bash)
- Basic distributed systems vocabulary (leader, quorum, eventual vs strong consistency)
- Recommended Reading: “The Linux Programming Interface” by Michael Kerrisk - chapters on processes, namespaces, and filesystems
Helpful But Not Required
- Prior Docker and Kubernetes usage
- Exposure to CI/CD and GitOps tooling
- Prior cloud networking experience
Self-Assessment Questions
- Can you explain what
PID 1inside a container means and why it matters? - Can you explain the difference between an image digest and a mutable tag?
- Can you explain why Kubernetes controllers continuously reconcile instead of executing once?
- Can you describe one realistic failure mode in Kubernetes networking?
Development Environment Setup
Required Tools:
- Docker Engine or Docker Desktop (recent stable)
kubectl,kustomize,helm- Local cluster tool (
kindork3d) crictl,jq,rg,stern,k9s(optional but recommended)
Recommended Tools:
prometheus+grafanalocal stack for observabilitytrivyorgrypefor image scanningcosignfor image signing exercises
Testing Your Setup:
$ docker version
Client: Docker Engine - Community
Server: Docker Engine - Community
$ kubectl version --client
Client Version: v1.xx.x
$ kind create cluster --name dkk-lab
Creating cluster "dkk-lab" ...
Cluster "dkk-lab" created
$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
dkk-lab-control-plane Ready control-plane 1m v1.xx.x
Time Investment
- Simple projects: 4-8 hours each
- Moderate projects: 10-20 hours each
- Complex projects: 20-40 hours each
- Total sprint: 3-5 months (part-time)
Important Reality Check
- Kubernetes issues often look like app bugs but are control-plane, networking, or policy bugs.
- Many failures are timing and state bugs; reproducibility requires disciplined observability.
- The value comes from debugging and verification, not from copying manifests.
Big Picture / Mental Model
The entire stack is a sequence of contracts:
Kernel contract:
namespaces + cgroups + capabilities + seccomp
OCI contract:
image spec + runtime spec + distribution spec
Kubernetes contract:
desired state in API -> controllers/scheduler -> converged running state
Operational contract:
policy + observability + backup/restore + rollout/rollback
Control-loop mental model:
[User intent]
|
v
[kubectl apply]
|
v
[API object stored in etcd] ---> [event stream]
| |
v v
[controllers compare desired vs actual] --(actions)--> [runtime changes]
^ |
| v
+------------------------- observe ---------------- [actual state]
Theory Primer
Chapter 1: Container Isolation Primitives and Runtime Execution Path
Fundamentals
Containers are isolated processes, not lightweight virtual machines. The kernel features that make containerization possible are namespaces (visibility isolation), cgroups (resource control), capabilities (privilege partitioning), and Linux security modules such as seccomp/AppArmor/SELinux. Namespaces define what a process can see: process IDs, network interfaces, mount points, and hostnames. Cgroups define what a process can consume: CPU time, memory, and I/O bandwidth. Capabilities split root into fine-grained powers so workloads do not need full host-level privilege. This model matters because Docker and Kubernetes are orchestration and packaging layers on top of these primitives, not replacements for them.
Deep Dive
At runtime, a container launch path is a chain of components. A client issues a command to a container engine; the engine communicates with a daemon, which then calls a lower-level runtime to realize the container process according to an OCI runtime bundle. The bundle describes process args, environment, mounts, and isolation settings. The runtime then asks the kernel to apply namespace boundaries and cgroup constraints before executing the target process.
The first non-obvious rule is that container boundaries are not absolute security boundaries by default. Isolation quality depends on configured privileges, host kernel state, and runtime hardening. A container that runs privileged, mounts host paths, or has dangerous capabilities can often escape intended isolation. This is why production baselines restrict host namespace sharing, disable privilege escalation, and narrow capabilities.
The second rule is lifecycle semantics. The process started as the container entrypoint becomes PID 1 inside that namespace. PID 1 has special signal-handling semantics, and poor signal handling is a frequent cause of stuck rollouts and long termination delays. In orchestrated systems, clean shutdown behavior is operationally critical because rolling updates, autoscaling, and disruption budgets all assume workloads terminate predictably.
The third rule is resource economics. Cgroups enforce quotas and limits, but the effective behavior depends on scheduler pressure and node-level contention. Memory limits can trigger out-of-memory kills when page cache and heap pressure interact. CPU throttling can produce high latency despite low average CPU usage. Practical mastery requires correlating kernel-level signals with application behavior, not just setting arbitrary limits.
Failure modes cluster around visibility assumptions and privilege assumptions. Teams often assume that if a process runs in a container it cannot affect host state; this is false if hostPath mounts or privileged mode are present. Teams also assume limits imply guaranteed resources; this is false in oversubscribed environments where noisy neighbors and scheduling policy dominate tail latency. Production reliability comes from explicitly modeling these constraints and testing under stress.
Operationally, you should treat runtime configuration as code and audit it. Minimal base images, explicit users, read-only filesystems where possible, and defense-in-depth security controls provide measurable risk reduction. Debugging and forensics become easier when runtime policies are deterministic and versioned.
How this fits on projects
- Projects 1, 4, 8, and 9 depend directly on runtime isolation behavior.
Definitions & key terms
- Namespace: Kernel isolation boundary for visibility scope.
- Cgroup: Kernel resource accounting and limit mechanism.
- Capability: Fine-grained privilege unit replacing full root powers.
- PID 1: Init-like process inside a container namespace.
- OCI runtime bundle: Files describing how to run a containerized process.
Mental model diagram
Host Kernel
+--------------------------------------------------------------+
| Namespaces: pid net mnt uts ipc user |
| Cgroups: cpu memory io pids |
| LSM/seccomp: syscall and policy filters |
+-------------------------^------------------------------------+
|
OCI runtime applies config
|
Container Process Tree |
PID 1 -> worker -> helper |
How it works
- Build runtime config from image metadata plus launch overrides.
- Create namespaces and cgroup assignments.
- Mount root filesystem and required volumes.
- Drop privileges and apply seccomp/profile constraints.
- Execute entrypoint as PID 1.
Invariants:
- Namespace boundaries define visibility, not absolute trust.
- Resource limits are enforced relative to host capacity and policy.
Failure modes:
- Stuck termination due to bad PID 1 behavior.
- OOM kills under memory pressure despite normal averages.
- Privilege misuse through broad capabilities.
Minimal concrete example
Pseudo-run sequence:
container run --image app:v1 --memory 256Mi --cpu 500m
-> create PID + NET + MNT namespaces
-> join cgroup path /kubepods/.../container-id
-> mount overlay rootfs
-> set no-new-privileges + seccomp profile
-> exec /app/server
Common misconceptions
- “Containers are VMs.” -> They share the host kernel.
- “Non-root user means fully safe.” -> Other privileges can still be dangerous.
- “Limits equal reservations.” -> Limits and requests serve different scheduling purposes.
Check-your-understanding questions
- Why is PID 1 behavior important for rolling updates?
- How can a container become risky even if it is non-root?
- Why can CPU throttling increase tail latency without high average CPU?
Check-your-understanding answers
- PID 1 receives termination signals first and must shut down children cleanly.
- Dangerous capabilities, host mounts, and shared namespaces can bypass expectations.
- Throttling interrupts work bursts and stretches completion time for latency-sensitive requests.
Real-world applications
- Multi-tenant platform hardening
- Runtime policy baselines for regulated workloads
- Performance incident analysis under node contention
Where you’ll apply it
- Project 1, Project 4, Project 8, Project 9
References
- Kubernetes container runtimes docs: https://kubernetes.io/docs/setup/production-environment/container-runtimes/
- Kubernetes CRI docs: https://kubernetes.io/docs/concepts/containers/cri/
- Docker overview: https://docs.docker.com/get-started/docker-overview/
Key insights
- Containers are a kernel-level process isolation contract with explicit security and resource tradeoffs.
Summary
- Understanding namespaces, cgroups, and privileges is mandatory for reliable Docker/Kubernetes operations.
Homework/Exercises to practice the concept
- Compare behavior of two identical workloads with different CPU and memory limits.
- Test graceful shutdown behavior with and without proper signal handling.
- Document capability requirements for one workload and remove all unnecessary ones.
Solutions to the homework/exercises
- Record throttle and OOM events; compare p95 latency and restart counts.
- Use termination tests and verify child processes exit before grace period ends.
- Start from restricted baseline and add only capabilities proven necessary by workload tests.
Chapter 2: OCI Image Model, Build Graphs, and Registry Distribution
Fundamentals
OCI standards define portable image and runtime formats across tools. The image spec describes manifests, configuration objects, and layers. The distribution spec defines registry APIs for pushing and pulling content by digest. Modern builders such as BuildKit optimize builds through content-addressable caching and dependency graph execution. The practical result is reproducibility and portability: the same digest should represent the same content everywhere.
Deep Dive
An image is a directed set of immutable blobs plus metadata, not a monolithic file. Layers are content-addressed by digest, so registries can deduplicate and clients can verify integrity. Tags are mutable pointers to manifests; digests are immutable identifiers. Production policy should treat digests as deployment truth and tags as discovery convenience.
Build systems convert source context and instructions into a graph of filesystem transformations. BuildKit executes independent stages in parallel when possible, skips unused stages, and can export/import cache from registries. This makes build performance and reproducibility first-class concerns. The optimization target is not only speed but also determinism: identical inputs should produce identical artifact digests.
Registry interactions follow a predictable workflow: authenticate, check existence of blobs, upload missing blobs, upload manifest, and update tag references. Pulls reverse this flow. Failures usually involve auth scope mismatch, digest mismatch, manifest media-type compatibility, or transient network errors. Teams that understand this protocol can debug CI failures quickly instead of retrying blindly.
Supply chain security overlays this model. Signatures and attestations bind trust metadata to digests, while SBOMs expose component composition for vulnerability and license analysis. Security posture improves when policy gates promote only signed and scanned digests to runtime environments.
A common performance trap is cache invalidation at the wrong build step. If an early step is volatile, all downstream layers rebuild. Better layering strategy puts stable dependencies before rapidly changing application files, reducing cache churn. Another trap is pulling large base images for tiny services, inflating cold-start and network costs. Minimal base images and explicit dependency management improve both security and performance.
At scale, registry architecture becomes a platform concern: geo-replication, retention policies, immutable tag policies, and garbage collection all influence reliability and cost. Digest pinning combined with promotion pipelines yields auditable releases and safer rollbacks.
How this fits on projects
- Projects 2, 3, 4, and 10 are image/registry heavy.
Definitions & key terms
- Manifest: Object describing referenced config and layers.
- Digest: Immutable hash-based content identifier.
- Tag: Mutable human-friendly reference.
- SBOM: Software bill of materials for component transparency.
- Attestation: Signed metadata proving build/provenance claims.
Mental model diagram
Source -> Build Graph -> Layers + Config -> Manifest -> Registry
| | | | |
| +-> cache keys +-> digests +-> tag -> pull by digest
How it works
- Parse build instructions into dependency graph.
- Execute graph nodes, producing layers and metadata.
- Hash content to produce digests.
- Push missing blobs and manifest to registry.
- Deploy by immutable digest.
Invariants:
- Same digest implies same content.
- Tags can change over time.
Failure modes:
- Digest mismatch from corruption or tooling incompatibility.
- Cache misses from poor layer ordering.
- Mutable tag drift causing unplanned runtime changes.
Minimal concrete example
Pseudo-manifest flow:
build app -> manifest digest sha256:AAA
push manifest sha256:AAA to registry
set tag app:prod -> sha256:AAA
deploy workload image@sha256:AAA (not tag-only)
Common misconceptions
- “Tag equals immutable release.” -> Tags are mutable pointers.
- “Fast build means good build.” -> Reproducibility and traceability matter equally.
- “Registry is just storage.” -> It is a protocol and policy boundary.
Check-your-understanding questions
- Why should production deploy by digest instead of tag?
- How can BuildKit reduce CI cost in ephemeral runners?
- Why are SBOMs useful even when no critical CVEs are present?
Check-your-understanding answers
- Digests prevent drift from mutable tag updates.
- External cache export/import avoids rebuilding unchanged graph nodes.
- They support license/compliance and future vulnerability investigations.
Real-world applications
- Reproducible CI/CD pipelines
- Secure artifact promotion between environments
- Registry cost and latency optimization
Where you’ll apply it
- Project 2, Project 3, Project 4, Project 10
References
- OCI image-spec release notices: https://opencontainers.org/release-notices/overview/
- OCI distribution-spec release notices: https://opencontainers.org/release-notices/v1-1-1-distribution-spec/
- Docker BuildKit docs: https://docs.docker.com/build/buildkit/
- Docker cache guidance: https://docs.docker.com/build/cache/optimize/
Key insights
- Treat image digests as immutable contracts and build graphs as reproducibility engines.
Summary
- OCI standards plus disciplined build and promotion patterns eliminate many deployment surprises.
Homework/Exercises to practice the concept
- Design a promotion workflow from dev to prod using digest pinning.
- Compare build times with and without external cache.
- Draft an SBOM and signature verification gate for pre-deploy checks.
Solutions to the homework/exercises
- Promote manifest digests across repos or tags without rebuilding.
- Measure warm/cold build delta and identify invalidation boundaries.
- Block deploy if signature missing or vulnerability threshold exceeded.
Chapter 3: Kubernetes Control Plane, Scheduling, and Reconciliation
Fundamentals
Kubernetes is a distributed control system centered on desired state. Users and automation declare intent in API objects. The control plane stores that intent in etcd, then scheduler and controllers continuously work to make actual state match desired state. The architecture is explicitly event-driven and eventually convergent, not transactionally immediate.
Deep Dive
The API server is the front door for all state mutations. It validates requests, enforces admission policies, and persists objects. etcd is the source-of-truth data store behind the API server. Watches stream object changes to controllers and operators, enabling reactive automation. This design scales by decoupling declaration from execution.
Scheduler behavior is often misunderstood. Scheduling is a two-stage decision process: filtering identifies feasible nodes and scoring ranks candidates. Plugin chains support extensible policies (resource fit, affinity, topology spread, taints/tolerations, storage constraints). Once a node is selected, binding finalizes placement. Backoff and queue behavior influence pending pod latency under pressure.
Controllers are reconciliation loops. Each loop compares desired and observed state for a resource type, then issues minimal actions to reduce drift. Deployments manage ReplicaSets; ReplicaSets manage Pods; StatefulSets manage ordered identity and storage semantics. The power of Kubernetes comes from this layered composition of simple controllers, not from one monolithic scheduler brain.
Operationally, control-plane correctness depends on event ordering assumptions, retry semantics, and idempotent logic. Controllers must tolerate missed events by periodic resync and state listing. They must also tolerate transient errors and partial progress. This is why Kubernetes patterns emphasize level-triggered reconciliation over edge-triggered one-shot automation.
Failure modes include API server overload, etcd latency spikes, and pathological resync storms. Symptoms often appear as slow scheduling, delayed rollouts, and stale status fields. Observability should include API request latency, watch health, queue depth, reconcile duration, and error rate.
Design tradeoffs are explicit: strong consistency for API object storage through etcd quorum, but eventual convergence for higher-level workload state. This balance enables robust distributed operation without requiring global transactional workflows for every object mutation.
How this fits on projects
- Projects 5, 6, 7, and 10 rely directly on scheduler and controller mental models.
Definitions & key terms
- Reconciliation: Continuous convergence process from desired to actual state.
- Admission: Validation/mutation step before object persistence.
- Binding: Scheduler action that assigns a pod to a node.
- Informer/watch: Event-driven cache+notification mechanism for API changes.
- Idempotency: Safe repeated execution with same final result.
Mental model diagram
Intent (YAML) -> API Server -> etcd
|
+-> Scheduler queue -> filter -> score -> bind
|
+-> Controllers (Deployment/StatefulSet/custom)
|
v
Runtime changes on nodes
How it works
- Persist desired object in API.
- Notify scheduler/controller loops via watch streams.
- Scheduler binds unscheduled pods.
- Controllers create/update/delete lower-level objects.
- Status updates report observed state.
Invariants:
- etcd-backed API data is authoritative.
- Controllers must be idempotent and retry-safe.
Failure modes:
- Pending pods from unsatisfied constraints.
- Slow reconciliation from API or controller backpressure.
- Divergence from non-idempotent custom automation.
Minimal concrete example
Pseudo-control loop:
if desired replicas != observed ready pods:
create or delete pods
requeue after event or periodic resync
Common misconceptions
- “Apply means immediate execution.” -> Apply means desired state update.
- “Scheduler alone guarantees reliability.” -> Controllers and probes are equally critical.
- “One failed reconcile means broken system.” -> Retries are expected in distributed control loops.
Check-your-understanding questions
- Why does Kubernetes prefer reconciliation loops over one-shot actions?
- What does scheduler scoring add beyond feasibility filtering?
- Why must controller actions be idempotent?
Check-your-understanding answers
- Distributed systems need convergence under partial failures and retries.
- It optimizes placement quality among many feasible nodes.
- Retries and duplicate events are normal; non-idempotent actions cause drift or thrash.
Real-world applications
- Capacity-aware scheduling strategies
- Safe rollout automation
- Building custom operators for domain-specific platforms
Where you’ll apply it
- Project 5, Project 6, Project 7, Project 10
References
- Kubernetes components: https://kubernetes.io/docs/concepts/overview/components
- Scheduler configuration and framework: https://kubernetes.io/docs/reference/scheduling/config/
- etcd operations in Kubernetes: https://kubernetes.io/docs/tasks/administer-cluster/configure-upgrade-etcd/
Key insights
- Kubernetes reliability emerges from small, idempotent, event-driven control loops.
Summary
- Mastering scheduler and controller behavior is the shortest path to cluster debugging confidence.
Homework/Exercises to practice the concept
- Trace one Deployment rollout from API update to ready pods.
- Simulate node pressure and explain why specific pods stay pending.
- Design reconcile pseudocode for a custom backup resource.
Solutions to the homework/exercises
- Capture events and correlate generation/observedGeneration transitions.
- Inspect scheduling events for resource, affinity, or taint constraints.
- Ensure idempotent steps, status checkpoints, and retry-safe external calls.
Chapter 4: Kubernetes Networking and Traffic Delivery
Fundamentals
Kubernetes networking is built on a simple but strict model: each pod gets a cluster-routable IP, pod-to-pod communication should work without NAT in the pod network, and Services provide stable virtual addressing for dynamic pod backends. CNI plugins implement pod network plumbing, while kube-proxy or eBPF datapaths implement service routing.
Deep Dive
Networking in Kubernetes has three layers of responsibility. First, CNI attaches pod interfaces and routes. Second, Service abstractions map stable virtual identities to changing pod endpoints. Third, ingress/gateway layers expose traffic entry points and policy. Understanding where a failure sits among these layers accelerates incident response.
CNI behavior determines pod connectivity fundamentals: interface creation, IP allocation, route setup, and network policy enforcement support. A broken CNI deployment often manifests as pods that are Running but unreachable. Service routing issues, in contrast, usually appear as healthy pod-to-pod communication with broken virtual service endpoints.
Service resolution combines DNS and endpoint programming. Clients usually resolve service names to cluster IPs, then kube-proxy/ipvs/eBPF routes traffic to endpoint pods. Endpoint readiness governs whether pods receive traffic. Misconfigured probes can therefore look like network outages when the actual issue is readiness policy.
Gateway and ingress APIs define north-south traffic policy. Gateway API has matured into a more expressive model for multi-tenant and advanced routing use cases. Teams should treat routing configuration as code, with explicit ownership boundaries and test coverage for edge cases such as TLS mismatch, backend policy conflicts, and canary rules.
Common failure patterns include DNS timeouts, stale endpoints, policy-denied flows, MTU mismatches, and asymmetric routing across nodes. Effective troubleshooting starts at L3/L4 reachability, then progresses to service abstraction and L7 policy. Skipping layers leads to guesswork.
Performance tradeoffs also matter. Overly chatty service meshes, suboptimal connection reuse, and mis-scoped retries can amplify latency and load. Good architecture balances observability and policy richness with datapath simplicity.
How this fits on projects
- Projects 8, 10, and parts of 6 and 7 depend on this network model.
Definitions & key terms
- CNI: Standard interface for container network plugins.
- Service: Stable virtual endpoint for a dynamic set of pods.
- EndpointSlice: Scalable representation of service backends.
- NetworkPolicy: Allow/deny traffic policy at pod level.
- Gateway API: Kubernetes API model for advanced service networking.
Mental model diagram
Client Pod -> DNS -> Service VIP -> datapath (kube-proxy/eBPF) -> Endpoint Pod
| |
+-------------------- CNI routes and policy --------------+
North-South:
Internet -> Gateway/Ingress -> Service -> Pod
How it works
- CNI assigns pod IP and node routes.
- Service controller tracks matching pods.
- EndpointSlices update as pod readiness changes.
- Datapath routes service traffic to healthy endpoints.
- Gateway/ingress applies entry policies and TLS handling.
Invariants:
- Pod-level identity is IP-based in cluster network model.
- Service identity is stable while backend membership changes.
Failure modes:
- DNS or endpoint staleness causing intermittent failures.
- Policy misconfiguration blocking expected flows.
- MTU or conntrack pressure creating silent packet drops.
Minimal concrete example
Pseudo-debug sequence:
1) pod A cannot call service B
2) test pod A -> pod B IP directly
3) test DNS resolution for service B
4) inspect EndpointSlices and readiness
5) inspect NetworkPolicy allow rules
Common misconceptions
- “Service equals load balancer process.” -> It is a virtual abstraction plus datapath rules.
- “Running pod means traffic-ready.” -> Readiness controls endpoint participation.
- “NetworkPolicy defaults to deny only for inbound.” -> Behavior depends on selected policy rules.
Check-your-understanding questions
- Why can a Running pod still receive zero service traffic?
- What distinguishes a CNI issue from a Service issue?
- Why does endpoint readiness matter for rollout safety?
Check-your-understanding answers
- Failed readiness probe keeps it out of endpoints.
- CNI issues break direct pod reachability; service issues can preserve direct pod reachability.
- It prevents routing to pods that are alive but not ready for requests.
Real-world applications
- Multi-tenant cluster segmentation
- Reliable service exposure and traffic shaping
- Incident triage for packet loss and service brownouts
Where you’ll apply it
- Project 8, Project 10 (and supporting work in Projects 6 and 7)
References
- Kubernetes network model: https://kubernetes.io/docs/concepts/services-networking/
- CNI project: https://github.com/containernetworking/cni
- Gateway API v1.4 update: https://kubernetes.io/blog/2025/11/06/gateway-api-v1-4/
Key insights
- Networking reliability comes from layered reasoning: CNI, Service endpoints, then L7 policy.
Summary
- Treat networking as composable control planes, not a single black box.
Homework/Exercises to practice the concept
- Create a failure matrix for DNS, endpoints, and network policy faults.
- Run a controlled network policy deny test and validate expected blast radius.
- Compare service latency with different endpoint counts and connection reuse patterns.
Solutions to the homework/exercises
- Define symptoms and diagnostics per layer.
- Start with audit mode labels, then enforce with explicit namespace scoping.
- Track p50/p95 latency and connection metrics before and after adjustments.
Chapter 5: Stateful Workloads, Security Baselines, and GitOps Operations
Fundamentals
Stateless scheduling is only half the story; real systems need durable data, controlled access, and repeatable change management. Kubernetes addresses this with StatefulSets, persistent volumes, policy controls, and automation patterns such as GitOps. Production excellence depends on combining data safety, security defaults, and auditable deployment flow.
Deep Dive
Stateful workloads require stable identity and storage attachment semantics. StatefulSets provide ordered startup/shutdown and stable pod naming, while persistent volumes bind data beyond pod lifetime. This model supports databases, queues, and other stateful systems but introduces operational complexity: upgrades, failover, and backup/restore must be intentionally designed.
Storage reliability is an operational pipeline, not a single feature. You need backup cadence, restore drills, and integrity checks. A backup that has never been restored is an assumption, not a control. In practice, teams should treat restore time objective and data loss objective as testable metrics.
Security posture starts with least privilege and admission control. Pod Security Standards define escalating profiles from privileged to restricted. Namespace-level enforcement plus explicit exceptions provides a practical baseline. Complementary controls include image scanning, signature verification, secret management, and strict RBAC scoping.
Policy-as-code tools encode these rules as versioned constraints. This reduces drift and enables pre-merge validation. Good policies are explicit, explainable, and paired with developer feedback loops. Overly strict policies without migration paths cause platform friction; gradual enforcement with audit/warn stages is usually the better rollout strategy.
GitOps operationalizes desired state management by treating cluster config as source-controlled truth. Automated reconciliation from Git to cluster improves auditability and rollback safety, but only if repository structure, promotion flow, and secret handling are well designed. GitOps is not “just auto-apply”; it is a discipline of declarative operations, drift management, and release governance.
Failure modes in this chapter are typically organizational as much as technical: unclear ownership, untested restores, and policy exemptions with no expiry. Mature teams align platform, security, and application owners around shared controls and explicit service-level objectives.
How this fits on projects
- Projects 7, 9, and 10 are centered on these production-grade concerns.
Definitions & key terms
- StatefulSet: Workload API for ordered, identity-stable pod sets.
- PVC/PV: Persistent volume claim and persistent volume abstraction.
- Pod Security Standards: Baseline security profile model.
- GitOps: Continuous reconciliation of declarative config from Git.
- Drift: Difference between declared and actual running state.
Mental model diagram
Git (desired config) -> reconciler -> cluster state
| |
+-> policy checks +-> runtime signals
Stateful workload:
identity + storage + backup + restore + failover drills
How it works
- Define workload, storage, and policy manifests declaratively.
- Enforce baseline admission and RBAC constraints.
- Reconcile cluster from versioned source of truth.
- Execute backup and restore drills on schedule.
- Track drift and security exceptions over time.
Invariants:
- State durability requires independent validation, not assumptions.
- Security controls require explicit exception management.
Failure modes:
- Data loss from untested restore path.
- Policy bypass through broad namespace exemptions.
- Config drift from manual cluster changes outside Git flow.
Minimal concrete example
Pseudo-release flow:
PR -> policy and security checks -> merge to main -> GitOps sync -> rollout
if health check fails:
rollback to previous commit digest
Common misconceptions
- “StatefulSet means data is safe by default.” -> Backup and restore strategy is separate.
- “Policy engine alone secures cluster.” -> RBAC, runtime hardening, and supply chain controls are also required.
- “GitOps removes need for incident response.” -> It improves recovery but does not prevent all failures.
Check-your-understanding questions
- Why is restore testing more important than backup creation logs?
- How does gradual policy enforcement reduce operational risk?
- What problem does GitOps solve that CI pipelines alone do not?
Check-your-understanding answers
- It validates end-to-end recoverability under realistic conditions.
- It surfaces violations early without immediate production breakage.
- It continuously enforces declared state and detects runtime drift.
Real-world applications
- Database platform operations
- Security compliance and audit readiness
- Multi-team production release governance
Where you’ll apply it
- Project 7, Project 9, Project 10
References
- Pod Security Standards: https://kubernetes.io/docs/concepts/security/pod-security-standards/
- Pod Security Admission: https://kubernetes.io/docs/tasks/configure-pod-container/enforce-standards-admission-controller
- Kubernetes security checklist: https://kubernetes.io/docs/concepts/security/security-checklist/
- NSA/CISA Kubernetes hardening notice: https://www.nsa.gov/Press-Room/News-Highlights/Article/Article/2716980/nsa-cisa-release-kubernetes-hardening-guidance/
Key insights
- Production maturity is the intersection of durable state, enforceable policy, and declarative operations.
Summary
- Stateful reliability and platform security require recurring drills, clear ownership, and policy discipline.
Homework/Exercises to practice the concept
- Define a backup and restore runbook with explicit RTO/RPO targets.
- Create a phased Pod Security rollout plan (audit -> warn -> enforce).
- Model a GitOps rollback scenario after a failed deployment.
Solutions to the homework/exercises
- Include backup cadence, restore command sequence, verification checks, and escalation path.
- Start with non-prod namespaces, then production with timed exception expiry.
- Revert the Git commit, reconcile, and verify post-rollback service health and data integrity.
Glossary
- Container: Isolated process with constrained kernel visibility and resources.
- OCI: Open standards for container images, runtime configuration, and distribution APIs.
- Digest: Immutable hash identifier for image content.
- Control Loop: Continuous comparison of desired and actual state followed by corrective actions.
- EndpointSlice: Kubernetes object describing service backend endpoints.
- StatefulSet: Kubernetes workload API for identity-stable, ordered, stateful pods.
- GitOps: Operational model where Git is source of truth and automated reconcilers apply state.
- Drift: Configuration difference between declared and runtime state.
- PSS: Pod Security Standards policy profiles.
- CRI: Kubernetes interface between kubelet and container runtime.
Why Docker and Kubernetes Matters
Modern motivation:
- Teams need faster release cycles without environment-specific surprises.
- Platform teams need predictable operations under autoscaling, failures, and audits.
- AI/ML and data workloads increasingly require standardized orchestration at scale.
Real-world statistics and impact:
- CNCF Annual Cloud Native Survey (January 2026): 82% of container users report Kubernetes in production; 98% report adoption of cloud native techniques; 59% report much or nearly all development/deployment is cloud native.
- Stack Overflow Developer Survey 2024: Docker appears in 53.9% of all-respondent “other tools” usage and 58.7% among professional developers; Kubernetes appears at 19.4% overall and 22% among professional developers.
- Kubernetes v1.34 (August 27, 2025) shipped 58 enhancements, showing a fast-moving but stable ecosystem.
- OCI standards continued active releases through 2025 (image-spec v1.1.1, distribution-spec v1.1.1, runtime-spec v1.3.0), reinforcing interoperability.
Context and evolution (after modern context):
- Early container usage focused on developer portability.
- The ecosystem then standardized around OCI artifacts and Kubernetes orchestration.
- Current maturity emphasizes supply-chain trust, platform engineering, and AI workload operations.
Old vs new operating model:
Traditional VM-Centric Ops Cloud-Native Container Ops
-------------------------------- --------------------------------
Manual host configuration Declarative desired state
Release by mutable server Release by immutable artifacts
Snowflake environments Reproducible images and policy
Ad-hoc rollback Versioned rollout and rollback
Limited workload portability Runtime + OCI + Kubernetes portability
Concept Summary Table
| Concept Cluster | What You Need to Internalize |
|---|---|
| Container Isolation Primitives | Containers are isolated processes using namespaces, cgroups, and privilege controls; these are the root cause surface for many runtime bugs. |
| OCI Image and Distribution Model | Images are immutable, content-addressed artifacts moved through standardized registry APIs; digest pinning and cache strategy are operational essentials. |
| Kubernetes Control Plane and Reconciliation | Kubernetes is event-driven desired-state convergence via scheduler and controllers, not immediate imperative execution. |
| Networking and Traffic Delivery | Reliable service behavior requires layered reasoning across CNI, Service endpoints, and L7 routing policy. |
| Stateful + Security + GitOps Operations | Durable systems require tested backup/restore, enforceable policy, and declarative change management with drift control. |
Project-to-Concept Map
| Project | Concepts Applied |
|---|---|
| Project 1 | Container Isolation Primitives |
| Project 2 | OCI Image and Distribution Model |
| Project 3 | OCI Image and Distribution Model |
| Project 4 | Container Isolation Primitives, OCI Image and Distribution Model |
| Project 5 | Kubernetes Control Plane and Reconciliation |
| Project 6 | Kubernetes Control Plane and Reconciliation |
| Project 7 | Kubernetes Control Plane and Reconciliation, Stateful + Security + GitOps Operations |
| Project 8 | Networking and Traffic Delivery |
| Project 9 | Container Isolation Primitives, Stateful + Security + GitOps Operations |
| Project 10 | All concept clusters |
Deep Dive Reading by Concept
| Concept | Book and Chapter | Why This Matters |
|---|---|---|
| Container Isolation Primitives | “The Linux Programming Interface” by Michael Kerrisk - process and namespace chapters | Grounds container behavior in kernel mechanics. |
| OCI Image and Distribution Model | “Docker Deep Dive” by Nigel Poulton - image, registry, and build chapters | Builds artifact literacy for reproducible delivery. |
| Kubernetes Control Plane and Reconciliation | “Programming Kubernetes” by Michael Hausenblas and Stefan Schimanski - controllers/operators chapters | Explains how desired state becomes running systems. |
| Networking and Traffic Delivery | “Kubernetes in Action” by Marko Luksa - networking and service chapters | Connects service abstractions to real packet paths. |
| Stateful + Security + GitOps Operations | “Kubernetes Patterns” by Bilgin Ibryam and Roland Huss - stateful, operational, and policy patterns | Provides operational blueprints for production reliability. |
Quick Start: Your First 48 Hours
Day 1:
- Read Theory Primer Chapters 1-3.
- Build and run Project 1 baseline outcome.
- Capture one page of notes: “What changed in my mental model?”
Day 2:
- Read Theory Primer Chapters 4-5.
- Complete Project 2 deterministic outcome transcript.
- Verify both projects against Definition of Done checklists.
Recommended Learning Paths
Path 1: Platform Engineer Track
- Project 1 -> Project 2 -> Project 5 -> Project 6 -> Project 7 -> Project 10
Path 2: DevOps/SRE Track
- Project 2 -> Project 3 -> Project 8 -> Project 9 -> Project 10
Path 3: Security and Reliability Track
- Project 1 -> Project 4 -> Project 7 -> Project 9 -> Project 10
Success Metrics
- You can explain and diagnose at least 15 common Docker/Kubernetes failure patterns.
- You can deploy by digest with policy gates and rollback safely.
- You can trace one user request from gateway to pod and back with observable evidence.
- You can execute a tested restore workflow for a stateful workload.
- You can present a clear architecture tradeoff memo for your capstone platform.
Project Overview Table
| # | Project | Core Focus | Difficulty | Time |
|---|---|---|---|---|
| 1 | Kernel-Level Container Runtime Lab | Namespaces, cgroups, process lifecycle | Advanced | 10-16h |
| 2 | OCI Image Builder + Registry Workflow | Manifests, layers, digest workflows | Advanced | 10-16h |
| 3 | BuildKit Cache and Reproducibility Lab | Build graph optimization, cache strategy | Intermediate | 6-10h |
| 4 | CRI Runtime Observability Lab | kubelet-runtime boundaries, runtime debugging | Advanced | 8-14h |
| 5 | Kubernetes Scheduler Simulator | Filter/score decisions, placement policy | Advanced | 12-20h |
| 6 | Controller Reconciliation Lab | Idempotent control loops, drift correction | Advanced | 12-20h |
| 7 | StatefulSet Failover + Backup Drills | durability, ordering, restore confidence | Advanced | 14-24h |
| 8 | CNI + Service Network Troubleshooting | packet path reasoning and policy | Advanced | 12-20h |
| 9 | Pod Security + Policy-as-Code | admission policy, least privilege | Intermediate-Advanced | 8-14h |
| 10 | GitOps Platform Capstone | integrated production platform thinking | Expert | 20-40h |
Project List
The following projects move you from container internals to production-grade Kubernetes platform operations.
Project 1: Kernel-Level Container Runtime Lab
- File: DOCKER_CONTAINERS_KUBERNETES_LEARNING_PROJECTS/P01-container-runtime-from-kernel-primitives.md
- Main Programming Language: Go
- Alternative Programming Languages: Rust, C
- Coolness Level: Level 4 - Hardcore Tech Flex
- Business Potential: 2. The “Operational Superpower”
- Difficulty: Level 4 - Expert
- Knowledge Area: Linux internals, process isolation
- Software or Tool: namespaces, cgroups v2, runc concepts
- Main Book: The Linux Programming Interface
What you will build: A minimal runtime lab that starts isolated processes with explicit namespace and cgroup constraints.
Why it teaches containers: It strips away Docker abstractions and shows the kernel-level reality.
Core challenges you will face:
- PID and mount namespace correctness -> Container Isolation Primitives
- Resource controls with cgroups v2 -> Container Isolation Primitives
- Safe privilege model -> Stateful + Security + GitOps Operations
Real World Outcome
You run a deterministic scenario where a workload is launched with explicit constraints and verify:
- isolated process view
- constrained memory and CPU
- predictable termination behavior
Expected transcript:
$ ./runtime-lab run --memory 256Mi --cpu 500m --cmd /bin/sh
[INFO] Created pid,mnt,uts,net namespaces
[INFO] Attached process to cgroup /lab/runtime/pod-001
[INFO] Applied seccomp baseline profile
[INFO] Process started as PID 1 inside container namespace
$ ./runtime-lab inspect pod-001
Namespaces: pid,mnt,uts,net
Limits: cpu=500m memory=256Mi
Capabilities: NET_BIND_SERVICE only
The Core Question You Are Answering
“What must be true at the kernel boundary for a process to behave like a container?”
This question forces you to reason about primitives, not CLI conveniences.
Concepts You Must Understand First
- Linux namespaces
- Which resources are isolated and which are shared?
- Book Reference: The Linux Programming Interface - namespace chapters
- cgroups v2
- How do throttling and OOM behavior surface in runtime metrics?
- Book Reference: Linux internals references in TLPI + kernel docs
- Capabilities and no-new-privileges
- Which privileges are still dangerous under non-root users?
- Book Reference: Kubernetes security checklist + Linux capabilities docs
Questions to Guide Your Design
- Isolation design
- Which namespaces are mandatory for this lab baseline?
- Which host resources must never be mounted inside the runtime sandbox?
- Resource policy
- What are reasonable default quotas for deterministic tests?
- How will you observe throttle/OOM events quickly?
Thinking Exercise
Execution Path Trace
Draw the runtime flow from command invocation to exec and list where each security/resource control is applied.
Questions to answer:
- Which step is irreversible if misconfigured?
- Which failures are visible only after workload pressure increases?
The Interview Questions They Will Ask
- “Explain namespaces vs cgroups in one minute.”
- “Why is PID 1 behavior critical in containers?”
- “How do capabilities differ from running as root?”
- “How would you debug frequent OOM kills in production?”
- “Why is privileged mode risky even in internal environments?”
Hints in Layers
Hint 1: Starting Point
- Treat each isolation feature as a separate checkpoint, not one monolithic step.
Hint 2: Next Level
- Validate process view (
ps), filesystem view (mount), and resource counters separately.
Hint 3: Technical Details
- Use pseudocode checkpoints:
create namespaces -> assign cgroup -> apply security profile -> exec.
Hint 4: Tools/Debugging
- Use
strace,cat /sys/fs/cgroup/..., and structured logs per lifecycle phase.
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Process isolation | The Linux Programming Interface | process + namespace chapters |
| Runtime internals | Docker Deep Dive | runtime architecture sections |
| Security posture | Kubernetes security docs | pod security + hardening sections |
Common Pitfalls and Debugging
Problem 1: “Process never exits cleanly”
- Why: PID 1 signal handling missing.
- Fix: Add explicit shutdown and child reap handling policy.
- Quick test: Launch and terminate 10 times; verify zero zombie processes.
Problem 2: “Latency spikes under low average CPU”
- Why: CPU throttling bursts.
- Fix: Tune requests/limits with workload profile.
- Quick test: Compare p95 latency before and after throttling adjustment.
Definition of Done
- Namespace and cgroup constraints are verifiably applied.
- Termination behavior is deterministic under repeated tests.
- Security baseline is documented and enforced.
- Failure signals (OOM/throttle) are observable and explained.
Project 2: OCI Image Builder and Registry Workflow
- File: DOCKER_CONTAINERS_KUBERNETES_LEARNING_PROJECTS/P02-oci-image-builder-registry-workflow.md
- Main Programming Language: Go
- Alternative Programming Languages: Python, Rust
- Coolness Level: Level 3 - Genuinely Clever
- Business Potential: 3. The “Platform Service”
- Difficulty: Level 3 - Advanced
- Knowledge Area: Artifact packaging and distribution protocols
- Software or Tool: OCI image/distribution specs
- Main Book: Docker Deep Dive
What you will build: An educational workflow that creates OCI-style artifacts and pushes/pulls them with digest integrity checks.
Why it teaches containers: It exposes the real contract behind docker build and docker push.
Core challenges you will face:
- Manifest and layer model comprehension -> OCI Image and Distribution Model
- Digest-first deployment discipline -> OCI Image and Distribution Model
- Promotion policy and immutability -> Stateful + Security + GitOps Operations
Real World Outcome
$ ./oci-lab build ./demo-app
[INFO] Generated layer digests:
sha256:1a2b...
sha256:3c4d...
[INFO] Manifest digest: sha256:9f9e...
$ ./oci-lab push registry.local/demo-app:lab
[INFO] Uploaded 2 new blobs
[INFO] Uploaded manifest sha256:9f9e...
[INFO] Tag demo-app:lab -> sha256:9f9e...
$ ./oci-lab pull registry.local/demo-app@sha256:9f9e...
[INFO] Digest verified
The Core Question You Are Answering
“How do we guarantee that what we built is exactly what we run?”
Concepts You Must Understand First
- Content-addressable storage
- Why digests matter for integrity and reproducibility.
- Book Reference: Designing Data-Intensive Applications - data integrity and replication concepts
- OCI artifact model
- Manifest, config, and layer relationship.
- Book Reference: Docker Deep Dive - image internals chapters
- Registry API behavior
- Why missing blob checks and retries matter.
- Book Reference: OCI distribution spec docs
Questions to Guide Your Design
- Artifact identity
- Which workflows should permit tags, and which must require digests?
- How do you prevent accidental mutable-tag deployments?
- Promotion flow
- Will you rebuild per environment or promote immutable digest artifacts?
- How do you audit release provenance?
Thinking Exercise
Tag Drift Drill
Model a scenario where app:prod is retagged without deployment review.
- Which controls would detect this?
- Which controls would prevent it?
The Interview Questions They Will Ask
- “What is the difference between tag and digest?”
- “Why do registries deduplicate layers?”
- “How would you design secure image promotion across environments?”
- “What causes digest mismatch errors?”
- “How do SBOM and signatures relate to image trust?”
Hints in Layers
Hint 1: Starting Point
- Build a tiny artifact and inspect manifest structure first.
Hint 2: Next Level
- Compare push behavior when blobs already exist in registry.
Hint 3: Technical Details
- Model push as: auth -> upload missing blobs -> upload manifest -> assign tag.
Hint 4: Tools/Debugging
- Use registry API inspection and manifest tooling to validate media types and digests.
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Docker image internals | Docker Deep Dive | image + registry chapters |
| Data integrity | Designing Data-Intensive Applications | data model and integrity chapters |
| OCI standards | OCI docs | image/distribution specifications |
Common Pitfalls and Debugging
Problem 1: “Deployment uses wrong binary despite successful build”
- Why: Tag drift.
- Fix: Enforce digest pinning in deploy manifests.
- Quick test: Compare deployed digest to build digest in release metadata.
Problem 2: “CI build times keep increasing”
- Why: Layer cache invalidation strategy is poor.
- Fix: Reorder build graph to isolate volatile steps.
- Quick test: Run warm cache builds three times and compare.
Definition of Done
- OCI manifest/layer model is documented with real digests.
- Push/pull by digest works with verification.
- Promotion flow uses immutable references.
- Failure paths are tested (auth failure, missing blob, digest mismatch).
Project 3: BuildKit Cache and Reproducibility Lab
- File: DOCKER_CONTAINERS_KUBERNETES_LEARNING_PROJECTS/P03-buildkit-multi-stage-cache-lab.md
- Main Programming Language: Dockerfile + shell
- Alternative Programming Languages: Make, Python, Go
- Coolness Level: Level 2 - Practical but High ROI
- Business Potential: 4. The “Cost and Velocity Win”
- Difficulty: Level 2 - Intermediate
- Knowledge Area: CI/CD performance engineering
- Software or Tool: BuildKit, buildx, external cache backends
- Main Book: Docker Deep Dive
What you will build: A benchmarked CI build pipeline with deterministic cache behavior and measurable speedups.
Why it teaches containers: It connects artifact theory to practical build throughput and cost.
Core challenges you will face:
- Cache key stability -> OCI Image and Distribution Model
- Multi-stage dependency planning -> OCI Image and Distribution Model
- Deterministic builds under ephemeral runners -> Stateful + Security + GitOps Operations
Real World Outcome
$ ./build-lab run --scenario cold
[RESULT] Build duration: 4m12s
[RESULT] Cache hit ratio: 0%
$ ./build-lab run --scenario warm
[RESULT] Build duration: 1m03s
[RESULT] Cache hit ratio: 78%
$ ./build-lab report
[SUMMARY] Mean speedup: 3.9x
[SUMMARY] Reproducibility: digest stable across 5 runs
The Core Question You Are Answering
“How do we make container builds both fast and reproducible in ephemeral CI?”
Concepts You Must Understand First
- Build graph dependencies
- Which input changes invalidate which layers?
- Book Reference: Docker build docs and BuildKit overview
- External cache backends
- Why internal builder cache is insufficient for ephemeral CI.
- Book Reference: Docker cache backend docs
- Deterministic artifact strategy
- How to separate volatile from stable build inputs.
- Book Reference: Docker Deep Dive
Questions to Guide Your Design
- Which build stage changes most often, and how can it be isolated?
- Which cache export/import strategy best fits your CI runner model?
Thinking Exercise
Cache Invalidation Map
Draw a stage graph and mark which files invalidate each stage. Estimate build-time impact for each invalidation edge.
The Interview Questions They Will Ask
- “What is BuildKit and why does it improve builds?”
- “How do you debug persistent cache misses?”
- “When can cache optimization hurt correctness?”
- “Why might digest reproducibility fail across runners?”
Hints in Layers
Hint 1: Starting Point
- Measure baseline cold/warm builds before tuning anything.
Hint 2: Next Level
- Move dependency installation earlier, source copy later.
Hint 3: Technical Details
- Use registry-backed cache with explicit
cache-to/cache-frompolicy.
Hint 4: Tools/Debugging
- Inspect BuildKit output for cache hit/miss reasons per step.
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Build internals | Docker Deep Dive | build and layer chapters |
| BuildKit behavior | Docker docs | BuildKit and cache sections |
| CI optimization | SRE/DevOps references | pipeline optimization chapters |
Common Pitfalls and Debugging
Problem 1: “Warm build not faster than cold build”
- Why: Non-deterministic step invalidates early stage.
- Fix: Reorder volatile inputs and stabilize build args.
- Quick test: Change one file per stage and measure invalidation scope.
Problem 2: “Digest differs across runners”
- Why: Environment-specific timestamps or toolchain drift.
- Fix: Pin toolchain and normalize build metadata.
- Quick test: Run same commit on two runners and compare digest outputs.
Definition of Done
- Cold vs warm build benchmark captured.
- External cache configured and verified.
- Reproducibility validated across repeated builds.
- Optimization decisions documented with measured evidence.
Project 4: CRI Runtime Observability with crictl and Events
- File: DOCKER_CONTAINERS_KUBERNETES_LEARNING_PROJECTS/P04-cri-observability-with-crictl-and-events.md
- Main Programming Language: Shell + Go (optional)
- Alternative Programming Languages: Python, Rust
- Coolness Level: Level 3 - Incident-Ready
- Business Potential: 2. The “Ops Differentiator”
- Difficulty: Level 3 - Advanced
- Knowledge Area: Node-level debugging
- Software or Tool: kubelet, CRI, crictl, container runtime logs
- Main Book: Kubernetes in Action
What you will build: A runtime observability playbook and tooling workflow to diagnose pod lifecycle failures from node to API.
Why it teaches containers/Kubernetes: It reveals the exact kubelet-to-runtime boundary where many production incidents occur.
Core challenges you will face:
- Correlating events across layers -> Kubernetes Control Plane and Reconciliation
- Runtime introspection at CRI boundary -> Container Isolation Primitives
- Separating app failures from platform failures -> OCI Image and Distribution Model
Real World Outcome
$ ./cri-lab diagnose pod/web-7d8f9
[STEP] API status collected
[STEP] Node runtime queried via CRI
[STEP] Pulled event timeline for last 15m
[FINDING] Image pull backoff caused by registry auth scope mismatch
[REMEDIATION] Updated imagePullSecret and redeployed
[VERIFY] Pod transitioned to Ready in 42s
The Core Question You Are Answering
“When a pod fails, where exactly did the failure happen in the control path?”
Concepts You Must Understand First
- Pod lifecycle and container states
- Pending vs Running vs CrashLoopBackOff interpretation.
- Book Reference: Kubernetes docs on pod lifecycle
- CRI contract
- How kubelet and runtime communicate.
- Book Reference: Kubernetes CRI docs
- Image pull and auth flow
- Which failures happen before container start.
- Book Reference: OCI distribution and registry behavior
Questions to Guide Your Design
- How will you time-correlate API events and node runtime events?
- Which minimum fields must every incident timeline include?
Thinking Exercise
Failure Taxonomy Exercise
Create a table mapping common statuses (ImagePullBackOff, CrashLoopBackOff, CreateContainerError) to likely failure layers and first diagnostic command.
The Interview Questions They Will Ask
- “How do you debug CrashLoopBackOff systematically?”
- “What does CRI do in Kubernetes?”
- “How do you tell image pull problems from runtime start problems?”
- “What is the first node-level command you run in a pod lifecycle incident?”
Hints in Layers
Hint 1: Starting Point
- Start with event timeline, not assumptions.
Hint 2: Next Level
- Build a single timeline that combines API, kubelet, and runtime entries.
Hint 3: Technical Details
- Categorize failure as pre-schedule, post-schedule pre-start, or post-start.
Hint 4: Tools/Debugging
- Use
kubectl describe,crictl ps/images/inspect, and node journal logs.
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Kubernetes troubleshooting | Kubernetes in Action | operations chapters |
| Runtime architecture | Programming Kubernetes | node and runtime sections |
| Incident practice | SRE workbook resources | incident response chapters |
Common Pitfalls and Debugging
Problem 1: “Blaming app code for image pull failures”
- Why: Failure happens before process start.
- Fix: Verify registry auth and manifest availability first.
- Quick test: Pull same digest manually with runtime credentials.
Problem 2: “Confusing pending scheduling with runtime failure”
- Why: Node assignment never happened.
- Fix: Check scheduler events and constraints.
- Quick test: Inspect pod events for
FailedSchedulingmarkers.
Definition of Done
- End-to-end failure taxonomy documented.
- Deterministic timeline generation workflow implemented.
- At least three failure scenarios diagnosed and resolved.
- Post-incident checklist created for recurring use.
Project 5: Kubernetes Scheduler Simulator
- File: DOCKER_CONTAINERS_KUBERNETES_LEARNING_PROJECTS/P05-kubernetes-scheduler-simulator.md
- Main Programming Language: Go
- Alternative Programming Languages: Python, Rust
- Coolness Level: Level 4 - System Design Depth
- Business Potential: 3. The “Platform Brain”
- Difficulty: Level 4 - Expert
- Knowledge Area: Scheduling and capacity strategy
- Software or Tool: scheduler framework concepts
- Main Book: Programming Kubernetes
What you will build: A simulator that models Kubernetes-like filter and score placement decisions for pod scheduling.
Why it teaches Kubernetes: Scheduling policy becomes concrete when you implement and test it under synthetic workloads.
Core challenges you will face:
- Filter and score plugin design -> Kubernetes Control Plane and Reconciliation
- Fairness vs efficiency tradeoffs -> Kubernetes Control Plane and Reconciliation
- Constraint explosion under real workload mixes -> Stateful + Security + GitOps Operations
Real World Outcome
$ ./sched-lab simulate --workload workload-a.json --nodes nodes.json
[INFO] Pending pods: 120
[INFO] Feasible candidates after filter: 38 avg per pod
[INFO] Final placement complete: 112 bound, 8 unschedulable
[REPORT] Top unschedulable reason: insufficient memory (6), taint mismatch (2)
The Core Question You Are Answering
“What policy produces stable, fair, and efficient placement under changing constraints?”
Concepts You Must Understand First
- Scheduler filter/score lifecycle
- Book Reference: Kubernetes scheduler framework docs
- Resource requests/limits semantics
- Book Reference: Kubernetes in Action scheduling chapters
- Affinity, taints, topology spread
- Book Reference: Programming Kubernetes scheduling content
Questions to Guide Your Design
- Which scoring weights align with your workload goals?
- How will you explain unschedulable decisions in human-readable form?
Thinking Exercise
Policy Tradeoff Matrix
Evaluate bin-packing-heavy vs spread-heavy policies across three workload distributions.
The Interview Questions They Will Ask
- “How does Kubernetes scheduler choose a node?”
- “What causes pods to remain Pending?”
- “How would you customize scheduling for GPU and latency workloads?”
- “What tradeoff exists between utilization and resilience?”
Hints in Layers
Hint 1: Starting Point
- Implement explainability first (why not scheduled), then optimize scoring.
Hint 2: Next Level
- Add configurable policy weights and compare outputs.
Hint 3: Technical Details
- Separate hard constraints (filter) from preference weights (score).
Hint 4: Tools/Debugging
- Produce per-pod decision traces for post-analysis.
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Scheduler internals | Programming Kubernetes | scheduling chapters |
| Cluster operations | Kubernetes in Action | scheduling and resources |
| Capacity planning | SRE literature | capacity and reliability chapters |
Common Pitfalls and Debugging
Problem 1: “Great utilization, poor failure tolerance”
- Why: Overly aggressive bin packing.
- Fix: Add spread and anti-affinity weighting.
- Quick test: Simulate one-node loss and compare disruption.
Problem 2: “Placement looks random”
- Why: Missing deterministic tie-break logic.
- Fix: Add stable ranking and explicit tie-break rules.
- Quick test: Re-run same workload and verify deterministic output.
Definition of Done
- Simulator supports filter and score phases with explainable output.
- At least three policy profiles benchmarked.
- Unschedulable diagnostics are deterministic and human-readable.
- Tradeoff report includes utilization and resilience metrics.
Project 6: Controller Reconciliation Lab
- File: DOCKER_CONTAINERS_KUBERNETES_LEARNING_PROJECTS/P06-controller-reconciliation-lab.md
- Main Programming Language: Go
- Alternative Programming Languages: Python, Java
- Coolness Level: Level 4 - Platform Craft
- Business Potential: 3. The “Automation Engine”
- Difficulty: Level 4 - Expert
- Knowledge Area: Declarative automation
- Software or Tool: Custom resources, informers, reconcile loops
- Main Book: Programming Kubernetes
What you will build: A controller lab that reconciles a custom resource into child workloads with idempotent actions and status reporting.
Why it teaches Kubernetes: Reconciliation is the core Kubernetes design pattern.
Core challenges you will face:
- Idempotent reconcile design -> Kubernetes Control Plane and Reconciliation
- Status and error semantics -> Kubernetes Control Plane and Reconciliation
- Safe retries and backoff -> Stateful + Security + GitOps Operations
Real World Outcome
$ ./reconcile-lab apply examples/cache-cluster.yaml
[INFO] Custom resource accepted
[INFO] Reconcile iteration #1 -> created StatefulSet and Service
[INFO] Reconcile iteration #2 -> status ReadyReplicas=3
[INFO] Drift detected after manual edit -> corrected in 9s
The Core Question You Are Answering
“How do we automate desired state safely in the presence of retries, failures, and drift?”
Concepts You Must Understand First
- Informer/watch mechanics
- Book Reference: Programming Kubernetes controller chapters
- Idempotent API mutation patterns
- Book Reference: Kubernetes patterns/operator design literature
- Status conditions and observedGeneration
- Book Reference: Kubernetes API conventions
Questions to Guide Your Design
- Which reconcile steps are safe to repeat without side effects?
- How do you represent partial progress in status fields?
Thinking Exercise
Retry Safety Walkthrough
For each reconcile step, write the expected behavior if the step runs five times due to transient errors.
The Interview Questions They Will Ask
- “What makes a reconcile loop idempotent?”
- “How do you avoid hot loops in failing controllers?”
- “How do status conditions help operations teams?”
- “Why is eventual consistency acceptable here?”
Hints in Layers
Hint 1: Starting Point
- Define desired child resources as pure functions of spec.
Hint 2: Next Level
- Separate read/compare and mutate phases clearly.
Hint 3: Technical Details
- Use generation checks and condition updates for progress visibility.
Hint 4: Tools/Debugging
- Track reconcile duration, requeue reasons, and error class counts.
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Operator/controller design | Programming Kubernetes | operator chapters |
| Kubernetes patterns | Kubernetes Patterns | controller and operational patterns |
| API conventions | Kubernetes docs | API conventions references |
Common Pitfalls and Debugging
Problem 1: “Controller thrashes API server”
- Why: Reconcile loop requeues immediately on non-critical conditions.
- Fix: Use bounded backoff and event-driven requeues.
- Quick test: Inject transient errors and verify stable request rates.
Problem 2: “Status says Ready but resources are unhealthy”
- Why: Status update logic not tied to observed child health.
- Fix: Gate Ready on concrete health checks and generation sync.
- Quick test: Force child failure and confirm status flips promptly.
Definition of Done
- Reconcile logic is idempotent and retry-safe.
- Status conditions communicate progress and failure clearly.
- Drift correction is demonstrated with deterministic timing.
- Metrics/logging support incident diagnosis.
Project 7: StatefulSet Failover and Backup Drills
- File: DOCKER_CONTAINERS_KUBERNETES_LEARNING_PROJECTS/P07-statefulset-failover-backup-drills.md
- Main Programming Language: YAML + shell automation
- Alternative Programming Languages: Go, Python
- Coolness Level: Level 3 - Production Realism
- Business Potential: 4. The “Reliability Multiplier”
- Difficulty: Level 4 - Expert
- Knowledge Area: Stateful reliability engineering
- Software or Tool: StatefulSets, PVCs, backup/restore tooling
- Main Book: Kubernetes Patterns
What you will build: A stateful workload lab with controlled failover, backup schedules, and tested restore procedures.
Why it teaches production Kubernetes: Data safety and recoverability are where real platform maturity is proven.
Core challenges you will face:
- Identity and storage lifecycle reasoning -> Stateful + Security + GitOps Operations
- Failure-domain aware design -> Kubernetes Control Plane and Reconciliation
- Restore confidence and RTO/RPO validation -> Stateful + Security + GitOps Operations
Real World Outcome
$ ./state-lab inject-failure --node worker-1
[INFO] Primary pod rescheduled with persistent volume reattach
[INFO] Service remained available with 1 retry spike
$ ./state-lab backup --job nightly-001
[INFO] Backup completed: snapshot-2026-02-11T02:00Z
$ ./state-lab restore --snapshot snapshot-2026-02-11T02:00Z
[INFO] Restore verified by checksum and application consistency checks
[RESULT] RTO=7m12s RPO<=5m
The Core Question You Are Answering
“Can this workload fail and recover without violating availability and data guarantees?”
Concepts You Must Understand First
- StatefulSet semantics
- Book Reference: Kubernetes in Action stateful chapters
- Persistent volume lifecycle
- Book Reference: Kubernetes storage docs and patterns
- RTO/RPO engineering
- Book Reference: SRE and disaster recovery playbooks
Questions to Guide Your Design
- Which failure scenarios are most likely and most expensive?
- How will you prove restore integrity beyond process startup?
Thinking Exercise
Disaster Timeline Drill
Model a control-plane healthy but node-failure scenario and map detection, failover, restore, and validation checkpoints.
The Interview Questions They Will Ask
- “How do StatefulSets differ from Deployments?”
- “What is the difference between backup completion and restore confidence?”
- “How do you choose RPO and RTO targets?”
- “What is a common anti-pattern in stateful Kubernetes operations?”
Hints in Layers
Hint 1: Starting Point
- Define success criteria before running failure drills.
Hint 2: Next Level
- Test with realistic write load, not idle workloads.
Hint 3: Technical Details
- Verify logical integrity (application checks) in addition to snapshot status.
Hint 4: Tools/Debugging
- Use timeline logs and metrics to calculate RTO/RPO accurately.
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Stateful patterns | Kubernetes Patterns | stateful service chapters |
| Storage behavior | Kubernetes in Action | storage chapters |
| Reliability targets | Site Reliability Engineering | availability and DR sections |
Common Pitfalls and Debugging
Problem 1: “Restore works in test but fails in incident”
- Why: Unvalidated dependencies and order-of-operations mismatch.
- Fix: Rehearse full runbook in environment parity.
- Quick test: Quarterly full restore game day with timeboxed objectives.
Problem 2: “Data corruption after failover”
- Why: Consistency checkpoints missing before backup/restore.
- Fix: Add pre/post consistency validation gates.
- Quick test: Compare checksums and application-level invariants.
Definition of Done
- Failover scenario completed with measured service impact.
- Backup and restore runbook executed end-to-end.
- RTO/RPO measured and documented.
- Integrity checks prove data correctness post-restore.
Project 8: CNI and Service Network Troubleshooting Lab
- File: DOCKER_CONTAINERS_KUBERNETES_LEARNING_PROJECTS/P08-cni-service-network-troubleshooting.md
- Main Programming Language: Shell + YAML
- Alternative Programming Languages: Go, Python
- Coolness Level: Level 4 - Incident Hero
- Business Potential: 3. The “Platform Reliability Service”
- Difficulty: Level 4 - Expert
- Knowledge Area: Cluster networking and incident response
- Software or Tool: CNI, Service, NetworkPolicy, Gateway API
- Main Book: Kubernetes in Action
What you will build: A reproducible networking incident lab with scripted fault injection and layered diagnosis workflows.
Why it teaches Kubernetes networking: You learn to diagnose by layer instead of guessing.
Core challenges you will face:
- Layered fault isolation -> Networking and Traffic Delivery
- Readiness vs reachability distinction -> Networking and Traffic Delivery
- Policy and routing interactions -> Stateful + Security + GitOps Operations
Real World Outcome
$ ./net-lab run-scenario policy-block
[SYMPTOM] checkout-service timeout from frontend
[DIAG] pod-to-pod direct IP works
[DIAG] Service endpoint list healthy
[DIAG] NetworkPolicy denies namespace path
[FIX] Applied allow rule for frontend -> checkout
[VERIFY] p95 latency returned to baseline in 90s
The Core Question You Are Answering
“At which network layer is the failure actually happening, and how can I prove it quickly?”
Concepts You Must Understand First
- Kubernetes network model
- Book Reference: Kubernetes services/networking docs
- Service and EndpointSlice mechanics
- Book Reference: Kubernetes in Action networking chapters
- NetworkPolicy behavior
- Book Reference: Kubernetes security/network policy docs
Questions to Guide Your Design
- What is the minimum diagnostic sequence that avoids false conclusions?
- Which metrics/logs prove recovery, not just temporary symptom relief?
Thinking Exercise
Fault Tree Construction
Create a fault tree for “service timeout” with branch probabilities and first diagnostic probe per branch.
The Interview Questions They Will Ask
- “How do you troubleshoot Kubernetes networking in a structured way?”
- “Why can a service fail while pods remain healthy?”
- “What role does EndpointSlice play?”
- “How do you test NetworkPolicy safely in production-like environments?”
Hints in Layers
Hint 1: Starting Point
- Test direct pod IP reachability before service abstraction.
Hint 2: Next Level
- Validate endpoint readiness and DNS resolution separately.
Hint 3: Technical Details
- Record packet path checkpoints across source pod, node datapath, and destination pod.
Hint 4: Tools/Debugging
- Use targeted captures and policy audit outputs with timestamps.
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Kubernetes networking | Kubernetes in Action | networking chapters |
| Service architecture | Kubernetes docs | services/network model |
| Policy hardening | Kubernetes security docs | policy and admission sections |
Common Pitfalls and Debugging
Problem 1: “Assuming DNS issue for every timeout”
- Why: Endpoint or policy issue often looks identical from app logs.
- Fix: Follow layered diagnostics strictly.
- Quick test: Direct pod IP and EndpointSlice check before DNS deep dive.
Problem 2: “NetworkPolicy change fixed one path but broke another”
- Why: Incomplete namespace/selector modeling.
- Fix: Build explicit traffic matrix before policy edits.
- Quick test: Run matrix verification against all service dependencies.
Definition of Done
- At least three network fault scenarios reproduced and resolved.
- Layered diagnostic playbook documented.
- Mean time to root cause measured and improved.
- Post-fix validation includes latency and error budget checks.
Project 9: Pod Security and Policy-as-Code Enforcement
- File: DOCKER_CONTAINERS_KUBERNETES_LEARNING_PROJECTS/P09-policy-as-code-and-pod-security.md
- Main Programming Language: YAML policy + shell
- Alternative Programming Languages: Rego, CEL, Go
- Coolness Level: Level 3 - Security-Forward Platforming
- Business Potential: 4. The “Compliance Accelerator”
- Difficulty: Level 3 - Advanced
- Knowledge Area: Admission control and workload security
- Software or Tool: Pod Security Standards, admission policies, image policy gates
- Main Book: Kubernetes Patterns
What you will build: A staged policy rollout that enforces baseline and restricted controls with measurable developer impact.
Why it teaches production security: You convert abstract guidance into operational controls and exception workflows.
Core challenges you will face:
- Policy correctness vs developer usability -> Stateful + Security + GitOps Operations
- Least privilege enforcement -> Container Isolation Primitives
- Exception governance -> Stateful + Security + GitOps Operations
Real World Outcome
$ ./policy-lab evaluate --mode audit
[RESULT] 37 workloads evaluated
[RESULT] 9 violations: privileged=true (3), hostPath (2), runAsNonRoot missing (4)
$ ./policy-lab enforce --namespace payments
[INFO] Enforcement active: restricted profile
[VERIFY] compliant workloads deploy, violating workloads blocked with clear reason
The Core Question You Are Answering
“How do we enforce secure defaults without freezing delivery velocity?”
Concepts You Must Understand First
- Pod Security Standards profiles
- Book Reference: Kubernetes pod security standards docs
- Admission control flow
- Book Reference: Kubernetes admission controller docs
- Exception lifecycle management
- Book Reference: security governance practices
Questions to Guide Your Design
- Which rules should be enforced immediately, and which should phase in?
- What exception metadata is mandatory (owner, expiry, risk rationale)?
Thinking Exercise
Policy Rollout Plan
Design a 30-day rollout from audit to enforce for two namespaces with different workload maturity.
The Interview Questions They Will Ask
- “What are Baseline and Restricted Pod Security profiles?”
- “How do you roll out policy safely in production?”
- “What metrics show whether policy rollout is healthy?”
- “How do you prevent exception sprawl?”
Hints in Layers
Hint 1: Starting Point
- Start in audit mode and classify violations by risk and fix complexity.
Hint 2: Next Level
- Add actionable remediation text to policy failures.
Hint 3: Technical Details
- Require exception expiry and approval metadata.
Hint 4: Tools/Debugging
- Track violation counts, time-to-fix, and blocked deployment rates.
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Policy patterns | Kubernetes Patterns | governance and policy chapters |
| Security standards | Kubernetes docs | pod security and admission sections |
| Hardening guidance | NSA/CISA guidance | recommendations summary |
Common Pitfalls and Debugging
Problem 1: “Policy rollout triggered mass deployment failures”
- Why: No audit phase and no exception process.
- Fix: Stage rollout and publish migration guides.
- Quick test: Pilot namespace before broad enforcement.
Problem 2: “Rules are enforced but risk remains high”
- Why: Coverage gaps (RBAC, image trust, secret handling).
- Fix: Expand controls beyond pod spec checks.
- Quick test: Map controls against threat scenarios and run gap review.
Definition of Done
- Policy baseline enforced in at least one production-like namespace.
- Exception workflow documented and time-bounded.
- Security and delivery impact metrics collected.
- Rollout playbook reusable for new namespaces.
Project 10: GitOps Platform Capstone
- File: DOCKER_CONTAINERS_KUBERNETES_LEARNING_PROJECTS/P10-gitops-platform-capstone.md
- Main Programming Language: YAML + automation scripts
- Alternative Programming Languages: Go, Python
- Coolness Level: Level 4 - Full-System Mastery
- Business Potential: 5. The “Platform Product”
- Difficulty: Level 4 - Expert
- Knowledge Area: End-to-end platform architecture
- Software or Tool: GitOps reconciler, policy gates, observability stack
- Main Book: Kubernetes Patterns + Programming Kubernetes
What you will build: A cohesive platform blueprint that integrates image promotion, policy, scheduling, networking, and stateful reliability into one operating model.
Why it teaches mastery: It forces concept integration across the full lifecycle from build to recovery.
Core challenges you will face:
- Cross-domain architecture decisions -> All concept clusters
- Operational governance and ownership boundaries -> Stateful + Security + GitOps Operations
- Reliable rollback and incident response -> Kubernetes Control Plane and Reconciliation
Real World Outcome
$ ./platform-capstone promote release-2026-02-11
[CHECK] Image digest signed and scanned
[CHECK] Policy gates passed
[CHECK] Canary rollout healthy at 10% then 50% then 100%
[CHECK] SLO guardrails respected
[RESULT] Release completed in 18m with full audit trail
$ ./platform-capstone drill --scenario node-failure-and-rollback
[RESULT] Auto-heal completed in 4m
[RESULT] Rollback validated to previous digest in 2m
The Core Question You Are Answering
“Can I run a multi-team platform where change is fast, safe, observable, and recoverable?”
Concepts You Must Understand First
- Artifact trust and promotion
- Book Reference: Docker Deep Dive + OCI docs
- Reconciliation and rollout mechanics
- Book Reference: Programming Kubernetes
- Policy and security baselines
- Book Reference: Kubernetes security docs
- Stateful and network reliability
- Book Reference: Kubernetes Patterns
Questions to Guide Your Design
- Where are the hard gates and where are advisory gates?
- Which metrics decide rollback automatically vs manually?
Thinking Exercise
Platform Operating Model Diagram
Draw ownership boundaries for app teams, platform team, and security team. Mark handoff points and escalation paths.
The Interview Questions They Will Ask
- “How would you design a production-grade Kubernetes platform?”
- “What are your non-negotiable release gates and why?”
- “How do you balance developer velocity and security controls?”
- “Describe your rollback strategy for stateful and stateless workloads.”
- “What does good platform observability look like?”
Hints in Layers
Hint 1: Starting Point
- Begin with a minimal golden path and explicit non-goals.
Hint 2: Next Level
- Add staged promotion with measurable quality gates.
Hint 3: Technical Details
- Define a policy matrix: build-time, admission-time, runtime controls.
Hint 4: Tools/Debugging
- Implement dashboards for rollout health, error budget, and drift detection.
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Platform design | Kubernetes Patterns | platform and operational patterns |
| Control loops | Programming Kubernetes | controller/operator chapters |
| Reliability | Site Reliability Engineering | release and incident chapters |
Common Pitfalls and Debugging
Problem 1: “Too many tools, no coherent operating model”
- Why: Tool-first design without ownership boundaries.
- Fix: Define platform contracts and responsibilities first.
- Quick test: New team onboarding exercise completes without ad-hoc tribal knowledge.
Problem 2: “Rollbacks are documented but not trusted”
- Why: No recurring rollback drills.
- Fix: Schedule and score game days.
- Quick test: Timeboxed quarterly rollback exercise with pass/fail criteria.
Definition of Done
- End-to-end release flow includes digest trust, policy gates, and observability.
- Rollback and failover drills validated with measured outcomes.
- Ownership model and runbooks are documented.
- Architecture tradeoffs and future roadmap are explicit.
Project Comparison Table
| Project | Difficulty | Time | Depth of Understanding | Fun Factor |
|---|---|---|---|---|
| 1. Kernel-Level Container Runtime Lab | Expert | 10-16h | Very High | 4/5 |
| 2. OCI Image Builder and Registry Workflow | Advanced | 10-16h | High | 4/5 |
| 3. BuildKit Cache and Reproducibility Lab | Intermediate | 6-10h | Medium-High | 3/5 |
| 4. CRI Runtime Observability Lab | Advanced | 8-14h | High | 4/5 |
| 5. Scheduler Simulator | Expert | 12-20h | Very High | 5/5 |
| 6. Controller Reconciliation Lab | Expert | 12-20h | Very High | 5/5 |
| 7. StatefulSet Failover and Backup Drills | Expert | 14-24h | Very High | 4/5 |
| 8. CNI and Service Network Troubleshooting Lab | Expert | 12-20h | Very High | 5/5 |
| 9. Pod Security and Policy-as-Code Enforcement | Advanced | 8-14h | High | 4/5 |
| 10. GitOps Platform Capstone | Expert | 20-40h | Maximum | 5/5 |
Recommendation
If you are new to this topic: Start with Project 1, then Project 2, then Project 3 to build a strong artifact/runtime base.
If you are a platform engineer: Start with Project 5, Project 6, and then integrate with Project 10.
If you want security and reliability leadership: Prioritize Project 7, Project 8, and Project 9, then complete Project 10.
Final Overall Project: Platform Reliability and Delivery Program
The Goal: Combine Projects 2, 6, 7, 8, 9, and 10 into a production-like platform operating model.
- Build immutable image promotion and trust gates.
- Implement reconciliation-driven deployment and rollback automation.
- Add network fault injection and stateful restore drills with SLO guardrails.
Success Criteria: You can execute one full release and one disaster drill with measured success and clear postmortem evidence.
From Learning to Production: What Is Next
| Your Project | Production Equivalent | Gap to Fill |
|---|---|---|
| Project 1 | Hardened runtime baseline | enterprise policy integration |
| Project 2 | Artifact registry platform | organization-wide provenance and signing |
| Project 3 | CI optimization program | cross-team standardization and governance |
| Project 4 | SRE runtime incident runbook | 24/7 alerting and escalation integration |
| Project 5 | Scheduling policy tuning | live workload and cost optimization loops |
| Project 6 | Operator/controller platform | formal API lifecycle and versioning |
| Project 7 | Stateful reliability program | multi-region DR and compliance audits |
| Project 8 | Network reliability engineering | advanced traffic policy and eBPF observability |
| Project 9 | Security policy platform | full threat model and audit evidence automation |
| Project 10 | Internal developer platform | productization and multi-team onboarding |
Summary
This learning path covers Docker and Kubernetes through 10 hands-on projects that start from kernel-level container behavior and end with platform-scale operational excellence.
| # | Project Name | Main Language | Difficulty | Time Estimate |
|---|---|---|---|---|
| 1 | Kernel-Level Container Runtime Lab | Go | Expert | 10-16h |
| 2 | OCI Image Builder and Registry Workflow | Go | Advanced | 10-16h |
| 3 | BuildKit Cache and Reproducibility Lab | Dockerfile/Shell | Intermediate | 6-10h |
| 4 | CRI Runtime Observability Lab | Shell | Advanced | 8-14h |
| 5 | Kubernetes Scheduler Simulator | Go | Expert | 12-20h |
| 6 | Controller Reconciliation Lab | Go | Expert | 12-20h |
| 7 | StatefulSet Failover and Backup Drills | YAML/Shell | Expert | 14-24h |
| 8 | CNI and Service Network Troubleshooting Lab | Shell | Expert | 12-20h |
| 9 | Pod Security and Policy-as-Code Enforcement | YAML | Advanced | 8-14h |
| 10 | GitOps Platform Capstone | YAML/Shell | Expert | 20-40h |
Expected Outcomes
- You can reason about container and Kubernetes behavior from first principles.
- You can build and operate digest-trusted, policy-enforced release pipelines.
- You can diagnose and recover from real production failure patterns with confidence.
Additional Resources and References
Standards and Specifications
- OCI release notices overview: https://opencontainers.org/release-notices/overview/
- OCI image-spec v1.1.1 notice: https://opencontainers.org/release-notices/v1-1-1-image-spec/
- OCI distribution-spec v1.1.1 notice: https://opencontainers.org/release-notices/v1-1-1-distribution-spec/
- OCI runtime-spec v1.3.0 notice: https://opencontainers.org/release-notices/v1-3-0-runtime-spec/
Official Documentation
- Docker overview: https://docs.docker.com/get-started/docker-overview/
- Docker BuildKit docs: https://docs.docker.com/build/buildkit/
- Docker cache optimization: https://docs.docker.com/build/cache/optimize/
- Kubernetes components: https://kubernetes.io/docs/concepts/overview/components
- Kubernetes CRI: https://kubernetes.io/docs/concepts/containers/cri/
- Kubernetes container runtimes: https://kubernetes.io/docs/setup/production-environment/container-runtimes/
- Kubernetes network model: https://kubernetes.io/docs/concepts/services-networking/
- Kubernetes pod lifecycle: https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/
- Kubernetes scheduler configuration: https://kubernetes.io/docs/reference/scheduling/config/
- Kubernetes operating etcd: https://kubernetes.io/docs/tasks/administer-cluster/configure-upgrade-etcd/
- Kubernetes pod security standards: https://kubernetes.io/docs/concepts/security/pod-security-standards/
- Kubernetes security checklist: https://kubernetes.io/docs/concepts/security/security-checklist/
Industry Data and Adoption Context
- CNCF Annual Cloud Native Survey announcement (Jan 20, 2026): https://www.cncf.io/announcements/2026/01/20/kubernetes-established-as-the-de-facto-operating-system-for-ai-as-production-use-hits-82-in-2025-cncf-annual-cloud-native-survey/
- CNCF Annual Survey 2023: https://www.cncf.io/reports/cncf-annual-survey-2023/
- Stack Overflow Developer Survey 2024: https://survey.stackoverflow.co/2024/
- Stack Overflow 2024 Technology breakdown: https://survey.stackoverflow.co/2024/technology
Security Guidance
- NSA/CISA Kubernetes hardening guidance notice: https://www.nsa.gov/Press-Room/News-Highlights/Article/Article/2716980/nsa-cisa-release-kubernetes-hardening-guidance/