Project 8: CNI and Service Network Troubleshooting Lab
Build a layered troubleshooting workflow for Kubernetes networking incidents using controlled fault injection.
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | Expert |
| Time Estimate | 12-20 hours |
| Main Programming Language | Shell + YAML |
| Alternative Programming Languages | Go, Python |
| Coolness Level | Level 4 - Incident Hero |
| Business Potential | 3. Reliability Service |
| Prerequisites | cluster networking basics, DNS, service endpoints, network policy |
| Key Topics | CNI diagnostics, endpoint analysis, policy debugging, packet path reasoning |
1. Learning Objectives
- Diagnose network failures by layer, not guesswork.
- Distinguish CNI, Service, DNS, and policy failure classes.
- Validate fixes with latency and error-budget evidence.
- Reduce MTTR for recurrent network incidents.
2. All Theory Needed (Per-Concept Breakdown)
2.1 Kubernetes Network Layers
Fundamentals
Pod connectivity, service routing, and ingress/gateway policies are separate layers with different failure signals.
Deep Dive into the concept
Layered diagnostics begin with direct pod reachability. If direct pod traffic works, investigate service abstraction (endpoints, DNS, kube-proxy/eBPF rules). If service path works internally but fails externally, inspect ingress/gateway policy. NetworkPolicy can deny paths that appear healthy from logs alone. Maintaining a strict diagnostic sequence prevents false conclusions.
2.2 Fault Isolation and Verification Discipline
Fundamentals
Fixes must be validated against both symptom resolution and guardrail metrics.
Deep Dive into the concept
A temporary symptom improvement does not prove root cause. Validation should include request success rate, p95/p99 latency, and regression checks for unrelated services. Mature teams maintain fault trees for frequent incidents and automate first-line checks.
3. Project Specification
3.1 What You Will Build
A network incident lab with:
- reproducible failure scenarios
- layered diagnostic scripts
- fix validation workflow
3.2 Functional Requirements
- Inject at least three network fault types.
- Run deterministic diagnostics per fault.
- Output probable root cause and confidence.
- Validate fix through metrics and replay tests.
3.3 Non-Functional Requirements
- Performance: diagnosis report in under 3 minutes.
- Reliability: repeated scenarios classify consistently.
- Usability: clear runbook steps for on-call use.
3.7 Real World Outcome
$ ./net-lab run --scenario service-timeout
symptom: frontend -> checkout timeout
layer_checks:
pod_to_pod: pass
dns: pass
endpointslice: pass
networkpolicy: deny detected
fix: allow frontend namespace to checkout service
verification: success rate back to 99.9%, p95 normalized
4. Solution Architecture
4.1 High-Level Design
fault injector -> diagnostics engine -> root cause classifier -> validation runner
4.2 Key Components
| Component | Responsibility | Key Decisions |
|---|---|---|
| Fault injector | create controlled failures | realistic but reversible scenarios |
| Diagnostics engine | execute layered checks | strict check ordering |
| Classifier | map symptoms to likely cause | transparent confidence model |
| Validation runner | confirm recovery | SLO-based verification |
5. Implementation Guide
5.3 The Core Question You’re Answering
“Which network layer is failing, and what evidence proves that diagnosis?”
5.6 Milestones
- Build baseline traffic matrix.
- Add fault scenarios and scripted checks.
- Add root-cause report output.
- Add post-fix validation suite.
5.9 Definition of Done
- Three network faults reproduced and diagnosed.
- Layered runbook validated by repeated tests.
- Fix verification includes latency + success metrics.
- MTTR improvement documented across runs.