Project 8: CNI and Service Network Troubleshooting Lab

Build a layered troubleshooting workflow for Kubernetes networking incidents using controlled fault injection.

Quick Reference

Attribute	Value
Difficulty	Expert
Time Estimate	12-20 hours
Main Programming Language	Shell + YAML
Alternative Programming Languages	Go, Python
Coolness Level	Level 4 - Incident Hero
Business Potential	3. Reliability Service
Prerequisites	cluster networking basics, DNS, service endpoints, network policy
Key Topics	CNI diagnostics, endpoint analysis, policy debugging, packet path reasoning

1. Learning Objectives

Diagnose network failures by layer, not guesswork.
Distinguish CNI, Service, DNS, and policy failure classes.
Validate fixes with latency and error-budget evidence.
Reduce MTTR for recurrent network incidents.

2. All Theory Needed (Per-Concept Breakdown)

2.1 Kubernetes Network Layers

Fundamentals

Pod connectivity, service routing, and ingress/gateway policies are separate layers with different failure signals.

Deep Dive into the concept

Layered diagnostics begin with direct pod reachability. If direct pod traffic works, investigate service abstraction (endpoints, DNS, kube-proxy/eBPF rules). If service path works internally but fails externally, inspect ingress/gateway policy. NetworkPolicy can deny paths that appear healthy from logs alone. Maintaining a strict diagnostic sequence prevents false conclusions.

2.2 Fault Isolation and Verification Discipline

Fundamentals

Fixes must be validated against both symptom resolution and guardrail metrics.

Deep Dive into the concept

A temporary symptom improvement does not prove root cause. Validation should include request success rate, p95/p99 latency, and regression checks for unrelated services. Mature teams maintain fault trees for frequent incidents and automate first-line checks.

3. Project Specification

3.1 What You Will Build

A network incident lab with:

reproducible failure scenarios
layered diagnostic scripts
fix validation workflow

3.2 Functional Requirements

Inject at least three network fault types.
Run deterministic diagnostics per fault.
Output probable root cause and confidence.
Validate fix through metrics and replay tests.

3.3 Non-Functional Requirements

Performance: diagnosis report in under 3 minutes.
Reliability: repeated scenarios classify consistently.
Usability: clear runbook steps for on-call use.

3.7 Real World Outcome

$ ./net-lab run --scenario service-timeout
symptom: frontend -> checkout timeout
layer_checks:
  pod_to_pod: pass
  dns: pass
  endpointslice: pass
  networkpolicy: deny detected
fix: allow frontend namespace to checkout service
verification: success rate back to 99.9%, p95 normalized

4. Solution Architecture

4.1 High-Level Design

fault injector -> diagnostics engine -> root cause classifier -> validation runner

4.2 Key Components

Component	Responsibility	Key Decisions
Fault injector	create controlled failures	realistic but reversible scenarios
Diagnostics engine	execute layered checks	strict check ordering
Classifier	map symptoms to likely cause	transparent confidence model
Validation runner	confirm recovery	SLO-based verification

5. Implementation Guide

5.3 The Core Question You’re Answering

“Which network layer is failing, and what evidence proves that diagnosis?”

5.6 Milestones

Build baseline traffic matrix.
Add fault scenarios and scripted checks.
Add root-cause report output.
Add post-fix validation suite.

5.9 Definition of Done

Three network faults reproduced and diagnosed.
Layered runbook validated by repeated tests.
Fix verification includes latency + success metrics.
MTTR improvement documented across runs.