Project 8: CNI and Service Network Troubleshooting Lab

Build a layered troubleshooting workflow for Kubernetes networking incidents using controlled fault injection.

Quick Reference

Attribute Value
Difficulty Expert
Time Estimate 12-20 hours
Main Programming Language Shell + YAML
Alternative Programming Languages Go, Python
Coolness Level Level 4 - Incident Hero
Business Potential 3. Reliability Service
Prerequisites cluster networking basics, DNS, service endpoints, network policy
Key Topics CNI diagnostics, endpoint analysis, policy debugging, packet path reasoning

1. Learning Objectives

  1. Diagnose network failures by layer, not guesswork.
  2. Distinguish CNI, Service, DNS, and policy failure classes.
  3. Validate fixes with latency and error-budget evidence.
  4. Reduce MTTR for recurrent network incidents.

2. All Theory Needed (Per-Concept Breakdown)

2.1 Kubernetes Network Layers

Fundamentals

Pod connectivity, service routing, and ingress/gateway policies are separate layers with different failure signals.

Deep Dive into the concept

Layered diagnostics begin with direct pod reachability. If direct pod traffic works, investigate service abstraction (endpoints, DNS, kube-proxy/eBPF rules). If service path works internally but fails externally, inspect ingress/gateway policy. NetworkPolicy can deny paths that appear healthy from logs alone. Maintaining a strict diagnostic sequence prevents false conclusions.

2.2 Fault Isolation and Verification Discipline

Fundamentals

Fixes must be validated against both symptom resolution and guardrail metrics.

Deep Dive into the concept

A temporary symptom improvement does not prove root cause. Validation should include request success rate, p95/p99 latency, and regression checks for unrelated services. Mature teams maintain fault trees for frequent incidents and automate first-line checks.


3. Project Specification

3.1 What You Will Build

A network incident lab with:

  • reproducible failure scenarios
  • layered diagnostic scripts
  • fix validation workflow

3.2 Functional Requirements

  1. Inject at least three network fault types.
  2. Run deterministic diagnostics per fault.
  3. Output probable root cause and confidence.
  4. Validate fix through metrics and replay tests.

3.3 Non-Functional Requirements

  • Performance: diagnosis report in under 3 minutes.
  • Reliability: repeated scenarios classify consistently.
  • Usability: clear runbook steps for on-call use.

3.7 Real World Outcome

$ ./net-lab run --scenario service-timeout
symptom: frontend -> checkout timeout
layer_checks:
  pod_to_pod: pass
  dns: pass
  endpointslice: pass
  networkpolicy: deny detected
fix: allow frontend namespace to checkout service
verification: success rate back to 99.9%, p95 normalized

4. Solution Architecture

4.1 High-Level Design

fault injector -> diagnostics engine -> root cause classifier -> validation runner

4.2 Key Components

Component Responsibility Key Decisions
Fault injector create controlled failures realistic but reversible scenarios
Diagnostics engine execute layered checks strict check ordering
Classifier map symptoms to likely cause transparent confidence model
Validation runner confirm recovery SLO-based verification

5. Implementation Guide

5.3 The Core Question You’re Answering

“Which network layer is failing, and what evidence proves that diagnosis?”

5.6 Milestones

  1. Build baseline traffic matrix.
  2. Add fault scenarios and scripted checks.
  3. Add root-cause report output.
  4. Add post-fix validation suite.

5.9 Definition of Done

  • Three network faults reproduced and diagnosed.
  • Layered runbook validated by repeated tests.
  • Fix verification includes latency + success metrics.
  • MTTR improvement documented across runs.