Project 5: Kubernetes Scheduler Simulator
Build a simulator that models filter/score scheduling decisions and explains why pods are placed or left pending.
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | Expert |
| Time Estimate | 12-20 hours |
| Main Programming Language | Go |
| Alternative Programming Languages | Python, Rust |
| Coolness Level | Level 4 - System Design Depth |
| Business Potential | 3. Platform Brain |
| Prerequisites | resource requests, affinity, taints, topology concepts |
| Key Topics | filter/score pipeline, fairness, explainable scheduling |
1. Learning Objectives
- Model scheduler filter and score phases.
- Compare policy profiles under realistic workload mixes.
- Explain unschedulable results deterministically.
- Quantify utilization vs resilience tradeoffs.
2. All Theory Needed (Per-Concept Breakdown)
2.1 Scheduling Pipeline
Fundamentals
Kubernetes scheduling first filters feasible nodes, then scores feasible candidates. Binding assigns chosen node.
Deep Dive into the concept
Hard constraints include resource availability, taints/tolerations, node selectors, and affinity requirements. Scoring handles preference optimization: spreading, bin-packing, locality, and custom priorities. A robust simulator separates these concerns and produces explainability outputs for each decision. Unschedulable diagnosis is as valuable as successful placement.
How this fit on projects
- Core for P05; supports rollout planning in P10.
Definitions & key terms
- feasible set, scoring weight, unschedulable reason, binding.
Mental model diagram
pending pod -> filter feasible nodes -> score feasible nodes -> bind best node
How it works
- Evaluate hard constraints.
- Build feasible set.
- Score with policy weights.
- Pick deterministic winner and bind.
Invariants: hard constraints never violated. Failure modes: opaque tie-break logic, overfitted scoring weights.
Minimal concrete example
Pod A feasible nodes: n1,n3,n4
Scores: n1=74 n3=81 n4=81
Tie-break rule: lowest node name -> n3
Common misconceptions
- “Scheduler only checks free CPU/memory.” -> many policy constraints apply.
Check-your-understanding questions
- Why separate filter and score phases?
- Why deterministic tie-breakers matter?
Check-your-understanding answers
- Feasibility and optimization solve different problems.
- Determinism enables reproducibility and incident analysis.
2.2 Capacity and Fairness Tradeoffs
Fundamentals
High utilization can conflict with fault tolerance and noisy-neighbor isolation.
Deep Dive into the concept
Aggressive bin packing improves cost but concentrates risk. Spread-oriented policies improve resilience but may waste headroom. A useful simulator reports both resource efficiency and disruption impact under node failure scenarios.
3. Project Specification
3.1 What You Will Build
A simulator that ingests node and pod datasets, applies scheduling policies, and outputs placement + diagnostics.
3.2 Functional Requirements
- Parse node and pod constraints.
- Execute filter/score decision flow.
- Produce explainable placement reports.
- Support multiple policy profiles.
3.3 Non-Functional Requirements
- Performance: schedule 1,000 pods in under 10 seconds locally.
- Reliability: deterministic output for same input.
- Usability: human-readable unschedulable report.
3.7 Real World Outcome
$ ./sched-lab simulate --nodes nodes.json --pods pods.json --policy spread
scheduled: 922
unschedulable: 78
top_reasons:
- memory: 41
- taint-mismatch: 21
- affinity-conflict: 16
4. Solution Architecture
4.1 High-Level Design
dataset loader -> filter engine -> score engine -> binder -> report module
4.2 Key Components
| Component | Responsibility | Key Decisions |
|---|---|---|
| Loader | parse node/pod inputs | schema validation |
| Filter engine | hard constraints | explicit reason codes |
| Score engine | preference weights | policy profiles |
| Reporter | placement diagnostics | deterministic summaries |
5. Implementation Guide
5.3 The Core Question You’re Answering
“What scheduling policy gives the best operational outcome for this workload mix, and why?”
5.6 Milestones
- Filter-only prototype.
- Add scoring profiles.
- Add deterministic tie-breaks.
- Add failure-domain simulation.
5.9 Definition of Done
- Filter and score phases implemented.
- Unschedulable reasons are deterministic.
- At least 3 policy profiles compared.
- Utilization/resilience tradeoff report produced.