Project 5: Kubernetes Scheduler Simulator

Build a simulator that models filter/score scheduling decisions and explains why pods are placed or left pending.

Quick Reference

Attribute	Value
Difficulty	Expert
Time Estimate	12-20 hours
Main Programming Language	Go
Alternative Programming Languages	Python, Rust
Coolness Level	Level 4 - System Design Depth
Business Potential	3. Platform Brain
Prerequisites	resource requests, affinity, taints, topology concepts
Key Topics	filter/score pipeline, fairness, explainable scheduling

1. Learning Objectives

Model scheduler filter and score phases.
Compare policy profiles under realistic workload mixes.
Explain unschedulable results deterministically.
Quantify utilization vs resilience tradeoffs.

2. All Theory Needed (Per-Concept Breakdown)

2.1 Scheduling Pipeline

Fundamentals

Kubernetes scheduling first filters feasible nodes, then scores feasible candidates. Binding assigns chosen node.

Deep Dive into the concept

Hard constraints include resource availability, taints/tolerations, node selectors, and affinity requirements. Scoring handles preference optimization: spreading, bin-packing, locality, and custom priorities. A robust simulator separates these concerns and produces explainability outputs for each decision. Unschedulable diagnosis is as valuable as successful placement.

How this fit on projects

Core for P05; supports rollout planning in P10.

Definitions & key terms

feasible set, scoring weight, unschedulable reason, binding.

Mental model diagram

pending pod -> filter feasible nodes -> score feasible nodes -> bind best node

How it works

Evaluate hard constraints.
Build feasible set.
Score with policy weights.
Pick deterministic winner and bind.

Invariants: hard constraints never violated. Failure modes: opaque tie-break logic, overfitted scoring weights.

Minimal concrete example

Pod A feasible nodes: n1,n3,n4
Scores: n1=74 n3=81 n4=81
Tie-break rule: lowest node name -> n3

Common misconceptions

“Scheduler only checks free CPU/memory.” -> many policy constraints apply.

Check-your-understanding questions

Why separate filter and score phases?
Why deterministic tie-breakers matter?

Check-your-understanding answers

Feasibility and optimization solve different problems.
Determinism enables reproducibility and incident analysis.

2.2 Capacity and Fairness Tradeoffs

Fundamentals

High utilization can conflict with fault tolerance and noisy-neighbor isolation.

Deep Dive into the concept

Aggressive bin packing improves cost but concentrates risk. Spread-oriented policies improve resilience but may waste headroom. A useful simulator reports both resource efficiency and disruption impact under node failure scenarios.

3. Project Specification

3.1 What You Will Build

A simulator that ingests node and pod datasets, applies scheduling policies, and outputs placement + diagnostics.

3.2 Functional Requirements

Parse node and pod constraints.
Execute filter/score decision flow.
Produce explainable placement reports.
Support multiple policy profiles.

3.3 Non-Functional Requirements

Performance: schedule 1,000 pods in under 10 seconds locally.
Reliability: deterministic output for same input.
Usability: human-readable unschedulable report.

3.7 Real World Outcome

$ ./sched-lab simulate --nodes nodes.json --pods pods.json --policy spread
scheduled: 922
unschedulable: 78
top_reasons:
  - memory: 41
  - taint-mismatch: 21
  - affinity-conflict: 16

4. Solution Architecture

4.1 High-Level Design

dataset loader -> filter engine -> score engine -> binder -> report module

4.2 Key Components

Component	Responsibility	Key Decisions
Loader	parse node/pod inputs	schema validation
Filter engine	hard constraints	explicit reason codes
Score engine	preference weights	policy profiles
Reporter	placement diagnostics	deterministic summaries

5. Implementation Guide

5.3 The Core Question You’re Answering

“What scheduling policy gives the best operational outcome for this workload mix, and why?”

5.6 Milestones

Filter-only prototype.
Add scoring profiles.
Add deterministic tie-breaks.
Add failure-domain simulation.

5.9 Definition of Done

Filter and score phases implemented.
Unschedulable reasons are deterministic.
At least 3 policy profiles compared.
Utilization/resilience tradeoff report produced.