Project 4: KV Cache “OOM” Simulator

Build a simulator that estimates KV cache memory usage and predicts out-of-memory conditions.

Quick Reference

Attribute	Value
Difficulty	Level 3: Advanced
Time Estimate	8-12 hours
Language	Python
Prerequisites	Transformer internals, memory math
Key Topics	KV cache sizing, memory profiling

1. Learning Objectives

By completing this project, you will:

Calculate KV cache memory usage by model size.
Simulate batch and sequence length impacts.
Predict OOM thresholds for given hardware.
Visualize memory growth curves.
Build a simple capacity planner.

2. Theoretical Foundation

2.1 KV Cache Memory

KV cache grows with batch size, sequence length, layers, and head dimensions.

3. Project Specification

3.1 What You Will Build

A simulator that takes model parameters and outputs memory usage, with warnings when it exceeds GPU RAM.

3.2 Functional Requirements

Memory model for KV cache.
Inputs: layers, heads, seq length, batch size.
Hardware profiles for GPU RAM sizes.
Visualization of memory curves.
Reports with OOM thresholds.

3.3 Non-Functional Requirements

Deterministic calculations.
Clear reporting with units (MB/GB).
Configurable model presets.

4. Solution Architecture

4.1 Components

Component	Responsibility
Memory Model	Compute KV cache usage
Simulator	Run scenarios
Visualizer	Plot memory curves
Reporter	Summarize thresholds

5. Implementation Guide

5.1 Project Structure

QUANTIZATION_DISTILLATION_INFERENCE_OPTIMIZATION_MASTERY/P04-kv-oom/
├── src/
│   ├── model.py
│   ├── simulate.py
│   ├── plot.py
│   └── report.py

5.2 Implementation Phases

Phase 1: Memory model (3-4h)

Compute cache size formulas.
Checkpoint: matches known examples.

Phase 2: Simulation (3-4h)

Run batch/seq sweeps.
Checkpoint: thresholds identified.

Phase 3: Visualization (2-4h)

Plot memory usage curves.
Checkpoint: charts show OOM points.

6. Testing Strategy

6.1 Test Categories

Category	Purpose	Examples
Unit	memory calc	known values
Integration	simulation	sweep outputs
Regression	report	stable thresholds

6.2 Critical Test Cases

Memory scales linearly with batch size.
OOM threshold computed correctly.
Reports show GB values accurately.

7. Common Pitfalls & Debugging

Pitfall	Symptom	Fix
Wrong units	misleading output	enforce MB/GB conversion
Missing factors	underestimates	include layer/head dims
Overcounting	inflated memory	verify formula components

8. Extensions & Challenges

Beginner

Add preset models (Llama, GPT).
Add CSV export.

Intermediate

Add activation memory estimates.
Add interactive sliders.

Advanced

Add multi-GPU sharding estimates.
Compare with real profiling data.

9. Real-World Connections

Inference sizing depends on KV cache planning.
Serving systems use these estimates for capacity.

10. Resources

LLM memory estimation guides
KV cache profiling references

11. Self-Assessment Checklist

I can compute KV cache memory.
I can predict OOM thresholds.
I can visualize memory scaling.

12. Submission / Completion Criteria

Minimum Completion:

KV cache memory simulator

Full Completion:

OOM thresholds + charts

Excellence:

Multi-GPU estimates
Comparison with real profiling

This guide was generated from project_based_ideas/AI_AGENTS_LLM_RAG/QUANTIZATION_DISTILLATION_INFERENCE_OPTIMIZATION_MASTERY.md.