Project 4: KV Cache “OOM” Simulator
Build a simulator that estimates KV cache memory usage and predicts out-of-memory conditions.
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | Level 3: Advanced |
| Time Estimate | 8-12 hours |
| Language | Python |
| Prerequisites | Transformer internals, memory math |
| Key Topics | KV cache sizing, memory profiling |
1. Learning Objectives
By completing this project, you will:
- Calculate KV cache memory usage by model size.
- Simulate batch and sequence length impacts.
- Predict OOM thresholds for given hardware.
- Visualize memory growth curves.
- Build a simple capacity planner.
2. Theoretical Foundation
2.1 KV Cache Memory
KV cache grows with batch size, sequence length, layers, and head dimensions.
3. Project Specification
3.1 What You Will Build
A simulator that takes model parameters and outputs memory usage, with warnings when it exceeds GPU RAM.
3.2 Functional Requirements
- Memory model for KV cache.
- Inputs: layers, heads, seq length, batch size.
- Hardware profiles for GPU RAM sizes.
- Visualization of memory curves.
- Reports with OOM thresholds.
3.3 Non-Functional Requirements
- Deterministic calculations.
- Clear reporting with units (MB/GB).
- Configurable model presets.
4. Solution Architecture
4.1 Components
| Component | Responsibility |
|---|---|
| Memory Model | Compute KV cache usage |
| Simulator | Run scenarios |
| Visualizer | Plot memory curves |
| Reporter | Summarize thresholds |
5. Implementation Guide
5.1 Project Structure
QUANTIZATION_DISTILLATION_INFERENCE_OPTIMIZATION_MASTERY/P04-kv-oom/
├── src/
│ ├── model.py
│ ├── simulate.py
│ ├── plot.py
│ └── report.py
5.2 Implementation Phases
Phase 1: Memory model (3-4h)
- Compute cache size formulas.
- Checkpoint: matches known examples.
Phase 2: Simulation (3-4h)
- Run batch/seq sweeps.
- Checkpoint: thresholds identified.
Phase 3: Visualization (2-4h)
- Plot memory usage curves.
- Checkpoint: charts show OOM points.
6. Testing Strategy
6.1 Test Categories
| Category | Purpose | Examples |
|---|---|---|
| Unit | memory calc | known values |
| Integration | simulation | sweep outputs |
| Regression | report | stable thresholds |
6.2 Critical Test Cases
- Memory scales linearly with batch size.
- OOM threshold computed correctly.
- Reports show GB values accurately.
7. Common Pitfalls & Debugging
| Pitfall | Symptom | Fix |
|---|---|---|
| Wrong units | misleading output | enforce MB/GB conversion |
| Missing factors | underestimates | include layer/head dims |
| Overcounting | inflated memory | verify formula components |
8. Extensions & Challenges
Beginner
- Add preset models (Llama, GPT).
- Add CSV export.
Intermediate
- Add activation memory estimates.
- Add interactive sliders.
Advanced
- Add multi-GPU sharding estimates.
- Compare with real profiling data.
9. Real-World Connections
- Inference sizing depends on KV cache planning.
- Serving systems use these estimates for capacity.
10. Resources
- LLM memory estimation guides
- KV cache profiling references
11. Self-Assessment Checklist
- I can compute KV cache memory.
- I can predict OOM thresholds.
- I can visualize memory scaling.
12. Submission / Completion Criteria
Minimum Completion:
- KV cache memory simulator
Full Completion:
- OOM thresholds + charts
Excellence:
- Multi-GPU estimates
- Comparison with real profiling
This guide was generated from project_based_ideas/AI_AGENTS_LLM_RAG/QUANTIZATION_DISTILLATION_INFERENCE_OPTIMIZATION_MASTERY.md.