Project 4: KV Cache “OOM” Simulator

Build a simulator that estimates KV cache memory usage and predicts out-of-memory conditions.

Quick Reference

Attribute Value
Difficulty Level 3: Advanced
Time Estimate 8-12 hours
Language Python
Prerequisites Transformer internals, memory math
Key Topics KV cache sizing, memory profiling

1. Learning Objectives

By completing this project, you will:

  1. Calculate KV cache memory usage by model size.
  2. Simulate batch and sequence length impacts.
  3. Predict OOM thresholds for given hardware.
  4. Visualize memory growth curves.
  5. Build a simple capacity planner.

2. Theoretical Foundation

2.1 KV Cache Memory

KV cache grows with batch size, sequence length, layers, and head dimensions.


3. Project Specification

3.1 What You Will Build

A simulator that takes model parameters and outputs memory usage, with warnings when it exceeds GPU RAM.

3.2 Functional Requirements

  1. Memory model for KV cache.
  2. Inputs: layers, heads, seq length, batch size.
  3. Hardware profiles for GPU RAM sizes.
  4. Visualization of memory curves.
  5. Reports with OOM thresholds.

3.3 Non-Functional Requirements

  • Deterministic calculations.
  • Clear reporting with units (MB/GB).
  • Configurable model presets.

4. Solution Architecture

4.1 Components

Component Responsibility
Memory Model Compute KV cache usage
Simulator Run scenarios
Visualizer Plot memory curves
Reporter Summarize thresholds

5. Implementation Guide

5.1 Project Structure

QUANTIZATION_DISTILLATION_INFERENCE_OPTIMIZATION_MASTERY/P04-kv-oom/
├── src/
│   ├── model.py
│   ├── simulate.py
│   ├── plot.py
│   └── report.py

5.2 Implementation Phases

Phase 1: Memory model (3-4h)

  • Compute cache size formulas.
  • Checkpoint: matches known examples.

Phase 2: Simulation (3-4h)

  • Run batch/seq sweeps.
  • Checkpoint: thresholds identified.

Phase 3: Visualization (2-4h)

  • Plot memory usage curves.
  • Checkpoint: charts show OOM points.

6. Testing Strategy

6.1 Test Categories

Category Purpose Examples
Unit memory calc known values
Integration simulation sweep outputs
Regression report stable thresholds

6.2 Critical Test Cases

  1. Memory scales linearly with batch size.
  2. OOM threshold computed correctly.
  3. Reports show GB values accurately.

7. Common Pitfalls & Debugging

Pitfall Symptom Fix
Wrong units misleading output enforce MB/GB conversion
Missing factors underestimates include layer/head dims
Overcounting inflated memory verify formula components

8. Extensions & Challenges

Beginner

  • Add preset models (Llama, GPT).
  • Add CSV export.

Intermediate

  • Add activation memory estimates.
  • Add interactive sliders.

Advanced

  • Add multi-GPU sharding estimates.
  • Compare with real profiling data.

9. Real-World Connections

  • Inference sizing depends on KV cache planning.
  • Serving systems use these estimates for capacity.

10. Resources

  • LLM memory estimation guides
  • KV cache profiling references

11. Self-Assessment Checklist

  • I can compute KV cache memory.
  • I can predict OOM thresholds.
  • I can visualize memory scaling.

12. Submission / Completion Criteria

Minimum Completion:

  • KV cache memory simulator

Full Completion:

  • OOM thresholds + charts

Excellence:

  • Multi-GPU estimates
  • Comparison with real profiling

This guide was generated from project_based_ideas/AI_AGENTS_LLM_RAG/QUANTIZATION_DISTILLATION_INFERENCE_OPTIMIZATION_MASTERY.md.