Project 5: Speculative Decoding Simulator

Build a simulator that models speculative decoding speedups and acceptance rates.

Quick Reference

Attribute Value
Difficulty Level 4: Expert
Time Estimate 10-16 hours
Language Python
Prerequisites Decoding basics, probability
Key Topics speculative decoding, acceptance rates, latency

1. Learning Objectives

By completing this project, you will:

  1. Simulate draft and target model decoding.
  2. Compute acceptance rates for proposals.
  3. Estimate latency savings under different configs.
  4. Visualize speedup curves.
  5. Compare speculative vs baseline decoding.

2. Theoretical Foundation

2.1 Speculative Decoding

A smaller draft model proposes tokens that a larger model verifies, improving throughput when acceptance is high.


3. Project Specification

3.1 What You Will Build

A simulator that models speculative decoding with tunable parameters and reports speedup.

3.2 Functional Requirements

  1. Draft model simulator with proposal batches.
  2. Target verification for acceptance rate.
  3. Latency model for speedup estimates.
  4. Charts for acceptance vs speedup.
  5. Config presets for common model sizes.

3.3 Non-Functional Requirements

  • Deterministic results with fixed seeds.
  • Clear output reports.
  • Configurable parameters.

4. Solution Architecture

4.1 Components

Component Responsibility
Draft Simulator Generate proposals
Verifier Accept/reject tokens
Latency Model Estimate speedup
Visualizer Plot results

5. Implementation Guide

5.1 Project Structure

QUANTIZATION_DISTILLATION_INFERENCE_OPTIMIZATION_MASTERY/P05-speculative-decoding/
├── src/
│   ├── draft.py
│   ├── verify.py
│   ├── latency.py
│   └── plot.py

5.2 Implementation Phases

Phase 1: Simulation model (4-6h)

  • Simulate draft proposals.
  • Checkpoint: acceptance rates computed.

Phase 2: Latency model (3-5h)

  • Estimate speedups based on acceptance.
  • Checkpoint: speedup curve generated.

Phase 3: Visualization (2-4h)

  • Plot acceptance vs speedup.
  • Checkpoint: charts match expectations.

6. Testing Strategy

6.1 Test Categories

Category Purpose Examples
Unit acceptance correct rates
Integration simulation outputs stable
Regression charts stable metrics

6.2 Critical Test Cases

  1. Acceptance rate increases speedup.
  2. Speedup approaches 1.0 when acceptance low.
  3. Config presets produce distinct curves.

7. Common Pitfalls & Debugging

Pitfall Symptom Fix
Unrealistic speedups inflated results validate latency model
Random variance unstable curves fix seeds
Misinterpreted acceptance wrong metrics clarify definitions

8. Extensions & Challenges

Beginner

  • Add CSV export.
  • Add command-line presets.

Intermediate

  • Add variable draft batch sizes.
  • Add cost estimates per token.

Advanced

  • Add multi-token verification modeling.
  • Compare with published benchmarks.

9. Real-World Connections

  • Inference servers use speculative decoding for speed.
  • Cost optimization depends on acceptance rates.

10. Resources

  • Speculative decoding papers
  • Inference optimization guides

11. Self-Assessment Checklist

  • I can model speculative decoding.
  • I can compute speedup estimates.
  • I can visualize acceptance impacts.

12. Submission / Completion Criteria

Minimum Completion:

  • Speculative decoding simulator

Full Completion:

  • Speedup curves and reports

Excellence:

  • Cost estimates and benchmark comparisons

This guide was generated from project_based_ideas/AI_AGENTS_LLM_RAG/QUANTIZATION_DISTILLATION_INFERENCE_OPTIMIZATION_MASTERY.md.