Project 5: Build a Simple Inference Server with Continuous Batching

Implement an inference server that batches incoming requests dynamically to improve throughput and latency.


Quick Reference

Attribute Value
Difficulty Level 4: Expert
Time Estimate 1–2 weeks
Language Python
Prerequisites Async programming, model inference basics
Key Topics batching, scheduling, throughput, latency

Learning Objectives

By completing this project, you will:

  1. Build an async inference server with request queues.
  2. Implement continuous batching with a scheduling window.
  3. Measure throughput and p99 latency.
  4. Handle backpressure and timeouts safely.
  5. Compare batched vs unbatched performance.

The Core Question You’re Answering

“How do you maximize throughput without breaking latency guarantees?”

Continuous batching is the key, but only if you measure tradeoffs correctly.


Concepts You Must Understand First

Concept Why It Matters Where to Learn
Async queues Handle concurrent requests Async programming guides
Batch scheduling Throughput vs latency Serving system papers
Tail latency Production SLAs Observability docs

Theoretical Foundation

Continuous Batching

Requests -> Queue -> Scheduler -> Batch -> Inference -> Responses

Rather than waiting for a fixed batch, the server merges requests within a short window.


Project Specification

What You’ll Build

An inference server that accepts requests, dynamically batches them, and reports latency/throughput metrics.

Functional Requirements

  1. Request queue with timestamps
  2. Scheduler that batches within a window
  3. Inference worker for batched calls
  4. Metrics for throughput and p95/p99 latency
  5. Backpressure and queue limits

Non-Functional Requirements

  • Graceful degradation under load
  • Deterministic test mode
  • Configurable batch window and size

Real World Outcome

Example metrics output:

{
  "throughput_rps": 45.2,
  "p50_ms": 120,
  "p95_ms": 280,
  "p99_ms": 450,
  "batch_size_avg": 6
}

Architecture Overview

┌──────────────┐   requests  ┌──────────────┐
│ API Server   │──────────▶│ Scheduler    │
└──────────────┘           └──────┬───────┘
                                  ▼
                           ┌──────────────┐
                           │ Inference    │
                           └──────────────┘

Implementation Guide

Phase 1: Basic Server (4–6h)

  • Accept requests and respond
  • Checkpoint: single request succeeds

Phase 2: Continuous Batching (4–8h)

  • Add scheduler window logic
  • Checkpoint: batch size >1 achieved

Phase 3: Metrics + Backpressure (4–8h)

  • Track latency percentiles
  • Add queue limits
  • Checkpoint: metrics report generated

Common Pitfalls & Debugging

Pitfall Symptom Fix
Starvation old requests stuck enforce max wait time
Tiny batches no throughput gain adjust window size
Unbounded queue memory blowup add backpressure

Interview Questions They’ll Ask

  1. What is the tradeoff between batch size and latency?
  2. How do you handle backpressure safely?
  3. How do you compute p99 latency?

Hints in Layers

  • Hint 1: Start with FIFO queue.
  • Hint 2: Add a short batching window (e.g., 10ms).
  • Hint 3: Measure p95/p99 latency.
  • Hint 4: Add backpressure when queue grows.

Learning Milestones

  1. Server Runs: accepts requests reliably.
  2. Batched: throughput improves.
  3. Measured: latency percentiles reported.

Submission / Completion Criteria

Minimum Completion

  • Async server + batching

Full Completion

  • Metrics + backpressure

Excellence

  • Adaptive batch sizing
  • Streaming responses

This guide was generated from project_based_ideas/AI_AGENTS_LLM_RAG/HUGGINGFACE_TRANSFORMERS_ML_INFERENCE_ECOSYSTEM.md.