Project 5: Build a Simple Inference Server with Continuous Batching

Implement an inference server that batches incoming requests dynamically to improve throughput and latency.

Quick Reference

Attribute	Value
Difficulty	Level 4: Expert
Time Estimate	1–2 weeks
Language	Python
Prerequisites	Async programming, model inference basics
Key Topics	batching, scheduling, throughput, latency

Learning Objectives

By completing this project, you will:

Build an async inference server with request queues.
Implement continuous batching with a scheduling window.
Measure throughput and p99 latency.
Handle backpressure and timeouts safely.
Compare batched vs unbatched performance.

The Core Question You’re Answering

“How do you maximize throughput without breaking latency guarantees?”

Continuous batching is the key, but only if you measure tradeoffs correctly.

Concepts You Must Understand First

Concept	Why It Matters	Where to Learn
Async queues	Handle concurrent requests	Async programming guides
Batch scheduling	Throughput vs latency	Serving system papers
Tail latency	Production SLAs	Observability docs

Theoretical Foundation

Continuous Batching

Requests -> Queue -> Scheduler -> Batch -> Inference -> Responses

Rather than waiting for a fixed batch, the server merges requests within a short window.

Project Specification

What You’ll Build

An inference server that accepts requests, dynamically batches them, and reports latency/throughput metrics.

Functional Requirements

Request queue with timestamps
Scheduler that batches within a window
Inference worker for batched calls
Metrics for throughput and p95/p99 latency
Backpressure and queue limits

Non-Functional Requirements

Graceful degradation under load
Deterministic test mode
Configurable batch window and size

Real World Outcome

Example metrics output:

{
  "throughput_rps": 45.2,
  "p50_ms": 120,
  "p95_ms": 280,
  "p99_ms": 450,
  "batch_size_avg": 6
}

Architecture Overview

┌──────────────┐   requests  ┌──────────────┐
│ API Server   │──────────▶│ Scheduler    │
└──────────────┘           └──────┬───────┘
                                  ▼
                           ┌──────────────┐
                           │ Inference    │
                           └──────────────┘

Implementation Guide

Phase 1: Basic Server (4–6h)

Accept requests and respond
Checkpoint: single request succeeds

Phase 2: Continuous Batching (4–8h)

Add scheduler window logic
Checkpoint: batch size >1 achieved

Phase 3: Metrics + Backpressure (4–8h)

Track latency percentiles
Add queue limits
Checkpoint: metrics report generated

Common Pitfalls & Debugging

Pitfall	Symptom	Fix
Starvation	old requests stuck	enforce max wait time
Tiny batches	no throughput gain	adjust window size
Unbounded queue	memory blowup	add backpressure

Interview Questions They’ll Ask

What is the tradeoff between batch size and latency?
How do you handle backpressure safely?
How do you compute p99 latency?

Hints in Layers

Hint 1: Start with FIFO queue.
Hint 2: Add a short batching window (e.g., 10ms).
Hint 3: Measure p95/p99 latency.
Hint 4: Add backpressure when queue grows.

Learning Milestones

Server Runs: accepts requests reliably.
Batched: throughput improves.
Measured: latency percentiles reported.

Submission / Completion Criteria

Minimum Completion

Async server + batching

Full Completion

Metrics + backpressure

Excellence

Adaptive batch sizing
Streaming responses

This guide was generated from project_based_ideas/AI_AGENTS_LLM_RAG/HUGGINGFACE_TRANSFORMERS_ML_INFERENCE_ECOSYSTEM.md.