Project 5: Build a Simple Inference Server with Continuous Batching
Implement an inference server that batches incoming requests dynamically to improve throughput and latency.
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | Level 4: Expert |
| Time Estimate | 1–2 weeks |
| Language | Python |
| Prerequisites | Async programming, model inference basics |
| Key Topics | batching, scheduling, throughput, latency |
Learning Objectives
By completing this project, you will:
- Build an async inference server with request queues.
- Implement continuous batching with a scheduling window.
- Measure throughput and p99 latency.
- Handle backpressure and timeouts safely.
- Compare batched vs unbatched performance.
The Core Question You’re Answering
“How do you maximize throughput without breaking latency guarantees?”
Continuous batching is the key, but only if you measure tradeoffs correctly.
Concepts You Must Understand First
| Concept | Why It Matters | Where to Learn |
|---|---|---|
| Async queues | Handle concurrent requests | Async programming guides |
| Batch scheduling | Throughput vs latency | Serving system papers |
| Tail latency | Production SLAs | Observability docs |
Theoretical Foundation
Continuous Batching
Requests -> Queue -> Scheduler -> Batch -> Inference -> Responses
Rather than waiting for a fixed batch, the server merges requests within a short window.
Project Specification
What You’ll Build
An inference server that accepts requests, dynamically batches them, and reports latency/throughput metrics.
Functional Requirements
- Request queue with timestamps
- Scheduler that batches within a window
- Inference worker for batched calls
- Metrics for throughput and p95/p99 latency
- Backpressure and queue limits
Non-Functional Requirements
- Graceful degradation under load
- Deterministic test mode
- Configurable batch window and size
Real World Outcome
Example metrics output:
{
"throughput_rps": 45.2,
"p50_ms": 120,
"p95_ms": 280,
"p99_ms": 450,
"batch_size_avg": 6
}
Architecture Overview
┌──────────────┐ requests ┌──────────────┐
│ API Server │──────────▶│ Scheduler │
└──────────────┘ └──────┬───────┘
▼
┌──────────────┐
│ Inference │
└──────────────┘
Implementation Guide
Phase 1: Basic Server (4–6h)
- Accept requests and respond
- Checkpoint: single request succeeds
Phase 2: Continuous Batching (4–8h)
- Add scheduler window logic
- Checkpoint: batch size >1 achieved
Phase 3: Metrics + Backpressure (4–8h)
- Track latency percentiles
- Add queue limits
- Checkpoint: metrics report generated
Common Pitfalls & Debugging
| Pitfall | Symptom | Fix |
|---|---|---|
| Starvation | old requests stuck | enforce max wait time |
| Tiny batches | no throughput gain | adjust window size |
| Unbounded queue | memory blowup | add backpressure |
Interview Questions They’ll Ask
- What is the tradeoff between batch size and latency?
- How do you handle backpressure safely?
- How do you compute p99 latency?
Hints in Layers
- Hint 1: Start with FIFO queue.
- Hint 2: Add a short batching window (e.g., 10ms).
- Hint 3: Measure p95/p99 latency.
- Hint 4: Add backpressure when queue grows.
Learning Milestones
- Server Runs: accepts requests reliably.
- Batched: throughput improves.
- Measured: latency percentiles reported.
Submission / Completion Criteria
Minimum Completion
- Async server + batching
Full Completion
- Metrics + backpressure
Excellence
- Adaptive batch sizing
- Streaming responses
This guide was generated from project_based_ideas/AI_AGENTS_LLM_RAG/HUGGINGFACE_TRANSFORMERS_ML_INFERENCE_ECOSYSTEM.md.