Project 2: Implement a Transformer from Scratch
Build a minimal transformer encoder/decoder to understand attention, masking, and inference mechanics.
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | Level 3: Advanced |
| Time Estimate | 1–2 weeks |
| Language | Python |
| Prerequisites | PyTorch basics, linear algebra |
| Key Topics | attention, masking, layer norm, feed-forward |
Learning Objectives
By completing this project, you will:
- Implement multi-head attention with correct shapes.
- Apply masking for encoder and decoder behavior.
- Build transformer blocks with residuals and layer norm.
- Validate outputs against HF reference.
- Identify inference bottlenecks in attention.
The Core Question You’re Answering
“What exactly is a transformer block doing at inference time?”
Implementing from scratch exposes the math and memory costs.
Concepts You Must Understand First
| Concept | Why It Matters | Where to Learn |
|---|---|---|
| Q/K/V projections | Attention mechanics | Transformer basics |
| Masking | Causal vs bidirectional | NLP architectures |
| Layer norm | Stabilizes training | Deep learning guides |
| Residuals | Gradient flow | ResNet concepts |
Theoretical Foundation
Transformer Block Anatomy
Input -> Attention -> Add & Norm -> MLP -> Add & Norm
Each block adds context-aware features while preserving stability.
Project Specification
What You’ll Build
A minimal transformer that runs inference on toy sequences and matches HF outputs within tolerance.
Functional Requirements
- Multi-head attention implementation
- Masking logic for encoder/decoder
- Residual + layer norm blocks
- Forward pass for batch input
- Reference output comparison
Non-Functional Requirements
- Deterministic outputs
- Clear tensor shape logging
- Small model sizes for CPU testing
Real World Outcome
Example comparison:
Your output: [0.12, -0.03, 0.44, ...]
HF output: [0.12, -0.03, 0.44, ...]
Max diff: 1e-5
Architecture Overview
┌──────────────┐ inputs ┌──────────────┐
│ Embeddings │────────▶│ Transformer │
└──────────────┘ └──────┬───────┘
▼
┌──────────────┐
│ Output │
└──────────────┘
Implementation Guide
Phase 1: Attention (3–5h)
- Build Q/K/V projections
- Checkpoint: attention weights correct
Phase 2: Blocks + Model (4–8h)
- Add MLP + layer norm
- Checkpoint: forward pass works
Phase 3: Reference Comparison (4–8h)
- Compare with HF output
- Checkpoint: diff within tolerance
Common Pitfalls & Debugging
| Pitfall | Symptom | Fix |
|---|---|---|
| Shape errors | runtime crashes | assert tensor dims |
| Mask leaks | future tokens visible | verify mask broadcast |
| Output drift | mismatch vs HF | check normalization order |
Interview Questions They’ll Ask
- Why do transformers need residual connections?
- How does causal masking change attention?
- What dominates inference cost in attention?
Hints in Layers
- Hint 1: Start with single-head attention.
- Hint 2: Add multi-head by splitting dimensions.
- Hint 3: Implement masking and compare outputs.
- Hint 4: Validate against a small HF model.
Learning Milestones
- Block Works: forward pass runs.
- Mask Works: causal behavior verified.
- Matched: outputs align with HF.
Submission / Completion Criteria
Minimum Completion
- Transformer block forward pass
Full Completion
- Reference output match
Excellence
- KV cache or flash attention
This guide was generated from project_based_ideas/AI_AGENTS_LLM_RAG/HUGGINGFACE_TRANSFORMERS_ML_INFERENCE_ECOSYSTEM.md.