Project 2: Implement a Transformer from Scratch

Build a minimal transformer encoder/decoder to understand attention, masking, and inference mechanics.


Quick Reference

Attribute Value
Difficulty Level 3: Advanced
Time Estimate 1–2 weeks
Language Python
Prerequisites PyTorch basics, linear algebra
Key Topics attention, masking, layer norm, feed-forward

Learning Objectives

By completing this project, you will:

  1. Implement multi-head attention with correct shapes.
  2. Apply masking for encoder and decoder behavior.
  3. Build transformer blocks with residuals and layer norm.
  4. Validate outputs against HF reference.
  5. Identify inference bottlenecks in attention.

The Core Question You’re Answering

“What exactly is a transformer block doing at inference time?”

Implementing from scratch exposes the math and memory costs.


Concepts You Must Understand First

Concept Why It Matters Where to Learn
Q/K/V projections Attention mechanics Transformer basics
Masking Causal vs bidirectional NLP architectures
Layer norm Stabilizes training Deep learning guides
Residuals Gradient flow ResNet concepts

Theoretical Foundation

Transformer Block Anatomy

Input -> Attention -> Add & Norm -> MLP -> Add & Norm

Each block adds context-aware features while preserving stability.


Project Specification

What You’ll Build

A minimal transformer that runs inference on toy sequences and matches HF outputs within tolerance.

Functional Requirements

  1. Multi-head attention implementation
  2. Masking logic for encoder/decoder
  3. Residual + layer norm blocks
  4. Forward pass for batch input
  5. Reference output comparison

Non-Functional Requirements

  • Deterministic outputs
  • Clear tensor shape logging
  • Small model sizes for CPU testing

Real World Outcome

Example comparison:

Your output:   [0.12, -0.03, 0.44, ...]
HF output:     [0.12, -0.03, 0.44, ...]
Max diff:      1e-5

Architecture Overview

┌──────────────┐   inputs  ┌──────────────┐
│ Embeddings   │────────▶│ Transformer  │
└──────────────┘         └──────┬───────┘
                                ▼
                         ┌──────────────┐
                         │ Output       │
                         └──────────────┘

Implementation Guide

Phase 1: Attention (3–5h)

  • Build Q/K/V projections
  • Checkpoint: attention weights correct

Phase 2: Blocks + Model (4–8h)

  • Add MLP + layer norm
  • Checkpoint: forward pass works

Phase 3: Reference Comparison (4–8h)

  • Compare with HF output
  • Checkpoint: diff within tolerance

Common Pitfalls & Debugging

Pitfall Symptom Fix
Shape errors runtime crashes assert tensor dims
Mask leaks future tokens visible verify mask broadcast
Output drift mismatch vs HF check normalization order

Interview Questions They’ll Ask

  1. Why do transformers need residual connections?
  2. How does causal masking change attention?
  3. What dominates inference cost in attention?

Hints in Layers

  • Hint 1: Start with single-head attention.
  • Hint 2: Add multi-head by splitting dimensions.
  • Hint 3: Implement masking and compare outputs.
  • Hint 4: Validate against a small HF model.

Learning Milestones

  1. Block Works: forward pass runs.
  2. Mask Works: causal behavior verified.
  3. Matched: outputs align with HF.

Submission / Completion Criteria

Minimum Completion

  • Transformer block forward pass

Full Completion

  • Reference output match

Excellence

  • KV cache or flash attention

This guide was generated from project_based_ideas/AI_AGENTS_LLM_RAG/HUGGINGFACE_TRANSFORMERS_ML_INFERENCE_ECOSYSTEM.md.