Project 1: Build Attention from Scratch (Pure NumPy)
Project Metadata
| Attribute | Value |
|---|---|
| Difficulty | Level 4: Expert |
| Time Estimate | 1-2 weeks |
| Prerequisites | Python, NumPy, linear algebra (matrix multiplication, dot products) |
| Main Programming Language | Python |
| Alternative Languages | Rust, C++ |
| Primary Tool | NumPy, Matplotlib |
| Main Book | “Build a Large Language Model (From Scratch)” by Sebastian Raschka |
| Knowledge Area | Deep Learning / Linear Algebra |
Learning Objectives
By completing this project, you will be able to:
- Implement scaled dot-product attention from first principles using only NumPy
- Explain mathematically why we compute QK^T / sqrt(d_k) and what each operation accomplishes
- Create attention masks for causal language modeling that prevent future token leakage
- Implement multi-head attention and understand how parallel attention subspaces work
- Visualize attention patterns to debug and understand what the model “looks at”
- Answer interview questions about attention with deep mathematical understanding
Theoretical Foundation
The Core Question You’re Answering
“How does a neural network know which words to ‘pay attention to’ when processing a sentence, and how is this implemented mathematically from first principles?”
This project forces you to answer:
- Why do we need three matrices (Q, K, V) instead of just comparing tokens directly?
- What does the softmax actually do, and why is it essential?
- How does scaling by sqrt(d_k) prevent gradient problems?
- Why does matrix multiplication QK^T give us relevance scores?
Most people use torch.nn.MultiheadAttention as a black box. After this project, you’ll understand every operation inside that black box and could implement it yourself in any language.
Concepts You Must Understand First
Stop and research these before coding:
1. Matrix Multiplication Semantics
What does Q @ K^T actually compute? (Hint: dot product between every query and every key)
Consider two matrices:
- Q with shape (N, d): N queries, each of dimension d
- K with shape (N, d): N keys, each of dimension d
When we compute Q @ K^T:
- K^T has shape (d, N)
- Result has shape (N, N)
- Element [i, j] = dot product of query i with key j
This is computing all pairwise similarities between queries and keys in one matrix operation.
Book Reference: “Introduction to Linear Algebra” by Gilbert Strang - Ch. 1-2
2. Dot Product as Similarity
Why is the dot product between two vectors a measure of their similarity?
dot(a, b) = |a| × |b| × cos(θ)
where θ is the angle between vectors a and b
- Vectors pointing same direction: cos(0°) = 1 → large positive dot product
- Orthogonal vectors: cos(90°) = 0 → dot product = 0
- Opposite vectors: cos(180°) = -1 → large negative dot product
The dot product captures both magnitude and directional alignment.
Book Reference: “Mathematics for Machine Learning” by Deisenroth et al. - Ch. 3.1
3. Softmax Function
What does softmax do? It converts arbitrary numbers to probabilities that sum to 1.
softmax(x_i) = exp(x_i) / Σ exp(x_j)
Why softmax instead of just normalization?
- Softmax emphasizes larger values exponentially
- Small differences become big differences after exp()
- This creates a “winner-take-more” dynamic
Example:
scores = [2.0, 1.0, 0.5]
# Simple normalization (dividing by sum):
# [2.0/3.5, 1.0/3.5, 0.5/3.5] = [0.57, 0.29, 0.14]
# Softmax:
# exp([2.0, 1.0, 0.5]) = [7.39, 2.72, 1.65]
# Sum = 11.76
# [0.63, 0.23, 0.14] # Notice: gap between first two increased
Book Reference: “Deep Learning” by Goodfellow et al. - Ch. 6.2.2.3
4. Why Scaling by sqrt(d_k)?
What happens to dot products as dimension increases?
For two random unit vectors in d dimensions, their dot product has:
- Mean: 0
- Variance: 1/d… wait, that’s the variance of each component
- Actually, variance of dot product ≈ d (for standard normal components)
Problem: As d increases, dot products have larger magnitude, pushing softmax into saturation regions where gradients are tiny.
Solution: Divide by sqrt(d_k) to normalize variance back to ~1
# Without scaling (d_k = 64):
scores = [-12.3, 45.2, 31.1, ...] # Large magnitudes
softmax(scores) → [0.00, 0.99, 0.01, ...] # One dominates, gradients vanish
# With scaling:
scores / 8 = [-1.5, 5.6, 3.9, ...] # Reasonable magnitudes
softmax(scores) → [0.02, 0.72, 0.26, ...] # Smoother, better gradients
Book Reference: “Attention Is All You Need” paper - Section 3.2.1
5. Attention Masks
What is a causal mask? It prevents attending to future positions.
For autoregressive generation (like GPT), token at position t should only see positions 0…t-1.
Implementation: Set future positions to -∞ before softmax
# For 4 tokens, causal mask:
mask = [
[0, -inf, -inf, -inf], # Token 0 can only see itself
[0, 0, -inf, -inf], # Token 1 sees 0 and 1
[0, 0, 0, -inf], # Token 2 sees 0, 1, 2
[0, 0, 0, 0 ] # Token 3 sees all
]
# After softmax: exp(-inf) = 0, so future tokens have zero weight
Book Reference: “Build a Large Language Model” by Sebastian Raschka - Ch. 3.5
Project Specification
What You’ll Build
A pure NumPy implementation of attention that:
- Implements scaled dot-product attention:
Attention(Q, K, V) = softmax(QK^T / sqrt(d_k))V - Supports optional causal masking for autoregressive models
- Implements multi-head attention with configurable number of heads
- Generates visualization heatmaps of attention patterns
Input/Output Specification
# Single-head attention
def attention(Q, K, V, mask=None):
"""
Args:
Q: Query matrix, shape (seq_len, d_k) or (batch, seq_len, d_k)
K: Key matrix, same shape as Q
V: Value matrix, same shape as Q
mask: Optional mask, shape (seq_len, seq_len)
Returns:
output: Attention output, same shape as Q
attention_weights: Softmax weights, shape (seq_len, seq_len)
"""
pass
# Multi-head attention
def multi_head_attention(x, W_q, W_k, W_v, W_o, num_heads, mask=None):
"""
Args:
x: Input embeddings, shape (batch, seq_len, d_model)
W_q, W_k, W_v: Projection matrices, shape (d_model, d_model)
W_o: Output projection, shape (d_model, d_model)
num_heads: Number of attention heads
mask: Optional causal mask
Returns:
output: Multi-head attention output
all_attention_weights: Weights from all heads
"""
pass
Expected Outputs
$ python attention_demo.py
=== Single-Head Attention Demo ===
Input sequence: ["The", "cat", "sat", "on", "the", "mat"]
Embedding dimension: 64
Sequence length: 6
Computing Q, K, V matrices...
✓ Q shape: (6, 64)
✓ K shape: (6, 64)
✓ V shape: (6, 64)
Computing attention scores (QK^T)...
Raw scores shape: (6, 6)
Score matrix:
The cat sat on the mat
The [[45.2 12.3 8.1 3.2 44.8 9.1]
cat [12.1 67.3 34.2 5.1 11.9 28.3]
sat [ 8.3 33.9 52.1 18.2 8.7 41.2]
on [ 3.1 5.2 17.8 29.4 3.3 38.9]
the [44.9 12.1 8.4 3.4 45.1 9.3]
mat [ 9.2 28.1 40.8 38.7 9.4 71.2]]
Scaling by sqrt(d_k) = sqrt(64) = 8.0...
Applying softmax (converting to probabilities)...
Attention weights:
The cat sat on the mat
The [0.42 0.08 0.05 0.01 0.41 0.03] ← "The" attends to itself and "the"
cat [0.03 0.35 0.18 0.01 0.03 0.40] ← "cat" attends to "mat" and itself
sat [0.01 0.15 0.24 0.03 0.01 0.56] ← "sat" attends heavily to "mat"
on [0.01 0.01 0.04 0.17 0.01 0.76] ← "on" attends to "mat"
the [0.42 0.08 0.05 0.01 0.41 0.03]
mat [0.01 0.08 0.19 0.19 0.01 0.52] ← "mat" attends to itself
Computing final output (attention_weights @ V)...
✓ Output shape: (6, 64)
=== Multi-Head Attention Demo ===
Number of heads: 8
Dimension per head: 8
Head 1 attention patterns (syntactic):
"sat" → "cat" (0.62) - subject-verb relationship
"cat" → "sat" (0.38) - verb dependency
Head 2 attention patterns (positional):
Each token attends to neighbors (local window)
Head 3 attention patterns (semantic):
"cat" → "mat" (0.71) - both are nouns
Concatenating head outputs...
Final output shape: (6, 64)
=== Visualization Saved ===
✓ attention_heatmap.png - Shows attention weight matrix
✓ multihead_comparison.png - Shows all 8 head patterns
Solution Architecture
Component Diagram
┌─────────────────────────────────────────────────────────────────┐
│ Attention Module │
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Input │───>│ Linear │───>│ Q, K, V │ │
│ │Embeddings│ │Projections│ │ Matrices │ │
│ └──────────┘ └──────────┘ └─────┬────┘ │
│ │ │
│ ┌────────────────────┼────────────────────┐ │
│ │ ▼ │ │
│ │ ┌────────────────────────────────────┐│ │
│ │ │ Attention Computation ││ │
│ │ │ ││ │
│ │ │ scores = Q @ K^T ││ │
│ │ │ scores = scores / sqrt(d_k) ││ │
│ │ │ scores = scores + mask (optional) ││ │
│ │ │ weights = softmax(scores) ││ │
│ │ │ output = weights @ V ││ │
│ │ └──────────────────┬─────────────────┘│ │
│ │ (per head) │ │ │
│ └─────────────────────┼──────────────────┘ │
│ │ │
│ ┌─────────────────────▼──────────────────┐ │
│ │ Concatenate Heads + Output Projection │ │
│ └─────────────────────┬──────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────┐ │
│ │ Final Output │ │
│ └──────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
File Structure
attention_from_scratch/
├── attention.py # Core attention functions
│ ├── softmax() # Numerically stable softmax
│ ├── attention() # Single-head scaled dot-product attention
│ ├── multi_head_attention() # Multi-head attention
│ └── create_causal_mask() # Causal masking helper
├── visualization.py # Attention visualization
│ ├── plot_attention_heatmap()
│ └── plot_multihead_comparison()
├── demo.py # Interactive demo script
├── tests.py # Unit tests
└── outputs/ # Generated visualizations
├── attention_heatmap.png
└── multihead_comparison.png
Data Flow
- Input: Token embeddings (seq_len, d_model)
- Project: Apply W_q, W_k, W_v to get Q, K, V matrices
- Split: Reshape for multi-head: (seq_len, d_model) → (num_heads, seq_len, d_k)
- Attention: For each head, compute scaled dot-product attention
- Concatenate: Merge head outputs back: (num_heads, seq_len, d_k) → (seq_len, d_model)
- Project: Apply W_o output projection
- Output: Attention output (seq_len, d_model)
Implementation Guide
Phase 1: Basic Attention (Days 1-2)
Goal: Implement single-head scaled dot-product attention
import numpy as np
def softmax(x, axis=-1):
"""Numerically stable softmax."""
x_max = np.max(x, axis=axis, keepdims=True)
exp_x = np.exp(x - x_max)
return exp_x / np.sum(exp_x, axis=axis, keepdims=True)
def attention(Q, K, V, mask=None):
"""
Scaled dot-product attention.
Attention(Q, K, V) = softmax(QK^T / sqrt(d_k)) V
"""
d_k = Q.shape[-1]
# Step 1: Compute attention scores
scores = Q @ K.T # (seq_len, seq_len)
# Step 2: Scale by sqrt(d_k)
scores = scores / np.sqrt(d_k)
# Step 3: Apply mask (optional)
if mask is not None:
scores = scores + mask # mask has -inf where blocked
# Step 4: Softmax to get attention weights
attention_weights = softmax(scores, axis=-1)
# Step 5: Weighted sum of values
output = attention_weights @ V
return output, attention_weights
Checkpoint: Test with 2-token example, verify manually calculated values match.
Phase 2: Causal Masking (Day 3)
Goal: Implement masking for autoregressive models
def create_causal_mask(size):
"""Create causal (autoregressive) mask."""
# Upper triangle should be -inf (can't see future)
mask = np.triu(np.ones((size, size)), k=1) * -1e9
return mask
# Test: After applying mask and softmax, upper triangle should be 0
Phase 3: Multi-Head Attention (Days 4-5)
Goal: Implement parallel attention heads
Key insight: Reshape to process all heads in parallel using NumPy’s vectorization.
def multi_head_attention(x, W_q, W_k, W_v, W_o, num_heads, mask=None):
batch_size, seq_len, d_model = x.shape
d_k = d_model // num_heads
# Project to Q, K, V
Q = x @ W_q # (batch, seq, d_model)
K = x @ W_k
V = x @ W_v
# Reshape: (batch, seq, d_model) → (batch, num_heads, seq, d_k)
Q = Q.reshape(batch_size, seq_len, num_heads, d_k).transpose(0, 2, 1, 3)
K = K.reshape(batch_size, seq_len, num_heads, d_k).transpose(0, 2, 1, 3)
V = V.reshape(batch_size, seq_len, num_heads, d_k).transpose(0, 2, 1, 3)
# Compute attention for all heads (vectorized)
scores = Q @ K.transpose(0, 1, 3, 2) # (batch, heads, seq, seq)
scores = scores / np.sqrt(d_k)
if mask is not None:
scores = scores + mask
weights = softmax(scores, axis=-1)
output = weights @ V # (batch, heads, seq, d_k)
# Concatenate heads: (batch, heads, seq, d_k) → (batch, seq, d_model)
output = output.transpose(0, 2, 1, 3).reshape(batch_size, seq_len, d_model)
# Output projection
output = output @ W_o
return output, weights
Phase 4: Visualization (Days 6-7)
Goal: Create interpretable attention visualizations
import matplotlib.pyplot as plt
import seaborn as sns
def visualize_attention(attention_weights, tokens, save_path='attention.png'):
plt.figure(figsize=(10, 8))
sns.heatmap(
attention_weights,
xticklabels=tokens,
yticklabels=tokens,
cmap='YlOrRd',
annot=True,
fmt='.2f',
cbar_kws={'label': 'Attention Weight'}
)
plt.xlabel('Key Position (attending to)')
plt.ylabel('Query Position (from)')
plt.title('Attention Weight Matrix')
plt.tight_layout()
plt.savefig(save_path, dpi=150)
plt.close()
Testing Strategy
Unit Tests
def test_attention_shapes():
"""Verify output shapes are correct."""
Q = np.random.randn(6, 64)
K = np.random.randn(6, 64)
V = np.random.randn(6, 64)
output, weights = attention(Q, K, V)
assert output.shape == (6, 64), f"Expected (6, 64), got {output.shape}"
assert weights.shape == (6, 6), f"Expected (6, 6), got {weights.shape}"
def test_attention_weights_sum_to_one():
"""Verify softmax produces valid probability distribution."""
Q = np.random.randn(4, 32)
K = np.random.randn(4, 32)
V = np.random.randn(4, 32)
_, weights = attention(Q, K, V)
row_sums = weights.sum(axis=-1)
np.testing.assert_allclose(row_sums, 1.0, rtol=1e-5)
def test_causal_mask_blocks_future():
"""Verify causal mask prevents attending to future."""
mask = create_causal_mask(4)
Q = K = V = np.eye(4) # Identity for easy verification
_, weights = attention(Q, K, V, mask=mask)
# Upper triangle should be zero
upper_triangle = np.triu(weights, k=1)
assert np.allclose(upper_triangle, 0), "Future positions should have zero weight"
def test_numerical_stability():
"""Verify softmax handles large values."""
# Create scores that would overflow without stability fix
Q = np.random.randn(4, 32) * 100 # Large values
K = np.random.randn(4, 32) * 100
V = np.random.randn(4, 32)
output, weights = attention(Q, K, V)
assert not np.any(np.isnan(output)), "Output should not contain NaN"
assert not np.any(np.isinf(output)), "Output should not contain Inf"
Manual Verification
Trace this by hand before coding:
# Simplified: 2 tokens, 4 dimensions
Q = [[1, 0, 1, 0], # Token 1 query
[0, 1, 0, 1]] # Token 2 query
K = [[1, 0, 1, 0], # Token 1 key
[0, 1, 0, 1]] # Token 2 key
V = [[10, 20, 30, 40], # Token 1 value
[5, 15, 25, 35]] # Token 2 value
# Step 1: QK^T
# Q[0] · K[0] = 1*1 + 0*0 + 1*1 + 0*0 = 2
# Q[0] · K[1] = 0
# Q[1] · K[0] = 0
# Q[1] · K[1] = 2
scores = [[2, 0],
[0, 2]]
# Step 2: Scale by sqrt(4) = 2
scaled = [[1, 0],
[0, 1]]
# Step 3: Softmax
# Row 1: [exp(1), exp(0)] / (exp(1)+exp(0)) ≈ [0.73, 0.27]
attention_weights = [[0.73, 0.27],
[0.27, 0.73]]
# Step 4: Output = weights @ V
# Output[0] = 0.73 * [10,20,30,40] + 0.27 * [5,15,25,35]
# = [8.65, 18.65, 28.65, 38.65]
Common Pitfalls
1. Forgetting to Scale
Problem: Softmax gradients vanish for large scores
# Wrong
scores = Q @ K.T
weights = softmax(scores)
# Right
scores = Q @ K.T / np.sqrt(d_k)
weights = softmax(scores)
2. Wrong Transpose for Multi-Head
Problem: Incorrect dimension ordering after split
# Wrong: Transpose after reshape doesn't give (batch, heads, seq, d_k)
Q = Q.reshape(batch, num_heads, seq_len, d_k) # Wrong order!
# Right: Reshape then transpose
Q = Q.reshape(batch, seq_len, num_heads, d_k).transpose(0, 2, 1, 3)
3. Mask Has Wrong Sign
Problem: Using positive infinity instead of negative
# Wrong: Positive values boost attention
mask = np.triu(np.ones((n, n)), k=1) * 1e9
# Right: Negative infinity blocks attention
mask = np.triu(np.ones((n, n)), k=1) * -1e9
4. Broadcasting Errors in Batched Attention
Problem: Mask doesn’t broadcast correctly
# Mask shape should be (1, 1, seq, seq) for batched attention
# to broadcast across (batch, heads, seq, seq)
mask = mask[np.newaxis, np.newaxis, :, :]
Extensions and Challenges
Challenge 1: Relative Position Encodings
Implement relative positional bias like T5 or ALiBi:
def alibi_mask(seq_len, num_heads):
"""ALiBi: Attention with Linear Biases."""
# Each head has different slope for distance penalty
slopes = 2 ** (-(8 / num_heads) * np.arange(1, num_heads + 1))
positions = np.abs(np.arange(seq_len)[:, None] - np.arange(seq_len))
return -slopes[:, None, None] * positions
Challenge 2: Grouped Query Attention
Implement GQA where multiple query heads share key-value heads:
def grouped_query_attention(Q, K, V, num_query_heads, num_kv_heads):
"""GQA: num_query_heads queries share num_kv_heads keys/values."""
# Repeat K, V to match query heads
repeat_factor = num_query_heads // num_kv_heads
K = np.repeat(K, repeat_factor, axis=1)
V = np.repeat(V, repeat_factor, axis=1)
# Continue with standard attention...
Challenge 3: Attention with Linear Complexity
Implement Performer’s FAVOR+ or Linear Attention:
def linear_attention(Q, K, V):
"""O(n) attention using kernel approximation."""
# Use feature map φ such that softmax(QK^T) ≈ φ(Q) @ φ(K)^T
def feature_map(x):
return np.exp(x - np.max(x, axis=-1, keepdims=True))
Q_prime = feature_map(Q)
K_prime = feature_map(K)
# Compute in O(n) instead of O(n²)
KV = K_prime.T @ V # (d, d)
return Q_prime @ KV # (n, d)
Interview Questions
Conceptual Questions
-
“Explain the attention mechanism in one sentence.”
Answer: “Attention computes a weighted average of values, where weights are learned by comparing how well queries match keys.”
-
“Why do we need three separate matrices Q, K, V instead of using the same matrix?”
Answer: “Separating them allows the model to learn different representations: Q for ‘what to look for’, K for ‘what’s available’, V for ‘what to retrieve’. Using different weight matrices (W_Q, W_K, W_V) gives the model more flexibility to learn these roles independently.”
-
“What’s the time complexity of attention, and why?”
Answer: “O(N²d) where N is sequence length, d is dimension. The N² comes from QK^T creating an N×N attention matrix, and d is the dimension of each vector.”
-
“How does multi-head attention differ from single-head?”
Answer: “Multi-head runs h parallel attention mechanisms, each with lower dimension (d/h). This lets the model learn different types of relationships (syntax, semantics, position) simultaneously.”
-
“What would happen if you removed the sqrt(d_k) scaling?”
Answer: “For large d_k, dot products would have large variance. This pushes softmax into regions with very small gradients (saturation), making training difficult.”
-
“How do you prevent GPT from looking at future tokens?”
Answer: “Apply a causal mask: set attention scores to -infinity for all future positions before softmax. After softmax, these become zero, eliminating future information.”
-
“What’s the difference between self-attention and cross-attention?”
Answer: “Self-attention: Q, K, V all come from the same sequence. Cross-attention: Q from one sequence, K and V from another (e.g., encoder-decoder models).”
Coding Questions
-
“Implement softmax in a numerically stable way.”
def softmax(x, axis=-1): x_max = np.max(x, axis=axis, keepdims=True) exp_x = np.exp(x - x_max) # Subtract max for stability return exp_x / np.sum(exp_x, axis=axis, keepdims=True) -
“How would you debug attention that’s giving uniform weights?”
Answer: Check that Q and K are actually different (if identical to input, all dot products similar). Verify scaling is applied. Check for temperature (if too high, softmax is uniform). Visualize the attention pattern to see if it’s learning anything.
Resources
Primary Reading
| Topic | Book | Chapter |
|---|---|---|
| Attention mechanism fundamentals | “Build a Large Language Model (From Scratch)” by Sebastian Raschka | Ch. 3: “Coding Attention Mechanisms” |
| Mathematical foundations | “The Illustrated Transformer” by Jay Alammar | Full article |
| Linear algebra for attention | “Introduction to Linear Algebra” by Gilbert Strang | Ch. 1-3 |
| Softmax and numerical stability | “Deep Learning” by Goodfellow, Bengio, Courville | Ch. 6.2 |
| Original transformer paper | “Attention Is All You Need” by Vaswani et al. | Section 3.2 |
Interactive Resources
- Transformer Explainer: https://poloclub.github.io/transformer-explainer/
- Jay Alammar’s Illustrated Transformer: https://jalammar.github.io/illustrated-transformer/
- 3Blue1Brown: Attention in Transformers: YouTube visual explanation
Code References
- Harvard NLP Annotated Transformer: http://nlp.seas.harvard.edu/annotated-transformer/
- Hugging Face Transformers source code:
modeling_utils.py,modeling_gpt2.py
Self-Assessment Checklist
Before moving to Project 2, verify you can:
- Explain why QK^T computes pairwise similarities
- Implement numerically stable softmax from scratch
- Create a causal mask and verify it blocks future tokens
- Implement multi-head attention with correct reshaping
- Visualize attention weights and interpret the patterns
- Answer all interview questions confidently
- Trace attention computation by hand for a 2-token example
- Explain the purpose of the sqrt(d_k) scaling factor
Next Project: P02 - Implement Flash Attention