Project 1: Build Attention from Scratch (Pure NumPy)
Project 1: Build Attention from Scratch (Pure NumPy)
Project Metadata
| Attribute | Value |
|---|---|
| Difficulty | Level 4: Expert |
| Time Estimate | 1-2 weeks |
| Prerequisites | Python, NumPy, linear algebra (matrix multiplication, dot products) |
| Main Programming Language | Python |
| Alternative Languages | Rust, C++ |
| Primary Tool | NumPy, Matplotlib |
| Main Book | โBuild a Large Language Model (From Scratch)โ by Sebastian Raschka |
| Knowledge Area | Deep Learning / Linear Algebra |
Learning Objectives
By completing this project, you will be able to:
- Implement scaled dot-product attention from first principles using only NumPy
- Explain mathematically why we compute QK^T / sqrt(d_k) and what each operation accomplishes
- Create attention masks for causal language modeling that prevent future token leakage
- Implement multi-head attention and understand how parallel attention subspaces work
- Visualize attention patterns to debug and understand what the model โlooks atโ
- Answer interview questions about attention with deep mathematical understanding
Theoretical Foundation
The Core Question Youโre Answering
โHow does a neural network know which words to โpay attention toโ when processing a sentence, and how is this implemented mathematically from first principles?โ
This project forces you to answer:
- Why do we need three matrices (Q, K, V) instead of just comparing tokens directly?
- What does the softmax actually do, and why is it essential?
- How does scaling by sqrt(d_k) prevent gradient problems?
- Why does matrix multiplication QK^T give us relevance scores?
Most people use torch.nn.MultiheadAttention as a black box. After this project, youโll understand every operation inside that black box and could implement it yourself in any language.
Concepts You Must Understand First
Stop and research these before coding:
1. Matrix Multiplication Semantics
What does Q @ K^T actually compute? (Hint: dot product between every query and every key)
Consider two matrices:
- Q with shape (N, d): N queries, each of dimension d
- K with shape (N, d): N keys, each of dimension d
When we compute Q @ K^T:
- K^T has shape (d, N)
- Result has shape (N, N)
- Element [i, j] = dot product of query i with key j
This is computing all pairwise similarities between queries and keys in one matrix operation.
Book Reference: โIntroduction to Linear Algebraโ by Gilbert Strang - Ch. 1-2
2. Dot Product as Similarity
Why is the dot product between two vectors a measure of their similarity?
dot(a, b) = |a| ร |b| ร cos(ฮธ)
where ฮธ is the angle between vectors a and b
- Vectors pointing same direction: cos(0ยฐ) = 1 โ large positive dot product
- Orthogonal vectors: cos(90ยฐ) = 0 โ dot product = 0
- Opposite vectors: cos(180ยฐ) = -1 โ large negative dot product
The dot product captures both magnitude and directional alignment.
Book Reference: โMathematics for Machine Learningโ by Deisenroth et al. - Ch. 3.1
3. Softmax Function
What does softmax do? It converts arbitrary numbers to probabilities that sum to 1.
softmax(x_i) = exp(x_i) / ฮฃ exp(x_j)
Why softmax instead of just normalization?
- Softmax emphasizes larger values exponentially
- Small differences become big differences after exp()
- This creates a โwinner-take-moreโ dynamic
Example:
scores = [2.0, 1.0, 0.5]
# Simple normalization (dividing by sum):
# [2.0/3.5, 1.0/3.5, 0.5/3.5] = [0.57, 0.29, 0.14]
# Softmax:
# exp([2.0, 1.0, 0.5]) = [7.39, 2.72, 1.65]
# Sum = 11.76
# [0.63, 0.23, 0.14] # Notice: gap between first two increased
Book Reference: โDeep Learningโ by Goodfellow et al. - Ch. 6.2.2.3
4. Why Scaling by sqrt(d_k)?
What happens to dot products as dimension increases?
For two random unit vectors in d dimensions, their dot product has:
- Mean: 0
- Variance: 1/dโฆ wait, thatโs the variance of each component
- Actually, variance of dot product โ d (for standard normal components)
Problem: As d increases, dot products have larger magnitude, pushing softmax into saturation regions where gradients are tiny.
Solution: Divide by sqrt(d_k) to normalize variance back to ~1
# Without scaling (d_k = 64):
scores = [-12.3, 45.2, 31.1, ...] # Large magnitudes
softmax(scores) โ [0.00, 0.99, 0.01, ...] # One dominates, gradients vanish
# With scaling:
scores / 8 = [-1.5, 5.6, 3.9, ...] # Reasonable magnitudes
softmax(scores) โ [0.02, 0.72, 0.26, ...] # Smoother, better gradients
Book Reference: โAttention Is All You Needโ paper - Section 3.2.1
5. Attention Masks
What is a causal mask? It prevents attending to future positions.
For autoregressive generation (like GPT), token at position t should only see positions 0โฆt-1.
Implementation: Set future positions to -โ before softmax
# For 4 tokens, causal mask:
mask = [
[0, -inf, -inf, -inf], # Token 0 can only see itself
[0, 0, -inf, -inf], # Token 1 sees 0 and 1
[0, 0, 0, -inf], # Token 2 sees 0, 1, 2
[0, 0, 0, 0 ] # Token 3 sees all
]
# After softmax: exp(-inf) = 0, so future tokens have zero weight
Book Reference: โBuild a Large Language Modelโ by Sebastian Raschka - Ch. 3.5
Project Specification
What Youโll Build
A pure NumPy implementation of attention that:
- Implements scaled dot-product attention:
Attention(Q, K, V) = softmax(QK^T / sqrt(d_k))V - Supports optional causal masking for autoregressive models
- Implements multi-head attention with configurable number of heads
- Generates visualization heatmaps of attention patterns
Input/Output Specification
# Single-head attention
def attention(Q, K, V, mask=None):
"""
Args:
Q: Query matrix, shape (seq_len, d_k) or (batch, seq_len, d_k)
K: Key matrix, same shape as Q
V: Value matrix, same shape as Q
mask: Optional mask, shape (seq_len, seq_len)
Returns:
output: Attention output, same shape as Q
attention_weights: Softmax weights, shape (seq_len, seq_len)
"""
pass
# Multi-head attention
def multi_head_attention(x, W_q, W_k, W_v, W_o, num_heads, mask=None):
"""
Args:
x: Input embeddings, shape (batch, seq_len, d_model)
W_q, W_k, W_v: Projection matrices, shape (d_model, d_model)
W_o: Output projection, shape (d_model, d_model)
num_heads: Number of attention heads
mask: Optional causal mask
Returns:
output: Multi-head attention output
all_attention_weights: Weights from all heads
"""
pass
Expected Outputs
$ python attention_demo.py
=== Single-Head Attention Demo ===
Input sequence: ["The", "cat", "sat", "on", "the", "mat"]
Embedding dimension: 64
Sequence length: 6
Computing Q, K, V matrices...
โ Q shape: (6, 64)
โ K shape: (6, 64)
โ V shape: (6, 64)
Computing attention scores (QK^T)...
Raw scores shape: (6, 6)
Score matrix:
The cat sat on the mat
The [[45.2 12.3 8.1 3.2 44.8 9.1]
cat [12.1 67.3 34.2 5.1 11.9 28.3]
sat [ 8.3 33.9 52.1 18.2 8.7 41.2]
on [ 3.1 5.2 17.8 29.4 3.3 38.9]
the [44.9 12.1 8.4 3.4 45.1 9.3]
mat [ 9.2 28.1 40.8 38.7 9.4 71.2]]
Scaling by sqrt(d_k) = sqrt(64) = 8.0...
Applying softmax (converting to probabilities)...
Attention weights:
The cat sat on the mat
The [0.42 0.08 0.05 0.01 0.41 0.03] โ "The" attends to itself and "the"
cat [0.03 0.35 0.18 0.01 0.03 0.40] โ "cat" attends to "mat" and itself
sat [0.01 0.15 0.24 0.03 0.01 0.56] โ "sat" attends heavily to "mat"
on [0.01 0.01 0.04 0.17 0.01 0.76] โ "on" attends to "mat"
the [0.42 0.08 0.05 0.01 0.41 0.03]
mat [0.01 0.08 0.19 0.19 0.01 0.52] โ "mat" attends to itself
Computing final output (attention_weights @ V)...
โ Output shape: (6, 64)
=== Multi-Head Attention Demo ===
Number of heads: 8
Dimension per head: 8
Head 1 attention patterns (syntactic):
"sat" โ "cat" (0.62) - subject-verb relationship
"cat" โ "sat" (0.38) - verb dependency
Head 2 attention patterns (positional):
Each token attends to neighbors (local window)
Head 3 attention patterns (semantic):
"cat" โ "mat" (0.71) - both are nouns
Concatenating head outputs...
Final output shape: (6, 64)
=== Visualization Saved ===
โ attention_heatmap.png - Shows attention weight matrix
โ multihead_comparison.png - Shows all 8 head patterns
Solution Architecture
Component Diagram
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Attention Module โ
โ โ
โ โโโโโโโโโโโโ โโโโโโโโโโโโ โโโโโโโโโโโโ โ
โ โ Input โโโโ>โ Linear โโโโ>โ Q, K, V โ โ
โ โEmbeddingsโ โProjectionsโ โ Matrices โ โ
โ โโโโโโโโโโโโ โโโโโโโโโโโโ โโโโโโโฌโโโโโ โ
โ โ โ
โ โโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโ โ
โ โ โผ โ โ
โ โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ โ Attention Computation โโ โ
โ โ โ โโ โ
โ โ โ scores = Q @ K^T โโ โ
โ โ โ scores = scores / sqrt(d_k) โโ โ
โ โ โ scores = scores + mask (optional) โโ โ
โ โ โ weights = softmax(scores) โโ โ
โ โ โ output = weights @ V โโ โ
โ โ โโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโ โ
โ โ (per head) โ โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโ โ
โ โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโ โ
โ โ Concatenate Heads + Output Projection โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโ โ
โ โ โ
โ โผ โ
โ โโโโโโโโโโโโโโโโโโโโ โ
โ โ Final Output โ โ
โ โโโโโโโโโโโโโโโโโโโโ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
File Structure
attention_from_scratch/
โโโ attention.py # Core attention functions
โ โโโ softmax() # Numerically stable softmax
โ โโโ attention() # Single-head scaled dot-product attention
โ โโโ multi_head_attention() # Multi-head attention
โ โโโ create_causal_mask() # Causal masking helper
โโโ visualization.py # Attention visualization
โ โโโ plot_attention_heatmap()
โ โโโ plot_multihead_comparison()
โโโ demo.py # Interactive demo script
โโโ tests.py # Unit tests
โโโ outputs/ # Generated visualizations
โโโ attention_heatmap.png
โโโ multihead_comparison.png
Data Flow
- Input: Token embeddings (seq_len, d_model)
- Project: Apply W_q, W_k, W_v to get Q, K, V matrices
- Split: Reshape for multi-head: (seq_len, d_model) โ (num_heads, seq_len, d_k)
- Attention: For each head, compute scaled dot-product attention
- Concatenate: Merge head outputs back: (num_heads, seq_len, d_k) โ (seq_len, d_model)
- Project: Apply W_o output projection
- Output: Attention output (seq_len, d_model)
Implementation Guide
Phase 1: Basic Attention (Days 1-2)
Goal: Implement single-head scaled dot-product attention
import numpy as np
def softmax(x, axis=-1):
"""Numerically stable softmax."""
x_max = np.max(x, axis=axis, keepdims=True)
exp_x = np.exp(x - x_max)
return exp_x / np.sum(exp_x, axis=axis, keepdims=True)
def attention(Q, K, V, mask=None):
"""
Scaled dot-product attention.
Attention(Q, K, V) = softmax(QK^T / sqrt(d_k)) V
"""
d_k = Q.shape[-1]
# Step 1: Compute attention scores
scores = Q @ K.T # (seq_len, seq_len)
# Step 2: Scale by sqrt(d_k)
scores = scores / np.sqrt(d_k)
# Step 3: Apply mask (optional)
if mask is not None:
scores = scores + mask # mask has -inf where blocked
# Step 4: Softmax to get attention weights
attention_weights = softmax(scores, axis=-1)
# Step 5: Weighted sum of values
output = attention_weights @ V
return output, attention_weights
Checkpoint: Test with 2-token example, verify manually calculated values match.
Phase 2: Causal Masking (Day 3)
Goal: Implement masking for autoregressive models
def create_causal_mask(size):
"""Create causal (autoregressive) mask."""
# Upper triangle should be -inf (can't see future)
mask = np.triu(np.ones((size, size)), k=1) * -1e9
return mask
# Test: After applying mask and softmax, upper triangle should be 0
Phase 3: Multi-Head Attention (Days 4-5)
Goal: Implement parallel attention heads
Key insight: Reshape to process all heads in parallel using NumPyโs vectorization.
def multi_head_attention(x, W_q, W_k, W_v, W_o, num_heads, mask=None):
batch_size, seq_len, d_model = x.shape
d_k = d_model // num_heads
# Project to Q, K, V
Q = x @ W_q # (batch, seq, d_model)
K = x @ W_k
V = x @ W_v
# Reshape: (batch, seq, d_model) โ (batch, num_heads, seq, d_k)
Q = Q.reshape(batch_size, seq_len, num_heads, d_k).transpose(0, 2, 1, 3)
K = K.reshape(batch_size, seq_len, num_heads, d_k).transpose(0, 2, 1, 3)
V = V.reshape(batch_size, seq_len, num_heads, d_k).transpose(0, 2, 1, 3)
# Compute attention for all heads (vectorized)
scores = Q @ K.transpose(0, 1, 3, 2) # (batch, heads, seq, seq)
scores = scores / np.sqrt(d_k)
if mask is not None:
scores = scores + mask
weights = softmax(scores, axis=-1)
output = weights @ V # (batch, heads, seq, d_k)
# Concatenate heads: (batch, heads, seq, d_k) โ (batch, seq, d_model)
output = output.transpose(0, 2, 1, 3).reshape(batch_size, seq_len, d_model)
# Output projection
output = output @ W_o
return output, weights
Phase 4: Visualization (Days 6-7)
Goal: Create interpretable attention visualizations
import matplotlib.pyplot as plt
import seaborn as sns
def visualize_attention(attention_weights, tokens, save_path='attention.png'):
plt.figure(figsize=(10, 8))
sns.heatmap(
attention_weights,
xticklabels=tokens,
yticklabels=tokens,
cmap='YlOrRd',
annot=True,
fmt='.2f',
cbar_kws={'label': 'Attention Weight'}
)
plt.xlabel('Key Position (attending to)')
plt.ylabel('Query Position (from)')
plt.title('Attention Weight Matrix')
plt.tight_layout()
plt.savefig(save_path, dpi=150)
plt.close()
Testing Strategy
Unit Tests
def test_attention_shapes():
"""Verify output shapes are correct."""
Q = np.random.randn(6, 64)
K = np.random.randn(6, 64)
V = np.random.randn(6, 64)
output, weights = attention(Q, K, V)
assert output.shape == (6, 64), f"Expected (6, 64), got {output.shape}"
assert weights.shape == (6, 6), f"Expected (6, 6), got {weights.shape}"
def test_attention_weights_sum_to_one():
"""Verify softmax produces valid probability distribution."""
Q = np.random.randn(4, 32)
K = np.random.randn(4, 32)
V = np.random.randn(4, 32)
_, weights = attention(Q, K, V)
row_sums = weights.sum(axis=-1)
np.testing.assert_allclose(row_sums, 1.0, rtol=1e-5)
def test_causal_mask_blocks_future():
"""Verify causal mask prevents attending to future."""
mask = create_causal_mask(4)
Q = K = V = np.eye(4) # Identity for easy verification
_, weights = attention(Q, K, V, mask=mask)
# Upper triangle should be zero
upper_triangle = np.triu(weights, k=1)
assert np.allclose(upper_triangle, 0), "Future positions should have zero weight"
def test_numerical_stability():
"""Verify softmax handles large values."""
# Create scores that would overflow without stability fix
Q = np.random.randn(4, 32) * 100 # Large values
K = np.random.randn(4, 32) * 100
V = np.random.randn(4, 32)
output, weights = attention(Q, K, V)
assert not np.any(np.isnan(output)), "Output should not contain NaN"
assert not np.any(np.isinf(output)), "Output should not contain Inf"
Manual Verification
Trace this by hand before coding:
# Simplified: 2 tokens, 4 dimensions
Q = [[1, 0, 1, 0], # Token 1 query
[0, 1, 0, 1]] # Token 2 query
K = [[1, 0, 1, 0], # Token 1 key
[0, 1, 0, 1]] # Token 2 key
V = [[10, 20, 30, 40], # Token 1 value
[5, 15, 25, 35]] # Token 2 value
# Step 1: QK^T
# Q[0] ยท K[0] = 1*1 + 0*0 + 1*1 + 0*0 = 2
# Q[0] ยท K[1] = 0
# Q[1] ยท K[0] = 0
# Q[1] ยท K[1] = 2
scores = [[2, 0],
[0, 2]]
# Step 2: Scale by sqrt(4) = 2
scaled = [[1, 0],
[0, 1]]
# Step 3: Softmax
# Row 1: [exp(1), exp(0)] / (exp(1)+exp(0)) โ [0.73, 0.27]
attention_weights = [[0.73, 0.27],
[0.27, 0.73]]
# Step 4: Output = weights @ V
# Output[0] = 0.73 * [10,20,30,40] + 0.27 * [5,15,25,35]
# = [8.65, 18.65, 28.65, 38.65]
Common Pitfalls
1. Forgetting to Scale
Problem: Softmax gradients vanish for large scores
# Wrong
scores = Q @ K.T
weights = softmax(scores)
# Right
scores = Q @ K.T / np.sqrt(d_k)
weights = softmax(scores)
2. Wrong Transpose for Multi-Head
Problem: Incorrect dimension ordering after split
# Wrong: Transpose after reshape doesn't give (batch, heads, seq, d_k)
Q = Q.reshape(batch, num_heads, seq_len, d_k) # Wrong order!
# Right: Reshape then transpose
Q = Q.reshape(batch, seq_len, num_heads, d_k).transpose(0, 2, 1, 3)
3. Mask Has Wrong Sign
Problem: Using positive infinity instead of negative
# Wrong: Positive values boost attention
mask = np.triu(np.ones((n, n)), k=1) * 1e9
# Right: Negative infinity blocks attention
mask = np.triu(np.ones((n, n)), k=1) * -1e9
4. Broadcasting Errors in Batched Attention
Problem: Mask doesnโt broadcast correctly
# Mask shape should be (1, 1, seq, seq) for batched attention
# to broadcast across (batch, heads, seq, seq)
mask = mask[np.newaxis, np.newaxis, :, :]
Extensions and Challenges
Challenge 1: Relative Position Encodings
Implement relative positional bias like T5 or ALiBi:
def alibi_mask(seq_len, num_heads):
"""ALiBi: Attention with Linear Biases."""
# Each head has different slope for distance penalty
slopes = 2 ** (-(8 / num_heads) * np.arange(1, num_heads + 1))
positions = np.abs(np.arange(seq_len)[:, None] - np.arange(seq_len))
return -slopes[:, None, None] * positions
Challenge 2: Grouped Query Attention
Implement GQA where multiple query heads share key-value heads:
def grouped_query_attention(Q, K, V, num_query_heads, num_kv_heads):
"""GQA: num_query_heads queries share num_kv_heads keys/values."""
# Repeat K, V to match query heads
repeat_factor = num_query_heads // num_kv_heads
K = np.repeat(K, repeat_factor, axis=1)
V = np.repeat(V, repeat_factor, axis=1)
# Continue with standard attention...
Challenge 3: Attention with Linear Complexity
Implement Performerโs FAVOR+ or Linear Attention:
def linear_attention(Q, K, V):
"""O(n) attention using kernel approximation."""
# Use feature map ฯ such that softmax(QK^T) โ ฯ(Q) @ ฯ(K)^T
def feature_map(x):
return np.exp(x - np.max(x, axis=-1, keepdims=True))
Q_prime = feature_map(Q)
K_prime = feature_map(K)
# Compute in O(n) instead of O(nยฒ)
KV = K_prime.T @ V # (d, d)
return Q_prime @ KV # (n, d)
Interview Questions
Conceptual Questions
-
โExplain the attention mechanism in one sentence.โ
Answer: โAttention computes a weighted average of values, where weights are learned by comparing how well queries match keys.โ
-
โWhy do we need three separate matrices Q, K, V instead of using the same matrix?โ
Answer: โSeparating them allows the model to learn different representations: Q for โwhat to look forโ, K for โwhatโs availableโ, V for โwhat to retrieveโ. Using different weight matrices (W_Q, W_K, W_V) gives the model more flexibility to learn these roles independently.โ
-
โWhatโs the time complexity of attention, and why?โ
Answer: โO(Nยฒd) where N is sequence length, d is dimension. The Nยฒ comes from QK^T creating an NรN attention matrix, and d is the dimension of each vector.โ
-
โHow does multi-head attention differ from single-head?โ
Answer: โMulti-head runs h parallel attention mechanisms, each with lower dimension (d/h). This lets the model learn different types of relationships (syntax, semantics, position) simultaneously.โ
-
โWhat would happen if you removed the sqrt(d_k) scaling?โ
Answer: โFor large d_k, dot products would have large variance. This pushes softmax into regions with very small gradients (saturation), making training difficult.โ
-
โHow do you prevent GPT from looking at future tokens?โ
Answer: โApply a causal mask: set attention scores to -infinity for all future positions before softmax. After softmax, these become zero, eliminating future information.โ
-
โWhatโs the difference between self-attention and cross-attention?โ
Answer: โSelf-attention: Q, K, V all come from the same sequence. Cross-attention: Q from one sequence, K and V from another (e.g., encoder-decoder models).โ
Coding Questions
-
โImplement softmax in a numerically stable way.โ
def softmax(x, axis=-1): x_max = np.max(x, axis=axis, keepdims=True) exp_x = np.exp(x - x_max) # Subtract max for stability return exp_x / np.sum(exp_x, axis=axis, keepdims=True) -
โHow would you debug attention thatโs giving uniform weights?โ
Answer: Check that Q and K are actually different (if identical to input, all dot products similar). Verify scaling is applied. Check for temperature (if too high, softmax is uniform). Visualize the attention pattern to see if itโs learning anything.
Resources
Primary Reading
| Topic | Book | Chapter |
|---|---|---|
| Attention mechanism fundamentals | โBuild a Large Language Model (From Scratch)โ by Sebastian Raschka | Ch. 3: โCoding Attention Mechanismsโ |
| Mathematical foundations | โThe Illustrated Transformerโ by Jay Alammar | Full article |
| Linear algebra for attention | โIntroduction to Linear Algebraโ by Gilbert Strang | Ch. 1-3 |
| Softmax and numerical stability | โDeep Learningโ by Goodfellow, Bengio, Courville | Ch. 6.2 |
| Original transformer paper | โAttention Is All You Needโ by Vaswani et al. | Section 3.2 |
Interactive Resources
- Transformer Explainer: https://poloclub.github.io/transformer-explainer/
- Jay Alammarโs Illustrated Transformer: https://jalammar.github.io/illustrated-transformer/
- 3Blue1Brown: Attention in Transformers: YouTube visual explanation
Code References
- Harvard NLP Annotated Transformer: http://nlp.seas.harvard.edu/annotated-transformer/
- Hugging Face Transformers source code:
modeling_utils.py,modeling_gpt2.py
Self-Assessment Checklist
Before moving to Project 2, verify you can:
- Explain why QK^T computes pairwise similarities
- Implement numerically stable softmax from scratch
- Create a causal mask and verify it blocks future tokens
- Implement multi-head attention with correct reshaping
- Visualize attention weights and interpret the patterns
- Answer all interview questions confidently
- Trace attention computation by hand for a 2-token example
- Explain the purpose of the sqrt(d_k) scaling factor
Next Project: P02 - Implement Flash Attention