Project 1: Build Attention from Scratch (Pure NumPy)

Project Metadata

Attribute	Value
Difficulty	Level 4: Expert
Time Estimate	1-2 weeks
Prerequisites	Python, NumPy, linear algebra (matrix multiplication, dot products)
Main Programming Language	Python
Alternative Languages	Rust, C++
Primary Tool	NumPy, Matplotlib
Main Book	“Build a Large Language Model (From Scratch)” by Sebastian Raschka
Knowledge Area	Deep Learning / Linear Algebra

Learning Objectives

By completing this project, you will be able to:

Implement scaled dot-product attention from first principles using only NumPy
Explain mathematically why we compute QK^T / sqrt(d_k) and what each operation accomplishes
Create attention masks for causal language modeling that prevent future token leakage
Implement multi-head attention and understand how parallel attention subspaces work
Visualize attention patterns to debug and understand what the model “looks at”
Answer interview questions about attention with deep mathematical understanding

Theoretical Foundation

The Core Question You’re Answering

“How does a neural network know which words to ‘pay attention to’ when processing a sentence, and how is this implemented mathematically from first principles?”

This project forces you to answer:

Why do we need three matrices (Q, K, V) instead of just comparing tokens directly?
What does the softmax actually do, and why is it essential?
How does scaling by sqrt(d_k) prevent gradient problems?
Why does matrix multiplication QK^T give us relevance scores?

Most people use torch.nn.MultiheadAttention as a black box. After this project, you’ll understand every operation inside that black box and could implement it yourself in any language.

Concepts You Must Understand First

Stop and research these before coding:

1. Matrix Multiplication Semantics

What does Q @ K^T actually compute? (Hint: dot product between every query and every key)

Consider two matrices:

Q with shape (N, d): N queries, each of dimension d
K with shape (N, d): N keys, each of dimension d

When we compute Q @ K^T:

K^T has shape (d, N)
Result has shape (N, N)
Element [i, j] = dot product of query i with key j

This is computing all pairwise similarities between queries and keys in one matrix operation.

Book Reference: “Introduction to Linear Algebra” by Gilbert Strang - Ch. 1-2

2. Dot Product as Similarity

Why is the dot product between two vectors a measure of their similarity?

dot(a, b) = |a| × |b| × cos(θ)

where θ is the angle between vectors a and b

Vectors pointing same direction: cos(0°) = 1 → large positive dot product
Orthogonal vectors: cos(90°) = 0 → dot product = 0
Opposite vectors: cos(180°) = -1 → large negative dot product

The dot product captures both magnitude and directional alignment.

Book Reference: “Mathematics for Machine Learning” by Deisenroth et al. - Ch. 3.1

3. Softmax Function

What does softmax do? It converts arbitrary numbers to probabilities that sum to 1.

softmax(x_i) = exp(x_i) / Σ exp(x_j)

Why softmax instead of just normalization?

Softmax emphasizes larger values exponentially
Small differences become big differences after exp()
This creates a “winner-take-more” dynamic

Example:

scores = [2.0, 1.0, 0.5]

# Simple normalization (dividing by sum):
# [2.0/3.5, 1.0/3.5, 0.5/3.5] = [0.57, 0.29, 0.14]

# Softmax:
# exp([2.0, 1.0, 0.5]) = [7.39, 2.72, 1.65]
# Sum = 11.76
# [0.63, 0.23, 0.14]  # Notice: gap between first two increased

Book Reference: “Deep Learning” by Goodfellow et al. - Ch. 6.2.2.3

4. Why Scaling by sqrt(d_k)?

What happens to dot products as dimension increases?

For two random unit vectors in d dimensions, their dot product has:

Mean: 0
Variance: 1/d… wait, that’s the variance of each component
Actually, variance of dot product ≈ d (for standard normal components)

Problem: As d increases, dot products have larger magnitude, pushing softmax into saturation regions where gradients are tiny.

Solution: Divide by sqrt(d_k) to normalize variance back to ~1

# Without scaling (d_k = 64):
scores = [-12.3, 45.2, 31.1, ...]  # Large magnitudes
softmax(scores) → [0.00, 0.99, 0.01, ...]  # One dominates, gradients vanish

# With scaling:
scores / 8 = [-1.5, 5.6, 3.9, ...]  # Reasonable magnitudes
softmax(scores) → [0.02, 0.72, 0.26, ...]  # Smoother, better gradients

Book Reference: “Attention Is All You Need” paper - Section 3.2.1

5. Attention Masks

What is a causal mask? It prevents attending to future positions.

For autoregressive generation (like GPT), token at position t should only see positions 0…t-1.

Implementation: Set future positions to -∞ before softmax

# For 4 tokens, causal mask:
mask = [
    [0,    -inf, -inf, -inf],  # Token 0 can only see itself
    [0,    0,    -inf, -inf],  # Token 1 sees 0 and 1
    [0,    0,    0,    -inf],  # Token 2 sees 0, 1, 2
    [0,    0,    0,    0   ]   # Token 3 sees all
]

# After softmax: exp(-inf) = 0, so future tokens have zero weight

Book Reference: “Build a Large Language Model” by Sebastian Raschka - Ch. 3.5

Project Specification

What You’ll Build

A pure NumPy implementation of attention that:

Implements scaled dot-product attention: Attention(Q, K, V) = softmax(QK^T / sqrt(d_k))V
Supports optional causal masking for autoregressive models
Implements multi-head attention with configurable number of heads
Generates visualization heatmaps of attention patterns

Input/Output Specification

# Single-head attention
def attention(Q, K, V, mask=None):
    """
    Args:
        Q: Query matrix, shape (seq_len, d_k) or (batch, seq_len, d_k)
        K: Key matrix, same shape as Q
        V: Value matrix, same shape as Q
        mask: Optional mask, shape (seq_len, seq_len)

    Returns:
        output: Attention output, same shape as Q
        attention_weights: Softmax weights, shape (seq_len, seq_len)
    """
    pass

# Multi-head attention
def multi_head_attention(x, W_q, W_k, W_v, W_o, num_heads, mask=None):
    """
    Args:
        x: Input embeddings, shape (batch, seq_len, d_model)
        W_q, W_k, W_v: Projection matrices, shape (d_model, d_model)
        W_o: Output projection, shape (d_model, d_model)
        num_heads: Number of attention heads
        mask: Optional causal mask

    Returns:
        output: Multi-head attention output
        all_attention_weights: Weights from all heads
    """
    pass

Expected Outputs

$ python attention_demo.py

=== Single-Head Attention Demo ===

Input sequence: ["The", "cat", "sat", "on", "the", "mat"]
Embedding dimension: 64
Sequence length: 6

Computing Q, K, V matrices...
✓ Q shape: (6, 64)
✓ K shape: (6, 64)
✓ V shape: (6, 64)

Computing attention scores (QK^T)...
Raw scores shape: (6, 6)
Score matrix:
        The    cat    sat     on    the    mat
The   [[45.2  12.3   8.1   3.2   44.8   9.1]
cat    [12.1  67.3  34.2   5.1   11.9  28.3]
sat    [ 8.3  33.9  52.1  18.2    8.7  41.2]
on     [ 3.1   5.2  17.8  29.4    3.3  38.9]
the    [44.9  12.1   8.4   3.4   45.1   9.3]
mat    [ 9.2  28.1  40.8  38.7    9.4  71.2]]

Scaling by sqrt(d_k) = sqrt(64) = 8.0...

Applying softmax (converting to probabilities)...
Attention weights:
        The    cat    sat     on    the    mat
The   [0.42  0.08  0.05  0.01  0.41  0.03]  ← "The" attends to itself and "the"
cat   [0.03  0.35  0.18  0.01  0.03  0.40]  ← "cat" attends to "mat" and itself
sat   [0.01  0.15  0.24  0.03  0.01  0.56]  ← "sat" attends heavily to "mat"
on    [0.01  0.01  0.04  0.17  0.01  0.76]  ← "on" attends to "mat"
the   [0.42  0.08  0.05  0.01  0.41  0.03]
mat   [0.01  0.08  0.19  0.19  0.01  0.52]  ← "mat" attends to itself

Computing final output (attention_weights @ V)...
✓ Output shape: (6, 64)

=== Multi-Head Attention Demo ===

Number of heads: 8
Dimension per head: 8

Head 1 attention patterns (syntactic):
  "sat" → "cat" (0.62) - subject-verb relationship
  "cat" → "sat" (0.38) - verb dependency

Head 2 attention patterns (positional):
  Each token attends to neighbors (local window)

Head 3 attention patterns (semantic):
  "cat" → "mat" (0.71) - both are nouns

Concatenating head outputs...
Final output shape: (6, 64)

=== Visualization Saved ===
✓ attention_heatmap.png - Shows attention weight matrix
✓ multihead_comparison.png - Shows all 8 head patterns

Solution Architecture

Component Diagram

┌─────────────────────────────────────────────────────────────────┐
│                    Attention Module                             │
│                                                                 │
│  ┌──────────┐    ┌──────────┐    ┌──────────┐                 │
│  │  Input   │───>│  Linear  │───>│  Q, K, V │                 │
│  │Embeddings│    │Projections│    │ Matrices │                 │
│  └──────────┘    └──────────┘    └─────┬────┘                 │
│                                        │                        │
│                   ┌────────────────────┼────────────────────┐  │
│                   │                    ▼                    │  │
│                   │  ┌────────────────────────────────────┐│  │
│                   │  │       Attention Computation        ││  │
│                   │  │                                    ││  │
│                   │  │  scores = Q @ K^T                  ││  │
│                   │  │  scores = scores / sqrt(d_k)       ││  │
│                   │  │  scores = scores + mask (optional) ││  │
│                   │  │  weights = softmax(scores)         ││  │
│                   │  │  output = weights @ V              ││  │
│                   │  └──────────────────┬─────────────────┘│  │
│                   │        (per head)   │                  │  │
│                   └─────────────────────┼──────────────────┘  │
│                                         │                      │
│                   ┌─────────────────────▼──────────────────┐  │
│                   │  Concatenate Heads + Output Projection │  │
│                   └─────────────────────┬──────────────────┘  │
│                                         │                      │
│                                         ▼                      │
│                              ┌──────────────────┐             │
│                              │  Final Output    │             │
│                              └──────────────────┘             │
└─────────────────────────────────────────────────────────────────┘

File Structure

attention_from_scratch/
├── attention.py           # Core attention functions
│   ├── softmax()          # Numerically stable softmax
│   ├── attention()        # Single-head scaled dot-product attention
│   ├── multi_head_attention()  # Multi-head attention
│   └── create_causal_mask()    # Causal masking helper
├── visualization.py       # Attention visualization
│   ├── plot_attention_heatmap()
│   └── plot_multihead_comparison()
├── demo.py               # Interactive demo script
├── tests.py              # Unit tests
└── outputs/              # Generated visualizations
    ├── attention_heatmap.png
    └── multihead_comparison.png

Data Flow

Input: Token embeddings (seq_len, d_model)
Project: Apply W_q, W_k, W_v to get Q, K, V matrices
Split: Reshape for multi-head: (seq_len, d_model) → (num_heads, seq_len, d_k)
Attention: For each head, compute scaled dot-product attention
Concatenate: Merge head outputs back: (num_heads, seq_len, d_k) → (seq_len, d_model)
Project: Apply W_o output projection
Output: Attention output (seq_len, d_model)

Implementation Guide

Phase 1: Basic Attention (Days 1-2)

Goal: Implement single-head scaled dot-product attention

import numpy as np

def softmax(x, axis=-1):
    """Numerically stable softmax."""
    x_max = np.max(x, axis=axis, keepdims=True)
    exp_x = np.exp(x - x_max)
    return exp_x / np.sum(exp_x, axis=axis, keepdims=True)

def attention(Q, K, V, mask=None):
    """
    Scaled dot-product attention.

    Attention(Q, K, V) = softmax(QK^T / sqrt(d_k)) V
    """
    d_k = Q.shape[-1]

    # Step 1: Compute attention scores
    scores = Q @ K.T  # (seq_len, seq_len)

    # Step 2: Scale by sqrt(d_k)
    scores = scores / np.sqrt(d_k)

    # Step 3: Apply mask (optional)
    if mask is not None:
        scores = scores + mask  # mask has -inf where blocked

    # Step 4: Softmax to get attention weights
    attention_weights = softmax(scores, axis=-1)

    # Step 5: Weighted sum of values
    output = attention_weights @ V

    return output, attention_weights

Checkpoint: Test with 2-token example, verify manually calculated values match.

Phase 2: Causal Masking (Day 3)

Goal: Implement masking for autoregressive models

def create_causal_mask(size):
    """Create causal (autoregressive) mask."""
    # Upper triangle should be -inf (can't see future)
    mask = np.triu(np.ones((size, size)), k=1) * -1e9
    return mask

# Test: After applying mask and softmax, upper triangle should be 0

Phase 3: Multi-Head Attention (Days 4-5)

Goal: Implement parallel attention heads

Key insight: Reshape to process all heads in parallel using NumPy’s vectorization.

def multi_head_attention(x, W_q, W_k, W_v, W_o, num_heads, mask=None):
    batch_size, seq_len, d_model = x.shape
    d_k = d_model // num_heads

    # Project to Q, K, V
    Q = x @ W_q  # (batch, seq, d_model)
    K = x @ W_k
    V = x @ W_v

    # Reshape: (batch, seq, d_model) → (batch, num_heads, seq, d_k)
    Q = Q.reshape(batch_size, seq_len, num_heads, d_k).transpose(0, 2, 1, 3)
    K = K.reshape(batch_size, seq_len, num_heads, d_k).transpose(0, 2, 1, 3)
    V = V.reshape(batch_size, seq_len, num_heads, d_k).transpose(0, 2, 1, 3)

    # Compute attention for all heads (vectorized)
    scores = Q @ K.transpose(0, 1, 3, 2)  # (batch, heads, seq, seq)
    scores = scores / np.sqrt(d_k)

    if mask is not None:
        scores = scores + mask

    weights = softmax(scores, axis=-1)
    output = weights @ V  # (batch, heads, seq, d_k)

    # Concatenate heads: (batch, heads, seq, d_k) → (batch, seq, d_model)
    output = output.transpose(0, 2, 1, 3).reshape(batch_size, seq_len, d_model)

    # Output projection
    output = output @ W_o

    return output, weights

Phase 4: Visualization (Days 6-7)

Goal: Create interpretable attention visualizations

import matplotlib.pyplot as plt
import seaborn as sns

def visualize_attention(attention_weights, tokens, save_path='attention.png'):
    plt.figure(figsize=(10, 8))
    sns.heatmap(
        attention_weights,
        xticklabels=tokens,
        yticklabels=tokens,
        cmap='YlOrRd',
        annot=True,
        fmt='.2f',
        cbar_kws={'label': 'Attention Weight'}
    )
    plt.xlabel('Key Position (attending to)')
    plt.ylabel('Query Position (from)')
    plt.title('Attention Weight Matrix')
    plt.tight_layout()
    plt.savefig(save_path, dpi=150)
    plt.close()

Testing Strategy

Unit Tests

def test_attention_shapes():
    """Verify output shapes are correct."""
    Q = np.random.randn(6, 64)
    K = np.random.randn(6, 64)
    V = np.random.randn(6, 64)

    output, weights = attention(Q, K, V)

    assert output.shape == (6, 64), f"Expected (6, 64), got {output.shape}"
    assert weights.shape == (6, 6), f"Expected (6, 6), got {weights.shape}"

def test_attention_weights_sum_to_one():
    """Verify softmax produces valid probability distribution."""
    Q = np.random.randn(4, 32)
    K = np.random.randn(4, 32)
    V = np.random.randn(4, 32)

    _, weights = attention(Q, K, V)
    row_sums = weights.sum(axis=-1)

    np.testing.assert_allclose(row_sums, 1.0, rtol=1e-5)

def test_causal_mask_blocks_future():
    """Verify causal mask prevents attending to future."""
    mask = create_causal_mask(4)
    Q = K = V = np.eye(4)  # Identity for easy verification

    _, weights = attention(Q, K, V, mask=mask)

    # Upper triangle should be zero
    upper_triangle = np.triu(weights, k=1)
    assert np.allclose(upper_triangle, 0), "Future positions should have zero weight"

def test_numerical_stability():
    """Verify softmax handles large values."""
    # Create scores that would overflow without stability fix
    Q = np.random.randn(4, 32) * 100  # Large values
    K = np.random.randn(4, 32) * 100
    V = np.random.randn(4, 32)

    output, weights = attention(Q, K, V)

    assert not np.any(np.isnan(output)), "Output should not contain NaN"
    assert not np.any(np.isinf(output)), "Output should not contain Inf"

Manual Verification

Trace this by hand before coding:

# Simplified: 2 tokens, 4 dimensions
Q = [[1, 0, 1, 0],    # Token 1 query
     [0, 1, 0, 1]]    # Token 2 query

K = [[1, 0, 1, 0],    # Token 1 key
     [0, 1, 0, 1]]    # Token 2 key

V = [[10, 20, 30, 40],   # Token 1 value
     [5, 15, 25, 35]]    # Token 2 value

# Step 1: QK^T
# Q[0] · K[0] = 1*1 + 0*0 + 1*1 + 0*0 = 2
# Q[0] · K[1] = 0
# Q[1] · K[0] = 0
# Q[1] · K[1] = 2
scores = [[2, 0],
          [0, 2]]

# Step 2: Scale by sqrt(4) = 2
scaled = [[1, 0],
          [0, 1]]

# Step 3: Softmax
# Row 1: [exp(1), exp(0)] / (exp(1)+exp(0)) ≈ [0.73, 0.27]
attention_weights = [[0.73, 0.27],
                     [0.27, 0.73]]

# Step 4: Output = weights @ V
# Output[0] = 0.73 * [10,20,30,40] + 0.27 * [5,15,25,35]
#           = [8.65, 18.65, 28.65, 38.65]

Common Pitfalls

1. Forgetting to Scale

Problem: Softmax gradients vanish for large scores

# Wrong
scores = Q @ K.T
weights = softmax(scores)

# Right
scores = Q @ K.T / np.sqrt(d_k)
weights = softmax(scores)

2. Wrong Transpose for Multi-Head

Problem: Incorrect dimension ordering after split

# Wrong: Transpose after reshape doesn't give (batch, heads, seq, d_k)
Q = Q.reshape(batch, num_heads, seq_len, d_k)  # Wrong order!

# Right: Reshape then transpose
Q = Q.reshape(batch, seq_len, num_heads, d_k).transpose(0, 2, 1, 3)

3. Mask Has Wrong Sign

Problem: Using positive infinity instead of negative

# Wrong: Positive values boost attention
mask = np.triu(np.ones((n, n)), k=1) * 1e9

# Right: Negative infinity blocks attention
mask = np.triu(np.ones((n, n)), k=1) * -1e9

4. Broadcasting Errors in Batched Attention

Problem: Mask doesn’t broadcast correctly

# Mask shape should be (1, 1, seq, seq) for batched attention
# to broadcast across (batch, heads, seq, seq)
mask = mask[np.newaxis, np.newaxis, :, :]

Extensions and Challenges

Challenge 1: Relative Position Encodings

Implement relative positional bias like T5 or ALiBi:

def alibi_mask(seq_len, num_heads):
    """ALiBi: Attention with Linear Biases."""
    # Each head has different slope for distance penalty
    slopes = 2 ** (-(8 / num_heads) * np.arange(1, num_heads + 1))
    positions = np.abs(np.arange(seq_len)[:, None] - np.arange(seq_len))
    return -slopes[:, None, None] * positions

Challenge 2: Grouped Query Attention

Implement GQA where multiple query heads share key-value heads:

def grouped_query_attention(Q, K, V, num_query_heads, num_kv_heads):
    """GQA: num_query_heads queries share num_kv_heads keys/values."""
    # Repeat K, V to match query heads
    repeat_factor = num_query_heads // num_kv_heads
    K = np.repeat(K, repeat_factor, axis=1)
    V = np.repeat(V, repeat_factor, axis=1)
    # Continue with standard attention...

Challenge 3: Attention with Linear Complexity

Implement Performer’s FAVOR+ or Linear Attention:

def linear_attention(Q, K, V):
    """O(n) attention using kernel approximation."""
    # Use feature map φ such that softmax(QK^T) ≈ φ(Q) @ φ(K)^T
    def feature_map(x):
        return np.exp(x - np.max(x, axis=-1, keepdims=True))

    Q_prime = feature_map(Q)
    K_prime = feature_map(K)

    # Compute in O(n) instead of O(n²)
    KV = K_prime.T @ V  # (d, d)
    return Q_prime @ KV  # (n, d)

Interview Questions

Conceptual Questions

“Explain the attention mechanism in one sentence.”

Answer: “Attention computes a weighted average of values, where weights are learned by comparing how well queries match keys.”
“Why do we need three separate matrices Q, K, V instead of using the same matrix?”

Answer: “Separating them allows the model to learn different representations: Q for ‘what to look for’, K for ‘what’s available’, V for ‘what to retrieve’. Using different weight matrices (W_Q, W_K, W_V) gives the model more flexibility to learn these roles independently.”
“What’s the time complexity of attention, and why?”

Answer: “O(N²d) where N is sequence length, d is dimension. The N² comes from QK^T creating an N×N attention matrix, and d is the dimension of each vector.”
“How does multi-head attention differ from single-head?”

Answer: “Multi-head runs h parallel attention mechanisms, each with lower dimension (d/h). This lets the model learn different types of relationships (syntax, semantics, position) simultaneously.”
“What would happen if you removed the sqrt(d_k) scaling?”

Answer: “For large d_k, dot products would have large variance. This pushes softmax into regions with very small gradients (saturation), making training difficult.”
“How do you prevent GPT from looking at future tokens?”

Answer: “Apply a causal mask: set attention scores to -infinity for all future positions before softmax. After softmax, these become zero, eliminating future information.”
“What’s the difference between self-attention and cross-attention?”

Answer: “Self-attention: Q, K, V all come from the same sequence. Cross-attention: Q from one sequence, K and V from another (e.g., encoder-decoder models).”

Coding Questions

“Implement softmax in a numerically stable way.”

def softmax(x, axis=-1):
    x_max = np.max(x, axis=axis, keepdims=True)
    exp_x = np.exp(x - x_max)  # Subtract max for stability
    return exp_x / np.sum(exp_x, axis=axis, keepdims=True)

“How would you debug attention that’s giving uniform weights?”

Answer: Check that Q and K are actually different (if identical to input, all dot products similar). Verify scaling is applied. Check for temperature (if too high, softmax is uniform). Visualize the attention pattern to see if it’s learning anything.

Resources

Primary Reading

Topic	Book	Chapter
Attention mechanism fundamentals	“Build a Large Language Model (From Scratch)” by Sebastian Raschka	Ch. 3: “Coding Attention Mechanisms”
Mathematical foundations	“The Illustrated Transformer” by Jay Alammar	Full article
Linear algebra for attention	“Introduction to Linear Algebra” by Gilbert Strang	Ch. 1-3
Softmax and numerical stability	“Deep Learning” by Goodfellow, Bengio, Courville	Ch. 6.2
Original transformer paper	“Attention Is All You Need” by Vaswani et al.	Section 3.2

Interactive Resources

Transformer Explainer: https://poloclub.github.io/transformer-explainer/
Jay Alammar’s Illustrated Transformer: https://jalammar.github.io/illustrated-transformer/
3Blue1Brown: Attention in Transformers: YouTube visual explanation

Code References

Harvard NLP Annotated Transformer: http://nlp.seas.harvard.edu/annotated-transformer/
Hugging Face Transformers source code: modeling_utils.py, modeling_gpt2.py

Self-Assessment Checklist

Before moving to Project 2, verify you can:

Explain why QK^T computes pairwise similarities
Implement numerically stable softmax from scratch
Create a causal mask and verify it blocks future tokens
Implement multi-head attention with correct reshaping
Visualize attention weights and interpret the patterns
Answer all interview questions confidently
Trace attention computation by hand for a 2-token example
Explain the purpose of the sqrt(d_k) scaling factor

Next Project: P02 - Implement Flash Attention