Project 1: Build Attention from Scratch (Pure NumPy)

Project 1: Build Attention from Scratch (Pure NumPy)

Project Metadata

Attribute Value
Difficulty Level 4: Expert
Time Estimate 1-2 weeks
Prerequisites Python, NumPy, linear algebra (matrix multiplication, dot products)
Main Programming Language Python
Alternative Languages Rust, C++
Primary Tool NumPy, Matplotlib
Main Book โ€œBuild a Large Language Model (From Scratch)โ€ by Sebastian Raschka
Knowledge Area Deep Learning / Linear Algebra

Learning Objectives

By completing this project, you will be able to:

  1. Implement scaled dot-product attention from first principles using only NumPy
  2. Explain mathematically why we compute QK^T / sqrt(d_k) and what each operation accomplishes
  3. Create attention masks for causal language modeling that prevent future token leakage
  4. Implement multi-head attention and understand how parallel attention subspaces work
  5. Visualize attention patterns to debug and understand what the model โ€œlooks atโ€
  6. Answer interview questions about attention with deep mathematical understanding

Theoretical Foundation

The Core Question Youโ€™re Answering

โ€œHow does a neural network know which words to โ€˜pay attention toโ€™ when processing a sentence, and how is this implemented mathematically from first principles?โ€

This project forces you to answer:

  • Why do we need three matrices (Q, K, V) instead of just comparing tokens directly?
  • What does the softmax actually do, and why is it essential?
  • How does scaling by sqrt(d_k) prevent gradient problems?
  • Why does matrix multiplication QK^T give us relevance scores?

Most people use torch.nn.MultiheadAttention as a black box. After this project, youโ€™ll understand every operation inside that black box and could implement it yourself in any language.

Concepts You Must Understand First

Stop and research these before coding:

1. Matrix Multiplication Semantics

What does Q @ K^T actually compute? (Hint: dot product between every query and every key)

Consider two matrices:

  • Q with shape (N, d): N queries, each of dimension d
  • K with shape (N, d): N keys, each of dimension d

When we compute Q @ K^T:

  • K^T has shape (d, N)
  • Result has shape (N, N)
  • Element [i, j] = dot product of query i with key j

This is computing all pairwise similarities between queries and keys in one matrix operation.

Book Reference: โ€œIntroduction to Linear Algebraโ€ by Gilbert Strang - Ch. 1-2

2. Dot Product as Similarity

Why is the dot product between two vectors a measure of their similarity?

dot(a, b) = |a| ร— |b| ร— cos(ฮธ)

where ฮธ is the angle between vectors a and b
  • Vectors pointing same direction: cos(0ยฐ) = 1 โ†’ large positive dot product
  • Orthogonal vectors: cos(90ยฐ) = 0 โ†’ dot product = 0
  • Opposite vectors: cos(180ยฐ) = -1 โ†’ large negative dot product

The dot product captures both magnitude and directional alignment.

Book Reference: โ€œMathematics for Machine Learningโ€ by Deisenroth et al. - Ch. 3.1

3. Softmax Function

What does softmax do? It converts arbitrary numbers to probabilities that sum to 1.

softmax(x_i) = exp(x_i) / ฮฃ exp(x_j)

Why softmax instead of just normalization?

  • Softmax emphasizes larger values exponentially
  • Small differences become big differences after exp()
  • This creates a โ€œwinner-take-moreโ€ dynamic

Example:

scores = [2.0, 1.0, 0.5]

# Simple normalization (dividing by sum):
# [2.0/3.5, 1.0/3.5, 0.5/3.5] = [0.57, 0.29, 0.14]

# Softmax:
# exp([2.0, 1.0, 0.5]) = [7.39, 2.72, 1.65]
# Sum = 11.76
# [0.63, 0.23, 0.14]  # Notice: gap between first two increased

Book Reference: โ€œDeep Learningโ€ by Goodfellow et al. - Ch. 6.2.2.3

4. Why Scaling by sqrt(d_k)?

What happens to dot products as dimension increases?

For two random unit vectors in d dimensions, their dot product has:

  • Mean: 0
  • Variance: 1/dโ€ฆ wait, thatโ€™s the variance of each component
  • Actually, variance of dot product โ‰ˆ d (for standard normal components)

Problem: As d increases, dot products have larger magnitude, pushing softmax into saturation regions where gradients are tiny.

Solution: Divide by sqrt(d_k) to normalize variance back to ~1

# Without scaling (d_k = 64):
scores = [-12.3, 45.2, 31.1, ...]  # Large magnitudes
softmax(scores) โ†’ [0.00, 0.99, 0.01, ...]  # One dominates, gradients vanish

# With scaling:
scores / 8 = [-1.5, 5.6, 3.9, ...]  # Reasonable magnitudes
softmax(scores) โ†’ [0.02, 0.72, 0.26, ...]  # Smoother, better gradients

Book Reference: โ€œAttention Is All You Needโ€ paper - Section 3.2.1

5. Attention Masks

What is a causal mask? It prevents attending to future positions.

For autoregressive generation (like GPT), token at position t should only see positions 0โ€ฆt-1.

Implementation: Set future positions to -โˆž before softmax

# For 4 tokens, causal mask:
mask = [
    [0,    -inf, -inf, -inf],  # Token 0 can only see itself
    [0,    0,    -inf, -inf],  # Token 1 sees 0 and 1
    [0,    0,    0,    -inf],  # Token 2 sees 0, 1, 2
    [0,    0,    0,    0   ]   # Token 3 sees all
]

# After softmax: exp(-inf) = 0, so future tokens have zero weight

Book Reference: โ€œBuild a Large Language Modelโ€ by Sebastian Raschka - Ch. 3.5


Project Specification

What Youโ€™ll Build

A pure NumPy implementation of attention that:

  1. Implements scaled dot-product attention: Attention(Q, K, V) = softmax(QK^T / sqrt(d_k))V
  2. Supports optional causal masking for autoregressive models
  3. Implements multi-head attention with configurable number of heads
  4. Generates visualization heatmaps of attention patterns

Input/Output Specification

# Single-head attention
def attention(Q, K, V, mask=None):
    """
    Args:
        Q: Query matrix, shape (seq_len, d_k) or (batch, seq_len, d_k)
        K: Key matrix, same shape as Q
        V: Value matrix, same shape as Q
        mask: Optional mask, shape (seq_len, seq_len)

    Returns:
        output: Attention output, same shape as Q
        attention_weights: Softmax weights, shape (seq_len, seq_len)
    """
    pass

# Multi-head attention
def multi_head_attention(x, W_q, W_k, W_v, W_o, num_heads, mask=None):
    """
    Args:
        x: Input embeddings, shape (batch, seq_len, d_model)
        W_q, W_k, W_v: Projection matrices, shape (d_model, d_model)
        W_o: Output projection, shape (d_model, d_model)
        num_heads: Number of attention heads
        mask: Optional causal mask

    Returns:
        output: Multi-head attention output
        all_attention_weights: Weights from all heads
    """
    pass

Expected Outputs

$ python attention_demo.py

=== Single-Head Attention Demo ===

Input sequence: ["The", "cat", "sat", "on", "the", "mat"]
Embedding dimension: 64
Sequence length: 6

Computing Q, K, V matrices...
โœ“ Q shape: (6, 64)
โœ“ K shape: (6, 64)
โœ“ V shape: (6, 64)

Computing attention scores (QK^T)...
Raw scores shape: (6, 6)
Score matrix:
        The    cat    sat     on    the    mat
The   [[45.2  12.3   8.1   3.2   44.8   9.1]
cat    [12.1  67.3  34.2   5.1   11.9  28.3]
sat    [ 8.3  33.9  52.1  18.2    8.7  41.2]
on     [ 3.1   5.2  17.8  29.4    3.3  38.9]
the    [44.9  12.1   8.4   3.4   45.1   9.3]
mat    [ 9.2  28.1  40.8  38.7    9.4  71.2]]

Scaling by sqrt(d_k) = sqrt(64) = 8.0...

Applying softmax (converting to probabilities)...
Attention weights:
        The    cat    sat     on    the    mat
The   [0.42  0.08  0.05  0.01  0.41  0.03]  โ† "The" attends to itself and "the"
cat   [0.03  0.35  0.18  0.01  0.03  0.40]  โ† "cat" attends to "mat" and itself
sat   [0.01  0.15  0.24  0.03  0.01  0.56]  โ† "sat" attends heavily to "mat"
on    [0.01  0.01  0.04  0.17  0.01  0.76]  โ† "on" attends to "mat"
the   [0.42  0.08  0.05  0.01  0.41  0.03]
mat   [0.01  0.08  0.19  0.19  0.01  0.52]  โ† "mat" attends to itself

Computing final output (attention_weights @ V)...
โœ“ Output shape: (6, 64)

=== Multi-Head Attention Demo ===

Number of heads: 8
Dimension per head: 8

Head 1 attention patterns (syntactic):
  "sat" โ†’ "cat" (0.62) - subject-verb relationship
  "cat" โ†’ "sat" (0.38) - verb dependency

Head 2 attention patterns (positional):
  Each token attends to neighbors (local window)

Head 3 attention patterns (semantic):
  "cat" โ†’ "mat" (0.71) - both are nouns

Concatenating head outputs...
Final output shape: (6, 64)

=== Visualization Saved ===
โœ“ attention_heatmap.png - Shows attention weight matrix
โœ“ multihead_comparison.png - Shows all 8 head patterns

Solution Architecture

Component Diagram

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                    Attention Module                             โ”‚
โ”‚                                                                 โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”                 โ”‚
โ”‚  โ”‚  Input   โ”‚โ”€โ”€โ”€>โ”‚  Linear  โ”‚โ”€โ”€โ”€>โ”‚  Q, K, V โ”‚                 โ”‚
โ”‚  โ”‚Embeddingsโ”‚    โ”‚Projectionsโ”‚    โ”‚ Matrices โ”‚                 โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”˜                 โ”‚
โ”‚                                        โ”‚                        โ”‚
โ”‚                   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”‚
โ”‚                   โ”‚                    โ–ผ                    โ”‚  โ”‚
โ”‚                   โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”โ”‚  โ”‚
โ”‚                   โ”‚  โ”‚       Attention Computation        โ”‚โ”‚  โ”‚
โ”‚                   โ”‚  โ”‚                                    โ”‚โ”‚  โ”‚
โ”‚                   โ”‚  โ”‚  scores = Q @ K^T                  โ”‚โ”‚  โ”‚
โ”‚                   โ”‚  โ”‚  scores = scores / sqrt(d_k)       โ”‚โ”‚  โ”‚
โ”‚                   โ”‚  โ”‚  scores = scores + mask (optional) โ”‚โ”‚  โ”‚
โ”‚                   โ”‚  โ”‚  weights = softmax(scores)         โ”‚โ”‚  โ”‚
โ”‚                   โ”‚  โ”‚  output = weights @ V              โ”‚โ”‚  โ”‚
โ”‚                   โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜โ”‚  โ”‚
โ”‚                   โ”‚        (per head)   โ”‚                  โ”‚  โ”‚
โ”‚                   โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ”‚
โ”‚                                         โ”‚                      โ”‚
โ”‚                   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”‚
โ”‚                   โ”‚  Concatenate Heads + Output Projection โ”‚  โ”‚
โ”‚                   โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ”‚
โ”‚                                         โ”‚                      โ”‚
โ”‚                                         โ–ผ                      โ”‚
โ”‚                              โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”             โ”‚
โ”‚                              โ”‚  Final Output    โ”‚             โ”‚
โ”‚                              โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜             โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

File Structure

attention_from_scratch/
โ”œโ”€โ”€ attention.py           # Core attention functions
โ”‚   โ”œโ”€โ”€ softmax()          # Numerically stable softmax
โ”‚   โ”œโ”€โ”€ attention()        # Single-head scaled dot-product attention
โ”‚   โ”œโ”€โ”€ multi_head_attention()  # Multi-head attention
โ”‚   โ””โ”€โ”€ create_causal_mask()    # Causal masking helper
โ”œโ”€โ”€ visualization.py       # Attention visualization
โ”‚   โ”œโ”€โ”€ plot_attention_heatmap()
โ”‚   โ””โ”€โ”€ plot_multihead_comparison()
โ”œโ”€โ”€ demo.py               # Interactive demo script
โ”œโ”€โ”€ tests.py              # Unit tests
โ””โ”€โ”€ outputs/              # Generated visualizations
    โ”œโ”€โ”€ attention_heatmap.png
    โ””โ”€โ”€ multihead_comparison.png

Data Flow

  1. Input: Token embeddings (seq_len, d_model)
  2. Project: Apply W_q, W_k, W_v to get Q, K, V matrices
  3. Split: Reshape for multi-head: (seq_len, d_model) โ†’ (num_heads, seq_len, d_k)
  4. Attention: For each head, compute scaled dot-product attention
  5. Concatenate: Merge head outputs back: (num_heads, seq_len, d_k) โ†’ (seq_len, d_model)
  6. Project: Apply W_o output projection
  7. Output: Attention output (seq_len, d_model)

Implementation Guide

Phase 1: Basic Attention (Days 1-2)

Goal: Implement single-head scaled dot-product attention

import numpy as np

def softmax(x, axis=-1):
    """Numerically stable softmax."""
    x_max = np.max(x, axis=axis, keepdims=True)
    exp_x = np.exp(x - x_max)
    return exp_x / np.sum(exp_x, axis=axis, keepdims=True)

def attention(Q, K, V, mask=None):
    """
    Scaled dot-product attention.

    Attention(Q, K, V) = softmax(QK^T / sqrt(d_k)) V
    """
    d_k = Q.shape[-1]

    # Step 1: Compute attention scores
    scores = Q @ K.T  # (seq_len, seq_len)

    # Step 2: Scale by sqrt(d_k)
    scores = scores / np.sqrt(d_k)

    # Step 3: Apply mask (optional)
    if mask is not None:
        scores = scores + mask  # mask has -inf where blocked

    # Step 4: Softmax to get attention weights
    attention_weights = softmax(scores, axis=-1)

    # Step 5: Weighted sum of values
    output = attention_weights @ V

    return output, attention_weights

Checkpoint: Test with 2-token example, verify manually calculated values match.

Phase 2: Causal Masking (Day 3)

Goal: Implement masking for autoregressive models

def create_causal_mask(size):
    """Create causal (autoregressive) mask."""
    # Upper triangle should be -inf (can't see future)
    mask = np.triu(np.ones((size, size)), k=1) * -1e9
    return mask

# Test: After applying mask and softmax, upper triangle should be 0

Phase 3: Multi-Head Attention (Days 4-5)

Goal: Implement parallel attention heads

Key insight: Reshape to process all heads in parallel using NumPyโ€™s vectorization.

def multi_head_attention(x, W_q, W_k, W_v, W_o, num_heads, mask=None):
    batch_size, seq_len, d_model = x.shape
    d_k = d_model // num_heads

    # Project to Q, K, V
    Q = x @ W_q  # (batch, seq, d_model)
    K = x @ W_k
    V = x @ W_v

    # Reshape: (batch, seq, d_model) โ†’ (batch, num_heads, seq, d_k)
    Q = Q.reshape(batch_size, seq_len, num_heads, d_k).transpose(0, 2, 1, 3)
    K = K.reshape(batch_size, seq_len, num_heads, d_k).transpose(0, 2, 1, 3)
    V = V.reshape(batch_size, seq_len, num_heads, d_k).transpose(0, 2, 1, 3)

    # Compute attention for all heads (vectorized)
    scores = Q @ K.transpose(0, 1, 3, 2)  # (batch, heads, seq, seq)
    scores = scores / np.sqrt(d_k)

    if mask is not None:
        scores = scores + mask

    weights = softmax(scores, axis=-1)
    output = weights @ V  # (batch, heads, seq, d_k)

    # Concatenate heads: (batch, heads, seq, d_k) โ†’ (batch, seq, d_model)
    output = output.transpose(0, 2, 1, 3).reshape(batch_size, seq_len, d_model)

    # Output projection
    output = output @ W_o

    return output, weights

Phase 4: Visualization (Days 6-7)

Goal: Create interpretable attention visualizations

import matplotlib.pyplot as plt
import seaborn as sns

def visualize_attention(attention_weights, tokens, save_path='attention.png'):
    plt.figure(figsize=(10, 8))
    sns.heatmap(
        attention_weights,
        xticklabels=tokens,
        yticklabels=tokens,
        cmap='YlOrRd',
        annot=True,
        fmt='.2f',
        cbar_kws={'label': 'Attention Weight'}
    )
    plt.xlabel('Key Position (attending to)')
    plt.ylabel('Query Position (from)')
    plt.title('Attention Weight Matrix')
    plt.tight_layout()
    plt.savefig(save_path, dpi=150)
    plt.close()

Testing Strategy

Unit Tests

def test_attention_shapes():
    """Verify output shapes are correct."""
    Q = np.random.randn(6, 64)
    K = np.random.randn(6, 64)
    V = np.random.randn(6, 64)

    output, weights = attention(Q, K, V)

    assert output.shape == (6, 64), f"Expected (6, 64), got {output.shape}"
    assert weights.shape == (6, 6), f"Expected (6, 6), got {weights.shape}"

def test_attention_weights_sum_to_one():
    """Verify softmax produces valid probability distribution."""
    Q = np.random.randn(4, 32)
    K = np.random.randn(4, 32)
    V = np.random.randn(4, 32)

    _, weights = attention(Q, K, V)
    row_sums = weights.sum(axis=-1)

    np.testing.assert_allclose(row_sums, 1.0, rtol=1e-5)

def test_causal_mask_blocks_future():
    """Verify causal mask prevents attending to future."""
    mask = create_causal_mask(4)
    Q = K = V = np.eye(4)  # Identity for easy verification

    _, weights = attention(Q, K, V, mask=mask)

    # Upper triangle should be zero
    upper_triangle = np.triu(weights, k=1)
    assert np.allclose(upper_triangle, 0), "Future positions should have zero weight"

def test_numerical_stability():
    """Verify softmax handles large values."""
    # Create scores that would overflow without stability fix
    Q = np.random.randn(4, 32) * 100  # Large values
    K = np.random.randn(4, 32) * 100
    V = np.random.randn(4, 32)

    output, weights = attention(Q, K, V)

    assert not np.any(np.isnan(output)), "Output should not contain NaN"
    assert not np.any(np.isinf(output)), "Output should not contain Inf"

Manual Verification

Trace this by hand before coding:

# Simplified: 2 tokens, 4 dimensions
Q = [[1, 0, 1, 0],    # Token 1 query
     [0, 1, 0, 1]]    # Token 2 query

K = [[1, 0, 1, 0],    # Token 1 key
     [0, 1, 0, 1]]    # Token 2 key

V = [[10, 20, 30, 40],   # Token 1 value
     [5, 15, 25, 35]]    # Token 2 value

# Step 1: QK^T
# Q[0] ยท K[0] = 1*1 + 0*0 + 1*1 + 0*0 = 2
# Q[0] ยท K[1] = 0
# Q[1] ยท K[0] = 0
# Q[1] ยท K[1] = 2
scores = [[2, 0],
          [0, 2]]

# Step 2: Scale by sqrt(4) = 2
scaled = [[1, 0],
          [0, 1]]

# Step 3: Softmax
# Row 1: [exp(1), exp(0)] / (exp(1)+exp(0)) โ‰ˆ [0.73, 0.27]
attention_weights = [[0.73, 0.27],
                     [0.27, 0.73]]

# Step 4: Output = weights @ V
# Output[0] = 0.73 * [10,20,30,40] + 0.27 * [5,15,25,35]
#           = [8.65, 18.65, 28.65, 38.65]

Common Pitfalls

1. Forgetting to Scale

Problem: Softmax gradients vanish for large scores

# Wrong
scores = Q @ K.T
weights = softmax(scores)

# Right
scores = Q @ K.T / np.sqrt(d_k)
weights = softmax(scores)

2. Wrong Transpose for Multi-Head

Problem: Incorrect dimension ordering after split

# Wrong: Transpose after reshape doesn't give (batch, heads, seq, d_k)
Q = Q.reshape(batch, num_heads, seq_len, d_k)  # Wrong order!

# Right: Reshape then transpose
Q = Q.reshape(batch, seq_len, num_heads, d_k).transpose(0, 2, 1, 3)

3. Mask Has Wrong Sign

Problem: Using positive infinity instead of negative

# Wrong: Positive values boost attention
mask = np.triu(np.ones((n, n)), k=1) * 1e9

# Right: Negative infinity blocks attention
mask = np.triu(np.ones((n, n)), k=1) * -1e9

4. Broadcasting Errors in Batched Attention

Problem: Mask doesnโ€™t broadcast correctly

# Mask shape should be (1, 1, seq, seq) for batched attention
# to broadcast across (batch, heads, seq, seq)
mask = mask[np.newaxis, np.newaxis, :, :]

Extensions and Challenges

Challenge 1: Relative Position Encodings

Implement relative positional bias like T5 or ALiBi:

def alibi_mask(seq_len, num_heads):
    """ALiBi: Attention with Linear Biases."""
    # Each head has different slope for distance penalty
    slopes = 2 ** (-(8 / num_heads) * np.arange(1, num_heads + 1))
    positions = np.abs(np.arange(seq_len)[:, None] - np.arange(seq_len))
    return -slopes[:, None, None] * positions

Challenge 2: Grouped Query Attention

Implement GQA where multiple query heads share key-value heads:

def grouped_query_attention(Q, K, V, num_query_heads, num_kv_heads):
    """GQA: num_query_heads queries share num_kv_heads keys/values."""
    # Repeat K, V to match query heads
    repeat_factor = num_query_heads // num_kv_heads
    K = np.repeat(K, repeat_factor, axis=1)
    V = np.repeat(V, repeat_factor, axis=1)
    # Continue with standard attention...

Challenge 3: Attention with Linear Complexity

Implement Performerโ€™s FAVOR+ or Linear Attention:

def linear_attention(Q, K, V):
    """O(n) attention using kernel approximation."""
    # Use feature map ฯ† such that softmax(QK^T) โ‰ˆ ฯ†(Q) @ ฯ†(K)^T
    def feature_map(x):
        return np.exp(x - np.max(x, axis=-1, keepdims=True))

    Q_prime = feature_map(Q)
    K_prime = feature_map(K)

    # Compute in O(n) instead of O(nยฒ)
    KV = K_prime.T @ V  # (d, d)
    return Q_prime @ KV  # (n, d)

Interview Questions

Conceptual Questions

  1. โ€œExplain the attention mechanism in one sentence.โ€

    Answer: โ€œAttention computes a weighted average of values, where weights are learned by comparing how well queries match keys.โ€

  2. โ€œWhy do we need three separate matrices Q, K, V instead of using the same matrix?โ€

    Answer: โ€œSeparating them allows the model to learn different representations: Q for โ€˜what to look forโ€™, K for โ€˜whatโ€™s availableโ€™, V for โ€˜what to retrieveโ€™. Using different weight matrices (W_Q, W_K, W_V) gives the model more flexibility to learn these roles independently.โ€

  3. โ€œWhatโ€™s the time complexity of attention, and why?โ€

    Answer: โ€œO(Nยฒd) where N is sequence length, d is dimension. The Nยฒ comes from QK^T creating an Nร—N attention matrix, and d is the dimension of each vector.โ€

  4. โ€œHow does multi-head attention differ from single-head?โ€

    Answer: โ€œMulti-head runs h parallel attention mechanisms, each with lower dimension (d/h). This lets the model learn different types of relationships (syntax, semantics, position) simultaneously.โ€

  5. โ€œWhat would happen if you removed the sqrt(d_k) scaling?โ€

    Answer: โ€œFor large d_k, dot products would have large variance. This pushes softmax into regions with very small gradients (saturation), making training difficult.โ€

  6. โ€œHow do you prevent GPT from looking at future tokens?โ€

    Answer: โ€œApply a causal mask: set attention scores to -infinity for all future positions before softmax. After softmax, these become zero, eliminating future information.โ€

  7. โ€œWhatโ€™s the difference between self-attention and cross-attention?โ€

    Answer: โ€œSelf-attention: Q, K, V all come from the same sequence. Cross-attention: Q from one sequence, K and V from another (e.g., encoder-decoder models).โ€

Coding Questions

  1. โ€œImplement softmax in a numerically stable way.โ€

    def softmax(x, axis=-1):
        x_max = np.max(x, axis=axis, keepdims=True)
        exp_x = np.exp(x - x_max)  # Subtract max for stability
        return exp_x / np.sum(exp_x, axis=axis, keepdims=True)
    
  2. โ€œHow would you debug attention thatโ€™s giving uniform weights?โ€

    Answer: Check that Q and K are actually different (if identical to input, all dot products similar). Verify scaling is applied. Check for temperature (if too high, softmax is uniform). Visualize the attention pattern to see if itโ€™s learning anything.


Resources

Primary Reading

Topic Book Chapter
Attention mechanism fundamentals โ€œBuild a Large Language Model (From Scratch)โ€ by Sebastian Raschka Ch. 3: โ€œCoding Attention Mechanismsโ€
Mathematical foundations โ€œThe Illustrated Transformerโ€ by Jay Alammar Full article
Linear algebra for attention โ€œIntroduction to Linear Algebraโ€ by Gilbert Strang Ch. 1-3
Softmax and numerical stability โ€œDeep Learningโ€ by Goodfellow, Bengio, Courville Ch. 6.2
Original transformer paper โ€œAttention Is All You Needโ€ by Vaswani et al. Section 3.2

Interactive Resources

  • Transformer Explainer: https://poloclub.github.io/transformer-explainer/
  • Jay Alammarโ€™s Illustrated Transformer: https://jalammar.github.io/illustrated-transformer/
  • 3Blue1Brown: Attention in Transformers: YouTube visual explanation

Code References

  • Harvard NLP Annotated Transformer: http://nlp.seas.harvard.edu/annotated-transformer/
  • Hugging Face Transformers source code: modeling_utils.py, modeling_gpt2.py

Self-Assessment Checklist

Before moving to Project 2, verify you can:

  • Explain why QK^T computes pairwise similarities
  • Implement numerically stable softmax from scratch
  • Create a causal mask and verify it blocks future tokens
  • Implement multi-head attention with correct reshaping
  • Visualize attention weights and interpret the patterns
  • Answer all interview questions confidently
  • Trace attention computation by hand for a 2-token example
  • Explain the purpose of the sqrt(d_k) scaling factor

Next Project: P02 - Implement Flash Attention