Project 10: Recurrent Character Generator (RNN)

The ancestor of GPT: Build a network that remembers the past and generates text one character at a time.

Project Overview

Attribute	Details
Difficulty	Level 5: Master
Time Estimate	3 Weeks
Language	Python (NumPy Only)
Prerequisites	Project 5 (Autograd), Project 8 (MNIST)
Knowledge Area	Sequence Modeling / NLP
Main Book	“Deep Learning” by Goodfellow, Bengio, Courville - Ch. 10

Learning Objectives

By the end of this project, you will be able to:

Explain why feedforward networks cannot model sequences and what RNNs add
Implement the RNN forward pass with hidden state propagation
Derive and implement Backpropagation Through Time (BPTT)
Diagnose and fix vanishing/exploding gradient problems
Apply gradient clipping to stabilize training
Generate text by sampling from probability distributions
Implement temperature sampling for controlling creativity
Understand the foundations that led to LSTMs, Transformers, and GPT

The Core Question You’re Answering

“How does AI handle Time and Sequence?”

Consider this sentence: “The cat sat on the ___”

To predict “mat,” you need to remember “cat” and “sat” from earlier. A feedforward network processes each word independently—it has no memory of previous words.

Feedforward Network (No Memory):
┌─────────┐     ┌─────────┐     ┌─────────┐     ┌─────────┐
│  "The"  │────►│  "cat"  │────►│  "sat"  │────►│  "on"   │
└─────────┘     └─────────┘     └─────────┘     └─────────┘
     │               │               │               │
     ▼               ▼               ▼               ▼
  [Pred]          [Pred]          [Pred]          [Pred]
                                                     │
                                              ???  ◄─┘
                                         (No context!)

An RNN creates a loop—the output from processing “The” influences how we process “cat,” which influences “sat,” and so on:

Recurrent Neural Network (Memory via Hidden State):

         h₀          h₁          h₂          h₃
          │           │           │           │
          ▼           ▼           ▼           ▼
       ┌─────┐     ┌─────┐     ┌─────┐     ┌─────┐
       │ RNN │────►│ RNN │────►│ RNN │────►│ RNN │────► h₄
       │Cell │     │Cell │     │Cell │     │Cell │
       └─────┘     └─────┘     └─────┘     └─────┘
          ▲           ▲           ▲           ▲
          │           │           │           │
       "The"       "cat"       "sat"        "on"
                                              │
                                              ▼
                                           "mat" ✓
                                      (Has context!)

The hidden state h is the network’s “short-term memory.” It carries information from previous timesteps. This is the fundamental insight that enables language modeling, speech recognition, and eventually led to the Transformer architecture that powers ChatGPT.

Concepts You Must Understand First

1. Why Feedforward Nets Can’t Model Sequences

Aspect	Feedforward	Recurrent
Input size	Fixed	Variable length
Memory	None	Hidden state
Processing	Parallel	Sequential
“Cat” vs “Cat sat”	Same output	Different output

Book Reference: “Deep Learning” Ch. 10.1 - Goodfellow

2. The Recurrence Relation

The core of an RNN is this equation:

h_t = f(W_hh · h_{t-1} + W_xh · x_t + b_h)

Where:

h_t = hidden state at time t (the “memory”)
h_{t-1} = hidden state from previous timestep
x_t = input at time t
W_hh = hidden-to-hidden weights (how memory affects memory)
W_xh = input-to-hidden weights (how input affects memory)
f = activation function (usually tanh)

        ┌─────────────────┐
        │                 │
        ▼                 │
    ┌───────┐         ┌───┴───┐
    │  h_t  │◄────────│ W_hh  │
    └───┬───┘         └───────┘
        │
        ▲
    ┌───┴───┐
    │ tanh  │
    └───┬───┘
        │
        ▲
    ┌───┴───────────────────┐
    │   W_xh · x_t          │
    │      +                │
    │   W_hh · h_{t-1}      │
    │      +                │
    │   b_h                 │
    └───────────────────────┘

3. Character-Level vs Word-Level Modeling

Approach	Vocabulary Size	Granularity	Pros	Cons
Character	50-100	Fine	Handles any word	Longer sequences
Word	50,000+	Coarse	Semantic meaning	OOV problem
Subword (BPE)	30,000	Medium	Best of both	Complexity

For this project, we use character-level because:

Small vocabulary (just letters, punctuation, spaces)
Can generate any word, even made-up ones
Easier to understand and debug

4. One-Hot Encoding for Characters

Each character becomes a binary vector:

Vocabulary: ['a', 'b', 'c', ..., 'z', ' ', '.']
Size: 27 characters

'a' = [1, 0, 0, ..., 0, 0, 0]  # Index 0
'b' = [0, 1, 0, ..., 0, 0, 0]  # Index 1
'c' = [0, 0, 1, ..., 0, 0, 0]  # Index 2
' ' = [0, 0, 0, ..., 0, 1, 0]  # Index 26

Book Reference: “Neural Networks and Deep Learning” Ch. 3 - Nielsen

5. Unrolling the RNN Through Time

To apply backpropagation, we “unroll” the loop into a feedforward network:

Unrolled RNN for sequence "cat":

    ┌─────┐     ┌─────┐     ┌─────┐
    │ h_0 │────►│ h_1 │────►│ h_2 │────► h_3
    └──┬──┘     └──┬──┘     └──┬──┘
       │           │           │
       ▼           ▼           ▼
    ┌─────┐     ┌─────┐     ┌─────┐
    │ y_1 │     │ y_2 │     │ y_3 │
    └─────┘     └─────┘     └─────┘
       │           │           │
       ▼           ▼           ▼
   Pred:'a'    Pred:'t'    Pred:' '
       │           │           │
       ▼           ▼           ▼
   Loss_1      Loss_2      Loss_3
       ▲           ▲           ▲
       │           │           │
    'c'          'a'          't'
   (Input)     (Input)      (Input)

Each step predicts the NEXT character given the history.

6. Truncated BPTT

For long sequences, gradients must flow through hundreds of timesteps. This is slow and causes gradient issues. Truncated BPTT limits how far back we propagate:

Full sequence: "The quick brown fox jumps over the lazy dog"

Truncated to chunks of 10:
Chunk 1: "The quick " (carry h forward)
Chunk 2: "brown fox " (carry h forward)
Chunk 3: "jumps over" (carry h forward)
...

Gradients only flow within each chunk, but hidden state carries forward.

Deep Theoretical Foundation

The RNN Equations in Detail

Forward Pass (for one timestep):

# 1. Combine previous hidden state and current input
z = W_hh @ h_prev + W_xh @ x + b_h

# 2. Apply non-linearity
h = np.tanh(z)

# 3. Compute output (prediction)
y = W_hy @ h + b_y

# 4. Convert to probabilities
p = softmax(y)

Dimensions:

Vocabulary size: V (e.g., 65 characters)
Hidden size: H (e.g., 128)

x:     (V, 1)     - one-hot input
h:     (H, 1)     - hidden state
y:     (V, 1)     - raw scores
p:     (V, 1)     - probabilities

W_xh:  (H, V)     - input to hidden
W_hh:  (H, H)     - hidden to hidden
W_hy:  (V, H)     - hidden to output
b_h:   (H, 1)     - hidden bias
b_y:   (V, 1)     - output bias

Backpropagation Through Time (BPTT)

The gradient must flow backward through time:

Forward direction (t=0 → T):
x_0 ──► h_0 ──► h_1 ──► h_2 ──► ... ──► h_T
              │       │       │             │
              ▼       ▼       ▼             ▼
             y_0     y_1     y_2           y_T
              │       │       │             │
              ▼       ▼       ▼             ▼
             L_0     L_1     L_2           L_T

Backward direction (t=T → 0):
dL/dh_T ◄── dL/dh_{T-1} ◄── dL/dh_{T-2} ◄── ... ◄── dL/dh_0

The BPTT Algorithm:

Run forward pass, cache all hidden states and inputs
Compute loss at each timestep
Starting from the last timestep, compute gradients
For each timestep t, the gradient of h_t has TWO sources:
- From the loss at time t (through y_t)
- From the next timestep (through h_{t+1})

# Gradient accumulation at timestep t
dh = dh_from_output + dh_from_next_timestep

# Where:
dh_from_output = W_hy.T @ dy_t
dh_from_next_timestep = W_hh.T @ (dh_next * (1 - h_next**2))  # tanh derivative

Why Gradients Explode or Vanish

Consider the gradient flowing from h_T to h_0:

∂h_T/∂h_0 = ∂h_T/∂h_{T-1} × ∂h_{T-1}/∂h_{T-2} × ... × ∂h_1/∂h_0

Each factor is approximately:

∂h_{t}/∂h_{t-1} ≈ W_hh × diag(1 - h_t²)

Over T timesteps:

∂h_T/∂h_0 ≈ (W_hh)^T × [product of tanh derivatives]

If eigenvalue of W_hh > 1: Gradient explodes exponentially If eigenvalue of W_hh < 1: Gradient vanishes exponentially

Gradient magnitude over time:

                    Exploding (λ > 1)
                         /
                        /
                       /
                      /
    ─────────────────●────────────────  Stable (λ ≈ 1)
                      \
                       \
                        \
                         \
                    Vanishing (λ < 1)

    ←───────────────────────────────→
           Time (number of steps)

Book Reference: “Deep Learning” Ch. 10.7 - Goodfellow

Gradient Clipping

To prevent exploding gradients, we clip the gradient norm:

def clip_gradients(gradients, max_norm=5.0):
    total_norm = 0
    for grad in gradients:
        total_norm += np.sum(grad ** 2)
    total_norm = np.sqrt(total_norm)

    if total_norm > max_norm:
        scale = max_norm / total_norm
        for grad in gradients:
            grad *= scale

    return gradients

This keeps training stable without fundamentally solving the vanishing gradient problem.

Teacher Forcing vs Free-Running

Teacher Forcing: During training, always feed the CORRECT previous character. Free-Running: Feed the model’s own predictions.

Teacher Forcing (Training):
Input:  "The quick brown"
Target: "he quick brown "

At each step, regardless of what the model predicts,
we feed the TRUE previous character.

Free-Running (Generation):
Input:  "T"
Output: "Th" → feed 'h'
Output: "The" → feed 'e'
Output: "The " → feed ' '
...

Teacher forcing is faster to train but can cause issues at generation time if the model never learned to recover from mistakes.

Sampling Strategies

Argmax Sampling: Always pick the most likely character.

next_char = np.argmax(probabilities)

Problem: Deterministic, repetitive output.

Temperature Sampling: Adjust probability distribution.

def sample_with_temperature(probs, temperature=1.0):
    # Higher temperature = more random
    # Lower temperature = more deterministic
    probs = np.exp(np.log(probs + 1e-8) / temperature)
    probs = probs / np.sum(probs)
    return np.random.choice(len(probs), p=probs)

Temperature Effect:

T=0.1 (Cold):  Probabilities: [0.98, 0.01, 0.01, ...]
               → Almost deterministic, repetitive

T=1.0 (Normal): Probabilities: [0.60, 0.25, 0.15, ...]
               → Balanced creativity

T=2.0 (Hot):   Probabilities: [0.35, 0.33, 0.32, ...]
               → High randomness, nonsensical

Real World Outcome

You will train on a text corpus and generate new text in a similar style.

Training Output

$ python train_rnn.py --text shakespeare.txt --hidden_size 256 --seq_length 50

Loading text file...
Text length: 1,115,394 characters
Vocabulary size: 65 unique characters
Vocabulary: ['\n', ' ', '!', '$', '&', "'", ',', '-', '.', ...]

Building RNN...
Architecture: CharRNN(input=65, hidden=256, output=65)
Total parameters: 199,745

Training with:
  - Sequence length: 50
  - Learning rate: 0.002
  - Gradient clip: 5.0

Iter 0 (loss=4.185):
ajnRCq$?xPkEWYxvJ&RoAGlFc
Vz!DNUSC'Wf:aMz;S
(complete gibberish)

Iter 1000 (loss=2.312):
the senter of the wing and stond
And the dake the with of the
(word-like patterns emerging)

Iter 5000 (loss=1.876):
KING RICHARD:
The world is fair and stright to me,
And I shall have the man of death.
(grammar emerging, character names)

Iter 10000 (loss=1.542):
ROMEO:
What light through yonder window breaks?
It is the east, and Juliet is the sun!
Arise, fair sun, and kill the envious moon.
(Shakespearean style captured!)

Iter 20000 (loss=1.328):
HAMLET:
To be, or not to be, that is the question:
Whether 'tis nobler in the mind to suffer
The slings and arrows of outrageous fortune,
Or to take arms against a sea of troubles.
(Near-perfect Shakespeare mimicry!)

Generation Output

$ python generate.py --model shakespeare_rnn.pkl --seed "To be or" --length 500 --temperature 0.8

Seed: "To be or"

Generated:
To be or not to be, that is the question:
Whether 'tis nobler in the mind to suffer
The slings and arrows of outrageous fortune,
Or to take arms against a sea of troubles,
And by opposing end them? To die: to sleep;
No more; and by a sleep to say we end
The heart-ache and the thousand natural shocks
That flesh is heir to, 'tis a consummation
Devoutly to be wish'd. To die, to sleep;
To sleep: perchance to dream: ay, there's the rub.

$ python generate.py --seed "KING:" --temperature 1.2

KING:
What say you, lords? Shall we make merry sport?
The duke hath spoken words of wondrous might,
And I would fain hear more of this strange tale.
But soft! What messenger approaches now?

Enter MESSENGER

MESSENGER:
My liege, the army of the north doth march!
They come with banners raised and swords unsheathed!

Solution Architecture

System Design

┌─────────────────────────────────────────────────────────────────┐
│                        CharRNN System                            │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  ┌──────────────┐    ┌──────────────┐    ┌──────────────┐       │
│  │ Text Loader  │───►│ Preprocessor │───►│  DataLoader  │       │
│  │              │    │ (Vocabulary) │    │  (Batching)  │       │
│  └──────────────┘    └──────────────┘    └──────────────┘       │
│                                                 │                │
│                                                 ▼                │
│  ┌──────────────────────────────────────────────────────────┐   │
│  │                        RNN Cell                           │   │
│  │                                                           │   │
│  │   x_t ──► [W_xh] ──┐                                     │   │
│  │                     ├──► [+] ──► [tanh] ──► h_t          │   │
│  │   h_{t-1} ► [W_hh] ─┘                       │             │   │
│  │                                             │             │   │
│  │                            h_t ──► [W_hy] ──► y_t        │   │
│  └──────────────────────────────────────────────────────────┘   │
│                                                 │                │
│                                                 ▼                │
│  ┌──────────────┐    ┌──────────────┐    ┌──────────────┐       │
│  │   Softmax    │───►│  CrossEnt    │───►│   BPTT       │       │
│  │              │    │  Loss        │    │  Backward    │       │
│  └──────────────┘    └──────────────┘    └──────────────┘       │
│                                                 │                │
│                                                 ▼                │
│  ┌──────────────┐    ┌──────────────┐    ┌──────────────┐       │
│  │  Grad Clip   │───►│   Optimizer  │───►│  Generator   │       │
│  │              │    │   (Adagrad)  │    │  (Sampling)  │       │
│  └──────────────┘    └──────────────┘    └──────────────┘       │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Class Structure

class CharRNN:
    """Character-level RNN for text generation"""

    def __init__(self, vocab_size, hidden_size):
        self.vocab_size = vocab_size
        self.hidden_size = hidden_size

        # Initialize weights
        self.W_xh = None  # (hidden_size, vocab_size)
        self.W_hh = None  # (hidden_size, hidden_size)
        self.W_hy = None  # (vocab_size, hidden_size)
        self.b_h = None   # (hidden_size, 1)
        self.b_y = None   # (vocab_size, 1)

        # Cache for BPTT
        self.cache = {}

    def forward(self, inputs, h_prev):
        """Forward pass through time"""
        pass

    def backward(self, targets):
        """Backpropagation through time"""
        pass

    def sample(self, seed, length, temperature):
        """Generate text"""
        pass

File Structure

char_rnn/
├── data.py           # Text loading and preprocessing
├── model.py          # RNN implementation
├── train.py          # Training loop
├── generate.py       # Text generation
├── utils.py          # Gradient clipping, sampling
└── shakespeare.txt   # Training data

Phased Implementation Guide

Phase 1: Text Preprocessing and Vocabulary (Day 1-2)

Goal: Convert text file to tensors.

Steps:

Load text file
Build character-to-index and index-to-character mappings
Encode text as integers

class Vocabulary:
    def __init__(self, text):
        self.chars = sorted(list(set(text)))
        self.char_to_idx = {ch: i for i, ch in enumerate(self.chars)}
        self.idx_to_char = {i: ch for i, ch in enumerate(self.chars)}
        self.size = len(self.chars)

    def encode(self, text):
        return [self.char_to_idx[ch] for ch in text]

    def decode(self, indices):
        return ''.join([self.idx_to_char[i] for i in indices])

Verification:

Encode “hello”, decode back, verify it’s “hello”
Print vocabulary size and sample characters

Phase 2: Character Encoding (One-Hot) (Day 2-3)

Goal: Convert integer indices to one-hot vectors.

def one_hot(index, vocab_size):
    vec = np.zeros((vocab_size, 1))
    vec[index] = 1.0
    return vec

def indices_to_one_hot(indices, vocab_size):
    return [one_hot(i, vocab_size) for i in indices]

Verification:

One-hot of index 3 with vocab_size 10 should be [0,0,0,1,0,0,0,0,0,0]

Phase 3: RNN Cell Forward Pass (Day 3-4)

Goal: Implement one timestep of the RNN.

def rnn_step_forward(x, h_prev, W_xh, W_hh, W_hy, b_h, b_y):
    """
    x:      (vocab_size, 1) - one-hot input
    h_prev: (hidden_size, 1) - previous hidden state

    Returns:
    h_next: (hidden_size, 1) - next hidden state
    y:      (vocab_size, 1) - output logits
    cache:  tuple of values needed for backward pass
    """
    # Hidden state
    z = W_xh @ x + W_hh @ h_prev + b_h
    h_next = np.tanh(z)

    # Output
    y = W_hy @ h_next + b_y

    cache = (x, h_prev, h_next, z, y)
    return h_next, y, cache

Verification:

Check output shapes
h should be bounded [-1, 1] (tanh)

Phase 4: Unrolled Forward Pass (Day 4-5)

Goal: Forward pass through entire sequence.

def forward(inputs, h_init, params):
    """
    inputs: list of one-hot vectors, length T
    h_init: initial hidden state

    Returns:
    outputs: list of output logits
    h_final: final hidden state
    caches: list of caches for backward pass
    """
    h = h_init
    outputs = []
    caches = []

    for x in inputs:
        h, y, cache = rnn_step_forward(x, h, **params)
        outputs.append(y)
        caches.append(cache)

    return outputs, h, caches

Phase 5: Loss Computation (Day 5-6)

Goal: Compute cross-entropy loss at each timestep.

def softmax(y):
    exp_y = np.exp(y - np.max(y))  # numerical stability
    return exp_y / np.sum(exp_y)

def cross_entropy_loss(outputs, targets):
    """
    outputs: list of logits
    targets: list of target indices

    Returns:
    loss: scalar
    probs: list of probability vectors (for backward pass)
    """
    loss = 0
    probs = []

    for y, target in zip(outputs, targets):
        p = softmax(y)
        loss -= np.log(p[target] + 1e-8)
        probs.append(p)

    return loss, probs

Phase 6: BPTT Backward Pass (Day 6-9) - THE HARD PART

Goal: Compute gradients flowing backward through time.

def backward(probs, targets, caches, params):
    """
    probs: list of softmax outputs
    targets: list of target indices
    caches: list of (x, h_prev, h_next, z, y) tuples

    Returns:
    gradients for all parameters
    """
    W_xh, W_hh, W_hy, b_h, b_y = params

    # Initialize gradients
    dW_xh = np.zeros_like(W_xh)
    dW_hh = np.zeros_like(W_hh)
    dW_hy = np.zeros_like(W_hy)
    db_h = np.zeros_like(b_h)
    db_y = np.zeros_like(b_y)

    # Gradient from next timestep (starts at 0)
    dh_next = np.zeros((W_hh.shape[0], 1))

    # Backward through time
    for t in reversed(range(len(caches))):
        x, h_prev, h_next, z, y = caches[t]
        p = probs[t]
        target = targets[t]

        # Gradient of loss w.r.t. output
        dy = p.copy()
        dy[target] -= 1  # softmax + cross-entropy gradient

        # Gradient w.r.t. W_hy and b_y
        dW_hy += dy @ h_next.T
        db_y += dy

        # Gradient w.r.t. hidden state (from output AND next timestep)
        dh = W_hy.T @ dy + dh_next

        # Gradient through tanh
        dz = dh * (1 - h_next ** 2)

        # Gradient w.r.t. W_xh, W_hh, b_h
        dW_xh += dz @ x.T
        dW_hh += dz @ h_prev.T
        db_h += dz

        # Gradient to pass to previous timestep
        dh_next = W_hh.T @ dz

    return dW_xh, dW_hh, dW_hy, db_h, db_y

Key Insight: dh has TWO sources:

From the loss at this timestep (through W_hy)
From the next timestep (dh_next)

Phase 7: Gradient Clipping (Day 9-10)

Goal: Prevent exploding gradients.

def clip_gradients(grads, max_norm=5.0):
    total_norm = 0
    for grad in grads:
        total_norm += np.sum(grad ** 2)
    total_norm = np.sqrt(total_norm)

    if total_norm > max_norm:
        for grad in grads:
            grad *= max_norm / total_norm

    return grads, total_norm

Phase 8: Training Loop (Day 10-14)

Goal: Put it all together.

def train(text, hidden_size=128, seq_length=25, learning_rate=0.1):
    vocab = Vocabulary(text)
    data = vocab.encode(text)

    # Initialize parameters
    params = initialize_params(vocab.size, hidden_size)

    # Adagrad memory (for adaptive learning rate)
    mem = {k: np.zeros_like(v) for k, v in params.items()}

    h_prev = np.zeros((hidden_size, 1))
    pointer = 0

    for iteration in range(100000):
        # Reset if end of data
        if pointer + seq_length + 1 >= len(data):
            h_prev = np.zeros((hidden_size, 1))
            pointer = 0

        # Get batch
        inputs = data[pointer:pointer + seq_length]
        targets = data[pointer + 1:pointer + seq_length + 1]

        # Forward
        outputs, h_prev, caches = forward(
            indices_to_one_hot(inputs, vocab.size),
            h_prev, params
        )

        # Loss
        loss, probs = cross_entropy_loss(outputs, targets)

        # Backward
        grads = backward(probs, targets, caches, params)

        # Clip gradients
        grads, grad_norm = clip_gradients(grads)

        # Update (Adagrad)
        for param_name, grad in zip(params.keys(), grads):
            mem[param_name] += grad ** 2
            params[param_name] -= learning_rate * grad / (np.sqrt(mem[param_name]) + 1e-8)

        pointer += seq_length

        # Sample occasionally
        if iteration % 1000 == 0:
            print(f"Iter {iteration}, Loss: {loss:.4f}")
            sample_text = sample(params, vocab, h_prev, 200)
            print(sample_text)

Phase 9: Text Generation (Day 14-17)

Goal: Generate new text by sampling.

def sample(params, vocab, h, length, temperature=1.0, seed=None):
    """Generate text character by character"""

    if seed:
        # Warm up hidden state with seed
        for ch in seed:
            x = one_hot(vocab.char_to_idx[ch], vocab.size)
            h, _, _ = rnn_step_forward(x, h, **params)
        text = seed
    else:
        # Start with random character
        idx = np.random.randint(vocab.size)
        text = vocab.idx_to_char[idx]

    # Generate
    for _ in range(length):
        x = one_hot(vocab.char_to_idx[text[-1]], vocab.size)
        h, y, _ = rnn_step_forward(x, h, **params)

        # Apply temperature
        p = softmax(y / temperature)

        # Sample
        idx = np.random.choice(vocab.size, p=p.flatten())
        text += vocab.idx_to_char[idx]

    return text

Questions to Guide Your Design

Hidden State Initialization: Should h_0 be zeros? Random? Learned?
Sequence Length Trade-off: Longer sequences capture more context but have worse gradient flow. What’s optimal?
Why Adagrad?: Why use Adagrad instead of vanilla SGD for RNNs?
Gradient Clipping Threshold: How do you choose max_norm = 5.0? What happens with 1.0 or 50.0?
Temperature Selection: When should you use T=0.5 vs T=1.5?

Thinking Exercise

Trace an Unrolled RNN by Hand

Consider this 3-character sequence: “ab” → predict “b “

Vocabulary: {a=0, b=1, ' '=2}
Hidden size: 2
Initial h_0 = [0, 0]

W_xh = [[0.1, 0.2, 0.3],
        [0.4, 0.5, 0.6]]   (2x3)

W_hh = [[0.1, 0.2],
        [0.3, 0.4]]         (2x2)

W_hy = [[0.1, 0.2],
        [0.3, 0.4],
        [0.5, 0.6]]         (3x2)

Task: Compute h_1 and y_1 after processing ‘a’.

x_1 = [1, 0, 0]  (one-hot for 'a')

z_1 = W_xh @ x_1 + W_hh @ h_0
    = [[0.1], [0.4]] + [[0], [0]]
    = [[0.1], [0.4]]

h_1 = tanh(z_1) = [[0.0997], [0.3799]]

y_1 = W_hy @ h_1
    = [[0.1*0.0997 + 0.2*0.3799],
       [0.3*0.0997 + 0.4*0.3799],
       [0.5*0.0997 + 0.6*0.3799]]
    = [[0.086], [0.182], [0.278]]

p_1 = softmax(y_1) = [0.28, 0.31, 0.41]

Target is ‘b’ (index 1). Loss = -log(0.31) = 1.17

Now trace the backward pass:

dy_1 = p_1 - target = [0.28, -0.69, 0.41]
dh_1 = W_hy.T @ dy_1 = ...

Complete this trace to understand BPTT deeply.

Testing Strategy

Unit Tests

def test_rnn_forward_shapes():
    vocab_size, hidden_size = 10, 5
    params = initialize_params(vocab_size, hidden_size)
    x = one_hot(3, vocab_size)
    h = np.zeros((hidden_size, 1))

    h_next, y, cache = rnn_step_forward(x, h, **params)

    assert h_next.shape == (hidden_size, 1)
    assert y.shape == (vocab_size, 1)
    assert np.all(np.abs(h_next) <= 1)  # tanh bounds

def test_gradient_check():
    """Numerical gradient verification"""
    # ... compare analytical vs numerical gradients

Integration Tests

Train on “ababab…” for 1000 iterations, should learn to predict alternating pattern
Train on “hello hello hello”, should learn to complete “hel” → “lo”

Common Pitfalls and Debugging Tips

1. Exploding Gradients

Symptom: Loss becomes NaN, weights become huge.

Cause: Gradient clipping threshold too high or missing.

Fix:

grads = clip_gradients(grads, max_norm=5.0)

2. Vanishing Gradients

Symptom: Model stops learning, loss plateaus high.

Cause: Long sequences, small hidden state.

Fix:

Use shorter sequence length
Increase hidden size
Consider LSTM/GRU

3. Mode Collapse

Symptom: Model generates the same character repeatedly (“eeeeee…”).

Cause: Temperature too low, or undertrained.

Fix: Increase temperature during generation.

4. Slow Training

Symptom: Each iteration takes seconds.

Cause: Python loops instead of vectorization.

Fix: Batch multiple sequences, use matrix operations.

5. Poor Generation Quality

Symptom: Output is word-like but nonsensical.

Cause: Undertrained, or sequence length too short.

Fix: Train longer, increase hidden size, use longer sequences.

The Interview Questions They’ll Ask

“What is the vanishing gradient problem?”
- Gradients shrink exponentially when flowing through many timesteps
- Caused by repeated multiplication by weights < 1 and tanh derivatives
- Makes it hard to learn long-range dependencies
“How does LSTM solve vanishing gradients?”
- LSTM adds a “cell state” that flows unchanged through time
- Gradients can flow through the cell state without vanishing
- Forget/input/output gates control information flow
“What is teacher forcing?”
- During training, feed the true previous output instead of predicted
- Speeds up training but can cause exposure bias
“How do RNNs differ from Transformers?”
- RNNs process sequentially, Transformers in parallel
- RNNs have fixed context (hidden state), Transformers use attention
- Transformers scale better but need more memory
“How would you implement beam search?”
- Keep track of top-k sequences at each step
- Expand each sequence, keep best k overall
- More diverse outputs than greedy/sampling

Hints in Layers

Hint 1: Weight Initialization

Use small random weights:

W = np.random.randn(m, n) * 0.01

Too large → exploding gradients. Too small → vanishing.

Hint 2: Hidden State Management

Detach hidden state between batches to limit BPTT length:

h_prev = h_prev.copy()  # Break gradient chain

Hint 3: Debugging BPTT

Test with sequence length 1 first:

# With seq_length=1, BPTT reduces to normal backprop
# Much easier to debug

Hint 4: Numerical Gradient Check

def numerical_gradient(f, x, epsilon=1e-5):
    grad = np.zeros_like(x)
    for i in range(x.size):
        x_plus = x.copy()
        x_plus.flat[i] += epsilon
        x_minus = x.copy()
        x_minus.flat[i] -= epsilon
        grad.flat[i] = (f(x_plus) - f(x_minus)) / (2 * epsilon)
    return grad

Extensions and Challenges

1. Implement LSTM

Add forget gate, input gate, output gate, and cell state:

f_t = sigmoid(W_f @ [h_{t-1}, x_t])  # Forget gate
i_t = sigmoid(W_i @ [h_{t-1}, x_t])  # Input gate
c_t = f_t * c_{t-1} + i_t * tanh(W_c @ [h_{t-1}, x_t])  # Cell state
o_t = sigmoid(W_o @ [h_{t-1}, x_t])  # Output gate
h_t = o_t * tanh(c_t)  # Hidden state

2. Add Temperature Scheduling

Start with high temperature (exploration), decay to low (exploitation):

temperature = max(0.5, 2.0 - iteration / 10000)

3. Try on Music (MIDI)

Each “character” is a MIDI note. Generate melodies!

4. Try on Code

Train on Python source code. Generate syntactically plausible code.

5. Implement GRU

Simplified LSTM with only two gates:

z_t = sigmoid(W_z @ [h_{t-1}, x_t])  # Update gate
r_t = sigmoid(W_r @ [h_{t-1}, x_t])  # Reset gate
h_t = (1 - z_t) * h_{t-1} + z_t * tanh(W @ [r_t * h_{t-1}, x_t])

Real-World Connections

Ancestry of Modern LLMs

1986: RNN (Rumelhart)
  ↓
1997: LSTM (Hochreiter & Schmidhuber)
  ↓
2014: GRU (Cho et al.)
  ↓
2017: Transformer (Vaswani et al.)
  ↓
2018: GPT-1 (OpenAI)
  ↓
2020: GPT-3
  ↓
2023: GPT-4, Claude, etc.

The RNN you build is the great-grandfather of ChatGPT!

Where RNNs Are Still Used

Speech recognition: Sequential audio processing
Handwriting recognition: Stroke sequences
Time series forecasting: Stock prices, weather
Edge devices: Simpler than Transformers, less memory

Books That Will Help

Topic	Book	Chapter
RNN fundamentals	“Deep Learning” by Goodfellow	Ch. 10
Practical implementation	“Grokking Deep Learning” by Trask	Ch. 11-12
LSTM/GRU	“Deep Learning” by Goodfellow	Ch. 10.10
NLP applications	“Speech and Language Processing” by Jurafsky	Ch. 9

Online Resources

Andrej Karpathy’s “The Unreasonable Effectiveness of RNNs” (blog post)
Karpathy’s min-char-rnn.py (reference implementation)
colah’s blog on “Understanding LSTMs”

Self-Assessment Checklist

Understanding

Can explain why feedforward nets can’t model sequences
Can draw the unrolled RNN diagram
Can explain vanishing/exploding gradients
Can describe the difference between RNN, LSTM, GRU

Implementation

RNN forward pass produces correct shapes
BPTT backward pass passes gradient check
Gradient clipping prevents explosions
Training loop reduces loss over time

Generation

Can generate text with different temperatures
Output quality improves with training
Can seed generation with custom text

Extensions

Understand how LSTM adds cell state
Could explain how to implement beam search
Understand connection to Transformers

Final Note

This project is the bridge between classical neural networks and modern language models. The hidden state you implement here is a direct ancestor of the key-value caches in Transformers. The sequence modeling intuition you develop will serve you when understanding GPT, BERT, and beyond.

When you successfully generate Shakespearean text, you will have implemented the core insight that launched the NLP revolution: neural networks can learn to model the probability distribution of language.

You are building the foundation of everything from autocomplete to ChatGPT.