Project 10: Recurrent Character Generator (RNN)

Project 10: Recurrent Character Generator (RNN)

The ancestor of GPT: Build a network that remembers the past and generates text one character at a time.


Project Overview

Attribute Details
Difficulty Level 5: Master
Time Estimate 3 Weeks
Language Python (NumPy Only)
Prerequisites Project 5 (Autograd), Project 8 (MNIST)
Knowledge Area Sequence Modeling / NLP
Main Book โ€œDeep Learningโ€ by Goodfellow, Bengio, Courville - Ch. 10

Learning Objectives

By the end of this project, you will be able to:

  1. Explain why feedforward networks cannot model sequences and what RNNs add
  2. Implement the RNN forward pass with hidden state propagation
  3. Derive and implement Backpropagation Through Time (BPTT)
  4. Diagnose and fix vanishing/exploding gradient problems
  5. Apply gradient clipping to stabilize training
  6. Generate text by sampling from probability distributions
  7. Implement temperature sampling for controlling creativity
  8. Understand the foundations that led to LSTMs, Transformers, and GPT

The Core Question Youโ€™re Answering

โ€œHow does AI handle Time and Sequence?โ€

Consider this sentence: โ€œThe cat sat on the ___โ€

To predict โ€œmat,โ€ you need to remember โ€œcatโ€ and โ€œsatโ€ from earlier. A feedforward network processes each word independentlyโ€”it has no memory of previous words.

Feedforward Network (No Memory):
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”     โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”     โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”     โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  "The"  โ”‚โ”€โ”€โ”€โ”€โ–บโ”‚  "cat"  โ”‚โ”€โ”€โ”€โ”€โ–บโ”‚  "sat"  โ”‚โ”€โ”€โ”€โ”€โ–บโ”‚  "on"   โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜     โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜     โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜     โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
     โ”‚               โ”‚               โ”‚               โ”‚
     โ–ผ               โ–ผ               โ–ผ               โ–ผ
  [Pred]          [Pred]          [Pred]          [Pred]
                                                     โ”‚
                                              ???  โ—„โ”€โ”˜
                                         (No context!)

An RNN creates a loopโ€”the output from processing โ€œTheโ€ influences how we process โ€œcat,โ€ which influences โ€œsat,โ€ and so on:

Recurrent Neural Network (Memory via Hidden State):

         hโ‚€          hโ‚          hโ‚‚          hโ‚ƒ
          โ”‚           โ”‚           โ”‚           โ”‚
          โ–ผ           โ–ผ           โ–ผ           โ–ผ
       โ”Œโ”€โ”€โ”€โ”€โ”€โ”     โ”Œโ”€โ”€โ”€โ”€โ”€โ”     โ”Œโ”€โ”€โ”€โ”€โ”€โ”     โ”Œโ”€โ”€โ”€โ”€โ”€โ”
       โ”‚ RNN โ”‚โ”€โ”€โ”€โ”€โ–บโ”‚ RNN โ”‚โ”€โ”€โ”€โ”€โ–บโ”‚ RNN โ”‚โ”€โ”€โ”€โ”€โ–บโ”‚ RNN โ”‚โ”€โ”€โ”€โ”€โ–บ hโ‚„
       โ”‚Cell โ”‚     โ”‚Cell โ”‚     โ”‚Cell โ”‚     โ”‚Cell โ”‚
       โ””โ”€โ”€โ”€โ”€โ”€โ”˜     โ””โ”€โ”€โ”€โ”€โ”€โ”˜     โ””โ”€โ”€โ”€โ”€โ”€โ”˜     โ””โ”€โ”€โ”€โ”€โ”€โ”˜
          โ–ฒ           โ–ฒ           โ–ฒ           โ–ฒ
          โ”‚           โ”‚           โ”‚           โ”‚
       "The"       "cat"       "sat"        "on"
                                              โ”‚
                                              โ–ผ
                                           "mat" โœ“
                                      (Has context!)

The hidden state h is the networkโ€™s โ€œshort-term memory.โ€ It carries information from previous timesteps. This is the fundamental insight that enables language modeling, speech recognition, and eventually led to the Transformer architecture that powers ChatGPT.


Concepts You Must Understand First

1. Why Feedforward Nets Canโ€™t Model Sequences

Aspect Feedforward Recurrent
Input size Fixed Variable length
Memory None Hidden state
Processing Parallel Sequential
โ€œCatโ€ vs โ€œCat satโ€ Same output Different output

Book Reference: โ€œDeep Learningโ€ Ch. 10.1 - Goodfellow

2. The Recurrence Relation

The core of an RNN is this equation:

h_t = f(W_hh ยท h_{t-1} + W_xh ยท x_t + b_h)

Where:

  • h_t = hidden state at time t (the โ€œmemoryโ€)
  • h_{t-1} = hidden state from previous timestep
  • x_t = input at time t
  • W_hh = hidden-to-hidden weights (how memory affects memory)
  • W_xh = input-to-hidden weights (how input affects memory)
  • f = activation function (usually tanh)
        โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
        โ”‚                 โ”‚
        โ–ผ                 โ”‚
    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”         โ”Œโ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”
    โ”‚  h_t  โ”‚โ—„โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”‚ W_hh  โ”‚
    โ””โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”˜         โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
        โ”‚
        โ–ฒ
    โ”Œโ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”
    โ”‚ tanh  โ”‚
    โ””โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”˜
        โ”‚
        โ–ฒ
    โ”Œโ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
    โ”‚   W_xh ยท x_t          โ”‚
    โ”‚      +                โ”‚
    โ”‚   W_hh ยท h_{t-1}      โ”‚
    โ”‚      +                โ”‚
    โ”‚   b_h                 โ”‚
    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

3. Character-Level vs Word-Level Modeling

Approach Vocabulary Size Granularity Pros Cons
Character 50-100 Fine Handles any word Longer sequences
Word 50,000+ Coarse Semantic meaning OOV problem
Subword (BPE) 30,000 Medium Best of both Complexity

For this project, we use character-level because:

  • Small vocabulary (just letters, punctuation, spaces)
  • Can generate any word, even made-up ones
  • Easier to understand and debug

4. One-Hot Encoding for Characters

Each character becomes a binary vector:

Vocabulary: ['a', 'b', 'c', ..., 'z', ' ', '.']
Size: 27 characters

'a' = [1, 0, 0, ..., 0, 0, 0]  # Index 0
'b' = [0, 1, 0, ..., 0, 0, 0]  # Index 1
'c' = [0, 0, 1, ..., 0, 0, 0]  # Index 2
' ' = [0, 0, 0, ..., 0, 1, 0]  # Index 26

Book Reference: โ€œNeural Networks and Deep Learningโ€ Ch. 3 - Nielsen

5. Unrolling the RNN Through Time

To apply backpropagation, we โ€œunrollโ€ the loop into a feedforward network:

Unrolled RNN for sequence "cat":

    โ”Œโ”€โ”€โ”€โ”€โ”€โ”     โ”Œโ”€โ”€โ”€โ”€โ”€โ”     โ”Œโ”€โ”€โ”€โ”€โ”€โ”
    โ”‚ h_0 โ”‚โ”€โ”€โ”€โ”€โ–บโ”‚ h_1 โ”‚โ”€โ”€โ”€โ”€โ–บโ”‚ h_2 โ”‚โ”€โ”€โ”€โ”€โ–บ h_3
    โ””โ”€โ”€โ”ฌโ”€โ”€โ”˜     โ””โ”€โ”€โ”ฌโ”€โ”€โ”˜     โ””โ”€โ”€โ”ฌโ”€โ”€โ”˜
       โ”‚           โ”‚           โ”‚
       โ–ผ           โ–ผ           โ–ผ
    โ”Œโ”€โ”€โ”€โ”€โ”€โ”     โ”Œโ”€โ”€โ”€โ”€โ”€โ”     โ”Œโ”€โ”€โ”€โ”€โ”€โ”
    โ”‚ y_1 โ”‚     โ”‚ y_2 โ”‚     โ”‚ y_3 โ”‚
    โ””โ”€โ”€โ”€โ”€โ”€โ”˜     โ””โ”€โ”€โ”€โ”€โ”€โ”˜     โ””โ”€โ”€โ”€โ”€โ”€โ”˜
       โ”‚           โ”‚           โ”‚
       โ–ผ           โ–ผ           โ–ผ
   Pred:'a'    Pred:'t'    Pred:' '
       โ”‚           โ”‚           โ”‚
       โ–ผ           โ–ผ           โ–ผ
   Loss_1      Loss_2      Loss_3
       โ–ฒ           โ–ฒ           โ–ฒ
       โ”‚           โ”‚           โ”‚
    'c'          'a'          't'
   (Input)     (Input)      (Input)

Each step predicts the NEXT character given the history.

6. Truncated BPTT

For long sequences, gradients must flow through hundreds of timesteps. This is slow and causes gradient issues. Truncated BPTT limits how far back we propagate:

Full sequence: "The quick brown fox jumps over the lazy dog"

Truncated to chunks of 10:
Chunk 1: "The quick " (carry h forward)
Chunk 2: "brown fox " (carry h forward)
Chunk 3: "jumps over" (carry h forward)
...

Gradients only flow within each chunk, but hidden state carries forward.

Deep Theoretical Foundation

The RNN Equations in Detail

Forward Pass (for one timestep):

# 1. Combine previous hidden state and current input
z = W_hh @ h_prev + W_xh @ x + b_h

# 2. Apply non-linearity
h = np.tanh(z)

# 3. Compute output (prediction)
y = W_hy @ h + b_y

# 4. Convert to probabilities
p = softmax(y)

Dimensions:

Vocabulary size: V (e.g., 65 characters)
Hidden size: H (e.g., 128)

x:     (V, 1)     - one-hot input
h:     (H, 1)     - hidden state
y:     (V, 1)     - raw scores
p:     (V, 1)     - probabilities

W_xh:  (H, V)     - input to hidden
W_hh:  (H, H)     - hidden to hidden
W_hy:  (V, H)     - hidden to output
b_h:   (H, 1)     - hidden bias
b_y:   (V, 1)     - output bias

Backpropagation Through Time (BPTT)

The gradient must flow backward through time:

Forward direction (t=0 โ†’ T):
x_0 โ”€โ”€โ–บ h_0 โ”€โ”€โ–บ h_1 โ”€โ”€โ–บ h_2 โ”€โ”€โ–บ ... โ”€โ”€โ–บ h_T
              โ”‚       โ”‚       โ”‚             โ”‚
              โ–ผ       โ–ผ       โ–ผ             โ–ผ
             y_0     y_1     y_2           y_T
              โ”‚       โ”‚       โ”‚             โ”‚
              โ–ผ       โ–ผ       โ–ผ             โ–ผ
             L_0     L_1     L_2           L_T

Backward direction (t=T โ†’ 0):
dL/dh_T โ—„โ”€โ”€ dL/dh_{T-1} โ—„โ”€โ”€ dL/dh_{T-2} โ—„โ”€โ”€ ... โ—„โ”€โ”€ dL/dh_0

The BPTT Algorithm:

  1. Run forward pass, cache all hidden states and inputs
  2. Compute loss at each timestep
  3. Starting from the last timestep, compute gradients
  4. For each timestep t, the gradient of h_t has TWO sources:
    • From the loss at time t (through y_t)
    • From the next timestep (through h_{t+1})
# Gradient accumulation at timestep t
dh = dh_from_output + dh_from_next_timestep

# Where:
dh_from_output = W_hy.T @ dy_t
dh_from_next_timestep = W_hh.T @ (dh_next * (1 - h_next**2))  # tanh derivative

Why Gradients Explode or Vanish

Consider the gradient flowing from h_T to h_0:

โˆ‚h_T/โˆ‚h_0 = โˆ‚h_T/โˆ‚h_{T-1} ร— โˆ‚h_{T-1}/โˆ‚h_{T-2} ร— ... ร— โˆ‚h_1/โˆ‚h_0

Each factor is approximately:

โˆ‚h_{t}/โˆ‚h_{t-1} โ‰ˆ W_hh ร— diag(1 - h_tยฒ)

Over T timesteps:

โˆ‚h_T/โˆ‚h_0 โ‰ˆ (W_hh)^T ร— [product of tanh derivatives]

If eigenvalue of W_hh > 1: Gradient explodes exponentially If eigenvalue of W_hh < 1: Gradient vanishes exponentially

Gradient magnitude over time:

                    Exploding (ฮป > 1)
                         /
                        /
                       /
                      /
    โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ—โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€  Stable (ฮป โ‰ˆ 1)
                      \
                       \
                        \
                         \
                    Vanishing (ฮป < 1)

    โ†โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ†’
           Time (number of steps)

Book Reference: โ€œDeep Learningโ€ Ch. 10.7 - Goodfellow

Gradient Clipping

To prevent exploding gradients, we clip the gradient norm:

def clip_gradients(gradients, max_norm=5.0):
    total_norm = 0
    for grad in gradients:
        total_norm += np.sum(grad ** 2)
    total_norm = np.sqrt(total_norm)

    if total_norm > max_norm:
        scale = max_norm / total_norm
        for grad in gradients:
            grad *= scale

    return gradients

This keeps training stable without fundamentally solving the vanishing gradient problem.

Teacher Forcing vs Free-Running

Teacher Forcing: During training, always feed the CORRECT previous character. Free-Running: Feed the modelโ€™s own predictions.

Teacher Forcing (Training):
Input:  "The quick brown"
Target: "he quick brown "

At each step, regardless of what the model predicts,
we feed the TRUE previous character.

Free-Running (Generation):
Input:  "T"
Output: "Th" โ†’ feed 'h'
Output: "The" โ†’ feed 'e'
Output: "The " โ†’ feed ' '
...

Teacher forcing is faster to train but can cause issues at generation time if the model never learned to recover from mistakes.

Sampling Strategies

Argmax Sampling: Always pick the most likely character.

next_char = np.argmax(probabilities)

Problem: Deterministic, repetitive output.

Temperature Sampling: Adjust probability distribution.

def sample_with_temperature(probs, temperature=1.0):
    # Higher temperature = more random
    # Lower temperature = more deterministic
    probs = np.exp(np.log(probs + 1e-8) / temperature)
    probs = probs / np.sum(probs)
    return np.random.choice(len(probs), p=probs)
Temperature Effect:

T=0.1 (Cold):  Probabilities: [0.98, 0.01, 0.01, ...]
               โ†’ Almost deterministic, repetitive

T=1.0 (Normal): Probabilities: [0.60, 0.25, 0.15, ...]
               โ†’ Balanced creativity

T=2.0 (Hot):   Probabilities: [0.35, 0.33, 0.32, ...]
               โ†’ High randomness, nonsensical

Real World Outcome

You will train on a text corpus and generate new text in a similar style.

Training Output

$ python train_rnn.py --text shakespeare.txt --hidden_size 256 --seq_length 50

Loading text file...
Text length: 1,115,394 characters
Vocabulary size: 65 unique characters
Vocabulary: ['\n', ' ', '!', '$', '&', "'", ',', '-', '.', ...]

Building RNN...
Architecture: CharRNN(input=65, hidden=256, output=65)
Total parameters: 199,745

Training with:
  - Sequence length: 50
  - Learning rate: 0.002
  - Gradient clip: 5.0

Iter 0 (loss=4.185):
ajnRCq$?xPkEWYxvJ&RoAGlFc
Vz!DNUSC'Wf:aMz;S
(complete gibberish)

Iter 1000 (loss=2.312):
the senter of the wing and stond
And the dake the with of the
(word-like patterns emerging)

Iter 5000 (loss=1.876):
KING RICHARD:
The world is fair and stright to me,
And I shall have the man of death.
(grammar emerging, character names)

Iter 10000 (loss=1.542):
ROMEO:
What light through yonder window breaks?
It is the east, and Juliet is the sun!
Arise, fair sun, and kill the envious moon.
(Shakespearean style captured!)

Iter 20000 (loss=1.328):
HAMLET:
To be, or not to be, that is the question:
Whether 'tis nobler in the mind to suffer
The slings and arrows of outrageous fortune,
Or to take arms against a sea of troubles.
(Near-perfect Shakespeare mimicry!)

Generation Output

$ python generate.py --model shakespeare_rnn.pkl --seed "To be or" --length 500 --temperature 0.8

Seed: "To be or"

Generated:
To be or not to be, that is the question:
Whether 'tis nobler in the mind to suffer
The slings and arrows of outrageous fortune,
Or to take arms against a sea of troubles,
And by opposing end them? To die: to sleep;
No more; and by a sleep to say we end
The heart-ache and the thousand natural shocks
That flesh is heir to, 'tis a consummation
Devoutly to be wish'd. To die, to sleep;
To sleep: perchance to dream: ay, there's the rub.

$ python generate.py --seed "KING:" --temperature 1.2

KING:
What say you, lords? Shall we make merry sport?
The duke hath spoken words of wondrous might,
And I would fain hear more of this strange tale.
But soft! What messenger approaches now?

Enter MESSENGER

MESSENGER:
My liege, the army of the north doth march!
They come with banners raised and swords unsheathed!

Solution Architecture

System Design

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                        CharRNN System                            โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚                                                                  โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”       โ”‚
โ”‚  โ”‚ Text Loader  โ”‚โ”€โ”€โ”€โ–บโ”‚ Preprocessor โ”‚โ”€โ”€โ”€โ–บโ”‚  DataLoader  โ”‚       โ”‚
โ”‚  โ”‚              โ”‚    โ”‚ (Vocabulary) โ”‚    โ”‚  (Batching)  โ”‚       โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜       โ”‚
โ”‚                                                 โ”‚                โ”‚
โ”‚                                                 โ–ผ                โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”   โ”‚
โ”‚  โ”‚                        RNN Cell                           โ”‚   โ”‚
โ”‚  โ”‚                                                           โ”‚   โ”‚
โ”‚  โ”‚   x_t โ”€โ”€โ–บ [W_xh] โ”€โ”€โ”                                     โ”‚   โ”‚
โ”‚  โ”‚                     โ”œโ”€โ”€โ–บ [+] โ”€โ”€โ–บ [tanh] โ”€โ”€โ–บ h_t          โ”‚   โ”‚
โ”‚  โ”‚   h_{t-1} โ–บ [W_hh] โ”€โ”˜                       โ”‚             โ”‚   โ”‚
โ”‚  โ”‚                                             โ”‚             โ”‚   โ”‚
โ”‚  โ”‚                            h_t โ”€โ”€โ–บ [W_hy] โ”€โ”€โ–บ y_t        โ”‚   โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜   โ”‚
โ”‚                                                 โ”‚                โ”‚
โ”‚                                                 โ–ผ                โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”       โ”‚
โ”‚  โ”‚   Softmax    โ”‚โ”€โ”€โ”€โ–บโ”‚  CrossEnt    โ”‚โ”€โ”€โ”€โ–บโ”‚   BPTT       โ”‚       โ”‚
โ”‚  โ”‚              โ”‚    โ”‚  Loss        โ”‚    โ”‚  Backward    โ”‚       โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜       โ”‚
โ”‚                                                 โ”‚                โ”‚
โ”‚                                                 โ–ผ                โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”       โ”‚
โ”‚  โ”‚  Grad Clip   โ”‚โ”€โ”€โ”€โ–บโ”‚   Optimizer  โ”‚โ”€โ”€โ”€โ–บโ”‚  Generator   โ”‚       โ”‚
โ”‚  โ”‚              โ”‚    โ”‚   (Adagrad)  โ”‚    โ”‚  (Sampling)  โ”‚       โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜       โ”‚
โ”‚                                                                  โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Class Structure

class CharRNN:
    """Character-level RNN for text generation"""

    def __init__(self, vocab_size, hidden_size):
        self.vocab_size = vocab_size
        self.hidden_size = hidden_size

        # Initialize weights
        self.W_xh = None  # (hidden_size, vocab_size)
        self.W_hh = None  # (hidden_size, hidden_size)
        self.W_hy = None  # (vocab_size, hidden_size)
        self.b_h = None   # (hidden_size, 1)
        self.b_y = None   # (vocab_size, 1)

        # Cache for BPTT
        self.cache = {}

    def forward(self, inputs, h_prev):
        """Forward pass through time"""
        pass

    def backward(self, targets):
        """Backpropagation through time"""
        pass

    def sample(self, seed, length, temperature):
        """Generate text"""
        pass

File Structure

char_rnn/
โ”œโ”€โ”€ data.py           # Text loading and preprocessing
โ”œโ”€โ”€ model.py          # RNN implementation
โ”œโ”€โ”€ train.py          # Training loop
โ”œโ”€โ”€ generate.py       # Text generation
โ”œโ”€โ”€ utils.py          # Gradient clipping, sampling
โ””โ”€โ”€ shakespeare.txt   # Training data

Phased Implementation Guide

Phase 1: Text Preprocessing and Vocabulary (Day 1-2)

Goal: Convert text file to tensors.

Steps:

  1. Load text file
  2. Build character-to-index and index-to-character mappings
  3. Encode text as integers
class Vocabulary:
    def __init__(self, text):
        self.chars = sorted(list(set(text)))
        self.char_to_idx = {ch: i for i, ch in enumerate(self.chars)}
        self.idx_to_char = {i: ch for i, ch in enumerate(self.chars)}
        self.size = len(self.chars)

    def encode(self, text):
        return [self.char_to_idx[ch] for ch in text]

    def decode(self, indices):
        return ''.join([self.idx_to_char[i] for i in indices])

Verification:

  • Encode โ€œhelloโ€, decode back, verify itโ€™s โ€œhelloโ€
  • Print vocabulary size and sample characters

Phase 2: Character Encoding (One-Hot) (Day 2-3)

Goal: Convert integer indices to one-hot vectors.

def one_hot(index, vocab_size):
    vec = np.zeros((vocab_size, 1))
    vec[index] = 1.0
    return vec

def indices_to_one_hot(indices, vocab_size):
    return [one_hot(i, vocab_size) for i in indices]

Verification:

  • One-hot of index 3 with vocab_size 10 should be [0,0,0,1,0,0,0,0,0,0]

Phase 3: RNN Cell Forward Pass (Day 3-4)

Goal: Implement one timestep of the RNN.

def rnn_step_forward(x, h_prev, W_xh, W_hh, W_hy, b_h, b_y):
    """
    x:      (vocab_size, 1) - one-hot input
    h_prev: (hidden_size, 1) - previous hidden state

    Returns:
    h_next: (hidden_size, 1) - next hidden state
    y:      (vocab_size, 1) - output logits
    cache:  tuple of values needed for backward pass
    """
    # Hidden state
    z = W_xh @ x + W_hh @ h_prev + b_h
    h_next = np.tanh(z)

    # Output
    y = W_hy @ h_next + b_y

    cache = (x, h_prev, h_next, z, y)
    return h_next, y, cache

Verification:

  • Check output shapes
  • h should be bounded [-1, 1] (tanh)

Phase 4: Unrolled Forward Pass (Day 4-5)

Goal: Forward pass through entire sequence.

def forward(inputs, h_init, params):
    """
    inputs: list of one-hot vectors, length T
    h_init: initial hidden state

    Returns:
    outputs: list of output logits
    h_final: final hidden state
    caches: list of caches for backward pass
    """
    h = h_init
    outputs = []
    caches = []

    for x in inputs:
        h, y, cache = rnn_step_forward(x, h, **params)
        outputs.append(y)
        caches.append(cache)

    return outputs, h, caches

Phase 5: Loss Computation (Day 5-6)

Goal: Compute cross-entropy loss at each timestep.

def softmax(y):
    exp_y = np.exp(y - np.max(y))  # numerical stability
    return exp_y / np.sum(exp_y)

def cross_entropy_loss(outputs, targets):
    """
    outputs: list of logits
    targets: list of target indices

    Returns:
    loss: scalar
    probs: list of probability vectors (for backward pass)
    """
    loss = 0
    probs = []

    for y, target in zip(outputs, targets):
        p = softmax(y)
        loss -= np.log(p[target] + 1e-8)
        probs.append(p)

    return loss, probs

Phase 6: BPTT Backward Pass (Day 6-9) - THE HARD PART

Goal: Compute gradients flowing backward through time.

def backward(probs, targets, caches, params):
    """
    probs: list of softmax outputs
    targets: list of target indices
    caches: list of (x, h_prev, h_next, z, y) tuples

    Returns:
    gradients for all parameters
    """
    W_xh, W_hh, W_hy, b_h, b_y = params

    # Initialize gradients
    dW_xh = np.zeros_like(W_xh)
    dW_hh = np.zeros_like(W_hh)
    dW_hy = np.zeros_like(W_hy)
    db_h = np.zeros_like(b_h)
    db_y = np.zeros_like(b_y)

    # Gradient from next timestep (starts at 0)
    dh_next = np.zeros((W_hh.shape[0], 1))

    # Backward through time
    for t in reversed(range(len(caches))):
        x, h_prev, h_next, z, y = caches[t]
        p = probs[t]
        target = targets[t]

        # Gradient of loss w.r.t. output
        dy = p.copy()
        dy[target] -= 1  # softmax + cross-entropy gradient

        # Gradient w.r.t. W_hy and b_y
        dW_hy += dy @ h_next.T
        db_y += dy

        # Gradient w.r.t. hidden state (from output AND next timestep)
        dh = W_hy.T @ dy + dh_next

        # Gradient through tanh
        dz = dh * (1 - h_next ** 2)

        # Gradient w.r.t. W_xh, W_hh, b_h
        dW_xh += dz @ x.T
        dW_hh += dz @ h_prev.T
        db_h += dz

        # Gradient to pass to previous timestep
        dh_next = W_hh.T @ dz

    return dW_xh, dW_hh, dW_hy, db_h, db_y

Key Insight: dh has TWO sources:

  1. From the loss at this timestep (through W_hy)
  2. From the next timestep (dh_next)

Phase 7: Gradient Clipping (Day 9-10)

Goal: Prevent exploding gradients.

def clip_gradients(grads, max_norm=5.0):
    total_norm = 0
    for grad in grads:
        total_norm += np.sum(grad ** 2)
    total_norm = np.sqrt(total_norm)

    if total_norm > max_norm:
        for grad in grads:
            grad *= max_norm / total_norm

    return grads, total_norm

Phase 8: Training Loop (Day 10-14)

Goal: Put it all together.

def train(text, hidden_size=128, seq_length=25, learning_rate=0.1):
    vocab = Vocabulary(text)
    data = vocab.encode(text)

    # Initialize parameters
    params = initialize_params(vocab.size, hidden_size)

    # Adagrad memory (for adaptive learning rate)
    mem = {k: np.zeros_like(v) for k, v in params.items()}

    h_prev = np.zeros((hidden_size, 1))
    pointer = 0

    for iteration in range(100000):
        # Reset if end of data
        if pointer + seq_length + 1 >= len(data):
            h_prev = np.zeros((hidden_size, 1))
            pointer = 0

        # Get batch
        inputs = data[pointer:pointer + seq_length]
        targets = data[pointer + 1:pointer + seq_length + 1]

        # Forward
        outputs, h_prev, caches = forward(
            indices_to_one_hot(inputs, vocab.size),
            h_prev, params
        )

        # Loss
        loss, probs = cross_entropy_loss(outputs, targets)

        # Backward
        grads = backward(probs, targets, caches, params)

        # Clip gradients
        grads, grad_norm = clip_gradients(grads)

        # Update (Adagrad)
        for param_name, grad in zip(params.keys(), grads):
            mem[param_name] += grad ** 2
            params[param_name] -= learning_rate * grad / (np.sqrt(mem[param_name]) + 1e-8)

        pointer += seq_length

        # Sample occasionally
        if iteration % 1000 == 0:
            print(f"Iter {iteration}, Loss: {loss:.4f}")
            sample_text = sample(params, vocab, h_prev, 200)
            print(sample_text)

Phase 9: Text Generation (Day 14-17)

Goal: Generate new text by sampling.

def sample(params, vocab, h, length, temperature=1.0, seed=None):
    """Generate text character by character"""

    if seed:
        # Warm up hidden state with seed
        for ch in seed:
            x = one_hot(vocab.char_to_idx[ch], vocab.size)
            h, _, _ = rnn_step_forward(x, h, **params)
        text = seed
    else:
        # Start with random character
        idx = np.random.randint(vocab.size)
        text = vocab.idx_to_char[idx]

    # Generate
    for _ in range(length):
        x = one_hot(vocab.char_to_idx[text[-1]], vocab.size)
        h, y, _ = rnn_step_forward(x, h, **params)

        # Apply temperature
        p = softmax(y / temperature)

        # Sample
        idx = np.random.choice(vocab.size, p=p.flatten())
        text += vocab.idx_to_char[idx]

    return text

Questions to Guide Your Design

  1. Hidden State Initialization: Should h_0 be zeros? Random? Learned?

  2. Sequence Length Trade-off: Longer sequences capture more context but have worse gradient flow. Whatโ€™s optimal?

  3. Why Adagrad?: Why use Adagrad instead of vanilla SGD for RNNs?

  4. Gradient Clipping Threshold: How do you choose max_norm = 5.0? What happens with 1.0 or 50.0?

  5. Temperature Selection: When should you use T=0.5 vs T=1.5?


Thinking Exercise

Trace an Unrolled RNN by Hand

Consider this 3-character sequence: โ€œabโ€ โ†’ predict โ€œb โ€œ

Vocabulary: {a=0, b=1, ' '=2}
Hidden size: 2
Initial h_0 = [0, 0]

W_xh = [[0.1, 0.2, 0.3],
        [0.4, 0.5, 0.6]]   (2x3)

W_hh = [[0.1, 0.2],
        [0.3, 0.4]]         (2x2)

W_hy = [[0.1, 0.2],
        [0.3, 0.4],
        [0.5, 0.6]]         (3x2)

Task: Compute h_1 and y_1 after processing โ€˜aโ€™.

x_1 = [1, 0, 0]  (one-hot for 'a')

z_1 = W_xh @ x_1 + W_hh @ h_0
    = [[0.1], [0.4]] + [[0], [0]]
    = [[0.1], [0.4]]

h_1 = tanh(z_1) = [[0.0997], [0.3799]]

y_1 = W_hy @ h_1
    = [[0.1*0.0997 + 0.2*0.3799],
       [0.3*0.0997 + 0.4*0.3799],
       [0.5*0.0997 + 0.6*0.3799]]
    = [[0.086], [0.182], [0.278]]

p_1 = softmax(y_1) = [0.28, 0.31, 0.41]

Target is โ€˜bโ€™ (index 1). Loss = -log(0.31) = 1.17

Now trace the backward pass:

dy_1 = p_1 - target = [0.28, -0.69, 0.41]
dh_1 = W_hy.T @ dy_1 = ...

Complete this trace to understand BPTT deeply.


Testing Strategy

Unit Tests

def test_rnn_forward_shapes():
    vocab_size, hidden_size = 10, 5
    params = initialize_params(vocab_size, hidden_size)
    x = one_hot(3, vocab_size)
    h = np.zeros((hidden_size, 1))

    h_next, y, cache = rnn_step_forward(x, h, **params)

    assert h_next.shape == (hidden_size, 1)
    assert y.shape == (vocab_size, 1)
    assert np.all(np.abs(h_next) <= 1)  # tanh bounds

def test_gradient_check():
    """Numerical gradient verification"""
    # ... compare analytical vs numerical gradients

Integration Tests

  1. Train on โ€œabababโ€ฆโ€ for 1000 iterations, should learn to predict alternating pattern
  2. Train on โ€œhello hello helloโ€, should learn to complete โ€œhelโ€ โ†’ โ€œloโ€

Common Pitfalls and Debugging Tips

1. Exploding Gradients

Symptom: Loss becomes NaN, weights become huge.

Cause: Gradient clipping threshold too high or missing.

Fix:

grads = clip_gradients(grads, max_norm=5.0)

2. Vanishing Gradients

Symptom: Model stops learning, loss plateaus high.

Cause: Long sequences, small hidden state.

Fix:

  • Use shorter sequence length
  • Increase hidden size
  • Consider LSTM/GRU

3. Mode Collapse

Symptom: Model generates the same character repeatedly (โ€œeeeeeeโ€ฆโ€).

Cause: Temperature too low, or undertrained.

Fix: Increase temperature during generation.

4. Slow Training

Symptom: Each iteration takes seconds.

Cause: Python loops instead of vectorization.

Fix: Batch multiple sequences, use matrix operations.

5. Poor Generation Quality

Symptom: Output is word-like but nonsensical.

Cause: Undertrained, or sequence length too short.

Fix: Train longer, increase hidden size, use longer sequences.


The Interview Questions Theyโ€™ll Ask

  1. โ€œWhat is the vanishing gradient problem?โ€
    • Gradients shrink exponentially when flowing through many timesteps
    • Caused by repeated multiplication by weights < 1 and tanh derivatives
    • Makes it hard to learn long-range dependencies
  2. โ€œHow does LSTM solve vanishing gradients?โ€
    • LSTM adds a โ€œcell stateโ€ that flows unchanged through time
    • Gradients can flow through the cell state without vanishing
    • Forget/input/output gates control information flow
  3. โ€œWhat is teacher forcing?โ€
    • During training, feed the true previous output instead of predicted
    • Speeds up training but can cause exposure bias
  4. โ€œHow do RNNs differ from Transformers?โ€
    • RNNs process sequentially, Transformers in parallel
    • RNNs have fixed context (hidden state), Transformers use attention
    • Transformers scale better but need more memory
  5. โ€œHow would you implement beam search?โ€
    • Keep track of top-k sequences at each step
    • Expand each sequence, keep best k overall
    • More diverse outputs than greedy/sampling

Hints in Layers

Hint 1: Weight Initialization

Use small random weights:

W = np.random.randn(m, n) * 0.01

Too large โ†’ exploding gradients. Too small โ†’ vanishing.

Hint 2: Hidden State Management

Detach hidden state between batches to limit BPTT length:

h_prev = h_prev.copy()  # Break gradient chain
Hint 3: Debugging BPTT

Test with sequence length 1 first:

# With seq_length=1, BPTT reduces to normal backprop
# Much easier to debug
Hint 4: Numerical Gradient Check
def numerical_gradient(f, x, epsilon=1e-5):
    grad = np.zeros_like(x)
    for i in range(x.size):
        x_plus = x.copy()
        x_plus.flat[i] += epsilon
        x_minus = x.copy()
        x_minus.flat[i] -= epsilon
        grad.flat[i] = (f(x_plus) - f(x_minus)) / (2 * epsilon)
    return grad

Extensions and Challenges

1. Implement LSTM

Add forget gate, input gate, output gate, and cell state:

f_t = sigmoid(W_f @ [h_{t-1}, x_t])  # Forget gate
i_t = sigmoid(W_i @ [h_{t-1}, x_t])  # Input gate
c_t = f_t * c_{t-1} + i_t * tanh(W_c @ [h_{t-1}, x_t])  # Cell state
o_t = sigmoid(W_o @ [h_{t-1}, x_t])  # Output gate
h_t = o_t * tanh(c_t)  # Hidden state

2. Add Temperature Scheduling

Start with high temperature (exploration), decay to low (exploitation):

temperature = max(0.5, 2.0 - iteration / 10000)

3. Try on Music (MIDI)

Each โ€œcharacterโ€ is a MIDI note. Generate melodies!

4. Try on Code

Train on Python source code. Generate syntactically plausible code.

5. Implement GRU

Simplified LSTM with only two gates:

z_t = sigmoid(W_z @ [h_{t-1}, x_t])  # Update gate
r_t = sigmoid(W_r @ [h_{t-1}, x_t])  # Reset gate
h_t = (1 - z_t) * h_{t-1} + z_t * tanh(W @ [r_t * h_{t-1}, x_t])

Real-World Connections

Ancestry of Modern LLMs

1986: RNN (Rumelhart)
  โ†“
1997: LSTM (Hochreiter & Schmidhuber)
  โ†“
2014: GRU (Cho et al.)
  โ†“
2017: Transformer (Vaswani et al.)
  โ†“
2018: GPT-1 (OpenAI)
  โ†“
2020: GPT-3
  โ†“
2023: GPT-4, Claude, etc.

The RNN you build is the great-grandfather of ChatGPT!

Where RNNs Are Still Used

  • Speech recognition: Sequential audio processing
  • Handwriting recognition: Stroke sequences
  • Time series forecasting: Stock prices, weather
  • Edge devices: Simpler than Transformers, less memory

Books That Will Help

Topic Book Chapter
RNN fundamentals โ€œDeep Learningโ€ by Goodfellow Ch. 10
Practical implementation โ€œGrokking Deep Learningโ€ by Trask Ch. 11-12
LSTM/GRU โ€œDeep Learningโ€ by Goodfellow Ch. 10.10
NLP applications โ€œSpeech and Language Processingโ€ by Jurafsky Ch. 9

Online Resources

  • Andrej Karpathyโ€™s โ€œThe Unreasonable Effectiveness of RNNsโ€ (blog post)
  • Karpathyโ€™s min-char-rnn.py (reference implementation)
  • colahโ€™s blog on โ€œUnderstanding LSTMsโ€

Self-Assessment Checklist

Understanding

  • Can explain why feedforward nets canโ€™t model sequences
  • Can draw the unrolled RNN diagram
  • Can explain vanishing/exploding gradients
  • Can describe the difference between RNN, LSTM, GRU

Implementation

  • RNN forward pass produces correct shapes
  • BPTT backward pass passes gradient check
  • Gradient clipping prevents explosions
  • Training loop reduces loss over time

Generation

  • Can generate text with different temperatures
  • Output quality improves with training
  • Can seed generation with custom text

Extensions

  • Understand how LSTM adds cell state
  • Could explain how to implement beam search
  • Understand connection to Transformers

Final Note

This project is the bridge between classical neural networks and modern language models. The hidden state you implement here is a direct ancestor of the key-value caches in Transformers. The sequence modeling intuition you develop will serve you when understanding GPT, BERT, and beyond.

When you successfully generate Shakespearean text, you will have implemented the core insight that launched the NLP revolution: neural networks can learn to model the probability distribution of language.

You are building the foundation of everything from autocomplete to ChatGPT.