Project 10: Recurrent Character Generator (RNN)
Project 10: Recurrent Character Generator (RNN)
The ancestor of GPT: Build a network that remembers the past and generates text one character at a time.
Project Overview
| Attribute | Details |
|---|---|
| Difficulty | Level 5: Master |
| Time Estimate | 3 Weeks |
| Language | Python (NumPy Only) |
| Prerequisites | Project 5 (Autograd), Project 8 (MNIST) |
| Knowledge Area | Sequence Modeling / NLP |
| Main Book | โDeep Learningโ by Goodfellow, Bengio, Courville - Ch. 10 |
Learning Objectives
By the end of this project, you will be able to:
- Explain why feedforward networks cannot model sequences and what RNNs add
- Implement the RNN forward pass with hidden state propagation
- Derive and implement Backpropagation Through Time (BPTT)
- Diagnose and fix vanishing/exploding gradient problems
- Apply gradient clipping to stabilize training
- Generate text by sampling from probability distributions
- Implement temperature sampling for controlling creativity
- Understand the foundations that led to LSTMs, Transformers, and GPT
The Core Question Youโre Answering
โHow does AI handle Time and Sequence?โ
Consider this sentence: โThe cat sat on the ___โ
To predict โmat,โ you need to remember โcatโ and โsatโ from earlier. A feedforward network processes each word independentlyโit has no memory of previous words.
Feedforward Network (No Memory):
โโโโโโโโโโโ โโโโโโโโโโโ โโโโโโโโโโโ โโโโโโโโโโโ
โ "The" โโโโโโบโ "cat" โโโโโโบโ "sat" โโโโโโบโ "on" โ
โโโโโโโโโโโ โโโโโโโโโโโ โโโโโโโโโโโ โโโโโโโโโโโ
โ โ โ โ
โผ โผ โผ โผ
[Pred] [Pred] [Pred] [Pred]
โ
??? โโโ
(No context!)
An RNN creates a loopโthe output from processing โTheโ influences how we process โcat,โ which influences โsat,โ and so on:
Recurrent Neural Network (Memory via Hidden State):
hโ hโ hโ hโ
โ โ โ โ
โผ โผ โผ โผ
โโโโโโโ โโโโโโโ โโโโโโโ โโโโโโโ
โ RNN โโโโโโบโ RNN โโโโโโบโ RNN โโโโโโบโ RNN โโโโโโบ hโ
โCell โ โCell โ โCell โ โCell โ
โโโโโโโ โโโโโโโ โโโโโโโ โโโโโโโ
โฒ โฒ โฒ โฒ
โ โ โ โ
"The" "cat" "sat" "on"
โ
โผ
"mat" โ
(Has context!)
The hidden state h is the networkโs โshort-term memory.โ It carries information from previous timesteps. This is the fundamental insight that enables language modeling, speech recognition, and eventually led to the Transformer architecture that powers ChatGPT.
Concepts You Must Understand First
1. Why Feedforward Nets Canโt Model Sequences
| Aspect | Feedforward | Recurrent |
|---|---|---|
| Input size | Fixed | Variable length |
| Memory | None | Hidden state |
| Processing | Parallel | Sequential |
| โCatโ vs โCat satโ | Same output | Different output |
Book Reference: โDeep Learningโ Ch. 10.1 - Goodfellow
2. The Recurrence Relation
The core of an RNN is this equation:
h_t = f(W_hh ยท h_{t-1} + W_xh ยท x_t + b_h)
Where:
h_t= hidden state at time t (the โmemoryโ)h_{t-1}= hidden state from previous timestepx_t= input at time tW_hh= hidden-to-hidden weights (how memory affects memory)W_xh= input-to-hidden weights (how input affects memory)f= activation function (usually tanh)
โโโโโโโโโโโโโโโโโโโ
โ โ
โผ โ
โโโโโโโโโ โโโโโดโโโโ
โ h_t โโโโโโโโโโโ W_hh โ
โโโโโฌโโโโ โโโโโโโโโ
โ
โฒ
โโโโโดโโโโ
โ tanh โ
โโโโโฌโโโโ
โ
โฒ
โโโโโดโโโโโโโโโโโโโโโโโโโโ
โ W_xh ยท x_t โ
โ + โ
โ W_hh ยท h_{t-1} โ
โ + โ
โ b_h โ
โโโโโโโโโโโโโโโโโโโโโโโโโ
3. Character-Level vs Word-Level Modeling
| Approach | Vocabulary Size | Granularity | Pros | Cons |
|---|---|---|---|---|
| Character | 50-100 | Fine | Handles any word | Longer sequences |
| Word | 50,000+ | Coarse | Semantic meaning | OOV problem |
| Subword (BPE) | 30,000 | Medium | Best of both | Complexity |
For this project, we use character-level because:
- Small vocabulary (just letters, punctuation, spaces)
- Can generate any word, even made-up ones
- Easier to understand and debug
4. One-Hot Encoding for Characters
Each character becomes a binary vector:
Vocabulary: ['a', 'b', 'c', ..., 'z', ' ', '.']
Size: 27 characters
'a' = [1, 0, 0, ..., 0, 0, 0] # Index 0
'b' = [0, 1, 0, ..., 0, 0, 0] # Index 1
'c' = [0, 0, 1, ..., 0, 0, 0] # Index 2
' ' = [0, 0, 0, ..., 0, 1, 0] # Index 26
Book Reference: โNeural Networks and Deep Learningโ Ch. 3 - Nielsen
5. Unrolling the RNN Through Time
To apply backpropagation, we โunrollโ the loop into a feedforward network:
Unrolled RNN for sequence "cat":
โโโโโโโ โโโโโโโ โโโโโโโ
โ h_0 โโโโโโบโ h_1 โโโโโโบโ h_2 โโโโโโบ h_3
โโโโฌโโโ โโโโฌโโโ โโโโฌโโโ
โ โ โ
โผ โผ โผ
โโโโโโโ โโโโโโโ โโโโโโโ
โ y_1 โ โ y_2 โ โ y_3 โ
โโโโโโโ โโโโโโโ โโโโโโโ
โ โ โ
โผ โผ โผ
Pred:'a' Pred:'t' Pred:' '
โ โ โ
โผ โผ โผ
Loss_1 Loss_2 Loss_3
โฒ โฒ โฒ
โ โ โ
'c' 'a' 't'
(Input) (Input) (Input)
Each step predicts the NEXT character given the history.
6. Truncated BPTT
For long sequences, gradients must flow through hundreds of timesteps. This is slow and causes gradient issues. Truncated BPTT limits how far back we propagate:
Full sequence: "The quick brown fox jumps over the lazy dog"
Truncated to chunks of 10:
Chunk 1: "The quick " (carry h forward)
Chunk 2: "brown fox " (carry h forward)
Chunk 3: "jumps over" (carry h forward)
...
Gradients only flow within each chunk, but hidden state carries forward.
Deep Theoretical Foundation
The RNN Equations in Detail
Forward Pass (for one timestep):
# 1. Combine previous hidden state and current input
z = W_hh @ h_prev + W_xh @ x + b_h
# 2. Apply non-linearity
h = np.tanh(z)
# 3. Compute output (prediction)
y = W_hy @ h + b_y
# 4. Convert to probabilities
p = softmax(y)
Dimensions:
Vocabulary size: V (e.g., 65 characters)
Hidden size: H (e.g., 128)
x: (V, 1) - one-hot input
h: (H, 1) - hidden state
y: (V, 1) - raw scores
p: (V, 1) - probabilities
W_xh: (H, V) - input to hidden
W_hh: (H, H) - hidden to hidden
W_hy: (V, H) - hidden to output
b_h: (H, 1) - hidden bias
b_y: (V, 1) - output bias
Backpropagation Through Time (BPTT)
The gradient must flow backward through time:
Forward direction (t=0 โ T):
x_0 โโโบ h_0 โโโบ h_1 โโโบ h_2 โโโบ ... โโโบ h_T
โ โ โ โ
โผ โผ โผ โผ
y_0 y_1 y_2 y_T
โ โ โ โ
โผ โผ โผ โผ
L_0 L_1 L_2 L_T
Backward direction (t=T โ 0):
dL/dh_T โโโ dL/dh_{T-1} โโโ dL/dh_{T-2} โโโ ... โโโ dL/dh_0
The BPTT Algorithm:
- Run forward pass, cache all hidden states and inputs
- Compute loss at each timestep
- Starting from the last timestep, compute gradients
- For each timestep t, the gradient of h_t has TWO sources:
- From the loss at time t (through y_t)
- From the next timestep (through h_{t+1})
# Gradient accumulation at timestep t
dh = dh_from_output + dh_from_next_timestep
# Where:
dh_from_output = W_hy.T @ dy_t
dh_from_next_timestep = W_hh.T @ (dh_next * (1 - h_next**2)) # tanh derivative
Why Gradients Explode or Vanish
Consider the gradient flowing from h_T to h_0:
โh_T/โh_0 = โh_T/โh_{T-1} ร โh_{T-1}/โh_{T-2} ร ... ร โh_1/โh_0
Each factor is approximately:
โh_{t}/โh_{t-1} โ W_hh ร diag(1 - h_tยฒ)
Over T timesteps:
โh_T/โh_0 โ (W_hh)^T ร [product of tanh derivatives]
If eigenvalue of W_hh > 1: Gradient explodes exponentially If eigenvalue of W_hh < 1: Gradient vanishes exponentially
Gradient magnitude over time:
Exploding (ฮป > 1)
/
/
/
/
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ Stable (ฮป โ 1)
\
\
\
\
Vanishing (ฮป < 1)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Time (number of steps)
Book Reference: โDeep Learningโ Ch. 10.7 - Goodfellow
Gradient Clipping
To prevent exploding gradients, we clip the gradient norm:
def clip_gradients(gradients, max_norm=5.0):
total_norm = 0
for grad in gradients:
total_norm += np.sum(grad ** 2)
total_norm = np.sqrt(total_norm)
if total_norm > max_norm:
scale = max_norm / total_norm
for grad in gradients:
grad *= scale
return gradients
This keeps training stable without fundamentally solving the vanishing gradient problem.
Teacher Forcing vs Free-Running
Teacher Forcing: During training, always feed the CORRECT previous character. Free-Running: Feed the modelโs own predictions.
Teacher Forcing (Training):
Input: "The quick brown"
Target: "he quick brown "
At each step, regardless of what the model predicts,
we feed the TRUE previous character.
Free-Running (Generation):
Input: "T"
Output: "Th" โ feed 'h'
Output: "The" โ feed 'e'
Output: "The " โ feed ' '
...
Teacher forcing is faster to train but can cause issues at generation time if the model never learned to recover from mistakes.
Sampling Strategies
Argmax Sampling: Always pick the most likely character.
next_char = np.argmax(probabilities)
Problem: Deterministic, repetitive output.
Temperature Sampling: Adjust probability distribution.
def sample_with_temperature(probs, temperature=1.0):
# Higher temperature = more random
# Lower temperature = more deterministic
probs = np.exp(np.log(probs + 1e-8) / temperature)
probs = probs / np.sum(probs)
return np.random.choice(len(probs), p=probs)
Temperature Effect:
T=0.1 (Cold): Probabilities: [0.98, 0.01, 0.01, ...]
โ Almost deterministic, repetitive
T=1.0 (Normal): Probabilities: [0.60, 0.25, 0.15, ...]
โ Balanced creativity
T=2.0 (Hot): Probabilities: [0.35, 0.33, 0.32, ...]
โ High randomness, nonsensical
Real World Outcome
You will train on a text corpus and generate new text in a similar style.
Training Output
$ python train_rnn.py --text shakespeare.txt --hidden_size 256 --seq_length 50
Loading text file...
Text length: 1,115,394 characters
Vocabulary size: 65 unique characters
Vocabulary: ['\n', ' ', '!', '$', '&', "'", ',', '-', '.', ...]
Building RNN...
Architecture: CharRNN(input=65, hidden=256, output=65)
Total parameters: 199,745
Training with:
- Sequence length: 50
- Learning rate: 0.002
- Gradient clip: 5.0
Iter 0 (loss=4.185):
ajnRCq$?xPkEWYxvJ&RoAGlFc
Vz!DNUSC'Wf:aMz;S
(complete gibberish)
Iter 1000 (loss=2.312):
the senter of the wing and stond
And the dake the with of the
(word-like patterns emerging)
Iter 5000 (loss=1.876):
KING RICHARD:
The world is fair and stright to me,
And I shall have the man of death.
(grammar emerging, character names)
Iter 10000 (loss=1.542):
ROMEO:
What light through yonder window breaks?
It is the east, and Juliet is the sun!
Arise, fair sun, and kill the envious moon.
(Shakespearean style captured!)
Iter 20000 (loss=1.328):
HAMLET:
To be, or not to be, that is the question:
Whether 'tis nobler in the mind to suffer
The slings and arrows of outrageous fortune,
Or to take arms against a sea of troubles.
(Near-perfect Shakespeare mimicry!)
Generation Output
$ python generate.py --model shakespeare_rnn.pkl --seed "To be or" --length 500 --temperature 0.8
Seed: "To be or"
Generated:
To be or not to be, that is the question:
Whether 'tis nobler in the mind to suffer
The slings and arrows of outrageous fortune,
Or to take arms against a sea of troubles,
And by opposing end them? To die: to sleep;
No more; and by a sleep to say we end
The heart-ache and the thousand natural shocks
That flesh is heir to, 'tis a consummation
Devoutly to be wish'd. To die, to sleep;
To sleep: perchance to dream: ay, there's the rub.
$ python generate.py --seed "KING:" --temperature 1.2
KING:
What say you, lords? Shall we make merry sport?
The duke hath spoken words of wondrous might,
And I would fain hear more of this strange tale.
But soft! What messenger approaches now?
Enter MESSENGER
MESSENGER:
My liege, the army of the north doth march!
They come with banners raised and swords unsheathed!
Solution Architecture
System Design
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ CharRNN System โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โ
โ โ Text Loader โโโโโบโ Preprocessor โโโโโบโ DataLoader โ โ
โ โ โ โ (Vocabulary) โ โ (Batching) โ โ
โ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โ
โ โ โ
โ โผ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ RNN Cell โ โ
โ โ โ โ
โ โ x_t โโโบ [W_xh] โโโ โ โ
โ โ โโโโบ [+] โโโบ [tanh] โโโบ h_t โ โ
โ โ h_{t-1} โบ [W_hh] โโ โ โ โ
โ โ โ โ โ
โ โ h_t โโโบ [W_hy] โโโบ y_t โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ โ
โ โผ โ
โ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โ
โ โ Softmax โโโโโบโ CrossEnt โโโโโบโ BPTT โ โ
โ โ โ โ Loss โ โ Backward โ โ
โ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โ
โ โ โ
โ โผ โ
โ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โ
โ โ Grad Clip โโโโโบโ Optimizer โโโโโบโ Generator โ โ
โ โ โ โ (Adagrad) โ โ (Sampling) โ โ
โ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Class Structure
class CharRNN:
"""Character-level RNN for text generation"""
def __init__(self, vocab_size, hidden_size):
self.vocab_size = vocab_size
self.hidden_size = hidden_size
# Initialize weights
self.W_xh = None # (hidden_size, vocab_size)
self.W_hh = None # (hidden_size, hidden_size)
self.W_hy = None # (vocab_size, hidden_size)
self.b_h = None # (hidden_size, 1)
self.b_y = None # (vocab_size, 1)
# Cache for BPTT
self.cache = {}
def forward(self, inputs, h_prev):
"""Forward pass through time"""
pass
def backward(self, targets):
"""Backpropagation through time"""
pass
def sample(self, seed, length, temperature):
"""Generate text"""
pass
File Structure
char_rnn/
โโโ data.py # Text loading and preprocessing
โโโ model.py # RNN implementation
โโโ train.py # Training loop
โโโ generate.py # Text generation
โโโ utils.py # Gradient clipping, sampling
โโโ shakespeare.txt # Training data
Phased Implementation Guide
Phase 1: Text Preprocessing and Vocabulary (Day 1-2)
Goal: Convert text file to tensors.
Steps:
- Load text file
- Build character-to-index and index-to-character mappings
- Encode text as integers
class Vocabulary:
def __init__(self, text):
self.chars = sorted(list(set(text)))
self.char_to_idx = {ch: i for i, ch in enumerate(self.chars)}
self.idx_to_char = {i: ch for i, ch in enumerate(self.chars)}
self.size = len(self.chars)
def encode(self, text):
return [self.char_to_idx[ch] for ch in text]
def decode(self, indices):
return ''.join([self.idx_to_char[i] for i in indices])
Verification:
- Encode โhelloโ, decode back, verify itโs โhelloโ
- Print vocabulary size and sample characters
Phase 2: Character Encoding (One-Hot) (Day 2-3)
Goal: Convert integer indices to one-hot vectors.
def one_hot(index, vocab_size):
vec = np.zeros((vocab_size, 1))
vec[index] = 1.0
return vec
def indices_to_one_hot(indices, vocab_size):
return [one_hot(i, vocab_size) for i in indices]
Verification:
- One-hot of index 3 with vocab_size 10 should be [0,0,0,1,0,0,0,0,0,0]
Phase 3: RNN Cell Forward Pass (Day 3-4)
Goal: Implement one timestep of the RNN.
def rnn_step_forward(x, h_prev, W_xh, W_hh, W_hy, b_h, b_y):
"""
x: (vocab_size, 1) - one-hot input
h_prev: (hidden_size, 1) - previous hidden state
Returns:
h_next: (hidden_size, 1) - next hidden state
y: (vocab_size, 1) - output logits
cache: tuple of values needed for backward pass
"""
# Hidden state
z = W_xh @ x + W_hh @ h_prev + b_h
h_next = np.tanh(z)
# Output
y = W_hy @ h_next + b_y
cache = (x, h_prev, h_next, z, y)
return h_next, y, cache
Verification:
- Check output shapes
- h should be bounded [-1, 1] (tanh)
Phase 4: Unrolled Forward Pass (Day 4-5)
Goal: Forward pass through entire sequence.
def forward(inputs, h_init, params):
"""
inputs: list of one-hot vectors, length T
h_init: initial hidden state
Returns:
outputs: list of output logits
h_final: final hidden state
caches: list of caches for backward pass
"""
h = h_init
outputs = []
caches = []
for x in inputs:
h, y, cache = rnn_step_forward(x, h, **params)
outputs.append(y)
caches.append(cache)
return outputs, h, caches
Phase 5: Loss Computation (Day 5-6)
Goal: Compute cross-entropy loss at each timestep.
def softmax(y):
exp_y = np.exp(y - np.max(y)) # numerical stability
return exp_y / np.sum(exp_y)
def cross_entropy_loss(outputs, targets):
"""
outputs: list of logits
targets: list of target indices
Returns:
loss: scalar
probs: list of probability vectors (for backward pass)
"""
loss = 0
probs = []
for y, target in zip(outputs, targets):
p = softmax(y)
loss -= np.log(p[target] + 1e-8)
probs.append(p)
return loss, probs
Phase 6: BPTT Backward Pass (Day 6-9) - THE HARD PART
Goal: Compute gradients flowing backward through time.
def backward(probs, targets, caches, params):
"""
probs: list of softmax outputs
targets: list of target indices
caches: list of (x, h_prev, h_next, z, y) tuples
Returns:
gradients for all parameters
"""
W_xh, W_hh, W_hy, b_h, b_y = params
# Initialize gradients
dW_xh = np.zeros_like(W_xh)
dW_hh = np.zeros_like(W_hh)
dW_hy = np.zeros_like(W_hy)
db_h = np.zeros_like(b_h)
db_y = np.zeros_like(b_y)
# Gradient from next timestep (starts at 0)
dh_next = np.zeros((W_hh.shape[0], 1))
# Backward through time
for t in reversed(range(len(caches))):
x, h_prev, h_next, z, y = caches[t]
p = probs[t]
target = targets[t]
# Gradient of loss w.r.t. output
dy = p.copy()
dy[target] -= 1 # softmax + cross-entropy gradient
# Gradient w.r.t. W_hy and b_y
dW_hy += dy @ h_next.T
db_y += dy
# Gradient w.r.t. hidden state (from output AND next timestep)
dh = W_hy.T @ dy + dh_next
# Gradient through tanh
dz = dh * (1 - h_next ** 2)
# Gradient w.r.t. W_xh, W_hh, b_h
dW_xh += dz @ x.T
dW_hh += dz @ h_prev.T
db_h += dz
# Gradient to pass to previous timestep
dh_next = W_hh.T @ dz
return dW_xh, dW_hh, dW_hy, db_h, db_y
Key Insight: dh has TWO sources:
- From the loss at this timestep (through W_hy)
- From the next timestep (dh_next)
Phase 7: Gradient Clipping (Day 9-10)
Goal: Prevent exploding gradients.
def clip_gradients(grads, max_norm=5.0):
total_norm = 0
for grad in grads:
total_norm += np.sum(grad ** 2)
total_norm = np.sqrt(total_norm)
if total_norm > max_norm:
for grad in grads:
grad *= max_norm / total_norm
return grads, total_norm
Phase 8: Training Loop (Day 10-14)
Goal: Put it all together.
def train(text, hidden_size=128, seq_length=25, learning_rate=0.1):
vocab = Vocabulary(text)
data = vocab.encode(text)
# Initialize parameters
params = initialize_params(vocab.size, hidden_size)
# Adagrad memory (for adaptive learning rate)
mem = {k: np.zeros_like(v) for k, v in params.items()}
h_prev = np.zeros((hidden_size, 1))
pointer = 0
for iteration in range(100000):
# Reset if end of data
if pointer + seq_length + 1 >= len(data):
h_prev = np.zeros((hidden_size, 1))
pointer = 0
# Get batch
inputs = data[pointer:pointer + seq_length]
targets = data[pointer + 1:pointer + seq_length + 1]
# Forward
outputs, h_prev, caches = forward(
indices_to_one_hot(inputs, vocab.size),
h_prev, params
)
# Loss
loss, probs = cross_entropy_loss(outputs, targets)
# Backward
grads = backward(probs, targets, caches, params)
# Clip gradients
grads, grad_norm = clip_gradients(grads)
# Update (Adagrad)
for param_name, grad in zip(params.keys(), grads):
mem[param_name] += grad ** 2
params[param_name] -= learning_rate * grad / (np.sqrt(mem[param_name]) + 1e-8)
pointer += seq_length
# Sample occasionally
if iteration % 1000 == 0:
print(f"Iter {iteration}, Loss: {loss:.4f}")
sample_text = sample(params, vocab, h_prev, 200)
print(sample_text)
Phase 9: Text Generation (Day 14-17)
Goal: Generate new text by sampling.
def sample(params, vocab, h, length, temperature=1.0, seed=None):
"""Generate text character by character"""
if seed:
# Warm up hidden state with seed
for ch in seed:
x = one_hot(vocab.char_to_idx[ch], vocab.size)
h, _, _ = rnn_step_forward(x, h, **params)
text = seed
else:
# Start with random character
idx = np.random.randint(vocab.size)
text = vocab.idx_to_char[idx]
# Generate
for _ in range(length):
x = one_hot(vocab.char_to_idx[text[-1]], vocab.size)
h, y, _ = rnn_step_forward(x, h, **params)
# Apply temperature
p = softmax(y / temperature)
# Sample
idx = np.random.choice(vocab.size, p=p.flatten())
text += vocab.idx_to_char[idx]
return text
Questions to Guide Your Design
-
Hidden State Initialization: Should h_0 be zeros? Random? Learned?
-
Sequence Length Trade-off: Longer sequences capture more context but have worse gradient flow. Whatโs optimal?
-
Why Adagrad?: Why use Adagrad instead of vanilla SGD for RNNs?
-
Gradient Clipping Threshold: How do you choose max_norm = 5.0? What happens with 1.0 or 50.0?
-
Temperature Selection: When should you use T=0.5 vs T=1.5?
Thinking Exercise
Trace an Unrolled RNN by Hand
Consider this 3-character sequence: โabโ โ predict โb โ
Vocabulary: {a=0, b=1, ' '=2}
Hidden size: 2
Initial h_0 = [0, 0]
W_xh = [[0.1, 0.2, 0.3],
[0.4, 0.5, 0.6]] (2x3)
W_hh = [[0.1, 0.2],
[0.3, 0.4]] (2x2)
W_hy = [[0.1, 0.2],
[0.3, 0.4],
[0.5, 0.6]] (3x2)
Task: Compute h_1 and y_1 after processing โaโ.
x_1 = [1, 0, 0] (one-hot for 'a')
z_1 = W_xh @ x_1 + W_hh @ h_0
= [[0.1], [0.4]] + [[0], [0]]
= [[0.1], [0.4]]
h_1 = tanh(z_1) = [[0.0997], [0.3799]]
y_1 = W_hy @ h_1
= [[0.1*0.0997 + 0.2*0.3799],
[0.3*0.0997 + 0.4*0.3799],
[0.5*0.0997 + 0.6*0.3799]]
= [[0.086], [0.182], [0.278]]
p_1 = softmax(y_1) = [0.28, 0.31, 0.41]
Target is โbโ (index 1). Loss = -log(0.31) = 1.17
Now trace the backward pass:
dy_1 = p_1 - target = [0.28, -0.69, 0.41]
dh_1 = W_hy.T @ dy_1 = ...
Complete this trace to understand BPTT deeply.
Testing Strategy
Unit Tests
def test_rnn_forward_shapes():
vocab_size, hidden_size = 10, 5
params = initialize_params(vocab_size, hidden_size)
x = one_hot(3, vocab_size)
h = np.zeros((hidden_size, 1))
h_next, y, cache = rnn_step_forward(x, h, **params)
assert h_next.shape == (hidden_size, 1)
assert y.shape == (vocab_size, 1)
assert np.all(np.abs(h_next) <= 1) # tanh bounds
def test_gradient_check():
"""Numerical gradient verification"""
# ... compare analytical vs numerical gradients
Integration Tests
- Train on โabababโฆโ for 1000 iterations, should learn to predict alternating pattern
- Train on โhello hello helloโ, should learn to complete โhelโ โ โloโ
Common Pitfalls and Debugging Tips
1. Exploding Gradients
Symptom: Loss becomes NaN, weights become huge.
Cause: Gradient clipping threshold too high or missing.
Fix:
grads = clip_gradients(grads, max_norm=5.0)
2. Vanishing Gradients
Symptom: Model stops learning, loss plateaus high.
Cause: Long sequences, small hidden state.
Fix:
- Use shorter sequence length
- Increase hidden size
- Consider LSTM/GRU
3. Mode Collapse
Symptom: Model generates the same character repeatedly (โeeeeeeโฆโ).
Cause: Temperature too low, or undertrained.
Fix: Increase temperature during generation.
4. Slow Training
Symptom: Each iteration takes seconds.
Cause: Python loops instead of vectorization.
Fix: Batch multiple sequences, use matrix operations.
5. Poor Generation Quality
Symptom: Output is word-like but nonsensical.
Cause: Undertrained, or sequence length too short.
Fix: Train longer, increase hidden size, use longer sequences.
The Interview Questions Theyโll Ask
- โWhat is the vanishing gradient problem?โ
- Gradients shrink exponentially when flowing through many timesteps
- Caused by repeated multiplication by weights < 1 and tanh derivatives
- Makes it hard to learn long-range dependencies
- โHow does LSTM solve vanishing gradients?โ
- LSTM adds a โcell stateโ that flows unchanged through time
- Gradients can flow through the cell state without vanishing
- Forget/input/output gates control information flow
- โWhat is teacher forcing?โ
- During training, feed the true previous output instead of predicted
- Speeds up training but can cause exposure bias
- โHow do RNNs differ from Transformers?โ
- RNNs process sequentially, Transformers in parallel
- RNNs have fixed context (hidden state), Transformers use attention
- Transformers scale better but need more memory
- โHow would you implement beam search?โ
- Keep track of top-k sequences at each step
- Expand each sequence, keep best k overall
- More diverse outputs than greedy/sampling
Hints in Layers
Hint 1: Weight Initialization
Use small random weights:
W = np.random.randn(m, n) * 0.01
Too large โ exploding gradients. Too small โ vanishing.
Hint 2: Hidden State Management
Detach hidden state between batches to limit BPTT length:
h_prev = h_prev.copy() # Break gradient chain
Hint 3: Debugging BPTT
Test with sequence length 1 first:
# With seq_length=1, BPTT reduces to normal backprop
# Much easier to debug
Hint 4: Numerical Gradient Check
def numerical_gradient(f, x, epsilon=1e-5):
grad = np.zeros_like(x)
for i in range(x.size):
x_plus = x.copy()
x_plus.flat[i] += epsilon
x_minus = x.copy()
x_minus.flat[i] -= epsilon
grad.flat[i] = (f(x_plus) - f(x_minus)) / (2 * epsilon)
return grad
Extensions and Challenges
1. Implement LSTM
Add forget gate, input gate, output gate, and cell state:
f_t = sigmoid(W_f @ [h_{t-1}, x_t]) # Forget gate
i_t = sigmoid(W_i @ [h_{t-1}, x_t]) # Input gate
c_t = f_t * c_{t-1} + i_t * tanh(W_c @ [h_{t-1}, x_t]) # Cell state
o_t = sigmoid(W_o @ [h_{t-1}, x_t]) # Output gate
h_t = o_t * tanh(c_t) # Hidden state
2. Add Temperature Scheduling
Start with high temperature (exploration), decay to low (exploitation):
temperature = max(0.5, 2.0 - iteration / 10000)
3. Try on Music (MIDI)
Each โcharacterโ is a MIDI note. Generate melodies!
4. Try on Code
Train on Python source code. Generate syntactically plausible code.
5. Implement GRU
Simplified LSTM with only two gates:
z_t = sigmoid(W_z @ [h_{t-1}, x_t]) # Update gate
r_t = sigmoid(W_r @ [h_{t-1}, x_t]) # Reset gate
h_t = (1 - z_t) * h_{t-1} + z_t * tanh(W @ [r_t * h_{t-1}, x_t])
Real-World Connections
Ancestry of Modern LLMs
1986: RNN (Rumelhart)
โ
1997: LSTM (Hochreiter & Schmidhuber)
โ
2014: GRU (Cho et al.)
โ
2017: Transformer (Vaswani et al.)
โ
2018: GPT-1 (OpenAI)
โ
2020: GPT-3
โ
2023: GPT-4, Claude, etc.
The RNN you build is the great-grandfather of ChatGPT!
Where RNNs Are Still Used
- Speech recognition: Sequential audio processing
- Handwriting recognition: Stroke sequences
- Time series forecasting: Stock prices, weather
- Edge devices: Simpler than Transformers, less memory
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| RNN fundamentals | โDeep Learningโ by Goodfellow | Ch. 10 |
| Practical implementation | โGrokking Deep Learningโ by Trask | Ch. 11-12 |
| LSTM/GRU | โDeep Learningโ by Goodfellow | Ch. 10.10 |
| NLP applications | โSpeech and Language Processingโ by Jurafsky | Ch. 9 |
Online Resources
- Andrej Karpathyโs โThe Unreasonable Effectiveness of RNNsโ (blog post)
- Karpathyโs min-char-rnn.py (reference implementation)
- colahโs blog on โUnderstanding LSTMsโ
Self-Assessment Checklist
Understanding
- Can explain why feedforward nets canโt model sequences
- Can draw the unrolled RNN diagram
- Can explain vanishing/exploding gradients
- Can describe the difference between RNN, LSTM, GRU
Implementation
- RNN forward pass produces correct shapes
- BPTT backward pass passes gradient check
- Gradient clipping prevents explosions
- Training loop reduces loss over time
Generation
- Can generate text with different temperatures
- Output quality improves with training
- Can seed generation with custom text
Extensions
- Understand how LSTM adds cell state
- Could explain how to implement beam search
- Understand connection to Transformers
Final Note
This project is the bridge between classical neural networks and modern language models. The hidden state you implement here is a direct ancestor of the key-value caches in Transformers. The sequence modeling intuition you develop will serve you when understanding GPT, BERT, and beyond.
When you successfully generate Shakespearean text, you will have implemented the core insight that launched the NLP revolution: neural networks can learn to model the probability distribution of language.
You are building the foundation of everything from autocomplete to ChatGPT.