Project 8: MNIST From First Principles
Project 8: MNIST From First Principles
The âHello Worldâ of AI, done the hard way - Build a neural network to recognize handwritten digits using nothing but NumPy, mastering softmax, one-hot encoding, and mini-batch gradient descent
Project Overview
| Attribute | Value |
|---|---|
| Difficulty | Level 4: Expert |
| Time Estimate | 2 Weeks |
| Language | Python (NumPy Only) |
| Prerequisites | Projects 3 (Linear Regression), 4 (Spam Filter), 6 (Fraud Detection MLP) |
| Main Book | âNeural Networks and Deep Learningâ by Michael Nielsen - Chapter 1 |
| Knowledge Area | Computer Vision / Deep Learning |
Learning Objectives
After completing this project, you will be able to:
-
Implement Softmax from scratch - Derive and code the function that turns 10 raw scores into probabilities summing to 1, including numerical stability tricks
-
Master multi-class cross-entropy - Understand why this loss function works for classification and derive its gradient analytically
-
Build one-hot encoding utilities - Convert integer labels to vector representations that neural networks can learn from
-
Achieve true vectorization mastery - Process 60,000 images without Python loops, understanding broadcasting and matrix operations at a deep level
-
Implement mini-batch training - Balance memory efficiency and gradient stability through batch processing
-
Design multi-layer dense networks - Build networks with arbitrary hidden layers, understanding the shapes that flow through each layer
-
Debug neural networks systematically - Use gradient checking, visualization, and layer-by-layer analysis to find and fix problems
-
Visualize learned representations - Extract and display what each neuron has learned to âlook forâ in input images
The Core Question Youâre Answering
âHow do we handle more than Yes/No answers?â
In Project 4 (Spam Filter), you built a binary classifier: Spam or Ham. But what if there are 10 possible answers? 100? 1000?
The key insight is that multi-class classification is just running multiple binary classifiers simultaneously and forcing them to compete. Softmax creates this competition - when one output neuronâs score goes up, all others must go down (because probabilities sum to 1).
Before coding, sit with this question:
- Why canât we just use 10 sigmoid neurons independently?
- What happens if two output neurons both claim 99% confidence?
- How do we train a network when only one answer is âcorrectâ?
Concepts You Must Understand First
Stop and research these before coding. Each concept builds on the previous.
1. Multi-Class Classification Setup
Unlike binary classification where we have one output neuron, multi-class needs N output neurons for N classes. Each neuron represents the âevidenceâ for one class.
Binary (Project 4): Multi-Class (This Project):
Input -> Hidden -> [1 neuron] Input -> Hidden -> [10 neurons]
| |
sigmoid softmax
| |
p(spam) [p(0), p(1), ..., p(9)]
Key insight: The output layer has 10 neurons, one for each digit. But before softmax, these are just raw scores (logits) - they can be any real number, including negative!
Book Reference: âNeural Networks and Deep Learningâ Ch. 1.6 (Implementing our network to classify digits)
2. The Softmax Function and Its Properties
Softmax converts a vector of arbitrary real numbers into a probability distribution:
\[\text{softmax}(z_i) = \frac{e^{z_i}}{\sum_{j=1}^{K} e^{z_j}}\]Properties you must internalize:
- All outputs are in (0, 1)
- All outputs sum to exactly 1
- The largest input becomes the largest probability
- Itâs differentiable everywhere
Example:
logits = [2.0, 1.0, 0.1]
e^2.0 = 7.389
e^1.0 = 2.718
e^0.1 = 1.105
sum = 11.212
softmax = [7.389/11.212, 2.718/11.212, 1.105/11.212]
= [0.659, 0.242, 0.099] (sums to 1.0)
Critical: Unlike sigmoid (which squashes independently), softmax creates competition - increasing one logit decreases all other probabilities.
Book Reference: âDeep Learningâ by Goodfellow et al., Ch. 6.2.2.3
3. Cross-Entropy Loss for Multiple Classes
For binary classification, we used: \(L = -[y \log(\hat{y}) + (1-y)\log(1-\hat{y})]\)
For multi-class with one-hot encoding: \(L = -\sum_{i=1}^{K} y_i \log(\hat{y}_i)\)
Since only one $y_i = 1$ (the correct class) and all others are 0, this simplifies to: \(L = -\log(\hat{y}_{\text{correct}})\)
Intuition: We want to maximize the probability of the correct class. Log makes this a minimization problem. If we predict 0.9 for the correct class, loss is -log(0.9) = 0.105 (low). If we predict 0.1, loss is -log(0.1) = 2.303 (high).
Book Reference: âGrokking Deep Learningâ Ch. 6
4. One-Hot Encoding
Neural networks work with numbers, not categorical labels. We convert:
Label: 3 -> [0, 0, 0, 1, 0, 0, 0, 0, 0, 0]
Label: 7 -> [0, 0, 0, 0, 0, 0, 0, 1, 0, 0]
Label: 0 -> [1, 0, 0, 0, 0, 0, 0, 0, 0, 0]
This encoding lets us:
- Compute loss directly (dot product with log probabilities)
- Avoid implying ordering (3 is not âbetweenâ 2 and 4 for classification)
- Enable batch matrix operations
def one_hot(labels, num_classes):
"""Convert integer labels to one-hot vectors."""
n = len(labels)
one_hot_matrix = np.zeros((n, num_classes))
one_hot_matrix[np.arange(n), labels] = 1
return one_hot_matrix
Book Reference: âHands-On Machine Learningâ by Geron, Ch. 4
5. Mini-Batch Training
With 60,000 training images, we have three choices:
| Method | Batch Size | Updates per Epoch | Gradient Quality |
|---|---|---|---|
| Stochastic GD | 1 | 60,000 | Very noisy |
| Full Batch GD | 60,000 | 1 | Very smooth but slow |
| Mini-Batch GD | 32-256 | 235-1875 | Good balance |
Mini-batch advantages:
- Uses vectorization (fast)
- Gradients are stable enough to converge
- Adds regularization through noise
- Fits in memory
Book Reference: âDeep Learningâ Ch. 8.1.3
6. Data Loading and Shuffling
The order of training examples matters! If you train on all 0s, then all 1s, then all 2s⌠the network will âforgetâ earlier digits (catastrophic forgetting).
Shuffling after each epoch prevents the network from learning spurious patterns in the order of examples.
def shuffle_data(X, y):
"""Shuffle training data while keeping X-y pairs aligned."""
indices = np.random.permutation(len(y))
return X[indices], y[indices]
Book Reference: âNeural Networks and Deep Learningâ Ch. 3
Deep Theoretical Foundation
Softmax Derivation and Interpretation
Letâs derive softmax from first principles. We want to convert scores to probabilities.
Attempt 1: Just normalize \(p_i = \frac{z_i}{\sum_j z_j}\)
Problem: What if $z_i$ is negative? Weâd get negative âprobabilities.â
Attempt 2: Use absolute values \(p_i = \frac{|z_i|}{\sum_j |z_j|}\)
Problem: z = [-10, 0.1] would give [0.99, 0.01], but -10 should mean âdefinitely not this class.â
Attempt 3: Exponentiate first \(p_i = \frac{e^{z_i}}{\sum_j e^{z_j}}\)
This works because:
- $e^x > 0$ for all x (no negative probabilities)
- $e^{-10} \approx 0.00005$ (negative scores become tiny)
- $e^{10} \approx 22026$ (positive scores become large)
- The ratio structure naturally creates probabilities
Interpretation: Softmax is the âargmaxâ function made differentiable. Argmax returns [0,0,1,0] (one-hot for max). Softmax returns [0.02, 0.08, 0.85, 0.05] (soft version where all classes have some probability).
Numerical Stability: The Log-Sum-Exp Trick
Naive softmax will overflow or underflow for large/small logits:
# BAD: Will overflow
z = np.array([1000, 1001, 1002])
exp_z = np.exp(z) # [inf, inf, inf] - overflow!
Solution: Subtract the maximum before exponentiating:
\[\text{softmax}(z_i) = \frac{e^{z_i - \max(z)}}{\sum_j e^{z_j - \max(z)}}\]This is mathematically identical (the max cancels in the ratio) but numerically stable:
def stable_softmax(z):
"""Numerically stable softmax."""
z_shifted = z - np.max(z, axis=-1, keepdims=True)
exp_z = np.exp(z_shifted)
return exp_z / np.sum(exp_z, axis=-1, keepdims=True)
Book Reference: âDeep Learningâ Ch. 4.1 (Numerical Computation)
Softmax + Cross-Entropy Gradient Derivation
This is where the magic happens. Letâs derive the gradient of the loss with respect to the logits.
Setup:
- Logits: $z = [z_1, z_2, âŚ, z_K]$
- Softmax output: $\hat{y}_i = \frac{e^{z_i}}{\sum_j e^{z_j}}$
- One-hot target: $y = [0, 0, 1, 0, âŚ, 0]$ (1 at the correct class)
- Cross-entropy loss: $L = -\sum_i y_i \log(\hat{y}_i)$
Goal: Find $\frac{\partial L}{\partial z_k}$ for all k.
Step 1: Derivative of softmax output with respect to logit
For $i = k$ (same index): \(\frac{\partial \hat{y}_i}{\partial z_k} = \hat{y}_i(1 - \hat{y}_i)\)
For $i \neq k$ (different indices): \(\frac{\partial \hat{y}_i}{\partial z_k} = -\hat{y}_i \hat{y}_k\)
Compact form: $\frac{\partial \hat{y}i}{\partial z_k} = \hat{y}_i(\delta{ik} - \hat{y}_k)$
where $\delta_{ik}$ is the Kronecker delta (1 if i=k, 0 otherwise).
Step 2: Apply chain rule
\[\frac{\partial L}{\partial z_k} = \sum_i \frac{\partial L}{\partial \hat{y}_i} \cdot \frac{\partial \hat{y}_i}{\partial z_k}\]Since $\frac{\partial L}{\partial \hat{y}_i} = -\frac{y_i}{\hat{y}_i}$:
\[\frac{\partial L}{\partial z_k} = -\sum_i \frac{y_i}{\hat{y}_i} \cdot \hat{y}_i(\delta_{ik} - \hat{y}_k)\] \[= -\sum_i y_i(\delta_{ik} - \hat{y}_k)\] \[= -y_k + \hat{y}_k \sum_i y_i\]Since $\sum_i y_i = 1$ (one-hot sums to 1):
\[\frac{\partial L}{\partial z_k} = \hat{y}_k - y_k\]The beautiful result: The gradient is simply prediction minus target.
# Gradient of cross-entropy loss w.r.t. logits
dL_dz = softmax_output - one_hot_target
# That's it! No complex derivatives needed in code.
This is why softmax + cross-entropy is so popular: The gradient is trivially simple.
Why Softmax Creates Competition
Consider a network predicting between cats and dogs:
Independent sigmoids:
- P(cat) = 0.8, P(dog) = 0.9
- Both are high! The network is confused but doesnât âknowâ it
Softmax:
- If logit_cat = 2 and logit_dog = 2.5
- P(cat) = 0.38, P(dog) = 0.62
- The outputs are forced to sum to 1
When we increase logit_cat:
- P(cat) goes UP
- P(dog) goes DOWN (even though we didnât touch logit_dog!)
This automatic competition is what makes softmax work for mutually exclusive classes.
Weight Initialization for Deep Networks
Random initialization matters more than you think. Bad initialization can cause:
- Vanishing gradients: Outputs saturate, gradients approach 0
- Exploding gradients: Values grow exponentially through layers
- Symmetry breaking failure: All neurons learn the same thing
Xavier/Glorot Initialization (for tanh/sigmoid): \(W \sim \mathcal{N}\left(0, \sqrt{\frac{2}{n_{\text{in}} + n_{\text{out}}}}\right)\)
He Initialization (for ReLU): \(W \sim \mathcal{N}\left(0, \sqrt{\frac{2}{n_{\text{in}}}}\right)\)
def he_init(shape):
"""He initialization for ReLU networks."""
fan_in = shape[0]
std = np.sqrt(2.0 / fan_in)
return np.random.randn(*shape) * std
def xavier_init(shape):
"""Xavier initialization for tanh/sigmoid networks."""
fan_in, fan_out = shape[0], shape[1]
std = np.sqrt(2.0 / (fan_in + fan_out))
return np.random.randn(*shape) * std
Book Reference: âDeep Learningâ Ch. 8.4
Real World Outcome
When you complete this project, here is exactly what you should see:
Training the Network
$ python mnist_train.py
Loading MNIST dataset...
Training set: 60000 images (28x28 pixels)
Test set: 10000 images (28x28 pixels)
Preprocessing:
- Flattening 28x28 -> 784
- Normalizing to [0, 1]
- One-hot encoding 10 classes
Network Architecture: [784 -> 128 -> 64 -> 10]
Layer 1: 784 x 128 = 100,352 parameters
Layer 2: 128 x 64 = 8,192 parameters
Layer 3: 64 x 10 = 640 parameters
Total: 109,184 trainable parameters
Hyperparameters:
Learning rate: 0.1
Batch size: 64
Epochs: 10
Training...
Epoch 1/10 | Train Loss: 0.3821 | Train Acc: 89.2% | Test Acc: 90.1% | Time: 12.3s
Epoch 2/10 | Train Loss: 0.1892 | Train Acc: 94.5% | Test Acc: 94.2% | Time: 11.8s
Epoch 3/10 | Train Loss: 0.1341 | Train Acc: 96.0% | Test Acc: 95.3% | Time: 11.9s
Epoch 4/10 | Train Loss: 0.1023 | Train Acc: 97.0% | Test Acc: 96.1% | Time: 12.1s
Epoch 5/10 | Train Loss: 0.0812 | Train Acc: 97.6% | Test Acc: 96.5% | Time: 12.0s
Epoch 6/10 | Train Loss: 0.0654 | Train Acc: 98.0% | Test Acc: 96.8% | Time: 11.7s
Epoch 7/10 | Train Loss: 0.0531 | Train Acc: 98.4% | Test Acc: 97.0% | Time: 11.9s
Epoch 8/10 | Train Loss: 0.0432 | Train Acc: 98.7% | Test Acc: 97.1% | Time: 12.2s
Epoch 9/10 | Train Loss: 0.0351 | Train Acc: 98.9% | Test Acc: 97.2% | Time: 11.8s
Epoch 10/10 | Train Loss: 0.0285 | Train Acc: 99.1% | Test Acc: 97.2% | Time: 12.0s
Training complete!
Final Test Accuracy: 97.2% (9720/10000 correct)
Model saved to: mnist_model.npz
Confusion Matrix:
0 1 2 3 4 5 6 7 8 9
+----------------------------------------------------
0 | 971 0 2 1 0 2 2 1 1 0
1 | 0 1121 3 2 0 1 3 1 4 0
2 | 4 2 1002 5 3 0 2 7 6 1
3 | 0 0 4 983 0 10 0 5 5 3
4 | 1 0 2 0 958 0 5 2 2 12
5 | 2 0 0 7 1 869 6 1 4 2
6 | 5 3 1 0 4 4 939 0 2 0
7 | 1 4 8 4 1 0 0 997 2 11
8 | 3 0 3 6 4 5 3 4 940 6
9 | 3 4 0 4 10 4 1 6 3 974
Predicting Individual Digits
$ python predict_digit.py --image my_drawing.png
Loading model from mnist_model.npz...
Loading image: my_drawing.png (28x28)
Preprocessing:
- Grayscale conversion
- Resize to 28x28
- Invert if necessary (MNIST has white digits on black)
- Normalize to [0, 1]
Forward pass through network...
Prediction: 7 (Confidence: 99.2%)
Full probability distribution:
0: 0.001%
1: 0.003%
2: 0.012%
3: 0.002%
4: 0.008%
5: 0.001%
6: 0.002%
7: 99.215% <-- PREDICTED
8: 0.721%
9: 0.035%
Visualizing Learned Weights
$ python visualize_weights.py
Loading model from mnist_model.npz...
Generating visualization of first layer weights...
Each image shows what input pattern activates one hidden neuron.
Saved: weights_layer1.png (16x8 grid of 784->128 learned patterns)
Saved: weight_histogram.png (distribution of weight values)
Observations:
- Some neurons detect horizontal strokes
- Some neurons detect vertical strokes
- Some neurons detect curves/corners
- Weights form interpretable patterns resembling digit fragments
Example Weight Visualization
When you visualize the 128 neurons of the first layer (each is a 28x28 pattern):
+---+---+---+---+---+---+---+---+
| | | / | \ | | - | | | O | <- Different patterns
+---+---+---+---+---+---+---+---+
| C | ) | ( | _ | | | | |
+---+---+---+---+---+---+---+---+
...
You'll see:
- Edge detectors (horizontal, vertical, diagonal lines)
- Curve detectors (arcs like in 0, 6, 8, 9)
- Corner detectors (L-shapes like in 7, 4)
- Position-specific patterns (top-left vs bottom-right)
Solution Architecture
High-Level System Design
+------------------------------------------------------------------+
| MNIST Training Pipeline |
+------------------------------------------------------------------+
| |
| +-----------+ +------------+ +-----------+ |
| | Data | | Network | | Training | |
| | Module | | Module | | Module | |
| +-----------+ +------------+ +-----------+ |
| | | | |
| - Load MNIST - Layer class - Mini-batch SGD |
| - Preprocess - Forward pass - Learning rate |
| - Batch - Backward pass - Epoch loop |
| - Shuffle - Weight init - Metrics |
| |
+------------------------------------------------------------------+
Data Flow Through the Network
Input Image (28x28 = 784 pixels)
|
v
+---------------+
| Flatten |
| (28x28->784)|
+---------------+
|
v
+---------------+
| Dense Layer 1 |
| 784 -> 128 |
| + ReLU |
+---------------+
|
v
+---------------+
| Dense Layer 2 |
| 128 -> 64 |
| + ReLU |
+---------------+
|
v
+---------------+
| Output Layer |
| 64 -> 10 |
| + Softmax |
+---------------+
|
v
Probabilities [p0, p1, ..., p9]
Matrix Dimensions at Each Step
Understanding shapes is crucial for debugging. For a batch of 64 images:
Layer Input Shape Weights Shape Output Shape
---------------------- ----------- ------------- ------------
Input (batch of images) (64, 784) - -
Dense Layer 1 (64, 784) (784, 128) (64, 128)
ReLU Activation (64, 128) - (64, 128)
Dense Layer 2 (64, 128) (128, 64) (64, 64)
ReLU Activation (64, 64) - (64, 64)
Output Layer (64, 64) (64, 10) (64, 10)
Softmax (64, 10) - (64, 10)
Class Structure
# Recommended module structure
class Layer:
"""Abstract base class for network layers."""
def forward(self, x): pass
def backward(self, grad): pass
class Dense(Layer):
"""Fully connected layer."""
def __init__(self, input_dim, output_dim):
self.W = he_init((input_dim, output_dim))
self.b = np.zeros(output_dim)
def forward(self, x):
# Cache input for backward pass
self.x = x
return x @ self.W + self.b
def backward(self, grad):
# Gradient w.r.t. weights
self.dW = self.x.T @ grad
self.db = np.sum(grad, axis=0)
# Gradient to pass to previous layer
return grad @ self.W.T
class ReLU(Layer):
"""ReLU activation function."""
def forward(self, x):
self.x = x
return np.maximum(0, x)
def backward(self, grad):
return grad * (self.x > 0)
class Softmax:
"""Softmax activation (output layer)."""
def forward(self, x):
exp_x = np.exp(x - np.max(x, axis=1, keepdims=True))
return exp_x / np.sum(exp_x, axis=1, keepdims=True)
class Network:
"""Sequential network of layers."""
def __init__(self, layers):
self.layers = layers
def forward(self, x):
for layer in self.layers:
x = layer.forward(x)
return x
def backward(self, grad):
for layer in reversed(self.layers):
grad = layer.backward(grad)
File Organization
mnist-from-scratch/
+-- data/
| +-- download_mnist.py # Script to download MNIST
| +-- train-images-idx3-ubyte
| +-- train-labels-idx1-ubyte
| +-- t10k-images-idx3-ubyte
| +-- t10k-labels-idx1-ubyte
|
+-- src/
| +-- layers.py # Dense, ReLU, Softmax
| +-- network.py # Network class
| +-- loss.py # CrossEntropyLoss
| +-- data_loader.py # MNIST loading, batching
| +-- utils.py # One-hot, shuffling, metrics
|
+-- train.py # Main training script
+-- predict.py # Inference on new images
+-- visualize.py # Weight visualization
+-- tests/
+-- test_layers.py # Unit tests for layers
+-- test_gradients.py # Gradient checking
Phased Implementation Guide
Phase 1: Load and Explore MNIST Data (2-3 hours)
Goal: Download MNIST and understand its structure.
Tasks:
-
Download the dataset
MNIST is available at http://yann.lecun.com/exdb/mnist/ or through various libraries.
# Option 1: Manual download import urllib.request import gzip urls = { 'train_images': 'http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz', 'train_labels': 'http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz', 'test_images': 'http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz', 'test_labels': 'http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz', } for name, url in urls.items(): print(f"Downloading {name}...") urllib.request.urlretrieve(url, f'data/{name}.gz') -
Parse the IDX file format
MNIST uses a custom binary format. Write a parser:
def load_mnist_images(filepath): """Load MNIST images from IDX file.""" with gzip.open(filepath, 'rb') as f: # Magic number (2051 for images) magic = int.from_bytes(f.read(4), 'big') assert magic == 2051 # Dimensions num_images = int.from_bytes(f.read(4), 'big') rows = int.from_bytes(f.read(4), 'big') cols = int.from_bytes(f.read(4), 'big') # Read image data data = np.frombuffer(f.read(), dtype=np.uint8) return data.reshape(num_images, rows, cols) def load_mnist_labels(filepath): """Load MNIST labels from IDX file.""" with gzip.open(filepath, 'rb') as f: magic = int.from_bytes(f.read(4), 'big') assert magic == 2049 num_labels = int.from_bytes(f.read(4), 'big') return np.frombuffer(f.read(), dtype=np.uint8) -
Explore the data
train_images = load_mnist_images('data/train-images-idx3-ubyte.gz') train_labels = load_mnist_labels('data/train-labels-idx1-ubyte.gz') print(f"Shape: {train_images.shape}") # (60000, 28, 28) print(f"Pixel range: {train_images.min()} to {train_images.max()}") # 0 to 255 print(f"Label distribution: {np.bincount(train_labels)}") # Visualize a few examples import matplotlib.pyplot as plt fig, axes = plt.subplots(2, 5, figsize=(10, 4)) for i, ax in enumerate(axes.flat): ax.imshow(train_images[i], cmap='gray') ax.set_title(f"Label: {train_labels[i]}") ax.axis('off') plt.savefig('sample_digits.png')
Success Criteria: You can load all 60,000 training images and visualize samples.
Phase 2: Preprocess Data (1-2 hours)
Goal: Transform raw images into network-ready format.
Tasks:
-
Flatten images
28x28 images become 784-element vectors:
def flatten_images(images): """Flatten 28x28 images to 784-length vectors.""" return images.reshape(images.shape[0], -1) X_train = flatten_images(train_images) print(f"Flattened shape: {X_train.shape}") # (60000, 784) -
Normalize pixel values
Neural networks work better with inputs in [0, 1] or standardized:
def normalize(X): """Normalize pixel values to [0, 1].""" return X.astype(np.float32) / 255.0 X_train = normalize(X_train) print(f"Normalized range: {X_train.min():.3f} to {X_train.max():.3f}") -
Create validation split
Reserve some training data for validation:
def train_val_split(X, y, val_ratio=0.1): """Split data into training and validation sets.""" n = len(y) indices = np.random.permutation(n) val_size = int(n * val_ratio) val_idx = indices[:val_size] train_idx = indices[val_size:] return (X[train_idx], y[train_idx], X[val_idx], y[val_idx]) X_train, y_train, X_val, y_val = train_val_split(X_train, train_labels)
Success Criteria: Data is flattened, normalized, and split into train/val/test sets.
Phase 3: One-Hot Encode Labels (30 minutes)
Goal: Convert integer labels to one-hot vectors.
Tasks:
-
Implement one-hot encoding
def one_hot(labels, num_classes=10): """Convert integer labels to one-hot vectors.""" n = len(labels) one_hot_matrix = np.zeros((n, num_classes), dtype=np.float32) one_hot_matrix[np.arange(n), labels] = 1.0 return one_hot_matrix -
Verify correctness
# Test labels = np.array([0, 1, 2, 3]) encoded = one_hot(labels, 10) print(encoded) # [[1, 0, 0, 0, 0, 0, 0, 0, 0, 0], # [0, 1, 0, 0, 0, 0, 0, 0, 0, 0], # [0, 0, 1, 0, 0, 0, 0, 0, 0, 0], # [0, 0, 0, 1, 0, 0, 0, 0, 0, 0]] assert encoded.sum(axis=1).all() == 1 # Each row sums to 1 assert encoded.sum() == 4 # Total of 4 ones
Success Criteria: one_hot([3, 7], 10) returns correct 2x10 matrix.
Phase 4: Network Architecture Setup (2-3 hours)
Goal: Implement the Dense layer and ReLU activation classes.
Tasks:
-
Implement Dense layer (forward only first)
class Dense: def __init__(self, input_dim, output_dim): # He initialization self.W = np.random.randn(input_dim, output_dim) * np.sqrt(2.0 / input_dim) self.b = np.zeros(output_dim) def forward(self, x): """Forward pass: y = xW + b""" self.x = x # Cache for backward return x @ self.W + self.b -
Implement ReLU activation
class ReLU: def forward(self, x): """ReLU: max(0, x)""" self.x = x # Cache for backward return np.maximum(0, x) -
Test the forward pass
# Create a simple 2-layer network layer1 = Dense(784, 128) relu1 = ReLU() layer2 = Dense(128, 10) # Forward pass x = X_train[:5] # 5 sample images h = layer1.forward(x) # (5, 128) h = relu1.forward(h) # (5, 128) out = layer2.forward(h) # (5, 10) print(f"Output shape: {out.shape}") # (5, 10) print(f"Sample output: {out[0]}") # 10 raw scores (logits)
Success Criteria: Forward pass produces (batch_size, 10) output tensor.
Phase 5: Softmax and Cross-Entropy (2-3 hours)
Goal: Implement numerically stable softmax and cross-entropy loss.
Tasks:
-
Implement stable softmax
def softmax(z): """Numerically stable softmax.""" # Subtract max for stability z_shifted = z - np.max(z, axis=1, keepdims=True) exp_z = np.exp(z_shifted) return exp_z / np.sum(exp_z, axis=1, keepdims=True) -
Test softmax properties
logits = np.array([[1.0, 2.0, 3.0], [1000.0, 1001.0, 1002.0]]) # Large values probs = softmax(logits) print(f"Probabilities:\n{probs}") print(f"Row sums: {probs.sum(axis=1)}") # Should be [1.0, 1.0] assert np.allclose(probs.sum(axis=1), 1.0) # All sum to 1 -
Implement cross-entropy loss
def cross_entropy_loss(probs, targets): """ Compute cross-entropy loss. probs: (batch, num_classes) - softmax outputs targets: (batch, num_classes) - one-hot encoded """ # Clip to prevent log(0) probs_clipped = np.clip(probs, 1e-10, 1 - 1e-10) # Cross-entropy: -sum(target * log(pred)) loss = -np.sum(targets * np.log(probs_clipped)) / len(probs) return loss -
Test loss computation
# Perfect prediction probs_perfect = np.array([[0.99, 0.01]]) targets = np.array([[1.0, 0.0]]) loss = cross_entropy_loss(probs_perfect, targets) print(f"Perfect prediction loss: {loss:.4f}") # Very low # Wrong prediction probs_wrong = np.array([[0.01, 0.99]]) loss = cross_entropy_loss(probs_wrong, targets) print(f"Wrong prediction loss: {loss:.4f}") # Very high
Success Criteria: Softmax outputs sum to 1; loss is low for correct predictions, high for wrong.
Phase 6: Forward Pass (Vectorized) (2-3 hours)
Goal: Chain all layers together for a complete forward pass.
Tasks:
-
Build the Network class
class Network: def __init__(self, layer_sizes): """ Create network with given layer sizes. Example: [784, 128, 64, 10] creates 3 Dense layers. """ self.layers = [] for i in range(len(layer_sizes) - 1): self.layers.append(Dense(layer_sizes[i], layer_sizes[i+1])) if i < len(layer_sizes) - 2: # No ReLU after last layer self.layers.append(ReLU()) def forward(self, x): """Forward pass through all layers.""" for layer in self.layers: x = layer.forward(x) return x def predict(self, x): """Get class predictions.""" logits = self.forward(x) probs = softmax(logits) return np.argmax(probs, axis=1) -
Test on full dataset
net = Network([784, 128, 64, 10]) # Forward pass on all training data logits = net.forward(X_train) probs = softmax(logits) # Initial accuracy (should be ~10% random) predictions = np.argmax(probs, axis=1) accuracy = (predictions == y_train).mean() print(f"Initial accuracy: {accuracy:.1%}") # ~10%
Success Criteria: Network makes predictions (even if random at first).
Phase 7: Backward Pass (Vectorized) (4-5 hours)
Goal: Implement gradient computation for all layers.
This is the hardest phase. Take your time.
Tasks:
-
Implement Dense backward pass
class Dense: # ... (keep forward method) def backward(self, grad_output): """ Backward pass. grad_output: gradient of loss w.r.t. this layer's output Returns: gradient of loss w.r.t. this layer's input """ batch_size = grad_output.shape[0] # Gradient w.r.t. weights self.dW = self.x.T @ grad_output / batch_size # Gradient w.r.t. biases self.db = np.mean(grad_output, axis=0) # Gradient w.r.t. input (to pass to previous layer) grad_input = grad_output @ self.W.T return grad_input -
Implement ReLU backward pass
class ReLU: # ... (keep forward method) def backward(self, grad_output): """ Backward pass. ReLU gradient: 1 if x > 0, else 0 """ return grad_output * (self.x > 0) -
Implement softmax + cross-entropy gradient
Recall the beautiful result: gradient is simply (prediction - target)
def softmax_cross_entropy_backward(probs, targets): """ Gradient of cross-entropy loss w.r.t. logits. This combines softmax and cross-entropy gradients. """ return (probs - targets) / len(probs) -
Add backward method to Network
class Network: # ... (keep existing methods) def backward(self, grad): """Backward pass through all layers in reverse.""" for layer in reversed(self.layers): grad = layer.backward(grad) -
Gradient checking (CRITICAL)
Verify your gradients numerically:
def gradient_check(net, x, y, epsilon=1e-5): """Verify analytical gradients match numerical gradients.""" # Get analytical gradient logits = net.forward(x) probs = softmax(logits) loss = cross_entropy_loss(probs, y) grad = softmax_cross_entropy_backward(probs, y) net.backward(grad) # Check each weight in first layer layer = net.layers[0] for i in range(min(5, layer.W.shape[0])): for j in range(min(5, layer.W.shape[1])): # Numerical gradient original = layer.W[i, j] layer.W[i, j] = original + epsilon logits_plus = net.forward(x) loss_plus = cross_entropy_loss(softmax(logits_plus), y) layer.W[i, j] = original - epsilon logits_minus = net.forward(x) loss_minus = cross_entropy_loss(softmax(logits_minus), y) layer.W[i, j] = original numerical_grad = (loss_plus - loss_minus) / (2 * epsilon) analytical_grad = layer.dW[i, j] rel_error = abs(numerical_grad - analytical_grad) / (abs(numerical_grad) + abs(analytical_grad) + 1e-8) if rel_error > 1e-5: print(f"Gradient check FAILED at ({i},{j}): num={numerical_grad:.6f}, ana={analytical_grad:.6f}") return False print("Gradient check PASSED!") return True
Success Criteria: Gradient check passes (relative error < 1e-5).
Phase 8: Mini-Batch Data Loader (1-2 hours)
Goal: Create efficient batch iteration.
Tasks:
-
Implement batch generator
def create_batches(X, y, batch_size, shuffle=True): """ Generate mini-batches from data. Yields (X_batch, y_batch) tuples. """ n = len(y) indices = np.arange(n) if shuffle: np.random.shuffle(indices) for start in range(0, n, batch_size): end = min(start + batch_size, n) batch_idx = indices[start:end] yield X[batch_idx], y[batch_idx] -
Test batching
batch_size = 64 num_batches = 0 total_samples = 0 for X_batch, y_batch in create_batches(X_train, y_train_onehot, batch_size): num_batches += 1 total_samples += len(y_batch) print(f"Number of batches: {num_batches}") print(f"Total samples: {total_samples}") print(f"Expected batches: {len(y_train) // batch_size + 1}")
Success Criteria: Batches cover all data exactly once per epoch.
Phase 9: Training Loop (3-4 hours)
Goal: Put it all together into a training loop.
Tasks:
-
Implement SGD update
def sgd_update(net, learning_rate): """Update weights using SGD.""" for layer in net.layers: if hasattr(layer, 'W'): layer.W -= learning_rate * layer.dW layer.b -= learning_rate * layer.db -
Implement training loop
def train(net, X_train, y_train, X_val, y_val, epochs=10, batch_size=64, learning_rate=0.1): """Train the network.""" y_train_onehot = one_hot(y_train) for epoch in range(epochs): epoch_loss = 0 num_batches = 0 # Training for X_batch, y_batch in create_batches(X_train, y_train_onehot, batch_size): # Forward logits = net.forward(X_batch) probs = softmax(logits) loss = cross_entropy_loss(probs, y_batch) epoch_loss += loss num_batches += 1 # Backward grad = softmax_cross_entropy_backward(probs, y_batch) net.backward(grad) # Update sgd_update(net, learning_rate) # Evaluation train_acc = evaluate(net, X_train, y_train) val_acc = evaluate(net, X_val, y_val) print(f"Epoch {epoch+1}/{epochs} | " f"Loss: {epoch_loss/num_batches:.4f} | " f"Train Acc: {train_acc:.1%} | " f"Val Acc: {val_acc:.1%}") def evaluate(net, X, y): """Compute accuracy.""" predictions = net.predict(X) return (predictions == y).mean() -
Train and monitor
net = Network([784, 128, 64, 10]) train(net, X_train, y_train, X_val, y_val, epochs=10, batch_size=64, learning_rate=0.1)
Success Criteria: Accuracy increases each epoch; reaches >95% on validation.
Phase 10: Evaluation and Visualization (2-3 hours)
Goal: Final testing and weight visualization.
Tasks:
-
Evaluate on test set
test_acc = evaluate(net, X_test, y_test) print(f"Test accuracy: {test_acc:.2%}") # Should be ~97% -
Generate confusion matrix
def confusion_matrix(y_true, y_pred, num_classes=10): """Generate confusion matrix.""" cm = np.zeros((num_classes, num_classes), dtype=int) for t, p in zip(y_true, y_pred): cm[t, p] += 1 return cm predictions = net.predict(X_test) cm = confusion_matrix(y_test, predictions) print("Confusion Matrix:") print(cm) -
Visualize learned weights
def visualize_weights(layer, num_show=64): """Visualize first layer weights as images.""" W = layer.W # (784, hidden_size) # Each column is one neuron's weights num_neurons = min(num_show, W.shape[1]) grid_size = int(np.ceil(np.sqrt(num_neurons))) fig, axes = plt.subplots(grid_size, grid_size, figsize=(12, 12)) for i, ax in enumerate(axes.flat): if i < num_neurons: weights = W[:, i].reshape(28, 28) ax.imshow(weights, cmap='RdBu', vmin=-0.5, vmax=0.5) ax.axis('off') plt.savefig('weights_visualization.png') print("Saved weights_visualization.png") visualize_weights(net.layers[0]) -
Save and load model
def save_model(net, filepath): """Save model weights.""" weights = {} for i, layer in enumerate(net.layers): if hasattr(layer, 'W'): weights[f'layer_{i}_W'] = layer.W weights[f'layer_{i}_b'] = layer.b np.savez(filepath, **weights) def load_model(net, filepath): """Load model weights.""" data = np.load(filepath) for i, layer in enumerate(net.layers): if hasattr(layer, 'W'): layer.W = data[f'layer_{i}_W'] layer.b = data[f'layer_{i}_b'] save_model(net, 'mnist_model.npz')
Success Criteria: 97%+ test accuracy; weight visualizations show interpretable patterns.
Questions to Guide Your Design
Before implementing each phase, think through these:
Data Representation
- Why do we flatten 28x28 images instead of keeping them 2D?
- What would happen if we didnât normalize pixel values?
- Why is shuffling between epochs important?
Network Architecture
- Why start with 128 and 64 hidden units? What happens with fewer? More?
- Whatâs the minimum network that can achieve 90%? 95%?
- Why do we use ReLU instead of sigmoid in hidden layers?
Softmax and Loss
- What happens if we use sigmoid instead of softmax for the output?
- Why does softmax use exp() specifically?
- How does cross-entropy âpunishâ confident wrong answers?
Training Dynamics
- What learning rate is too high? Too low? How do you know?
- How many epochs until the network starts overfitting?
- What batch size works best? Why?
Debugging
- If accuracy is stuck at 10%, whatâs wrong?
- If loss is NaN, whatâs the likely cause?
- How do you verify gradients are correct?
Thinking Exercise
Before coding, work through this example by hand.
Setup: 2-class classification (cat vs dog), 3 input features, 1 hidden layer with 2 neurons
Input: x = [0.5, 0.8, 0.2]
Hidden layer: W1 = [[0.1, 0.2], [0.3, 0.4], [0.5, 0.6]], b1 = [0.1, 0.1]
Output layer: W2 = [[0.3, 0.7], [0.4, 0.6]], b2 = [0.1, 0.1]
True label: y = [1, 0] (cat)
Trace the forward pass:
-
Hidden layer pre-activation: z1 = x @ W1 + b1 = ?
-
Hidden layer activation (ReLU): h1 = max(0, z1) = ?
-
Output layer pre-activation: z2 = h1 @ W2 + b2 = ?
-
Softmax: probs = softmax(z2) = ?
-
Cross-entropy loss: L = -sum(y * log(probs)) = ?
Now trace backward:
-
Gradient of loss w.r.t. z2: dL/dz2 = probs - y = ?
-
Gradient w.r.t. W2: dL/dW2 = h1.T @ dL/dz2 = ?
Continue through the networkâŚ
Testing Strategy
Unit Tests
# tests/test_layers.py
def test_dense_forward_shape():
layer = Dense(10, 5)
x = np.random.randn(3, 10)
y = layer.forward(x)
assert y.shape == (3, 5)
def test_relu_forward():
relu = ReLU()
x = np.array([[-1, 0, 1], [2, -2, 0]])
y = relu.forward(x)
expected = np.array([[0, 0, 1], [2, 0, 0]])
assert np.allclose(y, expected)
def test_softmax_sums_to_one():
logits = np.random.randn(5, 10)
probs = softmax(logits)
assert np.allclose(probs.sum(axis=1), 1.0)
def test_softmax_all_positive():
logits = np.random.randn(5, 10) * 100 # Large values
probs = softmax(logits)
assert (probs > 0).all()
def test_one_hot_encoding():
labels = np.array([0, 5, 9])
encoded = one_hot(labels, 10)
assert encoded.shape == (3, 10)
assert encoded[0, 0] == 1
assert encoded[1, 5] == 1
assert encoded[2, 9] == 1
assert encoded.sum() == 3
Gradient Checking
def test_gradient_correctness():
net = Network([5, 3, 2])
x = np.random.randn(2, 5)
y = one_hot(np.array([0, 1]), 2)
assert gradient_check(net, x, y)
Integration Tests
def test_can_overfit_small_batch():
"""Network should perfectly fit a tiny dataset."""
net = Network([784, 50, 10])
X_tiny = X_train[:10]
y_tiny = y_train[:10]
for _ in range(500):
train_one_epoch(net, X_tiny, y_tiny, batch_size=10, lr=0.5)
acc = evaluate(net, X_tiny, y_tiny)
assert acc > 0.99, f"Should overfit tiny batch, got {acc}"
Common Pitfalls and Debugging Tips
Pitfall 1: Softmax Numerical Overflow
Symptom: RuntimeWarning: overflow encountered in exp, outputs are NaN
Cause: Exponentiating large numbers (e.g., e^1000 = inf)
Fix: Subtract the max before exponentiating
# BAD
exp_z = np.exp(z)
# GOOD
z_shifted = z - np.max(z, axis=1, keepdims=True)
exp_z = np.exp(z_shifted)
Pitfall 2: Log of Zero
Symptom: -inf in loss, NaN in gradients
Cause: Taking log(0) when softmax outputs exactly 0
Fix: Clip probabilities
probs_clipped = np.clip(probs, 1e-10, 1 - 1e-10)
log_probs = np.log(probs_clipped)
Pitfall 3: Shape Mismatch in Matrix Multiplication
Symptom: ValueError: shapes (64,128) and (784,128) not aligned
Cause: Wrong dimension order in weight matrix or forgot to transpose
Debug: Print shapes at every step
print(f"x.shape: {x.shape}")
print(f"W.shape: {W.shape}")
print(f"Expected: ({x.shape[0]}, {W.shape[1]})")
Pitfall 4: Forgetting to Cache Inputs
Symptom: Backward pass fails with âAttributeError: âDenseâ object has no attribute âxââ
Cause: Forward pass didnât save the input for gradient computation
Fix: Always cache in forward:
def forward(self, x):
self.x = x # CACHE THIS
return x @ self.W + self.b
Pitfall 5: Gradient Not Averaged Over Batch
Symptom: Training is unstable, loss oscillates wildly
Cause: Gradients are summed instead of averaged
Fix: Divide by batch size
self.dW = self.x.T @ grad_output / batch_size # Divide!
Pitfall 6: Learning Rate Too High
Symptom: Loss increases or oscillates instead of decreasing
Cause: Updates overshoot the minimum
Fix: Start with lr=0.001, increase gradually to 0.1
Pitfall 7: Weights All Zero or Same
Symptom: All neurons produce identical outputs
Cause: Zero initialization or same random seed
Fix: Use proper He initialization
W = np.random.randn(in_dim, out_dim) * np.sqrt(2.0 / in_dim)
Pitfall 8: Accuracy Stuck at 10%
Symptom: Network never learns better than random guessing
Cause: Usually a gradient bug, or incorrect loss computation
Debug:
- Run gradient check
- Verify loss decreases on tiny dataset
- Check that predictions change after update
Interview Questions
Prepare to answer these after completing the project:
Fundamental Understanding
-
âExplain the softmax function. Why do we use it instead of sigmoid for multi-class classification?â
Key points: Softmax normalizes to probabilities summing to 1, creating competition between classes. Sigmoid treats each class independently, which doesnât make sense for mutually exclusive classes.
-
âDerive the gradient of cross-entropy loss with respect to the logits when using softmax.â
Key insight: The gradient simplifies beautifully to (prediction - target). Show the derivation step by step.
-
âWhat is the log-sum-exp trick and why is it important?â
Key points: Prevents numerical overflow when computing softmax. Subtracting the max before exponentiating keeps numbers in a reasonable range without changing the result.
Implementation Details
-
âHow does mini-batch gradient descent differ from stochastic and batch gradient descent? What are the tradeoffs?â
Key points: Batch size affects gradient noise (regularization), memory usage, and convergence speed. Mini-batch is the practical middle ground.
-
âWhy do we need to shuffle the training data between epochs?â
Key points: Prevents the network from learning spurious patterns in the order of examples. Reduces variance in gradient estimates.
-
âExplain Xavier/Glorot and He initialization. When would you use each?â
Key points: Xavier for sigmoid/tanh, He for ReLU. Both aim to keep variance stable through layers, preventing vanishing/exploding gradients.
Debugging and Analysis
-
âYour MNIST network is stuck at 10% accuracy. Walk me through how you would debug this.â
Key points: Check gradient correctness, verify loss decreases on tiny batch, check shapes, verify data preprocessing, start with simpler network.
-
âLook at these weight visualizations. What do they tell you about what the network learned?â
Key points: First layer learns edge detectors, templates for digit parts. Each neuron specializes in detecting specific patterns.
-
âThe training accuracy is 99.5% but test accuracy is 94%. Whatâs happening and how would you fix it?â
Key points: Overfitting. Solutions include regularization (L2, dropout), early stopping, data augmentation, simpler model.
Advanced Concepts
-
âHow would you modify this network to handle Fashion-MNIST without changing the architecture?â
Key points: No architecture change needed - just different data. Might need different hyperparameters. This demonstrates the generality of the approach.
Hints in Layers
Hint 1: Getting Started Start with the data. Load MNIST, visualize examples, verify shapes. Many bugs come from misunderstanding the data format. MNIST images are grayscale with white digits on black background (0 = black, 255 = white).
Hint 2: Building Layers Implement forward pass first for all layers. Test that you can pass data through the whole network. Shapes should be: (batch, 784) -> (batch, 128) -> (batch, 64) -> (batch, 10).
Hint 3: Softmax Stability Always subtract the max before exponentiating. The most common bug is numerical overflow/underflow in softmax. Use this pattern:
z_max = np.max(z, axis=1, keepdims=True)
exp_z = np.exp(z - z_max)
softmax = exp_z / exp_z.sum(axis=1, keepdims=True)
Hint 4: Gradient of Softmax + Cross-Entropy Donât compute them separately! The combined gradient is simply (softmax_output - one_hot_target). This is one of the most elegant results in deep learning.
Hint 5: Gradient Checking Before training, verify your gradients numerically. Use epsilon = 1e-5, compute (f(x+eps) - f(x-eps)) / (2*eps), compare to your analytical gradient. Relative error should be < 1e-5.
Hint 6: First Training Run Start with a simpler network (e.g., [784, 10] with no hidden layers). This is just logistic regression. It should reach ~92% accuracy. If not, you have a bug in your core training loop.
Hint 7: Weight Visualization The first Dense layer has shape (784, 128). Each column is one neuronâs weights, which you can reshape to 28x28 and visualize. These show what pattern each neuron is âlooking for.â
Extensions and Challenges
Extension 1: Add Dropout
Dropout randomly sets neurons to zero during training, providing regularization:
class Dropout:
def __init__(self, p=0.5):
self.p = p # Probability of keeping a neuron
def forward(self, x, training=True):
if not training:
return x
self.mask = np.random.binomial(1, self.p, x.shape) / self.p
return x * self.mask
def backward(self, grad):
return grad * self.mask
Challenge: Add dropout after each ReLU. Compare test accuracy with and without dropout.
Extension 2: Visualize Learned Features
Beyond raw weights, visualize what inputs maximally activate each neuron:
def visualize_activation_maximization(net, layer_idx, neuron_idx, num_steps=100):
"""Find input that maximally activates a specific neuron."""
x = np.random.randn(1, 784) * 0.01
for step in range(num_steps):
# Forward to target layer
h = x
for i, layer in enumerate(net.layers[:layer_idx+1]):
h = layer.forward(h)
# Gradient w.r.t. target neuron
grad = np.zeros_like(h)
grad[0, neuron_idx] = 1
# Backward to input
for layer in reversed(net.layers[:layer_idx+1]):
grad = layer.backward(grad)
# Gradient ascent
x += 0.1 * grad
x = np.clip(x, -1, 1)
return x.reshape(28, 28)
Extension 3: Try Fashion-MNIST
Fashion-MNIST is a drop-in replacement for MNIST with clothing items instead of digits:
- Same format: 60,000 training, 10,000 test, 28x28 grayscale
- 10 classes: T-shirt, Trouser, Pullover, Dress, Coat, Sandal, Shirt, Sneaker, Bag, Ankle boot
- More challenging than digits (~89% accuracy vs ~97%)
Challenge: Achieve >88% on Fashion-MNIST with your implementation.
Extension 4: Learning Rate Scheduling
Implement learning rate decay:
def step_decay(initial_lr, epoch, drop=0.5, epochs_drop=5):
"""Reduce LR by factor every N epochs."""
return initial_lr * (drop ** (epoch // epochs_drop))
def exponential_decay(initial_lr, epoch, decay_rate=0.95):
"""Exponential decay."""
return initial_lr * (decay_rate ** epoch)
Extension 5: Momentum
Add momentum to SGD for faster convergence:
class MomentumSGD:
def __init__(self, net, lr=0.01, momentum=0.9):
self.lr = lr
self.momentum = momentum
self.velocity = {}
for i, layer in enumerate(net.layers):
if hasattr(layer, 'W'):
self.velocity[f'{i}_W'] = np.zeros_like(layer.W)
self.velocity[f'{i}_b'] = np.zeros_like(layer.b)
def update(self, net):
for i, layer in enumerate(net.layers):
if hasattr(layer, 'W'):
self.velocity[f'{i}_W'] = (
self.momentum * self.velocity[f'{i}_W'] -
self.lr * layer.dW
)
layer.W += self.velocity[f'{i}_W']
self.velocity[f'{i}_b'] = (
self.momentum * self.velocity[f'{i}_b'] -
self.lr * layer.db
)
layer.b += self.velocity[f'{i}_b']
Real-World Connections
Optical Character Recognition (OCR)
MNIST is the ancestor of all OCR systems. Modern OCR uses:
- CNNs instead of dense layers (Project 9)
- Larger, higher-resolution images
- Thousands of characters (Chinese, Japanese, etc.)
- Sequence-to-sequence models for words
Industry application: Banks process millions of handwritten checks daily using MNIST-descended technology.
Check Reading and Document Processing
Every time you deposit a check via mobile app:
- Image is captured and preprocessed
- Handwritten amounts are segmented
- Each digit is classified (like MNIST)
- Business logic validates the result
Scale: Chase Bank processes 40+ million checks per day.
Medical Imaging
The same principles apply to:
- Detecting tumors in X-rays
- Classifying cell types in microscopy
- Reading handwritten prescriptions
Impact: AI-assisted diagnosis can achieve specialist-level accuracy.
Postal Systems
USPS and postal services worldwide use handwritten digit recognition:
- ZIP code reading (98% accuracy required)
- Address parsing
- Package routing
Fun fact: The postal service was one of the first commercial applications of neural networks (1990s).
Books That Will Help
| Topic | Book | Specific Chapters |
|---|---|---|
| MNIST walkthrough | âNeural Networks and Deep Learningâ by Michael Nielsen | Ch. 1 (complete MNIST example) |
| Backpropagation intuition | âGrokking Deep Learningâ by Andrew Trask | Ch. 4-5 (hands-on gradient descent) |
| Mathematical foundations | âDeep Learningâ by Goodfellow et al. | Ch. 6 (deep feedforward networks) |
| Practical implementation | âDeep Learning with Pythonâ by Chollet | Ch. 2 (first neural network) |
| Numerical stability | âNumerical Recipesâ by Press et al. | Ch. 5 (function evaluation) |
| Weight initialization | âDeep Learningâ by Goodfellow et al. | Ch. 8.4 (parameter initialization) |
Free Online Resources
- Michael Nielsenâs book: http://neuralnetworksanddeeplearning.com/
- Arguably the best free resource for understanding neural networks
- Includes complete MNIST implementation with explanations
- 3Blue1Brown Neural Networks series: YouTube
- Excellent visualizations of gradient descent and backpropagation
- Watch before implementing
- Andrej Karpathyâs âmicrogradâ: GitHub
- Minimal autograd engine in ~100 lines
- Good reference for understanding backward pass
Self-Assessment Checklist
Before moving to Project 9 (CNN), verify you can:
Conceptual Understanding
- Explain why softmax creates competition between classes
- Derive the gradient of softmax + cross-entropy from first principles
- Explain the log-sum-exp trick and why itâs necessary
- Describe what one-hot encoding is and why we use it
- Explain mini-batch gradient descent vs. full batch vs. stochastic
Implementation Skills
- Implement softmax that doesnât overflow for large inputs
- Write a Dense layer with correct forward and backward passes
- Implement ReLU activation with correct gradient
- Create a data loader that shuffles and batches correctly
- Save and load model weights
Debugging Abilities
- Use gradient checking to verify your implementation
- Debug a network stuck at 10% accuracy
- Identify overfitting from training curves
- Tune learning rate for stable training
Practical Results
- Achieve >96% test accuracy on MNIST
- Visualize learned weights as 28x28 images
- Generate a confusion matrix and interpret it
- Predict digits from your own handwritten images
Questions to Answer Without Looking
- Whatâs the shape of the gradient for a Dense layer with input (64, 784) and output (64, 128)?
- Why do we use cross-entropy instead of MSE for classification?
- What happens if you initialize all weights to zero?
- How many trainable parameters in a network [784 -> 128 -> 64 -> 10]?
- What does a horizontal stripe pattern in a weight visualization mean?
Next: P09: CNN From Scratch - Replace dense layers with convolutional layers and achieve >99% accuracy