Project 8: MNIST From First Principles

The “Hello World” of AI, done the hard way - Build a neural network to recognize handwritten digits using nothing but NumPy, mastering softmax, one-hot encoding, and mini-batch gradient descent

Project Overview

Attribute	Value
Difficulty	Level 4: Expert
Time Estimate	2 Weeks
Language	Python (NumPy Only)
Prerequisites	Projects 3 (Linear Regression), 4 (Spam Filter), 6 (Fraud Detection MLP)
Main Book	“Neural Networks and Deep Learning” by Michael Nielsen - Chapter 1
Knowledge Area	Computer Vision / Deep Learning

Learning Objectives

After completing this project, you will be able to:

Implement Softmax from scratch - Derive and code the function that turns 10 raw scores into probabilities summing to 1, including numerical stability tricks
Master multi-class cross-entropy - Understand why this loss function works for classification and derive its gradient analytically
Build one-hot encoding utilities - Convert integer labels to vector representations that neural networks can learn from
Achieve true vectorization mastery - Process 60,000 images without Python loops, understanding broadcasting and matrix operations at a deep level
Implement mini-batch training - Balance memory efficiency and gradient stability through batch processing
Design multi-layer dense networks - Build networks with arbitrary hidden layers, understanding the shapes that flow through each layer
Debug neural networks systematically - Use gradient checking, visualization, and layer-by-layer analysis to find and fix problems
Visualize learned representations - Extract and display what each neuron has learned to “look for” in input images

The Core Question You’re Answering

“How do we handle more than Yes/No answers?”

In Project 4 (Spam Filter), you built a binary classifier: Spam or Ham. But what if there are 10 possible answers? 100? 1000?

The key insight is that multi-class classification is just running multiple binary classifiers simultaneously and forcing them to compete. Softmax creates this competition - when one output neuron’s score goes up, all others must go down (because probabilities sum to 1).

Before coding, sit with this question:

Why can’t we just use 10 sigmoid neurons independently?
What happens if two output neurons both claim 99% confidence?
How do we train a network when only one answer is “correct”?

Concepts You Must Understand First

Stop and research these before coding. Each concept builds on the previous.

1. Multi-Class Classification Setup

Unlike binary classification where we have one output neuron, multi-class needs N output neurons for N classes. Each neuron represents the “evidence” for one class.

Binary (Project 4):                Multi-Class (This Project):

Input -> Hidden -> [1 neuron]      Input -> Hidden -> [10 neurons]
                     |                                  |
                  sigmoid                            softmax
                     |                                  |
                  p(spam)                     [p(0), p(1), ..., p(9)]

Key insight: The output layer has 10 neurons, one for each digit. But before softmax, these are just raw scores (logits) - they can be any real number, including negative!

Book Reference: “Neural Networks and Deep Learning” Ch. 1.6 (Implementing our network to classify digits)

2. The Softmax Function and Its Properties

Softmax converts a vector of arbitrary real numbers into a probability distribution:

\[\text{softmax}(z_i) = \frac{e^{z_i}}{\sum_{j=1}^{K} e^{z_j}}\]

Properties you must internalize:

All outputs are in (0, 1)
All outputs sum to exactly 1
The largest input becomes the largest probability
It’s differentiable everywhere

Example:
logits = [2.0, 1.0, 0.1]

e^2.0 = 7.389
e^1.0 = 2.718
e^0.1 = 1.105
sum   = 11.212

softmax = [7.389/11.212, 2.718/11.212, 1.105/11.212]
        = [0.659, 0.242, 0.099]  (sums to 1.0)

Critical: Unlike sigmoid (which squashes independently), softmax creates competition - increasing one logit decreases all other probabilities.

Book Reference: “Deep Learning” by Goodfellow et al., Ch. 6.2.2.3

3. Cross-Entropy Loss for Multiple Classes

For binary classification, we used: $L = -[y \log(\hat{y}) + (1-y)\log(1-\hat{y})]$

For multi-class with one-hot encoding: $L = -\sum_{i=1}^{K} y_i \log(\hat{y}_i)$

Since only one $y_i = 1$ (the correct class) and all others are 0, this simplifies to: $L = -\log(\hat{y}_{\text{correct}})$

Intuition: We want to maximize the probability of the correct class. Log makes this a minimization problem. If we predict 0.9 for the correct class, loss is -log(0.9) = 0.105 (low). If we predict 0.1, loss is -log(0.1) = 2.303 (high).

Book Reference: “Grokking Deep Learning” Ch. 6

4. One-Hot Encoding

Neural networks work with numbers, not categorical labels. We convert:

Label: 3  ->  [0, 0, 0, 1, 0, 0, 0, 0, 0, 0]
Label: 7  ->  [0, 0, 0, 0, 0, 0, 0, 1, 0, 0]
Label: 0  ->  [1, 0, 0, 0, 0, 0, 0, 0, 0, 0]

This encoding lets us:

Compute loss directly (dot product with log probabilities)
Avoid implying ordering (3 is not “between” 2 and 4 for classification)
Enable batch matrix operations

def one_hot(labels, num_classes):
    """Convert integer labels to one-hot vectors."""
    n = len(labels)
    one_hot_matrix = np.zeros((n, num_classes))
    one_hot_matrix[np.arange(n), labels] = 1
    return one_hot_matrix

Book Reference: “Hands-On Machine Learning” by Geron, Ch. 4

5. Mini-Batch Training

With 60,000 training images, we have three choices:

Method	Batch Size	Updates per Epoch	Gradient Quality
Stochastic GD	1	60,000	Very noisy
Full Batch GD	60,000	1	Very smooth but slow
Mini-Batch GD	32-256	235-1875	Good balance

Mini-batch advantages:

Uses vectorization (fast)
Gradients are stable enough to converge
Adds regularization through noise
Fits in memory

Book Reference: “Deep Learning” Ch. 8.1.3

6. Data Loading and Shuffling

The order of training examples matters! If you train on all 0s, then all 1s, then all 2s… the network will “forget” earlier digits (catastrophic forgetting).

Shuffling after each epoch prevents the network from learning spurious patterns in the order of examples.

def shuffle_data(X, y):
    """Shuffle training data while keeping X-y pairs aligned."""
    indices = np.random.permutation(len(y))
    return X[indices], y[indices]

Book Reference: “Neural Networks and Deep Learning” Ch. 3

Deep Theoretical Foundation

Softmax Derivation and Interpretation

Let’s derive softmax from first principles. We want to convert scores to probabilities.

Attempt 1: Just normalize $p_i = \frac{z_i}{\sum_j z_j}$

Problem: What if $z_i$ is negative? We’d get negative “probabilities.”

Attempt 2: Use absolute values $p_i = \frac{|z_i|}{\sum_j |z_j|}$

Problem: z = [-10, 0.1] would give [0.99, 0.01], but -10 should mean “definitely not this class.”

Attempt 3: Exponentiate first $p_i = \frac{e^{z_i}}{\sum_j e^{z_j}}$

This works because:

$e^x > 0$ for all x (no negative probabilities)
$e^{-10} \approx 0.00005$ (negative scores become tiny)
$e^{10} \approx 22026$ (positive scores become large)
The ratio structure naturally creates probabilities

Interpretation: Softmax is the “argmax” function made differentiable. Argmax returns [0,0,1,0] (one-hot for max). Softmax returns [0.02, 0.08, 0.85, 0.05] (soft version where all classes have some probability).

Numerical Stability: The Log-Sum-Exp Trick

Naive softmax will overflow or underflow for large/small logits:

# BAD: Will overflow
z = np.array([1000, 1001, 1002])
exp_z = np.exp(z)  # [inf, inf, inf] - overflow!

Solution: Subtract the maximum before exponentiating:

\[\text{softmax}(z_i) = \frac{e^{z_i - \max(z)}}{\sum_j e^{z_j - \max(z)}}\]

This is mathematically identical (the max cancels in the ratio) but numerically stable:

def stable_softmax(z):
    """Numerically stable softmax."""
    z_shifted = z - np.max(z, axis=-1, keepdims=True)
    exp_z = np.exp(z_shifted)
    return exp_z / np.sum(exp_z, axis=-1, keepdims=True)

Book Reference: “Deep Learning” Ch. 4.1 (Numerical Computation)

Softmax + Cross-Entropy Gradient Derivation

This is where the magic happens. Let’s derive the gradient of the loss with respect to the logits.

Setup:

Logits: $z = [z_1, z_2, …, z_K]$
Softmax output: $\hat{y}_i = \frac{e^{z_i}}{\sum_j e^{z_j}}$
One-hot target: $y = [0, 0, 1, 0, …, 0]$ (1 at the correct class)
Cross-entropy loss: $L = -\sum_i y_i \log(\hat{y}_i)$

Goal: Find $\frac{\partial L}{\partial z_k}$ for all k.

Step 1: Derivative of softmax output with respect to logit

For $i = k$ (same index): $\frac{\partial \hat{y}_i}{\partial z_k} = \hat{y}_i(1 - \hat{y}_i)$

For $i \neq k$ (different indices): $\frac{\partial \hat{y}_i}{\partial z_k} = -\hat{y}_i \hat{y}_k$

Compact form: $\frac{\partial \hat{y}i}{\partial z_k} = \hat{y}_i(\delta{ik} - \hat{y}_k)$

where $\delta_{ik}$ is the Kronecker delta (1 if i=k, 0 otherwise).

Step 2: Apply chain rule

\[\frac{\partial L}{\partial z_k} = \sum_i \frac{\partial L}{\partial \hat{y}_i} \cdot \frac{\partial \hat{y}_i}{\partial z_k}\]

Since $\frac{\partial L}{\partial \hat{y}_i} = -\frac{y_i}{\hat{y}_i}$:

\[\frac{\partial L}{\partial z_k} = -\sum_i \frac{y_i}{\hat{y}_i} \cdot \hat{y}_i(\delta_{ik} - \hat{y}_k)\] \[= -\sum_i y_i(\delta_{ik} - \hat{y}_k)\] \[= -y_k + \hat{y}_k \sum_i y_i\]

Since $\sum_i y_i = 1$ (one-hot sums to 1):

\[\frac{\partial L}{\partial z_k} = \hat{y}_k - y_k\]

The beautiful result: The gradient is simply prediction minus target.

# Gradient of cross-entropy loss w.r.t. logits
dL_dz = softmax_output - one_hot_target
# That's it! No complex derivatives needed in code.

This is why softmax + cross-entropy is so popular: The gradient is trivially simple.

Why Softmax Creates Competition

Consider a network predicting between cats and dogs:

Independent sigmoids:

P(cat) = 0.8, P(dog) = 0.9
Both are high! The network is confused but doesn’t “know” it

Softmax:

If logit_cat = 2 and logit_dog = 2.5
P(cat) = 0.38, P(dog) = 0.62
The outputs are forced to sum to 1

When we increase logit_cat:

P(cat) goes UP
P(dog) goes DOWN (even though we didn’t touch logit_dog!)

This automatic competition is what makes softmax work for mutually exclusive classes.

Weight Initialization for Deep Networks

Random initialization matters more than you think. Bad initialization can cause:

Vanishing gradients: Outputs saturate, gradients approach 0
Exploding gradients: Values grow exponentially through layers
Symmetry breaking failure: All neurons learn the same thing

Xavier/Glorot Initialization (for tanh/sigmoid): $W \sim \mathcal{N}\left(0, \sqrt{\frac{2}{n_{\text{in}} + n_{\text{out}}}}\right)$

He Initialization (for ReLU): $W \sim \mathcal{N}\left(0, \sqrt{\frac{2}{n_{\text{in}}}}\right)$

def he_init(shape):
    """He initialization for ReLU networks."""
    fan_in = shape[0]
    std = np.sqrt(2.0 / fan_in)
    return np.random.randn(*shape) * std

def xavier_init(shape):
    """Xavier initialization for tanh/sigmoid networks."""
    fan_in, fan_out = shape[0], shape[1]
    std = np.sqrt(2.0 / (fan_in + fan_out))
    return np.random.randn(*shape) * std

Book Reference: “Deep Learning” Ch. 8.4

Real World Outcome

When you complete this project, here is exactly what you should see:

Training the Network

$ python mnist_train.py
Loading MNIST dataset...
Training set: 60000 images (28x28 pixels)
Test set: 10000 images (28x28 pixels)

Preprocessing:
  - Flattening 28x28 -> 784
  - Normalizing to [0, 1]
  - One-hot encoding 10 classes

Network Architecture: [784 -> 128 -> 64 -> 10]
  Layer 1: 784 x 128 = 100,352 parameters
  Layer 2: 128 x 64 = 8,192 parameters
  Layer 3: 64 x 10 = 640 parameters
  Total: 109,184 trainable parameters

Hyperparameters:
  Learning rate: 0.1
  Batch size: 64
  Epochs: 10

Training...
Epoch  1/10 | Train Loss: 0.3821 | Train Acc: 89.2% | Test Acc: 90.1% | Time: 12.3s
Epoch  2/10 | Train Loss: 0.1892 | Train Acc: 94.5% | Test Acc: 94.2% | Time: 11.8s
Epoch  3/10 | Train Loss: 0.1341 | Train Acc: 96.0% | Test Acc: 95.3% | Time: 11.9s
Epoch  4/10 | Train Loss: 0.1023 | Train Acc: 97.0% | Test Acc: 96.1% | Time: 12.1s
Epoch  5/10 | Train Loss: 0.0812 | Train Acc: 97.6% | Test Acc: 96.5% | Time: 12.0s
Epoch  6/10 | Train Loss: 0.0654 | Train Acc: 98.0% | Test Acc: 96.8% | Time: 11.7s
Epoch  7/10 | Train Loss: 0.0531 | Train Acc: 98.4% | Test Acc: 97.0% | Time: 11.9s
Epoch  8/10 | Train Loss: 0.0432 | Train Acc: 98.7% | Test Acc: 97.1% | Time: 12.2s
Epoch  9/10 | Train Loss: 0.0351 | Train Acc: 98.9% | Test Acc: 97.2% | Time: 11.8s
Epoch 10/10 | Train Loss: 0.0285 | Train Acc: 99.1% | Test Acc: 97.2% | Time: 12.0s

Training complete!
Final Test Accuracy: 97.2% (9720/10000 correct)
Model saved to: mnist_model.npz

Confusion Matrix:
        0    1    2    3    4    5    6    7    8    9
    +----------------------------------------------------
  0 |  971    0    2    1    0    2    2    1    1    0
  1 |    0 1121    3    2    0    1    3    1    4    0
  2 |    4    2 1002    5    3    0    2    7    6    1
  3 |    0    0    4  983    0   10    0    5    5    3
  4 |    1    0    2    0  958    0    5    2    2   12
  5 |    2    0    0    7    1  869    6    1    4    2
  6 |    5    3    1    0    4    4  939    0    2    0
  7 |    1    4    8    4    1    0    0  997    2   11
  8 |    3    0    3    6    4    5    3    4  940    6
  9 |    3    4    0    4   10    4    1    6    3  974

Predicting Individual Digits

$ python predict_digit.py --image my_drawing.png
Loading model from mnist_model.npz...
Loading image: my_drawing.png (28x28)

Preprocessing:
  - Grayscale conversion
  - Resize to 28x28
  - Invert if necessary (MNIST has white digits on black)
  - Normalize to [0, 1]

Forward pass through network...

Prediction: 7 (Confidence: 99.2%)

Full probability distribution:
  0: 0.001%
  1: 0.003%
  2: 0.012%
  3: 0.002%
  4: 0.008%
  5: 0.001%
  6: 0.002%
  7: 99.215%  <-- PREDICTED
  8: 0.721%
  9: 0.035%

Visualizing Learned Weights

$ python visualize_weights.py
Loading model from mnist_model.npz...

Generating visualization of first layer weights...
Each image shows what input pattern activates one hidden neuron.

Saved: weights_layer1.png (16x8 grid of 784->128 learned patterns)
Saved: weight_histogram.png (distribution of weight values)

Observations:
  - Some neurons detect horizontal strokes
  - Some neurons detect vertical strokes
  - Some neurons detect curves/corners
  - Weights form interpretable patterns resembling digit fragments

Example Weight Visualization

When you visualize the 128 neurons of the first layer (each is a 28x28 pattern):

+---+---+---+---+---+---+---+---+
|   |   | / | \ |   | - | | | O |   <- Different patterns
+---+---+---+---+---+---+---+---+
| C | ) | ( | _ |   |   |   |   |
+---+---+---+---+---+---+---+---+
...

You'll see:
- Edge detectors (horizontal, vertical, diagonal lines)
- Curve detectors (arcs like in 0, 6, 8, 9)
- Corner detectors (L-shapes like in 7, 4)
- Position-specific patterns (top-left vs bottom-right)

Solution Architecture

High-Level System Design

+------------------------------------------------------------------+
|                      MNIST Training Pipeline                      |
+------------------------------------------------------------------+
|                                                                   |
|   +-----------+     +------------+     +-----------+              |
|   |  Data     |     | Network    |     | Training  |              |
|   |  Module   |     | Module     |     | Module    |              |
|   +-----------+     +------------+     +-----------+              |
|        |                  |                  |                    |
|   - Load MNIST       - Layer class      - Mini-batch SGD         |
|   - Preprocess       - Forward pass     - Learning rate          |
|   - Batch            - Backward pass    - Epoch loop             |
|   - Shuffle          - Weight init      - Metrics                |
|                                                                   |
+------------------------------------------------------------------+

Data Flow Through the Network

Input Image (28x28 = 784 pixels)
         |
         v
   +---------------+
   |   Flatten     |
   |   (28x28->784)|
   +---------------+
         |
         v
   +---------------+
   | Dense Layer 1 |
   | 784 -> 128    |
   | + ReLU        |
   +---------------+
         |
         v
   +---------------+
   | Dense Layer 2 |
   | 128 -> 64     |
   | + ReLU        |
   +---------------+
         |
         v
   +---------------+
   | Output Layer  |
   | 64 -> 10      |
   | + Softmax     |
   +---------------+
         |
         v
   Probabilities [p0, p1, ..., p9]

Matrix Dimensions at Each Step

Understanding shapes is crucial for debugging. For a batch of 64 images:

Layer                    Input Shape    Weights Shape    Output Shape
----------------------   -----------    -------------    ------------
Input (batch of images)  (64, 784)      -                -
Dense Layer 1            (64, 784)      (784, 128)       (64, 128)
ReLU Activation          (64, 128)      -                (64, 128)
Dense Layer 2            (64, 128)      (128, 64)        (64, 64)
ReLU Activation          (64, 64)       -                (64, 64)
Output Layer             (64, 64)       (64, 10)         (64, 10)
Softmax                  (64, 10)       -                (64, 10)

Class Structure

# Recommended module structure

class Layer:
    """Abstract base class for network layers."""
    def forward(self, x): pass
    def backward(self, grad): pass

class Dense(Layer):
    """Fully connected layer."""
    def __init__(self, input_dim, output_dim):
        self.W = he_init((input_dim, output_dim))
        self.b = np.zeros(output_dim)

    def forward(self, x):
        # Cache input for backward pass
        self.x = x
        return x @ self.W + self.b

    def backward(self, grad):
        # Gradient w.r.t. weights
        self.dW = self.x.T @ grad
        self.db = np.sum(grad, axis=0)
        # Gradient to pass to previous layer
        return grad @ self.W.T

class ReLU(Layer):
    """ReLU activation function."""
    def forward(self, x):
        self.x = x
        return np.maximum(0, x)

    def backward(self, grad):
        return grad * (self.x > 0)

class Softmax:
    """Softmax activation (output layer)."""
    def forward(self, x):
        exp_x = np.exp(x - np.max(x, axis=1, keepdims=True))
        return exp_x / np.sum(exp_x, axis=1, keepdims=True)

class Network:
    """Sequential network of layers."""
    def __init__(self, layers):
        self.layers = layers

    def forward(self, x):
        for layer in self.layers:
            x = layer.forward(x)
        return x

    def backward(self, grad):
        for layer in reversed(self.layers):
            grad = layer.backward(grad)

File Organization

mnist-from-scratch/
+-- data/
|   +-- download_mnist.py      # Script to download MNIST
|   +-- train-images-idx3-ubyte
|   +-- train-labels-idx1-ubyte
|   +-- t10k-images-idx3-ubyte
|   +-- t10k-labels-idx1-ubyte
|
+-- src/
|   +-- layers.py              # Dense, ReLU, Softmax
|   +-- network.py             # Network class
|   +-- loss.py                # CrossEntropyLoss
|   +-- data_loader.py         # MNIST loading, batching
|   +-- utils.py               # One-hot, shuffling, metrics
|
+-- train.py                   # Main training script
+-- predict.py                 # Inference on new images
+-- visualize.py               # Weight visualization
+-- tests/
    +-- test_layers.py         # Unit tests for layers
    +-- test_gradients.py      # Gradient checking

Phased Implementation Guide

Phase 1: Load and Explore MNIST Data (2-3 hours)

Goal: Download MNIST and understand its structure.

Tasks:

Download the dataset

MNIST is available at http://yann.lecun.com/exdb/mnist/ or through various libraries.

# Option 1: Manual download
import urllib.request
import gzip

urls = {
    'train_images': 'http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz',
    'train_labels': 'http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz',
    'test_images': 'http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz',
    'test_labels': 'http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz',
}

for name, url in urls.items():
    print(f"Downloading {name}...")
    urllib.request.urlretrieve(url, f'data/{name}.gz')

Parse the IDX file format

MNIST uses a custom binary format. Write a parser:

def load_mnist_images(filepath):
    """Load MNIST images from IDX file."""
    with gzip.open(filepath, 'rb') as f:
        # Magic number (2051 for images)
        magic = int.from_bytes(f.read(4), 'big')
        assert magic == 2051

        # Dimensions
        num_images = int.from_bytes(f.read(4), 'big')
        rows = int.from_bytes(f.read(4), 'big')
        cols = int.from_bytes(f.read(4), 'big')

        # Read image data
        data = np.frombuffer(f.read(), dtype=np.uint8)
        return data.reshape(num_images, rows, cols)

def load_mnist_labels(filepath):
    """Load MNIST labels from IDX file."""
    with gzip.open(filepath, 'rb') as f:
        magic = int.from_bytes(f.read(4), 'big')
        assert magic == 2049

        num_labels = int.from_bytes(f.read(4), 'big')
        return np.frombuffer(f.read(), dtype=np.uint8)

Explore the data

train_images = load_mnist_images('data/train-images-idx3-ubyte.gz')
train_labels = load_mnist_labels('data/train-labels-idx1-ubyte.gz')

print(f"Shape: {train_images.shape}")  # (60000, 28, 28)
print(f"Pixel range: {train_images.min()} to {train_images.max()}")  # 0 to 255
print(f"Label distribution: {np.bincount(train_labels)}")

# Visualize a few examples
import matplotlib.pyplot as plt
fig, axes = plt.subplots(2, 5, figsize=(10, 4))
for i, ax in enumerate(axes.flat):
    ax.imshow(train_images[i], cmap='gray')
    ax.set_title(f"Label: {train_labels[i]}")
    ax.axis('off')
plt.savefig('sample_digits.png')

Success Criteria: You can load all 60,000 training images and visualize samples.

Phase 2: Preprocess Data (1-2 hours)

Goal: Transform raw images into network-ready format.

Tasks:

Flatten images

28x28 images become 784-element vectors:

def flatten_images(images):
    """Flatten 28x28 images to 784-length vectors."""
    return images.reshape(images.shape[0], -1)

X_train = flatten_images(train_images)
print(f"Flattened shape: {X_train.shape}")  # (60000, 784)

Normalize pixel values

Neural networks work better with inputs in [0, 1] or standardized:

def normalize(X):
    """Normalize pixel values to [0, 1]."""
    return X.astype(np.float32) / 255.0

X_train = normalize(X_train)
print(f"Normalized range: {X_train.min():.3f} to {X_train.max():.3f}")

Create validation split

Reserve some training data for validation:

def train_val_split(X, y, val_ratio=0.1):
    """Split data into training and validation sets."""
    n = len(y)
    indices = np.random.permutation(n)
    val_size = int(n * val_ratio)

    val_idx = indices[:val_size]
    train_idx = indices[val_size:]

    return (X[train_idx], y[train_idx],
            X[val_idx], y[val_idx])

X_train, y_train, X_val, y_val = train_val_split(X_train, train_labels)

Success Criteria: Data is flattened, normalized, and split into train/val/test sets.

Phase 3: One-Hot Encode Labels (30 minutes)

Goal: Convert integer labels to one-hot vectors.

Tasks:

Implement one-hot encoding

def one_hot(labels, num_classes=10):
    """Convert integer labels to one-hot vectors."""
    n = len(labels)
    one_hot_matrix = np.zeros((n, num_classes), dtype=np.float32)
    one_hot_matrix[np.arange(n), labels] = 1.0
    return one_hot_matrix

Verify correctness

# Test
labels = np.array([0, 1, 2, 3])
encoded = one_hot(labels, 10)
print(encoded)
# [[1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
#  [0, 1, 0, 0, 0, 0, 0, 0, 0, 0],
#  [0, 0, 1, 0, 0, 0, 0, 0, 0, 0],
#  [0, 0, 0, 1, 0, 0, 0, 0, 0, 0]]

assert encoded.sum(axis=1).all() == 1  # Each row sums to 1
assert encoded.sum() == 4              # Total of 4 ones

Success Criteria: one_hot([3, 7], 10) returns correct 2x10 matrix.

Phase 4: Network Architecture Setup (2-3 hours)

Goal: Implement the Dense layer and ReLU activation classes.

Tasks:

Implement Dense layer (forward only first)

class Dense:
    def __init__(self, input_dim, output_dim):
        # He initialization
        self.W = np.random.randn(input_dim, output_dim) * np.sqrt(2.0 / input_dim)
        self.b = np.zeros(output_dim)

    def forward(self, x):
        """Forward pass: y = xW + b"""
        self.x = x  # Cache for backward
        return x @ self.W + self.b

Implement ReLU activation

class ReLU:
    def forward(self, x):
        """ReLU: max(0, x)"""
        self.x = x  # Cache for backward
        return np.maximum(0, x)

Test the forward pass

# Create a simple 2-layer network
layer1 = Dense(784, 128)
relu1 = ReLU()
layer2 = Dense(128, 10)

# Forward pass
x = X_train[:5]  # 5 sample images
h = layer1.forward(x)      # (5, 128)
h = relu1.forward(h)       # (5, 128)
out = layer2.forward(h)    # (5, 10)

print(f"Output shape: {out.shape}")  # (5, 10)
print(f"Sample output: {out[0]}")    # 10 raw scores (logits)

Success Criteria: Forward pass produces (batch_size, 10) output tensor.

Phase 5: Softmax and Cross-Entropy (2-3 hours)

Goal: Implement numerically stable softmax and cross-entropy loss.

Tasks:

Implement stable softmax

def softmax(z):
    """Numerically stable softmax."""
    # Subtract max for stability
    z_shifted = z - np.max(z, axis=1, keepdims=True)
    exp_z = np.exp(z_shifted)
    return exp_z / np.sum(exp_z, axis=1, keepdims=True)

Test softmax properties

logits = np.array([[1.0, 2.0, 3.0],
                   [1000.0, 1001.0, 1002.0]])  # Large values

probs = softmax(logits)
print(f"Probabilities:\n{probs}")
print(f"Row sums: {probs.sum(axis=1)}")  # Should be [1.0, 1.0]
assert np.allclose(probs.sum(axis=1), 1.0)  # All sum to 1

Implement cross-entropy loss

def cross_entropy_loss(probs, targets):
    """
    Compute cross-entropy loss.
    probs: (batch, num_classes) - softmax outputs
    targets: (batch, num_classes) - one-hot encoded
    """
    # Clip to prevent log(0)
    probs_clipped = np.clip(probs, 1e-10, 1 - 1e-10)
    # Cross-entropy: -sum(target * log(pred))
    loss = -np.sum(targets * np.log(probs_clipped)) / len(probs)
    return loss

Test loss computation

# Perfect prediction
probs_perfect = np.array([[0.99, 0.01]])
targets = np.array([[1.0, 0.0]])
loss = cross_entropy_loss(probs_perfect, targets)
print(f"Perfect prediction loss: {loss:.4f}")  # Very low

# Wrong prediction
probs_wrong = np.array([[0.01, 0.99]])
loss = cross_entropy_loss(probs_wrong, targets)
print(f"Wrong prediction loss: {loss:.4f}")  # Very high

Success Criteria: Softmax outputs sum to 1; loss is low for correct predictions, high for wrong.

Phase 6: Forward Pass (Vectorized) (2-3 hours)

Goal: Chain all layers together for a complete forward pass.

Tasks:

Build the Network class

class Network:
    def __init__(self, layer_sizes):
        """
        Create network with given layer sizes.
        Example: [784, 128, 64, 10] creates 3 Dense layers.
        """
        self.layers = []
        for i in range(len(layer_sizes) - 1):
            self.layers.append(Dense(layer_sizes[i], layer_sizes[i+1]))
            if i < len(layer_sizes) - 2:  # No ReLU after last layer
                self.layers.append(ReLU())

    def forward(self, x):
        """Forward pass through all layers."""
        for layer in self.layers:
            x = layer.forward(x)
        return x

    def predict(self, x):
        """Get class predictions."""
        logits = self.forward(x)
        probs = softmax(logits)
        return np.argmax(probs, axis=1)

Test on full dataset

net = Network([784, 128, 64, 10])

# Forward pass on all training data
logits = net.forward(X_train)
probs = softmax(logits)

# Initial accuracy (should be ~10% random)
predictions = np.argmax(probs, axis=1)
accuracy = (predictions == y_train).mean()
print(f"Initial accuracy: {accuracy:.1%}")  # ~10%

Success Criteria: Network makes predictions (even if random at first).

Phase 7: Backward Pass (Vectorized) (4-5 hours)

Goal: Implement gradient computation for all layers.

This is the hardest phase. Take your time.

Tasks:

Implement Dense backward pass

class Dense:
    # ... (keep forward method)

    def backward(self, grad_output):
        """
        Backward pass.
        grad_output: gradient of loss w.r.t. this layer's output
        Returns: gradient of loss w.r.t. this layer's input
        """
        batch_size = grad_output.shape[0]

        # Gradient w.r.t. weights
        self.dW = self.x.T @ grad_output / batch_size

        # Gradient w.r.t. biases
        self.db = np.mean(grad_output, axis=0)

        # Gradient w.r.t. input (to pass to previous layer)
        grad_input = grad_output @ self.W.T

        return grad_input

Implement ReLU backward pass

class ReLU:
    # ... (keep forward method)

    def backward(self, grad_output):
        """
        Backward pass.
        ReLU gradient: 1 if x > 0, else 0
        """
        return grad_output * (self.x > 0)

Implement softmax + cross-entropy gradient

Recall the beautiful result: gradient is simply (prediction - target)

def softmax_cross_entropy_backward(probs, targets):
    """
    Gradient of cross-entropy loss w.r.t. logits.
    This combines softmax and cross-entropy gradients.
    """
    return (probs - targets) / len(probs)

Add backward method to Network

class Network:
    # ... (keep existing methods)

    def backward(self, grad):
        """Backward pass through all layers in reverse."""
        for layer in reversed(self.layers):
            grad = layer.backward(grad)

Gradient checking (CRITICAL)

Verify your gradients numerically:

def gradient_check(net, x, y, epsilon=1e-5):
    """Verify analytical gradients match numerical gradients."""
    # Get analytical gradient
    logits = net.forward(x)
    probs = softmax(logits)
    loss = cross_entropy_loss(probs, y)
    grad = softmax_cross_entropy_backward(probs, y)
    net.backward(grad)

    # Check each weight in first layer
    layer = net.layers[0]
    for i in range(min(5, layer.W.shape[0])):
        for j in range(min(5, layer.W.shape[1])):
            # Numerical gradient
            original = layer.W[i, j]

            layer.W[i, j] = original + epsilon
            logits_plus = net.forward(x)
            loss_plus = cross_entropy_loss(softmax(logits_plus), y)

            layer.W[i, j] = original - epsilon
            logits_minus = net.forward(x)
            loss_minus = cross_entropy_loss(softmax(logits_minus), y)

            layer.W[i, j] = original

            numerical_grad = (loss_plus - loss_minus) / (2 * epsilon)
            analytical_grad = layer.dW[i, j]

            rel_error = abs(numerical_grad - analytical_grad) / (abs(numerical_grad) + abs(analytical_grad) + 1e-8)
            if rel_error > 1e-5:
                print(f"Gradient check FAILED at ({i},{j}): num={numerical_grad:.6f}, ana={analytical_grad:.6f}")
                return False

    print("Gradient check PASSED!")
    return True

Success Criteria: Gradient check passes (relative error < 1e-5).

Phase 8: Mini-Batch Data Loader (1-2 hours)

Goal: Create efficient batch iteration.

Tasks:

Implement batch generator

def create_batches(X, y, batch_size, shuffle=True):
    """
    Generate mini-batches from data.
    Yields (X_batch, y_batch) tuples.
    """
    n = len(y)
    indices = np.arange(n)

    if shuffle:
        np.random.shuffle(indices)

    for start in range(0, n, batch_size):
        end = min(start + batch_size, n)
        batch_idx = indices[start:end]
        yield X[batch_idx], y[batch_idx]

Test batching

batch_size = 64
num_batches = 0
total_samples = 0

for X_batch, y_batch in create_batches(X_train, y_train_onehot, batch_size):
    num_batches += 1
    total_samples += len(y_batch)

print(f"Number of batches: {num_batches}")
print(f"Total samples: {total_samples}")
print(f"Expected batches: {len(y_train) // batch_size + 1}")

Success Criteria: Batches cover all data exactly once per epoch.

Phase 9: Training Loop (3-4 hours)

Goal: Put it all together into a training loop.

Tasks:

Implement SGD update

def sgd_update(net, learning_rate):
    """Update weights using SGD."""
    for layer in net.layers:
        if hasattr(layer, 'W'):
            layer.W -= learning_rate * layer.dW
            layer.b -= learning_rate * layer.db

Implement training loop

def train(net, X_train, y_train, X_val, y_val,
          epochs=10, batch_size=64, learning_rate=0.1):
    """Train the network."""
    y_train_onehot = one_hot(y_train)

    for epoch in range(epochs):
        epoch_loss = 0
        num_batches = 0

        # Training
        for X_batch, y_batch in create_batches(X_train, y_train_onehot, batch_size):
            # Forward
            logits = net.forward(X_batch)
            probs = softmax(logits)
            loss = cross_entropy_loss(probs, y_batch)
            epoch_loss += loss
            num_batches += 1

            # Backward
            grad = softmax_cross_entropy_backward(probs, y_batch)
            net.backward(grad)

            # Update
            sgd_update(net, learning_rate)

        # Evaluation
        train_acc = evaluate(net, X_train, y_train)
        val_acc = evaluate(net, X_val, y_val)

        print(f"Epoch {epoch+1}/{epochs} | "
              f"Loss: {epoch_loss/num_batches:.4f} | "
              f"Train Acc: {train_acc:.1%} | "
              f"Val Acc: {val_acc:.1%}")

def evaluate(net, X, y):
    """Compute accuracy."""
    predictions = net.predict(X)
    return (predictions == y).mean()

Train and monitor

net = Network([784, 128, 64, 10])
train(net, X_train, y_train, X_val, y_val,
      epochs=10, batch_size=64, learning_rate=0.1)

Success Criteria: Accuracy increases each epoch; reaches >95% on validation.

Phase 10: Evaluation and Visualization (2-3 hours)

Goal: Final testing and weight visualization.

Tasks:

Evaluate on test set

test_acc = evaluate(net, X_test, y_test)
print(f"Test accuracy: {test_acc:.2%}")  # Should be ~97%

Generate confusion matrix

def confusion_matrix(y_true, y_pred, num_classes=10):
    """Generate confusion matrix."""
    cm = np.zeros((num_classes, num_classes), dtype=int)
    for t, p in zip(y_true, y_pred):
        cm[t, p] += 1
    return cm

predictions = net.predict(X_test)
cm = confusion_matrix(y_test, predictions)
print("Confusion Matrix:")
print(cm)

Visualize learned weights

def visualize_weights(layer, num_show=64):
    """Visualize first layer weights as images."""
    W = layer.W  # (784, hidden_size)

    # Each column is one neuron's weights
    num_neurons = min(num_show, W.shape[1])
    grid_size = int(np.ceil(np.sqrt(num_neurons)))

    fig, axes = plt.subplots(grid_size, grid_size, figsize=(12, 12))
    for i, ax in enumerate(axes.flat):
        if i < num_neurons:
            weights = W[:, i].reshape(28, 28)
            ax.imshow(weights, cmap='RdBu', vmin=-0.5, vmax=0.5)
        ax.axis('off')

    plt.savefig('weights_visualization.png')
    print("Saved weights_visualization.png")

visualize_weights(net.layers[0])

Save and load model

def save_model(net, filepath):
    """Save model weights."""
    weights = {}
    for i, layer in enumerate(net.layers):
        if hasattr(layer, 'W'):
            weights[f'layer_{i}_W'] = layer.W
            weights[f'layer_{i}_b'] = layer.b
    np.savez(filepath, **weights)

def load_model(net, filepath):
    """Load model weights."""
    data = np.load(filepath)
    for i, layer in enumerate(net.layers):
        if hasattr(layer, 'W'):
            layer.W = data[f'layer_{i}_W']
            layer.b = data[f'layer_{i}_b']

save_model(net, 'mnist_model.npz')

Success Criteria: 97%+ test accuracy; weight visualizations show interpretable patterns.

Questions to Guide Your Design

Before implementing each phase, think through these:

Data Representation

Why do we flatten 28x28 images instead of keeping them 2D?
What would happen if we didn’t normalize pixel values?
Why is shuffling between epochs important?

Network Architecture

Why start with 128 and 64 hidden units? What happens with fewer? More?
What’s the minimum network that can achieve 90%? 95%?
Why do we use ReLU instead of sigmoid in hidden layers?

Softmax and Loss

What happens if we use sigmoid instead of softmax for the output?
Why does softmax use exp() specifically?
How does cross-entropy “punish” confident wrong answers?

Training Dynamics

What learning rate is too high? Too low? How do you know?
How many epochs until the network starts overfitting?
What batch size works best? Why?

Debugging

If accuracy is stuck at 10%, what’s wrong?
If loss is NaN, what’s the likely cause?
How do you verify gradients are correct?

Thinking Exercise

Before coding, work through this example by hand.

Setup: 2-class classification (cat vs dog), 3 input features, 1 hidden layer with 2 neurons

Input: x = [0.5, 0.8, 0.2]
Hidden layer: W1 = [[0.1, 0.2], [0.3, 0.4], [0.5, 0.6]], b1 = [0.1, 0.1]
Output layer: W2 = [[0.3, 0.7], [0.4, 0.6]], b2 = [0.1, 0.1]
True label: y = [1, 0] (cat)

Trace the forward pass:

Hidden layer pre-activation: z1 = x @ W1 + b1 = ?
Hidden layer activation (ReLU): h1 = max(0, z1) = ?
Output layer pre-activation: z2 = h1 @ W2 + b2 = ?
Softmax: probs = softmax(z2) = ?
Cross-entropy loss: L = -sum(y * log(probs)) = ?

Now trace backward:

Gradient of loss w.r.t. z2: dL/dz2 = probs - y = ?
Gradient w.r.t. W2: dL/dW2 = h1.T @ dL/dz2 = ?

Continue through the network…

Testing Strategy

Unit Tests

# tests/test_layers.py

def test_dense_forward_shape():
    layer = Dense(10, 5)
    x = np.random.randn(3, 10)
    y = layer.forward(x)
    assert y.shape == (3, 5)

def test_relu_forward():
    relu = ReLU()
    x = np.array([[-1, 0, 1], [2, -2, 0]])
    y = relu.forward(x)
    expected = np.array([[0, 0, 1], [2, 0, 0]])
    assert np.allclose(y, expected)

def test_softmax_sums_to_one():
    logits = np.random.randn(5, 10)
    probs = softmax(logits)
    assert np.allclose(probs.sum(axis=1), 1.0)

def test_softmax_all_positive():
    logits = np.random.randn(5, 10) * 100  # Large values
    probs = softmax(logits)
    assert (probs > 0).all()

def test_one_hot_encoding():
    labels = np.array([0, 5, 9])
    encoded = one_hot(labels, 10)
    assert encoded.shape == (3, 10)
    assert encoded[0, 0] == 1
    assert encoded[1, 5] == 1
    assert encoded[2, 9] == 1
    assert encoded.sum() == 3

Gradient Checking

def test_gradient_correctness():
    net = Network([5, 3, 2])
    x = np.random.randn(2, 5)
    y = one_hot(np.array([0, 1]), 2)

    assert gradient_check(net, x, y)

Integration Tests

def test_can_overfit_small_batch():
    """Network should perfectly fit a tiny dataset."""
    net = Network([784, 50, 10])
    X_tiny = X_train[:10]
    y_tiny = y_train[:10]

    for _ in range(500):
        train_one_epoch(net, X_tiny, y_tiny, batch_size=10, lr=0.5)

    acc = evaluate(net, X_tiny, y_tiny)
    assert acc > 0.99, f"Should overfit tiny batch, got {acc}"

Common Pitfalls and Debugging Tips

Pitfall 1: Softmax Numerical Overflow

Symptom: RuntimeWarning: overflow encountered in exp, outputs are NaN

Cause: Exponentiating large numbers (e.g., e^1000 = inf)

Fix: Subtract the max before exponentiating

# BAD
exp_z = np.exp(z)

# GOOD
z_shifted = z - np.max(z, axis=1, keepdims=True)
exp_z = np.exp(z_shifted)

Pitfall 2: Log of Zero

Symptom: -inf in loss, NaN in gradients

Cause: Taking log(0) when softmax outputs exactly 0

Fix: Clip probabilities

probs_clipped = np.clip(probs, 1e-10, 1 - 1e-10)
log_probs = np.log(probs_clipped)

Pitfall 3: Shape Mismatch in Matrix Multiplication

Symptom: ValueError: shapes (64,128) and (784,128) not aligned

Cause: Wrong dimension order in weight matrix or forgot to transpose

Debug: Print shapes at every step

print(f"x.shape: {x.shape}")
print(f"W.shape: {W.shape}")
print(f"Expected: ({x.shape[0]}, {W.shape[1]})")

Pitfall 4: Forgetting to Cache Inputs

Symptom: Backward pass fails with “AttributeError: ‘Dense’ object has no attribute ‘x’”

Cause: Forward pass didn’t save the input for gradient computation

Fix: Always cache in forward:

def forward(self, x):
    self.x = x  # CACHE THIS
    return x @ self.W + self.b

Pitfall 5: Gradient Not Averaged Over Batch

Symptom: Training is unstable, loss oscillates wildly

Cause: Gradients are summed instead of averaged

Fix: Divide by batch size

self.dW = self.x.T @ grad_output / batch_size  # Divide!

Pitfall 6: Learning Rate Too High

Symptom: Loss increases or oscillates instead of decreasing

Cause: Updates overshoot the minimum

Fix: Start with lr=0.001, increase gradually to 0.1

Pitfall 7: Weights All Zero or Same

Symptom: All neurons produce identical outputs

Cause: Zero initialization or same random seed

Fix: Use proper He initialization

W = np.random.randn(in_dim, out_dim) * np.sqrt(2.0 / in_dim)

Pitfall 8: Accuracy Stuck at 10%

Symptom: Network never learns better than random guessing

Cause: Usually a gradient bug, or incorrect loss computation

Debug:

Run gradient check
Verify loss decreases on tiny dataset
Check that predictions change after update

Interview Questions

Prepare to answer these after completing the project:

Fundamental Understanding

“Explain the softmax function. Why do we use it instead of sigmoid for multi-class classification?”

Key points: Softmax normalizes to probabilities summing to 1, creating competition between classes. Sigmoid treats each class independently, which doesn’t make sense for mutually exclusive classes.
“Derive the gradient of cross-entropy loss with respect to the logits when using softmax.”

Key insight: The gradient simplifies beautifully to (prediction - target). Show the derivation step by step.
“What is the log-sum-exp trick and why is it important?”

Key points: Prevents numerical overflow when computing softmax. Subtracting the max before exponentiating keeps numbers in a reasonable range without changing the result.

Implementation Details

“How does mini-batch gradient descent differ from stochastic and batch gradient descent? What are the tradeoffs?”

Key points: Batch size affects gradient noise (regularization), memory usage, and convergence speed. Mini-batch is the practical middle ground.
“Why do we need to shuffle the training data between epochs?”

Key points: Prevents the network from learning spurious patterns in the order of examples. Reduces variance in gradient estimates.
“Explain Xavier/Glorot and He initialization. When would you use each?”

Key points: Xavier for sigmoid/tanh, He for ReLU. Both aim to keep variance stable through layers, preventing vanishing/exploding gradients.

Debugging and Analysis

“Your MNIST network is stuck at 10% accuracy. Walk me through how you would debug this.”

Key points: Check gradient correctness, verify loss decreases on tiny batch, check shapes, verify data preprocessing, start with simpler network.
“Look at these weight visualizations. What do they tell you about what the network learned?”

Key points: First layer learns edge detectors, templates for digit parts. Each neuron specializes in detecting specific patterns.
“The training accuracy is 99.5% but test accuracy is 94%. What’s happening and how would you fix it?”

Key points: Overfitting. Solutions include regularization (L2, dropout), early stopping, data augmentation, simpler model.

Advanced Concepts

“How would you modify this network to handle Fashion-MNIST without changing the architecture?”

Key points: No architecture change needed - just different data. Might need different hyperparameters. This demonstrates the generality of the approach.

Hints in Layers

Hint 1: Getting Started Start with the data. Load MNIST, visualize examples, verify shapes. Many bugs come from misunderstanding the data format. MNIST images are grayscale with white digits on black background (0 = black, 255 = white).

Hint 2: Building Layers Implement forward pass first for all layers. Test that you can pass data through the whole network. Shapes should be: (batch, 784) -> (batch, 128) -> (batch, 64) -> (batch, 10).

Hint 3: Softmax Stability Always subtract the max before exponentiating. The most common bug is numerical overflow/underflow in softmax. Use this pattern:

z_max = np.max(z, axis=1, keepdims=True)
exp_z = np.exp(z - z_max)
softmax = exp_z / exp_z.sum(axis=1, keepdims=True)

Hint 4: Gradient of Softmax + Cross-Entropy Don’t compute them separately! The combined gradient is simply (softmax_output - one_hot_target). This is one of the most elegant results in deep learning.

Hint 5: Gradient Checking Before training, verify your gradients numerically. Use epsilon = 1e-5, compute (f(x+eps) - f(x-eps)) / (2*eps), compare to your analytical gradient. Relative error should be < 1e-5.

Hint 6: First Training Run Start with a simpler network (e.g., [784, 10] with no hidden layers). This is just logistic regression. It should reach ~92% accuracy. If not, you have a bug in your core training loop.

Hint 7: Weight Visualization The first Dense layer has shape (784, 128). Each column is one neuron’s weights, which you can reshape to 28x28 and visualize. These show what pattern each neuron is “looking for.”

Extensions and Challenges

Extension 1: Add Dropout

Dropout randomly sets neurons to zero during training, providing regularization:

class Dropout:
    def __init__(self, p=0.5):
        self.p = p  # Probability of keeping a neuron

    def forward(self, x, training=True):
        if not training:
            return x
        self.mask = np.random.binomial(1, self.p, x.shape) / self.p
        return x * self.mask

    def backward(self, grad):
        return grad * self.mask

Challenge: Add dropout after each ReLU. Compare test accuracy with and without dropout.

Extension 2: Visualize Learned Features

Beyond raw weights, visualize what inputs maximally activate each neuron:

def visualize_activation_maximization(net, layer_idx, neuron_idx, num_steps=100):
    """Find input that maximally activates a specific neuron."""
    x = np.random.randn(1, 784) * 0.01

    for step in range(num_steps):
        # Forward to target layer
        h = x
        for i, layer in enumerate(net.layers[:layer_idx+1]):
            h = layer.forward(h)

        # Gradient w.r.t. target neuron
        grad = np.zeros_like(h)
        grad[0, neuron_idx] = 1

        # Backward to input
        for layer in reversed(net.layers[:layer_idx+1]):
            grad = layer.backward(grad)

        # Gradient ascent
        x += 0.1 * grad
        x = np.clip(x, -1, 1)

    return x.reshape(28, 28)

Extension 3: Try Fashion-MNIST

Fashion-MNIST is a drop-in replacement for MNIST with clothing items instead of digits:

Same format: 60,000 training, 10,000 test, 28x28 grayscale
10 classes: T-shirt, Trouser, Pullover, Dress, Coat, Sandal, Shirt, Sneaker, Bag, Ankle boot
More challenging than digits (~89% accuracy vs ~97%)

Challenge: Achieve >88% on Fashion-MNIST with your implementation.

Extension 4: Learning Rate Scheduling

Implement learning rate decay:

def step_decay(initial_lr, epoch, drop=0.5, epochs_drop=5):
    """Reduce LR by factor every N epochs."""
    return initial_lr * (drop ** (epoch // epochs_drop))

def exponential_decay(initial_lr, epoch, decay_rate=0.95):
    """Exponential decay."""
    return initial_lr * (decay_rate ** epoch)

Extension 5: Momentum

Add momentum to SGD for faster convergence:

class MomentumSGD:
    def __init__(self, net, lr=0.01, momentum=0.9):
        self.lr = lr
        self.momentum = momentum
        self.velocity = {}
        for i, layer in enumerate(net.layers):
            if hasattr(layer, 'W'):
                self.velocity[f'{i}_W'] = np.zeros_like(layer.W)
                self.velocity[f'{i}_b'] = np.zeros_like(layer.b)

    def update(self, net):
        for i, layer in enumerate(net.layers):
            if hasattr(layer, 'W'):
                self.velocity[f'{i}_W'] = (
                    self.momentum * self.velocity[f'{i}_W'] -
                    self.lr * layer.dW
                )
                layer.W += self.velocity[f'{i}_W']

                self.velocity[f'{i}_b'] = (
                    self.momentum * self.velocity[f'{i}_b'] -
                    self.lr * layer.db
                )
                layer.b += self.velocity[f'{i}_b']

Real-World Connections

Optical Character Recognition (OCR)

MNIST is the ancestor of all OCR systems. Modern OCR uses:

CNNs instead of dense layers (Project 9)
Larger, higher-resolution images
Thousands of characters (Chinese, Japanese, etc.)
Sequence-to-sequence models for words

Industry application: Banks process millions of handwritten checks daily using MNIST-descended technology.

Check Reading and Document Processing

Every time you deposit a check via mobile app:

Image is captured and preprocessed
Handwritten amounts are segmented
Each digit is classified (like MNIST)
Business logic validates the result

Scale: Chase Bank processes 40+ million checks per day.

Medical Imaging

The same principles apply to:

Detecting tumors in X-rays
Classifying cell types in microscopy
Reading handwritten prescriptions

Impact: AI-assisted diagnosis can achieve specialist-level accuracy.

Postal Systems

USPS and postal services worldwide use handwritten digit recognition:

ZIP code reading (98% accuracy required)
Address parsing
Package routing

Fun fact: The postal service was one of the first commercial applications of neural networks (1990s).

Books That Will Help

Topic	Book	Specific Chapters
MNIST walkthrough	“Neural Networks and Deep Learning” by Michael Nielsen	Ch. 1 (complete MNIST example)
Backpropagation intuition	“Grokking Deep Learning” by Andrew Trask	Ch. 4-5 (hands-on gradient descent)
Mathematical foundations	“Deep Learning” by Goodfellow et al.	Ch. 6 (deep feedforward networks)
Practical implementation	“Deep Learning with Python” by Chollet	Ch. 2 (first neural network)
Numerical stability	“Numerical Recipes” by Press et al.	Ch. 5 (function evaluation)
Weight initialization	“Deep Learning” by Goodfellow et al.	Ch. 8.4 (parameter initialization)

Free Online Resources

Michael Nielsen’s book: http://neuralnetworksanddeeplearning.com/
- Arguably the best free resource for understanding neural networks
- Includes complete MNIST implementation with explanations
3Blue1Brown Neural Networks series: YouTube
- Excellent visualizations of gradient descent and backpropagation
- Watch before implementing
Andrej Karpathy’s “micrograd”: GitHub
- Minimal autograd engine in ~100 lines
- Good reference for understanding backward pass

Self-Assessment Checklist

Before moving to Project 9 (CNN), verify you can:

Conceptual Understanding

Explain why softmax creates competition between classes
Derive the gradient of softmax + cross-entropy from first principles
Explain the log-sum-exp trick and why it’s necessary
Describe what one-hot encoding is and why we use it
Explain mini-batch gradient descent vs. full batch vs. stochastic

Implementation Skills

Implement softmax that doesn’t overflow for large inputs
Write a Dense layer with correct forward and backward passes
Implement ReLU activation with correct gradient
Create a data loader that shuffles and batches correctly
Save and load model weights

Debugging Abilities

Use gradient checking to verify your implementation
Debug a network stuck at 10% accuracy
Identify overfitting from training curves
Tune learning rate for stable training

Practical Results

Achieve >96% test accuracy on MNIST
Visualize learned weights as 28x28 images
Generate a confusion matrix and interpret it
Predict digits from your own handwritten images

Questions to Answer Without Looking

What’s the shape of the gradient for a Dense layer with input (64, 784) and output (64, 128)?
Why do we use cross-entropy instead of MSE for classification?
What happens if you initialize all weights to zero?
How many trainable parameters in a network [784 -> 128 -> 64 -> 10]?
What does a horizontal stripe pattern in a weight visualization mean?

Next: P09: CNN From Scratch - Replace dense layers with convolutional layers and achieve >99% accuracy