Project 8: MNIST From First Principles

Project 8: MNIST From First Principles

The “Hello World” of AI, done the hard way - Build a neural network to recognize handwritten digits using nothing but NumPy, mastering softmax, one-hot encoding, and mini-batch gradient descent


Project Overview

Attribute Value
Difficulty Level 4: Expert
Time Estimate 2 Weeks
Language Python (NumPy Only)
Prerequisites Projects 3 (Linear Regression), 4 (Spam Filter), 6 (Fraud Detection MLP)
Main Book “Neural Networks and Deep Learning” by Michael Nielsen - Chapter 1
Knowledge Area Computer Vision / Deep Learning

Learning Objectives

After completing this project, you will be able to:

  1. Implement Softmax from scratch - Derive and code the function that turns 10 raw scores into probabilities summing to 1, including numerical stability tricks

  2. Master multi-class cross-entropy - Understand why this loss function works for classification and derive its gradient analytically

  3. Build one-hot encoding utilities - Convert integer labels to vector representations that neural networks can learn from

  4. Achieve true vectorization mastery - Process 60,000 images without Python loops, understanding broadcasting and matrix operations at a deep level

  5. Implement mini-batch training - Balance memory efficiency and gradient stability through batch processing

  6. Design multi-layer dense networks - Build networks with arbitrary hidden layers, understanding the shapes that flow through each layer

  7. Debug neural networks systematically - Use gradient checking, visualization, and layer-by-layer analysis to find and fix problems

  8. Visualize learned representations - Extract and display what each neuron has learned to “look for” in input images


The Core Question You’re Answering

“How do we handle more than Yes/No answers?”

In Project 4 (Spam Filter), you built a binary classifier: Spam or Ham. But what if there are 10 possible answers? 100? 1000?

The key insight is that multi-class classification is just running multiple binary classifiers simultaneously and forcing them to compete. Softmax creates this competition - when one output neuron’s score goes up, all others must go down (because probabilities sum to 1).

Before coding, sit with this question:

  • Why can’t we just use 10 sigmoid neurons independently?
  • What happens if two output neurons both claim 99% confidence?
  • How do we train a network when only one answer is “correct”?

Concepts You Must Understand First

Stop and research these before coding. Each concept builds on the previous.

1. Multi-Class Classification Setup

Unlike binary classification where we have one output neuron, multi-class needs N output neurons for N classes. Each neuron represents the “evidence” for one class.

Binary (Project 4):                Multi-Class (This Project):

Input -> Hidden -> [1 neuron]      Input -> Hidden -> [10 neurons]
                     |                                  |
                  sigmoid                            softmax
                     |                                  |
                  p(spam)                     [p(0), p(1), ..., p(9)]

Key insight: The output layer has 10 neurons, one for each digit. But before softmax, these are just raw scores (logits) - they can be any real number, including negative!

Book Reference: “Neural Networks and Deep Learning” Ch. 1.6 (Implementing our network to classify digits)

2. The Softmax Function and Its Properties

Softmax converts a vector of arbitrary real numbers into a probability distribution:

\[\text{softmax}(z_i) = \frac{e^{z_i}}{\sum_{j=1}^{K} e^{z_j}}\]

Properties you must internalize:

  • All outputs are in (0, 1)
  • All outputs sum to exactly 1
  • The largest input becomes the largest probability
  • It’s differentiable everywhere
Example:
logits = [2.0, 1.0, 0.1]

e^2.0 = 7.389
e^1.0 = 2.718
e^0.1 = 1.105
sum   = 11.212

softmax = [7.389/11.212, 2.718/11.212, 1.105/11.212]
        = [0.659, 0.242, 0.099]  (sums to 1.0)

Critical: Unlike sigmoid (which squashes independently), softmax creates competition - increasing one logit decreases all other probabilities.

Book Reference: “Deep Learning” by Goodfellow et al., Ch. 6.2.2.3

3. Cross-Entropy Loss for Multiple Classes

For binary classification, we used: \(L = -[y \log(\hat{y}) + (1-y)\log(1-\hat{y})]\)

For multi-class with one-hot encoding: \(L = -\sum_{i=1}^{K} y_i \log(\hat{y}_i)\)

Since only one $y_i = 1$ (the correct class) and all others are 0, this simplifies to: \(L = -\log(\hat{y}_{\text{correct}})\)

Intuition: We want to maximize the probability of the correct class. Log makes this a minimization problem. If we predict 0.9 for the correct class, loss is -log(0.9) = 0.105 (low). If we predict 0.1, loss is -log(0.1) = 2.303 (high).

Book Reference: “Grokking Deep Learning” Ch. 6

4. One-Hot Encoding

Neural networks work with numbers, not categorical labels. We convert:

Label: 3  ->  [0, 0, 0, 1, 0, 0, 0, 0, 0, 0]
Label: 7  ->  [0, 0, 0, 0, 0, 0, 0, 1, 0, 0]
Label: 0  ->  [1, 0, 0, 0, 0, 0, 0, 0, 0, 0]

This encoding lets us:

  • Compute loss directly (dot product with log probabilities)
  • Avoid implying ordering (3 is not “between” 2 and 4 for classification)
  • Enable batch matrix operations
def one_hot(labels, num_classes):
    """Convert integer labels to one-hot vectors."""
    n = len(labels)
    one_hot_matrix = np.zeros((n, num_classes))
    one_hot_matrix[np.arange(n), labels] = 1
    return one_hot_matrix

Book Reference: “Hands-On Machine Learning” by Geron, Ch. 4

5. Mini-Batch Training

With 60,000 training images, we have three choices:

Method Batch Size Updates per Epoch Gradient Quality
Stochastic GD 1 60,000 Very noisy
Full Batch GD 60,000 1 Very smooth but slow
Mini-Batch GD 32-256 235-1875 Good balance

Mini-batch advantages:

  • Uses vectorization (fast)
  • Gradients are stable enough to converge
  • Adds regularization through noise
  • Fits in memory

Book Reference: “Deep Learning” Ch. 8.1.3

6. Data Loading and Shuffling

The order of training examples matters! If you train on all 0s, then all 1s, then all 2s… the network will “forget” earlier digits (catastrophic forgetting).

Shuffling after each epoch prevents the network from learning spurious patterns in the order of examples.

def shuffle_data(X, y):
    """Shuffle training data while keeping X-y pairs aligned."""
    indices = np.random.permutation(len(y))
    return X[indices], y[indices]

Book Reference: “Neural Networks and Deep Learning” Ch. 3


Deep Theoretical Foundation

Softmax Derivation and Interpretation

Let’s derive softmax from first principles. We want to convert scores to probabilities.

Attempt 1: Just normalize \(p_i = \frac{z_i}{\sum_j z_j}\)

Problem: What if $z_i$ is negative? We’d get negative “probabilities.”

Attempt 2: Use absolute values \(p_i = \frac{|z_i|}{\sum_j |z_j|}\)

Problem: z = [-10, 0.1] would give [0.99, 0.01], but -10 should mean “definitely not this class.”

Attempt 3: Exponentiate first \(p_i = \frac{e^{z_i}}{\sum_j e^{z_j}}\)

This works because:

  • $e^x > 0$ for all x (no negative probabilities)
  • $e^{-10} \approx 0.00005$ (negative scores become tiny)
  • $e^{10} \approx 22026$ (positive scores become large)
  • The ratio structure naturally creates probabilities

Interpretation: Softmax is the “argmax” function made differentiable. Argmax returns [0,0,1,0] (one-hot for max). Softmax returns [0.02, 0.08, 0.85, 0.05] (soft version where all classes have some probability).

Numerical Stability: The Log-Sum-Exp Trick

Naive softmax will overflow or underflow for large/small logits:

# BAD: Will overflow
z = np.array([1000, 1001, 1002])
exp_z = np.exp(z)  # [inf, inf, inf] - overflow!

Solution: Subtract the maximum before exponentiating:

\[\text{softmax}(z_i) = \frac{e^{z_i - \max(z)}}{\sum_j e^{z_j - \max(z)}}\]

This is mathematically identical (the max cancels in the ratio) but numerically stable:

def stable_softmax(z):
    """Numerically stable softmax."""
    z_shifted = z - np.max(z, axis=-1, keepdims=True)
    exp_z = np.exp(z_shifted)
    return exp_z / np.sum(exp_z, axis=-1, keepdims=True)

Book Reference: “Deep Learning” Ch. 4.1 (Numerical Computation)

Softmax + Cross-Entropy Gradient Derivation

This is where the magic happens. Let’s derive the gradient of the loss with respect to the logits.

Setup:

  • Logits: $z = [z_1, z_2, …, z_K]$
  • Softmax output: $\hat{y}_i = \frac{e^{z_i}}{\sum_j e^{z_j}}$
  • One-hot target: $y = [0, 0, 1, 0, …, 0]$ (1 at the correct class)
  • Cross-entropy loss: $L = -\sum_i y_i \log(\hat{y}_i)$

Goal: Find $\frac{\partial L}{\partial z_k}$ for all k.

Step 1: Derivative of softmax output with respect to logit

For $i = k$ (same index): \(\frac{\partial \hat{y}_i}{\partial z_k} = \hat{y}_i(1 - \hat{y}_i)\)

For $i \neq k$ (different indices): \(\frac{\partial \hat{y}_i}{\partial z_k} = -\hat{y}_i \hat{y}_k\)

Compact form: $\frac{\partial \hat{y}i}{\partial z_k} = \hat{y}_i(\delta{ik} - \hat{y}_k)$

where $\delta_{ik}$ is the Kronecker delta (1 if i=k, 0 otherwise).

Step 2: Apply chain rule

\[\frac{\partial L}{\partial z_k} = \sum_i \frac{\partial L}{\partial \hat{y}_i} \cdot \frac{\partial \hat{y}_i}{\partial z_k}\]

Since $\frac{\partial L}{\partial \hat{y}_i} = -\frac{y_i}{\hat{y}_i}$:

\[\frac{\partial L}{\partial z_k} = -\sum_i \frac{y_i}{\hat{y}_i} \cdot \hat{y}_i(\delta_{ik} - \hat{y}_k)\] \[= -\sum_i y_i(\delta_{ik} - \hat{y}_k)\] \[= -y_k + \hat{y}_k \sum_i y_i\]

Since $\sum_i y_i = 1$ (one-hot sums to 1):

\[\frac{\partial L}{\partial z_k} = \hat{y}_k - y_k\]

The beautiful result: The gradient is simply prediction minus target.

# Gradient of cross-entropy loss w.r.t. logits
dL_dz = softmax_output - one_hot_target
# That's it! No complex derivatives needed in code.

This is why softmax + cross-entropy is so popular: The gradient is trivially simple.

Why Softmax Creates Competition

Consider a network predicting between cats and dogs:

Independent sigmoids:

  • P(cat) = 0.8, P(dog) = 0.9
  • Both are high! The network is confused but doesn’t “know” it

Softmax:

  • If logit_cat = 2 and logit_dog = 2.5
  • P(cat) = 0.38, P(dog) = 0.62
  • The outputs are forced to sum to 1

When we increase logit_cat:

  • P(cat) goes UP
  • P(dog) goes DOWN (even though we didn’t touch logit_dog!)

This automatic competition is what makes softmax work for mutually exclusive classes.

Weight Initialization for Deep Networks

Random initialization matters more than you think. Bad initialization can cause:

  • Vanishing gradients: Outputs saturate, gradients approach 0
  • Exploding gradients: Values grow exponentially through layers
  • Symmetry breaking failure: All neurons learn the same thing

Xavier/Glorot Initialization (for tanh/sigmoid): \(W \sim \mathcal{N}\left(0, \sqrt{\frac{2}{n_{\text{in}} + n_{\text{out}}}}\right)\)

He Initialization (for ReLU): \(W \sim \mathcal{N}\left(0, \sqrt{\frac{2}{n_{\text{in}}}}\right)\)

def he_init(shape):
    """He initialization for ReLU networks."""
    fan_in = shape[0]
    std = np.sqrt(2.0 / fan_in)
    return np.random.randn(*shape) * std

def xavier_init(shape):
    """Xavier initialization for tanh/sigmoid networks."""
    fan_in, fan_out = shape[0], shape[1]
    std = np.sqrt(2.0 / (fan_in + fan_out))
    return np.random.randn(*shape) * std

Book Reference: “Deep Learning” Ch. 8.4


Real World Outcome

When you complete this project, here is exactly what you should see:

Training the Network

$ python mnist_train.py
Loading MNIST dataset...
Training set: 60000 images (28x28 pixels)
Test set: 10000 images (28x28 pixels)

Preprocessing:
  - Flattening 28x28 -> 784
  - Normalizing to [0, 1]
  - One-hot encoding 10 classes

Network Architecture: [784 -> 128 -> 64 -> 10]
  Layer 1: 784 x 128 = 100,352 parameters
  Layer 2: 128 x 64 = 8,192 parameters
  Layer 3: 64 x 10 = 640 parameters
  Total: 109,184 trainable parameters

Hyperparameters:
  Learning rate: 0.1
  Batch size: 64
  Epochs: 10

Training...
Epoch  1/10 | Train Loss: 0.3821 | Train Acc: 89.2% | Test Acc: 90.1% | Time: 12.3s
Epoch  2/10 | Train Loss: 0.1892 | Train Acc: 94.5% | Test Acc: 94.2% | Time: 11.8s
Epoch  3/10 | Train Loss: 0.1341 | Train Acc: 96.0% | Test Acc: 95.3% | Time: 11.9s
Epoch  4/10 | Train Loss: 0.1023 | Train Acc: 97.0% | Test Acc: 96.1% | Time: 12.1s
Epoch  5/10 | Train Loss: 0.0812 | Train Acc: 97.6% | Test Acc: 96.5% | Time: 12.0s
Epoch  6/10 | Train Loss: 0.0654 | Train Acc: 98.0% | Test Acc: 96.8% | Time: 11.7s
Epoch  7/10 | Train Loss: 0.0531 | Train Acc: 98.4% | Test Acc: 97.0% | Time: 11.9s
Epoch  8/10 | Train Loss: 0.0432 | Train Acc: 98.7% | Test Acc: 97.1% | Time: 12.2s
Epoch  9/10 | Train Loss: 0.0351 | Train Acc: 98.9% | Test Acc: 97.2% | Time: 11.8s
Epoch 10/10 | Train Loss: 0.0285 | Train Acc: 99.1% | Test Acc: 97.2% | Time: 12.0s

Training complete!
Final Test Accuracy: 97.2% (9720/10000 correct)
Model saved to: mnist_model.npz

Confusion Matrix:
        0    1    2    3    4    5    6    7    8    9
    +----------------------------------------------------
  0 |  971    0    2    1    0    2    2    1    1    0
  1 |    0 1121    3    2    0    1    3    1    4    0
  2 |    4    2 1002    5    3    0    2    7    6    1
  3 |    0    0    4  983    0   10    0    5    5    3
  4 |    1    0    2    0  958    0    5    2    2   12
  5 |    2    0    0    7    1  869    6    1    4    2
  6 |    5    3    1    0    4    4  939    0    2    0
  7 |    1    4    8    4    1    0    0  997    2   11
  8 |    3    0    3    6    4    5    3    4  940    6
  9 |    3    4    0    4   10    4    1    6    3  974

Predicting Individual Digits

$ python predict_digit.py --image my_drawing.png
Loading model from mnist_model.npz...
Loading image: my_drawing.png (28x28)

Preprocessing:
  - Grayscale conversion
  - Resize to 28x28
  - Invert if necessary (MNIST has white digits on black)
  - Normalize to [0, 1]

Forward pass through network...

Prediction: 7 (Confidence: 99.2%)

Full probability distribution:
  0: 0.001%
  1: 0.003%
  2: 0.012%
  3: 0.002%
  4: 0.008%
  5: 0.001%
  6: 0.002%
  7: 99.215%  <-- PREDICTED
  8: 0.721%
  9: 0.035%

Visualizing Learned Weights

$ python visualize_weights.py
Loading model from mnist_model.npz...

Generating visualization of first layer weights...
Each image shows what input pattern activates one hidden neuron.

Saved: weights_layer1.png (16x8 grid of 784->128 learned patterns)
Saved: weight_histogram.png (distribution of weight values)

Observations:
  - Some neurons detect horizontal strokes
  - Some neurons detect vertical strokes
  - Some neurons detect curves/corners
  - Weights form interpretable patterns resembling digit fragments

Example Weight Visualization

When you visualize the 128 neurons of the first layer (each is a 28x28 pattern):

+---+---+---+---+---+---+---+---+
|   |   | / | \ |   | - | | | O |   <- Different patterns
+---+---+---+---+---+---+---+---+
| C | ) | ( | _ |   |   |   |   |
+---+---+---+---+---+---+---+---+
...

You'll see:
- Edge detectors (horizontal, vertical, diagonal lines)
- Curve detectors (arcs like in 0, 6, 8, 9)
- Corner detectors (L-shapes like in 7, 4)
- Position-specific patterns (top-left vs bottom-right)

Solution Architecture

High-Level System Design

+------------------------------------------------------------------+
|                      MNIST Training Pipeline                      |
+------------------------------------------------------------------+
|                                                                   |
|   +-----------+     +------------+     +-----------+              |
|   |  Data     |     | Network    |     | Training  |              |
|   |  Module   |     | Module     |     | Module    |              |
|   +-----------+     +------------+     +-----------+              |
|        |                  |                  |                    |
|   - Load MNIST       - Layer class      - Mini-batch SGD         |
|   - Preprocess       - Forward pass     - Learning rate          |
|   - Batch            - Backward pass    - Epoch loop             |
|   - Shuffle          - Weight init      - Metrics                |
|                                                                   |
+------------------------------------------------------------------+

Data Flow Through the Network

Input Image (28x28 = 784 pixels)
         |
         v
   +---------------+
   |   Flatten     |
   |   (28x28->784)|
   +---------------+
         |
         v
   +---------------+
   | Dense Layer 1 |
   | 784 -> 128    |
   | + ReLU        |
   +---------------+
         |
         v
   +---------------+
   | Dense Layer 2 |
   | 128 -> 64     |
   | + ReLU        |
   +---------------+
         |
         v
   +---------------+
   | Output Layer  |
   | 64 -> 10      |
   | + Softmax     |
   +---------------+
         |
         v
   Probabilities [p0, p1, ..., p9]

Matrix Dimensions at Each Step

Understanding shapes is crucial for debugging. For a batch of 64 images:

Layer                    Input Shape    Weights Shape    Output Shape
----------------------   -----------    -------------    ------------
Input (batch of images)  (64, 784)      -                -
Dense Layer 1            (64, 784)      (784, 128)       (64, 128)
ReLU Activation          (64, 128)      -                (64, 128)
Dense Layer 2            (64, 128)      (128, 64)        (64, 64)
ReLU Activation          (64, 64)       -                (64, 64)
Output Layer             (64, 64)       (64, 10)         (64, 10)
Softmax                  (64, 10)       -                (64, 10)

Class Structure

# Recommended module structure

class Layer:
    """Abstract base class for network layers."""
    def forward(self, x): pass
    def backward(self, grad): pass

class Dense(Layer):
    """Fully connected layer."""
    def __init__(self, input_dim, output_dim):
        self.W = he_init((input_dim, output_dim))
        self.b = np.zeros(output_dim)

    def forward(self, x):
        # Cache input for backward pass
        self.x = x
        return x @ self.W + self.b

    def backward(self, grad):
        # Gradient w.r.t. weights
        self.dW = self.x.T @ grad
        self.db = np.sum(grad, axis=0)
        # Gradient to pass to previous layer
        return grad @ self.W.T

class ReLU(Layer):
    """ReLU activation function."""
    def forward(self, x):
        self.x = x
        return np.maximum(0, x)

    def backward(self, grad):
        return grad * (self.x > 0)

class Softmax:
    """Softmax activation (output layer)."""
    def forward(self, x):
        exp_x = np.exp(x - np.max(x, axis=1, keepdims=True))
        return exp_x / np.sum(exp_x, axis=1, keepdims=True)

class Network:
    """Sequential network of layers."""
    def __init__(self, layers):
        self.layers = layers

    def forward(self, x):
        for layer in self.layers:
            x = layer.forward(x)
        return x

    def backward(self, grad):
        for layer in reversed(self.layers):
            grad = layer.backward(grad)

File Organization

mnist-from-scratch/
+-- data/
|   +-- download_mnist.py      # Script to download MNIST
|   +-- train-images-idx3-ubyte
|   +-- train-labels-idx1-ubyte
|   +-- t10k-images-idx3-ubyte
|   +-- t10k-labels-idx1-ubyte
|
+-- src/
|   +-- layers.py              # Dense, ReLU, Softmax
|   +-- network.py             # Network class
|   +-- loss.py                # CrossEntropyLoss
|   +-- data_loader.py         # MNIST loading, batching
|   +-- utils.py               # One-hot, shuffling, metrics
|
+-- train.py                   # Main training script
+-- predict.py                 # Inference on new images
+-- visualize.py               # Weight visualization
+-- tests/
    +-- test_layers.py         # Unit tests for layers
    +-- test_gradients.py      # Gradient checking

Phased Implementation Guide

Phase 1: Load and Explore MNIST Data (2-3 hours)

Goal: Download MNIST and understand its structure.

Tasks:

  1. Download the dataset

    MNIST is available at http://yann.lecun.com/exdb/mnist/ or through various libraries.

    # Option 1: Manual download
    import urllib.request
    import gzip
    
    urls = {
        'train_images': 'http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz',
        'train_labels': 'http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz',
        'test_images': 'http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz',
        'test_labels': 'http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz',
    }
    
    for name, url in urls.items():
        print(f"Downloading {name}...")
        urllib.request.urlretrieve(url, f'data/{name}.gz')
    
  2. Parse the IDX file format

    MNIST uses a custom binary format. Write a parser:

    def load_mnist_images(filepath):
        """Load MNIST images from IDX file."""
        with gzip.open(filepath, 'rb') as f:
            # Magic number (2051 for images)
            magic = int.from_bytes(f.read(4), 'big')
            assert magic == 2051
    
            # Dimensions
            num_images = int.from_bytes(f.read(4), 'big')
            rows = int.from_bytes(f.read(4), 'big')
            cols = int.from_bytes(f.read(4), 'big')
    
            # Read image data
            data = np.frombuffer(f.read(), dtype=np.uint8)
            return data.reshape(num_images, rows, cols)
    
    def load_mnist_labels(filepath):
        """Load MNIST labels from IDX file."""
        with gzip.open(filepath, 'rb') as f:
            magic = int.from_bytes(f.read(4), 'big')
            assert magic == 2049
    
            num_labels = int.from_bytes(f.read(4), 'big')
            return np.frombuffer(f.read(), dtype=np.uint8)
    
  3. Explore the data

    train_images = load_mnist_images('data/train-images-idx3-ubyte.gz')
    train_labels = load_mnist_labels('data/train-labels-idx1-ubyte.gz')
    
    print(f"Shape: {train_images.shape}")  # (60000, 28, 28)
    print(f"Pixel range: {train_images.min()} to {train_images.max()}")  # 0 to 255
    print(f"Label distribution: {np.bincount(train_labels)}")
    
    # Visualize a few examples
    import matplotlib.pyplot as plt
    fig, axes = plt.subplots(2, 5, figsize=(10, 4))
    for i, ax in enumerate(axes.flat):
        ax.imshow(train_images[i], cmap='gray')
        ax.set_title(f"Label: {train_labels[i]}")
        ax.axis('off')
    plt.savefig('sample_digits.png')
    

Success Criteria: You can load all 60,000 training images and visualize samples.


Phase 2: Preprocess Data (1-2 hours)

Goal: Transform raw images into network-ready format.

Tasks:

  1. Flatten images

    28x28 images become 784-element vectors:

    def flatten_images(images):
        """Flatten 28x28 images to 784-length vectors."""
        return images.reshape(images.shape[0], -1)
    
    X_train = flatten_images(train_images)
    print(f"Flattened shape: {X_train.shape}")  # (60000, 784)
    
  2. Normalize pixel values

    Neural networks work better with inputs in [0, 1] or standardized:

    def normalize(X):
        """Normalize pixel values to [0, 1]."""
        return X.astype(np.float32) / 255.0
    
    X_train = normalize(X_train)
    print(f"Normalized range: {X_train.min():.3f} to {X_train.max():.3f}")
    
  3. Create validation split

    Reserve some training data for validation:

    def train_val_split(X, y, val_ratio=0.1):
        """Split data into training and validation sets."""
        n = len(y)
        indices = np.random.permutation(n)
        val_size = int(n * val_ratio)
    
        val_idx = indices[:val_size]
        train_idx = indices[val_size:]
    
        return (X[train_idx], y[train_idx],
                X[val_idx], y[val_idx])
    
    X_train, y_train, X_val, y_val = train_val_split(X_train, train_labels)
    

Success Criteria: Data is flattened, normalized, and split into train/val/test sets.


Phase 3: One-Hot Encode Labels (30 minutes)

Goal: Convert integer labels to one-hot vectors.

Tasks:

  1. Implement one-hot encoding

    def one_hot(labels, num_classes=10):
        """Convert integer labels to one-hot vectors."""
        n = len(labels)
        one_hot_matrix = np.zeros((n, num_classes), dtype=np.float32)
        one_hot_matrix[np.arange(n), labels] = 1.0
        return one_hot_matrix
    
  2. Verify correctness

    # Test
    labels = np.array([0, 1, 2, 3])
    encoded = one_hot(labels, 10)
    print(encoded)
    # [[1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
    #  [0, 1, 0, 0, 0, 0, 0, 0, 0, 0],
    #  [0, 0, 1, 0, 0, 0, 0, 0, 0, 0],
    #  [0, 0, 0, 1, 0, 0, 0, 0, 0, 0]]
    
    assert encoded.sum(axis=1).all() == 1  # Each row sums to 1
    assert encoded.sum() == 4              # Total of 4 ones
    

Success Criteria: one_hot([3, 7], 10) returns correct 2x10 matrix.


Phase 4: Network Architecture Setup (2-3 hours)

Goal: Implement the Dense layer and ReLU activation classes.

Tasks:

  1. Implement Dense layer (forward only first)

    class Dense:
        def __init__(self, input_dim, output_dim):
            # He initialization
            self.W = np.random.randn(input_dim, output_dim) * np.sqrt(2.0 / input_dim)
            self.b = np.zeros(output_dim)
    
        def forward(self, x):
            """Forward pass: y = xW + b"""
            self.x = x  # Cache for backward
            return x @ self.W + self.b
    
  2. Implement ReLU activation

    class ReLU:
        def forward(self, x):
            """ReLU: max(0, x)"""
            self.x = x  # Cache for backward
            return np.maximum(0, x)
    
  3. Test the forward pass

    # Create a simple 2-layer network
    layer1 = Dense(784, 128)
    relu1 = ReLU()
    layer2 = Dense(128, 10)
    
    # Forward pass
    x = X_train[:5]  # 5 sample images
    h = layer1.forward(x)      # (5, 128)
    h = relu1.forward(h)       # (5, 128)
    out = layer2.forward(h)    # (5, 10)
    
    print(f"Output shape: {out.shape}")  # (5, 10)
    print(f"Sample output: {out[0]}")    # 10 raw scores (logits)
    

Success Criteria: Forward pass produces (batch_size, 10) output tensor.


Phase 5: Softmax and Cross-Entropy (2-3 hours)

Goal: Implement numerically stable softmax and cross-entropy loss.

Tasks:

  1. Implement stable softmax

    def softmax(z):
        """Numerically stable softmax."""
        # Subtract max for stability
        z_shifted = z - np.max(z, axis=1, keepdims=True)
        exp_z = np.exp(z_shifted)
        return exp_z / np.sum(exp_z, axis=1, keepdims=True)
    
  2. Test softmax properties

    logits = np.array([[1.0, 2.0, 3.0],
                       [1000.0, 1001.0, 1002.0]])  # Large values
    
    probs = softmax(logits)
    print(f"Probabilities:\n{probs}")
    print(f"Row sums: {probs.sum(axis=1)}")  # Should be [1.0, 1.0]
    assert np.allclose(probs.sum(axis=1), 1.0)  # All sum to 1
    
  3. Implement cross-entropy loss

    def cross_entropy_loss(probs, targets):
        """
        Compute cross-entropy loss.
        probs: (batch, num_classes) - softmax outputs
        targets: (batch, num_classes) - one-hot encoded
        """
        # Clip to prevent log(0)
        probs_clipped = np.clip(probs, 1e-10, 1 - 1e-10)
        # Cross-entropy: -sum(target * log(pred))
        loss = -np.sum(targets * np.log(probs_clipped)) / len(probs)
        return loss
    
  4. Test loss computation

    # Perfect prediction
    probs_perfect = np.array([[0.99, 0.01]])
    targets = np.array([[1.0, 0.0]])
    loss = cross_entropy_loss(probs_perfect, targets)
    print(f"Perfect prediction loss: {loss:.4f}")  # Very low
    
    # Wrong prediction
    probs_wrong = np.array([[0.01, 0.99]])
    loss = cross_entropy_loss(probs_wrong, targets)
    print(f"Wrong prediction loss: {loss:.4f}")  # Very high
    

Success Criteria: Softmax outputs sum to 1; loss is low for correct predictions, high for wrong.


Phase 6: Forward Pass (Vectorized) (2-3 hours)

Goal: Chain all layers together for a complete forward pass.

Tasks:

  1. Build the Network class

    class Network:
        def __init__(self, layer_sizes):
            """
            Create network with given layer sizes.
            Example: [784, 128, 64, 10] creates 3 Dense layers.
            """
            self.layers = []
            for i in range(len(layer_sizes) - 1):
                self.layers.append(Dense(layer_sizes[i], layer_sizes[i+1]))
                if i < len(layer_sizes) - 2:  # No ReLU after last layer
                    self.layers.append(ReLU())
    
        def forward(self, x):
            """Forward pass through all layers."""
            for layer in self.layers:
                x = layer.forward(x)
            return x
    
        def predict(self, x):
            """Get class predictions."""
            logits = self.forward(x)
            probs = softmax(logits)
            return np.argmax(probs, axis=1)
    
  2. Test on full dataset

    net = Network([784, 128, 64, 10])
    
    # Forward pass on all training data
    logits = net.forward(X_train)
    probs = softmax(logits)
    
    # Initial accuracy (should be ~10% random)
    predictions = np.argmax(probs, axis=1)
    accuracy = (predictions == y_train).mean()
    print(f"Initial accuracy: {accuracy:.1%}")  # ~10%
    

Success Criteria: Network makes predictions (even if random at first).


Phase 7: Backward Pass (Vectorized) (4-5 hours)

Goal: Implement gradient computation for all layers.

This is the hardest phase. Take your time.

Tasks:

  1. Implement Dense backward pass

    class Dense:
        # ... (keep forward method)
    
        def backward(self, grad_output):
            """
            Backward pass.
            grad_output: gradient of loss w.r.t. this layer's output
            Returns: gradient of loss w.r.t. this layer's input
            """
            batch_size = grad_output.shape[0]
    
            # Gradient w.r.t. weights
            self.dW = self.x.T @ grad_output / batch_size
    
            # Gradient w.r.t. biases
            self.db = np.mean(grad_output, axis=0)
    
            # Gradient w.r.t. input (to pass to previous layer)
            grad_input = grad_output @ self.W.T
    
            return grad_input
    
  2. Implement ReLU backward pass

    class ReLU:
        # ... (keep forward method)
    
        def backward(self, grad_output):
            """
            Backward pass.
            ReLU gradient: 1 if x > 0, else 0
            """
            return grad_output * (self.x > 0)
    
  3. Implement softmax + cross-entropy gradient

    Recall the beautiful result: gradient is simply (prediction - target)

    def softmax_cross_entropy_backward(probs, targets):
        """
        Gradient of cross-entropy loss w.r.t. logits.
        This combines softmax and cross-entropy gradients.
        """
        return (probs - targets) / len(probs)
    
  4. Add backward method to Network

    class Network:
        # ... (keep existing methods)
    
        def backward(self, grad):
            """Backward pass through all layers in reverse."""
            for layer in reversed(self.layers):
                grad = layer.backward(grad)
    
  5. Gradient checking (CRITICAL)

    Verify your gradients numerically:

    def gradient_check(net, x, y, epsilon=1e-5):
        """Verify analytical gradients match numerical gradients."""
        # Get analytical gradient
        logits = net.forward(x)
        probs = softmax(logits)
        loss = cross_entropy_loss(probs, y)
        grad = softmax_cross_entropy_backward(probs, y)
        net.backward(grad)
    
        # Check each weight in first layer
        layer = net.layers[0]
        for i in range(min(5, layer.W.shape[0])):
            for j in range(min(5, layer.W.shape[1])):
                # Numerical gradient
                original = layer.W[i, j]
    
                layer.W[i, j] = original + epsilon
                logits_plus = net.forward(x)
                loss_plus = cross_entropy_loss(softmax(logits_plus), y)
    
                layer.W[i, j] = original - epsilon
                logits_minus = net.forward(x)
                loss_minus = cross_entropy_loss(softmax(logits_minus), y)
    
                layer.W[i, j] = original
    
                numerical_grad = (loss_plus - loss_minus) / (2 * epsilon)
                analytical_grad = layer.dW[i, j]
    
                rel_error = abs(numerical_grad - analytical_grad) / (abs(numerical_grad) + abs(analytical_grad) + 1e-8)
                if rel_error > 1e-5:
                    print(f"Gradient check FAILED at ({i},{j}): num={numerical_grad:.6f}, ana={analytical_grad:.6f}")
                    return False
    
        print("Gradient check PASSED!")
        return True
    

Success Criteria: Gradient check passes (relative error < 1e-5).


Phase 8: Mini-Batch Data Loader (1-2 hours)

Goal: Create efficient batch iteration.

Tasks:

  1. Implement batch generator

    def create_batches(X, y, batch_size, shuffle=True):
        """
        Generate mini-batches from data.
        Yields (X_batch, y_batch) tuples.
        """
        n = len(y)
        indices = np.arange(n)
    
        if shuffle:
            np.random.shuffle(indices)
    
        for start in range(0, n, batch_size):
            end = min(start + batch_size, n)
            batch_idx = indices[start:end]
            yield X[batch_idx], y[batch_idx]
    
  2. Test batching

    batch_size = 64
    num_batches = 0
    total_samples = 0
    
    for X_batch, y_batch in create_batches(X_train, y_train_onehot, batch_size):
        num_batches += 1
        total_samples += len(y_batch)
    
    print(f"Number of batches: {num_batches}")
    print(f"Total samples: {total_samples}")
    print(f"Expected batches: {len(y_train) // batch_size + 1}")
    

Success Criteria: Batches cover all data exactly once per epoch.


Phase 9: Training Loop (3-4 hours)

Goal: Put it all together into a training loop.

Tasks:

  1. Implement SGD update

    def sgd_update(net, learning_rate):
        """Update weights using SGD."""
        for layer in net.layers:
            if hasattr(layer, 'W'):
                layer.W -= learning_rate * layer.dW
                layer.b -= learning_rate * layer.db
    
  2. Implement training loop

    def train(net, X_train, y_train, X_val, y_val,
              epochs=10, batch_size=64, learning_rate=0.1):
        """Train the network."""
        y_train_onehot = one_hot(y_train)
    
        for epoch in range(epochs):
            epoch_loss = 0
            num_batches = 0
    
            # Training
            for X_batch, y_batch in create_batches(X_train, y_train_onehot, batch_size):
                # Forward
                logits = net.forward(X_batch)
                probs = softmax(logits)
                loss = cross_entropy_loss(probs, y_batch)
                epoch_loss += loss
                num_batches += 1
    
                # Backward
                grad = softmax_cross_entropy_backward(probs, y_batch)
                net.backward(grad)
    
                # Update
                sgd_update(net, learning_rate)
    
            # Evaluation
            train_acc = evaluate(net, X_train, y_train)
            val_acc = evaluate(net, X_val, y_val)
    
            print(f"Epoch {epoch+1}/{epochs} | "
                  f"Loss: {epoch_loss/num_batches:.4f} | "
                  f"Train Acc: {train_acc:.1%} | "
                  f"Val Acc: {val_acc:.1%}")
    
    def evaluate(net, X, y):
        """Compute accuracy."""
        predictions = net.predict(X)
        return (predictions == y).mean()
    
  3. Train and monitor

    net = Network([784, 128, 64, 10])
    train(net, X_train, y_train, X_val, y_val,
          epochs=10, batch_size=64, learning_rate=0.1)
    

Success Criteria: Accuracy increases each epoch; reaches >95% on validation.


Phase 10: Evaluation and Visualization (2-3 hours)

Goal: Final testing and weight visualization.

Tasks:

  1. Evaluate on test set

    test_acc = evaluate(net, X_test, y_test)
    print(f"Test accuracy: {test_acc:.2%}")  # Should be ~97%
    
  2. Generate confusion matrix

    def confusion_matrix(y_true, y_pred, num_classes=10):
        """Generate confusion matrix."""
        cm = np.zeros((num_classes, num_classes), dtype=int)
        for t, p in zip(y_true, y_pred):
            cm[t, p] += 1
        return cm
    
    predictions = net.predict(X_test)
    cm = confusion_matrix(y_test, predictions)
    print("Confusion Matrix:")
    print(cm)
    
  3. Visualize learned weights

    def visualize_weights(layer, num_show=64):
        """Visualize first layer weights as images."""
        W = layer.W  # (784, hidden_size)
    
        # Each column is one neuron's weights
        num_neurons = min(num_show, W.shape[1])
        grid_size = int(np.ceil(np.sqrt(num_neurons)))
    
        fig, axes = plt.subplots(grid_size, grid_size, figsize=(12, 12))
        for i, ax in enumerate(axes.flat):
            if i < num_neurons:
                weights = W[:, i].reshape(28, 28)
                ax.imshow(weights, cmap='RdBu', vmin=-0.5, vmax=0.5)
            ax.axis('off')
    
        plt.savefig('weights_visualization.png')
        print("Saved weights_visualization.png")
    
    visualize_weights(net.layers[0])
    
  4. Save and load model

    def save_model(net, filepath):
        """Save model weights."""
        weights = {}
        for i, layer in enumerate(net.layers):
            if hasattr(layer, 'W'):
                weights[f'layer_{i}_W'] = layer.W
                weights[f'layer_{i}_b'] = layer.b
        np.savez(filepath, **weights)
    
    def load_model(net, filepath):
        """Load model weights."""
        data = np.load(filepath)
        for i, layer in enumerate(net.layers):
            if hasattr(layer, 'W'):
                layer.W = data[f'layer_{i}_W']
                layer.b = data[f'layer_{i}_b']
    
    save_model(net, 'mnist_model.npz')
    

Success Criteria: 97%+ test accuracy; weight visualizations show interpretable patterns.


Questions to Guide Your Design

Before implementing each phase, think through these:

Data Representation

  • Why do we flatten 28x28 images instead of keeping them 2D?
  • What would happen if we didn’t normalize pixel values?
  • Why is shuffling between epochs important?

Network Architecture

  • Why start with 128 and 64 hidden units? What happens with fewer? More?
  • What’s the minimum network that can achieve 90%? 95%?
  • Why do we use ReLU instead of sigmoid in hidden layers?

Softmax and Loss

  • What happens if we use sigmoid instead of softmax for the output?
  • Why does softmax use exp() specifically?
  • How does cross-entropy “punish” confident wrong answers?

Training Dynamics

  • What learning rate is too high? Too low? How do you know?
  • How many epochs until the network starts overfitting?
  • What batch size works best? Why?

Debugging

  • If accuracy is stuck at 10%, what’s wrong?
  • If loss is NaN, what’s the likely cause?
  • How do you verify gradients are correct?

Thinking Exercise

Before coding, work through this example by hand.

Setup: 2-class classification (cat vs dog), 3 input features, 1 hidden layer with 2 neurons

Input: x = [0.5, 0.8, 0.2]
Hidden layer: W1 = [[0.1, 0.2], [0.3, 0.4], [0.5, 0.6]], b1 = [0.1, 0.1]
Output layer: W2 = [[0.3, 0.7], [0.4, 0.6]], b2 = [0.1, 0.1]
True label: y = [1, 0] (cat)

Trace the forward pass:

  1. Hidden layer pre-activation: z1 = x @ W1 + b1 = ?

  2. Hidden layer activation (ReLU): h1 = max(0, z1) = ?

  3. Output layer pre-activation: z2 = h1 @ W2 + b2 = ?

  4. Softmax: probs = softmax(z2) = ?

  5. Cross-entropy loss: L = -sum(y * log(probs)) = ?

Now trace backward:

  1. Gradient of loss w.r.t. z2: dL/dz2 = probs - y = ?

  2. Gradient w.r.t. W2: dL/dW2 = h1.T @ dL/dz2 = ?

Continue through the network…


Testing Strategy

Unit Tests

# tests/test_layers.py

def test_dense_forward_shape():
    layer = Dense(10, 5)
    x = np.random.randn(3, 10)
    y = layer.forward(x)
    assert y.shape == (3, 5)

def test_relu_forward():
    relu = ReLU()
    x = np.array([[-1, 0, 1], [2, -2, 0]])
    y = relu.forward(x)
    expected = np.array([[0, 0, 1], [2, 0, 0]])
    assert np.allclose(y, expected)

def test_softmax_sums_to_one():
    logits = np.random.randn(5, 10)
    probs = softmax(logits)
    assert np.allclose(probs.sum(axis=1), 1.0)

def test_softmax_all_positive():
    logits = np.random.randn(5, 10) * 100  # Large values
    probs = softmax(logits)
    assert (probs > 0).all()

def test_one_hot_encoding():
    labels = np.array([0, 5, 9])
    encoded = one_hot(labels, 10)
    assert encoded.shape == (3, 10)
    assert encoded[0, 0] == 1
    assert encoded[1, 5] == 1
    assert encoded[2, 9] == 1
    assert encoded.sum() == 3

Gradient Checking

def test_gradient_correctness():
    net = Network([5, 3, 2])
    x = np.random.randn(2, 5)
    y = one_hot(np.array([0, 1]), 2)

    assert gradient_check(net, x, y)

Integration Tests

def test_can_overfit_small_batch():
    """Network should perfectly fit a tiny dataset."""
    net = Network([784, 50, 10])
    X_tiny = X_train[:10]
    y_tiny = y_train[:10]

    for _ in range(500):
        train_one_epoch(net, X_tiny, y_tiny, batch_size=10, lr=0.5)

    acc = evaluate(net, X_tiny, y_tiny)
    assert acc > 0.99, f"Should overfit tiny batch, got {acc}"

Common Pitfalls and Debugging Tips

Pitfall 1: Softmax Numerical Overflow

Symptom: RuntimeWarning: overflow encountered in exp, outputs are NaN

Cause: Exponentiating large numbers (e.g., e^1000 = inf)

Fix: Subtract the max before exponentiating

# BAD
exp_z = np.exp(z)

# GOOD
z_shifted = z - np.max(z, axis=1, keepdims=True)
exp_z = np.exp(z_shifted)

Pitfall 2: Log of Zero

Symptom: -inf in loss, NaN in gradients

Cause: Taking log(0) when softmax outputs exactly 0

Fix: Clip probabilities

probs_clipped = np.clip(probs, 1e-10, 1 - 1e-10)
log_probs = np.log(probs_clipped)

Pitfall 3: Shape Mismatch in Matrix Multiplication

Symptom: ValueError: shapes (64,128) and (784,128) not aligned

Cause: Wrong dimension order in weight matrix or forgot to transpose

Debug: Print shapes at every step

print(f"x.shape: {x.shape}")
print(f"W.shape: {W.shape}")
print(f"Expected: ({x.shape[0]}, {W.shape[1]})")

Pitfall 4: Forgetting to Cache Inputs

Symptom: Backward pass fails with “AttributeError: ‘Dense’ object has no attribute ‘x’”

Cause: Forward pass didn’t save the input for gradient computation

Fix: Always cache in forward:

def forward(self, x):
    self.x = x  # CACHE THIS
    return x @ self.W + self.b

Pitfall 5: Gradient Not Averaged Over Batch

Symptom: Training is unstable, loss oscillates wildly

Cause: Gradients are summed instead of averaged

Fix: Divide by batch size

self.dW = self.x.T @ grad_output / batch_size  # Divide!

Pitfall 6: Learning Rate Too High

Symptom: Loss increases or oscillates instead of decreasing

Cause: Updates overshoot the minimum

Fix: Start with lr=0.001, increase gradually to 0.1

Pitfall 7: Weights All Zero or Same

Symptom: All neurons produce identical outputs

Cause: Zero initialization or same random seed

Fix: Use proper He initialization

W = np.random.randn(in_dim, out_dim) * np.sqrt(2.0 / in_dim)

Pitfall 8: Accuracy Stuck at 10%

Symptom: Network never learns better than random guessing

Cause: Usually a gradient bug, or incorrect loss computation

Debug:

  1. Run gradient check
  2. Verify loss decreases on tiny dataset
  3. Check that predictions change after update

Interview Questions

Prepare to answer these after completing the project:

Fundamental Understanding

  1. “Explain the softmax function. Why do we use it instead of sigmoid for multi-class classification?”

    Key points: Softmax normalizes to probabilities summing to 1, creating competition between classes. Sigmoid treats each class independently, which doesn’t make sense for mutually exclusive classes.

  2. “Derive the gradient of cross-entropy loss with respect to the logits when using softmax.”

    Key insight: The gradient simplifies beautifully to (prediction - target). Show the derivation step by step.

  3. “What is the log-sum-exp trick and why is it important?”

    Key points: Prevents numerical overflow when computing softmax. Subtracting the max before exponentiating keeps numbers in a reasonable range without changing the result.

Implementation Details

  1. “How does mini-batch gradient descent differ from stochastic and batch gradient descent? What are the tradeoffs?”

    Key points: Batch size affects gradient noise (regularization), memory usage, and convergence speed. Mini-batch is the practical middle ground.

  2. “Why do we need to shuffle the training data between epochs?”

    Key points: Prevents the network from learning spurious patterns in the order of examples. Reduces variance in gradient estimates.

  3. “Explain Xavier/Glorot and He initialization. When would you use each?”

    Key points: Xavier for sigmoid/tanh, He for ReLU. Both aim to keep variance stable through layers, preventing vanishing/exploding gradients.

Debugging and Analysis

  1. “Your MNIST network is stuck at 10% accuracy. Walk me through how you would debug this.”

    Key points: Check gradient correctness, verify loss decreases on tiny batch, check shapes, verify data preprocessing, start with simpler network.

  2. “Look at these weight visualizations. What do they tell you about what the network learned?”

    Key points: First layer learns edge detectors, templates for digit parts. Each neuron specializes in detecting specific patterns.

  3. “The training accuracy is 99.5% but test accuracy is 94%. What’s happening and how would you fix it?”

    Key points: Overfitting. Solutions include regularization (L2, dropout), early stopping, data augmentation, simpler model.

Advanced Concepts

  1. “How would you modify this network to handle Fashion-MNIST without changing the architecture?”

    Key points: No architecture change needed - just different data. Might need different hyperparameters. This demonstrates the generality of the approach.


Hints in Layers

Hint 1: Getting Started Start with the data. Load MNIST, visualize examples, verify shapes. Many bugs come from misunderstanding the data format. MNIST images are grayscale with white digits on black background (0 = black, 255 = white).

Hint 2: Building Layers Implement forward pass first for all layers. Test that you can pass data through the whole network. Shapes should be: (batch, 784) -> (batch, 128) -> (batch, 64) -> (batch, 10).

Hint 3: Softmax Stability Always subtract the max before exponentiating. The most common bug is numerical overflow/underflow in softmax. Use this pattern:

z_max = np.max(z, axis=1, keepdims=True)
exp_z = np.exp(z - z_max)
softmax = exp_z / exp_z.sum(axis=1, keepdims=True)

Hint 4: Gradient of Softmax + Cross-Entropy Don’t compute them separately! The combined gradient is simply (softmax_output - one_hot_target). This is one of the most elegant results in deep learning.

Hint 5: Gradient Checking Before training, verify your gradients numerically. Use epsilon = 1e-5, compute (f(x+eps) - f(x-eps)) / (2*eps), compare to your analytical gradient. Relative error should be < 1e-5.

Hint 6: First Training Run Start with a simpler network (e.g., [784, 10] with no hidden layers). This is just logistic regression. It should reach ~92% accuracy. If not, you have a bug in your core training loop.

Hint 7: Weight Visualization The first Dense layer has shape (784, 128). Each column is one neuron’s weights, which you can reshape to 28x28 and visualize. These show what pattern each neuron is “looking for.”


Extensions and Challenges

Extension 1: Add Dropout

Dropout randomly sets neurons to zero during training, providing regularization:

class Dropout:
    def __init__(self, p=0.5):
        self.p = p  # Probability of keeping a neuron

    def forward(self, x, training=True):
        if not training:
            return x
        self.mask = np.random.binomial(1, self.p, x.shape) / self.p
        return x * self.mask

    def backward(self, grad):
        return grad * self.mask

Challenge: Add dropout after each ReLU. Compare test accuracy with and without dropout.

Extension 2: Visualize Learned Features

Beyond raw weights, visualize what inputs maximally activate each neuron:

def visualize_activation_maximization(net, layer_idx, neuron_idx, num_steps=100):
    """Find input that maximally activates a specific neuron."""
    x = np.random.randn(1, 784) * 0.01

    for step in range(num_steps):
        # Forward to target layer
        h = x
        for i, layer in enumerate(net.layers[:layer_idx+1]):
            h = layer.forward(h)

        # Gradient w.r.t. target neuron
        grad = np.zeros_like(h)
        grad[0, neuron_idx] = 1

        # Backward to input
        for layer in reversed(net.layers[:layer_idx+1]):
            grad = layer.backward(grad)

        # Gradient ascent
        x += 0.1 * grad
        x = np.clip(x, -1, 1)

    return x.reshape(28, 28)

Extension 3: Try Fashion-MNIST

Fashion-MNIST is a drop-in replacement for MNIST with clothing items instead of digits:

  • Same format: 60,000 training, 10,000 test, 28x28 grayscale
  • 10 classes: T-shirt, Trouser, Pullover, Dress, Coat, Sandal, Shirt, Sneaker, Bag, Ankle boot
  • More challenging than digits (~89% accuracy vs ~97%)

Challenge: Achieve >88% on Fashion-MNIST with your implementation.

Extension 4: Learning Rate Scheduling

Implement learning rate decay:

def step_decay(initial_lr, epoch, drop=0.5, epochs_drop=5):
    """Reduce LR by factor every N epochs."""
    return initial_lr * (drop ** (epoch // epochs_drop))

def exponential_decay(initial_lr, epoch, decay_rate=0.95):
    """Exponential decay."""
    return initial_lr * (decay_rate ** epoch)

Extension 5: Momentum

Add momentum to SGD for faster convergence:

class MomentumSGD:
    def __init__(self, net, lr=0.01, momentum=0.9):
        self.lr = lr
        self.momentum = momentum
        self.velocity = {}
        for i, layer in enumerate(net.layers):
            if hasattr(layer, 'W'):
                self.velocity[f'{i}_W'] = np.zeros_like(layer.W)
                self.velocity[f'{i}_b'] = np.zeros_like(layer.b)

    def update(self, net):
        for i, layer in enumerate(net.layers):
            if hasattr(layer, 'W'):
                self.velocity[f'{i}_W'] = (
                    self.momentum * self.velocity[f'{i}_W'] -
                    self.lr * layer.dW
                )
                layer.W += self.velocity[f'{i}_W']

                self.velocity[f'{i}_b'] = (
                    self.momentum * self.velocity[f'{i}_b'] -
                    self.lr * layer.db
                )
                layer.b += self.velocity[f'{i}_b']

Real-World Connections

Optical Character Recognition (OCR)

MNIST is the ancestor of all OCR systems. Modern OCR uses:

  • CNNs instead of dense layers (Project 9)
  • Larger, higher-resolution images
  • Thousands of characters (Chinese, Japanese, etc.)
  • Sequence-to-sequence models for words

Industry application: Banks process millions of handwritten checks daily using MNIST-descended technology.

Check Reading and Document Processing

Every time you deposit a check via mobile app:

  1. Image is captured and preprocessed
  2. Handwritten amounts are segmented
  3. Each digit is classified (like MNIST)
  4. Business logic validates the result

Scale: Chase Bank processes 40+ million checks per day.

Medical Imaging

The same principles apply to:

  • Detecting tumors in X-rays
  • Classifying cell types in microscopy
  • Reading handwritten prescriptions

Impact: AI-assisted diagnosis can achieve specialist-level accuracy.

Postal Systems

USPS and postal services worldwide use handwritten digit recognition:

  • ZIP code reading (98% accuracy required)
  • Address parsing
  • Package routing

Fun fact: The postal service was one of the first commercial applications of neural networks (1990s).


Books That Will Help

Topic Book Specific Chapters
MNIST walkthrough “Neural Networks and Deep Learning” by Michael Nielsen Ch. 1 (complete MNIST example)
Backpropagation intuition “Grokking Deep Learning” by Andrew Trask Ch. 4-5 (hands-on gradient descent)
Mathematical foundations “Deep Learning” by Goodfellow et al. Ch. 6 (deep feedforward networks)
Practical implementation “Deep Learning with Python” by Chollet Ch. 2 (first neural network)
Numerical stability “Numerical Recipes” by Press et al. Ch. 5 (function evaluation)
Weight initialization “Deep Learning” by Goodfellow et al. Ch. 8.4 (parameter initialization)

Free Online Resources

  • Michael Nielsen’s book: http://neuralnetworksanddeeplearning.com/
    • Arguably the best free resource for understanding neural networks
    • Includes complete MNIST implementation with explanations
  • 3Blue1Brown Neural Networks series: YouTube
    • Excellent visualizations of gradient descent and backpropagation
    • Watch before implementing
  • Andrej Karpathy’s “micrograd”: GitHub
    • Minimal autograd engine in ~100 lines
    • Good reference for understanding backward pass

Self-Assessment Checklist

Before moving to Project 9 (CNN), verify you can:

Conceptual Understanding

  • Explain why softmax creates competition between classes
  • Derive the gradient of softmax + cross-entropy from first principles
  • Explain the log-sum-exp trick and why it’s necessary
  • Describe what one-hot encoding is and why we use it
  • Explain mini-batch gradient descent vs. full batch vs. stochastic

Implementation Skills

  • Implement softmax that doesn’t overflow for large inputs
  • Write a Dense layer with correct forward and backward passes
  • Implement ReLU activation with correct gradient
  • Create a data loader that shuffles and batches correctly
  • Save and load model weights

Debugging Abilities

  • Use gradient checking to verify your implementation
  • Debug a network stuck at 10% accuracy
  • Identify overfitting from training curves
  • Tune learning rate for stable training

Practical Results

  • Achieve >96% test accuracy on MNIST
  • Visualize learned weights as 28x28 images
  • Generate a confusion matrix and interpret it
  • Predict digits from your own handwritten images

Questions to Answer Without Looking

  1. What’s the shape of the gradient for a Dense layer with input (64, 784) and output (64, 128)?
  2. Why do we use cross-entropy instead of MSE for classification?
  3. What happens if you initialize all weights to zero?
  4. How many trainable parameters in a network [784 -> 128 -> 64 -> 10]?
  5. What does a horizontal stripe pattern in a weight visualization mean?

Next: P09: CNN From Scratch - Replace dense layers with convolutional layers and achieve >99% accuracy