P03: Neural Network from Scratch

P03: Neural Network from Scratch

Build a neural network that classifies handwritten digits, implementing forward propagation, backpropagation, and trainingโ€”using only matrix operations.


Overview

Attribute Value
Difficulty Intermediate-Advanced
Time Estimate 2-3 weeks
Language Python (recommended) or C
Prerequisites Calculus basics (derivatives), matrix multiplication
Primary Book โ€œMath for Programmersโ€ by Paul Orland

Learning Objectives

By completing this project, you will:

  1. See neural networks as linear algebra - Every layer is matrix multiplication + nonlinearity
  2. Master the chain rule in matrix form - Backpropagation is gradient computation
  3. Understand gradient descent geometrically - Navigating high-dimensional landscapes
  4. Grasp matrix transposes deeply - Why they appear in backprop
  5. Internalize batch processing - Matrix-matrix multiplication for efficiency
  6. Connect math to learning - Watch abstract matrices develop โ€œknowledgeโ€

Theoretical Foundation

Part 1: The Neuron - A Linear Function Plus Nonlinearity

A Single Neuron

The fundamental building block:

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                      SINGLE NEURON                             โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚                                                                โ”‚
โ”‚    Inputs              Weights          Sum          Output    โ”‚
โ”‚                                                                โ”‚
โ”‚    xโ‚ โ”€โ”€โ”€โ”€โ”€โ”€โ”€wโ‚โ”€โ”€โ”€โ”€โ”                                          โ”‚
โ”‚                     โ”‚                                          โ”‚
โ”‚    xโ‚‚ โ”€โ”€โ”€โ”€โ”€โ”€โ”€wโ‚‚โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ†’ ฮฃ(wแตขxแตข) + b โ”€โ”€โ†’ ฯƒ(z) โ”€โ”€โ†’ a         โ”‚
โ”‚                     โ”‚           โ†‘                              โ”‚
โ”‚    xโ‚ƒ โ”€โ”€โ”€โ”€โ”€โ”€โ”€wโ‚ƒโ”€โ”€โ”€โ”€โ”˜           โ”‚                              โ”‚
โ”‚                               bias                             โ”‚
โ”‚                                                                โ”‚
โ”‚    Linear part:     z = wโ‚xโ‚ + wโ‚‚xโ‚‚ + wโ‚ƒxโ‚ƒ + b               โ”‚
โ”‚    Activation:      a = ฯƒ(z)                                  โ”‚
โ”‚                                                                โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Single Neuron

The linear part (z):

z = w ยท x + b = wโ‚xโ‚ + wโ‚‚xโ‚‚ + ... + wโ‚™xโ‚™ + b

This is a dot product! It measures how much the input aligns
with the weight vector.

The nonlinear part (activation function ฯƒ):

  • Without nonlinearity, stacking layers would still be linear
  • Common activations: sigmoid, tanh, ReLU

Why Nonlinearity is Essential

Linear layer 1:  y = Wโ‚x + bโ‚
Linear layer 2:  z = Wโ‚‚y + bโ‚‚ = Wโ‚‚(Wโ‚x + bโ‚) + bโ‚‚ = Wโ‚‚Wโ‚x + (Wโ‚‚bโ‚ + bโ‚‚)

This is still just z = Ax + c โ€” a single linear transformation!
Stacking linear layers gains nothing.

With nonlinearity:  y = ฯƒ(Wโ‚x + bโ‚)
                    z = ฯƒ(Wโ‚‚y + bโ‚‚)

Now the composition is nonlinear โ€” can approximate any function!

Part 2: The Layer - Many Neurons in Parallel

From Neuron to Layer

Instead of one neuron, we compute many simultaneously:

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                      NEURAL NETWORK LAYER                      โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚                                                                โ”‚
โ”‚    Input         Weight Matrix         Bias      Activation   โ”‚
โ”‚    Vector                              Vector                  โ”‚
โ”‚                                                                โ”‚
โ”‚    [xโ‚]     [wโ‚โ‚  wโ‚โ‚‚  wโ‚โ‚ƒ]   [xโ‚]     [bโ‚]                   โ”‚
โ”‚    [xโ‚‚]  โ†’  [wโ‚‚โ‚  wโ‚‚โ‚‚  wโ‚‚โ‚ƒ] ร— [xโ‚‚]  +  [bโ‚‚]  โ†’  ฯƒ()  โ†’  [aโ‚] โ”‚
โ”‚    [xโ‚ƒ]     [wโ‚ƒโ‚  wโ‚ƒโ‚‚  wโ‚ƒโ‚ƒ]   [xโ‚ƒ]     [bโ‚ƒ]           [aโ‚‚] โ”‚
โ”‚             [wโ‚„โ‚  wโ‚„โ‚‚  wโ‚„โ‚ƒ]            [bโ‚„]           [aโ‚ƒ] โ”‚
โ”‚                                                        [aโ‚„] โ”‚
โ”‚                                                                โ”‚
โ”‚    3 inputs โ†’ 4 outputs (4 neurons)                           โ”‚
โ”‚    Weight matrix: 4ร—3 (rows = neurons, columns = inputs)      โ”‚
โ”‚                                                                โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Neural Network Layer as Matrix Operation

The matrix equation:

z = Wx + b
a = ฯƒ(z)

Where:
- x is the input vector (nร—1)
- W is the weight matrix (mร—n)
- b is the bias vector (mร—1)
- z is the pre-activation (mร—1)
- a is the output/activation (mร—1)

Each row of W is one neuronโ€™s weight vector:

Row 1 of W = weights for neuron 1
Row 2 of W = weights for neuron 2
...

(Wx)แตข = row i of W ยท x = weighted sum for neuron i

Part 3: The Network - Layers Composed

Multi-Layer Architecture

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                    MULTI-LAYER NEURAL NETWORK                          โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚                                                                         โ”‚
โ”‚    Input        Hidden Layer 1      Hidden Layer 2       Output        โ”‚
โ”‚    (784)            (128)               (64)             (10)          โ”‚
โ”‚                                                                         โ”‚
โ”‚     x               aยน                  aยฒ                 ลท           โ”‚
โ”‚    [ยท]             [ยท]                 [ยท]               [ยท]           โ”‚
โ”‚    [ยท]    Wยน,bยน    [ยท]      Wยฒ,bยฒ     [ยท]     Wยณ,bยณ     [ยท]           โ”‚
โ”‚    [ยท]   โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ†’ [ยท]    โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ†’   [ยท]   โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ†’   [ยท]           โ”‚
โ”‚    [ยท]     ฯƒ       [ยท]       ฯƒ        [ยท]      ฯƒ        [ยท]           โ”‚
โ”‚    ...             ...                 ...               ...           โ”‚
โ”‚    [ยท]             [ยท]                 [ยท]               [ยท]           โ”‚
โ”‚                                                                         โ”‚
โ”‚    For MNIST:                                                          โ”‚
โ”‚    - Input: 28ร—28 = 784 pixels                                         โ”‚
โ”‚    - Output: 10 classes (digits 0-9)                                   โ”‚
โ”‚    - Hidden layers: learn intermediate features                        โ”‚
โ”‚                                                                         โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Multi-Layer Neural Network Architecture

Forward propagation equations:

Layer 1:  zยน = Wยนx + bยน,     aยน = ฯƒ(zยน)
Layer 2:  zยฒ = Wยฒaยน + bยฒ,    aยฒ = ฯƒ(zยฒ)
Layer 3:  zยณ = Wยณaยฒ + bยณ,    ลท = softmax(zยณ)

Dimensions for MNIST:

Input x:    784 ร— 1
Wยน:         128 ร— 784     aยน: 128 ร— 1
Wยฒ:         64 ร— 128      aยฒ: 64 ร— 1
Wยณ:         10 ร— 64       ลท:  10 ร— 1

Total parameters: 128ร—784 + 128 + 64ร—128 + 64 + 10ร—64 + 10 = 109,386

Part 4: Activation Functions

Sigmoid

ฯƒ(z) = 1 / (1 + eโปแถป)

Range: (0, 1)
Derivative: ฯƒ'(z) = ฯƒ(z)(1 - ฯƒ(z))

Graph:
    1 โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•
      โ”‚           โ•ฑ
      โ”‚         โ•ฑ
    0.5 โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ—โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
      โ”‚       โ•ฑโ”‚
      โ”‚     โ•ฑ  โ”‚
    0 โ•โ•โ•โ•โ•โ•โ•โ•โ•โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
              0       z

Problem: "vanishing gradients" โ€” derivative โ†’ 0 for large |z|

Sigmoid Activation Function

ReLU (Rectified Linear Unit)

ReLU(z) = max(0, z)

Range: [0, โˆž)
Derivative: 1 if z > 0, else 0

Graph:
    โ”‚      โ•ฑ
    โ”‚    โ•ฑ
    โ”‚  โ•ฑ
    โ”‚โ•ฑ
    โ—โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
   0              z

Advantages: No vanishing gradient (for z > 0), computationally simple
Disadvantage: "Dead neurons" if z < 0 always

ReLU Activation Function

Softmax (for output layer classification)

softmax(zแตข) = e^zแตข / ฮฃโฑผ e^zโฑผ

Converts raw scores to probabilities:
- All outputs sum to 1
- Each output โˆˆ (0, 1)

Example:
z = [2.0, 1.0, 0.1]
e^z = [7.39, 2.72, 1.11]
softmax(z) = [0.66, 0.24, 0.10]  โ† probabilities for 3 classes

Part 5: The Loss Function

Cross-Entropy Loss

For classification, we use cross-entropy:

L = -ฮฃแตข yแตข log(ลทแตข)

Where:
- y is the true label (one-hot encoded)
- ลท is the predicted probability

For single correct class c:
L = -log(ลทc)

If we predict high probability for correct class: L is small
If we predict low probability for correct class: L is large

Example:

True label: 3 (one-hot: [0,0,0,1,0,0,0,0,0,0])
Prediction: [0.01, 0.01, 0.02, 0.90, 0.02, 0.01, 0.01, 0.01, 0.005, 0.005]

L = -log(0.90) โ‰ˆ 0.105  (good prediction, low loss)

Bad prediction: [0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1]
L = -log(0.1) โ‰ˆ 2.303   (poor prediction, high loss)

Part 6: Gradient Descent

The Optimization Problem

Find weights W*, b* that minimize:

L(W, b) = (1/N) ฮฃแตข Loss(network(xแตข; W, b), yแตข)

Where the sum is over all training examples.

The Gradient Descent Update

W := W - ฮฑ ร— โˆ‚L/โˆ‚W
b := b - ฮฑ ร— โˆ‚L/โˆ‚b

Where ฮฑ is the learning rate.

Intuition: Move in the direction of steepest descent.

Geometric picture:

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                   LOSS LANDSCAPE                               โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚                                                                โ”‚
โ”‚    Loss                                                        โ”‚
โ”‚     โ”‚      โ•ฑโ•ฒ                                                  โ”‚
โ”‚     โ”‚     โ•ฑ  โ•ฒ    โ•ฑโ•ฒ                                           โ”‚
โ”‚     โ”‚    โ•ฑ    โ•ฒ  โ•ฑ  โ•ฒ                                          โ”‚
โ”‚     โ”‚   โ•ฑ      โ•ฒโ•ฑ    โ•ฒ____โ•ฑโ•ฒ                                   โ”‚
โ”‚     โ”‚  โ•ฑ                    โ•ฒ___                               โ”‚
โ”‚     โ”‚ โ—โ”€โ†’โ”€โ†’โ”€โ†’โ”€โ†’โ”€โ†’โ”€โ†’โ”€โ†’โ”€โ†’โ”€โ†’โ”€โ†’โ”€โ†’โ”€โ†’โ—  (minimum)                   โ”‚
โ”‚     โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ weights              โ”‚
โ”‚                                                                โ”‚
โ”‚   Each step moves downhill along the gradient.                 โ”‚
โ”‚   Learning rate ฮฑ controls step size.                          โ”‚
โ”‚                                                                โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Loss Landscape - Gradient Descent

Stochastic Gradient Descent (SGD)

Instead of computing gradient over all examples:

For each mini-batch of size B:
    1. Forward pass: compute predictions
    2. Compute loss
    3. Backward pass: compute gradients
    4. Update weights

Advantages:

  • Faster iterations
  • Noise helps escape local minima
  • Fits in memory

Part 7: Backpropagation - The Chain Rule in Matrices

The Chain Rule Review

For scalar functions:

If y = f(g(x)), then dy/dx = (dy/dg) ร— (dg/dx)

For matrices, we need to track shapes carefully.

Backward Through One Layer

Consider layer: a = ฯƒ(Wx + b)

We receive from the next layer: โˆ‚L/โˆ‚a (gradient w.r.t. this layerโ€™s output)

We need to compute:

โˆ‚L/โˆ‚W = ?
โˆ‚L/โˆ‚b = ?
โˆ‚L/โˆ‚x = ? (to pass to previous layer)

Step 1: Gradient through activation

โˆ‚L/โˆ‚z = โˆ‚L/โˆ‚a โŠ™ ฯƒ'(z)

Where โŠ™ is element-wise multiplication.
Each element of z affects its corresponding element of a.

Step 2: Gradient w.r.t. weights

z = Wx + b
โˆ‚z/โˆ‚W = ?

The key insight: zแตข = ฮฃโฑผ Wแตขโฑผ xโฑผ
So โˆ‚zแตข/โˆ‚Wแตขโฑผ = xโฑผ

In matrix form: โˆ‚L/โˆ‚W = (โˆ‚L/โˆ‚z) ร— xแต€
                         (mร—1)   (1ร—n)  =  (mร—n)

This is an outer product!

Step 3: Gradient w.r.t. bias

โˆ‚L/โˆ‚b = โˆ‚L/โˆ‚z

(bias adds directly to z, so gradient passes through)

Step 4: Gradient w.r.t. input (to pass backward)

z = Wx + b
โˆ‚zแตข/โˆ‚xโฑผ = Wแตขโฑผ

In matrix form: โˆ‚L/โˆ‚x = Wแต€ ร— (โˆ‚L/โˆ‚z)
                        (nร—m)  (mร—1)  =  (nร—1)

THE TRANSPOSE APPEARS! This is why understanding transposes matters.

The Full Backpropagation Algorithm

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                    BACKPROPAGATION FLOW                                โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚                                                                         โ”‚
โ”‚  FORWARD PASS (compute activations)                                    โ”‚
โ”‚  โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€                                     โ”‚
โ”‚  x โ”€โ”€Wยนโ”€โ”€โ†’ zยน โ”€โ”€ฯƒโ”€โ”€โ†’ aยน โ”€โ”€Wยฒโ”€โ”€โ†’ zยฒ โ”€โ”€ฯƒโ”€โ”€โ†’ aยฒ โ”€โ”€Wยณโ”€โ”€โ†’ zยณ โ”€โ”€softmaxโ”€โ”€โ†’ ลทโ”‚
โ”‚                                                                         โ”‚
โ”‚  Store all z's and a's โ€” needed for backward pass!                     โ”‚
โ”‚                                                                         โ”‚
โ”‚  BACKWARD PASS (compute gradients)                                     โ”‚
โ”‚  โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€                                    โ”‚
โ”‚  โˆ‚L/โˆ‚ลท โ†โ”€โ”€ โˆ‚L/โˆ‚zยณ โ†โ”€Wยณแต€โ”€โ”€ โˆ‚L/โˆ‚aยฒ โ†โ”€โ”€ โˆ‚L/โˆ‚zยฒ โ†โ”€Wยฒแต€โ”€โ”€ โˆ‚L/โˆ‚aยน โ†โ”€โ”€ โˆ‚L/โˆ‚zยนโ”‚
โ”‚               โ”‚               โ”‚               โ”‚               โ”‚        โ”‚
โ”‚               โ–ผ               โ–ผ               โ–ผ               โ–ผ        โ”‚
โ”‚            โˆ‚L/โˆ‚Wยณ          โˆ‚L/โˆ‚Wยฒ          โˆ‚L/โˆ‚Wยน          (done)      โ”‚
โ”‚            โˆ‚L/โˆ‚bยณ          โˆ‚L/โˆ‚bยฒ          โˆ‚L/โˆ‚bยน                      โ”‚
โ”‚                                                                         โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Backpropagation Flow

Pseudocode:

def backward(self, dL_dy):
    # For each layer, from last to first:
    for layer in reversed(self.layers):
        # Gradient through activation
        dL_dz = dL_dy * activation_derivative(layer.z)

        # Gradients for this layer's parameters
        layer.dW = outer(dL_dz, layer.input)
        layer.db = dL_dz

        # Gradient to pass to previous layer
        dL_dy = layer.W.T @ dL_dz

    return  # Gradients stored in each layer

Part 8: Batch Processing with Matrices

Single Example vs. Batch

Single example:

x: (n, 1) vector
z = Wx + b
a = ฯƒ(z)

Batch of B examples:

X: (n, B) matrix โ€” each column is one example
Z = WX + b (broadcast b to each column)
A = ฯƒ(Z)

Why batches?:

  1. GPU parallelism: matrix-matrix multiplication is optimized
  2. Gradient averaging: smoother optimization
  3. Memory efficiency: process in chunks

Matrix dimensions for batch:

X:     (784, 32)  โ€” 32 images, each 784 pixels
Wยน:    (128, 784)
Zยน:    (128, 32)  โ€” 32 outputs, each with 128 neurons
Aยน:    (128, 32)
...
ลถ:     (10, 32)   โ€” 32 predictions, each with 10 classes

Backprop with Batches

The gradients need to be summed (or averaged) over the batch:

# Forward
Z = W @ X + b.reshape(-1, 1)  # broadcast bias
A = activation(Z)

# Backward
dL_dZ = dL_dA * activation_derivative(Z)
dL_dW = (dL_dZ @ X.T) / batch_size  # average over batch
dL_db = dL_dZ.mean(axis=1)          # average over batch
dL_dX = W.T @ dL_dZ

Part 9: Weight Initialization

Why Initialization Matters

Bad initialization โ†’ gradients vanish or explode โ†’ training fails

Problem: If weights are too large, activations saturate (sigmoid โ†’ 0 or 1)
         If weights are too small, activations collapse to zero

The variance of activations should stay roughly constant through layers.

Xavier/Glorot Initialization

For sigmoid/tanh:

W ~ Normal(0, sqrt(2 / (fan_in + fan_out)))

Or equivalently:
W ~ Uniform(-sqrt(6 / (fan_in + fan_out)), sqrt(6 / (fan_in + fan_out)))

Where:
- fan_in = number of input neurons
- fan_out = number of output neurons

He Initialization

For ReLU:

W ~ Normal(0, sqrt(2 / fan_in))

The math: ReLU zeros out half the activations, so we need 2ร— variance to compensate.


Part 10: Understanding Through Linear Algebra

Neural Networks as Function Composition

f(x) = ฯƒL(WL ร— ฯƒLโ‚‹โ‚(WLโ‚‹โ‚ ร— ... ฯƒโ‚(Wโ‚ ร— x + bโ‚) ... + bLโ‚‹โ‚) + bL)

Each layer:
1. Linearly transforms (rotates, scales, translates via W and b)
2. Applies pointwise nonlinearity (ฯƒ)

The sequence of transformations warps the input space to make
classification easier.

What Each Layer โ€œSeesโ€

Layer 1: Raw pixels
Layer 2: Edges and simple patterns (combinations of pixels)
Layer 3: Shapes and parts (combinations of edges)
Layer 4: Objects and digits (combinations of shapes)
Output:  Class probabilities

The Geometry of Classification

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚             SPACE WARPING THROUGH LAYERS                       โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚                                                                โ”‚
โ”‚    Input Space          Hidden Space         Output Space      โ”‚
โ”‚                                                                โ”‚
โ”‚    โ— โ— โ—                    โ—                     โ—            โ”‚
โ”‚    โ—‹ โ—‹ โ—‹    Layer 1     โ—‹       โ—‹   Layer 2                    โ”‚
โ”‚    โ— โ— โ—   โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ†’        โ—      โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ†’   โ—‹              โ”‚
โ”‚    โ—‹ โ—‹ โ—‹                โ—‹       โ—‹                              โ”‚
โ”‚                            โ—                     โ—             โ”‚
โ”‚                                                                โ”‚
โ”‚   Interleaved           Separated              Linearly        โ”‚
โ”‚   (hard to separate)    (easier)               separable       โ”‚
โ”‚                                                                โ”‚
โ”‚   The network learns transformations that untangle the data!   โ”‚
โ”‚                                                                โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Space Warping Through Layers


Project Specification

What Youโ€™re Building

A neural network that:

  1. Loads the MNIST dataset (handwritten digits)
  2. Implements forward propagation with matrix operations
  3. Implements backpropagation to compute gradients
  4. Trains using stochastic gradient descent
  5. Achieves >95% accuracy on test set

Minimum Requirements

  • Load and preprocess MNIST data
  • Implement matrix operations (or use numpy)
  • Build network with at least 2 hidden layers
  • Implement ReLU and softmax activations
  • Implement cross-entropy loss
  • Implement backpropagation from scratch
  • Train and achieve >95% test accuracy
  • Plot training loss over epochs

Stretch Goals

  • Implement multiple optimizers (SGD, momentum, Adam)
  • Add dropout regularization
  • Visualize learned weights as images
  • Implement convolutional layers
  • Add batch normalization

Solution Architecture

Data Structures

class Layer:
    """A single neural network layer."""

    def __init__(self, input_size: int, output_size: int, activation: str):
        self.W = initialize_weights(input_size, output_size)
        self.b = np.zeros((output_size, 1))
        self.activation = activation

        # Stored during forward pass for backprop
        self.input = None
        self.z = None
        self.a = None

        # Gradients computed during backward pass
        self.dW = None
        self.db = None

class NeuralNetwork:
    """Multi-layer neural network."""

    def __init__(self, layer_sizes: list, activations: list):
        self.layers = []
        for i in range(len(layer_sizes) - 1):
            self.layers.append(Layer(
                layer_sizes[i],
                layer_sizes[i + 1],
                activations[i]
            ))

    def forward(self, X: np.ndarray) -> np.ndarray:
        """Compute forward pass, return output."""
        pass

    def backward(self, Y: np.ndarray) -> None:
        """Compute gradients via backpropagation."""
        pass

    def update(self, learning_rate: float) -> None:
        """Apply gradient descent update."""
        pass

Module Structure

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                           main.py                               โ”‚
โ”‚  - Load MNIST                                                   โ”‚
โ”‚  - Training loop                                                โ”‚
โ”‚  - Evaluation and visualization                                 โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                               โ”‚
        โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
        โ–ผ                      โ–ผ                      โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”      โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”      โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚   data.py    โ”‚      โ”‚  network.py  โ”‚      โ”‚  train.py    โ”‚
โ”‚              โ”‚      โ”‚              โ”‚      โ”‚              โ”‚
โ”‚ load_mnist   โ”‚      โ”‚ Layer class  โ”‚      โ”‚ train_epoch  โ”‚
โ”‚ normalize    โ”‚      โ”‚ NeuralNetworkโ”‚      โ”‚ evaluate     โ”‚
โ”‚ one_hot      โ”‚      โ”‚ forward      โ”‚      โ”‚ sgd_update   โ”‚
โ”‚ batch_loader โ”‚      โ”‚ backward     โ”‚      โ”‚ learning_schedule โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜      โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜      โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
        โ”‚                      โ”‚                      โ”‚
        โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                               โ–ผ
                      โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
                      โ”‚ activations.pyโ”‚
                      โ”‚              โ”‚
                      โ”‚ relu, relu_deriv โ”‚
                      โ”‚ sigmoid, sigmoid_deriv โ”‚
                      โ”‚ softmax      โ”‚
                      โ”‚ cross_entropy โ”‚
                      โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Neural Network Module Structure

Training Loop Flow

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                         TRAINING LOOP                                   โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚                                                                         โ”‚
โ”‚  For each epoch (pass through entire dataset):                         โ”‚
โ”‚  โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€                         โ”‚
โ”‚                                                                         โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”                                                        โ”‚
โ”‚  โ”‚ Shuffle dataโ”‚ Randomize order each epoch                            โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”˜                                                        โ”‚
โ”‚         โ”‚                                                               โ”‚
โ”‚         โ–ผ                                                               โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”                                                        โ”‚
โ”‚  โ”‚ For each    โ”‚ Process in mini-batches                               โ”‚
โ”‚  โ”‚ mini-batch  โ”‚                                                        โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”˜                                                        โ”‚
โ”‚         โ”‚                                                               โ”‚
โ”‚         โ–ผ                                                               โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”       โ”‚
โ”‚  โ”‚ 1. FORWARD PASS                                              โ”‚       โ”‚
โ”‚  โ”‚    X_batch โ”€โ”€โ†’ Layer 1 โ”€โ”€โ†’ Layer 2 โ”€โ”€โ†’ ... โ”€โ”€โ†’ ลถ_batch      โ”‚       โ”‚
โ”‚  โ”‚    Store activations at each layer                          โ”‚       โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜       โ”‚
โ”‚         โ”‚                                                               โ”‚
โ”‚         โ–ผ                                                               โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”       โ”‚
โ”‚  โ”‚ 2. COMPUTE LOSS                                              โ”‚       โ”‚
โ”‚  โ”‚    L = CrossEntropy(ลถ_batch, Y_batch)                        โ”‚       โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜       โ”‚
โ”‚         โ”‚                                                               โ”‚
โ”‚         โ–ผ                                                               โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”       โ”‚
โ”‚  โ”‚ 3. BACKWARD PASS                                             โ”‚       โ”‚
โ”‚  โ”‚    โˆ‚L/โˆ‚ลถ โ†โ”€โ”€ Layer N โ†โ”€โ”€ ... โ†โ”€โ”€ Layer 1                    โ”‚       โ”‚
โ”‚  โ”‚    Compute โˆ‚L/โˆ‚W, โˆ‚L/โˆ‚b for each layer                       โ”‚       โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜       โ”‚
โ”‚         โ”‚                                                               โ”‚
โ”‚         โ–ผ                                                               โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”       โ”‚
โ”‚  โ”‚ 4. UPDATE WEIGHTS                                            โ”‚       โ”‚
โ”‚  โ”‚    W := W - ฮฑ ร— โˆ‚L/โˆ‚W                                        โ”‚       โ”‚
โ”‚  โ”‚    b := b - ฮฑ ร— โˆ‚L/โˆ‚b                                        โ”‚       โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜       โ”‚
โ”‚         โ”‚                                                               โ”‚
โ”‚         โ””โ”€โ”€โ†’ Next mini-batch                                            โ”‚
โ”‚                                                                         โ”‚
โ”‚  After epoch: Evaluate on validation set, adjust learning rate         โ”‚
โ”‚                                                                         โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Training Loop


Phased Implementation Guide

Phase 1: Data Loading and Preprocessing (Days 1-2)

Goal: Load MNIST and prepare for training

Tasks:

  1. Download/load MNIST dataset (use keras.datasets or fetch manually)
  2. Normalize pixel values to [0, 1] or [-1, 1]
  3. Flatten images from 28ร—28 to 784-element vectors
  4. One-hot encode labels
  5. Create batch iterator

Implementation:

def load_mnist():
    # Using keras for convenience
    from tensorflow.keras.datasets import mnist
    (x_train, y_train), (x_test, y_test) = mnist.load_data()

    # Normalize to [0, 1]
    x_train = x_train.reshape(-1, 784).T / 255.0  # (784, N)
    x_test = x_test.reshape(-1, 784).T / 255.0

    # One-hot encode
    y_train_onehot = np.zeros((10, len(y_train)))
    y_train_onehot[y_train, np.arange(len(y_train))] = 1

    return x_train, y_train_onehot, x_test, y_test

Verification:

# Check shapes
assert x_train.shape == (784, 60000)
assert y_train_onehot.shape == (10, 60000)

# Check normalization
assert x_train.min() >= 0 and x_train.max() <= 1

# Check one-hot
assert y_train_onehot.sum(axis=0).all() == 1  # each column sums to 1

Phase 2: Activation Functions (Days 3-4)

Goal: Implement all activation functions and derivatives

Tasks:

  1. Implement ReLU and its derivative
  2. Implement sigmoid and its derivative
  3. Implement softmax
  4. Implement cross-entropy loss

Implementation:

def relu(z):
    return np.maximum(0, z)

def relu_derivative(z):
    return (z > 0).astype(float)

def sigmoid(z):
    return 1 / (1 + np.exp(-np.clip(z, -500, 500)))

def sigmoid_derivative(z):
    s = sigmoid(z)
    return s * (1 - s)

def softmax(z):
    # Subtract max for numerical stability
    exp_z = np.exp(z - z.max(axis=0, keepdims=True))
    return exp_z / exp_z.sum(axis=0, keepdims=True)

def cross_entropy_loss(y_pred, y_true):
    # Avoid log(0)
    epsilon = 1e-15
    y_pred = np.clip(y_pred, epsilon, 1 - epsilon)
    return -np.sum(y_true * np.log(y_pred)) / y_true.shape[1]

Verification:

# ReLU
assert relu(np.array([-1, 0, 1])) == [0, 0, 1]
assert relu_derivative(np.array([-1, 0, 1])) == [0, 0, 1]

# Softmax sums to 1
z = np.random.randn(10, 5)
assert np.allclose(softmax(z).sum(axis=0), 1)

# Sigmoid is in (0, 1)
assert (sigmoid(z) > 0).all() and (sigmoid(z) < 1).all()

Phase 3: Forward Propagation (Days 5-7)

Goal: Implement forward pass through network

Tasks:

  1. Implement Layer forward pass
  2. Implement Network forward pass
  3. Verify with known inputs

Implementation:

class Layer:
    def forward(self, X):
        self.input = X
        self.z = self.W @ X + self.b
        if self.activation == 'relu':
            self.a = relu(self.z)
        elif self.activation == 'sigmoid':
            self.a = sigmoid(self.z)
        elif self.activation == 'softmax':
            self.a = softmax(self.z)
        return self.a

class NeuralNetwork:
    def forward(self, X):
        current = X
        for layer in self.layers:
            current = layer.forward(current)
        return current

Verification:

# Initialize network
net = NeuralNetwork([784, 128, 10], ['relu', 'softmax'])

# Forward pass should produce valid probabilities
output = net.forward(x_train[:, :32])
assert output.shape == (10, 32)
assert np.allclose(output.sum(axis=0), 1)

Phase 4: Backpropagation (Days 8-11)

Goal: Implement backward pass for gradient computation

Tasks:

  1. Implement output layer gradient (softmax + cross-entropy)
  2. Implement hidden layer backward pass
  3. Chain gradients through network
  4. Verify with numerical gradient checking

Implementation:

class Layer:
    def backward(self, dL_da):
        # Gradient through activation
        if self.activation == 'relu':
            dL_dz = dL_da * relu_derivative(self.z)
        elif self.activation == 'sigmoid':
            dL_dz = dL_da * sigmoid_derivative(self.z)
        elif self.activation == 'softmax':
            # For softmax + cross-entropy, gradient simplifies to (ลท - y)
            dL_dz = dL_da  # Pass dL_da directly (see note below)

        # Gradients for parameters
        batch_size = self.input.shape[1]
        self.dW = dL_dz @ self.input.T / batch_size
        self.db = dL_dz.mean(axis=1, keepdims=True)

        # Gradient to pass to previous layer
        dL_dx = self.W.T @ dL_dz
        return dL_dx

class NeuralNetwork:
    def backward(self, Y):
        # Output layer: softmax + cross-entropy derivative is (ลท - y)
        dL_da = self.layers[-1].a - Y

        # Backpropagate through layers
        for layer in reversed(self.layers):
            dL_da = layer.backward(dL_da)

Numerical Gradient Check:

def numerical_gradient(f, W, epsilon=1e-5):
    """Compute gradient numerically for verification."""
    grad = np.zeros_like(W)
    for i in range(W.shape[0]):
        for j in range(W.shape[1]):
            W[i, j] += epsilon
            f_plus = f()
            W[i, j] -= 2 * epsilon
            f_minus = f()
            W[i, j] += epsilon  # restore
            grad[i, j] = (f_plus - f_minus) / (2 * epsilon)
    return grad

# Compare analytical vs numerical gradient
# They should match within 1e-5 relative error

Phase 5: Training (Days 12-14)

Goal: Train the network to classify digits

Tasks:

  1. Implement SGD update step
  2. Implement training loop
  3. Track and plot loss over epochs
  4. Implement evaluation on test set

Implementation:

def train_epoch(net, X, Y, batch_size, learning_rate):
    """Train for one epoch."""
    n_samples = X.shape[1]
    indices = np.random.permutation(n_samples)
    total_loss = 0

    for start in range(0, n_samples, batch_size):
        end = min(start + batch_size, n_samples)
        batch_idx = indices[start:end]

        X_batch = X[:, batch_idx]
        Y_batch = Y[:, batch_idx]

        # Forward
        output = net.forward(X_batch)

        # Loss
        loss = cross_entropy_loss(output, Y_batch)
        total_loss += loss * (end - start)

        # Backward
        net.backward(Y_batch)

        # Update
        for layer in net.layers:
            layer.W -= learning_rate * layer.dW
            layer.b -= learning_rate * layer.db

    return total_loss / n_samples

def evaluate(net, X, y):
    """Compute accuracy."""
    predictions = net.forward(X).argmax(axis=0)
    return (predictions == y).mean()

Training Loop:

net = NeuralNetwork([784, 128, 64, 10], ['relu', 'relu', 'softmax'])

for epoch in range(20):
    loss = train_epoch(net, x_train, y_train_onehot, batch_size=32, learning_rate=0.1)
    acc = evaluate(net, x_test, y_test)
    print(f"Epoch {epoch}: Loss = {loss:.4f}, Test Accuracy = {acc:.4f}")

Phase 6: Visualization and Analysis (Days 15-17)

Goal: Understand what the network learned

Tasks:

  1. Plot training loss curve
  2. Visualize first layer weights as images
  3. Show example predictions with confidence
  4. Analyze misclassified examples

Visualizations:

# Weight visualization
def visualize_weights(W):
    """Plot first layer weights as 28ร—28 images."""
    n_neurons = min(W.shape[0], 25)
    fig, axes = plt.subplots(5, 5, figsize=(10, 10))
    for i, ax in enumerate(axes.flat):
        if i < n_neurons:
            img = W[i].reshape(28, 28)
            ax.imshow(img, cmap='RdBu')
        ax.axis('off')
    plt.show()

# Prediction examples
def show_predictions(net, X, y, n=10):
    """Show predictions with confidence."""
    probs = net.forward(X[:, :n])
    preds = probs.argmax(axis=0)
    for i in range(n):
        print(f"True: {y[i]}, Pred: {preds[i]}, Confidence: {probs[preds[i], i]:.2%}")

Phase 7: Improvements and Extensions (Days 18-21)

Goal: Improve accuracy and add features

Tasks:

  1. Add momentum to SGD
  2. Implement learning rate decay
  3. Add dropout regularization
  4. Experiment with architecture

Momentum SGD:

class MomentumSGD:
    def __init__(self, learning_rate, momentum=0.9):
        self.lr = learning_rate
        self.momentum = momentum
        self.velocities = {}

    def update(self, layer, layer_id):
        if layer_id not in self.velocities:
            self.velocities[layer_id] = {
                'W': np.zeros_like(layer.W),
                'b': np.zeros_like(layer.b)
            }

        v = self.velocities[layer_id]
        v['W'] = self.momentum * v['W'] - self.lr * layer.dW
        v['b'] = self.momentum * v['b'] - self.lr * layer.db

        layer.W += v['W']
        layer.b += v['b']

Testing Strategy

Unit Tests

  1. Activation functions:
    • ReLU(negative) = 0
    • Softmax sums to 1
    • Sigmoid โˆˆ (0, 1)
  2. Forward pass:
    • Output shape matches expected
    • Probabilities sum to 1
  3. Backward pass:
    • Gradients match numerical approximation
    • Shapes are consistent
  4. Update step:
    • Loss decreases on same batch
    • Weights actually change

Integration Tests

  1. Overfit small batch: Should achieve 100% accuracy
  2. Training curve: Loss should decrease
  3. Gradient check: Analytical โ‰ˆ numerical

Performance Tests

  1. Target accuracy: >95% on test set
  2. Training time: Reasonable (minutes, not hours)
  3. Memory usage: Fits in RAM

Common Pitfalls and Debugging

Shape Mismatches

Symptom: โ€œShapes not alignedโ€ errors

Cause: Matrix dimensions donโ€™t match for multiplication

Fix: Draw out dimensions at each step:

X:      (784, 32)
Wยน:     (128, 784)
Wยน @ X: (128, 32) โœ“

dL_dz:  (128, 32)
X.T:    (32, 784)
dW:     (128, 32) @ (32, 784) = (128, 784) โœ“

Vanishing/Exploding Gradients

Symptom: Weights become NaN or donโ€™t change

Cause: Improper initialization or unstable activations

Fix:

  • Use Xavier/He initialization
  • Use ReLU instead of sigmoid for hidden layers
  • Gradient clipping: grad = np.clip(grad, -1, 1)

Numerical Instability

Symptom: NaN in loss or activations

Cause: Overflow in exp(), log(0)

Fix:

# Softmax: subtract max
exp_z = np.exp(z - z.max(axis=0, keepdims=True))

# Cross-entropy: clip predictions
y_pred = np.clip(y_pred, 1e-15, 1 - 1e-15)

Learning Rate Issues

Symptom: Loss oscillates or doesnโ€™t decrease

Cause: Learning rate too high or too low

Fix: Start with 0.01-0.1, use learning rate finder, or adaptive optimizers

Batch Size Effects

Symptom: Different results with different batch sizes

Cause: Gradients not averaged correctly

Fix: Divide gradients by batch size:

self.dW = dL_dz @ self.input.T / batch_size

Extensions and Challenges

Beginner Extensions

  • Add L2 regularization (weight decay)
  • Implement early stopping
  • Save/load model weights

Intermediate Extensions

  • Implement Adam optimizer
  • Add batch normalization
  • Implement dropout
  • Try different architectures

Advanced Extensions

  • Implement convolutional layers
  • Add residual connections
  • Implement in pure C
  • GPU acceleration with CUDA

Real-World Connections

Deep Learning Frameworks

Every framework (PyTorch, TensorFlow, JAX) does exactly this:

  • Forward pass: compute outputs
  • Backward pass: compute gradients (autograd)
  • Update: optimizer step

Understanding the math lets you debug and optimize effectively.

Computer Vision

This MNIST classifier is a simplified version of:

  • Handwriting recognition (postal services)
  • OCR (document scanning)
  • Medical image analysis
  • Self-driving cars

Natural Language Processing

Same principles apply:

  • Word embeddings are learned weight matrices
  • Transformers use attention (matrix operations)
  • Language models predict next tokens (classification)

Optimization Theory

Gradient descent connects to:

  • Convex optimization
  • Stochastic approximation
  • Online learning

Self-Assessment Checklist

Conceptual Understanding

  • I can explain why neural networks need nonlinearity
  • I understand what each matrix multiplication represents
  • I can derive the backpropagation equations
  • I know why the transpose appears in gradients
  • I understand the role of learning rate and batch size

Practical Skills

  • I implemented forward propagation with matrix operations
  • I implemented backpropagation from scratch
  • My network achieves >95% accuracy on MNIST
  • I can visualize and interpret learned weights
  • I can debug shape mismatch errors

Teaching Test

Can you explain to someone else:

  • How does a single neuron compute its output?
  • What does each layer learn?
  • Why do we use mini-batches?
  • How does backpropagation work?

Resources

Primary

  • โ€œMath for Programmersโ€ by Paul Orland - Chapter on ML
  • 3Blue1Brown: Neural Networks - Visual intuition
  • โ€œGrokking Algorithmsโ€ - ML appendix

Reference

  • โ€œDeep Learningโ€ by Goodfellow, Bengio, Courville - The definitive textbook
  • โ€œHands-On Machine Learningโ€ by Aurรฉlien Gรฉron - Practical guide
  • CS231n course notes (Stanford) - Detailed derivations

Code References

  • NumPy documentation - Matrix operations
  • micrograd by Andrej Karpathy - Minimal autograd implementation
  • PyTorch source code - Production implementation

When you complete this project, youโ€™ll understand that deep learning is not magicโ€”itโ€™s linear algebra composed with nonlinearities, optimized via calculus. Every AI breakthrough is built on these foundations you now understand.