Project 9: The CNN From Scratch (Pooling & Strides)

Sprint: AI Prediction & Neural Networks: From Math to Machine Focus Area: Convolutional Neural Networks and Spatial Invariance

Project Metadata

Attribute	Value
Difficulty	Level 5: Master
Main Programming Language	Python (NumPy)
Alternative Languages	C, Rust, Julia
Coolness Level	Level 5: Pure Magic
Business Potential	4. Open Core (Custom Vision Hardware)
Knowledge Area	Convolutional Neural Networks
Software/Tools	NumPy, Matplotlib, MNIST dataset
Main Book	“Deep Learning” by Goodfellow, Bengio, Courville - Ch. 9
Estimated Time	2-3 Weeks
Prerequisites	Project 7 (Kernel Explorer), Project 8 (MNIST Dense)

What You Will Build

You will upgrade your MNIST handwritten digit classifier (from Project 8) by replacing the naive “flatten the image” approach with proper Convolutional Layers and Max Pooling Layers. You will implement the complete forward and backward pass for convolution operations manually, including the notoriously tricky im2col algorithm that makes convolution efficient.

Your CNN will:

Learn filters automatically (instead of you hardcoding edge detectors)
Recognize digits regardless of their position in the image (translation invariance)
Use 10x fewer parameters than the dense network while achieving higher accuracy
Train in reasonable time through vectorized operations

This is considered the hardest project in this learning path. The backward pass through convolution is where most people give up. If you complete this, you truly understand how CNNs work at the deepest level.

Learning Objectives

By completing this project, you will:

Implement Conv2D Forward Pass - Slide learned filters across images to produce feature maps
Master the im2col Transformation - Convert convolution into matrix multiplication for efficiency
Implement Conv2D Backward Pass - The notoriously difficult gradient computation through convolution
Build Max Pooling Layers - Downsample feature maps while preserving important features
Implement Max Pooling Backward Pass - Route gradients only through the “winning” neurons
Connect Convolutional and Dense Layers - Flatten feature volumes to feed into fully connected layers
Understand Parameter Sharing - Why CNNs are efficient for images
Achieve Translation Invariance - Recognize patterns regardless of position
Debug with Gradient Checking - Verify your backprop implementation is correct

The Core Question You’re Answering

“How can we make AI efficient enough for images?”

A 28x28 grayscale image has 784 pixels. That’s manageable. But a 1000x1000 color image has 3 million inputs. If your first hidden layer has 1000 neurons, you need 3 billion weights just for layer 1. This is impossible to train.

CNNs solve this through two key insights:

Local Connectivity: A neuron doesn’t need to see the entire image. It only needs to see a small patch (like 3x3 pixels). Edges and textures are local features.
Parameter Sharing: The same filter that detects a vertical edge in the top-left corner should work in the bottom-right corner too. We use the same weights everywhere.

These two ideas reduce parameters by 1000x while actually improving accuracy, because:

Sparse connections prevent overfitting
Shared weights encode translation invariance (a “7” is a “7” anywhere in the image)
Hierarchical features emerge naturally (edges -> textures -> shapes -> objects)

Concepts You Must Understand First

Before implementing, ensure you have solid grounding in these foundational concepts:

Concept	Why It Matters	Where to Learn
Convolution Operation	You must be able to compute a convolution by hand. Project 7 should have given you this. Know what happens when a 3x3 kernel slides over a 5x5 image.	Project 7, “Deep Learning with Python” Ch. 5
Parameter Sharing	The key insight that makes CNNs work. One filter = one set of weights applied everywhere. This gives translation invariance.	“Deep Learning” Ch. 9.2
Sparse Connectivity	Each output pixel connects to only a small patch of input, not the entire image. This is why CNNs have fewer parameters.	“Deep Learning” Ch. 9.2
Translation Invariance	A CNN should recognize a cat whether it’s on the left or right side of the image. Pooling and shared weights create this property.	“Deep Learning” Ch. 9.3
Max Pooling Operation	Downsampling by taking the maximum in each patch. Reduces spatial dimensions and provides local translation invariance.	“Deep Learning with Python” Ch. 5.1.2
Feature Map Dimensions	Given input (28x28), filter (3x3), stride (1), padding (0), what’s the output size? You must know the formula: `(W - F + 2P) / S + 1`.	“Deep Learning” Ch. 9.5
The im2col Transformation	The trick that converts convolution into matrix multiplication. Essential for efficient implementation.	Stanford CS231n Notes

The Dimension Formula

This will save you hours of debugging:

Output Size = floor((Input_Size - Filter_Size + 2*Padding) / Stride) + 1

Example: Input 28x28, Filter 3x3, Padding 0, Stride 1:

Output = (28 - 3 + 0) / 1 + 1 = 26

So a 28x28 image becomes a 26x26 feature map after one 3x3 convolution.

Deep Theoretical Foundation

Why Convolutions Are Perfect for Images

Consider a dense network trying to recognize a “7”:

Dense Network View:

Input: 28x28 = 784 pixels        Each hidden neuron connects
       (flattened to vector)     to ALL 784 input pixels

       [x1, x2, x3, ... x784] --> [h1, h2, h3, ... h256]

       Weights: 784 * 256 = 200,704 parameters (just layer 1!)

Problem: A "7" at pixel (5,5) looks COMPLETELY DIFFERENT from
         a "7" at pixel (20,20) because different weights fire.

Now consider a CNN:

CNN View:

Input: 28x28 image              One filter (3x3 = 9 weights)
       (keep the 2D structure)  slides across the ENTIRE image

       ┌─────────────────┐      ┌───┐
       │  7              │  *   │ F │ = Feature Map 26x26
       │                 │      └───┘
       │                 │
       └─────────────────┘      Same 9 weights used everywhere!

       Weights: 9 parameters (the filter)

Benefit: A "7" activates the same filter whether it's
         top-left or bottom-right. Translation invariance!

Parameter Efficiency: Conv vs Dense

Let’s compare parameter counts for processing a 28x28 image:

┌─────────────────────────────────────────────────────────────────────────┐
│                    Parameter Count Comparison                            │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  Dense Layer (784 inputs -> 256 hidden):                                 │
│    Parameters = 784 * 256 + 256 (bias) = 200,960                         │
│                                                                          │
│  Conv Layer (1 channel -> 32 filters, 3x3):                              │
│    Parameters = 32 * (3 * 3 * 1) + 32 (bias) = 320                       │
│                                                                          │
│  Ratio: Dense / Conv = 628x MORE parameters for dense!                   │
│                                                                          │
│  And the conv layer produces MORE information:                           │
│    Dense: 256 values                                                     │
│    Conv: 32 * 26 * 26 = 21,632 values (feature maps)                     │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

The Receptive Field Concept

Each pixel in a deeper layer “sees” a larger patch of the original image:

Layer 1: Each output pixel sees 3x3 of input        (receptive field = 3)
Layer 2: Each output pixel sees 3x3 of layer 1      (receptive field = 5)
Layer 3: Each output pixel sees 3x3 of layer 2      (receptive field = 7)

         ┌─────────────────────┐
         │  Original Image     │
         │  ┌─────────────┐    │
         │  │ RF Layer 2  │    │
         │  │ ┌───────┐   │    │
         │  │ │RF L1  │   │    │
         │  │ │ 3x3   │   │    │
         │  │ └───────┘   │    │
         │  │   5x5       │    │
         │  └─────────────┘    │
         │     7x7             │
         └─────────────────────┘

The deeper you go, the more context each neuron has.
Early layers: edges, textures
Middle layers: parts (eyes, wheels)
Deep layers: objects (faces, cars)

Pooling for Spatial Invariance

Max pooling takes the maximum value in each patch:

Input 4x4:                      Max Pool 2x2, Stride 2:
┌────┬────┬────┬────┐           ┌────┬────┐
│ 1  │ 2  │ 5  │ 3  │           │ 6  │ 8  │  (max of top-left 2x2 = 6)
├────┼────┼────┼────┤    -->    ├────┼────┤  (max of top-right 2x2 = 8)
│ 6  │ 4  │ 8  │ 1  │           │ 7  │ 9  │  (max of bottom-left 2x2 = 7)
├────┼────┼────┼────┤           └────┴────┘  (max of bottom-right 2x2 = 9)
│ 2  │ 7  │ 3  │ 9  │
├────┼────┼────┼────┤
│ 1  │ 5  │ 4  │ 2  │
└────┴────┴────┴────┘

Why it helps:
1. Reduces spatial size (4x4 -> 2x2 = 75% reduction)
2. Provides local translation invariance:
   - If the "6" moved one pixel right (to where "4" was),
     the output would still be "6" (or close to it)
3. Keeps the "loudest" feature in each region

Backpropagation Through Convolution (The Hard Part)

This is where most people give up. Let’s build intuition before diving into math.

Forward pass recap:

Input: Image X of shape (H, W)
Filter: Kernel K of shape (FH, FW)
Output: Feature map Y of shape (H-FH+1, W-FW+1)
Each Y[i,j] = sum(X[i:i+FH, j:j+FW] * K)

Backward pass goal:

Given: gradient of loss w.r.t. output, dL/dY
Find: dL/dX (to backprop further) and dL/dK (to update weights)

The key insight: In the forward pass, each input pixel X[a,b] contributes to multiple outputs (wherever the filter overlapped that pixel). In the backward pass, we sum all those contributions.

How one input pixel affects multiple outputs:

Input X (5x5):                  Output Y (3x3) with 3x3 filter:
┌───┬───┬───┬───┬───┐          ┌───┬───┬───┐
│   │   │   │   │   │          │Y00│Y01│Y02│
├───┼───┼───┼───┼───┤          ├───┼───┼───┤
│   │ X │   │   │   │ <--This  │Y10│Y11│Y12│
├───┼───┼───┼───┼───┤    pixel ├───┼───┼───┤
│   │   │   │   │   │          │Y20│Y21│Y22│
├───┼───┼───┼───┼───┤          └───┴───┴───┘
│   │   │   │   │   │
├───┼───┼───┼───┼───┤          X[1,1] contributes to Y[0,0]
│   │   │   │   │   │          (when filter is at position 0,0)
└───┴───┴───┴───┴───┘
                               So dL/dX[1,1] includes dL/dY[0,0] * K[1,1]

The gradient of the filter (dL/dK) is even more interesting:

Each filter weight K[i,j] was multiplied by many input values during forward pass
So dL/dK[i,j] = sum over all positions of dL/dY[pos] * X[corresponding input]
This is actually a convolution of X with dL/dY!

The im2col Transformation: Convolution as Matrix Multiplication

The naive convolution uses nested loops and is slow. The im2col trick converts convolution into a single matrix multiplication:

Original Convolution (4x4 input, 2x2 filter, stride 1):

Input X:                Filter K:           Output Y (3x3):
┌───┬───┬───┬───┐      ┌───┬───┐           ┌───┬───┬───┐
│ 1 │ 2 │ 3 │ 4 │      │ w │ x │           │   │   │   │
├───┼───┼───┼───┤      ├───┼───┤           ├───┼───┼───┤
│ 5 │ 6 │ 7 │ 8 │      │ y │ z │           │   │   │   │
├───┼───┼───┼───┤      └───┴───┘           ├───┼───┼───┤
│ 9 │10 │11 │12 │                          │   │   │   │
├───┼───┼───┼───┤                          └───┴───┴───┘
│13 │14 │15 │16 │
└───┴───┴───┴───┘

Step 1: im2col - Stretch each receptive field into a column

Position (0,0): [1,2,5,6]     ---> Column 0
Position (0,1): [2,3,6,7]     ---> Column 1
Position (0,2): [3,4,7,8]     ---> Column 2
Position (1,0): [5,6,9,10]    ---> Column 3
... and so on for all 9 positions

im2col(X) matrix (4 x 9):
┌────┬────┬────┬────┬────┬────┬────┬────┬────┐
│  1 │  2 │  3 │  5 │  6 │  7 │  9 │ 10 │ 11 │
│  2 │  3 │  4 │  6 │  7 │  8 │ 10 │ 11 │ 12 │
│  5 │  6 │  7 │  9 │ 10 │ 11 │ 13 │ 14 │ 15 │
│  6 │  7 │  8 │ 10 │ 11 │ 12 │ 14 │ 15 │ 16 │
└────┴────┴────┴────┴────┴────┴────┴────┴────┘

Step 2: Flatten filter to row vector
K_flat = [w, x, y, z]  (1 x 4)

Step 3: Matrix multiplication
Output = K_flat @ im2col(X) = (1 x 4) @ (4 x 9) = (1 x 9)

Step 4: Reshape output to 3x3

Why this is faster:

Matrix multiplication is highly optimized (BLAS, GPU acceleration)
Avoids Python loop overhead
With multiple filters, it’s even more efficient (just more rows in K_flat)

Backprop Through Max Pooling (Routing Gradients)

Max pooling has no learnable parameters, but we still need to backpropagate gradients. The rule is simple:

The gradient only flows through the neuron that “won” (had the max value)

Forward Max Pool:
┌────┬────┐
│ 1  │ 4  │    max = 4 (position [0,1])
├────┼────┤
│ 2  │ 3  │
└────┴────┘

Backward (given dL/dOutput = 0.5):
┌──────┬──────┐
│  0   │ 0.5  │    Only the winning position gets the gradient
├──────┼──────┤
│  0   │  0   │
└──────┴──────┘

This is called "gradient routing" - we need to remember
which position won during forward pass.

Modern CNN Architectures Overview

Understanding history helps you appreciate design choices:

┌─────────────────────────────────────────────────────────────────────────┐
│                    CNN Architecture Evolution                            │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  LeNet-5 (1998) - Yann LeCun                                             │
│  ├── 2 conv layers, 2 pooling, 3 dense                                   │
│  ├── Designed for 32x32 grayscale digits                                 │
│  └── ~60K parameters                                                     │
│                                                                          │
│  AlexNet (2012) - Krizhevsky, Sutskever, Hinton                          │
│  ├── First deep CNN to win ImageNet (15.3% error, previous was 26%)      │
│  ├── 5 conv layers, 3 dense layers                                       │
│  ├── Used ReLU (not sigmoid/tanh) and dropout                            │
│  └── ~60M parameters, trained on GPU                                     │
│                                                                          │
│  VGGNet (2014) - Simonyan, Zisserman                                     │
│  ├── Key insight: stack many 3x3 convs (better than fewer large ones)    │
│  ├── 16-19 layers, very uniform architecture                             │
│  └── ~138M parameters                                                    │
│                                                                          │
│  ResNet (2015) - He et al.                                               │
│  ├── Skip connections allow training 100+ layer networks                 │
│  ├── Solved vanishing gradient problem                                   │
│  └── Still state-of-art baseline today                                   │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Real World Outcome

When you complete this project and run your CNN trainer, you will see:

$ python train_cnn.py

═══════════════════════════════════════════════════════════════════════════
   CNN From Scratch - MNIST Classifier
   Implemented with NumPy only (no frameworks!)
═══════════════════════════════════════════════════════════════════════════

Architecture:
  ┌─────────────────────────────────────────────────────────────────────┐
  │  Input: (1, 28, 28)                                                  │
  │     │                                                                │
  │     ▼                                                                │
  │  Conv2D(1 -> 32, 3x3, stride=1) + ReLU                               │
  │     Output: (32, 26, 26)   Parameters: 320                           │
  │     │                                                                │
  │     ▼                                                                │
  │  MaxPool2D(2x2, stride=2)                                            │
  │     Output: (32, 13, 13)   Parameters: 0                             │
  │     │                                                                │
  │     ▼                                                                │
  │  Conv2D(32 -> 64, 3x3, stride=1) + ReLU                              │
  │     Output: (64, 11, 11)   Parameters: 18,496                        │
  │     │                                                                │
  │     ▼                                                                │
  │  MaxPool2D(2x2, stride=2)                                            │
  │     Output: (64, 5, 5)     Parameters: 0                             │
  │     │                                                                │
  │     ▼                                                                │
  │  Flatten                                                             │
  │     Output: (1600,)                                                  │
  │     │                                                                │
  │     ▼                                                                │
  │  Dense(1600 -> 128) + ReLU                                           │
  │     Output: (128,)         Parameters: 204,928                       │
  │     │                                                                │
  │     ▼                                                                │
  │  Dense(128 -> 10) + Softmax                                          │
  │     Output: (10,)          Parameters: 1,290                         │
  └─────────────────────────────────────────────────────────────────────┘

  Total Parameters: 225,034
  ──────────────────────────────────────────────────────────────────────
  Compare to Dense Network (Project 8): 500,000+ parameters
  CNN uses 55% fewer parameters!
  ──────────────────────────────────────────────────────────────────────

Loading MNIST dataset...
  Training samples: 60,000
  Test samples: 10,000

Training Configuration:
  Batch size: 64
  Learning rate: 0.01
  Optimizer: SGD with momentum (0.9)

───────────────────────────────────────────────────────────────────────────
Training Progress:
───────────────────────────────────────────────────────────────────────────

Epoch 1/10 [████████████████████████████████████████] 938/938
  Train Loss: 0.2341   Train Acc: 92.3%   Time: 45.2s
  Test Loss:  0.0892   Test Acc:  97.2%

Epoch 2/10 [████████████████████████████████████████] 938/938
  Train Loss: 0.0812   Train Acc: 97.5%   Time: 44.8s
  Test Loss:  0.0654   Test Acc:  98.0%

Epoch 3/10 [████████████████████████████████████████] 938/938
  Train Loss: 0.0543   Train Acc: 98.3%   Time: 44.9s
  Test Loss:  0.0521   Test Acc:  98.4%

Epoch 4/10 [████████████████████████████████████████] 938/938
  Train Loss: 0.0398   Train Acc: 98.7%   Time: 45.1s
  Test Loss:  0.0478   Test Acc:  98.6%

Epoch 5/10 [████████████████████████████████████████] 938/938
  Train Loss: 0.0312   Train Acc: 99.0%   Time: 45.0s
  Test Loss:  0.0412   Test Acc:  98.8%

  ... [epochs 6-9] ...

Epoch 10/10 [████████████████████████████████████████] 938/938
  Train Loss: 0.0098   Train Acc: 99.7%   Time: 44.7s
  Test Loss:  0.0356   Test Acc:  99.1%

═══════════════════════════════════════════════════════════════════════════
Training Complete!
═══════════════════════════════════════════════════════════════════════════

Final Results:
  Test Accuracy: 99.1% (9,910 / 10,000 correct)

  Comparison:
  ┌─────────────────────────────────────────────────────────────────────┐
  │  Model              │ Parameters │ Test Accuracy │ Improvement      │
  ├─────────────────────────────────────────────────────────────────────┤
  │  Dense (Project 8)  │  ~500,000  │    97.5%      │  baseline        │
  │  CNN (This Project) │  ~225,000  │    99.1%      │  +1.6% acc, 55%  │
  │                     │            │               │  fewer params    │
  └─────────────────────────────────────────────────────────────────────┘

Visualizing learned filters...
  Saved: learned_filters_conv1.png

  First layer filters (32 x 3x3):
  ┌─────────────────────────────────────────────────────────────────────┐
  │  [Edge ↑] [Edge →] [Edge ↗] [Blob] [Corner] [Texture] ...           │
  │                                                                      │
  │  The network automatically learned edge detectors!                   │
  │  Compare to your hand-coded kernels from Project 7.                  │
  └─────────────────────────────────────────────────────────────────────┘

Testing translation invariance...
  Original digit "7" at center:     Prediction: 7 (99.8% confidence)
  Same "7" shifted 5px right:       Prediction: 7 (99.6% confidence)
  Same "7" shifted 5px down:        Prediction: 7 (99.4% confidence)

  Dense network on shifted images:
  Same "7" shifted 5px right:       Prediction: 1 (45% confidence) ← FAIL

  CNN maintains recognition regardless of position!

Model saved to: cnn_mnist_model.npz
═══════════════════════════════════════════════════════════════════════════

This output demonstrates:

Dramatically better accuracy than dense networks
Fewer parameters (efficiency through convolution)
Automatic feature learning (no hand-coded filters needed)
Translation invariance (the key property of CNNs)

Solution Architecture

Core Classes

Your implementation needs these key components:

┌─────────────────────────────────────────────────────────────────────────┐
│                       CNN Architecture Overview                          │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  class Conv2D:                                                           │
│    """Convolutional layer with learnable filters"""                      │
│    - __init__(in_channels, out_channels, kernel_size, stride, padding)  │
│    - forward(x) -> feature_maps                                          │
│    - backward(d_out) -> d_input, stores d_weights/d_bias                │
│    - im2col(x) -> col_matrix (for efficient forward)                    │
│    - col2im(col, shape) -> image (for efficient backward)               │
│                                                                          │
│  class MaxPool2D:                                                        │
│    """Max pooling layer (no learnable parameters)"""                     │
│    - __init__(pool_size, stride)                                         │
│    - forward(x) -> pooled_output, stores max_indices                    │
│    - backward(d_out) -> d_input (gradient routing)                      │
│                                                                          │
│  class Flatten:                                                          │
│    """Reshape 3D feature maps to 1D vector"""                            │
│    - forward(x) -> flattened                                             │
│    - backward(d_out) -> reshaped to original                            │
│                                                                          │
│  class Dense:                                                            │
│    """Fully connected layer (from Project 8)"""                          │
│    - forward(x) -> output                                                │
│    - backward(d_out) -> d_input, stores d_weights/d_bias                │
│                                                                          │
│  class ReLU:                                                             │
│    """ReLU activation"""                                                 │
│    - forward(x) -> max(0, x)                                             │
│    - backward(d_out) -> d_out * (x > 0)                                 │
│                                                                          │
│  class Softmax:                                                          │
│    """Softmax for final layer"""                                         │
│    - forward(x) -> probabilities                                         │
│    - backward(d_out) -> gradients                                        │
│                                                                          │
│  class CNN:                                                              │
│    """Container that chains layers together"""                           │
│    - __init__(layers: List[Layer])                                       │
│    - forward(x) -> prediction                                            │
│    - backward(d_loss) -> propagates gradients                           │
│    - update_params(learning_rate)                                        │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Data Flow Through the Network

Input: (batch, 1, 28, 28)
           │
           ▼
    ┌──────────────┐
    │   Conv2D     │  32 filters of 3x3
    │  (1 -> 32)   │
    └──────┬───────┘
           │ (batch, 32, 26, 26)
           ▼
    ┌──────────────┐
    │    ReLU      │  max(0, x)
    └──────┬───────┘
           │ (batch, 32, 26, 26)
           ▼
    ┌──────────────┐
    │  MaxPool2D   │  2x2, stride 2
    └──────┬───────┘
           │ (batch, 32, 13, 13)
           ▼
    ┌──────────────┐
    │   Conv2D     │  64 filters of 3x3
    │  (32 -> 64)  │
    └──────┬───────┘
           │ (batch, 64, 11, 11)
           ▼
    ┌──────────────┐
    │    ReLU      │
    └──────┬───────┘
           │ (batch, 64, 11, 11)
           ▼
    ┌──────────────┐
    │  MaxPool2D   │  2x2, stride 2
    └──────┬───────┘
           │ (batch, 64, 5, 5)
           ▼
    ┌──────────────┐
    │   Flatten    │  Reshape to 1D
    └──────┬───────┘
           │ (batch, 1600)
           ▼
    ┌──────────────┐
    │    Dense     │  1600 -> 128
    │   + ReLU     │
    └──────┬───────┘
           │ (batch, 128)
           ▼
    ┌──────────────┐
    │    Dense     │  128 -> 10
    │  + Softmax   │
    └──────┬───────┘
           │ (batch, 10)
           ▼
      Predictions

Phased Implementation Guide

Phase 1: Conv2D Forward Pass (Days 1-3)

Goal: Implement the forward convolution using nested loops first

Start with the simplest possible implementation:

def conv2d_forward_naive(x, weights, bias):
    """
    Naive convolution implementation with loops.

    Args:
        x: Input of shape (batch, in_channels, H, W)
        weights: Filters of shape (out_channels, in_channels, FH, FW)
        bias: Bias of shape (out_channels,)

    Returns:
        Output of shape (batch, out_channels, H_out, W_out)
    """
    batch, in_channels, H, W = x.shape
    out_channels, _, FH, FW = weights.shape

    H_out = H - FH + 1
    W_out = W - FW + 1

    output = np.zeros((batch, out_channels, H_out, W_out))

    for b in range(batch):
        for oc in range(out_channels):
            for i in range(H_out):
                for j in range(W_out):
                    # Extract patch
                    patch = x[b, :, i:i+FH, j:j+FW]
                    # Convolve: element-wise multiply and sum
                    output[b, oc, i, j] = np.sum(patch * weights[oc]) + bias[oc]

    return output

Test it: Create a 5x5 input, 3x3 filter with known values, and verify output matches hand calculation.

Checkpoint: Can perform forward convolution (slowly) on small inputs.

Phase 2: im2col Transformation (Days 4-6)

Goal: Convert convolution to matrix multiplication for speed

The im2col function stretches each receptive field into a column:

def im2col(x, FH, FW, stride=1, padding=0):
    """
    Transform input into column matrix for efficient convolution.

    Args:
        x: Input of shape (batch, C, H, W)
        FH, FW: Filter height and width
        stride: Stride of convolution
        padding: Zero padding

    Returns:
        col: Column matrix of shape (C*FH*FW, batch*H_out*W_out)
    """
    batch, C, H, W = x.shape

    # Apply padding if needed
    if padding > 0:
        x = np.pad(x, ((0,0), (0,0), (padding,padding), (padding,padding)))
        H += 2 * padding
        W += 2 * padding

    H_out = (H - FH) // stride + 1
    W_out = (W - FW) // stride + 1

    # Create output array
    col = np.zeros((C * FH * FW, batch * H_out * W_out))

    col_idx = 0
    for b in range(batch):
        for i in range(H_out):
            for j in range(W_out):
                # Extract receptive field and flatten
                patch = x[b, :, i*stride:i*stride+FH, j*stride:j*stride+FW]
                col[:, col_idx] = patch.flatten()
                col_idx += 1

    return col

The fast convolution:

def conv2d_forward_fast(x, weights, bias):
    batch, in_channels, H, W = x.shape
    out_channels, _, FH, FW = weights.shape

    H_out = H - FH + 1
    W_out = W - FW + 1

    # im2col: (C*FH*FW, batch*H_out*W_out)
    col = im2col(x, FH, FW)

    # Reshape weights: (out_channels, in_channels*FH*FW)
    W_col = weights.reshape(out_channels, -1)

    # Matrix multiplication!
    # (out_channels, C*FH*FW) @ (C*FH*FW, batch*H_out*W_out)
    # = (out_channels, batch*H_out*W_out)
    output = W_col @ col + bias.reshape(-1, 1)

    # Reshape to (batch, out_channels, H_out, W_out)
    output = output.reshape(out_channels, batch, H_out, W_out)
    output = output.transpose(1, 0, 2, 3)

    return output

Verify: Both naive and fast implementations should give identical results. Profile the speed difference.

Checkpoint: Fast forward pass using matrix multiplication.

Phase 3: Conv2D Backward Pass (Days 7-12)

Goal: Implement gradient computation through convolution

This is the hardest part. We need two gradients:

dL/dX (to backpropagate to earlier layers)
dL/dW (to update the filter weights)

Understanding the gradient of the filter (dL/dW):

Forward: Y[i,j] = sum_{a,b} X[i+a, j+b] * W[a, b]

Backward: dL/dW[a,b] = sum_{i,j} dL/dY[i,j] * X[i+a, j+b]

This is a convolution of X with dL/dY!

Understanding the gradient of the input (dL/dX):

Each input X[a,b] contributes to multiple outputs Y[i,j]
wherever the filter overlapped that position.

dL/dX[a,b] = sum over all (i,j) where X[a,b] was used:
             dL/dY[i,j] * W[a-i, b-j]

This is a "full" convolution of dL/dY with W flipped!

Implementation sketch:

def conv2d_backward(d_out, x, weights, col):
    """
    Backward pass for convolution.

    Args:
        d_out: Gradient from next layer, shape (batch, out_channels, H_out, W_out)
        x: Original input, shape (batch, in_channels, H, W)
        weights: Filter weights, shape (out_channels, in_channels, FH, FW)
        col: Cached im2col matrix from forward pass

    Returns:
        d_x: Gradient w.r.t. input
        d_w: Gradient w.r.t. weights
        d_b: Gradient w.r.t. bias
    """
    batch, out_channels, H_out, W_out = d_out.shape
    _, in_channels, FH, FW = weights.shape

    # Gradient of bias: sum over batch and spatial dimensions
    d_b = np.sum(d_out, axis=(0, 2, 3))

    # Reshape d_out for matrix multiplication
    d_out_col = d_out.transpose(1, 0, 2, 3).reshape(out_channels, -1)

    # Gradient of weights: d_out convolved with input
    # (out_channels, batch*H_out*W_out) @ (batch*H_out*W_out, in_channels*FH*FW)
    d_w = d_out_col @ col.T
    d_w = d_w.reshape(weights.shape)

    # Gradient of input: need col2im
    W_col = weights.reshape(out_channels, -1)
    d_col = W_col.T @ d_out_col  # (in_channels*FH*FW, batch*H_out*W_out)

    # col2im to convert back to image shape
    d_x = col2im(d_col, x.shape, FH, FW)

    return d_x, d_w, d_b

The col2im function (inverse of im2col):

def col2im(col, x_shape, FH, FW, stride=1, padding=0):
    """
    Inverse of im2col: accumulate gradients back to image format.

    Key insight: Multiple columns contributed to the same input pixel,
    so we SUM the gradients (not replace).
    """
    batch, C, H, W = x_shape

    H_out = (H - FH) // stride + 1
    W_out = (W - FW) // stride + 1

    dx = np.zeros((batch, C, H, W))

    col_idx = 0
    for b in range(batch):
        for i in range(H_out):
            for j in range(W_out):
                # Get the column gradient
                patch_grad = col[:, col_idx].reshape(C, FH, FW)
                # ACCUMULATE into the appropriate position
                dx[b, :, i:i+FH, j:j+FW] += patch_grad
                col_idx += 1

    return dx

Checkpoint: Backward pass computes gradients. Verify with gradient checking (next phase).

Phase 4: MaxPool Forward Pass (Day 13)

Goal: Implement max pooling

class MaxPool2D:
    def __init__(self, pool_size=2, stride=2):
        self.pool_size = pool_size
        self.stride = stride
        self.max_indices = None  # Store for backward

    def forward(self, x):
        """
        Max pooling forward pass.

        Args:
            x: Input of shape (batch, C, H, W)

        Returns:
            Output of shape (batch, C, H//pool, W//pool)
        """
        batch, C, H, W = x.shape
        PH = PW = self.pool_size
        S = self.stride

        H_out = (H - PH) // S + 1
        W_out = (W - PW) // S + 1

        output = np.zeros((batch, C, H_out, W_out))
        self.max_indices = np.zeros((batch, C, H_out, W_out, 2), dtype=int)

        for b in range(batch):
            for c in range(C):
                for i in range(H_out):
                    for j in range(W_out):
                        h_start = i * S
                        w_start = j * S
                        patch = x[b, c, h_start:h_start+PH, w_start:w_start+PW]

                        # Find max and its position
                        max_val = np.max(patch)
                        max_pos = np.unravel_index(np.argmax(patch), (PH, PW))

                        output[b, c, i, j] = max_val
                        self.max_indices[b, c, i, j] = [h_start + max_pos[0],
                                                         w_start + max_pos[1]]

        return output

Checkpoint: MaxPool reduces spatial dimensions by half.

Phase 5: MaxPool Backward Pass (Day 14)

Goal: Route gradients through max positions only

def backward(self, d_out, x_shape):
    """
    Max pooling backward pass.

    The gradient only flows through the position that had the max value.
    """
    batch, C, H, W = x_shape
    d_x = np.zeros((batch, C, H, W))

    _, _, H_out, W_out = d_out.shape

    for b in range(batch):
        for c in range(C):
            for i in range(H_out):
                for j in range(W_out):
                    # Get the position that won during forward
                    max_h, max_w = self.max_indices[b, c, i, j]
                    # Route the gradient to that position
                    d_x[b, c, max_h, max_w] += d_out[b, c, i, j]

    return d_x

Checkpoint: Gradients flow only through max positions.

Phase 6: Flatten Layer (Day 15)

Goal: Reshape 3D feature maps to 1D for dense layers

class Flatten:
    def __init__(self):
        self.input_shape = None

    def forward(self, x):
        """Flatten all dimensions except batch."""
        self.input_shape = x.shape
        batch = x.shape[0]
        return x.reshape(batch, -1)

    def backward(self, d_out):
        """Reshape gradient back to original shape."""
        return d_out.reshape(self.input_shape)

Checkpoint: Can connect conv layers to dense layers.

Phase 7: Connect to Existing Dense Layers (Days 16-17)

Goal: Integrate Dense and activation layers from Project 8

You should already have Dense and ReLU layers from Project 8. Make sure they have a consistent interface:

class Dense:
    def __init__(self, in_features, out_features):
        # He initialization (good for ReLU)
        self.weights = np.random.randn(in_features, out_features) * np.sqrt(2.0 / in_features)
        self.bias = np.zeros(out_features)
        self.d_weights = None
        self.d_bias = None
        self.input_cache = None

    def forward(self, x):
        self.input_cache = x
        return x @ self.weights + self.bias

    def backward(self, d_out):
        self.d_weights = self.input_cache.T @ d_out
        self.d_bias = np.sum(d_out, axis=0)
        return d_out @ self.weights.T

class ReLU:
    def __init__(self):
        self.mask = None

    def forward(self, x):
        self.mask = (x > 0)
        return np.maximum(0, x)

    def backward(self, d_out):
        return d_out * self.mask

Checkpoint: All layer types have forward/backward methods.

Phase 8: Build the Full CNN (Days 18-19)

Goal: Chain all layers together

class CNN:
    def __init__(self):
        self.layers = [
            Conv2D(1, 32, kernel_size=3),
            ReLU(),
            MaxPool2D(2, 2),
            Conv2D(32, 64, kernel_size=3),
            ReLU(),
            MaxPool2D(2, 2),
            Flatten(),
            Dense(64 * 5 * 5, 128),  # 64 channels, 5x5 spatial
            ReLU(),
            Dense(128, 10),
            Softmax()
        ]

    def forward(self, x):
        for layer in self.layers:
            x = layer.forward(x)
        return x

    def backward(self, d_loss):
        for layer in reversed(self.layers):
            d_loss = layer.backward(d_loss)

    def update_params(self, lr):
        for layer in self.layers:
            if hasattr(layer, 'weights'):
                layer.weights -= lr * layer.d_weights
                layer.bias -= lr * layer.d_bias

Checkpoint: Can do a full forward-backward pass.

Phase 9: Train on MNIST (Days 20-21)

Goal: Train the CNN and achieve 99%+ accuracy

def train_cnn():
    # Load MNIST
    X_train, y_train, X_test, y_test = load_mnist()

    # Reshape to (batch, 1, 28, 28) for CNN
    X_train = X_train.reshape(-1, 1, 28, 28) / 255.0
    X_test = X_test.reshape(-1, 1, 28, 28) / 255.0

    cnn = CNN()

    batch_size = 64
    learning_rate = 0.01
    epochs = 10

    for epoch in range(epochs):
        # Shuffle data
        indices = np.random.permutation(len(X_train))

        total_loss = 0
        correct = 0

        for i in range(0, len(X_train), batch_size):
            batch_idx = indices[i:i+batch_size]
            X_batch = X_train[batch_idx]
            y_batch = y_train[batch_idx]

            # Forward pass
            predictions = cnn.forward(X_batch)

            # Compute loss and accuracy
            loss = cross_entropy_loss(predictions, y_batch)
            total_loss += loss * len(batch_idx)
            correct += np.sum(np.argmax(predictions, axis=1) == y_batch)

            # Backward pass
            d_loss = cross_entropy_gradient(predictions, y_batch)
            cnn.backward(d_loss)

            # Update weights
            cnn.update_params(learning_rate)

        train_acc = correct / len(X_train)
        print(f"Epoch {epoch+1}: Loss={total_loss/len(X_train):.4f}, Acc={train_acc:.2%}")

        # Test accuracy
        test_pred = cnn.forward(X_test)
        test_acc = np.mean(np.argmax(test_pred, axis=1) == y_test)
        print(f"           Test Acc={test_acc:.2%}")

Checkpoint: Model achieves 99%+ accuracy on MNIST.

Questions to Guide Your Design

Before writing code, think through these design questions:

Dimension Tracking

What are the output dimensions after each layer? Given a 28x28 input, trace through every layer. This will catch most bugs early.
How do you handle the batch dimension? All operations must work on batches, not single images.

Memory Considerations

What do you need to cache for the backward pass? The im2col matrix? Max indices? Input values?
How much memory does training take? With 64 images of 28x28 and 32 filters, how big is the im2col matrix?

Efficiency

Where are the bottlenecks? im2col is expensive. Can you optimize it?
Can you vectorize the pooling operations? The naive loop implementation is slow.

Gradient Computation

How do you handle multiple input channels in conv backward? Each output channel has gradients from all input channels.
What happens at the edges of the image? With no padding, edge pixels contribute to fewer outputs.

Thinking Exercise

Before implementing, trace the backward pass through a tiny example by hand:

Setup:

Input: 4x4 single-channel image
Filter: 2x2, single filter
Output: 3x3 feature map

Input X:            Filter W:          Output Y:
┌───┬───┬───┬───┐   ┌───┬───┐         ┌───┬───┬───┐
│ 1 │ 2 │ 3 │ 4 │   │ a │ b │         │Y00│Y01│Y02│
├───┼───┼───┼───┤   ├───┼───┤         ├───┼───┼───┤
│ 5 │ 6 │ 7 │ 8 │   │ c │ d │         │Y10│Y11│Y12│
├───┼───┼───┼───┤   └───┴───┘         ├───┼───┼───┤
│ 9 │10 │11 │12 │                     │Y20│Y21│Y22│
├───┼───┼───┼───┤                     └───┴───┴───┘
│13 │14 │15 │16 │
└───┴───┴───┴───┘

Forward pass equations:

Y[0,0] = 1*a + 2*b + 5*c + 6*d
Y[0,1] = 2*a + 3*b + 6*c + 7*d
Y[0,2] = 3*a + 4*b + 7*c + 8*d
... (continue for all 9 outputs)

Your task: Given dL/dY (the gradient of loss w.r.t. each output), derive:

dL/da, dL/db, dL/dc, dL/dd (gradients for filter weights)
dL/dX[1,1] (gradient for the input pixel at position (1,1), which is value 6)

Hint for #1: dL/da = sum of dL/dY[i,j] * (X element that was multiplied by 'a' at that position)

Hint for #2: Which output positions Y[i,j] used input X[1,1]=6? That input contributes to the gradient from each of those positions.

Testing Strategy

Gradient Checking Is Essential

The backward pass is complex enough that bugs are almost guaranteed. Use numerical gradient checking:

def gradient_check(layer, x, epsilon=1e-5):
    """
    Verify analytical gradients match numerical gradients.
    """
    # Forward pass
    output = layer.forward(x)

    # Create random gradient from "next layer"
    d_out = np.random.randn(*output.shape)

    # Analytical gradient
    d_x_analytical = layer.backward(d_out)

    # Numerical gradient
    d_x_numerical = np.zeros_like(x)

    for i in np.ndindex(x.shape):
        x_plus = x.copy()
        x_plus[i] += epsilon
        out_plus = layer.forward(x_plus)

        x_minus = x.copy()
        x_minus[i] -= epsilon
        out_minus = layer.forward(x_minus)

        # Gradient = change in loss / change in input
        d_x_numerical[i] = np.sum((out_plus - out_minus) * d_out) / (2 * epsilon)

    # Compare
    diff = np.linalg.norm(d_x_analytical - d_x_numerical)
    diff /= np.linalg.norm(d_x_analytical) + np.linalg.norm(d_x_numerical)

    print(f"Relative difference: {diff}")
    assert diff < 1e-5, "Gradient check failed!"

Test each layer individually:

Test Conv2D backward with a tiny input (4x4)
Test MaxPool backward
Test the full network on one training example

Unit Tests

def test_conv2d_output_shape():
    layer = Conv2D(in_channels=1, out_channels=32, kernel_size=3)
    x = np.random.randn(4, 1, 28, 28)  # batch of 4
    out = layer.forward(x)
    assert out.shape == (4, 32, 26, 26), f"Expected (4, 32, 26, 26), got {out.shape}"

def test_maxpool_reduces_size():
    layer = MaxPool2D(pool_size=2, stride=2)
    x = np.random.randn(4, 32, 26, 26)
    out = layer.forward(x)
    assert out.shape == (4, 32, 13, 13), f"Expected (4, 32, 13, 13), got {out.shape}"

def test_im2col_correctness():
    """Verify im2col matches naive convolution."""
    x = np.random.randn(1, 1, 5, 5)
    w = np.random.randn(1, 1, 3, 3)
    b = np.zeros(1)

    out_naive = conv2d_forward_naive(x, w, b)
    out_fast = conv2d_forward_fast(x, w, b)

    assert np.allclose(out_naive, out_fast), "im2col convolution doesn't match naive!"

Common Pitfalls and Debugging Tips

1. Dimension Mismatches

Symptom: ValueError: shapes not aligned during matrix multiplication

Cause: im2col produces wrong shape, or reshape is incorrect

Fix: Print shapes at every step. The im2col output should be:

Rows: in_channels * filter_height * filter_width
Cols: batch * output_height * output_width

2. Forgetting to Accumulate Gradients in col2im

Symptom: Training diverges or accuracy stays at 10%

Cause: Using = instead of += in col2im

# WRONG:
dx[b, :, i:i+FH, j:j+FW] = patch_grad

# RIGHT:
dx[b, :, i:i+FH, j:j+FW] += patch_grad

Each input pixel contributes to multiple outputs, so gradients must be accumulated.

3. Transpose Confusion in Backward Pass

Symptom: Gradient check fails

Cause: The shapes in matrix multiplication are wrong

Fix: Write out the shapes explicitly:

# dL/dW = dL/dY (transposed somehow) @ X (transposed somehow)
# Work out the shapes:
# W shape: (out_channels, in_channels, FH, FW)
# Need dW to be this shape
# d_out: (batch, out_channels, H_out, W_out)
# col: (in_channels*FH*FW, batch*H_out*W_out)

4. Max Pooling Gradient Routing Errors

Symptom: Gradients are wrong, but only when pooling is involved

Cause: Max indices were stored incorrectly, or not accounting for stride

Fix: Verify max indices point to the actual maximum values:

# After forward pass, verify:
for each (i,j) in output:
    assert x[max_indices[i,j]] == output[i,j]

5. Learning Rate Issues

Symptom: Loss explodes or stays constant

Cause: Learning rate wrong for convolution (often needs to be smaller than for dense)

Fix: Start with lr=0.001 for conv layers. The gradients through convolution can be large because many paths contribute to each gradient.

6. Numerical Stability in Softmax

Symptom: NaN values during training

Cause: Softmax overflow

Fix: Subtract max before exponentiation:

def softmax(x):
    x_stable = x - np.max(x, axis=1, keepdims=True)
    exp_x = np.exp(x_stable)
    return exp_x / np.sum(exp_x, axis=1, keepdims=True)

Interview Questions

If you build a CNN from scratch, expect these questions:

Conceptual Questions

“Explain the difference between valid and same padding.”
- Valid: no padding, output smaller than input
- Same: pad so output has same spatial size as input
- Formula for same padding: P = (F - 1) / 2 where F is filter size
“Why do we use small filters (3x3) instead of large ones (7x7)?”
- Two 3x3 filters have same receptive field as one 5x5
- But 2(33) = 18 params vs 25 params
- More non-linearities (ReLU between layers)
- VGGNet proved this empirically
“What is the receptive field and why does it matter?”
- The region of input that affects one output pixel
- Deeper layers have larger receptive fields
- Determines what context the network can use
“How does max pooling provide translation invariance?”
- If a feature shifts slightly, it might still be the max in its pool region
- Small translations don’t change the pooled output
- But large translations (bigger than pool size) aren’t invariant

Implementation Questions

“Walk me through the backward pass of convolution.”
- Need dL/dW and dL/dX
- dL/dW: convolve input with d_out
- dL/dX: “full” convolution of d_out with flipped filter
- im2col makes this efficient
“Why is im2col used instead of direct convolution?”
- Converts convolution to matrix multiplication
- Matrix multiplication is heavily optimized (BLAS, cuBLAS)
- Avoids Python loop overhead
- GPU-friendly
“How would you implement strided convolution?”
- In im2col, columns are extracted at stride intervals
- Skip stride positions when iterating
- Output size: (W - F) // stride + 1
“What happens if I forget to store max indices during forward pass?”
- Cannot compute correct backward pass
- Gradients won’t flow to the right input positions
- Training will fail

Architecture Questions

“Why do CNNs alternate conv and pooling layers?”
- Conv: learn features at current resolution
- Pool: reduce size, add invariance
- Alternating builds hierarchy: edges -> textures -> parts -> objects
“How would you add batch normalization to your CNN?”
- Add BN layer after conv, before activation
- Normalize each channel across batch and spatial dimensions
- Learnable scale and shift parameters
- Improves training stability

Hints in Layers

Stuck on implementation? Read only the hint level you need:

Challenge: im2col Is Confusing

Hint Level 1 (Conceptual): Think of im2col as taking each receptive field patch and making it a column in a matrix.

Hint Level 2 (Direction): For a 4x4 input with 2x2 filter, you get 9 positions (3x3 output). Each position is a 2x2 patch = 4 values. So im2col output is 4x9.

Hint Level 3 (Specific): Use np.lib.stride_tricks.as_strided for a fast vectorized version (but be careful with strides!).

Hint Level 4 (Code):

# Fast im2col using stride tricks
def im2col_fast(x, FH, FW, stride=1):
    B, C, H, W = x.shape
    out_h = (H - FH) // stride + 1
    out_w = (W - FW) // stride + 1

    # Use stride tricks to create view of patches
    shape = (B, C, out_h, out_w, FH, FW)
    strides = (x.strides[0], x.strides[1],
               x.strides[2]*stride, x.strides[3]*stride,
               x.strides[2], x.strides[3])
    patches = np.lib.stride_tricks.as_strided(x, shape=shape, strides=strides)

    # Reshape to (C*FH*FW, B*out_h*out_w)
    return patches.transpose(1, 4, 5, 0, 2, 3).reshape(C*FH*FW, -1)

Challenge: col2im Accumulation

Hint Level 1 (Conceptual): Each input pixel appears in multiple columns of im2col. In col2im, you must add all contributions.

Hint Level 2 (Direction): Use np.add.at for indexed accumulation, which handles the case where the same index appears multiple times.

Hint Level 3 (Specific): Keep track of which input positions each column came from during im2col.

Hint Level 4 (Code):

# col2im with np.add.at
def col2im_fast(col, x_shape, FH, FW, stride=1):
    B, C, H, W = x_shape
    out_h = (H - FH) // stride + 1
    out_w = (W - FW) // stride + 1

    col_reshaped = col.reshape(C, FH, FW, B, out_h, out_w).transpose(3, 0, 4, 5, 1, 2)

    dx = np.zeros((B, C, H, W))
    for i in range(out_h):
        for j in range(out_w):
            dx[:, :, i*stride:i*stride+FH, j*stride:j*stride+FW] += col_reshaped[:, :, i, j]

    return dx

Challenge: Gradient of Conv Filter

Hint Level 1 (Conceptual): dL/dW is the correlation of the input with the error gradient.

Hint Level 2 (Direction): It’s actually a convolution where you slide d_out over the input.

Hint Level 3 (Specific): Using im2col, the columns represent input patches. Multiply by the corresponding output gradients.

Hint Level 4 (Code):

# dW = d_out_col @ col.T, then reshape
# d_out_col shape: (out_channels, B*out_h*out_w)
# col shape: (C*FH*FW, B*out_h*out_w)
# Result: (out_channels, C*FH*FW) -> reshape to (out_channels, C, FH, FW)

Extensions and Challenges

1. Add Batch Normalization

Batch normalization stabilizes training and allows higher learning rates:

class BatchNorm2D:
    def __init__(self, num_features, eps=1e-5, momentum=0.1):
        self.gamma = np.ones(num_features)  # Scale
        self.beta = np.zeros(num_features)  # Shift
        self.eps = eps
        self.momentum = momentum
        self.running_mean = np.zeros(num_features)
        self.running_var = np.ones(num_features)

    def forward(self, x, training=True):
        if training:
            mean = x.mean(axis=(0, 2, 3), keepdims=True)
            var = x.var(axis=(0, 2, 3), keepdims=True)
            # Update running statistics
            self.running_mean = (1 - self.momentum) * self.running_mean + self.momentum * mean.squeeze()
            self.running_var = (1 - self.momentum) * self.running_var + self.momentum * var.squeeze()
        else:
            mean = self.running_mean.reshape(1, -1, 1, 1)
            var = self.running_var.reshape(1, -1, 1, 1)

        x_norm = (x - mean) / np.sqrt(var + self.eps)
        return self.gamma.reshape(1, -1, 1, 1) * x_norm + self.beta.reshape(1, -1, 1, 1)

2. Implement Residual Connections (ResNet-style)

Skip connections allow training much deeper networks:

class ResidualBlock:
    def __init__(self, channels):
        self.conv1 = Conv2D(channels, channels, 3, padding=1)
        self.bn1 = BatchNorm2D(channels)
        self.conv2 = Conv2D(channels, channels, 3, padding=1)
        self.bn2 = BatchNorm2D(channels)
        self.relu = ReLU()

    def forward(self, x):
        identity = x  # Save input

        out = self.conv1.forward(x)
        out = self.bn1.forward(out)
        out = self.relu.forward(out)

        out = self.conv2.forward(out)
        out = self.bn2.forward(out)

        out = out + identity  # Skip connection!
        out = self.relu.forward(out)

        return out

3. Try on CIFAR-10

CIFAR-10 has 32x32 color images (3 channels) with 10 classes (airplanes, cars, etc.):

Modify input channels from 1 to 3
Likely need more layers/filters for the harder task
Data augmentation helps: random crops, flips

4. Implement Average Pooling

Alternative to max pooling that takes the mean instead:

class AvgPool2D:
    def forward(self, x):
        # Average over each pool region
        pass

    def backward(self, d_out):
        # Gradient distributed equally to all positions
        # (unlike max pool where only winner gets gradient)
        pass

5. Add Dropout

Regularization technique that randomly zeros neurons during training:

class Dropout2D:
    def __init__(self, p=0.5):
        self.p = p
        self.mask = None

    def forward(self, x, training=True):
        if training:
            self.mask = (np.random.rand(*x.shape) > self.p) / (1 - self.p)
            return x * self.mask
        return x

    def backward(self, d_out):
        return d_out * self.mask

Real-World Connections

Self-Driving Cars

Tesla’s Autopilot, Waymo, and others use CNNs for:

Lane detection (pixel classification)
Object detection (pedestrians, cars, signs)
Depth estimation from cameras

Your CNN from scratch demonstrates the core technology. Production systems use:

Much deeper networks (ResNet-50, EfficientNet)
Multiple camera inputs fused together
Real-time inference optimization

Medical Imaging

CNNs detect diseases in X-rays, MRIs, and CT scans:

Diabetic retinopathy detection (Google)
Skin cancer classification (Stanford)
COVID-19 detection from chest X-rays

Your CNN teaches the fundamentals used in FDA-approved medical AI devices.

Smartphone Cameras

When your phone applies “portrait mode” or “night mode”:

CNNs segment foreground from background
CNNs denoise low-light images
CNNs enhance resolution (super-resolution)

All running on your phone’s neural processing unit.

Content Moderation

Facebook, YouTube, and Instagram use CNNs to:

Detect nudity and violence
Identify copyrighted content
Filter spam and fake accounts

Billions of images processed daily using architectures that build on what you’re learning.

Books That Will Help

Book	Relevant Chapters	What You’ll Learn
Deep Learning by Goodfellow, Bengio, Courville	Ch. 9: Convolutional Networks	The theoretical foundation: why CNNs work, receptive fields, invariance properties. The math is rigorous but essential.
Deep Learning with Python by Francois Chollet	Ch. 5: Deep Learning for Computer Vision	Practical intuition for CNN architectures. Written by the creator of Keras. Less math, more insight.
Neural Networks and Deep Learning by Michael Nielsen	Ch. 6: Deep Learning	Free online book with excellent visualizations. Good for building intuition before diving into implementation.
Grokking Deep Learning by Andrew Trask	Ch. 8, 10: CNNs	Code-first approach that matches our project style. Shows implementations you can learn from.
Dive into Deep Learning (d2l.ai)	Ch. 6: Convolutional Neural Networks	Free online book with executable code. Shows both math and implementation side by side.

Academic Papers Worth Reading

LeNet-5 (LeCun et al., 1998): The original CNN paper for digit recognition
AlexNet (Krizhevsky et al., 2012): The paper that started the deep learning revolution
VGGNet (Simonyan & Zisserman, 2014): Shows power of small 3x3 filters
ResNet (He et al., 2015): Skip connections for very deep networks

Self-Assessment Checklist

Before considering this project complete, verify you can:

Implementation

Implement Conv2D forward pass with correct output dimensions
Implement im2col transformation for efficient convolution
Implement Conv2D backward pass (gradient check passes)
Implement MaxPool2D forward pass with max index tracking
Implement MaxPool2D backward pass with gradient routing
Connect conv layers to dense layers via Flatten
Train the full CNN on MNIST to 99%+ accuracy

Understanding

Explain why CNNs are more efficient than dense networks for images
Calculate output dimensions given input, filter, stride, and padding
Trace the backward pass of convolution for a simple example by hand
Explain how max pooling provides translation invariance
Describe the relationship between receptive field and network depth

Debugging

Use gradient checking to verify backward passes
Debug dimension mismatches in matrix operations
Identify and fix numerical stability issues

Extensions

Explain how batch normalization would integrate into your CNN
Describe how residual connections (skip connections) work
Compare your implementation’s performance to a framework (PyTorch/TensorFlow)

Resources

Primary References

Stanford CS231n: Convolutional Neural Networks - Excellent notes on backprop through conv layers
Deep Learning Book Chapter 9 - Theoretical foundation
Andrej Karpathy’s Conv Net Demo - Visual interactive demo

Implementation References

im2col Explained - Detailed walkthrough
Caffe’s im2col - Reference implementation

Videos

3Blue1Brown: But what is a convolution? - Beautiful visual explanation
Andrew Ng: Convolutional Neural Networks - Coursera course

Datasets

MNIST - Your primary test dataset
CIFAR-10 - Natural images for extension

Key Insights

Convolution is parameter sharing. Instead of learning separate weights for each pixel position, we learn one set of weights (the filter) and apply it everywhere. This single insight reduces parameters by orders of magnitude and gives CNNs their power.

im2col is the trick that makes CNNs fast. By reformatting the convolution as matrix multiplication, we leverage decades of linear algebra optimization. Every GPU CNN implementation uses this trick.

The backward pass through convolution is itself a convolution. Once you see this, the math becomes elegant: forward is convolution with the filter, backward is convolution with the flipped filter (plus some transpositions).

Translation invariance isn’t magic - it’s architecture. Shared weights mean the same features are detected everywhere. Pooling provides local invariance. Together, they let CNNs recognize objects regardless of position.

After completing this project, you will have implemented the core architecture that powers computer vision. From self-driving cars to medical imaging, CNNs are everywhere. You now understand not just how to use them, but how they work at the byte level. Project 10 (RNN) will show you how to extend these ideas to sequences and time.