Project 9: The CNN From Scratch (Pooling & Strides)
Project 9: The CNN From Scratch (Pooling & Strides)
Sprint: AI Prediction & Neural Networks: From Math to Machine Focus Area: Convolutional Neural Networks and Spatial Invariance
Project Metadata
| Attribute | Value |
|---|---|
| Difficulty | Level 5: Master |
| Main Programming Language | Python (NumPy) |
| Alternative Languages | C, Rust, Julia |
| Coolness Level | Level 5: Pure Magic |
| Business Potential | 4. Open Core (Custom Vision Hardware) |
| Knowledge Area | Convolutional Neural Networks |
| Software/Tools | NumPy, Matplotlib, MNIST dataset |
| Main Book | โDeep Learningโ by Goodfellow, Bengio, Courville - Ch. 9 |
| Estimated Time | 2-3 Weeks |
| Prerequisites | Project 7 (Kernel Explorer), Project 8 (MNIST Dense) |
What You Will Build
You will upgrade your MNIST handwritten digit classifier (from Project 8) by replacing the naive โflatten the imageโ approach with proper Convolutional Layers and Max Pooling Layers. You will implement the complete forward and backward pass for convolution operations manually, including the notoriously tricky im2col algorithm that makes convolution efficient.
Your CNN will:
- Learn filters automatically (instead of you hardcoding edge detectors)
- Recognize digits regardless of their position in the image (translation invariance)
- Use 10x fewer parameters than the dense network while achieving higher accuracy
- Train in reasonable time through vectorized operations
This is considered the hardest project in this learning path. The backward pass through convolution is where most people give up. If you complete this, you truly understand how CNNs work at the deepest level.
Learning Objectives
By completing this project, you will:
- Implement Conv2D Forward Pass - Slide learned filters across images to produce feature maps
- Master the im2col Transformation - Convert convolution into matrix multiplication for efficiency
- Implement Conv2D Backward Pass - The notoriously difficult gradient computation through convolution
- Build Max Pooling Layers - Downsample feature maps while preserving important features
- Implement Max Pooling Backward Pass - Route gradients only through the โwinningโ neurons
- Connect Convolutional and Dense Layers - Flatten feature volumes to feed into fully connected layers
- Understand Parameter Sharing - Why CNNs are efficient for images
- Achieve Translation Invariance - Recognize patterns regardless of position
- Debug with Gradient Checking - Verify your backprop implementation is correct
The Core Question Youโre Answering
โHow can we make AI efficient enough for images?โ
A 28x28 grayscale image has 784 pixels. Thatโs manageable. But a 1000x1000 color image has 3 million inputs. If your first hidden layer has 1000 neurons, you need 3 billion weights just for layer 1. This is impossible to train.
CNNs solve this through two key insights:
-
Local Connectivity: A neuron doesnโt need to see the entire image. It only needs to see a small patch (like 3x3 pixels). Edges and textures are local features.
-
Parameter Sharing: The same filter that detects a vertical edge in the top-left corner should work in the bottom-right corner too. We use the same weights everywhere.
These two ideas reduce parameters by 1000x while actually improving accuracy, because:
- Sparse connections prevent overfitting
- Shared weights encode translation invariance (a โ7โ is a โ7โ anywhere in the image)
- Hierarchical features emerge naturally (edges -> textures -> shapes -> objects)
Concepts You Must Understand First
Before implementing, ensure you have solid grounding in these foundational concepts:
| Concept | Why It Matters | Where to Learn |
|---|---|---|
| Convolution Operation | You must be able to compute a convolution by hand. Project 7 should have given you this. Know what happens when a 3x3 kernel slides over a 5x5 image. | Project 7, โDeep Learning with Pythonโ Ch. 5 |
| Parameter Sharing | The key insight that makes CNNs work. One filter = one set of weights applied everywhere. This gives translation invariance. | โDeep Learningโ Ch. 9.2 |
| Sparse Connectivity | Each output pixel connects to only a small patch of input, not the entire image. This is why CNNs have fewer parameters. | โDeep Learningโ Ch. 9.2 |
| Translation Invariance | A CNN should recognize a cat whether itโs on the left or right side of the image. Pooling and shared weights create this property. | โDeep Learningโ Ch. 9.3 |
| Max Pooling Operation | Downsampling by taking the maximum in each patch. Reduces spatial dimensions and provides local translation invariance. | โDeep Learning with Pythonโ Ch. 5.1.2 |
| Feature Map Dimensions | Given input (28x28), filter (3x3), stride (1), padding (0), whatโs the output size? You must know the formula: (W - F + 2P) / S + 1. |
โDeep Learningโ Ch. 9.5 |
| The im2col Transformation | The trick that converts convolution into matrix multiplication. Essential for efficient implementation. | Stanford CS231n Notes |
The Dimension Formula
This will save you hours of debugging:
Output Size = floor((Input_Size - Filter_Size + 2*Padding) / Stride) + 1
Example: Input 28x28, Filter 3x3, Padding 0, Stride 1:
Output = (28 - 3 + 0) / 1 + 1 = 26
So a 28x28 image becomes a 26x26 feature map after one 3x3 convolution.
Deep Theoretical Foundation
Why Convolutions Are Perfect for Images
Consider a dense network trying to recognize a โ7โ:
Dense Network View:
Input: 28x28 = 784 pixels Each hidden neuron connects
(flattened to vector) to ALL 784 input pixels
[x1, x2, x3, ... x784] --> [h1, h2, h3, ... h256]
Weights: 784 * 256 = 200,704 parameters (just layer 1!)
Problem: A "7" at pixel (5,5) looks COMPLETELY DIFFERENT from
a "7" at pixel (20,20) because different weights fire.
Now consider a CNN:
CNN View:
Input: 28x28 image One filter (3x3 = 9 weights)
(keep the 2D structure) slides across the ENTIRE image
โโโโโโโโโโโโโโโโโโโ โโโโโ
โ 7 โ * โ F โ = Feature Map 26x26
โ โ โโโโโ
โ โ
โโโโโโโโโโโโโโโโโโโ Same 9 weights used everywhere!
Weights: 9 parameters (the filter)
Benefit: A "7" activates the same filter whether it's
top-left or bottom-right. Translation invariance!
Parameter Efficiency: Conv vs Dense
Letโs compare parameter counts for processing a 28x28 image:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Parameter Count Comparison โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ Dense Layer (784 inputs -> 256 hidden): โ
โ Parameters = 784 * 256 + 256 (bias) = 200,960 โ
โ โ
โ Conv Layer (1 channel -> 32 filters, 3x3): โ
โ Parameters = 32 * (3 * 3 * 1) + 32 (bias) = 320 โ
โ โ
โ Ratio: Dense / Conv = 628x MORE parameters for dense! โ
โ โ
โ And the conv layer produces MORE information: โ
โ Dense: 256 values โ
โ Conv: 32 * 26 * 26 = 21,632 values (feature maps) โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
The Receptive Field Concept
Each pixel in a deeper layer โseesโ a larger patch of the original image:
Layer 1: Each output pixel sees 3x3 of input (receptive field = 3)
Layer 2: Each output pixel sees 3x3 of layer 1 (receptive field = 5)
Layer 3: Each output pixel sees 3x3 of layer 2 (receptive field = 7)
โโโโโโโโโโโโโโโโโโโโโโโ
โ Original Image โ
โ โโโโโโโโโโโโโโโ โ
โ โ RF Layer 2 โ โ
โ โ โโโโโโโโโ โ โ
โ โ โRF L1 โ โ โ
โ โ โ 3x3 โ โ โ
โ โ โโโโโโโโโ โ โ
โ โ 5x5 โ โ
โ โโโโโโโโโโโโโโโ โ
โ 7x7 โ
โโโโโโโโโโโโโโโโโโโโโโโ
The deeper you go, the more context each neuron has.
Early layers: edges, textures
Middle layers: parts (eyes, wheels)
Deep layers: objects (faces, cars)
Pooling for Spatial Invariance
Max pooling takes the maximum value in each patch:
Input 4x4: Max Pool 2x2, Stride 2:
โโโโโโฌโโโโโฌโโโโโฌโโโโโ โโโโโโฌโโโโโ
โ 1 โ 2 โ 5 โ 3 โ โ 6 โ 8 โ (max of top-left 2x2 = 6)
โโโโโโผโโโโโผโโโโโผโโโโโค --> โโโโโโผโโโโโค (max of top-right 2x2 = 8)
โ 6 โ 4 โ 8 โ 1 โ โ 7 โ 9 โ (max of bottom-left 2x2 = 7)
โโโโโโผโโโโโผโโโโโผโโโโโค โโโโโโดโโโโโ (max of bottom-right 2x2 = 9)
โ 2 โ 7 โ 3 โ 9 โ
โโโโโโผโโโโโผโโโโโผโโโโโค
โ 1 โ 5 โ 4 โ 2 โ
โโโโโโดโโโโโดโโโโโดโโโโโ
Why it helps:
1. Reduces spatial size (4x4 -> 2x2 = 75% reduction)
2. Provides local translation invariance:
- If the "6" moved one pixel right (to where "4" was),
the output would still be "6" (or close to it)
3. Keeps the "loudest" feature in each region
Backpropagation Through Convolution (The Hard Part)
This is where most people give up. Letโs build intuition before diving into math.
Forward pass recap:
- Input: Image
Xof shape (H, W) - Filter: Kernel
Kof shape (FH, FW) - Output: Feature map
Yof shape (H-FH+1, W-FW+1) - Each
Y[i,j] = sum(X[i:i+FH, j:j+FW] * K)
Backward pass goal:
- Given: gradient of loss w.r.t. output,
dL/dY - Find:
dL/dX(to backprop further) anddL/dK(to update weights)
The key insight: In the forward pass, each input pixel X[a,b] contributes to multiple outputs (wherever the filter overlapped that pixel). In the backward pass, we sum all those contributions.
How one input pixel affects multiple outputs:
Input X (5x5): Output Y (3x3) with 3x3 filter:
โโโโโฌโโโโฌโโโโฌโโโโฌโโโโ โโโโโฌโโโโฌโโโโ
โ โ โ โ โ โ โY00โY01โY02โ
โโโโโผโโโโผโโโโผโโโโผโโโโค โโโโโผโโโโผโโโโค
โ โ X โ โ โ โ <--This โY10โY11โY12โ
โโโโโผโโโโผโโโโผโโโโผโโโโค pixel โโโโโผโโโโผโโโโค
โ โ โ โ โ โ โY20โY21โY22โ
โโโโโผโโโโผโโโโผโโโโผโโโโค โโโโโดโโโโดโโโโ
โ โ โ โ โ โ
โโโโโผโโโโผโโโโผโโโโผโโโโค X[1,1] contributes to Y[0,0]
โ โ โ โ โ โ (when filter is at position 0,0)
โโโโโดโโโโดโโโโดโโโโดโโโโ
So dL/dX[1,1] includes dL/dY[0,0] * K[1,1]
The gradient of the filter (dL/dK) is even more interesting:
- Each filter weight
K[i,j]was multiplied by many input values during forward pass - So
dL/dK[i,j]= sum over all positions ofdL/dY[pos] * X[corresponding input] - This is actually a convolution of
XwithdL/dY!
The im2col Transformation: Convolution as Matrix Multiplication
The naive convolution uses nested loops and is slow. The im2col trick converts convolution into a single matrix multiplication:
Original Convolution (4x4 input, 2x2 filter, stride 1):
Input X: Filter K: Output Y (3x3):
โโโโโฌโโโโฌโโโโฌโโโโ โโโโโฌโโโโ โโโโโฌโโโโฌโโโโ
โ 1 โ 2 โ 3 โ 4 โ โ w โ x โ โ โ โ โ
โโโโโผโโโโผโโโโผโโโโค โโโโโผโโโโค โโโโโผโโโโผโโโโค
โ 5 โ 6 โ 7 โ 8 โ โ y โ z โ โ โ โ โ
โโโโโผโโโโผโโโโผโโโโค โโโโโดโโโโ โโโโโผโโโโผโโโโค
โ 9 โ10 โ11 โ12 โ โ โ โ โ
โโโโโผโโโโผโโโโผโโโโค โโโโโดโโโโดโโโโ
โ13 โ14 โ15 โ16 โ
โโโโโดโโโโดโโโโดโโโโ
Step 1: im2col - Stretch each receptive field into a column
Position (0,0): [1,2,5,6] ---> Column 0
Position (0,1): [2,3,6,7] ---> Column 1
Position (0,2): [3,4,7,8] ---> Column 2
Position (1,0): [5,6,9,10] ---> Column 3
... and so on for all 9 positions
im2col(X) matrix (4 x 9):
โโโโโโฌโโโโโฌโโโโโฌโโโโโฌโโโโโฌโโโโโฌโโโโโฌโโโโโฌโโโโโ
โ 1 โ 2 โ 3 โ 5 โ 6 โ 7 โ 9 โ 10 โ 11 โ
โ 2 โ 3 โ 4 โ 6 โ 7 โ 8 โ 10 โ 11 โ 12 โ
โ 5 โ 6 โ 7 โ 9 โ 10 โ 11 โ 13 โ 14 โ 15 โ
โ 6 โ 7 โ 8 โ 10 โ 11 โ 12 โ 14 โ 15 โ 16 โ
โโโโโโดโโโโโดโโโโโดโโโโโดโโโโโดโโโโโดโโโโโดโโโโโดโโโโโ
Step 2: Flatten filter to row vector
K_flat = [w, x, y, z] (1 x 4)
Step 3: Matrix multiplication
Output = K_flat @ im2col(X) = (1 x 4) @ (4 x 9) = (1 x 9)
Step 4: Reshape output to 3x3
Why this is faster:
- Matrix multiplication is highly optimized (BLAS, GPU acceleration)
- Avoids Python loop overhead
- With multiple filters, itโs even more efficient (just more rows in K_flat)
Backprop Through Max Pooling (Routing Gradients)
Max pooling has no learnable parameters, but we still need to backpropagate gradients. The rule is simple:
The gradient only flows through the neuron that โwonโ (had the max value)
Forward Max Pool:
โโโโโโฌโโโโโ
โ 1 โ 4 โ max = 4 (position [0,1])
โโโโโโผโโโโโค
โ 2 โ 3 โ
โโโโโโดโโโโโ
Backward (given dL/dOutput = 0.5):
โโโโโโโโฌโโโโโโโ
โ 0 โ 0.5 โ Only the winning position gets the gradient
โโโโโโโโผโโโโโโโค
โ 0 โ 0 โ
โโโโโโโโดโโโโโโโ
This is called "gradient routing" - we need to remember
which position won during forward pass.
Modern CNN Architectures Overview
Understanding history helps you appreciate design choices:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ CNN Architecture Evolution โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ LeNet-5 (1998) - Yann LeCun โ
โ โโโ 2 conv layers, 2 pooling, 3 dense โ
โ โโโ Designed for 32x32 grayscale digits โ
โ โโโ ~60K parameters โ
โ โ
โ AlexNet (2012) - Krizhevsky, Sutskever, Hinton โ
โ โโโ First deep CNN to win ImageNet (15.3% error, previous was 26%) โ
โ โโโ 5 conv layers, 3 dense layers โ
โ โโโ Used ReLU (not sigmoid/tanh) and dropout โ
โ โโโ ~60M parameters, trained on GPU โ
โ โ
โ VGGNet (2014) - Simonyan, Zisserman โ
โ โโโ Key insight: stack many 3x3 convs (better than fewer large ones) โ
โ โโโ 16-19 layers, very uniform architecture โ
โ โโโ ~138M parameters โ
โ โ
โ ResNet (2015) - He et al. โ
โ โโโ Skip connections allow training 100+ layer networks โ
โ โโโ Solved vanishing gradient problem โ
โ โโโ Still state-of-art baseline today โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Real World Outcome
When you complete this project and run your CNN trainer, you will see:
$ python train_cnn.py
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
CNN From Scratch - MNIST Classifier
Implemented with NumPy only (no frameworks!)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Architecture:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Input: (1, 28, 28) โ
โ โ โ
โ โผ โ
โ Conv2D(1 -> 32, 3x3, stride=1) + ReLU โ
โ Output: (32, 26, 26) Parameters: 320 โ
โ โ โ
โ โผ โ
โ MaxPool2D(2x2, stride=2) โ
โ Output: (32, 13, 13) Parameters: 0 โ
โ โ โ
โ โผ โ
โ Conv2D(32 -> 64, 3x3, stride=1) + ReLU โ
โ Output: (64, 11, 11) Parameters: 18,496 โ
โ โ โ
โ โผ โ
โ MaxPool2D(2x2, stride=2) โ
โ Output: (64, 5, 5) Parameters: 0 โ
โ โ โ
โ โผ โ
โ Flatten โ
โ Output: (1600,) โ
โ โ โ
โ โผ โ
โ Dense(1600 -> 128) + ReLU โ
โ Output: (128,) Parameters: 204,928 โ
โ โ โ
โ โผ โ
โ Dense(128 -> 10) + Softmax โ
โ Output: (10,) Parameters: 1,290 โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Total Parameters: 225,034
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Compare to Dense Network (Project 8): 500,000+ parameters
CNN uses 55% fewer parameters!
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Loading MNIST dataset...
Training samples: 60,000
Test samples: 10,000
Training Configuration:
Batch size: 64
Learning rate: 0.01
Optimizer: SGD with momentum (0.9)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Training Progress:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Epoch 1/10 [โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ] 938/938
Train Loss: 0.2341 Train Acc: 92.3% Time: 45.2s
Test Loss: 0.0892 Test Acc: 97.2%
Epoch 2/10 [โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ] 938/938
Train Loss: 0.0812 Train Acc: 97.5% Time: 44.8s
Test Loss: 0.0654 Test Acc: 98.0%
Epoch 3/10 [โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ] 938/938
Train Loss: 0.0543 Train Acc: 98.3% Time: 44.9s
Test Loss: 0.0521 Test Acc: 98.4%
Epoch 4/10 [โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ] 938/938
Train Loss: 0.0398 Train Acc: 98.7% Time: 45.1s
Test Loss: 0.0478 Test Acc: 98.6%
Epoch 5/10 [โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ] 938/938
Train Loss: 0.0312 Train Acc: 99.0% Time: 45.0s
Test Loss: 0.0412 Test Acc: 98.8%
... [epochs 6-9] ...
Epoch 10/10 [โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ] 938/938
Train Loss: 0.0098 Train Acc: 99.7% Time: 44.7s
Test Loss: 0.0356 Test Acc: 99.1%
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Training Complete!
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Final Results:
Test Accuracy: 99.1% (9,910 / 10,000 correct)
Comparison:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Model โ Parameters โ Test Accuracy โ Improvement โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Dense (Project 8) โ ~500,000 โ 97.5% โ baseline โ
โ CNN (This Project) โ ~225,000 โ 99.1% โ +1.6% acc, 55% โ
โ โ โ โ fewer params โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Visualizing learned filters...
Saved: learned_filters_conv1.png
First layer filters (32 x 3x3):
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ [Edge โ] [Edge โ] [Edge โ] [Blob] [Corner] [Texture] ... โ
โ โ
โ The network automatically learned edge detectors! โ
โ Compare to your hand-coded kernels from Project 7. โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Testing translation invariance...
Original digit "7" at center: Prediction: 7 (99.8% confidence)
Same "7" shifted 5px right: Prediction: 7 (99.6% confidence)
Same "7" shifted 5px down: Prediction: 7 (99.4% confidence)
Dense network on shifted images:
Same "7" shifted 5px right: Prediction: 1 (45% confidence) โ FAIL
CNN maintains recognition regardless of position!
Model saved to: cnn_mnist_model.npz
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
This output demonstrates:
- Dramatically better accuracy than dense networks
- Fewer parameters (efficiency through convolution)
- Automatic feature learning (no hand-coded filters needed)
- Translation invariance (the key property of CNNs)
Solution Architecture
Core Classes
Your implementation needs these key components:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ CNN Architecture Overview โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ class Conv2D: โ
โ """Convolutional layer with learnable filters""" โ
โ - __init__(in_channels, out_channels, kernel_size, stride, padding) โ
โ - forward(x) -> feature_maps โ
โ - backward(d_out) -> d_input, stores d_weights/d_bias โ
โ - im2col(x) -> col_matrix (for efficient forward) โ
โ - col2im(col, shape) -> image (for efficient backward) โ
โ โ
โ class MaxPool2D: โ
โ """Max pooling layer (no learnable parameters)""" โ
โ - __init__(pool_size, stride) โ
โ - forward(x) -> pooled_output, stores max_indices โ
โ - backward(d_out) -> d_input (gradient routing) โ
โ โ
โ class Flatten: โ
โ """Reshape 3D feature maps to 1D vector""" โ
โ - forward(x) -> flattened โ
โ - backward(d_out) -> reshaped to original โ
โ โ
โ class Dense: โ
โ """Fully connected layer (from Project 8)""" โ
โ - forward(x) -> output โ
โ - backward(d_out) -> d_input, stores d_weights/d_bias โ
โ โ
โ class ReLU: โ
โ """ReLU activation""" โ
โ - forward(x) -> max(0, x) โ
โ - backward(d_out) -> d_out * (x > 0) โ
โ โ
โ class Softmax: โ
โ """Softmax for final layer""" โ
โ - forward(x) -> probabilities โ
โ - backward(d_out) -> gradients โ
โ โ
โ class CNN: โ
โ """Container that chains layers together""" โ
โ - __init__(layers: List[Layer]) โ
โ - forward(x) -> prediction โ
โ - backward(d_loss) -> propagates gradients โ
โ - update_params(learning_rate) โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Data Flow Through the Network
Input: (batch, 1, 28, 28)
โ
โผ
โโโโโโโโโโโโโโโโ
โ Conv2D โ 32 filters of 3x3
โ (1 -> 32) โ
โโโโโโโโฌโโโโโโโโ
โ (batch, 32, 26, 26)
โผ
โโโโโโโโโโโโโโโโ
โ ReLU โ max(0, x)
โโโโโโโโฌโโโโโโโโ
โ (batch, 32, 26, 26)
โผ
โโโโโโโโโโโโโโโโ
โ MaxPool2D โ 2x2, stride 2
โโโโโโโโฌโโโโโโโโ
โ (batch, 32, 13, 13)
โผ
โโโโโโโโโโโโโโโโ
โ Conv2D โ 64 filters of 3x3
โ (32 -> 64) โ
โโโโโโโโฌโโโโโโโโ
โ (batch, 64, 11, 11)
โผ
โโโโโโโโโโโโโโโโ
โ ReLU โ
โโโโโโโโฌโโโโโโโโ
โ (batch, 64, 11, 11)
โผ
โโโโโโโโโโโโโโโโ
โ MaxPool2D โ 2x2, stride 2
โโโโโโโโฌโโโโโโโโ
โ (batch, 64, 5, 5)
โผ
โโโโโโโโโโโโโโโโ
โ Flatten โ Reshape to 1D
โโโโโโโโฌโโโโโโโโ
โ (batch, 1600)
โผ
โโโโโโโโโโโโโโโโ
โ Dense โ 1600 -> 128
โ + ReLU โ
โโโโโโโโฌโโโโโโโโ
โ (batch, 128)
โผ
โโโโโโโโโโโโโโโโ
โ Dense โ 128 -> 10
โ + Softmax โ
โโโโโโโโฌโโโโโโโโ
โ (batch, 10)
โผ
Predictions
Phased Implementation Guide
Phase 1: Conv2D Forward Pass (Days 1-3)
Goal: Implement the forward convolution using nested loops first
Start with the simplest possible implementation:
def conv2d_forward_naive(x, weights, bias):
"""
Naive convolution implementation with loops.
Args:
x: Input of shape (batch, in_channels, H, W)
weights: Filters of shape (out_channels, in_channels, FH, FW)
bias: Bias of shape (out_channels,)
Returns:
Output of shape (batch, out_channels, H_out, W_out)
"""
batch, in_channels, H, W = x.shape
out_channels, _, FH, FW = weights.shape
H_out = H - FH + 1
W_out = W - FW + 1
output = np.zeros((batch, out_channels, H_out, W_out))
for b in range(batch):
for oc in range(out_channels):
for i in range(H_out):
for j in range(W_out):
# Extract patch
patch = x[b, :, i:i+FH, j:j+FW]
# Convolve: element-wise multiply and sum
output[b, oc, i, j] = np.sum(patch * weights[oc]) + bias[oc]
return output
Test it: Create a 5x5 input, 3x3 filter with known values, and verify output matches hand calculation.
Checkpoint: Can perform forward convolution (slowly) on small inputs.
Phase 2: im2col Transformation (Days 4-6)
Goal: Convert convolution to matrix multiplication for speed
The im2col function stretches each receptive field into a column:
def im2col(x, FH, FW, stride=1, padding=0):
"""
Transform input into column matrix for efficient convolution.
Args:
x: Input of shape (batch, C, H, W)
FH, FW: Filter height and width
stride: Stride of convolution
padding: Zero padding
Returns:
col: Column matrix of shape (C*FH*FW, batch*H_out*W_out)
"""
batch, C, H, W = x.shape
# Apply padding if needed
if padding > 0:
x = np.pad(x, ((0,0), (0,0), (padding,padding), (padding,padding)))
H += 2 * padding
W += 2 * padding
H_out = (H - FH) // stride + 1
W_out = (W - FW) // stride + 1
# Create output array
col = np.zeros((C * FH * FW, batch * H_out * W_out))
col_idx = 0
for b in range(batch):
for i in range(H_out):
for j in range(W_out):
# Extract receptive field and flatten
patch = x[b, :, i*stride:i*stride+FH, j*stride:j*stride+FW]
col[:, col_idx] = patch.flatten()
col_idx += 1
return col
The fast convolution:
def conv2d_forward_fast(x, weights, bias):
batch, in_channels, H, W = x.shape
out_channels, _, FH, FW = weights.shape
H_out = H - FH + 1
W_out = W - FW + 1
# im2col: (C*FH*FW, batch*H_out*W_out)
col = im2col(x, FH, FW)
# Reshape weights: (out_channels, in_channels*FH*FW)
W_col = weights.reshape(out_channels, -1)
# Matrix multiplication!
# (out_channels, C*FH*FW) @ (C*FH*FW, batch*H_out*W_out)
# = (out_channels, batch*H_out*W_out)
output = W_col @ col + bias.reshape(-1, 1)
# Reshape to (batch, out_channels, H_out, W_out)
output = output.reshape(out_channels, batch, H_out, W_out)
output = output.transpose(1, 0, 2, 3)
return output
Verify: Both naive and fast implementations should give identical results. Profile the speed difference.
Checkpoint: Fast forward pass using matrix multiplication.
Phase 3: Conv2D Backward Pass (Days 7-12)
Goal: Implement gradient computation through convolution
This is the hardest part. We need two gradients:
dL/dX(to backpropagate to earlier layers)dL/dW(to update the filter weights)
Understanding the gradient of the filter (dL/dW):
Forward: Y[i,j] = sum_{a,b} X[i+a, j+b] * W[a, b]
Backward: dL/dW[a,b] = sum_{i,j} dL/dY[i,j] * X[i+a, j+b]
This is a convolution of X with dL/dY!
Understanding the gradient of the input (dL/dX):
Each input X[a,b] contributes to multiple outputs Y[i,j]
wherever the filter overlapped that position.
dL/dX[a,b] = sum over all (i,j) where X[a,b] was used:
dL/dY[i,j] * W[a-i, b-j]
This is a "full" convolution of dL/dY with W flipped!
Implementation sketch:
def conv2d_backward(d_out, x, weights, col):
"""
Backward pass for convolution.
Args:
d_out: Gradient from next layer, shape (batch, out_channels, H_out, W_out)
x: Original input, shape (batch, in_channels, H, W)
weights: Filter weights, shape (out_channels, in_channels, FH, FW)
col: Cached im2col matrix from forward pass
Returns:
d_x: Gradient w.r.t. input
d_w: Gradient w.r.t. weights
d_b: Gradient w.r.t. bias
"""
batch, out_channels, H_out, W_out = d_out.shape
_, in_channels, FH, FW = weights.shape
# Gradient of bias: sum over batch and spatial dimensions
d_b = np.sum(d_out, axis=(0, 2, 3))
# Reshape d_out for matrix multiplication
d_out_col = d_out.transpose(1, 0, 2, 3).reshape(out_channels, -1)
# Gradient of weights: d_out convolved with input
# (out_channels, batch*H_out*W_out) @ (batch*H_out*W_out, in_channels*FH*FW)
d_w = d_out_col @ col.T
d_w = d_w.reshape(weights.shape)
# Gradient of input: need col2im
W_col = weights.reshape(out_channels, -1)
d_col = W_col.T @ d_out_col # (in_channels*FH*FW, batch*H_out*W_out)
# col2im to convert back to image shape
d_x = col2im(d_col, x.shape, FH, FW)
return d_x, d_w, d_b
The col2im function (inverse of im2col):
def col2im(col, x_shape, FH, FW, stride=1, padding=0):
"""
Inverse of im2col: accumulate gradients back to image format.
Key insight: Multiple columns contributed to the same input pixel,
so we SUM the gradients (not replace).
"""
batch, C, H, W = x_shape
H_out = (H - FH) // stride + 1
W_out = (W - FW) // stride + 1
dx = np.zeros((batch, C, H, W))
col_idx = 0
for b in range(batch):
for i in range(H_out):
for j in range(W_out):
# Get the column gradient
patch_grad = col[:, col_idx].reshape(C, FH, FW)
# ACCUMULATE into the appropriate position
dx[b, :, i:i+FH, j:j+FW] += patch_grad
col_idx += 1
return dx
Checkpoint: Backward pass computes gradients. Verify with gradient checking (next phase).
Phase 4: MaxPool Forward Pass (Day 13)
Goal: Implement max pooling
class MaxPool2D:
def __init__(self, pool_size=2, stride=2):
self.pool_size = pool_size
self.stride = stride
self.max_indices = None # Store for backward
def forward(self, x):
"""
Max pooling forward pass.
Args:
x: Input of shape (batch, C, H, W)
Returns:
Output of shape (batch, C, H//pool, W//pool)
"""
batch, C, H, W = x.shape
PH = PW = self.pool_size
S = self.stride
H_out = (H - PH) // S + 1
W_out = (W - PW) // S + 1
output = np.zeros((batch, C, H_out, W_out))
self.max_indices = np.zeros((batch, C, H_out, W_out, 2), dtype=int)
for b in range(batch):
for c in range(C):
for i in range(H_out):
for j in range(W_out):
h_start = i * S
w_start = j * S
patch = x[b, c, h_start:h_start+PH, w_start:w_start+PW]
# Find max and its position
max_val = np.max(patch)
max_pos = np.unravel_index(np.argmax(patch), (PH, PW))
output[b, c, i, j] = max_val
self.max_indices[b, c, i, j] = [h_start + max_pos[0],
w_start + max_pos[1]]
return output
Checkpoint: MaxPool reduces spatial dimensions by half.
Phase 5: MaxPool Backward Pass (Day 14)
Goal: Route gradients through max positions only
def backward(self, d_out, x_shape):
"""
Max pooling backward pass.
The gradient only flows through the position that had the max value.
"""
batch, C, H, W = x_shape
d_x = np.zeros((batch, C, H, W))
_, _, H_out, W_out = d_out.shape
for b in range(batch):
for c in range(C):
for i in range(H_out):
for j in range(W_out):
# Get the position that won during forward
max_h, max_w = self.max_indices[b, c, i, j]
# Route the gradient to that position
d_x[b, c, max_h, max_w] += d_out[b, c, i, j]
return d_x
Checkpoint: Gradients flow only through max positions.
Phase 6: Flatten Layer (Day 15)
Goal: Reshape 3D feature maps to 1D for dense layers
class Flatten:
def __init__(self):
self.input_shape = None
def forward(self, x):
"""Flatten all dimensions except batch."""
self.input_shape = x.shape
batch = x.shape[0]
return x.reshape(batch, -1)
def backward(self, d_out):
"""Reshape gradient back to original shape."""
return d_out.reshape(self.input_shape)
Checkpoint: Can connect conv layers to dense layers.
Phase 7: Connect to Existing Dense Layers (Days 16-17)
Goal: Integrate Dense and activation layers from Project 8
You should already have Dense and ReLU layers from Project 8. Make sure they have a consistent interface:
class Dense:
def __init__(self, in_features, out_features):
# He initialization (good for ReLU)
self.weights = np.random.randn(in_features, out_features) * np.sqrt(2.0 / in_features)
self.bias = np.zeros(out_features)
self.d_weights = None
self.d_bias = None
self.input_cache = None
def forward(self, x):
self.input_cache = x
return x @ self.weights + self.bias
def backward(self, d_out):
self.d_weights = self.input_cache.T @ d_out
self.d_bias = np.sum(d_out, axis=0)
return d_out @ self.weights.T
class ReLU:
def __init__(self):
self.mask = None
def forward(self, x):
self.mask = (x > 0)
return np.maximum(0, x)
def backward(self, d_out):
return d_out * self.mask
Checkpoint: All layer types have forward/backward methods.
Phase 8: Build the Full CNN (Days 18-19)
Goal: Chain all layers together
class CNN:
def __init__(self):
self.layers = [
Conv2D(1, 32, kernel_size=3),
ReLU(),
MaxPool2D(2, 2),
Conv2D(32, 64, kernel_size=3),
ReLU(),
MaxPool2D(2, 2),
Flatten(),
Dense(64 * 5 * 5, 128), # 64 channels, 5x5 spatial
ReLU(),
Dense(128, 10),
Softmax()
]
def forward(self, x):
for layer in self.layers:
x = layer.forward(x)
return x
def backward(self, d_loss):
for layer in reversed(self.layers):
d_loss = layer.backward(d_loss)
def update_params(self, lr):
for layer in self.layers:
if hasattr(layer, 'weights'):
layer.weights -= lr * layer.d_weights
layer.bias -= lr * layer.d_bias
Checkpoint: Can do a full forward-backward pass.
Phase 9: Train on MNIST (Days 20-21)
Goal: Train the CNN and achieve 99%+ accuracy
def train_cnn():
# Load MNIST
X_train, y_train, X_test, y_test = load_mnist()
# Reshape to (batch, 1, 28, 28) for CNN
X_train = X_train.reshape(-1, 1, 28, 28) / 255.0
X_test = X_test.reshape(-1, 1, 28, 28) / 255.0
cnn = CNN()
batch_size = 64
learning_rate = 0.01
epochs = 10
for epoch in range(epochs):
# Shuffle data
indices = np.random.permutation(len(X_train))
total_loss = 0
correct = 0
for i in range(0, len(X_train), batch_size):
batch_idx = indices[i:i+batch_size]
X_batch = X_train[batch_idx]
y_batch = y_train[batch_idx]
# Forward pass
predictions = cnn.forward(X_batch)
# Compute loss and accuracy
loss = cross_entropy_loss(predictions, y_batch)
total_loss += loss * len(batch_idx)
correct += np.sum(np.argmax(predictions, axis=1) == y_batch)
# Backward pass
d_loss = cross_entropy_gradient(predictions, y_batch)
cnn.backward(d_loss)
# Update weights
cnn.update_params(learning_rate)
train_acc = correct / len(X_train)
print(f"Epoch {epoch+1}: Loss={total_loss/len(X_train):.4f}, Acc={train_acc:.2%}")
# Test accuracy
test_pred = cnn.forward(X_test)
test_acc = np.mean(np.argmax(test_pred, axis=1) == y_test)
print(f" Test Acc={test_acc:.2%}")
Checkpoint: Model achieves 99%+ accuracy on MNIST.
Questions to Guide Your Design
Before writing code, think through these design questions:
Dimension Tracking
- What are the output dimensions after each layer? Given a 28x28 input, trace through every layer. This will catch most bugs early.
- How do you handle the batch dimension? All operations must work on batches, not single images.
Memory Considerations
- What do you need to cache for the backward pass? The im2col matrix? Max indices? Input values?
- How much memory does training take? With 64 images of 28x28 and 32 filters, how big is the im2col matrix?
Efficiency
- Where are the bottlenecks? im2col is expensive. Can you optimize it?
- Can you vectorize the pooling operations? The naive loop implementation is slow.
Gradient Computation
- How do you handle multiple input channels in conv backward? Each output channel has gradients from all input channels.
- What happens at the edges of the image? With no padding, edge pixels contribute to fewer outputs.
Thinking Exercise
Before implementing, trace the backward pass through a tiny example by hand:
Setup:
- Input: 4x4 single-channel image
- Filter: 2x2, single filter
- Output: 3x3 feature map
Input X: Filter W: Output Y:
โโโโโฌโโโโฌโโโโฌโโโโ โโโโโฌโโโโ โโโโโฌโโโโฌโโโโ
โ 1 โ 2 โ 3 โ 4 โ โ a โ b โ โY00โY01โY02โ
โโโโโผโโโโผโโโโผโโโโค โโโโโผโโโโค โโโโโผโโโโผโโโโค
โ 5 โ 6 โ 7 โ 8 โ โ c โ d โ โY10โY11โY12โ
โโโโโผโโโโผโโโโผโโโโค โโโโโดโโโโ โโโโโผโโโโผโโโโค
โ 9 โ10 โ11 โ12 โ โY20โY21โY22โ
โโโโโผโโโโผโโโโผโโโโค โโโโโดโโโโดโโโโ
โ13 โ14 โ15 โ16 โ
โโโโโดโโโโดโโโโดโโโโ
Forward pass equations:
Y[0,0] = 1*a + 2*b + 5*c + 6*d
Y[0,1] = 2*a + 3*b + 6*c + 7*d
Y[0,2] = 3*a + 4*b + 7*c + 8*d
... (continue for all 9 outputs)
Your task: Given dL/dY (the gradient of loss w.r.t. each output), derive:
dL/da,dL/db,dL/dc,dL/dd(gradients for filter weights)dL/dX[1,1](gradient for the input pixel at position (1,1), which is value 6)
Hint for #1: dL/da = sum of dL/dY[i,j] * (X element that was multiplied by 'a' at that position)
Hint for #2: Which output positions Y[i,j] used input X[1,1]=6? That input contributes to the gradient from each of those positions.
Testing Strategy
Gradient Checking Is Essential
The backward pass is complex enough that bugs are almost guaranteed. Use numerical gradient checking:
def gradient_check(layer, x, epsilon=1e-5):
"""
Verify analytical gradients match numerical gradients.
"""
# Forward pass
output = layer.forward(x)
# Create random gradient from "next layer"
d_out = np.random.randn(*output.shape)
# Analytical gradient
d_x_analytical = layer.backward(d_out)
# Numerical gradient
d_x_numerical = np.zeros_like(x)
for i in np.ndindex(x.shape):
x_plus = x.copy()
x_plus[i] += epsilon
out_plus = layer.forward(x_plus)
x_minus = x.copy()
x_minus[i] -= epsilon
out_minus = layer.forward(x_minus)
# Gradient = change in loss / change in input
d_x_numerical[i] = np.sum((out_plus - out_minus) * d_out) / (2 * epsilon)
# Compare
diff = np.linalg.norm(d_x_analytical - d_x_numerical)
diff /= np.linalg.norm(d_x_analytical) + np.linalg.norm(d_x_numerical)
print(f"Relative difference: {diff}")
assert diff < 1e-5, "Gradient check failed!"
Test each layer individually:
- Test Conv2D backward with a tiny input (4x4)
- Test MaxPool backward
- Test the full network on one training example
Unit Tests
def test_conv2d_output_shape():
layer = Conv2D(in_channels=1, out_channels=32, kernel_size=3)
x = np.random.randn(4, 1, 28, 28) # batch of 4
out = layer.forward(x)
assert out.shape == (4, 32, 26, 26), f"Expected (4, 32, 26, 26), got {out.shape}"
def test_maxpool_reduces_size():
layer = MaxPool2D(pool_size=2, stride=2)
x = np.random.randn(4, 32, 26, 26)
out = layer.forward(x)
assert out.shape == (4, 32, 13, 13), f"Expected (4, 32, 13, 13), got {out.shape}"
def test_im2col_correctness():
"""Verify im2col matches naive convolution."""
x = np.random.randn(1, 1, 5, 5)
w = np.random.randn(1, 1, 3, 3)
b = np.zeros(1)
out_naive = conv2d_forward_naive(x, w, b)
out_fast = conv2d_forward_fast(x, w, b)
assert np.allclose(out_naive, out_fast), "im2col convolution doesn't match naive!"
Common Pitfalls and Debugging Tips
1. Dimension Mismatches
Symptom: ValueError: shapes not aligned during matrix multiplication
Cause: im2col produces wrong shape, or reshape is incorrect
Fix: Print shapes at every step. The im2col output should be:
- Rows:
in_channels * filter_height * filter_width - Cols:
batch * output_height * output_width
2. Forgetting to Accumulate Gradients in col2im
Symptom: Training diverges or accuracy stays at 10%
Cause: Using = instead of += in col2im
# WRONG:
dx[b, :, i:i+FH, j:j+FW] = patch_grad
# RIGHT:
dx[b, :, i:i+FH, j:j+FW] += patch_grad
Each input pixel contributes to multiple outputs, so gradients must be accumulated.
3. Transpose Confusion in Backward Pass
Symptom: Gradient check fails
Cause: The shapes in matrix multiplication are wrong
Fix: Write out the shapes explicitly:
# dL/dW = dL/dY (transposed somehow) @ X (transposed somehow)
# Work out the shapes:
# W shape: (out_channels, in_channels, FH, FW)
# Need dW to be this shape
# d_out: (batch, out_channels, H_out, W_out)
# col: (in_channels*FH*FW, batch*H_out*W_out)
4. Max Pooling Gradient Routing Errors
Symptom: Gradients are wrong, but only when pooling is involved
Cause: Max indices were stored incorrectly, or not accounting for stride
Fix: Verify max indices point to the actual maximum values:
# After forward pass, verify:
for each (i,j) in output:
assert x[max_indices[i,j]] == output[i,j]
5. Learning Rate Issues
Symptom: Loss explodes or stays constant
Cause: Learning rate wrong for convolution (often needs to be smaller than for dense)
Fix: Start with lr=0.001 for conv layers. The gradients through convolution can be large because many paths contribute to each gradient.
6. Numerical Stability in Softmax
Symptom: NaN values during training
Cause: Softmax overflow
Fix: Subtract max before exponentiation:
def softmax(x):
x_stable = x - np.max(x, axis=1, keepdims=True)
exp_x = np.exp(x_stable)
return exp_x / np.sum(exp_x, axis=1, keepdims=True)
Interview Questions
If you build a CNN from scratch, expect these questions:
Conceptual Questions
- โExplain the difference between valid and same padding.โ
- Valid: no padding, output smaller than input
- Same: pad so output has same spatial size as input
- Formula for same padding:
P = (F - 1) / 2where F is filter size
- โWhy do we use small filters (3x3) instead of large ones (7x7)?โ
- Two 3x3 filters have same receptive field as one 5x5
- But 2(33) = 18 params vs 25 params
- More non-linearities (ReLU between layers)
- VGGNet proved this empirically
- โWhat is the receptive field and why does it matter?โ
- The region of input that affects one output pixel
- Deeper layers have larger receptive fields
- Determines what context the network can use
- โHow does max pooling provide translation invariance?โ
- If a feature shifts slightly, it might still be the max in its pool region
- Small translations donโt change the pooled output
- But large translations (bigger than pool size) arenโt invariant
Implementation Questions
- โWalk me through the backward pass of convolution.โ
- Need dL/dW and dL/dX
- dL/dW: convolve input with d_out
- dL/dX: โfullโ convolution of d_out with flipped filter
- im2col makes this efficient
- โWhy is im2col used instead of direct convolution?โ
- Converts convolution to matrix multiplication
- Matrix multiplication is heavily optimized (BLAS, cuBLAS)
- Avoids Python loop overhead
- GPU-friendly
- โHow would you implement strided convolution?โ
- In im2col, columns are extracted at stride intervals
- Skip
stridepositions when iterating - Output size:
(W - F) // stride + 1
- โWhat happens if I forget to store max indices during forward pass?โ
- Cannot compute correct backward pass
- Gradients wonโt flow to the right input positions
- Training will fail
Architecture Questions
- โWhy do CNNs alternate conv and pooling layers?โ
- Conv: learn features at current resolution
- Pool: reduce size, add invariance
- Alternating builds hierarchy: edges -> textures -> parts -> objects
- โHow would you add batch normalization to your CNN?โ
- Add BN layer after conv, before activation
- Normalize each channel across batch and spatial dimensions
- Learnable scale and shift parameters
- Improves training stability
Hints in Layers
Stuck on implementation? Read only the hint level you need:
Challenge: im2col Is Confusing
Hint Level 1 (Conceptual): Think of im2col as taking each receptive field patch and making it a column in a matrix.
Hint Level 2 (Direction): For a 4x4 input with 2x2 filter, you get 9 positions (3x3 output). Each position is a 2x2 patch = 4 values. So im2col output is 4x9.
Hint Level 3 (Specific): Use np.lib.stride_tricks.as_strided for a fast vectorized version (but be careful with strides!).
Hint Level 4 (Code):
# Fast im2col using stride tricks
def im2col_fast(x, FH, FW, stride=1):
B, C, H, W = x.shape
out_h = (H - FH) // stride + 1
out_w = (W - FW) // stride + 1
# Use stride tricks to create view of patches
shape = (B, C, out_h, out_w, FH, FW)
strides = (x.strides[0], x.strides[1],
x.strides[2]*stride, x.strides[3]*stride,
x.strides[2], x.strides[3])
patches = np.lib.stride_tricks.as_strided(x, shape=shape, strides=strides)
# Reshape to (C*FH*FW, B*out_h*out_w)
return patches.transpose(1, 4, 5, 0, 2, 3).reshape(C*FH*FW, -1)
Challenge: col2im Accumulation
Hint Level 1 (Conceptual): Each input pixel appears in multiple columns of im2col. In col2im, you must add all contributions.
Hint Level 2 (Direction): Use np.add.at for indexed accumulation, which handles the case where the same index appears multiple times.
Hint Level 3 (Specific): Keep track of which input positions each column came from during im2col.
Hint Level 4 (Code):
# col2im with np.add.at
def col2im_fast(col, x_shape, FH, FW, stride=1):
B, C, H, W = x_shape
out_h = (H - FH) // stride + 1
out_w = (W - FW) // stride + 1
col_reshaped = col.reshape(C, FH, FW, B, out_h, out_w).transpose(3, 0, 4, 5, 1, 2)
dx = np.zeros((B, C, H, W))
for i in range(out_h):
for j in range(out_w):
dx[:, :, i*stride:i*stride+FH, j*stride:j*stride+FW] += col_reshaped[:, :, i, j]
return dx
Challenge: Gradient of Conv Filter
Hint Level 1 (Conceptual): dL/dW is the correlation of the input with the error gradient.
Hint Level 2 (Direction): Itโs actually a convolution where you slide d_out over the input.
Hint Level 3 (Specific): Using im2col, the columns represent input patches. Multiply by the corresponding output gradients.
Hint Level 4 (Code):
# dW = d_out_col @ col.T, then reshape
# d_out_col shape: (out_channels, B*out_h*out_w)
# col shape: (C*FH*FW, B*out_h*out_w)
# Result: (out_channels, C*FH*FW) -> reshape to (out_channels, C, FH, FW)
Extensions and Challenges
1. Add Batch Normalization
Batch normalization stabilizes training and allows higher learning rates:
class BatchNorm2D:
def __init__(self, num_features, eps=1e-5, momentum=0.1):
self.gamma = np.ones(num_features) # Scale
self.beta = np.zeros(num_features) # Shift
self.eps = eps
self.momentum = momentum
self.running_mean = np.zeros(num_features)
self.running_var = np.ones(num_features)
def forward(self, x, training=True):
if training:
mean = x.mean(axis=(0, 2, 3), keepdims=True)
var = x.var(axis=(0, 2, 3), keepdims=True)
# Update running statistics
self.running_mean = (1 - self.momentum) * self.running_mean + self.momentum * mean.squeeze()
self.running_var = (1 - self.momentum) * self.running_var + self.momentum * var.squeeze()
else:
mean = self.running_mean.reshape(1, -1, 1, 1)
var = self.running_var.reshape(1, -1, 1, 1)
x_norm = (x - mean) / np.sqrt(var + self.eps)
return self.gamma.reshape(1, -1, 1, 1) * x_norm + self.beta.reshape(1, -1, 1, 1)
2. Implement Residual Connections (ResNet-style)
Skip connections allow training much deeper networks:
class ResidualBlock:
def __init__(self, channels):
self.conv1 = Conv2D(channels, channels, 3, padding=1)
self.bn1 = BatchNorm2D(channels)
self.conv2 = Conv2D(channels, channels, 3, padding=1)
self.bn2 = BatchNorm2D(channels)
self.relu = ReLU()
def forward(self, x):
identity = x # Save input
out = self.conv1.forward(x)
out = self.bn1.forward(out)
out = self.relu.forward(out)
out = self.conv2.forward(out)
out = self.bn2.forward(out)
out = out + identity # Skip connection!
out = self.relu.forward(out)
return out
3. Try on CIFAR-10
CIFAR-10 has 32x32 color images (3 channels) with 10 classes (airplanes, cars, etc.):
- Modify input channels from 1 to 3
- Likely need more layers/filters for the harder task
- Data augmentation helps: random crops, flips
4. Implement Average Pooling
Alternative to max pooling that takes the mean instead:
class AvgPool2D:
def forward(self, x):
# Average over each pool region
pass
def backward(self, d_out):
# Gradient distributed equally to all positions
# (unlike max pool where only winner gets gradient)
pass
5. Add Dropout
Regularization technique that randomly zeros neurons during training:
class Dropout2D:
def __init__(self, p=0.5):
self.p = p
self.mask = None
def forward(self, x, training=True):
if training:
self.mask = (np.random.rand(*x.shape) > self.p) / (1 - self.p)
return x * self.mask
return x
def backward(self, d_out):
return d_out * self.mask
Real-World Connections
Self-Driving Cars
Teslaโs Autopilot, Waymo, and others use CNNs for:
- Lane detection (pixel classification)
- Object detection (pedestrians, cars, signs)
- Depth estimation from cameras
Your CNN from scratch demonstrates the core technology. Production systems use:
- Much deeper networks (ResNet-50, EfficientNet)
- Multiple camera inputs fused together
- Real-time inference optimization
Medical Imaging
CNNs detect diseases in X-rays, MRIs, and CT scans:
- Diabetic retinopathy detection (Google)
- Skin cancer classification (Stanford)
- COVID-19 detection from chest X-rays
Your CNN teaches the fundamentals used in FDA-approved medical AI devices.
Smartphone Cameras
When your phone applies โportrait modeโ or โnight modeโ:
- CNNs segment foreground from background
- CNNs denoise low-light images
- CNNs enhance resolution (super-resolution)
All running on your phoneโs neural processing unit.
Content Moderation
Facebook, YouTube, and Instagram use CNNs to:
- Detect nudity and violence
- Identify copyrighted content
- Filter spam and fake accounts
Billions of images processed daily using architectures that build on what youโre learning.
Books That Will Help
| Book | Relevant Chapters | What Youโll Learn |
|---|---|---|
| Deep Learning by Goodfellow, Bengio, Courville | Ch. 9: Convolutional Networks | The theoretical foundation: why CNNs work, receptive fields, invariance properties. The math is rigorous but essential. |
| Deep Learning with Python by Francois Chollet | Ch. 5: Deep Learning for Computer Vision | Practical intuition for CNN architectures. Written by the creator of Keras. Less math, more insight. |
| Neural Networks and Deep Learning by Michael Nielsen | Ch. 6: Deep Learning | Free online book with excellent visualizations. Good for building intuition before diving into implementation. |
| Grokking Deep Learning by Andrew Trask | Ch. 8, 10: CNNs | Code-first approach that matches our project style. Shows implementations you can learn from. |
| Dive into Deep Learning (d2l.ai) | Ch. 6: Convolutional Neural Networks | Free online book with executable code. Shows both math and implementation side by side. |
Academic Papers Worth Reading
- LeNet-5 (LeCun et al., 1998): The original CNN paper for digit recognition
- AlexNet (Krizhevsky et al., 2012): The paper that started the deep learning revolution
- VGGNet (Simonyan & Zisserman, 2014): Shows power of small 3x3 filters
- ResNet (He et al., 2015): Skip connections for very deep networks
Self-Assessment Checklist
Before considering this project complete, verify you can:
Implementation
- Implement Conv2D forward pass with correct output dimensions
- Implement im2col transformation for efficient convolution
- Implement Conv2D backward pass (gradient check passes)
- Implement MaxPool2D forward pass with max index tracking
- Implement MaxPool2D backward pass with gradient routing
- Connect conv layers to dense layers via Flatten
- Train the full CNN on MNIST to 99%+ accuracy
Understanding
- Explain why CNNs are more efficient than dense networks for images
- Calculate output dimensions given input, filter, stride, and padding
- Trace the backward pass of convolution for a simple example by hand
- Explain how max pooling provides translation invariance
- Describe the relationship between receptive field and network depth
Debugging
- Use gradient checking to verify backward passes
- Debug dimension mismatches in matrix operations
- Identify and fix numerical stability issues
Extensions
- Explain how batch normalization would integrate into your CNN
- Describe how residual connections (skip connections) work
- Compare your implementationโs performance to a framework (PyTorch/TensorFlow)
Resources
Primary References
- Stanford CS231n: Convolutional Neural Networks - Excellent notes on backprop through conv layers
- Deep Learning Book Chapter 9 - Theoretical foundation
- Andrej Karpathyโs Conv Net Demo - Visual interactive demo
Implementation References
- im2col Explained - Detailed walkthrough
- Caffeโs im2col - Reference implementation
Videos
- 3Blue1Brown: But what is a convolution? - Beautiful visual explanation
- Andrew Ng: Convolutional Neural Networks - Coursera course
Datasets
Key Insights
Convolution is parameter sharing. Instead of learning separate weights for each pixel position, we learn one set of weights (the filter) and apply it everywhere. This single insight reduces parameters by orders of magnitude and gives CNNs their power.
im2col is the trick that makes CNNs fast. By reformatting the convolution as matrix multiplication, we leverage decades of linear algebra optimization. Every GPU CNN implementation uses this trick.
The backward pass through convolution is itself a convolution. Once you see this, the math becomes elegant: forward is convolution with the filter, backward is convolution with the flipped filter (plus some transpositions).
Translation invariance isnโt magic - itโs architecture. Shared weights mean the same features are detected everywhere. Pooling provides local invariance. Together, they let CNNs recognize objects regardless of position.
After completing this project, you will have implemented the core architecture that powers computer vision. From self-driving cars to medical imaging, CNNs are everywhere. You now understand not just how to use them, but how they work at the byte level. Project 10 (RNN) will show you how to extend these ideas to sequences and time.