P07: The Convolutional Kernel Explorer
P07: The Convolutional Kernel Explorer
Build a tool where you manually define 3x3 kernels and slide them over images to detect edges, sharpen, or blur - writing the convolution operation from scratch.
Project Overview
| Attribute | Value |
|---|---|
| Difficulty | Intermediate |
| Time Estimate | Weekend (8-16 hours) |
| Language | Python (OpenCV/NumPy) |
| Prerequisites | Loops, 2D arrays, basic image concepts |
| Primary Book | โDeep Learning with Pythonโ by Francois Chollet, Ch. 5 |
| Knowledge Area | Computer Vision / Signal Processing |
Learning Objectives
After completing this project, you will be able to:
- Understand convolution mathematically - Explain exactly what happens when a kernel slides over an image
- Implement convolution from scratch - Write the nested loops that compute the sliding window operation
- Design kernels for specific effects - Know which 3x3 matrices detect edges, blur, or sharpen
- Handle image boundaries - Implement padding strategies (zero, reflect, wrap)
- Control output dimensions - Calculate output size based on stride and padding
- Process color images - Apply convolution to RGB images with 3 channels
- Connect to CNNs - Understand that neural networks LEARN these kernels automatically
The Core Question Youโre Answering
โHow does a computer โseeโ shapes?โ
A digital image is just a grid of numbers - pixel values from 0 (black) to 255 (white). The computer has no concept of โedge,โ โcorner,โ or โface.โ It just sees numbers.
The insight: Shapes are patterns in how numbers change. An edge is a sudden jump from low to high (or high to low). A corner is where two edges meet. A texture is a repeated pattern of changes.
Convolution is the mathematical operation that detects these changes. A 3x3 kernel acts as a โpattern detectorโ - it looks at a small window of pixels and produces a single number indicating how well that pattern matches.
Before coding, internalize this: A CNN doesnโt โseeโ images - it detects patterns of numerical change, layer by layer, until abstract patterns like โcatโ or โdogโ emerge from primitive patterns like โhorizontal edgeโ or โcurve.โ
Concepts You Must Understand First
Stop and research these before coding:
1. Images as 2D/3D Arrays of Numbers
An image is a matrix (or tensor for color):
- Grayscale: 2D array, shape (height, width), values 0-255
- Color (RGB): 3D array, shape (height, width, 3), each channel is a 2D array
Grayscale image (5x5): Color image (5x5x3):
โโโโโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโ
โ 50 80 100 120 90 โ โ R G B โ Each pixel
โ 60 90 140 180 110 โ โ 255 0 0 โ is 3 numbers
โ 70 100 200 220 130 โ โ 0 255 0 โ
โ 80 110 180 200 120 โ โ 0 0 255 โ
โ 70 90 120 140 100 โ โโโโโโโโโโโโโโโโโโโโโโโ
โโโโโโโโโโโโโโโโโโโโโโโ
Book Reference: โComputer Vision: Algorithms and Applicationsโ Ch. 2 - Richard Szeliski
2. The Convolution Operation Mathematically
Convolution is a โsliding windowโ operation. For each position in the image:
- Place the kernel (3x3 matrix) over a region of the image
- Multiply each kernel value by the corresponding pixel value
- Sum all products
- This sum becomes one pixel in the output
Mathematical definition for a kernel K and image I:
k-1 k-1
Output(x,y) = Sum Sum I(x+i, y+j) * K(i, j)
i=0 j=0
Where k is the kernel size (usually 3)
Book Reference: โDigital Image Processingโ Ch. 3 - Gonzalez & Woods
3. Why Convolution Detects Features
The magic is in the kernel values. Consider this kernel:
Edge Detection (Laplacian):
โโโโโโโโโโโโโโโโโโ
โ -1 -1 -1 โ
โ -1 8 -1 โ
โ -1 -1 -1 โ
โโโโโโโโโโโโโโโโโโ
When applied to a uniform region (all same color), the result is 0:
- Center contributes: 8 * pixel_value
- Neighbors contribute: 8 * (-1) * pixel_value = -8 * pixel_value
- Total: 0
When applied to an edge (center bright, neighbors dark):
- Center contributes: 8 * high_value
- Neighbors contribute: -1 * low_values (smaller magnitude)
- Total: large positive number!
Key insight: The kernel โlights upโ only where the pattern it encodes exists.
4. Common Kernels and What They Detect
Identity (no change):
โโโโโโโโโโโ
โ 0 0 0 โ
โ 0 1 0 โ Output = Input
โ 0 0 0 โ
โโโโโโโโโโโ
Box Blur (average):
โโโโโโโโโโโโโโโโโโโโโโโ
โ 1/9 1/9 1/9 โ
โ 1/9 1/9 1/9 โ Averages 3x3 neighborhood
โ 1/9 1/9 1/9 โ
โโโโโโโโโโโโโโโโโโโโโโโ
Gaussian Blur (weighted average):
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ 1/16 2/16 1/16 โ
โ 2/16 4/16 2/16 โ Center weighted more
โ 1/16 2/16 1/16 โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Sharpen:
โโโโโโโโโโโโโโโโโโ
โ 0 -1 0 โ
โ -1 5 -1 โ Emphasizes center vs neighbors
โ 0 -1 0 โ
โโโโโโโโโโโโโโโโโโ
Sobel (horizontal edge):
โโโโโโโโโโโโโโโโโโ
โ -1 -2 -1 โ
โ 0 0 0 โ Detects horizontal gradients
โ 1 2 1 โ
โโโโโโโโโโโโโโโโโโ
Sobel (vertical edge):
โโโโโโโโโโโโโโโโโโ
โ -1 0 1 โ
โ -2 0 2 โ Detects vertical gradients
โ -1 0 1 โ
โโโโโโโโโโโโโโโโโโ
Emboss:
โโโโโโโโโโโโโโโโโโ
โ -2 -1 0 โ
โ -1 1 1 โ Creates 3D relief effect
โ 0 1 2 โ
โโโโโโโโโโโโโโโโโโ
5. Padding Modes (valid, same, full)
When the kernel reaches the image edge, what happens?
No Padding (valid):
- Output is smaller than input
- Output size: (N - K + 1) where N is input size, K is kernel size
- For 5x5 image with 3x3 kernel: output is 3x3
Zero Padding (same):
- Pad input with zeros so output equals input size
- Pad amount: (K - 1) / 2 on each side
- For 3x3 kernel: pad 1 pixel of zeros around entire image
Reflect Padding:
- Mirror pixels at the boundary
- Avoids introducing artificial zeros
Original (5x5): Zero Padded (7x7):
โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโ
โ A B C D E โ โ 0 0 0 0 0 0 0 โ
โ F G H I J โ --> โ 0 A B C D E 0 โ
โ K L M N O โ โ 0 F G H I J 0 โ
โ P Q R S T โ โ 0 K L M N O 0 โ
โ U V W X Y โ โ 0 P Q R S T 0 โ
โโโโโโโโโโโโโโโ โ 0 U V W X Y 0 โ
โ 0 0 0 0 0 0 0 โ
โโโโโโโโโโโโโโโโโโโโโโโ
6. Strides and Output Size Calculation
Stride = how many pixels to move the kernel between applications
Stride 1: Move 1 pixel at a time (standard) Stride 2: Skip every other position (downsamples by 2)
Output size formula:
output_size = ((input_size + 2*padding - kernel_size) / stride) + 1
Example:
Input: 224x224
Kernel: 3x3
Padding: 1 (same padding)
Stride: 2
Output = ((224 + 2*1 - 3) / 2) + 1 = (223 / 2) + 1 = 111 + 1 = 112
Book Reference: โDeep Learningโ by Goodfellow, Bengio, Courville - Ch. 9
Deep Theoretical Foundation
Signal Processing Origins of Convolution
Convolution was invented for signal processing long before AI. The idea: combine two signals by sliding one over the other.
In 1D (audio signals):
Signal: [ 1 2 3 4 5 6 7 ]
Kernel: [ 0.25 0.5 0.25 ] (simple smoothing)
Slide kernel across signal:
Position 1: 0.25*1 + 0.5*2 + 0.25*3 = 2.0
Position 2: 0.25*2 + 0.5*3 + 0.25*4 = 3.0
...
For images, we extend this to 2D - the kernel slides in both X and Y directions.
Edge Detection Theory (First Derivative)
In calculus, the derivative measures rate of change. A large derivative means rapid change - an edge!
Pixel values: 10 10 10 200 200 200
โ
Sudden jump = high derivative = EDGE
Approximate derivative with finite differences:
d/dx โ f(x+1) - f(x-1) (central difference)
The Sobel operator encodes this derivative:
Sobel X (detects vertical edges):
โโโโโโโโโโโโโโโโโโ
โ -1 0 1 โ This computes: right - left
โ -2 0 2 โ Weighted by distance from center
โ -1 0 1 โ
โโโโโโโโโโโโโโโโโโ
Large positive output = bright on right, dark on left
Large negative output = dark on right, bright on left
Zero output = no horizontal gradient
The Sobel, Prewitt, and Laplacian Operators
Prewitt (simpler Sobel):
Horizontal: Vertical:
โโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโ
โ -1 -1 -1 โ โ -1 0 1 โ
โ 0 0 0 โ โ -1 0 1 โ
โ 1 1 1 โ โ -1 0 1 โ
โโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโ
Laplacian (second derivative - detects ALL edges):
โโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโ
โ 0 -1 0 โ or โ -1 -1 -1 โ
โ -1 4 -1 โ โ -1 8 -1 โ
โ 0 -1 0 โ โ -1 -1 -1 โ
โโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโ
The Laplacian detects edges in all directions because it computes the sum of second derivatives.
Mathematical Definition of 2D Convolution
Formally, discrete 2D convolution of image I with kernel K:
(I * K)[x, y] = Sum_i Sum_j I[x-i, y-j] * K[i, j]
Note: Some definitions flip the kernel (true convolution).
What we implement is technically "cross-correlation."
For symmetric kernels, they're identical.
How CNNs LEARN Optimal Kernels
Hereโs the profound insight: In a CNN, the kernel values are weights that are learned through backpropagation.
Traditional approach (what youโre building):
- Human designs kernel: โI want to detect horizontal edgesโ
- Kernel is fixed: [-1, -2, -1], [0, 0, 0], [1, 2, 1]
CNN approach:
- Initialize kernel with random values
- Show network many images with labels
- Backpropagate error to adjust kernel values
- Network discovers: โTo recognize cats, I need THIS edge patternโ
The first layer of a trained CNN often learns kernels that look like Gabor filters (oriented edges at various angles). Deeper layers learn increasingly abstract patterns.
Layer 1 learns: Layer 2 learns: Layer 3+ learns:
Edges, gradients Corners, textures Object parts
/ | \ โโ โฑโฒ Eyes, ears, wheels
โ โ โ โโ โฒโฑ
This is why understanding convolution is essential: CNNs are just networks that learn which convolutions are useful.
Real World Outcome
After completing this project, youโll have a command-line tool that applies convolutions to images:
$ python convolve.py --image face.jpg --kernel "edge_detect"
Applying Kernel:
[[-1, -1, -1],
[-1, 8, -1],
[-1, -1, -1]]
Input image shape: (480, 640, 3)
Output image shape: (480, 640)
Processing time: 0.23 seconds
Saved result to output.jpg
Visual Output Examples
Original Image (face.jpg):
Description: A photograph of a human face with clear features -
eyes, nose, mouth, hair outline visible. Smooth skin tones,
varying lighting across the face.
After Edge Detection (Laplacian):
Description: The output is predominantly dark (black background)
with bright white lines tracing:
- The outline of the face against the background
- The edges of the eyes (eyelids, iris boundaries)
- The nose bridge and nostrils
- The lip boundaries
- Hair strands
All smooth regions (cheeks, forehead) are black because
there's no change = no edge.
After Sharpening:
Description: The face looks "crisper" - edges are more defined,
fine details like individual hairs and skin texture are
more visible. The image has higher local contrast.
After Gaussian Blur:
Description: The face looks "softer" - like looking through
frosted glass. Fine details are smoothed out, the image
appears slightly out of focus. Good for removing noise.
After Emboss:
Description: The face appears as if carved in stone or metal,
with a 3D relief effect. Edges facing one direction are
bright, opposite direction are dark. Gives a sculptural look.
Solution Architecture
System Design
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ CONVOLVE.PY ARCHITECTURE โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ โโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโ โ
โ โ CLI Parser โโโโโ>โ Kernel Loader โ โ
โ โ (argparse) โ โ (dictionary) โ โ
โ โโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโฌโโโโโโโโโโ โ
โ โ โ โ
โ โผ โผ โ
โ โโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโ โ
โ โ Image Loader โโโโโ>โ Preprocessor โ โ
โ โ (OpenCV/PIL) โ โ (to float, โ โ
โ โ RGB or Gray โ โ normalize) โ โ
โ โโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโฌโโโโโโโโโโ โ
โ โ โ
โ โผ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ CONVOLUTION ENGINE โ โ
โ โ โ โ
โ โ โโโโโโโโโโโโโโโโโโโโโโโโ โ โ
โ โ โ Padding Handler โ โ โ
โ โ โ (zero/reflect/wrap) โ โ โ
โ โ โโโโโโโโโโโโฌโโโโโโโโโโโโ โ โ
โ โ โ โ โ
โ โ โผ โ โ
โ โ โโโโโโโโโโโโโโโโโโโโโโโโ โ โ
โ โ โ Sliding Window โ โ โ
โ โ โ (nested loops) โ โ โ
โ โ โ โ โ โ
โ โ โ for y in range(...):โ โ โ
โ โ โ for x in range: โ โ โ
โ โ โ extract_patch โ โ โ
โ โ โ element_mult โ โ โ
โ โ โ sum_to_output โ โ โ
โ โ โโโโโโโโโโโโฌโโโโโโโโโโโโ โ โ
โ โ โ โ โ
โ โ โผ โ โ
โ โ โโโโโโโโโโโโโโโโโโโโโโโโ โ โ
โ โ โ Stride Handler โ โ โ
โ โ โ (output indexing) โ โ โ
โ โ โโโโโโโโโโโโโโโโโโโโโโโโ โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ โ
โ โผ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ Post-processing โ โ
โ โ - Clip to valid range โ โ
โ โ - Convert to uint8 โ โ
โ โ - Handle color merge โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ โ
โ โผ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ Output โ โ
โ โ - Save to file โ โ
โ โ - Display (optional) โ โ
โ โ - Side-by-side compare โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Data Flow
Input Image Kernel Output
(H x W x C) (K x K) (H' x W')
โโโโโโโโโโโโโโโ โโโโโโโโโโโ โโโโโโโโโโโโโโโ
โ 50 80 100 โ โ -1 -1 -1โ โ โ
โ 60 90 140 โ * โ -1 8 -1โ = โ ??? (you โ
โ 70 100 200 โ โ -1 -1 -1โ โ compute!) โ
โ 80 110 180 โ โโโโโโโโโโโ โ โ
โ 70 90 120 โ โโโโโโโโโโโโโโโ
โโโโโโโโโโโโโโโ
Step-by-step:
1. Position kernel at top-left
2. Multiply and sum
3. Store in output[0,0]
4. Slide kernel right
5. Repeat until right edge
6. Move down, start from left
7. Repeat until bottom
Module Structure
convolve/
โโโ convolve.py # Main CLI entry point
โโโ kernels.py # Predefined kernel definitions
โโโ convolution.py # Core convolution implementation
โโโ padding.py # Padding utilities
โโโ utils.py # Image I/O, visualization
โโโ tests/
โโโ test_convolution.py
โโโ test_kernels.py
โโโ test_images/
โโโ checkerboard.png
โโโ gradient.png
โโโ simple_edge.png
Phased Implementation Guide
Phase 1: Load Image as NumPy Array (30 minutes)
Goal: Read an image file and understand its structure
Tasks:
- Install dependencies:
pip install numpy opencv-python matplotlib - Load an image using OpenCV or PIL
- Print its shape and data type
- Understand the value range (0-255 for uint8)
Code Structure (hints, not solution):
import cv2
import numpy as np
def load_image(path, grayscale=True):
"""
Load image from file.
Args:
path: Path to image file
grayscale: If True, convert to grayscale
Returns:
numpy array of shape (H, W) for gray or (H, W, 3) for color
"""
# Use cv2.imread with appropriate flags
# cv2.IMREAD_GRAYSCALE or cv2.IMREAD_COLOR
# Return as float32 for easier math
pass
Verification:
img = load_image("test.jpg", grayscale=True)
print(f"Shape: {img.shape}") # Should be (height, width)
print(f"Dtype: {img.dtype}") # Should be float32
print(f"Range: {img.min()} to {img.max()}") # 0.0 to 255.0
Phase 2: Define Kernel Dictionaries (30 minutes)
Goal: Create a library of predefined kernels
Tasks:
- Create a dictionary mapping names to 3x3 NumPy arrays
- Include: identity, blur, sharpen, edge_detect, sobel_x, sobel_y, emboss
Code Structure:
KERNELS = {
"identity": np.array([
[0, 0, 0],
[0, 1, 0],
[0, 0, 0]
], dtype=np.float32),
"box_blur": np.array([
# Fill in: 1/9 for all 9 elements
], dtype=np.float32),
"edge_detect": np.array([
# Fill in: Laplacian kernel
], dtype=np.float32),
# Add more...
}
def get_kernel(name):
"""Return kernel by name, or parse custom [[...]] format."""
if name in KERNELS:
return KERNELS[name]
# Optional: parse custom kernels from string
pass
Verification:
kernel = get_kernel("box_blur")
print(kernel.sum()) # Should be ~1.0 for blur
print(kernel.shape) # Should be (3, 3)
Phase 3: Implement Naive Convolution (Nested Loops) (1-2 hours)
Goal: Write the basic sliding window algorithm
This is the core of the project. You must implement this yourself!
The Algorithm:
For each output pixel (out_y, out_x):
1. Find the corresponding patch in the input
2. The patch is kernel_size x kernel_size
3. Multiply patch elementwise with kernel
4. Sum all products
5. Store in output[out_y, out_x]
ASCII Visualization of Sliding Window:
Input Image (5x5): Kernel (3x3):
โโโโโโโโโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโ
โ a b c d e โ โ k00 k01 k02 โ
โ f g h i j โ โ k10 k11 k12 โ
โ k l m n o โ โ k20 k21 k22 โ
โ p q r s t โ โโโโโโโโโโโโโโโโโ
โ u v w x y โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโ
Position 1 (top-left): Position 2 (shifted right):
โโโโโโโโโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ[a b c] d e โ โ a [b c d] e โ
โ[f g h] i j โ โ f [g h i] j โ
โ[k l m] n o โ โ k [l m n] o โ
โ p q r s t โ โ p q r s t โ
โ u v w x y โ โ u v w x y โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโโโโ
Output[0,0] = a*k00 + b*k01 + c*k02 Output[0,1] = b*k00 + c*k01 + d*k02
+ f*k10 + g*k11 + h*k12 + g*k10 + h*k11 + i*k12
+ k*k20 + l*k21 + m*k22 + l*k20 + m*k21 + n*k22
Continue sliding right, then down...
Code Structure:
def convolve2d_naive(image, kernel):
"""
Apply 2D convolution using nested loops.
Args:
image: 2D numpy array (H, W)
kernel: 2D numpy array (K, K), must be odd-sized
Returns:
2D numpy array, output of convolution
"""
img_h, img_w = image.shape
k_h, k_w = kernel.shape
# Calculate output dimensions (no padding = "valid")
out_h = img_h - k_h + 1
out_w = img_w - k_w + 1
# Initialize output array
output = np.zeros((out_h, out_w), dtype=np.float32)
# Half kernel size for indexing
half_k = k_h // 2
# Nested loops - YOU IMPLEMENT THIS
for out_y in range(out_h):
for out_x in range(out_w):
# Extract the patch from image
# Multiply with kernel
# Sum and store
pass
return output
Verification:
# Test with identity kernel - output should equal center of input
img = np.array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]], dtype=np.float32)
kernel = KERNELS["identity"]
result = convolve2d_naive(img, kernel)
assert result[0, 0] == 5 # Center of 3x3 input
# Test with known values
img = np.ones((5, 5), dtype=np.float32)
blur = KERNELS["box_blur"]
result = convolve2d_naive(img, blur)
assert np.allclose(result, 1.0) # Blurring uniform image = same
Phase 4: Add Padding Support (1 hour)
Goal: Allow output to be same size as input
Tasks:
- Implement
pad_image(image, pad_size, mode='zero') - Support modes: โzeroโ, โreflectโ, โreplicateโ
- Modify convolve2d to accept padding parameter
Padding Visualization:
Original (3x3): Zero Padded (5x5):
โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโ
โ 1 2 3 โ โ 0 0 0 0 0 โ
โ 4 5 6 โ --> โ 0 1 2 3 0 โ
โ 7 8 9 โ โ 0 4 5 6 0 โ
โโโโโโโโโโโโโโโ โ 0 7 8 9 0 โ
โ 0 0 0 0 0 โ
โโโโโโโโโโโโโโโโโโโโโโโ
Reflect Padded (5x5):
โโโโโโโโโโโโโโโโโโโโโโโ
โ 5 4 5 6 5 โ (reflected at borders)
โ 2 1 2 3 2 โ
โ 5 4 5 6 5 โ
โ 8 7 8 9 8 โ
โ 5 4 5 6 5 โ
โโโโโโโโโโโโโโโโโโโโโโโ
Code Structure:
def pad_image(image, pad_size, mode='zero'):
"""
Pad image borders.
Args:
image: 2D numpy array
pad_size: Number of pixels to pad on each side
mode: 'zero', 'reflect', or 'replicate'
Returns:
Padded image
"""
if mode == 'zero':
return np.pad(image, pad_size, mode='constant', constant_values=0)
elif mode == 'reflect':
return np.pad(image, pad_size, mode='reflect')
elif mode == 'replicate':
return np.pad(image, pad_size, mode='edge')
else:
raise ValueError(f"Unknown padding mode: {mode}")
Verification:
img = np.array([[1, 2], [3, 4]], dtype=np.float32)
padded = pad_image(img, 1, 'zero')
assert padded.shape == (4, 4)
assert padded[0, 0] == 0 # Corner should be zero
assert padded[1, 1] == 1 # Original data preserved
Phase 5: Add Stride Support (45 minutes)
Goal: Skip pixels for faster computation/downsampling
Stride Visualization:
Stride = 1 (standard): Stride = 2 (downsample):
Every position computed Skip every other position
โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ
โ[X][X][X][X] โ โ[X] . [X] . โ
โ[X][X][X][X] โ โ . . . . โ
โ[X][X][X][X] โ โ[X] . [X] . โ
โ[X][X][X][X] โ โ . . . . โ
โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ
Output: 4x4 Output: 2x2
Code Modification:
def convolve2d(image, kernel, stride=1, padding=0, pad_mode='zero'):
"""
Full convolution with stride and padding support.
"""
# Pad image if needed
if padding > 0:
image = pad_image(image, padding, pad_mode)
img_h, img_w = image.shape
k_h, k_w = kernel.shape
# Calculate output dimensions with stride
out_h = ((img_h - k_h) // stride) + 1
out_w = ((img_w - k_w) // stride) + 1
output = np.zeros((out_h, out_w), dtype=np.float32)
for out_y in range(out_h):
for out_x in range(out_w):
# Input position accounts for stride
in_y = out_y * stride
in_x = out_x * stride
# Rest of convolution...
pass
return output
Verification:
img = np.ones((6, 6), dtype=np.float32)
kernel = KERNELS["identity"]
result = convolve2d(img, kernel, stride=2, padding=1)
assert result.shape == (3, 3) # Downsampled by 2
Phase 6: Handle Color Images (3 Channels) (45 minutes)
Goal: Apply convolution to RGB images
Strategy: Apply the same kernel to each channel independently, then recombine
Color Image (H, W, 3):
โโโโโโโโโโโโโโโ
โ R channel โ โโ> convolve โโ> R'
โโโโโโโโโโโโโโโค
โ G channel โ โโ> convolve โโ> G' โโ> Stack โโ> Output (H', W', 3)
โโโโโโโโโโโโโโโค
โ B channel โ โโ> convolve โโ> B'
โโโโโโโโโโโโโโโ
Code Structure:
def convolve2d_color(image, kernel, **kwargs):
"""
Apply convolution to color image.
Args:
image: 3D numpy array (H, W, C)
kernel: 2D kernel to apply to each channel
**kwargs: stride, padding, pad_mode
Returns:
3D numpy array (H', W', C)
"""
if image.ndim == 2:
# Grayscale
return convolve2d(image, kernel, **kwargs)
# Process each channel
channels = []
for c in range(image.shape[2]):
channel_result = convolve2d(image[:, :, c], kernel, **kwargs)
channels.append(channel_result)
return np.stack(channels, axis=2)
Verification:
img = np.random.rand(100, 100, 3).astype(np.float32) * 255
kernel = KERNELS["box_blur"]
result = convolve2d_color(img, kernel, padding=1)
assert result.shape == img.shape # Same dimensions
assert result.shape[2] == 3 # Still 3 channels
Phase 7: Visualization and CLI (1 hour)
Goal: Create user-friendly interface and visual comparison
Tasks:
- Create side-by-side before/after display
- Parse command-line arguments
- Handle image saving
- Add progress indicator for large images
Code Structure:
import argparse
import matplotlib.pyplot as plt
def visualize_result(original, processed, kernel_name):
"""Show original and processed side by side."""
fig, axes = plt.subplots(1, 2, figsize=(12, 6))
axes[0].imshow(original, cmap='gray' if original.ndim == 2 else None)
axes[0].set_title('Original')
axes[0].axis('off')
axes[1].imshow(processed, cmap='gray' if processed.ndim == 2 else None)
axes[1].set_title(f'After {kernel_name}')
axes[1].axis('off')
plt.tight_layout()
plt.show()
def main():
parser = argparse.ArgumentParser(description='Image Convolution Tool')
parser.add_argument('--image', required=True, help='Input image path')
parser.add_argument('--kernel', required=True, help='Kernel name or custom [[...]]')
parser.add_argument('--output', default='output.jpg', help='Output path')
parser.add_argument('--stride', type=int, default=1)
parser.add_argument('--padding', type=int, default=0)
parser.add_argument('--show', action='store_true', help='Display result')
args = parser.parse_args()
# Load, process, save
# You implement this!
if __name__ == '__main__':
main()
Questions to Guide Your Design
Before implementing, think through these:
Algorithm Design
- Loop Order: Should you loop over output positions or input positions? Why?
- Index Calculation: Given output position (y, x), what input pixels contribute?
- Edge Cases: What happens at corners? What about even-sized kernels?
Efficiency Considerations
- Memory Layout: Is row-major or column-major access faster? Why?
- Vectorization: How could you use NumPy operations instead of loops?
- Separable Kernels: If a kernel is separable (k = v * h^T), how does this help?
Numerical Stability
- Data Types: Why use float32 instead of uint8 during computation?
- Clipping: After convolution, values may exceed 0-255. How do you handle this?
- Normalization: Should the kernel sum to 1? When does it matter?
Design Decisions
- Kernel Flipping: True convolution flips the kernel. Cross-correlation doesnโt. Which are you implementing?
- Boundary Handling: When would you prefer zero padding vs. reflection?
- Color vs. Grayscale: For edge detection, would you convolve color or convert to gray first?
Thinking Exercise
Manual Convolution Practice
Before coding, apply this 3x3 kernel to this 5x5 image BY HAND:
Input Image:
โโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ 0 0 0 0 0 โ
โ 0 0 0 0 0 โ
โ 0 0 255 0 0 โ
โ 0 0 0 0 0 โ
โ 0 0 0 0 0 โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโ
(A single white pixel in the center)
Kernel (Laplacian edge detector):
โโโโโโโโโโโโโโโโโโ
โ -1 -1 -1 โ
โ -1 8 -1 โ
โ -1 -1 -1 โ
โโโโโโโโโโโโโโโโโโ
Calculate the output (3x3, no padding):
Position (0,0): Sum products of kernel with top-left 3x3 patch
Patch: Kernel: Products:
0 0 0 -1 -1 -1 0 0 0
0 0 0 x -1 8 -1 = 0 0 0 = Sum = 0
0 0 255 -1 -1 -1 0 0 -255
Output[0,0] = -255
Position (0,1): Center patch
Patch: Products:
0 0 0 0 0 0
0 0 0 0 0 0 = Sum = 0
0 255 0 0 -255 0
Output[0,1] = -255
Position (1,1): Centered on white pixel
Patch: Products:
0 0 0 0 0 0
0 255 0 x 0 2040 0 = Sum = 2040
0 0 0 0 0 0
Output[1,1] = 8 * 255 = 2040
Complete the 3x3 output grid:
โโโโโโโโโโโโโโโโโโโโโโโโ
โ -255 -255 -255 โ
โ -255 2040 -255 โ
โ -255 -255 -255 โ
โโโโโโโโโโโโโโโโโโโโโโโโ
Insight: The edge detector โamplifiesโ the single pixel into a pattern showing high contrast between center and surroundings!
Testing Strategy
Unit Tests for Core Functions
def test_identity_kernel():
"""Identity kernel should preserve center values."""
img = np.array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]], dtype=np.float32)
kernel = np.array([[0, 0, 0],
[0, 1, 0],
[0, 0, 0]], dtype=np.float32)
result = convolve2d_naive(img, kernel)
assert result.shape == (1, 1)
assert result[0, 0] == 5
def test_box_blur_uniform():
"""Blurring uniform image should return same values."""
img = np.full((5, 5), 100.0, dtype=np.float32)
kernel = np.full((3, 3), 1/9, dtype=np.float32)
result = convolve2d_naive(img, kernel)
assert np.allclose(result, 100.0)
def test_output_dimensions():
"""Test output size calculation."""
img = np.zeros((10, 10), dtype=np.float32)
kernel = np.zeros((3, 3), dtype=np.float32)
# No padding
result = convolve2d(img, kernel, padding=0)
assert result.shape == (8, 8)
# Same padding
result = convolve2d(img, kernel, padding=1)
assert result.shape == (10, 10)
# Stride 2
result = convolve2d(img, kernel, stride=2, padding=1)
assert result.shape == (5, 5)
def test_edge_detection_on_gradient():
"""Horizontal edge detector should activate on horizontal edges."""
# Create horizontal gradient
img = np.zeros((5, 5), dtype=np.float32)
img[0:2, :] = 255 # Top half white
img[2:, :] = 0 # Bottom half black
# Sobel vertical (detects horizontal edges)
kernel = np.array([[-1, -2, -1],
[0, 0, 0],
[1, 2, 1]], dtype=np.float32)
result = convolve2d_naive(img, kernel)
# Should have high values in middle row (the edge)
assert result[1, :].mean() > result[0, :].mean()
assert result[1, :].mean() > result[2, :].mean()
Visual Tests
- Checkerboard Pattern: Apply identity kernel - output should look identical
- Uniform Image: Apply any kernel - output should be uniform (for valid kernels)
- Natural Image: Apply Sobel - should highlight edges
- Blur: Apply Gaussian blur - image should look softer
Edge Cases
def test_single_pixel_image():
"""Handle 1x1 images gracefully."""
img = np.array([[100.0]])
kernel = np.array([[0, 0, 0],
[0, 1, 0],
[0, 0, 0]], dtype=np.float32)
# With same padding, should work
result = convolve2d(img, kernel, padding=1)
assert result.shape == (1, 1)
def test_large_kernel():
"""Handle kernels larger than image."""
img = np.ones((3, 3), dtype=np.float32)
kernel = np.ones((5, 5), dtype=np.float32) / 25
# Should return empty or raise informative error
try:
result = convolve2d(img, kernel, padding=0)
assert result.size == 0 or result.shape[0] == 0
except ValueError as e:
assert "kernel larger than image" in str(e).lower()
Common Pitfalls and Debugging Tips
Pitfall 1: Off-by-One Errors
Symptom: Output has black borders or wrong dimensions
Cause: Incorrect loop bounds or index calculations
Fix: Double-check the output dimension formula:
out_h = ((img_h + 2*padding - kernel_h) // stride) + 1
Pitfall 2: Axis Confusion
Symptom: Image appears rotated or flipped
Cause: Mixing up x/y, row/column, height/width conventions
Fix: Be consistent. NumPy uses [row, col] = [y, x]. Document your convention.
Pitfall 3: Integer Division Issues
Symptom: Blur kernel produces very dark output
Cause: Using integer division when creating kernel
# WRONG:
kernel = np.array([[1/9, 1/9, 1/9], ...]) # 1/9 = 0 in Python 2!
# RIGHT:
kernel = np.array([[1/9, 1/9, 1/9], ...], dtype=np.float32)
Pitfall 4: Clipping Issues
Symptom: Edge detection output looks washed out or wrong
Cause: Not handling negative values or overflow
Fix:
# After convolution, values may be outside [0, 255]
# Option 1: Clip
output = np.clip(output, 0, 255)
# Option 2: Normalize
output = (output - output.min()) / (output.max() - output.min()) * 255
# Option 3: Absolute value (for edge detection)
output = np.abs(output)
Pitfall 5: Slow Performance
Symptom: Processing takes minutes for large images
Cause: Python loops are slow
Fix:
- First make it work correctly with loops
- Then optimize with NumPy vectorization or use
scipy.ndimage.convolve - For learning, slow is fine - youโre understanding the algorithm!
Debugging Technique: Print Intermediate Values
def convolve2d_debug(image, kernel, verbose=True):
"""Convolution with debug output."""
for out_y in range(out_h):
for out_x in range(out_w):
patch = image[out_y:out_y+k_h, out_x:out_x+k_w]
product = patch * kernel
value = product.sum()
if verbose and out_y < 2 and out_x < 2:
print(f"Position ({out_y}, {out_x}):")
print(f" Patch:\n{patch}")
print(f" Kernel:\n{kernel}")
print(f" Product:\n{product}")
print(f" Sum: {value}")
output[out_y, out_x] = value
Interview Questions
Prepare to answer these:
Basic Understanding
- โExplain convolution in image processing.โ
- Key points: Sliding window, element-wise multiplication, sum to single output, kernel defines what pattern to detect
- โWhy do we use 3x3 kernels instead of larger ones?โ
- Key points: Smaller = faster, stacking small kernels achieves large receptive field, 3x3 captures local patterns well
- โWhatโs the difference between convolution and correlation?โ
- Key points: Convolution flips the kernel, correlation doesnโt. For symmetric kernels theyโre identical. Deep learning typically uses correlation.
Implementation Details
- โHow do you handle image borders during convolution?โ
- Key points: Padding strategies - zero padding adds artificial edges, reflect padding is more natural, valid mode shrinks output
- โWhy does stride affect output dimensions?โ
- Key points: Stride skips positions, fewer outputs = smaller spatial dimensions, stride 2 halves size
- โHow would you optimize a naive convolution implementation?โ
- Key points: im2col transformation, vectorization, separable kernels, FFT-based convolution for large kernels
CNN Connection
- โHow does convolution relate to CNNs?โ
- Key points: CNN learns kernel values via backpropagation, first layers learn edges, deeper layers learn abstract features
- โWhat is a feature map?โ
- Key points: Output of one convolution, each kernel produces one feature map, multiple kernels = multiple feature maps
- โWhy are CNNs translation invariant?โ
- Key points: Same kernel applied everywhere, a pattern is detected regardless of position, pooling further improves invariance
Practical Application
- โWhen would you use Sobel vs. Laplacian for edge detection?โ
- Key points: Sobel detects directional edges (horizontal/vertical), Laplacian detects all edges, Sobel gives edge direction info
Hints in Layers
If youโre stuck, read these one at a time:
Hint 1: Loop Structure The outer loops iterate over output positions. For each output position, you need to know which input pixels contribute. With no padding, output position (y, x) gets input from image[y:y+k_h, x:x+k_w].
Hint 2: NumPy Slicing
Extract a patch with: patch = image[y:y+k_size, x:x+k_size]. This gives you a k_size x k_size array.
Hint 3: Element-wise Operations
Once you have the patch, computing the output is simple: value = (patch * kernel).sum(). NumPy handles element-wise multiplication.
Hint 4: Padding for Same Output Size
To make output same size as input with 3x3 kernel, pad by 1 on each side. Use np.pad(image, 1, mode='constant') for zero padding.
Hint 5: Stride Implementation With stride s, the input position for output (oy, ox) is (oy * s, ox * s). Update your output dimension calculation accordingly.
Hint 6: Color Images
Process each channel independently. Split image into R, G, B, convolve each, then stack with np.stack([r_out, g_out, b_out], axis=2).
Extensions and Challenges
Extension 1: Implement Gaussian Blur
The Gaussian kernel isnโt just uniform weights - it follows a bell curve:
def create_gaussian_kernel(size, sigma):
"""Create a Gaussian blur kernel."""
ax = np.linspace(-(size // 2), size // 2, size)
xx, yy = np.meshgrid(ax, ax)
kernel = np.exp(-0.5 * (xx**2 + yy**2) / sigma**2)
return kernel / kernel.sum() # Normalize
# Try different sigma values
kernel_3x3 = create_gaussian_kernel(3, sigma=1.0)
kernel_5x5 = create_gaussian_kernel(5, sigma=1.5)
Extension 2: Separable Kernels for Efficiency
Some kernels can be decomposed into vertical * horizontal:
Box blur 3x3: Can be separated:
โโโโโโโโโโโโโโโโโโโ โโโโโโโ โโโโโโโโโโโโโโโ
โ 1/9 1/9 1/9 โ โ 1/3 โ โ 1/3 1/3 1/3 โ
โ 1/9 1/9 1/9 โ = โ 1/3 โ x โโโโโโโโโโโโโโโ
โ 1/9 1/9 1/9 โ โ 1/3 โ
โโโโโโโโโโโโโโโโโโโ โโโโโโโ
Instead of 9 multiplications per pixel, you do 3 + 3 = 6!
Implement separable convolution:
def convolve_separable(image, v_kernel, h_kernel):
"""Apply vertical then horizontal 1D convolutions."""
# First convolve horizontally
temp = convolve1d_horizontal(image, h_kernel)
# Then convolve vertically
result = convolve1d_vertical(temp, v_kernel)
return result
Extension 3: Real-Time Webcam Edge Detector
import cv2
def realtime_edge_detection():
cap = cv2.VideoCapture(0)
kernel = np.array([[-1, -1, -1],
[-1, 8, -1],
[-1, -1, -1]], dtype=np.float32)
while True:
ret, frame = cap.read()
if not ret:
break
gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY).astype(np.float32)
edges = convolve2d(gray, kernel, padding=1)
edges = np.clip(np.abs(edges), 0, 255).astype(np.uint8)
cv2.imshow('Edges', edges)
if cv2.waitKey(1) & 0xFF == ord('q'):
break
cap.release()
cv2.destroyAllWindows()
Extension 4: Compare with OpenCV
Verify your implementation matches OpenCV:
import cv2
# Your implementation
my_result = convolve2d(image, kernel, padding=1)
# OpenCV's implementation
cv_result = cv2.filter2D(image, -1, kernel)
# Should match (within floating point tolerance)
assert np.allclose(my_result, cv_result, rtol=1e-5)
Extension 5: Visualize What CNNs Learn
Load a pretrained CNN and visualize its first-layer kernels:
import torch
import torchvision.models as models
# Load pretrained model
model = models.resnet18(pretrained=True)
# Get first convolutional layer weights
first_conv = model.conv1.weight.data.numpy()
# Shape: (64, 3, 7, 7) - 64 kernels, 3 input channels, 7x7 size
# Visualize the kernels
fig, axes = plt.subplots(8, 8, figsize=(12, 12))
for i, ax in enumerate(axes.flat):
# Take first channel of each kernel
kernel = first_conv[i, 0]
ax.imshow(kernel, cmap='gray')
ax.axis('off')
plt.suptitle('First Layer CNN Kernels (learned from data!)')
plt.show()
Real-World Connections
Instagram Filters
Those โvintageโ and โartisticโ filters? Many are just convolutions:
- Vignette: Not convolution, but demonstrates per-pixel operations
- Sharpen: The sharpen kernel you implemented
- Blur/Soft Focus: Gaussian blur kernels
- Emboss: Creates that โstampedโ look
Photoshop
- Unsharp Mask: Sharpen by subtracting blurred version
- Find Edges: Sobel or Laplacian filters
- Gaussian Blur: Exactly what you built
- High Pass Filter: Original minus low-pass (blurred)
Medical Imaging
- X-ray Enhancement: Edge enhancement to see bone details
- MRI Processing: Noise reduction with smoothing kernels
- Tumor Detection: CNN-learned kernels find abnormalities
Computer Vision Systems
- Self-Driving Cars: Edge detection for lane finding
- Face Recognition: CNN first layers detect facial features
- OCR: Edge detection helps isolate characters
How Production Systems Differ
Your implementation uses Python loops - educational but slow. Production systems:
- GPU Acceleration: cuDNN library runs convolutions on GPU
- SIMD Instructions: CPU uses vectorized operations
- Memory Optimization: im2col transformation trades memory for speed
- Fused Operations: Combine convolution + activation in one kernel
Books That Will Help
| Topic | Book | Chapter/Section |
|---|---|---|
| CNN Foundations | Deep Learning with Python by Francois Chollet | Ch. 5 (Complete introduction) |
| Mathematical Theory | Deep Learning by Goodfellow, Bengio, Courville | Ch. 9 (Convolutional Networks) |
| Image Processing | Digital Image Processing by Gonzalez & Woods | Ch. 3 (Spatial Filtering) |
| Computer Vision | Computer Vision: Algorithms and Applications by Szeliski | Ch. 3 (Image Processing) |
| Signal Processing | Signals and Systems by Oppenheim | Ch. 3-4 (Convolution) |
| Practical Vision | Programming Computer Vision with Python by Solem | Ch. 1-2 (Basic operations) |
Self-Assessment Checklist
Conceptual Understanding
- I can explain convolution without looking at notes
- I can design a kernel for a specific effect (blur, sharpen, edge)
- I understand why padding and stride affect output dimensions
- I can calculate output size given input, kernel, padding, and stride
- I know why CNNs LEARN kernels instead of using hand-crafted ones
Implementation Skills
- My naive convolution produces correct output for known inputs
- I can handle both grayscale and color images
- I implemented padding correctly (zero, reflect)
- I implemented stride support
- My code handles edge cases gracefully
Practical Application
- I applied edge detection to a real image and understood the output
- I experimented with different kernels and saw their effects
- I can debug convolution issues by inspecting intermediate values
- I verified my implementation against OpenCV
Teaching Test
Can you explain to someone else:
- Why does an edge detection kernel have negative values?
- Whatโs the difference between blur and sharpen kernels?
- Why do we use โsameโ padding in neural networks?
- How does a CNN โseeโ a cat?
Moving Forward
After completing this project:
- Next Project: P08: MNIST From First Principles - Apply convolution thinking to digit recognition
- Then: P09: CNN From Scratch - Build a full CNN with learned kernels
- Deep Dive: Implement the backward pass for convolution (gradient computation)
The key insight to carry forward: Convolution is pattern matching. A kernel says โIโm looking for THIS pattern,โ and the output says โhereโs how strongly that pattern appears at each location.โ CNNs learn which patterns matter for the task at hand.
When you look at a CNN architecture like ResNet or VGG, you now understand the fundamental operation: layers upon layers of learned pattern detectors, from simple edges to complex objects.
This project bridges classical computer vision (hand-designed kernels) with modern deep learning (learned kernels). Understanding both gives you the intuition to design, debug, and optimize neural networks for vision tasks.