P07: The Convolutional Kernel Explorer

Build a tool where you manually define 3x3 kernels and slide them over images to detect edges, sharpen, or blur - writing the convolution operation from scratch.

Project Overview

Attribute	Value
Difficulty	Intermediate
Time Estimate	Weekend (8-16 hours)
Language	Python (OpenCV/NumPy)
Prerequisites	Loops, 2D arrays, basic image concepts
Primary Book	“Deep Learning with Python” by Francois Chollet, Ch. 5
Knowledge Area	Computer Vision / Signal Processing

Learning Objectives

After completing this project, you will be able to:

Understand convolution mathematically - Explain exactly what happens when a kernel slides over an image
Implement convolution from scratch - Write the nested loops that compute the sliding window operation
Design kernels for specific effects - Know which 3x3 matrices detect edges, blur, or sharpen
Handle image boundaries - Implement padding strategies (zero, reflect, wrap)
Control output dimensions - Calculate output size based on stride and padding
Process color images - Apply convolution to RGB images with 3 channels
Connect to CNNs - Understand that neural networks LEARN these kernels automatically

The Core Question You’re Answering

“How does a computer ‘see’ shapes?”

A digital image is just a grid of numbers - pixel values from 0 (black) to 255 (white). The computer has no concept of “edge,” “corner,” or “face.” It just sees numbers.

The insight: Shapes are patterns in how numbers change. An edge is a sudden jump from low to high (or high to low). A corner is where two edges meet. A texture is a repeated pattern of changes.

Convolution is the mathematical operation that detects these changes. A 3x3 kernel acts as a “pattern detector” - it looks at a small window of pixels and produces a single number indicating how well that pattern matches.

Before coding, internalize this: A CNN doesn’t “see” images - it detects patterns of numerical change, layer by layer, until abstract patterns like “cat” or “dog” emerge from primitive patterns like “horizontal edge” or “curve.”

Concepts You Must Understand First

Stop and research these before coding:

1. Images as 2D/3D Arrays of Numbers

An image is a matrix (or tensor for color):

Grayscale: 2D array, shape (height, width), values 0-255
Color (RGB): 3D array, shape (height, width, 3), each channel is a 2D array

Grayscale image (5x5):         Color image (5x5x3):
┌─────────────────────┐        ┌─────────────────────┐
│ 50  80 100 120  90 │        │  R   G   B  │ Each pixel
│ 60  90 140 180 110 │        │ 255  0   0  │ is 3 numbers
│ 70 100 200 220 130 │        │  0  255  0  │
│ 80 110 180 200 120 │        │  0   0  255 │
│ 70  90 120 140 100 │        └─────────────────────┘
└─────────────────────┘

Book Reference: “Computer Vision: Algorithms and Applications” Ch. 2 - Richard Szeliski

2. The Convolution Operation Mathematically

Convolution is a “sliding window” operation. For each position in the image:

Place the kernel (3x3 matrix) over a region of the image
Multiply each kernel value by the corresponding pixel value
Sum all products
This sum becomes one pixel in the output

Mathematical definition for a kernel K and image I:

                 k-1   k-1
Output(x,y) =    Sum   Sum   I(x+i, y+j) * K(i, j)
                i=0   j=0

Where k is the kernel size (usually 3)

Book Reference: “Digital Image Processing” Ch. 3 - Gonzalez & Woods

3. Why Convolution Detects Features

The magic is in the kernel values. Consider this kernel:

Edge Detection (Laplacian):
┌────────────────┐
│ -1  -1  -1     │
│ -1   8  -1     │
│ -1  -1  -1     │
└────────────────┘

When applied to a uniform region (all same color), the result is 0:

Center contributes: 8 * pixel_value
Neighbors contribute: 8 * (-1) * pixel_value = -8 * pixel_value
Total: 0

When applied to an edge (center bright, neighbors dark):

Center contributes: 8 * high_value
Neighbors contribute: -1 * low_values (smaller magnitude)
Total: large positive number!

Key insight: The kernel “lights up” only where the pattern it encodes exists.

4. Common Kernels and What They Detect

Identity (no change):

┌─────────┐
│ 0  0  0 │
│ 0  1  0 │    Output = Input
│ 0  0  0 │
└─────────┘

Box Blur (average):

┌─────────────────────┐
│ 1/9  1/9  1/9       │
│ 1/9  1/9  1/9       │    Averages 3x3 neighborhood
│ 1/9  1/9  1/9       │
└─────────────────────┘

Gaussian Blur (weighted average):

┌───────────────────────────┐
│ 1/16  2/16  1/16          │
│ 2/16  4/16  2/16          │    Center weighted more
│ 1/16  2/16  1/16          │
└───────────────────────────┘

Sharpen:

┌────────────────┐
│  0  -1   0     │
│ -1   5  -1     │    Emphasizes center vs neighbors
│  0  -1   0     │
└────────────────┘

Sobel (horizontal edge):

┌────────────────┐
│ -1  -2  -1     │
│  0   0   0     │    Detects horizontal gradients
│  1   2   1     │
└────────────────┘

Sobel (vertical edge):

┌────────────────┐
│ -1   0   1     │
│ -2   0   2     │    Detects vertical gradients
│ -1   0   1     │
└────────────────┘

Emboss:

┌────────────────┐
│ -2  -1   0     │
│ -1   1   1     │    Creates 3D relief effect
│  0   1   2     │
└────────────────┘

5. Padding Modes (valid, same, full)

When the kernel reaches the image edge, what happens?

No Padding (valid):

Output is smaller than input
Output size: (N - K + 1) where N is input size, K is kernel size
For 5x5 image with 3x3 kernel: output is 3x3

Zero Padding (same):

Pad input with zeros so output equals input size
Pad amount: (K - 1) / 2 on each side
For 3x3 kernel: pad 1 pixel of zeros around entire image

Reflect Padding:

Mirror pixels at the boundary
Avoids introducing artificial zeros

Original (5x5):           Zero Padded (7x7):
┌─────────────┐           ┌─────────────────────┐
│ A B C D E   │           │ 0 0 0 0 0 0 0       │
│ F G H I J   │    -->    │ 0 A B C D E 0       │
│ K L M N O   │           │ 0 F G H I J 0       │
│ P Q R S T   │           │ 0 K L M N O 0       │
│ U V W X Y   │           │ 0 P Q R S T 0       │
└─────────────┘           │ 0 U V W X Y 0       │
                          │ 0 0 0 0 0 0 0       │
                          └─────────────────────┘

6. Strides and Output Size Calculation

Stride = how many pixels to move the kernel between applications

Stride 1: Move 1 pixel at a time (standard) Stride 2: Skip every other position (downsamples by 2)

Output size formula:

output_size = ((input_size + 2*padding - kernel_size) / stride) + 1

Example:
  Input: 224x224
  Kernel: 3x3
  Padding: 1 (same padding)
  Stride: 2

  Output = ((224 + 2*1 - 3) / 2) + 1 = (223 / 2) + 1 = 111 + 1 = 112

Book Reference: “Deep Learning” by Goodfellow, Bengio, Courville - Ch. 9

Deep Theoretical Foundation

Signal Processing Origins of Convolution

Convolution was invented for signal processing long before AI. The idea: combine two signals by sliding one over the other.

In 1D (audio signals):

Signal:      [ 1  2  3  4  5  6  7 ]
Kernel:      [ 0.25  0.5  0.25 ]   (simple smoothing)

Slide kernel across signal:
  Position 1: 0.25*1 + 0.5*2 + 0.25*3 = 2.0
  Position 2: 0.25*2 + 0.5*3 + 0.25*4 = 3.0
  ...

For images, we extend this to 2D - the kernel slides in both X and Y directions.

Edge Detection Theory (First Derivative)

In calculus, the derivative measures rate of change. A large derivative means rapid change - an edge!

Pixel values:    10  10  10  200  200  200
                        ↑
                   Sudden jump = high derivative = EDGE

Approximate derivative with finite differences:
  d/dx ≈ f(x+1) - f(x-1)  (central difference)

The Sobel operator encodes this derivative:

Sobel X (detects vertical edges):
┌────────────────┐
│ -1   0   1     │      This computes: right - left
│ -2   0   2     │      Weighted by distance from center
│ -1   0   1     │
└────────────────┘

Large positive output = bright on right, dark on left
Large negative output = dark on right, bright on left
Zero output = no horizontal gradient

The Sobel, Prewitt, and Laplacian Operators

Prewitt (simpler Sobel):

Horizontal:              Vertical:
┌────────────────┐       ┌────────────────┐
│ -1  -1  -1     │       │ -1   0   1     │
│  0   0   0     │       │ -1   0   1     │
│  1   1   1     │       │ -1   0   1     │
└────────────────┘       └────────────────┘

Laplacian (second derivative - detects ALL edges):

┌────────────────┐       ┌────────────────┐
│  0  -1   0     │  or   │ -1  -1  -1     │
│ -1   4  -1     │       │ -1   8  -1     │
│  0  -1   0     │       │ -1  -1  -1     │
└────────────────┘       └────────────────┘

The Laplacian detects edges in all directions because it computes the sum of second derivatives.

Mathematical Definition of 2D Convolution

Formally, discrete 2D convolution of image I with kernel K:

(I * K)[x, y] = Sum_i Sum_j I[x-i, y-j] * K[i, j]

Note: Some definitions flip the kernel (true convolution).
What we implement is technically "cross-correlation."
For symmetric kernels, they're identical.

How CNNs LEARN Optimal Kernels

Here’s the profound insight: In a CNN, the kernel values are weights that are learned through backpropagation.

Traditional approach (what you’re building):

Human designs kernel: “I want to detect horizontal edges”
Kernel is fixed: [-1, -2, -1], [0, 0, 0], [1, 2, 1]

CNN approach:

Initialize kernel with random values
Show network many images with labels
Backpropagate error to adjust kernel values
Network discovers: “To recognize cats, I need THIS edge pattern”

The first layer of a trained CNN often learns kernels that look like Gabor filters (oriented edges at various angles). Deeper layers learn increasingly abstract patterns.

Layer 1 learns:      Layer 2 learns:       Layer 3+ learns:
Edges, gradients     Corners, textures     Object parts
   /  |  \              ┌┐  ╱╲              Eyes, ears, wheels
  ─   │   ─             └┘  ╲╱

This is why understanding convolution is essential: CNNs are just networks that learn which convolutions are useful.

Real World Outcome

After completing this project, you’ll have a command-line tool that applies convolutions to images:

$ python convolve.py --image face.jpg --kernel "edge_detect"

Applying Kernel:
[[-1, -1, -1],
 [-1,  8, -1],
 [-1, -1, -1]]

Input image shape: (480, 640, 3)
Output image shape: (480, 640)
Processing time: 0.23 seconds

Saved result to output.jpg

Visual Output Examples

Original Image (face.jpg):

Description: A photograph of a human face with clear features -
eyes, nose, mouth, hair outline visible. Smooth skin tones,
varying lighting across the face.

After Edge Detection (Laplacian):

Description: The output is predominantly dark (black background)
with bright white lines tracing:
- The outline of the face against the background
- The edges of the eyes (eyelids, iris boundaries)
- The nose bridge and nostrils
- The lip boundaries
- Hair strands

All smooth regions (cheeks, forehead) are black because
there's no change = no edge.

After Sharpening:

Description: The face looks "crisper" - edges are more defined,
fine details like individual hairs and skin texture are
more visible. The image has higher local contrast.

After Gaussian Blur:

Description: The face looks "softer" - like looking through
frosted glass. Fine details are smoothed out, the image
appears slightly out of focus. Good for removing noise.

After Emboss:

Description: The face appears as if carved in stone or metal,
with a 3D relief effect. Edges facing one direction are
bright, opposite direction are dark. Gives a sculptural look.

Solution Architecture

System Design

┌─────────────────────────────────────────────────────────────────┐
│                     CONVOLVE.PY ARCHITECTURE                     │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  ┌──────────────────┐     ┌──────────────────┐                  │
│  │    CLI Parser    │────>│   Kernel Loader  │                  │
│  │  (argparse)      │     │  (dictionary)    │                  │
│  └──────────────────┘     └────────┬─────────┘                  │
│           │                        │                             │
│           ▼                        ▼                             │
│  ┌──────────────────┐     ┌──────────────────┐                  │
│  │  Image Loader    │────>│  Preprocessor    │                  │
│  │  (OpenCV/PIL)    │     │  (to float,      │                  │
│  │  RGB or Gray     │     │   normalize)     │                  │
│  └──────────────────┘     └────────┬─────────┘                  │
│                                    │                             │
│                                    ▼                             │
│                    ┌────────────────────────────┐                │
│                    │     CONVOLUTION ENGINE     │                │
│                    │                            │                │
│                    │  ┌──────────────────────┐  │                │
│                    │  │  Padding Handler     │  │                │
│                    │  │  (zero/reflect/wrap) │  │                │
│                    │  └──────────┬───────────┘  │                │
│                    │             │              │                │
│                    │             ▼              │                │
│                    │  ┌──────────────────────┐  │                │
│                    │  │  Sliding Window      │  │                │
│                    │  │  (nested loops)      │  │                │
│                    │  │                      │  │                │
│                    │  │  for y in range(...):│  │                │
│                    │  │    for x in range:   │  │                │
│                    │  │      extract_patch   │  │                │
│                    │  │      element_mult    │  │                │
│                    │  │      sum_to_output   │  │                │
│                    │  └──────────┬───────────┘  │                │
│                    │             │              │                │
│                    │             ▼              │                │
│                    │  ┌──────────────────────┐  │                │
│                    │  │   Stride Handler     │  │                │
│                    │  │   (output indexing)  │  │                │
│                    │  └──────────────────────┘  │                │
│                    └────────────────────────────┘                │
│                                    │                             │
│                                    ▼                             │
│                    ┌──────────────────────────┐                  │
│                    │   Post-processing        │                  │
│                    │   - Clip to valid range  │                  │
│                    │   - Convert to uint8     │                  │
│                    │   - Handle color merge   │                  │
│                    └────────────────────────────┘                │
│                                    │                             │
│                                    ▼                             │
│                    ┌──────────────────────────┐                  │
│                    │   Output                 │                  │
│                    │   - Save to file         │                  │
│                    │   - Display (optional)   │                  │
│                    │   - Side-by-side compare │                  │
│                    └──────────────────────────┘                  │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Data Flow

Input Image                    Kernel                      Output
(H x W x C)                    (K x K)                     (H' x W')

┌─────────────┐               ┌─────────┐                 ┌─────────────┐
│ 50  80 100  │               │ -1 -1 -1│                 │             │
│ 60  90 140  │     *         │ -1  8 -1│       =         │  ??? (you   │
│ 70 100 200  │               │ -1 -1 -1│                 │  compute!)  │
│ 80 110 180  │               └─────────┘                 │             │
│ 70  90 120  │                                           └─────────────┘
└─────────────┘

Step-by-step:
1. Position kernel at top-left
2. Multiply and sum
3. Store in output[0,0]
4. Slide kernel right
5. Repeat until right edge
6. Move down, start from left
7. Repeat until bottom

Module Structure

convolve/
├── convolve.py          # Main CLI entry point
├── kernels.py           # Predefined kernel definitions
├── convolution.py       # Core convolution implementation
├── padding.py           # Padding utilities
├── utils.py             # Image I/O, visualization
└── tests/
    ├── test_convolution.py
    ├── test_kernels.py
    └── test_images/
        ├── checkerboard.png
        ├── gradient.png
        └── simple_edge.png

Phased Implementation Guide

Phase 1: Load Image as NumPy Array (30 minutes)

Goal: Read an image file and understand its structure

Tasks:

Install dependencies: pip install numpy opencv-python matplotlib
Load an image using OpenCV or PIL
Print its shape and data type
Understand the value range (0-255 for uint8)

Code Structure (hints, not solution):

import cv2
import numpy as np

def load_image(path, grayscale=True):
    """
    Load image from file.

    Args:
        path: Path to image file
        grayscale: If True, convert to grayscale

    Returns:
        numpy array of shape (H, W) for gray or (H, W, 3) for color
    """
    # Use cv2.imread with appropriate flags
    # cv2.IMREAD_GRAYSCALE or cv2.IMREAD_COLOR
    # Return as float32 for easier math
    pass

Verification:

img = load_image("test.jpg", grayscale=True)
print(f"Shape: {img.shape}")      # Should be (height, width)
print(f"Dtype: {img.dtype}")      # Should be float32
print(f"Range: {img.min()} to {img.max()}")  # 0.0 to 255.0

Phase 2: Define Kernel Dictionaries (30 minutes)

Goal: Create a library of predefined kernels

Tasks:

Create a dictionary mapping names to 3x3 NumPy arrays
Include: identity, blur, sharpen, edge_detect, sobel_x, sobel_y, emboss

Code Structure:

KERNELS = {
    "identity": np.array([
        [0, 0, 0],
        [0, 1, 0],
        [0, 0, 0]
    ], dtype=np.float32),

    "box_blur": np.array([
        # Fill in: 1/9 for all 9 elements
    ], dtype=np.float32),

    "edge_detect": np.array([
        # Fill in: Laplacian kernel
    ], dtype=np.float32),

    # Add more...
}

def get_kernel(name):
    """Return kernel by name, or parse custom [[...]] format."""
    if name in KERNELS:
        return KERNELS[name]
    # Optional: parse custom kernels from string
    pass

Verification:

kernel = get_kernel("box_blur")
print(kernel.sum())  # Should be ~1.0 for blur
print(kernel.shape)  # Should be (3, 3)

Phase 3: Implement Naive Convolution (Nested Loops) (1-2 hours)

Goal: Write the basic sliding window algorithm

This is the core of the project. You must implement this yourself!

The Algorithm:

For each output pixel (out_y, out_x):
Find the corresponding patch in the input
The patch is kernel_size x kernel_size
Multiply patch elementwise with kernel
Sum all products
Store in output[out_y, out_x]

ASCII Visualization of Sliding Window:

Input Image (5x5):               Kernel (3x3):
┌─────────────────────────┐      ┌───────────────┐
│ a  b  c  d  e           │      │ k00 k01 k02   │
│ f  g  h  i  j           │      │ k10 k11 k12   │
│ k  l  m  n  o           │      │ k20 k21 k22   │
│ p  q  r  s  t           │      └───────────────┘
│ u  v  w  x  y           │
└─────────────────────────┘

Position 1 (top-left):           Position 2 (shifted right):
┌─────────────────────────┐      ┌─────────────────────────┐
│[a  b  c] d  e           │      │ a [b  c  d] e           │
│[f  g  h] i  j           │      │ f [g  h  i] j           │
│[k  l  m] n  o           │      │ k [l  m  n] o           │
│ p  q  r  s  t           │      │ p  q  r  s  t           │
│ u  v  w  x  y           │      │ u  v  w  x  y           │
└─────────────────────────┘      └─────────────────────────┘

Output[0,0] = a*k00 + b*k01 + c*k02   Output[0,1] = b*k00 + c*k01 + d*k02
            + f*k10 + g*k11 + h*k12              + g*k10 + h*k11 + i*k12
            + k*k20 + l*k21 + m*k22              + l*k20 + m*k21 + n*k22

Continue sliding right, then down...

Code Structure:

def convolve2d_naive(image, kernel):
    """
    Apply 2D convolution using nested loops.

    Args:
        image: 2D numpy array (H, W)
        kernel: 2D numpy array (K, K), must be odd-sized

    Returns:
        2D numpy array, output of convolution
    """
    img_h, img_w = image.shape
    k_h, k_w = kernel.shape

    # Calculate output dimensions (no padding = "valid")
    out_h = img_h - k_h + 1
    out_w = img_w - k_w + 1

    # Initialize output array
    output = np.zeros((out_h, out_w), dtype=np.float32)

    # Half kernel size for indexing
    half_k = k_h // 2

    # Nested loops - YOU IMPLEMENT THIS
    for out_y in range(out_h):
        for out_x in range(out_w):
            # Extract the patch from image
            # Multiply with kernel
            # Sum and store
            pass

    return output

Verification:

# Test with identity kernel - output should equal center of input
img = np.array([[1, 2, 3],
                [4, 5, 6],
                [7, 8, 9]], dtype=np.float32)
kernel = KERNELS["identity"]
result = convolve2d_naive(img, kernel)
assert result[0, 0] == 5  # Center of 3x3 input

# Test with known values
img = np.ones((5, 5), dtype=np.float32)
blur = KERNELS["box_blur"]
result = convolve2d_naive(img, blur)
assert np.allclose(result, 1.0)  # Blurring uniform image = same

Phase 4: Add Padding Support (1 hour)

Goal: Allow output to be same size as input

Tasks:

Implement pad_image(image, pad_size, mode='zero')
Support modes: ‘zero’, ‘reflect’, ‘replicate’
Modify convolve2d to accept padding parameter

Padding Visualization:

Original (3x3):              Zero Padded (5x5):
┌─────────────┐              ┌─────────────────────┐
│ 1  2  3     │              │ 0  0  0  0  0       │
│ 4  5  6     │     -->      │ 0  1  2  3  0       │
│ 7  8  9     │              │ 0  4  5  6  0       │
└─────────────┘              │ 0  7  8  9  0       │
                             │ 0  0  0  0  0       │
                             └─────────────────────┘

Reflect Padded (5x5):
┌─────────────────────┐
│ 5  4  5  6  5       │      (reflected at borders)
│ 2  1  2  3  2       │
│ 5  4  5  6  5       │
│ 8  7  8  9  8       │
│ 5  4  5  6  5       │
└─────────────────────┘

Code Structure:

def pad_image(image, pad_size, mode='zero'):
    """
    Pad image borders.

    Args:
        image: 2D numpy array
        pad_size: Number of pixels to pad on each side
        mode: 'zero', 'reflect', or 'replicate'

    Returns:
        Padded image
    """
    if mode == 'zero':
        return np.pad(image, pad_size, mode='constant', constant_values=0)
    elif mode == 'reflect':
        return np.pad(image, pad_size, mode='reflect')
    elif mode == 'replicate':
        return np.pad(image, pad_size, mode='edge')
    else:
        raise ValueError(f"Unknown padding mode: {mode}")

Verification:

img = np.array([[1, 2], [3, 4]], dtype=np.float32)
padded = pad_image(img, 1, 'zero')
assert padded.shape == (4, 4)
assert padded[0, 0] == 0  # Corner should be zero
assert padded[1, 1] == 1  # Original data preserved

Phase 5: Add Stride Support (45 minutes)

Goal: Skip pixels for faster computation/downsampling

Stride Visualization:

Stride = 1 (standard):            Stride = 2 (downsample):
Every position computed           Skip every other position

┌─────────────────┐               ┌─────────────────┐
│[X][X][X][X]     │               │[X] . [X] .      │
│[X][X][X][X]     │               │ .  .  .  .      │
│[X][X][X][X]     │               │[X] . [X] .      │
│[X][X][X][X]     │               │ .  .  .  .      │
└─────────────────┘               └─────────────────┘
Output: 4x4                       Output: 2x2

Code Modification:

def convolve2d(image, kernel, stride=1, padding=0, pad_mode='zero'):
    """
    Full convolution with stride and padding support.
    """
    # Pad image if needed
    if padding > 0:
        image = pad_image(image, padding, pad_mode)

    img_h, img_w = image.shape
    k_h, k_w = kernel.shape

    # Calculate output dimensions with stride
    out_h = ((img_h - k_h) // stride) + 1
    out_w = ((img_w - k_w) // stride) + 1

    output = np.zeros((out_h, out_w), dtype=np.float32)

    for out_y in range(out_h):
        for out_x in range(out_w):
            # Input position accounts for stride
            in_y = out_y * stride
            in_x = out_x * stride
            # Rest of convolution...
            pass

    return output

Verification:

img = np.ones((6, 6), dtype=np.float32)
kernel = KERNELS["identity"]
result = convolve2d(img, kernel, stride=2, padding=1)
assert result.shape == (3, 3)  # Downsampled by 2

Phase 6: Handle Color Images (3 Channels) (45 minutes)

Goal: Apply convolution to RGB images

Strategy: Apply the same kernel to each channel independently, then recombine

Color Image (H, W, 3):
┌─────────────┐
│ R channel   │ ──> convolve ──> R'
├─────────────┤
│ G channel   │ ──> convolve ──> G'   ──> Stack ──> Output (H', W', 3)
├─────────────┤
│ B channel   │ ──> convolve ──> B'
└─────────────┘

Code Structure:

def convolve2d_color(image, kernel, **kwargs):
    """
    Apply convolution to color image.

    Args:
        image: 3D numpy array (H, W, C)
        kernel: 2D kernel to apply to each channel
        **kwargs: stride, padding, pad_mode

    Returns:
        3D numpy array (H', W', C)
    """
    if image.ndim == 2:
        # Grayscale
        return convolve2d(image, kernel, **kwargs)

    # Process each channel
    channels = []
    for c in range(image.shape[2]):
        channel_result = convolve2d(image[:, :, c], kernel, **kwargs)
        channels.append(channel_result)

    return np.stack(channels, axis=2)

Verification:

img = np.random.rand(100, 100, 3).astype(np.float32) * 255
kernel = KERNELS["box_blur"]
result = convolve2d_color(img, kernel, padding=1)
assert result.shape == img.shape  # Same dimensions
assert result.shape[2] == 3  # Still 3 channels

Phase 7: Visualization and CLI (1 hour)

Goal: Create user-friendly interface and visual comparison

Tasks:

Create side-by-side before/after display
Parse command-line arguments
Handle image saving
Add progress indicator for large images

Code Structure:

import argparse
import matplotlib.pyplot as plt

def visualize_result(original, processed, kernel_name):
    """Show original and processed side by side."""
    fig, axes = plt.subplots(1, 2, figsize=(12, 6))

    axes[0].imshow(original, cmap='gray' if original.ndim == 2 else None)
    axes[0].set_title('Original')
    axes[0].axis('off')

    axes[1].imshow(processed, cmap='gray' if processed.ndim == 2 else None)
    axes[1].set_title(f'After {kernel_name}')
    axes[1].axis('off')

    plt.tight_layout()
    plt.show()

def main():
    parser = argparse.ArgumentParser(description='Image Convolution Tool')
    parser.add_argument('--image', required=True, help='Input image path')
    parser.add_argument('--kernel', required=True, help='Kernel name or custom [[...]]')
    parser.add_argument('--output', default='output.jpg', help='Output path')
    parser.add_argument('--stride', type=int, default=1)
    parser.add_argument('--padding', type=int, default=0)
    parser.add_argument('--show', action='store_true', help='Display result')

    args = parser.parse_args()

    # Load, process, save
    # You implement this!

if __name__ == '__main__':
    main()

Questions to Guide Your Design

Before implementing, think through these:

Algorithm Design

Loop Order: Should you loop over output positions or input positions? Why?
Index Calculation: Given output position (y, x), what input pixels contribute?
Edge Cases: What happens at corners? What about even-sized kernels?

Efficiency Considerations

Memory Layout: Is row-major or column-major access faster? Why?
Vectorization: How could you use NumPy operations instead of loops?
Separable Kernels: If a kernel is separable (k = v * h^T), how does this help?

Numerical Stability

Data Types: Why use float32 instead of uint8 during computation?
Clipping: After convolution, values may exceed 0-255. How do you handle this?
Normalization: Should the kernel sum to 1? When does it matter?

Design Decisions

Kernel Flipping: True convolution flips the kernel. Cross-correlation doesn’t. Which are you implementing?
Boundary Handling: When would you prefer zero padding vs. reflection?
Color vs. Grayscale: For edge detection, would you convolve color or convert to gray first?

Thinking Exercise

Manual Convolution Practice

Before coding, apply this 3x3 kernel to this 5x5 image BY HAND:

Input Image:

┌─────────────────────────┐
│  0   0   0   0   0      │
│  0   0   0   0   0      │
│  0   0  255  0   0      │
│  0   0   0   0   0      │
│  0   0   0   0   0      │
└─────────────────────────┘
(A single white pixel in the center)

Kernel (Laplacian edge detector):

┌────────────────┐
│ -1  -1  -1     │
│ -1   8  -1     │
│ -1  -1  -1     │
└────────────────┘

Calculate the output (3x3, no padding):

Position (0,0): Sum products of kernel with top-left 3x3 patch

Patch:          Kernel:         Products:
0   0   0       -1  -1  -1      0   0   0
0   0   0   x   -1   8  -1   =  0   0   0   = Sum = 0
0   0 255       -1  -1  -1      0   0 -255

Output[0,0] = -255

Position (0,1): Center patch

Patch:          Products:
0   0   0       0   0   0
0   0   0       0   0   0     = Sum = 0
0 255   0       0 -255  0

Output[0,1] = -255

Position (1,1): Centered on white pixel

Patch:          Products:
0   0   0       0    0    0
0 255   0   x   0  2040   0   = Sum = 2040
0   0   0       0    0    0

Output[1,1] = 8 * 255 = 2040

Complete the 3x3 output grid:

┌──────────────────────┐
│ -255  -255  -255     │
│ -255  2040  -255     │
│ -255  -255  -255     │
└──────────────────────┘

Insight: The edge detector “amplifies” the single pixel into a pattern showing high contrast between center and surroundings!

Testing Strategy

Unit Tests for Core Functions

def test_identity_kernel():
    """Identity kernel should preserve center values."""
    img = np.array([[1, 2, 3],
                    [4, 5, 6],
                    [7, 8, 9]], dtype=np.float32)
    kernel = np.array([[0, 0, 0],
                       [0, 1, 0],
                       [0, 0, 0]], dtype=np.float32)
    result = convolve2d_naive(img, kernel)
    assert result.shape == (1, 1)
    assert result[0, 0] == 5

def test_box_blur_uniform():
    """Blurring uniform image should return same values."""
    img = np.full((5, 5), 100.0, dtype=np.float32)
    kernel = np.full((3, 3), 1/9, dtype=np.float32)
    result = convolve2d_naive(img, kernel)
    assert np.allclose(result, 100.0)

def test_output_dimensions():
    """Test output size calculation."""
    img = np.zeros((10, 10), dtype=np.float32)
    kernel = np.zeros((3, 3), dtype=np.float32)

    # No padding
    result = convolve2d(img, kernel, padding=0)
    assert result.shape == (8, 8)

    # Same padding
    result = convolve2d(img, kernel, padding=1)
    assert result.shape == (10, 10)

    # Stride 2
    result = convolve2d(img, kernel, stride=2, padding=1)
    assert result.shape == (5, 5)

def test_edge_detection_on_gradient():
    """Horizontal edge detector should activate on horizontal edges."""
    # Create horizontal gradient
    img = np.zeros((5, 5), dtype=np.float32)
    img[0:2, :] = 255  # Top half white
    img[2:, :] = 0     # Bottom half black

    # Sobel vertical (detects horizontal edges)
    kernel = np.array([[-1, -2, -1],
                       [0, 0, 0],
                       [1, 2, 1]], dtype=np.float32)
    result = convolve2d_naive(img, kernel)

    # Should have high values in middle row (the edge)
    assert result[1, :].mean() > result[0, :].mean()
    assert result[1, :].mean() > result[2, :].mean()

Visual Tests

Checkerboard Pattern: Apply identity kernel - output should look identical
Uniform Image: Apply any kernel - output should be uniform (for valid kernels)
Natural Image: Apply Sobel - should highlight edges
Blur: Apply Gaussian blur - image should look softer

Edge Cases

def test_single_pixel_image():
    """Handle 1x1 images gracefully."""
    img = np.array([[100.0]])
    kernel = np.array([[0, 0, 0],
                       [0, 1, 0],
                       [0, 0, 0]], dtype=np.float32)
    # With same padding, should work
    result = convolve2d(img, kernel, padding=1)
    assert result.shape == (1, 1)

def test_large_kernel():
    """Handle kernels larger than image."""
    img = np.ones((3, 3), dtype=np.float32)
    kernel = np.ones((5, 5), dtype=np.float32) / 25
    # Should return empty or raise informative error
    try:
        result = convolve2d(img, kernel, padding=0)
        assert result.size == 0 or result.shape[0] == 0
    except ValueError as e:
        assert "kernel larger than image" in str(e).lower()

Common Pitfalls and Debugging Tips

Pitfall 1: Off-by-One Errors

Symptom: Output has black borders or wrong dimensions

Cause: Incorrect loop bounds or index calculations

Fix: Double-check the output dimension formula:

out_h = ((img_h + 2*padding - kernel_h) // stride) + 1

Pitfall 2: Axis Confusion

Symptom: Image appears rotated or flipped

Cause: Mixing up x/y, row/column, height/width conventions

Fix: Be consistent. NumPy uses [row, col] = [y, x]. Document your convention.

Pitfall 3: Integer Division Issues

Symptom: Blur kernel produces very dark output

Cause: Using integer division when creating kernel

# WRONG:
kernel = np.array([[1/9, 1/9, 1/9], ...])  # 1/9 = 0 in Python 2!
# RIGHT:
kernel = np.array([[1/9, 1/9, 1/9], ...], dtype=np.float32)

Pitfall 4: Clipping Issues

Symptom: Edge detection output looks washed out or wrong

Cause: Not handling negative values or overflow

Fix:

# After convolution, values may be outside [0, 255]
# Option 1: Clip
output = np.clip(output, 0, 255)

# Option 2: Normalize
output = (output - output.min()) / (output.max() - output.min()) * 255

# Option 3: Absolute value (for edge detection)
output = np.abs(output)

Pitfall 5: Slow Performance

Symptom: Processing takes minutes for large images

Cause: Python loops are slow

Fix:

First make it work correctly with loops
Then optimize with NumPy vectorization or use scipy.ndimage.convolve
For learning, slow is fine - you’re understanding the algorithm!

Debugging Technique: Print Intermediate Values

def convolve2d_debug(image, kernel, verbose=True):
    """Convolution with debug output."""
    for out_y in range(out_h):
        for out_x in range(out_w):
            patch = image[out_y:out_y+k_h, out_x:out_x+k_w]
            product = patch * kernel
            value = product.sum()

            if verbose and out_y < 2 and out_x < 2:
                print(f"Position ({out_y}, {out_x}):")
                print(f"  Patch:\n{patch}")
                print(f"  Kernel:\n{kernel}")
                print(f"  Product:\n{product}")
                print(f"  Sum: {value}")

            output[out_y, out_x] = value

Interview Questions

Prepare to answer these:

Basic Understanding

“Explain convolution in image processing.”
- Key points: Sliding window, element-wise multiplication, sum to single output, kernel defines what pattern to detect
“Why do we use 3x3 kernels instead of larger ones?”
- Key points: Smaller = faster, stacking small kernels achieves large receptive field, 3x3 captures local patterns well
“What’s the difference between convolution and correlation?”
- Key points: Convolution flips the kernel, correlation doesn’t. For symmetric kernels they’re identical. Deep learning typically uses correlation.

Implementation Details

“How do you handle image borders during convolution?”
- Key points: Padding strategies - zero padding adds artificial edges, reflect padding is more natural, valid mode shrinks output
“Why does stride affect output dimensions?”
- Key points: Stride skips positions, fewer outputs = smaller spatial dimensions, stride 2 halves size
“How would you optimize a naive convolution implementation?”
- Key points: im2col transformation, vectorization, separable kernels, FFT-based convolution for large kernels

CNN Connection

“How does convolution relate to CNNs?”
- Key points: CNN learns kernel values via backpropagation, first layers learn edges, deeper layers learn abstract features
“What is a feature map?”
- Key points: Output of one convolution, each kernel produces one feature map, multiple kernels = multiple feature maps
“Why are CNNs translation invariant?”
- Key points: Same kernel applied everywhere, a pattern is detected regardless of position, pooling further improves invariance

Practical Application

“When would you use Sobel vs. Laplacian for edge detection?”
- Key points: Sobel detects directional edges (horizontal/vertical), Laplacian detects all edges, Sobel gives edge direction info

Hints in Layers

If you’re stuck, read these one at a time:

Hint 1: Loop Structure The outer loops iterate over output positions. For each output position, you need to know which input pixels contribute. With no padding, output position (y, x) gets input from image[y:y+k_h, x:x+k_w].

Hint 2: NumPy Slicing Extract a patch with: patch = image[y:y+k_size, x:x+k_size]. This gives you a k_size x k_size array.

Hint 3: Element-wise Operations Once you have the patch, computing the output is simple: value = (patch * kernel).sum(). NumPy handles element-wise multiplication.

Hint 4: Padding for Same Output Size To make output same size as input with 3x3 kernel, pad by 1 on each side. Use np.pad(image, 1, mode='constant') for zero padding.

Hint 5: Stride Implementation With stride s, the input position for output (oy, ox) is (oy * s, ox * s). Update your output dimension calculation accordingly.

Hint 6: Color Images Process each channel independently. Split image into R, G, B, convolve each, then stack with np.stack([r_out, g_out, b_out], axis=2).

Extensions and Challenges

Extension 1: Implement Gaussian Blur

The Gaussian kernel isn’t just uniform weights - it follows a bell curve:

def create_gaussian_kernel(size, sigma):
    """Create a Gaussian blur kernel."""
    ax = np.linspace(-(size // 2), size // 2, size)
    xx, yy = np.meshgrid(ax, ax)
    kernel = np.exp(-0.5 * (xx**2 + yy**2) / sigma**2)
    return kernel / kernel.sum()  # Normalize

# Try different sigma values
kernel_3x3 = create_gaussian_kernel(3, sigma=1.0)
kernel_5x5 = create_gaussian_kernel(5, sigma=1.5)

Extension 2: Separable Kernels for Efficiency

Some kernels can be decomposed into vertical * horizontal:

Box blur 3x3:                Can be separated:
┌─────────────────┐          ┌─────┐     ┌─────────────┐
│ 1/9  1/9  1/9   │          │ 1/3 │     │ 1/3 1/3 1/3 │
│ 1/9  1/9  1/9   │    =     │ 1/3 │  x  └─────────────┘
│ 1/9  1/9  1/9   │          │ 1/3 │
└─────────────────┘          └─────┘

Instead of 9 multiplications per pixel, you do 3 + 3 = 6!

Implement separable convolution:

def convolve_separable(image, v_kernel, h_kernel):
    """Apply vertical then horizontal 1D convolutions."""
    # First convolve horizontally
    temp = convolve1d_horizontal(image, h_kernel)
    # Then convolve vertically
    result = convolve1d_vertical(temp, v_kernel)
    return result

Extension 3: Real-Time Webcam Edge Detector

import cv2

def realtime_edge_detection():
    cap = cv2.VideoCapture(0)
    kernel = np.array([[-1, -1, -1],
                       [-1,  8, -1],
                       [-1, -1, -1]], dtype=np.float32)

    while True:
        ret, frame = cap.read()
        if not ret:
            break

        gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY).astype(np.float32)
        edges = convolve2d(gray, kernel, padding=1)
        edges = np.clip(np.abs(edges), 0, 255).astype(np.uint8)

        cv2.imshow('Edges', edges)
        if cv2.waitKey(1) & 0xFF == ord('q'):
            break

    cap.release()
    cv2.destroyAllWindows()

Extension 4: Compare with OpenCV

Verify your implementation matches OpenCV:

import cv2

# Your implementation
my_result = convolve2d(image, kernel, padding=1)

# OpenCV's implementation
cv_result = cv2.filter2D(image, -1, kernel)

# Should match (within floating point tolerance)
assert np.allclose(my_result, cv_result, rtol=1e-5)

Extension 5: Visualize What CNNs Learn

Load a pretrained CNN and visualize its first-layer kernels:

import torch
import torchvision.models as models

# Load pretrained model
model = models.resnet18(pretrained=True)

# Get first convolutional layer weights
first_conv = model.conv1.weight.data.numpy()
# Shape: (64, 3, 7, 7) - 64 kernels, 3 input channels, 7x7 size

# Visualize the kernels
fig, axes = plt.subplots(8, 8, figsize=(12, 12))
for i, ax in enumerate(axes.flat):
    # Take first channel of each kernel
    kernel = first_conv[i, 0]
    ax.imshow(kernel, cmap='gray')
    ax.axis('off')
plt.suptitle('First Layer CNN Kernels (learned from data!)')
plt.show()

Real-World Connections

Instagram Filters

Those “vintage” and “artistic” filters? Many are just convolutions:

Vignette: Not convolution, but demonstrates per-pixel operations
Sharpen: The sharpen kernel you implemented
Blur/Soft Focus: Gaussian blur kernels
Emboss: Creates that “stamped” look

Photoshop

Unsharp Mask: Sharpen by subtracting blurred version
Find Edges: Sobel or Laplacian filters
Gaussian Blur: Exactly what you built
High Pass Filter: Original minus low-pass (blurred)

Medical Imaging

X-ray Enhancement: Edge enhancement to see bone details
MRI Processing: Noise reduction with smoothing kernels
Tumor Detection: CNN-learned kernels find abnormalities

Computer Vision Systems

Self-Driving Cars: Edge detection for lane finding
Face Recognition: CNN first layers detect facial features
OCR: Edge detection helps isolate characters

How Production Systems Differ

Your implementation uses Python loops - educational but slow. Production systems:

GPU Acceleration: cuDNN library runs convolutions on GPU
SIMD Instructions: CPU uses vectorized operations
Memory Optimization: im2col transformation trades memory for speed
Fused Operations: Combine convolution + activation in one kernel

Books That Will Help

Topic	Book	Chapter/Section
CNN Foundations	Deep Learning with Python by Francois Chollet	Ch. 5 (Complete introduction)
Mathematical Theory	Deep Learning by Goodfellow, Bengio, Courville	Ch. 9 (Convolutional Networks)
Image Processing	Digital Image Processing by Gonzalez & Woods	Ch. 3 (Spatial Filtering)
Computer Vision	Computer Vision: Algorithms and Applications by Szeliski	Ch. 3 (Image Processing)
Signal Processing	Signals and Systems by Oppenheim	Ch. 3-4 (Convolution)
Practical Vision	Programming Computer Vision with Python by Solem	Ch. 1-2 (Basic operations)

Self-Assessment Checklist

Conceptual Understanding

I can explain convolution without looking at notes
I can design a kernel for a specific effect (blur, sharpen, edge)
I understand why padding and stride affect output dimensions
I can calculate output size given input, kernel, padding, and stride
I know why CNNs LEARN kernels instead of using hand-crafted ones

Implementation Skills

My naive convolution produces correct output for known inputs
I can handle both grayscale and color images
I implemented padding correctly (zero, reflect)
I implemented stride support
My code handles edge cases gracefully

Practical Application

I applied edge detection to a real image and understood the output
I experimented with different kernels and saw their effects
I can debug convolution issues by inspecting intermediate values
I verified my implementation against OpenCV

Teaching Test

Can you explain to someone else:

Why does an edge detection kernel have negative values?
What’s the difference between blur and sharpen kernels?
Why do we use “same” padding in neural networks?
How does a CNN “see” a cat?

Moving Forward

After completing this project:

Next Project: P08: MNIST From First Principles - Apply convolution thinking to digit recognition
Then: P09: CNN From Scratch - Build a full CNN with learned kernels
Deep Dive: Implement the backward pass for convolution (gradient computation)

The key insight to carry forward: Convolution is pattern matching. A kernel says “I’m looking for THIS pattern,” and the output says “here’s how strongly that pattern appears at each location.” CNNs learn which patterns matter for the task at hand.

When you look at a CNN architecture like ResNet or VGG, you now understand the fundamental operation: layers upon layers of learned pattern detectors, from simple edges to complex objects.

This project bridges classical computer vision (hand-designed kernels) with modern deep learning (learned kernels). Understanding both gives you the intuition to design, debug, and optimize neural networks for vision tasks.