Project 19: Neural Network from First Principles

A multi-layer neural network that learns to classify handwritten digits (MNIST). Implement forward pass, backpropagation, and training loop from scratch—no TensorFlow, no PyTorch, just NumPy.

Quick Reference

Attribute	Value
Difficulty	Level 4: Expert (The Systems Architect)
Main Programming Language	Python
Alternative Programming Languages	C, Julia, Rust
Coolness Level	Level 5: Pure Magic (Super Cool)
Business Potential	1. The “Resume Gold” (Educational/Personal Brand)
Knowledge Area	Deep Learning / Optimization
Software or Tool	Neural Network
Main Book	“Neural Networks and Deep Learning” by Michael Nielsen

1. Learning Objectives

By completing this project, you will:

Translate math definitions into deterministic implementation steps.
Build validation checks that make correctness observable.
Diagnose numerical, logical, and data-shape failures early.
Explain tradeoffs in interviews using evidence from your own build.

2. All Theory Needed (Per-Concept Breakdown)

This project applies the following theory clusters:

Symbolic-to-numeric translation (expressions, data shapes, invariants)
Stability constraints (precision, scaling, stopping criteria)
Optimization or inference logic (depending on project objective)
Evaluation discipline (error analysis, test coverage, reproducibility)

Concept A: Mathematical Representation Discipline

Fundamentals A math expression is not executable until you define representation, ordering, and domain constraints. The same equation can be represented as a token stream, tree, matrix pipeline, or probability graph. Choosing representation determines what bugs you can catch early.

Deep Dive into the concept Most project failures begin before algorithm selection: they start with ambiguous representation. If your parser cannot distinguish unary minus from subtraction, your calculator fails. If your matrix dimensions are implicit rather than validated, your linear algebra pipeline fails silently. If your probabilistic assumptions (independence, stationarity, or class priors) are not explicit, your inference can look accurate on one split and collapse on another. The core implementation move is to treat representation as a contract. Define each object with shape, domain, and semantic intent. Then enforce invariants at boundaries: input parser, preprocessing, training loop, evaluation stage. This makes debugging local instead of global.

How this fits this project You will encode each operation with explicit contracts and invariant checks.

Definitions & key terms

Invariant: Property that must hold before and after each operation.
Shape contract: Expected dimensional structure of vectors/matrices/tensors.
Domain constraint: Allowed value range (for example log input > 0).

Mental model diagram

User Input -> Representation Layer -> Validated Operation -> Observable Output
              (tokens/shapes)        (invariants pass)       (tests/plots/logs)

How it works

Parse/ingest data into typed structures.
Validate shape/domain invariants.
Execute operation.
Compare observed output with expected behavior.
Record failure signature if mismatch appears.

Minimal concrete example

PSEUDOCODE
read expression
tokenize with precedence rules
if token sequence invalid -> return syntax error
evaluate tree
if domain violation -> return bounded diagnostic
print value and confidence check

Common misconceptions

“If it runs once, representation is correct.” -> false.
“Type checks are enough without shape checks.” -> false.

Check-your-understanding questions

Which invariant catches division-by-zero earliest?
Why does shape validation belong at boundaries rather than only in core logic?
Predict failure if tokenization ignores unary minus.

Check-your-understanding answers

Domain check on denominator before operation execution.
Boundary validation keeps errors local and diagnostic.
Expressions like -2^2 get misinterpreted and produce wrong precedence behavior.

Real-world applications Feature preprocessing, model-serving input validation, and experiment-tracking schema enforcement.

Where you’ll apply it This project and every downstream project in the sprint.

References

CSAPP (Bryant & O’Hallaron), floating-point chapter
Math for Programmers (Paul Orland), representation-oriented chapters

Key insight Correct representation reduces the complexity of every later decision.

Summary Stable ML math implementations start with explicit contracts, not implicit assumptions.

Homework/Exercises

Write five invariants for your project.
Build a failing test input for each invariant.

Solutions

Include at least one shape, one domain, one convergence, one reproducibility, and one output-range invariant.
Each failing input should trigger exactly one diagnostic to keep root-cause analysis clean.

3. Build Blueprint

Scope the smallest end-to-end slice that produces visible output.
Add deterministic tests and edge-case probes.
Layer complexity only after baseline behavior is stable.
Add metrics logging before optimization.
Run failure drills: perturb inputs, scale values, and check stability.

4. Real-World Outcome (Target)

$ python neural_net.py mnist/

Loading MNIST dataset...
  Training: 60,000 images
  Test: 10,000 images

Network architecture: 784 → 128 → 64 → 10
  Layer 1: 784 inputs × 128 outputs = 100,352 weights
  Layer 2: 128 × 64 = 8,192 weights
  Layer 3: 64 × 10 = 640 weights
  Total: 109,184 trainable parameters

Training with mini-batch gradient descent (batch_size=32, lr=0.01)

Epoch 1/10:  Loss = 0.823, Accuracy = 78.2%
Epoch 2/10:  Loss = 0.412, Accuracy = 89.1%
Epoch 5/10:  Loss = 0.187, Accuracy = 94.6%
Epoch 10/10: Loss = 0.098, Accuracy = 97.2%

Test set accuracy: 96.8%

[Confusion matrix showing per-digit accuracy]
[Visualization: some misclassified examples with predictions]

$ python neural_net.py predict digit.png
[Shows image]
Prediction: 7 (confidence: 98.3%)
Probabilities: [0.001, 0.002, 0.005, 0.001, 0.002, 0.001, 0.001, 0.983, 0.002, 0.002]

Implementation Hints: Forward pass for layer l:

z[l] = a[l-1] @ W[l] + b[l]
a[l] = activation(z[l])  # ReLU or sigmoid

Backward pass (chain rule!):

# Output layer (with softmax + cross-entropy)
delta[L] = a[L] - y_one_hot  # Beautifully simple!

# Hidden layers
delta[l] = (delta[l+1] @ W[l+1].T) * activation_derivative(z[l])

# Gradients
dW[l] = a[l-1].T @ delta[l]
db[l] = delta[l].sum(axis=0)

This is the mathematical heart of deep learning. Every framework automates this, but you’ll have built it by hand.

Learning milestones:

Network trains and loss decreases → You understand forward/backward pass
Accuracy exceeds 95% → You’ve built a working deep learning system
You can explain backpropagation step-by-step → You’ve internalized the chain rule

5. Core Design Notes from Main Guide

Core Question

“How does a pile of linear algebra and calculus learn to see?”

A neural network is nothing more than alternating linear transformations (matrix multiplications) and nonlinear activations (ReLU, sigmoid). Yet when you stack them together and train with gradient descent, something remarkable happens: the network learns to recognize handwritten digits, or faces, or speech. How? The magic is backpropagation–the chain rule applied systematically to compute how every single weight affects the final loss. By building this from scratch, you will understand that there is no magic at all. Just matrices, derivatives, and the patient accumulation of tiny gradient steps toward a solution.

Concepts You Must Understand First

Stop and research these before coding:

The Multi-Layer Perceptron Architecture
- What is a “layer” in a neural network?
- How does information flow from input through hidden layers to output?
- What is the role of weights vs biases at each layer?
- Book Reference: “Neural Networks and Deep Learning” Chapter 1 - Michael Nielsen
Activation Functions (ReLU, Sigmoid, Softmax)
- Why do we need nonlinear activation functions?
- What is the difference between ReLU(z) = max(0, z) and sigmoid?
- When do you use softmax (output layer) vs ReLU (hidden layers)?
- Book Reference: “Deep Learning” Chapter 6 - Goodfellow et al.
Forward Pass as Matrix Chain Multiplication
- How do you compute the output of a layer: a = activation(W @ x + b)?
- What are the dimensions of W if input is n-dimensional and output is m-dimensional?
- How do you “chain” layers together?
- Book Reference: “Hands-On Machine Learning” Chapter 10 - Aurelien Geron
Backpropagation (Chain Rule in Depth)
- What is the chain rule: d(f(g(x)))/dx = f’(g(x)) * g’(x)?
- How do you propagate gradients backward through layers?
- Why do you need to cache values from the forward pass?
- Book Reference: “Neural Networks and Deep Learning” Chapter 2 - Michael Nielsen
Softmax and Cross-Entropy for Multi-Class
- What is softmax: exp(z_i) / sum(exp(z_j))?
- Why is softmax + cross-entropy elegant (gradient = output - target)?
- What is the “log-sum-exp” trick for numerical stability?
- Book Reference: “Deep Learning” Chapter 6 - Goodfellow et al.
Weight Initialization
- Why do you not initialize all weights to zero?
- What is Xavier/Glorot initialization? What about He initialization?
- How does initialization affect training speed and convergence?
- Book Reference: “Hands-On Machine Learning” Chapter 11 - Aurelien Geron
Mini-Batch Gradient Descent
- Why use mini-batches instead of all data (batch) or one sample (stochastic)?
- How does batch size affect training dynamics?
- What is an “epoch” in this context?
- Book Reference: “Deep Learning” Chapter 8 - Goodfellow et al.

Questions to Guide Your Design

Before implementing, think through these:

Data representation: How will you represent the MNIST images? Flatten to 784-dimensional vectors? Normalize pixel values?
Layer abstraction: Will you create a Layer class, or just use functions? How do you store weights and gradients?
Forward pass caching: What values do you need to save during forward pass for use in backward pass?
Gradient accumulation: In mini-batch training, how do you accumulate gradients across samples before updating weights?
Numerical stability: Where might you encounter overflow or underflow? (Hint: softmax, cross-entropy)
Training monitoring: How will you track loss and accuracy during training? When do you evaluate on the test set?

Thinking Exercise

Trace backpropagation through a tiny network by hand:

Consider a 2-layer network: Input (2) -> Hidden (2) -> Output (1)

Input: x = [1, 2]
Weights layer 1: W1 = [[0.1, 0.2], [0.3, 0.4]], b1 = [0.1, 0.1]
Weights layer 2: W2 = [[0.5], [0.6]], b2 = [0.1]
Activation: ReLU for hidden, sigmoid for output
Target: y = 1

Forward pass:

z1 = W1 @ x + b1 = [[0.1, 0.2], [0.3, 0.4]] @ [1, 2] + [0.1, 0.1] = [0.6, 1.2]
a1 = ReLU(z1) = [0.6, 1.2]
z2 = W2.T @ a1 + b2 = [0.5, 0.6] @ [0.6, 1.2] + 0.1 = 0.5*0.6 + 0.6*1.2 + 0.1 = 1.12
a2 = sigmoid(1.12) = 0.754
Loss = -log(0.754) = 0.282

Backward pass:

dL/da2 = -1/a2 = -1.326  (gradient of cross-entropy w.r.t. prediction)
da2/dz2 = a2*(1-a2) = 0.754 * 0.246 = 0.186  (sigmoid derivative)
dL/dz2 = dL/da2 * da2/dz2 = -1.326 * 0.186 = -0.247
  (or use combined: dL/dz2 = a2 - y = 0.754 - 1 = -0.246)

dL/dW2 = a1 * dL/dz2 = [0.6, 1.2] * (-0.246) = [-0.148, -0.295]
dL/db2 = dL/dz2 = -0.246

dL/da1 = W2 * dL/dz2 = [0.5, 0.6] * (-0.246) = [-0.123, -0.148]
da1/dz1 = [1, 1]  (ReLU derivative, both positive)
dL/dz1 = [-0.123, -0.148]

dL/dW1 = outer(dL/dz1, x) = ... (exercise for you)
dL/db1 = dL/dz1 = [-0.123, -0.148]

Interview Questions

“Explain backpropagation in your own words.”
- Expected answer: Backpropagation applies the chain rule to compute how much each weight contributes to the loss. We propagate gradients backward from the output, multiplying by local derivatives at each layer.
“Why do we need activation functions? What happens without them?”
- Expected answer: Without nonlinear activations, the entire network collapses to a single linear transformation. A stack of linear functions is just one linear function.
“What is the vanishing gradient problem and how do you address it?”
- Expected answer: Gradients shrink as they propagate through many layers (especially with sigmoid). Solutions: ReLU activation, skip connections (ResNet), batch normalization.
“Why is mini-batch gradient descent preferred over batch or stochastic?”
- Expected answer: Mini-batch balances computational efficiency (parallel processing) with gradient noise (helps escape local minima). Batch is slow; stochastic is noisy.
“How do you initialize the weights of a neural network?”
- Expected answer: Xavier/Glorot for sigmoid/tanh (variance = 1/n_in), He for ReLU (variance = 2/n_in). Zero initialization breaks symmetry and prevents learning.
“What is the difference between epoch, batch, and iteration?”
- Expected answer: Epoch = one pass through all training data. Batch = subset of data for one gradient update. Iteration = one gradient update step.
“How would you debug a neural network that is not learning?”
- Expected answer: Check: loss decreasing? gradients flowing (not zero or exploding)? learning rate appropriate? data correctly preprocessed? architecture reasonable? overfitting a tiny batch first?

Hints in Layers (Treat as pseudocode guidance)

Hint 1: Structure your forward pass to cache everything needed for backward:

def forward(X, weights):
    cache = {'input': X}
    A = X
    for i, (W, b) in enumerate(weights):
        Z = A @ W + b
        cache[f'Z{i}'] = Z
        A = relu(Z) if i < len(weights)-1 else softmax(Z)
        cache[f'A{i}'] = A
    return A, cache

Hint 2: The backward pass mirrors the forward:

def backward(y_true, cache, weights):
    grads = []
    m = y_true.shape[0]
    dA = cache['A_last'] - y_true  # softmax + cross-entropy combined

    for i in reversed(range(len(weights))):
        dZ = dA * relu_derivative(cache[f'Z{i}']) if i < len(weights)-1 else dA
        dW = cache[f'A{i-1}'].T @ dZ / m
        db = np.sum(dZ, axis=0) / m
        grads.append((dW, db))
        dA = dZ @ weights[i][0].T

    return list(reversed(grads))

Hint 3: Softmax with numerical stability:

def softmax(z):
    exp_z = np.exp(z - np.max(z, axis=1, keepdims=True))  # subtract max
    return exp_z / np.sum(exp_z, axis=1, keepdims=True)

Hint 4: He initialization for ReLU networks:

W = np.random.randn(n_in, n_out) * np.sqrt(2.0 / n_in)
b = np.zeros(n_out)

Hint 5: Training loop structure:

for epoch in range(epochs):
    for batch_start in range(0, len(X_train), batch_size):
        X_batch = X_train[batch_start:batch_start+batch_size]
        y_batch = y_train[batch_start:batch_start+batch_size]

        output, cache = forward(X_batch, weights)
        grads = backward(y_batch, cache, weights)

        for i, (dW, db) in enumerate(grads):
            weights[i][0] -= learning_rate * dW
            weights[i][1] -= learning_rate * db

Books That Will Help

Topic	Book	Chapter
Neural Network Fundamentals	“Neural Networks and Deep Learning” by Michael Nielsen	Chapters 1-2: Networks, Backpropagation
Backpropagation Math	“Deep Learning” by Goodfellow et al.	Chapter 6: Deep Feedforward Networks
Softmax and Cross-Entropy	“Deep Learning” by Goodfellow et al.	Chapter 6.2: Output Units
Weight Initialization	“Hands-On Machine Learning” by Aurelien Geron	Chapter 11: Training Deep Networks
Mini-Batch Optimization	“Deep Learning” by Goodfellow et al.	Chapter 8: Optimization
Practical Implementation	“Make Your Own Neural Network” by Tariq Rashid	Full book: step-by-step

6. Validation, Pitfalls, and Completion

Common Pitfalls and Debugging

Problem 1: “Outputs drift after a few iterations”

Why: Hidden numerical instability (unscaled features, aggressive step size, or repeated subtraction of nearly equal values).
Fix: Normalize inputs, reduce step size, and track relative error rather than only absolute error.
Quick test: Run the same task with two scales of input (for example x and 10x) and compare normalized error curves.

Problem 2: “Results are inconsistent across runs”

Why: Random seeds, data split randomness, or non-deterministic ordering are uncontrolled.
Fix: Set seeds, log configuration, and store split indices and hyperparameters with each run.
Quick test: Re-run three times with the same seed and confirm metrics remain inside a tight tolerance band.

Problem 3: “The project works on the demo case but fails on edge cases”

Why: Tests only cover happy-path inputs.
Fix: Add adversarial inputs (empty values, extreme ranges, near-singular matrices, rare classes).
Quick test: Build an edge-case test matrix and ensure every scenario reports expected behavior.

Definition of Done

Core functionality works on reference inputs
Edge cases are tested and documented
Results are reproducible (seeded and versioned configuration)
Performance or convergence behavior is measured and explained
A short retrospective explains what failed first and how you fixed it

7. Extension Ideas

Add a stress-test mode with adversarial inputs.
Add a short benchmark report (runtime + memory + error trend).
Add a reproducibility bundle (seed, config, and fixed test corpus).

8. Why This Project Matters

Not specified

This project is valuable because it creates observable evidence of mathematical reasoning under real implementation constraints.