Project 11: Backpropagation from Scratch (Single Neuron)

A single neuron that learns via backpropagation. This is the atomic unit of neural networks. You’ll implement forward pass, loss calculation, and backward pass (gradient computation via chain rule) completely from scratch.

Quick Reference

Attribute Value
Difficulty Level 3: Advanced (The Engineer)
Main Programming Language Python
Alternative Programming Languages C, Julia, Rust
Coolness Level Level 4: Hardcore Tech Flex
Business Potential 1. The “Resume Gold” (Educational/Personal Brand)
Knowledge Area Neural Networks / Calculus
Software or Tool Backprop Engine
Main Book “Neural Networks and Deep Learning” by Michael Nielsen

1. Learning Objectives

By completing this project, you will:

  1. Translate math definitions into deterministic implementation steps.
  2. Build validation checks that make correctness observable.
  3. Diagnose numerical, logical, and data-shape failures early.
  4. Explain tradeoffs in interviews using evidence from your own build.

2. All Theory Needed (Per-Concept Breakdown)

This project applies the following theory clusters:

  • Symbolic-to-numeric translation (expressions, data shapes, invariants)
  • Stability constraints (precision, scaling, stopping criteria)
  • Optimization or inference logic (depending on project objective)
  • Evaluation discipline (error analysis, test coverage, reproducibility)

Concept A: Mathematical Representation Discipline

Fundamentals A math expression is not executable until you define representation, ordering, and domain constraints. The same equation can be represented as a token stream, tree, matrix pipeline, or probability graph. Choosing representation determines what bugs you can catch early.

Deep Dive into the concept Most project failures begin before algorithm selection: they start with ambiguous representation. If your parser cannot distinguish unary minus from subtraction, your calculator fails. If your matrix dimensions are implicit rather than validated, your linear algebra pipeline fails silently. If your probabilistic assumptions (independence, stationarity, or class priors) are not explicit, your inference can look accurate on one split and collapse on another. The core implementation move is to treat representation as a contract. Define each object with shape, domain, and semantic intent. Then enforce invariants at boundaries: input parser, preprocessing, training loop, evaluation stage. This makes debugging local instead of global.

How this fits this project You will encode each operation with explicit contracts and invariant checks.

Definitions & key terms

  • Invariant: Property that must hold before and after each operation.
  • Shape contract: Expected dimensional structure of vectors/matrices/tensors.
  • Domain constraint: Allowed value range (for example log input > 0).

Mental model diagram

User Input -> Representation Layer -> Validated Operation -> Observable Output
              (tokens/shapes)        (invariants pass)       (tests/plots/logs)

How it works

  1. Parse/ingest data into typed structures.
  2. Validate shape/domain invariants.
  3. Execute operation.
  4. Compare observed output with expected behavior.
  5. Record failure signature if mismatch appears.

Minimal concrete example

PSEUDOCODE
read expression
tokenize with precedence rules
if token sequence invalid -> return syntax error
evaluate tree
if domain violation -> return bounded diagnostic
print value and confidence check

Common misconceptions

  • “If it runs once, representation is correct.” -> false.
  • “Type checks are enough without shape checks.” -> false.

Check-your-understanding questions

  1. Which invariant catches division-by-zero earliest?
  2. Why does shape validation belong at boundaries rather than only in core logic?
  3. Predict failure if tokenization ignores unary minus.

Check-your-understanding answers

  1. Domain check on denominator before operation execution.
  2. Boundary validation keeps errors local and diagnostic.
  3. Expressions like -2^2 get misinterpreted and produce wrong precedence behavior.

Real-world applications Feature preprocessing, model-serving input validation, and experiment-tracking schema enforcement.

Where you’ll apply it This project and every downstream project in the sprint.

References

  • CSAPP (Bryant & O’Hallaron), floating-point chapter
  • Math for Programmers (Paul Orland), representation-oriented chapters

Key insight Correct representation reduces the complexity of every later decision.

Summary Stable ML math implementations start with explicit contracts, not implicit assumptions.

Homework/Exercises

  1. Write five invariants for your project.
  2. Build a failing test input for each invariant.

Solutions

  1. Include at least one shape, one domain, one convergence, one reproducibility, and one output-range invariant.
  2. Each failing input should trigger exactly one diagnostic to keep root-cause analysis clean.

3. Build Blueprint

  1. Scope the smallest end-to-end slice that produces visible output.
  2. Add deterministic tests and edge-case probes.
  3. Layer complexity only after baseline behavior is stable.
  4. Add metrics logging before optimization.
  5. Run failure drills: perturb inputs, scale values, and check stability.

4. Real-World Outcome (Target)

$ python neuron.py

Training single neuron to learn AND gate:
Inputs: [[0,0], [0,1], [1,0], [1,1]]
Targets: [0, 0, 0, 1]

Initial weights: w1=0.5, w2=-0.3, bias=-0.1
Initial predictions: [0.475, 0.377, 0.549, 0.450]
Initial loss: 0.312

Epoch 100:
  Forward:  input=[1,1] → z = 1*0.8 + 1*0.7 + (-0.5) = 1.0 → σ(1.0) = 0.731
  Loss:     (0.731 - 1)² = 0.072
  Backward: ∂L/∂z = 2(0.731-1) * σ'(1.0) = -0.106
            ∂L/∂w1 = -0.106 * 1 = -0.106  [input was 1]
            ∂L/∂w2 = -0.106 * 1 = -0.106
  Update:   w1 += 0.1 * 0.106 = 0.811

Epoch 1000:
  Predictions: [0.02, 0.08, 0.07, 0.91]  ✓ (AND gate learned!)
  Final weights: w1=5.2, w2=5.1, bias=-7.8

[Visual: Decision boundary moving during training]

Implementation Hints: Neuron computation:

z = w1*x1 + w2*x2 + bias  (linear combination)
a = sigmoid(z) = 1 / (1 + exp(-z))  (activation)

Sigmoid derivative: sigmoid'(z) = sigmoid(z) * (1 - sigmoid(z))

Chain rule for weight gradient:

∂L/∂w1 = ∂L/∂a * ∂a/∂z * ∂z/∂w1
       = 2(a - target) * sigmoid'(z) * x1

This is backpropagation! The gradient “flows backward” through the computation.

Learning milestones:

  1. Forward pass produces output → You understand function composition
  2. Gradients computed correctly → You’ve mastered the chain rule
  3. Neuron learns the AND gate → You’ve implemented learning from scratch!

5. Core Design Notes from Main Guide

Core Question

How does a machine “learn” from its mistakes?

When you train a neuron, you’re answering one of the most profound questions in machine learning: how can an algorithm improve itself by measuring how wrong it was? The answer lies in the beautiful connection between calculus (the chain rule) and optimization (gradient descent). Every time your neuron updates its weights, it’s answering: “In which direction should I change to be less wrong next time?”

Concepts You Must Understand First

Stop and research these before coding:

  1. The Forward Pass as Function Composition
    • What does it mean to “compose” two functions like f(g(x))?
    • Why is a neuron just a composition of a linear function and an activation?
    • How do you trace an input through multiple operations to get an output?
    • Book Reference: “Neural Networks and Deep Learning” Chapter 1 - Michael Nielsen
  2. Loss Functions and Their Purpose
    • Why do we need a single number to represent “how wrong” we are?
    • What’s the difference between MSE and cross-entropy loss? When use each?
    • Why does squaring errors in MSE penalize large errors more than small ones?
    • Book Reference: “Deep Learning” Chapter 3.13 - Goodfellow et al.
  3. The Chain Rule from Calculus
    • If y = f(g(x)), how do you compute dy/dx?
    • Why is the chain rule called “chain”? What’s being chained?
    • Can you apply the chain rule to three or more nested functions?
    • Book Reference: “Calculus” Chapter 3.4 - James Stewart
  4. Activation Functions and Their Derivatives
    • Why does sigmoid(x) output values between 0 and 1?
    • What is the derivative of sigmoid? Why is it sigmoid(x) * (1 - sigmoid(x))?
    • What is ReLU and why is its derivative so simple?
    • Where does the gradient “vanish” for sigmoid? Why is this a problem?
    • Book Reference: “Neural Networks and Deep Learning” Chapter 3 - Michael Nielsen
  5. Gradient Descent Weight Updates
    • Why do we subtract the gradient rather than add it?
    • What role does the learning rate play? What happens if it’s too big/small?
    • Why do we multiply the gradient by the input to get the weight gradient?
    • Book Reference: “Hands-On Machine Learning” Chapter 4 - Aurelien Geron

Questions to Guide Your Design

Before implementing, think through these:

  1. Data Representation: How will you represent inputs, weights, and the bias? As separate variables or bundled together?

  2. Forward Pass Structure: What’s the exact sequence of operations? Linear combination first, then activation? How will you store intermediate values for the backward pass?

  3. Loss Computation: Will you compute loss for each sample individually or as a batch? How does this affect your gradient calculation?

  4. Backward Pass Flow: In what order do you compute derivatives? Start from the loss and work backward, or start from the inputs?

  5. Weight Update Timing: Do you update weights after each sample, or accumulate gradients over a batch first?

  6. Convergence Detection: How will you know when to stop training? Fixed epochs? When loss plateaus?

Thinking Exercise

Hand-trace this backpropagation before coding:

Given a single neuron with:

  • Input: x = [1.0, 0.5]
  • Weights: w = [0.3, -0.2]
  • Bias: b = 0.1
  • Target: y = 1
  • Activation: sigmoid

Step 1: Forward Pass

z = w[0]*x[0] + w[1]*x[1] + b = ?
a = sigmoid(z) = 1 / (1 + exp(-z)) = ?

Step 2: Loss (MSE)

L = (a - y)^2 = ?

Step 3: Backward Pass

dL/da = 2 * (a - y) = ?
da/dz = sigmoid(z) * (1 - sigmoid(z)) = ?
dL/dz = dL/da * da/dz = ?

dz/dw[0] = x[0] = ?
dz/dw[1] = x[1] = ?
dz/db = 1

dL/dw[0] = dL/dz * dz/dw[0] = ?
dL/dw[1] = dL/dz * dz/dw[1] = ?
dL/db = dL/dz * 1 = ?

Step 4: Update (learning rate = 0.1)

w[0]_new = w[0] - 0.1 * dL/dw[0] = ?
w[1]_new = w[1] - 0.1 * dL/dw[1] = ?
b_new = b - 0.1 * dL/db = ?

Work through this completely by hand. Check: After update, does the neuron predict closer to 1?

Interview Questions

  1. “What is backpropagation, and why is it efficient?”
    • Expected: It’s applying the chain rule to compute gradients efficiently by reusing intermediate derivatives.
  2. “Walk me through the gradient computation for a single neuron with sigmoid activation.”
    • Expected: Forward pass (z = wx + b, a = sigmoid(z)), loss, then chain rule backward.
  3. “Why do we use log loss (cross-entropy) instead of MSE for classification?”
    • Expected: MSE gradients become very small when predictions are confident but wrong; cross-entropy doesn’t have this problem.
  4. “What is the vanishing gradient problem?”
    • Expected: Sigmoid squashes outputs to (0,1), and its derivative is at most 0.25. Multiplying small numbers across many layers makes gradients vanishingly small.
  5. “Implement the backward pass for a neuron with ReLU activation.”
    • Expected: dL/dz = dL/da * 1 if z > 0 else 0. Much simpler than sigmoid!
  6. “What happens if you initialize all weights to zero?”
    • Expected: All neurons compute the same thing. Gradients are identical. They stay identical forever. Symmetry breaking is essential.
  7. “How does batch size affect gradient updates?”
    • Expected: Larger batches give smoother, more accurate gradients but slower updates. Smaller batches are noisier but can escape local minima.

Hints in Layers (Treat as pseudocode guidance)

Hint 1: The Core Structure Your neuron has three learnable parameters: w1, w2 (weights for two inputs), and b (bias). The forward pass computes: z = w1x1 + w2x2 + b, then a = sigmoid(z). Start by getting this working.

Hint 2: The Loss and Its Derivative For MSE: L = (a - target)^2. The derivative dL/da = 2*(a - target). This is where the “error signal” starts.

Hint 3: The Chain Rule Application You need dL/dw1. By the chain rule: dL/dw1 = dL/da * da/dz * dz/dw1. You already have dL/da. Compute da/dz = a*(1-a) (sigmoid derivative). And dz/dw1 = x1 (the input!).

Hint 4: The Complete Gradient Flow

# Forward
z = w1*x1 + w2*x2 + b
a = 1 / (1 + np.exp(-z))

# Backward
dL_da = 2 * (a - target)
da_dz = a * (1 - a)
dL_dz = dL_da * da_dz  # This is the key "delta" term

dL_dw1 = dL_dz * x1
dL_dw2 = dL_dz * x2
dL_db = dL_dz * 1

Hint 5: The Training Loop

for epoch in range(1000):
    total_loss = 0
    for (x1, x2), target in training_data:
        # Forward pass
        # Compute loss
        # Backward pass
        # Update weights
        w1 -= learning_rate * dL_dw1
        w2 -= learning_rate * dL_dw2
        b -= learning_rate * dL_db

Books That Will Help

Topic Book Chapter
Backpropagation Intuition “Neural Networks and Deep Learning” Chapter 2 - Michael Nielsen
Chain Rule Fundamentals “Calculus” Chapter 3 - James Stewart
Computational Graphs “Deep Learning” Chapter 6.5 - Goodfellow et al.
Loss Functions “Deep Learning” Chapter 3 - Goodfellow et al.
Sigmoid and Activation Functions “Hands-On Machine Learning” Chapter 10 - Aurelien Geron
Gradient Descent Variations “Deep Learning” Chapter 8 - Goodfellow et al.
Weight Initialization “Neural Networks and Deep Learning” Chapter 3 - Michael Nielsen

Part 4: Probability & Statistics

ML is fundamentally about making predictions under uncertainty. Probability gives us the language to express and reason about uncertainty.



6. Validation, Pitfalls, and Completion

Common Pitfalls and Debugging

Problem 1: “Outputs drift after a few iterations”

  • Why: Hidden numerical instability (unscaled features, aggressive step size, or repeated subtraction of nearly equal values).
  • Fix: Normalize inputs, reduce step size, and track relative error rather than only absolute error.
  • Quick test: Run the same task with two scales of input (for example x and 10x) and compare normalized error curves.

Problem 2: “Results are inconsistent across runs”

  • Why: Random seeds, data split randomness, or non-deterministic ordering are uncontrolled.
  • Fix: Set seeds, log configuration, and store split indices and hyperparameters with each run.
  • Quick test: Re-run three times with the same seed and confirm metrics remain inside a tight tolerance band.

Problem 3: “The project works on the demo case but fails on edge cases”

  • Why: Tests only cover happy-path inputs.
  • Fix: Add adversarial inputs (empty values, extreme ranges, near-singular matrices, rare classes).
  • Quick test: Build an edge-case test matrix and ensure every scenario reports expected behavior.

Definition of Done

  • Core functionality works on reference inputs
  • Edge cases are tested and documented
  • Results are reproducible (seeded and versioned configuration)
  • Performance or convergence behavior is measured and explained
  • A short retrospective explains what failed first and how you fixed it

7. Extension Ideas

  1. Add a stress-test mode with adversarial inputs.
  2. Add a short benchmark report (runtime + memory + error trend).
  3. Add a reproducibility bundle (seed, config, and fixed test corpus).

8. Why This Project Matters

Not specified

This project is valuable because it creates observable evidence of mathematical reasoning under real implementation constraints.