Project 1: The Manual Neuron

Learn how machines “learn” by building a single neuron that teaches itself logic gates - no libraries, no shortcuts, just raw math becoming intelligence


Project Overview

Attribute Value
Difficulty Beginner
Time Estimate Weekend (8-16 hours)
Language Python (Pure, NO NumPy)
Alternative Languages C, Rust
Prerequisites Basic Python, high school algebra
Main Book Grokking Deep Learning by Andrew Trask
Knowledge Area Artificial Neurons / Logic Gates

Learning Objectives

After completing this project, you will be able to:

  1. Explain the perceptron algorithm - Describe how a single neuron computes its output from inputs, weights, and bias
  2. Implement forward propagation manually - Write output = (input1 * weight1) + (input2 * weight2) + bias without any library help
  3. Derive and apply the Delta Rule - Calculate weight updates based on error and learning rate
  4. Understand linear separability - Explain why single neurons can solve AND/OR but not XOR
  5. Train a model to convergence - Iterate until the neuron correctly predicts all truth table entries
  6. Connect math to AI intuition - See exactly how numbers changing leads to “learning”

The Core Question You’re Answering

“How can multiplying numbers lead to ‘decisions’?”

Before you write a single line of code, internalize this truth: a neural network making a decision is just drawing a line.

Think of the input space as a 2D plane where the x-axis is input1 and the y-axis is input2. The four possible inputs for a logic gate are the corners of a unit square:

    input2
      ^
    1 |   (0,1)-----(1,1)
      |     |         |
      |     |         |
    0 |   (0,0)-----(1,0)
      +----------------------> input1
          0         1

A single neuron draws a line (or in higher dimensions, a hyperplane) that separates “positive” examples from “negative” examples. The weights and bias define where that line sits.

When you train a perceptron, you’re adjusting the line until it correctly separates all the positive examples from the negative ones.

Your task: Build the machine that finds that line automatically.


Concepts You Must Understand First

Stop and research these before coding:

1. The Dot Product and Weighted Sum

The fundamental operation of a neuron is the weighted sum: multiply each input by its corresponding weight, then add everything together (including the bias).

z = (x1 * w1) + (x2 * w2) + ... + (xn * wn) + b

This is a dot product plus a bias term. The dot product measures “how aligned” two vectors are.

Why it matters: The dot product is the building block of ALL neural networks. Every hidden layer, every attention mechanism, every embedding lookup - they all reduce to dot products.

Book Reference: “Grokking Deep Learning” by Andrew Trask - Chapter 3: “Introduction to Neural Prediction”

2. The Step Activation Function

After computing the weighted sum, we need to make a decision: is this input “positive” or “negative”? The step function does exactly this:

         1  if z >= threshold
step(z) =
         0  if z < threshold

Often, we set the threshold to 0 and absorb it into the bias:

         1  if z >= 0
step(z) =
         0  if z < 0

Visualization:

  output
    ^
  1 |         +------------
    |         |
    |         |
  0 |---------+
    +-------------------> z
              0

The step function is non-differentiable at z=0, which is why modern networks use ReLU or sigmoid. But for perceptrons learning logic gates, step works perfectly.

Book Reference: “Neural Networks and Deep Learning” by Michael Nielsen - Chapter 1, Section on “Perceptrons”

3. Error Calculation

Error is the difference between what you wanted and what you got:

error = target - prediction

For binary outputs (0 or 1):

  • If target=1 and prediction=0: error = 1 (we need to increase the output)
  • If target=0 and prediction=1: error = -1 (we need to decrease the output)
  • If target=prediction: error = 0 (no change needed)

Why it matters: Error is the signal that drives learning. Without knowing how wrong you are, you can’t improve.

Book Reference: “Grokking Deep Learning” by Andrew Trask - Chapter 4: “Introduction to Neural Learning”

4. The Perceptron Learning Algorithm (Delta Rule)

The Perceptron Learning Rule states:

w_new = w_old + (learning_rate * error * input)
b_new = b_old + (learning_rate * error)

Intuition:

  • If error > 0 (predicted too low), increase weights for inputs that were “on” (input=1)
  • If error < 0 (predicted too high), decrease weights for inputs that were “on”
  • Inputs that were “off” (input=0) don’t change their weights (multiplying by 0)

Why this works: When an input contributed to a wrong prediction:

  • If the input was 1 and we predicted 0 (should be 1), increase that weight so next time the weighted sum is higher
  • If the input was 1 and we predicted 1 (should be 0), decrease that weight so next time the weighted sum is lower

Book Reference: “Neural Networks and Deep Learning” by Michael Nielsen - Chapter 1: “The Perceptron Learning Algorithm”

5. Linear Separability

A problem is linearly separable if you can draw a straight line (or hyperplane in higher dimensions) to separate the positive and negative examples.

AND Gate (linearly separable):

    x2
    ^
  1 |  O (0,1)     X (1,1)   <- One output is 1
    |
    |
  0 |  O (0,0)     O (1,0)   <- All these outputs are 0
    +----------------------> x1
       0           1

O = output 0
X = output 1

A line can separate the X from the Os:
    x2
    ^
  1 |  O         \ X
    |            \
    |           \
  0 |  O         \ O
    +-----------\-------> x1

XOR Gate (NOT linearly separable):

    x2
    ^
  1 |  X (0,1)     O (1,1)
    |
    |
  0 |  O (0,0)     X (1,0)
    +----------------------> x1

No single straight line can separate the Xs from the Os!
They are diagonally opposite.

This is the Minsky-Papert limitation that caused the first “AI Winter” in the 1960s-70s.

Book Reference: “Grokking Deep Learning” by Andrew Trask - Chapter 3: “Linear Separability”


Deep Theoretical Foundation

History of the Perceptron (Rosenblatt 1958)

In 1958, Frank Rosenblatt at Cornell Aeronautical Laboratory created the Perceptron - the first algorithm that could learn from data. It was inspired by how neurons in the brain work.

   Historical Timeline of Neural Networks
   ┌─────────────────────────────────────────────────────────────────┐
   │                                                                 │
   │  1943: McCulloch-Pitts neuron (theoretical model)               │
   │    │                                                            │
   │    ▼                                                            │
   │  1958: Rosenblatt's Perceptron (first learning algorithm)       │
   │    │                                                            │
   │    ▼                                                            │
   │  1969: Minsky & Papert "Perceptrons" book (XOR problem)         │
   │    │                                                            │
   │    ▼                                                            │
   │  1969-1986: "AI Winter" (research funding dried up)             │
   │    │                                                            │
   │    ▼                                                            │
   │  1986: Rumelhart, Hinton, Williams (Backpropagation)            │
   │    │                                                            │
   │    ▼                                                            │
   │  2012: AlexNet (Deep Learning Renaissance)                      │
   │    │                                                            │
   │    ▼                                                            │
   │  Today: Transformers, LLMs, etc.                                │
   │                                                                 │
   └─────────────────────────────────────────────────────────────────┘

Rosenblatt’s perceptron was physical hardware - the Mark I Perceptron had 400 photocells connected to neurons implemented as potentiometers (variable resistors). It could learn to recognize letters.

The perceptron was overhyped. The New York Times declared it the “embryo of an electronic computer that [the Navy] expects will be able to walk, talk, see, write, reproduce itself and be conscious of its existence.”

Then came the crash.

The Minsky-Papert Book and the First AI Winter

In 1969, Marvin Minsky and Seymour Papert published “Perceptrons,” a mathematical analysis showing the fundamental limitations of single-layer perceptrons.

Their key result: A single perceptron cannot learn XOR because XOR is not linearly separable.

This devastated AI research funding. If neural networks couldn’t even learn XOR, how could they learn anything useful?

What Minsky and Papert actually proved was technically correct but practically misleading. They acknowledged that multi-layer perceptrons (what we now call neural networks) could solve XOR, but dismissed them because “there is no learning algorithm for multi-layer perceptrons.”

They were wrong. The backpropagation algorithm was discovered (and forgotten, and rediscovered) multiple times before being popularized in 1986.

The lesson: Understanding the perceptron deeply - including its limitations - is essential for understanding why we need multiple layers and more sophisticated architectures.

Mathematical Formulation

A perceptron with n inputs computes:

                   n
           z = b + Σ (xi * wi)
                  i=1

           y = step(z)

Where:

  • xi = input i (binary: 0 or 1 for logic gates)
  • wi = weight for input i (real number, learned)
  • b = bias (real number, learned)
  • z = weighted sum (real number)
  • y = output (binary: 0 or 1 after step function)

ASCII Diagram of a 2-Input Perceptron:

                    ┌─────────────────────────────────────────────┐
                    │                                             │
    Input x1 ──────►│  x1 * w1 ──┐                                │
                    │            │                                │
                    │            ▼                                │
                    │         ┌─────┐    ┌──────────┐   ┌───────┐ │
    Input x2 ──────►│  x2*w2──►│  Σ  │───►│ step(z)  │──►│ Output│─┼──► y
                    │            │   ▲    └──────────┘   └───────┘ │
                    │            ▼   │                            │
    Bias 1 ────────►│    b ──────┘                                │
                    │                                             │
                    └─────────────────────────────────────────────┘

    z = (x1 * w1) + (x2 * w2) + b
    y = step(z) = 1 if z >= 0 else 0

The Decision Boundary

The perceptron decides y = 1 when:

z >= 0
(x1 * w1) + (x2 * w2) + b >= 0

Rearranging to see the line equation:

x2 >= (-w1/w2)*x1 + (-b/w2)

This is a line with:

  • Slope: -w1/w2
  • Intercept: -b/w2

Example: Trained OR Gate

After training, let’s say: w1 = 1.5, w2 = 1.5, b = -1.0

Decision boundary: 1.5*x1 + 1.5*x2 - 1.0 = 0

Rearranging: x2 = -x1 + 0.67

    x2
    ^
  1 |  X (0,1)  \   X (1,1)    <- Both have output 1
    |            \
    |             \
0.67|              \           <- Decision boundary
    |               \
  0 |  O (0,0)       \ X (1,0) <- (0,0) is 0, (1,0) is 1
    +------------------\-----> x1
       0       0.67    1

Points above/right of line → output 1
Points below/left of line → output 0

Why XOR Fails

For XOR:

  • (0,0) → 0
  • (0,1) → 1
  • (1,0) → 1
  • (1,1) → 0
    x2
    ^
  1 |  X (0,1)     O (1,1)
    |     ┌─────────────┐
    |     │ No single   │
    |     │ line works! │
    |     └─────────────┘
  0 |  O (0,0)     X (1,0)
    +----------------------> x1

The X points are on opposite corners.
Any line that separates (0,1) from (1,1)
will also separate (0,0) from (1,0) incorrectly.

This is why XOR required multi-layer perceptrons (hidden layers) - they can draw curved decision boundaries.

The Delta Rule Derivation

The perceptron learning algorithm minimizes error through gradient descent (though Rosenblatt didn’t frame it that way).

For the step function, we can’t compute a true gradient (it’s not differentiable). But we can use a heuristic:

Update Rule:

Δwi = η * (t - y) * xi
wi(new) = wi(old) + Δwi

Where:

  • η (eta) = learning rate (typically 0.1 to 1.0)
  • t = target (expected output)
  • y = predicted output
  • xi = input

Intuition:

  • If t = 1 and y = 0: error = 1, so we add η * xi to each weight. This makes z larger next time for this input pattern.
  • If t = 0 and y = 1: error = -1, so we subtract η * xi from each weight. This makes z smaller next time.
  • If t = y: error = 0, no change.

Convergence Theorem: The perceptron convergence theorem (Novikoff, 1962) proves that if the training data is linearly separable, the perceptron learning algorithm will converge to a solution in finite iterations.


Real World Outcome

You’ll run a script that starts with random garbage weights (guessing randomly) and prints its “learning process” until it perfectly mimics a logic gate.

Example Output (OR Gate):

$ python manual_neuron.py --gate OR

========================================
        PERCEPTRON TRAINING: OR GATE
========================================

Truth Table for OR:
  [0, 0] -> 0
  [0, 1] -> 1
  [1, 0] -> 1
  [1, 1] -> 1

Initial Weights (random):
  w1 = 0.23
  w2 = -0.47
  b  = 0.15
  Learning Rate: 0.1

----------------------------------------
Epoch 1:
  Input=[0, 0] z=0.15 Predicted=1 Target=0 Error=-1
    -> UPDATING: w1=0.23->0.23, w2=-0.47->-0.47, b=0.15->0.05
  Input=[0, 1] z=-0.42 Predicted=0 Target=1 Error=1
    -> UPDATING: w1=0.23->0.23, w2=-0.47->-0.37, b=0.05->0.15
  Input=[1, 0] z=0.38 Predicted=1 Target=1 Error=0 (Correct!)
  Input=[1, 1] z=0.01 Predicted=1 Target=1 Error=0 (Correct!)
  Epoch 1 Errors: 2/4

Epoch 2:
  Input=[0, 0] z=0.15 Predicted=1 Target=0 Error=-1
    -> UPDATING: w1=0.23->0.23, w2=-0.37->-0.37, b=0.15->0.05
  Input=[0, 1] z=-0.32 Predicted=0 Target=1 Error=1
    -> UPDATING: w1=0.23->0.23, w2=-0.37->-0.27, b=0.05->0.15
  Input=[1, 0] z=0.38 Predicted=1 Target=1 Error=0 (Correct!)
  Input=[1, 1] z=0.11 Predicted=1 Target=1 Error=0 (Correct!)
  Epoch 2 Errors: 2/4

... (many epochs later) ...

Epoch 43:
  Input=[0, 0] z=-0.12 Predicted=0 Target=0 Error=0 (Correct!)
  Input=[0, 1] z=0.78 Predicted=1 Target=1 Error=0 (Correct!)
  Input=[1, 0] z=0.95 Predicted=1 Target=1 Error=0 (Correct!)
  Input=[1, 1] z=1.85 Predicted=1 Target=1 Error=0 (Correct!)
  Epoch 43 Errors: 0/4

========================================
           TRAINING COMPLETE!
========================================

Final Weights:
  w1 = 1.07
  w2 = 0.90
  b  = -0.12

Decision Boundary Equation:
  1.07*x1 + 0.90*x2 - 0.12 = 0

----------------------------------------
            TESTING MODEL
----------------------------------------

[0, 0] -> z=-0.12 -> step -> 0 (Expected: 0)[0, 1] -> z=0.78  -> step -> 1 (Expected: 1)[1, 0] -> z=0.95  -> step -> 1 (Expected: 1)[1, 1] -> z=1.85  -> step -> 1 (Expected: 1) ✓

ALL TESTS PASSED!
The perceptron has learned the OR function.

Example Output (AND Gate):

$ python manual_neuron.py --gate AND

========================================
        PERCEPTRON TRAINING: AND GATE
========================================

Truth Table for AND:
  [0, 0] -> 0
  [0, 1] -> 0
  [1, 0] -> 0
  [1, 1] -> 1

Initial Weights (random):
  w1 = -0.15
  w2 = 0.32
  b  = 0.05
  Learning Rate: 0.1

... (training epochs) ...

Epoch 28:
  Input=[0, 0] z=-0.45 Predicted=0 Target=0 Error=0 (Correct!)
  Input=[0, 1] z=0.15 Predicted=1 Target=0 Error=-1
    -> UPDATING...
  ...

Epoch 67: SOLVED!

Final Weights:
  w1 = 0.80
  w2 = 0.75
  b  = -1.20

Testing:
[0, 0] -> 0 (Expected: 0)[0, 1] -> 0 (Expected: 0)[1, 0] -> 0 (Expected: 0)[1, 1] -> 1 (Expected: 1) ✓

ALL TESTS PASSED!

Example Output (XOR - Expected Failure):

$ python manual_neuron.py --gate XOR

========================================
        PERCEPTRON TRAINING: XOR GATE
========================================

Truth Table for XOR:
  [0, 0] -> 0
  [0, 1] -> 1
  [1, 0] -> 1
  [1, 1] -> 0

Initial Weights (random):
  w1 = 0.12
  w2 = 0.45
  b  = -0.08

... (training) ...

Epoch 100: Errors: 1/4
Epoch 200: Errors: 2/4
Epoch 500: Errors: 1/4
Epoch 1000: Still not converged!

========================================
       TRAINING FAILED (as expected)
========================================

XOR is not linearly separable.
A single perceptron cannot learn XOR.
You need hidden layers (multi-layer perceptron).

This is the Minsky-Papert limitation!

Solution Architecture

High-Level Design Approach

This section describes what your solution should look like, not how to implement it.

Architecture Diagram:

┌─────────────────────────────────────────────────────────────────────────┐
│                         PERCEPTRON TRAINING SYSTEM                      │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│   ┌─────────────────┐                                                   │
│   │  Training Data  │                                                   │
│   │  ┌───────────┐  │                                                   │
│   │  │ Inputs    │  │         ┌──────────────────────────────┐          │
│   │  │ [0,0]     │  │         │         PERCEPTRON           │          │
│   │  │ [0,1]     │──┼────────►│  ┌────┐    ┌────┐   ┌────┐   │          │
│   │  │ [1,0]     │  │         │  │ w1 │    │ w2 │   │ b  │   │          │
│   │  │ [1,1]     │  │         │  └──┬─┘    └──┬─┘   └──┬─┘   │          │
│   │  └───────────┘  │         │     │         │        │     │          │
│   │  ┌───────────┐  │         │     ▼         ▼        ▼     │          │
│   │  │ Targets   │  │         │    ┌──────────────────────┐  │          │
│   │  │ 0,1,1,1   │  │         │    │ z = x1*w1 + x2*w2 + b│  │          │
│   │  └───────────┘  │         │    └──────────┬───────────┘  │          │
│   └────────┬────────┘         │               │              │          │
│            │                  │               ▼              │          │
│            │                  │    ┌───────────────────┐     │          │
│            │                  │    │ y = step(z)       │     │          │
│            │                  │    │   1 if z >= 0     │     │          │
│            │                  │    │   0 if z < 0      │     │          │
│            │                  │    └────────┬──────────┘     │          │
│            │                  │             │                │          │
│            │                  └─────────────┼────────────────┘          │
│            │                                │                           │
│            │                                ▼                           │
│            │                    ┌────────────────────┐                  │
│            │                    │    Prediction y    │                  │
│            │                    └─────────┬──────────┘                  │
│            │                              │                             │
│            ▼                              ▼                             │
│   ┌─────────────────────────────────────────────────┐                   │
│   │           ERROR CALCULATION                     │                   │
│   │                                                 │                   │
│   │   error = target - prediction                   │                   │
│   │                                                 │                   │
│   └───────────────────────┬─────────────────────────┘                   │
│                           │                                             │
│                           │ if error != 0                               │
│                           ▼                                             │
│   ┌─────────────────────────────────────────────────┐                   │
│   │           WEIGHT UPDATE (Delta Rule)            │                   │
│   │                                                 │                   │
│   │   w1 = w1 + (learning_rate * error * x1)        │                   │
│   │   w2 = w2 + (learning_rate * error * x2)        │                   │
│   │   b  = b  + (learning_rate * error * 1)         │                   │
│   │                                                 │                   │
│   └─────────────────────────────────────────────────┘                   │
│                           │                                             │
│                           │ Loop until all predictions correct          │
│                           ▼                                             │
│   ┌─────────────────────────────────────────────────┐                   │
│   │           CONVERGENCE CHECK                     │                   │
│   │                                                 │                   │
│   │   If all 4 inputs predict correctly:            │                   │
│   │       STOP - Model is trained                   │                   │
│   │   Else:                                         │                   │
│   │       Continue to next epoch                    │                   │
│   │                                                 │                   │
│   └─────────────────────────────────────────────────┘                   │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

Data Structures Needed

┌───────────────────────────────────────────────────────────┐
│                    DATA STRUCTURES                        │
├───────────────────────────────────────────────────────────┤
│                                                           │
│  1. TRAINING DATA                                         │
│     ┌───────────┬───────────┐                             │
│     │  inputs   │  targets  │                             │
│     ├───────────┼───────────┤                             │
│     │  [0, 0]   │     0     │                             │
│     │  [0, 1]   │   0 or 1  │  <- depends on gate         │
│     │  [1, 0]   │   0 or 1  │                             │
│     │  [1, 1]   │   0 or 1  │                             │
│     └───────────┴───────────┘                             │
│                                                           │
│  2. MODEL PARAMETERS (floats, updated during training)    │
│     • w1: weight for input 1                              │
│     • w2: weight for input 2                              │
│     • b:  bias                                            │
│                                                           │
│  3. HYPERPARAMETERS (constants, set before training)      │
│     • learning_rate: typically 0.1 to 1.0                 │
│     • max_epochs: limit iterations (e.g., 1000)           │
│                                                           │
└───────────────────────────────────────────────────────────┘

Function Breakdown

┌───────────────────────────────────────────────────────────┐
│                     FUNCTION DESIGN                       │
├───────────────────────────────────────────────────────────┤
│                                                           │
│  step(z) -> int                                           │
│    Input:  z (weighted sum, float)                        │
│    Output: 0 or 1                                         │
│    Logic:  return 1 if z >= 0 else 0                      │
│                                                           │
│  ─────────────────────────────────────────────────────    │
│                                                           │
│  forward(x1, x2, w1, w2, b) -> (z, y)                     │
│    Input:  inputs x1, x2; weights w1, w2; bias b          │
│    Output: weighted sum z, prediction y                   │
│    Logic:  z = x1*w1 + x2*w2 + b                          │
│            y = step(z)                                    │
│                                                           │
│  ─────────────────────────────────────────────────────    │
│                                                           │
│  update_weights(w1, w2, b, x1, x2, error, lr) -> tuple    │
│    Input:  current weights, inputs, error, learning rate  │
│    Output: new (w1, w2, b)                                │
│    Logic:  Apply Delta Rule                               │
│                                                           │
│  ─────────────────────────────────────────────────────    │
│                                                           │
│  train(data, targets, lr, max_epochs) -> (w1, w2, b)      │
│    Input:  training data, targets, hyperparameters        │
│    Output: trained weights and bias                       │
│    Logic:  Loop epochs, update on errors, check converge  │
│                                                           │
│  ─────────────────────────────────────────────────────    │
│                                                           │
│  test(data, targets, w1, w2, b) -> bool                   │
│    Input:  test data, expected outputs, trained params    │
│    Output: True if all correct, False otherwise           │
│    Logic:  Run forward pass on each input, compare        │
│                                                           │
└───────────────────────────────────────────────────────────┘

Data Flow Diagram

┌───────────────────────────────────────────────────────────────────────┐
│                         TRAINING FLOW                                 │
└───────────────────────────────────────────────────────────────────────┘

  Initialize           For each epoch          For each sample
  ┌────────┐           ┌────────────┐          ┌────────────────┐
  │ Random │           │  Reset     │          │ Get (x1, x2),  │
  │ w1,w2,b│──────────►│  epoch_err │─────────►│ target         │
  └────────┘           │  counter   │          └───────┬────────┘
                       └────────────┘                  │
                                                       ▼
                                            ┌────────────────────┐
                                            │ Forward Pass       │
                                            │ z = x1*w1+x2*w2+b  │
                                            │ y = step(z)        │
                                            └───────┬────────────┘
                                                    │
                                                    ▼
                                            ┌────────────────────┐
                                            │ Calculate Error    │
                                            │ err = target - y   │
                                            └───────┬────────────┘
                                                    │
                                          ┌─────────┴─────────┐
                                          │                   │
                                    err != 0?           err == 0
                                          │                   │
                                          ▼                   ▼
                                   ┌─────────────┐    ┌─────────────┐
                                   │ Update      │    │ No change   │
                                   │ weights     │    │ continue    │
                                   └──────┬──────┘    └──────┬──────┘
                                          │                  │
                                          └────────┬─────────┘
                                                   │
                                                   ▼
                                          ┌────────────────┐
                                          │ Next sample or │
                                          │ next epoch     │
                                          └────────┬───────┘
                                                   │
                                         ┌─────────┴─────────┐
                                         │                   │
                                   All correct?        Still errors
                                         │                   │
                                         ▼                   ▼
                                  ┌─────────────┐    ┌─────────────┐
                                  │ STOP        │    │ Continue    │
                                  │ Return      │    │ training    │
                                  │ weights     │    └─────────────┘
                                  └─────────────┘

Phased Implementation Guide

Phase 1: Forward Pass (1-2 hours)

Goal: Implement the core computation of a neuron.

  1. Write the step function:
    • Takes a single float z
    • Returns 1 if z >= 0, else 0
  2. Write the forward function:
    • Takes inputs x1, x2 and parameters w1, w2, b
    • Computes z = x1*w1 + x2*w2 + b
    • Returns step(z)
  3. Test manually:
    # With w1=0.5, w2=0.5, b=-0.75
    # forward(0, 0, 0.5, 0.5, -0.75) should return 0 (z=-0.75)
    # forward(1, 1, 0.5, 0.5, -0.75) should return 1 (z=0.25)
    

Checkpoint: You should be able to manually set weights that make the forward function behave like AND or OR.

Phase 2: Error Calculation (30 minutes)

Goal: Compute how wrong the prediction is.

  1. Write an error function or just compute inline:
    • error = target - prediction
  2. Verify all cases:
    target=0, pred=0 -> error=0  (correct, no update)
    target=1, pred=1 -> error=0  (correct, no update)
    target=1, pred=0 -> error=1  (increase output)
    target=0, pred=1 -> error=-1 (decrease output)
    

Checkpoint: Given a prediction and target, you should know whether and how to update.

Phase 3: Weight Updates (1 hour)

Goal: Implement the Delta Rule.

  1. Write the update_weights function:
    def update_weights(w1, w2, b, x1, x2, error, learning_rate):
        w1_new = w1 + learning_rate * error * x1
        w2_new = w2 + learning_rate * error * x2
        b_new = b + learning_rate * error * 1  # bias input is always 1
        return w1_new, w2_new, b_new
    
  2. Test the update logic:
    • If error=1, x1=1, lr=0.1: w1 should increase by 0.1
    • If error=-1, x1=1, lr=0.1: w1 should decrease by 0.1
    • If x1=0: w1 should not change (0 * anything = 0)

Checkpoint: Weights change in the right direction based on error.

Phase 4: Training Loop (1-2 hours)

Goal: Repeat forward → error → update until convergence.

  1. Define training data for AND, OR, NAND, NOR gates
  2. Initialize weights randomly (small values, e.g., -1 to 1)
  3. Implement the epoch loop:
    for epoch in range(max_epochs):
        errors_this_epoch = 0
        for (x1, x2), target in zip(inputs, targets):
            z, prediction = forward(...)
            error = target - prediction
            if error != 0:
                update_weights(...)
                errors_this_epoch += 1
        if errors_this_epoch == 0:
            print("Converged!")
            break
    
  4. Add verbose logging to see learning progress

Checkpoint: Running the training loop on OR should converge within ~100 epochs.

Phase 5: Testing and Validation (1 hour)

Goal: Verify the trained perceptron works correctly.

  1. After training, run all 4 inputs through forward pass
  2. Compare to expected truth table
  3. Print pass/fail for each

Checkpoint: All 4 tests pass for AND, OR, NAND, NOR. XOR should fail to converge.


Questions to Guide Your Design

Before implementing, think through these:

Understanding the Algorithm

  1. Why random initialization?
    • What happens if you start with all zeros?
    • Why not start with “good” weights?
  2. What does the learning rate control?
    • What happens if learning_rate = 0?
    • What happens if learning_rate = 100?
    • Why is 0.1 a common choice?
  3. Why iterate through all samples before checking convergence?
    • Could you check after each sample?
    • What’s the difference between “epoch” and “iteration”?

Understanding the Math

  1. Why multiply error by input in the update rule?
    • What happens to w1 when x1=0?
    • Why is this mathematically correct?
  2. How does the bias differ from weights?
    • What does the bias “shift”?
    • Why don’t we multiply bias update by an input?
  3. What does the decision boundary look like geometrically?
    • Draw the boundary for a trained AND gate
    • How do the weights define its slope?

Understanding the Limits

  1. Why can’t a perceptron learn XOR?
    • Draw the 4 XOR points and try to separate them with a line
    • What would you need to separate them?
  2. What’s the minimum number of weights to learn a 2-input gate?
    • Could you do it with just w1 and w2 (no bias)?
    • When is bias essential?

Thinking Exercise

Before coding, trace this by hand:

Starting with:

  • w1 = 0.5
  • w2 = 0.5
  • b = -0.75
  • learning_rate = 0.1
  • Training for AND gate: (0,0)→0, (0,1)→0, (1,0)→0, (1,1)→1

Epoch 1 Trace:

Input z = x1w1 + x2w2 + b y = step(z) Target Error New w1 New w2 New b
(0,0) 00.5 + 00.5 - 0.75 = -0.75 0 0 0 0.5 0.5 -0.75
(0,1) 00.5 + 10.5 - 0.75 = -0.25 0 0 0 0.5 0.5 -0.75
(1,0) 10.5 + 00.5 - 0.75 = -0.25 0 0 0 0.5 0.5 -0.75
(1,1) 10.5 + 10.5 - 0.75 = 0.25 1 1 0 0.5 0.5 -0.75

Result: All correct on epoch 1! The initial weights happened to be good.

Now try with different starting weights:

  • w1 = -0.2
  • w2 = 0.3
  • b = 0.1

Trace Epoch 1:

Input z y Target Error Update New w1 New w2 New b
(0,0) 0(-0.2) + 00.3 + 0.1 = 0.1 1 0 -1 Yes ? ? ?
               

Your task: Complete this trace for all 4 inputs of epoch 1. Then continue to epoch 2.

Questions while tracing:

  • Which weight changed the most after the first error?
  • Why didn’t w1 change when processing (0,0)?
  • How many epochs until all 4 are correct?

Testing Strategy

Unit Tests for Each Function

# Test step function
assert step(-1) == 0
assert step(0) == 1  # boundary case: z >= 0
assert step(0.001) == 1
assert step(-0.001) == 0

# Test forward pass
z, y = forward(0, 0, 1, 1, -1.5)  # Mimics AND
assert y == 0
z, y = forward(1, 1, 1, 1, -1.5)
assert y == 1

# Test update rule
w1, w2, b = 0.5, 0.5, 0
w1, w2, b = update_weights(w1, w2, b, 1, 0, 1, 0.1)
assert w1 == 0.6  # increased because x1=1, error=1
assert w2 == 0.5  # unchanged because x2=0
assert b == 0.1   # increased because error=1

Integration Test: Train and Verify

# Train on OR gate
inputs = [(0,0), (0,1), (1,0), (1,1)]
targets = [0, 1, 1, 1]
w1, w2, b = train(inputs, targets, learning_rate=0.1, max_epochs=1000)

# Verify all predictions
for (x1, x2), target in zip(inputs, targets):
    _, prediction = forward(x1, x2, w1, w2, b)
    assert prediction == target, f"Failed on {(x1, x2)}"

Convergence Test

# AND, OR, NAND, NOR should all converge
for gate_name, gate_targets in [("AND", [0,0,0,1]), ("OR", [0,1,1,1]), ...]:
    w1, w2, b, epochs = train_with_count(inputs, gate_targets, ...)
    assert epochs < 1000, f"{gate_name} didn't converge"

# XOR should NOT converge
w1, w2, b, epochs = train_with_count(inputs, [0,1,1,0], max_epochs=1000)
assert epochs == 1000, "XOR unexpectedly converged!"

Common Pitfalls and Debugging Tips

Pitfall 1: Off-by-One in Step Function

Symptom: Inconsistent results at z=0 Cause: Using > instead of >= or vice versa Fix: Decide on convention (usually z >= 0 → 1) and stick to it

Pitfall 2: Forgetting to Update Bias

Symptom: Model doesn’t converge or converges slowly Cause: Only updating w1 and w2, not b Fix: Remember: b = b + lr * error * 1

Pitfall 3: Wrong Sign in Update Rule

Symptom: Error gets worse instead of better Cause: Using prediction - target instead of target - prediction Fix: Error should be positive when prediction is too low

Pitfall 4: Not Iterating Until Convergence

Symptom: Model seems random Cause: Only running one epoch Fix: Loop until zero errors in an epoch (or max epochs)

Pitfall 5: Learning Rate Too High

Symptom: Weights oscillate wildly, never settle Cause: learning_rate > 1 or very large values Fix: Use lr in range 0.01 to 1.0 (start with 0.1)

Pitfall 6: Learning Rate Too Low

Symptom: Takes thousands of epochs to converge Cause: learning_rate too small (e.g., 0.001) Fix: For simple logic gates, 0.1 to 1.0 works well

Debugging Technique: Print Everything

When stuck, print at each step:

print(f"Input: ({x1}, {x2})")
print(f"Weights before: w1={w1:.3f}, w2={w2:.3f}, b={b:.3f}")
print(f"z = {x1}*{w1} + {x2}*{w2} + {b} = {z:.3f}")
print(f"y = step({z:.3f}) = {y}")
print(f"Target: {target}, Error: {error}")
if error != 0:
    print(f"Updating: w1 += {lr}*{error}*{x1} = {lr*error*x1:.3f}")

The Interview Questions They’ll Ask

Prepare to answer these:

1. “Explain how a perceptron learns. Walk me through one update step.”

Key points to cover:

  • Forward pass: weighted sum + step function
  • Error calculation: target - prediction
  • Weight update: Delta Rule (w += lr * error * input)
  • Why inputs of 0 don’t change their weights

2. “What is the decision boundary of a perceptron?”

Key insight:

  • It’s a hyperplane (line in 2D) defined by w1x1 + w2x2 + b = 0
  • Weights define the orientation (slope)
  • Bias shifts the line

3. “Why can’t a single perceptron learn XOR?”

Key insight:

  • XOR is not linearly separable
  • Positive examples are on opposite corners
  • No single line can separate them
  • Need hidden layers (MLP) to create non-linear boundaries

4. “What’s the difference between a perceptron and a modern neural network neuron?”

Key insight:

  • Perceptron: step function (non-differentiable)
  • Modern: sigmoid/ReLU (differentiable for gradient descent)
  • Perceptron: single layer
  • Modern: multiple layers with backpropagation

5. “What is the role of the bias term?”

Key insight:

  • Bias shifts the decision boundary away from the origin
  • Without bias, the hyperplane must pass through origin
  • Example: AND gate needs negative bias to threshold at both inputs high

6. “How does the learning rate affect training?”

Key insight:

  • Too high: overshoots, oscillates, may not converge
  • Too low: converges slowly, may get stuck
  • Just right: smooth convergence to solution

7. “What guarantees that a perceptron will converge?”

Key insight:

  • The Perceptron Convergence Theorem (Novikoff, 1962)
  • IF data is linearly separable
  • THEN algorithm will converge in finite steps
  • If not separable, it will loop forever (hence XOR failure)

Hints in Layers

Use these hints only when stuck. Try for at least 15 minutes before reading each hint.

Hint 1: Structure

Your main file should have:

  1. A function for the step activation
  2. A function for forward pass
  3. A function for weight updates
  4. A training loop that calls these
  5. A test function that verifies correctness

Hint 2: Initialization

Random initialization should be small values:

import random
w1 = random.uniform(-1, 1)
w2 = random.uniform(-1, 1)
b = random.uniform(-1, 1)

Hint 3: Training Data

Define your gates as dictionaries:

GATES = {
    'AND': [0, 0, 0, 1],
    'OR':  [0, 1, 1, 1],
    'NAND': [1, 1, 1, 0],
    'NOR': [1, 0, 0, 0],
    'XOR': [0, 1, 1, 0],  # Will not converge!
}
INPUTS = [(0, 0), (0, 1), (1, 0), (1, 1)]

Hint 4: The Training Loop Pattern

for epoch in range(max_epochs):
    total_error = 0
    for (x1, x2), target in zip(inputs, targets):
        # forward pass
        # calculate error
        # if error != 0: update weights
        # accumulate error count
    if total_error == 0:
        break  # Converged!

Hint 5: Edge Case - No Error

When prediction equals target, error is 0. The update equation:

w = w + lr * 0 * x = w + 0 = w

Weights don’t change when you’re already correct. This is important!


Extensions and Challenges

After completing the basic perceptron, try these:

Extension 1: 3-Input Gates

Implement AND3, OR3, MAJORITY (output 1 if 2+ inputs are 1).

  • Now you have z = x1*w1 + x2*w2 + x3*w3 + b
  • Visualize in 3D (the decision boundary is a plane!)

Extension 2: NAND as Universal Gate

NAND is a universal gate - you can build any other gate from NANDs.

  • Train a NAND perceptron
  • Show how to compose them (manually) to make AND, OR, NOT

Extension 3: Visualization

Plot the decision boundary as training progresses:

  • Use matplotlib to show the 2D input space
  • Draw the line w1*x1 + w2*x2 + b = 0
  • Update the plot each epoch to see the line move

Extension 4: Multi-class (One-vs-All)

Instead of binary output, classify into 4 categories:

  • Train 4 perceptrons, one for each class
  • Output the class with highest weighted sum (before step)

Extension 5: Implement in C or Rust

Rewrite the perceptron in a low-level language:

  • No garbage collection, manual memory
  • Appreciate how simple the actual computation is
  • Time the training - it should be microseconds

Extension 6: Two-Layer Perceptron

Build a simple 2-layer network to solve XOR:

  • Hidden layer with 2 neurons
  • Output layer with 1 neuron
  • You’ll need to implement backpropagation (preview of Project 5)

Real-World Connections

Where Perceptrons Appear Today

  1. Spam Filters (Early Versions)
    • Before deep learning, spam filters used linear classifiers
    • Features: word counts, sender reputation
    • Perceptron-style updates on misclassifications
  2. Credit Scoring (Logistic Regression)
    • Banks use linear models for interpretability
    • Similar to perceptron but with sigmoid activation
    • Weights show which factors matter (income, debt ratio)
  3. Sentiment Analysis (Baseline)
    • Count positive/negative words → weighted sum → decision
    • Perceptron is the simplest baseline to beat
  4. Medical Triage
    • Simple rule-based systems are essentially perceptrons
    • “If blood pressure > X AND temperature > Y, alert doctor”

Why This Foundation Matters

Understanding the perceptron is essential because:

  1. Every deep learning layer IS a perceptron (plus non-linearity)
    • A dense layer: each output neuron is z = w1x1 + w2x2 + … + b
    • You just learned the atom of neural networks
  2. Debugging deep networks requires this intuition
    • When gradients vanish, you’re seeing the XOR problem at scale
    • When weights explode, it’s learning rate issues
  3. Interpretable AI often means simpler models
    • Regulators want to know WHY a loan was denied
    • Perceptrons are explainable: “these factors with these weights”
  4. Edge/embedded AI needs efficient models
    • IoT devices can’t run transformers
    • Simple perceptron-style models fit in kilobytes

Books That Will Help

Topic Book Chapter/Section
Perceptron fundamentals Grokking Deep Learning by Andrew Trask Ch. 3: “Introduction to Neural Prediction”
Mathematical foundations Neural Networks and Deep Learning by Michael Nielsen Ch. 1: “Using neural nets to recognize handwritten digits”
The Perceptron algorithm Grokking Deep Learning by Andrew Trask Ch. 4: “Introduction to Neural Learning”
Linear separability Pattern Recognition and Machine Learning by Christopher Bishop Ch. 4: “Linear Models for Classification”
History and context Perceptrons by Minsky & Papert Introduction and Ch. 1-3 (historical document)
Optimization theory Deep Learning by Goodfellow, Bengio, Courville Ch. 4.3: “Gradient-Based Optimization”
Python implementation Data Science from Scratch by Joel Grus Ch. 18: “Neural Networks”

Online Resources

  • 3Blue1Brown: “But what is a neural network?” (YouTube) - Excellent visualization
  • Andrej Karpathy: “Neural Networks: Zero to Hero” - Modern perspective
  • Michael Nielsen: neuralnetworksanddeeplearning.com - Free online book

Self-Assessment Checklist

Before moving to Project 2, verify you can:

Implementation Skills

  • Write the step function without looking at notes
  • Implement forward pass from scratch
  • Apply the Delta Rule correctly
  • Train to convergence on AND, OR, NAND, NOR
  • Explain why XOR doesn’t converge

Conceptual Understanding

  • Draw the decision boundary for a trained perceptron
  • Explain what each weight controls geometrically
  • Describe what the bias shifts
  • Define linear separability with an example

Mathematical Foundations

  • Derive the Delta Rule update from error minimization intuition
  • Calculate z by hand for given weights and inputs
  • Predict whether a point is above or below the decision boundary

Conceptual Questions (Answer Without Looking)

  1. What’s the output of step(-0.001)?
  2. If error=1 and x1=0, how much does w1 change?
  3. Why doesn’t XOR work with a single perceptron?
  4. What happens if learning_rate = 0?
  5. How many parameters does a 2-input perceptron have?
  6. What’s the role of bias in the decision boundary?
  7. Can a perceptron with 3 inputs learn the MAJORITY function?

Code Challenges (Try Without Hints)

  1. Modify your code to work with 3 inputs
  2. Add a function that plots the decision boundary
  3. Count how many epochs each gate needs on average (run 100 trials)
  4. Find the minimum learning rate that still converges in < 1000 epochs

What’s Next

You’ve built the atom of neural networks. But real learning happens when atoms combine into molecules.

Project 2: Gradient Descent Visualizer will show you:

  • How optimization works in continuous (not binary) spaces
  • Why we need derivatives
  • What a “loss landscape” looks like
  • How learning rate affects convergence

The perceptron used a simple error and discrete step function. Modern networks use continuous loss functions and smooth activations - that’s where calculus enters the picture.


Next: P02: Gradient Descent Visualizer - See optimization in action


Appendix: Logic Gate Truth Tables

For reference:

AND Gate:            OR Gate:             NAND Gate:           NOR Gate:
x1 x2 | y            x1 x2 | y            x1 x2 | y            x1 x2 | y
------+--            ------+--            ------+--            ------+--
0  0  | 0            0  0  | 0            0  0  | 1            0  0  | 1
0  1  | 0            0  1  | 1            0  1  | 1            0  1  | 0
1  0  | 0            1  0  | 1            1  0  | 1            1  0  | 0
1  1  | 1            1  1  | 1            1  1  | 0            1  1  | 0

XOR Gate (NOT linearly separable):
x1 x2 | y
------+--
0  0  | 0
0  1  | 1
1  0  | 1
1  1  | 0

XNOR Gate (NOT linearly separable):
x1 x2 | y
------+--
0  0  | 1
0  1  | 0
1  0  | 0
1  1  | 1

This project is part of the “AI Prediction & Neural Networks: From Math to Machine” learning path.