Project 1: The Manual Neuron

Learn how machines “learn” by building a single neuron that teaches itself logic gates - no libraries, no shortcuts, just raw math becoming intelligence

Project Overview

Attribute	Value
Difficulty	Beginner
Time Estimate	Weekend (8-16 hours)
Language	Python (Pure, NO NumPy)
Alternative Languages	C, Rust
Prerequisites	Basic Python, high school algebra
Main Book	Grokking Deep Learning by Andrew Trask
Knowledge Area	Artificial Neurons / Logic Gates

Learning Objectives

After completing this project, you will be able to:

Explain the perceptron algorithm - Describe how a single neuron computes its output from inputs, weights, and bias
Implement forward propagation manually - Write output = (input1 * weight1) + (input2 * weight2) + bias without any library help
Derive and apply the Delta Rule - Calculate weight updates based on error and learning rate
Understand linear separability - Explain why single neurons can solve AND/OR but not XOR
Train a model to convergence - Iterate until the neuron correctly predicts all truth table entries
Connect math to AI intuition - See exactly how numbers changing leads to “learning”

The Core Question You’re Answering

“How can multiplying numbers lead to ‘decisions’?”

Before you write a single line of code, internalize this truth: a neural network making a decision is just drawing a line.

Think of the input space as a 2D plane where the x-axis is input1 and the y-axis is input2. The four possible inputs for a logic gate are the corners of a unit square:

    input2
      ^
    1 |   (0,1)-----(1,1)
      |     |         |
      |     |         |
    0 |   (0,0)-----(1,0)
      +----------------------> input1
          0         1

A single neuron draws a line (or in higher dimensions, a hyperplane) that separates “positive” examples from “negative” examples. The weights and bias define where that line sits.

When you train a perceptron, you’re adjusting the line until it correctly separates all the positive examples from the negative ones.

Your task: Build the machine that finds that line automatically.

Concepts You Must Understand First

Stop and research these before coding:

1. The Dot Product and Weighted Sum

The fundamental operation of a neuron is the weighted sum: multiply each input by its corresponding weight, then add everything together (including the bias).

z = (x1 * w1) + (x2 * w2) + ... + (xn * wn) + b

This is a dot product plus a bias term. The dot product measures “how aligned” two vectors are.

Why it matters: The dot product is the building block of ALL neural networks. Every hidden layer, every attention mechanism, every embedding lookup - they all reduce to dot products.

Book Reference: “Grokking Deep Learning” by Andrew Trask - Chapter 3: “Introduction to Neural Prediction”

2. The Step Activation Function

After computing the weighted sum, we need to make a decision: is this input “positive” or “negative”? The step function does exactly this:

         1  if z >= threshold
step(z) =
         0  if z < threshold

Often, we set the threshold to 0 and absorb it into the bias:

         1  if z >= 0
step(z) =
         0  if z < 0

Visualization:

  output
    ^
  1 |         +------------
    |         |
    |         |
  0 |---------+
    +-------------------> z
              0

The step function is non-differentiable at z=0, which is why modern networks use ReLU or sigmoid. But for perceptrons learning logic gates, step works perfectly.

Book Reference: “Neural Networks and Deep Learning” by Michael Nielsen - Chapter 1, Section on “Perceptrons”

3. Error Calculation

Error is the difference between what you wanted and what you got:

error = target - prediction

For binary outputs (0 or 1):

If target=1 and prediction=0: error = 1 (we need to increase the output)
If target=0 and prediction=1: error = -1 (we need to decrease the output)
If target=prediction: error = 0 (no change needed)

Why it matters: Error is the signal that drives learning. Without knowing how wrong you are, you can’t improve.

Book Reference: “Grokking Deep Learning” by Andrew Trask - Chapter 4: “Introduction to Neural Learning”

4. The Perceptron Learning Algorithm (Delta Rule)

The Perceptron Learning Rule states:

w_new = w_old + (learning_rate * error * input)
b_new = b_old + (learning_rate * error)

Intuition:

If error > 0 (predicted too low), increase weights for inputs that were “on” (input=1)
If error < 0 (predicted too high), decrease weights for inputs that were “on”
Inputs that were “off” (input=0) don’t change their weights (multiplying by 0)

Why this works: When an input contributed to a wrong prediction:

If the input was 1 and we predicted 0 (should be 1), increase that weight so next time the weighted sum is higher
If the input was 1 and we predicted 1 (should be 0), decrease that weight so next time the weighted sum is lower

Book Reference: “Neural Networks and Deep Learning” by Michael Nielsen - Chapter 1: “The Perceptron Learning Algorithm”

5. Linear Separability

A problem is linearly separable if you can draw a straight line (or hyperplane in higher dimensions) to separate the positive and negative examples.

AND Gate (linearly separable):

    x2
    ^
  1 |  O (0,1)     X (1,1)   <- One output is 1
    |
    |
  0 |  O (0,0)     O (1,0)   <- All these outputs are 0
    +----------------------> x1
       0           1

O = output 0
X = output 1

A line can separate the X from the Os:
    x2
    ^
  1 |  O         \ X
    |            \
    |           \
  0 |  O         \ O
    +-----------\-------> x1

XOR Gate (NOT linearly separable):

    x2
    ^
  1 |  X (0,1)     O (1,1)
    |
    |
  0 |  O (0,0)     X (1,0)
    +----------------------> x1

No single straight line can separate the Xs from the Os!
They are diagonally opposite.

This is the Minsky-Papert limitation that caused the first “AI Winter” in the 1960s-70s.

Book Reference: “Grokking Deep Learning” by Andrew Trask - Chapter 3: “Linear Separability”

Deep Theoretical Foundation

History of the Perceptron (Rosenblatt 1958)

In 1958, Frank Rosenblatt at Cornell Aeronautical Laboratory created the Perceptron - the first algorithm that could learn from data. It was inspired by how neurons in the brain work.

   Historical Timeline of Neural Networks
   ┌─────────────────────────────────────────────────────────────────┐
   │                                                                 │
   │  1943: McCulloch-Pitts neuron (theoretical model)               │
   │    │                                                            │
   │    ▼                                                            │
   │  1958: Rosenblatt's Perceptron (first learning algorithm)       │
   │    │                                                            │
   │    ▼                                                            │
   │  1969: Minsky & Papert "Perceptrons" book (XOR problem)         │
   │    │                                                            │
   │    ▼                                                            │
   │  1969-1986: "AI Winter" (research funding dried up)             │
   │    │                                                            │
   │    ▼                                                            │
   │  1986: Rumelhart, Hinton, Williams (Backpropagation)            │
   │    │                                                            │
   │    ▼                                                            │
   │  2012: AlexNet (Deep Learning Renaissance)                      │
   │    │                                                            │
   │    ▼                                                            │
   │  Today: Transformers, LLMs, etc.                                │
   │                                                                 │
   └─────────────────────────────────────────────────────────────────┘

Rosenblatt’s perceptron was physical hardware - the Mark I Perceptron had 400 photocells connected to neurons implemented as potentiometers (variable resistors). It could learn to recognize letters.

The perceptron was overhyped. The New York Times declared it the “embryo of an electronic computer that [the Navy] expects will be able to walk, talk, see, write, reproduce itself and be conscious of its existence.”

Then came the crash.

The Minsky-Papert Book and the First AI Winter

In 1969, Marvin Minsky and Seymour Papert published “Perceptrons,” a mathematical analysis showing the fundamental limitations of single-layer perceptrons.

Their key result: A single perceptron cannot learn XOR because XOR is not linearly separable.

This devastated AI research funding. If neural networks couldn’t even learn XOR, how could they learn anything useful?

What Minsky and Papert actually proved was technically correct but practically misleading. They acknowledged that multi-layer perceptrons (what we now call neural networks) could solve XOR, but dismissed them because “there is no learning algorithm for multi-layer perceptrons.”

They were wrong. The backpropagation algorithm was discovered (and forgotten, and rediscovered) multiple times before being popularized in 1986.

The lesson: Understanding the perceptron deeply - including its limitations - is essential for understanding why we need multiple layers and more sophisticated architectures.

Mathematical Formulation

A perceptron with n inputs computes:

                   n
           z = b + Σ (xi * wi)
                  i=1

           y = step(z)

Where:

xi = input i (binary: 0 or 1 for logic gates)
wi = weight for input i (real number, learned)
b = bias (real number, learned)
z = weighted sum (real number)
y = output (binary: 0 or 1 after step function)

ASCII Diagram of a 2-Input Perceptron:

                    ┌─────────────────────────────────────────────┐
                    │                                             │
    Input x1 ──────►│  x1 * w1 ──┐                                │
                    │            │                                │
                    │            ▼                                │
                    │         ┌─────┐    ┌──────────┐   ┌───────┐ │
    Input x2 ──────►│  x2*w2──►│  Σ  │───►│ step(z)  │──►│ Output│─┼──► y
                    │            │   ▲    └──────────┘   └───────┘ │
                    │            ▼   │                            │
    Bias 1 ────────►│    b ──────┘                                │
                    │                                             │
                    └─────────────────────────────────────────────┘

    z = (x1 * w1) + (x2 * w2) + b
    y = step(z) = 1 if z >= 0 else 0

The Decision Boundary

The perceptron decides y = 1 when:

z >= 0
(x1 * w1) + (x2 * w2) + b >= 0

Rearranging to see the line equation:

x2 >= (-w1/w2)*x1 + (-b/w2)

This is a line with:

Slope: -w1/w2
Intercept: -b/w2

Example: Trained OR Gate

After training, let’s say: w1 = 1.5, w2 = 1.5, b = -1.0

Decision boundary: 1.5*x1 + 1.5*x2 - 1.0 = 0

Rearranging: x2 = -x1 + 0.67

    x2
    ^
  1 |  X (0,1)  \   X (1,1)    <- Both have output 1
    |            \
    |             \
0.67|              \           <- Decision boundary
    |               \
  0 |  O (0,0)       \ X (1,0) <- (0,0) is 0, (1,0) is 1
    +------------------\-----> x1
       0       0.67    1

Points above/right of line → output 1
Points below/left of line → output 0

Why XOR Fails

For XOR:

(0,0) → 0
(0,1) → 1
(1,0) → 1
(1,1) → 0

    x2
    ^
  1 |  X (0,1)     O (1,1)
    |     ┌─────────────┐
    |     │ No single   │
    |     │ line works! │
    |     └─────────────┘
  0 |  O (0,0)     X (1,0)
    +----------------------> x1

The X points are on opposite corners.
Any line that separates (0,1) from (1,1)
will also separate (0,0) from (1,0) incorrectly.

This is why XOR required multi-layer perceptrons (hidden layers) - they can draw curved decision boundaries.

The Delta Rule Derivation

The perceptron learning algorithm minimizes error through gradient descent (though Rosenblatt didn’t frame it that way).

For the step function, we can’t compute a true gradient (it’s not differentiable). But we can use a heuristic:

Update Rule:

Δwi = η * (t - y) * xi
wi(new) = wi(old) + Δwi

Where:

η (eta) = learning rate (typically 0.1 to 1.0)
t = target (expected output)
y = predicted output
xi = input

Intuition:

If t = 1 and y = 0: error = 1, so we add η * xi to each weight. This makes z larger next time for this input pattern.
If t = 0 and y = 1: error = -1, so we subtract η * xi from each weight. This makes z smaller next time.
If t = y: error = 0, no change.

Convergence Theorem: The perceptron convergence theorem (Novikoff, 1962) proves that if the training data is linearly separable, the perceptron learning algorithm will converge to a solution in finite iterations.

Real World Outcome

You’ll run a script that starts with random garbage weights (guessing randomly) and prints its “learning process” until it perfectly mimics a logic gate.

Example Output (OR Gate):

$ python manual_neuron.py --gate OR

========================================
        PERCEPTRON TRAINING: OR GATE
========================================

Truth Table for OR:
  [0, 0] -> 0
  [0, 1] -> 1
  [1, 0] -> 1
  [1, 1] -> 1

Initial Weights (random):
  w1 = 0.23
  w2 = -0.47
  b  = 0.15
  Learning Rate: 0.1

----------------------------------------
Epoch 1:
  Input=[0, 0] z=0.15 Predicted=1 Target=0 Error=-1
    -> UPDATING: w1=0.23->0.23, w2=-0.47->-0.47, b=0.15->0.05
  Input=[0, 1] z=-0.42 Predicted=0 Target=1 Error=1
    -> UPDATING: w1=0.23->0.23, w2=-0.47->-0.37, b=0.05->0.15
  Input=[1, 0] z=0.38 Predicted=1 Target=1 Error=0 (Correct!)
  Input=[1, 1] z=0.01 Predicted=1 Target=1 Error=0 (Correct!)
  Epoch 1 Errors: 2/4

Epoch 2:
  Input=[0, 0] z=0.15 Predicted=1 Target=0 Error=-1
    -> UPDATING: w1=0.23->0.23, w2=-0.37->-0.37, b=0.15->0.05
  Input=[0, 1] z=-0.32 Predicted=0 Target=1 Error=1
    -> UPDATING: w1=0.23->0.23, w2=-0.37->-0.27, b=0.05->0.15
  Input=[1, 0] z=0.38 Predicted=1 Target=1 Error=0 (Correct!)
  Input=[1, 1] z=0.11 Predicted=1 Target=1 Error=0 (Correct!)
  Epoch 2 Errors: 2/4

... (many epochs later) ...

Epoch 43:
  Input=[0, 0] z=-0.12 Predicted=0 Target=0 Error=0 (Correct!)
  Input=[0, 1] z=0.78 Predicted=1 Target=1 Error=0 (Correct!)
  Input=[1, 0] z=0.95 Predicted=1 Target=1 Error=0 (Correct!)
  Input=[1, 1] z=1.85 Predicted=1 Target=1 Error=0 (Correct!)
  Epoch 43 Errors: 0/4

========================================
           TRAINING COMPLETE!
========================================

Final Weights:
  w1 = 1.07
  w2 = 0.90
  b  = -0.12

Decision Boundary Equation:
  1.07*x1 + 0.90*x2 - 0.12 = 0

----------------------------------------
            TESTING MODEL
----------------------------------------

[0, 0] -> z=-0.12 -> step -> 0 (Expected: 0) ✓
[0, 1] -> z=0.78  -> step -> 1 (Expected: 1) ✓
[1, 0] -> z=0.95  -> step -> 1 (Expected: 1) ✓
[1, 1] -> z=1.85  -> step -> 1 (Expected: 1) ✓

ALL TESTS PASSED!
The perceptron has learned the OR function.

Example Output (AND Gate):

$ python manual_neuron.py --gate AND

========================================
        PERCEPTRON TRAINING: AND GATE
========================================

Truth Table for AND:
  [0, 0] -> 0
  [0, 1] -> 0
  [1, 0] -> 0
  [1, 1] -> 1

Initial Weights (random):
  w1 = -0.15
  w2 = 0.32
  b  = 0.05
  Learning Rate: 0.1

... (training epochs) ...

Epoch 28:
  Input=[0, 0] z=-0.45 Predicted=0 Target=0 Error=0 (Correct!)
  Input=[0, 1] z=0.15 Predicted=1 Target=0 Error=-1
    -> UPDATING...
  ...

Epoch 67: SOLVED!

Final Weights:
  w1 = 0.80
  w2 = 0.75
  b  = -1.20

Testing:
[0, 0] -> 0 (Expected: 0) ✓
[0, 1] -> 0 (Expected: 0) ✓
[1, 0] -> 0 (Expected: 0) ✓
[1, 1] -> 1 (Expected: 1) ✓

ALL TESTS PASSED!

Example Output (XOR - Expected Failure):

$ python manual_neuron.py --gate XOR

========================================
        PERCEPTRON TRAINING: XOR GATE
========================================

Truth Table for XOR:
  [0, 0] -> 0
  [0, 1] -> 1
  [1, 0] -> 1
  [1, 1] -> 0

Initial Weights (random):
  w1 = 0.12
  w2 = 0.45
  b  = -0.08

... (training) ...

Epoch 100: Errors: 1/4
Epoch 200: Errors: 2/4
Epoch 500: Errors: 1/4
Epoch 1000: Still not converged!

========================================
       TRAINING FAILED (as expected)
========================================

XOR is not linearly separable.
A single perceptron cannot learn XOR.
You need hidden layers (multi-layer perceptron).

This is the Minsky-Papert limitation!

Solution Architecture

High-Level Design Approach

This section describes what your solution should look like, not how to implement it.

Architecture Diagram:

┌─────────────────────────────────────────────────────────────────────────┐
│                         PERCEPTRON TRAINING SYSTEM                      │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│   ┌─────────────────┐                                                   │
│   │  Training Data  │                                                   │
│   │  ┌───────────┐  │                                                   │
│   │  │ Inputs    │  │         ┌──────────────────────────────┐          │
│   │  │ [0,0]     │  │         │         PERCEPTRON           │          │
│   │  │ [0,1]     │──┼────────►│  ┌────┐    ┌────┐   ┌────┐   │          │
│   │  │ [1,0]     │  │         │  │ w1 │    │ w2 │   │ b  │   │          │
│   │  │ [1,1]     │  │         │  └──┬─┘    └──┬─┘   └──┬─┘   │          │
│   │  └───────────┘  │         │     │         │        │     │          │
│   │  ┌───────────┐  │         │     ▼         ▼        ▼     │          │
│   │  │ Targets   │  │         │    ┌──────────────────────┐  │          │
│   │  │ 0,1,1,1   │  │         │    │ z = x1*w1 + x2*w2 + b│  │          │
│   │  └───────────┘  │         │    └──────────┬───────────┘  │          │
│   └────────┬────────┘         │               │              │          │
│            │                  │               ▼              │          │
│            │                  │    ┌───────────────────┐     │          │
│            │                  │    │ y = step(z)       │     │          │
│            │                  │    │   1 if z >= 0     │     │          │
│            │                  │    │   0 if z < 0      │     │          │
│            │                  │    └────────┬──────────┘     │          │
│            │                  │             │                │          │
│            │                  └─────────────┼────────────────┘          │
│            │                                │                           │
│            │                                ▼                           │
│            │                    ┌────────────────────┐                  │
│            │                    │    Prediction y    │                  │
│            │                    └─────────┬──────────┘                  │
│            │                              │                             │
│            ▼                              ▼                             │
│   ┌─────────────────────────────────────────────────┐                   │
│   │           ERROR CALCULATION                     │                   │
│   │                                                 │                   │
│   │   error = target - prediction                   │                   │
│   │                                                 │                   │
│   └───────────────────────┬─────────────────────────┘                   │
│                           │                                             │
│                           │ if error != 0                               │
│                           ▼                                             │
│   ┌─────────────────────────────────────────────────┐                   │
│   │           WEIGHT UPDATE (Delta Rule)            │                   │
│   │                                                 │                   │
│   │   w1 = w1 + (learning_rate * error * x1)        │                   │
│   │   w2 = w2 + (learning_rate * error * x2)        │                   │
│   │   b  = b  + (learning_rate * error * 1)         │                   │
│   │                                                 │                   │
│   └─────────────────────────────────────────────────┘                   │
│                           │                                             │
│                           │ Loop until all predictions correct          │
│                           ▼                                             │
│   ┌─────────────────────────────────────────────────┐                   │
│   │           CONVERGENCE CHECK                     │                   │
│   │                                                 │                   │
│   │   If all 4 inputs predict correctly:            │                   │
│   │       STOP - Model is trained                   │                   │
│   │   Else:                                         │                   │
│   │       Continue to next epoch                    │                   │
│   │                                                 │                   │
│   └─────────────────────────────────────────────────┘                   │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

Data Structures Needed

┌───────────────────────────────────────────────────────────┐
│                    DATA STRUCTURES                        │
├───────────────────────────────────────────────────────────┤
│                                                           │
│  1. TRAINING DATA                                         │
│     ┌───────────┬───────────┐                             │
│     │  inputs   │  targets  │                             │
│     ├───────────┼───────────┤                             │
│     │  [0, 0]   │     0     │                             │
│     │  [0, 1]   │   0 or 1  │  <- depends on gate         │
│     │  [1, 0]   │   0 or 1  │                             │
│     │  [1, 1]   │   0 or 1  │                             │
│     └───────────┴───────────┘                             │
│                                                           │
│  2. MODEL PARAMETERS (floats, updated during training)    │
│     • w1: weight for input 1                              │
│     • w2: weight for input 2                              │
│     • b:  bias                                            │
│                                                           │
│  3. HYPERPARAMETERS (constants, set before training)      │
│     • learning_rate: typically 0.1 to 1.0                 │
│     • max_epochs: limit iterations (e.g., 1000)           │
│                                                           │
└───────────────────────────────────────────────────────────┘

Function Breakdown

┌───────────────────────────────────────────────────────────┐
│                     FUNCTION DESIGN                       │
├───────────────────────────────────────────────────────────┤
│                                                           │
│  step(z) -> int                                           │
│    Input:  z (weighted sum, float)                        │
│    Output: 0 or 1                                         │
│    Logic:  return 1 if z >= 0 else 0                      │
│                                                           │
│  ─────────────────────────────────────────────────────    │
│                                                           │
│  forward(x1, x2, w1, w2, b) -> (z, y)                     │
│    Input:  inputs x1, x2; weights w1, w2; bias b          │
│    Output: weighted sum z, prediction y                   │
│    Logic:  z = x1*w1 + x2*w2 + b                          │
│            y = step(z)                                    │
│                                                           │
│  ─────────────────────────────────────────────────────    │
│                                                           │
│  update_weights(w1, w2, b, x1, x2, error, lr) -> tuple    │
│    Input:  current weights, inputs, error, learning rate  │
│    Output: new (w1, w2, b)                                │
│    Logic:  Apply Delta Rule                               │
│                                                           │
│  ─────────────────────────────────────────────────────    │
│                                                           │
│  train(data, targets, lr, max_epochs) -> (w1, w2, b)      │
│    Input:  training data, targets, hyperparameters        │
│    Output: trained weights and bias                       │
│    Logic:  Loop epochs, update on errors, check converge  │
│                                                           │
│  ─────────────────────────────────────────────────────    │
│                                                           │
│  test(data, targets, w1, w2, b) -> bool                   │
│    Input:  test data, expected outputs, trained params    │
│    Output: True if all correct, False otherwise           │
│    Logic:  Run forward pass on each input, compare        │
│                                                           │
└───────────────────────────────────────────────────────────┘

Data Flow Diagram

┌───────────────────────────────────────────────────────────────────────┐
│                         TRAINING FLOW                                 │
└───────────────────────────────────────────────────────────────────────┘

  Initialize           For each epoch          For each sample
  ┌────────┐           ┌────────────┐          ┌────────────────┐
  │ Random │           │  Reset     │          │ Get (x1, x2),  │
  │ w1,w2,b│──────────►│  epoch_err │─────────►│ target         │
  └────────┘           │  counter   │          └───────┬────────┘
                       └────────────┘                  │
                                                       ▼
                                            ┌────────────────────┐
                                            │ Forward Pass       │
                                            │ z = x1*w1+x2*w2+b  │
                                            │ y = step(z)        │
                                            └───────┬────────────┘
                                                    │
                                                    ▼
                                            ┌────────────────────┐
                                            │ Calculate Error    │
                                            │ err = target - y   │
                                            └───────┬────────────┘
                                                    │
                                          ┌─────────┴─────────┐
                                          │                   │
                                    err != 0?           err == 0
                                          │                   │
                                          ▼                   ▼
                                   ┌─────────────┐    ┌─────────────┐
                                   │ Update      │    │ No change   │
                                   │ weights     │    │ continue    │
                                   └──────┬──────┘    └──────┬──────┘
                                          │                  │
                                          └────────┬─────────┘
                                                   │
                                                   ▼
                                          ┌────────────────┐
                                          │ Next sample or │
                                          │ next epoch     │
                                          └────────┬───────┘
                                                   │
                                         ┌─────────┴─────────┐
                                         │                   │
                                   All correct?        Still errors
                                         │                   │
                                         ▼                   ▼
                                  ┌─────────────┐    ┌─────────────┐
                                  │ STOP        │    │ Continue    │
                                  │ Return      │    │ training    │
                                  │ weights     │    └─────────────┘
                                  └─────────────┘

Phased Implementation Guide

Phase 1: Forward Pass (1-2 hours)

Goal: Implement the core computation of a neuron.

Write the step function:
- Takes a single float z
- Returns 1 if z >= 0, else 0
Write the forward function:
- Takes inputs x1, x2 and parameters w1, w2, b
- Computes z = x1*w1 + x2*w2 + b
- Returns step(z)

Test manually:

# With w1=0.5, w2=0.5, b=-0.75
# forward(0, 0, 0.5, 0.5, -0.75) should return 0 (z=-0.75)
# forward(1, 1, 0.5, 0.5, -0.75) should return 1 (z=0.25)

Checkpoint: You should be able to manually set weights that make the forward function behave like AND or OR.

Phase 2: Error Calculation (30 minutes)

Goal: Compute how wrong the prediction is.

Write an error function or just compute inline:
- error = target - prediction

Verify all cases:

target=0, pred=0 -> error=0  (correct, no update)
target=1, pred=1 -> error=0  (correct, no update)
target=1, pred=0 -> error=1  (increase output)
target=0, pred=1 -> error=-1 (decrease output)

Checkpoint: Given a prediction and target, you should know whether and how to update.

Phase 3: Weight Updates (1 hour)

Goal: Implement the Delta Rule.

Write the update_weights function:

def update_weights(w1, w2, b, x1, x2, error, learning_rate):
    w1_new = w1 + learning_rate * error * x1
    w2_new = w2 + learning_rate * error * x2
    b_new = b + learning_rate * error * 1  # bias input is always 1
    return w1_new, w2_new, b_new

Test the update logic:
- If error=1, x1=1, lr=0.1: w1 should increase by 0.1
- If error=-1, x1=1, lr=0.1: w1 should decrease by 0.1
- If x1=0: w1 should not change (0 * anything = 0)

Checkpoint: Weights change in the right direction based on error.

Phase 4: Training Loop (1-2 hours)

Goal: Repeat forward → error → update until convergence.

Define training data for AND, OR, NAND, NOR gates
Initialize weights randomly (small values, e.g., -1 to 1)

Implement the epoch loop:

for epoch in range(max_epochs):
    errors_this_epoch = 0
    for (x1, x2), target in zip(inputs, targets):
        z, prediction = forward(...)
        error = target - prediction
        if error != 0:
            update_weights(...)
            errors_this_epoch += 1
    if errors_this_epoch == 0:
        print("Converged!")
        break

Add verbose logging to see learning progress

Checkpoint: Running the training loop on OR should converge within ~100 epochs.

Phase 5: Testing and Validation (1 hour)

Goal: Verify the trained perceptron works correctly.

After training, run all 4 inputs through forward pass
Compare to expected truth table
Print pass/fail for each

Checkpoint: All 4 tests pass for AND, OR, NAND, NOR. XOR should fail to converge.

Questions to Guide Your Design

Before implementing, think through these:

Understanding the Algorithm

Why random initialization?
- What happens if you start with all zeros?
- Why not start with “good” weights?
What does the learning rate control?
- What happens if learning_rate = 0?
- What happens if learning_rate = 100?
- Why is 0.1 a common choice?
Why iterate through all samples before checking convergence?
- Could you check after each sample?
- What’s the difference between “epoch” and “iteration”?

Understanding the Math

Why multiply error by input in the update rule?
- What happens to w1 when x1=0?
- Why is this mathematically correct?
How does the bias differ from weights?
- What does the bias “shift”?
- Why don’t we multiply bias update by an input?
What does the decision boundary look like geometrically?
- Draw the boundary for a trained AND gate
- How do the weights define its slope?

Understanding the Limits

Why can’t a perceptron learn XOR?
- Draw the 4 XOR points and try to separate them with a line
- What would you need to separate them?
What’s the minimum number of weights to learn a 2-input gate?
- Could you do it with just w1 and w2 (no bias)?
- When is bias essential?

Thinking Exercise

Before coding, trace this by hand:

Starting with:

w1 = 0.5
w2 = 0.5
b = -0.75
learning_rate = 0.1
Training for AND gate: (0,0)→0, (0,1)→0, (1,0)→0, (1,1)→1

Epoch 1 Trace:

Input	z = x1w1 + x2w2 + b	y = step(z)	Target	New w1	New w2	New b
(0,0)	00.5 + 00.5 - 0.75 = -0.75	0	0	0.5	0.5	-0.75
(0,1)	00.5 + 10.5 - 0.75 = -0.25	0	0	0.5	0.5	-0.75
(1,0)	10.5 + 00.5 - 0.75 = -0.25	0	0	0.5	0.5	-0.75
(1,1)	10.5 + 10.5 - 0.75 = 0.25	1	1	0.5	0.5	-0.75

Result: All correct on epoch 1! The initial weights happened to be good.

Now try with different starting weights:

w1 = -0.2
w2 = 0.3
b = 0.1

Trace Epoch 1:

Input	z	y	Target	Error	Update	New w1	New w2	New b
(0,0)	0(-0.2) + 00.3 + 0.1 = 0.1	1	0	-1	Yes	?	?	?
…

Your task: Complete this trace for all 4 inputs of epoch 1. Then continue to epoch 2.

Questions while tracing:

Which weight changed the most after the first error?
Why didn’t w1 change when processing (0,0)?
How many epochs until all 4 are correct?

Testing Strategy

Unit Tests for Each Function

# Test step function
assert step(-1) == 0
assert step(0) == 1  # boundary case: z >= 0
assert step(0.001) == 1
assert step(-0.001) == 0

# Test forward pass
z, y = forward(0, 0, 1, 1, -1.5)  # Mimics AND
assert y == 0
z, y = forward(1, 1, 1, 1, -1.5)
assert y == 1

# Test update rule
w1, w2, b = 0.5, 0.5, 0
w1, w2, b = update_weights(w1, w2, b, 1, 0, 1, 0.1)
assert w1 == 0.6  # increased because x1=1, error=1
assert w2 == 0.5  # unchanged because x2=0
assert b == 0.1   # increased because error=1

Integration Test: Train and Verify

# Train on OR gate
inputs = [(0,0), (0,1), (1,0), (1,1)]
targets = [0, 1, 1, 1]
w1, w2, b = train(inputs, targets, learning_rate=0.1, max_epochs=1000)

# Verify all predictions
for (x1, x2), target in zip(inputs, targets):
    _, prediction = forward(x1, x2, w1, w2, b)
    assert prediction == target, f"Failed on {(x1, x2)}"

Convergence Test

# AND, OR, NAND, NOR should all converge
for gate_name, gate_targets in [("AND", [0,0,0,1]), ("OR", [0,1,1,1]), ...]:
    w1, w2, b, epochs = train_with_count(inputs, gate_targets, ...)
    assert epochs < 1000, f"{gate_name} didn't converge"

# XOR should NOT converge
w1, w2, b, epochs = train_with_count(inputs, [0,1,1,0], max_epochs=1000)
assert epochs == 1000, "XOR unexpectedly converged!"

Common Pitfalls and Debugging Tips

Pitfall 1: Off-by-One in Step Function

Symptom: Inconsistent results at z=0 Cause: Using > instead of >= or vice versa Fix: Decide on convention (usually z >= 0 → 1) and stick to it

Pitfall 2: Forgetting to Update Bias

Symptom: Model doesn’t converge or converges slowly Cause: Only updating w1 and w2, not b Fix: Remember: b = b + lr * error * 1

Symptom: Error gets worse instead of better Cause: Using prediction - target instead of target - prediction Fix: Error should be positive when prediction is too low

Pitfall 4: Not Iterating Until Convergence

Symptom: Model seems random Cause: Only running one epoch Fix: Loop until zero errors in an epoch (or max epochs)

Pitfall 5: Learning Rate Too High

Symptom: Weights oscillate wildly, never settle Cause: learning_rate > 1 or very large values Fix: Use lr in range 0.01 to 1.0 (start with 0.1)

Pitfall 6: Learning Rate Too Low

Symptom: Takes thousands of epochs to converge Cause: learning_rate too small (e.g., 0.001) Fix: For simple logic gates, 0.1 to 1.0 works well

Debugging Technique: Print Everything

When stuck, print at each step:

print(f"Input: ({x1}, {x2})")
print(f"Weights before: w1={w1:.3f}, w2={w2:.3f}, b={b:.3f}")
print(f"z = {x1}*{w1} + {x2}*{w2} + {b} = {z:.3f}")
print(f"y = step({z:.3f}) = {y}")
print(f"Target: {target}, Error: {error}")
if error != 0:
    print(f"Updating: w1 += {lr}*{error}*{x1} = {lr*error*x1:.3f}")

The Interview Questions They’ll Ask

Prepare to answer these:

1. “Explain how a perceptron learns. Walk me through one update step.”

Key points to cover:

Forward pass: weighted sum + step function
Error calculation: target - prediction
Weight update: Delta Rule (w += lr * error * input)
Why inputs of 0 don’t change their weights

2. “What is the decision boundary of a perceptron?”

Key insight:

It’s a hyperplane (line in 2D) defined by w1x1 + w2x2 + b = 0
Weights define the orientation (slope)
Bias shifts the line

3. “Why can’t a single perceptron learn XOR?”

Key insight:

XOR is not linearly separable
Positive examples are on opposite corners
No single line can separate them
Need hidden layers (MLP) to create non-linear boundaries

4. “What’s the difference between a perceptron and a modern neural network neuron?”

Key insight:

Perceptron: step function (non-differentiable)
Modern: sigmoid/ReLU (differentiable for gradient descent)
Perceptron: single layer
Modern: multiple layers with backpropagation

5. “What is the role of the bias term?”

Key insight:

Bias shifts the decision boundary away from the origin
Without bias, the hyperplane must pass through origin
Example: AND gate needs negative bias to threshold at both inputs high

6. “How does the learning rate affect training?”

Key insight:

Too high: overshoots, oscillates, may not converge
Too low: converges slowly, may get stuck
Just right: smooth convergence to solution

7. “What guarantees that a perceptron will converge?”

Key insight:

The Perceptron Convergence Theorem (Novikoff, 1962)
IF data is linearly separable
THEN algorithm will converge in finite steps
If not separable, it will loop forever (hence XOR failure)

Hints in Layers

Use these hints only when stuck. Try for at least 15 minutes before reading each hint.

Hint 1: Structure

Your main file should have:

A function for the step activation
A function for forward pass
A function for weight updates
A training loop that calls these
A test function that verifies correctness

Hint 2: Initialization

Random initialization should be small values:

import random
w1 = random.uniform(-1, 1)
w2 = random.uniform(-1, 1)
b = random.uniform(-1, 1)

Hint 3: Training Data

Define your gates as dictionaries:

GATES = {
    'AND': [0, 0, 0, 1],
    'OR':  [0, 1, 1, 1],
    'NAND': [1, 1, 1, 0],
    'NOR': [1, 0, 0, 0],
    'XOR': [0, 1, 1, 0],  # Will not converge!
}
INPUTS = [(0, 0), (0, 1), (1, 0), (1, 1)]

Hint 4: The Training Loop Pattern

for epoch in range(max_epochs):
    total_error = 0
    for (x1, x2), target in zip(inputs, targets):
        # forward pass
        # calculate error
        # if error != 0: update weights
        # accumulate error count
    if total_error == 0:
        break  # Converged!

Hint 5: Edge Case - No Error

When prediction equals target, error is 0. The update equation:

w = w + lr * 0 * x = w + 0 = w

Weights don’t change when you’re already correct. This is important!

Extensions and Challenges

After completing the basic perceptron, try these:

Extension 1: 3-Input Gates

Implement AND3, OR3, MAJORITY (output 1 if 2+ inputs are 1).

Now you have z = x1*w1 + x2*w2 + x3*w3 + b
Visualize in 3D (the decision boundary is a plane!)

Extension 2: NAND as Universal Gate

NAND is a universal gate - you can build any other gate from NANDs.

Train a NAND perceptron
Show how to compose them (manually) to make AND, OR, NOT

Extension 3: Visualization

Plot the decision boundary as training progresses:

Use matplotlib to show the 2D input space
Draw the line w1*x1 + w2*x2 + b = 0
Update the plot each epoch to see the line move

Extension 4: Multi-class (One-vs-All)

Instead of binary output, classify into 4 categories:

Train 4 perceptrons, one for each class
Output the class with highest weighted sum (before step)

Extension 5: Implement in C or Rust

Rewrite the perceptron in a low-level language:

No garbage collection, manual memory
Appreciate how simple the actual computation is
Time the training - it should be microseconds

Extension 6: Two-Layer Perceptron

Build a simple 2-layer network to solve XOR:

Hidden layer with 2 neurons
Output layer with 1 neuron
You’ll need to implement backpropagation (preview of Project 5)

Real-World Connections

Where Perceptrons Appear Today

Spam Filters (Early Versions)
- Before deep learning, spam filters used linear classifiers
- Features: word counts, sender reputation
- Perceptron-style updates on misclassifications
Credit Scoring (Logistic Regression)
- Banks use linear models for interpretability
- Similar to perceptron but with sigmoid activation
- Weights show which factors matter (income, debt ratio)
Sentiment Analysis (Baseline)
- Count positive/negative words → weighted sum → decision
- Perceptron is the simplest baseline to beat
Medical Triage
- Simple rule-based systems are essentially perceptrons
- “If blood pressure > X AND temperature > Y, alert doctor”

Why This Foundation Matters

Understanding the perceptron is essential because:

Every deep learning layer IS a perceptron (plus non-linearity)
- A dense layer: each output neuron is z = w1x1 + w2x2 + … + b
- You just learned the atom of neural networks
Debugging deep networks requires this intuition
- When gradients vanish, you’re seeing the XOR problem at scale
- When weights explode, it’s learning rate issues
Interpretable AI often means simpler models
- Regulators want to know WHY a loan was denied
- Perceptrons are explainable: “these factors with these weights”
Edge/embedded AI needs efficient models
- IoT devices can’t run transformers
- Simple perceptron-style models fit in kilobytes

Books That Will Help

Topic	Book	Chapter/Section
Perceptron fundamentals	Grokking Deep Learning by Andrew Trask	Ch. 3: “Introduction to Neural Prediction”
Mathematical foundations	Neural Networks and Deep Learning by Michael Nielsen	Ch. 1: “Using neural nets to recognize handwritten digits”
The Perceptron algorithm	Grokking Deep Learning by Andrew Trask	Ch. 4: “Introduction to Neural Learning”
Linear separability	Pattern Recognition and Machine Learning by Christopher Bishop	Ch. 4: “Linear Models for Classification”
History and context	Perceptrons by Minsky & Papert	Introduction and Ch. 1-3 (historical document)
Optimization theory	Deep Learning by Goodfellow, Bengio, Courville	Ch. 4.3: “Gradient-Based Optimization”
Python implementation	Data Science from Scratch by Joel Grus	Ch. 18: “Neural Networks”

Online Resources

3Blue1Brown: “But what is a neural network?” (YouTube) - Excellent visualization
Andrej Karpathy: “Neural Networks: Zero to Hero” - Modern perspective
Michael Nielsen: neuralnetworksanddeeplearning.com - Free online book

Self-Assessment Checklist

Before moving to Project 2, verify you can:

Implementation Skills

Write the step function without looking at notes
Implement forward pass from scratch
Apply the Delta Rule correctly
Train to convergence on AND, OR, NAND, NOR
Explain why XOR doesn’t converge

Conceptual Understanding

Draw the decision boundary for a trained perceptron
Explain what each weight controls geometrically
Describe what the bias shifts
Define linear separability with an example

Mathematical Foundations

Derive the Delta Rule update from error minimization intuition
Calculate z by hand for given weights and inputs
Predict whether a point is above or below the decision boundary

Conceptual Questions (Answer Without Looking)

What’s the output of step(-0.001)?
If error=1 and x1=0, how much does w1 change?
Why doesn’t XOR work with a single perceptron?
What happens if learning_rate = 0?
How many parameters does a 2-input perceptron have?
What’s the role of bias in the decision boundary?
Can a perceptron with 3 inputs learn the MAJORITY function?

Code Challenges (Try Without Hints)

Modify your code to work with 3 inputs
Add a function that plots the decision boundary
Count how many epochs each gate needs on average (run 100 trials)
Find the minimum learning rate that still converges in < 1000 epochs

What’s Next

You’ve built the atom of neural networks. But real learning happens when atoms combine into molecules.

Project 2: Gradient Descent Visualizer will show you:

How optimization works in continuous (not binary) spaces
Why we need derivatives
What a “loss landscape” looks like
How learning rate affects convergence

The perceptron used a simple error and discrete step function. Modern networks use continuous loss functions and smooth activations - that’s where calculus enters the picture.

Next: P02: Gradient Descent Visualizer - See optimization in action

Appendix: Logic Gate Truth Tables

For reference:

AND Gate:            OR Gate:             NAND Gate:           NOR Gate:
x1 x2 | y            x1 x2 | y            x1 x2 | y            x1 x2 | y
------+--            ------+--            ------+--            ------+--
0  0  | 0            0  0  | 0            0  0  | 1            0  0  | 1
0  1  | 0            0  1  | 1            0  1  | 1            0  1  | 0
1  0  | 0            1  0  | 1            1  0  | 1            1  0  | 0
1  1  | 1            1  1  | 1            1  1  | 0            1  1  | 0

XOR Gate (NOT linearly separable):
x1 x2 | y
------+--
0  0  | 0
0  1  | 1
1  0  | 1
1  1  | 0

XNOR Gate (NOT linearly separable):
x1 x2 | y
------+--
0  0  | 1
0  1  | 0
1  0  | 0
1  1  | 1

This project is part of the “AI Prediction & Neural Networks: From Math to Machine” learning path.

Project 1: The Manual Neuron

Project Overview

Learning Objectives

The Core Question You’re Answering

Concepts You Must Understand First

1. The Dot Product and Weighted Sum

2. The Step Activation Function

3. Error Calculation

4. The Perceptron Learning Algorithm (Delta Rule)

5. Linear Separability

Deep Theoretical Foundation

History of the Perceptron (Rosenblatt 1958)

The Minsky-Papert Book and the First AI Winter

Mathematical Formulation

The Decision Boundary

Why XOR Fails

The Delta Rule Derivation

Real World Outcome

Solution Architecture

High-Level Design Approach

Data Structures Needed

Function Breakdown

Data Flow Diagram

Phased Implementation Guide

Phase 1: Forward Pass (1-2 hours)

Phase 2: Error Calculation (30 minutes)

Phase 3: Weight Updates (1 hour)

Phase 4: Training Loop (1-2 hours)

Phase 5: Testing and Validation (1 hour)

Questions to Guide Your Design

Understanding the Algorithm

Understanding the Math

Understanding the Limits

Thinking Exercise

Testing Strategy

Unit Tests for Each Function

Integration Test: Train and Verify

Convergence Test

Common Pitfalls and Debugging Tips

Pitfall 1: Off-by-One in Step Function

Pitfall 2: Forgetting to Update Bias

Pitfall 3: Wrong Sign in Update Rule

Pitfall 4: Not Iterating Until Convergence

Pitfall 5: Learning Rate Too High

Pitfall 6: Learning Rate Too Low

Debugging Technique: Print Everything

The Interview Questions They’ll Ask

1. “Explain how a perceptron learns. Walk me through one update step.”

2. “What is the decision boundary of a perceptron?”

3. “Why can’t a single perceptron learn XOR?”

4. “What’s the difference between a perceptron and a modern neural network neuron?”

5. “What is the role of the bias term?”

6. “How does the learning rate affect training?”

7. “What guarantees that a perceptron will converge?”

Hints in Layers

Hint 1: Structure

Hint 2: Initialization

Hint 3: Training Data

Hint 4: The Training Loop Pattern

Hint 5: Edge Case - No Error

Extensions and Challenges

Extension 1: 3-Input Gates

Extension 2: NAND as Universal Gate

Extension 3: Visualization

Extension 4: Multi-class (One-vs-All)

Extension 5: Implement in C or Rust

Extension 6: Two-Layer Perceptron

Real-World Connections

Where Perceptrons Appear Today

Why This Foundation Matters

Books That Will Help

Online Resources

Self-Assessment Checklist

Implementation Skills

Conceptual Understanding

Mathematical Foundations

Conceptual Questions (Answer Without Looking)

Code Challenges (Try Without Hints)

What’s Next

Appendix: Logic Gate Truth Tables