Project 1: The Manual Neuron
Project 1: The Manual Neuron
Learn how machines โlearnโ by building a single neuron that teaches itself logic gates - no libraries, no shortcuts, just raw math becoming intelligence
Project Overview
| Attribute | Value |
|---|---|
| Difficulty | Beginner |
| Time Estimate | Weekend (8-16 hours) |
| Language | Python (Pure, NO NumPy) |
| Alternative Languages | C, Rust |
| Prerequisites | Basic Python, high school algebra |
| Main Book | Grokking Deep Learning by Andrew Trask |
| Knowledge Area | Artificial Neurons / Logic Gates |
Learning Objectives
After completing this project, you will be able to:
- Explain the perceptron algorithm - Describe how a single neuron computes its output from inputs, weights, and bias
- Implement forward propagation manually - Write
output = (input1 * weight1) + (input2 * weight2) + biaswithout any library help - Derive and apply the Delta Rule - Calculate weight updates based on error and learning rate
- Understand linear separability - Explain why single neurons can solve AND/OR but not XOR
- Train a model to convergence - Iterate until the neuron correctly predicts all truth table entries
- Connect math to AI intuition - See exactly how numbers changing leads to โlearningโ
The Core Question Youโre Answering
โHow can multiplying numbers lead to โdecisionsโ?โ
Before you write a single line of code, internalize this truth: a neural network making a decision is just drawing a line.
Think of the input space as a 2D plane where the x-axis is input1 and the y-axis is input2. The four possible inputs for a logic gate are the corners of a unit square:
input2
^
1 | (0,1)-----(1,1)
| | |
| | |
0 | (0,0)-----(1,0)
+----------------------> input1
0 1
A single neuron draws a line (or in higher dimensions, a hyperplane) that separates โpositiveโ examples from โnegativeโ examples. The weights and bias define where that line sits.
When you train a perceptron, youโre adjusting the line until it correctly separates all the positive examples from the negative ones.
Your task: Build the machine that finds that line automatically.
Concepts You Must Understand First
Stop and research these before coding:
1. The Dot Product and Weighted Sum
The fundamental operation of a neuron is the weighted sum: multiply each input by its corresponding weight, then add everything together (including the bias).
z = (x1 * w1) + (x2 * w2) + ... + (xn * wn) + b
This is a dot product plus a bias term. The dot product measures โhow alignedโ two vectors are.
Why it matters: The dot product is the building block of ALL neural networks. Every hidden layer, every attention mechanism, every embedding lookup - they all reduce to dot products.
Book Reference: โGrokking Deep Learningโ by Andrew Trask - Chapter 3: โIntroduction to Neural Predictionโ
2. The Step Activation Function
After computing the weighted sum, we need to make a decision: is this input โpositiveโ or โnegativeโ? The step function does exactly this:
1 if z >= threshold
step(z) =
0 if z < threshold
Often, we set the threshold to 0 and absorb it into the bias:
1 if z >= 0
step(z) =
0 if z < 0
Visualization:
output
^
1 | +------------
| |
| |
0 |---------+
+-------------------> z
0
The step function is non-differentiable at z=0, which is why modern networks use ReLU or sigmoid. But for perceptrons learning logic gates, step works perfectly.
Book Reference: โNeural Networks and Deep Learningโ by Michael Nielsen - Chapter 1, Section on โPerceptronsโ
3. Error Calculation
Error is the difference between what you wanted and what you got:
error = target - prediction
For binary outputs (0 or 1):
- If target=1 and prediction=0: error = 1 (we need to increase the output)
- If target=0 and prediction=1: error = -1 (we need to decrease the output)
- If target=prediction: error = 0 (no change needed)
Why it matters: Error is the signal that drives learning. Without knowing how wrong you are, you canโt improve.
Book Reference: โGrokking Deep Learningโ by Andrew Trask - Chapter 4: โIntroduction to Neural Learningโ
4. The Perceptron Learning Algorithm (Delta Rule)
The Perceptron Learning Rule states:
w_new = w_old + (learning_rate * error * input)
b_new = b_old + (learning_rate * error)
Intuition:
- If error > 0 (predicted too low), increase weights for inputs that were โonโ (input=1)
- If error < 0 (predicted too high), decrease weights for inputs that were โonโ
- Inputs that were โoffโ (input=0) donโt change their weights (multiplying by 0)
Why this works: When an input contributed to a wrong prediction:
- If the input was 1 and we predicted 0 (should be 1), increase that weight so next time the weighted sum is higher
- If the input was 1 and we predicted 1 (should be 0), decrease that weight so next time the weighted sum is lower
Book Reference: โNeural Networks and Deep Learningโ by Michael Nielsen - Chapter 1: โThe Perceptron Learning Algorithmโ
5. Linear Separability
A problem is linearly separable if you can draw a straight line (or hyperplane in higher dimensions) to separate the positive and negative examples.
AND Gate (linearly separable):
x2
^
1 | O (0,1) X (1,1) <- One output is 1
|
|
0 | O (0,0) O (1,0) <- All these outputs are 0
+----------------------> x1
0 1
O = output 0
X = output 1
A line can separate the X from the Os:
x2
^
1 | O \ X
| \
| \
0 | O \ O
+-----------\-------> x1
XOR Gate (NOT linearly separable):
x2
^
1 | X (0,1) O (1,1)
|
|
0 | O (0,0) X (1,0)
+----------------------> x1
No single straight line can separate the Xs from the Os!
They are diagonally opposite.
This is the Minsky-Papert limitation that caused the first โAI Winterโ in the 1960s-70s.
Book Reference: โGrokking Deep Learningโ by Andrew Trask - Chapter 3: โLinear Separabilityโ
Deep Theoretical Foundation
History of the Perceptron (Rosenblatt 1958)
In 1958, Frank Rosenblatt at Cornell Aeronautical Laboratory created the Perceptron - the first algorithm that could learn from data. It was inspired by how neurons in the brain work.
Historical Timeline of Neural Networks
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ โ
โ 1943: McCulloch-Pitts neuron (theoretical model) โ
โ โ โ
โ โผ โ
โ 1958: Rosenblatt's Perceptron (first learning algorithm) โ
โ โ โ
โ โผ โ
โ 1969: Minsky & Papert "Perceptrons" book (XOR problem) โ
โ โ โ
โ โผ โ
โ 1969-1986: "AI Winter" (research funding dried up) โ
โ โ โ
โ โผ โ
โ 1986: Rumelhart, Hinton, Williams (Backpropagation) โ
โ โ โ
โ โผ โ
โ 2012: AlexNet (Deep Learning Renaissance) โ
โ โ โ
โ โผ โ
โ Today: Transformers, LLMs, etc. โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Rosenblattโs perceptron was physical hardware - the Mark I Perceptron had 400 photocells connected to neurons implemented as potentiometers (variable resistors). It could learn to recognize letters.
The perceptron was overhyped. The New York Times declared it the โembryo of an electronic computer that [the Navy] expects will be able to walk, talk, see, write, reproduce itself and be conscious of its existence.โ
Then came the crash.
The Minsky-Papert Book and the First AI Winter
In 1969, Marvin Minsky and Seymour Papert published โPerceptrons,โ a mathematical analysis showing the fundamental limitations of single-layer perceptrons.
Their key result: A single perceptron cannot learn XOR because XOR is not linearly separable.
This devastated AI research funding. If neural networks couldnโt even learn XOR, how could they learn anything useful?
What Minsky and Papert actually proved was technically correct but practically misleading. They acknowledged that multi-layer perceptrons (what we now call neural networks) could solve XOR, but dismissed them because โthere is no learning algorithm for multi-layer perceptrons.โ
They were wrong. The backpropagation algorithm was discovered (and forgotten, and rediscovered) multiple times before being popularized in 1986.
The lesson: Understanding the perceptron deeply - including its limitations - is essential for understanding why we need multiple layers and more sophisticated architectures.
Mathematical Formulation
A perceptron with n inputs computes:
n
z = b + ฮฃ (xi * wi)
i=1
y = step(z)
Where:
xi= input i (binary: 0 or 1 for logic gates)wi= weight for input i (real number, learned)b= bias (real number, learned)z= weighted sum (real number)y= output (binary: 0 or 1 after step function)
ASCII Diagram of a 2-Input Perceptron:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ โ
Input x1 โโโโโโโบโ x1 * w1 โโโ โ
โ โ โ
โ โผ โ
โ โโโโโโโ โโโโโโโโโโโโ โโโโโโโโโ โ
Input x2 โโโโโโโบโ x2*w2โโโบโ ฮฃ โโโโโบโ step(z) โโโโบโ Outputโโโผโโโบ y
โ โ โฒ โโโโโโโโโโโโ โโโโโโโโโ โ
โ โผ โ โ
Bias 1 โโโโโโโโโบโ b โโโโโโโ โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
z = (x1 * w1) + (x2 * w2) + b
y = step(z) = 1 if z >= 0 else 0
The Decision Boundary
The perceptron decides y = 1 when:
z >= 0
(x1 * w1) + (x2 * w2) + b >= 0
Rearranging to see the line equation:
x2 >= (-w1/w2)*x1 + (-b/w2)
This is a line with:
- Slope:
-w1/w2 - Intercept:
-b/w2
Example: Trained OR Gate
After training, letโs say: w1 = 1.5, w2 = 1.5, b = -1.0
Decision boundary: 1.5*x1 + 1.5*x2 - 1.0 = 0
Rearranging: x2 = -x1 + 0.67
x2
^
1 | X (0,1) \ X (1,1) <- Both have output 1
| \
| \
0.67| \ <- Decision boundary
| \
0 | O (0,0) \ X (1,0) <- (0,0) is 0, (1,0) is 1
+------------------\-----> x1
0 0.67 1
Points above/right of line โ output 1
Points below/left of line โ output 0
Why XOR Fails
For XOR:
- (0,0) โ 0
- (0,1) โ 1
- (1,0) โ 1
- (1,1) โ 0
x2
^
1 | X (0,1) O (1,1)
| โโโโโโโโโโโโโโโ
| โ No single โ
| โ line works! โ
| โโโโโโโโโโโโโโโ
0 | O (0,0) X (1,0)
+----------------------> x1
The X points are on opposite corners.
Any line that separates (0,1) from (1,1)
will also separate (0,0) from (1,0) incorrectly.
This is why XOR required multi-layer perceptrons (hidden layers) - they can draw curved decision boundaries.
The Delta Rule Derivation
The perceptron learning algorithm minimizes error through gradient descent (though Rosenblatt didnโt frame it that way).
For the step function, we canโt compute a true gradient (itโs not differentiable). But we can use a heuristic:
Update Rule:
ฮwi = ฮท * (t - y) * xi
wi(new) = wi(old) + ฮwi
Where:
ฮท(eta) = learning rate (typically 0.1 to 1.0)t= target (expected output)y= predicted outputxi= input
Intuition:
- If
t = 1andy = 0: error = 1, so we addฮท * xito each weight. This makes z larger next time for this input pattern. - If
t = 0andy = 1: error = -1, so we subtractฮท * xifrom each weight. This makes z smaller next time. - If
t = y: error = 0, no change.
Convergence Theorem: The perceptron convergence theorem (Novikoff, 1962) proves that if the training data is linearly separable, the perceptron learning algorithm will converge to a solution in finite iterations.
Real World Outcome
Youโll run a script that starts with random garbage weights (guessing randomly) and prints its โlearning processโ until it perfectly mimics a logic gate.
Example Output (OR Gate):
$ python manual_neuron.py --gate OR
========================================
PERCEPTRON TRAINING: OR GATE
========================================
Truth Table for OR:
[0, 0] -> 0
[0, 1] -> 1
[1, 0] -> 1
[1, 1] -> 1
Initial Weights (random):
w1 = 0.23
w2 = -0.47
b = 0.15
Learning Rate: 0.1
----------------------------------------
Epoch 1:
Input=[0, 0] z=0.15 Predicted=1 Target=0 Error=-1
-> UPDATING: w1=0.23->0.23, w2=-0.47->-0.47, b=0.15->0.05
Input=[0, 1] z=-0.42 Predicted=0 Target=1 Error=1
-> UPDATING: w1=0.23->0.23, w2=-0.47->-0.37, b=0.05->0.15
Input=[1, 0] z=0.38 Predicted=1 Target=1 Error=0 (Correct!)
Input=[1, 1] z=0.01 Predicted=1 Target=1 Error=0 (Correct!)
Epoch 1 Errors: 2/4
Epoch 2:
Input=[0, 0] z=0.15 Predicted=1 Target=0 Error=-1
-> UPDATING: w1=0.23->0.23, w2=-0.37->-0.37, b=0.15->0.05
Input=[0, 1] z=-0.32 Predicted=0 Target=1 Error=1
-> UPDATING: w1=0.23->0.23, w2=-0.37->-0.27, b=0.05->0.15
Input=[1, 0] z=0.38 Predicted=1 Target=1 Error=0 (Correct!)
Input=[1, 1] z=0.11 Predicted=1 Target=1 Error=0 (Correct!)
Epoch 2 Errors: 2/4
... (many epochs later) ...
Epoch 43:
Input=[0, 0] z=-0.12 Predicted=0 Target=0 Error=0 (Correct!)
Input=[0, 1] z=0.78 Predicted=1 Target=1 Error=0 (Correct!)
Input=[1, 0] z=0.95 Predicted=1 Target=1 Error=0 (Correct!)
Input=[1, 1] z=1.85 Predicted=1 Target=1 Error=0 (Correct!)
Epoch 43 Errors: 0/4
========================================
TRAINING COMPLETE!
========================================
Final Weights:
w1 = 1.07
w2 = 0.90
b = -0.12
Decision Boundary Equation:
1.07*x1 + 0.90*x2 - 0.12 = 0
----------------------------------------
TESTING MODEL
----------------------------------------
[0, 0] -> z=-0.12 -> step -> 0 (Expected: 0) โ
[0, 1] -> z=0.78 -> step -> 1 (Expected: 1) โ
[1, 0] -> z=0.95 -> step -> 1 (Expected: 1) โ
[1, 1] -> z=1.85 -> step -> 1 (Expected: 1) โ
ALL TESTS PASSED!
The perceptron has learned the OR function.
Example Output (AND Gate):
$ python manual_neuron.py --gate AND
========================================
PERCEPTRON TRAINING: AND GATE
========================================
Truth Table for AND:
[0, 0] -> 0
[0, 1] -> 0
[1, 0] -> 0
[1, 1] -> 1
Initial Weights (random):
w1 = -0.15
w2 = 0.32
b = 0.05
Learning Rate: 0.1
... (training epochs) ...
Epoch 28:
Input=[0, 0] z=-0.45 Predicted=0 Target=0 Error=0 (Correct!)
Input=[0, 1] z=0.15 Predicted=1 Target=0 Error=-1
-> UPDATING...
...
Epoch 67: SOLVED!
Final Weights:
w1 = 0.80
w2 = 0.75
b = -1.20
Testing:
[0, 0] -> 0 (Expected: 0) โ
[0, 1] -> 0 (Expected: 0) โ
[1, 0] -> 0 (Expected: 0) โ
[1, 1] -> 1 (Expected: 1) โ
ALL TESTS PASSED!
Example Output (XOR - Expected Failure):
$ python manual_neuron.py --gate XOR
========================================
PERCEPTRON TRAINING: XOR GATE
========================================
Truth Table for XOR:
[0, 0] -> 0
[0, 1] -> 1
[1, 0] -> 1
[1, 1] -> 0
Initial Weights (random):
w1 = 0.12
w2 = 0.45
b = -0.08
... (training) ...
Epoch 100: Errors: 1/4
Epoch 200: Errors: 2/4
Epoch 500: Errors: 1/4
Epoch 1000: Still not converged!
========================================
TRAINING FAILED (as expected)
========================================
XOR is not linearly separable.
A single perceptron cannot learn XOR.
You need hidden layers (multi-layer perceptron).
This is the Minsky-Papert limitation!
Solution Architecture
High-Level Design Approach
This section describes what your solution should look like, not how to implement it.
Architecture Diagram:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ PERCEPTRON TRAINING SYSTEM โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ โโโโโโโโโโโโโโโโโโโ โ
โ โ Training Data โ โ
โ โ โโโโโโโโโโโโโ โ โ
โ โ โ Inputs โ โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ โ [0,0] โ โ โ PERCEPTRON โ โ
โ โ โ [0,1] โโโโผโโโโโโโโโบโ โโโโโโ โโโโโโ โโโโโโ โ โ
โ โ โ [1,0] โ โ โ โ w1 โ โ w2 โ โ b โ โ โ
โ โ โ [1,1] โ โ โ โโโโฌโโ โโโโฌโโ โโโโฌโโ โ โ
โ โ โโโโโโโโโโโโโ โ โ โ โ โ โ โ
โ โ โโโโโโโโโโโโโ โ โ โผ โผ โผ โ โ
โ โ โ Targets โ โ โ โโโโโโโโโโโโโโโโโโโโโโโโ โ โ
โ โ โ 0,1,1,1 โ โ โ โ z = x1*w1 + x2*w2 + bโ โ โ
โ โ โโโโโโโโโโโโโ โ โ โโโโโโโโโโโโฌโโโโโโโโโโโโ โ โ
โ โโโโโโโโโโฌโโโโโโโโโ โ โ โ โ
โ โ โ โผ โ โ
โ โ โ โโโโโโโโโโโโโโโโโโโโโ โ โ
โ โ โ โ y = step(z) โ โ โ
โ โ โ โ 1 if z >= 0 โ โ โ
โ โ โ โ 0 if z < 0 โ โ โ
โ โ โ โโโโโโโโโโฌโโโโโโโโโโโ โ โ
โ โ โ โ โ โ
โ โ โโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโ โ
โ โ โ โ
โ โ โผ โ
โ โ โโโโโโโโโโโโโโโโโโโโโโ โ
โ โ โ Prediction y โ โ
โ โ โโโโโโโโโโโฌโโโโโโโโโโโ โ
โ โ โ โ
โ โผ โผ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ ERROR CALCULATION โ โ
โ โ โ โ
โ โ error = target - prediction โ โ
โ โ โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ โ
โ โ if error != 0 โ
โ โผ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ WEIGHT UPDATE (Delta Rule) โ โ
โ โ โ โ
โ โ w1 = w1 + (learning_rate * error * x1) โ โ
โ โ w2 = w2 + (learning_rate * error * x2) โ โ
โ โ b = b + (learning_rate * error * 1) โ โ
โ โ โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ โ
โ โ Loop until all predictions correct โ
โ โผ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ CONVERGENCE CHECK โ โ
โ โ โ โ
โ โ If all 4 inputs predict correctly: โ โ
โ โ STOP - Model is trained โ โ
โ โ Else: โ โ
โ โ Continue to next epoch โ โ
โ โ โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Data Structures Needed
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ DATA STRUCTURES โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ 1. TRAINING DATA โ
โ โโโโโโโโโโโโโฌโโโโโโโโโโโโ โ
โ โ inputs โ targets โ โ
โ โโโโโโโโโโโโโผโโโโโโโโโโโโค โ
โ โ [0, 0] โ 0 โ โ
โ โ [0, 1] โ 0 or 1 โ <- depends on gate โ
โ โ [1, 0] โ 0 or 1 โ โ
โ โ [1, 1] โ 0 or 1 โ โ
โ โโโโโโโโโโโโโดโโโโโโโโโโโโ โ
โ โ
โ 2. MODEL PARAMETERS (floats, updated during training) โ
โ โข w1: weight for input 1 โ
โ โข w2: weight for input 2 โ
โ โข b: bias โ
โ โ
โ 3. HYPERPARAMETERS (constants, set before training) โ
โ โข learning_rate: typically 0.1 to 1.0 โ
โ โข max_epochs: limit iterations (e.g., 1000) โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Function Breakdown
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ FUNCTION DESIGN โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ step(z) -> int โ
โ Input: z (weighted sum, float) โ
โ Output: 0 or 1 โ
โ Logic: return 1 if z >= 0 else 0 โ
โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ
โ forward(x1, x2, w1, w2, b) -> (z, y) โ
โ Input: inputs x1, x2; weights w1, w2; bias b โ
โ Output: weighted sum z, prediction y โ
โ Logic: z = x1*w1 + x2*w2 + b โ
โ y = step(z) โ
โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ
โ update_weights(w1, w2, b, x1, x2, error, lr) -> tuple โ
โ Input: current weights, inputs, error, learning rate โ
โ Output: new (w1, w2, b) โ
โ Logic: Apply Delta Rule โ
โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ
โ train(data, targets, lr, max_epochs) -> (w1, w2, b) โ
โ Input: training data, targets, hyperparameters โ
โ Output: trained weights and bias โ
โ Logic: Loop epochs, update on errors, check converge โ
โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ
โ test(data, targets, w1, w2, b) -> bool โ
โ Input: test data, expected outputs, trained params โ
โ Output: True if all correct, False otherwise โ
โ Logic: Run forward pass on each input, compare โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Data Flow Diagram
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ TRAINING FLOW โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Initialize For each epoch For each sample
โโโโโโโโโโ โโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโ
โ Random โ โ Reset โ โ Get (x1, x2), โ
โ w1,w2,bโโโโโโโโโโโโบโ epoch_err โโโโโโโโโโโบโ target โ
โโโโโโโโโโ โ counter โ โโโโโโโโโฌโโโโโโโโโ
โโโโโโโโโโโโโโ โ
โผ
โโโโโโโโโโโโโโโโโโโโโโ
โ Forward Pass โ
โ z = x1*w1+x2*w2+b โ
โ y = step(z) โ
โโโโโโโโโฌโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโ
โ Calculate Error โ
โ err = target - y โ
โโโโโโโโโฌโโโโโโโโโโโโโ
โ
โโโโโโโโโโโดโโโโโโโโโโ
โ โ
err != 0? err == 0
โ โ
โผ โผ
โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ
โ Update โ โ No change โ
โ weights โ โ continue โ
โโโโโโโโฌโโโโโโโ โโโโโโโโฌโโโโโโโ
โ โ
โโโโโโโโโโฌโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโ
โ Next sample or โ
โ next epoch โ
โโโโโโโโโโฌโโโโโโโโ
โ
โโโโโโโโโโโดโโโโโโโโโโ
โ โ
All correct? Still errors
โ โ
โผ โผ
โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ
โ STOP โ โ Continue โ
โ Return โ โ training โ
โ weights โ โโโโโโโโโโโโโโโ
โโโโโโโโโโโโโโโ
Phased Implementation Guide
Phase 1: Forward Pass (1-2 hours)
Goal: Implement the core computation of a neuron.
- Write the
stepfunction:- Takes a single float
z - Returns 1 if z >= 0, else 0
- Takes a single float
- Write the
forwardfunction:- Takes inputs
x1,x2and parametersw1,w2,b - Computes
z = x1*w1 + x2*w2 + b - Returns
step(z)
- Takes inputs
- Test manually:
# With w1=0.5, w2=0.5, b=-0.75 # forward(0, 0, 0.5, 0.5, -0.75) should return 0 (z=-0.75) # forward(1, 1, 0.5, 0.5, -0.75) should return 1 (z=0.25)
Checkpoint: You should be able to manually set weights that make the forward function behave like AND or OR.
Phase 2: Error Calculation (30 minutes)
Goal: Compute how wrong the prediction is.
- Write an
errorfunction or just compute inline:error = target - prediction
- Verify all cases:
target=0, pred=0 -> error=0 (correct, no update) target=1, pred=1 -> error=0 (correct, no update) target=1, pred=0 -> error=1 (increase output) target=0, pred=1 -> error=-1 (decrease output)
Checkpoint: Given a prediction and target, you should know whether and how to update.
Phase 3: Weight Updates (1 hour)
Goal: Implement the Delta Rule.
- Write the
update_weightsfunction:def update_weights(w1, w2, b, x1, x2, error, learning_rate): w1_new = w1 + learning_rate * error * x1 w2_new = w2 + learning_rate * error * x2 b_new = b + learning_rate * error * 1 # bias input is always 1 return w1_new, w2_new, b_new - Test the update logic:
- If error=1, x1=1, lr=0.1: w1 should increase by 0.1
- If error=-1, x1=1, lr=0.1: w1 should decrease by 0.1
- If x1=0: w1 should not change (0 * anything = 0)
Checkpoint: Weights change in the right direction based on error.
Phase 4: Training Loop (1-2 hours)
Goal: Repeat forward โ error โ update until convergence.
- Define training data for AND, OR, NAND, NOR gates
- Initialize weights randomly (small values, e.g., -1 to 1)
- Implement the epoch loop:
for epoch in range(max_epochs): errors_this_epoch = 0 for (x1, x2), target in zip(inputs, targets): z, prediction = forward(...) error = target - prediction if error != 0: update_weights(...) errors_this_epoch += 1 if errors_this_epoch == 0: print("Converged!") break - Add verbose logging to see learning progress
Checkpoint: Running the training loop on OR should converge within ~100 epochs.
Phase 5: Testing and Validation (1 hour)
Goal: Verify the trained perceptron works correctly.
- After training, run all 4 inputs through forward pass
- Compare to expected truth table
- Print pass/fail for each
Checkpoint: All 4 tests pass for AND, OR, NAND, NOR. XOR should fail to converge.
Questions to Guide Your Design
Before implementing, think through these:
Understanding the Algorithm
- Why random initialization?
- What happens if you start with all zeros?
- Why not start with โgoodโ weights?
- What does the learning rate control?
- What happens if learning_rate = 0?
- What happens if learning_rate = 100?
- Why is 0.1 a common choice?
- Why iterate through all samples before checking convergence?
- Could you check after each sample?
- Whatโs the difference between โepochโ and โiterationโ?
Understanding the Math
- Why multiply error by input in the update rule?
- What happens to w1 when x1=0?
- Why is this mathematically correct?
- How does the bias differ from weights?
- What does the bias โshiftโ?
- Why donโt we multiply bias update by an input?
- What does the decision boundary look like geometrically?
- Draw the boundary for a trained AND gate
- How do the weights define its slope?
Understanding the Limits
- Why canโt a perceptron learn XOR?
- Draw the 4 XOR points and try to separate them with a line
- What would you need to separate them?
- Whatโs the minimum number of weights to learn a 2-input gate?
- Could you do it with just w1 and w2 (no bias)?
- When is bias essential?
Thinking Exercise
Before coding, trace this by hand:
Starting with:
- w1 = 0.5
- w2 = 0.5
- b = -0.75
- learning_rate = 0.1
- Training for AND gate: (0,0)โ0, (0,1)โ0, (1,0)โ0, (1,1)โ1
Epoch 1 Trace:
| Input | z = x1w1 + x2w2 + b | y = step(z) | Target | Error | New w1 | New w2 | New b |
|---|---|---|---|---|---|---|---|
| (0,0) | 00.5 + 00.5 - 0.75 = -0.75 | 0 | 0 | 0 | 0.5 | 0.5 | -0.75 |
| (0,1) | 00.5 + 10.5 - 0.75 = -0.25 | 0 | 0 | 0 | 0.5 | 0.5 | -0.75 |
| (1,0) | 10.5 + 00.5 - 0.75 = -0.25 | 0 | 0 | 0 | 0.5 | 0.5 | -0.75 |
| (1,1) | 10.5 + 10.5 - 0.75 = 0.25 | 1 | 1 | 0 | 0.5 | 0.5 | -0.75 |
Result: All correct on epoch 1! The initial weights happened to be good.
Now try with different starting weights:
- w1 = -0.2
- w2 = 0.3
- b = 0.1
Trace Epoch 1:
| Input | z | y | Target | Error | Update | New w1 | New w2 | New b |
|---|---|---|---|---|---|---|---|---|
| (0,0) | 0(-0.2) + 00.3 + 0.1 = 0.1 | 1 | 0 | -1 | Yes | ? | ? | ? |
| โฆ | ย | ย | ย | ย | ย | ย | ย | ย |
Your task: Complete this trace for all 4 inputs of epoch 1. Then continue to epoch 2.
Questions while tracing:
- Which weight changed the most after the first error?
- Why didnโt w1 change when processing (0,0)?
- How many epochs until all 4 are correct?
Testing Strategy
Unit Tests for Each Function
# Test step function
assert step(-1) == 0
assert step(0) == 1 # boundary case: z >= 0
assert step(0.001) == 1
assert step(-0.001) == 0
# Test forward pass
z, y = forward(0, 0, 1, 1, -1.5) # Mimics AND
assert y == 0
z, y = forward(1, 1, 1, 1, -1.5)
assert y == 1
# Test update rule
w1, w2, b = 0.5, 0.5, 0
w1, w2, b = update_weights(w1, w2, b, 1, 0, 1, 0.1)
assert w1 == 0.6 # increased because x1=1, error=1
assert w2 == 0.5 # unchanged because x2=0
assert b == 0.1 # increased because error=1
Integration Test: Train and Verify
# Train on OR gate
inputs = [(0,0), (0,1), (1,0), (1,1)]
targets = [0, 1, 1, 1]
w1, w2, b = train(inputs, targets, learning_rate=0.1, max_epochs=1000)
# Verify all predictions
for (x1, x2), target in zip(inputs, targets):
_, prediction = forward(x1, x2, w1, w2, b)
assert prediction == target, f"Failed on {(x1, x2)}"
Convergence Test
# AND, OR, NAND, NOR should all converge
for gate_name, gate_targets in [("AND", [0,0,0,1]), ("OR", [0,1,1,1]), ...]:
w1, w2, b, epochs = train_with_count(inputs, gate_targets, ...)
assert epochs < 1000, f"{gate_name} didn't converge"
# XOR should NOT converge
w1, w2, b, epochs = train_with_count(inputs, [0,1,1,0], max_epochs=1000)
assert epochs == 1000, "XOR unexpectedly converged!"
Common Pitfalls and Debugging Tips
Pitfall 1: Off-by-One in Step Function
Symptom: Inconsistent results at z=0
Cause: Using > instead of >= or vice versa
Fix: Decide on convention (usually z >= 0 โ 1) and stick to it
Pitfall 2: Forgetting to Update Bias
Symptom: Model doesnโt converge or converges slowly
Cause: Only updating w1 and w2, not b
Fix: Remember: b = b + lr * error * 1
Pitfall 3: Wrong Sign in Update Rule
Symptom: Error gets worse instead of better
Cause: Using prediction - target instead of target - prediction
Fix: Error should be positive when prediction is too low
Pitfall 4: Not Iterating Until Convergence
Symptom: Model seems random Cause: Only running one epoch Fix: Loop until zero errors in an epoch (or max epochs)
Pitfall 5: Learning Rate Too High
Symptom: Weights oscillate wildly, never settle Cause: learning_rate > 1 or very large values Fix: Use lr in range 0.01 to 1.0 (start with 0.1)
Pitfall 6: Learning Rate Too Low
Symptom: Takes thousands of epochs to converge Cause: learning_rate too small (e.g., 0.001) Fix: For simple logic gates, 0.1 to 1.0 works well
Debugging Technique: Print Everything
When stuck, print at each step:
print(f"Input: ({x1}, {x2})")
print(f"Weights before: w1={w1:.3f}, w2={w2:.3f}, b={b:.3f}")
print(f"z = {x1}*{w1} + {x2}*{w2} + {b} = {z:.3f}")
print(f"y = step({z:.3f}) = {y}")
print(f"Target: {target}, Error: {error}")
if error != 0:
print(f"Updating: w1 += {lr}*{error}*{x1} = {lr*error*x1:.3f}")
The Interview Questions Theyโll Ask
Prepare to answer these:
1. โExplain how a perceptron learns. Walk me through one update step.โ
Key points to cover:
- Forward pass: weighted sum + step function
- Error calculation: target - prediction
- Weight update: Delta Rule (w += lr * error * input)
- Why inputs of 0 donโt change their weights
2. โWhat is the decision boundary of a perceptron?โ
Key insight:
- Itโs a hyperplane (line in 2D) defined by w1x1 + w2x2 + b = 0
- Weights define the orientation (slope)
- Bias shifts the line
3. โWhy canโt a single perceptron learn XOR?โ
Key insight:
- XOR is not linearly separable
- Positive examples are on opposite corners
- No single line can separate them
- Need hidden layers (MLP) to create non-linear boundaries
4. โWhatโs the difference between a perceptron and a modern neural network neuron?โ
Key insight:
- Perceptron: step function (non-differentiable)
- Modern: sigmoid/ReLU (differentiable for gradient descent)
- Perceptron: single layer
- Modern: multiple layers with backpropagation
5. โWhat is the role of the bias term?โ
Key insight:
- Bias shifts the decision boundary away from the origin
- Without bias, the hyperplane must pass through origin
- Example: AND gate needs negative bias to threshold at both inputs high
6. โHow does the learning rate affect training?โ
Key insight:
- Too high: overshoots, oscillates, may not converge
- Too low: converges slowly, may get stuck
- Just right: smooth convergence to solution
7. โWhat guarantees that a perceptron will converge?โ
Key insight:
- The Perceptron Convergence Theorem (Novikoff, 1962)
- IF data is linearly separable
- THEN algorithm will converge in finite steps
- If not separable, it will loop forever (hence XOR failure)
Hints in Layers
Use these hints only when stuck. Try for at least 15 minutes before reading each hint.
Hint 1: Structure
Your main file should have:
- A function for the step activation
- A function for forward pass
- A function for weight updates
- A training loop that calls these
- A test function that verifies correctness
Hint 2: Initialization
Random initialization should be small values:
import random
w1 = random.uniform(-1, 1)
w2 = random.uniform(-1, 1)
b = random.uniform(-1, 1)
Hint 3: Training Data
Define your gates as dictionaries:
GATES = {
'AND': [0, 0, 0, 1],
'OR': [0, 1, 1, 1],
'NAND': [1, 1, 1, 0],
'NOR': [1, 0, 0, 0],
'XOR': [0, 1, 1, 0], # Will not converge!
}
INPUTS = [(0, 0), (0, 1), (1, 0), (1, 1)]
Hint 4: The Training Loop Pattern
for epoch in range(max_epochs):
total_error = 0
for (x1, x2), target in zip(inputs, targets):
# forward pass
# calculate error
# if error != 0: update weights
# accumulate error count
if total_error == 0:
break # Converged!
Hint 5: Edge Case - No Error
When prediction equals target, error is 0. The update equation:
w = w + lr * 0 * x = w + 0 = w
Weights donโt change when youโre already correct. This is important!
Extensions and Challenges
After completing the basic perceptron, try these:
Extension 1: 3-Input Gates
Implement AND3, OR3, MAJORITY (output 1 if 2+ inputs are 1).
- Now you have
z = x1*w1 + x2*w2 + x3*w3 + b - Visualize in 3D (the decision boundary is a plane!)
Extension 2: NAND as Universal Gate
NAND is a universal gate - you can build any other gate from NANDs.
- Train a NAND perceptron
- Show how to compose them (manually) to make AND, OR, NOT
Extension 3: Visualization
Plot the decision boundary as training progresses:
- Use matplotlib to show the 2D input space
- Draw the line
w1*x1 + w2*x2 + b = 0 - Update the plot each epoch to see the line move
Extension 4: Multi-class (One-vs-All)
Instead of binary output, classify into 4 categories:
- Train 4 perceptrons, one for each class
- Output the class with highest weighted sum (before step)
Extension 5: Implement in C or Rust
Rewrite the perceptron in a low-level language:
- No garbage collection, manual memory
- Appreciate how simple the actual computation is
- Time the training - it should be microseconds
Extension 6: Two-Layer Perceptron
Build a simple 2-layer network to solve XOR:
- Hidden layer with 2 neurons
- Output layer with 1 neuron
- Youโll need to implement backpropagation (preview of Project 5)
Real-World Connections
Where Perceptrons Appear Today
- Spam Filters (Early Versions)
- Before deep learning, spam filters used linear classifiers
- Features: word counts, sender reputation
- Perceptron-style updates on misclassifications
- Credit Scoring (Logistic Regression)
- Banks use linear models for interpretability
- Similar to perceptron but with sigmoid activation
- Weights show which factors matter (income, debt ratio)
- Sentiment Analysis (Baseline)
- Count positive/negative words โ weighted sum โ decision
- Perceptron is the simplest baseline to beat
- Medical Triage
- Simple rule-based systems are essentially perceptrons
- โIf blood pressure > X AND temperature > Y, alert doctorโ
Why This Foundation Matters
Understanding the perceptron is essential because:
- Every deep learning layer IS a perceptron (plus non-linearity)
- A dense layer: each output neuron is z = w1x1 + w2x2 + โฆ + b
- You just learned the atom of neural networks
- Debugging deep networks requires this intuition
- When gradients vanish, youโre seeing the XOR problem at scale
- When weights explode, itโs learning rate issues
- Interpretable AI often means simpler models
- Regulators want to know WHY a loan was denied
- Perceptrons are explainable: โthese factors with these weightsโ
- Edge/embedded AI needs efficient models
- IoT devices canโt run transformers
- Simple perceptron-style models fit in kilobytes
Books That Will Help
| Topic | Book | Chapter/Section |
|---|---|---|
| Perceptron fundamentals | Grokking Deep Learning by Andrew Trask | Ch. 3: โIntroduction to Neural Predictionโ |
| Mathematical foundations | Neural Networks and Deep Learning by Michael Nielsen | Ch. 1: โUsing neural nets to recognize handwritten digitsโ |
| The Perceptron algorithm | Grokking Deep Learning by Andrew Trask | Ch. 4: โIntroduction to Neural Learningโ |
| Linear separability | Pattern Recognition and Machine Learning by Christopher Bishop | Ch. 4: โLinear Models for Classificationโ |
| History and context | Perceptrons by Minsky & Papert | Introduction and Ch. 1-3 (historical document) |
| Optimization theory | Deep Learning by Goodfellow, Bengio, Courville | Ch. 4.3: โGradient-Based Optimizationโ |
| Python implementation | Data Science from Scratch by Joel Grus | Ch. 18: โNeural Networksโ |
Online Resources
- 3Blue1Brown: โBut what is a neural network?โ (YouTube) - Excellent visualization
- Andrej Karpathy: โNeural Networks: Zero to Heroโ - Modern perspective
- Michael Nielsen: neuralnetworksanddeeplearning.com - Free online book
Self-Assessment Checklist
Before moving to Project 2, verify you can:
Implementation Skills
- Write the step function without looking at notes
- Implement forward pass from scratch
- Apply the Delta Rule correctly
- Train to convergence on AND, OR, NAND, NOR
- Explain why XOR doesnโt converge
Conceptual Understanding
- Draw the decision boundary for a trained perceptron
- Explain what each weight controls geometrically
- Describe what the bias shifts
- Define linear separability with an example
Mathematical Foundations
- Derive the Delta Rule update from error minimization intuition
- Calculate z by hand for given weights and inputs
- Predict whether a point is above or below the decision boundary
Conceptual Questions (Answer Without Looking)
- Whatโs the output of step(-0.001)?
- If error=1 and x1=0, how much does w1 change?
- Why doesnโt XOR work with a single perceptron?
- What happens if learning_rate = 0?
- How many parameters does a 2-input perceptron have?
- Whatโs the role of bias in the decision boundary?
- Can a perceptron with 3 inputs learn the MAJORITY function?
Code Challenges (Try Without Hints)
- Modify your code to work with 3 inputs
- Add a function that plots the decision boundary
- Count how many epochs each gate needs on average (run 100 trials)
- Find the minimum learning rate that still converges in < 1000 epochs
Whatโs Next
Youโve built the atom of neural networks. But real learning happens when atoms combine into molecules.
Project 2: Gradient Descent Visualizer will show you:
- How optimization works in continuous (not binary) spaces
- Why we need derivatives
- What a โloss landscapeโ looks like
- How learning rate affects convergence
The perceptron used a simple error and discrete step function. Modern networks use continuous loss functions and smooth activations - thatโs where calculus enters the picture.
Next: P02: Gradient Descent Visualizer - See optimization in action
Appendix: Logic Gate Truth Tables
For reference:
AND Gate: OR Gate: NAND Gate: NOR Gate:
x1 x2 | y x1 x2 | y x1 x2 | y x1 x2 | y
------+-- ------+-- ------+-- ------+--
0 0 | 0 0 0 | 0 0 0 | 1 0 0 | 1
0 1 | 0 0 1 | 1 0 1 | 1 0 1 | 0
1 0 | 0 1 0 | 1 1 0 | 1 1 0 | 0
1 1 | 1 1 1 | 1 1 1 | 0 1 1 | 0
XOR Gate (NOT linearly separable):
x1 x2 | y
------+--
0 0 | 0
0 1 | 1
1 0 | 1
1 1 | 0
XNOR Gate (NOT linearly separable):
x1 x2 | y
------+--
0 0 | 1
0 1 | 0
1 0 | 0
1 1 | 1
This project is part of the โAI Prediction & Neural Networks: From Math to Machineโ learning path.