MATH CONCEPTS DEEP DIVE
This section provides detailed explanations of the core mathematical concepts that all 20 projects in this guide teach. Understanding these concepts deeply—not just procedurally—will transform you from someone who uses ML libraries to someone who truly understands what happens inside them.
Deep Dive: The Mathematics Behind Machine Learning
This section provides detailed explanations of the core mathematical concepts that all 20 projects in this guide teach. Understanding these concepts deeply—not just procedurally—will transform you from someone who uses ML libraries to someone who truly understands what happens inside them.
Why This Math Matters for Machine Learning
Before diving into each mathematical area, let’s understand why these specific topics are essential:
┌─────────────────────────────────────────────────────────────────────────┐
│ THE ML MATHEMATICS ECOSYSTEM │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ ALGEBRA LINEAR ALGEBRA CALCULUS │
│ ──────── ────────────── ──────── │
│ Variables as Vectors & Matrices Derivatives │
│ unknowns we solve ──▶ as data containers ──▶ measure change │
│ for in equations & transformations in predictions │
│ │ │ │ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ OPTIMIZATION (GRADIENT DESCENT) │ │
│ │ The algorithm that makes neural networks "learn" │ │
│ │ by minimizing prediction errors │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ ▲ ▲ ▲ │
│ │ │ │ │
│ FUNCTIONS PROBABILITY EXPONENTS/LOGS │
│ ───────── ─────────── ────────────── │
│ Map inputs to Quantify uncertainty Scale data, │
│ outputs (the core in predictions, measure info, │
│ of all ML models) model noise enable learning │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Every neural network is fundamentally:
- A composition of functions (algebra)
- Represented by matrices of weights (linear algebra)
- Trained by computing gradients (calculus)
- Making probabilistic predictions (probability)
- Optimized by gradient descent (optimization)
The math is not abstract theory—it is the actual implementation. When you run model.fit() in PyTorch or TensorFlow, these mathematical operations are exactly what happens inside.
Algebra: The Language of Relationships
Algebra is the foundation upon which all higher mathematics rests. At its core, algebra is about expressing relationships between quantities using symbols, then manipulating those symbols to discover new truths.
Variables: Placeholders for Unknown or Changing Quantities
In ML, we use variables constantly:
xrepresents input features (a single number or a vector of thousands)yrepresents the target we want to predictw(weights) andb(bias) are the parameters we learnθ(theta) represents all learnable parameters
THE FUNDAMENTAL ML EQUATION
ŷ = f(x; θ)
│ │ │ │
│ │ │ └── Parameters we learn (weights, biases)
│ │ └───── Input data (features)
│ └──────── The model (a function)
└──────────── Predicted output
Example: Linear model
ŷ = w₁x₁ + w₂x₂ + w₃x₃ + b
└────────┬─────────┘ │
│ │
Weighted sum of Bias term
input features (intercept)
Equations: Statements of Equality We Solve
An equation states that two expressions are equal. Solving equations means finding values that make this true.
SOLVING A LINEAR EQUATION
Find x such that: 3x + 7 = 22
Step 1: Subtract 7 from both sides
3x + 7 - 7 = 22 - 7
3x = 15
Step 2: Divide both sides by 3
3x/3 = 15/3
x = 5
Verification: 3(5) + 7 = 15 + 7 = 22 ✓
In ML, we don’t solve single equations—we solve systems of equations represented as matrices, or we use iterative methods (gradient descent) to find approximate solutions.
Inverse Operations: The Key to Solving
Every operation has an inverse that “undoes” it:
INVERSE OPERATIONS
Operation Inverse Why It Matters in ML
─────────────────────────────────────────────────────────────────
Addition (+) Subtraction (−) Bias adjustment
Multiplication (×) Division (÷) Weight scaling
Exponentiation (xⁿ) Roots (ⁿ√x) Feature engineering
Exponentiation (eˣ) Logarithm (ln x) Loss functions, gradients
Squaring (x²) Square root (√x) Distance metrics
Matrix mult (AB) Matrix inverse (A⁻¹) Solving linear systems
Reference: “Math for Programmers” by Paul Orland, Chapter 2, provides an excellent programmer-focused treatment of algebraic fundamentals.
Functions: The Heart of Computation
A function is a rule that takes an input and produces exactly one output. This is the most important concept for ML because every ML model is a function.
┌─────────────────────┐
│ │
INPUT ────────▶ │ FUNCTION │ ────────▶ OUTPUT
x │ f(x) │ y
│ │
│ "A machine that │
│ transforms x │
│ into y" │
└─────────────────────┘
┌────────────────────────────────────────────────────────────────────────┐
│ │
│ THE FUNCTION MACHINE ANALOGY │
│ ═══════════════════════════ │
│ │
│ INPUT │
│ │ │
│ ▼ │
│ ┌──────────────┐ │
│ │ ░░░░░░░░ │ ◄── Internal mechanism (the rule) │
│ │ ░ f(x) ░ │ │
│ │ ░░░░░░░░ │ │
│ └──────────────┘ │
│ │ │
│ ▼ │
│ OUTPUT │
│ │
│ Example: f(x) = x² │
│ │
│ f(2) = 4 "Put in 2, get out 4" │
│ f(3) = 9 "Put in 3, get out 9" │
│ f(-2) = 4 "Put in -2, still get 4" (same output!) │
│ │
└────────────────────────────────────────────────────────────────────────┘
Domain and Range: What Goes In, What Comes Out
DOMAIN AND RANGE VISUALIZATION
DOMAIN RANGE
(valid inputs) (possible outputs)
┌───────────┐ ┌───────────┐
│ │ │ │
│ x = 1 ──┼──────── f(x) = x² ─────────▶│── 1 │
│ x = 2 ──┼──────────────────── ───────▶│── 4 │
│ x = 3 ──┼────────────────────────────▶│── 9 │
│ x = -1 ──┼────────────────────────────▶│── 1 │
│ x = -2 ──┼────────────────────────────▶│── 4 │
│ │ │ │
│ All real │ │ Only y≥0 │
│ numbers │ │ (non-neg)│
└───────────┘ └───────────┘
Domain of x²: all real numbers ℝ
Range of x²: [0, ∞) non-negative reals
Function Composition: Combining Functions
Machine learning models are compositions of many functions. A neural network layer applies a linear function followed by a non-linear activation:
FUNCTION COMPOSITION: (g ∘ f)(x) = g(f(x))
"First apply f, then apply g to the result"
Example: f(x) = 2x, g(x) = x + 3
(g ∘ f)(4) = g(f(4))
= g(2·4)
= g(8)
= 8 + 3
= 11
Neural Network Layer as Composition:
────────────────────────────────────
output = σ(Wx + b)
│ └──┬──┘
│ │
│ └── Linear function: f(x) = Wx + b
│
└─────── Activation function: g(z) = σ(z)
Layer = g ∘ f = σ(Wx + b)
VISUAL: How composition works in a neural network
x ──▶ [W·x + b] ──▶ z ──▶ [σ(z)] ──▶ output
└───┬───┘ └──┬──┘
│ │
Linear part Nonlinear part
(matrix) (activation)
Inverse Functions: Undoing Transformations
If f takes x to y, then f⁻¹ (the inverse) takes y back to x:
INVERSE FUNCTIONS
f
x ─────────────▶ y
◀───────────────
f⁻¹
f(x) = y means f⁻¹(y) = x
Example: f(x) = 2x + 3
Finding the inverse:
1. Write y = 2x + 3
2. Solve for x: x = (y - 3)/2
3. Swap x and y: y = (x - 3)/2
Therefore: f⁻¹(x) = (x - 3)/2
Verification: f(f⁻¹(x)) = 2·((x-3)/2) + 3 = x - 3 + 3 = x ✓
ML APPLICATION: Encoder-Decoder Networks
Input ──▶ [Encoder] ──▶ Latent Code ──▶ [Decoder] ──▶ Reconstruction
(compress) (decompress)
f(x) ≈ f⁻¹(z)
Reference: “Math for Programmers” by Paul Orland, Chapter 3, covers functions from a visual, computational perspective.
Exponents and Logarithms: Growth and Scale
Exponents and logarithms are inverse operations that appear throughout ML—in activation functions, loss functions, learning rate schedules, and information theory.
Exponential Growth: The Power of Repeated Multiplication
EXPONENTIAL GROWTH
2¹ = 2
2² = 4
2³ = 8
2⁴ = 16
2⁵ = 32
2⁶ = 64
2⁷ = 128
2⁸ = 256
2⁹ = 512
2¹⁰ = 1024
│
1024 ┼ ╭
│ ╱
│ ╱
512 ┼ ╱
│ ╱
256 ┼ ╱
│ ╱
128 ┼ ╱
│ ╱
64 ┼ ╱
32 ┼ ╱╱
16 ┼ ╱╱╱
8 ┼ ╱╱╱
4 ┼ ╱╱╱
2 ┼ ╱╱╱
└───────────────────────────────────────────────
0 2 4 6 8 10
Key insight: Exponential growth starts slow, then EXPLODES
This is why:
- Neural network gradients can "explode" during training
- Compound interest seems slow then suddenly huge
- Viruses spread slowly, then overwhelm
The Natural Exponential: e^x
The number e ≈ 2.71828… is special because the derivative of eˣ is itself:
THE SPECIAL PROPERTY OF e^x
d/dx [eˣ] = eˣ
"The rate of growth is equal to the current value"
This is why e^x appears in:
- Sigmoid activation: σ(x) = 1/(1 + e^(-x))
- Softmax: softmax(xᵢ) = e^(xᵢ) / Σⱼ e^(xⱼ)
- Probability distributions: Normal(x) ∝ e^(-x²/2)
- Learning rate decay: lr(t) = lr₀ · e^(-λt)
SIGMOID FUNCTION (used in logistic regression, neural networks)
1 ┼─────────────────────────────────────────
│ ╭───────────
│ ╱╱╱
0.5 ┼────────────────────╱╱╱───────────────
│ ╱╱╱
│ ╱╱╱╱
0 ┼────╱╱╱──────────────────────────────────
└──────┼───────┼───────┼───────┼──────────
-4 -2 0 2 4
σ(x) = 1 / (1 + e^(-x))
- Maps any real number to (0, 1)
- Used for probabilities
- Derivative: σ'(x) = σ(x)(1 - σ(x))
Logarithms: The Inverse of Exponentiation
If exponentiation asks “2 raised to what power?”, logarithms answer that question:
LOGARITHMS AS INVERSE OPERATIONS
Exponential: 2³ = 8
Logarithmic: log₂(8) = 3
"2 to the power of WHAT equals 8?"
Answer: 3
THE RELATIONSHIP:
If b^y = x, then log_b(x) = y
b^(log_b(x)) = x (they undo each other)
log_b(b^x) = x (they undo each other)
COMMON LOGARITHMS IN ML:
log₂(x) - Base 2, used in information theory (bits)
log₁₀(x) - Base 10, used for order of magnitude
ln(x) - Natural log (base e), used in calculus/ML
ln(e) = 1
ln(1) = 0
ln(0) = -∞ (undefined, approaches negative infinity)
Why Logarithms Appear in Machine Learning
LOGARITHMS IN ML: THREE CRITICAL USES
1. CROSS-ENTROPY LOSS (Classification)
───────────────────────────────────────
L = -Σᵢ yᵢ · log(ŷᵢ)
Why log? Penalizes confident wrong predictions heavily:
Predicted True Loss Contribution
─────────────────────────────────────
ŷ = 0.99 y = 1 -log(0.99) = 0.01 (small penalty, correct!)
ŷ = 0.50 y = 1 -log(0.50) = 0.69 (medium penalty)
ŷ = 0.01 y = 1 -log(0.01) = 4.61 (HUGE penalty, very wrong!)
2. INFORMATION THEORY (Entropy, Mutual Information)
────────────────────────────────────────────────────
H(X) = -Σᵢ p(xᵢ) · log₂(p(xᵢ))
"How many bits do we need to encode X?"
Fair coin (50/50): H = -2·(0.5·log₂(0.5)) = 1 bit
Biased coin (99/1): H ≈ 0.08 bits (very predictable)
3. NUMERICAL STABILITY (Log-Sum-Exp Trick)
───────────────────────────────────────────
Problem: Computing Σᵢ e^(xᵢ) can overflow
Solution: Use log-sum-exp
log(Σᵢ e^(xᵢ)) = max(x) + log(Σᵢ e^(xᵢ - max(x)))
This is how softmax is actually computed in practice!
Logarithm Properties (Essential for ML Derivations)
LOGARITHM RULES
log(a·b) = log(a) + log(b) Product → Sum
log(a/b) = log(a) - log(b) Quotient → Difference
log(aⁿ) = n·log(a) Power → Multiply
log(1) = 0 Always true
log(base) = 1 log_b(b) = 1
WHY THIS MATTERS:
Computing P(w₁) × P(w₂) × P(w₃) × ... × P(wₙ)
Problem: This product gets TINY (underflows to 0)
Solution: Use logs!
log(P(w₁)·P(w₂)·...·P(wₙ)) = log(P(w₁)) + log(P(w₂)) + ... + log(P(wₙ))
Products become sums. No underflow!
Reference: “C Programming: A Modern Approach” by K. N. King, Chapter 7, covers the numerical representation of these values, while “Math for Programmers” Chapter 2 provides the mathematical intuition.
Trigonometry: Circles and Waves
Trigonometry connects angles to ratios, circles to waves, and appears in ML through signal processing, attention mechanisms, and positional encodings.
The Unit Circle: Where It All Begins
THE UNIT CIRCLE (radius = 1)
90° (π/2)
│
(0,1) │
╱╲ │
╱ ╲ │
╱ ╲ │
╱ ╲│
180° (π) ─────●────────●────────● 0° (0)
(-1,0) │(0,0) (1,0)
╲│╱
│
│
(0,-1)
270° (3π/2)
For any angle θ, the point on the unit circle is:
(cos(θ), sin(θ))
KEY VALUES:
θ = 0°: (cos(0), sin(0)) = (1, 0)
θ = 90°: (cos(90°), sin(90°)) = (0, 1)
θ = 180°: (cos(180°), sin(180°)) = (-1, 0)
θ = 270°: (cos(270°), sin(270°)) = (0, -1)
Sine and Cosine as Projections
SINE AND COSINE AS COORDINATES
│ ╱ point on circle
│ ╱ at angle θ
│ ●
│ ╱│
│ ╱ │
sin(θ) │ ╱ │ sin(θ) = y-coordinate
│ ╱ │ = "vertical projection"
│ ╱ θ │
│╱─────┼─────
cos(θ)
cos(θ) = x-coordinate
= "horizontal projection"
PYTHAGORAS ON THE UNIT CIRCLE:
sin²(θ) + cos²(θ) = 1 (always true!)
This is just x² + y² = r² = 1² on the unit circle
Sine and Cosine as Waves
SINE WAVE
1 ┼ ╭───╮ ╭───╮
│ ╱ ╲ ╱ ╲
│ ╱ ╲ ╱ ╲
0 ┼────────●─────────●─────────●─────────●────
│ 0 π 2π 3π
│ ╲ ╱ ╲ ╱
│ ╲ ╱ ╲ ╱
-1 ┼ ╰───╯ ╰───╯
y = sin(x)
Properties:
- Periodic: repeats every 2π
- Bounded: always between -1 and 1
- Smooth: infinitely differentiable
d/dx[sin(x)] = cos(x)
d/dx[cos(x)] = -sin(x)
COSINE WAVE (sine shifted by π/2)
1 ┼───╮ ╭───╮
│ ╲ ╱ ╲
│ ╲ ╱ ╲
0 ┼──────●─────────●─────────●──────
│ π/2 3π/2 5π/2
│ ╲ ╱ ╲
│ ╲ ╱ ╲
-1 ┼ ╰───╯ ╰
y = cos(x)
Why Trigonometry Matters for ML
TRIGONOMETRY IN MACHINE LEARNING
1. POSITIONAL ENCODING (Transformers)
─────────────────────────────────────
PE(pos, 2i) = sin(pos / 10000^(2i/d))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d))
Why? Sine/cosine create unique patterns for each position
that the model can learn to decode.
Position 0: [sin(0), cos(0), sin(0), cos(0), ...]
= [0, 1, 0, 1, ...]
Position 1: [sin(1/10000), cos(1/10000), ...]
≈ [0.0001, 1, ...]
2. ROTATION MATRICES (Computer Vision, Robotics)
────────────────────────────────────────────────
R(θ) = ┌ cos(θ) -sin(θ) ┐
│ │
└ sin(θ) cos(θ) ┘
Rotates any vector by angle θ counterclockwise.
3. FOURIER TRANSFORMS (Signal Processing, Audio)
────────────────────────────────────────────────
Any signal can be decomposed into sine waves:
f(t) = Σₙ [aₙ·cos(nωt) + bₙ·sin(nωt)]
Used in: Audio processing, image compression, feature extraction
4. NEURAL NETWORK ACTIVATIONS
─────────────────────────────
Some networks use sin(x) as an activation (SIREN networks)
for implicit neural representations.
Reference: “Computer Graphics from Scratch” by Gabriel Gambetta covers trigonometry in the context of rotations and projections, which directly applies to ML transformations.
Linear Algebra: The Backbone of ML
Linear algebra is not optional for ML—it IS the implementation. Every neural network forward pass is matrix multiplication. Every weight update is vector arithmetic. Every dataset is a matrix.
Vectors: Direction and Magnitude
VECTORS AS ARROWS
A vector has both direction and magnitude (length).
v = [3, 4] means "go 3 right and 4 up"
4 │ ↗ v = [3,4]
│ ╱
│ ╱
│ ╱ ||v|| = √(3² + 4²) = √25 = 5
│ ╱
│ ╱
│ ╱
│ ╱
│╱θ
└────────────────────
3
Magnitude (length): ||v|| = √(v₁² + v₂² + ... + vₙ²)
Direction: θ = arctan(v₂/v₁) = arctan(4/3) ≈ 53.1°
VECTORS IN ML:
Feature vector: x = [height, weight, age]
[ 5.9, 160, 25 ]
Weight vector: w = [w₁, w₂, w₃]
[0.5, 0.3, 0.2]
Prediction: ŷ = w·x = w₁x₁ + w₂x₂ + w₃x₃
= 0.5(5.9) + 0.3(160) + 0.2(25)
= 2.95 + 48 + 5 = 55.95
Vector Operations
VECTOR OPERATIONS
1. ADDITION (element-wise)
───────────────────────────
[1, 2, 3] + [4, 5, 6] = [1+4, 2+5, 3+6] = [5, 7, 9]
Geometrically: "chain" the arrows
↗ b ↗ a+b
a ↗ ╱ ╱
╲ ╱ ╱
╲╱ = ╱
╱
a ↗
2. SCALAR MULTIPLICATION
─────────────────────────
3 × [1, 2] = [3, 6]
"Stretch the arrow by factor 3"
───▶ v
─────────────▶ 3v
3. DOT PRODUCT (crucial for ML!)
─────────────────────────────────
a·b = a₁b₁ + a₂b₂ + ... + aₙbₙ
[1, 2, 3] · [4, 5, 6] = 1×4 + 2×5 + 3×6 = 4 + 10 + 18 = 32
Geometric interpretation:
a·b = ||a|| × ||b|| × cos(θ)
where θ is the angle between vectors.
If a·b = 0, vectors are PERPENDICULAR (orthogonal)
If a·b > 0, vectors point in similar directions
If a·b < 0, vectors point in opposite directions
DOT PRODUCT IS THE NEURON:
inputs weights dot product activation
[x₁] · [w₁] = Σ wᵢxᵢ + b → σ(z)
[x₂] [w₂] │
[x₃] [w₃] ▼
output
Matrices: Collections of Vectors, or Transformations
MATRICES AS DATA
A dataset with n samples and d features is an n×d matrix:
Feature 1 Feature 2 Feature 3
X = ┌ 5.1 3.5 1.4 ┐ Sample 1
│ 4.9 3.0 1.4 │ Sample 2
│ 4.7 3.2 1.3 │ Sample 3
│ ... ... ... │ ...
└ 5.9 3.0 5.1 ┘ Sample n
Shape: (n_samples, n_features)
MATRICES AS TRANSFORMATIONS
A 2×2 matrix transforms 2D vectors:
T = ┌ a b ┐ v = ┌ x ┐
│ │ │ │
└ c d ┘ └ y ┘
Tv = ┌ ax + by ┐
│ │
└ cx + dy ┘
Examples:
┌ 2 0 ┐ Scales x by 2, y unchanged
│ │
└ 0 1 ┘
┌ cos θ -sin θ ┐ Rotates by angle θ
│ │
└ sin θ cos θ ┘
┌ 1 k ┐ Shears (slants) by factor k
│ │
└ 0 1 ┘
Matrix Multiplication: The Core of Neural Networks
MATRIX MULTIPLICATION
C = A × B
If A is (m × n) and B is (n × p), then C is (m × p)
The (i,j) entry of C is the dot product of:
- Row i of A
- Column j of B
Example:
A = ┌ 1 2 ┐ B = ┌ 5 6 ┐
│ │ │ │
└ 3 4 ┘ └ 7 8 ┘
C = A × B = ┌ 1×5+2×7 1×6+2×8 ┐ = ┌ 19 22 ┐
│ │ │ │
└ 3×5+4×7 3×6+4×8 ┘ └ 43 50 ┘
C[0,0] = Row 0 of A · Col 0 of B = [1,2]·[5,7] = 5+14 = 19
C[0,1] = Row 0 of A · Col 1 of B = [1,2]·[6,8] = 6+16 = 22
...
NEURAL NETWORK LAYER AS MATRIX MULTIPLICATION:
Input: x = [x₁, x₂, x₃] (1×3 vector, or batch of them)
Weights: W = ┌ w₁₁ w₁₂ w₁₃ w₁₄ ┐
│ w₂₁ w₂₂ w₂₃ w₂₄ │ (3×4 matrix)
└ w₃₁ w₃₂ w₃₃ w₃₄ ┘
Output: z = xW + b (1×4 vector)
Each output neuron computes one dot product:
z₁ = x₁w₁₁ + x₂w₂₁ + x₃w₃₁ + b₁
z₂ = x₁w₁₂ + x₂w₂₂ + x₃w₃₂ + b₂
...
Eigenvalues and Eigenvectors: The Directions That Don’t Rotate
EIGENVECTORS: SPECIAL DIRECTIONS
For a matrix A, an eigenvector v satisfies:
A·v = λ·v
"When A transforms v, v doesn't change direction,
only scales by factor λ (the eigenvalue)"
┌─────────────────────────────────────────────────────────┐
│ │
│ Regular vector: Eigenvector: │
│ │
│ v → Av → v → Av = λv → │
│ ↗ ╲↘ ↗ ─────────→ │
│ ╱ ╲ │ (same direction, │
│ ╱ ╲ │ just stretched) │
│ ╱ ╲ │ │
│ Direction Direction │ │
│ CHANGES SAME │ │
│ │
└─────────────────────────────────────────────────────────┘
WHY EIGENVECTORS MATTER FOR ML:
1. PCA: Eigenvectors of covariance matrix = principal components
(directions of maximum variance in data)
2. PageRank: The ranking vector is the dominant eigenvector
of the link matrix
3. Spectral Clustering: Uses eigenvectors of similarity matrix
4. Stability: Eigenvalues tell if gradients will explode/vanish
|λ| > 1: grows exponentially (exploding gradients)
|λ| < 1: shrinks exponentially (vanishing gradients)
Reference: “Math for Programmers” by Paul Orland, Chapters 5-7, covers vectors and matrices with visual intuition. “Linear Algebra Done Right” by Sheldon Axler provides deeper theoretical foundations.
Calculus: The Mathematics of Change
Calculus answers the question: “How does the output change when I change the input?” This is fundamental to ML because training is about adjusting parameters to change (reduce) the loss.
Derivatives: Rate of Change
THE DERIVATIVE AS SLOPE
The derivative f'(x) tells us the instantaneous rate of change:
f'(x) = lim f(x + h) - f(x)
h→0 ─────────────────
h
"As we make h infinitely small, what is the slope?"
GEOMETRIC INTERPRETATION:
f(x)
│ ╱ tangent line at x=a
│ ╱ (slope = f'(a))
│ ╱
│ ●─────────────
│ ╱╱│
│ ╱╱ │
│ ╱╱ │ f(a)
│╱╱──────┼─────────────────
a
The derivative f'(a) is the slope of the tangent line at x=a.
EXAMPLE: f(x) = x²
f'(x) = 2x
At x = 3: f'(3) = 6
"At x=3, f is increasing at rate 6"
"If we move right by 0.01, f increases by about 0.06"
At x = 0: f'(0) = 0
"At x=0, f is flat (minimum!)"
At x = -2: f'(-2) = -4
"At x=-2, f is decreasing at rate 4"
Common Derivatives (Memorize These)
DERIVATIVE RULES
Function Derivative Why it matters in ML
────────────────────────────────────────────────────────────
f(x) = c f'(x) = 0 Constant has no change
f(x) = x f'(x) = 1 Identity
f(x) = xⁿ f'(x) = nxⁿ⁻¹ Power rule (polynomials)
f(x) = eˣ f'(x) = eˣ Exponential (special!)
f(x) = ln(x) f'(x) = 1/x Log (in loss functions)
f(x) = sin(x) f'(x) = cos(x) Positional encoding
f(x) = cos(x) f'(x) = -sin(x) Positional encoding
σ(x) = 1/(1+e⁻ˣ) σ'(x) = σ(x)(1-σ(x)) Sigmoid activation
ReLU(x) = max(0,x) ReLU'(x) = {1 if x>0 ReLU activation
{0 if x≤0
COMBINATION RULES
Sum: (f + g)' = f' + g'
Product: (f·g)' = f'·g + f·g'
Chain: (f(g(x)))' = f'(g(x)) · g'(x) ← CRITICAL for backprop!
The Chain Rule: The Heart of Backpropagation
THE CHAIN RULE
If y = f(g(x)), then:
dy/dx = (dy/du) · (du/dx)
where u = g(x)
"The derivative of a composition is the product of derivatives"
EXAMPLE: y = (3x + 2)⁵
Let u = 3x + 2, so y = u⁵
dy/du = 5u⁴
du/dx = 3
dy/dx = 5u⁴ · 3 = 15(3x + 2)⁴
WHY THIS IS BACKPROPAGATION:
In a neural network:
x → [Layer 1] → h → [Layer 2] → ŷ → [Loss] → L
To update Layer 1's weights, we need ∂L/∂W₁.
Chain rule:
∂L/∂W₁ = (∂L/∂ŷ) · (∂ŷ/∂h) · (∂h/∂W₁)
└──┬──┘ └──┬──┘ └──┬──┘
│ │ │
From loss Through Through
Layer 2 Layer 1
Gradients "flow backward" through the network!
Gradients: Derivatives in Multiple Dimensions
PARTIAL DERIVATIVES
For f(x, y), we can take derivatives with respect to each variable:
∂f/∂x = derivative treating y as constant
∂f/∂y = derivative treating x as constant
EXAMPLE: f(x, y) = x² + 3xy + y²
∂f/∂x = 2x + 3y (treating y as constant)
∂f/∂y = 3x + 2y (treating x as constant)
THE GRADIENT
The gradient ∇f collects all partial derivatives into a vector:
∇f = [∂f/∂x, ∂f/∂y, ∂f/∂z, ...]
For f(x, y) = x² + y²:
∇f = [2x, 2y]
At point (3, 4): ∇f = [6, 8]
CRITICAL PROPERTY:
The gradient points in the direction of STEEPEST ASCENT.
Therefore, to minimize f, we move in direction -∇f (steepest descent).
VISUALIZATION:
┌───────────────────────────────────────────────────────────┐
│ │
│ ∇f points "uphill" │
│ ↗ │
│ ╱ │
│ ●─╱─── Current point │
│ ╲ │
│ ╲ │
│ ↘ │
│ -∇f points "downhill" (direction we move) │
│ │
└───────────────────────────────────────────────────────────┘
Reference: “Calculus” by James Stewart provides comprehensive coverage. “Math for Programmers” by Paul Orland, Chapter 8, gives a programmer-focused treatment. “Neural Networks and Deep Learning” by Michael Nielsen (free online) explains backpropagation beautifully.
Probability: Reasoning Under Uncertainty
ML models don’t just make predictions—they reason about uncertainty. Probability provides the framework for this reasoning.
Random Variables and Distributions
RANDOM VARIABLES
A random variable X assigns numbers to random outcomes.
Example: X = "sum of two dice"
Possible values: 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12
P(X = 2) = 1/36 (only one way: 1+1)
P(X = 7) = 6/36 (six ways: 1+6, 2+5, 3+4, 4+3, 5+2, 6+1)
PROBABILITY DISTRIBUTIONS
A distribution describes how likely each value is.
DISCRETE (countable outcomes):
P(X=k)
│
0.17│ ●
│ ● ● ●
0.11│ ● ● ● ● ●
│ ● ● ● ● ● ● ● ●
│● ● ● ● ● ● ● ● ● ● ●
└──────────────────────
2 3 4 5 6 7 8 9 10 11 12
CONTINUOUS (any value in a range):
p(x)
│ ╭────╮
│ ╱ ╲
│ ╱ ╲
│ ╱ ╲
│ ╱ ╲
│╱ ╲
└──────────────────────────
μ
Normal Distribution
N(μ, σ²)
Expected Value: The Average Outcome
EXPECTED VALUE (MEAN)
E[X] = Σᵢ xᵢ · P(X = xᵢ) (discrete)
E[X] = ∫ x · p(x) dx (continuous)
"The weighted average of all possible outcomes"
EXAMPLE: Fair 6-sided die
E[X] = 1·(1/6) + 2·(1/6) + 3·(1/6) + 4·(1/6) + 5·(1/6) + 6·(1/6)
= (1 + 2 + 3 + 4 + 5 + 6) / 6
= 21/6
= 3.5
You'll never roll 3.5, but it's the "center of mass" of outcomes.
WHY IT MATTERS FOR ML:
Loss function = E[L(y, ŷ)]
We minimize the EXPECTED loss over all training examples.
In practice:
Train loss ≈ (1/n) Σᵢ L(yᵢ, ŷᵢ)
This is a Monte Carlo estimate of E[L]!
Bayes’ Theorem: Updating Beliefs
BAYES' THEOREM
P(A|B) = P(B|A) · P(A)
─────────────
P(B)
┌─────────┐ ┌─────────┐ ┌─────────┐
│Posterior│ = │Likelihood│ × │ Prior │ ÷ P(Evidence)
│ P(A|B) │ │ P(B|A) │ │ P(A) │
└─────────┘ └─────────┘ └─────────┘
"Updated belief after seeing evidence"
SPAM FILTER EXAMPLE:
A = email is spam
B = email contains word "free"
Given:
P(spam) = 0.3 (30% of emails are spam)
P("free" | spam) = 0.8 (80% of spam has "free")
P("free" | not spam) = 0.1 (10% of ham has "free")
Question: Email contains "free". What's P(spam | "free")?
P("free") = P("free"|spam)·P(spam) + P("free"|not spam)·P(not spam)
= 0.8 × 0.3 + 0.1 × 0.7
= 0.24 + 0.07 = 0.31
P(spam | "free") = P("free"|spam) · P(spam) / P("free")
= 0.8 × 0.3 / 0.31
= 0.24 / 0.31
≈ 0.77
Seeing "free" raises spam probability from 30% to 77%!
Key Probability Distributions for ML
DISTRIBUTIONS YOU'LL ENCOUNTER
1. BERNOULLI: Single binary outcome
────────────────────────────────
P(X=1) = p, P(X=0) = 1-p
Used for: Binary classification output
2. NORMAL (GAUSSIAN): Bell curve
────────────────────────────────
p(x) = (1/√(2πσ²)) exp(-(x-μ)²/(2σ²))
Parameters: μ (mean), σ² (variance)
│ ╭───╮
│ ╱ ╲ 68% within ±1σ
│ ╱ ╲ 95% within ±2σ
│ ╱ ╲ 99.7% within ±3σ
│ ╱ ╲
└───────────────────
μ-σ μ μ+σ
Used for: Prior distributions, noise modeling, VAEs
3. CATEGORICAL: Multiple discrete outcomes
─────────────────────────────────────────
P(X=k) = pₖ, where Σₖ pₖ = 1
Used for: Multi-class classification (softmax output)
4. EXPONENTIAL: Time between events
───────────────────────────────────
p(x) = λe^(-λx) for x ≥ 0
Used for: Waiting times, learning rate decay
Reference: “Think Bayes” by Allen Downey provides an intuitive, computational approach to probability. “All of Statistics” by Larry Wasserman is a comprehensive reference.
Optimization: Making Machines Learn
All of machine learning reduces to optimization: define a loss function that measures how wrong your model is, then find parameters that minimize it.
Loss Functions: Measuring Error
LOSS FUNCTIONS
The loss L(y, ŷ) measures the difference between:
- True value y
- Predicted value ŷ
REGRESSION LOSSES:
Mean Squared Error (MSE):
L = (1/n) Σᵢ (yᵢ - ŷᵢ)²
- Penalizes large errors heavily (squared)
- Gradient: ∂L/∂ŷ = -2(y - ŷ)
Mean Absolute Error (MAE):
L = (1/n) Σᵢ |yᵢ - ŷᵢ|
- More robust to outliers
- Gradient: ∂L/∂ŷ = -sign(y - ŷ)
CLASSIFICATION LOSSES:
Binary Cross-Entropy:
L = -(1/n) Σᵢ [yᵢ log(ŷᵢ) + (1-yᵢ) log(1-ŷᵢ)]
- ŷ is predicted probability (from sigmoid)
- Heavily penalizes confident wrong predictions
- Gradient: ∂L/∂ŷ = (ŷ - y) / (ŷ(1-ŷ))
Categorical Cross-Entropy:
L = -(1/n) Σᵢ Σₖ yᵢₖ log(ŷᵢₖ)
- For multi-class (softmax output)
- y is one-hot encoded
LOSS LANDSCAPE VISUALIZATION:
Loss
│
│ ╲ ╱╲ ╱
│ ╲ ╱ ╲╱ Local minima
│ ● ╲
│ ╲
│ ● Global minimum (we want to find this!)
└────────────────────
Parameter θ
Gradient Descent: Walking Downhill
GRADIENT DESCENT ALGORITHM
Goal: Find θ* that minimizes L(θ)
Algorithm:
1. Start with initial guess θ₀
2. Compute gradient ∇L(θ)
3. Update: θ ← θ - α·∇L(θ)
4. Repeat until convergence
α = learning rate (step size)
┌─────────────────────────────────────────────────────────────┐
│ │
│ GRADIENT DESCENT INTUITION │
│ │
│ Imagine you're blindfolded on a hill and want to find │
│ the lowest point. You can only feel the slope under │
│ your feet. │
│ │
│ Strategy: Always step in the direction that goes down │
│ most steeply. Eventually you'll reach a valley. │
│ │
│ Start here │
│ ↓ │
│ ●───→ Step 1 │
│ ╲ │
│ ●───→ Step 2 │
│ ╲ │
│ ●───→ Step 3 │
│ ╲ │
│ ● Minimum! │
│ │
└─────────────────────────────────────────────────────────────┘
THE UPDATE RULE IN DETAIL:
θ_new = θ_old - α · ∇L(θ_old)
- ∇L points "uphill" (direction of steepest increase)
- Subtracting moves us "downhill"
- α controls step size:
- Too small: slow convergence
- Too large: oscillation or divergence
- Just right: smooth convergence
Learning Rate: The Most Important Hyperparameter
LEARNING RATE EFFECTS
α too small: α too large:
Loss Loss
│ │ ╱╲ ╱╲
│╲ │ ╱ ╲ ╱ ╲
│ ╲ │ ╱ ╲╱ ╲
│ ╲ │ ╱ ↗ Diverges!
│ ╲ │╱
│ ╲ └────────────────
│ ╲ Iteration
│ ╲
│ ╲ Very slow!
└────────╲─────────────
Iteration
α just right:
Loss
│╲
│ ╲
│ ╲
│ ╲
│ ╲
│ ╲_______________ Converges!
└──────────────────────
Iteration
LEARNING RATE SCHEDULES:
Constant: α(t) = α₀
Step decay: α(t) = α₀ · 0.1^(t/step)
Exponential: α(t) = α₀ · e^(-λt)
Cosine: α(t) = α₀ · (1 + cos(πt/T)) / 2
Convexity: When Optimization is Easy
CONVEX VS NON-CONVEX
CONVEX (bowl-shaped): NON-CONVEX (complex):
╲ ╱ ╱╲ ╱╲
╲ ╱ ╱ ╲ ╱ ╲
╲ ╱ ╱ ● ╲
● ● ●
Global min Local Local
(only one!) minima minima
Convex: Gradient descent always finds the global minimum.
Non-convex: May get stuck in local minima.
GOOD NEWS: Linear regression is convex!
L(w) = ||y - Xw||²
This is a quadratic in w, which is convex.
Gradient descent (or the normal equation) finds global optimum.
BAD NEWS: Neural networks are non-convex!
The loss landscape has many local minima, saddle points, and plateaus.
In practice, we often find "good enough" solutions.
SADDLE POINTS (in high dimensions):
╲ ╱
●
╱ ╲
Gradient = 0, but not a minimum.
Common in high-dimensional spaces.
Modern optimizers (Adam, RMSprop) handle these.
Stochastic Gradient Descent: Scaling Up
BATCH VS STOCHASTIC GRADIENT DESCENT
Batch GD: Use ALL data to compute gradient
∇L = (1/n) Σᵢ ∇Lᵢ
Pro: Accurate gradient
Con: Slow for large datasets
Stochastic GD (SGD): Use ONE sample
∇L ≈ ∇Lᵢ (for random i)
Pro: Fast updates
Con: Noisy gradient, may not converge smoothly
Mini-batch GD: Use SOME samples (e.g., 32, 64, 128)
∇L ≈ (1/B) Σᵢ∈batch ∇Lᵢ
Best of both worlds!
- Fast (GPU can process batches in parallel)
- Smooth enough to converge
- Noise can help escape local minima!
VISUALIZATION:
Batch GD: Mini-batch SGD:
●──→──→──→──● ●──→──↗──↙──→──●
(smooth path) (noisy but gets there)
Reference: “Hands-On Machine Learning” by Aurelien Geron, Chapter 4, provides practical coverage of gradient descent and its variants. “Deep Learning” by Goodfellow, Bengio, and Courville, Chapter 4-8, gives theoretical depth.
Putting It All Together: The Mathematical Flow of a Neural Network
Now let’s see how all these concepts combine in a single neural network forward and backward pass:
COMPLETE MATHEMATICAL FLOW OF TRAINING
INPUT: x ∈ ℝⁿ (feature vector)
TARGET: y ∈ ℝ (true label)
PARAMETERS: W₁, b₁, W₂, b₂ (weight matrices and bias vectors)
═══════════════════════════════════════════════════════════════════
FORWARD PASS (Linear Algebra + Functions)
──────────────────────────────────────────
Layer 1:
z₁ = W₁ · x + b₁ ← Matrix multiplication (linear algebra)
a₁ = σ(z₁) ← Activation function (functions)
Layer 2 (output):
z₂ = W₂ · a₁ + b₂ ← Matrix multiplication
ŷ = σ(z₂) ← Sigmoid for probability (exp/log)
═══════════════════════════════════════════════════════════════════
LOSS COMPUTATION (Probability)
───────────────────────────────
L = -[y·log(ŷ) + (1-y)·log(1-ŷ)] ← Cross-entropy (probability)
═══════════════════════════════════════════════════════════════════
BACKWARD PASS (Calculus - Chain Rule)
──────────────────────────────────────
Output layer gradient:
∂L/∂z₂ = ŷ - y ← Derivative of loss + sigmoid
∂L/∂W₂ = a₁ᵀ · ∂L/∂z₂ ← Chain rule
∂L/∂b₂ = ∂L/∂z₂
Hidden layer gradient (chain rule through):
∂L/∂a₁ = W₂ᵀ · ∂L/∂z₂ ← Gradient flows backward
∂L/∂z₁ = ∂L/∂a₁ ⊙ σ'(z₁) ← Element-wise with activation derivative
∂L/∂W₁ = xᵀ · ∂L/∂z₁ ← Chain rule
∂L/∂b₁ = ∂L/∂z₁
═══════════════════════════════════════════════════════════════════
PARAMETER UPDATE (Optimization)
─────────────────────────────────
W₁ ← W₁ - α · ∂L/∂W₁ ← Gradient descent
b₁ ← b₁ - α · ∂L/∂b₁
W₂ ← W₂ - α · ∂L/∂W₂
b₂ ← b₂ - α · ∂L/∂b₂
═══════════════════════════════════════════════════════════════════
REPEAT for each batch until loss converges!
This is what happens inside model.fit(). Every concept we’ve covered—algebra, functions, exponents, linear algebra, calculus, probability, and optimization—comes together in this elegant mathematical dance.
When you complete these 20 projects, you won’t just understand this diagram—you’ll have built every component yourself.