← Back to all projects

MATH CONCEPTS DEEP DIVE

This section provides detailed explanations of the core mathematical concepts that all 20 projects in this guide teach. Understanding these concepts deeply—not just procedurally—will transform you from someone who uses ML libraries to someone who truly understands what happens inside them.

Deep Dive: The Mathematics Behind Machine Learning

Why This Math Matters for Machine Learning

Before diving into each mathematical area, let’s understand why these specific topics are essential:

┌─────────────────────────────────────────────────────────────────────────┐
│                    THE ML MATHEMATICS ECOSYSTEM                         │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│  ALGEBRA                    LINEAR ALGEBRA              CALCULUS        │
│  ────────                   ──────────────              ────────        │
│  Variables as              Vectors & Matrices          Derivatives      │
│  unknowns we solve     ──▶ as data containers     ──▶  measure change   │
│  for in equations          & transformations          in predictions   │
│        │                         │                          │          │
│        │                         │                          │          │
│        ▼                         ▼                          ▼          │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │                    OPTIMIZATION (GRADIENT DESCENT)              │   │
│  │         The algorithm that makes neural networks "learn"        │   │
│  │               by minimizing prediction errors                   │   │
│  └─────────────────────────────────────────────────────────────────┘   │
│        ▲                         ▲                          ▲          │
│        │                         │                          │          │
│  FUNCTIONS                 PROBABILITY              EXPONENTS/LOGS     │
│  ─────────                 ───────────              ──────────────     │
│  Map inputs to            Quantify uncertainty      Scale data,        │
│  outputs (the core        in predictions,           measure info,      │
│  of all ML models)        model noise               enable learning    │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

Every neural network is fundamentally:

A composition of functions (algebra)
Represented by matrices of weights (linear algebra)
Trained by computing gradients (calculus)
Making probabilistic predictions (probability)
Optimized by gradient descent (optimization)

The math is not abstract theory—it is the actual implementation. When you run model.fit() in PyTorch or TensorFlow, these mathematical operations are exactly what happens inside.

Algebra: The Language of Relationships

Algebra is the foundation upon which all higher mathematics rests. At its core, algebra is about expressing relationships between quantities using symbols, then manipulating those symbols to discover new truths.

Variables: Placeholders for Unknown or Changing Quantities

In ML, we use variables constantly:

x represents input features (a single number or a vector of thousands)
y represents the target we want to predict
w (weights) and b (bias) are the parameters we learn
θ (theta) represents all learnable parameters

THE FUNDAMENTAL ML EQUATION

     ŷ = f(x; θ)
     │   │  │  │
     │   │  │  └── Parameters we learn (weights, biases)
     │   │  └───── Input data (features)
     │   └──────── The model (a function)
     └──────────── Predicted output

Example: Linear model
     ŷ = w₁x₁ + w₂x₂ + w₃x₃ + b
         └────────┬─────────┘   │
                  │             │
         Weighted sum of    Bias term
         input features     (intercept)

Equations: Statements of Equality We Solve

An equation states that two expressions are equal. Solving equations means finding values that make this true.

SOLVING A LINEAR EQUATION

Find x such that: 3x + 7 = 22

Step 1: Subtract 7 from both sides
        3x + 7 - 7 = 22 - 7
        3x = 15

Step 2: Divide both sides by 3
        3x/3 = 15/3
        x = 5

Verification: 3(5) + 7 = 15 + 7 = 22 ✓

In ML, we don’t solve single equations—we solve systems of equations represented as matrices, or we use iterative methods (gradient descent) to find approximate solutions.

Inverse Operations: The Key to Solving

Every operation has an inverse that “undoes” it:

INVERSE OPERATIONS

Operation           Inverse              Why It Matters in ML
─────────────────────────────────────────────────────────────────
Addition (+)        Subtraction (−)      Bias adjustment
Multiplication (×)  Division (÷)         Weight scaling
Exponentiation (xⁿ) Roots (ⁿ√x)          Feature engineering
Exponentiation (eˣ) Logarithm (ln x)     Loss functions, gradients
Squaring (x²)       Square root (√x)     Distance metrics
Matrix mult (AB)    Matrix inverse (A⁻¹) Solving linear systems

Reference: “Math for Programmers” by Paul Orland, Chapter 2, provides an excellent programmer-focused treatment of algebraic fundamentals.

Functions: The Heart of Computation

A function is a rule that takes an input and produces exactly one output. This is the most important concept for ML because every ML model is a function.

                    ┌─────────────────────┐
                    │                     │
    INPUT ────────▶ │      FUNCTION       │ ────────▶ OUTPUT
      x             │       f(x)          │             y
                    │                     │
                    │  "A machine that    │
                    │   transforms x      │
                    │   into y"           │
                    └─────────────────────┘

┌────────────────────────────────────────────────────────────────────────┐
│                                                                        │
│  THE FUNCTION MACHINE ANALOGY                                          │
│  ═══════════════════════════                                           │
│                                                                        │
│           INPUT                                                        │
│             │                                                          │
│             ▼                                                          │
│      ┌──────────────┐                                                  │
│      │   ░░░░░░░░   │  ◄── Internal mechanism (the rule)               │
│      │   ░ f(x) ░   │                                                  │
│      │   ░░░░░░░░   │                                                  │
│      └──────────────┘                                                  │
│             │                                                          │
│             ▼                                                          │
│          OUTPUT                                                        │
│                                                                        │
│  Example: f(x) = x²                                                    │
│                                                                        │
│      f(2) = 4       "Put in 2, get out 4"                              │
│      f(3) = 9       "Put in 3, get out 9"                              │
│      f(-2) = 4      "Put in -2, still get 4" (same output!)            │
│                                                                        │
└────────────────────────────────────────────────────────────────────────┘

Domain and Range: What Goes In, What Comes Out

DOMAIN AND RANGE VISUALIZATION

         DOMAIN                                    RANGE
    (valid inputs)                            (possible outputs)

    ┌───────────┐                             ┌───────────┐
    │           │                             │           │
    │  x = 1  ──┼──────── f(x) = x² ─────────▶│── 1       │
    │  x = 2  ──┼──────────────────── ───────▶│── 4       │
    │  x = 3  ──┼────────────────────────────▶│── 9       │
    │  x = -1 ──┼────────────────────────────▶│── 1       │
    │  x = -2 ──┼────────────────────────────▶│── 4       │
    │           │                             │           │
    │  All real │                             │  Only y≥0 │
    │  numbers  │                             │  (non-neg)│
    └───────────┘                             └───────────┘

    Domain of x²: all real numbers ℝ
    Range of x²: [0, ∞) non-negative reals

Function Composition: Combining Functions

Machine learning models are compositions of many functions. A neural network layer applies a linear function followed by a non-linear activation:

FUNCTION COMPOSITION: (g ∘ f)(x) = g(f(x))

"First apply f, then apply g to the result"

Example: f(x) = 2x, g(x) = x + 3

    (g ∘ f)(4) = g(f(4))
               = g(2·4)
               = g(8)
               = 8 + 3
               = 11

Neural Network Layer as Composition:
────────────────────────────────────

    output = σ(Wx + b)
             │ └──┬──┘
             │    │
             │    └── Linear function: f(x) = Wx + b
             │
             └─────── Activation function: g(z) = σ(z)

    Layer = g ∘ f = σ(Wx + b)

VISUAL: How composition works in a neural network

    x ──▶ [W·x + b] ──▶ z ──▶ [σ(z)] ──▶ output
          └───┬───┘          └──┬──┘
              │                 │
         Linear part      Nonlinear part
          (matrix)        (activation)

Inverse Functions: Undoing Transformations

If f takes x to y, then f⁻¹ (the inverse) takes y back to x:

INVERSE FUNCTIONS

           f
    x ─────────────▶ y
    ◀───────────────
          f⁻¹

    f(x) = y   means   f⁻¹(y) = x

Example: f(x) = 2x + 3

    Finding the inverse:
    1. Write y = 2x + 3
    2. Solve for x: x = (y - 3)/2
    3. Swap x and y: y = (x - 3)/2

    Therefore: f⁻¹(x) = (x - 3)/2

    Verification: f(f⁻¹(x)) = 2·((x-3)/2) + 3 = x - 3 + 3 = x ✓

ML APPLICATION: Encoder-Decoder Networks

    Input ──▶ [Encoder] ──▶ Latent Code ──▶ [Decoder] ──▶ Reconstruction
              (compress)                    (decompress)
                f(x)                          ≈ f⁻¹(z)

Reference: “Math for Programmers” by Paul Orland, Chapter 3, covers functions from a visual, computational perspective.

Exponents and Logarithms: Growth and Scale

Exponents and logarithms are inverse operations that appear throughout ML—in activation functions, loss functions, learning rate schedules, and information theory.

Exponential Growth: The Power of Repeated Multiplication

EXPONENTIAL GROWTH

    2¹ = 2
    2² = 4
    2³ = 8
    2⁴ = 16
    2⁵ = 32
    2⁶ = 64
    2⁷ = 128
    2⁸ = 256
    2⁹ = 512
    2¹⁰ = 1024

              │
         1024 ┼                                           ╭
              │                                          ╱
              │                                         ╱
          512 ┼                                        ╱
              │                                       ╱
          256 ┼                                     ╱
              │                                   ╱
          128 ┼                                 ╱
              │                              ╱
           64 ┼                           ╱
           32 ┼                       ╱╱
           16 ┼                  ╱╱╱
            8 ┼             ╱╱╱
            4 ┼        ╱╱╱
            2 ┼   ╱╱╱
              └───────────────────────────────────────────────
              0    2    4    6    8   10

    Key insight: Exponential growth starts slow, then EXPLODES

    This is why:
    - Neural network gradients can "explode" during training
    - Compound interest seems slow then suddenly huge
    - Viruses spread slowly, then overwhelm

The Natural Exponential: e^x

The number e ≈ 2.71828… is special because the derivative of eˣ is itself:

THE SPECIAL PROPERTY OF e^x

    d/dx [eˣ] = eˣ

    "The rate of growth is equal to the current value"

This is why e^x appears in:
    - Sigmoid activation: σ(x) = 1/(1 + e^(-x))
    - Softmax: softmax(xᵢ) = e^(xᵢ) / Σⱼ e^(xⱼ)
    - Probability distributions: Normal(x) ∝ e^(-x²/2)
    - Learning rate decay: lr(t) = lr₀ · e^(-λt)

SIGMOID FUNCTION (used in logistic regression, neural networks)

         1   ┼─────────────────────────────────────────
             │                           ╭───────────
             │                        ╱╱╱
         0.5 ┼────────────────────╱╱╱───────────────
             │               ╱╱╱
             │          ╱╱╱╱
         0   ┼────╱╱╱──────────────────────────────────
             └──────┼───────┼───────┼───────┼──────────
                   -4      -2       0       2       4

    σ(x) = 1 / (1 + e^(-x))

    - Maps any real number to (0, 1)
    - Used for probabilities
    - Derivative: σ'(x) = σ(x)(1 - σ(x))

Logarithms: The Inverse of Exponentiation

If exponentiation asks “2 raised to what power?”, logarithms answer that question:

LOGARITHMS AS INVERSE OPERATIONS

    Exponential:    2³ = 8
    Logarithmic:    log₂(8) = 3

    "2 to the power of WHAT equals 8?"
    Answer: 3

THE RELATIONSHIP:

    If b^y = x, then log_b(x) = y

    b^(log_b(x)) = x     (they undo each other)
    log_b(b^x) = x       (they undo each other)

COMMON LOGARITHMS IN ML:

    log₂(x)   - Base 2, used in information theory (bits)
    log₁₀(x)  - Base 10, used for order of magnitude
    ln(x)     - Natural log (base e), used in calculus/ML

    ln(e) = 1
    ln(1) = 0
    ln(0) = -∞  (undefined, approaches negative infinity)

Why Logarithms Appear in Machine Learning

LOGARITHMS IN ML: THREE CRITICAL USES

1. CROSS-ENTROPY LOSS (Classification)
───────────────────────────────────────
    L = -Σᵢ yᵢ · log(ŷᵢ)

    Why log? Penalizes confident wrong predictions heavily:

    Predicted   True    Loss Contribution
    ─────────────────────────────────────
    ŷ = 0.99    y = 1   -log(0.99) = 0.01  (small penalty, correct!)
    ŷ = 0.50    y = 1   -log(0.50) = 0.69  (medium penalty)
    ŷ = 0.01    y = 1   -log(0.01) = 4.61  (HUGE penalty, very wrong!)


2. INFORMATION THEORY (Entropy, Mutual Information)
────────────────────────────────────────────────────
    H(X) = -Σᵢ p(xᵢ) · log₂(p(xᵢ))

    "How many bits do we need to encode X?"

    Fair coin (50/50): H = -2·(0.5·log₂(0.5)) = 1 bit
    Biased coin (99/1): H ≈ 0.08 bits (very predictable)


3. NUMERICAL STABILITY (Log-Sum-Exp Trick)
───────────────────────────────────────────
    Problem: Computing Σᵢ e^(xᵢ) can overflow

    Solution: Use log-sum-exp

    log(Σᵢ e^(xᵢ)) = max(x) + log(Σᵢ e^(xᵢ - max(x)))

    This is how softmax is actually computed in practice!

Logarithm Properties (Essential for ML Derivations)

LOGARITHM RULES

    log(a·b) = log(a) + log(b)     Product → Sum
    log(a/b) = log(a) - log(b)     Quotient → Difference
    log(aⁿ) = n·log(a)             Power → Multiply
    log(1) = 0                      Always true
    log(base) = 1                   log_b(b) = 1

WHY THIS MATTERS:

    Computing P(w₁) × P(w₂) × P(w₃) × ... × P(wₙ)

    Problem: This product gets TINY (underflows to 0)

    Solution: Use logs!
    log(P(w₁)·P(w₂)·...·P(wₙ)) = log(P(w₁)) + log(P(w₂)) + ... + log(P(wₙ))

    Products become sums. No underflow!

Reference: “C Programming: A Modern Approach” by K. N. King, Chapter 7, covers the numerical representation of these values, while “Math for Programmers” Chapter 2 provides the mathematical intuition.

Trigonometry: Circles and Waves

Trigonometry connects angles to ratios, circles to waves, and appears in ML through signal processing, attention mechanisms, and positional encodings.

The Unit Circle: Where It All Begins

THE UNIT CIRCLE (radius = 1)

                        90° (π/2)
                           │
                    (0,1)  │
                      ╱╲   │
                     ╱  ╲  │
                    ╱    ╲ │
                   ╱      ╲│
    180° (π) ─────●────────●────────● 0° (0)
              (-1,0)       │(0,0)  (1,0)
                          ╲│╱
                           │
                           │
                        (0,-1)
                        270° (3π/2)

For any angle θ, the point on the unit circle is:
    (cos(θ), sin(θ))

KEY VALUES:
    θ = 0°:    (cos(0), sin(0)) = (1, 0)
    θ = 90°:   (cos(90°), sin(90°)) = (0, 1)
    θ = 180°:  (cos(180°), sin(180°)) = (-1, 0)
    θ = 270°:  (cos(270°), sin(270°)) = (0, -1)

Sine and Cosine as Projections

SINE AND COSINE AS COORDINATES

           │        ╱ point on circle
           │       ╱  at angle θ
           │      ●
           │     ╱│
           │    ╱ │
    sin(θ) │   ╱  │ sin(θ) = y-coordinate
           │  ╱   │         = "vertical projection"
           │ ╱ θ  │
           │╱─────┼─────
                  cos(θ)

           cos(θ) = x-coordinate
                  = "horizontal projection"

PYTHAGORAS ON THE UNIT CIRCLE:

    sin²(θ) + cos²(θ) = 1  (always true!)

    This is just x² + y² = r² = 1² on the unit circle

Sine and Cosine as Waves

SINE WAVE

    1 ┼           ╭───╮               ╭───╮
      │          ╱     ╲             ╱     ╲
      │         ╱       ╲           ╱       ╲
    0 ┼────────●─────────●─────────●─────────●────
      │       0          π         2π         3π
      │         ╲       ╱           ╲       ╱
      │          ╲     ╱             ╲     ╱
   -1 ┼           ╰───╯               ╰───╯

    y = sin(x)

    Properties:
    - Periodic: repeats every 2π
    - Bounded: always between -1 and 1
    - Smooth: infinitely differentiable

    d/dx[sin(x)] = cos(x)
    d/dx[cos(x)] = -sin(x)

COSINE WAVE (sine shifted by π/2)

    1 ┼───╮               ╭───╮
      │    ╲             ╱     ╲
      │     ╲           ╱       ╲
    0 ┼──────●─────────●─────────●──────
      │      π/2       3π/2      5π/2
      │       ╲       ╱           ╲
      │        ╲     ╱             ╲
   -1 ┼         ╰───╯               ╰

    y = cos(x)

Why Trigonometry Matters for ML

TRIGONOMETRY IN MACHINE LEARNING

1. POSITIONAL ENCODING (Transformers)
─────────────────────────────────────
    PE(pos, 2i) = sin(pos / 10000^(2i/d))
    PE(pos, 2i+1) = cos(pos / 10000^(2i/d))

    Why? Sine/cosine create unique patterns for each position
    that the model can learn to decode.

    Position 0: [sin(0), cos(0), sin(0), cos(0), ...]
              = [0, 1, 0, 1, ...]

    Position 1: [sin(1/10000), cos(1/10000), ...]
              ≈ [0.0001, 1, ...]


2. ROTATION MATRICES (Computer Vision, Robotics)
────────────────────────────────────────────────
    R(θ) = ┌ cos(θ)  -sin(θ) ┐
           │                 │
           └ sin(θ)   cos(θ) ┘

    Rotates any vector by angle θ counterclockwise.


3. FOURIER TRANSFORMS (Signal Processing, Audio)
────────────────────────────────────────────────
    Any signal can be decomposed into sine waves:

    f(t) = Σₙ [aₙ·cos(nωt) + bₙ·sin(nωt)]

    Used in: Audio processing, image compression, feature extraction


4. NEURAL NETWORK ACTIVATIONS
─────────────────────────────
    Some networks use sin(x) as an activation (SIREN networks)
    for implicit neural representations.

Reference: “Computer Graphics from Scratch” by Gabriel Gambetta covers trigonometry in the context of rotations and projections, which directly applies to ML transformations.

Linear Algebra: The Backbone of ML

Linear algebra is not optional for ML—it IS the implementation. Every neural network forward pass is matrix multiplication. Every weight update is vector arithmetic. Every dataset is a matrix.

Vectors: Direction and Magnitude

VECTORS AS ARROWS

    A vector has both direction and magnitude (length).

    v = [3, 4]  means "go 3 right and 4 up"

        4 │        ↗ v = [3,4]
          │       ╱
          │      ╱
          │     ╱   ||v|| = √(3² + 4²) = √25 = 5
          │    ╱
          │   ╱
          │  ╱
          │ ╱
          │╱θ
          └────────────────────
                   3

    Magnitude (length):  ||v|| = √(v₁² + v₂² + ... + vₙ²)
    Direction: θ = arctan(v₂/v₁) = arctan(4/3) ≈ 53.1°

VECTORS IN ML:

    Feature vector: x = [height, weight, age]
                        [  5.9,    160,  25 ]

    Weight vector:  w = [w₁, w₂, w₃]
                        [0.5, 0.3, 0.2]

    Prediction:     ŷ = w·x = w₁x₁ + w₂x₂ + w₃x₃
                        = 0.5(5.9) + 0.3(160) + 0.2(25)
                        = 2.95 + 48 + 5 = 55.95

Vector Operations

VECTOR OPERATIONS

1. ADDITION (element-wise)
───────────────────────────
    [1, 2, 3] + [4, 5, 6] = [1+4, 2+5, 3+6] = [5, 7, 9]

    Geometrically: "chain" the arrows

           ↗ b         ↗ a+b
    a ↗   ╱           ╱
      ╲  ╱           ╱
       ╲╱    =      ╱
                   ╱
         a ↗


2. SCALAR MULTIPLICATION
─────────────────────────
    3 × [1, 2] = [3, 6]

    "Stretch the arrow by factor 3"

    ───▶ v
    ─────────────▶ 3v


3. DOT PRODUCT (crucial for ML!)
─────────────────────────────────
    a·b = a₁b₁ + a₂b₂ + ... + aₙbₙ

    [1, 2, 3] · [4, 5, 6] = 1×4 + 2×5 + 3×6 = 4 + 10 + 18 = 32

    Geometric interpretation:
    a·b = ||a|| × ||b|| × cos(θ)

    where θ is the angle between vectors.

    If a·b = 0, vectors are PERPENDICULAR (orthogonal)
    If a·b > 0, vectors point in similar directions
    If a·b < 0, vectors point in opposite directions

DOT PRODUCT IS THE NEURON:

    inputs     weights       dot product      activation
    [x₁]   ·   [w₁]    =    Σ wᵢxᵢ + b   →     σ(z)
    [x₂]       [w₂]            │
    [x₃]       [w₃]            ▼
                              output

Matrices: Collections of Vectors, or Transformations

MATRICES AS DATA

    A dataset with n samples and d features is an n×d matrix:

         Feature 1  Feature 2  Feature 3
    X = ┌   5.1        3.5        1.4    ┐  Sample 1
        │   4.9        3.0        1.4    │  Sample 2
        │   4.7        3.2        1.3    │  Sample 3
        │   ...        ...        ...    │  ...
        └   5.9        3.0        5.1    ┘  Sample n

    Shape: (n_samples, n_features)


MATRICES AS TRANSFORMATIONS

    A 2×2 matrix transforms 2D vectors:

    T = ┌ a  b ┐     v = ┌ x ┐
        │      │         │   │
        └ c  d ┘         └ y ┘

    Tv = ┌ ax + by ┐
         │         │
         └ cx + dy ┘

    Examples:

    ┌ 2  0 ┐  Scales x by 2, y unchanged
    │      │
    └ 0  1 ┘

    ┌ cos θ  -sin θ ┐  Rotates by angle θ
    │               │
    └ sin θ   cos θ ┘

    ┌ 1  k ┐  Shears (slants) by factor k
    │      │
    └ 0  1 ┘

Matrix Multiplication: The Core of Neural Networks

MATRIX MULTIPLICATION

    C = A × B

    If A is (m × n) and B is (n × p), then C is (m × p)

    The (i,j) entry of C is the dot product of:
        - Row i of A
        - Column j of B

    Example:

    A = ┌ 1  2 ┐    B = ┌ 5  6 ┐
        │      │        │      │
        └ 3  4 ┘        └ 7  8 ┘

    C = A × B = ┌ 1×5+2×7  1×6+2×8 ┐ = ┌ 19  22 ┐
                │                   │   │        │
                └ 3×5+4×7  3×6+4×8 ┘   └ 43  50 ┘

    C[0,0] = Row 0 of A · Col 0 of B = [1,2]·[5,7] = 5+14 = 19
    C[0,1] = Row 0 of A · Col 1 of B = [1,2]·[6,8] = 6+16 = 22
    ...

NEURAL NETWORK LAYER AS MATRIX MULTIPLICATION:

    Input: x = [x₁, x₂, x₃]  (1×3 vector, or batch of them)

    Weights: W = ┌ w₁₁  w₁₂  w₁₃  w₁₄ ┐
                 │ w₂₁  w₂₂  w₂₃  w₂₄ │  (3×4 matrix)
                 └ w₃₁  w₃₂  w₃₃  w₃₄ ┘

    Output: z = xW + b  (1×4 vector)

    Each output neuron computes one dot product:
        z₁ = x₁w₁₁ + x₂w₂₁ + x₃w₃₁ + b₁
        z₂ = x₁w₁₂ + x₂w₂₂ + x₃w₃₂ + b₂
        ...

Eigenvalues and Eigenvectors: The Directions That Don’t Rotate

EIGENVECTORS: SPECIAL DIRECTIONS

    For a matrix A, an eigenvector v satisfies:

        A·v = λ·v

    "When A transforms v, v doesn't change direction,
     only scales by factor λ (the eigenvalue)"

    ┌─────────────────────────────────────────────────────────┐
    │                                                         │
    │   Regular vector:           Eigenvector:                │
    │                                                         │
    │      v →    Av →           v →    Av = λv →             │
    │      ↗      ╲↘             ↗      ─────────→            │
    │     ╱        ╲             │      (same direction,      │
    │    ╱          ╲            │       just stretched)      │
    │   ╱            ╲           │                            │
    │  Direction      Direction  │                            │
    │  CHANGES        SAME       │                            │
    │                                                         │
    └─────────────────────────────────────────────────────────┘

WHY EIGENVECTORS MATTER FOR ML:

1. PCA: Eigenvectors of covariance matrix = principal components
        (directions of maximum variance in data)

2. PageRank: The ranking vector is the dominant eigenvector
             of the link matrix

3. Spectral Clustering: Uses eigenvectors of similarity matrix

4. Stability: Eigenvalues tell if gradients will explode/vanish
              |λ| > 1: grows exponentially (exploding gradients)
              |λ| < 1: shrinks exponentially (vanishing gradients)

Reference: “Math for Programmers” by Paul Orland, Chapters 5-7, covers vectors and matrices with visual intuition. “Linear Algebra Done Right” by Sheldon Axler provides deeper theoretical foundations.

Calculus: The Mathematics of Change

Calculus answers the question: “How does the output change when I change the input?” This is fundamental to ML because training is about adjusting parameters to change (reduce) the loss.

Derivatives: Rate of Change

THE DERIVATIVE AS SLOPE

    The derivative f'(x) tells us the instantaneous rate of change:

        f'(x) = lim    f(x + h) - f(x)
               h→0    ─────────────────
                            h

    "As we make h infinitely small, what is the slope?"

GEOMETRIC INTERPRETATION:

    f(x)
      │             ╱ tangent line at x=a
      │           ╱   (slope = f'(a))
      │          ╱
      │        ●─────────────
      │      ╱╱│
      │    ╱╱  │
      │  ╱╱    │ f(a)
      │╱╱──────┼─────────────────
              a

    The derivative f'(a) is the slope of the tangent line at x=a.

EXAMPLE: f(x) = x²

    f'(x) = 2x

    At x = 3: f'(3) = 6
        "At x=3, f is increasing at rate 6"
        "If we move right by 0.01, f increases by about 0.06"

    At x = 0: f'(0) = 0
        "At x=0, f is flat (minimum!)"

    At x = -2: f'(-2) = -4
        "At x=-2, f is decreasing at rate 4"

Common Derivatives (Memorize These)

DERIVATIVE RULES

    Function          Derivative          Why it matters in ML
    ────────────────────────────────────────────────────────────
    f(x) = c          f'(x) = 0          Constant has no change
    f(x) = x          f'(x) = 1          Identity
    f(x) = xⁿ         f'(x) = nxⁿ⁻¹      Power rule (polynomials)
    f(x) = eˣ         f'(x) = eˣ         Exponential (special!)
    f(x) = ln(x)      f'(x) = 1/x        Log (in loss functions)
    f(x) = sin(x)     f'(x) = cos(x)     Positional encoding
    f(x) = cos(x)     f'(x) = -sin(x)    Positional encoding

    σ(x) = 1/(1+e⁻ˣ)  σ'(x) = σ(x)(1-σ(x))   Sigmoid activation
    ReLU(x) = max(0,x) ReLU'(x) = {1 if x>0   ReLU activation
                                  {0 if x≤0

COMBINATION RULES

    Sum:      (f + g)' = f' + g'
    Product:  (f·g)' = f'·g + f·g'
    Chain:    (f(g(x)))' = f'(g(x)) · g'(x)   ← CRITICAL for backprop!

The Chain Rule: The Heart of Backpropagation

THE CHAIN RULE

    If y = f(g(x)), then:

        dy/dx = (dy/du) · (du/dx)

    where u = g(x)

    "The derivative of a composition is the product of derivatives"

EXAMPLE: y = (3x + 2)⁵

    Let u = 3x + 2, so y = u⁵

    dy/du = 5u⁴
    du/dx = 3

    dy/dx = 5u⁴ · 3 = 15(3x + 2)⁴

WHY THIS IS BACKPROPAGATION:

    In a neural network:

    x → [Layer 1] → h → [Layer 2] → ŷ → [Loss] → L

    To update Layer 1's weights, we need ∂L/∂W₁.

    Chain rule:
    ∂L/∂W₁ = (∂L/∂ŷ) · (∂ŷ/∂h) · (∂h/∂W₁)
             └──┬──┘   └──┬──┘   └──┬──┘
                │         │         │
           From loss  Through  Through
                      Layer 2  Layer 1

    Gradients "flow backward" through the network!

Gradients: Derivatives in Multiple Dimensions

PARTIAL DERIVATIVES

    For f(x, y), we can take derivatives with respect to each variable:

        ∂f/∂x = derivative treating y as constant
        ∂f/∂y = derivative treating x as constant

EXAMPLE: f(x, y) = x² + 3xy + y²

    ∂f/∂x = 2x + 3y   (treating y as constant)
    ∂f/∂y = 3x + 2y   (treating x as constant)

THE GRADIENT

    The gradient ∇f collects all partial derivatives into a vector:

        ∇f = [∂f/∂x, ∂f/∂y, ∂f/∂z, ...]

    For f(x, y) = x² + y²:

        ∇f = [2x, 2y]

        At point (3, 4): ∇f = [6, 8]

    CRITICAL PROPERTY:
    The gradient points in the direction of STEEPEST ASCENT.

    Therefore, to minimize f, we move in direction -∇f (steepest descent).

VISUALIZATION:

    ┌───────────────────────────────────────────────────────────┐
    │                                                           │
    │        ∇f points "uphill"                                 │
    │             ↗                                             │
    │            ╱                                              │
    │         ●─╱───  Current point                             │
    │          ╲                                                │
    │           ╲                                               │
    │            ↘                                              │
    │        -∇f points "downhill" (direction we move)          │
    │                                                           │
    └───────────────────────────────────────────────────────────┘

Reference: “Calculus” by James Stewart provides comprehensive coverage. “Math for Programmers” by Paul Orland, Chapter 8, gives a programmer-focused treatment. “Neural Networks and Deep Learning” by Michael Nielsen (free online) explains backpropagation beautifully.

Probability: Reasoning Under Uncertainty

ML models don’t just make predictions—they reason about uncertainty. Probability provides the framework for this reasoning.

Random Variables and Distributions

RANDOM VARIABLES

    A random variable X assigns numbers to random outcomes.

    Example: X = "sum of two dice"

    Possible values: 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12

    P(X = 2) = 1/36  (only one way: 1+1)
    P(X = 7) = 6/36  (six ways: 1+6, 2+5, 3+4, 4+3, 5+2, 6+1)

PROBABILITY DISTRIBUTIONS

    A distribution describes how likely each value is.

    DISCRETE (countable outcomes):

    P(X=k)
      │
    0.17│       ●
      │      ● ● ●
    0.11│    ● ● ● ● ●
      │  ● ● ● ● ● ● ● ●
      │● ● ● ● ● ● ● ● ● ● ●
      └──────────────────────
        2 3 4 5 6 7 8 9 10 11 12

    CONTINUOUS (any value in a range):

    p(x)
      │        ╭────╮
      │       ╱      ╲
      │      ╱        ╲
      │    ╱            ╲
      │  ╱                ╲
      │╱                    ╲
      └──────────────────────────
                μ
           Normal Distribution
           N(μ, σ²)

Expected Value: The Average Outcome

EXPECTED VALUE (MEAN)

    E[X] = Σᵢ xᵢ · P(X = xᵢ)    (discrete)
    E[X] = ∫ x · p(x) dx        (continuous)

    "The weighted average of all possible outcomes"

EXAMPLE: Fair 6-sided die

    E[X] = 1·(1/6) + 2·(1/6) + 3·(1/6) + 4·(1/6) + 5·(1/6) + 6·(1/6)
         = (1 + 2 + 3 + 4 + 5 + 6) / 6
         = 21/6
         = 3.5

    You'll never roll 3.5, but it's the "center of mass" of outcomes.

WHY IT MATTERS FOR ML:

    Loss function = E[L(y, ŷ)]

    We minimize the EXPECTED loss over all training examples.

    In practice:
        Train loss ≈ (1/n) Σᵢ L(yᵢ, ŷᵢ)

    This is a Monte Carlo estimate of E[L]!

Bayes’ Theorem: Updating Beliefs

BAYES' THEOREM

    P(A|B) = P(B|A) · P(A)
             ─────────────
                P(B)

    ┌─────────┐   ┌─────────┐   ┌─────────┐
    │Posterior│ = │Likelihood│ × │  Prior  │  ÷  P(Evidence)
    │ P(A|B)  │   │  P(B|A) │   │  P(A)   │
    └─────────┘   └─────────┘   └─────────┘

    "Updated belief after seeing evidence"

SPAM FILTER EXAMPLE:

    A = email is spam
    B = email contains word "free"

    Given:
        P(spam) = 0.3                    (30% of emails are spam)
        P("free" | spam) = 0.8           (80% of spam has "free")
        P("free" | not spam) = 0.1       (10% of ham has "free")

    Question: Email contains "free". What's P(spam | "free")?

    P("free") = P("free"|spam)·P(spam) + P("free"|not spam)·P(not spam)
              = 0.8 × 0.3 + 0.1 × 0.7
              = 0.24 + 0.07 = 0.31

    P(spam | "free") = P("free"|spam) · P(spam) / P("free")
                     = 0.8 × 0.3 / 0.31
                     = 0.24 / 0.31
                     ≈ 0.77

    Seeing "free" raises spam probability from 30% to 77%!

Key Probability Distributions for ML

DISTRIBUTIONS YOU'LL ENCOUNTER

1. BERNOULLI: Single binary outcome
   ────────────────────────────────
   P(X=1) = p, P(X=0) = 1-p

   Used for: Binary classification output


2. NORMAL (GAUSSIAN): Bell curve
   ────────────────────────────────
   p(x) = (1/√(2πσ²)) exp(-(x-μ)²/(2σ²))

   Parameters: μ (mean), σ² (variance)

         │     ╭───╮
         │    ╱     ╲     68% within ±1σ
         │   ╱       ╲    95% within ±2σ
         │  ╱         ╲   99.7% within ±3σ
         │ ╱           ╲
         └───────────────────
              μ-σ  μ  μ+σ

   Used for: Prior distributions, noise modeling, VAEs


3. CATEGORICAL: Multiple discrete outcomes
   ─────────────────────────────────────────
   P(X=k) = pₖ, where Σₖ pₖ = 1

   Used for: Multi-class classification (softmax output)


4. EXPONENTIAL: Time between events
   ───────────────────────────────────
   p(x) = λe^(-λx) for x ≥ 0

   Used for: Waiting times, learning rate decay

Reference: “Think Bayes” by Allen Downey provides an intuitive, computational approach to probability. “All of Statistics” by Larry Wasserman is a comprehensive reference.

Optimization: Making Machines Learn

All of machine learning reduces to optimization: define a loss function that measures how wrong your model is, then find parameters that minimize it.

Loss Functions: Measuring Error

LOSS FUNCTIONS

The loss L(y, ŷ) measures the difference between:
    - True value y
    - Predicted value ŷ

REGRESSION LOSSES:

    Mean Squared Error (MSE):
    L = (1/n) Σᵢ (yᵢ - ŷᵢ)²

    - Penalizes large errors heavily (squared)
    - Gradient: ∂L/∂ŷ = -2(y - ŷ)

    Mean Absolute Error (MAE):
    L = (1/n) Σᵢ |yᵢ - ŷᵢ|

    - More robust to outliers
    - Gradient: ∂L/∂ŷ = -sign(y - ŷ)

CLASSIFICATION LOSSES:

    Binary Cross-Entropy:
    L = -(1/n) Σᵢ [yᵢ log(ŷᵢ) + (1-yᵢ) log(1-ŷᵢ)]

    - ŷ is predicted probability (from sigmoid)
    - Heavily penalizes confident wrong predictions
    - Gradient: ∂L/∂ŷ = (ŷ - y) / (ŷ(1-ŷ))

    Categorical Cross-Entropy:
    L = -(1/n) Σᵢ Σₖ yᵢₖ log(ŷᵢₖ)

    - For multi-class (softmax output)
    - y is one-hot encoded

LOSS LANDSCAPE VISUALIZATION:

    Loss
      │
      │  ╲   ╱╲  ╱
      │   ╲ ╱  ╲╱  Local minima
      │    ●    ╲
      │          ╲
      │           ●  Global minimum (we want to find this!)
      └────────────────────
         Parameter θ

Gradient Descent: Walking Downhill

GRADIENT DESCENT ALGORITHM

    Goal: Find θ* that minimizes L(θ)

    Algorithm:
    1. Start with initial guess θ₀
    2. Compute gradient ∇L(θ)
    3. Update: θ ← θ - α·∇L(θ)
    4. Repeat until convergence

    α = learning rate (step size)

    ┌─────────────────────────────────────────────────────────────┐
    │                                                             │
    │   GRADIENT DESCENT INTUITION                                │
    │                                                             │
    │   Imagine you're blindfolded on a hill and want to find     │
    │   the lowest point. You can only feel the slope under       │
    │   your feet.                                                │
    │                                                             │
    │   Strategy: Always step in the direction that goes down     │
    │   most steeply. Eventually you'll reach a valley.           │
    │                                                             │
    │        Start here                                           │
    │            ↓                                                │
    │          ●───→ Step 1                                       │
    │              ╲                                              │
    │               ●───→ Step 2                                  │
    │                   ╲                                         │
    │                    ●───→ Step 3                             │
    │                        ╲                                    │
    │                         ● Minimum!                          │
    │                                                             │
    └─────────────────────────────────────────────────────────────┘

THE UPDATE RULE IN DETAIL:

    θ_new = θ_old - α · ∇L(θ_old)

    - ∇L points "uphill" (direction of steepest increase)
    - Subtracting moves us "downhill"
    - α controls step size:
        - Too small: slow convergence
        - Too large: oscillation or divergence
        - Just right: smooth convergence

Learning Rate: The Most Important Hyperparameter

LEARNING RATE EFFECTS

    α too small:                α too large:

    Loss                        Loss
      │                           │    ╱╲    ╱╲
      │╲                          │   ╱  ╲  ╱  ╲
      │ ╲                         │  ╱    ╲╱    ╲
      │  ╲                        │ ╱            ↗ Diverges!
      │   ╲                       │╱
      │    ╲                      └────────────────
      │     ╲                         Iteration
      │      ╲
      │       ╲    Very slow!
      └────────╲─────────────
                Iteration


    α just right:

    Loss
      │╲
      │ ╲
      │  ╲
      │   ╲
      │    ╲
      │     ╲_______________  Converges!
      └──────────────────────
          Iteration

LEARNING RATE SCHEDULES:

    Constant:     α(t) = α₀
    Step decay:   α(t) = α₀ · 0.1^(t/step)
    Exponential:  α(t) = α₀ · e^(-λt)
    Cosine:       α(t) = α₀ · (1 + cos(πt/T)) / 2

Convexity: When Optimization is Easy

CONVEX VS NON-CONVEX

    CONVEX (bowl-shaped):          NON-CONVEX (complex):

         ╲     ╱                        ╱╲   ╱╲
          ╲   ╱                        ╱  ╲ ╱  ╲
           ╲ ╱                        ╱    ●    ╲
            ●                        ●           ●
       Global min                  Local      Local
    (only one!)                   minima      minima

    Convex: Gradient descent always finds the global minimum.
    Non-convex: May get stuck in local minima.

GOOD NEWS: Linear regression is convex!

    L(w) = ||y - Xw||²

    This is a quadratic in w, which is convex.
    Gradient descent (or the normal equation) finds global optimum.

BAD NEWS: Neural networks are non-convex!

    The loss landscape has many local minima, saddle points, and plateaus.
    In practice, we often find "good enough" solutions.

SADDLE POINTS (in high dimensions):

         ╲ ╱
          ●
         ╱ ╲

    Gradient = 0, but not a minimum.
    Common in high-dimensional spaces.
    Modern optimizers (Adam, RMSprop) handle these.

Stochastic Gradient Descent: Scaling Up

BATCH VS STOCHASTIC GRADIENT DESCENT

    Batch GD: Use ALL data to compute gradient
        ∇L = (1/n) Σᵢ ∇Lᵢ

        Pro: Accurate gradient
        Con: Slow for large datasets

    Stochastic GD (SGD): Use ONE sample
        ∇L ≈ ∇Lᵢ (for random i)

        Pro: Fast updates
        Con: Noisy gradient, may not converge smoothly

    Mini-batch GD: Use SOME samples (e.g., 32, 64, 128)
        ∇L ≈ (1/B) Σᵢ∈batch ∇Lᵢ

        Best of both worlds!

        - Fast (GPU can process batches in parallel)
        - Smooth enough to converge
        - Noise can help escape local minima!

VISUALIZATION:

    Batch GD:        Mini-batch SGD:

    ●──→──→──→──●    ●──→──↗──↙──→──●
    (smooth path)     (noisy but gets there)

Reference: “Hands-On Machine Learning” by Aurelien Geron, Chapter 4, provides practical coverage of gradient descent and its variants. “Deep Learning” by Goodfellow, Bengio, and Courville, Chapter 4-8, gives theoretical depth.

Putting It All Together: The Mathematical Flow of a Neural Network

Now let’s see how all these concepts combine in a single neural network forward and backward pass:

COMPLETE MATHEMATICAL FLOW OF TRAINING

INPUT: x ∈ ℝⁿ (feature vector)
TARGET: y ∈ ℝ (true label)
PARAMETERS: W₁, b₁, W₂, b₂ (weight matrices and bias vectors)

═══════════════════════════════════════════════════════════════════

FORWARD PASS (Linear Algebra + Functions)
──────────────────────────────────────────

Layer 1:
    z₁ = W₁ · x + b₁         ← Matrix multiplication (linear algebra)
    a₁ = σ(z₁)               ← Activation function (functions)

Layer 2 (output):
    z₂ = W₂ · a₁ + b₂        ← Matrix multiplication
    ŷ = σ(z₂)                ← Sigmoid for probability (exp/log)

═══════════════════════════════════════════════════════════════════

LOSS COMPUTATION (Probability)
───────────────────────────────

    L = -[y·log(ŷ) + (1-y)·log(1-ŷ)]    ← Cross-entropy (probability)

═══════════════════════════════════════════════════════════════════

BACKWARD PASS (Calculus - Chain Rule)
──────────────────────────────────────

Output layer gradient:
    ∂L/∂z₂ = ŷ - y           ← Derivative of loss + sigmoid
    ∂L/∂W₂ = a₁ᵀ · ∂L/∂z₂    ← Chain rule
    ∂L/∂b₂ = ∂L/∂z₂

Hidden layer gradient (chain rule through):
    ∂L/∂a₁ = W₂ᵀ · ∂L/∂z₂    ← Gradient flows backward
    ∂L/∂z₁ = ∂L/∂a₁ ⊙ σ'(z₁) ← Element-wise with activation derivative
    ∂L/∂W₁ = xᵀ · ∂L/∂z₁     ← Chain rule
    ∂L/∂b₁ = ∂L/∂z₁

═══════════════════════════════════════════════════════════════════

PARAMETER UPDATE (Optimization)
─────────────────────────────────

    W₁ ← W₁ - α · ∂L/∂W₁     ← Gradient descent
    b₁ ← b₁ - α · ∂L/∂b₁
    W₂ ← W₂ - α · ∂L/∂W₂
    b₂ ← b₂ - α · ∂L/∂b₂

═══════════════════════════════════════════════════════════════════

REPEAT for each batch until loss converges!

This is what happens inside model.fit(). Every concept we’ve covered—algebra, functions, exponents, linear algebra, calculus, probability, and optimization—comes together in this elegant mathematical dance.

When you complete these 20 projects, you won’t just understand this diagram—you’ll have built every component yourself.