Project 3: Predict the Price (Linear Regression Engine)

Build a vectorized linear regression library that predicts house prices using matrix operations, gradient descent, and proper feature scaling

Project Overview

Attribute	Value
Difficulty	Intermediate
Time Estimate	1 Week
Languages	Python (NumPy primary)
Prerequisites	Projects 1 & 2 (Manual Neuron, Gradient Descent Visualizer), basic linear algebra
Main Reference	“Data Science from Scratch” by Joel Grus
Knowledge Area	Regression / Vectors / Optimization

Learning Objectives

After completing this project, you will be able to:

Vectorize mathematical operations - Replace Python loops with NumPy matrix operations for 100x speedups
Implement feature scaling - Apply standardization and normalization to prepare data for training
Derive and implement MSE loss - Understand why Mean Squared Error works and how to compute gradients
Build batch gradient descent - Update weights using the full dataset gradient
Handle multi-feature regression - Extend single-variable regression to N input features
Debug convergence issues - Diagnose exploding gradients, slow convergence, and local minima

The Core Question You’re Answering

“How do we handle multiple inputs at once?”

In Project 1, you predicted logic gates with two inputs. Real-world problems have dozens or thousands of inputs. A house price depends on square footage, bedrooms, bathrooms, location, age, lot size, and countless other factors.

The naive approach would be nested loops:

for sample in 10000_samples:
    prediction = 0
    for feature in 50_features:
        prediction += weights[feature] * sample[feature]

This is unbearably slow. Linear algebra gives us a single expression:

predictions = X @ W  # 10,000 predictions in one operation

This project teaches you why that works and how to use it.

Concepts You Must Understand First

Before writing code, ensure you have solid mental models for these concepts:

1. Vectors and Matrices (Row vs Column)

A vector is a 1D array of numbers. A matrix is a 2D grid.

Row vector (1 x 3):     Column vector (3 x 1):    Matrix (3 x 2):
[1, 2, 3]               [1]                       [1, 4]
                        [2]                       [2, 5]
                        [3]                       [3, 6]

Critical rule: In machine learning, we follow the convention:

Rows = samples (individual data points)
Columns = features (input variables)

         Feature1  Feature2  Feature3
Sample 1    1200       3        2       <- House 1: 1200 sqft, 3 beds, 2 baths
Sample 2    1800       4        3       <- House 2: 1800 sqft, 4 beds, 3 baths
Sample 3    2500       5        4       <- House 3: 2500 sqft, 5 beds, 4 baths

Matrix X has shape (n_samples, n_features) or (m, n).

2. Matrix Multiplication Mechanics

Matrix multiplication A @ B works when A has shape (m, n) and B has shape (n, p).

     A (2x3)         B (3x1)           Result (2x1)
  [a11 a12 a13]     [b1]       [a11*b1 + a12*b2 + a13*b3]
  [a21 a22 a23]  @  [b2]   =   [a21*b1 + a22*b2 + a23*b3]
                    [b3]

Each row of A "dots" with the column of B to produce one output element.

The key insight: X @ W computes the weighted sum for EVERY sample simultaneously.

       X (3 samples, 2 features)      W (2 weights)
         sqft   beds                     w1
       [1200,   3]                      [0.5]
       [1800,   4]        @             [100]
       [2500,   5]

       = [1200*0.5 + 3*100]   =   [900]   <- Prediction for house 1
         [1800*0.5 + 4*100]   =   [1300]  <- Prediction for house 2
         [2500*0.5 + 5*100]   =   [1750]  <- Prediction for house 3

Three predictions, one operation. No loops.

3. Why Vectorization is Faster (SIMD, Cache Locality)

Your CPU has special SIMD (Single Instruction, Multiple Data) units that can perform 4, 8, or even 16 floating-point operations simultaneously.

Sequential (Python loop):           Vectorized (NumPy):

for i in range(4):                  result = a * b  # All 4 at once!
    result[i] = a[i] * b[i]
    |-> ~4 CPU cycles each          |-> ~1 CPU cycle total (SIMD)
    Total: ~16 cycles               Total: ~1-2 cycles

Additionally, cache locality matters:

NumPy stores arrays contiguously in memory
When you access a[0], the CPU loads a[0:7] into cache
Subsequent accesses hit cache (fast) instead of RAM (slow)

Python loops don’t benefit because:

Each iteration has interpreter overhead
Python floats are heap-allocated objects, scattered in memory

Bottom line: A 1000-element vector operation runs ~100x faster in NumPy than in pure Python.

4. Feature Scaling (Standardization vs Normalization)

Raw features have wildly different scales:

Square footage: 500 - 10,000
Bedrooms: 1 - 10
Year built: 1900 - 2024

If we use these directly, gradient descent struggles:

Before scaling:                 After scaling:
  ^                               ^
  |  /  (steep along sqft)        |    /  (equal gradients)
  | /                             |   /
  |/___________>                  |__/____>
   sqft (0-10000)                  sqft (-2 to 2)
   beds (0-10)                     beds (-2 to 2)

Standardization (Z-score): Centers at 0, scales to unit variance

x_scaled = (x - mean) / std

Normalization (Min-Max): Scales to [0, 1] range

x_scaled = (x - min) / (max - min)

When to use which:

Standardization: When you don’t know the bounds, handles outliers better
Normalization: When you need bounded outputs (e.g., image pixels)

For gradient descent, standardization is usually preferred.

5. MSE Loss Function Derivation

Mean Squared Error measures how wrong our predictions are:

                          1   m
MSE Loss = L(W) = ---- * Sum (y_hat_i - y_i)^2
                   m   i=1

where:
  m = number of samples
  y_hat_i = predicted value for sample i
  y_i = actual value for sample i

Why squared?

Penalizes large errors more than small errors
Is differentiable everywhere (unlike absolute value)
Has a unique minimum (convex for linear regression)

In matrix form:

predictions = X @ W                    # Shape: (m, 1)
errors = predictions - y               # Shape: (m, 1)
L = (1/m) * (errors.T @ errors)        # Scalar: sum of squared errors

The errors.T @ errors computes the sum of squared errors in one operation.

Deep Theoretical Foundation

The Linear Regression Model

Linear regression assumes a linear relationship between inputs and output:

y = w1*x1 + w2*x2 + ... + wn*xn + b

Or in matrix form:
y = X @ W + b

We often absorb the bias b into W by adding a column of 1s to X:

     Original X            Augmented X          Weights W
  [sqft, beds]            [1, sqft, beds]       [b, w1, w2]

  [1200,  3]      ->      [1, 1200, 3]          [50000]   <- bias
  [1800,  4]      ->      [1, 1800, 4]   @      [100]     <- weight for sqft
  [2500,  5]      ->      [1, 2500, 5]          [10000]   <- weight for beds

  Now: y = X @ W (no separate bias term!)

The Normal Equation (Closed-Form Solution)

There’s actually a formula that gives the optimal weights directly:

W* = (X.T @ X)^(-1) @ X.T @ y

This finds the exact solution in one step! So why bother with gradient descent?

Problems with the normal equation:

Matrix inversion is O(n^3) - For 100,000 features, that’s 10^15 operations
Memory intensive - X.T @ X is (n_features x n_features)
Singular matrices - If features are correlated, the inverse doesn’t exist
No regularization - Can’t easily add L2 penalty

Gradient descent advantages:

Scales to millions of features - Each step is O(m * n)
Works for any differentiable loss - Not just MSE
Naturally extends to neural networks - Same algorithm everywhere
Easy to add regularization - Just modify the gradient

Rule of thumb: Use normal equation for n_features < 10,000. Use gradient descent otherwise.

MSE Gradient Derivation with Matrices

The gradient tells us which direction to move weights to reduce loss.

Starting with:

L = (1/m) * ||X @ W - y||^2

Let e = X @ W - y (the error vector). Then:

L = (1/m) * e.T @ e

Using matrix calculus (chain rule for matrices):

dL/dW = (1/m) * d(e.T @ e)/dW
      = (1/m) * 2 * X.T @ e       (derivation involves Jacobians)
      = (2/m) * X.T @ (X @ W - y)

The gradient formula:

gradient = (2/m) * X.T @ (X @ W - y)

where:
  X.T has shape (n_features, m)
  (X @ W - y) has shape (m, 1)
  gradient has shape (n_features, 1)

Each element of the gradient tells us how much to adjust that weight.

Batch vs Stochastic vs Mini-batch

Batch Gradient Descent (this project):

Uses ALL samples to compute gradient
Smooth convergence, but slow per step
Memory intensive for large datasets

for epoch in epochs:
    gradient = (2/m) * X.T @ (X @ W - y)  # Full dataset
    W = W - learning_rate * gradient

Stochastic Gradient Descent (SGD):

Uses ONE sample per update
Noisy but fast per step
Can escape local minima

for epoch in epochs:
    for i in shuffle(samples):
        gradient = 2 * X[i].T * (X[i] @ W - y[i])  # Single sample
        W = W - learning_rate * gradient

Mini-batch Gradient Descent:

Uses a batch of 32-256 samples
Best of both: stable yet efficient
Standard in deep learning

for epoch in epochs:
    for batch in split(samples, batch_size=32):
        gradient = (2/len(batch)) * X_batch.T @ (X_batch @ W - y_batch)
        W = W - learning_rate * gradient

Feature Scaling Mathematics

Without scaling, features with larger magnitudes dominate the gradient:

Suppose: sqft ~ 2000, beds ~ 3

Gradient for sqft weight: proportional to sqft values ~ 2000
Gradient for beds weight: proportional to beds values ~ 3

The sqft gradient is ~700x larger!

This creates an elongated loss surface:

Before scaling:             After scaling:
      Loss                        Loss
       ^                           ^
       |   ___                     |     ___
       |  /   \                    |    /   \
       | |  *  |   (hard to       |   |  *  |  (easy to
       |  \___/     navigate)      |    \___/   navigate)
       |________>                  |________>
         w_sqft                      w_sqft'
         (tiny step)                 (balanced step)

Standardization formula:

For each feature column j:
    mean_j = (1/m) * sum(X[:, j])
    std_j = sqrt((1/m) * sum((X[:, j] - mean_j)^2))
    X_scaled[:, j] = (X[:, j] - mean_j) / std_j

Critical: Save the mean and std during training. Apply the SAME transformation to new data at prediction time!

Real-World Outcome

When your linear regression engine is complete, you will have a tool that produces accurate predictions with detailed training feedback. Here is exactly what running your tool should produce:

Example 1: Training on House Data

Command and Output:

$ python predict_house.py --train data.csv

================================================================================
                    LINEAR REGRESSION ENGINE v1.0
================================================================================

Loading data from: data.csv
Records loaded: 500
Features detected: 2 (sqft, bedrooms)

--------------------------------------------------------------------------------
DATA EXPLORATION
--------------------------------------------------------------------------------
Feature Statistics:
                     sqft        bedrooms
  min:              850.00          1.00
  max:             4200.00          6.00
  mean:            1856.32          3.12
  std:              687.45          1.24

Target Statistics (price):
  min:          $125,000
  max:          $875,000
  mean:         $342,156
  std:          $156,234

--------------------------------------------------------------------------------
PREPROCESSING
--------------------------------------------------------------------------------
Applying standardization (Z-score normalization)...
  sqft:     mean=1856.32, std=687.45 -> scaled to mean=0, std=1
  bedrooms: mean=3.12, std=1.24 -> scaled to mean=0, std=1

Adding bias column (column of 1s)...
Feature matrix shape: (500, 3)

--------------------------------------------------------------------------------
TRAINING
--------------------------------------------------------------------------------
Hyperparameters:
  Learning rate: 0.01
  Max epochs: 10000
  Convergence threshold: 0.0001

Initializing weights: [0.0, 0.0, 0.0]

Training Progress:
  Epoch 1:       Loss = 116,849,251,904.00  (Starting point)
  Epoch 10:      Loss =  12,345,678,901.23
  Epoch 50:      Loss =     456,789,012.34
  Epoch 100:     Loss =      45,000,234.56
  Epoch 500:     Loss =       2,300,567.89
  Epoch 1000:    Loss =         150,234.56  (Converged!)

  [==================================================] 100%

Convergence achieved at epoch 1000
  Final loss: 150,234.56
  Loss reduction: 99.99%

Learned weights (scaled space):
  w0 (bias):     342156.0
  w1 (sqft):     127834.5
  w2 (bedrooms):  18234.2

Unscaled interpretation:
  Base price:           $342,156 (when all features are at mean)
  Per sqft:             $186.00 ($127,834.5 / 687.45)
  Per bedroom:          $14,705 ($18,234.2 / 1.24)

--------------------------------------------------------------------------------
MODEL EVALUATION
--------------------------------------------------------------------------------
Training set metrics:
  Mean Squared Error:    150,234.56
  Root MSE:              $387.73
  Mean Absolute Error:   $298.45
  R-squared:             0.9823

Sample predictions vs actuals:
  Sample 1: Predicted $352,000 | Actual $350,000 | Error: $2,000
  Sample 2: Predicted $425,500 | Actual $430,000 | Error: $4,500
  Sample 3: Predicted $289,000 | Actual $285,000 | Error: $4,000

================================================================================
MODEL SAVED
================================================================================
Saved model to: model.pkl
Saved scaler to: scaler.pkl

Ready for predictions! Run: python predict_house.py --predict

Example 2: Making Predictions

Command and Output:

$ python predict_house.py --predict

================================================================================
                    LINEAR REGRESSION ENGINE v1.0
================================================================================

Loading model from: model.pkl
Loading scaler from: scaler.pkl

Model loaded successfully.
  Features: sqft, bedrooms
  Trained on: 500 samples
  R-squared: 0.9823

--------------------------------------------------------------------------------
INTERACTIVE PREDICTION MODE
--------------------------------------------------------------------------------

Enter house features (or 'quit' to exit):

Enter sqft: 2500
Enter bedrooms: 3

Processing...
  Raw input: [2500, 3]
  Scaled input: [0.936, -0.097]  (using training statistics)
  Augmented: [1, 0.936, -0.097]

Prediction Breakdown:
  Base price (bias):        $342,156.00
  + sqft contribution:      $119,688.90  (2500 sqft)
  + bedrooms contribution:  -$1,768.10   (3 beds, below average)
  -----------------------------------------
  PREDICTED PRICE:          $460,076.80

Confidence interval (95%): $452,300 - $467,850

--------------------------------------------------------------------------------

Enter sqft: 1200
Enter bedrooms: 2

Processing...
  Raw input: [1200, 2]
  Scaled input: [-0.954, -0.903]

Prediction Breakdown:
  Base price (bias):        $342,156.00
  + sqft contribution:      -$121,977.24
  + bedrooms contribution:  -$16,463.50
  -----------------------------------------
  PREDICTED PRICE:          $203,715.26

Confidence interval (95%): $196,000 - $211,430

--------------------------------------------------------------------------------

Enter sqft: quit

Thank you for using Linear Regression Engine!

Example 3: Timing Comparison (Loop vs Vectorized)

$ python predict_house.py --benchmark

================================================================================
                    VECTORIZATION BENCHMARK
================================================================================

Generating synthetic data...
  Samples: 100,000
  Features: 50

--------------------------------------------------------------------------------
LOOP-BASED IMPLEMENTATION
--------------------------------------------------------------------------------
def predict_loop(X, W):
    predictions = []
    for i in range(len(X)):
        pred = 0
        for j in range(len(W)):
            pred += X[i, j] * W[j]
        predictions.append(pred)
    return predictions

Running 100 iterations...
  Average time: 2.847 seconds
  Predictions per second: 35,124

--------------------------------------------------------------------------------
VECTORIZED IMPLEMENTATION
--------------------------------------------------------------------------------
def predict_vectorized(X, W):
    return X @ W

Running 100 iterations...
  Average time: 0.023 seconds
  Predictions per second: 4,347,826

================================================================================
COMPARISON
================================================================================
Speedup: 123.8x faster with vectorization!

For 100,000 samples with 50 features:
  Loop version:       2,847 ms
  Vectorized version:    23 ms
  Time saved:         2,824 ms per inference

At scale (1M predictions/day):
  Loop version:       7.9 hours
  Vectorized version: 3.8 minutes

This is why AI uses GPUs (which excel at matrix operations).

Example 4: Convergence Visualization

$ python predict_house.py --train data.csv --visualize

================================================================================
                    TRAINING VISUALIZATION
================================================================================

[Training progress as before...]

Generating visualizations...

Loss Curve:
                    Training Loss over Epochs
  Loss  ^
  1e11  |****
        |   ***
  1e9   |      ***
        |         ***
  1e7   |            ***
        |               ****
  1e5   |                   ****************************
        |______________________________________________|____>
        0        200      400      600      800      1000  Epoch

Feature Weights over Training:
  Weight ^
  +150k  |                                          .....w_sqft
         |                                    .....
  +100k  |                              .....
         |                        .....
  +50k   |                  .....
         |           ......                         _____w_beds
  0      |.....-----
         |______________________________________________|____>
         0        200      400      600      800      1000  Epoch

Prediction vs Actual Scatter:
  Predicted ^
  $800k     |                                    .*
            |                               .*.*
  $600k     |                          *.*.*
            |                     *.*.*
  $400k     |                *.*.*
            |           *.*.*
  $200k     |     *.*.*                    Perfect fit line
            |*.*                           would be diagonal
            |______________________________________|___>
            $200k    $400k    $600k    $800k     Actual

Saved to: training_visualization.png

Solution Architecture

High-Level Design

+-------------------------------------------------------------------------+
|                          LINEAR REGRESSION ENGINE                        |
+-------------------------------------------------------------------------+
|                                                                         |
|  +--------------+    +----------------+    +------------------+         |
|  |  DataLoader  |--->|  Preprocessor  |--->|    Optimizer     |         |
|  | (CSV parser) |    | (scaling, bias)|    | (gradient desc.) |         |
|  +--------------+    +----------------+    +------------------+         |
|         |                    |                      |                   |
|         v                    v                      v                   |
|  +-------------+    +-----------------+    +------------------+         |
|  |   X, y      |    | X_scaled, means |    | W (weights)      |         |
|  | (raw data)  |    | stds, X_aug     |    | loss_history     |         |
|  +-------------+    +-----------------+    +------------------+         |
|                                                    |                    |
|                          +-------------------------+                    |
|                          |                                              |
|                          v                                              |
|                  +------------------+    +-----------------+            |
|                  |    Predictor     |--->|    Evaluator    |            |
|                  | (inference)      |    | (metrics, viz)  |            |
|                  +------------------+    +-----------------+            |
|                                                                         |
+-------------------------------------------------------------------------+

Data Flow:
1. DataLoader reads CSV, extracts features (X) and target (y)
2. Preprocessor computes statistics, scales X, adds bias column
3. Optimizer runs gradient descent, updates weights, tracks loss
4. Predictor applies model to new data (with same scaling)
5. Evaluator computes R^2, RMSE, generates plots

Data Structures

class LinearRegression:
    """Core linear regression model."""

    # Model parameters
    weights: np.ndarray          # Shape: (n_features + 1,) includes bias

    # Scaling parameters (saved for prediction)
    feature_means: np.ndarray    # Shape: (n_features,)
    feature_stds: np.ndarray     # Shape: (n_features,)

    # Training hyperparameters
    learning_rate: float
    max_epochs: int
    convergence_threshold: float

    # Training history
    loss_history: List[float]
    weight_history: List[np.ndarray]


class Dataset:
    """Holds training/test data."""

    X_raw: np.ndarray           # Original features, shape (m, n)
    y: np.ndarray               # Targets, shape (m,) or (m, 1)
    feature_names: List[str]    # Column names
    X_scaled: np.ndarray        # Scaled and augmented (m, n+1)


class Metrics:
    """Evaluation metrics."""

    mse: float                  # Mean Squared Error
    rmse: float                 # Root MSE
    mae: float                  # Mean Absolute Error
    r_squared: float            # Coefficient of determination

Matrix Shape Reference

Understanding shapes is critical. Here’s a cheat sheet:

Given:
  m = number of samples (e.g., 500 houses)
  n = number of features (e.g., 2: sqft, bedrooms)

Matrices:
  X_raw       : (m, n)       = (500, 2)
  X_scaled    : (m, n)       = (500, 2)
  X_augmented : (m, n+1)     = (500, 3)    # After adding bias column
  y           : (m, 1)       = (500, 1)
  W           : (n+1, 1)     = (3, 1)

Operations and their shapes:
  X_aug @ W               : (500, 3) @ (3, 1) = (500, 1)    predictions
  X_aug @ W - y           : (500, 1) - (500, 1) = (500, 1)  errors
  X_aug.T @ (X_aug@W - y) : (3, 500) @ (500, 1) = (3, 1)    gradient

Common errors:

Shape mismatch: Check that inner dimensions match for @
Dimension issues: y as (m,) vs (m, 1) can cause broadcasting bugs
Forgetting bias: W should have n+1 elements, not n

Phased Implementation Guide

Phase 1: Data Loading and Exploration (Day 1)

Goal: Load CSV data, understand its structure, visualize distributions.

Tasks:

Load CSV using pandas or numpy
Separate features (X) from target (y)
Compute basic statistics (min, max, mean, std)
Plot histograms of each feature
Create scatter plots of features vs target

Code structure:

def load_data(filepath: str) -> Tuple[np.ndarray, np.ndarray, List[str]]:
    """Load CSV and return X, y, and feature names."""
    # Use pandas for convenience, convert to numpy
    df = pd.read_csv(filepath)
    feature_names = list(df.columns[:-1])  # All but last column
    X = df.iloc[:, :-1].values  # All but last column as numpy
    y = df.iloc[:, -1].values   # Last column as numpy
    return X, y, feature_names

def explore_data(X, y, feature_names):
    """Print statistics and create visualizations."""
    print("Feature Statistics:")
    for i, name in enumerate(feature_names):
        print(f"  {name}: min={X[:, i].min():.2f}, max={X[:, i].max():.2f}, "
              f"mean={X[:, i].mean():.2f}, std={X[:, i].std():.2f}")

    print(f"\nTarget: min=${y.min():,.0f}, max=${y.max():,.0f}, "
          f"mean=${y.mean():,.0f}")

Testing:

Create a small CSV by hand (5 rows, 2 features)
Verify statistics match manual calculations
Check shapes: X.shape should be (n_samples, n_features)

Phase 2: Feature Scaling Implementation (Day 2)

Goal: Implement standardization that can be applied consistently.

The key insight: You must save the mean and std from training data to apply the SAME transformation to test data later.

Code structure:

class Scaler:
    """Standardizes features using Z-score normalization."""

    def __init__(self):
        self.means = None
        self.stds = None
        self.is_fitted = False

    def fit(self, X: np.ndarray):
        """Compute mean and std for each feature."""
        self.means = X.mean(axis=0)  # Mean of each column
        self.stds = X.std(axis=0)    # Std of each column
        # Handle zero std (constant features)
        self.stds[self.stds == 0] = 1
        self.is_fitted = True

    def transform(self, X: np.ndarray) -> np.ndarray:
        """Apply standardization using saved statistics."""
        if not self.is_fitted:
            raise ValueError("Scaler must be fitted before transform")
        return (X - self.means) / self.stds

    def fit_transform(self, X: np.ndarray) -> np.ndarray:
        """Fit and transform in one step."""
        self.fit(X)
        return self.transform(X)

Testing:

# Test: After scaling, each column should have mean ~0, std ~1
X_scaled = scaler.fit_transform(X)
assert np.allclose(X_scaled.mean(axis=0), 0, atol=1e-10)
assert np.allclose(X_scaled.std(axis=0), 1, atol=1e-10)

Phase 3: Vectorized Forward Pass (Day 3)

Goal: Compute predictions for all samples in one matrix operation.

The forward pass:

def add_bias_column(X: np.ndarray) -> np.ndarray:
    """Prepend a column of 1s to X for the bias term."""
    ones = np.ones((X.shape[0], 1))
    return np.hstack([ones, X])

def predict(X_augmented: np.ndarray, W: np.ndarray) -> np.ndarray:
    """Compute predictions: y_hat = X @ W"""
    return X_augmented @ W

Shape verification:

# Given X with shape (500, 2):
X_aug = add_bias_column(X)  # Shape: (500, 3)
W = np.zeros((3, 1))        # Shape: (3, 1) - initialized to zeros
y_hat = predict(X_aug, W)   # Shape: (500, 1)

print(f"X_aug shape: {X_aug.shape}")  # (500, 3)
print(f"W shape: {W.shape}")          # (3, 1)
print(f"y_hat shape: {y_hat.shape}")  # (500, 1)

No loops allowed! The entire prediction for 500 samples happens in X_augmented @ W.

Phase 4: MSE Loss Calculation (Day 4)

Goal: Implement the loss function to measure prediction quality.

The MSE formula in code:

def compute_loss(y_true: np.ndarray, y_pred: np.ndarray) -> float:
    """Compute Mean Squared Error."""
    m = len(y_true)
    errors = y_pred - y_true
    mse = (1 / m) * np.sum(errors ** 2)
    return mse

# Alternative using matrix operations:
def compute_loss_matrix(y_true: np.ndarray, y_pred: np.ndarray) -> float:
    """Compute MSE using matrix form."""
    m = len(y_true)
    errors = y_pred - y_true
    mse = (1 / m) * (errors.T @ errors).item()  # .item() extracts scalar
    return mse

Testing:

# Test with known values
y_true = np.array([[100], [200], [300]])
y_pred = np.array([[110], [190], [310]])
# Errors: [10, -10, 10], squared: [100, 100, 100], mean: 100
loss = compute_loss(y_true, y_pred)
assert loss == 100.0

Phase 5: Gradient Computation (Vectorized) (Day 5)

Goal: Compute the gradient of the loss with respect to weights.

The gradient formula:

gradient = (2/m) * X.T @ (X @ W - y)

Implementation:

def compute_gradient(X: np.ndarray, y: np.ndarray, W: np.ndarray) -> np.ndarray:
    """Compute gradient of MSE loss with respect to W."""
    m = X.shape[0]
    predictions = X @ W
    errors = predictions - y
    gradient = (2 / m) * (X.T @ errors)
    return gradient

Shape verification:

# Given: X is (500, 3), y is (500, 1), W is (3, 1)
gradient = compute_gradient(X, y, W)
print(f"Gradient shape: {gradient.shape}")  # Should be (3, 1)

# Each element tells us how to adjust that weight:
# gradient[0] -> adjustment for bias
# gradient[1] -> adjustment for w_sqft
# gradient[2] -> adjustment for w_bedrooms

Testing:

# Numerical gradient check (slow but verifies correctness)
def numerical_gradient(X, y, W, epsilon=1e-7):
    num_grad = np.zeros_like(W)
    for i in range(len(W)):
        W_plus = W.copy()
        W_plus[i] += epsilon
        W_minus = W.copy()
        W_minus[i] -= epsilon

        loss_plus = compute_loss(y, X @ W_plus)
        loss_minus = compute_loss(y, X @ W_minus)

        num_grad[i] = (loss_plus - loss_minus) / (2 * epsilon)
    return num_grad

# Compare analytical and numerical gradients
analytical = compute_gradient(X, y, W)
numerical = numerical_gradient(X, y, W)
assert np.allclose(analytical, numerical, rtol=1e-5)

Phase 6: Training Loop with Convergence Check (Day 6-7)

Goal: Put it all together with proper stopping criteria.

The training loop:

def train(X: np.ndarray, y: np.ndarray,
          learning_rate: float = 0.01,
          max_epochs: int = 10000,
          convergence_threshold: float = 1e-6) -> Tuple[np.ndarray, List[float]]:
    """Train linear regression using batch gradient descent."""

    # Initialize weights to zeros
    n_features = X.shape[1]
    W = np.zeros((n_features, 1))

    loss_history = []
    prev_loss = float('inf')

    for epoch in range(max_epochs):
        # Forward pass
        predictions = X @ W

        # Compute loss
        current_loss = compute_loss(y, predictions)
        loss_history.append(current_loss)

        # Check convergence
        if abs(prev_loss - current_loss) < convergence_threshold:
            print(f"Converged at epoch {epoch}")
            break
        prev_loss = current_loss

        # Compute gradient
        gradient = compute_gradient(X, y, W)

        # Update weights
        W = W - learning_rate * gradient

        # Print progress periodically
        if epoch % 100 == 0:
            print(f"Epoch {epoch}: Loss = {current_loss:.2f}")

    return W, loss_history

The complete pipeline:

def fit(X_raw: np.ndarray, y_raw: np.ndarray):
    """Full training pipeline."""

    # 1. Reshape y if needed
    y = y_raw.reshape(-1, 1) if y_raw.ndim == 1 else y_raw

    # 2. Scale features
    scaler = Scaler()
    X_scaled = scaler.fit_transform(X_raw)

    # 3. Add bias column
    X_aug = add_bias_column(X_scaled)

    # 4. Train
    W, loss_history = train(X_aug, y)

    # 5. Return model
    return {
        'weights': W,
        'scaler': scaler,
        'loss_history': loss_history
    }

Questions to Guide Your Design

Before implementing, think through these questions:

Data representation:

Should y be shape (m,) or (m, 1)? Why does it matter?
How will you handle missing values in the CSV?
What if a feature column has zero variance (all same value)?

Scaling decisions:

Should you scale the target y as well? What are the tradeoffs?
What happens if you forget to scale test data the same way?
How will you store the scaling parameters for deployment?

Gradient descent tuning:

How will you know if the learning rate is too high?
What symptoms indicate the learning rate is too low?
Should you decrease the learning rate over time? (Learning rate scheduling)

Convergence criteria:

Is checking loss difference sufficient? What about weight changes?
What if the loss oscillates instead of decreasing?
How many epochs without improvement should trigger early stopping?

Memory efficiency:

For 1 million samples, how much RAM does X.T @ X need?
Could you implement mini-batch gradient descent to reduce memory?

Thinking Exercise: Matrix Multiplication by Hand

Before trusting NumPy, understand what it’s doing. Compute these by hand:

Exercise 1: Simple Prediction

Given:

X = [[1, 2, 3],     (2 samples, 3 features including bias)
     [1, 4, 1]]

W = [[10],          (3 weights)
     [5],
     [2]]

Compute X @ W:

Step by step:
Row 1 of X dot W: 1*10 + 2*5 + 3*2 = 10 + 10 + 6 = 26
Row 2 of X dot W: 1*10 + 4*5 + 1*2 = 10 + 20 + 2 = 32

Result: [[26],
         [32]]

Exercise 2: Gradient Computation

Given:

X = [[1, 2],
     [1, 3]]

y = [[5],
     [7]]

W = [[1],
     [1]]

Compute the gradient:

Step 1: Predictions = X @ W
  Row 1: 1*1 + 2*1 = 3
  Row 2: 1*1 + 3*1 = 4
  predictions = [[3], [4]]

Step 2: Errors = predictions - y
  [[3], [4]] - [[5], [7]] = [[-2], [-3]]

Step 3: X.T @ errors
  X.T = [[1, 1],
         [2, 3]]

  X.T @ errors:
    Row 1: 1*(-2) + 1*(-3) = -5
    Row 2: 2*(-2) + 3*(-3) = -4 + -9 = -13

  = [[-5], [-13]]

Step 4: Multiply by 2/m (m=2)
  gradient = (2/2) * [[-5], [-13]] = [[-5], [-13]]

The negative gradient tells us: increase both weights to reduce error.

Exercise 3: Shape Reasoning

Without computing, predict the output shape:

(100, 5) @ (5, 1) = ?
(5, 100) @ (100, 1) = ?
(100, 5).T @ (100, 1) = ?
(100, 1) @ (1, 5) = ?

Answers:

(100, 1) - 100 predictions from 100 samples
(5, 1) - 5 sums of 100 elements each (gradient shape)
Same as 2: (5, 100) @ (100, 1) = (5, 1)
(100, 5) - Outer product, creates a matrix

Testing Strategy

Unit Tests for Core Functions

import numpy as np
import pytest

class TestScaler:
    def test_fit_computes_correct_stats(self):
        X = np.array([[1, 10], [2, 20], [3, 30]])
        scaler = Scaler()
        scaler.fit(X)

        assert np.allclose(scaler.means, [2, 20])
        assert np.allclose(scaler.stds, [0.816, 8.165], rtol=0.01)

    def test_transform_scales_correctly(self):
        X = np.array([[0], [10]])
        scaler = Scaler()
        scaler.fit(X)

        X_scaled = scaler.transform(X)
        assert np.allclose(X_scaled.mean(), 0)
        assert np.allclose(X_scaled.std(), 1)

    def test_transform_before_fit_raises(self):
        scaler = Scaler()
        with pytest.raises(ValueError):
            scaler.transform(np.array([[1, 2]]))


class TestGradient:
    def test_gradient_matches_numerical(self):
        np.random.seed(42)
        X = np.random.randn(10, 3)
        y = np.random.randn(10, 1)
        W = np.random.randn(3, 1)

        analytical = compute_gradient(X, y, W)
        numerical = numerical_gradient(X, y, W)

        assert np.allclose(analytical, numerical, rtol=1e-5)

    def test_gradient_zero_at_optimum(self):
        # For a simple case we can solve analytically
        X = np.array([[1, 1], [1, 2]])
        y = np.array([[3], [5]])
        W_optimal = np.array([[1], [2]])  # y = 1 + 2*x

        gradient = compute_gradient(X, y, W_optimal)
        assert np.allclose(gradient, 0, atol=1e-10)


class TestTraining:
    def test_loss_decreases(self):
        np.random.seed(42)
        X = np.random.randn(100, 3)
        y = X @ np.array([[1], [2], [3]]) + np.random.randn(100, 1) * 0.1

        W, loss_history = train(X, y, learning_rate=0.1, max_epochs=100)

        assert loss_history[-1] < loss_history[0]
        assert all(loss_history[i] >= loss_history[i+1] - 1e-10
                   for i in range(len(loss_history)-1))

    def test_recovers_known_weights(self):
        # Generate data from known linear relationship
        X = add_bias_column(np.random.randn(1000, 2))
        true_W = np.array([[5], [3], [-2]])  # bias=5, w1=3, w2=-2
        y = X @ true_W

        W_learned, _ = train(X, y, learning_rate=0.1, max_epochs=1000)

        assert np.allclose(W_learned, true_W, rtol=0.01)

Integration Tests

class TestFullPipeline:
    def test_pipeline_on_synthetic_data(self):
        # Generate house-like data
        np.random.seed(42)
        n_samples = 500

        sqft = np.random.uniform(1000, 3000, n_samples)
        beds = np.random.randint(1, 6, n_samples)
        price = 50000 + 100 * sqft + 20000 * beds + np.random.randn(n_samples) * 10000

        X = np.column_stack([sqft, beds])
        y = price

        model = fit(X, y)

        # Test prediction
        test_house = np.array([[2000, 3]])
        expected_price = 50000 + 100 * 2000 + 20000 * 3  # ~310,000

        test_scaled = model['scaler'].transform(test_house)
        test_aug = add_bias_column(test_scaled)
        prediction = (test_aug @ model['weights']).item()

        assert abs(prediction - expected_price) < 20000  # Within $20k

    def test_model_saves_and_loads(self):
        import pickle

        X = np.random.randn(100, 2)
        y = np.random.randn(100, 1)
        model = fit(X, y)

        # Save
        with open('test_model.pkl', 'wb') as f:
            pickle.dump(model, f)

        # Load
        with open('test_model.pkl', 'rb') as f:
            loaded = pickle.load(f)

        assert np.allclose(model['weights'], loaded['weights'])
        os.remove('test_model.pkl')

Common Pitfalls and Debugging Tips

Pitfall 1: Exploding Loss

Symptom: Loss shoots to infinity or becomes NaN

Epoch 1: Loss = 150000
Epoch 2: Loss = 1500000000
Epoch 3: Loss = inf
Epoch 4: Loss = nan

Cause: Learning rate is too high. Gradient steps overshoot the minimum.

Fix: Reduce learning rate by 10x or 100x. Try 0.01, 0.001, or 0.0001.

Diagram:

Learning rate too high:
    Loss
      ^
      |      * <-- overshoot
      |     / \
      |    /   \
      |   /     * <-- overshoot again
      |  *       \
      |   \       * <-- explodes!
      |____\______\_____>
            optimum

Pitfall 2: Loss Not Decreasing

Symptom: Loss stays flat or decreases imperceptibly

Epoch 1: Loss = 150000.00
Epoch 100: Loss = 149999.99
Epoch 1000: Loss = 149998.50

Causes:

Learning rate too small
Features not scaled
Weights initialized poorly

Fixes:

Increase learning rate
Verify features have mean 0, std 1
Try different weight initialization

Pitfall 3: Shape Mismatch Errors

Symptom:

ValueError: matmul: Input operand 1 has a mismatch in its core dimension

Common causes:

Forgot to add bias column
y is shape (m,) instead of (m, 1)
Mixed up X.T with X

Debugging:

print(f"X shape: {X.shape}")     # Should be (m, n+1)
print(f"y shape: {y.shape}")     # Should be (m, 1)
print(f"W shape: {W.shape}")     # Should be (n+1, 1)

# Reshape y if needed
y = y.reshape(-1, 1)

Pitfall 4: Forgot to Scale Test Data

Symptom: Model trains well but predictions on new data are wildly wrong

Cause: New data uses original scale, but model learned in scaled space

Fix: Always apply the SAME scaler to test data:

# During training:
scaler = Scaler()
X_train_scaled = scaler.fit_transform(X_train)

# During prediction:
X_test_scaled = scaler.transform(X_test)  # Use same scaler!
prediction = X_test_scaled @ W

Pitfall 5: Numerical Precision Issues

Symptom: Gradient checks fail with small but non-zero differences

Cause: Floating-point arithmetic has limited precision

Fix: Use appropriate tolerances:

# Bad: Exact equality
assert gradient == numerical_gradient

# Good: Approximate equality
assert np.allclose(gradient, numerical_gradient, rtol=1e-5, atol=1e-8)

Debugging Workflow

When something goes wrong:

Print shapes at every step:

print(f"After loading: X={X.shape}, y={y.shape}")
print(f"After scaling: X={X_scaled.shape}")
print(f"After augmenting: X={X_aug.shape}")
print(f"Weights: W={W.shape}")
print(f"Predictions: pred={pred.shape}")
print(f"Gradient: grad={grad.shape}")

Test with tiny data:

# Use 3 samples, 2 features - small enough to verify by hand
X_tiny = np.array([[1, 2], [3, 4], [5, 6]])
y_tiny = np.array([[10], [20], [30]])

Check intermediate values:

print(f"First 5 predictions: {predictions[:5].flatten()}")
print(f"First 5 errors: {(predictions - y)[:5].flatten()}")
print(f"Gradient: {gradient.flatten()}")

Verify with known solution:

# For y = 2*x + 1, we know W should be [1, 2]
X = add_bias_column(np.array([[0], [1], [2]]))
y = np.array([[1], [3], [5]])
# Train and verify W converges to [1, 2]

Interview Questions

Binary parsing skills translate directly to interview questions. Here’s what you’ll be prepared to answer:

Gradient Descent and Optimization:

“Explain gradient descent in your own words. Why does it work?”
“What happens if the learning rate is too high? Too low?”
“How would you choose an appropriate learning rate?”
“What is the difference between batch, stochastic, and mini-batch gradient descent?”
“Can gradient descent get stuck? How would you detect and address this?”

Linear Algebra and Vectorization:

“Why is matrix multiplication used in machine learning?”
“Explain why vectorized code is faster than loops.”
“Given matrices of shapes (100, 50) and (50, 1), what is the output shape of their product?”
“How would you implement matrix multiplication from scratch?”
“What is the computational complexity of matrix multiplication?”

Feature Engineering:

“Why do we need to scale features before training?”
“What is the difference between normalization and standardization?”
“How do you handle new data that falls outside the training range?”
“What happens if you have correlated features?”
“How would you add polynomial features to capture non-linear relationships?”

Loss Functions:

“Why do we use squared error instead of absolute error?”
“Derive the gradient of Mean Squared Error.”
“What is the closed-form solution for linear regression?”
“When would you use gradient descent instead of the normal equation?”
“How does regularization modify the loss function?”

Practical Machine Learning:

“How do you know if your model is overfitting?”
“What is R-squared and how do you interpret it?”
“How would you handle missing values in your training data?”
“Describe your debugging process when a model isn’t learning.”
“How would you deploy this model to production?”

Coding Questions:

“Implement standardization without using numpy’s built-in functions.”
“Write a function to compute MSE without loops.”
“Implement gradient descent for linear regression.”
“How would you add L2 regularization to your implementation?”

Hints in Layers

If you get stuck, reveal hints progressively. Try to solve problems yourself first.

Layer 1: Getting Started

Hint: Data loading structure

Use pandas for CSV parsing, then convert to numpy:

import pandas as pd
import numpy as np

df = pd.read_csv('data.csv')

# Assume last column is target
X = df.iloc[:, :-1].values  # All columns except last
y = df.iloc[:, -1].values.reshape(-1, 1)  # Last column, reshaped

print(f"X shape: {X.shape}, y shape: {y.shape}")

Hint: Bias column addition

Prepend a column of 1s to X:

def add_bias(X):
    m = X.shape[0]  # Number of samples
    ones = np.ones((m, 1))
    return np.hstack([ones, X])

# After: X_aug[:, 0] is all 1s
# When we do X_aug @ W, W[0] becomes the bias term

Layer 2: Scaling

Hint: Handle zero standard deviation

If a feature column is constant, std=0 causes division by zero:

self.stds = X.std(axis=0)
# Replace zeros with 1 to avoid NaN
self.stds[self.stds == 0] = 1.0

Constant features contribute nothing to the model anyway.

Hint: Keeping y in original scale

For interpretability, don’t scale the target:

# Scale X only
X_scaled = scaler.fit_transform(X)

# Keep y in original scale (dollars, not z-scores)
# Predictions will also be in dollars

If you do scale y, you must un-scale predictions for interpretation.

Layer 3: Gradient Computation

Hint: Matrix dimensions for gradient

The gradient must have the same shape as W:

X: (m, n)    where m=samples, n=features (including bias)
y: (m, 1)
W: (n, 1)

predictions = X @ W          # (m, n) @ (n, 1) = (m, 1)
errors = predictions - y     # (m, 1) - (m, 1) = (m, 1)
gradient = X.T @ errors      # (n, m) @ (m, 1) = (n, 1)

gradient has shape (n, 1), same as W.

Hint: Don't forget the 2/m factor

The full MSE gradient formula includes a factor of 2/m:

gradient = (2 / m) * X.T @ (X @ W - y)

If you forget this, your learning rate will be off by a constant factor. Some implementations absorb the 2 into the learning rate.

Layer 4: Training Loop

Hint: Convergence detection

Check if loss stopped decreasing:

convergence_threshold = 1e-6

prev_loss = float('inf')
for epoch in range(max_epochs):
    current_loss = compute_loss(y, X @ W)

    if abs(prev_loss - current_loss) < convergence_threshold:
        print(f"Converged at epoch {epoch}")
        break

    prev_loss = current_loss
    # ... gradient update

Hint: Debugging non-converging models

If loss isn’t decreasing after 10 epochs:

Print the learning rate and gradient magnitude:

print(f"LR: {learning_rate}, Grad norm: {np.linalg.norm(gradient)}")

Check if features are scaled:

print(f"X mean: {X.mean(axis=0)}, X std: {X.std(axis=0)}")
# Should be close to 0 and 1

Try a much smaller learning rate:
```
learning_rate = 0.0001
```

Extensions and Challenges

Extension 1: Add Polynomial Features

Linear regression assumes linear relationships. But house price might increase quadratically with size. Add polynomial features:

def add_polynomial_features(X, degree=2):
    """Add polynomial terms up to specified degree."""
    from itertools import combinations_with_replacement

    m, n = X.shape
    features = [X]

    for d in range(2, degree + 1):
        for combo in combinations_with_replacement(range(n), d):
            new_feature = np.ones((m, 1))
            for idx in combo:
                new_feature *= X[:, idx:idx+1]
            features.append(new_feature)

    return np.hstack(features)

# Example: X = [sqft, beds]
# After degree=2: [sqft, beds, sqft^2, sqft*beds, beds^2]

Challenge: Fit a model to non-linear data and compare polynomial degrees.

Extension 2: Implement L2 Regularization (Ridge Regression)

Regularization prevents overfitting by penalizing large weights:

Loss_regularized = MSE + lambda * ||W||^2

Gradient_regularized = gradient_MSE + 2 * lambda * W

def train_with_regularization(X, y, lambda_reg=0.01):
    # ... same training loop, but:

    mse_gradient = (2 / m) * X.T @ (X @ W - y)
    reg_gradient = 2 * lambda_reg * W

    # Don't regularize the bias term!
    reg_gradient[0] = 0

    gradient = mse_gradient + reg_gradient
    W = W - learning_rate * gradient

Challenge: Experiment with different lambda_reg values. Plot training vs validation loss.

Extension 3: Compare with sklearn

Verify your implementation matches the industry standard:

from sklearn.linear_model import LinearRegression, Ridge
from sklearn.preprocessing import StandardScaler

# Your implementation
my_model = fit(X, y)
my_predictions = predict_with_model(my_model, X_test)

# sklearn
sk_scaler = StandardScaler()
X_scaled = sk_scaler.fit_transform(X)
sk_model = LinearRegression()
sk_model.fit(X_scaled, y)
sk_predictions = sk_model.predict(sk_scaler.transform(X_test))

# Compare
print(f"Prediction difference: {np.abs(my_predictions - sk_predictions).mean()}")
# Should be < 0.01 if both are working correctly

Extension 4: Learning Rate Scheduling

Decrease the learning rate over time for finer convergence:

def learning_rate_schedule(initial_lr, epoch, decay=0.001):
    return initial_lr / (1 + decay * epoch)

# In training loop:
lr = learning_rate_schedule(0.1, epoch)
W = W - lr * gradient

Extension 5: Mini-Batch Gradient Descent

Implement batching for memory efficiency and faster convergence:

def train_minibatch(X, y, batch_size=32):
    n_batches = len(X) // batch_size

    for epoch in range(max_epochs):
        # Shuffle data each epoch
        indices = np.random.permutation(len(X))
        X_shuffled = X[indices]
        y_shuffled = y[indices]

        for batch in range(n_batches):
            start = batch * batch_size
            end = start + batch_size

            X_batch = X_shuffled[start:end]
            y_batch = y_shuffled[start:end]

            gradient = compute_gradient(X_batch, y_batch, W)
            W = W - learning_rate * gradient

Extension 6: Real Dataset

Apply your model to actual housing data:

California Housing Dataset
Boston Housing Dataset (use carefully, has ethical concerns)
Ames Housing Dataset

Real-World Connections

How This Relates to Neural Networks

Linear regression is a neural network with:

No hidden layers
No activation function (or identity activation)
MSE loss

When you add hidden layers and non-linear activations, the same gradient descent principles apply:

Linear Regression:   Input -> Output
Neural Network:      Input -> Hidden -> Hidden -> Output
                            (ReLU)    (ReLU)

Both use: Loss -> Gradient -> Update Weights

Why GPUs Power Modern AI

Matrix multiplication is embarrassingly parallel:

A @ B: Each output element is independent!

output[i,j] = sum(A[i,:] * B[:,j])

GPU with 1000 cores computes 1000 elements simultaneously.

Your vectorized code runs on CPU SIMD units (4-8 parallel ops). A GPU has thousands of cores, enabling 100-1000x further speedup.

This is why:

X @ W in NumPy: milliseconds
X @ W on GPU (PyTorch/TensorFlow): microseconds
Nested Python loops: seconds

Linear Regression in Industry

Despite being “simple,” linear regression is ubiquitous:

Finance: Stock price prediction, risk assessment
Real Estate: Property valuation (exactly what you built!)
Marketing: Customer lifetime value prediction
Healthcare: Treatment effect estimation
Tech: Ad click-through rate prediction

Why linear?

Interpretable: “Each bedroom adds $15,000 to price”
Fast: Milliseconds for prediction
Robust: Fewer hyperparameters than deep learning
Baseline: Always compare complex models against linear regression

Vectorization in Real Codebases

Every ML library uses the same patterns you learned:

PyTorch:

predictions = model(inputs)  # Internally: matrix multiplications
loss = criterion(predictions, targets)
loss.backward()  # Computes gradients via autograd
optimizer.step()  # W = W - lr * gradient

TensorFlow:

with tf.GradientTape() as tape:
    predictions = model(inputs)
    loss = loss_fn(targets, predictions)
gradients = tape.gradient(loss, model.trainable_variables)
optimizer.apply_gradients(zip(gradients, model.trainable_variables))

Same concepts, more abstraction.

Books That Will Help

Book	Author(s)	Why It’s Relevant
Data Science from Scratch	Joel Grus	Implements linear regression and gradient descent in Python with minimal dependencies. Chapter 14 covers exactly what this project teaches.
Hands-On Machine Learning	Aurelien Geron	Chapter 4 covers linear regression in depth, including the normal equation, regularization, and gradient descent variants.
Linear Algebra and Its Applications	Gilbert Strang	The definitive text on linear algebra. Understanding matrix operations deeply helps debug vectorized code.
Numerical Recipes	Press et al.	Covers computational aspects of linear regression including numerical stability and efficient algorithms.
An Introduction to Statistical Learning	James, Witten, Hastie, Tibshirani	Free online. Chapter 3 covers linear regression from a statistical perspective.
The Elements of Statistical Learning	Hastie, Tibshirani, Friedman	More advanced. Chapters 2-3 provide rigorous treatment of linear methods.

Self-Assessment Checklist

Before moving to Project 4, verify you can:

Explain why vectorization is faster than loops (SIMD, cache locality)
Implement standardization and explain when to use it
Derive the MSE gradient by hand
Implement batch gradient descent without loops in the math
Diagnose exploding loss (learning rate too high)
Diagnose non-converging loss (learning rate too low, unscaled features)
Add a bias column to a feature matrix
Verify gradient implementation with numerical approximation
Explain when to use the normal equation vs gradient descent

Conceptual Questions

Why do we add a column of 1s to the feature matrix?
What would happen if we scaled one feature but not another?
Why might the loss oscillate instead of smoothly decreasing?
How does the gradient tell us which direction to move?
Why is (X.T @ X)^-1 @ X.T @ y expensive for many features?

Timing Challenge

Run a timing experiment on your implementation:

import time

sizes = [100, 1000, 10000, 100000]
features = 10

for size in sizes:
    X = np.random.randn(size, features)
    W = np.random.randn(features, 1)

    # Time vectorized version
    start = time.time()
    for _ in range(100):
        result = X @ W
    vectorized_time = (time.time() - start) / 100

    # Time loop version
    start = time.time()
    for _ in range(100 if size < 10000 else 1):
        result_loop = np.zeros((size, 1))
        for i in range(size):
            for j in range(features):
                result_loop[i] += X[i, j] * W[j]
    loop_time = (time.time() - start) / (100 if size < 10000 else 1)

    print(f"Size {size:>6}: Vectorized {vectorized_time*1000:.3f}ms, "
          f"Loop {loop_time*1000:.3f}ms, Speedup {loop_time/vectorized_time:.1f}x")

Expected output:

Size    100: Vectorized 0.002ms, Loop 0.180ms, Speedup 90.0x
Size   1000: Vectorized 0.008ms, Loop 1.820ms, Speedup 227.5x
Size  10000: Vectorized 0.070ms, Loop 18.500ms, Speedup 264.3x
Size 100000: Vectorized 0.650ms, Loop 185.000ms, Speedup 284.6x

Next: P04: The Spam Filter - Apply sigmoid activation and cross-entropy loss for classification