Project 3: Predict the Price (Linear Regression Engine)
Project 3: Predict the Price (Linear Regression Engine)
Build a vectorized linear regression library that predicts house prices using matrix operations, gradient descent, and proper feature scaling
Project Overview
| Attribute | Value |
|---|---|
| Difficulty | Intermediate |
| Time Estimate | 1 Week |
| Languages | Python (NumPy primary) |
| Prerequisites | Projects 1 & 2 (Manual Neuron, Gradient Descent Visualizer), basic linear algebra |
| Main Reference | “Data Science from Scratch” by Joel Grus |
| Knowledge Area | Regression / Vectors / Optimization |
Learning Objectives
After completing this project, you will be able to:
- Vectorize mathematical operations - Replace Python loops with NumPy matrix operations for 100x speedups
- Implement feature scaling - Apply standardization and normalization to prepare data for training
- Derive and implement MSE loss - Understand why Mean Squared Error works and how to compute gradients
- Build batch gradient descent - Update weights using the full dataset gradient
- Handle multi-feature regression - Extend single-variable regression to N input features
- Debug convergence issues - Diagnose exploding gradients, slow convergence, and local minima
The Core Question You’re Answering
“How do we handle multiple inputs at once?”
In Project 1, you predicted logic gates with two inputs. Real-world problems have dozens or thousands of inputs. A house price depends on square footage, bedrooms, bathrooms, location, age, lot size, and countless other factors.
The naive approach would be nested loops:
for sample in 10000_samples:
prediction = 0
for feature in 50_features:
prediction += weights[feature] * sample[feature]
This is unbearably slow. Linear algebra gives us a single expression:
predictions = X @ W # 10,000 predictions in one operation
This project teaches you why that works and how to use it.
Concepts You Must Understand First
Before writing code, ensure you have solid mental models for these concepts:
1. Vectors and Matrices (Row vs Column)
A vector is a 1D array of numbers. A matrix is a 2D grid.
Row vector (1 x 3): Column vector (3 x 1): Matrix (3 x 2):
[1, 2, 3] [1] [1, 4]
[2] [2, 5]
[3] [3, 6]
Critical rule: In machine learning, we follow the convention:
- Rows = samples (individual data points)
- Columns = features (input variables)
Feature1 Feature2 Feature3
Sample 1 1200 3 2 <- House 1: 1200 sqft, 3 beds, 2 baths
Sample 2 1800 4 3 <- House 2: 1800 sqft, 4 beds, 3 baths
Sample 3 2500 5 4 <- House 3: 2500 sqft, 5 beds, 4 baths
Matrix X has shape (n_samples, n_features) or (m, n).
2. Matrix Multiplication Mechanics
Matrix multiplication A @ B works when A has shape (m, n) and B has shape (n, p).
A (2x3) B (3x1) Result (2x1)
[a11 a12 a13] [b1] [a11*b1 + a12*b2 + a13*b3]
[a21 a22 a23] @ [b2] = [a21*b1 + a22*b2 + a23*b3]
[b3]
Each row of A "dots" with the column of B to produce one output element.
The key insight: X @ W computes the weighted sum for EVERY sample simultaneously.
X (3 samples, 2 features) W (2 weights)
sqft beds w1
[1200, 3] [0.5]
[1800, 4] @ [100]
[2500, 5]
= [1200*0.5 + 3*100] = [900] <- Prediction for house 1
[1800*0.5 + 4*100] = [1300] <- Prediction for house 2
[2500*0.5 + 5*100] = [1750] <- Prediction for house 3
Three predictions, one operation. No loops.
3. Why Vectorization is Faster (SIMD, Cache Locality)
Your CPU has special SIMD (Single Instruction, Multiple Data) units that can perform 4, 8, or even 16 floating-point operations simultaneously.
Sequential (Python loop): Vectorized (NumPy):
for i in range(4): result = a * b # All 4 at once!
result[i] = a[i] * b[i]
|-> ~4 CPU cycles each |-> ~1 CPU cycle total (SIMD)
Total: ~16 cycles Total: ~1-2 cycles
Additionally, cache locality matters:
- NumPy stores arrays contiguously in memory
- When you access
a[0], the CPU loadsa[0:7]into cache - Subsequent accesses hit cache (fast) instead of RAM (slow)
Python loops don’t benefit because:
- Each iteration has interpreter overhead
- Python floats are heap-allocated objects, scattered in memory
Bottom line: A 1000-element vector operation runs ~100x faster in NumPy than in pure Python.
4. Feature Scaling (Standardization vs Normalization)
Raw features have wildly different scales:
- Square footage: 500 - 10,000
- Bedrooms: 1 - 10
- Year built: 1900 - 2024
If we use these directly, gradient descent struggles:
Before scaling: After scaling:
^ ^
| / (steep along sqft) | / (equal gradients)
| / | /
|/___________> |__/____>
sqft (0-10000) sqft (-2 to 2)
beds (0-10) beds (-2 to 2)
Standardization (Z-score): Centers at 0, scales to unit variance
x_scaled = (x - mean) / std
Normalization (Min-Max): Scales to [0, 1] range
x_scaled = (x - min) / (max - min)
When to use which:
- Standardization: When you don’t know the bounds, handles outliers better
- Normalization: When you need bounded outputs (e.g., image pixels)
For gradient descent, standardization is usually preferred.
5. MSE Loss Function Derivation
Mean Squared Error measures how wrong our predictions are:
1 m
MSE Loss = L(W) = ---- * Sum (y_hat_i - y_i)^2
m i=1
where:
m = number of samples
y_hat_i = predicted value for sample i
y_i = actual value for sample i
Why squared?
- Penalizes large errors more than small errors
- Is differentiable everywhere (unlike absolute value)
- Has a unique minimum (convex for linear regression)
In matrix form:
predictions = X @ W # Shape: (m, 1)
errors = predictions - y # Shape: (m, 1)
L = (1/m) * (errors.T @ errors) # Scalar: sum of squared errors
The errors.T @ errors computes the sum of squared errors in one operation.
Deep Theoretical Foundation
The Linear Regression Model
Linear regression assumes a linear relationship between inputs and output:
y = w1*x1 + w2*x2 + ... + wn*xn + b
Or in matrix form:
y = X @ W + b
We often absorb the bias b into W by adding a column of 1s to X:
Original X Augmented X Weights W
[sqft, beds] [1, sqft, beds] [b, w1, w2]
[1200, 3] -> [1, 1200, 3] [50000] <- bias
[1800, 4] -> [1, 1800, 4] @ [100] <- weight for sqft
[2500, 5] -> [1, 2500, 5] [10000] <- weight for beds
Now: y = X @ W (no separate bias term!)
The Normal Equation (Closed-Form Solution)
There’s actually a formula that gives the optimal weights directly:
W* = (X.T @ X)^(-1) @ X.T @ y
This finds the exact solution in one step! So why bother with gradient descent?
Problems with the normal equation:
- Matrix inversion is O(n^3) - For 100,000 features, that’s 10^15 operations
- Memory intensive -
X.T @ Xis (n_features x n_features) - Singular matrices - If features are correlated, the inverse doesn’t exist
- No regularization - Can’t easily add L2 penalty
Gradient descent advantages:
- Scales to millions of features - Each step is O(m * n)
- Works for any differentiable loss - Not just MSE
- Naturally extends to neural networks - Same algorithm everywhere
- Easy to add regularization - Just modify the gradient
Rule of thumb: Use normal equation for n_features < 10,000. Use gradient descent otherwise.
MSE Gradient Derivation with Matrices
The gradient tells us which direction to move weights to reduce loss.
Starting with:
L = (1/m) * ||X @ W - y||^2
Let e = X @ W - y (the error vector). Then:
L = (1/m) * e.T @ e
Using matrix calculus (chain rule for matrices):
dL/dW = (1/m) * d(e.T @ e)/dW
= (1/m) * 2 * X.T @ e (derivation involves Jacobians)
= (2/m) * X.T @ (X @ W - y)
The gradient formula:
gradient = (2/m) * X.T @ (X @ W - y)
where:
X.T has shape (n_features, m)
(X @ W - y) has shape (m, 1)
gradient has shape (n_features, 1)
Each element of the gradient tells us how much to adjust that weight.
Batch vs Stochastic vs Mini-batch
Batch Gradient Descent (this project):
- Uses ALL samples to compute gradient
- Smooth convergence, but slow per step
- Memory intensive for large datasets
for epoch in epochs:
gradient = (2/m) * X.T @ (X @ W - y) # Full dataset
W = W - learning_rate * gradient
Stochastic Gradient Descent (SGD):
- Uses ONE sample per update
- Noisy but fast per step
- Can escape local minima
for epoch in epochs:
for i in shuffle(samples):
gradient = 2 * X[i].T * (X[i] @ W - y[i]) # Single sample
W = W - learning_rate * gradient
Mini-batch Gradient Descent:
- Uses a batch of 32-256 samples
- Best of both: stable yet efficient
- Standard in deep learning
for epoch in epochs:
for batch in split(samples, batch_size=32):
gradient = (2/len(batch)) * X_batch.T @ (X_batch @ W - y_batch)
W = W - learning_rate * gradient
Feature Scaling Mathematics
Without scaling, features with larger magnitudes dominate the gradient:
Suppose: sqft ~ 2000, beds ~ 3
Gradient for sqft weight: proportional to sqft values ~ 2000
Gradient for beds weight: proportional to beds values ~ 3
The sqft gradient is ~700x larger!
This creates an elongated loss surface:
Before scaling: After scaling:
Loss Loss
^ ^
| ___ | ___
| / \ | / \
| | * | (hard to | | * | (easy to
| \___/ navigate) | \___/ navigate)
|________> |________>
w_sqft w_sqft'
(tiny step) (balanced step)
Standardization formula:
For each feature column j:
mean_j = (1/m) * sum(X[:, j])
std_j = sqrt((1/m) * sum((X[:, j] - mean_j)^2))
X_scaled[:, j] = (X[:, j] - mean_j) / std_j
Critical: Save the mean and std during training. Apply the SAME transformation to new data at prediction time!
Real-World Outcome
When your linear regression engine is complete, you will have a tool that produces accurate predictions with detailed training feedback. Here is exactly what running your tool should produce:
Example 1: Training on House Data
Command and Output:
$ python predict_house.py --train data.csv
================================================================================
LINEAR REGRESSION ENGINE v1.0
================================================================================
Loading data from: data.csv
Records loaded: 500
Features detected: 2 (sqft, bedrooms)
--------------------------------------------------------------------------------
DATA EXPLORATION
--------------------------------------------------------------------------------
Feature Statistics:
sqft bedrooms
min: 850.00 1.00
max: 4200.00 6.00
mean: 1856.32 3.12
std: 687.45 1.24
Target Statistics (price):
min: $125,000
max: $875,000
mean: $342,156
std: $156,234
--------------------------------------------------------------------------------
PREPROCESSING
--------------------------------------------------------------------------------
Applying standardization (Z-score normalization)...
sqft: mean=1856.32, std=687.45 -> scaled to mean=0, std=1
bedrooms: mean=3.12, std=1.24 -> scaled to mean=0, std=1
Adding bias column (column of 1s)...
Feature matrix shape: (500, 3)
--------------------------------------------------------------------------------
TRAINING
--------------------------------------------------------------------------------
Hyperparameters:
Learning rate: 0.01
Max epochs: 10000
Convergence threshold: 0.0001
Initializing weights: [0.0, 0.0, 0.0]
Training Progress:
Epoch 1: Loss = 116,849,251,904.00 (Starting point)
Epoch 10: Loss = 12,345,678,901.23
Epoch 50: Loss = 456,789,012.34
Epoch 100: Loss = 45,000,234.56
Epoch 500: Loss = 2,300,567.89
Epoch 1000: Loss = 150,234.56 (Converged!)
[==================================================] 100%
Convergence achieved at epoch 1000
Final loss: 150,234.56
Loss reduction: 99.99%
Learned weights (scaled space):
w0 (bias): 342156.0
w1 (sqft): 127834.5
w2 (bedrooms): 18234.2
Unscaled interpretation:
Base price: $342,156 (when all features are at mean)
Per sqft: $186.00 ($127,834.5 / 687.45)
Per bedroom: $14,705 ($18,234.2 / 1.24)
--------------------------------------------------------------------------------
MODEL EVALUATION
--------------------------------------------------------------------------------
Training set metrics:
Mean Squared Error: 150,234.56
Root MSE: $387.73
Mean Absolute Error: $298.45
R-squared: 0.9823
Sample predictions vs actuals:
Sample 1: Predicted $352,000 | Actual $350,000 | Error: $2,000
Sample 2: Predicted $425,500 | Actual $430,000 | Error: $4,500
Sample 3: Predicted $289,000 | Actual $285,000 | Error: $4,000
================================================================================
MODEL SAVED
================================================================================
Saved model to: model.pkl
Saved scaler to: scaler.pkl
Ready for predictions! Run: python predict_house.py --predict
Example 2: Making Predictions
Command and Output:
$ python predict_house.py --predict
================================================================================
LINEAR REGRESSION ENGINE v1.0
================================================================================
Loading model from: model.pkl
Loading scaler from: scaler.pkl
Model loaded successfully.
Features: sqft, bedrooms
Trained on: 500 samples
R-squared: 0.9823
--------------------------------------------------------------------------------
INTERACTIVE PREDICTION MODE
--------------------------------------------------------------------------------
Enter house features (or 'quit' to exit):
Enter sqft: 2500
Enter bedrooms: 3
Processing...
Raw input: [2500, 3]
Scaled input: [0.936, -0.097] (using training statistics)
Augmented: [1, 0.936, -0.097]
Prediction Breakdown:
Base price (bias): $342,156.00
+ sqft contribution: $119,688.90 (2500 sqft)
+ bedrooms contribution: -$1,768.10 (3 beds, below average)
-----------------------------------------
PREDICTED PRICE: $460,076.80
Confidence interval (95%): $452,300 - $467,850
--------------------------------------------------------------------------------
Enter sqft: 1200
Enter bedrooms: 2
Processing...
Raw input: [1200, 2]
Scaled input: [-0.954, -0.903]
Prediction Breakdown:
Base price (bias): $342,156.00
+ sqft contribution: -$121,977.24
+ bedrooms contribution: -$16,463.50
-----------------------------------------
PREDICTED PRICE: $203,715.26
Confidence interval (95%): $196,000 - $211,430
--------------------------------------------------------------------------------
Enter sqft: quit
Thank you for using Linear Regression Engine!
Example 3: Timing Comparison (Loop vs Vectorized)
$ python predict_house.py --benchmark
================================================================================
VECTORIZATION BENCHMARK
================================================================================
Generating synthetic data...
Samples: 100,000
Features: 50
--------------------------------------------------------------------------------
LOOP-BASED IMPLEMENTATION
--------------------------------------------------------------------------------
def predict_loop(X, W):
predictions = []
for i in range(len(X)):
pred = 0
for j in range(len(W)):
pred += X[i, j] * W[j]
predictions.append(pred)
return predictions
Running 100 iterations...
Average time: 2.847 seconds
Predictions per second: 35,124
--------------------------------------------------------------------------------
VECTORIZED IMPLEMENTATION
--------------------------------------------------------------------------------
def predict_vectorized(X, W):
return X @ W
Running 100 iterations...
Average time: 0.023 seconds
Predictions per second: 4,347,826
================================================================================
COMPARISON
================================================================================
Speedup: 123.8x faster with vectorization!
For 100,000 samples with 50 features:
Loop version: 2,847 ms
Vectorized version: 23 ms
Time saved: 2,824 ms per inference
At scale (1M predictions/day):
Loop version: 7.9 hours
Vectorized version: 3.8 minutes
This is why AI uses GPUs (which excel at matrix operations).
Example 4: Convergence Visualization
$ python predict_house.py --train data.csv --visualize
================================================================================
TRAINING VISUALIZATION
================================================================================
[Training progress as before...]
Generating visualizations...
Loss Curve:
Training Loss over Epochs
Loss ^
1e11 |****
| ***
1e9 | ***
| ***
1e7 | ***
| ****
1e5 | ****************************
|______________________________________________|____>
0 200 400 600 800 1000 Epoch
Feature Weights over Training:
Weight ^
+150k | .....w_sqft
| .....
+100k | .....
| .....
+50k | .....
| ...... _____w_beds
0 |.....-----
|______________________________________________|____>
0 200 400 600 800 1000 Epoch
Prediction vs Actual Scatter:
Predicted ^
$800k | .*
| .*.*
$600k | *.*.*
| *.*.*
$400k | *.*.*
| *.*.*
$200k | *.*.* Perfect fit line
|*.* would be diagonal
|______________________________________|___>
$200k $400k $600k $800k Actual
Saved to: training_visualization.png
Solution Architecture
High-Level Design
+-------------------------------------------------------------------------+
| LINEAR REGRESSION ENGINE |
+-------------------------------------------------------------------------+
| |
| +--------------+ +----------------+ +------------------+ |
| | DataLoader |--->| Preprocessor |--->| Optimizer | |
| | (CSV parser) | | (scaling, bias)| | (gradient desc.) | |
| +--------------+ +----------------+ +------------------+ |
| | | | |
| v v v |
| +-------------+ +-----------------+ +------------------+ |
| | X, y | | X_scaled, means | | W (weights) | |
| | (raw data) | | stds, X_aug | | loss_history | |
| +-------------+ +-----------------+ +------------------+ |
| | |
| +-------------------------+ |
| | |
| v |
| +------------------+ +-----------------+ |
| | Predictor |--->| Evaluator | |
| | (inference) | | (metrics, viz) | |
| +------------------+ +-----------------+ |
| |
+-------------------------------------------------------------------------+
Data Flow:
1. DataLoader reads CSV, extracts features (X) and target (y)
2. Preprocessor computes statistics, scales X, adds bias column
3. Optimizer runs gradient descent, updates weights, tracks loss
4. Predictor applies model to new data (with same scaling)
5. Evaluator computes R^2, RMSE, generates plots
Data Structures
class LinearRegression:
"""Core linear regression model."""
# Model parameters
weights: np.ndarray # Shape: (n_features + 1,) includes bias
# Scaling parameters (saved for prediction)
feature_means: np.ndarray # Shape: (n_features,)
feature_stds: np.ndarray # Shape: (n_features,)
# Training hyperparameters
learning_rate: float
max_epochs: int
convergence_threshold: float
# Training history
loss_history: List[float]
weight_history: List[np.ndarray]
class Dataset:
"""Holds training/test data."""
X_raw: np.ndarray # Original features, shape (m, n)
y: np.ndarray # Targets, shape (m,) or (m, 1)
feature_names: List[str] # Column names
X_scaled: np.ndarray # Scaled and augmented (m, n+1)
class Metrics:
"""Evaluation metrics."""
mse: float # Mean Squared Error
rmse: float # Root MSE
mae: float # Mean Absolute Error
r_squared: float # Coefficient of determination
Matrix Shape Reference
Understanding shapes is critical. Here’s a cheat sheet:
Given:
m = number of samples (e.g., 500 houses)
n = number of features (e.g., 2: sqft, bedrooms)
Matrices:
X_raw : (m, n) = (500, 2)
X_scaled : (m, n) = (500, 2)
X_augmented : (m, n+1) = (500, 3) # After adding bias column
y : (m, 1) = (500, 1)
W : (n+1, 1) = (3, 1)
Operations and their shapes:
X_aug @ W : (500, 3) @ (3, 1) = (500, 1) predictions
X_aug @ W - y : (500, 1) - (500, 1) = (500, 1) errors
X_aug.T @ (X_aug@W - y) : (3, 500) @ (500, 1) = (3, 1) gradient
Common errors:
- Shape mismatch: Check that inner dimensions match for
@ - Dimension issues:
yas (m,) vs (m, 1) can cause broadcasting bugs - Forgetting bias:
Wshould have n+1 elements, not n
Phased Implementation Guide
Phase 1: Data Loading and Exploration (Day 1)
Goal: Load CSV data, understand its structure, visualize distributions.
Tasks:
- Load CSV using pandas or numpy
- Separate features (X) from target (y)
- Compute basic statistics (min, max, mean, std)
- Plot histograms of each feature
- Create scatter plots of features vs target
Code structure:
def load_data(filepath: str) -> Tuple[np.ndarray, np.ndarray, List[str]]:
"""Load CSV and return X, y, and feature names."""
# Use pandas for convenience, convert to numpy
df = pd.read_csv(filepath)
feature_names = list(df.columns[:-1]) # All but last column
X = df.iloc[:, :-1].values # All but last column as numpy
y = df.iloc[:, -1].values # Last column as numpy
return X, y, feature_names
def explore_data(X, y, feature_names):
"""Print statistics and create visualizations."""
print("Feature Statistics:")
for i, name in enumerate(feature_names):
print(f" {name}: min={X[:, i].min():.2f}, max={X[:, i].max():.2f}, "
f"mean={X[:, i].mean():.2f}, std={X[:, i].std():.2f}")
print(f"\nTarget: min=${y.min():,.0f}, max=${y.max():,.0f}, "
f"mean=${y.mean():,.0f}")
Testing:
- Create a small CSV by hand (5 rows, 2 features)
- Verify statistics match manual calculations
- Check shapes:
X.shapeshould be(n_samples, n_features)
Phase 2: Feature Scaling Implementation (Day 2)
Goal: Implement standardization that can be applied consistently.
The key insight: You must save the mean and std from training data to apply the SAME transformation to test data later.
Code structure:
class Scaler:
"""Standardizes features using Z-score normalization."""
def __init__(self):
self.means = None
self.stds = None
self.is_fitted = False
def fit(self, X: np.ndarray):
"""Compute mean and std for each feature."""
self.means = X.mean(axis=0) # Mean of each column
self.stds = X.std(axis=0) # Std of each column
# Handle zero std (constant features)
self.stds[self.stds == 0] = 1
self.is_fitted = True
def transform(self, X: np.ndarray) -> np.ndarray:
"""Apply standardization using saved statistics."""
if not self.is_fitted:
raise ValueError("Scaler must be fitted before transform")
return (X - self.means) / self.stds
def fit_transform(self, X: np.ndarray) -> np.ndarray:
"""Fit and transform in one step."""
self.fit(X)
return self.transform(X)
Testing:
# Test: After scaling, each column should have mean ~0, std ~1
X_scaled = scaler.fit_transform(X)
assert np.allclose(X_scaled.mean(axis=0), 0, atol=1e-10)
assert np.allclose(X_scaled.std(axis=0), 1, atol=1e-10)
Phase 3: Vectorized Forward Pass (Day 3)
Goal: Compute predictions for all samples in one matrix operation.
The forward pass:
def add_bias_column(X: np.ndarray) -> np.ndarray:
"""Prepend a column of 1s to X for the bias term."""
ones = np.ones((X.shape[0], 1))
return np.hstack([ones, X])
def predict(X_augmented: np.ndarray, W: np.ndarray) -> np.ndarray:
"""Compute predictions: y_hat = X @ W"""
return X_augmented @ W
Shape verification:
# Given X with shape (500, 2):
X_aug = add_bias_column(X) # Shape: (500, 3)
W = np.zeros((3, 1)) # Shape: (3, 1) - initialized to zeros
y_hat = predict(X_aug, W) # Shape: (500, 1)
print(f"X_aug shape: {X_aug.shape}") # (500, 3)
print(f"W shape: {W.shape}") # (3, 1)
print(f"y_hat shape: {y_hat.shape}") # (500, 1)
No loops allowed! The entire prediction for 500 samples happens in X_augmented @ W.
Phase 4: MSE Loss Calculation (Day 4)
Goal: Implement the loss function to measure prediction quality.
The MSE formula in code:
def compute_loss(y_true: np.ndarray, y_pred: np.ndarray) -> float:
"""Compute Mean Squared Error."""
m = len(y_true)
errors = y_pred - y_true
mse = (1 / m) * np.sum(errors ** 2)
return mse
# Alternative using matrix operations:
def compute_loss_matrix(y_true: np.ndarray, y_pred: np.ndarray) -> float:
"""Compute MSE using matrix form."""
m = len(y_true)
errors = y_pred - y_true
mse = (1 / m) * (errors.T @ errors).item() # .item() extracts scalar
return mse
Testing:
# Test with known values
y_true = np.array([[100], [200], [300]])
y_pred = np.array([[110], [190], [310]])
# Errors: [10, -10, 10], squared: [100, 100, 100], mean: 100
loss = compute_loss(y_true, y_pred)
assert loss == 100.0
Phase 5: Gradient Computation (Vectorized) (Day 5)
Goal: Compute the gradient of the loss with respect to weights.
The gradient formula:
gradient = (2/m) * X.T @ (X @ W - y)
Implementation:
def compute_gradient(X: np.ndarray, y: np.ndarray, W: np.ndarray) -> np.ndarray:
"""Compute gradient of MSE loss with respect to W."""
m = X.shape[0]
predictions = X @ W
errors = predictions - y
gradient = (2 / m) * (X.T @ errors)
return gradient
Shape verification:
# Given: X is (500, 3), y is (500, 1), W is (3, 1)
gradient = compute_gradient(X, y, W)
print(f"Gradient shape: {gradient.shape}") # Should be (3, 1)
# Each element tells us how to adjust that weight:
# gradient[0] -> adjustment for bias
# gradient[1] -> adjustment for w_sqft
# gradient[2] -> adjustment for w_bedrooms
Testing:
# Numerical gradient check (slow but verifies correctness)
def numerical_gradient(X, y, W, epsilon=1e-7):
num_grad = np.zeros_like(W)
for i in range(len(W)):
W_plus = W.copy()
W_plus[i] += epsilon
W_minus = W.copy()
W_minus[i] -= epsilon
loss_plus = compute_loss(y, X @ W_plus)
loss_minus = compute_loss(y, X @ W_minus)
num_grad[i] = (loss_plus - loss_minus) / (2 * epsilon)
return num_grad
# Compare analytical and numerical gradients
analytical = compute_gradient(X, y, W)
numerical = numerical_gradient(X, y, W)
assert np.allclose(analytical, numerical, rtol=1e-5)
Phase 6: Training Loop with Convergence Check (Day 6-7)
Goal: Put it all together with proper stopping criteria.
The training loop:
def train(X: np.ndarray, y: np.ndarray,
learning_rate: float = 0.01,
max_epochs: int = 10000,
convergence_threshold: float = 1e-6) -> Tuple[np.ndarray, List[float]]:
"""Train linear regression using batch gradient descent."""
# Initialize weights to zeros
n_features = X.shape[1]
W = np.zeros((n_features, 1))
loss_history = []
prev_loss = float('inf')
for epoch in range(max_epochs):
# Forward pass
predictions = X @ W
# Compute loss
current_loss = compute_loss(y, predictions)
loss_history.append(current_loss)
# Check convergence
if abs(prev_loss - current_loss) < convergence_threshold:
print(f"Converged at epoch {epoch}")
break
prev_loss = current_loss
# Compute gradient
gradient = compute_gradient(X, y, W)
# Update weights
W = W - learning_rate * gradient
# Print progress periodically
if epoch % 100 == 0:
print(f"Epoch {epoch}: Loss = {current_loss:.2f}")
return W, loss_history
The complete pipeline:
def fit(X_raw: np.ndarray, y_raw: np.ndarray):
"""Full training pipeline."""
# 1. Reshape y if needed
y = y_raw.reshape(-1, 1) if y_raw.ndim == 1 else y_raw
# 2. Scale features
scaler = Scaler()
X_scaled = scaler.fit_transform(X_raw)
# 3. Add bias column
X_aug = add_bias_column(X_scaled)
# 4. Train
W, loss_history = train(X_aug, y)
# 5. Return model
return {
'weights': W,
'scaler': scaler,
'loss_history': loss_history
}
Questions to Guide Your Design
Before implementing, think through these questions:
Data representation:
- Should
ybe shape(m,)or(m, 1)? Why does it matter? - How will you handle missing values in the CSV?
- What if a feature column has zero variance (all same value)?
Scaling decisions:
- Should you scale the target
yas well? What are the tradeoffs? - What happens if you forget to scale test data the same way?
- How will you store the scaling parameters for deployment?
Gradient descent tuning:
- How will you know if the learning rate is too high?
- What symptoms indicate the learning rate is too low?
- Should you decrease the learning rate over time? (Learning rate scheduling)
Convergence criteria:
- Is checking loss difference sufficient? What about weight changes?
- What if the loss oscillates instead of decreasing?
- How many epochs without improvement should trigger early stopping?
Memory efficiency:
- For 1 million samples, how much RAM does
X.T @ Xneed? - Could you implement mini-batch gradient descent to reduce memory?
Thinking Exercise: Matrix Multiplication by Hand
Before trusting NumPy, understand what it’s doing. Compute these by hand:
Exercise 1: Simple Prediction
Given:
X = [[1, 2, 3], (2 samples, 3 features including bias)
[1, 4, 1]]
W = [[10], (3 weights)
[5],
[2]]
Compute X @ W:
Step by step:
Row 1 of X dot W: 1*10 + 2*5 + 3*2 = 10 + 10 + 6 = 26
Row 2 of X dot W: 1*10 + 4*5 + 1*2 = 10 + 20 + 2 = 32
Result: [[26],
[32]]
Exercise 2: Gradient Computation
Given:
X = [[1, 2],
[1, 3]]
y = [[5],
[7]]
W = [[1],
[1]]
Compute the gradient:
Step 1: Predictions = X @ W
Row 1: 1*1 + 2*1 = 3
Row 2: 1*1 + 3*1 = 4
predictions = [[3], [4]]
Step 2: Errors = predictions - y
[[3], [4]] - [[5], [7]] = [[-2], [-3]]
Step 3: X.T @ errors
X.T = [[1, 1],
[2, 3]]
X.T @ errors:
Row 1: 1*(-2) + 1*(-3) = -5
Row 2: 2*(-2) + 3*(-3) = -4 + -9 = -13
= [[-5], [-13]]
Step 4: Multiply by 2/m (m=2)
gradient = (2/2) * [[-5], [-13]] = [[-5], [-13]]
The negative gradient tells us: increase both weights to reduce error.
Exercise 3: Shape Reasoning
Without computing, predict the output shape:
(100, 5) @ (5, 1) = ?(5, 100) @ (100, 1) = ?(100, 5).T @ (100, 1) = ?(100, 1) @ (1, 5) = ?
Answers:
(100, 1)- 100 predictions from 100 samples(5, 1)- 5 sums of 100 elements each (gradient shape)- Same as 2:
(5, 100) @ (100, 1) = (5, 1) (100, 5)- Outer product, creates a matrix
Testing Strategy
Unit Tests for Core Functions
import numpy as np
import pytest
class TestScaler:
def test_fit_computes_correct_stats(self):
X = np.array([[1, 10], [2, 20], [3, 30]])
scaler = Scaler()
scaler.fit(X)
assert np.allclose(scaler.means, [2, 20])
assert np.allclose(scaler.stds, [0.816, 8.165], rtol=0.01)
def test_transform_scales_correctly(self):
X = np.array([[0], [10]])
scaler = Scaler()
scaler.fit(X)
X_scaled = scaler.transform(X)
assert np.allclose(X_scaled.mean(), 0)
assert np.allclose(X_scaled.std(), 1)
def test_transform_before_fit_raises(self):
scaler = Scaler()
with pytest.raises(ValueError):
scaler.transform(np.array([[1, 2]]))
class TestGradient:
def test_gradient_matches_numerical(self):
np.random.seed(42)
X = np.random.randn(10, 3)
y = np.random.randn(10, 1)
W = np.random.randn(3, 1)
analytical = compute_gradient(X, y, W)
numerical = numerical_gradient(X, y, W)
assert np.allclose(analytical, numerical, rtol=1e-5)
def test_gradient_zero_at_optimum(self):
# For a simple case we can solve analytically
X = np.array([[1, 1], [1, 2]])
y = np.array([[3], [5]])
W_optimal = np.array([[1], [2]]) # y = 1 + 2*x
gradient = compute_gradient(X, y, W_optimal)
assert np.allclose(gradient, 0, atol=1e-10)
class TestTraining:
def test_loss_decreases(self):
np.random.seed(42)
X = np.random.randn(100, 3)
y = X @ np.array([[1], [2], [3]]) + np.random.randn(100, 1) * 0.1
W, loss_history = train(X, y, learning_rate=0.1, max_epochs=100)
assert loss_history[-1] < loss_history[0]
assert all(loss_history[i] >= loss_history[i+1] - 1e-10
for i in range(len(loss_history)-1))
def test_recovers_known_weights(self):
# Generate data from known linear relationship
X = add_bias_column(np.random.randn(1000, 2))
true_W = np.array([[5], [3], [-2]]) # bias=5, w1=3, w2=-2
y = X @ true_W
W_learned, _ = train(X, y, learning_rate=0.1, max_epochs=1000)
assert np.allclose(W_learned, true_W, rtol=0.01)
Integration Tests
class TestFullPipeline:
def test_pipeline_on_synthetic_data(self):
# Generate house-like data
np.random.seed(42)
n_samples = 500
sqft = np.random.uniform(1000, 3000, n_samples)
beds = np.random.randint(1, 6, n_samples)
price = 50000 + 100 * sqft + 20000 * beds + np.random.randn(n_samples) * 10000
X = np.column_stack([sqft, beds])
y = price
model = fit(X, y)
# Test prediction
test_house = np.array([[2000, 3]])
expected_price = 50000 + 100 * 2000 + 20000 * 3 # ~310,000
test_scaled = model['scaler'].transform(test_house)
test_aug = add_bias_column(test_scaled)
prediction = (test_aug @ model['weights']).item()
assert abs(prediction - expected_price) < 20000 # Within $20k
def test_model_saves_and_loads(self):
import pickle
X = np.random.randn(100, 2)
y = np.random.randn(100, 1)
model = fit(X, y)
# Save
with open('test_model.pkl', 'wb') as f:
pickle.dump(model, f)
# Load
with open('test_model.pkl', 'rb') as f:
loaded = pickle.load(f)
assert np.allclose(model['weights'], loaded['weights'])
os.remove('test_model.pkl')
Common Pitfalls and Debugging Tips
Pitfall 1: Exploding Loss
Symptom: Loss shoots to infinity or becomes NaN
Epoch 1: Loss = 150000
Epoch 2: Loss = 1500000000
Epoch 3: Loss = inf
Epoch 4: Loss = nan
Cause: Learning rate is too high. Gradient steps overshoot the minimum.
Fix: Reduce learning rate by 10x or 100x. Try 0.01, 0.001, or 0.0001.
Diagram:
Learning rate too high:
Loss
^
| * <-- overshoot
| / \
| / \
| / * <-- overshoot again
| * \
| \ * <-- explodes!
|____\______\_____>
optimum
Pitfall 2: Loss Not Decreasing
Symptom: Loss stays flat or decreases imperceptibly
Epoch 1: Loss = 150000.00
Epoch 100: Loss = 149999.99
Epoch 1000: Loss = 149998.50
Causes:
- Learning rate too small
- Features not scaled
- Weights initialized poorly
Fixes:
- Increase learning rate
- Verify features have mean 0, std 1
- Try different weight initialization
Pitfall 3: Shape Mismatch Errors
Symptom:
ValueError: matmul: Input operand 1 has a mismatch in its core dimension
Common causes:
- Forgot to add bias column
yis shape(m,)instead of(m, 1)- Mixed up
X.TwithX
Debugging:
print(f"X shape: {X.shape}") # Should be (m, n+1)
print(f"y shape: {y.shape}") # Should be (m, 1)
print(f"W shape: {W.shape}") # Should be (n+1, 1)
# Reshape y if needed
y = y.reshape(-1, 1)
Pitfall 4: Forgot to Scale Test Data
Symptom: Model trains well but predictions on new data are wildly wrong
Cause: New data uses original scale, but model learned in scaled space
Fix: Always apply the SAME scaler to test data:
# During training:
scaler = Scaler()
X_train_scaled = scaler.fit_transform(X_train)
# During prediction:
X_test_scaled = scaler.transform(X_test) # Use same scaler!
prediction = X_test_scaled @ W
Pitfall 5: Numerical Precision Issues
Symptom: Gradient checks fail with small but non-zero differences
Cause: Floating-point arithmetic has limited precision
Fix: Use appropriate tolerances:
# Bad: Exact equality
assert gradient == numerical_gradient
# Good: Approximate equality
assert np.allclose(gradient, numerical_gradient, rtol=1e-5, atol=1e-8)
Debugging Workflow
When something goes wrong:
- Print shapes at every step:
print(f"After loading: X={X.shape}, y={y.shape}") print(f"After scaling: X={X_scaled.shape}") print(f"After augmenting: X={X_aug.shape}") print(f"Weights: W={W.shape}") print(f"Predictions: pred={pred.shape}") print(f"Gradient: grad={grad.shape}") - Test with tiny data:
# Use 3 samples, 2 features - small enough to verify by hand X_tiny = np.array([[1, 2], [3, 4], [5, 6]]) y_tiny = np.array([[10], [20], [30]]) - Check intermediate values:
print(f"First 5 predictions: {predictions[:5].flatten()}") print(f"First 5 errors: {(predictions - y)[:5].flatten()}") print(f"Gradient: {gradient.flatten()}") - Verify with known solution:
# For y = 2*x + 1, we know W should be [1, 2] X = add_bias_column(np.array([[0], [1], [2]])) y = np.array([[1], [3], [5]]) # Train and verify W converges to [1, 2]
Interview Questions
Binary parsing skills translate directly to interview questions. Here’s what you’ll be prepared to answer:
Gradient Descent and Optimization:
- “Explain gradient descent in your own words. Why does it work?”
- “What happens if the learning rate is too high? Too low?”
- “How would you choose an appropriate learning rate?”
- “What is the difference between batch, stochastic, and mini-batch gradient descent?”
- “Can gradient descent get stuck? How would you detect and address this?”
Linear Algebra and Vectorization:
- “Why is matrix multiplication used in machine learning?”
- “Explain why vectorized code is faster than loops.”
- “Given matrices of shapes (100, 50) and (50, 1), what is the output shape of their product?”
- “How would you implement matrix multiplication from scratch?”
- “What is the computational complexity of matrix multiplication?”
Feature Engineering:
- “Why do we need to scale features before training?”
- “What is the difference between normalization and standardization?”
- “How do you handle new data that falls outside the training range?”
- “What happens if you have correlated features?”
- “How would you add polynomial features to capture non-linear relationships?”
Loss Functions:
- “Why do we use squared error instead of absolute error?”
- “Derive the gradient of Mean Squared Error.”
- “What is the closed-form solution for linear regression?”
- “When would you use gradient descent instead of the normal equation?”
- “How does regularization modify the loss function?”
Practical Machine Learning:
- “How do you know if your model is overfitting?”
- “What is R-squared and how do you interpret it?”
- “How would you handle missing values in your training data?”
- “Describe your debugging process when a model isn’t learning.”
- “How would you deploy this model to production?”
Coding Questions:
- “Implement standardization without using numpy’s built-in functions.”
- “Write a function to compute MSE without loops.”
- “Implement gradient descent for linear regression.”
- “How would you add L2 regularization to your implementation?”
Hints in Layers
If you get stuck, reveal hints progressively. Try to solve problems yourself first.
Layer 1: Getting Started
Hint: Data loading structure
Use pandas for CSV parsing, then convert to numpy:
import pandas as pd
import numpy as np
df = pd.read_csv('data.csv')
# Assume last column is target
X = df.iloc[:, :-1].values # All columns except last
y = df.iloc[:, -1].values.reshape(-1, 1) # Last column, reshaped
print(f"X shape: {X.shape}, y shape: {y.shape}")
Hint: Bias column addition
Prepend a column of 1s to X:
def add_bias(X):
m = X.shape[0] # Number of samples
ones = np.ones((m, 1))
return np.hstack([ones, X])
# After: X_aug[:, 0] is all 1s
# When we do X_aug @ W, W[0] becomes the bias term
Layer 2: Scaling
Hint: Handle zero standard deviation
If a feature column is constant, std=0 causes division by zero:
self.stds = X.std(axis=0)
# Replace zeros with 1 to avoid NaN
self.stds[self.stds == 0] = 1.0
Constant features contribute nothing to the model anyway.
Hint: Keeping y in original scale
For interpretability, don’t scale the target:
# Scale X only
X_scaled = scaler.fit_transform(X)
# Keep y in original scale (dollars, not z-scores)
# Predictions will also be in dollars
If you do scale y, you must un-scale predictions for interpretation.
Layer 3: Gradient Computation
Hint: Matrix dimensions for gradient
The gradient must have the same shape as W:
X: (m, n) where m=samples, n=features (including bias)
y: (m, 1)
W: (n, 1)
predictions = X @ W # (m, n) @ (n, 1) = (m, 1)
errors = predictions - y # (m, 1) - (m, 1) = (m, 1)
gradient = X.T @ errors # (n, m) @ (m, 1) = (n, 1)
gradient has shape (n, 1), same as W.
Hint: Don't forget the 2/m factor
The full MSE gradient formula includes a factor of 2/m:
gradient = (2 / m) * X.T @ (X @ W - y)
If you forget this, your learning rate will be off by a constant factor. Some implementations absorb the 2 into the learning rate.
Layer 4: Training Loop
Hint: Convergence detection
Check if loss stopped decreasing:
convergence_threshold = 1e-6
prev_loss = float('inf')
for epoch in range(max_epochs):
current_loss = compute_loss(y, X @ W)
if abs(prev_loss - current_loss) < convergence_threshold:
print(f"Converged at epoch {epoch}")
break
prev_loss = current_loss
# ... gradient update
Hint: Debugging non-converging models
If loss isn’t decreasing after 10 epochs:
- Print the learning rate and gradient magnitude:
print(f"LR: {learning_rate}, Grad norm: {np.linalg.norm(gradient)}") - Check if features are scaled:
print(f"X mean: {X.mean(axis=0)}, X std: {X.std(axis=0)}") # Should be close to 0 and 1 - Try a much smaller learning rate:
learning_rate = 0.0001
Extensions and Challenges
Extension 1: Add Polynomial Features
Linear regression assumes linear relationships. But house price might increase quadratically with size. Add polynomial features:
def add_polynomial_features(X, degree=2):
"""Add polynomial terms up to specified degree."""
from itertools import combinations_with_replacement
m, n = X.shape
features = [X]
for d in range(2, degree + 1):
for combo in combinations_with_replacement(range(n), d):
new_feature = np.ones((m, 1))
for idx in combo:
new_feature *= X[:, idx:idx+1]
features.append(new_feature)
return np.hstack(features)
# Example: X = [sqft, beds]
# After degree=2: [sqft, beds, sqft^2, sqft*beds, beds^2]
Challenge: Fit a model to non-linear data and compare polynomial degrees.
Extension 2: Implement L2 Regularization (Ridge Regression)
Regularization prevents overfitting by penalizing large weights:
Loss_regularized = MSE + lambda * ||W||^2
Gradient_regularized = gradient_MSE + 2 * lambda * W
def train_with_regularization(X, y, lambda_reg=0.01):
# ... same training loop, but:
mse_gradient = (2 / m) * X.T @ (X @ W - y)
reg_gradient = 2 * lambda_reg * W
# Don't regularize the bias term!
reg_gradient[0] = 0
gradient = mse_gradient + reg_gradient
W = W - learning_rate * gradient
Challenge: Experiment with different lambda_reg values. Plot training vs validation loss.
Extension 3: Compare with sklearn
Verify your implementation matches the industry standard:
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.preprocessing import StandardScaler
# Your implementation
my_model = fit(X, y)
my_predictions = predict_with_model(my_model, X_test)
# sklearn
sk_scaler = StandardScaler()
X_scaled = sk_scaler.fit_transform(X)
sk_model = LinearRegression()
sk_model.fit(X_scaled, y)
sk_predictions = sk_model.predict(sk_scaler.transform(X_test))
# Compare
print(f"Prediction difference: {np.abs(my_predictions - sk_predictions).mean()}")
# Should be < 0.01 if both are working correctly
Extension 4: Learning Rate Scheduling
Decrease the learning rate over time for finer convergence:
def learning_rate_schedule(initial_lr, epoch, decay=0.001):
return initial_lr / (1 + decay * epoch)
# In training loop:
lr = learning_rate_schedule(0.1, epoch)
W = W - lr * gradient
Extension 5: Mini-Batch Gradient Descent
Implement batching for memory efficiency and faster convergence:
def train_minibatch(X, y, batch_size=32):
n_batches = len(X) // batch_size
for epoch in range(max_epochs):
# Shuffle data each epoch
indices = np.random.permutation(len(X))
X_shuffled = X[indices]
y_shuffled = y[indices]
for batch in range(n_batches):
start = batch * batch_size
end = start + batch_size
X_batch = X_shuffled[start:end]
y_batch = y_shuffled[start:end]
gradient = compute_gradient(X_batch, y_batch, W)
W = W - learning_rate * gradient
Extension 6: Real Dataset
Apply your model to actual housing data:
- California Housing Dataset
- Boston Housing Dataset (use carefully, has ethical concerns)
- Ames Housing Dataset
Real-World Connections
How This Relates to Neural Networks
Linear regression is a neural network with:
- No hidden layers
- No activation function (or identity activation)
- MSE loss
When you add hidden layers and non-linear activations, the same gradient descent principles apply:
Linear Regression: Input -> Output
Neural Network: Input -> Hidden -> Hidden -> Output
(ReLU) (ReLU)
Both use: Loss -> Gradient -> Update Weights
Why GPUs Power Modern AI
Matrix multiplication is embarrassingly parallel:
A @ B: Each output element is independent!
output[i,j] = sum(A[i,:] * B[:,j])
GPU with 1000 cores computes 1000 elements simultaneously.
Your vectorized code runs on CPU SIMD units (4-8 parallel ops). A GPU has thousands of cores, enabling 100-1000x further speedup.
This is why:
X @ Win NumPy: millisecondsX @ Won GPU (PyTorch/TensorFlow): microseconds- Nested Python loops: seconds
Linear Regression in Industry
Despite being “simple,” linear regression is ubiquitous:
- Finance: Stock price prediction, risk assessment
- Real Estate: Property valuation (exactly what you built!)
- Marketing: Customer lifetime value prediction
- Healthcare: Treatment effect estimation
- Tech: Ad click-through rate prediction
Why linear?
- Interpretable: “Each bedroom adds $15,000 to price”
- Fast: Milliseconds for prediction
- Robust: Fewer hyperparameters than deep learning
- Baseline: Always compare complex models against linear regression
Vectorization in Real Codebases
Every ML library uses the same patterns you learned:
PyTorch:
predictions = model(inputs) # Internally: matrix multiplications
loss = criterion(predictions, targets)
loss.backward() # Computes gradients via autograd
optimizer.step() # W = W - lr * gradient
TensorFlow:
with tf.GradientTape() as tape:
predictions = model(inputs)
loss = loss_fn(targets, predictions)
gradients = tape.gradient(loss, model.trainable_variables)
optimizer.apply_gradients(zip(gradients, model.trainable_variables))
Same concepts, more abstraction.
Books That Will Help
| Book | Author(s) | Why It’s Relevant |
|---|---|---|
| Data Science from Scratch | Joel Grus | Implements linear regression and gradient descent in Python with minimal dependencies. Chapter 14 covers exactly what this project teaches. |
| Hands-On Machine Learning | Aurelien Geron | Chapter 4 covers linear regression in depth, including the normal equation, regularization, and gradient descent variants. |
| Linear Algebra and Its Applications | Gilbert Strang | The definitive text on linear algebra. Understanding matrix operations deeply helps debug vectorized code. |
| Numerical Recipes | Press et al. | Covers computational aspects of linear regression including numerical stability and efficient algorithms. |
| An Introduction to Statistical Learning | James, Witten, Hastie, Tibshirani | Free online. Chapter 3 covers linear regression from a statistical perspective. |
| The Elements of Statistical Learning | Hastie, Tibshirani, Friedman | More advanced. Chapters 2-3 provide rigorous treatment of linear methods. |
Self-Assessment Checklist
Before moving to Project 4, verify you can:
- Explain why vectorization is faster than loops (SIMD, cache locality)
- Implement standardization and explain when to use it
- Derive the MSE gradient by hand
- Implement batch gradient descent without loops in the math
- Diagnose exploding loss (learning rate too high)
- Diagnose non-converging loss (learning rate too low, unscaled features)
- Add a bias column to a feature matrix
- Verify gradient implementation with numerical approximation
- Explain when to use the normal equation vs gradient descent
Conceptual Questions
- Why do we add a column of 1s to the feature matrix?
- What would happen if we scaled one feature but not another?
- Why might the loss oscillate instead of smoothly decreasing?
- How does the gradient tell us which direction to move?
- Why is
(X.T @ X)^-1 @ X.T @ yexpensive for many features?
Timing Challenge
Run a timing experiment on your implementation:
import time
sizes = [100, 1000, 10000, 100000]
features = 10
for size in sizes:
X = np.random.randn(size, features)
W = np.random.randn(features, 1)
# Time vectorized version
start = time.time()
for _ in range(100):
result = X @ W
vectorized_time = (time.time() - start) / 100
# Time loop version
start = time.time()
for _ in range(100 if size < 10000 else 1):
result_loop = np.zeros((size, 1))
for i in range(size):
for j in range(features):
result_loop[i] += X[i, j] * W[j]
loop_time = (time.time() - start) / (100 if size < 10000 else 1)
print(f"Size {size:>6}: Vectorized {vectorized_time*1000:.3f}ms, "
f"Loop {loop_time*1000:.3f}ms, Speedup {loop_time/vectorized_time:.1f}x")
Expected output:
Size 100: Vectorized 0.002ms, Loop 0.180ms, Speedup 90.0x
Size 1000: Vectorized 0.008ms, Loop 1.820ms, Speedup 227.5x
Size 10000: Vectorized 0.070ms, Loop 18.500ms, Speedup 264.3x
Size 100000: Vectorized 0.650ms, Loop 185.000ms, Speedup 284.6x
Next: P04: The Spam Filter - Apply sigmoid activation and cross-entropy loss for classification