Project 17: Linear Regression from Scratch

Linear regression implemented two ways: (1) analytically using the normal equation, and (2) iteratively using gradient descent. Compare their performance and understand when to use each.

Quick Reference

Attribute	Value
Difficulty	Level 2: Intermediate (The Developer)
Main Programming Language	Python
Alternative Programming Languages	C, Julia, Rust
Coolness Level	Level 3: Genuinely Clever
Business Potential	1. The “Resume Gold” (Educational/Personal Brand)
Knowledge Area	Regression / Optimization
Software or Tool	Linear Regression
Main Book	“Hands-On Machine Learning” by Aurélien Géron

1. Learning Objectives

By completing this project, you will:

Translate math definitions into deterministic implementation steps.
Build validation checks that make correctness observable.
Diagnose numerical, logical, and data-shape failures early.
Explain tradeoffs in interviews using evidence from your own build.

2. All Theory Needed (Per-Concept Breakdown)

This project applies the following theory clusters:

Symbolic-to-numeric translation (expressions, data shapes, invariants)
Stability constraints (precision, scaling, stopping criteria)
Optimization or inference logic (depending on project objective)
Evaluation discipline (error analysis, test coverage, reproducibility)

Concept A: Mathematical Representation Discipline

Fundamentals A math expression is not executable until you define representation, ordering, and domain constraints. The same equation can be represented as a token stream, tree, matrix pipeline, or probability graph. Choosing representation determines what bugs you can catch early.

Deep Dive into the concept Most project failures begin before algorithm selection: they start with ambiguous representation. If your parser cannot distinguish unary minus from subtraction, your calculator fails. If your matrix dimensions are implicit rather than validated, your linear algebra pipeline fails silently. If your probabilistic assumptions (independence, stationarity, or class priors) are not explicit, your inference can look accurate on one split and collapse on another. The core implementation move is to treat representation as a contract. Define each object with shape, domain, and semantic intent. Then enforce invariants at boundaries: input parser, preprocessing, training loop, evaluation stage. This makes debugging local instead of global.

How this fits this project You will encode each operation with explicit contracts and invariant checks.

Definitions & key terms

Invariant: Property that must hold before and after each operation.
Shape contract: Expected dimensional structure of vectors/matrices/tensors.
Domain constraint: Allowed value range (for example log input > 0).

Mental model diagram

User Input -> Representation Layer -> Validated Operation -> Observable Output
              (tokens/shapes)        (invariants pass)       (tests/plots/logs)

How it works

Parse/ingest data into typed structures.
Validate shape/domain invariants.
Execute operation.
Compare observed output with expected behavior.
Record failure signature if mismatch appears.

Minimal concrete example

PSEUDOCODE
read expression
tokenize with precedence rules
if token sequence invalid -> return syntax error
evaluate tree
if domain violation -> return bounded diagnostic
print value and confidence check

Common misconceptions

“If it runs once, representation is correct.” -> false.
“Type checks are enough without shape checks.” -> false.

Check-your-understanding questions

Which invariant catches division-by-zero earliest?
Why does shape validation belong at boundaries rather than only in core logic?
Predict failure if tokenization ignores unary minus.

Check-your-understanding answers

Domain check on denominator before operation execution.
Boundary validation keeps errors local and diagnostic.
Expressions like -2^2 get misinterpreted and produce wrong precedence behavior.

Real-world applications Feature preprocessing, model-serving input validation, and experiment-tracking schema enforcement.

Where you’ll apply it This project and every downstream project in the sprint.

References

CSAPP (Bryant & O’Hallaron), floating-point chapter
Math for Programmers (Paul Orland), representation-oriented chapters

Key insight Correct representation reduces the complexity of every later decision.

Summary Stable ML math implementations start with explicit contracts, not implicit assumptions.

Homework/Exercises

Write five invariants for your project.
Build a failing test input for each invariant.

Solutions

Include at least one shape, one domain, one convergence, one reproducibility, and one output-range invariant.
Each failing input should trigger exactly one diagnostic to keep root-cause analysis clean.

3. Build Blueprint

Scope the smallest end-to-end slice that produces visible output.
Add deterministic tests and edge-case probes.
Layer complexity only after baseline behavior is stable.
Add metrics logging before optimization.
Run failure drills: perturb inputs, scale values, and check stability.

4. Real-World Outcome (Target)

$ python linear_regression.py housing.csv --target=price

Loading data: 500 samples, 5 features

Method 1: Normal Equation (analytical)
  Computation time: 0.003s
  Weights: [intercept=5.2, sqft=0.0012, bedrooms=2.3, ...]

Method 2: Gradient Descent (iterative)
  Learning rate: 0.01
  Iterations: 1000
  Computation time: 0.15s
  Final loss: 0.0234
  Weights: [intercept=5.1, sqft=0.0012, bedrooms=2.4, ...]

[Plot: Gradient descent loss decreasing over iterations]
[Plot: Predicted vs actual prices scatter plot]

Test set performance:
  R² Score: 0.87
  RMSE: $45,230

$ python linear_regression.py --predict "sqft=2000, bedrooms=3, ..."
Predicted price: $425,000

Implementation Hints: Normal equation:

# X is (n_samples, n_features+1) with column of 1s for intercept
# y is (n_samples,)
w = np.linalg.inv(X.T @ X) @ X.T @ y

Gradient descent:

w = np.zeros(n_features + 1)
for _ in range(iterations):
    predictions = X @ w
    error = predictions - y
    gradient = (2/n_samples) * X.T @ error
    w = w - learning_rate * gradient

Feature scaling (important for gradient descent!):

X_scaled = (X - X.mean(axis=0)) / X.std(axis=0)

Learning milestones:

Both methods give same answer → You understand they solve the same problem
Gradient descent needs feature scaling → You understand optimization dynamics
You know when to use each → Normal equation for small data, GD for large

5. Core Design Notes from Main Guide

Core Question

“When can we solve a problem exactly, and when must we iterate toward the answer?”

Linear regression presents you with a fundamental dichotomy in optimization. The normal equation gives you the exact answer in one matrix computation–no iteration, no learning rate, no convergence worries. But it requires inverting a matrix, which becomes prohibitively expensive for large datasets. Gradient descent trades exactness for scalability–you get arbitrarily close to the answer through iteration, and it works even when you have millions of features. This tension between closed-form solutions and iterative methods pervades all of machine learning. Understanding when each approach applies–and why–is essential.

Concepts You Must Understand First

Stop and research these before coding:

Mean Squared Error (MSE) Loss
- Why do we square the errors instead of using absolute values?
- What is the geometric interpretation of MSE?
- How does MSE relate to the assumption that errors are normally distributed?
- Book Reference: “Hands-On Machine Learning” Chapter 4 - Aurelien Geron
The Normal Equation Derivation
- What does it mean to “set the gradient to zero”?
- Why does the solution involve (X^T X)^(-1)?
- When is X^T X invertible? When is it not?
- Book Reference: “Machine Learning” (Coursera) Week 2 - Andrew Ng
Matrix Inversion Complexity
- What is the time complexity of inverting an n x n matrix?
- Why does this become prohibitive for large feature counts?
- What is the Moore-Penrose pseudoinverse and when do you need it?
- Book Reference: “Linear Algebra Done Right” Chapter 3 - Sheldon Axler
Gradient Descent Mechanics
- Why is the gradient of MSE equal to 2X^T(Xw - y)/n?
- What does the learning rate control, geometrically?
- Why does feature scaling help gradient descent converge faster?
- Book Reference: “Deep Learning” Chapter 4 - Goodfellow et al.
R-squared Score
- What does R^2 = 0.87 actually mean in terms of explained variance?
- How is R^2 related to MSE?
- Why can R^2 be negative on test data?
- Book Reference: “Data Science for Business” Chapter 4 - Provost & Fawcett
The Bias Term (Intercept)
- Why do we add a column of ones to X?
- What happens if you forget the bias term?
- How does centering data relate to the intercept?
- Book Reference: “Hands-On Machine Learning” Chapter 4 - Aurelien Geron

Questions to Guide Your Design

Before implementing, think through these:

Data augmentation for bias: How do you add the column of ones to X to handle the intercept term? Do you add it as the first or last column?
Feature scaling strategy: Will you standardize (mean=0, std=1) or normalize (min=0, max=1)? Why does gradient descent care but normal equation does not?
Normal equation numerical stability: What happens when X^T X is nearly singular? How do you detect and handle this?
Gradient descent stopping criteria: When do you stop iterating? Fixed epochs? Convergence threshold? Both?
Batch vs stochastic vs mini-batch: Will you compute the gradient on all data (batch) or subsets? How does this affect convergence?
Comparison methodology: How will you verify that both methods give the same answer? What tolerance is acceptable?

Thinking Exercise

Work through this 2D example by hand:

Data points:

X = [[1], [2], [3]]  (one feature)
y = [2, 4, 5]        (targets)

Step 1: Add bias column

X_aug = [[1, 1], [1, 2], [1, 3]]  (column of ones added)

Step 2: Normal equation

X^T X = ?
(X^T X)^(-1) = ?
X^T y = ?
w = (X^T X)^(-1) X^T y = ?

Step 3: Verify with gradient descent Start with w = [0, 0], learning_rate = 0.1

Iteration 1:
  predictions = X_aug @ w = ?
  errors = predictions - y = ?
  gradient = (2/3) * X_aug^T @ errors = ?
  w_new = w - 0.1 * gradient = ?

After many iterations, w should converge to the same answer as the normal equation.

Interview Questions

“Derive the normal equation for linear regression.”
- Expected answer: Start with MSE loss, take gradient with respect to w, set to zero, solve for w to get w = (X^T X)^(-1) X^T y.
“When would you choose gradient descent over the normal equation?”
- Expected answer: Large number of features (n > 10,000), need online learning, memory constraints, or when X^T X is ill-conditioned.
“What is the computational complexity of both approaches?”
- Expected answer: Normal equation is O(n^3) for matrix inversion where n is number of features. Gradient descent is O(m * n * k) where m is samples, k is iterations.
“Why must you scale features for gradient descent but not for the normal equation?”
- Expected answer: Unscaled features create elongated contours in the loss surface, causing gradient descent to oscillate. Normal equation algebraically solves regardless of scale.
“What is the difference between R^2 on training data vs test data?”
- Expected answer: Training R^2 can only increase with more features (overfitting). Test R^2 measures generalization and can decrease or go negative if model is overfit.
“How would you extend this to polynomial regression?”
- Expected answer: Create polynomial features (x^2, x^3, x1*x2, etc.) and apply the same linear regression. Model is still “linear in parameters.”
“What is regularization and why might you need it?”
- Expected answer: Adding a penalty term (L1 or L2) to the loss prevents large weights and reduces overfitting. L2 regularization has a closed-form solution: w = (X^T X + lambda*I)^(-1) X^T y.

Hints in Layers (Treat as pseudocode guidance)

Hint 1: Start simple. Implement the normal equation first since it is just matrix operations:

X_aug = np.column_stack([np.ones(len(X)), X])
w = np.linalg.inv(X_aug.T @ X_aug) @ X_aug.T @ y

Hint 2: For gradient descent, the key formula is:

gradient = (2/n_samples) * X_aug.T @ (X_aug @ w - y)
w = w - learning_rate * gradient

Hint 3: Feature scaling for gradient descent:

X_mean, X_std = X.mean(axis=0), X.std(axis=0)
X_scaled = (X - X_mean) / X_std
# Remember to transform test data with training statistics!

Hint 4: For numerical stability with the normal equation, use np.linalg.pinv (pseudoinverse) or add a small ridge term:

w = np.linalg.inv(X.T @ X + 1e-8 * np.eye(n_features)) @ X.T @ y

Hint 5: To compute R^2:

ss_res = np.sum((y - predictions)**2)
ss_tot = np.sum((y - y.mean())**2)
r2 = 1 - ss_res / ss_tot

Books That Will Help

Topic	Book	Chapter
Linear Regression Theory	“Hands-On Machine Learning” by Aurelien Geron	Chapter 4: Training Models
Normal Equation Derivation	“Machine Learning” (Coursera) by Andrew Ng	Week 2: Linear Regression
Gradient Descent Intuition	“Deep Learning” by Goodfellow et al.	Chapter 4: Numerical Computation
Matrix Calculus	“Math for Programmers” by Paul Orland	Chapter 12: Multivariable Calculus
Feature Scaling	“Data Science for Business” by Provost & Fawcett	Chapter 4: Modeling
Regularization	“The Elements of Statistical Learning” by Hastie et al.	Chapter 3: Linear Regression

6. Validation, Pitfalls, and Completion

Common Pitfalls and Debugging

Problem 1: “Outputs drift after a few iterations”

Why: Hidden numerical instability (unscaled features, aggressive step size, or repeated subtraction of nearly equal values).
Fix: Normalize inputs, reduce step size, and track relative error rather than only absolute error.
Quick test: Run the same task with two scales of input (for example x and 10x) and compare normalized error curves.

Problem 2: “Results are inconsistent across runs”

Why: Random seeds, data split randomness, or non-deterministic ordering are uncontrolled.
Fix: Set seeds, log configuration, and store split indices and hyperparameters with each run.
Quick test: Re-run three times with the same seed and confirm metrics remain inside a tight tolerance band.

Problem 3: “The project works on the demo case but fails on edge cases”

Why: Tests only cover happy-path inputs.
Fix: Add adversarial inputs (empty values, extreme ranges, near-singular matrices, rare classes).
Quick test: Build an edge-case test matrix and ensure every scenario reports expected behavior.

Definition of Done

Core functionality works on reference inputs
Edge cases are tested and documented
Results are reproducible (seeded and versioned configuration)
Performance or convergence behavior is measured and explained
A short retrospective explains what failed first and how you fixed it

7. Extension Ideas

Add a stress-test mode with adversarial inputs.
Add a short benchmark report (runtime + memory + error trend).
Add a reproducibility bundle (seed, config, and fixed test corpus).

8. Why This Project Matters

Not specified

This project is valuable because it creates observable evidence of mathematical reasoning under real implementation constraints.