Project 9: Data Detective (Linear Regression From Scratch)

Build a from-first-principles linear regression workflow that fits a line to data, reports interpretable error metrics, and visualizes when the model is useful versus misleading.

Quick Reference

Attribute Value
Difficulty Level 2 (Intermediate)
Time Estimate 12-18 hours
Main Programming Language Python
Alternative Languages JavaScript, C++
Knowledge Area Statistics, algebra, model evaluation
Recommended Libraries/Tools CSV reader, plotting library, CLI argument parser
Main Book “An Introduction to Statistical Learning” (linear regression chapter)
Deliverable CLI tool that reads dataset, fits line, reports metrics, and exports fit/residual plots

Learning Objectives

By the end of this project, you should be able to:

  1. Derive and implement slope/intercept for simple linear regression.
  2. Explain least squares objective and why squared error is used.
  3. Interpret residuals, MSE/RMSE, and R^2 in plain language.
  4. Detect when a line is a poor model despite a numeric metric.
  5. Build a reproducible analysis workflow from raw CSV to final report.
  6. Communicate model limitations and assumptions clearly.

All Theory Needed per Concept

Concept 1: Linear Relationship and Model Form

What you need:

  • Simple model: y_hat = b0 + b1*x
  • b1 is slope (change in predicted y per unit x).
  • b0 is intercept (predicted y when x=0).

Why it matters here:

  • Every output in your project derives from b0 and b1; if these are wrong, all metrics are meaningless.

Failure mode to watch:

  • Treating intercept as always physically meaningful. In many datasets, x=0 is out of realistic range.

Practical check:

  • Plot data first. If relationship is clearly curved, linear fit may be structurally wrong.

Concept 2: Least Squares Optimization

What you need:

  • Residual for point i: e_i = y_i - y_hat_i
  • Objective: minimize sum(e_i^2).
  • Closed-form slope: b1 = sum((x_i-x_bar)(y_i-y_bar)) / sum((x_i-x_bar)^2)
  • Intercept: b0 = y_bar - b1*x_bar

Why it matters here:

  • This is the mathematical engine of the project.

Failure mode to watch:

  • Division by near-zero variance in x (all x-values almost equal).

Practical check:

  • If var(x) is zero, model is undefined; tool must fail with explicit message.

Concept 3: Fit Quality Metrics and Residual Analysis

What you need:

  • MSE: average squared residual.
  • RMSE: square root of MSE (same units as y).
  • R^2 = 1 - SSE/SST where:
    • SSE = sum((y_i - y_hat_i)^2)
    • SST = sum((y_i - y_bar)^2)

Why it matters here:

  • Metrics summarize fit, but residual plots reveal patterns metrics can hide.

Failure mode to watch:

  • High R^2 interpreted as causal proof or as universal model quality.

Practical check:

  • Inspect residual plot for patterns (curvature, funnel shape, clusters).

Concept 4: Data Quality, Outliers, and Generalization

What you need:

  • Missing values and malformed rows can silently bias model.
  • Outliers can strongly pull least-squares line.
  • Train-only evaluation can overstate reliability.

Why it matters here:

  • Real data is messy. Reliable regression workflow includes data checks and separate validation mindset.

Failure mode to watch:

  • Reporting only training metrics and claiming strong predictive performance.

Practical check:

  • Compare fit metrics with and without top outlier to understand sensitivity.

Project Specification

Build a command-line tool named “Data Detective” with these requirements:

  • Inputs:
    • CSV file with two required columns (x, y) or configurable column names.
    • Optional flags for delimiter, header presence, output directory.
  • Processing:
    • Parse and validate dataset.
    • Compute means, slope, intercept.
    • Generate predictions and residuals.
    • Compute MSE/RMSE and R^2.
    • Create scatter + fitted line plot and residual plot.
  • Outputs:
    • Text summary (sample count, slope, intercept, metrics).
    • Saved visualization artifacts.
    • Warning messages for tiny var(x), outlier leverage, or suspicious residual patterns.

Non-negotiable constraints:

  • Must implement formulas directly (not hidden auto-fit API).
  • Must produce residual-oriented diagnostic output, not only one score.
  • Must include at least one test dataset with expected coefficients.

Solution Architecture (ASCII)

+------------------------------+
| CLI / Dataset Input          |
| csv path, columns, options   |
+---------------+--------------+
                |
                v
+------------------------------+
| Data Validation Layer        |
| parse, type checks, missing  |
+---------------+--------------+
                |
                v
+------------------------------+
| Stats Core                   |
| means, variance, covariance  |
+---------------+--------------+
                |
                v
+------------------------------+
| Regression Engine            |
| b1, b0, predictions          |
+-------+----------------------+
        |
        +----------------------------+
        |                            |
        v                            v
+-------------------------+   +---------------------------+
| Metrics Module          |   | Visualization Module      |
| SSE, MSE, RMSE, R^2     |   | scatter+line, residuals   |
+------------+------------+   +-------------+-------------+
             |                              |
             +--------------+---------------+
                            v
                   +------------------------+
                   | Report + Artifacts     |
                   +------------------------+

Implementation Guide

Phase 1: Data Contract and Parsing

  1. Define expected schema (x, y, numeric only).
  2. Fail fast on missing columns or non-numeric rows.
  3. Decide explicit policy for missing values (drop with count report or reject file).

Pseudocode:

read csv
validate required columns
coerce x,y to numeric
drop/flag invalid rows
ensure sample_count >= minimum threshold

Phase 2: Core Regression Math

  1. Compute x_bar, y_bar.
  2. Compute numerator/denominator for slope.
  3. Compute intercept and predictions.

Pseudocode:

num = sum((x-x_bar)*(y-y_bar))
den = sum((x-x_bar)^2)
if den == 0: fail("x variance is zero")
b1 = num/den
b0 = y_bar - b1*x_bar

Phase 3: Metrics and Diagnostics

  1. Compute residuals and squared residuals.
  2. Compute SSE, MSE, RMSE, R^2.
  3. Flag suspicious diagnostics:
    • very large residual relative to others,
    • obvious trend in residuals,
    • small sample size warning.

Phase 4: Visualization and Reporting

  1. Plot scatter with fitted line.
  2. Plot residuals vs x (or index).
  3. Print clear textual summary with units/context notes.

Phase 5: Reproducibility

  1. Save outputs with deterministic naming.
  2. Include metadata in report (dataset name, row count, timestamp or run id).
  3. Keep command examples in README for reruns.

Testing Strategy

Deterministic Math Tests

  1. Perfect line dataset:
    • Example pattern: points exactly on y=2x+1
    • Expected: slope 2, intercept 1, R^2=1, near-zero residuals.
  2. Flat response dataset:
    • Constant y values.
    • Expected slope near 0; discuss R^2 interpretation carefully.
  3. Shift invariance test:
    • Add constant to all y values.
    • Expected slope unchanged, intercept shifted by that constant.

Robustness Tests

  1. Malformed row handling (missing numeric values) is explicit and counted.
  2. Near-zero x-variance dataset triggers clear error/warning.
  3. Outlier injection test demonstrates coefficient sensitivity.

Diagnostics Tests

  1. Residual sum close to zero (with intercept model).
  2. Residual plot generated and labeled.
  3. Text report includes all required metrics and row counts.

Common Pitfalls

  1. Confusing correlation with causation:
    • Cause: treating linear association as proof of mechanism.
    • Fix: document that regression here is descriptive/predictive, not causal proof.
  2. Blind trust in R^2:
    • Cause: single-metric thinking.
    • Fix: always inspect residual plot and domain plausibility.
  3. Hidden data leakage:
    • Cause: preprocessing with future/test info in advanced setups.
    • Fix: keep clear separation if you add train/test extension.
  4. Ignoring scale and units:
    • Cause: reporting coefficients without context.
    • Fix: annotate what one unit change in x means physically.

Extensions

  1. Add train/test split and report out-of-sample error.
  2. Add robust regression comparison to reduce outlier sensitivity.
  3. Add polynomial features to compare linear vs curved fits.
  4. Add confidence intervals for coefficients.
  5. Generalize to multiple regression with matrix form.

Real-World Connections

  1. Education analytics: studying relation between study hours and score trend.
  2. Business forecasting: basic demand vs price sensitivity estimate.
  3. Engineering calibration: mapping sensor input to measured output.
  4. Public policy: trend analysis on historical indicators (with caution about causality).

Resources

  • James, Witten, Hastie, Tibshirani, An Introduction to Statistical Learning (linear regression chapter).
  • Freedman, Pisani, Purves, Statistics (correlation/regression intuition).
  • Montgomery, Peck, Vining, Introduction to Linear Regression Analysis.
  • Khan Academy / MIT OCW linear regression lectures for visual intuition.

Self-Assessment

  1. Why does least squares use squared residuals instead of raw residuals?
  2. What does a slope of -0.8 mean in plain language?
  3. Can you have a high R^2 and still have a bad model? Give a case.
  4. What does a curved residual pattern tell you?
  5. What should your tool do when all x values are identical?

Submission Criteria

A submission is complete only if all items below are satisfied:

  • Tool ingests a CSV and validates numeric x,y columns.
  • Slope/intercept are computed from explicit formulas (not black-box fitting).
  • Report includes at least MSE, RMSE, and R^2.
  • Produces both scatter+fit plot and residual plot.
  • Includes at least 3 deterministic tests with expected outcomes.
  • Documents assumptions, limitations, and outlier sensitivity.
  • Includes one reproducible sample run log and generated artifacts.