Project 9: Data Detective (Linear Regression From Scratch)
Build a from-first-principles linear regression workflow that fits a line to data, reports interpretable error metrics, and visualizes when the model is useful versus misleading.
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | Level 2 (Intermediate) |
| Time Estimate | 12-18 hours |
| Main Programming Language | Python |
| Alternative Languages | JavaScript, C++ |
| Knowledge Area | Statistics, algebra, model evaluation |
| Recommended Libraries/Tools | CSV reader, plotting library, CLI argument parser |
| Main Book | “An Introduction to Statistical Learning” (linear regression chapter) |
| Deliverable | CLI tool that reads dataset, fits line, reports metrics, and exports fit/residual plots |
Learning Objectives
By the end of this project, you should be able to:
- Derive and implement slope/intercept for simple linear regression.
- Explain least squares objective and why squared error is used.
- Interpret residuals, MSE/RMSE, and
R^2in plain language. - Detect when a line is a poor model despite a numeric metric.
- Build a reproducible analysis workflow from raw CSV to final report.
- Communicate model limitations and assumptions clearly.
All Theory Needed per Concept
Concept 1: Linear Relationship and Model Form
What you need:
- Simple model:
y_hat = b0 + b1*x b1is slope (change in predictedyper unitx).b0is intercept (predictedywhenx=0).
Why it matters here:
- Every output in your project derives from
b0andb1; if these are wrong, all metrics are meaningless.
Failure mode to watch:
- Treating intercept as always physically meaningful. In many datasets,
x=0is out of realistic range.
Practical check:
- Plot data first. If relationship is clearly curved, linear fit may be structurally wrong.
Concept 2: Least Squares Optimization
What you need:
- Residual for point
i:e_i = y_i - y_hat_i - Objective: minimize
sum(e_i^2). - Closed-form slope:
b1 = sum((x_i-x_bar)(y_i-y_bar)) / sum((x_i-x_bar)^2) - Intercept:
b0 = y_bar - b1*x_bar
Why it matters here:
- This is the mathematical engine of the project.
Failure mode to watch:
- Division by near-zero variance in
x(all x-values almost equal).
Practical check:
- If
var(x)is zero, model is undefined; tool must fail with explicit message.
Concept 3: Fit Quality Metrics and Residual Analysis
What you need:
- MSE: average squared residual.
- RMSE: square root of MSE (same units as
y). R^2 = 1 - SSE/SSTwhere:SSE = sum((y_i - y_hat_i)^2)SST = sum((y_i - y_bar)^2)
Why it matters here:
- Metrics summarize fit, but residual plots reveal patterns metrics can hide.
Failure mode to watch:
- High
R^2interpreted as causal proof or as universal model quality.
Practical check:
- Inspect residual plot for patterns (curvature, funnel shape, clusters).
Concept 4: Data Quality, Outliers, and Generalization
What you need:
- Missing values and malformed rows can silently bias model.
- Outliers can strongly pull least-squares line.
- Train-only evaluation can overstate reliability.
Why it matters here:
- Real data is messy. Reliable regression workflow includes data checks and separate validation mindset.
Failure mode to watch:
- Reporting only training metrics and claiming strong predictive performance.
Practical check:
- Compare fit metrics with and without top outlier to understand sensitivity.
Project Specification
Build a command-line tool named “Data Detective” with these requirements:
- Inputs:
- CSV file with two required columns (
x,y) or configurable column names. - Optional flags for delimiter, header presence, output directory.
- CSV file with two required columns (
- Processing:
- Parse and validate dataset.
- Compute means, slope, intercept.
- Generate predictions and residuals.
- Compute MSE/RMSE and
R^2. - Create scatter + fitted line plot and residual plot.
- Outputs:
- Text summary (sample count, slope, intercept, metrics).
- Saved visualization artifacts.
- Warning messages for tiny
var(x), outlier leverage, or suspicious residual patterns.
Non-negotiable constraints:
- Must implement formulas directly (not hidden auto-fit API).
- Must produce residual-oriented diagnostic output, not only one score.
- Must include at least one test dataset with expected coefficients.
Solution Architecture (ASCII)
+------------------------------+
| CLI / Dataset Input |
| csv path, columns, options |
+---------------+--------------+
|
v
+------------------------------+
| Data Validation Layer |
| parse, type checks, missing |
+---------------+--------------+
|
v
+------------------------------+
| Stats Core |
| means, variance, covariance |
+---------------+--------------+
|
v
+------------------------------+
| Regression Engine |
| b1, b0, predictions |
+-------+----------------------+
|
+----------------------------+
| |
v v
+-------------------------+ +---------------------------+
| Metrics Module | | Visualization Module |
| SSE, MSE, RMSE, R^2 | | scatter+line, residuals |
+------------+------------+ +-------------+-------------+
| |
+--------------+---------------+
v
+------------------------+
| Report + Artifacts |
+------------------------+
Implementation Guide
Phase 1: Data Contract and Parsing
- Define expected schema (
x,y, numeric only). - Fail fast on missing columns or non-numeric rows.
- Decide explicit policy for missing values (drop with count report or reject file).
Pseudocode:
read csv
validate required columns
coerce x,y to numeric
drop/flag invalid rows
ensure sample_count >= minimum threshold
Phase 2: Core Regression Math
- Compute
x_bar,y_bar. - Compute numerator/denominator for slope.
- Compute intercept and predictions.
Pseudocode:
num = sum((x-x_bar)*(y-y_bar))
den = sum((x-x_bar)^2)
if den == 0: fail("x variance is zero")
b1 = num/den
b0 = y_bar - b1*x_bar
Phase 3: Metrics and Diagnostics
- Compute residuals and squared residuals.
- Compute
SSE,MSE,RMSE,R^2. - Flag suspicious diagnostics:
- very large residual relative to others,
- obvious trend in residuals,
- small sample size warning.
Phase 4: Visualization and Reporting
- Plot scatter with fitted line.
- Plot residuals vs
x(or index). - Print clear textual summary with units/context notes.
Phase 5: Reproducibility
- Save outputs with deterministic naming.
- Include metadata in report (dataset name, row count, timestamp or run id).
- Keep command examples in README for reruns.
Testing Strategy
Deterministic Math Tests
- Perfect line dataset:
- Example pattern: points exactly on
y=2x+1 - Expected: slope
2, intercept1,R^2=1, near-zero residuals.
- Example pattern: points exactly on
- Flat response dataset:
- Constant
yvalues. - Expected slope near
0; discussR^2interpretation carefully.
- Constant
- Shift invariance test:
- Add constant to all
yvalues. - Expected slope unchanged, intercept shifted by that constant.
- Add constant to all
Robustness Tests
- Malformed row handling (missing numeric values) is explicit and counted.
- Near-zero x-variance dataset triggers clear error/warning.
- Outlier injection test demonstrates coefficient sensitivity.
Diagnostics Tests
- Residual sum close to zero (with intercept model).
- Residual plot generated and labeled.
- Text report includes all required metrics and row counts.
Common Pitfalls
- Confusing correlation with causation:
- Cause: treating linear association as proof of mechanism.
- Fix: document that regression here is descriptive/predictive, not causal proof.
- Blind trust in
R^2:- Cause: single-metric thinking.
- Fix: always inspect residual plot and domain plausibility.
- Hidden data leakage:
- Cause: preprocessing with future/test info in advanced setups.
- Fix: keep clear separation if you add train/test extension.
- Ignoring scale and units:
- Cause: reporting coefficients without context.
- Fix: annotate what one unit change in
xmeans physically.
Extensions
- Add train/test split and report out-of-sample error.
- Add robust regression comparison to reduce outlier sensitivity.
- Add polynomial features to compare linear vs curved fits.
- Add confidence intervals for coefficients.
- Generalize to multiple regression with matrix form.
Real-World Connections
- Education analytics: studying relation between study hours and score trend.
- Business forecasting: basic demand vs price sensitivity estimate.
- Engineering calibration: mapping sensor input to measured output.
- Public policy: trend analysis on historical indicators (with caution about causality).
Resources
- James, Witten, Hastie, Tibshirani, An Introduction to Statistical Learning (linear regression chapter).
- Freedman, Pisani, Purves, Statistics (correlation/regression intuition).
- Montgomery, Peck, Vining, Introduction to Linear Regression Analysis.
- Khan Academy / MIT OCW linear regression lectures for visual intuition.
Self-Assessment
- Why does least squares use squared residuals instead of raw residuals?
- What does a slope of
-0.8mean in plain language? - Can you have a high
R^2and still have a bad model? Give a case. - What does a curved residual pattern tell you?
- What should your tool do when all
xvalues are identical?
Submission Criteria
A submission is complete only if all items below are satisfied:
- Tool ingests a CSV and validates numeric
x,ycolumns. - Slope/intercept are computed from explicit formulas (not black-box fitting).
- Report includes at least
MSE,RMSE, andR^2. - Produces both scatter+fit plot and residual plot.
- Includes at least 3 deterministic tests with expected outcomes.
- Documents assumptions, limitations, and outlier sensitivity.
- Includes one reproducible sample run log and generated artifacts.