Project 9: The Data Detective (Linear Regression from Scratch)

Build a linear regression model that fits a line to data and explains the error.


Project Overview

Attribute Value
Difficulty Level 2: Intermediate
Time Estimate 1-2 weeks
Main Language Python
Alternative Languages JavaScript, C++
Knowledge Area Statistics and modeling
Tools Plotting tool
Main Book “An Introduction to Statistical Learning” by James et al.

What you’ll build: A tool that fits a best-fit line to a dataset and reports slope, intercept, and error.

Why it teaches math: Regression connects algebra to real data. You learn how models are judged by error.

Core challenges you’ll face:

  • Computing slope and intercept analytically
  • Measuring residuals and total error
  • Visualizing the fit

Real World Outcome

You will input a dataset (x, y points) and receive a regression line with error metrics and a plot.

Example Output:

$ python regression.py data.csv
Slope: 2.43
Intercept: -1.12
R^2: 0.91
Saved plot to regression.png

Verification steps:

  • Compare with a known dataset
  • Confirm residuals sum near zero

The Core Question You’re Answering

“How do I find the line that best explains a set of points?”

This is the foundation of predictive modeling.


Concepts You Must Understand First

Stop and research these before coding:

  1. Least squares
    • Why do we minimize squared error instead of absolute error?
    • Book Reference: “An Introduction to Statistical Learning” by James et al., Ch. 3
  2. Correlation
    • What does correlation say about linear relationships?
    • Book Reference: “Statistics” by Freedman, Pisani, and Purves, Ch. 4
  3. R-squared
    • How does R^2 measure explained variance?
    • Book Reference: “An Introduction to Statistical Learning” by James et al., Ch. 3

Questions to Guide Your Design

  1. Data handling
    • How will you handle missing or malformed data?
    • Will you allow multiple datasets?
  2. Error reporting
    • Will you compute mean squared error and R^2?
    • How will you visualize residuals?

Thinking Exercise

Small Dataset

Given points (1,2), (2,3), (3,5), compute the best-fit line by hand.

Questions while working:

  • What is the slope?
  • How far is each point from the line?

The Interview Questions They’ll Ask

Prepare to answer these:

  1. “What is least squares regression?”
  2. “Why do we square the errors?”
  3. “What does R-squared mean?”
  4. “How do you detect outliers?”
  5. “How would you fit a nonlinear model?”

Hints in Layers

Hint 1: Starting Point Compute means of x and y first.

Hint 2: Next Level Use the slope formula derived from covariance and variance.

Hint 3: Technical Details Compute residuals and verify their sum is close to zero.

Hint 4: Tools/Debugging Plot the line over the scatter to visually confirm the fit.


Books That Will Help

Topic Book Chapter
Least squares “An Introduction to Statistical Learning” by James et al. Ch. 3
Correlation “Statistics” by Freedman, Pisani, and Purves Ch. 4
R-squared “An Introduction to Statistical Learning” by James et al. Ch. 3

Implementation Hints

  • Use plain formulas before relying on libraries.
  • Validate with a dataset that has an obvious trend.
  • Add residual plots to spot outliers.

Learning Milestones

  1. First milestone: You can compute slope and intercept correctly.
  2. Second milestone: You can measure error and R^2.
  3. Final milestone: You can explain regression assumptions and limits.