Project 3: California Housing Regression Pipeline

Build an end-to-end regression baseline with clean train/validation/test boundaries and reproducible artifacts.

Quick Reference

Attribute Value
Difficulty Level 2: Intermediate
Time Estimate 8-12 hours
Main Programming Language Python (Alternatives: R, Julia)
Alternative Programming Languages R, Julia
Coolness Level Level 2: Practical but Forgettable
Business Potential 1. Resume Gold
Prerequisites EDA basics, train/test split, regression metrics
Key Topics Pipelines, scaling, regression metrics, residual analysis

1. Learning Objectives

  1. Build a leak-resistant regression workflow.
  2. Compare baseline and improved preprocessing strategies.
  3. Interpret MAE/RMSE and residual plots.
  4. Save artifact + report for reproducibility.

2. All Theory Needed (Per-Concept Breakdown)

2.1 Leak-Resistant Pipeline Design

Fundamentals

A pipeline should enforce training-only fitting for transformations and preserve identical inference behavior later.

Deep Dive into the concept

Regression pipelines for tabular data have three core boundaries: split boundary, transformation boundary, and evaluation boundary. Split boundary isolates test data. Transformation boundary ensures imputation/scaling parameters are learned only on training folds. Evaluation boundary protects final metrics by separating hyperparameter tuning from test evaluation.

In regression tasks, numeric scale and skew can strongly affect linear model behavior and optimization. Pipelines let you standardize these operations safely while keeping configuration reproducible. Column-wise transformations are useful when features need different treatment (for example, median imputation for one group, one-hot encoding for another).

Residual analysis is integral, not optional. Aggregate metrics alone hide systematic bias (underprediction for high-value houses, for instance). Residuals by quantile bands reveal where model assumptions fail.

How this fit on projects

Main focus in P03, then reused in P05 and P06.

Definitions & key terms

  • MAE: average absolute error.
  • RMSE: square-root of mean squared error.
  • Residual: actual minus predicted.
  • Cross-validation: repeated train/validation splits.

Mental model diagram

Train split -> fit preprocess -> fit model -> validate
Test split  -> transform only  -> predict    -> final metrics

How it works

  1. Split data.
  2. Define preprocessors by feature type.
  3. Compose pipeline with estimator.
  4. CV on training split.
  5. Final fit and test evaluation.
  6. Save model + report.

Failure modes:

  • fitting scaler before split,
  • test-set tuning,
  • missing residual diagnostics.

Minimal concrete example

CV RMSE: 0.74 +/- 0.03
Test MAE: 0.51
Residual risk: top decile underpredicted

Common misconceptions

  • ”"”Good RMSE means no bias.””” Correction: residual slices may reveal systematic errors.

Check-your-understanding questions

  1. Why report both MAE and RMSE?
  2. What makes a pipeline reproducible?
  3. Why keep a baseline linear model?

Check-your-understanding answers

  1. They reflect different penalty sensitivity to large errors.
  2. Fixed split, fixed config, saved artifacts, versioned environment.
  3. It anchors complexity decisions with interpretable reference performance.

Real-world applications

  • Price forecasting
  • Demand estimation
  • Cost prediction

Where you’ll apply it

  • This project, P05, P06.

References

  • scikit-learn getting started: https://scikit-learn.org/stable/getting_started.html
  • scikit-learn model evaluation: https://scikit-learn.org/stable/modules/model_evaluation.html

Key insights

Pipeline boundaries are the main defense against accidental optimism.

Summary

A trustworthy regression baseline is reproducible, validated, and diagnostically rich.

Homework/Exercises to practice the concept

  1. Compare two preprocess variants and explain metric changes.
  2. Slice residuals by target quantile.
  3. Re-run from clean environment and compare results.

Solutions to the homework/exercises

  1. Scaling often improves linear stability.
  2. High quantile underprediction indicates model misspecification or missing features.
  3. Minor variation acceptable; major drift indicates reproducibility gaps.

3. Project Specification

3.1 What You Will Build

A command-based regression training flow that outputs metrics, diagnostics, and reusable model artifacts.

3.2 Functional Requirements

  1. Load and validate dataset schema.
  2. Execute train/validation/test strategy.
  3. Train baseline pipeline and at least one alternative.
  4. Report MAE, RMSE, and residual insights.
  5. Save artifact and report files.

3.3 Non-Functional Requirements

  • Performance: run completes under 2 minutes on laptop hardware.
  • Reliability: deterministic outputs with fixed seed.
  • Usability: clear report sections and config-driven runs.

3.4 Example Usage / Output

$ ml-lab train-regression --dataset housing.csv --target median_value
CV RMSE: 0.742 +/- 0.031
Test MAE: 0.512
Artifact saved

3.5 Data Formats / Schemas / Protocols

  • CSV numeric + categorical features.
  • Numeric target column.

3.6 Edge Cases

  • Missing target.
  • All-null feature columns.
  • Tiny dataset with insufficient folds.

3.7 Real World Outcome

3.7.1 How to Run (Copy/Paste)

ml-lab train-regression --dataset data/california_housing.csv --target median_house_value

3.7.2 Golden Path Demo (Deterministic)

$ ml-lab train-regression --dataset fixtures/housing_fixed.csv --target target
CV RMSE: 0.741 +/- 0.030
Test MAE: 0.510
Exit code: 0

3.7.3 If CLI: exact terminal transcript

$ ml-lab train-regression --dataset fixtures/housing_fixed.csv --target target
Split: train=80% test=20% seed=42
CV RMSE: 0.741 +/- 0.030
Test MAE: 0.510
Residual note: high-target underprediction
Saved: artifacts/regression.joblib
Exit code: 0

$ ml-lab train-regression --dataset fixtures/housing_bad.csv --target target
ERROR: column target contains non-numeric values
Exit code: 2

4. Solution Architecture

4.1 High-Level Design

Config Loader -> Schema Validator -> Split Engine -> Pipeline Trainer -> Evaluator -> Artifact Writer

4.2 Key Components

Component Responsibility Key Decisions
Split Engine holdout/CV strategy fixed seed and fold count
Pipeline Trainer fit preprocessing + model one object for both
Evaluator MAE/RMSE/residual slices include risk notes
Artifact Writer save model/report versioned filenames

4.3 Data Structures (No Full Code)

config: {seed, target, folds, metric_primary}
report: {cv_mean, cv_std, test_metrics, residual_notes}

4.4 Algorithm Overview

  1. Validate config and schema.
  2. Split dataset.
  3. Cross-validate pipeline.
  4. Fit full train and evaluate test.
  5. Save report and artifact.

Complexity depends on estimator and folds.


5. Implementation Guide

5.1 Development Environment Setup

# install sklearn/pandas and prepare data directory

5.2 Project Structure

regression-project/
|-- configs/
|-- artifacts/
|-- reports/
`-- src/

5.3 The Core Question You’re Answering

Can this model provide reliable numeric estimates under controlled evaluation?

5.4 Concepts You Must Understand First

  • Regression metrics
  • CV split strategy
  • Residual diagnostics

5.5 Questions to Guide Your Design

  • Which metric aligns with business tolerance?
  • Which feature transformations are justified?
  • What constitutes acceptable residual pattern?

5.6 Thinking Exercise

Draw expected residual shape for well-specified vs misspecified model.

5.7 The Interview Questions They’ll Ask

  1. MAE vs RMSE: when prefer each?
  2. Why CV plus holdout?
  3. How do you diagnose heteroscedasticity?

5.8 Hints in Layers

  • Baseline first.
  • Keep split deterministic.
  • Add residual quantile table.

5.9 Books That Will Help

Topic Book Chapter
End-to-end tabular workflow Hands-On Machine Learning Ch. 2
Regression diagnostics ISL Ch. 3

5.10 Implementation Phases

  • Phase 1: split + baseline pipeline.
  • Phase 2: alternative preprocessing/model.
  • Phase 3: diagnostics + artifacts.

5.11 Key Implementation Decisions

Decision Options Recommendation Rationale
Primary metric MAE, RMSE RMSE primary + MAE secondary captures large-error risk + interpretability
Model baseline linear only, tree only, both linear baseline + one nonlinear defensible comparison

6. Testing Strategy

6.1 Test Categories

Category Purpose Examples
Unit schema and metric checks non-numeric target failure
Integration full train command artifact/report written
Edge tiny datasets fold validation failure

6.2 Critical Test Cases

  1. Golden fixture metrics within tolerance.
  2. Missing target fails with exit code 2.
  3. Re-run produces same metrics under fixed seed.

6.3 Test Data

  • fixed housing fixture
  • malformed schema fixture

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

Pitfall Symptom Solution
Test leakage unrealistically high score isolate holdout and audit transforms
Metric-only view hidden bias remains inspect residual slices
Unlocked seeds unstable outputs fixed seed and recorded config

7.2 Debugging Strategies

  • Compare training vs validation metric gaps.
  • Slice errors by feature/target quantiles.

7.3 Performance Traps

Overly large hyperparameter grids before establishing baseline.


8. Extensions & Challenges

8.1 Beginner Extensions

  • Add MAE by segment table.

8.2 Intermediate Extensions

  • Add robust regression variant.

8.3 Advanced Extensions

  • Add uncertainty intervals via bootstrapping.

9. Real-World Connections

9.1 Industry Applications

  • Property valuation support
  • Demand and cost estimation
  • scikit-learn examples
  • statsmodels diagnostics workflows

9.3 Interview Relevance

Covers leak-resistant evaluation and regression error analysis.


10. Resources

10.1 Essential Reading

  • Hands-On Machine Learning Ch. 2
  • ISL Ch. 3 and Ch. 5

10.2 Video Resources

  • Regression diagnostics tutorials

10.3 Tools & Documentation

  • https://scikit-learn.org/

11. Self-Assessment Checklist

11.1 Understanding

  • I can explain why this split strategy is valid.

11.2 Implementation

  • Artifacts and report regenerate deterministically.

11.3 Growth

  • I documented next model-improvement hypothesis.

12. Submission / Completion Criteria

Minimum Viable Completion:

  • Baseline pipeline + MAE/RMSE report.

Full Completion:

  • Includes CV variance and residual slices.

Excellence (Going Above & Beyond):

  • Includes clear go/no-go recommendation with risk narrative.