Project 3: California Housing Regression Pipeline

Build an end-to-end regression baseline with clean train/validation/test boundaries and reproducible artifacts.

Quick Reference

Attribute	Value
Difficulty	Level 2: Intermediate
Time Estimate	8-12 hours
Main Programming Language	Python (Alternatives: R, Julia)
Alternative Programming Languages	R, Julia
Coolness Level	Level 2: Practical but Forgettable
Business Potential	1. Resume Gold
Prerequisites	EDA basics, train/test split, regression metrics
Key Topics	Pipelines, scaling, regression metrics, residual analysis

1. Learning Objectives

Build a leak-resistant regression workflow.
Compare baseline and improved preprocessing strategies.
Interpret MAE/RMSE and residual plots.
Save artifact + report for reproducibility.

2. All Theory Needed (Per-Concept Breakdown)

2.1 Leak-Resistant Pipeline Design

Fundamentals

A pipeline should enforce training-only fitting for transformations and preserve identical inference behavior later.

Deep Dive into the concept

Regression pipelines for tabular data have three core boundaries: split boundary, transformation boundary, and evaluation boundary. Split boundary isolates test data. Transformation boundary ensures imputation/scaling parameters are learned only on training folds. Evaluation boundary protects final metrics by separating hyperparameter tuning from test evaluation.

In regression tasks, numeric scale and skew can strongly affect linear model behavior and optimization. Pipelines let you standardize these operations safely while keeping configuration reproducible. Column-wise transformations are useful when features need different treatment (for example, median imputation for one group, one-hot encoding for another).

Residual analysis is integral, not optional. Aggregate metrics alone hide systematic bias (underprediction for high-value houses, for instance). Residuals by quantile bands reveal where model assumptions fail.

How this fit on projects

Main focus in P03, then reused in P05 and P06.

Definitions & key terms

MAE: average absolute error.
RMSE: square-root of mean squared error.
Residual: actual minus predicted.
Cross-validation: repeated train/validation splits.

Mental model diagram

Train split -> fit preprocess -> fit model -> validate
Test split  -> transform only  -> predict    -> final metrics

How it works

Split data.
Define preprocessors by feature type.
Compose pipeline with estimator.
CV on training split.
Final fit and test evaluation.
Save model + report.

Failure modes:

fitting scaler before split,
test-set tuning,
missing residual diagnostics.

Minimal concrete example

CV RMSE: 0.74 +/- 0.03
Test MAE: 0.51
Residual risk: top decile underpredicted

Common misconceptions

”"”Good RMSE means no bias.””” Correction: residual slices may reveal systematic errors.

Check-your-understanding questions

Why report both MAE and RMSE?
What makes a pipeline reproducible?
Why keep a baseline linear model?

Check-your-understanding answers

They reflect different penalty sensitivity to large errors.
Fixed split, fixed config, saved artifacts, versioned environment.
It anchors complexity decisions with interpretable reference performance.

Real-world applications

Price forecasting
Demand estimation
Cost prediction

Where you’ll apply it

This project, P05, P06.

References

scikit-learn getting started: https://scikit-learn.org/stable/getting_started.html
scikit-learn model evaluation: https://scikit-learn.org/stable/modules/model_evaluation.html

Key insights

Pipeline boundaries are the main defense against accidental optimism.

Summary

A trustworthy regression baseline is reproducible, validated, and diagnostically rich.

Homework/Exercises to practice the concept

Compare two preprocess variants and explain metric changes.
Slice residuals by target quantile.
Re-run from clean environment and compare results.

Solutions to the homework/exercises

Scaling often improves linear stability.
High quantile underprediction indicates model misspecification or missing features.
Minor variation acceptable; major drift indicates reproducibility gaps.

3. Project Specification

3.1 What You Will Build

A command-based regression training flow that outputs metrics, diagnostics, and reusable model artifacts.

3.2 Functional Requirements

Load and validate dataset schema.
Execute train/validation/test strategy.
Train baseline pipeline and at least one alternative.
Report MAE, RMSE, and residual insights.
Save artifact and report files.

3.3 Non-Functional Requirements

Performance: run completes under 2 minutes on laptop hardware.
Reliability: deterministic outputs with fixed seed.
Usability: clear report sections and config-driven runs.

3.4 Example Usage / Output

$ ml-lab train-regression --dataset housing.csv --target median_value
CV RMSE: 0.742 +/- 0.031
Test MAE: 0.512
Artifact saved

3.5 Data Formats / Schemas / Protocols

CSV numeric + categorical features.
Numeric target column.

3.6 Edge Cases

Missing target.
All-null feature columns.
Tiny dataset with insufficient folds.

3.7 Real World Outcome

3.7.1 How to Run (Copy/Paste)

ml-lab train-regression --dataset data/california_housing.csv --target median_house_value

3.7.2 Golden Path Demo (Deterministic)

$ ml-lab train-regression --dataset fixtures/housing_fixed.csv --target target
CV RMSE: 0.741 +/- 0.030
Test MAE: 0.510
Exit code: 0

3.7.3 If CLI: exact terminal transcript

$ ml-lab train-regression --dataset fixtures/housing_fixed.csv --target target
Split: train=80% test=20% seed=42
CV RMSE: 0.741 +/- 0.030
Test MAE: 0.510
Residual note: high-target underprediction
Saved: artifacts/regression.joblib
Exit code: 0

$ ml-lab train-regression --dataset fixtures/housing_bad.csv --target target
ERROR: column target contains non-numeric values
Exit code: 2

4. Solution Architecture

4.1 High-Level Design

Config Loader -> Schema Validator -> Split Engine -> Pipeline Trainer -> Evaluator -> Artifact Writer

4.2 Key Components

Component	Responsibility	Key Decisions
Split Engine	holdout/CV strategy	fixed seed and fold count
Pipeline Trainer	fit preprocessing + model	one object for both
Evaluator	MAE/RMSE/residual slices	include risk notes
Artifact Writer	save model/report	versioned filenames

4.3 Data Structures (No Full Code)

config: {seed, target, folds, metric_primary}
report: {cv_mean, cv_std, test_metrics, residual_notes}

4.4 Algorithm Overview

Validate config and schema.
Split dataset.
Cross-validate pipeline.
Fit full train and evaluate test.
Save report and artifact.

Complexity depends on estimator and folds.

5. Implementation Guide

5.1 Development Environment Setup

# install sklearn/pandas and prepare data directory

5.2 Project Structure

regression-project/
|-- configs/
|-- artifacts/
|-- reports/
`-- src/

5.3 The Core Question You’re Answering

Can this model provide reliable numeric estimates under controlled evaluation?

5.4 Concepts You Must Understand First

Regression metrics
CV split strategy
Residual diagnostics

5.5 Questions to Guide Your Design

Which metric aligns with business tolerance?
Which feature transformations are justified?
What constitutes acceptable residual pattern?

5.6 Thinking Exercise

Draw expected residual shape for well-specified vs misspecified model.

5.7 The Interview Questions They’ll Ask

MAE vs RMSE: when prefer each?
Why CV plus holdout?
How do you diagnose heteroscedasticity?

5.8 Hints in Layers

Baseline first.
Keep split deterministic.
Add residual quantile table.

5.9 Books That Will Help

Topic	Book	Chapter
End-to-end tabular workflow	Hands-On Machine Learning	Ch. 2
Regression diagnostics	ISL	Ch. 3

5.10 Implementation Phases

Phase 1: split + baseline pipeline.
Phase 2: alternative preprocessing/model.
Phase 3: diagnostics + artifacts.

5.11 Key Implementation Decisions

Decision	Options	Recommendation	Rationale
Primary metric	MAE, RMSE	RMSE primary + MAE secondary	captures large-error risk + interpretability
Model baseline	linear only, tree only, both	linear baseline + one nonlinear	defensible comparison

6. Testing Strategy

6.1 Test Categories

Category	Purpose	Examples
Unit	schema and metric checks	non-numeric target failure
Integration	full train command	artifact/report written
Edge	tiny datasets	fold validation failure

6.2 Critical Test Cases

Golden fixture metrics within tolerance.
Missing target fails with exit code 2.
Re-run produces same metrics under fixed seed.

6.3 Test Data

fixed housing fixture
malformed schema fixture

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

Pitfall	Symptom	Solution
Test leakage	unrealistically high score	isolate holdout and audit transforms
Metric-only view	hidden bias remains	inspect residual slices
Unlocked seeds	unstable outputs	fixed seed and recorded config

7.2 Debugging Strategies

Compare training vs validation metric gaps.
Slice errors by feature/target quantiles.

7.3 Performance Traps

Overly large hyperparameter grids before establishing baseline.

8. Extensions & Challenges

8.1 Beginner Extensions

Add MAE by segment table.

8.2 Intermediate Extensions

Add robust regression variant.

8.3 Advanced Extensions

Add uncertainty intervals via bootstrapping.

9. Real-World Connections

9.1 Industry Applications

Property valuation support
Demand and cost estimation

scikit-learn examples
statsmodels diagnostics workflows

9.3 Interview Relevance

Covers leak-resistant evaluation and regression error analysis.

10. Resources

10.1 Essential Reading

Hands-On Machine Learning Ch. 2
ISL Ch. 3 and Ch. 5

10.2 Video Resources

Regression diagnostics tutorials

10.3 Tools & Documentation

https://scikit-learn.org/

Previous: Project 2
Next: Project 4

11. Self-Assessment Checklist

11.1 Understanding

I can explain why this split strategy is valid.

11.2 Implementation

Artifacts and report regenerate deterministically.

11.3 Growth

I documented next model-improvement hypothesis.

12. Submission / Completion Criteria

Minimum Viable Completion:

Baseline pipeline + MAE/RMSE report.

Full Completion:

Includes CV variance and residual slices.

Excellence (Going Above & Beyond):

Includes clear go/no-go recommendation with risk narrative.