Project 3: California Housing Regression Pipeline
Build an end-to-end regression baseline with clean train/validation/test boundaries and reproducible artifacts.
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | Level 2: Intermediate |
| Time Estimate | 8-12 hours |
| Main Programming Language | Python (Alternatives: R, Julia) |
| Alternative Programming Languages | R, Julia |
| Coolness Level | Level 2: Practical but Forgettable |
| Business Potential | 1. Resume Gold |
| Prerequisites | EDA basics, train/test split, regression metrics |
| Key Topics | Pipelines, scaling, regression metrics, residual analysis |
1. Learning Objectives
- Build a leak-resistant regression workflow.
- Compare baseline and improved preprocessing strategies.
- Interpret MAE/RMSE and residual plots.
- Save artifact + report for reproducibility.
2. All Theory Needed (Per-Concept Breakdown)
2.1 Leak-Resistant Pipeline Design
Fundamentals
A pipeline should enforce training-only fitting for transformations and preserve identical inference behavior later.
Deep Dive into the concept
Regression pipelines for tabular data have three core boundaries: split boundary, transformation boundary, and evaluation boundary. Split boundary isolates test data. Transformation boundary ensures imputation/scaling parameters are learned only on training folds. Evaluation boundary protects final metrics by separating hyperparameter tuning from test evaluation.
In regression tasks, numeric scale and skew can strongly affect linear model behavior and optimization. Pipelines let you standardize these operations safely while keeping configuration reproducible. Column-wise transformations are useful when features need different treatment (for example, median imputation for one group, one-hot encoding for another).
Residual analysis is integral, not optional. Aggregate metrics alone hide systematic bias (underprediction for high-value houses, for instance). Residuals by quantile bands reveal where model assumptions fail.
How this fit on projects
Main focus in P03, then reused in P05 and P06.
Definitions & key terms
- MAE: average absolute error.
- RMSE: square-root of mean squared error.
- Residual: actual minus predicted.
- Cross-validation: repeated train/validation splits.
Mental model diagram
Train split -> fit preprocess -> fit model -> validate
Test split -> transform only -> predict -> final metrics
How it works
- Split data.
- Define preprocessors by feature type.
- Compose pipeline with estimator.
- CV on training split.
- Final fit and test evaluation.
- Save model + report.
Failure modes:
- fitting scaler before split,
- test-set tuning,
- missing residual diagnostics.
Minimal concrete example
CV RMSE: 0.74 +/- 0.03
Test MAE: 0.51
Residual risk: top decile underpredicted
Common misconceptions
- ”"”Good RMSE means no bias.””” Correction: residual slices may reveal systematic errors.
Check-your-understanding questions
- Why report both MAE and RMSE?
- What makes a pipeline reproducible?
- Why keep a baseline linear model?
Check-your-understanding answers
- They reflect different penalty sensitivity to large errors.
- Fixed split, fixed config, saved artifacts, versioned environment.
- It anchors complexity decisions with interpretable reference performance.
Real-world applications
- Price forecasting
- Demand estimation
- Cost prediction
Where you’ll apply it
- This project, P05, P06.
References
- scikit-learn getting started: https://scikit-learn.org/stable/getting_started.html
- scikit-learn model evaluation: https://scikit-learn.org/stable/modules/model_evaluation.html
Key insights
Pipeline boundaries are the main defense against accidental optimism.
Summary
A trustworthy regression baseline is reproducible, validated, and diagnostically rich.
Homework/Exercises to practice the concept
- Compare two preprocess variants and explain metric changes.
- Slice residuals by target quantile.
- Re-run from clean environment and compare results.
Solutions to the homework/exercises
- Scaling often improves linear stability.
- High quantile underprediction indicates model misspecification or missing features.
- Minor variation acceptable; major drift indicates reproducibility gaps.
3. Project Specification
3.1 What You Will Build
A command-based regression training flow that outputs metrics, diagnostics, and reusable model artifacts.
3.2 Functional Requirements
- Load and validate dataset schema.
- Execute train/validation/test strategy.
- Train baseline pipeline and at least one alternative.
- Report MAE, RMSE, and residual insights.
- Save artifact and report files.
3.3 Non-Functional Requirements
- Performance: run completes under 2 minutes on laptop hardware.
- Reliability: deterministic outputs with fixed seed.
- Usability: clear report sections and config-driven runs.
3.4 Example Usage / Output
$ ml-lab train-regression --dataset housing.csv --target median_value
CV RMSE: 0.742 +/- 0.031
Test MAE: 0.512
Artifact saved
3.5 Data Formats / Schemas / Protocols
- CSV numeric + categorical features.
- Numeric target column.
3.6 Edge Cases
- Missing target.
- All-null feature columns.
- Tiny dataset with insufficient folds.
3.7 Real World Outcome
3.7.1 How to Run (Copy/Paste)
ml-lab train-regression --dataset data/california_housing.csv --target median_house_value
3.7.2 Golden Path Demo (Deterministic)
$ ml-lab train-regression --dataset fixtures/housing_fixed.csv --target target
CV RMSE: 0.741 +/- 0.030
Test MAE: 0.510
Exit code: 0
3.7.3 If CLI: exact terminal transcript
$ ml-lab train-regression --dataset fixtures/housing_fixed.csv --target target
Split: train=80% test=20% seed=42
CV RMSE: 0.741 +/- 0.030
Test MAE: 0.510
Residual note: high-target underprediction
Saved: artifacts/regression.joblib
Exit code: 0
$ ml-lab train-regression --dataset fixtures/housing_bad.csv --target target
ERROR: column target contains non-numeric values
Exit code: 2
4. Solution Architecture
4.1 High-Level Design
Config Loader -> Schema Validator -> Split Engine -> Pipeline Trainer -> Evaluator -> Artifact Writer
4.2 Key Components
| Component | Responsibility | Key Decisions |
|---|---|---|
| Split Engine | holdout/CV strategy | fixed seed and fold count |
| Pipeline Trainer | fit preprocessing + model | one object for both |
| Evaluator | MAE/RMSE/residual slices | include risk notes |
| Artifact Writer | save model/report | versioned filenames |
4.3 Data Structures (No Full Code)
config: {seed, target, folds, metric_primary}
report: {cv_mean, cv_std, test_metrics, residual_notes}
4.4 Algorithm Overview
- Validate config and schema.
- Split dataset.
- Cross-validate pipeline.
- Fit full train and evaluate test.
- Save report and artifact.
Complexity depends on estimator and folds.
5. Implementation Guide
5.1 Development Environment Setup
# install sklearn/pandas and prepare data directory
5.2 Project Structure
regression-project/
|-- configs/
|-- artifacts/
|-- reports/
`-- src/
5.3 The Core Question You’re Answering
Can this model provide reliable numeric estimates under controlled evaluation?
5.4 Concepts You Must Understand First
- Regression metrics
- CV split strategy
- Residual diagnostics
5.5 Questions to Guide Your Design
- Which metric aligns with business tolerance?
- Which feature transformations are justified?
- What constitutes acceptable residual pattern?
5.6 Thinking Exercise
Draw expected residual shape for well-specified vs misspecified model.
5.7 The Interview Questions They’ll Ask
- MAE vs RMSE: when prefer each?
- Why CV plus holdout?
- How do you diagnose heteroscedasticity?
5.8 Hints in Layers
- Baseline first.
- Keep split deterministic.
- Add residual quantile table.
5.9 Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| End-to-end tabular workflow | Hands-On Machine Learning | Ch. 2 |
| Regression diagnostics | ISL | Ch. 3 |
5.10 Implementation Phases
- Phase 1: split + baseline pipeline.
- Phase 2: alternative preprocessing/model.
- Phase 3: diagnostics + artifacts.
5.11 Key Implementation Decisions
| Decision | Options | Recommendation | Rationale |
|---|---|---|---|
| Primary metric | MAE, RMSE | RMSE primary + MAE secondary | captures large-error risk + interpretability |
| Model baseline | linear only, tree only, both | linear baseline + one nonlinear | defensible comparison |
6. Testing Strategy
6.1 Test Categories
| Category | Purpose | Examples |
|---|---|---|
| Unit | schema and metric checks | non-numeric target failure |
| Integration | full train command | artifact/report written |
| Edge | tiny datasets | fold validation failure |
6.2 Critical Test Cases
- Golden fixture metrics within tolerance.
- Missing target fails with exit code 2.
- Re-run produces same metrics under fixed seed.
6.3 Test Data
- fixed housing fixture
- malformed schema fixture
7. Common Pitfalls & Debugging
7.1 Frequent Mistakes
| Pitfall | Symptom | Solution |
|---|---|---|
| Test leakage | unrealistically high score | isolate holdout and audit transforms |
| Metric-only view | hidden bias remains | inspect residual slices |
| Unlocked seeds | unstable outputs | fixed seed and recorded config |
7.2 Debugging Strategies
- Compare training vs validation metric gaps.
- Slice errors by feature/target quantiles.
7.3 Performance Traps
Overly large hyperparameter grids before establishing baseline.
8. Extensions & Challenges
8.1 Beginner Extensions
- Add MAE by segment table.
8.2 Intermediate Extensions
- Add robust regression variant.
8.3 Advanced Extensions
- Add uncertainty intervals via bootstrapping.
9. Real-World Connections
9.1 Industry Applications
- Property valuation support
- Demand and cost estimation
9.2 Related Open Source Projects
- scikit-learn examples
- statsmodels diagnostics workflows
9.3 Interview Relevance
Covers leak-resistant evaluation and regression error analysis.
10. Resources
10.1 Essential Reading
- Hands-On Machine Learning Ch. 2
- ISL Ch. 3 and Ch. 5
10.2 Video Resources
- Regression diagnostics tutorials
10.3 Tools & Documentation
- https://scikit-learn.org/
10.4 Related Projects in This Series
11. Self-Assessment Checklist
11.1 Understanding
- I can explain why this split strategy is valid.
11.2 Implementation
- Artifacts and report regenerate deterministically.
11.3 Growth
- I documented next model-improvement hypothesis.
12. Submission / Completion Criteria
Minimum Viable Completion:
- Baseline pipeline + MAE/RMSE report.
Full Completion:
- Includes CV variance and residual slices.
Excellence (Going Above & Beyond):
- Includes clear go/no-go recommendation with risk narrative.