Project 5: Feature Engineering and Model Selection Lab

Run controlled experiments across feature sets and model families to separate real gains from noise.

Quick Reference

Attribute	Value
Difficulty	Level 3: Advanced
Time Estimate	12-18 hours
Main Programming Language	Python (Alternatives: R, Julia)
Alternative Programming Languages	R, Julia
Coolness Level	Level 3: Genuinely Clever
Business Potential	2. Micro-SaaS / Pro Tool
Prerequisites	CV design, pipeline composition, metric interpretation
Key Topics	Feature hypotheses, model comparison, variance-aware ranking

1. Learning Objectives

Design a hypothesis-driven feature experiment matrix.
Compare models fairly with fixed evaluation protocol.
Rank results by both mean performance and stability.
Produce a short list of deployment-candidate configurations.

2. All Theory Needed (Per-Concept Breakdown)

2.1 Experiment Design for Tabular ML

Fundamentals

Model improvement must come from controlled experiments where only intended factors change.

Deep Dive into the concept

Feature engineering can outperform model-switching when grounded in domain semantics. However, without disciplined experiment design, apparent gains are often noise. You need a fixed evaluation protocol (same folds, same seeds, same primary metric) and a predefined experiment matrix to avoid p-hacking.

Each feature candidate should include rationale, expected gain, and risk (for example, leakage risk, sparsity risk, instability risk). This converts random trial-and-error into testable hypotheses. Model families should then be compared under this fixed feature matrix.

Stability matters. A model with highest mean metric but high variance across folds can fail unpredictably in production. Ranking should combine central tendency and spread. Tie-breakers can include interpretability, inference cost, and operational constraints.

How this fit on projects

Primary concept for P05 and directly feeds P06 production-style reporting.

Definitions & key terms

experiment matrix: structured list of feature/model combinations.
variance-aware ranking: compare mean metric and fold variance.
p-hacking: unconstrained searching until random win appears.
baseline lift: improvement over simple reference model.

Mental model diagram

Hypotheses -> Experiment Matrix -> Fixed CV Protocol -> Ranking -> Recommendation

How it works

Define baseline.
Define candidate features/models before running.
Execute matrix with fixed CV protocol.
Rank by metric + variance.
Validate top candidates on holdout test.

Failure modes:

changing folds between experiments,
tuning directly on test,
undocumented feature logic.

Minimal concrete example

12 configs tested
baseline ROC-AUC = 0.781
best mean ROC-AUC = 0.842 (std=0.014)
selected config: highest constrained score with stable variance

Common misconceptions

”"”Best average score always wins.””” Correction: high variance and high complexity can make it fragile.

Check-your-understanding questions

Why predefine experiment matrix before running?
How does variance influence deployment confidence?
Why keep a baseline throughout?

Check-your-understanding answers

Prevents selective reporting and p-hacking.
High variance implies unstable field performance.
Baseline anchors whether complexity adds real value.

Real-world applications

Churn model tuning
Lead-scoring improvement cycles
Fraud detection feature iteration

Where you’ll apply it

This project and P06 governance/reporting.

References

scikit-learn cross-validation docs: https://scikit-learn.org/stable/modules/cross_validation.html

Key insights

Improvement without controlled comparison is not trustworthy improvement.

Summary

This project teaches experiment discipline, not only model tweaking.

Homework/Exercises to practice the concept

Write a 10-config experiment matrix with rationale fields.
Rank mock results using mean and variance.
Identify one likely leakage-prone feature candidate.

Solutions to the homework/exercises

Include feature hypothesis, expected impact, and risk score.
Penalize unstable configs when means are close.
Features derived after target event are leakage risks.

3. Project Specification

3.1 What You Will Build

An experiment runner and ranking report for feature/model combinations using fixed validation protocol.

3.2 Functional Requirements

Support configurable feature sets and estimators.
Run CV for each config under same folds.
Log mean/std metrics and runtime.
Export ranking and top-candidate recommendations.

3.3 Non-Functional Requirements

Performance: complete matrix under 10 minutes for reference dataset.
Reliability: deterministic rankings with fixed seeds.
Usability: clear CSV/markdown outputs.

3.4 Example Usage / Output

$ ml-lab feature-lab --config configs/feature_lab.yaml
Executed 12 configs
Best constrained score: cfg_07
Report: reports/feature_lab_rankings.csv

3.5 Data Formats / Schemas / Protocols

Input dataset with explicit target.
Config file describing feature transforms and estimator params.

3.6 Edge Cases

Config references missing column.
Model training failure for one config should be logged, not crash full matrix.

3.7 Real World Outcome

3.7.1 How to Run (Copy/Paste)

ml-lab feature-lab --dataset data/churn.csv --config configs/feature_lab.yaml

3.7.2 Golden Path Demo (Deterministic)

$ ml-lab feature-lab --dataset fixtures/churn_fixed.csv --config fixtures/lab_fixed.yaml
Configs run: 8
Best config: cfg_05
Exit code: 0

3.7.3 If CLI: exact terminal transcript

$ ml-lab feature-lab --dataset fixtures/churn_fixed.csv --config fixtures/lab_fixed.yaml
Running 8 configurations...
cfg_01 ROC-AUC=0.781 +/- 0.020
cfg_05 ROC-AUC=0.842 +/- 0.014
Top recommendation: cfg_05
Saved ranking: reports/rankings.csv
Exit code: 0

$ ml-lab feature-lab --dataset fixtures/churn_fixed.csv --config fixtures/bad_feature.yaml
ERROR: feature "contract_age_days" not found in schema
Exit code: 2

4. Solution Architecture

4.1 High-Level Design

Config Parser -> Feature Builder -> CV Runner -> Metrics Aggregator -> Ranker -> Reporter

4.2 Key Components

Component	Responsibility	Key Decisions
Config Parser	load experiment definitions	strict schema checks
CV Runner	execute consistent folds	shared fold indices
Ranker	score configs	mean + variance strategy
Reporter	write artifacts	CSV + markdown summary

4.3 Data Structures (No Full Code)

experiment_config: {id, features, model, params}
result_row: {id, metric_mean, metric_std, runtime, notes}

4.4 Algorithm Overview

Parse and validate configs.
Build consistent folds.
Execute each config.
Aggregate metrics.
Rank and export recommendations.

5. Implementation Guide

5.1 Development Environment Setup

# install dependencies and create config/report directories

5.2 Project Structure

feature-lab/
|-- configs/
|-- reports/
|-- artifacts/
`-- src/

5.3 The Core Question You’re Answering

Which feature decisions consistently improve generalization under fair comparison?

5.4 Concepts You Must Understand First

CV protocol control
Feature transformation semantics
ranking under uncertainty

5.5 Questions to Guide Your Design

Which features are hypothesis-backed?
What metric threshold defines meaningful lift?
How much variance is acceptable?

5.6 Thinking Exercise

Create a feature hypothesis sheet with expected direction and risk.

5.7 The Interview Questions They’ll Ask

How do you avoid overfitting during feature engineering?
Why can cross-validation still mislead?
What does stable lift look like?

5.8 Hints in Layers

lock protocol before experiments
log failures per config
rank with variance penalties

5.9 Books That Will Help

Topic	Book	Chapter
Feature engineering patterns	Hands-On Machine Learning	Ch. 2
Model assessment	ISL	Ch. 5

5.10 Implementation Phases

Phase 1: config + runner skeleton.
Phase 2: metrics aggregation and ranking.
Phase 3: recommendation report and test holdout validation.

5.11 Key Implementation Decisions

Decision	Options	Recommendation	Rationale
Ranking rule	mean only, mean+std	mean+std	stability-aware
Config format	ad hoc, structured file	structured file	reproducibility

6. Testing Strategy

6.1 Test Categories

Category	Purpose	Examples
Unit	config parsing	missing fields
Integration	full matrix run	ranking output
Edge	partial failures	skip + continue behavior

6.2 Critical Test Cases

Fixed matrix produces stable ranking order.
Broken config fails with explicit error.
One failing estimator does not abort full run.

6.3 Test Data

fixed churn fixture
bad-config fixture

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

Pitfall	Symptom	Solution
Protocol drift	unfair comparisons	reuse same folds
Undocumented features	irreproducible gains	enforce config schema
Mean-only ranking	unstable winner	include variance penalty

7.2 Debugging Strategies

print config IDs in every metric line.
verify fold reuse across configs.

7.3 Performance Traps

Running large search spaces before pruning low-value candidates.

8. Extensions & Challenges

8.1 Beginner Extensions

Add simple HTML leaderboard export.

8.2 Intermediate Extensions

Add nested CV mode for stricter tuning.

8.3 Advanced Extensions

Add experiment tracking backend integration.

9. Real-World Connections

9.1 Industry Applications

Growth model optimization
Risk model iteration programs

Optuna (for search workflows)
MLflow (for tracking)

9.3 Interview Relevance

Demonstrates experiment rigor and practical model selection discipline.

10. Resources

10.1 Essential Reading

scikit-learn cross-validation docs
Hands-On ML Ch. 2

10.2 Video Resources

practical model selection talks

10.3 Tools & Documentation

https://scikit-learn.org/stable/modules/cross_validation.html

Previous: Project 4
Next: Project 6

11. Self-Assessment Checklist

11.1 Understanding

I can explain why my top config beat baseline.

11.2 Implementation

Ranking is reproducible with fixed seed and folds.

11.3 Growth

I documented one rejected config and why.

12. Submission / Completion Criteria

Minimum Viable Completion:

Complete matrix run + ranking export.

Full Completion:

Includes variance-aware recommendation and holdout check.

Excellence (Going Above & Beyond):

Includes structured experiment logs and clear governance notes.