Project 5: Feature Engineering and Model Selection Lab

Run controlled experiments across feature sets and model families to separate real gains from noise.

Quick Reference

Attribute Value
Difficulty Level 3: Advanced
Time Estimate 12-18 hours
Main Programming Language Python (Alternatives: R, Julia)
Alternative Programming Languages R, Julia
Coolness Level Level 3: Genuinely Clever
Business Potential 2. Micro-SaaS / Pro Tool
Prerequisites CV design, pipeline composition, metric interpretation
Key Topics Feature hypotheses, model comparison, variance-aware ranking

1. Learning Objectives

  1. Design a hypothesis-driven feature experiment matrix.
  2. Compare models fairly with fixed evaluation protocol.
  3. Rank results by both mean performance and stability.
  4. Produce a short list of deployment-candidate configurations.

2. All Theory Needed (Per-Concept Breakdown)

2.1 Experiment Design for Tabular ML

Fundamentals

Model improvement must come from controlled experiments where only intended factors change.

Deep Dive into the concept

Feature engineering can outperform model-switching when grounded in domain semantics. However, without disciplined experiment design, apparent gains are often noise. You need a fixed evaluation protocol (same folds, same seeds, same primary metric) and a predefined experiment matrix to avoid p-hacking.

Each feature candidate should include rationale, expected gain, and risk (for example, leakage risk, sparsity risk, instability risk). This converts random trial-and-error into testable hypotheses. Model families should then be compared under this fixed feature matrix.

Stability matters. A model with highest mean metric but high variance across folds can fail unpredictably in production. Ranking should combine central tendency and spread. Tie-breakers can include interpretability, inference cost, and operational constraints.

How this fit on projects

Primary concept for P05 and directly feeds P06 production-style reporting.

Definitions & key terms

  • experiment matrix: structured list of feature/model combinations.
  • variance-aware ranking: compare mean metric and fold variance.
  • p-hacking: unconstrained searching until random win appears.
  • baseline lift: improvement over simple reference model.

Mental model diagram

Hypotheses -> Experiment Matrix -> Fixed CV Protocol -> Ranking -> Recommendation

How it works

  1. Define baseline.
  2. Define candidate features/models before running.
  3. Execute matrix with fixed CV protocol.
  4. Rank by metric + variance.
  5. Validate top candidates on holdout test.

Failure modes:

  • changing folds between experiments,
  • tuning directly on test,
  • undocumented feature logic.

Minimal concrete example

12 configs tested
baseline ROC-AUC = 0.781
best mean ROC-AUC = 0.842 (std=0.014)
selected config: highest constrained score with stable variance

Common misconceptions

  • ”"”Best average score always wins.””” Correction: high variance and high complexity can make it fragile.

Check-your-understanding questions

  1. Why predefine experiment matrix before running?
  2. How does variance influence deployment confidence?
  3. Why keep a baseline throughout?

Check-your-understanding answers

  1. Prevents selective reporting and p-hacking.
  2. High variance implies unstable field performance.
  3. Baseline anchors whether complexity adds real value.

Real-world applications

  • Churn model tuning
  • Lead-scoring improvement cycles
  • Fraud detection feature iteration

Where you’ll apply it

  • This project and P06 governance/reporting.

References

  • scikit-learn cross-validation docs: https://scikit-learn.org/stable/modules/cross_validation.html

Key insights

Improvement without controlled comparison is not trustworthy improvement.

Summary

This project teaches experiment discipline, not only model tweaking.

Homework/Exercises to practice the concept

  1. Write a 10-config experiment matrix with rationale fields.
  2. Rank mock results using mean and variance.
  3. Identify one likely leakage-prone feature candidate.

Solutions to the homework/exercises

  1. Include feature hypothesis, expected impact, and risk score.
  2. Penalize unstable configs when means are close.
  3. Features derived after target event are leakage risks.

3. Project Specification

3.1 What You Will Build

An experiment runner and ranking report for feature/model combinations using fixed validation protocol.

3.2 Functional Requirements

  1. Support configurable feature sets and estimators.
  2. Run CV for each config under same folds.
  3. Log mean/std metrics and runtime.
  4. Export ranking and top-candidate recommendations.

3.3 Non-Functional Requirements

  • Performance: complete matrix under 10 minutes for reference dataset.
  • Reliability: deterministic rankings with fixed seeds.
  • Usability: clear CSV/markdown outputs.

3.4 Example Usage / Output

$ ml-lab feature-lab --config configs/feature_lab.yaml
Executed 12 configs
Best constrained score: cfg_07
Report: reports/feature_lab_rankings.csv

3.5 Data Formats / Schemas / Protocols

  • Input dataset with explicit target.
  • Config file describing feature transforms and estimator params.

3.6 Edge Cases

  • Config references missing column.
  • Model training failure for one config should be logged, not crash full matrix.

3.7 Real World Outcome

3.7.1 How to Run (Copy/Paste)

ml-lab feature-lab --dataset data/churn.csv --config configs/feature_lab.yaml

3.7.2 Golden Path Demo (Deterministic)

$ ml-lab feature-lab --dataset fixtures/churn_fixed.csv --config fixtures/lab_fixed.yaml
Configs run: 8
Best config: cfg_05
Exit code: 0

3.7.3 If CLI: exact terminal transcript

$ ml-lab feature-lab --dataset fixtures/churn_fixed.csv --config fixtures/lab_fixed.yaml
Running 8 configurations...
cfg_01 ROC-AUC=0.781 +/- 0.020
cfg_05 ROC-AUC=0.842 +/- 0.014
Top recommendation: cfg_05
Saved ranking: reports/rankings.csv
Exit code: 0

$ ml-lab feature-lab --dataset fixtures/churn_fixed.csv --config fixtures/bad_feature.yaml
ERROR: feature "contract_age_days" not found in schema
Exit code: 2

4. Solution Architecture

4.1 High-Level Design

Config Parser -> Feature Builder -> CV Runner -> Metrics Aggregator -> Ranker -> Reporter

4.2 Key Components

Component Responsibility Key Decisions
Config Parser load experiment definitions strict schema checks
CV Runner execute consistent folds shared fold indices
Ranker score configs mean + variance strategy
Reporter write artifacts CSV + markdown summary

4.3 Data Structures (No Full Code)

experiment_config: {id, features, model, params}
result_row: {id, metric_mean, metric_std, runtime, notes}

4.4 Algorithm Overview

  1. Parse and validate configs.
  2. Build consistent folds.
  3. Execute each config.
  4. Aggregate metrics.
  5. Rank and export recommendations.

5. Implementation Guide

5.1 Development Environment Setup

# install dependencies and create config/report directories

5.2 Project Structure

feature-lab/
|-- configs/
|-- reports/
|-- artifacts/
`-- src/

5.3 The Core Question You’re Answering

Which feature decisions consistently improve generalization under fair comparison?

5.4 Concepts You Must Understand First

  • CV protocol control
  • Feature transformation semantics
  • ranking under uncertainty

5.5 Questions to Guide Your Design

  • Which features are hypothesis-backed?
  • What metric threshold defines meaningful lift?
  • How much variance is acceptable?

5.6 Thinking Exercise

Create a feature hypothesis sheet with expected direction and risk.

5.7 The Interview Questions They’ll Ask

  1. How do you avoid overfitting during feature engineering?
  2. Why can cross-validation still mislead?
  3. What does stable lift look like?

5.8 Hints in Layers

  • lock protocol before experiments
  • log failures per config
  • rank with variance penalties

5.9 Books That Will Help

Topic Book Chapter
Feature engineering patterns Hands-On Machine Learning Ch. 2
Model assessment ISL Ch. 5

5.10 Implementation Phases

  • Phase 1: config + runner skeleton.
  • Phase 2: metrics aggregation and ranking.
  • Phase 3: recommendation report and test holdout validation.

5.11 Key Implementation Decisions

Decision Options Recommendation Rationale
Ranking rule mean only, mean+std mean+std stability-aware
Config format ad hoc, structured file structured file reproducibility

6. Testing Strategy

6.1 Test Categories

Category Purpose Examples
Unit config parsing missing fields
Integration full matrix run ranking output
Edge partial failures skip + continue behavior

6.2 Critical Test Cases

  1. Fixed matrix produces stable ranking order.
  2. Broken config fails with explicit error.
  3. One failing estimator does not abort full run.

6.3 Test Data

  • fixed churn fixture
  • bad-config fixture

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

Pitfall Symptom Solution
Protocol drift unfair comparisons reuse same folds
Undocumented features irreproducible gains enforce config schema
Mean-only ranking unstable winner include variance penalty

7.2 Debugging Strategies

  • print config IDs in every metric line.
  • verify fold reuse across configs.

7.3 Performance Traps

Running large search spaces before pruning low-value candidates.


8. Extensions & Challenges

8.1 Beginner Extensions

  • Add simple HTML leaderboard export.

8.2 Intermediate Extensions

  • Add nested CV mode for stricter tuning.

8.3 Advanced Extensions

  • Add experiment tracking backend integration.

9. Real-World Connections

9.1 Industry Applications

  • Growth model optimization
  • Risk model iteration programs
  • Optuna (for search workflows)
  • MLflow (for tracking)

9.3 Interview Relevance

Demonstrates experiment rigor and practical model selection discipline.


10. Resources

10.1 Essential Reading

  • scikit-learn cross-validation docs
  • Hands-On ML Ch. 2

10.2 Video Resources

  • practical model selection talks

10.3 Tools & Documentation

  • https://scikit-learn.org/stable/modules/cross_validation.html

11. Self-Assessment Checklist

11.1 Understanding

  • I can explain why my top config beat baseline.

11.2 Implementation

  • Ranking is reproducible with fixed seed and folds.

11.3 Growth

  • I documented one rejected config and why.

12. Submission / Completion Criteria

Minimum Viable Completion:

  • Complete matrix run + ranking export.

Full Completion:

  • Includes variance-aware recommendation and holdout check.

Excellence (Going Above & Beyond):

  • Includes structured experiment logs and clear governance notes.