Project 4: Iris/Wine Classification Benchmark

Compare classification models with threshold-aware metrics and class-level error analysis.

Quick Reference

Attribute	Value
Difficulty	Level 2: Intermediate
Time Estimate	8-12 hours
Main Programming Language	Python (Alternatives: R, Julia)
Alternative Programming Languages	R, Julia
Coolness Level	Level 2: Practical but Forgettable
Business Potential	1. Resume Gold
Prerequisites	Classification metrics, confusion matrix basics, CV awareness
Key Topics	Precision/recall trade-offs, thresholding, model comparison

1. Learning Objectives

Benchmark two classifier families under identical split policies.
Choose threshold based on operational constraints.
Interpret confusion matrix beyond overall accuracy.
Produce class-wise risk notes for decision-makers.

2. All Theory Needed (Per-Concept Breakdown)

2.1 Metric and Threshold Decision Framework

Fundamentals

Classification metrics measure different error trade-offs. Threshold transforms probabilities into decisions and should reflect cost and capacity constraints.

Deep Dive into the concept

Accuracy is simple but often misleading when class frequencies are uneven. Precision reflects positive prediction quality; recall reflects positive coverage. F1 balances both but still hides class-level asymmetries. Macro and weighted variants change how class imbalance affects aggregate scores.

Thresholding is where modeling meets operations. A lower threshold increases recall but may flood manual review queues with false positives. A higher threshold improves precision but may miss critical positives. Therefore threshold should be selected on validation data using explicit constraints (for example, precision floor or maximum daily alerts).

Confusion matrix is the fastest way to diagnose systematic class failures. If one class has poor recall, additional data, features, or class-weighting may be required. ROC-AUC and PR-AUC help compare ranking quality, but they do not choose operating point alone.

How this fit on projects

Core for P04 and reused in P05/P06 reporting.

Definitions & key terms

Precision: TP / (TP + FP)
Recall: TP / (TP + FN)
Threshold: probability cutoff for positive class
Macro-F1: class-average F1 treating classes equally

Mental model diagram

probabilities -> threshold -> confusion matrix -> precision/recall -> operational decision

How it works

Train models and get probability outputs.
Sweep candidate thresholds.
Compute precision/recall and workload proxy.
Select threshold under constraints.
Validate on held-out test split.

Failure modes:

using accuracy only,
tuning threshold on test,
ignoring class-wise performance.

Minimal concrete example

threshold 0.30: precision 0.58, recall 0.80
threshold 0.42: precision 0.65, recall 0.71
policy: precision floor 0.60 -> choose 0.42

Common misconceptions

”"”Highest AUC means best production model.””” Correction: threshold and operational constraints still decide viability.

Check-your-understanding questions

Why can two models with similar AUC behave differently at chosen threshold?
Why is macro-F1 useful in imbalance?
Why tune threshold on validation only?

Check-your-understanding answers

Local curve shape near operating region can differ.
It prevents majority class dominance in aggregate score.
To keep final test estimate unbiased.

Real-world applications

Fraud/abuse screening
Medical triage support
Customer churn alerts

Where you’ll apply it

This project and P06 decision report section.

References

scikit-learn metrics docs: https://scikit-learn.org/stable/modules/model_evaluation.html

Key insights

A classifier is only useful when its thresholded behavior fits operational reality.

Summary

Model ranking and decision thresholding are distinct tasks that must both be handled rigorously.

Homework/Exercises to practice the concept

Build threshold table with precision/recall/workload.
Compare macro-F1 vs weighted-F1 on imbalanced data.
Explain one class-specific failure from confusion matrix.

Solutions to the homework/exercises

Choose threshold meeting policy constraints.
Macro-F1 reveals minority class degradation hidden in weighted metrics.
Use class-level feature coverage analysis to hypothesize cause.

3. Project Specification

3.1 What You Will Build

A benchmark script/report comparing two classifiers and one threshold policy recommendation.

3.2 Functional Requirements

Train at least two classifiers with same split/CV plan.
Generate confusion matrix and class-wise metrics.
Sweep thresholds and pick one by policy.
Output recommendation report with caveats.

3.3 Non-Functional Requirements

Performance: full benchmark under 2 minutes for reference dataset.
Reliability: deterministic output for fixed seeds.
Usability: one command to reproduce all outputs.

3.4 Example Usage / Output

$ ml-lab classify-benchmark --dataset wine.csv --target quality
Best model: random_forest
Chosen threshold: 0.42
Macro-F1: 0.76

3.5 Data Formats / Schemas / Protocols

Input CSV with categorical target.
Output report + confusion matrix image.

3.6 Edge Cases

Single-class training fold should be detected and handled.
Missing class labels in test set should be reported with caution.

3.7 Real World Outcome

3.7.1 How to Run (Copy/Paste)

ml-lab classify-benchmark --dataset data/wine.csv --target quality_label

3.7.2 Golden Path Demo (Deterministic)

$ ml-lab classify-benchmark --dataset fixtures/wine_fixed.csv --target label
Best model: random_forest
Threshold: 0.42
Exit code: 0

3.7.3 If CLI: exact terminal transcript

$ ml-lab classify-benchmark --dataset fixtures/wine_fixed.csv --target label
Model A macro-F1: 0.71
Model B macro-F1: 0.76
Threshold policy: precision>=0.60
Chosen threshold: 0.42
Saved confusion matrix: reports/confusion.png
Exit code: 0

$ ml-lab classify-benchmark --dataset fixtures/wine_bad.csv --target label
ERROR: target class count < 2 after filtering
Exit code: 2

4. Solution Architecture

4.1 High-Level Design

Loader -> Splitter -> Trainer(2 models) -> Threshold Evaluator -> Reporter

4.2 Key Components

Component	Responsibility	Key Decisions
Trainer	fit and score models	same folds for fairness
Threshold Evaluator	policy-based cutoff selection	validation-only tuning
Reporter	class metrics + recommendation	include caveats

4.3 Data Structures (No Full Code)

scores_table: model x metric
threshold_table: threshold x {precision, recall, workload}

4.4 Algorithm Overview

Split and train models.
Compare CV metrics.
Sweep threshold on validation probabilities.
Evaluate selected setup on test.

5. Implementation Guide

5.1 Development Environment Setup

# install sklearn, matplotlib, seaborn

5.2 Project Structure

classification-benchmark/
|-- data/
|-- reports/
`-- src/

5.3 The Core Question You’re Answering

Which model-threshold pair best balances error costs under constraints?

5.4 Concepts You Must Understand First

Precision/recall and F1 variants
Threshold tuning boundaries
Confusion matrix diagnosis

5.5 Questions to Guide Your Design

What precision floor is operationally required?
How much recall loss is acceptable?
Which class failures are highest risk?

5.6 Thinking Exercise

Draft a cost table for FP/FN and map to threshold choice.

5.7 The Interview Questions They’ll Ask

Why threshold 0.5 is not universal?
How do you compare classifiers fairly?
What metric would you use for extreme class imbalance?

5.8 Hints in Layers

Benchmark before threshold tuning.
Tune threshold on validation only.
Always include class-wise metrics in report.

5.9 Books That Will Help

Topic	Book	Chapter
Classification workflow	Hands-On Machine Learning	Ch. 3
Model assessment	ISL	Ch. 4-5

5.10 Implementation Phases

Phase 1: baseline model comparison.
Phase 2: threshold policy implementation.
Phase 3: reporting and diagnostics.

5.11 Key Implementation Decisions

Decision	Options	Recommendation	Rationale
Primary metric	accuracy, macro-F1	macro-F1	class balance robustness
Threshold selection	fixed 0.5, policy-based	policy-based	operational alignment

6. Testing Strategy

6.1 Test Categories

Category	Purpose	Examples
Unit	metric correctness	confusion matrix counts
Integration	end-to-end run	report + figure generation
Edge	class scarcity	low support warning

6.2 Critical Test Cases

Golden fixture ranking and threshold stable.
Policy constraint satisfied in report.
Missing class scenario warns/fails correctly.

6.3 Test Data

fixed wine fixture
low-support-class fixture

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

Pitfall	Symptom	Solution
Accuracy obsession	hidden minority failures	add macro metrics
Test-set threshold tuning	optimistic final result	keep threshold tuning on validation
Missing caveats	unsafe recommendation	include class-risk notes

7.2 Debugging Strategies

Verify confusion matrix sums equal test count.
Plot threshold curves and inspect chosen point.

7.3 Performance Traps

Overly wide model grid before clear baseline policy.

8. Extensions & Challenges

8.1 Beginner Extensions

Add weighted metrics table.

8.2 Intermediate Extensions

Add calibration plot and Brier score.

8.3 Advanced Extensions

Add cost-sensitive training variant.

9. Real-World Connections

9.1 Industry Applications

Risk scoring
Quality classification

scikit-learn classification examples

9.3 Interview Relevance

Strong preparation for precision/recall and confusion matrix discussions.

10. Resources

10.1 Essential Reading

scikit-learn evaluation docs

10.2 Video Resources

classification metric deep-dive lectures

10.3 Tools & Documentation

https://scikit-learn.org/stable/modules/model_evaluation.html

Previous: Project 3
Next: Project 5

11. Self-Assessment Checklist

11.1 Understanding

I can justify threshold choice using explicit constraints.

11.2 Implementation

Benchmark output is reproducible and complete.

11.3 Growth

I documented one class-specific improvement plan.

12. Submission / Completion Criteria

Minimum Viable Completion:

Two-model benchmark + confusion matrix.

Full Completion:

Threshold policy and class-risk notes included.

Excellence (Going Above & Beyond):

Includes calibration analysis and cost-sensitive recommendation.