Project 4: Iris/Wine Classification Benchmark
Compare classification models with threshold-aware metrics and class-level error analysis.
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | Level 2: Intermediate |
| Time Estimate | 8-12 hours |
| Main Programming Language | Python (Alternatives: R, Julia) |
| Alternative Programming Languages | R, Julia |
| Coolness Level | Level 2: Practical but Forgettable |
| Business Potential | 1. Resume Gold |
| Prerequisites | Classification metrics, confusion matrix basics, CV awareness |
| Key Topics | Precision/recall trade-offs, thresholding, model comparison |
1. Learning Objectives
- Benchmark two classifier families under identical split policies.
- Choose threshold based on operational constraints.
- Interpret confusion matrix beyond overall accuracy.
- Produce class-wise risk notes for decision-makers.
2. All Theory Needed (Per-Concept Breakdown)
2.1 Metric and Threshold Decision Framework
Fundamentals
Classification metrics measure different error trade-offs. Threshold transforms probabilities into decisions and should reflect cost and capacity constraints.
Deep Dive into the concept
Accuracy is simple but often misleading when class frequencies are uneven. Precision reflects positive prediction quality; recall reflects positive coverage. F1 balances both but still hides class-level asymmetries. Macro and weighted variants change how class imbalance affects aggregate scores.
Thresholding is where modeling meets operations. A lower threshold increases recall but may flood manual review queues with false positives. A higher threshold improves precision but may miss critical positives. Therefore threshold should be selected on validation data using explicit constraints (for example, precision floor or maximum daily alerts).
Confusion matrix is the fastest way to diagnose systematic class failures. If one class has poor recall, additional data, features, or class-weighting may be required. ROC-AUC and PR-AUC help compare ranking quality, but they do not choose operating point alone.
How this fit on projects
Core for P04 and reused in P05/P06 reporting.
Definitions & key terms
- Precision: TP / (TP + FP)
- Recall: TP / (TP + FN)
- Threshold: probability cutoff for positive class
- Macro-F1: class-average F1 treating classes equally
Mental model diagram
probabilities -> threshold -> confusion matrix -> precision/recall -> operational decision
How it works
- Train models and get probability outputs.
- Sweep candidate thresholds.
- Compute precision/recall and workload proxy.
- Select threshold under constraints.
- Validate on held-out test split.
Failure modes:
- using accuracy only,
- tuning threshold on test,
- ignoring class-wise performance.
Minimal concrete example
threshold 0.30: precision 0.58, recall 0.80
threshold 0.42: precision 0.65, recall 0.71
policy: precision floor 0.60 -> choose 0.42
Common misconceptions
- ”"”Highest AUC means best production model.””” Correction: threshold and operational constraints still decide viability.
Check-your-understanding questions
- Why can two models with similar AUC behave differently at chosen threshold?
- Why is macro-F1 useful in imbalance?
- Why tune threshold on validation only?
Check-your-understanding answers
- Local curve shape near operating region can differ.
- It prevents majority class dominance in aggregate score.
- To keep final test estimate unbiased.
Real-world applications
- Fraud/abuse screening
- Medical triage support
- Customer churn alerts
Where you’ll apply it
- This project and P06 decision report section.
References
- scikit-learn metrics docs: https://scikit-learn.org/stable/modules/model_evaluation.html
Key insights
A classifier is only useful when its thresholded behavior fits operational reality.
Summary
Model ranking and decision thresholding are distinct tasks that must both be handled rigorously.
Homework/Exercises to practice the concept
- Build threshold table with precision/recall/workload.
- Compare macro-F1 vs weighted-F1 on imbalanced data.
- Explain one class-specific failure from confusion matrix.
Solutions to the homework/exercises
- Choose threshold meeting policy constraints.
- Macro-F1 reveals minority class degradation hidden in weighted metrics.
- Use class-level feature coverage analysis to hypothesize cause.
3. Project Specification
3.1 What You Will Build
A benchmark script/report comparing two classifiers and one threshold policy recommendation.
3.2 Functional Requirements
- Train at least two classifiers with same split/CV plan.
- Generate confusion matrix and class-wise metrics.
- Sweep thresholds and pick one by policy.
- Output recommendation report with caveats.
3.3 Non-Functional Requirements
- Performance: full benchmark under 2 minutes for reference dataset.
- Reliability: deterministic output for fixed seeds.
- Usability: one command to reproduce all outputs.
3.4 Example Usage / Output
$ ml-lab classify-benchmark --dataset wine.csv --target quality
Best model: random_forest
Chosen threshold: 0.42
Macro-F1: 0.76
3.5 Data Formats / Schemas / Protocols
- Input CSV with categorical target.
- Output report + confusion matrix image.
3.6 Edge Cases
- Single-class training fold should be detected and handled.
- Missing class labels in test set should be reported with caution.
3.7 Real World Outcome
3.7.1 How to Run (Copy/Paste)
ml-lab classify-benchmark --dataset data/wine.csv --target quality_label
3.7.2 Golden Path Demo (Deterministic)
$ ml-lab classify-benchmark --dataset fixtures/wine_fixed.csv --target label
Best model: random_forest
Threshold: 0.42
Exit code: 0
3.7.3 If CLI: exact terminal transcript
$ ml-lab classify-benchmark --dataset fixtures/wine_fixed.csv --target label
Model A macro-F1: 0.71
Model B macro-F1: 0.76
Threshold policy: precision>=0.60
Chosen threshold: 0.42
Saved confusion matrix: reports/confusion.png
Exit code: 0
$ ml-lab classify-benchmark --dataset fixtures/wine_bad.csv --target label
ERROR: target class count < 2 after filtering
Exit code: 2
4. Solution Architecture
4.1 High-Level Design
Loader -> Splitter -> Trainer(2 models) -> Threshold Evaluator -> Reporter
4.2 Key Components
| Component | Responsibility | Key Decisions |
|---|---|---|
| Trainer | fit and score models | same folds for fairness |
| Threshold Evaluator | policy-based cutoff selection | validation-only tuning |
| Reporter | class metrics + recommendation | include caveats |
4.3 Data Structures (No Full Code)
scores_table: model x metric
threshold_table: threshold x {precision, recall, workload}
4.4 Algorithm Overview
- Split and train models.
- Compare CV metrics.
- Sweep threshold on validation probabilities.
- Evaluate selected setup on test.
5. Implementation Guide
5.1 Development Environment Setup
# install sklearn, matplotlib, seaborn
5.2 Project Structure
classification-benchmark/
|-- data/
|-- reports/
`-- src/
5.3 The Core Question You’re Answering
Which model-threshold pair best balances error costs under constraints?
5.4 Concepts You Must Understand First
- Precision/recall and F1 variants
- Threshold tuning boundaries
- Confusion matrix diagnosis
5.5 Questions to Guide Your Design
- What precision floor is operationally required?
- How much recall loss is acceptable?
- Which class failures are highest risk?
5.6 Thinking Exercise
Draft a cost table for FP/FN and map to threshold choice.
5.7 The Interview Questions They’ll Ask
- Why threshold 0.5 is not universal?
- How do you compare classifiers fairly?
- What metric would you use for extreme class imbalance?
5.8 Hints in Layers
- Benchmark before threshold tuning.
- Tune threshold on validation only.
- Always include class-wise metrics in report.
5.9 Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Classification workflow | Hands-On Machine Learning | Ch. 3 |
| Model assessment | ISL | Ch. 4-5 |
5.10 Implementation Phases
- Phase 1: baseline model comparison.
- Phase 2: threshold policy implementation.
- Phase 3: reporting and diagnostics.
5.11 Key Implementation Decisions
| Decision | Options | Recommendation | Rationale |
|---|---|---|---|
| Primary metric | accuracy, macro-F1 | macro-F1 | class balance robustness |
| Threshold selection | fixed 0.5, policy-based | policy-based | operational alignment |
6. Testing Strategy
6.1 Test Categories
| Category | Purpose | Examples |
|---|---|---|
| Unit | metric correctness | confusion matrix counts |
| Integration | end-to-end run | report + figure generation |
| Edge | class scarcity | low support warning |
6.2 Critical Test Cases
- Golden fixture ranking and threshold stable.
- Policy constraint satisfied in report.
- Missing class scenario warns/fails correctly.
6.3 Test Data
- fixed wine fixture
- low-support-class fixture
7. Common Pitfalls & Debugging
7.1 Frequent Mistakes
| Pitfall | Symptom | Solution |
|---|---|---|
| Accuracy obsession | hidden minority failures | add macro metrics |
| Test-set threshold tuning | optimistic final result | keep threshold tuning on validation |
| Missing caveats | unsafe recommendation | include class-risk notes |
7.2 Debugging Strategies
- Verify confusion matrix sums equal test count.
- Plot threshold curves and inspect chosen point.
7.3 Performance Traps
Overly wide model grid before clear baseline policy.
8. Extensions & Challenges
8.1 Beginner Extensions
- Add weighted metrics table.
8.2 Intermediate Extensions
- Add calibration plot and Brier score.
8.3 Advanced Extensions
- Add cost-sensitive training variant.
9. Real-World Connections
9.1 Industry Applications
- Risk scoring
- Quality classification
9.2 Related Open Source Projects
- scikit-learn classification examples
9.3 Interview Relevance
Strong preparation for precision/recall and confusion matrix discussions.
10. Resources
10.1 Essential Reading
- scikit-learn evaluation docs
10.2 Video Resources
- classification metric deep-dive lectures
10.3 Tools & Documentation
- https://scikit-learn.org/stable/modules/model_evaluation.html
10.4 Related Projects in This Series
11. Self-Assessment Checklist
11.1 Understanding
- I can justify threshold choice using explicit constraints.
11.2 Implementation
- Benchmark output is reproducible and complete.
11.3 Growth
- I documented one class-specific improvement plan.
12. Submission / Completion Criteria
Minimum Viable Completion:
- Two-model benchmark + confusion matrix.
Full Completion:
- Threshold policy and class-risk notes included.
Excellence (Going Above & Beyond):
- Includes calibration analysis and cost-sensitive recommendation.