Project 6: Reproducible ML Baseline and Reporting System
Package a full tabular ML baseline workflow with deterministic runs, artifacts, and decision-ready reporting.
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | Level 3: Advanced |
| Time Estimate | 14-20 hours |
| Main Programming Language | Python (Alternatives: R, Julia) |
| Alternative Programming Languages | R, Julia |
| Coolness Level | Level 3: Genuinely Clever |
| Business Potential | 3. Service & Support Model |
| Prerequisites | Projects 1-5, pipeline and metric discipline |
| Key Topics | Reproducibility contracts, artifact packaging, model reporting |
1. Learning Objectives
- Build one-command reproducible ML baseline execution.
- Enforce schema and config contracts before training.
- Export model artifact + metadata + decision report.
- Demonstrate deterministic golden path and explicit failure path.
2. All Theory Needed (Per-Concept Breakdown)
2.1 Reproducibility as an Engineering Contract
Fundamentals
Reproducibility means same code + same data + same config -> same conclusion within defined tolerance.
Deep Dive into the concept
Notebook success often fails in team settings because hidden local assumptions are not captured: implicit random seeds, unpinned dependencies, undocumented preprocessing, and ad hoc data filters. Reproducibility formalizes these assumptions into explicit contracts.
A robust baseline run should include:
- config hash,
- dataset snapshot hash,
- dependency versions,
- deterministic split/seed policy,
- full metric and threshold report,
- serialized model + preprocessing package.
Schema validation is the first gate. If target column or required features are missing, fail early with explicit exit codes. This protects teams from silent quality regressions and misconfigured jobs.
Decision reporting is the second gate. Metrics should include baseline comparison, threshold rationale, and residual risks. A model card-style summary helps product and operations interpret limits.
Finally, include both success and failure demos. A system that only proves happy path is not production-ready.
How this fit on projects
Capstone concept that integrates all prior projects.
Definitions & key terms
- artifact: saved output of training (model, preprocessing, metadata).
- model card: structured summary of model behavior and limits.
- schema drift: mismatch between expected and observed columns/types.
- deterministic run: repeatable execution under fixed conditions.
Mental model diagram
Config + Data + Code -> Validation -> Train/Eval -> Artifacts -> Decision Report
|-> Fail-fast errors with explicit exit codes
How it works
- Parse config and validate schema.
- Compute hashes and environment metadata.
- Execute training/evaluation pipeline.
- Export artifacts and reports.
- Return deterministic exit code.
Failure modes:
- hidden defaults,
- missing metadata,
- artifact without schema contract.
Minimal concrete example
run_id: 2026-02-11T09:30Z
config_hash: 9f5d1a2
dataset_hash: 72a83c1
metric: ROC-AUC=0.829
threshold: 0.37 (precision floor policy)
status: DEPLOY_CANDIDATE_WITH_NOTES
Common misconceptions
- ”"”Saving model file alone is enough.””” Correction: without preprocessing/schema metadata, inference can break silently.
Check-your-understanding questions
- Why hash both config and dataset?
- What should trigger non-zero exit codes?
- Why include failure demos in project output?
Check-your-understanding answers
- To prove exact run provenance and reproducibility.
- Schema violations, missing target, invalid config, critical metric computation failures.
- Reliability includes known-failure behavior, not just success path.
Real-world applications
- Baseline pipelines for product scoring.
- Team handoff workflows.
- Governance and audit readiness.
Where you’ll apply it
- This capstone and any production-oriented ML workflow.
References
- scikit-learn pipeline docs: https://scikit-learn.org/stable/modules/compose.html
Key insights
A reproducible baseline is a productized experiment, not a notebook snapshot.
Summary
This project turns your ML skills into dependable team output.
Homework/Exercises to practice the concept
- Define required metadata fields for every run.
- Create two failure fixtures and expected exit codes.
- Re-run baseline in fresh environment and compare outputs.
Solutions to the homework/exercises
- Include config hash, data hash, seed, versions, metrics, threshold.
- Example: missing target (2), schema mismatch (2).
- Metrics should match within tolerance; large drift signals environment issues.
3. Project Specification
3.1 What You Will Build
A CLI-style baseline runner producing deterministic training/evaluation outputs and structured artifacts.
3.2 Functional Requirements
- Validate config and schema before training.
- Execute deterministic split/train/evaluate flow.
- Export model, schema contract, metrics JSON, and model card markdown.
- Return explicit exit codes for success/failure paths.
3.3 Non-Functional Requirements
- Performance: complete standard run under 3 minutes.
- Reliability: deterministic metrics under fixed seeds.
- Usability: one command and clear report paths.
3.4 Example Usage / Output
$ ml-lab run-baseline --config configs/churn_baseline.yaml
Status: success
Artifacts written
Decision: deploy_candidate_with_notes
3.5 Data Formats / Schemas / Protocols
- YAML/JSON config schema.
- CSV/parquet input with required column contract.
- Output artifacts under timestamped run directory.
3.6 Edge Cases
- Missing target column.
- New unseen category in required strict mode.
- Zero variance feature set.
3.7 Real World Outcome
3.7.1 How to Run (Copy/Paste)
ml-lab run-baseline --config configs/churn_baseline.yaml
3.7.2 Golden Path Demo (Deterministic)
$ ml-lab run-baseline --config fixtures/churn_fixed.yaml
Validation ROC-AUC: 0.836 +/- 0.018
Test ROC-AUC: 0.829
Exit code: 0
3.7.3 If CLI: exact terminal transcript
$ ml-lab run-baseline --config fixtures/churn_fixed.yaml
Config hash: 9f5d1a2
Dataset hash: 72a83c1
Validation ROC-AUC: 0.836 +/- 0.018
Test ROC-AUC: 0.829
Threshold: 0.37 (precision>=0.65)
Saved: artifacts/model.joblib
Saved: artifacts/preprocess_schema.json
Saved: reports/model_card.md
Exit code: 0
$ ml-lab run-baseline --config fixtures/missing_target.yaml
ERROR: required target column "churned" not found
Exit code: 2
4. Solution Architecture
4.1 High-Level Design
Config Loader -> Schema Gate -> Pipeline Runner -> Artifact Packager -> Report Generator
4.2 Key Components
| Component | Responsibility | Key Decisions |
|---|---|---|
| Schema Gate | input contract validation | strict vs warn mode |
| Pipeline Runner | deterministic train/eval | fixed seeds and split |
| Artifact Packager | model + schema + metrics | versioned directory layout |
| Report Generator | model card and decision notes | include residual risks |
4.3 Data Structures (No Full Code)
run_metadata: {run_id, hashes, versions, seed}
artifact_index: {model_path, schema_path, metrics_path, report_path}
4.4 Algorithm Overview
- Load config and validate fields.
- Validate dataset schema.
- Run training and evaluation.
- Export artifacts and decision report.
- Exit with status code.
5. Implementation Guide
5.1 Development Environment Setup
# create reproducible environment and lock dependencies
5.2 Project Structure
baseline-system/
|-- configs/
|-- artifacts/
|-- reports/
|-- fixtures/
`-- src/
5.3 The Core Question You’re Answering
Can this workflow be rerun by another engineer with the same conclusions?
5.4 Concepts You Must Understand First
- Deterministic experiments
- Schema contracts
- Decision reporting
5.5 Questions to Guide Your Design
- Which metadata is mandatory for auditability?
- Which failures should stop execution immediately?
- How do you communicate residual risk clearly?
5.6 Thinking Exercise
Write a runbook section called “"”why two runs can diverge””” and mitigation per cause.
5.7 The Interview Questions They’ll Ask
- What defines reproducibility in ML pipelines?
- Why is schema validation critical before model loading?
- What belongs in a model card?
5.8 Hints in Layers
- Start with config schema and fail-fast rules.
- Hash config/data before training.
- Treat report generation as required output, not optional.
5.9 Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Practical ML workflows | Hands-On Machine Learning | Ch. 2-3 |
| Engineering reliability | The Pragmatic Programmer | Ch. 8-9 |
5.10 Implementation Phases
- Phase 1: config + schema validation.
- Phase 2: deterministic training/eval runner.
- Phase 3: artifact packaging + model card.
5.11 Key Implementation Decisions
| Decision | Options | Recommendation | Rationale |
|---|---|---|---|
| Failure mode | warn, fail-fast | fail-fast for contract violations | prevents silent corruption |
| Artifact structure | flat, run-based | run-based directories | traceability |
6. Testing Strategy
6.1 Test Categories
| Category | Purpose | Examples |
|---|---|---|
| Unit | schema/config validation | missing key failure |
| Integration | full baseline run | artifact set complete |
| Edge | failure fixtures | non-zero exit code checks |
6.2 Critical Test Cases
- Golden config produces fixed metric output.
- Missing target fixture exits with code 2.
- Clean environment rerun reproduces report.
6.3 Test Data
- fixed churn config + dataset
- schema-failure fixture
7. Common Pitfalls & Debugging
7.1 Frequent Mistakes
| Pitfall | Symptom | Solution |
|---|---|---|
| Unpinned dependencies | metric drift across machines | lock versions and log them |
| Missing schema contract | inference-time crashes | save and validate schema artifact |
| Report optionality | unclear deployment decisions | require report generation in success path |
7.2 Debugging Strategies
- Compare run metadata first (hashes/versions/seeds).
- Diff reports across runs before debugging model details.
7.3 Performance Traps
Generating heavy plots in all runs; keep baseline report lightweight.
8. Extensions & Challenges
8.1 Beginner Extensions
- Add run summary index page.
8.2 Intermediate Extensions
- Add drift check against previous run artifacts.
8.3 Advanced Extensions
- Integrate with CI pipeline and scheduled baseline checks.
9. Real-World Connections
9.1 Industry Applications
- Team ML baseline templates
- Governance and compliance evidence trails
9.2 Related Open Source Projects
- MLflow
- DVC
9.3 Interview Relevance
Demonstrates transition from experimentation to production-minded ML engineering.
10. Resources
10.1 Essential Reading
- scikit-learn pipeline documentation
10.2 Video Resources
- MLOps baseline and reproducibility talks
10.3 Tools & Documentation
- https://scikit-learn.org/stable/modules/compose.html
10.4 Related Projects in This Series
- Previous: Project 5
11. Self-Assessment Checklist
11.1 Understanding
- I can explain every field in my run metadata.
11.2 Implementation
- Golden path and failure path both work as documented.
11.3 Growth
- I documented one production hardening step for next iteration.
12. Submission / Completion Criteria
Minimum Viable Completion:
- One-command baseline run with complete artifacts.
Full Completion:
- Includes deterministic rerun proof and failure handling.
Excellence (Going Above & Beyond):
- Includes governance-ready model card with threshold and risk sections.