Project 6: Reproducible ML Baseline and Reporting System

Package a full tabular ML baseline workflow with deterministic runs, artifacts, and decision-ready reporting.

Quick Reference

Attribute	Value
Difficulty	Level 3: Advanced
Time Estimate	14-20 hours
Main Programming Language	Python (Alternatives: R, Julia)
Alternative Programming Languages	R, Julia
Coolness Level	Level 3: Genuinely Clever
Business Potential	3. Service & Support Model
Prerequisites	Projects 1-5, pipeline and metric discipline
Key Topics	Reproducibility contracts, artifact packaging, model reporting

1. Learning Objectives

Build one-command reproducible ML baseline execution.
Enforce schema and config contracts before training.
Export model artifact + metadata + decision report.
Demonstrate deterministic golden path and explicit failure path.

2. All Theory Needed (Per-Concept Breakdown)

2.1 Reproducibility as an Engineering Contract

Fundamentals

Reproducibility means same code + same data + same config -> same conclusion within defined tolerance.

Deep Dive into the concept

Notebook success often fails in team settings because hidden local assumptions are not captured: implicit random seeds, unpinned dependencies, undocumented preprocessing, and ad hoc data filters. Reproducibility formalizes these assumptions into explicit contracts.

A robust baseline run should include:

config hash,
dataset snapshot hash,
dependency versions,
deterministic split/seed policy,
full metric and threshold report,
serialized model + preprocessing package.

Schema validation is the first gate. If target column or required features are missing, fail early with explicit exit codes. This protects teams from silent quality regressions and misconfigured jobs.

Decision reporting is the second gate. Metrics should include baseline comparison, threshold rationale, and residual risks. A model card-style summary helps product and operations interpret limits.

Finally, include both success and failure demos. A system that only proves happy path is not production-ready.

How this fit on projects

Capstone concept that integrates all prior projects.

Definitions & key terms

artifact: saved output of training (model, preprocessing, metadata).
model card: structured summary of model behavior and limits.
schema drift: mismatch between expected and observed columns/types.
deterministic run: repeatable execution under fixed conditions.

Mental model diagram

Config + Data + Code -> Validation -> Train/Eval -> Artifacts -> Decision Report
                 |-> Fail-fast errors with explicit exit codes

How it works

Parse config and validate schema.
Compute hashes and environment metadata.
Execute training/evaluation pipeline.
Export artifacts and reports.
Return deterministic exit code.

Failure modes:

hidden defaults,
missing metadata,
artifact without schema contract.

Minimal concrete example

run_id: 2026-02-11T09:30Z
config_hash: 9f5d1a2
dataset_hash: 72a83c1
metric: ROC-AUC=0.829
threshold: 0.37 (precision floor policy)
status: DEPLOY_CANDIDATE_WITH_NOTES

Common misconceptions

”"”Saving model file alone is enough.””” Correction: without preprocessing/schema metadata, inference can break silently.

Check-your-understanding questions

Why hash both config and dataset?
What should trigger non-zero exit codes?
Why include failure demos in project output?

Check-your-understanding answers

To prove exact run provenance and reproducibility.
Schema violations, missing target, invalid config, critical metric computation failures.
Reliability includes known-failure behavior, not just success path.

Real-world applications

Baseline pipelines for product scoring.
Team handoff workflows.
Governance and audit readiness.

Where you’ll apply it

This capstone and any production-oriented ML workflow.

References

scikit-learn pipeline docs: https://scikit-learn.org/stable/modules/compose.html

Key insights

A reproducible baseline is a productized experiment, not a notebook snapshot.

Summary

This project turns your ML skills into dependable team output.

Homework/Exercises to practice the concept

Define required metadata fields for every run.
Create two failure fixtures and expected exit codes.
Re-run baseline in fresh environment and compare outputs.

Solutions to the homework/exercises

Include config hash, data hash, seed, versions, metrics, threshold.
Example: missing target (2), schema mismatch (2).
Metrics should match within tolerance; large drift signals environment issues.

3. Project Specification

3.1 What You Will Build

A CLI-style baseline runner producing deterministic training/evaluation outputs and structured artifacts.

3.2 Functional Requirements

Validate config and schema before training.
Execute deterministic split/train/evaluate flow.
Export model, schema contract, metrics JSON, and model card markdown.
Return explicit exit codes for success/failure paths.

3.3 Non-Functional Requirements

Performance: complete standard run under 3 minutes.
Reliability: deterministic metrics under fixed seeds.
Usability: one command and clear report paths.

3.4 Example Usage / Output

$ ml-lab run-baseline --config configs/churn_baseline.yaml
Status: success
Artifacts written
Decision: deploy_candidate_with_notes

3.5 Data Formats / Schemas / Protocols

YAML/JSON config schema.
CSV/parquet input with required column contract.
Output artifacts under timestamped run directory.

3.6 Edge Cases

Missing target column.
New unseen category in required strict mode.
Zero variance feature set.

3.7 Real World Outcome

3.7.1 How to Run (Copy/Paste)

ml-lab run-baseline --config configs/churn_baseline.yaml

3.7.2 Golden Path Demo (Deterministic)

$ ml-lab run-baseline --config fixtures/churn_fixed.yaml
Validation ROC-AUC: 0.836 +/- 0.018
Test ROC-AUC: 0.829
Exit code: 0

3.7.3 If CLI: exact terminal transcript

$ ml-lab run-baseline --config fixtures/churn_fixed.yaml
Config hash: 9f5d1a2
Dataset hash: 72a83c1
Validation ROC-AUC: 0.836 +/- 0.018
Test ROC-AUC: 0.829
Threshold: 0.37 (precision>=0.65)
Saved: artifacts/model.joblib
Saved: artifacts/preprocess_schema.json
Saved: reports/model_card.md
Exit code: 0

$ ml-lab run-baseline --config fixtures/missing_target.yaml
ERROR: required target column "churned" not found
Exit code: 2

4. Solution Architecture

4.1 High-Level Design

Config Loader -> Schema Gate -> Pipeline Runner -> Artifact Packager -> Report Generator

4.2 Key Components

Component	Responsibility	Key Decisions
Schema Gate	input contract validation	strict vs warn mode
Pipeline Runner	deterministic train/eval	fixed seeds and split
Artifact Packager	model + schema + metrics	versioned directory layout
Report Generator	model card and decision notes	include residual risks

4.3 Data Structures (No Full Code)

run_metadata: {run_id, hashes, versions, seed}
artifact_index: {model_path, schema_path, metrics_path, report_path}

4.4 Algorithm Overview

Load config and validate fields.
Validate dataset schema.
Run training and evaluation.
Export artifacts and decision report.
Exit with status code.

5. Implementation Guide

5.1 Development Environment Setup

# create reproducible environment and lock dependencies

5.2 Project Structure

baseline-system/
|-- configs/
|-- artifacts/
|-- reports/
|-- fixtures/
`-- src/

5.3 The Core Question You’re Answering

Can this workflow be rerun by another engineer with the same conclusions?

5.4 Concepts You Must Understand First

Deterministic experiments
Schema contracts
Decision reporting

5.5 Questions to Guide Your Design

Which metadata is mandatory for auditability?
Which failures should stop execution immediately?
How do you communicate residual risk clearly?

5.6 Thinking Exercise

Write a runbook section called “"”why two runs can diverge””” and mitigation per cause.

5.7 The Interview Questions They’ll Ask

What defines reproducibility in ML pipelines?
Why is schema validation critical before model loading?
What belongs in a model card?

5.8 Hints in Layers

Start with config schema and fail-fast rules.
Hash config/data before training.
Treat report generation as required output, not optional.

5.9 Books That Will Help

Topic	Book	Chapter
Practical ML workflows	Hands-On Machine Learning	Ch. 2-3
Engineering reliability	The Pragmatic Programmer	Ch. 8-9

5.10 Implementation Phases

Phase 1: config + schema validation.
Phase 2: deterministic training/eval runner.
Phase 3: artifact packaging + model card.

5.11 Key Implementation Decisions

Decision	Options	Recommendation	Rationale
Failure mode	warn, fail-fast	fail-fast for contract violations	prevents silent corruption
Artifact structure	flat, run-based	run-based directories	traceability

6. Testing Strategy

6.1 Test Categories

Category	Purpose	Examples
Unit	schema/config validation	missing key failure
Integration	full baseline run	artifact set complete
Edge	failure fixtures	non-zero exit code checks

6.2 Critical Test Cases

Golden config produces fixed metric output.
Missing target fixture exits with code 2.
Clean environment rerun reproduces report.

6.3 Test Data

fixed churn config + dataset
schema-failure fixture

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

Pitfall	Symptom	Solution
Unpinned dependencies	metric drift across machines	lock versions and log them
Missing schema contract	inference-time crashes	save and validate schema artifact
Report optionality	unclear deployment decisions	require report generation in success path

7.2 Debugging Strategies

Compare run metadata first (hashes/versions/seeds).
Diff reports across runs before debugging model details.

7.3 Performance Traps

Generating heavy plots in all runs; keep baseline report lightweight.

8. Extensions & Challenges

8.1 Beginner Extensions

Add run summary index page.

8.2 Intermediate Extensions

Add drift check against previous run artifacts.

8.3 Advanced Extensions

Integrate with CI pipeline and scheduled baseline checks.

9. Real-World Connections

9.1 Industry Applications

Team ML baseline templates
Governance and compliance evidence trails

MLflow
DVC

9.3 Interview Relevance

Demonstrates transition from experimentation to production-minded ML engineering.

10. Resources

10.1 Essential Reading

scikit-learn pipeline documentation

10.2 Video Resources

MLOps baseline and reproducibility talks

10.3 Tools & Documentation

https://scikit-learn.org/stable/modules/compose.html

Previous: Project 5

11. Self-Assessment Checklist

11.1 Understanding

I can explain every field in my run metadata.

11.2 Implementation

Golden path and failure path both work as documented.

11.3 Growth

I documented one production hardening step for next iteration.

12. Submission / Completion Criteria

Minimum Viable Completion:

One-command baseline run with complete artifacts.

Full Completion:

Includes deterministic rerun proof and failure handling.

Excellence (Going Above & Beyond):

Includes governance-ready model card with threshold and risk sections.