Project 6: Reproducible ML Baseline and Reporting System

Package a full tabular ML baseline workflow with deterministic runs, artifacts, and decision-ready reporting.

Quick Reference

Attribute Value
Difficulty Level 3: Advanced
Time Estimate 14-20 hours
Main Programming Language Python (Alternatives: R, Julia)
Alternative Programming Languages R, Julia
Coolness Level Level 3: Genuinely Clever
Business Potential 3. Service & Support Model
Prerequisites Projects 1-5, pipeline and metric discipline
Key Topics Reproducibility contracts, artifact packaging, model reporting

1. Learning Objectives

  1. Build one-command reproducible ML baseline execution.
  2. Enforce schema and config contracts before training.
  3. Export model artifact + metadata + decision report.
  4. Demonstrate deterministic golden path and explicit failure path.

2. All Theory Needed (Per-Concept Breakdown)

2.1 Reproducibility as an Engineering Contract

Fundamentals

Reproducibility means same code + same data + same config -> same conclusion within defined tolerance.

Deep Dive into the concept

Notebook success often fails in team settings because hidden local assumptions are not captured: implicit random seeds, unpinned dependencies, undocumented preprocessing, and ad hoc data filters. Reproducibility formalizes these assumptions into explicit contracts.

A robust baseline run should include:

  • config hash,
  • dataset snapshot hash,
  • dependency versions,
  • deterministic split/seed policy,
  • full metric and threshold report,
  • serialized model + preprocessing package.

Schema validation is the first gate. If target column or required features are missing, fail early with explicit exit codes. This protects teams from silent quality regressions and misconfigured jobs.

Decision reporting is the second gate. Metrics should include baseline comparison, threshold rationale, and residual risks. A model card-style summary helps product and operations interpret limits.

Finally, include both success and failure demos. A system that only proves happy path is not production-ready.

How this fit on projects

Capstone concept that integrates all prior projects.

Definitions & key terms

  • artifact: saved output of training (model, preprocessing, metadata).
  • model card: structured summary of model behavior and limits.
  • schema drift: mismatch between expected and observed columns/types.
  • deterministic run: repeatable execution under fixed conditions.

Mental model diagram

Config + Data + Code -> Validation -> Train/Eval -> Artifacts -> Decision Report
                 |-> Fail-fast errors with explicit exit codes

How it works

  1. Parse config and validate schema.
  2. Compute hashes and environment metadata.
  3. Execute training/evaluation pipeline.
  4. Export artifacts and reports.
  5. Return deterministic exit code.

Failure modes:

  • hidden defaults,
  • missing metadata,
  • artifact without schema contract.

Minimal concrete example

run_id: 2026-02-11T09:30Z
config_hash: 9f5d1a2
dataset_hash: 72a83c1
metric: ROC-AUC=0.829
threshold: 0.37 (precision floor policy)
status: DEPLOY_CANDIDATE_WITH_NOTES

Common misconceptions

  • ”"”Saving model file alone is enough.””” Correction: without preprocessing/schema metadata, inference can break silently.

Check-your-understanding questions

  1. Why hash both config and dataset?
  2. What should trigger non-zero exit codes?
  3. Why include failure demos in project output?

Check-your-understanding answers

  1. To prove exact run provenance and reproducibility.
  2. Schema violations, missing target, invalid config, critical metric computation failures.
  3. Reliability includes known-failure behavior, not just success path.

Real-world applications

  • Baseline pipelines for product scoring.
  • Team handoff workflows.
  • Governance and audit readiness.

Where you’ll apply it

  • This capstone and any production-oriented ML workflow.

References

  • scikit-learn pipeline docs: https://scikit-learn.org/stable/modules/compose.html

Key insights

A reproducible baseline is a productized experiment, not a notebook snapshot.

Summary

This project turns your ML skills into dependable team output.

Homework/Exercises to practice the concept

  1. Define required metadata fields for every run.
  2. Create two failure fixtures and expected exit codes.
  3. Re-run baseline in fresh environment and compare outputs.

Solutions to the homework/exercises

  1. Include config hash, data hash, seed, versions, metrics, threshold.
  2. Example: missing target (2), schema mismatch (2).
  3. Metrics should match within tolerance; large drift signals environment issues.

3. Project Specification

3.1 What You Will Build

A CLI-style baseline runner producing deterministic training/evaluation outputs and structured artifacts.

3.2 Functional Requirements

  1. Validate config and schema before training.
  2. Execute deterministic split/train/evaluate flow.
  3. Export model, schema contract, metrics JSON, and model card markdown.
  4. Return explicit exit codes for success/failure paths.

3.3 Non-Functional Requirements

  • Performance: complete standard run under 3 minutes.
  • Reliability: deterministic metrics under fixed seeds.
  • Usability: one command and clear report paths.

3.4 Example Usage / Output

$ ml-lab run-baseline --config configs/churn_baseline.yaml
Status: success
Artifacts written
Decision: deploy_candidate_with_notes

3.5 Data Formats / Schemas / Protocols

  • YAML/JSON config schema.
  • CSV/parquet input with required column contract.
  • Output artifacts under timestamped run directory.

3.6 Edge Cases

  • Missing target column.
  • New unseen category in required strict mode.
  • Zero variance feature set.

3.7 Real World Outcome

3.7.1 How to Run (Copy/Paste)

ml-lab run-baseline --config configs/churn_baseline.yaml

3.7.2 Golden Path Demo (Deterministic)

$ ml-lab run-baseline --config fixtures/churn_fixed.yaml
Validation ROC-AUC: 0.836 +/- 0.018
Test ROC-AUC: 0.829
Exit code: 0

3.7.3 If CLI: exact terminal transcript

$ ml-lab run-baseline --config fixtures/churn_fixed.yaml
Config hash: 9f5d1a2
Dataset hash: 72a83c1
Validation ROC-AUC: 0.836 +/- 0.018
Test ROC-AUC: 0.829
Threshold: 0.37 (precision>=0.65)
Saved: artifacts/model.joblib
Saved: artifacts/preprocess_schema.json
Saved: reports/model_card.md
Exit code: 0

$ ml-lab run-baseline --config fixtures/missing_target.yaml
ERROR: required target column "churned" not found
Exit code: 2

4. Solution Architecture

4.1 High-Level Design

Config Loader -> Schema Gate -> Pipeline Runner -> Artifact Packager -> Report Generator

4.2 Key Components

Component Responsibility Key Decisions
Schema Gate input contract validation strict vs warn mode
Pipeline Runner deterministic train/eval fixed seeds and split
Artifact Packager model + schema + metrics versioned directory layout
Report Generator model card and decision notes include residual risks

4.3 Data Structures (No Full Code)

run_metadata: {run_id, hashes, versions, seed}
artifact_index: {model_path, schema_path, metrics_path, report_path}

4.4 Algorithm Overview

  1. Load config and validate fields.
  2. Validate dataset schema.
  3. Run training and evaluation.
  4. Export artifacts and decision report.
  5. Exit with status code.

5. Implementation Guide

5.1 Development Environment Setup

# create reproducible environment and lock dependencies

5.2 Project Structure

baseline-system/
|-- configs/
|-- artifacts/
|-- reports/
|-- fixtures/
`-- src/

5.3 The Core Question You’re Answering

Can this workflow be rerun by another engineer with the same conclusions?

5.4 Concepts You Must Understand First

  • Deterministic experiments
  • Schema contracts
  • Decision reporting

5.5 Questions to Guide Your Design

  • Which metadata is mandatory for auditability?
  • Which failures should stop execution immediately?
  • How do you communicate residual risk clearly?

5.6 Thinking Exercise

Write a runbook section called “"”why two runs can diverge””” and mitigation per cause.

5.7 The Interview Questions They’ll Ask

  1. What defines reproducibility in ML pipelines?
  2. Why is schema validation critical before model loading?
  3. What belongs in a model card?

5.8 Hints in Layers

  • Start with config schema and fail-fast rules.
  • Hash config/data before training.
  • Treat report generation as required output, not optional.

5.9 Books That Will Help

Topic Book Chapter
Practical ML workflows Hands-On Machine Learning Ch. 2-3
Engineering reliability The Pragmatic Programmer Ch. 8-9

5.10 Implementation Phases

  • Phase 1: config + schema validation.
  • Phase 2: deterministic training/eval runner.
  • Phase 3: artifact packaging + model card.

5.11 Key Implementation Decisions

Decision Options Recommendation Rationale
Failure mode warn, fail-fast fail-fast for contract violations prevents silent corruption
Artifact structure flat, run-based run-based directories traceability

6. Testing Strategy

6.1 Test Categories

Category Purpose Examples
Unit schema/config validation missing key failure
Integration full baseline run artifact set complete
Edge failure fixtures non-zero exit code checks

6.2 Critical Test Cases

  1. Golden config produces fixed metric output.
  2. Missing target fixture exits with code 2.
  3. Clean environment rerun reproduces report.

6.3 Test Data

  • fixed churn config + dataset
  • schema-failure fixture

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

Pitfall Symptom Solution
Unpinned dependencies metric drift across machines lock versions and log them
Missing schema contract inference-time crashes save and validate schema artifact
Report optionality unclear deployment decisions require report generation in success path

7.2 Debugging Strategies

  • Compare run metadata first (hashes/versions/seeds).
  • Diff reports across runs before debugging model details.

7.3 Performance Traps

Generating heavy plots in all runs; keep baseline report lightweight.


8. Extensions & Challenges

8.1 Beginner Extensions

  • Add run summary index page.

8.2 Intermediate Extensions

  • Add drift check against previous run artifacts.

8.3 Advanced Extensions

  • Integrate with CI pipeline and scheduled baseline checks.

9. Real-World Connections

9.1 Industry Applications

  • Team ML baseline templates
  • Governance and compliance evidence trails
  • MLflow
  • DVC

9.3 Interview Relevance

Demonstrates transition from experimentation to production-minded ML engineering.


10. Resources

10.1 Essential Reading

  • scikit-learn pipeline documentation

10.2 Video Resources

  • MLOps baseline and reproducibility talks

10.3 Tools & Documentation

  • https://scikit-learn.org/stable/modules/compose.html

11. Self-Assessment Checklist

11.1 Understanding

  • I can explain every field in my run metadata.

11.2 Implementation

  • Golden path and failure path both work as documented.

11.3 Growth

  • I documented one production hardening step for next iteration.

12. Submission / Completion Criteria

Minimum Viable Completion:

  • One-command baseline run with complete artifacts.

Full Completion:

  • Includes deterministic rerun proof and failure handling.

Excellence (Going Above & Beyond):

  • Includes governance-ready model card with threshold and risk sections.