Project 20: Complete ML Pipeline from Scratch

A complete machine learning pipeline that takes raw data and produces a trained, evaluated, deployable model—all from scratch. No sklearn, no pandas, no frameworks. Just your mathematical implementations from the previous projects, integrated into a cohesive system.

Quick Reference

Attribute Value
Difficulty Level 5: Master (The First-Principles Wizard)
Main Programming Language Python
Alternative Programming Languages C++, Julia, Rust
Coolness Level Level 5: Pure Magic (Super Cool)
Business Potential 4. The “Open Core” Infrastructure (Enterprise Scale)
Knowledge Area Machine Learning / Full Stack ML
Software or Tool Complete ML System
Main Book “Designing Machine Learning Systems” by Chip Huyen

1. Learning Objectives

By completing this project, you will:

  1. Translate math definitions into deterministic implementation steps.
  2. Build validation checks that make correctness observable.
  3. Diagnose numerical, logical, and data-shape failures early.
  4. Explain tradeoffs in interviews using evidence from your own build.

2. All Theory Needed (Per-Concept Breakdown)

This project applies the following theory clusters:

  • Symbolic-to-numeric translation (expressions, data shapes, invariants)
  • Stability constraints (precision, scaling, stopping criteria)
  • Optimization or inference logic (depending on project objective)
  • Evaluation discipline (error analysis, test coverage, reproducibility)

Concept A: Mathematical Representation Discipline

Fundamentals A math expression is not executable until you define representation, ordering, and domain constraints. The same equation can be represented as a token stream, tree, matrix pipeline, or probability graph. Choosing representation determines what bugs you can catch early.

Deep Dive into the concept Most project failures begin before algorithm selection: they start with ambiguous representation. If your parser cannot distinguish unary minus from subtraction, your calculator fails. If your matrix dimensions are implicit rather than validated, your linear algebra pipeline fails silently. If your probabilistic assumptions (independence, stationarity, or class priors) are not explicit, your inference can look accurate on one split and collapse on another. The core implementation move is to treat representation as a contract. Define each object with shape, domain, and semantic intent. Then enforce invariants at boundaries: input parser, preprocessing, training loop, evaluation stage. This makes debugging local instead of global.

How this fits this project You will encode each operation with explicit contracts and invariant checks.

Definitions & key terms

  • Invariant: Property that must hold before and after each operation.
  • Shape contract: Expected dimensional structure of vectors/matrices/tensors.
  • Domain constraint: Allowed value range (for example log input > 0).

Mental model diagram

User Input -> Representation Layer -> Validated Operation -> Observable Output
              (tokens/shapes)        (invariants pass)       (tests/plots/logs)

How it works

  1. Parse/ingest data into typed structures.
  2. Validate shape/domain invariants.
  3. Execute operation.
  4. Compare observed output with expected behavior.
  5. Record failure signature if mismatch appears.

Minimal concrete example

PSEUDOCODE
read expression
tokenize with precedence rules
if token sequence invalid -> return syntax error
evaluate tree
if domain violation -> return bounded diagnostic
print value and confidence check

Common misconceptions

  • “If it runs once, representation is correct.” -> false.
  • “Type checks are enough without shape checks.” -> false.

Check-your-understanding questions

  1. Which invariant catches division-by-zero earliest?
  2. Why does shape validation belong at boundaries rather than only in core logic?
  3. Predict failure if tokenization ignores unary minus.

Check-your-understanding answers

  1. Domain check on denominator before operation execution.
  2. Boundary validation keeps errors local and diagnostic.
  3. Expressions like -2^2 get misinterpreted and produce wrong precedence behavior.

Real-world applications Feature preprocessing, model-serving input validation, and experiment-tracking schema enforcement.

Where you’ll apply it This project and every downstream project in the sprint.

References

  • CSAPP (Bryant & O’Hallaron), floating-point chapter
  • Math for Programmers (Paul Orland), representation-oriented chapters

Key insight Correct representation reduces the complexity of every later decision.

Summary Stable ML math implementations start with explicit contracts, not implicit assumptions.

Homework/Exercises

  1. Write five invariants for your project.
  2. Build a failing test input for each invariant.

Solutions

  1. Include at least one shape, one domain, one convergence, one reproducibility, and one output-range invariant.
  2. Each failing input should trigger exactly one diagnostic to keep root-cause analysis clean.

3. Build Blueprint

  1. Scope the smallest end-to-end slice that produces visible output.
  2. Add deterministic tests and edge-case probes.
  3. Layer complexity only after baseline behavior is stable.
  4. Add metrics logging before optimization.
  5. Run failure drills: perturb inputs, scale values, and check stability.

4. Real-World Outcome (Target)

$ python ml_pipeline.py train titanic.csv --target=survived

=== ML Pipeline: Titanic Survival Prediction ===

Step 1: Data Loading
  Loaded 891 samples, 12 features
  Missing values: age (177), cabin (687), embarked (2)

Step 2: Preprocessing (your implementations!)
  - Imputed missing ages with median
  - One-hot encoded categorical features
  - Normalized numerical features (mean=0, std=1)
  Final feature matrix: 891 × 24

Step 3: Feature Engineering
  - Applied PCA: kept 15 components (95% variance)
  - Created polynomial features (degree 2) for top 5

Step 4: Model Training (5-fold cross-validation)
  Logistic Regression:  Accuracy = 0.782 ± 0.034
  Neural Network (1 layer): Accuracy = 0.798 ± 0.041
  Neural Network (2 layers): Accuracy = 0.812 ± 0.038

Step 5: Hyperparameter Tuning (Neural Network)
  Grid search over learning_rate, hidden_size, regularization
  Best: lr=0.01, hidden=64, reg=0.001
  Tuned accuracy: 0.823 ± 0.029

Step 6: Final Evaluation
  Test set accuracy: 0.817
  Confusion matrix:
              Predicted
              Died  Survived
  Actual Died   98      15
        Survived 22      44

  Precision: 0.75, Recall: 0.67, F1: 0.71

Step 7: Model Saved
  → model.pkl (contains weights, normalization params, feature names)

$ python ml_pipeline.py predict model.pkl passenger.json
Prediction: SURVIVED (probability: 0.73)
Key factors: Sex (female), Pclass (1), Age (29)

Implementation Hints: The pipeline architecture:

class MLPipeline:
    def __init__(self):
        self.preprocessor = Preprocessor()  # Project 13 (stats)
        self.pca = PCA()                     # Project 7
        self.model = NeuralNetwork()         # Project 19

    def fit(self, X, y):
        X = self.preprocessor.fit_transform(X)
        X = self.pca.fit_transform(X)
        self.model.train(X, y)

    def predict(self, X):
        X = self.preprocessor.transform(X)
        X = self.pca.transform(X)
        return self.model.predict(X)

Cross-validation splits data k ways, trains on k-1, tests on 1, rotates. Average scores estimate generalization.

Learning milestones:

  1. Pipeline runs end-to-end → You can integrate ML components
  2. Cross-validation gives reliable estimates → You understand proper evaluation
  3. You can explain every mathematical operation → You’ve truly learned ML from first principles

5. Core Design Notes from Main Guide

Core Question

“What does it really mean to build a machine learning system from nothing?”

Most ML practitioners grab sklearn, call model.fit(), and move on. But what happens inside? This capstone project forces you to answer that question completely. You will build every component yourself: loading and cleaning data, engineering features, splitting into train/validation/test, implementing models, selecting hyperparameters, and measuring performance. When you finish, you will have a system that truly belongs to you–not because you downloaded it, but because you built every mathematical piece. This is the difference between using ML and understanding ML.

Concepts You Must Understand First

Stop and research these before coding:

  1. Data Preprocessing and Cleaning
    • How do you handle missing values mathematically (mean imputation, mode imputation)?
    • What is feature scaling and why do different models need different scaling?
    • How do you encode categorical variables (one-hot encoding, label encoding)?
    • Book Reference: “Designing Machine Learning Systems” Chapter 4 - Chip Huyen
  2. Feature Engineering
    • What makes a good feature? How do you create polynomial features?
    • When and why should you use PCA for dimensionality reduction?
    • How do you select features (correlation analysis, mutual information)?
    • Book Reference: “Feature Engineering for Machine Learning” Chapters 1-3 - Zheng & Casari
  3. Train/Validation/Test Split Philosophy
    • Why do we need three sets, not just train and test?
    • What is data leakage and how does it invalidate your results?
    • How does time-series data change the splitting strategy?
    • Book Reference: “Hands-On Machine Learning” Chapter 2 - Aurelien Geron
  4. K-Fold Cross-Validation
    • Why is single train/test split unreliable?
    • How does K-fold give you a better estimate of generalization?
    • What is stratified K-fold and when do you need it?
    • Book Reference: “The Elements of Statistical Learning” Chapter 7 - Hastie et al.
  5. The Bias-Variance Tradeoff
    • What is the mathematical decomposition: Error = Bias^2 + Variance + Noise?
    • How does model complexity affect bias vs variance?
    • How do you diagnose if your model is underfitting or overfitting?
    • Book Reference: “Machine Learning” (Coursera) Week 6 - Andrew Ng
  6. Hyperparameter Tuning Strategies
    • What is the difference between model parameters and hyperparameters?
    • How does grid search work? What about random search?
    • What is Bayesian optimization and when is it worth the complexity?
    • Book Reference: “Designing Machine Learning Systems” Chapter 6 - Chip Huyen
  7. Model Evaluation Metrics
    • When should you use accuracy vs precision vs recall vs F1?
    • What is a confusion matrix and how do you interpret it?
    • What is the ROC curve and AUC? When are they misleading?
    • Book Reference: “Data Science for Business” Chapter 7 - Provost & Fawcett

Questions to Guide Your Design

Before implementing, think through these:

  1. Pipeline architecture: How will you chain preprocessing -> feature engineering -> model -> evaluation? Will you use classes or functions?

  2. Configuration management: How will you specify hyperparameters for tuning? A config file? Function arguments?

  3. Reproducibility: How will you ensure the same random seed gives the same results? What about saving/loading models?

  4. Metrics storage: How will you store and compare results across different models and hyperparameters?

  5. Early stopping: For iterative models, how do you decide when to stop training? Validation loss plateau?

  6. Model persistence: How will you save your trained model for later use? What format?

Thinking Exercise

Design the cross-validation loop on paper:

You have 100 samples and want to do 5-fold cross-validation for a model with two hyperparameters: learning_rate in [0.01, 0.1, 1.0] and regularization in [0.001, 0.01, 0.1].

  1. How many total model training runs will you perform?
    • 5 folds x 3 learning_rates x 3 regularizations = 45 runs
  2. For each fold, what data goes where?
    • Fold 1: samples 0-19 test, samples 20-99 train
    • Fold 2: samples 20-39 test, samples 0-19 + 40-99 train
    • (and so on…)
  3. How do you aggregate the results?
    • For each hyperparameter combo, average the 5 fold scores
    • Select the combo with best average score
    • Final evaluation: retrain on ALL training data, test on held-out test set
  4. What can go wrong?
    • Data leakage if you scale using the whole dataset before splitting
    • Overfitting to validation if you tune too many hyperparameters
    • Not shuffling data before splitting (problematic for ordered data)

Interview Questions

  1. “Walk me through how you would build an ML pipeline from scratch.”
    • Expected answer: Load data, explore and clean, engineer features, split train/val/test, implement models, tune hyperparameters with cross-validation, evaluate on test set, save model.
  2. “What is data leakage and how do you prevent it?”
    • Expected answer: When information from test data influences training. Prevent by: fitting scalers/encoders only on training data, not using future information for time series, being careful with target-dependent features.
  3. “How do you choose between different models for a problem?”
    • Expected answer: Start simple (linear/logistic regression), measure baseline. Try more complex models if underfitting. Use cross-validation to compare fairly. Consider interpretability and computational cost.
  4. “Explain the bias-variance tradeoff with a concrete example.”
    • Expected answer: High bias = underfitting (model too simple). High variance = overfitting (model memorizes training data). Example: polynomial degree 1 has high bias, degree 20 has high variance, degree 3-5 might be optimal.
  5. “When would you use precision vs recall as your primary metric?”
    • Expected answer: High precision when false positives are costly (spam filter, you do not want to miss important emails). High recall when false negatives are costly (cancer detection, you do not want to miss a case).
  6. “How do you handle imbalanced datasets?”
    • Expected answer: Stratified sampling, class weights in loss function, oversampling minority (SMOTE), undersampling majority, or use appropriate metrics (F1, AUC instead of accuracy).
  7. “What is the purpose of a validation set vs test set?”
    • Expected answer: Validation set guides model selection and hyperparameter tuning. Test set is only touched once at the end to estimate true generalization. If you repeatedly use the test set, you overfit to it.

Hints in Layers (Treat as pseudocode guidance)

Hint 1: Start with the data pipeline:

class DataPipeline:
    def __init__(self):
        self.scaler_mean = None
        self.scaler_std = None

    def fit(self, X):
        self.scaler_mean = np.mean(X, axis=0)
        self.scaler_std = np.std(X, axis=0)

    def transform(self, X):
        return (X - self.scaler_mean) / (self.scaler_std + 1e-8)

    def fit_transform(self, X):
        self.fit(X)
        return self.transform(X)

Hint 2: Cross-validation structure:

def cross_validate(X, y, model_class, hyperparams, k=5):
    n = len(X)
    fold_size = n // k
    scores = []

    for i in range(k):
        val_idx = range(i * fold_size, (i+1) * fold_size)
        train_idx = [j for j in range(n) if j not in val_idx]

        X_train, y_train = X[train_idx], y[train_idx]
        X_val, y_val = X[val_idx], y[val_idx]

        model = model_class(**hyperparams)
        model.fit(X_train, y_train)
        score = model.evaluate(X_val, y_val)
        scores.append(score)

    return np.mean(scores), np.std(scores)

Hint 3: Grid search over hyperparameters:

def grid_search(X, y, model_class, param_grid, k=5):
    best_score = -np.inf
    best_params = None

    param_combinations = list(product(*param_grid.values()))

    for params in param_combinations:
        hyperparams = dict(zip(param_grid.keys(), params))
        mean_score, std_score = cross_validate(X, y, model_class, hyperparams, k)

        if mean_score > best_score:
            best_score = mean_score
            best_params = hyperparams

    return best_params, best_score

Hint 4: Evaluation metrics:

def confusion_matrix(y_true, y_pred):
    tp = np.sum((y_true == 1) & (y_pred == 1))
    tn = np.sum((y_true == 0) & (y_pred == 0))
    fp = np.sum((y_true == 0) & (y_pred == 1))
    fn = np.sum((y_true == 1) & (y_pred == 0))
    return tp, tn, fp, fn

def precision(y_true, y_pred):
    tp, tn, fp, fn = confusion_matrix(y_true, y_pred)
    return tp / (tp + fp + 1e-8)

def recall(y_true, y_pred):
    tp, tn, fp, fn = confusion_matrix(y_true, y_pred)
    return tp / (tp + fn + 1e-8)

def f1_score(y_true, y_pred):
    p, r = precision(y_true, y_pred), recall(y_true, y_pred)
    return 2 * p * r / (p + r + 1e-8)

Hint 5: Complete pipeline orchestration:

# 1. Load and preprocess
X, y = load_data('dataset.csv')
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

pipeline = DataPipeline()
X_train = pipeline.fit_transform(X_train)
X_test = pipeline.transform(X_test)  # Use training statistics!

# 2. Hyperparameter tuning with cross-validation
param_grid = {'learning_rate': [0.01, 0.1], 'regularization': [0.001, 0.01]}
best_params, cv_score = grid_search(X_train, y_train, MyModel, param_grid)

# 3. Final training and evaluation
final_model = MyModel(**best_params)
final_model.fit(X_train, y_train)
test_score = final_model.evaluate(X_test, y_test)

print(f"CV Score: {cv_score:.4f}, Test Score: {test_score:.4f}")

Books That Will Help

Topic Book Chapter
ML System Design “Designing Machine Learning Systems” by Chip Huyen Chapters 2, 4, 6: Data, Features, Evaluation
Cross-Validation Theory “The Elements of Statistical Learning” by Hastie et al. Chapter 7: Model Assessment
Feature Engineering “Feature Engineering for ML” by Zheng & Casari Chapters 1-3: Numeric, Categorical, Text
Bias-Variance Tradeoff “Machine Learning” (Coursera) by Andrew Ng Week 6: Advice for Applying ML
Evaluation Metrics “Data Science for Business” by Provost & Fawcett Chapter 7: Evaluation Methods
Practical Pipeline “Hands-On Machine Learning” by Aurelien Geron Chapter 2: End-to-End Project


6. Validation, Pitfalls, and Completion

Common Pitfalls and Debugging

Problem 1: “Outputs drift after a few iterations”

  • Why: Hidden numerical instability (unscaled features, aggressive step size, or repeated subtraction of nearly equal values).
  • Fix: Normalize inputs, reduce step size, and track relative error rather than only absolute error.
  • Quick test: Run the same task with two scales of input (for example x and 10x) and compare normalized error curves.

Problem 2: “Results are inconsistent across runs”

  • Why: Random seeds, data split randomness, or non-deterministic ordering are uncontrolled.
  • Fix: Set seeds, log configuration, and store split indices and hyperparameters with each run.
  • Quick test: Re-run three times with the same seed and confirm metrics remain inside a tight tolerance band.

Problem 3: “The project works on the demo case but fails on edge cases”

  • Why: Tests only cover happy-path inputs.
  • Fix: Add adversarial inputs (empty values, extreme ranges, near-singular matrices, rare classes).
  • Quick test: Build an edge-case test matrix and ensure every scenario reports expected behavior.

Definition of Done

  • Core functionality works on reference inputs
  • Edge cases are tested and documented
  • Results are reproducible (seeded and versioned configuration)
  • Performance or convergence behavior is measured and explained
  • A short retrospective explains what failed first and how you fixed it

7. Extension Ideas

  1. Add a stress-test mode with adversarial inputs.
  2. Add a short benchmark report (runtime + memory + error trend).
  3. Add a reproducibility bundle (seed, config, and fixed test corpus).

8. Why This Project Matters

Not specified

This project is valuable because it creates observable evidence of mathematical reasoning under real implementation constraints.