Project 6: Fraud Detection Neural Net (MLP From Scratch)

Sprint: AI Prediction & Neural Networks - From Math to Machine Focus Area: Multi-Layer Perceptrons and Class Imbalance

Project Metadata

Attribute	Value
Difficulty	Level 3: Advanced
Main Programming Language	Python (Using your Autograd or NumPy)
Alternative Languages	C, Rust, Julia
Coolness Level	Level 3: Genuinely Clever
Business Potential	3. Service & Support (FinTech)
Knowledge Area	Multi-Layer Perceptrons (MLP)
Software/Tools	NumPy, Matplotlib, Your Autograd Engine (from Project 5)
Main Book	“Neural Networks and Deep Learning” Ch. 2 - Michael Nielsen
Estimated Time	1 Week
Prerequisites	Project 3 (Linear Regression), Project 5 (Autograd Engine)

What You Will Build

A fully connected neural network (Multi-Layer Perceptron) that detects fraudulent credit card transactions. Unlike previous projects where data was linearly separable, fraud detection requires learning complex decision boundaries that no single line can capture.

Your MLP will:

Stack multiple Layer objects to create depth
Use ReLU activation to introduce non-linearity
Handle extreme class imbalance (99.8% legitimate, 0.2% fraud)
Implement Stochastic Gradient Descent with mini-batches
Evaluate using Precision, Recall, and F1-score (not just accuracy!)

This project forces you to confront WHY we need “deep” learning - because the real world is messy, non-linear, and imbalanced.

Learning Objectives

By completing this project, you will:

Implement the Layer class - Build a reusable abstraction for fully connected layers with weights, biases, and activations
Stack layers into an MLP class - Compose multiple layers into a network that performs forward and backward passes automatically
Understand why depth matters - Prove to yourself that 1 layer cannot solve non-linear problems, but 2+ layers can
Master ReLU activation - Implement the activation that solved the vanishing gradient problem and enabled deep learning
Handle class imbalance correctly - Learn why 99% accuracy can mean 0% utility, and how to fix it with class weights and sampling
Implement mini-batch SGD - Train efficiently by processing data in small batches rather than one sample or all at once
Evaluate with real metrics - Use confusion matrices, precision, recall, and F1 to measure what actually matters

The Core Question You’re Answering

“Why do we need ‘Deep’ learning?”

A single neuron draws a line. A single layer of neurons draws multiple lines. But no matter how many lines you draw, you cannot circle a cluster of points - you cannot learn “shapes.”

Consider the XOR problem: inputs (0,0) and (1,1) produce output 0, while (0,1) and (1,0) produce output 1. No single straight line can separate these. You need to fold the space - to transform the inputs so that what was unseparable becomes separable.

This is what hidden layers do. They learn transformations. The first layer might learn “are both inputs similar?” and “are both inputs different?” The second layer can then draw a simple line in this new feature space.

Fraud detection is the same. A fraud transaction might look legitimate on any single feature. But the combination of features - high amount, late night, foreign country, new card - creates a pattern that a deep network can learn to recognize as “suspicious shape” in high-dimensional space.

When you build this MLP, you will see the magic: adding a single hidden layer transforms an impossible problem into a solvable one.

Concepts You Must Understand First

Before writing code, ensure you have solid grounding in these foundational concepts:

1. Why Single Layers Cannot Solve Non-Linear Problems (XOR)

The XOR problem proves the limitations of single-layer networks:

XOR Truth Table:
Input A   Input B   Output
   0         0        0
   0         1        1
   1         0        1
   1         1        0

Plotting in 2D space:
      B
      |
    1 + X       O
      |
    0 + O       X
      +----+----+
          0     1   A

O = Output 0
X = Output 1

No single line can separate the X's from the O's!

What happens with a single neuron:

A single neuron computes: output = sign(w1*A + w2*B + bias)
This equation describes a line: w1*A + w2*B + bias = 0
Points on one side of the line output 1, the other side output 0
XOR requires a non-linear decision boundary - impossible with one line

The solution: Add a hidden layer

Hidden neurons transform the input space
The output layer then operates on this transformed space
In the new space, the problem becomes linearly separable

2. Universal Approximation Theorem (Intuition)

“A neural network with a single hidden layer containing a finite number of neurons can approximate any continuous function on compact subsets of R^n, under mild assumptions on the activation function.”

What this means in plain English:

Given enough hidden neurons, a 2-layer network can learn ANY pattern
It’s like having infinite LEGO bricks - you can build any shape
BUT: “can approximate” doesn’t mean “will learn efficiently”
Deeper networks learn hierarchical features more naturally

One hidden layer (wide and shallow):
Input → [1000 neurons] → Output
Can approximate anything but may need exponentially many neurons

Multiple hidden layers (narrow and deep):
Input → [16] → [16] → [8] → Output
Learns hierarchical features efficiently:
  Layer 1: Basic patterns (edges, thresholds)
  Layer 2: Combinations of patterns
  Layer 3: High-level concepts

3. ReLU vs Sigmoid vs Tanh Trade-offs

SIGMOID: f(x) = 1 / (1 + e^(-x))
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
       1.0 ┤                          ████████████████
           │                     █████
           │                  ███
       0.5 ┤               ███
           │            ███
           │        ████
       0.0 ┤████████
           ┼───────────────────────────────────────────────────────
          -6                    0                               +6

Pros: Smooth, bounded [0,1], good for output probabilities
Cons: VANISHING GRADIENT! Derivative → 0 for large |x|
      Max derivative = 0.25 at x=0
      Slow training, gradients disappear in deep networks

TANH: f(x) = (e^x - e^(-x)) / (e^x + e^(-x))
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
       1.0 ┤                          ████████████████
           │                     █████
           │                  ███
       0.0 ┤───────────────███───────────────────────
           │            ███
           │        ████
      -1.0 ┤████████
           ┼───────────────────────────────────────────────────────
          -6                    0                               +6

Pros: Zero-centered (unlike sigmoid), stronger gradients
Cons: Still saturates! Vanishing gradient for large |x|

RELU: f(x) = max(0, x)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
           │                          ████████████████
           │                     █████
           │                █████
       0.0 ┤████████████████
           │
           │
           ┼───────────────────────────────────────────────────────
          -6                    0                               +6

Pros: NO vanishing gradient for positive inputs!
      Derivative = 1 for x > 0 (gradients flow freely)
      Computationally simple: just max(0, x)
      Sparse activation (some neurons output 0)
Cons: "Dead neurons" - if always negative, gradient = 0 forever
      Leaky ReLU fixes this: f(x) = max(0.01*x, x)

Why ReLU enabled deep learning:

Before ReLU: training networks with 5+ layers was nearly impossible
Sigmoids squash gradients: 0.25^10 = 0.0000001 (vanished!)
ReLU: 1^10 = 1 (gradients flow)
This is why “deep” learning became possible in the 2010s

4. Class Imbalance and Its Dangers

CREDIT CARD FRAUD: The Imbalance Problem
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Total Transactions: 284,807 (real Kaggle dataset)
Legitimate (Class 0): 284,315 (99.83%)
Fraudulent (Class 1): 492 (0.17%)

                    ████████████████████████████████████████  99.83%
Legitimate          ████████████████████████████████████████
                    ████████████████████████████████████████

Fraudulent          █                                          0.17%


THE LAZY MODEL PROBLEM:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
A model that ALWAYS predicts "Legitimate" achieves:
  Accuracy = 284,315 / 284,807 = 99.83%

This is TERRIBLE! It catches 0% of fraud!

The bank loses money on every fraudulent transaction it misses.
A 99.83% accuracy model is WORTHLESS for fraud detection.

Why standard accuracy fails:

Accuracy = (Correct Predictions) / (Total Predictions)
With 99.83% legitimate, guessing “all legitimate” gives 99.83% accuracy
The model never learns to detect the minority class
It takes the path of least resistance: predict the majority

5. Precision, Recall, F1, Confusion Matrix

CONFUSION MATRIX
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
                        PREDICTED
                    Legitimate    Fraud
              ┌────────────────────────────┐
    ACTUAL    │                            │
  Legitimate  │    TN = 284,000    FP = 315│  True Neg / False Pos
              │                            │
              ├────────────────────────────┤
    ACTUAL    │                            │
  Fraud       │    FN = 42         TP = 450│  False Neg / True Pos
              │                            │
              └────────────────────────────┘

TN (True Negative): Correctly predicted Legitimate
FP (False Positive): Predicted Fraud, but was Legitimate (annoys customer)
FN (False Negative): Predicted Legitimate, but was Fraud (MONEY LOST!)
TP (True Positive): Correctly predicted Fraud (MONEY SAVED!)


PRECISION: Of all predicted fraud, how many were actually fraud?
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Precision = TP / (TP + FP) = 450 / (450 + 315) = 0.588 = 58.8%

"When we flag something as fraud, we're right 58.8% of the time"
Low precision = Many false alarms (customer complaints)


RECALL (Sensitivity): Of all actual fraud, how many did we catch?
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Recall = TP / (TP + FN) = 450 / (450 + 42) = 0.915 = 91.5%

"We catch 91.5% of all fraud"
Low recall = Missing fraudulent transactions (bank loses money)


F1-SCORE: Harmonic mean of Precision and Recall
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
F1 = 2 * (Precision * Recall) / (Precision + Recall)
F1 = 2 * (0.588 * 0.915) / (0.588 + 0.915) = 0.716 = 71.6%

Why harmonic mean? Penalizes extremes.
  If Precision=1.0 and Recall=0.0, F1=0 (not 0.5!)
  You can't game F1 by ignoring one metric.

6. Batch vs SGD vs Mini-batch

GRADIENT DESCENT VARIANTS
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

BATCH (Full) Gradient Descent:
────────────────────────────────
• Use ALL samples to compute gradient
• One update per epoch
• Gradient is exact average over entire dataset

Dataset: [████████████████████████████████████████]
          ↑ Compute loss for all samples
          ↑ Compute gradient (average over all)
          ↑ Single weight update

Pros: Stable, consistent direction
Cons: SLOW! Memory intensive. Gets stuck in sharp minima.


STOCHASTIC Gradient Descent (SGD):
────────────────────────────────
• Use ONE sample at a time
• N updates per epoch (N = dataset size)
• Gradient is noisy estimate

Dataset: [█|█|█|█|█|█|█|█|█|█|█|█|█|█|█|█|█|█|█|█]
          ↑ Update weights after each sample

Pros: Fast updates. Noise helps escape local minima.
Cons: Very noisy! Oscillates around minimum.


MINI-BATCH Gradient Descent (The Winner):
────────────────────────────────
• Use B samples at a time (e.g., B=32)
• N/B updates per epoch
• Gradient is average over mini-batch

Dataset: [████|████|████|████|████|████|████|████]
          ↑    ↑    ↑    ↑    ↑    ↑    ↑    ↑
          Update weights after each mini-batch

Pros: Best of both worlds!
      - Some noise (helps generalization)
      - Vectorized computation (fast on GPU)
      - Manageable memory usage
Cons: Introduces hyperparameter B (batch size)


COMMON BATCH SIZES:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Batch Size    Use Case
────────────────────────────────
32            Standard starting point
64-128        Common for image classification
256-1024      Large datasets, powerful GPUs
1-4           Extreme memory constraints

Deep Theoretical Foundation

Hidden Layers as Feature Extractors

Think of each hidden layer as learning a new “language” to describe the data:

INPUT LAYER: Raw Features
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Transaction Amount: $5000
Time: 02:34:17 (late night)
Location: Nigeria (IP-based)
Card Age: 2 days
Merchant Category: Electronics
V1-V28: PCA-transformed features

These are just numbers. No meaning yet.


HIDDEN LAYER 1: Basic Pattern Detectors
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Neuron 1: "Is this a large amount?" (High amount = high activation)
Neuron 2: "Is this at an unusual time?" (2-5 AM = high activation)
Neuron 3: "Is this a high-risk country?"
Neuron 4: "Is this a new card?"
Neuron 5: "Is this a high-risk merchant category?"
...

The layer learns THRESHOLDS - when does "large" become suspicious?


HIDDEN LAYER 2: Pattern Combinations
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Neuron 1: "Large amount + Unusual time" (both trigger = high activation)
Neuron 2: "New card + High-risk country"
Neuron 3: "Electronics + Late night + Large amount"
...

The layer learns COMBINATIONS that are suspicious together


OUTPUT LAYER: Final Decision
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Combines all the learned patterns into a single probability:
P(Fraud) = 0.97

If the "Large + Late + New Card + Electronics" pattern fires strongly,
the output is high, regardless of which individual features triggered it.

Why Depth Helps: Hierarchical Representations

DEPTH = ABSTRACTION LEVELS
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

2 Layers (Shallow):
  Input → [Simple Patterns] → Output

  Must learn: "If (A AND B) OR (C AND D) OR (E AND F AND G) → Fraud"
  Each hidden neuron must capture one full rule


4 Layers (Deep):
  Input → [Primitives] → [Combinations] → [Complex Rules] → Output

  Layer 1: "Is A high?", "Is B unusual?", etc.
  Layer 2: "A AND B together", "C AND D together"
  Layer 3: "(A AND B) combined with (C AND D)"
  Layer 4: Final decision

  Each layer builds on the previous, like LEGO


ANALOGY: Language
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Letters → Words → Phrases → Sentences → Paragraphs → Meaning

You don't learn "when these 500 letters appear in this order, it's spam"
You learn: letters → words → "cheap meds" → spam

Deep networks learn hierarchical features naturally.
Shallow networks must memorize everything at once.

ReLU: Solving the Vanishing Gradient Problem

THE VANISHING GRADIENT DISASTER
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Sigmoid derivative: σ'(x) = σ(x) * (1 - σ(x))
Maximum value: σ'(0) = 0.25

In backpropagation, gradients MULTIPLY through layers:

Layer 5: gradient = 0.25
Layer 4: gradient = 0.25 * 0.25 = 0.0625
Layer 3: gradient = 0.25^3 = 0.0156
Layer 2: gradient = 0.25^4 = 0.0039
Layer 1: gradient = 0.25^5 = 0.00097

By Layer 1, the gradient is 0.1% of what it was!
The early layers learn NOTHING. Training stalls.


RELU TO THE RESCUE
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

ReLU derivative:
  if x > 0: derivative = 1
  if x <= 0: derivative = 0

In backpropagation:

Layer 5: gradient = 1.0 (if active)
Layer 4: gradient = 1.0 * 1.0 = 1.0
Layer 3: gradient = 1.0^3 = 1.0
Layer 2: gradient = 1.0^4 = 1.0
Layer 1: gradient = 1.0^5 = 1.0

Gradients flow unchanged! Deep learning becomes possible.


DEAD NEURON PROBLEM
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

If ReLU input is ALWAYS negative:
  output = 0
  gradient = 0
  weights never update
  Neuron is "dead" forever

This happens when:
  - Learning rate too high (weights become very negative)
  - Poor initialization

Solutions:
  1. Careful weight initialization (He initialization)
  2. Leaky ReLU: f(x) = max(0.01*x, x)
  3. PReLU: f(x) = max(α*x, x) where α is learned

Weight Initialization Strategies

WHY INITIALIZATION MATTERS
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Weights too small:
  Signals shrink as they pass through layers
  Output → 0, gradients → 0

Weights too large:
  Signals explode as they pass through layers
  Output → ∞, gradients → ∞ (NaN errors)


XAVIER/GLOROT INITIALIZATION (for sigmoid/tanh)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
W ~ Uniform(-sqrt(6/(n_in + n_out)), +sqrt(6/(n_in + n_out)))

or

W ~ Normal(0, sqrt(2/(n_in + n_out)))

Keeps variance of activations consistent across layers.


HE INITIALIZATION (for ReLU) - YOU SHOULD USE THIS
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
W ~ Normal(0, sqrt(2/n_in))

Why different? ReLU zeroes out half the neurons on average.
To maintain variance, we need 2x larger initial weights.

In code:
  weights = np.random.randn(n_in, n_out) * np.sqrt(2.0 / n_in)


BIAS INITIALIZATION
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Initialize to 0. Or small positive value (0.01) for ReLU to prevent
dead neurons at initialization.

  biases = np.zeros(n_out)  # Simple and works

Handling Class Imbalance

METHOD 1: CLASS WEIGHTS
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Multiply the loss for each class by a weight.
Minority class gets higher weight → its errors hurt more.

Weight formula:
  w_class = total_samples / (n_classes * samples_in_class)

Example:
  Total: 1000 samples
  Class 0 (legitimate): 990 samples
  Class 1 (fraud): 10 samples

  w_0 = 1000 / (2 * 990) = 0.505
  w_1 = 1000 / (2 * 10) = 50.0

Fraud errors are penalized 100x more than legitimate errors.

In code:
  loss = y * class_weight_1 * log(y_pred) +
         (1-y) * class_weight_0 * log(1 - y_pred)


METHOD 2: OVERSAMPLING (SMOTE)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Create synthetic minority samples by interpolating between existing ones.

SMOTE Algorithm:
1. For each minority sample x:
2.   Find k nearest minority neighbors
3.   Pick one neighbor x_n randomly
4.   Create synthetic: x_new = x + random(0,1) * (x_n - x)

Before SMOTE:
  ██████████████████████████████████████  Class 0: 990
  █                                       Class 1: 10

After SMOTE:
  ██████████████████████████████████████  Class 0: 990
  ████████████████████████████████████    Class 1: 900 (synthetic)


METHOD 3: UNDERSAMPLING
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Randomly remove majority class samples to balance.

Before:
  ██████████████████████████████████████  Class 0: 990
  █                                       Class 1: 10

After random undersampling:
  ██                                      Class 0: 10 (kept)
  █                                       Class 1: 10

Problem: Throws away 98% of data!
Use only if you have TONS of data.


METHOD 4: THRESHOLD ADJUSTMENT
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Instead of: y_pred > 0.5 → Fraud
Use:        y_pred > 0.1 → Fraud

This catches more fraud (higher recall) at cost of more false positives.
Tune threshold based on business requirements:
  - Banks may prefer low threshold (catch all fraud, accept false alarms)
  - Customers may prefer higher threshold (fewer card declines)

The Forward and Backward Pass Through Multiple Layers

FORWARD PASS: Input → Output
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Input X (shape: batch_size × n_features)
         │
         ▼
┌─────────────────────────────────┐
│  Layer 1: Linear + ReLU        │
│  Z1 = X @ W1 + b1              │  Pre-activation
│  A1 = ReLU(Z1)                 │  Post-activation
└─────────────────────────────────┘
         │
         ▼
┌─────────────────────────────────┐
│  Layer 2: Linear + ReLU        │
│  Z2 = A1 @ W2 + b2             │
│  A2 = ReLU(Z2)                 │
└─────────────────────────────────┘
         │
         ▼
┌─────────────────────────────────┐
│  Output Layer: Linear + Sigmoid│
│  Z3 = A2 @ W3 + b3             │
│  A3 = Sigmoid(Z3)              │  Probability output
└─────────────────────────────────┘
         │
         ▼
      Y_pred (probability of fraud)


BACKWARD PASS: Output → Input (Gradients)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Loss = Binary Cross-Entropy(Y_true, Y_pred)
         │
         ▼
┌─────────────────────────────────┐
│  dL/dZ3 = A3 - Y               │  Gradient of loss w.r.t. output
│  dL/dW3 = A2.T @ dL/dZ3        │  Gradient for weights
│  dL/db3 = sum(dL/dZ3, axis=0)  │  Gradient for biases
│  dL/dA2 = dL/dZ3 @ W3.T        │  Pass gradient backward
└─────────────────────────────────┘
         │
         ▼
┌─────────────────────────────────┐
│  dL/dZ2 = dL/dA2 * ReLU'(Z2)   │  Apply ReLU derivative
│  dL/dW2 = A1.T @ dL/dZ2        │
│  dL/db2 = sum(dL/dZ2, axis=0)  │
│  dL/dA1 = dL/dZ2 @ W2.T        │
└─────────────────────────────────┘
         │
         ▼
┌─────────────────────────────────┐
│  dL/dZ1 = dL/dA1 * ReLU'(Z1)   │
│  dL/dW1 = X.T @ dL/dZ1         │
│  dL/db1 = sum(dL/dZ1, axis=0)  │
└─────────────────────────────────┘


Where:
  ReLU'(Z) = 1 if Z > 0, else 0
  @ = matrix multiplication
  .T = transpose

Batch Size Effects on Convergence

BATCH SIZE SPECTRUM
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Batch Size = 1 (Pure SGD)
━━━━━━━━━━━━━━━━━━━━━━━━━
Loss landscape trajectory:
  ∿∿∿∿∿∿∿∿∿∿∿∿∿∿∿∿∿∿∿∿→ minimum
  Very noisy but escapes local minima

Update frequency: Every sample
Gradient variance: HIGH
Generalization: Good (noise regularizes)
Training speed: Slow (no parallelism)


Batch Size = 32 (Common Choice)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Loss landscape trajectory:
  ~~~~~~~~~~→ minimum
  Some noise, mostly consistent direction

Update frequency: Every 32 samples
Gradient variance: Moderate
Generalization: Good
Training speed: Fast (vectorized)


Batch Size = Full Dataset (Batch GD)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Loss landscape trajectory:
  ──────────→ minimum
  Smooth, deterministic path

Update frequency: Once per epoch
Gradient variance: Zero
Generalization: Worse (may overfit)
Training speed: Very slow (no frequent updates)


EMPIRICAL FINDINGS:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
• Batch size 32-256 works well for most problems
• Larger batches need larger learning rates
• Larger batches → sharper minima → worse generalization
• For imbalanced data: ensure each batch has minority samples!

Your fraud detector: Use batch_size=64 and stratified sampling
  to ensure each batch contains ~1% fraud samples.

Real World Outcome

When you run your fraud detector, you will see output like this:

$ python train_fraud.py --data creditcard.csv

============================================================
   Fraud Detection MLP - Training
============================================================

Loading data: creditcard.csv
  Total samples: 284,807
  Legitimate (0): 284,315 (99.83%)
  Fraudulent (1): 492 (0.17%)

Class imbalance ratio: 578:1
Applying class weights: {0: 0.50, 1: 289.07}

Network Architecture:
  Input Layer:  30 features
  Hidden Layer 1: 16 neurons (ReLU)
  Hidden Layer 2: 16 neurons (ReLU)
  Output Layer: 1 neuron (Sigmoid)

  Total Parameters: 849

Training Configuration:
  Optimizer: Mini-batch SGD
  Learning Rate: 0.01
  Batch Size: 64
  Epochs: 50

------------------------------------------------------------
Training Progress:
------------------------------------------------------------

Epoch 1/50:
  Loss: 0.6931 | Acc: 99.0% | Precision: 0.00 | Recall: 0.00 | F1: 0.00
  ⚠️  WARNING: Model predicting all legitimate (Recall = 0%)!
      This happens early - class weights will fix it.

Epoch 5/50:
  Loss: 0.2847 | Acc: 98.2% | Precision: 0.32 | Recall: 0.58 | F1: 0.41
  ✓ Model starting to detect fraud

Epoch 10/50:
  Loss: 0.1234 | Acc: 99.1% | Precision: 0.56 | Recall: 0.72 | F1: 0.63

Epoch 25/50:
  Loss: 0.0523 | Acc: 99.6% | Precision: 0.78 | Recall: 0.85 | F1: 0.81

Epoch 50/50:
  Loss: 0.0312 | Acc: 99.8% | Precision: 0.86 | Recall: 0.92 | F1: 0.89
  ✓ Training complete!

============================================================
   Final Evaluation (Test Set)
============================================================

Confusion Matrix:
                 Predicted
              Neg      Pos
Actual Neg   56,824     40
Actual Pos       5     93

Metrics:
  Accuracy:  99.92%
  Precision: 69.92%  (When we flag fraud, we're right 70% of the time)
  Recall:    94.90%  (We catch 95% of all fraud!)
  F1-Score:  80.52%

Business Impact:
  Fraud Caught: 93 of 98 transactions ($186,000 saved)
  Fraud Missed: 5 transactions ($8,500 lost)
  False Alarms: 40 customers temporarily inconvenienced

============================================================

$ python predict.py --transaction "Time=0,Amount=5000,V1=-1.36,..."

┌─────────────────────────────────────────────────────────┐
│  TRANSACTION ANALYSIS                                    │
├─────────────────────────────────────────────────────────┤
│                                                          │
│  Input Features:                                         │
│    Time: 0 seconds                                       │
│    Amount: $5,000.00                                     │
│    V1-V28: [PCA components shown]                        │
│                                                          │
│  Network Activations:                                    │
│    Hidden Layer 1: [0.0, 2.3, 0.0, 1.8, 0.0, 4.2, ...]  │
│    Hidden Layer 2: [1.2, 0.0, 3.1, 0.0, 2.8, 0.0, ...]  │
│                                                          │
│  ┌─────────────────────────────────────────────────────┐ │
│  │                                                      │ │
│  │   PREDICTION: FRAUD                                  │ │
│  │   Probability: 0.983 (98.3% confidence)              │ │
│  │                                                      │ │
│  │   ████████████████████████████████████████ 98.3%    │ │
│  │                                                      │ │
│  │   Recommendation: BLOCK TRANSACTION                  │ │
│  │                                                      │ │
│  └─────────────────────────────────────────────────────┘ │
│                                                          │
└─────────────────────────────────────────────────────────┘

Solution Architecture

Class Design

MLP ARCHITECTURE
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

┌──────────────────────────────────────────────────────────────────────┐
│                              MLP Class                                │
├──────────────────────────────────────────────────────────────────────┤
│  layers: List[Layer]          # Stack of Layer objects              │
│  loss_fn: Callable            # Binary Cross-Entropy                 │
│  optimizer: Optimizer         # SGD with learning rate               │
├──────────────────────────────────────────────────────────────────────┤
│  forward(X) → Y_pred          # Propagate input through all layers   │
│  backward(Y_true, Y_pred)     # Compute gradients for all layers     │
│  train_step(X_batch, Y_batch) # One forward-backward-update cycle   │
│  fit(X, Y, epochs, batch_size)# Full training loop                   │
│  predict(X) → Y_pred          # Inference only (no gradients)        │
│  evaluate(X, Y) → metrics     # Compute Precision, Recall, F1        │
└──────────────────────────────────────────────────────────────────────┘
        │
        │ contains
        ▼
┌──────────────────────────────────────────────────────────────────────┐
│                             Layer Class                               │
├──────────────────────────────────────────────────────────────────────┤
│  weights: np.ndarray          # Shape: (n_in, n_out)                 │
│  biases: np.ndarray           # Shape: (n_out,)                      │
│  activation: str              # "relu", "sigmoid", or None          │
│                                                                       │
│  # Cached for backprop:                                               │
│  input_cache: np.ndarray      # Input received during forward        │
│  z_cache: np.ndarray          # Pre-activation (before ReLU/Sigmoid) │
├──────────────────────────────────────────────────────────────────────┤
│  forward(X) → A               # Z = X @ W + b; A = activation(Z)     │
│  backward(dA) → dX            # Compute dW, db, and return dX        │
│  update(lr)                   # W -= lr * dW; b -= lr * db           │
└──────────────────────────────────────────────────────────────────────┘
        │
        │ uses
        ▼
┌──────────────────────────────────────────────────────────────────────┐
│                          Activation Functions                         │
├──────────────────────────────────────────────────────────────────────┤
│  relu(Z) = max(0, Z)                                                 │
│  relu_derivative(Z) = (Z > 0).astype(float)                          │
│                                                                       │
│  sigmoid(Z) = 1 / (1 + exp(-Z))                                      │
│  sigmoid_derivative(Z) = sigmoid(Z) * (1 - sigmoid(Z))               │
└──────────────────────────────────────────────────────────────────────┘
        │
        │ outputs to
        ▼
┌──────────────────────────────────────────────────────────────────────┐
│                            Loss Function                              │
├──────────────────────────────────────────────────────────────────────┤
│  Binary Cross-Entropy (with class weights):                          │
│                                                                       │
│  L = -1/N * Σ [ w1 * y * log(ŷ) + w0 * (1-y) * log(1-ŷ) ]           │
│                                                                       │
│  Gradient:                                                            │
│  dL/dŷ = -w1 * y/ŷ + w0 * (1-y)/(1-ŷ)                                │
│                                                                       │
│  For sigmoid output, simplifies to:                                   │
│  dL/dZ = ŷ - y  (if using sum of weighted cross-entropies)           │
└──────────────────────────────────────────────────────────────────────┘

Data Flow Diagram

COMPLETE DATA FLOW
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

              ┌─────────────────────────────────────────────────────┐
              │                   Training Loop                      │
              └───────────────────────┬─────────────────────────────┘
                                      │
                                      ▼
    ┌────────────────┐         ┌─────────────┐         ┌────────────┐
    │   Load Data    │────────▶│  Preprocess │────────▶│   Split    │
    │ creditcard.csv │         │  Normalize  │         │ Train/Test │
    └────────────────┘         └─────────────┘         └────────────┘
                                                              │
                    ┌─────────────────────────────────────────┘
                    │
                    ▼
    ┌──────────────────────────────────────────────────────────────┐
    │                    For Each Epoch                             │
    │  ┌──────────────────────────────────────────────────────────┐ │
    │  │                For Each Mini-Batch                       │ │
    │  │                                                          │ │
    │  │   X_batch        ┌─────────────┐                         │ │
    │  │      │           │   Forward   │          Y_pred         │ │
    │  │      └──────────▶│    Pass     │──────────────┐          │ │
    │  │                  └─────────────┘              │          │ │
    │  │                                               ▼          │ │
    │  │                  ┌─────────────┐      ┌─────────────┐    │ │
    │  │                  │  Backward   │◀─────│ Compute Loss│    │ │
    │  │                  │    Pass     │      │ (Weighted)  │    │ │
    │  │                  └──────┬──────┘      └─────────────┘    │ │
    │  │                         │                                │ │
    │  │                         ▼                                │ │
    │  │                  ┌─────────────┐                         │ │
    │  │                  │   Update    │                         │ │
    │  │                  │   Weights   │                         │ │
    │  │                  │  W -= lr*dW │                         │ │
    │  │                  └─────────────┘                         │ │
    │  └──────────────────────────────────────────────────────────┘ │
    │                                                               │
    │  ┌──────────────────────────────────────────────────────────┐ │
    │  │  Evaluate on Validation Set                              │ │
    │  │  Log: Loss, Accuracy, Precision, Recall, F1              │ │
    │  └──────────────────────────────────────────────────────────┘ │
    └──────────────────────────────────────────────────────────────┘
                    │
                    ▼
    ┌──────────────────────────────────────────────────────────────┐
    │           Final Evaluation on Test Set                        │
    │  • Confusion Matrix                                          │
    │  • Precision, Recall, F1-Score                               │
    │  • ROC-AUC Curve                                             │
    └──────────────────────────────────────────────────────────────┘

Phased Implementation Guide

Phase 1: Layer Class with Weights and Biases (Day 1)

Goal: Create the fundamental building block

import numpy as np

class Layer:
    """A single fully-connected layer with optional activation."""

    def __init__(self, n_in: int, n_out: int, activation: str = None):
        """
        Initialize layer with He initialization for weights.

        Args:
            n_in: Number of input features
            n_out: Number of output neurons
            activation: "relu", "sigmoid", or None for linear
        """
        # He initialization (good for ReLU)
        self.weights = np.random.randn(n_in, n_out) * np.sqrt(2.0 / n_in)
        self.biases = np.zeros(n_out)
        self.activation = activation

        # Cache for backpropagation
        self.input_cache = None
        self.z_cache = None  # Pre-activation values

        # Gradient storage
        self.dW = None
        self.db = None

Checkpoint: Verify weights have correct shape, biases are zeros, activation is stored.

Phase 2: Forward Pass Through Layer (Day 1)

Goal: Implement the forward computation

def forward(self, X: np.ndarray) -> np.ndarray:
    """
    Forward pass: Z = X @ W + b, then apply activation.

    Args:
        X: Input array, shape (batch_size, n_in)

    Returns:
        A: Output array, shape (batch_size, n_out)
    """
    # Cache input for backprop
    self.input_cache = X

    # Linear transformation
    Z = X @ self.weights + self.biases
    self.z_cache = Z

    # Apply activation
    if self.activation == "relu":
        A = np.maximum(0, Z)
    elif self.activation == "sigmoid":
        A = 1 / (1 + np.exp(-np.clip(Z, -500, 500)))  # Clip for stability
    else:
        A = Z  # Linear/no activation

    return A

Checkpoint: Test with random input, verify output shape is (batch_size, n_out).

Phase 3: ReLU Activation and Derivative (Day 2)

Goal: Implement ReLU properly with its derivative

def relu(Z: np.ndarray) -> np.ndarray:
    """ReLU activation: max(0, x)"""
    return np.maximum(0, Z)

def relu_derivative(Z: np.ndarray) -> np.ndarray:
    """
    ReLU derivative: 1 if x > 0, else 0

    Note: Derivative at exactly 0 is undefined, but we use 0.
    """
    return (Z > 0).astype(float)

def sigmoid(Z: np.ndarray) -> np.ndarray:
    """Sigmoid activation: 1 / (1 + e^-x)"""
    # Clip to prevent overflow
    Z = np.clip(Z, -500, 500)
    return 1 / (1 + np.exp(-Z))

def sigmoid_derivative(Z: np.ndarray) -> np.ndarray:
    """Sigmoid derivative: sigmoid(x) * (1 - sigmoid(x))"""
    s = sigmoid(Z)
    return s * (1 - s)

Checkpoint: Test that relu(np.array([-1, 0, 1])) = [0, 0, 1].

Phase 4: MLP Class Stacking Layers (Day 2-3)

Goal: Create the network container

class MLP:
    """Multi-Layer Perceptron for binary classification."""

    def __init__(self, layer_sizes: list, activations: list = None):
        """
        Initialize MLP with specified architecture.

        Args:
            layer_sizes: [input_size, hidden1_size, ..., output_size]
            activations: ["relu", "relu", ..., "sigmoid"] per layer

        Example:
            MLP([30, 16, 16, 1], ["relu", "relu", "sigmoid"])
        """
        if activations is None:
            activations = ["relu"] * (len(layer_sizes) - 2) + ["sigmoid"]

        self.layers = []
        for i in range(len(layer_sizes) - 1):
            layer = Layer(
                n_in=layer_sizes[i],
                n_out=layer_sizes[i + 1],
                activation=activations[i]
            )
            self.layers.append(layer)

    def forward(self, X: np.ndarray) -> np.ndarray:
        """Forward pass through all layers."""
        A = X
        for layer in self.layers:
            A = layer.forward(A)
        return A

    def predict(self, X: np.ndarray, threshold: float = 0.5) -> np.ndarray:
        """Return binary predictions."""
        probs = self.forward(X)
        return (probs >= threshold).astype(int)

Checkpoint: Create MLP([30, 16, 16, 1]), forward pass with random input, verify output shape.

Phase 5: Backward Pass (Day 3-4)

Goal: Implement backpropagation through all layers

def backward(self, dA: np.ndarray) -> np.ndarray:
    """
    Backward pass for a single layer.

    Args:
        dA: Gradient of loss w.r.t. this layer's output

    Returns:
        dX: Gradient of loss w.r.t. this layer's input
    """
    m = dA.shape[0]  # Batch size

    # Compute dZ based on activation
    if self.activation == "relu":
        dZ = dA * relu_derivative(self.z_cache)
    elif self.activation == "sigmoid":
        dZ = dA * sigmoid_derivative(self.z_cache)
    else:
        dZ = dA  # Linear

    # Compute gradients for weights and biases
    self.dW = (1/m) * (self.input_cache.T @ dZ)
    self.db = (1/m) * np.sum(dZ, axis=0)

    # Compute gradient for previous layer
    dX = dZ @ self.weights.T

    return dX

# In MLP class:
def backward(self, y_true: np.ndarray, y_pred: np.ndarray,
             class_weights: dict = None):
    """
    Full backward pass through all layers.

    Args:
        y_true: Ground truth labels, shape (batch_size, 1)
        y_pred: Predicted probabilities, shape (batch_size, 1)
        class_weights: {0: weight_0, 1: weight_1} for imbalance
    """
    m = y_true.shape[0]

    # For sigmoid output with BCE loss, the gradient simplifies to:
    # dL/dZ = y_pred - y_true (for unweighted)

    # Apply class weights
    if class_weights:
        weights = np.where(y_true == 1, class_weights[1], class_weights[0])
        dA = (y_pred - y_true) * weights
    else:
        dA = y_pred - y_true

    # Backpropagate through layers in reverse
    for layer in reversed(self.layers):
        dA = layer.backward(dA)

Checkpoint: After backward, every layer should have non-zero dW and db.

Phase 6: SGD Optimizer (Day 4)

Goal: Update weights using gradients

def update_weights(self, learning_rate: float):
    """Update all layer weights using SGD."""
    for layer in self.layers:
        layer.weights -= learning_rate * layer.dW
        layer.biases -= learning_rate * layer.db

def train_step(self, X_batch: np.ndarray, y_batch: np.ndarray,
               learning_rate: float, class_weights: dict = None) -> float:
    """
    Single training step: forward, loss, backward, update.

    Returns:
        loss: Binary cross-entropy loss for this batch
    """
    # Forward pass
    y_pred = self.forward(X_batch)

    # Compute loss (BCE)
    epsilon = 1e-7  # Prevent log(0)
    y_pred = np.clip(y_pred, epsilon, 1 - epsilon)

    if class_weights:
        weights = np.where(y_batch == 1, class_weights[1], class_weights[0])
        loss = -np.mean(weights * (
            y_batch * np.log(y_pred) +
            (1 - y_batch) * np.log(1 - y_pred)
        ))
    else:
        loss = -np.mean(
            y_batch * np.log(y_pred) +
            (1 - y_batch) * np.log(1 - y_pred)
        )

    # Backward pass
    self.backward(y_batch, y_pred, class_weights)

    # Update weights
    self.update_weights(learning_rate)

    return loss

Checkpoint: Train on small batch, verify loss decreases over iterations.

Phase 7: Class Weighting for Imbalance (Day 5)

Goal: Implement balanced training

def compute_class_weights(y: np.ndarray) -> dict:
    """
    Compute class weights inversely proportional to class frequencies.

    Returns:
        {0: weight_0, 1: weight_1}
    """
    n_samples = len(y)
    n_classes = 2

    counts = np.bincount(y.flatten().astype(int))
    weights = n_samples / (n_classes * counts)

    return {0: weights[0], 1: weights[1]}

# Example usage:
# y_train = [0, 0, 0, 0, 0, 0, 0, 0, 0, 1] # 90% class 0, 10% class 1
# weights = compute_class_weights(y_train)
# weights = {0: 0.556, 1: 5.0} # Class 1 weighted 9x more

Phase 8: Evaluation Metrics (Day 5-6)

Goal: Implement proper evaluation

def evaluate(self, X: np.ndarray, y: np.ndarray, threshold: float = 0.5) -> dict:
    """
    Compute classification metrics.

    Returns:
        dict with accuracy, precision, recall, f1, confusion_matrix
    """
    y_pred = (self.forward(X) >= threshold).astype(int).flatten()
    y_true = y.flatten().astype(int)

    # Confusion matrix elements
    tp = np.sum((y_pred == 1) & (y_true == 1))
    tn = np.sum((y_pred == 0) & (y_true == 0))
    fp = np.sum((y_pred == 1) & (y_true == 0))
    fn = np.sum((y_pred == 0) & (y_true == 1))

    # Metrics
    accuracy = (tp + tn) / (tp + tn + fp + fn)
    precision = tp / (tp + fp) if (tp + fp) > 0 else 0
    recall = tp / (tp + fn) if (tp + fn) > 0 else 0
    f1 = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0

    return {
        "accuracy": accuracy,
        "precision": precision,
        "recall": recall,
        "f1": f1,
        "confusion_matrix": {"TP": tp, "TN": tn, "FP": fp, "FN": fn}
    }

Phase 8b: Full Training Loop (Day 6-7)

Goal: Put it all together

def fit(self, X_train: np.ndarray, y_train: np.ndarray,
        X_val: np.ndarray = None, y_val: np.ndarray = None,
        epochs: int = 50, batch_size: int = 64,
        learning_rate: float = 0.01,
        use_class_weights: bool = True,
        verbose: bool = True):
    """
    Full training loop with mini-batches.
    """
    n_samples = X_train.shape[0]

    # Compute class weights
    class_weights = compute_class_weights(y_train) if use_class_weights else None

    if verbose and class_weights:
        print(f"Class weights: {class_weights}")

    history = {"loss": [], "val_metrics": []}

    for epoch in range(epochs):
        # Shuffle data each epoch
        indices = np.random.permutation(n_samples)
        X_shuffled = X_train[indices]
        y_shuffled = y_train[indices]

        epoch_losses = []

        # Mini-batch training
        for i in range(0, n_samples, batch_size):
            X_batch = X_shuffled[i:i+batch_size]
            y_batch = y_shuffled[i:i+batch_size]

            loss = self.train_step(X_batch, y_batch, learning_rate, class_weights)
            epoch_losses.append(loss)

        avg_loss = np.mean(epoch_losses)
        history["loss"].append(avg_loss)

        # Validation
        if X_val is not None and verbose:
            metrics = self.evaluate(X_val, y_val)
            history["val_metrics"].append(metrics)

            print(f"Epoch {epoch+1}/{epochs}: "
                  f"Loss={avg_loss:.4f} | "
                  f"Acc={metrics['accuracy']:.3f} | "
                  f"Prec={metrics['precision']:.3f} | "
                  f"Rec={metrics['recall']:.3f} | "
                  f"F1={metrics['f1']:.3f}")

    return history

Questions to Guide Your Design

Before writing code, think through these questions:

Architecture Decisions

How many hidden layers do you need?
- Start with 2 hidden layers (16 neurons each)
- Add more if underfitting
- The architecture [30 → 16 → 16 → 1] is a good starting point
Why not use Sigmoid in hidden layers?
- Vanishing gradients would kill learning in deep networks
- ReLU keeps gradients flowing
- Only use Sigmoid for the output (probability interpretation)
What batch size should you use?
- 64 is a good default
- Too small: noisy gradients, slow training
- Too large: smooth gradients but may miss minima
- For imbalanced data: ensure batches contain minority samples

Imbalance Handling

Why is 99% accuracy worthless here?
- Because predicting “all legitimate” gives 99.83% accuracy
- We care about catching fraud, not overall correctness
- Recall matters more than accuracy
Class weights vs SMOTE - which to use?
- Class weights: Simple, no synthetic data, works well in practice
- SMOTE: Creates synthetic minority samples, can help but risks overfitting
- Start with class weights, try SMOTE if recall is too low
What threshold should you use for prediction?
- Default 0.5 assumes equal costs for errors
- Lower threshold (0.3): Catch more fraud, more false alarms
- Higher threshold (0.7): Fewer false alarms, miss more fraud
- Tune based on business requirements

Debugging

How do you know if training is working?
- Loss should decrease
- Recall should increase (model learning to detect fraud)
- If Recall stays at 0 for many epochs, class weights may be too low
What if all neurons output the same value?
- Check initialization (weights should be diverse)
- Check for dead ReLUs (too many neurons stuck at 0)
- Reduce learning rate

Thinking Exercise

Before implementing, work through this exercise by hand:

The XOR Problem with a Hidden Layer

Setup: Build a network to solve XOR

Network: 2 inputs → 2 hidden (ReLU) → 1 output (Sigmoid)

Initial weights (random example):
  Hidden layer: W1 = [[0.5, -0.5],   b1 = [0, 0]
                      [0.5, -0.5]]
  Output layer: W2 = [[1.0],         b2 = [0]
                      [-1.0]]

Training data:
  X = [[0, 0], [0, 1], [1, 0], [1, 1]]
  Y = [[0],    [1],    [1],    [0]]

Task: Hand-trace the forward pass for input [0, 1]

Hidden layer pre-activation (Z1):

Z1 = [0, 1] @ [[0.5, -0.5], [0.5, -0.5]] + [0, 0]
Z1 = [0*0.5 + 1*0.5, 0*(-0.5) + 1*(-0.5)]
Z1 = [0.5, -0.5]

Hidden layer activation (A1) - ReLU:

A1 = ReLU([0.5, -0.5])
A1 = [0.5, 0]  (negative value becomes 0)

Output pre-activation (Z2):

Z2 = [0.5, 0] @ [[1.0], [-1.0]] + [0]
Z2 = [0.5*1.0 + 0*(-1.0)]
Z2 = [0.5]

Output activation (A2) - Sigmoid:

A2 = Sigmoid(0.5)
A2 = 1 / (1 + e^(-0.5))
A2 ≈ 0.62

Loss (BCE, target = 1):

Loss = -(1 * log(0.62) + 0 * log(0.38))
Loss ≈ 0.48

Now you try: Trace the forward pass for input [1, 1] (target = 0).

Draw the decision boundary: After training, the hidden layer transforms the 2D input space such that XOR becomes linearly separable. Sketch what this transformation might look like.

Testing Strategy

Unit Tests

def test_layer_forward():
    """Test Layer forward pass produces correct shape."""
    layer = Layer(n_in=10, n_out=5, activation="relu")
    X = np.random.randn(32, 10)  # Batch of 32
    A = layer.forward(X)

    assert A.shape == (32, 5), f"Expected (32, 5), got {A.shape}"
    assert np.all(A >= 0), "ReLU should produce non-negative outputs"
    assert layer.input_cache is not None, "Should cache input"
    assert layer.z_cache is not None, "Should cache pre-activation"

def test_layer_backward():
    """Test Layer backward pass computes gradients."""
    layer = Layer(n_in=10, n_out=5, activation="relu")
    X = np.random.randn(32, 10)

    # Forward
    A = layer.forward(X)

    # Backward (fake gradient from next layer)
    dA = np.random.randn(32, 5)
    dX = layer.backward(dA)

    assert dX.shape == X.shape, "Gradient should match input shape"
    assert layer.dW.shape == layer.weights.shape
    assert layer.db.shape == layer.biases.shape

def test_mlp_forward():
    """Test MLP produces probability output."""
    mlp = MLP([30, 16, 16, 1], ["relu", "relu", "sigmoid"])
    X = np.random.randn(64, 30)

    y_pred = mlp.forward(X)

    assert y_pred.shape == (64, 1)
    assert np.all(y_pred >= 0) and np.all(y_pred <= 1), "Should be probabilities"

def test_relu():
    """Test ReLU activation."""
    Z = np.array([-2, -1, 0, 1, 2])
    expected = np.array([0, 0, 0, 1, 2])

    assert np.allclose(relu(Z), expected)

def test_class_weights():
    """Test class weight computation."""
    y = np.array([0, 0, 0, 0, 0, 0, 0, 0, 0, 1]).reshape(-1, 1)
    weights = compute_class_weights(y)

    # Class 1 should have ~9x the weight of class 0
    assert weights[1] > 5 * weights[0]

Integration Tests

def test_training_reduces_loss():
    """Test that training actually reduces loss."""
    np.random.seed(42)

    # Synthetic linearly separable data
    X = np.random.randn(1000, 10)
    y = (X[:, 0] + X[:, 1] > 0).astype(float).reshape(-1, 1)

    mlp = MLP([10, 8, 1], ["relu", "sigmoid"])

    initial_loss = mlp.train_step(X[:100], y[:100], learning_rate=0.01)

    # Train for 50 epochs
    for _ in range(50):
        mlp.train_step(X[:100], y[:100], learning_rate=0.01)

    final_loss = mlp.train_step(X[:100], y[:100], learning_rate=0.01)

    assert final_loss < initial_loss, "Loss should decrease with training"

def test_xor_solved():
    """Test that MLP can solve XOR (proves hidden layers work)."""
    X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
    y = np.array([[0], [1], [1], [0]])

    mlp = MLP([2, 4, 1], ["relu", "sigmoid"])

    # Train
    for _ in range(1000):
        mlp.train_step(X, y, learning_rate=0.1)

    # Evaluate
    predictions = mlp.predict(X)
    accuracy = np.mean(predictions == y)

    assert accuracy >= 0.75, f"Should solve XOR, got accuracy {accuracy}"

Smoke Test on Real Data

def test_fraud_data_loading():
    """Test that creditcard.csv loads correctly."""
    import pandas as pd

    df = pd.read_csv("creditcard.csv")

    assert "Class" in df.columns, "Should have Class column"
    assert df.shape[1] == 31, "Should have 30 features + 1 label"
    assert df["Class"].isin([0, 1]).all(), "Labels should be 0 or 1"

    fraud_ratio = df["Class"].mean()
    assert fraud_ratio < 0.01, "Fraud should be <1% of data"

Common Pitfalls and Debugging Tips

1. Recall Stays at 0%

Symptom: Model predicts “legitimate” for everything.

Epoch 10: Accuracy 99.8%, Recall 0.0%
Epoch 20: Accuracy 99.8%, Recall 0.0%
...
Model never learns to detect fraud!

Causes:

Class weights not applied or too small
Learning rate too low
Network too small to learn the pattern

Fix:

# Verify class weights are correct
weights = compute_class_weights(y_train)
print(f"Class weights: {weights}")
# Should be something like {0: 0.5, 1: 289}

# Increase learning rate
learning_rate = 0.1  # Start higher, reduce if unstable

# Verify weights are being used in loss
# Add debug print in train_step()

2. Loss is NaN or Infinity

Symptom: Training explodes.

Epoch 1: Loss = 0.69
Epoch 2: Loss = 15.4
Epoch 3: Loss = inf
Epoch 4: Loss = nan

Causes:

Learning rate too high
No gradient clipping
Log of 0 in BCE loss

Fix:

# Clip predictions to avoid log(0)
epsilon = 1e-7
y_pred = np.clip(y_pred, epsilon, 1 - epsilon)

# Reduce learning rate
learning_rate = 0.001

# Clip gradients (optional)
for layer in self.layers:
    layer.dW = np.clip(layer.dW, -1, 1)
    layer.db = np.clip(layer.db, -1, 1)

3. Dead ReLU Neurons

Symptom: Many hidden layer outputs are exactly 0 for all inputs.

A = mlp.layers[0].forward(X)
print(np.sum(A == 0) / A.size)  # If > 50%, too many dead neurons

Causes:

Poor weight initialization
Learning rate too high caused weights to go very negative
Bias initialization issue

Fix:

# Use He initialization properly
self.weights = np.random.randn(n_in, n_out) * np.sqrt(2.0 / n_in)

# Initialize biases to small positive value
self.biases = np.full(n_out, 0.01)  # Helps ReLU stay active initially

# Try Leaky ReLU instead
def leaky_relu(Z, alpha=0.01):
    return np.where(Z > 0, Z, alpha * Z)

4. Gradients Vanishing or Exploding

Symptom: Early layers don’t learn, or training is unstable.

Diagnosis:

# Check gradient magnitudes after backward pass
for i, layer in enumerate(mlp.layers):
    print(f"Layer {i}: |dW| mean = {np.abs(layer.dW).mean():.6f}")
# Should be similar order of magnitude across layers

Fix:

Use proper initialization (He for ReLU, Xavier for sigmoid/tanh)
Use ReLU instead of sigmoid in hidden layers
Add batch normalization (advanced)

5. Overfitting

Symptom: Training metrics are great, but validation metrics are worse.

Epoch 50: Train F1=0.95, Val F1=0.65

Causes:

Network too large for dataset
Training too long
Not enough regularization

Fix:

# Reduce network size
mlp = MLP([30, 8, 8, 1])  # Fewer neurons

# Add L2 regularization to weight updates
lambda_l2 = 0.001
self.dW += lambda_l2 * self.weights

# Use early stopping
if val_f1 < best_val_f1:
    patience_counter += 1
    if patience_counter >= patience:
        break
else:
    best_val_f1 = val_f1
    patience_counter = 0

Interview Questions This Project Prepares You For

Understanding Questions

“Why can’t a single-layer network solve XOR?”
- A single layer computes a linear combination of inputs
- The decision boundary is a hyperplane (straight line in 2D)
- XOR requires a non-linear boundary (you need to “fold” the space)
- Adding a hidden layer allows learning feature transformations that make XOR linearly separable in the new space
“Explain the vanishing gradient problem and how ReLU solves it.”
- Sigmoid/tanh derivatives are < 1, so gradients shrink as they backpropagate
- In deep networks, early layer gradients become ~0, preventing learning
- ReLU has derivative = 1 for positive inputs, so gradients flow unchanged
- This enabled training of networks with many layers
“What’s wrong with using accuracy for imbalanced classification?”
- A naive model that always predicts the majority class achieves high accuracy
- For 99:1 imbalance, predicting “always 0” gives 99% accuracy but 0% utility
- Precision and Recall measure what matters: catching the minority class without too many false alarms
- F1-score balances precision and recall

Implementation Questions

“Walk me through one forward-backward pass of your MLP.”
- Forward: Input → (linear transform + activation) for each layer → output probability
- Loss: Compare prediction to label using weighted BCE
- Backward: Compute dL/dA for output, propagate through each layer computing dW, db, dX
- Update: W -= learning_rate * dW for each layer
“How do you handle the class imbalance problem?”
- Class weights: Multiply loss by inverse class frequency
- SMOTE: Generate synthetic minority samples
- Threshold tuning: Lower classification threshold to catch more minority class
- Stratified sampling: Ensure each batch contains minority samples
“What’s the difference between batch, mini-batch, and stochastic gradient descent?”
- Batch: Entire dataset per update - stable but slow, may overfit
- SGD: One sample per update - noisy but fast, helps generalization
- Mini-batch: N samples per update - best of both worlds, vectorizable

Design Questions

“How would you decide on the network architecture?”
- Start simple: 2 hidden layers, 16-32 neurons each
- Increase if underfitting (low train AND val performance)
- Decrease if overfitting (high train, low val performance)
- For tabular data, 2-4 layers usually sufficient
- Use validation set to tune, not test set
“How would you deploy this model in production?”
- Save trained weights (np.save or pickle)
- Wrap in prediction API (Flask/FastAPI)
- Apply same preprocessing (normalization) to new transactions
- Log predictions and actual outcomes for monitoring
- Retrain periodically as fraud patterns change

Hints in Layers

Stuck? Read only the hint level you need.

Challenge: Model Predicts All Same Class

Hint Level 1 (Conceptual): The model found a shortcut. It’s easier to always predict the majority class than to learn patterns.

Hint Level 2 (Direction): You need to penalize errors on the minority class more heavily. The loss function should “care more” about fraud.

Hint Level 3 (Specific): Multiply the loss for each sample by a weight that’s inversely proportional to class frequency. Class 1 (fraud) should have weight ~100-500x larger than class 0.

Hint Level 4 (Code):

# Compute weights
class_weights = compute_class_weights(y_train)  # {0: 0.5, 1: 289}

# Apply in loss
sample_weights = np.where(y_batch == 1, class_weights[1], class_weights[0])
loss = -np.mean(sample_weights * (y * log(y_pred) + (1-y) * log(1-y_pred)))

Challenge: Backward Pass is Wrong

Hint Level 1 (Conceptual): The chain rule must be applied correctly. Each layer’s gradient depends on the gradient from the next layer.

Hint Level 2 (Direction): For ReLU, the gradient is 0 where input was negative, and 1 where positive. You need to multiply by this mask.

Hint Level 3 (Specific): Store Z (pre-activation) during forward pass. In backward, compute dZ = dA * (Z > 0) for ReLU.

Hint Level 4 (Code):

def backward(self, dA):
    # dA: gradient from next layer (or loss)

    if self.activation == "relu":
        dZ = dA * (self.z_cache > 0).astype(float)
    elif self.activation == "sigmoid":
        s = sigmoid(self.z_cache)
        dZ = dA * s * (1 - s)
    else:
        dZ = dA

    m = dA.shape[0]
    self.dW = (1/m) * self.input_cache.T @ dZ
    self.db = (1/m) * np.sum(dZ, axis=0)
    dX = dZ @ self.weights.T

    return dX

Challenge: Loss Not Decreasing

Hint Level 1 (Conceptual): Either the learning rate is wrong, or the gradients are wrong.

Hint Level 2 (Direction): Try a higher learning rate (0.1 or 1.0) to see if loss moves at all. If it explodes, your gradients are correct; if nothing happens, gradients may be wrong.

Hint Level 3 (Specific): Print gradient magnitudes. They should be non-zero and roughly similar across layers.

Hint Level 4 (Code):

# Debug: Print gradient stats
for i, layer in enumerate(mlp.layers):
    print(f"Layer {i}:")
    print(f"  |dW| mean: {np.abs(layer.dW).mean():.6f}")
    print(f"  |db| mean: {np.abs(layer.db).mean():.6f}")
    print(f"  dW range: [{layer.dW.min():.4f}, {layer.dW.max():.4f}]")

Extensions and Challenges

1. Add Dropout Regularization

Dropout randomly “turns off” neurons during training, preventing co-adaptation.

class DropoutLayer:
    def __init__(self, p: float = 0.5):
        """p = probability of KEEPING a neuron (not dropping)."""
        self.p = p
        self.mask = None

    def forward(self, X, training=True):
        if training:
            self.mask = (np.random.rand(*X.shape) < self.p) / self.p
            return X * self.mask
        else:
            return X  # No dropout during inference

    def backward(self, dA):
        return dA * self.mask

2. Implement Adam Optimizer

Adam adapts learning rate per-parameter using momentum and second moments.

class AdamOptimizer:
    def __init__(self, lr=0.001, beta1=0.9, beta2=0.999, epsilon=1e-8):
        self.lr = lr
        self.beta1 = beta1
        self.beta2 = beta2
        self.epsilon = epsilon
        self.m = {}  # First moment
        self.v = {}  # Second moment
        self.t = 0   # Timestep

    def update(self, layer, layer_id):
        if layer_id not in self.m:
            self.m[layer_id] = {"W": np.zeros_like(layer.weights),
                                "b": np.zeros_like(layer.biases)}
            self.v[layer_id] = {"W": np.zeros_like(layer.weights),
                                "b": np.zeros_like(layer.biases)}

        self.t += 1

        for param, grad, key in [(layer.weights, layer.dW, "W"),
                                  (layer.biases, layer.db, "b")]:
            # Update moments
            self.m[layer_id][key] = self.beta1 * self.m[layer_id][key] + (1-self.beta1) * grad
            self.v[layer_id][key] = self.beta2 * self.v[layer_id][key] + (1-self.beta2) * grad**2

            # Bias correction
            m_hat = self.m[layer_id][key] / (1 - self.beta1**self.t)
            v_hat = self.v[layer_id][key] / (1 - self.beta2**self.t)

            # Update
            param -= self.lr * m_hat / (np.sqrt(v_hat) + self.epsilon)

3. Try Different Architectures

Experiment with:

Wider networks: [30, 64, 64, 1]
Deeper networks: [30, 16, 16, 16, 1]
Bottleneck: [30, 8, 16, 8, 1] (compression in middle)
Residual connections (advanced)

4. Implement Learning Rate Scheduling

Reduce learning rate as training progresses:

def lr_schedule(epoch, initial_lr=0.1):
    """Decay learning rate by 10x every 20 epochs."""
    return initial_lr * (0.1 ** (epoch // 20))

# Step decay
def step_decay(epoch, initial_lr, drop=0.5, epochs_drop=10):
    return initial_lr * (drop ** (epoch // epochs_drop))

# Exponential decay
def exponential_decay(epoch, initial_lr, decay_rate=0.95):
    return initial_lr * (decay_rate ** epoch)

5. Visualize Decision Boundaries

For 2D synthetic data, visualize what the network learns:

def plot_decision_boundary(mlp, X, y, resolution=100):
    x_min, x_max = X[:, 0].min() - 0.5, X[:, 0].max() + 0.5
    y_min, y_max = X[:, 1].min() - 0.5, X[:, 1].max() + 0.5

    xx, yy = np.meshgrid(
        np.linspace(x_min, x_max, resolution),
        np.linspace(y_min, y_max, resolution)
    )

    grid = np.c_[xx.ravel(), yy.ravel()]
    probs = mlp.forward(grid).reshape(xx.shape)

    plt.contourf(xx, yy, probs, levels=50, cmap='RdBu', alpha=0.8)
    plt.scatter(X[:, 0], X[:, 1], c=y.flatten(), cmap='RdBu', edgecolors='black')
    plt.title("Decision Boundary")
    plt.show()

Real-World Connections

FinTech Fraud Detection Systems

How real companies do it:

Feature Engineering: Beyond raw transaction data, companies use:
- Velocity features (transactions per hour/day)
- Behavioral patterns (typical spending categories)
- Device fingerprinting
- Geolocation anomalies
- Network analysis (connected accounts)
Model Architecture:
- Ensemble of models (gradient boosting + neural nets)
- Real-time scoring (<100ms latency requirement)
- Explainability layers (why was this flagged?)
Deployment Considerations:
- Models retrained weekly/monthly (fraud patterns evolve)
- A/B testing new models against production
- Feedback loops from confirmed fraud
- Cost-sensitive learning (missed fraud costs more than false alarms)

Companies using ML for fraud detection:

Stripe Radar: ML-based fraud prevention for payments
PayPal: Real-time risk scoring for transactions
Capital One: Credit card fraud detection
Featurespace: Adaptive behavioral analytics

Beyond Binary Classification

This project teaches fundamentals applicable to:

Anomaly Detection: Autoencoders for unsupervised fraud detection
Sequence Models: RNNs/LSTMs for transaction sequences
Graph Neural Networks: Detecting fraud rings
Federated Learning: Training across banks without sharing data

Books That Will Help

Book	Relevant Chapters	What You’ll Learn
“Neural Networks and Deep Learning” by Michael Nielsen	Ch. 2: “How the backpropagation algorithm works”	Visual, intuitive explanation of backprop. Free online at neuralnetworksanddeeplearning.com
“Deep Learning” by Goodfellow, Bengio, Courville	Ch. 6: “Deep Feedforward Networks”	Mathematical foundation of MLPs, activation functions, loss functions
“Grokking Deep Learning” by Andrew Trask	Ch. 4-7: Gradient descent through backprop	Extremely beginner-friendly, builds everything from scratch
“Hands-On Machine Learning” by Aurelien Geron	Ch. 10: “Introduction to Artificial Neural Networks”	Practical Keras implementation with sklearn integration
“Pattern Recognition and Machine Learning” by Bishop	Ch. 5: “Neural Networks”	Rigorous statistical treatment of MLPs

Online Resources

3Blue1Brown Neural Networks Series - Beautiful visual explanations
Andrej Karpathy’s micrograd - Tiny autograd engine (reference for Project 5)
Kaggle Credit Card Fraud Dataset - Real anonymized data

Self-Assessment Checklist

Before considering this project complete, verify you can:

Implementation

Build a Layer class with forward and backward passes
Stack layers into an MLP that trains end-to-end
Implement ReLU and Sigmoid activations with their derivatives
Compute binary cross-entropy loss with class weights
Train using mini-batch gradient descent
Evaluate using Precision, Recall, F1-score (not just accuracy)

Understanding

Explain why XOR cannot be solved with a single layer
Draw the decision boundary of a 2-layer network on paper
Describe how hidden layers transform feature space
Explain the vanishing gradient problem and how ReLU solves it
Justify why accuracy is misleading for imbalanced data

Debugging

Diagnose why the model predicts all one class
Identify dead ReLU neurons
Fix NaN/Inf in loss
Tune hyperparameters (learning rate, batch size, architecture)

Extensions (Choose at least 1)

Add dropout regularization
Implement Adam optimizer
Try SMOTE for oversampling
Visualize decision boundaries on 2D synthetic data
Achieve >85% F1 on the credit card fraud dataset

Key Insights

“Deep” means feature extraction, not just more parameters. Each layer learns increasingly abstract representations. Layer 1 might detect “large amount,” Layer 2 detects “large + unusual time,” and so on. Depth creates a hierarchy of features.

Accuracy is a lie in imbalanced settings. A model that predicts “all negative” achieves 99.8% accuracy on the fraud dataset but catches 0% of fraud. Always use metrics that measure what you care about: catching the minority class.

Class imbalance is not a data problem; it’s a loss problem. You don’t need more data - you need to tell the model that minority class errors hurt more. Class weights are simple and effective.

ReLU enabled deep learning. Before ReLU, training networks deeper than 3-4 layers was impractical. The simple function max(0, x) changed everything by allowing gradients to flow.

After completing this project, you will understand WHY neural networks need depth, HOW to handle the class imbalance that plagues real-world data, and WHAT metrics actually matter for production systems. You’re building the same architecture used by banks processing millions of transactions daily.