Project 6: Fraud Detection Neural Net (MLP From Scratch)
Project 6: Fraud Detection Neural Net (MLP From Scratch)
Sprint: AI Prediction & Neural Networks - From Math to Machine Focus Area: Multi-Layer Perceptrons and Class Imbalance
Project Metadata
| Attribute | Value |
|---|---|
| Difficulty | Level 3: Advanced |
| Main Programming Language | Python (Using your Autograd or NumPy) |
| Alternative Languages | C, Rust, Julia |
| Coolness Level | Level 3: Genuinely Clever |
| Business Potential | 3. Service & Support (FinTech) |
| Knowledge Area | Multi-Layer Perceptrons (MLP) |
| Software/Tools | NumPy, Matplotlib, Your Autograd Engine (from Project 5) |
| Main Book | โNeural Networks and Deep Learningโ Ch. 2 - Michael Nielsen |
| Estimated Time | 1 Week |
| Prerequisites | Project 3 (Linear Regression), Project 5 (Autograd Engine) |
What You Will Build
A fully connected neural network (Multi-Layer Perceptron) that detects fraudulent credit card transactions. Unlike previous projects where data was linearly separable, fraud detection requires learning complex decision boundaries that no single line can capture.
Your MLP will:
- Stack multiple
Layerobjects to create depth - Use ReLU activation to introduce non-linearity
- Handle extreme class imbalance (99.8% legitimate, 0.2% fraud)
- Implement Stochastic Gradient Descent with mini-batches
- Evaluate using Precision, Recall, and F1-score (not just accuracy!)
This project forces you to confront WHY we need โdeepโ learning - because the real world is messy, non-linear, and imbalanced.
Learning Objectives
By completing this project, you will:
- Implement the
Layerclass - Build a reusable abstraction for fully connected layers with weights, biases, and activations - Stack layers into an
MLPclass - Compose multiple layers into a network that performs forward and backward passes automatically - Understand why depth matters - Prove to yourself that 1 layer cannot solve non-linear problems, but 2+ layers can
- Master ReLU activation - Implement the activation that solved the vanishing gradient problem and enabled deep learning
- Handle class imbalance correctly - Learn why 99% accuracy can mean 0% utility, and how to fix it with class weights and sampling
- Implement mini-batch SGD - Train efficiently by processing data in small batches rather than one sample or all at once
- Evaluate with real metrics - Use confusion matrices, precision, recall, and F1 to measure what actually matters
The Core Question Youโre Answering
โWhy do we need โDeepโ learning?โ
A single neuron draws a line. A single layer of neurons draws multiple lines. But no matter how many lines you draw, you cannot circle a cluster of points - you cannot learn โshapes.โ
Consider the XOR problem: inputs (0,0) and (1,1) produce output 0, while (0,1) and (1,0) produce output 1. No single straight line can separate these. You need to fold the space - to transform the inputs so that what was unseparable becomes separable.
This is what hidden layers do. They learn transformations. The first layer might learn โare both inputs similar?โ and โare both inputs different?โ The second layer can then draw a simple line in this new feature space.
Fraud detection is the same. A fraud transaction might look legitimate on any single feature. But the combination of features - high amount, late night, foreign country, new card - creates a pattern that a deep network can learn to recognize as โsuspicious shapeโ in high-dimensional space.
When you build this MLP, you will see the magic: adding a single hidden layer transforms an impossible problem into a solvable one.
Concepts You Must Understand First
Before writing code, ensure you have solid grounding in these foundational concepts:
1. Why Single Layers Cannot Solve Non-Linear Problems (XOR)
The XOR problem proves the limitations of single-layer networks:
XOR Truth Table:
Input A Input B Output
0 0 0
0 1 1
1 0 1
1 1 0
Plotting in 2D space:
B
|
1 + X O
|
0 + O X
+----+----+
0 1 A
O = Output 0
X = Output 1
No single line can separate the X's from the O's!
What happens with a single neuron:
- A single neuron computes:
output = sign(w1*A + w2*B + bias) - This equation describes a line:
w1*A + w2*B + bias = 0 - Points on one side of the line output 1, the other side output 0
- XOR requires a non-linear decision boundary - impossible with one line
The solution: Add a hidden layer
- Hidden neurons transform the input space
- The output layer then operates on this transformed space
- In the new space, the problem becomes linearly separable
2. Universal Approximation Theorem (Intuition)
โA neural network with a single hidden layer containing a finite number of neurons can approximate any continuous function on compact subsets of R^n, under mild assumptions on the activation function.โ
What this means in plain English:
- Given enough hidden neurons, a 2-layer network can learn ANY pattern
- Itโs like having infinite LEGO bricks - you can build any shape
- BUT: โcan approximateโ doesnโt mean โwill learn efficientlyโ
- Deeper networks learn hierarchical features more naturally
One hidden layer (wide and shallow):
Input โ [1000 neurons] โ Output
Can approximate anything but may need exponentially many neurons
Multiple hidden layers (narrow and deep):
Input โ [16] โ [16] โ [8] โ Output
Learns hierarchical features efficiently:
Layer 1: Basic patterns (edges, thresholds)
Layer 2: Combinations of patterns
Layer 3: High-level concepts
3. ReLU vs Sigmoid vs Tanh Trade-offs
SIGMOID: f(x) = 1 / (1 + e^(-x))
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
1.0 โค โโโโโโโโโโโโโโโโ
โ โโโโโ
โ โโโ
0.5 โค โโโ
โ โโโ
โ โโโโ
0.0 โคโโโโโโโโ
โผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
-6 0 +6
Pros: Smooth, bounded [0,1], good for output probabilities
Cons: VANISHING GRADIENT! Derivative โ 0 for large |x|
Max derivative = 0.25 at x=0
Slow training, gradients disappear in deep networks
TANH: f(x) = (e^x - e^(-x)) / (e^x + e^(-x))
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
1.0 โค โโโโโโโโโโโโโโโโ
โ โโโโโ
โ โโโ
0.0 โคโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ โโโ
โ โโโโ
-1.0 โคโโโโโโโโ
โผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
-6 0 +6
Pros: Zero-centered (unlike sigmoid), stronger gradients
Cons: Still saturates! Vanishing gradient for large |x|
RELU: f(x) = max(0, x)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ โโโโโโโโโโโโโโโโ
โ โโโโโ
โ โโโโโ
0.0 โคโโโโโโโโโโโโโโโโ
โ
โ
โผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
-6 0 +6
Pros: NO vanishing gradient for positive inputs!
Derivative = 1 for x > 0 (gradients flow freely)
Computationally simple: just max(0, x)
Sparse activation (some neurons output 0)
Cons: "Dead neurons" - if always negative, gradient = 0 forever
Leaky ReLU fixes this: f(x) = max(0.01*x, x)
Why ReLU enabled deep learning:
- Before ReLU: training networks with 5+ layers was nearly impossible
- Sigmoids squash gradients: 0.25^10 = 0.0000001 (vanished!)
- ReLU: 1^10 = 1 (gradients flow)
- This is why โdeepโ learning became possible in the 2010s
4. Class Imbalance and Its Dangers
CREDIT CARD FRAUD: The Imbalance Problem
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Total Transactions: 284,807 (real Kaggle dataset)
Legitimate (Class 0): 284,315 (99.83%)
Fraudulent (Class 1): 492 (0.17%)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 99.83%
Legitimate โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Fraudulent โ 0.17%
THE LAZY MODEL PROBLEM:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
A model that ALWAYS predicts "Legitimate" achieves:
Accuracy = 284,315 / 284,807 = 99.83%
This is TERRIBLE! It catches 0% of fraud!
The bank loses money on every fraudulent transaction it misses.
A 99.83% accuracy model is WORTHLESS for fraud detection.
Why standard accuracy fails:
- Accuracy = (Correct Predictions) / (Total Predictions)
- With 99.83% legitimate, guessing โall legitimateโ gives 99.83% accuracy
- The model never learns to detect the minority class
- It takes the path of least resistance: predict the majority
5. Precision, Recall, F1, Confusion Matrix
CONFUSION MATRIX
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
PREDICTED
Legitimate Fraud
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
ACTUAL โ โ
Legitimate โ TN = 284,000 FP = 315โ True Neg / False Pos
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
ACTUAL โ โ
Fraud โ FN = 42 TP = 450โ False Neg / True Pos
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
TN (True Negative): Correctly predicted Legitimate
FP (False Positive): Predicted Fraud, but was Legitimate (annoys customer)
FN (False Negative): Predicted Legitimate, but was Fraud (MONEY LOST!)
TP (True Positive): Correctly predicted Fraud (MONEY SAVED!)
PRECISION: Of all predicted fraud, how many were actually fraud?
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Precision = TP / (TP + FP) = 450 / (450 + 315) = 0.588 = 58.8%
"When we flag something as fraud, we're right 58.8% of the time"
Low precision = Many false alarms (customer complaints)
RECALL (Sensitivity): Of all actual fraud, how many did we catch?
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Recall = TP / (TP + FN) = 450 / (450 + 42) = 0.915 = 91.5%
"We catch 91.5% of all fraud"
Low recall = Missing fraudulent transactions (bank loses money)
F1-SCORE: Harmonic mean of Precision and Recall
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
F1 = 2 * (Precision * Recall) / (Precision + Recall)
F1 = 2 * (0.588 * 0.915) / (0.588 + 0.915) = 0.716 = 71.6%
Why harmonic mean? Penalizes extremes.
If Precision=1.0 and Recall=0.0, F1=0 (not 0.5!)
You can't game F1 by ignoring one metric.
6. Batch vs SGD vs Mini-batch
GRADIENT DESCENT VARIANTS
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
BATCH (Full) Gradient Descent:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โข Use ALL samples to compute gradient
โข One update per epoch
โข Gradient is exact average over entire dataset
Dataset: [โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ]
โ Compute loss for all samples
โ Compute gradient (average over all)
โ Single weight update
Pros: Stable, consistent direction
Cons: SLOW! Memory intensive. Gets stuck in sharp minima.
STOCHASTIC Gradient Descent (SGD):
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โข Use ONE sample at a time
โข N updates per epoch (N = dataset size)
โข Gradient is noisy estimate
Dataset: [โ|โ|โ|โ|โ|โ|โ|โ|โ|โ|โ|โ|โ|โ|โ|โ|โ|โ|โ|โ]
โ Update weights after each sample
Pros: Fast updates. Noise helps escape local minima.
Cons: Very noisy! Oscillates around minimum.
MINI-BATCH Gradient Descent (The Winner):
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โข Use B samples at a time (e.g., B=32)
โข N/B updates per epoch
โข Gradient is average over mini-batch
Dataset: [โโโโ|โโโโ|โโโโ|โโโโ|โโโโ|โโโโ|โโโโ|โโโโ]
โ โ โ โ โ โ โ โ
Update weights after each mini-batch
Pros: Best of both worlds!
- Some noise (helps generalization)
- Vectorized computation (fast on GPU)
- Manageable memory usage
Cons: Introduces hyperparameter B (batch size)
COMMON BATCH SIZES:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Batch Size Use Case
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
32 Standard starting point
64-128 Common for image classification
256-1024 Large datasets, powerful GPUs
1-4 Extreme memory constraints
Deep Theoretical Foundation
Hidden Layers as Feature Extractors
Think of each hidden layer as learning a new โlanguageโ to describe the data:
INPUT LAYER: Raw Features
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Transaction Amount: $5000
Time: 02:34:17 (late night)
Location: Nigeria (IP-based)
Card Age: 2 days
Merchant Category: Electronics
V1-V28: PCA-transformed features
These are just numbers. No meaning yet.
HIDDEN LAYER 1: Basic Pattern Detectors
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Neuron 1: "Is this a large amount?" (High amount = high activation)
Neuron 2: "Is this at an unusual time?" (2-5 AM = high activation)
Neuron 3: "Is this a high-risk country?"
Neuron 4: "Is this a new card?"
Neuron 5: "Is this a high-risk merchant category?"
...
The layer learns THRESHOLDS - when does "large" become suspicious?
HIDDEN LAYER 2: Pattern Combinations
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Neuron 1: "Large amount + Unusual time" (both trigger = high activation)
Neuron 2: "New card + High-risk country"
Neuron 3: "Electronics + Late night + Large amount"
...
The layer learns COMBINATIONS that are suspicious together
OUTPUT LAYER: Final Decision
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Combines all the learned patterns into a single probability:
P(Fraud) = 0.97
If the "Large + Late + New Card + Electronics" pattern fires strongly,
the output is high, regardless of which individual features triggered it.
Why Depth Helps: Hierarchical Representations
DEPTH = ABSTRACTION LEVELS
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
2 Layers (Shallow):
Input โ [Simple Patterns] โ Output
Must learn: "If (A AND B) OR (C AND D) OR (E AND F AND G) โ Fraud"
Each hidden neuron must capture one full rule
4 Layers (Deep):
Input โ [Primitives] โ [Combinations] โ [Complex Rules] โ Output
Layer 1: "Is A high?", "Is B unusual?", etc.
Layer 2: "A AND B together", "C AND D together"
Layer 3: "(A AND B) combined with (C AND D)"
Layer 4: Final decision
Each layer builds on the previous, like LEGO
ANALOGY: Language
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Letters โ Words โ Phrases โ Sentences โ Paragraphs โ Meaning
You don't learn "when these 500 letters appear in this order, it's spam"
You learn: letters โ words โ "cheap meds" โ spam
Deep networks learn hierarchical features naturally.
Shallow networks must memorize everything at once.
ReLU: Solving the Vanishing Gradient Problem
THE VANISHING GRADIENT DISASTER
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Sigmoid derivative: ฯ'(x) = ฯ(x) * (1 - ฯ(x))
Maximum value: ฯ'(0) = 0.25
In backpropagation, gradients MULTIPLY through layers:
Layer 5: gradient = 0.25
Layer 4: gradient = 0.25 * 0.25 = 0.0625
Layer 3: gradient = 0.25^3 = 0.0156
Layer 2: gradient = 0.25^4 = 0.0039
Layer 1: gradient = 0.25^5 = 0.00097
By Layer 1, the gradient is 0.1% of what it was!
The early layers learn NOTHING. Training stalls.
RELU TO THE RESCUE
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
ReLU derivative:
if x > 0: derivative = 1
if x <= 0: derivative = 0
In backpropagation:
Layer 5: gradient = 1.0 (if active)
Layer 4: gradient = 1.0 * 1.0 = 1.0
Layer 3: gradient = 1.0^3 = 1.0
Layer 2: gradient = 1.0^4 = 1.0
Layer 1: gradient = 1.0^5 = 1.0
Gradients flow unchanged! Deep learning becomes possible.
DEAD NEURON PROBLEM
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
If ReLU input is ALWAYS negative:
output = 0
gradient = 0
weights never update
Neuron is "dead" forever
This happens when:
- Learning rate too high (weights become very negative)
- Poor initialization
Solutions:
1. Careful weight initialization (He initialization)
2. Leaky ReLU: f(x) = max(0.01*x, x)
3. PReLU: f(x) = max(ฮฑ*x, x) where ฮฑ is learned
Weight Initialization Strategies
WHY INITIALIZATION MATTERS
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Weights too small:
Signals shrink as they pass through layers
Output โ 0, gradients โ 0
Weights too large:
Signals explode as they pass through layers
Output โ โ, gradients โ โ (NaN errors)
XAVIER/GLOROT INITIALIZATION (for sigmoid/tanh)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
W ~ Uniform(-sqrt(6/(n_in + n_out)), +sqrt(6/(n_in + n_out)))
or
W ~ Normal(0, sqrt(2/(n_in + n_out)))
Keeps variance of activations consistent across layers.
HE INITIALIZATION (for ReLU) - YOU SHOULD USE THIS
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
W ~ Normal(0, sqrt(2/n_in))
Why different? ReLU zeroes out half the neurons on average.
To maintain variance, we need 2x larger initial weights.
In code:
weights = np.random.randn(n_in, n_out) * np.sqrt(2.0 / n_in)
BIAS INITIALIZATION
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Initialize to 0. Or small positive value (0.01) for ReLU to prevent
dead neurons at initialization.
biases = np.zeros(n_out) # Simple and works
Handling Class Imbalance
METHOD 1: CLASS WEIGHTS
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Multiply the loss for each class by a weight.
Minority class gets higher weight โ its errors hurt more.
Weight formula:
w_class = total_samples / (n_classes * samples_in_class)
Example:
Total: 1000 samples
Class 0 (legitimate): 990 samples
Class 1 (fraud): 10 samples
w_0 = 1000 / (2 * 990) = 0.505
w_1 = 1000 / (2 * 10) = 50.0
Fraud errors are penalized 100x more than legitimate errors.
In code:
loss = y * class_weight_1 * log(y_pred) +
(1-y) * class_weight_0 * log(1 - y_pred)
METHOD 2: OVERSAMPLING (SMOTE)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Create synthetic minority samples by interpolating between existing ones.
SMOTE Algorithm:
1. For each minority sample x:
2. Find k nearest minority neighbors
3. Pick one neighbor x_n randomly
4. Create synthetic: x_new = x + random(0,1) * (x_n - x)
Before SMOTE:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ Class 0: 990
โ Class 1: 10
After SMOTE:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ Class 0: 990
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ Class 1: 900 (synthetic)
METHOD 3: UNDERSAMPLING
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Randomly remove majority class samples to balance.
Before:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ Class 0: 990
โ Class 1: 10
After random undersampling:
โโ Class 0: 10 (kept)
โ Class 1: 10
Problem: Throws away 98% of data!
Use only if you have TONS of data.
METHOD 4: THRESHOLD ADJUSTMENT
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Instead of: y_pred > 0.5 โ Fraud
Use: y_pred > 0.1 โ Fraud
This catches more fraud (higher recall) at cost of more false positives.
Tune threshold based on business requirements:
- Banks may prefer low threshold (catch all fraud, accept false alarms)
- Customers may prefer higher threshold (fewer card declines)
The Forward and Backward Pass Through Multiple Layers
FORWARD PASS: Input โ Output
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Input X (shape: batch_size ร n_features)
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Layer 1: Linear + ReLU โ
โ Z1 = X @ W1 + b1 โ Pre-activation
โ A1 = ReLU(Z1) โ Post-activation
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Layer 2: Linear + ReLU โ
โ Z2 = A1 @ W2 + b2 โ
โ A2 = ReLU(Z2) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Output Layer: Linear + Sigmoidโ
โ Z3 = A2 @ W3 + b3 โ
โ A3 = Sigmoid(Z3) โ Probability output
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โผ
Y_pred (probability of fraud)
BACKWARD PASS: Output โ Input (Gradients)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Loss = Binary Cross-Entropy(Y_true, Y_pred)
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ dL/dZ3 = A3 - Y โ Gradient of loss w.r.t. output
โ dL/dW3 = A2.T @ dL/dZ3 โ Gradient for weights
โ dL/db3 = sum(dL/dZ3, axis=0) โ Gradient for biases
โ dL/dA2 = dL/dZ3 @ W3.T โ Pass gradient backward
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ dL/dZ2 = dL/dA2 * ReLU'(Z2) โ Apply ReLU derivative
โ dL/dW2 = A1.T @ dL/dZ2 โ
โ dL/db2 = sum(dL/dZ2, axis=0) โ
โ dL/dA1 = dL/dZ2 @ W2.T โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ dL/dZ1 = dL/dA1 * ReLU'(Z1) โ
โ dL/dW1 = X.T @ dL/dZ1 โ
โ dL/db1 = sum(dL/dZ1, axis=0) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Where:
ReLU'(Z) = 1 if Z > 0, else 0
@ = matrix multiplication
.T = transpose
Batch Size Effects on Convergence
BATCH SIZE SPECTRUM
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Batch Size = 1 (Pure SGD)
โโโโโโโโโโโโโโโโโโโโโโโโโ
Loss landscape trajectory:
โฟโฟโฟโฟโฟโฟโฟโฟโฟโฟโฟโฟโฟโฟโฟโฟโฟโฟโฟโฟโ minimum
Very noisy but escapes local minima
Update frequency: Every sample
Gradient variance: HIGH
Generalization: Good (noise regularizes)
Training speed: Slow (no parallelism)
Batch Size = 32 (Common Choice)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Loss landscape trajectory:
~~~~~~~~~~โ minimum
Some noise, mostly consistent direction
Update frequency: Every 32 samples
Gradient variance: Moderate
Generalization: Good
Training speed: Fast (vectorized)
Batch Size = Full Dataset (Batch GD)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Loss landscape trajectory:
โโโโโโโโโโโ minimum
Smooth, deterministic path
Update frequency: Once per epoch
Gradient variance: Zero
Generalization: Worse (may overfit)
Training speed: Very slow (no frequent updates)
EMPIRICAL FINDINGS:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โข Batch size 32-256 works well for most problems
โข Larger batches need larger learning rates
โข Larger batches โ sharper minima โ worse generalization
โข For imbalanced data: ensure each batch has minority samples!
Your fraud detector: Use batch_size=64 and stratified sampling
to ensure each batch contains ~1% fraud samples.
Real World Outcome
When you run your fraud detector, you will see output like this:
$ python train_fraud.py --data creditcard.csv
============================================================
Fraud Detection MLP - Training
============================================================
Loading data: creditcard.csv
Total samples: 284,807
Legitimate (0): 284,315 (99.83%)
Fraudulent (1): 492 (0.17%)
Class imbalance ratio: 578:1
Applying class weights: {0: 0.50, 1: 289.07}
Network Architecture:
Input Layer: 30 features
Hidden Layer 1: 16 neurons (ReLU)
Hidden Layer 2: 16 neurons (ReLU)
Output Layer: 1 neuron (Sigmoid)
Total Parameters: 849
Training Configuration:
Optimizer: Mini-batch SGD
Learning Rate: 0.01
Batch Size: 64
Epochs: 50
------------------------------------------------------------
Training Progress:
------------------------------------------------------------
Epoch 1/50:
Loss: 0.6931 | Acc: 99.0% | Precision: 0.00 | Recall: 0.00 | F1: 0.00
โ ๏ธ WARNING: Model predicting all legitimate (Recall = 0%)!
This happens early - class weights will fix it.
Epoch 5/50:
Loss: 0.2847 | Acc: 98.2% | Precision: 0.32 | Recall: 0.58 | F1: 0.41
โ Model starting to detect fraud
Epoch 10/50:
Loss: 0.1234 | Acc: 99.1% | Precision: 0.56 | Recall: 0.72 | F1: 0.63
Epoch 25/50:
Loss: 0.0523 | Acc: 99.6% | Precision: 0.78 | Recall: 0.85 | F1: 0.81
Epoch 50/50:
Loss: 0.0312 | Acc: 99.8% | Precision: 0.86 | Recall: 0.92 | F1: 0.89
โ Training complete!
============================================================
Final Evaluation (Test Set)
============================================================
Confusion Matrix:
Predicted
Neg Pos
Actual Neg 56,824 40
Actual Pos 5 93
Metrics:
Accuracy: 99.92%
Precision: 69.92% (When we flag fraud, we're right 70% of the time)
Recall: 94.90% (We catch 95% of all fraud!)
F1-Score: 80.52%
Business Impact:
Fraud Caught: 93 of 98 transactions ($186,000 saved)
Fraud Missed: 5 transactions ($8,500 lost)
False Alarms: 40 customers temporarily inconvenienced
============================================================
$ python predict.py --transaction "Time=0,Amount=5000,V1=-1.36,..."
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ TRANSACTION ANALYSIS โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ Input Features: โ
โ Time: 0 seconds โ
โ Amount: $5,000.00 โ
โ V1-V28: [PCA components shown] โ
โ โ
โ Network Activations: โ
โ Hidden Layer 1: [0.0, 2.3, 0.0, 1.8, 0.0, 4.2, ...] โ
โ Hidden Layer 2: [1.2, 0.0, 3.1, 0.0, 2.8, 0.0, ...] โ
โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ โ โ
โ โ PREDICTION: FRAUD โ โ
โ โ Probability: 0.983 (98.3% confidence) โ โ
โ โ โ โ
โ โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 98.3% โ โ
โ โ โ โ
โ โ Recommendation: BLOCK TRANSACTION โ โ
โ โ โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Solution Architecture
Class Design
MLP ARCHITECTURE
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ MLP Class โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ layers: List[Layer] # Stack of Layer objects โ
โ loss_fn: Callable # Binary Cross-Entropy โ
โ optimizer: Optimizer # SGD with learning rate โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ forward(X) โ Y_pred # Propagate input through all layers โ
โ backward(Y_true, Y_pred) # Compute gradients for all layers โ
โ train_step(X_batch, Y_batch) # One forward-backward-update cycle โ
โ fit(X, Y, epochs, batch_size)# Full training loop โ
โ predict(X) โ Y_pred # Inference only (no gradients) โ
โ evaluate(X, Y) โ metrics # Compute Precision, Recall, F1 โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โ contains
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Layer Class โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ weights: np.ndarray # Shape: (n_in, n_out) โ
โ biases: np.ndarray # Shape: (n_out,) โ
โ activation: str # "relu", "sigmoid", or None โ
โ โ
โ # Cached for backprop: โ
โ input_cache: np.ndarray # Input received during forward โ
โ z_cache: np.ndarray # Pre-activation (before ReLU/Sigmoid) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ forward(X) โ A # Z = X @ W + b; A = activation(Z) โ
โ backward(dA) โ dX # Compute dW, db, and return dX โ
โ update(lr) # W -= lr * dW; b -= lr * db โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โ uses
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Activation Functions โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ relu(Z) = max(0, Z) โ
โ relu_derivative(Z) = (Z > 0).astype(float) โ
โ โ
โ sigmoid(Z) = 1 / (1 + exp(-Z)) โ
โ sigmoid_derivative(Z) = sigmoid(Z) * (1 - sigmoid(Z)) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โ outputs to
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Loss Function โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Binary Cross-Entropy (with class weights): โ
โ โ
โ L = -1/N * ฮฃ [ w1 * y * log(ลท) + w0 * (1-y) * log(1-ลท) ] โ
โ โ
โ Gradient: โ
โ dL/dลท = -w1 * y/ลท + w0 * (1-y)/(1-ลท) โ
โ โ
โ For sigmoid output, simplifies to: โ
โ dL/dZ = ลท - y (if using sum of weighted cross-entropies) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Data Flow Diagram
COMPLETE DATA FLOW
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Training Loop โ
โโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโ
โ Load Data โโโโโโโโโโถโ Preprocess โโโโโโโโโโถโ Split โ
โ creditcard.csv โ โ Normalize โ โ Train/Test โ
โโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ For Each Epoch โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ For Each Mini-Batch โ โ
โ โ โ โ
โ โ X_batch โโโโโโโโโโโโโโโ โ โ
โ โ โ โ Forward โ Y_pred โ โ
โ โ โโโโโโโโโโโโถโ Pass โโโโโโโโโโโโโโโโ โ โ
โ โ โโโโโโโโโโโโโโโ โ โ โ
โ โ โผ โ โ
โ โ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ โ โ
โ โ โ Backward โโโโโโโโ Compute Lossโ โ โ
โ โ โ Pass โ โ (Weighted) โ โ โ
โ โ โโโโโโโโฌโโโโโโโ โโโโโโโโโโโโโโโ โ โ
โ โ โ โ โ
โ โ โผ โ โ
โ โ โโโโโโโโโโโโโโโ โ โ
โ โ โ Update โ โ โ
โ โ โ Weights โ โ โ
โ โ โ W -= lr*dW โ โ โ
โ โ โโโโโโโโโโโโโโโ โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ Evaluate on Validation Set โ โ
โ โ Log: Loss, Accuracy, Precision, Recall, F1 โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Final Evaluation on Test Set โ
โ โข Confusion Matrix โ
โ โข Precision, Recall, F1-Score โ
โ โข ROC-AUC Curve โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Phased Implementation Guide
Phase 1: Layer Class with Weights and Biases (Day 1)
Goal: Create the fundamental building block
import numpy as np
class Layer:
"""A single fully-connected layer with optional activation."""
def __init__(self, n_in: int, n_out: int, activation: str = None):
"""
Initialize layer with He initialization for weights.
Args:
n_in: Number of input features
n_out: Number of output neurons
activation: "relu", "sigmoid", or None for linear
"""
# He initialization (good for ReLU)
self.weights = np.random.randn(n_in, n_out) * np.sqrt(2.0 / n_in)
self.biases = np.zeros(n_out)
self.activation = activation
# Cache for backpropagation
self.input_cache = None
self.z_cache = None # Pre-activation values
# Gradient storage
self.dW = None
self.db = None
Checkpoint: Verify weights have correct shape, biases are zeros, activation is stored.
Phase 2: Forward Pass Through Layer (Day 1)
Goal: Implement the forward computation
def forward(self, X: np.ndarray) -> np.ndarray:
"""
Forward pass: Z = X @ W + b, then apply activation.
Args:
X: Input array, shape (batch_size, n_in)
Returns:
A: Output array, shape (batch_size, n_out)
"""
# Cache input for backprop
self.input_cache = X
# Linear transformation
Z = X @ self.weights + self.biases
self.z_cache = Z
# Apply activation
if self.activation == "relu":
A = np.maximum(0, Z)
elif self.activation == "sigmoid":
A = 1 / (1 + np.exp(-np.clip(Z, -500, 500))) # Clip for stability
else:
A = Z # Linear/no activation
return A
Checkpoint: Test with random input, verify output shape is (batch_size, n_out).
Phase 3: ReLU Activation and Derivative (Day 2)
Goal: Implement ReLU properly with its derivative
def relu(Z: np.ndarray) -> np.ndarray:
"""ReLU activation: max(0, x)"""
return np.maximum(0, Z)
def relu_derivative(Z: np.ndarray) -> np.ndarray:
"""
ReLU derivative: 1 if x > 0, else 0
Note: Derivative at exactly 0 is undefined, but we use 0.
"""
return (Z > 0).astype(float)
def sigmoid(Z: np.ndarray) -> np.ndarray:
"""Sigmoid activation: 1 / (1 + e^-x)"""
# Clip to prevent overflow
Z = np.clip(Z, -500, 500)
return 1 / (1 + np.exp(-Z))
def sigmoid_derivative(Z: np.ndarray) -> np.ndarray:
"""Sigmoid derivative: sigmoid(x) * (1 - sigmoid(x))"""
s = sigmoid(Z)
return s * (1 - s)
Checkpoint: Test that relu(np.array([-1, 0, 1])) = [0, 0, 1].
Phase 4: MLP Class Stacking Layers (Day 2-3)
Goal: Create the network container
class MLP:
"""Multi-Layer Perceptron for binary classification."""
def __init__(self, layer_sizes: list, activations: list = None):
"""
Initialize MLP with specified architecture.
Args:
layer_sizes: [input_size, hidden1_size, ..., output_size]
activations: ["relu", "relu", ..., "sigmoid"] per layer
Example:
MLP([30, 16, 16, 1], ["relu", "relu", "sigmoid"])
"""
if activations is None:
activations = ["relu"] * (len(layer_sizes) - 2) + ["sigmoid"]
self.layers = []
for i in range(len(layer_sizes) - 1):
layer = Layer(
n_in=layer_sizes[i],
n_out=layer_sizes[i + 1],
activation=activations[i]
)
self.layers.append(layer)
def forward(self, X: np.ndarray) -> np.ndarray:
"""Forward pass through all layers."""
A = X
for layer in self.layers:
A = layer.forward(A)
return A
def predict(self, X: np.ndarray, threshold: float = 0.5) -> np.ndarray:
"""Return binary predictions."""
probs = self.forward(X)
return (probs >= threshold).astype(int)
Checkpoint: Create MLP([30, 16, 16, 1]), forward pass with random input, verify output shape.
Phase 5: Backward Pass (Day 3-4)
Goal: Implement backpropagation through all layers
def backward(self, dA: np.ndarray) -> np.ndarray:
"""
Backward pass for a single layer.
Args:
dA: Gradient of loss w.r.t. this layer's output
Returns:
dX: Gradient of loss w.r.t. this layer's input
"""
m = dA.shape[0] # Batch size
# Compute dZ based on activation
if self.activation == "relu":
dZ = dA * relu_derivative(self.z_cache)
elif self.activation == "sigmoid":
dZ = dA * sigmoid_derivative(self.z_cache)
else:
dZ = dA # Linear
# Compute gradients for weights and biases
self.dW = (1/m) * (self.input_cache.T @ dZ)
self.db = (1/m) * np.sum(dZ, axis=0)
# Compute gradient for previous layer
dX = dZ @ self.weights.T
return dX
# In MLP class:
def backward(self, y_true: np.ndarray, y_pred: np.ndarray,
class_weights: dict = None):
"""
Full backward pass through all layers.
Args:
y_true: Ground truth labels, shape (batch_size, 1)
y_pred: Predicted probabilities, shape (batch_size, 1)
class_weights: {0: weight_0, 1: weight_1} for imbalance
"""
m = y_true.shape[0]
# For sigmoid output with BCE loss, the gradient simplifies to:
# dL/dZ = y_pred - y_true (for unweighted)
# Apply class weights
if class_weights:
weights = np.where(y_true == 1, class_weights[1], class_weights[0])
dA = (y_pred - y_true) * weights
else:
dA = y_pred - y_true
# Backpropagate through layers in reverse
for layer in reversed(self.layers):
dA = layer.backward(dA)
Checkpoint: After backward, every layer should have non-zero dW and db.
Phase 6: SGD Optimizer (Day 4)
Goal: Update weights using gradients
def update_weights(self, learning_rate: float):
"""Update all layer weights using SGD."""
for layer in self.layers:
layer.weights -= learning_rate * layer.dW
layer.biases -= learning_rate * layer.db
def train_step(self, X_batch: np.ndarray, y_batch: np.ndarray,
learning_rate: float, class_weights: dict = None) -> float:
"""
Single training step: forward, loss, backward, update.
Returns:
loss: Binary cross-entropy loss for this batch
"""
# Forward pass
y_pred = self.forward(X_batch)
# Compute loss (BCE)
epsilon = 1e-7 # Prevent log(0)
y_pred = np.clip(y_pred, epsilon, 1 - epsilon)
if class_weights:
weights = np.where(y_batch == 1, class_weights[1], class_weights[0])
loss = -np.mean(weights * (
y_batch * np.log(y_pred) +
(1 - y_batch) * np.log(1 - y_pred)
))
else:
loss = -np.mean(
y_batch * np.log(y_pred) +
(1 - y_batch) * np.log(1 - y_pred)
)
# Backward pass
self.backward(y_batch, y_pred, class_weights)
# Update weights
self.update_weights(learning_rate)
return loss
Checkpoint: Train on small batch, verify loss decreases over iterations.
Phase 7: Class Weighting for Imbalance (Day 5)
Goal: Implement balanced training
def compute_class_weights(y: np.ndarray) -> dict:
"""
Compute class weights inversely proportional to class frequencies.
Returns:
{0: weight_0, 1: weight_1}
"""
n_samples = len(y)
n_classes = 2
counts = np.bincount(y.flatten().astype(int))
weights = n_samples / (n_classes * counts)
return {0: weights[0], 1: weights[1]}
# Example usage:
# y_train = [0, 0, 0, 0, 0, 0, 0, 0, 0, 1] # 90% class 0, 10% class 1
# weights = compute_class_weights(y_train)
# weights = {0: 0.556, 1: 5.0} # Class 1 weighted 9x more
Phase 8: Evaluation Metrics (Day 5-6)
Goal: Implement proper evaluation
def evaluate(self, X: np.ndarray, y: np.ndarray, threshold: float = 0.5) -> dict:
"""
Compute classification metrics.
Returns:
dict with accuracy, precision, recall, f1, confusion_matrix
"""
y_pred = (self.forward(X) >= threshold).astype(int).flatten()
y_true = y.flatten().astype(int)
# Confusion matrix elements
tp = np.sum((y_pred == 1) & (y_true == 1))
tn = np.sum((y_pred == 0) & (y_true == 0))
fp = np.sum((y_pred == 1) & (y_true == 0))
fn = np.sum((y_pred == 0) & (y_true == 1))
# Metrics
accuracy = (tp + tn) / (tp + tn + fp + fn)
precision = tp / (tp + fp) if (tp + fp) > 0 else 0
recall = tp / (tp + fn) if (tp + fn) > 0 else 0
f1 = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0
return {
"accuracy": accuracy,
"precision": precision,
"recall": recall,
"f1": f1,
"confusion_matrix": {"TP": tp, "TN": tn, "FP": fp, "FN": fn}
}
Phase 8b: Full Training Loop (Day 6-7)
Goal: Put it all together
def fit(self, X_train: np.ndarray, y_train: np.ndarray,
X_val: np.ndarray = None, y_val: np.ndarray = None,
epochs: int = 50, batch_size: int = 64,
learning_rate: float = 0.01,
use_class_weights: bool = True,
verbose: bool = True):
"""
Full training loop with mini-batches.
"""
n_samples = X_train.shape[0]
# Compute class weights
class_weights = compute_class_weights(y_train) if use_class_weights else None
if verbose and class_weights:
print(f"Class weights: {class_weights}")
history = {"loss": [], "val_metrics": []}
for epoch in range(epochs):
# Shuffle data each epoch
indices = np.random.permutation(n_samples)
X_shuffled = X_train[indices]
y_shuffled = y_train[indices]
epoch_losses = []
# Mini-batch training
for i in range(0, n_samples, batch_size):
X_batch = X_shuffled[i:i+batch_size]
y_batch = y_shuffled[i:i+batch_size]
loss = self.train_step(X_batch, y_batch, learning_rate, class_weights)
epoch_losses.append(loss)
avg_loss = np.mean(epoch_losses)
history["loss"].append(avg_loss)
# Validation
if X_val is not None and verbose:
metrics = self.evaluate(X_val, y_val)
history["val_metrics"].append(metrics)
print(f"Epoch {epoch+1}/{epochs}: "
f"Loss={avg_loss:.4f} | "
f"Acc={metrics['accuracy']:.3f} | "
f"Prec={metrics['precision']:.3f} | "
f"Rec={metrics['recall']:.3f} | "
f"F1={metrics['f1']:.3f}")
return history
Questions to Guide Your Design
Before writing code, think through these questions:
Architecture Decisions
- How many hidden layers do you need?
- Start with 2 hidden layers (16 neurons each)
- Add more if underfitting
- The architecture [30 โ 16 โ 16 โ 1] is a good starting point
- Why not use Sigmoid in hidden layers?
- Vanishing gradients would kill learning in deep networks
- ReLU keeps gradients flowing
- Only use Sigmoid for the output (probability interpretation)
- What batch size should you use?
- 64 is a good default
- Too small: noisy gradients, slow training
- Too large: smooth gradients but may miss minima
- For imbalanced data: ensure batches contain minority samples
Imbalance Handling
- Why is 99% accuracy worthless here?
- Because predicting โall legitimateโ gives 99.83% accuracy
- We care about catching fraud, not overall correctness
- Recall matters more than accuracy
- Class weights vs SMOTE - which to use?
- Class weights: Simple, no synthetic data, works well in practice
- SMOTE: Creates synthetic minority samples, can help but risks overfitting
- Start with class weights, try SMOTE if recall is too low
- What threshold should you use for prediction?
- Default 0.5 assumes equal costs for errors
- Lower threshold (0.3): Catch more fraud, more false alarms
- Higher threshold (0.7): Fewer false alarms, miss more fraud
- Tune based on business requirements
Debugging
- How do you know if training is working?
- Loss should decrease
- Recall should increase (model learning to detect fraud)
- If Recall stays at 0 for many epochs, class weights may be too low
- What if all neurons output the same value?
- Check initialization (weights should be diverse)
- Check for dead ReLUs (too many neurons stuck at 0)
- Reduce learning rate
Thinking Exercise
Before implementing, work through this exercise by hand:
The XOR Problem with a Hidden Layer
Setup: Build a network to solve XOR
Network: 2 inputs โ 2 hidden (ReLU) โ 1 output (Sigmoid)
Initial weights (random example):
Hidden layer: W1 = [[0.5, -0.5], b1 = [0, 0]
[0.5, -0.5]]
Output layer: W2 = [[1.0], b2 = [0]
[-1.0]]
Training data:
X = [[0, 0], [0, 1], [1, 0], [1, 1]]
Y = [[0], [1], [1], [0]]
Task: Hand-trace the forward pass for input [0, 1]
- Hidden layer pre-activation (Z1):
Z1 = [0, 1] @ [[0.5, -0.5], [0.5, -0.5]] + [0, 0] Z1 = [0*0.5 + 1*0.5, 0*(-0.5) + 1*(-0.5)] Z1 = [0.5, -0.5] - Hidden layer activation (A1) - ReLU:
A1 = ReLU([0.5, -0.5]) A1 = [0.5, 0] (negative value becomes 0) - Output pre-activation (Z2):
Z2 = [0.5, 0] @ [[1.0], [-1.0]] + [0] Z2 = [0.5*1.0 + 0*(-1.0)] Z2 = [0.5] - Output activation (A2) - Sigmoid:
A2 = Sigmoid(0.5) A2 = 1 / (1 + e^(-0.5)) A2 โ 0.62 - Loss (BCE, target = 1):
Loss = -(1 * log(0.62) + 0 * log(0.38)) Loss โ 0.48
Now you try: Trace the forward pass for input [1, 1] (target = 0).
Draw the decision boundary: After training, the hidden layer transforms the 2D input space such that XOR becomes linearly separable. Sketch what this transformation might look like.
Testing Strategy
Unit Tests
def test_layer_forward():
"""Test Layer forward pass produces correct shape."""
layer = Layer(n_in=10, n_out=5, activation="relu")
X = np.random.randn(32, 10) # Batch of 32
A = layer.forward(X)
assert A.shape == (32, 5), f"Expected (32, 5), got {A.shape}"
assert np.all(A >= 0), "ReLU should produce non-negative outputs"
assert layer.input_cache is not None, "Should cache input"
assert layer.z_cache is not None, "Should cache pre-activation"
def test_layer_backward():
"""Test Layer backward pass computes gradients."""
layer = Layer(n_in=10, n_out=5, activation="relu")
X = np.random.randn(32, 10)
# Forward
A = layer.forward(X)
# Backward (fake gradient from next layer)
dA = np.random.randn(32, 5)
dX = layer.backward(dA)
assert dX.shape == X.shape, "Gradient should match input shape"
assert layer.dW.shape == layer.weights.shape
assert layer.db.shape == layer.biases.shape
def test_mlp_forward():
"""Test MLP produces probability output."""
mlp = MLP([30, 16, 16, 1], ["relu", "relu", "sigmoid"])
X = np.random.randn(64, 30)
y_pred = mlp.forward(X)
assert y_pred.shape == (64, 1)
assert np.all(y_pred >= 0) and np.all(y_pred <= 1), "Should be probabilities"
def test_relu():
"""Test ReLU activation."""
Z = np.array([-2, -1, 0, 1, 2])
expected = np.array([0, 0, 0, 1, 2])
assert np.allclose(relu(Z), expected)
def test_class_weights():
"""Test class weight computation."""
y = np.array([0, 0, 0, 0, 0, 0, 0, 0, 0, 1]).reshape(-1, 1)
weights = compute_class_weights(y)
# Class 1 should have ~9x the weight of class 0
assert weights[1] > 5 * weights[0]
Integration Tests
def test_training_reduces_loss():
"""Test that training actually reduces loss."""
np.random.seed(42)
# Synthetic linearly separable data
X = np.random.randn(1000, 10)
y = (X[:, 0] + X[:, 1] > 0).astype(float).reshape(-1, 1)
mlp = MLP([10, 8, 1], ["relu", "sigmoid"])
initial_loss = mlp.train_step(X[:100], y[:100], learning_rate=0.01)
# Train for 50 epochs
for _ in range(50):
mlp.train_step(X[:100], y[:100], learning_rate=0.01)
final_loss = mlp.train_step(X[:100], y[:100], learning_rate=0.01)
assert final_loss < initial_loss, "Loss should decrease with training"
def test_xor_solved():
"""Test that MLP can solve XOR (proves hidden layers work)."""
X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
y = np.array([[0], [1], [1], [0]])
mlp = MLP([2, 4, 1], ["relu", "sigmoid"])
# Train
for _ in range(1000):
mlp.train_step(X, y, learning_rate=0.1)
# Evaluate
predictions = mlp.predict(X)
accuracy = np.mean(predictions == y)
assert accuracy >= 0.75, f"Should solve XOR, got accuracy {accuracy}"
Smoke Test on Real Data
def test_fraud_data_loading():
"""Test that creditcard.csv loads correctly."""
import pandas as pd
df = pd.read_csv("creditcard.csv")
assert "Class" in df.columns, "Should have Class column"
assert df.shape[1] == 31, "Should have 30 features + 1 label"
assert df["Class"].isin([0, 1]).all(), "Labels should be 0 or 1"
fraud_ratio = df["Class"].mean()
assert fraud_ratio < 0.01, "Fraud should be <1% of data"
Common Pitfalls and Debugging Tips
1. Recall Stays at 0%
Symptom: Model predicts โlegitimateโ for everything.
Epoch 10: Accuracy 99.8%, Recall 0.0%
Epoch 20: Accuracy 99.8%, Recall 0.0%
...
Model never learns to detect fraud!
Causes:
- Class weights not applied or too small
- Learning rate too low
- Network too small to learn the pattern
Fix:
# Verify class weights are correct
weights = compute_class_weights(y_train)
print(f"Class weights: {weights}")
# Should be something like {0: 0.5, 1: 289}
# Increase learning rate
learning_rate = 0.1 # Start higher, reduce if unstable
# Verify weights are being used in loss
# Add debug print in train_step()
2. Loss is NaN or Infinity
Symptom: Training explodes.
Epoch 1: Loss = 0.69
Epoch 2: Loss = 15.4
Epoch 3: Loss = inf
Epoch 4: Loss = nan
Causes:
- Learning rate too high
- No gradient clipping
- Log of 0 in BCE loss
Fix:
# Clip predictions to avoid log(0)
epsilon = 1e-7
y_pred = np.clip(y_pred, epsilon, 1 - epsilon)
# Reduce learning rate
learning_rate = 0.001
# Clip gradients (optional)
for layer in self.layers:
layer.dW = np.clip(layer.dW, -1, 1)
layer.db = np.clip(layer.db, -1, 1)
3. Dead ReLU Neurons
Symptom: Many hidden layer outputs are exactly 0 for all inputs.
A = mlp.layers[0].forward(X)
print(np.sum(A == 0) / A.size) # If > 50%, too many dead neurons
Causes:
- Poor weight initialization
- Learning rate too high caused weights to go very negative
- Bias initialization issue
Fix:
# Use He initialization properly
self.weights = np.random.randn(n_in, n_out) * np.sqrt(2.0 / n_in)
# Initialize biases to small positive value
self.biases = np.full(n_out, 0.01) # Helps ReLU stay active initially
# Try Leaky ReLU instead
def leaky_relu(Z, alpha=0.01):
return np.where(Z > 0, Z, alpha * Z)
4. Gradients Vanishing or Exploding
Symptom: Early layers donโt learn, or training is unstable.
Diagnosis:
# Check gradient magnitudes after backward pass
for i, layer in enumerate(mlp.layers):
print(f"Layer {i}: |dW| mean = {np.abs(layer.dW).mean():.6f}")
# Should be similar order of magnitude across layers
Fix:
- Use proper initialization (He for ReLU, Xavier for sigmoid/tanh)
- Use ReLU instead of sigmoid in hidden layers
- Add batch normalization (advanced)
5. Overfitting
Symptom: Training metrics are great, but validation metrics are worse.
Epoch 50: Train F1=0.95, Val F1=0.65
Causes:
- Network too large for dataset
- Training too long
- Not enough regularization
Fix:
# Reduce network size
mlp = MLP([30, 8, 8, 1]) # Fewer neurons
# Add L2 regularization to weight updates
lambda_l2 = 0.001
self.dW += lambda_l2 * self.weights
# Use early stopping
if val_f1 < best_val_f1:
patience_counter += 1
if patience_counter >= patience:
break
else:
best_val_f1 = val_f1
patience_counter = 0
Interview Questions This Project Prepares You For
Understanding Questions
- โWhy canโt a single-layer network solve XOR?โ
- A single layer computes a linear combination of inputs
- The decision boundary is a hyperplane (straight line in 2D)
- XOR requires a non-linear boundary (you need to โfoldโ the space)
- Adding a hidden layer allows learning feature transformations that make XOR linearly separable in the new space
- โExplain the vanishing gradient problem and how ReLU solves it.โ
- Sigmoid/tanh derivatives are < 1, so gradients shrink as they backpropagate
- In deep networks, early layer gradients become ~0, preventing learning
- ReLU has derivative = 1 for positive inputs, so gradients flow unchanged
- This enabled training of networks with many layers
- โWhatโs wrong with using accuracy for imbalanced classification?โ
- A naive model that always predicts the majority class achieves high accuracy
- For 99:1 imbalance, predicting โalways 0โ gives 99% accuracy but 0% utility
- Precision and Recall measure what matters: catching the minority class without too many false alarms
- F1-score balances precision and recall
Implementation Questions
- โWalk me through one forward-backward pass of your MLP.โ
- Forward: Input โ (linear transform + activation) for each layer โ output probability
- Loss: Compare prediction to label using weighted BCE
- Backward: Compute dL/dA for output, propagate through each layer computing dW, db, dX
- Update: W -= learning_rate * dW for each layer
- โHow do you handle the class imbalance problem?โ
- Class weights: Multiply loss by inverse class frequency
- SMOTE: Generate synthetic minority samples
- Threshold tuning: Lower classification threshold to catch more minority class
- Stratified sampling: Ensure each batch contains minority samples
- โWhatโs the difference between batch, mini-batch, and stochastic gradient descent?โ
- Batch: Entire dataset per update - stable but slow, may overfit
- SGD: One sample per update - noisy but fast, helps generalization
- Mini-batch: N samples per update - best of both worlds, vectorizable
Design Questions
- โHow would you decide on the network architecture?โ
- Start simple: 2 hidden layers, 16-32 neurons each
- Increase if underfitting (low train AND val performance)
- Decrease if overfitting (high train, low val performance)
- For tabular data, 2-4 layers usually sufficient
- Use validation set to tune, not test set
- โHow would you deploy this model in production?โ
- Save trained weights (np.save or pickle)
- Wrap in prediction API (Flask/FastAPI)
- Apply same preprocessing (normalization) to new transactions
- Log predictions and actual outcomes for monitoring
- Retrain periodically as fraud patterns change
Hints in Layers
Stuck? Read only the hint level you need.
Challenge: Model Predicts All Same Class
Hint Level 1 (Conceptual): The model found a shortcut. Itโs easier to always predict the majority class than to learn patterns.
Hint Level 2 (Direction): You need to penalize errors on the minority class more heavily. The loss function should โcare moreโ about fraud.
Hint Level 3 (Specific): Multiply the loss for each sample by a weight thatโs inversely proportional to class frequency. Class 1 (fraud) should have weight ~100-500x larger than class 0.
Hint Level 4 (Code):
# Compute weights
class_weights = compute_class_weights(y_train) # {0: 0.5, 1: 289}
# Apply in loss
sample_weights = np.where(y_batch == 1, class_weights[1], class_weights[0])
loss = -np.mean(sample_weights * (y * log(y_pred) + (1-y) * log(1-y_pred)))
Challenge: Backward Pass is Wrong
Hint Level 1 (Conceptual): The chain rule must be applied correctly. Each layerโs gradient depends on the gradient from the next layer.
Hint Level 2 (Direction): For ReLU, the gradient is 0 where input was negative, and 1 where positive. You need to multiply by this mask.
Hint Level 3 (Specific): Store Z (pre-activation) during forward pass. In backward, compute dZ = dA * (Z > 0) for ReLU.
Hint Level 4 (Code):
def backward(self, dA):
# dA: gradient from next layer (or loss)
if self.activation == "relu":
dZ = dA * (self.z_cache > 0).astype(float)
elif self.activation == "sigmoid":
s = sigmoid(self.z_cache)
dZ = dA * s * (1 - s)
else:
dZ = dA
m = dA.shape[0]
self.dW = (1/m) * self.input_cache.T @ dZ
self.db = (1/m) * np.sum(dZ, axis=0)
dX = dZ @ self.weights.T
return dX
Challenge: Loss Not Decreasing
Hint Level 1 (Conceptual): Either the learning rate is wrong, or the gradients are wrong.
Hint Level 2 (Direction): Try a higher learning rate (0.1 or 1.0) to see if loss moves at all. If it explodes, your gradients are correct; if nothing happens, gradients may be wrong.
Hint Level 3 (Specific): Print gradient magnitudes. They should be non-zero and roughly similar across layers.
Hint Level 4 (Code):
# Debug: Print gradient stats
for i, layer in enumerate(mlp.layers):
print(f"Layer {i}:")
print(f" |dW| mean: {np.abs(layer.dW).mean():.6f}")
print(f" |db| mean: {np.abs(layer.db).mean():.6f}")
print(f" dW range: [{layer.dW.min():.4f}, {layer.dW.max():.4f}]")
Extensions and Challenges
1. Add Dropout Regularization
Dropout randomly โturns offโ neurons during training, preventing co-adaptation.
class DropoutLayer:
def __init__(self, p: float = 0.5):
"""p = probability of KEEPING a neuron (not dropping)."""
self.p = p
self.mask = None
def forward(self, X, training=True):
if training:
self.mask = (np.random.rand(*X.shape) < self.p) / self.p
return X * self.mask
else:
return X # No dropout during inference
def backward(self, dA):
return dA * self.mask
2. Implement Adam Optimizer
Adam adapts learning rate per-parameter using momentum and second moments.
class AdamOptimizer:
def __init__(self, lr=0.001, beta1=0.9, beta2=0.999, epsilon=1e-8):
self.lr = lr
self.beta1 = beta1
self.beta2 = beta2
self.epsilon = epsilon
self.m = {} # First moment
self.v = {} # Second moment
self.t = 0 # Timestep
def update(self, layer, layer_id):
if layer_id not in self.m:
self.m[layer_id] = {"W": np.zeros_like(layer.weights),
"b": np.zeros_like(layer.biases)}
self.v[layer_id] = {"W": np.zeros_like(layer.weights),
"b": np.zeros_like(layer.biases)}
self.t += 1
for param, grad, key in [(layer.weights, layer.dW, "W"),
(layer.biases, layer.db, "b")]:
# Update moments
self.m[layer_id][key] = self.beta1 * self.m[layer_id][key] + (1-self.beta1) * grad
self.v[layer_id][key] = self.beta2 * self.v[layer_id][key] + (1-self.beta2) * grad**2
# Bias correction
m_hat = self.m[layer_id][key] / (1 - self.beta1**self.t)
v_hat = self.v[layer_id][key] / (1 - self.beta2**self.t)
# Update
param -= self.lr * m_hat / (np.sqrt(v_hat) + self.epsilon)
3. Try Different Architectures
Experiment with:
- Wider networks: [30, 64, 64, 1]
- Deeper networks: [30, 16, 16, 16, 1]
- Bottleneck: [30, 8, 16, 8, 1] (compression in middle)
- Residual connections (advanced)
4. Implement Learning Rate Scheduling
Reduce learning rate as training progresses:
def lr_schedule(epoch, initial_lr=0.1):
"""Decay learning rate by 10x every 20 epochs."""
return initial_lr * (0.1 ** (epoch // 20))
# Step decay
def step_decay(epoch, initial_lr, drop=0.5, epochs_drop=10):
return initial_lr * (drop ** (epoch // epochs_drop))
# Exponential decay
def exponential_decay(epoch, initial_lr, decay_rate=0.95):
return initial_lr * (decay_rate ** epoch)
5. Visualize Decision Boundaries
For 2D synthetic data, visualize what the network learns:
def plot_decision_boundary(mlp, X, y, resolution=100):
x_min, x_max = X[:, 0].min() - 0.5, X[:, 0].max() + 0.5
y_min, y_max = X[:, 1].min() - 0.5, X[:, 1].max() + 0.5
xx, yy = np.meshgrid(
np.linspace(x_min, x_max, resolution),
np.linspace(y_min, y_max, resolution)
)
grid = np.c_[xx.ravel(), yy.ravel()]
probs = mlp.forward(grid).reshape(xx.shape)
plt.contourf(xx, yy, probs, levels=50, cmap='RdBu', alpha=0.8)
plt.scatter(X[:, 0], X[:, 1], c=y.flatten(), cmap='RdBu', edgecolors='black')
plt.title("Decision Boundary")
plt.show()
Real-World Connections
FinTech Fraud Detection Systems
How real companies do it:
- Feature Engineering: Beyond raw transaction data, companies use:
- Velocity features (transactions per hour/day)
- Behavioral patterns (typical spending categories)
- Device fingerprinting
- Geolocation anomalies
- Network analysis (connected accounts)
- Model Architecture:
- Ensemble of models (gradient boosting + neural nets)
- Real-time scoring (<100ms latency requirement)
- Explainability layers (why was this flagged?)
- Deployment Considerations:
- Models retrained weekly/monthly (fraud patterns evolve)
- A/B testing new models against production
- Feedback loops from confirmed fraud
- Cost-sensitive learning (missed fraud costs more than false alarms)
Companies using ML for fraud detection:
- Stripe Radar: ML-based fraud prevention for payments
- PayPal: Real-time risk scoring for transactions
- Capital One: Credit card fraud detection
- Featurespace: Adaptive behavioral analytics
Beyond Binary Classification
This project teaches fundamentals applicable to:
- Anomaly Detection: Autoencoders for unsupervised fraud detection
- Sequence Models: RNNs/LSTMs for transaction sequences
- Graph Neural Networks: Detecting fraud rings
- Federated Learning: Training across banks without sharing data
Books That Will Help
| Book | Relevant Chapters | What Youโll Learn |
|---|---|---|
| โNeural Networks and Deep Learningโ by Michael Nielsen | Ch. 2: โHow the backpropagation algorithm worksโ | Visual, intuitive explanation of backprop. Free online at neuralnetworksanddeeplearning.com |
| โDeep Learningโ by Goodfellow, Bengio, Courville | Ch. 6: โDeep Feedforward Networksโ | Mathematical foundation of MLPs, activation functions, loss functions |
| โGrokking Deep Learningโ by Andrew Trask | Ch. 4-7: Gradient descent through backprop | Extremely beginner-friendly, builds everything from scratch |
| โHands-On Machine Learningโ by Aurelien Geron | Ch. 10: โIntroduction to Artificial Neural Networksโ | Practical Keras implementation with sklearn integration |
| โPattern Recognition and Machine Learningโ by Bishop | Ch. 5: โNeural Networksโ | Rigorous statistical treatment of MLPs |
Online Resources
- 3Blue1Brown Neural Networks Series - Beautiful visual explanations
- Andrej Karpathyโs micrograd - Tiny autograd engine (reference for Project 5)
- Kaggle Credit Card Fraud Dataset - Real anonymized data
Self-Assessment Checklist
Before considering this project complete, verify you can:
Implementation
- Build a
Layerclass with forward and backward passes - Stack layers into an
MLPthat trains end-to-end - Implement ReLU and Sigmoid activations with their derivatives
- Compute binary cross-entropy loss with class weights
- Train using mini-batch gradient descent
- Evaluate using Precision, Recall, F1-score (not just accuracy)
Understanding
- Explain why XOR cannot be solved with a single layer
- Draw the decision boundary of a 2-layer network on paper
- Describe how hidden layers transform feature space
- Explain the vanishing gradient problem and how ReLU solves it
- Justify why accuracy is misleading for imbalanced data
Debugging
- Diagnose why the model predicts all one class
- Identify dead ReLU neurons
- Fix NaN/Inf in loss
- Tune hyperparameters (learning rate, batch size, architecture)
Extensions (Choose at least 1)
- Add dropout regularization
- Implement Adam optimizer
- Try SMOTE for oversampling
- Visualize decision boundaries on 2D synthetic data
- Achieve >85% F1 on the credit card fraud dataset
Key Insights
โDeepโ means feature extraction, not just more parameters. Each layer learns increasingly abstract representations. Layer 1 might detect โlarge amount,โ Layer 2 detects โlarge + unusual time,โ and so on. Depth creates a hierarchy of features.
Accuracy is a lie in imbalanced settings. A model that predicts โall negativeโ achieves 99.8% accuracy on the fraud dataset but catches 0% of fraud. Always use metrics that measure what you care about: catching the minority class.
Class imbalance is not a data problem; itโs a loss problem. You donโt need more data - you need to tell the model that minority class errors hurt more. Class weights are simple and effective.
ReLU enabled deep learning. Before ReLU, training networks deeper than 3-4 layers was impractical. The simple function
max(0, x)changed everything by allowing gradients to flow.
After completing this project, you will understand WHY neural networks need depth, HOW to handle the class imbalance that plagues real-world data, and WHAT metrics actually matter for production systems. Youโre building the same architecture used by banks processing millions of transactions daily.