Project 4: The Spam Filter (Logistic Regression)
Project 4: The Spam Filter (Logistic Regression)
Build a text classifier that reads emails and predicts âSpamâ or âHamâ (Not Spam) using Sigmoid Activation and Cross-Entropy Loss
Project Overview
| Attribute | Value |
|---|---|
| Difficulty | Level 2: Intermediate |
| Time Estimate | 1 Week |
| Language | Python |
| Prerequisites | Project 3 (Linear Regression), Basic NumPy |
| Main Reference | âGrokking Deep Learningâ by Andrew Trask, Chapter 3 |
| Knowledge Area | Classification / Probability / NLP Basics |
Learning Objectives
After completing this project, you will be able to:
- Distinguish regression from classification - Understand why predicting categories requires different math than predicting continuous values
- Implement the Sigmoid function - Write
1 / (1 + e^-z)and understand its properties - Explain probability interpretation - Know why sigmoid outputs are probabilities between 0 and 1
- Derive Cross-Entropy Loss - Understand why MSE fails for classification and how log loss penalizes confident wrong answers
- Build a Bag of Words representation - Convert raw text into numerical vectors the machine can process
- Train a binary classifier - Implement gradient descent for logistic regression
- Evaluate with proper metrics - Calculate accuracy, precision, recall, and F1-score
The Core Question Youâre Answering
âHow does a computer understand âconceptsâ like Spam?â
The short answer: It doesnât.
A computer has no concept of âspamâ or âham.â It doesnât understand language, context, or intent. What it can do is count:
- How often does the word âFREEâ appear in spam emails? Very often.
- How often does âFREEâ appear in legitimate emails? Rarely.
- How often does âmomâ appear in spam? Almost never.
- How often does âmomâ appear in legitimate emails? Sometimes.
This is the profound insight behind machine learning: understanding is approximated by statistics. The classifier learns that certain word combinations correlate with certain labels. Itâs pattern matching at scale, not comprehension.
The Machine's "Understanding" of Spam:
Human View: Machine View:
"Buy cheap meds now!!" --> [buy:1, cheap:1, meds:1, now:1, click:0, mom:0, dinner:0, ...]
Dot product with weights --> 4.5
Sigmoid(4.5) --> 0.989
0.989 > 0.5 --> SPAM
The machine sees only:
- A vector of numbers (word counts)
- A weighted sum (how "spam-like" the vector is)
- A probability (how confident the prediction is)
It has no idea what "cheap meds" means. It only knows the pattern.
This project teaches you to build this statistical pattern matcher from scratch.
Concepts You Must Understand First
Before writing code, you need mental models for these foundational concepts:
1. Why Classification Differs from Regression
Regression predicts a continuous value: âThis house costs $450,000.â
Classification predicts a discrete category: âThis email is SPAM.â
The difference isnât superficial. It fundamentally changes the math:
Regression (Project 3): Classification (This Project):
Output: Any real number Output: 0 or 1 (Binary)
Example: -5.3, 0, 100.7, ... Example: SPAM (1) or HAM (0)
Loss: Mean Squared Error Loss: Cross-Entropy (Log Loss)
L = (y - y_hat)^2 L = -[y*log(p) + (1-y)*log(1-p)]
Why MSE? Why Cross-Entropy?
- Punishes big errors more - Punishes confident wrong answers
- Smooth, differentiable - Designed for probabilities
- Makes sense for continuous - MSE breaks for classification
targets (vanishing gradients)
Book Reference: âPattern Recognition and Machine Learningâ by Christopher Bishop, Chapter 4.3 covers the theoretical foundations of logistic regression.
2. The Sigmoid Function and Its Properties
The sigmoid function âsquashesâ any real number into the range (0, 1):
Sigmoid: sigma(z) = 1 / (1 + e^(-z))
1.0 ___________________________
| ------
| ----
0.5 | ----
| ----
| ----
0.0 |---------------------------
-6 -4 -2 0 2 4 6
z
Key Properties:
- Domain: All real numbers (-inf, +inf)
- Range: (0, 1) - Perfect for probabilities!
- sigma(0) = 0.5 (decision boundary)
- sigma(-z) = 1 - sigma(z) (symmetric)
- Derivative: sigma'(z) = sigma(z) * (1 - sigma(z))
This simple derivative makes gradient computation elegant.
Why Sigmoid for Classification?
1. Output is always between 0 and 1 (valid probability)
2. Large positive z --> output near 1 (high confidence positive)
3. Large negative z --> output near 0 (high confidence negative)
4. z near 0 --> output near 0.5 (uncertain)
ASCII Art - Sigmoid in Detail:
sigma(z) = 1 / (1 + e^(-z))
1.0 | . . . . . . . .
| . .
| .
0.8 | .
| .
| .
0.6 | .
| .
| .
0.5 |. . . . . . + (Decision Boundary: sigma(0) = 0.5)
| .
| .
0.4 | .
| .
| .
0.2 | .
| .
| .
0.0 |. . .
+---------------------------------------------------
-6 -4 -2 0 2 4 6 z
Interpretation:
z = -6: sigma(-6) = 0.002 --> 0.2% chance of SPAM (very confident HAM)
z = -2: sigma(-2) = 0.119 --> 11.9% chance of SPAM
z = 0: sigma(0) = 0.500 --> 50% (totally uncertain)
z = 2: sigma(2) = 0.881 --> 88.1% chance of SPAM
z = 6: sigma(6) = 0.998 --> 99.8% chance of SPAM (very confident)
Book Reference: âDeep Learningâ by Goodfellow, Bengio, Courville, Section 6.2.2.2 covers sigmoid and its variants.
3. Probability Interpretation of Outputs
The sigmoid output isnât just a number between 0 and 1. It has a precise probabilistic meaning:
p = sigma(w * x + b)
This p is the model's estimate of:
P(y = 1 | x) = "The probability that the email is SPAM given the features x"
The complement:
P(y = 0 | x) = 1 - p = "The probability that the email is HAM"
Example:
Email: "Cheap meds! Buy now!"
Features: [cheap:1, meds:1, buy:1, now:1, mom:0, ...]
After training:
z = w * x + b = 4.2
p = sigma(4.2) = 0.985
Interpretation: "I am 98.5% confident this is SPAM"
The remaining 1.5% represents:
- Uncertainty in the model
- Possible edge cases
- Training data limitations
The Decision Boundary:
When p = 0.5, the model is exactly uncertain.
This happens when z = 0, i.e., when w * x + b = 0.
The decision rule:
If p >= 0.5: Predict SPAM (class 1)
If p < 0.5: Predict HAM (class 0)
In practice, you might adjust this threshold:
- High-stakes spam: Lower threshold (catch more spam, more false positives)
- Important emails: Higher threshold (catch less spam, fewer false positives)
Book Reference: âMachine Learning: A Probabilistic Perspectiveâ by Kevin Murphy, Chapter 8 provides rigorous probability theory for classification.
4. Cross-Entropy Loss Derivation
Why MSE Fails for Classification:
Consider: True label y = 1 (SPAM), Predicted p = 0.01 (1% confident)
MSE Loss: (y - p)^2 = (1 - 0.01)^2 = 0.98
Now consider: p = 0.0001 (0.01% confident)
MSE Loss: (1 - 0.0001)^2 = 0.9998
The loss barely changed! MSE doesn't adequately punish
confident wrong predictions.
Gradient problem:
MSE gradient involves sigma'(z), which is tiny when z is large.
When the model is very confident (large |z|), the gradient vanishes,
and learning stops precisely when we need it most.
Cross-Entropy Loss:
L = -[y * log(p) + (1-y) * log(1-p)]
Case 1: True label y = 1 (SPAM)
L = -log(p)
If p = 0.99: L = -log(0.99) = 0.01 (small loss, good!)
If p = 0.50: L = -log(0.50) = 0.69 (medium loss)
If p = 0.01: L = -log(0.01) = 4.61 (huge loss!)
Case 2: True label y = 0 (HAM)
L = -log(1-p)
If p = 0.01: L = -log(0.99) = 0.01 (small loss, good!)
If p = 0.50: L = -log(0.50) = 0.69 (medium loss)
If p = 0.99: L = -log(0.01) = 4.61 (huge loss!)
Visualizing Cross-Entropy:
Loss when y = 1 (True SPAM): Loss when y = 0 (True HAM):
L = -log(p) L = -log(1-p)
5 |. 5 | .
| . | .
4 | . 4 | .
| . | .
3 | . 3 | .
| . | .
2 | . 2 | .
| .. | ..
1 | .. 1 | ..
| .... | ....
0 |______________________ 0 |______________________
0 0.5 1.0 0 0.5 1.0
p (predicted) p (predicted)
Confident and correct = Low loss
Confident and WRONG = Massive loss (goes to infinity!)
The Beautiful Gradient Simplification:
When you combine sigmoid with cross-entropy, magic happens:
Forward pass:
z = w * x + b
p = sigma(z)
L = -[y * log(p) + (1-y) * log(1-p)]
Backward pass (derivation):
dL/dp = -y/p + (1-y)/(1-p)
dp/dz = p * (1 - p) (sigmoid derivative)
dL/dz = dL/dp * dp/dz
= (-y/p + (1-y)/(1-p)) * p * (1-p)
= -y*(1-p) + (1-y)*p
= p - y
That's it! The gradient is simply: p - y (prediction minus truth)
dL/dw = (p - y) * x
dL/db = (p - y)
This elegant simplification is one reason logistic regression is so popular.
No vanishing gradients. No complex derivatives. Just (p - y).
Book Reference: âPattern Recognition and Machine Learningâ by Christopher Bishop, Section 4.3.2 derives the cross-entropy gradient.
5. Bag of Words Representation
Text is strings. Machines need numbers. Bag of Words (BoW) is the simplest bridge:
Step 1: Build a vocabulary from all training emails
Training emails:
"Buy cheap meds now"
"Hey mom, dinner tonight?"
"FREE money click here"
"Are we still on for lunch?"
Vocabulary: {buy, cheap, meds, now, hey, mom, dinner, tonight,
free, money, click, here, are, we, still, on, for, lunch}
Index map:
buy=0, cheap=1, meds=2, now=3, hey=4, mom=5, dinner=6, ...
Step 2: Convert each email to a vector (word counts)
"Buy cheap meds now"
--> [1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
buy cheap meds now (rest are zeros)
"Hey mom, dinner tonight?"
--> [0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
Step 3: The classifier sees only these vectors, not the text
Email text: "CHEAP MEDS FREE CHEAP"
|
v
[0, 2, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]
^ ^ ^
| | |
cheap=2 meds=1 free=1
What Bag of Words Loses:
These two sentences have IDENTICAL Bag of Words vectors:
"The cat sat on the mat"
"The mat sat on the cat"
BoW ignores:
- Word order
- Grammar
- Context
- Semantics
BoW keeps:
- Word presence/frequency
Despite these limitations, BoW works surprisingly well for spam detection
because spam has distinctive vocabulary regardless of order.
Book Reference: âSpeech and Language Processingâ by Jurafsky & Martin, Chapter 4 covers text representation.
6. Tokenization and Vocabulary Building
Before BoW, you need to clean and split the text:
Raw email: "BUY cheap M3DS now!! Click HERE for $$$ savings..."
Step 1: Lowercase
"buy cheap m3ds now!! click here for $$$ savings..."
Step 2: Remove punctuation
"buy cheap m3ds now click here for savings"
Step 3: Tokenize (split on whitespace)
["buy", "cheap", "m3ds", "now", "click", "here", "for", "savings"]
Step 4: (Optional) Remove stopwords
Common words like "the", "a", "for", "is" carry little meaning.
["buy", "cheap", "m3ds", "click", "savings"]
Step 5: (Optional) Stemming/Lemmatization
Reduce words to roots: "savings" -> "save", "clicking" -> "click"
["buy", "cheap", "m3ds", "click", "save"]
For this project, we'll do basic preprocessing:
1. Lowercase
2. Split on whitespace
3. Remove punctuation
4. Optionally remove stopwords
Deep Theoretical Foundation
From Linear to Logistic Regression
Logistic regression is linear regression with a twist:
Linear Regression (Project 3):
y_hat = w * x + b
Output: Any real number
Use case: Predicting prices, temperatures, etc.
Logistic Regression (This Project):
z = w * x + b (Same linear combination!)
p = sigma(z) (Squash through sigmoid)
Output: Number between 0 and 1
Use case: Predicting probabilities, classifications
The only difference is adding sigmoid to the output.
Everything else (gradient descent, weight updates) works the same way,
just with different gradients.
Linear Regression Logistic Regression
+-------------+ +-------------+
x ---------> | w * x + b | ---> y_hat | w * x + b | ---> z
+-------------+ +-------------+
|
v
+-------------+
| sigmoid(z) | ---> p
+-------------+
Why Sigmoid? The Deep Reasons
- Probability Bounds: Outputs are always valid probabilities (0 to 1)
- Monotonic: Higher z always means higher probability
- Smooth and Differentiable: Allows gradient-based optimization
- Nice Gradient:
sigma'(z) = sigma(z) * (1 - sigma(z))is beautiful - Natural Log-Odds Interpretation:
z = log(p / (1-p))(log-odds or logit)
The Log-Odds Connection:
If p = sigma(z), then z = logit(p) = log(p / (1-p))
Example: If an email is 80% likely to be spam (p = 0.8)
log-odds = log(0.8 / 0.2) = log(4) = 1.39
This means: "The email is e^1.39 = 4x more likely to be spam than ham"
The linear model z = w * x + b directly computes log-odds.
Each weight tells you how much that feature changes the log-odds.
Cross-Entropy: Penalizing Confident Wrong Answers
The asymmetry of cross-entropy is intentional:
Loss Comparison: True label = 1 (SPAM)
Prediction p MSE Loss Cross-Entropy Loss
-----------------------------------------------
0.99 0.0001 0.01
0.90 0.01 0.11
0.70 0.09 0.36
0.50 0.25 0.69
0.30 0.49 1.20
0.10 0.81 2.30
0.01 0.98 4.61
0.001 0.998 6.91
0.0001 0.9998 9.21
Notice:
- MSE plateaus around 1.0 for wrong predictions
- Cross-Entropy goes to INFINITY as prediction goes to 0
- Cross-Entropy heavily punishes confident wrong answers
Why This Matters for Learning:
Scenario: The model sees a spam email and predicts 0.01 (very confident HAM)
With MSE:
Loss = (1 - 0.01)^2 = 0.98
Gradient is relatively small
Model makes a tiny update
Learning is slow
With Cross-Entropy:
Loss = -log(0.01) = 4.61
Gradient = (p - y) * x = (0.01 - 1) * x = -0.99 * x
Large gradient, large update
Model quickly corrects its mistake
Cross-entropy forces the model to take confident mistakes seriously.
Decision Boundary Visualization
The decision boundary is where p = 0.5, which means z = 0:
With two features (x1 and x2):
z = w1*x1 + w2*x2 + b = 0
This is a LINE in 2D space.
Solving for x2:
x2 = (-w1*x1 - b) / w2
Example: w1 = 2, w2 = 1, b = -3
Boundary: 2*x1 + x2 - 3 = 0
x2 = -2*x1 + 3
x2
|
5 | HAM region (z < 0)
| .
4 | . Key:
| . . = Decision boundary
3 | . <-- boundary crosses here X = SPAM emails (above)
| X X O = HAM emails (below)
2 | X .
| X .
1 | O O .
| O O O .
0 |_____________.______ x1
0 1 2 3 4
SPAM region (z > 0): Upper-left
HAM region (z < 0): Lower-right
Boundary: The line where z = 0
Real World Outcome
When you complete this project, your spam filter will process raw email text and output probability scores:
Example Session
$ python spam_filter.py "Buy cheap meds now!! Click here"
Preprocessing... [buy, cheap, meds, now, click, here]
Vocabulary encoding... Vector shape: (5000,)
Non-zero features: [buy:1, cheap:1, meds:1, now:1, click:1, here:1]
Computing z = w.dot(x) + b
z = 4.237
Applying sigmoid: p = 1 / (1 + e^(-4.237))
Probability: 0.986
Classification: SPAM (98.6% confident)
$ python spam_filter.py "Hey mom, are we still on for dinner?"
Preprocessing... [hey, mom, are, we, still, on, for, dinner]
Vocabulary encoding... Vector shape: (5000,)
Non-zero features: [hey:1, mom:1, dinner:1]
Computing z = w.dot(x) + b
z = -6.124
Applying sigmoid: p = 1 / (1 + e^(-(-6.124)))
Probability: 0.002
Classification: HAM (99.8% confident)
$ python spam_filter.py "Meeting tomorrow at 3pm"
Preprocessing... [meeting, tomorrow, at, 3pm]
Probability: 0.089
Classification: HAM (91.1% confident)
$ python spam_filter.py "WINNER! You have been selected for a FREE prize"
Preprocessing... [winner, you, have, been, selected, for, a, free, prize]
Probability: 0.997
Classification: SPAM (99.7% confident)
Training Output
$ python spam_filter.py --train data/spam_ham.csv
Loading dataset...
Total emails: 5,572
Spam: 747 (13.4%)
Ham: 4,825 (86.6%)
Building vocabulary from training data...
Total unique words: 8,923
Keeping top 5,000 most frequent words
Converting emails to bag-of-words vectors...
Training set shape: (4,457, 5000)
Test set shape: (1,115, 5000)
Training logistic regression...
Learning rate: 0.1
Epochs: 100
Epoch 10: Loss = 0.412 | Accuracy = 93.2%
Epoch 20: Loss = 0.289 | Accuracy = 95.8%
Epoch 30: Loss = 0.224 | Accuracy = 96.9%
Epoch 40: Loss = 0.185 | Accuracy = 97.4%
Epoch 50: Loss = 0.159 | Accuracy = 97.8%
...
Epoch 100: Loss = 0.098 | Accuracy = 98.5%
Evaluating on test set...
Accuracy: 97.8%
Precision: 95.2% (Of predicted spam, 95.2% were actually spam)
Recall: 91.3% (Of actual spam, 91.3% were caught)
F1-Score: 93.2%
Confusion Matrix:
Predicted
HAM SPAM
Actual HAM [ 952 15 ]
Actual SPAM [ 10 138 ]
Most "spammy" words (highest positive weights):
free: +2.34
click: +2.12
winner: +2.01
prize: +1.89
urgent: +1.76
Most "hammy" words (highest negative weights):
meeting: -1.45
thanks: -1.38
dinner: -1.21
project: -1.15
please: -1.02
Saving model to spam_model.pkl...
Done!
Solution Architecture
+------------------------------------------------------------------+
| Spam Filter Architecture |
+------------------------------------------------------------------+
| |
| 1. DATA LOADING |
| +----------+ +---------+ +-------------+ |
| | CSV File | ---> | Pandas | ---> | texts, labels| |
| +----------+ +---------+ +-------------+ |
| |
| 2. PREPROCESSING PIPELINE |
| +---------+ +----------+ +------------+ +--------+ |
| | Raw Text| -> | Lowercase| -> | Remove | -> | Tokenize| |
| | | | | | Punctuation| | | |
| +---------+ +----------+ +------------+ +--------+ |
| |
| 3. VOCABULARY BUILDING |
| +--------+ +-------------+ +------------+ |
| | Tokens | -> | Count Freq | -> | Top N Words| -> vocab |
| +--------+ +-------------+ +------------+ |
| |
| 4. BAG OF WORDS ENCODING |
| +---------+ +-----------+ +------------+ |
| | Tokens | -> | Vocab Map | -> | Count Vec | -> X (n, v) |
| +---------+ +-----------+ +------------+ |
| |
| 5. MODEL |
| +------------+ +----------+ +---------+ |
| | Weights w | + | Bias b | = | z | |
| | (v,) | | (1,) | | (n,) | |
| +------------+ +----------+ +---------+ |
| | | |
| | X.dot(w) + b | |
| +------------------------------------------+ |
| | |
| v |
| +----------------------------------------------------------+ |
| | p = sigmoid(z) | |
| | p = 1 / (1 + exp(-z)) | |
| +----------------------------------------------------------+ |
| | |
| v |
| 6. TRAINING LOOP |
| +-------------+ +----------+ +-----------+ |
| | Predictions | -> | CE Loss | -> | Gradients | -> Update |
| | p | | | | dw, db | w, b |
| +-------------+ +----------+ +-----------+ |
| |
| 7. INFERENCE |
| +------------+ +---------+ +------------+ |
| | New Email | -> | Encode | -> | p = f(x) | -> SPAM/HAM |
| +------------+ +---------+ +------------+ |
| |
+------------------------------------------------------------------+
Class Structure
class SpamFilter:
"""
Logistic Regression-based Spam Classifier
Attributes:
vocab: dict # word -> index mapping
vocab_size: int # number of unique words
weights: np.array # (vocab_size,) learned weights
bias: float # learned bias term
Methods:
fit(texts, labels, epochs, lr) # Train the model
predict(text) -> float # Get spam probability
classify(text) -> str # Get "SPAM" or "HAM"
evaluate(texts, labels) -> dict # Get accuracy, precision, recall
"""
Phased Implementation Guide
Phase 1: Text Preprocessing Pipeline (Day 1)
Goal: Convert raw email text into clean tokens.
import re
import string
def preprocess(text):
"""
Clean and tokenize text for spam classification.
Steps:
1. Lowercase
2. Remove punctuation
3. Split into words
4. (Optional) Remove stopwords
Args:
text: Raw email string
Returns:
List of cleaned tokens
"""
# Step 1: Lowercase
text = text.lower()
# Step 2: Remove punctuation
text = text.translate(str.maketrans('', '', string.punctuation))
# Step 3: Tokenize (split on whitespace)
tokens = text.split()
# Step 4: (Optional) Remove stopwords
# stopwords = {'the', 'a', 'an', 'is', 'are', 'was', 'were', ...}
# tokens = [t for t in tokens if t not in stopwords]
return tokens
# Test
text = "BUY cheap M3DS now!! Click HERE for $$$ savings..."
print(preprocess(text))
# Expected: ['buy', 'cheap', 'm3ds', 'now', 'click', 'here', 'for', 'savings']
Checkpoint: Can preprocess 100 emails in under 1 second.
Phase 2: Vocabulary Building and Encoding (Day 1-2)
Goal: Build a vocabulary from training data and map words to indices.
from collections import Counter
def build_vocabulary(texts, max_vocab_size=5000):
"""
Build vocabulary from list of texts.
Args:
texts: List of raw text strings
max_vocab_size: Keep only top N most frequent words
Returns:
Dictionary mapping word -> index
"""
# Count all words across all texts
word_counts = Counter()
for text in texts:
tokens = preprocess(text)
word_counts.update(tokens)
# Keep top N most frequent words
most_common = word_counts.most_common(max_vocab_size)
# Create word -> index mapping
vocab = {word: idx for idx, (word, count) in enumerate(most_common)}
return vocab
# Test
texts = [
"Buy cheap meds now",
"Hey mom dinner tonight",
"Free money click here",
]
vocab = build_vocabulary(texts, max_vocab_size=100)
print(vocab)
# {'buy': 0, 'cheap': 1, 'meds': 2, 'now': 3, 'hey': 4, ...}
Checkpoint: Build vocabulary of 5,000 words from 5,000 emails in under 10 seconds.
Phase 3: Bag of Words Transformation (Day 2)
Goal: Convert tokenized text into numerical vectors.
import numpy as np
def text_to_vector(text, vocab):
"""
Convert text to bag-of-words vector.
Args:
text: Raw text string
vocab: Word -> index dictionary
Returns:
numpy array of shape (vocab_size,)
"""
tokens = preprocess(text)
vector = np.zeros(len(vocab))
for token in tokens:
if token in vocab:
vector[vocab[token]] += 1
return vector
def texts_to_matrix(texts, vocab):
"""
Convert list of texts to matrix of vectors.
Args:
texts: List of raw text strings
vocab: Word -> index dictionary
Returns:
numpy array of shape (n_texts, vocab_size)
"""
return np.array([text_to_vector(text, vocab) for text in texts])
# Test
vocab = {'buy': 0, 'cheap': 1, 'meds': 2, 'free': 3, 'click': 4}
text = "buy cheap cheap free"
vector = text_to_vector(text, vocab)
print(vector)
# Expected: [1., 2., 0., 1., 0.]
# buy cheap meds free click
Checkpoint: Convert 1,000 emails to vectors in under 5 seconds.
Phase 4: Sigmoid Implementation (Day 2-3)
Goal: Implement the sigmoid activation function with numerical stability.
import numpy as np
def sigmoid(z):
"""
Numerically stable sigmoid function.
sigma(z) = 1 / (1 + exp(-z))
For numerical stability:
- For z >= 0: 1 / (1 + exp(-z))
- For z < 0: exp(z) / (1 + exp(z))
This avoids overflow when z is a large negative number.
"""
# Clip z to prevent overflow
z = np.clip(z, -500, 500)
# Numerically stable computation
positive_mask = z >= 0
negative_mask = ~positive_mask
result = np.zeros_like(z, dtype=float)
# For positive z: standard formula
result[positive_mask] = 1 / (1 + np.exp(-z[positive_mask]))
# For negative z: equivalent but stable formula
exp_z = np.exp(z[negative_mask])
result[negative_mask] = exp_z / (1 + exp_z)
return result
# Test
print(sigmoid(0)) # 0.5
print(sigmoid(10)) # ~0.99995
print(sigmoid(-10)) # ~0.00005
print(sigmoid(-1000)) # Should not overflow, returns ~0
Verification: Test with extreme values like +/-1000 without errors.
Phase 5: Cross-Entropy Loss (Day 3)
Goal: Implement cross-entropy loss function.
import numpy as np
def cross_entropy_loss(y_true, y_pred, epsilon=1e-15):
"""
Compute binary cross-entropy loss.
L = -[y * log(p) + (1-y) * log(1-p)]
Args:
y_true: True labels (0 or 1), shape (n,)
y_pred: Predicted probabilities, shape (n,)
epsilon: Small value to prevent log(0)
Returns:
Average loss across all samples
"""
# Clip predictions to prevent log(0)
y_pred = np.clip(y_pred, epsilon, 1 - epsilon)
# Compute cross-entropy
loss = -np.mean(
y_true * np.log(y_pred) +
(1 - y_true) * np.log(1 - y_pred)
)
return loss
# Test cases
# Perfect prediction
print(cross_entropy_loss(np.array([1]), np.array([0.99]))) # ~0.01
# Terrible prediction
print(cross_entropy_loss(np.array([1]), np.array([0.01]))) # ~4.6
# Mixed
y_true = np.array([1, 0, 1, 0])
y_pred = np.array([0.9, 0.1, 0.8, 0.2])
print(cross_entropy_loss(y_true, y_pred)) # ~0.16
Verification: Loss should be near 0 for perfect predictions, high for wrong predictions.
Phase 6: Training Loop (Day 3-4)
Goal: Implement gradient descent to train the model.
import numpy as np
class LogisticRegression:
def __init__(self, n_features):
"""Initialize weights and bias."""
self.weights = np.zeros(n_features)
self.bias = 0.0
def forward(self, X):
"""Compute predictions for input X."""
z = X.dot(self.weights) + self.bias
return sigmoid(z)
def compute_gradients(self, X, y_true, y_pred):
"""
Compute gradients for weights and bias.
Gradient of cross-entropy with sigmoid:
dL/dw = (1/n) * X.T.dot(y_pred - y_true)
dL/db = (1/n) * sum(y_pred - y_true)
"""
n = len(y_true)
error = y_pred - y_true # Shape: (n,)
dw = (1/n) * X.T.dot(error) # Shape: (n_features,)
db = (1/n) * np.sum(error) # Scalar
return dw, db
def fit(self, X, y, epochs=100, learning_rate=0.1, verbose=True):
"""
Train the model using gradient descent.
Args:
X: Feature matrix, shape (n_samples, n_features)
y: Labels, shape (n_samples,)
epochs: Number of training iterations
learning_rate: Step size for updates
verbose: Print progress
"""
history = {'loss': [], 'accuracy': []}
for epoch in range(epochs):
# Forward pass
y_pred = self.forward(X)
# Compute loss
loss = cross_entropy_loss(y, y_pred)
# Compute accuracy
predictions = (y_pred >= 0.5).astype(int)
accuracy = np.mean(predictions == y)
# Store history
history['loss'].append(loss)
history['accuracy'].append(accuracy)
# Compute gradients
dw, db = self.compute_gradients(X, y, y_pred)
# Update weights
self.weights -= learning_rate * dw
self.bias -= learning_rate * db
# Print progress
if verbose and (epoch + 1) % 10 == 0:
print(f"Epoch {epoch+1}: Loss = {loss:.4f} | Accuracy = {accuracy:.2%}")
return history
def predict_proba(self, X):
"""Get probability predictions."""
return self.forward(X)
def predict(self, X, threshold=0.5):
"""Get class predictions (0 or 1)."""
return (self.predict_proba(X) >= threshold).astype(int)
Checkpoint: Training accuracy should increase over epochs. Loss should decrease.
Phase 7: Inference and Evaluation (Day 5)
Goal: Build the complete spam filter with evaluation metrics.
import numpy as np
def evaluate(model, X_test, y_test):
"""
Compute classification metrics.
Returns:
Dictionary with accuracy, precision, recall, f1
"""
y_pred = model.predict(X_test)
# True positives, false positives, etc.
tp = np.sum((y_pred == 1) & (y_test == 1))
fp = np.sum((y_pred == 1) & (y_test == 0))
fn = np.sum((y_pred == 0) & (y_test == 1))
tn = np.sum((y_pred == 0) & (y_test == 0))
# Metrics
accuracy = (tp + tn) / len(y_test)
precision = tp / (tp + fp) if (tp + fp) > 0 else 0
recall = tp / (tp + fn) if (tp + fn) > 0 else 0
f1 = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0
return {
'accuracy': accuracy,
'precision': precision,
'recall': recall,
'f1': f1,
'confusion_matrix': [[tn, fp], [fn, tp]]
}
class SpamFilter:
"""Complete spam filter combining all components."""
def __init__(self, max_vocab_size=5000):
self.max_vocab_size = max_vocab_size
self.vocab = None
self.model = None
def fit(self, texts, labels, epochs=100, learning_rate=0.1):
"""Train the spam filter on text data."""
# Build vocabulary
self.vocab = build_vocabulary(texts, self.max_vocab_size)
# Convert texts to vectors
X = texts_to_matrix(texts, self.vocab)
y = np.array(labels)
# Initialize and train model
self.model = LogisticRegression(len(self.vocab))
history = self.model.fit(X, y, epochs, learning_rate)
return history
def predict(self, text):
"""Predict spam probability for a single text."""
vector = text_to_vector(text, self.vocab)
prob = self.model.predict_proba(vector.reshape(1, -1))[0]
return prob
def classify(self, text, threshold=0.5):
"""Classify text as SPAM or HAM."""
prob = self.predict(text)
label = "SPAM" if prob >= threshold else "HAM"
confidence = prob if prob >= 0.5 else 1 - prob
return label, confidence
def get_top_features(self, n=10):
"""Get most spam-indicative and ham-indicative words."""
idx_to_word = {idx: word for word, idx in self.vocab.items()}
# Sort by weight
sorted_indices = np.argsort(self.weights)
# Most spammy (highest positive weights)
spammy = [(idx_to_word[i], self.weights[i])
for i in sorted_indices[-n:][::-1]]
# Most hammy (most negative weights)
hammy = [(idx_to_word[i], self.weights[i])
for i in sorted_indices[:n]]
return {'spam_words': spammy, 'ham_words': hammy}
Checkpoint: Achieve >95% accuracy on a test set. F1-score > 0.90.
Questions to Guide Your Design
Use these questions as checkpoints during implementation:
Text Processing Questions
- How should I handle unknown words during inference?
- Answer: Ignore them. If a word isnât in the vocabulary, it contributes 0 to the prediction.
- Should I use word counts or binary presence?
- Counts: âfree free freeâ gets higher spam signal
- Binary: âfreeâ present or not, regardless of frequency
- Try both! Binary often works better for spam.
- How do I handle very rare or very common words?
- Very rare: Probably noise, remove (min_df parameter)
- Very common: âtheâ, âaâ, âisâ - stopwords, remove them
Model Questions
- Why initialize weights to zero instead of random?
- For logistic regression, zero initialization works fine
- All features start with equal importance
- The gradient will differentiate them
- What learning rate should I use?
- Start with 0.1, adjust based on convergence
- Too high: Loss oscillates or diverges
- Too low: Training is very slow
- How many epochs are enough?
- Watch the loss curve
- Stop when loss stops decreasing (early stopping)
- For this dataset, 100-200 epochs is usually enough
Evaluation Questions
- Why is accuracy not enough for spam detection?
- Class imbalance: 90% ham, 10% spam
- A model predicting âHAMâ always gets 90% accuracy!
- Need precision and recall
- Whatâs more important: precision or recall?
- High precision: Few false positives (legitimate emails marked spam)
- High recall: Few false negatives (spam getting through)
- For email: High precision is often preferred (donât lose important emails)
Thinking Exercise
Manual Probability Calculation
Task: Work through the spam classification by hand.
Setup:
- Vocabulary: {buy: 0, cheap: 1, free: 2, meeting: 3, dinner: 4}
- Weights: [2.0, 1.5, 2.5, -1.0, -0.8]
- Bias: -3.0
Email: âFree cheap cheap meetingâ
Step 1: Tokenize
["free", "cheap", "cheap", "meeting"]
Step 2: Create BoW vector
buy cheap free meeting dinner
x = [ 0, 2, 1, 1, 0 ]
Step 3: Compute z
z = w.dot(x) + b
z = (2.0*0) + (1.5*2) + (2.5*1) + (-1.0*1) + (-0.8*0) + (-3.0)
z = 0 + 3.0 + 2.5 - 1.0 + 0 - 3.0
z = 1.5
Step 4: Apply sigmoid
p = 1 / (1 + e^(-1.5))
p = 1 / (1 + 0.223)
p = 1 / 1.223
p = 0.817
Step 5: Classify
0.817 > 0.5, so predict SPAM with 81.7% confidence
Question: If we add âdinnerâ to the email (âFree cheap cheap meeting dinnerâ), what happens?
New x: [0, 2, 1, 1, 1]
New z = 1.5 + (-0.8 * 1) = 0.7
New p = sigmoid(0.7) = 0.668
Still SPAM, but confidence dropped from 81.7% to 66.8%!
The word "dinner" has negative weight, making the email seem less spammy.
Testing Strategy
Unit Tests
def test_preprocess():
"""Test text preprocessing."""
assert preprocess("Hello World!") == ["hello", "world"]
assert preprocess("BUY NOW!!!") == ["buy", "now"]
assert preprocess("") == []
def test_sigmoid():
"""Test sigmoid function."""
assert abs(sigmoid(0) - 0.5) < 1e-6
assert sigmoid(100) > 0.99
assert sigmoid(-100) < 0.01
# Should not overflow
assert sigmoid(-1000) == 0.0 or sigmoid(-1000) > 0
def test_cross_entropy():
"""Test cross-entropy loss."""
# Perfect prediction
assert cross_entropy_loss(np.array([1]), np.array([0.9999])) < 0.01
# Terrible prediction
assert cross_entropy_loss(np.array([1]), np.array([0.0001])) > 4
def test_gradient():
"""Test gradient computation."""
model = LogisticRegression(2)
X = np.array([[1, 0], [0, 1]])
y = np.array([1, 0])
y_pred = np.array([0.7, 0.3])
dw, db = model.compute_gradients(X, y, y_pred)
# Gradient should be (p - y) * x
expected_dw = np.array([-0.15, 0.15]) # [(-0.3*1 + 0.3*0)/2, (-0.3*0 + 0.3*1)/2]
assert np.allclose(dw, expected_dw)
Integration Tests
def test_training_improves():
"""Test that training reduces loss."""
# Simple dataset
texts = ["buy free money", "meeting dinner thanks"] * 50
labels = [1, 0] * 50
filter = SpamFilter(max_vocab_size=100)
history = filter.fit(texts, labels, epochs=50, learning_rate=0.5)
# Loss should decrease
assert history['loss'][-1] < history['loss'][0]
# Accuracy should improve
assert history['accuracy'][-1] > history['accuracy'][0]
def test_spam_prediction():
"""Test predictions on obvious cases."""
# Train on clear examples
texts = [
"FREE money click here",
"Win prize now",
"Cheap meds buy now",
"Hey mom dinner tonight",
"Meeting at 3pm",
"Thanks for your help",
]
labels = [1, 1, 1, 0, 0, 0]
filter = SpamFilter(max_vocab_size=50)
filter.fit(texts, labels, epochs=100, learning_rate=1.0)
# Test predictions
assert filter.predict("Free prize win") > 0.5 # Should be spam
assert filter.predict("Thanks for the meeting") < 0.5 # Should be ham
Accuracy, Precision, Recall
def test_evaluation_metrics():
"""Test that metrics are computed correctly."""
# Create a model with known predictions
y_true = np.array([1, 1, 1, 1, 0, 0, 0, 0])
y_pred = np.array([1, 1, 0, 0, 0, 0, 1, 1])
# TP TP FN FN TN TN FP FP
# Manual calculation:
# TP = 2, FN = 2, TN = 2, FP = 2
# Accuracy = (TP + TN) / 8 = 4/8 = 0.5
# Precision = TP / (TP + FP) = 2/4 = 0.5
# Recall = TP / (TP + FN) = 2/4 = 0.5
# F1 = 2 * 0.5 * 0.5 / (0.5 + 0.5) = 0.5
metrics = compute_metrics(y_true, y_pred)
assert metrics['accuracy'] == 0.5
assert metrics['precision'] == 0.5
assert metrics['recall'] == 0.5
assert metrics['f1'] == 0.5
Common Pitfalls and Debugging Tips
1. Numerical Instability in Sigmoid
Problem: exp(-z) overflows for large negative z.
# BAD: Will overflow for z = -1000
def sigmoid_naive(z):
return 1 / (1 + np.exp(-z))
# GOOD: Numerically stable
def sigmoid(z):
z = np.clip(z, -500, 500) # Prevent overflow
return np.where(z >= 0,
1 / (1 + np.exp(-z)),
np.exp(z) / (1 + np.exp(z)))
2. Log of Zero in Cross-Entropy
Problem: log(0) is negative infinity.
# BAD: Will crash if y_pred is exactly 0 or 1
loss = -np.mean(y_true * np.log(y_pred))
# GOOD: Clip predictions away from 0 and 1
epsilon = 1e-15
y_pred = np.clip(y_pred, epsilon, 1 - epsilon)
loss = -np.mean(y_true * np.log(y_pred))
3. Vocabulary Mismatch Between Training and Inference
Problem: New words in test data not in vocabulary.
def text_to_vector(text, vocab):
tokens = preprocess(text)
vector = np.zeros(len(vocab))
for token in tokens:
if token in vocab: # IMPORTANT: Check if word exists!
vector[vocab[token]] += 1
# else: Ignore unknown words
return vector
4. Class Imbalance
Problem: 90% ham, 10% spam. Model predicts all ham.
# Check class distribution
print(f"Spam: {sum(labels)} ({sum(labels)/len(labels):.1%})")
print(f"Ham: {len(labels) - sum(labels)}")
# Solutions:
# 1. Class weights: Penalize mistakes on minority class more
class_weights = {0: 1.0, 1: 9.0} # Weight spam 9x more
# 2. Oversampling: Duplicate minority class samples
# 3. Undersampling: Remove majority class samples
# 4. SMOTE: Synthetic minority oversampling
5. Learning Rate Too High
Symptom: Loss oscillates wildly or increases.
Epoch 1: Loss = 0.693
Epoch 2: Loss = 2.345 <-- Went UP!
Epoch 3: Loss = 1.234
Epoch 4: Loss = 5.678 <-- Wild oscillation
Fix: Reduce learning rate by 10x. Start with 0.01 or 0.001.
6. Not Normalizing Features
Problem: Word counts have very different scales.
Word "the": appears 50 times
Word "free": appears 2 times
If you don't normalize:
- "the" dominates the gradient
- Rare but important words ("free") are ignored
Solutions:
# Binary encoding (presence/absence, not count)
vector = (vector > 0).astype(float)
# TF-IDF weighting (bonus challenge)
# Term frequency * inverse document frequency
Interview Questions
Conceptual Questions
Q1: âWhy do we use sigmoid instead of just thresholding?â
Expected answer: Sigmoid produces probabilities between 0 and 1, which allows us to:
- Interpret outputs as confidence levels
- Set custom decision thresholds based on the application
- Use gradient descent because sigmoid is differentiable (threshold is not)
- Combine multiple models by averaging probabilities
Q2: âWhy is cross-entropy better than MSE for classification?â
Expected answer:
- Cross-entropy loss gradient is
(p - y), which doesnât vanish when the model is confident but wrong -
MSE gradient involves sigma'(z), which approaches zero for largez , causing vanishing gradients - Cross-entropy penalizes confident wrong answers much more heavily (goes to infinity)
- Cross-entropy has a probabilistic interpretation (negative log-likelihood)
Q3: âExplain the Bag of Words representation. What are its limitations?â
Expected answer: BoW represents text as a vector of word counts/frequencies.
Limitations:
- Ignores word order (ânot goodâ and âgood notâ are identical)
- Ignores semantics/meaning
- Creates very sparse, high-dimensional vectors
- Canât handle out-of-vocabulary words
- No understanding of synonyms or context
Technical Questions
Q4: âHow would you handle a word like âfreeâ that appears in both spam and ham?â
Expected answer: The model learns the optimal weight based on training data. If âfreeâ appears in 90% of spam but only 10% of ham, it will get a positive weight (spam-indicative). The weight represents log-odds: w_free = log(P(spam|free) / P(ham|free)).
Q5: âYour model has 95% accuracy but 0% recall. Whatâs happening?â
Expected answer: Class imbalance. If 95% of emails are ham, the model can achieve 95% accuracy by predicting ham for everything. Recall is 0 because it catches no spam.
Solutions:
- Use class weights to penalize spam misses more
- Oversample the minority class
- Use F1-score or balanced accuracy instead of accuracy
- Lower the decision threshold
Q6: âHow do you choose the vocabulary size?â
Expected answer: Trade-off between:
- Too small: Miss important words
- Too large: Overfit to rare words, slow training, high memory
Typical approach:
- Keep top N most frequent words (5,000-10,000)
- Remove words appearing in < K documents (min_df)
- Remove words appearing in > X% of documents (max_df)
Coding Questions
Q7: âImplement the sigmoid derivative.â
def sigmoid_derivative(z):
s = sigmoid(z)
return s * (1 - s)
Q8: âWrite code to compute precision, recall, and F1.â
def precision_recall_f1(y_true, y_pred):
tp = sum((p == 1) and (t == 1) for p, t in zip(y_pred, y_true))
fp = sum((p == 1) and (t == 0) for p, t in zip(y_pred, y_true))
fn = sum((p == 0) and (t == 1) for p, t in zip(y_pred, y_true))
precision = tp / (tp + fp) if (tp + fp) > 0 else 0
recall = tp / (tp + fn) if (tp + fn) > 0 else 0
f1 = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0
return precision, recall, f1
Hints in Layers
For when youâre stuck, reveal hints progressively:
Layer 1: Getting Started
Hint: How do I structure the project?
Start with this skeleton:
# spam_filter.py
import numpy as np
from collections import Counter
# 1. Preprocessing
def preprocess(text):
pass
# 2. Vocabulary
def build_vocabulary(texts, max_vocab_size):
pass
# 3. Vectorization
def text_to_vector(text, vocab):
pass
# 4. Sigmoid
def sigmoid(z):
pass
# 5. Loss
def cross_entropy_loss(y_true, y_pred):
pass
# 6. Model class
class SpamFilter:
def fit(self, texts, labels):
pass
def predict(self, text):
pass
# 7. Main
if __name__ == "__main__":
# Load data, train, evaluate
pass
Hint: What dataset should I use?
Use the UCI SMS Spam Collection:
- Download from: https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection
- 5,574 SMS messages labeled âspamâ or âhamâ
- Simple CSV format: âlabel\ttextâ
Or use the Enron email dataset for a more realistic challenge.
Layer 2: Preprocessing Issues
Hint: My vocabulary is too large
Reduce vocabulary size by:
- Lowercasing (already done)
- Removing stopwords:
{"the", "a", "an", "is", "are", "was", ...} - Keeping only top N most frequent words
- Removing words that appear in fewer than K documents
# Example stopwords
STOPWORDS = {
'the', 'a', 'an', 'is', 'are', 'was', 'were', 'be', 'been',
'being', 'have', 'has', 'had', 'do', 'does', 'did', 'will',
'would', 'could', 'should', 'may', 'might', 'must', 'shall',
'can', 'to', 'of', 'in', 'for', 'on', 'with', 'at', 'by',
'from', 'as', 'into', 'through', 'during', 'before', 'after',
'above', 'below', 'between', 'under', 'again', 'further',
'then', 'once', 'here', 'there', 'when', 'where', 'why',
'how', 'all', 'each', 'few', 'more', 'most', 'other', 'some',
'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so',
'than', 'too', 'very', 'just', 'and', 'but', 'if', 'or',
'because', 'until', 'while', 'this', 'that', 'these', 'those',
}
Layer 3: Training Issues
Hint: My loss is not decreasing
Check these:
- Learning rate: Try 0.01, 0.1, 1.0 - find what works
- Feature scale: Normalize vectors if using counts
- Gradient check: Print gradients to ensure theyâre not zero or NaN
- Initial weights: Try small random values instead of zeros
# Debug: Print gradient magnitudes
dw, db = model.compute_gradients(X, y, y_pred)
print(f"dw mean: {np.mean(np.abs(dw)):.6f}")
print(f"db: {db:.6f}")
Hint: My loss is NaN or Inf
Numerical stability issues:
- Clip sigmoid input:
z = np.clip(z, -500, 500) - Clip predictions for log:
y_pred = np.clip(y_pred, 1e-15, 1-1e-15) - Use the stable sigmoid implementation
# Check for NaN/Inf
if np.isnan(loss) or np.isinf(loss):
print(f"z range: {z.min()} to {z.max()}")
print(f"p range: {y_pred.min()} to {y_pred.max()}")
Layer 4: Evaluation Issues
Hint: High accuracy but low recall
This is the class imbalance problem. Solutions:
# 1. Class weights in gradient
# Weight the gradient by class frequency
weights = np.where(y == 1, n_ham / n_spam, 1.0)
weighted_error = (y_pred - y) * weights
# 2. Lower the threshold
# Instead of 0.5, try 0.3 or lower
y_pred = (y_proba >= 0.3).astype(int)
# 3. Use different metric for evaluation
# Optimize for F1 or balanced accuracy instead of accuracy
Layer 5: Advanced Issues
Hint: How do I know if my model is overfitting?
Split your data:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
# Train on train set, evaluate on test set
filter.fit(X_train, y_train)
train_metrics = evaluate(filter, X_train, y_train)
test_metrics = evaluate(filter, X_test, y_test)
print(f"Train accuracy: {train_metrics['accuracy']:.2%}")
print(f"Test accuracy: {test_metrics['accuracy']:.2%}")
# If train >> test, you're overfitting
# Solutions: Smaller vocabulary, regularization, more data
Extensions and Challenges
Extension 1: Implement TF-IDF Weighting
Bag of Words treats all word occurrences equally. TF-IDF weights by importance:
TF-IDF = Term Frequency * Inverse Document Frequency
TF(word, doc) = Count of word in doc / Total words in doc
IDF(word) = log(Total docs / Docs containing word)
Example:
- "the" appears in 95% of documents: IDF = log(100/95) = 0.05 (low weight)
- "viagra" appears in 2% of documents: IDF = log(100/2) = 3.9 (high weight)
from collections import Counter
import numpy as np
def compute_idf(texts, vocab):
"""Compute IDF for each word in vocabulary."""
n_docs = len(texts)
doc_counts = Counter()
for text in texts:
unique_words = set(preprocess(text))
doc_counts.update(unique_words)
idf = {}
for word, idx in vocab.items():
df = doc_counts.get(word, 1) # Avoid division by zero
idf[word] = np.log(n_docs / df)
return idf
def text_to_tfidf(text, vocab, idf):
"""Convert text to TF-IDF vector."""
tokens = preprocess(text)
tf = Counter(tokens)
total = len(tokens)
vector = np.zeros(len(vocab))
for token in tokens:
if token in vocab:
tf_score = tf[token] / total
idf_score = idf.get(token, 1)
vector[vocab[token]] = tf_score * idf_score
return vector
Extension 2: Multi-Class Classification
Extend from binary (SPAM/HAM) to multiple categories:
Categories: [SPAM, PROMO, IMPORTANT, NORMAL]
Instead of sigmoid (binary), use softmax (multi-class):
softmax(z_i) = exp(z_i) / sum(exp(z_j))
This gives a probability distribution over all classes.
def softmax(z):
"""Multi-class sigmoid: softmax."""
# Subtract max for numerical stability
z = z - np.max(z, axis=-1, keepdims=True)
exp_z = np.exp(z)
return exp_z / np.sum(exp_z, axis=-1, keepdims=True)
# Now weights are a matrix: (vocab_size, n_classes)
# Output is a vector of probabilities: (n_classes,)
Extension 3: Character-Level Features
Instead of words, use character n-grams:
Word-level: "free" -> ["free"]
Char-level (n=3): "free" -> ["fre", "ree"]
Advantage: Handles misspellings and obfuscation
"fr33" (word-level) -> Unknown word, ignored
"fr33" (char-level) -> ["fr3", "r33"] - might still match spam patterns
Extension 4: Regularization
Prevent overfitting with L2 regularization:
# Add penalty for large weights
L2_lambda = 0.01
# Modified loss
loss = cross_entropy + (L2_lambda / 2) * np.sum(weights ** 2)
# Modified gradient
dw = gradient + L2_lambda * weights
Extension 5: Learning Curves
Visualize how the model learns:
import matplotlib.pyplot as plt
def plot_learning_curves(history):
fig, axes = plt.subplots(1, 2, figsize=(12, 4))
axes[0].plot(history['loss'])
axes[0].set_title('Loss Over Epochs')
axes[0].set_xlabel('Epoch')
axes[0].set_ylabel('Cross-Entropy Loss')
axes[1].plot(history['accuracy'])
axes[1].set_title('Accuracy Over Epochs')
axes[1].set_xlabel('Epoch')
axes[1].set_ylabel('Accuracy')
plt.tight_layout()
plt.savefig('learning_curves.png')
Real-World Connections
Gmail Spam Filter
Gmailâs spam filter is vastly more sophisticated, but it builds on these principles:
Gmail's Approach (Simplified):
1. Text features (like BoW, but much more advanced)
2. Sender reputation (history of sending spam)
3. Link analysis (known malicious URLs)
4. User behavior (what users mark as spam)
5. Network analysis (patterns across millions of users)
6. Deep learning models (not just logistic regression)
But the core idea is the same:
Extract features -> Compute weighted sum -> Apply activation -> Predict
Content Moderation
The same classification approach powers:
- Toxic comment detection
- Hate speech filtering
- Fake news detection
- Phishing email detection
Recommendation Systems
Binary classification underpins:
- âWill user click this ad?â (Click-through rate prediction)
- âWill user like this movie?â (Binary like/dislike)
- âWill user churn?â (Customer retention)
Books That Will Help
| Book | Author(s) | Relevance | Key Chapters |
|---|---|---|---|
| Grokking Deep Learning | Andrew Trask | Primary reference for this project | Ch. 3: Forward Propagation, Ch. 5: Gradient Descent |
| Pattern Recognition and Machine Learning | Christopher Bishop | Theoretical foundations | Ch. 4: Linear Models for Classification |
| Machine Learning: A Probabilistic Perspective | Kevin Murphy | Rigorous probability theory | Ch. 8: Logistic Regression |
| Speech and Language Processing | Jurafsky & Martin | NLP fundamentals | Ch. 4: Naive Bayes and Sentiment |
| Deep Learning | Goodfellow, Bengio, Courville | Modern deep learning bible | Ch. 6.2: Activation Functions |
Reading Order Recommendation
- Start with Grokking Deep Learning Ch. 3 - Intuitive introduction to forward propagation
- Then read Speech and Language Processing Ch. 4 - Text classification context
- Reference Pattern Recognition Ch. 4.3 - Mathematical derivation of logistic regression
- Deep dive Deep Learning Ch. 6 - Modern perspective on activation functions and loss
Self-Assessment Checklist
Conceptual Understanding
- Explain why classification uses sigmoid while regression doesnât
- Draw the sigmoid curve and mark the decision boundary
- Derive the gradient of cross-entropy loss with sigmoid
- Explain why cross-entropy is better than MSE for classification
- Describe what Bag of Words loses from the original text
Implementation Skills
- Implement numerically stable sigmoid
- Implement cross-entropy loss with proper clipping
- Build vocabulary from a corpus of texts
- Convert text to BoW vectors
- Train a logistic regression model from scratch
- Compute accuracy, precision, recall, and F1-score
Practical Application
- Achieve >95% accuracy on the SMS Spam dataset
- Handle class imbalance appropriately
- Explain what the model learned (top spam/ham words)
- Debug training issues (NaN loss, non-decreasing loss)
- Split data into train/test sets properly
Extensions Attempted
- Implement TF-IDF weighting
- Try binary features instead of counts
- Add regularization
- Plot learning curves
- Experiment with different vocabulary sizes
Key Insights
Classification is not a minor variation of regression. The change from MSE to cross-entropy, and the addition of sigmoid, fundamentally changes how the model learns. Donât treat logistic regression as âlinear regression with an extra step.â
Text is just patterns of numbers. The machine has no understanding of language. It sees word frequencies and learns correlations. This is both humbling (AI doesnât âunderstandâ) and empowering (simple math can achieve impressive results).
The sigmoid-cross-entropy combination is elegant. The gradient simplifies to
(p - y), which is remarkably clean. This mathematical elegance is one reason logistic regression has stood the test of time.
Class imbalance will break your model. Always check your class distribution. A model that predicts the majority class for everything will have high accuracy but zero utility.
Preprocessing matters. The quality of your text cleaning (tokenization, stopwords, normalization) often matters more than the model complexity. Garbage in, garbage out.
Connecting Forward
This project builds directly on Project 3 (Linear Regression) by adding:
- Sigmoid activation for probability outputs
- Cross-entropy loss for classification
- Text preprocessing and vectorization
The next step, Project 5 (Autograd Engine), will show you how to automate gradient computation. Instead of manually deriving dL/dw = (p - y) * x, youâll build a system that computes gradients automatically for any computational graph.
Project 6 (Fraud Detection MLP) will extend classification to non-linear problems by adding hidden layers. When a single sigmoid canât separate the data, youâll stack layers to learn complex decision boundaries.
After completing this project, youâll understand the fundamental building block of classification: converting raw data into probabilities through learned weights. Every spam filter, content moderator, and recommendation system builds on this foundation.