Project 4: The Spam Filter (Logistic Regression)

Build a text classifier that reads emails and predicts “Spam” or “Ham” (Not Spam) using Sigmoid Activation and Cross-Entropy Loss

Project Overview

Attribute	Value
Difficulty	Level 2: Intermediate
Time Estimate	1 Week
Language	Python
Prerequisites	Project 3 (Linear Regression), Basic NumPy
Main Reference	“Grokking Deep Learning” by Andrew Trask, Chapter 3
Knowledge Area	Classification / Probability / NLP Basics

Learning Objectives

After completing this project, you will be able to:

Distinguish regression from classification - Understand why predicting categories requires different math than predicting continuous values
Implement the Sigmoid function - Write 1 / (1 + e^-z) and understand its properties
Explain probability interpretation - Know why sigmoid outputs are probabilities between 0 and 1
Derive Cross-Entropy Loss - Understand why MSE fails for classification and how log loss penalizes confident wrong answers
Build a Bag of Words representation - Convert raw text into numerical vectors the machine can process
Train a binary classifier - Implement gradient descent for logistic regression
Evaluate with proper metrics - Calculate accuracy, precision, recall, and F1-score

The Core Question You’re Answering

“How does a computer understand ‘concepts’ like Spam?”

The short answer: It doesn’t.

A computer has no concept of “spam” or “ham.” It doesn’t understand language, context, or intent. What it can do is count:

How often does the word “FREE” appear in spam emails? Very often.
How often does “FREE” appear in legitimate emails? Rarely.
How often does “mom” appear in spam? Almost never.
How often does “mom” appear in legitimate emails? Sometimes.

This is the profound insight behind machine learning: understanding is approximated by statistics. The classifier learns that certain word combinations correlate with certain labels. It’s pattern matching at scale, not comprehension.

The Machine's "Understanding" of Spam:

Human View:                          Machine View:
"Buy cheap meds now!!"         -->   [buy:1, cheap:1, meds:1, now:1, click:0, mom:0, dinner:0, ...]
                                     Dot product with weights --> 4.5
                                     Sigmoid(4.5) --> 0.989
                                     0.989 > 0.5 --> SPAM

The machine sees only:
- A vector of numbers (word counts)
- A weighted sum (how "spam-like" the vector is)
- A probability (how confident the prediction is)

It has no idea what "cheap meds" means. It only knows the pattern.

This project teaches you to build this statistical pattern matcher from scratch.

Concepts You Must Understand First

Before writing code, you need mental models for these foundational concepts:

1. Why Classification Differs from Regression

Regression predicts a continuous value: “This house costs $450,000.”

Classification predicts a discrete category: “This email is SPAM.”

The difference isn’t superficial. It fundamentally changes the math:

Regression (Project 3):                Classification (This Project):
  Output: Any real number                Output: 0 or 1 (Binary)
  Example: -5.3, 0, 100.7, ...          Example: SPAM (1) or HAM (0)

  Loss: Mean Squared Error               Loss: Cross-Entropy (Log Loss)
  L = (y - y_hat)^2                      L = -[y*log(p) + (1-y)*log(1-p)]

  Why MSE?                               Why Cross-Entropy?
  - Punishes big errors more             - Punishes confident wrong answers
  - Smooth, differentiable               - Designed for probabilities
  - Makes sense for continuous           - MSE breaks for classification
    targets                                (vanishing gradients)

Book Reference: “Pattern Recognition and Machine Learning” by Christopher Bishop, Chapter 4.3 covers the theoretical foundations of logistic regression.

2. The Sigmoid Function and Its Properties

The sigmoid function “squashes” any real number into the range (0, 1):

         Sigmoid: sigma(z) = 1 / (1 + e^(-z))

                    1.0 ___________________________
                        |                   ------
                        |               ----
                    0.5 |           ----
                        |       ----
                        |   ----
                    0.0 |---------------------------
                       -6  -4  -2   0   2   4   6
                                   z

Key Properties:
- Domain: All real numbers (-inf, +inf)
- Range: (0, 1) - Perfect for probabilities!
- sigma(0) = 0.5 (decision boundary)
- sigma(-z) = 1 - sigma(z) (symmetric)
- Derivative: sigma'(z) = sigma(z) * (1 - sigma(z))
  This simple derivative makes gradient computation elegant.

Why Sigmoid for Classification?
1. Output is always between 0 and 1 (valid probability)
2. Large positive z --> output near 1 (high confidence positive)
3. Large negative z --> output near 0 (high confidence negative)
4. z near 0 --> output near 0.5 (uncertain)

ASCII Art - Sigmoid in Detail:

            sigma(z) = 1 / (1 + e^(-z))

    1.0 |                              . . . . . . . .
        |                         . .
        |                      .
    0.8 |                    .
        |                  .
        |                 .
    0.6 |               .
        |              .
        |             .
    0.5 |. . . . . . +  (Decision Boundary: sigma(0) = 0.5)
        |           .
        |          .
    0.4 |         .
        |        .
        |       .
    0.2 |     .
        |   .
        | .
    0.0 |. . .
        +---------------------------------------------------
           -6    -4    -2     0     2     4     6     z

    Interpretation:
    z = -6: sigma(-6) = 0.002  --> 0.2% chance of SPAM (very confident HAM)
    z = -2: sigma(-2) = 0.119  --> 11.9% chance of SPAM
    z =  0: sigma(0)  = 0.500  --> 50% (totally uncertain)
    z =  2: sigma(2)  = 0.881  --> 88.1% chance of SPAM
    z =  6: sigma(6)  = 0.998  --> 99.8% chance of SPAM (very confident)

Book Reference: “Deep Learning” by Goodfellow, Bengio, Courville, Section 6.2.2.2 covers sigmoid and its variants.

3. Probability Interpretation of Outputs

The sigmoid output isn’t just a number between 0 and 1. It has a precise probabilistic meaning:

p = sigma(w * x + b)

This p is the model's estimate of:
  P(y = 1 | x) = "The probability that the email is SPAM given the features x"

The complement:
  P(y = 0 | x) = 1 - p = "The probability that the email is HAM"

Example:
  Email: "Cheap meds! Buy now!"
  Features: [cheap:1, meds:1, buy:1, now:1, mom:0, ...]

  After training:
  z = w * x + b = 4.2
  p = sigma(4.2) = 0.985

  Interpretation: "I am 98.5% confident this is SPAM"

  The remaining 1.5% represents:
  - Uncertainty in the model
  - Possible edge cases
  - Training data limitations

The Decision Boundary:

When p = 0.5, the model is exactly uncertain.
This happens when z = 0, i.e., when w * x + b = 0.

The decision rule:
  If p >= 0.5: Predict SPAM (class 1)
  If p <  0.5: Predict HAM  (class 0)

In practice, you might adjust this threshold:
  - High-stakes spam: Lower threshold (catch more spam, more false positives)
  - Important emails: Higher threshold (catch less spam, fewer false positives)

Book Reference: “Machine Learning: A Probabilistic Perspective” by Kevin Murphy, Chapter 8 provides rigorous probability theory for classification.

4. Cross-Entropy Loss Derivation

Why MSE Fails for Classification:

Consider: True label y = 1 (SPAM), Predicted p = 0.01 (1% confident)

MSE Loss: (y - p)^2 = (1 - 0.01)^2 = 0.98

Now consider: p = 0.0001 (0.01% confident)
MSE Loss: (1 - 0.0001)^2 = 0.9998

The loss barely changed! MSE doesn't adequately punish
confident wrong predictions.

Gradient problem:
MSE gradient involves sigma'(z), which is tiny when z is large.
When the model is very confident (large |z|), the gradient vanishes,
and learning stops precisely when we need it most.

Cross-Entropy Loss:

L = -[y * log(p) + (1-y) * log(1-p)]

Case 1: True label y = 1 (SPAM)
  L = -log(p)
  If p = 0.99: L = -log(0.99) = 0.01  (small loss, good!)
  If p = 0.50: L = -log(0.50) = 0.69  (medium loss)
  If p = 0.01: L = -log(0.01) = 4.61  (huge loss!)

Case 2: True label y = 0 (HAM)
  L = -log(1-p)
  If p = 0.01: L = -log(0.99) = 0.01  (small loss, good!)
  If p = 0.50: L = -log(0.50) = 0.69  (medium loss)
  If p = 0.99: L = -log(0.01) = 4.61  (huge loss!)

Visualizing Cross-Entropy:

Loss when y = 1 (True SPAM):           Loss when y = 0 (True HAM):

    L = -log(p)                            L = -log(1-p)

  5 |.                                   5 |                              .
    | .                                    |                             .
  4 |  .                                 4 |                            .
    |   .                                  |                           .
  3 |    .                               3 |                          .
    |     .                                |                         .
  2 |       .                            2 |                       .
    |        ..                            |                     ..
  1 |           ..                       1 |                  ..
    |              ....                    |             ....
  0 |______________________              0 |______________________
    0    0.5    1.0                        0    0.5    1.0
         p (predicted)                          p (predicted)

Confident and correct = Low loss
Confident and WRONG = Massive loss (goes to infinity!)

The Beautiful Gradient Simplification:

When you combine sigmoid with cross-entropy, magic happens:

Forward pass:
  z = w * x + b
  p = sigma(z)
  L = -[y * log(p) + (1-y) * log(1-p)]

Backward pass (derivation):
  dL/dp = -y/p + (1-y)/(1-p)
  dp/dz = p * (1 - p)  (sigmoid derivative)

  dL/dz = dL/dp * dp/dz
        = (-y/p + (1-y)/(1-p)) * p * (1-p)
        = -y*(1-p) + (1-y)*p
        = p - y

That's it! The gradient is simply: p - y (prediction minus truth)

dL/dw = (p - y) * x
dL/db = (p - y)

This elegant simplification is one reason logistic regression is so popular.
No vanishing gradients. No complex derivatives. Just (p - y).

Book Reference: “Pattern Recognition and Machine Learning” by Christopher Bishop, Section 4.3.2 derives the cross-entropy gradient.

5. Bag of Words Representation

Text is strings. Machines need numbers. Bag of Words (BoW) is the simplest bridge:

Step 1: Build a vocabulary from all training emails

Training emails:
  "Buy cheap meds now"
  "Hey mom, dinner tonight?"
  "FREE money click here"
  "Are we still on for lunch?"

Vocabulary: {buy, cheap, meds, now, hey, mom, dinner, tonight,
             free, money, click, here, are, we, still, on, for, lunch}

Index map:
  buy=0, cheap=1, meds=2, now=3, hey=4, mom=5, dinner=6, ...

Step 2: Convert each email to a vector (word counts)

"Buy cheap meds now"
  --> [1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
       buy cheap meds now (rest are zeros)

"Hey mom, dinner tonight?"
  --> [0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

Step 3: The classifier sees only these vectors, not the text

Email text: "CHEAP MEDS FREE CHEAP"
                    |
                    v
              [0, 2, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]
                   ^       ^          ^
                   |       |          |
               cheap=2   meds=1    free=1

What Bag of Words Loses:

These two sentences have IDENTICAL Bag of Words vectors:
  "The cat sat on the mat"
  "The mat sat on the cat"

BoW ignores:
- Word order
- Grammar
- Context
- Semantics

BoW keeps:
- Word presence/frequency

Despite these limitations, BoW works surprisingly well for spam detection
because spam has distinctive vocabulary regardless of order.

Book Reference: “Speech and Language Processing” by Jurafsky & Martin, Chapter 4 covers text representation.

6. Tokenization and Vocabulary Building

Before BoW, you need to clean and split the text:

Raw email: "BUY cheap M3DS now!! Click HERE for $$$ savings..."

Step 1: Lowercase
  "buy cheap m3ds now!! click here for $$$ savings..."

Step 2: Remove punctuation
  "buy cheap m3ds now click here for savings"

Step 3: Tokenize (split on whitespace)
  ["buy", "cheap", "m3ds", "now", "click", "here", "for", "savings"]

Step 4: (Optional) Remove stopwords
  Common words like "the", "a", "for", "is" carry little meaning.
  ["buy", "cheap", "m3ds", "click", "savings"]

Step 5: (Optional) Stemming/Lemmatization
  Reduce words to roots: "savings" -> "save", "clicking" -> "click"
  ["buy", "cheap", "m3ds", "click", "save"]

For this project, we'll do basic preprocessing:
  1. Lowercase
  2. Split on whitespace
  3. Remove punctuation
  4. Optionally remove stopwords

Deep Theoretical Foundation

From Linear to Logistic Regression

Logistic regression is linear regression with a twist:

Linear Regression (Project 3):
  y_hat = w * x + b
  Output: Any real number
  Use case: Predicting prices, temperatures, etc.

Logistic Regression (This Project):
  z = w * x + b          (Same linear combination!)
  p = sigma(z)           (Squash through sigmoid)
  Output: Number between 0 and 1
  Use case: Predicting probabilities, classifications

The only difference is adding sigmoid to the output.
Everything else (gradient descent, weight updates) works the same way,
just with different gradients.

                Linear Regression              Logistic Regression

                +-------------+               +-------------+
   x ---------> | w * x + b   | ---> y_hat    | w * x + b   | ---> z
                +-------------+               +-------------+
                                                    |
                                                    v
                                              +-------------+
                                              |  sigmoid(z) | ---> p
                                              +-------------+

Why Sigmoid? The Deep Reasons

Probability Bounds: Outputs are always valid probabilities (0 to 1)
Monotonic: Higher z always means higher probability
Smooth and Differentiable: Allows gradient-based optimization
Nice Gradient: sigma'(z) = sigma(z) * (1 - sigma(z)) is beautiful
Natural Log-Odds Interpretation: z = log(p / (1-p)) (log-odds or logit)

The Log-Odds Connection:

If p = sigma(z), then z = logit(p) = log(p / (1-p))

Example: If an email is 80% likely to be spam (p = 0.8)
  log-odds = log(0.8 / 0.2) = log(4) = 1.39

This means: "The email is e^1.39 = 4x more likely to be spam than ham"

The linear model z = w * x + b directly computes log-odds.
Each weight tells you how much that feature changes the log-odds.

Cross-Entropy: Penalizing Confident Wrong Answers

The asymmetry of cross-entropy is intentional:

Loss Comparison: True label = 1 (SPAM)

Prediction p    MSE Loss       Cross-Entropy Loss
-----------------------------------------------
0.99            0.0001         0.01
0.90            0.01           0.11
0.70            0.09           0.36
0.50            0.25           0.69
0.30            0.49           1.20
0.10            0.81           2.30
0.01            0.98           4.61
0.001           0.998          6.91
0.0001          0.9998         9.21

Notice:
- MSE plateaus around 1.0 for wrong predictions
- Cross-Entropy goes to INFINITY as prediction goes to 0
- Cross-Entropy heavily punishes confident wrong answers

Why This Matters for Learning:

Scenario: The model sees a spam email and predicts 0.01 (very confident HAM)

With MSE:
  Loss = (1 - 0.01)^2 = 0.98
  Gradient is relatively small
  Model makes a tiny update
  Learning is slow

With Cross-Entropy:
  Loss = -log(0.01) = 4.61
  Gradient = (p - y) * x = (0.01 - 1) * x = -0.99 * x
  Large gradient, large update
  Model quickly corrects its mistake

Cross-entropy forces the model to take confident mistakes seriously.

Decision Boundary Visualization

The decision boundary is where p = 0.5, which means z = 0:

With two features (x1 and x2):
  z = w1*x1 + w2*x2 + b = 0

  This is a LINE in 2D space.

  Solving for x2:
  x2 = (-w1*x1 - b) / w2

Example: w1 = 2, w2 = 1, b = -3
  Boundary: 2*x1 + x2 - 3 = 0
            x2 = -2*x1 + 3

       x2
        |
      5 |  HAM region (z < 0)
        |      .
      4 |     .                            Key:
        |    .                             . = Decision boundary
      3 |   . <-- boundary crosses here    X = SPAM emails (above)
        |  X    X                          O = HAM emails (below)
      2 | X   .
        |    X .
      1 |   O   O  .
        |  O  O  O   .
      0 |_____________.______ x1
        0   1   2   3   4

  SPAM region (z > 0): Upper-left
  HAM region (z < 0): Lower-right
  Boundary: The line where z = 0

Real World Outcome

When you complete this project, your spam filter will process raw email text and output probability scores:

Example Session

$ python spam_filter.py "Buy cheap meds now!! Click here"
Preprocessing... [buy, cheap, meds, now, click, here]
Vocabulary encoding... Vector shape: (5000,)
Non-zero features: [buy:1, cheap:1, meds:1, now:1, click:1, here:1]

Computing z = w.dot(x) + b
z = 4.237

Applying sigmoid: p = 1 / (1 + e^(-4.237))
Probability: 0.986

Classification: SPAM (98.6% confident)

$ python spam_filter.py "Hey mom, are we still on for dinner?"
Preprocessing... [hey, mom, are, we, still, on, for, dinner]
Vocabulary encoding... Vector shape: (5000,)
Non-zero features: [hey:1, mom:1, dinner:1]

Computing z = w.dot(x) + b
z = -6.124

Applying sigmoid: p = 1 / (1 + e^(-(-6.124)))
Probability: 0.002

Classification: HAM (99.8% confident)

$ python spam_filter.py "Meeting tomorrow at 3pm"
Preprocessing... [meeting, tomorrow, at, 3pm]
Probability: 0.089
Classification: HAM (91.1% confident)

$ python spam_filter.py "WINNER! You have been selected for a FREE prize"
Preprocessing... [winner, you, have, been, selected, for, a, free, prize]
Probability: 0.997
Classification: SPAM (99.7% confident)

Training Output

$ python spam_filter.py --train data/spam_ham.csv
Loading dataset...
  Total emails: 5,572
  Spam: 747 (13.4%)
  Ham: 4,825 (86.6%)

Building vocabulary from training data...
  Total unique words: 8,923
  Keeping top 5,000 most frequent words

Converting emails to bag-of-words vectors...
  Training set shape: (4,457, 5000)
  Test set shape: (1,115, 5000)

Training logistic regression...
  Learning rate: 0.1
  Epochs: 100

Epoch 10:  Loss = 0.412 | Accuracy = 93.2%
Epoch 20:  Loss = 0.289 | Accuracy = 95.8%
Epoch 30:  Loss = 0.224 | Accuracy = 96.9%
Epoch 40:  Loss = 0.185 | Accuracy = 97.4%
Epoch 50:  Loss = 0.159 | Accuracy = 97.8%
...
Epoch 100: Loss = 0.098 | Accuracy = 98.5%

Evaluating on test set...
  Accuracy:  97.8%
  Precision: 95.2%  (Of predicted spam, 95.2% were actually spam)
  Recall:    91.3%  (Of actual spam, 91.3% were caught)
  F1-Score:  93.2%

Confusion Matrix:
                 Predicted
              HAM    SPAM
Actual HAM  [ 952     15 ]
Actual SPAM [  10    138 ]

Most "spammy" words (highest positive weights):
  free:    +2.34
  click:   +2.12
  winner:  +2.01
  prize:   +1.89
  urgent:  +1.76

Most "hammy" words (highest negative weights):
  meeting: -1.45
  thanks:  -1.38
  dinner:  -1.21
  project: -1.15
  please:  -1.02

Saving model to spam_model.pkl...
Done!

Solution Architecture

+------------------------------------------------------------------+
|                    Spam Filter Architecture                        |
+------------------------------------------------------------------+
|                                                                    |
|  1. DATA LOADING                                                   |
|     +----------+      +---------+      +-------------+            |
|     | CSV File | ---> | Pandas  | ---> | texts, labels|           |
|     +----------+      +---------+      +-------------+            |
|                                                                    |
|  2. PREPROCESSING PIPELINE                                         |
|     +---------+    +----------+    +------------+    +--------+   |
|     | Raw Text| -> | Lowercase| -> | Remove     | -> | Tokenize|  |
|     |         |    |          |    | Punctuation|    |         |  |
|     +---------+    +----------+    +------------+    +--------+   |
|                                                                    |
|  3. VOCABULARY BUILDING                                            |
|     +--------+    +-------------+    +------------+               |
|     | Tokens | -> | Count Freq  | -> | Top N Words| -> vocab     |
|     +--------+    +-------------+    +------------+               |
|                                                                    |
|  4. BAG OF WORDS ENCODING                                          |
|     +---------+    +-----------+    +------------+                |
|     | Tokens  | -> | Vocab Map | -> | Count Vec  | -> X (n, v)   |
|     +---------+    +-----------+    +------------+                |
|                                                                    |
|  5. MODEL                                                          |
|     +------------+         +----------+         +---------+       |
|     | Weights w  |   +     | Bias b   |   =     |   z     |       |
|     | (v,)       |         | (1,)     |         | (n,)    |       |
|     +------------+         +----------+         +---------+       |
|           |                                          |            |
|           |     X.dot(w) + b                        |            |
|           +------------------------------------------+            |
|                              |                                     |
|                              v                                     |
|     +----------------------------------------------------------+  |
|     |                  p = sigmoid(z)                           |  |
|     |                  p = 1 / (1 + exp(-z))                   |  |
|     +----------------------------------------------------------+  |
|                              |                                     |
|                              v                                     |
|  6. TRAINING LOOP                                                  |
|     +-------------+    +----------+    +-----------+              |
|     | Predictions | -> | CE Loss  | -> | Gradients | -> Update   |
|     |     p       |    |          |    | dw, db    |    w, b     |
|     +-------------+    +----------+    +-----------+              |
|                                                                    |
|  7. INFERENCE                                                      |
|     +------------+    +---------+    +------------+               |
|     | New Email  | -> | Encode  | -> | p = f(x)   | -> SPAM/HAM  |
|     +------------+    +---------+    +------------+               |
|                                                                    |
+------------------------------------------------------------------+

Class Structure

class SpamFilter:
    """
    Logistic Regression-based Spam Classifier

    Attributes:
        vocab: dict         # word -> index mapping
        vocab_size: int     # number of unique words
        weights: np.array   # (vocab_size,) learned weights
        bias: float         # learned bias term

    Methods:
        fit(texts, labels, epochs, lr)  # Train the model
        predict(text) -> float          # Get spam probability
        classify(text) -> str           # Get "SPAM" or "HAM"
        evaluate(texts, labels) -> dict # Get accuracy, precision, recall
    """

Phased Implementation Guide

Phase 1: Text Preprocessing Pipeline (Day 1)

Goal: Convert raw email text into clean tokens.

import re
import string

def preprocess(text):
    """
    Clean and tokenize text for spam classification.

    Steps:
    1. Lowercase
    2. Remove punctuation
    3. Split into words
    4. (Optional) Remove stopwords

    Args:
        text: Raw email string

    Returns:
        List of cleaned tokens
    """
    # Step 1: Lowercase
    text = text.lower()

    # Step 2: Remove punctuation
    text = text.translate(str.maketrans('', '', string.punctuation))

    # Step 3: Tokenize (split on whitespace)
    tokens = text.split()

    # Step 4: (Optional) Remove stopwords
    # stopwords = {'the', 'a', 'an', 'is', 'are', 'was', 'were', ...}
    # tokens = [t for t in tokens if t not in stopwords]

    return tokens

# Test
text = "BUY cheap M3DS now!! Click HERE for $$$ savings..."
print(preprocess(text))
# Expected: ['buy', 'cheap', 'm3ds', 'now', 'click', 'here', 'for', 'savings']

Checkpoint: Can preprocess 100 emails in under 1 second.

Phase 2: Vocabulary Building and Encoding (Day 1-2)

Goal: Build a vocabulary from training data and map words to indices.

from collections import Counter

def build_vocabulary(texts, max_vocab_size=5000):
    """
    Build vocabulary from list of texts.

    Args:
        texts: List of raw text strings
        max_vocab_size: Keep only top N most frequent words

    Returns:
        Dictionary mapping word -> index
    """
    # Count all words across all texts
    word_counts = Counter()
    for text in texts:
        tokens = preprocess(text)
        word_counts.update(tokens)

    # Keep top N most frequent words
    most_common = word_counts.most_common(max_vocab_size)

    # Create word -> index mapping
    vocab = {word: idx for idx, (word, count) in enumerate(most_common)}

    return vocab

# Test
texts = [
    "Buy cheap meds now",
    "Hey mom dinner tonight",
    "Free money click here",
]
vocab = build_vocabulary(texts, max_vocab_size=100)
print(vocab)
# {'buy': 0, 'cheap': 1, 'meds': 2, 'now': 3, 'hey': 4, ...}

Checkpoint: Build vocabulary of 5,000 words from 5,000 emails in under 10 seconds.

Phase 3: Bag of Words Transformation (Day 2)

Goal: Convert tokenized text into numerical vectors.

import numpy as np

def text_to_vector(text, vocab):
    """
    Convert text to bag-of-words vector.

    Args:
        text: Raw text string
        vocab: Word -> index dictionary

    Returns:
        numpy array of shape (vocab_size,)
    """
    tokens = preprocess(text)
    vector = np.zeros(len(vocab))

    for token in tokens:
        if token in vocab:
            vector[vocab[token]] += 1

    return vector

def texts_to_matrix(texts, vocab):
    """
    Convert list of texts to matrix of vectors.

    Args:
        texts: List of raw text strings
        vocab: Word -> index dictionary

    Returns:
        numpy array of shape (n_texts, vocab_size)
    """
    return np.array([text_to_vector(text, vocab) for text in texts])

# Test
vocab = {'buy': 0, 'cheap': 1, 'meds': 2, 'free': 3, 'click': 4}
text = "buy cheap cheap free"
vector = text_to_vector(text, vocab)
print(vector)
# Expected: [1., 2., 0., 1., 0.]
#            buy  cheap meds free click

Checkpoint: Convert 1,000 emails to vectors in under 5 seconds.

Phase 4: Sigmoid Implementation (Day 2-3)

Goal: Implement the sigmoid activation function with numerical stability.

import numpy as np

def sigmoid(z):
    """
    Numerically stable sigmoid function.

    sigma(z) = 1 / (1 + exp(-z))

    For numerical stability:
    - For z >= 0: 1 / (1 + exp(-z))
    - For z < 0:  exp(z) / (1 + exp(z))

    This avoids overflow when z is a large negative number.
    """
    # Clip z to prevent overflow
    z = np.clip(z, -500, 500)

    # Numerically stable computation
    positive_mask = z >= 0
    negative_mask = ~positive_mask

    result = np.zeros_like(z, dtype=float)

    # For positive z: standard formula
    result[positive_mask] = 1 / (1 + np.exp(-z[positive_mask]))

    # For negative z: equivalent but stable formula
    exp_z = np.exp(z[negative_mask])
    result[negative_mask] = exp_z / (1 + exp_z)

    return result

# Test
print(sigmoid(0))     # 0.5
print(sigmoid(10))    # ~0.99995
print(sigmoid(-10))   # ~0.00005
print(sigmoid(-1000)) # Should not overflow, returns ~0

Verification: Test with extreme values like +/-1000 without errors.

Phase 5: Cross-Entropy Loss (Day 3)

Goal: Implement cross-entropy loss function.

import numpy as np

def cross_entropy_loss(y_true, y_pred, epsilon=1e-15):
    """
    Compute binary cross-entropy loss.

    L = -[y * log(p) + (1-y) * log(1-p)]

    Args:
        y_true: True labels (0 or 1), shape (n,)
        y_pred: Predicted probabilities, shape (n,)
        epsilon: Small value to prevent log(0)

    Returns:
        Average loss across all samples
    """
    # Clip predictions to prevent log(0)
    y_pred = np.clip(y_pred, epsilon, 1 - epsilon)

    # Compute cross-entropy
    loss = -np.mean(
        y_true * np.log(y_pred) +
        (1 - y_true) * np.log(1 - y_pred)
    )

    return loss

# Test cases
# Perfect prediction
print(cross_entropy_loss(np.array([1]), np.array([0.99])))  # ~0.01

# Terrible prediction
print(cross_entropy_loss(np.array([1]), np.array([0.01])))  # ~4.6

# Mixed
y_true = np.array([1, 0, 1, 0])
y_pred = np.array([0.9, 0.1, 0.8, 0.2])
print(cross_entropy_loss(y_true, y_pred))  # ~0.16

Verification: Loss should be near 0 for perfect predictions, high for wrong predictions.

Phase 6: Training Loop (Day 3-4)

Goal: Implement gradient descent to train the model.

import numpy as np

class LogisticRegression:
    def __init__(self, n_features):
        """Initialize weights and bias."""
        self.weights = np.zeros(n_features)
        self.bias = 0.0

    def forward(self, X):
        """Compute predictions for input X."""
        z = X.dot(self.weights) + self.bias
        return sigmoid(z)

    def compute_gradients(self, X, y_true, y_pred):
        """
        Compute gradients for weights and bias.

        Gradient of cross-entropy with sigmoid:
        dL/dw = (1/n) * X.T.dot(y_pred - y_true)
        dL/db = (1/n) * sum(y_pred - y_true)
        """
        n = len(y_true)
        error = y_pred - y_true  # Shape: (n,)

        dw = (1/n) * X.T.dot(error)  # Shape: (n_features,)
        db = (1/n) * np.sum(error)   # Scalar

        return dw, db

    def fit(self, X, y, epochs=100, learning_rate=0.1, verbose=True):
        """
        Train the model using gradient descent.

        Args:
            X: Feature matrix, shape (n_samples, n_features)
            y: Labels, shape (n_samples,)
            epochs: Number of training iterations
            learning_rate: Step size for updates
            verbose: Print progress
        """
        history = {'loss': [], 'accuracy': []}

        for epoch in range(epochs):
            # Forward pass
            y_pred = self.forward(X)

            # Compute loss
            loss = cross_entropy_loss(y, y_pred)

            # Compute accuracy
            predictions = (y_pred >= 0.5).astype(int)
            accuracy = np.mean(predictions == y)

            # Store history
            history['loss'].append(loss)
            history['accuracy'].append(accuracy)

            # Compute gradients
            dw, db = self.compute_gradients(X, y, y_pred)

            # Update weights
            self.weights -= learning_rate * dw
            self.bias -= learning_rate * db

            # Print progress
            if verbose and (epoch + 1) % 10 == 0:
                print(f"Epoch {epoch+1}: Loss = {loss:.4f} | Accuracy = {accuracy:.2%}")

        return history

    def predict_proba(self, X):
        """Get probability predictions."""
        return self.forward(X)

    def predict(self, X, threshold=0.5):
        """Get class predictions (0 or 1)."""
        return (self.predict_proba(X) >= threshold).astype(int)

Checkpoint: Training accuracy should increase over epochs. Loss should decrease.

Phase 7: Inference and Evaluation (Day 5)

Goal: Build the complete spam filter with evaluation metrics.

import numpy as np

def evaluate(model, X_test, y_test):
    """
    Compute classification metrics.

    Returns:
        Dictionary with accuracy, precision, recall, f1
    """
    y_pred = model.predict(X_test)

    # True positives, false positives, etc.
    tp = np.sum((y_pred == 1) & (y_test == 1))
    fp = np.sum((y_pred == 1) & (y_test == 0))
    fn = np.sum((y_pred == 0) & (y_test == 1))
    tn = np.sum((y_pred == 0) & (y_test == 0))

    # Metrics
    accuracy = (tp + tn) / len(y_test)
    precision = tp / (tp + fp) if (tp + fp) > 0 else 0
    recall = tp / (tp + fn) if (tp + fn) > 0 else 0
    f1 = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0

    return {
        'accuracy': accuracy,
        'precision': precision,
        'recall': recall,
        'f1': f1,
        'confusion_matrix': [[tn, fp], [fn, tp]]
    }

class SpamFilter:
    """Complete spam filter combining all components."""

    def __init__(self, max_vocab_size=5000):
        self.max_vocab_size = max_vocab_size
        self.vocab = None
        self.model = None

    def fit(self, texts, labels, epochs=100, learning_rate=0.1):
        """Train the spam filter on text data."""
        # Build vocabulary
        self.vocab = build_vocabulary(texts, self.max_vocab_size)

        # Convert texts to vectors
        X = texts_to_matrix(texts, self.vocab)
        y = np.array(labels)

        # Initialize and train model
        self.model = LogisticRegression(len(self.vocab))
        history = self.model.fit(X, y, epochs, learning_rate)

        return history

    def predict(self, text):
        """Predict spam probability for a single text."""
        vector = text_to_vector(text, self.vocab)
        prob = self.model.predict_proba(vector.reshape(1, -1))[0]
        return prob

    def classify(self, text, threshold=0.5):
        """Classify text as SPAM or HAM."""
        prob = self.predict(text)
        label = "SPAM" if prob >= threshold else "HAM"
        confidence = prob if prob >= 0.5 else 1 - prob
        return label, confidence

    def get_top_features(self, n=10):
        """Get most spam-indicative and ham-indicative words."""
        idx_to_word = {idx: word for word, idx in self.vocab.items()}

        # Sort by weight
        sorted_indices = np.argsort(self.weights)

        # Most spammy (highest positive weights)
        spammy = [(idx_to_word[i], self.weights[i])
                  for i in sorted_indices[-n:][::-1]]

        # Most hammy (most negative weights)
        hammy = [(idx_to_word[i], self.weights[i])
                 for i in sorted_indices[:n]]

        return {'spam_words': spammy, 'ham_words': hammy}

Checkpoint: Achieve >95% accuracy on a test set. F1-score > 0.90.

Questions to Guide Your Design

Use these questions as checkpoints during implementation:

Text Processing Questions

How should I handle unknown words during inference?
- Answer: Ignore them. If a word isn’t in the vocabulary, it contributes 0 to the prediction.
Should I use word counts or binary presence?
- Counts: “free free free” gets higher spam signal
- Binary: “free” present or not, regardless of frequency
- Try both! Binary often works better for spam.
How do I handle very rare or very common words?
- Very rare: Probably noise, remove (min_df parameter)
- Very common: “the”, “a”, “is” - stopwords, remove them

Model Questions

Why initialize weights to zero instead of random?
- For logistic regression, zero initialization works fine
- All features start with equal importance
- The gradient will differentiate them
What learning rate should I use?
- Start with 0.1, adjust based on convergence
- Too high: Loss oscillates or diverges
- Too low: Training is very slow
How many epochs are enough?
- Watch the loss curve
- Stop when loss stops decreasing (early stopping)
- For this dataset, 100-200 epochs is usually enough

Evaluation Questions

Why is accuracy not enough for spam detection?
- Class imbalance: 90% ham, 10% spam
- A model predicting “HAM” always gets 90% accuracy!
- Need precision and recall
What’s more important: precision or recall?
- High precision: Few false positives (legitimate emails marked spam)
- High recall: Few false negatives (spam getting through)
- For email: High precision is often preferred (don’t lose important emails)

Thinking Exercise

Manual Probability Calculation

Task: Work through the spam classification by hand.

Setup:

Vocabulary: {buy: 0, cheap: 1, free: 2, meeting: 3, dinner: 4}
Weights: [2.0, 1.5, 2.5, -1.0, -0.8]
Bias: -3.0

Email: “Free cheap cheap meeting”

Step 1: Tokenize

["free", "cheap", "cheap", "meeting"]

Step 2: Create BoW vector

          buy  cheap  free  meeting  dinner
x =      [ 0,    2,     1,     1,       0  ]

Step 3: Compute z

z = w.dot(x) + b
z = (2.0*0) + (1.5*2) + (2.5*1) + (-1.0*1) + (-0.8*0) + (-3.0)
z = 0 + 3.0 + 2.5 - 1.0 + 0 - 3.0
z = 1.5

Step 4: Apply sigmoid

p = 1 / (1 + e^(-1.5))
p = 1 / (1 + 0.223)
p = 1 / 1.223
p = 0.817

Step 5: Classify

0.817 > 0.5, so predict SPAM with 81.7% confidence

Question: If we add “dinner” to the email (“Free cheap cheap meeting dinner”), what happens?

New x: [0, 2, 1, 1, 1]
New z = 1.5 + (-0.8 * 1) = 0.7
New p = sigmoid(0.7) = 0.668

Still SPAM, but confidence dropped from 81.7% to 66.8%!
The word "dinner" has negative weight, making the email seem less spammy.

Testing Strategy

Unit Tests

def test_preprocess():
    """Test text preprocessing."""
    assert preprocess("Hello World!") == ["hello", "world"]
    assert preprocess("BUY NOW!!!") == ["buy", "now"]
    assert preprocess("") == []

def test_sigmoid():
    """Test sigmoid function."""
    assert abs(sigmoid(0) - 0.5) < 1e-6
    assert sigmoid(100) > 0.99
    assert sigmoid(-100) < 0.01
    # Should not overflow
    assert sigmoid(-1000) == 0.0 or sigmoid(-1000) > 0

def test_cross_entropy():
    """Test cross-entropy loss."""
    # Perfect prediction
    assert cross_entropy_loss(np.array([1]), np.array([0.9999])) < 0.01
    # Terrible prediction
    assert cross_entropy_loss(np.array([1]), np.array([0.0001])) > 4

def test_gradient():
    """Test gradient computation."""
    model = LogisticRegression(2)
    X = np.array([[1, 0], [0, 1]])
    y = np.array([1, 0])
    y_pred = np.array([0.7, 0.3])

    dw, db = model.compute_gradients(X, y, y_pred)

    # Gradient should be (p - y) * x
    expected_dw = np.array([-0.15, 0.15])  # [(-0.3*1 + 0.3*0)/2, (-0.3*0 + 0.3*1)/2]
    assert np.allclose(dw, expected_dw)

Integration Tests

def test_training_improves():
    """Test that training reduces loss."""
    # Simple dataset
    texts = ["buy free money", "meeting dinner thanks"] * 50
    labels = [1, 0] * 50

    filter = SpamFilter(max_vocab_size=100)
    history = filter.fit(texts, labels, epochs=50, learning_rate=0.5)

    # Loss should decrease
    assert history['loss'][-1] < history['loss'][0]
    # Accuracy should improve
    assert history['accuracy'][-1] > history['accuracy'][0]

def test_spam_prediction():
    """Test predictions on obvious cases."""
    # Train on clear examples
    texts = [
        "FREE money click here",
        "Win prize now",
        "Cheap meds buy now",
        "Hey mom dinner tonight",
        "Meeting at 3pm",
        "Thanks for your help",
    ]
    labels = [1, 1, 1, 0, 0, 0]

    filter = SpamFilter(max_vocab_size=50)
    filter.fit(texts, labels, epochs=100, learning_rate=1.0)

    # Test predictions
    assert filter.predict("Free prize win") > 0.5  # Should be spam
    assert filter.predict("Thanks for the meeting") < 0.5  # Should be ham

Accuracy, Precision, Recall

def test_evaluation_metrics():
    """Test that metrics are computed correctly."""
    # Create a model with known predictions
    y_true = np.array([1, 1, 1, 1, 0, 0, 0, 0])
    y_pred = np.array([1, 1, 0, 0, 0, 0, 1, 1])
    #                  TP  TP  FN  FN  TN  TN  FP  FP

    # Manual calculation:
    # TP = 2, FN = 2, TN = 2, FP = 2
    # Accuracy = (TP + TN) / 8 = 4/8 = 0.5
    # Precision = TP / (TP + FP) = 2/4 = 0.5
    # Recall = TP / (TP + FN) = 2/4 = 0.5
    # F1 = 2 * 0.5 * 0.5 / (0.5 + 0.5) = 0.5

    metrics = compute_metrics(y_true, y_pred)

    assert metrics['accuracy'] == 0.5
    assert metrics['precision'] == 0.5
    assert metrics['recall'] == 0.5
    assert metrics['f1'] == 0.5

Common Pitfalls and Debugging Tips

1. Numerical Instability in Sigmoid

Problem: exp(-z) overflows for large negative z.

# BAD: Will overflow for z = -1000
def sigmoid_naive(z):
    return 1 / (1 + np.exp(-z))

# GOOD: Numerically stable
def sigmoid(z):
    z = np.clip(z, -500, 500)  # Prevent overflow
    return np.where(z >= 0,
                    1 / (1 + np.exp(-z)),
                    np.exp(z) / (1 + np.exp(z)))

2. Log of Zero in Cross-Entropy

Problem: log(0) is negative infinity.

# BAD: Will crash if y_pred is exactly 0 or 1
loss = -np.mean(y_true * np.log(y_pred))

# GOOD: Clip predictions away from 0 and 1
epsilon = 1e-15
y_pred = np.clip(y_pred, epsilon, 1 - epsilon)
loss = -np.mean(y_true * np.log(y_pred))

3. Vocabulary Mismatch Between Training and Inference

Problem: New words in test data not in vocabulary.

def text_to_vector(text, vocab):
    tokens = preprocess(text)
    vector = np.zeros(len(vocab))

    for token in tokens:
        if token in vocab:  # IMPORTANT: Check if word exists!
            vector[vocab[token]] += 1
        # else: Ignore unknown words

    return vector

4. Class Imbalance

Problem: 90% ham, 10% spam. Model predicts all ham.

# Check class distribution
print(f"Spam: {sum(labels)} ({sum(labels)/len(labels):.1%})")
print(f"Ham: {len(labels) - sum(labels)}")

# Solutions:
# 1. Class weights: Penalize mistakes on minority class more
class_weights = {0: 1.0, 1: 9.0}  # Weight spam 9x more

# 2. Oversampling: Duplicate minority class samples
# 3. Undersampling: Remove majority class samples
# 4. SMOTE: Synthetic minority oversampling

5. Learning Rate Too High

Symptom: Loss oscillates wildly or increases.

Epoch 1:  Loss = 0.693
Epoch 2:  Loss = 2.345    <-- Went UP!
Epoch 3:  Loss = 1.234
Epoch 4:  Loss = 5.678    <-- Wild oscillation

Fix: Reduce learning rate by 10x. Start with 0.01 or 0.001.

6. Not Normalizing Features

Problem: Word counts have very different scales.

Word "the": appears 50 times
Word "free": appears 2 times

If you don't normalize:
- "the" dominates the gradient
- Rare but important words ("free") are ignored

Solutions:

# Binary encoding (presence/absence, not count)
vector = (vector > 0).astype(float)

# TF-IDF weighting (bonus challenge)
# Term frequency * inverse document frequency

Interview Questions

Conceptual Questions

Q1: “Why do we use sigmoid instead of just thresholding?”

Expected answer: Sigmoid produces probabilities between 0 and 1, which allows us to:

Interpret outputs as confidence levels
Set custom decision thresholds based on the application
Use gradient descent because sigmoid is differentiable (threshold is not)
Combine multiple models by averaging probabilities

Q2: “Why is cross-entropy better than MSE for classification?”

Expected answer:

Cross-entropy loss gradient is (p - y), which doesn’t vanish when the model is confident but wrong
MSE gradient involves sigma'(z), which approaches zero for large z , causing vanishing gradients
Cross-entropy penalizes confident wrong answers much more heavily (goes to infinity)
Cross-entropy has a probabilistic interpretation (negative log-likelihood)

Q3: “Explain the Bag of Words representation. What are its limitations?”

Expected answer: BoW represents text as a vector of word counts/frequencies.

Limitations:

Ignores word order (“not good” and “good not” are identical)
Ignores semantics/meaning
Creates very sparse, high-dimensional vectors
Can’t handle out-of-vocabulary words
No understanding of synonyms or context

Technical Questions

Q4: “How would you handle a word like ‘free’ that appears in both spam and ham?”

Expected answer: The model learns the optimal weight based on training data. If “free” appears in 90% of spam but only 10% of ham, it will get a positive weight (spam-indicative). The weight represents log-odds: w_free = log(P(spam|free) / P(ham|free)).

Q5: “Your model has 95% accuracy but 0% recall. What’s happening?”

Expected answer: Class imbalance. If 95% of emails are ham, the model can achieve 95% accuracy by predicting ham for everything. Recall is 0 because it catches no spam.

Solutions:

Use class weights to penalize spam misses more
Oversample the minority class
Use F1-score or balanced accuracy instead of accuracy
Lower the decision threshold

Q6: “How do you choose the vocabulary size?”

Expected answer: Trade-off between:

Too small: Miss important words
Too large: Overfit to rare words, slow training, high memory

Typical approach:

Keep top N most frequent words (5,000-10,000)
Remove words appearing in < K documents (min_df)
Remove words appearing in > X% of documents (max_df)

Coding Questions

Q7: “Implement the sigmoid derivative.”

def sigmoid_derivative(z):
    s = sigmoid(z)
    return s * (1 - s)

Q8: “Write code to compute precision, recall, and F1.”

def precision_recall_f1(y_true, y_pred):
    tp = sum((p == 1) and (t == 1) for p, t in zip(y_pred, y_true))
    fp = sum((p == 1) and (t == 0) for p, t in zip(y_pred, y_true))
    fn = sum((p == 0) and (t == 1) for p, t in zip(y_pred, y_true))

    precision = tp / (tp + fp) if (tp + fp) > 0 else 0
    recall = tp / (tp + fn) if (tp + fn) > 0 else 0
    f1 = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0

    return precision, recall, f1

Hints in Layers

For when you’re stuck, reveal hints progressively:

Layer 1: Getting Started

Hint: How do I structure the project?

Start with this skeleton:

# spam_filter.py

import numpy as np
from collections import Counter

# 1. Preprocessing
def preprocess(text):
    pass

# 2. Vocabulary
def build_vocabulary(texts, max_vocab_size):
    pass

# 3. Vectorization
def text_to_vector(text, vocab):
    pass

# 4. Sigmoid
def sigmoid(z):
    pass

# 5. Loss
def cross_entropy_loss(y_true, y_pred):
    pass

# 6. Model class
class SpamFilter:
    def fit(self, texts, labels):
        pass
    def predict(self, text):
        pass

# 7. Main
if __name__ == "__main__":
    # Load data, train, evaluate
    pass

Hint: What dataset should I use?

Use the UCI SMS Spam Collection:

Download from: https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection
5,574 SMS messages labeled “spam” or “ham”
Simple CSV format: “label\ttext”

Or use the Enron email dataset for a more realistic challenge.

Layer 2: Preprocessing Issues

Hint: My vocabulary is too large

Reduce vocabulary size by:

Lowercasing (already done)
Removing stopwords: {"the", "a", "an", "is", "are", "was", ...}
Keeping only top N most frequent words
Removing words that appear in fewer than K documents

# Example stopwords
STOPWORDS = {
    'the', 'a', 'an', 'is', 'are', 'was', 'were', 'be', 'been',
    'being', 'have', 'has', 'had', 'do', 'does', 'did', 'will',
    'would', 'could', 'should', 'may', 'might', 'must', 'shall',
    'can', 'to', 'of', 'in', 'for', 'on', 'with', 'at', 'by',
    'from', 'as', 'into', 'through', 'during', 'before', 'after',
    'above', 'below', 'between', 'under', 'again', 'further',
    'then', 'once', 'here', 'there', 'when', 'where', 'why',
    'how', 'all', 'each', 'few', 'more', 'most', 'other', 'some',
    'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so',
    'than', 'too', 'very', 'just', 'and', 'but', 'if', 'or',
    'because', 'until', 'while', 'this', 'that', 'these', 'those',
}

Layer 3: Training Issues

Hint: My loss is not decreasing

Check these:

Learning rate: Try 0.01, 0.1, 1.0 - find what works
Feature scale: Normalize vectors if using counts
Gradient check: Print gradients to ensure they’re not zero or NaN
Initial weights: Try small random values instead of zeros

# Debug: Print gradient magnitudes
dw, db = model.compute_gradients(X, y, y_pred)
print(f"dw mean: {np.mean(np.abs(dw)):.6f}")
print(f"db: {db:.6f}")

Hint: My loss is NaN or Inf

Numerical stability issues:

Clip sigmoid input: z = np.clip(z, -500, 500)
Clip predictions for log: y_pred = np.clip(y_pred, 1e-15, 1-1e-15)
Use the stable sigmoid implementation

# Check for NaN/Inf
if np.isnan(loss) or np.isinf(loss):
    print(f"z range: {z.min()} to {z.max()}")
    print(f"p range: {y_pred.min()} to {y_pred.max()}")

Layer 4: Evaluation Issues

Hint: High accuracy but low recall

This is the class imbalance problem. Solutions:

# 1. Class weights in gradient
# Weight the gradient by class frequency
weights = np.where(y == 1, n_ham / n_spam, 1.0)
weighted_error = (y_pred - y) * weights

# 2. Lower the threshold
# Instead of 0.5, try 0.3 or lower
y_pred = (y_proba >= 0.3).astype(int)

# 3. Use different metric for evaluation
# Optimize for F1 or balanced accuracy instead of accuracy

Layer 5: Advanced Issues

Hint: How do I know if my model is overfitting?

Split your data:

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Train on train set, evaluate on test set
filter.fit(X_train, y_train)
train_metrics = evaluate(filter, X_train, y_train)
test_metrics = evaluate(filter, X_test, y_test)

print(f"Train accuracy: {train_metrics['accuracy']:.2%}")
print(f"Test accuracy: {test_metrics['accuracy']:.2%}")

# If train >> test, you're overfitting
# Solutions: Smaller vocabulary, regularization, more data

Extensions and Challenges

Extension 1: Implement TF-IDF Weighting

Bag of Words treats all word occurrences equally. TF-IDF weights by importance:

TF-IDF = Term Frequency * Inverse Document Frequency

TF(word, doc) = Count of word in doc / Total words in doc
IDF(word) = log(Total docs / Docs containing word)

Example:
- "the" appears in 95% of documents: IDF = log(100/95) = 0.05 (low weight)
- "viagra" appears in 2% of documents: IDF = log(100/2) = 3.9 (high weight)

from collections import Counter
import numpy as np

def compute_idf(texts, vocab):
    """Compute IDF for each word in vocabulary."""
    n_docs = len(texts)
    doc_counts = Counter()

    for text in texts:
        unique_words = set(preprocess(text))
        doc_counts.update(unique_words)

    idf = {}
    for word, idx in vocab.items():
        df = doc_counts.get(word, 1)  # Avoid division by zero
        idf[word] = np.log(n_docs / df)

    return idf

def text_to_tfidf(text, vocab, idf):
    """Convert text to TF-IDF vector."""
    tokens = preprocess(text)
    tf = Counter(tokens)
    total = len(tokens)

    vector = np.zeros(len(vocab))
    for token in tokens:
        if token in vocab:
            tf_score = tf[token] / total
            idf_score = idf.get(token, 1)
            vector[vocab[token]] = tf_score * idf_score

    return vector

Extension 2: Multi-Class Classification

Extend from binary (SPAM/HAM) to multiple categories:

Categories: [SPAM, PROMO, IMPORTANT, NORMAL]

Instead of sigmoid (binary), use softmax (multi-class):
  softmax(z_i) = exp(z_i) / sum(exp(z_j))

This gives a probability distribution over all classes.

def softmax(z):
    """Multi-class sigmoid: softmax."""
    # Subtract max for numerical stability
    z = z - np.max(z, axis=-1, keepdims=True)
    exp_z = np.exp(z)
    return exp_z / np.sum(exp_z, axis=-1, keepdims=True)

# Now weights are a matrix: (vocab_size, n_classes)
# Output is a vector of probabilities: (n_classes,)

Extension 3: Character-Level Features

Instead of words, use character n-grams:

Word-level: "free" -> ["free"]
Char-level (n=3): "free" -> ["fre", "ree"]

Advantage: Handles misspellings and obfuscation
  "fr33" (word-level) -> Unknown word, ignored
  "fr33" (char-level) -> ["fr3", "r33"] - might still match spam patterns

Extension 4: Regularization

Prevent overfitting with L2 regularization:

# Add penalty for large weights
L2_lambda = 0.01

# Modified loss
loss = cross_entropy + (L2_lambda / 2) * np.sum(weights ** 2)

# Modified gradient
dw = gradient + L2_lambda * weights

Extension 5: Learning Curves

Visualize how the model learns:

import matplotlib.pyplot as plt

def plot_learning_curves(history):
    fig, axes = plt.subplots(1, 2, figsize=(12, 4))

    axes[0].plot(history['loss'])
    axes[0].set_title('Loss Over Epochs')
    axes[0].set_xlabel('Epoch')
    axes[0].set_ylabel('Cross-Entropy Loss')

    axes[1].plot(history['accuracy'])
    axes[1].set_title('Accuracy Over Epochs')
    axes[1].set_xlabel('Epoch')
    axes[1].set_ylabel('Accuracy')

    plt.tight_layout()
    plt.savefig('learning_curves.png')

Real-World Connections

Gmail Spam Filter

Gmail’s spam filter is vastly more sophisticated, but it builds on these principles:

Gmail's Approach (Simplified):
1. Text features (like BoW, but much more advanced)
2. Sender reputation (history of sending spam)
3. Link analysis (known malicious URLs)
4. User behavior (what users mark as spam)
5. Network analysis (patterns across millions of users)
6. Deep learning models (not just logistic regression)

But the core idea is the same:
  Extract features -> Compute weighted sum -> Apply activation -> Predict

Content Moderation

The same classification approach powers:

Toxic comment detection
Hate speech filtering
Fake news detection
Phishing email detection

Recommendation Systems

Binary classification underpins:

“Will user click this ad?” (Click-through rate prediction)
“Will user like this movie?” (Binary like/dislike)
“Will user churn?” (Customer retention)

Books That Will Help

Book	Author(s)	Relevance	Key Chapters
Grokking Deep Learning	Andrew Trask	Primary reference for this project	Ch. 3: Forward Propagation, Ch. 5: Gradient Descent
Pattern Recognition and Machine Learning	Christopher Bishop	Theoretical foundations	Ch. 4: Linear Models for Classification
Machine Learning: A Probabilistic Perspective	Kevin Murphy	Rigorous probability theory	Ch. 8: Logistic Regression
Speech and Language Processing	Jurafsky & Martin	NLP fundamentals	Ch. 4: Naive Bayes and Sentiment
Deep Learning	Goodfellow, Bengio, Courville	Modern deep learning bible	Ch. 6.2: Activation Functions

Reading Order Recommendation

Start with Grokking Deep Learning Ch. 3 - Intuitive introduction to forward propagation
Then read Speech and Language Processing Ch. 4 - Text classification context
Reference Pattern Recognition Ch. 4.3 - Mathematical derivation of logistic regression
Deep dive Deep Learning Ch. 6 - Modern perspective on activation functions and loss

Self-Assessment Checklist

Conceptual Understanding

Explain why classification uses sigmoid while regression doesn’t
Draw the sigmoid curve and mark the decision boundary
Derive the gradient of cross-entropy loss with sigmoid
Explain why cross-entropy is better than MSE for classification
Describe what Bag of Words loses from the original text

Implementation Skills

Implement numerically stable sigmoid
Implement cross-entropy loss with proper clipping
Build vocabulary from a corpus of texts
Convert text to BoW vectors
Train a logistic regression model from scratch
Compute accuracy, precision, recall, and F1-score

Practical Application

Achieve >95% accuracy on the SMS Spam dataset
Handle class imbalance appropriately
Explain what the model learned (top spam/ham words)
Debug training issues (NaN loss, non-decreasing loss)
Split data into train/test sets properly

Extensions Attempted

Implement TF-IDF weighting
Try binary features instead of counts
Add regularization
Plot learning curves
Experiment with different vocabulary sizes

Key Insights

Classification is not a minor variation of regression. The change from MSE to cross-entropy, and the addition of sigmoid, fundamentally changes how the model learns. Don’t treat logistic regression as “linear regression with an extra step.”

Text is just patterns of numbers. The machine has no understanding of language. It sees word frequencies and learns correlations. This is both humbling (AI doesn’t “understand”) and empowering (simple math can achieve impressive results).

The sigmoid-cross-entropy combination is elegant. The gradient simplifies to (p - y), which is remarkably clean. This mathematical elegance is one reason logistic regression has stood the test of time.

Class imbalance will break your model. Always check your class distribution. A model that predicts the majority class for everything will have high accuracy but zero utility.

Preprocessing matters. The quality of your text cleaning (tokenization, stopwords, normalization) often matters more than the model complexity. Garbage in, garbage out.

Connecting Forward

This project builds directly on Project 3 (Linear Regression) by adding:

Sigmoid activation for probability outputs
Cross-entropy loss for classification
Text preprocessing and vectorization

The next step, Project 5 (Autograd Engine), will show you how to automate gradient computation. Instead of manually deriving dL/dw = (p - y) * x, you’ll build a system that computes gradients automatically for any computational graph.

Project 6 (Fraud Detection MLP) will extend classification to non-linear problems by adding hidden layers. When a single sigmoid can’t separate the data, you’ll stack layers to learn complex decision boundaries.

After completing this project, you’ll understand the fundamental building block of classification: converting raw data into probabilities through learned weights. Every spam filter, content moderator, and recommendation system builds on this foundation.