Project 14: Naive Bayes Spam Filter

A spam filter that classifies emails using Naive Bayes. Train on labeled emails, then predict whether new emails are spam or ham based on word probabilities.

Quick Reference

Attribute	Value
Difficulty	Level 2: Intermediate (The Developer)
Main Programming Language	Python
Alternative Programming Languages	C, JavaScript, Go
Coolness Level	Level 3: Genuinely Clever
Business Potential	3. The “Service & Support” Model (B2B Utility)
Knowledge Area	Bayesian Inference / Text Classification
Software or Tool	Spam Filter
Main Book	“Hands-On Machine Learning” by Aurélien Géron

1. Learning Objectives

By completing this project, you will:

Translate math definitions into deterministic implementation steps.
Build validation checks that make correctness observable.
Diagnose numerical, logical, and data-shape failures early.
Explain tradeoffs in interviews using evidence from your own build.

2. All Theory Needed (Per-Concept Breakdown)

This project applies the following theory clusters:

Symbolic-to-numeric translation (expressions, data shapes, invariants)
Stability constraints (precision, scaling, stopping criteria)
Optimization or inference logic (depending on project objective)
Evaluation discipline (error analysis, test coverage, reproducibility)

Concept A: Mathematical Representation Discipline

Fundamentals A math expression is not executable until you define representation, ordering, and domain constraints. The same equation can be represented as a token stream, tree, matrix pipeline, or probability graph. Choosing representation determines what bugs you can catch early.

Deep Dive into the concept Most project failures begin before algorithm selection: they start with ambiguous representation. If your parser cannot distinguish unary minus from subtraction, your calculator fails. If your matrix dimensions are implicit rather than validated, your linear algebra pipeline fails silently. If your probabilistic assumptions (independence, stationarity, or class priors) are not explicit, your inference can look accurate on one split and collapse on another. The core implementation move is to treat representation as a contract. Define each object with shape, domain, and semantic intent. Then enforce invariants at boundaries: input parser, preprocessing, training loop, evaluation stage. This makes debugging local instead of global.

How this fits this project You will encode each operation with explicit contracts and invariant checks.

Definitions & key terms

Invariant: Property that must hold before and after each operation.
Shape contract: Expected dimensional structure of vectors/matrices/tensors.
Domain constraint: Allowed value range (for example log input > 0).

Mental model diagram

User Input -> Representation Layer -> Validated Operation -> Observable Output
              (tokens/shapes)        (invariants pass)       (tests/plots/logs)

How it works

Parse/ingest data into typed structures.
Validate shape/domain invariants.
Execute operation.
Compare observed output with expected behavior.
Record failure signature if mismatch appears.

Minimal concrete example

PSEUDOCODE
read expression
tokenize with precedence rules
if token sequence invalid -> return syntax error
evaluate tree
if domain violation -> return bounded diagnostic
print value and confidence check

Common misconceptions

“If it runs once, representation is correct.” -> false.
“Type checks are enough without shape checks.” -> false.

Check-your-understanding questions

Which invariant catches division-by-zero earliest?
Why does shape validation belong at boundaries rather than only in core logic?
Predict failure if tokenization ignores unary minus.

Check-your-understanding answers

Domain check on denominator before operation execution.
Boundary validation keeps errors local and diagnostic.
Expressions like -2^2 get misinterpreted and produce wrong precedence behavior.

Real-world applications Feature preprocessing, model-serving input validation, and experiment-tracking schema enforcement.

Where you’ll apply it This project and every downstream project in the sprint.

References

CSAPP (Bryant & O’Hallaron), floating-point chapter
Math for Programmers (Paul Orland), representation-oriented chapters

Key insight Correct representation reduces the complexity of every later decision.

Summary Stable ML math implementations start with explicit contracts, not implicit assumptions.

Homework/Exercises

Write five invariants for your project.
Build a failing test input for each invariant.

Solutions

Include at least one shape, one domain, one convergence, one reproducibility, and one output-range invariant.
Each failing input should trigger exactly one diagnostic to keep root-cause analysis clean.

3. Build Blueprint

Scope the smallest end-to-end slice that produces visible output.
Add deterministic tests and edge-case probes.
Layer complexity only after baseline behavior is stable.
Add metrics logging before optimization.
Run failure drills: perturb inputs, scale values, and check stability.

4. Real-World Outcome (Target)

$ python spam_filter.py train spam_dataset/

Training on 5000 emails (2500 spam, 2500 ham)...

Most spammy words:     Most hammy words:
  "free"      0.89       "meeting"   0.91
  "winner"    0.87       "project"   0.88
  "click"     0.84       "attached"  0.85
  "viagra"    0.99       "thanks"    0.82

$ python spam_filter.py predict "Congratulations! You've won a FREE iPhone! Click here!"

Analysis:
  P(spam | text) = 0.9987
  P(ham | text)  = 0.0013

  Key signals:
    "free" → strongly indicates spam
    "congratulations" → moderately indicates spam
    "click" → strongly indicates spam

Classification: SPAM (confidence: 99.87%)

$ python spam_filter.py evaluate test_dataset/

Precision: 0.94  (of predicted spam, 94% was actually spam)
Recall:    0.91  (of actual spam, 91% was caught)
F1 Score:  0.92

Implementation Hints: Training:

P(word | spam) = (count of word in spam + 1) / (total spam words + vocab_size)

The +1 is Laplace smoothing (avoids zero probabilities).

Classification using log probabilities:

log P(spam | words) ∝ log P(spam) + Σ log P(word_i | spam)

Compare log P(spam

words) with log P(ham

words).

The “naive” assumption: words are independent given the class. Obviously false, but works surprisingly well!

Learning milestones:

Classifier makes reasonable predictions → You understand Bayes’ theorem
Log probabilities prevent underflow → You understand numerical stability
You can explain why it’s “naive” → You understand conditional independence

5. Core Design Notes from Main Guide

Core Question

How do we update our beliefs when we see new evidence?

Bayes’ theorem is the mathematical answer to one of humanity’s deepest questions: how should rational beings learn from experience? When you see the word “free” in an email, how much should that update your belief that it’s spam? This project makes you confront the mathematics of belief revision, transforming prior assumptions into posterior knowledge.

Concepts You Must Understand First

Stop and research these before coding:

Bayes’ Theorem
- What is the exact formula: P(A B) = P(B A) * P(A) / P(B)?
- What do “prior,” “likelihood,” “evidence,” and “posterior” mean?
- Why is Bayes’ theorem just a restatement of the definition of conditional probability?
- Book Reference: “Think Bayes” Chapter 1 - Allen Downey
Conditional Probability
- What does P(spam “free”) mean in plain English?
- How is P(A B) different from P(B A)? (The “prosecutor’s fallacy”)
- How do you compute P(A B) from a contingency table?
- Book Reference: “Introduction to Probability” Chapter 2 - Blitzstein & Hwang
Maximum Likelihood Estimation
- What does it mean to estimate P(word spam) from training data?
- Why is counting occurrences a maximum likelihood estimate?
- What happens when a word never appears in spam training data?
- Book Reference: “All of Statistics” Chapter 9 - Larry Wasserman
The “Naive” Independence Assumption
- Why do we assume words are independent given the class?
- Is this assumption ever true in real text?
- Why does Naive Bayes work well despite this false assumption?
- Book Reference: “Machine Learning” Chapter 6 - Tom Mitchell
Log Probabilities and Numerical Stability
- Why do we use log probabilities instead of regular probabilities?
- What is numerical underflow and when does it happen?
- How does log(a * b) = log(a) + log(b) save us?
- Book Reference: “Speech and Language Processing” Chapter 4 - Jurafsky & Martin
Laplace Smoothing (Add-One Smoothing)
- What problem does smoothing solve?
- Why add 1 to numerator and vocabulary to denominator?
- What prior belief does Laplace smoothing encode?
- Book Reference: “Information Retrieval” Chapter 13 - Manning et al.

Questions to Guide Your Design

Before implementing, think through these:

Text Preprocessing: How will you tokenize emails? Lowercase? Remove punctuation? Handle numbers?
Vocabulary Management: Will you use all words or limit vocabulary size? How do you handle words seen at test time but not training time?
Probability Estimation: For each word, how do you compute P(word spam) and P(word ham)?
Classification Formula: How does the final classification decision work? What exactly are you comparing?
Smoothing Strategy: What value of smoothing will you use? How does it affect rare vs common words?
Evaluation Metrics: How will you measure success? Why might accuracy alone be misleading for spam detection?

Thinking Exercise

Work through Bayes’ theorem by hand:

Suppose you have training data with:

100 spam emails, 100 ham emails
The word “free” appears in 60 spam emails and 5 ham emails
The word “meeting” appears in 10 spam emails and 50 ham emails

Now classify an email containing: “free meeting tomorrow”

Step 1: Compute Priors

P(spam) = 100/200 = ?
P(ham) = 100/200 = ?

Step 2: Compute Likelihoods (with add-1 smoothing, assume vocab size = 10000)

P(“free” spam) = (60 + 1) / (100 + 10000) = ?
P(“free” ham) = (5 + 1) / (100 + 10000) = ?
P(“meeting” spam) = (10 + 1) / (100 + 10000) = ?
P(“meeting” ham) = (50 + 1) / (100 + 10000) = ?

Step 3: Apply Bayes (ignoring “tomorrow” since it cancels)

P(spam | "free meeting") ∝ P(spam) * P("free"|spam) * P("meeting"|spam)
                        = 0.5 * ? * ?
                        = ?

P(ham | "free meeting") ∝ P(ham) * P("free"|ham) * P("meeting"|ham)
                       = 0.5 * ? * ?
                       = ?

Step 4: Normalize to get probabilities Which is larger? What’s the classification?

Step 5: Now use log probabilities Redo the calculation using log(P) = log(prior) + log(likelihood1) + log(likelihood2) Verify you get the same answer.

Interview Questions

“Explain Naive Bayes classification.”

Expected: Use Bayes’ theorem to compute P(class

features). “Naive” assumes feature independence given class. Multiply likelihoods, pick highest posterior class.

“Why is it called ‘naive’?”
- Expected: It assumes features are conditionally independent given the class. This is almost never true (e.g., “New York” words are dependent), but it works well anyway.

“How do you handle words not seen during training?”

Expected: Laplace smoothing: add pseudocounts so no probability is zero. P(word

class) = (count + 1) / (total + vocab_size).

“Why use log probabilities?”
- Expected: Multiplying many small probabilities causes underflow. Logs convert products to sums, keeping numbers manageable.
“What are the limitations of Naive Bayes?”
- Expected: Independence assumption, struggles with correlated features, assumes features are equally important, calibration of probabilities can be poor.
“Compare Naive Bayes to Logistic Regression.”
- Expected: Both are linear classifiers! NB estimates parameters separately (generative), LR optimizes directly (discriminative). LR can learn feature interactions, NB cannot.
“How would you handle imbalanced spam/ham ratio?”
- Expected: Adjust priors based on real-world ratios. Or threshold adjustment on posterior probabilities. Or resampling training data.

Hints in Layers (Treat as pseudocode guidance)

Hint 1: Training Data Structure Build two dictionaries: spam_word_counts and ham_word_counts. Also track total word counts in each class.

Hint 2: The Training Phase

def train(emails, labels):
    for email, label in zip(emails, labels):
        words = tokenize(email)
        for word in words:
            if label == 'spam':
                spam_word_counts[word] += 1
                spam_total += 1
            else:
                ham_word_counts[word] += 1
                ham_total += 1

Hint 3: The Probability Calculation

def word_probability(word, class_word_counts, class_total, vocab_size, alpha=1):
    count = class_word_counts.get(word, 0)
    return (count + alpha) / (class_total + alpha * vocab_size)

Hint 4: The Classification with Log Probabilities

def classify(email):
    words = tokenize(email)

    log_prob_spam = math.log(prior_spam)
    log_prob_ham = math.log(prior_ham)

    for word in words:
        log_prob_spam += math.log(word_probability(word, spam_counts, ...))
        log_prob_ham += math.log(word_probability(word, ham_counts, ...))

    return 'spam' if log_prob_spam > log_prob_ham else 'ham'

Hint 5: Getting Actual Probabilities

# Convert log probabilities to actual probabilities (for confidence)
from scipy.special import softmax
probs = softmax([log_prob_spam, log_prob_ham])
confidence = max(probs)

Books That Will Help

Topic	Book	Chapter
Bayes’ Theorem Intuition	“Think Bayes”	Chapters 1-2 - Allen Downey
Naive Bayes Classifier	“Machine Learning”	Chapter 6 - Tom Mitchell
Text Classification	“Speech and Language Processing”	Chapter 4 - Jurafsky & Martin
Smoothing Techniques	“Information Retrieval”	Chapter 13 - Manning et al.
Maximum Likelihood	“All of Statistics”	Chapter 9 - Larry Wasserman
Probabilistic Classifiers	“Pattern Recognition and ML”	Chapter 4 - Bishop

6. Validation, Pitfalls, and Completion

Common Pitfalls and Debugging

Problem 1: “Outputs drift after a few iterations”

Why: Hidden numerical instability (unscaled features, aggressive step size, or repeated subtraction of nearly equal values).
Fix: Normalize inputs, reduce step size, and track relative error rather than only absolute error.
Quick test: Run the same task with two scales of input (for example x and 10x) and compare normalized error curves.

Problem 2: “Results are inconsistent across runs”

Why: Random seeds, data split randomness, or non-deterministic ordering are uncontrolled.
Fix: Set seeds, log configuration, and store split indices and hyperparameters with each run.
Quick test: Re-run three times with the same seed and confirm metrics remain inside a tight tolerance band.

Problem 3: “The project works on the demo case but fails on edge cases”

Why: Tests only cover happy-path inputs.
Fix: Add adversarial inputs (empty values, extreme ranges, near-singular matrices, rare classes).
Quick test: Build an edge-case test matrix and ensure every scenario reports expected behavior.

Definition of Done

Core functionality works on reference inputs
Edge cases are tested and documented
Results are reproducible (seeded and versioned configuration)
Performance or convergence behavior is measured and explained
A short retrospective explains what failed first and how you fixed it

7. Extension Ideas

Add a stress-test mode with adversarial inputs.
Add a short benchmark report (runtime + memory + error trend).
Add a reproducibility bundle (seed, config, and fixed test corpus).

8. Why This Project Matters

Not specified

This project is valuable because it creates observable evidence of mathematical reasoning under real implementation constraints.