Project 5: Bayesian Spam Filter

Build a naive Bayes spam filter that classifies emails by word evidence.


Project Overview

Attribute Value
Difficulty Level 2: Intermediate
Time Estimate Weekend
Main Language Python
Alternative Languages JavaScript, C++
Knowledge Area Probability and statistics
Tools CLI, text dataset
Main Book “Pattern Recognition and Machine Learning” by Bishop

What you’ll build: A classifier that labels messages as spam or not using Bayes’ rule.

Why it teaches math: You must compute conditional probabilities and combine evidence logically.

Core challenges you’ll face:

  • Tokenizing text consistently
  • Computing probabilities with smoothing
  • Evaluating accuracy and false positives

Real World Outcome

You will train on a small dataset and classify new messages with confidence scores.

Example Output:

$ python spam_filter.py --train data/ --test sample.txt
Spam probability: 0.94
Classification: SPAM

Verification steps:

  • Test against known labeled messages
  • Measure precision and recall

The Core Question You’re Answering

“How do you combine multiple pieces of evidence into a single probability?”

This is Bayes’ rule in practice.


Concepts You Must Understand First

Stop and research these before coding:

  1. Bayes’ theorem
    • How do you compute P(spam words)?
    • Book Reference: “Pattern Recognition and Machine Learning” by Bishop, Ch. 1
  2. Conditional independence
    • Why does naive Bayes assume word independence?
    • Book Reference: “Pattern Recognition and Machine Learning” by Bishop, Ch. 4
  3. Smoothing
    • Why do you need Laplace smoothing for rare words?
    • Book Reference: “Speech and Language Processing” by Jurafsky & Martin, Ch. 4

Questions to Guide Your Design

  1. Feature choice
    • Will you use word counts or binary presence?
    • How will you handle stop words?
  2. Evaluation
    • How will you split training and test data?
    • Which metric matters most for spam filtering?

Thinking Exercise

Bayes by Hand

Given word “free” appears in 30% of spam and 2% of non-spam, compute how it changes the spam probability.

Questions while working:

  • Why can a rare word be strong evidence?
  • How does prior probability affect the result?

The Interview Questions They’ll Ask

Prepare to answer these:

  1. “What is Bayes’ theorem?”
  2. “Why does naive Bayes work well for text?”
  3. “What is Laplace smoothing?”
  4. “What is the difference between precision and recall?”
  5. “How do you handle unseen words?”

Hints in Layers

Hint 1: Starting Point Start with word counts and compute class probabilities.

Hint 2: Next Level Use log probabilities to avoid underflow.

Hint 3: Technical Details Apply Laplace smoothing to every word count.

Hint 4: Tools/Debugging Inspect top weighted words for each class to validate logic.


Books That Will Help

Topic Book Chapter
Bayes’ theorem “Pattern Recognition and Machine Learning” by Bishop Ch. 1
Naive Bayes “Pattern Recognition and Machine Learning” by Bishop Ch. 4
Smoothing “Speech and Language Processing” by Jurafsky & Martin Ch. 4

Implementation Hints

  • Use log sums to keep numeric stability.
  • Keep a vocabulary size cap to avoid noise.
  • Track confusion matrix for evaluation.

Learning Milestones

  1. First milestone: You can compute spam probabilities for test messages.
  2. Second milestone: You can measure precision and recall.
  3. Final milestone: You can explain why naive Bayes works despite assumptions.