Project 5: Bayesian Spam Filter
Build a naive Bayes spam filter that classifies emails by word evidence.
Project Overview
| Attribute | Value |
|---|---|
| Difficulty | Level 2: Intermediate |
| Time Estimate | Weekend |
| Main Language | Python |
| Alternative Languages | JavaScript, C++ |
| Knowledge Area | Probability and statistics |
| Tools | CLI, text dataset |
| Main Book | “Pattern Recognition and Machine Learning” by Bishop |
What you’ll build: A classifier that labels messages as spam or not using Bayes’ rule.
Why it teaches math: You must compute conditional probabilities and combine evidence logically.
Core challenges you’ll face:
- Tokenizing text consistently
- Computing probabilities with smoothing
- Evaluating accuracy and false positives
Real World Outcome
You will train on a small dataset and classify new messages with confidence scores.
Example Output:
$ python spam_filter.py --train data/ --test sample.txt
Spam probability: 0.94
Classification: SPAM
Verification steps:
- Test against known labeled messages
- Measure precision and recall
The Core Question You’re Answering
“How do you combine multiple pieces of evidence into a single probability?”
This is Bayes’ rule in practice.
Concepts You Must Understand First
Stop and research these before coding:
- Bayes’ theorem
-
How do you compute P(spam words)? - Book Reference: “Pattern Recognition and Machine Learning” by Bishop, Ch. 1
-
- Conditional independence
- Why does naive Bayes assume word independence?
- Book Reference: “Pattern Recognition and Machine Learning” by Bishop, Ch. 4
- Smoothing
- Why do you need Laplace smoothing for rare words?
- Book Reference: “Speech and Language Processing” by Jurafsky & Martin, Ch. 4
Questions to Guide Your Design
- Feature choice
- Will you use word counts or binary presence?
- How will you handle stop words?
- Evaluation
- How will you split training and test data?
- Which metric matters most for spam filtering?
Thinking Exercise
Bayes by Hand
Given word “free” appears in 30% of spam and 2% of non-spam, compute how it changes the spam probability.
Questions while working:
- Why can a rare word be strong evidence?
- How does prior probability affect the result?
The Interview Questions They’ll Ask
Prepare to answer these:
- “What is Bayes’ theorem?”
- “Why does naive Bayes work well for text?”
- “What is Laplace smoothing?”
- “What is the difference between precision and recall?”
- “How do you handle unseen words?”
Hints in Layers
Hint 1: Starting Point Start with word counts and compute class probabilities.
Hint 2: Next Level Use log probabilities to avoid underflow.
Hint 3: Technical Details Apply Laplace smoothing to every word count.
Hint 4: Tools/Debugging Inspect top weighted words for each class to validate logic.
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Bayes’ theorem | “Pattern Recognition and Machine Learning” by Bishop | Ch. 1 |
| Naive Bayes | “Pattern Recognition and Machine Learning” by Bishop | Ch. 4 |
| Smoothing | “Speech and Language Processing” by Jurafsky & Martin | Ch. 4 |
Implementation Hints
- Use log sums to keep numeric stability.
- Keep a vocabulary size cap to avoid noise.
- Track confusion matrix for evaluation.
Learning Milestones
- First milestone: You can compute spam probabilities for test messages.
- Second milestone: You can measure precision and recall.
- Final milestone: You can explain why naive Bayes works despite assumptions.