Project 5: Bayesian Spam Filter

Build a naive Bayes spam filter that classifies emails by word evidence.

Project Overview

Attribute	Value
Difficulty	Level 2: Intermediate
Time Estimate	Weekend
Main Language	Python
Alternative Languages	JavaScript, C++
Knowledge Area	Probability and statistics
Tools	CLI, text dataset
Main Book	“Pattern Recognition and Machine Learning” by Bishop

What you’ll build: A classifier that labels messages as spam or not using Bayes’ rule.

Why it teaches math: You must compute conditional probabilities and combine evidence logically.

Core challenges you’ll face:

You will train on a small dataset and classify new messages with confidence scores.

Example Output:

$ python spam_filter.py --train data/ --test sample.txt
Spam probability: 0.94
Classification: SPAM

Verification steps:

“How do you combine multiple pieces of evidence into a single probability?”

This is Bayes’ rule in practice.

Stop and research these before coding:

Bayes’ theorem
- How do you compute P(spam words)?
- Book Reference: “Pattern Recognition and Machine Learning” by Bishop, Ch. 1
Conditional independence
- Why does naive Bayes assume word independence?
- Book Reference: “Pattern Recognition and Machine Learning” by Bishop, Ch. 4
Smoothing
- Why do you need Laplace smoothing for rare words?
- Book Reference: “Speech and Language Processing” by Jurafsky & Martin, Ch. 4

Feature choice
- Will you use word counts or binary presence?
- How will you handle stop words?
Evaluation
- How will you split training and test data?
- Which metric matters most for spam filtering?

Given word “free” appears in 30% of spam and 2% of non-spam, compute how it changes the spam probability.

Questions while working:

Prepare to answer these:

Hint 1: Starting Point Start with word counts and compute class probabilities.

Hint 2: Next Level Use log probabilities to avoid underflow.

Hint 3: Technical Details Apply Laplace smoothing to every word count.

Hint 4: Tools/Debugging Inspect top weighted words for each class to validate logic.

Topic	Book	Chapter
Bayes’ theorem	“Pattern Recognition and Machine Learning” by Bishop	Ch. 1
Naive Bayes	“Pattern Recognition and Machine Learning” by Bishop	Ch. 4
Smoothing	“Speech and Language Processing” by Jurafsky & Martin	Ch. 4