Project 3: A “Dumb” Spam Filter

Build a naive keyword-based spam filter and measure its accuracy.

Project Overview

Attribute	Value
Difficulty	Level 1: Beginner
Time Estimate	Weekend
Main Language	Python
Alternative Languages	R, JavaScript
Knowledge Area	Classification
Tools	Text dataset
Main Book	“Practical Statistics for Data Scientists” by Bruce & Gedeck

What you’ll build: A simple rule-based spam filter and a report of false positives/negatives.

Why it teaches stats: Classification performance is all about errors and tradeoffs.

Core challenges you’ll face:

You will classify messages and report a confusion matrix.

Example Output:

Accuracy: 0.82
Precision: 0.76
Recall: 0.69

Verification steps:

“How do you measure the quality of a classification rule?”

This is the practical side of statistics.

Stop and research these before coding:

Confusion matrix
- What do TP, FP, TN, FN mean?
- Book Reference: “Practical Statistics for Data Scientists” Ch. 5
Precision vs recall
- Why do they trade off?
- Book Reference: “Practical Statistics for Data Scientists” Ch. 5
Base rates
- How does class imbalance affect accuracy?
- Book Reference: “Naked Statistics” Ch. 8

Rule design
- Which keywords are strong spam indicators?
- How will you handle mixed messages?
Evaluation
- How will you split training and test sets?
- Will you use cross-validation?

If 95% of emails are not spam, what accuracy do you get by labeling everything “not spam”?

Questions while working:

Prepare to answer these:

Hint 1: Starting Point Start with a short list of spam keywords.

Hint 2: Next Level Compute precision and recall, not just accuracy.

Hint 3: Technical Details Track false positives and negatives explicitly.

Hint 4: Tools/Debugging Review misclassified messages to improve rules.

Topic	Book	Chapter
Confusion matrix	“Practical Statistics for Data Scientists”	Ch. 5
Precision/recall	“Practical Statistics for Data Scientists”	Ch. 5
Base rates	“Naked Statistics”	Ch. 8