Project 3: A “Dumb” Spam Filter
Build a naive keyword-based spam filter and measure its accuracy.
Project Overview
| Attribute | Value |
|---|---|
| Difficulty | Level 1: Beginner |
| Time Estimate | Weekend |
| Main Language | Python |
| Alternative Languages | R, JavaScript |
| Knowledge Area | Classification |
| Tools | Text dataset |
| Main Book | “Practical Statistics for Data Scientists” by Bruce & Gedeck |
What you’ll build: A simple rule-based spam filter and a report of false positives/negatives.
Why it teaches stats: Classification performance is all about errors and tradeoffs.
Core challenges you’ll face:
- Defining keyword rules
- Measuring accuracy, precision, recall
- Balancing false positives and negatives
Real World Outcome
You will classify messages and report a confusion matrix.
Example Output:
Accuracy: 0.82
Precision: 0.76
Recall: 0.69
Verification steps:
- Inspect false positives manually
- Test with different keyword sets
The Core Question You’re Answering
“How do you measure the quality of a classification rule?”
This is the practical side of statistics.
Concepts You Must Understand First
Stop and research these before coding:
- Confusion matrix
- What do TP, FP, TN, FN mean?
- Book Reference: “Practical Statistics for Data Scientists” Ch. 5
- Precision vs recall
- Why do they trade off?
- Book Reference: “Practical Statistics for Data Scientists” Ch. 5
- Base rates
- How does class imbalance affect accuracy?
- Book Reference: “Naked Statistics” Ch. 8
Questions to Guide Your Design
- Rule design
- Which keywords are strong spam indicators?
- How will you handle mixed messages?
- Evaluation
- How will you split training and test sets?
- Will you use cross-validation?
Thinking Exercise
Accuracy Trap
If 95% of emails are not spam, what accuracy do you get by labeling everything “not spam”?
Questions while working:
- Why is accuracy misleading here?
- What metric would you trust more?
The Interview Questions They’ll Ask
Prepare to answer these:
- “What is a confusion matrix?”
- “Why can accuracy be misleading?”
- “What is precision vs recall?”
- “How do you handle class imbalance?”
- “How do you choose a threshold?”
Hints in Layers
Hint 1: Starting Point Start with a short list of spam keywords.
Hint 2: Next Level Compute precision and recall, not just accuracy.
Hint 3: Technical Details Track false positives and negatives explicitly.
Hint 4: Tools/Debugging Review misclassified messages to improve rules.
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Confusion matrix | “Practical Statistics for Data Scientists” | Ch. 5 |
| Precision/recall | “Practical Statistics for Data Scientists” | Ch. 5 |
| Base rates | “Naked Statistics” | Ch. 8 |
Implementation Hints
- Keep rules configurable in a text file.
- Report metrics on a held-out test set.
- Document tradeoffs between precision and recall.
Learning Milestones
- First milestone: You can build a rule-based classifier.
- Second milestone: You can compute precision and recall.
- Final milestone: You can explain metric tradeoffs.