Project 3: A “Dumb” Spam Filter

Build a naive keyword-based spam filter and measure its accuracy.


Project Overview

Attribute Value
Difficulty Level 1: Beginner
Time Estimate Weekend
Main Language Python
Alternative Languages R, JavaScript
Knowledge Area Classification
Tools Text dataset
Main Book “Practical Statistics for Data Scientists” by Bruce & Gedeck

What you’ll build: A simple rule-based spam filter and a report of false positives/negatives.

Why it teaches stats: Classification performance is all about errors and tradeoffs.

Core challenges you’ll face:

  • Defining keyword rules
  • Measuring accuracy, precision, recall
  • Balancing false positives and negatives

Real World Outcome

You will classify messages and report a confusion matrix.

Example Output:

Accuracy: 0.82
Precision: 0.76
Recall: 0.69

Verification steps:

  • Inspect false positives manually
  • Test with different keyword sets

The Core Question You’re Answering

“How do you measure the quality of a classification rule?”

This is the practical side of statistics.


Concepts You Must Understand First

Stop and research these before coding:

  1. Confusion matrix
    • What do TP, FP, TN, FN mean?
    • Book Reference: “Practical Statistics for Data Scientists” Ch. 5
  2. Precision vs recall
    • Why do they trade off?
    • Book Reference: “Practical Statistics for Data Scientists” Ch. 5
  3. Base rates
    • How does class imbalance affect accuracy?
    • Book Reference: “Naked Statistics” Ch. 8

Questions to Guide Your Design

  1. Rule design
    • Which keywords are strong spam indicators?
    • How will you handle mixed messages?
  2. Evaluation
    • How will you split training and test sets?
    • Will you use cross-validation?

Thinking Exercise

Accuracy Trap

If 95% of emails are not spam, what accuracy do you get by labeling everything “not spam”?

Questions while working:

  • Why is accuracy misleading here?
  • What metric would you trust more?

The Interview Questions They’ll Ask

Prepare to answer these:

  1. “What is a confusion matrix?”
  2. “Why can accuracy be misleading?”
  3. “What is precision vs recall?”
  4. “How do you handle class imbalance?”
  5. “How do you choose a threshold?”

Hints in Layers

Hint 1: Starting Point Start with a short list of spam keywords.

Hint 2: Next Level Compute precision and recall, not just accuracy.

Hint 3: Technical Details Track false positives and negatives explicitly.

Hint 4: Tools/Debugging Review misclassified messages to improve rules.


Books That Will Help

Topic Book Chapter
Confusion matrix “Practical Statistics for Data Scientists” Ch. 5
Precision/recall “Practical Statistics for Data Scientists” Ch. 5
Base rates “Naked Statistics” Ch. 8

Implementation Hints

  • Keep rules configurable in a text file.
  • Report metrics on a held-out test set.
  • Document tradeoffs between precision and recall.

Learning Milestones

  1. First milestone: You can build a rule-based classifier.
  2. Second milestone: You can compute precision and recall.
  3. Final milestone: You can explain metric tradeoffs.