Project 8: Bayesian Spam Classifier
Build a naive Bayes classifier for spam detection using token and header features.
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | Advanced |
| Time Estimate | 2-3 weeks |
| Language | Python (Alternatives: Go, Rust) |
| Prerequisites | Probability basics, text processing |
| Key Topics | Naive Bayes, feature extraction, evaluation |
1. Learning Objectives
- Implement a naive Bayes classifier for email.
- Build a feature pipeline from headers and body.
- Train, evaluate, and tune thresholds for precision/recall.
- Output human-readable reasons for classification.
2. Theoretical Foundation
2.1 Core Concepts
-
Naive Bayes: Assumes feature independence and computes P(spam features). - Tokenization: Break text into tokens, normalize case, remove noise.
- Prior and likelihood: Use counts to estimate probabilities with smoothing.
- Thresholding: Choose a cutoff to balance false positives and negatives.
2.2 Why This Matters
Spam detection is a classic application of probabilistic reasoning and provides a foundation for modern email filtering systems.
2.3 Historical Context / Background
Naive Bayes was one of the earliest effective spam filters and remains a core baseline in many systems.
2.4 Common Misconceptions
- Misconception: Bayes only uses body text. Reality: headers are strong signals.
- Misconception: Higher accuracy means better. Reality: false positives are costly.
3. Project Specification
3.1 What You Will Build
A tool that trains on labeled spam/ham datasets, classifies new messages, and reports a spam score with key contributing features.
3.2 Functional Requirements
- Parse messages and extract features.
- Train model with smoothing.
- Score messages and output spam probability.
- Provide a confusion matrix and metrics.
3.3 Non-Functional Requirements
- Performance: Train on 10k emails in under a minute.
- Reliability: Handle malformed messages.
- Usability: Report top features driving decision.
3.4 Example Usage / Output
$ ./spam-classify --train data/train --test data/test
Accuracy: 0.95
Precision: 0.93
Recall: 0.91
$ ./spam-classify --message message.eml
Spam score: 0.98
Top signals: "free", "winner", "click", "http://"
3.5 Real World Outcome
You can classify inbound messages and explain why they were flagged, using interpretable signals.
4. Solution Architecture
4.1 High-Level Design
Dataset Loader
-> Feature Extractor
-> Bayes Trainer
-> Scorer
-> Report Generator
4.2 Key Components
| Component | Responsibility | Key Decisions |
|---|---|---|
| Extractor | Tokens and header features | Use bag-of-words plus header flags |
| Trainer | Compute priors and likelihoods | Laplace smoothing |
| Scorer | Compute log-probabilities | Avoid underflow with logs |
| Reporter | Explain results | Show top contributing tokens |
4.3 Data Structures
class Model:
def __init__(self):
self.spam_counts = {}
self.ham_counts = {}
self.spam_total = 0
self.ham_total = 0
4.4 Algorithm Overview
Key Algorithm: Naive Bayes Scoring
- Compute log P(spam) and log P(ham).
- For each token, add log likelihoods.
- Normalize to produce a score.
Complexity Analysis:
- Time: O(n) per message for n tokens
- Space: O(v) for vocabulary size
5. Implementation Guide
5.1 Development Environment Setup
python -m venv .venv
source .venv/bin/activate
5.2 Project Structure
spam-classifier/
├── loader.py
├── features.py
├── train.py
├── score.py
└── report.py
5.3 The Core Question You’re Answering
“How can we estimate the probability that a message is spam based on its features?”
5.4 Concepts You Must Understand First
Stop and research these before coding:
- Bayes theorem
- Log probabilities
- Tokenization strategies
- Precision vs recall
5.5 Questions to Guide Your Design
- Which headers should you include as features?
- How will you handle rare tokens?
- What threshold is acceptable to avoid false positives?
5.6 Thinking Exercise
If you increase the threshold from 0.8 to 0.95, what happens to precision and recall?
5.7 The Interview Questions They’ll Ask
- “Why use log probabilities in naive Bayes?”
- “What is Laplace smoothing and why is it needed?”
- “How do you balance false positives and negatives in spam filtering?”
5.8 Hints in Layers
Hint 1: Start with body tokens only
- Add headers after you have a working baseline.
Hint 2: Use a small dataset first
- Validate correctness before scaling.
Hint 3: Add feature explanations
- Track token contributions for transparency.
5.9 Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Naive Bayes | Pattern Recognition and ML | Ch. 4 |
| Text processing | Speech and Language Processing | Ch. 4 |
| Email filtering | SpamAssassin docs | Concepts section |
5.10 Implementation Phases
Phase 1: Foundation (4-5 days)
Goals:
- Load dataset and tokenize
Tasks:
- Parse messages and extract tokens.
- Build initial counts.
Checkpoint: Vocabulary and token counts produced.
Phase 2: Core Functionality (1 week)
Goals:
- Train and score
Tasks:
- Implement priors and likelihoods.
- Score messages with log probs.
Checkpoint: Score outputs for test set.
Phase 3: Polish and Edge Cases (4-5 days)
Goals:
- Evaluation and reporting
Tasks:
- Compute precision, recall, F1.
- Provide top feature report.
Checkpoint: Metrics and explanations shown.
5.11 Key Implementation Decisions
| Decision | Options | Recommendation | Rationale |
|---|---|---|---|
| Tokenization | simple split vs regex | regex | Better accuracy |
| Features | body only vs headers | both | Improves signals |
| Threshold | fixed vs tunable | tunable | Adjust for false positives |
6. Testing Strategy
6.1 Test Categories
| Category | Purpose | Examples |
|---|---|---|
| Unit Tests | Tokenization | punctuation and URLs |
| Integration Tests | Dataset evaluation | test split |
| Edge Case Tests | Empty body | headers only |
6.2 Critical Test Cases
- Empty message should not crash.
- Highly spammy message scores above threshold.
- Known ham scores below threshold.
6.3 Test Data
Subject: Free winner
Body: Click here to claim your prize
7. Common Pitfalls and Debugging
7.1 Frequent Mistakes
| Pitfall | Symptom | Solution |
|---|---|---|
| Probability underflow | scores are zero | use log space |
| No smoothing | zero probabilities | apply Laplace smoothing |
| Overfitting | high train, low test | use validation set |
7.2 Debugging Strategies
- Inspect top tokens for spam vs ham.
- Compare confusion matrix across thresholds.
7.3 Performance Traps
- Large vocabularies. Prune rare tokens.
8. Extensions and Challenges
8.1 Beginner Extensions
- Add stopword removal.
- Add n-gram features.
8.2 Intermediate Extensions
- Implement TF-IDF weighting.
- Add URL reputation features.
8.3 Advanced Extensions
- Compare with logistic regression.
- Train incremental updates on new data.
9. Real-World Connections
9.1 Industry Applications
- Spam filters still use Bayes as a baseline.
- Security gateways combine Bayes with reputation and auth.
9.2 Related Open Source Projects
- SpamAssassin: https://spamassassin.apache.org/
- Bogofilter: https://bogofilter.sourceforge.net/
9.3 Interview Relevance
- Probabilistic classifiers and evaluation metrics are common ML interview topics.
10. Resources
10.1 Essential Reading
- Naive Bayes tutorials and email filtering guides
10.2 Video Resources
- Bayesian spam filter walkthroughs
10.3 Tools and Documentation
- Public spam datasets (Enron, SpamAssassin)
10.4 Related Projects in This Series
11. Self-Assessment Checklist
11.1 Understanding
- I can explain Bayes theorem
- I understand precision vs recall
- I can explain feature selection
11.2 Implementation
- Model trains and scores correctly
- Metrics are reported
- Explanations are shown
11.3 Growth
- I can tune thresholds for business needs
- I can reason about false positives
12. Submission / Completion Criteria
Minimum Viable Completion:
- Train a naive Bayes model and score messages
Full Completion:
- Provide metrics and top features
Excellence (Going Above and Beyond):
- Add advanced features and incremental training
- Compare multiple models
This guide was generated from EMAIL_SYSTEMS_DEEP_DIVE_PROJECTS.md. For the complete learning path, see the parent directory.