Project 8: Bayesian Spam Classifier

Build a naive Bayes classifier for spam detection using token and header features.

Quick Reference

Attribute	Value
Difficulty	Advanced
Time Estimate	2-3 weeks
Language	Python (Alternatives: Go, Rust)
Prerequisites	Probability basics, text processing
Key Topics	Naive Bayes, feature extraction, evaluation

1. Learning Objectives

Implement a naive Bayes classifier for email.
Build a feature pipeline from headers and body.
Train, evaluate, and tune thresholds for precision/recall.
Output human-readable reasons for classification.

2. Theoretical Foundation

2.1 Core Concepts

Naive Bayes: Assumes feature independence and computes P(spam features).
Tokenization: Break text into tokens, normalize case, remove noise.
Prior and likelihood: Use counts to estimate probabilities with smoothing.
Thresholding: Choose a cutoff to balance false positives and negatives.

2.2 Why This Matters

Spam detection is a classic application of probabilistic reasoning and provides a foundation for modern email filtering systems.

2.3 Historical Context / Background

Naive Bayes was one of the earliest effective spam filters and remains a core baseline in many systems.

2.4 Common Misconceptions

Misconception: Bayes only uses body text. Reality: headers are strong signals.
Misconception: Higher accuracy means better. Reality: false positives are costly.

3. Project Specification

3.1 What You Will Build

A tool that trains on labeled spam/ham datasets, classifies new messages, and reports a spam score with key contributing features.

3.2 Functional Requirements

Parse messages and extract features.
Train model with smoothing.
Score messages and output spam probability.
Provide a confusion matrix and metrics.

3.3 Non-Functional Requirements

Performance: Train on 10k emails in under a minute.
Reliability: Handle malformed messages.
Usability: Report top features driving decision.

3.4 Example Usage / Output

$ ./spam-classify --train data/train --test data/test
Accuracy: 0.95
Precision: 0.93
Recall: 0.91

$ ./spam-classify --message message.eml
Spam score: 0.98
Top signals: "free", "winner", "click", "http://"

3.5 Real World Outcome

You can classify inbound messages and explain why they were flagged, using interpretable signals.

4. Solution Architecture

4.1 High-Level Design

Dataset Loader
  -> Feature Extractor
  -> Bayes Trainer
  -> Scorer
  -> Report Generator

4.2 Key Components

Component	Responsibility	Key Decisions
Extractor	Tokens and header features	Use bag-of-words plus header flags
Trainer	Compute priors and likelihoods	Laplace smoothing
Scorer	Compute log-probabilities	Avoid underflow with logs
Reporter	Explain results	Show top contributing tokens

4.3 Data Structures

class Model:
    def __init__(self):
        self.spam_counts = {}
        self.ham_counts = {}
        self.spam_total = 0
        self.ham_total = 0

4.4 Algorithm Overview

Key Algorithm: Naive Bayes Scoring

Compute log P(spam) and log P(ham).
For each token, add log likelihoods.
Normalize to produce a score.

Complexity Analysis:

Time: O(n) per message for n tokens
Space: O(v) for vocabulary size

5. Implementation Guide

5.1 Development Environment Setup

python -m venv .venv
source .venv/bin/activate

5.2 Project Structure

spam-classifier/
├── loader.py
├── features.py
├── train.py
├── score.py
└── report.py

5.3 The Core Question You’re Answering

“How can we estimate the probability that a message is spam based on its features?”

5.4 Concepts You Must Understand First

Stop and research these before coding:

Bayes theorem
Log probabilities
Tokenization strategies
Precision vs recall

5.5 Questions to Guide Your Design

Which headers should you include as features?
How will you handle rare tokens?
What threshold is acceptable to avoid false positives?

5.6 Thinking Exercise

If you increase the threshold from 0.8 to 0.95, what happens to precision and recall?

5.7 The Interview Questions They’ll Ask

“Why use log probabilities in naive Bayes?”
“What is Laplace smoothing and why is it needed?”
“How do you balance false positives and negatives in spam filtering?”

5.8 Hints in Layers

Hint 1: Start with body tokens only

Add headers after you have a working baseline.

Hint 2: Use a small dataset first

Validate correctness before scaling.

Hint 3: Add feature explanations

Track token contributions for transparency.

5.9 Books That Will Help

Topic	Book	Chapter
Naive Bayes	Pattern Recognition and ML	Ch. 4
Text processing	Speech and Language Processing	Ch. 4
Email filtering	SpamAssassin docs	Concepts section

5.10 Implementation Phases

Phase 1: Foundation (4-5 days)

Goals:

Load dataset and tokenize

Tasks:

Parse messages and extract tokens.
Build initial counts.

Checkpoint: Vocabulary and token counts produced.

Phase 2: Core Functionality (1 week)

Goals:

Train and score

Tasks:

Implement priors and likelihoods.
Score messages with log probs.

Checkpoint: Score outputs for test set.

Phase 3: Polish and Edge Cases (4-5 days)

Goals:

Evaluation and reporting

Tasks:

Compute precision, recall, F1.
Provide top feature report.

Checkpoint: Metrics and explanations shown.

5.11 Key Implementation Decisions

Decision	Options	Recommendation	Rationale
Tokenization	simple split vs regex	regex	Better accuracy
Features	body only vs headers	both	Improves signals
Threshold	fixed vs tunable	tunable	Adjust for false positives

6. Testing Strategy

6.1 Test Categories

Category	Purpose	Examples
Unit Tests	Tokenization	punctuation and URLs
Integration Tests	Dataset evaluation	test split
Edge Case Tests	Empty body	headers only

6.2 Critical Test Cases

Empty message should not crash.
Highly spammy message scores above threshold.
Known ham scores below threshold.

6.3 Test Data

Subject: Free winner
Body: Click here to claim your prize

7. Common Pitfalls and Debugging

7.1 Frequent Mistakes

Pitfall	Symptom	Solution
Probability underflow	scores are zero	use log space
No smoothing	zero probabilities	apply Laplace smoothing
Overfitting	high train, low test	use validation set

7.2 Debugging Strategies

Inspect top tokens for spam vs ham.
Compare confusion matrix across thresholds.

7.3 Performance Traps

Large vocabularies. Prune rare tokens.

8. Extensions and Challenges

8.1 Beginner Extensions

Add stopword removal.
Add n-gram features.

8.2 Intermediate Extensions

Implement TF-IDF weighting.
Add URL reputation features.

8.3 Advanced Extensions

Compare with logistic regression.
Train incremental updates on new data.

9. Real-World Connections

9.1 Industry Applications

Spam filters still use Bayes as a baseline.
Security gateways combine Bayes with reputation and auth.

SpamAssassin: https://spamassassin.apache.org/
Bogofilter: https://bogofilter.sourceforge.net/

9.3 Interview Relevance

Probabilistic classifiers and evaluation metrics are common ML interview topics.

10. Resources

10.1 Essential Reading

Naive Bayes tutorials and email filtering guides

10.2 Video Resources

Bayesian spam filter walkthroughs

10.3 Tools and Documentation

Public spam datasets (Enron, SpamAssassin)

11. Self-Assessment Checklist

11.1 Understanding

I can explain Bayes theorem
I understand precision vs recall
I can explain feature selection

11.2 Implementation

Model trains and scores correctly
Metrics are reported
Explanations are shown

11.3 Growth

I can tune thresholds for business needs
I can reason about false positives

12. Submission / Completion Criteria

Minimum Viable Completion:

Train a naive Bayes model and score messages

Full Completion:

Provide metrics and top features

Excellence (Going Above and Beyond):

Add advanced features and incremental training
Compare multiple models

This guide was generated from EMAIL_SYSTEMS_DEEP_DIVE_PROJECTS.md. For the complete learning path, see the parent directory.

Project 8: Bayesian Spam Classifier

Quick Reference

1. Learning Objectives

2. Theoretical Foundation

2.1 Core Concepts

2.2 Why This Matters

2.3 Historical Context / Background

2.4 Common Misconceptions

3. Project Specification

3.1 What You Will Build

3.2 Functional Requirements

3.3 Non-Functional Requirements

3.4 Example Usage / Output

3.5 Real World Outcome

4. Solution Architecture

4.1 High-Level Design

4.2 Key Components

4.3 Data Structures

4.4 Algorithm Overview

5. Implementation Guide

5.1 Development Environment Setup

5.2 Project Structure

5.3 The Core Question You’re Answering

5.4 Concepts You Must Understand First

5.5 Questions to Guide Your Design

5.6 Thinking Exercise

5.7 The Interview Questions They’ll Ask

5.8 Hints in Layers

5.9 Books That Will Help

5.10 Implementation Phases

Phase 1: Foundation (4-5 days)

Phase 2: Core Functionality (1 week)

Phase 3: Polish and Edge Cases (4-5 days)

5.11 Key Implementation Decisions

6. Testing Strategy

6.1 Test Categories

6.2 Critical Test Cases

6.3 Test Data

7. Common Pitfalls and Debugging

7.1 Frequent Mistakes

7.2 Debugging Strategies

7.3 Performance Traps

8. Extensions and Challenges

8.1 Beginner Extensions

8.2 Intermediate Extensions

8.3 Advanced Extensions

9. Real-World Connections

9.1 Industry Applications

9.2 Related Open Source Projects

9.3 Interview Relevance

10. Resources

10.1 Essential Reading

10.2 Video Resources

10.3 Tools and Documentation

10.4 Related Projects in This Series

11. Self-Assessment Checklist

11.1 Understanding

11.2 Implementation

11.3 Growth

12. Submission / Completion Criteria