Project 14: Phishing Email Detector
Build a detector that scores emails for phishing risk using header, domain, and content signals.
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | Advanced |
| Time Estimate | 2-3 weeks |
| Language | Python (Alternatives: Go, Rust) |
| Prerequisites | Header parsing, URL parsing |
| Key Topics | Spoofing signals, domain similarity, URL analysis |
1. Learning Objectives
- Extract phishing signals from headers and body.
- Detect domain impersonation and lookalike domains.
- Analyze links for risk patterns.
- Produce a risk score and explanation.
2. Theoretical Foundation
2.1 Core Concepts
- Spoofing signals: Misaligned From/Reply-To, failed auth results.
- Lookalike domains: Typosquatting and homoglyph attacks.
- Link analysis: Mismatch between visible text and actual URL.
- Risk scoring: Combine multiple weak signals into a score.
2.2 Why This Matters
Phishing remains a top attack vector. Detecting it requires multi-signal analysis rather than a single rule.
2.3 Historical Context / Background
Modern phishing uses UI deception, link obfuscation, and compromised infrastructure. Authentication alone is not sufficient.
2.4 Common Misconceptions
- Misconception: SPF/DKIM pass means safe. Reality: attackers can use compromised domains.
- Misconception: One bad signal is enough. Reality: false positives are common.
3. Project Specification
3.1 What You Will Build
A CLI tool that reads a raw message, extracts signals, computes a risk score, and outputs a list of suspicious indicators.
3.2 Functional Requirements
- Parse headers and body.
- Compare From, Reply-To, Return-Path domains.
- Detect lookalike domains using edit distance and homoglyph checks.
- Extract URLs and compare visible link text vs actual URL.
- Output risk score and reasons.
3.3 Non-Functional Requirements
- Performance: Analyze a message in under 300 ms.
- Reliability: Handle malformed HTML and MIME.
- Usability: Clear explanation for each risk signal.
3.4 Example Usage / Output
$ ./phish-detect message.eml
Risk score: 0.82 (High)
Signals:
- Reply-To domain mismatch: billing@paypa1.com
- URL domain mismatch: paypal.com (text) -> paypa1.com (link)
- DMARC failed
3.5 Real World Outcome
You can flag suspicious emails and explain why they are risky, helping users and analysts take action.
4. Solution Architecture
4.1 High-Level Design
Message Parser
-> Signal Extractor
-> Domain Analyzer
-> URL Analyzer
-> Risk Scorer
4.2 Key Components
| Component | Responsibility | Key Decisions |
|---|---|---|
| Signal Extractor | Pull header and body indicators | prioritize auth results |
| Domain Analyzer | Similarity and homoglyph checks | use edit distance threshold |
| URL Analyzer | Compare displayed vs actual | handle HTML anchors |
| Scorer | Combine signals | weighted sum |
4.3 Data Structures
class Signal:
def __init__(self, name, weight, detail):
self.name = name
self.weight = weight
self.detail = detail
4.4 Algorithm Overview
Key Algorithm: Domain Similarity
- Normalize domains (lowercase, strip dots).
- Compute edit distance or use confusables map.
- Flag if distance below threshold.
Complexity Analysis:
- Time: O(n*m) for edit distance per comparison
- Space: O(n*m)
5. Implementation Guide
5.1 Development Environment Setup
python -m venv .venv
source .venv/bin/activate
5.2 Project Structure
phishing-detector/
├── parser.py
├── signals.py
├── domains.py
├── urls.py
└── score.py
5.3 The Core Question You’re Answering
“Does this email try to impersonate a trusted sender or trick the user into unsafe actions?”
5.4 Concepts You Must Understand First
Stop and research these before coding:
- Authentication-Results
- Domain alignment and spoofing
- URL parsing and punycode
- Edit distance basics
5.5 Questions to Guide Your Design
- Which domains should be considered trusted targets?
- What signals are strong enough to trigger a high score?
- How will you handle missing headers?
5.6 Thinking Exercise
If From is example.com, Reply-To is example-support.com, and SPF/DKIM pass, should this be flagged? Why?
5.7 The Interview Questions They’ll Ask
- “What signals indicate phishing beyond authentication failures?”
- “How do you detect lookalike domains?”
- “Why are false positives dangerous in phishing detection?”
5.8 Hints in Layers
Hint 1: Start with header mismatches
- From vs Reply-To vs Return-Path.
Hint 2: Add URL checks
- Compare anchor text to href.
Hint 3: Combine signals with weights
- Avoid single-rule decisions.
5.9 Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Email headers | RFC 5322 | Sections 2-3 |
| Phishing tactics | Practical Email Security | phishing section |
| URL parsing | Web security guides | URL section |
5.10 Implementation Phases
Phase 1: Foundation (4-5 days)
Goals:
- Parse headers and body
Tasks:
- Extract domains from headers.
- Parse HTML links.
Checkpoint: List headers and URLs.
Phase 2: Core Functionality (1 week)
Goals:
- Implement signals and scoring
Tasks:
- Add mismatch and auth failure signals.
- Add domain similarity checks.
Checkpoint: Risk score produced for test cases.
Phase 3: Polish and Edge Cases (4-5 days)
Goals:
- Improve explanations and robustness
Tasks:
- Add reason strings and severity.
- Handle malformed HTML and MIME.
Checkpoint: Clear report for multiple samples.
5.11 Key Implementation Decisions
| Decision | Options | Recommendation | Rationale |
|---|---|---|---|
| Similarity | edit distance vs confusables | edit distance + confusables | better coverage |
| Scoring | weighted sum vs rules | weighted sum | flexible |
| Output | text vs JSON | both | analysis and integration |
6. Testing Strategy
6.1 Test Categories
| Category | Purpose | Examples |
|---|---|---|
| Unit Tests | Domain similarity | paypa1 vs paypal |
| Integration Tests | Real samples | known phishing samples |
| Edge Case Tests | Missing headers | fallback behavior |
6.2 Critical Test Cases
- Reply-To mismatch raises score.
- Lookalike domain triggers warning.
- Link text mismatch triggers warning.
6.3 Test Data
From: support@paypal.com
Reply-To: support@paypa1.com
7. Common Pitfalls and Debugging
7.1 Frequent Mistakes
| Pitfall | Symptom | Solution |
|---|---|---|
| Overweighting one signal | false positives | balance weights |
| Bad URL parsing | missed links | use HTML parser |
| Ignoring punycode | missed IDN attacks | normalize to Unicode or ASCII |
7.2 Debugging Strategies
- Print extracted signals and weights.
- Compare results to known phishing examples.
7.3 Performance Traps
- Heavy edit-distance checks across many domains. Limit comparisons.
8. Extensions and Challenges
8.1 Beginner Extensions
- Add blacklist of known phishing domains.
- Highlight suspicious attachments.
8.2 Intermediate Extensions
- Add ML classifier for content.
- Integrate with reputation checker.
8.3 Advanced Extensions
- Build a browser extension that warns users.
- Add feedback loop to improve scoring.
9. Real-World Connections
9.1 Industry Applications
- Security teams use multi-signal detection for phishing.
- Email gateways combine rules and ML scores.
9.2 Related Open Source Projects
- PhishTank: https://phishtank.org/
- mailparser: https://github.com/mikel/mail
9.3 Interview Relevance
- Email security signals and spoofing detection are common interview topics.
10. Resources
10.1 Essential Reading
- Phishing and email security guides
10.2 Video Resources
- Phishing detection walkthroughs
10.3 Tools and Documentation
- punycode references
- public suffix list
10.4 Related Projects in This Series
11. Self-Assessment Checklist
11.1 Understanding
- I can explain phishing signals
- I understand lookalike domains
- I can parse and analyze URLs
11.2 Implementation
- Produces a risk score with reasons
- Handles malformed messages
- Flags common phishing patterns
11.3 Growth
- I can tune scoring to reduce false positives
- I can explain phishing tradeoffs to stakeholders
12. Submission / Completion Criteria
Minimum Viable Completion:
- Extract header signals and output risk score
Full Completion:
- Add domain similarity and URL mismatch detection
Excellence (Going Above and Beyond):
- Integrate ML and reputation feeds
This guide was generated from EMAIL_SYSTEMS_DEEP_DIVE_PROJECTS.md. For the complete learning path, see the parent directory.