Project 14: Phishing Email Detector

Build a detector that scores emails for phishing risk using header, domain, and content signals.

Quick Reference

Attribute	Value
Difficulty	Advanced
Time Estimate	2-3 weeks
Language	Python (Alternatives: Go, Rust)
Prerequisites	Header parsing, URL parsing
Key Topics	Spoofing signals, domain similarity, URL analysis

1. Learning Objectives

Extract phishing signals from headers and body.
Detect domain impersonation and lookalike domains.
Analyze links for risk patterns.
Produce a risk score and explanation.

2. Theoretical Foundation

2.1 Core Concepts

Spoofing signals: Misaligned From/Reply-To, failed auth results.
Lookalike domains: Typosquatting and homoglyph attacks.
Link analysis: Mismatch between visible text and actual URL.
Risk scoring: Combine multiple weak signals into a score.

2.2 Why This Matters

Phishing remains a top attack vector. Detecting it requires multi-signal analysis rather than a single rule.

2.3 Historical Context / Background

Modern phishing uses UI deception, link obfuscation, and compromised infrastructure. Authentication alone is not sufficient.

2.4 Common Misconceptions

Misconception: SPF/DKIM pass means safe. Reality: attackers can use compromised domains.
Misconception: One bad signal is enough. Reality: false positives are common.

3. Project Specification

3.1 What You Will Build

A CLI tool that reads a raw message, extracts signals, computes a risk score, and outputs a list of suspicious indicators.

3.2 Functional Requirements

Parse headers and body.
Compare From, Reply-To, Return-Path domains.
Detect lookalike domains using edit distance and homoglyph checks.
Extract URLs and compare visible link text vs actual URL.
Output risk score and reasons.

3.3 Non-Functional Requirements

Performance: Analyze a message in under 300 ms.
Reliability: Handle malformed HTML and MIME.
Usability: Clear explanation for each risk signal.

3.4 Example Usage / Output

$ ./phish-detect message.eml
Risk score: 0.82 (High)
Signals:
  - Reply-To domain mismatch: billing@paypa1.com
  - URL domain mismatch: paypal.com (text) -> paypa1.com (link)
  - DMARC failed

3.5 Real World Outcome

You can flag suspicious emails and explain why they are risky, helping users and analysts take action.

4. Solution Architecture

4.1 High-Level Design

Message Parser
  -> Signal Extractor
  -> Domain Analyzer
  -> URL Analyzer
  -> Risk Scorer

4.2 Key Components

Component	Responsibility	Key Decisions
Signal Extractor	Pull header and body indicators	prioritize auth results
Domain Analyzer	Similarity and homoglyph checks	use edit distance threshold
URL Analyzer	Compare displayed vs actual	handle HTML anchors
Scorer	Combine signals	weighted sum

4.3 Data Structures

class Signal:
    def __init__(self, name, weight, detail):
        self.name = name
        self.weight = weight
        self.detail = detail

4.4 Algorithm Overview

Key Algorithm: Domain Similarity

Normalize domains (lowercase, strip dots).
Compute edit distance or use confusables map.
Flag if distance below threshold.

Complexity Analysis:

Time: O(n*m) for edit distance per comparison
Space: O(n*m)

5. Implementation Guide

5.1 Development Environment Setup

python -m venv .venv
source .venv/bin/activate

5.2 Project Structure

phishing-detector/
├── parser.py
├── signals.py
├── domains.py
├── urls.py
└── score.py

5.3 The Core Question You’re Answering

“Does this email try to impersonate a trusted sender or trick the user into unsafe actions?”

5.4 Concepts You Must Understand First

Stop and research these before coding:

Authentication-Results
Domain alignment and spoofing
URL parsing and punycode
Edit distance basics

5.5 Questions to Guide Your Design

Which domains should be considered trusted targets?
What signals are strong enough to trigger a high score?
How will you handle missing headers?

5.6 Thinking Exercise

If From is example.com, Reply-To is example-support.com, and SPF/DKIM pass, should this be flagged? Why?

5.7 The Interview Questions They’ll Ask

“What signals indicate phishing beyond authentication failures?”
“How do you detect lookalike domains?”
“Why are false positives dangerous in phishing detection?”

5.8 Hints in Layers

Hint 1: Start with header mismatches

From vs Reply-To vs Return-Path.

Hint 2: Add URL checks

Compare anchor text to href.

Hint 3: Combine signals with weights

Avoid single-rule decisions.

5.9 Books That Will Help

Topic	Book	Chapter
Email headers	RFC 5322	Sections 2-3
Phishing tactics	Practical Email Security	phishing section
URL parsing	Web security guides	URL section

5.10 Implementation Phases

Phase 1: Foundation (4-5 days)

Goals:

Parse headers and body

Tasks:

Extract domains from headers.
Parse HTML links.

Checkpoint: List headers and URLs.

Phase 2: Core Functionality (1 week)

Goals:

Implement signals and scoring

Tasks:

Add mismatch and auth failure signals.
Add domain similarity checks.

Checkpoint: Risk score produced for test cases.

Phase 3: Polish and Edge Cases (4-5 days)

Goals:

Improve explanations and robustness

Tasks:

Add reason strings and severity.
Handle malformed HTML and MIME.

Checkpoint: Clear report for multiple samples.

5.11 Key Implementation Decisions

Decision	Options	Recommendation	Rationale
Similarity	edit distance vs confusables	edit distance + confusables	better coverage
Scoring	weighted sum vs rules	weighted sum	flexible
Output	text vs JSON	both	analysis and integration

6. Testing Strategy

6.1 Test Categories

Category	Purpose	Examples
Unit Tests	Domain similarity	paypa1 vs paypal
Integration Tests	Real samples	known phishing samples
Edge Case Tests	Missing headers	fallback behavior

6.2 Critical Test Cases

Reply-To mismatch raises score.
Lookalike domain triggers warning.
Link text mismatch triggers warning.

6.3 Test Data

From: support@paypal.com
Reply-To: support@paypa1.com

7. Common Pitfalls and Debugging

7.1 Frequent Mistakes

Pitfall	Symptom	Solution
Overweighting one signal	false positives	balance weights
Bad URL parsing	missed links	use HTML parser
Ignoring punycode	missed IDN attacks	normalize to Unicode or ASCII

7.2 Debugging Strategies

Print extracted signals and weights.
Compare results to known phishing examples.

7.3 Performance Traps

Heavy edit-distance checks across many domains. Limit comparisons.

8. Extensions and Challenges

8.1 Beginner Extensions

Add blacklist of known phishing domains.
Highlight suspicious attachments.

8.2 Intermediate Extensions

Add ML classifier for content.
Integrate with reputation checker.

8.3 Advanced Extensions

Build a browser extension that warns users.
Add feedback loop to improve scoring.

9. Real-World Connections

9.1 Industry Applications

Security teams use multi-signal detection for phishing.
Email gateways combine rules and ML scores.

PhishTank: https://phishtank.org/
mailparser: https://github.com/mikel/mail

9.3 Interview Relevance

Email security signals and spoofing detection are common interview topics.

10. Resources

10.1 Essential Reading

Phishing and email security guides

10.2 Video Resources

Phishing detection walkthroughs

10.3 Tools and Documentation

punycode references
public suffix list

11. Self-Assessment Checklist

11.1 Understanding

I can explain phishing signals
I understand lookalike domains
I can parse and analyze URLs

11.2 Implementation

Produces a risk score with reasons
Handles malformed messages
Flags common phishing patterns

11.3 Growth

I can tune scoring to reduce false positives
I can explain phishing tradeoffs to stakeholders

12. Submission / Completion Criteria

Minimum Viable Completion:

Extract header signals and output risk score

Full Completion:

Add domain similarity and URL mismatch detection

Excellence (Going Above and Beyond):

Integrate ML and reputation feeds

This guide was generated from EMAIL_SYSTEMS_DEEP_DIVE_PROJECTS.md. For the complete learning path, see the parent directory.

Project 14: Phishing Email Detector

Quick Reference

1. Learning Objectives

2. Theoretical Foundation

2.1 Core Concepts

2.2 Why This Matters

2.3 Historical Context / Background

2.4 Common Misconceptions

3. Project Specification

3.1 What You Will Build

3.2 Functional Requirements

3.3 Non-Functional Requirements

3.4 Example Usage / Output

3.5 Real World Outcome

4. Solution Architecture

4.1 High-Level Design

4.2 Key Components

4.3 Data Structures

4.4 Algorithm Overview

5. Implementation Guide

5.1 Development Environment Setup

5.2 Project Structure

5.3 The Core Question You’re Answering

5.4 Concepts You Must Understand First

5.5 Questions to Guide Your Design

5.6 Thinking Exercise

5.7 The Interview Questions They’ll Ask

5.8 Hints in Layers

5.9 Books That Will Help

5.10 Implementation Phases

Phase 1: Foundation (4-5 days)

Phase 2: Core Functionality (1 week)

Phase 3: Polish and Edge Cases (4-5 days)

5.11 Key Implementation Decisions

6. Testing Strategy

6.1 Test Categories

6.2 Critical Test Cases

6.3 Test Data

7. Common Pitfalls and Debugging

7.1 Frequent Mistakes

7.2 Debugging Strategies

7.3 Performance Traps

8. Extensions and Challenges

8.1 Beginner Extensions

8.2 Intermediate Extensions

8.3 Advanced Extensions

9. Real-World Connections

9.1 Industry Applications

9.2 Related Open Source Projects

9.3 Interview Relevance

10. Resources

10.1 Essential Reading

10.2 Video Resources

10.3 Tools and Documentation

10.4 Related Projects in This Series

11. Self-Assessment Checklist

11.1 Understanding

11.2 Implementation

11.3 Growth

12. Submission / Completion Criteria