Project 14: Phishing Email Detector

Build a detector that scores emails for phishing risk using header, domain, and content signals.

Quick Reference

Attribute Value
Difficulty Advanced
Time Estimate 2-3 weeks
Language Python (Alternatives: Go, Rust)
Prerequisites Header parsing, URL parsing
Key Topics Spoofing signals, domain similarity, URL analysis

1. Learning Objectives

  1. Extract phishing signals from headers and body.
  2. Detect domain impersonation and lookalike domains.
  3. Analyze links for risk patterns.
  4. Produce a risk score and explanation.

2. Theoretical Foundation

2.1 Core Concepts

  • Spoofing signals: Misaligned From/Reply-To, failed auth results.
  • Lookalike domains: Typosquatting and homoglyph attacks.
  • Link analysis: Mismatch between visible text and actual URL.
  • Risk scoring: Combine multiple weak signals into a score.

2.2 Why This Matters

Phishing remains a top attack vector. Detecting it requires multi-signal analysis rather than a single rule.

2.3 Historical Context / Background

Modern phishing uses UI deception, link obfuscation, and compromised infrastructure. Authentication alone is not sufficient.

2.4 Common Misconceptions

  • Misconception: SPF/DKIM pass means safe. Reality: attackers can use compromised domains.
  • Misconception: One bad signal is enough. Reality: false positives are common.

3. Project Specification

3.1 What You Will Build

A CLI tool that reads a raw message, extracts signals, computes a risk score, and outputs a list of suspicious indicators.

3.2 Functional Requirements

  1. Parse headers and body.
  2. Compare From, Reply-To, Return-Path domains.
  3. Detect lookalike domains using edit distance and homoglyph checks.
  4. Extract URLs and compare visible link text vs actual URL.
  5. Output risk score and reasons.

3.3 Non-Functional Requirements

  • Performance: Analyze a message in under 300 ms.
  • Reliability: Handle malformed HTML and MIME.
  • Usability: Clear explanation for each risk signal.

3.4 Example Usage / Output

$ ./phish-detect message.eml
Risk score: 0.82 (High)
Signals:
  - Reply-To domain mismatch: billing@paypa1.com
  - URL domain mismatch: paypal.com (text) -> paypa1.com (link)
  - DMARC failed

3.5 Real World Outcome

You can flag suspicious emails and explain why they are risky, helping users and analysts take action.


4. Solution Architecture

4.1 High-Level Design

Message Parser
  -> Signal Extractor
  -> Domain Analyzer
  -> URL Analyzer
  -> Risk Scorer

4.2 Key Components

Component Responsibility Key Decisions
Signal Extractor Pull header and body indicators prioritize auth results
Domain Analyzer Similarity and homoglyph checks use edit distance threshold
URL Analyzer Compare displayed vs actual handle HTML anchors
Scorer Combine signals weighted sum

4.3 Data Structures

class Signal:
    def __init__(self, name, weight, detail):
        self.name = name
        self.weight = weight
        self.detail = detail

4.4 Algorithm Overview

Key Algorithm: Domain Similarity

  1. Normalize domains (lowercase, strip dots).
  2. Compute edit distance or use confusables map.
  3. Flag if distance below threshold.

Complexity Analysis:

  • Time: O(n*m) for edit distance per comparison
  • Space: O(n*m)

5. Implementation Guide

5.1 Development Environment Setup

python -m venv .venv
source .venv/bin/activate

5.2 Project Structure

phishing-detector/
├── parser.py
├── signals.py
├── domains.py
├── urls.py
└── score.py

5.3 The Core Question You’re Answering

“Does this email try to impersonate a trusted sender or trick the user into unsafe actions?”

5.4 Concepts You Must Understand First

Stop and research these before coding:

  1. Authentication-Results
  2. Domain alignment and spoofing
  3. URL parsing and punycode
  4. Edit distance basics

5.5 Questions to Guide Your Design

  1. Which domains should be considered trusted targets?
  2. What signals are strong enough to trigger a high score?
  3. How will you handle missing headers?

5.6 Thinking Exercise

If From is example.com, Reply-To is example-support.com, and SPF/DKIM pass, should this be flagged? Why?

5.7 The Interview Questions They’ll Ask

  1. “What signals indicate phishing beyond authentication failures?”
  2. “How do you detect lookalike domains?”
  3. “Why are false positives dangerous in phishing detection?”

5.8 Hints in Layers

Hint 1: Start with header mismatches

  • From vs Reply-To vs Return-Path.

Hint 2: Add URL checks

  • Compare anchor text to href.

Hint 3: Combine signals with weights

  • Avoid single-rule decisions.

5.9 Books That Will Help

Topic Book Chapter
Email headers RFC 5322 Sections 2-3
Phishing tactics Practical Email Security phishing section
URL parsing Web security guides URL section

5.10 Implementation Phases

Phase 1: Foundation (4-5 days)

Goals:

  • Parse headers and body

Tasks:

  1. Extract domains from headers.
  2. Parse HTML links.

Checkpoint: List headers and URLs.

Phase 2: Core Functionality (1 week)

Goals:

  • Implement signals and scoring

Tasks:

  1. Add mismatch and auth failure signals.
  2. Add domain similarity checks.

Checkpoint: Risk score produced for test cases.

Phase 3: Polish and Edge Cases (4-5 days)

Goals:

  • Improve explanations and robustness

Tasks:

  1. Add reason strings and severity.
  2. Handle malformed HTML and MIME.

Checkpoint: Clear report for multiple samples.

5.11 Key Implementation Decisions

Decision Options Recommendation Rationale
Similarity edit distance vs confusables edit distance + confusables better coverage
Scoring weighted sum vs rules weighted sum flexible
Output text vs JSON both analysis and integration

6. Testing Strategy

6.1 Test Categories

Category Purpose Examples
Unit Tests Domain similarity paypa1 vs paypal
Integration Tests Real samples known phishing samples
Edge Case Tests Missing headers fallback behavior

6.2 Critical Test Cases

  1. Reply-To mismatch raises score.
  2. Lookalike domain triggers warning.
  3. Link text mismatch triggers warning.

6.3 Test Data

From: support@paypal.com
Reply-To: support@paypa1.com

7. Common Pitfalls and Debugging

7.1 Frequent Mistakes

Pitfall Symptom Solution
Overweighting one signal false positives balance weights
Bad URL parsing missed links use HTML parser
Ignoring punycode missed IDN attacks normalize to Unicode or ASCII

7.2 Debugging Strategies

  • Print extracted signals and weights.
  • Compare results to known phishing examples.

7.3 Performance Traps

  • Heavy edit-distance checks across many domains. Limit comparisons.

8. Extensions and Challenges

8.1 Beginner Extensions

  • Add blacklist of known phishing domains.
  • Highlight suspicious attachments.

8.2 Intermediate Extensions

  • Add ML classifier for content.
  • Integrate with reputation checker.

8.3 Advanced Extensions

  • Build a browser extension that warns users.
  • Add feedback loop to improve scoring.

9. Real-World Connections

9.1 Industry Applications

  • Security teams use multi-signal detection for phishing.
  • Email gateways combine rules and ML scores.
  • PhishTank: https://phishtank.org/
  • mailparser: https://github.com/mikel/mail

9.3 Interview Relevance

  • Email security signals and spoofing detection are common interview topics.

10. Resources

10.1 Essential Reading

  • Phishing and email security guides

10.2 Video Resources

  • Phishing detection walkthroughs

10.3 Tools and Documentation

  • punycode references
  • public suffix list

11. Self-Assessment Checklist

11.1 Understanding

  • I can explain phishing signals
  • I understand lookalike domains
  • I can parse and analyze URLs

11.2 Implementation

  • Produces a risk score with reasons
  • Handles malformed messages
  • Flags common phishing patterns

11.3 Growth

  • I can tune scoring to reduce false positives
  • I can explain phishing tradeoffs to stakeholders

12. Submission / Completion Criteria

Minimum Viable Completion:

  • Extract header signals and output risk score

Full Completion:

  • Add domain similarity and URL mismatch detection

Excellence (Going Above and Beyond):

  • Integrate ML and reputation feeds

This guide was generated from EMAIL_SYSTEMS_DEEP_DIVE_PROJECTS.md. For the complete learning path, see the parent directory.