Project 6: Continuous Authentication Monitor (Behavioral Zero Trust)
Project 6: Continuous Authentication Monitor (Behavioral Zero Trust)
The Core Zero Trust Principle: âTrust is not a one-time decision. It is continuously evaluated based on behavior and context.â This project teaches you that authentication at login is just the beginning - real security requires watching every action, learning normal patterns, and detecting when something feels wrong.
Project Overview
| Attribute | Value |
|---|---|
| Difficulty | Level 4: Expert |
| Time Estimate | Extended Sprint (40-60 hours) |
| Primary Language | Python |
| Alternative Languages | Go, Node.js |
| Coolness Level | Level 5: Pure Magic |
| Business Potential | Industry Disruptor |
| Knowledge Area | Data Science / Security Analytics |
| Software/Tools | Log Processing, Anomaly Detection, Geolocation APIs |
| Main Book | âSecurity in Computingâ by Pfleeger - Chapter 7 |
Learning Objectives
By completing this project, you will be able to:
-
Implement User and Entity Behavior Analytics (UEBA): Build a system that learns ânormalâ behavior patterns and detects deviations that indicate account compromise.
-
Calculate Continuous Trust Scores: Move beyond binary authentication to real-time risk assessment that considers location, time, device, and behavior.
-
Solve the Impossible Travel Problem: Implement the Haversine formula to detect physically impossible login sequences (NYC to London in 20 minutes).
-
Build Statistical Baselines: Use moving averages, standard deviations, and z-scores to define what ânormalâ means for each user.
-
Design Real-Time Event Processing: Architect a stream processing system that can analyze access logs and trigger alerts within seconds.
-
Implement Token Revocation at Scale: Propagate security decisions to all Policy Enforcement Points (PEPs) in your infrastructure simultaneously.
-
Balance Security and Usability: Tune detection thresholds to minimize false positives while catching real threats.
Deep Theoretical Foundation
User and Entity Behavior Analytics (UEBA)
UEBA represents a paradigm shift in security from signature-based detection to behavior-based detection. Traditional security asks âIs this a known attack pattern?â UEBA asks âIs this behavior normal for this entity?â
+------------------------------------------------------------------+
| EVOLUTION OF THREAT DETECTION |
+------------------------------------------------------------------+
GENERATION 1: Signature-Based (1990s-2000s)
==========================================
Known Attack Patterns Database
+---------------------------+
| SQL Injection: ' OR 1=1 |
| XSS: <script>alert()</script> |
| Known Malware Hashes |
+---------------------------+
|
v
[Request] ---> [Pattern Match?] ---> [Alert if Match]
Weakness: Cannot detect unknown attacks (zero-days)
Weakness: Attackers simply modify patterns slightly
GENERATION 2: Rule-Based (2000s-2010s)
=====================================
Static Rules
+---------------------------+
| IF failed_logins > 5 |
| THEN lock_account |
| IF request_rate > 100/s |
| THEN rate_limit |
+---------------------------+
|
v
[Request] ---> [Rule Evaluation] ---> [Action]
Weakness: Requires human to write every rule
Weakness: Attackers learn the thresholds
GENERATION 3: Behavior Analytics (2015-Present) - This Project
==============================================================
Learned Behavioral Baselines (per-user)
+---------------------------+
| Alice: Login NYC 9-5 |
| Alice: Uses Chrome/macOS |
| Alice: Accesses /api/v2 |
| Alice: ~50 requests/hour |
+---------------------------+
|
v
[Request] ---> [Compare to Baseline] ---> [Calculate Risk Score]
|
v
[Risk Score > Threshold?] ---> [Alert + Revoke Access]
Strength: Detects novel attacks (no signature needed)
Strength: Personalized to each user's patterns
Strength: Adapts as behavior naturally changes
+------------------------------------------------------------------+
Why UEBA Matters for Zero Trust:
Zero Trust mandates continuous verification. But what does âverificationâ mean after initial authentication? You cannot ask users to re-enter passwords every 5 minutes. Instead, you verify their behavior matches their identity.
Traditional Authentication:
===========================
08:00 - User: alice, Password: ******, MFA: 123456
--> ACCESS GRANTED (for 8 hours)
08:01 - 16:00: No further verification
(Attacker could have taken over at 08:05)
Continuous Authentication (UEBA):
=================================
08:00 - User: alice, Password: ******, MFA: 123456
--> Initial access granted, monitoring begins
--> Trust Score: 100
08:05 - Request from NYC, Chrome/macOS, normal pattern
--> Trust Score: 100
08:10 - Request from NYC, same device
--> Trust Score: 100
08:15 - ALERT: Request from LONDON, different device
--> Impossible Travel detected (NYC->London in 15 min)
--> Trust Score: 0
--> ACCESS REVOKED
--> MFA step-up required
Continuous Authentication vs Point-in-Time
The phrase âauthenticate once, trust foreverâ is the antithesis of Zero Trust. Hereâs why continuous authentication is essential:
+------------------------------------------------------------------+
| THE PROBLEM WITH POINT-IN-TIME AUTH |
+------------------------------------------------------------------+
Attack Scenario: Session Hijacking
==================================
08:00 - Legitimate user Alice authenticates
Browser receives session token: JWT_TOKEN_ABC
08:01 - Alice's laptop infected with malware
Malware extracts JWT_TOKEN_ABC
08:02 - Attacker uses JWT_TOKEN_ABC from different location
Server sees: Valid token, correct signature, not expired
Result: ACCESS GRANTED
08:03 - 16:00: Attacker has full access as "Alice"
WHY TRADITIONAL AUTH FAILED:
- Token was valid (correct signature)
- Token was not expired
- No mechanism to detect context change
+------------------------------------------------------------------+
Solution: Continuous Context Evaluation
=======================================
Every request is evaluated against:
1. LOCATION CONTEXT
- Is this IP address consistent with recent activity?
- Is travel time between locations physically possible?
- Is this a known VPN exit node or Tor node?
2. TEMPORAL CONTEXT
- Does this match the user's typical working hours?
- Is this an unusual day of week for this user?
- How long since the last activity? (session staleness)
3. DEVICE CONTEXT
- Is this the same browser fingerprint?
- Is this the same user-agent string?
- Has the device health changed?
4. BEHAVIORAL CONTEXT
- Is the request rate normal for this user?
- Are they accessing typical resources?
- Is the data volume expected?
+------------------------------------------------------------------+
The Impossible Travel Algorithm
The âImpossible Travelâ detection is one of the most powerful signals in UEBA. It catches scenarios where a userâs credentials appear to be used from two distant locations within an impossibly short time.
The Haversine Formula:
The Haversine formula calculates the great-circle distance between two points on a sphere given their latitude and longitude. This is the shortest distance between two points on Earthâs surface.
+------------------------------------------------------------------+
| HAVERSINE FORMULA |
+------------------------------------------------------------------+
Given:
- Point A: (lat1, lon1) in radians
- Point B: (lat2, lon2) in radians
- R = Earth's radius = 6,371 km
Formula:
_____________________________________
d = 2R * arcsin( / sin^2((lat2-lat1)/2) + )
V cos(lat1)*cos(lat2)*sin^2((lon2-lon1)/2)
Step-by-step calculation:
1. Convert degrees to radians:
lat_rad = lat_deg * (pi / 180)
2. Calculate differences:
dlat = lat2 - lat1
dlon = lon2 - lon1
3. Apply Haversine:
a = sin^2(dlat/2) + cos(lat1) * cos(lat2) * sin^2(dlon/2)
c = 2 * arcsin(sqrt(a))
distance = R * c
+------------------------------------------------------------------+
| EXAMPLE CALCULATION |
+------------------------------------------------------------------+
Location A: New York City
Latitude: 40.7128 N
Longitude: 74.0060 W
Location B: London
Latitude: 51.5074 N
Longitude: 0.1278 W
Step 1: Convert to radians
NYC: lat1 = 0.7102 rad, lon1 = -1.2918 rad
London: lat2 = 0.8997 rad, lon2 = -0.0022 rad
Step 2: Calculate
dlat = 0.1895 rad
dlon = 1.2896 rad
a = sin^2(0.0948) + cos(0.7102) * cos(0.8997) * sin^2(0.6448)
a = 0.00897 + 0.764 * 0.621 * 0.361
a = 0.00897 + 0.171
a = 0.180
c = 2 * arcsin(sqrt(0.180))
c = 2 * arcsin(0.424)
c = 2 * 0.438
c = 0.876 rad
distance = 6371 * 0.876 = 5,581 km
Result: NYC to London = 5,581 km (approximately)
+------------------------------------------------------------------+
Implementation in Python:
import math
def haversine_distance(lat1: float, lon1: float,
lat2: float, lon2: float) -> float:
"""
Calculate the great-circle distance between two points
on Earth using the Haversine formula.
Args:
lat1, lon1: Coordinates of point 1 (in degrees)
lat2, lon2: Coordinates of point 2 (in degrees)
Returns:
Distance in kilometers
"""
R = 6371 # Earth's radius in kilometers
# Convert to radians
lat1_rad = math.radians(lat1)
lat2_rad = math.radians(lat2)
dlat = math.radians(lat2 - lat1)
dlon = math.radians(lon2 - lon1)
# Haversine formula
a = (math.sin(dlat / 2) ** 2 +
math.cos(lat1_rad) * math.cos(lat2_rad) *
math.sin(dlon / 2) ** 2)
c = 2 * math.asin(math.sqrt(a))
return R * c
Speed Calculation and Thresholds:
+------------------------------------------------------------------+
| IMPOSSIBLE TRAVEL DETECTION |
+------------------------------------------------------------------+
Given two log entries:
Entry 1: User Alice, IP 1.1.1.1 (NYC), Time: 10:00:00
Entry 2: User Alice, IP 8.8.8.8 (London), Time: 10:20:00
Step 1: Geolocate IPs
1.1.1.1 -> NYC (40.7128, -74.0060)
8.8.8.8 -> London (51.5074, -0.1278)
Step 2: Calculate distance
Distance = haversine(40.7128, -74.0060, 51.5074, -0.1278)
Distance = 5,581 km
Step 3: Calculate time difference
Time diff = 10:20:00 - 10:00:00 = 20 minutes = 0.333 hours
Step 4: Calculate required speed
Speed = Distance / Time = 5,581 km / 0.333 hours = 16,743 km/h
Step 5: Compare to threshold
Commercial aircraft: ~900 km/h
Supersonic jet: ~2,200 km/h
"Impossible" threshold: 1,500 km/h (generous buffer)
16,743 km/h >> 1,500 km/h
VERDICT: IMPOSSIBLE TRAVEL DETECTED
ACTION: Revoke session, require re-authentication
+------------------------------------------------------------------+
| THRESHOLD CONSIDERATIONS |
+------------------------------------------------------------------+
Conservative (Fewer False Positives):
Speed > 2,500 km/h = Impossible
Rationale: Accounts for supersonic travel, timing errors
Aggressive (More Security):
Speed > 1,000 km/h = Suspicious
Rationale: Commercial flights average 800-900 km/h
Adaptive (Best Practice):
Calculate based on:
- User's typical travel patterns
- Known airports near each location
- Time of day (red-eye flights vs business hours)
- Account sensitivity level
+------------------------------------------------------------------+
Edge Cases to Handle:
+------------------------------------------------------------------+
| IMPOSSIBLE TRAVEL: EDGE CASES |
+------------------------------------------------------------------+
CASE 1: VPN Usage
=================
User connects to VPN, appears to "travel" instantly.
Solution:
- Maintain list of known VPN/proxy IP ranges
- Flag VPN traffic but don't immediately alert
- Use device fingerprint as secondary signal
- Allow user to register "I use VPN" preference
CASE 2: Mobile Network Handoff
==============================
Mobile carrier assigns different exit IPs that geolocate differently.
Solution:
- Use ASN (Autonomous System Number) grouping
- Same carrier ASN = same logical location
- Set minimum distance threshold (e.g., >500km to alert)
CASE 3: Shared IP (NAT/CGNAT)
=============================
Multiple users share the same public IP (corporate NAT, CGNAT).
Solution:
- Combine with device fingerprint
- Use behavioral signals (typing patterns, request patterns)
- Maintain "known office IP" allowlist
CASE 4: Legitimate Fast Travel
==============================
User genuinely flies from NYC to Boston (1 hour flight).
Solution:
- Speed threshold must account for fastest commercial travel
- Use 1,500 km/h as baseline (faster than any commercial flight)
- Consider "airport proximity" in calculations
CASE 5: Clock Synchronization
=============================
Log timestamps from different servers have clock skew.
Solution:
- Use centralized time service (NTP synchronized)
- Include microsecond precision in timestamps
- Require minimum time difference (e.g., 1 minute) before analysis
+------------------------------------------------------------------+
Statistical Anomaly Detection
Building a baseline of ânormalâ behavior requires understanding statistical concepts. Hereâs how to detect when behavior deviates significantly from the norm.
Standard Deviation and Z-Scores:
+------------------------------------------------------------------+
| STATISTICAL BASELINE BUILDING |
+------------------------------------------------------------------+
THE CONCEPT:
============
Normal behavior clusters around an average (mean).
The spread of that behavior is the standard deviation.
A z-score tells us how many standard deviations from the mean a
value is.
value - mean
z-score = ---------------
standard_deviation
If z-score > 2: Value is unusually high (95th percentile)
If z-score > 3: Value is extremely unusual (99.7th percentile)
EXAMPLE: Login Hour Detection
=============================
Alice's login history (30 days):
Hours: [9, 9, 10, 9, 8, 9, 10, 9, 9, 10, 8, 9, 9, 10, 9,
9, 8, 9, 10, 9, 9, 9, 10, 8, 9, 9, 10, 9, 9, 9]
Calculate mean (average):
mean = sum(hours) / count = 270 / 30 = 9.0
Calculate standard deviation:
variance = sum((x - mean)^2) / count
variance = sum of squared differences / 30
variance = 0.47
std_dev = sqrt(0.47) = 0.68
Now evaluate new logins:
Login at 9:00 AM:
z-score = (9 - 9) / 0.68 = 0
Interpretation: Perfectly normal
Login at 10:00 AM:
z-score = (10 - 9) / 0.68 = 1.47
Interpretation: Slightly unusual but acceptable
Login at 3:00 AM:
z-score = (3 - 9) / 0.68 = -8.8
Interpretation: EXTREMELY UNUSUAL (8+ standard deviations!)
Action: Trigger MFA step-up
+------------------------------------------------------------------+
Exponential Moving Average (EMA):
For behavior that changes over time, use an exponential moving average to weight recent behavior more heavily:
+------------------------------------------------------------------+
| EXPONENTIAL MOVING AVERAGE (EMA) |
+------------------------------------------------------------------+
THE FORMULA:
============
EMA_today = (value_today * alpha) + (EMA_yesterday * (1 - alpha))
Where alpha (smoothing factor) = 2 / (N + 1)
N = number of periods (e.g., 30 days)
alpha for 30-day EMA = 2/31 = 0.0645
WHY EMA vs SIMPLE AVERAGE:
==========================
Simple Average: All historical data weighted equally
- User's behavior from 6 months ago = today's behavior
- Doesn't adapt well to legitimate changes
Exponential Moving Average: Recent data weighted more heavily
- Yesterday's behavior matters more than last month's
- Naturally adapts as user's patterns evolve
- Older data "decays" exponentially
EXAMPLE: Request Rate Baseline
==============================
Day 1: Requests = 50 -> EMA = 50 (initialization)
Day 2: Requests = 60 -> EMA = 60*0.065 + 50*0.935 = 50.6
Day 3: Requests = 55 -> EMA = 55*0.065 + 50.6*0.935 = 50.9
...
Day 30: Requests = 80 -> EMA = ~65
If Day 31 shows 500 requests:
z-score = (500 - 65) / baseline_std_dev = Very High
Action: Alert - unusual activity volume
+------------------------------------------------------------------+
Time-Based Patterns:
+------------------------------------------------------------------+
| TIME-BASED BEHAVIORAL PATTERNS |
+------------------------------------------------------------------+
PATTERN 1: Hour of Day
======================
Track when user typically accesses the system.
Hour | Frequency | Probability
------|-----------|------------
08:00 | 12 | 0.02
09:00 | 85 | 0.17
10:00 | 95 | 0.19
11:00 | 78 | 0.16
... | ... | ...
03:00 | 0 | 0.00
Request at 3 AM: Probability = 0.00
Action: High-risk flag
PATTERN 2: Day of Week
======================
Track weekday vs weekend behavior.
Day | Requests | Typical
----------|----------|--------
Monday | 120 | Yes
Tuesday | 115 | Yes
Wednesday | 118 | Yes
Thursday | 122 | Yes
Friday | 110 | Yes
Saturday | 2 | No
Sunday | 0 | No
Request on Sunday: Atypical
Combine with other signals for risk score
PATTERN 3: Request Velocity
===========================
Track how quickly requests come in bursts.
Normal pattern:
09:00-09:05: 5 requests
09:05-09:10: 8 requests
09:10-09:15: 4 requests
Average: 5.7 requests per 5 minutes
Anomalous pattern:
09:00-09:05: 150 requests <- Automated script?
z-score = (150 - 5.7) / 2.1 = 68.7
Action: Rate limit + alert
+------------------------------------------------------------------+
Trust Score Calculation
A composite trust score combines multiple signals into a single value that the Policy Decision Point (PDP) can use for access decisions.
+------------------------------------------------------------------+
| TRUST SCORE ARCHITECTURE |
+------------------------------------------------------------------+
SCORE COMPONENTS:
=================
1. LOCATION SCORE (0-100)
- Same city as usual: 100
- Same country, different city: 80
- Different country, expected travel: 60
- Different country, unexpected: 20
- Impossible travel detected: 0
2. TEMPORAL SCORE (0-100)
- Normal working hours: 100
- Off-hours but precedent exists: 70
- First time at this hour: 40
- 3 AM on a holiday: 10
3. DEVICE SCORE (0-100)
- Known device, healthy: 100
- Known device, unhealthy: 50
- Unknown device, expected type: 40
- Unknown device, unexpected type: 20
4. BEHAVIORAL SCORE (0-100)
- Request pattern matches baseline: 100
- Slightly elevated activity: 80
- Unusual resources accessed: 50
- Massive deviation from norm: 10
COMPOSITE CALCULATION:
======================
Method 1: Weighted Average
--------------------------
trust_score = (
location_score * 0.30 +
temporal_score * 0.20 +
device_score * 0.25 +
behavioral_score * 0.25
)
Method 2: Minimum Signal (More Secure)
--------------------------------------
trust_score = min(location, temporal, device, behavioral)
Rationale: One critical failure = high risk
Method 3: Multiplicative (Most Sensitive)
-----------------------------------------
trust_score = (loc/100) * (temp/100) * (dev/100) * (behav/100) * 100
Effect: Any low score dramatically reduces final score
POLICY ACTIONS BASED ON SCORE:
==============================
Trust Score | Action
------------|--------------------------------------------
90-100 | Full access, no additional verification
70-89 | Access granted, activity logged
50-69 | Limited access, step-up MFA for sensitive ops
30-49 | Read-only access, alert security team
0-29 | Access denied, session terminated, MFA required
+------------------------------------------------------------------+
Score Decay Over Time:
Trust should decay when thereâs no activity to verify:
+------------------------------------------------------------------+
| TRUST SCORE DECAY |
+------------------------------------------------------------------+
CONCEPT:
========
The longer a session is idle, the less confident we are that the
original user is still in control.
Decay Formula (Exponential):
current_score = initial_score * e^(-lambda * time_since_activity)
Where lambda = decay rate constant
Example with lambda = 0.01 per minute:
Time Idle | Trust Score (started at 100)
----------|------------------------------
0 min | 100
5 min | 95
15 min | 86
30 min | 74
60 min | 55
120 min | 30
At 120 minutes idle: Force re-authentication
IMPLEMENTATION:
===============
class TrustScore:
def __init__(self, initial_score: float, decay_rate: float = 0.01):
self.initial_score = initial_score
self.decay_rate = decay_rate
self.last_activity = time.time()
def get_current_score(self) -> float:
time_idle = time.time() - self.last_activity
minutes_idle = time_idle / 60
decay_factor = math.exp(-self.decay_rate * minutes_idle)
return self.initial_score * decay_factor
def refresh(self, new_score: float):
self.initial_score = new_score
self.last_activity = time.time()
+------------------------------------------------------------------+
Real-Time Event Processing
Continuous authentication requires processing log events in real-time, not in batch. Here are the key architectural patterns:
+------------------------------------------------------------------+
| REAL-TIME EVENT PROCESSING ARCHITECTURE |
+------------------------------------------------------------------+
STREAM PROCESSING CONCEPTS:
===========================
Unlike batch processing (process all logs nightly), stream processing
handles events as they arrive, with sub-second latency.
Key Components:
1. EVENT PRODUCER
- PEP proxies emit access logs
- Each log entry = one event
- Format: JSON with timestamp, user, IP, resource, action
2. MESSAGE QUEUE / EVENT BUS
- Kafka, Redis Streams, RabbitMQ
- Decouples producers from consumers
- Provides durability and replay capability
3. STREAM PROCESSOR
- Consumes events in real-time
- Maintains state (user baselines, current sessions)
- Emits alerts and revocation signals
4. SINK / ACTION LAYER
- Writes alerts to database
- Publishes revocation to PEPs
- Updates dashboards
ARCHITECTURE DIAGRAM:
=====================
+--------+ +--------+ +--------+
| PEP 1 | | PEP 2 | | PEP 3 | (Policy Enforcement Points)
+---+----+ +---+----+ +---+----+
| | |
v v v
+----------------------------------+
| MESSAGE QUEUE | (Redis Streams / Kafka)
| "access-log" topic/stream |
+----------------+-----------------+
|
v
+----------------------------------+
| CONTINUOUS AUTH MONITOR | (This Project)
| |
| +----------------------------+ |
| | Event Parser | |
| +-------------+--------------+ |
| | |
| +-------------v--------------+ |
| | User Baseline Store | | (Per-user behavioral models)
| +-------------+--------------+ |
| | |
| +-------------v--------------+ |
| | Anomaly Detector | | (Z-scores, impossible travel)
| +-------------+--------------+ |
| | |
| +-------------v--------------+ |
| | Trust Score Calculator | |
| +-------------+--------------+ |
| | |
+----------------+-----------------+
|
+--------+--------+
| |
v v
+---------------+ +---------------+
| Dashboard | | Revocation | (Real-time UI + PEP signals)
| (WebSocket) | | Publisher |
+---------------+ +-------+-------+
|
+----------------+----------------+
v v v
+--------+ +--------+ +--------+
| PEP 1 | | PEP 2 | | PEP 3 |
+--------+ +--------+ +--------+
+------------------------------------------------------------------+
Event-Driven Architecture:
+------------------------------------------------------------------+
| PUB/SUB FOR REVOCATION |
+------------------------------------------------------------------+
PROBLEM:
========
When impossible travel is detected, we must revoke the session
across ALL PEP proxies within seconds.
Option 1: Poll-Based (Slow)
---------------------------
PEPs periodically check: "Is this token still valid?"
- Poll interval: 30 seconds
- Worst case: 30 second window for attacker
Option 2: Push-Based (Fast) - Preferred
---------------------------------------
Monitor publishes: "REVOKE user:alice session:XYZ"
All PEPs receive immediately and terminate session.
- Latency: < 100ms
- No polling overhead
IMPLEMENTATION WITH REDIS PUB/SUB:
==================================
Monitor (Publisher):
-------------------
def revoke_user_session(user_id: str, session_id: str, reason: str):
message = {
"action": "REVOKE",
"user_id": user_id,
"session_id": session_id,
"reason": reason,
"timestamp": time.time()
}
redis_client.publish("session-revocations", json.dumps(message))
PEP (Subscriber):
-----------------
def handle_revocation(message):
data = json.loads(message)
if data["action"] == "REVOKE":
session_store.invalidate(
user_id=data["user_id"],
session_id=data["session_id"]
)
log.warning(f"Session revoked: {data['reason']}")
# Subscribe to channel
pubsub = redis_client.pubsub()
pubsub.subscribe("session-revocations")
for message in pubsub.listen():
if message["type"] == "message":
handle_revocation(message["data"])
FAILSAFE: SHORT-LIVED TOKENS
============================
Even with push revocation, use short-lived tokens (5-15 minutes)
as defense in depth. If push fails, token expires quickly.
+------------------------------------------------------------------+
Token Revocation Strategies
When the monitor detects an anomaly, it must revoke the compromised token. Here are the strategies:
+------------------------------------------------------------------+
| TOKEN REVOCATION STRATEGIES |
+------------------------------------------------------------------+
STRATEGY 1: Token Blacklist (Redis-Based)
=========================================
How it works:
- Maintain a set of revoked token IDs (JTI claims)
- On each request, PEP checks: Is this JTI in the blacklist?
- TTL on blacklist entries matches token lifetime
Redis Implementation:
# Revoke a token
SADD revoked_tokens:<user_id> <token_jti>
EXPIRE revoked_tokens:<user_id> 3600 # 1 hour TTL
# Check if revoked
SISMEMBER revoked_tokens:<user_id> <token_jti>
Pros:
+ Simple to implement
+ Fast O(1) lookup
+ Distributed across PEPs
Cons:
- Requires network call on every request
- Blacklist can grow large
STRATEGY 2: Short-Lived Tokens with Refresh
============================================
How it works:
- Access tokens live only 5-15 minutes
- Refresh tokens live longer (hours/days)
- On revocation: invalidate refresh token
- Access token expires naturally
Implementation:
Access Token: { exp: now + 15 minutes, ... }
Refresh Token: { jti: "refresh-xyz", exp: now + 24 hours }
# To revoke: Add refresh token to blacklist
SADD revoked_refresh_tokens <refresh_token_jti>
Pros:
+ Small blast radius (15 min max)
+ Fewer blacklist entries (only refresh tokens)
+ Can skip blacklist check for access tokens
Cons:
- 15 minute window of exposure
- More token refresh traffic
STRATEGY 3: Pushed Invalidation (Recommended for ZTA)
=====================================================
How it works:
- On revocation: Publish message to all PEPs immediately
- PEPs maintain local cache of revoked sessions
- No centralized check needed after initial push
Implementation:
Monitor:
redis.publish("revocations", {
"user": "alice",
"session": "sess-123",
"reason": "impossible_travel"
})
PEP:
on_revocation_message(msg):
local_revocation_cache.add(msg.session)
on_request(token):
if token.session_id in local_revocation_cache:
return DENY
# Continue normal validation
Pros:
+ Near-instant revocation (< 100ms)
+ No centralized bottleneck
+ PEPs remain autonomous
Cons:
- Requires reliable message delivery
- PEPs must subscribe to revocation channel
STRATEGY 4: JWT `jti` Claim Tracking
====================================
Every JWT should include a unique identifier (jti claim):
{
"sub": "alice@example.com",
"jti": "550e8400-e29b-41d4-a716-446655440000",
"exp": 1703980800,
"iat": 1703977200
}
This enables:
- Per-token revocation (not per-user)
- Audit trail of specific tokens
- Replay attack prevention
+------------------------------------------------------------------+
IP Geolocation
Converting IP addresses to geographic coordinates is fundamental to impossible travel detection.
+------------------------------------------------------------------+
| IP GEOLOCATION |
+------------------------------------------------------------------+
HOW IT WORKS:
=============
1. IP Address Registration
- IANA allocates IP blocks to Regional Internet Registries (RIRs)
- RIRs allocate to ISPs
- ISPs assign to customers in specific regions
2. Geolocation Databases
- Companies like MaxMind map IP ranges to locations
- Based on: ISP registration data, user-submitted data,
network topology analysis
3. Lookup Process
- Input: IP address (e.g., 8.8.8.8)
- Output: Latitude, Longitude, City, Country, ASN
MAXMIND GEOLITE2 DATABASE:
==========================
Free database with city-level accuracy.
Installation (Python):
pip install geoip2
# Download GeoLite2-City.mmdb from MaxMind
Usage:
import geoip2.database
reader = geoip2.database.Reader('GeoLite2-City.mmdb')
response = reader.city('128.101.101.101')
print(response.city.name) # Minneapolis
print(response.country.name) # United States
print(response.location.latitude) # 44.9778
print(response.location.longitude) # -93.2650
print(response.location.accuracy_radius) # 20 (km)
ACCURACY CONSIDERATIONS:
========================
Accuracy varies significantly:
Location Type | Typical Accuracy
-------------------|------------------
Home ISP | City-level (~50km)
Mobile carrier | Region-level (~100km)
Corporate network | Building-level (if registered)
VPN/Proxy | Exit node location (not user)
Tor | Random exit node
CGNAT | ISP hub location (not user)
HANDLING INACCURACY:
====================
1. Use accuracy_radius from database
- If accuracy_radius > 100km, reduce confidence
- Don't trigger impossible travel if both locations uncertain
2. Buffer distance calculations
- NYC to Boston = 340km
- If accuracy_radius = 50km each, effective minimum = 240km
- Use: effective_distance = distance - (radius1 + radius2)
3. Flag VPN/proxy IPs
- Maintain list of known VPN provider IP ranges
- Don't use these for location-based decisions
- Fall back to other signals (device, behavior)
+------------------------------------------------------------------+
| PRIVACY CONSIDERATIONS |
+------------------------------------------------------------------+
IP geolocation involves privacy-sensitive data:
1. DATA RETENTION
- Don't store raw IPs longer than necessary
- Hash or anonymize for long-term analytics
- Follow GDPR/CCPA requirements
2. USER TRANSPARENCY
- Inform users their location is monitored
- Provide mechanism to report false positives
- Allow users to pre-register travel
3. GRANULARITY
- Store city/region, not exact coordinates
- Precision for security, not surveillance
+------------------------------------------------------------------+
References to Books in Userâs Library
The following books provide deeper theoretical foundations:
âSecurity in Computingâ by Charles Pfleeger - Chapter 7: Network Security
- Authentication protocols and their weaknesses
- Session management and replay attacks
- Understanding why continuous verification matters
âDesigning Data-Intensive Applicationsâ by Martin Kleppmann - Chapter 11: Stream Processing
- Event streaming architectures
- Exactly-once processing semantics
- Time handling in distributed systems
- Why batch processing is insufficient for security
âFoundations of Information Securityâ by Jason Andress - Chapter 8: Intrusion Detection
- Signature vs anomaly-based detection
- False positive and false negative tradeoffs
- Building detection systems that scale
Complete Project Specification
Functional Requirements
| ID | Requirement | Priority | Acceptance Criteria |
|---|---|---|---|
| FR-1 | Ingest access logs in real-time | P0 | Process logs within 1 second of generation |
| FR-2 | Detect impossible travel | P0 | Flag when travel speed > 1500 km/h |
| FR-3 | Build user behavioral baselines | P0 | Track login times, locations, request patterns per user |
| FR-4 | Calculate continuous trust scores | P0 | Composite score from location, time, device, behavior |
| FR-5 | Trigger session revocation | P0 | Publish revocation signal within 100ms of detection |
| FR-6 | Expose real-time dashboard | P1 | WebSocket-based trust score visualization |
| FR-7 | Handle VPN/proxy edge cases | P1 | Flag but donât auto-revoke for known VPN IPs |
| FR-8 | Support MFA step-up triggers | P1 | Return step-up recommendation for medium-risk scores |
| FR-9 | Persist audit trail | P2 | Store all detections for forensic analysis |
| FR-10 | Configure thresholds | P2 | Admin API to adjust detection sensitivity |
Non-Functional Requirements
| ID | Requirement | Target | Rationale |
|---|---|---|---|
| NFR-1 | Detection latency | < 2 seconds | Minimize attacker window |
| NFR-2 | Revocation propagation | < 100ms | All PEPs must receive signal fast |
| NFR-3 | False positive rate | < 1% | Avoid user frustration |
| NFR-4 | Throughput | 10,000 events/sec | Handle enterprise scale |
| NFR-5 | Availability | 99.9% | Security system must be reliable |
| NFR-6 | Baseline convergence | < 7 days | Learn user patterns within a week |
Log Event Schema
{
"event_id": "evt-550e8400-e29b-41d4",
"timestamp": "2024-12-27T10:15:00.123Z",
"user_id": "alice@example.com",
"session_id": "sess-4412-XA",
"token_jti": "jwt-889923",
"source_ip": "203.0.113.45",
"user_agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7)",
"request": {
"method": "GET",
"path": "/api/v2/sensitive-data",
"query_params": {},
"body_size_bytes": 0
},
"response": {
"status_code": 200,
"body_size_bytes": 4532
},
"pep_id": "proxy-east-1",
"device_fingerprint": "fp-abc123"
}
Alert Event Schema
{
"alert_id": "alert-123456",
"timestamp": "2024-12-27T10:20:00.456Z",
"user_id": "alice@example.com",
"alert_type": "impossible_travel",
"severity": "critical",
"details": {
"location_a": {
"ip": "203.0.113.45",
"city": "New York",
"country": "US",
"coordinates": [40.7128, -74.0060]
},
"location_b": {
"ip": "185.34.22.11",
"city": "London",
"country": "GB",
"coordinates": [51.5074, -0.1278]
},
"time_difference_seconds": 1200,
"distance_km": 5581,
"required_speed_kmh": 16743
},
"trust_score_before": 100,
"trust_score_after": 0,
"action_taken": "session_revoked",
"session_id": "sess-4412-XA"
}
Architecture Diagram
+------------------------------------------------------------------+
| CONTINUOUS AUTHENTICATION MONITOR |
+------------------------------------------------------------------+
POLICY ENFORCEMENT POINTS
+--------+ +--------+ +--------+ +--------+ +--------+
| PEP 1 | | PEP 2 | | PEP 3 | | PEP 4 | | PEP 5 |
+---+----+ +---+----+ +---+----+ +---+----+ +---+----+
| | | | |
| Access Logs (JSON over Redis Streams) |
v v v v v
+------------------------------------------------------------------+
| REDIS STREAMS |
| "access-events" stream |
+--------------------------------+---------------------------------+
|
v
+------------------------------------------------------------------+
| CONTINUOUS AUTH MONITOR (Python) |
| |
| +------------------------+ +--------------------------+ |
| | Stream Consumer | | Baseline Store | |
| | (asyncio/aioredis) | | (Redis Hash per user) | |
| +------------+-----------+ +-----------+--------------+ |
| | | |
| v v |
| +------------------------+ +--------------------------+ |
| | Event Parser |--->| Baseline Updater | |
| | - Extract user, IP | | - EMA calculations | |
| | - Geolocate IP | | - Time pattern updates | |
| +------------+-----------+ +--------------------------+ |
| | |
| v |
| +------------------------+ |
| | Anomaly Detectors | |
| | +------------------+ | |
| | | Impossible Travel| | |
| | +------------------+ | |
| | | Temporal Anomaly | | |
| | +------------------+ | |
| | | Behavior Anomaly | | |
| | +------------------+ | |
| +------------+-----------+ |
| | |
| v |
| +------------------------+ |
| | Trust Score Engine | |
| | - Weighted composite | |
| | - Score decay | |
| +------------+-----------+ |
| | |
| +-------+-------+ |
| | | |
| v v |
| +---------+ +------------+ |
| | Alert | | Revocation | |
| | Writer | | Publisher | |
| +---------+ +-----+------+ |
| | |
+------------------------+------------------------------------------+
|
| Redis Pub/Sub "session-revocations"
v
+--------+ +--------+ +--------+ +--------+ +--------+
| PEP 1 | | PEP 2 | | PEP 3 | | PEP 4 | | PEP 5 |
+--------+ +--------+ +--------+ +--------+ +--------+
(All PEPs subscribe and instantly terminate revoked sessions)
+------------------------------------------------------------------+
| SUPPORTING SYSTEMS |
+------------------------------------------------------------------+
+----------------+ +----------------+ +----------------+
| GeoIP DB | | PostgreSQL | | Dashboard UI |
| (MaxMind) | | (Audit Logs) | | (React/WS) |
+----------------+ +----------------+ +----------------+
Real World Outcome
When you complete this project, you will have an intelligent security monitor that detects account takeover in real-time. Here is exactly what you will see:
Starting the Monitor
$ ./zta-monitor --config /etc/zta-monitor/config.yaml
+------------------------------------------------------------------+
| ZERO TRUST CONTINUOUS AUTHENTICATION MONITOR |
+------------------------------------------------------------------+
[INFO] 2024-12-27T10:00:00Z Monitor v1.0.0 starting...
[INFO] 2024-12-27T10:00:00Z Connecting to Redis at localhost:6379
[INFO] 2024-12-27T10:00:00Z Loading GeoIP database: /data/GeoLite2-City.mmdb
[INFO] 2024-12-27T10:00:01Z GeoIP database loaded (3.2M entries)
[INFO] 2024-12-27T10:00:01Z Subscribing to stream: access-events
[INFO] 2024-12-27T10:00:01Z Publishing revocations to: session-revocations
[INFO] 2024-12-27T10:00:01Z Dashboard WebSocket on port 8080
[INFO] 2024-12-27T10:00:01Z Ready. Monitoring for anomalies...
[INFO] 2024-12-27T10:00:01Z Loaded baselines for 50 users from cache.
Normal Activity
[LOG] 2024-12-27T10:05:00Z User: alice@corp.com | IP: 203.0.113.45 (NYC) | Trust: 100
[LOG] 2024-12-27T10:05:05Z User: alice@corp.com | IP: 203.0.113.45 (NYC) | Trust: 100
[LOG] 2024-12-27T10:05:10Z User: bob@corp.com | IP: 198.51.100.22 (LA) | Trust: 100
[LOG] 2024-12-27T10:05:15Z User: alice@corp.com | IP: 203.0.113.45 (NYC) | Trust: 100
[BASELINE] alice@corp.com: Updated location baseline (NYC confirmed)
[BASELINE] alice@corp.com: Updated temporal baseline (10:00-11:00 typical)
Anomaly Detection: Impossible Travel
[LOG] 2024-12-27T10:20:00Z User: alice@corp.com | IP: 185.34.22.11 (LONDON)
+------------------------------------------------------------------+
| CRITICAL ALERT: IMPOSSIBLE TRAVEL |
+------------------------------------------------------------------+
| |
| User: alice@corp.com |
| Session ID: sess-4412-XA |
| |
| Location A: New York, US (203.0.113.45) |
| Time A: 2024-12-27T10:05:15Z |
| |
| Location B: London, GB (185.34.22.11) |
| Time B: 2024-12-27T10:20:00Z |
| |
| Time Elapsed: 15 minutes |
| Distance: 5,581 km |
| Required Speed: 22,324 km/h |
| |
| Verdict: PHYSICALLY IMPOSSIBLE |
| (Max feasible: 1,500 km/h) |
| |
| Trust Score: 100 -> 0 |
| |
+------------------------------------------------------------------+
[ACTION] 2024-12-27T10:20:00.052Z Publishing revocation for alice@corp.com (sess-4412-XA)
[ACTION] 2024-12-27T10:20:00.053Z Revocation published to 'session-revocations' channel
[AUDIT] 2024-12-27T10:20:00.055Z Alert written to PostgreSQL (alert-123456)
PEP Response (In Another Terminal)
# Terminal 2: PEP Proxy Log
$ ./zta-proxy --log-level info
[INFO] 2024-12-27T10:00:00Z PEP Proxy listening on :8080
[INFO] 2024-12-27T10:00:00Z Subscribed to 'session-revocations' channel
[INFO] 2024-12-27T10:05:15Z ALLOW alice@corp.com GET /api/data (Trust: 100)
[INFO] 2024-12-27T10:10:22Z ALLOW alice@corp.com GET /api/users (Trust: 100)
[REVOCATION] 2024-12-27T10:20:00.053Z Received GLOBAL_REVOKE:
User: alice@corp.com
Session: sess-4412-XA
Reason: Impossible travel detected
[INFO] 2024-12-27T10:20:00.054Z Session terminated: sess-4412-XA
[INFO] 2024-12-27T10:20:00.054Z Removed from active sessions cache
Attacker Experience
# Terminal 3: Attacker's request (using stolen token from London)
$ curl -i -H "Authorization: Bearer eyJhbG..." http://api.corp.com/data
HTTP/1.1 401 Unauthorized
Content-Type: application/json
X-ZT-Reason: session_revoked
{
"error": "Session Terminated",
"reason": "Suspicious login activity detected (Impossible Travel).",
"alert_id": "alert-123456",
"remediation": "Your session has been terminated for security. Please authenticate with MFA to restore access.",
"mfa_url": "https://auth.corp.com/mfa?user=alice@corp.com"
}
Trust Score Dashboard
$ curl http://localhost:8080/api/dashboard/sessions | jq .
{
"active_sessions": [
{
"user": "bob@corp.com",
"session_id": "sess-7721-BC",
"trust_score": 95,
"last_activity": "2024-12-27T10:19:55Z",
"location": "Los Angeles, US",
"device": "Chrome/macOS",
"status": "active"
},
{
"user": "charlie@corp.com",
"session_id": "sess-9921-DE",
"trust_score": 72,
"last_activity": "2024-12-27T10:18:30Z",
"location": "Chicago, US",
"device": "Firefox/Windows",
"status": "active",
"warnings": ["Unusual access time (outside 9-5)"]
}
],
"recent_alerts": [
{
"alert_id": "alert-123456",
"user": "alice@corp.com",
"type": "impossible_travel",
"severity": "critical",
"timestamp": "2024-12-27T10:20:00Z",
"action_taken": "session_revoked"
}
],
"statistics": {
"total_sessions_monitored": 147,
"alerts_last_hour": 1,
"average_trust_score": 94.2,
"false_positive_rate_30d": 0.3
}
}
The Core Question Youâre Answering
âHow can I continuously verify a userâs identity throughout their session, detecting anomalies that suggest account compromise or session hijacking?â
One-time login authentication creates a dangerous assumption: that whoever presented valid credentials at 9 AM is still the legitimate user at 3 PM. In reality, session tokens can be stolen, devices can be compromised, and attackers can hijack authenticated sessions within minutes of the initial login. Continuous authentication closes this gap by treating every action as an opportunity to verify identity, not through intrusive re-authentication, but through behavioral signals that are unique to each user.
Concepts You Must Understand First
Before diving into implementation, ensure you have a solid grasp of these foundational concepts:
1. User and Entity Behavior Analytics (UEBA)
UEBA is the practice of establishing behavioral baselines for users (and entities like devices, services) and detecting deviations that may indicate compromise. Unlike signature-based detection that looks for known attack patterns, UEBA asks âis this behavior normal for THIS specific user?â This personalization is what makes it powerful against novel attacks.
Key insight: Every user has a unique behavioral fingerprint - the hours they work, the resources they access, the locations they connect from, and the patterns of their requests.
2. Behavioral Baselines and Anomaly Detection
A baseline represents ânormalâ behavior for a user, built from historical observations. Anomaly detection identifies when current behavior deviates significantly from this baseline. The challenge is distinguishing between legitimate changes (user traveling for work) and malicious deviations (attacker using stolen credentials).
Key insight: Baselines must evolve over time to accommodate legitimate behavioral changes while remaining sensitive to sudden, unexplained shifts.
3. Impossible Travel Detection and the Haversine Formula
Impossible travel detection catches scenarios where a user appears to be in two geographically distant locations within an impossibly short time frame. The Haversine formula calculates the great-circle distance between two points on Earthâs surface, which combined with elapsed time gives the required travel speed.
Key insight: A login from New York at 10:00 AM followed by a login from London at 10:15 AM requires travel at 22,000+ km/h - clearly impossible and strong evidence of credential theft.
4. Session Management and Token Lifecycle
Sessions represent authenticated user contexts, typically maintained via tokens (JWTs, session cookies). Understanding how tokens are issued, validated, refreshed, and revoked is crucial. Continuous authentication adds a new dimension: tokens that were valid can become invalid based on behavioral signals, not just expiration.
Key insight: Revocation must be fast (sub-second) and distributed across all Policy Enforcement Points simultaneously.
5. Risk Scoring Algorithms
Risk scores combine multiple signals into a single value that drives access decisions. Signals include location familiarity, temporal patterns, device recognition, and behavioral consistency. The scoring algorithm must balance sensitivity (catching real attacks) with specificity (avoiding false positives).
Key insight: A single suspicious signal may reduce trust moderately, but multiple concurrent anomalies should trigger immediate revocation.
6. Real-Time Event Processing
Continuous authentication requires processing access logs as they happen, not in nightly batches. Stream processing architectures consume events in real-time, maintain stateful computations (user baselines), and emit decisions within milliseconds. Understanding event streaming patterns (pub/sub, consumer groups) is essential.
Key insight: The latency between an attack and detection must be measured in seconds, not hours.
Questions to Guide Your Design
As you design your Continuous Authentication Monitor, work through these questions:
Event Collection Architecture
- What data points will you capture in each access log event?
- How will you ensure PEPs (Policy Enforcement Points) reliably deliver logs to your monitor?
- What happens if log delivery is delayed or out of order?
- How will you handle high-volume event streams without dropping events?
Scoring and Decision Logic
- How will you weight different signals (location, time, device, behavior) in your composite score?
- What thresholds trigger different responses (step-up MFA vs immediate revocation)?
- How should trust scores decay over time during periods of inactivity?
- How will you handle the cold-start problem for new users with no baseline?
Response Actions
- How will you propagate revocation decisions to all PEPs in sub-second time?
- What remediation options will you offer users who are flagged (MFA step-up, support contact)?
- How will you prevent alert fatigue for security teams?
- What audit trail will you maintain for compliance and forensics?
Edge Cases
- How will you handle legitimate VPN usage that changes apparent location?
- What about mobile users whose carrier assigns different IP ranges?
- How will you detect and handle coordinated attacks across multiple accounts?
- What if your monitor itself is targeted for denial of service?
Thinking Exercise
Before writing any code, work through this design exercise on paper:
Design a Behavioral Baseline on Paper
Imagine youâre building a baseline for user âAliceâ who works as a software engineer:
- Temporal Profile:
- What hours does Alice typically log in? Draw a histogram of login times.
- What days of the week does she work? How would you represent this?
- How would you update this histogram with new data while giving more weight to recent behavior?
- Location Profile:
- Alice primarily works from home (NYC) but visits the office (Boston) twice a month and occasionally travels. How would you represent her ânormalâ locations?
- What data structure captures location familiarity with confidence levels?
- How far can she be from a known location before you consider it anomalous?
- Behavioral Profile:
- Alice typically makes 50-80 API requests per hour during work hours. How would you represent this as a baseline?
- She accesses the /api/code and /api/docs endpoints most frequently. How would you track resource access patterns?
- Calculate: If her mean is 65 requests/hour with standard deviation of 12, what z-score would 200 requests/hour produce?
- Anomaly Scenarios:
- Write out the detection logic for: âAlice logs in at 3 AM from an IP in Romaniaâ
- What signals fire? What would you set her trust score to?
- What action would you take?
This paper exercise forces you to think through the data structures and algorithms before touching code.
Hints in Layers
Use these hints progressively - try to solve problems yourself before revealing the next layer.
Hint 1: Start with the Haversine Function
Your first piece of working code should be the distance calculation. This is pure math with no external dependencies - perfect for test-driven development. Write tests first with known distances (NYC to London = ~5,580 km) and implement until tests pass.
Hint 2: Use Redis Streams for Event Ingestion
Donât build a custom message queue. Redis Streams (XREAD, XADD) provide durable, ordered event streaming with consumer groups. Your monitor can use XREAD BLOCK to wait for new events, process them, and acknowledge with XACK. This handles the âreliable deliveryâ problem for you.
Hint 3: Store Baselines as Redis Hashes
For each user, store their baseline as a Redis Hash (HSET/HGETALL). Keys like baseline:{user_id} with fields for last_location, hour_histogram, request_ema, etc. This gives you atomic updates and fast reads without a separate database for the hot path.
Hint 4: Use Pub/Sub for Revocation Broadcasting
When you detect an anomaly and need to revoke a session, publish to a Redis Pub/Sub channel. All PEPs subscribe to this channel and receive the revocation message within milliseconds. This is faster than polling a blacklist and scales horizontally.
Hint 5: Implement Score Components Independently First
Build and test each score component (location_score, temporal_score, behavioral_score) as independent functions before combining them. This makes debugging easier and allows you to tune each component separately. Only after each component works should you implement the composite scoring logic.
Solution Architecture
Component Breakdown
continuous-auth-monitor/
+-- main.py # Application entry point
+-- config/
| +-- config.py # Configuration loading
| +-- config.yaml # Default configuration
+-- ingestion/
| +-- stream_consumer.py # Redis Streams consumer
| +-- event_parser.py # Parse log events
+-- geolocation/
| +-- geoip_service.py # MaxMind GeoIP lookup
| +-- distance.py # Haversine calculations
+-- baseline/
| +-- store.py # User baseline storage
| +-- updater.py # EMA baseline updates
| +-- models.py # Baseline data structures
+-- detection/
| +-- impossible_travel.py # Travel speed detection
| +-- temporal_anomaly.py # Time-based detection
| +-- behavioral_anomaly.py # Request pattern detection
| +-- detector_pipeline.py # Orchestrates all detectors
+-- scoring/
| +-- trust_score.py # Composite score calculation
| +-- decay.py # Score decay over time
+-- actions/
| +-- revocation_publisher.py # Redis Pub/Sub publisher
| +-- alert_writer.py # PostgreSQL audit log
| +-- mfa_trigger.py # MFA step-up integration
+-- api/
| +-- dashboard.py # REST + WebSocket API
| +-- admin.py # Configuration endpoints
+-- tests/
+-- test_haversine.py
+-- test_baseline.py
+-- test_impossible_travel.py
+-- test_integration.py
Data Flow
+------------------------------------------------------------------+
| DETAILED DATA FLOW |
+------------------------------------------------------------------+
1. LOG EVENT ARRIVES
+----------------------------------------------------------+
| Redis Stream: XREAD BLOCK 0 STREAMS access-events $ |
+----------------------------------------------------------+
|
v
2. EVENT PARSING
+----------------------------------------------------------+
| Extract: user_id, session_id, source_ip, timestamp |
| Validate: Required fields present, timestamp parseable |
+----------------------------------------------------------+
|
v
3. GEOLOCATION
+----------------------------------------------------------+
| GeoIP Lookup: source_ip -> (latitude, longitude, city) |
| Cache: Store IP->Location mappings (1 hour TTL) |
+----------------------------------------------------------+
|
v
4. FETCH USER BASELINE
+----------------------------------------------------------+
| Redis Hash: HGETALL baseline:{user_id} |
| Contents: last_location, typical_hours, avg_requests, |
| request_stddev, last_activity_time |
+----------------------------------------------------------+
|
v
5. RUN ANOMALY DETECTORS (Parallel)
+----------------------------------------------------------+
| Impossible Travel: Compare current location to last |
| Temporal Anomaly: Compare login hour to baseline |
| Behavioral: Compare request patterns to baseline |
+----------------------------------------------------------+
|
v
6. CALCULATE TRUST SCORE
+----------------------------------------------------------+
| Composite: weight * location_score + |
| weight * temporal_score + |
| weight * behavioral_score |
| Apply decay based on idle time |
+----------------------------------------------------------+
|
v
7. DECISION
+----------------------------------------------------------+
| If trust_score < threshold: |
| -> Publish revocation |
| -> Write alert to audit log |
| -> Notify dashboard via WebSocket |
| Else: |
| -> Update baseline with new data |
| -> Update last_location, last_activity |
+----------------------------------------------------------+
|
v
8. UPDATE BASELINE (if not anomaly)
+----------------------------------------------------------+
| Redis Hash: HSET baseline:{user_id} |
| last_location = current_location |
| last_activity = current_time |
| hour_histogram = updated histogram |
| request_ema = new EMA value |
+----------------------------------------------------------+
Phased Implementation Guide
Phase 1: Log Ingestion and Geolocation (8-10 hours)
Goal: Consume access logs from Redis Streams and enrich with geolocation.
Milestone: Print each log entry with city and coordinates.
Steps:
- Set up Python project with dependencies:
pip install aioredis geoip2 pydantic - Create Redis Streams consumer:
import asyncio import aioredis async def consume_events(): redis = await aioredis.from_url("redis://localhost") last_id = "0" while True: events = await redis.xread( {"access-events": last_id}, block=1000 # Wait up to 1 second ) for stream, messages in events: for msg_id, data in messages: yield data last_id = msg_id - Integrate MaxMind GeoIP:
import geoip2.database class GeoIPService: def __init__(self, db_path: str): self.reader = geoip2.database.Reader(db_path) def lookup(self, ip: str) -> dict: try: response = self.reader.city(ip) return { "city": response.city.name, "country": response.country.iso_code, "latitude": response.location.latitude, "longitude": response.location.longitude, "accuracy_km": response.location.accuracy_radius } except geoip2.errors.AddressNotFoundError: return None -
Implement Haversine distance function (from theory section).
- Create event parser with geolocation enrichment.
Verification:
# Publish test event
$ redis-cli XADD access-events "*" user_id alice@corp.com source_ip 8.8.8.8
# Monitor should output:
[LOG] User: alice@corp.com | IP: 8.8.8.8 | Location: Mountain View, US (37.4056, -122.0775)
Phase 2: User Baseline Building (8-10 hours)
Goal: Build and update per-user behavioral baselines.
Milestone: Show userâs typical login hours and locations.
Steps:
- Design baseline data model:
from dataclasses import dataclass from typing import List, Dict @dataclass class UserBaseline: user_id: str last_location: tuple # (lat, lon, city) last_activity: datetime hour_histogram: Dict[int, int] # hour -> count location_history: List[tuple] # Recent locations request_count_ema: float request_count_stddev: float - Implement baseline storage in Redis:
class BaselineStore: def __init__(self, redis_client): self.redis = redis_client async def get_baseline(self, user_id: str) -> UserBaseline: data = await self.redis.hgetall(f"baseline:{user_id}") if not data: return None return UserBaseline.from_dict(data) async def update_baseline(self, user_id: str, event: dict): # Update hour histogram hour = event["timestamp"].hour await self.redis.hincrby(f"baseline:{user_id}", f"hour:{hour}", 1) # Update last location await self.redis.hset(f"baseline:{user_id}", mapping={ "last_lat": event["latitude"], "last_lon": event["longitude"], "last_city": event["city"], "last_activity": event["timestamp"].isoformat() }) -
Implement EMA calculation for request rates.
- Add baseline initialization for new users.
Verification:
$ curl http://localhost:8080/api/baseline/alice@corp.com
{
"user_id": "alice@corp.com",
"typical_hours": [9, 10, 11, 14, 15, 16],
"typical_locations": ["New York", "Boston"],
"baseline_confidence": 0.85,
"events_analyzed": 127
}
Phase 3: Impossible Travel Detection (6-8 hours)
Goal: Detect when a user appears in two distant locations too quickly.
Milestone: Trigger alert for NYC->London in 15 minutes.
Steps:
- Implement the detector:
from dataclasses import dataclass @dataclass class ImpossibleTravelAlert: user_id: str location_a: dict location_b: dict time_difference_seconds: int distance_km: float speed_kmh: float threshold_kmh: float = 1500 class ImpossibleTravelDetector: def __init__(self, speed_threshold_kmh: float = 1500): self.threshold = speed_threshold_kmh def detect(self, current_event: dict, last_location: tuple, last_time: datetime) -> ImpossibleTravelAlert: current_loc = (current_event["latitude"], current_event["longitude"]) distance = haversine_distance( last_location[0], last_location[1], current_loc[0], current_loc[1] ) time_diff = (current_event["timestamp"] - last_time).total_seconds() hours = time_diff / 3600 if hours == 0: speed = float('inf') else: speed = distance / hours if speed > self.threshold: return ImpossibleTravelAlert( user_id=current_event["user_id"], location_a={"city": last_location[2], "coords": last_location[:2]}, location_b={"city": current_event["city"], "coords": current_loc}, time_difference_seconds=int(time_diff), distance_km=distance, speed_kmh=speed ) return None -
Add edge case handling (VPN IPs, same city tolerance).
- Integrate with baseline store.
Verification:
# Simulate two events
$ redis-cli XADD access-events "*" user_id alice source_ip 1.1.1.1 timestamp "2024-12-27T10:00:00Z"
$ redis-cli XADD access-events "*" user_id alice source_ip 8.8.8.8 timestamp "2024-12-27T10:15:00Z"
# Monitor should detect:
[ALERT] Impossible Travel: alice | NYC -> London | 15 min | 5581 km | 22324 km/h
Phase 4: Trust Score Calculation (6-8 hours)
Goal: Calculate composite trust scores from multiple signals.
Milestone: Display real-time trust scores for active sessions.
Steps:
- Implement individual score calculators:
def location_score(current_loc: tuple, baseline: UserBaseline) -> int: """Score based on location familiarity.""" known_cities = [loc[2] for loc in baseline.location_history] if current_loc[2] in known_cities: return 100 # Calculate distance to nearest known location min_distance = min( haversine_distance(*current_loc[:2], *known[:2]) for known in baseline.location_history ) if min_distance < 100: # Same metro area return 80 elif min_distance < 500: # Same region return 60 elif min_distance < 1000: # Same country probably return 40 else: return 20 def temporal_score(hour: int, baseline: UserBaseline) -> int: """Score based on login time typicality.""" total_events = sum(baseline.hour_histogram.values()) if total_events == 0: return 50 # No baseline yet hour_count = baseline.hour_histogram.get(hour, 0) frequency = hour_count / total_events if frequency > 0.1: # >10% of logins at this hour return 100 elif frequency > 0.05: return 80 elif frequency > 0.01: return 50 else: return 20 -
Implement composite score calculator with weights.
-
Add score decay function.
- Integrate with session state storage.
Verification:
$ curl http://localhost:8080/api/sessions/active
[
{
"session_id": "sess-4412-XA",
"user": "alice@corp.com",
"trust_score": 95,
"score_breakdown": {
"location": 100,
"temporal": 90,
"behavioral": 95
}
}
]
Phase 5: Revocation Signal Broadcasting (4-6 hours)
Goal: Publish revocation signals that PEPs can consume.
Milestone: Verify PEP receives and acts on revocation within 100ms.
Steps:
- Implement revocation publisher:
import json class RevocationPublisher: def __init__(self, redis_client): self.redis = redis_client self.channel = "session-revocations" async def revoke(self, user_id: str, session_id: str, reason: str): message = { "action": "REVOKE", "user_id": user_id, "session_id": session_id, "reason": reason, "timestamp": datetime.utcnow().isoformat() } await self.redis.publish(self.channel, json.dumps(message)) -
Create PEP subscription handler (for testing).
-
Add latency measurement.
- Implement retry logic for failed publishes.
Verification:
# Terminal 1: Subscribe to revocations
$ redis-cli SUBSCRIBE session-revocations
# Terminal 2: Trigger revocation
$ curl -X POST http://localhost:8080/api/revoke/alice@corp.com/sess-4412-XA
# Terminal 1 should immediately show:
1) "message"
2) "session-revocations"
3) "{\"action\":\"REVOKE\",\"user_id\":\"alice@corp.com\",...}"
Phase 6: Dashboard and Alerting (6-8 hours)
Goal: Real-time visibility into trust scores and alerts.
Milestone: WebSocket-connected dashboard showing live session states.
Steps:
- Create FastAPI application with WebSocket support:
from fastapi import FastAPI, WebSocket from fastapi.websockets import WebSocketDisconnect app = FastAPI() active_connections = [] @app.websocket("/ws/dashboard") async def dashboard_websocket(websocket: WebSocket): await websocket.accept() active_connections.append(websocket) try: while True: # Keep connection alive await websocket.receive_text() except WebSocketDisconnect: active_connections.remove(websocket) async def broadcast_update(data: dict): for connection in active_connections: await connection.send_json(data) -
Implement REST endpoints for session listing.
-
Create alert history endpoint.
- Add admin API for threshold configuration.
Testing Strategy
Unit Tests
# tests/test_haversine.py
import pytest
from geolocation.distance import haversine_distance
def test_haversine_nyc_to_london():
# NYC coordinates
nyc = (40.7128, -74.0060)
# London coordinates
london = (51.5074, -0.1278)
distance = haversine_distance(*nyc, *london)
# Expected: ~5,581 km (allow 1% tolerance)
assert 5500 < distance < 5650
def test_haversine_same_point():
point = (40.7128, -74.0060)
distance = haversine_distance(*point, *point)
assert distance == 0
def test_haversine_antipodal_points():
# Maximum possible distance
point_a = (0, 0)
point_b = (0, 180)
distance = haversine_distance(*point_a, *point_b)
# Half Earth circumference: ~20,000 km
assert 19900 < distance < 20100
# tests/test_impossible_travel.py
import pytest
from datetime import datetime, timedelta
from detection.impossible_travel import ImpossibleTravelDetector
def test_impossible_travel_detected():
detector = ImpossibleTravelDetector(speed_threshold_kmh=1500)
current_event = {
"user_id": "alice",
"latitude": 51.5074, # London
"longitude": -0.1278,
"city": "London",
"timestamp": datetime(2024, 12, 27, 10, 20, 0)
}
last_location = (40.7128, -74.0060, "New York") # NYC
last_time = datetime(2024, 12, 27, 10, 5, 0) # 15 min earlier
alert = detector.detect(current_event, last_location, last_time)
assert alert is not None
assert alert.speed_kmh > 20000 # Way over threshold
assert alert.distance_km > 5500
def test_legitimate_travel_not_flagged():
detector = ImpossibleTravelDetector(speed_threshold_kmh=1500)
# NYC to Boston, 6 hours later (possible by car)
current_event = {
"user_id": "alice",
"latitude": 42.3601, # Boston
"longitude": -71.0589,
"city": "Boston",
"timestamp": datetime(2024, 12, 27, 16, 0, 0)
}
last_location = (40.7128, -74.0060, "New York")
last_time = datetime(2024, 12, 27, 10, 0, 0) # 6 hours earlier
alert = detector.detect(current_event, last_location, last_time)
assert alert is None # No alert for ~60 km/h travel
Integration Tests
# tests/test_integration.py
import pytest
import asyncio
import aioredis
@pytest.fixture
async def redis_client():
client = await aioredis.from_url("redis://localhost")
yield client
await client.close()
@pytest.mark.asyncio
async def test_end_to_end_impossible_travel(redis_client):
"""Test complete flow from log ingestion to revocation."""
# Subscribe to revocations
pubsub = redis_client.pubsub()
await pubsub.subscribe("session-revocations")
# Publish first event (NYC)
await redis_client.xadd("access-events", {
"user_id": "test-user",
"source_ip": "203.0.113.45",
"session_id": "test-session",
"timestamp": "2024-12-27T10:00:00Z"
})
# Wait for baseline update
await asyncio.sleep(1)
# Publish second event (London, 15 min later)
await redis_client.xadd("access-events", {
"user_id": "test-user",
"source_ip": "185.34.22.11",
"session_id": "test-session",
"timestamp": "2024-12-27T10:15:00Z"
})
# Wait for revocation
message = await asyncio.wait_for(
pubsub.get_message(ignore_subscribe_messages=True),
timeout=5.0
)
assert message is not None
data = json.loads(message["data"])
assert data["action"] == "REVOKE"
assert data["user_id"] == "test-user"
False Positive Testing
# tests/test_false_positives.py
def test_vpn_should_not_trigger_immediate_revoke():
"""VPN usage should flag but not auto-revoke."""
# Known VPN IP range
vpn_ip = "104.238.130.1" # Example VPN provider
detector = ImpossibleTravelDetector()
detector.vpn_ips = load_vpn_ip_ranges()
result = detector.detect_with_context(
current_ip=vpn_ip,
last_location=(40.7128, -74.0060, "NYC"),
last_time=datetime.now() - timedelta(minutes=5)
)
assert result.action == "FLAG" # Not REVOKE
assert result.requires_mfa_stepup == True
def test_same_city_different_ip():
"""Different IPs in same city should not flag."""
# Two different IPs both in NYC
detector = ImpossibleTravelDetector(min_distance_km=50)
current = {"lat": 40.7580, "lon": -73.9855} # Midtown
last = (40.7128, -74.0060, "NYC") # Downtown
alert = detector.detect(current, last, datetime.now())
assert alert is None # Only 5km apart
Performance Testing
# Generate load test events
$ python scripts/generate_events.py --count 100000 --users 1000
# Run benchmark
$ hyperfine --warmup 3 'python -c "from main import process_batch; process_batch()"'
# Expected: Process 10,000 events/second on single core
Common Pitfalls and Debugging
Pitfall 1: VPN False Positives
Symptom: Users with VPNs constantly flagged for impossible travel.
Cause: VPN exit nodes in different countries than userâs actual location.
Solution:
class VPNAwareDetector:
def __init__(self):
self.vpn_asns = self.load_vpn_asn_list()
self.vpn_ip_ranges = self.load_vpn_ip_ranges()
def is_vpn(self, ip: str) -> bool:
# Check if IP belongs to known VPN provider
if ip_in_ranges(ip, self.vpn_ip_ranges):
return True
# Check ASN
asn = self.get_asn(ip)
return asn in self.vpn_asns
def detect(self, event, last_location, last_time):
if self.is_vpn(event["source_ip"]):
# Don't use location for VPN IPs
# Fall back to other signals (device, behavior)
return self.detect_via_behavior(event)
return self.detect_via_location(event, last_location, last_time)
Additional: Allow users to pre-register VPN usage in their profile.
Pitfall 2: Clock Synchronization Issues
Symptom: Events appear out of order; impossible travel detected for legitimate sequences.
Cause: Different PEP servers have unsynchronized clocks.
Solution:
def validate_event_sequence(current: dict, previous: dict) -> bool:
"""Require minimum time gap before analysis."""
time_diff = (current["timestamp"] - previous["timestamp"]).total_seconds()
# If events are within 60 seconds, don't analyze for travel
# Clock skew could cause false ordering
if abs(time_diff) < 60:
return False
# If current event is before previous, log warning
if time_diff < 0:
log.warning(f"Event timestamp ordering issue: {current} before {previous}")
return False
return True
Infrastructure: Ensure all servers use NTP with tight synchronization (chrony or ntpd).
Pitfall 3: Baseline Cold Start
Symptom: New users immediately flagged for anomalies.
Cause: No baseline exists to compare against.
Solution:
class BaselineManager:
MINIMUM_EVENTS_FOR_BASELINE = 10
COLD_START_TRUST_SCORE = 70 # Not full trust, but not denied
def has_sufficient_baseline(self, user_id: str) -> bool:
baseline = self.get_baseline(user_id)
if not baseline:
return False
return baseline.event_count >= self.MINIMUM_EVENTS_FOR_BASELINE
def get_trust_score(self, user_id: str, event: dict) -> int:
if not self.has_sufficient_baseline(user_id):
# New user - can't detect anomalies yet
# Apply conservative trust score but allow access
return self.COLD_START_TRUST_SCORE
return self.calculate_trust_score(user_id, event)
Consider: âOnboarding modeâ for new users with reduced sensitivity.
Pitfall 4: Alert Fatigue
Symptom: Security team ignores alerts because too many are false positives.
Cause: Thresholds too aggressive; legitimate edge cases not handled.
Solution:
class AlertManager:
def __init__(self):
self.alert_counts = defaultdict(int)
self.suppression_rules = []
def should_alert(self, alert: Alert) -> bool:
# Check suppression rules
for rule in self.suppression_rules:
if rule.matches(alert):
return False
# Rate limit per user
user_alerts_today = self.alert_counts[alert.user_id]
if user_alerts_today > 5:
log.info(f"Suppressing alert for {alert.user_id} - rate limited")
return False
# Require minimum severity
if alert.severity < Severity.MEDIUM:
return False
return True
def add_suppression_rule(self, rule: SuppressionRule):
"""Allow admins to suppress known false positive patterns."""
self.suppression_rules.append(rule)
Tuning: Track false positive rate and adjust thresholds quarterly.
Pitfall 5: Memory Leak in Baseline Store
Symptom: Monitor memory grows unbounded over time.
Cause: Storing unlimited location history per user.
Solution:
class UserBaseline:
MAX_LOCATION_HISTORY = 100
MAX_HOUR_HISTOGRAM_SIZE = 24 # Always bounded
def add_location(self, location: tuple):
self.location_history.append(location)
# Prune old entries
if len(self.location_history) > self.MAX_LOCATION_HISTORY:
self.location_history = self.location_history[-self.MAX_LOCATION_HISTORY:]
def add_to_hour_histogram(self, hour: int):
self.hour_histogram[hour] = self.hour_histogram.get(hour, 0) + 1
# Use exponential decay to prevent unbounded growth
if sum(self.hour_histogram.values()) > 1000:
for h in self.hour_histogram:
self.hour_histogram[h] = int(self.hour_histogram[h] * 0.9)
Debugging Commands
# Check user's current baseline
$ redis-cli HGETALL baseline:alice@corp.com
# View recent events for a user
$ redis-cli XRANGE access-events - + COUNT 10 | grep alice
# Test geolocation
$ python -c "from geolocation import GeoIPService; g = GeoIPService('GeoLite2-City.mmdb'); print(g.lookup('8.8.8.8'))"
# Monitor revocation channel
$ redis-cli SUBSCRIBE session-revocations
# Check detection latency
$ python scripts/measure_latency.py
# View alert history
$ curl http://localhost:8080/api/alerts?user=alice@corp.com&limit=10
Extensions and Challenges
Extension 1: Machine Learning for Behavior Modeling
Replace statistical baselines with ML models that learn complex patterns.
Approach:
from sklearn.ensemble import IsolationForest
import numpy as np
class MLBehaviorModel:
def __init__(self):
self.model = IsolationForest(
contamination=0.01, # Expect 1% anomalies
random_state=42
)
def train(self, user_events: list):
"""Train on user's historical events."""
features = self.extract_features(user_events)
self.model.fit(features)
def extract_features(self, events: list) -> np.array:
return np.array([
[
e["hour"],
e["day_of_week"],
e["latitude"],
e["longitude"],
e["request_rate"],
e["session_duration"]
]
for e in events
])
def is_anomaly(self, event: dict) -> bool:
features = self.extract_features([event])
prediction = self.model.predict(features)
return prediction[0] == -1 # -1 = anomaly
Why Itâs Better: Can detect complex, multi-dimensional anomalies that simple statistics miss.
Extension 2: User-Agent and Browser Fingerprinting
Add device fingerprinting as an additional signal.
Implementation:
class DeviceFingerprinter:
def extract_fingerprint(self, headers: dict) -> str:
"""Create a hash of device characteristics."""
components = [
headers.get("User-Agent", ""),
headers.get("Accept-Language", ""),
headers.get("Accept-Encoding", ""),
headers.get("Sec-CH-UA", ""), # Client hints
headers.get("Sec-CH-UA-Platform", "")
]
return hashlib.sha256("|".join(components).encode()).hexdigest()[:16]
def is_new_device(self, user_id: str, fingerprint: str) -> bool:
known_devices = self.get_known_devices(user_id)
return fingerprint not in known_devices
Use Case: Even if IP changes (VPN), same device fingerprint = higher trust.
Extension 3: Keystroke Dynamics
Analyze typing patterns for continuous authentication.
Concept:
+------------------------------------------------------------------+
| KEYSTROKE DYNAMICS |
+------------------------------------------------------------------+
Measurable characteristics:
- Key hold time (dwell time): How long each key is pressed
- Inter-key interval: Time between releasing one key and pressing next
- Typing speed: Characters per minute
- Error patterns: Backspace frequency, correction patterns
Example baseline for "alice":
- Average dwell time: 85ms
- Average inter-key interval: 120ms
- Typing speed: 65 WPM
- Error rate: 3%
Anomaly detection:
- Current session dwell time: 45ms (much faster)
- Current speed: 120 WPM
- Conclusion: Possibly a different person typing, or automated tool
+------------------------------------------------------------------+
Extension 4: Mouse Movement Analysis
Track mouse movement patterns as behavioral biometric.
Signals to analyze:
- Mouse speed distribution
- Cursor path characteristics (straight lines vs curves)
- Click patterns (single vs double click timing)
- Scroll behavior
Extension 5: Integration with SIEM
Send alerts to enterprise SIEM for correlation.
class SIEMIntegration:
def __init__(self, siem_endpoint: str, api_key: str):
self.endpoint = siem_endpoint
self.api_key = api_key
def send_alert(self, alert: Alert):
# Format as Common Event Format (CEF)
cef_message = self.format_cef(alert)
requests.post(
f"{self.endpoint}/api/events",
headers={"Authorization": f"Bearer {self.api_key}"},
json={"event": cef_message}
)
def format_cef(self, alert: Alert) -> str:
return (
f"CEF:0|ZeroTrust|AuthMonitor|1.0|"
f"{alert.type}|{alert.severity}|"
f"src={alert.source_ip} "
f"duser={alert.user_id} "
f"msg={alert.message}"
)
Real-World Connections
Microsoft Defender for Identity
Microsoftâs UEBA solution for detecting identity-based attacks.
How it compares:
- Uses similar signals (location, time, behavior)
- Integrates with Active Directory
- Your project: More customizable, works with any identity provider
Reference: https://docs.microsoft.com/en-us/defender-for-identity/
AWS GuardDuty
Amazonâs threat detection service using ML.
Relevant features:
- Unusual API calls detection
- Impossible travel detection for AWS Console logins
- Automated remediation via EventBridge
Your project: Applies same concepts to your custom applications.
Okta ThreatInsight
Oktaâs built-in threat detection for identity.
Capabilities:
- Credential stuffing detection
- Brute force protection
- Suspicious location alerts
Your project: Building similar capabilities from scratch teaches you how these work.
Google BeyondCorp
Googleâs Zero Trust implementation.
Relevant concepts:
- Device trust signals
- Context-aware access
- Continuous verification
Reference: https://cloud.google.com/beyondcorp
Interview Questions
Question 1: What is User and Entity Behavior Analytics (UEBA)?
Strong Answer: UEBA is a security approach that establishes behavioral baselines for users and entities, then detects deviations that may indicate compromise. Unlike signature-based detection that looks for known attack patterns, UEBA asks âis this behavior normal for this user?â
Key components:
- Baseline building (whatâs normal)
- Anomaly detection (whatâs different)
- Risk scoring (how serious is the deviation)
- Response automation (what to do about it)
Follow-up: âWhatâs the difference between UEBA and traditional IDS?â
Traditional IDS uses signatures and rules. UEBA learns per-entity baselines and detects novel attacks without prior signatures.
Question 2: Explain the Impossible Travel detection algorithm.
Strong Answer:
- Extract location from IP address using geolocation database
- Calculate great-circle distance between current and last location using Haversine formula
- Calculate elapsed time between events
- Compute required travel speed: distance / time
- If speed exceeds threshold (e.g., 1500 km/h), flag as impossible
Edge cases to handle:
- VPN usage (IP doesnât reflect actual location)
- Mobile network IP changes
- Clock synchronization between servers
- Same-city IP changes
Formula: Speed = Distance / Time, where Distance = 2R * arcsin(sqrt(a)) with a from Haversine.
Question 3: How do you balance security with user experience in continuous authentication?
Strong Answer: Three key strategies:
- Risk-based responses:
- Low risk: No friction
- Medium risk: Step-up MFA for sensitive operations only
- High risk: Force re-authentication
- Critical (impossible travel): Immediate revocation
- False positive reduction:
- Combine multiple signals before acting
- Allow users to pre-register travel or VPN usage
- Implement suppression rules for known patterns
- Tune thresholds based on measured false positive rates
- Graceful degradation:
- Never completely lock out based on single signal
- Provide clear remediation path (MFA step-up)
- Track and learn from user feedback on false positives
Question 4: How would you handle token revocation at scale?
Strong Answer: Push-based approach using pub/sub:
- Monitor publishes revocation to Redis Pub/Sub
- All PEPs subscribe and receive within milliseconds
- PEPs maintain local cache of revoked sessions
- No centralized bottleneck
Defense in depth:
- Short-lived access tokens (5-15 minutes)
- Even without push, token expires quickly
- Blacklist only needs to store refresh token revocations
Challenges at scale:
- Network partitions could miss revocations
- Solution: Combine push with short TTL as fallback
- Consider consistent hashing for revocation distribution
Question 5: What is a trust score and how do you calculate it?
Strong Answer: A trust score is a composite value (0-100) representing confidence that a session is legitimate.
Components:
- Location score: Based on familiarity of current location
- Temporal score: Based on typicality of current time
- Device score: Based on known device fingerprint
- Behavioral score: Based on request pattern normality
Calculation methods:
- Weighted average:
sum(score * weight) - Minimum signal:
min(all_scores)- most secure - Multiplicative:
product(scores/100) * 100- most sensitive
Decay: Trust decays over time without activity, forcing re-verification.
Question 6: How do you handle the cold start problem for new users?
Strong Answer: New users have no baseline to compare against, creating challenges:
Strategies:
- Onboarding period: Apply conservative trust score (e.g., 70) during first N events
- Organization-wide baseline: Compare to typical behavior for similar roles
- Explicit profiling: Ask users to register expected locations/devices
- Higher MFA frequency: Require more verification until baseline established
- Supervised learning: Use labeled data from existing users to train models
Minimum baseline threshold: Typically 10-20 events before anomaly detection activates.
Question 7: Describe the z-score and how it applies to anomaly detection.
Strong Answer: Z-score measures how many standard deviations a value is from the mean:
z = (value - mean) / standard_deviation
Interpretation:
- z < 2: Normal (within 95% of data)
- z > 2: Unusual (outside 95%)
- z > 3: Very unusual (outside 99.7%)
Application in UEBA:
- Calculate mean and stddev of userâs login hours
- For new login, compute z-score
- z > 3 for 3 AM login = anomaly alert
Limitations:
- Assumes normal distribution
- Sensitive to outliers in baseline
- Use robust alternatives (MAD) for non-normal data
Question 8: How does stream processing differ from batch processing for security monitoring?
Strong Answer: Batch processing:
- Process logs nightly
- Detection latency: hours to days
- Simpler architecture
- Good for forensics, not prevention
Stream processing:
- Process events as they arrive
- Detection latency: seconds
- More complex (state management, exactly-once)
- Essential for active defense
For security:
- Batch: Historical analysis, compliance reporting
- Stream: Real-time threat detection, active response
- Best practice: Use both (Lambda architecture)
Question 9: What signals beyond location would you use for continuous authentication?
Strong Answer:
- Device fingerprint: Browser/OS characteristics
- Temporal patterns: Typical login hours, day of week
- Request behavior: Frequency, resources accessed, data volume
- Session characteristics: Duration, activity patterns
- Biometrics (advanced): Keystroke dynamics, mouse movements
- Network context: VPN, corporate network, public WiFi
- Device health: From endpoint agent (antivirus, patch level)
Composite approach: No single signal is definitive; combine for confidence.
Question 10: How would you measure the effectiveness of a UEBA system?
Strong Answer: Metrics:
- Detection rate: % of real attacks caught (from red team exercises)
- False positive rate: % of legitimate users flagged
- Detection latency: Time from attack start to alert
- Mean time to respond: Alert to revocation
- User friction: MFA step-ups per user per day
Testing methods:
- Red team exercises (inject known attacks)
- Replay historical attacks with known labels
- A/B testing detection thresholds
- User feedback on false positives
Target: <1% false positive rate while maintaining >95% detection rate.
Resources and Self-Assessment
Books
| Book | Author | Relevant Chapters |
|---|---|---|
| Security in Computing | Charles Pfleeger | Ch. 7: Network Security, Ch. 8: Intrusion Detection |
| Designing Data-Intensive Applications | Martin Kleppmann | Ch. 11: Stream Processing |
| Foundations of Information Security | Jason Andress | Ch. 8: Intrusion Detection |
| Zero Trust Networks | Gilman & Barth | Ch. 6: Runtime Security |
| The Art of Software Security Assessment | Dowd, McDonald, Schuh | Ch. 11: Network Protocols |
Tools and Libraries
| Tool | Purpose |
|---|---|
| MaxMind GeoLite2 | IP geolocation database |
| Redis Streams | Event streaming |
| geoip2 (Python) | GeoIP database client |
| aioredis | Async Redis client for Python |
| FastAPI | Dashboard API |
| pytest-asyncio | Async testing |
RFCs and Standards
| Document | Topic |
|---|---|
| RFC 8693 | OAuth 2.0 Token Exchange (for step-up auth) |
| NIST SP 800-207 | Zero Trust Architecture |
| MITRE ATT&CK | Credential access techniques |
Self-Assessment Checklist
Before considering this project complete, verify you can:
Conceptual Understanding:
- Explain UEBA and how it differs from signature-based detection
- Describe the impossible travel algorithm with mathematical formulas
- Explain trust scores and how multiple signals combine
- Articulate the tradeoff between security and usability
- Describe token revocation strategies and their tradeoffs
Implementation Skills:
- Calculate great-circle distance using Haversine formula
- Build behavioral baselines using exponential moving averages
- Calculate z-scores for anomaly detection
- Process events from a stream (Redis Streams or similar)
- Publish messages via pub/sub for distributed revocation
Integration Capability:
- Integrate with GeoIP database for location lookup
- Connect to Policy Enforcement Points for revocation
- Expose REST and WebSocket APIs for dashboard
- Store audit trails in persistent database
Operational Readiness:
- Handle VPN and proxy edge cases
- Manage the cold start problem for new users
- Tune thresholds to minimize false positives
- Measure and optimize detection latency
Security Thinking:
- Consider how an attacker might evade detection
- Design for fail-closed behavior
- Handle clock synchronization issues
- Protect the monitor itself from compromise
The Core Question Youâve Answered
âHow do I know that the person using this session is still the same person who authenticated?â
This is THE fundamental question of continuous authentication. By building this project, youâve learned that authentication is not an event - itâs a continuous process. Every action provides evidence about identity, and your monitor is constantly evaluating that evidence.
Youâve built a system that watches behavior, learns whatâs normal, and raises the alarm when something feels wrong - not because it matches a known attack pattern, but because it deviates from expected behavior. This is the essence of Zero Trust.
The Trust Score is the new access decision. Itâs not âdoes this user have permission?â but âhow confident are we that this is actually that user, right now, in this context?â
Your Continuous Authentication Monitor answers that question for every request, enabling truly adaptive, risk-based access control.
Project Guide Version 1.0 - December 2024