Project 13: Federated Edge Event Bus

Build cross-site event routing with bounded demand and gateway failover.

Quick Reference

Attribute	Value
Difficulty	Level 4
Time Estimate	24-40 hours
Main Programming Language	Elixir
Alternative Programming Languages	Erlang
Coolness Level	Level 5
Business Potential	Level 4
Prerequisites	GenStage, distributed nodes, supervision
Key Topics	WAN backpressure, event routing, gateway isolation

1. Learning Objectives

By completing this project, you will:

Route selected event streams across multiple BEAM sites.
Enforce demand-based backpressure on WAN links.
Isolate gateway failures from local producers.
Measure queue growth and failover behavior under load.

2. All Theory Needed (Per-Concept Breakdown)

Distributed Backpressure Across Sites

Fundamentals Backpressure ensures producers do not overwhelm consumers. In federated systems, this must extend across network boundaries where link latency and throughput vary significantly. Gateway processes are the control points that translate local producer speed into remote demand constraints.

Deep Dive into the concept Local backpressure is already a key BEAM practice with GenStage. Federated backpressure adds routing and transport variability. A producer in one site may generate events faster than another site can safely consume. If you treat WAN links as infinite buffers, queue growth becomes silent until failure.

The architecture should separate local event ingestion from cross-site replication. Local producers publish into local stages; gateway stages selectively replicate topics based on policy and remote demand. This design gives you one place to enforce bounded queues, compaction policies, and drop strategies.

Demand contracts must be explicit per route and per topic. High-priority operational events might reserve quota while low-priority telemetry is sampled or compacted. These policies are business decisions, not just technical tuning.

Failover is another core dimension. Gateway processes are critical-path components and must be supervised independently. A healthy design routes around failed gateways while keeping local processing alive. The invariant is that local workloads should not block merely because a remote site is slow or unavailable.

Observability should cover queue depth, end-to-end latency, demand fulfillment ratio, and drop counts. Each metric must be segmented by route and topic. Aggregate metrics hide bottlenecks.

Common failure patterns include unbounded buffers, missing per-route limits, and coupling gateway availability to local producer loops. Fix these by defining strict isolation boundaries and rejection/compaction behavior before implementing.

How this fit on projects This project extends P11 topology and uses P12 recovery discipline for degraded WAN scenarios.

Definitions & key terms

Federation: Multiple clusters sharing selected data flows.
Gateway stage: Process that forwards events between sites.
Demand window: Maximum events allowed in flight for a route.
Compaction policy: Rule to reduce event volume under pressure.

Mental model diagram

[Local Producers] -> [Local Stage] -> [Gateway Stage] ====WAN====> [Remote Gateway] -> [Remote Consumers]
          |                 |                 |                         |
       local SLA        local queue      demand window            remote demand

How it works (step-by-step, with invariants and failure modes)

Local producers emit events by topic.
Routing policy tags each event as local-only or replicated.
Gateway stage forwards only when remote demand exists.
Remote gateway acknowledges and releases demand.
Invariant: per-route queue remains bounded.
Failure modes: queue explosion, stale routing tables, gateway restart loops.

Minimal concrete example

topic = "alerts.eu-west"
if remote_demand(topic) > 0:
  forward(event)
else:
  compact_or_defer(event)

Common misconceptions

“Backpressure ends at the datacenter boundary.” It must be end-to-end.
“Dropping events is always wrong.” Controlled drops can protect system health.

Check-your-understanding questions

Why should WAN queues be bounded per route?
What is the role of gateway isolation in failure handling?
Which metrics best reveal hidden demand mismatch?

Check-your-understanding answers

To prevent one slow route from consuming all memory.
It keeps local services healthy during cross-site issues.
Queue depth, demand fulfillment ratio, and route latency.

Real-world applications

Multi-site telemetry pipelines
Edge analytics replication
Regional alert fan-out systems

Where you’ll apply it

Directly in Project 13
Operational tuning informed by Projects 11 and 12

References

GenStage documentation
Distributed Erlang documentation
OTP supervision principles

Key insights Federated event systems survive by enforcing local isolation and explicit remote demand.

Summary Cross-site event routing on BEAM is reliable when you treat backpressure, routing, and failover as one design problem.

Homework/Exercises to practice the concept

Define a topic policy matrix (replicate, sample, drop, local-only).
Design a queue budget per route and justify the limits.

Solutions to the homework/exercises

Assign business-critical topics strict replication; sample low-value telemetry.
Set per-route caps from memory budget and latency SLOs.

3. Project Specification

3.1 What You Will Build

A federated event bus with topic-based routing, route-level demand windows, and gateway failover behavior.

3.2 Functional Requirements

Topic Routing: Selectively replicate topics across sites.
Demand Enforcement: Bound in-flight events per route.
Gateway Failover: Continue service after primary gateway failure.

3.3 Non-Functional Requirements

Performance: Sustained throughput without unbounded queue growth.
Reliability: Local processing remains available during WAN degradation.
Usability: Operators can inspect route state and backlog clearly.

3.4 Example Usage / Output

$ busctl publish sensor.us-east temp=71
$ busctl stats

3.5 Data Formats / Schemas / Protocols

Topic metadata with replication policy
Route demand and queue status records
Gateway failover status events

3.6 Edge Cases

Remote consumer demand drops to zero
Gateway failover during burst traffic
Topic policy changes at runtime

3.7 Real World Outcome

3.7.1 How to Run (Copy/Paste)

Bring up multi-site topology from P11.
Start event producers and consumers by topic.
Run failover and demand-throttle drills.

3.7.2 Golden Path Demo (Deterministic)

A high-rate local producer is throttled safely while high-priority alerts remain replicated.

3.7.3 If CLI: exact terminal transcript

$ busctl stats
local_queue=12 wan_queue=3 dropped=0 backpressure=healthy

$ busctl failover gw-us
gateway gw-us down; traffic rerouted via gw-us-backup

4. Solution Architecture

4.1 High-Level Design

[Producer] -> [Local Router] -> [Gateway Stage] -> [Remote Gateway] -> [Consumer]

4.2 Key Components

Component	Responsibility	Key Decisions
Local Router	Topic policy application	Replicate vs local-only logic
Gateway Stage	Route-level backpressure	Queue caps and compaction strategy
Failover Controller	Gateway reroute	Promotion timing and health thresholds

4.4 Data Structures (No Full Code)

Topic policy table
Route demand windows
Queue counters and latency snapshots

4.4 Algorithm Overview

Key Algorithm: WAN Demand Gate

Read remote demand window.
Admit or defer outgoing events.
Update route metrics.
Trigger compaction/drops when queue cap is hit.

Complexity Analysis:

Time: O(E) for E events per route window
Space: O(R + T) for routes and topics

5. Implementation Guide

5.1 Development Environment Setup

# reuse multi-site cluster lab from P11

5.2 Project Structure

project-root/
├── lib/
│   ├── topic_router.ex
│   ├── gateway_stage.ex
│   ├── failover_controller.ex
│   └── metrics.ex
├── test/
└── README.md

5.3 The Core Question You’re Answering

“How do I preserve throughput and correctness when federating event streams across unreliable WAN links?”

5.4 Concepts You Must Understand First

GenStage demand semantics
Distributed routing and node failure semantics
Supervisor boundaries for gateway and failover controllers

5.5 Questions to Guide Your Design

Which topics deserve strict replication vs local processing?
What queue caps protect memory without losing critical signals?
How will you test failover while traffic is in flight?