Project 13: Federated Edge Event Bus
Build cross-site event routing with bounded demand and gateway failover.
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | Level 4 |
| Time Estimate | 24-40 hours |
| Main Programming Language | Elixir |
| Alternative Programming Languages | Erlang |
| Coolness Level | Level 5 |
| Business Potential | Level 4 |
| Prerequisites | GenStage, distributed nodes, supervision |
| Key Topics | WAN backpressure, event routing, gateway isolation |
1. Learning Objectives
By completing this project, you will:
- Route selected event streams across multiple BEAM sites.
- Enforce demand-based backpressure on WAN links.
- Isolate gateway failures from local producers.
- Measure queue growth and failover behavior under load.
2. All Theory Needed (Per-Concept Breakdown)
Distributed Backpressure Across Sites
Fundamentals Backpressure ensures producers do not overwhelm consumers. In federated systems, this must extend across network boundaries where link latency and throughput vary significantly. Gateway processes are the control points that translate local producer speed into remote demand constraints.
Deep Dive into the concept Local backpressure is already a key BEAM practice with GenStage. Federated backpressure adds routing and transport variability. A producer in one site may generate events faster than another site can safely consume. If you treat WAN links as infinite buffers, queue growth becomes silent until failure.
The architecture should separate local event ingestion from cross-site replication. Local producers publish into local stages; gateway stages selectively replicate topics based on policy and remote demand. This design gives you one place to enforce bounded queues, compaction policies, and drop strategies.
Demand contracts must be explicit per route and per topic. High-priority operational events might reserve quota while low-priority telemetry is sampled or compacted. These policies are business decisions, not just technical tuning.
Failover is another core dimension. Gateway processes are critical-path components and must be supervised independently. A healthy design routes around failed gateways while keeping local processing alive. The invariant is that local workloads should not block merely because a remote site is slow or unavailable.
Observability should cover queue depth, end-to-end latency, demand fulfillment ratio, and drop counts. Each metric must be segmented by route and topic. Aggregate metrics hide bottlenecks.
Common failure patterns include unbounded buffers, missing per-route limits, and coupling gateway availability to local producer loops. Fix these by defining strict isolation boundaries and rejection/compaction behavior before implementing.
How this fit on projects This project extends P11 topology and uses P12 recovery discipline for degraded WAN scenarios.
Definitions & key terms
- Federation: Multiple clusters sharing selected data flows.
- Gateway stage: Process that forwards events between sites.
- Demand window: Maximum events allowed in flight for a route.
- Compaction policy: Rule to reduce event volume under pressure.
Mental model diagram
[Local Producers] -> [Local Stage] -> [Gateway Stage] ====WAN====> [Remote Gateway] -> [Remote Consumers]
| | | |
local SLA local queue demand window remote demand
How it works (step-by-step, with invariants and failure modes)
- Local producers emit events by topic.
- Routing policy tags each event as local-only or replicated.
- Gateway stage forwards only when remote demand exists.
- Remote gateway acknowledges and releases demand.
- Invariant: per-route queue remains bounded.
- Failure modes: queue explosion, stale routing tables, gateway restart loops.
Minimal concrete example
topic = "alerts.eu-west"
if remote_demand(topic) > 0:
forward(event)
else:
compact_or_defer(event)
Common misconceptions
- “Backpressure ends at the datacenter boundary.” It must be end-to-end.
- “Dropping events is always wrong.” Controlled drops can protect system health.
Check-your-understanding questions
- Why should WAN queues be bounded per route?
- What is the role of gateway isolation in failure handling?
- Which metrics best reveal hidden demand mismatch?
Check-your-understanding answers
- To prevent one slow route from consuming all memory.
- It keeps local services healthy during cross-site issues.
- Queue depth, demand fulfillment ratio, and route latency.
Real-world applications
- Multi-site telemetry pipelines
- Edge analytics replication
- Regional alert fan-out systems
Where you’ll apply it
- Directly in Project 13
- Operational tuning informed by Projects 11 and 12
References
- GenStage documentation
- Distributed Erlang documentation
- OTP supervision principles
Key insights Federated event systems survive by enforcing local isolation and explicit remote demand.
Summary Cross-site event routing on BEAM is reliable when you treat backpressure, routing, and failover as one design problem.
Homework/Exercises to practice the concept
- Define a topic policy matrix (replicate, sample, drop, local-only).
- Design a queue budget per route and justify the limits.
Solutions to the homework/exercises
- Assign business-critical topics strict replication; sample low-value telemetry.
- Set per-route caps from memory budget and latency SLOs.
3. Project Specification
3.1 What You Will Build
A federated event bus with topic-based routing, route-level demand windows, and gateway failover behavior.
3.2 Functional Requirements
- Topic Routing: Selectively replicate topics across sites.
- Demand Enforcement: Bound in-flight events per route.
- Gateway Failover: Continue service after primary gateway failure.
3.3 Non-Functional Requirements
- Performance: Sustained throughput without unbounded queue growth.
- Reliability: Local processing remains available during WAN degradation.
- Usability: Operators can inspect route state and backlog clearly.
3.4 Example Usage / Output
$ busctl publish sensor.us-east temp=71
$ busctl stats
3.5 Data Formats / Schemas / Protocols
- Topic metadata with replication policy
- Route demand and queue status records
- Gateway failover status events
3.6 Edge Cases
- Remote consumer demand drops to zero
- Gateway failover during burst traffic
- Topic policy changes at runtime
3.7 Real World Outcome
3.7.1 How to Run (Copy/Paste)
- Bring up multi-site topology from P11.
- Start event producers and consumers by topic.
- Run failover and demand-throttle drills.
3.7.2 Golden Path Demo (Deterministic)
A high-rate local producer is throttled safely while high-priority alerts remain replicated.
3.7.3 If CLI: exact terminal transcript
$ busctl stats
local_queue=12 wan_queue=3 dropped=0 backpressure=healthy
$ busctl failover gw-us
gateway gw-us down; traffic rerouted via gw-us-backup
4. Solution Architecture
4.1 High-Level Design
[Producer] -> [Local Router] -> [Gateway Stage] -> [Remote Gateway] -> [Consumer]
4.2 Key Components
| Component | Responsibility | Key Decisions |
|---|---|---|
| Local Router | Topic policy application | Replicate vs local-only logic |
| Gateway Stage | Route-level backpressure | Queue caps and compaction strategy |
| Failover Controller | Gateway reroute | Promotion timing and health thresholds |
4.4 Data Structures (No Full Code)
- Topic policy table
- Route demand windows
- Queue counters and latency snapshots
4.4 Algorithm Overview
Key Algorithm: WAN Demand Gate
- Read remote demand window.
- Admit or defer outgoing events.
- Update route metrics.
- Trigger compaction/drops when queue cap is hit.
Complexity Analysis:
- Time: O(E) for E events per route window
- Space: O(R + T) for routes and topics
5. Implementation Guide
5.1 Development Environment Setup
# reuse multi-site cluster lab from P11
5.2 Project Structure
project-root/
├── lib/
│ ├── topic_router.ex
│ ├── gateway_stage.ex
│ ├── failover_controller.ex
│ └── metrics.ex
├── test/
└── README.md
5.3 The Core Question You’re Answering
“How do I preserve throughput and correctness when federating event streams across unreliable WAN links?”
5.4 Concepts You Must Understand First
- GenStage demand semantics
- Distributed routing and node failure semantics
- Supervisor boundaries for gateway and failover controllers
5.5 Questions to Guide Your Design
- Which topics deserve strict replication vs local processing?
- What queue caps protect memory without losing critical signals?
- How will you test failover while traffic is in flight?