LEARN CONSUL DEEP DIVE
Learn HashiCorp Consul: From Zero to Distributed Systems Master
Goal: Deeply understand Consul—not just how to use it, but how it works behind the scenes. Master the Raft consensus algorithm, SWIM gossip protocol, service mesh internals, and build your own implementations to truly understand distributed systems.
Why Consul Matters
Consul is more than a tool—it’s a masterclass in distributed systems engineering. It elegantly combines:
- Raft consensus for strong consistency
- SWIM gossip for scalable membership
- Service mesh for zero-trust networking
- DNS interface for universal compatibility
Understanding Consul’s internals means understanding:
- How distributed databases achieve consensus
- How large clusters detect failures without overwhelming the network
- How service mesh actually provides mTLS
- How DNS can be used for service discovery
After completing these projects, you will:
- Implement your own Raft consensus library
- Build a SWIM gossip protocol from scratch
- Create a working service discovery system
- Understand every layer from network to application
- Know exactly what happens when you run
consul agent
Core Architecture Overview
┌─────────────────────────────────────────────────────────────────┐
│ CONSUL CLUSTER │
├─────────────────────────────────────────────────────────────────┤
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ SERVER AGENTS │ │
│ │ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ │
│ │ │ Leader │◄─►│Follower │◄─►│Follower │ RAFT QUORUM │ │
│ │ │ Server │ │ Server │ │ Server │ (Consensus) │ │
│ │ └────┬────┘ └────┬────┘ └────┬────┘ │ │
│ │ │ │ │ │ │
│ │ └─────────────┴─────────────┘ │ │
│ │ │ │ │
│ │ ┌────────▼────────┐ │ │
│ │ │ State Store │ (MemDB + BoltDB) │ │
│ │ │ - Services │ │ │
│ │ │ - Nodes │ │ │
│ │ │ - KV Store │ │ │
│ │ │ - Intentions │ │ │
│ │ └─────────────────┘ │ │
│ └─────────────────────────────────────────────────────────┘ │
│ ▲ │
│ │ LAN Gossip (SWIM/Serf) │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ CLIENT AGENTS │ │
│ │ ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐ │ │
│ │ │ Client │ │ Client │ │ Client │ │ Client │ │ │
│ │ │ + App │ │ + App │ │ + App │ │ + App │ │ │
│ │ └────────┘ └────────┘ └────────┘ └────────┘ │ │
│ └─────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
WAN Gossip connects multiple datacenters (server-to-server only)
Fundamental Concepts
- Agents
- Server Agent: Participates in Raft consensus, stores state, serves queries
- Client Agent: Forwards requests to servers, runs local health checks, participates in gossip
- Raft Consensus (for Strong Consistency)
- Leader election with randomized timeouts
- Log replication to followers
- Quorum-based commits ((N/2)+1)
- Applied to finite state machine (MemDB)
- Gossip Protocol (for Membership & Failure Detection)
- Based on SWIM (Scalable Weakly-consistent Infection-style process group Membership)
- LAN pool: All agents in a datacenter
- WAN pool: Servers across datacenters
- Failure detection without central coordinator
- Service Mesh (Connect)
- Sidecar proxies (Envoy by default)
- mTLS for all service-to-service communication
- Intentions for authorization (allow/deny)
- Built-in or external Certificate Authority
- Service Discovery
- Service registration with health checks
- DNS interface (*.consul domain)
- HTTP API for programmatic access
- Prepared queries for advanced routing
- Key/Value Store
- Hierarchical key-value storage
- Watch for changes
- Distributed configuration management
- Locks and sessions for coordination
Project List
Projects are ordered from foundational concepts to advanced implementations.
Project 1: Simple Key-Value Store with Replication
- File: LEARN_CONSUL_DEEP_DIVE.md
- Main Programming Language: Go
- Alternative Programming Languages: Rust, Python, Java
- Coolness Level: Level 3: Genuinely Clever
- Business Potential: 1. The “Resume Gold”
- Difficulty: Level 2: Intermediate
- Knowledge Area: Distributed Systems / Data Storage
- Software or Tool: Key-Value Store (like Consul’s KV)
- Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann
What you’ll build: A simple in-memory key-value store with leader-based replication. One node is the leader, accepts writes, and replicates to followers. This is the simplest form of replication—before adding consensus.
Why it teaches Consul: Consul’s KV store is built on replicated state. Before understanding Raft, you need to understand why replication is hard. This project exposes the problems: What happens when the leader dies? How do you know writes succeeded?
Core challenges you’ll face:
- Leader forwarding → maps to how Consul clients forward to servers
- Replication lag → maps to eventual vs strong consistency
- Split brain → maps to why we need consensus
- Network partitions → maps to CAP theorem tradeoffs
Key Concepts:
- Leader-Follower Replication: “Designing Data-Intensive Applications” Chapter 5 - Martin Kleppmann
- Network Partitions: Jepsen: Call Me Maybe
- CAP Theorem: CAP Twelve Years Later
Difficulty: Intermediate Time estimate: 1 week Prerequisites: Basic networking (TCP), Go fundamentals, understanding of client-server architecture
Real world outcome:
# Start leader
$ ./kvstore --role=leader --port=8000
# Start followers
$ ./kvstore --role=follower --leader=localhost:8000 --port=8001
$ ./kvstore --role=follower --leader=localhost:8000 --port=8002
# Write to leader
$ curl -X PUT localhost:8000/kv/name -d 'Douglas'
{"success": true, "replicated_to": 2}
# Read from any node (including followers)
$ curl localhost:8001/kv/name
{"key": "name", "value": "Douglas"}
# Kill leader, observe what breaks
$ kill %1
$ curl localhost:8001/kv/name # Works (read)
$ curl -X PUT localhost:8001/kv/age -d '30' # Fails! No leader!
Implementation Hints:
Structure:
type KVStore struct {
data map[string]string
role string // "leader" or "follower"
leader string // leader address
followers []string
mu sync.RWMutex
}
Leader write flow:
- Accept write request
- Apply locally
- Send to all followers (sync or async)
- Return success (how many replicas confirm?)
Questions to consider:
- Do you wait for all followers to confirm? (Strong consistency, but slow)
- Do you return immediately after local write? (Fast, but may lose data)
- What if a follower is down during replication?
- How does a follower become the new leader?
The goal is to feel the pain of distributed systems before learning the solutions.
Learning milestones:
- Basic replication works → You understand leader-follower architecture
- You lose data when leader dies → You understand why consensus matters
- Followers have stale data → You understand replication lag
- Split brain causes conflicts → You understand the consensus problem
Project 2: Implement the Raft Consensus Algorithm
- File: LEARN_CONSUL_DEEP_DIVE.md
- Main Programming Language: Go
- Alternative Programming Languages: Rust, C++, Java
- Coolness Level: Level 5: Pure Magic
- Business Potential: 1. The “Resume Gold”
- Difficulty: Level 4: Expert
- Knowledge Area: Distributed Systems / Consensus
- Software or Tool: Raft implementation (like hashicorp/raft)
- Main Book: “In Search of an Understandable Consensus Algorithm” (Raft paper) by Ongaro & Ousterhout
What you’ll build: A complete Raft implementation with leader election, log replication, and persistence. This is the same algorithm Consul uses for its server consensus.
Why it teaches Consul: Raft is the foundation of Consul’s consistency. Every write to Consul (service registration, KV updates) goes through Raft. Understanding Raft means understanding Consul’s guarantees.
Core challenges you’ll face:
- Leader election with randomized timeouts → maps to election timer implementation
- Log replication and commitment → maps to achieving consensus
- Persistence and recovery → maps to durable state
- Membership changes → maps to adding/removing servers
Resources for key challenges:
- The Raft Paper - The original paper, surprisingly readable
- Raft Visualization - Interactive demo of leader election
- hashicorp/raft source - Production Go implementation
Key Concepts:
- Leader Election: Raft Paper Section 5.2
- Log Replication: Raft Paper Section 5.3
- Safety: Raft Paper Section 5.4
- Persistence: Raft Paper Section 5.6
- Consul’s Use of Raft: HashiCorp Consul Consensus Protocol
Difficulty: Expert Time estimate: 3-4 weeks Prerequisites: Project 1 completed, strong understanding of concurrency, Go experience
Real world outcome:
# Start a 3-node Raft cluster
$ ./raftnode --id=1 --peers=localhost:8001,localhost:8002,localhost:8003 &
$ ./raftnode --id=2 --peers=localhost:8001,localhost:8002,localhost:8003 &
$ ./raftnode --id=3 --peers=localhost:8001,localhost:8002,localhost:8003 &
# Check cluster state
$ ./raftctl status
Node 1: LEADER (term 5, commit index 42)
Node 2: FOLLOWER (term 5, commit index 42)
Node 3: FOLLOWER (term 5, commit index 42)
# Submit a command (goes through leader)
$ ./raftctl apply "SET x = 10"
Applied at index 43
# Kill the leader, watch election happen
$ kill %1
# Wait 150-300ms (election timeout)
$ ./raftctl status
Node 2: LEADER (term 6, commit index 43)
Node 3: FOLLOWER (term 6, commit index 43)
# Bring node 1 back
$ ./raftnode --id=1 --peers=... &
$ ./raftctl status
Node 1: FOLLOWER (term 6, commit index 43) # Automatically catches up!
Node 2: LEADER (term 6, commit index 43)
Node 3: FOLLOWER (term 6, commit index 43)
Implementation Hints:
Core state (per Raft paper Figure 2):
type RaftNode struct {
// Persistent state (on all servers)
currentTerm int
votedFor *int
log []LogEntry
// Volatile state (on all servers)
commitIndex int
lastApplied int
// Volatile state (on leader)
nextIndex map[int]int // for each follower
matchIndex map[int]int // for each follower
// Additional
state State // FOLLOWER, CANDIDATE, LEADER
electionTimer *time.Timer
}
Leader election flow:
- Start as FOLLOWER with random election timeout (150-300ms)
- If timeout expires without heartbeat → become CANDIDATE
- Increment term, vote for self, request votes from peers
- If majority votes → become LEADER
- If receive AppendEntries from valid leader → return to FOLLOWER
Log replication flow:
- Client sends command to leader
- Leader appends to local log
- Leader sends AppendEntries RPC to all followers
- When majority acknowledge → mark as committed
- Apply committed entries to state machine
- Respond to client
Key implementation details:
- Election timeout randomization prevents split votes
- Term numbers detect stale leaders
- Log matching property ensures consistency
- Commit index only advances when leader’s term entries are replicated
Use the Raft TLA+ spec or the paper’s pseudocode as your guide. Test with Jepsen if you want to be thorough!
Learning milestones:
- Leader election works → You understand randomized timeouts and voting
- Log replication works → You understand the two-phase commit pattern
- Survives leader failure → You understand the safety properties
- Passes Raft paper’s Figure 8 scenario → You’ve implemented it correctly
Project 3: Implement the SWIM Gossip Protocol
- File: LEARN_CONSUL_DEEP_DIVE.md
- Main Programming Language: Go
- Alternative Programming Languages: Rust, C++, Python
- Coolness Level: Level 4: Hardcore Tech Flex
- Business Potential: 1. The “Resume Gold”
- Difficulty: Level 3: Advanced
- Knowledge Area: Distributed Systems / Membership
- Software or Tool: Gossip protocol (like Serf/memberlist)
- Main Book: “SWIM: Scalable Weakly-consistent Infection-style Process Group Membership Protocol”
What you’ll build: A SWIM-based gossip protocol that provides membership management and failure detection for a cluster. This is the protocol Consul uses (via Serf/memberlist) for its LAN and WAN gossip pools.
Why it teaches Consul: Consul uses gossip for everything that doesn’t need strong consistency: knowing which nodes are alive, detecting failures, broadcasting events. Raft is expensive; gossip is cheap and scalable.
Core challenges you’ll face:
- Failure detection with ping/ping-req → maps to indirect probing
- Infection-style dissemination → maps to piggybacking updates
- Suspicion mechanism → maps to reducing false positives
- Scalability (O(n) messages vs O(n²)) → maps to why SWIM beats heartbeats
Resources for key challenges:
- SWIM Paper - Original Cornell paper
- Serf Internals - How HashiCorp implements it
- Lifeguard Extensions - HashiCorp’s improvements
Key Concepts:
- SWIM Protocol: SWIM Paper Sections 3-4
- Failure Detection: SWIM Paper Section 3
- Dissemination: SWIM Paper Section 4
- Serf Implementation: HashiCorp memberlist source
- Consul Gossip: HashiCorp Consul Gossip Protocol
Difficulty: Advanced Time estimate: 2-3 weeks Prerequisites: Project 2 helpful but not required, understanding of UDP networking
Real world outcome:
# Start a gossip cluster
$ ./swimnode --name=node1 --port=7001 &
$ ./swimnode --name=node2 --port=7002 --join=localhost:7001 &
$ ./swimnode --name=node3 --port=7003 --join=localhost:7001 &
# List members (from any node)
$ ./swimctl members --node=localhost:7001
node1 alive 192.168.1.10:7001
node2 alive 192.168.1.10:7002
node3 alive 192.168.1.10:7003
# Kill a node, watch failure detection
$ kill %2 # Kill node2
# Wait for failure detection (~1-2 seconds)
$ ./swimctl members --node=localhost:7001
node1 alive 192.168.1.10:7001
node2 dead 192.168.1.10:7002 (detected: 1.2s ago)
node3 alive 192.168.1.10:7003
# Broadcast an event (propagates via gossip)
$ ./swimctl event "deploy:v1.2.3" --node=localhost:7001
Event broadcast to 3 nodes
# All nodes receive the event within milliseconds
$ ./swimctl events --node=localhost:7003
[2024-01-15 10:23:45] deploy:v1.2.3
Implementation Hints:
SWIM failure detection (per protocol round):
func (n *Node) protocolRound() {
// 1. Pick random node to probe
target := n.randomMember()
// 2. Send direct ping, wait for ack
if n.ping(target, timeout) {
return // Node is alive
}
// 3. Ping failed - use indirect probing
// Pick k random nodes to help probe
probers := n.randomMembers(k)
for _, prober := range probers {
n.pingReq(prober, target) // "Please ping target for me"
}
// 4. Wait for any indirect ack
if n.waitForIndirectAck(target, timeout) {
return // Node is alive (via indirect probe)
}
// 5. Mark as suspect (not immediately dead!)
n.suspect(target)
}
Infection-style dissemination:
- Piggyback membership updates on ping/ack messages
- Each update has a “lamport timestamp” or incarnation number
- Updates spread exponentially: each node tells a few others
- Eventually consistent: all nodes learn about join/leave/fail
Suspicion mechanism:
- Don’t immediately mark failed nodes as dead
- Broadcast “suspect” message, give node time to refute
- If node is actually alive, it sends “alive” with higher incarnation
- Only mark dead after suspicion timeout
Key insight: SWIM achieves O(n) message complexity vs O(n²) for heartbeats!
Learning milestones:
- Nodes can join and discover each other → You understand gossip join
- Failure detection works → You understand ping/ping-req/suspect
- Events propagate to all nodes → You understand dissemination
- False positives are minimal → You understand suspicion and incarnation
Project 4: Build a Service Registry
- File: LEARN_CONSUL_DEEP_DIVE.md
- Main Programming Language: Go
- Alternative Programming Languages: Rust, Java, Python
- Coolness Level: Level 3: Genuinely Clever
- Business Potential: 3. The “Service & Support” Model
- Difficulty: Level 2: Intermediate
- Knowledge Area: Microservices / Service Discovery
- Software or Tool: Service Registry (like Consul catalog)
- Main Book: “Building Microservices, 2nd Edition” by Sam Newman
What you’ll build: A service registry where services register themselves, health checks run, and clients can discover healthy service instances. Combine your Raft (Project 2) and Gossip (Project 3) work, or use simpler in-memory storage.
Why it teaches Consul: This is Consul’s core use case. Service discovery seems simple until you handle: health checks, multiple instances, datacenter awareness, and the difference between “registered” and “healthy.”
Core challenges you’ll face:
- Service registration API → maps to Consul’s agent/service/register
- Health check execution → maps to TCP/HTTP/script checks
- Critical vs warning states → maps to Consul’s check states
- Service queries with filtering → maps to Consul’s catalog/service endpoint
Key Concepts:
- Service Discovery Patterns: “Building Microservices” Chapter 5 - Sam Newman
- Health Checking: Consul Health Checks
- Service Registration: Consul Service Definition
Difficulty: Intermediate Time estimate: 1-2 weeks Prerequisites: Basic HTTP API development, understanding of health checks
Real world outcome:
# Start registry server
$ ./registry --port=8500
# Register a service (via API)
$ curl -X PUT localhost:8500/v1/agent/service/register -d '{
"name": "web",
"id": "web-1",
"port": 8080,
"check": {
"http": "http://localhost:8080/health",
"interval": "10s"
}
}'
{"success": true}
# Register another instance
$ curl -X PUT localhost:8500/v1/agent/service/register -d '{
"name": "web",
"id": "web-2",
"port": 8081,
"check": {
"http": "http://localhost:8081/health",
"interval": "10s"
}
}'
# Query healthy instances
$ curl localhost:8500/v1/catalog/service/web?passing=true
[
{"id": "web-1", "address": "127.0.0.1", "port": 8080, "status": "passing"},
{"id": "web-2", "address": "127.0.0.1", "port": 8081, "status": "passing"}
]
# Stop one instance, health check fails
$ # (after 10s + grace period)
$ curl localhost:8500/v1/catalog/service/web?passing=true
[
{"id": "web-1", "address": "127.0.0.1", "port": 8080, "status": "passing"}
]
# web-2 is still registered, but filtered out due to failing health check
Implementation Hints:
Service and check structures:
type Service struct {
ID string
Name string
Address string
Port int
Tags []string
Meta map[string]string
}
type Check struct {
ID string
ServiceID string
Type string // "http", "tcp", "script"
Target string // URL, address, or command
Interval time.Duration
Timeout time.Duration
Status string // "passing", "warning", "critical"
LastCheck time.Time
Output string
}
Health check executor:
func (r *Registry) runHealthChecks() {
for _, check := range r.checks {
go func(c *Check) {
for {
result := r.executeCheck(c)
r.updateCheckStatus(c.ID, result)
time.Sleep(c.Interval)
}
}(check)
}
}
func (r *Registry) executeCheck(c *Check) CheckResult {
switch c.Type {
case "http":
resp, err := http.Get(c.Target)
if err != nil {
return CheckResult{Status: "critical", Output: err.Error()}
}
if resp.StatusCode >= 200 && resp.StatusCode < 300 {
return CheckResult{Status: "passing", Output: "HTTP 200 OK"}
}
return CheckResult{Status: "critical", Output: resp.Status}
case "tcp":
// Try to dial, success = passing
case "script":
// Execute command, check exit code
}
}
Key considerations:
- Separate “registered” from “healthy” in queries
- Support tags for filtering (
webvsweb-production) - Handle deregistration (explicit and due to agent leaving)
- Think about anti-entropy (what if check state is lost?)
Learning milestones:
- Services register and list → You understand the catalog concept
- Health checks run and update status → You understand active health checking
- Queries filter by health → You understand the passing/warning/critical model
- Multiple instances of same service work → You understand service vs instance
Project 5: DNS-Based Service Discovery
- File: LEARN_CONSUL_DEEP_DIVE.md
- Main Programming Language: Go
- Alternative Programming Languages: Rust, C, Python
- Coolness Level: Level 4: Hardcore Tech Flex
- Business Potential: 2. The “Micro-SaaS / Pro Tool”
- Difficulty: Level 3: Advanced
- Knowledge Area: Networking / DNS
- Software or Tool: DNS Server (like Consul’s DNS interface)
- Main Book: “DNS and BIND, 5th Edition” by Cricket Liu & Paul Albitz
What you’ll build: A DNS server that integrates with your service registry (Project 4). Queries like web.service.consul return A records for healthy instances. This is how Consul provides universal service discovery without client libraries.
Why it teaches Consul: Consul’s DNS interface is brilliant—every language/tool knows how to do DNS lookups. No SDK required. Understanding this means understanding why DNS is perfect for service discovery (and its limitations).
Core challenges you’ll face:
- DNS protocol parsing → maps to UDP/TCP message format
- A, AAAA, SRV record generation → maps to different record types
- TTL and caching → maps to trade-off between freshness and load
- Recursive vs authoritative → maps to forwarding non-.consul queries
Key Concepts:
- DNS Protocol: “DNS and BIND” Chapter 2 - Cricket Liu
- Consul DNS Interface: Consul DNS
- SRV Records: RFC 2782 - for port discovery
- DNS Message Format: RFC 1035 Sections 4-5
Difficulty: Advanced Time estimate: 2 weeks Prerequisites: Project 4 completed, understanding of DNS basics
Real world outcome:
# Start DNS server (integrated with registry)
$ ./registry-dns --port=8600
# Query service via DNS
$ dig @localhost -p 8600 web.service.consul
;; ANSWER SECTION:
web.service.consul. 30 IN A 192.168.1.10
web.service.consul. 30 IN A 192.168.1.11
# SRV records include port information
$ dig @localhost -p 8600 web.service.consul SRV
;; ANSWER SECTION:
web.service.consul. 30 IN SRV 1 1 8080 web-1.node.consul.
web.service.consul. 30 IN SRV 1 1 8081 web-2.node.consul.
;; ADDITIONAL SECTION:
web-1.node.consul. 30 IN A 192.168.1.10
web-2.node.consul. 30 IN A 192.168.1.11
# Query with tag filter
$ dig @localhost -p 8600 production.web.service.consul
# Only returns instances tagged "production"
# Use it from any application!
$ curl http://web.service.consul:8080/api
# (assuming DNS is configured to use our server)
Implementation Hints:
DNS message structure:
type DNSMessage struct {
Header DNSHeader
Questions []DNSQuestion
Answers []DNSRecord
Authority []DNSRecord
Additional []DNSRecord
}
type DNSHeader struct {
ID uint16
Flags uint16 // QR, Opcode, AA, TC, RD, RA, RCODE
QDCount uint16 // Questions
ANCount uint16 // Answers
NSCount uint16 // Authority
ARCount uint16 // Additional
}
Parsing .consul queries:
func (s *DNSServer) parseConsulQuery(name string) (queryType, service, tag, datacenter string) {
// web.service.consul -> service=web
// production.web.service.consul -> tag=production, service=web
// web.service.dc1.consul -> service=web, dc=dc1
parts := strings.Split(strings.TrimSuffix(name, ".consul."), ".")
// Parse based on structure...
}
Generating A records:
func (s *DNSServer) handleServiceQuery(service, tag string) []DNSRecord {
instances := s.registry.GetHealthyInstances(service, tag)
var records []DNSRecord
for _, inst := range instances {
records = append(records, DNSRecord{
Name: service + ".service.consul.",
Type: 1, // A record
Class: 1, // IN
TTL: 30, // Short TTL for dynamic services
Data: inst.Address,
})
}
return records
}
Key considerations:
- Use UDP for queries under 512 bytes, TCP for larger
- Set short TTLs (30s) for service records—they change frequently
- Support SRV records for port information
- Implement forwarding for non-.consul queries
- Handle negative caching (NXDOMAIN)
Learning milestones:
- Basic A record queries work → You understand DNS protocol
- SRV records include ports → You understand service discovery via DNS
- Tag filtering works → You understand Consul’s naming scheme
- Standard tools (
dig,nslookup) work → Your implementation is correct
Project 6: Distributed Key-Value Store with Watches
- File: LEARN_CONSUL_DEEP_DIVE.md
- Main Programming Language: Go
- Alternative Programming Languages: Rust, Java, Python
- Coolness Level: Level 3: Genuinely Clever
- Business Potential: 3. The “Service & Support” Model
- Difficulty: Level 3: Advanced
- Knowledge Area: Distributed Systems / Configuration
- Software or Tool: Key-Value Store (like Consul KV)
- Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann
What you’ll build: A distributed key-value store backed by your Raft implementation (Project 2), with the ability to watch for changes. Clients can block on a key and get notified when it changes—like Consul’s blocking queries.
Why it teaches Consul: Consul’s KV store with watches enables distributed configuration. Services watch their config keys and reconfigure automatically. The blocking query pattern is essential to understand.
Core challenges you’ll face:
- Integrating KV operations with Raft → maps to state machine application
- Blocking queries (long-polling) → maps to Consul’s ?index= parameter
- CAS (Check-And-Set) operations → maps to optimistic concurrency
- Hierarchical keys and prefix queries → maps to recurse parameter
Key Concepts:
- Consul KV Store: Consul KV API
- Blocking Queries: Consul Blocking Queries
- Check-And-Set: Optimistic locking pattern
- State Machine Replication: Raft paper Section 7
Difficulty: Advanced Time estimate: 2 weeks Prerequisites: Project 2 (Raft) completed
Real world outcome:
# Write a key
$ curl -X PUT localhost:8500/v1/kv/config/database/host -d 'db.example.com'
true
# Read a key
$ curl localhost:8500/v1/kv/config/database/host
[{"Key": "config/database/host", "Value": "ZGIuZXhhbXBsZS5jb20=", "ModifyIndex": 42}]
# (Value is base64 encoded)
# Watch for changes (blocking query)
$ curl "localhost:8500/v1/kv/config/database/host?index=42&wait=5m"
# This blocks until the key changes or 5 minutes pass...
# In another terminal, update the key
$ curl -X PUT localhost:8500/v1/kv/config/database/host -d 'newdb.example.com'
true
# The watch returns immediately with new value!
[{"Key": "config/database/host", "Value": "bmV3ZGIuZXhhbXBsZS5jb20=", "ModifyIndex": 43}]
# CAS operation (only update if at expected index)
$ curl -X PUT "localhost:8500/v1/kv/config/database/host?cas=43" -d 'finaldb.example.com'
true
$ curl -X PUT "localhost:8500/v1/kv/config/database/host?cas=43" -d 'wontwork.example.com'
false # Failed because ModifyIndex is now 44
# Recursive list
$ curl localhost:8500/v1/kv/config?recurse
[
{"Key": "config/database/host", ...},
{"Key": "config/database/port", ...},
{"Key": "config/cache/host", ...}
]
Implementation Hints:
KV entry with metadata:
type KVEntry struct {
Key string
Value []byte
Flags uint64
CreateIndex uint64 // Raft index when created
ModifyIndex uint64 // Raft index when last modified
LockIndex uint64 // For distributed locks
Session string // Associated session
}
Applying KV operations to Raft:
type KVCommand struct {
Op string // "set", "delete", "cas"
Key string
Value []byte
CAS uint64 // For check-and-set
}
func (kv *KVStore) Apply(log *raft.Log) interface{} {
var cmd KVCommand
json.Unmarshal(log.Data, &cmd)
switch cmd.Op {
case "set":
entry := &KVEntry{
Key: cmd.Key,
Value: cmd.Value,
ModifyIndex: log.Index,
}
if existing := kv.store[cmd.Key]; existing == nil {
entry.CreateIndex = log.Index
} else {
entry.CreateIndex = existing.CreateIndex
}
kv.store[cmd.Key] = entry
kv.notifyWatchers(cmd.Key) // Wake up blocking queries
return true
case "cas":
existing := kv.store[cmd.Key]
if existing == nil || existing.ModifyIndex != cmd.CAS {
return false // CAS failed
}
// Apply the update...
}
}
Blocking query implementation:
func (kv *KVStore) WatchKey(key string, index uint64, timeout time.Duration) *KVEntry {
// If current ModifyIndex > requested index, return immediately
if entry := kv.store[key]; entry != nil && entry.ModifyIndex > index {
return entry
}
// Otherwise, wait for notification or timeout
ch := kv.addWatcher(key)
defer kv.removeWatcher(key, ch)
select {
case <-ch:
return kv.store[key]
case <-time.After(timeout):
return kv.store[key] // Return current value on timeout
}
}
Learning milestones:
- Basic KV operations replicated via Raft → You understand state machine replication
- Blocking queries work → You understand the long-polling pattern
- CAS prevents race conditions → You understand optimistic concurrency
- Prefix queries work → You understand hierarchical key design
Project 7: Session and Lock Manager
- File: LEARN_CONSUL_DEEP_DIVE.md
- Main Programming Language: Go
- Alternative Programming Languages: Rust, Java, Python
- Coolness Level: Level 3: Genuinely Clever
- Business Potential: 3. The “Service & Support” Model
- Difficulty: Level 3: Advanced
- Knowledge Area: Distributed Systems / Coordination
- Software or Tool: Distributed Lock Manager (like Consul sessions)
- Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann
What you’ll build: A distributed lock/session manager. Sessions are tied to node health, and locks are released when sessions expire. This enables leader election and distributed coordination patterns.
Why it teaches Consul: Consul’s sessions are the foundation for distributed locking. Understanding how sessions tie to health checks, and how lock release works, is crucial for building reliable distributed systems.
Core challenges you’ll face:
- Session lifecycle (create, renew, destroy) → maps to ephemeral state
- Lock acquisition and release → maps to distributed mutual exclusion
- Session invalidation on node failure → maps to gossip integration
- Leader election pattern → maps to contended lock acquisition
Key Concepts:
- Consul Sessions: Consul Sessions
- Distributed Locks: “Designing Data-Intensive Applications” Chapter 8 - Martin Kleppmann
- Leader Election: Consul Leader Election
- Session Behavior: “delete” vs “release” on invalidation
Difficulty: Advanced Time estimate: 2 weeks Prerequisites: Projects 3, 4, 6 completed
Real world outcome:
# Create a session (tied to node health)
$ curl -X PUT localhost:8500/v1/session/create -d '{
"Name": "my-leader-election",
"TTL": "15s",
"LockDelay": "10s",
"Behavior": "delete"
}'
{"ID": "session-abc-123"}
# Acquire a lock
$ curl -X PUT "localhost:8500/v1/kv/service/leader?acquire=session-abc-123" -d 'node-1'
true # Lock acquired!
# Another node tries to acquire the same lock
$ curl -X PUT "localhost:8500/v1/kv/service/leader?acquire=session-def-456" -d 'node-2'
false # Lock already held
# Check who holds the lock
$ curl localhost:8500/v1/kv/service/leader
[{"Key": "service/leader", "Value": "bm9kZS0x", "Session": "session-abc-123", ...}]
# If node-1's session expires (TTL) or node fails (gossip), lock is released
# After lock-delay, node-2 can acquire it
# Explicit release
$ curl -X PUT "localhost:8500/v1/kv/service/leader?release=session-abc-123"
true
# Renew session before TTL expires
$ curl -X PUT localhost:8500/v1/session/renew/session-abc-123
[{"ID": "session-abc-123", "TTL": "15s", ...}]
Implementation Hints:
Session structure:
type Session struct {
ID string
Name string
Node string // Owning node
TTL time.Duration
LockDelay time.Duration // Wait before reassigning locks
Behavior string // "release" or "delete"
Checks []string // Health checks that must pass
CreateIndex uint64
LastRenew time.Time
}
Lock acquisition flow:
func (s *SessionManager) AcquireLock(key, sessionID, value string) bool {
session := s.sessions[sessionID]
if session == nil || !s.isSessionValid(session) {
return false
}
entry := s.kv.Get(key)
// Already locked by another session?
if entry != nil && entry.Session != "" && entry.Session != sessionID {
return false
}
// Acquire the lock
s.kv.SetWithSession(key, value, sessionID)
return true
}
Session invalidation (on node failure or TTL):
func (s *SessionManager) InvalidateSession(sessionID string) {
session := s.sessions[sessionID]
if session == nil {
return
}
// Find all keys locked by this session
lockedKeys := s.kv.GetKeysBySession(sessionID)
for _, key := range lockedKeys {
switch session.Behavior {
case "release":
// Remove session association, keep value
s.kv.ClearSession(key)
case "delete":
// Delete the key entirely
s.kv.Delete(key)
}
}
// After LockDelay, these keys can be acquired by other sessions
delete(s.sessions, sessionID)
}
Integration with gossip:
- When a node is marked as failed by SWIM, invalidate its sessions
- Sessions can also be tied to specific health checks
Learning milestones:
- Session create/renew/destroy works → You understand session lifecycle
- Locks are exclusive → You understand distributed locking
- Session expiry releases locks → You understand TTL mechanism
- Node failure releases locks → You understand gossip integration
Project 8: Service Mesh Sidecar Proxy
- File: LEARN_CONSUL_DEEP_DIVE.md
- Main Programming Language: Go
- Alternative Programming Languages: Rust, C++
- Coolness Level: Level 4: Hardcore Tech Flex
- Business Potential: 4. The “Open Core” Infrastructure
- Difficulty: Level 4: Expert
- Knowledge Area: Networking / Security
- Software or Tool: Sidecar Proxy (like Envoy/Consul Connect proxy)
- Main Book: “Zero Trust Networks, 2nd Edition” by Evan Gilman & Doug Barth
What you’ll build: A simple sidecar proxy that intercepts service traffic, establishes mTLS connections with other proxies, and enforces service-to-service authorization (intentions).
Why it teaches Consul: Consul Connect is the service mesh. Understanding how the sidecar intercepts traffic, validates certificates, and checks intentions is understanding modern zero-trust networking.
Core challenges you’ll face:
- TLS termination and origination → maps to mTLS implementation
- Certificate management → maps to CA integration
- Intention checking → maps to authorization decisions
- Transparent proxying → maps to iptables/localhost binding
Key Concepts:
- Consul Connect: Consul Service Mesh
- mTLS: Mutual TLS Explained
- SPIFFE: SPIFFE Identity - Consul uses SPIFFE-compatible identities
- Intentions: Consul Intentions
Difficulty: Expert Time estimate: 3-4 weeks Prerequisites: Strong understanding of TLS, Projects 4-6 completed
Real world outcome:
# Start a service (not aware of mesh)
$ python3 -m http.server 8080 &
# Start sidecar for the service
$ ./sidecar --service=web --port=8080 --listen=21000
# The sidecar:
# 1. Gets certificates from CA (or Consul)
# 2. Listens on 21000 for inbound mTLS connections
# 3. Forwards decrypted traffic to localhost:8080
# Start client-side sidecar (for making outbound connections)
$ ./sidecar --service=client --upstream=web:9090
# The client app connects to localhost:9090 (plaintext)
# Sidecar establishes mTLS to web's sidecar
# Traffic flows encrypted between sidecars
# Test the connection
$ curl localhost:9090
# Works! Traffic went: curl -> client-sidecar -> mTLS -> web-sidecar -> web-service
# Add an intention (deny client -> web)
$ ./intentctl deny client web
# Now the request fails
$ curl localhost:9090
Error: Connection refused by intention
# Allow it
$ ./intentctl allow client web
$ curl localhost:9090
# Works again!
Implementation Hints:
Sidecar architecture:
type Sidecar struct {
serviceName string
localPort int // Where the actual service listens
listenPort int // Where we accept inbound mTLS
upstreams []Upstream // Outbound services we proxy to
cert tls.Certificate
caCert *x509.CertPool
}
type Upstream struct {
ServiceName string
LocalPort int // localhost:LocalPort -> remote service
}
Inbound connection handler:
func (s *Sidecar) handleInbound(conn net.Conn) {
// 1. TLS handshake (we require client cert)
tlsConn := tls.Server(conn, &tls.Config{
Certificates: []tls.Certificate{s.cert},
ClientAuth: tls.RequireAndVerifyClientCert,
ClientCAs: s.caCert,
})
if err := tlsConn.Handshake(); err != nil {
log.Printf("TLS handshake failed: %v", err)
return
}
// 2. Extract client service identity from cert
clientService := extractServiceName(tlsConn.ConnectionState().PeerCertificates[0])
// 3. Check intention
if !s.checkIntention(clientService, s.serviceName) {
log.Printf("Intention denied: %s -> %s", clientService, s.serviceName)
tlsConn.Close()
return
}
// 4. Forward to local service
localConn, _ := net.Dial("tcp", fmt.Sprintf("localhost:%d", s.localPort))
go io.Copy(localConn, tlsConn)
io.Copy(tlsConn, localConn)
}
Outbound proxy (upstream):
func (s *Sidecar) handleOutbound(upstream Upstream, conn net.Conn) {
// 1. Discover the upstream service (via DNS or registry)
instances := s.discover(upstream.ServiceName)
target := instances[0] // Load balancing here
// 2. Establish mTLS connection to target's sidecar
tlsConn := tls.Dial("tcp", target, &tls.Config{
Certificates: []tls.Certificate{s.cert},
RootCAs: s.caCert,
ServerName: upstream.ServiceName, // SNI for routing
})
// 3. Proxy traffic
go io.Copy(tlsConn, conn)
io.Copy(conn, tlsConn)
}
Certificate identity (SPIFFE-style):
spiffe://trust-domain/ns/default/dc/dc1/svc/web
The service name is embedded in the certificate’s SAN (Subject Alternative Name).
Learning milestones:
- mTLS connections work → You understand certificate-based identity
- Intentions block/allow correctly → You understand authorization
- Transparent to application → You understand the sidecar pattern
- Works with your service registry → You’ve integrated the pieces
Project 9: Certificate Authority (CA) for Service Mesh
- File: LEARN_CONSUL_DEEP_DIVE.md
- Main Programming Language: Go
- Alternative Programming Languages: Rust, Python
- Coolness Level: Level 4: Hardcore Tech Flex
- Business Potential: 4. The “Open Core” Infrastructure
- Difficulty: Level 4: Expert
- Knowledge Area: Security / PKI
- Software or Tool: Certificate Authority (like Consul’s built-in CA)
- Main Book: “Serious Cryptography, 2nd Edition” by Jean-Philippe Aumasson
What you’ll build: A Certificate Authority that issues short-lived certificates to services, supporting automatic rotation. This is the trust foundation of your service mesh.
Why it teaches Consul: Consul can act as a CA or integrate with Vault. Understanding how certificates are issued, rotated, and trusted is fundamental to understanding Connect’s security model.
Core challenges you’ll face:
- Generating CA root and intermediate certs → maps to PKI hierarchy
- Issuing leaf certificates on demand → maps to CSR processing
- Short-lived certs and rotation → maps to ephemeral identity
- Certificate distribution → maps to how sidecars get certs
Key Concepts:
- PKI Fundamentals: “Serious Cryptography” Chapter 12 - Jean-Philippe Aumasson
- SPIFFE/SPIRE: SPIFFE Standards
- Consul CA: Consul Connect CA
- Certificate Rotation: Short TTLs vs long-lived certs
Difficulty: Expert Time estimate: 2-3 weeks Prerequisites: Understanding of TLS/PKI, Projects 4-8 helpful
Real world outcome:
# Initialize CA (creates root and intermediate)
$ ./meshca init --domain=consul --ttl=8760h
Root CA created: /certs/root.pem
Intermediate CA created: /certs/intermediate.pem
# Start CA server
$ ./meshca serve --port=8200
# Service requests a certificate
$ curl -X POST localhost:8200/v1/connect/ca/leaf/web -d '{
"csr": "-----BEGIN CERTIFICATE REQUEST-----\n..."
}'
{
"certificate": "-----BEGIN CERTIFICATE-----\n...",
"private_key": null, # We don't see it (service generated CSR)
"valid_after": "2024-01-15T10:00:00Z",
"valid_before": "2024-01-15T22:00:00Z", # 12-hour TTL
"service": "web",
"spiffe_id": "spiffe://consul/ns/default/dc/dc1/svc/web"
}
# Get root certificates (for verification)
$ curl localhost:8200/v1/connect/ca/roots
{
"roots": [{
"id": "abc123",
"root_cert": "-----BEGIN CERTIFICATE-----\n...",
"active": true
}]
}
# Rotate root CA (zero downtime)
$ ./meshca rotate
New root created, old root still valid during transition...
Implementation Hints:
CA structure:
type MeshCA struct {
rootKey *ecdsa.PrivateKey
rootCert *x509.Certificate
interKey *ecdsa.PrivateKey
interCert *x509.Certificate
trustDomain string
}
Issuing a leaf certificate:
func (ca *MeshCA) IssueCert(serviceName string, csr *x509.CertificateRequest) (*x509.Certificate, error) {
// Generate SPIFFE ID
spiffeID := fmt.Sprintf("spiffe://%s/ns/default/dc/dc1/svc/%s",
ca.trustDomain, serviceName)
// Create certificate template
template := &x509.Certificate{
SerialNumber: generateSerial(),
Subject: csr.Subject,
NotBefore: time.Now(),
NotAfter: time.Now().Add(12 * time.Hour), // Short-lived!
KeyUsage: x509.KeyUsageDigitalSignature,
ExtKeyUsage: []x509.ExtKeyUsage{
x509.ExtKeyUsageClientAuth,
x509.ExtKeyUsageServerAuth,
},
URIs: []*url.URL{{Scheme: "spiffe", Host: ca.trustDomain,
Path: fmt.Sprintf("/ns/default/dc/dc1/svc/%s", serviceName)}},
}
// Sign with intermediate CA
certDER, _ := x509.CreateCertificate(
rand.Reader, template, ca.interCert, csr.PublicKey, ca.interKey)
return x509.ParseCertificate(certDER)
}
Root rotation strategy:
- Generate new root and intermediate
- Add new root to trust bundle (both old and new trusted)
- Start issuing certs signed by new intermediate
- After old certs expire (max TTL), remove old root from bundle
Key insight: Short-lived certs (hours, not years) make revocation unnecessary—just don’t renew!
Learning milestones:
- CA issues valid certificates → You understand PKI basics
- Sidecars can authenticate each other → Your certs work in practice
- Rotation doesn’t break connections → You understand trust bundle management
- SPIFFE IDs are correct → You understand service identity
Project 10: Multi-Datacenter Federation
- File: LEARN_CONSUL_DEEP_DIVE.md
- Main Programming Language: Go
- Alternative Programming Languages: Rust, Java
- Coolness Level: Level 4: Hardcore Tech Flex
- Business Potential: 4. The “Open Core” Infrastructure
- Difficulty: Level 4: Expert
- Knowledge Area: Distributed Systems / Networking
- Software or Tool: Multi-DC Federation (like Consul WAN)
- Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann
What you’ll build: Extend your Consul clone to support multiple datacenters. Each DC has its own Raft cluster, but they’re connected via WAN gossip for cross-DC service discovery and failover.
Why it teaches Consul: Multi-DC is where distributed systems get really interesting. How do you maintain consistency? How do you route across DCs? Consul’s approach—independent Raft clusters with WAN gossip—is elegant.
Core challenges you’ll face:
- WAN gossip pool (servers only) → maps to separate gossip with higher latency tolerance
- Cross-DC queries → maps to query forwarding to remote DC
- Prepared queries with failover → maps to automatic DC failover
- Replication vs forwarding → maps to where to evaluate queries
Key Concepts:
- Consul Multi-DC: Multi-Datacenter Consul
- WAN Federation: Consul WAN Gossip
- Prepared Queries: Prepared Query Failover
- Network Partitions: What happens when DCs can’t communicate
Difficulty: Expert Time estimate: 3-4 weeks Prerequisites: Projects 2, 3, 4 completed, understanding of multi-region architecture
Real world outcome:
# DC1 cluster
$ ./consul-clone agent -server -datacenter=dc1 -wan-join=dc2-server:8302
# DC2 cluster
$ ./consul-clone agent -server -datacenter=dc2 -wan-join=dc1-server:8302
# List all DCs (via WAN gossip)
$ curl localhost:8500/v1/catalog/datacenters
["dc1", "dc2"]
# Register service in DC1
$ curl -X PUT localhost:8500/v1/agent/service/register -d '{
"name": "api",
"port": 8080
}'
# Query service from DC2 (cross-DC)
$ curl dc2-server:8500/v1/catalog/service/api?dc=dc1
[{"Node": "node1", "Datacenter": "dc1", "ServicePort": 8080, ...}]
# Create prepared query with failover
$ curl -X POST localhost:8500/v1/query -d '{
"Name": "api-failover",
"Service": {
"Service": "api",
"Failover": {
"Datacenters": ["dc2", "dc3"]
}
}
}'
# When DC1's api is unhealthy, query automatically returns DC2's api
$ curl localhost:8500/v1/query/api-failover/execute
# Returns DC2 instances if DC1 has none healthy
Implementation Hints:
WAN gossip configuration:
type WANGossipConfig struct {
BindPort int // 8302 by default
SuspicionMult int // Higher for WAN (network is less reliable)
ProbeInterval time.Duration // Longer for WAN
PushPullInterval time.Duration // Longer for WAN
}
Cross-DC query forwarding:
func (s *Server) handleCatalogQuery(dc string, service string) []ServiceInstance {
if dc == s.datacenter {
// Local query
return s.catalog.GetService(service)
}
// Find a server in the target DC (via WAN gossip)
remoteServer := s.wanGossip.GetServerInDC(dc)
if remoteServer == nil {
return nil // DC unreachable
}
// Forward query
return s.forwardQuery(remoteServer, service)
}
Prepared query failover:
func (s *Server) executePreparedQuery(query *PreparedQuery) []ServiceInstance {
// Try primary DC first
results := s.catalog.GetService(query.Service.Service)
healthyResults := filterHealthy(results)
if len(healthyResults) > 0 {
return healthyResults
}
// Failover to other DCs in order
for _, dc := range query.Service.Failover.Datacenters {
results := s.handleCatalogQuery(dc, query.Service.Service)
healthyResults := filterHealthy(results)
if len(healthyResults) > 0 {
return healthyResults
}
}
return nil
}
Learning milestones:
- WAN gossip connects DCs → You understand the WAN pool
- Cross-DC queries work → You understand forwarding
- Failover routes to other DCs → You understand prepared queries
- Network partition is handled gracefully → You understand failure modes
Project 11: ACL System with Tokens and Policies
- File: LEARN_CONSUL_DEEP_DIVE.md
- Main Programming Language: Go
- Alternative Programming Languages: Rust, Java
- Coolness Level: Level 3: Genuinely Clever
- Business Potential: 3. The “Service & Support” Model
- Difficulty: Level 3: Advanced
- Knowledge Area: Security / Access Control
- Software or Tool: ACL System (like Consul ACLs)
- Main Book: “Foundations of Information Security” by Jason Andress
What you’ll build: An ACL system with tokens, policies, and roles. Tokens authenticate requests, policies define permissions, and roles group policies. Secure your Consul clone!
Why it teaches Consul: Security is not optional. Consul’s ACL system secures everything: who can register services, who can read KV, who can modify intentions. Understanding this completes your Consul knowledge.
Core challenges you’ll face:
- Token management (create, update, delete) → maps to identity management
- Policy language (HCL-like) → maps to permission specification
- Permission resolution → maps to policy inheritance and roles
- Token replication → maps to Raft-based security state
Key Concepts:
- Consul ACLs: Consul ACL System
- Policy Language: Consul ACL Rules
- RBAC: Role-Based Access Control concepts
- Capability-Based Security: Token as capability
Difficulty: Advanced Time estimate: 2 weeks Prerequisites: Projects 2, 4, 6 completed
Real world outcome:
# Bootstrap ACL system (creates initial management token)
$ curl -X PUT localhost:8500/v1/acl/bootstrap
{"AccessorID": "root-accessor", "SecretID": "root-secret-token"}
# Create a policy
$ curl -X PUT localhost:8500/v1/acl/policy -H "X-Consul-Token: root-secret-token" -d '{
"Name": "web-service-policy",
"Rules": "service \"web\" { policy = \"write\" }\nkey_prefix \"config/web/\" { policy = \"read\" }"
}'
# Create a token with this policy
$ curl -X PUT localhost:8500/v1/acl/token -H "X-Consul-Token: root-secret-token" -d '{
"Policies": [{"Name": "web-service-policy"}]
}'
{"SecretID": "web-token-123"}
# Use the token to register service (allowed)
$ curl -X PUT localhost:8500/v1/agent/service/register \
-H "X-Consul-Token: web-token-123" \
-d '{"name": "web", "port": 8080}'
# Success!
# Try to register a different service (denied)
$ curl -X PUT localhost:8500/v1/agent/service/register \
-H "X-Consul-Token: web-token-123" \
-d '{"name": "api", "port": 9090}'
{"error": "Permission denied: service \"api\" write"}
# Read allowed KV prefix
$ curl localhost:8500/v1/kv/config/web/setting -H "X-Consul-Token: web-token-123"
# Success!
# Read denied KV prefix
$ curl localhost:8500/v1/kv/secrets/password -H "X-Consul-Token: web-token-123"
{"error": "Permission denied"}
Implementation Hints:
Policy structure:
type Policy struct {
ID string
Name string
Description string
Rules []Rule
}
type Rule struct {
Resource string // "service", "key", "node", "agent", etc.
Segment string // The name/prefix (e.g., "web", "config/")
Policy string // "read", "write", "deny"
Intentions string // For service: "read", "write"
}
Token structure:
type Token struct {
AccessorID string // Public identifier
SecretID string // The actual token value
Description string
Policies []string // Policy IDs
Roles []string // Role IDs
Local bool // DC-local or global
CreateTime time.Time
ExpirationTime *time.Time
}
Permission checking:
func (acl *ACLResolver) CheckPermission(token, resource, segment, action string) bool {
// 1. Look up token
t := acl.tokens[token]
if t == nil {
return false // Anonymous - check default policy
}
// 2. Gather all policies (from token and roles)
policies := acl.gatherPolicies(t)
// 3. Check each policy for matching rule
// Most specific match wins, deny overrides allow
for _, policy := range policies {
for _, rule := range policy.Rules {
if rule.Resource == resource && matchesSegment(rule.Segment, segment) {
if rule.Policy == "deny" {
return false
}
if rule.Policy == action || rule.Policy == "write" {
return true
}
}
}
}
// 4. Default deny
return false
}
Policy language parsing (simplified):
service "web" { policy = "write" }
key_prefix "config/" { policy = "read" }
node "" { policy = "read" } # Empty = all nodes
Learning milestones:
- Tokens authenticate requests → You understand identity
- Policies grant permissions → You understand authorization
- Roles group policies → You understand RBAC
- Default-deny works → You understand security principles
Project 12: Consul Agent (Complete Implementation)
- File: LEARN_CONSUL_DEEP_DIVE.md
- Main Programming Language: Go
- Alternative Programming Languages: Rust
- Coolness Level: Level 5: Pure Magic
- Business Potential: 5. The “Industry Disruptor”
- Difficulty: Level 5: Master
- Knowledge Area: Distributed Systems / Full Stack
- Software or Tool: Consul Agent Clone
- Main Book: All previous books combined
What you’ll build: A complete Consul agent that can run as either server or client mode. It integrates all previous projects: Raft, gossip, service registry, DNS, KV, sessions, mesh, and ACLs.
Why it teaches Consul: This is the capstone. You’ll understand how all the pieces fit together, how the agent lifecycle works, and what happens when you run consul agent.
Core challenges you’ll face:
- Mode switching (server vs client) → maps to agent architecture
- Graceful shutdown → maps to leave vs fail
- Configuration management → maps to HCL/JSON config files
- HTTP/gRPC API serving → maps to API surface
Key Concepts:
- Agent Architecture: Consul Agent
- Configuration: Consul Configuration
- All Previous Concepts: Integration of everything
Difficulty: Master Time estimate: 2-3 months Prerequisites: All previous projects completed
Real world outcome:
# Start servers (just like real Consul!)
$ ./consul-clone agent -server -bootstrap-expect=3 -datacenter=dc1 -data-dir=/data/consul
# Start clients
$ ./consul-clone agent -datacenter=dc1 -join=server1:8301
# All standard Consul commands work
$ ./consul-clone members
Node Address Status Type DC
server1 192.168.1.10:8301 alive server dc1
server2 192.168.1.11:8301 alive server dc1
server3 192.168.1.12:8301 alive server dc1
client1 192.168.1.20:8301 alive client dc1
$ ./consul-clone kv put config/key value
Success!
$ ./consul-clone catalog services
api
web
$ ./consul-clone operator raft list-peers
Node ID Address State Voter
server1 abc123 192.168.1.10:8300 leader true
server2 def456 192.168.1.11:8300 follower true
server3 ghi789 192.168.1.12:8300 follower true
# DNS works
$ dig @localhost -p 8600 web.service.consul
...
# HTTP API works
$ curl localhost:8500/v1/status/leader
"192.168.1.10:8300"
Implementation Hints:
Agent structure:
type Agent struct {
config *Config
mode AgentMode // SERVER or CLIENT
// Server-only components
raft *Raft
fsm *StateMachine
leaderCh <-chan bool
// All agents
lanGossip *GossipPool
wanGossip *GossipPool // Servers only
// Services
catalog *Catalog
kvStore *KVStore
sessions *SessionManager
acl *ACLResolver
connect *ConnectManager
// Interfaces
httpServer *http.Server
grpcServer *grpc.Server
dnsServer *DNSServer
// Lifecycle
shutdownCh chan struct{}
}
Agent startup sequence:
func (a *Agent) Start() error {
// 1. Load configuration
if err := a.loadConfig(); err != nil {
return err
}
// 2. Initialize gossip
a.lanGossip = NewGossipPool(a.config.LANConfig)
if a.mode == SERVER {
a.wanGossip = NewGossipPool(a.config.WANConfig)
}
// 3. Initialize Raft (servers only)
if a.mode == SERVER {
a.raft = NewRaft(a.config.RaftConfig)
a.raft.Start()
}
// 4. Initialize services
a.catalog = NewCatalog(a.raft)
a.kvStore = NewKVStore(a.raft)
a.sessions = NewSessionManager(a.kvStore, a.lanGossip)
// 5. Start API servers
go a.httpServer.ListenAndServe()
go a.grpcServer.Serve()
go a.dnsServer.ListenAndServe()
// 6. Join cluster
if len(a.config.RetryJoin) > 0 {
a.lanGossip.Join(a.config.RetryJoin)
}
return nil
}
Graceful shutdown:
func (a *Agent) Shutdown() {
// 1. Leave gossip (so others know we're leaving intentionally)
a.lanGossip.Leave()
if a.wanGossip != nil {
a.wanGossip.Leave()
}
// 2. Stop accepting new connections
a.httpServer.Shutdown(context.Background())
a.grpcServer.GracefulStop()
a.dnsServer.Shutdown()
// 3. Step down as leader if applicable
if a.raft != nil {
a.raft.Shutdown()
}
// 4. Close remaining connections
close(a.shutdownCh)
}
Learning milestones:
- Agent starts and joins cluster → Integration works
- All APIs function correctly → You’ve built a Consul clone
- Graceful shutdown preserves data → Lifecycle is correct
- You can replace real Consul in tests → You truly understand Consul
Project Comparison Table
| Project | Difficulty | Time | Depth of Understanding | Fun Factor |
|---|---|---|---|---|
| 1. Simple KV with Replication | Intermediate | 1 week | ⭐⭐ | ⭐⭐⭐ |
| 2. Raft Consensus | Expert | 3-4 weeks | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ |
| 3. SWIM Gossip Protocol | Advanced | 2-3 weeks | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐ |
| 4. Service Registry | Intermediate | 1-2 weeks | ⭐⭐⭐ | ⭐⭐⭐ |
| 5. DNS Service Discovery | Advanced | 2 weeks | ⭐⭐⭐⭐ | ⭐⭐⭐⭐ |
| 6. KV Store with Watches | Advanced | 2 weeks | ⭐⭐⭐⭐ | ⭐⭐⭐ |
| 7. Session and Lock Manager | Advanced | 2 weeks | ⭐⭐⭐⭐ | ⭐⭐⭐ |
| 8. Sidecar Proxy | Expert | 3-4 weeks | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ |
| 9. Certificate Authority | Expert | 2-3 weeks | ⭐⭐⭐⭐ | ⭐⭐⭐⭐ |
| 10. Multi-DC Federation | Expert | 3-4 weeks | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐ |
| 11. ACL System | Advanced | 2 weeks | ⭐⭐⭐ | ⭐⭐⭐ |
| 12. Complete Agent | Master | 2-3 months | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ |
Recommended Learning Path
Foundational Track (Understand the Core)
- Project 1: Simple KV - Feel the pain of distributed systems
- Project 2: Raft - THE foundational project
- Project 3: SWIM Gossip - The other foundational protocol
Service Discovery Track
- Project 4: Service Registry - Core use case
- Project 5: DNS Interface - Universal access
- Project 6: KV with Watches - Configuration management
Service Mesh Track
- Project 8: Sidecar Proxy - How Connect works
- Project 9: CA - Trust foundation
- Project 7: Sessions/Locks - Coordination primitives
Production-Ready Track
- Project 11: ACL System - Security
- Project 10: Multi-DC - Scale out
- Project 12: Complete Agent - Everything together
Final Capstone: Production Consul Cluster Simulator
- File: LEARN_CONSUL_DEEP_DIVE.md
- Main Programming Language: Go
- Alternative Programming Languages: Rust
- Coolness Level: Level 5: Pure Magic
- Business Potential: 5. The “Industry Disruptor”
- Difficulty: Level 5: Master
- Knowledge Area: Distributed Systems / Chaos Engineering
- Software or Tool: Chaos Testing Framework
- Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann
What you’ll build: A simulation environment that runs multiple Consul agents, injects failures (network partitions, node crashes, slow disks), and verifies correctness. Like Jepsen for your Consul clone.
Why it teaches mastery: Building systems is one thing; proving they work under failure is another. This project will stress-test your implementation and teach you what “production-ready” really means.
Real world outcome:
$ ./chaos-consul simulate --nodes=5 --duration=10m
[00:00] Starting 5 nodes (3 servers, 2 clients)
[00:05] All nodes healthy, leader elected: node-1
[00:30] INJECT: Network partition between node-1 and {node-3, node-4, node-5}
[00:32] Leader election started (node-1 isolated)
[00:33] New leader: node-3
[00:35] VERIFY: All writes succeed to new leader ✓
[00:35] VERIFY: Old leader rejects writes ✓
[01:00] HEAL: Network partition resolved
[01:02] node-1 becomes follower, syncs log ✓
[02:00] INJECT: node-2 crash
[02:01] Gossip detects failure within 2s ✓
[02:03] Services on node-2 marked unhealthy ✓
[05:00] INJECT: Slow disk on leader (100ms latency)
[05:10] Write latency increased but system operational ✓
[10:00] Simulation complete
Results:
- Writes: 10,234 success, 0 lost
- Reads: 50,123 success, 3 stale (during partition, expected)
- Leader elections: 3 (all completed < 5s)
- Consistency violations: 0
PASS: Your Consul clone is production-ready!
Summary
| # | Project | Main Language |
|---|---|---|
| 1 | Simple Key-Value Store with Replication | Go |
| 2 | Implement the Raft Consensus Algorithm | Go |
| 3 | Implement the SWIM Gossip Protocol | Go |
| 4 | Build a Service Registry | Go |
| 5 | DNS-Based Service Discovery | Go |
| 6 | Distributed Key-Value Store with Watches | Go |
| 7 | Session and Lock Manager | Go |
| 8 | Service Mesh Sidecar Proxy | Go |
| 9 | Certificate Authority (CA) for Service Mesh | Go |
| 10 | Multi-Datacenter Federation | Go |
| 11 | ACL System with Tokens and Policies | Go |
| 12 | Consul Agent (Complete Implementation) | Go |
| Capstone | Production Consul Cluster Simulator | Go |
Key Resources
Official Documentation
- Consul Architecture - Start here
- Consul Consensus (Raft) - How Raft is used
- Consul Gossip (Serf) - How gossip is used
- Consul Connect - Service mesh internals
Academic Papers
- Raft Paper - Essential reading
- SWIM Paper - Gossip fundamentals
- Lifeguard Extensions - HashiCorp’s improvements
Books
- “Designing Data-Intensive Applications” by Martin Kleppmann - The distributed systems bible
- “Building Microservices, 2nd Edition” by Sam Newman - Service discovery context
- “Zero Trust Networks, 2nd Edition” by Gilman & Barth - Service mesh security
HashiCorp Talks
- Everybody Talks: Gossip, Serf, memberlist, Raft, and SWIM - Excellent overview
- Consul Connect Deep Dive - Service mesh internals
Source Code
- hashicorp/raft - Production Raft library
- hashicorp/memberlist - SWIM implementation
- hashicorp/serf - Gossip layer
- hashicorp/consul - The real thing
“The best way to understand a distributed system is to build one. The second best way is to break one. By the end of this journey, you’ll have done both.”