← Back to all projects

LEARN CONSUL DEEP DIVE

Learn HashiCorp Consul: From Zero to Distributed Systems Master

Goal: Deeply understand Consul—not just how to use it, but how it works behind the scenes. Master the Raft consensus algorithm, SWIM gossip protocol, service mesh internals, and build your own implementations to truly understand distributed systems.


Why Consul Matters

Consul is more than a tool—it’s a masterclass in distributed systems engineering. It elegantly combines:

  • Raft consensus for strong consistency
  • SWIM gossip for scalable membership
  • Service mesh for zero-trust networking
  • DNS interface for universal compatibility

Understanding Consul’s internals means understanding:

  • How distributed databases achieve consensus
  • How large clusters detect failures without overwhelming the network
  • How service mesh actually provides mTLS
  • How DNS can be used for service discovery

After completing these projects, you will:

  • Implement your own Raft consensus library
  • Build a SWIM gossip protocol from scratch
  • Create a working service discovery system
  • Understand every layer from network to application
  • Know exactly what happens when you run consul agent

Core Architecture Overview

┌─────────────────────────────────────────────────────────────────┐
│                         CONSUL CLUSTER                          │
├─────────────────────────────────────────────────────────────────┤
│  ┌─────────────────────────────────────────────────────────┐   │
│  │                    SERVER AGENTS                         │   │
│  │  ┌─────────┐   ┌─────────┐   ┌─────────┐                │   │
│  │  │ Leader  │◄─►│Follower │◄─►│Follower │  RAFT QUORUM   │   │
│  │  │ Server  │   │ Server  │   │ Server  │  (Consensus)   │   │
│  │  └────┬────┘   └────┬────┘   └────┬────┘                │   │
│  │       │             │             │                      │   │
│  │       └─────────────┴─────────────┘                      │   │
│  │                     │                                    │   │
│  │            ┌────────▼────────┐                          │   │
│  │            │   State Store   │ (MemDB + BoltDB)         │   │
│  │            │  - Services     │                          │   │
│  │            │  - Nodes        │                          │   │
│  │            │  - KV Store     │                          │   │
│  │            │  - Intentions   │                          │   │
│  │            └─────────────────┘                          │   │
│  └─────────────────────────────────────────────────────────┘   │
│                          ▲                                      │
│                          │ LAN Gossip (SWIM/Serf)              │
│                          ▼                                      │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │                    CLIENT AGENTS                         │   │
│  │  ┌────────┐  ┌────────┐  ┌────────┐  ┌────────┐        │   │
│  │  │ Client │  │ Client │  │ Client │  │ Client │        │   │
│  │  │ + App  │  │ + App  │  │ + App  │  │ + App  │        │   │
│  │  └────────┘  └────────┘  └────────┘  └────────┘        │   │
│  └─────────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────┘

WAN Gossip connects multiple datacenters (server-to-server only)

Fundamental Concepts

  1. Agents
    • Server Agent: Participates in Raft consensus, stores state, serves queries
    • Client Agent: Forwards requests to servers, runs local health checks, participates in gossip
  2. Raft Consensus (for Strong Consistency)
    • Leader election with randomized timeouts
    • Log replication to followers
    • Quorum-based commits ((N/2)+1)
    • Applied to finite state machine (MemDB)
  3. Gossip Protocol (for Membership & Failure Detection)
    • Based on SWIM (Scalable Weakly-consistent Infection-style process group Membership)
    • LAN pool: All agents in a datacenter
    • WAN pool: Servers across datacenters
    • Failure detection without central coordinator
  4. Service Mesh (Connect)
    • Sidecar proxies (Envoy by default)
    • mTLS for all service-to-service communication
    • Intentions for authorization (allow/deny)
    • Built-in or external Certificate Authority
  5. Service Discovery
    • Service registration with health checks
    • DNS interface (*.consul domain)
    • HTTP API for programmatic access
    • Prepared queries for advanced routing
  6. Key/Value Store
    • Hierarchical key-value storage
    • Watch for changes
    • Distributed configuration management
    • Locks and sessions for coordination

Project List

Projects are ordered from foundational concepts to advanced implementations.


Project 1: Simple Key-Value Store with Replication

  • File: LEARN_CONSUL_DEEP_DIVE.md
  • Main Programming Language: Go
  • Alternative Programming Languages: Rust, Python, Java
  • Coolness Level: Level 3: Genuinely Clever
  • Business Potential: 1. The “Resume Gold”
  • Difficulty: Level 2: Intermediate
  • Knowledge Area: Distributed Systems / Data Storage
  • Software or Tool: Key-Value Store (like Consul’s KV)
  • Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann

What you’ll build: A simple in-memory key-value store with leader-based replication. One node is the leader, accepts writes, and replicates to followers. This is the simplest form of replication—before adding consensus.

Why it teaches Consul: Consul’s KV store is built on replicated state. Before understanding Raft, you need to understand why replication is hard. This project exposes the problems: What happens when the leader dies? How do you know writes succeeded?

Core challenges you’ll face:

  • Leader forwarding → maps to how Consul clients forward to servers
  • Replication lag → maps to eventual vs strong consistency
  • Split brain → maps to why we need consensus
  • Network partitions → maps to CAP theorem tradeoffs

Key Concepts:

Difficulty: Intermediate Time estimate: 1 week Prerequisites: Basic networking (TCP), Go fundamentals, understanding of client-server architecture

Real world outcome:

# Start leader
$ ./kvstore --role=leader --port=8000

# Start followers
$ ./kvstore --role=follower --leader=localhost:8000 --port=8001
$ ./kvstore --role=follower --leader=localhost:8000 --port=8002

# Write to leader
$ curl -X PUT localhost:8000/kv/name -d 'Douglas'
{"success": true, "replicated_to": 2}

# Read from any node (including followers)
$ curl localhost:8001/kv/name
{"key": "name", "value": "Douglas"}

# Kill leader, observe what breaks
$ kill %1
$ curl localhost:8001/kv/name  # Works (read)
$ curl -X PUT localhost:8001/kv/age -d '30'  # Fails! No leader!

Implementation Hints:

Structure:

type KVStore struct {
    data       map[string]string
    role       string  // "leader" or "follower"
    leader     string  // leader address
    followers  []string
    mu         sync.RWMutex
}

Leader write flow:

  1. Accept write request
  2. Apply locally
  3. Send to all followers (sync or async)
  4. Return success (how many replicas confirm?)

Questions to consider:

  • Do you wait for all followers to confirm? (Strong consistency, but slow)
  • Do you return immediately after local write? (Fast, but may lose data)
  • What if a follower is down during replication?
  • How does a follower become the new leader?

The goal is to feel the pain of distributed systems before learning the solutions.

Learning milestones:

  1. Basic replication works → You understand leader-follower architecture
  2. You lose data when leader dies → You understand why consensus matters
  3. Followers have stale data → You understand replication lag
  4. Split brain causes conflicts → You understand the consensus problem

Project 2: Implement the Raft Consensus Algorithm

  • File: LEARN_CONSUL_DEEP_DIVE.md
  • Main Programming Language: Go
  • Alternative Programming Languages: Rust, C++, Java
  • Coolness Level: Level 5: Pure Magic
  • Business Potential: 1. The “Resume Gold”
  • Difficulty: Level 4: Expert
  • Knowledge Area: Distributed Systems / Consensus
  • Software or Tool: Raft implementation (like hashicorp/raft)
  • Main Book: “In Search of an Understandable Consensus Algorithm” (Raft paper) by Ongaro & Ousterhout

What you’ll build: A complete Raft implementation with leader election, log replication, and persistence. This is the same algorithm Consul uses for its server consensus.

Why it teaches Consul: Raft is the foundation of Consul’s consistency. Every write to Consul (service registration, KV updates) goes through Raft. Understanding Raft means understanding Consul’s guarantees.

Core challenges you’ll face:

  • Leader election with randomized timeouts → maps to election timer implementation
  • Log replication and commitment → maps to achieving consensus
  • Persistence and recovery → maps to durable state
  • Membership changes → maps to adding/removing servers

Resources for key challenges:

Key Concepts:

  • Leader Election: Raft Paper Section 5.2
  • Log Replication: Raft Paper Section 5.3
  • Safety: Raft Paper Section 5.4
  • Persistence: Raft Paper Section 5.6
  • Consul’s Use of Raft: HashiCorp Consul Consensus Protocol

Difficulty: Expert Time estimate: 3-4 weeks Prerequisites: Project 1 completed, strong understanding of concurrency, Go experience

Real world outcome:

# Start a 3-node Raft cluster
$ ./raftnode --id=1 --peers=localhost:8001,localhost:8002,localhost:8003 &
$ ./raftnode --id=2 --peers=localhost:8001,localhost:8002,localhost:8003 &
$ ./raftnode --id=3 --peers=localhost:8001,localhost:8002,localhost:8003 &

# Check cluster state
$ ./raftctl status
Node 1: LEADER  (term 5, commit index 42)
Node 2: FOLLOWER (term 5, commit index 42)
Node 3: FOLLOWER (term 5, commit index 42)

# Submit a command (goes through leader)
$ ./raftctl apply "SET x = 10"
Applied at index 43

# Kill the leader, watch election happen
$ kill %1
# Wait 150-300ms (election timeout)
$ ./raftctl status
Node 2: LEADER  (term 6, commit index 43)
Node 3: FOLLOWER (term 6, commit index 43)

# Bring node 1 back
$ ./raftnode --id=1 --peers=... &
$ ./raftctl status
Node 1: FOLLOWER (term 6, commit index 43)  # Automatically catches up!
Node 2: LEADER  (term 6, commit index 43)
Node 3: FOLLOWER (term 6, commit index 43)

Implementation Hints:

Core state (per Raft paper Figure 2):

type RaftNode struct {
    // Persistent state (on all servers)
    currentTerm int
    votedFor    *int
    log         []LogEntry

    // Volatile state (on all servers)
    commitIndex int
    lastApplied int

    // Volatile state (on leader)
    nextIndex   map[int]int  // for each follower
    matchIndex  map[int]int  // for each follower

    // Additional
    state       State  // FOLLOWER, CANDIDATE, LEADER
    electionTimer *time.Timer
}

Leader election flow:

  1. Start as FOLLOWER with random election timeout (150-300ms)
  2. If timeout expires without heartbeat → become CANDIDATE
  3. Increment term, vote for self, request votes from peers
  4. If majority votes → become LEADER
  5. If receive AppendEntries from valid leader → return to FOLLOWER

Log replication flow:

  1. Client sends command to leader
  2. Leader appends to local log
  3. Leader sends AppendEntries RPC to all followers
  4. When majority acknowledge → mark as committed
  5. Apply committed entries to state machine
  6. Respond to client

Key implementation details:

  • Election timeout randomization prevents split votes
  • Term numbers detect stale leaders
  • Log matching property ensures consistency
  • Commit index only advances when leader’s term entries are replicated

Use the Raft TLA+ spec or the paper’s pseudocode as your guide. Test with Jepsen if you want to be thorough!

Learning milestones:

  1. Leader election works → You understand randomized timeouts and voting
  2. Log replication works → You understand the two-phase commit pattern
  3. Survives leader failure → You understand the safety properties
  4. Passes Raft paper’s Figure 8 scenario → You’ve implemented it correctly

Project 3: Implement the SWIM Gossip Protocol

  • File: LEARN_CONSUL_DEEP_DIVE.md
  • Main Programming Language: Go
  • Alternative Programming Languages: Rust, C++, Python
  • Coolness Level: Level 4: Hardcore Tech Flex
  • Business Potential: 1. The “Resume Gold”
  • Difficulty: Level 3: Advanced
  • Knowledge Area: Distributed Systems / Membership
  • Software or Tool: Gossip protocol (like Serf/memberlist)
  • Main Book: “SWIM: Scalable Weakly-consistent Infection-style Process Group Membership Protocol”

What you’ll build: A SWIM-based gossip protocol that provides membership management and failure detection for a cluster. This is the protocol Consul uses (via Serf/memberlist) for its LAN and WAN gossip pools.

Why it teaches Consul: Consul uses gossip for everything that doesn’t need strong consistency: knowing which nodes are alive, detecting failures, broadcasting events. Raft is expensive; gossip is cheap and scalable.

Core challenges you’ll face:

  • Failure detection with ping/ping-req → maps to indirect probing
  • Infection-style dissemination → maps to piggybacking updates
  • Suspicion mechanism → maps to reducing false positives
  • Scalability (O(n) messages vs O(n²)) → maps to why SWIM beats heartbeats

Resources for key challenges:

Key Concepts:

Difficulty: Advanced Time estimate: 2-3 weeks Prerequisites: Project 2 helpful but not required, understanding of UDP networking

Real world outcome:

# Start a gossip cluster
$ ./swimnode --name=node1 --port=7001 &
$ ./swimnode --name=node2 --port=7002 --join=localhost:7001 &
$ ./swimnode --name=node3 --port=7003 --join=localhost:7001 &

# List members (from any node)
$ ./swimctl members --node=localhost:7001
node1  alive  192.168.1.10:7001
node2  alive  192.168.1.10:7002
node3  alive  192.168.1.10:7003

# Kill a node, watch failure detection
$ kill %2  # Kill node2
# Wait for failure detection (~1-2 seconds)
$ ./swimctl members --node=localhost:7001
node1  alive  192.168.1.10:7001
node2  dead   192.168.1.10:7002  (detected: 1.2s ago)
node3  alive  192.168.1.10:7003

# Broadcast an event (propagates via gossip)
$ ./swimctl event "deploy:v1.2.3" --node=localhost:7001
Event broadcast to 3 nodes

# All nodes receive the event within milliseconds
$ ./swimctl events --node=localhost:7003
[2024-01-15 10:23:45] deploy:v1.2.3

Implementation Hints:

SWIM failure detection (per protocol round):

func (n *Node) protocolRound() {
    // 1. Pick random node to probe
    target := n.randomMember()

    // 2. Send direct ping, wait for ack
    if n.ping(target, timeout) {
        return  // Node is alive
    }

    // 3. Ping failed - use indirect probing
    // Pick k random nodes to help probe
    probers := n.randomMembers(k)
    for _, prober := range probers {
        n.pingReq(prober, target)  // "Please ping target for me"
    }

    // 4. Wait for any indirect ack
    if n.waitForIndirectAck(target, timeout) {
        return  // Node is alive (via indirect probe)
    }

    // 5. Mark as suspect (not immediately dead!)
    n.suspect(target)
}

Infection-style dissemination:

  • Piggyback membership updates on ping/ack messages
  • Each update has a “lamport timestamp” or incarnation number
  • Updates spread exponentially: each node tells a few others
  • Eventually consistent: all nodes learn about join/leave/fail

Suspicion mechanism:

  • Don’t immediately mark failed nodes as dead
  • Broadcast “suspect” message, give node time to refute
  • If node is actually alive, it sends “alive” with higher incarnation
  • Only mark dead after suspicion timeout

Key insight: SWIM achieves O(n) message complexity vs O(n²) for heartbeats!

Learning milestones:

  1. Nodes can join and discover each other → You understand gossip join
  2. Failure detection works → You understand ping/ping-req/suspect
  3. Events propagate to all nodes → You understand dissemination
  4. False positives are minimal → You understand suspicion and incarnation

Project 4: Build a Service Registry

  • File: LEARN_CONSUL_DEEP_DIVE.md
  • Main Programming Language: Go
  • Alternative Programming Languages: Rust, Java, Python
  • Coolness Level: Level 3: Genuinely Clever
  • Business Potential: 3. The “Service & Support” Model
  • Difficulty: Level 2: Intermediate
  • Knowledge Area: Microservices / Service Discovery
  • Software or Tool: Service Registry (like Consul catalog)
  • Main Book: “Building Microservices, 2nd Edition” by Sam Newman

What you’ll build: A service registry where services register themselves, health checks run, and clients can discover healthy service instances. Combine your Raft (Project 2) and Gossip (Project 3) work, or use simpler in-memory storage.

Why it teaches Consul: This is Consul’s core use case. Service discovery seems simple until you handle: health checks, multiple instances, datacenter awareness, and the difference between “registered” and “healthy.”

Core challenges you’ll face:

  • Service registration API → maps to Consul’s agent/service/register
  • Health check execution → maps to TCP/HTTP/script checks
  • Critical vs warning states → maps to Consul’s check states
  • Service queries with filtering → maps to Consul’s catalog/service endpoint

Key Concepts:

Difficulty: Intermediate Time estimate: 1-2 weeks Prerequisites: Basic HTTP API development, understanding of health checks

Real world outcome:

# Start registry server
$ ./registry --port=8500

# Register a service (via API)
$ curl -X PUT localhost:8500/v1/agent/service/register -d '{
  "name": "web",
  "id": "web-1",
  "port": 8080,
  "check": {
    "http": "http://localhost:8080/health",
    "interval": "10s"
  }
}'
{"success": true}

# Register another instance
$ curl -X PUT localhost:8500/v1/agent/service/register -d '{
  "name": "web",
  "id": "web-2",
  "port": 8081,
  "check": {
    "http": "http://localhost:8081/health",
    "interval": "10s"
  }
}'

# Query healthy instances
$ curl localhost:8500/v1/catalog/service/web?passing=true
[
  {"id": "web-1", "address": "127.0.0.1", "port": 8080, "status": "passing"},
  {"id": "web-2", "address": "127.0.0.1", "port": 8081, "status": "passing"}
]

# Stop one instance, health check fails
$ # (after 10s + grace period)
$ curl localhost:8500/v1/catalog/service/web?passing=true
[
  {"id": "web-1", "address": "127.0.0.1", "port": 8080, "status": "passing"}
]
# web-2 is still registered, but filtered out due to failing health check

Implementation Hints:

Service and check structures:

type Service struct {
    ID      string
    Name    string
    Address string
    Port    int
    Tags    []string
    Meta    map[string]string
}

type Check struct {
    ID          string
    ServiceID   string
    Type        string  // "http", "tcp", "script"
    Target      string  // URL, address, or command
    Interval    time.Duration
    Timeout     time.Duration
    Status      string  // "passing", "warning", "critical"
    LastCheck   time.Time
    Output      string
}

Health check executor:

func (r *Registry) runHealthChecks() {
    for _, check := range r.checks {
        go func(c *Check) {
            for {
                result := r.executeCheck(c)
                r.updateCheckStatus(c.ID, result)
                time.Sleep(c.Interval)
            }
        }(check)
    }
}

func (r *Registry) executeCheck(c *Check) CheckResult {
    switch c.Type {
    case "http":
        resp, err := http.Get(c.Target)
        if err != nil {
            return CheckResult{Status: "critical", Output: err.Error()}
        }
        if resp.StatusCode >= 200 && resp.StatusCode < 300 {
            return CheckResult{Status: "passing", Output: "HTTP 200 OK"}
        }
        return CheckResult{Status: "critical", Output: resp.Status}
    case "tcp":
        // Try to dial, success = passing
    case "script":
        // Execute command, check exit code
    }
}

Key considerations:

  • Separate “registered” from “healthy” in queries
  • Support tags for filtering (web vs web-production)
  • Handle deregistration (explicit and due to agent leaving)
  • Think about anti-entropy (what if check state is lost?)

Learning milestones:

  1. Services register and list → You understand the catalog concept
  2. Health checks run and update status → You understand active health checking
  3. Queries filter by health → You understand the passing/warning/critical model
  4. Multiple instances of same service work → You understand service vs instance

Project 5: DNS-Based Service Discovery

  • File: LEARN_CONSUL_DEEP_DIVE.md
  • Main Programming Language: Go
  • Alternative Programming Languages: Rust, C, Python
  • Coolness Level: Level 4: Hardcore Tech Flex
  • Business Potential: 2. The “Micro-SaaS / Pro Tool”
  • Difficulty: Level 3: Advanced
  • Knowledge Area: Networking / DNS
  • Software or Tool: DNS Server (like Consul’s DNS interface)
  • Main Book: “DNS and BIND, 5th Edition” by Cricket Liu & Paul Albitz

What you’ll build: A DNS server that integrates with your service registry (Project 4). Queries like web.service.consul return A records for healthy instances. This is how Consul provides universal service discovery without client libraries.

Why it teaches Consul: Consul’s DNS interface is brilliant—every language/tool knows how to do DNS lookups. No SDK required. Understanding this means understanding why DNS is perfect for service discovery (and its limitations).

Core challenges you’ll face:

  • DNS protocol parsing → maps to UDP/TCP message format
  • A, AAAA, SRV record generation → maps to different record types
  • TTL and caching → maps to trade-off between freshness and load
  • Recursive vs authoritative → maps to forwarding non-.consul queries

Key Concepts:

  • DNS Protocol: “DNS and BIND” Chapter 2 - Cricket Liu
  • Consul DNS Interface: Consul DNS
  • SRV Records: RFC 2782 - for port discovery
  • DNS Message Format: RFC 1035 Sections 4-5

Difficulty: Advanced Time estimate: 2 weeks Prerequisites: Project 4 completed, understanding of DNS basics

Real world outcome:

# Start DNS server (integrated with registry)
$ ./registry-dns --port=8600

# Query service via DNS
$ dig @localhost -p 8600 web.service.consul

;; ANSWER SECTION:
web.service.consul.   30   IN   A   192.168.1.10
web.service.consul.   30   IN   A   192.168.1.11

# SRV records include port information
$ dig @localhost -p 8600 web.service.consul SRV

;; ANSWER SECTION:
web.service.consul.   30   IN   SRV   1 1 8080 web-1.node.consul.
web.service.consul.   30   IN   SRV   1 1 8081 web-2.node.consul.

;; ADDITIONAL SECTION:
web-1.node.consul.    30   IN   A   192.168.1.10
web-2.node.consul.    30   IN   A   192.168.1.11

# Query with tag filter
$ dig @localhost -p 8600 production.web.service.consul

# Only returns instances tagged "production"

# Use it from any application!
$ curl http://web.service.consul:8080/api
# (assuming DNS is configured to use our server)

Implementation Hints:

DNS message structure:

type DNSMessage struct {
    Header     DNSHeader
    Questions  []DNSQuestion
    Answers    []DNSRecord
    Authority  []DNSRecord
    Additional []DNSRecord
}

type DNSHeader struct {
    ID      uint16
    Flags   uint16  // QR, Opcode, AA, TC, RD, RA, RCODE
    QDCount uint16  // Questions
    ANCount uint16  // Answers
    NSCount uint16  // Authority
    ARCount uint16  // Additional
}

Parsing .consul queries:

func (s *DNSServer) parseConsulQuery(name string) (queryType, service, tag, datacenter string) {
    // web.service.consul           -> service=web
    // production.web.service.consul -> tag=production, service=web
    // web.service.dc1.consul       -> service=web, dc=dc1

    parts := strings.Split(strings.TrimSuffix(name, ".consul."), ".")
    // Parse based on structure...
}

Generating A records:

func (s *DNSServer) handleServiceQuery(service, tag string) []DNSRecord {
    instances := s.registry.GetHealthyInstances(service, tag)

    var records []DNSRecord
    for _, inst := range instances {
        records = append(records, DNSRecord{
            Name:  service + ".service.consul.",
            Type:  1,  // A record
            Class: 1,  // IN
            TTL:   30, // Short TTL for dynamic services
            Data:  inst.Address,
        })
    }
    return records
}

Key considerations:

  • Use UDP for queries under 512 bytes, TCP for larger
  • Set short TTLs (30s) for service records—they change frequently
  • Support SRV records for port information
  • Implement forwarding for non-.consul queries
  • Handle negative caching (NXDOMAIN)

Learning milestones:

  1. Basic A record queries work → You understand DNS protocol
  2. SRV records include ports → You understand service discovery via DNS
  3. Tag filtering works → You understand Consul’s naming scheme
  4. Standard tools (dig, nslookup) work → Your implementation is correct

Project 6: Distributed Key-Value Store with Watches

  • File: LEARN_CONSUL_DEEP_DIVE.md
  • Main Programming Language: Go
  • Alternative Programming Languages: Rust, Java, Python
  • Coolness Level: Level 3: Genuinely Clever
  • Business Potential: 3. The “Service & Support” Model
  • Difficulty: Level 3: Advanced
  • Knowledge Area: Distributed Systems / Configuration
  • Software or Tool: Key-Value Store (like Consul KV)
  • Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann

What you’ll build: A distributed key-value store backed by your Raft implementation (Project 2), with the ability to watch for changes. Clients can block on a key and get notified when it changes—like Consul’s blocking queries.

Why it teaches Consul: Consul’s KV store with watches enables distributed configuration. Services watch their config keys and reconfigure automatically. The blocking query pattern is essential to understand.

Core challenges you’ll face:

  • Integrating KV operations with Raft → maps to state machine application
  • Blocking queries (long-polling) → maps to Consul’s ?index= parameter
  • CAS (Check-And-Set) operations → maps to optimistic concurrency
  • Hierarchical keys and prefix queries → maps to recurse parameter

Key Concepts:

Difficulty: Advanced Time estimate: 2 weeks Prerequisites: Project 2 (Raft) completed

Real world outcome:

# Write a key
$ curl -X PUT localhost:8500/v1/kv/config/database/host -d 'db.example.com'
true

# Read a key
$ curl localhost:8500/v1/kv/config/database/host
[{"Key": "config/database/host", "Value": "ZGIuZXhhbXBsZS5jb20=", "ModifyIndex": 42}]
# (Value is base64 encoded)

# Watch for changes (blocking query)
$ curl "localhost:8500/v1/kv/config/database/host?index=42&wait=5m"
# This blocks until the key changes or 5 minutes pass...

# In another terminal, update the key
$ curl -X PUT localhost:8500/v1/kv/config/database/host -d 'newdb.example.com'
true

# The watch returns immediately with new value!
[{"Key": "config/database/host", "Value": "bmV3ZGIuZXhhbXBsZS5jb20=", "ModifyIndex": 43}]

# CAS operation (only update if at expected index)
$ curl -X PUT "localhost:8500/v1/kv/config/database/host?cas=43" -d 'finaldb.example.com'
true

$ curl -X PUT "localhost:8500/v1/kv/config/database/host?cas=43" -d 'wontwork.example.com'
false  # Failed because ModifyIndex is now 44

# Recursive list
$ curl localhost:8500/v1/kv/config?recurse
[
  {"Key": "config/database/host", ...},
  {"Key": "config/database/port", ...},
  {"Key": "config/cache/host", ...}
]

Implementation Hints:

KV entry with metadata:

type KVEntry struct {
    Key         string
    Value       []byte
    Flags       uint64
    CreateIndex uint64  // Raft index when created
    ModifyIndex uint64  // Raft index when last modified
    LockIndex   uint64  // For distributed locks
    Session     string  // Associated session
}

Applying KV operations to Raft:

type KVCommand struct {
    Op    string  // "set", "delete", "cas"
    Key   string
    Value []byte
    CAS   uint64  // For check-and-set
}

func (kv *KVStore) Apply(log *raft.Log) interface{} {
    var cmd KVCommand
    json.Unmarshal(log.Data, &cmd)

    switch cmd.Op {
    case "set":
        entry := &KVEntry{
            Key:         cmd.Key,
            Value:       cmd.Value,
            ModifyIndex: log.Index,
        }
        if existing := kv.store[cmd.Key]; existing == nil {
            entry.CreateIndex = log.Index
        } else {
            entry.CreateIndex = existing.CreateIndex
        }
        kv.store[cmd.Key] = entry
        kv.notifyWatchers(cmd.Key)  // Wake up blocking queries
        return true

    case "cas":
        existing := kv.store[cmd.Key]
        if existing == nil || existing.ModifyIndex != cmd.CAS {
            return false  // CAS failed
        }
        // Apply the update...
    }
}

Blocking query implementation:

func (kv *KVStore) WatchKey(key string, index uint64, timeout time.Duration) *KVEntry {
    // If current ModifyIndex > requested index, return immediately
    if entry := kv.store[key]; entry != nil && entry.ModifyIndex > index {
        return entry
    }

    // Otherwise, wait for notification or timeout
    ch := kv.addWatcher(key)
    defer kv.removeWatcher(key, ch)

    select {
    case <-ch:
        return kv.store[key]
    case <-time.After(timeout):
        return kv.store[key]  // Return current value on timeout
    }
}

Learning milestones:

  1. Basic KV operations replicated via Raft → You understand state machine replication
  2. Blocking queries work → You understand the long-polling pattern
  3. CAS prevents race conditions → You understand optimistic concurrency
  4. Prefix queries work → You understand hierarchical key design

Project 7: Session and Lock Manager

  • File: LEARN_CONSUL_DEEP_DIVE.md
  • Main Programming Language: Go
  • Alternative Programming Languages: Rust, Java, Python
  • Coolness Level: Level 3: Genuinely Clever
  • Business Potential: 3. The “Service & Support” Model
  • Difficulty: Level 3: Advanced
  • Knowledge Area: Distributed Systems / Coordination
  • Software or Tool: Distributed Lock Manager (like Consul sessions)
  • Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann

What you’ll build: A distributed lock/session manager. Sessions are tied to node health, and locks are released when sessions expire. This enables leader election and distributed coordination patterns.

Why it teaches Consul: Consul’s sessions are the foundation for distributed locking. Understanding how sessions tie to health checks, and how lock release works, is crucial for building reliable distributed systems.

Core challenges you’ll face:

  • Session lifecycle (create, renew, destroy) → maps to ephemeral state
  • Lock acquisition and release → maps to distributed mutual exclusion
  • Session invalidation on node failure → maps to gossip integration
  • Leader election pattern → maps to contended lock acquisition

Key Concepts:

  • Consul Sessions: Consul Sessions
  • Distributed Locks: “Designing Data-Intensive Applications” Chapter 8 - Martin Kleppmann
  • Leader Election: Consul Leader Election
  • Session Behavior: “delete” vs “release” on invalidation

Difficulty: Advanced Time estimate: 2 weeks Prerequisites: Projects 3, 4, 6 completed

Real world outcome:

# Create a session (tied to node health)
$ curl -X PUT localhost:8500/v1/session/create -d '{
  "Name": "my-leader-election",
  "TTL": "15s",
  "LockDelay": "10s",
  "Behavior": "delete"
}'
{"ID": "session-abc-123"}

# Acquire a lock
$ curl -X PUT "localhost:8500/v1/kv/service/leader?acquire=session-abc-123" -d 'node-1'
true  # Lock acquired!

# Another node tries to acquire the same lock
$ curl -X PUT "localhost:8500/v1/kv/service/leader?acquire=session-def-456" -d 'node-2'
false  # Lock already held

# Check who holds the lock
$ curl localhost:8500/v1/kv/service/leader
[{"Key": "service/leader", "Value": "bm9kZS0x", "Session": "session-abc-123", ...}]

# If node-1's session expires (TTL) or node fails (gossip), lock is released
# After lock-delay, node-2 can acquire it

# Explicit release
$ curl -X PUT "localhost:8500/v1/kv/service/leader?release=session-abc-123"
true

# Renew session before TTL expires
$ curl -X PUT localhost:8500/v1/session/renew/session-abc-123
[{"ID": "session-abc-123", "TTL": "15s", ...}]

Implementation Hints:

Session structure:

type Session struct {
    ID          string
    Name        string
    Node        string        // Owning node
    TTL         time.Duration
    LockDelay   time.Duration // Wait before reassigning locks
    Behavior    string        // "release" or "delete"
    Checks      []string      // Health checks that must pass
    CreateIndex uint64
    LastRenew   time.Time
}

Lock acquisition flow:

func (s *SessionManager) AcquireLock(key, sessionID, value string) bool {
    session := s.sessions[sessionID]
    if session == nil || !s.isSessionValid(session) {
        return false
    }

    entry := s.kv.Get(key)

    // Already locked by another session?
    if entry != nil && entry.Session != "" && entry.Session != sessionID {
        return false
    }

    // Acquire the lock
    s.kv.SetWithSession(key, value, sessionID)
    return true
}

Session invalidation (on node failure or TTL):

func (s *SessionManager) InvalidateSession(sessionID string) {
    session := s.sessions[sessionID]
    if session == nil {
        return
    }

    // Find all keys locked by this session
    lockedKeys := s.kv.GetKeysBySession(sessionID)

    for _, key := range lockedKeys {
        switch session.Behavior {
        case "release":
            // Remove session association, keep value
            s.kv.ClearSession(key)
        case "delete":
            // Delete the key entirely
            s.kv.Delete(key)
        }
    }

    // After LockDelay, these keys can be acquired by other sessions
    delete(s.sessions, sessionID)
}

Integration with gossip:

  • When a node is marked as failed by SWIM, invalidate its sessions
  • Sessions can also be tied to specific health checks

Learning milestones:

  1. Session create/renew/destroy works → You understand session lifecycle
  2. Locks are exclusive → You understand distributed locking
  3. Session expiry releases locks → You understand TTL mechanism
  4. Node failure releases locks → You understand gossip integration

Project 8: Service Mesh Sidecar Proxy

  • File: LEARN_CONSUL_DEEP_DIVE.md
  • Main Programming Language: Go
  • Alternative Programming Languages: Rust, C++
  • Coolness Level: Level 4: Hardcore Tech Flex
  • Business Potential: 4. The “Open Core” Infrastructure
  • Difficulty: Level 4: Expert
  • Knowledge Area: Networking / Security
  • Software or Tool: Sidecar Proxy (like Envoy/Consul Connect proxy)
  • Main Book: “Zero Trust Networks, 2nd Edition” by Evan Gilman & Doug Barth

What you’ll build: A simple sidecar proxy that intercepts service traffic, establishes mTLS connections with other proxies, and enforces service-to-service authorization (intentions).

Why it teaches Consul: Consul Connect is the service mesh. Understanding how the sidecar intercepts traffic, validates certificates, and checks intentions is understanding modern zero-trust networking.

Core challenges you’ll face:

  • TLS termination and origination → maps to mTLS implementation
  • Certificate management → maps to CA integration
  • Intention checking → maps to authorization decisions
  • Transparent proxying → maps to iptables/localhost binding

Key Concepts:

Difficulty: Expert Time estimate: 3-4 weeks Prerequisites: Strong understanding of TLS, Projects 4-6 completed

Real world outcome:

# Start a service (not aware of mesh)
$ python3 -m http.server 8080 &

# Start sidecar for the service
$ ./sidecar --service=web --port=8080 --listen=21000

# The sidecar:
# 1. Gets certificates from CA (or Consul)
# 2. Listens on 21000 for inbound mTLS connections
# 3. Forwards decrypted traffic to localhost:8080

# Start client-side sidecar (for making outbound connections)
$ ./sidecar --service=client --upstream=web:9090

# The client app connects to localhost:9090 (plaintext)
# Sidecar establishes mTLS to web's sidecar
# Traffic flows encrypted between sidecars

# Test the connection
$ curl localhost:9090
# Works! Traffic went: curl -> client-sidecar -> mTLS -> web-sidecar -> web-service

# Add an intention (deny client -> web)
$ ./intentctl deny client web

# Now the request fails
$ curl localhost:9090
Error: Connection refused by intention

# Allow it
$ ./intentctl allow client web
$ curl localhost:9090
# Works again!

Implementation Hints:

Sidecar architecture:

type Sidecar struct {
    serviceName string
    localPort   int          // Where the actual service listens
    listenPort  int          // Where we accept inbound mTLS
    upstreams   []Upstream   // Outbound services we proxy to
    cert        tls.Certificate
    caCert      *x509.CertPool
}

type Upstream struct {
    ServiceName string
    LocalPort   int    // localhost:LocalPort -> remote service
}

Inbound connection handler:

func (s *Sidecar) handleInbound(conn net.Conn) {
    // 1. TLS handshake (we require client cert)
    tlsConn := tls.Server(conn, &tls.Config{
        Certificates: []tls.Certificate{s.cert},
        ClientAuth:   tls.RequireAndVerifyClientCert,
        ClientCAs:    s.caCert,
    })

    if err := tlsConn.Handshake(); err != nil {
        log.Printf("TLS handshake failed: %v", err)
        return
    }

    // 2. Extract client service identity from cert
    clientService := extractServiceName(tlsConn.ConnectionState().PeerCertificates[0])

    // 3. Check intention
    if !s.checkIntention(clientService, s.serviceName) {
        log.Printf("Intention denied: %s -> %s", clientService, s.serviceName)
        tlsConn.Close()
        return
    }

    // 4. Forward to local service
    localConn, _ := net.Dial("tcp", fmt.Sprintf("localhost:%d", s.localPort))
    go io.Copy(localConn, tlsConn)
    io.Copy(tlsConn, localConn)
}

Outbound proxy (upstream):

func (s *Sidecar) handleOutbound(upstream Upstream, conn net.Conn) {
    // 1. Discover the upstream service (via DNS or registry)
    instances := s.discover(upstream.ServiceName)
    target := instances[0]  // Load balancing here

    // 2. Establish mTLS connection to target's sidecar
    tlsConn := tls.Dial("tcp", target, &tls.Config{
        Certificates: []tls.Certificate{s.cert},
        RootCAs:      s.caCert,
        ServerName:   upstream.ServiceName,  // SNI for routing
    })

    // 3. Proxy traffic
    go io.Copy(tlsConn, conn)
    io.Copy(conn, tlsConn)
}

Certificate identity (SPIFFE-style):

spiffe://trust-domain/ns/default/dc/dc1/svc/web

The service name is embedded in the certificate’s SAN (Subject Alternative Name).

Learning milestones:

  1. mTLS connections work → You understand certificate-based identity
  2. Intentions block/allow correctly → You understand authorization
  3. Transparent to application → You understand the sidecar pattern
  4. Works with your service registry → You’ve integrated the pieces

Project 9: Certificate Authority (CA) for Service Mesh

  • File: LEARN_CONSUL_DEEP_DIVE.md
  • Main Programming Language: Go
  • Alternative Programming Languages: Rust, Python
  • Coolness Level: Level 4: Hardcore Tech Flex
  • Business Potential: 4. The “Open Core” Infrastructure
  • Difficulty: Level 4: Expert
  • Knowledge Area: Security / PKI
  • Software or Tool: Certificate Authority (like Consul’s built-in CA)
  • Main Book: “Serious Cryptography, 2nd Edition” by Jean-Philippe Aumasson

What you’ll build: A Certificate Authority that issues short-lived certificates to services, supporting automatic rotation. This is the trust foundation of your service mesh.

Why it teaches Consul: Consul can act as a CA or integrate with Vault. Understanding how certificates are issued, rotated, and trusted is fundamental to understanding Connect’s security model.

Core challenges you’ll face:

  • Generating CA root and intermediate certs → maps to PKI hierarchy
  • Issuing leaf certificates on demand → maps to CSR processing
  • Short-lived certs and rotation → maps to ephemeral identity
  • Certificate distribution → maps to how sidecars get certs

Key Concepts:

  • PKI Fundamentals: “Serious Cryptography” Chapter 12 - Jean-Philippe Aumasson
  • SPIFFE/SPIRE: SPIFFE Standards
  • Consul CA: Consul Connect CA
  • Certificate Rotation: Short TTLs vs long-lived certs

Difficulty: Expert Time estimate: 2-3 weeks Prerequisites: Understanding of TLS/PKI, Projects 4-8 helpful

Real world outcome:

# Initialize CA (creates root and intermediate)
$ ./meshca init --domain=consul --ttl=8760h
Root CA created: /certs/root.pem
Intermediate CA created: /certs/intermediate.pem

# Start CA server
$ ./meshca serve --port=8200

# Service requests a certificate
$ curl -X POST localhost:8200/v1/connect/ca/leaf/web -d '{
  "csr": "-----BEGIN CERTIFICATE REQUEST-----\n..."
}'
{
  "certificate": "-----BEGIN CERTIFICATE-----\n...",
  "private_key": null,  # We don't see it (service generated CSR)
  "valid_after": "2024-01-15T10:00:00Z",
  "valid_before": "2024-01-15T22:00:00Z",  # 12-hour TTL
  "service": "web",
  "spiffe_id": "spiffe://consul/ns/default/dc/dc1/svc/web"
}

# Get root certificates (for verification)
$ curl localhost:8200/v1/connect/ca/roots
{
  "roots": [{
    "id": "abc123",
    "root_cert": "-----BEGIN CERTIFICATE-----\n...",
    "active": true
  }]
}

# Rotate root CA (zero downtime)
$ ./meshca rotate
New root created, old root still valid during transition...

Implementation Hints:

CA structure:

type MeshCA struct {
    rootKey     *ecdsa.PrivateKey
    rootCert    *x509.Certificate
    interKey    *ecdsa.PrivateKey
    interCert   *x509.Certificate
    trustDomain string
}

Issuing a leaf certificate:

func (ca *MeshCA) IssueCert(serviceName string, csr *x509.CertificateRequest) (*x509.Certificate, error) {
    // Generate SPIFFE ID
    spiffeID := fmt.Sprintf("spiffe://%s/ns/default/dc/dc1/svc/%s",
        ca.trustDomain, serviceName)

    // Create certificate template
    template := &x509.Certificate{
        SerialNumber: generateSerial(),
        Subject:      csr.Subject,
        NotBefore:    time.Now(),
        NotAfter:     time.Now().Add(12 * time.Hour),  // Short-lived!
        KeyUsage:     x509.KeyUsageDigitalSignature,
        ExtKeyUsage:  []x509.ExtKeyUsage{
            x509.ExtKeyUsageClientAuth,
            x509.ExtKeyUsageServerAuth,
        },
        URIs: []*url.URL{{Scheme: "spiffe", Host: ca.trustDomain,
            Path: fmt.Sprintf("/ns/default/dc/dc1/svc/%s", serviceName)}},
    }

    // Sign with intermediate CA
    certDER, _ := x509.CreateCertificate(
        rand.Reader, template, ca.interCert, csr.PublicKey, ca.interKey)

    return x509.ParseCertificate(certDER)
}

Root rotation strategy:

  1. Generate new root and intermediate
  2. Add new root to trust bundle (both old and new trusted)
  3. Start issuing certs signed by new intermediate
  4. After old certs expire (max TTL), remove old root from bundle

Key insight: Short-lived certs (hours, not years) make revocation unnecessary—just don’t renew!

Learning milestones:

  1. CA issues valid certificates → You understand PKI basics
  2. Sidecars can authenticate each other → Your certs work in practice
  3. Rotation doesn’t break connections → You understand trust bundle management
  4. SPIFFE IDs are correct → You understand service identity

Project 10: Multi-Datacenter Federation

  • File: LEARN_CONSUL_DEEP_DIVE.md
  • Main Programming Language: Go
  • Alternative Programming Languages: Rust, Java
  • Coolness Level: Level 4: Hardcore Tech Flex
  • Business Potential: 4. The “Open Core” Infrastructure
  • Difficulty: Level 4: Expert
  • Knowledge Area: Distributed Systems / Networking
  • Software or Tool: Multi-DC Federation (like Consul WAN)
  • Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann

What you’ll build: Extend your Consul clone to support multiple datacenters. Each DC has its own Raft cluster, but they’re connected via WAN gossip for cross-DC service discovery and failover.

Why it teaches Consul: Multi-DC is where distributed systems get really interesting. How do you maintain consistency? How do you route across DCs? Consul’s approach—independent Raft clusters with WAN gossip—is elegant.

Core challenges you’ll face:

  • WAN gossip pool (servers only) → maps to separate gossip with higher latency tolerance
  • Cross-DC queries → maps to query forwarding to remote DC
  • Prepared queries with failover → maps to automatic DC failover
  • Replication vs forwarding → maps to where to evaluate queries

Key Concepts:

Difficulty: Expert Time estimate: 3-4 weeks Prerequisites: Projects 2, 3, 4 completed, understanding of multi-region architecture

Real world outcome:

# DC1 cluster
$ ./consul-clone agent -server -datacenter=dc1 -wan-join=dc2-server:8302

# DC2 cluster
$ ./consul-clone agent -server -datacenter=dc2 -wan-join=dc1-server:8302

# List all DCs (via WAN gossip)
$ curl localhost:8500/v1/catalog/datacenters
["dc1", "dc2"]

# Register service in DC1
$ curl -X PUT localhost:8500/v1/agent/service/register -d '{
  "name": "api",
  "port": 8080
}'

# Query service from DC2 (cross-DC)
$ curl dc2-server:8500/v1/catalog/service/api?dc=dc1
[{"Node": "node1", "Datacenter": "dc1", "ServicePort": 8080, ...}]

# Create prepared query with failover
$ curl -X POST localhost:8500/v1/query -d '{
  "Name": "api-failover",
  "Service": {
    "Service": "api",
    "Failover": {
      "Datacenters": ["dc2", "dc3"]
    }
  }
}'

# When DC1's api is unhealthy, query automatically returns DC2's api
$ curl localhost:8500/v1/query/api-failover/execute
# Returns DC2 instances if DC1 has none healthy

Implementation Hints:

WAN gossip configuration:

type WANGossipConfig struct {
    BindPort        int    // 8302 by default
    SuspicionMult   int    // Higher for WAN (network is less reliable)
    ProbeInterval   time.Duration  // Longer for WAN
    PushPullInterval time.Duration // Longer for WAN
}

Cross-DC query forwarding:

func (s *Server) handleCatalogQuery(dc string, service string) []ServiceInstance {
    if dc == s.datacenter {
        // Local query
        return s.catalog.GetService(service)
    }

    // Find a server in the target DC (via WAN gossip)
    remoteServer := s.wanGossip.GetServerInDC(dc)
    if remoteServer == nil {
        return nil  // DC unreachable
    }

    // Forward query
    return s.forwardQuery(remoteServer, service)
}

Prepared query failover:

func (s *Server) executePreparedQuery(query *PreparedQuery) []ServiceInstance {
    // Try primary DC first
    results := s.catalog.GetService(query.Service.Service)
    healthyResults := filterHealthy(results)

    if len(healthyResults) > 0 {
        return healthyResults
    }

    // Failover to other DCs in order
    for _, dc := range query.Service.Failover.Datacenters {
        results := s.handleCatalogQuery(dc, query.Service.Service)
        healthyResults := filterHealthy(results)
        if len(healthyResults) > 0 {
            return healthyResults
        }
    }

    return nil
}

Learning milestones:

  1. WAN gossip connects DCs → You understand the WAN pool
  2. Cross-DC queries work → You understand forwarding
  3. Failover routes to other DCs → You understand prepared queries
  4. Network partition is handled gracefully → You understand failure modes

Project 11: ACL System with Tokens and Policies

  • File: LEARN_CONSUL_DEEP_DIVE.md
  • Main Programming Language: Go
  • Alternative Programming Languages: Rust, Java
  • Coolness Level: Level 3: Genuinely Clever
  • Business Potential: 3. The “Service & Support” Model
  • Difficulty: Level 3: Advanced
  • Knowledge Area: Security / Access Control
  • Software or Tool: ACL System (like Consul ACLs)
  • Main Book: “Foundations of Information Security” by Jason Andress

What you’ll build: An ACL system with tokens, policies, and roles. Tokens authenticate requests, policies define permissions, and roles group policies. Secure your Consul clone!

Why it teaches Consul: Security is not optional. Consul’s ACL system secures everything: who can register services, who can read KV, who can modify intentions. Understanding this completes your Consul knowledge.

Core challenges you’ll face:

  • Token management (create, update, delete) → maps to identity management
  • Policy language (HCL-like) → maps to permission specification
  • Permission resolution → maps to policy inheritance and roles
  • Token replication → maps to Raft-based security state

Key Concepts:

Difficulty: Advanced Time estimate: 2 weeks Prerequisites: Projects 2, 4, 6 completed

Real world outcome:

# Bootstrap ACL system (creates initial management token)
$ curl -X PUT localhost:8500/v1/acl/bootstrap
{"AccessorID": "root-accessor", "SecretID": "root-secret-token"}

# Create a policy
$ curl -X PUT localhost:8500/v1/acl/policy -H "X-Consul-Token: root-secret-token" -d '{
  "Name": "web-service-policy",
  "Rules": "service \"web\" { policy = \"write\" }\nkey_prefix \"config/web/\" { policy = \"read\" }"
}'

# Create a token with this policy
$ curl -X PUT localhost:8500/v1/acl/token -H "X-Consul-Token: root-secret-token" -d '{
  "Policies": [{"Name": "web-service-policy"}]
}'
{"SecretID": "web-token-123"}

# Use the token to register service (allowed)
$ curl -X PUT localhost:8500/v1/agent/service/register \
  -H "X-Consul-Token: web-token-123" \
  -d '{"name": "web", "port": 8080}'
# Success!

# Try to register a different service (denied)
$ curl -X PUT localhost:8500/v1/agent/service/register \
  -H "X-Consul-Token: web-token-123" \
  -d '{"name": "api", "port": 9090}'
{"error": "Permission denied: service \"api\" write"}

# Read allowed KV prefix
$ curl localhost:8500/v1/kv/config/web/setting -H "X-Consul-Token: web-token-123"
# Success!

# Read denied KV prefix
$ curl localhost:8500/v1/kv/secrets/password -H "X-Consul-Token: web-token-123"
{"error": "Permission denied"}

Implementation Hints:

Policy structure:

type Policy struct {
    ID          string
    Name        string
    Description string
    Rules       []Rule
}

type Rule struct {
    Resource   string  // "service", "key", "node", "agent", etc.
    Segment    string  // The name/prefix (e.g., "web", "config/")
    Policy     string  // "read", "write", "deny"
    Intentions string  // For service: "read", "write"
}

Token structure:

type Token struct {
    AccessorID  string    // Public identifier
    SecretID    string    // The actual token value
    Description string
    Policies    []string  // Policy IDs
    Roles       []string  // Role IDs
    Local       bool      // DC-local or global
    CreateTime  time.Time
    ExpirationTime *time.Time
}

Permission checking:

func (acl *ACLResolver) CheckPermission(token, resource, segment, action string) bool {
    // 1. Look up token
    t := acl.tokens[token]
    if t == nil {
        return false  // Anonymous - check default policy
    }

    // 2. Gather all policies (from token and roles)
    policies := acl.gatherPolicies(t)

    // 3. Check each policy for matching rule
    // Most specific match wins, deny overrides allow
    for _, policy := range policies {
        for _, rule := range policy.Rules {
            if rule.Resource == resource && matchesSegment(rule.Segment, segment) {
                if rule.Policy == "deny" {
                    return false
                }
                if rule.Policy == action || rule.Policy == "write" {
                    return true
                }
            }
        }
    }

    // 4. Default deny
    return false
}

Policy language parsing (simplified):

service "web" { policy = "write" }
key_prefix "config/" { policy = "read" }
node "" { policy = "read" }  # Empty = all nodes

Learning milestones:

  1. Tokens authenticate requests → You understand identity
  2. Policies grant permissions → You understand authorization
  3. Roles group policies → You understand RBAC
  4. Default-deny works → You understand security principles

Project 12: Consul Agent (Complete Implementation)

  • File: LEARN_CONSUL_DEEP_DIVE.md
  • Main Programming Language: Go
  • Alternative Programming Languages: Rust
  • Coolness Level: Level 5: Pure Magic
  • Business Potential: 5. The “Industry Disruptor”
  • Difficulty: Level 5: Master
  • Knowledge Area: Distributed Systems / Full Stack
  • Software or Tool: Consul Agent Clone
  • Main Book: All previous books combined

What you’ll build: A complete Consul agent that can run as either server or client mode. It integrates all previous projects: Raft, gossip, service registry, DNS, KV, sessions, mesh, and ACLs.

Why it teaches Consul: This is the capstone. You’ll understand how all the pieces fit together, how the agent lifecycle works, and what happens when you run consul agent.

Core challenges you’ll face:

  • Mode switching (server vs client) → maps to agent architecture
  • Graceful shutdown → maps to leave vs fail
  • Configuration management → maps to HCL/JSON config files
  • HTTP/gRPC API serving → maps to API surface

Key Concepts:

Difficulty: Master Time estimate: 2-3 months Prerequisites: All previous projects completed

Real world outcome:

# Start servers (just like real Consul!)
$ ./consul-clone agent -server -bootstrap-expect=3 -datacenter=dc1 -data-dir=/data/consul

# Start clients
$ ./consul-clone agent -datacenter=dc1 -join=server1:8301

# All standard Consul commands work
$ ./consul-clone members
Node     Address             Status  Type    DC
server1  192.168.1.10:8301   alive   server  dc1
server2  192.168.1.11:8301   alive   server  dc1
server3  192.168.1.12:8301   alive   server  dc1
client1  192.168.1.20:8301   alive   client  dc1

$ ./consul-clone kv put config/key value
Success!

$ ./consul-clone catalog services
api
web

$ ./consul-clone operator raft list-peers
Node     ID      Address             State   Voter
server1  abc123  192.168.1.10:8300   leader  true
server2  def456  192.168.1.11:8300   follower true
server3  ghi789  192.168.1.12:8300   follower true

# DNS works
$ dig @localhost -p 8600 web.service.consul
...

# HTTP API works
$ curl localhost:8500/v1/status/leader
"192.168.1.10:8300"

Implementation Hints:

Agent structure:

type Agent struct {
    config      *Config
    mode        AgentMode  // SERVER or CLIENT

    // Server-only components
    raft        *Raft
    fsm         *StateMachine
    leaderCh    <-chan bool

    // All agents
    lanGossip   *GossipPool
    wanGossip   *GossipPool  // Servers only

    // Services
    catalog     *Catalog
    kvStore     *KVStore
    sessions    *SessionManager
    acl         *ACLResolver
    connect     *ConnectManager

    // Interfaces
    httpServer  *http.Server
    grpcServer  *grpc.Server
    dnsServer   *DNSServer

    // Lifecycle
    shutdownCh  chan struct{}
}

Agent startup sequence:

func (a *Agent) Start() error {
    // 1. Load configuration
    if err := a.loadConfig(); err != nil {
        return err
    }

    // 2. Initialize gossip
    a.lanGossip = NewGossipPool(a.config.LANConfig)
    if a.mode == SERVER {
        a.wanGossip = NewGossipPool(a.config.WANConfig)
    }

    // 3. Initialize Raft (servers only)
    if a.mode == SERVER {
        a.raft = NewRaft(a.config.RaftConfig)
        a.raft.Start()
    }

    // 4. Initialize services
    a.catalog = NewCatalog(a.raft)
    a.kvStore = NewKVStore(a.raft)
    a.sessions = NewSessionManager(a.kvStore, a.lanGossip)

    // 5. Start API servers
    go a.httpServer.ListenAndServe()
    go a.grpcServer.Serve()
    go a.dnsServer.ListenAndServe()

    // 6. Join cluster
    if len(a.config.RetryJoin) > 0 {
        a.lanGossip.Join(a.config.RetryJoin)
    }

    return nil
}

Graceful shutdown:

func (a *Agent) Shutdown() {
    // 1. Leave gossip (so others know we're leaving intentionally)
    a.lanGossip.Leave()
    if a.wanGossip != nil {
        a.wanGossip.Leave()
    }

    // 2. Stop accepting new connections
    a.httpServer.Shutdown(context.Background())
    a.grpcServer.GracefulStop()
    a.dnsServer.Shutdown()

    // 3. Step down as leader if applicable
    if a.raft != nil {
        a.raft.Shutdown()
    }

    // 4. Close remaining connections
    close(a.shutdownCh)
}

Learning milestones:

  1. Agent starts and joins cluster → Integration works
  2. All APIs function correctly → You’ve built a Consul clone
  3. Graceful shutdown preserves data → Lifecycle is correct
  4. You can replace real Consul in tests → You truly understand Consul

Project Comparison Table

Project Difficulty Time Depth of Understanding Fun Factor
1. Simple KV with Replication Intermediate 1 week ⭐⭐ ⭐⭐⭐
2. Raft Consensus Expert 3-4 weeks ⭐⭐⭐⭐⭐ ⭐⭐⭐⭐⭐
3. SWIM Gossip Protocol Advanced 2-3 weeks ⭐⭐⭐⭐⭐ ⭐⭐⭐⭐
4. Service Registry Intermediate 1-2 weeks ⭐⭐⭐ ⭐⭐⭐
5. DNS Service Discovery Advanced 2 weeks ⭐⭐⭐⭐ ⭐⭐⭐⭐
6. KV Store with Watches Advanced 2 weeks ⭐⭐⭐⭐ ⭐⭐⭐
7. Session and Lock Manager Advanced 2 weeks ⭐⭐⭐⭐ ⭐⭐⭐
8. Sidecar Proxy Expert 3-4 weeks ⭐⭐⭐⭐⭐ ⭐⭐⭐⭐⭐
9. Certificate Authority Expert 2-3 weeks ⭐⭐⭐⭐ ⭐⭐⭐⭐
10. Multi-DC Federation Expert 3-4 weeks ⭐⭐⭐⭐⭐ ⭐⭐⭐⭐
11. ACL System Advanced 2 weeks ⭐⭐⭐ ⭐⭐⭐
12. Complete Agent Master 2-3 months ⭐⭐⭐⭐⭐ ⭐⭐⭐⭐⭐

Foundational Track (Understand the Core)

  1. Project 1: Simple KV - Feel the pain of distributed systems
  2. Project 2: Raft - THE foundational project
  3. Project 3: SWIM Gossip - The other foundational protocol

Service Discovery Track

  1. Project 4: Service Registry - Core use case
  2. Project 5: DNS Interface - Universal access
  3. Project 6: KV with Watches - Configuration management

Service Mesh Track

  1. Project 8: Sidecar Proxy - How Connect works
  2. Project 9: CA - Trust foundation
  3. Project 7: Sessions/Locks - Coordination primitives

Production-Ready Track

  1. Project 11: ACL System - Security
  2. Project 10: Multi-DC - Scale out
  3. Project 12: Complete Agent - Everything together

Final Capstone: Production Consul Cluster Simulator

  • File: LEARN_CONSUL_DEEP_DIVE.md
  • Main Programming Language: Go
  • Alternative Programming Languages: Rust
  • Coolness Level: Level 5: Pure Magic
  • Business Potential: 5. The “Industry Disruptor”
  • Difficulty: Level 5: Master
  • Knowledge Area: Distributed Systems / Chaos Engineering
  • Software or Tool: Chaos Testing Framework
  • Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann

What you’ll build: A simulation environment that runs multiple Consul agents, injects failures (network partitions, node crashes, slow disks), and verifies correctness. Like Jepsen for your Consul clone.

Why it teaches mastery: Building systems is one thing; proving they work under failure is another. This project will stress-test your implementation and teach you what “production-ready” really means.

Real world outcome:

$ ./chaos-consul simulate --nodes=5 --duration=10m

[00:00] Starting 5 nodes (3 servers, 2 clients)
[00:05] All nodes healthy, leader elected: node-1
[00:30] INJECT: Network partition between node-1 and {node-3, node-4, node-5}
[00:32] Leader election started (node-1 isolated)
[00:33] New leader: node-3
[00:35] VERIFY: All writes succeed to new leader ✓
[00:35] VERIFY: Old leader rejects writes ✓
[01:00] HEAL: Network partition resolved
[01:02] node-1 becomes follower, syncs log ✓
[02:00] INJECT: node-2 crash
[02:01] Gossip detects failure within 2s ✓
[02:03] Services on node-2 marked unhealthy ✓
[05:00] INJECT: Slow disk on leader (100ms latency)
[05:10] Write latency increased but system operational ✓
[10:00] Simulation complete

Results:
- Writes: 10,234 success, 0 lost
- Reads: 50,123 success, 3 stale (during partition, expected)
- Leader elections: 3 (all completed < 5s)
- Consistency violations: 0

PASS: Your Consul clone is production-ready!

Summary

# Project Main Language
1 Simple Key-Value Store with Replication Go
2 Implement the Raft Consensus Algorithm Go
3 Implement the SWIM Gossip Protocol Go
4 Build a Service Registry Go
5 DNS-Based Service Discovery Go
6 Distributed Key-Value Store with Watches Go
7 Session and Lock Manager Go
8 Service Mesh Sidecar Proxy Go
9 Certificate Authority (CA) for Service Mesh Go
10 Multi-Datacenter Federation Go
11 ACL System with Tokens and Policies Go
12 Consul Agent (Complete Implementation) Go
Capstone Production Consul Cluster Simulator Go

Key Resources

Official Documentation

Academic Papers

Books

  • “Designing Data-Intensive Applications” by Martin Kleppmann - The distributed systems bible
  • “Building Microservices, 2nd Edition” by Sam Newman - Service discovery context
  • “Zero Trust Networks, 2nd Edition” by Gilman & Barth - Service mesh security

HashiCorp Talks

Source Code


“The best way to understand a distributed system is to build one. The second best way is to break one. By the end of this journey, you’ll have done both.”