← Back to all projects

LEARN NOMAD DEEP DIVE

Learn HashiCorp Nomad: From Zero to Mastery

Goal: Deeply understand Nomad—from basic job scheduling to implementing your own scheduler, understanding Raft consensus, building task drivers, and mastering the internals that make it a production-grade orchestrator.


Why Nomad Matters

Nomad is HashiCorp’s workload orchestrator, and it occupies a unique position in the infrastructure landscape:

  • Simpler than Kubernetes: Single binary, no etcd, no complex control plane
  • More flexible: Runs containers, VMs, Java apps, binaries, batch jobs—anything
  • Production-proven: Powers millions of containers at companies like Cloudflare, Roblox, and CircleCI
  • Inspired by Google: The scheduler design draws from Google’s Borg and Omega papers
  • HashiCorp Ecosystem: Seamless integration with Consul (service mesh), Vault (secrets), Terraform (provisioning)

Understanding Nomad deeply teaches you:

  1. Distributed systems fundamentals: Raft consensus, gossip protocols, leader election
  2. Scheduler design: Bin packing, spread algorithms, constraint satisfaction
  3. Container orchestration: Namespaces, cgroups, networking modes
  4. Service mesh concepts: Envoy sidecars, mutual TLS, service discovery
  5. Real production patterns: Multi-region, high availability, disaster recovery

Core Architecture Overview

┌─────────────────────────────────────────────────────────────────┐
│                           NOMAD CLUSTER                          │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐           │
│  │   Server 1   │  │   Server 2   │  │   Server 3   │           │
│  │   (Leader)   │◄─┤  (Follower)  ├──┤  (Follower)  │           │
│  │              │  │              │  │              │           │
│  │ ┌──────────┐ │  │ ┌──────────┐ │  │ ┌──────────┐ │           │
│  │ │   Raft   │ │  │ │   Raft   │ │  │ │   Raft   │ │           │
│  │ │  (FSM)   │ │  │ │  (FSM)   │ │  │ │  (FSM)   │ │           │
│  │ └──────────┘ │  │ └──────────┘ │  │ └──────────┘ │           │
│  │ ┌──────────┐ │  │ ┌──────────┐ │  │ ┌──────────┐ │           │
│  │ │Scheduler │ │  │ │Scheduler │ │  │ │Scheduler │ │           │
│  │ │ Workers  │ │  │ │ Workers  │ │  │ │ Workers  │ │           │
│  │ └──────────┘ │  │ └──────────┘ │  │ └──────────┘ │           │
│  └──────┬───────┘  └──────┬───────┘  └──────┬───────┘           │
│         │                 │                 │                    │
│         └────────────────▼┼─────────────────┘                    │
│                     Serf Gossip                                  │
│                     (Membership)                                 │
│         ┌─────────────────┼─────────────────┐                    │
│         │                 │                 │                    │
│  ┌──────▼───────┐  ┌──────▼───────┐  ┌──────▼───────┐           │
│  │   Client 1   │  │   Client 2   │  │   Client N   │           │
│  │              │  │              │  │              │           │
│  │ ┌──────────┐ │  │ ┌──────────┐ │  │ ┌──────────┐ │           │
│  │ │  Docker  │ │  │ │   exec   │ │  │ │ raw_exec │ │           │
│  │ │  Driver  │ │  │ │  Driver  │ │  │ │  Driver  │ │           │
│  │ └──────────┘ │  │ └──────────┘ │  │ └──────────┘ │           │
│  │ ┌──────────┐ │  │ ┌──────────┐ │  │ ┌──────────┐ │           │
│  │ │Allocation│ │  │ │Allocation│ │  │ │Allocation│ │           │
│  │ │  Runner  │ │  │ │  Runner  │ │  │ │  Runner  │ │           │
│  │ └──────────┘ │  │ └──────────┘ │  │ └──────────┘ │           │
│  └──────────────┘  └──────────────┘  └──────────────┘           │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Key Components

Component Purpose
Server Control plane—stores state, runs schedulers, coordinates via Raft
Client Data plane—runs workloads, reports node status
Raft Consensus algorithm for leader election and state replication
Serf/Gossip Membership protocol for cluster discovery
Scheduler Places allocations on nodes using bin-packing/spread
Task Drivers Execute workloads (Docker, exec, raw_exec, Java, etc.)
Evaluation Unit of scheduling work, triggered by state changes
Allocation A task group placed on a specific node

The Scheduling Pipeline

Understanding this pipeline is key to understanding Nomad:

Job Submission → Evaluation Created → Scheduler Processes →
Plan Generated → Leader Applies Plan → Allocation Created →
Client Runs Tasks
  1. Job Registration: User submits a job (desired state)
  2. Evaluation Creation: Nomad creates an evaluation to process the change
  3. Scheduler Dequeues: A scheduler worker picks up the evaluation
  4. Feasibility Checking: Filter nodes that can’t run the job (constraints, resources)
  5. Ranking: Score feasible nodes (bin packing, affinity, spread)
  6. Plan Submission: Scheduler submits plan to leader
  7. Plan Application: Leader checks for conflicts, applies or rejects
  8. Allocation: Client receives allocation and starts tasks

Project List

Projects progress from understanding the basics to building sophisticated internal components.


Project 1: Local Development Cluster

  • File: LEARN_NOMAD_DEEP_DIVE.md
  • Main Programming Language: HCL (HashiCorp Configuration Language) + Bash
  • Alternative Programming Languages: Python, Go
  • Coolness Level: Level 2: Practical but Forgettable
  • Business Potential: 1. The “Resume Gold”
  • Difficulty: Level 1: Beginner
  • Knowledge Area: Cluster Setup / Agent Configuration
  • Software or Tool: Nomad, Docker
  • Main Book: HashiCorp Nomad Documentation

What you’ll build: A local 3-server, 2-client Nomad cluster running in Docker containers (or as processes), demonstrating leader election, automatic clustering, and basic job deployment.

Why it teaches Nomad: Before diving deep, you need a running cluster to experiment with. This project introduces the agent modes (server/client), configuration files, and the basic operational commands.

Core challenges you’ll face:

  • Configuring server vs client mode → maps to understanding agent roles
  • Setting up cluster bootstrapping → maps to bootstrap_expect and join
  • Networking between agents → maps to advertise addresses and ports
  • Running your first job → maps to job specification basics

Key Concepts:

Difficulty: Beginner Time estimate: Weekend Prerequisites: Docker installed. Basic understanding of configuration files. Command-line comfort.

Real world outcome:

# Your cluster is running:
$ nomad server members
Name          Address      Port  Status  Leader  Protocol
server-1.dc1  10.0.0.1     4648  alive   true    2
server-2.dc1  10.0.0.2     4648  alive   false   2
server-3.dc1  10.0.0.3     4648  alive   false   2

$ nomad node status
ID        DC   Name      Class   Drain  Eligibility  Status
a1b2c3d4  dc1  client-1  <none>  false  eligible     ready
e5f6g7h8  dc1  client-2  <none>  false  eligible     ready

# Deploy a simple job:
$ nomad run hello.nomad
==> Monitoring deployment...
    2024-01-15T10:00:00Z (ID: abc123)
    Deployment "abc123" successful

$ curl http://client-1:8080
Hello from Nomad!

Implementation Hints:

Directory structure:

nomad-cluster/
├── docker-compose.yml       # Or Vagrant, or shell scripts
├── config/
│   ├── server.hcl           # Server configuration
│   └── client.hcl           # Client configuration
└── jobs/
    └── hello.nomad          # First job

Server configuration (server.hcl):

# Minimal server config
data_dir = "/opt/nomad/data"
bind_addr = "0.0.0.0"

server {
  enabled          = true
  bootstrap_expect = 3  # Wait for 3 servers before electing leader
}

# Advertise address for other nodes to reach this one
advertise {
  http = "{{ GetInterfaceIP \"eth0\" }}"
  rpc  = "{{ GetInterfaceIP \"eth0\" }}"
  serf = "{{ GetInterfaceIP \"eth0\" }}"
}

Client configuration (client.hcl):

data_dir = "/opt/nomad/data"
bind_addr = "0.0.0.0"

client {
  enabled = true
  servers = ["server-1:4647", "server-2:4647", "server-3:4647"]
}

# Enable task drivers
plugin "docker" {
  config {
    allow_privileged = false
  }
}

First job (hello.nomad):

job "hello" {
  datacenters = ["dc1"]
  type        = "service"

  group "web" {
    count = 2

    network {
      port "http" { to = 8080 }
    }

    task "server" {
      driver = "docker"

      config {
        image = "hashicorp/http-echo"
        args  = ["-text", "Hello from Nomad!"]
        ports = ["http"]
      }

      resources {
        cpu    = 100
        memory = 64
      }
    }
  }
}

Learning milestones:

  1. Servers elect a leader → You understand Raft basics
  2. Clients join the cluster → You understand gossip membership
  3. Job deploys successfully → You understand the scheduling flow
  4. You scale the job → You understand desired vs actual state

Project 2: Job Lifecycle Explorer

  • File: LEARN_NOMAD_DEEP_DIVE.md
  • Main Programming Language: HCL + Bash
  • Alternative Programming Languages: Go (API client), Python
  • Coolness Level: Level 2: Practical but Forgettable
  • Business Potential: 1. The “Resume Gold”
  • Difficulty: Level 1: Beginner
  • Knowledge Area: Job Types / Task Lifecycle / Update Strategies
  • Software or Tool: Nomad
  • Main Book: HashiCorp Nomad Documentation

What you’ll build: A suite of jobs demonstrating all Nomad job types (service, batch, system, sysbatch), task lifecycle hooks, update strategies, and health checks.

Why it teaches Nomad: Each job type has different scheduling behavior. Understanding when tasks start, stop, and restart—and how updates roll out—is essential for production deployments.

Core challenges you’ll face:

  • Understanding job types → maps to service vs batch vs system schedulers
  • Lifecycle hooks (prestart, poststart, poststop) → maps to task dependencies
  • Health checks and deployment health → maps to how Nomad knows tasks are ready
  • Update strategies → maps to rolling vs canary deployments

Key Concepts:

Difficulty: Beginner Time estimate: 1 week Prerequisites: Project 1 completed. Running cluster.

Real world outcome:

# Different job types in action:

# Service job - long-running, rescheduled on failure
$ nomad run api-service.nomad
$ nomad job status api-service
Allocations
ID        Node ID   Task Group  Version  Desired  Status   Created
abc123    a1b2c3    api         0        run      running  5m ago

# Batch job - runs to completion
$ nomad run data-import.nomad
$ nomad job status data-import
Allocations
ID        Node ID   Task Group  Version  Desired  Status    Created
def456    e5f6g7    import      0        run      complete  2m ago

# System job - runs on every node
$ nomad run node-exporter.nomad
$ nomad job status node-exporter
Allocations (2 nodes)
ID        Node ID   Task Group  Version  Desired  Status   Created
ghi789    a1b2c3    exporter    0        run      running  1m ago
jkl012    e5f6g7    exporter    0        run      running  1m ago

# Rolling update with health checks:
$ nomad run -var version=v2 api-service.nomad
==> Monitoring deployment...
    Deployment "xyz789" in progress...
    2024-01-15T10:05:00Z - 1/3 allocations healthy
    2024-01-15T10:06:00Z - 2/3 allocations healthy
    2024-01-15T10:07:00Z - 3/3 allocations healthy
    Deployment "xyz789" successful

Implementation Hints:

Service job with lifecycle hooks:

job "api-service" {
  type = "service"

  group "api" {
    count = 3

    update {
      max_parallel     = 1
      min_healthy_time = "30s"
      healthy_deadline = "5m"
      auto_revert      = true
      canary           = 1  # Deploy 1 canary first
    }

    # Prestart task - runs before main task
    task "db-migrate" {
      lifecycle {
        hook    = "prestart"
        sidecar = false
      }

      driver = "docker"
      config {
        image   = "myapp/migrate:${var.version}"
        command = "/migrate"
      }
    }

    # Main task
    task "api" {
      driver = "docker"

      config {
        image = "myapp/api:${var.version}"
        ports = ["http"]
      }

      # Health check for deployment
      service {
        name = "api"
        port = "http"

        check {
          type     = "http"
          path     = "/health"
          interval = "10s"
          timeout  = "2s"
        }
      }
    }

    # Sidecar - runs alongside main task
    task "log-shipper" {
      lifecycle {
        hook    = "poststart"
        sidecar = true
      }

      driver = "docker"
      config {
        image = "fluent/fluent-bit"
      }
    }
  }
}

Batch job with periodic scheduling:

job "daily-report" {
  type = "batch"

  periodic {
    cron             = "0 2 * * *"  # 2 AM daily
    prohibit_overlap = true
  }

  group "report" {
    task "generate" {
      driver = "docker"
      config {
        image   = "myapp/reporter"
        command = "/generate-report"
      }

      # Batch jobs can have restart policies
      restart {
        attempts = 3
        interval = "30m"
        delay    = "15s"
        mode     = "fail"
      }
    }
  }
}

Learning milestones:

  1. You run all 4 job types → You understand scheduler differences
  2. Lifecycle hooks execute in order → You understand task dependencies
  3. Health checks gate deployments → You understand deployment health
  4. Rolling updates work with canaries → You understand update strategies

Project 3: Scheduler Visualization Tool

  • File: LEARN_NOMAD_DEEP_DIVE.md
  • Main Programming Language: Go
  • Alternative Programming Languages: Python, JavaScript
  • Coolness Level: Level 3: Genuinely Clever
  • Business Potential: 2. The “Micro-SaaS / Pro Tool”
  • Difficulty: Level 2: Intermediate
  • Knowledge Area: Scheduling Internals / API Usage
  • Software or Tool: Nomad API
  • Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann (for distributed concepts)

What you’ll build: A tool that visualizes Nomad’s scheduling decisions in real-time—showing evaluations, plans, allocations, node scores, and why certain nodes were chosen or rejected.

Why it teaches Nomad: The scheduler is Nomad’s brain. By building a visualization tool, you’ll understand the evaluation → plan → allocation pipeline, see bin packing in action, and understand why placements happen where they do.

Core challenges you’ll face:

  • Streaming evaluations → maps to understanding the evaluation broker
  • Parsing allocation plans → maps to feasibility and ranking phases
  • Visualizing node scores → maps to bin packing algorithm
  • Understanding failures → maps to why allocations get blocked

Key Concepts:

Difficulty: Intermediate Time estimate: 1-2 weeks Prerequisites: Projects 1-2 completed. Go or Python proficiency.

Real world outcome:

┌─────────────────────────────────────────────────────────────────┐
│ Nomad Scheduler Visualizer                                       │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│ Evaluation: abc123-def456 (Triggered by: job-register)          │
│ Job: api-service | Type: service | Priority: 50                 │
│                                                                  │
│ ┌─ Scheduling Timeline ──────────────────────────────────────┐  │
│ │                                                             │  │
│ │  10:00:00.100  Evaluation Created                          │  │
│ │  10:00:00.150  Scheduler Dequeued (worker-3)               │  │
│ │  10:00:00.200  Feasibility Check Started                   │  │
│ │                 - client-1: FEASIBLE                       │  │
│ │                 - client-2: FEASIBLE                       │  │
│ │                 - client-3: REJECTED (insufficient memory) │  │
│ │  10:00:00.250  Ranking Phase                               │  │
│ │                 - client-1: score=0.85 (binpack: 0.9,      │  │
│ │                              anti-affinity: 0.8)           │  │
│ │                 - client-2: score=0.72 (binpack: 0.7,      │  │
│ │                              anti-affinity: 0.75)          │  │
│ │  10:00:00.300  Plan Submitted                              │  │
│ │  10:00:00.350  Plan Accepted (no conflicts)                │  │
│ │  10:00:00.400  Allocation Created on client-1              │  │
│ │                                                             │  │
│ └─────────────────────────────────────────────────────────────┘  │
│                                                                  │
│ Node Resource Usage:                                             │
│ ┌──────────────────────────────────────────────────────────────┐ │
│ │ client-1 [████████████░░░░░░░░] CPU: 60% │ Mem: 2.1/4GB     │ │
│ │ client-2 [████████░░░░░░░░░░░░] CPU: 40% │ Mem: 1.5/4GB     │ │
│ │ client-3 [████████████████████] CPU: 95% │ Mem: 3.9/4GB     │ │
│ └──────────────────────────────────────────────────────────────┘ │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Implementation Hints:

Using the Event Stream API:

package main

import (
    "context"
    "fmt"
    "github.com/hashicorp/nomad/api"
)

func main() {
    client, _ := api.NewClient(api.DefaultConfig())

    // Subscribe to evaluation and allocation events
    topics := map[api.Topic][]string{
        api.TopicEvaluation: {"*"},
        api.TopicAllocation: {"*"},
        api.TopicNode:       {"*"},
    }

    ctx := context.Background()
    eventCh, _ := client.EventStream().Stream(ctx, topics, 0, nil)

    for {
        select {
        case events := <-eventCh:
            for _, event := range events.Events {
                switch event.Topic {
                case api.TopicEvaluation:
                    handleEvaluation(event)
                case api.TopicAllocation:
                    handleAllocation(event)
                }
            }
        }
    }
}

func handleEvaluation(event api.Event) {
    eval := event.Evaluation
    fmt.Printf("Eval %s: Status=%s TriggeredBy=%s\n",
        eval.ID[:8], eval.Status, eval.TriggeredBy)

    // Get detailed scheduling info from allocation plan
    if eval.Status == "complete" {
        showPlanDetails(eval.ID)
    }
}

Getting node scores and feasibility info:

func showPlanDetails(evalID string) {
    client, _ := api.NewClient(api.DefaultConfig())

    // Get allocations for this evaluation
    allocs, _, _ := client.Evaluations().Allocations(evalID, nil)

    for _, alloc := range allocs {
        fmt.Printf("  Allocation %s on %s\n", alloc.ID[:8], alloc.NodeID[:8])

        // Metrics show why this node was chosen
        if alloc.Metrics != nil {
            fmt.Printf("    NodesEvaluated: %d\n", alloc.Metrics.NodesEvaluated)
            fmt.Printf("    NodesFiltered: %d\n", alloc.Metrics.NodesFiltered)

            // Filter reasons
            for class, count := range alloc.Metrics.ClassFiltered {
                fmt.Printf("    Filtered by class %s: %d\n", class, count)
            }

            // Scores
            for _, score := range alloc.Metrics.ScoreMetaData {
                fmt.Printf("    Node %s: score=%.2f\n",
                    score.NodeID[:8], score.NormScore)
            }
        }
    }
}

Learning milestones:

  1. You stream evaluation events → You understand the evaluation broker
  2. You show feasibility filtering → You understand constraint checking
  3. You display node scores → You understand bin packing
  4. You track blocked evaluations → You understand resource exhaustion

Project 4: Custom Resource Constraint System

  • File: LEARN_NOMAD_DEEP_DIVE.md
  • Main Programming Language: HCL + Go
  • Alternative Programming Languages: Python (for testing)
  • Coolness Level: Level 3: Genuinely Clever
  • Business Potential: 3. The “Service & Support” Model
  • Difficulty: Level 2: Intermediate
  • Knowledge Area: Constraints / Affinities / Node Metadata
  • Software or Tool: Nomad
  • Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann

What you’ll build: A system that uses node attributes, custom metadata, constraints, and affinities to implement complex placement requirements like “run on GPU nodes,” “prefer SSD storage,” “spread across availability zones.”

Why it teaches Nomad: Real-world scheduling is rarely just “find a node with enough CPU.” This project teaches you how Nomad’s constraint system enables sophisticated placement logic without modifying the scheduler itself.

Core challenges you’ll face:

  • Node fingerprinting → maps to how Nomad discovers node capabilities
  • Custom node metadata → maps to operator-defined attributes
  • Hard constraints vs soft affinities → maps to must-have vs prefer
  • Spread across failure domains → maps to the spread block

Key Concepts:

Difficulty: Intermediate Time estimate: 1 week Prerequisites: Projects 1-3 completed. Understanding of constraint satisfaction.

Real world outcome:

# Configure nodes with metadata:
# client-1.hcl
client {
  meta {
    "zone"         = "us-east-1a"
    "instance"     = "gpu"
    "storage"      = "ssd"
  }
}

# Job with complex placement requirements:
$ cat ml-training.nomad

job "ml-training" {
  # Hard constraint: MUST have GPU
  constraint {
    attribute = "${meta.instance}"
    value     = "gpu"
  }

  # Soft affinity: PREFER SSD storage (weight: 50%)
  affinity {
    attribute = "${meta.storage}"
    value     = "ssd"
    weight    = 50
  }

  # Spread across availability zones
  spread {
    attribute = "${meta.zone}"
    weight    = 100

    target "us-east-1a" { percent = 33 }
    target "us-east-1b" { percent = 33 }
    target "us-east-1c" { percent = 34 }
  }

  group "training" {
    count = 6

    task "train" {
      driver = "docker"
      config {
        image = "tensorflow/tensorflow:latest-gpu"
      }

      resources {
        # Request GPU device
        device "nvidia/gpu" {
          count = 1
        }
      }
    }
  }
}

$ nomad run ml-training.nomad
$ nomad job status ml-training

Allocations
ID        Node ID   Zone        Status
abc123    node-1    us-east-1a  running   # GPU + SSD
def456    node-4    us-east-1b  running   # GPU + HDD (SSD preferred but not required)
ghi789    node-7    us-east-1c  running   # GPU + SSD
...

# Placement respects:
# ✓ All on GPU nodes (constraint)
# ✓ Prefers SSD nodes (affinity)
# ✓ Spread across zones (spread)

Implementation Hints:

Node metadata configuration:

# On each client node
client {
  enabled = true

  # Custom metadata operators can define
  meta {
    "zone"            = "us-east-1a"
    "rack"            = "rack-42"
    "instance_type"   = "gpu"
    "storage_type"    = "nvme"
    "network_speed"   = "25gbps"
    "cost_tier"       = "expensive"
  }

  # Host volumes for persistent storage
  host_volume "scratch" {
    path      = "/mnt/scratch"
    read_only = false
  }
}

Complex constraint examples:

job "complex-constraints" {
  # Must run on Linux
  constraint {
    attribute = "${attr.kernel.name}"
    value     = "linux"
  }

  # Must have at least 4 CPU cores
  constraint {
    attribute = "${attr.cpu.numcores}"
    operator  = ">="
    value     = "4"
  }

  # Must NOT run on nodes tagged "maintenance"
  constraint {
    attribute = "${meta.status}"
    operator  = "!="
    value     = "maintenance"
  }

  # Must run in specific datacenters
  constraint {
    attribute = "${node.datacenter}"
    operator  = "regexp"
    value     = "us-.*"
  }

  # Prefer nodes with recent kernel
  affinity {
    attribute = "${attr.kernel.version}"
    operator  = "version"
    value     = ">= 5.0"
    weight    = 25
  }

  group "app" {
    # Distinct hosts - never colocate instances
    constraint {
      operator = "distinct_hosts"
      value    = "true"
    }

    task "worker" {
      # ...
    }
  }
}

Learning milestones:

  1. You use built-in attributes → You understand fingerprinting
  2. You define custom metadata → You understand operator-defined placement
  3. You combine constraints and affinities → You understand hard vs soft requirements
  4. You spread across failure domains → You understand availability patterns

Project 5: Consul Service Mesh Integration

  • File: LEARN_NOMAD_DEEP_DIVE.md
  • Main Programming Language: HCL
  • Alternative Programming Languages: Go, Python
  • Coolness Level: Level 3: Genuinely Clever
  • Business Potential: 3. The “Service & Support” Model
  • Difficulty: Level 2: Intermediate
  • Knowledge Area: Service Mesh / mTLS / Envoy
  • Software or Tool: Nomad + Consul + Envoy
  • Main Book: “Service Mesh Patterns” by Sandeep Dinesh

What you’ll build: A microservices application with Consul Connect service mesh, demonstrating automatic mTLS, service intentions (authorization), and transparent proxy networking.

Why it teaches Nomad: Nomad’s integration with Consul Connect shows how the HashiCorp stack works together. You’ll understand sidecar injection, service identity, and how Envoy proxies are configured automatically.

Core challenges you’ll face:

  • Setting up Consul cluster → maps to understanding service discovery
  • Enabling Connect sidecar proxies → maps to Envoy injection
  • Configuring service intentions → maps to service-to-service authorization
  • Network modes (bridge) → maps to Linux network namespaces

Key Concepts:

Difficulty: Intermediate Time estimate: 1-2 weeks Prerequisites: Projects 1-4 completed. Basic understanding of TLS and service mesh concepts.

Real world outcome:

# Architecture:
#
#   ┌─────────────────────────────────────────────────────┐
#   │                    Nomad Job                         │
#   │  ┌─────────────────┐      ┌─────────────────┐       │
#   │  │     web         │      │     api         │       │
#   │  │   (frontend)    │ mTLS │   (backend)     │       │
#   │  │  ┌───────────┐  │◄────►│  ┌───────────┐  │       │
#   │  │  │  nginx    │  │      │  │  python   │  │       │
#   │  │  └───────────┘  │      │  └───────────┘  │       │
#   │  │  ┌───────────┐  │      │  ┌───────────┐  │       │
#   │  │  │  envoy    │  │      │  │  envoy    │  │       │
#   │  │  │ (sidecar) │  │      │  │ (sidecar) │  │       │
#   │  │  └───────────┘  │      │  └───────────┘  │       │
#   │  └─────────────────┘      └─────────────────┘       │
#   └─────────────────────────────────────────────────────┘

$ nomad run mesh-app.nomad

# Services register with Consul:
$ consul catalog services
api
api-sidecar-proxy
web
web-sidecar-proxy

# Intentions control access:
$ consul intention check web api
Allowed

$ consul intention check malicious-app api
Denied

# Traffic flows through mTLS:
$ curl http://web.service.consul
Hello from web! Got response from api: {"status": "ok"}

Implementation Hints:

Consul configuration for Connect:

# consul.hcl
connect {
  enabled = true
}

ports {
  grpc = 8502
}

Nomad job with Connect sidecar:

job "mesh-app" {
  datacenters = ["dc1"]

  group "api" {
    count = 2

    network {
      mode = "bridge"  # Required for Connect

      port "http" {
        to = 8080
      }
    }

    service {
      name = "api"
      port = "8080"

      connect {
        sidecar_service {}  # Injects Envoy sidecar
      }
    }

    task "api" {
      driver = "docker"

      config {
        image = "myapp/api:v1"
        ports = ["http"]
      }
    }
  }

  group "web" {
    count = 2

    network {
      mode = "bridge"

      port "http" {
        static = 8080
        to     = 80
      }
    }

    service {
      name = "web"
      port = "80"

      connect {
        sidecar_service {
          proxy {
            # Upstream connection to api service
            upstreams {
              destination_name = "api"
              local_bind_port  = 5000  # web connects to localhost:5000
            }
          }
        }
      }
    }

    task "web" {
      driver = "docker"

      config {
        image = "myapp/web:v1"
        ports = ["http"]
      }

      env {
        API_URL = "http://localhost:5000"  # Goes through Envoy → api
      }
    }
  }
}

Service intentions (Consul):

# Allow web → api
consul intention create web api

# Deny all other traffic to api
consul intention create -deny "*" api

# Or use HCL:
# intentions.hcl
Kind = "service-intentions"
Name = "api"
Sources = [
  {
    Name   = "web"
    Action = "allow"
  },
  {
    Name   = "*"
    Action = "deny"
  }
]

Learning milestones:

  1. Sidecars inject automatically → You understand sidecar injection
  2. Services discover each other → You understand service mesh networking
  3. mTLS encrypts traffic → You understand zero-trust networking
  4. Intentions control access → You understand service authorization

Project 6: Autoscaler Implementation

  • File: LEARN_NOMAD_DEEP_DIVE.md
  • Main Programming Language: Go
  • Alternative Programming Languages: Python, HCL (for policies)
  • Coolness Level: Level 3: Genuinely Clever
  • Business Potential: 3. The “Service & Support” Model
  • Difficulty: Level 3: Advanced
  • Knowledge Area: Autoscaling / Metrics / Control Loops
  • Software or Tool: Nomad Autoscaler
  • Main Book: “Site Reliability Engineering” by Google

What you’ll build: An autoscaling system that adjusts job counts based on metrics (CPU, memory, queue depth, custom metrics), implementing both horizontal pod autoscaling and cluster autoscaling.

Why it teaches Nomad: Autoscaling requires understanding how to monitor allocations, interact with the Nomad API to scale jobs, and implement control loops that don’t oscillate. This teaches the operational side of Nomad.

Core challenges you’ll face:

  • Collecting metrics → maps to Prometheus integration or Nomad metrics
  • Implementing scaling policies → maps to target tracking algorithms
  • Avoiding thrashing → maps to cooldown periods and stabilization
  • Cluster autoscaling → maps to adding/removing nodes

Key Concepts:

Difficulty: Advanced Time estimate: 2 weeks Prerequisites: Projects 1-5 completed. Understanding of control systems basics.

Real world outcome:

# Your autoscaler in action:

# Initial state:
$ nomad job status api-service
Task Group  Queued  Starting  Running  Failed  Complete  Lost
api         0       0         2        0       0         0

# Load increases, autoscaler detects high CPU:
$ watch nomad-autoscaler agent -config=config.hcl

[INFO] policy: scaling api from 2 to 5: cpu_utilization=85%, target=70%
[INFO] scale: successfully scaled api to 5 allocations

$ nomad job status api-service
Task Group  Queued  Starting  Running  Failed  Complete  Lost
api         0       1         5        0       0         0

# Load decreases:
[INFO] policy: stabilization window not passed, holding at 5
[INFO] policy: scaling api from 5 to 3: cpu_utilization=35%, target=70%
[INFO] scale: successfully scaled api to 3 allocations

# Cluster autoscaling (add nodes when all are at capacity):
[INFO] cluster: no feasible nodes for pending allocations
[INFO] cluster: launching 2 new instances via AWS ASG
[INFO] cluster: new nodes joined: node-5, node-6

Implementation Hints:

Autoscaler configuration:

# autoscaler.hcl
nomad {
  address = "http://localhost:4646"
}

apm "prometheus" {
  driver = "prometheus"
  config = {
    address = "http://prometheus:9090"
  }
}

target "nomad" {
  driver = "nomad"
}

policy {
  default_cooldown            = "2m"
  default_evaluation_interval = "30s"
}

Scaling policy in job spec:

job "api-service" {
  group "api" {
    count = 2

    scaling {
      enabled = true
      min     = 2
      max     = 20

      policy {
        # Target 70% average CPU
        check "cpu" {
          source = "prometheus"
          query  = "avg(nomad_client_allocs_cpu_allocated{task_group='api'})"

          strategy "target-value" {
            target = 70
          }
        }

        # Also scale on queue depth
        check "queue_depth" {
          source = "prometheus"
          query  = "sum(rabbitmq_queue_messages{queue='tasks'})"

          strategy "target-value" {
            target = 100  # 100 messages per instance
          }
        }
      }
    }

    task "api" {
      # ...
    }
  }
}

Building your own autoscaler (simplified):

package main

import (
    "time"
    "github.com/hashicorp/nomad/api"
)

type Autoscaler struct {
    client       *api.Client
    targetCPU    float64
    minCount     int
    maxCount     int
    cooldown     time.Duration
    lastScaled   time.Time
}

func (a *Autoscaler) Run() {
    ticker := time.NewTicker(30 * time.Second)

    for range ticker.C {
        if time.Since(a.lastScaled) < a.cooldown {
            continue // Still in cooldown
        }

        currentCPU := a.getAverageCPU()
        currentCount := a.getCurrentCount()

        // Calculate desired count
        desiredCount := int(float64(currentCount) * (currentCPU / a.targetCPU))
        desiredCount = clamp(desiredCount, a.minCount, a.maxCount)

        if desiredCount != currentCount {
            a.scale(desiredCount)
            a.lastScaled = time.Now()
        }
    }
}

func (a *Autoscaler) scale(count int) error {
    job, _, _ := a.client.Jobs().Info("api-service", nil)

    // Update task group count
    for _, tg := range job.TaskGroups {
        if *tg.Name == "api" {
            tg.Count = &count
        }
    }

    _, _, err := a.client.Jobs().Register(job, nil)
    return err
}

Learning milestones:

  1. You collect metrics from allocations → You understand Nomad telemetry
  2. You implement target-value scaling → You understand control loops
  3. You handle cooldown correctly → You understand stabilization
  4. You add cluster autoscaling → You understand infrastructure automation

Project 7: Custom Task Driver Plugin

  • File: LEARN_NOMAD_DEEP_DIVE.md
  • Main Programming Language: Go
  • Alternative Programming Languages: N/A (must be Go for plugins)
  • Coolness Level: Level 4: Hardcore Tech Flex
  • Business Potential: 4. The “Open Core” Infrastructure
  • Difficulty: Level 3: Advanced
  • Knowledge Area: Plugin System / Task Isolation / Go-Plugin
  • Software or Tool: Nomad Plugin SDK
  • Main Book: “Go Programming Language” by Donovan & Kernighan

What you’ll build: A custom task driver that runs workloads in a way not supported by built-in drivers—for example, running WASM modules, systemd units, Firecracker microVMs, or Podman containers.

Why it teaches Nomad: Task drivers are how Nomad executes workloads. Building one teaches you the plugin architecture, how Nomad tracks task state, handles resource isolation, and manages the task lifecycle.

Core challenges you’ll face:

  • Implementing the DriverPlugin interface → maps to understanding the plugin contract
  • Fingerprinting node capabilities → maps to what the driver can do
  • Managing task state → maps to starting, stopping, recovering tasks
  • Resource isolation → maps to cgroups, namespaces, security

Key Concepts:

Difficulty: Advanced Time estimate: 2-3 weeks Prerequisites: Projects 1-6 completed. Strong Go proficiency. Understanding of process isolation.

Real world outcome:

# Your custom WASM driver in action:

$ cat wasm-job.nomad
job "wasm-app" {
  group "app" {
    task "handler" {
      driver = "wasm"  # Your custom driver!

      config {
        module = "https://example.com/handler.wasm"
        runtime = "wasmtime"

        # WASI capabilities
        allow_fs = false
        allow_net = true
      }

      resources {
        cpu    = 100
        memory = 128
      }
    }
  }
}

$ nomad run wasm-job.nomad
$ nomad alloc logs abc123
[wasm-driver] Loading module from https://example.com/handler.wasm
[wasm-driver] Runtime: wasmtime v14.0.0
[wasm-driver] Instantiating module...
[handler] Listening on :8080
[handler] Received request: GET /

Implementation Hints:

Basic driver structure:

package main

import (
    "context"
    "time"

    "github.com/hashicorp/go-hclog"
    "github.com/hashicorp/nomad/drivers/shared/eventer"
    "github.com/hashicorp/nomad/plugins/base"
    "github.com/hashicorp/nomad/plugins/drivers"
    "github.com/hashicorp/nomad/plugins/shared/hclspec"
)

const pluginName = "wasm"

// WasmDriver implements drivers.DriverPlugin
type WasmDriver struct {
    eventer *eventer.Eventer
    config  *Config
    logger  hclog.Logger

    // Track running tasks
    tasks *taskStore
}

// PluginInfo returns information about the plugin
func (d *WasmDriver) PluginInfo() (*base.PluginInfoResponse, error) {
    return &base.PluginInfoResponse{
        Type:              base.PluginTypeDriver,
        PluginApiVersions: []string{drivers.ApiVersion010},
        PluginVersion:     "0.1.0",
        Name:              pluginName,
    }, nil
}

// ConfigSchema returns the schema for driver config
func (d *WasmDriver) ConfigSchema() (*hclspec.Spec, error) {
    return hclspec.NewObject(map[string]*hclspec.Spec{
        "runtime_path": hclspec.NewDefault(
            hclspec.NewAttr("runtime_path", "string", false),
            hclspec.NewLiteral(`"/usr/local/bin/wasmtime"`),
        ),
    }), nil
}

// TaskConfigSchema returns the schema for task config
func (d *WasmDriver) TaskConfigSchema() (*hclspec.Spec, error) {
    return hclspec.NewObject(map[string]*hclspec.Spec{
        "module":    hclspec.NewAttr("module", "string", true),
        "runtime":   hclspec.NewAttr("runtime", "string", false),
        "allow_fs":  hclspec.NewAttr("allow_fs", "bool", false),
        "allow_net": hclspec.NewAttr("allow_net", "bool", false),
    }), nil
}

// Fingerprint returns node capabilities
func (d *WasmDriver) Fingerprint(ctx context.Context) (<-chan *drivers.Fingerprint, error) {
    ch := make(chan *drivers.Fingerprint)

    go func() {
        defer close(ch)

        fp := &drivers.Fingerprint{
            Attributes: map[string]*pstructs.Attribute{
                "driver.wasm.version": pstructs.NewStringAttribute("0.1.0"),
            },
            Health:            drivers.HealthStateHealthy,
            HealthDescription: "WASM runtime available",
        }

        // Check if wasmtime is installed
        if !d.isRuntimeAvailable() {
            fp.Health = drivers.HealthStateUndetected
            fp.HealthDescription = "wasmtime not found"
        }

        ch <- fp
    }()

    return ch, nil
}

// StartTask launches the WASM module
func (d *WasmDriver) StartTask(cfg *drivers.TaskConfig) (*drivers.TaskHandle, *drivers.DriverNetwork, error) {
    var taskConfig TaskConfig
    if err := cfg.DecodeDriverConfig(&taskConfig); err != nil {
        return nil, nil, err
    }

    d.logger.Info("starting wasm task", "module", taskConfig.Module)

    // Download module, start wasmtime process
    handle, err := d.startWasmProcess(cfg, &taskConfig)
    if err != nil {
        return nil, nil, err
    }

    return handle, nil, nil
}

// StopTask stops a running task
func (d *WasmDriver) StopTask(taskID string, timeout time.Duration, signal string) error {
    handle, ok := d.tasks.Get(taskID)
    if !ok {
        return drivers.ErrTaskNotFound
    }

    // Send signal to process
    return handle.Kill(signal, timeout)
}

// WaitTask blocks until task exits
func (d *WasmDriver) WaitTask(ctx context.Context, taskID string) (<-chan *drivers.ExitResult, error) {
    handle, ok := d.tasks.Get(taskID)
    if !ok {
        return nil, drivers.ErrTaskNotFound
    }

    return handle.WaitCh(), nil
}

Main function with plugin serving:

func main() {
    // Serve the plugin
    plugins.Serve(factory)
}

func factory(log hclog.Logger) interface{} {
    return &WasmDriver{
        eventer: eventer.NewEventer(context.Background(), log),
        logger:  log.Named(pluginName),
        tasks:   newTaskStore(),
    }
}

Learning milestones:

  1. You implement Fingerprint → You understand capability detection
  2. You implement StartTask → You understand task lifecycle
  3. You handle recovery after restarts → You understand driver state
  4. Tasks run with isolation → You understand resource control

Project 8: Raft Consensus Simulator

  • File: LEARN_NOMAD_DEEP_DIVE.md
  • Main Programming Language: Go
  • Alternative Programming Languages: Python, Rust
  • Coolness Level: Level 4: Hardcore Tech Flex
  • Business Potential: 1. The “Resume Gold”
  • Difficulty: Level 4: Expert
  • Knowledge Area: Distributed Systems / Consensus / Raft
  • Software or Tool: Custom implementation
  • Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann

What you’ll build: A visual Raft consensus simulator that demonstrates leader election, log replication, and fault tolerance—showing exactly how Nomad servers coordinate.

Why it teaches Nomad: Nomad uses Raft for all state management. Understanding Raft means understanding how Nomad maintains consistency, handles leader failures, and ensures that scheduling decisions are durable.

Core challenges you’ll face:

  • Implementing leader election → maps to terms, votes, timeouts
  • Log replication → maps to AppendEntries RPC
  • Handling network partitions → maps to split-brain scenarios
  • State machine application → maps to Nomad’s FSM

Key Concepts:

Difficulty: Expert Time estimate: 3-4 weeks Prerequisites: Projects 1-7 completed. Strong understanding of distributed systems.

Real world outcome:

┌─────────────────────────────────────────────────────────────────┐
│ Raft Consensus Simulator                                         │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  Term: 5                                                         │
│                                                                  │
│  ┌─────────────┐    ┌─────────────┐    ┌─────────────┐          │
│  │   Node 1    │    │   Node 2    │    │   Node 3    │          │
│  │   LEADER    │    │  FOLLOWER   │    │  FOLLOWER   │          │
│  │  Term: 5    │    │  Term: 5    │    │  Term: 5    │          │
│  │             │    │             │    │             │          │
│  │ Log:        │    │ Log:        │    │ Log:        │          │
│  │ [1] x=1 ✓   │    │ [1] x=1 ✓   │    │ [1] x=1 ✓   │          │
│  │ [2] y=2 ✓   │    │ [2] y=2 ✓   │    │ [2] y=2 ✓   │          │
│  │ [3] z=3     │◄──►│ [3] z=3     │◄──►│ [3] z=3     │          │
│  │    ↑ commit │    │             │    │             │          │
│  └─────────────┘    └─────────────┘    └─────────────┘          │
│                                                                  │
│  ┌─ Event Log ─────────────────────────────────────────────────┐│
│  │ 10:00:00.100 Node1: Became leader for term 5                ││
│  │ 10:00:00.150 Node1: Received client request: SET z=3        ││
│  │ 10:00:00.200 Node1→Node2: AppendEntries(term=5, entry=z=3)  ││
│  │ 10:00:00.200 Node1→Node3: AppendEntries(term=5, entry=z=3)  ││
│  │ 10:00:00.250 Node2→Node1: AppendEntriesResp(success=true)   ││
│  │ 10:00:00.260 Node3→Node1: AppendEntriesResp(success=true)   ││
│  │ 10:00:00.300 Node1: Entry z=3 committed (majority ack)      ││
│  │ 10:00:00.350 Node1: Applied z=3 to state machine            ││
│  └─────────────────────────────────────────────────────────────┘│
│                                                                  │
│  [Partition Node 2] [Kill Leader] [Add Entry] [Step]            │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Implementation Hints:

Raft node state:

type NodeState int

const (
    Follower NodeState = iota
    Candidate
    Leader
)

type RaftNode struct {
    id    int
    state NodeState
    term  int

    // Persistent state
    log         []LogEntry
    votedFor    int
    commitIndex int
    lastApplied int

    // Leader state (reinitialized after election)
    nextIndex  map[int]int
    matchIndex map[int]int

    // Timing
    electionTimeout time.Duration
    lastHeartbeat   time.Time

    // Communication
    peers   []int
    inbox   chan Message
    outbox  chan Message
}

type LogEntry struct {
    Term    int
    Index   int
    Command interface{}
}

type Message struct {
    From    int
    To      int
    Type    MessageType
    Term    int
    Payload interface{}
}

Leader election:

func (n *RaftNode) becomeCandidate() {
    n.state = Candidate
    n.term++
    n.votedFor = n.id
    n.votesReceived = 1  // Vote for self

    // Request votes from all peers
    for _, peer := range n.peers {
        n.send(Message{
            From: n.id,
            To:   peer,
            Type: RequestVote,
            Term: n.term,
            Payload: RequestVoteArgs{
                CandidateID:  n.id,
                LastLogIndex: len(n.log) - 1,
                LastLogTerm:  n.log[len(n.log)-1].Term,
            },
        })
    }

    n.resetElectionTimer()
}

func (n *RaftNode) handleRequestVote(msg Message) {
    args := msg.Payload.(RequestVoteArgs)

    // Deny if term is old
    if args.Term < n.term {
        n.sendVoteReply(msg.From, false)
        return
    }

    // Check if log is at least as up-to-date
    if n.votedFor == -1 || n.votedFor == args.CandidateID {
        if n.isLogUpToDate(args.LastLogIndex, args.LastLogTerm) {
            n.votedFor = args.CandidateID
            n.sendVoteReply(msg.From, true)
            n.resetElectionTimer()
            return
        }
    }

    n.sendVoteReply(msg.From, false)
}

Log replication:

func (n *RaftNode) appendEntry(command interface{}) {
    if n.state != Leader {
        return
    }

    entry := LogEntry{
        Term:    n.term,
        Index:   len(n.log),
        Command: command,
    }
    n.log = append(n.log, entry)

    // Replicate to followers
    n.replicateLog()
}

func (n *RaftNode) replicateLog() {
    for _, peer := range n.peers {
        nextIdx := n.nextIndex[peer]
        entries := n.log[nextIdx:]

        n.send(Message{
            From: n.id,
            To:   peer,
            Type: AppendEntries,
            Term: n.term,
            Payload: AppendEntriesArgs{
                LeaderID:     n.id,
                PrevLogIndex: nextIdx - 1,
                PrevLogTerm:  n.log[nextIdx-1].Term,
                Entries:      entries,
                LeaderCommit: n.commitIndex,
            },
        })
    }
}

Learning milestones:

  1. Leader election works → You understand Raft elections
  2. Logs replicate correctly → You understand AppendEntries
  3. Network partitions handled → You understand split-brain prevention
  4. State machine applies → You understand FSM in consensus

Project 9: Gossip Protocol Implementation

  • File: LEARN_NOMAD_DEEP_DIVE.md
  • Main Programming Language: Go
  • Alternative Programming Languages: Python, Rust
  • Coolness Level: Level 4: Hardcore Tech Flex
  • Business Potential: 1. The “Resume Gold”
  • Difficulty: Level 3: Advanced
  • Knowledge Area: Distributed Systems / Membership / SWIM
  • Software or Tool: Custom implementation (inspired by Serf)
  • Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann

What you’ll build: A gossip-based membership protocol implementing SWIM (Scalable Weakly-consistent Infection-style Process Group Membership), which is what Nomad uses for cluster discovery.

Why it teaches Nomad: Before Raft can elect a leader, servers need to discover each other. Serf’s gossip protocol handles this. Understanding gossip means understanding how Nomad clusters form and how failure detection works.

Core challenges you’ll face:

  • Implementing gossip dissemination → maps to rumor spreading
  • Failure detection → maps to ping, ping-req, suspect, dead
  • Handling network partitions → maps to partition healing
  • Scalability → maps to O(log N) message complexity

Key Concepts:

Difficulty: Advanced Time estimate: 2-3 weeks Prerequisites: Projects 1-8 completed. Understanding of networking and UDP.

Real world outcome:

┌─────────────────────────────────────────────────────────────────┐
│ Gossip Protocol Simulator                                        │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  Cluster Members:                                                │
│  ┌────────────────────────────────────────────────────────────┐ │
│  │ Node    │ State   │ Last Seen │ Incarnation │ Messages     │ │
│  ├─────────┼─────────┼───────────┼─────────────┼──────────────┤ │
│  │ node-1  │ ALIVE   │ 100ms     │ 5           │ ████████     │ │
│  │ node-2  │ ALIVE   │ 50ms      │ 3           │ ██████       │ │
│  │ node-3  │ SUSPECT │ 2.5s      │ 7           │ ██           │ │
│  │ node-4  │ ALIVE   │ 200ms     │ 2           │ ████         │ │
│  │ node-5  │ DEAD    │ 30s       │ 1           │              │ │
│  └────────────────────────────────────────────────────────────┘ │
│                                                                  │
│  Message Flow:                                                   │
│  ┌────────────────────────────────────────────────────────────┐ │
│  │ 10:00:00.100 node-1 → node-2: PING                         │ │
│  │ 10:00:00.110 node-2 → node-1: ACK                          │ │
│  │ 10:00:00.200 node-1 → node-4: PING (piggyback: node-3 sus) │ │
│  │ 10:00:00.210 node-4 → node-1: ACK                          │ │
│  │ 10:00:00.300 node-2 → node-3: PING                         │ │
│  │ 10:00:00.800 node-2: No ACK from node-3, starting PING-REQ │ │
│  │ 10:00:00.810 node-2 → node-4: PING-REQ(node-3)             │ │
│  │ 10:00:00.820 node-4 → node-3: PING                         │ │
│  │ 10:00:01.320 node-4: No ACK from node-3                    │ │
│  │ 10:00:01.330 node-4 → node-2: NACK(node-3)                 │ │
│  │ 10:00:01.340 node-2: Marking node-3 as SUSPECT             │ │
│  └────────────────────────────────────────────────────────────┘ │
│                                                                  │
│  [Kill Node] [Partition] [Rejoin] [Show Convergence]            │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Implementation Hints:

SWIM node structure:

type NodeState int

const (
    Alive NodeState = iota
    Suspect
    Dead
)

type Member struct {
    ID          string
    Addr        string
    State       NodeState
    Incarnation int
    LastUpdate  time.Time
}

type GossipNode struct {
    self    Member
    members map[string]*Member

    // SWIM parameters
    probeInterval  time.Duration
    probeTimeout   time.Duration
    suspectTimeout time.Duration
    indirectNodes  int

    // Gossip queue for piggybacking
    gossipQueue []GossipMessage

    // UDP socket
    conn *net.UDPConn
}

type GossipMessage struct {
    Type        MessageType
    Member      Member
    Incarnation int
}

type MessageType int

const (
    Ping MessageType = iota
    Ack
    PingReq
    Alive
    Suspect
    Dead
)

SWIM probe protocol:

func (n *GossipNode) runProbeLoop() {
    ticker := time.NewTicker(n.probeInterval)

    for range ticker.C {
        // Select random member to probe
        target := n.randomMember()
        if target == nil {
            continue
        }

        // Direct probe
        ack := n.probe(target)
        if ack {
            continue
        }

        // No direct ACK - try indirect probes
        indirectAck := n.indirectProbe(target)
        if indirectAck {
            continue
        }

        // No indirect ACK - mark as suspect
        n.suspect(target)
    }
}

func (n *GossipNode) probe(target *Member) bool {
    // Send PING with piggybacked gossip
    msg := n.buildPingMessage(target.Addr)
    n.send(msg)

    // Wait for ACK
    select {
    case ack := <-n.ackChan:
        if ack.From == target.ID {
            return true
        }
    case <-time.After(n.probeTimeout):
        return false
    }
    return false
}

func (n *GossipNode) indirectProbe(target *Member) bool {
    // Select k random members to probe on our behalf
    helpers := n.randomMembers(n.indirectNodes)

    for _, helper := range helpers {
        n.send(PingReqMessage{
            To:     helper.Addr,
            Target: target.Addr,
        })
    }

    // Wait for any positive response
    timeout := time.After(n.probeTimeout * 2)
    for i := 0; i < len(helpers); i++ {
        select {
        case resp := <-n.pingReqRespChan:
            if resp.Target == target.ID && resp.Ack {
                return true
            }
        case <-timeout:
            return false
        }
    }
    return false
}

Gossip dissemination:

func (n *GossipNode) gossip(msg GossipMessage) {
    // Add to gossip queue
    n.gossipQueue = append(n.gossipQueue, msg)

    // Piggyback on next probe messages
    // SWIM piggybacks gossip on PING/ACK for efficiency
}

func (n *GossipNode) handleAlive(msg GossipMessage) {
    member, exists := n.members[msg.Member.ID]

    if !exists {
        // New member joined
        n.members[msg.Member.ID] = &msg.Member
        n.gossip(msg)
        return
    }

    // Only update if incarnation is higher
    if msg.Incarnation > member.Incarnation {
        member.State = Alive
        member.Incarnation = msg.Incarnation
        n.gossip(msg)
    }
}

func (n *GossipNode) refute() {
    // When we hear we're suspected, refute with higher incarnation
    n.self.Incarnation++
    n.gossip(GossipMessage{
        Type:        Alive,
        Member:      n.self,
        Incarnation: n.self.Incarnation,
    })
}

Learning milestones:

  1. Members discover each other → You understand gossip joining
  2. Failures are detected → You understand SWIM probing
  3. Suspicion and death work → You understand failure state machine
  4. Gossip converges quickly → You understand epidemic dissemination

Project 10: Mini-Scheduler Implementation

  • File: LEARN_NOMAD_DEEP_DIVE.md
  • Main Programming Language: Go
  • Alternative Programming Languages: Rust, Python
  • Coolness Level: Level 5: Pure Magic (Super Cool)
  • Business Potential: 4. The “Open Core” Infrastructure
  • Difficulty: Level 4: Expert
  • Knowledge Area: Scheduling / Bin Packing / Constraint Satisfaction
  • Software or Tool: Custom implementation
  • Main Book: “Scheduling: Theory, Algorithms, and Systems” by Michael Pinedo

What you’ll build: Your own job scheduler from scratch that handles job registration, evaluation creation, feasibility checking, bin-packing placement, and plan application—essentially a mini-Nomad scheduler.

Why it teaches Nomad: The scheduler is the heart of Nomad. Building one teaches you constraint satisfaction, bin packing algorithms, optimistic concurrency, and the evaluation-plan-allocation pipeline.

Core challenges you’ll face:

  • Evaluation pipeline → maps to how scheduling work is queued
  • Feasibility checking → maps to constraint filtering
  • Bin packing algorithm → maps to node scoring and ranking
  • Plan queue and conflict resolution → maps to optimistic concurrency

Key Concepts:

Difficulty: Expert Time estimate: 1 month Prerequisites: Projects 1-9 completed. Strong algorithm skills.

Real world outcome:

┌─────────────────────────────────────────────────────────────────┐
│ Mini-Scheduler                                                   │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│ Job: api-service                                                 │
│ Task Group: api (count: 5)                                       │
│ Requirements: cpu=500, memory=1024, constraint[instance=large]   │
│                                                                  │
│ ┌─ Scheduling Trace ──────────────────────────────────────────┐ │
│ │                                                              │ │
│ │ Phase 1: Feasibility                                         │ │
│ │   node-1: PASS (large, cpu=2000 avail, mem=4096 avail)       │ │
│ │   node-2: PASS (large, cpu=1500 avail, mem=3072 avail)       │ │
│ │   node-3: FAIL (small - constraint mismatch)                 │ │
│ │   node-4: PASS (large, cpu=2000 avail, mem=2048 avail)       │ │
│ │   node-5: FAIL (insufficient memory: 512 avail)              │ │
│ │                                                              │ │
│ │ Phase 2: Ranking (Bin Pack + Anti-Affinity)                  │ │
│ │   node-1: binpack=0.75, anti_affinity=1.00, score=0.875     │ │
│ │   node-2: binpack=0.85, anti_affinity=1.00, score=0.925     │ │
│ │   node-4: binpack=0.60, anti_affinity=1.00, score=0.800     │ │
│ │                                                              │ │
│ │ Phase 3: Placement                                           │ │
│ │   Allocation 1 → node-2 (highest score)                      │ │
│ │   Allocation 2 → node-1 (anti-affinity reduces node-2)       │ │
│ │   Allocation 3 → node-4 (spreading load)                     │ │
│ │   Allocation 4 → node-2 (resources still fit)                │ │
│ │   Allocation 5 → node-1 (resources still fit)                │ │
│ │                                                              │ │
│ │ Phase 4: Plan Applied Successfully                           │ │
│ │   5 allocations created                                      │ │
│ │                                                              │ │
│ └──────────────────────────────────────────────────────────────┘ │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Implementation Hints:

Core data structures:

type Job struct {
    ID         string
    Name       string
    Type       string // service, batch, system
    Priority   int
    TaskGroups []*TaskGroup
}

type TaskGroup struct {
    Name        string
    Count       int
    Constraints []*Constraint
    Tasks       []*Task
}

type Task struct {
    Name      string
    Driver    string
    Resources *Resources
}

type Resources struct {
    CPU    int // MHz
    Memory int // MB
}

type Constraint struct {
    Attribute string
    Operator  string // =, !=, >, <, in, not_in
    Value     string
}

type Node struct {
    ID         string
    Datacenter string
    Attributes map[string]string
    Resources  *Resources
    Allocated  *Resources
}

type Allocation struct {
    ID           string
    JobID        string
    TaskGroup    string
    NodeID       string
    DesiredState string
    Resources    *Resources
}

type Evaluation struct {
    ID          string
    JobID       string
    TriggeredBy string
    Status      string
}

Scheduler implementation:

type Scheduler struct {
    jobs        map[string]*Job
    nodes       map[string]*Node
    allocations map[string]*Allocation
    evalBroker  *EvalBroker
    planQueue   *PlanQueue
}

func (s *Scheduler) Schedule(eval *Evaluation) (*Plan, error) {
    job := s.jobs[eval.JobID]

    plan := &Plan{
        EvalID:      eval.ID,
        Allocations: make([]*Allocation, 0),
    }

    for _, tg := range job.TaskGroups {
        // Current allocations for this task group
        current := s.allocationsFor(job.ID, tg.Name)
        desired := tg.Count

        // Need to place (desired - current) new allocations
        toPlace := desired - len(current)

        for i := 0; i < toPlace; i++ {
            // Phase 1: Feasibility
            feasible := s.feasibleNodes(tg)

            if len(feasible) == 0 {
                // No feasible nodes - evaluation blocked
                return nil, ErrNoFeasibleNodes
            }

            // Phase 2: Ranking
            ranked := s.rankNodes(feasible, tg, plan)

            // Phase 3: Place on best node
            best := ranked[0]
            alloc := s.createAllocation(job, tg, best.Node)

            plan.Allocations = append(plan.Allocations, alloc)

            // Update our view for subsequent placements
            s.reserveResources(best.Node, tg)
        }
    }

    return plan, nil
}

func (s *Scheduler) feasibleNodes(tg *TaskGroup) []*Node {
    feasible := make([]*Node, 0)

    for _, node := range s.nodes {
        if s.checkConstraints(node, tg.Constraints) &&
           s.checkResources(node, tg) {
            feasible = append(feasible, node)
        }
    }

    return feasible
}

func (s *Scheduler) checkConstraints(node *Node, constraints []*Constraint) bool {
    for _, c := range constraints {
        value := node.Attributes[c.Attribute]

        switch c.Operator {
        case "=":
            if value != c.Value {
                return false
            }
        case "!=":
            if value == c.Value {
                return false
            }
        // ... more operators
        }
    }
    return true
}

Bin packing scorer:

type ScoredNode struct {
    Node  *Node
    Score float64
}

func (s *Scheduler) rankNodes(nodes []*Node, tg *TaskGroup, plan *Plan) []*ScoredNode {
    scored := make([]*ScoredNode, 0, len(nodes))

    for _, node := range nodes {
        score := s.calculateScore(node, tg, plan)
        scored = append(scored, &ScoredNode{Node: node, Score: score})
    }

    // Sort by score descending
    sort.Slice(scored, func(i, j int) bool {
        return scored[i].Score > scored[j].Score
    })

    return scored
}

func (s *Scheduler) calculateScore(node *Node, tg *TaskGroup, plan *Plan) float64 {
    // Bin packing score: prefer nodes with less available resources
    // (pack jobs tightly together)
    binpackScore := s.binpackScore(node, tg)

    // Anti-affinity score: prefer nodes without other allocations
    // of the same job (spread instances out)
    antiAffinityScore := s.antiAffinityScore(node, tg, plan)

    // Combine scores
    return (binpackScore * 0.5) + (antiAffinityScore * 0.5)
}

func (s *Scheduler) binpackScore(node *Node, tg *TaskGroup) float64 {
    // Calculate what percentage of resources will be used after placement
    cpuAfter := float64(node.Allocated.CPU+tg.Resources.CPU) / float64(node.Resources.CPU)
    memAfter := float64(node.Allocated.Memory+tg.Resources.Memory) / float64(node.Resources.Memory)

    // Higher utilization = higher score (bin packing)
    return (cpuAfter + memAfter) / 2.0
}

Learning milestones:

  1. Feasibility filtering works → You understand constraint checking
  2. Bin packing places efficiently → You understand scoring algorithms
  3. Anti-affinity spreads allocations → You understand fault tolerance
  4. Plan queue handles conflicts → You understand optimistic concurrency

Project 11: Multi-Region Federation Lab

  • File: LEARN_NOMAD_DEEP_DIVE.md
  • Main Programming Language: HCL + Terraform
  • Alternative Programming Languages: Bash, Go
  • Coolness Level: Level 3: Genuinely Clever
  • Business Potential: 3. The “Service & Support” Model
  • Difficulty: Level 3: Advanced
  • Knowledge Area: Multi-Region / WAN Federation / Disaster Recovery
  • Software or Tool: Nomad + Consul + Terraform
  • Main Book: “Site Reliability Engineering” by Google

What you’ll build: A multi-region Nomad deployment with federated clusters, cross-region job deployment, and disaster recovery failover—simulating a real global infrastructure.

Why it teaches Nomad: Production Nomad often spans multiple regions. Understanding federation teaches you WAN gossip, authoritative regions, cross-region scheduling, and how to design for geographic redundancy.

Core challenges you’ll face:

  • Setting up WAN federation → maps to gossip across regions
  • Cross-region job deployment → maps to multiregion job specification
  • Authoritative region for ACLs → maps to single source of truth
  • Failover and recovery → maps to region outage handling

Key Concepts:

Difficulty: Advanced Time estimate: 2 weeks Prerequisites: Projects 1-10 completed. Terraform experience helpful. Cloud account (AWS/GCP/Azure) or sufficient local resources.

Real world outcome:

# Two regions: us-east and eu-west

$ nomad server members -region=us-east
Name                 Address     Port  Status  Leader  Region
server-1.us-east.dc1 10.0.1.11   4648  alive   true    us-east
server-2.us-east.dc1 10.0.1.12   4648  alive   false   us-east
server-3.us-east.dc1 10.0.1.13   4648  alive   false   us-east

$ nomad server members -region=eu-west
Name                 Address     Port  Status  Leader  Region
server-1.eu-west.dc1 10.1.1.11   4648  alive   true    eu-west
server-2.eu-west.dc1 10.1.1.12   4648  alive   false   eu-west
server-3.eu-west.dc1 10.1.1.13   4648  alive   false   eu-west

# Federated - servers know about each other:
$ nomad server members -region=us-east
Name                 Address     Port  Status  Leader  Region
server-1.us-east.dc1 10.0.1.11   4648  alive   true    us-east
server-2.us-east.dc1 10.0.1.12   4648  alive   false   us-east
server-3.us-east.dc1 10.0.1.13   4648  alive   false   us-east
server-1.eu-west.dc1 10.1.1.11   4648  alive   true    eu-west (WAN)
server-2.eu-west.dc1 10.1.1.12   4648  alive   false   eu-west (WAN)
server-3.eu-west.dc1 10.1.1.13   4648  alive   false   eu-west (WAN)

# Deploy multiregion job:
$ nomad run global-api.nomad
==> Monitoring multiregion deployment...
    Region "us-east": 3/3 allocations healthy
    Region "eu-west": 3/3 allocations healthy
    Multiregion deployment successful!

Implementation Hints:

Multiregion job specification:

job "global-api" {
  type = "service"

  multiregion {
    strategy {
      max_parallel = 1
      on_failure   = "fail_all"
    }

    region "us-east" {
      count       = 3
      datacenters = ["us-east-1a", "us-east-1b"]
    }

    region "eu-west" {
      count       = 3
      datacenters = ["eu-west-1a", "eu-west-1b"]
    }
  }

  group "api" {
    # count is set per-region above

    network {
      port "http" { to = 8080 }
    }

    task "api" {
      driver = "docker"

      config {
        image = "myapp/api:v2"
        ports = ["http"]
      }

      env {
        REGION = "${NOMAD_REGION}"
        DC     = "${NOMAD_DC}"
      }
    }
  }
}

Federation setup:

# us-east server config
server {
  enabled          = true
  bootstrap_expect = 3
  authoritative_region = "us-east"  # This region is the source of truth for ACLs
}

# Enable WAN gossip encryption
server_join {
  retry_join = ["provider=aws tag_key=nomad-server tag_value=true"]
}

# eu-west server config
server {
  enabled          = true
  bootstrap_expect = 3
  authoritative_region = "us-east"  # Points to us-east
}

# Join to us-east for federation
server_join {
  retry_join = ["provider=aws tag_key=nomad-server tag_value=true region=us-east-1"]
}

Terraform for multi-region:

# modules/nomad-cluster/main.tf
module "us_east" {
  source = "./modules/nomad-region"

  region     = "us-east-1"
  servers    = 3
  clients    = 5
  server_ips = var.us_east_server_ips
}

module "eu_west" {
  source = "./modules/nomad-region"

  region     = "eu-west-1"
  servers    = 3
  clients    = 5
  server_ips = var.eu_west_server_ips

  # Join to us-east for federation
  wan_join = module.us_east.server_ips
}

Learning milestones:

  1. Regions discover each other → You understand WAN gossip
  2. Multiregion jobs deploy → You understand coordinated deployment
  3. ACLs replicate from authoritative region → You understand central control
  4. Failover works → You understand disaster recovery

Project 12: Nomad Source Code Deep Dive

  • File: LEARN_NOMAD_DEEP_DIVE.md
  • Main Programming Language: Go
  • Alternative Programming Languages: N/A
  • Coolness Level: Level 4: Hardcore Tech Flex
  • Business Potential: 1. The “Resume Gold”
  • Difficulty: Level 4: Expert
  • Knowledge Area: Source Code / Internal Architecture
  • Software or Tool: Nomad source code
  • Main Book: “The Go Programming Language” by Donovan & Kernighan

What you’ll build: A documented walkthrough of the Nomad source code, tracing a job from submission through evaluation, scheduling, and allocation—with annotations and diagrams.

Why it teaches Nomad: Reading the source code is the ultimate way to understand how Nomad works. This project takes you through the actual implementation of concepts you’ve learned, seeing how theory translates to production code.

Core challenges you’ll face:

  • Navigating the codebase → maps to understanding package structure
  • Tracing RPC flows → maps to how requests move through the system
  • Understanding the FSM → maps to state machine operations
  • Following the scheduler → maps to actual bin-packing code

Key Concepts:

Difficulty: Expert Time estimate: 2-3 weeks Prerequisites: All previous projects. Strong Go proficiency. Ability to read complex codebases.

Real world outcome:

# Your documented walkthrough:

docs/nomad-source-walkthrough/
├── 00-overview.md           # Package structure and architecture
├── 01-job-submission.md     # nomad/job_endpoint.go
├── 02-evaluation-broker.md  # nomad/eval_broker.go
├── 03-scheduler-workers.md  # scheduler/generic_sched.go
├── 04-feasibility.md        # scheduler/feasible.go
├── 05-ranking.md            # scheduler/rank.go
├── 06-plan-queue.md         # nomad/plan_queue.go
├── 07-state-store.md        # nomad/state/state_store.go
├── 08-client-alloc.md       # client/allocrunner/
├── diagrams/
│   ├── job-flow.png
│   ├── scheduler-internals.png
│   └── raft-fsm.png
└── README.md

Implementation Hints:

Key directories to explore:

nomad/
├── command/          # CLI commands
├── api/              # Go client library
├── nomad/            # Server-side code
│   ├── job_endpoint.go      # Job RPC handlers
│   ├── eval_broker.go       # Evaluation queue
│   ├── plan_queue.go        # Plan application
│   ├── fsm.go               # Raft state machine
│   └── state/               # State store (BoltDB)
├── scheduler/        # Scheduling logic
│   ├── generic_sched.go     # Service/batch scheduler
│   ├── system_sched.go      # System scheduler
│   ├── feasible.go          # Constraint checking
│   └── rank.go              # Node scoring
├── client/           # Client-side code
│   ├── client.go            # Main client
│   ├── allocrunner/         # Allocation execution
│   └── fingerprint/         # Node fingerprinting
└── drivers/          # Task drivers
    ├── docker/
    ├── exec/
    └── rawexec/

Tracing job submission:

// 1. Job endpoint receives request
// nomad/job_endpoint.go
func (j *Job) Register(args *structs.JobRegisterRequest, reply *structs.JobRegisterResponse) error {
    // Validate job
    // Persist to state store via Raft
    // Create evaluation
}

// 2. Evaluation is enqueued
// nomad/eval_broker.go
func (b *EvalBroker) Enqueue(eval *structs.Evaluation) {
    // Add to priority queue
    // Signal waiting workers
}

// 3. Scheduler worker dequeues
// nomad/worker.go
func (w *Worker) run() {
    for {
        eval, _ := w.srv.evalBroker.Dequeue(...)
        w.invokeScheduler(eval)
    }
}

// 4. Scheduler processes
// scheduler/generic_sched.go
func (s *GenericScheduler) Process(eval *structs.Evaluation) error {
    // Reconcile desired vs actual state
    // For each needed allocation:
    //   - Find feasible nodes
    //   - Rank nodes
    //   - Create allocation in plan
    // Submit plan
}

// 5. Plan is applied
// nomad/plan_apply.go
func (p *planner) applyPlan(...) {
    // Check for conflicts
    // Apply via Raft
    // Create allocations in state store
}

Questions to answer in your walkthrough:

  • How does optimistic concurrency work in the plan queue?
  • What happens when a node fails during scheduling?
  • How are blocked evaluations handled?
  • How does the client reconcile allocations?

Learning milestones:

  1. You trace job submission → You understand RPC flow
  2. You understand the FSM → You understand Raft integration
  3. You follow the scheduler → You understand actual algorithms
  4. You document the flow → You can teach others

Project Comparison Table

Project Difficulty Time Depth of Understanding Fun Factor
1. Local Dev Cluster Beginner Weekend ⭐⭐ ⭐⭐⭐
2. Job Lifecycle Explorer Beginner 1 week ⭐⭐⭐ ⭐⭐⭐
3. Scheduler Visualizer Intermediate 1-2 weeks ⭐⭐⭐⭐ ⭐⭐⭐⭐⭐
4. Custom Constraints Intermediate 1 week ⭐⭐⭐ ⭐⭐⭐
5. Consul Service Mesh Intermediate 1-2 weeks ⭐⭐⭐⭐ ⭐⭐⭐⭐
6. Autoscaler Advanced 2 weeks ⭐⭐⭐⭐ ⭐⭐⭐⭐
7. Custom Task Driver Advanced 2-3 weeks ⭐⭐⭐⭐⭐ ⭐⭐⭐⭐⭐
8. Raft Simulator Expert 3-4 weeks ⭐⭐⭐⭐⭐ ⭐⭐⭐⭐
9. Gossip Protocol Advanced 2-3 weeks ⭐⭐⭐⭐⭐ ⭐⭐⭐⭐
10. Mini-Scheduler Expert 1 month ⭐⭐⭐⭐⭐ ⭐⭐⭐⭐⭐
11. Multi-Region Lab Advanced 2 weeks ⭐⭐⭐⭐ ⭐⭐⭐⭐
12. Source Code Dive Expert 2-3 weeks ⭐⭐⭐⭐⭐ ⭐⭐⭐

If you’re new to container orchestration:

  1. Project 1 (Local Cluster) - Get hands-on experience
  2. Project 2 (Job Lifecycle) - Understand job types
  3. Project 4 (Constraints) - Learn placement logic
  4. Project 5 (Service Mesh) - Understand networking

If you know Kubernetes and want to understand Nomad’s differences:

  1. Project 3 (Scheduler Visualizer) - See the evaluation system
  2. Project 7 (Task Driver) - Understand the plugin model
  3. Project 10 (Mini-Scheduler) - Build your own scheduler

If you want to deeply understand distributed systems:

  1. Project 8 (Raft) - Consensus fundamentals
  2. Project 9 (Gossip) - Membership protocols
  3. Project 10 (Scheduler) - Distributed scheduling
  4. Project 12 (Source Code) - Production implementation

Quick path (2-3 weeks):

  1. Project 1 (Weekend) - Get running
  2. Project 2 (3 days) - Understand jobs
  3. Project 3 (1 week) - See scheduling
  4. Project 5 (1 week) - Service mesh

Final Capstone Project: Production-Grade Nomad Platform

  • File: LEARN_NOMAD_DEEP_DIVE.md
  • Main Programming Language: Go + HCL + Terraform
  • Alternative Programming Languages: Python, Bash
  • Coolness Level: Level 5: Pure Magic (Super Cool)
  • Business Potential: 4. The “Open Core” Infrastructure
  • Difficulty: Level 5: Master
  • Knowledge Area: Platform Engineering / Production Operations
  • Software or Tool: Nomad + Consul + Vault + Terraform
  • Main Book: “Site Reliability Engineering” by Google

What you’ll build: A complete production-grade Nomad platform with multi-region federation, Consul service mesh, Vault secrets integration, monitoring/alerting, CI/CD pipeline integration, and disaster recovery procedures.

Why it teaches everything: This project integrates all concepts: scheduling, service mesh, secrets, observability, security, and operations. It’s the capstone that proves you can run Nomad in production.

Core components:

  1. Multi-region clusters with Terraform provisioning
  2. Consul Connect for service mesh
  3. Vault integration for secrets management
  4. Prometheus + Grafana for monitoring
  5. CI/CD integration for deployment pipelines
  6. Runbooks for operational procedures
  7. Disaster recovery procedures and testing

Real world outcome:

Production Nomad Platform
├── infrastructure/
│   ├── terraform/
│   │   ├── modules/
│   │   │   ├── nomad-cluster/
│   │   │   ├── consul-cluster/
│   │   │   └── vault-cluster/
│   │   ├── environments/
│   │   │   ├── production/
│   │   │   └── staging/
│   │   └── main.tf
│   └── packer/
│       └── nomad-ami.pkr.hcl
├── jobs/
│   ├── platform/
│   │   ├── prometheus.nomad
│   │   ├── grafana.nomad
│   │   ├── traefik.nomad
│   │   └── vault-agent.nomad
│   └── applications/
│       └── example-app.nomad
├── policies/
│   ├── nomad-acl/
│   ├── consul-intentions/
│   └── vault-policies/
├── monitoring/
│   ├── dashboards/
│   ├── alerts/
│   └── runbooks/
├── ci-cd/
│   ├── .github/workflows/
│   └── scripts/
└── docs/
    ├── architecture.md
    ├── operations.md
    └── disaster-recovery.md

Time estimate: 2-3 months Prerequisites: All 12 projects completed


Essential Resources

Official Documentation

Key Papers and Articles

Courses

Books (Complementary)

  • “Designing Data-Intensive Applications” by Martin Kleppmann - Distributed systems fundamentals
  • “Site Reliability Engineering” by Google - Production operations
  • “The Go Programming Language” by Donovan & Kernighan - For source code reading

Source Code

Community


Summary

# Project Main Programming Language
1 Local Development Cluster HCL + Bash
2 Job Lifecycle Explorer HCL + Bash
3 Scheduler Visualization Tool Go
4 Custom Resource Constraint System HCL + Go
5 Consul Service Mesh Integration HCL
6 Autoscaler Implementation Go
7 Custom Task Driver Plugin Go
8 Raft Consensus Simulator Go
9 Gossip Protocol Implementation Go
10 Mini-Scheduler Implementation Go
11 Multi-Region Federation Lab HCL + Terraform
12 Nomad Source Code Deep Dive Go
Final Production-Grade Nomad Platform Go + HCL + Terraform

Happy scheduling! Nomad’s simplicity hides powerful distributed systems concepts—enjoy the journey from “nomad run” to understanding every byte of the protocol. 🚀