Learn HashiCorp Nomad: From Zero to Mastery

Goal: Deeply understand Nomad—from basic job scheduling to implementing your own scheduler, understanding Raft consensus, building task drivers, and mastering the internals that make it a production-grade orchestrator.

Why Nomad Matters

Nomad is HashiCorp’s workload orchestrator, and it occupies a unique position in the infrastructure landscape:

Simpler than Kubernetes: Single binary, no etcd, no complex control plane
More flexible: Runs containers, VMs, Java apps, binaries, batch jobs—anything
Production-proven: Powers millions of containers at companies like Cloudflare, Roblox, and CircleCI
Inspired by Google: The scheduler design draws from Google’s Borg and Omega papers
HashiCorp Ecosystem: Seamless integration with Consul (service mesh), Vault (secrets), Terraform (provisioning)

Understanding Nomad deeply teaches you:

Distributed systems fundamentals: Raft consensus, gossip protocols, leader election
Scheduler design: Bin packing, spread algorithms, constraint satisfaction
Container orchestration: Namespaces, cgroups, networking modes
Service mesh concepts: Envoy sidecars, mutual TLS, service discovery
Real production patterns: Multi-region, high availability, disaster recovery

Core Architecture Overview

┌─────────────────────────────────────────────────────────────────┐
│                           NOMAD CLUSTER                          │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐           │
│  │   Server 1   │  │   Server 2   │  │   Server 3   │           │
│  │   (Leader)   │◄─┤  (Follower)  ├──┤  (Follower)  │           │
│  │              │  │              │  │              │           │
│  │ ┌──────────┐ │  │ ┌──────────┐ │  │ ┌──────────┐ │           │
│  │ │   Raft   │ │  │ │   Raft   │ │  │ │   Raft   │ │           │
│  │ │  (FSM)   │ │  │ │  (FSM)   │ │  │ │  (FSM)   │ │           │
│  │ └──────────┘ │  │ └──────────┘ │  │ └──────────┘ │           │
│  │ ┌──────────┐ │  │ ┌──────────┐ │  │ ┌──────────┐ │           │
│  │ │Scheduler │ │  │ │Scheduler │ │  │ │Scheduler │ │           │
│  │ │ Workers  │ │  │ │ Workers  │ │  │ │ Workers  │ │           │
│  │ └──────────┘ │  │ └──────────┘ │  │ └──────────┘ │           │
│  └──────┬───────┘  └──────┬───────┘  └──────┬───────┘           │
│         │                 │                 │                    │
│         └────────────────▼┼─────────────────┘                    │
│                     Serf Gossip                                  │
│                     (Membership)                                 │
│         ┌─────────────────┼─────────────────┐                    │
│         │                 │                 │                    │
│  ┌──────▼───────┐  ┌──────▼───────┐  ┌──────▼───────┐           │
│  │   Client 1   │  │   Client 2   │  │   Client N   │           │
│  │              │  │              │  │              │           │
│  │ ┌──────────┐ │  │ ┌──────────┐ │  │ ┌──────────┐ │           │
│  │ │  Docker  │ │  │ │   exec   │ │  │ │ raw_exec │ │           │
│  │ │  Driver  │ │  │ │  Driver  │ │  │ │  Driver  │ │           │
│  │ └──────────┘ │  │ └──────────┘ │  │ └──────────┘ │           │
│  │ ┌──────────┐ │  │ ┌──────────┐ │  │ ┌──────────┐ │           │
│  │ │Allocation│ │  │ │Allocation│ │  │ │Allocation│ │           │
│  │ │  Runner  │ │  │ │  Runner  │ │  │ │  Runner  │ │           │
│  │ └──────────┘ │  │ └──────────┘ │  │ └──────────┘ │           │
│  └──────────────┘  └──────────────┘  └──────────────┘           │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Key Components

Component	Purpose
Server	Control plane—stores state, runs schedulers, coordinates via Raft
Client	Data plane—runs workloads, reports node status
Raft	Consensus algorithm for leader election and state replication
Serf/Gossip	Membership protocol for cluster discovery
Scheduler	Places allocations on nodes using bin-packing/spread
Task Drivers	Execute workloads (Docker, exec, raw_exec, Java, etc.)
Evaluation	Unit of scheduling work, triggered by state changes
Allocation	A task group placed on a specific node

The Scheduling Pipeline

Understanding this pipeline is key to understanding Nomad:

Job Submission → Evaluation Created → Scheduler Processes →
Plan Generated → Leader Applies Plan → Allocation Created →
Client Runs Tasks

Job Registration: User submits a job (desired state)
Evaluation Creation: Nomad creates an evaluation to process the change
Scheduler Dequeues: A scheduler worker picks up the evaluation
Feasibility Checking: Filter nodes that can’t run the job (constraints, resources)
Ranking: Score feasible nodes (bin packing, affinity, spread)
Plan Submission: Scheduler submits plan to leader
Plan Application: Leader checks for conflicts, applies or rejects
Allocation: Client receives allocation and starts tasks

Project List

Projects progress from understanding the basics to building sophisticated internal components.

Project 1: Local Development Cluster

File: LEARN_NOMAD_DEEP_DIVE.md
Main Programming Language: HCL (HashiCorp Configuration Language) + Bash
Alternative Programming Languages: Python, Go
Coolness Level: Level 2: Practical but Forgettable
Business Potential: 1. The “Resume Gold”
Difficulty: Level 1: Beginner
Knowledge Area: Cluster Setup / Agent Configuration
Software or Tool: Nomad, Docker
Main Book: HashiCorp Nomad Documentation

What you’ll build: A local 3-server, 2-client Nomad cluster running in Docker containers (or as processes), demonstrating leader election, automatic clustering, and basic job deployment.

Why it teaches Nomad: Before diving deep, you need a running cluster to experiment with. This project introduces the agent modes (server/client), configuration files, and the basic operational commands.

Core challenges you’ll face:

Configuring server vs client mode → maps to understanding agent roles
Setting up cluster bootstrapping → maps to bootstrap_expect and join
Networking between agents → maps to advertise addresses and ports
Running your first job → maps to job specification basics

Key Concepts:

Agent Configuration: Nomad Agent Configuration
Bootstrap Process: Clustering Tutorial
Job Specification: Job Specification Docs

Difficulty: Beginner Time estimate: Weekend Prerequisites: Docker installed. Basic understanding of configuration files. Command-line comfort.

Real world outcome:

# Your cluster is running:
$ nomad server members
Name          Address      Port  Status  Leader  Protocol
server-1.dc1  10.0.0.1     4648  alive   true    2
server-2.dc1  10.0.0.2     4648  alive   false   2
server-3.dc1  10.0.0.3     4648  alive   false   2

$ nomad node status
ID        DC   Name      Class   Drain  Eligibility  Status
a1b2c3d4  dc1  client-1  <none>  false  eligible     ready
e5f6g7h8  dc1  client-2  <none>  false  eligible     ready

# Deploy a simple job:
$ nomad run hello.nomad
==> Monitoring deployment...
    2024-01-15T10:00:00Z (ID: abc123)
    Deployment "abc123" successful

$ curl http://client-1:8080
Hello from Nomad!

Implementation Hints:

Directory structure:

nomad-cluster/
├── docker-compose.yml       # Or Vagrant, or shell scripts
├── config/
│   ├── server.hcl           # Server configuration
│   └── client.hcl           # Client configuration
└── jobs/
    └── hello.nomad          # First job

Server configuration (server.hcl):

# Minimal server config
data_dir = "/opt/nomad/data"
bind_addr = "0.0.0.0"

server {
  enabled          = true
  bootstrap_expect = 3  # Wait for 3 servers before electing leader
}

# Advertise address for other nodes to reach this one
advertise {
  http = "{{ GetInterfaceIP \"eth0\" }}"
  rpc  = "{{ GetInterfaceIP \"eth0\" }}"
  serf = "{{ GetInterfaceIP \"eth0\" }}"
}

Client configuration (client.hcl):

data_dir = "/opt/nomad/data"
bind_addr = "0.0.0.0"

client {
  enabled = true
  servers = ["server-1:4647", "server-2:4647", "server-3:4647"]
}

# Enable task drivers
plugin "docker" {
  config {
    allow_privileged = false
  }
}

First job (hello.nomad):

job "hello" {
  datacenters = ["dc1"]
  type        = "service"

  group "web" {
    count = 2

    network {
      port "http" { to = 8080 }
    }

    task "server" {
      driver = "docker"

      config {
        image = "hashicorp/http-echo"
        args  = ["-text", "Hello from Nomad!"]
        ports = ["http"]
      }

      resources {
        cpu    = 100
        memory = 64
      }
    }
  }
}

Learning milestones:

Servers elect a leader → You understand Raft basics
Clients join the cluster → You understand gossip membership
Job deploys successfully → You understand the scheduling flow
You scale the job → You understand desired vs actual state

Project 2: Job Lifecycle Explorer

File: LEARN_NOMAD_DEEP_DIVE.md
Main Programming Language: HCL + Bash
Alternative Programming Languages: Go (API client), Python
Coolness Level: Level 2: Practical but Forgettable
Business Potential: 1. The “Resume Gold”
Difficulty: Level 1: Beginner
Knowledge Area: Job Types / Task Lifecycle / Update Strategies
Software or Tool: Nomad
Main Book: HashiCorp Nomad Documentation

What you’ll build: A suite of jobs demonstrating all Nomad job types (service, batch, system, sysbatch), task lifecycle hooks, update strategies, and health checks.

Why it teaches Nomad: Each job type has different scheduling behavior. Understanding when tasks start, stop, and restart—and how updates roll out—is essential for production deployments.

Core challenges you’ll face:

Understanding job types → maps to service vs batch vs system schedulers
Lifecycle hooks (prestart, poststart, poststop) → maps to task dependencies
Health checks and deployment health → maps to how Nomad knows tasks are ready
Update strategies → maps to rolling vs canary deployments

Key Concepts:

Job Types: Scheduler Types
Lifecycle Hooks: lifecycle block
Update Stanza: update block
Health Checks: check block

Difficulty: Beginner Time estimate: 1 week Prerequisites: Project 1 completed. Running cluster.

Real world outcome:

# Different job types in action:

# Service job - long-running, rescheduled on failure
$ nomad run api-service.nomad
$ nomad job status api-service
Allocations
ID        Node ID   Task Group  Version  Desired  Status   Created
abc123    a1b2c3    api         0        run      running  5m ago

# Batch job - runs to completion
$ nomad run data-import.nomad
$ nomad job status data-import
Allocations
ID        Node ID   Task Group  Version  Desired  Status    Created
def456    e5f6g7    import      0        run      complete  2m ago

# System job - runs on every node
$ nomad run node-exporter.nomad
$ nomad job status node-exporter
Allocations (2 nodes)
ID        Node ID   Task Group  Version  Desired  Status   Created
ghi789    a1b2c3    exporter    0        run      running  1m ago
jkl012    e5f6g7    exporter    0        run      running  1m ago

# Rolling update with health checks:
$ nomad run -var version=v2 api-service.nomad
==> Monitoring deployment...
    Deployment "xyz789" in progress...
    2024-01-15T10:05:00Z - 1/3 allocations healthy
    2024-01-15T10:06:00Z - 2/3 allocations healthy
    2024-01-15T10:07:00Z - 3/3 allocations healthy
    Deployment "xyz789" successful

Implementation Hints:

Service job with lifecycle hooks:

job "api-service" {
  type = "service"

  group "api" {
    count = 3

    update {
      max_parallel     = 1
      min_healthy_time = "30s"
      healthy_deadline = "5m"
      auto_revert      = true
      canary           = 1  # Deploy 1 canary first
    }

    # Prestart task - runs before main task
    task "db-migrate" {
      lifecycle {
        hook    = "prestart"
        sidecar = false
      }

      driver = "docker"
      config {
        image   = "myapp/migrate:${var.version}"
        command = "/migrate"
      }
    }

    # Main task
    task "api" {
      driver = "docker"

      config {
        image = "myapp/api:${var.version}"
        ports = ["http"]
      }

      # Health check for deployment
      service {
        name = "api"
        port = "http"

        check {
          type     = "http"
          path     = "/health"
          interval = "10s"
          timeout  = "2s"
        }
      }
    }

    # Sidecar - runs alongside main task
    task "log-shipper" {
      lifecycle {
        hook    = "poststart"
        sidecar = true
      }

      driver = "docker"
      config {
        image = "fluent/fluent-bit"
      }
    }
  }
}

Batch job with periodic scheduling:

job "daily-report" {
  type = "batch"

  periodic {
    cron             = "0 2 * * *"  # 2 AM daily
    prohibit_overlap = true
  }

  group "report" {
    task "generate" {
      driver = "docker"
      config {
        image   = "myapp/reporter"
        command = "/generate-report"
      }

      # Batch jobs can have restart policies
      restart {
        attempts = 3
        interval = "30m"
        delay    = "15s"
        mode     = "fail"
      }
    }
  }
}

Learning milestones:

You run all 4 job types → You understand scheduler differences
Lifecycle hooks execute in order → You understand task dependencies
Health checks gate deployments → You understand deployment health
Rolling updates work with canaries → You understand update strategies

Project 3: Scheduler Visualization Tool

File: LEARN_NOMAD_DEEP_DIVE.md
Main Programming Language: Go
Alternative Programming Languages: Python, JavaScript
Coolness Level: Level 3: Genuinely Clever
Business Potential: 2. The “Micro-SaaS / Pro Tool”
Difficulty: Level 2: Intermediate
Knowledge Area: Scheduling Internals / API Usage
Software or Tool: Nomad API
Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann (for distributed concepts)

What you’ll build: A tool that visualizes Nomad’s scheduling decisions in real-time—showing evaluations, plans, allocations, node scores, and why certain nodes were chosen or rejected.

Why it teaches Nomad: The scheduler is Nomad’s brain. By building a visualization tool, you’ll understand the evaluation → plan → allocation pipeline, see bin packing in action, and understand why placements happen where they do.

Core challenges you’ll face:

Streaming evaluations → maps to understanding the evaluation broker
Parsing allocation plans → maps to feasibility and ranking phases
Visualizing node scores → maps to bin packing algorithm
Understanding failures → maps to why allocations get blocked

Key Concepts:

Scheduling Internals: How Scheduling Works
Nomad API: API Documentation
Event Stream: Event Stream API

Difficulty: Intermediate Time estimate: 1-2 weeks Prerequisites: Projects 1-2 completed. Go or Python proficiency.

Real world outcome:

┌─────────────────────────────────────────────────────────────────┐
│ Nomad Scheduler Visualizer                                       │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│ Evaluation: abc123-def456 (Triggered by: job-register)          │
│ Job: api-service | Type: service | Priority: 50                 │
│                                                                  │
│ ┌─ Scheduling Timeline ──────────────────────────────────────┐  │
│ │                                                             │  │
│ │  10:00:00.100  Evaluation Created                          │  │
│ │  10:00:00.150  Scheduler Dequeued (worker-3)               │  │
│ │  10:00:00.200  Feasibility Check Started                   │  │
│ │                 - client-1: FEASIBLE                       │  │
│ │                 - client-2: FEASIBLE                       │  │
│ │                 - client-3: REJECTED (insufficient memory) │  │
│ │  10:00:00.250  Ranking Phase                               │  │
│ │                 - client-1: score=0.85 (binpack: 0.9,      │  │
│ │                              anti-affinity: 0.8)           │  │
│ │                 - client-2: score=0.72 (binpack: 0.7,      │  │
│ │                              anti-affinity: 0.75)          │  │
│ │  10:00:00.300  Plan Submitted                              │  │
│ │  10:00:00.350  Plan Accepted (no conflicts)                │  │
│ │  10:00:00.400  Allocation Created on client-1              │  │
│ │                                                             │  │
│ └─────────────────────────────────────────────────────────────┘  │
│                                                                  │
│ Node Resource Usage:                                             │
│ ┌──────────────────────────────────────────────────────────────┐ │
│ │ client-1 [████████████░░░░░░░░] CPU: 60% │ Mem: 2.1/4GB     │ │
│ │ client-2 [████████░░░░░░░░░░░░] CPU: 40% │ Mem: 1.5/4GB     │ │
│ │ client-3 [████████████████████] CPU: 95% │ Mem: 3.9/4GB     │ │
│ └──────────────────────────────────────────────────────────────┘ │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Implementation Hints:

Using the Event Stream API:

package main

import (
    "context"
    "fmt"
    "github.com/hashicorp/nomad/api"
)

func main() {
    client, _ := api.NewClient(api.DefaultConfig())

    // Subscribe to evaluation and allocation events
    topics := map[api.Topic][]string{
        api.TopicEvaluation: {"*"},
        api.TopicAllocation: {"*"},
        api.TopicNode:       {"*"},
    }

    ctx := context.Background()
    eventCh, _ := client.EventStream().Stream(ctx, topics, 0, nil)

    for {
        select {
        case events := <-eventCh:
            for _, event := range events.Events {
                switch event.Topic {
                case api.TopicEvaluation:
                    handleEvaluation(event)
                case api.TopicAllocation:
                    handleAllocation(event)
                }
            }
        }
    }
}

func handleEvaluation(event api.Event) {
    eval := event.Evaluation
    fmt.Printf("Eval %s: Status=%s TriggeredBy=%s\n",
        eval.ID[:8], eval.Status, eval.TriggeredBy)

    // Get detailed scheduling info from allocation plan
    if eval.Status == "complete" {
        showPlanDetails(eval.ID)
    }
}

Getting node scores and feasibility info:

func showPlanDetails(evalID string) {
    client, _ := api.NewClient(api.DefaultConfig())

    // Get allocations for this evaluation
    allocs, _, _ := client.Evaluations().Allocations(evalID, nil)

    for _, alloc := range allocs {
        fmt.Printf("  Allocation %s on %s\n", alloc.ID[:8], alloc.NodeID[:8])

        // Metrics show why this node was chosen
        if alloc.Metrics != nil {
            fmt.Printf("    NodesEvaluated: %d\n", alloc.Metrics.NodesEvaluated)
            fmt.Printf("    NodesFiltered: %d\n", alloc.Metrics.NodesFiltered)

            // Filter reasons
            for class, count := range alloc.Metrics.ClassFiltered {
                fmt.Printf("    Filtered by class %s: %d\n", class, count)
            }

            // Scores
            for _, score := range alloc.Metrics.ScoreMetaData {
                fmt.Printf("    Node %s: score=%.2f\n",
                    score.NodeID[:8], score.NormScore)
            }
        }
    }
}

Learning milestones:

You stream evaluation events → You understand the evaluation broker
You show feasibility filtering → You understand constraint checking
You display node scores → You understand bin packing
You track blocked evaluations → You understand resource exhaustion

Project 4: Custom Resource Constraint System

File: LEARN_NOMAD_DEEP_DIVE.md
Main Programming Language: HCL + Go
Alternative Programming Languages: Python (for testing)
Coolness Level: Level 3: Genuinely Clever
Business Potential: 3. The “Service & Support” Model
Difficulty: Level 2: Intermediate
Knowledge Area: Constraints / Affinities / Node Metadata
Software or Tool: Nomad
Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann

What you’ll build: A system that uses node attributes, custom metadata, constraints, and affinities to implement complex placement requirements like “run on GPU nodes,” “prefer SSD storage,” “spread across availability zones.”

Why it teaches Nomad: Real-world scheduling is rarely just “find a node with enough CPU.” This project teaches you how Nomad’s constraint system enables sophisticated placement logic without modifying the scheduler itself.

Core challenges you’ll face:

Node fingerprinting → maps to how Nomad discovers node capabilities
Custom node metadata → maps to operator-defined attributes
Hard constraints vs soft affinities → maps to must-have vs prefer
Spread across failure domains → maps to the spread block

Key Concepts:

Constraints: constraint block
Affinities: affinity block
Spread: spread block
Node Metadata: Client metadata

Difficulty: Intermediate Time estimate: 1 week Prerequisites: Projects 1-3 completed. Understanding of constraint satisfaction.

Real world outcome:

# Configure nodes with metadata:
# client-1.hcl
client {
  meta {
    "zone"         = "us-east-1a"
    "instance"     = "gpu"
    "storage"      = "ssd"
  }
}

# Job with complex placement requirements:
$ cat ml-training.nomad

job "ml-training" {
  # Hard constraint: MUST have GPU
  constraint {
    attribute = "${meta.instance}"
    value     = "gpu"
  }

  # Soft affinity: PREFER SSD storage (weight: 50%)
  affinity {
    attribute = "${meta.storage}"
    value     = "ssd"
    weight    = 50
  }

  # Spread across availability zones
  spread {
    attribute = "${meta.zone}"
    weight    = 100

    target "us-east-1a" { percent = 33 }
    target "us-east-1b" { percent = 33 }
    target "us-east-1c" { percent = 34 }
  }

  group "training" {
    count = 6

    task "train" {
      driver = "docker"
      config {
        image = "tensorflow/tensorflow:latest-gpu"
      }

      resources {
        # Request GPU device
        device "nvidia/gpu" {
          count = 1
        }
      }
    }
  }
}

$ nomad run ml-training.nomad
$ nomad job status ml-training

Allocations
ID        Node ID   Zone        Status
abc123    node-1    us-east-1a  running   # GPU + SSD
def456    node-4    us-east-1b  running   # GPU + HDD (SSD preferred but not required)
ghi789    node-7    us-east-1c  running   # GPU + SSD
...

# Placement respects:
# ✓ All on GPU nodes (constraint)
# ✓ Prefers SSD nodes (affinity)
# ✓ Spread across zones (spread)

Implementation Hints:

Node metadata configuration:

# On each client node
client {
  enabled = true

  # Custom metadata operators can define
  meta {
    "zone"            = "us-east-1a"
    "rack"            = "rack-42"
    "instance_type"   = "gpu"
    "storage_type"    = "nvme"
    "network_speed"   = "25gbps"
    "cost_tier"       = "expensive"
  }

  # Host volumes for persistent storage
  host_volume "scratch" {
    path      = "/mnt/scratch"
    read_only = false
  }
}

Complex constraint examples:

job "complex-constraints" {
  # Must run on Linux
  constraint {
    attribute = "${attr.kernel.name}"
    value     = "linux"
  }

  # Must have at least 4 CPU cores
  constraint {
    attribute = "${attr.cpu.numcores}"
    operator  = ">="
    value     = "4"
  }

  # Must NOT run on nodes tagged "maintenance"
  constraint {
    attribute = "${meta.status}"
    operator  = "!="
    value     = "maintenance"
  }

  # Must run in specific datacenters
  constraint {
    attribute = "${node.datacenter}"
    operator  = "regexp"
    value     = "us-.*"
  }

  # Prefer nodes with recent kernel
  affinity {
    attribute = "${attr.kernel.version}"
    operator  = "version"
    value     = ">= 5.0"
    weight    = 25
  }

  group "app" {
    # Distinct hosts - never colocate instances
    constraint {
      operator = "distinct_hosts"
      value    = "true"
    }

    task "worker" {
      # ...
    }
  }
}

Learning milestones:

You use built-in attributes → You understand fingerprinting
You define custom metadata → You understand operator-defined placement
You combine constraints and affinities → You understand hard vs soft requirements
You spread across failure domains → You understand availability patterns

Project 5: Consul Service Mesh Integration

File: LEARN_NOMAD_DEEP_DIVE.md
Main Programming Language: HCL
Alternative Programming Languages: Go, Python
Coolness Level: Level 3: Genuinely Clever
Business Potential: 3. The “Service & Support” Model
Difficulty: Level 2: Intermediate
Knowledge Area: Service Mesh / mTLS / Envoy
Software or Tool: Nomad + Consul + Envoy
Main Book: “Service Mesh Patterns” by Sandeep Dinesh

What you’ll build: A microservices application with Consul Connect service mesh, demonstrating automatic mTLS, service intentions (authorization), and transparent proxy networking.

Why it teaches Nomad: Nomad’s integration with Consul Connect shows how the HashiCorp stack works together. You’ll understand sidecar injection, service identity, and how Envoy proxies are configured automatically.

Core challenges you’ll face:

Setting up Consul cluster → maps to understanding service discovery
Enabling Connect sidecar proxies → maps to Envoy injection
Configuring service intentions → maps to service-to-service authorization
Network modes (bridge) → maps to Linux network namespaces

Key Concepts:

Consul Connect: Service Mesh Integration
Service Intentions: Consul Intentions
Network Modes: Networking
Envoy Sidecar: Sidecar Task

Difficulty: Intermediate Time estimate: 1-2 weeks Prerequisites: Projects 1-4 completed. Basic understanding of TLS and service mesh concepts.

Real world outcome:

# Architecture:
#
#   ┌─────────────────────────────────────────────────────┐
#   │                    Nomad Job                         │
#   │  ┌─────────────────┐      ┌─────────────────┐       │
#   │  │     web         │      │     api         │       │
#   │  │   (frontend)    │ mTLS │   (backend)     │       │
#   │  │  ┌───────────┐  │◄────►│  ┌───────────┐  │       │
#   │  │  │  nginx    │  │      │  │  python   │  │       │
#   │  │  └───────────┘  │      │  └───────────┘  │       │
#   │  │  ┌───────────┐  │      │  ┌───────────┐  │       │
#   │  │  │  envoy    │  │      │  │  envoy    │  │       │
#   │  │  │ (sidecar) │  │      │  │ (sidecar) │  │       │
#   │  │  └───────────┘  │      │  └───────────┘  │       │
#   │  └─────────────────┘      └─────────────────┘       │
#   └─────────────────────────────────────────────────────┘

$ nomad run mesh-app.nomad

# Services register with Consul:
$ consul catalog services
api
api-sidecar-proxy
web
web-sidecar-proxy

# Intentions control access:
$ consul intention check web api
Allowed

$ consul intention check malicious-app api
Denied

# Traffic flows through mTLS:
$ curl http://web.service.consul
Hello from web! Got response from api: {"status": "ok"}

Implementation Hints:

Consul configuration for Connect:

# consul.hcl
connect {
  enabled = true
}

ports {
  grpc = 8502
}

Nomad job with Connect sidecar:

job "mesh-app" {
  datacenters = ["dc1"]

  group "api" {
    count = 2

    network {
      mode = "bridge"  # Required for Connect

      port "http" {
        to = 8080
      }
    }

    service {
      name = "api"
      port = "8080"

      connect {
        sidecar_service {}  # Injects Envoy sidecar
      }
    }

    task "api" {
      driver = "docker"

      config {
        image = "myapp/api:v1"
        ports = ["http"]
      }
    }
  }

  group "web" {
    count = 2

    network {
      mode = "bridge"

      port "http" {
        static = 8080
        to     = 80
      }
    }

    service {
      name = "web"
      port = "80"

      connect {
        sidecar_service {
          proxy {
            # Upstream connection to api service
            upstreams {
              destination_name = "api"
              local_bind_port  = 5000  # web connects to localhost:5000
            }
          }
        }
      }
    }

    task "web" {
      driver = "docker"

      config {
        image = "myapp/web:v1"
        ports = ["http"]
      }

      env {
        API_URL = "http://localhost:5000"  # Goes through Envoy → api
      }
    }
  }
}

Service intentions (Consul):

# Allow web → api
consul intention create web api

# Deny all other traffic to api
consul intention create -deny "*" api

# Or use HCL:
# intentions.hcl
Kind = "service-intentions"
Name = "api"
Sources = [
  {
    Name   = "web"
    Action = "allow"
  },
  {
    Name   = "*"
    Action = "deny"
  }
]

Learning milestones:

Sidecars inject automatically → You understand sidecar injection
Services discover each other → You understand service mesh networking
mTLS encrypts traffic → You understand zero-trust networking
Intentions control access → You understand service authorization

Project 6: Autoscaler Implementation

File: LEARN_NOMAD_DEEP_DIVE.md
Main Programming Language: Go
Alternative Programming Languages: Python, HCL (for policies)
Coolness Level: Level 3: Genuinely Clever
Business Potential: 3. The “Service & Support” Model
Difficulty: Level 3: Advanced
Knowledge Area: Autoscaling / Metrics / Control Loops
Software or Tool: Nomad Autoscaler
Main Book: “Site Reliability Engineering” by Google

What you’ll build: An autoscaling system that adjusts job counts based on metrics (CPU, memory, queue depth, custom metrics), implementing both horizontal pod autoscaling and cluster autoscaling.

Why it teaches Nomad: Autoscaling requires understanding how to monitor allocations, interact with the Nomad API to scale jobs, and implement control loops that don’t oscillate. This teaches the operational side of Nomad.

Core challenges you’ll face:

Collecting metrics → maps to Prometheus integration or Nomad metrics
Implementing scaling policies → maps to target tracking algorithms
Avoiding thrashing → maps to cooldown periods and stabilization
Cluster autoscaling → maps to adding/removing nodes

Key Concepts:

Nomad Autoscaler: Autoscaler Documentation
Scaling Policies: Policy Configuration
APM Plugins: APM Plugins
Target Plugins: Target Plugins

Difficulty: Advanced Time estimate: 2 weeks Prerequisites: Projects 1-5 completed. Understanding of control systems basics.

Real world outcome:

# Your autoscaler in action:

# Initial state:
$ nomad job status api-service
Task Group  Queued  Starting  Running  Failed  Complete  Lost
api         0       0         2        0       0         0

# Load increases, autoscaler detects high CPU:
$ watch nomad-autoscaler agent -config=config.hcl

[INFO] policy: scaling api from 2 to 5: cpu_utilization=85%, target=70%
[INFO] scale: successfully scaled api to 5 allocations

$ nomad job status api-service
Task Group  Queued  Starting  Running  Failed  Complete  Lost
api         0       1         5        0       0         0

# Load decreases:
[INFO] policy: stabilization window not passed, holding at 5
[INFO] policy: scaling api from 5 to 3: cpu_utilization=35%, target=70%
[INFO] scale: successfully scaled api to 3 allocations

# Cluster autoscaling (add nodes when all are at capacity):
[INFO] cluster: no feasible nodes for pending allocations
[INFO] cluster: launching 2 new instances via AWS ASG
[INFO] cluster: new nodes joined: node-5, node-6

Implementation Hints:

Autoscaler configuration:

# autoscaler.hcl
nomad {
  address = "http://localhost:4646"
}

apm "prometheus" {
  driver = "prometheus"
  config = {
    address = "http://prometheus:9090"
  }
}

target "nomad" {
  driver = "nomad"
}

policy {
  default_cooldown            = "2m"
  default_evaluation_interval = "30s"
}

Scaling policy in job spec:

job "api-service" {
  group "api" {
    count = 2

    scaling {
      enabled = true
      min     = 2
      max     = 20

      policy {
        # Target 70% average CPU
        check "cpu" {
          source = "prometheus"
          query  = "avg(nomad_client_allocs_cpu_allocated{task_group='api'})"

          strategy "target-value" {
            target = 70
          }
        }

        # Also scale on queue depth
        check "queue_depth" {
          source = "prometheus"
          query  = "sum(rabbitmq_queue_messages{queue='tasks'})"

          strategy "target-value" {
            target = 100  # 100 messages per instance
          }
        }
      }
    }

    task "api" {
      # ...
    }
  }
}

Building your own autoscaler (simplified):

package main

import (
    "time"
    "github.com/hashicorp/nomad/api"
)

type Autoscaler struct {
    client       *api.Client
    targetCPU    float64
    minCount     int
    maxCount     int
    cooldown     time.Duration
    lastScaled   time.Time
}

func (a *Autoscaler) Run() {
    ticker := time.NewTicker(30 * time.Second)

    for range ticker.C {
        if time.Since(a.lastScaled) < a.cooldown {
            continue // Still in cooldown
        }

        currentCPU := a.getAverageCPU()
        currentCount := a.getCurrentCount()

        // Calculate desired count
        desiredCount := int(float64(currentCount) * (currentCPU / a.targetCPU))
        desiredCount = clamp(desiredCount, a.minCount, a.maxCount)

        if desiredCount != currentCount {
            a.scale(desiredCount)
            a.lastScaled = time.Now()
        }
    }
}

func (a *Autoscaler) scale(count int) error {
    job, _, _ := a.client.Jobs().Info("api-service", nil)

    // Update task group count
    for _, tg := range job.TaskGroups {
        if *tg.Name == "api" {
            tg.Count = &count
        }
    }

    _, _, err := a.client.Jobs().Register(job, nil)
    return err
}

Learning milestones:

You collect metrics from allocations → You understand Nomad telemetry
You implement target-value scaling → You understand control loops
You handle cooldown correctly → You understand stabilization
You add cluster autoscaling → You understand infrastructure automation

Project 7: Custom Task Driver Plugin

File: LEARN_NOMAD_DEEP_DIVE.md
Main Programming Language: Go
Alternative Programming Languages: N/A (must be Go for plugins)
Coolness Level: Level 4: Hardcore Tech Flex
Business Potential: 4. The “Open Core” Infrastructure
Difficulty: Level 3: Advanced
Knowledge Area: Plugin System / Task Isolation / Go-Plugin
Software or Tool: Nomad Plugin SDK
Main Book: “Go Programming Language” by Donovan & Kernighan

What you’ll build: A custom task driver that runs workloads in a way not supported by built-in drivers—for example, running WASM modules, systemd units, Firecracker microVMs, or Podman containers.

Why it teaches Nomad: Task drivers are how Nomad executes workloads. Building one teaches you the plugin architecture, how Nomad tracks task state, handles resource isolation, and manages the task lifecycle.

Core challenges you’ll face:

Implementing the DriverPlugin interface → maps to understanding the plugin contract
Fingerprinting node capabilities → maps to what the driver can do
Managing task state → maps to starting, stopping, recovering tasks
Resource isolation → maps to cgroups, namespaces, security

Key Concepts:

Task Driver SDK: Create Task Driver Plugin
go-plugin: HashiCorp go-plugin
Driver Interface: Browse Nomad source drivers/ directory
exec2 driver: Example Implementation

Difficulty: Advanced Time estimate: 2-3 weeks Prerequisites: Projects 1-6 completed. Strong Go proficiency. Understanding of process isolation.

Real world outcome:

# Your custom WASM driver in action:

$ cat wasm-job.nomad
job "wasm-app" {
  group "app" {
    task "handler" {
      driver = "wasm"  # Your custom driver!

      config {
        module = "https://example.com/handler.wasm"
        runtime = "wasmtime"

        # WASI capabilities
        allow_fs = false
        allow_net = true
      }

      resources {
        cpu    = 100
        memory = 128
      }
    }
  }
}

$ nomad run wasm-job.nomad
$ nomad alloc logs abc123
[wasm-driver] Loading module from https://example.com/handler.wasm
[wasm-driver] Runtime: wasmtime v14.0.0
[wasm-driver] Instantiating module...
[handler] Listening on :8080
[handler] Received request: GET /

Implementation Hints:

Basic driver structure:

package main

import (
    "context"
    "time"

    "github.com/hashicorp/go-hclog"
    "github.com/hashicorp/nomad/drivers/shared/eventer"
    "github.com/hashicorp/nomad/plugins/base"
    "github.com/hashicorp/nomad/plugins/drivers"
    "github.com/hashicorp/nomad/plugins/shared/hclspec"
)

const pluginName = "wasm"

// WasmDriver implements drivers.DriverPlugin
type WasmDriver struct {
    eventer *eventer.Eventer
    config  *Config
    logger  hclog.Logger

    // Track running tasks
    tasks *taskStore
}

// PluginInfo returns information about the plugin
func (d *WasmDriver) PluginInfo() (*base.PluginInfoResponse, error) {
    return &base.PluginInfoResponse{
        Type:              base.PluginTypeDriver,
        PluginApiVersions: []string{drivers.ApiVersion010},
        PluginVersion:     "0.1.0",
        Name:              pluginName,
    }, nil
}

// ConfigSchema returns the schema for driver config
func (d *WasmDriver) ConfigSchema() (*hclspec.Spec, error) {
    return hclspec.NewObject(map[string]*hclspec.Spec{
        "runtime_path": hclspec.NewDefault(
            hclspec.NewAttr("runtime_path", "string", false),
            hclspec.NewLiteral(`"/usr/local/bin/wasmtime"`),
        ),
    }), nil
}

// TaskConfigSchema returns the schema for task config
func (d *WasmDriver) TaskConfigSchema() (*hclspec.Spec, error) {
    return hclspec.NewObject(map[string]*hclspec.Spec{
        "module":    hclspec.NewAttr("module", "string", true),
        "runtime":   hclspec.NewAttr("runtime", "string", false),
        "allow_fs":  hclspec.NewAttr("allow_fs", "bool", false),
        "allow_net": hclspec.NewAttr("allow_net", "bool", false),
    }), nil
}

// Fingerprint returns node capabilities
func (d *WasmDriver) Fingerprint(ctx context.Context) (<-chan *drivers.Fingerprint, error) {
    ch := make(chan *drivers.Fingerprint)

    go func() {
        defer close(ch)

        fp := &drivers.Fingerprint{
            Attributes: map[string]*pstructs.Attribute{
                "driver.wasm.version": pstructs.NewStringAttribute("0.1.0"),
            },
            Health:            drivers.HealthStateHealthy,
            HealthDescription: "WASM runtime available",
        }

        // Check if wasmtime is installed
        if !d.isRuntimeAvailable() {
            fp.Health = drivers.HealthStateUndetected
            fp.HealthDescription = "wasmtime not found"
        }

        ch <- fp
    }()

    return ch, nil
}

// StartTask launches the WASM module
func (d *WasmDriver) StartTask(cfg *drivers.TaskConfig) (*drivers.TaskHandle, *drivers.DriverNetwork, error) {
    var taskConfig TaskConfig
    if err := cfg.DecodeDriverConfig(&taskConfig); err != nil {
        return nil, nil, err
    }

    d.logger.Info("starting wasm task", "module", taskConfig.Module)

    // Download module, start wasmtime process
    handle, err := d.startWasmProcess(cfg, &taskConfig)
    if err != nil {
        return nil, nil, err
    }

    return handle, nil, nil
}

// StopTask stops a running task
func (d *WasmDriver) StopTask(taskID string, timeout time.Duration, signal string) error {
    handle, ok := d.tasks.Get(taskID)
    if !ok {
        return drivers.ErrTaskNotFound
    }

    // Send signal to process
    return handle.Kill(signal, timeout)
}

// WaitTask blocks until task exits
func (d *WasmDriver) WaitTask(ctx context.Context, taskID string) (<-chan *drivers.ExitResult, error) {
    handle, ok := d.tasks.Get(taskID)
    if !ok {
        return nil, drivers.ErrTaskNotFound
    }

    return handle.WaitCh(), nil
}

Main function with plugin serving:

func main() {
    // Serve the plugin
    plugins.Serve(factory)
}

func factory(log hclog.Logger) interface{} {
    return &WasmDriver{
        eventer: eventer.NewEventer(context.Background(), log),
        logger:  log.Named(pluginName),
        tasks:   newTaskStore(),
    }
}

Learning milestones:

You implement Fingerprint → You understand capability detection
You implement StartTask → You understand task lifecycle
You handle recovery after restarts → You understand driver state
Tasks run with isolation → You understand resource control

Project 8: Raft Consensus Simulator

File: LEARN_NOMAD_DEEP_DIVE.md
Main Programming Language: Go
Alternative Programming Languages: Python, Rust
Coolness Level: Level 4: Hardcore Tech Flex
Business Potential: 1. The “Resume Gold”
Difficulty: Level 4: Expert
Knowledge Area: Distributed Systems / Consensus / Raft
Software or Tool: Custom implementation
Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann

What you’ll build: A visual Raft consensus simulator that demonstrates leader election, log replication, and fault tolerance—showing exactly how Nomad servers coordinate.

Why it teaches Nomad: Nomad uses Raft for all state management. Understanding Raft means understanding how Nomad maintains consistency, handles leader failures, and ensures that scheduling decisions are durable.

Core challenges you’ll face:

Implementing leader election → maps to terms, votes, timeouts
Log replication → maps to AppendEntries RPC
Handling network partitions → maps to split-brain scenarios
State machine application → maps to Nomad’s FSM

Key Concepts:

Raft Paper: In Search of an Understandable Consensus Algorithm
Nomad Consensus: Consensus Protocol
HashiCorp Raft: hashicorp/raft
Raft Visualization: The Secret Lives of Data

Difficulty: Expert Time estimate: 3-4 weeks Prerequisites: Projects 1-7 completed. Strong understanding of distributed systems.

Real world outcome:

┌─────────────────────────────────────────────────────────────────┐
│ Raft Consensus Simulator                                         │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  Term: 5                                                         │
│                                                                  │
│  ┌─────────────┐    ┌─────────────┐    ┌─────────────┐          │
│  │   Node 1    │    │   Node 2    │    │   Node 3    │          │
│  │   LEADER    │    │  FOLLOWER   │    │  FOLLOWER   │          │
│  │  Term: 5    │    │  Term: 5    │    │  Term: 5    │          │
│  │             │    │             │    │             │          │
│  │ Log:        │    │ Log:        │    │ Log:        │          │
│  │ [1] x=1 ✓   │    │ [1] x=1 ✓   │    │ [1] x=1 ✓   │          │
│  │ [2] y=2 ✓   │    │ [2] y=2 ✓   │    │ [2] y=2 ✓   │          │
│  │ [3] z=3     │◄──►│ [3] z=3     │◄──►│ [3] z=3     │          │
│  │    ↑ commit │    │             │    │             │          │
│  └─────────────┘    └─────────────┘    └─────────────┘          │
│                                                                  │
│  ┌─ Event Log ─────────────────────────────────────────────────┐│
│  │ 10:00:00.100 Node1: Became leader for term 5                ││
│  │ 10:00:00.150 Node1: Received client request: SET z=3        ││
│  │ 10:00:00.200 Node1→Node2: AppendEntries(term=5, entry=z=3)  ││
│  │ 10:00:00.200 Node1→Node3: AppendEntries(term=5, entry=z=3)  ││
│  │ 10:00:00.250 Node2→Node1: AppendEntriesResp(success=true)   ││
│  │ 10:00:00.260 Node3→Node1: AppendEntriesResp(success=true)   ││
│  │ 10:00:00.300 Node1: Entry z=3 committed (majority ack)      ││
│  │ 10:00:00.350 Node1: Applied z=3 to state machine            ││
│  └─────────────────────────────────────────────────────────────┘│
│                                                                  │
│  [Partition Node 2] [Kill Leader] [Add Entry] [Step]            │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Implementation Hints:

Raft node state:

type NodeState int

const (
    Follower NodeState = iota
    Candidate
    Leader
)

type RaftNode struct {
    id    int
    state NodeState
    term  int

    // Persistent state
    log         []LogEntry
    votedFor    int
    commitIndex int
    lastApplied int

    // Leader state (reinitialized after election)
    nextIndex  map[int]int
    matchIndex map[int]int

    // Timing
    electionTimeout time.Duration
    lastHeartbeat   time.Time

    // Communication
    peers   []int
    inbox   chan Message
    outbox  chan Message
}

type LogEntry struct {
    Term    int
    Index   int
    Command interface{}
}

type Message struct {
    From    int
    To      int
    Type    MessageType
    Term    int
    Payload interface{}
}

Leader election:

func (n *RaftNode) becomeCandidate() {
    n.state = Candidate
    n.term++
    n.votedFor = n.id
    n.votesReceived = 1  // Vote for self

    // Request votes from all peers
    for _, peer := range n.peers {
        n.send(Message{
            From: n.id,
            To:   peer,
            Type: RequestVote,
            Term: n.term,
            Payload: RequestVoteArgs{
                CandidateID:  n.id,
                LastLogIndex: len(n.log) - 1,
                LastLogTerm:  n.log[len(n.log)-1].Term,
            },
        })
    }

    n.resetElectionTimer()
}

func (n *RaftNode) handleRequestVote(msg Message) {
    args := msg.Payload.(RequestVoteArgs)

    // Deny if term is old
    if args.Term < n.term {
        n.sendVoteReply(msg.From, false)
        return
    }

    // Check if log is at least as up-to-date
    if n.votedFor == -1 || n.votedFor == args.CandidateID {
        if n.isLogUpToDate(args.LastLogIndex, args.LastLogTerm) {
            n.votedFor = args.CandidateID
            n.sendVoteReply(msg.From, true)
            n.resetElectionTimer()
            return
        }
    }

    n.sendVoteReply(msg.From, false)
}

Log replication:

func (n *RaftNode) appendEntry(command interface{}) {
    if n.state != Leader {
        return
    }

    entry := LogEntry{
        Term:    n.term,
        Index:   len(n.log),
        Command: command,
    }
    n.log = append(n.log, entry)

    // Replicate to followers
    n.replicateLog()
}

func (n *RaftNode) replicateLog() {
    for _, peer := range n.peers {
        nextIdx := n.nextIndex[peer]
        entries := n.log[nextIdx:]

        n.send(Message{
            From: n.id,
            To:   peer,
            Type: AppendEntries,
            Term: n.term,
            Payload: AppendEntriesArgs{
                LeaderID:     n.id,
                PrevLogIndex: nextIdx - 1,
                PrevLogTerm:  n.log[nextIdx-1].Term,
                Entries:      entries,
                LeaderCommit: n.commitIndex,
            },
        })
    }
}

Learning milestones:

Leader election works → You understand Raft elections
Logs replicate correctly → You understand AppendEntries
Network partitions handled → You understand split-brain prevention
State machine applies → You understand FSM in consensus

Project 9: Gossip Protocol Implementation

File: LEARN_NOMAD_DEEP_DIVE.md
Main Programming Language: Go
Alternative Programming Languages: Python, Rust
Coolness Level: Level 4: Hardcore Tech Flex
Business Potential: 1. The “Resume Gold”
Difficulty: Level 3: Advanced
Knowledge Area: Distributed Systems / Membership / SWIM
Software or Tool: Custom implementation (inspired by Serf)
Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann

What you’ll build: A gossip-based membership protocol implementing SWIM (Scalable Weakly-consistent Infection-style Process Group Membership), which is what Nomad uses for cluster discovery.

Why it teaches Nomad: Before Raft can elect a leader, servers need to discover each other. Serf’s gossip protocol handles this. Understanding gossip means understanding how Nomad clusters form and how failure detection works.

Core challenges you’ll face:

Implementing gossip dissemination → maps to rumor spreading
Failure detection → maps to ping, ping-req, suspect, dead
Handling network partitions → maps to partition healing
Scalability → maps to O(log N) message complexity

Key Concepts:

SWIM Paper: SWIM: Scalable Weakly-consistent Infection-style Process Group Membership Protocol
Serf: HashiCorp Serf
Nomad Gossip: Gossip Protocol
memberlist: hashicorp/memberlist

Difficulty: Advanced Time estimate: 2-3 weeks Prerequisites: Projects 1-8 completed. Understanding of networking and UDP.

Real world outcome:

┌─────────────────────────────────────────────────────────────────┐
│ Gossip Protocol Simulator                                        │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  Cluster Members:                                                │
│  ┌────────────────────────────────────────────────────────────┐ │
│  │ Node    │ State   │ Last Seen │ Incarnation │ Messages     │ │
│  ├─────────┼─────────┼───────────┼─────────────┼──────────────┤ │
│  │ node-1  │ ALIVE   │ 100ms     │ 5           │ ████████     │ │
│  │ node-2  │ ALIVE   │ 50ms      │ 3           │ ██████       │ │
│  │ node-3  │ SUSPECT │ 2.5s      │ 7           │ ██           │ │
│  │ node-4  │ ALIVE   │ 200ms     │ 2           │ ████         │ │
│  │ node-5  │ DEAD    │ 30s       │ 1           │              │ │
│  └────────────────────────────────────────────────────────────┘ │
│                                                                  │
│  Message Flow:                                                   │
│  ┌────────────────────────────────────────────────────────────┐ │
│  │ 10:00:00.100 node-1 → node-2: PING                         │ │
│  │ 10:00:00.110 node-2 → node-1: ACK                          │ │
│  │ 10:00:00.200 node-1 → node-4: PING (piggyback: node-3 sus) │ │
│  │ 10:00:00.210 node-4 → node-1: ACK                          │ │
│  │ 10:00:00.300 node-2 → node-3: PING                         │ │
│  │ 10:00:00.800 node-2: No ACK from node-3, starting PING-REQ │ │
│  │ 10:00:00.810 node-2 → node-4: PING-REQ(node-3)             │ │
│  │ 10:00:00.820 node-4 → node-3: PING                         │ │
│  │ 10:00:01.320 node-4: No ACK from node-3                    │ │
│  │ 10:00:01.330 node-4 → node-2: NACK(node-3)                 │ │
│  │ 10:00:01.340 node-2: Marking node-3 as SUSPECT             │ │
│  └────────────────────────────────────────────────────────────┘ │
│                                                                  │
│  [Kill Node] [Partition] [Rejoin] [Show Convergence]            │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Implementation Hints:

SWIM node structure:

type NodeState int

const (
    Alive NodeState = iota
    Suspect
    Dead
)

type Member struct {
    ID          string
    Addr        string
    State       NodeState
    Incarnation int
    LastUpdate  time.Time
}

type GossipNode struct {
    self    Member
    members map[string]*Member

    // SWIM parameters
    probeInterval  time.Duration
    probeTimeout   time.Duration
    suspectTimeout time.Duration
    indirectNodes  int

    // Gossip queue for piggybacking
    gossipQueue []GossipMessage

    // UDP socket
    conn *net.UDPConn
}

type GossipMessage struct {
    Type        MessageType
    Member      Member
    Incarnation int
}

type MessageType int

const (
    Ping MessageType = iota
    Ack
    PingReq
    Alive
    Suspect
    Dead
)

SWIM probe protocol:

func (n *GossipNode) runProbeLoop() {
    ticker := time.NewTicker(n.probeInterval)

    for range ticker.C {
        // Select random member to probe
        target := n.randomMember()
        if target == nil {
            continue
        }

        // Direct probe
        ack := n.probe(target)
        if ack {
            continue
        }

        // No direct ACK - try indirect probes
        indirectAck := n.indirectProbe(target)
        if indirectAck {
            continue
        }

        // No indirect ACK - mark as suspect
        n.suspect(target)
    }
}

func (n *GossipNode) probe(target *Member) bool {
    // Send PING with piggybacked gossip
    msg := n.buildPingMessage(target.Addr)
    n.send(msg)

    // Wait for ACK
    select {
    case ack := <-n.ackChan:
        if ack.From == target.ID {
            return true
        }
    case <-time.After(n.probeTimeout):
        return false
    }
    return false
}

func (n *GossipNode) indirectProbe(target *Member) bool {
    // Select k random members to probe on our behalf
    helpers := n.randomMembers(n.indirectNodes)

    for _, helper := range helpers {
        n.send(PingReqMessage{
            To:     helper.Addr,
            Target: target.Addr,
        })
    }

    // Wait for any positive response
    timeout := time.After(n.probeTimeout * 2)
    for i := 0; i < len(helpers); i++ {
        select {
        case resp := <-n.pingReqRespChan:
            if resp.Target == target.ID && resp.Ack {
                return true
            }
        case <-timeout:
            return false
        }
    }
    return false
}

Gossip dissemination:

func (n *GossipNode) gossip(msg GossipMessage) {
    // Add to gossip queue
    n.gossipQueue = append(n.gossipQueue, msg)

    // Piggyback on next probe messages
    // SWIM piggybacks gossip on PING/ACK for efficiency
}

func (n *GossipNode) handleAlive(msg GossipMessage) {
    member, exists := n.members[msg.Member.ID]

    if !exists {
        // New member joined
        n.members[msg.Member.ID] = &msg.Member
        n.gossip(msg)
        return
    }

    // Only update if incarnation is higher
    if msg.Incarnation > member.Incarnation {
        member.State = Alive
        member.Incarnation = msg.Incarnation
        n.gossip(msg)
    }
}

func (n *GossipNode) refute() {
    // When we hear we're suspected, refute with higher incarnation
    n.self.Incarnation++
    n.gossip(GossipMessage{
        Type:        Alive,
        Member:      n.self,
        Incarnation: n.self.Incarnation,
    })
}

Learning milestones:

Members discover each other → You understand gossip joining
Failures are detected → You understand SWIM probing
Suspicion and death work → You understand failure state machine
Gossip converges quickly → You understand epidemic dissemination

Project 10: Mini-Scheduler Implementation

File: LEARN_NOMAD_DEEP_DIVE.md
Main Programming Language: Go
Alternative Programming Languages: Rust, Python
Coolness Level: Level 5: Pure Magic (Super Cool)
Business Potential: 4. The “Open Core” Infrastructure
Difficulty: Level 4: Expert
Knowledge Area: Scheduling / Bin Packing / Constraint Satisfaction
Software or Tool: Custom implementation
Main Book: “Scheduling: Theory, Algorithms, and Systems” by Michael Pinedo

What you’ll build: Your own job scheduler from scratch that handles job registration, evaluation creation, feasibility checking, bin-packing placement, and plan application—essentially a mini-Nomad scheduler.

Why it teaches Nomad: The scheduler is the heart of Nomad. Building one teaches you constraint satisfaction, bin packing algorithms, optimistic concurrency, and the evaluation-plan-allocation pipeline.

Core challenges you’ll face:

Evaluation pipeline → maps to how scheduling work is queued
Feasibility checking → maps to constraint filtering
Bin packing algorithm → maps to node scoring and ranking
Plan queue and conflict resolution → maps to optimistic concurrency

Key Concepts:

Scheduling Internals: How Scheduling Works
Google Borg: Large-scale cluster management at Google with Borg
Google Omega: Omega: flexible, scalable schedulers
Bin Packing: First Fit Decreasing, Best Fit algorithms

Difficulty: Expert Time estimate: 1 month Prerequisites: Projects 1-9 completed. Strong algorithm skills.

Real world outcome:

┌─────────────────────────────────────────────────────────────────┐
│ Mini-Scheduler                                                   │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│ Job: api-service                                                 │
│ Task Group: api (count: 5)                                       │
│ Requirements: cpu=500, memory=1024, constraint[instance=large]   │
│                                                                  │
│ ┌─ Scheduling Trace ──────────────────────────────────────────┐ │
│ │                                                              │ │
│ │ Phase 1: Feasibility                                         │ │
│ │   node-1: PASS (large, cpu=2000 avail, mem=4096 avail)       │ │
│ │   node-2: PASS (large, cpu=1500 avail, mem=3072 avail)       │ │
│ │   node-3: FAIL (small - constraint mismatch)                 │ │
│ │   node-4: PASS (large, cpu=2000 avail, mem=2048 avail)       │ │
│ │   node-5: FAIL (insufficient memory: 512 avail)              │ │
│ │                                                              │ │
│ │ Phase 2: Ranking (Bin Pack + Anti-Affinity)                  │ │
│ │   node-1: binpack=0.75, anti_affinity=1.00, score=0.875     │ │
│ │   node-2: binpack=0.85, anti_affinity=1.00, score=0.925     │ │
│ │   node-4: binpack=0.60, anti_affinity=1.00, score=0.800     │ │
│ │                                                              │ │
│ │ Phase 3: Placement                                           │ │
│ │   Allocation 1 → node-2 (highest score)                      │ │
│ │   Allocation 2 → node-1 (anti-affinity reduces node-2)       │ │
│ │   Allocation 3 → node-4 (spreading load)                     │ │
│ │   Allocation 4 → node-2 (resources still fit)                │ │
│ │   Allocation 5 → node-1 (resources still fit)                │ │
│ │                                                              │ │
│ │ Phase 4: Plan Applied Successfully                           │ │
│ │   5 allocations created                                      │ │
│ │                                                              │ │
│ └──────────────────────────────────────────────────────────────┘ │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Implementation Hints:

Core data structures:

type Job struct {
    ID         string
    Name       string
    Type       string // service, batch, system
    Priority   int
    TaskGroups []*TaskGroup
}

type TaskGroup struct {
    Name        string
    Count       int
    Constraints []*Constraint
    Tasks       []*Task
}

type Task struct {
    Name      string
    Driver    string
    Resources *Resources
}

type Resources struct {
    CPU    int // MHz
    Memory int // MB
}

type Constraint struct {
    Attribute string
    Operator  string // =, !=, >, <, in, not_in
    Value     string
}

type Node struct {
    ID         string
    Datacenter string
    Attributes map[string]string
    Resources  *Resources
    Allocated  *Resources
}

type Allocation struct {
    ID           string
    JobID        string
    TaskGroup    string
    NodeID       string
    DesiredState string
    Resources    *Resources
}

type Evaluation struct {
    ID          string
    JobID       string
    TriggeredBy string
    Status      string
}

Scheduler implementation:

type Scheduler struct {
    jobs        map[string]*Job
    nodes       map[string]*Node
    allocations map[string]*Allocation
    evalBroker  *EvalBroker
    planQueue   *PlanQueue
}

func (s *Scheduler) Schedule(eval *Evaluation) (*Plan, error) {
    job := s.jobs[eval.JobID]

    plan := &Plan{
        EvalID:      eval.ID,
        Allocations: make([]*Allocation, 0),
    }

    for _, tg := range job.TaskGroups {
        // Current allocations for this task group
        current := s.allocationsFor(job.ID, tg.Name)
        desired := tg.Count

        // Need to place (desired - current) new allocations
        toPlace := desired - len(current)

        for i := 0; i < toPlace; i++ {
            // Phase 1: Feasibility
            feasible := s.feasibleNodes(tg)

            if len(feasible) == 0 {
                // No feasible nodes - evaluation blocked
                return nil, ErrNoFeasibleNodes
            }

            // Phase 2: Ranking
            ranked := s.rankNodes(feasible, tg, plan)

            // Phase 3: Place on best node
            best := ranked[0]
            alloc := s.createAllocation(job, tg, best.Node)

            plan.Allocations = append(plan.Allocations, alloc)

            // Update our view for subsequent placements
            s.reserveResources(best.Node, tg)
        }
    }

    return plan, nil
}

func (s *Scheduler) feasibleNodes(tg *TaskGroup) []*Node {
    feasible := make([]*Node, 0)

    for _, node := range s.nodes {
        if s.checkConstraints(node, tg.Constraints) &&
           s.checkResources(node, tg) {
            feasible = append(feasible, node)
        }
    }

    return feasible
}

func (s *Scheduler) checkConstraints(node *Node, constraints []*Constraint) bool {
    for _, c := range constraints {
        value := node.Attributes[c.Attribute]

        switch c.Operator {
        case "=":
            if value != c.Value {
                return false
            }
        case "!=":
            if value == c.Value {
                return false
            }
        // ... more operators
        }
    }
    return true
}

Bin packing scorer:

type ScoredNode struct {
    Node  *Node
    Score float64
}

func (s *Scheduler) rankNodes(nodes []*Node, tg *TaskGroup, plan *Plan) []*ScoredNode {
    scored := make([]*ScoredNode, 0, len(nodes))

    for _, node := range nodes {
        score := s.calculateScore(node, tg, plan)
        scored = append(scored, &ScoredNode{Node: node, Score: score})
    }

    // Sort by score descending
    sort.Slice(scored, func(i, j int) bool {
        return scored[i].Score > scored[j].Score
    })

    return scored
}

func (s *Scheduler) calculateScore(node *Node, tg *TaskGroup, plan *Plan) float64 {
    // Bin packing score: prefer nodes with less available resources
    // (pack jobs tightly together)
    binpackScore := s.binpackScore(node, tg)

    // Anti-affinity score: prefer nodes without other allocations
    // of the same job (spread instances out)
    antiAffinityScore := s.antiAffinityScore(node, tg, plan)

    // Combine scores
    return (binpackScore * 0.5) + (antiAffinityScore * 0.5)
}

func (s *Scheduler) binpackScore(node *Node, tg *TaskGroup) float64 {
    // Calculate what percentage of resources will be used after placement
    cpuAfter := float64(node.Allocated.CPU+tg.Resources.CPU) / float64(node.Resources.CPU)
    memAfter := float64(node.Allocated.Memory+tg.Resources.Memory) / float64(node.Resources.Memory)

    // Higher utilization = higher score (bin packing)
    return (cpuAfter + memAfter) / 2.0
}

Learning milestones:

Feasibility filtering works → You understand constraint checking
Bin packing places efficiently → You understand scoring algorithms
Anti-affinity spreads allocations → You understand fault tolerance
Plan queue handles conflicts → You understand optimistic concurrency

Project 11: Multi-Region Federation Lab

File: LEARN_NOMAD_DEEP_DIVE.md
Main Programming Language: HCL + Terraform
Alternative Programming Languages: Bash, Go
Coolness Level: Level 3: Genuinely Clever
Business Potential: 3. The “Service & Support” Model
Difficulty: Level 3: Advanced
Knowledge Area: Multi-Region / WAN Federation / Disaster Recovery
Software or Tool: Nomad + Consul + Terraform
Main Book: “Site Reliability Engineering” by Google

What you’ll build: A multi-region Nomad deployment with federated clusters, cross-region job deployment, and disaster recovery failover—simulating a real global infrastructure.

Why it teaches Nomad: Production Nomad often spans multiple regions. Understanding federation teaches you WAN gossip, authoritative regions, cross-region scheduling, and how to design for geographic redundancy.

Core challenges you’ll face:

Setting up WAN federation → maps to gossip across regions
Cross-region job deployment → maps to multiregion job specification
Authoritative region for ACLs → maps to single source of truth
Failover and recovery → maps to region outage handling

Key Concepts:

Federation: Federate Multi-Region Clusters
Multiregion Jobs: multiregion block
WAN Gossip: Gossip Protocol
Authoritative Region: ACL Replication

Difficulty: Advanced Time estimate: 2 weeks Prerequisites: Projects 1-10 completed. Terraform experience helpful. Cloud account (AWS/GCP/Azure) or sufficient local resources.

Real world outcome:

# Two regions: us-east and eu-west

$ nomad server members -region=us-east
Name                 Address     Port  Status  Leader  Region
server-1.us-east.dc1 10.0.1.11   4648  alive   true    us-east
server-2.us-east.dc1 10.0.1.12   4648  alive   false   us-east
server-3.us-east.dc1 10.0.1.13   4648  alive   false   us-east

$ nomad server members -region=eu-west
Name                 Address     Port  Status  Leader  Region
server-1.eu-west.dc1 10.1.1.11   4648  alive   true    eu-west
server-2.eu-west.dc1 10.1.1.12   4648  alive   false   eu-west
server-3.eu-west.dc1 10.1.1.13   4648  alive   false   eu-west

# Federated - servers know about each other:
$ nomad server members -region=us-east
Name                 Address     Port  Status  Leader  Region
server-1.us-east.dc1 10.0.1.11   4648  alive   true    us-east
server-2.us-east.dc1 10.0.1.12   4648  alive   false   us-east
server-3.us-east.dc1 10.0.1.13   4648  alive   false   us-east
server-1.eu-west.dc1 10.1.1.11   4648  alive   true    eu-west (WAN)
server-2.eu-west.dc1 10.1.1.12   4648  alive   false   eu-west (WAN)
server-3.eu-west.dc1 10.1.1.13   4648  alive   false   eu-west (WAN)

# Deploy multiregion job:
$ nomad run global-api.nomad
==> Monitoring multiregion deployment...
    Region "us-east": 3/3 allocations healthy
    Region "eu-west": 3/3 allocations healthy
    Multiregion deployment successful!

Implementation Hints:

Multiregion job specification:

job "global-api" {
  type = "service"

  multiregion {
    strategy {
      max_parallel = 1
      on_failure   = "fail_all"
    }

    region "us-east" {
      count       = 3
      datacenters = ["us-east-1a", "us-east-1b"]
    }

    region "eu-west" {
      count       = 3
      datacenters = ["eu-west-1a", "eu-west-1b"]
    }
  }

  group "api" {
    # count is set per-region above

    network {
      port "http" { to = 8080 }
    }

    task "api" {
      driver = "docker"

      config {
        image = "myapp/api:v2"
        ports = ["http"]
      }

      env {
        REGION = "${NOMAD_REGION}"
        DC     = "${NOMAD_DC}"
      }
    }
  }
}

Federation setup:

# us-east server config
server {
  enabled          = true
  bootstrap_expect = 3
  authoritative_region = "us-east"  # This region is the source of truth for ACLs
}

# Enable WAN gossip encryption
server_join {
  retry_join = ["provider=aws tag_key=nomad-server tag_value=true"]
}

# eu-west server config
server {
  enabled          = true
  bootstrap_expect = 3
  authoritative_region = "us-east"  # Points to us-east
}

# Join to us-east for federation
server_join {
  retry_join = ["provider=aws tag_key=nomad-server tag_value=true region=us-east-1"]
}

Terraform for multi-region:

# modules/nomad-cluster/main.tf
module "us_east" {
  source = "./modules/nomad-region"

  region     = "us-east-1"
  servers    = 3
  clients    = 5
  server_ips = var.us_east_server_ips
}

module "eu_west" {
  source = "./modules/nomad-region"

  region     = "eu-west-1"
  servers    = 3
  clients    = 5
  server_ips = var.eu_west_server_ips

  # Join to us-east for federation
  wan_join = module.us_east.server_ips
}

Learning milestones:

Regions discover each other → You understand WAN gossip
Multiregion jobs deploy → You understand coordinated deployment
ACLs replicate from authoritative region → You understand central control
Failover works → You understand disaster recovery

Project 12: Nomad Source Code Deep Dive

File: LEARN_NOMAD_DEEP_DIVE.md
Main Programming Language: Go
Alternative Programming Languages: N/A
Coolness Level: Level 4: Hardcore Tech Flex
Business Potential: 1. The “Resume Gold”
Difficulty: Level 4: Expert
Knowledge Area: Source Code / Internal Architecture
Software or Tool: Nomad source code
Main Book: “The Go Programming Language” by Donovan & Kernighan

What you’ll build: A documented walkthrough of the Nomad source code, tracing a job from submission through evaluation, scheduling, and allocation—with annotations and diagrams.

Why it teaches Nomad: Reading the source code is the ultimate way to understand how Nomad works. This project takes you through the actual implementation of concepts you’ve learned, seeing how theory translates to production code.

Core challenges you’ll face:

Navigating the codebase → maps to understanding package structure
Tracing RPC flows → maps to how requests move through the system
Understanding the FSM → maps to state machine operations
Following the scheduler → maps to actual bin-packing code

Key Concepts:

Source Code: github.com/hashicorp/nomad
Contributing Guide: Contributing to Nomad
Architecture Docs: Architecture in Source

Difficulty: Expert Time estimate: 2-3 weeks Prerequisites: All previous projects. Strong Go proficiency. Ability to read complex codebases.

Real world outcome:

# Your documented walkthrough:

docs/nomad-source-walkthrough/
├── 00-overview.md           # Package structure and architecture
├── 01-job-submission.md     # nomad/job_endpoint.go
├── 02-evaluation-broker.md  # nomad/eval_broker.go
├── 03-scheduler-workers.md  # scheduler/generic_sched.go
├── 04-feasibility.md        # scheduler/feasible.go
├── 05-ranking.md            # scheduler/rank.go
├── 06-plan-queue.md         # nomad/plan_queue.go
├── 07-state-store.md        # nomad/state/state_store.go
├── 08-client-alloc.md       # client/allocrunner/
├── diagrams/
│   ├── job-flow.png
│   ├── scheduler-internals.png
│   └── raft-fsm.png
└── README.md

Implementation Hints:

Key directories to explore:

nomad/
├── command/          # CLI commands
├── api/              # Go client library
├── nomad/            # Server-side code
│   ├── job_endpoint.go      # Job RPC handlers
│   ├── eval_broker.go       # Evaluation queue
│   ├── plan_queue.go        # Plan application
│   ├── fsm.go               # Raft state machine
│   └── state/               # State store (BoltDB)
├── scheduler/        # Scheduling logic
│   ├── generic_sched.go     # Service/batch scheduler
│   ├── system_sched.go      # System scheduler
│   ├── feasible.go          # Constraint checking
│   └── rank.go              # Node scoring
├── client/           # Client-side code
│   ├── client.go            # Main client
│   ├── allocrunner/         # Allocation execution
│   └── fingerprint/         # Node fingerprinting
└── drivers/          # Task drivers
    ├── docker/
    ├── exec/
    └── rawexec/

Tracing job submission:

// 1. Job endpoint receives request
// nomad/job_endpoint.go
func (j *Job) Register(args *structs.JobRegisterRequest, reply *structs.JobRegisterResponse) error {
    // Validate job
    // Persist to state store via Raft
    // Create evaluation
}

// 2. Evaluation is enqueued
// nomad/eval_broker.go
func (b *EvalBroker) Enqueue(eval *structs.Evaluation) {
    // Add to priority queue
    // Signal waiting workers
}

// 3. Scheduler worker dequeues
// nomad/worker.go
func (w *Worker) run() {
    for {
        eval, _ := w.srv.evalBroker.Dequeue(...)
        w.invokeScheduler(eval)
    }
}

// 4. Scheduler processes
// scheduler/generic_sched.go
func (s *GenericScheduler) Process(eval *structs.Evaluation) error {
    // Reconcile desired vs actual state
    // For each needed allocation:
    //   - Find feasible nodes
    //   - Rank nodes
    //   - Create allocation in plan
    // Submit plan
}

// 5. Plan is applied
// nomad/plan_apply.go
func (p *planner) applyPlan(...) {
    // Check for conflicts
    // Apply via Raft
    // Create allocations in state store
}

Questions to answer in your walkthrough:

How does optimistic concurrency work in the plan queue?
What happens when a node fails during scheduling?
How are blocked evaluations handled?
How does the client reconcile allocations?

Learning milestones:

You trace job submission → You understand RPC flow
You understand the FSM → You understand Raft integration
You follow the scheduler → You understand actual algorithms
You document the flow → You can teach others

Project Comparison Table

Project	Difficulty	Time	Depth of Understanding	Fun Factor
1. Local Dev Cluster	Beginner	Weekend	⭐⭐	⭐⭐⭐
2. Job Lifecycle Explorer	Beginner	1 week	⭐⭐⭐	⭐⭐⭐
3. Scheduler Visualizer	Intermediate	1-2 weeks	⭐⭐⭐⭐	⭐⭐⭐⭐⭐
4. Custom Constraints	Intermediate	1 week	⭐⭐⭐	⭐⭐⭐
5. Consul Service Mesh	Intermediate	1-2 weeks	⭐⭐⭐⭐	⭐⭐⭐⭐
6. Autoscaler	Advanced	2 weeks	⭐⭐⭐⭐	⭐⭐⭐⭐
7. Custom Task Driver	Advanced	2-3 weeks	⭐⭐⭐⭐⭐	⭐⭐⭐⭐⭐
8. Raft Simulator	Expert	3-4 weeks	⭐⭐⭐⭐⭐	⭐⭐⭐⭐
9. Gossip Protocol	Advanced	2-3 weeks	⭐⭐⭐⭐⭐	⭐⭐⭐⭐
10. Mini-Scheduler	Expert	1 month	⭐⭐⭐⭐⭐	⭐⭐⭐⭐⭐
11. Multi-Region Lab	Advanced	2 weeks	⭐⭐⭐⭐	⭐⭐⭐⭐
12. Source Code Dive	Expert	2-3 weeks	⭐⭐⭐⭐⭐	⭐⭐⭐

Recommended Learning Path

If you’re new to container orchestration:

Project 1 (Local Cluster) - Get hands-on experience
Project 2 (Job Lifecycle) - Understand job types
Project 4 (Constraints) - Learn placement logic
Project 5 (Service Mesh) - Understand networking

If you know Kubernetes and want to understand Nomad’s differences:

Project 3 (Scheduler Visualizer) - See the evaluation system
Project 7 (Task Driver) - Understand the plugin model
Project 10 (Mini-Scheduler) - Build your own scheduler

If you want to deeply understand distributed systems:

Project 8 (Raft) - Consensus fundamentals
Project 9 (Gossip) - Membership protocols
Project 10 (Scheduler) - Distributed scheduling
Project 12 (Source Code) - Production implementation

Quick path (2-3 weeks):

Project 1 (Weekend) - Get running
Project 2 (3 days) - Understand jobs
Project 3 (1 week) - See scheduling
Project 5 (1 week) - Service mesh

Final Capstone Project: Production-Grade Nomad Platform

File: LEARN_NOMAD_DEEP_DIVE.md
Main Programming Language: Go + HCL + Terraform
Alternative Programming Languages: Python, Bash
Coolness Level: Level 5: Pure Magic (Super Cool)
Business Potential: 4. The “Open Core” Infrastructure
Difficulty: Level 5: Master
Knowledge Area: Platform Engineering / Production Operations
Software or Tool: Nomad + Consul + Vault + Terraform
Main Book: “Site Reliability Engineering” by Google

What you’ll build: A complete production-grade Nomad platform with multi-region federation, Consul service mesh, Vault secrets integration, monitoring/alerting, CI/CD pipeline integration, and disaster recovery procedures.

Why it teaches everything: This project integrates all concepts: scheduling, service mesh, secrets, observability, security, and operations. It’s the capstone that proves you can run Nomad in production.

Core components:

Multi-region clusters with Terraform provisioning
Consul Connect for service mesh
Vault integration for secrets management
Prometheus + Grafana for monitoring
CI/CD integration for deployment pipelines
Runbooks for operational procedures
Disaster recovery procedures and testing

Real world outcome:

Production Nomad Platform
├── infrastructure/
│   ├── terraform/
│   │   ├── modules/
│   │   │   ├── nomad-cluster/
│   │   │   ├── consul-cluster/
│   │   │   └── vault-cluster/
│   │   ├── environments/
│   │   │   ├── production/
│   │   │   └── staging/
│   │   └── main.tf
│   └── packer/
│       └── nomad-ami.pkr.hcl
├── jobs/
│   ├── platform/
│   │   ├── prometheus.nomad
│   │   ├── grafana.nomad
│   │   ├── traefik.nomad
│   │   └── vault-agent.nomad
│   └── applications/
│       └── example-app.nomad
├── policies/
│   ├── nomad-acl/
│   ├── consul-intentions/
│   └── vault-policies/
├── monitoring/
│   ├── dashboards/
│   ├── alerts/
│   └── runbooks/
├── ci-cd/
│   ├── .github/workflows/
│   └── scripts/
└── docs/
    ├── architecture.md
    ├── operations.md
    └── disaster-recovery.md

Time estimate: 2-3 months Prerequisites: All 12 projects completed

Essential Resources

Official Documentation

Nomad Documentation - Comprehensive official docs
Nomad Tutorials - Hands-on learning paths
Nomad API Docs - Complete API reference

Key Papers and Articles

Large-scale cluster management at Google with Borg - Nomad’s inspiration
Omega: flexible, scalable schedulers - Optimistic concurrency model
Raft Consensus Algorithm - The consensus paper
SWIM Protocol - Gossip membership

Courses

HashiCorp Nomad on Udemy - Comprehensive course by Bryan Krausen
Nomad 101 - Official introduction

Books (Complementary)

“Designing Data-Intensive Applications” by Martin Kleppmann - Distributed systems fundamentals
“Site Reliability Engineering” by Google - Production operations
“The Go Programming Language” by Donovan & Kernighan - For source code reading

Source Code

github.com/hashicorp/nomad - Nomad source code
github.com/hashicorp/raft - Raft implementation
github.com/hashicorp/serf - Gossip library

Community

Nomad Forum - Official discussion forum
Nomad GitHub Issues - Bug reports and features
HashiCorp User Groups - Local meetups

Summary

#	Project	Main Programming Language
1	Local Development Cluster	HCL + Bash
2	Job Lifecycle Explorer	HCL + Bash
3	Scheduler Visualization Tool	Go
4	Custom Resource Constraint System	HCL + Go
5	Consul Service Mesh Integration	HCL
6	Autoscaler Implementation	Go
7	Custom Task Driver Plugin	Go
8	Raft Consensus Simulator	Go
9	Gossip Protocol Implementation	Go
10	Mini-Scheduler Implementation	Go
11	Multi-Region Federation Lab	HCL + Terraform
12	Nomad Source Code Deep Dive	Go
Final	Production-Grade Nomad Platform	Go + HCL + Terraform

Happy scheduling! Nomad’s simplicity hides powerful distributed systems concepts—enjoy the journey from “nomad run” to understanding every byte of the protocol. 🚀