LEARN NOMAD DEEP DIVE
Learn HashiCorp Nomad: From Zero to Mastery
Goal: Deeply understand Nomad—from basic job scheduling to implementing your own scheduler, understanding Raft consensus, building task drivers, and mastering the internals that make it a production-grade orchestrator.
Why Nomad Matters
Nomad is HashiCorp’s workload orchestrator, and it occupies a unique position in the infrastructure landscape:
- Simpler than Kubernetes: Single binary, no etcd, no complex control plane
- More flexible: Runs containers, VMs, Java apps, binaries, batch jobs—anything
- Production-proven: Powers millions of containers at companies like Cloudflare, Roblox, and CircleCI
- Inspired by Google: The scheduler design draws from Google’s Borg and Omega papers
- HashiCorp Ecosystem: Seamless integration with Consul (service mesh), Vault (secrets), Terraform (provisioning)
Understanding Nomad deeply teaches you:
- Distributed systems fundamentals: Raft consensus, gossip protocols, leader election
- Scheduler design: Bin packing, spread algorithms, constraint satisfaction
- Container orchestration: Namespaces, cgroups, networking modes
- Service mesh concepts: Envoy sidecars, mutual TLS, service discovery
- Real production patterns: Multi-region, high availability, disaster recovery
Core Architecture Overview
┌─────────────────────────────────────────────────────────────────┐
│ NOMAD CLUSTER │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Server 1 │ │ Server 2 │ │ Server 3 │ │
│ │ (Leader) │◄─┤ (Follower) ├──┤ (Follower) │ │
│ │ │ │ │ │ │ │
│ │ ┌──────────┐ │ │ ┌──────────┐ │ │ ┌──────────┐ │ │
│ │ │ Raft │ │ │ │ Raft │ │ │ │ Raft │ │ │
│ │ │ (FSM) │ │ │ │ (FSM) │ │ │ │ (FSM) │ │ │
│ │ └──────────┘ │ │ └──────────┘ │ │ └──────────┘ │ │
│ │ ┌──────────┐ │ │ ┌──────────┐ │ │ ┌──────────┐ │ │
│ │ │Scheduler │ │ │ │Scheduler │ │ │ │Scheduler │ │ │
│ │ │ Workers │ │ │ │ Workers │ │ │ │ Workers │ │ │
│ │ └──────────┘ │ │ └──────────┘ │ │ └──────────┘ │ │
│ └──────┬───────┘ └──────┬───────┘ └──────┬───────┘ │
│ │ │ │ │
│ └────────────────▼┼─────────────────┘ │
│ Serf Gossip │
│ (Membership) │
│ ┌─────────────────┼─────────────────┐ │
│ │ │ │ │
│ ┌──────▼───────┐ ┌──────▼───────┐ ┌──────▼───────┐ │
│ │ Client 1 │ │ Client 2 │ │ Client N │ │
│ │ │ │ │ │ │ │
│ │ ┌──────────┐ │ │ ┌──────────┐ │ │ ┌──────────┐ │ │
│ │ │ Docker │ │ │ │ exec │ │ │ │ raw_exec │ │ │
│ │ │ Driver │ │ │ │ Driver │ │ │ │ Driver │ │ │
│ │ └──────────┘ │ │ └──────────┘ │ │ └──────────┘ │ │
│ │ ┌──────────┐ │ │ ┌──────────┐ │ │ ┌──────────┐ │ │
│ │ │Allocation│ │ │ │Allocation│ │ │ │Allocation│ │ │
│ │ │ Runner │ │ │ │ Runner │ │ │ │ Runner │ │ │
│ │ └──────────┘ │ │ └──────────┘ │ │ └──────────┘ │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘
Key Components
| Component | Purpose |
|---|---|
| Server | Control plane—stores state, runs schedulers, coordinates via Raft |
| Client | Data plane—runs workloads, reports node status |
| Raft | Consensus algorithm for leader election and state replication |
| Serf/Gossip | Membership protocol for cluster discovery |
| Scheduler | Places allocations on nodes using bin-packing/spread |
| Task Drivers | Execute workloads (Docker, exec, raw_exec, Java, etc.) |
| Evaluation | Unit of scheduling work, triggered by state changes |
| Allocation | A task group placed on a specific node |
The Scheduling Pipeline
Understanding this pipeline is key to understanding Nomad:
Job Submission → Evaluation Created → Scheduler Processes →
Plan Generated → Leader Applies Plan → Allocation Created →
Client Runs Tasks
- Job Registration: User submits a job (desired state)
- Evaluation Creation: Nomad creates an evaluation to process the change
- Scheduler Dequeues: A scheduler worker picks up the evaluation
- Feasibility Checking: Filter nodes that can’t run the job (constraints, resources)
- Ranking: Score feasible nodes (bin packing, affinity, spread)
- Plan Submission: Scheduler submits plan to leader
- Plan Application: Leader checks for conflicts, applies or rejects
- Allocation: Client receives allocation and starts tasks
Project List
Projects progress from understanding the basics to building sophisticated internal components.
Project 1: Local Development Cluster
- File: LEARN_NOMAD_DEEP_DIVE.md
- Main Programming Language: HCL (HashiCorp Configuration Language) + Bash
- Alternative Programming Languages: Python, Go
- Coolness Level: Level 2: Practical but Forgettable
- Business Potential: 1. The “Resume Gold”
- Difficulty: Level 1: Beginner
- Knowledge Area: Cluster Setup / Agent Configuration
- Software or Tool: Nomad, Docker
- Main Book: HashiCorp Nomad Documentation
What you’ll build: A local 3-server, 2-client Nomad cluster running in Docker containers (or as processes), demonstrating leader election, automatic clustering, and basic job deployment.
Why it teaches Nomad: Before diving deep, you need a running cluster to experiment with. This project introduces the agent modes (server/client), configuration files, and the basic operational commands.
Core challenges you’ll face:
- Configuring server vs client mode → maps to understanding agent roles
- Setting up cluster bootstrapping → maps to bootstrap_expect and join
- Networking between agents → maps to advertise addresses and ports
- Running your first job → maps to job specification basics
Key Concepts:
- Agent Configuration: Nomad Agent Configuration
- Bootstrap Process: Clustering Tutorial
- Job Specification: Job Specification Docs
Difficulty: Beginner Time estimate: Weekend Prerequisites: Docker installed. Basic understanding of configuration files. Command-line comfort.
Real world outcome:
# Your cluster is running:
$ nomad server members
Name Address Port Status Leader Protocol
server-1.dc1 10.0.0.1 4648 alive true 2
server-2.dc1 10.0.0.2 4648 alive false 2
server-3.dc1 10.0.0.3 4648 alive false 2
$ nomad node status
ID DC Name Class Drain Eligibility Status
a1b2c3d4 dc1 client-1 <none> false eligible ready
e5f6g7h8 dc1 client-2 <none> false eligible ready
# Deploy a simple job:
$ nomad run hello.nomad
==> Monitoring deployment...
2024-01-15T10:00:00Z (ID: abc123)
Deployment "abc123" successful
$ curl http://client-1:8080
Hello from Nomad!
Implementation Hints:
Directory structure:
nomad-cluster/
├── docker-compose.yml # Or Vagrant, or shell scripts
├── config/
│ ├── server.hcl # Server configuration
│ └── client.hcl # Client configuration
└── jobs/
└── hello.nomad # First job
Server configuration (server.hcl):
# Minimal server config
data_dir = "/opt/nomad/data"
bind_addr = "0.0.0.0"
server {
enabled = true
bootstrap_expect = 3 # Wait for 3 servers before electing leader
}
# Advertise address for other nodes to reach this one
advertise {
http = "{{ GetInterfaceIP \"eth0\" }}"
rpc = "{{ GetInterfaceIP \"eth0\" }}"
serf = "{{ GetInterfaceIP \"eth0\" }}"
}
Client configuration (client.hcl):
data_dir = "/opt/nomad/data"
bind_addr = "0.0.0.0"
client {
enabled = true
servers = ["server-1:4647", "server-2:4647", "server-3:4647"]
}
# Enable task drivers
plugin "docker" {
config {
allow_privileged = false
}
}
First job (hello.nomad):
job "hello" {
datacenters = ["dc1"]
type = "service"
group "web" {
count = 2
network {
port "http" { to = 8080 }
}
task "server" {
driver = "docker"
config {
image = "hashicorp/http-echo"
args = ["-text", "Hello from Nomad!"]
ports = ["http"]
}
resources {
cpu = 100
memory = 64
}
}
}
}
Learning milestones:
- Servers elect a leader → You understand Raft basics
- Clients join the cluster → You understand gossip membership
- Job deploys successfully → You understand the scheduling flow
- You scale the job → You understand desired vs actual state
Project 2: Job Lifecycle Explorer
- File: LEARN_NOMAD_DEEP_DIVE.md
- Main Programming Language: HCL + Bash
- Alternative Programming Languages: Go (API client), Python
- Coolness Level: Level 2: Practical but Forgettable
- Business Potential: 1. The “Resume Gold”
- Difficulty: Level 1: Beginner
- Knowledge Area: Job Types / Task Lifecycle / Update Strategies
- Software or Tool: Nomad
- Main Book: HashiCorp Nomad Documentation
What you’ll build: A suite of jobs demonstrating all Nomad job types (service, batch, system, sysbatch), task lifecycle hooks, update strategies, and health checks.
Why it teaches Nomad: Each job type has different scheduling behavior. Understanding when tasks start, stop, and restart—and how updates roll out—is essential for production deployments.
Core challenges you’ll face:
- Understanding job types → maps to service vs batch vs system schedulers
- Lifecycle hooks (prestart, poststart, poststop) → maps to task dependencies
- Health checks and deployment health → maps to how Nomad knows tasks are ready
- Update strategies → maps to rolling vs canary deployments
Key Concepts:
- Job Types: Scheduler Types
- Lifecycle Hooks: lifecycle block
- Update Stanza: update block
- Health Checks: check block
Difficulty: Beginner Time estimate: 1 week Prerequisites: Project 1 completed. Running cluster.
Real world outcome:
# Different job types in action:
# Service job - long-running, rescheduled on failure
$ nomad run api-service.nomad
$ nomad job status api-service
Allocations
ID Node ID Task Group Version Desired Status Created
abc123 a1b2c3 api 0 run running 5m ago
# Batch job - runs to completion
$ nomad run data-import.nomad
$ nomad job status data-import
Allocations
ID Node ID Task Group Version Desired Status Created
def456 e5f6g7 import 0 run complete 2m ago
# System job - runs on every node
$ nomad run node-exporter.nomad
$ nomad job status node-exporter
Allocations (2 nodes)
ID Node ID Task Group Version Desired Status Created
ghi789 a1b2c3 exporter 0 run running 1m ago
jkl012 e5f6g7 exporter 0 run running 1m ago
# Rolling update with health checks:
$ nomad run -var version=v2 api-service.nomad
==> Monitoring deployment...
Deployment "xyz789" in progress...
2024-01-15T10:05:00Z - 1/3 allocations healthy
2024-01-15T10:06:00Z - 2/3 allocations healthy
2024-01-15T10:07:00Z - 3/3 allocations healthy
Deployment "xyz789" successful
Implementation Hints:
Service job with lifecycle hooks:
job "api-service" {
type = "service"
group "api" {
count = 3
update {
max_parallel = 1
min_healthy_time = "30s"
healthy_deadline = "5m"
auto_revert = true
canary = 1 # Deploy 1 canary first
}
# Prestart task - runs before main task
task "db-migrate" {
lifecycle {
hook = "prestart"
sidecar = false
}
driver = "docker"
config {
image = "myapp/migrate:${var.version}"
command = "/migrate"
}
}
# Main task
task "api" {
driver = "docker"
config {
image = "myapp/api:${var.version}"
ports = ["http"]
}
# Health check for deployment
service {
name = "api"
port = "http"
check {
type = "http"
path = "/health"
interval = "10s"
timeout = "2s"
}
}
}
# Sidecar - runs alongside main task
task "log-shipper" {
lifecycle {
hook = "poststart"
sidecar = true
}
driver = "docker"
config {
image = "fluent/fluent-bit"
}
}
}
}
Batch job with periodic scheduling:
job "daily-report" {
type = "batch"
periodic {
cron = "0 2 * * *" # 2 AM daily
prohibit_overlap = true
}
group "report" {
task "generate" {
driver = "docker"
config {
image = "myapp/reporter"
command = "/generate-report"
}
# Batch jobs can have restart policies
restart {
attempts = 3
interval = "30m"
delay = "15s"
mode = "fail"
}
}
}
}
Learning milestones:
- You run all 4 job types → You understand scheduler differences
- Lifecycle hooks execute in order → You understand task dependencies
- Health checks gate deployments → You understand deployment health
- Rolling updates work with canaries → You understand update strategies
Project 3: Scheduler Visualization Tool
- File: LEARN_NOMAD_DEEP_DIVE.md
- Main Programming Language: Go
- Alternative Programming Languages: Python, JavaScript
- Coolness Level: Level 3: Genuinely Clever
- Business Potential: 2. The “Micro-SaaS / Pro Tool”
- Difficulty: Level 2: Intermediate
- Knowledge Area: Scheduling Internals / API Usage
- Software or Tool: Nomad API
- Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann (for distributed concepts)
What you’ll build: A tool that visualizes Nomad’s scheduling decisions in real-time—showing evaluations, plans, allocations, node scores, and why certain nodes were chosen or rejected.
Why it teaches Nomad: The scheduler is Nomad’s brain. By building a visualization tool, you’ll understand the evaluation → plan → allocation pipeline, see bin packing in action, and understand why placements happen where they do.
Core challenges you’ll face:
- Streaming evaluations → maps to understanding the evaluation broker
- Parsing allocation plans → maps to feasibility and ranking phases
- Visualizing node scores → maps to bin packing algorithm
- Understanding failures → maps to why allocations get blocked
Key Concepts:
- Scheduling Internals: How Scheduling Works
- Nomad API: API Documentation
- Event Stream: Event Stream API
Difficulty: Intermediate Time estimate: 1-2 weeks Prerequisites: Projects 1-2 completed. Go or Python proficiency.
Real world outcome:
┌─────────────────────────────────────────────────────────────────┐
│ Nomad Scheduler Visualizer │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Evaluation: abc123-def456 (Triggered by: job-register) │
│ Job: api-service | Type: service | Priority: 50 │
│ │
│ ┌─ Scheduling Timeline ──────────────────────────────────────┐ │
│ │ │ │
│ │ 10:00:00.100 Evaluation Created │ │
│ │ 10:00:00.150 Scheduler Dequeued (worker-3) │ │
│ │ 10:00:00.200 Feasibility Check Started │ │
│ │ - client-1: FEASIBLE │ │
│ │ - client-2: FEASIBLE │ │
│ │ - client-3: REJECTED (insufficient memory) │ │
│ │ 10:00:00.250 Ranking Phase │ │
│ │ - client-1: score=0.85 (binpack: 0.9, │ │
│ │ anti-affinity: 0.8) │ │
│ │ - client-2: score=0.72 (binpack: 0.7, │ │
│ │ anti-affinity: 0.75) │ │
│ │ 10:00:00.300 Plan Submitted │ │
│ │ 10:00:00.350 Plan Accepted (no conflicts) │ │
│ │ 10:00:00.400 Allocation Created on client-1 │ │
│ │ │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │
│ Node Resource Usage: │
│ ┌──────────────────────────────────────────────────────────────┐ │
│ │ client-1 [████████████░░░░░░░░] CPU: 60% │ Mem: 2.1/4GB │ │
│ │ client-2 [████████░░░░░░░░░░░░] CPU: 40% │ Mem: 1.5/4GB │ │
│ │ client-3 [████████████████████] CPU: 95% │ Mem: 3.9/4GB │ │
│ └──────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘
Implementation Hints:
Using the Event Stream API:
package main
import (
"context"
"fmt"
"github.com/hashicorp/nomad/api"
)
func main() {
client, _ := api.NewClient(api.DefaultConfig())
// Subscribe to evaluation and allocation events
topics := map[api.Topic][]string{
api.TopicEvaluation: {"*"},
api.TopicAllocation: {"*"},
api.TopicNode: {"*"},
}
ctx := context.Background()
eventCh, _ := client.EventStream().Stream(ctx, topics, 0, nil)
for {
select {
case events := <-eventCh:
for _, event := range events.Events {
switch event.Topic {
case api.TopicEvaluation:
handleEvaluation(event)
case api.TopicAllocation:
handleAllocation(event)
}
}
}
}
}
func handleEvaluation(event api.Event) {
eval := event.Evaluation
fmt.Printf("Eval %s: Status=%s TriggeredBy=%s\n",
eval.ID[:8], eval.Status, eval.TriggeredBy)
// Get detailed scheduling info from allocation plan
if eval.Status == "complete" {
showPlanDetails(eval.ID)
}
}
Getting node scores and feasibility info:
func showPlanDetails(evalID string) {
client, _ := api.NewClient(api.DefaultConfig())
// Get allocations for this evaluation
allocs, _, _ := client.Evaluations().Allocations(evalID, nil)
for _, alloc := range allocs {
fmt.Printf(" Allocation %s on %s\n", alloc.ID[:8], alloc.NodeID[:8])
// Metrics show why this node was chosen
if alloc.Metrics != nil {
fmt.Printf(" NodesEvaluated: %d\n", alloc.Metrics.NodesEvaluated)
fmt.Printf(" NodesFiltered: %d\n", alloc.Metrics.NodesFiltered)
// Filter reasons
for class, count := range alloc.Metrics.ClassFiltered {
fmt.Printf(" Filtered by class %s: %d\n", class, count)
}
// Scores
for _, score := range alloc.Metrics.ScoreMetaData {
fmt.Printf(" Node %s: score=%.2f\n",
score.NodeID[:8], score.NormScore)
}
}
}
}
Learning milestones:
- You stream evaluation events → You understand the evaluation broker
- You show feasibility filtering → You understand constraint checking
- You display node scores → You understand bin packing
- You track blocked evaluations → You understand resource exhaustion
Project 4: Custom Resource Constraint System
- File: LEARN_NOMAD_DEEP_DIVE.md
- Main Programming Language: HCL + Go
- Alternative Programming Languages: Python (for testing)
- Coolness Level: Level 3: Genuinely Clever
- Business Potential: 3. The “Service & Support” Model
- Difficulty: Level 2: Intermediate
- Knowledge Area: Constraints / Affinities / Node Metadata
- Software or Tool: Nomad
- Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann
What you’ll build: A system that uses node attributes, custom metadata, constraints, and affinities to implement complex placement requirements like “run on GPU nodes,” “prefer SSD storage,” “spread across availability zones.”
Why it teaches Nomad: Real-world scheduling is rarely just “find a node with enough CPU.” This project teaches you how Nomad’s constraint system enables sophisticated placement logic without modifying the scheduler itself.
Core challenges you’ll face:
- Node fingerprinting → maps to how Nomad discovers node capabilities
- Custom node metadata → maps to operator-defined attributes
- Hard constraints vs soft affinities → maps to must-have vs prefer
- Spread across failure domains → maps to the spread block
Key Concepts:
- Constraints: constraint block
- Affinities: affinity block
- Spread: spread block
- Node Metadata: Client metadata
Difficulty: Intermediate Time estimate: 1 week Prerequisites: Projects 1-3 completed. Understanding of constraint satisfaction.
Real world outcome:
# Configure nodes with metadata:
# client-1.hcl
client {
meta {
"zone" = "us-east-1a"
"instance" = "gpu"
"storage" = "ssd"
}
}
# Job with complex placement requirements:
$ cat ml-training.nomad
job "ml-training" {
# Hard constraint: MUST have GPU
constraint {
attribute = "${meta.instance}"
value = "gpu"
}
# Soft affinity: PREFER SSD storage (weight: 50%)
affinity {
attribute = "${meta.storage}"
value = "ssd"
weight = 50
}
# Spread across availability zones
spread {
attribute = "${meta.zone}"
weight = 100
target "us-east-1a" { percent = 33 }
target "us-east-1b" { percent = 33 }
target "us-east-1c" { percent = 34 }
}
group "training" {
count = 6
task "train" {
driver = "docker"
config {
image = "tensorflow/tensorflow:latest-gpu"
}
resources {
# Request GPU device
device "nvidia/gpu" {
count = 1
}
}
}
}
}
$ nomad run ml-training.nomad
$ nomad job status ml-training
Allocations
ID Node ID Zone Status
abc123 node-1 us-east-1a running # GPU + SSD
def456 node-4 us-east-1b running # GPU + HDD (SSD preferred but not required)
ghi789 node-7 us-east-1c running # GPU + SSD
...
# Placement respects:
# ✓ All on GPU nodes (constraint)
# ✓ Prefers SSD nodes (affinity)
# ✓ Spread across zones (spread)
Implementation Hints:
Node metadata configuration:
# On each client node
client {
enabled = true
# Custom metadata operators can define
meta {
"zone" = "us-east-1a"
"rack" = "rack-42"
"instance_type" = "gpu"
"storage_type" = "nvme"
"network_speed" = "25gbps"
"cost_tier" = "expensive"
}
# Host volumes for persistent storage
host_volume "scratch" {
path = "/mnt/scratch"
read_only = false
}
}
Complex constraint examples:
job "complex-constraints" {
# Must run on Linux
constraint {
attribute = "${attr.kernel.name}"
value = "linux"
}
# Must have at least 4 CPU cores
constraint {
attribute = "${attr.cpu.numcores}"
operator = ">="
value = "4"
}
# Must NOT run on nodes tagged "maintenance"
constraint {
attribute = "${meta.status}"
operator = "!="
value = "maintenance"
}
# Must run in specific datacenters
constraint {
attribute = "${node.datacenter}"
operator = "regexp"
value = "us-.*"
}
# Prefer nodes with recent kernel
affinity {
attribute = "${attr.kernel.version}"
operator = "version"
value = ">= 5.0"
weight = 25
}
group "app" {
# Distinct hosts - never colocate instances
constraint {
operator = "distinct_hosts"
value = "true"
}
task "worker" {
# ...
}
}
}
Learning milestones:
- You use built-in attributes → You understand fingerprinting
- You define custom metadata → You understand operator-defined placement
- You combine constraints and affinities → You understand hard vs soft requirements
- You spread across failure domains → You understand availability patterns
Project 5: Consul Service Mesh Integration
- File: LEARN_NOMAD_DEEP_DIVE.md
- Main Programming Language: HCL
- Alternative Programming Languages: Go, Python
- Coolness Level: Level 3: Genuinely Clever
- Business Potential: 3. The “Service & Support” Model
- Difficulty: Level 2: Intermediate
- Knowledge Area: Service Mesh / mTLS / Envoy
- Software or Tool: Nomad + Consul + Envoy
- Main Book: “Service Mesh Patterns” by Sandeep Dinesh
What you’ll build: A microservices application with Consul Connect service mesh, demonstrating automatic mTLS, service intentions (authorization), and transparent proxy networking.
Why it teaches Nomad: Nomad’s integration with Consul Connect shows how the HashiCorp stack works together. You’ll understand sidecar injection, service identity, and how Envoy proxies are configured automatically.
Core challenges you’ll face:
- Setting up Consul cluster → maps to understanding service discovery
- Enabling Connect sidecar proxies → maps to Envoy injection
- Configuring service intentions → maps to service-to-service authorization
- Network modes (bridge) → maps to Linux network namespaces
Key Concepts:
- Consul Connect: Service Mesh Integration
- Service Intentions: Consul Intentions
- Network Modes: Networking
- Envoy Sidecar: Sidecar Task
Difficulty: Intermediate Time estimate: 1-2 weeks Prerequisites: Projects 1-4 completed. Basic understanding of TLS and service mesh concepts.
Real world outcome:
# Architecture:
#
# ┌─────────────────────────────────────────────────────┐
# │ Nomad Job │
# │ ┌─────────────────┐ ┌─────────────────┐ │
# │ │ web │ │ api │ │
# │ │ (frontend) │ mTLS │ (backend) │ │
# │ │ ┌───────────┐ │◄────►│ ┌───────────┐ │ │
# │ │ │ nginx │ │ │ │ python │ │ │
# │ │ └───────────┘ │ │ └───────────┘ │ │
# │ │ ┌───────────┐ │ │ ┌───────────┐ │ │
# │ │ │ envoy │ │ │ │ envoy │ │ │
# │ │ │ (sidecar) │ │ │ │ (sidecar) │ │ │
# │ │ └───────────┘ │ │ └───────────┘ │ │
# │ └─────────────────┘ └─────────────────┘ │
# └─────────────────────────────────────────────────────┘
$ nomad run mesh-app.nomad
# Services register with Consul:
$ consul catalog services
api
api-sidecar-proxy
web
web-sidecar-proxy
# Intentions control access:
$ consul intention check web api
Allowed
$ consul intention check malicious-app api
Denied
# Traffic flows through mTLS:
$ curl http://web.service.consul
Hello from web! Got response from api: {"status": "ok"}
Implementation Hints:
Consul configuration for Connect:
# consul.hcl
connect {
enabled = true
}
ports {
grpc = 8502
}
Nomad job with Connect sidecar:
job "mesh-app" {
datacenters = ["dc1"]
group "api" {
count = 2
network {
mode = "bridge" # Required for Connect
port "http" {
to = 8080
}
}
service {
name = "api"
port = "8080"
connect {
sidecar_service {} # Injects Envoy sidecar
}
}
task "api" {
driver = "docker"
config {
image = "myapp/api:v1"
ports = ["http"]
}
}
}
group "web" {
count = 2
network {
mode = "bridge"
port "http" {
static = 8080
to = 80
}
}
service {
name = "web"
port = "80"
connect {
sidecar_service {
proxy {
# Upstream connection to api service
upstreams {
destination_name = "api"
local_bind_port = 5000 # web connects to localhost:5000
}
}
}
}
}
task "web" {
driver = "docker"
config {
image = "myapp/web:v1"
ports = ["http"]
}
env {
API_URL = "http://localhost:5000" # Goes through Envoy → api
}
}
}
}
Service intentions (Consul):
# Allow web → api
consul intention create web api
# Deny all other traffic to api
consul intention create -deny "*" api
# Or use HCL:
# intentions.hcl
Kind = "service-intentions"
Name = "api"
Sources = [
{
Name = "web"
Action = "allow"
},
{
Name = "*"
Action = "deny"
}
]
Learning milestones:
- Sidecars inject automatically → You understand sidecar injection
- Services discover each other → You understand service mesh networking
- mTLS encrypts traffic → You understand zero-trust networking
- Intentions control access → You understand service authorization
Project 6: Autoscaler Implementation
- File: LEARN_NOMAD_DEEP_DIVE.md
- Main Programming Language: Go
- Alternative Programming Languages: Python, HCL (for policies)
- Coolness Level: Level 3: Genuinely Clever
- Business Potential: 3. The “Service & Support” Model
- Difficulty: Level 3: Advanced
- Knowledge Area: Autoscaling / Metrics / Control Loops
- Software or Tool: Nomad Autoscaler
- Main Book: “Site Reliability Engineering” by Google
What you’ll build: An autoscaling system that adjusts job counts based on metrics (CPU, memory, queue depth, custom metrics), implementing both horizontal pod autoscaling and cluster autoscaling.
Why it teaches Nomad: Autoscaling requires understanding how to monitor allocations, interact with the Nomad API to scale jobs, and implement control loops that don’t oscillate. This teaches the operational side of Nomad.
Core challenges you’ll face:
- Collecting metrics → maps to Prometheus integration or Nomad metrics
- Implementing scaling policies → maps to target tracking algorithms
- Avoiding thrashing → maps to cooldown periods and stabilization
- Cluster autoscaling → maps to adding/removing nodes
Key Concepts:
- Nomad Autoscaler: Autoscaler Documentation
- Scaling Policies: Policy Configuration
- APM Plugins: APM Plugins
- Target Plugins: Target Plugins
Difficulty: Advanced Time estimate: 2 weeks Prerequisites: Projects 1-5 completed. Understanding of control systems basics.
Real world outcome:
# Your autoscaler in action:
# Initial state:
$ nomad job status api-service
Task Group Queued Starting Running Failed Complete Lost
api 0 0 2 0 0 0
# Load increases, autoscaler detects high CPU:
$ watch nomad-autoscaler agent -config=config.hcl
[INFO] policy: scaling api from 2 to 5: cpu_utilization=85%, target=70%
[INFO] scale: successfully scaled api to 5 allocations
$ nomad job status api-service
Task Group Queued Starting Running Failed Complete Lost
api 0 1 5 0 0 0
# Load decreases:
[INFO] policy: stabilization window not passed, holding at 5
[INFO] policy: scaling api from 5 to 3: cpu_utilization=35%, target=70%
[INFO] scale: successfully scaled api to 3 allocations
# Cluster autoscaling (add nodes when all are at capacity):
[INFO] cluster: no feasible nodes for pending allocations
[INFO] cluster: launching 2 new instances via AWS ASG
[INFO] cluster: new nodes joined: node-5, node-6
Implementation Hints:
Autoscaler configuration:
# autoscaler.hcl
nomad {
address = "http://localhost:4646"
}
apm "prometheus" {
driver = "prometheus"
config = {
address = "http://prometheus:9090"
}
}
target "nomad" {
driver = "nomad"
}
policy {
default_cooldown = "2m"
default_evaluation_interval = "30s"
}
Scaling policy in job spec:
job "api-service" {
group "api" {
count = 2
scaling {
enabled = true
min = 2
max = 20
policy {
# Target 70% average CPU
check "cpu" {
source = "prometheus"
query = "avg(nomad_client_allocs_cpu_allocated{task_group='api'})"
strategy "target-value" {
target = 70
}
}
# Also scale on queue depth
check "queue_depth" {
source = "prometheus"
query = "sum(rabbitmq_queue_messages{queue='tasks'})"
strategy "target-value" {
target = 100 # 100 messages per instance
}
}
}
}
task "api" {
# ...
}
}
}
Building your own autoscaler (simplified):
package main
import (
"time"
"github.com/hashicorp/nomad/api"
)
type Autoscaler struct {
client *api.Client
targetCPU float64
minCount int
maxCount int
cooldown time.Duration
lastScaled time.Time
}
func (a *Autoscaler) Run() {
ticker := time.NewTicker(30 * time.Second)
for range ticker.C {
if time.Since(a.lastScaled) < a.cooldown {
continue // Still in cooldown
}
currentCPU := a.getAverageCPU()
currentCount := a.getCurrentCount()
// Calculate desired count
desiredCount := int(float64(currentCount) * (currentCPU / a.targetCPU))
desiredCount = clamp(desiredCount, a.minCount, a.maxCount)
if desiredCount != currentCount {
a.scale(desiredCount)
a.lastScaled = time.Now()
}
}
}
func (a *Autoscaler) scale(count int) error {
job, _, _ := a.client.Jobs().Info("api-service", nil)
// Update task group count
for _, tg := range job.TaskGroups {
if *tg.Name == "api" {
tg.Count = &count
}
}
_, _, err := a.client.Jobs().Register(job, nil)
return err
}
Learning milestones:
- You collect metrics from allocations → You understand Nomad telemetry
- You implement target-value scaling → You understand control loops
- You handle cooldown correctly → You understand stabilization
- You add cluster autoscaling → You understand infrastructure automation
Project 7: Custom Task Driver Plugin
- File: LEARN_NOMAD_DEEP_DIVE.md
- Main Programming Language: Go
- Alternative Programming Languages: N/A (must be Go for plugins)
- Coolness Level: Level 4: Hardcore Tech Flex
- Business Potential: 4. The “Open Core” Infrastructure
- Difficulty: Level 3: Advanced
- Knowledge Area: Plugin System / Task Isolation / Go-Plugin
- Software or Tool: Nomad Plugin SDK
- Main Book: “Go Programming Language” by Donovan & Kernighan
What you’ll build: A custom task driver that runs workloads in a way not supported by built-in drivers—for example, running WASM modules, systemd units, Firecracker microVMs, or Podman containers.
Why it teaches Nomad: Task drivers are how Nomad executes workloads. Building one teaches you the plugin architecture, how Nomad tracks task state, handles resource isolation, and manages the task lifecycle.
Core challenges you’ll face:
- Implementing the DriverPlugin interface → maps to understanding the plugin contract
- Fingerprinting node capabilities → maps to what the driver can do
- Managing task state → maps to starting, stopping, recovering tasks
- Resource isolation → maps to cgroups, namespaces, security
Key Concepts:
- Task Driver SDK: Create Task Driver Plugin
- go-plugin: HashiCorp go-plugin
- Driver Interface: Browse Nomad source
drivers/directory - exec2 driver: Example Implementation
Difficulty: Advanced Time estimate: 2-3 weeks Prerequisites: Projects 1-6 completed. Strong Go proficiency. Understanding of process isolation.
Real world outcome:
# Your custom WASM driver in action:
$ cat wasm-job.nomad
job "wasm-app" {
group "app" {
task "handler" {
driver = "wasm" # Your custom driver!
config {
module = "https://example.com/handler.wasm"
runtime = "wasmtime"
# WASI capabilities
allow_fs = false
allow_net = true
}
resources {
cpu = 100
memory = 128
}
}
}
}
$ nomad run wasm-job.nomad
$ nomad alloc logs abc123
[wasm-driver] Loading module from https://example.com/handler.wasm
[wasm-driver] Runtime: wasmtime v14.0.0
[wasm-driver] Instantiating module...
[handler] Listening on :8080
[handler] Received request: GET /
Implementation Hints:
Basic driver structure:
package main
import (
"context"
"time"
"github.com/hashicorp/go-hclog"
"github.com/hashicorp/nomad/drivers/shared/eventer"
"github.com/hashicorp/nomad/plugins/base"
"github.com/hashicorp/nomad/plugins/drivers"
"github.com/hashicorp/nomad/plugins/shared/hclspec"
)
const pluginName = "wasm"
// WasmDriver implements drivers.DriverPlugin
type WasmDriver struct {
eventer *eventer.Eventer
config *Config
logger hclog.Logger
// Track running tasks
tasks *taskStore
}
// PluginInfo returns information about the plugin
func (d *WasmDriver) PluginInfo() (*base.PluginInfoResponse, error) {
return &base.PluginInfoResponse{
Type: base.PluginTypeDriver,
PluginApiVersions: []string{drivers.ApiVersion010},
PluginVersion: "0.1.0",
Name: pluginName,
}, nil
}
// ConfigSchema returns the schema for driver config
func (d *WasmDriver) ConfigSchema() (*hclspec.Spec, error) {
return hclspec.NewObject(map[string]*hclspec.Spec{
"runtime_path": hclspec.NewDefault(
hclspec.NewAttr("runtime_path", "string", false),
hclspec.NewLiteral(`"/usr/local/bin/wasmtime"`),
),
}), nil
}
// TaskConfigSchema returns the schema for task config
func (d *WasmDriver) TaskConfigSchema() (*hclspec.Spec, error) {
return hclspec.NewObject(map[string]*hclspec.Spec{
"module": hclspec.NewAttr("module", "string", true),
"runtime": hclspec.NewAttr("runtime", "string", false),
"allow_fs": hclspec.NewAttr("allow_fs", "bool", false),
"allow_net": hclspec.NewAttr("allow_net", "bool", false),
}), nil
}
// Fingerprint returns node capabilities
func (d *WasmDriver) Fingerprint(ctx context.Context) (<-chan *drivers.Fingerprint, error) {
ch := make(chan *drivers.Fingerprint)
go func() {
defer close(ch)
fp := &drivers.Fingerprint{
Attributes: map[string]*pstructs.Attribute{
"driver.wasm.version": pstructs.NewStringAttribute("0.1.0"),
},
Health: drivers.HealthStateHealthy,
HealthDescription: "WASM runtime available",
}
// Check if wasmtime is installed
if !d.isRuntimeAvailable() {
fp.Health = drivers.HealthStateUndetected
fp.HealthDescription = "wasmtime not found"
}
ch <- fp
}()
return ch, nil
}
// StartTask launches the WASM module
func (d *WasmDriver) StartTask(cfg *drivers.TaskConfig) (*drivers.TaskHandle, *drivers.DriverNetwork, error) {
var taskConfig TaskConfig
if err := cfg.DecodeDriverConfig(&taskConfig); err != nil {
return nil, nil, err
}
d.logger.Info("starting wasm task", "module", taskConfig.Module)
// Download module, start wasmtime process
handle, err := d.startWasmProcess(cfg, &taskConfig)
if err != nil {
return nil, nil, err
}
return handle, nil, nil
}
// StopTask stops a running task
func (d *WasmDriver) StopTask(taskID string, timeout time.Duration, signal string) error {
handle, ok := d.tasks.Get(taskID)
if !ok {
return drivers.ErrTaskNotFound
}
// Send signal to process
return handle.Kill(signal, timeout)
}
// WaitTask blocks until task exits
func (d *WasmDriver) WaitTask(ctx context.Context, taskID string) (<-chan *drivers.ExitResult, error) {
handle, ok := d.tasks.Get(taskID)
if !ok {
return nil, drivers.ErrTaskNotFound
}
return handle.WaitCh(), nil
}
Main function with plugin serving:
func main() {
// Serve the plugin
plugins.Serve(factory)
}
func factory(log hclog.Logger) interface{} {
return &WasmDriver{
eventer: eventer.NewEventer(context.Background(), log),
logger: log.Named(pluginName),
tasks: newTaskStore(),
}
}
Learning milestones:
- You implement Fingerprint → You understand capability detection
- You implement StartTask → You understand task lifecycle
- You handle recovery after restarts → You understand driver state
- Tasks run with isolation → You understand resource control
Project 8: Raft Consensus Simulator
- File: LEARN_NOMAD_DEEP_DIVE.md
- Main Programming Language: Go
- Alternative Programming Languages: Python, Rust
- Coolness Level: Level 4: Hardcore Tech Flex
- Business Potential: 1. The “Resume Gold”
- Difficulty: Level 4: Expert
- Knowledge Area: Distributed Systems / Consensus / Raft
- Software or Tool: Custom implementation
- Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann
What you’ll build: A visual Raft consensus simulator that demonstrates leader election, log replication, and fault tolerance—showing exactly how Nomad servers coordinate.
Why it teaches Nomad: Nomad uses Raft for all state management. Understanding Raft means understanding how Nomad maintains consistency, handles leader failures, and ensures that scheduling decisions are durable.
Core challenges you’ll face:
- Implementing leader election → maps to terms, votes, timeouts
- Log replication → maps to AppendEntries RPC
- Handling network partitions → maps to split-brain scenarios
- State machine application → maps to Nomad’s FSM
Key Concepts:
- Raft Paper: In Search of an Understandable Consensus Algorithm
- Nomad Consensus: Consensus Protocol
- HashiCorp Raft: hashicorp/raft
- Raft Visualization: The Secret Lives of Data
Difficulty: Expert Time estimate: 3-4 weeks Prerequisites: Projects 1-7 completed. Strong understanding of distributed systems.
Real world outcome:
┌─────────────────────────────────────────────────────────────────┐
│ Raft Consensus Simulator │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Term: 5 │
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Node 1 │ │ Node 2 │ │ Node 3 │ │
│ │ LEADER │ │ FOLLOWER │ │ FOLLOWER │ │
│ │ Term: 5 │ │ Term: 5 │ │ Term: 5 │ │
│ │ │ │ │ │ │ │
│ │ Log: │ │ Log: │ │ Log: │ │
│ │ [1] x=1 ✓ │ │ [1] x=1 ✓ │ │ [1] x=1 ✓ │ │
│ │ [2] y=2 ✓ │ │ [2] y=2 ✓ │ │ [2] y=2 ✓ │ │
│ │ [3] z=3 │◄──►│ [3] z=3 │◄──►│ [3] z=3 │ │
│ │ ↑ commit │ │ │ │ │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
│ │
│ ┌─ Event Log ─────────────────────────────────────────────────┐│
│ │ 10:00:00.100 Node1: Became leader for term 5 ││
│ │ 10:00:00.150 Node1: Received client request: SET z=3 ││
│ │ 10:00:00.200 Node1→Node2: AppendEntries(term=5, entry=z=3) ││
│ │ 10:00:00.200 Node1→Node3: AppendEntries(term=5, entry=z=3) ││
│ │ 10:00:00.250 Node2→Node1: AppendEntriesResp(success=true) ││
│ │ 10:00:00.260 Node3→Node1: AppendEntriesResp(success=true) ││
│ │ 10:00:00.300 Node1: Entry z=3 committed (majority ack) ││
│ │ 10:00:00.350 Node1: Applied z=3 to state machine ││
│ └─────────────────────────────────────────────────────────────┘│
│ │
│ [Partition Node 2] [Kill Leader] [Add Entry] [Step] │
│ │
└─────────────────────────────────────────────────────────────────┘
Implementation Hints:
Raft node state:
type NodeState int
const (
Follower NodeState = iota
Candidate
Leader
)
type RaftNode struct {
id int
state NodeState
term int
// Persistent state
log []LogEntry
votedFor int
commitIndex int
lastApplied int
// Leader state (reinitialized after election)
nextIndex map[int]int
matchIndex map[int]int
// Timing
electionTimeout time.Duration
lastHeartbeat time.Time
// Communication
peers []int
inbox chan Message
outbox chan Message
}
type LogEntry struct {
Term int
Index int
Command interface{}
}
type Message struct {
From int
To int
Type MessageType
Term int
Payload interface{}
}
Leader election:
func (n *RaftNode) becomeCandidate() {
n.state = Candidate
n.term++
n.votedFor = n.id
n.votesReceived = 1 // Vote for self
// Request votes from all peers
for _, peer := range n.peers {
n.send(Message{
From: n.id,
To: peer,
Type: RequestVote,
Term: n.term,
Payload: RequestVoteArgs{
CandidateID: n.id,
LastLogIndex: len(n.log) - 1,
LastLogTerm: n.log[len(n.log)-1].Term,
},
})
}
n.resetElectionTimer()
}
func (n *RaftNode) handleRequestVote(msg Message) {
args := msg.Payload.(RequestVoteArgs)
// Deny if term is old
if args.Term < n.term {
n.sendVoteReply(msg.From, false)
return
}
// Check if log is at least as up-to-date
if n.votedFor == -1 || n.votedFor == args.CandidateID {
if n.isLogUpToDate(args.LastLogIndex, args.LastLogTerm) {
n.votedFor = args.CandidateID
n.sendVoteReply(msg.From, true)
n.resetElectionTimer()
return
}
}
n.sendVoteReply(msg.From, false)
}
Log replication:
func (n *RaftNode) appendEntry(command interface{}) {
if n.state != Leader {
return
}
entry := LogEntry{
Term: n.term,
Index: len(n.log),
Command: command,
}
n.log = append(n.log, entry)
// Replicate to followers
n.replicateLog()
}
func (n *RaftNode) replicateLog() {
for _, peer := range n.peers {
nextIdx := n.nextIndex[peer]
entries := n.log[nextIdx:]
n.send(Message{
From: n.id,
To: peer,
Type: AppendEntries,
Term: n.term,
Payload: AppendEntriesArgs{
LeaderID: n.id,
PrevLogIndex: nextIdx - 1,
PrevLogTerm: n.log[nextIdx-1].Term,
Entries: entries,
LeaderCommit: n.commitIndex,
},
})
}
}
Learning milestones:
- Leader election works → You understand Raft elections
- Logs replicate correctly → You understand AppendEntries
- Network partitions handled → You understand split-brain prevention
- State machine applies → You understand FSM in consensus
Project 9: Gossip Protocol Implementation
- File: LEARN_NOMAD_DEEP_DIVE.md
- Main Programming Language: Go
- Alternative Programming Languages: Python, Rust
- Coolness Level: Level 4: Hardcore Tech Flex
- Business Potential: 1. The “Resume Gold”
- Difficulty: Level 3: Advanced
- Knowledge Area: Distributed Systems / Membership / SWIM
- Software or Tool: Custom implementation (inspired by Serf)
- Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann
What you’ll build: A gossip-based membership protocol implementing SWIM (Scalable Weakly-consistent Infection-style Process Group Membership), which is what Nomad uses for cluster discovery.
Why it teaches Nomad: Before Raft can elect a leader, servers need to discover each other. Serf’s gossip protocol handles this. Understanding gossip means understanding how Nomad clusters form and how failure detection works.
Core challenges you’ll face:
- Implementing gossip dissemination → maps to rumor spreading
- Failure detection → maps to ping, ping-req, suspect, dead
- Handling network partitions → maps to partition healing
- Scalability → maps to O(log N) message complexity
Key Concepts:
- SWIM Paper: SWIM: Scalable Weakly-consistent Infection-style Process Group Membership Protocol
- Serf: HashiCorp Serf
- Nomad Gossip: Gossip Protocol
- memberlist: hashicorp/memberlist
Difficulty: Advanced Time estimate: 2-3 weeks Prerequisites: Projects 1-8 completed. Understanding of networking and UDP.
Real world outcome:
┌─────────────────────────────────────────────────────────────────┐
│ Gossip Protocol Simulator │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Cluster Members: │
│ ┌────────────────────────────────────────────────────────────┐ │
│ │ Node │ State │ Last Seen │ Incarnation │ Messages │ │
│ ├─────────┼─────────┼───────────┼─────────────┼──────────────┤ │
│ │ node-1 │ ALIVE │ 100ms │ 5 │ ████████ │ │
│ │ node-2 │ ALIVE │ 50ms │ 3 │ ██████ │ │
│ │ node-3 │ SUSPECT │ 2.5s │ 7 │ ██ │ │
│ │ node-4 │ ALIVE │ 200ms │ 2 │ ████ │ │
│ │ node-5 │ DEAD │ 30s │ 1 │ │ │
│ └────────────────────────────────────────────────────────────┘ │
│ │
│ Message Flow: │
│ ┌────────────────────────────────────────────────────────────┐ │
│ │ 10:00:00.100 node-1 → node-2: PING │ │
│ │ 10:00:00.110 node-2 → node-1: ACK │ │
│ │ 10:00:00.200 node-1 → node-4: PING (piggyback: node-3 sus) │ │
│ │ 10:00:00.210 node-4 → node-1: ACK │ │
│ │ 10:00:00.300 node-2 → node-3: PING │ │
│ │ 10:00:00.800 node-2: No ACK from node-3, starting PING-REQ │ │
│ │ 10:00:00.810 node-2 → node-4: PING-REQ(node-3) │ │
│ │ 10:00:00.820 node-4 → node-3: PING │ │
│ │ 10:00:01.320 node-4: No ACK from node-3 │ │
│ │ 10:00:01.330 node-4 → node-2: NACK(node-3) │ │
│ │ 10:00:01.340 node-2: Marking node-3 as SUSPECT │ │
│ └────────────────────────────────────────────────────────────┘ │
│ │
│ [Kill Node] [Partition] [Rejoin] [Show Convergence] │
│ │
└─────────────────────────────────────────────────────────────────┘
Implementation Hints:
SWIM node structure:
type NodeState int
const (
Alive NodeState = iota
Suspect
Dead
)
type Member struct {
ID string
Addr string
State NodeState
Incarnation int
LastUpdate time.Time
}
type GossipNode struct {
self Member
members map[string]*Member
// SWIM parameters
probeInterval time.Duration
probeTimeout time.Duration
suspectTimeout time.Duration
indirectNodes int
// Gossip queue for piggybacking
gossipQueue []GossipMessage
// UDP socket
conn *net.UDPConn
}
type GossipMessage struct {
Type MessageType
Member Member
Incarnation int
}
type MessageType int
const (
Ping MessageType = iota
Ack
PingReq
Alive
Suspect
Dead
)
SWIM probe protocol:
func (n *GossipNode) runProbeLoop() {
ticker := time.NewTicker(n.probeInterval)
for range ticker.C {
// Select random member to probe
target := n.randomMember()
if target == nil {
continue
}
// Direct probe
ack := n.probe(target)
if ack {
continue
}
// No direct ACK - try indirect probes
indirectAck := n.indirectProbe(target)
if indirectAck {
continue
}
// No indirect ACK - mark as suspect
n.suspect(target)
}
}
func (n *GossipNode) probe(target *Member) bool {
// Send PING with piggybacked gossip
msg := n.buildPingMessage(target.Addr)
n.send(msg)
// Wait for ACK
select {
case ack := <-n.ackChan:
if ack.From == target.ID {
return true
}
case <-time.After(n.probeTimeout):
return false
}
return false
}
func (n *GossipNode) indirectProbe(target *Member) bool {
// Select k random members to probe on our behalf
helpers := n.randomMembers(n.indirectNodes)
for _, helper := range helpers {
n.send(PingReqMessage{
To: helper.Addr,
Target: target.Addr,
})
}
// Wait for any positive response
timeout := time.After(n.probeTimeout * 2)
for i := 0; i < len(helpers); i++ {
select {
case resp := <-n.pingReqRespChan:
if resp.Target == target.ID && resp.Ack {
return true
}
case <-timeout:
return false
}
}
return false
}
Gossip dissemination:
func (n *GossipNode) gossip(msg GossipMessage) {
// Add to gossip queue
n.gossipQueue = append(n.gossipQueue, msg)
// Piggyback on next probe messages
// SWIM piggybacks gossip on PING/ACK for efficiency
}
func (n *GossipNode) handleAlive(msg GossipMessage) {
member, exists := n.members[msg.Member.ID]
if !exists {
// New member joined
n.members[msg.Member.ID] = &msg.Member
n.gossip(msg)
return
}
// Only update if incarnation is higher
if msg.Incarnation > member.Incarnation {
member.State = Alive
member.Incarnation = msg.Incarnation
n.gossip(msg)
}
}
func (n *GossipNode) refute() {
// When we hear we're suspected, refute with higher incarnation
n.self.Incarnation++
n.gossip(GossipMessage{
Type: Alive,
Member: n.self,
Incarnation: n.self.Incarnation,
})
}
Learning milestones:
- Members discover each other → You understand gossip joining
- Failures are detected → You understand SWIM probing
- Suspicion and death work → You understand failure state machine
- Gossip converges quickly → You understand epidemic dissemination
Project 10: Mini-Scheduler Implementation
- File: LEARN_NOMAD_DEEP_DIVE.md
- Main Programming Language: Go
- Alternative Programming Languages: Rust, Python
- Coolness Level: Level 5: Pure Magic (Super Cool)
- Business Potential: 4. The “Open Core” Infrastructure
- Difficulty: Level 4: Expert
- Knowledge Area: Scheduling / Bin Packing / Constraint Satisfaction
- Software or Tool: Custom implementation
- Main Book: “Scheduling: Theory, Algorithms, and Systems” by Michael Pinedo
What you’ll build: Your own job scheduler from scratch that handles job registration, evaluation creation, feasibility checking, bin-packing placement, and plan application—essentially a mini-Nomad scheduler.
Why it teaches Nomad: The scheduler is the heart of Nomad. Building one teaches you constraint satisfaction, bin packing algorithms, optimistic concurrency, and the evaluation-plan-allocation pipeline.
Core challenges you’ll face:
- Evaluation pipeline → maps to how scheduling work is queued
- Feasibility checking → maps to constraint filtering
- Bin packing algorithm → maps to node scoring and ranking
- Plan queue and conflict resolution → maps to optimistic concurrency
Key Concepts:
- Scheduling Internals: How Scheduling Works
- Google Borg: Large-scale cluster management at Google with Borg
- Google Omega: Omega: flexible, scalable schedulers
- Bin Packing: First Fit Decreasing, Best Fit algorithms
Difficulty: Expert Time estimate: 1 month Prerequisites: Projects 1-9 completed. Strong algorithm skills.
Real world outcome:
┌─────────────────────────────────────────────────────────────────┐
│ Mini-Scheduler │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Job: api-service │
│ Task Group: api (count: 5) │
│ Requirements: cpu=500, memory=1024, constraint[instance=large] │
│ │
│ ┌─ Scheduling Trace ──────────────────────────────────────────┐ │
│ │ │ │
│ │ Phase 1: Feasibility │ │
│ │ node-1: PASS (large, cpu=2000 avail, mem=4096 avail) │ │
│ │ node-2: PASS (large, cpu=1500 avail, mem=3072 avail) │ │
│ │ node-3: FAIL (small - constraint mismatch) │ │
│ │ node-4: PASS (large, cpu=2000 avail, mem=2048 avail) │ │
│ │ node-5: FAIL (insufficient memory: 512 avail) │ │
│ │ │ │
│ │ Phase 2: Ranking (Bin Pack + Anti-Affinity) │ │
│ │ node-1: binpack=0.75, anti_affinity=1.00, score=0.875 │ │
│ │ node-2: binpack=0.85, anti_affinity=1.00, score=0.925 │ │
│ │ node-4: binpack=0.60, anti_affinity=1.00, score=0.800 │ │
│ │ │ │
│ │ Phase 3: Placement │ │
│ │ Allocation 1 → node-2 (highest score) │ │
│ │ Allocation 2 → node-1 (anti-affinity reduces node-2) │ │
│ │ Allocation 3 → node-4 (spreading load) │ │
│ │ Allocation 4 → node-2 (resources still fit) │ │
│ │ Allocation 5 → node-1 (resources still fit) │ │
│ │ │ │
│ │ Phase 4: Plan Applied Successfully │ │
│ │ 5 allocations created │ │
│ │ │ │
│ └──────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘
Implementation Hints:
Core data structures:
type Job struct {
ID string
Name string
Type string // service, batch, system
Priority int
TaskGroups []*TaskGroup
}
type TaskGroup struct {
Name string
Count int
Constraints []*Constraint
Tasks []*Task
}
type Task struct {
Name string
Driver string
Resources *Resources
}
type Resources struct {
CPU int // MHz
Memory int // MB
}
type Constraint struct {
Attribute string
Operator string // =, !=, >, <, in, not_in
Value string
}
type Node struct {
ID string
Datacenter string
Attributes map[string]string
Resources *Resources
Allocated *Resources
}
type Allocation struct {
ID string
JobID string
TaskGroup string
NodeID string
DesiredState string
Resources *Resources
}
type Evaluation struct {
ID string
JobID string
TriggeredBy string
Status string
}
Scheduler implementation:
type Scheduler struct {
jobs map[string]*Job
nodes map[string]*Node
allocations map[string]*Allocation
evalBroker *EvalBroker
planQueue *PlanQueue
}
func (s *Scheduler) Schedule(eval *Evaluation) (*Plan, error) {
job := s.jobs[eval.JobID]
plan := &Plan{
EvalID: eval.ID,
Allocations: make([]*Allocation, 0),
}
for _, tg := range job.TaskGroups {
// Current allocations for this task group
current := s.allocationsFor(job.ID, tg.Name)
desired := tg.Count
// Need to place (desired - current) new allocations
toPlace := desired - len(current)
for i := 0; i < toPlace; i++ {
// Phase 1: Feasibility
feasible := s.feasibleNodes(tg)
if len(feasible) == 0 {
// No feasible nodes - evaluation blocked
return nil, ErrNoFeasibleNodes
}
// Phase 2: Ranking
ranked := s.rankNodes(feasible, tg, plan)
// Phase 3: Place on best node
best := ranked[0]
alloc := s.createAllocation(job, tg, best.Node)
plan.Allocations = append(plan.Allocations, alloc)
// Update our view for subsequent placements
s.reserveResources(best.Node, tg)
}
}
return plan, nil
}
func (s *Scheduler) feasibleNodes(tg *TaskGroup) []*Node {
feasible := make([]*Node, 0)
for _, node := range s.nodes {
if s.checkConstraints(node, tg.Constraints) &&
s.checkResources(node, tg) {
feasible = append(feasible, node)
}
}
return feasible
}
func (s *Scheduler) checkConstraints(node *Node, constraints []*Constraint) bool {
for _, c := range constraints {
value := node.Attributes[c.Attribute]
switch c.Operator {
case "=":
if value != c.Value {
return false
}
case "!=":
if value == c.Value {
return false
}
// ... more operators
}
}
return true
}
Bin packing scorer:
type ScoredNode struct {
Node *Node
Score float64
}
func (s *Scheduler) rankNodes(nodes []*Node, tg *TaskGroup, plan *Plan) []*ScoredNode {
scored := make([]*ScoredNode, 0, len(nodes))
for _, node := range nodes {
score := s.calculateScore(node, tg, plan)
scored = append(scored, &ScoredNode{Node: node, Score: score})
}
// Sort by score descending
sort.Slice(scored, func(i, j int) bool {
return scored[i].Score > scored[j].Score
})
return scored
}
func (s *Scheduler) calculateScore(node *Node, tg *TaskGroup, plan *Plan) float64 {
// Bin packing score: prefer nodes with less available resources
// (pack jobs tightly together)
binpackScore := s.binpackScore(node, tg)
// Anti-affinity score: prefer nodes without other allocations
// of the same job (spread instances out)
antiAffinityScore := s.antiAffinityScore(node, tg, plan)
// Combine scores
return (binpackScore * 0.5) + (antiAffinityScore * 0.5)
}
func (s *Scheduler) binpackScore(node *Node, tg *TaskGroup) float64 {
// Calculate what percentage of resources will be used after placement
cpuAfter := float64(node.Allocated.CPU+tg.Resources.CPU) / float64(node.Resources.CPU)
memAfter := float64(node.Allocated.Memory+tg.Resources.Memory) / float64(node.Resources.Memory)
// Higher utilization = higher score (bin packing)
return (cpuAfter + memAfter) / 2.0
}
Learning milestones:
- Feasibility filtering works → You understand constraint checking
- Bin packing places efficiently → You understand scoring algorithms
- Anti-affinity spreads allocations → You understand fault tolerance
- Plan queue handles conflicts → You understand optimistic concurrency
Project 11: Multi-Region Federation Lab
- File: LEARN_NOMAD_DEEP_DIVE.md
- Main Programming Language: HCL + Terraform
- Alternative Programming Languages: Bash, Go
- Coolness Level: Level 3: Genuinely Clever
- Business Potential: 3. The “Service & Support” Model
- Difficulty: Level 3: Advanced
- Knowledge Area: Multi-Region / WAN Federation / Disaster Recovery
- Software or Tool: Nomad + Consul + Terraform
- Main Book: “Site Reliability Engineering” by Google
What you’ll build: A multi-region Nomad deployment with federated clusters, cross-region job deployment, and disaster recovery failover—simulating a real global infrastructure.
Why it teaches Nomad: Production Nomad often spans multiple regions. Understanding federation teaches you WAN gossip, authoritative regions, cross-region scheduling, and how to design for geographic redundancy.
Core challenges you’ll face:
- Setting up WAN federation → maps to gossip across regions
- Cross-region job deployment → maps to multiregion job specification
- Authoritative region for ACLs → maps to single source of truth
- Failover and recovery → maps to region outage handling
Key Concepts:
- Federation: Federate Multi-Region Clusters
- Multiregion Jobs: multiregion block
- WAN Gossip: Gossip Protocol
- Authoritative Region: ACL Replication
Difficulty: Advanced Time estimate: 2 weeks Prerequisites: Projects 1-10 completed. Terraform experience helpful. Cloud account (AWS/GCP/Azure) or sufficient local resources.
Real world outcome:
# Two regions: us-east and eu-west
$ nomad server members -region=us-east
Name Address Port Status Leader Region
server-1.us-east.dc1 10.0.1.11 4648 alive true us-east
server-2.us-east.dc1 10.0.1.12 4648 alive false us-east
server-3.us-east.dc1 10.0.1.13 4648 alive false us-east
$ nomad server members -region=eu-west
Name Address Port Status Leader Region
server-1.eu-west.dc1 10.1.1.11 4648 alive true eu-west
server-2.eu-west.dc1 10.1.1.12 4648 alive false eu-west
server-3.eu-west.dc1 10.1.1.13 4648 alive false eu-west
# Federated - servers know about each other:
$ nomad server members -region=us-east
Name Address Port Status Leader Region
server-1.us-east.dc1 10.0.1.11 4648 alive true us-east
server-2.us-east.dc1 10.0.1.12 4648 alive false us-east
server-3.us-east.dc1 10.0.1.13 4648 alive false us-east
server-1.eu-west.dc1 10.1.1.11 4648 alive true eu-west (WAN)
server-2.eu-west.dc1 10.1.1.12 4648 alive false eu-west (WAN)
server-3.eu-west.dc1 10.1.1.13 4648 alive false eu-west (WAN)
# Deploy multiregion job:
$ nomad run global-api.nomad
==> Monitoring multiregion deployment...
Region "us-east": 3/3 allocations healthy
Region "eu-west": 3/3 allocations healthy
Multiregion deployment successful!
Implementation Hints:
Multiregion job specification:
job "global-api" {
type = "service"
multiregion {
strategy {
max_parallel = 1
on_failure = "fail_all"
}
region "us-east" {
count = 3
datacenters = ["us-east-1a", "us-east-1b"]
}
region "eu-west" {
count = 3
datacenters = ["eu-west-1a", "eu-west-1b"]
}
}
group "api" {
# count is set per-region above
network {
port "http" { to = 8080 }
}
task "api" {
driver = "docker"
config {
image = "myapp/api:v2"
ports = ["http"]
}
env {
REGION = "${NOMAD_REGION}"
DC = "${NOMAD_DC}"
}
}
}
}
Federation setup:
# us-east server config
server {
enabled = true
bootstrap_expect = 3
authoritative_region = "us-east" # This region is the source of truth for ACLs
}
# Enable WAN gossip encryption
server_join {
retry_join = ["provider=aws tag_key=nomad-server tag_value=true"]
}
# eu-west server config
server {
enabled = true
bootstrap_expect = 3
authoritative_region = "us-east" # Points to us-east
}
# Join to us-east for federation
server_join {
retry_join = ["provider=aws tag_key=nomad-server tag_value=true region=us-east-1"]
}
Terraform for multi-region:
# modules/nomad-cluster/main.tf
module "us_east" {
source = "./modules/nomad-region"
region = "us-east-1"
servers = 3
clients = 5
server_ips = var.us_east_server_ips
}
module "eu_west" {
source = "./modules/nomad-region"
region = "eu-west-1"
servers = 3
clients = 5
server_ips = var.eu_west_server_ips
# Join to us-east for federation
wan_join = module.us_east.server_ips
}
Learning milestones:
- Regions discover each other → You understand WAN gossip
- Multiregion jobs deploy → You understand coordinated deployment
- ACLs replicate from authoritative region → You understand central control
- Failover works → You understand disaster recovery
Project 12: Nomad Source Code Deep Dive
- File: LEARN_NOMAD_DEEP_DIVE.md
- Main Programming Language: Go
- Alternative Programming Languages: N/A
- Coolness Level: Level 4: Hardcore Tech Flex
- Business Potential: 1. The “Resume Gold”
- Difficulty: Level 4: Expert
- Knowledge Area: Source Code / Internal Architecture
- Software or Tool: Nomad source code
- Main Book: “The Go Programming Language” by Donovan & Kernighan
What you’ll build: A documented walkthrough of the Nomad source code, tracing a job from submission through evaluation, scheduling, and allocation—with annotations and diagrams.
Why it teaches Nomad: Reading the source code is the ultimate way to understand how Nomad works. This project takes you through the actual implementation of concepts you’ve learned, seeing how theory translates to production code.
Core challenges you’ll face:
- Navigating the codebase → maps to understanding package structure
- Tracing RPC flows → maps to how requests move through the system
- Understanding the FSM → maps to state machine operations
- Following the scheduler → maps to actual bin-packing code
Key Concepts:
- Source Code: github.com/hashicorp/nomad
- Contributing Guide: Contributing to Nomad
- Architecture Docs: Architecture in Source
Difficulty: Expert Time estimate: 2-3 weeks Prerequisites: All previous projects. Strong Go proficiency. Ability to read complex codebases.
Real world outcome:
# Your documented walkthrough:
docs/nomad-source-walkthrough/
├── 00-overview.md # Package structure and architecture
├── 01-job-submission.md # nomad/job_endpoint.go
├── 02-evaluation-broker.md # nomad/eval_broker.go
├── 03-scheduler-workers.md # scheduler/generic_sched.go
├── 04-feasibility.md # scheduler/feasible.go
├── 05-ranking.md # scheduler/rank.go
├── 06-plan-queue.md # nomad/plan_queue.go
├── 07-state-store.md # nomad/state/state_store.go
├── 08-client-alloc.md # client/allocrunner/
├── diagrams/
│ ├── job-flow.png
│ ├── scheduler-internals.png
│ └── raft-fsm.png
└── README.md
Implementation Hints:
Key directories to explore:
nomad/
├── command/ # CLI commands
├── api/ # Go client library
├── nomad/ # Server-side code
│ ├── job_endpoint.go # Job RPC handlers
│ ├── eval_broker.go # Evaluation queue
│ ├── plan_queue.go # Plan application
│ ├── fsm.go # Raft state machine
│ └── state/ # State store (BoltDB)
├── scheduler/ # Scheduling logic
│ ├── generic_sched.go # Service/batch scheduler
│ ├── system_sched.go # System scheduler
│ ├── feasible.go # Constraint checking
│ └── rank.go # Node scoring
├── client/ # Client-side code
│ ├── client.go # Main client
│ ├── allocrunner/ # Allocation execution
│ └── fingerprint/ # Node fingerprinting
└── drivers/ # Task drivers
├── docker/
├── exec/
└── rawexec/
Tracing job submission:
// 1. Job endpoint receives request
// nomad/job_endpoint.go
func (j *Job) Register(args *structs.JobRegisterRequest, reply *structs.JobRegisterResponse) error {
// Validate job
// Persist to state store via Raft
// Create evaluation
}
// 2. Evaluation is enqueued
// nomad/eval_broker.go
func (b *EvalBroker) Enqueue(eval *structs.Evaluation) {
// Add to priority queue
// Signal waiting workers
}
// 3. Scheduler worker dequeues
// nomad/worker.go
func (w *Worker) run() {
for {
eval, _ := w.srv.evalBroker.Dequeue(...)
w.invokeScheduler(eval)
}
}
// 4. Scheduler processes
// scheduler/generic_sched.go
func (s *GenericScheduler) Process(eval *structs.Evaluation) error {
// Reconcile desired vs actual state
// For each needed allocation:
// - Find feasible nodes
// - Rank nodes
// - Create allocation in plan
// Submit plan
}
// 5. Plan is applied
// nomad/plan_apply.go
func (p *planner) applyPlan(...) {
// Check for conflicts
// Apply via Raft
// Create allocations in state store
}
Questions to answer in your walkthrough:
- How does optimistic concurrency work in the plan queue?
- What happens when a node fails during scheduling?
- How are blocked evaluations handled?
- How does the client reconcile allocations?
Learning milestones:
- You trace job submission → You understand RPC flow
- You understand the FSM → You understand Raft integration
- You follow the scheduler → You understand actual algorithms
- You document the flow → You can teach others
Project Comparison Table
| Project | Difficulty | Time | Depth of Understanding | Fun Factor |
|---|---|---|---|---|
| 1. Local Dev Cluster | Beginner | Weekend | ⭐⭐ | ⭐⭐⭐ |
| 2. Job Lifecycle Explorer | Beginner | 1 week | ⭐⭐⭐ | ⭐⭐⭐ |
| 3. Scheduler Visualizer | Intermediate | 1-2 weeks | ⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ |
| 4. Custom Constraints | Intermediate | 1 week | ⭐⭐⭐ | ⭐⭐⭐ |
| 5. Consul Service Mesh | Intermediate | 1-2 weeks | ⭐⭐⭐⭐ | ⭐⭐⭐⭐ |
| 6. Autoscaler | Advanced | 2 weeks | ⭐⭐⭐⭐ | ⭐⭐⭐⭐ |
| 7. Custom Task Driver | Advanced | 2-3 weeks | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ |
| 8. Raft Simulator | Expert | 3-4 weeks | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐ |
| 9. Gossip Protocol | Advanced | 2-3 weeks | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐ |
| 10. Mini-Scheduler | Expert | 1 month | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ |
| 11. Multi-Region Lab | Advanced | 2 weeks | ⭐⭐⭐⭐ | ⭐⭐⭐⭐ |
| 12. Source Code Dive | Expert | 2-3 weeks | ⭐⭐⭐⭐⭐ | ⭐⭐⭐ |
Recommended Learning Path
If you’re new to container orchestration:
- Project 1 (Local Cluster) - Get hands-on experience
- Project 2 (Job Lifecycle) - Understand job types
- Project 4 (Constraints) - Learn placement logic
- Project 5 (Service Mesh) - Understand networking
If you know Kubernetes and want to understand Nomad’s differences:
- Project 3 (Scheduler Visualizer) - See the evaluation system
- Project 7 (Task Driver) - Understand the plugin model
- Project 10 (Mini-Scheduler) - Build your own scheduler
If you want to deeply understand distributed systems:
- Project 8 (Raft) - Consensus fundamentals
- Project 9 (Gossip) - Membership protocols
- Project 10 (Scheduler) - Distributed scheduling
- Project 12 (Source Code) - Production implementation
Quick path (2-3 weeks):
- Project 1 (Weekend) - Get running
- Project 2 (3 days) - Understand jobs
- Project 3 (1 week) - See scheduling
- Project 5 (1 week) - Service mesh
Final Capstone Project: Production-Grade Nomad Platform
- File: LEARN_NOMAD_DEEP_DIVE.md
- Main Programming Language: Go + HCL + Terraform
- Alternative Programming Languages: Python, Bash
- Coolness Level: Level 5: Pure Magic (Super Cool)
- Business Potential: 4. The “Open Core” Infrastructure
- Difficulty: Level 5: Master
- Knowledge Area: Platform Engineering / Production Operations
- Software or Tool: Nomad + Consul + Vault + Terraform
- Main Book: “Site Reliability Engineering” by Google
What you’ll build: A complete production-grade Nomad platform with multi-region federation, Consul service mesh, Vault secrets integration, monitoring/alerting, CI/CD pipeline integration, and disaster recovery procedures.
Why it teaches everything: This project integrates all concepts: scheduling, service mesh, secrets, observability, security, and operations. It’s the capstone that proves you can run Nomad in production.
Core components:
- Multi-region clusters with Terraform provisioning
- Consul Connect for service mesh
- Vault integration for secrets management
- Prometheus + Grafana for monitoring
- CI/CD integration for deployment pipelines
- Runbooks for operational procedures
- Disaster recovery procedures and testing
Real world outcome:
Production Nomad Platform
├── infrastructure/
│ ├── terraform/
│ │ ├── modules/
│ │ │ ├── nomad-cluster/
│ │ │ ├── consul-cluster/
│ │ │ └── vault-cluster/
│ │ ├── environments/
│ │ │ ├── production/
│ │ │ └── staging/
│ │ └── main.tf
│ └── packer/
│ └── nomad-ami.pkr.hcl
├── jobs/
│ ├── platform/
│ │ ├── prometheus.nomad
│ │ ├── grafana.nomad
│ │ ├── traefik.nomad
│ │ └── vault-agent.nomad
│ └── applications/
│ └── example-app.nomad
├── policies/
│ ├── nomad-acl/
│ ├── consul-intentions/
│ └── vault-policies/
├── monitoring/
│ ├── dashboards/
│ ├── alerts/
│ └── runbooks/
├── ci-cd/
│ ├── .github/workflows/
│ └── scripts/
└── docs/
├── architecture.md
├── operations.md
└── disaster-recovery.md
Time estimate: 2-3 months Prerequisites: All 12 projects completed
Essential Resources
Official Documentation
- Nomad Documentation - Comprehensive official docs
- Nomad Tutorials - Hands-on learning paths
- Nomad API Docs - Complete API reference
Key Papers and Articles
- Large-scale cluster management at Google with Borg - Nomad’s inspiration
- Omega: flexible, scalable schedulers - Optimistic concurrency model
- Raft Consensus Algorithm - The consensus paper
- SWIM Protocol - Gossip membership
Courses
- HashiCorp Nomad on Udemy - Comprehensive course by Bryan Krausen
- Nomad 101 - Official introduction
Books (Complementary)
- “Designing Data-Intensive Applications” by Martin Kleppmann - Distributed systems fundamentals
- “Site Reliability Engineering” by Google - Production operations
- “The Go Programming Language” by Donovan & Kernighan - For source code reading
Source Code
- github.com/hashicorp/nomad - Nomad source code
- github.com/hashicorp/raft - Raft implementation
- github.com/hashicorp/serf - Gossip library
Community
- Nomad Forum - Official discussion forum
- Nomad GitHub Issues - Bug reports and features
- HashiCorp User Groups - Local meetups
Summary
| # | Project | Main Programming Language |
|---|---|---|
| 1 | Local Development Cluster | HCL + Bash |
| 2 | Job Lifecycle Explorer | HCL + Bash |
| 3 | Scheduler Visualization Tool | Go |
| 4 | Custom Resource Constraint System | HCL + Go |
| 5 | Consul Service Mesh Integration | HCL |
| 6 | Autoscaler Implementation | Go |
| 7 | Custom Task Driver Plugin | Go |
| 8 | Raft Consensus Simulator | Go |
| 9 | Gossip Protocol Implementation | Go |
| 10 | Mini-Scheduler Implementation | Go |
| 11 | Multi-Region Federation Lab | HCL + Terraform |
| 12 | Nomad Source Code Deep Dive | Go |
| Final | Production-Grade Nomad Platform | Go + HCL + Terraform |
Happy scheduling! Nomad’s simplicity hides powerful distributed systems concepts—enjoy the journey from “nomad run” to understanding every byte of the protocol. 🚀