Project 4: Build a Service Registry

Build a service catalog with registration, checks, and healthy-instance queries.

Quick Reference

Attribute	Value
Difficulty	Level 2
Time Estimate	1-2 weeks
Main Programming Language	Go (Alternatives: Rust, Python, Java)
Alternative Programming Languages	Rust, Python, Java
Coolness Level	Level 3-5 depending on implementation depth
Business Potential	Resume Gold to Open Core Infrastructure
Prerequisites	Linux networking, HTTP/DNS basics, distributed systems vocabulary
Key Topics	catalog schema, health status, query filtering

1. Learning Objectives

By completing this project, you will:

Explain the core failure model this project is designed to expose.
Design a protocol-aware implementation strategy before touching code.
Validate correctness using deterministic success and failure demos.
Connect this project to production-grade Consul behaviors.

2. All Theory Needed (Per-Concept Breakdown)

Service Identity and Registration

Fundamentals

This concept defines the non-negotiable correctness boundary for the project. If you cannot state what must always be true, debugging becomes guesswork. In Consul-like systems, correctness comes from preserving invariants under imperfect networks, not from happy-path API output. The goal is to identify what state transition is authoritative, what is merely observed, and what can be delayed safely. You should be able to describe what signal means committed, which state may be stale, and which event must trigger immediate fail-safe behavior.

Deep Dive into the concept

Model a request lifecycle under stress. Start with a normal request and trace each boundary crossing: client to local agent, local agent to authoritative node, authoritative node to peers, then commit/apply to state machine. At each stage ask what happens if timeout, reorder, or partial loss occurs. Many implementation bugs come from collapsing distinct states into one boolean success. In distributed systems, accepted, replicated, and committed are different truths.

This concept also forces separation of control and data signals. A system may report node alive while mutation path is unavailable; or mutation path healthy while query path returns stale data due to caching. Engineering work is not to eliminate all inconsistency windows, but to bound and communicate them. API design should include explicit metadata such as indexes, terms, and status epochs.

Tie this concept to operator workflow. During incidents, teams need direct answers: what invariant is currently violated, what metric confirms it, and what remediation restores it. If design lacks observability on invariant transitions, incident MTTR increases sharply.

How this fit on projects

This concept is foundational in this project and reappears in later integration tasks.

Definitions & key terms

Invariant: property that must always hold.
Authoritative state: source of truth used for correctness decisions.
Stale state: delayed but potentially useful view.
Transition epoch/index: monotonic marker of state progression.

Mental model diagram

request -> accepted -> replicated -> committed -> applied

How it works (step-by-step, with invariants and failure modes)

Request enters system with intent.
Local validation enforces schema and role constraints.
Propagation path executes according to protocol.
Commit condition is evaluated against invariant rule.
State applies and becomes externally visible.

Failure modes:

premature success response
silent stale reads presented as fresh truth
missing monotonic metadata for conflict resolution

Minimal concrete example

PUT /resource/x
-> accepted index=41
-> replicated to peers
-> committed index=41
-> read path returns value with ModifyIndex=41

Common misconceptions

If API returned success, state is globally safe.
Eventually consistent means unpredictable.

Check-your-understanding questions

What event marks authoritative success in this project?
Which state can be stale but still useful?
What metric would prove invariant health?

Check-your-understanding answers

The protocol-defined commit condition.
Read-side cache with explicit freshness metadata.
Monotonic commit/index progression plus failure-rate bounds.

Real-world applications

control-plane mutation APIs
lock ownership verification
secure routing policy updates

Where you’ll apply it

This project implementation sections 3 to 6.
Also used in: P12-consul-agent-capstone.md.

References

Consul architecture docs.
Raft or SWIM papers as applicable.
RFC standards relevant to this project.

Key insights

Correctness is preserved by explicit state transitions and invariants, not by retry volume.

Summary

You now have a model for what must always be true, where uncertainty exists, and how to expose both safely.

Homework/Exercises to practice the concept

Write the invariant list for this project in one page.
Define one metric and one log field for each invariant.

Solutions to the homework/exercises

Strong lists include commit, ownership, and authorization constraints.
Metrics/log fields should be monotonic and attributable by request id.

Health-Driven Discoverability

Fundamentals

This concept covers the main tradeoff surface in the project. Most failures happen because teams optimize one axis blindly and violate assumptions elsewhere. You need language for describing tradeoffs before implementation.

Deep Dive into the concept

Identify the dominant cost and risk in this project. Discovery projects emphasize freshness versus load. Security projects emphasize strictness versus operability. Consensus projects emphasize liveness versus safety. Map each API and timer configuration to one side of that tradeoff.

A robust design makes choices explicit and testable. If timeout is short, quantify false-positive risk. If policy defaults are permissive during migration, define rollback and audit controls. If stale reads are allowed, make stale metadata mandatory and document caller behavior.

Treat this concept as a policy engine for implementation decisions. Every parameter should have an owner, rationale, and failure drill. Production maturity is less about ideal defaults and more about predictable behavior under non-ideal conditions.

How this fit on projects

Used directly in this project’s architecture and reused in federation, security, and capstone projects.

Definitions & key terms

Tradeoff axis: pair of competing priorities.
Guardrail: explicit constraint preventing unsafe optimization.
Operational envelope: expected safe range of runtime conditions.

Mental model diagram

strict/fresh  <---- SLO-guided choice ---->  available/cheap

How it works

Pick explicit SLO target.
Choose parameter defaults aligned to SLO.
Define guardrails and rollback thresholds.
Validate with deterministic stress scenarios.

Failure modes:

hidden defaults with no rationale
tuning without baseline measurement
policy changes without canary validation

Minimal concrete example

TTL=5s  -> fresher routing, higher query load
TTL=30s -> lower load, slower failure visibility
selected TTL=10s after benchmark and outage drill

Common misconceptions

There is one best timeout or TTL.
Policy strictness always increases reliability.

Check-your-understanding questions

Which tradeoff axis is primary in this project?
What metric shows you crossed safe envelope?
What rollback action should be pre-defined?

Check-your-understanding answers

Depends on project domain; document in architecture section.
Error-rate, latency, or false-positive threshold crossing.
Revert to last-known-good parameter or policy profile.

Real-world applications

timeout and TTL tuning
secure rollout of deny-by-default policies
failover trigger calibration

Where you’ll apply it

This project’s design decisions section.
Also used in: P10-multi-datacenter-federation.md and P11-acl-tokens-and-policies.md.

References

Consul operational docs for the relevant subsystem.
CNCF and HashiCorp survey findings.

Key insights

Operational excellence is explicit tradeoff management with measured feedback loops.

Summary

If you can state your tradeoff and prove its behavior under fault drills, your design is production-oriented.

Homework/Exercises to practice the concept

Define one aggressive and one conservative parameter profile.
Run both profiles and compare three key metrics.

Solutions to the homework/exercises

Aggressive profile favors speed; conservative profile favors stability.
Use measurements to set baseline and emergency fallback profiles.

3. Project Specification

3.1 What You Will Build

You will build Build a service catalog with registration, checks, and healthy-instance queries. The project includes deterministic CLI or API flows, observable state transitions, and explicit failure-mode behavior. It intentionally excludes cloud-specific deployment automation so focus stays on protocol behavior.

3.2 Functional Requirements

Implement the core protocol or workflow for this project domain.
Expose user-visible interface with deterministic output.
Emit observability data for transition debugging.
Support one explicit failure scenario with predictable behavior.

3.3 Non-Functional Requirements

Performance: stable under local load tests without unbounded queue growth.
Reliability: deterministic success and failure responses.
Usability: clear output showing state and reason codes.

3.4 Example Usage / Output

$ run project start
state=ready
$ run golden path command
result=success index=...
$ run failure path command
result=expected_safe_failure

3.5 Data Formats / Schemas / Protocols

Define request and response shapes that include identity/resource fields, monotonic markers, and structured error envelopes.

3.6 Edge Cases

duplicate registration or idempotent replay
timeout and partial acknowledgement
stale token/cert/session/index depending on project type
restart recovery with durability checks

3.7 Real World Outcome

3.7.1 How to Run (Copy/Paste)

start control processes
run one golden path command
run one failure path command
verify output signatures

3.7.2 Golden Path Demo (Deterministic)

Use fixed fixture inputs and expected status transitions.

3.7.3 If CLI: provide an exact terminal transcript

$ ./project-04 start
ready
$ ./project-04 demo --golden
ok
$ ./project-04 demo --failure
expected_error

3.7.4 If Web App

N/A for this project family.

3.7.5 If API

Include endpoint table with one 2xx and one structured error example.

3.7.6 If Library

N/A.

3.7.7 If GUI/Desktop/Mobile

N/A.

3.7.8 If TUI

Optional if you add a terminal interface.

4. Solution Architecture

4.1 High-Level Design

client/test driver
      |
      v
project interface layer -> protocol/state engine -> persistence/cache
      |
      v
telemetry and diagnostics

4.2 Key Components

Component	Responsibility	Key Decisions
Interface layer	validate requests and shape responses	deterministic error contract
Core engine	enforce protocol transitions	invariant-first processing
Persistence/cache	store and retrieve state	explicit consistency model
Telemetry	provide root-cause signals	structured fields and counters

4.4 Data Structures (No Full Code)

ResourceState:
  id
  status
  index_or_term
  last_update
  owner_identity

4.4 Algorithm Overview

Key Algorithm: transition and commit path

Validate and authorize request.
Execute transition according to protocol rules.
Persist and publish monotonic metadata.
Return deterministic response.

Complexity Analysis:

Time: proportional to peer/resource scope of operation.
Space: proportional to active state plus metadata history.

5. Implementation Guide

5.1 Development Environment Setup

# install runtime, build toolchain, and CLI dependencies
# verify command line tools and local networking setup

5.2 Project Structure

project-04/
  cmd/
  internal/api/
  internal/engine/
  internal/storage/
  internal/telemetry/
  tests/
  docs/

5.3 The Core Question You’re Answering

What exact mechanism turns uncertain distributed signals into safe, operable behavior for this project?

5.4 Concepts You Must Understand First

Service Identity and Registration
Health-Driven Discoverability
Failure model and observability coupling

5.5 Questions to Guide Your Design

Which transitions are authoritative versus eventually observed?
What metadata proves correctness to clients and operators?
How does the system fail safely when assumptions break?

5.6 Thinking Exercise

Before coding, draw transition timelines for:

normal operation
timeout or delayed operation
invalid authorization or ownership operation

5.7 The Interview Questions They’ll Ask

Which invariant was hardest to preserve and why?
How did you test failure behavior beyond happy path?
What tradeoff did you choose and how did you validate it?
How would you scale this design in production?
Which metric predicts incident risk earliest?

5.8 Hints in Layers

Hint 1: implement explicit state model before I/O details.
Hint 2: add monotonic metadata to mutating responses.
Hint 3: build deterministic failure fixture before load tests.
Hint 4: keep protocol logs structured and correlation-id aware.

5.9 Books That Will Help

Topic	Book	Chapter
Distributed reasoning	Designing Data-Intensive Applications	5-9
Network mechanics	Computer Networks	5
Service systems	Building Microservices	discovery and resilience chapters

5.10 Implementation Phases

Phase 1: Foundation

Define state model and interface schema.
Build deterministic single-node behavior.

Checkpoint: local golden path passes.

Phase 2: Core Functionality

Implement protocol engine and state path.
Add observability fields.

Checkpoint: distributed or failure scenario passes.

Phase 3: Polish & Edge Cases

Handle retries, timeouts, and edge conditions.
Validate reproducibility and documentation.

Checkpoint: deterministic success and failure demos are stable.

5.11 Key Implementation Decisions

Decision	Options	Recommendation	Rationale
Consistency mode	strict, stale, hybrid	hybrid with explicit metadata	balances latency and correctness
Failure handling	retry-only, fail-safe, fail-open	fail-safe by default	prevents hidden corruption
Telemetry shape	ad-hoc logs, structured logs	structured logs + metrics	faster root-cause analysis

6. Testing Strategy

6.1 Test Categories

Category	Purpose	Examples
Unit	validate local transitions	parser, evaluator, index checks
Integration	verify component interaction	request lifecycle and commit path
Edge Case	exercise failure behavior	timeout, partition, stale metadata

6.2 Critical Test Cases

Golden path success with deterministic output.
Deliberate failure path with safe behavior.
Recovery path that restores healthy state.

6.3 Test Data

fixed ids, fixed timestamps or seeds, known fixture responses

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

Pitfall	Symptom	Solution
Missing invariant checks	intermittent incorrect success	enforce transition guards
Unclear metadata	hard root-cause analysis	include index/term/owner in responses
Over-aggressive tuning	flapping and false positives	calibrate with baseline measurements

7.2 Debugging Strategies

Trace one request end-to-end with correlation id.
Compare observed state against invariant checklist.
Replay deterministic failure fixture until explanation is clear.

7.3 Performance Traps

unbounded watchers or retries
synchronized timers causing burst load
expensive per-request policy lookup without caching plan

8. Extensions & Challenges

8.1 Beginner Extensions

Add richer status and error diagnostics.
Add idempotency key handling for retries.

8.2 Intermediate Extensions

Add metrics endpoint with SLO-oriented counters.
Add chaos test script for one network and one resource fault.

8.3 Advanced Extensions

Add multi-tenant isolation controls.
Add production-like deployment topology simulation.

9. Real-World Connections

9.1 Industry Applications

platform control-plane services
secure service networking and governance
resilient configuration and discovery workflows

HashiCorp Consul
HashiCorp memberlist and raft libraries
Envoy for sidecar traffic control patterns

9.3 Interview Relevance

This project prepares discussion on invariants, failure tradeoffs, and operational readiness.

10. Resources

10.1 Essential Reading

Consul architecture and subsystem docs
Raft and SWIM papers
Relevant RFC standards for DNS and TLS

10.2 Video Resources

HashiCorp engineering talks on Consul internals
distributed systems talks on incident-safe design

10.3 Tools & Documentation

consul, curl, jq, dig, openssl
packet tracing tools such as tcpdump and Wireshark

Previous: P03-swim-gossip-membership.md
Next: P05-dns-service-discovery.md

11. Self-Assessment Checklist

11.1 Understanding

I can explain the primary invariant from memory.
I can explain the main tradeoff axis and chosen defaults.
I can predict behavior for one major failure scenario.

11.2 Implementation

Functional requirements are complete.
Golden and failure demos are deterministic.
Telemetry is sufficient for root-cause analysis.
Edge cases are handled and documented.

11.3 Growth

I documented one design change caused by evidence.
I can explain this project in interview-level depth.
I can map this project to production equivalents.

12. Submission / Completion Criteria

Minimum Viable Completion:

one deterministic happy path
one deterministic safe-failure path
one observability view that explains state

Full Completion:

all minimum criteria plus edge-case validation
implementation decision rationale documented

Excellence (Going Above & Beyond):

multi-fault scenario drill with stable behavior
concise incident runbook with remediation order

13 Additional Content Rules (Applied)

Determinism: fixed fixtures and repeatable demos.
Outcome completeness: success and failure demo intent included.
Cross-linking: related project references included.
No placeholders: all sections are actionable.