Project 4: Build a Service Registry

Build a service catalog with registration, checks, and healthy-instance queries.

Quick Reference

Attribute Value
Difficulty Level 2
Time Estimate 1-2 weeks
Main Programming Language Go (Alternatives: Rust, Python, Java)
Alternative Programming Languages Rust, Python, Java
Coolness Level Level 3-5 depending on implementation depth
Business Potential Resume Gold to Open Core Infrastructure
Prerequisites Linux networking, HTTP/DNS basics, distributed systems vocabulary
Key Topics catalog schema, health status, query filtering

1. Learning Objectives

By completing this project, you will:

  1. Explain the core failure model this project is designed to expose.
  2. Design a protocol-aware implementation strategy before touching code.
  3. Validate correctness using deterministic success and failure demos.
  4. Connect this project to production-grade Consul behaviors.

2. All Theory Needed (Per-Concept Breakdown)

Service Identity and Registration

Fundamentals

This concept defines the non-negotiable correctness boundary for the project. If you cannot state what must always be true, debugging becomes guesswork. In Consul-like systems, correctness comes from preserving invariants under imperfect networks, not from happy-path API output. The goal is to identify what state transition is authoritative, what is merely observed, and what can be delayed safely. You should be able to describe what signal means committed, which state may be stale, and which event must trigger immediate fail-safe behavior.

Deep Dive into the concept

Model a request lifecycle under stress. Start with a normal request and trace each boundary crossing: client to local agent, local agent to authoritative node, authoritative node to peers, then commit/apply to state machine. At each stage ask what happens if timeout, reorder, or partial loss occurs. Many implementation bugs come from collapsing distinct states into one boolean success. In distributed systems, accepted, replicated, and committed are different truths.

This concept also forces separation of control and data signals. A system may report node alive while mutation path is unavailable; or mutation path healthy while query path returns stale data due to caching. Engineering work is not to eliminate all inconsistency windows, but to bound and communicate them. API design should include explicit metadata such as indexes, terms, and status epochs.

Tie this concept to operator workflow. During incidents, teams need direct answers: what invariant is currently violated, what metric confirms it, and what remediation restores it. If design lacks observability on invariant transitions, incident MTTR increases sharply.

How this fit on projects

This concept is foundational in this project and reappears in later integration tasks.

Definitions & key terms

  • Invariant: property that must always hold.
  • Authoritative state: source of truth used for correctness decisions.
  • Stale state: delayed but potentially useful view.
  • Transition epoch/index: monotonic marker of state progression.

Mental model diagram

request -> accepted -> replicated -> committed -> applied

How it works (step-by-step, with invariants and failure modes)

  1. Request enters system with intent.
  2. Local validation enforces schema and role constraints.
  3. Propagation path executes according to protocol.
  4. Commit condition is evaluated against invariant rule.
  5. State applies and becomes externally visible.

Failure modes:

  • premature success response
  • silent stale reads presented as fresh truth
  • missing monotonic metadata for conflict resolution

Minimal concrete example

PUT /resource/x
-> accepted index=41
-> replicated to peers
-> committed index=41
-> read path returns value with ModifyIndex=41

Common misconceptions

  • If API returned success, state is globally safe.
  • Eventually consistent means unpredictable.

Check-your-understanding questions

  1. What event marks authoritative success in this project?
  2. Which state can be stale but still useful?
  3. What metric would prove invariant health?

Check-your-understanding answers

  1. The protocol-defined commit condition.
  2. Read-side cache with explicit freshness metadata.
  3. Monotonic commit/index progression plus failure-rate bounds.

Real-world applications

  • control-plane mutation APIs
  • lock ownership verification
  • secure routing policy updates

Where you’ll apply it

  • This project implementation sections 3 to 6.
  • Also used in: P12-consul-agent-capstone.md.

References

  • Consul architecture docs.
  • Raft or SWIM papers as applicable.
  • RFC standards relevant to this project.

Key insights

Correctness is preserved by explicit state transitions and invariants, not by retry volume.

Summary

You now have a model for what must always be true, where uncertainty exists, and how to expose both safely.

Homework/Exercises to practice the concept

  1. Write the invariant list for this project in one page.
  2. Define one metric and one log field for each invariant.

Solutions to the homework/exercises

  1. Strong lists include commit, ownership, and authorization constraints.
  2. Metrics/log fields should be monotonic and attributable by request id.

Health-Driven Discoverability

Fundamentals

This concept covers the main tradeoff surface in the project. Most failures happen because teams optimize one axis blindly and violate assumptions elsewhere. You need language for describing tradeoffs before implementation.

Deep Dive into the concept

Identify the dominant cost and risk in this project. Discovery projects emphasize freshness versus load. Security projects emphasize strictness versus operability. Consensus projects emphasize liveness versus safety. Map each API and timer configuration to one side of that tradeoff.

A robust design makes choices explicit and testable. If timeout is short, quantify false-positive risk. If policy defaults are permissive during migration, define rollback and audit controls. If stale reads are allowed, make stale metadata mandatory and document caller behavior.

Treat this concept as a policy engine for implementation decisions. Every parameter should have an owner, rationale, and failure drill. Production maturity is less about ideal defaults and more about predictable behavior under non-ideal conditions.

How this fit on projects

Used directly in this project’s architecture and reused in federation, security, and capstone projects.

Definitions & key terms

  • Tradeoff axis: pair of competing priorities.
  • Guardrail: explicit constraint preventing unsafe optimization.
  • Operational envelope: expected safe range of runtime conditions.

Mental model diagram

strict/fresh  <---- SLO-guided choice ---->  available/cheap

How it works

  1. Pick explicit SLO target.
  2. Choose parameter defaults aligned to SLO.
  3. Define guardrails and rollback thresholds.
  4. Validate with deterministic stress scenarios.

Failure modes:

  • hidden defaults with no rationale
  • tuning without baseline measurement
  • policy changes without canary validation

Minimal concrete example

TTL=5s  -> fresher routing, higher query load
TTL=30s -> lower load, slower failure visibility
selected TTL=10s after benchmark and outage drill

Common misconceptions

  • There is one best timeout or TTL.
  • Policy strictness always increases reliability.

Check-your-understanding questions

  1. Which tradeoff axis is primary in this project?
  2. What metric shows you crossed safe envelope?
  3. What rollback action should be pre-defined?

Check-your-understanding answers

  1. Depends on project domain; document in architecture section.
  2. Error-rate, latency, or false-positive threshold crossing.
  3. Revert to last-known-good parameter or policy profile.

Real-world applications

  • timeout and TTL tuning
  • secure rollout of deny-by-default policies
  • failover trigger calibration

Where you’ll apply it

  • This project’s design decisions section.
  • Also used in: P10-multi-datacenter-federation.md and P11-acl-tokens-and-policies.md.

References

  • Consul operational docs for the relevant subsystem.
  • CNCF and HashiCorp survey findings.

Key insights

Operational excellence is explicit tradeoff management with measured feedback loops.

Summary

If you can state your tradeoff and prove its behavior under fault drills, your design is production-oriented.

Homework/Exercises to practice the concept

  1. Define one aggressive and one conservative parameter profile.
  2. Run both profiles and compare three key metrics.

Solutions to the homework/exercises

  1. Aggressive profile favors speed; conservative profile favors stability.
  2. Use measurements to set baseline and emergency fallback profiles.

3. Project Specification

3.1 What You Will Build

You will build Build a service catalog with registration, checks, and healthy-instance queries. The project includes deterministic CLI or API flows, observable state transitions, and explicit failure-mode behavior. It intentionally excludes cloud-specific deployment automation so focus stays on protocol behavior.

3.2 Functional Requirements

  1. Implement the core protocol or workflow for this project domain.
  2. Expose user-visible interface with deterministic output.
  3. Emit observability data for transition debugging.
  4. Support one explicit failure scenario with predictable behavior.

3.3 Non-Functional Requirements

  • Performance: stable under local load tests without unbounded queue growth.
  • Reliability: deterministic success and failure responses.
  • Usability: clear output showing state and reason codes.

3.4 Example Usage / Output

$ run project start
state=ready
$ run golden path command
result=success index=...
$ run failure path command
result=expected_safe_failure

3.5 Data Formats / Schemas / Protocols

Define request and response shapes that include identity/resource fields, monotonic markers, and structured error envelopes.

3.6 Edge Cases

  • duplicate registration or idempotent replay
  • timeout and partial acknowledgement
  • stale token/cert/session/index depending on project type
  • restart recovery with durability checks

3.7 Real World Outcome

3.7.1 How to Run (Copy/Paste)

  • start control processes
  • run one golden path command
  • run one failure path command
  • verify output signatures

3.7.2 Golden Path Demo (Deterministic)

Use fixed fixture inputs and expected status transitions.

3.7.3 If CLI: provide an exact terminal transcript

$ ./project-04 start
ready
$ ./project-04 demo --golden
ok
$ ./project-04 demo --failure
expected_error

3.7.4 If Web App

N/A for this project family.

3.7.5 If API

Include endpoint table with one 2xx and one structured error example.

3.7.6 If Library

N/A.

3.7.7 If GUI/Desktop/Mobile

N/A.

3.7.8 If TUI

Optional if you add a terminal interface.


4. Solution Architecture

4.1 High-Level Design

client/test driver
      |
      v
project interface layer -> protocol/state engine -> persistence/cache
      |
      v
telemetry and diagnostics

4.2 Key Components

Component Responsibility Key Decisions
Interface layer validate requests and shape responses deterministic error contract
Core engine enforce protocol transitions invariant-first processing
Persistence/cache store and retrieve state explicit consistency model
Telemetry provide root-cause signals structured fields and counters

4.4 Data Structures (No Full Code)

ResourceState:
  id
  status
  index_or_term
  last_update
  owner_identity

4.4 Algorithm Overview

Key Algorithm: transition and commit path

  1. Validate and authorize request.
  2. Execute transition according to protocol rules.
  3. Persist and publish monotonic metadata.
  4. Return deterministic response.

Complexity Analysis:

  • Time: proportional to peer/resource scope of operation.
  • Space: proportional to active state plus metadata history.

5. Implementation Guide

5.1 Development Environment Setup

# install runtime, build toolchain, and CLI dependencies
# verify command line tools and local networking setup

5.2 Project Structure

project-04/
  cmd/
  internal/api/
  internal/engine/
  internal/storage/
  internal/telemetry/
  tests/
  docs/

5.3 The Core Question You’re Answering

What exact mechanism turns uncertain distributed signals into safe, operable behavior for this project?

5.4 Concepts You Must Understand First

  1. Service Identity and Registration
  2. Health-Driven Discoverability
  3. Failure model and observability coupling

5.5 Questions to Guide Your Design

  1. Which transitions are authoritative versus eventually observed?
  2. What metadata proves correctness to clients and operators?
  3. How does the system fail safely when assumptions break?

5.6 Thinking Exercise

Before coding, draw transition timelines for:

  • normal operation
  • timeout or delayed operation
  • invalid authorization or ownership operation

5.7 The Interview Questions They’ll Ask

  1. Which invariant was hardest to preserve and why?
  2. How did you test failure behavior beyond happy path?
  3. What tradeoff did you choose and how did you validate it?
  4. How would you scale this design in production?
  5. Which metric predicts incident risk earliest?

5.8 Hints in Layers

  • Hint 1: implement explicit state model before I/O details.
  • Hint 2: add monotonic metadata to mutating responses.
  • Hint 3: build deterministic failure fixture before load tests.
  • Hint 4: keep protocol logs structured and correlation-id aware.

5.9 Books That Will Help

Topic Book Chapter
Distributed reasoning Designing Data-Intensive Applications 5-9
Network mechanics Computer Networks 5
Service systems Building Microservices discovery and resilience chapters

5.10 Implementation Phases

Phase 1: Foundation

  • Define state model and interface schema.
  • Build deterministic single-node behavior.

Checkpoint: local golden path passes.

Phase 2: Core Functionality

  • Implement protocol engine and state path.
  • Add observability fields.

Checkpoint: distributed or failure scenario passes.

Phase 3: Polish & Edge Cases

  • Handle retries, timeouts, and edge conditions.
  • Validate reproducibility and documentation.

Checkpoint: deterministic success and failure demos are stable.

5.11 Key Implementation Decisions

Decision Options Recommendation Rationale
Consistency mode strict, stale, hybrid hybrid with explicit metadata balances latency and correctness
Failure handling retry-only, fail-safe, fail-open fail-safe by default prevents hidden corruption
Telemetry shape ad-hoc logs, structured logs structured logs + metrics faster root-cause analysis

6. Testing Strategy

6.1 Test Categories

Category Purpose Examples
Unit validate local transitions parser, evaluator, index checks
Integration verify component interaction request lifecycle and commit path
Edge Case exercise failure behavior timeout, partition, stale metadata

6.2 Critical Test Cases

  1. Golden path success with deterministic output.
  2. Deliberate failure path with safe behavior.
  3. Recovery path that restores healthy state.

6.3 Test Data

fixed ids, fixed timestamps or seeds, known fixture responses

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

Pitfall Symptom Solution
Missing invariant checks intermittent incorrect success enforce transition guards
Unclear metadata hard root-cause analysis include index/term/owner in responses
Over-aggressive tuning flapping and false positives calibrate with baseline measurements

7.2 Debugging Strategies

  • Trace one request end-to-end with correlation id.
  • Compare observed state against invariant checklist.
  • Replay deterministic failure fixture until explanation is clear.

7.3 Performance Traps

  • unbounded watchers or retries
  • synchronized timers causing burst load
  • expensive per-request policy lookup without caching plan

8. Extensions & Challenges

8.1 Beginner Extensions

  • Add richer status and error diagnostics.
  • Add idempotency key handling for retries.

8.2 Intermediate Extensions

  • Add metrics endpoint with SLO-oriented counters.
  • Add chaos test script for one network and one resource fault.

8.3 Advanced Extensions

  • Add multi-tenant isolation controls.
  • Add production-like deployment topology simulation.

9. Real-World Connections

9.1 Industry Applications

  • platform control-plane services
  • secure service networking and governance
  • resilient configuration and discovery workflows
  • HashiCorp Consul
  • HashiCorp memberlist and raft libraries
  • Envoy for sidecar traffic control patterns

9.3 Interview Relevance

This project prepares discussion on invariants, failure tradeoffs, and operational readiness.


10. Resources

10.1 Essential Reading

  • Consul architecture and subsystem docs
  • Raft and SWIM papers
  • Relevant RFC standards for DNS and TLS

10.2 Video Resources

  • HashiCorp engineering talks on Consul internals
  • distributed systems talks on incident-safe design

10.3 Tools & Documentation

  • consul, curl, jq, dig, openssl
  • packet tracing tools such as tcpdump and Wireshark
  • Previous: P03-swim-gossip-membership.md
  • Next: P05-dns-service-discovery.md

11. Self-Assessment Checklist

11.1 Understanding

  • I can explain the primary invariant from memory.
  • I can explain the main tradeoff axis and chosen defaults.
  • I can predict behavior for one major failure scenario.

11.2 Implementation

  • Functional requirements are complete.
  • Golden and failure demos are deterministic.
  • Telemetry is sufficient for root-cause analysis.
  • Edge cases are handled and documented.

11.3 Growth

  • I documented one design change caused by evidence.
  • I can explain this project in interview-level depth.
  • I can map this project to production equivalents.

12. Submission / Completion Criteria

Minimum Viable Completion:

  • one deterministic happy path
  • one deterministic safe-failure path
  • one observability view that explains state

Full Completion:

  • all minimum criteria plus edge-case validation
  • implementation decision rationale documented

Excellence (Going Above & Beyond):

  • multi-fault scenario drill with stable behavior
  • concise incident runbook with remediation order

13 Additional Content Rules (Applied)

  • Determinism: fixed fixtures and repeatable demos.
  • Outcome completeness: success and failure demo intent included.
  • Cross-linking: related project references included.
  • No placeholders: all sections are actionable.