Project 7: Session and Lock Manager
Implement session-backed advisory locks for distributed coordination patterns.
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | Level 3 |
| Time Estimate | 1-2 weeks |
| Main Programming Language | Go (Alternatives: Rust, Python, Java) |
| Alternative Programming Languages | Rust, Python, Java |
| Coolness Level | Level 3-5 depending on implementation depth |
| Business Potential | Resume Gold to Open Core Infrastructure |
| Prerequisites | Linux networking, HTTP/DNS basics, distributed systems vocabulary |
| Key Topics | sessions, TTL renewal, advisory lock ownership |
1. Learning Objectives
By completing this project, you will:
- Explain the core failure model this project is designed to expose.
- Design a protocol-aware implementation strategy before touching code.
- Validate correctness using deterministic success and failure demos.
- Connect this project to production-grade Consul behaviors.
2. All Theory Needed (Per-Concept Breakdown)
Leases and Lock Ownership
Fundamentals
This concept defines the non-negotiable correctness boundary for the project. If you cannot state what must always be true, debugging becomes guesswork. In Consul-like systems, correctness comes from preserving invariants under imperfect networks, not from happy-path API output. The goal is to identify what state transition is authoritative, what is merely observed, and what can be delayed safely. You should be able to describe what signal means committed, which state may be stale, and which event must trigger immediate fail-safe behavior.
Deep Dive into the concept
Model a request lifecycle under stress. Start with a normal request and trace each boundary crossing: client to local agent, local agent to authoritative node, authoritative node to peers, then commit/apply to state machine. At each stage ask what happens if timeout, reorder, or partial loss occurs. Many implementation bugs come from collapsing distinct states into one boolean success. In distributed systems, accepted, replicated, and committed are different truths.
This concept also forces separation of control and data signals. A system may report node alive while mutation path is unavailable; or mutation path healthy while query path returns stale data due to caching. Engineering work is not to eliminate all inconsistency windows, but to bound and communicate them. API design should include explicit metadata such as indexes, terms, and status epochs.
Tie this concept to operator workflow. During incidents, teams need direct answers: what invariant is currently violated, what metric confirms it, and what remediation restores it. If design lacks observability on invariant transitions, incident MTTR increases sharply.
How this fit on projects
This concept is foundational in this project and reappears in later integration tasks.
Definitions & key terms
- Invariant: property that must always hold.
- Authoritative state: source of truth used for correctness decisions.
- Stale state: delayed but potentially useful view.
- Transition epoch/index: monotonic marker of state progression.
Mental model diagram
request -> accepted -> replicated -> committed -> applied
How it works (step-by-step, with invariants and failure modes)
- Request enters system with intent.
- Local validation enforces schema and role constraints.
- Propagation path executes according to protocol.
- Commit condition is evaluated against invariant rule.
- State applies and becomes externally visible.
Failure modes:
- premature success response
- silent stale reads presented as fresh truth
- missing monotonic metadata for conflict resolution
Minimal concrete example
PUT /resource/x
-> accepted index=41
-> replicated to peers
-> committed index=41
-> read path returns value with ModifyIndex=41
Common misconceptions
- If API returned success, state is globally safe.
- Eventually consistent means unpredictable.
Check-your-understanding questions
- What event marks authoritative success in this project?
- Which state can be stale but still useful?
- What metric would prove invariant health?
Check-your-understanding answers
- The protocol-defined commit condition.
- Read-side cache with explicit freshness metadata.
- Monotonic commit/index progression plus failure-rate bounds.
Real-world applications
- control-plane mutation APIs
- lock ownership verification
- secure routing policy updates
Where you’ll apply it
- This project implementation sections 3 to 6.
- Also used in: P12-consul-agent-capstone.md.
References
- Consul architecture docs.
- Raft or SWIM papers as applicable.
- RFC standards relevant to this project.
Key insights
Correctness is preserved by explicit state transitions and invariants, not by retry volume.
Summary
You now have a model for what must always be true, where uncertainty exists, and how to expose both safely.
Homework/Exercises to practice the concept
- Write the invariant list for this project in one page.
- Define one metric and one log field for each invariant.
Solutions to the homework/exercises
- Strong lists include commit, ownership, and authorization constraints.
- Metrics/log fields should be monotonic and attributable by request id.
Lock Delay and Safety
Fundamentals
This concept covers the main tradeoff surface in the project. Most failures happen because teams optimize one axis blindly and violate assumptions elsewhere. You need language for describing tradeoffs before implementation.
Deep Dive into the concept
Identify the dominant cost and risk in this project. Discovery projects emphasize freshness versus load. Security projects emphasize strictness versus operability. Consensus projects emphasize liveness versus safety. Map each API and timer configuration to one side of that tradeoff.
A robust design makes choices explicit and testable. If timeout is short, quantify false-positive risk. If policy defaults are permissive during migration, define rollback and audit controls. If stale reads are allowed, make stale metadata mandatory and document caller behavior.
Treat this concept as a policy engine for implementation decisions. Every parameter should have an owner, rationale, and failure drill. Production maturity is less about ideal defaults and more about predictable behavior under non-ideal conditions.
How this fit on projects
Used directly in this project’s architecture and reused in federation, security, and capstone projects.
Definitions & key terms
- Tradeoff axis: pair of competing priorities.
- Guardrail: explicit constraint preventing unsafe optimization.
- Operational envelope: expected safe range of runtime conditions.
Mental model diagram
strict/fresh <---- SLO-guided choice ----> available/cheap
How it works
- Pick explicit SLO target.
- Choose parameter defaults aligned to SLO.
- Define guardrails and rollback thresholds.
- Validate with deterministic stress scenarios.
Failure modes:
- hidden defaults with no rationale
- tuning without baseline measurement
- policy changes without canary validation
Minimal concrete example
TTL=5s -> fresher routing, higher query load
TTL=30s -> lower load, slower failure visibility
selected TTL=10s after benchmark and outage drill
Common misconceptions
- There is one best timeout or TTL.
- Policy strictness always increases reliability.
Check-your-understanding questions
- Which tradeoff axis is primary in this project?
- What metric shows you crossed safe envelope?
- What rollback action should be pre-defined?
Check-your-understanding answers
- Depends on project domain; document in architecture section.
- Error-rate, latency, or false-positive threshold crossing.
- Revert to last-known-good parameter or policy profile.
Real-world applications
- timeout and TTL tuning
- secure rollout of deny-by-default policies
- failover trigger calibration
Where you’ll apply it
- This project’s design decisions section.
- Also used in: P10-multi-datacenter-federation.md and P11-acl-tokens-and-policies.md.
References
- Consul operational docs for the relevant subsystem.
- CNCF and HashiCorp survey findings.
Key insights
Operational excellence is explicit tradeoff management with measured feedback loops.
Summary
If you can state your tradeoff and prove its behavior under fault drills, your design is production-oriented.
Homework/Exercises to practice the concept
- Define one aggressive and one conservative parameter profile.
- Run both profiles and compare three key metrics.
Solutions to the homework/exercises
- Aggressive profile favors speed; conservative profile favors stability.
- Use measurements to set baseline and emergency fallback profiles.
3. Project Specification
3.1 What You Will Build
You will build Implement session-backed advisory locks for distributed coordination patterns. The project includes deterministic CLI or API flows, observable state transitions, and explicit failure-mode behavior. It intentionally excludes cloud-specific deployment automation so focus stays on protocol behavior.
3.2 Functional Requirements
- Implement the core protocol or workflow for this project domain.
- Expose user-visible interface with deterministic output.
- Emit observability data for transition debugging.
- Support one explicit failure scenario with predictable behavior.
3.3 Non-Functional Requirements
- Performance: stable under local load tests without unbounded queue growth.
- Reliability: deterministic success and failure responses.
- Usability: clear output showing state and reason codes.
3.4 Example Usage / Output
$ run project start
state=ready
$ run golden path command
result=success index=...
$ run failure path command
result=expected_safe_failure
3.5 Data Formats / Schemas / Protocols
Define request and response shapes that include identity/resource fields, monotonic markers, and structured error envelopes.
3.6 Edge Cases
- duplicate registration or idempotent replay
- timeout and partial acknowledgement
- stale token/cert/session/index depending on project type
- restart recovery with durability checks
3.7 Real World Outcome
3.7.1 How to Run (Copy/Paste)
- start control processes
- run one golden path command
- run one failure path command
- verify output signatures
3.7.2 Golden Path Demo (Deterministic)
Use fixed fixture inputs and expected status transitions.
3.7.3 If CLI: provide an exact terminal transcript
$ ./project-07 start
ready
$ ./project-07 demo --golden
ok
$ ./project-07 demo --failure
expected_error
3.7.4 If Web App
N/A for this project family.
3.7.5 If API
Include endpoint table with one 2xx and one structured error example.
3.7.6 If Library
N/A.
3.7.7 If GUI/Desktop/Mobile
N/A.
3.7.8 If TUI
Optional if you add a terminal interface.
4. Solution Architecture
4.1 High-Level Design
client/test driver
|
v
project interface layer -> protocol/state engine -> persistence/cache
|
v
telemetry and diagnostics
4.2 Key Components
| Component | Responsibility | Key Decisions |
|---|---|---|
| Interface layer | validate requests and shape responses | deterministic error contract |
| Core engine | enforce protocol transitions | invariant-first processing |
| Persistence/cache | store and retrieve state | explicit consistency model |
| Telemetry | provide root-cause signals | structured fields and counters |
4.4 Data Structures (No Full Code)
ResourceState:
id
status
index_or_term
last_update
owner_identity
4.4 Algorithm Overview
Key Algorithm: transition and commit path
- Validate and authorize request.
- Execute transition according to protocol rules.
- Persist and publish monotonic metadata.
- Return deterministic response.
Complexity Analysis:
- Time: proportional to peer/resource scope of operation.
- Space: proportional to active state plus metadata history.
5. Implementation Guide
5.1 Development Environment Setup
# install runtime, build toolchain, and CLI dependencies
# verify command line tools and local networking setup
5.2 Project Structure
project-07/
cmd/
internal/api/
internal/engine/
internal/storage/
internal/telemetry/
tests/
docs/
5.3 The Core Question You’re Answering
What exact mechanism turns uncertain distributed signals into safe, operable behavior for this project?
5.4 Concepts You Must Understand First
- Leases and Lock Ownership
- Lock Delay and Safety
- Failure model and observability coupling
5.5 Questions to Guide Your Design
- Which transitions are authoritative versus eventually observed?
- What metadata proves correctness to clients and operators?
- How does the system fail safely when assumptions break?
5.6 Thinking Exercise
Before coding, draw transition timelines for:
- normal operation
- timeout or delayed operation
- invalid authorization or ownership operation
5.7 The Interview Questions They’ll Ask
- Which invariant was hardest to preserve and why?
- How did you test failure behavior beyond happy path?
- What tradeoff did you choose and how did you validate it?
- How would you scale this design in production?
- Which metric predicts incident risk earliest?
5.8 Hints in Layers
- Hint 1: implement explicit state model before I/O details.
- Hint 2: add monotonic metadata to mutating responses.
- Hint 3: build deterministic failure fixture before load tests.
- Hint 4: keep protocol logs structured and correlation-id aware.
5.9 Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Distributed reasoning | Designing Data-Intensive Applications | 5-9 |
| Network mechanics | Computer Networks | 5 |
| Service systems | Building Microservices | discovery and resilience chapters |
5.10 Implementation Phases
Phase 1: Foundation
- Define state model and interface schema.
- Build deterministic single-node behavior.
Checkpoint: local golden path passes.
Phase 2: Core Functionality
- Implement protocol engine and state path.
- Add observability fields.
Checkpoint: distributed or failure scenario passes.
Phase 3: Polish & Edge Cases
- Handle retries, timeouts, and edge conditions.
- Validate reproducibility and documentation.
Checkpoint: deterministic success and failure demos are stable.
5.11 Key Implementation Decisions
| Decision | Options | Recommendation | Rationale |
|---|---|---|---|
| Consistency mode | strict, stale, hybrid | hybrid with explicit metadata | balances latency and correctness |
| Failure handling | retry-only, fail-safe, fail-open | fail-safe by default | prevents hidden corruption |
| Telemetry shape | ad-hoc logs, structured logs | structured logs + metrics | faster root-cause analysis |
6. Testing Strategy
6.1 Test Categories
| Category | Purpose | Examples |
|---|---|---|
| Unit | validate local transitions | parser, evaluator, index checks |
| Integration | verify component interaction | request lifecycle and commit path |
| Edge Case | exercise failure behavior | timeout, partition, stale metadata |
6.2 Critical Test Cases
- Golden path success with deterministic output.
- Deliberate failure path with safe behavior.
- Recovery path that restores healthy state.
6.3 Test Data
fixed ids, fixed timestamps or seeds, known fixture responses
7. Common Pitfalls & Debugging
7.1 Frequent Mistakes
| Pitfall | Symptom | Solution |
|---|---|---|
| Missing invariant checks | intermittent incorrect success | enforce transition guards |
| Unclear metadata | hard root-cause analysis | include index/term/owner in responses |
| Over-aggressive tuning | flapping and false positives | calibrate with baseline measurements |
7.2 Debugging Strategies
- Trace one request end-to-end with correlation id.
- Compare observed state against invariant checklist.
- Replay deterministic failure fixture until explanation is clear.
7.3 Performance Traps
- unbounded watchers or retries
- synchronized timers causing burst load
- expensive per-request policy lookup without caching plan
8. Extensions & Challenges
8.1 Beginner Extensions
- Add richer status and error diagnostics.
- Add idempotency key handling for retries.
8.2 Intermediate Extensions
- Add metrics endpoint with SLO-oriented counters.
- Add chaos test script for one network and one resource fault.
8.3 Advanced Extensions
- Add multi-tenant isolation controls.
- Add production-like deployment topology simulation.
9. Real-World Connections
9.1 Industry Applications
- platform control-plane services
- secure service networking and governance
- resilient configuration and discovery workflows
9.2 Related Open Source Projects
- HashiCorp Consul
- HashiCorp memberlist and raft libraries
- Envoy for sidecar traffic control patterns
9.3 Interview Relevance
This project prepares discussion on invariants, failure tradeoffs, and operational readiness.
10. Resources
10.1 Essential Reading
- Consul architecture and subsystem docs
- Raft and SWIM papers
- Relevant RFC standards for DNS and TLS
10.2 Video Resources
- HashiCorp engineering talks on Consul internals
- distributed systems talks on incident-safe design
10.3 Tools & Documentation
- consul, curl, jq, dig, openssl
- packet tracing tools such as tcpdump and Wireshark
10.4 Related Projects in This Series
- Previous: P06-kv-watches-and-cas.md
- Next: P08-service-mesh-sidecar-proxy.md
11. Self-Assessment Checklist
11.1 Understanding
- I can explain the primary invariant from memory.
- I can explain the main tradeoff axis and chosen defaults.
- I can predict behavior for one major failure scenario.
11.2 Implementation
- Functional requirements are complete.
- Golden and failure demos are deterministic.
- Telemetry is sufficient for root-cause analysis.
- Edge cases are handled and documented.
11.3 Growth
- I documented one design change caused by evidence.
- I can explain this project in interview-level depth.
- I can map this project to production equivalents.
12. Submission / Completion Criteria
Minimum Viable Completion:
- one deterministic happy path
- one deterministic safe-failure path
- one observability view that explains state
Full Completion:
- all minimum criteria plus edge-case validation
- implementation decision rationale documented
Excellence (Going Above & Beyond):
- multi-fault scenario drill with stable behavior
- concise incident runbook with remediation order
13 Additional Content Rules (Applied)
- Determinism: fixed fixtures and repeatable demos.
- Outcome completeness: success and failure demo intent included.
- Cross-linking: related project references included.
- No placeholders: all sections are actionable.