Project 05: Hash Index (Extendable Hashing)
Implement extendable hashing with online growth and deterministic bucket split behavior.
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | Advanced |
| Time Estimate | 2 weeks |
| Main Programming Language | C++ (Alternatives: Rust, Go) |
| Alternative Programming Languages | Rust, Go |
| Coolness Level | See REFERENCE.md scale used in main sprint |
| Business Potential | Resume Gold |
| Prerequisites | Projects 1-3, hashing, bit-prefix logic |
| Key Topics | Directory depth, bucket split, equality lookup |
1. Learning Objectives
By completing this project, you will:
- Implement the core subsystem with deterministic behavior.
- Explain the subsystem invariants and failure boundaries.
- Build a repeatable validation workflow with golden outputs.
- Document tradeoffs and operational evidence.
2. All Theory Needed (Per-Concept Breakdown)
Core Storage and State Invariants
Fundamentals Every database subsystem is a state machine over persistent and in-memory state. Correctness depends on explicit invariants, not just passing happy-path tests. You must define what state exists, what transitions are valid, and what recovery behavior is required when transitions are interrupted. For Hash Index (Extendable Hashing), this means identifying critical metadata, lifecycle transitions, and safety conditions before writing implementation details. A strong mental model starts from contracts: what this component promises to callers, and what it assumes from lower layers.
Deep Dive into the concept Systems projects fail most often when invariants are implicit. The first hard requirement is state transparency: each important structure needs a schema, versioning strategy, and validation routine. For this project, define your state as a graph of entities and relationships, then classify each transition as idempotent, atomic, or recoverable. If you do not classify transitions, you cannot reason about crash points or retries.
The second requirement is boundary discipline. A subsystem should expose a small API with clear preconditions and postconditions. Inputs should be validated early; outputs should include enough metadata for diagnosis. Hidden side effects are expensive in database internals because downstream components assume deterministic behavior. If a method mutates shared state unexpectedly, concurrency and recovery failures follow.
Third, introduce deterministic test fixtures from day one. Randomized tests are useful, but deterministic seeds and golden transcripts are essential for debugging and regression checks. For each critical operation, define one normal path, one edge path, and one failure path. Capture expected outcomes in machine-readable form so you can rerun quickly after every change.
Fourth, treat observability as part of implementation. Add counters, state snapshots, and structured logs for transition boundaries. In database systems, poor observability turns simple bugs into multi-day hunts. Your logs should answer: what happened, in what order, to which state object, and under which operation ID.
Fifth, include compatibility planning. Today you implement a learning version; tomorrow you may add new metadata fields or algorithm variants. Reserve extension fields or version tags now to avoid destructive migrations later. This is practical risk management in storage-oriented systems.
Finally, connect theory to runtime cost. Correctness is mandatory, but efficiency determines usability. Understand where your design spends CPU, memory, and I/O. Track both asymptotic behavior and real bottlenecks under expected workloads.
How this fit on projects
- Directly powers Project 05 in the main guide.
- Reused by later projects that depend on this subsystem.
Definitions & key terms
- Invariant: condition that must always hold.
- Transition: state change caused by one operation.
- Idempotence: repeated execution yields the same final result.
- Deterministic test: fixed seed/input produces identical output.
Mental model diagram
Input -> Validate -> State Transition -> Persist/Expose -> Verify
| | | |
v v v v
reject bad enforce rules emit metrics assert invariants
How it works (step-by-step, with invariants and failure modes)
- Define state structures and invariants.
- Implement operation transitions with explicit pre/post checks.
- Add deterministic tests and golden outputs.
- Add fault-injection tests for interrupted transitions.
- Add observability hooks and iterate on bottlenecks.
Minimal concrete example
Pseudo-operation:
OP(request)
- validate request shape and required references
- read current state snapshot
- compute next state deterministically
- persist next state with integrity marker
- emit operation metric and structured event
Common misconceptions
- “If tests pass once, invariants are fine.” -> Need restart and adversarial tests.
- “Observability can be added later.” -> Late telemetry creates blind spots.
Check-your-understanding questions
- Why must invariants be explicit before coding?
- What makes a transition safely retryable?
- Why is deterministic output critical?
Check-your-understanding answers
- They define correctness boundaries and prevent contract drift.
- Idempotence or explicit compensation logic.
- It enables reproducible debugging and regression detection.
Real-world applications
- Storage engines, lock managers, and query planners all depend on explicit invariants.
Where you’ll apply it
- Main guide Project 05 and downstream integrations.
References
- Database System Concepts Ch. 14
- Database Internals by Alex Petrov.
Key insights Correctness in database internals is explicit state management plus disciplined transitions.
Summary This concept provides the reliability and debuggability foundation for the project.
Homework/Exercises to practice the concept
- List five invariants specific to this project.
- Design one deterministic happy path and one deterministic failure path test.
- Identify one transition that must be idempotent and explain why.
Solutions to the homework/exercises
- Include identity, ordering, integrity, and lifecycle constraints.
- Use fixed seeds/datasets and record expected outputs.
- Recovery or retry transitions must be idempotent.
Algorithmic and Operational Tradeoffs
Fundamentals Database engineering is a sequence of tradeoffs between throughput, latency, memory footprint, and correctness risk. For Hash Index (Extendable Hashing), each design choice changes operational behavior. A simpler algorithm may be easier to prove correct but slower; a faster algorithm may require richer metadata and more failure handling. Define workload assumptions and objective metrics before choosing implementation details.
Deep Dive into the concept Operational tradeoffs are easiest to manage by writing objective functions. If your workload is read-heavy with skewed access, cache locality and selective paths dominate. If writes dominate, write amplification and contention control matter more. Start by stating what “good” means for this project: low p95 latency, bounded memory growth, deterministic recovery time, or weighted combinations.
Map algorithm choices to these objectives. Aggressive caching may reduce misses but increase metadata overhead. Rich indexing can reduce query latency but increase update cost. Stronger concurrency controls improve correctness but can reduce throughput under contention. None is universally correct.
Benchmarking should mirror expected usage. Include edge-case traces and realistic mixed traces. Measure averages and tail percentiles. Systems that look fast on average may still be operationally poor if they have unpredictable spikes. Tie benchmark results to root causes using metrics and logs.
Failure behavior is part of quality. A fast path that fails catastrophically under one crash point is worse than a slower but predictable path. Include interruption drills where relevant and capture recovery correctness and time.
Document decisions: options considered, chosen option, and rationale. This makes future evolution safer and interview explanations stronger.
How this fit on projects
- Guides implementation choices and tuning strategy.
Definitions & key terms
- Write amplification: extra bytes written per logical write.
- Tail latency: p95/p99 response time.
- Contention: throughput loss from synchronization conflicts.
Mental model diagram
Workload assumptions -> candidate designs -> benchmark -> pick design -> validate failure behavior
How it works
- Define workload and objective metrics.
- Enumerate design options.
- Benchmark deterministic traces.
- Evaluate steady-state and failure-state behavior.
- Record decision and rationale.
Minimal concrete example
Decision: choose Clock over LRU.
Evidence: similar hit ratio, lower overhead, better mixed-scan behavior.
Risk: less precise recency tracking.
Mitigation: add scan-resistant admission rule.
Common misconceptions
- “Faster microbenchmark means better system.” -> Not if failure/tail behavior is weak.
- “One benchmark is enough.” -> Need mixed and adversarial scenarios.
Check-your-understanding questions
- Why are tail percentiles important?
- Give one correctness-performance coupling in this project.
- Why keep a decision log?
Check-your-understanding answers
- They reflect user-visible worst-case behavior.
- Durability or lock safety often increases runtime overhead.
- To preserve rationale and ease future tuning.
Real-world applications
- Production tuning of caches, indexes, and transaction settings.
Where you’ll apply it
- In implementation phases and performance validation.
References
- Designing Data-Intensive Applications by Martin Kleppmann.
- Database System Concepts Ch. 14
Key insights Good subsystem choices come from evidence-backed tradeoffs, not intuition alone.
Summary This concept helps you choose and defend implementation decisions.
Homework/Exercises to practice the concept
- Build a benchmark matrix with three workload profiles.
- Write one decision note including options and evidence.
Solutions to the homework/exercises
- Use read-heavy, write-heavy, and skewed mixed traces.
- Link selected option to p95 latency and correctness evidence.
3. Project Specification
3.1 What You Will Build
Implement extendable hashing with online growth and deterministic bucket split behavior.
Included:
- Working subsystem implementation with deterministic tests.
- Metrics and logging for key transitions.
- Failure-path verification.
Excluded:
- Full distributed behavior.
- Vendor-specific managed service integrations.
3.2 Functional Requirements
- Implement subsystem core API and state transitions.
- Provide deterministic golden scenario output.
- Support restart or re-run correctness where applicable.
- Emit enough metrics for debugging and comparison.
3.3 Non-Functional Requirements
- Performance: baseline scenario completes within project budget.
- Reliability: deterministic output under fixed seed.
- Usability: clear CLI diagnostics.
3.4 Example Usage / Output
$ ./db_lab hash-demo --seed 22
3.5 Data Formats / Schemas / Protocols
Use explicit schemas for persistent structures: version, identity fields, integrity markers, and status flags.
3.6 Edge Cases
- Empty initialization state.
- Maximum capacity boundaries.
- Interrupted operation and restart.
- Invalid input handling.
3.7 Real World Outcome
3.7.1 How to Run (Copy/Paste)
$ cmake -S . -B build && cmake --build build
$ ./db_lab hash-demo --seed 22
3.7.2 Golden Path Demo (Deterministic)
[insert] 50000 key-rid pairs
[stats] global_depth=11 buckets=231
[probe] key=701337 found=true
[split] total_bucket_splits=230
[result] PASS
3.7.3 If CLI: exact terminal transcript
$ ./db_lab hash-demo --seed 22
[insert] 50000 key-rid pairs
[stats] global_depth=11 buckets=231
[probe] key=701337 found=true
[split] total_bucket_splits=230
[result] PASS
3.7.4 Failure Demo
$ ./db_lab hash-demo --seed 22 --inject-failure metadata-corruption
[error] integrity validation failed
[exit] code=2
4. Solution Architecture
4.1 High-Level Design
Input/Request -> Validation -> Core State Machine -> Persistence/Cache -> Metrics + Output
4.2 Key Components
| Component | Responsibility | Key Decisions |
|---|---|---|
| API Layer | Validate input and invoke subsystem operations | Fail fast on invalid arguments |
| Core Engine | Execute deterministic transitions | Explicit invariants and status codes |
| Storage/State Adapter | Persist/load state | Versioned metadata and integrity checks |
| Observability | Metrics and trace events | Deterministic event format |
4.3 Data Structures (No Full Code)
Entity:
- id
- status
- payload
- version
- integrity_marker
4.4 Algorithm Overview
- Validate request and state.
- Compute deterministic transition.
- Apply transition with invariant checks.
- Persist and emit metrics.
5. Implementation Guide
5.1 Development Environment Setup
$ cmake -S . -B build
$ cmake --build build
$ ctest --test-dir build
5.2 Project Structure
project-root/
├── src/
│ ├── core/
│ ├── io/
│ ├── api/
│ └── metrics/
├── tests/
│ ├── unit/
│ └── integration/
└── docs/
5.3 The Core Question You’re Answering
“How does dynamic hashing scale without full-table rehash?”
5.4 Concepts You Must Understand First
- State invariants and transition boundaries.
- Failure modes and deterministic recovery behavior.
- Algorithmic tradeoffs for expected workloads.
5.5 Questions to Guide Your Design
- What invariants are non-negotiable?
- Which operation is highest risk under failure?
- Which metrics prove your design under load?
5.6 Thinking Exercise
Write one transition diagram and annotate failure points with expected behavior.
5.7 The Interview Questions They’ll Ask
- What invariant is hardest to maintain here?
- What performance tradeoff did you choose?
- How did you validate failure correctness?
- Which metric justified your architecture?
- How would this scale in production?
5.8 Hints in Layers
Hint 1: Build deterministic baseline first
Hint 2: Add invariants as executable assertions
Hint 3: Introduce failure injection early
Hint 4: Keep a concise design decision log
5.9 Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Core concept | Database System Concepts Ch. 14 | Relevant chapter |
| Practical internals | Database Internals | Related chapter |
| Reliability mindset | Designing Data-Intensive Applications | Reliability sections |
5.10 Implementation Phases
Phase 1: Foundation
- Define schemas, invariants, and baseline tests.
Phase 2: Core Functionality
- Implement operations and deterministic scenario.
Phase 3: Robustness
- Add failure-path tests and performance instrumentation.
5.11 Key Implementation Decisions
| Decision | Options | Recommendation | Rationale |
|---|---|---|---|
| Metadata versioning | implicit / explicit | explicit | safer evolution |
| Error handling | silent / structured | structured | deterministic debugging |
| Validation timing | late / early | early | fail-fast correctness |
6. Testing Strategy
6.1 Test Categories
| Category | Purpose | Examples |
|---|---|---|
| Unit Tests | Per-operation correctness | transition assertions |
| Integration Tests | End-to-end validation | golden scenario transcript |
| Edge Case Tests | Boundary/failure coverage | invalid inputs, restarts |
6.2 Critical Test Cases
- Golden deterministic scenario.
- Restart/re-run consistency.
- Failure injection near persistence boundary.
- Concurrency stress where applicable.
6.3 Test Data
Fixed-seed synthetic fixtures with expected summary hashes.
7. Common Pitfalls & Debugging
7.1 Frequent Mistakes
| Pitfall | Symptom | Solution |
|---|---|---|
| Implicit invariant | random wrong results | executable invariant checks |
| Missing deterministic fixture | flaky tests | fixed seeds and golden outputs |
| Weak observability | slow debugging | structured logs and counters |
7.2 Debugging Strategies
- Replay deterministic transcript and diff state snapshots.
- Enable verbose logs for one failing operation ID.
- Use invariant dump command before/after transition.
7.3 Performance Traps
- Optimizing before baseline correctness.
- Ignoring p95/p99 latency.
- Underestimating metadata overhead.
8. Extensions & Challenges
8.1 Beginner Extensions
- Add one extra diagnostic command.
- Add one boundary-case fixture.
8.2 Intermediate Extensions
- Implement a second algorithmic variant.
- Add machine-readable explain output.
8.3 Advanced Extensions
- Add online maintenance path.
- Add high-contention benchmark with mitigation.
9. Real-World Connections
9.1 Industry Applications
This project maps directly to production DBMS subsystems used in PostgreSQL, MySQL/InnoDB, and SQLite classes of systems.
9.2 Related Open Source Projects
- PostgreSQL
- SQLite
- CMU Bustub
9.3 Interview Relevance
Covers system design, correctness, performance, and failure-mode reasoning expected in database/backend interviews.
10. Resources
10.1 Essential Reading
- Database System Concepts Ch. 14
- Database Internals by Alex Petrov
10.2 Video Resources
- CMU 15-445/645 Intro to Database Systems lectures.
10.3 Tools & Documentation
- Repository test harness and docs.
- PostgreSQL and SQLite official documentation.
10.4 Related Projects in This Series
- Previous project provides prerequisites.
- Next project consumes this subsystem.
11. Self-Assessment Checklist
11.1 Understanding
- I can explain two critical invariants from memory.
- I can explain one failure mode and mitigation.
- I can justify one design decision with metrics.
11.2 Implementation
- Functional requirements are met.
- Golden scenario passes.
- Edge and failure tests pass.
11.3 Growth
- I documented one tradeoff to revisit.
- I can explain this project in interview format.
12. Submission / Completion Criteria
Minimum Viable Completion:
- Deterministic golden scenario passes.
- Core invariants are enforced.
- Basic failure handling is documented.
Full Completion:
- Includes benchmark evidence and decision log.
- Includes restart/failure-path verification.
Excellence (Going Above & Beyond):
- Includes algorithm comparison with recommendation.
- Includes robust observability and concise technical report.