Project 12: Statistics and Cost-Based Optimizer

Build stats collection and cost-based plan selection with explain diagnostics.

Quick Reference

Attribute Value
Difficulty Master
Time Estimate 3-4 weeks
Main Programming Language C++ (Alternatives: Rust, Go)
Alternative Programming Languages Rust, Go
Coolness Level See REFERENCE.md scale used in main sprint
Business Potential Resume Gold
Prerequisites Projects 6-8, probability basics, join planning
Key Topics Cardinality estimation, cost models, plan search

1. Learning Objectives

By completing this project, you will:

  1. Implement the core subsystem with deterministic behavior.
  2. Explain the subsystem invariants and failure boundaries.
  3. Build a repeatable validation workflow with golden outputs.
  4. Document tradeoffs and operational evidence.

2. All Theory Needed (Per-Concept Breakdown)

Core Storage and State Invariants

Fundamentals Every database subsystem is a state machine over persistent and in-memory state. Correctness depends on explicit invariants, not just passing happy-path tests. You must define what state exists, what transitions are valid, and what recovery behavior is required when transitions are interrupted. For Statistics and Cost-Based Optimizer, this means identifying critical metadata, lifecycle transitions, and safety conditions before writing implementation details. A strong mental model starts from contracts: what this component promises to callers, and what it assumes from lower layers.

Deep Dive into the concept Systems projects fail most often when invariants are implicit. The first hard requirement is state transparency: each important structure needs a schema, versioning strategy, and validation routine. For this project, define your state as a graph of entities and relationships, then classify each transition as idempotent, atomic, or recoverable. If you do not classify transitions, you cannot reason about crash points or retries.

The second requirement is boundary discipline. A subsystem should expose a small API with clear preconditions and postconditions. Inputs should be validated early; outputs should include enough metadata for diagnosis. Hidden side effects are expensive in database internals because downstream components assume deterministic behavior. If a method mutates shared state unexpectedly, concurrency and recovery failures follow.

Third, introduce deterministic test fixtures from day one. Randomized tests are useful, but deterministic seeds and golden transcripts are essential for debugging and regression checks. For each critical operation, define one normal path, one edge path, and one failure path. Capture expected outcomes in machine-readable form so you can rerun quickly after every change.

Fourth, treat observability as part of implementation. Add counters, state snapshots, and structured logs for transition boundaries. In database systems, poor observability turns simple bugs into multi-day hunts. Your logs should answer: what happened, in what order, to which state object, and under which operation ID.

Fifth, include compatibility planning. Today you implement a learning version; tomorrow you may add new metadata fields or algorithm variants. Reserve extension fields or version tags now to avoid destructive migrations later. This is practical risk management in storage-oriented systems.

Finally, connect theory to runtime cost. Correctness is mandatory, but efficiency determines usability. Understand where your design spends CPU, memory, and I/O. Track both asymptotic behavior and real bottlenecks under expected workloads.

How this fit on projects

  • Directly powers Project 12 in the main guide.
  • Reused by later projects that depend on this subsystem.

Definitions & key terms

  • Invariant: condition that must always hold.
  • Transition: state change caused by one operation.
  • Idempotence: repeated execution yields the same final result.
  • Deterministic test: fixed seed/input produces identical output.

Mental model diagram

Input -> Validate -> State Transition -> Persist/Expose -> Verify
          |               |                  |            |
          v               v                  v            v
      reject bad      enforce rules      emit metrics   assert invariants

How it works (step-by-step, with invariants and failure modes)

  1. Define state structures and invariants.
  2. Implement operation transitions with explicit pre/post checks.
  3. Add deterministic tests and golden outputs.
  4. Add fault-injection tests for interrupted transitions.
  5. Add observability hooks and iterate on bottlenecks.

Minimal concrete example

Pseudo-operation:
OP(request)
- validate request shape and required references
- read current state snapshot
- compute next state deterministically
- persist next state with integrity marker
- emit operation metric and structured event

Common misconceptions

  • “If tests pass once, invariants are fine.” -> Need restart and adversarial tests.
  • “Observability can be added later.” -> Late telemetry creates blind spots.

Check-your-understanding questions

  1. Why must invariants be explicit before coding?
  2. What makes a transition safely retryable?
  3. Why is deterministic output critical?

Check-your-understanding answers

  1. They define correctness boundaries and prevent contract drift.
  2. Idempotence or explicit compensation logic.
  3. It enables reproducible debugging and regression detection.

Real-world applications

  • Storage engines, lock managers, and query planners all depend on explicit invariants.

Where you’ll apply it

  • Main guide Project 12 and downstream integrations.

References

  • Database System Concepts Ch. 16 + System R paper
  • Database Internals by Alex Petrov.

Key insights Correctness in database internals is explicit state management plus disciplined transitions.

Summary This concept provides the reliability and debuggability foundation for the project.

Homework/Exercises to practice the concept

  1. List five invariants specific to this project.
  2. Design one deterministic happy path and one deterministic failure path test.
  3. Identify one transition that must be idempotent and explain why.

Solutions to the homework/exercises

  1. Include identity, ordering, integrity, and lifecycle constraints.
  2. Use fixed seeds/datasets and record expected outputs.
  3. Recovery or retry transitions must be idempotent.

Algorithmic and Operational Tradeoffs

Fundamentals Database engineering is a sequence of tradeoffs between throughput, latency, memory footprint, and correctness risk. For Statistics and Cost-Based Optimizer, each design choice changes operational behavior. A simpler algorithm may be easier to prove correct but slower; a faster algorithm may require richer metadata and more failure handling. Define workload assumptions and objective metrics before choosing implementation details.

Deep Dive into the concept Operational tradeoffs are easiest to manage by writing objective functions. If your workload is read-heavy with skewed access, cache locality and selective paths dominate. If writes dominate, write amplification and contention control matter more. Start by stating what “good” means for this project: low p95 latency, bounded memory growth, deterministic recovery time, or weighted combinations.

Map algorithm choices to these objectives. Aggressive caching may reduce misses but increase metadata overhead. Rich indexing can reduce query latency but increase update cost. Stronger concurrency controls improve correctness but can reduce throughput under contention. None is universally correct.

Benchmarking should mirror expected usage. Include edge-case traces and realistic mixed traces. Measure averages and tail percentiles. Systems that look fast on average may still be operationally poor if they have unpredictable spikes. Tie benchmark results to root causes using metrics and logs.

Failure behavior is part of quality. A fast path that fails catastrophically under one crash point is worse than a slower but predictable path. Include interruption drills where relevant and capture recovery correctness and time.

Document decisions: options considered, chosen option, and rationale. This makes future evolution safer and interview explanations stronger.

How this fit on projects

  • Guides implementation choices and tuning strategy.

Definitions & key terms

  • Write amplification: extra bytes written per logical write.
  • Tail latency: p95/p99 response time.
  • Contention: throughput loss from synchronization conflicts.

Mental model diagram

Workload assumptions -> candidate designs -> benchmark -> pick design -> validate failure behavior

How it works

  1. Define workload and objective metrics.
  2. Enumerate design options.
  3. Benchmark deterministic traces.
  4. Evaluate steady-state and failure-state behavior.
  5. Record decision and rationale.

Minimal concrete example

Decision: choose Clock over LRU.
Evidence: similar hit ratio, lower overhead, better mixed-scan behavior.
Risk: less precise recency tracking.
Mitigation: add scan-resistant admission rule.

Common misconceptions

  • “Faster microbenchmark means better system.” -> Not if failure/tail behavior is weak.
  • “One benchmark is enough.” -> Need mixed and adversarial scenarios.

Check-your-understanding questions

  1. Why are tail percentiles important?
  2. Give one correctness-performance coupling in this project.
  3. Why keep a decision log?

Check-your-understanding answers

  1. They reflect user-visible worst-case behavior.
  2. Durability or lock safety often increases runtime overhead.
  3. To preserve rationale and ease future tuning.

Real-world applications

  • Production tuning of caches, indexes, and transaction settings.

Where you’ll apply it

  • In implementation phases and performance validation.

References

  • Designing Data-Intensive Applications by Martin Kleppmann.
  • Database System Concepts Ch. 16 + System R paper

Key insights Good subsystem choices come from evidence-backed tradeoffs, not intuition alone.

Summary This concept helps you choose and defend implementation decisions.

Homework/Exercises to practice the concept

  1. Build a benchmark matrix with three workload profiles.
  2. Write one decision note including options and evidence.

Solutions to the homework/exercises

  1. Use read-heavy, write-heavy, and skewed mixed traces.
  2. Link selected option to p95 latency and correctness evidence.

3. Project Specification

3.1 What You Will Build

Build stats collection and cost-based plan selection with explain diagnostics.

Included:

  • Working subsystem implementation with deterministic tests.
  • Metrics and logging for key transitions.
  • Failure-path verification.

Excluded:

  • Full distributed behavior.
  • Vendor-specific managed service integrations.

3.2 Functional Requirements

  1. Implement subsystem core API and state transitions.
  2. Provide deterministic golden scenario output.
  3. Support restart or re-run correctness where applicable.
  4. Emit enough metrics for debugging and comparison.

3.3 Non-Functional Requirements

  • Performance: baseline scenario completes within project budget.
  • Reliability: deterministic output under fixed seed.
  • Usability: clear CLI diagnostics.

3.4 Example Usage / Output

$ ./db_lab cbo-demo --seed 808

3.5 Data Formats / Schemas / Protocols

Use explicit schemas for persistent structures: version, identity fields, integrity markers, and status flags.

3.6 Edge Cases

  • Empty initialization state.
  • Maximum capacity boundaries.
  • Interrupted operation and restart.
  • Invalid input handling.

3.7 Real World Outcome

3.7.1 How to Run (Copy/Paste)

$ cmake -S . -B build && cmake --build build
$ ./db_lab cbo-demo --seed 808

3.7.2 Golden Path Demo (Deterministic)

[analyze] table=orders rows=5000000
[planA] SeqScan cost=145000
[planB] IndexScan cost=1800
[chosen] planB
[actual] rows=17 latency=19.4 ms
[result] PASS

3.7.3 If CLI: exact terminal transcript

$ ./db_lab cbo-demo --seed 808
[analyze] table=orders rows=5000000
[planA] SeqScan cost=145000
[planB] IndexScan cost=1800
[chosen] planB
[actual] rows=17 latency=19.4 ms
[result] PASS

3.7.4 Failure Demo

$ ./db_lab cbo-demo --seed 808 --inject-failure metadata-corruption
[error] integrity validation failed
[exit] code=2

4. Solution Architecture

4.1 High-Level Design

Input/Request -> Validation -> Core State Machine -> Persistence/Cache -> Metrics + Output

4.2 Key Components

Component Responsibility Key Decisions
API Layer Validate input and invoke subsystem operations Fail fast on invalid arguments
Core Engine Execute deterministic transitions Explicit invariants and status codes
Storage/State Adapter Persist/load state Versioned metadata and integrity checks
Observability Metrics and trace events Deterministic event format

4.3 Data Structures (No Full Code)

Entity:
- id
- status
- payload
- version
- integrity_marker

4.4 Algorithm Overview

  1. Validate request and state.
  2. Compute deterministic transition.
  3. Apply transition with invariant checks.
  4. Persist and emit metrics.

5. Implementation Guide

5.1 Development Environment Setup

$ cmake -S . -B build
$ cmake --build build
$ ctest --test-dir build

5.2 Project Structure

project-root/
├── src/
│   ├── core/
│   ├── io/
│   ├── api/
│   └── metrics/
├── tests/
│   ├── unit/
│   └── integration/
└── docs/

5.3 The Core Question You’re Answering

“How can the engine predict plan cost accurately enough to choose fast plans?”

5.4 Concepts You Must Understand First

  1. State invariants and transition boundaries.
  2. Failure modes and deterministic recovery behavior.
  3. Algorithmic tradeoffs for expected workloads.

5.5 Questions to Guide Your Design

  1. What invariants are non-negotiable?
  2. Which operation is highest risk under failure?
  3. Which metrics prove your design under load?

5.6 Thinking Exercise

Write one transition diagram and annotate failure points with expected behavior.

5.7 The Interview Questions They’ll Ask

  1. What invariant is hardest to maintain here?
  2. What performance tradeoff did you choose?
  3. How did you validate failure correctness?
  4. Which metric justified your architecture?
  5. How would this scale in production?

5.8 Hints in Layers

Hint 1: Build deterministic baseline first

Hint 2: Add invariants as executable assertions

Hint 3: Introduce failure injection early

Hint 4: Keep a concise design decision log

5.9 Books That Will Help

Topic Book Chapter
Core concept Database System Concepts Ch. 16 + System R paper Relevant chapter
Practical internals Database Internals Related chapter
Reliability mindset Designing Data-Intensive Applications Reliability sections

5.10 Implementation Phases

Phase 1: Foundation

  • Define schemas, invariants, and baseline tests.

Phase 2: Core Functionality

  • Implement operations and deterministic scenario.

Phase 3: Robustness

  • Add failure-path tests and performance instrumentation.

5.11 Key Implementation Decisions

Decision Options Recommendation Rationale
Metadata versioning implicit / explicit explicit safer evolution
Error handling silent / structured structured deterministic debugging
Validation timing late / early early fail-fast correctness

6. Testing Strategy

6.1 Test Categories

Category Purpose Examples
Unit Tests Per-operation correctness transition assertions
Integration Tests End-to-end validation golden scenario transcript
Edge Case Tests Boundary/failure coverage invalid inputs, restarts

6.2 Critical Test Cases

  1. Golden deterministic scenario.
  2. Restart/re-run consistency.
  3. Failure injection near persistence boundary.
  4. Concurrency stress where applicable.

6.3 Test Data

Fixed-seed synthetic fixtures with expected summary hashes.


7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

Pitfall Symptom Solution
Implicit invariant random wrong results executable invariant checks
Missing deterministic fixture flaky tests fixed seeds and golden outputs
Weak observability slow debugging structured logs and counters

7.2 Debugging Strategies

  • Replay deterministic transcript and diff state snapshots.
  • Enable verbose logs for one failing operation ID.
  • Use invariant dump command before/after transition.

7.3 Performance Traps

  • Optimizing before baseline correctness.
  • Ignoring p95/p99 latency.
  • Underestimating metadata overhead.

8. Extensions & Challenges

8.1 Beginner Extensions

  • Add one extra diagnostic command.
  • Add one boundary-case fixture.

8.2 Intermediate Extensions

  • Implement a second algorithmic variant.
  • Add machine-readable explain output.

8.3 Advanced Extensions

  • Add online maintenance path.
  • Add high-contention benchmark with mitigation.

9. Real-World Connections

9.1 Industry Applications

This project maps directly to production DBMS subsystems used in PostgreSQL, MySQL/InnoDB, and SQLite classes of systems.

  • PostgreSQL
  • SQLite
  • CMU Bustub

9.3 Interview Relevance

Covers system design, correctness, performance, and failure-mode reasoning expected in database/backend interviews.


10. Resources

10.1 Essential Reading

  • Database System Concepts Ch. 16 + System R paper
  • Database Internals by Alex Petrov

10.2 Video Resources

  • CMU 15-445/645 Intro to Database Systems lectures.

10.3 Tools & Documentation

  • Repository test harness and docs.
  • PostgreSQL and SQLite official documentation.
  • Previous project provides prerequisites.
  • Next project consumes this subsystem.

11. Self-Assessment Checklist

11.1 Understanding

  • I can explain two critical invariants from memory.
  • I can explain one failure mode and mitigation.
  • I can justify one design decision with metrics.

11.2 Implementation

  • Functional requirements are met.
  • Golden scenario passes.
  • Edge and failure tests pass.

11.3 Growth

  • I documented one tradeoff to revisit.
  • I can explain this project in interview format.

12. Submission / Completion Criteria

Minimum Viable Completion:

  • Deterministic golden scenario passes.
  • Core invariants are enforced.
  • Basic failure handling is documented.

Full Completion:

  • Includes benchmark evidence and decision log.
  • Includes restart/failure-path verification.

Excellence (Going Above & Beyond):

  • Includes algorithm comparison with recommendation.
  • Includes robust observability and concise technical report.