Project 12: Statistics and Cost-Based Optimizer

Build stats collection and cost-based plan selection with explain diagnostics.

Quick Reference

Attribute	Value
Difficulty	Master
Time Estimate	3-4 weeks
Main Programming Language	C++ (Alternatives: Rust, Go)
Alternative Programming Languages	Rust, Go
Coolness Level	See REFERENCE.md scale used in main sprint
Business Potential	Resume Gold
Prerequisites	Projects 6-8, probability basics, join planning
Key Topics	Cardinality estimation, cost models, plan search

1. Learning Objectives

By completing this project, you will:

Implement the core subsystem with deterministic behavior.
Explain the subsystem invariants and failure boundaries.
Build a repeatable validation workflow with golden outputs.
Document tradeoffs and operational evidence.

2. All Theory Needed (Per-Concept Breakdown)

Core Storage and State Invariants

Fundamentals Every database subsystem is a state machine over persistent and in-memory state. Correctness depends on explicit invariants, not just passing happy-path tests. You must define what state exists, what transitions are valid, and what recovery behavior is required when transitions are interrupted. For Statistics and Cost-Based Optimizer, this means identifying critical metadata, lifecycle transitions, and safety conditions before writing implementation details. A strong mental model starts from contracts: what this component promises to callers, and what it assumes from lower layers.

Deep Dive into the concept Systems projects fail most often when invariants are implicit. The first hard requirement is state transparency: each important structure needs a schema, versioning strategy, and validation routine. For this project, define your state as a graph of entities and relationships, then classify each transition as idempotent, atomic, or recoverable. If you do not classify transitions, you cannot reason about crash points or retries.

The second requirement is boundary discipline. A subsystem should expose a small API with clear preconditions and postconditions. Inputs should be validated early; outputs should include enough metadata for diagnosis. Hidden side effects are expensive in database internals because downstream components assume deterministic behavior. If a method mutates shared state unexpectedly, concurrency and recovery failures follow.

Third, introduce deterministic test fixtures from day one. Randomized tests are useful, but deterministic seeds and golden transcripts are essential for debugging and regression checks. For each critical operation, define one normal path, one edge path, and one failure path. Capture expected outcomes in machine-readable form so you can rerun quickly after every change.

Fourth, treat observability as part of implementation. Add counters, state snapshots, and structured logs for transition boundaries. In database systems, poor observability turns simple bugs into multi-day hunts. Your logs should answer: what happened, in what order, to which state object, and under which operation ID.

Fifth, include compatibility planning. Today you implement a learning version; tomorrow you may add new metadata fields or algorithm variants. Reserve extension fields or version tags now to avoid destructive migrations later. This is practical risk management in storage-oriented systems.

Finally, connect theory to runtime cost. Correctness is mandatory, but efficiency determines usability. Understand where your design spends CPU, memory, and I/O. Track both asymptotic behavior and real bottlenecks under expected workloads.

How this fit on projects

Directly powers Project 12 in the main guide.
Reused by later projects that depend on this subsystem.

Definitions & key terms

Invariant: condition that must always hold.
Transition: state change caused by one operation.
Idempotence: repeated execution yields the same final result.
Deterministic test: fixed seed/input produces identical output.

Mental model diagram

Input -> Validate -> State Transition -> Persist/Expose -> Verify
          |               |                  |            |
          v               v                  v            v
      reject bad      enforce rules      emit metrics   assert invariants

How it works (step-by-step, with invariants and failure modes)

Define state structures and invariants.
Implement operation transitions with explicit pre/post checks.
Add deterministic tests and golden outputs.
Add fault-injection tests for interrupted transitions.
Add observability hooks and iterate on bottlenecks.

Minimal concrete example

Pseudo-operation:
OP(request)
- validate request shape and required references
- read current state snapshot
- compute next state deterministically
- persist next state with integrity marker
- emit operation metric and structured event

Common misconceptions

“If tests pass once, invariants are fine.” -> Need restart and adversarial tests.
“Observability can be added later.” -> Late telemetry creates blind spots.

Check-your-understanding questions

Why must invariants be explicit before coding?
What makes a transition safely retryable?
Why is deterministic output critical?

Check-your-understanding answers

They define correctness boundaries and prevent contract drift.
Idempotence or explicit compensation logic.
It enables reproducible debugging and regression detection.

Real-world applications

Storage engines, lock managers, and query planners all depend on explicit invariants.

Where you’ll apply it

Main guide Project 12 and downstream integrations.

References

Database System Concepts Ch. 16 + System R paper
Database Internals by Alex Petrov.

Key insights Correctness in database internals is explicit state management plus disciplined transitions.

Summary This concept provides the reliability and debuggability foundation for the project.

Homework/Exercises to practice the concept

List five invariants specific to this project.
Design one deterministic happy path and one deterministic failure path test.
Identify one transition that must be idempotent and explain why.

Solutions to the homework/exercises

Include identity, ordering, integrity, and lifecycle constraints.
Use fixed seeds/datasets and record expected outputs.
Recovery or retry transitions must be idempotent.

Algorithmic and Operational Tradeoffs

Fundamentals Database engineering is a sequence of tradeoffs between throughput, latency, memory footprint, and correctness risk. For Statistics and Cost-Based Optimizer, each design choice changes operational behavior. A simpler algorithm may be easier to prove correct but slower; a faster algorithm may require richer metadata and more failure handling. Define workload assumptions and objective metrics before choosing implementation details.

Deep Dive into the concept Operational tradeoffs are easiest to manage by writing objective functions. If your workload is read-heavy with skewed access, cache locality and selective paths dominate. If writes dominate, write amplification and contention control matter more. Start by stating what “good” means for this project: low p95 latency, bounded memory growth, deterministic recovery time, or weighted combinations.

Map algorithm choices to these objectives. Aggressive caching may reduce misses but increase metadata overhead. Rich indexing can reduce query latency but increase update cost. Stronger concurrency controls improve correctness but can reduce throughput under contention. None is universally correct.

Benchmarking should mirror expected usage. Include edge-case traces and realistic mixed traces. Measure averages and tail percentiles. Systems that look fast on average may still be operationally poor if they have unpredictable spikes. Tie benchmark results to root causes using metrics and logs.

Failure behavior is part of quality. A fast path that fails catastrophically under one crash point is worse than a slower but predictable path. Include interruption drills where relevant and capture recovery correctness and time.

Document decisions: options considered, chosen option, and rationale. This makes future evolution safer and interview explanations stronger.

How this fit on projects

Guides implementation choices and tuning strategy.

Definitions & key terms

Write amplification: extra bytes written per logical write.
Tail latency: p95/p99 response time.
Contention: throughput loss from synchronization conflicts.

Mental model diagram

Workload assumptions -> candidate designs -> benchmark -> pick design -> validate failure behavior

How it works

Define workload and objective metrics.
Enumerate design options.
Benchmark deterministic traces.
Evaluate steady-state and failure-state behavior.
Record decision and rationale.

Minimal concrete example

Decision: choose Clock over LRU.
Evidence: similar hit ratio, lower overhead, better mixed-scan behavior.
Risk: less precise recency tracking.
Mitigation: add scan-resistant admission rule.

Common misconceptions

“Faster microbenchmark means better system.” -> Not if failure/tail behavior is weak.
“One benchmark is enough.” -> Need mixed and adversarial scenarios.

Check-your-understanding questions

Why are tail percentiles important?
Give one correctness-performance coupling in this project.
Why keep a decision log?

Check-your-understanding answers

They reflect user-visible worst-case behavior.
Durability or lock safety often increases runtime overhead.
To preserve rationale and ease future tuning.

Real-world applications

Production tuning of caches, indexes, and transaction settings.

Where you’ll apply it

In implementation phases and performance validation.

References

Designing Data-Intensive Applications by Martin Kleppmann.
Database System Concepts Ch. 16 + System R paper

Key insights Good subsystem choices come from evidence-backed tradeoffs, not intuition alone.

Summary This concept helps you choose and defend implementation decisions.

Homework/Exercises to practice the concept

Build a benchmark matrix with three workload profiles.
Write one decision note including options and evidence.

Solutions to the homework/exercises

Use read-heavy, write-heavy, and skewed mixed traces.
Link selected option to p95 latency and correctness evidence.

3. Project Specification

3.1 What You Will Build

Build stats collection and cost-based plan selection with explain diagnostics.

Included:

Working subsystem implementation with deterministic tests.
Metrics and logging for key transitions.
Failure-path verification.

Excluded:

Full distributed behavior.
Vendor-specific managed service integrations.

3.2 Functional Requirements

Implement subsystem core API and state transitions.
Provide deterministic golden scenario output.
Support restart or re-run correctness where applicable.
Emit enough metrics for debugging and comparison.

3.3 Non-Functional Requirements

Performance: baseline scenario completes within project budget.
Reliability: deterministic output under fixed seed.
Usability: clear CLI diagnostics.

3.4 Example Usage / Output

$ ./db_lab cbo-demo --seed 808

3.5 Data Formats / Schemas / Protocols

Use explicit schemas for persistent structures: version, identity fields, integrity markers, and status flags.

3.6 Edge Cases

Empty initialization state.
Maximum capacity boundaries.
Interrupted operation and restart.
Invalid input handling.

3.7 Real World Outcome

3.7.1 How to Run (Copy/Paste)

$ cmake -S . -B build && cmake --build build
$ ./db_lab cbo-demo --seed 808

3.7.2 Golden Path Demo (Deterministic)

[analyze] table=orders rows=5000000
[planA] SeqScan cost=145000
[planB] IndexScan cost=1800
[chosen] planB
[actual] rows=17 latency=19.4 ms
[result] PASS

3.7.3 If CLI: exact terminal transcript

$ ./db_lab cbo-demo --seed 808
[analyze] table=orders rows=5000000
[planA] SeqScan cost=145000
[planB] IndexScan cost=1800
[chosen] planB
[actual] rows=17 latency=19.4 ms
[result] PASS

3.7.4 Failure Demo

$ ./db_lab cbo-demo --seed 808 --inject-failure metadata-corruption
[error] integrity validation failed
[exit] code=2

4. Solution Architecture

4.1 High-Level Design

Input/Request -> Validation -> Core State Machine -> Persistence/Cache -> Metrics + Output

4.2 Key Components

Component	Responsibility	Key Decisions
API Layer	Validate input and invoke subsystem operations	Fail fast on invalid arguments
Core Engine	Execute deterministic transitions	Explicit invariants and status codes
Storage/State Adapter	Persist/load state	Versioned metadata and integrity checks
Observability	Metrics and trace events	Deterministic event format

4.3 Data Structures (No Full Code)

Entity:
- id
- status
- payload
- version
- integrity_marker

4.4 Algorithm Overview

Validate request and state.
Compute deterministic transition.
Apply transition with invariant checks.
Persist and emit metrics.

5. Implementation Guide

5.1 Development Environment Setup

$ cmake -S . -B build
$ cmake --build build
$ ctest --test-dir build

5.2 Project Structure

project-root/
├── src/
│   ├── core/
│   ├── io/
│   ├── api/
│   └── metrics/
├── tests/
│   ├── unit/
│   └── integration/
└── docs/

5.3 The Core Question You’re Answering

“How can the engine predict plan cost accurately enough to choose fast plans?”

5.4 Concepts You Must Understand First

State invariants and transition boundaries.
Failure modes and deterministic recovery behavior.
Algorithmic tradeoffs for expected workloads.

5.5 Questions to Guide Your Design

What invariants are non-negotiable?
Which operation is highest risk under failure?
Which metrics prove your design under load?

5.6 Thinking Exercise

Write one transition diagram and annotate failure points with expected behavior.

5.7 The Interview Questions They’ll Ask

What invariant is hardest to maintain here?
What performance tradeoff did you choose?
How did you validate failure correctness?
Which metric justified your architecture?
How would this scale in production?

5.8 Hints in Layers

Hint 1: Build deterministic baseline first

Hint 2: Add invariants as executable assertions

Hint 3: Introduce failure injection early

Hint 4: Keep a concise design decision log

5.9 Books That Will Help

Topic	Book	Chapter
Core concept	Database System Concepts Ch. 16 + System R paper	Relevant chapter
Practical internals	Database Internals	Related chapter
Reliability mindset	Designing Data-Intensive Applications	Reliability sections

5.10 Implementation Phases

Phase 1: Foundation

Define schemas, invariants, and baseline tests.

Phase 2: Core Functionality

Implement operations and deterministic scenario.

Phase 3: Robustness

Add failure-path tests and performance instrumentation.

5.11 Key Implementation Decisions

Decision	Options	Recommendation	Rationale
Metadata versioning	implicit / explicit	explicit	safer evolution
Error handling	silent / structured	structured	deterministic debugging
Validation timing	late / early	early	fail-fast correctness

6. Testing Strategy

6.1 Test Categories

Category	Purpose	Examples
Unit Tests	Per-operation correctness	transition assertions
Integration Tests	End-to-end validation	golden scenario transcript
Edge Case Tests	Boundary/failure coverage	invalid inputs, restarts

6.2 Critical Test Cases

Golden deterministic scenario.
Restart/re-run consistency.
Failure injection near persistence boundary.
Concurrency stress where applicable.

6.3 Test Data

Fixed-seed synthetic fixtures with expected summary hashes.

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

Pitfall	Symptom	Solution
Implicit invariant	random wrong results	executable invariant checks
Missing deterministic fixture	flaky tests	fixed seeds and golden outputs
Weak observability	slow debugging	structured logs and counters

7.2 Debugging Strategies

Replay deterministic transcript and diff state snapshots.
Enable verbose logs for one failing operation ID.
Use invariant dump command before/after transition.

7.3 Performance Traps

Optimizing before baseline correctness.
Ignoring p95/p99 latency.
Underestimating metadata overhead.

8. Extensions & Challenges

8.1 Beginner Extensions

Add one extra diagnostic command.
Add one boundary-case fixture.

8.2 Intermediate Extensions

Implement a second algorithmic variant.
Add machine-readable explain output.

8.3 Advanced Extensions

Add online maintenance path.
Add high-contention benchmark with mitigation.

9. Real-World Connections

9.1 Industry Applications

This project maps directly to production DBMS subsystems used in PostgreSQL, MySQL/InnoDB, and SQLite classes of systems.

PostgreSQL
SQLite
CMU Bustub

9.3 Interview Relevance

Covers system design, correctness, performance, and failure-mode reasoning expected in database/backend interviews.

10. Resources

10.1 Essential Reading

Database System Concepts Ch. 16 + System R paper
Database Internals by Alex Petrov

10.2 Video Resources

CMU 15-445/645 Intro to Database Systems lectures.

10.3 Tools & Documentation

Repository test harness and docs.
PostgreSQL and SQLite official documentation.

Previous project provides prerequisites.
Next project consumes this subsystem.

11. Self-Assessment Checklist

11.1 Understanding

I can explain two critical invariants from memory.
I can explain one failure mode and mitigation.
I can justify one design decision with metrics.

11.2 Implementation

Functional requirements are met.
Golden scenario passes.
Edge and failure tests pass.

11.3 Growth

I documented one tradeoff to revisit.
I can explain this project in interview format.

12. Submission / Completion Criteria

Minimum Viable Completion:

Deterministic golden scenario passes.
Core invariants are enforced.
Basic failure handling is documented.

Full Completion:

Includes benchmark evidence and decision log.
Includes restart/failure-path verification.

Excellence (Going Above & Beyond):

Includes algorithm comparison with recommendation.
Includes robust observability and concise technical report.