Sprint: Database Systems Mastery - Real World Projects

Goal: Build first-principles mastery of database systems by implementing the core modules of a mini relational DBMS, end-to-end, with observable outputs at every step. You will understand not only what each database subsystem does, but why it exists, which invariants keep it correct, and what fails when those invariants break. This guide takes you from bytes on disk to parsed SQL to optimized execution plans and crash-safe recovery. By the end, you will be able to design, build, debug, and reason about storage engines, indexing, query execution, concurrency control, and durability like a systems engineer.

Introduction

What is a database system? A database management system (DBMS) is a software system that stores data durably, enforces correctness under concurrency, and executes declarative queries efficiently.
What problem does it solve today? It gives applications a reliable contract for data integrity, fast retrieval, transactional updates, and operational recoverability.
What will you build across this sprint? A layered mini-DBMS: disk/page manager, tuple/page formats, buffer pool, B+Tree and hash indexing, query execution operators, parser+optimizer, transaction manager, recovery manager, MVCC, and a cost-based optimizer.
In scope: single-node database internals, architecture and algorithms, deterministic local testing, reliability drills.
Out of scope: full distributed consensus implementation, cloud control plane features, and vendor-specific managed services.

Big-picture architecture

Application / SQL Client
        |
        v
Parser -> Logical Plan -> Optimizer -> Physical Plan
                                      |
                                      v
                             Execution Engine (iterators)
                                      |
                                      v
           +---------------- Transaction Manager ----------------+
           |             (locks / MVCC / isolation)             |
           +----------------------+------------------------------+
                                  |
                                  v
                         Buffer Pool / Page Cache
                                  |
                    +-------------+-------------+
                    |                           |
                    v                           v
               Index Manager               Heap/Table Manager
             (B+Tree, Hash)               (pages, tuples, slots)
                    |                           |
                    +-------------+-------------+
                                  |
                                  v
                     WAL / Log + Data Files on Disk

How to Use This Guide

Read the Theory Primer first. It is the mini-book that prevents shallow implementation and blind trial/error.
Build projects in order at least once; later you can jump by learning path.
For every project, start from The Core Question You Are Answering, then do Thinking Exercise, then implementation.
Treat each Real World Outcome as the acceptance contract; if your output differs, you are not done.
Keep an engineering notebook: invariants, failures observed, what fixed them, and why.
After every two projects, revisit the concept chapters and answer the check-your-understanding questions from memory.

Prerequisites & Background Knowledge

Essential Prerequisites (Must Have)

Programming in a systems language (C++ or Rust): pointers/references, memory ownership, data structures.
Basic operating systems concepts: files, virtual memory, processes, synchronization primitives.
Basic algorithm analysis: asymptotic complexity, cache and I/O sensitivity.
SQL basics: SELECT, JOIN, GROUP BY, INSERT, UPDATE, DELETE.
Recommended Reading: “Database System Concepts” by Silberschatz, Korth, and Sudarshan - Ch. 1-3, 13-16.

Helpful But Not Required

Compiler basics (tokenization/parsing) for Project 8.
Probabilistic reasoning and histograms for Project 12.
Linux tracing/perf tools for deeper observability.

Self-Assessment Questions

Can you explain why a DBMS uses fixed-size pages instead of arbitrary record-sized disk writes?
Can you explain what makes WAL durable and why ordering rules matter?
Can you explain one anomaly prevented by serializable isolation?

Development Environment Setup Required Tools:

C++ compiler with C++20 support (clang++ or g++)
CMake 3.24+
Python 3.11+ for test harnesses and log generation
sqlite3 CLI (for comparison experiments)
jq and rg for diagnostics and text/data checks

Recommended Tools:

perf, strace, dtrace/Instruments (platform dependent)
lldb or gdb
Docker for reproducible local labs

Testing Your Setup:

$ clang++ --version
clang version 16.x (or newer)

$ cmake --version
cmake version 3.24.x (or newer)

$ python3 --version
Python 3.11.x (or newer)

$ sqlite3 --version
3.4x.x

Time Investment

Simple projects: 4-8 hours each
Moderate projects: 10-20 hours each
Complex projects: 20-40 hours each
Total sprint: 4-6 months part-time

Important Reality Check Database internals are hard because correctness and performance are coupled. You can pass happy-path tests with a broken system. The real work is preserving invariants under partial failure, concurrency, and adversarial data distributions.

Big Picture / Mental Model

Database systems are a chain of contracts. Each layer hides complexity but depends on upstream and downstream invariants.

[SQL Contract]
  "Result is logically correct"
         |
         v
[Planner Contract]
  "Chosen plan is semantically equivalent"
         |
         v
[Executor Contract]
  "Operators produce tuples in correct semantics"
         |
         v
[Txn Contract]
  "Concurrent behavior respects isolation"
         |
         v
[Storage Contract]
  "Data and metadata remain durable and recoverable"

If any contract is violated:
- planner bug -> wrong answers
- executor bug -> missing/duplicate tuples
- txn bug -> anomalies or corruption
- storage/recovery bug -> permanent data loss

Theory Primer

Chapter 1: Storage Engines, Pages, and Record Layouts

Fundamentals A DBMS stores and moves data in fixed-size pages because disk and SSD I/O are page-oriented and because stable units simplify caching, checksums, and recovery. Within a page, records are not just copied as structs; they are encoded according to a page format that supports variable-length columns, nullability, version metadata, and free-space management. The dominant design is a slotted page: a header and slot array point to tuple payloads, allowing in-page movement without changing logical record identifiers. Storage design is about balancing write amplification, read locality, and update friendliness. If your physical layout is wrong, every upper layer pays: indexes fragment, scans slow down, vacuum/compaction becomes expensive, and crash recovery complexity increases.

Deep Dive The first deep principle in storage engines is that logical rows and physical bytes must be decoupled. Applications think in rows and columns, but the engine thinks in pages, offsets, and byte ranges. A slotted page is the bridge. The page header stores metadata such as free-space pointer, slot count, and sometimes checksum/version. Slots map stable slot IDs to variable tuple offsets and lengths. This allows you to move tuple bodies during compaction while keeping RecordId = (PageId, SlotId) stable. Without this indirection, updates and deletes would force cascading pointer rewrites across indexes and logs.

Second, you must define tuple encoding rules before writing any code. For fixed-size columns, offsets are deterministic. For variable-size columns, you need either an offset table or length-prefixed segments. Null handling usually requires a bitmap. Every tuple also needs visibility/version metadata once transactions arrive. Even if you postpone MVCC, reserve bits/fields early to avoid format migrations. A strong page format is intentionally forward-compatible: include version number and optional extension region so later projects can evolve without rewriting all data.

Third, free-space management is not a micro-optimization; it is a core control loop. Insertion searches for a page with enough free bytes. Deletion creates holes. Update may require in-place rewrite, forwarding pointer, or delete+insert depending on size delta. If you do nothing, pages fragment and effective capacity collapses. A common policy is periodic compaction when free-space fragmentation exceeds a threshold. But compaction itself increases write cost and can interfere with active readers/writers, so it should be policy-driven and observable.

Fourth, understand physical locality. Sequential scans rely on contiguous pages and prefetch behavior. Secondary indexes rely on stable record identifiers and predictable page residency. If heap files are append-only without maintenance, hot rows scatter, cache hit rate drops, and write amplification rises during updates. Production systems address this with clustering, vacuum/reorg, and fill-factor controls. Your lab system can emulate this with explicit page reuse and periodic reorganization tasks.

Fifth, page checksums and torn-write defenses matter even in learning systems. A crash during page write can leave partial bytes on disk. WAL protects logical durability, but page-level corruption still needs detection. Include checksum field in page header and validate on read. If checksum fails, surface deterministic error behavior. This discipline makes your later recovery project realistic: not all failures are clean process crashes.

Sixth, storage metadata must be explicit. At minimum, maintain a catalog page or metadata file with page size, format version, next page ID, free-page list head, and optional table root pointers. Avoid hidden assumptions. Most student implementations fail when they hardcode “page 0 is always header” but then forget migration paths. The key invariant is: metadata must be recoverable and self-describing after restart.

Seventh, measure storage behavior continuously. Track insert occupancy, free-space histogram, average tuple size, and fragmentation ratio. These metrics guide tuning and reveal bugs. If a delete-heavy workload never recovers free space, your free-list or compaction algorithm is wrong. If updates frequently become delete+insert, your page fill strategy is too aggressive. A DBMS storage layer is not static; it is an adaptive system.

Finally, connect storage to future layers. Index entries will store record IDs that reference heap pages. Buffer pool replacement will evict pages based on usage. Recovery will replay log records against page IDs and LSNs. Every field you choose now affects those layers. Storage design is the foundation of everything else: correctness, performance, and maintainability.

How this fit on projects

Primary concept for Projects 1 and 2.
Required underpinning for Projects 3-12.

Definitions & key terms

Page: fixed-size I/O unit between memory and disk.
Slotted page: page format using slot directory + tuple payload region.
RecordId (RID): stable logical pointer, typically (PageId, SlotId).
Fragmentation: free bytes exist but are not contiguous enough for target insert.
Compaction: rearranging tuples to coalesce free space.

Mental model diagram

+------------------------------ Page (8KB) -------------------------------+
| Header: page_id | lsn | checksum | free_start | free_end | slot_count |
+-------------------------------------------------------------------------+
| Tuple payload region (grows up)                            Free space   |
| [tupleA][tupleB moved][tupleC]............................(fragmented?) |
|                                                     [slotN][slot2][slot1]|
+-------------------------------------------------------------------------+
Slot entry -> (offset, len, flags)
RID = (page_id, slot_id)

How it works (step-by-step, with invariants and failure modes)

Allocate page ID and initialize header.
Insert tuple: check contiguous space, write payload, create slot.
Delete tuple: mark slot deleted/tombstone; adjust visibility metadata.
Optional compaction: rewrite live tuple payloads, update slot offsets.
Persist changes with WAL-before-data rule once recovery layer exists.

Invariants:

Slot IDs are stable unless explicitly remapped with index/log coordination.
Header free-space pointers must never overlap payload/slot regions.
Checksum must match serialized bytes at write boundary.

Failure modes:

Miscomputed offsets -> silent corruption.
Missing tombstone semantics -> ghost tuples in scans.
Non-atomic metadata update -> unrecoverable page state.

Minimal concrete example

Pseudo-transcript:
INSERT tuple size=46 into page=17
- free contiguous bytes before: 112
- write payload at offset 2048
- slot_count: 13 -> 14
- slot[14] = {offset:2048,len:46,flags:live}
- free contiguous bytes after: 62

Common misconceptions

“Rows are stored exactly as language structs.” -> DB tuple encoding is explicit and versioned.
“Delete immediately frees bytes.” -> Usually deletion marks space reusable later; fragmentation persists.
“Page format is internal and can be sloppy.” -> It is a long-term compatibility contract.

Check-your-understanding questions

Why is slot indirection essential for stable record references?
What changes if an updated tuple no longer fits in its original space?
Why can fragmentation exist even when total free bytes look large?
What minimum metadata must survive restart?

Check-your-understanding answers

It allows tuple payload movement without changing logical IDs used by indexes.
You need relocation policy: forwarding pointer or delete+insert semantics.
Free space may be non-contiguous and therefore unusable for requested size.
Page format version, next page ID, free page map/list, and root/catalog pointers.

Real-world applications

PostgreSQL heap page + visibility design.
SQLite B-tree page formats and freelist pages.
InnoDB page directory and record headers.

Where you will apply it

Project 1
Project 2
Also impacts Projects 3, 4, 5, 10, 11.

References

SQLite File Format
Database Internals (Alex Petrov)
“Database System Concepts” (Silberschatz et al.), storage chapters.

Key insights Physical layout decisions are not implementation details; they are system behavior decisions.

Summary You now have the physical mental model: pages, slots, tuple encoding, and free-space policy. Every later subsystem assumes these rules are correct.

Homework/Exercises to practice the concept

Design a 4KB slotted page format that supports nullable variable-length columns.
Simulate 1000 inserts, 300 deletes, 200 updates; compute fragmentation ratio over time.
Propose a compaction threshold policy and justify tradeoffs.

Solutions to the homework/exercises

Use header + slot array + payload region + null bitmap + varlen offset table.
Fragmentation ratio should rise after deletes and size-increasing updates; compaction reduces it.
Example: compact when fragmentation > 25% and page not pinned; this balances write cost and reuse.

Chapter 2: Buffer Pool, Caching Policy, and I/O Control

Fundamentals A buffer pool is the in-memory cache for disk pages and is usually the single largest determinant of DBMS performance. It mediates every read/write between operators and storage and enforces page lifecycle states: clean/dirty, pinned/unpinned, and resident/evicted. A page table maps page IDs to frames. When memory is full, a replacement policy selects eviction candidates while respecting pin counts and dirty flushing constraints. Correctness requires careful ordering: dirty pages cannot be flushed out-of-order relative to WAL, and active pages must not be evicted while in use. Performance requires minimizing random I/O and avoiding pathological replacement behavior under mixed scan + point-lookup workloads.

Deep Dive Think of the buffer pool as a traffic controller for scarce memory, not just a map. Queries generate page demand patterns with very different locality profiles: index probes are skewed and hot-set heavy; analytical scans are broad and streaming; joins may create bursty access to build/probe pages. A naive LRU can perform well in simple OLTP but fail under scans that pollute the cache. Modern systems often use Clock variants, segmented LRU, or admission controls to protect hot pages from one-time scan floods.

The core data structures are straightforward but subtle in behavior. The page table maps PageId -> FrameId. Each frame stores a page buffer and metadata: pin count, dirty flag, recency score/reference bit, and optionally pageLSN. On FetchPage(pid), if present, increment pin and return. If absent, choose victim frame from replacer among unpinned pages. If victim dirty, flush subject to WAL constraint. Then read requested page into frame, update mapping, set pin=1. On UnpinPage(pid, is_dirty), decrement pin and possibly mark dirty. Bugs often arise when pin counts are mishandled: negative counts, leaked pins, or eviction of pinned pages.

Dirty-page handling introduces durability coupling. With WAL, the invariant is flushed_log_lsn >= pageLSN before a dirty page is flushed. If violated, crash recovery cannot reconstruct missing updates. This means buffer manager cannot flush in isolation; it must coordinate with log manager. Even in a learning DBMS, explicitly encode this interface: CanFlush(page_lsn) or FlushLogUpTo(page_lsn). Without this, later recovery implementation becomes patchy and non-deterministic.

Replacement policy selection should be workload-driven. LRU is intuitive but expensive to maintain and vulnerable to scan thrash. Clock approximates LRU with low overhead and is common in teaching systems. ARC and LRU-K are more adaptive but complex. A pragmatic path is Clock with two enhancements: scan resistance (do not immediately admit sequential one-hit pages into protected set) and dirty-page awareness (prefer clean victims when equivalent). Track hit ratio by object class (index/table/temp) so policy tuning has evidence.

Concurrency is another major dimension. Multiple threads can fetch/unpin simultaneously. A global mutex is correct but scales poorly. Fine-grained latching (page table latch + frame latch + replacer latch) improves throughput but increases deadlock risk. Establish strict lock order early. Example: page table latch -> frame latch -> replacer latch. If violated, deadlocks appear only under stress and are difficult to reproduce.

Prefetching and background flushing convert reactive behavior into proactive behavior. Sequential scans benefit from read-ahead; checkpointer/background writer smooths write bursts by flushing old dirty pages before eviction pressure spikes. These mechanisms reduce tail latency and avoid stalls where worker threads do synchronous flushes in critical paths.

Observability should be built into the buffer pool API: counters for hit/miss, evictions, dirty flushes, no-victim events, average pin duration, and latch wait time. Most systems engineering gains come from measuring before optimizing. If hit ratio looks high but query latency remains high, you may have latch contention or write stalls.

Finally, the buffer pool is where storage, execution, and recovery intersect. Operators should request pages through a minimal API, not bypass it. Recovery must reapply updates through page fetch/update interfaces that preserve pageLSN semantics. Index and heap managers should never write raw files directly once buffer pool exists. Violating this layering creates split-brain state and impossible debugging sessions.

How this fit on projects

Primary concept for Project 3.
Enables reliable behavior for Projects 4-12.

Definitions & key terms

Frame: one memory slot that can hold one page.
Pin count: number of active users preventing eviction.
Dirty page: modified in memory, not yet persisted.
Replacer: policy component that selects eviction victims.
pageLSN: latest log sequence number reflected in page image.

Mental model diagram

Executor requests PageId=42
        |
        v
Page table lookup ---> hit? ---- yes ---> frame -> pin++ -> use page
        | no
        v
Choose victim from replacer (unpinned only)
        |
        +-> victim dirty? yes -> ensure WAL flushed to pageLSN -> write page
        |
        v
Read page 42 into frame -> update page table -> pin=1 -> return

How it works (step-by-step, with invariants and failure modes)

FetchPage checks page table under latch.
Miss path reserves victim frame from replacer.
Dirty victim flush with WAL ordering check.
Load target page and register mapping.
UnpinPage decrements pin, marks dirty if needed, re-enqueues frame.

Invariants:

Pinned pages cannot be eviction victims.
Every resident page has exactly one page-table mapping.
Flushed dirty page must satisfy WAL-before-data.

Failure modes:

Pin leak -> buffer pool exhaustion (no free victim).
Missing mapping cleanup -> duplicate frame aliases.
Concurrent dirty flush race -> lost update or stale page image.

Minimal concrete example

Pseudo-events:
T1 FetchPage(7) miss -> frame3 loaded pin=1
T2 FetchPage(7) hit  -> frame3 pin=2
T1 UnpinPage(7, dirty=true) -> pin=1 dirty=true
T2 UnpinPage(7, dirty=false) -> pin=0 dirty=true -> candidate for replacer

Common misconceptions

“High hit ratio always means fast queries.” -> Contention and flush stalls can dominate.
“LRU is always best.” -> Workload shape determines effectiveness.
“Dirty page can flush anytime.” -> WAL ordering constraint applies.

Check-your-understanding questions

Why is pin count separate from dirty flag?
What causes scan pollution and how do you mitigate it?
Why should executor/storage avoid bypassing buffer pool?

Check-your-understanding answers

Usage lifetime and persistence status are different concerns.
One-time scan pages evict hot pages; use admission/segmentation/read-ahead controls.
Bypassing breaks coherence, durability ordering, and observability.

Real-world applications

PostgreSQL shared buffers and background writer/checkpointer.
InnoDB buffer pool instances and flush lists.
Embedded DB cache tuning for edge devices.

Where you will apply it

Project 3
Critical dependency for Projects 4, 5, 6, 7, 10, 11, 12.

References

PostgreSQL Buffer and WAL docs
“Architecture of a Database System” (Hellerstein, Stonebraker, Hamilton).

Key insights Caching policy and correctness policy must be designed together; optimizing one without the other breaks the system.

Summary The buffer pool is your performance control plane and durability gatekeeper. Treat it as a first-class subsystem.

Homework/Exercises to practice the concept

Compare LRU vs Clock on a synthetic mixed workload and record hit/miss and eviction churn.
Inject a pin leak and explain how system symptoms differ from low-memory symptoms.
Design metrics dashboard for buffer pool health.

Solutions to the homework/exercises

Clock often matches LRU with lower overhead; mixed workloads reveal scan sensitivity.
Pin leak shows persistent no victim despite moderate cache size and low eviction count.
Include hit ratio, no-victim events, dirty flush queue depth, and average latch wait.

Chapter 3: Index Structures and Access Paths (B+Tree and Hash)

Fundamentals Indexes are auxiliary structures that trade write cost and storage for faster reads. A relational DBMS relies on indexes to avoid full scans when predicates are selective. B+Trees are the default general-purpose index because they support both point lookups and ordered range scans. Hash indexes are optimized for equality predicates and can offer simpler probe behavior but poor ordered traversal. Access path selection is a planner decision: an index is useful only when combined with accurate selectivity estimates and realistic I/O/CPU cost models. Index correctness depends on key ordering/hash partitioning invariants and on synchronization rules during concurrent updates and structural changes.

Deep Dive Start with B+Tree semantics. Internal nodes contain separator keys and child pointers; leaf nodes contain sorted keys and record references, often linked for in-order scans. The critical invariant is balanced height: all leaves are at the same depth, so search cost is O(log_f N) where f is fanout. Because nodes are page-sized, fanout is large and tree height remains small even for large datasets. Most point lookup latency then depends on cache residency of upper levels and locality of leaf pages.

Insert and delete are where complexity lives. During insert, if target leaf has space, place key and done. If overflow, split leaf roughly half/half, propagate separator upward, and potentially cascade splits to root. Delete symmetrically may cause underflow requiring redistribution with sibling or merge and parent key deletion. Structural modification operations must preserve search correctness at all times, even under concurrency. Production systems use latch coupling or optimistic methods to prevent readers from seeing broken paths.

Page layout inside index nodes is performance critical. Keys may be fixed or variable; entries may use prefix compression to increase fanout. Internal node pointers must remain aligned and update-safe. Leaf pages often include sibling pointers for fast range scans and rebalancing. You should design index page headers with node type, key count, parent pointer (optional), next-leaf pointer (for leaves), and low/high fence keys if doing advanced verification.

Hash indexing chooses a different tradeoff. Static hash tables degrade with growth, so dynamic variants like extendable hashing add directory depth and bucket split rules. Extendable hashing keeps global depth for directory and local depth per bucket. On overflow, split bucket and possibly double directory if local depth equals global depth. Equality lookups are efficient, but range queries are unsupported or expensive. Also, skewed keys can create uneven bucket occupancy.

Concurrency for indexes differs from transaction locking. Index latches protect short critical sections on internal data structures. Transaction locks protect logical isolation across operations. Confusing the two creates either corruption (too weak latching) or severe contention (too coarse locking). During page split/merge, hold latches long enough to preserve structural invariants but release quickly to avoid throughput collapse.

Recovery integration requires logging structural changes carefully. A page split is not a single atomic write; it touches old leaf, new leaf, and parent. WAL records must make operations redo-safe and undo-safe according to your recovery model. Even in simplified labs, define operation sequence so crash at any point can be reasoned about. Example: log intent, create new page, redistribute keys, update parent, mark commit of structure change.

Access path quality is workload-specific. B+Tree excels for range and order-by operations and often for point lookups due to cache-friendly upper levels. Hash can outperform for pure equality on high-cardinality keys with low range requirements. But planner decisions must account for clustering, selectivity, and maintenance cost. Over-indexing harms write throughput and memory pressure.

Finally, index lifecycle management matters: build, validate, rebalance, and drop. A poorly maintained index can become larger than the table and still underperform due to low selectivity. Always measure not just query speedup but write penalty, space overhead, and cache effects.

How this fit on projects

Primary concept for Projects 4 and 5.
Used by Projects 6, 7, 8, and 12.

Definitions & key terms

Fanout: number of children per internal B+Tree node.
Split/merge: structural changes after overflow/underflow.
Extendable hashing: dynamic hash index with directory depth.
Selectivity: fraction of rows matching a predicate.
Access path: chosen physical method to retrieve tuples.

Mental model diagram

B+Tree
 Root
  |
  +-- Internal node(s): [k1|k2|k3] -> child pages
            |
            +--> Leaf pages (sorted keys + RIDs) <-> linked for range scan

Hash Index (extendable)
Hash(key) -> prefix bits -> directory entry -> bucket page -> (key,RID)*
                if overflow -> split bucket, maybe double directory

How it works (step-by-step, with invariants and failure modes)

Lookup: traverse B+Tree separators or hash directory bits to target page.
Insert: place entry; if overflow perform split and parent update.
Delete: remove entry; if underflow redistribute/merge.
Range scan: B+Tree leaf walk via sibling links.

Invariants:

B+Tree leaves stay sorted and at same depth.
Extendable hash bucket local depth <= global depth.
Every index entry references a valid live record ID.

Failure modes:

Bad split key promotion -> orphaned key ranges.
Incorrect sibling pointers -> broken range scan.
Directory depth bug -> unreachable hash buckets.

Minimal concrete example

Pseudo-log of B+Tree insert overflow:
insert key=52 into leaf L9 (full)
split L9 -> L9:[10,20,35] L23:[52,68,71]
promote separator=52 to parent P3
parent P3 insert separator; parent overflow? no

Common misconceptions

“Hash index is always faster.” -> Only for suitable equality patterns.
“B+Tree and binary search tree are basically the same.” -> Page-oriented fanout changes behavior dramatically.
“More indexes is always better.” -> Write and maintenance costs can dominate.

Check-your-understanding questions

Why does large node fanout matter more than Big-O notation intuition?
What planner info is needed to choose between table scan and index scan?
Why are latches and locks different in index operations?

Check-your-understanding answers

It keeps tree height tiny, reducing random I/O depth.
Selectivity, table/index size, clustering, and cached page probability.
Latches protect in-memory structure integrity; locks enforce transaction isolation.

Real-world applications

OLTP primary/secondary indexing.
Event stores with compound index keys.
Time-series secondary access paths.

Where you will apply it

Project 4
Project 5
Also Project 6, Project 12.

References

B-tree and B+Tree origin paper metadata
IBM System R Optimizer context
“Database System Concepts” indexing chapters.

Key insights Indexes are performance multipliers only when structural invariants and planner cost assumptions are both correct.

Summary You now understand index mechanics, maintenance, and planner coupling for real workloads.

Homework/Exercises to practice the concept

Compute B+Tree height for 100M rows with realistic page fanout.
Simulate extendable hash growth under skewed keys.
Design an experiment comparing range query latency on B+Tree vs hash index.

Solutions to the homework/exercises

Height is usually small (often 3-4 levels) due to high fanout.
Skew drives repeated splits in hot buckets; directory growth may lag evenly distributed expectation.
B+Tree should dominate range queries because leaf links preserve order.

Chapter 4: Query Execution and Cost-Based Optimization

Fundamentals A query engine transforms declarative SQL into executable operator pipelines. Logical plans describe relational algebra semantics; physical plans choose concrete algorithms and access paths. Execution commonly follows the iterator model (open-next-close) to stream tuples through operators. Performance depends on operator choice, join order, predicate pushdown, and data distribution awareness. Cost-based optimization uses statistics to estimate cardinalities and select the lowest-cost plan under a simplified model of I/O and CPU cost. This chapter matters because most real production incidents involving “slow database” are optimizer or plan-shape problems, not parser bugs.

Deep Dive The planner/executor boundary is where theory meets engineering. Logical optimization preserves equivalence while rewriting expressions into easier-to-execute forms: push predicates closer to scans, prune unused columns, simplify joins, and reorder commutative operations. Physical optimization chooses algorithms: nested loop vs hash join, index scan vs seq scan, in-memory sort vs external sort. Every choice has assumptions about row counts and memory availability.

The cardinality estimation problem is the hardest practical issue. The optimizer must estimate how many rows survive each filter and join. It uses table stats such as row count, distinct values, null fraction, and histograms. Independence assumptions are often wrong, especially with correlated columns, leading to large estimate errors and catastrophic plan choices. A robust learning project should expose this by generating skewed and correlated synthetic datasets and comparing estimated vs actual row counts.

Operator implementation details matter. In iterator engines, each operator produces one tuple at a time to parent. This supports pipelining and low memory use for many workloads. But blocking operators (sort, hash build, aggregate) consume full inputs before producing output. The optimizer must account for these memory and spill behaviors. If memory budget is exceeded, operators spill to disk and performance changes by orders of magnitude.

Join ordering has combinatorial explosion. For n joined relations, possible trees grow rapidly. Exhaustive dynamic programming is feasible for small n; heuristics are used for larger queries. Selinger-style dynamic programming remains foundational: compute best plan for subsets, then combine. Even a simplified implementation teaches crucial intuition: local best choices can produce global worst plans if cardinality estimates are wrong.

Cost models are always approximations. A common first model: cost = page_reads * io_weight + tuples_processed * cpu_weight. This is simplistic but useful. Improve it gradually: random vs sequential I/O distinction, cache hit probability, sort spill penalties, and parallelism factors. The value is not perfect prediction; it is consistent ranking of alternatives.

Rule-based and cost-based optimization should coexist. Rule-based rewrites are deterministic and cheap; cost-based choices are selective and data-dependent. For a mini DBMS, start with a small rule set (predicate pushdown, projection pruning, join commutativity) then add cost-based join order/access path selection. Ensure every rewrite is testable for equivalence.

Explainability is a feature, not an afterthought. Implement plan explanation output with estimated rows, operator cost, and chosen algorithm. Debugging optimizer errors without this is nearly impossible. Compare estimated_rows vs actual_rows at runtime to identify statistics drift or model flaws.

Finally, optimization must remain correct first, fast second. Never perform rewrite that changes query semantics around null handling, outer joins, or duplicate semantics unless fully validated. Planner bugs silently return wrong answers, which is worse than slow answers.

How this fit on projects

Primary concept for Projects 6, 7, 8, and 12.
Impacts final integration project significantly.

Definitions & key terms

Logical plan: algebraic representation independent of execution algorithm.
Physical plan: executable operator tree with chosen algorithms.
Cardinality estimation: predicted row counts at plan nodes.
Blocking operator: needs full input before output.
Selectivity: fraction of tuples passing a predicate.

Mental model diagram

SQL
 |
 v
Parser -> AST -> Logical Plan --(rules)--> Rewritten Logical Plan
                                           |
                                           v
                                Physical Enumerator + Cost Model
                                           |
                                           v
                                   Chosen Plan (operators)
                                           |
                                           v
                                 Runtime + Actual Metrics
                                           |
                                           v
                               Feedback for stats/model tuning

How it works (step-by-step, with invariants and failure modes)

Parse SQL to AST and validate schema references.
Build logical algebra tree.
Apply safe logical rewrite rules.
Enumerate physical alternatives and estimate costs.
Execute chosen plan and collect runtime stats.

Invariants:

Rewrites must preserve semantics.
Physical plan nodes must satisfy required input/output schema.
Cost model units must be internally consistent.

Failure modes:

Bad cardinality estimate -> wrong join order.
Invalid rewrite around outer joins -> wrong results.
Missing spill logic -> out-of-memory or severe latency spikes.

Minimal concrete example

Pseudo-query:
SELECT c.name
FROM customers c JOIN orders o ON c.id=o.customer_id
WHERE c.region='NA' AND o.total > 100;

Rewrite idea:
Push c.region filter to customer scan before join.
Plan alternatives:
A) HashJoin(seqscan customers, seqscan orders)
B) IndexNestedLoop(indexscan customers(region), indexscan orders(customer_id))
Choose based on estimated cardinalities and index selectivity.

Common misconceptions

“Optimizer is just a parser extension.” -> It is a separate search and estimation subsystem.
“If query is slow, add index.” -> Wrong plan shape or stale stats can be root cause.
“Rules are enough.” -> Data-dependent costs require cost-based decisions.

Check-your-understanding questions

Why can cardinality error early in a plan cause huge downstream damage?
What is a blocking operator and why does it matter for memory?
Why do you need both estimated and actual row counts in explain output?

Check-your-understanding answers

Join order and algorithm choices amplify upstream estimate error exponentially.
It materializes input before output, increasing memory/spill risk.
To detect estimator drift and improve stats/model iteratively.

Real-world applications

SQL query tuning in production OLTP.
Data warehouse plan optimization.
API backend latency reduction through plan diagnostics.

Where you will apply it

Project 6
Project 7
Project 8
Project 12

References

System R access path selection paper
“Readings in Database Systems” optimizer chapters.
“Database System Concepts” query processing and optimization chapters.

Key insights Query speed is mostly plan quality, and plan quality is mostly estimate quality.

Summary A robust optimizer couples equivalence-preserving rewrites with cost estimation grounded in real statistics.

Homework/Exercises to practice the concept

Build two logically equivalent plans and estimate their costs under two different data distributions.
Create a dataset with correlated columns and observe estimator error.
Add a simple explain output and compare estimate vs actual rows.

Solutions to the homework/exercises

Optimal plan flips when selectivity flips; this validates cost sensitivity.
Independence assumptions overestimate/underestimate selectivity dramatically.
Divergence pinpoints where stats or model need refinement.

Chapter 5: Transactions, Isolation, 2PL, and MVCC

Fundamentals Transactions package related operations into a correctness unit with ACID guarantees. Isolation is the most subtle ACID property because it defines behavior under concurrency. Lock-based methods like two-phase locking (2PL) enforce serializability by controlling conflicting access, while MVCC increases read/write concurrency by storing multiple versions and using snapshot visibility rules. Both approaches require precise state machines, conflict handling, and cleanup of obsolete state. If concurrency control is wrong, the system may appear to work but return inconsistent results, lose updates, or deadlock under load.

Deep Dive Concurrency control starts with conflict theory: two operations conflict if they access same logical item and at least one is a write. Serializability demands that concurrent execution be equivalent to some serial order. Strict 2PL enforces this by requiring transactions to acquire locks during a growing phase and release only at commit/abort. Shared locks allow concurrent reads; exclusive locks serialize writes. Intention locks and lock hierarchies scale this model from table-level to page/tuple-level.

A lock manager maintains lock tables keyed by resource ID. Requests can be granted, queued, upgraded, or denied. Correctness requires compatibility matrix enforcement and deterministic wake-up ordering policies to avoid starvation. Deadlocks are inevitable in sufficiently concurrent systems; you need either prevention (ordering/timeouts/wait-die/wound-wait) or detection (wait-for graph cycle detection) plus victim selection. In a learning DBMS, periodic deadlock detection is the clearest model.

2PL offers strong correctness but can reduce concurrency due to blocking and lock contention. Long transactions become especially harmful, holding locks and expanding conflict windows. This motivates MVCC. In MVCC, writes create new row versions with metadata (xmin, xmax, commit timestamps, etc.). Readers choose visible versions according to snapshot rules, so readers usually do not block writers. Writers still conflict with writers and require conflict resolution.

Snapshot isolation semantics are powerful but not equivalent to serializable isolation. Write skew anomalies can still occur unless additional checks or serializable snapshot isolation mechanisms are used. A good educational sprint should include explicit anomaly demonstrations: lost update, non-repeatable read, phantom, and write skew. Learners often assume “no blocking” means “fully correct”; this is false without formal isolation guarantees.

Version garbage collection is part of concurrency control, not just storage maintenance. Old versions remain needed while active snapshots may still see them. Once no active transaction can observe a version, GC can reclaim it. In MVCC systems this is often tied to global oldest active transaction timestamp or horizon. If GC is too conservative, storage bloat grows. If too aggressive, active snapshots break correctness.

Integrating concurrency with indexes introduces additional challenges. Index entries may reference multiple versions, pending deletes, or visibility checks in heap. Some systems use index-level versioning, others keep index stable and perform visibility checks during heap fetch. Your mini-DBMS can use the latter for simplicity.

Transaction lifecycle must be explicit: BEGIN -> ACTIVE -> COMMITTING/ABORTING -> COMMITTED/ABORTED. Recovery manager relies on this state machine for undo/redo logic. Likewise, buffer pool latches and transaction locks must be clearly separated: latches are short critical-section protection; locks are logical isolation mechanisms across operation durations.

Observability is essential. Track lock wait times, deadlock count, abort reasons, average transaction duration, and MVCC version chain length. Without these, tuning isolation behavior is guesswork. Production teams use these signals to decide index changes, query rewrites, or transaction boundary refactoring.

Finally, concurrency control is a business requirement, not an academic exercise. Payment systems, inventory systems, and booking systems fail in expensive ways when isolation is wrong. Mastery here differentiates reliable systems engineers from basic CRUD developers.

How this fit on projects

Primary concept for Projects 9 and 11.
Required context for Projects 10 and final integration.

Definitions & key terms

Serializability: equivalent outcome to some serial execution.
Strict 2PL: locks released at commit/abort, preventing dirty reads/writes.
Deadlock: cycle of waiting transactions.
Snapshot isolation: reads from consistent snapshot; not always serializable.
Version GC: reclamation of obsolete row versions.

Mental model diagram

2PL path:
Txn -> lock request -> granted/queued -> execute -> commit -> unlock

MVCC path:
Txn start -> snapshot timestamp
read -> choose visible version
write -> append new version + conflict check
commit -> publish version visibility
GC -> remove versions older than safe horizon

How it works (step-by-step, with invariants and failure modes)

Start transaction and assign ID/timestamp.
For 2PL: acquire lock per operation according to compatibility matrix.
For MVCC read: choose newest visible version under snapshot rules.
On commit: validate conflicts, persist commit record, release resources.
Run deadlock detection/GC periodically.

Invariants:

Lock table state is consistent with transaction state.
Visible-version rule is deterministic for given snapshot.
Committed effects become durable in WAL order.

Failure modes:

Lock upgrade race -> starvation/deadlock storms.
Incorrect visibility check -> inconsistent reads.
GC horizon bug -> use-after-free of needed versions.

Minimal concrete example

Pseudo-transaction timeline:
T1 starts at ts=100, reads row R version v1(ts=90)
T2 starts at ts=110, updates R -> creates v2(ts=110), commits
T1 rereads under snapshot isolation -> still sees v1
T3 starts at ts=120 -> sees v2

Common misconceptions

“MVCC means no conflicts.” -> Writer-writer conflicts remain.
“Read committed and snapshot isolation are basically same.” -> Visibility semantics differ.
“Deadlocks are bugs in lock manager only.” -> They are a normal outcome of lock acquisition patterns.

Check-your-understanding questions

Why can snapshot isolation allow write skew?
Why do you still need locks/latches in MVCC systems?
What metrics reveal contention problems fastest?

Check-your-understanding answers

Concurrent transactions may read stale-but-valid snapshots and update disjoint rows that violate joint invariants.
Structural and writer conflict protection is still required.
Lock wait time, deadlock rate, abort reasons, and long transaction count.

Real-world applications

Banking and ledger consistency.
Reservation/inventory systems.
Multi-tenant SaaS correctness under high concurrency.

Where you will apply it

Project 9
Project 11
Also Project 10.

References

PostgreSQL MVCC Introduction
“Transaction Processing: Concepts and Techniques” (Gray & Reuter).
“Database System Concepts” transaction chapters.

Key insights Concurrency control is a set of invariants over time, not a single lock/unlock API.

Summary You can now reason about isolation semantics and design lock/MVCC behavior with explicit tradeoffs.

Homework/Exercises to practice the concept

Construct a write-skew scenario and classify which isolation levels prevent it.
Design a lock compatibility matrix including intention locks.
Propose an MVCC GC safe-horizon rule.

Solutions to the homework/exercises

Serializable prevents write skew; snapshot isolation may not.
IS/IX/S/X compatibility should preserve hierarchical correctness.
Safe horizon is minimum start timestamp among active transactions.

Chapter 6: WAL, Checkpoints, and Crash Recovery (ARIES-style)

Fundamentals Durability and atomicity under failure are delivered by write-ahead logging (WAL) and recovery algorithms. WAL requires log records to reach stable storage before corresponding data pages are flushed. Recovery then uses log history to restore consistent state after crashes: redo committed effects that were not persisted, and undo incomplete effects. Checkpoints reduce restart work by recording bounded recovery starting points. ARIES-style recovery formalizes this with Analysis, Redo, and Undo phases and pageLSN checks. Recovery correctness is non-negotiable: silent durability bugs can permanently lose or corrupt business data.

Deep Dive Think in failure models first. The common crash model is fail-stop process/system crash with storage persistence preserved up to last completed fsync. Under this model, memory state is lost; disk may contain partially updated data pages unless atomic write guarantees are assumed (usually unsafe to assume). WAL transforms uncertain data-page persistence into deterministic replay by logging intent and before/after information.

A log record typically includes LSN, transaction ID, page ID (if physical), operation type, and redo/undo payload. Page headers store pageLSN, indicating latest applied update. Before flushing a dirty page, ensure durableLSN >= pageLSN. This single inequality is the heart of WAL correctness. If violated, a page may contain changes whose log records were not durable at crash time, making recovery impossible to replay correctly.

Checkpointing trades normal runtime overhead for faster restart. A fuzzy checkpoint records active transactions and dirty page table without fully stopping writes. At restart, Analysis begins from checkpoint LSN, reconstructing transaction table and dirty page table. Redo replays idempotently from smallest recLSN needed; pageLSN checks skip already-applied records. Undo then rolls back loser transactions using compensation log records (CLRs) to make undo itself crash-safe.

Idempotence is crucial. Recovery may re-run after crash during recovery itself. Therefore redo and undo actions must be safe when repeated. ARIES solves this using LSN ordering and CLRs with undoNextLSN. Even simplified recovery implementations should preserve idempotent design: each applied action should be detectable as already applied.

Group commit is a performance optimization with correctness implications. Instead of fsync per commit, batch multiple commit log records and flush once. This improves throughput dramatically but increases tail latency variability. The invariant remains: a commit is acknowledged only after commit record is durable.

Interaction with buffer pool is continuous. Dirty pages can flush anytime subject to WAL rule. Steal/no-force policies are common: buffer manager may flush uncommitted updates (steal), and commit does not force all data pages (no-force). This combination requires both redo and undo capability and is why ARIES-like designs are prevalent.

Testing recovery requires adversarial crash injection, not just unit tests. Insert failpoints between critical steps: after log append before flush, after page write before metadata update, during undo, during checkpoint write. Then restart and verify invariants: committed data present, uncommitted effects absent, metadata consistent, and no double-application.

Observability for recovery includes log generation rate, flush latency percentiles, checkpoint duration, restart time, and redo/undo record counts. These metrics guide durability-performance tradeoffs and capacity planning.

Finally, recovery is not an isolated subsystem. Transaction manager, buffer pool, and storage format must all participate with clear contracts. If transaction status tracking is weak or pageLSN missing, recovery correctness cannot be proven.

How this fit on projects

Primary concept for Project 10.
Essential for final integration project and production-path thinking.

Definitions & key terms

WAL: write-ahead log protocol ensuring log durability before data flush.
LSN: monotonic log sequence number.
Checkpoint: durable marker reducing restart work.
CLR: compensation log record for crash-safe undo.
Steal/No-force: buffer policy pair requiring undo+redo recovery.

Mental model diagram

Normal runtime:
update -> append log record (LSN=500) -> pageLSN=500 in memory
commit -> flush log >= commitLSN -> ACK
background flush page when durableLSN >= pageLSN

Crash + restart:
Analysis -> determine winners/losers and dirty pages
Redo -> reapply needed updates in LSN order
Undo -> rollback losers using CLRs

How it works (step-by-step, with invariants and failure modes)

Generate log record per update and assign LSN.
Apply update to in-memory page, set pageLSN.
On commit, force log flush through commit record.
Background checkpoint records active txns + dirty pages.
On restart, run Analysis -> Redo -> Undo.

Invariants:

Commit acknowledged only after durable commit record.
Data page flush allowed only if durableLSN >= pageLSN.
Recovery replay order is non-decreasing LSN.

Failure modes:

Missing fsync on commit -> acknowledged but lost transactions.
Broken pageLSN checks -> double redo or missed redo.
Incomplete undo tracking -> resurrected uncommitted data.

Minimal concrete example

Pseudo crash drill:
T7 UPDATE account balance -20 (LSN 900)
T7 COMMIT (LSN 901) -> fsync up to 901 -> ACK
crash before dirty page flush
restart:
Analysis sees T7 committed
Redo reapplies LSN 900 to page (if pageLSN<900)
Final state includes committed debit

Common misconceptions

“If using SSD, fsync ordering is less important.” -> Durability contract still depends on flush semantics.
“Checkpoint means no recovery replay needed.” -> Fuzzy checkpoints still require redo/undo.
“Recovery is only for rare crashes.” -> It is active in every clean restart path and backup restore path.

Check-your-understanding questions

Why does steal/no-force require both redo and undo?
What makes redo idempotent?
Why are CLRs needed for robust undo?

Check-your-understanding answers

Uncommitted updates may reach disk (need undo) and committed updates may remain only in log (need redo).
LSN/pageLSN checks and deterministic operation ordering.
They record undo progress so crash during undo can resume safely.

Real-world applications

PostgreSQL WAL and crash recovery.
InnoDB redo/undo log behavior.
Embedded database reliability in edge devices.

Where you will apply it

Project 10
Final Overall Project

References

ARIES paper summary (IBM)
PostgreSQL WAL docs
“Database System Concepts” recovery chapters.

Key insights Recovery correctness is mostly about strict write ordering and replay idempotence.

Summary You now have the crash-consistency mental model needed to build durable systems.

Homework/Exercises to practice the concept

Draw log and pageLSN timeline for two concurrent transactions with one abort.
Design a failpoint matrix for recovery testing.
Explain why commit ACK before fsync is unsafe.

Solutions to the homework/exercises

Winner/loser separation and pageLSN checks determine replay/rollback outcomes.
Inject crashes after append, after flush, after page write, during checkpoint, during undo.
The client observes success while durable state does not include commit record.

Glossary

ACID: Atomicity, Consistency, Isolation, Durability transaction guarantees.
Cardinality: number of rows at an operator boundary or table.
CLRs: compensation log records used during undo.
Fanout: branching factor of tree index nodes.
Fuzzy checkpoint: non-blocking checkpoint capturing approximate runtime state.
LSN: monotonic identifier for log record ordering.
MVCC: concurrency model using multiple row versions and snapshot visibility.
PageLSN: latest LSN reflected in a data page image.
RID: record identifier, often (PageId, SlotId).
Selectivity: fraction of rows satisfying predicate.
Steal/No-force: buffer/write policy that permits flushing uncommitted pages and defers forcing committed pages.

Why Database Systems Matter

Modern software dependency: Almost every product stack depends on at least one transactional or analytical database.
Developer adoption signal (2024-2025): Stack Overflow’s 2024 survey press release highlights PostgreSQL at 49% as the most popular database; Stack Overflow’s 2025 press release reports PostgreSQL still leading among developers who want to keep using it (66% admired/retained usage intent for that metric).
Ecosystem breadth (February 2026 snapshot): DB-Engines reports 429 systems in ranking, reflecting a broad and still-growing DBMS landscape.
Deployment scale signal: SQLite’s official documentation states there are likely over one trillion SQLite databases in active use, emphasizing how deeply embedded database engines are in consumer and enterprise software.

Context and evolution (short)

Early systems optimized disk I/O and transaction safety.
Modern systems still keep those core invariants but add better concurrency, richer optimizers, and cloud-native operational models.

Traditional app model                 Modern data platform model
----------------------                --------------------------
Single DB, ad-hoc tuning              Multiple engines by workload
Manual failover                       Automated failover and PITR
Limited observability                 Rich plan/lock/log telemetry
Schema changes as events              Continuous schema evolution discipline

Sources:

Concept Summary Table

Concept Cluster	What You Need to Internalize
Storage Engines and Page Layouts	Stable record IDs require indirection and strict page-format invariants; free-space policy drives long-term performance.
Buffer Pool and I/O Control	Page lifecycle, replacement policy, and WAL-aware flushing determine both throughput and correctness.
Index Structures and Access Paths	B+Tree/Hash tradeoffs only pay off when structural invariants and planner assumptions are aligned.
Query Execution and Optimization	Query speed is mostly plan quality; plan quality is mostly estimate quality.
Transactions and Isolation (2PL/MVCC)	Concurrency control is about formal correctness over time, not just lock APIs.
WAL and Recovery	Durability requires strict ordering, idempotent replay, and disciplined crash testing.

Project-to-Concept Map

Project	Concepts Applied
Project 1	Storage Engines and Page Layouts
Project 2	Storage Engines and Page Layouts
Project 3	Buffer Pool and I/O Control
Project 4	Index Structures and Access Paths, Buffer Pool and I/O Control
Project 5	Index Structures and Access Paths
Project 6	Query Execution and Optimization, Index Structures
Project 7	Query Execution and Optimization
Project 8	Query Execution and Optimization
Project 9	Transactions and Isolation
Project 10	WAL and Recovery, Transactions and Isolation
Project 11	Transactions and Isolation, Storage Versioning
Project 12	Query Execution and Optimization, Index Structures
Final Overall Project	All concept clusters

Deep Dive Reading by Concept

Concept	Book and Chapter	Why This Matters
Storage Engines and Page Layouts	“Database Internals” by Alex Petrov - Ch. 3-4	Explains file formats, page layouts, and storage tradeoffs in production systems.
Buffer Pool and I/O Control	“Database Internals” by Alex Petrov - Ch. 5	Connects caching policy to performance and durability behavior.
Index Structures and Access Paths	“Database System Concepts” by Silberschatz et al. - Ch. 14	Provides the formal and practical basis for B+Tree and hashing.
Query Execution and Optimization	“Readings in Database Systems” (5th ed.) - optimizer papers	Builds intuition for plan search, costing, and estimation failures.
Transactions and Isolation	“Transaction Processing” by Gray & Reuter - selected chapters	Canonical treatment of locking, serializability, and transaction design.
WAL and Recovery	ARIES paper + “Database System Concepts” Ch. 19	Essential for implementing real crash-safe behavior.

Quick Start

Your first 48 hours:

Day 1:

Read Theory Primer Chapters 1 and 2.
Sketch your page and metadata format on paper.
Start Project 1 and complete deterministic page read/write tests.

Day 2:

Run Project 1 Definition of Done checklist.
Read Chapter 3 and implement initial tuple slot metadata for Project 2.
Document one failure mode you found and fixed.

Recommended Learning Paths

Path 1: The Systems Programmer

Project 1 -> Project 2 -> Project 3 -> Project 4 -> Project 10 -> Final Overall Project

Path 2: The Query Performance Engineer

Project 3 -> Project 4 -> Project 6 -> Project 7 -> Project 8 -> Project 12

Path 3: The Reliability Engineer

Project 3 -> Project 9 -> Project 10 -> Project 11 -> Final Overall Project

Success Metrics

You can explain each major DBMS layer with invariants and failure modes, without notes.
You can reproduce deterministic demo outputs for all projects.
You can diagnose at least three intentionally injected bugs (one storage, one concurrency, one recovery).
Your final mini-DBMS survives crash-recovery drills with committed-data durability and aborted-data rollback.

Project Overview Table

#	Project	Difficulty	Time	Primary Outcome
1	Disk-Oriented Storage Manager	Intermediate	1 week	Deterministic page persistence API
2	Tuple and Page Manager	Intermediate	1-2 weeks	Slotted-page tuple lifecycle
3	Buffer Pool Manager	Advanced	2 weeks	WAL-aware page cache and eviction
4	B+Tree Index	Expert	3-4 weeks	Range-capable disk index
5	Hash Index (Extendable Hashing)	Advanced	2 weeks	Dynamic equality index
6	Relational Algebra Executor	Advanced	2 weeks	Iterator-based operator engine
7	Join Operators	Expert	3 weeks	Multiple join algorithm implementations
8	Query Parser and Rule Optimizer	Expert	3 weeks	SQL-to-plan transformation pipeline
9	Transaction Manager (2PL)	Expert	3 weeks	Serializable lock-based transaction layer
10	Recovery Manager (UNDO/REDO)	Master	4+ weeks	Crash-safe WAL recovery workflow
11	Multi-Version Concurrency Control	Master	3-4 weeks	Snapshot-based concurrency model
12	Statistics and Cost Optimizer	Master	3-4 weeks	Cost-based plan selection

Project List

The following projects guide you from raw bytes on disk to a crash-safe, query-optimized mini DBMS.

Project 1: Disk-Oriented Storage Manager

File: P01-disk-oriented-storage-manager.md
Main Programming Language: C++
Alternative Programming Languages: Rust, Go
Coolness Level: Level 4 - Hardcore Tech Flex
Business Potential: Resume Gold
Difficulty: Intermediate
Knowledge Area: Filesystems, binary layouts, deterministic I/O
Software or Tool: POSIX file APIs or equivalent
Main Book: Database System Concepts

What you will build: A page-based persistence module that allocates, reads, writes, and recycles fixed-size pages.

Why it teaches database systems: It forces explicit physical data contracts and deterministic on-disk behavior.

Core challenges you will face:

Mapping logical page IDs to file offsets.
Handling partial writes and metadata consistency.
Maintaining free-page tracking without corruption.

Real World Outcome

You run a storage CLI harness that creates a fresh data file, allocates pages, writes deterministic payloads, restarts, and verifies bytes exactly.

$ ./db_lab storage-demo --db ./lab.db --seed 42
[init] page_size=8192 format_version=1
[alloc] page_id=1
[alloc] page_id=2
[write] page_id=1 checksum=0x91a52f01
[write] page_id=2 checksum=0x77c4da10
[restart] reopening database file
[read] page_id=1 checksum=0x91a52f01 match=true
[read] page_id=2 checksum=0x77c4da10 match=true
[result] PASS deterministic-persistence

The Core Question You Are Answering

“How does a DBMS turn a file into a durable page address space that survives restarts without ambiguity?”

This question matters because every higher-level subsystem assumes reliable page identity and bytes.

Concepts You Must Understand First

Fixed-size page abstraction
- Why is page size fixed?
- How does alignment affect I/O and checksums?
- Book Reference: Database System Concepts - Storage chapters.
On-disk metadata invariants
- What must be stored in a header page?
- How do you version format changes?
- Book Reference: Database Internals - Storage chapter.
Failure model basics
- What can crash mid-write do to your bytes?
- Why do checksums matter?
- Book Reference: Transaction Processing - reliability context.

Questions to Guide Your Design

Page Identity
- Will page_id * PAGE_SIZE always be valid?
- How do you represent deallocated pages?
Metadata Safety
- Which updates must be atomic or recoverable?
- Can header corruption be detected quickly?

Thinking Exercise

Sketch page allocation/deallocation state transitions and identify one state where a crash would create ambiguity.

The Interview Questions They Will Ask

“Why do databases use fixed pages?”
“What can go wrong with naive append-only page allocation?”
“How would you detect on-disk corruption early?”
“What metadata would you persist in page 0 and why?”
“How would you evolve page format without data migration downtime?”

Hints in Layers

Hint 1: Start with a strict page header contract Define version, checksum, and free-space metadata before implementing allocation.

Hint 2: Make every write deterministic Use a deterministic seed and fixed payloads for tests.

Hint 3: Treat restart as first-class test Every operation should be validated across process restart.

Hint 4: Fail fast on checksum mismatch Do not continue with undefined bytes.

Books That Will Help

Topic	Book	Chapter
Page layout and file organization	Database System Concepts	Ch. 13
Real engine storage tradeoffs	Database Internals	Ch. 3-4
Reliability mindset	Transaction Processing	Early chapters

Common Pitfalls and Debugging

Problem 1: “Page reads return shifted bytes”

Why: wrong offset math or stale page size assumption.
Fix: centralize offset calculation and assert alignment.
Quick test: read/write known pattern on page IDs 1, 2, 100.

Problem 2: “Restart loses free-page information”

Why: free-list metadata not persisted atomically.
Fix: persist free-page structure with version/checksum.
Quick test: allocate/deallocate/restart/reallocate and verify reuse.

Definition of Done

Deterministic read/write tests pass across restart.
Header metadata validates with checksum/version.
Deallocated pages can be reused safely.
Corruption path is detected and surfaced clearly.

Project 2: Tuple and Page Manager

File: P02-tuple-and-page-manager.md
Main Programming Language: C++
Alternative Programming Languages: Rust, Go
Coolness Level: Level 3 - Genuinely Clever
Business Potential: Resume Gold
Difficulty: Intermediate
Knowledge Area: Slotted pages, record encoding, compaction
Software or Tool: Project 1 storage manager
Main Book: Database System Concepts

What you will build: A slotted page module for insert/read/update/delete of variable-length tuples.

Why it teaches database systems: It bridges logical rows and physical bytes with stable record IDs.

Core challenges you will face:

Tuple encoding with null and varlen support.
Free-space fragmentation and compaction strategy.
Stable slot IDs under updates/deletes.

Real World Outcome

$ ./db_lab tuple-demo --db ./lab.db --seed 11
[insert] rid=(3,1) bytes=38
[insert] rid=(3,2) bytes=56
[delete] rid=(3,1) tombstone=true
[update] rid=(3,2) old_bytes=56 new_bytes=64 relocated=false
[compact] page=3 moved=1 reclaimed=48
[scan] live_tuples=1 deleted_slots=1
[result] PASS slotted-page-lifecycle

You can inspect tuple lifecycle and verify slot stability after compaction.

The Core Question You Are Answering

“How do real databases represent variable rows in fixed pages without breaking references?”

Concepts You Must Understand First

Slotted page structure
- Why slots must be separate from payload bytes.
- Book Reference: Database System Concepts - Ch. 13.
Tuple serialization contracts
- How to encode nulls and variable lengths.
- Book Reference: Database Internals - storage record encoding.
Fragmentation vs compaction tradeoff
- Why compaction timing affects throughput.
- Book Reference: Database Internals.

Questions to Guide Your Design

RID stability
- When can a slot ID change?
- How do indexes survive compaction?
Update policy
- In-place update vs relocation threshold?
- How will forwarding pointers work if needed?

Thinking Exercise

Draw one page before and after 10 inserts, 4 deletes, 3 updates. Mark contiguous free region and fragmented holes.

The Interview Questions They Will Ask

“What is a slotted page and why not array-of-structs?”
“How do you handle variable-length records safely?”
“Why does delete usually not immediately reclaim contiguous space?”
“What is tombstoning and when is it useful?”
“How do you keep RID stable after compaction?”

Hints in Layers

Hint 1: Implement slot metadata first Use explicit slot flags: empty/live/deleted.

Hint 2: Keep tuple payload append-from-front, slots from-back This simplifies free-space checks.

Hint 3: Compaction should rewrite payload only Do not change slot identity unexpectedly.

Hint 4: Build deterministic tuple fixtures Make serialization failures easy to diff.

Books That Will Help

Topic	Book	Chapter
Tuple/page formats	Database System Concepts	Ch. 13
Storage record design	Database Internals	Ch. 4
Data layout performance	CS:APP	Memory chapters

Common Pitfalls and Debugging

Problem 1: “Tuple reads fail after compaction”

Why: slot offsets not updated correctly.
Fix: compaction must produce a new offset table then swap.
Quick test: roundtrip all RIDs pre/post compaction.

Problem 2: “Update causes hidden slot leak”

Why: relocated tuple created new slot but old slot not tombstoned.
Fix: enforce update state transition invariants.
Quick test: random update/delete fuzz run with slot-count assertions.

Definition of Done

Variable-length tuple insert/update/delete works.
RID lookup remains valid after compaction.
Free-space accounting is accurate after mixed operations.
Serialization/deserialization is deterministic.

Project 3: Buffer Pool Manager

File: P03-buffer-pool-manager.md
Main Programming Language: C++
Alternative Programming Languages: Rust, Go
Coolness Level: Level 4 - Hardcore Tech Flex
Business Potential: Resume Gold
Difficulty: Advanced
Knowledge Area: caching, replacement, concurrency primitives
Software or Tool: Projects 1-2
Main Book: Database System Concepts

What you will build: A frame-based buffer pool with page table, pin/unpin, dirty tracking, and replacement policy.

Why it teaches database systems: It is the performance-critical layer connecting storage with execution.

Core challenges you will face:

Victim selection under pin constraints.
Dirty flush ordering (WAL-aware contract).
Concurrent access without deadlock.

Real World Outcome

$ ./db_lab buffer-demo --frames 3 --workload mixed --seed 9
[fetch] miss page=1 -> frame=0
[fetch] miss page=2 -> frame=1
[fetch] miss page=3 -> frame=2
[unpin] page=1 dirty=true pin=0
[fetch] miss page=4 -> evict page=2 clean
[fetch] hit  page=1 frame=0
[metrics] hits=1 misses=4 evictions=1 no_victim=0
[result] PASS buffer-pool-core

The Core Question You Are Answering

“How does a DBMS keep hot data in memory without violating durability or concurrency safety?”

Concepts You Must Understand First

Cache replacement behavior - LRU vs Clock tradeoffs.
Pin count semantics - page usage lifetime.
Dirty-page lifecycle - when flushes happen and why.

Book Reference: Database Internals Ch. 5, Database System Concepts Ch. 11.

Questions to Guide Your Design

How will you prevent eviction of active pages?
What lock order will prevent latching deadlocks?
How will you surface no-victim conditions deterministically?

Thinking Exercise

Given frame count=2 and access sequence 1,2,1,3,2, simulate hits/misses for LRU and Clock.

The Interview Questions They Will Ask

“What is the difference between pin count and dirty flag?”
“Why can LRU fail on sequential scans?”
“How do you avoid evicting pages still in use?”
“Where does WAL ordering interact with buffer flush?”
“What metrics would you expose for tuning?”

Hints in Layers

Hint 1: Keep API small and explicit Fetch, Unpin, Flush, NewPage, DeletePage.

Hint 2: Separate replacer from page table logic Policy should not own storage mappings.

Hint 3: Log every eviction decision in tests Diagnose policy behavior early.

Hint 4: Build a deterministic stress harness Threaded tests with fixed seeds reveal race issues.

Books That Will Help

Topic	Book	Chapter
Buffer managers	Database System Concepts	Ch. 11
Cache/admission ideas	Database Internals	Ch. 5
Concurrency primitives	C++ Concurrency in Action	Selected chapters

Common Pitfalls and Debugging

Problem 1: “Buffer pool deadlocks under stress”

Why: inconsistent latch acquisition order.
Fix: define global lock order and enforce it.
Quick test: 1000 randomized concurrent fetch/unpin loops.

Problem 2: “No victim available even with low usage”

Why: leaked pins due to missing unpin on error paths.
Fix: RAII-style pin guards or scoped unpin discipline.
Quick test: pin-count invariant check at test teardown.

Definition of Done

Deterministic hit/miss/eviction behavior matches policy expectation.
No pinned-page evictions occur.
Dirty page flush path is tested.
Concurrent stress harness runs without deadlock.

Project 4: B+Tree Index

File: P04-bplus-tree-index.md
Main Programming Language: C++
Alternative Programming Languages: Rust
Coolness Level: Level 5 - Pure Magic
Business Potential: Resume Gold
Difficulty: Expert
Knowledge Area: disk-oriented trees, split/merge algorithms
Software or Tool: Projects 1-3
Main Book: Database System Concepts

What you will build: A disk-backed B+Tree supporting point lookup, insert, delete, and range scan.

Why it teaches database systems: It is the canonical DB index and teaches page-level data structure engineering.

Core challenges you will face:

Correct split/merge under boundary cases.
Leaf sibling maintenance for range scans.
Structural consistency under concurrent operations.

Real World Outcome

$ ./db_lab bptree-demo --seed 101
[insert] keys=1..10000
[stats] height=3 leaf_pages=129 internal_pages=3
[lookup] key=923 -> rid=(77,4) found=true
[range] 500..510 -> 11 rows in ascending order
[delete] keys=1..4000
[validate] balanced=true ordering=true sibling_links=true
[result] PASS bplus-tree-index

The Core Question You Are Answering

“How do databases maintain ordered access at scale with bounded lookup depth?”

Concepts You Must Understand First

Fanout and tree height economics.
Overflow/underflow handling.
Latch coupling for structural changes.

Book Reference: Database System Concepts Ch. 14.

Questions to Guide Your Design

Split policy: median vs biased split?
Root split and root shrink handling?
Validation invariants after each mutation?

Thinking Exercise

Manually simulate insert sequence causing two cascading splits and draw resulting tree pages.

The Interview Questions They Will Ask

“Why B+Tree instead of binary search tree on disk?”
“What happens when a leaf overflows?”
“How do range scans work efficiently?”
“How do you verify tree invariants automatically?”
“What concurrency issues appear during page split?”

Hints in Layers

Hint 1: Start with single-thread correctness Do not mix concurrency until invariants pass.

Hint 2: Build node-level validators Sorted keys and pointer count checks save hours.

Hint 3: Add structural-operation tracing Split/merge logs make debugging reproducible.

Hint 4: Keep leaf link updates explicit Range scans fail silently if sibling pointers are wrong.

Books That Will Help

Topic	Book	Chapter
B+Tree fundamentals	Database System Concepts	Ch. 14
Engine implementation patterns	Database Internals	Index chapters
Practical algorithm rigor	CLRS	Trees sections

Common Pitfalls and Debugging

Problem 1: “Lookup misses existing key after split”

Why: wrong separator promoted to parent.
Fix: define and test exact split promotion rule.
Quick test: insert deterministic sequence and query all keys.

Problem 2: “Range scan loops forever”

Why: broken next-leaf pointer chain.
Fix: update sibling pointers atomically in split/merge path.
Quick test: full-tree ascending scan with visited-page set.

Definition of Done

Point lookup, insert, delete, and range scan pass.
Structural invariants validated after random workloads.
Tree remains balanced.
Deterministic workload produces deterministic stats.

Project 5: Hash Index (Extendable Hashing)

File: P05-extendable-hash-index.md
Main Programming Language: C++
Alternative Programming Languages: Rust
Coolness Level: Level 3 - Genuinely Clever
Business Potential: Resume Gold
Difficulty: Advanced
Knowledge Area: dynamic hashing, directory depth management
Software or Tool: Projects 1-3
Main Book: Database System Concepts

What you will build: A disk-backed extendable hash index for fast equality lookup.

Why it teaches database systems: It contrasts ordered vs hash access paths and dynamic growth behavior.

Core challenges you will face:

Local/global depth correctness.
Bucket split and directory doubling.
Handling skewed key distributions.

Real World Outcome

$ ./db_lab hash-demo --seed 22
[insert] 50000 key-rid pairs
[stats] global_depth=11 buckets=231 load_factor=0.78
[probe] key=701337 found=true rid=(120,8)
[probe] key=999999 found=false
[split] total_bucket_splits=230 directory_doubles=5
[result] PASS extendable-hash-index

The Core Question You Are Answering

“How can a hash index grow online without full rehash of all data?”

Concepts You Must Understand First

Hash prefix addressing.
Directory and bucket depth semantics.
Equality-lookup workload design.

Questions to Guide Your Design

What triggers split vs directory doubling?
How do you prevent split races under concurrency?
How will you validate directory-to-bucket consistency?

Thinking Exercise

Start with global depth 2 and simulate inserts that repeatedly hit one bucket. Track directory and local depths.

The Interview Questions They Will Ask

“Why choose extendable hashing over static hashing?”
“What is local depth and why does it matter?”
“Why are range queries poor for hash indexes?”
“How would skewed keys affect this design?”
“How do you test correctness after many splits?”

Hints in Layers

Hint 1: Encode depth in bucket header Never infer it indirectly.

Hint 2: Build a consistency checker Each directory entry must map to valid bucket and compatible depth.

Hint 3: Use deterministic hash fixtures Make split behavior reproducible.

Hint 4: Benchmark with skew and uniform datasets Observe tradeoffs clearly.

Books That Will Help

Topic	Book	Chapter
Dynamic hashing	Database System Concepts	Ch. 14
Access-method tradeoffs	Database Internals	Index chapters
Hashing analysis	CLRS	Hash tables

Common Pitfalls and Debugging

Problem 1: “Directory points to invalid bucket after split”

Why: incorrect bitmask update during remap.
Fix: centralize remap logic and test with binary traces.
Quick test: verify every directory slot references existing bucket page.

Problem 2: “Split storm under skew”

Why: load concentrated in few prefixes.
Fix: evaluate bucket size and hash function quality; add skew tests.
Quick test: run Zipfian key workload and inspect split distribution.

Definition of Done

Equality lookups are correct after arbitrary inserts/deletes.
Directory/bucket depth invariants hold.
Split/doubling paths are deterministic under fixed seed.
Performance measured on uniform and skewed key sets.

Project 6: Relational Algebra Executor (Operators)

File: P06-relational-algebra-executor.md
Main Programming Language: C++
Alternative Programming Languages: Rust
Coolness Level: Level 3 - Genuinely Clever
Business Potential: Resume Gold
Difficulty: Advanced
Knowledge Area: iterator model, scans, selection/projection
Software or Tool: Projects 1-4
Main Book: Database System Concepts

What you will build: An iterator-based executor with SeqScan, IndexScan, Selection, and Projection.

Why it teaches database systems: It turns storage and indexing into query results through composable operators.

Core challenges you will face:

Defining clean operator interfaces.
Correct tuple flow and schema handling.
Predicate evaluation and short-circuit behavior.

Real World Outcome

$ ./db_lab exec-demo --query "SELECT b FROM t WHERE a = 5" --seed 5
[plan] Projection(b)
       Selection(a=5)
       SeqScan(t)
[execute] input_rows=100000 output_rows=842
[latency] 14.2 ms
[result] PASS iterator-engine

The Core Question You Are Answering

“How does a DBMS turn a plan tree into a stream of correct tuples?”

Concepts You Must Understand First

Iterator model (open-next-close).
Tuple schema and projection semantics.
Predicate evaluation cost and pushdown.

Questions to Guide Your Design

Will operators own child lifecycle?
How will null semantics be handled in predicates?
How will you debug wrong-result bugs quickly?

Thinking Exercise

Trace one tuple from SeqScan through Selection and Projection; list every transformation.

The Interview Questions They Will Ask

“What is the iterator model and why use it?”
“What is predicate pushdown?”
“How do you represent tuple schema through operators?”
“What can cause duplicate/missing output rows?”
“How do blocking operators differ from streaming operators?”

Hints in Layers

Hint 1: Start with one-row-at-a-time contract Avoid vectorization complexity initially.

Hint 2: Build operator-level golden tests Test each operator in isolation.

Hint 3: Add plan-explain output early Correctness debugging depends on visibility.

Hint 4: Keep null semantics explicit Do not rely on host language booleans.

Books That Will Help

Topic	Book	Chapter
Query execution	Database System Concepts	Ch. 15
Volcano model concepts	CMU DB lectures	Query execution modules
Practical engine internals	Database Internals	Execution chapters

Common Pitfalls and Debugging

Problem 1: “Selection returns wrong row count”

Why: incorrect null or comparison handling.
Fix: define predicate truth table including null behavior.
Quick test: run predicate suite with null and boundary values.

Problem 2: “Projection corrupts column mapping”

Why: schema index mismatch between parent/child operators.
Fix: use explicit schema objects and field IDs.
Quick test: projection on reordered columns with known tuples.

Definition of Done

Core operators produce correct output for deterministic fixtures.
Plan explain output is available.
Null and edge cases are tested.
Integration with storage/index APIs is stable.

Project 7: Join Operators

File: P07-join-operators.md
Main Programming Language: C++
Alternative Programming Languages: Rust
Coolness Level: Level 4 - Hardcore Tech Flex
Business Potential: Resume Gold
Difficulty: Expert
Knowledge Area: join algorithms, memory/I/O tradeoffs
Software or Tool: Project 6
Main Book: Database System Concepts

What you will build: Nested-loop, hash join, and sort-merge join operators.

Why it teaches database systems: Join strategy dominates many real query costs.

Core challenges you will face:

Algorithm correctness across duplicates and nulls.
Build/probe memory management.
Choosing right join algorithm per data shape.

Real World Outcome

$ ./db_lab join-demo --seed 77
[data] left_rows=2_000_000 right_rows=500_000
[planA] NestedLoopJoin latency=24.8 s
[planB] HashJoin      latency=1.9 s
[planC] SortMergeJoin latency=3.6 s
[validate] outputs_equal=true rows=410221
[result] PASS join-operators

The Core Question You Are Answering

“How does algorithm choice change join performance by orders of magnitude without changing query semantics?”

Concepts You Must Understand First

Join semantics for inner and outer joins.
Hash table build/probe behavior.
External sorting basics for large inputs.

Questions to Guide Your Design

Which side should be hash-build input and why?
How will spills be handled when memory is insufficient?
How do you prove two algorithms are semantically equivalent?

Thinking Exercise

Estimate cost of NLJ vs HashJoin for three cardinality/selectivity scenarios.

The Interview Questions They Will Ask

“When is nested-loop join acceptable?”
“Why does hash join usually outperform NLJ?”
“How does sort-merge join help with ordered data?”
“How do duplicates affect join output size?”
“How would you debug incorrect join cardinality?”

Hints in Layers

Hint 1: Make one correct baseline first Use simple NLJ to validate semantics.

Hint 2: Share a common join-output validator Compare row sets across algorithms.

Hint 3: Instrument memory use Hash build behavior must be observable.

Hint 4: Add skewed key datasets Skew reveals algorithm edge cases quickly.

Books That Will Help

Topic	Book	Chapter
Join algorithms	Database System Concepts	Ch. 15
Sorting and hash foundations	CLRS	Sorting/Hash chapters
Practical execution tradeoffs	Database Internals	Execution sections

Common Pitfalls and Debugging

Problem 1: “Hash join misses matches”

Why: hash key normalization mismatch across inputs.
Fix: canonicalize key representation before hash/probe.
Quick test: equality join with mixed key encodings.

Problem 2: “Sort-merge duplicates incorrect”

Why: merge cursor advancement logic wrong on repeated keys.
Fix: explicit duplicate-run handling.
Quick test: dataset with long duplicate runs.

Definition of Done

All join algorithms return identical results on deterministic fixtures.
Performance comparison is captured with explanation.
Null and duplicate behavior is correct.
Spill or memory-bound behavior is documented.

Project 8: Simple Query Parser & Rule-Based Optimizer

File: P08-sql-parser-rule-optimizer.md
Main Programming Language: C++
Alternative Programming Languages: Rust, Python
Coolness Level: Level 4 - Hardcore Tech Flex
Business Potential: Resume Gold
Difficulty: Expert
Knowledge Area: parsing, AST, logical rewrites
Software or Tool: Projects 6-7
Main Book: Dragon Book + Database System Concepts

What you will build: SQL subset parser that produces AST, logical plan, and rule-optimized physical skeleton.

Why it teaches database systems: It connects human query text to executable operator trees.

Core challenges you will face:

Grammar design and parse error handling.
Semantic binding (table/column resolution).
Correct logical rewrites.

Real World Outcome

$ ./db_lab planner-demo --sql "SELECT t1.b FROM t1 JOIN t2 ON t1.id=t2.id WHERE t1.a=5"
[parse] AST ok
[logical] Project -> Filter -> Join -> Scan(t1,t2)
[rules] predicate_pushdown=true projection_pruning=true
[physical] IndexScan(t1.a) + HashJoin + Projection
[result] PASS parser-and-rules

The Core Question You Are Answering

“How does a DBMS convert SQL text into a semantically correct and more efficient plan?”

Concepts You Must Understand First

Tokenization and parsing basics.
AST to relational algebra mapping.
Rule rewrite safety conditions.

Questions to Guide Your Design

Which SQL subset will be accepted and explicitly rejected?
How will alias/column disambiguation work?
Which rewrites are guaranteed safe in your subset?

Thinking Exercise

Take one SQL query and manually derive AST -> logical plan -> rewritten logical plan.

The Interview Questions They Will Ask

“What is the difference between parsing and optimization?”
“What is predicate pushdown and why safe?”
“How do you handle ambiguous column references?”
“Which rewrites can change semantics if done wrong?”
“How do you test parser correctness systematically?”

Hints in Layers

Hint 1: Freeze SQL grammar scope early Small, well-defined subset beats half-implemented broad grammar.

Hint 2: Build AST pretty-printer It is your fastest parse-debug tool.

Hint 3: Separate binding from parsing Name resolution is a semantic step.

Hint 4: Add rewrite equivalence tests Execute both original and rewritten plans on same fixtures.

Books That Will Help

Topic	Book	Chapter
Parsing fundamentals	Dragon Book	Ch. 2-4
Relational algebra mapping	Database System Concepts	Ch. 6, 15
Optimizer rewrite ideas	Readings in DB Systems	Optimizer papers

Common Pitfalls and Debugging

Problem 1: “Parser accepts invalid query shapes”

Why: grammar too permissive or semantic checks missing.
Fix: add binding/validation phase.
Quick test: negative-query corpus with expected errors.

Problem 2: “Rewrite changes output”

Why: rule applied without null/outer-join safety checks.
Fix: add preconditions per rewrite rule.
Quick test: equivalence harness over random datasets.

Definition of Done

SQL subset is documented and enforced.
AST/logical/physical representations are inspectable.
Core rewrites pass equivalence tests.
Error paths provide clear diagnostics.

Project 9: Transaction Manager with Two-Phase Locking (2PL)

File: P09-transaction-manager-2pl.md
Main Programming Language: C++
Alternative Programming Languages: Rust
Coolness Level: Level 4 - Hardcore Tech Flex
Business Potential: Resume Gold
Difficulty: Expert
Knowledge Area: lock manager, deadlocks, isolation
Software or Tool: Projects 3, 6
Main Book: Database System Concepts

What you will build: A transaction manager with strict 2PL lock manager and deadlock detection.

Why it teaches database systems: It enforces correctness under concurrent execution.

Core challenges you will face:

Lock compatibility and upgrades.
Wait-for graph cycle detection.
Abort/rollback interaction with execution layer.

Real World Outcome

$ ./db_lab txn-demo --mode 2pl --seed 404
[txn] begin T1
[txn] begin T2
[lock] T1 X RID(10,2) granted
[lock] T2 X RID(10,3) granted
[lock] T1 X RID(10,3) waiting
[lock] T2 X RID(10,2) waiting
[deadlock-detector] cycle=[T1,T2] victim=T2
[txn] abort T2
[txn] commit T1
[result] PASS strict-2pl-deadlock-resolution

The Core Question You Are Answering

“How do we guarantee isolation with concurrent writes while preventing the system from freezing under deadlocks?”

Concepts You Must Understand First

Conflict serializability.
Lock compatibility matrices.
Deadlock detection vs prevention tradeoffs.

Questions to Guide Your Design

How are lock upgrades handled safely?
What victim selection strategy minimizes wasted work?
How often should deadlock detection run?

Thinking Exercise

Create a two-transaction lock timeline that deadlocks and annotate where detection can intervene.

The Interview Questions They Will Ask

“What does strict 2PL guarantee?”
“Why are deadlocks normal, not exceptional?”
“How does lock granularity affect throughput?”
“What is lock escalation and when is it risky?”
“How would you test serializability claims?”

Hints in Layers

Hint 1: Start with one lock table keyed by RID Keep granularity simple first.

Hint 2: Build deterministic scheduler tests You need reproducible deadlock scenarios.

Hint 3: Separate lock states from txn states Conflation causes subtle rollback bugs.

Hint 4: Emit wait-for graph snapshots Visualization helps verify cycle detection.

Books That Will Help

Topic	Book	Chapter
Concurrency control	Database System Concepts	Ch. 18
Transaction design	Gray & Reuter	Core chapters
Deadlock theory	Operating Systems concepts texts	Deadlock chapter

Common Pitfalls and Debugging

Problem 1: “Transactions starve on lock upgrades”

Why: unfair queueing policy.
Fix: define FIFO or priority with starvation bounds.
Quick test: many readers + one upgrader simulation.

Problem 2: “Abort leaves locks behind”

Why: rollback path skips lock release on some errors.
Fix: centralized transaction finalization routine.
Quick test: force abort in every lock state and assert no residual locks.

Definition of Done

Strict 2PL semantics verified with deterministic schedules.
Deadlock cycle detection and victim abort works.
Lock release is complete on commit/abort.
Isolation anomalies for covered levels are demonstrably prevented.

Project 10: Recovery Manager (UNDO/REDO Logging)

File: P10-recovery-manager-wal-undo-redo.md
Main Programming Language: C++
Alternative Programming Languages: Rust
Coolness Level: Level 5 - Pure Magic
Business Potential: Resume Gold
Difficulty: Master
Knowledge Area: WAL, checkpoints, crash recovery
Software or Tool: Projects 3 and 9
Main Book: Database System Concepts + ARIES paper

What you will build: A WAL-based recovery manager implementing analysis/redo/undo workflow.

Why it teaches database systems: It is the core durability engine.

Core challenges you will face:

Log record format and LSN monotonicity.
WAL-before-data enforcement.
Correct restart behavior under partial crash states.

Real World Outcome

$ ./db_lab recovery-demo --seed 12 --inject-crash after-commit-before-page-flush
[txn] T15 update RID(8,2) old=100 new=120
[log] append lsn=7001 type=UPDATE
[log] append lsn=7002 type=COMMIT
[log] fsync up_to=7002
[crash] simulated
[restart] analysis_start_lsn=6800
[restart] redo_applied=1 undo_applied=0
[verify] committed_value=120
[result] PASS crash-recovery

The Core Question You Are Answering

“How can a database prove that committed data survives crashes while incomplete work is removed?”

Concepts You Must Understand First

WAL ordering invariant.
Steal/no-force consequences.
Idempotent redo/undo design.

Questions to Guide Your Design

What minimum fields are required in each log record?
How will pageLSN gate redo idempotence?
Which checkpoint model will you implement first?

Thinking Exercise

Draw a timeline with two transactions where one commits and one aborts; mark expected redo/undo sets after crash.

The Interview Questions They Will Ask

“Why does commit require log flush?”
“What is the role of pageLSN in redo?”
“Why do steal/no-force systems need both undo and redo?”
“What is a fuzzy checkpoint?”
“How do you test recovery beyond unit tests?”

Hints in Layers

Hint 1: Start with append-only log and strict LSN assignment Do not optimize prematurely.

Hint 2: Implement deterministic crash-injection hooks Recovery bugs hide without adversarial testing.

Hint 3: Separate analysis metadata from page replay Keep phases explicit and test each independently.

Hint 4: Track replay counters and final invariants Observable recovery makes debugging practical.

Books That Will Help

Topic	Book	Chapter
Recovery fundamentals	Database System Concepts	Ch. 19
ARIES method	IBM ARIES paper	Full paper
Durable systems engineering	Database Internals	Reliability sections

Common Pitfalls and Debugging

Problem 1: “Committed transaction missing after restart”

Why: commit ACK happened before fsync.
Fix: enforce durable commit record before ACK.
Quick test: crash immediately after ACK and verify commit persists.

Problem 2: “Redo reapplies updates twice”

Why: missing pageLSN idempotence check.
Fix: redo only if pageLSN < recordLSN.
Quick test: repeat recovery twice and compare final state hash.

Definition of Done

Recovery passes deterministic crash matrix.
Committed effects survive crash.
Uncommitted effects are undone.
Replay is idempotent across repeated restart.

Project 11: Multi-Version Concurrency Control (MVCC)

File: P11-mvcc-engine.md
Main Programming Language: C++
Alternative Programming Languages: Rust
Coolness Level: Level 5 - Pure Magic
Business Potential: Resume Gold
Difficulty: Master
Knowledge Area: snapshots, visibility rules, version GC
Software or Tool: Projects 2, 3, 9, 10
Main Book: Database System Concepts + transaction texts

What you will build: MVCC layer enabling snapshot reads with non-blocking reader/writer interactions.

Why it teaches database systems: It reflects modern high-concurrency DB architecture.

Core challenges you will face:

Version chain storage and traversal.
Snapshot visibility correctness.
Safe garbage collection horizon.

Real World Outcome

$ ./db_lab mvcc-demo --seed 55
[txn] T1 start_ts=100
[txn] T2 start_ts=110
[T2] update key=42 value=900 -> commit_ts=115
[T1] read key=42 value=850 (snapshot)
[T3] start_ts=120 read key=42 value=900
[gc] reclaimed_versions=12 safe_horizon=100
[result] PASS mvcc-snapshot-visibility

The Core Question You Are Answering

“How can readers and writers proceed concurrently without sacrificing correctness?”

Concepts You Must Understand First

Snapshot timestamp model.
Version metadata (xmin, xmax or equivalent).
Writer conflict detection and GC horizon rules.

Questions to Guide Your Design

Where will version metadata live: tuple header, side table, or both?
How do you resolve write-write conflicts deterministically?
What metrics indicate version bloat risk?

Thinking Exercise

Simulate three transactions with overlapping lifetimes and determine visible version for each read.

The Interview Questions They Will Ask

“How does MVCC reduce lock contention?”
“Why can snapshot isolation allow write skew?”
“What makes version GC safe?”
“How do indexes interact with versioned tuples?”
“When would you still need explicit locks with MVCC?”

Hints in Layers

Hint 1: Implement visibility function first Everything else depends on it.

Hint 2: Keep version metadata explicit and testable Opaque flags make debugging painful.

Hint 3: Build anomaly demonstration tests Show what each isolation level allows/prevents.

Hint 4: Add GC watermark instrumentation Monitor version-chain lengths continuously.

Books That Will Help

Topic	Book	Chapter
MVCC and snapshots	Database System Concepts	Concurrency chapters
Transaction correctness	Gray & Reuter	Isolation sections
Production perspective	PostgreSQL docs	MVCC sections

Common Pitfalls and Debugging

Problem 1: “Readers see future versions”

Why: incorrect visibility comparison against snapshot timestamp.
Fix: centralize visibility predicate and test edge timestamps.
Quick test: deterministic transaction timeline fixtures.

Problem 2: “GC removes live versions”

Why: safe horizon computed without considering oldest active txn.
Fix: horizon = min(active_start_ts).
Quick test: long-running reader during GC run.

Definition of Done

Snapshot reads are deterministic and correct.
Writer conflicts are handled per policy.
GC reclaims obsolete versions safely.
Isolation anomaly behavior is documented by test.

Project 12: Simple Statistics and Cost-Based Optimizer

File: P12-statistics-cost-optimizer.md
Main Programming Language: C++
Alternative Programming Languages: Rust, Python
Coolness Level: Level 4 - Hardcore Tech Flex
Business Potential: Resume Gold
Difficulty: Master
Knowledge Area: statistics, cardinality estimation, plan search
Software or Tool: Projects 6-8
Main Book: Database System Concepts + System R paper

What you will build: Statistics collector and cost model to choose better physical plans automatically.

Why it teaches database systems: It gives your engine query-planning intelligence.

Core challenges you will face:

Stats model design (histograms, NDV, null fraction).
Estimation errors under skew/correlation.
Plan enumeration and cost ranking.

Real World Outcome

$ ./db_lab cbo-demo --seed 808
[analyze] table=orders rows=5_000_000 ndv(customer_id)=420000
[query] SELECT * FROM orders WHERE customer_id=42
[planA] SeqScan cost=145000 est_rows=12000
[planB] IndexScan cost=1800 est_rows=14
[chosen] planB
[actual] rows=17 latency=19.4 ms
[result] PASS cost-based-plan-selection

The Core Question You Are Answering

“How does a database predict query cost well enough to choose fast plans automatically?”

Concepts You Must Understand First

Cardinality estimation basics.
Cost-model calibration (I/O vs CPU weights).
Join-order search complexity.

Questions to Guide Your Design

Which stats are cheap enough to maintain and still useful?
How often should stats refresh?
How will you compare estimated vs actual to improve the model?

Thinking Exercise

Given two predicates with correlated columns, estimate rows under independence and explain why estimate can fail.

The Interview Questions They Will Ask

“What statistics are most useful for plan selection?”
“Why do bad cardinality estimates cause bad plans?”
“What is a Selinger-style optimizer?”
“How do you validate a cost model in practice?”
“When should planner hints be avoided?”

Hints in Layers

Hint 1: Start with single-table selectivity estimation Keep scope narrow then expand to joins.

Hint 2: Separate estimation from planning Makes experimentation easier.

Hint 3: Keep explain output machine-readable A/B comparisons become straightforward.

Hint 4: Track estimate/actual error distribution Use data, not intuition, for improvements.

Books That Will Help

Topic	Book	Chapter
Cost-based optimization	Database System Concepts	Ch. 16
Foundational optimizer design	System R paper	Full paper
Modern optimizer thinking	Readings in Database Systems	Optimizer chapters

Common Pitfalls and Debugging

Problem 1: “Planner always picks same join order”

Why: enumeration or cost function bug.
Fix: verify candidate generation and compare cost traces.
Quick test: small 3-table query with obvious selectivity differences.

Problem 2: “Estimated rows far from actual on skewed data”

Why: simplistic uniformity assumption.
Fix: add histograms and/or frequency sketches.
Quick test: Zipf-distributed dataset benchmark.

Definition of Done

Stats collection pipeline exists and is repeatable.
Cost model chooses better plan than baseline for benchmark set.
Explain output includes estimated and actual rows.
Estimation error analysis is documented.

Project Comparison Table

Project	Difficulty	Time	Depth of Understanding	Fun Factor
1. Disk-Oriented Storage Manager	Intermediate	1 week	High storage intuition	★★★☆☆
2. Tuple and Page Manager	Intermediate	1-2 weeks	High layout intuition	★★★☆☆
3. Buffer Pool Manager	Advanced	2 weeks	High systems intuition	★★★★☆
4. B+Tree Index	Expert	3-4 weeks	Very high data-structure depth	★★★★★
5. Extendable Hash Index	Advanced	2 weeks	High index tradeoff depth	★★★★☆
6. Relational Algebra Executor	Advanced	2 weeks	High operator-model depth	★★★★☆
7. Join Operators	Expert	3 weeks	Very high algorithm depth	★★★★★
8. Parser + Rule Optimizer	Expert	3 weeks	High compiler+DB bridge depth	★★★★☆
9. Transaction Manager (2PL)	Expert	3 weeks	Very high correctness depth	★★★★★
10. Recovery Manager	Master	4+ weeks	Very high reliability depth	★★★★★
11. MVCC Engine	Master	3-4 weeks	Very high concurrency depth	★★★★★
12. Statistics + CBO	Master	3-4 weeks	Very high optimization depth	★★★★★

Recommendation

If you are new to DB internals: Start with Project 1, then 2, then 3 to establish reliable page+cache mental models.

If you are a backend engineer focused on query speed: Start with Project 3, then 4, 6, 7, and 12.

If you want reliability engineering depth: Start with Project 9, then 10, then 11.

Final Overall Project

Capstone: Mini-Relational DBMS

The Goal: Integrate Projects 1-12 into one coherent mini DBMS with deterministic CLI behavior.

Build a catalog and SQL front door (CREATE TABLE, CREATE INDEX, INSERT, SELECT, UPDATE, DELETE).
Route statements through parser -> optimizer -> execution with transaction context.
Enforce durability and recovery workflow with crash-injection test matrix.
Publish an EXPLAIN path showing chosen plans and estimate/actual diagnostics.

Success Criteria: The system returns correct results, preserves committed data across crashes, rejects/rolls back invalid transactions safely, and demonstrates measurable plan-quality improvements from statistics.

From Learning to Production: What Is Next

Your Project	Production Equivalent	Gap to Fill
Page/Tuple Manager	PostgreSQL heap + page format	Multi-table catalog maturity, migration tooling
Buffer Pool Manager	InnoDB/PostgreSQL shared buffer stack	NUMA tuning, background flush/checkpoint sophistication
B+Tree/Hash Index	Production index engines	Online rebuild, concurrent DDL, compression
Executor + Planner	PostgreSQL planner/executor	Advanced SQL semantics, vectorized execution, JIT
2PL/MVCC	Production concurrency subsystem	Full serializable protocol, lock hierarchy completeness
WAL Recovery	ARIES-like recovery in major DBMS	Backup/PITR tooling, distributed replication, ops tooling

Summary

This learning path covers database systems internals through 12 hands-on projects plus one capstone integration.

#	Project Name	Main Language	Difficulty	Time Estimate
1	Disk-Oriented Storage Manager	C++	Intermediate	1 week
2	Tuple and Page Manager	C++	Intermediate	1-2 weeks
3	Buffer Pool Manager	C++	Advanced	2 weeks
4	B+Tree Index	C++	Expert	3-4 weeks
5	Hash Index (Extendable Hashing)	C++	Advanced	2 weeks
6	Relational Algebra Executor	C++	Advanced	2 weeks
7	Join Operators	C++	Expert	3 weeks
8	Query Parser and Rule Optimizer	C++	Expert	3 weeks
9	Transaction Manager (2PL)	C++	Expert	3 weeks
10	Recovery Manager (UNDO/REDO)	C++	Master	4+ weeks
11	MVCC Engine	C++	Master	3-4 weeks
12	Statistics and Cost Optimizer	C++	Master	3-4 weeks

Expected Outcomes

You can explain DBMS architecture with invariants, not just features.
You can implement and validate critical DB subsystems deterministically.
You can diagnose performance and correctness issues with evidence.

Additional Resources and References

Standards and Specifications

Industry Analysis and Adoption Signals

Foundational Papers

Books

“Database System Concepts” by Silberschatz, Korth, Sudarshan - Core textbook mapping for most projects.
“Database Internals” by Alex Petrov - Practical engine implementation perspective.
“Transaction Processing: Concepts and Techniques” by Gray and Reuter - Concurrency and recovery rigor.