Sprint: Database Systems Mastery - Real World Projects
Goal: Build first-principles mastery of database systems by implementing the core modules of a mini relational DBMS, end-to-end, with observable outputs at every step. You will understand not only what each database subsystem does, but why it exists, which invariants keep it correct, and what fails when those invariants break. This guide takes you from bytes on disk to parsed SQL to optimized execution plans and crash-safe recovery. By the end, you will be able to design, build, debug, and reason about storage engines, indexing, query execution, concurrency control, and durability like a systems engineer.
Introduction
- What is a database system? A database management system (DBMS) is a software system that stores data durably, enforces correctness under concurrency, and executes declarative queries efficiently.
- What problem does it solve today? It gives applications a reliable contract for data integrity, fast retrieval, transactional updates, and operational recoverability.
- What will you build across this sprint? A layered mini-DBMS: disk/page manager, tuple/page formats, buffer pool, B+Tree and hash indexing, query execution operators, parser+optimizer, transaction manager, recovery manager, MVCC, and a cost-based optimizer.
- In scope: single-node database internals, architecture and algorithms, deterministic local testing, reliability drills.
- Out of scope: full distributed consensus implementation, cloud control plane features, and vendor-specific managed services.
Big-picture architecture
Application / SQL Client
|
v
Parser -> Logical Plan -> Optimizer -> Physical Plan
|
v
Execution Engine (iterators)
|
v
+---------------- Transaction Manager ----------------+
| (locks / MVCC / isolation) |
+----------------------+------------------------------+
|
v
Buffer Pool / Page Cache
|
+-------------+-------------+
| |
v v
Index Manager Heap/Table Manager
(B+Tree, Hash) (pages, tuples, slots)
| |
+-------------+-------------+
|
v
WAL / Log + Data Files on Disk
How to Use This Guide
- Read the Theory Primer first. It is the mini-book that prevents shallow implementation and blind trial/error.
- Build projects in order at least once; later you can jump by learning path.
- For every project, start from
The Core Question You Are Answering, then doThinking Exercise, then implementation. - Treat each
Real World Outcomeas the acceptance contract; if your output differs, you are not done. - Keep an engineering notebook: invariants, failures observed, what fixed them, and why.
- After every two projects, revisit the concept chapters and answer the check-your-understanding questions from memory.
Prerequisites & Background Knowledge
Essential Prerequisites (Must Have)
- Programming in a systems language (C++ or Rust): pointers/references, memory ownership, data structures.
- Basic operating systems concepts: files, virtual memory, processes, synchronization primitives.
- Basic algorithm analysis: asymptotic complexity, cache and I/O sensitivity.
- SQL basics:
SELECT,JOIN,GROUP BY,INSERT,UPDATE,DELETE. - Recommended Reading: “Database System Concepts” by Silberschatz, Korth, and Sudarshan - Ch. 1-3, 13-16.
Helpful But Not Required
- Compiler basics (tokenization/parsing) for Project 8.
- Probabilistic reasoning and histograms for Project 12.
- Linux tracing/perf tools for deeper observability.
Self-Assessment Questions
- Can you explain why a DBMS uses fixed-size pages instead of arbitrary record-sized disk writes?
- Can you explain what makes WAL durable and why ordering rules matter?
- Can you explain one anomaly prevented by serializable isolation?
Development Environment Setup Required Tools:
- C++ compiler with C++20 support (
clang++org++) - CMake 3.24+
- Python 3.11+ for test harnesses and log generation
sqlite3CLI (for comparison experiments)jqandrgfor diagnostics and text/data checks
Recommended Tools:
perf,strace,dtrace/Instruments (platform dependent)lldborgdb- Docker for reproducible local labs
Testing Your Setup:
$ clang++ --version
clang version 16.x (or newer)
$ cmake --version
cmake version 3.24.x (or newer)
$ python3 --version
Python 3.11.x (or newer)
$ sqlite3 --version
3.4x.x
Time Investment
- Simple projects: 4-8 hours each
- Moderate projects: 10-20 hours each
- Complex projects: 20-40 hours each
- Total sprint: 4-6 months part-time
Important Reality Check Database internals are hard because correctness and performance are coupled. You can pass happy-path tests with a broken system. The real work is preserving invariants under partial failure, concurrency, and adversarial data distributions.
Big Picture / Mental Model
Database systems are a chain of contracts. Each layer hides complexity but depends on upstream and downstream invariants.
[SQL Contract]
"Result is logically correct"
|
v
[Planner Contract]
"Chosen plan is semantically equivalent"
|
v
[Executor Contract]
"Operators produce tuples in correct semantics"
|
v
[Txn Contract]
"Concurrent behavior respects isolation"
|
v
[Storage Contract]
"Data and metadata remain durable and recoverable"
If any contract is violated:
- planner bug -> wrong answers
- executor bug -> missing/duplicate tuples
- txn bug -> anomalies or corruption
- storage/recovery bug -> permanent data loss
Theory Primer
Chapter 1: Storage Engines, Pages, and Record Layouts
Fundamentals A DBMS stores and moves data in fixed-size pages because disk and SSD I/O are page-oriented and because stable units simplify caching, checksums, and recovery. Within a page, records are not just copied as structs; they are encoded according to a page format that supports variable-length columns, nullability, version metadata, and free-space management. The dominant design is a slotted page: a header and slot array point to tuple payloads, allowing in-page movement without changing logical record identifiers. Storage design is about balancing write amplification, read locality, and update friendliness. If your physical layout is wrong, every upper layer pays: indexes fragment, scans slow down, vacuum/compaction becomes expensive, and crash recovery complexity increases.
Deep Dive
The first deep principle in storage engines is that logical rows and physical bytes must be decoupled. Applications think in rows and columns, but the engine thinks in pages, offsets, and byte ranges. A slotted page is the bridge. The page header stores metadata such as free-space pointer, slot count, and sometimes checksum/version. Slots map stable slot IDs to variable tuple offsets and lengths. This allows you to move tuple bodies during compaction while keeping RecordId = (PageId, SlotId) stable. Without this indirection, updates and deletes would force cascading pointer rewrites across indexes and logs.
Second, you must define tuple encoding rules before writing any code. For fixed-size columns, offsets are deterministic. For variable-size columns, you need either an offset table or length-prefixed segments. Null handling usually requires a bitmap. Every tuple also needs visibility/version metadata once transactions arrive. Even if you postpone MVCC, reserve bits/fields early to avoid format migrations. A strong page format is intentionally forward-compatible: include version number and optional extension region so later projects can evolve without rewriting all data.
Third, free-space management is not a micro-optimization; it is a core control loop. Insertion searches for a page with enough free bytes. Deletion creates holes. Update may require in-place rewrite, forwarding pointer, or delete+insert depending on size delta. If you do nothing, pages fragment and effective capacity collapses. A common policy is periodic compaction when free-space fragmentation exceeds a threshold. But compaction itself increases write cost and can interfere with active readers/writers, so it should be policy-driven and observable.
Fourth, understand physical locality. Sequential scans rely on contiguous pages and prefetch behavior. Secondary indexes rely on stable record identifiers and predictable page residency. If heap files are append-only without maintenance, hot rows scatter, cache hit rate drops, and write amplification rises during updates. Production systems address this with clustering, vacuum/reorg, and fill-factor controls. Your lab system can emulate this with explicit page reuse and periodic reorganization tasks.
Fifth, page checksums and torn-write defenses matter even in learning systems. A crash during page write can leave partial bytes on disk. WAL protects logical durability, but page-level corruption still needs detection. Include checksum field in page header and validate on read. If checksum fails, surface deterministic error behavior. This discipline makes your later recovery project realistic: not all failures are clean process crashes.
Sixth, storage metadata must be explicit. At minimum, maintain a catalog page or metadata file with page size, format version, next page ID, free-page list head, and optional table root pointers. Avoid hidden assumptions. Most student implementations fail when they hardcode “page 0 is always header” but then forget migration paths. The key invariant is: metadata must be recoverable and self-describing after restart.
Seventh, measure storage behavior continuously. Track insert occupancy, free-space histogram, average tuple size, and fragmentation ratio. These metrics guide tuning and reveal bugs. If a delete-heavy workload never recovers free space, your free-list or compaction algorithm is wrong. If updates frequently become delete+insert, your page fill strategy is too aggressive. A DBMS storage layer is not static; it is an adaptive system.
Finally, connect storage to future layers. Index entries will store record IDs that reference heap pages. Buffer pool replacement will evict pages based on usage. Recovery will replay log records against page IDs and LSNs. Every field you choose now affects those layers. Storage design is the foundation of everything else: correctness, performance, and maintainability.
How this fit on projects
- Primary concept for Projects 1 and 2.
- Required underpinning for Projects 3-12.
Definitions & key terms
- Page: fixed-size I/O unit between memory and disk.
- Slotted page: page format using slot directory + tuple payload region.
- RecordId (RID): stable logical pointer, typically
(PageId, SlotId). - Fragmentation: free bytes exist but are not contiguous enough for target insert.
- Compaction: rearranging tuples to coalesce free space.
Mental model diagram
+------------------------------ Page (8KB) -------------------------------+
| Header: page_id | lsn | checksum | free_start | free_end | slot_count |
+-------------------------------------------------------------------------+
| Tuple payload region (grows up) Free space |
| [tupleA][tupleB moved][tupleC]............................(fragmented?) |
| [slotN][slot2][slot1]|
+-------------------------------------------------------------------------+
Slot entry -> (offset, len, flags)
RID = (page_id, slot_id)
How it works (step-by-step, with invariants and failure modes)
- Allocate page ID and initialize header.
- Insert tuple: check contiguous space, write payload, create slot.
- Delete tuple: mark slot deleted/tombstone; adjust visibility metadata.
- Optional compaction: rewrite live tuple payloads, update slot offsets.
- Persist changes with WAL-before-data rule once recovery layer exists.
Invariants:
- Slot IDs are stable unless explicitly remapped with index/log coordination.
- Header free-space pointers must never overlap payload/slot regions.
- Checksum must match serialized bytes at write boundary.
Failure modes:
- Miscomputed offsets -> silent corruption.
- Missing tombstone semantics -> ghost tuples in scans.
- Non-atomic metadata update -> unrecoverable page state.
Minimal concrete example
Pseudo-transcript:
INSERT tuple size=46 into page=17
- free contiguous bytes before: 112
- write payload at offset 2048
- slot_count: 13 -> 14
- slot[14] = {offset:2048,len:46,flags:live}
- free contiguous bytes after: 62
Common misconceptions
- “Rows are stored exactly as language structs.” -> DB tuple encoding is explicit and versioned.
- “Delete immediately frees bytes.” -> Usually deletion marks space reusable later; fragmentation persists.
- “Page format is internal and can be sloppy.” -> It is a long-term compatibility contract.
Check-your-understanding questions
- Why is slot indirection essential for stable record references?
- What changes if an updated tuple no longer fits in its original space?
- Why can fragmentation exist even when total free bytes look large?
- What minimum metadata must survive restart?
Check-your-understanding answers
- It allows tuple payload movement without changing logical IDs used by indexes.
- You need relocation policy: forwarding pointer or delete+insert semantics.
- Free space may be non-contiguous and therefore unusable for requested size.
- Page format version, next page ID, free page map/list, and root/catalog pointers.
Real-world applications
- PostgreSQL heap page + visibility design.
- SQLite B-tree page formats and freelist pages.
- InnoDB page directory and record headers.
Where you will apply it
References
- SQLite File Format
- Database Internals (Alex Petrov)
- “Database System Concepts” (Silberschatz et al.), storage chapters.
Key insights Physical layout decisions are not implementation details; they are system behavior decisions.
Summary You now have the physical mental model: pages, slots, tuple encoding, and free-space policy. Every later subsystem assumes these rules are correct.
Homework/Exercises to practice the concept
- Design a 4KB slotted page format that supports nullable variable-length columns.
- Simulate 1000 inserts, 300 deletes, 200 updates; compute fragmentation ratio over time.
- Propose a compaction threshold policy and justify tradeoffs.
Solutions to the homework/exercises
- Use header + slot array + payload region + null bitmap + varlen offset table.
- Fragmentation ratio should rise after deletes and size-increasing updates; compaction reduces it.
- Example: compact when fragmentation > 25% and page not pinned; this balances write cost and reuse.
Chapter 2: Buffer Pool, Caching Policy, and I/O Control
Fundamentals A buffer pool is the in-memory cache for disk pages and is usually the single largest determinant of DBMS performance. It mediates every read/write between operators and storage and enforces page lifecycle states: clean/dirty, pinned/unpinned, and resident/evicted. A page table maps page IDs to frames. When memory is full, a replacement policy selects eviction candidates while respecting pin counts and dirty flushing constraints. Correctness requires careful ordering: dirty pages cannot be flushed out-of-order relative to WAL, and active pages must not be evicted while in use. Performance requires minimizing random I/O and avoiding pathological replacement behavior under mixed scan + point-lookup workloads.
Deep Dive Think of the buffer pool as a traffic controller for scarce memory, not just a map. Queries generate page demand patterns with very different locality profiles: index probes are skewed and hot-set heavy; analytical scans are broad and streaming; joins may create bursty access to build/probe pages. A naive LRU can perform well in simple OLTP but fail under scans that pollute the cache. Modern systems often use Clock variants, segmented LRU, or admission controls to protect hot pages from one-time scan floods.
The core data structures are straightforward but subtle in behavior. The page table maps PageId -> FrameId. Each frame stores a page buffer and metadata: pin count, dirty flag, recency score/reference bit, and optionally pageLSN. On FetchPage(pid), if present, increment pin and return. If absent, choose victim frame from replacer among unpinned pages. If victim dirty, flush subject to WAL constraint. Then read requested page into frame, update mapping, set pin=1. On UnpinPage(pid, is_dirty), decrement pin and possibly mark dirty. Bugs often arise when pin counts are mishandled: negative counts, leaked pins, or eviction of pinned pages.
Dirty-page handling introduces durability coupling. With WAL, the invariant is flushed_log_lsn >= pageLSN before a dirty page is flushed. If violated, crash recovery cannot reconstruct missing updates. This means buffer manager cannot flush in isolation; it must coordinate with log manager. Even in a learning DBMS, explicitly encode this interface: CanFlush(page_lsn) or FlushLogUpTo(page_lsn). Without this, later recovery implementation becomes patchy and non-deterministic.
Replacement policy selection should be workload-driven. LRU is intuitive but expensive to maintain and vulnerable to scan thrash. Clock approximates LRU with low overhead and is common in teaching systems. ARC and LRU-K are more adaptive but complex. A pragmatic path is Clock with two enhancements: scan resistance (do not immediately admit sequential one-hit pages into protected set) and dirty-page awareness (prefer clean victims when equivalent). Track hit ratio by object class (index/table/temp) so policy tuning has evidence.
Concurrency is another major dimension. Multiple threads can fetch/unpin simultaneously. A global mutex is correct but scales poorly. Fine-grained latching (page table latch + frame latch + replacer latch) improves throughput but increases deadlock risk. Establish strict lock order early. Example: page table latch -> frame latch -> replacer latch. If violated, deadlocks appear only under stress and are difficult to reproduce.
Prefetching and background flushing convert reactive behavior into proactive behavior. Sequential scans benefit from read-ahead; checkpointer/background writer smooths write bursts by flushing old dirty pages before eviction pressure spikes. These mechanisms reduce tail latency and avoid stalls where worker threads do synchronous flushes in critical paths.
Observability should be built into the buffer pool API: counters for hit/miss, evictions, dirty flushes, no-victim events, average pin duration, and latch wait time. Most systems engineering gains come from measuring before optimizing. If hit ratio looks high but query latency remains high, you may have latch contention or write stalls.
Finally, the buffer pool is where storage, execution, and recovery intersect. Operators should request pages through a minimal API, not bypass it. Recovery must reapply updates through page fetch/update interfaces that preserve pageLSN semantics. Index and heap managers should never write raw files directly once buffer pool exists. Violating this layering creates split-brain state and impossible debugging sessions.
How this fit on projects
- Primary concept for Project 3.
- Enables reliable behavior for Projects 4-12.
Definitions & key terms
- Frame: one memory slot that can hold one page.
- Pin count: number of active users preventing eviction.
- Dirty page: modified in memory, not yet persisted.
- Replacer: policy component that selects eviction victims.
- pageLSN: latest log sequence number reflected in page image.
Mental model diagram
Executor requests PageId=42
|
v
Page table lookup ---> hit? ---- yes ---> frame -> pin++ -> use page
| no
v
Choose victim from replacer (unpinned only)
|
+-> victim dirty? yes -> ensure WAL flushed to pageLSN -> write page
|
v
Read page 42 into frame -> update page table -> pin=1 -> return
How it works (step-by-step, with invariants and failure modes)
FetchPagechecks page table under latch.- Miss path reserves victim frame from replacer.
- Dirty victim flush with WAL ordering check.
- Load target page and register mapping.
UnpinPagedecrements pin, marks dirty if needed, re-enqueues frame.
Invariants:
- Pinned pages cannot be eviction victims.
- Every resident page has exactly one page-table mapping.
- Flushed dirty page must satisfy WAL-before-data.
Failure modes:
- Pin leak -> buffer pool exhaustion (
no free victim). - Missing mapping cleanup -> duplicate frame aliases.
- Concurrent dirty flush race -> lost update or stale page image.
Minimal concrete example
Pseudo-events:
T1 FetchPage(7) miss -> frame3 loaded pin=1
T2 FetchPage(7) hit -> frame3 pin=2
T1 UnpinPage(7, dirty=true) -> pin=1 dirty=true
T2 UnpinPage(7, dirty=false) -> pin=0 dirty=true -> candidate for replacer
Common misconceptions
- “High hit ratio always means fast queries.” -> Contention and flush stalls can dominate.
- “LRU is always best.” -> Workload shape determines effectiveness.
- “Dirty page can flush anytime.” -> WAL ordering constraint applies.
Check-your-understanding questions
- Why is pin count separate from dirty flag?
- What causes scan pollution and how do you mitigate it?
- Why should executor/storage avoid bypassing buffer pool?
Check-your-understanding answers
- Usage lifetime and persistence status are different concerns.
- One-time scan pages evict hot pages; use admission/segmentation/read-ahead controls.
- Bypassing breaks coherence, durability ordering, and observability.
Real-world applications
- PostgreSQL shared buffers and background writer/checkpointer.
- InnoDB buffer pool instances and flush lists.
- Embedded DB cache tuning for edge devices.
Where you will apply it
- Project 3
- Critical dependency for Projects 4, 5, 6, 7, 10, 11, 12.
References
- PostgreSQL Buffer and WAL docs
- “Architecture of a Database System” (Hellerstein, Stonebraker, Hamilton).
Key insights Caching policy and correctness policy must be designed together; optimizing one without the other breaks the system.
Summary The buffer pool is your performance control plane and durability gatekeeper. Treat it as a first-class subsystem.
Homework/Exercises to practice the concept
- Compare LRU vs Clock on a synthetic mixed workload and record hit/miss and eviction churn.
- Inject a pin leak and explain how system symptoms differ from low-memory symptoms.
- Design metrics dashboard for buffer pool health.
Solutions to the homework/exercises
- Clock often matches LRU with lower overhead; mixed workloads reveal scan sensitivity.
- Pin leak shows persistent
no victimdespite moderate cache size and low eviction count. - Include hit ratio, no-victim events, dirty flush queue depth, and average latch wait.
Chapter 3: Index Structures and Access Paths (B+Tree and Hash)
Fundamentals Indexes are auxiliary structures that trade write cost and storage for faster reads. A relational DBMS relies on indexes to avoid full scans when predicates are selective. B+Trees are the default general-purpose index because they support both point lookups and ordered range scans. Hash indexes are optimized for equality predicates and can offer simpler probe behavior but poor ordered traversal. Access path selection is a planner decision: an index is useful only when combined with accurate selectivity estimates and realistic I/O/CPU cost models. Index correctness depends on key ordering/hash partitioning invariants and on synchronization rules during concurrent updates and structural changes.
Deep Dive
Start with B+Tree semantics. Internal nodes contain separator keys and child pointers; leaf nodes contain sorted keys and record references, often linked for in-order scans. The critical invariant is balanced height: all leaves are at the same depth, so search cost is O(log_f N) where f is fanout. Because nodes are page-sized, fanout is large and tree height remains small even for large datasets. Most point lookup latency then depends on cache residency of upper levels and locality of leaf pages.
Insert and delete are where complexity lives. During insert, if target leaf has space, place key and done. If overflow, split leaf roughly half/half, propagate separator upward, and potentially cascade splits to root. Delete symmetrically may cause underflow requiring redistribution with sibling or merge and parent key deletion. Structural modification operations must preserve search correctness at all times, even under concurrency. Production systems use latch coupling or optimistic methods to prevent readers from seeing broken paths.
Page layout inside index nodes is performance critical. Keys may be fixed or variable; entries may use prefix compression to increase fanout. Internal node pointers must remain aligned and update-safe. Leaf pages often include sibling pointers for fast range scans and rebalancing. You should design index page headers with node type, key count, parent pointer (optional), next-leaf pointer (for leaves), and low/high fence keys if doing advanced verification.
Hash indexing chooses a different tradeoff. Static hash tables degrade with growth, so dynamic variants like extendable hashing add directory depth and bucket split rules. Extendable hashing keeps global depth for directory and local depth per bucket. On overflow, split bucket and possibly double directory if local depth equals global depth. Equality lookups are efficient, but range queries are unsupported or expensive. Also, skewed keys can create uneven bucket occupancy.
Concurrency for indexes differs from transaction locking. Index latches protect short critical sections on internal data structures. Transaction locks protect logical isolation across operations. Confusing the two creates either corruption (too weak latching) or severe contention (too coarse locking). During page split/merge, hold latches long enough to preserve structural invariants but release quickly to avoid throughput collapse.
Recovery integration requires logging structural changes carefully. A page split is not a single atomic write; it touches old leaf, new leaf, and parent. WAL records must make operations redo-safe and undo-safe according to your recovery model. Even in simplified labs, define operation sequence so crash at any point can be reasoned about. Example: log intent, create new page, redistribute keys, update parent, mark commit of structure change.
Access path quality is workload-specific. B+Tree excels for range and order-by operations and often for point lookups due to cache-friendly upper levels. Hash can outperform for pure equality on high-cardinality keys with low range requirements. But planner decisions must account for clustering, selectivity, and maintenance cost. Over-indexing harms write throughput and memory pressure.
Finally, index lifecycle management matters: build, validate, rebalance, and drop. A poorly maintained index can become larger than the table and still underperform due to low selectivity. Always measure not just query speedup but write penalty, space overhead, and cache effects.
How this fit on projects
- Primary concept for Projects 4 and 5.
- Used by Projects 6, 7, 8, and 12.
Definitions & key terms
- Fanout: number of children per internal B+Tree node.
- Split/merge: structural changes after overflow/underflow.
- Extendable hashing: dynamic hash index with directory depth.
- Selectivity: fraction of rows matching a predicate.
- Access path: chosen physical method to retrieve tuples.
Mental model diagram
B+Tree
Root
|
+-- Internal node(s): [k1|k2|k3] -> child pages
|
+--> Leaf pages (sorted keys + RIDs) <-> linked for range scan
Hash Index (extendable)
Hash(key) -> prefix bits -> directory entry -> bucket page -> (key,RID)*
if overflow -> split bucket, maybe double directory
How it works (step-by-step, with invariants and failure modes)
- Lookup: traverse B+Tree separators or hash directory bits to target page.
- Insert: place entry; if overflow perform split and parent update.
- Delete: remove entry; if underflow redistribute/merge.
- Range scan: B+Tree leaf walk via sibling links.
Invariants:
- B+Tree leaves stay sorted and at same depth.
- Extendable hash bucket local depth <= global depth.
- Every index entry references a valid live record ID.
Failure modes:
- Bad split key promotion -> orphaned key ranges.
- Incorrect sibling pointers -> broken range scan.
- Directory depth bug -> unreachable hash buckets.
Minimal concrete example
Pseudo-log of B+Tree insert overflow:
insert key=52 into leaf L9 (full)
split L9 -> L9:[10,20,35] L23:[52,68,71]
promote separator=52 to parent P3
parent P3 insert separator; parent overflow? no
Common misconceptions
- “Hash index is always faster.” -> Only for suitable equality patterns.
- “B+Tree and binary search tree are basically the same.” -> Page-oriented fanout changes behavior dramatically.
- “More indexes is always better.” -> Write and maintenance costs can dominate.
Check-your-understanding questions
- Why does large node fanout matter more than Big-O notation intuition?
- What planner info is needed to choose between table scan and index scan?
- Why are latches and locks different in index operations?
Check-your-understanding answers
- It keeps tree height tiny, reducing random I/O depth.
- Selectivity, table/index size, clustering, and cached page probability.
- Latches protect in-memory structure integrity; locks enforce transaction isolation.
Real-world applications
- OLTP primary/secondary indexing.
- Event stores with compound index keys.
- Time-series secondary access paths.
Where you will apply it
- Project 4
- Project 5
- Also Project 6, Project 12.
References
- B-tree and B+Tree origin paper metadata
- IBM System R Optimizer context
- “Database System Concepts” indexing chapters.
Key insights Indexes are performance multipliers only when structural invariants and planner cost assumptions are both correct.
Summary You now understand index mechanics, maintenance, and planner coupling for real workloads.
Homework/Exercises to practice the concept
- Compute B+Tree height for 100M rows with realistic page fanout.
- Simulate extendable hash growth under skewed keys.
- Design an experiment comparing range query latency on B+Tree vs hash index.
Solutions to the homework/exercises
- Height is usually small (often 3-4 levels) due to high fanout.
- Skew drives repeated splits in hot buckets; directory growth may lag evenly distributed expectation.
- B+Tree should dominate range queries because leaf links preserve order.
Chapter 4: Query Execution and Cost-Based Optimization
Fundamentals
A query engine transforms declarative SQL into executable operator pipelines. Logical plans describe relational algebra semantics; physical plans choose concrete algorithms and access paths. Execution commonly follows the iterator model (open-next-close) to stream tuples through operators. Performance depends on operator choice, join order, predicate pushdown, and data distribution awareness. Cost-based optimization uses statistics to estimate cardinalities and select the lowest-cost plan under a simplified model of I/O and CPU cost. This chapter matters because most real production incidents involving “slow database” are optimizer or plan-shape problems, not parser bugs.
Deep Dive The planner/executor boundary is where theory meets engineering. Logical optimization preserves equivalence while rewriting expressions into easier-to-execute forms: push predicates closer to scans, prune unused columns, simplify joins, and reorder commutative operations. Physical optimization chooses algorithms: nested loop vs hash join, index scan vs seq scan, in-memory sort vs external sort. Every choice has assumptions about row counts and memory availability.
The cardinality estimation problem is the hardest practical issue. The optimizer must estimate how many rows survive each filter and join. It uses table stats such as row count, distinct values, null fraction, and histograms. Independence assumptions are often wrong, especially with correlated columns, leading to large estimate errors and catastrophic plan choices. A robust learning project should expose this by generating skewed and correlated synthetic datasets and comparing estimated vs actual row counts.
Operator implementation details matter. In iterator engines, each operator produces one tuple at a time to parent. This supports pipelining and low memory use for many workloads. But blocking operators (sort, hash build, aggregate) consume full inputs before producing output. The optimizer must account for these memory and spill behaviors. If memory budget is exceeded, operators spill to disk and performance changes by orders of magnitude.
Join ordering has combinatorial explosion. For n joined relations, possible trees grow rapidly. Exhaustive dynamic programming is feasible for small n; heuristics are used for larger queries. Selinger-style dynamic programming remains foundational: compute best plan for subsets, then combine. Even a simplified implementation teaches crucial intuition: local best choices can produce global worst plans if cardinality estimates are wrong.
Cost models are always approximations. A common first model: cost = page_reads * io_weight + tuples_processed * cpu_weight. This is simplistic but useful. Improve it gradually: random vs sequential I/O distinction, cache hit probability, sort spill penalties, and parallelism factors. The value is not perfect prediction; it is consistent ranking of alternatives.
Rule-based and cost-based optimization should coexist. Rule-based rewrites are deterministic and cheap; cost-based choices are selective and data-dependent. For a mini DBMS, start with a small rule set (predicate pushdown, projection pruning, join commutativity) then add cost-based join order/access path selection. Ensure every rewrite is testable for equivalence.
Explainability is a feature, not an afterthought. Implement plan explanation output with estimated rows, operator cost, and chosen algorithm. Debugging optimizer errors without this is nearly impossible. Compare estimated_rows vs actual_rows at runtime to identify statistics drift or model flaws.
Finally, optimization must remain correct first, fast second. Never perform rewrite that changes query semantics around null handling, outer joins, or duplicate semantics unless fully validated. Planner bugs silently return wrong answers, which is worse than slow answers.
How this fit on projects
- Primary concept for Projects 6, 7, 8, and 12.
- Impacts final integration project significantly.
Definitions & key terms
- Logical plan: algebraic representation independent of execution algorithm.
- Physical plan: executable operator tree with chosen algorithms.
- Cardinality estimation: predicted row counts at plan nodes.
- Blocking operator: needs full input before output.
- Selectivity: fraction of tuples passing a predicate.
Mental model diagram
SQL
|
v
Parser -> AST -> Logical Plan --(rules)--> Rewritten Logical Plan
|
v
Physical Enumerator + Cost Model
|
v
Chosen Plan (operators)
|
v
Runtime + Actual Metrics
|
v
Feedback for stats/model tuning
How it works (step-by-step, with invariants and failure modes)
- Parse SQL to AST and validate schema references.
- Build logical algebra tree.
- Apply safe logical rewrite rules.
- Enumerate physical alternatives and estimate costs.
- Execute chosen plan and collect runtime stats.
Invariants:
- Rewrites must preserve semantics.
- Physical plan nodes must satisfy required input/output schema.
- Cost model units must be internally consistent.
Failure modes:
- Bad cardinality estimate -> wrong join order.
- Invalid rewrite around outer joins -> wrong results.
- Missing spill logic -> out-of-memory or severe latency spikes.
Minimal concrete example
Pseudo-query:
SELECT c.name
FROM customers c JOIN orders o ON c.id=o.customer_id
WHERE c.region='NA' AND o.total > 100;
Rewrite idea:
Push c.region filter to customer scan before join.
Plan alternatives:
A) HashJoin(seqscan customers, seqscan orders)
B) IndexNestedLoop(indexscan customers(region), indexscan orders(customer_id))
Choose based on estimated cardinalities and index selectivity.
Common misconceptions
- “Optimizer is just a parser extension.” -> It is a separate search and estimation subsystem.
- “If query is slow, add index.” -> Wrong plan shape or stale stats can be root cause.
- “Rules are enough.” -> Data-dependent costs require cost-based decisions.
Check-your-understanding questions
- Why can cardinality error early in a plan cause huge downstream damage?
- What is a blocking operator and why does it matter for memory?
- Why do you need both estimated and actual row counts in explain output?
Check-your-understanding answers
- Join order and algorithm choices amplify upstream estimate error exponentially.
- It materializes input before output, increasing memory/spill risk.
- To detect estimator drift and improve stats/model iteratively.
Real-world applications
- SQL query tuning in production OLTP.
- Data warehouse plan optimization.
- API backend latency reduction through plan diagnostics.
Where you will apply it
References
- System R access path selection paper
- “Readings in Database Systems” optimizer chapters.
- “Database System Concepts” query processing and optimization chapters.
Key insights Query speed is mostly plan quality, and plan quality is mostly estimate quality.
Summary A robust optimizer couples equivalence-preserving rewrites with cost estimation grounded in real statistics.
Homework/Exercises to practice the concept
- Build two logically equivalent plans and estimate their costs under two different data distributions.
- Create a dataset with correlated columns and observe estimator error.
- Add a simple explain output and compare estimate vs actual rows.
Solutions to the homework/exercises
- Optimal plan flips when selectivity flips; this validates cost sensitivity.
- Independence assumptions overestimate/underestimate selectivity dramatically.
- Divergence pinpoints where stats or model need refinement.
Chapter 5: Transactions, Isolation, 2PL, and MVCC
Fundamentals Transactions package related operations into a correctness unit with ACID guarantees. Isolation is the most subtle ACID property because it defines behavior under concurrency. Lock-based methods like two-phase locking (2PL) enforce serializability by controlling conflicting access, while MVCC increases read/write concurrency by storing multiple versions and using snapshot visibility rules. Both approaches require precise state machines, conflict handling, and cleanup of obsolete state. If concurrency control is wrong, the system may appear to work but return inconsistent results, lose updates, or deadlock under load.
Deep Dive Concurrency control starts with conflict theory: two operations conflict if they access same logical item and at least one is a write. Serializability demands that concurrent execution be equivalent to some serial order. Strict 2PL enforces this by requiring transactions to acquire locks during a growing phase and release only at commit/abort. Shared locks allow concurrent reads; exclusive locks serialize writes. Intention locks and lock hierarchies scale this model from table-level to page/tuple-level.
A lock manager maintains lock tables keyed by resource ID. Requests can be granted, queued, upgraded, or denied. Correctness requires compatibility matrix enforcement and deterministic wake-up ordering policies to avoid starvation. Deadlocks are inevitable in sufficiently concurrent systems; you need either prevention (ordering/timeouts/wait-die/wound-wait) or detection (wait-for graph cycle detection) plus victim selection. In a learning DBMS, periodic deadlock detection is the clearest model.
2PL offers strong correctness but can reduce concurrency due to blocking and lock contention. Long transactions become especially harmful, holding locks and expanding conflict windows. This motivates MVCC. In MVCC, writes create new row versions with metadata (xmin, xmax, commit timestamps, etc.). Readers choose visible versions according to snapshot rules, so readers usually do not block writers. Writers still conflict with writers and require conflict resolution.
Snapshot isolation semantics are powerful but not equivalent to serializable isolation. Write skew anomalies can still occur unless additional checks or serializable snapshot isolation mechanisms are used. A good educational sprint should include explicit anomaly demonstrations: lost update, non-repeatable read, phantom, and write skew. Learners often assume “no blocking” means “fully correct”; this is false without formal isolation guarantees.
Version garbage collection is part of concurrency control, not just storage maintenance. Old versions remain needed while active snapshots may still see them. Once no active transaction can observe a version, GC can reclaim it. In MVCC systems this is often tied to global oldest active transaction timestamp or horizon. If GC is too conservative, storage bloat grows. If too aggressive, active snapshots break correctness.
Integrating concurrency with indexes introduces additional challenges. Index entries may reference multiple versions, pending deletes, or visibility checks in heap. Some systems use index-level versioning, others keep index stable and perform visibility checks during heap fetch. Your mini-DBMS can use the latter for simplicity.
Transaction lifecycle must be explicit: BEGIN -> ACTIVE -> COMMITTING/ABORTING -> COMMITTED/ABORTED. Recovery manager relies on this state machine for undo/redo logic. Likewise, buffer pool latches and transaction locks must be clearly separated: latches are short critical-section protection; locks are logical isolation mechanisms across operation durations.
Observability is essential. Track lock wait times, deadlock count, abort reasons, average transaction duration, and MVCC version chain length. Without these, tuning isolation behavior is guesswork. Production teams use these signals to decide index changes, query rewrites, or transaction boundary refactoring.
Finally, concurrency control is a business requirement, not an academic exercise. Payment systems, inventory systems, and booking systems fail in expensive ways when isolation is wrong. Mastery here differentiates reliable systems engineers from basic CRUD developers.
How this fit on projects
- Primary concept for Projects 9 and 11.
- Required context for Projects 10 and final integration.
Definitions & key terms
- Serializability: equivalent outcome to some serial execution.
- Strict 2PL: locks released at commit/abort, preventing dirty reads/writes.
- Deadlock: cycle of waiting transactions.
- Snapshot isolation: reads from consistent snapshot; not always serializable.
- Version GC: reclamation of obsolete row versions.
Mental model diagram
2PL path:
Txn -> lock request -> granted/queued -> execute -> commit -> unlock
MVCC path:
Txn start -> snapshot timestamp
read -> choose visible version
write -> append new version + conflict check
commit -> publish version visibility
GC -> remove versions older than safe horizon
How it works (step-by-step, with invariants and failure modes)
- Start transaction and assign ID/timestamp.
- For 2PL: acquire lock per operation according to compatibility matrix.
- For MVCC read: choose newest visible version under snapshot rules.
- On commit: validate conflicts, persist commit record, release resources.
- Run deadlock detection/GC periodically.
Invariants:
- Lock table state is consistent with transaction state.
- Visible-version rule is deterministic for given snapshot.
- Committed effects become durable in WAL order.
Failure modes:
- Lock upgrade race -> starvation/deadlock storms.
- Incorrect visibility check -> inconsistent reads.
- GC horizon bug -> use-after-free of needed versions.
Minimal concrete example
Pseudo-transaction timeline:
T1 starts at ts=100, reads row R version v1(ts=90)
T2 starts at ts=110, updates R -> creates v2(ts=110), commits
T1 rereads under snapshot isolation -> still sees v1
T3 starts at ts=120 -> sees v2
Common misconceptions
- “MVCC means no conflicts.” -> Writer-writer conflicts remain.
- “Read committed and snapshot isolation are basically same.” -> Visibility semantics differ.
- “Deadlocks are bugs in lock manager only.” -> They are a normal outcome of lock acquisition patterns.
Check-your-understanding questions
- Why can snapshot isolation allow write skew?
- Why do you still need locks/latches in MVCC systems?
- What metrics reveal contention problems fastest?
Check-your-understanding answers
- Concurrent transactions may read stale-but-valid snapshots and update disjoint rows that violate joint invariants.
- Structural and writer conflict protection is still required.
- Lock wait time, deadlock rate, abort reasons, and long transaction count.
Real-world applications
- Banking and ledger consistency.
- Reservation/inventory systems.
- Multi-tenant SaaS correctness under high concurrency.
Where you will apply it
- Project 9
- Project 11
- Also Project 10.
References
- PostgreSQL MVCC Introduction
- “Transaction Processing: Concepts and Techniques” (Gray & Reuter).
- “Database System Concepts” transaction chapters.
Key insights Concurrency control is a set of invariants over time, not a single lock/unlock API.
Summary You can now reason about isolation semantics and design lock/MVCC behavior with explicit tradeoffs.
Homework/Exercises to practice the concept
- Construct a write-skew scenario and classify which isolation levels prevent it.
- Design a lock compatibility matrix including intention locks.
- Propose an MVCC GC safe-horizon rule.
Solutions to the homework/exercises
- Serializable prevents write skew; snapshot isolation may not.
- IS/IX/S/X compatibility should preserve hierarchical correctness.
- Safe horizon is minimum start timestamp among active transactions.
Chapter 6: WAL, Checkpoints, and Crash Recovery (ARIES-style)
Fundamentals Durability and atomicity under failure are delivered by write-ahead logging (WAL) and recovery algorithms. WAL requires log records to reach stable storage before corresponding data pages are flushed. Recovery then uses log history to restore consistent state after crashes: redo committed effects that were not persisted, and undo incomplete effects. Checkpoints reduce restart work by recording bounded recovery starting points. ARIES-style recovery formalizes this with Analysis, Redo, and Undo phases and pageLSN checks. Recovery correctness is non-negotiable: silent durability bugs can permanently lose or corrupt business data.
Deep Dive Think in failure models first. The common crash model is fail-stop process/system crash with storage persistence preserved up to last completed fsync. Under this model, memory state is lost; disk may contain partially updated data pages unless atomic write guarantees are assumed (usually unsafe to assume). WAL transforms uncertain data-page persistence into deterministic replay by logging intent and before/after information.
A log record typically includes LSN, transaction ID, page ID (if physical), operation type, and redo/undo payload. Page headers store pageLSN, indicating latest applied update. Before flushing a dirty page, ensure durableLSN >= pageLSN. This single inequality is the heart of WAL correctness. If violated, a page may contain changes whose log records were not durable at crash time, making recovery impossible to replay correctly.
Checkpointing trades normal runtime overhead for faster restart. A fuzzy checkpoint records active transactions and dirty page table without fully stopping writes. At restart, Analysis begins from checkpoint LSN, reconstructing transaction table and dirty page table. Redo replays idempotently from smallest recLSN needed; pageLSN checks skip already-applied records. Undo then rolls back loser transactions using compensation log records (CLRs) to make undo itself crash-safe.
Idempotence is crucial. Recovery may re-run after crash during recovery itself. Therefore redo and undo actions must be safe when repeated. ARIES solves this using LSN ordering and CLRs with undoNextLSN. Even simplified recovery implementations should preserve idempotent design: each applied action should be detectable as already applied.
Group commit is a performance optimization with correctness implications. Instead of fsync per commit, batch multiple commit log records and flush once. This improves throughput dramatically but increases tail latency variability. The invariant remains: a commit is acknowledged only after commit record is durable.
Interaction with buffer pool is continuous. Dirty pages can flush anytime subject to WAL rule. Steal/no-force policies are common: buffer manager may flush uncommitted updates (steal), and commit does not force all data pages (no-force). This combination requires both redo and undo capability and is why ARIES-like designs are prevalent.
Testing recovery requires adversarial crash injection, not just unit tests. Insert failpoints between critical steps: after log append before flush, after page write before metadata update, during undo, during checkpoint write. Then restart and verify invariants: committed data present, uncommitted effects absent, metadata consistent, and no double-application.
Observability for recovery includes log generation rate, flush latency percentiles, checkpoint duration, restart time, and redo/undo record counts. These metrics guide durability-performance tradeoffs and capacity planning.
Finally, recovery is not an isolated subsystem. Transaction manager, buffer pool, and storage format must all participate with clear contracts. If transaction status tracking is weak or pageLSN missing, recovery correctness cannot be proven.
How this fit on projects
- Primary concept for Project 10.
- Essential for final integration project and production-path thinking.
Definitions & key terms
- WAL: write-ahead log protocol ensuring log durability before data flush.
- LSN: monotonic log sequence number.
- Checkpoint: durable marker reducing restart work.
- CLR: compensation log record for crash-safe undo.
- Steal/No-force: buffer policy pair requiring undo+redo recovery.
Mental model diagram
Normal runtime:
update -> append log record (LSN=500) -> pageLSN=500 in memory
commit -> flush log >= commitLSN -> ACK
background flush page when durableLSN >= pageLSN
Crash + restart:
Analysis -> determine winners/losers and dirty pages
Redo -> reapply needed updates in LSN order
Undo -> rollback losers using CLRs
How it works (step-by-step, with invariants and failure modes)
- Generate log record per update and assign LSN.
- Apply update to in-memory page, set pageLSN.
- On commit, force log flush through commit record.
- Background checkpoint records active txns + dirty pages.
- On restart, run Analysis -> Redo -> Undo.
Invariants:
- Commit acknowledged only after durable commit record.
- Data page flush allowed only if durableLSN >= pageLSN.
- Recovery replay order is non-decreasing LSN.
Failure modes:
- Missing fsync on commit -> acknowledged but lost transactions.
- Broken pageLSN checks -> double redo or missed redo.
- Incomplete undo tracking -> resurrected uncommitted data.
Minimal concrete example
Pseudo crash drill:
T7 UPDATE account balance -20 (LSN 900)
T7 COMMIT (LSN 901) -> fsync up to 901 -> ACK
crash before dirty page flush
restart:
Analysis sees T7 committed
Redo reapplies LSN 900 to page (if pageLSN<900)
Final state includes committed debit
Common misconceptions
- “If using SSD, fsync ordering is less important.” -> Durability contract still depends on flush semantics.
- “Checkpoint means no recovery replay needed.” -> Fuzzy checkpoints still require redo/undo.
- “Recovery is only for rare crashes.” -> It is active in every clean restart path and backup restore path.
Check-your-understanding questions
- Why does steal/no-force require both redo and undo?
- What makes redo idempotent?
- Why are CLRs needed for robust undo?
Check-your-understanding answers
- Uncommitted updates may reach disk (need undo) and committed updates may remain only in log (need redo).
- LSN/pageLSN checks and deterministic operation ordering.
- They record undo progress so crash during undo can resume safely.
Real-world applications
- PostgreSQL WAL and crash recovery.
- InnoDB redo/undo log behavior.
- Embedded database reliability in edge devices.
Where you will apply it
References
- ARIES paper summary (IBM)
- PostgreSQL WAL docs
- “Database System Concepts” recovery chapters.
Key insights Recovery correctness is mostly about strict write ordering and replay idempotence.
Summary You now have the crash-consistency mental model needed to build durable systems.
Homework/Exercises to practice the concept
- Draw log and pageLSN timeline for two concurrent transactions with one abort.
- Design a failpoint matrix for recovery testing.
- Explain why commit ACK before fsync is unsafe.
Solutions to the homework/exercises
- Winner/loser separation and pageLSN checks determine replay/rollback outcomes.
- Inject crashes after append, after flush, after page write, during checkpoint, during undo.
- The client observes success while durable state does not include commit record.
Glossary
- ACID: Atomicity, Consistency, Isolation, Durability transaction guarantees.
- Cardinality: number of rows at an operator boundary or table.
- CLRs: compensation log records used during undo.
- Fanout: branching factor of tree index nodes.
- Fuzzy checkpoint: non-blocking checkpoint capturing approximate runtime state.
- LSN: monotonic identifier for log record ordering.
- MVCC: concurrency model using multiple row versions and snapshot visibility.
- PageLSN: latest LSN reflected in a data page image.
- RID: record identifier, often
(PageId, SlotId). - Selectivity: fraction of rows satisfying predicate.
- Steal/No-force: buffer/write policy that permits flushing uncommitted pages and defers forcing committed pages.
Why Database Systems Matter
- Modern software dependency: Almost every product stack depends on at least one transactional or analytical database.
- Developer adoption signal (2024-2025): Stack Overflow’s 2024 survey press release highlights PostgreSQL at 49% as the most popular database; Stack Overflow’s 2025 press release reports PostgreSQL still leading among developers who want to keep using it (66% admired/retained usage intent for that metric).
- Ecosystem breadth (February 2026 snapshot): DB-Engines reports 429 systems in ranking, reflecting a broad and still-growing DBMS landscape.
- Deployment scale signal: SQLite’s official documentation states there are likely over one trillion SQLite databases in active use, emphasizing how deeply embedded database engines are in consumer and enterprise software.
Context and evolution (short)
- Early systems optimized disk I/O and transaction safety.
- Modern systems still keep those core invariants but add better concurrency, richer optimizers, and cloud-native operational models.
Traditional app model Modern data platform model
---------------------- --------------------------
Single DB, ad-hoc tuning Multiple engines by workload
Manual failover Automated failover and PITR
Limited observability Rich plan/lock/log telemetry
Schema changes as events Continuous schema evolution discipline
Sources:
- Stack Overflow Developer Survey 2024
- Stack Overflow 2024 Press Release
- Stack Overflow 2025 Press Release
- DB-Engines Ranking
- SQLite Most Deployed Database Engine
Concept Summary Table
| Concept Cluster | What You Need to Internalize |
|---|---|
| Storage Engines and Page Layouts | Stable record IDs require indirection and strict page-format invariants; free-space policy drives long-term performance. |
| Buffer Pool and I/O Control | Page lifecycle, replacement policy, and WAL-aware flushing determine both throughput and correctness. |
| Index Structures and Access Paths | B+Tree/Hash tradeoffs only pay off when structural invariants and planner assumptions are aligned. |
| Query Execution and Optimization | Query speed is mostly plan quality; plan quality is mostly estimate quality. |
| Transactions and Isolation (2PL/MVCC) | Concurrency control is about formal correctness over time, not just lock APIs. |
| WAL and Recovery | Durability requires strict ordering, idempotent replay, and disciplined crash testing. |
Project-to-Concept Map
| Project | Concepts Applied |
|---|---|
| Project 1 | Storage Engines and Page Layouts |
| Project 2 | Storage Engines and Page Layouts |
| Project 3 | Buffer Pool and I/O Control |
| Project 4 | Index Structures and Access Paths, Buffer Pool and I/O Control |
| Project 5 | Index Structures and Access Paths |
| Project 6 | Query Execution and Optimization, Index Structures |
| Project 7 | Query Execution and Optimization |
| Project 8 | Query Execution and Optimization |
| Project 9 | Transactions and Isolation |
| Project 10 | WAL and Recovery, Transactions and Isolation |
| Project 11 | Transactions and Isolation, Storage Versioning |
| Project 12 | Query Execution and Optimization, Index Structures |
| Final Overall Project | All concept clusters |
Deep Dive Reading by Concept
| Concept | Book and Chapter | Why This Matters |
|---|---|---|
| Storage Engines and Page Layouts | “Database Internals” by Alex Petrov - Ch. 3-4 | Explains file formats, page layouts, and storage tradeoffs in production systems. |
| Buffer Pool and I/O Control | “Database Internals” by Alex Petrov - Ch. 5 | Connects caching policy to performance and durability behavior. |
| Index Structures and Access Paths | “Database System Concepts” by Silberschatz et al. - Ch. 14 | Provides the formal and practical basis for B+Tree and hashing. |
| Query Execution and Optimization | “Readings in Database Systems” (5th ed.) - optimizer papers | Builds intuition for plan search, costing, and estimation failures. |
| Transactions and Isolation | “Transaction Processing” by Gray & Reuter - selected chapters | Canonical treatment of locking, serializability, and transaction design. |
| WAL and Recovery | ARIES paper + “Database System Concepts” Ch. 19 | Essential for implementing real crash-safe behavior. |
Quick Start
Your first 48 hours:
Day 1:
- Read Theory Primer Chapters 1 and 2.
- Sketch your page and metadata format on paper.
- Start Project 1 and complete deterministic page read/write tests.
Day 2:
- Run Project 1
Definition of Donechecklist. - Read Chapter 3 and implement initial tuple slot metadata for Project 2.
- Document one failure mode you found and fixed.
Recommended Learning Paths
Path 1: The Systems Programmer
- Project 1 -> Project 2 -> Project 3 -> Project 4 -> Project 10 -> Final Overall Project
Path 2: The Query Performance Engineer
- Project 3 -> Project 4 -> Project 6 -> Project 7 -> Project 8 -> Project 12
Path 3: The Reliability Engineer
- Project 3 -> Project 9 -> Project 10 -> Project 11 -> Final Overall Project
Success Metrics
- You can explain each major DBMS layer with invariants and failure modes, without notes.
- You can reproduce deterministic demo outputs for all projects.
- You can diagnose at least three intentionally injected bugs (one storage, one concurrency, one recovery).
- Your final mini-DBMS survives crash-recovery drills with committed-data durability and aborted-data rollback.
Project Overview Table
| # | Project | Difficulty | Time | Primary Outcome |
|---|---|---|---|---|
| 1 | Disk-Oriented Storage Manager | Intermediate | 1 week | Deterministic page persistence API |
| 2 | Tuple and Page Manager | Intermediate | 1-2 weeks | Slotted-page tuple lifecycle |
| 3 | Buffer Pool Manager | Advanced | 2 weeks | WAL-aware page cache and eviction |
| 4 | B+Tree Index | Expert | 3-4 weeks | Range-capable disk index |
| 5 | Hash Index (Extendable Hashing) | Advanced | 2 weeks | Dynamic equality index |
| 6 | Relational Algebra Executor | Advanced | 2 weeks | Iterator-based operator engine |
| 7 | Join Operators | Expert | 3 weeks | Multiple join algorithm implementations |
| 8 | Query Parser and Rule Optimizer | Expert | 3 weeks | SQL-to-plan transformation pipeline |
| 9 | Transaction Manager (2PL) | Expert | 3 weeks | Serializable lock-based transaction layer |
| 10 | Recovery Manager (UNDO/REDO) | Master | 4+ weeks | Crash-safe WAL recovery workflow |
| 11 | Multi-Version Concurrency Control | Master | 3-4 weeks | Snapshot-based concurrency model |
| 12 | Statistics and Cost Optimizer | Master | 3-4 weeks | Cost-based plan selection |
Project List
The following projects guide you from raw bytes on disk to a crash-safe, query-optimized mini DBMS.
Project 1: Disk-Oriented Storage Manager
- File:
P01-disk-oriented-storage-manager.md - Main Programming Language: C++
- Alternative Programming Languages: Rust, Go
- Coolness Level: Level 4 - Hardcore Tech Flex
- Business Potential: Resume Gold
- Difficulty: Intermediate
- Knowledge Area: Filesystems, binary layouts, deterministic I/O
- Software or Tool: POSIX file APIs or equivalent
- Main Book: Database System Concepts
What you will build: A page-based persistence module that allocates, reads, writes, and recycles fixed-size pages.
Why it teaches database systems: It forces explicit physical data contracts and deterministic on-disk behavior.
Core challenges you will face:
- Mapping logical page IDs to file offsets.
- Handling partial writes and metadata consistency.
- Maintaining free-page tracking without corruption.
Real World Outcome
You run a storage CLI harness that creates a fresh data file, allocates pages, writes deterministic payloads, restarts, and verifies bytes exactly.
$ ./db_lab storage-demo --db ./lab.db --seed 42
[init] page_size=8192 format_version=1
[alloc] page_id=1
[alloc] page_id=2
[write] page_id=1 checksum=0x91a52f01
[write] page_id=2 checksum=0x77c4da10
[restart] reopening database file
[read] page_id=1 checksum=0x91a52f01 match=true
[read] page_id=2 checksum=0x77c4da10 match=true
[result] PASS deterministic-persistence
The Core Question You Are Answering
“How does a DBMS turn a file into a durable page address space that survives restarts without ambiguity?”
This question matters because every higher-level subsystem assumes reliable page identity and bytes.
Concepts You Must Understand First
- Fixed-size page abstraction
- Why is page size fixed?
- How does alignment affect I/O and checksums?
- Book Reference: Database System Concepts - Storage chapters.
- On-disk metadata invariants
- What must be stored in a header page?
- How do you version format changes?
- Book Reference: Database Internals - Storage chapter.
- Failure model basics
- What can crash mid-write do to your bytes?
- Why do checksums matter?
- Book Reference: Transaction Processing - reliability context.
Questions to Guide Your Design
- Page Identity
- Will
page_id * PAGE_SIZEalways be valid? - How do you represent deallocated pages?
- Will
- Metadata Safety
- Which updates must be atomic or recoverable?
- Can header corruption be detected quickly?
Thinking Exercise
Sketch page allocation/deallocation state transitions and identify one state where a crash would create ambiguity.
The Interview Questions They Will Ask
- “Why do databases use fixed pages?”
- “What can go wrong with naive append-only page allocation?”
- “How would you detect on-disk corruption early?”
- “What metadata would you persist in page 0 and why?”
- “How would you evolve page format without data migration downtime?”
Hints in Layers
Hint 1: Start with a strict page header contract Define version, checksum, and free-space metadata before implementing allocation.
Hint 2: Make every write deterministic Use a deterministic seed and fixed payloads for tests.
Hint 3: Treat restart as first-class test Every operation should be validated across process restart.
Hint 4: Fail fast on checksum mismatch Do not continue with undefined bytes.
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Page layout and file organization | Database System Concepts | Ch. 13 |
| Real engine storage tradeoffs | Database Internals | Ch. 3-4 |
| Reliability mindset | Transaction Processing | Early chapters |
Common Pitfalls and Debugging
Problem 1: “Page reads return shifted bytes”
- Why: wrong offset math or stale page size assumption.
- Fix: centralize offset calculation and assert alignment.
- Quick test: read/write known pattern on page IDs 1, 2, 100.
Problem 2: “Restart loses free-page information”
- Why: free-list metadata not persisted atomically.
- Fix: persist free-page structure with version/checksum.
- Quick test: allocate/deallocate/restart/reallocate and verify reuse.
Definition of Done
- Deterministic read/write tests pass across restart.
- Header metadata validates with checksum/version.
- Deallocated pages can be reused safely.
- Corruption path is detected and surfaced clearly.
Project 2: Tuple and Page Manager
- File:
P02-tuple-and-page-manager.md - Main Programming Language: C++
- Alternative Programming Languages: Rust, Go
- Coolness Level: Level 3 - Genuinely Clever
- Business Potential: Resume Gold
- Difficulty: Intermediate
- Knowledge Area: Slotted pages, record encoding, compaction
- Software or Tool: Project 1 storage manager
- Main Book: Database System Concepts
What you will build: A slotted page module for insert/read/update/delete of variable-length tuples.
Why it teaches database systems: It bridges logical rows and physical bytes with stable record IDs.
Core challenges you will face:
- Tuple encoding with null and varlen support.
- Free-space fragmentation and compaction strategy.
- Stable slot IDs under updates/deletes.
Real World Outcome
$ ./db_lab tuple-demo --db ./lab.db --seed 11
[insert] rid=(3,1) bytes=38
[insert] rid=(3,2) bytes=56
[delete] rid=(3,1) tombstone=true
[update] rid=(3,2) old_bytes=56 new_bytes=64 relocated=false
[compact] page=3 moved=1 reclaimed=48
[scan] live_tuples=1 deleted_slots=1
[result] PASS slotted-page-lifecycle
You can inspect tuple lifecycle and verify slot stability after compaction.
The Core Question You Are Answering
“How do real databases represent variable rows in fixed pages without breaking references?”
Concepts You Must Understand First
- Slotted page structure
- Why slots must be separate from payload bytes.
- Book Reference: Database System Concepts - Ch. 13.
- Tuple serialization contracts
- How to encode nulls and variable lengths.
- Book Reference: Database Internals - storage record encoding.
- Fragmentation vs compaction tradeoff
- Why compaction timing affects throughput.
- Book Reference: Database Internals.
Questions to Guide Your Design
- RID stability
- When can a slot ID change?
- How do indexes survive compaction?
- Update policy
- In-place update vs relocation threshold?
- How will forwarding pointers work if needed?
Thinking Exercise
Draw one page before and after 10 inserts, 4 deletes, 3 updates. Mark contiguous free region and fragmented holes.
The Interview Questions They Will Ask
- “What is a slotted page and why not array-of-structs?”
- “How do you handle variable-length records safely?”
- “Why does delete usually not immediately reclaim contiguous space?”
- “What is tombstoning and when is it useful?”
- “How do you keep RID stable after compaction?”
Hints in Layers
Hint 1: Implement slot metadata first Use explicit slot flags: empty/live/deleted.
Hint 2: Keep tuple payload append-from-front, slots from-back This simplifies free-space checks.
Hint 3: Compaction should rewrite payload only Do not change slot identity unexpectedly.
Hint 4: Build deterministic tuple fixtures Make serialization failures easy to diff.
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Tuple/page formats | Database System Concepts | Ch. 13 |
| Storage record design | Database Internals | Ch. 4 |
| Data layout performance | CS:APP | Memory chapters |
Common Pitfalls and Debugging
Problem 1: “Tuple reads fail after compaction”
- Why: slot offsets not updated correctly.
- Fix: compaction must produce a new offset table then swap.
- Quick test: roundtrip all RIDs pre/post compaction.
Problem 2: “Update causes hidden slot leak”
- Why: relocated tuple created new slot but old slot not tombstoned.
- Fix: enforce update state transition invariants.
- Quick test: random update/delete fuzz run with slot-count assertions.
Definition of Done
- Variable-length tuple insert/update/delete works.
- RID lookup remains valid after compaction.
- Free-space accounting is accurate after mixed operations.
- Serialization/deserialization is deterministic.
Project 3: Buffer Pool Manager
- File:
P03-buffer-pool-manager.md - Main Programming Language: C++
- Alternative Programming Languages: Rust, Go
- Coolness Level: Level 4 - Hardcore Tech Flex
- Business Potential: Resume Gold
- Difficulty: Advanced
- Knowledge Area: caching, replacement, concurrency primitives
- Software or Tool: Projects 1-2
- Main Book: Database System Concepts
What you will build: A frame-based buffer pool with page table, pin/unpin, dirty tracking, and replacement policy.
Why it teaches database systems: It is the performance-critical layer connecting storage with execution.
Core challenges you will face:
- Victim selection under pin constraints.
- Dirty flush ordering (WAL-aware contract).
- Concurrent access without deadlock.
Real World Outcome
$ ./db_lab buffer-demo --frames 3 --workload mixed --seed 9
[fetch] miss page=1 -> frame=0
[fetch] miss page=2 -> frame=1
[fetch] miss page=3 -> frame=2
[unpin] page=1 dirty=true pin=0
[fetch] miss page=4 -> evict page=2 clean
[fetch] hit page=1 frame=0
[metrics] hits=1 misses=4 evictions=1 no_victim=0
[result] PASS buffer-pool-core
The Core Question You Are Answering
“How does a DBMS keep hot data in memory without violating durability or concurrency safety?”
Concepts You Must Understand First
- Cache replacement behavior - LRU vs Clock tradeoffs.
- Pin count semantics - page usage lifetime.
- Dirty-page lifecycle - when flushes happen and why.
Book Reference: Database Internals Ch. 5, Database System Concepts Ch. 11.
Questions to Guide Your Design
- How will you prevent eviction of active pages?
- What lock order will prevent latching deadlocks?
- How will you surface no-victim conditions deterministically?
Thinking Exercise
Given frame count=2 and access sequence 1,2,1,3,2, simulate hits/misses for LRU and Clock.
The Interview Questions They Will Ask
- “What is the difference between pin count and dirty flag?”
- “Why can LRU fail on sequential scans?”
- “How do you avoid evicting pages still in use?”
- “Where does WAL ordering interact with buffer flush?”
- “What metrics would you expose for tuning?”
Hints in Layers
Hint 1: Keep API small and explicit
Fetch, Unpin, Flush, NewPage, DeletePage.
Hint 2: Separate replacer from page table logic Policy should not own storage mappings.
Hint 3: Log every eviction decision in tests Diagnose policy behavior early.
Hint 4: Build a deterministic stress harness Threaded tests with fixed seeds reveal race issues.
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Buffer managers | Database System Concepts | Ch. 11 |
| Cache/admission ideas | Database Internals | Ch. 5 |
| Concurrency primitives | C++ Concurrency in Action | Selected chapters |
Common Pitfalls and Debugging
Problem 1: “Buffer pool deadlocks under stress”
- Why: inconsistent latch acquisition order.
- Fix: define global lock order and enforce it.
- Quick test: 1000 randomized concurrent fetch/unpin loops.
Problem 2: “No victim available even with low usage”
- Why: leaked pins due to missing unpin on error paths.
- Fix: RAII-style pin guards or scoped unpin discipline.
- Quick test: pin-count invariant check at test teardown.
Definition of Done
- Deterministic hit/miss/eviction behavior matches policy expectation.
- No pinned-page evictions occur.
- Dirty page flush path is tested.
- Concurrent stress harness runs without deadlock.
Project 4: B+Tree Index
- File:
P04-bplus-tree-index.md - Main Programming Language: C++
- Alternative Programming Languages: Rust
- Coolness Level: Level 5 - Pure Magic
- Business Potential: Resume Gold
- Difficulty: Expert
- Knowledge Area: disk-oriented trees, split/merge algorithms
- Software or Tool: Projects 1-3
- Main Book: Database System Concepts
What you will build: A disk-backed B+Tree supporting point lookup, insert, delete, and range scan.
Why it teaches database systems: It is the canonical DB index and teaches page-level data structure engineering.
Core challenges you will face:
- Correct split/merge under boundary cases.
- Leaf sibling maintenance for range scans.
- Structural consistency under concurrent operations.
Real World Outcome
$ ./db_lab bptree-demo --seed 101
[insert] keys=1..10000
[stats] height=3 leaf_pages=129 internal_pages=3
[lookup] key=923 -> rid=(77,4) found=true
[range] 500..510 -> 11 rows in ascending order
[delete] keys=1..4000
[validate] balanced=true ordering=true sibling_links=true
[result] PASS bplus-tree-index
The Core Question You Are Answering
“How do databases maintain ordered access at scale with bounded lookup depth?”
Concepts You Must Understand First
- Fanout and tree height economics.
- Overflow/underflow handling.
- Latch coupling for structural changes.
Book Reference: Database System Concepts Ch. 14.
Questions to Guide Your Design
- Split policy: median vs biased split?
- Root split and root shrink handling?
- Validation invariants after each mutation?
Thinking Exercise
Manually simulate insert sequence causing two cascading splits and draw resulting tree pages.
The Interview Questions They Will Ask
- “Why B+Tree instead of binary search tree on disk?”
- “What happens when a leaf overflows?”
- “How do range scans work efficiently?”
- “How do you verify tree invariants automatically?”
- “What concurrency issues appear during page split?”
Hints in Layers
Hint 1: Start with single-thread correctness Do not mix concurrency until invariants pass.
Hint 2: Build node-level validators Sorted keys and pointer count checks save hours.
Hint 3: Add structural-operation tracing Split/merge logs make debugging reproducible.
Hint 4: Keep leaf link updates explicit Range scans fail silently if sibling pointers are wrong.
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| B+Tree fundamentals | Database System Concepts | Ch. 14 |
| Engine implementation patterns | Database Internals | Index chapters |
| Practical algorithm rigor | CLRS | Trees sections |
Common Pitfalls and Debugging
Problem 1: “Lookup misses existing key after split”
- Why: wrong separator promoted to parent.
- Fix: define and test exact split promotion rule.
- Quick test: insert deterministic sequence and query all keys.
Problem 2: “Range scan loops forever”
- Why: broken next-leaf pointer chain.
- Fix: update sibling pointers atomically in split/merge path.
- Quick test: full-tree ascending scan with visited-page set.
Definition of Done
- Point lookup, insert, delete, and range scan pass.
- Structural invariants validated after random workloads.
- Tree remains balanced.
- Deterministic workload produces deterministic stats.
Project 5: Hash Index (Extendable Hashing)
- File:
P05-extendable-hash-index.md - Main Programming Language: C++
- Alternative Programming Languages: Rust
- Coolness Level: Level 3 - Genuinely Clever
- Business Potential: Resume Gold
- Difficulty: Advanced
- Knowledge Area: dynamic hashing, directory depth management
- Software or Tool: Projects 1-3
- Main Book: Database System Concepts
What you will build: A disk-backed extendable hash index for fast equality lookup.
Why it teaches database systems: It contrasts ordered vs hash access paths and dynamic growth behavior.
Core challenges you will face:
- Local/global depth correctness.
- Bucket split and directory doubling.
- Handling skewed key distributions.
Real World Outcome
$ ./db_lab hash-demo --seed 22
[insert] 50000 key-rid pairs
[stats] global_depth=11 buckets=231 load_factor=0.78
[probe] key=701337 found=true rid=(120,8)
[probe] key=999999 found=false
[split] total_bucket_splits=230 directory_doubles=5
[result] PASS extendable-hash-index
The Core Question You Are Answering
“How can a hash index grow online without full rehash of all data?”
Concepts You Must Understand First
- Hash prefix addressing.
- Directory and bucket depth semantics.
- Equality-lookup workload design.
Questions to Guide Your Design
- What triggers split vs directory doubling?
- How do you prevent split races under concurrency?
- How will you validate directory-to-bucket consistency?
Thinking Exercise
Start with global depth 2 and simulate inserts that repeatedly hit one bucket. Track directory and local depths.
The Interview Questions They Will Ask
- “Why choose extendable hashing over static hashing?”
- “What is local depth and why does it matter?”
- “Why are range queries poor for hash indexes?”
- “How would skewed keys affect this design?”
- “How do you test correctness after many splits?”
Hints in Layers
Hint 1: Encode depth in bucket header Never infer it indirectly.
Hint 2: Build a consistency checker Each directory entry must map to valid bucket and compatible depth.
Hint 3: Use deterministic hash fixtures Make split behavior reproducible.
Hint 4: Benchmark with skew and uniform datasets Observe tradeoffs clearly.
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Dynamic hashing | Database System Concepts | Ch. 14 |
| Access-method tradeoffs | Database Internals | Index chapters |
| Hashing analysis | CLRS | Hash tables |
Common Pitfalls and Debugging
Problem 1: “Directory points to invalid bucket after split”
- Why: incorrect bitmask update during remap.
- Fix: centralize remap logic and test with binary traces.
- Quick test: verify every directory slot references existing bucket page.
Problem 2: “Split storm under skew”
- Why: load concentrated in few prefixes.
- Fix: evaluate bucket size and hash function quality; add skew tests.
- Quick test: run Zipfian key workload and inspect split distribution.
Definition of Done
- Equality lookups are correct after arbitrary inserts/deletes.
- Directory/bucket depth invariants hold.
- Split/doubling paths are deterministic under fixed seed.
- Performance measured on uniform and skewed key sets.
Project 6: Relational Algebra Executor (Operators)
- File:
P06-relational-algebra-executor.md - Main Programming Language: C++
- Alternative Programming Languages: Rust
- Coolness Level: Level 3 - Genuinely Clever
- Business Potential: Resume Gold
- Difficulty: Advanced
- Knowledge Area: iterator model, scans, selection/projection
- Software or Tool: Projects 1-4
- Main Book: Database System Concepts
What you will build: An iterator-based executor with SeqScan, IndexScan, Selection, and Projection.
Why it teaches database systems: It turns storage and indexing into query results through composable operators.
Core challenges you will face:
- Defining clean operator interfaces.
- Correct tuple flow and schema handling.
- Predicate evaluation and short-circuit behavior.
Real World Outcome
$ ./db_lab exec-demo --query "SELECT b FROM t WHERE a = 5" --seed 5
[plan] Projection(b)
Selection(a=5)
SeqScan(t)
[execute] input_rows=100000 output_rows=842
[latency] 14.2 ms
[result] PASS iterator-engine
The Core Question You Are Answering
“How does a DBMS turn a plan tree into a stream of correct tuples?”
Concepts You Must Understand First
- Iterator model (
open-next-close). - Tuple schema and projection semantics.
- Predicate evaluation cost and pushdown.
Questions to Guide Your Design
- Will operators own child lifecycle?
- How will null semantics be handled in predicates?
- How will you debug wrong-result bugs quickly?
Thinking Exercise
Trace one tuple from SeqScan through Selection and Projection; list every transformation.
The Interview Questions They Will Ask
- “What is the iterator model and why use it?”
- “What is predicate pushdown?”
- “How do you represent tuple schema through operators?”
- “What can cause duplicate/missing output rows?”
- “How do blocking operators differ from streaming operators?”
Hints in Layers
Hint 1: Start with one-row-at-a-time contract Avoid vectorization complexity initially.
Hint 2: Build operator-level golden tests Test each operator in isolation.
Hint 3: Add plan-explain output early Correctness debugging depends on visibility.
Hint 4: Keep null semantics explicit Do not rely on host language booleans.
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Query execution | Database System Concepts | Ch. 15 |
| Volcano model concepts | CMU DB lectures | Query execution modules |
| Practical engine internals | Database Internals | Execution chapters |
Common Pitfalls and Debugging
Problem 1: “Selection returns wrong row count”
- Why: incorrect null or comparison handling.
- Fix: define predicate truth table including null behavior.
- Quick test: run predicate suite with null and boundary values.
Problem 2: “Projection corrupts column mapping”
- Why: schema index mismatch between parent/child operators.
- Fix: use explicit schema objects and field IDs.
- Quick test: projection on reordered columns with known tuples.
Definition of Done
- Core operators produce correct output for deterministic fixtures.
- Plan explain output is available.
- Null and edge cases are tested.
- Integration with storage/index APIs is stable.
Project 7: Join Operators
- File:
P07-join-operators.md - Main Programming Language: C++
- Alternative Programming Languages: Rust
- Coolness Level: Level 4 - Hardcore Tech Flex
- Business Potential: Resume Gold
- Difficulty: Expert
- Knowledge Area: join algorithms, memory/I/O tradeoffs
- Software or Tool: Project 6
- Main Book: Database System Concepts
What you will build: Nested-loop, hash join, and sort-merge join operators.
Why it teaches database systems: Join strategy dominates many real query costs.
Core challenges you will face:
- Algorithm correctness across duplicates and nulls.
- Build/probe memory management.
- Choosing right join algorithm per data shape.
Real World Outcome
$ ./db_lab join-demo --seed 77
[data] left_rows=2_000_000 right_rows=500_000
[planA] NestedLoopJoin latency=24.8 s
[planB] HashJoin latency=1.9 s
[planC] SortMergeJoin latency=3.6 s
[validate] outputs_equal=true rows=410221
[result] PASS join-operators
The Core Question You Are Answering
“How does algorithm choice change join performance by orders of magnitude without changing query semantics?”
Concepts You Must Understand First
- Join semantics for inner and outer joins.
- Hash table build/probe behavior.
- External sorting basics for large inputs.
Questions to Guide Your Design
- Which side should be hash-build input and why?
- How will spills be handled when memory is insufficient?
- How do you prove two algorithms are semantically equivalent?
Thinking Exercise
Estimate cost of NLJ vs HashJoin for three cardinality/selectivity scenarios.
The Interview Questions They Will Ask
- “When is nested-loop join acceptable?”
- “Why does hash join usually outperform NLJ?”
- “How does sort-merge join help with ordered data?”
- “How do duplicates affect join output size?”
- “How would you debug incorrect join cardinality?”
Hints in Layers
Hint 1: Make one correct baseline first Use simple NLJ to validate semantics.
Hint 2: Share a common join-output validator Compare row sets across algorithms.
Hint 3: Instrument memory use Hash build behavior must be observable.
Hint 4: Add skewed key datasets Skew reveals algorithm edge cases quickly.
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Join algorithms | Database System Concepts | Ch. 15 |
| Sorting and hash foundations | CLRS | Sorting/Hash chapters |
| Practical execution tradeoffs | Database Internals | Execution sections |
Common Pitfalls and Debugging
Problem 1: “Hash join misses matches”
- Why: hash key normalization mismatch across inputs.
- Fix: canonicalize key representation before hash/probe.
- Quick test: equality join with mixed key encodings.
Problem 2: “Sort-merge duplicates incorrect”
- Why: merge cursor advancement logic wrong on repeated keys.
- Fix: explicit duplicate-run handling.
- Quick test: dataset with long duplicate runs.
Definition of Done
- All join algorithms return identical results on deterministic fixtures.
- Performance comparison is captured with explanation.
- Null and duplicate behavior is correct.
- Spill or memory-bound behavior is documented.
Project 8: Simple Query Parser & Rule-Based Optimizer
- File:
P08-sql-parser-rule-optimizer.md - Main Programming Language: C++
- Alternative Programming Languages: Rust, Python
- Coolness Level: Level 4 - Hardcore Tech Flex
- Business Potential: Resume Gold
- Difficulty: Expert
- Knowledge Area: parsing, AST, logical rewrites
- Software or Tool: Projects 6-7
- Main Book: Dragon Book + Database System Concepts
What you will build: SQL subset parser that produces AST, logical plan, and rule-optimized physical skeleton.
Why it teaches database systems: It connects human query text to executable operator trees.
Core challenges you will face:
- Grammar design and parse error handling.
- Semantic binding (table/column resolution).
- Correct logical rewrites.
Real World Outcome
$ ./db_lab planner-demo --sql "SELECT t1.b FROM t1 JOIN t2 ON t1.id=t2.id WHERE t1.a=5"
[parse] AST ok
[logical] Project -> Filter -> Join -> Scan(t1,t2)
[rules] predicate_pushdown=true projection_pruning=true
[physical] IndexScan(t1.a) + HashJoin + Projection
[result] PASS parser-and-rules
The Core Question You Are Answering
“How does a DBMS convert SQL text into a semantically correct and more efficient plan?”
Concepts You Must Understand First
- Tokenization and parsing basics.
- AST to relational algebra mapping.
- Rule rewrite safety conditions.
Questions to Guide Your Design
- Which SQL subset will be accepted and explicitly rejected?
- How will alias/column disambiguation work?
- Which rewrites are guaranteed safe in your subset?
Thinking Exercise
Take one SQL query and manually derive AST -> logical plan -> rewritten logical plan.
The Interview Questions They Will Ask
- “What is the difference between parsing and optimization?”
- “What is predicate pushdown and why safe?”
- “How do you handle ambiguous column references?”
- “Which rewrites can change semantics if done wrong?”
- “How do you test parser correctness systematically?”
Hints in Layers
Hint 1: Freeze SQL grammar scope early Small, well-defined subset beats half-implemented broad grammar.
Hint 2: Build AST pretty-printer It is your fastest parse-debug tool.
Hint 3: Separate binding from parsing Name resolution is a semantic step.
Hint 4: Add rewrite equivalence tests Execute both original and rewritten plans on same fixtures.
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Parsing fundamentals | Dragon Book | Ch. 2-4 |
| Relational algebra mapping | Database System Concepts | Ch. 6, 15 |
| Optimizer rewrite ideas | Readings in DB Systems | Optimizer papers |
Common Pitfalls and Debugging
Problem 1: “Parser accepts invalid query shapes”
- Why: grammar too permissive or semantic checks missing.
- Fix: add binding/validation phase.
- Quick test: negative-query corpus with expected errors.
Problem 2: “Rewrite changes output”
- Why: rule applied without null/outer-join safety checks.
- Fix: add preconditions per rewrite rule.
- Quick test: equivalence harness over random datasets.
Definition of Done
- SQL subset is documented and enforced.
- AST/logical/physical representations are inspectable.
- Core rewrites pass equivalence tests.
- Error paths provide clear diagnostics.
Project 9: Transaction Manager with Two-Phase Locking (2PL)
- File:
P09-transaction-manager-2pl.md - Main Programming Language: C++
- Alternative Programming Languages: Rust
- Coolness Level: Level 4 - Hardcore Tech Flex
- Business Potential: Resume Gold
- Difficulty: Expert
- Knowledge Area: lock manager, deadlocks, isolation
- Software or Tool: Projects 3, 6
- Main Book: Database System Concepts
What you will build: A transaction manager with strict 2PL lock manager and deadlock detection.
Why it teaches database systems: It enforces correctness under concurrent execution.
Core challenges you will face:
- Lock compatibility and upgrades.
- Wait-for graph cycle detection.
- Abort/rollback interaction with execution layer.
Real World Outcome
$ ./db_lab txn-demo --mode 2pl --seed 404
[txn] begin T1
[txn] begin T2
[lock] T1 X RID(10,2) granted
[lock] T2 X RID(10,3) granted
[lock] T1 X RID(10,3) waiting
[lock] T2 X RID(10,2) waiting
[deadlock-detector] cycle=[T1,T2] victim=T2
[txn] abort T2
[txn] commit T1
[result] PASS strict-2pl-deadlock-resolution
The Core Question You Are Answering
“How do we guarantee isolation with concurrent writes while preventing the system from freezing under deadlocks?”
Concepts You Must Understand First
- Conflict serializability.
- Lock compatibility matrices.
- Deadlock detection vs prevention tradeoffs.
Questions to Guide Your Design
- How are lock upgrades handled safely?
- What victim selection strategy minimizes wasted work?
- How often should deadlock detection run?
Thinking Exercise
Create a two-transaction lock timeline that deadlocks and annotate where detection can intervene.
The Interview Questions They Will Ask
- “What does strict 2PL guarantee?”
- “Why are deadlocks normal, not exceptional?”
- “How does lock granularity affect throughput?”
- “What is lock escalation and when is it risky?”
- “How would you test serializability claims?”
Hints in Layers
Hint 1: Start with one lock table keyed by RID Keep granularity simple first.
Hint 2: Build deterministic scheduler tests You need reproducible deadlock scenarios.
Hint 3: Separate lock states from txn states Conflation causes subtle rollback bugs.
Hint 4: Emit wait-for graph snapshots Visualization helps verify cycle detection.
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Concurrency control | Database System Concepts | Ch. 18 |
| Transaction design | Gray & Reuter | Core chapters |
| Deadlock theory | Operating Systems concepts texts | Deadlock chapter |
Common Pitfalls and Debugging
Problem 1: “Transactions starve on lock upgrades”
- Why: unfair queueing policy.
- Fix: define FIFO or priority with starvation bounds.
- Quick test: many readers + one upgrader simulation.
Problem 2: “Abort leaves locks behind”
- Why: rollback path skips lock release on some errors.
- Fix: centralized transaction finalization routine.
- Quick test: force abort in every lock state and assert no residual locks.
Definition of Done
- Strict 2PL semantics verified with deterministic schedules.
- Deadlock cycle detection and victim abort works.
- Lock release is complete on commit/abort.
- Isolation anomalies for covered levels are demonstrably prevented.
Project 10: Recovery Manager (UNDO/REDO Logging)
- File:
P10-recovery-manager-wal-undo-redo.md - Main Programming Language: C++
- Alternative Programming Languages: Rust
- Coolness Level: Level 5 - Pure Magic
- Business Potential: Resume Gold
- Difficulty: Master
- Knowledge Area: WAL, checkpoints, crash recovery
- Software or Tool: Projects 3 and 9
- Main Book: Database System Concepts + ARIES paper
What you will build: A WAL-based recovery manager implementing analysis/redo/undo workflow.
Why it teaches database systems: It is the core durability engine.
Core challenges you will face:
- Log record format and LSN monotonicity.
- WAL-before-data enforcement.
- Correct restart behavior under partial crash states.
Real World Outcome
$ ./db_lab recovery-demo --seed 12 --inject-crash after-commit-before-page-flush
[txn] T15 update RID(8,2) old=100 new=120
[log] append lsn=7001 type=UPDATE
[log] append lsn=7002 type=COMMIT
[log] fsync up_to=7002
[crash] simulated
[restart] analysis_start_lsn=6800
[restart] redo_applied=1 undo_applied=0
[verify] committed_value=120
[result] PASS crash-recovery
The Core Question You Are Answering
“How can a database prove that committed data survives crashes while incomplete work is removed?”
Concepts You Must Understand First
- WAL ordering invariant.
- Steal/no-force consequences.
- Idempotent redo/undo design.
Questions to Guide Your Design
- What minimum fields are required in each log record?
- How will pageLSN gate redo idempotence?
- Which checkpoint model will you implement first?
Thinking Exercise
Draw a timeline with two transactions where one commits and one aborts; mark expected redo/undo sets after crash.
The Interview Questions They Will Ask
- “Why does commit require log flush?”
- “What is the role of pageLSN in redo?”
- “Why do steal/no-force systems need both undo and redo?”
- “What is a fuzzy checkpoint?”
- “How do you test recovery beyond unit tests?”
Hints in Layers
Hint 1: Start with append-only log and strict LSN assignment Do not optimize prematurely.
Hint 2: Implement deterministic crash-injection hooks Recovery bugs hide without adversarial testing.
Hint 3: Separate analysis metadata from page replay Keep phases explicit and test each independently.
Hint 4: Track replay counters and final invariants Observable recovery makes debugging practical.
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Recovery fundamentals | Database System Concepts | Ch. 19 |
| ARIES method | IBM ARIES paper | Full paper |
| Durable systems engineering | Database Internals | Reliability sections |
Common Pitfalls and Debugging
Problem 1: “Committed transaction missing after restart”
- Why: commit ACK happened before fsync.
- Fix: enforce durable commit record before ACK.
- Quick test: crash immediately after ACK and verify commit persists.
Problem 2: “Redo reapplies updates twice”
- Why: missing pageLSN idempotence check.
- Fix: redo only if
pageLSN < recordLSN. - Quick test: repeat recovery twice and compare final state hash.
Definition of Done
- Recovery passes deterministic crash matrix.
- Committed effects survive crash.
- Uncommitted effects are undone.
- Replay is idempotent across repeated restart.
Project 11: Multi-Version Concurrency Control (MVCC)
- File:
P11-mvcc-engine.md - Main Programming Language: C++
- Alternative Programming Languages: Rust
- Coolness Level: Level 5 - Pure Magic
- Business Potential: Resume Gold
- Difficulty: Master
- Knowledge Area: snapshots, visibility rules, version GC
- Software or Tool: Projects 2, 3, 9, 10
- Main Book: Database System Concepts + transaction texts
What you will build: MVCC layer enabling snapshot reads with non-blocking reader/writer interactions.
Why it teaches database systems: It reflects modern high-concurrency DB architecture.
Core challenges you will face:
- Version chain storage and traversal.
- Snapshot visibility correctness.
- Safe garbage collection horizon.
Real World Outcome
$ ./db_lab mvcc-demo --seed 55
[txn] T1 start_ts=100
[txn] T2 start_ts=110
[T2] update key=42 value=900 -> commit_ts=115
[T1] read key=42 value=850 (snapshot)
[T3] start_ts=120 read key=42 value=900
[gc] reclaimed_versions=12 safe_horizon=100
[result] PASS mvcc-snapshot-visibility
The Core Question You Are Answering
“How can readers and writers proceed concurrently without sacrificing correctness?”
Concepts You Must Understand First
- Snapshot timestamp model.
- Version metadata (
xmin,xmaxor equivalent). - Writer conflict detection and GC horizon rules.
Questions to Guide Your Design
- Where will version metadata live: tuple header, side table, or both?
- How do you resolve write-write conflicts deterministically?
- What metrics indicate version bloat risk?
Thinking Exercise
Simulate three transactions with overlapping lifetimes and determine visible version for each read.
The Interview Questions They Will Ask
- “How does MVCC reduce lock contention?”
- “Why can snapshot isolation allow write skew?”
- “What makes version GC safe?”
- “How do indexes interact with versioned tuples?”
- “When would you still need explicit locks with MVCC?”
Hints in Layers
Hint 1: Implement visibility function first Everything else depends on it.
Hint 2: Keep version metadata explicit and testable Opaque flags make debugging painful.
Hint 3: Build anomaly demonstration tests Show what each isolation level allows/prevents.
Hint 4: Add GC watermark instrumentation Monitor version-chain lengths continuously.
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| MVCC and snapshots | Database System Concepts | Concurrency chapters |
| Transaction correctness | Gray & Reuter | Isolation sections |
| Production perspective | PostgreSQL docs | MVCC sections |
Common Pitfalls and Debugging
Problem 1: “Readers see future versions”
- Why: incorrect visibility comparison against snapshot timestamp.
- Fix: centralize visibility predicate and test edge timestamps.
- Quick test: deterministic transaction timeline fixtures.
Problem 2: “GC removes live versions”
- Why: safe horizon computed without considering oldest active txn.
- Fix: horizon = min(active_start_ts).
- Quick test: long-running reader during GC run.
Definition of Done
- Snapshot reads are deterministic and correct.
- Writer conflicts are handled per policy.
- GC reclaims obsolete versions safely.
- Isolation anomaly behavior is documented by test.
Project 12: Simple Statistics and Cost-Based Optimizer
- File:
P12-statistics-cost-optimizer.md - Main Programming Language: C++
- Alternative Programming Languages: Rust, Python
- Coolness Level: Level 4 - Hardcore Tech Flex
- Business Potential: Resume Gold
- Difficulty: Master
- Knowledge Area: statistics, cardinality estimation, plan search
- Software or Tool: Projects 6-8
- Main Book: Database System Concepts + System R paper
What you will build: Statistics collector and cost model to choose better physical plans automatically.
Why it teaches database systems: It gives your engine query-planning intelligence.
Core challenges you will face:
- Stats model design (histograms, NDV, null fraction).
- Estimation errors under skew/correlation.
- Plan enumeration and cost ranking.
Real World Outcome
$ ./db_lab cbo-demo --seed 808
[analyze] table=orders rows=5_000_000 ndv(customer_id)=420000
[query] SELECT * FROM orders WHERE customer_id=42
[planA] SeqScan cost=145000 est_rows=12000
[planB] IndexScan cost=1800 est_rows=14
[chosen] planB
[actual] rows=17 latency=19.4 ms
[result] PASS cost-based-plan-selection
The Core Question You Are Answering
“How does a database predict query cost well enough to choose fast plans automatically?”
Concepts You Must Understand First
- Cardinality estimation basics.
- Cost-model calibration (I/O vs CPU weights).
- Join-order search complexity.
Questions to Guide Your Design
- Which stats are cheap enough to maintain and still useful?
- How often should stats refresh?
- How will you compare estimated vs actual to improve the model?
Thinking Exercise
Given two predicates with correlated columns, estimate rows under independence and explain why estimate can fail.
The Interview Questions They Will Ask
- “What statistics are most useful for plan selection?”
- “Why do bad cardinality estimates cause bad plans?”
- “What is a Selinger-style optimizer?”
- “How do you validate a cost model in practice?”
- “When should planner hints be avoided?”
Hints in Layers
Hint 1: Start with single-table selectivity estimation Keep scope narrow then expand to joins.
Hint 2: Separate estimation from planning Makes experimentation easier.
Hint 3: Keep explain output machine-readable A/B comparisons become straightforward.
Hint 4: Track estimate/actual error distribution Use data, not intuition, for improvements.
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Cost-based optimization | Database System Concepts | Ch. 16 |
| Foundational optimizer design | System R paper | Full paper |
| Modern optimizer thinking | Readings in Database Systems | Optimizer chapters |
Common Pitfalls and Debugging
Problem 1: “Planner always picks same join order”
- Why: enumeration or cost function bug.
- Fix: verify candidate generation and compare cost traces.
- Quick test: small 3-table query with obvious selectivity differences.
Problem 2: “Estimated rows far from actual on skewed data”
- Why: simplistic uniformity assumption.
- Fix: add histograms and/or frequency sketches.
- Quick test: Zipf-distributed dataset benchmark.
Definition of Done
- Stats collection pipeline exists and is repeatable.
- Cost model chooses better plan than baseline for benchmark set.
- Explain output includes estimated and actual rows.
- Estimation error analysis is documented.
Project Comparison Table
| Project | Difficulty | Time | Depth of Understanding | Fun Factor |
|---|---|---|---|---|
| 1. Disk-Oriented Storage Manager | Intermediate | 1 week | High storage intuition | ★★★☆☆ |
| 2. Tuple and Page Manager | Intermediate | 1-2 weeks | High layout intuition | ★★★☆☆ |
| 3. Buffer Pool Manager | Advanced | 2 weeks | High systems intuition | ★★★★☆ |
| 4. B+Tree Index | Expert | 3-4 weeks | Very high data-structure depth | ★★★★★ |
| 5. Extendable Hash Index | Advanced | 2 weeks | High index tradeoff depth | ★★★★☆ |
| 6. Relational Algebra Executor | Advanced | 2 weeks | High operator-model depth | ★★★★☆ |
| 7. Join Operators | Expert | 3 weeks | Very high algorithm depth | ★★★★★ |
| 8. Parser + Rule Optimizer | Expert | 3 weeks | High compiler+DB bridge depth | ★★★★☆ |
| 9. Transaction Manager (2PL) | Expert | 3 weeks | Very high correctness depth | ★★★★★ |
| 10. Recovery Manager | Master | 4+ weeks | Very high reliability depth | ★★★★★ |
| 11. MVCC Engine | Master | 3-4 weeks | Very high concurrency depth | ★★★★★ |
| 12. Statistics + CBO | Master | 3-4 weeks | Very high optimization depth | ★★★★★ |
Recommendation
If you are new to DB internals: Start with Project 1, then 2, then 3 to establish reliable page+cache mental models.
If you are a backend engineer focused on query speed: Start with Project 3, then 4, 6, 7, and 12.
If you want reliability engineering depth: Start with Project 9, then 10, then 11.
Final Overall Project
Capstone: Mini-Relational DBMS
The Goal: Integrate Projects 1-12 into one coherent mini DBMS with deterministic CLI behavior.
- Build a catalog and SQL front door (
CREATE TABLE,CREATE INDEX,INSERT,SELECT,UPDATE,DELETE). - Route statements through parser -> optimizer -> execution with transaction context.
- Enforce durability and recovery workflow with crash-injection test matrix.
- Publish an
EXPLAINpath showing chosen plans and estimate/actual diagnostics.
Success Criteria: The system returns correct results, preserves committed data across crashes, rejects/rolls back invalid transactions safely, and demonstrates measurable plan-quality improvements from statistics.
From Learning to Production: What Is Next
| Your Project | Production Equivalent | Gap to Fill |
|---|---|---|
| Page/Tuple Manager | PostgreSQL heap + page format | Multi-table catalog maturity, migration tooling |
| Buffer Pool Manager | InnoDB/PostgreSQL shared buffer stack | NUMA tuning, background flush/checkpoint sophistication |
| B+Tree/Hash Index | Production index engines | Online rebuild, concurrent DDL, compression |
| Executor + Planner | PostgreSQL planner/executor | Advanced SQL semantics, vectorized execution, JIT |
| 2PL/MVCC | Production concurrency subsystem | Full serializable protocol, lock hierarchy completeness |
| WAL Recovery | ARIES-like recovery in major DBMS | Backup/PITR tooling, distributed replication, ops tooling |
Summary
This learning path covers database systems internals through 12 hands-on projects plus one capstone integration.
| # | Project Name | Main Language | Difficulty | Time Estimate |
|---|---|---|---|---|
| 1 | Disk-Oriented Storage Manager | C++ | Intermediate | 1 week |
| 2 | Tuple and Page Manager | C++ | Intermediate | 1-2 weeks |
| 3 | Buffer Pool Manager | C++ | Advanced | 2 weeks |
| 4 | B+Tree Index | C++ | Expert | 3-4 weeks |
| 5 | Hash Index (Extendable Hashing) | C++ | Advanced | 2 weeks |
| 6 | Relational Algebra Executor | C++ | Advanced | 2 weeks |
| 7 | Join Operators | C++ | Expert | 3 weeks |
| 8 | Query Parser and Rule Optimizer | C++ | Expert | 3 weeks |
| 9 | Transaction Manager (2PL) | C++ | Expert | 3 weeks |
| 10 | Recovery Manager (UNDO/REDO) | C++ | Master | 4+ weeks |
| 11 | MVCC Engine | C++ | Master | 3-4 weeks |
| 12 | Statistics and Cost Optimizer | C++ | Master | 3-4 weeks |
Expected Outcomes
- You can explain DBMS architecture with invariants, not just features.
- You can implement and validate critical DB subsystems deterministically.
- You can diagnose performance and correctness issues with evidence.
Additional Resources and References
Standards and Specifications
- SQL standards overview (ISO/IEC 9075 family)
- PostgreSQL WAL docs
- PostgreSQL MVCC docs
- SQLite file format
Industry Analysis and Adoption Signals
- Stack Overflow Developer Survey 2024 - Databases
- DB-Engines Ranking
- SQLite deployment scale statement
Foundational Papers
Books
- “Database System Concepts” by Silberschatz, Korth, Sudarshan - Core textbook mapping for most projects.
- “Database Internals” by Alex Petrov - Practical engine implementation perspective.
- “Transaction Processing: Concepts and Techniques” by Gray and Reuter - Concurrency and recovery rigor.