Sprint: Database Internals in C Mastery - Real World Projects

Goal: Build a first-principles mental model of how real databases work internally, from bytes on disk to concurrent SQL execution, by implementing core storage-engine subsystems in C. You will learn why page layout decisions shape performance, how B+ trees and buffer pools minimize I/O, and how WAL and recovery preserve correctness under crashes. You will also understand how isolation levels, locking, and query execution interact under real workloads. By the end, you will be able to design, debug, and benchmark a small but serious embedded database engine and explain each subsystem with production-grade rigor.

Introduction

Database internals is the discipline of understanding how a DBMS transforms persistent bytes into reliable query results under concurrency. A mature database is not one algorithm; it is a coordinated set of subsystems: storage layout, index structures, cache management, recovery logic, transaction control, and query execution.

This guide focuses on implementing those subsystems in C so you can see the real machine boundaries directly: memory ownership, I/O order, page boundaries, and failure behavior. You will build a sequence of projects that starts from a persistent key-value engine and ends with a mini SQL database architecture.

In scope:

Slotted-page storage and binary formats
B+ tree indexing and split/merge behavior
Buffer pool design and eviction policies
WAL, checkpoints, and crash recovery
Transaction isolation and concurrency control patterns
SQL parse-plan-execute pipeline

Out of scope:

Distributed transactions and consensus
Full ANSI SQL compliance
Columnar/vectorized engines at warehouse scale
Production replication/failover orchestration

                  ┌──────────────────────────────────────────────────────┐
                  │                    SQL Text Input                    │
                  └──────────────────────────┬───────────────────────────┘
                                             │
                                             v
┌─────────────────────┐   ┌───────────────────────┐   ┌──────────────────────┐
│ Tokenizer + Parser  │-->| Logical/Physical Plan │-->| Operator Execution    │
└─────────────────────┘   └───────────────────────┘   └──────────┬───────────┘
                                                                  │
                                                                  v
┌─────────────────────┐   ┌───────────────────────┐   ┌──────────────────────┐
│ Transaction Manager │<->│ Buffer Pool Manager   │<->│ B+Tree + Heap Pages  │
└──────────┬──────────┘   └──────────┬────────────┘   └──────────┬───────────┘
           │                         │                           │
           v                         v                           v
┌─────────────────────┐   ┌───────────────────────┐   ┌──────────────────────┐
│ WAL + Checkpointing │<->│ File I/O + fsync      │<->│ Main DB + WAL Files   │
└─────────────────────┘   └───────────────────────┘   └──────────────────────┘

How to Use This Guide

Read ## Theory Primer before writing implementation code.
For each project, answer the core question and thinking exercise first.
Keep a single engineering notebook with: invariants, failure cases, and benchmark deltas.
Use deterministic fixtures (fixed seeds, fixed dataset order, fixed timestamps in tests) so regressions are visible.
Use project links in order on the first pass; then do a second pass by subsystem depth (storage, then recovery, then concurrency, then SQL).

Prerequisites & Background Knowledge

Essential Prerequisites (Must Have)

C language fundamentals: pointers, structs, file I/O, memory lifecycle
Data-structure fundamentals: trees, hash tables, amortized complexity
OS fundamentals: virtual memory, page cache, file descriptors, fsync
Basic command-line tooling: make, hexdump, od, strace, gdb or lldb
Recommended Reading: The C Programming Language (K&R) Ch. 5-8

Helpful But Not Required

Compiler pipeline basics (tokenizing/parsing/AST)
Advanced concurrency patterns (lock ordering, deadlock detection)
Linux performance tooling (perf, flame graphs)

Self-Assessment Questions

Can you explain the difference between logical records and on-disk pages?
Can you explain why write ordering (log before data) matters for durability?
Can you reason about O(log n) B+ tree behavior in terms of disk page reads?

Development Environment Setup Required Tools:

C compiler: clang 15+ or gcc 12+
Build tooling: make or ninja
Debugging: gdb or lldb
CLI diagnostics: hexdump, od, strace

Recommended Tools:

perf for profiling hot paths
valgrind or sanitizer builds for memory correctness
sqlite3 binary for sanity-comparison experiments

Testing Your Setup:

$ cc --version
clang version 18.1.8

$ hexdump -C /tmp/sample.db | head -3
00000000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|

$ getconf PAGESIZE
4096

Time Investment

Simple projects: 4-8 hours each
Moderate projects: 10-20 hours each
Complex projects: 20-40 hours each
Total sprint: 3-5 months (part-time)

Important Reality Check Database internals learning is nonlinear. You will often understand a concept only after failing an adjacent one (for example, understanding page formats only after debugging B+ tree split corruption). That friction is expected and productive.

Big Picture / Mental Model

Think of a database as a state machine constrained by three non-negotiables:

Performance: minimize random I/O and unnecessary copies.
Correctness: preserve invariants under concurrency.
Durability: survive crashes without violating committed state.

                 USER VIEW                           ENGINE VIEW
┌─────────────────────────────────┐   ┌────────────────────────────────────────┐
│ SELECT ... WHERE ...            │   │ Parse -> Plan -> Execute               │
│ INSERT/UPDATE/DELETE            │   │                                        │
└─────────────────────────────────┘   │   Read/Write Pages via Buffer Pool     │
                                       │             │                          │
                                       │             v                          │
                                       │      B+Tree + Heap Layout              │
                                       │             │                          │
                                       │             v                          │
                                       │      WAL + Checkpoint + Recovery       │
                                       └─────────────┬──────────────────────────┘
                                                     │
                                                     v
                                         Crash-safe persistent files

Theory Primer

Concept 1: Page-Oriented Storage and Binary Encoding

Fundamentals

Relational databases are physically page stores. Even when you think in rows, indexes, and SQL operators, the machine persists fixed-size pages and updates those pages according to strict invariants. A page is the smallest durable read/write unit in most engines. Inside a page, records are laid out with metadata that enables insert/delete/update while preserving stable logical identifiers. Slotted pages solve this by separating the slot directory from the actual record payload. Slots point to payload offsets, so compaction can move payload bytes while keeping logical record references stable. Binary encoding matters because data must survive process restarts, architecture differences, and version upgrades. This means explicit endianness, explicit field sizes, and versioned headers. If you skip those disciplines, you create a format that appears to work in one process but breaks under upgrade, crash recovery, or external tooling.

Deep Dive

Page storage is the first place where abstraction collides with hardware behavior. A page-based engine deliberately trades flexibility for predictable I/O. Instead of writing arbitrary byte ranges, it writes page-aligned blocks. This design aligns with how filesystems and storage devices optimize reads and writes. In practice, page sizes usually sit in a bounded range (commonly 4 KiB to 16 KiB), and page-local data structures are engineered to maximize useful payload while keeping metadata small and deterministic.

A canonical slotted page has four regions: header, slot array, free space, and payload region. The slot array grows upward from the front, payload grows downward from the back, and free space exists between them. Inserts allocate payload bytes, then add a slot entry. Deletes can tombstone a slot or reclaim immediately depending on policy. Updates may be in-place when size is unchanged, or they may trigger relocation if variable-length fields grow. The slot indirection layer is the central stability trick: external references use (page_id, slot_id) while payload can move.

You must define invariants explicitly. Typical invariants include: (1) all slot offsets point within payload bounds, (2) payload records do not overlap, (3) free-space pointers are ordered and valid, (4) page checksum or LSN metadata matches write protocol expectations. Invariant checks are not optional; they are your earliest corruption detector. In production engines, page inspection tools exist largely because page-level corruption triage is common in outage analysis.

Binary encoding is equally strategic. If you serialize C structs directly, compiler padding and host endianness can make files non-portable or unstable across versions. Better pattern: define explicit field widths (u16, u32, etc.), encode/decode with deterministic byte order, include a version field, and document reserved bytes. SQLite’s on-disk format documentation is a strong model here because it treats file format as a stable external contract, not an incidental implementation detail.

Failure modes are subtle. Torn writes can leave partially updated pages; stale pointers can survive crashes if WAL/data write order is wrong; compaction bugs can produce “valid-looking” but semantically broken pages. This is why many engines avoid complex in-place transformations unless strictly necessary and rely on append-like techniques plus metadata flips for atomicity. Your educational engine can start simpler: conservative page rewrites with strict checksum validation and a repair/debug mode that reports first-invalid offset.

Another critical point is fragmentation economics. Frequent updates and deletes increase internal fragmentation, reducing effective page density and query performance. Compaction improves density but costs CPU and write amplification. Real engines balance this with background vacuum/cleanup or on-threshold compaction policies. In your projects, force yourself to measure page fill-factor over time under mixed workloads; this converts a theoretical concept into observable behavior.

Finally, page formats are long-lived commitments. A casual header change can invalidate old files or require migration tooling. Introduce format versioning from day one, and include compatibility tests that open historical files. This one decision drastically reduces future pain and mirrors production reality.

How this fits on projects

Projects 1, 3, 4, and 7 rely directly on page format choices.
Every benchmark and recovery test is constrained by page-level behavior.

Definitions & key terms

Page: Fixed-size block used as primary storage I/O unit.
Slot directory: Array mapping logical slot IDs to payload offsets.
Heap page: Page storing table tuples/records.
Fill factor: Effective used-space ratio for a page.
Format version: Header field indicating on-disk schema version.

Mental model diagram

Page N
┌──────────────────────────────────────────────────────────────┐
│ Header: magic | version | page_id | LSN | checksum | flags  │
├──────────────────────────────────────────────────────────────┤
│ Slot[0] -> off=4012 len=40                                  │
│ Slot[1] -> off=3960 len=52                                  │
│ Slot[2] -> tombstone                                         │
├──────────────────────────────────────────────────────────────┤
│                         Free Space                           │
├──────────────────────────────────────────────────────────────┤
│ Record bytes packed from end of page toward middle           │
└──────────────────────────────────────────────────────────────┘
Invariant: slot offsets always point to valid, non-overlapping record ranges

How it works (step-by-step, with invariants and failure modes)

Read page header; verify magic, version, and checksum.
Validate slot array bounds and monotonic free-space pointers.
Resolve (page_id, slot_id) to record bytes via slot offset.
For insert, allocate payload then write slot metadata atomically in memory.
On flush, write page through buffer manager with WAL ordering guarantees.

Failure modes:

Slot offset out of bounds -> corruption flag.
Overlapping payload regions -> allocator/compaction bug.
Header/version mismatch -> incompatible file or torn write.

Minimal concrete example

Pseudo-layout (little-endian chosen by your format contract):
header.magic = "TDB1"
header.version = 2
header.page_size = 4096
slot[17] = {offset: 0x0F20, length: 31, flags: LIVE}
record@0x0F20 = [key_len=5][val_len=14][key="email"][value="a@b.example"]

Common misconceptions

“Rows are stored directly one after another forever.” Correction: updates/deletes require indirection and free-space control.
“struct dump is a stable file format.” Correction: padding and endianness make this unsafe.
“Compaction is always good.” Correction: compaction costs CPU/I/O and can increase write amplification.

Check-your-understanding questions

Why does slot indirection reduce logical record instability?
What corruption class does checksum catch vs not catch?
Why is format versioning needed even in personal projects?

Check-your-understanding answers

Because (page,slot) remains stable while payload bytes can move.
It catches many accidental/torn mutations, but not all logical semantic errors.
Any field-layout change can break older files without version gates.

Real-world applications

SQLite table/index pages
PostgreSQL heap and index pages
InnoDB index page structures

Where you’ll apply it

Project 1, Project 3, Project 4, Project 7

References

SQLite file format: https://www.sqlite.org/fileformat.html
PostgreSQL page layout: https://www.postgresql.org/docs/current/storage-page-layout.html
MySQL InnoDB physical index structure: https://dev.mysql.com/doc/refman/8.4/en/innodb-physical-structure.html

Key insights A database file format is a long-lived external contract, not an internal implementation detail.

Summary Page format quality determines whether every later subsystem (indexing, caching, recovery) remains tractable or becomes fragile.

Homework/Exercises to practice the concept

Design a 4 KiB page header with forward-compatible reserved fields.
Simulate 10,000 random inserts/deletes and track fill factor drift.
Write an offline checker spec that validates all page invariants.

Solutions to the homework/exercises

Include magic/version/page_id/LSN/checksum/free-space offsets/reserved bytes.
Expect fragmentation to rise; compaction threshold can restore density.
Validation order: header -> slot bounds -> offset overlap -> checksum -> semantic field checks.

Concept 2: B+Tree Indexes and Access Paths

Fundamentals

B+ trees are balanced, high fan-out search trees optimized for block storage. Internal nodes route lookups; leaf nodes hold sorted keys and payload references. High fan-out keeps tree height small, so lookup cost is logarithmic with very low constant factors in terms of page reads. This is why they dominate row-store indexes. Unlike binary trees, B+ trees are page-aware: one node is typically one page, and node occupancy policy directly controls performance. Splits and merges preserve order and balance while maintaining invariants such as sorted key ranges and consistent child separators. In practical database engines, B+ trees are tightly coupled with page layout, buffer pools, and concurrency semantics.

Deep Dive

A B+ tree’s strength comes from matching algorithmic shape to storage economics. Random I/O is expensive relative to CPU arithmetic. So instead of pointer-heavy in-memory trees, a B+ tree packs many keys per page. If an internal page can hold hundreds of key separators, height stays tiny even for millions of records. For example, fan-out 200 with height 4 already addresses billions of key intervals. That means a lookup can often complete in a handful of page accesses, many of which may already be cached.

Node format design is critical. Internal nodes need key separator arrays and child page IDs; leaf nodes need key-value references (or row IDs) plus sibling pointers for range scans. Prefix compression or fence keys may be used to improve occupancy, but early educational implementations should prioritize clarity: fixed-width keys first, then variable-length support. You should still model split policy explicitly (median split vs biased split for append-heavy workloads) because it shapes write amplification and cache behavior.

Insertion path: descend to leaf, insert key in order, split if overflow, propagate separator to parent, possibly cascading to root. Root split creates a new root and increases height by one. Deletion path is harder: remove key, detect underflow, then borrow from sibling or merge and update parent separators. Many educational implementations delay full deletion complexity because merge correctness is easy to get wrong. If you do implement deletion, keep invariant checks aggressive and run randomized differential tests.

Range scans are a B+ tree’s practical edge. Because leaves are linked, scanning a key interval avoids repeatedly returning to upper nodes. This makes ordered retrieval and prefix predicates efficient. It also enables index-only plans when leaf payload includes enough columns. In storage engines, this property is central to query performance for ORDER BY, BETWEEN, and incremental pagination patterns.

Failure modes in B+ trees usually come from separator/key-boundary mistakes, not the high-level algorithm itself. A single off-by-one in split propagation can cause missing or duplicated key visibility. Another common bug is parent separator stale after child mutation, which manifests as intermittent lookup misses. To defend against this, implement an offline verifier: in-order traversal must return globally sorted keys; each parent key must equal or bound child ranges; all leaves must be at equal depth; sibling links must be bidirectionally consistent.

Concurrency adds another layer. Production engines use latch coupling or optimistic lock-crabbing to avoid structural races. For learning, start single-threaded for correctness, then add coarse-grain latches at node level. The lesson is that structural modifications (splits/merges) are not just data updates; they are topology updates. WAL integration must capture enough info for redo/undo of both record mutations and structure changes.

Finally, B+ trees are not universally best. Hash indexes can beat them for point reads under stable conditions, and LSM trees can outperform on write-heavy ingestion. But for mixed read/write with order-sensitive queries, B+ remains the baseline architecture in major engines.

How this fits on projects

Core of Project 3
Used by Project 6 and integrated in Project 7

Definitions & key terms

Fan-out: Number of child pointers in an internal node.
Separator key: Parent key dividing child key ranges.
Leaf chain: Linked list across leaf pages for range scans.
Split propagation: Parent updates caused by child overflow.
Underflow: Node occupancy below minimum threshold.

Mental model diagram

                         [ 40 | 80 ]
                      /       |       \
          [10|20|30]      [50|60|70]      [90|95]
             |               |               |
        Leaf A <-----> Leaf B <-----> Leaf C

Lookup(65): root -> middle child -> Leaf B
Range(55..92): start Leaf B, then follow leaf sibling links

How it works (step-by-step, with invariants and failure modes)

Compare key with separators from root to leaf.
Insert in leaf maintaining sorted order.
If overflow, split node and promote separator.
Repeat upward until no overflow or new root created.

Invariants:

Keys sorted inside every node.
All leaves at same depth.
Parent separators correctly partition child ranges.

Failure modes:

Separator stale after split -> missed lookups.
Incorrect sibling links -> broken range scan.
Partial structural write without WAL safety -> corruption after crash.

Minimal concrete example

Pseudo-event sequence for inserting key=58:
before: parent keys [40|80], middle leaf keys [50,55,60,70]
insert -> [50,55,58,60,70] (overflow)
split -> left [50,55], right [58,60,70]
promote separator 58 into parent

Common misconceptions

“B-tree and B+ tree are practically identical.” Correction: leaf-only payload and linked leaves materially affect scan behavior.
“O(log n) is enough to guarantee speed.” Correction: fan-out, cache hit rate, and page locality dominate constants.
“Deletion is symmetric with insertion.” Correction: underflow/merge handling is often trickier.

Check-your-understanding questions

Why does high fan-out lower disk I/O depth?
Why are leaf links essential for range scans?
What verifier checks would catch separator corruption?

Check-your-understanding answers

More keys per node means fewer levels for same keyspace.
They avoid repeated parent traversal for sequential ranges.
Parent-child range consistency and full in-order traversal monotonicity.

Real-world applications

SQLite B-tree tables/indexes
InnoDB clustered and secondary indexes
PostgreSQL B-tree access method

Where you’ll apply it

Project 3, Project 6, Project 7

References

SQLite architecture: https://www.sqlite.org/arch.html
InnoDB index structure: https://dev.mysql.com/doc/refman/8.4/en/innodb-physical-structure.html
PostgreSQL storage internals: https://www.postgresql.org/docs/current/storage-page-layout.html

Key insights B+ trees win because they are shaped for page I/O economics, not abstract pointer operations.

Summary Correct separator logic and verification tooling matter more than clever split heuristics in early implementations.

Homework/Exercises to practice the concept

Design an invariant checker for a persisted B+ tree file.
Compare estimated depth for fan-out 64 vs 256 at 100M keys.
Describe how split policy changes write amplification for append-heavy keys.

Solutions to the homework/exercises

Validate depth uniformity, sortedness, parent bounds, leaf-link consistency.
Height drops materially as fan-out rises; fewer I/O hops per lookup.
Right-biased splits can reduce repeated rewrites for append-heavy patterns.

Concept 3: Buffer Pool, Caching, and I/O Control

Fundamentals

The buffer pool is the database’s controlled cache of pages in RAM. It mediates access between query operators and disk files while tracking dirty state, pin counts, and eviction eligibility. Without a buffer pool, every logical operation would hit disk directly, causing severe performance collapse. With one, hot pages stay in memory, updates are grouped, and I/O is shaped into more efficient patterns. Correct buffer-pool behavior depends on strict contracts: pin before use, unpin after use, never evict pinned frames, and write dirty pages safely. Buffer management is where many first engines become either fast-but-wrong or safe-but-slow.

Deep Dive

A buffer pool is often misunderstood as “just an LRU cache.” In reality, it is a concurrency-aware, durability-aware control plane for pages. Each frame tracks at least: page_id, dirty flag, pin count, reference metadata, and page LSN/checksum fields. Query and storage components request pages by page ID. On hit, frame is returned with pin incremented. On miss, a victim frame is selected, flushed if dirty, then reused for the requested page.

The pin/unpin contract is foundational. A pinned page indicates active use; eviction of pinned pages is forbidden. Most mysterious corruption in student engines comes from violating this indirectly: stale pointers after frame reuse, forgotten unpin paths on error, or double-unpin under retries. Build explicit guards and panic-mode assertions around pin count transitions.

Eviction policy is a design lever. Pure LRU can be scan-unfriendly because large sequential reads evict hot OLTP pages. CLOCK and variants reduce overhead and resist scan pollution. Production systems often use multi-queue or adaptive policies. For this guide, implement CLOCK or 2Q-like behavior after baseline LRU so you can observe the difference under mixed workload traces.

Dirty-page handling must align with WAL protocol. If WAL exists, you cannot flush a dirty page whose page-LSN exceeds durable WAL-LSN; doing so breaks recovery assumptions. This WAL-before-data constraint makes buffer flushing a correctness problem, not merely a performance one. Background flushers help smooth latency by writing dirty pages proactively instead of during foreground evictions.

Another deep issue is double caching: OS page cache plus DB buffer pool. Embedded engines sometimes rely heavily on OS cache; server engines prefer tighter control and may use direct I/O in some contexts. Even without direct I/O, you should reason about read amplification: a logical page miss can cause kernel-level reads, user-space copies, and metadata overhead. Instrumenting hit ratio alone is insufficient; track bytes read/written, flush batch sizes, and stall causes.

Prefetching and locality hints matter. If range scan is detected, prefetch next leaf pages can reduce misses. But aggressive prefetch can evict useful pages. As with all buffering choices, the right behavior depends on workload shape. This is why benchmark scenarios must include at least: random point lookups, sequential scans, and mixed read/write OLTP.

Failure modes include dirty-loss (dirty page evicted without flush), latch contention hotspots (single global mutex around all frames), and livelock during victim search when most pages are pinned. Detect these with metrics: eviction retries, average victim-search steps, flush queue depth, and pin lifetime histograms.

Finally, remember that buffer pools are testable in isolation. Build deterministic traces (sequence of page requests with pins/unpins) and assert final frame state. Isolated buffer tests give you confidence before integration with B+ tree or WAL.

How this fits on projects

Core of Project 4
Critical dependency in Project 5, Project 6, and Project 7

Definitions & key terms

Frame: In-memory slot holding one page.
Pin count: Number of active users holding a page.
Dirty page: Page modified in memory but not persisted.
Victim selection: Choosing an eviction candidate.
Writeback: Flushing dirty pages to durable storage.

Mental model diagram

Disk file pages                     Buffer Pool Frames
┌───────────────┐                  ┌──────────────────────────────┐
│ Page 41       │<--- miss/load ---│ Frame 0: page=41 pin=2 dirty │
│ Page 42       │<--- miss/load ---│ Frame 1: page=42 pin=0 clean │
│ Page 43       │<--- miss/load ---│ Frame 2: page=43 pin=1 dirty │
└───────────────┘                  │ Victim policy: CLOCK hand -> │
                                   └──────────────────────────────┘
Rule: never evict pinned frame; dirty flush must honor WAL ordering

How it works (step-by-step, with invariants and failure modes)

fetch(page_id) checks hash map for frame hit.
On miss, pick victim frame with pin_count==0.
If victim dirty, flush safely (WAL constraint if enabled).
Read requested page into frame, pin it, return handle.
Caller unpins after use; dirty mark set on modification.

Invariants:

Pin count never negative.
Evicted frame always unpinned.
Dirty page flush success required before frame reuse.

Failure modes:

Forgotten unpin -> throughput collapse/livelock.
Flush without WAL safety -> unrecoverable state.
Global lock bottleneck -> poor multicore scaling.

Minimal concrete example

Trace:
fetch(10) miss -> frame2 pin=1
fetch(11) miss -> frame3 pin=1
unpin(10, dirty=true)
fetch(12) miss -> CLOCK picks frame2? no (dirty flush first), then reuse

Common misconceptions

“High hit ratio means good performance.” Correction: tail latency can still be bad due to flush stalls.
“Eviction policy is purely optimization.” Correction: policy influences correctness pressure and lock behavior.
“Pinned pages can be force-evicted if needed.” Correction: that breaks active readers/writers.

Check-your-understanding questions

Why is pin discipline a correctness requirement?
How can WAL rules constrain flush timing?
Why can sequential scans punish naive LRU?

Check-your-understanding answers

Frame reuse while in use causes use-after-free-like corruption.
Data page with higher page-LSN cannot be flushed before durable WAL.
Scan pages displace hot pages, creating avoidable miss storms.

Real-world applications

PostgreSQL shared buffers
InnoDB buffer pool
SQLite pager + page cache

Where you’ll apply it

Project 4, Project 5, Project 6, Project 7

References

SQLite architecture (pager/page cache): https://www.sqlite.org/arch.html
PostgreSQL WAL details: https://www.postgresql.org/docs/current/wal-intro.html

Key insights Buffer management is the runtime governor of both performance and correctness.

Summary If page pins, flush ordering, and eviction invariants are weak, every higher-layer subsystem inherits instability.

Homework/Exercises to practice the concept

Design metrics dashboard fields for buffer health.
Write a deterministic request trace that breaks naive LRU.
Specify WAL-aware flush preconditions.

Solutions to the homework/exercises

Include hit ratio, dirty %, avg pin lifetime, eviction retries, flush latency.
Alternate long scans with hot random set to show scan pollution.
Only flush when durable_wal_lsn >= page_lsn.

Concept 4: WAL, Checkpointing, and Crash Recovery

Fundamentals

Write-Ahead Logging (WAL) is the core durability pattern in modern databases. The central rule is simple and strict: log records describing page modifications must be persisted before corresponding data pages are written. With this rule, engines can recover from crashes by replaying log records. Checkpoints shorten restart time by recording progress and reducing the portion of log that must be scanned. Recovery algorithms typically run analysis/redo/undo phases (or a reduced subset for simpler engines). WAL is not only about durability; it also unlocks performance by allowing data pages to flush lazily instead of forcing all dirty data on each commit.

Deep Dive

WAL exists because forcing every modified data page at commit is too expensive and unpredictable. A single transaction might touch many pages across the file; forcing all of them introduces heavy random writes and high latency variance. WAL flips the contract: commit durability is tied to sequential log persistence, not full data-page flush. Since logs are append-oriented, storage devices handle them efficiently.

A robust WAL design includes: log sequence numbers (LSNs), record types (BEGIN/UPDATE/COMMIT/ABORT/CHECKPOINT), transaction identifiers, and enough payload to redo (and optionally undo) actions. Each updated page stores page-LSN so recovery can decide whether a log record is already reflected on disk. This page-LSN bridge is essential for idempotent replay.

Recovery generally starts from last checkpoint. Analysis reconstructs in-flight transactions and dirty page state. Redo re-applies necessary committed updates (or all history with page-LSN checks, depending on model). Undo rolls back loser transactions if undo logging exists. Full ARIES-style recovery is advanced; for this sprint, a constrained redo/undo model is enough if invariants are explicit and test harnesses are strong.

Checkpoints are performance tradeoffs. Frequent checkpoints reduce restart time but increase runtime write pressure. Infrequent checkpoints reduce overhead but elongate recovery windows. Production systems tune this against RTO objectives and storage limits. In your learning engine, expose checkpoint interval as a config and benchmark recovery time vs steady-state throughput.

fsync semantics and ordering are non-negotiable. A common student bug is writing log records with buffered stdio and assuming durability. Durability requires explicit flush to stable storage and careful handling of partial writes. Group commit is a practical optimization: multiple transactions share one log flush, amortizing sync overhead. PostgreSQL documentation explicitly highlights this benefit.

Failure taxonomy matters: process crash, OS crash, power loss, media failure. WAL primarily addresses the first three for a healthy storage device; media failure requires backups/replication beyond this guide. Your tests should include kill -9 at randomized points, simulated torn log tails, and restart idempotence checks (running recovery twice should not diverge state).

Another subtlety is interaction with buffer flush. If a page dirty from LSN 120 is flushed while durable log only reaches LSN 100, recovery cannot reconstruct the exact state safely. This violates WAL. So buffer managers must gate flush by durable LSN watermark. This cross-subsystem dependency is where many implementations fail.

Finally, WAL format evolution should be versioned and self-describing enough for offline tools. During outages, operators rely on log inspection to answer: what committed, what was in flight, what was replayed. Build that observability early.

How this fits on projects

Central in Project 5
Dependency for correctness in Project 6 and Project 7

Definitions & key terms

LSN: Monotonic identifier for log positions.
Redo: Reapply updates that may not be on data pages yet.
Undo: Roll back uncommitted updates.
Checkpoint: Durable marker reducing recovery work.
Group commit: Batching commits behind one log sync.

Mental model diagram

TXN updates page P
   │
   ├──> append WAL record (LSN=9001)
   │
   ├──> fsync WAL up to 9001   [COMMIT durable]
   │
   └──> data page flush can happen later

Crash -> restart -> replay WAL records with LSN > page_lsn

How it works (step-by-step, with invariants and failure modes)

On logical update, emit WAL record with txn_id and redo payload.
Assign new LSN and set page-LSN in dirty buffer copy.
On commit, force WAL flush through commit LSN.
Data pages flush asynchronously when allowed.
On restart, scan from checkpoint and redo/undo as required.

Invariants:

Durable WAL precedes durable data for same LSN.
Recovery replay is idempotent using page-LSN checks.
Committed state persists across crash tests.

Failure modes:

Torn log record tail not detected -> invalid replay.
Data flush before WAL flush -> durability violation.
Missing transaction end markers -> ambiguous recovery state.

Minimal concrete example

WAL transcript:
[LSN 410] BEGIN txn=7
[LSN 418] UPDATE txn=7 page=12 slot=4 old=... new=...
[LSN 430] COMMIT txn=7
fsync(log, <=430)
# crash before page 12 flush
restart -> REDO LSN 418 onto page 12 (if page_lsn < 418)

Common misconceptions

“If commit returns, data pages must already be written.” Correction: WAL allows commit before data-page flush.
“Recovery replay is one-pass always.” Correction: depends on algorithm and metadata.
“Checkpoint means no recovery needed.” Correction: it only narrows required log window.

Check-your-understanding questions

Why is page-LSN required for idempotent redo?
How does group commit reduce overhead?
Which crash classes are not solved by WAL alone?

Check-your-understanding answers

It tells whether update is already reflected on page.
Many transactions amortize one sync call.
Media loss/corruption without backup/replica strategy.

Real-world applications

PostgreSQL WAL and point-in-time recovery
SQLite rollback journal and WAL modes
InnoDB redo logging

Where you’ll apply it

Project 5, Project 6, Project 7

References

PostgreSQL WAL intro: https://www.postgresql.org/docs/current/wal-intro.html
SQLite file format + WAL sections: https://www.sqlite.org/fileformat.html
IBM ARIES publication page: https://research.ibm.com/publications/aries-a-transaction-recovery-method-supporting-fine-granularity-locking-and-partial-rollbacks-using-write-ahead-logging

Key insights WAL is a write-ordering contract that converts crashes from catastrophe into replay work.

Summary Recovery correctness depends on strict LSN discipline, not just “having a log file.”

Homework/Exercises to practice the concept

Define a WAL record schema for update + commit.
Design crash tests that kill process at three random points per transaction.
Explain how checkpoint interval affects RTO and steady-state throughput.

Solutions to the homework/exercises

Include txn_id, page_id, slot_id, before/after image or logical delta, checksum.
Kill before commit flush, after commit flush, and during data flush.
Short interval improves restart time but increases background write pressure.

Concept 5: Concurrency Control and Isolation Semantics

Fundamentals

Concurrency control prevents correctness anomalies when multiple transactions interleave on shared data. Isolation levels define which anomalies are tolerated: dirty reads, non-repeatable reads, phantoms, and serialization anomalies. Engines use lock-based methods, MVCC, or hybrids. Locking provides explicit exclusion; MVCC provides snapshot reads with version chains and conflict checks. Practical systems combine both to balance throughput and correctness. Understanding isolation is essential because many bugs are not crashes or wrong formulas; they are legal interleavings that violate business invariants.

Deep Dive

The hardest part of database correctness is not single-thread logic; it is behavior under interleaving. Transaction isolation is the lens that makes those behaviors explicit. SQL standard levels define minimal guarantees, but implementations differ. PostgreSQL, for example, maps these levels onto MVCC semantics and documents anomaly behavior clearly.

Two broad families dominate. Pessimistic locking acquires locks before conflicting actions, often with two-phase locking (2PL): growing phase acquires locks, shrinking phase releases. Strict 2PL delays release until commit/abort, simplifying recovery and serializability reasoning. Cost: blocking and deadlocks. Engines need deadlock detection or timeout policies.

MVCC avoids many read-write conflicts by storing row versions with visibility rules (created_xid/deleted_xid or equivalent metadata). Readers choose versions visible to their snapshot, while writers create new versions. This improves read concurrency but introduces garbage-collection/vacuum complexity and conflict detection rules at commit for stronger isolation levels.

Anomaly literacy is non-negotiable. Lost update, write skew, and read skew are common production failure modes. Developers often think “transaction” implies serial behavior, but default isolation levels usually do not provide full serializability. Your learning lab should include explicit anomaly scripts where two sessions execute deterministic interleavings and show outcomes at Read Committed vs Serializable.

Lock granularity and intent also matter. Row locks maximize concurrency but increase metadata overhead. Page/table locks are cheaper to manage but reduce parallelism. Index-range locking or predicate locking is needed to prevent phantoms in lock-based serializable schemes. MVCC engines often implement Serializable Snapshot Isolation with dependency tracking rather than pure predicate locks.

Recovery coupling: concurrency metadata and WAL must agree. If a transaction aborts after partial updates, undo visibility or rollback operations must restore logical consistency. In MVCC, abort may mark new versions invalid; in lock-based systems, undo may physically restore prior images. Either way, the transaction manager and recovery manager are not separable concerns.

Testing concurrency requires harnesses that control timing. Random stress alone rarely reproduces precise anomalies. Use barrier-based test scripts: session A step1, session B step1, session A step2, etc., with expected outcomes. Include retry logic tests for serializable failures, because production clients must handle retriable aborts.

Finally, isolation is a product decision, not just technical purity. Stronger isolation increases safety but may reduce throughput. Mature systems expose level selection and require workload-aware choices. The right answer depends on invariants you must protect.

How this fits on projects

Core of Project 7
Also influences update semantics in Project 5 and Project 6

Definitions & key terms

Dirty read: Reading uncommitted data from another transaction.
Non-repeatable read: Re-reading same row returns different value.
Phantom: Re-running predicate query returns different row set.
MVCC: Multi-version concurrency control.
Serializable: Equivalent to some serial transaction order.

Mental model diagram

Time --->
Txn A: BEGIN ---- read(balance=100) -------- write(balance=90) ---- COMMIT
Txn B: BEGIN -------- read(balance=100) ---- write(balance=80) ---- COMMIT

Without conflict control: lost update possible
With proper control: one txn blocks/aborts/retries

How it works (step-by-step, with invariants and failure modes)

Transaction starts with chosen isolation level.
Reads/writes use lock or version visibility rules.
Conflicts trigger blocking, abort, or retry.
Commit publishes state according to isolation protocol.
Abort restores or invalidates uncommitted effects.

Invariants:

No dirty reads beyond allowed level.
Conflict handling deterministic per policy.
Commit order and visibility rules consistent with documented level.

Failure modes:

Missing conflict check -> lost update.
Deadlock without detection -> indefinite stalls.
Retry-unaware clients -> user-visible failures.

Minimal concrete example

Interleaving script (read-committed):
A: read x=5
B: read x=5
A: write x=6, commit
B: write x=6, commit
Result: one increment lost
Serializable policy should force one transaction retry.

Common misconceptions

“Read committed means always consistent business state.” Correction: complex invariants can still break.
“MVCC means no locking.” Correction: write-write conflicts and metadata locks remain.
“Serializable is always too slow.” Correction: overhead depends on workload and conflict shape.

Check-your-understanding questions

Why can write skew happen under snapshot-style isolation?
Why are retries part of serializable application logic?
What extra complexity does MVCC introduce vs lock-only engines?

Check-your-understanding answers

Concurrent txns can each observe valid snapshot and commit incompatible writes.
Serializable engines may abort conflicting transactions to preserve serial order.
Version visibility rules and garbage collection/vacuum lifecycle.

Real-world applications

PostgreSQL transaction isolation and MVCC
InnoDB lock + MVCC hybrid behavior

Where you’ll apply it

Project 7 (primary), Project 5, Project 6

References

PostgreSQL transaction isolation docs: https://www.postgresql.org/docs/current/transaction-iso.html

Key insights Isolation level is a correctness budget; spend it consciously.

Summary Concurrency design is successful only when anomalies are explicitly tested and understood, not assumed away.

Homework/Exercises to practice the concept

Design two-session scripts showing non-repeatable read and phantom.
Specify retry policy for serializable aborts.
Compare lock-based vs MVCC behavior on read-heavy workload.

Solutions to the homework/exercises

Use ordered barriers to force deterministic interleavings.
Retry whole transaction with capped exponential backoff.
MVCC usually improves read concurrency at cost of version maintenance.

Concept 6: SQL Compilation and Volcano-Style Execution

Fundamentals

A SQL engine is a compiler plus runtime. Text query input is tokenized, parsed into an AST, transformed into logical operators, lowered to physical operators, and executed. The classic Volcano iterator model (open, next, close) composes operators into pipelines that pull tuples on demand. Even minimalist engines need this mental split: parsing is syntax, planning is algorithm selection, execution is dataflow over access methods. Query internals become understandable once you model operators as explicit state machines.

Deep Dive

SQL appears declarative, but execution is deeply procedural. The parser first checks grammar and builds structured representation (AST). At this stage, no data has been read. Then planning translates AST nodes into relational algebra primitives: scan, filter, project, join, aggregate, sort. Physical planning chooses concrete algorithms (index scan vs table scan, hash join vs nested-loop join) based on available indexes and crude cost estimates.

The iterator (Volcano) model remains pedagogically powerful because of simplicity and composability. Each operator exposes a next() call returning either one tuple or end-of-stream. Parent operators request tuples from children. A filter operator repeatedly pulls from child until predicate passes. A join operator coordinates two children according to chosen algorithm. This pull model makes execution order, state retention, and backpressure explicit.

For C implementations, memory ownership across operators is a recurring hazard. Decide whether tuples are materialized per operator, borrowed via arena, or referenced via page pointers. Ambiguous ownership causes leaks or use-after-free. A practical educational compromise: immutable tuple slices with operator-local scratch buffers and clear lifetime boundaries at pipeline stage transitions.

Planning can start rule-based. For example: if predicate is equality on indexed column, choose index scan; otherwise full scan. Add cost hints later (estimated rows, selectivity, I/O cost). Even simplistic costing teaches critical lessons: bad cardinality assumptions can produce catastrophic plan choices.

Projection and predicate pushdown are low-complexity, high-impact optimizations. Push filters closer to scans to reduce tuple volume early. Avoid materializing full rows when only selected columns are needed. These two rules provide large gains without complex optimizer machinery.

Failure modes include parser ambiguity bugs, incorrect null semantics, and mismatch between EXPLAIN plan output and actual operator execution. Your project must verify plan/execution consistency by logging operator events and comparing against expected sequence.

The SQL subsystem also closes the loop with storage internals. A plan node is only as good as access paths provided by B+ tree and buffer pool. If index lookup returns unstable row IDs or stale visibility info, query correctness collapses. This is why SQL should be introduced after storage/recovery foundations.

Finally, keep scope strict. A high-quality subset (SELECT, predicate filters, simple joins, inserts/updates/deletes) with reliable execution is far more educational than broad but unreliable syntax coverage.

How this fits on projects

Core in Project 6
Integrated end-to-end in Project 7

Definitions & key terms

AST: Abstract syntax tree for parsed SQL.
Logical plan: Engine-agnostic relational operator graph.
Physical plan: Concrete algorithms + access paths.
Iterator model: Pull-based operator interface.
Predicate pushdown: Apply filters as early as possible.

Mental model diagram

SQL Text
  -> Lexer/Parser -> AST
  -> Planner      -> Logical Plan
  -> Optimizer    -> Physical Plan
  -> Executor     -> Operator Pipeline
                     [Scan] -> [Filter] -> [Project] -> [Output]

How it works (step-by-step, with invariants and failure modes)

Tokenize SQL into lexemes.
Parse tokens into AST with syntax validation.
Build logical operators from AST structure.
Choose physical operators/access paths.
Execute via iterator calls until end-of-stream.

Invariants:

Parsed query and execution plan preserve query semantics.
Operator contracts (next) deterministic and composable.
EXPLAIN describes actual runtime operator order.

Failure modes:

AST misinterpretation -> wrong results.
Join predicate misuse -> cartesian explosion.
Incorrect NULL handling -> subtle semantic bugs.

Minimal concrete example

Input SQL:
SELECT name FROM users WHERE age > 30

Physical plan (pseudo):
Project(name)
  Filter(age > 30)
    TableScan(users)

Common misconceptions

“Parser is the hard part of SQL engines.” Correction: planning + runtime correctness are usually harder.
“EXPLAIN is optional.” Correction: it is core observability for query behavior.
“Small SQL subset means trivial engine.” Correction: correctness across storage + concurrency still complex.

Check-your-understanding questions

Why separate logical and physical planning?
Why does predicate pushdown often provide large gains?
What should EXPLAIN guarantee?

Check-your-understanding answers

It separates semantics from algorithm choices.
It reduces intermediate tuple volume early in pipeline.
Faithful mapping to actual executed operator pipeline.

Real-world applications

SQLite parse/codegen/VDBE pipeline
PostgreSQL planner/executor architecture

Where you’ll apply it

Project 6 and Project 7

References

SQLite architecture: https://www.sqlite.org/arch.html

Key insights SQL internals become tractable when treated as compiler + runtime, not a monolith.

Summary Reliable query execution depends more on operator contracts and observability than on syntax breadth.

Homework/Exercises to practice the concept

Draw logical and physical plans for three sample queries.
Define operator contract tests for scan/filter/project.
Add one deterministic EXPLAIN snapshot test case.

Solutions to the homework/exercises

Ensure each logical node maps to clear physical operator.
Validate tuple counts, ordering assumptions, and EOS signaling.
Snapshot should include operator tree and chosen access path.

Glossary

ACID: Atomicity, Consistency, Isolation, Durability guarantees for transactions.
B+ tree: Balanced tree where internal nodes route and leaves hold sorted entries.
Buffer pool: Managed in-memory cache for database pages.
Checkpoint: Recovery marker reducing WAL replay scope.
Fan-out: Number of child pointers in an internal tree node.
Heap page: Storage page holding table records.
LSN: Log sequence number identifying WAL record order.
MVCC: Multi-version concurrency control for snapshot reads.
Slotted page: Page format using slot indirection to decouple record position.
Volcano model: Pull-based operator execution interface.

Why Database Internals Matters

Modern systems are increasingly data-intensive, and backend reliability/performance decisions often reduce to database internals decisions.

Real-world signals:

In the 2025 Stack Overflow Developer Survey, PostgreSQL was reported by 55.6% of all respondents and 58.2% of professional developers, while SQLite was 37.5% overall. These are implementation-facing technologies used directly by working developers (https://survey.stackoverflow.co/2025/technology).
SQLite documents that it is likely “used more than all other database engines combined” and estimates over 1 trillion active SQLite databases due broad embedding footprint (https://www.sqlite.org/mostdeployed.html).
Official PostgreSQL docs emphasize WAL as a standard integrity method and explain why sequential log flushes enable both durability and reduced write cost compared to forcing all data pages (https://www.postgresql.org/docs/current/wal-intro.html).

Context & Evolution:

Early engines focused on disk-era constraints and strict page discipline.
Modern engines add stronger concurrency models, richer optimizers, and cloud operational tooling.
The core building blocks remain stable: pages, indexes, buffer managers, logs, and isolation models.

Then (naive file app)                  Now (engineered database)
┌───────────────────────────┐          ┌─────────────────────────────┐
│ append rows to file       │          │ page layout + indexing      │
│ linear scans for lookup   │   --->   │ buffer pool + WAL + planner │
│ ad-hoc crash behavior     │          │ explicit recovery semantics  │
└───────────────────────────┘          └─────────────────────────────┘

Concept Summary Table

Concept Cluster	What You Need to Internalize
Page-Oriented Storage and Binary Encoding	Stable on-disk contracts need explicit layouts, slot indirection, and invariant checks.
B+Tree Indexes and Access Paths	Fan-out and separator correctness dominate lookup and range-scan behavior.
Buffer Pool, Caching, and I/O Control	Pin/dirty/eviction discipline controls both latency and correctness.
WAL, Checkpointing, and Crash Recovery	Durability is a write-ordering protocol plus deterministic replay behavior.
Concurrency Control and Isolation Semantics	Correctness under interleaving requires explicit anomaly handling and retry semantics.
SQL Compilation and Volcano Execution	Query processing is a compiler/runtime pipeline tied to storage access paths.

Project-to-Concept Map

Project	Concepts Applied
Project 1	Page-Oriented Storage and Binary Encoding
Project 2	Page-Oriented Storage and Binary Encoding, API boundaries from SQL Compilation ecosystem
Project 3	Page-Oriented Storage and Binary Encoding, B+Tree Indexes and Access Paths
Project 4	Buffer Pool, Caching, and I/O Control, Page-Oriented Storage and Binary Encoding
Project 5	WAL, Checkpointing, and Crash Recovery, Buffer Pool, Caching, and I/O Control
Project 6	SQL Compilation and Volcano Execution, B+Tree Indexes and Access Paths
Project 7	Concurrency Control and Isolation Semantics, WAL, Checkpointing, and Crash Recovery, SQL Compilation and Volcano Execution

Deep Dive Reading by Concept

Concept	Book and Chapter	Why This Matters
Page-Oriented Storage and Binary Encoding	Computer Systems: A Programmer’s Perspective (Bryant & O’Hallaron), Ch. 6	Bridges memory hierarchy assumptions with page-level data layout decisions.
Page-Oriented Storage and Binary Encoding	The C Programming Language (K&R), Ch. 5-8	Sharpens binary I/O, pointer arithmetic, and file discipline needed for on-disk formats.
B+Tree Indexes and Access Paths	Algorithms, Fourth Edition (Sedgewick & Wayne), balanced search tree sections	Gives formal tree invariants and complexity reasoning.
Buffer Pool, Caching, and I/O Control	Operating Systems: Three Easy Pieces, file-system and caching chapters	Connects database cache policy to OS/page-cache behavior.
WAL, Checkpointing, and Crash Recovery	Operating System Concepts (Silberschatz et al.), recovery concepts	Frames crash consistency and logging guarantees.
Concurrency Control and Isolation Semantics	Operating System Concepts (Silberschatz et al.), concurrency chapters	Builds lock ordering and deadlock mental models before database-specific variants.
SQL Compilation and Volcano Execution	Compilers: Principles and Practice (Dave & Dave), lexical/syntax analysis chapters	Translates parser/compiler mental models directly into query-engine design.

Quick Start: Your First 48 Hours

Day 1:

Read ## Theory Primer Concept 1 and Concept 2.
Build Project 1 scaffold and persist first 100 deterministic key-value records.
Inspect file bytes with hexdump and reconcile against your format document.

Day 2:

Run corruption checks on Project 1 and add one malformed-record fixture.
Read Concept 3 and sketch buffer frame state transitions.
Start Project 3 lookup-only B+ tree traversal (no splits yet).

Recommended Learning Paths

Path 1: The Systems Builder

Project 1 -> Project 3 -> Project 4 -> Project 5 -> Project 7

Path 2: The Query Engineer

Project 1 -> Project 3 -> Project 6 -> Project 7

Path 3: The Reliability Engineer

Project 1 -> Project 4 -> Project 5 -> Project 7

Success Metrics

You can explain each committed byte in one page dump from your own engine.
You can force-crash the engine and recover to identical post-commit state.
You can demonstrate O(log n) index lookup depth trends across dataset growth.
You can reproduce at least two isolation anomalies and show how your engine blocks/aborts/retries them.
You can run EXPLAIN and map each operator to real execution behavior.

Project Overview Table

Project	Main Focus	Difficulty	Time	Primary Output
Project 1	Persistent heap-file KV store	Beginner	1 weekend	Deterministic on-disk key-value file format
Project 2	KV client C boundary/API	Intermediate	4-6 days	Clean C client library with strict ownership rules
Project 3	Disk-backed B+ tree	Advanced	1-2 weeks	Indexed point lookup + range scan engine
Project 4	Slotted pages + buffer pool	Advanced	1-2 weeks	Pin/dirty/eviction-aware page manager
Project 5	WAL + crash recovery	Expert	1-2 weeks	Crash-safe durability harness and recovery tool
Project 6	SQL parser + executor	Expert	2-3 weeks	Subset SQL REPL with `EXPLAIN`
Project 7	Isolation + end-to-end integration	Expert	3-4 weeks	Multi-session TinyDB engine with transactional guarantees

Project List

The following projects guide you from byte-level persistence to a crash-safe SQL-capable mini database engine.

Project 1: Persistent Key-Value Store

File: P01-persistent-key-value-store.md
Main Programming Language: C
Alternative Programming Languages: Rust, Go
Coolness Level: Level 3: Genuinely Clever
Business Potential: Level 1: The “Resume Gold”
Difficulty: Level 1: Beginner (The Tinkerer)
Knowledge Area: Storage layout, binary encoding
Software or Tool: Filesystem + hex inspection tooling
Main Book: The C Programming Language (K&R)

What you will build: A deterministic key-value CLI backed by a versioned page file.

Why it teaches database internals: It forces direct confrontation with binary encoding, page headers, and corruption boundaries.

Core challenges you will face:

Record encoding discipline -> maps to Page-Oriented Storage and Binary Encoding
Page free-space accounting -> maps to Page-Oriented Storage and Binary Encoding
Deterministic replay tests -> maps to WAL, Checkpointing, and Crash Recovery foundations

Real World Outcome

You run a CLI that persists values across restarts and can be audited at byte level.

$ ./kvdb put user:42:name "Ada"
OK txn=00000019 lsn=0000012A page=0003 slot=0017

$ ./kvdb get user:42:name
VALUE "Ada" page=0003 slot=0017

$ ./kvdb list --prefix user:42
user:42:name="Ada"

$ ./kvdb inspect-page 3
page=3 version=2 slots=18 free=1296 checksum=0x7a31
slot[17] off=3872 len=28 state=LIVE key=user:42:name

The Core Question You Are Answering

“What exact bytes must exist on disk so I can restart and reconstruct state without ambiguity?”

This question matters because every higher-level feature assumes persistent state can be read back unambiguously.

Concepts You Must Understand First

Slotted page basics
- How can record payload move while slot identity remains stable?
- Book Reference: Computer Systems: A Programmer’s Perspective — Ch. 6
Binary format versioning
- Which header fields allow future-compatible evolution?
- Book Reference: The C Programming Language — Ch. 5-8
Checksums and corruption detection
- What failure classes can checksums detect?
- Book Reference: Operating System Concepts — file integrity concepts

Questions to Guide Your Design

File contract
- Which header fields are mandatory on every page?
- How will you reject unknown format versions safely?
Record lifecycle
- What does delete mean physically (tombstone vs reclaim)?
- When do you compact a fragmented page?

Thinking Exercise

Design the First 64 Bytes

Draft a fixed 64-byte page header and justify each field.

Questions to answer:

Which fields are needed for correctness vs debugging?
Which bytes are reserved for future compatibility?

The Interview Questions They Will Ask

“Why not serialize structs directly with fwrite?”
“How do you detect and isolate a corrupted page?”
“Why are slot directories useful for variable-length records?”
“How would you migrate file format version 1 to 2 safely?”
“How do you validate reproducibility of persistence tests?”

Hints in Layers

Hint 1: Start with read-only inspector Define a page inspector first; writing is easier after visibility exists.

Hint 2: Pin down byte order early Choose one endianness in format docs and never mix.

Hint 3: Pseudocode for insert flow

load page -> validate header -> ensure free_space >= record_size+slot_size
write payload bytes at tail -> write slot entry -> update header counters

Hint 4: Corruption harness Flip one byte in a test page and ensure loader rejects it with actionable error.

Books That Will Help

Topic	Book	Chapter
Binary I/O in C	The C Programming Language	Ch. 5-8
Memory/page behavior	Computer Systems: A Programmer’s Perspective	Ch. 6
Robust C interfaces	Effective C, 2nd Edition	Memory-related chapters

Common Pitfalls and Debugging

Problem 1: “Data reads fine until restart, then random keys fail”

Why: Header offsets were not updated after delete/insert sequence.
Fix: Recompute and validate free-space boundaries on each mutation.
Quick test: ./kvdb fuzz-page-layout --seed 42 --iterations 10000

Problem 2: “Page loads on one machine but not another”

Why: Implicit host-endian struct serialization.
Fix: Encode/decode fields explicitly with defined byte order.
Quick test: Run cross-architecture decode fixture.

Definition of Done

CRUD works across 100 restart cycles with deterministic fixtures
On-disk format document matches byte-level inspector output
Corrupted pages are rejected safely (no crashes, clear diagnostics)
Performance baseline and page fill-factor metrics are recorded

Project 2: Key-Value Store Client Library

File: P02-kv-client-library.md
Main Programming Language: C
Alternative Programming Languages: Rust, Go
Coolness Level: Level 2: Practical but Useful
Business Potential: Level 3: Service and Support
Difficulty: Level 2: Intermediate
Knowledge Area: API boundary design, protocol framing
Software or Tool: TCP client + protocol codec
Main Book: C Interfaces and Implementations

What you will build: A C client library for a KV service with strict ownership and error contracts.

Why it teaches database internals: Client boundaries shape operability, safety, and performance as much as core engine code.

Core challenges you will face:

Opaque handle contracts -> maps to boundary discipline
Memory ownership clarity -> maps to correctness and leak avoidance
Protocol framing robustness -> maps to parse/execute infrastructure

Real World Outcome

$ ./kvclient_demo
connect host=127.0.0.1 port=7390 -> OK
set key=user:7 value="Lin" -> OK latency_ms=0.42
get key=user:7 -> VALUE "Lin" ownership=caller
free value buffer -> OK
disconnect -> OK

You can hand this library to another engineer and they can use it without reading any internal source code.

The Core Question You Are Answering

“How do I design a C boundary that is easy to use correctly and hard to misuse?”

Concepts You Must Understand First

Opaque pointer interfaces
- Why hide struct internals from consumers?
- Book Reference: C Interfaces and Implementations — interface chapters
Ownership transfer contracts
- Who allocates and who frees each returned object?
- Book Reference: Effective C, 2nd Edition — memory management chapters
Protocol framing
- How do you parse partial reads safely?
- Book Reference: The Linux Programming Interface — socket I/O chapters

Questions to Guide Your Design

API ergonomics
- Can a caller discover cleanup obligations from the header alone?
- Are error states machine-readable, not string-only?
Failure behavior
- What happens to in-flight requests on disconnect?
- Can retries accidentally duplicate writes?

Thinking Exercise

Ownership Matrix

Before coding, draw a table of every function and ownership of each pointer on success/failure.

Questions:

Which API calls transfer ownership?
Which paths require idempotent cleanup?

The Interview Questions They Will Ask

“What is an opaque handle and why does it matter in C APIs?”
“How do you avoid ambiguous ownership in returned buffers?”
“How do you handle partial TCP reads in a framed protocol?”
“How do you design error codes so callers can recover programmatically?”
“What makes an API boundary stable across versions?”

Hints in Layers

Hint 1: Header-first design Write the public .h contract before internal implementation.

Hint 2: Explicit free function Use dedicated release functions for returned heap objects.

Hint 3: Framing pseudocode

read fixed header -> parse payload length -> read exact payload bytes -> decode

Hint 4: Timeout strategy Expose configurable connect/read/write timeouts in one options struct.

Books That Will Help

Topic	Book	Chapter
Interface boundaries	C Interfaces and Implementations	Ch. 1-3
Memory ownership	Effective C, 2nd Edition	Memory management chapters
Socket I/O patterns	The Linux Programming Interface	Networking chapters

Common Pitfalls and Debugging

Problem 1: “Occasional segfault after get“

Why: Caller freed memory owned by library internals.
Fix: Return caller-owned copy or immutable borrowed view with explicit lifetime.
Quick test: Run ASAN + stress retrieval loop.

Problem 2: “Protocol desync after long values”

Why: Length prefix parsed incorrectly after partial reads.
Fix: Use exact-byte read helper with bounded retries.
Quick test: Replay mixed-size frame corpus.

Definition of Done

Public API has explicit ownership notes for all pointer-returning functions
Framed protocol handles partial reads/writes and malformed frames
Error codes are stable and machine-parseable
No leaks across reconnect stress test

Project 3: Disk-Backed B+Tree Library

File: P03-disk-backed-bplustree.md
Main Programming Language: C
Alternative Programming Languages: Rust, C++
Coolness Level: Level 4: Hardcore Tech Flex
Business Potential: Level 4: Open Core Infrastructure
Difficulty: Level 3: Advanced
Knowledge Area: Index structures, storage-engine algorithms
Software or Tool: Page manager + tree verifier
Main Book: Algorithms, Fourth Edition

What you will build: A persistent B+ tree supporting point lookups, inserts, deletes, and range scans.

Why it teaches database internals: This is the canonical index structure in major row-store engines.

Core challenges you will face:

Split/merge correctness -> maps to B+ tree invariants
Separator propagation -> maps to search correctness
Leaf-link consistency -> maps to range scan performance

Real World Outcome

$ ./bptree_demo load --keys 1000000 --seed 7
build complete pages=8421 height=4 avg_leaf_fill=0.73

$ ./bptree_demo get 884422
FOUND key=884422 value_ptr=(page=5812,slot=9) depth=4

$ ./bptree_demo range 884400 884460
rows=61 first=884400 last=884460 leaf_hops=2

$ ./bptree_demo verify
OK invariants=all depth=4 linked_leaves=consistent

The Core Question You Are Answering

“How do databases keep indexed lookup fast as data grows by orders of magnitude?”

Concepts You Must Understand First

Fan-out and tree height
- Why does high fan-out beat pointer-heavy binary trees on disk?
- Book Reference: Algorithms, Fourth Edition — balanced tree sections
Node split/merge semantics
- What parent separator is promoted on split?
- Book Reference: Algorithms, Fourth Edition
Leaf chaining for scans
- Why are sibling links crucial for range scans?
- Book Reference: storage-engine indexing chapters (external resources)

Questions to Guide Your Design

Node layout decisions
- Fixed-width keys first or variable-width with indirection?
- How do you reserve metadata space for future format evolution?
Correctness instrumentation
- Which verifier checks run after each structural mutation?
- How will you detect stale parent separators early?

Thinking Exercise

Split Propagation Trace

Trace by hand an insertion sequence that causes leaf split then root split.

Questions:

Which key becomes parent separator at each step?
How do child ranges change after propagation?

The Interview Questions They Will Ask

“Why B+ trees over hash indexes for general-purpose queries?”
“How does node fan-out influence tree height and I/O?”
“What invariants must hold after split and merge?”
“How do range scans leverage leaf sibling links?”
“How would you detect index corruption offline?”

Hints in Layers

Hint 1: Read-only traversal first Implement get on static tree pages before mutation logic.

Hint 2: Split in one node type first Complete leaf split path before internal-node split.

Hint 3: Mutation pseudocode

insert into leaf -> if overflow split leaf -> insert separator into parent
if parent overflow -> split parent -> continue upward

Hint 4: Verifier every step Run full invariant verifier after each randomized mutation batch.

Books That Will Help

Topic	Book	Chapter
Balanced trees	Algorithms, Fourth Edition	Balanced tree sections
Cache/storage effects	Computer Systems: A Programmer’s Perspective	Ch. 6
C data structure discipline	Mastering Algorithms with C	Tree/data-structure chapters

Common Pitfalls and Debugging

Problem 1: “Random missing keys after heavy inserts”

Why: Parent separator not updated after child split.
Fix: Enforce parent range verification post-mutation.
Quick test: Differential test against sorted-map oracle.

Problem 2: “Range scans skip keys”

Why: Leaf sibling pointers broken during merge.
Fix: Update both forward and backward links atomically in page mutation logic.
Quick test: Full in-order scan monotonicity assertion.

Definition of Done

Point lookup and range scan results match oracle for randomized workloads
Tree invariants validated after 100k mixed operations
Restart/load preserves index correctness
Depth trend aligns with expected O(log n) behavior

Project 4: Page-Based Storage Engine with Buffer Pool

File: P04-page-storage-buffer-pool.md
Main Programming Language: C
Alternative Programming Languages: Rust, Go
Coolness Level: Level 4: Hardcore Tech Flex
Business Potential: Level 4: Open Core Infrastructure
Difficulty: Level 3: Advanced
Knowledge Area: Cache management, physical storage
Software or Tool: Buffer pool metrics + eviction simulator
Main Book: Operating Systems: Three Easy Pieces

What you will build: A slotted-page heap file with pin/unpin-aware buffer pool and eviction policy.

Why it teaches database internals: It reveals how real systems hide disk latency while preserving correctness.

Core challenges you will face:

Pin discipline -> maps to buffer correctness
Dirty flush policy -> maps to durability foundations
Eviction behavior under mixed workloads -> maps to performance engineering

Real World Outcome

$ ./storage_engine_demo workload --profile mixed --ops 500000 --seed 11
buffer_frames=1024 policy=clock
hit_rate=0.934 evictions=27122 dirty_flushes=9180
p95_read_ms=1.8 p95_write_ms=2.4

$ ./storage_engine_demo inspect-frame 87
frame=87 page=14829 pin=0 dirty=1 refbit=0 last_access=1700001234

The Core Question You Are Answering

“How do I keep page access fast under pressure without violating correctness invariants?”

Concepts You Must Understand First

Pin/unpin semantics
- Why can pinned pages never be evicted?
- Book Reference: Operating Systems: Three Easy Pieces — memory/file cache concepts
Clock/LRU policy tradeoffs
- How does scan pollution harm naive policies?
- Book Reference: Computer Systems: A Programmer’s Perspective — cache behavior
Dirty page lifecycle
- When is a dirty page eligible for flush?
- Book Reference: WAL docs and recovery chapters

Questions to Guide Your Design

Frame metadata
- Which fields are required for safe eviction?
- How will you detect pin leaks?
I/O scheduling
- Do you flush on demand only or maintain background flush queue?
- How do you avoid bursty eviction stalls?

Thinking Exercise

Frame-State State-Machine

Draw states for free -> clean-pinned -> clean-unpinned -> dirty-pinned -> dirty-unpinned -> flushing.

Questions:

Which transitions are legal?
Which transitions must be atomic under lock?

The Interview Questions They Will Ask

“Why isn’t cache hit ratio alone enough to evaluate a buffer pool?”
“What happens if a pinned page is evicted by bug?”
“Why can sequential scans hurt LRU?”
“How do you choose between CLOCK and LRU?”
“What metrics identify pin leaks?”

Hints in Layers

Hint 1: Isolate buffer tests Build deterministic frame-trace tests before integrating storage engine.

Hint 2: Add pin assertions everywhere Pin underflow/overflow checks save hours later.

Hint 3: Victim pseudocode

while true: inspect candidate frame
if pinned -> skip
if refbit=1 -> clear refbit, advance
else choose victim

Hint 4: Separate logical and physical errors Return explicit error categories: pin error, I/O error, checksum error.

Books That Will Help

Topic	Book	Chapter
Cache policy intuition	Computer Systems: A Programmer’s Perspective	Ch. 6
Systems-level caching	Operating Systems: Three Easy Pieces	file/cache chapters
C implementation discipline	Effective C, 2nd Edition	reliability/safety chapters

Common Pitfalls and Debugging

Problem 1: “Throughput degrades over time”

Why: Pin leaks prevent victim selection; clock hand loops excessively.
Fix: Track pin lifetimes and enforce timeout diagnostics in debug mode.
Quick test: ./storage_engine_demo detect-pin-leaks --duration 120

Problem 2: “Occasional stale reads”

Why: Dirty flush errors ignored and frame reused.
Fix: Block frame reuse until durable flush success is confirmed.
Quick test: Simulate disk write failures and assert no silent reuse.

Definition of Done

Pin/unpin invariants hold under randomized concurrent request simulation
Eviction never selects pinned frames
Dirty flush pipeline handles injected I/O failures safely
Metrics dashboard reports stable behavior across workload profiles

Project 5: Write-Ahead Logging (WAL) and Crash Recovery

File: P05-wal-crash-recovery.md
Main Programming Language: C
Alternative Programming Languages: Rust, Go
Coolness Level: Level 5: Pure Magic
Business Potential: Level 4: Open Core Infrastructure
Difficulty: Level 4: Expert
Knowledge Area: Durability, crash consistency
Software or Tool: WAL parser + crash harness
Main Book: Recovery chapters from systems/database references

What you will build: A WAL subsystem with checkpoints and deterministic crash-recovery replay.

Why it teaches database internals: Durability is the dividing line between a toy data tool and a real database engine.

Core challenges you will face:

Log/data ordering discipline -> maps to WAL invariants
Replay idempotence -> maps to recovery correctness
Checkpoint policy tuning -> maps to performance/reliability tradeoffs

Real World Outcome

$ ./wal_demo run --ops 20000 --checkpoint-every 500 --crash-at 13742
SIMULATED CRASH at op=13742

$ ./wal_demo recover
analysis start_lsn=0x00019000 dirty_pages=317 inflight_txns=4
redo applied=1298 skipped=9821
undo rolled_back_txns=4
RECOVERY COMPLETE state_hash=4d7a9c...

$ ./wal_demo verify
OK committed_state_consistent=true

The Core Question You Are Answering

“What precise write-ordering rules let me recover to a valid committed state after arbitrary crashes?”

Concepts You Must Understand First

WAL-before-data invariant
- Why must log be durable before data page flush?
- Book Reference: crash consistency chapters in OS/database texts
Redo idempotence with page-LSN
- How does page-LSN prevent duplicate application?
- Book Reference: WAL/recovery references
Checkpoint economics
- How does checkpoint frequency trade throughput vs restart time?
- Book Reference: recovery-system chapters

Questions to Guide Your Design

Record schema
- Logical redo records or physical before/after images?
- How will you checksum and version log records?
Recovery algorithm
- What metadata do you persist at checkpoint?
- How do you detect torn-tail log records safely?

Thinking Exercise

Crash Timeline Audit

Map one transaction timeline with events: BEGIN -> UPDATE -> COMMIT -> fsync log -> crash -> data flush.

Questions:

Which states are legal after restart?
What changes if crash happens before commit flush?

The Interview Questions They Will Ask

“What is the WAL-before-data rule and why is it necessary?”
“How do you make redo replay idempotent?”
“What does checkpointing optimize, and what does it not solve?”
“How do you test crash recovery deterministically?”
“What is group commit and when is it useful?”

Hints in Layers

Hint 1: Start with append-only log and parser Recovery work is easier after log observability exists.

Hint 2: Persist durable LSN watermark Buffer flush policy can consult one source of truth.

Hint 3: Replay pseudocode

scan checkpoint -> analysis tables -> redo by LSN with page_lsn guard
then undo loser txns if undo logs exist

Hint 4: Re-run recovery twice Idempotence test: second recovery should make zero logical changes.

Books That Will Help

Topic	Book	Chapter
Crash consistency	Operating Systems: Three Easy Pieces	crash-consistency chapter
Transaction durability	Operating System Concepts	recovery-focused sections
C robustness patterns	Effective C, 2nd Edition	error handling and reliability chapters

Common Pitfalls and Debugging

Problem 1: “Committed rows disappear after crash”

Why: Commit acknowledged before durable log flush.
Fix: Gate commit ACK on fsync success up to commit LSN.
Quick test: Kill process immediately after commit return in loop.

Problem 2: “Recovery duplicates updates”

Why: Redo ignores page-LSN guard.
Fix: Apply redo only when page-LSN < record LSN.
Quick test: Run recovery twice and compare state hash.

Definition of Done

Crash harness passes randomized kill-point tests
Recovery is idempotent under repeated restarts
Committed transactions persist; uncommitted effects do not
Checkpoint interval tradeoff is benchmarked and documented

Project 6: SQL Query Engine (Parse -> Plan -> Execute)

File: P06-sql-query-engine.md
Main Programming Language: C
Alternative Programming Languages: Rust, C++
Coolness Level: Level 5: Pure Magic
Business Potential: Level 4: Open Core Infrastructure
Difficulty: Level 4: Expert
Knowledge Area: Parsing, query planning, execution
Software or Tool: SQL REPL + explain plan
Main Book: Compilers: Principles and Practice

What you will build: A minimal SQL engine subset (SELECT, INSERT, UPDATE, DELETE, simple joins, aggregates) over previous storage/index layers.

Why it teaches database internals: It connects user-level SQL to physical storage and operator behavior.

Core challenges you will face:

Grammar/AST correctness -> maps to compile pipeline
Plan choice quality -> maps to index and cost reasoning
Operator contract stability -> maps to execution correctness

Real World Outcome

$ ./tiny_sql
tinydb> CREATE TABLE users(id INT, name TEXT, age INT);
OK

tinydb> INSERT INTO users VALUES (1,'Ada',37),(2,'Lin',28);
OK rows=2

tinydb> EXPLAIN SELECT name FROM users WHERE age > 30;
Project(name)
  Filter(age > 30)
    SeqScan(users)

tinydb> SELECT name FROM users WHERE age > 30;
+------+
| name |
+------+
| Ada  |
+------+
1 row

The Core Question You Are Answering

“How does declarative SQL become a deterministic operator pipeline over pages and indexes?”

Concepts You Must Understand First

Tokenization and parsing
- How do tokens become AST nodes with validated syntax?
- Book Reference: Compilers: Principles and Practice — lexical/syntax chapters
Logical vs physical plans
- Why separate semantic intent from execution algorithms?
- Book Reference: compiler architecture chapters
Iterator execution
- What guarantees should next() provide?
- Book Reference: query execution model references

Questions to Guide Your Design

Planner scope
- Which query patterns get index scans?
- How do you represent unsupported constructs clearly?
Runtime contracts
- Who owns tuple memory across operators?
- How does executor report recoverable vs fatal errors?

Thinking Exercise

Plan Two Ways

For SELECT * FROM orders WHERE customer_id=7, draw both seq-scan and index-scan plans.

Questions:

Under what selectivity assumptions does each win?
How would bad cardinality estimates hurt performance?

The Interview Questions They Will Ask

“What is the difference between logical and physical planning?”
“How does a Volcano iterator model work?”
“What should EXPLAIN guarantee in a query engine?”
“Why can wrong cardinality estimates cause bad plans?”
“How do indexes influence physical operator choice?”

Hints in Layers

Hint 1: Lock SQL subset Freeze scope early to avoid grammar sprawl.

Hint 2: Build AST printer Debugging parser becomes easy with deterministic AST output.

Hint 3: Operator pseudocode

Project.next() -> pull tuple from child -> emit selected columns -> repeat

Hint 4: Plan/execution audit Log operator open/next/close events and compare to EXPLAIN tree.

Books That Will Help

Topic	Book	Chapter
Lexing/parsing	Compilers: Principles and Practice	Ch. 3-4
C runtime safety	Effective C, 2nd Edition	runtime safety chapters
Data-structure choices	Algorithms, Fourth Edition	relevant search/join structures

Common Pitfalls and Debugging

Problem 1: “Query parses but returns wrong columns”

Why: AST projection mapping uses wrong source index offsets.
Fix: Add schema-bound column resolution stage before planning.
Quick test: Golden query corpus with known output order.

Problem 2: “EXPLAIN says index scan but runtime does seq scan”

Why: Plan and executor operator enums diverged.
Fix: Use a single shared plan representation for both explain and execute paths.
Quick test: Snapshot assert on plan + runtime event sequence.

Definition of Done

Supported SQL subset parses and executes deterministically
EXPLAIN reflects actual runtime operator tree
Index scans are chosen for defined predicate patterns
Query correctness validated against fixture datasets

Project 7: Transaction Isolation and End-to-End TinyDB Integration

File: P07-transaction-isolation-tinydb.md
Main Programming Language: C
Alternative Programming Languages: Rust, Go
Coolness Level: Level 5: Pure Magic
Business Potential: Level 5: Venture-Backed Product Candidate
Difficulty: Level 4: Expert
Knowledge Area: Concurrency, end-to-end systems integration
Software or Tool: Multi-session harness + anomaly tests
Main Book: Systems/OS concurrency references + DB docs

What you will build: A full TinyDB integration layer with transaction boundaries, isolation options, and retry-safe client semantics.

Why it teaches database internals: It forces all subsystems to cooperate under real contention and failure scenarios.

Core challenges you will face:

Isolation semantics implementation -> maps to concurrency theory
WAL + lock/version integration -> maps to durability/correctness coupling
Anomaly test harness -> maps to practical reliability proof

Real World Outcome

$ ./tinydb_multi_session --scenario lost-update
scenario=lost-update isolation=read-committed
result=ANOMALY_OBSERVED

$ ./tinydb_multi_session --scenario lost-update --isolation serializable
result=RETRY_TRIGGERED txn=2 retries=1 final_state=CORRECT

$ ./tinydb_multi_session --scenario write-skew --isolation serializable
result=SERIALIZATION_FAILURE_HANDLED final_state=CORRECT

The Core Question You Are Answering

“Can this engine preserve business invariants under concurrent load and crashes, not just pass single-session tests?”

Concepts You Must Understand First

Isolation anomalies
- Which anomalies are acceptable at each isolation level?
- Book Reference: OS concurrency chapters + official DB docs
Conflict handling
- Block, abort, or retry?
- Book Reference: transaction and locking concepts
End-to-end consistency checks
- How do WAL, buffer, and visibility rules compose?
- Book Reference: recovery and isolation references

Questions to Guide Your Design

Concurrency model
- Lock-based, MVCC-like, or hybrid for your scope?
- How do you represent transaction snapshots?
Client contract
- Which errors are retriable?
- How does API signal serialization failures clearly?

Thinking Exercise

Two-Session Invariant Simulation

Model two transactions that each read two rows and write one row.

Questions:

Under which isolation setting can write skew appear?
What conflict detection prevents it?

The Interview Questions They Will Ask

“How do you demonstrate a lost update anomaly in a controlled test?”
“What changes when moving from read committed to serializable?”
“Why are transaction retries part of application-level correctness?”
“How do WAL and isolation mechanisms interact?”
“Which metrics indicate lock contention problems?”

Hints in Layers

Hint 1: Deterministic scheduler Use explicit barriers to force interleavings.

Hint 2: Separate conflict taxonomy Track read-write, write-write, and predicate conflicts independently.

Hint 3: Retry pseudocode

begin txn -> execute -> if serialization_failure then rollback + retry whole txn

Hint 4: Integration checksum After each scenario, compute canonical state hash for pass/fail.

Books That Will Help

Topic	Book	Chapter
Concurrency fundamentals	Operating System Concepts	concurrency chapters
Systems race reasoning	Operating Systems: Three Easy Pieces	concurrency chapters
C interface reliability	Effective C, 2nd Edition	safety/reliability chapters

Common Pitfalls and Debugging

Problem 1: “Serializable mode still shows write skew”

Why: Predicate conflicts not tracked, only row-level conflicts.
Fix: Add range/predicate conflict metadata or stronger locking strategy.
Quick test: Deterministic write-skew scenario replay.

Problem 2: “Throughput collapses with many readers”

Why: Overly coarse locks serialize non-conflicting reads.
Fix: Refine granularity or adopt snapshot reads for read paths.
Quick test: Reader-heavy benchmark across lock strategies.

Definition of Done

Deterministic anomaly scenarios produce expected outcomes per isolation level
Retriable serialization failures are surfaced and recoverable by client
End-to-end crash + recovery + concurrency tests pass
Performance report includes contention and retry metrics

Project Comparison Table

Project	Difficulty	Time	Depth of Understanding	Fun Factor
1. Persistent Key-Value Store	Level 1	Weekend	High physical storage intuition	★★★★☆
2. KV Client Library	Level 2	4-6 days	High boundary/API discipline	★★★☆☆
3. Disk-Backed B+Tree	Level 3	1-2 weeks	Very high indexing intuition	★★★★★
4. Storage + Buffer Pool	Level 3	1-2 weeks	Very high runtime performance intuition	★★★★☆
5. WAL + Recovery	Level 4	1-2 weeks	Very high reliability intuition	★★★★★
6. SQL Query Engine	Level 4	2-3 weeks	Very high compiler/runtime intuition	★★★★★
7. Isolation + Integration	Level 4	3-4 weeks	End-to-end systems mastery	★★★★★

Recommendation

If you are new to database internals: Start with Project 1, then Project 3. This builds the storage/index core before higher complexity.

If you are a backend engineer who already uses SQL daily: Start with Project 3 and Project 6. You will connect query behavior directly to access-path mechanics.

If you want reliability engineering depth: Prioritize Project 5 and Project 7 after completing Project 1 and Project 4.

Final Overall Project

Final Overall Project: TinyDB — Crash-Safe Embedded SQL Engine

The Goal: Combine Projects 1 through 7 into a single binary that supports transactional SQL over a durable page file with indexes, WAL, and isolation controls.

Integrate page storage, B+ tree, and buffer pool into one storage subsystem.
Integrate WAL and restart recovery across all write paths.
Integrate SQL compile/execute and transaction manager with retriable serializable semantics.

Success Criteria: Runs deterministic workload corpus, survives crash-injection tests, and passes concurrency anomaly matrix with documented isolation behavior.

From Learning to Production

Your Project	Production Equivalent	Gap to Fill
Project 1	SQLite pager/file format discipline	Online schema migration and compatibility tooling
Project 3	InnoDB/PostgreSQL B-tree internals	Concurrency latching and online maintenance
Project 4	Shared buffer managers in major DBMS	Advanced eviction heuristics and NUMA awareness
Project 5	PostgreSQL WAL + PITR workflows	Replication, backup orchestration, operational tooling
Project 6	PostgreSQL planner/executor, SQLite VDBE	Cost model sophistication and optimizer breadth
Project 7	Serializable transaction implementations	Full predicate locking/SSI and distributed correctness

Summary

This learning path covers Database Internals in C through 7 hands-on projects.

#	Project Name	Main Language	Difficulty	Time Estimate
1	Persistent Key-Value Store	C	Beginner	1 weekend
2	KV Client Library	C	Intermediate	4-6 days
3	Disk-Backed B+Tree	C	Advanced	1-2 weeks
4	Storage + Buffer Pool	C	Advanced	1-2 weeks
5	WAL + Crash Recovery	C	Expert	1-2 weeks
6	SQL Query Engine	C	Expert	2-3 weeks
7	Isolation + TinyDB Integration	C	Expert	3-4 weeks

Expected Outcomes

You can reason from SQL behavior to page-level I/O and back.
You can build and validate crash-safe write paths.
You can explain isolation semantics with deterministic anomaly evidence.

Additional Resources and References

Standards and Specifications

SQLite database file format spec: https://www.sqlite.org/fileformat.html
PostgreSQL WAL documentation: https://www.postgresql.org/docs/current/wal-intro.html
PostgreSQL transaction isolation documentation: https://www.postgresql.org/docs/current/transaction-iso.html
PostgreSQL page layout documentation: https://www.postgresql.org/docs/current/storage-page-layout.html
MySQL InnoDB index/page structure documentation: https://dev.mysql.com/doc/refman/8.4/en/innodb-physical-structure.html

Industry Analysis

Stack Overflow Developer Survey 2025 (technology adoption): https://survey.stackoverflow.co/2025/technology
SQLite deployment notes (most deployed SQL engine context): https://www.sqlite.org/mostdeployed.html

Papers and Foundational References

ARIES publication page (IBM Research): https://research.ibm.com/publications/aries-a-transaction-recovery-method-supporting-fine-granularity-locking-and-partial-rollbacks-using-write-ahead-logging

Books

The C Programming Language by Kernighan & Ritchie — core C I/O and binary handling foundations.
Computer Systems: A Programmer’s Perspective by Bryant & O’Hallaron — memory hierarchy and systems-level data movement intuition.
Algorithms, Fourth Edition by Sedgewick & Wayne — balanced tree and search-structure fundamentals.
Operating Systems: Three Easy Pieces by Arpaci-Dusseau & Arpaci-Dusseau — crash consistency and concurrency mindset.
Compilers: Principles and Practice by Dave & Dave — parser/planner mental models useful for SQL engines.