Sprint: Database Internals in C Mastery - Real World Projects
Goal: Build a first-principles mental model of how real databases work internally, from bytes on disk to concurrent SQL execution, by implementing core storage-engine subsystems in C. You will learn why page layout decisions shape performance, how B+ trees and buffer pools minimize I/O, and how WAL and recovery preserve correctness under crashes. You will also understand how isolation levels, locking, and query execution interact under real workloads. By the end, you will be able to design, debug, and benchmark a small but serious embedded database engine and explain each subsystem with production-grade rigor.
Introduction
Database internals is the discipline of understanding how a DBMS transforms persistent bytes into reliable query results under concurrency. A mature database is not one algorithm; it is a coordinated set of subsystems: storage layout, index structures, cache management, recovery logic, transaction control, and query execution.
This guide focuses on implementing those subsystems in C so you can see the real machine boundaries directly: memory ownership, I/O order, page boundaries, and failure behavior. You will build a sequence of projects that starts from a persistent key-value engine and ends with a mini SQL database architecture.
In scope:
- Slotted-page storage and binary formats
- B+ tree indexing and split/merge behavior
- Buffer pool design and eviction policies
- WAL, checkpoints, and crash recovery
- Transaction isolation and concurrency control patterns
- SQL parse-plan-execute pipeline
Out of scope:
- Distributed transactions and consensus
- Full ANSI SQL compliance
- Columnar/vectorized engines at warehouse scale
- Production replication/failover orchestration
┌──────────────────────────────────────────────────────┐
│ SQL Text Input │
└──────────────────────────┬───────────────────────────┘
│
v
┌─────────────────────┐ ┌───────────────────────┐ ┌──────────────────────┐
│ Tokenizer + Parser │-->| Logical/Physical Plan │-->| Operator Execution │
└─────────────────────┘ └───────────────────────┘ └──────────┬───────────┘
│
v
┌─────────────────────┐ ┌───────────────────────┐ ┌──────────────────────┐
│ Transaction Manager │<->│ Buffer Pool Manager │<->│ B+Tree + Heap Pages │
└──────────┬──────────┘ └──────────┬────────────┘ └──────────┬───────────┘
│ │ │
v v v
┌─────────────────────┐ ┌───────────────────────┐ ┌──────────────────────┐
│ WAL + Checkpointing │<->│ File I/O + fsync │<->│ Main DB + WAL Files │
└─────────────────────┘ └───────────────────────┘ └──────────────────────┘
How to Use This Guide
- Read
## Theory Primerbefore writing implementation code. - For each project, answer the core question and thinking exercise first.
- Keep a single engineering notebook with: invariants, failure cases, and benchmark deltas.
- Use deterministic fixtures (fixed seeds, fixed dataset order, fixed timestamps in tests) so regressions are visible.
- Use project links in order on the first pass; then do a second pass by subsystem depth (storage, then recovery, then concurrency, then SQL).
Prerequisites & Background Knowledge
Essential Prerequisites (Must Have)
- C language fundamentals: pointers, structs, file I/O, memory lifecycle
- Data-structure fundamentals: trees, hash tables, amortized complexity
- OS fundamentals: virtual memory, page cache, file descriptors,
fsync - Basic command-line tooling:
make,hexdump,od,strace,gdborlldb - Recommended Reading: The C Programming Language (K&R) Ch. 5-8
Helpful But Not Required
- Compiler pipeline basics (tokenizing/parsing/AST)
- Advanced concurrency patterns (lock ordering, deadlock detection)
- Linux performance tooling (
perf, flame graphs)
Self-Assessment Questions
- Can you explain the difference between logical records and on-disk pages?
- Can you explain why write ordering (
logbeforedata) matters for durability? - Can you reason about O(log n) B+ tree behavior in terms of disk page reads?
Development Environment Setup Required Tools:
- C compiler:
clang15+ orgcc12+ - Build tooling:
makeorninja - Debugging:
gdborlldb - CLI diagnostics:
hexdump,od,strace
Recommended Tools:
perffor profiling hot pathsvalgrindor sanitizer builds for memory correctnesssqlite3binary for sanity-comparison experiments
Testing Your Setup:
$ cc --version
clang version 18.1.8
$ hexdump -C /tmp/sample.db | head -3
00000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
$ getconf PAGESIZE
4096
Time Investment
- Simple projects: 4-8 hours each
- Moderate projects: 10-20 hours each
- Complex projects: 20-40 hours each
- Total sprint: 3-5 months (part-time)
Important Reality Check Database internals learning is nonlinear. You will often understand a concept only after failing an adjacent one (for example, understanding page formats only after debugging B+ tree split corruption). That friction is expected and productive.
Big Picture / Mental Model
Think of a database as a state machine constrained by three non-negotiables:
- Performance: minimize random I/O and unnecessary copies.
- Correctness: preserve invariants under concurrency.
- Durability: survive crashes without violating committed state.
USER VIEW ENGINE VIEW
┌─────────────────────────────────┐ ┌────────────────────────────────────────┐
│ SELECT ... WHERE ... │ │ Parse -> Plan -> Execute │
│ INSERT/UPDATE/DELETE │ │ │
└─────────────────────────────────┘ │ Read/Write Pages via Buffer Pool │
│ │ │
│ v │
│ B+Tree + Heap Layout │
│ │ │
│ v │
│ WAL + Checkpoint + Recovery │
└─────────────┬──────────────────────────┘
│
v
Crash-safe persistent files
Theory Primer
Concept 1: Page-Oriented Storage and Binary Encoding
Fundamentals
Relational databases are physically page stores. Even when you think in rows, indexes, and SQL operators, the machine persists fixed-size pages and updates those pages according to strict invariants. A page is the smallest durable read/write unit in most engines. Inside a page, records are laid out with metadata that enables insert/delete/update while preserving stable logical identifiers. Slotted pages solve this by separating the slot directory from the actual record payload. Slots point to payload offsets, so compaction can move payload bytes while keeping logical record references stable. Binary encoding matters because data must survive process restarts, architecture differences, and version upgrades. This means explicit endianness, explicit field sizes, and versioned headers. If you skip those disciplines, you create a format that appears to work in one process but breaks under upgrade, crash recovery, or external tooling.
Deep Dive
Page storage is the first place where abstraction collides with hardware behavior. A page-based engine deliberately trades flexibility for predictable I/O. Instead of writing arbitrary byte ranges, it writes page-aligned blocks. This design aligns with how filesystems and storage devices optimize reads and writes. In practice, page sizes usually sit in a bounded range (commonly 4 KiB to 16 KiB), and page-local data structures are engineered to maximize useful payload while keeping metadata small and deterministic.
A canonical slotted page has four regions: header, slot array, free space, and payload region. The slot array grows upward from the front, payload grows downward from the back, and free space exists between them. Inserts allocate payload bytes, then add a slot entry. Deletes can tombstone a slot or reclaim immediately depending on policy. Updates may be in-place when size is unchanged, or they may trigger relocation if variable-length fields grow. The slot indirection layer is the central stability trick: external references use (page_id, slot_id) while payload can move.
You must define invariants explicitly. Typical invariants include: (1) all slot offsets point within payload bounds, (2) payload records do not overlap, (3) free-space pointers are ordered and valid, (4) page checksum or LSN metadata matches write protocol expectations. Invariant checks are not optional; they are your earliest corruption detector. In production engines, page inspection tools exist largely because page-level corruption triage is common in outage analysis.
Binary encoding is equally strategic. If you serialize C structs directly, compiler padding and host endianness can make files non-portable or unstable across versions. Better pattern: define explicit field widths (u16, u32, etc.), encode/decode with deterministic byte order, include a version field, and document reserved bytes. SQLite’s on-disk format documentation is a strong model here because it treats file format as a stable external contract, not an incidental implementation detail.
Failure modes are subtle. Torn writes can leave partially updated pages; stale pointers can survive crashes if WAL/data write order is wrong; compaction bugs can produce “valid-looking” but semantically broken pages. This is why many engines avoid complex in-place transformations unless strictly necessary and rely on append-like techniques plus metadata flips for atomicity. Your educational engine can start simpler: conservative page rewrites with strict checksum validation and a repair/debug mode that reports first-invalid offset.
Another critical point is fragmentation economics. Frequent updates and deletes increase internal fragmentation, reducing effective page density and query performance. Compaction improves density but costs CPU and write amplification. Real engines balance this with background vacuum/cleanup or on-threshold compaction policies. In your projects, force yourself to measure page fill-factor over time under mixed workloads; this converts a theoretical concept into observable behavior.
Finally, page formats are long-lived commitments. A casual header change can invalidate old files or require migration tooling. Introduce format versioning from day one, and include compatibility tests that open historical files. This one decision drastically reduces future pain and mirrors production reality.
How this fits on projects
- Projects 1, 3, 4, and 7 rely directly on page format choices.
- Every benchmark and recovery test is constrained by page-level behavior.
Definitions & key terms
- Page: Fixed-size block used as primary storage I/O unit.
- Slot directory: Array mapping logical slot IDs to payload offsets.
- Heap page: Page storing table tuples/records.
- Fill factor: Effective used-space ratio for a page.
- Format version: Header field indicating on-disk schema version.
Mental model diagram
Page N
┌──────────────────────────────────────────────────────────────┐
│ Header: magic | version | page_id | LSN | checksum | flags │
├──────────────────────────────────────────────────────────────┤
│ Slot[0] -> off=4012 len=40 │
│ Slot[1] -> off=3960 len=52 │
│ Slot[2] -> tombstone │
├──────────────────────────────────────────────────────────────┤
│ Free Space │
├──────────────────────────────────────────────────────────────┤
│ Record bytes packed from end of page toward middle │
└──────────────────────────────────────────────────────────────┘
Invariant: slot offsets always point to valid, non-overlapping record ranges
How it works (step-by-step, with invariants and failure modes)
- Read page header; verify magic, version, and checksum.
- Validate slot array bounds and monotonic free-space pointers.
- Resolve
(page_id, slot_id)to record bytes via slot offset. - For insert, allocate payload then write slot metadata atomically in memory.
- On flush, write page through buffer manager with WAL ordering guarantees.
Failure modes:
- Slot offset out of bounds -> corruption flag.
- Overlapping payload regions -> allocator/compaction bug.
- Header/version mismatch -> incompatible file or torn write.
Minimal concrete example
Pseudo-layout (little-endian chosen by your format contract):
header.magic = "TDB1"
header.version = 2
header.page_size = 4096
slot[17] = {offset: 0x0F20, length: 31, flags: LIVE}
record@0x0F20 = [key_len=5][val_len=14][key="email"][value="a@b.example"]
Common misconceptions
- “Rows are stored directly one after another forever.” Correction: updates/deletes require indirection and free-space control.
- “
structdump is a stable file format.” Correction: padding and endianness make this unsafe. - “Compaction is always good.” Correction: compaction costs CPU/I/O and can increase write amplification.
Check-your-understanding questions
- Why does slot indirection reduce logical record instability?
- What corruption class does checksum catch vs not catch?
- Why is format versioning needed even in personal projects?
Check-your-understanding answers
- Because
(page,slot)remains stable while payload bytes can move. - It catches many accidental/torn mutations, but not all logical semantic errors.
- Any field-layout change can break older files without version gates.
Real-world applications
- SQLite table/index pages
- PostgreSQL heap and index pages
- InnoDB index page structures
Where you’ll apply it
Project 1,Project 3,Project 4,Project 7
References
- SQLite file format: https://www.sqlite.org/fileformat.html
- PostgreSQL page layout: https://www.postgresql.org/docs/current/storage-page-layout.html
- MySQL InnoDB physical index structure: https://dev.mysql.com/doc/refman/8.4/en/innodb-physical-structure.html
Key insights A database file format is a long-lived external contract, not an internal implementation detail.
Summary Page format quality determines whether every later subsystem (indexing, caching, recovery) remains tractable or becomes fragile.
Homework/Exercises to practice the concept
- Design a 4 KiB page header with forward-compatible reserved fields.
- Simulate 10,000 random inserts/deletes and track fill factor drift.
- Write an offline checker spec that validates all page invariants.
Solutions to the homework/exercises
- Include magic/version/page_id/LSN/checksum/free-space offsets/reserved bytes.
- Expect fragmentation to rise; compaction threshold can restore density.
- Validation order: header -> slot bounds -> offset overlap -> checksum -> semantic field checks.
Concept 2: B+Tree Indexes and Access Paths
Fundamentals
B+ trees are balanced, high fan-out search trees optimized for block storage. Internal nodes route lookups; leaf nodes hold sorted keys and payload references. High fan-out keeps tree height small, so lookup cost is logarithmic with very low constant factors in terms of page reads. This is why they dominate row-store indexes. Unlike binary trees, B+ trees are page-aware: one node is typically one page, and node occupancy policy directly controls performance. Splits and merges preserve order and balance while maintaining invariants such as sorted key ranges and consistent child separators. In practical database engines, B+ trees are tightly coupled with page layout, buffer pools, and concurrency semantics.
Deep Dive
A B+ tree’s strength comes from matching algorithmic shape to storage economics. Random I/O is expensive relative to CPU arithmetic. So instead of pointer-heavy in-memory trees, a B+ tree packs many keys per page. If an internal page can hold hundreds of key separators, height stays tiny even for millions of records. For example, fan-out 200 with height 4 already addresses billions of key intervals. That means a lookup can often complete in a handful of page accesses, many of which may already be cached.
Node format design is critical. Internal nodes need key separator arrays and child page IDs; leaf nodes need key-value references (or row IDs) plus sibling pointers for range scans. Prefix compression or fence keys may be used to improve occupancy, but early educational implementations should prioritize clarity: fixed-width keys first, then variable-length support. You should still model split policy explicitly (median split vs biased split for append-heavy workloads) because it shapes write amplification and cache behavior.
Insertion path: descend to leaf, insert key in order, split if overflow, propagate separator to parent, possibly cascading to root. Root split creates a new root and increases height by one. Deletion path is harder: remove key, detect underflow, then borrow from sibling or merge and update parent separators. Many educational implementations delay full deletion complexity because merge correctness is easy to get wrong. If you do implement deletion, keep invariant checks aggressive and run randomized differential tests.
Range scans are a B+ tree’s practical edge. Because leaves are linked, scanning a key interval avoids repeatedly returning to upper nodes. This makes ordered retrieval and prefix predicates efficient. It also enables index-only plans when leaf payload includes enough columns. In storage engines, this property is central to query performance for ORDER BY, BETWEEN, and incremental pagination patterns.
Failure modes in B+ trees usually come from separator/key-boundary mistakes, not the high-level algorithm itself. A single off-by-one in split propagation can cause missing or duplicated key visibility. Another common bug is parent separator stale after child mutation, which manifests as intermittent lookup misses. To defend against this, implement an offline verifier: in-order traversal must return globally sorted keys; each parent key must equal or bound child ranges; all leaves must be at equal depth; sibling links must be bidirectionally consistent.
Concurrency adds another layer. Production engines use latch coupling or optimistic lock-crabbing to avoid structural races. For learning, start single-threaded for correctness, then add coarse-grain latches at node level. The lesson is that structural modifications (splits/merges) are not just data updates; they are topology updates. WAL integration must capture enough info for redo/undo of both record mutations and structure changes.
Finally, B+ trees are not universally best. Hash indexes can beat them for point reads under stable conditions, and LSM trees can outperform on write-heavy ingestion. But for mixed read/write with order-sensitive queries, B+ remains the baseline architecture in major engines.
How this fits on projects
- Core of
Project 3 - Used by
Project 6and integrated inProject 7
Definitions & key terms
- Fan-out: Number of child pointers in an internal node.
- Separator key: Parent key dividing child key ranges.
- Leaf chain: Linked list across leaf pages for range scans.
- Split propagation: Parent updates caused by child overflow.
- Underflow: Node occupancy below minimum threshold.
Mental model diagram
[ 40 | 80 ]
/ | \
[10|20|30] [50|60|70] [90|95]
| | |
Leaf A <-----> Leaf B <-----> Leaf C
Lookup(65): root -> middle child -> Leaf B
Range(55..92): start Leaf B, then follow leaf sibling links
How it works (step-by-step, with invariants and failure modes)
- Compare key with separators from root to leaf.
- Insert in leaf maintaining sorted order.
- If overflow, split node and promote separator.
- Repeat upward until no overflow or new root created.
Invariants:
- Keys sorted inside every node.
- All leaves at same depth.
- Parent separators correctly partition child ranges.
Failure modes:
- Separator stale after split -> missed lookups.
- Incorrect sibling links -> broken range scan.
- Partial structural write without WAL safety -> corruption after crash.
Minimal concrete example
Pseudo-event sequence for inserting key=58:
before: parent keys [40|80], middle leaf keys [50,55,60,70]
insert -> [50,55,58,60,70] (overflow)
split -> left [50,55], right [58,60,70]
promote separator 58 into parent
Common misconceptions
- “B-tree and B+ tree are practically identical.” Correction: leaf-only payload and linked leaves materially affect scan behavior.
- “O(log n) is enough to guarantee speed.” Correction: fan-out, cache hit rate, and page locality dominate constants.
- “Deletion is symmetric with insertion.” Correction: underflow/merge handling is often trickier.
Check-your-understanding questions
- Why does high fan-out lower disk I/O depth?
- Why are leaf links essential for range scans?
- What verifier checks would catch separator corruption?
Check-your-understanding answers
- More keys per node means fewer levels for same keyspace.
- They avoid repeated parent traversal for sequential ranges.
- Parent-child range consistency and full in-order traversal monotonicity.
Real-world applications
- SQLite B-tree tables/indexes
- InnoDB clustered and secondary indexes
- PostgreSQL B-tree access method
Where you’ll apply it
Project 3,Project 6,Project 7
References
- SQLite architecture: https://www.sqlite.org/arch.html
- InnoDB index structure: https://dev.mysql.com/doc/refman/8.4/en/innodb-physical-structure.html
- PostgreSQL storage internals: https://www.postgresql.org/docs/current/storage-page-layout.html
Key insights B+ trees win because they are shaped for page I/O economics, not abstract pointer operations.
Summary Correct separator logic and verification tooling matter more than clever split heuristics in early implementations.
Homework/Exercises to practice the concept
- Design an invariant checker for a persisted B+ tree file.
- Compare estimated depth for fan-out 64 vs 256 at 100M keys.
- Describe how split policy changes write amplification for append-heavy keys.
Solutions to the homework/exercises
- Validate depth uniformity, sortedness, parent bounds, leaf-link consistency.
- Height drops materially as fan-out rises; fewer I/O hops per lookup.
- Right-biased splits can reduce repeated rewrites for append-heavy patterns.
Concept 3: Buffer Pool, Caching, and I/O Control
Fundamentals
The buffer pool is the database’s controlled cache of pages in RAM. It mediates access between query operators and disk files while tracking dirty state, pin counts, and eviction eligibility. Without a buffer pool, every logical operation would hit disk directly, causing severe performance collapse. With one, hot pages stay in memory, updates are grouped, and I/O is shaped into more efficient patterns. Correct buffer-pool behavior depends on strict contracts: pin before use, unpin after use, never evict pinned frames, and write dirty pages safely. Buffer management is where many first engines become either fast-but-wrong or safe-but-slow.
Deep Dive
A buffer pool is often misunderstood as “just an LRU cache.” In reality, it is a concurrency-aware, durability-aware control plane for pages. Each frame tracks at least: page_id, dirty flag, pin count, reference metadata, and page LSN/checksum fields. Query and storage components request pages by page ID. On hit, frame is returned with pin incremented. On miss, a victim frame is selected, flushed if dirty, then reused for the requested page.
The pin/unpin contract is foundational. A pinned page indicates active use; eviction of pinned pages is forbidden. Most mysterious corruption in student engines comes from violating this indirectly: stale pointers after frame reuse, forgotten unpin paths on error, or double-unpin under retries. Build explicit guards and panic-mode assertions around pin count transitions.
Eviction policy is a design lever. Pure LRU can be scan-unfriendly because large sequential reads evict hot OLTP pages. CLOCK and variants reduce overhead and resist scan pollution. Production systems often use multi-queue or adaptive policies. For this guide, implement CLOCK or 2Q-like behavior after baseline LRU so you can observe the difference under mixed workload traces.
Dirty-page handling must align with WAL protocol. If WAL exists, you cannot flush a dirty page whose page-LSN exceeds durable WAL-LSN; doing so breaks recovery assumptions. This WAL-before-data constraint makes buffer flushing a correctness problem, not merely a performance one. Background flushers help smooth latency by writing dirty pages proactively instead of during foreground evictions.
Another deep issue is double caching: OS page cache plus DB buffer pool. Embedded engines sometimes rely heavily on OS cache; server engines prefer tighter control and may use direct I/O in some contexts. Even without direct I/O, you should reason about read amplification: a logical page miss can cause kernel-level reads, user-space copies, and metadata overhead. Instrumenting hit ratio alone is insufficient; track bytes read/written, flush batch sizes, and stall causes.
Prefetching and locality hints matter. If range scan is detected, prefetch next leaf pages can reduce misses. But aggressive prefetch can evict useful pages. As with all buffering choices, the right behavior depends on workload shape. This is why benchmark scenarios must include at least: random point lookups, sequential scans, and mixed read/write OLTP.
Failure modes include dirty-loss (dirty page evicted without flush), latch contention hotspots (single global mutex around all frames), and livelock during victim search when most pages are pinned. Detect these with metrics: eviction retries, average victim-search steps, flush queue depth, and pin lifetime histograms.
Finally, remember that buffer pools are testable in isolation. Build deterministic traces (sequence of page requests with pins/unpins) and assert final frame state. Isolated buffer tests give you confidence before integration with B+ tree or WAL.
How this fits on projects
- Core of
Project 4 - Critical dependency in
Project 5,Project 6, andProject 7
Definitions & key terms
- Frame: In-memory slot holding one page.
- Pin count: Number of active users holding a page.
- Dirty page: Page modified in memory but not persisted.
- Victim selection: Choosing an eviction candidate.
- Writeback: Flushing dirty pages to durable storage.
Mental model diagram
Disk file pages Buffer Pool Frames
┌───────────────┐ ┌──────────────────────────────┐
│ Page 41 │<--- miss/load ---│ Frame 0: page=41 pin=2 dirty │
│ Page 42 │<--- miss/load ---│ Frame 1: page=42 pin=0 clean │
│ Page 43 │<--- miss/load ---│ Frame 2: page=43 pin=1 dirty │
└───────────────┘ │ Victim policy: CLOCK hand -> │
└──────────────────────────────┘
Rule: never evict pinned frame; dirty flush must honor WAL ordering
How it works (step-by-step, with invariants and failure modes)
fetch(page_id)checks hash map for frame hit.- On miss, pick victim frame with
pin_count==0. - If victim dirty, flush safely (WAL constraint if enabled).
- Read requested page into frame, pin it, return handle.
- Caller unpins after use; dirty mark set on modification.
Invariants:
- Pin count never negative.
- Evicted frame always unpinned.
- Dirty page flush success required before frame reuse.
Failure modes:
- Forgotten unpin -> throughput collapse/livelock.
- Flush without WAL safety -> unrecoverable state.
- Global lock bottleneck -> poor multicore scaling.
Minimal concrete example
Trace:
fetch(10) miss -> frame2 pin=1
fetch(11) miss -> frame3 pin=1
unpin(10, dirty=true)
fetch(12) miss -> CLOCK picks frame2? no (dirty flush first), then reuse
Common misconceptions
- “High hit ratio means good performance.” Correction: tail latency can still be bad due to flush stalls.
- “Eviction policy is purely optimization.” Correction: policy influences correctness pressure and lock behavior.
- “Pinned pages can be force-evicted if needed.” Correction: that breaks active readers/writers.
Check-your-understanding questions
- Why is pin discipline a correctness requirement?
- How can WAL rules constrain flush timing?
- Why can sequential scans punish naive LRU?
Check-your-understanding answers
- Frame reuse while in use causes use-after-free-like corruption.
- Data page with higher page-LSN cannot be flushed before durable WAL.
- Scan pages displace hot pages, creating avoidable miss storms.
Real-world applications
- PostgreSQL shared buffers
- InnoDB buffer pool
- SQLite pager + page cache
Where you’ll apply it
Project 4,Project 5,Project 6,Project 7
References
- SQLite architecture (pager/page cache): https://www.sqlite.org/arch.html
- PostgreSQL WAL details: https://www.postgresql.org/docs/current/wal-intro.html
Key insights Buffer management is the runtime governor of both performance and correctness.
Summary If page pins, flush ordering, and eviction invariants are weak, every higher-layer subsystem inherits instability.
Homework/Exercises to practice the concept
- Design metrics dashboard fields for buffer health.
- Write a deterministic request trace that breaks naive LRU.
- Specify WAL-aware flush preconditions.
Solutions to the homework/exercises
- Include hit ratio, dirty %, avg pin lifetime, eviction retries, flush latency.
- Alternate long scans with hot random set to show scan pollution.
- Only flush when
durable_wal_lsn >= page_lsn.
Concept 4: WAL, Checkpointing, and Crash Recovery
Fundamentals
Write-Ahead Logging (WAL) is the core durability pattern in modern databases. The central rule is simple and strict: log records describing page modifications must be persisted before corresponding data pages are written. With this rule, engines can recover from crashes by replaying log records. Checkpoints shorten restart time by recording progress and reducing the portion of log that must be scanned. Recovery algorithms typically run analysis/redo/undo phases (or a reduced subset for simpler engines). WAL is not only about durability; it also unlocks performance by allowing data pages to flush lazily instead of forcing all dirty data on each commit.
Deep Dive
WAL exists because forcing every modified data page at commit is too expensive and unpredictable. A single transaction might touch many pages across the file; forcing all of them introduces heavy random writes and high latency variance. WAL flips the contract: commit durability is tied to sequential log persistence, not full data-page flush. Since logs are append-oriented, storage devices handle them efficiently.
A robust WAL design includes: log sequence numbers (LSNs), record types (BEGIN/UPDATE/COMMIT/ABORT/CHECKPOINT), transaction identifiers, and enough payload to redo (and optionally undo) actions. Each updated page stores page-LSN so recovery can decide whether a log record is already reflected on disk. This page-LSN bridge is essential for idempotent replay.
Recovery generally starts from last checkpoint. Analysis reconstructs in-flight transactions and dirty page state. Redo re-applies necessary committed updates (or all history with page-LSN checks, depending on model). Undo rolls back loser transactions if undo logging exists. Full ARIES-style recovery is advanced; for this sprint, a constrained redo/undo model is enough if invariants are explicit and test harnesses are strong.
Checkpoints are performance tradeoffs. Frequent checkpoints reduce restart time but increase runtime write pressure. Infrequent checkpoints reduce overhead but elongate recovery windows. Production systems tune this against RTO objectives and storage limits. In your learning engine, expose checkpoint interval as a config and benchmark recovery time vs steady-state throughput.
fsync semantics and ordering are non-negotiable. A common student bug is writing log records with buffered stdio and assuming durability. Durability requires explicit flush to stable storage and careful handling of partial writes. Group commit is a practical optimization: multiple transactions share one log flush, amortizing sync overhead. PostgreSQL documentation explicitly highlights this benefit.
Failure taxonomy matters: process crash, OS crash, power loss, media failure. WAL primarily addresses the first three for a healthy storage device; media failure requires backups/replication beyond this guide. Your tests should include kill -9 at randomized points, simulated torn log tails, and restart idempotence checks (running recovery twice should not diverge state).
Another subtlety is interaction with buffer flush. If a page dirty from LSN 120 is flushed while durable log only reaches LSN 100, recovery cannot reconstruct the exact state safely. This violates WAL. So buffer managers must gate flush by durable LSN watermark. This cross-subsystem dependency is where many implementations fail.
Finally, WAL format evolution should be versioned and self-describing enough for offline tools. During outages, operators rely on log inspection to answer: what committed, what was in flight, what was replayed. Build that observability early.
How this fits on projects
- Central in
Project 5 - Dependency for correctness in
Project 6andProject 7
Definitions & key terms
- LSN: Monotonic identifier for log positions.
- Redo: Reapply updates that may not be on data pages yet.
- Undo: Roll back uncommitted updates.
- Checkpoint: Durable marker reducing recovery work.
- Group commit: Batching commits behind one log sync.
Mental model diagram
TXN updates page P
│
├──> append WAL record (LSN=9001)
│
├──> fsync WAL up to 9001 [COMMIT durable]
│
└──> data page flush can happen later
Crash -> restart -> replay WAL records with LSN > page_lsn
How it works (step-by-step, with invariants and failure modes)
- On logical update, emit WAL record with txn_id and redo payload.
- Assign new LSN and set page-LSN in dirty buffer copy.
- On commit, force WAL flush through commit LSN.
- Data pages flush asynchronously when allowed.
- On restart, scan from checkpoint and redo/undo as required.
Invariants:
- Durable WAL precedes durable data for same LSN.
- Recovery replay is idempotent using page-LSN checks.
- Committed state persists across crash tests.
Failure modes:
- Torn log record tail not detected -> invalid replay.
- Data flush before WAL flush -> durability violation.
- Missing transaction end markers -> ambiguous recovery state.
Minimal concrete example
WAL transcript:
[LSN 410] BEGIN txn=7
[LSN 418] UPDATE txn=7 page=12 slot=4 old=... new=...
[LSN 430] COMMIT txn=7
fsync(log, <=430)
# crash before page 12 flush
restart -> REDO LSN 418 onto page 12 (if page_lsn < 418)
Common misconceptions
- “If commit returns, data pages must already be written.” Correction: WAL allows commit before data-page flush.
- “Recovery replay is one-pass always.” Correction: depends on algorithm and metadata.
- “Checkpoint means no recovery needed.” Correction: it only narrows required log window.
Check-your-understanding questions
- Why is page-LSN required for idempotent redo?
- How does group commit reduce overhead?
- Which crash classes are not solved by WAL alone?
Check-your-understanding answers
- It tells whether update is already reflected on page.
- Many transactions amortize one sync call.
- Media loss/corruption without backup/replica strategy.
Real-world applications
- PostgreSQL WAL and point-in-time recovery
- SQLite rollback journal and WAL modes
- InnoDB redo logging
Where you’ll apply it
Project 5,Project 6,Project 7
References
- PostgreSQL WAL intro: https://www.postgresql.org/docs/current/wal-intro.html
- SQLite file format + WAL sections: https://www.sqlite.org/fileformat.html
- IBM ARIES publication page: https://research.ibm.com/publications/aries-a-transaction-recovery-method-supporting-fine-granularity-locking-and-partial-rollbacks-using-write-ahead-logging
Key insights WAL is a write-ordering contract that converts crashes from catastrophe into replay work.
Summary Recovery correctness depends on strict LSN discipline, not just “having a log file.”
Homework/Exercises to practice the concept
- Define a WAL record schema for update + commit.
- Design crash tests that kill process at three random points per transaction.
- Explain how checkpoint interval affects RTO and steady-state throughput.
Solutions to the homework/exercises
- Include txn_id, page_id, slot_id, before/after image or logical delta, checksum.
- Kill before commit flush, after commit flush, and during data flush.
- Short interval improves restart time but increases background write pressure.
Concept 5: Concurrency Control and Isolation Semantics
Fundamentals
Concurrency control prevents correctness anomalies when multiple transactions interleave on shared data. Isolation levels define which anomalies are tolerated: dirty reads, non-repeatable reads, phantoms, and serialization anomalies. Engines use lock-based methods, MVCC, or hybrids. Locking provides explicit exclusion; MVCC provides snapshot reads with version chains and conflict checks. Practical systems combine both to balance throughput and correctness. Understanding isolation is essential because many bugs are not crashes or wrong formulas; they are legal interleavings that violate business invariants.
Deep Dive
The hardest part of database correctness is not single-thread logic; it is behavior under interleaving. Transaction isolation is the lens that makes those behaviors explicit. SQL standard levels define minimal guarantees, but implementations differ. PostgreSQL, for example, maps these levels onto MVCC semantics and documents anomaly behavior clearly.
Two broad families dominate. Pessimistic locking acquires locks before conflicting actions, often with two-phase locking (2PL): growing phase acquires locks, shrinking phase releases. Strict 2PL delays release until commit/abort, simplifying recovery and serializability reasoning. Cost: blocking and deadlocks. Engines need deadlock detection or timeout policies.
MVCC avoids many read-write conflicts by storing row versions with visibility rules (created_xid/deleted_xid or equivalent metadata). Readers choose versions visible to their snapshot, while writers create new versions. This improves read concurrency but introduces garbage-collection/vacuum complexity and conflict detection rules at commit for stronger isolation levels.
Anomaly literacy is non-negotiable. Lost update, write skew, and read skew are common production failure modes. Developers often think “transaction” implies serial behavior, but default isolation levels usually do not provide full serializability. Your learning lab should include explicit anomaly scripts where two sessions execute deterministic interleavings and show outcomes at Read Committed vs Serializable.
Lock granularity and intent also matter. Row locks maximize concurrency but increase metadata overhead. Page/table locks are cheaper to manage but reduce parallelism. Index-range locking or predicate locking is needed to prevent phantoms in lock-based serializable schemes. MVCC engines often implement Serializable Snapshot Isolation with dependency tracking rather than pure predicate locks.
Recovery coupling: concurrency metadata and WAL must agree. If a transaction aborts after partial updates, undo visibility or rollback operations must restore logical consistency. In MVCC, abort may mark new versions invalid; in lock-based systems, undo may physically restore prior images. Either way, the transaction manager and recovery manager are not separable concerns.
Testing concurrency requires harnesses that control timing. Random stress alone rarely reproduces precise anomalies. Use barrier-based test scripts: session A step1, session B step1, session A step2, etc., with expected outcomes. Include retry logic tests for serializable failures, because production clients must handle retriable aborts.
Finally, isolation is a product decision, not just technical purity. Stronger isolation increases safety but may reduce throughput. Mature systems expose level selection and require workload-aware choices. The right answer depends on invariants you must protect.
How this fits on projects
- Core of
Project 7 - Also influences update semantics in
Project 5andProject 6
Definitions & key terms
- Dirty read: Reading uncommitted data from another transaction.
- Non-repeatable read: Re-reading same row returns different value.
- Phantom: Re-running predicate query returns different row set.
- MVCC: Multi-version concurrency control.
- Serializable: Equivalent to some serial transaction order.
Mental model diagram
Time --->
Txn A: BEGIN ---- read(balance=100) -------- write(balance=90) ---- COMMIT
Txn B: BEGIN -------- read(balance=100) ---- write(balance=80) ---- COMMIT
Without conflict control: lost update possible
With proper control: one txn blocks/aborts/retries
How it works (step-by-step, with invariants and failure modes)
- Transaction starts with chosen isolation level.
- Reads/writes use lock or version visibility rules.
- Conflicts trigger blocking, abort, or retry.
- Commit publishes state according to isolation protocol.
- Abort restores or invalidates uncommitted effects.
Invariants:
- No dirty reads beyond allowed level.
- Conflict handling deterministic per policy.
- Commit order and visibility rules consistent with documented level.
Failure modes:
- Missing conflict check -> lost update.
- Deadlock without detection -> indefinite stalls.
- Retry-unaware clients -> user-visible failures.
Minimal concrete example
Interleaving script (read-committed):
A: read x=5
B: read x=5
A: write x=6, commit
B: write x=6, commit
Result: one increment lost
Serializable policy should force one transaction retry.
Common misconceptions
- “Read committed means always consistent business state.” Correction: complex invariants can still break.
- “MVCC means no locking.” Correction: write-write conflicts and metadata locks remain.
- “Serializable is always too slow.” Correction: overhead depends on workload and conflict shape.
Check-your-understanding questions
- Why can write skew happen under snapshot-style isolation?
- Why are retries part of serializable application logic?
- What extra complexity does MVCC introduce vs lock-only engines?
Check-your-understanding answers
- Concurrent txns can each observe valid snapshot and commit incompatible writes.
- Serializable engines may abort conflicting transactions to preserve serial order.
- Version visibility rules and garbage collection/vacuum lifecycle.
Real-world applications
- PostgreSQL transaction isolation and MVCC
- InnoDB lock + MVCC hybrid behavior
Where you’ll apply it
Project 7(primary),Project 5,Project 6
References
- PostgreSQL transaction isolation docs: https://www.postgresql.org/docs/current/transaction-iso.html
Key insights Isolation level is a correctness budget; spend it consciously.
Summary Concurrency design is successful only when anomalies are explicitly tested and understood, not assumed away.
Homework/Exercises to practice the concept
- Design two-session scripts showing non-repeatable read and phantom.
- Specify retry policy for serializable aborts.
- Compare lock-based vs MVCC behavior on read-heavy workload.
Solutions to the homework/exercises
- Use ordered barriers to force deterministic interleavings.
- Retry whole transaction with capped exponential backoff.
- MVCC usually improves read concurrency at cost of version maintenance.
Concept 6: SQL Compilation and Volcano-Style Execution
Fundamentals
A SQL engine is a compiler plus runtime. Text query input is tokenized, parsed into an AST, transformed into logical operators, lowered to physical operators, and executed. The classic Volcano iterator model (open, next, close) composes operators into pipelines that pull tuples on demand. Even minimalist engines need this mental split: parsing is syntax, planning is algorithm selection, execution is dataflow over access methods. Query internals become understandable once you model operators as explicit state machines.
Deep Dive
SQL appears declarative, but execution is deeply procedural. The parser first checks grammar and builds structured representation (AST). At this stage, no data has been read. Then planning translates AST nodes into relational algebra primitives: scan, filter, project, join, aggregate, sort. Physical planning chooses concrete algorithms (index scan vs table scan, hash join vs nested-loop join) based on available indexes and crude cost estimates.
The iterator (Volcano) model remains pedagogically powerful because of simplicity and composability. Each operator exposes a next() call returning either one tuple or end-of-stream. Parent operators request tuples from children. A filter operator repeatedly pulls from child until predicate passes. A join operator coordinates two children according to chosen algorithm. This pull model makes execution order, state retention, and backpressure explicit.
For C implementations, memory ownership across operators is a recurring hazard. Decide whether tuples are materialized per operator, borrowed via arena, or referenced via page pointers. Ambiguous ownership causes leaks or use-after-free. A practical educational compromise: immutable tuple slices with operator-local scratch buffers and clear lifetime boundaries at pipeline stage transitions.
Planning can start rule-based. For example: if predicate is equality on indexed column, choose index scan; otherwise full scan. Add cost hints later (estimated rows, selectivity, I/O cost). Even simplistic costing teaches critical lessons: bad cardinality assumptions can produce catastrophic plan choices.
Projection and predicate pushdown are low-complexity, high-impact optimizations. Push filters closer to scans to reduce tuple volume early. Avoid materializing full rows when only selected columns are needed. These two rules provide large gains without complex optimizer machinery.
Failure modes include parser ambiguity bugs, incorrect null semantics, and mismatch between EXPLAIN plan output and actual operator execution. Your project must verify plan/execution consistency by logging operator events and comparing against expected sequence.
The SQL subsystem also closes the loop with storage internals. A plan node is only as good as access paths provided by B+ tree and buffer pool. If index lookup returns unstable row IDs or stale visibility info, query correctness collapses. This is why SQL should be introduced after storage/recovery foundations.
Finally, keep scope strict. A high-quality subset (SELECT, predicate filters, simple joins, inserts/updates/deletes) with reliable execution is far more educational than broad but unreliable syntax coverage.
How this fits on projects
- Core in
Project 6 - Integrated end-to-end in
Project 7
Definitions & key terms
- AST: Abstract syntax tree for parsed SQL.
- Logical plan: Engine-agnostic relational operator graph.
- Physical plan: Concrete algorithms + access paths.
- Iterator model: Pull-based operator interface.
- Predicate pushdown: Apply filters as early as possible.
Mental model diagram
SQL Text
-> Lexer/Parser -> AST
-> Planner -> Logical Plan
-> Optimizer -> Physical Plan
-> Executor -> Operator Pipeline
[Scan] -> [Filter] -> [Project] -> [Output]
How it works (step-by-step, with invariants and failure modes)
- Tokenize SQL into lexemes.
- Parse tokens into AST with syntax validation.
- Build logical operators from AST structure.
- Choose physical operators/access paths.
- Execute via iterator calls until end-of-stream.
Invariants:
- Parsed query and execution plan preserve query semantics.
- Operator contracts (
next) deterministic and composable. EXPLAINdescribes actual runtime operator order.
Failure modes:
- AST misinterpretation -> wrong results.
- Join predicate misuse -> cartesian explosion.
- Incorrect NULL handling -> subtle semantic bugs.
Minimal concrete example
Input SQL:
SELECT name FROM users WHERE age > 30
Physical plan (pseudo):
Project(name)
Filter(age > 30)
TableScan(users)
Common misconceptions
- “Parser is the hard part of SQL engines.” Correction: planning + runtime correctness are usually harder.
- “
EXPLAINis optional.” Correction: it is core observability for query behavior. - “Small SQL subset means trivial engine.” Correction: correctness across storage + concurrency still complex.
Check-your-understanding questions
- Why separate logical and physical planning?
- Why does predicate pushdown often provide large gains?
- What should
EXPLAINguarantee?
Check-your-understanding answers
- It separates semantics from algorithm choices.
- It reduces intermediate tuple volume early in pipeline.
- Faithful mapping to actual executed operator pipeline.
Real-world applications
- SQLite parse/codegen/VDBE pipeline
- PostgreSQL planner/executor architecture
Where you’ll apply it
Project 6andProject 7
References
- SQLite architecture: https://www.sqlite.org/arch.html
Key insights SQL internals become tractable when treated as compiler + runtime, not a monolith.
Summary Reliable query execution depends more on operator contracts and observability than on syntax breadth.
Homework/Exercises to practice the concept
- Draw logical and physical plans for three sample queries.
- Define operator contract tests for scan/filter/project.
- Add one deterministic
EXPLAINsnapshot test case.
Solutions to the homework/exercises
- Ensure each logical node maps to clear physical operator.
- Validate tuple counts, ordering assumptions, and EOS signaling.
- Snapshot should include operator tree and chosen access path.
Glossary
- ACID: Atomicity, Consistency, Isolation, Durability guarantees for transactions.
- B+ tree: Balanced tree where internal nodes route and leaves hold sorted entries.
- Buffer pool: Managed in-memory cache for database pages.
- Checkpoint: Recovery marker reducing WAL replay scope.
- Fan-out: Number of child pointers in an internal tree node.
- Heap page: Storage page holding table records.
- LSN: Log sequence number identifying WAL record order.
- MVCC: Multi-version concurrency control for snapshot reads.
- Slotted page: Page format using slot indirection to decouple record position.
- Volcano model: Pull-based operator execution interface.
Why Database Internals Matters
Modern systems are increasingly data-intensive, and backend reliability/performance decisions often reduce to database internals decisions.
Real-world signals:
- In the 2025 Stack Overflow Developer Survey, PostgreSQL was reported by 55.6% of all respondents and 58.2% of professional developers, while SQLite was 37.5% overall. These are implementation-facing technologies used directly by working developers (https://survey.stackoverflow.co/2025/technology).
- SQLite documents that it is likely “used more than all other database engines combined” and estimates over 1 trillion active SQLite databases due broad embedding footprint (https://www.sqlite.org/mostdeployed.html).
- Official PostgreSQL docs emphasize WAL as a standard integrity method and explain why sequential log flushes enable both durability and reduced write cost compared to forcing all data pages (https://www.postgresql.org/docs/current/wal-intro.html).
Context & Evolution:
- Early engines focused on disk-era constraints and strict page discipline.
- Modern engines add stronger concurrency models, richer optimizers, and cloud operational tooling.
- The core building blocks remain stable: pages, indexes, buffer managers, logs, and isolation models.
Then (naive file app) Now (engineered database)
┌───────────────────────────┐ ┌─────────────────────────────┐
│ append rows to file │ │ page layout + indexing │
│ linear scans for lookup │ ---> │ buffer pool + WAL + planner │
│ ad-hoc crash behavior │ │ explicit recovery semantics │
└───────────────────────────┘ └─────────────────────────────┘
Concept Summary Table
| Concept Cluster | What You Need to Internalize |
|---|---|
| Page-Oriented Storage and Binary Encoding | Stable on-disk contracts need explicit layouts, slot indirection, and invariant checks. |
| B+Tree Indexes and Access Paths | Fan-out and separator correctness dominate lookup and range-scan behavior. |
| Buffer Pool, Caching, and I/O Control | Pin/dirty/eviction discipline controls both latency and correctness. |
| WAL, Checkpointing, and Crash Recovery | Durability is a write-ordering protocol plus deterministic replay behavior. |
| Concurrency Control and Isolation Semantics | Correctness under interleaving requires explicit anomaly handling and retry semantics. |
| SQL Compilation and Volcano Execution | Query processing is a compiler/runtime pipeline tied to storage access paths. |
Project-to-Concept Map
| Project | Concepts Applied |
|---|---|
| Project 1 | Page-Oriented Storage and Binary Encoding |
| Project 2 | Page-Oriented Storage and Binary Encoding, API boundaries from SQL Compilation ecosystem |
| Project 3 | Page-Oriented Storage and Binary Encoding, B+Tree Indexes and Access Paths |
| Project 4 | Buffer Pool, Caching, and I/O Control, Page-Oriented Storage and Binary Encoding |
| Project 5 | WAL, Checkpointing, and Crash Recovery, Buffer Pool, Caching, and I/O Control |
| Project 6 | SQL Compilation and Volcano Execution, B+Tree Indexes and Access Paths |
| Project 7 | Concurrency Control and Isolation Semantics, WAL, Checkpointing, and Crash Recovery, SQL Compilation and Volcano Execution |
Deep Dive Reading by Concept
| Concept | Book and Chapter | Why This Matters |
|---|---|---|
| Page-Oriented Storage and Binary Encoding | Computer Systems: A Programmer’s Perspective (Bryant & O’Hallaron), Ch. 6 | Bridges memory hierarchy assumptions with page-level data layout decisions. |
| Page-Oriented Storage and Binary Encoding | The C Programming Language (K&R), Ch. 5-8 | Sharpens binary I/O, pointer arithmetic, and file discipline needed for on-disk formats. |
| B+Tree Indexes and Access Paths | Algorithms, Fourth Edition (Sedgewick & Wayne), balanced search tree sections | Gives formal tree invariants and complexity reasoning. |
| Buffer Pool, Caching, and I/O Control | Operating Systems: Three Easy Pieces, file-system and caching chapters | Connects database cache policy to OS/page-cache behavior. |
| WAL, Checkpointing, and Crash Recovery | Operating System Concepts (Silberschatz et al.), recovery concepts | Frames crash consistency and logging guarantees. |
| Concurrency Control and Isolation Semantics | Operating System Concepts (Silberschatz et al.), concurrency chapters | Builds lock ordering and deadlock mental models before database-specific variants. |
| SQL Compilation and Volcano Execution | Compilers: Principles and Practice (Dave & Dave), lexical/syntax analysis chapters | Translates parser/compiler mental models directly into query-engine design. |
Quick Start: Your First 48 Hours
Day 1:
- Read
## Theory PrimerConcept 1 and Concept 2. - Build Project 1 scaffold and persist first 100 deterministic key-value records.
- Inspect file bytes with
hexdumpand reconcile against your format document.
Day 2:
- Run corruption checks on Project 1 and add one malformed-record fixture.
- Read Concept 3 and sketch buffer frame state transitions.
- Start Project 3 lookup-only B+ tree traversal (no splits yet).
Recommended Learning Paths
Path 1: The Systems Builder
- Project 1 -> Project 3 -> Project 4 -> Project 5 -> Project 7
Path 2: The Query Engineer
- Project 1 -> Project 3 -> Project 6 -> Project 7
Path 3: The Reliability Engineer
- Project 1 -> Project 4 -> Project 5 -> Project 7
Success Metrics
- You can explain each committed byte in one page dump from your own engine.
- You can force-crash the engine and recover to identical post-commit state.
- You can demonstrate O(log n) index lookup depth trends across dataset growth.
- You can reproduce at least two isolation anomalies and show how your engine blocks/aborts/retries them.
- You can run
EXPLAINand map each operator to real execution behavior.
Project Overview Table
| Project | Main Focus | Difficulty | Time | Primary Output |
|---|---|---|---|---|
| Project 1 | Persistent heap-file KV store | Beginner | 1 weekend | Deterministic on-disk key-value file format |
| Project 2 | KV client C boundary/API | Intermediate | 4-6 days | Clean C client library with strict ownership rules |
| Project 3 | Disk-backed B+ tree | Advanced | 1-2 weeks | Indexed point lookup + range scan engine |
| Project 4 | Slotted pages + buffer pool | Advanced | 1-2 weeks | Pin/dirty/eviction-aware page manager |
| Project 5 | WAL + crash recovery | Expert | 1-2 weeks | Crash-safe durability harness and recovery tool |
| Project 6 | SQL parser + executor | Expert | 2-3 weeks | Subset SQL REPL with EXPLAIN |
| Project 7 | Isolation + end-to-end integration | Expert | 3-4 weeks | Multi-session TinyDB engine with transactional guarantees |
Project List
The following projects guide you from byte-level persistence to a crash-safe SQL-capable mini database engine.
Project 1: Persistent Key-Value Store
- File:
P01-persistent-key-value-store.md - Main Programming Language: C
- Alternative Programming Languages: Rust, Go
- Coolness Level: Level 3: Genuinely Clever
- Business Potential: Level 1: The “Resume Gold”
- Difficulty: Level 1: Beginner (The Tinkerer)
- Knowledge Area: Storage layout, binary encoding
- Software or Tool: Filesystem + hex inspection tooling
- Main Book: The C Programming Language (K&R)
What you will build: A deterministic key-value CLI backed by a versioned page file.
Why it teaches database internals: It forces direct confrontation with binary encoding, page headers, and corruption boundaries.
Core challenges you will face:
- Record encoding discipline -> maps to
Page-Oriented Storage and Binary Encoding - Page free-space accounting -> maps to
Page-Oriented Storage and Binary Encoding - Deterministic replay tests -> maps to
WAL, Checkpointing, and Crash Recoveryfoundations
Real World Outcome
You run a CLI that persists values across restarts and can be audited at byte level.
$ ./kvdb put user:42:name "Ada"
OK txn=00000019 lsn=0000012A page=0003 slot=0017
$ ./kvdb get user:42:name
VALUE "Ada" page=0003 slot=0017
$ ./kvdb list --prefix user:42
user:42:name="Ada"
$ ./kvdb inspect-page 3
page=3 version=2 slots=18 free=1296 checksum=0x7a31
slot[17] off=3872 len=28 state=LIVE key=user:42:name
The Core Question You Are Answering
“What exact bytes must exist on disk so I can restart and reconstruct state without ambiguity?”
This question matters because every higher-level feature assumes persistent state can be read back unambiguously.
Concepts You Must Understand First
- Slotted page basics
- How can record payload move while slot identity remains stable?
- Book Reference: Computer Systems: A Programmer’s Perspective — Ch. 6
- Binary format versioning
- Which header fields allow future-compatible evolution?
- Book Reference: The C Programming Language — Ch. 5-8
- Checksums and corruption detection
- What failure classes can checksums detect?
- Book Reference: Operating System Concepts — file integrity concepts
Questions to Guide Your Design
- File contract
- Which header fields are mandatory on every page?
- How will you reject unknown format versions safely?
- Record lifecycle
- What does delete mean physically (tombstone vs reclaim)?
- When do you compact a fragmented page?
Thinking Exercise
Design the First 64 Bytes
Draft a fixed 64-byte page header and justify each field.
Questions to answer:
- Which fields are needed for correctness vs debugging?
- Which bytes are reserved for future compatibility?
The Interview Questions They Will Ask
- “Why not serialize structs directly with
fwrite?” - “How do you detect and isolate a corrupted page?”
- “Why are slot directories useful for variable-length records?”
- “How would you migrate file format version 1 to 2 safely?”
- “How do you validate reproducibility of persistence tests?”
Hints in Layers
Hint 1: Start with read-only inspector Define a page inspector first; writing is easier after visibility exists.
Hint 2: Pin down byte order early Choose one endianness in format docs and never mix.
Hint 3: Pseudocode for insert flow
load page -> validate header -> ensure free_space >= record_size+slot_size
write payload bytes at tail -> write slot entry -> update header counters
Hint 4: Corruption harness Flip one byte in a test page and ensure loader rejects it with actionable error.
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Binary I/O in C | The C Programming Language | Ch. 5-8 |
| Memory/page behavior | Computer Systems: A Programmer’s Perspective | Ch. 6 |
| Robust C interfaces | Effective C, 2nd Edition | Memory-related chapters |
Common Pitfalls and Debugging
Problem 1: “Data reads fine until restart, then random keys fail”
- Why: Header offsets were not updated after delete/insert sequence.
- Fix: Recompute and validate free-space boundaries on each mutation.
- Quick test:
./kvdb fuzz-page-layout --seed 42 --iterations 10000
Problem 2: “Page loads on one machine but not another”
- Why: Implicit host-endian struct serialization.
- Fix: Encode/decode fields explicitly with defined byte order.
- Quick test: Run cross-architecture decode fixture.
Definition of Done
- CRUD works across 100 restart cycles with deterministic fixtures
- On-disk format document matches byte-level inspector output
- Corrupted pages are rejected safely (no crashes, clear diagnostics)
- Performance baseline and page fill-factor metrics are recorded
Project 2: Key-Value Store Client Library
- File:
P02-kv-client-library.md - Main Programming Language: C
- Alternative Programming Languages: Rust, Go
- Coolness Level: Level 2: Practical but Useful
- Business Potential: Level 3: Service and Support
- Difficulty: Level 2: Intermediate
- Knowledge Area: API boundary design, protocol framing
- Software or Tool: TCP client + protocol codec
- Main Book: C Interfaces and Implementations
What you will build: A C client library for a KV service with strict ownership and error contracts.
Why it teaches database internals: Client boundaries shape operability, safety, and performance as much as core engine code.
Core challenges you will face:
- Opaque handle contracts -> maps to boundary discipline
- Memory ownership clarity -> maps to correctness and leak avoidance
- Protocol framing robustness -> maps to parse/execute infrastructure
Real World Outcome
$ ./kvclient_demo
connect host=127.0.0.1 port=7390 -> OK
set key=user:7 value="Lin" -> OK latency_ms=0.42
get key=user:7 -> VALUE "Lin" ownership=caller
free value buffer -> OK
disconnect -> OK
You can hand this library to another engineer and they can use it without reading any internal source code.
The Core Question You Are Answering
“How do I design a C boundary that is easy to use correctly and hard to misuse?”
Concepts You Must Understand First
- Opaque pointer interfaces
- Why hide struct internals from consumers?
- Book Reference: C Interfaces and Implementations — interface chapters
- Ownership transfer contracts
- Who allocates and who frees each returned object?
- Book Reference: Effective C, 2nd Edition — memory management chapters
- Protocol framing
- How do you parse partial reads safely?
- Book Reference: The Linux Programming Interface — socket I/O chapters
Questions to Guide Your Design
- API ergonomics
- Can a caller discover cleanup obligations from the header alone?
- Are error states machine-readable, not string-only?
- Failure behavior
- What happens to in-flight requests on disconnect?
- Can retries accidentally duplicate writes?
Thinking Exercise
Ownership Matrix
Before coding, draw a table of every function and ownership of each pointer on success/failure.
Questions:
- Which API calls transfer ownership?
- Which paths require idempotent cleanup?
The Interview Questions They Will Ask
- “What is an opaque handle and why does it matter in C APIs?”
- “How do you avoid ambiguous ownership in returned buffers?”
- “How do you handle partial TCP reads in a framed protocol?”
- “How do you design error codes so callers can recover programmatically?”
- “What makes an API boundary stable across versions?”
Hints in Layers
Hint 1: Header-first design
Write the public .h contract before internal implementation.
Hint 2: Explicit free function Use dedicated release functions for returned heap objects.
Hint 3: Framing pseudocode
read fixed header -> parse payload length -> read exact payload bytes -> decode
Hint 4: Timeout strategy Expose configurable connect/read/write timeouts in one options struct.
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Interface boundaries | C Interfaces and Implementations | Ch. 1-3 |
| Memory ownership | Effective C, 2nd Edition | Memory management chapters |
| Socket I/O patterns | The Linux Programming Interface | Networking chapters |
Common Pitfalls and Debugging
Problem 1: “Occasional segfault after get“
- Why: Caller freed memory owned by library internals.
- Fix: Return caller-owned copy or immutable borrowed view with explicit lifetime.
- Quick test: Run ASAN + stress retrieval loop.
Problem 2: “Protocol desync after long values”
- Why: Length prefix parsed incorrectly after partial reads.
- Fix: Use exact-byte read helper with bounded retries.
- Quick test: Replay mixed-size frame corpus.
Definition of Done
- Public API has explicit ownership notes for all pointer-returning functions
- Framed protocol handles partial reads/writes and malformed frames
- Error codes are stable and machine-parseable
- No leaks across reconnect stress test
Project 3: Disk-Backed B+Tree Library
- File:
P03-disk-backed-bplustree.md - Main Programming Language: C
- Alternative Programming Languages: Rust, C++
- Coolness Level: Level 4: Hardcore Tech Flex
- Business Potential: Level 4: Open Core Infrastructure
- Difficulty: Level 3: Advanced
- Knowledge Area: Index structures, storage-engine algorithms
- Software or Tool: Page manager + tree verifier
- Main Book: Algorithms, Fourth Edition
What you will build: A persistent B+ tree supporting point lookups, inserts, deletes, and range scans.
Why it teaches database internals: This is the canonical index structure in major row-store engines.
Core challenges you will face:
- Split/merge correctness -> maps to B+ tree invariants
- Separator propagation -> maps to search correctness
- Leaf-link consistency -> maps to range scan performance
Real World Outcome
$ ./bptree_demo load --keys 1000000 --seed 7
build complete pages=8421 height=4 avg_leaf_fill=0.73
$ ./bptree_demo get 884422
FOUND key=884422 value_ptr=(page=5812,slot=9) depth=4
$ ./bptree_demo range 884400 884460
rows=61 first=884400 last=884460 leaf_hops=2
$ ./bptree_demo verify
OK invariants=all depth=4 linked_leaves=consistent
The Core Question You Are Answering
“How do databases keep indexed lookup fast as data grows by orders of magnitude?”
Concepts You Must Understand First
- Fan-out and tree height
- Why does high fan-out beat pointer-heavy binary trees on disk?
- Book Reference: Algorithms, Fourth Edition — balanced tree sections
- Node split/merge semantics
- What parent separator is promoted on split?
- Book Reference: Algorithms, Fourth Edition
- Leaf chaining for scans
- Why are sibling links crucial for range scans?
- Book Reference: storage-engine indexing chapters (external resources)
Questions to Guide Your Design
- Node layout decisions
- Fixed-width keys first or variable-width with indirection?
- How do you reserve metadata space for future format evolution?
- Correctness instrumentation
- Which verifier checks run after each structural mutation?
- How will you detect stale parent separators early?
Thinking Exercise
Split Propagation Trace
Trace by hand an insertion sequence that causes leaf split then root split.
Questions:
- Which key becomes parent separator at each step?
- How do child ranges change after propagation?
The Interview Questions They Will Ask
- “Why B+ trees over hash indexes for general-purpose queries?”
- “How does node fan-out influence tree height and I/O?”
- “What invariants must hold after split and merge?”
- “How do range scans leverage leaf sibling links?”
- “How would you detect index corruption offline?”
Hints in Layers
Hint 1: Read-only traversal first
Implement get on static tree pages before mutation logic.
Hint 2: Split in one node type first Complete leaf split path before internal-node split.
Hint 3: Mutation pseudocode
insert into leaf -> if overflow split leaf -> insert separator into parent
if parent overflow -> split parent -> continue upward
Hint 4: Verifier every step Run full invariant verifier after each randomized mutation batch.
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Balanced trees | Algorithms, Fourth Edition | Balanced tree sections |
| Cache/storage effects | Computer Systems: A Programmer’s Perspective | Ch. 6 |
| C data structure discipline | Mastering Algorithms with C | Tree/data-structure chapters |
Common Pitfalls and Debugging
Problem 1: “Random missing keys after heavy inserts”
- Why: Parent separator not updated after child split.
- Fix: Enforce parent range verification post-mutation.
- Quick test: Differential test against sorted-map oracle.
Problem 2: “Range scans skip keys”
- Why: Leaf sibling pointers broken during merge.
- Fix: Update both forward and backward links atomically in page mutation logic.
- Quick test: Full in-order scan monotonicity assertion.
Definition of Done
- Point lookup and range scan results match oracle for randomized workloads
- Tree invariants validated after 100k mixed operations
- Restart/load preserves index correctness
- Depth trend aligns with expected O(log n) behavior
Project 4: Page-Based Storage Engine with Buffer Pool
- File:
P04-page-storage-buffer-pool.md - Main Programming Language: C
- Alternative Programming Languages: Rust, Go
- Coolness Level: Level 4: Hardcore Tech Flex
- Business Potential: Level 4: Open Core Infrastructure
- Difficulty: Level 3: Advanced
- Knowledge Area: Cache management, physical storage
- Software or Tool: Buffer pool metrics + eviction simulator
- Main Book: Operating Systems: Three Easy Pieces
What you will build: A slotted-page heap file with pin/unpin-aware buffer pool and eviction policy.
Why it teaches database internals: It reveals how real systems hide disk latency while preserving correctness.
Core challenges you will face:
- Pin discipline -> maps to buffer correctness
- Dirty flush policy -> maps to durability foundations
- Eviction behavior under mixed workloads -> maps to performance engineering
Real World Outcome
$ ./storage_engine_demo workload --profile mixed --ops 500000 --seed 11
buffer_frames=1024 policy=clock
hit_rate=0.934 evictions=27122 dirty_flushes=9180
p95_read_ms=1.8 p95_write_ms=2.4
$ ./storage_engine_demo inspect-frame 87
frame=87 page=14829 pin=0 dirty=1 refbit=0 last_access=1700001234
The Core Question You Are Answering
“How do I keep page access fast under pressure without violating correctness invariants?”
Concepts You Must Understand First
- Pin/unpin semantics
- Why can pinned pages never be evicted?
- Book Reference: Operating Systems: Three Easy Pieces — memory/file cache concepts
- Clock/LRU policy tradeoffs
- How does scan pollution harm naive policies?
- Book Reference: Computer Systems: A Programmer’s Perspective — cache behavior
- Dirty page lifecycle
- When is a dirty page eligible for flush?
- Book Reference: WAL docs and recovery chapters
Questions to Guide Your Design
- Frame metadata
- Which fields are required for safe eviction?
- How will you detect pin leaks?
- I/O scheduling
- Do you flush on demand only or maintain background flush queue?
- How do you avoid bursty eviction stalls?
Thinking Exercise
Frame-State State-Machine
Draw states for free -> clean-pinned -> clean-unpinned -> dirty-pinned -> dirty-unpinned -> flushing.
Questions:
- Which transitions are legal?
- Which transitions must be atomic under lock?
The Interview Questions They Will Ask
- “Why isn’t cache hit ratio alone enough to evaluate a buffer pool?”
- “What happens if a pinned page is evicted by bug?”
- “Why can sequential scans hurt LRU?”
- “How do you choose between CLOCK and LRU?”
- “What metrics identify pin leaks?”
Hints in Layers
Hint 1: Isolate buffer tests Build deterministic frame-trace tests before integrating storage engine.
Hint 2: Add pin assertions everywhere Pin underflow/overflow checks save hours later.
Hint 3: Victim pseudocode
while true: inspect candidate frame
if pinned -> skip
if refbit=1 -> clear refbit, advance
else choose victim
Hint 4: Separate logical and physical errors Return explicit error categories: pin error, I/O error, checksum error.
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Cache policy intuition | Computer Systems: A Programmer’s Perspective | Ch. 6 |
| Systems-level caching | Operating Systems: Three Easy Pieces | file/cache chapters |
| C implementation discipline | Effective C, 2nd Edition | reliability/safety chapters |
Common Pitfalls and Debugging
Problem 1: “Throughput degrades over time”
- Why: Pin leaks prevent victim selection; clock hand loops excessively.
- Fix: Track pin lifetimes and enforce timeout diagnostics in debug mode.
- Quick test:
./storage_engine_demo detect-pin-leaks --duration 120
Problem 2: “Occasional stale reads”
- Why: Dirty flush errors ignored and frame reused.
- Fix: Block frame reuse until durable flush success is confirmed.
- Quick test: Simulate disk write failures and assert no silent reuse.
Definition of Done
- Pin/unpin invariants hold under randomized concurrent request simulation
- Eviction never selects pinned frames
- Dirty flush pipeline handles injected I/O failures safely
- Metrics dashboard reports stable behavior across workload profiles
Project 5: Write-Ahead Logging (WAL) and Crash Recovery
- File:
P05-wal-crash-recovery.md - Main Programming Language: C
- Alternative Programming Languages: Rust, Go
- Coolness Level: Level 5: Pure Magic
- Business Potential: Level 4: Open Core Infrastructure
- Difficulty: Level 4: Expert
- Knowledge Area: Durability, crash consistency
- Software or Tool: WAL parser + crash harness
- Main Book: Recovery chapters from systems/database references
What you will build: A WAL subsystem with checkpoints and deterministic crash-recovery replay.
Why it teaches database internals: Durability is the dividing line between a toy data tool and a real database engine.
Core challenges you will face:
- Log/data ordering discipline -> maps to WAL invariants
- Replay idempotence -> maps to recovery correctness
- Checkpoint policy tuning -> maps to performance/reliability tradeoffs
Real World Outcome
$ ./wal_demo run --ops 20000 --checkpoint-every 500 --crash-at 13742
SIMULATED CRASH at op=13742
$ ./wal_demo recover
analysis start_lsn=0x00019000 dirty_pages=317 inflight_txns=4
redo applied=1298 skipped=9821
undo rolled_back_txns=4
RECOVERY COMPLETE state_hash=4d7a9c...
$ ./wal_demo verify
OK committed_state_consistent=true
The Core Question You Are Answering
“What precise write-ordering rules let me recover to a valid committed state after arbitrary crashes?”
Concepts You Must Understand First
- WAL-before-data invariant
- Why must log be durable before data page flush?
- Book Reference: crash consistency chapters in OS/database texts
- Redo idempotence with page-LSN
- How does page-LSN prevent duplicate application?
- Book Reference: WAL/recovery references
- Checkpoint economics
- How does checkpoint frequency trade throughput vs restart time?
- Book Reference: recovery-system chapters
Questions to Guide Your Design
- Record schema
- Logical redo records or physical before/after images?
- How will you checksum and version log records?
- Recovery algorithm
- What metadata do you persist at checkpoint?
- How do you detect torn-tail log records safely?
Thinking Exercise
Crash Timeline Audit
Map one transaction timeline with events: BEGIN -> UPDATE -> COMMIT -> fsync log -> crash -> data flush.
Questions:
- Which states are legal after restart?
- What changes if crash happens before commit flush?
The Interview Questions They Will Ask
- “What is the WAL-before-data rule and why is it necessary?”
- “How do you make redo replay idempotent?”
- “What does checkpointing optimize, and what does it not solve?”
- “How do you test crash recovery deterministically?”
- “What is group commit and when is it useful?”
Hints in Layers
Hint 1: Start with append-only log and parser Recovery work is easier after log observability exists.
Hint 2: Persist durable LSN watermark Buffer flush policy can consult one source of truth.
Hint 3: Replay pseudocode
scan checkpoint -> analysis tables -> redo by LSN with page_lsn guard
then undo loser txns if undo logs exist
Hint 4: Re-run recovery twice Idempotence test: second recovery should make zero logical changes.
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Crash consistency | Operating Systems: Three Easy Pieces | crash-consistency chapter |
| Transaction durability | Operating System Concepts | recovery-focused sections |
| C robustness patterns | Effective C, 2nd Edition | error handling and reliability chapters |
Common Pitfalls and Debugging
Problem 1: “Committed rows disappear after crash”
- Why: Commit acknowledged before durable log flush.
- Fix: Gate commit ACK on
fsyncsuccess up to commit LSN. - Quick test: Kill process immediately after commit return in loop.
Problem 2: “Recovery duplicates updates”
- Why: Redo ignores page-LSN guard.
- Fix: Apply redo only when page-LSN < record LSN.
- Quick test: Run recovery twice and compare state hash.
Definition of Done
- Crash harness passes randomized kill-point tests
- Recovery is idempotent under repeated restarts
- Committed transactions persist; uncommitted effects do not
- Checkpoint interval tradeoff is benchmarked and documented
Project 6: SQL Query Engine (Parse -> Plan -> Execute)
- File:
P06-sql-query-engine.md - Main Programming Language: C
- Alternative Programming Languages: Rust, C++
- Coolness Level: Level 5: Pure Magic
- Business Potential: Level 4: Open Core Infrastructure
- Difficulty: Level 4: Expert
- Knowledge Area: Parsing, query planning, execution
- Software or Tool: SQL REPL + explain plan
- Main Book: Compilers: Principles and Practice
What you will build: A minimal SQL engine subset (SELECT, INSERT, UPDATE, DELETE, simple joins, aggregates) over previous storage/index layers.
Why it teaches database internals: It connects user-level SQL to physical storage and operator behavior.
Core challenges you will face:
- Grammar/AST correctness -> maps to compile pipeline
- Plan choice quality -> maps to index and cost reasoning
- Operator contract stability -> maps to execution correctness
Real World Outcome
$ ./tiny_sql
tinydb> CREATE TABLE users(id INT, name TEXT, age INT);
OK
tinydb> INSERT INTO users VALUES (1,'Ada',37),(2,'Lin',28);
OK rows=2
tinydb> EXPLAIN SELECT name FROM users WHERE age > 30;
Project(name)
Filter(age > 30)
SeqScan(users)
tinydb> SELECT name FROM users WHERE age > 30;
+------+
| name |
+------+
| Ada |
+------+
1 row
The Core Question You Are Answering
“How does declarative SQL become a deterministic operator pipeline over pages and indexes?”
Concepts You Must Understand First
- Tokenization and parsing
- How do tokens become AST nodes with validated syntax?
- Book Reference: Compilers: Principles and Practice — lexical/syntax chapters
- Logical vs physical plans
- Why separate semantic intent from execution algorithms?
- Book Reference: compiler architecture chapters
- Iterator execution
- What guarantees should
next()provide? - Book Reference: query execution model references
- What guarantees should
Questions to Guide Your Design
- Planner scope
- Which query patterns get index scans?
- How do you represent unsupported constructs clearly?
- Runtime contracts
- Who owns tuple memory across operators?
- How does executor report recoverable vs fatal errors?
Thinking Exercise
Plan Two Ways
For SELECT * FROM orders WHERE customer_id=7, draw both seq-scan and index-scan plans.
Questions:
- Under what selectivity assumptions does each win?
- How would bad cardinality estimates hurt performance?
The Interview Questions They Will Ask
- “What is the difference between logical and physical planning?”
- “How does a Volcano iterator model work?”
- “What should
EXPLAINguarantee in a query engine?” - “Why can wrong cardinality estimates cause bad plans?”
- “How do indexes influence physical operator choice?”
Hints in Layers
Hint 1: Lock SQL subset Freeze scope early to avoid grammar sprawl.
Hint 2: Build AST printer Debugging parser becomes easy with deterministic AST output.
Hint 3: Operator pseudocode
Project.next() -> pull tuple from child -> emit selected columns -> repeat
Hint 4: Plan/execution audit
Log operator open/next/close events and compare to EXPLAIN tree.
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Lexing/parsing | Compilers: Principles and Practice | Ch. 3-4 |
| C runtime safety | Effective C, 2nd Edition | runtime safety chapters |
| Data-structure choices | Algorithms, Fourth Edition | relevant search/join structures |
Common Pitfalls and Debugging
Problem 1: “Query parses but returns wrong columns”
- Why: AST projection mapping uses wrong source index offsets.
- Fix: Add schema-bound column resolution stage before planning.
- Quick test: Golden query corpus with known output order.
Problem 2: “EXPLAIN says index scan but runtime does seq scan”
- Why: Plan and executor operator enums diverged.
- Fix: Use a single shared plan representation for both explain and execute paths.
- Quick test: Snapshot assert on plan + runtime event sequence.
Definition of Done
- Supported SQL subset parses and executes deterministically
EXPLAINreflects actual runtime operator tree- Index scans are chosen for defined predicate patterns
- Query correctness validated against fixture datasets
Project 7: Transaction Isolation and End-to-End TinyDB Integration
- File:
P07-transaction-isolation-tinydb.md - Main Programming Language: C
- Alternative Programming Languages: Rust, Go
- Coolness Level: Level 5: Pure Magic
- Business Potential: Level 5: Venture-Backed Product Candidate
- Difficulty: Level 4: Expert
- Knowledge Area: Concurrency, end-to-end systems integration
- Software or Tool: Multi-session harness + anomaly tests
- Main Book: Systems/OS concurrency references + DB docs
What you will build: A full TinyDB integration layer with transaction boundaries, isolation options, and retry-safe client semantics.
Why it teaches database internals: It forces all subsystems to cooperate under real contention and failure scenarios.
Core challenges you will face:
- Isolation semantics implementation -> maps to concurrency theory
- WAL + lock/version integration -> maps to durability/correctness coupling
- Anomaly test harness -> maps to practical reliability proof
Real World Outcome
$ ./tinydb_multi_session --scenario lost-update
scenario=lost-update isolation=read-committed
result=ANOMALY_OBSERVED
$ ./tinydb_multi_session --scenario lost-update --isolation serializable
result=RETRY_TRIGGERED txn=2 retries=1 final_state=CORRECT
$ ./tinydb_multi_session --scenario write-skew --isolation serializable
result=SERIALIZATION_FAILURE_HANDLED final_state=CORRECT
The Core Question You Are Answering
“Can this engine preserve business invariants under concurrent load and crashes, not just pass single-session tests?”
Concepts You Must Understand First
- Isolation anomalies
- Which anomalies are acceptable at each isolation level?
- Book Reference: OS concurrency chapters + official DB docs
- Conflict handling
- Block, abort, or retry?
- Book Reference: transaction and locking concepts
- End-to-end consistency checks
- How do WAL, buffer, and visibility rules compose?
- Book Reference: recovery and isolation references
Questions to Guide Your Design
- Concurrency model
- Lock-based, MVCC-like, or hybrid for your scope?
- How do you represent transaction snapshots?
- Client contract
- Which errors are retriable?
- How does API signal serialization failures clearly?
Thinking Exercise
Two-Session Invariant Simulation
Model two transactions that each read two rows and write one row.
Questions:
- Under which isolation setting can write skew appear?
- What conflict detection prevents it?
The Interview Questions They Will Ask
- “How do you demonstrate a lost update anomaly in a controlled test?”
- “What changes when moving from read committed to serializable?”
- “Why are transaction retries part of application-level correctness?”
- “How do WAL and isolation mechanisms interact?”
- “Which metrics indicate lock contention problems?”
Hints in Layers
Hint 1: Deterministic scheduler Use explicit barriers to force interleavings.
Hint 2: Separate conflict taxonomy Track read-write, write-write, and predicate conflicts independently.
Hint 3: Retry pseudocode
begin txn -> execute -> if serialization_failure then rollback + retry whole txn
Hint 4: Integration checksum After each scenario, compute canonical state hash for pass/fail.
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Concurrency fundamentals | Operating System Concepts | concurrency chapters |
| Systems race reasoning | Operating Systems: Three Easy Pieces | concurrency chapters |
| C interface reliability | Effective C, 2nd Edition | safety/reliability chapters |
Common Pitfalls and Debugging
Problem 1: “Serializable mode still shows write skew”
- Why: Predicate conflicts not tracked, only row-level conflicts.
- Fix: Add range/predicate conflict metadata or stronger locking strategy.
- Quick test: Deterministic write-skew scenario replay.
Problem 2: “Throughput collapses with many readers”
- Why: Overly coarse locks serialize non-conflicting reads.
- Fix: Refine granularity or adopt snapshot reads for read paths.
- Quick test: Reader-heavy benchmark across lock strategies.
Definition of Done
- Deterministic anomaly scenarios produce expected outcomes per isolation level
- Retriable serialization failures are surfaced and recoverable by client
- End-to-end crash + recovery + concurrency tests pass
- Performance report includes contention and retry metrics
Project Comparison Table
| Project | Difficulty | Time | Depth of Understanding | Fun Factor |
|---|---|---|---|---|
| 1. Persistent Key-Value Store | Level 1 | Weekend | High physical storage intuition | ★★★★☆ |
| 2. KV Client Library | Level 2 | 4-6 days | High boundary/API discipline | ★★★☆☆ |
| 3. Disk-Backed B+Tree | Level 3 | 1-2 weeks | Very high indexing intuition | ★★★★★ |
| 4. Storage + Buffer Pool | Level 3 | 1-2 weeks | Very high runtime performance intuition | ★★★★☆ |
| 5. WAL + Recovery | Level 4 | 1-2 weeks | Very high reliability intuition | ★★★★★ |
| 6. SQL Query Engine | Level 4 | 2-3 weeks | Very high compiler/runtime intuition | ★★★★★ |
| 7. Isolation + Integration | Level 4 | 3-4 weeks | End-to-end systems mastery | ★★★★★ |
Recommendation
If you are new to database internals: Start with Project 1, then Project 3. This builds the storage/index core before higher complexity.
If you are a backend engineer who already uses SQL daily: Start with Project 3 and Project 6. You will connect query behavior directly to access-path mechanics.
If you want reliability engineering depth: Prioritize Project 5 and Project 7 after completing Project 1 and Project 4.
Final Overall Project
Final Overall Project: TinyDB — Crash-Safe Embedded SQL Engine
The Goal: Combine Projects 1 through 7 into a single binary that supports transactional SQL over a durable page file with indexes, WAL, and isolation controls.
- Integrate page storage, B+ tree, and buffer pool into one storage subsystem.
- Integrate WAL and restart recovery across all write paths.
- Integrate SQL compile/execute and transaction manager with retriable serializable semantics.
Success Criteria: Runs deterministic workload corpus, survives crash-injection tests, and passes concurrency anomaly matrix with documented isolation behavior.
From Learning to Production
| Your Project | Production Equivalent | Gap to Fill |
|---|---|---|
| Project 1 | SQLite pager/file format discipline | Online schema migration and compatibility tooling |
| Project 3 | InnoDB/PostgreSQL B-tree internals | Concurrency latching and online maintenance |
| Project 4 | Shared buffer managers in major DBMS | Advanced eviction heuristics and NUMA awareness |
| Project 5 | PostgreSQL WAL + PITR workflows | Replication, backup orchestration, operational tooling |
| Project 6 | PostgreSQL planner/executor, SQLite VDBE | Cost model sophistication and optimizer breadth |
| Project 7 | Serializable transaction implementations | Full predicate locking/SSI and distributed correctness |
Summary
This learning path covers Database Internals in C through 7 hands-on projects.
| # | Project Name | Main Language | Difficulty | Time Estimate |
|---|---|---|---|---|
| 1 | Persistent Key-Value Store | C | Beginner | 1 weekend |
| 2 | KV Client Library | C | Intermediate | 4-6 days |
| 3 | Disk-Backed B+Tree | C | Advanced | 1-2 weeks |
| 4 | Storage + Buffer Pool | C | Advanced | 1-2 weeks |
| 5 | WAL + Crash Recovery | C | Expert | 1-2 weeks |
| 6 | SQL Query Engine | C | Expert | 2-3 weeks |
| 7 | Isolation + TinyDB Integration | C | Expert | 3-4 weeks |
Expected Outcomes
- You can reason from SQL behavior to page-level I/O and back.
- You can build and validate crash-safe write paths.
- You can explain isolation semantics with deterministic anomaly evidence.
Additional Resources and References
Standards and Specifications
- SQLite database file format spec: https://www.sqlite.org/fileformat.html
- PostgreSQL WAL documentation: https://www.postgresql.org/docs/current/wal-intro.html
- PostgreSQL transaction isolation documentation: https://www.postgresql.org/docs/current/transaction-iso.html
- PostgreSQL page layout documentation: https://www.postgresql.org/docs/current/storage-page-layout.html
- MySQL InnoDB index/page structure documentation: https://dev.mysql.com/doc/refman/8.4/en/innodb-physical-structure.html
Industry Analysis
- Stack Overflow Developer Survey 2025 (technology adoption): https://survey.stackoverflow.co/2025/technology
- SQLite deployment notes (most deployed SQL engine context): https://www.sqlite.org/mostdeployed.html
Papers and Foundational References
- ARIES publication page (IBM Research): https://research.ibm.com/publications/aries-a-transaction-recovery-method-supporting-fine-granularity-locking-and-partial-rollbacks-using-write-ahead-logging
Books
- The C Programming Language by Kernighan & Ritchie — core C I/O and binary handling foundations.
- Computer Systems: A Programmer’s Perspective by Bryant & O’Hallaron — memory hierarchy and systems-level data movement intuition.
- Algorithms, Fourth Edition by Sedgewick & Wayne — balanced tree and search-structure fundamentals.
- Operating Systems: Three Easy Pieces by Arpaci-Dusseau & Arpaci-Dusseau — crash consistency and concurrency mindset.
- Compilers: Principles and Practice by Dave & Dave — parser/planner mental models useful for SQL engines.