Project 5: B-Tree File Index (Mini Filesystem Index)

Build a disk-based B-tree index that supports prefix queries and range scans.

Quick Reference

Attribute Value
Difficulty Advanced
Time Estimate 25-40 hours
Main Programming Language C (Alternatives: Rust)
Alternative Programming Languages Rust
Coolness Level Level 4
Business Potential Open-core infrastructure
Prerequisites File IO, basic trees, serialization
Key Topics B-tree layout, pages, splits, range scans, crash safety

1. Learning Objectives

By completing this project, you will:

  1. Design a page-aligned B-tree node layout for disk.
  2. Implement insertion with node splits and tree height growth.
  3. Support prefix and range scans using linked leaves.
  4. Implement page IO with a simple buffer cache.
  5. Add crash-safe update ordering (write-ahead or copy-on-write).

2. All Theory Needed (Per-Concept Breakdown)

B-Tree Node Layout and Fanout

Fundamentals

A B-tree is a balanced tree optimized for disk access. Each node stores many keys and child pointers, so the tree height is small even for huge datasets. The number of keys per node is called the fanout, and it is determined by the page size. If a page is 4096 bytes, you can fit many keys and child pointers in that page. The goal is to maximize fanout so that each lookup requires as few page reads as possible. B-trees are used in databases and filesystems because they minimize disk IO while maintaining sorted order for range queries.

Deep Dive into the concept

A B-tree node is essentially a fixed-size page. It contains a header (node type, key count, pointers) and an array of keys and child pointers. The layout must be compact and predictable because you will read and write it directly to disk. The size of each key determines the fanout: if keys are fixed-length, you can store more per page; if variable-length, you need offsets or a separate key area. For this project, use fixed-length keys (e.g., file names limited to 64 bytes) so the layout is simple and deterministic. This lets you compute how many keys fit in a page: max_keys = (page_size - header_size) / (key_size + child_ptr_size).

Fanout controls tree height. With a fanout of 100, a tree can index 100^3 = 1,000,000 keys in three levels. That means a lookup requires only three page reads. This is why B-trees are powerful: they trade larger nodes for fewer levels. The invariant is that each internal node with k keys has k+1 child pointers, and keys are sorted. Leaf nodes store key-value pairs and are linked for range scans. Insertion must maintain the B-tree properties: nodes are kept between half-full and full (except the root), ensuring the tree remains balanced.

Designing the node layout involves careful alignment and serialization. You must choose an on-disk format and a corresponding in-memory struct. The on-disk format should avoid pointers; instead, store child page numbers or offsets. This ensures that the file can be reopened later. Each node should include a page id, a flag indicating leaf vs internal, and the number of keys. Keys and child pointers follow. For leaves, you store key and value (e.g., filename to inode id). For internal nodes, you store keys and child page ids.

The layout must also include space for a “next leaf” pointer for range scans. This is a page id pointing to the next leaf. This turns the leaf level into a linked list, enabling efficient range queries. The next-leaf pointer reduces the need to traverse back up the tree during a scan, which saves IO. This is a key advantage of B+ trees and is commonly used in storage engines.

Finally, you must consider how the node layout affects insert and split operations. When a node is full, you split it into two nodes and promote the middle key to the parent. The split logic depends on the array layout. If you have fixed-size keys and pointers, splitting is straightforward: copy half the keys to a new node. This is easier to implement and reason about than variable-length keys. The key insight is that the node layout is both a storage format and an algorithmic structure. It determines fanout, split behavior, and disk IO cost, so it must be designed deliberately.

Another practical detail is endianness and portability. If you write pages on one machine and read them on another, you need a stable byte order. For this project, define all integer fields as little-endian or network byte order and convert on read/write. This makes your on-disk format explicit and future-proof. Even if you never move the files, writing them with a defined endianness forces you to think about serialization as a first-class design concern.

How this fit on projects

Node layout is required in §3.2 and §3.5, implemented in §5.10 Phase 1, and tested in §6.2. It is contrasted with P02-log-structured-kv-store.md as an alternative storage structure.

Definitions & key terms

  • fanout -> number of children per node
  • page -> fixed-size block of storage (e.g., 4096 bytes)
  • B-tree -> balanced tree optimized for block devices
  • B+ tree -> variant where all values are in leaves and leaves are linked

Mental model diagram

[internal node]
 keys: k1 k2 k3
 children: c0 c1 c2 c3

[leaf node] -> [leaf node] -> [leaf node]

How it works (step-by-step, with invariants and failure modes)

  1. Choose fixed page size and key size.
  2. Define header with type, key count, and next leaf pointer.
  3. Store keys and child page ids in arrays.
  4. Ensure nodes are at least half full (except root).
  5. On split, distribute keys evenly and update parent.

Invariants: keys sorted; internal nodes have k+1 children; leaves linked. Failure modes: incorrect split leads to lost keys or broken ordering.

Minimal concrete example

Page size 4096, key size 32, ptr size 8 -> max keys ~ (4096-64)/40 = 100

Common misconceptions

  • “B-trees are binary trees” -> they have high fanout, not just two children.
  • “Key size doesn’t matter” -> it directly affects fanout and IO.
  • “Leaf linking is optional” -> it is critical for range scans.

Check-your-understanding questions

  1. Why does higher fanout reduce tree height?
  2. Why use fixed-length keys in this project?
  3. What is the benefit of linking leaf nodes?

Check-your-understanding answers

  1. Each node covers more keys, so fewer levels are needed.
  2. It simplifies serialization and split logic.
  3. It enables efficient range scans without traversing internal nodes.

Real-world applications

  • Filesystem indexes (NTFS, ext4).
  • Database indexes (MySQL InnoDB).

Where you’ll apply it

References

  • Algorithms, Fourth Edition, Ch. 6.1
  • Database Internals, Ch. 2-3

Key insights

B-tree performance is dominated by page layout and fanout, not by pointer operations.

Summary

Designing node layout is the first and most important step in building a disk-based B-tree because it determines IO cost and split behavior.

Homework/Exercises to practice the concept

  1. Compute fanout for different key sizes and page sizes.
  2. Design a page header and estimate how many keys fit.

Solutions to the homework/exercises

  1. Use max_keys = (page_size - header_size) / (key_size + ptr_size).
  2. Create a header of fixed fields and compute remaining bytes.

Page-Oriented IO and Buffer Cache

Fundamentals

Disk IO happens in pages (blocks), not individual bytes. A B-tree node is stored in a page, so every read or write loads or writes a full page. A buffer cache keeps recently used pages in memory to avoid repeated disk reads. Without a cache, each tree lookup would require multiple disk reads, which is slow. A simple buffer cache maps page ids to in-memory buffers and uses LRU to evict when full.

From a systems perspective, you should be able to explain the invariants this concept maintains, the resources it consumes (CPU, memory, IO), and the typical failure modes when it is misused. This basic framing will help you reason about correctness before you optimize or extend the design.

Deep Dive into the concept

Page-oriented IO is the core performance constraint for disk-based data structures. When you call pread or pwrite, the OS loads data in page-sized chunks (often 4KB). The B-tree is designed to align nodes with pages to minimize wasted IO. However, even with a well-designed tree, repeated accesses to the same pages can be expensive if each access requires a disk read. This is why buffer caches are essential. A buffer cache stores recently accessed pages in memory. If a page is requested and already in cache, you get a cache hit and avoid disk IO. If not, you read the page from disk and store it in the cache, possibly evicting another page.

A minimal cache design uses a hash table mapping page id to cache entries and an LRU list to manage eviction. This is similar to the cache you built in Project 1. The cache entry stores the page data, a dirty flag, and a reference count. When you modify a page, you mark it dirty. On eviction, if the page is dirty, you write it back to disk. This is called write-back caching. Alternatively, you can use write-through: write to disk immediately on modification. Write-back is faster but needs careful crash safety; write-through is simpler but slower. For this project, you can use write-through or write-back with explicit flush commands.

The buffer cache also needs pinning. When you read a page for modification, you must prevent it from being evicted while you are using it. A reference count or pin flag can handle this. After you finish using the page, you unpin it. If you forget to unpin, you can leak cache entries and never evict, which is a bug. For simplicity, you can use a single-threaded design and avoid complex pinning, but you should still reason about the lifecycle of pages.

Another key concept is page numbering. Pages are addressed by page id, not file offsets. The page id is multiplied by page size to compute the offset. When you allocate a new page (e.g., during a split), you append to the file and assign a new page id. This is a simple form of page allocation. A more advanced system would maintain a free page list, but for this project, append-only allocation is fine.

The buffer cache directly impacts performance metrics. You should instrument cache hit rate, miss rate, and writeback count. These metrics allow you to reason about the working set size and verify that your cache is effective. For deterministic tests, you can use a fixed sequence of lookups and report exact hit/miss counts.

If you want to go one step further, add simple prefetching: when you read a leaf page for a range scan, also read the next leaf page into cache. This exploits sequential access and reduces stalls during scans. Even a naive prefetch can dramatically improve scan throughput. It also reinforces the idea that page-oriented IO is about anticipating access patterns, not just reacting to misses.

How this fit on projects

The buffer cache is required in §3.2 and §3.7. It is implemented in §5.10 Phase 2 and tested in §6.2. The LRU technique is shared with P01-in-memory-cache-lru.md.

Definitions & key terms

  • page -> fixed-size block of storage
  • buffer cache -> in-memory cache of pages
  • dirty page -> page modified in cache but not yet on disk
  • write-back -> write dirty pages on eviction

Mental model diagram

lookup -> cache hit? yes -> use page
                no -> read page -> cache -> use

How it works (step-by-step, with invariants and failure modes)

  1. Request page id from cache.
  2. If in cache, return pointer.
  3. If not, read page from disk and insert into cache.
  4. If cache full, evict LRU page (write back if dirty).
  5. Mark page dirty on modification.

Invariants: cache contains at most N pages; dirty pages are written before eviction. Failure modes: lost writes if dirty pages not flushed.

Minimal concrete example

page_t *p = cache_get(page_id);
if (!p) { p = cache_load(page_id); }

Common misconceptions

  • “OS cache is enough” -> you still need your own for correctness and instrumentation.
  • “Write-back is always safe” -> it requires crash-safe ordering.
  • “No cache needed” -> performance will be dominated by disk IO.

Check-your-understanding questions

  1. Why does a buffer cache improve B-tree performance?
  2. What is a dirty page and why does it matter?
  3. How do you compute file offset from page id?

Check-your-understanding answers

  1. It avoids repeated disk reads for hot pages.
  2. Dirty pages contain updates that must be written back to disk.
  3. offset = page_id * page_size.

Real-world applications

  • Database buffer pools.
  • Filesystem page caches.

Where you’ll apply it

References

  • Database Internals, buffer pool chapter
  • Operating Systems: Three Easy Pieces, caching chapters

Key insights

A B-tree is only as fast as its page cache; without it, disk IO dominates.

Summary

Buffer caching reduces IO and enables efficient B-tree operations. It requires careful dirty tracking and eviction policies.

Homework/Exercises to practice the concept

  1. Implement a simple page cache with LRU eviction.
  2. Measure hit rate on a repeated lookup workload.

Solutions to the homework/exercises

  1. Use a hash map for lookup and a doubly linked list for LRU.
  2. Run the same queries twice and compare hits vs misses.

Split, Merge, and Tree Balancing

Fundamentals

B-trees remain balanced by splitting full nodes during insert and merging or rebalancing nodes during delete. Insertion descends to a leaf, inserts a key, and if the node exceeds capacity, splits it into two nodes and promotes the middle key to the parent. This may propagate upward, possibly creating a new root. This process keeps the tree height small and balanced. Understanding split logic is critical to a correct B-tree.

From a systems perspective, you should be able to explain the invariants this concept maintains, the resources it consumes (CPU, memory, IO), and the typical failure modes when it is misused. This basic framing will help you reason about correctness before you optimize or extend the design.

Deep Dive into the concept

Insertion in a B-tree is a carefully structured process. You start at the root and traverse down to the appropriate leaf by comparing keys and following child pointers. Once at the leaf, you insert the new key in sorted order. If the leaf is still within capacity, you are done. If it overflows (key count exceeds max), you split. Splitting a node involves creating a new node, moving half the keys to it, and promoting the middle key to the parent. For leaves in a B+ tree, you keep the promoted key in the left leaf and copy it to the parent (or promote the smallest key of the right node). The exact rule must be consistent across your implementation.

Split propagation is the tricky part. If the parent has room, you insert the promoted key and a pointer to the new node. If the parent is also full, you must split it as well. This can propagate to the root, potentially increasing the tree height by one. This is how B-trees grow while maintaining balance. The invariants are: all leaves are at the same depth, and each internal node (except root) is at least half full. These invariants guarantee that tree height is O(log_f n), where f is fanout.

Deletion is more complex and can be skipped for this project. If you want to implement deletes, you must handle underflow by borrowing from siblings or merging nodes. This is a good extension but not required. Instead, focus on insert and search correctness.

Splits must be crash safe. If you write a new node to disk and update the parent, you must ensure that the new node is durable before the parent points to it. Otherwise, a crash could leave a parent pointing to garbage. A simple approach is to write the new node first, fsync, then update the parent. This is write-ahead ordering and is critical for correctness. If you also update the root, the order matters even more. This ties into the crash safety concept below.

Testing split logic requires deterministic data. A good test inserts keys in sorted order, which is the worst-case for many tree structures. B-trees should handle it gracefully. You should verify that tree height grows predictably and that all keys remain in order. You can add a tree validation function that checks node sizes and key ordering across the tree.

A small but important design choice is whether to split on the way down or on the way up. Some B-tree algorithms pre-split full nodes while descending, so that when you reach the leaf, you are guaranteed room. This can simplify insertion because you never have to propagate splits upward. The alternative is to split on the way up after inserting, which is easier to implement in a simple recursive function but requires handling cascading splits. Either approach is valid; choose one and document it. The key is consistency so that your validation logic can check invariants deterministically.

How this fit on projects

Split logic is required in §3.2 and implemented in §5.10 Phase 2. It is tested in §6.2. It is also a prerequisite for P06-capstone-simple-database-engine.md.

Definitions & key terms

  • split -> divide a full node into two nodes
  • promote -> move a key up to the parent
  • fanout -> maximum children per node
  • balance -> all leaves at same depth

Mental model diagram

Before split: [k1 k2 k3 k4]
After split:  [k1 k2] [k3 k4] with k3 promoted

How it works (step-by-step, with invariants and failure modes)

  1. Insert key into leaf in sorted order.
  2. If node overflows, split into left and right.
  3. Promote middle key to parent.
  4. If parent overflows, repeat.
  5. If root splits, create new root.

Invariants: nodes (except root) are at least half full; all leaves at same depth. Failure modes: lost keys or broken parent-child pointers.

Minimal concrete example

max_keys=4, insert 5th key -> split 2/2, promote middle

Common misconceptions

  • “Splitting only affects leaves” -> it can propagate up to the root.
  • “B-trees are always full” -> nodes are only guaranteed half full.
  • “Order doesn’t matter” -> key order is critical for search correctness.

Check-your-understanding questions

  1. Why does splitting keep the tree balanced?
  2. What happens if you update parent before writing new node?
  3. Why might sorted inserts be a good test?

Check-your-understanding answers

  1. It maintains node size constraints and equal leaf depth.
  2. A crash could leave a parent pointing to an uninitialized page.
  3. It forces frequent splits and exposes split bugs.

Real-world applications

  • Database index insert logic.
  • Filesystem metadata trees.

Where you’ll apply it

References

  • Algorithms, Fourth Edition, Ch. 6.1
  • Database Internals, Ch. 2-3

Key insights

Splits are the mechanism that preserves B-tree balance and low height.

Summary

Insertion and splitting keep the B-tree balanced. Correct split ordering and crash-safe writes are critical.

Homework/Exercises to practice the concept

  1. Implement split logic for a small in-memory B-tree and validate ordering.
  2. Insert keys in sorted order and verify tree height.

Solutions to the homework/exercises

  1. Split at median and promote the middle key to parent.
  2. Count levels after inserts and compare to expected height.

Range Scans and Linked Leaves

Fundamentals

Range scans are queries that return all keys between two bounds or with a common prefix. B-trees support range scans efficiently by linking leaf nodes in order. Once you locate the starting leaf, you follow the linked list of leaves and collect keys until the range ends. This avoids re-traversing the tree for each key and keeps IO sequential.

From a systems perspective, you should be able to explain the invariants this concept maintains, the resources it consumes (CPU, memory, IO), and the typical failure modes when it is misused. This basic framing will help you reason about correctness before you optimize or extend the design.

Deep Dive into the concept

Range scans are a primary reason B-trees are used in storage engines. The leaf nodes store keys in sorted order. By linking leaves with a next-leaf pointer, you can scan sequentially across leaves. This is essentially a linked list at the leaf level, which is called a B+ tree (all values in leaves). For this project, you should implement leaf links even if you keep values in leaves only. The scan algorithm is: perform a lookup for the start key, find the leaf that would contain it, then iterate forward across keys in that leaf and subsequent leaves until the end key or prefix boundary is reached.

Prefix queries are a special case of range scans. For example, a prefix “log_” corresponds to the range of keys starting with that prefix. You can compute the end key by incrementing the prefix lexicographically or by scanning until a key no longer matches the prefix. The latter is simpler and works for fixed-length keys. The key is that range scans are sequential and can take advantage of prefetching and sequential IO. This is why linked leaves are critical: without them, you would need to traverse the tree for each next key, which would require repeated internal node reads.

Linked leaves also simplify ordered iteration. Many database engines provide cursors that iterate through keys. This is implemented by storing a pointer to the current leaf and index, then moving forward in the leaf or to the next leaf. This is efficient and avoids re-searching. You can model a simple cursor in this project to support range scans and prefix queries.

A pitfall is maintaining leaf links during splits. When a leaf splits, you must update the next-leaf pointer of the left leaf to point to the new right leaf, and the new right leaf should point to the old next leaf. This must be crash-safe: you should write the new right leaf and update pointers in a safe order. If you do it wrong, range scans can skip or duplicate keys. For this project, you can keep it simple by ensuring that splits update links in memory and then write pages in order (new leaf, old leaf, parent).

Range scans are also a good place to think about output pagination. In real systems, you rarely return all results at once; you return a page of results and a cursor. You can model this by adding a --limit option to your query that stops after N keys and prints a "next cursor" (leaf id and index). This is not required, but it shows you understand how range scans are exposed in APIs. It also makes your range scan implementation exercise the leaf-link logic across multiple calls, which is a good stress test for correctness.

You can also optimize prefix scans by computing an upper bound. For ASCII keys, the upper bound can be constructed by incrementing the last character of the prefix and treating that as the end key. This lets you stop as soon as you exceed the bound rather than checking each key for prefix match. Even if you do not implement this optimization, describing it shows you understand how range scans can be tightened for performance.

How this fit on projects

Range scans are required in §3.2 and demonstrated in §3.7. They are implemented in §5.10 Phase 3 and tested in §6.2. The idea also appears in P02-log-structured-kv-store.md as an alternative range query strategy.

Definitions & key terms

  • range scan -> retrieve keys within a range
  • prefix query -> retrieve keys with a common prefix
  • linked leaves -> leaf nodes connected by next pointers
  • cursor -> stateful iterator over keys

Mental model diagram

[leaf A] -> [leaf B] -> [leaf C]
   | keys in order |

How it works (step-by-step, with invariants and failure modes)

  1. Search tree for start key and locate leaf.
  2. Scan keys in leaf from start position.
  3. Follow next-leaf pointer to next leaf.
  4. Stop when key exceeds end or prefix no longer matches.

Invariants: leaves are linked in key order; keys sorted within each leaf. Failure modes: broken links or incorrect split update.

Minimal concrete example

prefix "log_" -> scan until key no longer starts with "log_"

Common misconceptions

  • “Range scans require repeated lookups” -> linked leaves avoid that.
  • “Prefix scans are special” -> they are just range scans.
  • “Leaf links are optional” -> they are essential for performance.

Check-your-understanding questions

  1. Why are linked leaves critical for range scans?
  2. How do you find the start position for a prefix scan?
  3. What must be updated when a leaf splits?

Check-your-understanding answers

  1. They allow sequential traversal without re-searching the tree.
  2. Search for the prefix as if it were a key and start from that position.
  3. The left leaf’s next pointer and the new leaf’s next pointer.

Real-world applications

  • Database range queries and ORDER BY.
  • Filesystem directory scans.

Where you’ll apply it

References

  • Database Internals, Ch. 2-3
  • Algorithms, Fourth Edition, B-tree sections

Key insights

Linked leaves turn a B-tree into an efficient ordered iterator with minimal IO.

Summary

Range scans rely on linked leaves to avoid repeated tree traversals. Maintain links carefully during splits for correctness.

Homework/Exercises to practice the concept

  1. Implement a cursor that scans keys across linked leaves.
  2. Add a prefix scan and verify output ordering.

Solutions to the homework/exercises

  1. Store (leaf_id, index) and move to next leaf when index reaches end.
  2. Scan until key no longer starts with prefix.

Crash-Safe Page Updates

Fundamentals

A B-tree update often touches multiple pages: a leaf, its parent, and possibly new nodes. If a crash occurs during an update, the tree can become inconsistent. Crash safety requires a strict update order or a write-ahead log. A simple approach is copy-on-write: write new pages first, then atomically update the root pointer. Another approach is to use a WAL to log changes before applying them. For this project, implement a minimal write ordering policy: write new pages, fsync, then update parents.

From a systems perspective, you should be able to explain the invariants this concept maintains, the resources it consumes (CPU, memory, IO), and the typical failure modes when it is misused. This basic framing will help you reason about correctness before you optimize or extend the design.

Deep Dive into the concept

Disk-based data structures must survive crashes. When you split a node, you create a new page and update the parent. If you update the parent before the new page is safely written, a crash can leave a parent pointing to an invalid page. The simplest fix is to ensure write order: write the new child page to disk and fsync it, then update the parent and fsync again. This guarantees that any pointer stored in the parent refers to a valid page. The cost is extra fsync calls, but the correctness is worth it for this project.

Another approach is copy-on-write (CoW). Instead of modifying pages in place, you write new versions and update a root pointer when all new pages are durable. The root pointer is stored in a small “superblock” page at the beginning of the file. Updating the superblock is done atomically by writing a new version and using a checksum or sequence number. This is more complex but offers strong crash safety without WAL. For this project, you can use simple write ordering, but you should document the limitation: partial updates can still occur if a crash happens after some pages are written but before all are updated. The ordering rules reduce this risk but do not eliminate it fully without WAL.

If you choose to implement a WAL, you must log logical changes (insert key) or physical changes (page diff). Logging physical changes is simpler because you can replay them directly. Each log record contains the page id, offset, and new data. On crash, you replay the log to apply updates. You must write the log before modifying pages (write-ahead). This is more advanced but aligns with the LSM project. For this project, you can keep a minimal WAL for page updates during splits if you want to push further.

Crash safety also involves metadata such as the root page id, page count, and free page list. These should be stored in a superblock and updated atomically. A simple approach is to store two copies of the superblock with a checksum and version number, and always write the newer version. On startup, choose the valid superblock with the highest version. This is a common technique in filesystems and databases.

For deterministic testing, you can simulate crashes by stopping the process at specific points and verifying the index with a checker. For example, insert a key, crash after writing the child but before updating the parent, then restart and verify that the tree is still valid. This is a powerful way to see how write ordering affects correctness.

You should also consider checksums on pages. A simple CRC in the page header lets you detect torn or corrupted pages on startup. If a checksum fails, you can either treat the page as missing and rebuild the index or report a corruption error. Even if you do not implement repair, detecting corruption is a valuable feature and reinforces the importance of write ordering. Checksums also make your crash tests more informative because you can distinguish between logical corruption and physical write errors.

How this fit on projects

Crash safety is required in §3.2 and described in §3.7. It is implemented in §5.10 Phase 3 and tested in §6.2. It is also central to P06-capstone-simple-database-engine.md.

Definitions & key terms

  • write ordering -> sequence of writes to ensure durability
  • superblock -> metadata block with root pointer
  • copy-on-write -> write new pages, then switch root
  • WAL -> write-ahead log

Mental model diagram

write child page -> fsync -> update parent -> fsync

How it works (step-by-step, with invariants and failure modes)

  1. Create new page for split and write it to disk.
  2. fsync new page.
  3. Update parent to point to new page.
  4. fsync parent or superblock.

Invariants: parent pointers always refer to valid pages. Failure modes: crash between writes leaves partial updates.

Minimal concrete example

Split leaf: write new leaf -> fsync -> update parent -> fsync

Common misconceptions

  • “Disk writes are atomic” -> they are not; partial writes happen.
  • “Order doesn’t matter” -> ordering is essential for consistency.
  • “OS cache is enough” -> fsync is required for durability.

Check-your-understanding questions

  1. Why must new pages be written before parent updates?
  2. What does a superblock store?
  3. How does copy-on-write provide crash safety?

Check-your-understanding answers

  1. Otherwise the parent may reference an unwritten page after a crash.
  2. Root page id, page size, version, and metadata.
  3. It writes a new consistent tree and flips the root pointer atomically.

Real-world applications

  • Filesystem metadata updates.
  • Database index maintenance.

Where you’ll apply it

References

  • Database Internals, recovery chapters
  • Filesystem design papers (e.g., WAFL)

Key insights

Crash safety is mostly about write ordering and careful metadata updates.

Summary

B-tree updates span multiple pages, so you must enforce a safe write order or use WAL/CoW to prevent corruption.

Homework/Exercises to practice the concept

  1. Implement a superblock with checksum and version.
  2. Simulate a crash during a split and verify recovery behavior.

Solutions to the homework/exercises

  1. Store two copies and choose the latest valid checksum at startup.
  2. Stop after writing child, restart, and run a tree checker.

3. Project Specification

3.1 What You Will Build

A disk-based B-tree index that stores fixed-length file-name keys and maps them to file IDs. The index supports prefix queries and range scans and persists across restarts. It uses fixed-size pages, linked leaves, and a simple buffer cache. The system provides a CLI for building the index from a directory and querying it.

3.2 Functional Requirements

  1. Build an index file from a list of file names.
  2. Support point lookups and prefix queries.
  3. Support range scans between two keys.
  4. Persist pages to disk and reopen index without rebuild.
  5. Provide stats: tree height, fanout, page count, cache hit rate.

3.3 Non-Functional Requirements

  • Performance: O(log_f n) lookups with minimal IO.
  • Reliability: crash-safe updates with write ordering.
  • Usability: deterministic CLI output.

3.4 Example Usage / Output

$ ./btree_index --build ./data
[index] pagesize=4096
[index] inserted 120000 keys
[index] height=3 fanout=96

$ ./btree_index --query "log_"
log_2024_12_31
log_2025_01_01

3.5 Data Formats / Schemas / Protocols

On-disk page layout:

[page_header][keys...][child_ptrs...][next_leaf]

3.6 Edge Cases

  • Duplicate keys overwrite existing values.
  • Prefix query with no matches returns empty output.
  • Crash during split preserves previous valid tree.

3.7 Real World Outcome

3.7.1 How to Run (Copy/Paste)

cc -O2 -Wall -Wextra -o btree_index src/main.c src/btree.c src/cache.c
./btree_index --build ./data --page-size 4096

3.7.2 Golden Path Demo (Deterministic)

$ ./btree_index --build ./data
[index] pagesize=4096
[index] inserted 10 keys
[index] height=1 fanout=96

$ ./btree_index --query "img_"
img_001
img_002

3.7.3 Failure Demo (Deterministic)

$ ./btree_index --query "missing_"
(no matches)

Exit codes:

  • 0 on normal exit
  • 2 on invalid arguments

4. Solution Architecture

4.1 High-Level Design

CLI -> B-tree -> Page Cache -> Disk File

4.2 Key Components

| Component | Responsibility | Key Decisions | |———–|—————-|—————| | Page layout | node serialization | fixed-size keys | | B-tree ops | search/insert/split | promote middle key | | Cache | page reuse | LRU eviction | | Crash safety | write ordering | child before parent |

4.4 Data Structures (No Full Code)

struct page_header {
    uint16_t is_leaf;
    uint16_t key_count;
    uint32_t next_leaf; // page id
};

4.4 Algorithm Overview

Key Algorithm: B-tree insert

  1. Descend to leaf and insert key.
  2. If overflow, split and promote key.
  3. Repeat up the tree.

Complexity Analysis:

  • Time: O(log_f n)
  • Space: O(n)

5. Implementation Guide

5.1 Development Environment Setup

cc --version

5.2 Project Structure

btree_index/
├── src/
│   ├── main.c
│   ├── btree.c
│   ├── page.c
│   └── cache.c
└── README.md

5.3 The Core Question You’re Answering

“How do you design a tree that minimizes disk IO rather than CPU time?”

5.4 Concepts You Must Understand First

  1. Node layout and fanout
  2. Page-oriented IO and caching
  3. Split logic and balancing
  4. Range scans with linked leaves
  5. Crash-safe write ordering

5.5 Questions to Guide Your Design

  1. What page size fits your disk and key sizes?
  2. How will you serialize nodes to disk?
  3. What is your split policy and how do you ensure crash safety?

5.6 Thinking Exercise

If a page holds 100 keys, how many levels do you need for 1 billion keys? What does that imply about IO?

5.7 The Interview Questions They’ll Ask

  1. Why are B-trees used for disk indexes?
  2. What is the difference between B-tree and B+ tree?
  3. How do you handle node splits safely?

5.8 Hints in Layers

Hint 1: Start with an in-memory B-tree. Hint 2: Add serialization and page IO. Hint 3: Add linked leaves and range scans.

5.9 Books That Will Help

| Topic | Book | Chapter | |——-|——|———| | B-tree design | Algorithms, Fourth Edition | Ch. 6.1 | | Storage engines | Database Internals | Ch. 2-4 |

5.10 Implementation Phases

Phase 1: Foundation (6-10 hours)

Goals: page layout and serialization Tasks: define header, encode/decode pages Checkpoint: write and read a page correctly

Phase 2: Core Functionality (8-12 hours)

Goals: insert and split Tasks: implement search/insert, split logic Checkpoint: ordered keys and stable height

Phase 3: Polish & Edge Cases (8-12 hours)

Goals: range scans and crash safety Tasks: linked leaves, write ordering, stats Checkpoint: prefix queries work and survive restart

5.11 Key Implementation Decisions

| Decision | Options | Recommendation | Rationale | |———|———|—————-|———–| | Key size | variable, fixed | fixed | simpler layout | | Cache policy | none, LRU | LRU | reduces IO | | Crash safety | write ordering, WAL | write ordering | simpler for project |


6. Testing Strategy

6.1 Test Categories

| Category | Purpose | Examples | |———|———|———-| | Unit Tests | page encode/decode | header fields | | Integration Tests | insert + lookup | random keys | | Edge Case Tests | split boundary | full node insert |

6.2 Critical Test Cases

  1. Insert sorted keys and verify tree height.
  2. Crash during split and verify old tree intact.
  3. Prefix scan returns ordered results.

6.3 Test Data

log_001
log_002
img_001

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

| Pitfall | Symptom | Solution | |——–|———|———-| | Bad split | missing keys | validate parent updates | | Broken leaf links | range scan skips | update next pointers on split | | Cache bugs | stale data | flush dirty pages correctly |

7.2 Debugging Strategies

  • Add a tree checker to validate ordering and node sizes.
  • Log page reads/writes with page ids.

7.3 Performance Traps

  • Too small pages increase height and IO.
  • No cache leads to repeated disk reads.

8. Extensions & Challenges

8.1 Beginner Extensions

  • Add exact key lookup tests.
  • Add range scan output limits.

8.2 Intermediate Extensions

  • Implement delete with merge/borrow.
  • Add a free page list.

8.3 Advanced Extensions

  • Implement WAL for page updates.
  • Add prefix compression in leaf nodes.

9. Real-World Connections

9.1 Industry Applications

  • Filesystem indexes (ext4, XFS).
  • Database primary indexes (InnoDB).
  • SQLite B-tree implementation
  • LevelDB (contrast to LSM)

9.3 Interview Relevance

  • Disk vs memory data structures
  • B-tree invariants

10. Resources

10.1 Essential Reading

  • Algorithms, Fourth Edition, Ch. 6.1
  • Database Internals, Ch. 2-4

10.2 Video Resources

  • “B-tree Internals” talks

10.3 Tools & Documentation

  • strace to observe page IO

11. Self-Assessment Checklist

11.1 Understanding

  • I can explain fanout and page size impact.
  • I can explain split logic.
  • I can explain range scans.

11.2 Implementation

  • All functional requirements are met.
  • Index persists across restart.
  • Range scans are ordered.

11.3 Growth

  • I can compare B-tree and LSM tradeoffs.

12. Submission / Completion Criteria

Minimum Viable Completion:

  • Disk-based B-tree with lookup
  • Basic insert with splits

Full Completion:

  • Range scans and prefix queries
  • Buffer cache with stats

Excellence (Going Above & Beyond):

  • Crash-safe WAL or CoW updates
  • Delete support with merge/borrow