Project 3: Disk-Backed B+Tree Library

Build a persistent B+ tree that supports point lookups and range scans with verifiable structural invariants.

Quick Reference

Attribute Value
Difficulty Advanced
Time Estimate 1-2 weeks
Main Programming Language C (Alternatives: Rust, C++)
Alternative Programming Languages Rust, C++
Coolness Level Level 4: Hardcore Tech Flex
Business Potential Level 4: Open Core Infrastructure
Prerequisites Project 1, sorted data structures, page serialization
Key Topics Fan-out economics, split/merge, separator correctness

1. Learning Objectives

  1. Implement B+ tree search/insert/delete on page-backed nodes.
  2. Enforce and verify tree invariants after structural changes.
  3. Support deterministic range scans via leaf sibling links.
  4. Measure lookup depth and compare against linear scan baseline.

2. All Theory Needed (Per-Concept Breakdown)

B+Tree Structural Invariants

Fundamentals

A B+ tree keeps all leaves at equal depth and stores separators in internal nodes that partition child key ranges. Leaves store sorted keys and references to values or row locations. Every operation must preserve sorted order, depth balance, and separator validity.

Deep Dive into the concept

B+ trees are tuned for block devices. One node is usually one page; this allows high fan-out and low depth. Internal nodes route, leaves store data references, and linked leaves enable efficient in-order traversal. Insertions may overflow nodes and trigger splits that can propagate to ancestors. Deletions may underflow nodes and require borrow/merge with siblings. The complexity is not in the high-level idea, but in edge cases where separators and sibling links diverge.

Verifier-first development is essential. After each randomized operation batch, run checks for: (1) key order inside nodes, (2) parent separator bounds, (3) equal leaf depth, (4) leaf-link consistency, and (5) key-set equivalence to oracle map. This discipline catches subtle corruption before it becomes persistent.

Choice of split policy affects behavior. Median split is stable and balanced; right-biased split can help append-heavy workloads. For learning, median split is safest. Deletion logic can be staged: tombstone delete first, full rebalance later. Regardless, invariant checks remain mandatory.

How this fit on projects

  • Primary index project.
  • Reused by SQL engine project for indexed lookups.

Definitions & key terms

  • Separator key: Boundary value in parent node.
  • Leaf chain: Linked list across leaf pages.
  • Underflow: Node occupancy below minimum threshold.

Mental model diagram

                [40 | 80]
              /     |     \
      [10 20 30] [50 60 70] [90 95]
          |           |         |
      Leaf A <-----> Leaf B <-> Leaf C

How it works (step-by-step, with invariants and failure modes)

  1. Descend root-to-leaf using separators.
  2. Insert key in sorted order.
  3. Split if overflow; promote separator.
  4. Propagate until stable.
  5. Verify tree invariants.

Failure modes:

  • Wrong promoted separator causes invisible keys.
  • Broken leaf chain skips range rows.

Minimal concrete example

insert 58 into leaf [50,55,60,70]
overflow -> split into [50,55] + [58,60,70]
promote 58 to parent

Common misconceptions

  • “O(log n) guarantees good performance always.” -> Constants matter (fan-out/cache).
  • “Deletion is easy inverse of insert.” -> Underflow logic is trickier.

Check-your-understanding questions

  1. Why are leaf links needed if internal nodes already route keys?
  2. What invariant catches stale separators?
  3. How can split bugs survive small tests?

Check-your-understanding answers

  1. They make sequential range traversal efficient.
  2. Parent bounds vs child min/max key checks.
  3. Edge cases appear only at larger depth/occupancy boundaries.

Real-world applications

  • SQLite table/index B-trees.
  • InnoDB secondary indexes.

Where you’ll apply it

  • This project and Project 6 query planning/execution paths.

References

Key insights

Separator correctness is the heart of index correctness.

Summary

A B+ tree is simple conceptually, but production-grade correctness requires verifier-driven development.

Homework/Exercises to practice the concept

  1. Compute expected height for fan-out 128 at 50M keys.
  2. Define verifier checks and failure messages.
  3. Hand-simulate one split cascade and one merge cascade.

Solutions to the homework/exercises

  1. Height remains small (few levels) with high fan-out.
  2. Include order, depth, bounds, leaf-link, and key-count checks.
  3. Track parent updates at every step to avoid separator drift.

3. Project Specification

3.1 What You Will Build

Persistent B+ tree library with insert, get, delete, range, and verify commands.

3.2 Functional Requirements

  1. Point lookup for existing/missing keys.
  2. Ordered range query.
  3. Insert with split propagation.
  4. Delete with rebalance policy.
  5. Offline invariant verification.

3.3 Non-Functional Requirements

  • Performance: Depth grows logarithmically.
  • Reliability: No structural corruption under stress tests.
  • Usability: Verifier outputs actionable diagnostics.

3.4 Example Usage / Output

load 1M keys -> lookup key -> range scan -> verify all invariants

3.5 Data Formats / Schemas / Protocols

Node header + sorted key array + pointer/value references + sibling metadata.

3.6 Edge Cases

Duplicate inserts, root split, root shrink, leaf underflow, separator edge equality.

3.7 Real World Outcome

3.7.1 How to Run (Copy/Paste)

make bptree_demo
./bptree_demo load --keys 1000000 --seed 7
./bptree_demo verify

3.7.2 Golden Path Demo (Deterministic)

Fixed key corpus and deterministic operation sequence.

3.7.3 If CLI: exact terminal transcript

$ ./bptree_demo load --keys 1000000 --seed 7
build complete pages=8421 height=4 avg_leaf_fill=0.73
$ ./bptree_demo get 884422
FOUND depth=4 page=5812 slot=9
$ ./bptree_demo verify
OK invariants=all

4. Solution Architecture

4.1 High-Level Design

B+Tree API -> Node Manager -> Page I/O Layer -> Data File

4.2 Key Components

Component Responsibility Key Decisions
Tree API Public operations deterministic errors
Node manager split/merge logic strict invariant checks
Leaf scanner range traversal sibling-link correctness
Verifier offline validation fail-fast diagnostics

4.4 Data Structures (No Full Code)

NodeHeader { node_type, key_count, page_id, parent_id }
InternalNode { keys[], child_page_ids[] }
LeafNode { keys[], value_refs[], next_leaf, prev_leaf }

4.4 Algorithm Overview

  • Search: descend via separators.
  • Insert: leaf insert + split propagation.
  • Delete: remove + borrow/merge as needed.

Complexity:

  • Lookup: O(log n)
  • Range scan: O(log n + k)

5. Implementation Guide

5.1 Development Environment Setup

make bptree_demo

5.2 Project Structure

project/
├── src/bptree/
├── src/page/
├── tests/
└── tools/verify

5.3 The Core Question You’re Answering

How can I keep indexed lookup fast while continuously mutating on-disk structure?

5.4 Concepts You Must Understand First

  • Tree balance invariants
  • Split/merge edge handling
  • Leaf-chain semantics

5.5 Questions to Guide Your Design

  • What separator do you promote on split?
  • How do you detect and recover from sibling-link corruption?

5.6 Thinking Exercise

Trace a full insert sequence until root split and justify each parent update.

5.7 The Interview Questions They’ll Ask

  1. Why are B+ trees preferred over binary search trees on disk?
  2. What invariants must always hold?
  3. How do range scans work efficiently?
  4. How do you verify index integrity offline?

5.8 Hints in Layers

  • Build read-only traversal first.
  • Add leaf split path before internal split.
  • Run verifier after every mutation batch.

5.9 Books That Will Help

Topic Book Chapter
Balanced trees Algorithms, Fourth Edition relevant sections
Memory/cache effects Computer Systems: A Programmer’s Perspective Ch. 6

5.10 Implementation Phases

  • Phase 1: search and static tree loading
  • Phase 2: insertion + split
  • Phase 3: deletion + verifier hardening

5.11 Key Implementation Decisions

Decision Options Recommendation Rationale
Split policy median/right-biased median simpler correctness
Delete strategy immediate rebalance/lazy staged rebalance easier debugging

6. Testing Strategy

6.1 Test Categories

  • Unit: node encode/decode
  • Integration: random operations vs oracle map
  • Property: invariant preservation under fuzz sequences

6.2 Critical Test Cases

  1. Root split and height growth.
  2. Leaf merge and parent separator updates.
  3. Range scan monotonicity across leaf boundaries.

6.3 Test Data

Deterministic random sequences with fixed seed and operation script.


7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

Pitfall Symptom Solution
Stale separators Missing keys in lookups parent-child bound checks
Broken leaf links range skips/loops bidirectional link validator

7.2 Debugging Strategies

  • Dump path trace for failed lookup.
  • Visualize tree levels after each split.

7.3 Performance Traps

Overly small node capacity inflates depth and I/O.


8. Extensions & Challenges

8.1 Beginner Extensions

  • Add tree visualizer by level.

8.2 Intermediate Extensions

  • Add prefix compression for keys.

8.3 Advanced Extensions

  • Add latch coupling for concurrent reads.

9. Real-World Connections

9.1 Industry Applications

B+ trees back core indexing in major relational databases.

  • SQLite source/docs for B-tree internals.

9.3 Interview Relevance

Common interviews ask split/merge mechanics and invariants.


10. Resources

10.1 Essential Reading

  • SQLite architecture and file format docs.
  • InnoDB index structure docs.

10.2 Video Resources

  • Storage engine internals talks focused on indexing.

10.3 Tools & Documentation

  • Tree visualizer scripts, deterministic fuzz harnesses.

11. Self-Assessment Checklist

11.1 Understanding

  • I can explain split and merge invariants.
  • I can reason about fan-out and depth quantitatively.

11.2 Implementation

  • Invariant verifier passes randomized workloads.
  • Restart persistence keeps tree consistent.

11.3 Growth

  • I can compare B+ tree vs alternative index tradeoffs.

12. Submission / Completion Criteria

Minimum Viable Completion:

  • Insert/get/range + persistent node format

Full Completion:

  • Plus delete/rebalance and robust verifier

Excellence:

  • Plus performance report and concurrent-read extension