Project 3: Disk-Backed B+Tree Library

Build a persistent B+ tree that supports point lookups and range scans with verifiable structural invariants.

Quick Reference

Attribute	Value
Difficulty	Advanced
Time Estimate	1-2 weeks
Main Programming Language	C (Alternatives: Rust, C++)
Alternative Programming Languages	Rust, C++
Coolness Level	Level 4: Hardcore Tech Flex
Business Potential	Level 4: Open Core Infrastructure
Prerequisites	Project 1, sorted data structures, page serialization
Key Topics	Fan-out economics, split/merge, separator correctness

1. Learning Objectives

Implement B+ tree search/insert/delete on page-backed nodes.
Enforce and verify tree invariants after structural changes.
Support deterministic range scans via leaf sibling links.
Measure lookup depth and compare against linear scan baseline.

2. All Theory Needed (Per-Concept Breakdown)

B+Tree Structural Invariants

Fundamentals

A B+ tree keeps all leaves at equal depth and stores separators in internal nodes that partition child key ranges. Leaves store sorted keys and references to values or row locations. Every operation must preserve sorted order, depth balance, and separator validity.

Deep Dive into the concept

B+ trees are tuned for block devices. One node is usually one page; this allows high fan-out and low depth. Internal nodes route, leaves store data references, and linked leaves enable efficient in-order traversal. Insertions may overflow nodes and trigger splits that can propagate to ancestors. Deletions may underflow nodes and require borrow/merge with siblings. The complexity is not in the high-level idea, but in edge cases where separators and sibling links diverge.

Verifier-first development is essential. After each randomized operation batch, run checks for: (1) key order inside nodes, (2) parent separator bounds, (3) equal leaf depth, (4) leaf-link consistency, and (5) key-set equivalence to oracle map. This discipline catches subtle corruption before it becomes persistent.

Choice of split policy affects behavior. Median split is stable and balanced; right-biased split can help append-heavy workloads. For learning, median split is safest. Deletion logic can be staged: tombstone delete first, full rebalance later. Regardless, invariant checks remain mandatory.

How this fit on projects

Primary index project.
Reused by SQL engine project for indexed lookups.

Definitions & key terms

Separator key: Boundary value in parent node.
Leaf chain: Linked list across leaf pages.
Underflow: Node occupancy below minimum threshold.

Mental model diagram

                [40 | 80]
              /     |     \
      [10 20 30] [50 60 70] [90 95]
          |           |         |
      Leaf A <-----> Leaf B <-> Leaf C

How it works (step-by-step, with invariants and failure modes)

Descend root-to-leaf using separators.
Insert key in sorted order.
Split if overflow; promote separator.
Propagate until stable.
Verify tree invariants.

Failure modes:

Wrong promoted separator causes invisible keys.
Broken leaf chain skips range rows.

Minimal concrete example

insert 58 into leaf [50,55,60,70]
overflow -> split into [50,55] + [58,60,70]
promote 58 to parent

Common misconceptions

“O(log n) guarantees good performance always.” -> Constants matter (fan-out/cache).
“Deletion is easy inverse of insert.” -> Underflow logic is trickier.

Check-your-understanding questions

Why are leaf links needed if internal nodes already route keys?
What invariant catches stale separators?
How can split bugs survive small tests?

Check-your-understanding answers

They make sequential range traversal efficient.
Parent bounds vs child min/max key checks.
Edge cases appear only at larger depth/occupancy boundaries.

Real-world applications

SQLite table/index B-trees.
InnoDB secondary indexes.

Where you’ll apply it

This project and Project 6 query planning/execution paths.

References

Key insights

Separator correctness is the heart of index correctness.

Summary

A B+ tree is simple conceptually, but production-grade correctness requires verifier-driven development.

Homework/Exercises to practice the concept

Compute expected height for fan-out 128 at 50M keys.
Define verifier checks and failure messages.
Hand-simulate one split cascade and one merge cascade.

Solutions to the homework/exercises

Height remains small (few levels) with high fan-out.
Include order, depth, bounds, leaf-link, and key-count checks.
Track parent updates at every step to avoid separator drift.

3. Project Specification

3.1 What You Will Build

Persistent B+ tree library with insert, get, delete, range, and verify commands.

3.2 Functional Requirements

Point lookup for existing/missing keys.
Ordered range query.
Insert with split propagation.
Delete with rebalance policy.
Offline invariant verification.

3.3 Non-Functional Requirements

Performance: Depth grows logarithmically.
Reliability: No structural corruption under stress tests.
Usability: Verifier outputs actionable diagnostics.

3.4 Example Usage / Output

load 1M keys -> lookup key -> range scan -> verify all invariants

3.5 Data Formats / Schemas / Protocols

Node header + sorted key array + pointer/value references + sibling metadata.

3.6 Edge Cases

Duplicate inserts, root split, root shrink, leaf underflow, separator edge equality.

3.7 Real World Outcome

3.7.1 How to Run (Copy/Paste)

make bptree_demo
./bptree_demo load --keys 1000000 --seed 7
./bptree_demo verify

3.7.2 Golden Path Demo (Deterministic)

Fixed key corpus and deterministic operation sequence.

3.7.3 If CLI: exact terminal transcript

$ ./bptree_demo load --keys 1000000 --seed 7
build complete pages=8421 height=4 avg_leaf_fill=0.73
$ ./bptree_demo get 884422
FOUND depth=4 page=5812 slot=9
$ ./bptree_demo verify
OK invariants=all

4. Solution Architecture

4.1 High-Level Design

B+Tree API -> Node Manager -> Page I/O Layer -> Data File

4.2 Key Components

Component	Responsibility	Key Decisions
Tree API	Public operations	deterministic errors
Node manager	split/merge logic	strict invariant checks
Leaf scanner	range traversal	sibling-link correctness
Verifier	offline validation	fail-fast diagnostics

4.4 Data Structures (No Full Code)

NodeHeader { node_type, key_count, page_id, parent_id }
InternalNode { keys[], child_page_ids[] }
LeafNode { keys[], value_refs[], next_leaf, prev_leaf }

4.4 Algorithm Overview

Search: descend via separators.
Insert: leaf insert + split propagation.
Delete: remove + borrow/merge as needed.

Complexity:

Lookup: O(log n)
Range scan: O(log n + k)

5. Implementation Guide

5.1 Development Environment Setup

make bptree_demo

5.2 Project Structure

project/
├── src/bptree/
├── src/page/
├── tests/
└── tools/verify

5.3 The Core Question You’re Answering

How can I keep indexed lookup fast while continuously mutating on-disk structure?

5.4 Concepts You Must Understand First

Tree balance invariants
Split/merge edge handling
Leaf-chain semantics

5.5 Questions to Guide Your Design

What separator do you promote on split?
How do you detect and recover from sibling-link corruption?

5.6 Thinking Exercise

Trace a full insert sequence until root split and justify each parent update.

5.7 The Interview Questions They’ll Ask

Why are B+ trees preferred over binary search trees on disk?
What invariants must always hold?
How do range scans work efficiently?
How do you verify index integrity offline?

5.8 Hints in Layers

Build read-only traversal first.
Add leaf split path before internal split.
Run verifier after every mutation batch.

5.9 Books That Will Help

Topic	Book	Chapter
Balanced trees	Algorithms, Fourth Edition	relevant sections
Memory/cache effects	Computer Systems: A Programmer’s Perspective	Ch. 6

5.10 Implementation Phases

Phase 1: search and static tree loading
Phase 2: insertion + split
Phase 3: deletion + verifier hardening

5.11 Key Implementation Decisions

Decision	Options	Recommendation	Rationale
Split policy	median/right-biased	median	simpler correctness
Delete strategy	immediate rebalance/lazy	staged rebalance	easier debugging

6. Testing Strategy

6.1 Test Categories

Unit: node encode/decode
Integration: random operations vs oracle map
Property: invariant preservation under fuzz sequences

6.2 Critical Test Cases

Root split and height growth.
Leaf merge and parent separator updates.
Range scan monotonicity across leaf boundaries.

6.3 Test Data

Deterministic random sequences with fixed seed and operation script.

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

Pitfall	Symptom	Solution
Stale separators	Missing keys in lookups	parent-child bound checks
Broken leaf links	range skips/loops	bidirectional link validator

7.2 Debugging Strategies

Dump path trace for failed lookup.
Visualize tree levels after each split.

7.3 Performance Traps

Overly small node capacity inflates depth and I/O.

8. Extensions & Challenges

8.1 Beginner Extensions

Add tree visualizer by level.

8.2 Intermediate Extensions

Add prefix compression for keys.

8.3 Advanced Extensions

Add latch coupling for concurrent reads.

9. Real-World Connections

9.1 Industry Applications

B+ trees back core indexing in major relational databases.

SQLite source/docs for B-tree internals.

9.3 Interview Relevance

Common interviews ask split/merge mechanics and invariants.

10. Resources

10.1 Essential Reading

SQLite architecture and file format docs.
InnoDB index structure docs.

10.2 Video Resources

Storage engine internals talks focused on indexing.

10.3 Tools & Documentation

Tree visualizer scripts, deterministic fuzz harnesses.

Previous: Project 1
Next: Project 4

11. Self-Assessment Checklist

11.1 Understanding

I can explain split and merge invariants.
I can reason about fan-out and depth quantitatively.

11.2 Implementation

Invariant verifier passes randomized workloads.
Restart persistence keeps tree consistent.

11.3 Growth

I can compare B+ tree vs alternative index tradeoffs.

12. Submission / Completion Criteria

Minimum Viable Completion:

Insert/get/range + persistent node format

Full Completion:

Plus delete/rebalance and robust verifier

Excellence:

Plus performance report and concurrent-read extension