Project 5: B-Tree File Index (Mini Filesystem Index)
Build a disk-backed B-tree that indexes file names and supports range queries.
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | Expert |
| Time Estimate | 2-3 weeks |
| Language | C |
| Prerequisites | Trees, file I/O, disk vs memory basics |
| Key Topics | B-tree nodes, page layout, range queries |
1. Learning Objectives
By completing this project, you will:
- Implement a B-tree node format sized to a disk page.
- Support insert and search with node splits.
- Persist the tree to disk and recover on restart.
- Implement prefix/range queries over sorted keys.
2. Theoretical Foundation
2.1 Core Concepts
- High fanout: B-trees reduce disk I/O by storing many keys per node.
- Page-oriented design: Nodes match disk page size to minimize reads.
- Balanced height: Guarantees logarithmic search depth.
2.2 Why This Matters
B-trees are the backbone of filesystems and database indexes. This project teaches why binary trees fail on disk.
2.3 Historical Context / Background
B-trees were invented for efficient storage on block devices and are still the standard in modern databases.
2.4 Common Misconceptions
- “Binary trees are enough”: On disk, they cause too many random reads.
- “Balancing is optional”: It is required for predictable performance.
3. Project Specification
3.1 What You Will Build
A persistent B-tree index stored in a file. It supports insert, exact lookup, and prefix range queries.
3.2 Functional Requirements
- Insert key/value pairs (file name -> metadata or id).
- Search for exact keys.
- Perform prefix or range queries.
- Persist all nodes to disk and recover on startup.
3.3 Non-Functional Requirements
- Performance: O(log n) searches and inserts.
- Reliability: Durable node writes.
- Usability: Simple CLI for insert/search/range.
3.4 Example Usage / Output
$ ./btree_index insert log_2024_01.txt 42
OK
$ ./btree_index range log_
log_2024_01.txt -> 42
3.5 Real World Outcome
You can insert thousands of file names and query by prefix after restarting:
$ ./btree_index insert log_2024_01.txt 42
OK
$ ./btree_index range log_
log_2024_01.txt -> 42
4. Solution Architecture
4.1 High-Level Design
insert/search -> load node -> modify/split -> write node -> update root
4.2 Key Components
| Component | Responsibility | Key Decisions |
|---|---|---|
| Page manager | Read/write nodes | Fixed page size |
| B-tree logic | Search/insert/split | B-tree order |
| Serializer | Node <-> bytes | Fixed layout |
| CLI | Command handling | Simple text commands |
4.3 Data Structures
typedef struct {
uint16_t num_keys;
char keys[MAX_KEYS][KEY_SIZE];
uint32_t children[MAX_KEYS + 1];
} btree_node_t;
4.4 Algorithm Overview
Key Algorithm: Insert with Split
- Descend to leaf.
- Insert key.
- If overflow, split and promote middle key.
Complexity Analysis:
- Time: O(log n)
- Space: O(1) per insert (amortized)
5. Implementation Guide
5.1 Development Environment Setup
gcc --version
5.2 Project Structure
project-root/
├── btree.c
├── pager.c
├── main.c
└── README.md
5.3 The Core Question You’re Answering
“Why do disk-based indexes use B-trees instead of binary trees or hash tables?”
5.4 Concepts You Must Understand First
Stop and research these before coding:
- B-tree order and node capacity
- Disk page sizes and alignment
- Range query traversal
5.5 Questions to Guide Your Design
Before implementing, think through these:
- What page size will you target (4KB)?
- How will you represent variable-length keys?
- How do you ensure atomic updates on crash?
5.6 Thinking Exercise
Node split
For a B-tree of order 4, insert keys in sorted order. When does the first split happen and which key is promoted?
5.7 The Interview Questions They’ll Ask
Prepare to answer these:
- “Why do B-trees have high fanout?”
- “How does a B-tree split work?”
- “Why are B-trees better for range queries than hash tables?”
5.8 Hints in Layers
Hint 1: Fixed key size Start with fixed-size keys for simpler node layout.
Hint 2: Pager abstraction Hide file I/O behind read_page/write_page functions.
Hint 3: Start in-memory Implement the B-tree in memory first, then add persistence.
5.9 Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| B-tree basics | “Database Internals” | Ch. 2-4 |
| Balanced trees | “CLRS” | Ch. 18 |
| Disk I/O | “Designing Data-Intensive Applications” | Ch. 3 |
5.10 Implementation Phases
Phase 1: Foundation (4-5 days)
Goals:
- In-memory B-tree search/insert.
Tasks:
- Implement node structure.
- Implement search and insert.
Checkpoint: Insert and lookup work in memory.
Phase 2: Core Functionality (1 week)
Goals:
- Disk-backed nodes and paging.
Tasks:
- Write nodes to fixed-size pages.
- Load nodes on demand.
Checkpoint: Data persists across restart.
Phase 3: Polish & Edge Cases (4-5 days)
Goals:
- Range queries and split correctness.
Tasks:
- Implement range scan.
- Validate split and merge behavior.
Checkpoint: Prefix queries return correct results.
5.11 Key Implementation Decisions
| Decision | Options | Recommendation | Rationale |
|---|---|---|---|
| Key format | fixed vs variable | fixed | Easier page layout |
| Durability | fsync every write vs batch | batch | Performance |
6. Testing Strategy
6.1 Test Categories
| Category | Purpose | Examples |
|---|---|---|
| Insert/search | Correctness | random key inserts |
| Split | Node overflow | ordered inserts |
| Range | Prefix queries | log_ prefix |
6.2 Critical Test Cases
- Insert enough keys to trigger multiple splits.
- Search returns correct values after restart.
- Range query returns sorted results.
6.3 Test Data
Keys: log_0001 .. log_1000
7. Common Pitfalls & Debugging
7.1 Frequent Mistakes
| Pitfall | Symptom | Solution |
|---|---|---|
| Wrong split index | Missing keys | Verify median selection |
| Misaligned pages | Corrupt reads | Use fixed-size structs |
| No root update | Tree broken | Handle root split carefully |
7.2 Debugging Strategies
- Add a tree dump command to print nodes.
- Validate ordering after each insert.
7.3 Performance Traps
Reading nodes repeatedly without caching increases I/O; add a small page cache.
8. Extensions & Challenges
8.1 Beginner Extensions
- Add delete support.
- Add statistics about tree height.
8.2 Intermediate Extensions
- Add a page cache with LRU eviction.
- Support variable-length keys.
8.3 Advanced Extensions
- Add WAL for crash recovery.
- Implement B+tree leaf chaining.
9. Real-World Connections
9.1 Industry Applications
- Filesystem indexes, database secondary indexes, directory services.
9.2 Related Open Source Projects
- SQLite: https://sqlite.org
- LMDB: https://symas.com/lmdb
9.3 Interview Relevance
- B-trees and disk-based indexes are common in database interviews.
10. Resources
10.1 Essential Reading
- Database Internals by Alex Petrov
- B-tree sections in CLRS Ch. 18
10.2 Video Resources
- B-tree visualizations (search “B-tree insert split”)
10.3 Tools & Documentation
- fsync(2) -
man 2 fsync
10.4 Related Projects in This Series
- Log-Structured Key-Value Store: compare disk structures.
11. Self-Assessment Checklist
11.1 Understanding
- I can explain why B-trees are disk-friendly.
- I can describe split and promote logic.
- I can explain range query traversal.
11.2 Implementation
- Inserts and searches are correct.
- Data persists across restart.
- Range queries return sorted results.
11.3 Growth
- I can extend to B+tree or add deletion.
- I can reason about crash recovery.
12. Submission / Completion Criteria
Minimum Viable Completion:
- Disk-backed B-tree with insert and search.
Full Completion:
- Range queries and persistence across restart.
Excellence (Going Above & Beyond):
- Add WAL and B+tree leaf chaining.
This guide was generated from SPRINT_4_5_REAL_WORLD_PROJECTS.md. For the complete learning path, see the parent directory.