Project 5: B-Tree File Index (Mini Filesystem Index)

Build a disk-backed B-tree that indexes file names and supports range queries.

Quick Reference

Attribute Value
Difficulty Expert
Time Estimate 2-3 weeks
Language C
Prerequisites Trees, file I/O, disk vs memory basics
Key Topics B-tree nodes, page layout, range queries

1. Learning Objectives

By completing this project, you will:

  1. Implement a B-tree node format sized to a disk page.
  2. Support insert and search with node splits.
  3. Persist the tree to disk and recover on restart.
  4. Implement prefix/range queries over sorted keys.

2. Theoretical Foundation

2.1 Core Concepts

  • High fanout: B-trees reduce disk I/O by storing many keys per node.
  • Page-oriented design: Nodes match disk page size to minimize reads.
  • Balanced height: Guarantees logarithmic search depth.

2.2 Why This Matters

B-trees are the backbone of filesystems and database indexes. This project teaches why binary trees fail on disk.

2.3 Historical Context / Background

B-trees were invented for efficient storage on block devices and are still the standard in modern databases.

2.4 Common Misconceptions

  • “Binary trees are enough”: On disk, they cause too many random reads.
  • “Balancing is optional”: It is required for predictable performance.

3. Project Specification

3.1 What You Will Build

A persistent B-tree index stored in a file. It supports insert, exact lookup, and prefix range queries.

3.2 Functional Requirements

  1. Insert key/value pairs (file name -> metadata or id).
  2. Search for exact keys.
  3. Perform prefix or range queries.
  4. Persist all nodes to disk and recover on startup.

3.3 Non-Functional Requirements

  • Performance: O(log n) searches and inserts.
  • Reliability: Durable node writes.
  • Usability: Simple CLI for insert/search/range.

3.4 Example Usage / Output

$ ./btree_index insert log_2024_01.txt 42
OK
$ ./btree_index range log_
log_2024_01.txt -> 42

3.5 Real World Outcome

You can insert thousands of file names and query by prefix after restarting:

$ ./btree_index insert log_2024_01.txt 42
OK
$ ./btree_index range log_
log_2024_01.txt -> 42

4. Solution Architecture

4.1 High-Level Design

insert/search -> load node -> modify/split -> write node -> update root

4.2 Key Components

Component Responsibility Key Decisions
Page manager Read/write nodes Fixed page size
B-tree logic Search/insert/split B-tree order
Serializer Node <-> bytes Fixed layout
CLI Command handling Simple text commands

4.3 Data Structures

typedef struct {
    uint16_t num_keys;
    char keys[MAX_KEYS][KEY_SIZE];
    uint32_t children[MAX_KEYS + 1];
} btree_node_t;

4.4 Algorithm Overview

Key Algorithm: Insert with Split

  1. Descend to leaf.
  2. Insert key.
  3. If overflow, split and promote middle key.

Complexity Analysis:

  • Time: O(log n)
  • Space: O(1) per insert (amortized)

5. Implementation Guide

5.1 Development Environment Setup

gcc --version

5.2 Project Structure

project-root/
├── btree.c
├── pager.c
├── main.c
└── README.md

5.3 The Core Question You’re Answering

“Why do disk-based indexes use B-trees instead of binary trees or hash tables?”

5.4 Concepts You Must Understand First

Stop and research these before coding:

  1. B-tree order and node capacity
  2. Disk page sizes and alignment
  3. Range query traversal

5.5 Questions to Guide Your Design

Before implementing, think through these:

  1. What page size will you target (4KB)?
  2. How will you represent variable-length keys?
  3. How do you ensure atomic updates on crash?

5.6 Thinking Exercise

Node split

For a B-tree of order 4, insert keys in sorted order. When does the first split happen and which key is promoted?

5.7 The Interview Questions They’ll Ask

Prepare to answer these:

  1. “Why do B-trees have high fanout?”
  2. “How does a B-tree split work?”
  3. “Why are B-trees better for range queries than hash tables?”

5.8 Hints in Layers

Hint 1: Fixed key size Start with fixed-size keys for simpler node layout.

Hint 2: Pager abstraction Hide file I/O behind read_page/write_page functions.

Hint 3: Start in-memory Implement the B-tree in memory first, then add persistence.

5.9 Books That Will Help

Topic Book Chapter
B-tree basics “Database Internals” Ch. 2-4
Balanced trees “CLRS” Ch. 18
Disk I/O “Designing Data-Intensive Applications” Ch. 3

5.10 Implementation Phases

Phase 1: Foundation (4-5 days)

Goals:

  • In-memory B-tree search/insert.

Tasks:

  1. Implement node structure.
  2. Implement search and insert.

Checkpoint: Insert and lookup work in memory.

Phase 2: Core Functionality (1 week)

Goals:

  • Disk-backed nodes and paging.

Tasks:

  1. Write nodes to fixed-size pages.
  2. Load nodes on demand.

Checkpoint: Data persists across restart.

Phase 3: Polish & Edge Cases (4-5 days)

Goals:

  • Range queries and split correctness.

Tasks:

  1. Implement range scan.
  2. Validate split and merge behavior.

Checkpoint: Prefix queries return correct results.

5.11 Key Implementation Decisions

Decision Options Recommendation Rationale
Key format fixed vs variable fixed Easier page layout
Durability fsync every write vs batch batch Performance

6. Testing Strategy

6.1 Test Categories

Category Purpose Examples
Insert/search Correctness random key inserts
Split Node overflow ordered inserts
Range Prefix queries log_ prefix

6.2 Critical Test Cases

  1. Insert enough keys to trigger multiple splits.
  2. Search returns correct values after restart.
  3. Range query returns sorted results.

6.3 Test Data

Keys: log_0001 .. log_1000

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

Pitfall Symptom Solution
Wrong split index Missing keys Verify median selection
Misaligned pages Corrupt reads Use fixed-size structs
No root update Tree broken Handle root split carefully

7.2 Debugging Strategies

  • Add a tree dump command to print nodes.
  • Validate ordering after each insert.

7.3 Performance Traps

Reading nodes repeatedly without caching increases I/O; add a small page cache.


8. Extensions & Challenges

8.1 Beginner Extensions

  • Add delete support.
  • Add statistics about tree height.

8.2 Intermediate Extensions

  • Add a page cache with LRU eviction.
  • Support variable-length keys.

8.3 Advanced Extensions

  • Add WAL for crash recovery.
  • Implement B+tree leaf chaining.

9. Real-World Connections

9.1 Industry Applications

  • Filesystem indexes, database secondary indexes, directory services.
  • SQLite: https://sqlite.org
  • LMDB: https://symas.com/lmdb

9.3 Interview Relevance

  • B-trees and disk-based indexes are common in database interviews.

10. Resources

10.1 Essential Reading

  • Database Internals by Alex Petrov
  • B-tree sections in CLRS Ch. 18

10.2 Video Resources

  • B-tree visualizations (search “B-tree insert split”)

10.3 Tools & Documentation

  • fsync(2) - man 2 fsync

11. Self-Assessment Checklist

11.1 Understanding

  • I can explain why B-trees are disk-friendly.
  • I can describe split and promote logic.
  • I can explain range query traversal.

11.2 Implementation

  • Inserts and searches are correct.
  • Data persists across restart.
  • Range queries return sorted results.

11.3 Growth

  • I can extend to B+tree or add deletion.
  • I can reason about crash recovery.

12. Submission / Completion Criteria

Minimum Viable Completion:

  • Disk-backed B-tree with insert and search.

Full Completion:

  • Range queries and persistence across restart.

Excellence (Going Above & Beyond):

  • Add WAL and B+tree leaf chaining.

This guide was generated from SPRINT_4_5_REAL_WORLD_PROJECTS.md. For the complete learning path, see the parent directory.