Project 16: Simple Database Library with Record Locking

Build a persistent key-value database library with record locking for concurrent access, demonstrating how databases use file locking, memory-mapped I/O, and B-tree indexing. Understand exactly how SQLite, Berkeley DB, and LMDB implement ACID properties using UNIX primitives.


Quick Reference

Attribute Value
Difficulty Level 5 - Master
Time 3-4 Weeks
Language C (alt: Rust)
Prerequisites Projects 1-3 (File I/O), Project 7-8 (Threading), Understanding of data structures
Key Topics fcntl() record locking, mmap(), B-tree indexing, write-ahead logging, crash recovery
APUE Chapters 14, 20
Coolness Level 5 - Pure Magic
Portfolio Value Open Core Infrastructure

1. Learning Objectives

By completing this project, you will:

  1. Master fcntl() record locking: Understand the difference between read and write locks, byte-range locking, and how to prevent deadlocks in concurrent database access
  2. Implement memory-mapped I/O: Use mmap() for high-performance file access, understand MAP_SHARED vs MAP_PRIVATE, and properly synchronize with msync()
  3. Build a B-tree index: Implement the foundational data structure that powers virtually all databases, understanding node splitting, merging, and disk-optimized layouts
  4. Design for crash recovery: Implement write-ahead logging (WAL) or copy-on-write semantics to ensure data integrity even during power failures
  5. Handle concurrent access: Design a locking protocol that allows multiple readers and writers without data corruption or starvation
  6. Understand ACID properties: Learn how Atomicity, Consistency, Isolation, and Durability are achieved at the system level
  7. Profile database performance: Measure and optimize read/write throughput, understand the tradeoffs between different storage strategies

2. Theoretical Foundation

2.1 Core Concepts

The Database Problem

Databases solve a fundamental challenge: how to store and retrieve data efficiently while allowing multiple processes to access it concurrently without corruption.

The Database Challenge

Multiple Processes              Single Database File
┌─────────────┐                ┌─────────────────────┐
│  Process A  │──────┐         │                     │
│  (writing)  │      │         │   ┌─────────────┐   │
└─────────────┘      │         │   │  Header     │   │
                     ├────────►│   ├─────────────┤   │
┌─────────────┐      │         │   │  Index      │   │
│  Process B  │──────┤         │   │  (B-tree)   │   │
│  (reading)  │      │         │   ├─────────────┤   │
└─────────────┘      │         │   │  Data       │   │
                     │         │   │  Pages      │   │
┌─────────────┐      │         │   └─────────────┘   │
│  Process C  │──────┘         │                     │
│  (writing)  │                └─────────────────────┘
└─────────────┘

Problem: How do we prevent:
1. Process A overwrites Process C's changes
2. Process B reads partially-written data
3. Power failure corrupts the file
4. Readers starve while writers hold locks

fcntl() Record Locking

UNIX provides advisory file locking through fcntl(). Unlike mandatory locking, advisory locking only works when all processes cooperate.

fcntl() Lock Types

Shared Lock (Read Lock)          Exclusive Lock (Write Lock)
┌────────────────────────┐       ┌────────────────────────┐
│ Multiple processes can │       │ Only ONE process can   │
│ hold read locks on the │       │ hold a write lock on   │
│ same byte range        │       │ the same byte range    │
│                        │       │                        │
│  Process A: F_RDLCK ✓  │       │  Process A: F_WRLCK ✓  │
│  Process B: F_RDLCK ✓  │       │  Process B: F_WRLCK ✗  │
│  Process C: F_RDLCK ✓  │       │  (blocked or EAGAIN)   │
│                        │       │                        │
│  Process D: F_WRLCK ✗  │       │  Process C: F_RDLCK ✗  │
│  (blocked or EAGAIN)   │       │  (blocked or EAGAIN)   │
└────────────────────────┘       └────────────────────────┘

Lock Compatibility Matrix:
┌─────────────┬─────────────┬─────────────┐
│ Requested   │ Existing    │ Existing    │
│             │ Read Lock   │ Write Lock  │
├─────────────┼─────────────┼─────────────┤
│ Read Lock   │   GRANTED   │   BLOCKED   │
├─────────────┼─────────────┼─────────────┤
│ Write Lock  │   BLOCKED   │   BLOCKED   │
└─────────────┴─────────────┴─────────────┘

Byte-range locking allows fine-grained control:

Byte-Range Locking

File:  ┌────┬────┬────┬────┬────┬────┬────┬────┐
Bytes: │ 0  │ 1  │ 2  │ 3  │ 4  │ 5  │ 6  │ 7  │
       └────┴────┴────┴────┴────┴────┴────┴────┘

Process A locks bytes 0-3 (header):
       ┌────────────────────┬────────────────────┐
       │  LOCKED BY A       │     UNLOCKED       │
       └────────────────────┴────────────────────┘

Process B can lock bytes 4-7 (data) simultaneously:
       ┌────────────────────┬────────────────────┐
       │  LOCKED BY A       │  LOCKED BY B       │
       └────────────────────┴────────────────────┘

This enables concurrent access to different regions!

Memory-Mapped Files (mmap)

Instead of read()/write(), mmap() maps the file directly into your process’s address space:

Traditional I/O vs Memory-Mapped I/O

Traditional I/O (read/write):
┌──────────────┐    ┌──────────────┐    ┌──────────────┐
│  User Buffer │◄───│  Kernel      │◄───│  Disk File   │
│  (malloc'd)  │    │  Buffer Cache│    │              │
└──────────────┘    └──────────────┘    └──────────────┘
     Copy 2              Copy 1

     Two data copies, explicit read/write calls

Memory-Mapped I/O (mmap):
┌──────────────────────────────────────────────────────┐
│                   Virtual Address Space              │
│  ┌──────────────┐                                    │
│  │ mmap'd region│────────────────────────────────┐   │
│  └──────────────┘                                │   │
└──────────────────────────────────────────────────────┘
                                                   │
         Page Table                                │
         ┌────────┐                                │
         │  PTE   │────────────────────────────────┘
         └────────┘
                   \
                    \
                     ▼
         ┌──────────────────┐
         │  Page Cache      │◄───── Disk File
         │  (Kernel manages)│
         └──────────────────┘

         Zero copies! Access file like memory.

B-Tree Structure

B-trees are the standard data structure for databases because they minimize disk I/O:

B-Tree Structure (Order 3 - max 3 children per node)

                           ┌─────────────┐
                           │     50      │ Root
                           └──────┬──────┘
                                  │
             ┌────────────────────┼────────────────────┐
             │                    │                    │
             ▼                    ▼                    ▼
       ┌──────────┐        ┌──────────┐        ┌──────────┐
       │  20, 30  │        │  60, 70  │        │  90, 100 │
       └────┬─────┘        └────┬─────┘        └────┬─────┘
            │                   │                   │
   ┌────────┼────────┐   ┌──────┼──────┐   ┌───────┼───────┐
   │        │        │   │      │      │   │       │       │
   ▼        ▼        ▼   ▼      ▼      ▼   ▼       ▼       ▼
┌─────┐ ┌─────┐ ┌─────┐ ┌────┐ ┌────┐ ┌────┐ ┌────┐ ┌────┐ ┌─────┐
│10,15│ │22,25│ │35,40│ │55  │ │65  │ │75,80│ │85  │ │95  │ │105  │
└─────┘ └─────┘ └─────┘ └────┘ └────┘ └────┘ └────┘ └────┘ └─────┘
  Leaf nodes contain actual data (or pointers to data)

Why B-Trees for Disk?
1. Wide, shallow tree = fewer disk seeks
2. Nodes sized to match disk block/page size
3. Good cache utilization
4. Efficient range queries (leaves are linked)

Write-Ahead Logging (WAL)

WAL ensures data integrity across crashes:

Write-Ahead Logging Protocol

BEFORE modification:
1. Write intended change to LOG file
2. Flush LOG to disk (fsync)
3. Only THEN modify the database file

┌────────────────────────────────────────────────────────────────┐
│                          TIME LINE                              │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ① Write log entry    ② fsync(log)    ③ Write to DB    ④ Commit│
│       │                    │               │               │   │
│       ▼                    ▼               ▼               ▼   │
│  ┌─────────┐         ┌─────────┐     ┌─────────┐     ┌───────┐ │
│  │LOG:     │         │LOG:     │     │DB:      │     │LOG:   │ │
│  │"set A=5"│  ───►   │on disk  │ ──► │A=5      │ ──► │commit │ │
│  │(memory) │         │         │     │         │     │       │ │
│  └─────────┘         └─────────┘     └─────────┘     └───────┘ │
│                                                                 │
│  CRASH HERE: Log has entry, DB unchanged → REDO on recovery    │
│  CRASH HERE: Log on disk, DB being written → REDO on recovery  │
│  CRASH HERE: Both written, no commit → UNDO on recovery        │
│  CRASH HERE: Committed → No action needed                      │
└─────────────────────────────────────────────────────────────────┘

Recovery on Startup:
1. Read LOG from beginning
2. For each uncommitted transaction: UNDO
3. For each committed but not applied: REDO
4. Truncate/checkpoint LOG

2.2 Why This Matters

Real-World Impact:

  • SQLite (used in every iPhone, Android device, browser, and billions of embedded systems) uses these exact techniques
  • LMDB (Lightning Memory-Mapped Database) uses mmap for ultra-fast reads
  • Berkeley DB pioneered these patterns and influenced decades of database design
  • PostgreSQL and MySQL use similar locking and WAL mechanisms

Industry Usage:

  • Financial systems require ACID guarantees for transaction processing
  • Every application with local storage (browsers, mobile apps) uses embedded databases
  • Blockchain systems implement similar persistence patterns
  • Configuration management tools need atomic file updates

Career Relevance:

  • Database internals knowledge commands premium salaries ($150K-$250K)
  • Essential for roles at database companies (MongoDB, Cockroach Labs, Redis Labs)
  • Required for storage engine development at cloud providers
  • Critical for debugging production database issues

2.3 Historical Context

1960s: Early File Systems

  • Files were sequential tape-based
  • No concept of concurrent access
  • Single-user systems

1970s: Indexed Sequential Access Method (ISAM)

  • IBM developed for mainframes
  • B-trees introduced for databases
  • Still single-user or batch processing

1980s: Berkeley DB Origins

  • Unix needed a simple, embeddable database
  • Keith Bostic at Berkeley created the original ndbm
  • fcntl() locking standardized in POSIX

1990s-2000s: SQLite Revolution

  • D. Richard Hipp created SQLite (2000)
  • “The most widely deployed database in the world”
  • Proved you could have ACID without a server

2011: LMDB

  • Howard Chu created for OpenLDAP
  • Pure mmap-based design
  • Influenced many modern embedded databases

Today:

  • RocksDB (Facebook), LevelDB (Google), BadgerDB (Dgraph)
  • All build on these fundamental UNIX primitives
  • Cloud databases still use the same locking and logging techniques

2.4 Common Misconceptions

Misconception 1: “File locking prevents all race conditions”

  • Reality: fcntl() locking is advisory. Processes that don’t check locks can still corrupt data. Your application must consistently acquire locks before all file access.

Misconception 2: “mmap is always faster”

  • Reality: mmap has overhead for small random reads. For sequential access or large files, traditional I/O can be faster. mmap shines for random access patterns typical in databases.

Misconception 3: “fsync guarantees data is on physical disk”

  • Reality: Some disks lie about fsync (write caches). Use O_DIRECT or O_DSYNC for true durability, or accept that hardware failures can lose recent data.

Misconception 4: “B-trees are complicated”

  • Reality: The basic algorithm is straightforward. What’s hard is handling concurrent modifications, crash recovery, and optimizing for your specific workload.

Misconception 5: “Embedded databases don’t need ACID”

  • Reality: Even a phone’s contacts database needs atomicity. Imagine corruption if a call comes in while saving a contact. ACID is essential everywhere.

3. Project Specification

3.1 What You Will Build

A key-value database library named mydb that:

  1. Provides a simple API: db_open(), db_get(), db_put(), db_delete(), db_close()
  2. Stores data persistently in a single file
  3. Uses B-tree indexing for efficient key lookup
  4. Supports concurrent access from multiple processes with record locking
  5. Survives process crashes without data corruption
  6. Can be used as both a library and CLI tool

3.2 Functional Requirements

  1. Database Operations
    • Create new databases
    • Open existing databases (with read-only or read-write mode)
    • Store key-value pairs (variable length keys and values)
    • Retrieve values by key
    • Delete key-value pairs
    • Iterate over all keys (prefix scan)
    • Get database statistics
  2. Concurrency
    • Multiple processes can read simultaneously
    • Writes are serialized with record locking
    • No data corruption under concurrent access
    • Deadlock detection or avoidance
  3. Persistence
    • Data survives process restart
    • Crash recovery using WAL or COW
    • Atomic updates (all-or-nothing)
  4. Performance
    • O(log n) key lookup via B-tree
    • Memory-mapped I/O for efficiency
    • Configurable page size

3.3 Non-Functional Requirements

  1. Reliability
    • Zero data loss on clean shutdown
    • Recovery from process crashes
    • Detection of file corruption
  2. Performance
    • 50,000+ reads/second for in-cache data
    • 10,000+ writes/second with synchronous commits
    • Sub-millisecond latency for cached reads
  3. Portability
    • Works on Linux, macOS, and FreeBSD
    • Uses only POSIX-standard APIs
    • No external dependencies (except standard C library)
  4. Resource Limits
    • Maximum database size: Limited by filesystem (2^63 bytes on 64-bit)
    • Maximum key size: 255 bytes
    • Maximum value size: 10MB
    • Efficient memory usage (mmap handles caching)

3.4 Example Usage / Output

# 1. Create and populate database
$ ./mydb create mydata.db
Database created: mydata.db

$ ./mydb put mydata.db user:1 '{"name": "Alice", "email": "alice@example.com"}'
Inserted: user:1

$ ./mydb put mydata.db user:2 '{"name": "Bob", "email": "bob@example.com"}'
Inserted: user:2

$ ./mydb put mydata.db config:timeout '30'
Inserted: config:timeout

# 2. Query data
$ ./mydb get mydata.db user:1
{"name": "Alice", "email": "alice@example.com"}

$ ./mydb get mydata.db user:999
Key not found: user:999

# 3. List all keys with prefix
$ ./mydb scan mydata.db "user:"
user:1
user:2

# 4. Concurrent access test
$ ./mydb_stress_test mydata.db --writers 4 --readers 8 --ops 10000
Running stress test...
Writers: 4, Readers: 8
Total operations: 10000

Results:
  Writes: 4000 (100% success)
  Reads: 6000 (100% success)
  Lock contentions: 127
  Deadlocks: 0
  Throughput: 25,432 ops/sec

# 5. Show database stats
$ ./mydb stats mydata.db
Database: mydata.db
File size: 4.2 MB
Records: 10,000
B-tree depth: 3
Fill factor: 67%
Free space: 1.4 MB

# 6. Delete a key
$ ./mydb delete mydata.db user:1
Deleted: user:1

$ ./mydb get mydata.db user:1
Key not found: user:1

# 7. Use as library
$ cat myapp.c
#include "mydb.h"

int main() {
    DB *db = db_open("mydata.db", DB_CREATE | DB_RDWR);
    if (!db) {
        perror("db_open");
        return 1;
    }

    // Put a value
    if (db_put(db, "key1", 4, "value1", 6) != 0) {
        fprintf(stderr, "db_put failed\n");
    }

    // Get a value
    char *value;
    size_t len;
    if (db_get(db, "key1", 4, &value, &len) == 0) {
        printf("Got: %.*s\n", (int)len, value);
        free(value);
    }

    // Iterate with prefix
    db_cursor_t *cursor = db_cursor_open(db, "user:", 5);
    while (db_cursor_next(cursor, &key, &key_len, &value, &value_len) == 0) {
        printf("%.*s = %.*s\n", (int)key_len, key, (int)value_len, value);
    }
    db_cursor_close(cursor);

    db_close(db);
    return 0;
}

$ gcc -o myapp myapp.c -lmydb
$ ./myapp
Got: value1

3.5 Real World Outcome

When this project is complete, you will have:

  1. A working database library that can be linked into any C program
  2. A CLI tool for database operations and testing
  3. Stress test utilities proving concurrent correctness
  4. Documentation explaining the file format and recovery procedures

You can verify success by:

  • Running multiple writers simultaneously without corruption
  • Killing processes mid-write and recovering without data loss
  • Comparing performance to SQLite for simple key-value workloads
  • Using strace to verify proper locking and fsync behavior

4. Solution Architecture

4.1 High-Level Design

                    Database Library Architecture

┌─────────────────────────────────────────────────────────────────────┐
│                         APPLICATION LAYER                            │
│  ┌────────────────────────────────────────────────────────────────┐ │
│  │  db_open()  db_put()  db_get()  db_delete()  db_close()        │ │
│  │  db_cursor_open()  db_cursor_next()  db_cursor_close()         │ │
│  └────────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────┘
                                   │
                                   ▼
┌─────────────────────────────────────────────────────────────────────┐
│                         B-TREE LAYER                                 │
│  ┌─────────────────────────────────────────────────────────────┐    │
│  │  btree_insert()  btree_search()  btree_delete()             │    │
│  │  btree_split_node()  btree_merge_nodes()                    │    │
│  └─────────────────────────────────────────────────────────────┘    │
└─────────────────────────────────────────────────────────────────────┘
                                   │
                                   ▼
┌─────────────────────────────────────────────────────────────────────┐
│                         PAGE CACHE LAYER                             │
│  ┌─────────────────────────────────────────────────────────────┐    │
│  │  page_get()  page_dirty()  page_flush()  page_alloc()       │    │
│  │  Uses mmap() for file access                                 │    │
│  └─────────────────────────────────────────────────────────────┘    │
└─────────────────────────────────────────────────────────────────────┘
                                   │
                                   ▼
┌─────────────────────────────────────────────────────────────────────┐
│                         LOCKING LAYER                                │
│  ┌─────────────────────────────────────────────────────────────┐    │
│  │  lock_page_read()  lock_page_write()  unlock_page()         │    │
│  │  Uses fcntl() for byte-range locking                        │    │
│  └─────────────────────────────────────────────────────────────┘    │
└─────────────────────────────────────────────────────────────────────┘
                                   │
                                   ▼
┌─────────────────────────────────────────────────────────────────────┐
│                         WAL LAYER                                    │
│  ┌─────────────────────────────────────────────────────────────┐    │
│  │  wal_begin()  wal_write()  wal_commit()  wal_rollback()     │    │
│  │  wal_recover()  wal_checkpoint()                            │    │
│  └─────────────────────────────────────────────────────────────┘    │
└─────────────────────────────────────────────────────────────────────┘
                                   │
                                   ▼
┌─────────────────────────────────────────────────────────────────────┐
│                         FILE LAYER                                   │
│  ┌─────────────────────────────────────────────────────────────┐    │
│  │  Database File (.db)          WAL File (.db-wal)            │    │
│  │  ┌─────────────────────┐      ┌─────────────────────┐       │    │
│  │  │ Header (4KB)        │      │ Log records         │       │    │
│  │  │ B-tree pages        │      │ Checksums           │       │    │
│  │  │ Data pages          │      │ Transaction markers │       │    │
│  │  │ Free list           │      │                     │       │    │
│  │  └─────────────────────┘      └─────────────────────┘       │    │
│  └─────────────────────────────────────────────────────────────┘    │
└─────────────────────────────────────────────────────────────────────┘

4.2 Key Components

1. API Layer (mydb.h, mydb.c)

  • Public interface for applications
  • Handles parameter validation
  • Manages database handle lifecycle

2. B-Tree Layer (btree.c, btree.h)

  • In-memory and on-disk B-tree operations
  • Node splitting and merging
  • Cursor-based iteration

3. Page Cache Layer (page.c, page.h)

  • Memory-mapped file access
  • Dirty page tracking
  • Buffer management

4. Locking Layer (lock.c, lock.h)

  • fcntl() based record locking
  • Header lock for metadata
  • Page-level locks for B-tree nodes

5. WAL Layer (wal.c, wal.h)

  • Write-ahead logging for durability
  • Transaction management
  • Crash recovery

4.3 Data Structures

/* Database file header - first 4KB of file */
typedef struct {
    char magic[4];            /* "MYDB" */
    uint32_t version;         /* Format version (1) */
    uint32_t page_size;       /* Page size in bytes (4096) */
    uint64_t root_page;       /* Offset of B-tree root page */
    uint64_t freelist_head;   /* Offset of first free page */
    uint64_t record_count;    /* Total number of records */
    uint64_t file_size;       /* Current file size */
    uint32_t btree_order;     /* B-tree order (fanout) */
    uint32_t flags;           /* Database flags */
    uint64_t wal_offset;      /* Current WAL position */
    uint8_t reserved[4028];   /* Pad to 4KB */
} db_header_t;

/* B-tree node - one page (4KB) */
typedef struct {
    uint8_t node_type;        /* BTREE_LEAF or BTREE_INTERNAL */
    uint16_t key_count;       /* Number of keys in node */
    uint64_t next_leaf;       /* Next leaf (for leaf nodes) */
    uint64_t prev_leaf;       /* Previous leaf (for leaf nodes) */
    /*
     * For internal nodes:
     *   keys[0..n-1] and children[0..n]
     * For leaf nodes:
     *   keys[0..n-1] and values[0..n-1]
     *
     * Keys grow from start, data from end
     */
    uint8_t data[4069];       /* Key/value/child storage */
} btree_node_t;

/* Key-value entry in leaf node */
typedef struct {
    uint16_t key_offset;      /* Offset into data[] for key */
    uint16_t key_len;         /* Key length */
    uint16_t value_offset;    /* Offset into data[] for value */
    uint32_t value_len;       /* Value length */
} kv_entry_t;

/* Database handle */
typedef struct {
    int fd;                   /* File descriptor */
    int wal_fd;               /* WAL file descriptor */
    void *mmap_addr;          /* mmap base address */
    size_t mmap_size;         /* Current mmap size */
    db_header_t *header;      /* Pointer to header in mmap */
    int flags;                /* Open flags */
    struct flock header_lock; /* Current header lock */
    int in_transaction;       /* Transaction active flag */
} DB;

/* Cursor for iteration */
typedef struct {
    DB *db;                   /* Database handle */
    uint64_t current_page;    /* Current leaf page */
    int current_idx;          /* Current entry index */
    char *prefix;             /* Key prefix filter */
    size_t prefix_len;        /* Prefix length */
} db_cursor_t;

/* WAL log record */
typedef struct {
    uint64_t lsn;             /* Log sequence number */
    uint32_t type;            /* WAL_INSERT, WAL_DELETE, etc. */
    uint32_t key_len;
    uint32_t value_len;
    uint32_t checksum;        /* CRC32 of record */
    /* Followed by: key data, value data */
} wal_record_t;

4.4 Algorithm Overview

B-Tree Search:

function btree_search(root, key):
    node = read_page(root)
    while node is INTERNAL:
        i = binary_search(node.keys, key)
        node = read_page(node.children[i])

    i = binary_search(node.keys, key)
    if node.keys[i] == key:
        return node.values[i]
    return NOT_FOUND

B-Tree Insert:

function btree_insert(root, key, value):
    # Find leaf and path
    path = []
    node = root
    while node is INTERNAL:
        path.push(node)
        i = find_child_index(node, key)
        node = node.children[i]

    # Insert into leaf
    if leaf_has_space(node):
        insert_into_leaf(node, key, value)
        return

    # Split leaf and propagate up
    (left, right, separator) = split_leaf(node, key, value)

    while path not empty:
        parent = path.pop()
        if parent_has_space(parent):
            insert_separator(parent, separator, right)
            return
        (left, right, separator) = split_internal(parent, separator, right)

    # Create new root
    new_root = create_internal_node()
    new_root.keys[0] = separator
    new_root.children[0] = left
    new_root.children[1] = right

Locking Protocol:

For READ operation:
    1. Acquire shared lock on header
    2. Read root page offset
    3. Traverse tree, acquiring shared lock on each page
    4. Release locks in reverse order

For WRITE operation:
    1. Acquire exclusive lock on header
    2. Begin WAL transaction
    3. Traverse tree with exclusive locks
    4. Modify pages, write to WAL
    5. Commit WAL (fsync)
    6. Apply changes to database
    7. Release locks

5. Implementation Guide

5.1 Development Environment Setup

# Required packages (Ubuntu/Debian)
sudo apt-get install build-essential

# macOS
xcode-select --install

# Create project structure
mkdir -p mydb/src mydb/tests mydb/include
cd mydb

# Create initial files
touch include/mydb.h
touch src/mydb.c src/btree.c src/page.c src/lock.c src/wal.c
touch src/cli.c
touch tests/test_basic.c tests/test_concurrent.c tests/stress_test.c
touch Makefile

# Initial Makefile
cat > Makefile << 'EOF'
CC = gcc
CFLAGS = -Wall -Wextra -O2 -g -Iinclude
LDFLAGS = -lpthread

SRC = src/mydb.c src/btree.c src/page.c src/lock.c src/wal.c
OBJ = $(SRC:.c=.o)

all: libmydb.a mydb stress_test

libmydb.a: $(OBJ)
	ar rcs $@ $^

mydb: src/cli.c libmydb.a
	$(CC) $(CFLAGS) -o $@ $< -L. -lmydb

stress_test: tests/stress_test.c libmydb.a
	$(CC) $(CFLAGS) -o $@ $< -L. -lmydb $(LDFLAGS)

test: tests/test_basic.c tests/test_concurrent.c libmydb.a
	$(CC) $(CFLAGS) -o test_basic tests/test_basic.c -L. -lmydb
	$(CC) $(CFLAGS) -o test_concurrent tests/test_concurrent.c -L. -lmydb $(LDFLAGS)
	./test_basic
	./test_concurrent

clean:
	rm -f $(OBJ) libmydb.a mydb stress_test test_basic test_concurrent *.db *.db-wal

.PHONY: all clean test
EOF

5.2 Project Structure

mydb/
├── include/
│   └── mydb.h              # Public API header
├── src/
│   ├── mydb.c              # Main API implementation
│   ├── btree.c             # B-tree operations
│   ├── btree_internal.h    # B-tree internal definitions
│   ├── page.c              # Page/mmap management
│   ├── page_internal.h     # Page internal definitions
│   ├── lock.c              # fcntl() locking
│   ├── lock_internal.h     # Lock internal definitions
│   ├── wal.c               # Write-ahead logging
│   ├── wal_internal.h      # WAL internal definitions
│   └── cli.c               # Command-line interface
├── tests/
│   ├── test_basic.c        # Unit tests
│   ├── test_concurrent.c   # Concurrency tests
│   └── stress_test.c       # Load testing
├── Makefile
└── README.md

5.3 The Core Question You’re Answering

“How do databases ensure data integrity when multiple processes access the same file concurrently?”

This question is fundamental to understanding database internals. The answer involves:

  1. Advisory file locking - fcntl() to coordinate access between processes
  2. Write-ahead logging - Record changes before applying them
  3. Careful ordering - fsync at correct points to survive crashes
  4. B-tree structure - Efficient indexing with predictable I/O patterns

5.4 Concepts You Must Understand First

Stop and research these before coding:

  1. fcntl() Record Locking
    • F_SETLK, F_SETLKW, F_GETLK operations
    • F_RDLCK (shared), F_WRLCK (exclusive), F_UNLCK
    • Byte-range locking with l_start, l_len
    • Book Reference: “APUE” Ch. 14.3
  2. Memory-Mapped Files
    • mmap(), munmap(), msync() system calls
    • MAP_SHARED vs MAP_PRIVATE semantics
    • When mmap() is faster than read/write
    • SIGBUS handling for truncated files
    • Book Reference: “APUE” Ch. 14.8
  3. B-tree Data Structures
    • Node structure: keys, children/values, metadata
    • Search algorithm (binary search within nodes)
    • Insert with node splitting
    • Delete with node merging or redistribution
    • Book Reference: “Algorithms” by Sedgewick, Ch. 6
  4. Write-Ahead Logging (WAL)
    • Log record format
    • Commit protocol (log fsync before data write)
    • Recovery: redo/undo
    • Checkpointing
    • Book Reference: “Transaction Processing” or database internals docs

5.5 Questions to Guide Your Design

Before implementing, think through these:

  1. File Layout
    • How to structure the database file (header, pages, free list)?
    • What page size to use (4KB matches most filesystems)?
    • How to handle file growth?
  2. Locking Granularity
    • Lock whole file (simple but poor concurrency)?
    • Lock pages (better concurrency, more complex)?
    • Lock individual records (best concurrency, most complex)?
    • What about lock ordering to prevent deadlocks?
  3. Crash Recovery
    • What if process crashes mid-write?
    • How to detect incomplete operations?
    • How to atomically update multiple pages?
    • Redo, undo, or both?
  4. Memory Management
    • How large should the mmap region be?
    • When to extend or remap?
    • How to handle files larger than address space (32-bit)?

5.6 Thinking Exercise

Design the File Layout

Consider this database file structure:

Offset 0                                           File End
┌─────────────┬────────────────┬─────────────────────────────┐
│   Header    │   B-tree Index │        Data Pages           │
│  (4 KB)     │   (variable)   │        (variable)           │
└─────────────┴────────────────┴─────────────────────────────┘

Header (4KB):
┌─────────────────────────────────────────────────────────────┐
│ Magic: "MYDB" (4 bytes)                                     │
│ Version: 1 (4 bytes)                                        │
│ Page size: 4096 (4 bytes)                                   │
│ Root page: offset of B-tree root (8 bytes)                  │
│ Free list head: offset of first free page (8 bytes)         │
│ Record count: total number of records (8 bytes)             │
│ ... padding to 4KB ...                                      │
└─────────────────────────────────────────────────────────────┘

B-tree Node (4KB page):
┌─────────────────────────────────────────────────────────────┐
│ Node type: LEAF or INTERNAL (1 byte)                        │
│ Key count: number of keys (2 bytes)                         │
│ Next leaf: offset (for leaf nodes) (8 bytes)                │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Key 0 (offset, length) │ Child/Value pointer 0          │ │
│ │ Key 1 (offset, length) │ Child/Value pointer 1          │ │
│ │ ...                    │ ...                            │ │
│ └─────────────────────────────────────────────────────────┘ │
│ [Key data stored at end of page, growing backward]          │
└─────────────────────────────────────────────────────────────┘

Locking strategy:
- Header lock: protects metadata updates
- Page locks: each B-tree page can be locked independently
- Record locks: for individual key-value pairs

Questions to answer:

  1. Why put the header at offset 0?
  2. Why store keys at the end of the page growing backward?
  3. What’s the advantage of linked leaf nodes?
  4. How do you implement the free list?

5.7 Hints in Layers

Hint 1: Simple Record Locking

Start with basic fcntl() locking to protect entire operations:

#include <fcntl.h>
#include <unistd.h>

int lock_region(int fd, int type, off_t start, off_t len) {
    struct flock fl;
    fl.l_type = type;        /* F_RDLCK, F_WRLCK, F_UNLCK */
    fl.l_whence = SEEK_SET;
    fl.l_start = start;      /* Start of lock region */
    fl.l_len = len;          /* Length (0 = to EOF) */
    fl.l_pid = 0;            /* Set by F_GETLK */

    /* F_SETLKW = blocking wait for lock */
    /* F_SETLK = non-blocking (returns EAGAIN if can't lock) */
    if (fcntl(fd, F_SETLKW, &fl) == -1) {
        if (errno == EDEADLK) {
            /* Deadlock detected! */
            return -1;
        }
        perror("fcntl");
        return -1;
    }
    return 0;
}

int unlock_region(int fd, off_t start, off_t len) {
    struct flock fl;
    fl.l_type = F_UNLCK;
    fl.l_whence = SEEK_SET;
    fl.l_start = start;
    fl.l_len = len;
    return fcntl(fd, F_SETLK, &fl);
}

/* Example usage: lock header for exclusive access */
#define HEADER_OFFSET 0
#define HEADER_SIZE 4096

int lock_header_write(DB *db) {
    return lock_region(db->fd, F_WRLCK, HEADER_OFFSET, HEADER_SIZE);
}

int lock_header_read(DB *db) {
    return lock_region(db->fd, F_RDLCK, HEADER_OFFSET, HEADER_SIZE);
}

Hint 2: Memory-Mapped Access

Use mmap() for efficient file access:

#include <sys/mman.h>

void *map_file(int fd, size_t size) {
    /* Extend file if necessary */
    if (ftruncate(fd, size) == -1) {
        perror("ftruncate");
        return MAP_FAILED;
    }

    void *addr = mmap(NULL, size,
                      PROT_READ | PROT_WRITE,
                      MAP_SHARED, fd, 0);
    if (addr == MAP_FAILED) {
        perror("mmap");
        return MAP_FAILED;
    }

    return addr;
}

/* Access header directly through mmap */
db_header_t *get_header(DB *db) {
    return (db_header_t *)db->mmap_addr;
}

/* Get a B-tree node by page offset */
btree_node_t *get_page(DB *db, uint64_t offset) {
    return (btree_node_t *)(db->mmap_addr + offset);
}

/* Flush changes to disk */
int sync_page(DB *db, uint64_t offset) {
    return msync(db->mmap_addr + offset, db->header->page_size, MS_SYNC);
}

/* Sync entire database */
int sync_all(DB *db) {
    return msync(db->mmap_addr, db->mmap_size, MS_SYNC);
}

Hint 3: Simple B-tree Insert

Start with a simplified insert algorithm:

/*
 * Simplified B-tree insert (no overflow handling yet)
 *
 * 1. Search for leaf where key belongs
 * 2. If leaf has room: insert key-value, done
 * 3. If leaf full: split leaf
 *    - Create new leaf
 *    - Move half the keys to new leaf
 *    - Insert separator key into parent
 *    - If parent full: recursively split
 */

int btree_insert(DB *db, const char *key, size_t key_len,
                 const char *value, size_t value_len) {

    /* Start at root */
    uint64_t page_offset = db->header->root_page;
    btree_node_t *node = get_page(db, page_offset);

    /* Stack to track path from root to leaf */
    uint64_t path[32];
    int path_len = 0;

    /* Search for leaf */
    while (node->node_type == BTREE_INTERNAL) {
        path[path_len++] = page_offset;

        /* Binary search for child */
        int i = find_child_index(node, key, key_len);
        page_offset = get_child_offset(node, i);
        node = get_page(db, page_offset);
    }

    /* At leaf node */
    if (leaf_has_space(node, key_len, value_len)) {
        /* Simple case: just insert */
        insert_into_leaf(node, key, key_len, value, value_len);
        sync_page(db, page_offset);
        return 0;
    }

    /* Need to split - more complex */
    return split_and_insert(db, path, path_len, page_offset,
                           key, key_len, value, value_len);
}

Hint 4: Crash Recovery with WAL

Implement a simple WAL for durability:

/* WAL Record Types */
#define WAL_BEGIN    1
#define WAL_INSERT   2
#define WAL_DELETE   3
#define WAL_COMMIT   4
#define WAL_ROLLBACK 5

typedef struct {
    uint32_t magic;          /* 0xWAL1 */
    uint64_t lsn;            /* Log Sequence Number */
    uint32_t type;           /* Record type */
    uint32_t key_len;
    uint32_t value_len;
    uint32_t checksum;       /* CRC32 of entire record */
    /* Followed by: key, value */
} wal_record_t;

int wal_write_insert(DB *db, const char *key, size_t key_len,
                     const char *value, size_t value_len) {
    size_t record_size = sizeof(wal_record_t) + key_len + value_len;
    wal_record_t *record = malloc(record_size);

    record->magic = 0x57414C31;  /* "WAL1" */
    record->lsn = db->next_lsn++;
    record->type = WAL_INSERT;
    record->key_len = key_len;
    record->value_len = value_len;

    /* Copy key and value after header */
    memcpy((char *)record + sizeof(wal_record_t), key, key_len);
    memcpy((char *)record + sizeof(wal_record_t) + key_len, value, value_len);

    /* Calculate checksum */
    record->checksum = crc32(record, record_size);

    /* Write to WAL file */
    write(db->wal_fd, record, record_size);
    free(record);

    return 0;
}

int wal_commit(DB *db) {
    wal_record_t commit;
    commit.magic = 0x57414C31;
    commit.lsn = db->next_lsn++;
    commit.type = WAL_COMMIT;
    commit.key_len = 0;
    commit.value_len = 0;
    commit.checksum = crc32(&commit, sizeof(commit));

    write(db->wal_fd, &commit, sizeof(commit));

    /* CRITICAL: fsync before proceeding */
    fsync(db->wal_fd);

    return 0;
}

/*
 * Recovery on startup:
 * 1. Read LOG from beginning
 * 2. Find last committed transaction
 * 3. Replay all committed operations
 * 4. Discard uncommitted operations
 * 5. Truncate/checkpoint LOG
 */
int wal_recover(DB *db) {
    /* Seek to beginning of WAL */
    lseek(db->wal_fd, 0, SEEK_SET);

    /* Track transactions and their states */
    /* ... implementation ... */

    return 0;
}

5.8 The Interview Questions They’ll Ask

  1. “How does fcntl() locking differ from flock()?”
    • fcntl() supports byte-range locking; flock() locks entire files
    • fcntl() locks are per-process; flock() locks are per-open-file-description
    • fcntl() locks are released on any close(); flock() only when all FDs closed
    • fcntl() is POSIX; flock() is BSD-origin, behavior varies
  2. “What is a write-ahead log and why is it used?”
    • WAL records changes before they’re applied to the main database
    • Enables atomicity: if crash during write, WAL can redo or undo
    • Enables durability: committed changes are in WAL even if not yet in DB
    • Allows lazy writes to main DB (better performance)
  3. “How do databases handle the ‘phantom read’ problem?”
    • Phantom reads occur when a query returns different rows due to concurrent inserts
    • Solution: Predicate locks (lock the condition, not just rows)
    • Or: Serializable isolation level (often via MVCC)
    • Or: Gap locks (lock ranges between existing keys)
  4. “What’s the difference between mmap() and read/write for databases?”
    • mmap() eliminates double copying (kernel buffer -> user buffer)
    • mmap() leverages kernel page cache automatically
    • read/write gives more control over I/O timing
    • mmap() can cause unpredictable page faults (latency spikes)
    • mmap() has address space limits on 32-bit systems
  5. “How would you implement range queries efficiently?”
    • B-tree leaf nodes are linked for sequential access
    • Find start of range with normal B-tree search
    • Follow next-leaf pointers until end of range
    • O(log n) to find start, O(k) for k results
  6. “What are the ACID properties and how do you implement each?”
    • Atomicity: WAL with undo capability, or shadow paging
    • Consistency: Application-level constraints, checked on commit
    • Isolation: Locking or MVCC
    • Durability: WAL committed to disk before success returned
  7. “How do you prevent deadlocks in a database?”
    • Lock ordering (always acquire locks in same order)
    • Timeout detection (abort if lock not acquired in time)
    • Wait-for graph analysis (detect cycles)
    • Wound-wait or wait-die protocols

5.9 Books That Will Help

Topic Book Chapter/Section
Record locking “APUE” by Stevens Ch. 14.3
mmap() “APUE” by Stevens Ch. 14.8
Database internals Stevens’ APUE Ch. 20
B-trees “Algorithms” by Sedgewick Ch. 6
B-trees “Introduction to Algorithms” (CLRS) Ch. 18
SQLite internals SQLite documentation Architecture document
Database design “Database Internals” by Petrov Ch. 1-5
LMDB design “LMDB Technical Information” All
Transaction processing “Transaction Processing” by Gray Ch. 7-9

5.10 Implementation Phases

Phase 1: Basic File Operations (Week 1)

  • Implement db_open() and db_close()
  • Create database file with header
  • mmap() the file
  • Implement basic locking (header lock only)
  • Test: Create database, reopen, verify header

Phase 2: Simple Key-Value Store (Week 1-2)

  • Implement single-page leaf node
  • Store all KV pairs in one page (no B-tree yet)
  • Implement db_put() and db_get() for small datasets
  • Add db_delete()
  • Test: Store and retrieve 10-20 key-value pairs

Phase 3: B-Tree Index (Week 2-3)

  • Implement B-tree node structure
  • Implement search (traversal from root)
  • Implement insert with node splitting
  • Implement delete with node merging
  • Test: Store 10,000+ records, verify lookup performance

Phase 4: Concurrency (Week 3)

  • Implement page-level locking
  • Add deadlock avoidance (lock ordering)
  • Multi-process concurrent reads
  • Serialized writes with proper locking
  • Test: Multiple processes accessing same database

Phase 5: Durability (Week 3-4)

  • Implement WAL file format
  • Implement wal_write(), wal_commit()
  • Implement recovery on startup
  • Test: Kill process mid-write, verify recovery

Phase 6: CLI and Polish (Week 4)

  • Implement command-line interface
  • Add statistics command
  • Write stress test utility
  • Documentation and cleanup

5.11 Key Implementation Decisions

Decision 1: Page Size

  • Options: 512B, 1KB, 4KB, 8KB, 16KB
  • Recommendation: 4KB (matches most filesystem blocks)
  • Trade-off: Larger = fewer tree levels, but more wasted space for sparse data

Decision 2: B-Tree Order

  • Higher order = shallower tree = fewer disk seeks
  • But: More complex splitting/merging
  • Recommendation: Order that fills ~70% of a page

Decision 3: WAL vs Copy-on-Write

  • WAL: Standard, well-understood, good write performance
  • COW: Never overwrites data, simpler recovery, slower writes
  • Recommendation: Start with WAL (SQLite model)

Decision 4: Lock Granularity

  • Coarse (whole file): Simple, poor concurrency
  • Page-level: Good balance
  • Record-level: Best concurrency, complex implementation
  • Recommendation: Page-level locking

Decision 5: Variable vs Fixed-Length Values

  • Variable: More flexible, complex page layout
  • Fixed: Simple, wastes space for small values
  • Recommendation: Variable-length with overflow pages for large values

6. Testing Strategy

6.1 Unit Tests

Test individual components in isolation:

/* test_basic.c */

void test_header_format(void) {
    DB *db = db_open("test.db", DB_CREATE | DB_RDWR);
    assert(db != NULL);

    /* Verify header */
    assert(memcmp(db->header->magic, "MYDB", 4) == 0);
    assert(db->header->version == 1);
    assert(db->header->page_size == 4096);

    db_close(db);
    unlink("test.db");
    printf("test_header_format: PASSED\n");
}

void test_put_get(void) {
    DB *db = db_open("test.db", DB_CREATE | DB_RDWR);

    /* Insert */
    assert(db_put(db, "key1", 4, "value1", 6) == 0);

    /* Retrieve */
    char *value;
    size_t len;
    assert(db_get(db, "key1", 4, &value, &len) == 0);
    assert(len == 6);
    assert(memcmp(value, "value1", 6) == 0);
    free(value);

    db_close(db);
    unlink("test.db");
    printf("test_put_get: PASSED\n");
}

void test_overwrite(void) {
    DB *db = db_open("test.db", DB_CREATE | DB_RDWR);

    db_put(db, "key", 3, "old", 3);
    db_put(db, "key", 3, "new", 3);

    char *value;
    size_t len;
    db_get(db, "key", 3, &value, &len);
    assert(memcmp(value, "new", 3) == 0);
    free(value);

    db_close(db);
    unlink("test.db");
    printf("test_overwrite: PASSED\n");
}

void test_delete(void) {
    DB *db = db_open("test.db", DB_CREATE | DB_RDWR);

    db_put(db, "key", 3, "value", 5);
    assert(db_delete(db, "key", 3) == 0);

    char *value;
    size_t len;
    assert(db_get(db, "key", 3, &value, &len) == DB_NOT_FOUND);

    db_close(db);
    unlink("test.db");
    printf("test_delete: PASSED\n");
}

void test_many_records(void) {
    DB *db = db_open("test.db", DB_CREATE | DB_RDWR);

    /* Insert 10,000 records */
    char key[32], value[64];
    for (int i = 0; i < 10000; i++) {
        snprintf(key, sizeof(key), "key%06d", i);
        snprintf(value, sizeof(value), "value%06d", i);
        assert(db_put(db, key, strlen(key), value, strlen(value)) == 0);
    }

    /* Verify all records */
    for (int i = 0; i < 10000; i++) {
        snprintf(key, sizeof(key), "key%06d", i);
        snprintf(value, sizeof(value), "value%06d", i);

        char *got_value;
        size_t got_len;
        assert(db_get(db, key, strlen(key), &got_value, &got_len) == 0);
        assert(got_len == strlen(value));
        assert(memcmp(got_value, value, got_len) == 0);
        free(got_value);
    }

    db_close(db);
    unlink("test.db");
    printf("test_many_records: PASSED\n");
}

6.2 Integration Tests

Test complete workflows:

/* test_concurrent.c */

void test_multiprocess_read(void) {
    /* Create and populate database */
    DB *db = db_open("test.db", DB_CREATE | DB_RDWR);
    for (int i = 0; i < 1000; i++) {
        char key[32], value[64];
        snprintf(key, sizeof(key), "key%d", i);
        snprintf(value, sizeof(value), "value%d", i);
        db_put(db, key, strlen(key), value, strlen(value));
    }
    db_close(db);

    /* Fork multiple reader processes */
    for (int p = 0; p < 4; p++) {
        pid_t pid = fork();
        if (pid == 0) {
            /* Child: read random keys */
            DB *child_db = db_open("test.db", DB_RDONLY);
            for (int i = 0; i < 1000; i++) {
                int key_num = rand() % 1000;
                char key[32];
                snprintf(key, sizeof(key), "key%d", key_num);

                char *value;
                size_t len;
                assert(db_get(child_db, key, strlen(key), &value, &len) == 0);
                free(value);
            }
            db_close(child_db);
            _exit(0);
        }
    }

    /* Wait for all children */
    for (int p = 0; p < 4; p++) {
        int status;
        wait(&status);
        assert(WIFEXITED(status) && WEXITSTATUS(status) == 0);
    }

    unlink("test.db");
    printf("test_multiprocess_read: PASSED\n");
}

void test_concurrent_writes(void) {
    /* Start with empty database */
    DB *db = db_open("test.db", DB_CREATE | DB_RDWR);
    db_close(db);

    /* Fork multiple writer processes */
    for (int p = 0; p < 4; p++) {
        pid_t pid = fork();
        if (pid == 0) {
            DB *child_db = db_open("test.db", DB_RDWR);
            for (int i = 0; i < 100; i++) {
                char key[32], value[64];
                snprintf(key, sizeof(key), "proc%d-key%d", p, i);
                snprintf(value, sizeof(value), "proc%d-value%d", p, i);
                db_put(child_db, key, strlen(key), value, strlen(value));
            }
            db_close(child_db);
            _exit(0);
        }
    }

    /* Wait for all children */
    for (int p = 0; p < 4; p++) {
        int status;
        wait(&status);
        assert(WIFEXITED(status) && WEXITSTATUS(status) == 0);
    }

    /* Verify all records present */
    db = db_open("test.db", DB_RDONLY);
    for (int p = 0; p < 4; p++) {
        for (int i = 0; i < 100; i++) {
            char key[32];
            snprintf(key, sizeof(key), "proc%d-key%d", p, i);

            char *value;
            size_t len;
            assert(db_get(db, key, strlen(key), &value, &len) == 0);
            free(value);
        }
    }
    db_close(db);

    unlink("test.db");
    printf("test_concurrent_writes: PASSED\n");
}

6.3 Edge Cases to Test

  1. Empty database - Can you read from empty database without crash?
  2. Single record - Does B-tree work with just one key?
  3. Maximum key size - What happens at 255-byte keys?
  4. Large values - Can you store 10MB values?
  5. Duplicate keys - How do you handle inserting existing key?
  6. Missing keys - Proper error for non-existent key?
  7. Disk full - Graceful failure when disk space exhausted?
  8. Read-only filesystem - Handle gracefully?
  9. Process crash mid-transaction - Does recovery work?
  10. Power failure simulation - Kill with SIGKILL during write

6.4 Verification Commands

# Trace system calls to verify locking
strace -e fcntl,flock,open,close,mmap,msync,fsync ./mydb put test.db key value

# Verify no memory leaks
valgrind --leak-check=full ./mydb put test.db key value

# Check file descriptor usage
lsof -p $(pgrep mydb)

# Monitor lock contention (Linux)
cat /proc/locks | grep $(pgrep mydb)

# Verify database file format
hexdump -C test.db | head -20

# Test crash recovery
./mydb put test.db key1 value1 &
PID=$!
sleep 0.001
kill -9 $PID
./mydb get test.db key1  # Should work or gracefully fail

7. Common Pitfalls & Debugging

Problem 1: “Deadlock between processes”

  • Why: Circular lock dependency (Process A holds page 1, wants page 2; Process B holds page 2, wants page 1)
  • Fix: Always acquire locks in consistent order (e.g., by page offset ascending)
  • Test: Use fcntl(fd, F_SETLK, ...) instead of F_SETLKW with timeout to detect
/* Detect deadlock potential with non-blocking lock */
int try_lock(int fd, off_t offset, off_t len) {
    struct flock fl = {.l_type = F_WRLCK, .l_whence = SEEK_SET,
                       .l_start = offset, .l_len = len};
    if (fcntl(fd, F_SETLK, &fl) == -1) {
        if (errno == EAGAIN || errno == EACCES) {
            return LOCK_BUSY;  /* Someone else holds it */
        }
        return LOCK_ERROR;
    }
    return LOCK_OK;
}

Problem 2: “Data corruption after crash”

  • Why: Wrote data pages but not WAL commit, or no fsync() between operations
  • Fix: WAL commit must fsync() before returning; only then apply to database
  • Test: Kill process with SIGKILL during write, verify recovery on restart

Problem 3: “mmap changes not persisted”

  • Why: Forgot to call msync() before munmap() or close()
  • Fix: Call msync(addr, size, MS_SYNC) before any operation that expects durability
  • Test: Write data, kill with SIGKILL, check if data present
/* Always sync before critical operations */
int commit_changes(DB *db) {
    /* Sync modified pages first */
    if (msync(db->mmap_addr, db->mmap_size, MS_SYNC) == -1) {
        perror("msync");
        return -1;
    }

    /* Then sync WAL */
    if (fsync(db->wal_fd) == -1) {
        perror("fsync wal");
        return -1;
    }

    return 0;
}

Problem 4: “Performance degrades over time”

  • Why: B-tree fragmentation, deleted space not reclaimed, WAL growing forever
  • Fix: Implement compaction (vacuum), WAL checkpointing, page rebalancing
  • Test: Insert 1M records, delete 900K, measure file size and lookup time

Problem 5: “SIGBUS when accessing mmap region”

  • Why: File was truncated after mmap, or accessing beyond current file size
  • Fix: Handle SIGBUS gracefully, use msync to ensure file size matches mmap expectations
static void sigbus_handler(int sig) {
    (void)sig;
    /* File was truncated while we had it mapped */
    fprintf(stderr, "SIGBUS: Database file corrupted or truncated\n");
    _exit(1);
}

void setup_sigbus_handler(void) {
    struct sigaction sa;
    sa.sa_handler = sigbus_handler;
    sigemptyset(&sa.sa_mask);
    sa.sa_flags = 0;
    sigaction(SIGBUS, &sa, NULL);
}

Problem 6: “Locks not released after process crash”

  • Why: fcntl() locks are automatically released on process termination
  • Actually: This should NOT happen. Locks are released when file closed or process exits.
  • Real issue: If you see this, you might have lock inheritance across fork() (child gets copy of parent’s locks but they’re independent)

8. Extensions & Challenges

8.1 Easy Extensions

  1. Key prefix scans - Return all keys starting with a given prefix
  2. Database statistics - Key count, page count, tree depth, fill factor
  3. Read-only mode - Open database without write locks
  4. Backup utility - Copy database while it’s in use (using locks)

8.2 Advanced Challenges

  1. Multi-Version Concurrency Control (MVCC)
    • Readers never block writers
    • Maintain multiple versions of values
    • Garbage collection of old versions
  2. Secondary indexes
    • Index values as well as keys
    • Support queries like “find all users with age > 30”
  3. Transactions
    • Begin/commit/rollback across multiple operations
    • Savepoints for partial rollback
  4. Replication
    • Stream WAL to a replica
    • Support failover to replica
  5. Compression
    • Compress pages before writing
    • Decompress on read
    • Trade CPU for I/O

8.3 Research Topics

  1. LMDB’s copy-on-write design - How does it achieve ACID without WAL?
  2. SQLite’s b-tree variants - Interior pages vs leaf pages
  3. RocksDB’s LSM trees - Why some databases use log-structured merge trees instead of B-trees
  4. PostgreSQL’s MVCC - How do real databases implement snapshot isolation?
  5. io_uring - How could you use io_uring for async database I/O?

9. Real-World Connections

9.1 Production Systems Using This

  1. SQLite - Uses exactly these techniques (B-tree, WAL, page locking)
  2. LMDB - Pure mmap-based, copy-on-write, no WAL needed
  3. Berkeley DB - The original embeddable database, influenced all others
  4. BoltDB (Go) - LMDB-inspired, used by etcd
  5. BadgerDB - LSM-tree variant, used by Dgraph
  6. PostgreSQL - Uses similar B-tree and WAL mechanisms (but server-based)

9.2 How the Pros Do It

SQLite’s approach:

  • B-tree pages are 4KB (configurable)
  • Two-phase commit with journal or WAL
  • Uses fcntl() locking by default
  • Database is a single file (plus optional WAL file)

LMDB’s approach:

  • Full mmap of entire database
  • Copy-on-write: never overwrite existing data
  • Lock-free reads via MVCC
  • Append-only with periodic compaction

PostgreSQL’s approach:

  • Shared buffer pool (not mmap)
  • Sophisticated locking (MVCC + locks)
  • Background writers and checkpoints
  • Separate files for tables/indexes

9.3 Reading the Source

  • SQLite source: https://www.sqlite.org/download.html (exceptionally well-commented)
  • LMDB source: https://git.openldap.org/openldap/openldap (single file, ~11K lines)
  • BoltDB source: https://github.com/boltdb/bolt (clean Go implementation)
  • LevelDB source: https://github.com/google/leveldb (LSM-tree reference)

10. Resources

10.1 Man Pages

Essential reading:

  • man 2 fcntl - File locking
  • man 2 mmap - Memory mapping
  • man 2 msync - Synchronize mapped memory
  • man 2 fsync - Synchronize file
  • man 2 open - Open flags (O_SYNC, O_DSYNC)
  • man 2 ftruncate - Resize file

10.2 Online Resources

10.3 Book Chapters

Topic Book Chapter
File locking “APUE” by Stevens Ch. 14.3
Memory mapping “APUE” by Stevens Ch. 14.8
Database library “APUE” by Stevens Ch. 20
B-trees “Algorithms” by Sedgewick Ch. 6
B-trees detailed “CLRS” Ch. 18
Database design “Database Internals” by Petrov Part I

11. Self-Assessment Checklist

Before considering this project complete, verify:

  • I can explain how fcntl() byte-range locking works
  • I understand the difference between advisory and mandatory locking
  • I can describe when mmap() is better than read/write and vice versa
  • I understand why msync() is necessary and when to use it
  • I can explain B-tree node splitting algorithm
  • I understand write-ahead logging and crash recovery
  • My database survives process crashes without corruption
  • Multiple readers can access simultaneously without blocking
  • Writers are properly serialized
  • I can answer all the interview questions confidently
  • My code compiles without warnings (gcc -Wall -Wextra -Werror)
  • valgrind reports zero memory leaks
  • strace shows proper use of fcntl() and fsync()

12. Submission / Completion Criteria

This project is complete when:

  1. Core Functionality:
    • db_open(), db_close() work correctly
    • db_put(), db_get(), db_delete() work for single and many records
    • B-tree indexing provides O(log n) lookups
    • Database file persists across restarts
  2. Concurrency:
    • Multiple processes can read simultaneously
    • Writes are properly locked
    • No data corruption under concurrent access
    • Stress test with 4+ processes passes
  3. Durability:
    • WAL or COW implemented
    • Kill -9 during write doesn’t corrupt database
    • Recovery on restart works
  4. Quality:
    • All unit tests pass
    • All integration tests pass
    • Zero valgrind errors
    • strace shows expected locking behavior
  5. Documentation:
    • README explains usage
    • File format documented
    • Recovery procedure documented

Connection to Other Projects

This project builds on:

  • Project 1-3 (File I/O): Low-level file operations
  • Project 7-8 (Threading): Concurrency concepts (applicable to multi-threaded variant)
  • Project 11 (epoll): I/O efficiency concepts

This project connects to the final shell (Project 17):

  • Your shell could use this database to store command history
  • Environment variables could be persisted
  • Shell configuration could be database-backed