Project 16: Simple Database Library with Record Locking
Build a persistent key-value database library with record locking for concurrent access, demonstrating how databases use file locking, memory-mapped I/O, and B-tree indexing. Understand exactly how SQLite, Berkeley DB, and LMDB implement ACID properties using UNIX primitives.
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | Level 5 - Master |
| Time | 3-4 Weeks |
| Language | C (alt: Rust) |
| Prerequisites | Projects 1-3 (File I/O), Project 7-8 (Threading), Understanding of data structures |
| Key Topics | fcntl() record locking, mmap(), B-tree indexing, write-ahead logging, crash recovery |
| APUE Chapters | 14, 20 |
| Coolness | Level 5 - Pure Magic |
| Portfolio Value | Open Core Infrastructure |
1. Learning Objectives
By completing this project, you will:
- Master fcntl() record locking: Understand the difference between read and write locks, byte-range locking, and how to prevent deadlocks in concurrent database access
- Implement memory-mapped I/O: Use mmap() for high-performance file access, understand MAP_SHARED vs MAP_PRIVATE, and properly synchronize with msync()
- Build a B-tree index: Implement the foundational data structure that powers virtually all databases, understanding node splitting, merging, and disk-optimized layouts
- Design for crash recovery: Implement write-ahead logging (WAL) or copy-on-write semantics to ensure data integrity even during power failures
- Handle concurrent access: Design a locking protocol that allows multiple readers and writers without data corruption or starvation
- Understand ACID properties: Learn how Atomicity, Consistency, Isolation, and Durability are achieved at the system level
- Profile database performance: Measure and optimize read/write throughput, understand the tradeoffs between different storage strategies
2. Theoretical Foundation
2.1 Core Concepts
The Database Problem
Databases solve a fundamental challenge: how to store and retrieve data efficiently while allowing multiple processes to access it concurrently without corruption.
The Database Challenge
Multiple Processes Single Database File
┌─────────────┐ ┌─────────────────────┐
│ Process A │──────┐ │ │
│ (writing) │ │ │ ┌─────────────┐ │
└─────────────┘ │ │ │ Header │ │
├────────►│ ├─────────────┤ │
┌─────────────┐ │ │ │ Index │ │
│ Process B │──────┤ │ │ (B-tree) │ │
│ (reading) │ │ │ ├─────────────┤ │
└─────────────┘ │ │ │ Data │ │
│ │ │ Pages │ │
┌─────────────┐ │ │ └─────────────┘ │
│ Process C │──────┘ │ │
│ (writing) │ └─────────────────────┘
└─────────────┘
Problem: How do we prevent:
1. Process A overwrites Process C's changes
2. Process B reads partially-written data
3. Power failure corrupts the file
4. Readers starve while writers hold locks
fcntl() Record Locking
UNIX provides advisory file locking through fcntl(). Unlike mandatory locking, advisory locking only works when all processes cooperate.
fcntl() Lock Types
Shared Lock (Read Lock) Exclusive Lock (Write Lock)
┌────────────────────────┐ ┌────────────────────────┐
│ Multiple processes can │ │ Only ONE process can │
│ hold read locks on the │ │ hold a write lock on │
│ same byte range │ │ the same byte range │
│ │ │ │
│ Process A: F_RDLCK ✓ │ │ Process A: F_WRLCK ✓ │
│ Process B: F_RDLCK ✓ │ │ Process B: F_WRLCK ✗ │
│ Process C: F_RDLCK ✓ │ │ (blocked or EAGAIN) │
│ │ │ │
│ Process D: F_WRLCK ✗ │ │ Process C: F_RDLCK ✗ │
│ (blocked or EAGAIN) │ │ (blocked or EAGAIN) │
└────────────────────────┘ └────────────────────────┘
Lock Compatibility Matrix:
┌─────────────┬─────────────┬─────────────┐
│ Requested │ Existing │ Existing │
│ │ Read Lock │ Write Lock │
├─────────────┼─────────────┼─────────────┤
│ Read Lock │ GRANTED │ BLOCKED │
├─────────────┼─────────────┼─────────────┤
│ Write Lock │ BLOCKED │ BLOCKED │
└─────────────┴─────────────┴─────────────┘
Byte-range locking allows fine-grained control:
Byte-Range Locking
File: ┌────┬────┬────┬────┬────┬────┬────┬────┐
Bytes: │ 0 │ 1 │ 2 │ 3 │ 4 │ 5 │ 6 │ 7 │
└────┴────┴────┴────┴────┴────┴────┴────┘
Process A locks bytes 0-3 (header):
┌────────────────────┬────────────────────┐
│ LOCKED BY A │ UNLOCKED │
└────────────────────┴────────────────────┘
Process B can lock bytes 4-7 (data) simultaneously:
┌────────────────────┬────────────────────┐
│ LOCKED BY A │ LOCKED BY B │
└────────────────────┴────────────────────┘
This enables concurrent access to different regions!
Memory-Mapped Files (mmap)
Instead of read()/write(), mmap() maps the file directly into your process’s address space:
Traditional I/O vs Memory-Mapped I/O
Traditional I/O (read/write):
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ User Buffer │◄───│ Kernel │◄───│ Disk File │
│ (malloc'd) │ │ Buffer Cache│ │ │
└──────────────┘ └──────────────┘ └──────────────┘
Copy 2 Copy 1
Two data copies, explicit read/write calls
Memory-Mapped I/O (mmap):
┌──────────────────────────────────────────────────────┐
│ Virtual Address Space │
│ ┌──────────────┐ │
│ │ mmap'd region│────────────────────────────────┐ │
│ └──────────────┘ │ │
└──────────────────────────────────────────────────────┘
│
Page Table │
┌────────┐ │
│ PTE │────────────────────────────────┘
└────────┘
\
\
▼
┌──────────────────┐
│ Page Cache │◄───── Disk File
│ (Kernel manages)│
└──────────────────┘
Zero copies! Access file like memory.
B-Tree Structure
B-trees are the standard data structure for databases because they minimize disk I/O:
B-Tree Structure (Order 3 - max 3 children per node)
┌─────────────┐
│ 50 │ Root
└──────┬──────┘
│
┌────────────────────┼────────────────────┐
│ │ │
▼ ▼ ▼
┌──────────┐ ┌──────────┐ ┌──────────┐
│ 20, 30 │ │ 60, 70 │ │ 90, 100 │
└────┬─────┘ └────┬─────┘ └────┬─────┘
│ │ │
┌────────┼────────┐ ┌──────┼──────┐ ┌───────┼───────┐
│ │ │ │ │ │ │ │ │
▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼
┌─────┐ ┌─────┐ ┌─────┐ ┌────┐ ┌────┐ ┌────┐ ┌────┐ ┌────┐ ┌─────┐
│10,15│ │22,25│ │35,40│ │55 │ │65 │ │75,80│ │85 │ │95 │ │105 │
└─────┘ └─────┘ └─────┘ └────┘ └────┘ └────┘ └────┘ └────┘ └─────┘
Leaf nodes contain actual data (or pointers to data)
Why B-Trees for Disk?
1. Wide, shallow tree = fewer disk seeks
2. Nodes sized to match disk block/page size
3. Good cache utilization
4. Efficient range queries (leaves are linked)
Write-Ahead Logging (WAL)
WAL ensures data integrity across crashes:
Write-Ahead Logging Protocol
BEFORE modification:
1. Write intended change to LOG file
2. Flush LOG to disk (fsync)
3. Only THEN modify the database file
┌────────────────────────────────────────────────────────────────┐
│ TIME LINE │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ① Write log entry ② fsync(log) ③ Write to DB ④ Commit│
│ │ │ │ │ │
│ ▼ ▼ ▼ ▼ │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌───────┐ │
│ │LOG: │ │LOG: │ │DB: │ │LOG: │ │
│ │"set A=5"│ ───► │on disk │ ──► │A=5 │ ──► │commit │ │
│ │(memory) │ │ │ │ │ │ │ │
│ └─────────┘ └─────────┘ └─────────┘ └───────┘ │
│ │
│ CRASH HERE: Log has entry, DB unchanged → REDO on recovery │
│ CRASH HERE: Log on disk, DB being written → REDO on recovery │
│ CRASH HERE: Both written, no commit → UNDO on recovery │
│ CRASH HERE: Committed → No action needed │
└─────────────────────────────────────────────────────────────────┘
Recovery on Startup:
1. Read LOG from beginning
2. For each uncommitted transaction: UNDO
3. For each committed but not applied: REDO
4. Truncate/checkpoint LOG
2.2 Why This Matters
Real-World Impact:
- SQLite (used in every iPhone, Android device, browser, and billions of embedded systems) uses these exact techniques
- LMDB (Lightning Memory-Mapped Database) uses mmap for ultra-fast reads
- Berkeley DB pioneered these patterns and influenced decades of database design
- PostgreSQL and MySQL use similar locking and WAL mechanisms
Industry Usage:
- Financial systems require ACID guarantees for transaction processing
- Every application with local storage (browsers, mobile apps) uses embedded databases
- Blockchain systems implement similar persistence patterns
- Configuration management tools need atomic file updates
Career Relevance:
- Database internals knowledge commands premium salaries ($150K-$250K)
- Essential for roles at database companies (MongoDB, Cockroach Labs, Redis Labs)
- Required for storage engine development at cloud providers
- Critical for debugging production database issues
2.3 Historical Context
1960s: Early File Systems
- Files were sequential tape-based
- No concept of concurrent access
- Single-user systems
1970s: Indexed Sequential Access Method (ISAM)
- IBM developed for mainframes
- B-trees introduced for databases
- Still single-user or batch processing
1980s: Berkeley DB Origins
- Unix needed a simple, embeddable database
- Keith Bostic at Berkeley created the original ndbm
- fcntl() locking standardized in POSIX
1990s-2000s: SQLite Revolution
- D. Richard Hipp created SQLite (2000)
- “The most widely deployed database in the world”
- Proved you could have ACID without a server
2011: LMDB
- Howard Chu created for OpenLDAP
- Pure mmap-based design
- Influenced many modern embedded databases
Today:
- RocksDB (Facebook), LevelDB (Google), BadgerDB (Dgraph)
- All build on these fundamental UNIX primitives
- Cloud databases still use the same locking and logging techniques
2.4 Common Misconceptions
Misconception 1: “File locking prevents all race conditions”
- Reality: fcntl() locking is advisory. Processes that don’t check locks can still corrupt data. Your application must consistently acquire locks before all file access.
Misconception 2: “mmap is always faster”
- Reality: mmap has overhead for small random reads. For sequential access or large files, traditional I/O can be faster. mmap shines for random access patterns typical in databases.
Misconception 3: “fsync guarantees data is on physical disk”
- Reality: Some disks lie about fsync (write caches). Use
O_DIRECTorO_DSYNCfor true durability, or accept that hardware failures can lose recent data.
Misconception 4: “B-trees are complicated”
- Reality: The basic algorithm is straightforward. What’s hard is handling concurrent modifications, crash recovery, and optimizing for your specific workload.
Misconception 5: “Embedded databases don’t need ACID”
- Reality: Even a phone’s contacts database needs atomicity. Imagine corruption if a call comes in while saving a contact. ACID is essential everywhere.
3. Project Specification
3.1 What You Will Build
A key-value database library named mydb that:
- Provides a simple API:
db_open(),db_get(),db_put(),db_delete(),db_close() - Stores data persistently in a single file
- Uses B-tree indexing for efficient key lookup
- Supports concurrent access from multiple processes with record locking
- Survives process crashes without data corruption
- Can be used as both a library and CLI tool
3.2 Functional Requirements
- Database Operations
- Create new databases
- Open existing databases (with read-only or read-write mode)
- Store key-value pairs (variable length keys and values)
- Retrieve values by key
- Delete key-value pairs
- Iterate over all keys (prefix scan)
- Get database statistics
- Concurrency
- Multiple processes can read simultaneously
- Writes are serialized with record locking
- No data corruption under concurrent access
- Deadlock detection or avoidance
- Persistence
- Data survives process restart
- Crash recovery using WAL or COW
- Atomic updates (all-or-nothing)
- Performance
- O(log n) key lookup via B-tree
- Memory-mapped I/O for efficiency
- Configurable page size
3.3 Non-Functional Requirements
- Reliability
- Zero data loss on clean shutdown
- Recovery from process crashes
- Detection of file corruption
- Performance
- 50,000+ reads/second for in-cache data
- 10,000+ writes/second with synchronous commits
- Sub-millisecond latency for cached reads
- Portability
- Works on Linux, macOS, and FreeBSD
- Uses only POSIX-standard APIs
- No external dependencies (except standard C library)
- Resource Limits
- Maximum database size: Limited by filesystem (2^63 bytes on 64-bit)
- Maximum key size: 255 bytes
- Maximum value size: 10MB
- Efficient memory usage (mmap handles caching)
3.4 Example Usage / Output
# 1. Create and populate database
$ ./mydb create mydata.db
Database created: mydata.db
$ ./mydb put mydata.db user:1 '{"name": "Alice", "email": "alice@example.com"}'
Inserted: user:1
$ ./mydb put mydata.db user:2 '{"name": "Bob", "email": "bob@example.com"}'
Inserted: user:2
$ ./mydb put mydata.db config:timeout '30'
Inserted: config:timeout
# 2. Query data
$ ./mydb get mydata.db user:1
{"name": "Alice", "email": "alice@example.com"}
$ ./mydb get mydata.db user:999
Key not found: user:999
# 3. List all keys with prefix
$ ./mydb scan mydata.db "user:"
user:1
user:2
# 4. Concurrent access test
$ ./mydb_stress_test mydata.db --writers 4 --readers 8 --ops 10000
Running stress test...
Writers: 4, Readers: 8
Total operations: 10000
Results:
Writes: 4000 (100% success)
Reads: 6000 (100% success)
Lock contentions: 127
Deadlocks: 0
Throughput: 25,432 ops/sec
# 5. Show database stats
$ ./mydb stats mydata.db
Database: mydata.db
File size: 4.2 MB
Records: 10,000
B-tree depth: 3
Fill factor: 67%
Free space: 1.4 MB
# 6. Delete a key
$ ./mydb delete mydata.db user:1
Deleted: user:1
$ ./mydb get mydata.db user:1
Key not found: user:1
# 7. Use as library
$ cat myapp.c
#include "mydb.h"
int main() {
DB *db = db_open("mydata.db", DB_CREATE | DB_RDWR);
if (!db) {
perror("db_open");
return 1;
}
// Put a value
if (db_put(db, "key1", 4, "value1", 6) != 0) {
fprintf(stderr, "db_put failed\n");
}
// Get a value
char *value;
size_t len;
if (db_get(db, "key1", 4, &value, &len) == 0) {
printf("Got: %.*s\n", (int)len, value);
free(value);
}
// Iterate with prefix
db_cursor_t *cursor = db_cursor_open(db, "user:", 5);
while (db_cursor_next(cursor, &key, &key_len, &value, &value_len) == 0) {
printf("%.*s = %.*s\n", (int)key_len, key, (int)value_len, value);
}
db_cursor_close(cursor);
db_close(db);
return 0;
}
$ gcc -o myapp myapp.c -lmydb
$ ./myapp
Got: value1
3.5 Real World Outcome
When this project is complete, you will have:
- A working database library that can be linked into any C program
- A CLI tool for database operations and testing
- Stress test utilities proving concurrent correctness
- Documentation explaining the file format and recovery procedures
You can verify success by:
- Running multiple writers simultaneously without corruption
- Killing processes mid-write and recovering without data loss
- Comparing performance to SQLite for simple key-value workloads
- Using strace to verify proper locking and fsync behavior
4. Solution Architecture
4.1 High-Level Design
Database Library Architecture
┌─────────────────────────────────────────────────────────────────────┐
│ APPLICATION LAYER │
│ ┌────────────────────────────────────────────────────────────────┐ │
│ │ db_open() db_put() db_get() db_delete() db_close() │ │
│ │ db_cursor_open() db_cursor_next() db_cursor_close() │ │
│ └────────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────┐
│ B-TREE LAYER │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ btree_insert() btree_search() btree_delete() │ │
│ │ btree_split_node() btree_merge_nodes() │ │
│ └─────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────┐
│ PAGE CACHE LAYER │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ page_get() page_dirty() page_flush() page_alloc() │ │
│ │ Uses mmap() for file access │ │
│ └─────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────┐
│ LOCKING LAYER │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ lock_page_read() lock_page_write() unlock_page() │ │
│ │ Uses fcntl() for byte-range locking │ │
│ └─────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────┐
│ WAL LAYER │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ wal_begin() wal_write() wal_commit() wal_rollback() │ │
│ │ wal_recover() wal_checkpoint() │ │
│ └─────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────┐
│ FILE LAYER │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ Database File (.db) WAL File (.db-wal) │ │
│ │ ┌─────────────────────┐ ┌─────────────────────┐ │ │
│ │ │ Header (4KB) │ │ Log records │ │ │
│ │ │ B-tree pages │ │ Checksums │ │ │
│ │ │ Data pages │ │ Transaction markers │ │ │
│ │ │ Free list │ │ │ │ │
│ │ └─────────────────────┘ └─────────────────────┘ │ │
│ └─────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────┘
4.2 Key Components
1. API Layer (mydb.h, mydb.c)
- Public interface for applications
- Handles parameter validation
- Manages database handle lifecycle
2. B-Tree Layer (btree.c, btree.h)
- In-memory and on-disk B-tree operations
- Node splitting and merging
- Cursor-based iteration
3. Page Cache Layer (page.c, page.h)
- Memory-mapped file access
- Dirty page tracking
- Buffer management
4. Locking Layer (lock.c, lock.h)
- fcntl() based record locking
- Header lock for metadata
- Page-level locks for B-tree nodes
5. WAL Layer (wal.c, wal.h)
- Write-ahead logging for durability
- Transaction management
- Crash recovery
4.3 Data Structures
/* Database file header - first 4KB of file */
typedef struct {
char magic[4]; /* "MYDB" */
uint32_t version; /* Format version (1) */
uint32_t page_size; /* Page size in bytes (4096) */
uint64_t root_page; /* Offset of B-tree root page */
uint64_t freelist_head; /* Offset of first free page */
uint64_t record_count; /* Total number of records */
uint64_t file_size; /* Current file size */
uint32_t btree_order; /* B-tree order (fanout) */
uint32_t flags; /* Database flags */
uint64_t wal_offset; /* Current WAL position */
uint8_t reserved[4028]; /* Pad to 4KB */
} db_header_t;
/* B-tree node - one page (4KB) */
typedef struct {
uint8_t node_type; /* BTREE_LEAF or BTREE_INTERNAL */
uint16_t key_count; /* Number of keys in node */
uint64_t next_leaf; /* Next leaf (for leaf nodes) */
uint64_t prev_leaf; /* Previous leaf (for leaf nodes) */
/*
* For internal nodes:
* keys[0..n-1] and children[0..n]
* For leaf nodes:
* keys[0..n-1] and values[0..n-1]
*
* Keys grow from start, data from end
*/
uint8_t data[4069]; /* Key/value/child storage */
} btree_node_t;
/* Key-value entry in leaf node */
typedef struct {
uint16_t key_offset; /* Offset into data[] for key */
uint16_t key_len; /* Key length */
uint16_t value_offset; /* Offset into data[] for value */
uint32_t value_len; /* Value length */
} kv_entry_t;
/* Database handle */
typedef struct {
int fd; /* File descriptor */
int wal_fd; /* WAL file descriptor */
void *mmap_addr; /* mmap base address */
size_t mmap_size; /* Current mmap size */
db_header_t *header; /* Pointer to header in mmap */
int flags; /* Open flags */
struct flock header_lock; /* Current header lock */
int in_transaction; /* Transaction active flag */
} DB;
/* Cursor for iteration */
typedef struct {
DB *db; /* Database handle */
uint64_t current_page; /* Current leaf page */
int current_idx; /* Current entry index */
char *prefix; /* Key prefix filter */
size_t prefix_len; /* Prefix length */
} db_cursor_t;
/* WAL log record */
typedef struct {
uint64_t lsn; /* Log sequence number */
uint32_t type; /* WAL_INSERT, WAL_DELETE, etc. */
uint32_t key_len;
uint32_t value_len;
uint32_t checksum; /* CRC32 of record */
/* Followed by: key data, value data */
} wal_record_t;
4.4 Algorithm Overview
B-Tree Search:
function btree_search(root, key):
node = read_page(root)
while node is INTERNAL:
i = binary_search(node.keys, key)
node = read_page(node.children[i])
i = binary_search(node.keys, key)
if node.keys[i] == key:
return node.values[i]
return NOT_FOUND
B-Tree Insert:
function btree_insert(root, key, value):
# Find leaf and path
path = []
node = root
while node is INTERNAL:
path.push(node)
i = find_child_index(node, key)
node = node.children[i]
# Insert into leaf
if leaf_has_space(node):
insert_into_leaf(node, key, value)
return
# Split leaf and propagate up
(left, right, separator) = split_leaf(node, key, value)
while path not empty:
parent = path.pop()
if parent_has_space(parent):
insert_separator(parent, separator, right)
return
(left, right, separator) = split_internal(parent, separator, right)
# Create new root
new_root = create_internal_node()
new_root.keys[0] = separator
new_root.children[0] = left
new_root.children[1] = right
Locking Protocol:
For READ operation:
1. Acquire shared lock on header
2. Read root page offset
3. Traverse tree, acquiring shared lock on each page
4. Release locks in reverse order
For WRITE operation:
1. Acquire exclusive lock on header
2. Begin WAL transaction
3. Traverse tree with exclusive locks
4. Modify pages, write to WAL
5. Commit WAL (fsync)
6. Apply changes to database
7. Release locks
5. Implementation Guide
5.1 Development Environment Setup
# Required packages (Ubuntu/Debian)
sudo apt-get install build-essential
# macOS
xcode-select --install
# Create project structure
mkdir -p mydb/src mydb/tests mydb/include
cd mydb
# Create initial files
touch include/mydb.h
touch src/mydb.c src/btree.c src/page.c src/lock.c src/wal.c
touch src/cli.c
touch tests/test_basic.c tests/test_concurrent.c tests/stress_test.c
touch Makefile
# Initial Makefile
cat > Makefile << 'EOF'
CC = gcc
CFLAGS = -Wall -Wextra -O2 -g -Iinclude
LDFLAGS = -lpthread
SRC = src/mydb.c src/btree.c src/page.c src/lock.c src/wal.c
OBJ = $(SRC:.c=.o)
all: libmydb.a mydb stress_test
libmydb.a: $(OBJ)
ar rcs $@ $^
mydb: src/cli.c libmydb.a
$(CC) $(CFLAGS) -o $@ $< -L. -lmydb
stress_test: tests/stress_test.c libmydb.a
$(CC) $(CFLAGS) -o $@ $< -L. -lmydb $(LDFLAGS)
test: tests/test_basic.c tests/test_concurrent.c libmydb.a
$(CC) $(CFLAGS) -o test_basic tests/test_basic.c -L. -lmydb
$(CC) $(CFLAGS) -o test_concurrent tests/test_concurrent.c -L. -lmydb $(LDFLAGS)
./test_basic
./test_concurrent
clean:
rm -f $(OBJ) libmydb.a mydb stress_test test_basic test_concurrent *.db *.db-wal
.PHONY: all clean test
EOF
5.2 Project Structure
mydb/
├── include/
│ └── mydb.h # Public API header
├── src/
│ ├── mydb.c # Main API implementation
│ ├── btree.c # B-tree operations
│ ├── btree_internal.h # B-tree internal definitions
│ ├── page.c # Page/mmap management
│ ├── page_internal.h # Page internal definitions
│ ├── lock.c # fcntl() locking
│ ├── lock_internal.h # Lock internal definitions
│ ├── wal.c # Write-ahead logging
│ ├── wal_internal.h # WAL internal definitions
│ └── cli.c # Command-line interface
├── tests/
│ ├── test_basic.c # Unit tests
│ ├── test_concurrent.c # Concurrency tests
│ └── stress_test.c # Load testing
├── Makefile
└── README.md
5.3 The Core Question You’re Answering
“How do databases ensure data integrity when multiple processes access the same file concurrently?”
This question is fundamental to understanding database internals. The answer involves:
- Advisory file locking - fcntl() to coordinate access between processes
- Write-ahead logging - Record changes before applying them
- Careful ordering - fsync at correct points to survive crashes
- B-tree structure - Efficient indexing with predictable I/O patterns
5.4 Concepts You Must Understand First
Stop and research these before coding:
- fcntl() Record Locking
- F_SETLK, F_SETLKW, F_GETLK operations
- F_RDLCK (shared), F_WRLCK (exclusive), F_UNLCK
- Byte-range locking with l_start, l_len
- Book Reference: “APUE” Ch. 14.3
- Memory-Mapped Files
- mmap(), munmap(), msync() system calls
- MAP_SHARED vs MAP_PRIVATE semantics
- When mmap() is faster than read/write
- SIGBUS handling for truncated files
- Book Reference: “APUE” Ch. 14.8
- B-tree Data Structures
- Node structure: keys, children/values, metadata
- Search algorithm (binary search within nodes)
- Insert with node splitting
- Delete with node merging or redistribution
- Book Reference: “Algorithms” by Sedgewick, Ch. 6
- Write-Ahead Logging (WAL)
- Log record format
- Commit protocol (log fsync before data write)
- Recovery: redo/undo
- Checkpointing
- Book Reference: “Transaction Processing” or database internals docs
5.5 Questions to Guide Your Design
Before implementing, think through these:
- File Layout
- How to structure the database file (header, pages, free list)?
- What page size to use (4KB matches most filesystems)?
- How to handle file growth?
- Locking Granularity
- Lock whole file (simple but poor concurrency)?
- Lock pages (better concurrency, more complex)?
- Lock individual records (best concurrency, most complex)?
- What about lock ordering to prevent deadlocks?
- Crash Recovery
- What if process crashes mid-write?
- How to detect incomplete operations?
- How to atomically update multiple pages?
- Redo, undo, or both?
- Memory Management
- How large should the mmap region be?
- When to extend or remap?
- How to handle files larger than address space (32-bit)?
5.6 Thinking Exercise
Design the File Layout
Consider this database file structure:
Offset 0 File End
┌─────────────┬────────────────┬─────────────────────────────┐
│ Header │ B-tree Index │ Data Pages │
│ (4 KB) │ (variable) │ (variable) │
└─────────────┴────────────────┴─────────────────────────────┘
Header (4KB):
┌─────────────────────────────────────────────────────────────┐
│ Magic: "MYDB" (4 bytes) │
│ Version: 1 (4 bytes) │
│ Page size: 4096 (4 bytes) │
│ Root page: offset of B-tree root (8 bytes) │
│ Free list head: offset of first free page (8 bytes) │
│ Record count: total number of records (8 bytes) │
│ ... padding to 4KB ... │
└─────────────────────────────────────────────────────────────┘
B-tree Node (4KB page):
┌─────────────────────────────────────────────────────────────┐
│ Node type: LEAF or INTERNAL (1 byte) │
│ Key count: number of keys (2 bytes) │
│ Next leaf: offset (for leaf nodes) (8 bytes) │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Key 0 (offset, length) │ Child/Value pointer 0 │ │
│ │ Key 1 (offset, length) │ Child/Value pointer 1 │ │
│ │ ... │ ... │ │
│ └─────────────────────────────────────────────────────────┘ │
│ [Key data stored at end of page, growing backward] │
└─────────────────────────────────────────────────────────────┘
Locking strategy:
- Header lock: protects metadata updates
- Page locks: each B-tree page can be locked independently
- Record locks: for individual key-value pairs
Questions to answer:
- Why put the header at offset 0?
- Why store keys at the end of the page growing backward?
- What’s the advantage of linked leaf nodes?
- How do you implement the free list?
5.7 Hints in Layers
Hint 1: Simple Record Locking
Start with basic fcntl() locking to protect entire operations:
#include <fcntl.h>
#include <unistd.h>
int lock_region(int fd, int type, off_t start, off_t len) {
struct flock fl;
fl.l_type = type; /* F_RDLCK, F_WRLCK, F_UNLCK */
fl.l_whence = SEEK_SET;
fl.l_start = start; /* Start of lock region */
fl.l_len = len; /* Length (0 = to EOF) */
fl.l_pid = 0; /* Set by F_GETLK */
/* F_SETLKW = blocking wait for lock */
/* F_SETLK = non-blocking (returns EAGAIN if can't lock) */
if (fcntl(fd, F_SETLKW, &fl) == -1) {
if (errno == EDEADLK) {
/* Deadlock detected! */
return -1;
}
perror("fcntl");
return -1;
}
return 0;
}
int unlock_region(int fd, off_t start, off_t len) {
struct flock fl;
fl.l_type = F_UNLCK;
fl.l_whence = SEEK_SET;
fl.l_start = start;
fl.l_len = len;
return fcntl(fd, F_SETLK, &fl);
}
/* Example usage: lock header for exclusive access */
#define HEADER_OFFSET 0
#define HEADER_SIZE 4096
int lock_header_write(DB *db) {
return lock_region(db->fd, F_WRLCK, HEADER_OFFSET, HEADER_SIZE);
}
int lock_header_read(DB *db) {
return lock_region(db->fd, F_RDLCK, HEADER_OFFSET, HEADER_SIZE);
}
Hint 2: Memory-Mapped Access
Use mmap() for efficient file access:
#include <sys/mman.h>
void *map_file(int fd, size_t size) {
/* Extend file if necessary */
if (ftruncate(fd, size) == -1) {
perror("ftruncate");
return MAP_FAILED;
}
void *addr = mmap(NULL, size,
PROT_READ | PROT_WRITE,
MAP_SHARED, fd, 0);
if (addr == MAP_FAILED) {
perror("mmap");
return MAP_FAILED;
}
return addr;
}
/* Access header directly through mmap */
db_header_t *get_header(DB *db) {
return (db_header_t *)db->mmap_addr;
}
/* Get a B-tree node by page offset */
btree_node_t *get_page(DB *db, uint64_t offset) {
return (btree_node_t *)(db->mmap_addr + offset);
}
/* Flush changes to disk */
int sync_page(DB *db, uint64_t offset) {
return msync(db->mmap_addr + offset, db->header->page_size, MS_SYNC);
}
/* Sync entire database */
int sync_all(DB *db) {
return msync(db->mmap_addr, db->mmap_size, MS_SYNC);
}
Hint 3: Simple B-tree Insert
Start with a simplified insert algorithm:
/*
* Simplified B-tree insert (no overflow handling yet)
*
* 1. Search for leaf where key belongs
* 2. If leaf has room: insert key-value, done
* 3. If leaf full: split leaf
* - Create new leaf
* - Move half the keys to new leaf
* - Insert separator key into parent
* - If parent full: recursively split
*/
int btree_insert(DB *db, const char *key, size_t key_len,
const char *value, size_t value_len) {
/* Start at root */
uint64_t page_offset = db->header->root_page;
btree_node_t *node = get_page(db, page_offset);
/* Stack to track path from root to leaf */
uint64_t path[32];
int path_len = 0;
/* Search for leaf */
while (node->node_type == BTREE_INTERNAL) {
path[path_len++] = page_offset;
/* Binary search for child */
int i = find_child_index(node, key, key_len);
page_offset = get_child_offset(node, i);
node = get_page(db, page_offset);
}
/* At leaf node */
if (leaf_has_space(node, key_len, value_len)) {
/* Simple case: just insert */
insert_into_leaf(node, key, key_len, value, value_len);
sync_page(db, page_offset);
return 0;
}
/* Need to split - more complex */
return split_and_insert(db, path, path_len, page_offset,
key, key_len, value, value_len);
}
Hint 4: Crash Recovery with WAL
Implement a simple WAL for durability:
/* WAL Record Types */
#define WAL_BEGIN 1
#define WAL_INSERT 2
#define WAL_DELETE 3
#define WAL_COMMIT 4
#define WAL_ROLLBACK 5
typedef struct {
uint32_t magic; /* 0xWAL1 */
uint64_t lsn; /* Log Sequence Number */
uint32_t type; /* Record type */
uint32_t key_len;
uint32_t value_len;
uint32_t checksum; /* CRC32 of entire record */
/* Followed by: key, value */
} wal_record_t;
int wal_write_insert(DB *db, const char *key, size_t key_len,
const char *value, size_t value_len) {
size_t record_size = sizeof(wal_record_t) + key_len + value_len;
wal_record_t *record = malloc(record_size);
record->magic = 0x57414C31; /* "WAL1" */
record->lsn = db->next_lsn++;
record->type = WAL_INSERT;
record->key_len = key_len;
record->value_len = value_len;
/* Copy key and value after header */
memcpy((char *)record + sizeof(wal_record_t), key, key_len);
memcpy((char *)record + sizeof(wal_record_t) + key_len, value, value_len);
/* Calculate checksum */
record->checksum = crc32(record, record_size);
/* Write to WAL file */
write(db->wal_fd, record, record_size);
free(record);
return 0;
}
int wal_commit(DB *db) {
wal_record_t commit;
commit.magic = 0x57414C31;
commit.lsn = db->next_lsn++;
commit.type = WAL_COMMIT;
commit.key_len = 0;
commit.value_len = 0;
commit.checksum = crc32(&commit, sizeof(commit));
write(db->wal_fd, &commit, sizeof(commit));
/* CRITICAL: fsync before proceeding */
fsync(db->wal_fd);
return 0;
}
/*
* Recovery on startup:
* 1. Read LOG from beginning
* 2. Find last committed transaction
* 3. Replay all committed operations
* 4. Discard uncommitted operations
* 5. Truncate/checkpoint LOG
*/
int wal_recover(DB *db) {
/* Seek to beginning of WAL */
lseek(db->wal_fd, 0, SEEK_SET);
/* Track transactions and their states */
/* ... implementation ... */
return 0;
}
5.8 The Interview Questions They’ll Ask
- “How does fcntl() locking differ from flock()?”
- fcntl() supports byte-range locking; flock() locks entire files
- fcntl() locks are per-process; flock() locks are per-open-file-description
- fcntl() locks are released on any close(); flock() only when all FDs closed
- fcntl() is POSIX; flock() is BSD-origin, behavior varies
- “What is a write-ahead log and why is it used?”
- WAL records changes before they’re applied to the main database
- Enables atomicity: if crash during write, WAL can redo or undo
- Enables durability: committed changes are in WAL even if not yet in DB
- Allows lazy writes to main DB (better performance)
- “How do databases handle the ‘phantom read’ problem?”
- Phantom reads occur when a query returns different rows due to concurrent inserts
- Solution: Predicate locks (lock the condition, not just rows)
- Or: Serializable isolation level (often via MVCC)
- Or: Gap locks (lock ranges between existing keys)
- “What’s the difference between mmap() and read/write for databases?”
- mmap() eliminates double copying (kernel buffer -> user buffer)
- mmap() leverages kernel page cache automatically
- read/write gives more control over I/O timing
- mmap() can cause unpredictable page faults (latency spikes)
- mmap() has address space limits on 32-bit systems
- “How would you implement range queries efficiently?”
- B-tree leaf nodes are linked for sequential access
- Find start of range with normal B-tree search
- Follow next-leaf pointers until end of range
- O(log n) to find start, O(k) for k results
- “What are the ACID properties and how do you implement each?”
- Atomicity: WAL with undo capability, or shadow paging
- Consistency: Application-level constraints, checked on commit
- Isolation: Locking or MVCC
- Durability: WAL committed to disk before success returned
- “How do you prevent deadlocks in a database?”
- Lock ordering (always acquire locks in same order)
- Timeout detection (abort if lock not acquired in time)
- Wait-for graph analysis (detect cycles)
- Wound-wait or wait-die protocols
5.9 Books That Will Help
| Topic | Book | Chapter/Section |
|---|---|---|
| Record locking | “APUE” by Stevens | Ch. 14.3 |
| mmap() | “APUE” by Stevens | Ch. 14.8 |
| Database internals | Stevens’ APUE | Ch. 20 |
| B-trees | “Algorithms” by Sedgewick | Ch. 6 |
| B-trees | “Introduction to Algorithms” (CLRS) | Ch. 18 |
| SQLite internals | SQLite documentation | Architecture document |
| Database design | “Database Internals” by Petrov | Ch. 1-5 |
| LMDB design | “LMDB Technical Information” | All |
| Transaction processing | “Transaction Processing” by Gray | Ch. 7-9 |
5.10 Implementation Phases
Phase 1: Basic File Operations (Week 1)
- Implement db_open() and db_close()
- Create database file with header
- mmap() the file
- Implement basic locking (header lock only)
- Test: Create database, reopen, verify header
Phase 2: Simple Key-Value Store (Week 1-2)
- Implement single-page leaf node
- Store all KV pairs in one page (no B-tree yet)
- Implement db_put() and db_get() for small datasets
- Add db_delete()
- Test: Store and retrieve 10-20 key-value pairs
Phase 3: B-Tree Index (Week 2-3)
- Implement B-tree node structure
- Implement search (traversal from root)
- Implement insert with node splitting
- Implement delete with node merging
- Test: Store 10,000+ records, verify lookup performance
Phase 4: Concurrency (Week 3)
- Implement page-level locking
- Add deadlock avoidance (lock ordering)
- Multi-process concurrent reads
- Serialized writes with proper locking
- Test: Multiple processes accessing same database
Phase 5: Durability (Week 3-4)
- Implement WAL file format
- Implement wal_write(), wal_commit()
- Implement recovery on startup
- Test: Kill process mid-write, verify recovery
Phase 6: CLI and Polish (Week 4)
- Implement command-line interface
- Add statistics command
- Write stress test utility
- Documentation and cleanup
5.11 Key Implementation Decisions
Decision 1: Page Size
- Options: 512B, 1KB, 4KB, 8KB, 16KB
- Recommendation: 4KB (matches most filesystem blocks)
- Trade-off: Larger = fewer tree levels, but more wasted space for sparse data
Decision 2: B-Tree Order
- Higher order = shallower tree = fewer disk seeks
- But: More complex splitting/merging
- Recommendation: Order that fills ~70% of a page
Decision 3: WAL vs Copy-on-Write
- WAL: Standard, well-understood, good write performance
- COW: Never overwrites data, simpler recovery, slower writes
- Recommendation: Start with WAL (SQLite model)
Decision 4: Lock Granularity
- Coarse (whole file): Simple, poor concurrency
- Page-level: Good balance
- Record-level: Best concurrency, complex implementation
- Recommendation: Page-level locking
Decision 5: Variable vs Fixed-Length Values
- Variable: More flexible, complex page layout
- Fixed: Simple, wastes space for small values
- Recommendation: Variable-length with overflow pages for large values
6. Testing Strategy
6.1 Unit Tests
Test individual components in isolation:
/* test_basic.c */
void test_header_format(void) {
DB *db = db_open("test.db", DB_CREATE | DB_RDWR);
assert(db != NULL);
/* Verify header */
assert(memcmp(db->header->magic, "MYDB", 4) == 0);
assert(db->header->version == 1);
assert(db->header->page_size == 4096);
db_close(db);
unlink("test.db");
printf("test_header_format: PASSED\n");
}
void test_put_get(void) {
DB *db = db_open("test.db", DB_CREATE | DB_RDWR);
/* Insert */
assert(db_put(db, "key1", 4, "value1", 6) == 0);
/* Retrieve */
char *value;
size_t len;
assert(db_get(db, "key1", 4, &value, &len) == 0);
assert(len == 6);
assert(memcmp(value, "value1", 6) == 0);
free(value);
db_close(db);
unlink("test.db");
printf("test_put_get: PASSED\n");
}
void test_overwrite(void) {
DB *db = db_open("test.db", DB_CREATE | DB_RDWR);
db_put(db, "key", 3, "old", 3);
db_put(db, "key", 3, "new", 3);
char *value;
size_t len;
db_get(db, "key", 3, &value, &len);
assert(memcmp(value, "new", 3) == 0);
free(value);
db_close(db);
unlink("test.db");
printf("test_overwrite: PASSED\n");
}
void test_delete(void) {
DB *db = db_open("test.db", DB_CREATE | DB_RDWR);
db_put(db, "key", 3, "value", 5);
assert(db_delete(db, "key", 3) == 0);
char *value;
size_t len;
assert(db_get(db, "key", 3, &value, &len) == DB_NOT_FOUND);
db_close(db);
unlink("test.db");
printf("test_delete: PASSED\n");
}
void test_many_records(void) {
DB *db = db_open("test.db", DB_CREATE | DB_RDWR);
/* Insert 10,000 records */
char key[32], value[64];
for (int i = 0; i < 10000; i++) {
snprintf(key, sizeof(key), "key%06d", i);
snprintf(value, sizeof(value), "value%06d", i);
assert(db_put(db, key, strlen(key), value, strlen(value)) == 0);
}
/* Verify all records */
for (int i = 0; i < 10000; i++) {
snprintf(key, sizeof(key), "key%06d", i);
snprintf(value, sizeof(value), "value%06d", i);
char *got_value;
size_t got_len;
assert(db_get(db, key, strlen(key), &got_value, &got_len) == 0);
assert(got_len == strlen(value));
assert(memcmp(got_value, value, got_len) == 0);
free(got_value);
}
db_close(db);
unlink("test.db");
printf("test_many_records: PASSED\n");
}
6.2 Integration Tests
Test complete workflows:
/* test_concurrent.c */
void test_multiprocess_read(void) {
/* Create and populate database */
DB *db = db_open("test.db", DB_CREATE | DB_RDWR);
for (int i = 0; i < 1000; i++) {
char key[32], value[64];
snprintf(key, sizeof(key), "key%d", i);
snprintf(value, sizeof(value), "value%d", i);
db_put(db, key, strlen(key), value, strlen(value));
}
db_close(db);
/* Fork multiple reader processes */
for (int p = 0; p < 4; p++) {
pid_t pid = fork();
if (pid == 0) {
/* Child: read random keys */
DB *child_db = db_open("test.db", DB_RDONLY);
for (int i = 0; i < 1000; i++) {
int key_num = rand() % 1000;
char key[32];
snprintf(key, sizeof(key), "key%d", key_num);
char *value;
size_t len;
assert(db_get(child_db, key, strlen(key), &value, &len) == 0);
free(value);
}
db_close(child_db);
_exit(0);
}
}
/* Wait for all children */
for (int p = 0; p < 4; p++) {
int status;
wait(&status);
assert(WIFEXITED(status) && WEXITSTATUS(status) == 0);
}
unlink("test.db");
printf("test_multiprocess_read: PASSED\n");
}
void test_concurrent_writes(void) {
/* Start with empty database */
DB *db = db_open("test.db", DB_CREATE | DB_RDWR);
db_close(db);
/* Fork multiple writer processes */
for (int p = 0; p < 4; p++) {
pid_t pid = fork();
if (pid == 0) {
DB *child_db = db_open("test.db", DB_RDWR);
for (int i = 0; i < 100; i++) {
char key[32], value[64];
snprintf(key, sizeof(key), "proc%d-key%d", p, i);
snprintf(value, sizeof(value), "proc%d-value%d", p, i);
db_put(child_db, key, strlen(key), value, strlen(value));
}
db_close(child_db);
_exit(0);
}
}
/* Wait for all children */
for (int p = 0; p < 4; p++) {
int status;
wait(&status);
assert(WIFEXITED(status) && WEXITSTATUS(status) == 0);
}
/* Verify all records present */
db = db_open("test.db", DB_RDONLY);
for (int p = 0; p < 4; p++) {
for (int i = 0; i < 100; i++) {
char key[32];
snprintf(key, sizeof(key), "proc%d-key%d", p, i);
char *value;
size_t len;
assert(db_get(db, key, strlen(key), &value, &len) == 0);
free(value);
}
}
db_close(db);
unlink("test.db");
printf("test_concurrent_writes: PASSED\n");
}
6.3 Edge Cases to Test
- Empty database - Can you read from empty database without crash?
- Single record - Does B-tree work with just one key?
- Maximum key size - What happens at 255-byte keys?
- Large values - Can you store 10MB values?
- Duplicate keys - How do you handle inserting existing key?
- Missing keys - Proper error for non-existent key?
- Disk full - Graceful failure when disk space exhausted?
- Read-only filesystem - Handle gracefully?
- Process crash mid-transaction - Does recovery work?
- Power failure simulation - Kill with SIGKILL during write
6.4 Verification Commands
# Trace system calls to verify locking
strace -e fcntl,flock,open,close,mmap,msync,fsync ./mydb put test.db key value
# Verify no memory leaks
valgrind --leak-check=full ./mydb put test.db key value
# Check file descriptor usage
lsof -p $(pgrep mydb)
# Monitor lock contention (Linux)
cat /proc/locks | grep $(pgrep mydb)
# Verify database file format
hexdump -C test.db | head -20
# Test crash recovery
./mydb put test.db key1 value1 &
PID=$!
sleep 0.001
kill -9 $PID
./mydb get test.db key1 # Should work or gracefully fail
7. Common Pitfalls & Debugging
Problem 1: “Deadlock between processes”
- Why: Circular lock dependency (Process A holds page 1, wants page 2; Process B holds page 2, wants page 1)
- Fix: Always acquire locks in consistent order (e.g., by page offset ascending)
- Test: Use
fcntl(fd, F_SETLK, ...)instead ofF_SETLKWwith timeout to detect
/* Detect deadlock potential with non-blocking lock */
int try_lock(int fd, off_t offset, off_t len) {
struct flock fl = {.l_type = F_WRLCK, .l_whence = SEEK_SET,
.l_start = offset, .l_len = len};
if (fcntl(fd, F_SETLK, &fl) == -1) {
if (errno == EAGAIN || errno == EACCES) {
return LOCK_BUSY; /* Someone else holds it */
}
return LOCK_ERROR;
}
return LOCK_OK;
}
Problem 2: “Data corruption after crash”
- Why: Wrote data pages but not WAL commit, or no fsync() between operations
- Fix: WAL commit must fsync() before returning; only then apply to database
- Test: Kill process with SIGKILL during write, verify recovery on restart
Problem 3: “mmap changes not persisted”
- Why: Forgot to call msync() before munmap() or close()
- Fix: Call
msync(addr, size, MS_SYNC)before any operation that expects durability - Test: Write data, kill with SIGKILL, check if data present
/* Always sync before critical operations */
int commit_changes(DB *db) {
/* Sync modified pages first */
if (msync(db->mmap_addr, db->mmap_size, MS_SYNC) == -1) {
perror("msync");
return -1;
}
/* Then sync WAL */
if (fsync(db->wal_fd) == -1) {
perror("fsync wal");
return -1;
}
return 0;
}
Problem 4: “Performance degrades over time”
- Why: B-tree fragmentation, deleted space not reclaimed, WAL growing forever
- Fix: Implement compaction (vacuum), WAL checkpointing, page rebalancing
- Test: Insert 1M records, delete 900K, measure file size and lookup time
Problem 5: “SIGBUS when accessing mmap region”
- Why: File was truncated after mmap, or accessing beyond current file size
- Fix: Handle SIGBUS gracefully, use msync to ensure file size matches mmap expectations
static void sigbus_handler(int sig) {
(void)sig;
/* File was truncated while we had it mapped */
fprintf(stderr, "SIGBUS: Database file corrupted or truncated\n");
_exit(1);
}
void setup_sigbus_handler(void) {
struct sigaction sa;
sa.sa_handler = sigbus_handler;
sigemptyset(&sa.sa_mask);
sa.sa_flags = 0;
sigaction(SIGBUS, &sa, NULL);
}
Problem 6: “Locks not released after process crash”
- Why: fcntl() locks are automatically released on process termination
- Actually: This should NOT happen. Locks are released when file closed or process exits.
- Real issue: If you see this, you might have lock inheritance across fork() (child gets copy of parent’s locks but they’re independent)
8. Extensions & Challenges
8.1 Easy Extensions
- Key prefix scans - Return all keys starting with a given prefix
- Database statistics - Key count, page count, tree depth, fill factor
- Read-only mode - Open database without write locks
- Backup utility - Copy database while it’s in use (using locks)
8.2 Advanced Challenges
- Multi-Version Concurrency Control (MVCC)
- Readers never block writers
- Maintain multiple versions of values
- Garbage collection of old versions
- Secondary indexes
- Index values as well as keys
- Support queries like “find all users with age > 30”
- Transactions
- Begin/commit/rollback across multiple operations
- Savepoints for partial rollback
- Replication
- Stream WAL to a replica
- Support failover to replica
- Compression
- Compress pages before writing
- Decompress on read
- Trade CPU for I/O
8.3 Research Topics
- LMDB’s copy-on-write design - How does it achieve ACID without WAL?
- SQLite’s b-tree variants - Interior pages vs leaf pages
- RocksDB’s LSM trees - Why some databases use log-structured merge trees instead of B-trees
- PostgreSQL’s MVCC - How do real databases implement snapshot isolation?
- io_uring - How could you use io_uring for async database I/O?
9. Real-World Connections
9.1 Production Systems Using This
- SQLite - Uses exactly these techniques (B-tree, WAL, page locking)
- LMDB - Pure mmap-based, copy-on-write, no WAL needed
- Berkeley DB - The original embeddable database, influenced all others
- BoltDB (Go) - LMDB-inspired, used by etcd
- BadgerDB - LSM-tree variant, used by Dgraph
- PostgreSQL - Uses similar B-tree and WAL mechanisms (but server-based)
9.2 How the Pros Do It
SQLite’s approach:
- B-tree pages are 4KB (configurable)
- Two-phase commit with journal or WAL
- Uses fcntl() locking by default
- Database is a single file (plus optional WAL file)
LMDB’s approach:
- Full mmap of entire database
- Copy-on-write: never overwrite existing data
- Lock-free reads via MVCC
- Append-only with periodic compaction
PostgreSQL’s approach:
- Shared buffer pool (not mmap)
- Sophisticated locking (MVCC + locks)
- Background writers and checkpoints
- Separate files for tables/indexes
9.3 Reading the Source
- SQLite source: https://www.sqlite.org/download.html (exceptionally well-commented)
- LMDB source: https://git.openldap.org/openldap/openldap (single file, ~11K lines)
- BoltDB source: https://github.com/boltdb/bolt (clean Go implementation)
- LevelDB source: https://github.com/google/leveldb (LSM-tree reference)
10. Resources
10.1 Man Pages
Essential reading:
man 2 fcntl- File lockingman 2 mmap- Memory mappingman 2 msync- Synchronize mapped memoryman 2 fsync- Synchronize fileman 2 open- Open flags (O_SYNC, O_DSYNC)man 2 ftruncate- Resize file
10.2 Online Resources
- SQLite Architecture
- SQLite File Format
- LMDB Technical Information
- PostgreSQL Internals
- How Databases Work
10.3 Book Chapters
| Topic | Book | Chapter |
|---|---|---|
| File locking | “APUE” by Stevens | Ch. 14.3 |
| Memory mapping | “APUE” by Stevens | Ch. 14.8 |
| Database library | “APUE” by Stevens | Ch. 20 |
| B-trees | “Algorithms” by Sedgewick | Ch. 6 |
| B-trees detailed | “CLRS” | Ch. 18 |
| Database design | “Database Internals” by Petrov | Part I |
11. Self-Assessment Checklist
Before considering this project complete, verify:
- I can explain how fcntl() byte-range locking works
- I understand the difference between advisory and mandatory locking
- I can describe when mmap() is better than read/write and vice versa
- I understand why msync() is necessary and when to use it
- I can explain B-tree node splitting algorithm
- I understand write-ahead logging and crash recovery
- My database survives process crashes without corruption
- Multiple readers can access simultaneously without blocking
- Writers are properly serialized
- I can answer all the interview questions confidently
- My code compiles without warnings (
gcc -Wall -Wextra -Werror) - valgrind reports zero memory leaks
- strace shows proper use of fcntl() and fsync()
12. Submission / Completion Criteria
This project is complete when:
- Core Functionality:
- db_open(), db_close() work correctly
- db_put(), db_get(), db_delete() work for single and many records
- B-tree indexing provides O(log n) lookups
- Database file persists across restarts
- Concurrency:
- Multiple processes can read simultaneously
- Writes are properly locked
- No data corruption under concurrent access
- Stress test with 4+ processes passes
- Durability:
- WAL or COW implemented
- Kill -9 during write doesn’t corrupt database
- Recovery on restart works
- Quality:
- All unit tests pass
- All integration tests pass
- Zero valgrind errors
- strace shows expected locking behavior
- Documentation:
- README explains usage
- File format documented
- Recovery procedure documented
Connection to Other Projects
This project builds on:
- Project 1-3 (File I/O): Low-level file operations
- Project 7-8 (Threading): Concurrency concepts (applicable to multi-threaded variant)
- Project 11 (epoll): I/O efficiency concepts
This project connects to the final shell (Project 17):
- Your shell could use this database to store command history
- Environment variables could be persisted
- Shell configuration could be database-backed