Capstone Project: Production-Grade Copy-on-Write Filesystem
Capstone Project: Production-Grade Copy-on-Write Filesystem
Project Overview
| Attribute | Details |
|---|---|
| Difficulty | Master |
| Time Estimate | 1-2 months |
| Language | C |
| Knowledge Area | Modern Filesystem Design |
| Main Book | โDesigning Data-Intensive Applicationsโ by Martin Kleppmann |
| Prerequisites | Projects 1-4 completed |
What Youโll Build
A complete, persistent, FUSE-based filesystem with copy-on-write semantics (like ZFS/Btrfs), snapshots, and data integrity verification (checksums). This is a โrealโ filesystem you could actually use.
Real World Output:
$ ./mkfs.cowfs -s 1G mydisk.img
$ ./cowfs mydisk.img /mnt/cow
# Normal operations work
$ cp -r ~/projects /mnt/cow/
$ ls /mnt/cow/projects
myapp/ webapp/ scripts/
# Create a snapshot before dangerous operation
$ ./cowctl /mnt/cow snapshot create before-refactor
Snapshot 'before-refactor' created (root block: 4521)
# Make changes
$ rm -rf /mnt/cow/projects/myapp/src/*
$ echo "oops" > /mnt/cow/projects/myapp/src/main.c
# Restore from snapshot instantly
$ ./cowctl /mnt/cow snapshot restore before-refactor
Restored to snapshot 'before-refactor'
$ ls /mnt/cow/projects/myapp/src/
main.c utils.c config.h # Everything back!
# Verify data integrity
$ ./cowctl /mnt/cow scrub
Scrubbing filesystem...
Checked 15231 blocks
Verified 15231 checksums
0 errors found
Learning Objectives
By completing this project, you will:
- Implement copy-on-write semantics โ never overwrite, always append
- Build block-pointer trees similar to Btrfs B-trees
- Create instant snapshots by saving root pointers
- Add data integrity verification with checksums
- Handle garbage collection of orphaned blocks
- Understand modern filesystem design principles
The Core Question Youโre Answering
โHow can we build a filesystem that never corrupts data, supports instant snapshots, and can detect and repair bit rot?โ
Traditional filesystems like ext4 overwrite data in place. This creates problems:
- Crash during write = corrupted data
- Silent bit rot goes undetected
- Snapshots require complex block tracking
Copy-on-write (COW) filesystems take a radically different approach: never overwrite anything. Every write creates new blocks, then atomically updates a root pointer. This elegant model solves journaling, snapshots, and checksums in one design.
Deep Theoretical Foundation
1. The Copy-on-Write Principle
Traditional write (overwrite in place):
Before: [Block 100: "Hello"]
Write: [Block 100: "World"] โ Overwrites existing data
After: [Block 100: "World"]
If crash during write: [Block 100: "Horld"] โ Corruption!
Copy-on-write:
Before: Root โ [Block 100: "Hello"]
Write: Allocate Block 200
[Block 200: "World"]
Update Root โ Block 200 (atomic)
After: Root โ [Block 200: "World"]
[Block 100: "Hello"] โ Still intact, now garbage
If crash during write:
- Block 200 partially written but unreferenced
- Root still โ Block 100 ("Hello")
- Filesystem is consistent!
Key insight: The atomic pointer update is the commit point. Everything before it is preparation; everything after is cleanup.
2. Tree Structure for Data Organization
COW filesystems use trees where nodes can be shared:
Initial state:
Root v1
/ \
Node A Node B
/ \ / \
Data1 Data2 Data3 Data4
After modifying Data2:
Root v2 Root v1 (snapshot)
/ \ / \
Node A' Node B Node A
/ \ / \ / \
Data1 Data2' Data3 Data4 Data1 Data2
(old)
Notice:
- Root v2 shares Node B with Root v1
- Node A' shares Data1 with Node A
- Only modified path is copied
Space efficiency: Snapshots share unmodified blocks. A 1GB filesystem with 100 snapshots doesnโt need 100GBโonly changed blocks are duplicated.
3. Block Reference Counting
Since blocks can be shared between snapshots, we need reference counting:
struct cow_block_info {
uint32_t checksum; // CRC32 of block data
uint32_t refcount; // Number of references
uint64_t generation; // When block was created
};
Lifecycle:
- Block allocated: refcount = 1
- Snapshot created: refcount++ for all reachable blocks
- Block overwritten (COW): old block refcountโ, new block refcount = 1
- Snapshot deleted: refcountโ for all blocks in snapshot
- refcount reaches 0: block is garbage, can be reused
4. Checksum-Based Integrity
Every block has a checksum stored separately (or in parent pointer):
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Parent Node โ
โโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโค
โ Child 0 ptr โ Child 0 csum โ Child 1 ptr โ Child 1 csum โ
โ Block 100 โ 0xA1B2C3D4 โ Block 200 โ 0x12345678 โ
โโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโ
On read:
1. Read block 100
2. Calculate CRC32 of block 100
3. Compare with stored 0xA1B2C3D4
4. If mismatch: bit rot detected!
Self-healing: With RAID or mirrors, the good copy can repair the bad one.
5. Snapshot Implementation
A snapshot is just a saved root pointer:
struct snapshot {
char name[64];
uint64_t root_block; // Root of tree at snapshot time
uint64_t timestamp;
uint32_t id;
};
Creating a snapshot:
- Get current root block number
- Increment refcount of root (and transitively, all blocks)
- Save (name, root_block, timestamp) in snapshot table
Restoring a snapshot:
- Find snapshot by name
- Set active root = snapshotโs root_block
- (Optional) Increment refcounts to prevent GC
Deleting a snapshot:
- Decrement refcount of snapshotโs root block
- Garbage collect any blocks that reach refcount = 0
6. Garbage Collection
Blocks with refcount = 0 are garbage. Collection strategies:
Strategy 1: Immediate Free
void dec_refcount(uint32_t block) {
block_info[block].refcount--;
if (block_info[block].refcount == 0) {
// Recursively decrement children
for (child in get_children(block)) {
dec_refcount(child);
}
// Add to free list
free_block(block);
}
}
Strategy 2: Background GC
- Mark orphaned blocks during operation
- Periodically scan and collect
- Less I/O spike, more complexity
Strategy 3: Generation-Based
- Track when blocks were allocated
- Collect old generations first
- Similar to generational garbage collection
7. On-Disk Layout
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Block 0: Superblock โ
โ - Magic number, version โ
โ - Total blocks, block size โ
โ - Current root block pointer โ
โ - Free block pointer (head of free list) โ
โ - Snapshot table location โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Block 1-N: Block Info Table โ
โ - One entry per data block โ
โ - Checksum + refcount + generation โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Block N+1-M: Snapshot Table โ
โ - Array of snapshot entries โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Block M+1-End: Data Blocks โ
โ - Tree nodes (inode trees, data extent trees) โ
โ - File data โ
โ - Free blocks (linked list through free space) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
8. Tree Node Structure
Btrfs-style B-tree nodes:
#define NODE_SIZE 4096
#define MAX_ITEMS 170 // Depends on item size
struct tree_header {
uint32_t checksum; // Checksum of entire node
uint64_t block_num; // This block's number (for verification)
uint64_t generation; // Transaction ID when written
uint8_t level; // 0 = leaf, >0 = internal
uint16_t num_items; // Number of items in node
};
struct tree_item {
struct {
uint64_t objectid; // Inode number or similar
uint8_t type; // Item type (INODE, EXTENT, DIR_ENTRY, etc.)
uint64_t offset; // Type-specific offset
} key;
uint32_t data_offset; // Offset to data in node (from end)
uint32_t data_size; // Size of data
};
// Leaf node layout:
// [header][item0][item1]...[itemN]...[free space]...[dataN]...[data1][data0]
// Items grow forward, data grows backward
// Internal node items point to child blocks instead of containing data
struct internal_item {
struct key key;
uint64_t child_block;
uint32_t child_checksum;
};
Project Specification
Functional Requirements
- Copy-on-write operations
- All writes create new blocks
- Atomic root pointer updates
- Never overwrite data in place
- Snapshots
- Create named snapshots instantly
- List existing snapshots
- Restore filesystem to snapshot state
- Delete snapshots (with GC)
- Data integrity
- CRC32 checksum per block
- Verify checksums on read
- Scrub command to check all blocks
- Report corruption
- Standard filesystem operations
- All operations from Projects 2-3
- Files, directories, permissions
- Hard links (with refcounting)
- Garbage collection
- Track block reference counts
- Reclaim unreferenced blocks
- Handle out-of-space gracefully
- Utilities
mkfs.cowfs: Format new filesystemcowctl: Snapshot management, scrubfsck.cowfs: Consistency check
Non-Functional Requirements
- Snapshot creation in O(1) time
- Support thousands of snapshots
- Checksums add < 10% overhead
- Graceful degradation when nearly full
Solution Architecture
Component Design
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ cowctl Utility โ
โ (snapshot create/list/restore/delete, scrub) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ โผ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ FUSE Callbacks โ โ
โ โ (getattr, read, write, create, unlink, etc.) โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ โผ โ โ
โ โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ โ
โ โ โ Transaction Manager โ โ โ
โ โ โ (begin, commit, atomic root update) โ โ โ
โ โ โโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ โ
โ โ โ โ โ
โ โ โโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ โ
โ โ โ โผ โ โ โ
โ โ โ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ โ โ โ
โ โ โ โ B-Tree โ โ Snapshot โ โ Checksum โ โ โ โ
โ โ โ โ Manager โ โ Manager โ โ Engine โ โ โ โ
โ โ โ โโโโโโโโฌโโโโโโโ โโโโโโโโฌโโโโโโโ โโโโโโโโฌโโโโโโโ โ โ โ
โ โ โ โ โ โ โ โ โ
โ โ โ โโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโ โ โ โ
โ โ โ โผ โ โ โ
โ โ โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ โ
โ โ โ โ Block Allocator โโ โ โ
โ โ โ โ (COW alloc, refcount, free list, GC) โโ โ โ
โ โ โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ โ
โ โ โ โ โ โ
โ โ โ COW Core Layer โ โ โ
โ โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ โ
โ โ โ โ
โ โ Filesystem Core โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ โผ โ โ
โ โ Block I/O Layer โ โ
โ โ (read, write, cache, checksum verify) โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ โ
โ cowfs (FUSE daemon) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Disk Image File โ
โ [Superblock][Block Info Table][Snapshots][Data Blocks...] โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Key Data Structures
// Superblock
struct cowfs_superblock {
uint32_t magic; // COWFS_MAGIC
uint32_t version;
uint32_t block_size;
uint64_t total_blocks;
uint64_t free_blocks;
uint64_t root_block; // Current filesystem root
uint64_t info_table_start; // Block info table
uint64_t snapshot_table_start;
uint64_t data_start; // First data block
uint64_t generation; // Current transaction ID
uint32_t checksum;
};
// Block info (one per data block)
struct block_info {
uint32_t checksum; // CRC32 of block content
uint32_t refcount;
uint64_t generation; // When written
};
// Snapshot entry
struct snapshot_entry {
char name[64];
uint64_t root_block;
uint64_t generation;
uint64_t timestamp;
uint32_t flags;
uint32_t reserved;
};
// Tree node header
struct node_header {
uint32_t checksum; // Of entire node
uint64_t block_num;
uint64_t generation;
uint8_t level; // 0 = leaf
uint16_t num_items;
uint16_t free_space;
uint32_t flags;
};
// Key for tree items
struct cowfs_key {
uint64_t objectid; // Inode number typically
uint8_t type; // INODE_ITEM, DIR_ITEM, EXTENT_ITEM, etc.
uint64_t offset; // Type-dependent
};
// Inode item (stored in leaf)
struct inode_item {
uint64_t size;
uint64_t blocks;
uint32_t mode;
uint32_t uid;
uint32_t gid;
uint32_t nlink;
uint64_t atime;
uint64_t mtime;
uint64_t ctime;
};
// Extent item (file data location)
struct extent_item {
uint64_t block_start; // First block of extent
uint64_t num_blocks;
uint32_t checksum; // Optional inline checksum
};
// Directory item
struct dir_item {
uint64_t child_inode;
uint8_t file_type;
uint16_t name_len;
char name[]; // Variable length
};
Key Functions
// Transaction management
struct transaction* txn_begin(struct cowfs_state *fs);
int txn_commit(struct transaction *txn); // Atomic root update
void txn_abort(struct transaction *txn);
// COW block allocation
uint64_t cow_alloc_block(struct cowfs_state *fs);
void cow_free_block(struct cowfs_state *fs, uint64_t block);
int cow_get_block(struct cowfs_state *fs, uint64_t block, void *buf);
int cow_put_block(struct cowfs_state *fs, uint64_t block, void *buf);
// Reference counting
void block_inc_ref(struct cowfs_state *fs, uint64_t block);
void block_dec_ref(struct cowfs_state *fs, uint64_t block);
uint32_t block_get_ref(struct cowfs_state *fs, uint64_t block);
// Checksum
uint32_t checksum_block(void *data, size_t size);
int verify_block(struct cowfs_state *fs, uint64_t block, void *buf);
// B-tree operations
int btree_lookup(struct cowfs_state *fs, uint64_t root,
struct cowfs_key *key, void *value);
int btree_insert(struct transaction *txn, uint64_t *root,
struct cowfs_key *key, void *value, size_t size);
int btree_delete(struct transaction *txn, uint64_t *root,
struct cowfs_key *key);
// Snapshot operations
int snapshot_create(struct cowfs_state *fs, const char *name);
int snapshot_restore(struct cowfs_state *fs, const char *name);
int snapshot_delete(struct cowfs_state *fs, const char *name);
int snapshot_list(struct cowfs_state *fs, struct snapshot_entry *out, int max);
// Integrity
int scrub_filesystem(struct cowfs_state *fs, void (*callback)(uint64_t, int));
Implementation Guide
Phase 1: COW Block Layer (Week 1)
Goals:
- Implement block allocator with free list
- Add checksum computation and verification
- Create atomic block write function
// Allocate a new block (never reuses immediately)
uint64_t cow_alloc_block(struct cowfs_state *fs) {
pthread_mutex_lock(&fs->alloc_lock);
// Get from free list
uint64_t block = fs->free_head;
if (block == 0) {
pthread_mutex_unlock(&fs->alloc_lock);
return 0; // Out of space
}
// Update free list head
uint64_t next_free;
pread(fs->fd, &next_free, 8, block * fs->block_size);
fs->free_head = next_free;
fs->sb.free_blocks--;
// Initialize block info
fs->block_info[block].refcount = 1;
fs->block_info[block].generation = fs->sb.generation;
fs->block_info[block].checksum = 0; // Updated on write
pthread_mutex_unlock(&fs->alloc_lock);
return block;
}
// Write block with checksum
int cow_put_block(struct cowfs_state *fs, uint64_t block, void *buf) {
// Calculate checksum
uint32_t csum = crc32(0, buf, fs->block_size);
fs->block_info[block].checksum = csum;
// Write to disk
ssize_t ret = pwrite(fs->fd, buf, fs->block_size,
block * fs->block_size);
return (ret == fs->block_size) ? 0 : -EIO;
}
// Read block with checksum verification
int cow_get_block(struct cowfs_state *fs, uint64_t block, void *buf) {
ssize_t ret = pread(fs->fd, buf, fs->block_size,
block * fs->block_size);
if (ret != fs->block_size) return -EIO;
// Verify checksum
uint32_t expected = fs->block_info[block].checksum;
uint32_t actual = crc32(0, buf, fs->block_size);
if (expected != actual) {
fprintf(stderr, "Checksum mismatch: block %lu, expected %08x, got %08x\n",
block, expected, actual);
return -EIO;
}
return 0;
}
Phase 2: B-Tree Implementation (Week 2)
Goals:
- Implement leaf and internal node structure
- Add search, insert, delete operations
- Handle node splits with COW
// Search for key in tree
int btree_lookup(struct cowfs_state *fs, uint64_t root,
struct cowfs_key *key, void *value) {
if (root == 0) return -ENOENT;
uint8_t buf[fs->block_size];
cow_get_block(fs, root, buf);
struct node_header *hdr = (struct node_header *)buf;
// Binary search for key
int idx = node_search(buf, key);
if (hdr->level == 0) {
// Leaf node: check for exact match
struct tree_item *item = get_item(buf, idx);
if (key_compare(&item->key, key) == 0) {
memcpy(value, get_item_data(buf, item), item->data_size);
return 0;
}
return -ENOENT;
} else {
// Internal node: recurse to child
struct internal_item *item = get_internal_item(buf, idx);
return btree_lookup(fs, item->child_block, key, value);
}
}
// Insert with COW
int btree_insert(struct transaction *txn, uint64_t *root,
struct cowfs_key *key, void *value, size_t size) {
// COW path: copy modified nodes
uint64_t path[MAX_TREE_DEPTH];
int depth = 0;
// Find insertion point, recording path
uint64_t node = *root;
while (node != 0) {
path[depth++] = node;
uint8_t buf[fs->block_size];
cow_get_block(fs, node, buf);
struct node_header *hdr = (struct node_header *)buf;
if (hdr->level == 0) break;
int idx = node_search(buf, key);
struct internal_item *item = get_internal_item(buf, idx);
node = item->child_block;
}
// COW: allocate new nodes for entire path
for (int i = depth - 1; i >= 0; i--) {
uint64_t old_block = path[i];
uint64_t new_block = cow_alloc_block(txn->fs);
uint8_t buf[fs->block_size];
cow_get_block(txn->fs, old_block, buf);
// Copy and modify
if (i == depth - 1) {
// Leaf: insert item
insert_leaf_item(buf, key, value, size);
} else {
// Internal: update child pointer
update_child_pointer(buf, path[i+1], txn->new_blocks[i+1]);
}
cow_put_block(txn->fs, new_block, buf);
txn->new_blocks[i] = new_block;
// Decrement old block refcount (might trigger GC)
block_dec_ref(txn->fs, old_block);
}
// New root is top of new path
*root = txn->new_blocks[0];
return 0;
}
Phase 3: Transaction Manager (Week 3)
Goals:
- Implement transaction begin/commit/abort
- Ensure atomic root pointer update
- Handle rollback on failure
struct transaction* txn_begin(struct cowfs_state *fs) {
struct transaction *txn = calloc(1, sizeof(*txn));
txn->fs = fs;
txn->generation = ++fs->sb.generation;
txn->old_root = fs->sb.root_block;
txn->new_root = fs->sb.root_block;
return txn;
}
int txn_commit(struct transaction *txn) {
struct cowfs_state *fs = txn->fs;
// All modifications are done, new tree built
// Atomically update superblock with new root
pthread_mutex_lock(&fs->sb_lock);
// Write superblock with new root
fs->sb.root_block = txn->new_root;
fs->sb.generation = txn->generation;
fs->sb.checksum = checksum_superblock(&fs->sb);
// Write superblock (atomic because it fits in one sector)
pwrite(fs->fd, &fs->sb, sizeof(fs->sb), 0);
fsync(fs->fd);
pthread_mutex_unlock(&fs->sb_lock);
// Transaction is now committed
free(txn);
return 0;
}
void txn_abort(struct transaction *txn) {
// Decrement refcounts on any newly allocated blocks
for (int i = 0; i < txn->num_new_blocks; i++) {
block_dec_ref(txn->fs, txn->new_blocks[i]);
}
free(txn);
}
Phase 4: Snapshot System (Week 4)
Goals:
- Create snapshots by saving root pointer
- Implement snapshot restore
- Handle reference counting across snapshots
int snapshot_create(struct cowfs_state *fs, const char *name) {
pthread_mutex_lock(&fs->snapshot_lock);
// Find free snapshot slot
int slot = find_free_snapshot_slot(fs);
if (slot < 0) {
pthread_mutex_unlock(&fs->snapshot_lock);
return -ENOSPC;
}
// Increment refcount of current root (and all reachable blocks)
inc_tree_refcount(fs, fs->sb.root_block);
// Create snapshot entry
struct snapshot_entry snap = {
.root_block = fs->sb.root_block,
.generation = fs->sb.generation,
.timestamp = time(NULL),
};
strncpy(snap.name, name, sizeof(snap.name) - 1);
// Write snapshot entry
write_snapshot_entry(fs, slot, &snap);
pthread_mutex_unlock(&fs->snapshot_lock);
return 0;
}
// Recursively increment refcounts
void inc_tree_refcount(struct cowfs_state *fs, uint64_t block) {
if (block == 0) return;
block_inc_ref(fs, block);
uint8_t buf[fs->block_size];
cow_get_block(fs, block, buf);
struct node_header *hdr = (struct node_header *)buf;
if (hdr->level == 0) {
// Leaf: increment data block refs
for (int i = 0; i < hdr->num_items; i++) {
struct tree_item *item = get_item(buf, i);
if (item->key.type == EXTENT_ITEM) {
struct extent_item *ext = get_item_data(buf, item);
for (uint64_t b = ext->block_start;
b < ext->block_start + ext->num_blocks; b++) {
block_inc_ref(fs, b);
}
}
}
} else {
// Internal: recurse to children
for (int i = 0; i < hdr->num_items; i++) {
struct internal_item *child = get_internal_item(buf, i);
inc_tree_refcount(fs, child->child_block);
}
}
}
int snapshot_restore(struct cowfs_state *fs, const char *name) {
pthread_mutex_lock(&fs->snapshot_lock);
// Find snapshot
struct snapshot_entry snap;
if (find_snapshot_by_name(fs, name, &snap) < 0) {
pthread_mutex_unlock(&fs->snapshot_lock);
return -ENOENT;
}
// Decrement current tree refcounts
dec_tree_refcount(fs, fs->sb.root_block);
// Set root to snapshot
fs->sb.root_block = snap.root_block;
// Increment snapshot tree refcounts (now active)
inc_tree_refcount(fs, snap.root_block);
// Update and sync superblock
write_superblock(fs);
pthread_mutex_unlock(&fs->snapshot_lock);
return 0;
}
Phase 5: Garbage Collection (Week 5)
Goals:
- Track block reference counts accurately
- Implement lazy garbage collection
- Handle circular references (shouldnโt exist with trees)
void block_dec_ref(struct cowfs_state *fs, uint64_t block) {
if (block == 0) return;
pthread_mutex_lock(&fs->ref_lock);
fs->block_info[block].refcount--;
if (fs->block_info[block].refcount == 0) {
// Block is now garbage
// For tree nodes, decrement children first
uint8_t buf[fs->block_size];
if (cow_get_block(fs, block, buf) == 0) {
struct node_header *hdr = (struct node_header *)buf;
if (hdr->level == 0) {
// Leaf: free data extents
for (int i = 0; i < hdr->num_items; i++) {
struct tree_item *item = get_item(buf, i);
if (item->key.type == EXTENT_ITEM) {
struct extent_item *ext = get_item_data(buf, item);
for (uint64_t b = ext->block_start;
b < ext->block_start + ext->num_blocks; b++) {
block_dec_ref_unlocked(fs, b);
}
}
}
} else {
// Internal: decrement child refs
for (int i = 0; i < hdr->num_items; i++) {
struct internal_item *child = get_internal_item(buf, i);
block_dec_ref_unlocked(fs, child->child_block);
}
}
}
// Add block to free list
uint64_t old_head = fs->free_head;
pwrite(fs->fd, &old_head, 8, block * fs->block_size);
fs->free_head = block;
fs->sb.free_blocks++;
}
pthread_mutex_unlock(&fs->ref_lock);
}
Phase 6: Scrub and Integrity (Week 6)
Goals:
- Verify all block checksums
- Report and optionally repair errors
- Background scrubbing support
int scrub_filesystem(struct cowfs_state *fs,
void (*callback)(uint64_t block, int status)) {
int errors = 0;
printf("Starting filesystem scrub...\n");
// Walk all reachable blocks from root
errors += scrub_tree(fs, fs->sb.root_block, callback);
// Also walk all snapshot trees
for (int i = 0; i < MAX_SNAPSHOTS; i++) {
struct snapshot_entry snap;
if (read_snapshot_entry(fs, i, &snap) == 0 && snap.root_block != 0) {
errors += scrub_tree(fs, snap.root_block, callback);
}
}
printf("Scrub complete: %d errors found\n", errors);
return errors;
}
int scrub_tree(struct cowfs_state *fs, uint64_t block,
void (*callback)(uint64_t, int)) {
if (block == 0) return 0;
int errors = 0;
// Verify this block
uint8_t buf[fs->block_size];
if (pread(fs->fd, buf, fs->block_size, block * fs->block_size)
!= fs->block_size) {
if (callback) callback(block, SCRUB_READ_ERROR);
return 1;
}
uint32_t expected = fs->block_info[block].checksum;
uint32_t actual = crc32(0, buf, fs->block_size);
if (expected != actual) {
if (callback) callback(block, SCRUB_CHECKSUM_MISMATCH);
errors++;
} else {
if (callback) callback(block, SCRUB_OK);
}
// Recurse for tree nodes
struct node_header *hdr = (struct node_header *)buf;
if (hdr->level > 0) {
for (int i = 0; i < hdr->num_items; i++) {
struct internal_item *child = get_internal_item(buf, i);
errors += scrub_tree(fs, child->child_block, callback);
}
}
return errors;
}
Phase 7: FUSE Integration (Weeks 7-8)
Goals:
- Connect B-tree operations to FUSE callbacks
- Implement all standard file operations
- Test with real workloads
static int cowfs_getattr(const char *path, struct stat *stbuf,
struct fuse_file_info *fi) {
struct cowfs_state *fs = get_fs_state();
// Lookup inode for path
uint64_t ino;
if (path_to_inode(fs, path, &ino) < 0) {
return -ENOENT;
}
// Get inode item from tree
struct cowfs_key key = {
.objectid = ino,
.type = INODE_ITEM,
.offset = 0,
};
struct inode_item item;
if (btree_lookup(fs, fs->sb.root_block, &key, &item) < 0) {
return -ENOENT;
}
// Fill stat buffer
memset(stbuf, 0, sizeof(*stbuf));
stbuf->st_ino = ino;
stbuf->st_mode = item.mode;
stbuf->st_nlink = item.nlink;
stbuf->st_uid = item.uid;
stbuf->st_gid = item.gid;
stbuf->st_size = item.size;
stbuf->st_blocks = item.blocks;
stbuf->st_atime = item.atime;
stbuf->st_mtime = item.mtime;
stbuf->st_ctime = item.ctime;
return 0;
}
static int cowfs_write(const char *path, const char *buf, size_t size,
off_t offset, struct fuse_file_info *fi) {
struct cowfs_state *fs = get_fs_state();
// Start transaction
struct transaction *txn = txn_begin(fs);
// Get inode
uint64_t ino;
path_to_inode(fs, path, &ino);
// Allocate new blocks for data (COW)
uint64_t new_blocks[size / fs->block_size + 1];
size_t num_blocks = 0;
for (off_t pos = offset; pos < offset + size; pos += fs->block_size) {
uint64_t blk = cow_alloc_block(fs);
if (blk == 0) {
txn_abort(txn);
return -ENOSPC;
}
// Write data to new block
size_t write_size = MIN(fs->block_size, offset + size - pos);
cow_put_block(fs, blk, buf + (pos - offset));
new_blocks[num_blocks++] = blk;
}
// Update extent tree with new blocks
update_extents(txn, ino, offset, new_blocks, num_blocks);
// Update inode
update_inode_size(txn, ino, MAX(old_size, offset + size));
// Commit
txn_commit(txn);
return size;
}
Testing Strategy
Unit Tests
// Test COW block allocation
void test_cow_alloc() {
struct cowfs_state fs;
init_test_fs(&fs, 100); // 100 blocks
uint64_t b1 = cow_alloc_block(&fs);
assert(b1 != 0);
assert(fs.block_info[b1].refcount == 1);
uint64_t b2 = cow_alloc_block(&fs);
assert(b2 != 0);
assert(b1 != b2);
}
// Test reference counting
void test_refcount() {
struct cowfs_state fs;
init_test_fs(&fs, 100);
uint64_t b = cow_alloc_block(&fs);
assert(block_get_ref(&fs, b) == 1);
block_inc_ref(&fs, b);
assert(block_get_ref(&fs, b) == 2);
block_dec_ref(&fs, b);
assert(block_get_ref(&fs, b) == 1);
block_dec_ref(&fs, b);
// Block should be freed
assert(block_is_free(&fs, b));
}
// Test snapshot
void test_snapshot() {
struct cowfs_state fs;
init_test_fs(&fs, 1000);
// Create some data
create_file(&fs, "/test.txt", "original");
// Take snapshot
snapshot_create(&fs, "snap1");
// Modify file
write_file(&fs, "/test.txt", "modified");
// Verify current is modified
assert(strcmp(read_file(&fs, "/test.txt"), "modified") == 0);
// Restore snapshot
snapshot_restore(&fs, "snap1");
// Verify restored to original
assert(strcmp(read_file(&fs, "/test.txt"), "original") == 0);
}
Crash Tests
#!/bin/bash
# Test crash recovery
for i in $(seq 1 50); do
echo "Crash test iteration $i"
# Start filesystem
./cowfs disk.img /mnt/cow &
PID=$!
sleep 0.5
# Do some work
(
for j in $(seq 1 10); do
dd if=/dev/urandom of=/mnt/cow/file_$j bs=1K count=10 2>/dev/null
done
) &
# Crash after random time
sleep $(echo "scale=2; $RANDOM/32768" | bc)
kill -9 $PID
# Remount
./cowfs disk.img /mnt/cow
# Verify scrub passes
./cowctl /mnt/cow scrub
if [ $? -ne 0 ]; then
echo "FAIL: Corruption after crash"
exit 1
fi
fusermount -u /mnt/cow
done
echo "PASS: All crash tests passed"
Common Pitfalls
Pitfall 1: Reference Count Leaks
Problem: Blocks never freed because refcount never reaches 0.
Solution: Careful tracking of all operations that inc/dec refs:
- Allocation: refcount = 1
- Snapshot: inc tree refcounts
- COW write: inc new block, dec old block
- Snapshot delete: dec tree refcounts
Pitfall 2: Transaction Abort Cleanup
Problem: Aborted transaction leaves orphaned blocks.
Solution: Track all allocations in transaction, free on abort.
Pitfall 3: Checksum of Modified Block
Problem: Checksum computed before final modification.
Solution: Always compute checksum immediately before write.
Interview Questions
- โWhat is copy-on-write, and why is it useful for filesystems?โ
- Expected: Never overwrite, atomic updates, enables snapshots
- โHow are snapshots instant in COW filesystems?โ
- Expected: Just save root pointer, increment refcounts
- โWhat is the space overhead of having many snapshots?โ
- Expected: Only changed blocks are duplicated, shared blocks counted once
- โHow does garbage collection work in a COW filesystem?โ
- Expected: Reference counting, free when refcount=0
- โHow do checksums enable self-healing?โ
- Expected: Detect corruption, use mirror/RAID copy to repair
Self-Assessment Checklist
- COW allocation never overwrites existing blocks
- Atomic root pointer update commits transactions
- Snapshots create instantly (O(1))
- All blocks have verified checksums
- Garbage collection correctly frees unreferenced blocks
- Filesystem survives simulated crashes
- Scrub detects bit rot
- Standard file operations work correctly
Resources
Books
- โDesigning Data-Intensive Applicationsโ by Kleppmann - Ch. 3, 7
- โDatabase Internalsโ by Petrov - Ch. 5-7
Papers/Docs
- Btrfs Design
- ZFS On-Disk Specification
- โAnalysis of Modern Filesystemsโ - Various academic papers