Capstone Project: Production-Grade Copy-on-Write Filesystem

Capstone Project: Production-Grade Copy-on-Write Filesystem

Project Overview

Attribute Details
Difficulty Master
Time Estimate 1-2 months
Language C
Knowledge Area Modern Filesystem Design
Main Book โ€œDesigning Data-Intensive Applicationsโ€ by Martin Kleppmann
Prerequisites Projects 1-4 completed

What Youโ€™ll Build

A complete, persistent, FUSE-based filesystem with copy-on-write semantics (like ZFS/Btrfs), snapshots, and data integrity verification (checksums). This is a โ€œrealโ€ filesystem you could actually use.

Real World Output:

$ ./mkfs.cowfs -s 1G mydisk.img
$ ./cowfs mydisk.img /mnt/cow

# Normal operations work
$ cp -r ~/projects /mnt/cow/
$ ls /mnt/cow/projects
myapp/ webapp/ scripts/

# Create a snapshot before dangerous operation
$ ./cowctl /mnt/cow snapshot create before-refactor
Snapshot 'before-refactor' created (root block: 4521)

# Make changes
$ rm -rf /mnt/cow/projects/myapp/src/*
$ echo "oops" > /mnt/cow/projects/myapp/src/main.c

# Restore from snapshot instantly
$ ./cowctl /mnt/cow snapshot restore before-refactor
Restored to snapshot 'before-refactor'
$ ls /mnt/cow/projects/myapp/src/
main.c utils.c config.h  # Everything back!

# Verify data integrity
$ ./cowctl /mnt/cow scrub
Scrubbing filesystem...
Checked 15231 blocks
Verified 15231 checksums
0 errors found

Learning Objectives

By completing this project, you will:

  1. Implement copy-on-write semantics โ€” never overwrite, always append
  2. Build block-pointer trees similar to Btrfs B-trees
  3. Create instant snapshots by saving root pointers
  4. Add data integrity verification with checksums
  5. Handle garbage collection of orphaned blocks
  6. Understand modern filesystem design principles

The Core Question Youโ€™re Answering

โ€œHow can we build a filesystem that never corrupts data, supports instant snapshots, and can detect and repair bit rot?โ€

Traditional filesystems like ext4 overwrite data in place. This creates problems:

  • Crash during write = corrupted data
  • Silent bit rot goes undetected
  • Snapshots require complex block tracking

Copy-on-write (COW) filesystems take a radically different approach: never overwrite anything. Every write creates new blocks, then atomically updates a root pointer. This elegant model solves journaling, snapshots, and checksums in one design.


Deep Theoretical Foundation

1. The Copy-on-Write Principle

Traditional write (overwrite in place):

Before: [Block 100: "Hello"]
Write:  [Block 100: "World"]  โ† Overwrites existing data
After:  [Block 100: "World"]

If crash during write: [Block 100: "Horld"] โ† Corruption!

Copy-on-write:

Before: Root โ†’ [Block 100: "Hello"]
Write:  Allocate Block 200
        [Block 200: "World"]
        Update Root โ†’ Block 200 (atomic)
After:  Root โ†’ [Block 200: "World"]
        [Block 100: "Hello"] โ† Still intact, now garbage

If crash during write:
  - Block 200 partially written but unreferenced
  - Root still โ†’ Block 100 ("Hello")
  - Filesystem is consistent!

Key insight: The atomic pointer update is the commit point. Everything before it is preparation; everything after is cleanup.

2. Tree Structure for Data Organization

COW filesystems use trees where nodes can be shared:

Initial state:
                    Root v1
                   /        \
                Node A      Node B
               /    \      /    \
            Data1  Data2  Data3  Data4

After modifying Data2:
                    Root v2          Root v1 (snapshot)
                   /        \       /        \
           Node A'          Node B           Node A
          /      \         /    \           /    \
       Data1   Data2'   Data3  Data4     Data1  Data2
                                                 (old)

Notice:
- Root v2 shares Node B with Root v1
- Node A' shares Data1 with Node A
- Only modified path is copied

Space efficiency: Snapshots share unmodified blocks. A 1GB filesystem with 100 snapshots doesnโ€™t need 100GBโ€”only changed blocks are duplicated.

3. Block Reference Counting

Since blocks can be shared between snapshots, we need reference counting:

struct cow_block_info {
    uint32_t checksum;      // CRC32 of block data
    uint32_t refcount;      // Number of references
    uint64_t generation;    // When block was created
};

Lifecycle:

  1. Block allocated: refcount = 1
  2. Snapshot created: refcount++ for all reachable blocks
  3. Block overwritten (COW): old block refcountโ€“, new block refcount = 1
  4. Snapshot deleted: refcountโ€“ for all blocks in snapshot
  5. refcount reaches 0: block is garbage, can be reused

4. Checksum-Based Integrity

Every block has a checksum stored separately (or in parent pointer):

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Parent Node                                                      โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ Child 0 ptr  โ”‚ Child 0 csum โ”‚ Child 1 ptr  โ”‚ Child 1 csum      โ”‚
โ”‚ Block 100    โ”‚ 0xA1B2C3D4   โ”‚ Block 200    โ”‚ 0x12345678        โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

On read:
1. Read block 100
2. Calculate CRC32 of block 100
3. Compare with stored 0xA1B2C3D4
4. If mismatch: bit rot detected!

Self-healing: With RAID or mirrors, the good copy can repair the bad one.

5. Snapshot Implementation

A snapshot is just a saved root pointer:

struct snapshot {
    char name[64];
    uint64_t root_block;      // Root of tree at snapshot time
    uint64_t timestamp;
    uint32_t id;
};

Creating a snapshot:

  1. Get current root block number
  2. Increment refcount of root (and transitively, all blocks)
  3. Save (name, root_block, timestamp) in snapshot table

Restoring a snapshot:

  1. Find snapshot by name
  2. Set active root = snapshotโ€™s root_block
  3. (Optional) Increment refcounts to prevent GC

Deleting a snapshot:

  1. Decrement refcount of snapshotโ€™s root block
  2. Garbage collect any blocks that reach refcount = 0

6. Garbage Collection

Blocks with refcount = 0 are garbage. Collection strategies:

Strategy 1: Immediate Free

void dec_refcount(uint32_t block) {
    block_info[block].refcount--;
    if (block_info[block].refcount == 0) {
        // Recursively decrement children
        for (child in get_children(block)) {
            dec_refcount(child);
        }
        // Add to free list
        free_block(block);
    }
}

Strategy 2: Background GC

  • Mark orphaned blocks during operation
  • Periodically scan and collect
  • Less I/O spike, more complexity

Strategy 3: Generation-Based

  • Track when blocks were allocated
  • Collect old generations first
  • Similar to generational garbage collection

7. On-Disk Layout

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Block 0: Superblock                                               โ”‚
โ”‚   - Magic number, version                                         โ”‚
โ”‚   - Total blocks, block size                                      โ”‚
โ”‚   - Current root block pointer                                    โ”‚
โ”‚   - Free block pointer (head of free list)                        โ”‚
โ”‚   - Snapshot table location                                       โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
โ”‚ Block 1-N: Block Info Table                                       โ”‚
โ”‚   - One entry per data block                                      โ”‚
โ”‚   - Checksum + refcount + generation                              โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
โ”‚ Block N+1-M: Snapshot Table                                       โ”‚
โ”‚   - Array of snapshot entries                                     โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
โ”‚ Block M+1-End: Data Blocks                                        โ”‚
โ”‚   - Tree nodes (inode trees, data extent trees)                   โ”‚
โ”‚   - File data                                                     โ”‚
โ”‚   - Free blocks (linked list through free space)                  โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

8. Tree Node Structure

Btrfs-style B-tree nodes:

#define NODE_SIZE 4096
#define MAX_ITEMS 170  // Depends on item size

struct tree_header {
    uint32_t checksum;        // Checksum of entire node
    uint64_t block_num;       // This block's number (for verification)
    uint64_t generation;      // Transaction ID when written
    uint8_t  level;           // 0 = leaf, >0 = internal
    uint16_t num_items;       // Number of items in node
};

struct tree_item {
    struct {
        uint64_t objectid;    // Inode number or similar
        uint8_t  type;        // Item type (INODE, EXTENT, DIR_ENTRY, etc.)
        uint64_t offset;      // Type-specific offset
    } key;
    uint32_t data_offset;     // Offset to data in node (from end)
    uint32_t data_size;       // Size of data
};

// Leaf node layout:
// [header][item0][item1]...[itemN]...[free space]...[dataN]...[data1][data0]
// Items grow forward, data grows backward

// Internal node items point to child blocks instead of containing data
struct internal_item {
    struct key key;
    uint64_t child_block;
    uint32_t child_checksum;
};

Project Specification

Functional Requirements

  1. Copy-on-write operations
    • All writes create new blocks
    • Atomic root pointer updates
    • Never overwrite data in place
  2. Snapshots
    • Create named snapshots instantly
    • List existing snapshots
    • Restore filesystem to snapshot state
    • Delete snapshots (with GC)
  3. Data integrity
    • CRC32 checksum per block
    • Verify checksums on read
    • Scrub command to check all blocks
    • Report corruption
  4. Standard filesystem operations
    • All operations from Projects 2-3
    • Files, directories, permissions
    • Hard links (with refcounting)
  5. Garbage collection
    • Track block reference counts
    • Reclaim unreferenced blocks
    • Handle out-of-space gracefully
  6. Utilities
    • mkfs.cowfs: Format new filesystem
    • cowctl: Snapshot management, scrub
    • fsck.cowfs: Consistency check

Non-Functional Requirements

  • Snapshot creation in O(1) time
  • Support thousands of snapshots
  • Checksums add < 10% overhead
  • Graceful degradation when nearly full

Solution Architecture

Component Design

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                        cowctl Utility                            โ”‚
โ”‚           (snapshot create/list/restore/delete, scrub)          โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                             โ”‚
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                            โ–ผ                                     โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚
โ”‚  โ”‚                    FUSE Callbacks                           โ”‚ โ”‚
โ”‚  โ”‚        (getattr, read, write, create, unlink, etc.)        โ”‚ โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚
โ”‚                               โ”‚                                  โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚
โ”‚  โ”‚                            โ–ผ                               โ”‚ โ”‚
โ”‚  โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚
โ”‚  โ”‚  โ”‚                  Transaction Manager                   โ”‚ โ”‚ โ”‚
โ”‚  โ”‚  โ”‚       (begin, commit, atomic root update)              โ”‚ โ”‚ โ”‚
โ”‚  โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚
โ”‚  โ”‚                           โ”‚                                โ”‚ โ”‚
โ”‚  โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”‚ โ”‚
โ”‚  โ”‚  โ”‚                        โ–ผ                            โ”‚  โ”‚ โ”‚
โ”‚  โ”‚  โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”   โ”‚  โ”‚ โ”‚
โ”‚  โ”‚  โ”‚  โ”‚   B-Tree    โ”‚ โ”‚  Snapshot   โ”‚ โ”‚  Checksum   โ”‚   โ”‚  โ”‚ โ”‚
โ”‚  โ”‚  โ”‚  โ”‚   Manager   โ”‚ โ”‚   Manager   โ”‚ โ”‚   Engine    โ”‚   โ”‚  โ”‚ โ”‚
โ”‚  โ”‚  โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”˜   โ”‚  โ”‚ โ”‚
โ”‚  โ”‚  โ”‚         โ”‚               โ”‚               โ”‚          โ”‚  โ”‚ โ”‚
โ”‚  โ”‚  โ”‚         โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜          โ”‚  โ”‚ โ”‚
โ”‚  โ”‚  โ”‚                         โ–ผ                          โ”‚  โ”‚ โ”‚
โ”‚  โ”‚  โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”โ”‚  โ”‚ โ”‚
โ”‚  โ”‚  โ”‚  โ”‚              Block Allocator                   โ”‚โ”‚  โ”‚ โ”‚
โ”‚  โ”‚  โ”‚  โ”‚    (COW alloc, refcount, free list, GC)       โ”‚โ”‚  โ”‚ โ”‚
โ”‚  โ”‚  โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜โ”‚  โ”‚ โ”‚
โ”‚  โ”‚  โ”‚                                                    โ”‚  โ”‚ โ”‚
โ”‚  โ”‚  โ”‚                  COW Core Layer                    โ”‚  โ”‚ โ”‚
โ”‚  โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ”‚ โ”‚
โ”‚  โ”‚                                                           โ”‚ โ”‚
โ”‚  โ”‚                    Filesystem Core                        โ”‚ โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚
โ”‚                               โ”‚                                  โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚
โ”‚  โ”‚                            โ–ผ                               โ”‚ โ”‚
โ”‚  โ”‚                     Block I/O Layer                        โ”‚ โ”‚
โ”‚  โ”‚           (read, write, cache, checksum verify)            โ”‚ โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚
โ”‚                               โ”‚                                  โ”‚
โ”‚                       cowfs (FUSE daemon)                        โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                                โ”‚
                                โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                        Disk Image File                           โ”‚
โ”‚  [Superblock][Block Info Table][Snapshots][Data Blocks...]      โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Key Data Structures

// Superblock
struct cowfs_superblock {
    uint32_t magic;              // COWFS_MAGIC
    uint32_t version;
    uint32_t block_size;
    uint64_t total_blocks;
    uint64_t free_blocks;
    uint64_t root_block;         // Current filesystem root
    uint64_t info_table_start;   // Block info table
    uint64_t snapshot_table_start;
    uint64_t data_start;         // First data block
    uint64_t generation;         // Current transaction ID
    uint32_t checksum;
};

// Block info (one per data block)
struct block_info {
    uint32_t checksum;           // CRC32 of block content
    uint32_t refcount;
    uint64_t generation;         // When written
};

// Snapshot entry
struct snapshot_entry {
    char name[64];
    uint64_t root_block;
    uint64_t generation;
    uint64_t timestamp;
    uint32_t flags;
    uint32_t reserved;
};

// Tree node header
struct node_header {
    uint32_t checksum;           // Of entire node
    uint64_t block_num;
    uint64_t generation;
    uint8_t  level;              // 0 = leaf
    uint16_t num_items;
    uint16_t free_space;
    uint32_t flags;
};

// Key for tree items
struct cowfs_key {
    uint64_t objectid;           // Inode number typically
    uint8_t  type;               // INODE_ITEM, DIR_ITEM, EXTENT_ITEM, etc.
    uint64_t offset;             // Type-dependent
};

// Inode item (stored in leaf)
struct inode_item {
    uint64_t size;
    uint64_t blocks;
    uint32_t mode;
    uint32_t uid;
    uint32_t gid;
    uint32_t nlink;
    uint64_t atime;
    uint64_t mtime;
    uint64_t ctime;
};

// Extent item (file data location)
struct extent_item {
    uint64_t block_start;        // First block of extent
    uint64_t num_blocks;
    uint32_t checksum;           // Optional inline checksum
};

// Directory item
struct dir_item {
    uint64_t child_inode;
    uint8_t  file_type;
    uint16_t name_len;
    char     name[];             // Variable length
};

Key Functions

// Transaction management
struct transaction* txn_begin(struct cowfs_state *fs);
int txn_commit(struct transaction *txn);  // Atomic root update
void txn_abort(struct transaction *txn);

// COW block allocation
uint64_t cow_alloc_block(struct cowfs_state *fs);
void cow_free_block(struct cowfs_state *fs, uint64_t block);
int cow_get_block(struct cowfs_state *fs, uint64_t block, void *buf);
int cow_put_block(struct cowfs_state *fs, uint64_t block, void *buf);

// Reference counting
void block_inc_ref(struct cowfs_state *fs, uint64_t block);
void block_dec_ref(struct cowfs_state *fs, uint64_t block);
uint32_t block_get_ref(struct cowfs_state *fs, uint64_t block);

// Checksum
uint32_t checksum_block(void *data, size_t size);
int verify_block(struct cowfs_state *fs, uint64_t block, void *buf);

// B-tree operations
int btree_lookup(struct cowfs_state *fs, uint64_t root,
                 struct cowfs_key *key, void *value);
int btree_insert(struct transaction *txn, uint64_t *root,
                 struct cowfs_key *key, void *value, size_t size);
int btree_delete(struct transaction *txn, uint64_t *root,
                 struct cowfs_key *key);

// Snapshot operations
int snapshot_create(struct cowfs_state *fs, const char *name);
int snapshot_restore(struct cowfs_state *fs, const char *name);
int snapshot_delete(struct cowfs_state *fs, const char *name);
int snapshot_list(struct cowfs_state *fs, struct snapshot_entry *out, int max);

// Integrity
int scrub_filesystem(struct cowfs_state *fs, void (*callback)(uint64_t, int));

Implementation Guide

Phase 1: COW Block Layer (Week 1)

Goals:

  • Implement block allocator with free list
  • Add checksum computation and verification
  • Create atomic block write function
// Allocate a new block (never reuses immediately)
uint64_t cow_alloc_block(struct cowfs_state *fs) {
    pthread_mutex_lock(&fs->alloc_lock);

    // Get from free list
    uint64_t block = fs->free_head;
    if (block == 0) {
        pthread_mutex_unlock(&fs->alloc_lock);
        return 0;  // Out of space
    }

    // Update free list head
    uint64_t next_free;
    pread(fs->fd, &next_free, 8, block * fs->block_size);
    fs->free_head = next_free;
    fs->sb.free_blocks--;

    // Initialize block info
    fs->block_info[block].refcount = 1;
    fs->block_info[block].generation = fs->sb.generation;
    fs->block_info[block].checksum = 0;  // Updated on write

    pthread_mutex_unlock(&fs->alloc_lock);
    return block;
}

// Write block with checksum
int cow_put_block(struct cowfs_state *fs, uint64_t block, void *buf) {
    // Calculate checksum
    uint32_t csum = crc32(0, buf, fs->block_size);
    fs->block_info[block].checksum = csum;

    // Write to disk
    ssize_t ret = pwrite(fs->fd, buf, fs->block_size,
                         block * fs->block_size);
    return (ret == fs->block_size) ? 0 : -EIO;
}

// Read block with checksum verification
int cow_get_block(struct cowfs_state *fs, uint64_t block, void *buf) {
    ssize_t ret = pread(fs->fd, buf, fs->block_size,
                        block * fs->block_size);
    if (ret != fs->block_size) return -EIO;

    // Verify checksum
    uint32_t expected = fs->block_info[block].checksum;
    uint32_t actual = crc32(0, buf, fs->block_size);

    if (expected != actual) {
        fprintf(stderr, "Checksum mismatch: block %lu, expected %08x, got %08x\n",
                block, expected, actual);
        return -EIO;
    }

    return 0;
}

Phase 2: B-Tree Implementation (Week 2)

Goals:

  • Implement leaf and internal node structure
  • Add search, insert, delete operations
  • Handle node splits with COW
// Search for key in tree
int btree_lookup(struct cowfs_state *fs, uint64_t root,
                 struct cowfs_key *key, void *value) {
    if (root == 0) return -ENOENT;

    uint8_t buf[fs->block_size];
    cow_get_block(fs, root, buf);

    struct node_header *hdr = (struct node_header *)buf;

    // Binary search for key
    int idx = node_search(buf, key);

    if (hdr->level == 0) {
        // Leaf node: check for exact match
        struct tree_item *item = get_item(buf, idx);
        if (key_compare(&item->key, key) == 0) {
            memcpy(value, get_item_data(buf, item), item->data_size);
            return 0;
        }
        return -ENOENT;
    } else {
        // Internal node: recurse to child
        struct internal_item *item = get_internal_item(buf, idx);
        return btree_lookup(fs, item->child_block, key, value);
    }
}

// Insert with COW
int btree_insert(struct transaction *txn, uint64_t *root,
                 struct cowfs_key *key, void *value, size_t size) {
    // COW path: copy modified nodes
    uint64_t path[MAX_TREE_DEPTH];
    int depth = 0;

    // Find insertion point, recording path
    uint64_t node = *root;
    while (node != 0) {
        path[depth++] = node;

        uint8_t buf[fs->block_size];
        cow_get_block(fs, node, buf);

        struct node_header *hdr = (struct node_header *)buf;
        if (hdr->level == 0) break;

        int idx = node_search(buf, key);
        struct internal_item *item = get_internal_item(buf, idx);
        node = item->child_block;
    }

    // COW: allocate new nodes for entire path
    for (int i = depth - 1; i >= 0; i--) {
        uint64_t old_block = path[i];
        uint64_t new_block = cow_alloc_block(txn->fs);

        uint8_t buf[fs->block_size];
        cow_get_block(txn->fs, old_block, buf);

        // Copy and modify
        if (i == depth - 1) {
            // Leaf: insert item
            insert_leaf_item(buf, key, value, size);
        } else {
            // Internal: update child pointer
            update_child_pointer(buf, path[i+1], txn->new_blocks[i+1]);
        }

        cow_put_block(txn->fs, new_block, buf);
        txn->new_blocks[i] = new_block;

        // Decrement old block refcount (might trigger GC)
        block_dec_ref(txn->fs, old_block);
    }

    // New root is top of new path
    *root = txn->new_blocks[0];
    return 0;
}

Phase 3: Transaction Manager (Week 3)

Goals:

  • Implement transaction begin/commit/abort
  • Ensure atomic root pointer update
  • Handle rollback on failure
struct transaction* txn_begin(struct cowfs_state *fs) {
    struct transaction *txn = calloc(1, sizeof(*txn));
    txn->fs = fs;
    txn->generation = ++fs->sb.generation;
    txn->old_root = fs->sb.root_block;
    txn->new_root = fs->sb.root_block;
    return txn;
}

int txn_commit(struct transaction *txn) {
    struct cowfs_state *fs = txn->fs;

    // All modifications are done, new tree built

    // Atomically update superblock with new root
    pthread_mutex_lock(&fs->sb_lock);

    // Write superblock with new root
    fs->sb.root_block = txn->new_root;
    fs->sb.generation = txn->generation;
    fs->sb.checksum = checksum_superblock(&fs->sb);

    // Write superblock (atomic because it fits in one sector)
    pwrite(fs->fd, &fs->sb, sizeof(fs->sb), 0);
    fsync(fs->fd);

    pthread_mutex_unlock(&fs->sb_lock);

    // Transaction is now committed
    free(txn);
    return 0;
}

void txn_abort(struct transaction *txn) {
    // Decrement refcounts on any newly allocated blocks
    for (int i = 0; i < txn->num_new_blocks; i++) {
        block_dec_ref(txn->fs, txn->new_blocks[i]);
    }
    free(txn);
}

Phase 4: Snapshot System (Week 4)

Goals:

  • Create snapshots by saving root pointer
  • Implement snapshot restore
  • Handle reference counting across snapshots
int snapshot_create(struct cowfs_state *fs, const char *name) {
    pthread_mutex_lock(&fs->snapshot_lock);

    // Find free snapshot slot
    int slot = find_free_snapshot_slot(fs);
    if (slot < 0) {
        pthread_mutex_unlock(&fs->snapshot_lock);
        return -ENOSPC;
    }

    // Increment refcount of current root (and all reachable blocks)
    inc_tree_refcount(fs, fs->sb.root_block);

    // Create snapshot entry
    struct snapshot_entry snap = {
        .root_block = fs->sb.root_block,
        .generation = fs->sb.generation,
        .timestamp = time(NULL),
    };
    strncpy(snap.name, name, sizeof(snap.name) - 1);

    // Write snapshot entry
    write_snapshot_entry(fs, slot, &snap);

    pthread_mutex_unlock(&fs->snapshot_lock);
    return 0;
}

// Recursively increment refcounts
void inc_tree_refcount(struct cowfs_state *fs, uint64_t block) {
    if (block == 0) return;

    block_inc_ref(fs, block);

    uint8_t buf[fs->block_size];
    cow_get_block(fs, block, buf);

    struct node_header *hdr = (struct node_header *)buf;
    if (hdr->level == 0) {
        // Leaf: increment data block refs
        for (int i = 0; i < hdr->num_items; i++) {
            struct tree_item *item = get_item(buf, i);
            if (item->key.type == EXTENT_ITEM) {
                struct extent_item *ext = get_item_data(buf, item);
                for (uint64_t b = ext->block_start;
                     b < ext->block_start + ext->num_blocks; b++) {
                    block_inc_ref(fs, b);
                }
            }
        }
    } else {
        // Internal: recurse to children
        for (int i = 0; i < hdr->num_items; i++) {
            struct internal_item *child = get_internal_item(buf, i);
            inc_tree_refcount(fs, child->child_block);
        }
    }
}

int snapshot_restore(struct cowfs_state *fs, const char *name) {
    pthread_mutex_lock(&fs->snapshot_lock);

    // Find snapshot
    struct snapshot_entry snap;
    if (find_snapshot_by_name(fs, name, &snap) < 0) {
        pthread_mutex_unlock(&fs->snapshot_lock);
        return -ENOENT;
    }

    // Decrement current tree refcounts
    dec_tree_refcount(fs, fs->sb.root_block);

    // Set root to snapshot
    fs->sb.root_block = snap.root_block;

    // Increment snapshot tree refcounts (now active)
    inc_tree_refcount(fs, snap.root_block);

    // Update and sync superblock
    write_superblock(fs);

    pthread_mutex_unlock(&fs->snapshot_lock);
    return 0;
}

Phase 5: Garbage Collection (Week 5)

Goals:

  • Track block reference counts accurately
  • Implement lazy garbage collection
  • Handle circular references (shouldnโ€™t exist with trees)
void block_dec_ref(struct cowfs_state *fs, uint64_t block) {
    if (block == 0) return;

    pthread_mutex_lock(&fs->ref_lock);

    fs->block_info[block].refcount--;

    if (fs->block_info[block].refcount == 0) {
        // Block is now garbage

        // For tree nodes, decrement children first
        uint8_t buf[fs->block_size];
        if (cow_get_block(fs, block, buf) == 0) {
            struct node_header *hdr = (struct node_header *)buf;
            if (hdr->level == 0) {
                // Leaf: free data extents
                for (int i = 0; i < hdr->num_items; i++) {
                    struct tree_item *item = get_item(buf, i);
                    if (item->key.type == EXTENT_ITEM) {
                        struct extent_item *ext = get_item_data(buf, item);
                        for (uint64_t b = ext->block_start;
                             b < ext->block_start + ext->num_blocks; b++) {
                            block_dec_ref_unlocked(fs, b);
                        }
                    }
                }
            } else {
                // Internal: decrement child refs
                for (int i = 0; i < hdr->num_items; i++) {
                    struct internal_item *child = get_internal_item(buf, i);
                    block_dec_ref_unlocked(fs, child->child_block);
                }
            }
        }

        // Add block to free list
        uint64_t old_head = fs->free_head;
        pwrite(fs->fd, &old_head, 8, block * fs->block_size);
        fs->free_head = block;
        fs->sb.free_blocks++;
    }

    pthread_mutex_unlock(&fs->ref_lock);
}

Phase 6: Scrub and Integrity (Week 6)

Goals:

  • Verify all block checksums
  • Report and optionally repair errors
  • Background scrubbing support
int scrub_filesystem(struct cowfs_state *fs,
                     void (*callback)(uint64_t block, int status)) {
    int errors = 0;

    printf("Starting filesystem scrub...\n");

    // Walk all reachable blocks from root
    errors += scrub_tree(fs, fs->sb.root_block, callback);

    // Also walk all snapshot trees
    for (int i = 0; i < MAX_SNAPSHOTS; i++) {
        struct snapshot_entry snap;
        if (read_snapshot_entry(fs, i, &snap) == 0 && snap.root_block != 0) {
            errors += scrub_tree(fs, snap.root_block, callback);
        }
    }

    printf("Scrub complete: %d errors found\n", errors);
    return errors;
}

int scrub_tree(struct cowfs_state *fs, uint64_t block,
               void (*callback)(uint64_t, int)) {
    if (block == 0) return 0;

    int errors = 0;

    // Verify this block
    uint8_t buf[fs->block_size];
    if (pread(fs->fd, buf, fs->block_size, block * fs->block_size)
        != fs->block_size) {
        if (callback) callback(block, SCRUB_READ_ERROR);
        return 1;
    }

    uint32_t expected = fs->block_info[block].checksum;
    uint32_t actual = crc32(0, buf, fs->block_size);

    if (expected != actual) {
        if (callback) callback(block, SCRUB_CHECKSUM_MISMATCH);
        errors++;
    } else {
        if (callback) callback(block, SCRUB_OK);
    }

    // Recurse for tree nodes
    struct node_header *hdr = (struct node_header *)buf;
    if (hdr->level > 0) {
        for (int i = 0; i < hdr->num_items; i++) {
            struct internal_item *child = get_internal_item(buf, i);
            errors += scrub_tree(fs, child->child_block, callback);
        }
    }

    return errors;
}

Phase 7: FUSE Integration (Weeks 7-8)

Goals:

  • Connect B-tree operations to FUSE callbacks
  • Implement all standard file operations
  • Test with real workloads
static int cowfs_getattr(const char *path, struct stat *stbuf,
                         struct fuse_file_info *fi) {
    struct cowfs_state *fs = get_fs_state();

    // Lookup inode for path
    uint64_t ino;
    if (path_to_inode(fs, path, &ino) < 0) {
        return -ENOENT;
    }

    // Get inode item from tree
    struct cowfs_key key = {
        .objectid = ino,
        .type = INODE_ITEM,
        .offset = 0,
    };
    struct inode_item item;
    if (btree_lookup(fs, fs->sb.root_block, &key, &item) < 0) {
        return -ENOENT;
    }

    // Fill stat buffer
    memset(stbuf, 0, sizeof(*stbuf));
    stbuf->st_ino = ino;
    stbuf->st_mode = item.mode;
    stbuf->st_nlink = item.nlink;
    stbuf->st_uid = item.uid;
    stbuf->st_gid = item.gid;
    stbuf->st_size = item.size;
    stbuf->st_blocks = item.blocks;
    stbuf->st_atime = item.atime;
    stbuf->st_mtime = item.mtime;
    stbuf->st_ctime = item.ctime;

    return 0;
}

static int cowfs_write(const char *path, const char *buf, size_t size,
                       off_t offset, struct fuse_file_info *fi) {
    struct cowfs_state *fs = get_fs_state();

    // Start transaction
    struct transaction *txn = txn_begin(fs);

    // Get inode
    uint64_t ino;
    path_to_inode(fs, path, &ino);

    // Allocate new blocks for data (COW)
    uint64_t new_blocks[size / fs->block_size + 1];
    size_t num_blocks = 0;

    for (off_t pos = offset; pos < offset + size; pos += fs->block_size) {
        uint64_t blk = cow_alloc_block(fs);
        if (blk == 0) {
            txn_abort(txn);
            return -ENOSPC;
        }

        // Write data to new block
        size_t write_size = MIN(fs->block_size, offset + size - pos);
        cow_put_block(fs, blk, buf + (pos - offset));
        new_blocks[num_blocks++] = blk;
    }

    // Update extent tree with new blocks
    update_extents(txn, ino, offset, new_blocks, num_blocks);

    // Update inode
    update_inode_size(txn, ino, MAX(old_size, offset + size));

    // Commit
    txn_commit(txn);

    return size;
}

Testing Strategy

Unit Tests

// Test COW block allocation
void test_cow_alloc() {
    struct cowfs_state fs;
    init_test_fs(&fs, 100);  // 100 blocks

    uint64_t b1 = cow_alloc_block(&fs);
    assert(b1 != 0);
    assert(fs.block_info[b1].refcount == 1);

    uint64_t b2 = cow_alloc_block(&fs);
    assert(b2 != 0);
    assert(b1 != b2);
}

// Test reference counting
void test_refcount() {
    struct cowfs_state fs;
    init_test_fs(&fs, 100);

    uint64_t b = cow_alloc_block(&fs);
    assert(block_get_ref(&fs, b) == 1);

    block_inc_ref(&fs, b);
    assert(block_get_ref(&fs, b) == 2);

    block_dec_ref(&fs, b);
    assert(block_get_ref(&fs, b) == 1);

    block_dec_ref(&fs, b);
    // Block should be freed
    assert(block_is_free(&fs, b));
}

// Test snapshot
void test_snapshot() {
    struct cowfs_state fs;
    init_test_fs(&fs, 1000);

    // Create some data
    create_file(&fs, "/test.txt", "original");

    // Take snapshot
    snapshot_create(&fs, "snap1");

    // Modify file
    write_file(&fs, "/test.txt", "modified");

    // Verify current is modified
    assert(strcmp(read_file(&fs, "/test.txt"), "modified") == 0);

    // Restore snapshot
    snapshot_restore(&fs, "snap1");

    // Verify restored to original
    assert(strcmp(read_file(&fs, "/test.txt"), "original") == 0);
}

Crash Tests

#!/bin/bash
# Test crash recovery

for i in $(seq 1 50); do
    echo "Crash test iteration $i"

    # Start filesystem
    ./cowfs disk.img /mnt/cow &
    PID=$!
    sleep 0.5

    # Do some work
    (
        for j in $(seq 1 10); do
            dd if=/dev/urandom of=/mnt/cow/file_$j bs=1K count=10 2>/dev/null
        done
    ) &

    # Crash after random time
    sleep $(echo "scale=2; $RANDOM/32768" | bc)
    kill -9 $PID

    # Remount
    ./cowfs disk.img /mnt/cow

    # Verify scrub passes
    ./cowctl /mnt/cow scrub
    if [ $? -ne 0 ]; then
        echo "FAIL: Corruption after crash"
        exit 1
    fi

    fusermount -u /mnt/cow
done

echo "PASS: All crash tests passed"

Common Pitfalls

Pitfall 1: Reference Count Leaks

Problem: Blocks never freed because refcount never reaches 0.

Solution: Careful tracking of all operations that inc/dec refs:

  • Allocation: refcount = 1
  • Snapshot: inc tree refcounts
  • COW write: inc new block, dec old block
  • Snapshot delete: dec tree refcounts

Pitfall 2: Transaction Abort Cleanup

Problem: Aborted transaction leaves orphaned blocks.

Solution: Track all allocations in transaction, free on abort.

Pitfall 3: Checksum of Modified Block

Problem: Checksum computed before final modification.

Solution: Always compute checksum immediately before write.


Interview Questions

  1. โ€œWhat is copy-on-write, and why is it useful for filesystems?โ€
    • Expected: Never overwrite, atomic updates, enables snapshots
  2. โ€œHow are snapshots instant in COW filesystems?โ€
    • Expected: Just save root pointer, increment refcounts
  3. โ€œWhat is the space overhead of having many snapshots?โ€
    • Expected: Only changed blocks are duplicated, shared blocks counted once
  4. โ€œHow does garbage collection work in a COW filesystem?โ€
    • Expected: Reference counting, free when refcount=0
  5. โ€œHow do checksums enable self-healing?โ€
    • Expected: Detect corruption, use mirror/RAID copy to repair

Self-Assessment Checklist

  • COW allocation never overwrites existing blocks
  • Atomic root pointer update commits transactions
  • Snapshots create instantly (O(1))
  • All blocks have verified checksums
  • Garbage collection correctly frees unreferenced blocks
  • Filesystem survives simulated crashes
  • Scrub detects bit rot
  • Standard file operations work correctly

Resources

Books

  • โ€œDesigning Data-Intensive Applicationsโ€ by Kleppmann - Ch. 3, 7
  • โ€œDatabase Internalsโ€ by Petrov - Ch. 5-7

Papers/Docs

Code