Project 4: Journaling Layer

Project Overview

Attribute Details
Difficulty Master
Time Estimate 2-3 weeks
Language C
Knowledge Area Crash Consistency / Databases
Main Book “Database Internals” by Alex Petrov
Prerequisites Project 3 completed

What You’ll Build

Add write-ahead logging (journaling) to your block-based filesystem. Before any metadata change, log the operation so crashes can be recovered.

Real World Output:

$ ./myfs disk.img /mnt/myfs
$ dd if=/dev/urandom of=/mnt/myfs/bigfile bs=1M count=50 &
# Kill power mid-write (or kill -9 the process)
$ ./myfs disk.img /mnt/myfs
Journal recovery: replaying 3 transactions...
Transaction 1: inode 15 allocation - COMMITTED, replaying
Transaction 2: block 1024-1087 allocation - COMMITTED, replaying
Transaction 3: inode 15 size update - INCOMPLETE, discarding
Recovery complete. Filesystem consistent.
$ ls /mnt/myfs  # No corruption!

Learning Objectives

By completing this project, you will:

  1. Understand the crash consistency problem and why filesystems need journaling
  2. Implement write-ahead logging (WAL) — the same technique used by databases
  3. Design transaction semantics for atomic filesystem operations
  4. Build a journal recovery system that replays committed transactions
  5. Handle checkpointing to reclaim journal space
  6. Balance performance and durability tradeoffs
  7. Connect filesystem concepts to database internals — ACID properties in action

The Core Question You’re Answering

“How do you guarantee filesystem consistency when power can fail at any millisecond during a write operation?”

Before journaling, crashes during writes left filesystems corrupted. fsck would scan the entire disk to find and fix inconsistencies—taking hours on large drives.

In 1999, ext3 added one feature to ext2: journaling. That single addition made ext3 mount instantly after crashes. You’re about to implement that same insight—write-ahead logging. It’s the same technique PostgreSQL, MySQL, and every serious database uses.

The Database Connection:

This project bridges filesystems and databases. The WAL (Write-Ahead Logging) protocol you’ll implement is identical to what PostgreSQL uses. The ARIES recovery algorithm (Analysis, Redo, Undo) that databases use is a more sophisticated version of what you’ll build here. Understanding this prepares you for database internals.


Deep Theoretical Foundation

1. The Crash Consistency Problem

Consider creating a file /newfile.txt:

Operations required:
1. Allocate inode (update inode bitmap)
2. Initialize inode (write inode structure)
3. Allocate data block (update block bitmap)
4. Write file data (write data block)
5. Update directory (add entry to parent)
6. Update superblock (decrement free counts)

What if power fails between operations?

Crash After Result
Step 1 Inode allocated but uninitialized - garbage metadata
Step 2 Inode exists but no directory entry - orphaned inode
Step 3 Block allocated but unused - space leak
Step 4 Data written but inode not updated - data lost
Step 5 Directory points to partial inode - potential corruption

The fundamental issue: Disk writes are not atomic across multiple blocks. You cannot guarantee all updates complete before a crash.

The Crash Window Problem:

Time ────────────────────────────────────────────────────────────►

     ┌─────────┐   ┌─────────┐   ┌─────────┐   ┌─────────┐
     │ Write   │   │ Write   │   │ Write   │   │ Write   │
     │ Bitmap  │   │ Inode   │   │ DirEnt  │   │ Data    │
     └────┬────┘   └────┬────┘   └────┬────┘   └────┬────┘
          │             │             │             │
          ▼             ▼             ▼             ▼
     ←────────────── CRASH WINDOW ──────────────────►

     Power can fail ANYWHERE in this window.
     Partial updates = inconsistent filesystem.

2. ACID Properties (Database Connection)

Journaling provides ACID guarantees — the same properties databases ensure:

Property Database Meaning Filesystem Meaning
Atomicity Transaction all-or-nothing File create either happens completely or not at all
Consistency DB constraints preserved Filesystem invariants maintained (bitmaps match usage)
Isolation Concurrent transactions don’t interfere Multiple file operations don’t corrupt each other
Durability Committed data survives crash Once fsync returns, data is on disk

The key insight: A filesystem operation (create, delete, rename) is a transaction that must be atomic.

3. Solution Approaches Compared

Approach 1: Write Order + fsck (ext2)

Carefully order writes to limit damage:

  1. Write data blocks first
  2. Then write metadata
  3. On crash, run fsck to scan and fix

Problems:

  • fsck scans entire disk (slow for large filesystems)
  • Some inconsistencies can’t be fixed
  • User must wait for fsck to complete
ext2 Recovery Time:

Filesystem Size    fsck Time
────────────────────────────
10 GB              ~1 minute
100 GB             ~10 minutes
1 TB               ~1-2 hours
10 TB              ~10-20 hours  ← Unacceptable!

Approach 2: Journaling (ext3/ext4)

Write operations twice:

  1. First, log the changes to a journal
  2. Then, apply changes to their final locations

On crash:

  1. Scan small journal (fast)
  2. Replay committed transactions
  3. Discard incomplete transactions

Key insight: The journal is small, so recovery is fast. Only committed transactions are replayed, so consistency is guaranteed.

Journaling Recovery:

Journal Size: 128 MB (fixed, regardless of FS size)
Recovery Time: ~1-5 seconds (always)

10 TB filesystem crashes at 3 AM?
Mount in 5 seconds, not 20 hours.

Approach 3: Copy-on-Write (ZFS/Btrfs)

Never overwrite data. Always write to new locations, then update pointer atomically. (This is Project 6.)

4. Write-Ahead Logging (WAL) Principles

The fundamental rules of WAL:

  1. Log before apply: Write the log entry before making the actual change
  2. Commit before success: Don’t report success until the commit record is durable
  3. Idempotent operations: Replaying a transaction multiple times = same result
  4. Recovery replays committed: On mount, replay all committed transactions
WAL Protocol Visualization:

Normal write flow:
┌─────────────────────────────────────────────────────────────────┐
│ 1. Application calls write()                                     │
│ 2. Filesystem prepares changes                                   │
│ 3. BEGIN transaction written to journal                          │
│ 4. All changes logged to journal                                 │
│ 5. COMMIT written to journal                                     │
│ 6. fsync() journal — changes are now durable                     │
│ 7. Apply changes to actual locations (checkpoint)                │
│ 8. Return success to application                                 │
└─────────────────────────────────────────────────────────────────┘

Recovery flow:
┌─────────────────────────────────────────────────────────────────┐
│ 1. Mount filesystem                                              │
│ 2. Scan journal from last checkpoint                             │
│ 3. For each transaction:                                         │
│    - If COMMIT found: replay all operations                      │
│    - If no COMMIT: discard (incomplete transaction)              │
│ 4. Mark journal as empty                                         │
│ 5. Continue normal operation                                     │
└─────────────────────────────────────────────────────────────────┘

5. The WAL Guarantee

Theorem: If a transaction has a COMMIT record in the durable journal, all its changes can be recovered.

Proof:

  1. COMMIT is written after all data entries
  2. COMMIT is fsync’d to disk before returning success
  3. If COMMIT exists, all data entries must exist (WAL write order)
  4. Recovery reads all data entries and applies them
  5. Therefore: COMMIT → Recoverable

6. Journal Structure

A journal is typically a fixed-size circular buffer on disk:

┌──────────────────────────────────────────────────────────────────┐
│                         JOURNAL                                   │
├─────────────┬─────────────────────────────────────┬──────────────┤
│ Journal     │            Journal Entries          │   Free       │
│ Superblock  │ [T1-Begin][Data][Commit][T2-Begin]..│   Space      │
│             │     ↑                         ↑     │              │
│             │   head                       tail   │              │
└─────────────┴─────────────────────────────────────┴──────────────┘

Journal wraps around when it reaches the end (circular buffer):
┌─────────────────────────────────────────────────────────────────┐
│ ...old data...][T5-data][T5-COMMIT][T6-BEGIN...   |  ...old...  │
│                                               ↑   ↑             │
│                                             tail head           │
└─────────────────────────────────────────────────────────────────┘

Journal superblock:

struct journal_superblock {
    uint32_t j_magic;           // Magic number (0x4A484653 "JHFS")
    uint32_t j_version;         // Journal format version
    uint32_t j_start;           // First block of journal
    uint32_t j_size;            // Size in blocks
    uint32_t j_head;            // Oldest valid transaction
    uint32_t j_tail;            // Next write position
    uint64_t j_sequence;        // Next transaction ID
    uint32_t j_checkpoint;      // Last checkpointed transaction
    uint32_t j_flags;           // Status flags
    uint8_t  j_uuid[16];        // Journal UUID
};

Journal entries:

#define J_BEGIN     1
#define J_COMMIT    2
#define J_DATA      3
#define J_REVOKE    4  // For recovery optimization

struct journal_entry_header {
    uint32_t je_type;           // Entry type
    uint32_t je_length;         // Entry length (including header)
    uint64_t je_sequence;       // Transaction ID
    uint32_t je_checksum;       // CRC32 of entry
};

struct journal_begin {
    struct journal_entry_header header;
    uint32_t jb_num_blocks;     // Expected blocks in transaction
    uint64_t jb_timestamp;      // Transaction start time
};

struct journal_data {
    struct journal_entry_header header;
    uint32_t jd_block;          // Target block number
    uint32_t jd_flags;          // Block flags
    // Followed by block data (BLOCK_SIZE bytes)
};

struct journal_commit {
    struct journal_entry_header header;
    uint64_t jc_sequence;       // Transaction ID (redundant for verification)
    uint32_t jc_checksum;       // Checksum of entire transaction
    uint64_t jc_timestamp;      // Commit time
};

7. Transaction Semantics

A transaction groups related filesystem operations:

// Creating a file is one transaction:
transaction_begin();
{
    alloc_inode();           // Logged
    init_inode();            // Logged
    alloc_blocks();          // Logged
    write_data();            // Maybe logged (depends on mode)
    add_dir_entry();         // Logged
    update_superblock();     // Logged
}
transaction_commit();

Atomicity guarantee: Either ALL changes commit, or NONE do.

Visualization of a Transaction:

Transaction T47: Create file "/documents/report.txt"

Journal Contents:
┌─────────────────────────────────────────────────────────────────┐
│ T47-BEGIN (seq=47, expecting 4 blocks)                          │
├─────────────────────────────────────────────────────────────────┤
│ T47-DATA: Block 5 (inode bitmap, set bit 23)                    │
├─────────────────────────────────────────────────────────────────┤
│ T47-DATA: Block 105 (inode 23 structure)                        │
├─────────────────────────────────────────────────────────────────┤
│ T47-DATA: Block 200 (directory /documents, add entry)           │
├─────────────────────────────────────────────────────────────────┤
│ T47-DATA: Block 0 (superblock, decrement free inode count)      │
├─────────────────────────────────────────────────────────────────┤
│ T47-COMMIT (seq=47, checksum=0x12345678)                        │
└─────────────────────────────────────────────────────────────────┘

8. Journaling Modes

Mode What’s Journaled Speed Safety
Journal Metadata + Data Slow Highest
Ordered Metadata only, data written first Medium Good
Writeback Metadata only Fast Lower

Journal mode: Everything is logged. File content AND metadata. Slowest but safest.

Ordered mode (ext4 default): Only metadata is journaled. But data blocks are written before the metadata journal entry commits. This ensures you never see stale data.

Writeback mode: Only metadata journaled. Data might not be flushed before commit. Fastest, but stale data possible after crash.

Journaling Modes Comparison:

Full Data Journaling:
  write("hello") → [Journal: data + metadata] → [Disk: data + metadata]
  Write amplification: 2x (data written twice)

Ordered Mode:
  write("hello") → [Disk: data] → [Journal: metadata] → [Disk: metadata]
  Write amplification: 1x for data, 2x for metadata
  Guarantee: Data on disk before metadata commit

Writeback Mode:
  write("hello") → [Journal: metadata] → [Disk: data + metadata] (any order)
  Write amplification: 2x for metadata only
  Risk: Stale/garbage data visible after crash

9. Checkpointing

The journal has limited size. To reclaim space, checkpoint transactions:

  1. Flush all logged changes to their final disk locations
  2. Verify writes completed (fsync)
  3. Advance journal head pointer
  4. Space between old head and new head is now reusable
Before checkpoint:
head                                                    tail
 ↓                                                       ↓
[T1-complete][T2-complete][T3-complete][T4-in-progress][...]
 ↑_________________________________________↑
          Can be checkpointed

After checkpoint:
                          new head                      tail
                              ↓                          ↓
[recyclable space          ][T4-in-progress][T5-new][...]

Checkpoint Process:
1. Identify complete transactions (T1, T2, T3)
2. Write T1, T2, T3 data to final disk locations
3. fsync() to ensure durability
4. Update journal superblock (move head past T3)
5. Journal space before new head is now reusable

10. Recovery Algorithm

void journal_recover(struct journal_state *j) {
    uint64_t pos = j->superblock.j_head;
    uint64_t current_txn = 0;
    bool has_commit = false;

    // First pass: identify complete transactions
    while (pos < j->superblock.j_tail) {
        struct journal_entry_header hdr;
        read_journal(j, pos, &hdr, sizeof(hdr));

        switch (hdr.je_type) {
        case J_BEGIN:
            current_txn = hdr.je_sequence;
            has_commit = false;
            break;

        case J_COMMIT:
            if (hdr.je_sequence == current_txn) {
                has_commit = true;
                // Mark transaction as complete
            }
            break;
        }

        pos += entry_size(&hdr);
    }

    // Second pass: replay complete transactions
    pos = j->superblock.j_head;
    while (pos < j->superblock.j_tail) {
        struct journal_entry_header hdr;
        read_journal(j, pos, &hdr, sizeof(hdr));

        if (hdr.je_type == J_DATA && transaction_complete(hdr.je_sequence)) {
            struct journal_data *jd = read_data_entry(j, pos);
            // Copy block data from journal to final location
            write_block(fs, jd->jd_block, jd->data);
        }

        pos += entry_size(&hdr);
    }

    // Clear journal
    j->superblock.j_head = j->superblock.j_tail;
    write_journal_superblock(j);
}

Project Specification

Functional Requirements

  1. Journal area on disk
    • Reserve blocks for journal
    • Configurable size (32MB default)
    • Circular buffer implementation
  2. Transaction support
    • Begin/commit transaction API
    • Log metadata changes
    • Optional data journaling mode
  3. Recovery on mount
    • Detect unclean shutdown
    • Scan and replay committed transactions
    • Discard incomplete transactions
  4. Checkpointing
    • Periodic flush of journal to final locations
    • Reclaim journal space
    • Handle journal full condition
  5. Integration with filesystem
    • Wrap all metadata operations in transactions
    • Handle nested operations correctly

Non-Functional Requirements

  • Recovery completes in < 5 seconds
  • Journal overhead < 20% on normal operations
  • Support concurrent transactions
  • Graceful degradation when journal is full

Solution Architecture

Component Design

┌─────────────────────────────────────────────────────────────────┐
│                    Filesystem Operations                         │
│            (create, unlink, write, mkdir, etc.)                  │
└────────────────────────────┬────────────────────────────────────┘
                             │
                             ▼
┌─────────────────────────────────────────────────────────────────┐
│                    Transaction Manager                           │
│              (begin, log, commit, abort)                         │
└────────────────────────────┬────────────────────────────────────┘
                             │
                             ▼
┌─────────────────────────────────────────────────────────────────┐
│                    Journal Writer                                │
│           (append entries, manage space)                         │
└────────────────────────────┬────────────────────────────────────┘
                             │
          ┌──────────────────┼──────────────────┐
          ▼                  ▼                  ▼
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Journal Buffer  │ │  Checkpointer   │ │ Recovery Engine │
│  (in-memory)    │ │ (background)    │ │ (on mount)      │
└────────┬────────┘ └────────┬────────┘ └────────┬────────┘
         │                   │                   │
         └───────────────────┼───────────────────┘
                             ▼
┌─────────────────────────────────────────────────────────────────┐
│                    Journal Blocks (on disk)                      │
│   [JSuper][Entry][Entry][Entry]...[unused]...[oldest entries]   │
└─────────────────────────────────────────────────────────────────┘

Data Structures

// Journal state (in-memory)
struct journal_state {
    struct journal_superblock jsuper;
    uint8_t *buffer;              // Write buffer
    size_t buffer_pos;            // Current position in buffer
    size_t buffer_size;           // Buffer size (typically one block)
    uint64_t current_txn;         // Current transaction ID
    int txn_depth;                // Nesting level (for compound ops)
    pthread_mutex_t lock;         // Concurrency control
    int dirty;                    // Journal superblock needs writing
    int disk_fd;                  // File descriptor for journal
};

// Transaction handle
struct transaction {
    uint64_t txn_id;
    struct journal_state *journal;
    struct logged_block *blocks;  // List of blocks to log
    size_t num_blocks;
    int state;                    // TXN_ACTIVE, TXN_COMMITTING, TXN_COMPLETE
    struct transaction *next;     // For transaction list
};

// Logged block info
struct logged_block {
    uint32_t block_num;           // Target block number
    uint8_t data[BLOCK_SIZE];     // Block content (copy)
    uint32_t flags;               // JOURNAL_METADATA, JOURNAL_DATA, etc.
    struct logged_block *next;
};

// Recovery state
struct recovery_state {
    uint64_t *complete_txns;      // Array of complete transaction IDs
    size_t num_complete;
    size_t capacity;
    uint64_t max_seq;             // Highest sequence seen
};

// Transaction states
#define TXN_ACTIVE      1
#define TXN_COMMITTING  2
#define TXN_COMMITTED   3
#define TXN_ABORTED     4

// Block flags
#define JOURNAL_METADATA    0x01
#define JOURNAL_DATA        0x02
#define JOURNAL_ESCAPE      0x04  // Block contains journal magic

Key Functions

// Journal initialization
int journal_init(struct myfs_state *fs, uint32_t start_block, uint32_t size);
int journal_load(struct myfs_state *fs);
int journal_close(struct myfs_state *fs);

// Transaction API
struct transaction* txn_begin(struct myfs_state *fs);
int txn_log_block(struct transaction *txn, uint32_t block_num, void *data);
int txn_commit(struct transaction *txn);
void txn_abort(struct transaction *txn);

// Journal operations
int journal_append(struct journal_state *j, void *entry, size_t len);
int journal_sync(struct journal_state *j);
int journal_checkpoint(struct journal_state *j);
int journal_needs_checkpoint(struct journal_state *j);

// Recovery
int journal_recover(struct myfs_state *fs);
int journal_needs_recovery(struct myfs_state *fs);
int journal_mark_clean(struct myfs_state *fs);

Implementation Guide

Phase 1: Journal Layout (Days 1-3)

Goals:

  • Reserve journal blocks in mkfs
  • Define journal superblock format
  • Initialize journal on format

Steps:

  1. Modify mkfs to allocate journal blocks
  2. Define journal superblock structure
  3. Write initial journal superblock
  4. Update filesystem superblock to point to journal
// In mkfs: allocate journal
uint32_t journal_blocks = 8192;  // 32MB with 4KB blocks
uint32_t journal_start = sb.s_first_data;
sb.s_journal_start = journal_start;
sb.s_journal_size = journal_blocks;
sb.s_first_data += journal_blocks;

// Initialize journal superblock
struct journal_superblock jsb = {
    .j_magic = JOURNAL_MAGIC,
    .j_version = 1,
    .j_start = journal_start + 1,  // After journal superblock
    .j_size = journal_blocks - 1,
    .j_head = 0,
    .j_tail = 0,
    .j_sequence = 1,
    .j_checkpoint = 0,
    .j_flags = JOURNAL_CLEAN,
};

// Generate UUID for journal
generate_uuid(jsb.j_uuid);

write_block(fd, journal_start, &jsb);

Verification:

# After mkfs, verify journal exists
./mkfs.myfs -s 100M -j 32M disk.img
hexdump -C disk.img | grep -A 4 "JHFS"  # Should see journal magic

Phase 2: Transaction Begin/Commit (Days 3-6)

Goals:

  • Implement transaction lifecycle
  • Write BEGIN and COMMIT records
  • Ensure durability with fsync
struct transaction* txn_begin(struct myfs_state *fs) {
    struct journal_state *j = fs->journal;

    pthread_mutex_lock(&j->lock);

    // Check for journal space
    if (journal_space_low(j)) {
        journal_checkpoint(j);
    }

    // Start new transaction
    struct transaction *txn = calloc(1, sizeof(*txn));
    txn->txn_id = j->jsuper.j_sequence++;
    txn->journal = j;
    txn->state = TXN_ACTIVE;

    // Write BEGIN record
    struct journal_begin begin = {
        .header = {
            .je_type = J_BEGIN,
            .je_length = sizeof(begin),
            .je_sequence = txn->txn_id,
            .je_checksum = 0,  // Filled after
        },
        .jb_num_blocks = 0,  // Updated at commit
        .jb_timestamp = time(NULL),
    };
    begin.header.je_checksum = crc32(&begin, sizeof(begin));
    journal_append(j, &begin, sizeof(begin));

    pthread_mutex_unlock(&j->lock);

    return txn;
}

int txn_log_block(struct transaction *txn, uint32_t block_num, void *data) {
    if (txn->state != TXN_ACTIVE) {
        return -EINVAL;
    }

    // Allocate logged block
    struct logged_block *lb = malloc(sizeof(*lb));
    lb->block_num = block_num;
    memcpy(lb->data, data, BLOCK_SIZE);
    lb->flags = JOURNAL_METADATA;

    // Add to transaction's block list
    lb->next = txn->blocks;
    txn->blocks = lb;
    txn->num_blocks++;

    return 0;
}

int txn_commit(struct transaction *txn) {
    struct journal_state *j = txn->journal;

    pthread_mutex_lock(&j->lock);

    txn->state = TXN_COMMITTING;

    // Write all logged blocks to journal
    struct logged_block *blk = txn->blocks;
    uint32_t txn_checksum = 0;

    while (blk) {
        struct journal_data jd = {
            .header = {
                .je_type = J_DATA,
                .je_length = sizeof(jd) + BLOCK_SIZE,
                .je_sequence = txn->txn_id,
            },
            .jd_block = blk->block_num,
            .jd_flags = blk->flags,
        };

        // Handle blocks that contain journal magic (escape them)
        if (needs_escape(blk->data)) {
            jd.jd_flags |= JOURNAL_ESCAPE;
        }

        jd.header.je_checksum = crc32(&jd, sizeof(jd));
        journal_append(j, &jd, sizeof(jd));
        journal_append(j, blk->data, BLOCK_SIZE);

        txn_checksum = crc32_combine(txn_checksum, blk->data, BLOCK_SIZE);
        blk = blk->next;
    }

    // Write COMMIT record
    struct journal_commit commit = {
        .header = {
            .je_type = J_COMMIT,
            .je_length = sizeof(commit),
            .je_sequence = txn->txn_id,
        },
        .jc_sequence = txn->txn_id,
        .jc_checksum = txn_checksum,
        .jc_timestamp = time(NULL),
    };
    commit.header.je_checksum = crc32(&commit, sizeof(commit));
    journal_append(j, &commit, sizeof(commit));

    // CRITICAL: Sync journal to disk before returning
    int ret = journal_sync(j);
    if (ret < 0) {
        pthread_mutex_unlock(&j->lock);
        return ret;
    }

    txn->state = TXN_COMMITTED;
    pthread_mutex_unlock(&j->lock);

    // Free transaction resources
    free_logged_blocks(txn->blocks);
    free(txn);

    return 0;
}

void txn_abort(struct transaction *txn) {
    txn->state = TXN_ABORTED;
    free_logged_blocks(txn->blocks);
    free(txn);
}

Phase 3: Logging Filesystem Operations (Days 6-10)

Goals:

  • Wrap metadata updates in transactions
  • Log block changes before applying
  • Handle inode, bitmap, and directory updates
int myfs_create_journaled(struct myfs_state *fs, const char *path, mode_t mode) {
    struct transaction *txn = txn_begin(fs);
    if (!txn) {
        return -ENOMEM;
    }

    // Allocate inode (update bitmap)
    int ino = find_free_inode(fs);
    if (ino < 0) {
        txn_abort(txn);
        return -ENOSPC;
    }

    // Read and modify inode bitmap
    uint8_t bitmap_block[BLOCK_SIZE];
    uint32_t bitmap_blk = fs->sb.s_inode_bitmap + ino / (BLOCK_SIZE * 8);
    read_block(fs, bitmap_blk, bitmap_block);
    BITMAP_SET(bitmap_block, ino % (BLOCK_SIZE * 8));
    txn_log_block(txn, bitmap_blk, bitmap_block);

    // Initialize inode
    struct myfs_inode inode = {
        .i_mode = S_IFREG | mode,
        .i_uid = getuid(),
        .i_gid = getgid(),
        .i_links_count = 1,
        .i_size = 0,
        .i_ctime = time(NULL),
        .i_mtime = time(NULL),
        .i_atime = time(NULL),
    };

    uint8_t inode_block[BLOCK_SIZE];
    uint32_t inode_blk = inode_to_block(fs, ino);
    read_block(fs, inode_blk, inode_block);
    write_inode_to_block(inode_block, ino, &inode);
    txn_log_block(txn, inode_blk, inode_block);

    // Add directory entry
    uint32_t parent_ino = path_to_parent_ino(fs, path);
    const char *filename = basename(path);
    uint8_t dir_block[BLOCK_SIZE];
    uint32_t dir_blk = get_dir_data_block(fs, parent_ino);
    read_block(fs, dir_blk, dir_block);
    add_dir_entry(dir_block, ino, filename, DT_REG);
    txn_log_block(txn, dir_blk, dir_block);

    // Update superblock (decrement free inode count)
    uint8_t sb_block[BLOCK_SIZE];
    read_block(fs, 0, sb_block);
    struct myfs_superblock *sb = (struct myfs_superblock *)sb_block;
    sb->s_free_inodes--;
    txn_log_block(txn, 0, sb_block);

    // Commit transaction
    int ret = txn_commit(txn);
    if (ret < 0) {
        txn_abort(txn);
        return ret;
    }

    // Apply changes to actual locations
    // This can be deferred to checkpoint, but we do it now for simplicity
    apply_transaction_blocks(fs, txn);

    return 0;
}

Phase 4: Recovery Implementation (Days 10-14)

Goals:

  • Detect unclean shutdown
  • Scan journal for complete transactions
  • Replay changes to filesystem
int journal_needs_recovery(struct myfs_state *fs) {
    struct journal_state *j = fs->journal;

    // Check if journal was marked clean
    if (j->jsuper.j_flags & JOURNAL_CLEAN) {
        return 0;
    }

    // Check if there's any content in journal
    if (j->jsuper.j_head == j->jsuper.j_tail) {
        return 0;
    }

    return 1;  // Recovery needed
}

int journal_recover(struct myfs_state *fs) {
    struct journal_state *j = fs->journal;
    struct recovery_state rs = {0};

    printf("Journal recovery starting...\n");
    printf("  Journal head: %u, tail: %u\n",
           j->jsuper.j_head, j->jsuper.j_tail);

    // Allocate recovery state
    rs.capacity = 256;
    rs.complete_txns = malloc(rs.capacity * sizeof(uint64_t));
    rs.num_complete = 0;

    // Phase 1: Scan for complete transactions
    printf("Phase 1: Scanning for complete transactions...\n");

    uint32_t pos = j->jsuper.j_head;
    uint64_t current_txn = 0;
    int in_transaction = 0;

    while (pos < j->jsuper.j_tail) {
        struct journal_entry_header hdr;
        read_journal_at(j, pos, &hdr, sizeof(hdr));

        // Verify entry checksum
        if (!verify_entry_checksum(&hdr)) {
            printf("  Warning: Corrupt entry at position %u, stopping scan\n", pos);
            break;
        }

        switch (hdr.je_type) {
        case J_BEGIN:
            current_txn = hdr.je_sequence;
            in_transaction = 1;
            printf("  Transaction %lu: BEGIN\n", current_txn);
            break;

        case J_COMMIT:
            if (in_transaction && hdr.je_sequence == current_txn) {
                struct journal_commit commit;
                read_journal_at(j, pos, &commit, sizeof(commit));

                // Verify transaction checksum
                if (verify_txn_checksum(j, current_txn, commit.jc_checksum)) {
                    add_complete_txn(&rs, current_txn);
                    printf("  Transaction %lu: COMPLETE (checksum OK)\n", current_txn);
                } else {
                    printf("  Transaction %lu: CHECKSUM FAILED\n", current_txn);
                }
                in_transaction = 0;
            }
            break;

        case J_DATA:
            // Just skip during scan phase
            break;
        }

        pos += hdr.je_length;
    }

    if (in_transaction) {
        printf("  Transaction %lu: INCOMPLETE (no commit)\n", current_txn);
    }

    // Phase 2: Replay complete transactions
    printf("Phase 2: Replaying %zu complete transactions...\n", rs.num_complete);

    pos = j->jsuper.j_head;
    int blocks_replayed = 0;

    while (pos < j->jsuper.j_tail) {
        struct journal_entry_header hdr;
        read_journal_at(j, pos, &hdr, sizeof(hdr));

        if (hdr.je_type == J_DATA && is_complete(&rs, hdr.je_sequence)) {
            struct journal_data jd;
            read_journal_at(j, pos, &jd, sizeof(jd));

            // Read block data from journal
            uint8_t block_data[BLOCK_SIZE];
            read_journal_at(j, pos + sizeof(jd), block_data, BLOCK_SIZE);

            // Handle escaped blocks
            if (jd.jd_flags & JOURNAL_ESCAPE) {
                unescape_block(block_data);
            }

            printf("  Replaying block %u from transaction %lu\n",
                   jd.jd_block, hdr.je_sequence);
            write_block(fs, jd.jd_block, block_data);
            blocks_replayed++;
        }

        pos += hdr.je_length;
    }

    // Sync replayed blocks
    fsync(fs->disk_fd);

    // Clear journal
    j->jsuper.j_head = j->jsuper.j_tail;
    j->jsuper.j_sequence = rs.max_seq + 1;
    j->jsuper.j_flags = JOURNAL_CLEAN;
    write_journal_superblock(j);

    printf("Recovery complete: %zu transactions, %d blocks replayed\n",
           rs.num_complete, blocks_replayed);

    free(rs.complete_txns);
    return 0;
}

Phase 5: Checkpointing (Days 14-17)

Goals:

  • Apply logged changes to final locations
  • Reclaim journal space
  • Handle journal-full condition
int journal_checkpoint(struct journal_state *j) {
    pthread_mutex_lock(&j->lock);

    printf("Checkpointing journal...\n");

    uint32_t pos = j->jsuper.j_head;
    uint32_t checkpoint_pos = pos;
    uint64_t current_txn = 0;
    int in_complete_txn = 0;

    // First pass: identify checkpoint boundary
    while (pos < j->jsuper.j_tail) {
        struct journal_entry_header hdr;
        read_journal_at(j, pos, &hdr, sizeof(hdr));

        switch (hdr.je_type) {
        case J_BEGIN:
            current_txn = hdr.je_sequence;
            in_complete_txn = 0;
            break;

        case J_COMMIT:
            if (hdr.je_sequence == current_txn) {
                in_complete_txn = 1;
            }
            break;
        }

        pos += hdr.je_length;

        if (in_complete_txn && hdr.je_type == J_COMMIT) {
            checkpoint_pos = pos;  // Can checkpoint up to here
        }
    }

    // Second pass: apply all complete transactions
    pos = j->jsuper.j_head;
    int blocks_written = 0;

    while (pos < checkpoint_pos) {
        struct journal_entry_header hdr;
        read_journal_at(j, pos, &hdr, sizeof(hdr));

        if (hdr.je_type == J_DATA) {
            struct journal_data jd;
            read_journal_at(j, pos, &jd, sizeof(jd));

            uint8_t block_data[BLOCK_SIZE];
            read_journal_at(j, pos + sizeof(jd), block_data, BLOCK_SIZE);

            if (jd.jd_flags & JOURNAL_ESCAPE) {
                unescape_block(block_data);
            }

            write_block(fs, jd.jd_block, block_data);
            blocks_written++;
        }

        pos += hdr.je_length;
    }

    // Sync filesystem blocks
    fsync(fs->disk_fd);

    // Advance head pointer
    j->jsuper.j_head = checkpoint_pos;
    j->jsuper.j_checkpoint = j->jsuper.j_sequence;
    write_journal_superblock(j);

    printf("Checkpoint complete: %d blocks written, journal space reclaimed\n",
           blocks_written);

    pthread_mutex_unlock(&j->lock);
    return 0;
}

// Check if journal needs checkpoint
bool journal_needs_checkpoint(struct journal_state *j) {
    uint32_t used = j->jsuper.j_tail - j->jsuper.j_head;
    uint32_t total = j->jsuper.j_size;
    return used > (total * 3 / 4);  // Checkpoint at 75% full
}

// Handle journal full condition
int journal_wait_for_space(struct journal_state *j, size_t needed_blocks) {
    while (1) {
        uint32_t available = j->jsuper.j_size -
                            (j->jsuper.j_tail - j->jsuper.j_head);

        if (available >= needed_blocks) {
            return 0;
        }

        // Try checkpoint
        int ret = journal_checkpoint(j);
        if (ret < 0) {
            return ret;
        }

        // Recheck
        available = j->jsuper.j_size - (j->jsuper.j_tail - j->jsuper.j_head);
        if (available >= needed_blocks) {
            return 0;
        }

        // Still no space, need to wait for transactions to complete
        return -EAGAIN;
    }
}

Phase 6: Integration and Testing (Days 17-21)

Goals:

  • Connect journaling to all filesystem operations
  • Test crash scenarios
  • Verify consistency after recovery

Crash simulation script:

#!/bin/bash
# test_crash_recovery.sh

DISK=/tmp/crash_test.img
MNT=/tmp/crash_mnt
MYFS=./myfs
MKFS=./mkfs.myfs
FSCK=./fsck.myfs

# Cleanup function
cleanup() {
    fusermount -u $MNT 2>/dev/null
    rm -f $DISK
    rmdir $MNT 2>/dev/null
}
trap cleanup EXIT

echo "=== Crash Recovery Test Suite ==="

# Test 1: Clean mount/unmount
echo -e "\n[Test 1: Clean mount/unmount]"
$MKFS -s 100M -j 32M $DISK
mkdir -p $MNT
$MYFS $DISK $MNT &
FSPID=$!
sleep 1

echo "file 1 content" > $MNT/file1.txt
mkdir $MNT/subdir
echo "file 2 content" > $MNT/subdir/file2.txt

fusermount -u $MNT
wait $FSPID

$MYFS $DISK $MNT
sleep 1

if [ -f "$MNT/file1.txt" ] && [ -f "$MNT/subdir/file2.txt" ]; then
    echo "PASS: Files persist after clean unmount"
else
    echo "FAIL: Files missing after remount"
    exit 1
fi

fusermount -u $MNT

# Test 2: Crash during file creation
echo -e "\n[Test 2: Crash during file creation]"
$MKFS -s 100M -j 32M $DISK
$MYFS $DISK $MNT &
FSPID=$!
sleep 1

# Start background writes
for i in $(seq 1 100); do
    echo "file $i content" > $MNT/file_$i.txt
done &
WRITEPID=$!

# Kill filesystem mid-write (simulates crash)
sleep 0.5
kill -9 $FSPID
kill -9 $WRITEPID 2>/dev/null

# Remount (triggers recovery)
$MYFS $DISK $MNT 2>&1 | grep -q "recovery" && echo "Recovery triggered"
sleep 1

# Verify consistency
$FSCK $DISK
if [ $? -eq 0 ]; then
    echo "PASS: Filesystem consistent after crash"
else
    echo "FAIL: Filesystem inconsistent"
    exit 1
fi

# Check files are complete or absent (not corrupted)
for f in $MNT/file_*.txt; do
    if [ -f "$f" ]; then
        content=$(cat "$f" 2>/dev/null)
        if [[ ! "$content" =~ ^file\ [0-9]+\ content$ ]]; then
            echo "CORRUPTION: $f has invalid content: $content"
            exit 1
        fi
    fi
done

echo "PASS: All existing files have valid content"

fusermount -u $MNT

# Test 3: Crash during large file write
echo -e "\n[Test 3: Crash during large file write]"
$MKFS -s 100M -j 32M $DISK
$MYFS $DISK $MNT &
FSPID=$!
sleep 1

# Start large file write
dd if=/dev/urandom of=$MNT/largefile bs=1M count=50 2>/dev/null &
DDPID=$!

# Crash mid-write
sleep 2
kill -9 $FSPID
kill -9 $DDPID 2>/dev/null

# Remount
$MYFS $DISK $MNT 2>&1 | tee /tmp/recovery.log
sleep 1

# Verify
$FSCK $DISK
if [ $? -eq 0 ]; then
    echo "PASS: Filesystem consistent after large file crash"
else
    echo "FAIL: Filesystem inconsistent"
    exit 1
fi

fusermount -u $MNT

echo -e "\n=== All crash recovery tests passed ==="

Testing Strategy

Unit Tests

// test_journal.c

#include "journal.h"
#include <assert.h>

// Test journal append and read
void test_journal_io() {
    struct journal_state j;
    journal_init_test(&j, 1024);  // 1024 block journal

    struct journal_begin begin = {
        .header = {
            .je_type = J_BEGIN,
            .je_length = sizeof(begin),
            .je_sequence = 1,
        },
        .jb_num_blocks = 3,
    };

    int ret = journal_append(&j, &begin, sizeof(begin));
    assert(ret == 0);

    struct journal_begin read_begin;
    read_journal_at(&j, 0, &read_begin, sizeof(read_begin));

    assert(read_begin.header.je_type == J_BEGIN);
    assert(read_begin.header.je_sequence == 1);
    assert(read_begin.jb_num_blocks == 3);

    journal_close(&j);
    printf("test_journal_io: PASS\n");
}

// Test transaction lifecycle
void test_txn_commit() {
    struct myfs_state fs;
    init_test_fs(&fs);

    struct transaction *txn = txn_begin(&fs);
    assert(txn != NULL);
    assert(txn->state == TXN_ACTIVE);

    uint8_t data[BLOCK_SIZE];
    memset(data, 0xAB, sizeof(data));
    int ret = txn_log_block(txn, 100, data);
    assert(ret == 0);
    assert(txn->num_blocks == 1);

    ret = txn_commit(txn);
    assert(ret == 0);

    // Verify journal contains the transaction
    struct journal_state *j = fs.journal;
    assert(j->jsuper.j_tail > j->jsuper.j_head);

    cleanup_test_fs(&fs);
    printf("test_txn_commit: PASS\n");
}

// Test transaction abort
void test_txn_abort() {
    struct myfs_state fs;
    init_test_fs(&fs);

    uint32_t initial_tail = fs.journal->jsuper.j_tail;

    struct transaction *txn = txn_begin(&fs);
    txn_log_block(txn, 100, test_data);
    txn_abort(txn);

    // Journal should not have grown (aborted transaction)
    // Note: BEGIN was written, but that's harmless without COMMIT
    assert(fs.journal->jsuper.j_tail == initial_tail + sizeof(struct journal_begin));

    cleanup_test_fs(&fs);
    printf("test_txn_abort: PASS\n");
}

// Test recovery of complete transaction
void test_recovery_complete() {
    struct myfs_state fs;
    init_test_fs(&fs);

    // Create and commit a transaction
    struct transaction *txn = txn_begin(&fs);
    uint8_t data[BLOCK_SIZE];
    memset(data, 0x55, sizeof(data));
    txn_log_block(txn, 50, data);
    txn_commit(txn);

    // Simulate crash: don't apply to final location
    // Clear the target block
    uint8_t zeros[BLOCK_SIZE] = {0};
    write_block(&fs, 50, zeros);

    // Mark journal as needing recovery
    fs.journal->jsuper.j_flags &= ~JOURNAL_CLEAN;

    // Recover
    int ret = journal_recover(&fs);
    assert(ret == 0);

    // Verify block 50 has correct content
    uint8_t verify[BLOCK_SIZE];
    read_block(&fs, 50, verify);
    assert(memcmp(verify, data, BLOCK_SIZE) == 0);

    cleanup_test_fs(&fs);
    printf("test_recovery_complete: PASS\n");
}

// Test recovery of incomplete transaction (should be discarded)
void test_recovery_incomplete() {
    struct myfs_state fs;
    init_test_fs(&fs);

    // Write BEGIN and DATA, but no COMMIT
    struct journal_begin begin = {
        .header = { .je_type = J_BEGIN, .je_length = sizeof(begin), .je_sequence = 1 },
    };
    journal_append(fs.journal, &begin, sizeof(begin));

    struct journal_data jd = {
        .header = { .je_type = J_DATA, .je_length = sizeof(jd) + BLOCK_SIZE, .je_sequence = 1 },
        .jd_block = 60,
    };
    journal_append(fs.journal, &jd, sizeof(jd));
    uint8_t data[BLOCK_SIZE];
    memset(data, 0xAA, sizeof(data));
    journal_append(fs.journal, data, BLOCK_SIZE);

    // Mark as needing recovery
    fs.journal->jsuper.j_flags &= ~JOURNAL_CLEAN;

    // Clear target block
    uint8_t zeros[BLOCK_SIZE] = {0};
    write_block(&fs, 60, zeros);

    // Recover
    journal_recover(&fs);

    // Block 60 should NOT have the data (transaction was incomplete)
    uint8_t verify[BLOCK_SIZE];
    read_block(&fs, 60, verify);
    assert(memcmp(verify, zeros, BLOCK_SIZE) == 0);

    cleanup_test_fs(&fs);
    printf("test_recovery_incomplete: PASS\n");
}

// Test checksum verification
void test_checksum_validation() {
    struct myfs_state fs;
    init_test_fs(&fs);

    // Create transaction with known data
    struct transaction *txn = txn_begin(&fs);
    uint8_t data[BLOCK_SIZE];
    memset(data, 0x42, sizeof(data));
    txn_log_block(txn, 70, data);
    txn_commit(txn);

    // Corrupt the data in journal
    uint32_t data_pos = sizeof(struct journal_begin) + sizeof(struct journal_data);
    fs.journal->buffer[data_pos] ^= 0xFF;  // Flip bits

    // Mark as needing recovery
    fs.journal->jsuper.j_flags &= ~JOURNAL_CLEAN;

    // Recovery should detect corruption
    int ret = journal_recover(&fs);
    // Transaction should be skipped due to checksum mismatch

    cleanup_test_fs(&fs);
    printf("test_checksum_validation: PASS\n");
}

int main() {
    test_journal_io();
    test_txn_commit();
    test_txn_abort();
    test_recovery_complete();
    test_recovery_incomplete();
    test_checksum_validation();

    printf("\nAll journal tests passed!\n");
    return 0;
}

Crash Point Testing

// Systematic crash point testing

static int crash_point = -1;
static int write_count = 0;

int write_block_with_crash(struct myfs_state *fs, uint32_t blk, void *data) {
    write_count++;

    if (crash_point >= 0 && write_count >= crash_point) {
        printf("SIMULATED CRASH at write #%d to block %u\n", write_count, blk);
        longjmp(crash_jmp, 1);  // Simulate crash
    }

    return write_block_real(fs, blk, data);
}

void test_all_crash_points() {
    // For each operation, test crashes at every possible point
    for (int cp = 1; cp <= 20; cp++) {
        printf("\n=== Testing crash at point %d ===\n", cp);

        struct myfs_state fs;
        init_test_fs(&fs);

        crash_point = cp;
        write_count = 0;

        if (setjmp(crash_jmp) == 0) {
            // Normal path
            myfs_create_journaled(&fs, "/test.txt", 0644);
            printf("Operation completed without crash\n");
        } else {
            // Crash occurred
            printf("Crash occurred, attempting recovery\n");

            // Simulate remount
            close_test_fs(&fs);
            init_test_fs(&fs);  // Reload from disk

            // Recovery
            if (journal_needs_recovery(&fs)) {
                journal_recover(&fs);
            }

            // Verify consistency
            int consistent = fsck(&fs);
            assert(consistent == 0);

            // File either exists completely or not at all
            struct stat st;
            int exists = myfs_getattr(&fs, "/test.txt", &st, NULL) == 0;
            if (exists) {
                assert(S_ISREG(st.st_mode));
                printf("File exists and is valid\n");
            } else {
                printf("File does not exist (transaction was incomplete)\n");
            }
        }

        cleanup_test_fs(&fs);
    }

    crash_point = -1;  // Disable crash simulation
}

Common Pitfalls and Debugging

Pitfall 1: Forgetting to Sync Journal

Problem: Transaction appears committed but journal isn’t on disk.

// WRONG
journal_append(j, &commit, sizeof(commit));
return 0;  // Commit record still in buffer!

// CORRECT
journal_append(j, &commit, sizeof(commit));
int ret = fsync(j->fd);  // CRITICAL: Ensure commit record is on disk
if (ret < 0) {
    perror("journal fsync failed");
    return -EIO;
}
return 0;

Detection: Data loss after crashes despite “successful” operations.

Pitfall 2: Non-Idempotent Operations

Problem: Replaying transaction twice causes issues.

// WRONG: Increment is not idempotent
superblock.free_blocks--;  // First replay: 1000 -> 999
                           // Second replay: 999 -> 998  WRONG!

// CORRECT: Set to absolute value (idempotent)
superblock.free_blocks = 999;  // Always results in 999

Key principle: Log the AFTER state, not the delta.

Pitfall 3: Transaction Order in Recovery

Problem: Replaying transactions in wrong order causes inconsistency.

// WRONG: Process entries as encountered
while (pos < tail) {
    if (entry.type == J_DATA) {
        write_block(entry.block, entry.data);  // Wrong order!
    }
}

// CORRECT: Process by transaction ID order
qsort(data_entries, count, sizeof(entry), compare_by_txn_id);
for (int i = 0; i < count; i++) {
    if (is_complete(data_entries[i].txn_id)) {
        write_block(data_entries[i].block, data_entries[i].data);
    }
}

Pitfall 4: Escaping Magic Numbers

Problem: Block data containing journal magic number corrupts recovery.

// WRONG: Direct copy
journal_append(j, block_data, BLOCK_SIZE);

// CORRECT: Escape blocks containing magic
if (contains_journal_magic(block_data)) {
    jd.jd_flags |= JOURNAL_ESCAPE;
    escape_block(block_data);  // XOR or other escape
}
journal_append(j, &jd, sizeof(jd));
journal_append(j, block_data, BLOCK_SIZE);

Pitfall 5: Journal Wraparound

Problem: Circular buffer calculations overflow.

// WRONG: Simple subtraction
uint32_t available = j->j_size - j->j_tail + j->j_head;  // Wrong when wrapped

// CORRECT: Handle wrap
uint32_t used = (j->j_tail >= j->j_head) ?
                (j->j_tail - j->j_head) :
                (j->j_size - j->j_head + j->j_tail);
uint32_t available = j->j_size - used;

Debugging Techniques

1. Journal dump tool:

./journal-dump disk.img
Journal Superblock:
  Magic: 0x4A484653 (JHFS)
  Head: 0, Tail: 4096
  Sequence: 47

Entries:
  [0000] T45 BEGIN (3 blocks expected)
  [0064] T45 DATA: block 5 (inode bitmap)
  [4160] T45 DATA: block 105 (inode)
  [8256] T45 DATA: block 200 (directory)
  [12352] T45 COMMIT (checksum: 0x12345678)
  [12416] T46 BEGIN (2 blocks expected)
  [12480] T46 DATA: block 6 (inode bitmap)
  [16576] INCOMPLETE (no commit for T46)

2. Verbose logging:

#define JLOG(level, fmt, ...) do { \
    if (journal_log_level >= level) \
        fprintf(stderr, "[JOURNAL:%s] " fmt "\n", \
                level_names[level], ##__VA_ARGS__); \
} while(0)

JLOG(DEBUG, "txn %lu: logging block %u", txn->txn_id, block_num);
JLOG(INFO, "txn %lu: commit, %zu blocks", txn->txn_id, txn->num_blocks);
JLOG(WARN, "txn %lu: checksum mismatch", txn->txn_id);

3. Artificial crash injection:

// Environment variable controlled crashes
static void check_crash_injection(const char *location) {
    const char *crash_at = getenv("JOURNAL_CRASH_AT");
    if (crash_at && strcmp(crash_at, location) == 0) {
        fprintf(stderr, "INJECTED CRASH at %s\n", location);
        _exit(1);  // Immediate exit, no cleanup
    }
}

// Usage:
check_crash_injection("after_commit_write");

Extensions and Challenges

Extension 1: Ordered Mode

Write data blocks before journaling metadata:

int txn_commit_ordered(struct transaction *txn) {
    // First: write all data blocks to final locations
    for (struct data_write *dw = txn->data_writes; dw; dw = dw->next) {
        write_block(fs, dw->block, dw->data);
    }
    fsync(fs->disk_fd);  // Data is durable

    // Then: journal metadata only
    for (struct logged_block *lb = txn->metadata; lb; lb = lb->next) {
        journal_log_block(txn, lb);
    }
    journal_commit(txn);
}

Extension 2: Asynchronous Commit

Return immediately, commit in background:

int txn_commit_async(struct transaction *txn) {
    // Add to commit queue
    pthread_mutex_lock(&commit_queue_lock);
    txn->next = commit_queue;
    commit_queue = txn;
    pthread_cond_signal(&commit_queue_cond);
    pthread_mutex_unlock(&commit_queue_lock);

    return 0;  // Returns immediately
}

// Background commit thread
void* commit_thread(void *arg) {
    while (running) {
        pthread_mutex_lock(&commit_queue_lock);
        while (!commit_queue && running) {
            pthread_cond_wait(&commit_queue_cond, &commit_queue_lock);
        }

        struct transaction *txn = commit_queue;
        commit_queue = txn->next;
        pthread_mutex_unlock(&commit_queue_lock);

        if (txn) {
            do_sync_commit(txn);
        }
    }
    return NULL;
}

Extension 3: Group Commit

Batch multiple transactions into one disk write:

void commit_group() {
    pthread_mutex_lock(&group_lock);

    // Wait for batch to fill or timeout
    struct timespec deadline;
    clock_gettime(CLOCK_REALTIME, &deadline);
    deadline.tv_nsec += 5000000;  // 5ms timeout

    while (pending_count < MAX_BATCH) {
        int ret = pthread_cond_timedwait(&group_cond, &group_lock, &deadline);
        if (ret == ETIMEDOUT) break;
    }

    // Gather all pending transactions
    struct transaction *batch[MAX_BATCH];
    int count = collect_pending(batch, MAX_BATCH);

    pthread_mutex_unlock(&group_lock);

    // Write all to journal
    for (int i = 0; i < count; i++) {
        write_txn_to_journal(batch[i]);
    }

    // Single fsync for all
    fsync(journal_fd);

    // Signal all waiters
    for (int i = 0; i < count; i++) {
        signal_commit_complete(batch[i]);
    }
}

Extension 4: Intent Logging (REDO-only)

Instead of logging full blocks, log operations:

struct journal_intent {
    uint32_t ji_type;       // INTENT_SET_BIT, INTENT_CLEAR_BIT, etc.
    uint32_t ji_block;      // Target block
    uint32_t ji_offset;     // Offset within block
    uint32_t ji_old_value;  // For UNDO (optional)
    uint32_t ji_new_value;  // For REDO
};

// Much smaller journal entries, but recovery more complex

Extension 5: Compound Transactions

Support nested/compound operations:

txn_begin();          // Level 1
  create_file();
  txn_begin();        // Level 2 (nested)
    write_data();
  txn_commit();       // Level 2 commits to parent
txn_commit();         // Level 1 commits all

Real-World Connections

ext3/ext4 Journaling

Your implementation mirrors ext3/ext4 JBD2 (Journaling Block Device):

ext4 Journal Architecture:
┌─────────────────────────────────────────────────────────────┐
│                        ext4 Core                             │
│  (create, unlink, write, mkdir, rename, etc.)               │
└────────────────────────────┬────────────────────────────────┘
                             │
                             ▼
┌─────────────────────────────────────────────────────────────┐
│                     JBD2 Layer                               │
│  (journal_start, journal_get_write_access,                  │
│   journal_dirty_metadata, journal_stop)                      │
└────────────────────────────┬────────────────────────────────┘
                             │
                             ▼
┌─────────────────────────────────────────────────────────────┐
│                  Journal Device                              │
│  (typically at /dev/sda1 or internal to filesystem)         │
└─────────────────────────────────────────────────────────────┘

Key similarities:

  • Circular journal buffer
  • BEGIN/COMMIT transaction markers
  • Checksums for integrity
  • Checkpoint to reclaim space
  • Recovery on mount

View ext4 journal:

# See journal status
sudo dumpe2fs /dev/sda1 | grep -i journal

# Detailed journal info
sudo debugfs -R "stat <8>" /dev/sda1  # Inode 8 is journal

PostgreSQL WAL

PostgreSQL uses identical WAL principles:

PostgreSQL WAL:
┌─────────────────────────────────────────────────────────────┐
│                   Transaction                                │
│              INSERT INTO users ...                          │
└────────────────────────────┬────────────────────────────────┘
                             │
                             ▼
┌─────────────────────────────────────────────────────────────┐
│                    WAL Writer                                │
│  XLogInsert() → WAL Buffer → XLogFlush() → WAL Segment     │
└────────────────────────────┬────────────────────────────────┘
                             │
                             ▼
┌─────────────────────────────────────────────────────────────┐
│              WAL Segment Files (pg_wal/)                     │
│  000000010000000000000001, 000000010000000000000002, ...    │
└─────────────────────────────────────────────────────────────┘

View PostgreSQL WAL:

# List WAL files
ls -la /var/lib/postgresql/14/main/pg_wal/

# Decode WAL records
pg_waldump -p /var/lib/postgresql/14/main/pg_wal/ 000000010000000000000001

MySQL/InnoDB Redo Log

InnoDB uses similar double-write + redo log:

// InnoDB simplified flow
mtr_start(&mtr);              // Mini-transaction start
btr_insert(index, entry);     // B-tree modification
mtr_commit(&mtr);             // Write to redo log + flush

The pattern repeats: All serious storage systems use WAL for durability.


Interview Questions

Conceptual

  1. “What problem does journaling solve?”
    • Expected: Crash consistency. Guarantees filesystem is always consistent, even after crashes. Avoids lengthy fsck scans.
  2. “Explain write-ahead logging.”
    • Expected: Log changes before applying. On crash, replay committed entries. Key: commit record makes transaction durable.
  3. “What’s the difference between metadata and full data journaling?”
    • Expected: Metadata-only logs structure changes. Full logs file content too. Trade-off: speed vs data safety. Ordered mode is middle ground.
  4. “Why is journaling faster than fsck?”
    • Expected: Journal is small (MB) vs entire filesystem (TB). Recovery scans journal only. O(journal size) vs O(filesystem size).
  5. “How does this relate to database transactions?”
    • Expected: Same ACID principles. WAL is used by PostgreSQL, MySQL. Filesystem transaction = database transaction conceptually.

Implementation

  1. “What makes an operation idempotent?”
    • Expected: Same result when applied multiple times. Log absolute state, not deltas. Important because recovery might replay.
  2. “How do you know a transaction is complete?”
    • Expected: COMMIT record present with valid checksum. No COMMIT = discard transaction.
  3. “What is checkpointing?”
    • Expected: Writing journal contents to final locations to reclaim journal space. Periodic, or when journal fills.
  4. “What happens if the commit record itself is corrupted?”
    • Expected: Transaction treated as incomplete (no valid commit). This is safe - we err on the side of discarding.
  5. “How do you handle journal wraparound?”
    • Expected: Circular buffer with head/tail pointers. Checkpoint before overwriting. Handle wrap in arithmetic.

Design

  1. “How would you implement group commit?”
    • Expected: Batch multiple transactions, single fsync. Latency tradeoff for throughput.
  2. “What are the tradeoffs of different journaling modes?”
    • Expected: Full = safest, slowest (2x write amplification). Ordered = good balance. Writeback = fastest, risk of stale data.

Resources

Books

  • “Operating Systems: Three Easy Pieces” Ch. 42 - Crash Consistency and Journaling (primary)
  • “Database Internals” by Alex Petrov - Ch. 3-5 - Transaction Processing (excellent for ARIES)
  • “Designing Data-Intensive Applications” by Kleppmann - Ch. 7 - Transactions
  • “Understanding the Linux Kernel” Ch. 18 - ext3/ext4 journaling implementation

Papers

  • “Journaling the Linux ext2fs Filesystem” - Stephen Tweedie (1998) - Original ext3 design
  • “Analysis and Evolution of Journaling File Systems” - USENIX ATC 2005
  • “ARIES: A Transaction Recovery Method” - Mohan et al. (1992) - Database recovery gold standard

Online

Videos

  • “Journaling in Filesystems” - MIT 6.828 Operating Systems Engineering
  • “Database Crash Recovery” - CMU 15-445 Database Systems

Self-Assessment Checklist

Core Functionality

  • Journal is created during mkfs with configurable size
  • Journal superblock is correctly formatted and readable
  • Transactions can be started (BEGIN record written)
  • Blocks can be logged within a transaction
  • Transactions can be committed (COMMIT record with checksum)
  • Journal sync (fsync) occurs before commit returns

Recovery

  • Unclean shutdown is detected on mount
  • Complete transactions are identified (have COMMIT)
  • Incomplete transactions are identified (no COMMIT)
  • Complete transactions are replayed correctly
  • Incomplete transactions are discarded
  • Checksums are verified during recovery
  • Journal is cleared after successful recovery

Integration

  • All metadata operations use transactions
  • File create wraps inode+bitmap+directory updates
  • File delete wraps inode+bitmap+directory updates
  • Directory operations are transactional
  • Nested/compound operations work correctly

Robustness

  • Filesystem is consistent after simulated crashes at any point
  • Checkpointing reclaims journal space
  • Journal-full condition is handled gracefully
  • Performance overhead is reasonable (< 20%)

Submission/Completion Criteria

To consider this project complete, you must demonstrate:

1. Working Implementation

# Create filesystem with journal
./mkfs.myfs -s 100M -j 32M disk.img

# Mount and use normally
./myfs disk.img /mnt/myfs
echo "test" > /mnt/myfs/file.txt
mkdir /mnt/myfs/subdir
cp /etc/passwd /mnt/myfs/subdir/

# Clean unmount
fusermount -u /mnt/myfs

# Remount - files persist
./myfs disk.img /mnt/myfs
cat /mnt/myfs/file.txt  # Shows "test"

2. Crash Recovery Demonstration

# Start writes, then crash
./myfs disk.img /mnt/myfs &
PID=$!
dd if=/dev/urandom of=/mnt/myfs/bigfile bs=1M count=10 &
sleep 0.5
kill -9 $PID

# Remount - recovery happens
./myfs disk.img /mnt/myfs
# Output should show:
# "Journal recovery: replaying N transactions..."
# "Recovery complete. Filesystem consistent."

3. fsck Passes

./fsck.myfs disk.img
# Output: "Filesystem consistent. 0 errors."

4. Test Suite Passes

make test
# All journal unit tests pass
# All crash recovery tests pass

5. Documentation

  • README explaining journal design choices
  • Comments in code explaining recovery algorithm
  • Performance measurements (overhead of journaling)

Bonus Achievements

  • Implement ordered mode (data before metadata)
  • Implement async commit with background thread
  • Implement group commit for better throughput
  • Add journal dump tool for debugging
  • Performance comparison: journaled vs non-journaled