Project 4: Journaling Layer

Project Overview

Attribute	Details
Difficulty	Master
Time Estimate	2-3 weeks
Language	C
Knowledge Area	Crash Consistency / Databases
Main Book	“Database Internals” by Alex Petrov
Prerequisites	Project 3 completed

What You’ll Build

Add write-ahead logging (journaling) to your block-based filesystem. Before any metadata change, log the operation so crashes can be recovered.

Real World Output:

$ ./myfs disk.img /mnt/myfs
$ dd if=/dev/urandom of=/mnt/myfs/bigfile bs=1M count=50 &
# Kill power mid-write (or kill -9 the process)
$ ./myfs disk.img /mnt/myfs
Journal recovery: replaying 3 transactions...
Transaction 1: inode 15 allocation - COMMITTED, replaying
Transaction 2: block 1024-1087 allocation - COMMITTED, replaying
Transaction 3: inode 15 size update - INCOMPLETE, discarding
Recovery complete. Filesystem consistent.
$ ls /mnt/myfs  # No corruption!

Learning Objectives

By completing this project, you will:

Understand the crash consistency problem and why filesystems need journaling
Implement write-ahead logging (WAL) — the same technique used by databases
Design transaction semantics for atomic filesystem operations
Build a journal recovery system that replays committed transactions
Handle checkpointing to reclaim journal space
Balance performance and durability tradeoffs

The Core Question You’re Answering

“How do you guarantee filesystem consistency when power can fail at any millisecond during a write operation?”

Before journaling, crashes during writes left filesystems corrupted. fsck would scan the entire disk to find and fix inconsistencies—taking hours on large drives.

In 1999, ext3 added one feature to ext2: journaling. That single addition made ext3 mount instantly after crashes. You’re about to implement that same insight—write-ahead logging. It’s the same technique PostgreSQL, MySQL, and every serious database uses.

Deep Theoretical Foundation

1. The Crash Consistency Problem

Consider creating a file /newfile.txt:

Operations required:
Allocate inode (update inode bitmap)
Initialize inode (write inode structure)
Allocate data block (update block bitmap)
Write file data (write data block)
Update directory (add entry to parent)
Update superblock (decrement free counts)

What if power fails between operations?

Crash After	Result
Step 1	Inode allocated but uninitialized → garbage metadata
Step 2	Inode exists but no directory entry → orphaned inode
Step 3	Block allocated but unused → space leak
Step 4	Data written but inode not updated → data lost
Step 5	Directory points to partial inode → potential corruption

The fundamental issue: Disk writes are not atomic across multiple blocks. You cannot guarantee all updates complete before a crash.

2. Solution Approaches

Approach 1: Write Order + fsck (ext2)

Carefully order writes to limit damage:

Write data blocks first
Then write metadata
On crash, run fsck to scan and fix

Problems:

fsck scans entire disk (slow for large filesystems)
Some inconsistencies can’t be fixed
User must wait for fsck to complete

Approach 2: Journaling (ext3/ext4)

Write operations twice:

First, log the changes to a journal
Then, apply changes to their final locations

On crash:

Scan small journal (fast)
Replay committed transactions
Discard incomplete transactions

Key insight: The journal is small, so recovery is fast. Only committed transactions are replayed, so consistency is guaranteed.

Approach 3: Copy-on-Write (ZFS/Btrfs)

Never overwrite data. Always write to new locations, then update pointer atomically. (This is Project 6.)

3. Write-Ahead Logging (WAL) Principles

The fundamental rules of WAL:

Log before apply: Write the log entry before making the actual change
Commit before success: Don’t report success until the commit record is durable
Idempotent operations: Replaying a transaction multiple times = same result
Recovery replays committed: On mount, replay all committed transactions

Normal write flow:
┌─────────────────────────────────────────────────────────────────┐
│ 1. Application calls write()                                     │
│ 2. Filesystem prepares changes                                   │
│ 3. BEGIN transaction written to journal                          │
│ 4. All changes logged to journal                                 │
│ 5. COMMIT written to journal                                     │
│ 6. fsync() journal — changes are now durable                     │
│ 7. Apply changes to actual locations (checkpoint)                │
│ 8. Return success to application                                 │
└─────────────────────────────────────────────────────────────────┘

Recovery flow:
┌─────────────────────────────────────────────────────────────────┐
│ 1. Mount filesystem                                              │
│ 2. Scan journal from last checkpoint                             │
│ 3. For each transaction:                                         │
│    - If COMMIT found: replay all operations                      │
│    - If no COMMIT: discard (incomplete transaction)              │
│ 4. Mark journal as empty                                         │
│ 5. Continue normal operation                                     │
└─────────────────────────────────────────────────────────────────┘

4. Journal Structure

A journal is typically a fixed-size circular buffer on disk:

┌──────────────────────────────────────────────────────────────────┐
│                         JOURNAL                                   │
├─────────────┬─────────────────────────────────────┬──────────────┤
│ Journal     │            Journal Entries          │   Free       │
│ Superblock  │ [T1-Begin][Data][Commit][T2-Begin]..│   Space      │
│             │     ↑                         ↑     │              │
│             │   head                       tail   │              │
└─────────────┴─────────────────────────────────────┴──────────────┘

Journal wraps around when it reaches the end:
┌─────────────────────────────────────────────────────────────────┐
│ ...old data...][T5-data][T5-COMMIT][T6-BEGIN...   |  ...old...  │
│                                               ↑   ↑             │
│                                             tail head           │
└─────────────────────────────────────────────────────────────────┘

Journal superblock:

struct journal_superblock {
    uint32_t j_magic;           // Magic number
    uint32_t j_version;         // Journal format version
    uint32_t j_start;           // First block of journal
    uint32_t j_size;            // Size in blocks
    uint32_t j_head;            // Oldest valid transaction
    uint32_t j_tail;            // Next write position
    uint64_t j_sequence;        // Next transaction ID
    uint32_t j_checkpoint;      // Last checkpointed transaction
};

Journal entries:

#define J_BEGIN     1
#define J_COMMIT    2
#define J_DATA      3
#define J_REVOKE    4  // For recovery optimization

struct journal_entry_header {
    uint32_t je_type;           // Entry type
    uint32_t je_length;         // Entry length
    uint64_t je_sequence;       // Transaction ID
    uint32_t je_checksum;       // CRC32 of entry
};

struct journal_begin {
    struct journal_entry_header header;
    uint32_t jb_num_blocks;     // Expected blocks in transaction
};

struct journal_data {
    struct journal_entry_header header;
    uint32_t jd_block;          // Target block number
    // Followed by block data
};

struct journal_commit {
    struct journal_entry_header header;
    uint64_t jc_sequence;       // Transaction ID (redundant for verification)
    uint32_t jc_checksum;       // Checksum of entire transaction
};

5. Transaction Semantics

A transaction groups related filesystem operations:

// Creating a file is one transaction:
transaction_begin();
{
    alloc_inode();           // Logged
    init_inode();            // Logged
    alloc_blocks();          // Logged
    write_data();            // Maybe logged (depends on mode)
    add_dir_entry();         // Logged
    update_superblock();     // Logged
}
transaction_commit();

Atomicity guarantee: Either ALL changes commit, or NONE do.

6. Journaling Modes

Mode	What’s Journaled	Speed	Safety
Journal	Metadata + Data	Slow	Highest
Ordered	Metadata only, data written first	Medium	Good
Writeback	Metadata only	Fast	Lower

Journal mode: Everything is logged. File content AND metadata. Slowest but safest.

Ordered mode (ext4 default): Only metadata is journaled. But data blocks are written before the metadata journal entry commits. This ensures you never see stale data.

Writeback mode: Only metadata journaled. Data might not be flushed before commit. Fastest, but stale data possible after crash.

7. Checkpointing

The journal has limited size. To reclaim space, checkpoint transactions:

Flush all logged changes to their final disk locations
Verify writes completed (fsync)
Advance journal head pointer
Space between old head and new head is now reusable

Before checkpoint:
head                                                    tail
 ↓                                                       ↓
[T1-complete][T2-complete][T3-complete][T4-in-progress][...]
 ↑_________________________________________↑
          Can be checkpointed

After checkpoint:
                          new head                      tail
                              ↓                          ↓
[recyclable space          ][T4-in-progress][T5-new][...]

8. Recovery Algorithm

void journal_recover(struct journal_state *j) {
    uint64_t pos = j->superblock.j_head;
    uint64_t current_txn = 0;
    bool has_commit = false;

    // First pass: identify complete transactions
    while (pos < j->superblock.j_tail) {
        struct journal_entry_header hdr;
        read_journal(j, pos, &hdr, sizeof(hdr));

        switch (hdr.je_type) {
        case J_BEGIN:
            current_txn = hdr.je_sequence;
            has_commit = false;
            break;

        case J_COMMIT:
            if (hdr.je_sequence == current_txn) {
                has_commit = true;
                // Mark transaction as complete
            }
            break;
        }

        pos += entry_size(&hdr);
    }

    // Second pass: replay complete transactions
    pos = j->superblock.j_head;
    while (pos < j->superblock.j_tail) {
        struct journal_entry_header hdr;
        read_journal(j, pos, &hdr, sizeof(hdr));

        if (hdr.je_type == J_DATA && transaction_complete(hdr.je_sequence)) {
            struct journal_data *jd = read_data_entry(j, pos);
            // Copy block data from journal to final location
            write_block(fs, jd->jd_block, jd->data);
        }

        pos += entry_size(&hdr);
    }

    // Clear journal
    j->superblock.j_head = j->superblock.j_tail;
    write_journal_superblock(j);
}

Project Specification

Functional Requirements

Journal area on disk
- Reserve blocks for journal
- Configurable size (32MB default)
- Circular buffer implementation
Transaction support
- Begin/commit transaction API
- Log metadata changes
- Optional data journaling mode
Recovery on mount
- Detect unclean shutdown
- Scan and replay committed transactions
- Discard incomplete transactions
Checkpointing
- Periodic flush of journal to final locations
- Reclaim journal space
- Handle journal full condition
Integration with filesystem
- Wrap all metadata operations in transactions
- Handle nested operations correctly

Non-Functional Requirements

Recovery completes in < 5 seconds
Journal overhead < 20% on normal operations
Support concurrent transactions
Graceful degradation when journal is full

Solution Architecture

Component Design

┌─────────────────────────────────────────────────────────────────┐
│                    Filesystem Operations                         │
│            (create, unlink, write, mkdir, etc.)                  │
└────────────────────────────┬────────────────────────────────────┘
                             │
                             ▼
┌─────────────────────────────────────────────────────────────────┐
│                    Transaction Manager                           │
│              (begin, log, commit, abort)                         │
└────────────────────────────┬────────────────────────────────────┘
                             │
                             ▼
┌─────────────────────────────────────────────────────────────────┐
│                    Journal Writer                                │
│           (append entries, manage space)                         │
└────────────────────────────┬────────────────────────────────────┘
                             │
          ┌──────────────────┼──────────────────┐
          ▼                  ▼                  ▼
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Journal Buffer  │ │  Checkpointer   │ │ Recovery Engine │
│  (in-memory)    │ │ (background)    │ │ (on mount)      │
└────────┬────────┘ └────────┬────────┘ └────────┬────────┘
         │                   │                   │
         └───────────────────┼───────────────────┘
                             ▼
┌─────────────────────────────────────────────────────────────────┐
│                    Journal Blocks (on disk)                      │
│   [JSuper][Entry][Entry][Entry]...[unused]...[oldest entries]   │
└─────────────────────────────────────────────────────────────────┘

Data Structures

// Journal state
struct journal_state {
    struct journal_superblock jsuper;
    uint8_t *buffer;              // Write buffer
    size_t buffer_pos;            // Current position in buffer
    uint64_t current_txn;         // Current transaction ID
    int txn_depth;                // Nesting level
    pthread_mutex_t lock;
    int dirty;
};

// Transaction handle
struct transaction {
    uint64_t txn_id;
    struct journal_state *journal;
    struct logged_block *blocks;  // List of blocks to log
    size_t num_blocks;
};

// Logged block info
struct logged_block {
    uint32_t block_num;           // Target block number
    uint8_t data[BLOCK_SIZE];     // Block content
    struct logged_block *next;
};

// Recovery state
struct recovery_state {
    uint64_t *complete_txns;      // Array of complete transaction IDs
    size_t num_complete;
};

Key Functions

// Journal initialization
int journal_init(struct myfs_state *fs, uint32_t start_block, uint32_t size);
int journal_load(struct myfs_state *fs);
int journal_close(struct myfs_state *fs);

// Transaction API
struct transaction* txn_begin(struct myfs_state *fs);
int txn_log_block(struct transaction *txn, uint32_t block_num, void *data);
int txn_commit(struct transaction *txn);
void txn_abort(struct transaction *txn);

// Journal operations
int journal_append(struct journal_state *j, void *entry, size_t len);
int journal_sync(struct journal_state *j);
int journal_checkpoint(struct journal_state *j);

// Recovery
int journal_recover(struct myfs_state *fs);
int journal_needs_recovery(struct myfs_state *fs);

Implementation Guide

Phase 1: Journal Layout (Days 1-3)

Goals:

Reserve journal blocks in mkfs
Define journal superblock format
Initialize journal on format

Steps:

Modify mkfs to allocate journal blocks
Define journal superblock structure
Write initial journal superblock
Update filesystem superblock to point to journal

// In mkfs: allocate journal
uint32_t journal_blocks = 8192;  // 32MB with 4KB blocks
uint32_t journal_start = sb.s_first_data;
sb.s_journal_start = journal_start;
sb.s_journal_size = journal_blocks;
sb.s_first_data += journal_blocks;

// Initialize journal superblock
struct journal_superblock jsb = {
    .j_magic = JOURNAL_MAGIC,
    .j_version = 1,
    .j_start = journal_start + 1,  // After journal superblock
    .j_size = journal_blocks - 1,
    .j_head = 0,
    .j_tail = 0,
    .j_sequence = 1,
    .j_checkpoint = 0,
};

write_block(fd, journal_start, &jsb);

Phase 2: Transaction Begin/Commit (Days 3-6)

Goals:

Implement transaction lifecycle
Write BEGIN and COMMIT records
Ensure durability with fsync

struct transaction* txn_begin(struct myfs_state *fs) {
    struct journal_state *j = fs->journal;

    pthread_mutex_lock(&j->lock);

    // Start new transaction
    struct transaction *txn = calloc(1, sizeof(*txn));
    txn->txn_id = j->jsuper.j_sequence++;
    txn->journal = j;

    // Write BEGIN record
    struct journal_begin begin = {
        .header = {
            .je_type = J_BEGIN,
            .je_length = sizeof(begin),
            .je_sequence = txn->txn_id,
        },
    };
    journal_append(j, &begin, sizeof(begin));

    return txn;
}

int txn_commit(struct transaction *txn) {
    struct journal_state *j = txn->journal;

    // Write all logged blocks
    struct logged_block *blk = txn->blocks;
    while (blk) {
        struct journal_data jd = {
            .header = {
                .je_type = J_DATA,
                .je_length = sizeof(jd) + BLOCK_SIZE,
                .je_sequence = txn->txn_id,
            },
            .jd_block = blk->block_num,
        };
        journal_append(j, &jd, sizeof(jd));
        journal_append(j, blk->data, BLOCK_SIZE);
        blk = blk->next;
    }

    // Write COMMIT record
    struct journal_commit commit = {
        .header = {
            .je_type = J_COMMIT,
            .je_length = sizeof(commit),
            .je_sequence = txn->txn_id,
        },
        .jc_sequence = txn->txn_id,
        .jc_checksum = calculate_txn_checksum(txn),
    };
    journal_append(j, &commit, sizeof(commit));

    // CRITICAL: Sync journal to disk
    journal_sync(j);

    pthread_mutex_unlock(&j->lock);

    // Free transaction resources
    free_logged_blocks(txn->blocks);
    free(txn);

    return 0;
}

Phase 3: Logging Filesystem Operations (Days 6-10)

Goals:

Wrap metadata updates in transactions
Log block changes before applying
Handle inode, bitmap, and directory updates

int myfs_create_journaled(struct myfs_state *fs, const char *path, mode_t mode) {
    struct transaction *txn = txn_begin(fs);

    // Allocate and log inode bitmap change
    int ino = alloc_inode(fs);
    uint8_t bitmap_block[BLOCK_SIZE];
    read_block(fs, fs->sb.s_inode_bitmap + ino / (BLOCK_SIZE * 8), bitmap_block);
    BITMAP_SET(bitmap_block, ino % (BLOCK_SIZE * 8));
    txn_log_block(txn, fs->sb.s_inode_bitmap + ino / (BLOCK_SIZE * 8), bitmap_block);

    // Initialize and log inode
    struct myfs_inode inode = {
        .i_mode = S_IFREG | mode,
        .i_uid = getuid(),
        .i_gid = getgid(),
        .i_links_count = 1,
        .i_ctime = time(NULL),
        .i_mtime = time(NULL),
    };
    uint8_t inode_block[BLOCK_SIZE];
    // ... read inode table block, update, log

    // Add directory entry and log
    // ... update parent directory block, log

    // Commit transaction
    int ret = txn_commit(txn);
    if (ret < 0) {
        txn_abort(txn);
        return ret;
    }

    // Now apply changes to actual locations
    // (This can be deferred to checkpoint)
    apply_logged_changes(txn);

    return 0;
}

Phase 4: Recovery Implementation (Days 10-14)

Goals:

Detect unclean shutdown
Scan journal for complete transactions
Replay changes to filesystem

int journal_recover(struct myfs_state *fs) {
    struct journal_state *j = fs->journal;

    printf("Journal recovery starting...\n");

    // Phase 1: Scan for complete transactions
    struct recovery_state rs = {0};
    uint32_t pos = j->jsuper.j_head;
    uint64_t current_txn = 0;

    while (pos < j->jsuper.j_tail) {
        struct journal_entry_header hdr;
        read_journal_at(j, pos, &hdr, sizeof(hdr));

        if (hdr.je_type == J_BEGIN) {
            current_txn = hdr.je_sequence;
        } else if (hdr.je_type == J_COMMIT && hdr.je_sequence == current_txn) {
            // Verify checksum
            struct journal_commit *commit = read_commit(j, pos);
            if (verify_txn_checksum(j, current_txn, commit->jc_checksum)) {
                add_complete_txn(&rs, current_txn);
                printf("Transaction %lu: COMPLETE\n", current_txn);
            } else {
                printf("Transaction %lu: CHECKSUM FAILED\n", current_txn);
            }
        }

        pos += hdr.je_length;
    }

    // Phase 2: Replay complete transactions
    pos = j->jsuper.j_head;
    while (pos < j->jsuper.j_tail) {
        struct journal_entry_header hdr;
        read_journal_at(j, pos, &hdr, sizeof(hdr));

        if (hdr.je_type == J_DATA && is_complete(&rs, hdr.je_sequence)) {
            struct journal_data *jd = read_data_entry(j, pos);
            printf("Replaying block %u from transaction %lu\n",
                   jd->jd_block, hdr.je_sequence);
            write_block(fs, jd->jd_block, jd->data);
        }

        pos += hdr.je_length;
    }

    // Clear journal
    j->jsuper.j_head = j->jsuper.j_tail;
    j->jsuper.j_sequence++;
    write_journal_superblock(j);

    printf("Recovery complete: %zu transactions replayed\n", rs.num_complete);
    return 0;
}

Phase 5: Checkpointing (Days 14-17)

Goals:

Apply logged changes to final locations
Reclaim journal space
Handle journal-full condition

int journal_checkpoint(struct journal_state *j) {
    pthread_mutex_lock(&j->lock);

    uint32_t pos = j->jsuper.j_head;
    uint32_t checkpoint_pos = pos;

    // Apply all complete transactions
    while (pos < j->jsuper.j_tail) {
        struct journal_entry_header hdr;
        read_journal_at(j, pos, &hdr, sizeof(hdr));

        if (hdr.je_type == J_DATA) {
            struct journal_data *jd = read_data_entry(j, pos);
            write_block(fs, jd->jd_block, jd->data);
        } else if (hdr.je_type == J_COMMIT) {
            checkpoint_pos = pos + hdr.je_length;
        }

        pos += hdr.je_length;
    }

    // Sync filesystem blocks
    fsync(fs->disk_fd);

    // Advance head
    j->jsuper.j_head = checkpoint_pos;
    j->jsuper.j_checkpoint = j->jsuper.j_sequence;
    write_journal_superblock(j);

    pthread_mutex_unlock(&j->lock);
    return 0;
}

// Check if journal needs checkpoint
bool journal_needs_checkpoint(struct journal_state *j) {
    uint32_t used = j->jsuper.j_tail - j->jsuper.j_head;
    uint32_t total = j->jsuper.j_size;
    return used > (total * 3 / 4);  // 75% full
}

Phase 6: Integration and Testing (Days 17-21)

Goals:

Connect journaling to all filesystem operations
Test crash scenarios
Verify consistency after recovery

Crash simulation script:

#!/bin/bash
# test_crash_recovery.sh

DISK=/tmp/crash_test.img
MNT=/tmp/crash_mnt

# Create and mount
./mkfs.myfs -s 100M -j 32M $DISK
mkdir -p $MNT
./myfs $DISK $MNT &
FSPID=$!
sleep 1

# Start background writes
for i in $(seq 1 100); do
    echo "file $i content" > $MNT/file_$i.txt
done &
WRITEPID=$!

# Kill filesystem mid-write
sleep 0.5
kill -9 $FSPID
kill -9 $WRITEPID 2>/dev/null

# Remount (triggers recovery)
./myfs $DISK $MNT
sleep 1

# Verify consistency
./fsck.myfs $DISK

# Check files are complete or absent (not corrupted)
for f in $MNT/file_*.txt; do
    content=$(cat "$f" 2>/dev/null)
    if [ -n "$content" ]; then
        if [[ ! "$content" =~ ^file\ [0-9]+\ content$ ]]; then
            echo "CORRUPTION: $f has invalid content"
            exit 1
        fi
    fi
done

echo "PASS: All files are consistent"
fusermount -u $MNT

Testing Strategy

Unit Tests

// Test journal append and read
void test_journal_io() {
    struct journal_state j;
    journal_init_test(&j);

    struct journal_begin begin = {
        .header = { .je_type = J_BEGIN, .je_length = sizeof(begin) },
    };
    journal_append(&j, &begin, sizeof(begin));

    struct journal_begin read_begin;
    read_journal_at(&j, 0, &read_begin, sizeof(read_begin));

    assert(read_begin.header.je_type == J_BEGIN);
}

// Test transaction commit
void test_txn_commit() {
    struct myfs_state fs;
    init_test_fs(&fs);

    struct transaction *txn = txn_begin(&fs);
    assert(txn != NULL);

    uint8_t data[BLOCK_SIZE];
    memset(data, 0xAB, sizeof(data));
    txn_log_block(txn, 100, data);

    int ret = txn_commit(txn);
    assert(ret == 0);

    // Verify journal contains the block
    // ...
}

// Test recovery
void test_recovery() {
    struct myfs_state fs;
    init_test_fs(&fs);

    // Create committed transaction
    struct transaction *txn = txn_begin(&fs);
    uint8_t data[BLOCK_SIZE];
    memset(data, 0x55, sizeof(data));
    txn_log_block(txn, 50, data);
    txn_commit(txn);

    // Simulate crash (don't checkpoint)

    // Recover
    journal_recover(&fs);

    // Verify block 50 has correct content
    uint8_t verify[BLOCK_SIZE];
    read_block(&fs, 50, verify);
    assert(memcmp(verify, data, BLOCK_SIZE) == 0);
}

Crash Simulation Tests

// Test all crash points in file creation
void test_crash_during_create() {
    for (int crash_point = 0; crash_point < 10; crash_point++) {
        // Set crash point
        set_crash_after_n_writes(crash_point);

        // Attempt operation
        signal_handler_installed = true;
        if (setjmp(crash_jmp) == 0) {
            myfs_create(&fs, "/test.txt", 0644);
        }

        // Simulate remount
        journal_recover(&fs);

        // Verify consistency
        assert(fsck(&fs) == 0);

        // Verify file either exists completely or not at all
        struct stat st;
        int exists = myfs_getattr(&fs, "/test.txt", &st, NULL) == 0;
        if (exists) {
            assert(st.st_size >= 0);
            assert(S_ISREG(st.st_mode));
        }
    }
}

Common Pitfalls and Debugging

Pitfall 1: Forgetting to Sync Journal

Problem: Transaction appears committed but journal isn’t on disk.

// WRONG
journal_append(j, &commit, sizeof(commit));
return 0;  // Commit record still in buffer!

// CORRECT
journal_append(j, &commit, sizeof(commit));
fsync(j->fd);  // CRITICAL: Ensure commit record is on disk
return 0;

Pitfall 2: Non-Idempotent Operations

Problem: Replaying transaction twice causes issues.

// WRONG: Increment is not idempotent
superblock.free_blocks--;

// CORRECT: Set to known value (idempotent)
superblock.free_blocks = calculate_free_blocks();

Pitfall 3: Transaction Order in Recovery

Problem: Replaying in wrong order causes inconsistency.

Solution: Use sequence numbers and process in order.

Debugging Techniques

1. Journal dump tool:

./journal-dump disk.img
Transaction 1: BEGIN
  Block 5: inode bitmap
  Block 105: inode table
  Block 200: directory
Transaction 1: COMMIT (checksum: 0x12345678)
Transaction 2: BEGIN
  Block 6: inode bitmap
Transaction 2: INCOMPLETE (no commit)

2. Verbose logging:

#define JLOG(fmt, ...) \
    fprintf(stderr, "[JOURNAL] " fmt "\n", ##__VA_ARGS__)

JLOG("txn %lu: logging block %u", txn->txn_id, block_num);
JLOG("txn %lu: commit, %zu blocks", txn->txn_id, txn->num_blocks);

3. Artificial crashes:

static int writes_until_crash = -1;

int write_block(struct myfs_state *fs, uint32_t blk, void *data) {
    if (writes_until_crash == 0) {
        exit(1);  // Simulate crash
    }
    if (writes_until_crash > 0) {
        writes_until_crash--;
    }
    // ... actual write
}

Extensions and Challenges

Extension 1: Ordered Mode

Write data blocks before journaling metadata:

int txn_commit_ordered(struct transaction *txn) {
    // First: write all data blocks to final locations
    for (struct data_write *dw = txn->data_writes; dw; dw = dw->next) {
        write_block(fs, dw->block, dw->data);
    }
    fsync(fs->disk_fd);  // Data is durable

    // Then: journal metadata only
    for (struct logged_block *lb = txn->metadata; lb; lb = lb->next) {
        journal_log_block(txn, lb);
    }
    journal_commit(txn);
}

Extension 2: Asynchronous Commit

Return immediately, commit in background:

int txn_commit_async(struct transaction *txn) {
    queue_for_commit(txn);
    return 0;  // Returns immediately
}

// Background thread
void* commit_thread(void *arg) {
    while (running) {
        struct transaction *txn = dequeue_commit();
        if (txn) {
            do_sync_commit(txn);
        }
    }
}

Extension 3: Group Commit

Batch multiple transactions into one disk write:

void commit_group() {
    // Gather all pending transactions
    struct transaction *batch[MAX_BATCH];
    int count = collect_pending(batch, MAX_BATCH);

    // Write all to journal
    for (int i = 0; i < count; i++) {
        write_txn_to_journal(batch[i]);
    }

    // Single fsync for all
    fsync(journal_fd);

    // Signal all waiters
    for (int i = 0; i < count; i++) {
        signal_commit_complete(batch[i]);
    }
}

Interview Questions

Conceptual

“What problem does journaling solve?”
- Expected: Crash consistency. Guarantees filesystem is always consistent, even after crashes.
“Explain write-ahead logging.”
- Expected: Log changes before applying. On crash, replay committed entries.
“What’s the difference between metadata and full data journaling?”
- Expected: Metadata-only logs structure changes. Full logs file content too. Trade-off: speed vs data safety.
“Why is journaling faster than fsck?”
- Expected: Journal is small (MB) vs entire filesystem (TB). Recovery scans journal only.

Implementation

“What makes an operation idempotent?”
- Expected: Same result when applied multiple times. Important because recovery might replay.
“How do you know a transaction is complete?”
- Expected: COMMIT record present with valid checksum.
“What is checkpointing?”
- Expected: Writing journal contents to final locations to reclaim journal space.
“How is journaling related to database WAL?”
- Expected: Same principle. Both ensure atomicity and durability through logging.

Self-Assessment Checklist

Journal is created during mkfs
Transactions can be started and committed
Metadata changes are logged before applying
Recovery replays committed transactions correctly
Incomplete transactions are discarded
Filesystem is consistent after simulated crashes
Checkpointing reclaims journal space
Journal-full condition is handled gracefully

Resources

Books

“Operating Systems: Three Easy Pieces” Ch. 42 - Crash Consistency and Journaling
“Database Internals” by Alex Petrov - Ch. 3-5 - Transaction Processing
“Designing Data-Intensive Applications” by Kleppmann - Ch. 7 - Transactions

Papers

“Journaling the Linux ext2fs Filesystem” - Stephen Tweedie
“Analysis and Evolution of Journaling File Systems” - USENIX ATC 2005

Project 4: Journaling Layer

Project Overview

What You’ll Build

Learning Objectives

The Core Question You’re Answering

Deep Theoretical Foundation

1. The Crash Consistency Problem

2. Solution Approaches

Approach 1: Write Order + fsck (ext2)

Approach 2: Journaling (ext3/ext4)

Approach 3: Copy-on-Write (ZFS/Btrfs)

3. Write-Ahead Logging (WAL) Principles

4. Journal Structure

5. Transaction Semantics

6. Journaling Modes

7. Checkpointing

8. Recovery Algorithm

Project Specification

Functional Requirements

Non-Functional Requirements

Solution Architecture

Component Design

Data Structures

Key Functions

Implementation Guide

Phase 1: Journal Layout (Days 1-3)

Phase 2: Transaction Begin/Commit (Days 3-6)

Phase 3: Logging Filesystem Operations (Days 6-10)

Phase 4: Recovery Implementation (Days 10-14)

Phase 5: Checkpointing (Days 14-17)

Phase 6: Integration and Testing (Days 17-21)

Testing Strategy

Unit Tests

Crash Simulation Tests

Common Pitfalls and Debugging

Pitfall 1: Forgetting to Sync Journal

Pitfall 2: Non-Idempotent Operations

Pitfall 3: Transaction Order in Recovery

Debugging Techniques

Extensions and Challenges

Extension 1: Ordered Mode

Extension 2: Asynchronous Commit

Extension 3: Group Commit

Interview Questions

Conceptual

Implementation

Self-Assessment Checklist

Resources

Books

Papers

Online