Project 4: Journaling Layer
Project 4: Journaling Layer
Project Overview
| Attribute | Details |
|---|---|
| Difficulty | Master |
| Time Estimate | 2-3 weeks |
| Language | C |
| Knowledge Area | Crash Consistency / Databases |
| Main Book | โDatabase Internalsโ by Alex Petrov |
| Prerequisites | Project 3 completed |
What Youโll Build
Add write-ahead logging (journaling) to your block-based filesystem. Before any metadata change, log the operation so crashes can be recovered.
Real World Output:
$ ./myfs disk.img /mnt/myfs
$ dd if=/dev/urandom of=/mnt/myfs/bigfile bs=1M count=50 &
# Kill power mid-write (or kill -9 the process)
$ ./myfs disk.img /mnt/myfs
Journal recovery: replaying 3 transactions...
Transaction 1: inode 15 allocation - COMMITTED, replaying
Transaction 2: block 1024-1087 allocation - COMMITTED, replaying
Transaction 3: inode 15 size update - INCOMPLETE, discarding
Recovery complete. Filesystem consistent.
$ ls /mnt/myfs # No corruption!
Learning Objectives
By completing this project, you will:
- Understand the crash consistency problem and why filesystems need journaling
- Implement write-ahead logging (WAL) โ the same technique used by databases
- Design transaction semantics for atomic filesystem operations
- Build a journal recovery system that replays committed transactions
- Handle checkpointing to reclaim journal space
- Balance performance and durability tradeoffs
The Core Question Youโre Answering
โHow do you guarantee filesystem consistency when power can fail at any millisecond during a write operation?โ
Before journaling, crashes during writes left filesystems corrupted. fsck would scan the entire disk to find and fix inconsistenciesโtaking hours on large drives.
In 1999, ext3 added one feature to ext2: journaling. That single addition made ext3 mount instantly after crashes. Youโre about to implement that same insightโwrite-ahead logging. Itโs the same technique PostgreSQL, MySQL, and every serious database uses.
Deep Theoretical Foundation
1. The Crash Consistency Problem
Consider creating a file /newfile.txt:
Operations required:
1. Allocate inode (update inode bitmap)
2. Initialize inode (write inode structure)
3. Allocate data block (update block bitmap)
4. Write file data (write data block)
5. Update directory (add entry to parent)
6. Update superblock (decrement free counts)
What if power fails between operations?
| Crash After | Result |
|---|---|
| Step 1 | Inode allocated but uninitialized โ garbage metadata |
| Step 2 | Inode exists but no directory entry โ orphaned inode |
| Step 3 | Block allocated but unused โ space leak |
| Step 4 | Data written but inode not updated โ data lost |
| Step 5 | Directory points to partial inode โ potential corruption |
The fundamental issue: Disk writes are not atomic across multiple blocks. You cannot guarantee all updates complete before a crash.
2. Solution Approaches
Approach 1: Write Order + fsck (ext2)
Carefully order writes to limit damage:
- Write data blocks first
- Then write metadata
- On crash, run
fsckto scan and fix
Problems:
fsckscans entire disk (slow for large filesystems)- Some inconsistencies canโt be fixed
- User must wait for fsck to complete
Approach 2: Journaling (ext3/ext4)
Write operations twice:
- First, log the changes to a journal
- Then, apply changes to their final locations
On crash:
- Scan small journal (fast)
- Replay committed transactions
- Discard incomplete transactions
Key insight: The journal is small, so recovery is fast. Only committed transactions are replayed, so consistency is guaranteed.
Approach 3: Copy-on-Write (ZFS/Btrfs)
Never overwrite data. Always write to new locations, then update pointer atomically. (This is Project 6.)
3. Write-Ahead Logging (WAL) Principles
The fundamental rules of WAL:
- Log before apply: Write the log entry before making the actual change
- Commit before success: Donโt report success until the commit record is durable
- Idempotent operations: Replaying a transaction multiple times = same result
- Recovery replays committed: On mount, replay all committed transactions
Normal write flow:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ 1. Application calls write() โ
โ 2. Filesystem prepares changes โ
โ 3. BEGIN transaction written to journal โ
โ 4. All changes logged to journal โ
โ 5. COMMIT written to journal โ
โ 6. fsync() journal โ changes are now durable โ
โ 7. Apply changes to actual locations (checkpoint) โ
โ 8. Return success to application โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Recovery flow:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ 1. Mount filesystem โ
โ 2. Scan journal from last checkpoint โ
โ 3. For each transaction: โ
โ - If COMMIT found: replay all operations โ
โ - If no COMMIT: discard (incomplete transaction) โ
โ 4. Mark journal as empty โ
โ 5. Continue normal operation โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
4. Journal Structure
A journal is typically a fixed-size circular buffer on disk:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ JOURNAL โ
โโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโค
โ Journal โ Journal Entries โ Free โ
โ Superblock โ [T1-Begin][Data][Commit][T2-Begin]..โ Space โ
โ โ โ โ โ โ
โ โ head tail โ โ
โโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโ
Journal wraps around when it reaches the end:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ ...old data...][T5-data][T5-COMMIT][T6-BEGIN... | ...old... โ
โ โ โ โ
โ tail head โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Journal superblock:
struct journal_superblock {
uint32_t j_magic; // Magic number
uint32_t j_version; // Journal format version
uint32_t j_start; // First block of journal
uint32_t j_size; // Size in blocks
uint32_t j_head; // Oldest valid transaction
uint32_t j_tail; // Next write position
uint64_t j_sequence; // Next transaction ID
uint32_t j_checkpoint; // Last checkpointed transaction
};
Journal entries:
#define J_BEGIN 1
#define J_COMMIT 2
#define J_DATA 3
#define J_REVOKE 4 // For recovery optimization
struct journal_entry_header {
uint32_t je_type; // Entry type
uint32_t je_length; // Entry length
uint64_t je_sequence; // Transaction ID
uint32_t je_checksum; // CRC32 of entry
};
struct journal_begin {
struct journal_entry_header header;
uint32_t jb_num_blocks; // Expected blocks in transaction
};
struct journal_data {
struct journal_entry_header header;
uint32_t jd_block; // Target block number
// Followed by block data
};
struct journal_commit {
struct journal_entry_header header;
uint64_t jc_sequence; // Transaction ID (redundant for verification)
uint32_t jc_checksum; // Checksum of entire transaction
};
5. Transaction Semantics
A transaction groups related filesystem operations:
// Creating a file is one transaction:
transaction_begin();
{
alloc_inode(); // Logged
init_inode(); // Logged
alloc_blocks(); // Logged
write_data(); // Maybe logged (depends on mode)
add_dir_entry(); // Logged
update_superblock(); // Logged
}
transaction_commit();
Atomicity guarantee: Either ALL changes commit, or NONE do.
6. Journaling Modes
| Mode | Whatโs Journaled | Speed | Safety |
|---|---|---|---|
| Journal | Metadata + Data | Slow | Highest |
| Ordered | Metadata only, data written first | Medium | Good |
| Writeback | Metadata only | Fast | Lower |
Journal mode: Everything is logged. File content AND metadata. Slowest but safest.
Ordered mode (ext4 default): Only metadata is journaled. But data blocks are written before the metadata journal entry commits. This ensures you never see stale data.
Writeback mode: Only metadata journaled. Data might not be flushed before commit. Fastest, but stale data possible after crash.
7. Checkpointing
The journal has limited size. To reclaim space, checkpoint transactions:
- Flush all logged changes to their final disk locations
- Verify writes completed (
fsync) - Advance journal head pointer
- Space between old head and new head is now reusable
Before checkpoint:
head tail
โ โ
[T1-complete][T2-complete][T3-complete][T4-in-progress][...]
โ_________________________________________โ
Can be checkpointed
After checkpoint:
new head tail
โ โ
[recyclable space ][T4-in-progress][T5-new][...]
8. Recovery Algorithm
void journal_recover(struct journal_state *j) {
uint64_t pos = j->superblock.j_head;
uint64_t current_txn = 0;
bool has_commit = false;
// First pass: identify complete transactions
while (pos < j->superblock.j_tail) {
struct journal_entry_header hdr;
read_journal(j, pos, &hdr, sizeof(hdr));
switch (hdr.je_type) {
case J_BEGIN:
current_txn = hdr.je_sequence;
has_commit = false;
break;
case J_COMMIT:
if (hdr.je_sequence == current_txn) {
has_commit = true;
// Mark transaction as complete
}
break;
}
pos += entry_size(&hdr);
}
// Second pass: replay complete transactions
pos = j->superblock.j_head;
while (pos < j->superblock.j_tail) {
struct journal_entry_header hdr;
read_journal(j, pos, &hdr, sizeof(hdr));
if (hdr.je_type == J_DATA && transaction_complete(hdr.je_sequence)) {
struct journal_data *jd = read_data_entry(j, pos);
// Copy block data from journal to final location
write_block(fs, jd->jd_block, jd->data);
}
pos += entry_size(&hdr);
}
// Clear journal
j->superblock.j_head = j->superblock.j_tail;
write_journal_superblock(j);
}
Project Specification
Functional Requirements
- Journal area on disk
- Reserve blocks for journal
- Configurable size (32MB default)
- Circular buffer implementation
- Transaction support
- Begin/commit transaction API
- Log metadata changes
- Optional data journaling mode
- Recovery on mount
- Detect unclean shutdown
- Scan and replay committed transactions
- Discard incomplete transactions
- Checkpointing
- Periodic flush of journal to final locations
- Reclaim journal space
- Handle journal full condition
- Integration with filesystem
- Wrap all metadata operations in transactions
- Handle nested operations correctly
Non-Functional Requirements
- Recovery completes in < 5 seconds
- Journal overhead < 20% on normal operations
- Support concurrent transactions
- Graceful degradation when journal is full
Solution Architecture
Component Design
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Filesystem Operations โ
โ (create, unlink, write, mkdir, etc.) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Transaction Manager โ
โ (begin, log, commit, abort) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Journal Writer โ
โ (append entries, manage space) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโ
โผ โผ โผ
โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ
โ Journal Buffer โ โ Checkpointer โ โ Recovery Engine โ
โ (in-memory) โ โ (background) โ โ (on mount) โ
โโโโโโโโโโฌโโโโโโโโโ โโโโโโโโโโฌโโโโโโโโโ โโโโโโโโโโฌโโโโโโโโโ
โ โ โ
โโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Journal Blocks (on disk) โ
โ [JSuper][Entry][Entry][Entry]...[unused]...[oldest entries] โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Data Structures
// Journal state
struct journal_state {
struct journal_superblock jsuper;
uint8_t *buffer; // Write buffer
size_t buffer_pos; // Current position in buffer
uint64_t current_txn; // Current transaction ID
int txn_depth; // Nesting level
pthread_mutex_t lock;
int dirty;
};
// Transaction handle
struct transaction {
uint64_t txn_id;
struct journal_state *journal;
struct logged_block *blocks; // List of blocks to log
size_t num_blocks;
};
// Logged block info
struct logged_block {
uint32_t block_num; // Target block number
uint8_t data[BLOCK_SIZE]; // Block content
struct logged_block *next;
};
// Recovery state
struct recovery_state {
uint64_t *complete_txns; // Array of complete transaction IDs
size_t num_complete;
};
Key Functions
// Journal initialization
int journal_init(struct myfs_state *fs, uint32_t start_block, uint32_t size);
int journal_load(struct myfs_state *fs);
int journal_close(struct myfs_state *fs);
// Transaction API
struct transaction* txn_begin(struct myfs_state *fs);
int txn_log_block(struct transaction *txn, uint32_t block_num, void *data);
int txn_commit(struct transaction *txn);
void txn_abort(struct transaction *txn);
// Journal operations
int journal_append(struct journal_state *j, void *entry, size_t len);
int journal_sync(struct journal_state *j);
int journal_checkpoint(struct journal_state *j);
// Recovery
int journal_recover(struct myfs_state *fs);
int journal_needs_recovery(struct myfs_state *fs);
Implementation Guide
Phase 1: Journal Layout (Days 1-3)
Goals:
- Reserve journal blocks in mkfs
- Define journal superblock format
- Initialize journal on format
Steps:
- Modify mkfs to allocate journal blocks
- Define journal superblock structure
- Write initial journal superblock
- Update filesystem superblock to point to journal
// In mkfs: allocate journal
uint32_t journal_blocks = 8192; // 32MB with 4KB blocks
uint32_t journal_start = sb.s_first_data;
sb.s_journal_start = journal_start;
sb.s_journal_size = journal_blocks;
sb.s_first_data += journal_blocks;
// Initialize journal superblock
struct journal_superblock jsb = {
.j_magic = JOURNAL_MAGIC,
.j_version = 1,
.j_start = journal_start + 1, // After journal superblock
.j_size = journal_blocks - 1,
.j_head = 0,
.j_tail = 0,
.j_sequence = 1,
.j_checkpoint = 0,
};
write_block(fd, journal_start, &jsb);
Phase 2: Transaction Begin/Commit (Days 3-6)
Goals:
- Implement transaction lifecycle
- Write BEGIN and COMMIT records
- Ensure durability with fsync
struct transaction* txn_begin(struct myfs_state *fs) {
struct journal_state *j = fs->journal;
pthread_mutex_lock(&j->lock);
// Start new transaction
struct transaction *txn = calloc(1, sizeof(*txn));
txn->txn_id = j->jsuper.j_sequence++;
txn->journal = j;
// Write BEGIN record
struct journal_begin begin = {
.header = {
.je_type = J_BEGIN,
.je_length = sizeof(begin),
.je_sequence = txn->txn_id,
},
};
journal_append(j, &begin, sizeof(begin));
return txn;
}
int txn_commit(struct transaction *txn) {
struct journal_state *j = txn->journal;
// Write all logged blocks
struct logged_block *blk = txn->blocks;
while (blk) {
struct journal_data jd = {
.header = {
.je_type = J_DATA,
.je_length = sizeof(jd) + BLOCK_SIZE,
.je_sequence = txn->txn_id,
},
.jd_block = blk->block_num,
};
journal_append(j, &jd, sizeof(jd));
journal_append(j, blk->data, BLOCK_SIZE);
blk = blk->next;
}
// Write COMMIT record
struct journal_commit commit = {
.header = {
.je_type = J_COMMIT,
.je_length = sizeof(commit),
.je_sequence = txn->txn_id,
},
.jc_sequence = txn->txn_id,
.jc_checksum = calculate_txn_checksum(txn),
};
journal_append(j, &commit, sizeof(commit));
// CRITICAL: Sync journal to disk
journal_sync(j);
pthread_mutex_unlock(&j->lock);
// Free transaction resources
free_logged_blocks(txn->blocks);
free(txn);
return 0;
}
Phase 3: Logging Filesystem Operations (Days 6-10)
Goals:
- Wrap metadata updates in transactions
- Log block changes before applying
- Handle inode, bitmap, and directory updates
int myfs_create_journaled(struct myfs_state *fs, const char *path, mode_t mode) {
struct transaction *txn = txn_begin(fs);
// Allocate and log inode bitmap change
int ino = alloc_inode(fs);
uint8_t bitmap_block[BLOCK_SIZE];
read_block(fs, fs->sb.s_inode_bitmap + ino / (BLOCK_SIZE * 8), bitmap_block);
BITMAP_SET(bitmap_block, ino % (BLOCK_SIZE * 8));
txn_log_block(txn, fs->sb.s_inode_bitmap + ino / (BLOCK_SIZE * 8), bitmap_block);
// Initialize and log inode
struct myfs_inode inode = {
.i_mode = S_IFREG | mode,
.i_uid = getuid(),
.i_gid = getgid(),
.i_links_count = 1,
.i_ctime = time(NULL),
.i_mtime = time(NULL),
};
uint8_t inode_block[BLOCK_SIZE];
// ... read inode table block, update, log
// Add directory entry and log
// ... update parent directory block, log
// Commit transaction
int ret = txn_commit(txn);
if (ret < 0) {
txn_abort(txn);
return ret;
}
// Now apply changes to actual locations
// (This can be deferred to checkpoint)
apply_logged_changes(txn);
return 0;
}
Phase 4: Recovery Implementation (Days 10-14)
Goals:
- Detect unclean shutdown
- Scan journal for complete transactions
- Replay changes to filesystem
int journal_recover(struct myfs_state *fs) {
struct journal_state *j = fs->journal;
printf("Journal recovery starting...\n");
// Phase 1: Scan for complete transactions
struct recovery_state rs = {0};
uint32_t pos = j->jsuper.j_head;
uint64_t current_txn = 0;
while (pos < j->jsuper.j_tail) {
struct journal_entry_header hdr;
read_journal_at(j, pos, &hdr, sizeof(hdr));
if (hdr.je_type == J_BEGIN) {
current_txn = hdr.je_sequence;
} else if (hdr.je_type == J_COMMIT && hdr.je_sequence == current_txn) {
// Verify checksum
struct journal_commit *commit = read_commit(j, pos);
if (verify_txn_checksum(j, current_txn, commit->jc_checksum)) {
add_complete_txn(&rs, current_txn);
printf("Transaction %lu: COMPLETE\n", current_txn);
} else {
printf("Transaction %lu: CHECKSUM FAILED\n", current_txn);
}
}
pos += hdr.je_length;
}
// Phase 2: Replay complete transactions
pos = j->jsuper.j_head;
while (pos < j->jsuper.j_tail) {
struct journal_entry_header hdr;
read_journal_at(j, pos, &hdr, sizeof(hdr));
if (hdr.je_type == J_DATA && is_complete(&rs, hdr.je_sequence)) {
struct journal_data *jd = read_data_entry(j, pos);
printf("Replaying block %u from transaction %lu\n",
jd->jd_block, hdr.je_sequence);
write_block(fs, jd->jd_block, jd->data);
}
pos += hdr.je_length;
}
// Clear journal
j->jsuper.j_head = j->jsuper.j_tail;
j->jsuper.j_sequence++;
write_journal_superblock(j);
printf("Recovery complete: %zu transactions replayed\n", rs.num_complete);
return 0;
}
Phase 5: Checkpointing (Days 14-17)
Goals:
- Apply logged changes to final locations
- Reclaim journal space
- Handle journal-full condition
int journal_checkpoint(struct journal_state *j) {
pthread_mutex_lock(&j->lock);
uint32_t pos = j->jsuper.j_head;
uint32_t checkpoint_pos = pos;
// Apply all complete transactions
while (pos < j->jsuper.j_tail) {
struct journal_entry_header hdr;
read_journal_at(j, pos, &hdr, sizeof(hdr));
if (hdr.je_type == J_DATA) {
struct journal_data *jd = read_data_entry(j, pos);
write_block(fs, jd->jd_block, jd->data);
} else if (hdr.je_type == J_COMMIT) {
checkpoint_pos = pos + hdr.je_length;
}
pos += hdr.je_length;
}
// Sync filesystem blocks
fsync(fs->disk_fd);
// Advance head
j->jsuper.j_head = checkpoint_pos;
j->jsuper.j_checkpoint = j->jsuper.j_sequence;
write_journal_superblock(j);
pthread_mutex_unlock(&j->lock);
return 0;
}
// Check if journal needs checkpoint
bool journal_needs_checkpoint(struct journal_state *j) {
uint32_t used = j->jsuper.j_tail - j->jsuper.j_head;
uint32_t total = j->jsuper.j_size;
return used > (total * 3 / 4); // 75% full
}
Phase 6: Integration and Testing (Days 17-21)
Goals:
- Connect journaling to all filesystem operations
- Test crash scenarios
- Verify consistency after recovery
Crash simulation script:
#!/bin/bash
# test_crash_recovery.sh
DISK=/tmp/crash_test.img
MNT=/tmp/crash_mnt
# Create and mount
./mkfs.myfs -s 100M -j 32M $DISK
mkdir -p $MNT
./myfs $DISK $MNT &
FSPID=$!
sleep 1
# Start background writes
for i in $(seq 1 100); do
echo "file $i content" > $MNT/file_$i.txt
done &
WRITEPID=$!
# Kill filesystem mid-write
sleep 0.5
kill -9 $FSPID
kill -9 $WRITEPID 2>/dev/null
# Remount (triggers recovery)
./myfs $DISK $MNT
sleep 1
# Verify consistency
./fsck.myfs $DISK
# Check files are complete or absent (not corrupted)
for f in $MNT/file_*.txt; do
content=$(cat "$f" 2>/dev/null)
if [ -n "$content" ]; then
if [[ ! "$content" =~ ^file\ [0-9]+\ content$ ]]; then
echo "CORRUPTION: $f has invalid content"
exit 1
fi
fi
done
echo "PASS: All files are consistent"
fusermount -u $MNT
Testing Strategy
Unit Tests
// Test journal append and read
void test_journal_io() {
struct journal_state j;
journal_init_test(&j);
struct journal_begin begin = {
.header = { .je_type = J_BEGIN, .je_length = sizeof(begin) },
};
journal_append(&j, &begin, sizeof(begin));
struct journal_begin read_begin;
read_journal_at(&j, 0, &read_begin, sizeof(read_begin));
assert(read_begin.header.je_type == J_BEGIN);
}
// Test transaction commit
void test_txn_commit() {
struct myfs_state fs;
init_test_fs(&fs);
struct transaction *txn = txn_begin(&fs);
assert(txn != NULL);
uint8_t data[BLOCK_SIZE];
memset(data, 0xAB, sizeof(data));
txn_log_block(txn, 100, data);
int ret = txn_commit(txn);
assert(ret == 0);
// Verify journal contains the block
// ...
}
// Test recovery
void test_recovery() {
struct myfs_state fs;
init_test_fs(&fs);
// Create committed transaction
struct transaction *txn = txn_begin(&fs);
uint8_t data[BLOCK_SIZE];
memset(data, 0x55, sizeof(data));
txn_log_block(txn, 50, data);
txn_commit(txn);
// Simulate crash (don't checkpoint)
// Recover
journal_recover(&fs);
// Verify block 50 has correct content
uint8_t verify[BLOCK_SIZE];
read_block(&fs, 50, verify);
assert(memcmp(verify, data, BLOCK_SIZE) == 0);
}
Crash Simulation Tests
// Test all crash points in file creation
void test_crash_during_create() {
for (int crash_point = 0; crash_point < 10; crash_point++) {
// Set crash point
set_crash_after_n_writes(crash_point);
// Attempt operation
signal_handler_installed = true;
if (setjmp(crash_jmp) == 0) {
myfs_create(&fs, "/test.txt", 0644);
}
// Simulate remount
journal_recover(&fs);
// Verify consistency
assert(fsck(&fs) == 0);
// Verify file either exists completely or not at all
struct stat st;
int exists = myfs_getattr(&fs, "/test.txt", &st, NULL) == 0;
if (exists) {
assert(st.st_size >= 0);
assert(S_ISREG(st.st_mode));
}
}
}
Common Pitfalls and Debugging
Pitfall 1: Forgetting to Sync Journal
Problem: Transaction appears committed but journal isnโt on disk.
// WRONG
journal_append(j, &commit, sizeof(commit));
return 0; // Commit record still in buffer!
// CORRECT
journal_append(j, &commit, sizeof(commit));
fsync(j->fd); // CRITICAL: Ensure commit record is on disk
return 0;
Pitfall 2: Non-Idempotent Operations
Problem: Replaying transaction twice causes issues.
// WRONG: Increment is not idempotent
superblock.free_blocks--;
// CORRECT: Set to known value (idempotent)
superblock.free_blocks = calculate_free_blocks();
Pitfall 3: Transaction Order in Recovery
Problem: Replaying in wrong order causes inconsistency.
Solution: Use sequence numbers and process in order.
Debugging Techniques
1. Journal dump tool:
./journal-dump disk.img
Transaction 1: BEGIN
Block 5: inode bitmap
Block 105: inode table
Block 200: directory
Transaction 1: COMMIT (checksum: 0x12345678)
Transaction 2: BEGIN
Block 6: inode bitmap
Transaction 2: INCOMPLETE (no commit)
2. Verbose logging:
#define JLOG(fmt, ...) \
fprintf(stderr, "[JOURNAL] " fmt "\n", ##__VA_ARGS__)
JLOG("txn %lu: logging block %u", txn->txn_id, block_num);
JLOG("txn %lu: commit, %zu blocks", txn->txn_id, txn->num_blocks);
3. Artificial crashes:
static int writes_until_crash = -1;
int write_block(struct myfs_state *fs, uint32_t blk, void *data) {
if (writes_until_crash == 0) {
exit(1); // Simulate crash
}
if (writes_until_crash > 0) {
writes_until_crash--;
}
// ... actual write
}
Extensions and Challenges
Extension 1: Ordered Mode
Write data blocks before journaling metadata:
int txn_commit_ordered(struct transaction *txn) {
// First: write all data blocks to final locations
for (struct data_write *dw = txn->data_writes; dw; dw = dw->next) {
write_block(fs, dw->block, dw->data);
}
fsync(fs->disk_fd); // Data is durable
// Then: journal metadata only
for (struct logged_block *lb = txn->metadata; lb; lb = lb->next) {
journal_log_block(txn, lb);
}
journal_commit(txn);
}
Extension 2: Asynchronous Commit
Return immediately, commit in background:
int txn_commit_async(struct transaction *txn) {
queue_for_commit(txn);
return 0; // Returns immediately
}
// Background thread
void* commit_thread(void *arg) {
while (running) {
struct transaction *txn = dequeue_commit();
if (txn) {
do_sync_commit(txn);
}
}
}
Extension 3: Group Commit
Batch multiple transactions into one disk write:
void commit_group() {
// Gather all pending transactions
struct transaction *batch[MAX_BATCH];
int count = collect_pending(batch, MAX_BATCH);
// Write all to journal
for (int i = 0; i < count; i++) {
write_txn_to_journal(batch[i]);
}
// Single fsync for all
fsync(journal_fd);
// Signal all waiters
for (int i = 0; i < count; i++) {
signal_commit_complete(batch[i]);
}
}
Interview Questions
Conceptual
- โWhat problem does journaling solve?โ
- Expected: Crash consistency. Guarantees filesystem is always consistent, even after crashes.
- โExplain write-ahead logging.โ
- Expected: Log changes before applying. On crash, replay committed entries.
- โWhatโs the difference between metadata and full data journaling?โ
- Expected: Metadata-only logs structure changes. Full logs file content too. Trade-off: speed vs data safety.
- โWhy is journaling faster than fsck?โ
- Expected: Journal is small (MB) vs entire filesystem (TB). Recovery scans journal only.
Implementation
- โWhat makes an operation idempotent?โ
- Expected: Same result when applied multiple times. Important because recovery might replay.
- โHow do you know a transaction is complete?โ
- Expected: COMMIT record present with valid checksum.
- โWhat is checkpointing?โ
- Expected: Writing journal contents to final locations to reclaim journal space.
- โHow is journaling related to database WAL?โ
- Expected: Same principle. Both ensure atomicity and durability through logging.
Self-Assessment Checklist
- Journal is created during mkfs
- Transactions can be started and committed
- Metadata changes are logged before applying
- Recovery replays committed transactions correctly
- Incomplete transactions are discarded
- Filesystem is consistent after simulated crashes
- Checkpointing reclaims journal space
- Journal-full condition is handled gracefully
Resources
Books
- โOperating Systems: Three Easy Piecesโ Ch. 42 - Crash Consistency and Journaling
- โDatabase Internalsโ by Alex Petrov - Ch. 3-5 - Transaction Processing
- โDesigning Data-Intensive Applicationsโ by Kleppmann - Ch. 7 - Transactions
Papers
- โJournaling the Linux ext2fs Filesystemโ - Stephen Tweedie
- โAnalysis and Evolution of Journaling File Systemsโ - USENIX ATC 2005