Project 4: Journaling Layer
Project Overview
| Attribute | Details |
|---|---|
| Difficulty | Master |
| Time Estimate | 2-3 weeks |
| Language | C |
| Knowledge Area | Crash Consistency / Databases |
| Main Book | “Database Internals” by Alex Petrov |
| Prerequisites | Project 3 completed |
What You’ll Build
Add write-ahead logging (journaling) to your block-based filesystem. Before any metadata change, log the operation so crashes can be recovered.
Real World Output:
$ ./myfs disk.img /mnt/myfs
$ dd if=/dev/urandom of=/mnt/myfs/bigfile bs=1M count=50 &
# Kill power mid-write (or kill -9 the process)
$ ./myfs disk.img /mnt/myfs
Journal recovery: replaying 3 transactions...
Transaction 1: inode 15 allocation - COMMITTED, replaying
Transaction 2: block 1024-1087 allocation - COMMITTED, replaying
Transaction 3: inode 15 size update - INCOMPLETE, discarding
Recovery complete. Filesystem consistent.
$ ls /mnt/myfs # No corruption!
Learning Objectives
By completing this project, you will:
- Understand the crash consistency problem and why filesystems need journaling
- Implement write-ahead logging (WAL) — the same technique used by databases
- Design transaction semantics for atomic filesystem operations
- Build a journal recovery system that replays committed transactions
- Handle checkpointing to reclaim journal space
- Balance performance and durability tradeoffs
- Connect filesystem concepts to database internals — ACID properties in action
The Core Question You’re Answering
“How do you guarantee filesystem consistency when power can fail at any millisecond during a write operation?”
Before journaling, crashes during writes left filesystems corrupted. fsck would scan the entire disk to find and fix inconsistencies—taking hours on large drives.
In 1999, ext3 added one feature to ext2: journaling. That single addition made ext3 mount instantly after crashes. You’re about to implement that same insight—write-ahead logging. It’s the same technique PostgreSQL, MySQL, and every serious database uses.
The Database Connection:
This project bridges filesystems and databases. The WAL (Write-Ahead Logging) protocol you’ll implement is identical to what PostgreSQL uses. The ARIES recovery algorithm (Analysis, Redo, Undo) that databases use is a more sophisticated version of what you’ll build here. Understanding this prepares you for database internals.
Deep Theoretical Foundation
1. The Crash Consistency Problem
Consider creating a file /newfile.txt:
Operations required:
1. Allocate inode (update inode bitmap)
2. Initialize inode (write inode structure)
3. Allocate data block (update block bitmap)
4. Write file data (write data block)
5. Update directory (add entry to parent)
6. Update superblock (decrement free counts)
What if power fails between operations?
| Crash After | Result |
|---|---|
| Step 1 | Inode allocated but uninitialized - garbage metadata |
| Step 2 | Inode exists but no directory entry - orphaned inode |
| Step 3 | Block allocated but unused - space leak |
| Step 4 | Data written but inode not updated - data lost |
| Step 5 | Directory points to partial inode - potential corruption |
The fundamental issue: Disk writes are not atomic across multiple blocks. You cannot guarantee all updates complete before a crash.
The Crash Window Problem:
Time ────────────────────────────────────────────────────────────►
┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐
│ Write │ │ Write │ │ Write │ │ Write │
│ Bitmap │ │ Inode │ │ DirEnt │ │ Data │
└────┬────┘ └────┬────┘ └────┬────┘ └────┬────┘
│ │ │ │
▼ ▼ ▼ ▼
←────────────── CRASH WINDOW ──────────────────►
Power can fail ANYWHERE in this window.
Partial updates = inconsistent filesystem.
2. ACID Properties (Database Connection)
Journaling provides ACID guarantees — the same properties databases ensure:
| Property | Database Meaning | Filesystem Meaning |
|---|---|---|
| Atomicity | Transaction all-or-nothing | File create either happens completely or not at all |
| Consistency | DB constraints preserved | Filesystem invariants maintained (bitmaps match usage) |
| Isolation | Concurrent transactions don’t interfere | Multiple file operations don’t corrupt each other |
| Durability | Committed data survives crash | Once fsync returns, data is on disk |
The key insight: A filesystem operation (create, delete, rename) is a transaction that must be atomic.
3. Solution Approaches Compared
Approach 1: Write Order + fsck (ext2)
Carefully order writes to limit damage:
- Write data blocks first
- Then write metadata
- On crash, run
fsckto scan and fix
Problems:
fsckscans entire disk (slow for large filesystems)- Some inconsistencies can’t be fixed
- User must wait for fsck to complete
ext2 Recovery Time:
Filesystem Size fsck Time
────────────────────────────
10 GB ~1 minute
100 GB ~10 minutes
1 TB ~1-2 hours
10 TB ~10-20 hours ← Unacceptable!
Approach 2: Journaling (ext3/ext4)
Write operations twice:
- First, log the changes to a journal
- Then, apply changes to their final locations
On crash:
- Scan small journal (fast)
- Replay committed transactions
- Discard incomplete transactions
Key insight: The journal is small, so recovery is fast. Only committed transactions are replayed, so consistency is guaranteed.
Journaling Recovery:
Journal Size: 128 MB (fixed, regardless of FS size)
Recovery Time: ~1-5 seconds (always)
10 TB filesystem crashes at 3 AM?
Mount in 5 seconds, not 20 hours.
Approach 3: Copy-on-Write (ZFS/Btrfs)
Never overwrite data. Always write to new locations, then update pointer atomically. (This is Project 6.)
4. Write-Ahead Logging (WAL) Principles
The fundamental rules of WAL:
- Log before apply: Write the log entry before making the actual change
- Commit before success: Don’t report success until the commit record is durable
- Idempotent operations: Replaying a transaction multiple times = same result
- Recovery replays committed: On mount, replay all committed transactions
WAL Protocol Visualization:
Normal write flow:
┌─────────────────────────────────────────────────────────────────┐
│ 1. Application calls write() │
│ 2. Filesystem prepares changes │
│ 3. BEGIN transaction written to journal │
│ 4. All changes logged to journal │
│ 5. COMMIT written to journal │
│ 6. fsync() journal — changes are now durable │
│ 7. Apply changes to actual locations (checkpoint) │
│ 8. Return success to application │
└─────────────────────────────────────────────────────────────────┘
Recovery flow:
┌─────────────────────────────────────────────────────────────────┐
│ 1. Mount filesystem │
│ 2. Scan journal from last checkpoint │
│ 3. For each transaction: │
│ - If COMMIT found: replay all operations │
│ - If no COMMIT: discard (incomplete transaction) │
│ 4. Mark journal as empty │
│ 5. Continue normal operation │
└─────────────────────────────────────────────────────────────────┘
5. The WAL Guarantee
Theorem: If a transaction has a COMMIT record in the durable journal, all its changes can be recovered.
Proof:
- COMMIT is written after all data entries
- COMMIT is fsync’d to disk before returning success
- If COMMIT exists, all data entries must exist (WAL write order)
- Recovery reads all data entries and applies them
- Therefore: COMMIT → Recoverable
6. Journal Structure
A journal is typically a fixed-size circular buffer on disk:
┌──────────────────────────────────────────────────────────────────┐
│ JOURNAL │
├─────────────┬─────────────────────────────────────┬──────────────┤
│ Journal │ Journal Entries │ Free │
│ Superblock │ [T1-Begin][Data][Commit][T2-Begin]..│ Space │
│ │ ↑ ↑ │ │
│ │ head tail │ │
└─────────────┴─────────────────────────────────────┴──────────────┘
Journal wraps around when it reaches the end (circular buffer):
┌─────────────────────────────────────────────────────────────────┐
│ ...old data...][T5-data][T5-COMMIT][T6-BEGIN... | ...old... │
│ ↑ ↑ │
│ tail head │
└─────────────────────────────────────────────────────────────────┘
Journal superblock:
struct journal_superblock {
uint32_t j_magic; // Magic number (0x4A484653 "JHFS")
uint32_t j_version; // Journal format version
uint32_t j_start; // First block of journal
uint32_t j_size; // Size in blocks
uint32_t j_head; // Oldest valid transaction
uint32_t j_tail; // Next write position
uint64_t j_sequence; // Next transaction ID
uint32_t j_checkpoint; // Last checkpointed transaction
uint32_t j_flags; // Status flags
uint8_t j_uuid[16]; // Journal UUID
};
Journal entries:
#define J_BEGIN 1
#define J_COMMIT 2
#define J_DATA 3
#define J_REVOKE 4 // For recovery optimization
struct journal_entry_header {
uint32_t je_type; // Entry type
uint32_t je_length; // Entry length (including header)
uint64_t je_sequence; // Transaction ID
uint32_t je_checksum; // CRC32 of entry
};
struct journal_begin {
struct journal_entry_header header;
uint32_t jb_num_blocks; // Expected blocks in transaction
uint64_t jb_timestamp; // Transaction start time
};
struct journal_data {
struct journal_entry_header header;
uint32_t jd_block; // Target block number
uint32_t jd_flags; // Block flags
// Followed by block data (BLOCK_SIZE bytes)
};
struct journal_commit {
struct journal_entry_header header;
uint64_t jc_sequence; // Transaction ID (redundant for verification)
uint32_t jc_checksum; // Checksum of entire transaction
uint64_t jc_timestamp; // Commit time
};
7. Transaction Semantics
A transaction groups related filesystem operations:
// Creating a file is one transaction:
transaction_begin();
{
alloc_inode(); // Logged
init_inode(); // Logged
alloc_blocks(); // Logged
write_data(); // Maybe logged (depends on mode)
add_dir_entry(); // Logged
update_superblock(); // Logged
}
transaction_commit();
Atomicity guarantee: Either ALL changes commit, or NONE do.
Visualization of a Transaction:
Transaction T47: Create file "/documents/report.txt"
Journal Contents:
┌─────────────────────────────────────────────────────────────────┐
│ T47-BEGIN (seq=47, expecting 4 blocks) │
├─────────────────────────────────────────────────────────────────┤
│ T47-DATA: Block 5 (inode bitmap, set bit 23) │
├─────────────────────────────────────────────────────────────────┤
│ T47-DATA: Block 105 (inode 23 structure) │
├─────────────────────────────────────────────────────────────────┤
│ T47-DATA: Block 200 (directory /documents, add entry) │
├─────────────────────────────────────────────────────────────────┤
│ T47-DATA: Block 0 (superblock, decrement free inode count) │
├─────────────────────────────────────────────────────────────────┤
│ T47-COMMIT (seq=47, checksum=0x12345678) │
└─────────────────────────────────────────────────────────────────┘
8. Journaling Modes
| Mode | What’s Journaled | Speed | Safety |
|---|---|---|---|
| Journal | Metadata + Data | Slow | Highest |
| Ordered | Metadata only, data written first | Medium | Good |
| Writeback | Metadata only | Fast | Lower |
Journal mode: Everything is logged. File content AND metadata. Slowest but safest.
Ordered mode (ext4 default): Only metadata is journaled. But data blocks are written before the metadata journal entry commits. This ensures you never see stale data.
Writeback mode: Only metadata journaled. Data might not be flushed before commit. Fastest, but stale data possible after crash.
Journaling Modes Comparison:
Full Data Journaling:
write("hello") → [Journal: data + metadata] → [Disk: data + metadata]
Write amplification: 2x (data written twice)
Ordered Mode:
write("hello") → [Disk: data] → [Journal: metadata] → [Disk: metadata]
Write amplification: 1x for data, 2x for metadata
Guarantee: Data on disk before metadata commit
Writeback Mode:
write("hello") → [Journal: metadata] → [Disk: data + metadata] (any order)
Write amplification: 2x for metadata only
Risk: Stale/garbage data visible after crash
9. Checkpointing
The journal has limited size. To reclaim space, checkpoint transactions:
- Flush all logged changes to their final disk locations
- Verify writes completed (
fsync) - Advance journal head pointer
- Space between old head and new head is now reusable
Before checkpoint:
head tail
↓ ↓
[T1-complete][T2-complete][T3-complete][T4-in-progress][...]
↑_________________________________________↑
Can be checkpointed
After checkpoint:
new head tail
↓ ↓
[recyclable space ][T4-in-progress][T5-new][...]
Checkpoint Process:
1. Identify complete transactions (T1, T2, T3)
2. Write T1, T2, T3 data to final disk locations
3. fsync() to ensure durability
4. Update journal superblock (move head past T3)
5. Journal space before new head is now reusable
10. Recovery Algorithm
void journal_recover(struct journal_state *j) {
uint64_t pos = j->superblock.j_head;
uint64_t current_txn = 0;
bool has_commit = false;
// First pass: identify complete transactions
while (pos < j->superblock.j_tail) {
struct journal_entry_header hdr;
read_journal(j, pos, &hdr, sizeof(hdr));
switch (hdr.je_type) {
case J_BEGIN:
current_txn = hdr.je_sequence;
has_commit = false;
break;
case J_COMMIT:
if (hdr.je_sequence == current_txn) {
has_commit = true;
// Mark transaction as complete
}
break;
}
pos += entry_size(&hdr);
}
// Second pass: replay complete transactions
pos = j->superblock.j_head;
while (pos < j->superblock.j_tail) {
struct journal_entry_header hdr;
read_journal(j, pos, &hdr, sizeof(hdr));
if (hdr.je_type == J_DATA && transaction_complete(hdr.je_sequence)) {
struct journal_data *jd = read_data_entry(j, pos);
// Copy block data from journal to final location
write_block(fs, jd->jd_block, jd->data);
}
pos += entry_size(&hdr);
}
// Clear journal
j->superblock.j_head = j->superblock.j_tail;
write_journal_superblock(j);
}
Project Specification
Functional Requirements
- Journal area on disk
- Reserve blocks for journal
- Configurable size (32MB default)
- Circular buffer implementation
- Transaction support
- Begin/commit transaction API
- Log metadata changes
- Optional data journaling mode
- Recovery on mount
- Detect unclean shutdown
- Scan and replay committed transactions
- Discard incomplete transactions
- Checkpointing
- Periodic flush of journal to final locations
- Reclaim journal space
- Handle journal full condition
- Integration with filesystem
- Wrap all metadata operations in transactions
- Handle nested operations correctly
Non-Functional Requirements
- Recovery completes in < 5 seconds
- Journal overhead < 20% on normal operations
- Support concurrent transactions
- Graceful degradation when journal is full
Solution Architecture
Component Design
┌─────────────────────────────────────────────────────────────────┐
│ Filesystem Operations │
│ (create, unlink, write, mkdir, etc.) │
└────────────────────────────┬────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ Transaction Manager │
│ (begin, log, commit, abort) │
└────────────────────────────┬────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ Journal Writer │
│ (append entries, manage space) │
└────────────────────────────┬────────────────────────────────────┘
│
┌──────────────────┼──────────────────┐
▼ ▼ ▼
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Journal Buffer │ │ Checkpointer │ │ Recovery Engine │
│ (in-memory) │ │ (background) │ │ (on mount) │
└────────┬────────┘ └────────┬────────┘ └────────┬────────┘
│ │ │
└───────────────────┼───────────────────┘
▼
┌─────────────────────────────────────────────────────────────────┐
│ Journal Blocks (on disk) │
│ [JSuper][Entry][Entry][Entry]...[unused]...[oldest entries] │
└─────────────────────────────────────────────────────────────────┘
Data Structures
// Journal state (in-memory)
struct journal_state {
struct journal_superblock jsuper;
uint8_t *buffer; // Write buffer
size_t buffer_pos; // Current position in buffer
size_t buffer_size; // Buffer size (typically one block)
uint64_t current_txn; // Current transaction ID
int txn_depth; // Nesting level (for compound ops)
pthread_mutex_t lock; // Concurrency control
int dirty; // Journal superblock needs writing
int disk_fd; // File descriptor for journal
};
// Transaction handle
struct transaction {
uint64_t txn_id;
struct journal_state *journal;
struct logged_block *blocks; // List of blocks to log
size_t num_blocks;
int state; // TXN_ACTIVE, TXN_COMMITTING, TXN_COMPLETE
struct transaction *next; // For transaction list
};
// Logged block info
struct logged_block {
uint32_t block_num; // Target block number
uint8_t data[BLOCK_SIZE]; // Block content (copy)
uint32_t flags; // JOURNAL_METADATA, JOURNAL_DATA, etc.
struct logged_block *next;
};
// Recovery state
struct recovery_state {
uint64_t *complete_txns; // Array of complete transaction IDs
size_t num_complete;
size_t capacity;
uint64_t max_seq; // Highest sequence seen
};
// Transaction states
#define TXN_ACTIVE 1
#define TXN_COMMITTING 2
#define TXN_COMMITTED 3
#define TXN_ABORTED 4
// Block flags
#define JOURNAL_METADATA 0x01
#define JOURNAL_DATA 0x02
#define JOURNAL_ESCAPE 0x04 // Block contains journal magic
Key Functions
// Journal initialization
int journal_init(struct myfs_state *fs, uint32_t start_block, uint32_t size);
int journal_load(struct myfs_state *fs);
int journal_close(struct myfs_state *fs);
// Transaction API
struct transaction* txn_begin(struct myfs_state *fs);
int txn_log_block(struct transaction *txn, uint32_t block_num, void *data);
int txn_commit(struct transaction *txn);
void txn_abort(struct transaction *txn);
// Journal operations
int journal_append(struct journal_state *j, void *entry, size_t len);
int journal_sync(struct journal_state *j);
int journal_checkpoint(struct journal_state *j);
int journal_needs_checkpoint(struct journal_state *j);
// Recovery
int journal_recover(struct myfs_state *fs);
int journal_needs_recovery(struct myfs_state *fs);
int journal_mark_clean(struct myfs_state *fs);
Implementation Guide
Phase 1: Journal Layout (Days 1-3)
Goals:
- Reserve journal blocks in mkfs
- Define journal superblock format
- Initialize journal on format
Steps:
- Modify mkfs to allocate journal blocks
- Define journal superblock structure
- Write initial journal superblock
- Update filesystem superblock to point to journal
// In mkfs: allocate journal
uint32_t journal_blocks = 8192; // 32MB with 4KB blocks
uint32_t journal_start = sb.s_first_data;
sb.s_journal_start = journal_start;
sb.s_journal_size = journal_blocks;
sb.s_first_data += journal_blocks;
// Initialize journal superblock
struct journal_superblock jsb = {
.j_magic = JOURNAL_MAGIC,
.j_version = 1,
.j_start = journal_start + 1, // After journal superblock
.j_size = journal_blocks - 1,
.j_head = 0,
.j_tail = 0,
.j_sequence = 1,
.j_checkpoint = 0,
.j_flags = JOURNAL_CLEAN,
};
// Generate UUID for journal
generate_uuid(jsb.j_uuid);
write_block(fd, journal_start, &jsb);
Verification:
# After mkfs, verify journal exists
./mkfs.myfs -s 100M -j 32M disk.img
hexdump -C disk.img | grep -A 4 "JHFS" # Should see journal magic
Phase 2: Transaction Begin/Commit (Days 3-6)
Goals:
- Implement transaction lifecycle
- Write BEGIN and COMMIT records
- Ensure durability with fsync
struct transaction* txn_begin(struct myfs_state *fs) {
struct journal_state *j = fs->journal;
pthread_mutex_lock(&j->lock);
// Check for journal space
if (journal_space_low(j)) {
journal_checkpoint(j);
}
// Start new transaction
struct transaction *txn = calloc(1, sizeof(*txn));
txn->txn_id = j->jsuper.j_sequence++;
txn->journal = j;
txn->state = TXN_ACTIVE;
// Write BEGIN record
struct journal_begin begin = {
.header = {
.je_type = J_BEGIN,
.je_length = sizeof(begin),
.je_sequence = txn->txn_id,
.je_checksum = 0, // Filled after
},
.jb_num_blocks = 0, // Updated at commit
.jb_timestamp = time(NULL),
};
begin.header.je_checksum = crc32(&begin, sizeof(begin));
journal_append(j, &begin, sizeof(begin));
pthread_mutex_unlock(&j->lock);
return txn;
}
int txn_log_block(struct transaction *txn, uint32_t block_num, void *data) {
if (txn->state != TXN_ACTIVE) {
return -EINVAL;
}
// Allocate logged block
struct logged_block *lb = malloc(sizeof(*lb));
lb->block_num = block_num;
memcpy(lb->data, data, BLOCK_SIZE);
lb->flags = JOURNAL_METADATA;
// Add to transaction's block list
lb->next = txn->blocks;
txn->blocks = lb;
txn->num_blocks++;
return 0;
}
int txn_commit(struct transaction *txn) {
struct journal_state *j = txn->journal;
pthread_mutex_lock(&j->lock);
txn->state = TXN_COMMITTING;
// Write all logged blocks to journal
struct logged_block *blk = txn->blocks;
uint32_t txn_checksum = 0;
while (blk) {
struct journal_data jd = {
.header = {
.je_type = J_DATA,
.je_length = sizeof(jd) + BLOCK_SIZE,
.je_sequence = txn->txn_id,
},
.jd_block = blk->block_num,
.jd_flags = blk->flags,
};
// Handle blocks that contain journal magic (escape them)
if (needs_escape(blk->data)) {
jd.jd_flags |= JOURNAL_ESCAPE;
}
jd.header.je_checksum = crc32(&jd, sizeof(jd));
journal_append(j, &jd, sizeof(jd));
journal_append(j, blk->data, BLOCK_SIZE);
txn_checksum = crc32_combine(txn_checksum, blk->data, BLOCK_SIZE);
blk = blk->next;
}
// Write COMMIT record
struct journal_commit commit = {
.header = {
.je_type = J_COMMIT,
.je_length = sizeof(commit),
.je_sequence = txn->txn_id,
},
.jc_sequence = txn->txn_id,
.jc_checksum = txn_checksum,
.jc_timestamp = time(NULL),
};
commit.header.je_checksum = crc32(&commit, sizeof(commit));
journal_append(j, &commit, sizeof(commit));
// CRITICAL: Sync journal to disk before returning
int ret = journal_sync(j);
if (ret < 0) {
pthread_mutex_unlock(&j->lock);
return ret;
}
txn->state = TXN_COMMITTED;
pthread_mutex_unlock(&j->lock);
// Free transaction resources
free_logged_blocks(txn->blocks);
free(txn);
return 0;
}
void txn_abort(struct transaction *txn) {
txn->state = TXN_ABORTED;
free_logged_blocks(txn->blocks);
free(txn);
}
Phase 3: Logging Filesystem Operations (Days 6-10)
Goals:
- Wrap metadata updates in transactions
- Log block changes before applying
- Handle inode, bitmap, and directory updates
int myfs_create_journaled(struct myfs_state *fs, const char *path, mode_t mode) {
struct transaction *txn = txn_begin(fs);
if (!txn) {
return -ENOMEM;
}
// Allocate inode (update bitmap)
int ino = find_free_inode(fs);
if (ino < 0) {
txn_abort(txn);
return -ENOSPC;
}
// Read and modify inode bitmap
uint8_t bitmap_block[BLOCK_SIZE];
uint32_t bitmap_blk = fs->sb.s_inode_bitmap + ino / (BLOCK_SIZE * 8);
read_block(fs, bitmap_blk, bitmap_block);
BITMAP_SET(bitmap_block, ino % (BLOCK_SIZE * 8));
txn_log_block(txn, bitmap_blk, bitmap_block);
// Initialize inode
struct myfs_inode inode = {
.i_mode = S_IFREG | mode,
.i_uid = getuid(),
.i_gid = getgid(),
.i_links_count = 1,
.i_size = 0,
.i_ctime = time(NULL),
.i_mtime = time(NULL),
.i_atime = time(NULL),
};
uint8_t inode_block[BLOCK_SIZE];
uint32_t inode_blk = inode_to_block(fs, ino);
read_block(fs, inode_blk, inode_block);
write_inode_to_block(inode_block, ino, &inode);
txn_log_block(txn, inode_blk, inode_block);
// Add directory entry
uint32_t parent_ino = path_to_parent_ino(fs, path);
const char *filename = basename(path);
uint8_t dir_block[BLOCK_SIZE];
uint32_t dir_blk = get_dir_data_block(fs, parent_ino);
read_block(fs, dir_blk, dir_block);
add_dir_entry(dir_block, ino, filename, DT_REG);
txn_log_block(txn, dir_blk, dir_block);
// Update superblock (decrement free inode count)
uint8_t sb_block[BLOCK_SIZE];
read_block(fs, 0, sb_block);
struct myfs_superblock *sb = (struct myfs_superblock *)sb_block;
sb->s_free_inodes--;
txn_log_block(txn, 0, sb_block);
// Commit transaction
int ret = txn_commit(txn);
if (ret < 0) {
txn_abort(txn);
return ret;
}
// Apply changes to actual locations
// This can be deferred to checkpoint, but we do it now for simplicity
apply_transaction_blocks(fs, txn);
return 0;
}
Phase 4: Recovery Implementation (Days 10-14)
Goals:
- Detect unclean shutdown
- Scan journal for complete transactions
- Replay changes to filesystem
int journal_needs_recovery(struct myfs_state *fs) {
struct journal_state *j = fs->journal;
// Check if journal was marked clean
if (j->jsuper.j_flags & JOURNAL_CLEAN) {
return 0;
}
// Check if there's any content in journal
if (j->jsuper.j_head == j->jsuper.j_tail) {
return 0;
}
return 1; // Recovery needed
}
int journal_recover(struct myfs_state *fs) {
struct journal_state *j = fs->journal;
struct recovery_state rs = {0};
printf("Journal recovery starting...\n");
printf(" Journal head: %u, tail: %u\n",
j->jsuper.j_head, j->jsuper.j_tail);
// Allocate recovery state
rs.capacity = 256;
rs.complete_txns = malloc(rs.capacity * sizeof(uint64_t));
rs.num_complete = 0;
// Phase 1: Scan for complete transactions
printf("Phase 1: Scanning for complete transactions...\n");
uint32_t pos = j->jsuper.j_head;
uint64_t current_txn = 0;
int in_transaction = 0;
while (pos < j->jsuper.j_tail) {
struct journal_entry_header hdr;
read_journal_at(j, pos, &hdr, sizeof(hdr));
// Verify entry checksum
if (!verify_entry_checksum(&hdr)) {
printf(" Warning: Corrupt entry at position %u, stopping scan\n", pos);
break;
}
switch (hdr.je_type) {
case J_BEGIN:
current_txn = hdr.je_sequence;
in_transaction = 1;
printf(" Transaction %lu: BEGIN\n", current_txn);
break;
case J_COMMIT:
if (in_transaction && hdr.je_sequence == current_txn) {
struct journal_commit commit;
read_journal_at(j, pos, &commit, sizeof(commit));
// Verify transaction checksum
if (verify_txn_checksum(j, current_txn, commit.jc_checksum)) {
add_complete_txn(&rs, current_txn);
printf(" Transaction %lu: COMPLETE (checksum OK)\n", current_txn);
} else {
printf(" Transaction %lu: CHECKSUM FAILED\n", current_txn);
}
in_transaction = 0;
}
break;
case J_DATA:
// Just skip during scan phase
break;
}
pos += hdr.je_length;
}
if (in_transaction) {
printf(" Transaction %lu: INCOMPLETE (no commit)\n", current_txn);
}
// Phase 2: Replay complete transactions
printf("Phase 2: Replaying %zu complete transactions...\n", rs.num_complete);
pos = j->jsuper.j_head;
int blocks_replayed = 0;
while (pos < j->jsuper.j_tail) {
struct journal_entry_header hdr;
read_journal_at(j, pos, &hdr, sizeof(hdr));
if (hdr.je_type == J_DATA && is_complete(&rs, hdr.je_sequence)) {
struct journal_data jd;
read_journal_at(j, pos, &jd, sizeof(jd));
// Read block data from journal
uint8_t block_data[BLOCK_SIZE];
read_journal_at(j, pos + sizeof(jd), block_data, BLOCK_SIZE);
// Handle escaped blocks
if (jd.jd_flags & JOURNAL_ESCAPE) {
unescape_block(block_data);
}
printf(" Replaying block %u from transaction %lu\n",
jd.jd_block, hdr.je_sequence);
write_block(fs, jd.jd_block, block_data);
blocks_replayed++;
}
pos += hdr.je_length;
}
// Sync replayed blocks
fsync(fs->disk_fd);
// Clear journal
j->jsuper.j_head = j->jsuper.j_tail;
j->jsuper.j_sequence = rs.max_seq + 1;
j->jsuper.j_flags = JOURNAL_CLEAN;
write_journal_superblock(j);
printf("Recovery complete: %zu transactions, %d blocks replayed\n",
rs.num_complete, blocks_replayed);
free(rs.complete_txns);
return 0;
}
Phase 5: Checkpointing (Days 14-17)
Goals:
- Apply logged changes to final locations
- Reclaim journal space
- Handle journal-full condition
int journal_checkpoint(struct journal_state *j) {
pthread_mutex_lock(&j->lock);
printf("Checkpointing journal...\n");
uint32_t pos = j->jsuper.j_head;
uint32_t checkpoint_pos = pos;
uint64_t current_txn = 0;
int in_complete_txn = 0;
// First pass: identify checkpoint boundary
while (pos < j->jsuper.j_tail) {
struct journal_entry_header hdr;
read_journal_at(j, pos, &hdr, sizeof(hdr));
switch (hdr.je_type) {
case J_BEGIN:
current_txn = hdr.je_sequence;
in_complete_txn = 0;
break;
case J_COMMIT:
if (hdr.je_sequence == current_txn) {
in_complete_txn = 1;
}
break;
}
pos += hdr.je_length;
if (in_complete_txn && hdr.je_type == J_COMMIT) {
checkpoint_pos = pos; // Can checkpoint up to here
}
}
// Second pass: apply all complete transactions
pos = j->jsuper.j_head;
int blocks_written = 0;
while (pos < checkpoint_pos) {
struct journal_entry_header hdr;
read_journal_at(j, pos, &hdr, sizeof(hdr));
if (hdr.je_type == J_DATA) {
struct journal_data jd;
read_journal_at(j, pos, &jd, sizeof(jd));
uint8_t block_data[BLOCK_SIZE];
read_journal_at(j, pos + sizeof(jd), block_data, BLOCK_SIZE);
if (jd.jd_flags & JOURNAL_ESCAPE) {
unescape_block(block_data);
}
write_block(fs, jd.jd_block, block_data);
blocks_written++;
}
pos += hdr.je_length;
}
// Sync filesystem blocks
fsync(fs->disk_fd);
// Advance head pointer
j->jsuper.j_head = checkpoint_pos;
j->jsuper.j_checkpoint = j->jsuper.j_sequence;
write_journal_superblock(j);
printf("Checkpoint complete: %d blocks written, journal space reclaimed\n",
blocks_written);
pthread_mutex_unlock(&j->lock);
return 0;
}
// Check if journal needs checkpoint
bool journal_needs_checkpoint(struct journal_state *j) {
uint32_t used = j->jsuper.j_tail - j->jsuper.j_head;
uint32_t total = j->jsuper.j_size;
return used > (total * 3 / 4); // Checkpoint at 75% full
}
// Handle journal full condition
int journal_wait_for_space(struct journal_state *j, size_t needed_blocks) {
while (1) {
uint32_t available = j->jsuper.j_size -
(j->jsuper.j_tail - j->jsuper.j_head);
if (available >= needed_blocks) {
return 0;
}
// Try checkpoint
int ret = journal_checkpoint(j);
if (ret < 0) {
return ret;
}
// Recheck
available = j->jsuper.j_size - (j->jsuper.j_tail - j->jsuper.j_head);
if (available >= needed_blocks) {
return 0;
}
// Still no space, need to wait for transactions to complete
return -EAGAIN;
}
}
Phase 6: Integration and Testing (Days 17-21)
Goals:
- Connect journaling to all filesystem operations
- Test crash scenarios
- Verify consistency after recovery
Crash simulation script:
#!/bin/bash
# test_crash_recovery.sh
DISK=/tmp/crash_test.img
MNT=/tmp/crash_mnt
MYFS=./myfs
MKFS=./mkfs.myfs
FSCK=./fsck.myfs
# Cleanup function
cleanup() {
fusermount -u $MNT 2>/dev/null
rm -f $DISK
rmdir $MNT 2>/dev/null
}
trap cleanup EXIT
echo "=== Crash Recovery Test Suite ==="
# Test 1: Clean mount/unmount
echo -e "\n[Test 1: Clean mount/unmount]"
$MKFS -s 100M -j 32M $DISK
mkdir -p $MNT
$MYFS $DISK $MNT &
FSPID=$!
sleep 1
echo "file 1 content" > $MNT/file1.txt
mkdir $MNT/subdir
echo "file 2 content" > $MNT/subdir/file2.txt
fusermount -u $MNT
wait $FSPID
$MYFS $DISK $MNT
sleep 1
if [ -f "$MNT/file1.txt" ] && [ -f "$MNT/subdir/file2.txt" ]; then
echo "PASS: Files persist after clean unmount"
else
echo "FAIL: Files missing after remount"
exit 1
fi
fusermount -u $MNT
# Test 2: Crash during file creation
echo -e "\n[Test 2: Crash during file creation]"
$MKFS -s 100M -j 32M $DISK
$MYFS $DISK $MNT &
FSPID=$!
sleep 1
# Start background writes
for i in $(seq 1 100); do
echo "file $i content" > $MNT/file_$i.txt
done &
WRITEPID=$!
# Kill filesystem mid-write (simulates crash)
sleep 0.5
kill -9 $FSPID
kill -9 $WRITEPID 2>/dev/null
# Remount (triggers recovery)
$MYFS $DISK $MNT 2>&1 | grep -q "recovery" && echo "Recovery triggered"
sleep 1
# Verify consistency
$FSCK $DISK
if [ $? -eq 0 ]; then
echo "PASS: Filesystem consistent after crash"
else
echo "FAIL: Filesystem inconsistent"
exit 1
fi
# Check files are complete or absent (not corrupted)
for f in $MNT/file_*.txt; do
if [ -f "$f" ]; then
content=$(cat "$f" 2>/dev/null)
if [[ ! "$content" =~ ^file\ [0-9]+\ content$ ]]; then
echo "CORRUPTION: $f has invalid content: $content"
exit 1
fi
fi
done
echo "PASS: All existing files have valid content"
fusermount -u $MNT
# Test 3: Crash during large file write
echo -e "\n[Test 3: Crash during large file write]"
$MKFS -s 100M -j 32M $DISK
$MYFS $DISK $MNT &
FSPID=$!
sleep 1
# Start large file write
dd if=/dev/urandom of=$MNT/largefile bs=1M count=50 2>/dev/null &
DDPID=$!
# Crash mid-write
sleep 2
kill -9 $FSPID
kill -9 $DDPID 2>/dev/null
# Remount
$MYFS $DISK $MNT 2>&1 | tee /tmp/recovery.log
sleep 1
# Verify
$FSCK $DISK
if [ $? -eq 0 ]; then
echo "PASS: Filesystem consistent after large file crash"
else
echo "FAIL: Filesystem inconsistent"
exit 1
fi
fusermount -u $MNT
echo -e "\n=== All crash recovery tests passed ==="
Testing Strategy
Unit Tests
// test_journal.c
#include "journal.h"
#include <assert.h>
// Test journal append and read
void test_journal_io() {
struct journal_state j;
journal_init_test(&j, 1024); // 1024 block journal
struct journal_begin begin = {
.header = {
.je_type = J_BEGIN,
.je_length = sizeof(begin),
.je_sequence = 1,
},
.jb_num_blocks = 3,
};
int ret = journal_append(&j, &begin, sizeof(begin));
assert(ret == 0);
struct journal_begin read_begin;
read_journal_at(&j, 0, &read_begin, sizeof(read_begin));
assert(read_begin.header.je_type == J_BEGIN);
assert(read_begin.header.je_sequence == 1);
assert(read_begin.jb_num_blocks == 3);
journal_close(&j);
printf("test_journal_io: PASS\n");
}
// Test transaction lifecycle
void test_txn_commit() {
struct myfs_state fs;
init_test_fs(&fs);
struct transaction *txn = txn_begin(&fs);
assert(txn != NULL);
assert(txn->state == TXN_ACTIVE);
uint8_t data[BLOCK_SIZE];
memset(data, 0xAB, sizeof(data));
int ret = txn_log_block(txn, 100, data);
assert(ret == 0);
assert(txn->num_blocks == 1);
ret = txn_commit(txn);
assert(ret == 0);
// Verify journal contains the transaction
struct journal_state *j = fs.journal;
assert(j->jsuper.j_tail > j->jsuper.j_head);
cleanup_test_fs(&fs);
printf("test_txn_commit: PASS\n");
}
// Test transaction abort
void test_txn_abort() {
struct myfs_state fs;
init_test_fs(&fs);
uint32_t initial_tail = fs.journal->jsuper.j_tail;
struct transaction *txn = txn_begin(&fs);
txn_log_block(txn, 100, test_data);
txn_abort(txn);
// Journal should not have grown (aborted transaction)
// Note: BEGIN was written, but that's harmless without COMMIT
assert(fs.journal->jsuper.j_tail == initial_tail + sizeof(struct journal_begin));
cleanup_test_fs(&fs);
printf("test_txn_abort: PASS\n");
}
// Test recovery of complete transaction
void test_recovery_complete() {
struct myfs_state fs;
init_test_fs(&fs);
// Create and commit a transaction
struct transaction *txn = txn_begin(&fs);
uint8_t data[BLOCK_SIZE];
memset(data, 0x55, sizeof(data));
txn_log_block(txn, 50, data);
txn_commit(txn);
// Simulate crash: don't apply to final location
// Clear the target block
uint8_t zeros[BLOCK_SIZE] = {0};
write_block(&fs, 50, zeros);
// Mark journal as needing recovery
fs.journal->jsuper.j_flags &= ~JOURNAL_CLEAN;
// Recover
int ret = journal_recover(&fs);
assert(ret == 0);
// Verify block 50 has correct content
uint8_t verify[BLOCK_SIZE];
read_block(&fs, 50, verify);
assert(memcmp(verify, data, BLOCK_SIZE) == 0);
cleanup_test_fs(&fs);
printf("test_recovery_complete: PASS\n");
}
// Test recovery of incomplete transaction (should be discarded)
void test_recovery_incomplete() {
struct myfs_state fs;
init_test_fs(&fs);
// Write BEGIN and DATA, but no COMMIT
struct journal_begin begin = {
.header = { .je_type = J_BEGIN, .je_length = sizeof(begin), .je_sequence = 1 },
};
journal_append(fs.journal, &begin, sizeof(begin));
struct journal_data jd = {
.header = { .je_type = J_DATA, .je_length = sizeof(jd) + BLOCK_SIZE, .je_sequence = 1 },
.jd_block = 60,
};
journal_append(fs.journal, &jd, sizeof(jd));
uint8_t data[BLOCK_SIZE];
memset(data, 0xAA, sizeof(data));
journal_append(fs.journal, data, BLOCK_SIZE);
// Mark as needing recovery
fs.journal->jsuper.j_flags &= ~JOURNAL_CLEAN;
// Clear target block
uint8_t zeros[BLOCK_SIZE] = {0};
write_block(&fs, 60, zeros);
// Recover
journal_recover(&fs);
// Block 60 should NOT have the data (transaction was incomplete)
uint8_t verify[BLOCK_SIZE];
read_block(&fs, 60, verify);
assert(memcmp(verify, zeros, BLOCK_SIZE) == 0);
cleanup_test_fs(&fs);
printf("test_recovery_incomplete: PASS\n");
}
// Test checksum verification
void test_checksum_validation() {
struct myfs_state fs;
init_test_fs(&fs);
// Create transaction with known data
struct transaction *txn = txn_begin(&fs);
uint8_t data[BLOCK_SIZE];
memset(data, 0x42, sizeof(data));
txn_log_block(txn, 70, data);
txn_commit(txn);
// Corrupt the data in journal
uint32_t data_pos = sizeof(struct journal_begin) + sizeof(struct journal_data);
fs.journal->buffer[data_pos] ^= 0xFF; // Flip bits
// Mark as needing recovery
fs.journal->jsuper.j_flags &= ~JOURNAL_CLEAN;
// Recovery should detect corruption
int ret = journal_recover(&fs);
// Transaction should be skipped due to checksum mismatch
cleanup_test_fs(&fs);
printf("test_checksum_validation: PASS\n");
}
int main() {
test_journal_io();
test_txn_commit();
test_txn_abort();
test_recovery_complete();
test_recovery_incomplete();
test_checksum_validation();
printf("\nAll journal tests passed!\n");
return 0;
}
Crash Point Testing
// Systematic crash point testing
static int crash_point = -1;
static int write_count = 0;
int write_block_with_crash(struct myfs_state *fs, uint32_t blk, void *data) {
write_count++;
if (crash_point >= 0 && write_count >= crash_point) {
printf("SIMULATED CRASH at write #%d to block %u\n", write_count, blk);
longjmp(crash_jmp, 1); // Simulate crash
}
return write_block_real(fs, blk, data);
}
void test_all_crash_points() {
// For each operation, test crashes at every possible point
for (int cp = 1; cp <= 20; cp++) {
printf("\n=== Testing crash at point %d ===\n", cp);
struct myfs_state fs;
init_test_fs(&fs);
crash_point = cp;
write_count = 0;
if (setjmp(crash_jmp) == 0) {
// Normal path
myfs_create_journaled(&fs, "/test.txt", 0644);
printf("Operation completed without crash\n");
} else {
// Crash occurred
printf("Crash occurred, attempting recovery\n");
// Simulate remount
close_test_fs(&fs);
init_test_fs(&fs); // Reload from disk
// Recovery
if (journal_needs_recovery(&fs)) {
journal_recover(&fs);
}
// Verify consistency
int consistent = fsck(&fs);
assert(consistent == 0);
// File either exists completely or not at all
struct stat st;
int exists = myfs_getattr(&fs, "/test.txt", &st, NULL) == 0;
if (exists) {
assert(S_ISREG(st.st_mode));
printf("File exists and is valid\n");
} else {
printf("File does not exist (transaction was incomplete)\n");
}
}
cleanup_test_fs(&fs);
}
crash_point = -1; // Disable crash simulation
}
Common Pitfalls and Debugging
Pitfall 1: Forgetting to Sync Journal
Problem: Transaction appears committed but journal isn’t on disk.
// WRONG
journal_append(j, &commit, sizeof(commit));
return 0; // Commit record still in buffer!
// CORRECT
journal_append(j, &commit, sizeof(commit));
int ret = fsync(j->fd); // CRITICAL: Ensure commit record is on disk
if (ret < 0) {
perror("journal fsync failed");
return -EIO;
}
return 0;
Detection: Data loss after crashes despite “successful” operations.
Pitfall 2: Non-Idempotent Operations
Problem: Replaying transaction twice causes issues.
// WRONG: Increment is not idempotent
superblock.free_blocks--; // First replay: 1000 -> 999
// Second replay: 999 -> 998 WRONG!
// CORRECT: Set to absolute value (idempotent)
superblock.free_blocks = 999; // Always results in 999
Key principle: Log the AFTER state, not the delta.
Pitfall 3: Transaction Order in Recovery
Problem: Replaying transactions in wrong order causes inconsistency.
// WRONG: Process entries as encountered
while (pos < tail) {
if (entry.type == J_DATA) {
write_block(entry.block, entry.data); // Wrong order!
}
}
// CORRECT: Process by transaction ID order
qsort(data_entries, count, sizeof(entry), compare_by_txn_id);
for (int i = 0; i < count; i++) {
if (is_complete(data_entries[i].txn_id)) {
write_block(data_entries[i].block, data_entries[i].data);
}
}
Pitfall 4: Escaping Magic Numbers
Problem: Block data containing journal magic number corrupts recovery.
// WRONG: Direct copy
journal_append(j, block_data, BLOCK_SIZE);
// CORRECT: Escape blocks containing magic
if (contains_journal_magic(block_data)) {
jd.jd_flags |= JOURNAL_ESCAPE;
escape_block(block_data); // XOR or other escape
}
journal_append(j, &jd, sizeof(jd));
journal_append(j, block_data, BLOCK_SIZE);
Pitfall 5: Journal Wraparound
Problem: Circular buffer calculations overflow.
// WRONG: Simple subtraction
uint32_t available = j->j_size - j->j_tail + j->j_head; // Wrong when wrapped
// CORRECT: Handle wrap
uint32_t used = (j->j_tail >= j->j_head) ?
(j->j_tail - j->j_head) :
(j->j_size - j->j_head + j->j_tail);
uint32_t available = j->j_size - used;
Debugging Techniques
1. Journal dump tool:
./journal-dump disk.img
Journal Superblock:
Magic: 0x4A484653 (JHFS)
Head: 0, Tail: 4096
Sequence: 47
Entries:
[0000] T45 BEGIN (3 blocks expected)
[0064] T45 DATA: block 5 (inode bitmap)
[4160] T45 DATA: block 105 (inode)
[8256] T45 DATA: block 200 (directory)
[12352] T45 COMMIT (checksum: 0x12345678)
[12416] T46 BEGIN (2 blocks expected)
[12480] T46 DATA: block 6 (inode bitmap)
[16576] INCOMPLETE (no commit for T46)
2. Verbose logging:
#define JLOG(level, fmt, ...) do { \
if (journal_log_level >= level) \
fprintf(stderr, "[JOURNAL:%s] " fmt "\n", \
level_names[level], ##__VA_ARGS__); \
} while(0)
JLOG(DEBUG, "txn %lu: logging block %u", txn->txn_id, block_num);
JLOG(INFO, "txn %lu: commit, %zu blocks", txn->txn_id, txn->num_blocks);
JLOG(WARN, "txn %lu: checksum mismatch", txn->txn_id);
3. Artificial crash injection:
// Environment variable controlled crashes
static void check_crash_injection(const char *location) {
const char *crash_at = getenv("JOURNAL_CRASH_AT");
if (crash_at && strcmp(crash_at, location) == 0) {
fprintf(stderr, "INJECTED CRASH at %s\n", location);
_exit(1); // Immediate exit, no cleanup
}
}
// Usage:
check_crash_injection("after_commit_write");
Extensions and Challenges
Extension 1: Ordered Mode
Write data blocks before journaling metadata:
int txn_commit_ordered(struct transaction *txn) {
// First: write all data blocks to final locations
for (struct data_write *dw = txn->data_writes; dw; dw = dw->next) {
write_block(fs, dw->block, dw->data);
}
fsync(fs->disk_fd); // Data is durable
// Then: journal metadata only
for (struct logged_block *lb = txn->metadata; lb; lb = lb->next) {
journal_log_block(txn, lb);
}
journal_commit(txn);
}
Extension 2: Asynchronous Commit
Return immediately, commit in background:
int txn_commit_async(struct transaction *txn) {
// Add to commit queue
pthread_mutex_lock(&commit_queue_lock);
txn->next = commit_queue;
commit_queue = txn;
pthread_cond_signal(&commit_queue_cond);
pthread_mutex_unlock(&commit_queue_lock);
return 0; // Returns immediately
}
// Background commit thread
void* commit_thread(void *arg) {
while (running) {
pthread_mutex_lock(&commit_queue_lock);
while (!commit_queue && running) {
pthread_cond_wait(&commit_queue_cond, &commit_queue_lock);
}
struct transaction *txn = commit_queue;
commit_queue = txn->next;
pthread_mutex_unlock(&commit_queue_lock);
if (txn) {
do_sync_commit(txn);
}
}
return NULL;
}
Extension 3: Group Commit
Batch multiple transactions into one disk write:
void commit_group() {
pthread_mutex_lock(&group_lock);
// Wait for batch to fill or timeout
struct timespec deadline;
clock_gettime(CLOCK_REALTIME, &deadline);
deadline.tv_nsec += 5000000; // 5ms timeout
while (pending_count < MAX_BATCH) {
int ret = pthread_cond_timedwait(&group_cond, &group_lock, &deadline);
if (ret == ETIMEDOUT) break;
}
// Gather all pending transactions
struct transaction *batch[MAX_BATCH];
int count = collect_pending(batch, MAX_BATCH);
pthread_mutex_unlock(&group_lock);
// Write all to journal
for (int i = 0; i < count; i++) {
write_txn_to_journal(batch[i]);
}
// Single fsync for all
fsync(journal_fd);
// Signal all waiters
for (int i = 0; i < count; i++) {
signal_commit_complete(batch[i]);
}
}
Extension 4: Intent Logging (REDO-only)
Instead of logging full blocks, log operations:
struct journal_intent {
uint32_t ji_type; // INTENT_SET_BIT, INTENT_CLEAR_BIT, etc.
uint32_t ji_block; // Target block
uint32_t ji_offset; // Offset within block
uint32_t ji_old_value; // For UNDO (optional)
uint32_t ji_new_value; // For REDO
};
// Much smaller journal entries, but recovery more complex
Extension 5: Compound Transactions
Support nested/compound operations:
txn_begin(); // Level 1
create_file();
txn_begin(); // Level 2 (nested)
write_data();
txn_commit(); // Level 2 commits to parent
txn_commit(); // Level 1 commits all
Real-World Connections
ext3/ext4 Journaling
Your implementation mirrors ext3/ext4 JBD2 (Journaling Block Device):
ext4 Journal Architecture:
┌─────────────────────────────────────────────────────────────┐
│ ext4 Core │
│ (create, unlink, write, mkdir, rename, etc.) │
└────────────────────────────┬────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ JBD2 Layer │
│ (journal_start, journal_get_write_access, │
│ journal_dirty_metadata, journal_stop) │
└────────────────────────────┬────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Journal Device │
│ (typically at /dev/sda1 or internal to filesystem) │
└─────────────────────────────────────────────────────────────┘
Key similarities:
- Circular journal buffer
- BEGIN/COMMIT transaction markers
- Checksums for integrity
- Checkpoint to reclaim space
- Recovery on mount
View ext4 journal:
# See journal status
sudo dumpe2fs /dev/sda1 | grep -i journal
# Detailed journal info
sudo debugfs -R "stat <8>" /dev/sda1 # Inode 8 is journal
PostgreSQL WAL
PostgreSQL uses identical WAL principles:
PostgreSQL WAL:
┌─────────────────────────────────────────────────────────────┐
│ Transaction │
│ INSERT INTO users ... │
└────────────────────────────┬────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ WAL Writer │
│ XLogInsert() → WAL Buffer → XLogFlush() → WAL Segment │
└────────────────────────────┬────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ WAL Segment Files (pg_wal/) │
│ 000000010000000000000001, 000000010000000000000002, ... │
└─────────────────────────────────────────────────────────────┘
View PostgreSQL WAL:
# List WAL files
ls -la /var/lib/postgresql/14/main/pg_wal/
# Decode WAL records
pg_waldump -p /var/lib/postgresql/14/main/pg_wal/ 000000010000000000000001
MySQL/InnoDB Redo Log
InnoDB uses similar double-write + redo log:
// InnoDB simplified flow
mtr_start(&mtr); // Mini-transaction start
btr_insert(index, entry); // B-tree modification
mtr_commit(&mtr); // Write to redo log + flush
The pattern repeats: All serious storage systems use WAL for durability.
Interview Questions
Conceptual
- “What problem does journaling solve?”
- Expected: Crash consistency. Guarantees filesystem is always consistent, even after crashes. Avoids lengthy fsck scans.
- “Explain write-ahead logging.”
- Expected: Log changes before applying. On crash, replay committed entries. Key: commit record makes transaction durable.
- “What’s the difference between metadata and full data journaling?”
- Expected: Metadata-only logs structure changes. Full logs file content too. Trade-off: speed vs data safety. Ordered mode is middle ground.
- “Why is journaling faster than fsck?”
- Expected: Journal is small (MB) vs entire filesystem (TB). Recovery scans journal only. O(journal size) vs O(filesystem size).
- “How does this relate to database transactions?”
- Expected: Same ACID principles. WAL is used by PostgreSQL, MySQL. Filesystem transaction = database transaction conceptually.
Implementation
- “What makes an operation idempotent?”
- Expected: Same result when applied multiple times. Log absolute state, not deltas. Important because recovery might replay.
- “How do you know a transaction is complete?”
- Expected: COMMIT record present with valid checksum. No COMMIT = discard transaction.
- “What is checkpointing?”
- Expected: Writing journal contents to final locations to reclaim journal space. Periodic, or when journal fills.
- “What happens if the commit record itself is corrupted?”
- Expected: Transaction treated as incomplete (no valid commit). This is safe - we err on the side of discarding.
- “How do you handle journal wraparound?”
- Expected: Circular buffer with head/tail pointers. Checkpoint before overwriting. Handle wrap in arithmetic.
Design
- “How would you implement group commit?”
- Expected: Batch multiple transactions, single fsync. Latency tradeoff for throughput.
- “What are the tradeoffs of different journaling modes?”
- Expected: Full = safest, slowest (2x write amplification). Ordered = good balance. Writeback = fastest, risk of stale data.
Resources
Books
- “Operating Systems: Three Easy Pieces” Ch. 42 - Crash Consistency and Journaling (primary)
- “Database Internals” by Alex Petrov - Ch. 3-5 - Transaction Processing (excellent for ARIES)
- “Designing Data-Intensive Applications” by Kleppmann - Ch. 7 - Transactions
- “Understanding the Linux Kernel” Ch. 18 - ext3/ext4 journaling implementation
Papers
- “Journaling the Linux ext2fs Filesystem” - Stephen Tweedie (1998) - Original ext3 design
- “Analysis and Evolution of Journaling File Systems” - USENIX ATC 2005
- “ARIES: A Transaction Recovery Method” - Mohan et al. (1992) - Database recovery gold standard
Online
Videos
- “Journaling in Filesystems” - MIT 6.828 Operating Systems Engineering
- “Database Crash Recovery” - CMU 15-445 Database Systems
Self-Assessment Checklist
Core Functionality
- Journal is created during mkfs with configurable size
- Journal superblock is correctly formatted and readable
- Transactions can be started (BEGIN record written)
- Blocks can be logged within a transaction
- Transactions can be committed (COMMIT record with checksum)
- Journal sync (fsync) occurs before commit returns
Recovery
- Unclean shutdown is detected on mount
- Complete transactions are identified (have COMMIT)
- Incomplete transactions are identified (no COMMIT)
- Complete transactions are replayed correctly
- Incomplete transactions are discarded
- Checksums are verified during recovery
- Journal is cleared after successful recovery
Integration
- All metadata operations use transactions
- File create wraps inode+bitmap+directory updates
- File delete wraps inode+bitmap+directory updates
- Directory operations are transactional
- Nested/compound operations work correctly
Robustness
- Filesystem is consistent after simulated crashes at any point
- Checkpointing reclaims journal space
- Journal-full condition is handled gracefully
- Performance overhead is reasonable (< 20%)
Submission/Completion Criteria
To consider this project complete, you must demonstrate:
1. Working Implementation
# Create filesystem with journal
./mkfs.myfs -s 100M -j 32M disk.img
# Mount and use normally
./myfs disk.img /mnt/myfs
echo "test" > /mnt/myfs/file.txt
mkdir /mnt/myfs/subdir
cp /etc/passwd /mnt/myfs/subdir/
# Clean unmount
fusermount -u /mnt/myfs
# Remount - files persist
./myfs disk.img /mnt/myfs
cat /mnt/myfs/file.txt # Shows "test"
2. Crash Recovery Demonstration
# Start writes, then crash
./myfs disk.img /mnt/myfs &
PID=$!
dd if=/dev/urandom of=/mnt/myfs/bigfile bs=1M count=10 &
sleep 0.5
kill -9 $PID
# Remount - recovery happens
./myfs disk.img /mnt/myfs
# Output should show:
# "Journal recovery: replaying N transactions..."
# "Recovery complete. Filesystem consistent."
3. fsck Passes
./fsck.myfs disk.img
# Output: "Filesystem consistent. 0 errors."
4. Test Suite Passes
make test
# All journal unit tests pass
# All crash recovery tests pass
5. Documentation
- README explaining journal design choices
- Comments in code explaining recovery algorithm
- Performance measurements (overhead of journaling)
Bonus Achievements
- Implement ordered mode (data before metadata)
- Implement async commit with background thread
- Implement group commit for better throughput
- Add journal dump tool for debugging
- Performance comparison: journaled vs non-journaled