Project 6: Build a Git-like Version Control System

Build a simplified version control system with content-addressable storage, a commit DAG, and crash-safe operations.

Quick Reference

Attribute	Value
Difficulty	Advanced
Time Estimate	4-6 weeks
Main Programming Language	C (Alternatives: Rust, Go)
Alternative Programming Languages	Rust, Go
Coolness Level	Level 4: Serious Systems Skill
Business Potential	Level 3: Foundational Infrastructure
Prerequisites	All prior projects, file I/O, hashing, data structures
Key Topics	Content-addressable storage, DAGs, atomicity, crash recovery

1. Learning Objectives

By completing this project, you will:

Implement a content-addressable object database (blobs, trees, commits).
Build a commit DAG with refs and traversal.
Design a staging area (index) with strict invariants.
Implement atomic ref updates and crash-safe commits.
Build merge and recovery workflows with deterministic behavior.

2. All Theory Needed (Per-Concept Breakdown)

2.1 Content-Addressable Storage and Object Model

Fundamentals

A Git-like system stores data by content, not by filename. Every object (file content, tree, commit) is hashed, and its hash becomes its ID. This is content-addressable storage (CAS). It guarantees integrity: if the content changes, the hash changes. The object model is simple: blobs store file contents, trees store directory listings (name -> hash), and commits store a tree hash plus parent commit hashes. This is the foundation of the entire system and is what enables branching, merging, and history integrity.

Because objects are addressed by hash, you get de-duplication and integrity verification automatically. Two identical files map to the same blob, and any bit flip changes the ID. This property turns storage into a verifiable ledger of history, which is why version control can trust data across machines and time.

Deep Dive into the concept

Content-addressable storage is a clever inversion of traditional filesystems. Instead of “put this file at this path,” you say “store these bytes and give me their hash.” The hash is deterministic, so the same content yields the same ID, which gives you de-duplication for free. In a Git-like system, the object database is append-only: objects are never modified, only added. This is a powerful control flow simplification because write operations become idempotent. If you store the same blob twice, you get the same hash and do not need to modify existing data.

The object model builds a hierarchy from simple primitives. A blob represents file content. A tree represents a directory and contains entries of (mode, name, hash). A commit object references a tree and includes metadata (author, message, timestamp) and one or more parent commit hashes. This structure forms a directed acyclic graph (DAG) of commits. The key insight is that commits are immutable because they are defined by their content, including their parent references. This immutability is what makes history robust.

The object database is often stored as files under .mygit/objects/xx/yy... where xx is the first two hex characters of the hash. This directory sharding avoids too many files in one directory. The object file usually stores a header (type and length) plus the raw content, often compressed. For this project, you can skip compression and store raw bytes with a header, but you must still ensure the hash is computed over a canonical representation. The hash should include the object type and length to avoid collisions between different object types with identical bytes.

Writing objects requires strict ordering. You must first compute the hash, then write the file to a temporary path, and finally rename it to the object path. This ensures atomicity: the object either exists or does not. Because objects are immutable, you never need to update them, which greatly simplifies crash recovery. If a crash happens during write, you either have a missing object (safe) or a partially written file (detectable if you write to temp and rename only on success).

A robust CAS also requires integrity verification. When reading an object, you should recompute its hash and verify it matches the expected hash. This ensures that disk corruption is detected. This verification is the basis for fsck (file system check) operations later. It is also why Git-like systems are resilient: corruption is detected immediately when an object is accessed.

Finally, CAS changes how you think about state. The “current state” of the repository is just a pointer (ref) to a commit hash. Everything else is immutable data. This makes the state machine small and controllable: you only need to update refs to change repository state. This is the core reason a Git-like system can be crash-safe: all mutable state is concentrated in a small set of ref files.

In practice, this also enables safe caching. Because objects are immutable, you can memoize reads and keep hot objects in memory without fear of stale data. This dramatically simplifies performance tuning and testing because any object read is guaranteed to be the same forever.

How this fit on projects

This concept defines the storage engine for the VCS. Every other operation (add, commit, checkout, merge) depends on storing and reading objects correctly.

Definitions & key terms

Content-addressable storage: Data addressed by hash of content.
Blob: Object storing file content.
Tree: Object storing directory entries.
Commit: Object pointing to a tree and parent commits.

Mental model diagram (ASCII)

file.txt -> [blob hash]
  dir/   -> [tree hash] -> entries(name->hash)
 commit  -> [commit hash] -> tree + parent(s)

How it works (step-by-step)

Compute hash of “type + length + content”.
Write object to temp file.
Atomically rename to object path.
Verify by re-hashing on read.

Failure modes: hash collision (theoretical), partial writes, incorrect header encoding.

Minimal concrete example

// hash = sha1("blob" + "\0" + len + data)
write_object(hash, data, len);

Common misconceptions

“Objects can be updated.” They are immutable; update by writing a new object.
“Hashes are only for IDs.” They also provide integrity checks.

Check-your-understanding questions

Why does including type and length in the hash matter?
What happens if you store the same blob twice?
How do you detect corruption?

Check-your-understanding answers

It prevents collisions between different object types and lengths.
The hash is identical, so the object is reused.
Recompute the hash and compare with expected ID.

Real-world applications

Git object database
Content-addressable storage systems (IPFS)

Where you’ll apply it

In this project: see §3.2 Functional Requirements and §4.3 Data Structures.
Also used in: P04-undo-redo-engine.md (append-only logs).

References

“Pro Git” by Scott Chacon, Chapter 10 (Git internals)
SHA-1 documentation and hashing standards

Key insights

Immutability plus hashing turns storage into an append-only, crash-safe system.

Summary

Content-addressable objects are the foundation of all Git-like operations.

Homework/Exercises to practice the concept

Write a tiny tool that hashes a file and stores it by hash.
Implement a cat-object command that verifies hashes on read.

Solutions to the homework/exercises

Compute sha1 over type+len+content and write to .mygit/objects.
Read object, recompute hash, and compare with filename hash.

2.2 Commit DAG, Refs, and Index Invariants

Fundamentals

Commits form a directed acyclic graph. Each commit points to one or more parents. Branches and tags are just refs: files that store a commit hash. The working directory and the index (staging area) must remain consistent with refs. The index is a snapshot of the next commit; the working directory is a mutable checkout of the current commit. The invariants between these three states (HEAD commit, index, working tree) define the correctness of commands like status, add, and checkout.

When those invariants are explicit, every command becomes a safe transition. add moves changes into the index, commit moves the index into a new commit, and checkout moves the working tree to match a commit. The clarity of these transitions is what makes a Git-like system predictable instead of mysterious.

Deep Dive into the concept

The commit graph is a DAG because commits only point backward in time. This makes history traversable, mergeable, and verifiable. Implementing the DAG requires a graph traversal algorithm: starting from a ref, recursively visit parents. This is the basis for log and fsck. It also ensures that unreachable objects can be identified for garbage collection.

Refs are the mutable pointers that define the current state. In a Git-like system, HEAD points to a branch ref (e.g., refs/heads/master). The branch ref points to the latest commit. A detached HEAD points directly to a commit hash. This small set of files is the only mutable state in the repository, which makes the system robust. When you update a ref, you effectively change the “current” history. This is why ref updates must be atomic.

The index (staging area) is the most subtle part. It represents the exact tree that will be committed. When you run add, you compute blobs for the specified files and update index entries (path -> blob hash + mode). When you run commit, you build a tree from the index and create a commit. The index must always represent a consistent tree; it cannot contain duplicate paths or invalid modes. The invariant is: index entries are sorted by path, unique, and represent a tree that can be built without ambiguity.

The working tree is a checkout of the commit pointed to by HEAD. When you run checkout, you update the working tree and the index to match the target commit. When you run status, you compare working tree vs. index vs. HEAD. These comparisons define three states: clean, staged changes, and unstaged changes. This is another state machine with explicit invariants. The system must prevent operations that would violate these invariants. For example, checkout should refuse to proceed if it would overwrite uncommitted changes, unless forced.

Branching and merging are higher-level operations built on these invariants. Creating a new branch is just writing a ref file. Switching branches updates HEAD and updates the working tree to match the target commit. The index serves as the staging area for merge resolution, with stage numbers for conflicts. Even if you do not implement full stage numbers, you must explicitly mark conflicts and prevent commit until resolved.

Path normalization is another important detail. If the index allows duplicate or non-normalized paths (like a/../b), you can build trees that do not correspond to a real working directory. Normalizing paths at add time keeps the index consistent and makes tree creation deterministic across platforms.

The key design principle is to keep invariants explicit. The index is the single source of truth for staging. The working tree reflects a checkout of a commit, plus any unstaged changes. The ref points to the current commit. If you can always answer “what commit does HEAD point to, what does the index represent, and how does working tree differ,” then the system is consistent. This is why status is such a fundamental command: it validates that your invariants are intact.

How this fit on projects

This concept defines the core correctness rules of the VCS: how refs, commits, the index, and the working tree relate to each other.

Definitions & key terms

DAG: Directed acyclic graph of commits.
Ref: Pointer file that stores a commit hash.
Index: Staging area for the next commit.
Working tree: Checked-out files on disk.

Mental model diagram (ASCII)

HEAD -> refs/heads/master -> commit C
            |
            v
         Index (staged tree)
            |
            v
       Working Tree (files)

How it works (step-by-step)

Add files: update index entries.
Commit: build tree from index, create commit, update ref.
Status: compare working tree vs index vs HEAD.
Checkout: update HEAD and sync index + working tree.

Failure modes: index out of sync, ref pointing to missing commit, working tree overwritten without warning.

Minimal concrete example

typedef struct IndexEntry {
    char path[256];
    char hash[41];
    int mode;
} IndexEntry;

Common misconceptions

“The index is just a cache.” It is the authoritative staging state.
“Refs are just labels.” They are the only mutable pointers and must be consistent.

Check-your-understanding questions

What does git add actually change?
Why is the commit graph a DAG and not a tree?
What should status compare?

Check-your-understanding answers

It updates index entries with new blob hashes.
Because merges create commits with multiple parents.
Working tree vs index vs HEAD commit.

Real-world applications

Git, Mercurial, and other DVCS systems
Any snapshot-based versioning tool

Where you’ll apply it

In this project: see §3.2 Functional Requirements and §4.2 Key Components.
Also used in: P03-connection-pool.md (invariant tracking).

References

“Pro Git” Chapter 10 (refs, index)
“Version Control with Git” by Loeliger (merge and staging concepts)

Key insights

The index is the staging truth; refs and working tree must stay consistent with it.

Summary

Once you define invariants between HEAD, index, and working tree, every command becomes a controlled state transition.

Homework/Exercises to practice the concept

Build a small index file format and parse it.
Write a function that compares working tree to index and lists differences.

Solutions to the homework/exercises

Store entries as fixed-length records with path and hash.
Hash files and compare to stored hashes, report changes.

2.3 Atomic Operations, Merges, and Recovery

Fundamentals

The VCS must never corrupt the repository, even if it crashes mid-operation. This is achieved by strict ordering and atomic updates. The golden rule is: write new objects first, update refs last. Merges introduce a state machine (CLEAN -> MERGING -> CLEAN or ABORTED). Recovery must detect incomplete operations and offer a safe path forward. Integrity checks (fsck) and garbage collection ensure that corrupted or unreachable objects are handled deterministically.

Atomicity is not optional here because users trust the repository as a source of truth. A single corrupted ref can make history appear to vanish. By treating ref updates as the only visible state change, you can make every complex operation recoverable even if the process is killed at arbitrary points.

Deep Dive into the concept

Atomicity is about defining a single “commit point” where an operation becomes visible. In a Git-like system, that point is the ref update. Object writes are append-only, so they are safe to do first. If the system crashes after writing objects but before updating refs, the objects are simply unreachable; the repository is still consistent. This is the same pattern used in database write-ahead logging and in the undo persistence project. The ref update should be done by writing to a temporary file and renaming it atomically. This guarantees that the ref is either old or new, never corrupted.

Merges are a state machine because they are multi-step operations. A merge may fast-forward (no new commit), produce a merge commit, or result in conflicts. If conflicts occur, the repository enters a MERGING state and must record this in a durable marker file (e.g., .mygit/MERGE_HEAD). The working tree contains conflict markers, and the index records unresolved paths. The merge is completed only when conflicts are resolved and a merge commit is created. Alternatively, the user can abort, which restores the working tree to the original state and removes MERGE_HEAD. This is a classic state machine with explicit transitions and cleanup paths.

Crash recovery must detect incomplete operations. On startup, if MERGE_HEAD exists, the repo is in MERGING state. The system should refuse normal operations and prompt for continue or abort. If a lock file exists, it may indicate an interrupted ref update. You should detect stale locks and remove them carefully. Recovery logic is part of the control flow: it defines the legal transitions out of a partial state.

Integrity checking (fsck) verifies that object hashes match their content and that all referenced objects exist. It also verifies that the commit graph has no cycles. Garbage collection (GC) identifies unreachable objects and removes them after a grace period. This is necessary because crashes or aborted operations can leave orphaned objects. The GC itself should be safe: it should only remove objects that are provably unreachable and old enough to not be part of an in-progress operation.

The hardest part is ensuring that every operation has a safe interruption point. For example, in commit creation: write blobs -> write tree -> write commit -> update ref. The ref update is the only state change visible to the user. If any earlier step fails, you abort without updating the ref, and the repo remains unchanged. This is the key to crash-safe control flow. The same principle applies to merges: you only update the branch ref after the merge commit is fully written. If conflicts occur, you record MERGE_HEAD and stop, leaving the repo in a well-defined MERGING state.

Deterministic tests for atomicity involve simulating crashes at each step. For each step, kill the process and then run a recovery check. The repo should always be consistent and recoverable. These tests are essential because atomicity bugs can be rare and catastrophic. By enumerating all interruption points, you make atomicity a testable property rather than a hope.

How this fit on projects

This concept defines the durability guarantees of the VCS: atomic commit ordering, merge state machine, and recovery tools.

Definitions & key terms

Atomic ref update: Update a ref via temp file + rename.
Merge state: Repository state when conflicts are unresolved.
fsck: Integrity checker for object database.
Garbage collection: Removal of unreachable objects.

Mental model diagram (ASCII)

Write objects -> Write commit -> Update ref (commit point)
Crash before ref update = safe (objects orphaned)

How it works (step-by-step)

Write new objects (blob/tree/commit).
Update refs atomically (temp + rename).
On merge conflict, write MERGE_HEAD and stop.
On startup, detect MERGE_HEAD and require continue/abort.
Run fsck to verify hashes and graph integrity.

Failure modes: ref updated before objects, stale lock files, merge state not recorded, GC deleting live objects.

Minimal concrete example

int update_ref_atomic(const char *ref, const char *hash) {
    write_temp(ref, hash);
    return rename(temp_path, ref);
}

Common misconceptions

“Writing objects and refs can be in any order.” Order is everything for crash safety.
“Merge conflicts are just working tree changes.” They are a repository state.

Check-your-understanding questions

Why must refs be updated last?
How does MERGE_HEAD help recovery?
What should fsck verify?

Check-your-understanding answers

Because refs are the commit point; updating them too early can point to missing objects.
It records that a merge is in progress, so recovery can resume or abort.
Hash integrity, object existence, and DAG correctness.

Real-world applications

Git commit and merge implementations
Crash-safe metadata updates in databases

Where you’ll apply it

In this project: see §3.7 Real World Outcome and §5.10 Phase 3.
Also used in: P04-undo-redo-engine.md (atomic logs).

References

“Pro Git” Chapter 10 (commit ordering)
“Designing Data-Intensive Applications” (atomicity and durability)

Key insights

Atomicity is a control-flow promise: only update the visible pointer once all data is durable.

Summary

Crash safety is achieved by strict ordering, explicit merge states, and integrity checks.

Homework/Exercises to practice the concept

Simulate a crash between object write and ref update; verify repo consistency.
Implement a minimal MERGE_HEAD marker and abort logic.

Solutions to the homework/exercises

Kill the process before ref update and check that refs still point to old commit.
Create MERGE_HEAD and restore working tree from HEAD on abort.

3. Project Specification

3.1 What You Will Build

A simplified VCS named mygit that supports init, add, commit, status, log, checkout, branch, and merge. It stores objects in a content-addressable database, maintains refs, and guarantees crash-safe commit and merge operations.

Included:

Object database (blobs, trees, commits)
Index (staging area)
Atomic commit creation
Merge with conflict detection
fsck and GC basics

Excluded:

Networking or remote push/pull
Packfiles

3.2 Functional Requirements

init: create .mygit/ with objects, refs, HEAD, index.
add: write blobs and update index entries.
commit: create tree + commit, update ref atomically.
status: show clean/staged/unstaged changes.
checkout: update HEAD and working tree safely.
branch: create refs with atomic updates.
merge: fast-forward or conflict state with MERGE_HEAD.
fsck: verify hashes and DAG integrity.

3.3 Non-Functional Requirements

Performance: handles 1k files with acceptable latency.
Reliability: no repository corruption after crash.
Usability: clear error messages and exit codes.

3.4 Example Usage / Output

$ ./mygit init
Initialized empty repository in .mygit/
$ ./mygit add file.txt
$ ./mygit commit -m "Initial"
[master abc123] Initial

3.5 Data Formats / Schemas / Protocols

Object header: “type\0len” followed by raw content.
Index entry: path, hash, mode.

JSON error shape (if --json):

{ "error": { "code": "REPO_CORRUPT", "message": "missing object" } }

3.6 Edge Cases

Commit with empty index
Merge conflict and abort
Checkout with uncommitted changes
Crash during ref update

3.7 Real World Outcome

Deterministic demo uses fixed files and commit messages.

3.7.1 How to Run (Copy/Paste)

cc -std=c11 -O2 -o mygit src/mygit.c
./mygit init

3.7.2 Golden Path Demo (Deterministic)

echo "hello" > file.txt
./mygit add file.txt
./mygit commit -m "c1"
./mygit log

Expected: log shows one commit with message “c1”.

3.7.3 CLI Transcript (Success + Failure)

$ ./mygit init
Initialized empty repository in .mygit/
$ ./mygit add file.txt
$ ./mygit commit -m "c1"
[master abc123] c1
$ echo $?
0

$ ./mygit commit -m "c2"
ERROR: NOTHING_TO_COMMIT
$ echo $?
1

Exit codes:

0 success
1 user error
2 system error (I/O, corruption)

4. Solution Architecture

4.1 High-Level Design

[Working Tree] <-> [Index] <-> [Object DB]
        |               |            |
        v               v            v
      HEAD/refs ----> Commits (DAG)

4.2 Key Components

4.3 Data Structures (No Full Code)

typedef struct {
    char path[256];
    char hash[41];
    int mode;
} IndexEntry;

typedef struct {
    char tree_hash[41];
    char parent_hash[41];
    char message[256];
} Commit;

4.4 Algorithm Overview

Key Algorithm: Crash-Safe Commit

Write blob objects for staged files.
Write tree object.
Write commit object.
Update branch ref atomically.

Complexity Analysis:

Time: O(n) for number of files
Space: O(n) for object storage

5. Implementation Guide

5.1 Development Environment Setup

cc --version

5.2 Project Structure

project-root/
├── src/
│   ├── mygit.c
│   ├── object.c
│   ├── index.c
│   ├── refs.c
│   └── merge.c
├── tests/
│   └── test_repo.c
└── Makefile

5.3 The Core Question You’re Answering

“How can I make repository state changes atomic and recoverable, even if the process crashes?”

5.4 Concepts You Must Understand First

Content-addressable objects
Commit DAG and refs
Atomic ref updates

5.5 Questions to Guide Your Design

What is the exact commit point for a commit operation?
How do you prevent ref corruption during crashes?
How do you detect and handle merge conflicts?

5.6 Thinking Exercise

Simulate a crash after writing the commit object but before updating the ref. What is repo state?

5.7 The Interview Questions They’ll Ask

“Why is Git crash-safe even if it crashes mid-commit?”
“How do you implement a three-way merge?”
“What does fsck actually verify?”

5.8 Hints in Layers

Hint 1: Implement blob hashing and storage first.

Hint 2: Build the index and commit creation.

Hint 3: Add merge state handling after basic commit works.

5.9 Books That Will Help

5.10 Implementation Phases

Phase 1: Object Store (1-2 weeks)

Implement hash and object writing.
Add cat-object and hash-object.

Phase 2: Index + Commit (1-2 weeks)

Implement staging and commit creation.
Add refs and HEAD.

Phase 3: Merge + Recovery (1-2 weeks)

Implement merge state machine.
Add fsck and GC basics.

5.11 Key Implementation Decisions

6. Testing Strategy

6.1 Test Categories

6.2 Critical Test Cases

Crash before ref update leaves repo consistent.
Merge conflict enters MERGING state.
fsck detects missing object.

6.3 Test Data

Files: a.txt, b.txt
Commits: c1, c2

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

7.2 Debugging Strategies

Add a fsck command early to validate objects.
Log ref updates and lock files.

7.3 Performance Traps

Recomputing hashes repeatedly can be slow; cache them in the index.

8. Extensions & Challenges

8.1 Beginner Extensions

Add tag support.
Add a simple diff command.

8.2 Intermediate Extensions

Implement packfiles for storage efficiency.
Add rebase support with REBASE_HEAD state.

8.3 Advanced Extensions

Implement push/pull over a custom protocol.
Add delta compression for objects.

9. Real-World Connections

9.1 Industry Applications

Version control systems
Content-addressable storage systems

Git (reference implementation)
Jujutsu (modern VCS built on Git concepts)

9.3 Interview Relevance

State machines, atomicity, and graph traversal are core systems interview topics.

10. Resources

10.1 Essential Reading

“Pro Git” by Scott Chacon
“Version Control with Git” by Loeliger

10.2 Video Resources

Git internals walkthroughs

10.3 Tools & Documentation

git itself for comparison

Project 2: HTTP Parser for streaming parsing.
Project 4: Undo/Redo Engine for crash-safe history.

11. Self-Assessment Checklist

11.1 Understanding

I can explain content-addressable storage.
I can explain commit DAG traversal.
I can explain why ref updates are atomic.

11.2 Implementation

All core commands work.
fsck detects corruption.
Merge conflicts are recoverable.

11.3 Growth

I can describe the crash-safe commit sequence without notes.

12. Submission / Completion Criteria

Minimum Viable Completion:

init, add, commit, status, and log work.
Objects are stored by hash and verified.

Full Completion:

Branching, checkout, and merge supported with MERGE_HEAD.
Crash recovery and fsck implemented.

Excellence (Going Above & Beyond):

Packfiles, rebase, and remote push/pull.

13. Additional Content Rules (Compliance)

Deterministic demos in §3.7.
Failure demo with exit codes included.
Cross-links included in §2.1, §2.2, and §2.3.