Project 6: Build a Git-like Version Control System
Build a simplified version control system with content-addressable storage, a commit DAG, and crash-safe operations.
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | Advanced |
| Time Estimate | 4-6 weeks |
| Main Programming Language | C (Alternatives: Rust, Go) |
| Alternative Programming Languages | Rust, Go |
| Coolness Level | Level 4: Serious Systems Skill |
| Business Potential | Level 3: Foundational Infrastructure |
| Prerequisites | All prior projects, file I/O, hashing, data structures |
| Key Topics | Content-addressable storage, DAGs, atomicity, crash recovery |
1. Learning Objectives
By completing this project, you will:
- Implement a content-addressable object database (blobs, trees, commits).
- Build a commit DAG with refs and traversal.
- Design a staging area (index) with strict invariants.
- Implement atomic ref updates and crash-safe commits.
- Build merge and recovery workflows with deterministic behavior.
2. All Theory Needed (Per-Concept Breakdown)
2.1 Content-Addressable Storage and Object Model
Fundamentals
A Git-like system stores data by content, not by filename. Every object (file content, tree, commit) is hashed, and its hash becomes its ID. This is content-addressable storage (CAS). It guarantees integrity: if the content changes, the hash changes. The object model is simple: blobs store file contents, trees store directory listings (name -> hash), and commits store a tree hash plus parent commit hashes. This is the foundation of the entire system and is what enables branching, merging, and history integrity.
Because objects are addressed by hash, you get de-duplication and integrity verification automatically. Two identical files map to the same blob, and any bit flip changes the ID. This property turns storage into a verifiable ledger of history, which is why version control can trust data across machines and time.
Deep Dive into the concept
Content-addressable storage is a clever inversion of traditional filesystems. Instead of “put this file at this path,” you say “store these bytes and give me their hash.” The hash is deterministic, so the same content yields the same ID, which gives you de-duplication for free. In a Git-like system, the object database is append-only: objects are never modified, only added. This is a powerful control flow simplification because write operations become idempotent. If you store the same blob twice, you get the same hash and do not need to modify existing data.
The object model builds a hierarchy from simple primitives. A blob represents file content. A tree represents a directory and contains entries of (mode, name, hash). A commit object references a tree and includes metadata (author, message, timestamp) and one or more parent commit hashes. This structure forms a directed acyclic graph (DAG) of commits. The key insight is that commits are immutable because they are defined by their content, including their parent references. This immutability is what makes history robust.
The object database is often stored as files under .mygit/objects/xx/yy... where xx is the first two hex characters of the hash. This directory sharding avoids too many files in one directory. The object file usually stores a header (type and length) plus the raw content, often compressed. For this project, you can skip compression and store raw bytes with a header, but you must still ensure the hash is computed over a canonical representation. The hash should include the object type and length to avoid collisions between different object types with identical bytes.
Writing objects requires strict ordering. You must first compute the hash, then write the file to a temporary path, and finally rename it to the object path. This ensures atomicity: the object either exists or does not. Because objects are immutable, you never need to update them, which greatly simplifies crash recovery. If a crash happens during write, you either have a missing object (safe) or a partially written file (detectable if you write to temp and rename only on success).
A robust CAS also requires integrity verification. When reading an object, you should recompute its hash and verify it matches the expected hash. This ensures that disk corruption is detected. This verification is the basis for fsck (file system check) operations later. It is also why Git-like systems are resilient: corruption is detected immediately when an object is accessed.
Finally, CAS changes how you think about state. The “current state” of the repository is just a pointer (ref) to a commit hash. Everything else is immutable data. This makes the state machine small and controllable: you only need to update refs to change repository state. This is the core reason a Git-like system can be crash-safe: all mutable state is concentrated in a small set of ref files.
In practice, this also enables safe caching. Because objects are immutable, you can memoize reads and keep hot objects in memory without fear of stale data. This dramatically simplifies performance tuning and testing because any object read is guaranteed to be the same forever.
How this fit on projects
This concept defines the storage engine for the VCS. Every other operation (add, commit, checkout, merge) depends on storing and reading objects correctly.
Definitions & key terms
- Content-addressable storage: Data addressed by hash of content.
- Blob: Object storing file content.
- Tree: Object storing directory entries.
- Commit: Object pointing to a tree and parent commits.
Mental model diagram (ASCII)
file.txt -> [blob hash]
dir/ -> [tree hash] -> entries(name->hash)
commit -> [commit hash] -> tree + parent(s)
How it works (step-by-step)
- Compute hash of “type + length + content”.
- Write object to temp file.
- Atomically rename to object path.
- Verify by re-hashing on read.
Failure modes: hash collision (theoretical), partial writes, incorrect header encoding.
Minimal concrete example
// hash = sha1("blob" + "\0" + len + data)
write_object(hash, data, len);
Common misconceptions
- “Objects can be updated.” They are immutable; update by writing a new object.
- “Hashes are only for IDs.” They also provide integrity checks.
Check-your-understanding questions
- Why does including type and length in the hash matter?
- What happens if you store the same blob twice?
- How do you detect corruption?
Check-your-understanding answers
- It prevents collisions between different object types and lengths.
- The hash is identical, so the object is reused.
- Recompute the hash and compare with expected ID.
Real-world applications
- Git object database
- Content-addressable storage systems (IPFS)
Where you’ll apply it
- In this project: see §3.2 Functional Requirements and §4.3 Data Structures.
- Also used in: P04-undo-redo-engine.md (append-only logs).
References
- “Pro Git” by Scott Chacon, Chapter 10 (Git internals)
- SHA-1 documentation and hashing standards
Key insights
Immutability plus hashing turns storage into an append-only, crash-safe system.
Summary
Content-addressable objects are the foundation of all Git-like operations.
Homework/Exercises to practice the concept
- Write a tiny tool that hashes a file and stores it by hash.
- Implement a
cat-objectcommand that verifies hashes on read.
Solutions to the homework/exercises
- Compute sha1 over type+len+content and write to
.mygit/objects. - Read object, recompute hash, and compare with filename hash.
2.2 Commit DAG, Refs, and Index Invariants
Fundamentals
Commits form a directed acyclic graph. Each commit points to one or more parents. Branches and tags are just refs: files that store a commit hash. The working directory and the index (staging area) must remain consistent with refs. The index is a snapshot of the next commit; the working directory is a mutable checkout of the current commit. The invariants between these three states (HEAD commit, index, working tree) define the correctness of commands like status, add, and checkout.
When those invariants are explicit, every command becomes a safe transition. add moves changes into the index, commit moves the index into a new commit, and checkout moves the working tree to match a commit. The clarity of these transitions is what makes a Git-like system predictable instead of mysterious.
Deep Dive into the concept
The commit graph is a DAG because commits only point backward in time. This makes history traversable, mergeable, and verifiable. Implementing the DAG requires a graph traversal algorithm: starting from a ref, recursively visit parents. This is the basis for log and fsck. It also ensures that unreachable objects can be identified for garbage collection.
Refs are the mutable pointers that define the current state. In a Git-like system, HEAD points to a branch ref (e.g., refs/heads/master). The branch ref points to the latest commit. A detached HEAD points directly to a commit hash. This small set of files is the only mutable state in the repository, which makes the system robust. When you update a ref, you effectively change the “current” history. This is why ref updates must be atomic.
The index (staging area) is the most subtle part. It represents the exact tree that will be committed. When you run add, you compute blobs for the specified files and update index entries (path -> blob hash + mode). When you run commit, you build a tree from the index and create a commit. The index must always represent a consistent tree; it cannot contain duplicate paths or invalid modes. The invariant is: index entries are sorted by path, unique, and represent a tree that can be built without ambiguity.
The working tree is a checkout of the commit pointed to by HEAD. When you run checkout, you update the working tree and the index to match the target commit. When you run status, you compare working tree vs. index vs. HEAD. These comparisons define three states: clean, staged changes, and unstaged changes. This is another state machine with explicit invariants. The system must prevent operations that would violate these invariants. For example, checkout should refuse to proceed if it would overwrite uncommitted changes, unless forced.
Branching and merging are higher-level operations built on these invariants. Creating a new branch is just writing a ref file. Switching branches updates HEAD and updates the working tree to match the target commit. The index serves as the staging area for merge resolution, with stage numbers for conflicts. Even if you do not implement full stage numbers, you must explicitly mark conflicts and prevent commit until resolved.
Path normalization is another important detail. If the index allows duplicate or non-normalized paths (like a/../b), you can build trees that do not correspond to a real working directory. Normalizing paths at add time keeps the index consistent and makes tree creation deterministic across platforms.
The key design principle is to keep invariants explicit. The index is the single source of truth for staging. The working tree reflects a checkout of a commit, plus any unstaged changes. The ref points to the current commit. If you can always answer “what commit does HEAD point to, what does the index represent, and how does working tree differ,” then the system is consistent. This is why status is such a fundamental command: it validates that your invariants are intact.
How this fit on projects
This concept defines the core correctness rules of the VCS: how refs, commits, the index, and the working tree relate to each other.
Definitions & key terms
- DAG: Directed acyclic graph of commits.
- Ref: Pointer file that stores a commit hash.
- Index: Staging area for the next commit.
- Working tree: Checked-out files on disk.
Mental model diagram (ASCII)
HEAD -> refs/heads/master -> commit C
|
v
Index (staged tree)
|
v
Working Tree (files)
How it works (step-by-step)
- Add files: update index entries.
- Commit: build tree from index, create commit, update ref.
- Status: compare working tree vs index vs HEAD.
- Checkout: update HEAD and sync index + working tree.
Failure modes: index out of sync, ref pointing to missing commit, working tree overwritten without warning.
Minimal concrete example
typedef struct IndexEntry {
char path[256];
char hash[41];
int mode;
} IndexEntry;
Common misconceptions
- “The index is just a cache.” It is the authoritative staging state.
- “Refs are just labels.” They are the only mutable pointers and must be consistent.
Check-your-understanding questions
- What does
git addactually change? - Why is the commit graph a DAG and not a tree?
- What should
statuscompare?
Check-your-understanding answers
- It updates index entries with new blob hashes.
- Because merges create commits with multiple parents.
- Working tree vs index vs HEAD commit.
Real-world applications
- Git, Mercurial, and other DVCS systems
- Any snapshot-based versioning tool
Where you’ll apply it
- In this project: see §3.2 Functional Requirements and §4.2 Key Components.
- Also used in: P03-connection-pool.md (invariant tracking).
References
- “Pro Git” Chapter 10 (refs, index)
- “Version Control with Git” by Loeliger (merge and staging concepts)
Key insights
The index is the staging truth; refs and working tree must stay consistent with it.
Summary
Once you define invariants between HEAD, index, and working tree, every command becomes a controlled state transition.
Homework/Exercises to practice the concept
- Build a small index file format and parse it.
- Write a function that compares working tree to index and lists differences.
Solutions to the homework/exercises
- Store entries as fixed-length records with path and hash.
- Hash files and compare to stored hashes, report changes.
2.3 Atomic Operations, Merges, and Recovery
Fundamentals
The VCS must never corrupt the repository, even if it crashes mid-operation. This is achieved by strict ordering and atomic updates. The golden rule is: write new objects first, update refs last. Merges introduce a state machine (CLEAN -> MERGING -> CLEAN or ABORTED). Recovery must detect incomplete operations and offer a safe path forward. Integrity checks (fsck) and garbage collection ensure that corrupted or unreachable objects are handled deterministically.
Atomicity is not optional here because users trust the repository as a source of truth. A single corrupted ref can make history appear to vanish. By treating ref updates as the only visible state change, you can make every complex operation recoverable even if the process is killed at arbitrary points.
Deep Dive into the concept
Atomicity is about defining a single “commit point” where an operation becomes visible. In a Git-like system, that point is the ref update. Object writes are append-only, so they are safe to do first. If the system crashes after writing objects but before updating refs, the objects are simply unreachable; the repository is still consistent. This is the same pattern used in database write-ahead logging and in the undo persistence project. The ref update should be done by writing to a temporary file and renaming it atomically. This guarantees that the ref is either old or new, never corrupted.
Merges are a state machine because they are multi-step operations. A merge may fast-forward (no new commit), produce a merge commit, or result in conflicts. If conflicts occur, the repository enters a MERGING state and must record this in a durable marker file (e.g., .mygit/MERGE_HEAD). The working tree contains conflict markers, and the index records unresolved paths. The merge is completed only when conflicts are resolved and a merge commit is created. Alternatively, the user can abort, which restores the working tree to the original state and removes MERGE_HEAD. This is a classic state machine with explicit transitions and cleanup paths.
Crash recovery must detect incomplete operations. On startup, if MERGE_HEAD exists, the repo is in MERGING state. The system should refuse normal operations and prompt for continue or abort. If a lock file exists, it may indicate an interrupted ref update. You should detect stale locks and remove them carefully. Recovery logic is part of the control flow: it defines the legal transitions out of a partial state.
Integrity checking (fsck) verifies that object hashes match their content and that all referenced objects exist. It also verifies that the commit graph has no cycles. Garbage collection (GC) identifies unreachable objects and removes them after a grace period. This is necessary because crashes or aborted operations can leave orphaned objects. The GC itself should be safe: it should only remove objects that are provably unreachable and old enough to not be part of an in-progress operation.
The hardest part is ensuring that every operation has a safe interruption point. For example, in commit creation: write blobs -> write tree -> write commit -> update ref. The ref update is the only state change visible to the user. If any earlier step fails, you abort without updating the ref, and the repo remains unchanged. This is the key to crash-safe control flow. The same principle applies to merges: you only update the branch ref after the merge commit is fully written. If conflicts occur, you record MERGE_HEAD and stop, leaving the repo in a well-defined MERGING state.
Deterministic tests for atomicity involve simulating crashes at each step. For each step, kill the process and then run a recovery check. The repo should always be consistent and recoverable. These tests are essential because atomicity bugs can be rare and catastrophic. By enumerating all interruption points, you make atomicity a testable property rather than a hope.
How this fit on projects
This concept defines the durability guarantees of the VCS: atomic commit ordering, merge state machine, and recovery tools.
Definitions & key terms
- Atomic ref update: Update a ref via temp file + rename.
- Merge state: Repository state when conflicts are unresolved.
- fsck: Integrity checker for object database.
- Garbage collection: Removal of unreachable objects.
Mental model diagram (ASCII)
Write objects -> Write commit -> Update ref (commit point)
Crash before ref update = safe (objects orphaned)
How it works (step-by-step)
- Write new objects (blob/tree/commit).
- Update refs atomically (temp + rename).
- On merge conflict, write MERGE_HEAD and stop.
- On startup, detect MERGE_HEAD and require continue/abort.
- Run fsck to verify hashes and graph integrity.
Failure modes: ref updated before objects, stale lock files, merge state not recorded, GC deleting live objects.
Minimal concrete example
int update_ref_atomic(const char *ref, const char *hash) {
write_temp(ref, hash);
return rename(temp_path, ref);
}
Common misconceptions
- “Writing objects and refs can be in any order.” Order is everything for crash safety.
- “Merge conflicts are just working tree changes.” They are a repository state.
Check-your-understanding questions
- Why must refs be updated last?
- How does MERGE_HEAD help recovery?
- What should fsck verify?
Check-your-understanding answers
- Because refs are the commit point; updating them too early can point to missing objects.
- It records that a merge is in progress, so recovery can resume or abort.
- Hash integrity, object existence, and DAG correctness.
Real-world applications
- Git commit and merge implementations
- Crash-safe metadata updates in databases
Where you’ll apply it
- In this project: see §3.7 Real World Outcome and §5.10 Phase 3.
- Also used in: P04-undo-redo-engine.md (atomic logs).
References
- “Pro Git” Chapter 10 (commit ordering)
- “Designing Data-Intensive Applications” (atomicity and durability)
Key insights
Atomicity is a control-flow promise: only update the visible pointer once all data is durable.
Summary
Crash safety is achieved by strict ordering, explicit merge states, and integrity checks.
Homework/Exercises to practice the concept
- Simulate a crash between object write and ref update; verify repo consistency.
- Implement a minimal MERGE_HEAD marker and abort logic.
Solutions to the homework/exercises
- Kill the process before ref update and check that refs still point to old commit.
- Create MERGE_HEAD and restore working tree from HEAD on abort.
3. Project Specification
3.1 What You Will Build
A simplified VCS named mygit that supports init, add, commit, status, log, checkout, branch, and merge. It stores objects in a content-addressable database, maintains refs, and guarantees crash-safe commit and merge operations.
Included:
- Object database (blobs, trees, commits)
- Index (staging area)
- Atomic commit creation
- Merge with conflict detection
- fsck and GC basics
Excluded:
- Networking or remote push/pull
- Packfiles
3.2 Functional Requirements
- init: create
.mygit/with objects, refs, HEAD, index. - add: write blobs and update index entries.
- commit: create tree + commit, update ref atomically.
- status: show clean/staged/unstaged changes.
- checkout: update HEAD and working tree safely.
- branch: create refs with atomic updates.
- merge: fast-forward or conflict state with MERGE_HEAD.
- fsck: verify hashes and DAG integrity.
3.3 Non-Functional Requirements
- Performance: handles 1k files with acceptable latency.
- Reliability: no repository corruption after crash.
- Usability: clear error messages and exit codes.
3.4 Example Usage / Output
$ ./mygit init
Initialized empty repository in .mygit/
$ ./mygit add file.txt
$ ./mygit commit -m "Initial"
[master abc123] Initial
3.5 Data Formats / Schemas / Protocols
- Object header: “type\0len” followed by raw content.
- Index entry: path, hash, mode.
JSON error shape (if --json):
{ "error": { "code": "REPO_CORRUPT", "message": "missing object" } }
3.6 Edge Cases
- Commit with empty index
- Merge conflict and abort
- Checkout with uncommitted changes
- Crash during ref update
3.7 Real World Outcome
Deterministic demo uses fixed files and commit messages.
3.7.1 How to Run (Copy/Paste)
cc -std=c11 -O2 -o mygit src/mygit.c
./mygit init
3.7.2 Golden Path Demo (Deterministic)
echo "hello" > file.txt
./mygit add file.txt
./mygit commit -m "c1"
./mygit log
Expected: log shows one commit with message “c1”.
3.7.3 CLI Transcript (Success + Failure)
$ ./mygit init
Initialized empty repository in .mygit/
$ ./mygit add file.txt
$ ./mygit commit -m "c1"
[master abc123] c1
$ echo $?
0
$ ./mygit commit -m "c2"
ERROR: NOTHING_TO_COMMIT
$ echo $?
1
Exit codes:
0success1user error2system error (I/O, corruption)
4. Solution Architecture
4.1 High-Level Design
[Working Tree] <-> [Index] <-> [Object DB]
| | |
v v v
HEAD/refs ----> Commits (DAG)
4.2 Key Components
| Component | Responsibility | Key Decisions | |———–|—————-|—————| | Object DB | Store blobs/trees/commits | Append-only, hash-verified | | Index | Stage file snapshots | Sorted unique entries | | Refs/HEAD | Point to current commit | Atomic updates | | Merge Engine | 3-way merge and conflict markers | MERGE_HEAD state | | Integrity | fsck and GC | Hash verification, reachability |
4.3 Data Structures (No Full Code)
typedef struct {
char path[256];
char hash[41];
int mode;
} IndexEntry;
typedef struct {
char tree_hash[41];
char parent_hash[41];
char message[256];
} Commit;
4.4 Algorithm Overview
Key Algorithm: Crash-Safe Commit
- Write blob objects for staged files.
- Write tree object.
- Write commit object.
- Update branch ref atomically.
Complexity Analysis:
- Time: O(n) for number of files
- Space: O(n) for object storage
5. Implementation Guide
5.1 Development Environment Setup
cc --version
5.2 Project Structure
project-root/
├── src/
│ ├── mygit.c
│ ├── object.c
│ ├── index.c
│ ├── refs.c
│ └── merge.c
├── tests/
│ └── test_repo.c
└── Makefile
5.3 The Core Question You’re Answering
“How can I make repository state changes atomic and recoverable, even if the process crashes?”
5.4 Concepts You Must Understand First
- Content-addressable objects
- Commit DAG and refs
- Atomic ref updates
5.5 Questions to Guide Your Design
- What is the exact commit point for a commit operation?
- How do you prevent ref corruption during crashes?
- How do you detect and handle merge conflicts?
5.6 Thinking Exercise
Simulate a crash after writing the commit object but before updating the ref. What is repo state?
5.7 The Interview Questions They’ll Ask
- “Why is Git crash-safe even if it crashes mid-commit?”
- “How do you implement a three-way merge?”
- “What does fsck actually verify?”
5.8 Hints in Layers
Hint 1: Implement blob hashing and storage first.
Hint 2: Build the index and commit creation.
Hint 3: Add merge state handling after basic commit works.
5.9 Books That Will Help
| Topic | Book | Chapter | |——-|——|———| | Git internals | “Pro Git” | Ch. 10 | | DAG algorithms | “Algorithms” (Sedgewick) | Ch. 4.2 | | Atomic file ops | “The Linux Programming Interface” | Ch. 5 |
5.10 Implementation Phases
Phase 1: Object Store (1-2 weeks)
- Implement hash and object writing.
- Add
cat-objectandhash-object.
Phase 2: Index + Commit (1-2 weeks)
- Implement staging and commit creation.
- Add refs and HEAD.
Phase 3: Merge + Recovery (1-2 weeks)
- Implement merge state machine.
- Add fsck and GC basics.
5.11 Key Implementation Decisions
| Decision | Options | Recommendation | Rationale | |———-|———|—————-|———–| | Hash algo | SHA-1, SHA-256 | SHA-1 | Match Git behavior | | Index format | Binary, text | Binary | Smaller and closer to Git | | Merge strategy | 3-way, recursive | 3-way | Simpler for this project |
6. Testing Strategy
6.1 Test Categories
| Category | Purpose | Examples | |———-|———|———-| | Unit Tests | Hash/object IO | write/read object | | Integration Tests | Commit flow | add -> commit -> log | | Crash Tests | Atomicity | kill mid-commit |
6.2 Critical Test Cases
- Crash before ref update leaves repo consistent.
- Merge conflict enters MERGING state.
- fsck detects missing object.
6.3 Test Data
Files: a.txt, b.txt
Commits: c1, c2
7. Common Pitfalls & Debugging
7.1 Frequent Mistakes
| Pitfall | Symptom | Solution | |——–|———|———-| | Updating ref too early | Corruption on crash | Update ref last | | Index corruption | status wrong | Write index to temp + rename | | Merge state forgotten | stuck repo | Use MERGE_HEAD marker |
7.2 Debugging Strategies
- Add a
fsckcommand early to validate objects. - Log ref updates and lock files.
7.3 Performance Traps
Recomputing hashes repeatedly can be slow; cache them in the index.
8. Extensions & Challenges
8.1 Beginner Extensions
- Add
tagsupport. - Add a simple
diffcommand.
8.2 Intermediate Extensions
- Implement packfiles for storage efficiency.
- Add rebase support with REBASE_HEAD state.
8.3 Advanced Extensions
- Implement push/pull over a custom protocol.
- Add delta compression for objects.
9. Real-World Connections
9.1 Industry Applications
- Version control systems
- Content-addressable storage systems
9.2 Related Open Source Projects
- Git (reference implementation)
- Jujutsu (modern VCS built on Git concepts)
9.3 Interview Relevance
- State machines, atomicity, and graph traversal are core systems interview topics.
10. Resources
10.1 Essential Reading
- “Pro Git” by Scott Chacon
- “Version Control with Git” by Loeliger
10.2 Video Resources
- Git internals walkthroughs
10.3 Tools & Documentation
gititself for comparison
10.4 Related Projects in This Series
- Project 2: HTTP Parser for streaming parsing.
- Project 4: Undo/Redo Engine for crash-safe history.
11. Self-Assessment Checklist
11.1 Understanding
- I can explain content-addressable storage.
- I can explain commit DAG traversal.
- I can explain why ref updates are atomic.
11.2 Implementation
- All core commands work.
- fsck detects corruption.
- Merge conflicts are recoverable.
11.3 Growth
- I can describe the crash-safe commit sequence without notes.
12. Submission / Completion Criteria
Minimum Viable Completion:
init,add,commit,status, andlogwork.- Objects are stored by hash and verified.
Full Completion:
- Branching, checkout, and merge supported with MERGE_HEAD.
- Crash recovery and fsck implemented.
Excellence (Going Above & Beyond):
- Packfiles, rebase, and remote push/pull.
13. Additional Content Rules (Compliance)
- Deterministic demos in §3.7.
- Failure demo with exit codes included.
- Cross-links included in §2.1, §2.2, and §2.3.