Project 1: Persistent Key-Value Store
Build a deterministic key-value engine that persists records in a versioned page file and can be audited at byte level.
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | Beginner |
| Time Estimate | 1 weekend |
| Main Programming Language | C (Alternatives: Rust, Go) |
| Alternative Programming Languages | Rust, Go |
| Coolness Level | Level 3: Genuinely Clever |
| Business Potential | Level 1: Resume Gold |
| Prerequisites | C pointers/structs, file I/O, binary formats |
| Key Topics | Slotted pages, record encoding, checksums, corruption handling |
1. Learning Objectives
By completing this project, you will:
- Design a stable on-disk page format with explicit versioning.
- Implement deterministic PUT/GET/DELETE behavior across process restarts.
- Build a page inspector that can decode raw bytes into structured records.
- Detect and isolate corruption without crashing.
- Establish baseline storage metrics (fill factor, fragmentation, read/write latency).
2. All Theory Needed (Per-Concept Breakdown)
Slotted Page Layout
Fundamentals
A slotted page separates logical record identity from physical byte location. The slot array stores offsets/lengths to payload records that live elsewhere in the page. This indirection allows record movement during compaction while preserving stable record handles like (page_id, slot_id). Without slots, deletes and variable-length updates quickly create fragile pointer assumptions and expensive full-page rewrites. The page header tracks metadata such as format version, free-space boundaries, checksums, and LSN markers.
Deep Dive into the concept
Page layout is a long-lived contract. Every field in your header becomes part of compatibility behavior for future files. If the format is under-specified, future features become migration emergencies. The robust approach is explicit widths, explicit byte order, and reserved bytes. Page mutation should be expressed as invariant-preserving transitions: allocate payload bytes, write record payload, write/update slot entry, adjust free-space metadata, recalculate checksum. If any step fails, the page should remain either old-valid or new-valid, never undefined.
Fragmentation management is a practical challenge. Updates and deletes create internal holes. You can choose immediate compaction (simple logic, higher write amplification) or deferred compaction triggered by thresholds (more complex, better amortized cost). In either strategy, slot IDs remain logical anchors. Corruption detection should include structural checks (slot bounds, overlap) and integrity checks (checksum). Structural checks catch semantic corruption that checksums can miss when data was consistently but incorrectly written.
A useful way to reason about correctness is to maintain two classes of invariants: structural invariants (all offsets in bounds, non-overlapping payload) and semantic invariants (decoded keys unique if your model requires uniqueness, lengths consistent with encoded fields). Testing should include malformed fixtures: invalid magic, wrong version, truncated payload, impossible slot lengths. If the engine fails closed (rejects bad pages safely) instead of fail-open, you have a solid storage foundation.
How this fit on projects
- Directly required in this project.
- Reused by P03 (B+ tree nodes) and P04 (buffer-managed pages).
Definitions & key terms
- Slot: Metadata entry referencing one record payload.
- Tombstone: Marker indicating logical deletion.
- Fill factor: Proportion of page occupied by live payload.
Mental model diagram
┌─────────────────────────────────────────────┐
│ Header: magic/version/free_start/free_end │
├─────────────────────────────────────────────┤
│ Slot[0] -> off=4012 len=24 LIVE │
│ Slot[1] -> off=3980 len=32 TOMBSTONE │
├─────────────────────────────────────────────┤
│ Free Space │
├─────────────────────────────────────────────┤
│ Payload bytes (records) from page tail │
└─────────────────────────────────────────────┘
How it works (step-by-step, with invariants and failure modes)
- Validate page header and checksum.
- For insert, allocate payload from free tail and then create slot entry.
- For delete, mark slot tombstone and reclaim lazily or compact immediately.
- For read, resolve slot offset and decode record fields.
Invariants:
- Slot offsets always in payload region.
- Slots never point to overlapping ranges.
- Header free-space pointers remain ordered.
Failure modes:
- Offset overflow due to integer truncation.
- Tombstoned slot accidentally returned as live.
- Compaction forgetting to update one slot.
Minimal concrete example
Pseudo-record bytes:
[key_len=0x0004][val_len=0x0003][key="name"][val="Ada"]
Common misconceptions
- “Struct dump equals portable file format.” -> False due padding/endianness.
- “Delete means immediate byte erase.” -> Often false; tombstones are common.
Check-your-understanding questions
- Why does slot indirection simplify compaction?
- Which bugs can checksums fail to detect?
- Why keep reserved header bytes?
Check-your-understanding answers
- Slot IDs remain stable while payload moves.
- Logically wrong but internally consistent writes.
- To evolve format without breaking old files.
Real-world applications
- SQLite page-level storage contracts.
- PostgreSQL heap page design.
Where you’ll apply it
- Primary in this project, then reused throughout P03/P04/P05.
References
- https://www.sqlite.org/fileformat.html
- https://www.postgresql.org/docs/current/storage-page-layout.html
Key insights
A reliable file format is engineered, not implied by compiler layout.
Summary
Slotted pages make variable-size records, updates, and compaction manageable while preserving logical identity.
Homework/Exercises to practice the concept
- Design a 48-byte page header with future-proof fields.
- Simulate insert/delete/update workload and plot fill factor every 100 ops.
- Write a page-checker spec listing structural and semantic validations.
Solutions to the homework/exercises
- Include magic, version, page_id, free pointers, checksum, reserved bytes.
- Expect fragmentation rise without compaction; threshold compaction stabilizes density.
- Validate header first, then slot bounds, then decode-level constraints.
3. Project Specification
3.1 What You Will Build
A CLI tool with put, get, del, list, and inspect-page commands over a single durable data file using fixed-size pages and slotted records.
3.2 Functional Requirements
putinserts or overwrites a key deterministically.getreturns exact stored value or explicit not-found.delmarks key deleted and prevents future visibility.listsupports prefix filtering.inspect-pagedecodes header and slots for debugging.
3.3 Non-Functional Requirements
- Performance: Stable behavior up to at least 100k keys.
- Reliability: No process crash on malformed page fixtures.
- Usability: Diagnostics include page/slot identifiers.
3.4 Example Usage / Output
$ ./kvdb put user:42:name "Ada"
OK txn=19 page=3 slot=17
$ ./kvdb get user:42:name
VALUE "Ada"
3.5 Data Formats / Schemas / Protocols
- Page header schema with version and checksum.
- Record schema: key length, value length, key bytes, value bytes.
3.6 Edge Cases
- Empty value, long key, duplicate key overwrite, tombstoned key reinsert.
3.7 Real World Outcome
3.7.1 How to Run (Copy/Paste)
make kvdb
./kvdb put user:42:name "Ada"
./kvdb get user:42:name
./kvdb inspect-page 3
3.7.2 Golden Path Demo (Deterministic)
Fixed key insertion order, fixed seed for generated keys.
3.7.3 If CLI: exact terminal transcript
$ ./kvdb put acct:001:status "active"
OK txn=00021 lsn=00000180 page=0004 slot=0009
$ ./kvdb get acct:001:status
VALUE "active" page=0004 slot=0009
$ ./kvdb del acct:001:status
OK tombstoned page=0004 slot=0009
$ ./kvdb get acct:001:status
NOT_FOUND
4. Solution Architecture
4.1 High-Level Design
CLI -> Command Router -> Storage API -> Page Manager -> File I/O
4.2 Key Components
| Component | Responsibility | Key Decisions |
|---|---|---|
| CLI parser | Parse commands and args | Reject ambiguous usage early |
| Storage API | Expose CRUD operations | Stable error categories |
| Page manager | Slot allocation and decode | Strict invariant checks |
| File layer | Read/write pages | Fixed page-size boundaries |
4.4 Data Structures (No Full Code)
PageHeader { magic, version, page_id, free_start, free_end, checksum }
SlotEntry { offset, length, state }
Record { key_len, val_len, key_bytes, val_bytes }
4.4 Algorithm Overview
- Lookup: scan slots per page and compare keys.
- Insert: pick page with space, write payload and slot.
- Delete: mark tombstone and optionally compact.
Complexity:
- Time: O(n) lookup without index
- Space: O(number_of_pages)
5. Implementation Guide
5.1 Development Environment Setup
make clean && make kvdb
5.2 Project Structure
project/
├── src/
├── include/
├── tests/
└── docs/format.md
5.3 The Core Question You’re Answering
How do I represent persistent key-value data so every byte has defined meaning across restarts?
5.4 Concepts You Must Understand First
- Slotted page model
- Deterministic binary encoding
- Corruption-safe parsing
5.5 Questions to Guide Your Design
- Where do you store free-space pointers?
- How do you recover from malformed record payload?
5.6 Thinking Exercise
Trace one insert and one delete by drawing resulting page bytes.
5.7 The Interview Questions They’ll Ask
- Why not append plain JSON lines?
- How would you evolve format version safely?
- How do you validate on-disk integrity quickly?
- What tradeoff exists between tombstones and compaction?
5.8 Hints in Layers
- Start with read-only page decoder.
- Add insert path before delete/overwrite.
- Add corruption fixtures before benchmarks.
5.9 Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Binary I/O | The C Programming Language | Ch. 5-8 |
| Memory/cache behavior | Computer Systems: A Programmer’s Perspective | Ch. 6 |
5.10 Implementation Phases
- Phase 1: Header + slot parser
- Phase 2: CRUD operations
- Phase 3: Corruption handling + metrics
5.11 Key Implementation Decisions
| Decision | Options | Recommendation | Rationale |
|---|---|---|---|
| Delete policy | tombstone / immediate compact | tombstone first | simpler correctness |
| Endianness | host / fixed | fixed | portability |
6. Testing Strategy
6.1 Test Categories
- Unit: encode/decode functions
- Integration: CRUD across restart
- Edge: malformed pages and truncated files
6.2 Critical Test Cases
- Duplicate key overwrite persists correctly.
- Tombstoned keys are not returned.
- Corrupted checksum page is rejected safely.
6.3 Test Data
Deterministic fixtures with fixed key/value corpus.
7. Common Pitfalls & Debugging
7.1 Frequent Mistakes
| Pitfall | Symptom | Solution |
|---|---|---|
| Implicit struct serialization | Cross-machine decode failures | Explicit field encoding |
| Free-space pointer bug | Random key misses | Invariant checks per write |
7.2 Debugging Strategies
- Use
inspect-pageafter each mutation in debug mode. - Keep a failing fixture corpus for regression.
7.3 Performance Traps
Linear scans become expensive quickly; this project intentionally exposes that pain before indexing.
8. Extensions & Challenges
8.1 Beginner Extensions
- Add
statscommand showing fill factor. - Add optional in-place compaction command.
8.2 Intermediate Extensions
- Add append-only segment mode.
- Add CRC32 vs xxHash integrity mode toggle.
8.3 Advanced Extensions
- Add lightweight secondary index sidecar file.
- Add online format-upgrade utility.
9. Real-World Connections
9.1 Industry Applications
- Embedded local state stores.
- Offline-first mobile caches.
9.2 Related Open Source Projects
- SQLite format docs: https://www.sqlite.org/fileformat.html
9.3 Interview Relevance
This project prepares byte-level persistence, invariants, and durability reasoning questions.
10. Resources
10.1 Essential Reading
- SQLite file format documentation.
- PostgreSQL storage page layout docs.
10.2 Video Resources
- Database internals talks focused on page layout and storage.
10.3 Tools & Documentation
hexdump,od,strace, and sanitizer builds.
10.4 Related Projects in This Series
11. Self-Assessment Checklist
11.1 Understanding
- I can explain each page header field and why it exists.
- I can explain tombstone vs compaction tradeoffs.
11.2 Implementation
- CRUD behavior is deterministic across restarts.
- Corruption fixtures fail safely.
11.3 Growth
- I can defend format evolution strategy in an interview.
12. Submission / Completion Criteria
Minimum Viable Completion:
- CRUD + restart persistence
- Page inspector
- Corruption-safe parser
Full Completion:
- Plus deterministic test suite and fill-factor metrics
Excellence:
- Plus migration tooling and format compatibility tests