← Back to all projects

LEARN GRAPH DATABASES DEEP DIVE

Learn Graph Databases: From Zero to Graph Database Master

Goal: Deeply understand graph databases—from basic graph data structures to building a complete graph database engine from scratch in C, including storage engines, query processing, indexing, and ACID transactions.


Why Graph Databases Matter

Relational databases struggle when data is highly connected. Every JOIN operation in SQL is expensive because it requires index lookups or full table scans. Graph databases solve this by storing relationships as first-class citizens—each node directly points to its neighbors (a concept called index-free adjacency), making traversals O(1) instead of O(log n).

After completing these projects, you will:

  • Understand every byte in a graph database’s storage format
  • Know how queries traverse from node to node without index lookups
  • Be able to implement graph algorithms (BFS, DFS, Dijkstra, PageRank)
  • Build your own query parser and execution engine
  • Understand ACID transactions and concurrency control in graph contexts
  • Have built a working graph database from scratch in C

Core Concept Analysis

The Property Graph Model

A property graph consists of:

┌─────────────────────────────────────────────────────────────────┐
│                     PROPERTY GRAPH MODEL                        │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│   ┌─────────────────┐         KNOWS          ┌───────────────┐  │
│   │     :Person     │ ────────────────────▶  │    :Person    │  │
│   │  name: "Alice"  │    {since: 2015}       │  name: "Bob"  │  │
│   │  age: 30        │                        │  age: 28      │  │
│   └─────────────────┘                        └───────────────┘  │
│           │                                         │           │
│           │ WORKS_AT                                │ WORKS_AT  │
│           ▼                                         ▼           │
│   ┌─────────────────────────────────────────────────────────┐   │
│   │                      :Company                           │   │
│   │                   name: "TechCorp"                      │   │
│   │                   founded: 2010                         │   │
│   └─────────────────────────────────────────────────────────┘   │
│                                                                 │
│   NODES: Entities with labels (:Person, :Company)               │
│   EDGES: Directed relationships with types (KNOWS, WORKS_AT)    │
│   PROPERTIES: Key-value pairs on both nodes and edges           │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Fundamental Concepts

  1. Graph Data Structures
    • Adjacency List: Each node stores a list of its neighbors
    • Adjacency Matrix: 2D array where matrix[i][j] = 1 if edge exists
    • Edge List: Simple list of (source, target) pairs
  2. Index-Free Adjacency
    • Each node directly references its neighbors via pointers
    • Traversing a relationship is O(1) pointer dereference
    • No index lookup required—the graph is the index
    • As graph grows, local traversal cost remains constant
  3. Native Graph Storage
    • Fixed-size records for nodes and relationships
    • Nodes: ID, first_relationship_id, first_property_id, labels
    • Relationships: ID, start_node, end_node, type, next_rel_start, next_rel_end
    • Properties stored separately as linked chains
  4. Graph Traversal Algorithms
    • BFS (Breadth-First Search): Level-by-level exploration, uses queue
    • DFS (Depth-First Search): Deep exploration first, uses stack/recursion
    • Dijkstra: Shortest weighted path, uses priority queue
    • A*: Heuristic-guided shortest path
  5. Query Processing
    • Pattern matching: (a)-[:KNOWS]->(b) finds all KNOWS relationships
    • Query planning: Choose optimal traversal order
    • Query execution: Walk the graph, collect results
  6. ACID in Graph Databases
    • Write-Ahead Logging (WAL): Log changes before applying them
    • MVCC: Multiple versions for concurrent reads
    • Locking: Prevent conflicting concurrent modifications

Project List

Projects are ordered from fundamental understanding to advanced implementations. Each project builds on the previous ones.


Project 1: In-Memory Graph Library (The Foundation)

  • File: LEARN_GRAPH_DATABASES_DEEP_DIVE.md
  • Main Programming Language: C
  • Alternative Programming Languages: Rust, Go, C++
  • Coolness Level: Level 3: Genuinely Clever
  • Business Potential: 1. The “Resume Gold”
  • Difficulty: Level 1: Beginner
  • Knowledge Area: Data Structures / Memory Management
  • Software or Tool: Building: Graph Library
  • Main Book: “C Programming: A Modern Approach” by K. N. King

What you’ll build: A C library that can create, modify, and traverse graphs in memory—supporting nodes with labels, edges with types, and properties on both.

Why it teaches graph databases: Every graph database starts here. Before you worry about persistence or queries, you need to understand how graphs live in memory. This project forces you to think about pointer relationships, memory layout, and the trade-offs between adjacency lists vs. matrices.

Core challenges you’ll face:

  • Designing the node/edge data structures → maps to understanding property graph model
  • Managing dynamic memory for growing graphs → maps to memory allocation strategies
  • Implementing efficient neighbor lookups → maps to index-free adjacency concept
  • Handling bidirectional edge references → maps to doubly-linked relationship lists

Key Concepts:

  • Pointers and Dynamic Memory: “C Programming: A Modern Approach” Chapter 17 - K. N. King
  • Linked Lists in C: “Mastering Algorithms with C” Chapter 5 - Kyle Loudon
  • Graph Data Structures: “Algorithms, Fourth Edition” Section 4.1 - Sedgewick & Wayne
  • Memory Layout: “Computer Systems: A Programmer’s Perspective” Chapter 9 - Bryant & O’Hallaron

Difficulty: Beginner Time estimate: 1 week Prerequisites:

  • Basic C programming (pointers, structs, malloc/free)
  • Understanding of linked lists
  • No prior graph knowledge required

Real world outcome:

$ ./graph_demo
Creating graph...
Added node 1: Person {name: "Alice", age: 30}
Added node 2: Person {name: "Bob", age: 28}
Added edge: (1)-[:KNOWS {since: 2015}]->(2)

Traversing from Alice:
  Alice --KNOWS--> Bob

Node count: 2
Edge count: 1
Memory used: 256 bytes

Implementation Hints:

Start with these core structures:

Node structure should contain:
- Unique ID (uint64_t)
- Pointer to first outgoing edge
- Pointer to first incoming edge
- Pointer to first property
- Label(s)

Edge structure should contain:
- Unique ID
- Start node pointer
- End node pointer
- Relationship type
- Next edge from same start node
- Next edge to same end node
- Pointer to first property

Property structure (linked list):
- Key (string)
- Value (union of types: int, float, string, bool)
- Next property pointer

Think about these questions:

  • How do you iterate all edges FROM a node? (Follow first_outgoing, then next pointers)
  • How do you iterate all edges TO a node? (Follow first_incoming chain)
  • Why do edges need both “next from start” AND “next to end” pointers?
  • What happens to edges when you delete a node?

Learning milestones:

  1. You can add nodes and edges → You understand basic graph structure
  2. You can traverse neighbors → You understand adjacency representation
  3. You can handle properties → You understand the property graph model
  4. You handle memory correctly (no leaks) → You understand ownership in C

Project 2: Graph Traversal Algorithms (BFS, DFS, Shortest Path)

  • File: LEARN_GRAPH_DATABASES_DEEP_DIVE.md
  • Main Programming Language: C
  • Alternative Programming Languages: Rust, Go, Python
  • Coolness Level: Level 3: Genuinely Clever
  • Business Potential: 1. The “Resume Gold”
  • Difficulty: Level 2: Intermediate
  • Knowledge Area: Graph Algorithms
  • Software or Tool: Building: Algorithm Library
  • Main Book: “Algorithms, Fourth Edition” by Sedgewick & Wayne

What you’ll build: Implementation of core graph algorithms—BFS, DFS, Dijkstra’s shortest path, and connected components—that operate on your in-memory graph library.

Why it teaches graph databases: Graph databases are only useful because of graph algorithms. When a user asks “find shortest path from A to B” or “find all friends-of-friends,” the database executes these algorithms. Understanding them deeply is essential.

Core challenges you’ll face:

  • Implementing BFS with proper queue management → maps to level-by-level exploration
  • Managing visited state without modifying nodes → maps to external state tracking
  • Priority queue for Dijkstra → maps to weighted path optimization
  • Detecting cycles during DFS → maps to graph classification problems

Key Concepts:

  • BFS Algorithm: “Algorithms, Fourth Edition” Section 4.1 - Sedgewick & Wayne
  • DFS Algorithm: “Introduction to Algorithms” (CLRS) Chapter 22
  • Dijkstra’s Algorithm: “Algorithms, Fourth Edition” Section 4.4 - Sedgewick & Wayne
  • Priority Queues (Heaps): “Algorithms, Fourth Edition” Section 2.4 - Sedgewick & Wayne

Resources for key challenges:

Difficulty: Intermediate Time estimate: 1 week Prerequisites:

  • Project 1 completed (in-memory graph library)
  • Understanding of queues and stacks
  • Basic algorithm complexity analysis

Real world outcome:

$ ./graph_algo social_network.graph
Loading graph with 1000 nodes, 5000 edges...

BFS from node 1:
  Level 0: Alice
  Level 1: Bob, Carol, Dave
  Level 2: Eve, Frank, Grace...
  Visited 847 nodes in 6 levels

DFS from node 1:
  Path: Alice -> Bob -> Eve -> ... (depth 23)

Shortest path from Alice to Zack:
  Alice -[2]-> Bob -[3]-> Carol -[1]-> Zack
  Total weight: 6
  Path length: 3 edges

Connected components: 3
  Component 1: 847 nodes
  Component 2: 142 nodes
  Component 3: 11 nodes

Implementation Hints:

For BFS, you need an external data structure:

Queue: FIFO structure for nodes to visit
Visited set: Hash set or boolean array to track visited nodes
Parent map: To reconstruct the path afterward

The key insight: BFS visits nodes in order of distance from start. When you first reach a node, you’ve found the shortest path (in unweighted graphs).

For Dijkstra, the difference is:

  • Use a priority queue (min-heap) instead of a regular queue
  • Priority = cumulative distance from start
  • Can update priority when finding shorter path (relaxation)

Ask yourself:

  • What’s the time complexity of BFS? (O(V + E))
  • Why doesn’t BFS work for weighted graphs?
  • What makes Dijkstra fail with negative edges?
  • How would you modify DFS to detect cycles?

Learning milestones:

  1. BFS finds shortest unweighted path → You understand level-order traversal
  2. DFS explores deeply and backtracks → You understand stack-based traversal
  3. Dijkstra handles weights → You understand greedy algorithms and priority queues
  4. You can find connected components → You understand graph structure analysis

Project 3: Graph Serialization & File Persistence

  • File: LEARN_GRAPH_DATABASES_DEEP_DIVE.md
  • Main Programming Language: C
  • Alternative Programming Languages: Rust, C++, Go
  • Coolness Level: Level 3: Genuinely Clever
  • Business Potential: 2. The “Micro-SaaS / Pro Tool”
  • Difficulty: Level 2: Intermediate
  • Knowledge Area: File I/O / Binary Formats / Persistence
  • Software or Tool: Building: Graph File Format
  • Main Book: “Computer Systems: A Programmer’s Perspective” by Bryant & O’Hallaron

What you’ll build: A binary file format for persisting graphs to disk, with save/load functionality. Your format will use fixed-size records like real graph databases do.

Why it teaches graph databases: Neo4j stores nodes as 9-byte fixed-size records. Understanding why fixed-size records matter (O(1) lookups via offset = id * record_size) is fundamental. This project bridges in-memory graphs to persistent storage.

Core challenges you’ll face:

  • Designing a binary file format → maps to understanding database storage layouts
  • Fixed-size records vs. variable-size data → maps to property storage strategies
  • Handling string storage separately → maps to string table/pool concepts
  • File offset arithmetic → maps to direct record access

Key Concepts:

  • Binary File I/O in C: “C Programming: A Modern Approach” Chapter 22 - K. N. King
  • Fixed-Size Record Design: “Database Internals” Chapter 3 - Alex Petrov
  • File Format Design: “Practical Binary Analysis” Chapter 2 - Dennis Andriesse
  • Endianness: “Computer Systems: A Programmer’s Perspective” Section 2.1 - Bryant & O’Hallaron

Difficulty: Intermediate Time estimate: 1 week Prerequisites:

  • Project 1 completed
  • Basic understanding of binary file I/O (fread, fwrite)
  • Understanding of struct packing and alignment

Real world outcome:

$ ./graph_save social.graph
Saving graph to social.graph.db
  Nodes: 1000 (9000 bytes)
  Edges: 5000 (50000 bytes)
  Properties: 8500 (variable, in properties.db)
  String pool: 45000 bytes
Total: 104000 bytes
Save complete.

$ ls -la *.db
-rw-r--r--  1 user  staff   9000 Dec 21 10:00 nodes.db
-rw-r--r--  1 user  staff  50000 Dec 21 10:00 rels.db
-rw-r--r--  1 user  staff  45000 Dec 21 10:00 strings.db

$ ./graph_load social.graph
Loading from social.graph.db...
Loaded 1000 nodes, 5000 edges in 12ms
Random lookup node 500: O(1), 0.001ms

Implementation Hints:

Design your file format like Neo4j does:

Node Record (9 bytes):
┌─────────┬──────────────┬──────────────┬─────────┐
│ in_use  │ first_rel_id │ first_prop_id│ labels  │
│ 1 byte  │   4 bytes    │    4 bytes   │ varies  │
└─────────┴──────────────┴──────────────┴─────────┘

To find node N: seek to offset N * 9, read 9 bytes

Relationship Record (34 bytes):
┌─────────┬──────────┬──────────┬──────────┬─────────────┬─────────────┬──────────┐
│ in_use  │ start_id │  end_id  │ rel_type │ next_rel_s  │ next_rel_e  │ prop_id  │
│ 1 byte  │ 4 bytes  │ 4 bytes  │ 4 bytes  │  4 bytes    │   4 bytes   │ 4 bytes  │
└─────────┴──────────┴──────────┴──────────┴─────────────┴─────────────┴──────────┘

For variable-length data (strings), use a separate string pool file:

  • Store offset into string pool in the property record
  • String pool: length-prefixed strings back to back
  • Or: null-terminated strings with index file

Questions to consider:

  • Why not store strings inline in nodes/edges?
  • What happens when you delete a node? (Mark as unused, don’t shift)
  • How do you handle graph updates? (In-place for fixed fields, append for strings)
  • What about alignment and padding?

Learning milestones:

  1. You can save and load a graph → You understand file I/O
  2. Random access by ID is O(1) → You understand fixed-size record design
  3. Variable data is stored separately → You understand storage decoupling
  4. Format is portable across systems → You understand endianness and packing

Project 4: Memory-Mapped Graph Storage

  • File: LEARN_GRAPH_DATABASES_DEEP_DIVE.md
  • Main Programming Language: C
  • Alternative Programming Languages: Rust, C++
  • Coolness Level: Level 4: Hardcore Tech Flex
  • Business Potential: 2. The “Micro-SaaS / Pro Tool”
  • Difficulty: Level 3: Advanced
  • Knowledge Area: Operating Systems / Virtual Memory / mmap
  • Software or Tool: Building: mmap-based storage
  • Main Book: “The Linux Programming Interface” by Michael Kerrisk

What you’ll build: A graph storage engine that uses memory-mapped files (mmap), allowing the OS to handle paging and making your graph appear as a simple array in memory.

Why it teaches graph databases: Many databases (SQLite, LMDB) use mmap for I/O. It’s a powerful technique but comes with trade-offs (less control over flushing, OS-dependent behavior). Understanding when mmap works well vs. when explicit I/O is better is crucial database knowledge.

Core challenges you’ll face:

  • Using mmap() correctly → maps to understanding virtual memory
  • Handling file growth (mremap or remap) → maps to dynamic storage expansion
  • Ensuring data durability (msync) → maps to understanding persistence guarantees
  • Dealing with page faults → maps to understanding I/O cost

Key Concepts:

  • mmap System Call: “The Linux Programming Interface” Chapter 49 - Michael Kerrisk
  • Virtual Memory: “Computer Systems: A Programmer’s Perspective” Chapter 9 - Bryant & O’Hallaron
  • File-Backed Memory: “Advanced Programming in the UNIX Environment” Chapter 14 - Stevens & Rago
  • msync and Durability: SQLite mmap documentation

Resources for key challenges:

Difficulty: Advanced Time estimate: 1-2 weeks Prerequisites:

  • Project 3 completed (file-based persistence)
  • Understanding of virtual memory concepts
  • Familiarity with Unix system calls

Real world outcome:

$ ./mmap_graph create large.graph 10000000
Creating mmap-backed graph...
Mapped file: 1.2 GB
Node capacity: 10,000,000
Ready.

$ ./mmap_graph load large.graph
Loading via mmap...
Mapped in 0.8ms (just setting up mapping, no actual I/O!)

$ ./mmap_graph query large.graph
> get_node 5000000
Node 5000000 accessed in 0.02ms (first access, page fault)
> get_node 5000001
Node 5000001 accessed in 0.001ms (same page, cached)

> traverse 1 depth 3
Traversed 15,847 nodes in 45ms
Pages faulted: 234

Implementation Hints:

Basic mmap usage in C:

Open file with proper size
mmap() with MAP_SHARED for persistence
Cast pointer to your record array
Access records directly via pointer arithmetic
msync() to ensure writes reach disk
munmap() when done

The beauty of mmap: your nodes become a simple array access:

Node* nodes = (Node*)mmap(...);
Node n = nodes[node_id];  // OS handles loading from disk

But understand the dangers:

  • You can’t control when dirty pages are written
  • Page faults happen at unpredictable times
  • No control over eviction order
  • Signal handlers needed for SIGBUS

Questions to think about:

  • What happens if you access node 1, then node 1,000,000? (Two page faults)
  • How big is a page? (Usually 4KB) How many nodes fit in one page?
  • When should you call msync()? What flags (MS_SYNC vs MS_ASYNC)?
  • Why does SQLite make mmap optional?

Learning milestones:

  1. Graph loads “instantly” → You understand memory mapping
  2. You observe page faults during traversal → You understand demand paging
  3. msync ensures durability → You understand persistence guarantees
  4. You understand mmap’s limitations → You’re ready for buffer pools

Project 5: Buffer Pool Manager

  • File: LEARN_GRAPH_DATABASES_DEEP_DIVE.md
  • Main Programming Language: C
  • Alternative Programming Languages: Rust, C++
  • Coolness Level: Level 4: Hardcore Tech Flex
  • Business Potential: 3. The “Service & Support” Model
  • Difficulty: Level 4: Expert
  • Knowledge Area: Database Internals / Caching / I/O Management
  • Software or Tool: Building: Buffer Pool
  • Main Book: “Database Internals” by Alex Petrov

What you’ll build: A buffer pool manager that caches disk pages in memory, implementing LRU eviction, pin counting, and dirty page tracking—the heart of any serious database.

Why it teaches graph databases: The buffer pool is what separates toy databases from real ones. It gives you control over memory usage, I/O scheduling, and durability. Neo4j, SQLite, and every serious database has a buffer pool. This is where databases get fast.

Core challenges you’ll face:

  • Page caching with fixed memory → maps to memory management constraints
  • LRU eviction policy → maps to cache management strategies
  • Pin counting for safe page access → maps to concurrency-safe reference counting
  • Dirty page tracking and flushing → maps to write-back strategies

Key Concepts:

  • Buffer Pool Architecture: “Database Internals” Chapter 5 - Alex Petrov
  • LRU Cache Implementation: “Algorithms, Fourth Edition” Section 1.3 - Sedgewick & Wayne
  • Page Management: “Operating Systems: Three Easy Pieces” Chapter 22 - OSTEP
  • Cache-Conscious Design: “Computer Systems: A Programmer’s Perspective” Chapter 6 - Bryant & O’Hallaron

Resources for key challenges:

Difficulty: Expert Time estimate: 2 weeks Prerequisites:

  • Projects 3-4 completed
  • Understanding of cache concepts
  • Solid C programming (hash tables, linked lists)

Real world outcome:

$ ./buffered_graph --pool-size 1000 --page-size 4096
Buffer pool initialized: 1000 pages, ~4MB memory

> load graph.db
Graph loaded. Pages on disk: 50,000
Pages in buffer: 0 (cold start)

> traverse node 1 depth 5
Traversed 12,453 nodes
Buffer stats:
  Pages fetched from disk: 234
  Pages found in buffer: 11,219 (97.9% hit rate)
  Dirty pages: 0
  Pinned pages: 0

> add_edge 1 500 "KNOWS"
Edge added.
Buffer stats:
  Dirty pages: 2 (will be written on checkpoint)

> checkpoint
Flushing 2 dirty pages to disk...
Done.

Implementation Hints:

Core buffer pool components:

Page structure:
- page_id (which disk page this is)
- data[PAGE_SIZE] (the actual bytes)
- pin_count (number of active users)
- is_dirty (has been modified)
- prev/next (for LRU list)

Buffer Pool:
- pages[POOL_SIZE] (the page slots)
- page_table (hash map: disk_page_id -> buffer_slot)
- lru_head, lru_tail (doubly-linked list of unpinned pages)

Key operations:

  • fetch_page(page_id): Get page, loading from disk if needed
  • unpin_page(page_id, is_dirty): Release page, mark dirty if modified
  • flush_page(page_id): Write dirty page to disk
  • evict(): Find LRU unpinned page, write if dirty, reclaim slot

The pin count is crucial: you can’t evict a page while someone is using it!

Questions to explore:

  • What happens if all pages are pinned? (Can’t evict, must wait or error)
  • What’s the difference between LRU and LRU-K?
  • How does dirty page batching improve performance?
  • What’s a checkpoint and why is it important?

Learning milestones:

  1. Pages are cached in memory → You understand caching basics
  2. LRU eviction works correctly → You understand eviction policies
  3. Pin counts prevent unsafe eviction → You understand concurrency safety
  4. Dirty pages are tracked and flushed → You understand write-back caching

Project 6: B+Tree Index for Node Lookups

  • File: LEARN_GRAPH_DATABASES_DEEP_DIVE.md
  • Main Programming Language: C
  • Alternative Programming Languages: Rust, C++, Go
  • Coolness Level: Level 4: Hardcore Tech Flex
  • Business Potential: 3. The “Service & Support” Model
  • Difficulty: Level 4: Expert
  • Knowledge Area: Data Structures / Indexing / Databases
  • Software or Tool: Building: B+Tree Index
  • Main Book: “Database Internals” by Alex Petrov

What you’ll build: A disk-based B+Tree index that allows fast lookups of nodes by property values (e.g., “find all nodes where name = ‘Alice’”).

Why it teaches graph databases: While index-free adjacency makes traversals fast, finding your starting nodes still requires indexes. “Find the person named Alice” needs a B+Tree lookup before you can start traversing. Every graph database supports property indexes.

Core challenges you’ll face:

  • Implementing B+Tree insert with splits → maps to balanced tree algorithms
  • Disk-based nodes with fixed-size pages → maps to database page design
  • Range scans via leaf traversal → maps to understanding B+Tree structure
  • Handling variable-length keys → maps to key encoding strategies

Key Concepts:

  • B+Tree Structure: “Database Internals” Chapter 4 - Alex Petrov
  • B-Tree Algorithms: “Introduction to Algorithms” (CLRS) Chapter 18
  • Disk-Based Trees: Let’s Build a Simple Database - Part 7
  • Key Encoding: “Designing Data-Intensive Applications” Chapter 3 - Martin Kleppmann

Resources for key challenges:

Difficulty: Expert Time estimate: 2-3 weeks Prerequisites:

  • Project 5 completed (buffer pool)
  • Understanding of binary search trees
  • Strong grasp of recursion

Real world outcome:

$ ./btree_index create name_index.db
B+Tree index created. Order: 100 (fits 4KB page)

$ ./btree_index bulk_load nodes.csv
Loading 1,000,000 entries...
Tree height: 3
Internal nodes: 127
Leaf nodes: 10,214
Load time: 8.3 seconds

$ ./btree_index lookup name_index.db "Alice"
Found key "Alice" at node_id: 42
Lookup time: 0.02ms (3 page reads)

$ ./btree_index range name_index.db "A" "B"
Range scan A-B:
  Aaron -> node 15
  Abel -> node 892
  ...
  Beatrice -> node 4521
Found 3,847 entries in 12ms

Implementation Hints:

B+Tree structure:

Internal Node:
┌────────┬────────┬────────┬────────┬────────┐
│  key1  │  key2  │  key3  │  ...   │  keyN  │
└────────┴────────┴────────┴────────┴────────┘
    │        │        │                 │
   ptr0     ptr1     ptr2             ptrN
    ↓        ↓        ↓                 ↓
  child0  child1   child2           childN

Leaf Node (also linked for range scans):
┌────────────┬────────────┬────────────┐
│ key1:val1  │ key2:val2  │ key3:val3  │ → next_leaf
└────────────┴────────────┴────────────┘

For graph DB: key = property value, val = node_id

Insert algorithm outline:

  1. Search to find correct leaf
  2. If leaf has space, insert key-value
  3. If leaf is full, split: create new leaf, distribute keys
  4. Propagate split upward: parent gets new key, may also split
  5. If root splits, create new root

Key questions:

  • Why B+Tree and not B-Tree? (All values in leaves, better for range scans)
  • What determines the “order” of the tree? (How many keys fit in a page)
  • How do you handle duplicate keys? (Allow duplicates in leaves, or use lists)
  • What happens during concurrent access? (Need latching/locking—future project!)

Learning milestones:

  1. Lookups are O(log n) → You understand balanced tree lookups
  2. Inserts handle splits correctly → You understand tree rebalancing
  3. Range scans follow leaf links → You understand B+Tree leaf chain
  4. Tree survives restart → You’ve integrated with buffer pool

Project 7: Simple Query Language Parser

  • File: LEARN_GRAPH_DATABASES_DEEP_DIVE.md
  • Main Programming Language: C
  • Alternative Programming Languages: Rust, Go, Python
  • Coolness Level: Level 4: Hardcore Tech Flex
  • Business Potential: 3. The “Service & Support” Model
  • Difficulty: Level 3: Advanced
  • Knowledge Area: Parsing / Compilers / Language Design
  • Software or Tool: Building: Query Parser
  • Main Book: “Writing a C Compiler” by Nora Sandler

What you’ll build: A parser for a simplified graph query language inspired by Cypher. Users can write queries like MATCH (a:Person)-[:KNOWS]->(b) WHERE a.name = "Alice" RETURN b.name.

Why it teaches graph databases: Cypher is to graph databases what SQL is to relational databases. Building a parser teaches you how query languages work, how patterns are represented internally, and prepares you for query execution.

Core challenges you’ll face:

  • Tokenizing the query string → maps to lexical analysis
  • Parsing graph patterns → maps to recursive descent parsing
  • Building an AST → maps to abstract syntax tree design
  • Handling property predicates → maps to expression parsing

Key Concepts:

  • Lexical Analysis: “Writing a C Compiler” Chapter 1 - Nora Sandler
  • Recursive Descent Parsing: “Language Implementation Patterns” Chapter 2 - Terence Parr
  • AST Design: “Compilers: Principles and Practice” Chapter 5 - Dave & Dave
  • Cypher Syntax: openCypher Specification

Difficulty: Advanced Time estimate: 2 weeks Prerequisites:

  • Basic understanding of parsing concepts
  • String manipulation in C
  • Familiarity with tree data structures

Real world outcome:

$ ./query_parser
GraphQL> MATCH (a:Person)-[:KNOWS]->(b:Person) WHERE a.name = "Alice" RETURN b

Tokens:
  MATCH, LPAREN, IDENT(a), COLON, IDENT(Person), RPAREN,
  DASH, LBRACKET, COLON, IDENT(KNOWS), RBRACKET, ARROW,
  LPAREN, IDENT(b), COLON, IDENT(Person), RPAREN,
  WHERE, IDENT(a), DOT, IDENT(name), EQ, STRING("Alice"),
  RETURN, IDENT(b)

AST:
  MatchClause
  └── Pattern
      ├── Node(var=a, label=Person)
      ├── Edge(type=KNOWS, direction=OUT)
      └── Node(var=b, label=Person)
  └── Where
      └── Equals
          ├── PropertyAccess(a, name)
          └── StringLiteral("Alice")
  └── Return
      └── Variable(b)

Parse successful!

Implementation Hints:

Define your grammar (simplified):

query       := match_clause where_clause? return_clause
match_clause := "MATCH" pattern
pattern     := node_pattern (edge_pattern node_pattern)*
node_pattern := "(" variable? (":" label)? properties? ")"
edge_pattern := ("-[" edge_type "]->") | ("<-[" edge_type "]-")
where_clause := "WHERE" expression
return_clause := "RETURN" expression_list

Lexer produces tokens:

  • Keywords: MATCH, WHERE, RETURN, AND, OR
  • Symbols: (, ), [, ], -, ->, <-, :, ., =, <>
  • Literals: strings, numbers
  • Identifiers: variable and label names

Parser builds AST nodes:

MatchClause { patterns, where, return }
Pattern { nodes[], edges[] }
NodePattern { variable, label, properties }
EdgePattern { type, direction, variable }
WhereExpr { left, op, right }

Think about:

  • How do you handle syntax errors gracefully?
  • What does your AST look like for (a)-[:KNOWS]->(b)-[:LIKES]->(c)?
  • How would you add support for optional patterns OPTIONAL MATCH?

Learning milestones:

  1. Lexer produces correct tokens → You understand tokenization
  2. Parser builds valid AST → You understand grammar and parsing
  3. Error messages are helpful → You understand error recovery
  4. AST can represent complex patterns → You’re ready for query execution

Project 8: Query Execution Engine

  • File: LEARN_GRAPH_DATABASES_DEEP_DIVE.md
  • Main Programming Language: C
  • Alternative Programming Languages: Rust, Go, C++
  • Coolness Level: Level 4: Hardcore Tech Flex
  • Business Potential: 3. The “Service & Support” Model
  • Difficulty: Level 4: Expert
  • Knowledge Area: Query Processing / Execution Engines
  • Software or Tool: Building: Query Executor
  • Main Book: “Database Internals” by Alex Petrov

What you’ll build: An execution engine that takes parsed ASTs and executes them against your graph storage, returning results. This includes pattern matching, filtering, and projecting results.

Why it teaches graph databases: This is where everything comes together. Your parser creates plans, your storage provides data, and the execution engine connects them. Understanding how pattern matching translates to traversal operations is core graph database knowledge.

Core challenges you’ll face:

  • Pattern matching algorithm → maps to subgraph isomorphism
  • Variable binding during traversal → maps to execution context management
  • Backtracking for complex patterns → maps to search space exploration
  • Integrating with storage layer → maps to I/O-efficient execution

Key Concepts:

  • Query Execution Models: “Database Internals” Chapter 8 - Alex Petrov
  • Pattern Matching: Cypher: An Evolving Query Language for Property Graphs
  • Iterator Model: “Architecture of a Database System” Section 4.2 - Hellerstein et al.
  • Backtracking Algorithms: “Algorithms, Fourth Edition” Section 6.5 - Sedgewick & Wayne

Difficulty: Expert Time estimate: 2-3 weeks Prerequisites:

  • Project 7 completed (parser)
  • Projects 1-5 completed (storage layer)
  • Understanding of iterator patterns

Real world outcome:

$ ./graph_db my_graph.db
Connected to graph: 100,000 nodes, 500,000 edges

graph> MATCH (a:Person {name: "Alice"})-[:KNOWS]->(b:Person) RETURN b.name, b.age

Execution Plan:
  1. IndexScan(Person, name="Alice") -> a
  2. ExpandOut(a, KNOWS) -> b
  3. Filter(b.label = Person)
  4. Project(b.name, b.age)

Executing...

| b.name    | b.age |
|-----------|-------|
| Bob       | 28    |
| Carol     | 32    |
| Dave      | 25    |

3 rows returned in 2.3ms
Nodes scanned: 1 (via index)
Edges traversed: 47

Implementation Hints:

Execution uses an iterator/operator model:

Operator interface:
  init()    - Prepare for execution
  next()    - Get next result tuple
  close()   - Cleanup

Common operators:
  IndexScan - Find starting nodes via index
  AllNodesScan - Iterate all nodes (expensive!)
  Expand - Follow edges from current node
  Filter - Check predicate, pass or skip
  Project - Select which fields to output

Pattern matching approach:

For pattern (a)-[:KNOWS]->(b)-[:WORKS_AT]->(c):
1. Find all starting points for 'a'
2. For each 'a', expand KNOWS edges to find 'b' candidates
3. For each 'b', expand WORKS_AT edges to find 'c' candidates
4. Each complete (a,b,c) tuple is a result

Think about:

  • What if pattern has no indexed start point? (Full scan, then filter)
  • How do you handle cycles in the pattern?
  • What’s the difference between DFS and BFS execution order?
  • How would you add aggregation (COUNT, SUM)?

Learning milestones:

  1. Simple patterns execute → You understand basic pattern matching
  2. Filters work correctly → You understand predicate evaluation
  3. Multi-hop patterns work → You understand traversal chaining
  4. Query uses index when available → You understand query optimization basics

Project 9: Write-Ahead Logging (WAL)

  • File: LEARN_GRAPH_DATABASES_DEEP_DIVE.md
  • Main Programming Language: C
  • Alternative Programming Languages: Rust, C++
  • Coolness Level: Level 4: Hardcore Tech Flex
  • Business Potential: 3. The “Service & Support” Model
  • Difficulty: Level 4: Expert
  • Knowledge Area: ACID / Durability / Recovery
  • Software or Tool: Building: WAL System
  • Main Book: “Database Internals” by Alex Petrov

What you’ll build: A Write-Ahead Log (WAL) that records all modifications before they’re applied to data files, enabling crash recovery and durability guarantees.

Why it teaches graph databases: Durability (the “D” in ACID) requires WAL. Without it, a crash can corrupt your database. Understanding the “log first, apply later” principle is fundamental to all database systems.

Core challenges you’ll face:

  • Append-only log format design → maps to log record structure
  • Ensuring log is durable before commit → maps to fsync semantics
  • Crash recovery algorithm → maps to redo/undo processing
  • Log truncation and checkpointing → maps to log space management

Key Concepts:

  • Write-Ahead Logging: “Database Internals” Chapter 6 - Alex Petrov
  • Crash Recovery: “Database Internals” Chapter 13 - Alex Petrov
  • ARIES Algorithm: Architecture of a Database System Section 6
  • fsync Semantics: “The Linux Programming Interface” Chapter 13 - Michael Kerrisk

Resources for key challenges:

Difficulty: Expert Time estimate: 2 weeks Prerequisites:

  • Project 5 completed (buffer pool with dirty tracking)
  • Understanding of file system durability
  • Solid C programming

Real world outcome:

$ ./graph_db_wal my_graph.db
WAL enabled: wal.log

graph> BEGIN
graph> CREATE (:Person {name: "Eve"})
Node 1001 created (WAL: LSN 5001)
graph> CREATE (a)-[:KNOWS]->(b)
Edge 5001 created (WAL: LSN 5002)
graph> COMMIT
Transaction committed (WAL synced to LSN 5002)

$ kill -9 <pid>  # Simulate crash

$ ./graph_db_wal my_graph.db
Recovering from WAL...
  Last checkpoint: LSN 4000
  WAL contains records up to: LSN 5002
  Redoing 1002 log records...
  Recovery complete: 2 operations replayed
Database ready.

graph> MATCH (p:Person {name: "Eve"}) RETURN p
| p                        |
|--------------------------|
| (:Person {name: "Eve"})  |

Eve survived the crash!

Implementation Hints:

WAL record format:

┌─────────────────────────────────────────────────────┐
│ LSN (8 bytes) - Log Sequence Number                 │
│ TxnID (8 bytes) - Transaction ID                    │
│ Type (1 byte) - INSERT/UPDATE/DELETE/COMMIT/ABORT   │
│ Length (4 bytes) - Length of data                   │
│ Data (variable) - The actual change                 │
│ Checksum (4 bytes) - CRC32 of record                │
└─────────────────────────────────────────────────────┘

Core operations:

wal_append(type, data) -> LSN
  - Write record to log buffer
  - Return the LSN for this record

wal_flush(lsn)
  - Ensure all records up to LSN are on disk
  - Call fsync() on log file

recover()
  - Read log from last checkpoint
  - For each record, redo the operation
  - Handle incomplete transactions

Key insight: You MUST write to the log and fsync BEFORE telling the user “commit successful”. Otherwise a crash after “success” but before the data is on disk means data loss.

Questions to consider:

  • Why append-only? (Sequential writes are fast, no in-place updates)
  • What’s a checkpoint? (A snapshot of which LSN has been applied to data files)
  • When can you delete old log records? (After checkpoint covers them)
  • What about ABORT? (Either don’t apply to data files, or write UNDO records)

Learning milestones:

  1. Operations are logged before applying → You understand WAL principle
  2. Recovery replays log correctly → You understand crash recovery
  3. Database survives kill -9 → You’ve achieved durability
  4. Checkpoints limit recovery time → You understand log management

Project 10: Transaction Manager with MVCC

  • File: LEARN_GRAPH_DATABASES_DEEP_DIVE.md
  • Main Programming Language: C
  • Alternative Programming Languages: Rust, C++
  • Coolness Level: Level 5: Pure Magic
  • Business Potential: 4. The “Open Core” Infrastructure
  • Difficulty: Level 5: Master
  • Knowledge Area: Concurrency / ACID / Transactions
  • Software or Tool: Building: MVCC Transaction Manager
  • Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann

What you’ll build: A transaction manager implementing Multi-Version Concurrency Control (MVCC), allowing concurrent readers and writers without blocking each other.

Why it teaches graph databases: MVCC is how modern databases achieve high concurrency. Readers see consistent snapshots while writers modify data. This is the “I” (Isolation) in ACID. Neo4j, PostgreSQL, and most modern databases use MVCC.

Core challenges you’ll face:

  • Version chain management → maps to row versioning
  • Snapshot visibility rules → maps to transaction isolation
  • Garbage collection of old versions → maps to version cleanup
  • Write-write conflict detection → maps to transaction validation

Key Concepts:

  • MVCC Fundamentals: “Designing Data-Intensive Applications” Chapter 7 - Martin Kleppmann
  • Snapshot Isolation: “Database Internals” Chapter 12 - Alex Petrov
  • Version Chains: DuckDB Transaction Paper
  • Transaction IDs: “Database Internals” Chapter 11 - Alex Petrov

Difficulty: Master Time estimate: 3-4 weeks Prerequisites:

  • Project 9 completed (WAL)
  • Strong understanding of concurrency
  • Experience with thread-safe programming

Real world outcome:

$ ./graph_db_mvcc my_graph.db &
Graph database started with MVCC enabled.

# Terminal 1 (Reader)
graph> BEGIN READ ONLY
graph> MATCH (p:Person) RETURN count(p)
| count(p) |
|----------|
| 1000     |
Started long report...

# Terminal 2 (Writer - concurrent)
graph> BEGIN
graph> CREATE (:Person {name: "NewPerson"})
graph> COMMIT
Committed!

# Terminal 1 (Reader - still sees old snapshot!)
graph> MATCH (p:Person) RETURN count(p)
| count(p) |
|----------|
| 1000     |  <- Still 1000! Snapshot isolation working.
graph> COMMIT
graph> MATCH (p:Person) RETURN count(p)
| count(p) |
|----------|
| 1001     |  <- New transaction sees update.

Implementation Hints:

MVCC version chain:

Each node/edge has a version chain:
┌──────────────────┐    ┌──────────────────┐    ┌──────────────────┐
│ Version 3        │───▶│ Version 2        │───▶│ Version 1        │
│ txn_id: 100      │    │ txn_id: 50       │    │ txn_id: 10       │
│ data: {age: 31}  │    │ data: {age: 30}  │    │ data: {age: 29}  │
│ begin_ts: 99     │    │ begin_ts: 49     │    │ begin_ts: 9      │
│ end_ts: ∞        │    │ end_ts: 99       │    │ end_ts: 49       │
└──────────────────┘    └──────────────────┘    └──────────────────┘

Visibility rules:

A version V is visible to transaction T if:
  - V.begin_ts <= T.snapshot_ts
  - V.end_ts > T.snapshot_ts
  - V.txn_id is committed (not in active_txns set)

Key components:

  • Transaction ID generator (monotonic)
  • Active transaction table
  • Version chain per record
  • Snapshot timestamp assignment at BEGIN
  • Commit/abort tracking

Think about:

  • What happens if two transactions try to update the same node?
  • How do you garbage collect old versions no one can see?
  • What’s the difference between snapshot isolation and serializable?
  • How does this interact with your WAL?

Learning milestones:

  1. Readers don’t block writers → You understand MVCC basics
  2. Snapshots are consistent → You understand isolation levels
  3. Write conflicts are detected → You understand transaction validation
  4. Old versions are cleaned up → You understand garbage collection

Project 11: Graph-Specific Algorithms (PageRank, Community Detection)

  • File: LEARN_GRAPH_DATABASES_DEEP_DIVE.md
  • Main Programming Language: C
  • Alternative Programming Languages: Rust, Python, Go
  • Coolness Level: Level 4: Hardcore Tech Flex
  • Business Potential: 3. The “Service & Support” Model
  • Difficulty: Level 3: Advanced
  • Knowledge Area: Graph Analytics / Linear Algebra
  • Software or Tool: Building: Graph Algorithm Library
  • Main Book: “Graph Algorithms” by Mark Needham & Amy Hodler

What you’ll build: Implementation of classic graph analytics algorithms—PageRank (for importance), Label Propagation (for community detection), and Betweenness Centrality (for bridge identification).

Why it teaches graph databases: Neo4j’s Graph Data Science library exists because these algorithms are why people use graph databases. Fraud detection, recommendation systems, and social network analysis all rely on these algorithms. This is where graph databases shine.

Core challenges you’ll face:

  • PageRank iteration until convergence → maps to power iteration method
  • Sparse matrix representation → maps to efficient graph representation
  • Community detection heuristics → maps to graph clustering
  • Betweenness via shortest paths → maps to centrality measures

Key Concepts:

  • PageRank Algorithm: “Graph Algorithms” Chapter 5 - Needham & Hodler
  • Community Detection: “Graph Algorithms” Chapter 6 - Needham & Hodler
  • Centrality Measures: “Graph Algorithms” Chapter 4 - Needham & Hodler
  • Power Iteration: “Introduction to Linear Algebra” Chapter 6 - Gilbert Strang

Difficulty: Advanced Time estimate: 2 weeks Prerequisites:

  • Project 2 completed (basic graph algorithms)
  • Basic linear algebra understanding
  • Familiarity with iterative algorithms

Real world outcome:

$ ./graph_analytics social.graph

PageRank (damping=0.85, iterations=20):
Top 10 influential users:
  1. user_42     score: 0.0847
  2. user_17     score: 0.0721
  3. user_891    score: 0.0698
  ...
Converged in 12 iterations.

Community Detection (Label Propagation):
Found 47 communities
  Community 1: 2,341 members (tech enthusiasts)
  Community 2: 1,892 members (sports fans)
  Community 3: 1,456 members (music lovers)
  ...
Modularity score: 0.73

Betweenness Centrality:
Top 5 bridge nodes:
  1. user_103    betweenness: 0.234 (connects communities 1-2)
  2. user_567    betweenness: 0.198 (connects communities 2-3)
  ...

Implementation Hints:

PageRank algorithm:

1. Initialize all nodes with rank 1/N
2. For each iteration:
   For each node v:
     new_rank[v] = (1-d)/N + d * Σ(rank[u]/out_degree[u])
                   for all u that link to v
3. Repeat until ranks converge (change < epsilon)

d = damping factor (typically 0.85)
N = total number of nodes

Label Propagation (simple community detection):

1. Assign each node a unique label
2. For each iteration:
   For each node (in random order):
     Set node's label to most frequent label among neighbors
     (break ties randomly)
3. Repeat until labels stabilize
4. Nodes with same label = same community

Questions to explore:

  • Why does PageRank converge? (It’s computing the eigenvector)
  • What’s the time complexity of PageRank? (O(iterations * edges))
  • How would you parallelize these algorithms?
  • How does Neo4j’s GDS library optimize these?

Learning milestones:

  1. PageRank converges and ranks make sense → You understand iterative algorithms
  2. Communities are meaningful → You understand graph clustering
  3. Centrality identifies key nodes → You understand network analysis
  4. Algorithms scale to large graphs → You understand performance optimization

Project 12: Distributed Graph Partitioning

  • File: LEARN_GRAPH_DATABASES_DEEP_DIVE.md
  • Main Programming Language: C
  • Alternative Programming Languages: Rust, Go, C++
  • Coolness Level: Level 5: Pure Magic
  • Business Potential: 4. The “Open Core” Infrastructure
  • Difficulty: Level 5: Master
  • Knowledge Area: Distributed Systems / Graph Partitioning
  • Software or Tool: Building: Distributed Graph Engine
  • Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann

What you’ll build: A system that partitions a large graph across multiple machines, enabling horizontal scaling while minimizing cross-partition traversals.

Why it teaches graph databases: When graphs get too big for one machine, you need distribution. But graphs are hard to partition—unlike rows in a table, cutting edges creates expensive cross-machine communication. Understanding graph partitioning is key to scaling graph databases.

Core challenges you’ll face:

  • Partitioning algorithm (METIS-style) → maps to balanced graph cuts
  • Cross-partition queries → maps to distributed query execution
  • Replication for fault tolerance → maps to data durability
  • Minimizing network hops → maps to locality optimization

Key Concepts:

  • Graph Partitioning: “Designing Data-Intensive Applications” Chapter 6 - Martin Kleppmann
  • Distributed Query Processing: “Database Internals” Chapter 9 - Alex Petrov
  • Consistent Hashing: “Designing Data-Intensive Applications” Chapter 6 - Martin Kleppmann
  • Balanced Cuts: METIS paper and documentation

Difficulty: Master Time estimate: 1 month+ Prerequisites:

  • All previous projects completed
  • Understanding of network programming
  • Experience with distributed systems concepts

Real world outcome:

# On coordinator node
$ ./graph_cluster init 3
Initializing 3-node cluster...
  Node 1: 192.168.1.10 (primary)
  Node 2: 192.168.1.11
  Node 3: 192.168.1.12
Cluster ready.

$ ./graph_cluster load billion_edges.graph
Loading graph with 1B nodes, 10B edges...
Partitioning using METIS...
  Partition 1: 334M nodes, 3.2B edges, 2.1% edge cut
  Partition 2: 333M nodes, 3.4B edges, 1.8% edge cut
  Partition 3: 333M nodes, 3.4B edges, 1.9% edge cut
Distribution complete.

$ ./graph_cluster query
cluster> MATCH (a:Person {name:"Alice"})-[:KNOWS*1..3]->(b) RETURN count(b)

Query plan:
  1. Find Alice (hash -> Node 2)
  2. Local 1-hop expansion on Node 2
  3. Parallel 2-hop expansion (spans all nodes)
  4. Aggregate counts

| count(b) |
|----------|
| 4,523,891|

Query time: 234ms (47ms local, 187ms cross-partition)

Implementation Hints:

Partitioning strategies:

1. Hash partitioning: node_id % num_partitions
   - Simple but ignores graph structure
   - Many edges will be cut

2. Label/range partitioning: nodes with same label together
   - Good for label-heavy workloads
   - May be imbalanced

3. Graph-aware (METIS-style):
   - Minimize edge cuts while balancing partition sizes
   - Pre-compute partitions
   - Better locality but expensive to compute

Distributed traversal:

For pattern (a)-[:KNOWS]->(b)-[:WORKS_AT]->(c):
1. Find 'a' locally
2. Expand KNOWS - some 'b' nodes may be remote
3. Send requests to remote partitions for remote 'b' nodes
4. Each partition expands WORKS_AT locally
5. Aggregate results at coordinator

Think about:

  • What’s the trade-off between partition balance and edge cuts?
  • How do you handle a traversal that touches all partitions?
  • What about nodes that have connections everywhere (super-nodes)?
  • How do you replicate popular nodes to reduce network traffic?

Learning milestones:

  1. Graph is partitioned across nodes → You understand distributed data
  2. Queries span partitions correctly → You understand distributed execution
  3. Edge cuts are minimized → You understand graph partitioning algorithms
  4. System survives node failure → You understand replication

Project 13: Query Optimizer with Cost Model

  • File: LEARN_GRAPH_DATABASES_DEEP_DIVE.md
  • Main Programming Language: C
  • Alternative Programming Languages: Rust, Go, C++
  • Coolness Level: Level 4: Hardcore Tech Flex
  • Business Potential: 4. The “Open Core” Infrastructure
  • Difficulty: Level 4: Expert
  • Knowledge Area: Query Optimization / Cost Modeling
  • Software or Tool: Building: Query Optimizer
  • Main Book: “Database Internals” by Alex Petrov

What you’ll build: A query optimizer that chooses the best execution plan by estimating costs based on statistics (cardinality, selectivity, index availability).

Why it teaches graph databases: The same query can be executed many ways. (a)-[:KNOWS]->(b)-[:WORKS_AT]->(c) could start from a, from c, or use an index. Good optimizers make queries 100x faster. Bad ones make databases unusable.

Core challenges you’ll face:

  • Cardinality estimation → maps to statistics collection
  • Plan enumeration → maps to search space exploration
  • Cost modeling → maps to I/O and CPU cost estimation
  • Plan selection → maps to optimization algorithms

Key Concepts:

  • Query Optimization: “Database Internals” Chapter 10 - Alex Petrov
  • Cardinality Estimation: “Architecture of a Database System” Section 4.4 - Hellerstein et al.
  • Cost Models: “Designing Data-Intensive Applications” Chapter 3 - Martin Kleppmann
  • Join Ordering: “Database Internals” Chapter 10 - Alex Petrov

Difficulty: Expert Time estimate: 3 weeks Prerequisites:

  • Project 8 completed (query executor)
  • Project 6 completed (indexes)
  • Understanding of statistics and estimation

Real world outcome:

$ ./graph_db --explain my_graph.db
graph> MATCH (a:Person)-[:KNOWS]->(b:Person)-[:WORKS_AT]->(c:Company {name:"TechCorp"})
       RETURN a.name

Statistics:
  Person nodes: 100,000
  Company nodes: 5,000
  KNOWS edges: 500,000
  WORKS_AT edges: 200,000
  TechCorp employees: 50

Plan 1 (Start from Person, no index):
  AllNodesScan(Person) -> a              cost: 100,000
  Expand(KNOWS) -> b                     cost: 500,000
  Expand(WORKS_AT) -> c                  cost: 200,000
  Filter(name="TechCorp")                cost: 200,000
  TOTAL: 1,000,000

Plan 2 (Start from TechCorp via index):
  IndexScan(Company.name="TechCorp") -> c  cost: 1
  ExpandIn(WORKS_AT) -> b                  cost: 50
  ExpandIn(KNOWS) -> a                     cost: 250
  TOTAL: 301                             ← WINNER!

Executing Plan 2...
50 rows in 3ms

Implementation Hints:

Statistics you need:

- Node count per label
- Edge count per type
- Property value histograms
- Index selectivity estimates

Cost model factors:

- Sequential scan: num_pages * page_read_cost
- Index scan: selectivity * num_rows * index_lookup_cost
- Expand: avg_degree * input_rows
- Filter: input_rows * filter_cost (output = selectivity * input)

Plan enumeration for pattern (a)-[r1]->(b)-[r2]->(c):

Possible orderings:
1. Start from a, expand r1, expand r2
2. Start from c, expand r2 backward, expand r1 backward
3. Start from b, expand r1 backward, expand r2 forward

Each can use AllNodesScan or IndexScan
Compare costs, pick cheapest

Questions:

  • How do you handle unknown selectivity? (Use defaults or samples)
  • What if statistics are stale? (Re-analyze periodically)
  • How do you handle variable-length patterns [:KNOWS*1..5]?
  • Is the optimizer overhead worth it for simple queries?

Learning milestones:

  1. Optimizer picks obviously better plans → You understand basic cost modeling
  2. Statistics affect plan choice → You understand cardinality estimation
  3. Complex patterns optimize correctly → You understand plan enumeration
  4. Query times improve dramatically → You’ve built a real optimizer

Project 14: Full-Text Search Integration

  • File: LEARN_GRAPH_DATABASES_DEEP_DIVE.md
  • Main Programming Language: C
  • Alternative Programming Languages: Rust, Go
  • Coolness Level: Level 3: Genuinely Clever
  • Business Potential: 3. The “Service & Support” Model
  • Difficulty: Level 3: Advanced
  • Knowledge Area: Information Retrieval / Text Search
  • Software or Tool: Building: Full-Text Index
  • Main Book: “Introduction to Information Retrieval” by Manning, Raghavan & Schütze

What you’ll build: A full-text search index for graph properties, enabling queries like “find all Person nodes where bio contains ‘machine learning’”.

Why it teaches graph databases: Neo4j integrates with Lucene for full-text search. Combining graph traversal with text search is powerful—find experts by searching their profiles, then traverse to their companies and co-workers.

Core challenges you’ll face:

  • Tokenization and stemming → maps to text processing
  • Inverted index construction → maps to index data structures
  • TF-IDF scoring → maps to relevance ranking
  • Integration with graph queries → maps to hybrid query execution

Key Concepts:

  • Inverted Indexes: “Introduction to Information Retrieval” Chapter 1 - Manning et al.
  • TF-IDF Scoring: “Introduction to Information Retrieval” Chapter 6 - Manning et al.
  • Text Tokenization: “Introduction to Information Retrieval” Chapter 2 - Manning et al.
  • Lucene Architecture: Lucene documentation

Difficulty: Advanced Time estimate: 2 weeks Prerequisites:

  • Project 6 completed (B+Tree)
  • Understanding of text processing
  • Basic information retrieval concepts

Real world outcome:

$ ./graph_db my_graph.db

graph> CREATE FULLTEXT INDEX bio_index FOR (p:Person) ON (p.bio)
Index created on Person.bio
Indexing 100,000 documents...
  Vocabulary size: 45,000 terms
  Index size: 12MB
Done.

graph> CALL db.index.fulltext.queryNodes("bio_index", "machine learning")
       YIELD node, score
       RETURN node.name, score
       ORDER BY score DESC
       LIMIT 5

| node.name    | score  |
|--------------|--------|
| Alice Chen   | 8.234  |
| Bob ML       | 7.891  |
| Carol Data   | 6.543  |
...

graph> MATCH (p:Person)-[:WORKS_AT]->(c:Company)
       WHERE p.bio CONTAINS "machine learning"
       RETURN c.name, count(p) as ml_experts
       ORDER BY ml_experts DESC

| c.name     | ml_experts |
|------------|------------|
| TechCorp   | 23         |
| DataInc    | 18         |
...

Implementation Hints:

Inverted index structure:

term -> [(doc_id, positions), (doc_id, positions), ...]

Example:
"machine" -> [(doc_42, [0, 156]), (doc_89, [23]), ...]
"learning" -> [(doc_42, [1, 157]), (doc_89, [24]), ...]

To find "machine learning":
  1. Get posting lists for both terms
  2. Intersect by doc_id
  3. Check if positions are adjacent

TF-IDF scoring:

TF(t, d) = frequency of term t in document d
IDF(t) = log(N / df(t))  where N = total docs, df = docs containing t
Score = TF * IDF

Higher score = term appears often in this doc, rarely in others

Integration with graph queries:

MATCH (p:Person) WHERE p.bio CONTAINS "machine learning"

Execution:
1. Query full-text index for "machine learning"
2. Get list of node IDs with scores
3. Fetch nodes, continue with graph pattern

Think about:

  • How do you handle updates to indexed properties?
  • How do you tokenize “machine-learning” vs “machine learning”?
  • Should you support phrase queries? Fuzzy matching?
  • How large can the inverted index get?

Learning milestones:

  1. Basic term search works → You understand inverted indexes
  2. Phrase search works → You understand position-aware indexing
  3. Scores reflect relevance → You understand TF-IDF
  4. Integrates with graph queries → You understand hybrid execution

Project 15: Complete Graph Database Engine

  • File: LEARN_GRAPH_DATABASES_DEEP_DIVE.md
  • Main Programming Language: C
  • Alternative Programming Languages: Rust, C++
  • Coolness Level: Level 5: Pure Magic
  • Business Potential: 5. The “Industry Disruptor”
  • Difficulty: Level 5: Master
  • Knowledge Area: Database Systems / Systems Programming
  • Software or Tool: Building: Graph Database
  • Main Book: “Database Internals” by Alex Petrov

What you’ll build: A complete graph database engine combining all previous projects: storage engine, buffer pool, B+Tree indexes, query parser, executor, optimizer, WAL, and transactions. A mini Neo4j from scratch.

Why it teaches graph databases: This is the capstone. You’ll integrate every component into a cohesive system. The challenges of making components work together teach you as much as building them individually. You’ll understand graph databases at the deepest level.

Core challenges you’ll face:

  • Component integration → maps to systems architecture
  • API design → maps to usability and developer experience
  • Performance tuning → maps to system optimization
  • Bug hunting across layers → maps to debugging complex systems

Key Concepts:

  • Database Architecture: “Database Internals” throughout - Alex Petrov
  • Systems Integration: “Designing Data-Intensive Applications” throughout - Martin Kleppmann
  • API Design: “The Pragmatic Programmer” Chapter 7 - Hunt & Thomas
  • Performance Analysis: “Computer Systems: A Programmer’s Perspective” Chapter 5 & 6 - Bryant & O’Hallaron

Difficulty: Master Time estimate: 1 month+ Prerequisites:

  • All previous projects completed
  • Strong systems programming skills
  • Patience and determination

Real world outcome:

$ ./myneo4j --data-dir /var/myneo4j --port 7687
MyNeo4j Graph Database v0.1
  Storage engine: Native fixed-record
  Buffer pool: 1GB (262,144 pages)
  WAL: Enabled (wal.log)
  Transactions: MVCC with snapshot isolation
  Indexes: B+Tree
  Query language: MiniCypher

Server started on bolt://localhost:7687

# In another terminal
$ ./myneo4j-shell localhost:7687
Connected to MyNeo4j.

neo4j> CREATE INDEX person_name FOR (p:Person) ON (p.name);
Index created.

neo4j> BEGIN;
neo4j> CREATE (a:Person {name: "Alice", age: 30});
Node created.
neo4j> CREATE (b:Person {name: "Bob", age: 28});
Node created.
neo4j> MATCH (a:Person {name: "Alice"}), (b:Person {name: "Bob"})
       CREATE (a)-[:KNOWS {since: 2020}]->(b);
Relationship created.
neo4j> COMMIT;
Transaction committed.

neo4j> EXPLAIN MATCH (a:Person {name: "Alice"})-[:KNOWS]->(b) RETURN b.name;
+----------------------------+
| Operator           | Rows  |
+----------------------------+
| Projection         | 1     |
| Expand(KNOWS OUT)  | 1     |
| IndexSeek(name)    | 1     |
+----------------------------+

neo4j> MATCH (a:Person {name: "Alice"})-[:KNOWS]->(b) RETURN b.name;
+--------+
| b.name |
+--------+
| Bob    |
+--------+
1 row, 0.8ms

neo4j> :quit
Goodbye!

Implementation Hints:

System architecture:

┌─────────────────────────────────────────────────────────────┐
│                     Client Connection                       │
│                     (TCP/Bolt protocol)                     │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│                      Query Parser                           │
│              (Lexer → Parser → AST)                         │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│                     Query Optimizer                         │
│        (Statistics → Plan Enumeration → Cost Model)         │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│                    Execution Engine                         │
│            (Operators: Scan, Expand, Filter, etc.)          │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│                   Transaction Manager                       │
│                    (MVCC, Isolation)                        │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌───────────────────┬─────────────────┬───────────────────────┐
│   Buffer Pool     │     Indexes     │         WAL           │
│   (Page Cache)    │    (B+Tree)     │   (Write-Ahead Log)   │
└───────────────────┴─────────────────┴───────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│                     Storage Engine                          │
│        (Fixed-size records: nodes.db, rels.db, etc.)        │
└─────────────────────────────────────────────────────────────┘

Integration checklist:

  • Storage engine creates/reads fixed-size records
  • Buffer pool caches pages, tracks dirty pages
  • WAL logs all modifications before apply
  • Transactions use MVCC for isolation
  • Indexes support CREATE INDEX / DROP INDEX
  • Parser handles MiniCypher subset
  • Optimizer chooses between scan and index
  • Executor runs plans, returns results
  • Server accepts network connections
  • Shell provides interactive access

Think about:

  • How do you test the integrated system?
  • What benchmarks should you run?
  • How do you handle errors at each layer?
  • What’s your upgrade/migration story?

Learning milestones:

  1. Basic CRUD operations work → System is functional
  2. Transactions provide ACID → System is reliable
  3. Indexes speed up lookups → System is optimized
  4. Concurrent clients work → System is production-ready
  5. Benchmarks show good performance → System is competitive

Project Comparison Table

Project Difficulty Time Depth of Understanding Fun Factor
1. In-Memory Graph Library Beginner 1 week ⭐⭐⭐ ⭐⭐⭐
2. Graph Traversal Algorithms Intermediate 1 week ⭐⭐⭐ ⭐⭐⭐⭐
3. Graph Serialization Intermediate 1 week ⭐⭐⭐ ⭐⭐⭐
4. Memory-Mapped Storage Advanced 1-2 weeks ⭐⭐⭐⭐ ⭐⭐⭐⭐
5. Buffer Pool Manager Expert 2 weeks ⭐⭐⭐⭐⭐ ⭐⭐⭐
6. B+Tree Index Expert 2-3 weeks ⭐⭐⭐⭐⭐ ⭐⭐⭐⭐
7. Query Language Parser Advanced 2 weeks ⭐⭐⭐⭐ ⭐⭐⭐⭐⭐
8. Query Execution Engine Expert 2-3 weeks ⭐⭐⭐⭐⭐ ⭐⭐⭐⭐
9. Write-Ahead Logging Expert 2 weeks ⭐⭐⭐⭐⭐ ⭐⭐⭐
10. MVCC Transactions Master 3-4 weeks ⭐⭐⭐⭐⭐ ⭐⭐⭐⭐
11. Graph Algorithms Advanced 2 weeks ⭐⭐⭐⭐ ⭐⭐⭐⭐⭐
12. Distributed Graph Master 1 month+ ⭐⭐⭐⭐⭐ ⭐⭐⭐⭐⭐
13. Query Optimizer Expert 3 weeks ⭐⭐⭐⭐⭐ ⭐⭐⭐⭐
14. Full-Text Search Advanced 2 weeks ⭐⭐⭐⭐ ⭐⭐⭐⭐
15. Complete Engine Master 1 month+ ⭐⭐⭐⭐⭐ ⭐⭐⭐⭐⭐

Based on building a graph database from scratch in C:

Phase 1: Foundations (Weeks 1-3)

Start here to understand the basics.

  1. Project 1: In-Memory Graph Library - Get comfortable with graph data structures
  2. Project 2: Graph Traversal Algorithms - Understand how to navigate graphs
  3. Project 3: Graph Serialization - Learn to persist graphs to disk

Phase 2: Storage Engine (Weeks 4-7)

Build the foundation of a real database.

  1. Project 4: Memory-Mapped Storage - Understand OS-level I/O
  2. Project 5: Buffer Pool Manager - Master database I/O management
  3. Project 6: B+Tree Index - Implement efficient property lookups

Phase 3: Query Processing (Weeks 8-11)

Make your database usable.

  1. Project 7: Query Language Parser - Build a Cypher-like parser
  2. Project 8: Query Execution Engine - Execute queries against storage
  3. Project 13: Query Optimizer - Make queries fast automatically

Phase 4: ACID & Reliability (Weeks 12-16)

Make your database reliable.

  1. Project 9: Write-Ahead Logging - Achieve durability
  2. Project 10: MVCC Transactions - Handle concurrent access

Phase 5: Advanced Features (Weeks 17+)

Polish and extend.

  1. Project 11: Graph Algorithms - Add analytics capabilities
  2. Project 14: Full-Text Search - Enable text search
  3. Project 12: Distributed Graph - Scale horizontally
  4. Project 15: Complete Engine - Integrate everything

Final Capstone Project: Production-Ready Graph Database

Following the same pattern, this project represents the culmination of all your learning.

  • File: LEARN_GRAPH_DATABASES_DEEP_DIVE.md
  • Main Programming Language: C
  • Alternative Programming Languages: Rust, C++
  • Coolness Level: Level 5: Pure Magic (Super Cool)
  • Business Potential: 5. The “Industry Disruptor”
  • Difficulty: Level 5: Master
  • Knowledge Area: Complete Database Systems
  • Software or Tool: Building: Production Graph Database
  • Main Book: “Database Internals” by Alex Petrov + “Designing Data-Intensive Applications” by Martin Kleppmann

What you’ll build: A production-grade graph database that could genuinely be used for real applications—with proper error handling, configuration, monitoring, backup/restore, and documentation.

Why this is the ultimate goal: Anyone can build a toy database. Building something production-worthy requires handling edge cases, providing operational tools, ensuring security, and creating a polished user experience. This separates engineers who understand databases from those who’ve mastered them.

Features beyond Project 15:

  • Configuration file support
  • Proper logging and monitoring hooks
  • Backup and restore functionality
  • Data import/export (CSV, JSON)
  • Authentication and authorization
  • Query timeout and resource limits
  • Comprehensive error messages
  • Unit tests, integration tests, stress tests
  • Documentation and examples

Real world outcome:

$ myneo4j --config myneo4j.conf
Loading configuration from myneo4j.conf
  Data directory: /var/lib/myneo4j
  Buffer pool: 4GB
  Max connections: 100
  Auth: enabled (bcrypt)
  TLS: enabled
  Backup schedule: daily at 03:00

[2024-01-15 10:00:00] INFO  MyNeo4j 1.0 starting...
[2024-01-15 10:00:01] INFO  Recovery: WAL contains 0 uncommitted transactions
[2024-01-15 10:00:01] INFO  Indexes: Loading 5 indexes
[2024-01-15 10:00:02] INFO  Server listening on bolt://0.0.0.0:7687 (TLS)
[2024-01-15 10:00:02] INFO  Server listening on http://0.0.0.0:7474 (metrics)
[2024-01-15 10:00:02] INFO  Ready to accept connections

# Monitoring
$ curl http://localhost:7474/metrics
{
  "nodes": 15000000,
  "edges": 75000000,
  "transactions_committed": 45123,
  "buffer_pool_hit_rate": 0.987,
  "avg_query_time_ms": 2.3,
  "active_connections": 12
}

# Backup
$ myneo4j-admin backup --to /backups/$(date +%Y%m%d)
Backup started...
  Phase 1: Checkpoint
  Phase 2: Copy data files
  Phase 3: Copy indexes
  Phase 4: Verify checksum
Backup complete: 12.3GB in 4m23s

Learning milestones:

  1. Database survives chaos testing → You’ve built robust software
  2. Operations team can manage it → You’ve built operator-friendly software
  3. Developers find it easy to use → You’ve built user-friendly software
  4. Performance is competitive → You’ve built efficient software
  5. You can explain every design decision → You’ve mastered the domain

Summary

# Project Name Main Programming Language
1 In-Memory Graph Library C
2 Graph Traversal Algorithms (BFS, DFS, Shortest Path) C
3 Graph Serialization & File Persistence C
4 Memory-Mapped Graph Storage C
5 Buffer Pool Manager C
6 B+Tree Index for Node Lookups C
7 Simple Query Language Parser C
8 Query Execution Engine C
9 Write-Ahead Logging (WAL) C
10 Transaction Manager with MVCC C
11 Graph-Specific Algorithms (PageRank, Community Detection) C
12 Distributed Graph Partitioning C
13 Query Optimizer with Cost Model C
14 Full-Text Search Integration C
15 Complete Graph Database Engine C
Final Production-Ready Graph Database C

Key Resources

Books

  • “Database Internals” by Alex Petrov - The essential guide to how databases work
  • “Designing Data-Intensive Applications” by Martin Kleppmann - Understanding distributed data systems
  • “Graph Databases” by Ian Robinson, Jim Webber & Emil Eifrem - Neo4j-focused but excellent concepts
  • “C Programming: A Modern Approach” by K. N. King - Essential C reference
  • “Computer Systems: A Programmer’s Perspective” by Bryant & O’Hallaron - Understanding systems

Online Resources

Papers


You’re ready to build a graph database from scratch. Start with Project 1 and work your way through. By the end, you’ll understand graph databases at the deepest level—the level of the engineers who built Neo4j, TigerGraph, and Amazon Neptune.