LEARN GRAPH DATABASES DEEP DIVE
Learn Graph Databases: From Zero to Graph Database Master
Goal: Deeply understand graph databases—from basic graph data structures to building a complete graph database engine from scratch in C, including storage engines, query processing, indexing, and ACID transactions.
Why Graph Databases Matter
Relational databases struggle when data is highly connected. Every JOIN operation in SQL is expensive because it requires index lookups or full table scans. Graph databases solve this by storing relationships as first-class citizens—each node directly points to its neighbors (a concept called index-free adjacency), making traversals O(1) instead of O(log n).
After completing these projects, you will:
- Understand every byte in a graph database’s storage format
- Know how queries traverse from node to node without index lookups
- Be able to implement graph algorithms (BFS, DFS, Dijkstra, PageRank)
- Build your own query parser and execution engine
- Understand ACID transactions and concurrency control in graph contexts
- Have built a working graph database from scratch in C
Core Concept Analysis
The Property Graph Model
A property graph consists of:
┌─────────────────────────────────────────────────────────────────┐
│ PROPERTY GRAPH MODEL │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────┐ KNOWS ┌───────────────┐ │
│ │ :Person │ ────────────────────▶ │ :Person │ │
│ │ name: "Alice" │ {since: 2015} │ name: "Bob" │ │
│ │ age: 30 │ │ age: 28 │ │
│ └─────────────────┘ └───────────────┘ │
│ │ │ │
│ │ WORKS_AT │ WORKS_AT │
│ ▼ ▼ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ :Company │ │
│ │ name: "TechCorp" │ │
│ │ founded: 2010 │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
│ NODES: Entities with labels (:Person, :Company) │
│ EDGES: Directed relationships with types (KNOWS, WORKS_AT) │
│ PROPERTIES: Key-value pairs on both nodes and edges │
│ │
└─────────────────────────────────────────────────────────────────┘
Fundamental Concepts
- Graph Data Structures
- Adjacency List: Each node stores a list of its neighbors
- Adjacency Matrix: 2D array where
matrix[i][j]= 1 if edge exists - Edge List: Simple list of (source, target) pairs
- Index-Free Adjacency
- Each node directly references its neighbors via pointers
- Traversing a relationship is O(1) pointer dereference
- No index lookup required—the graph is the index
- As graph grows, local traversal cost remains constant
- Native Graph Storage
- Fixed-size records for nodes and relationships
- Nodes: ID, first_relationship_id, first_property_id, labels
- Relationships: ID, start_node, end_node, type, next_rel_start, next_rel_end
- Properties stored separately as linked chains
- Graph Traversal Algorithms
- BFS (Breadth-First Search): Level-by-level exploration, uses queue
- DFS (Depth-First Search): Deep exploration first, uses stack/recursion
- Dijkstra: Shortest weighted path, uses priority queue
- A*: Heuristic-guided shortest path
- Query Processing
- Pattern matching:
(a)-[:KNOWS]->(b)finds all KNOWS relationships - Query planning: Choose optimal traversal order
- Query execution: Walk the graph, collect results
- Pattern matching:
- ACID in Graph Databases
- Write-Ahead Logging (WAL): Log changes before applying them
- MVCC: Multiple versions for concurrent reads
- Locking: Prevent conflicting concurrent modifications
Project List
Projects are ordered from fundamental understanding to advanced implementations. Each project builds on the previous ones.
Project 1: In-Memory Graph Library (The Foundation)
- File: LEARN_GRAPH_DATABASES_DEEP_DIVE.md
- Main Programming Language: C
- Alternative Programming Languages: Rust, Go, C++
- Coolness Level: Level 3: Genuinely Clever
- Business Potential: 1. The “Resume Gold”
- Difficulty: Level 1: Beginner
- Knowledge Area: Data Structures / Memory Management
- Software or Tool: Building: Graph Library
- Main Book: “C Programming: A Modern Approach” by K. N. King
What you’ll build: A C library that can create, modify, and traverse graphs in memory—supporting nodes with labels, edges with types, and properties on both.
Why it teaches graph databases: Every graph database starts here. Before you worry about persistence or queries, you need to understand how graphs live in memory. This project forces you to think about pointer relationships, memory layout, and the trade-offs between adjacency lists vs. matrices.
Core challenges you’ll face:
- Designing the node/edge data structures → maps to understanding property graph model
- Managing dynamic memory for growing graphs → maps to memory allocation strategies
- Implementing efficient neighbor lookups → maps to index-free adjacency concept
- Handling bidirectional edge references → maps to doubly-linked relationship lists
Key Concepts:
- Pointers and Dynamic Memory: “C Programming: A Modern Approach” Chapter 17 - K. N. King
- Linked Lists in C: “Mastering Algorithms with C” Chapter 5 - Kyle Loudon
- Graph Data Structures: “Algorithms, Fourth Edition” Section 4.1 - Sedgewick & Wayne
- Memory Layout: “Computer Systems: A Programmer’s Perspective” Chapter 9 - Bryant & O’Hallaron
Difficulty: Beginner Time estimate: 1 week Prerequisites:
- Basic C programming (pointers, structs, malloc/free)
- Understanding of linked lists
- No prior graph knowledge required
Real world outcome:
$ ./graph_demo
Creating graph...
Added node 1: Person {name: "Alice", age: 30}
Added node 2: Person {name: "Bob", age: 28}
Added edge: (1)-[:KNOWS {since: 2015}]->(2)
Traversing from Alice:
Alice --KNOWS--> Bob
Node count: 2
Edge count: 1
Memory used: 256 bytes
Implementation Hints:
Start with these core structures:
Node structure should contain:
- Unique ID (uint64_t)
- Pointer to first outgoing edge
- Pointer to first incoming edge
- Pointer to first property
- Label(s)
Edge structure should contain:
- Unique ID
- Start node pointer
- End node pointer
- Relationship type
- Next edge from same start node
- Next edge to same end node
- Pointer to first property
Property structure (linked list):
- Key (string)
- Value (union of types: int, float, string, bool)
- Next property pointer
Think about these questions:
- How do you iterate all edges FROM a node? (Follow first_outgoing, then next pointers)
- How do you iterate all edges TO a node? (Follow first_incoming chain)
- Why do edges need both “next from start” AND “next to end” pointers?
- What happens to edges when you delete a node?
Learning milestones:
- You can add nodes and edges → You understand basic graph structure
- You can traverse neighbors → You understand adjacency representation
- You can handle properties → You understand the property graph model
- You handle memory correctly (no leaks) → You understand ownership in C
Project 2: Graph Traversal Algorithms (BFS, DFS, Shortest Path)
- File: LEARN_GRAPH_DATABASES_DEEP_DIVE.md
- Main Programming Language: C
- Alternative Programming Languages: Rust, Go, Python
- Coolness Level: Level 3: Genuinely Clever
- Business Potential: 1. The “Resume Gold”
- Difficulty: Level 2: Intermediate
- Knowledge Area: Graph Algorithms
- Software or Tool: Building: Algorithm Library
- Main Book: “Algorithms, Fourth Edition” by Sedgewick & Wayne
What you’ll build: Implementation of core graph algorithms—BFS, DFS, Dijkstra’s shortest path, and connected components—that operate on your in-memory graph library.
Why it teaches graph databases: Graph databases are only useful because of graph algorithms. When a user asks “find shortest path from A to B” or “find all friends-of-friends,” the database executes these algorithms. Understanding them deeply is essential.
Core challenges you’ll face:
- Implementing BFS with proper queue management → maps to level-by-level exploration
- Managing visited state without modifying nodes → maps to external state tracking
- Priority queue for Dijkstra → maps to weighted path optimization
- Detecting cycles during DFS → maps to graph classification problems
Key Concepts:
- BFS Algorithm: “Algorithms, Fourth Edition” Section 4.1 - Sedgewick & Wayne
- DFS Algorithm: “Introduction to Algorithms” (CLRS) Chapter 22
- Dijkstra’s Algorithm: “Algorithms, Fourth Edition” Section 4.4 - Sedgewick & Wayne
- Priority Queues (Heaps): “Algorithms, Fourth Edition” Section 2.4 - Sedgewick & Wayne
Resources for key challenges:
- GeeksforGeeks - BFS for a Graph - Step-by-step BFS implementation
- Programiz - Dijkstra’s Algorithm - Clear visual explanation
Difficulty: Intermediate Time estimate: 1 week Prerequisites:
- Project 1 completed (in-memory graph library)
- Understanding of queues and stacks
- Basic algorithm complexity analysis
Real world outcome:
$ ./graph_algo social_network.graph
Loading graph with 1000 nodes, 5000 edges...
BFS from node 1:
Level 0: Alice
Level 1: Bob, Carol, Dave
Level 2: Eve, Frank, Grace...
Visited 847 nodes in 6 levels
DFS from node 1:
Path: Alice -> Bob -> Eve -> ... (depth 23)
Shortest path from Alice to Zack:
Alice -[2]-> Bob -[3]-> Carol -[1]-> Zack
Total weight: 6
Path length: 3 edges
Connected components: 3
Component 1: 847 nodes
Component 2: 142 nodes
Component 3: 11 nodes
Implementation Hints:
For BFS, you need an external data structure:
Queue: FIFO structure for nodes to visit
Visited set: Hash set or boolean array to track visited nodes
Parent map: To reconstruct the path afterward
The key insight: BFS visits nodes in order of distance from start. When you first reach a node, you’ve found the shortest path (in unweighted graphs).
For Dijkstra, the difference is:
- Use a priority queue (min-heap) instead of a regular queue
- Priority = cumulative distance from start
- Can update priority when finding shorter path (relaxation)
Ask yourself:
- What’s the time complexity of BFS? (O(V + E))
- Why doesn’t BFS work for weighted graphs?
- What makes Dijkstra fail with negative edges?
- How would you modify DFS to detect cycles?
Learning milestones:
- BFS finds shortest unweighted path → You understand level-order traversal
- DFS explores deeply and backtracks → You understand stack-based traversal
- Dijkstra handles weights → You understand greedy algorithms and priority queues
- You can find connected components → You understand graph structure analysis
Project 3: Graph Serialization & File Persistence
- File: LEARN_GRAPH_DATABASES_DEEP_DIVE.md
- Main Programming Language: C
- Alternative Programming Languages: Rust, C++, Go
- Coolness Level: Level 3: Genuinely Clever
- Business Potential: 2. The “Micro-SaaS / Pro Tool”
- Difficulty: Level 2: Intermediate
- Knowledge Area: File I/O / Binary Formats / Persistence
- Software or Tool: Building: Graph File Format
- Main Book: “Computer Systems: A Programmer’s Perspective” by Bryant & O’Hallaron
What you’ll build: A binary file format for persisting graphs to disk, with save/load functionality. Your format will use fixed-size records like real graph databases do.
Why it teaches graph databases: Neo4j stores nodes as 9-byte fixed-size records. Understanding why fixed-size records matter (O(1) lookups via offset = id * record_size) is fundamental. This project bridges in-memory graphs to persistent storage.
Core challenges you’ll face:
- Designing a binary file format → maps to understanding database storage layouts
- Fixed-size records vs. variable-size data → maps to property storage strategies
- Handling string storage separately → maps to string table/pool concepts
- File offset arithmetic → maps to direct record access
Key Concepts:
- Binary File I/O in C: “C Programming: A Modern Approach” Chapter 22 - K. N. King
- Fixed-Size Record Design: “Database Internals” Chapter 3 - Alex Petrov
- File Format Design: “Practical Binary Analysis” Chapter 2 - Dennis Andriesse
- Endianness: “Computer Systems: A Programmer’s Perspective” Section 2.1 - Bryant & O’Hallaron
Difficulty: Intermediate Time estimate: 1 week Prerequisites:
- Project 1 completed
- Basic understanding of binary file I/O (fread, fwrite)
- Understanding of struct packing and alignment
Real world outcome:
$ ./graph_save social.graph
Saving graph to social.graph.db
Nodes: 1000 (9000 bytes)
Edges: 5000 (50000 bytes)
Properties: 8500 (variable, in properties.db)
String pool: 45000 bytes
Total: 104000 bytes
Save complete.
$ ls -la *.db
-rw-r--r-- 1 user staff 9000 Dec 21 10:00 nodes.db
-rw-r--r-- 1 user staff 50000 Dec 21 10:00 rels.db
-rw-r--r-- 1 user staff 45000 Dec 21 10:00 strings.db
$ ./graph_load social.graph
Loading from social.graph.db...
Loaded 1000 nodes, 5000 edges in 12ms
Random lookup node 500: O(1), 0.001ms
Implementation Hints:
Design your file format like Neo4j does:
Node Record (9 bytes):
┌─────────┬──────────────┬──────────────┬─────────┐
│ in_use │ first_rel_id │ first_prop_id│ labels │
│ 1 byte │ 4 bytes │ 4 bytes │ varies │
└─────────┴──────────────┴──────────────┴─────────┘
To find node N: seek to offset N * 9, read 9 bytes
Relationship Record (34 bytes):
┌─────────┬──────────┬──────────┬──────────┬─────────────┬─────────────┬──────────┐
│ in_use │ start_id │ end_id │ rel_type │ next_rel_s │ next_rel_e │ prop_id │
│ 1 byte │ 4 bytes │ 4 bytes │ 4 bytes │ 4 bytes │ 4 bytes │ 4 bytes │
└─────────┴──────────┴──────────┴──────────┴─────────────┴─────────────┴──────────┘
For variable-length data (strings), use a separate string pool file:
- Store offset into string pool in the property record
- String pool: length-prefixed strings back to back
- Or: null-terminated strings with index file
Questions to consider:
- Why not store strings inline in nodes/edges?
- What happens when you delete a node? (Mark as unused, don’t shift)
- How do you handle graph updates? (In-place for fixed fields, append for strings)
- What about alignment and padding?
Learning milestones:
- You can save and load a graph → You understand file I/O
- Random access by ID is O(1) → You understand fixed-size record design
- Variable data is stored separately → You understand storage decoupling
- Format is portable across systems → You understand endianness and packing
Project 4: Memory-Mapped Graph Storage
- File: LEARN_GRAPH_DATABASES_DEEP_DIVE.md
- Main Programming Language: C
- Alternative Programming Languages: Rust, C++
- Coolness Level: Level 4: Hardcore Tech Flex
- Business Potential: 2. The “Micro-SaaS / Pro Tool”
- Difficulty: Level 3: Advanced
- Knowledge Area: Operating Systems / Virtual Memory / mmap
- Software or Tool: Building: mmap-based storage
- Main Book: “The Linux Programming Interface” by Michael Kerrisk
What you’ll build: A graph storage engine that uses memory-mapped files (mmap), allowing the OS to handle paging and making your graph appear as a simple array in memory.
Why it teaches graph databases: Many databases (SQLite, LMDB) use mmap for I/O. It’s a powerful technique but comes with trade-offs (less control over flushing, OS-dependent behavior). Understanding when mmap works well vs. when explicit I/O is better is crucial database knowledge.
Core challenges you’ll face:
- Using mmap() correctly → maps to understanding virtual memory
- Handling file growth (mremap or remap) → maps to dynamic storage expansion
- Ensuring data durability (msync) → maps to understanding persistence guarantees
- Dealing with page faults → maps to understanding I/O cost
Key Concepts:
- mmap System Call: “The Linux Programming Interface” Chapter 49 - Michael Kerrisk
- Virtual Memory: “Computer Systems: A Programmer’s Perspective” Chapter 9 - Bryant & O’Hallaron
- File-Backed Memory: “Advanced Programming in the UNIX Environment” Chapter 14 - Stevens & Rago
- msync and Durability: SQLite mmap documentation
Resources for key challenges:
- CMU - Are You Sure You Want to Use MMAP in Your DBMS? - Critical analysis of mmap trade-offs
Difficulty: Advanced Time estimate: 1-2 weeks Prerequisites:
- Project 3 completed (file-based persistence)
- Understanding of virtual memory concepts
- Familiarity with Unix system calls
Real world outcome:
$ ./mmap_graph create large.graph 10000000
Creating mmap-backed graph...
Mapped file: 1.2 GB
Node capacity: 10,000,000
Ready.
$ ./mmap_graph load large.graph
Loading via mmap...
Mapped in 0.8ms (just setting up mapping, no actual I/O!)
$ ./mmap_graph query large.graph
> get_node 5000000
Node 5000000 accessed in 0.02ms (first access, page fault)
> get_node 5000001
Node 5000001 accessed in 0.001ms (same page, cached)
> traverse 1 depth 3
Traversed 15,847 nodes in 45ms
Pages faulted: 234
Implementation Hints:
Basic mmap usage in C:
Open file with proper size
mmap() with MAP_SHARED for persistence
Cast pointer to your record array
Access records directly via pointer arithmetic
msync() to ensure writes reach disk
munmap() when done
The beauty of mmap: your nodes become a simple array access:
Node* nodes = (Node*)mmap(...);
Node n = nodes[node_id]; // OS handles loading from disk
But understand the dangers:
- You can’t control when dirty pages are written
- Page faults happen at unpredictable times
- No control over eviction order
- Signal handlers needed for SIGBUS
Questions to think about:
- What happens if you access node 1, then node 1,000,000? (Two page faults)
- How big is a page? (Usually 4KB) How many nodes fit in one page?
- When should you call msync()? What flags (MS_SYNC vs MS_ASYNC)?
- Why does SQLite make mmap optional?
Learning milestones:
- Graph loads “instantly” → You understand memory mapping
- You observe page faults during traversal → You understand demand paging
- msync ensures durability → You understand persistence guarantees
- You understand mmap’s limitations → You’re ready for buffer pools
Project 5: Buffer Pool Manager
- File: LEARN_GRAPH_DATABASES_DEEP_DIVE.md
- Main Programming Language: C
- Alternative Programming Languages: Rust, C++
- Coolness Level: Level 4: Hardcore Tech Flex
- Business Potential: 3. The “Service & Support” Model
- Difficulty: Level 4: Expert
- Knowledge Area: Database Internals / Caching / I/O Management
- Software or Tool: Building: Buffer Pool
- Main Book: “Database Internals” by Alex Petrov
What you’ll build: A buffer pool manager that caches disk pages in memory, implementing LRU eviction, pin counting, and dirty page tracking—the heart of any serious database.
Why it teaches graph databases: The buffer pool is what separates toy databases from real ones. It gives you control over memory usage, I/O scheduling, and durability. Neo4j, SQLite, and every serious database has a buffer pool. This is where databases get fast.
Core challenges you’ll face:
- Page caching with fixed memory → maps to memory management constraints
- LRU eviction policy → maps to cache management strategies
- Pin counting for safe page access → maps to concurrency-safe reference counting
- Dirty page tracking and flushing → maps to write-back strategies
Key Concepts:
- Buffer Pool Architecture: “Database Internals” Chapter 5 - Alex Petrov
- LRU Cache Implementation: “Algorithms, Fourth Edition” Section 1.3 - Sedgewick & Wayne
- Page Management: “Operating Systems: Three Easy Pieces” Chapter 22 - OSTEP
- Cache-Conscious Design: “Computer Systems: A Programmer’s Perspective” Chapter 6 - Bryant & O’Hallaron
Resources for key challenges:
- Let’s Build a Simple Database - Part 3 - Pager implementation
Difficulty: Expert Time estimate: 2 weeks Prerequisites:
- Projects 3-4 completed
- Understanding of cache concepts
- Solid C programming (hash tables, linked lists)
Real world outcome:
$ ./buffered_graph --pool-size 1000 --page-size 4096
Buffer pool initialized: 1000 pages, ~4MB memory
> load graph.db
Graph loaded. Pages on disk: 50,000
Pages in buffer: 0 (cold start)
> traverse node 1 depth 5
Traversed 12,453 nodes
Buffer stats:
Pages fetched from disk: 234
Pages found in buffer: 11,219 (97.9% hit rate)
Dirty pages: 0
Pinned pages: 0
> add_edge 1 500 "KNOWS"
Edge added.
Buffer stats:
Dirty pages: 2 (will be written on checkpoint)
> checkpoint
Flushing 2 dirty pages to disk...
Done.
Implementation Hints:
Core buffer pool components:
Page structure:
- page_id (which disk page this is)
- data[PAGE_SIZE] (the actual bytes)
- pin_count (number of active users)
- is_dirty (has been modified)
- prev/next (for LRU list)
Buffer Pool:
- pages[POOL_SIZE] (the page slots)
- page_table (hash map: disk_page_id -> buffer_slot)
- lru_head, lru_tail (doubly-linked list of unpinned pages)
Key operations:
fetch_page(page_id): Get page, loading from disk if neededunpin_page(page_id, is_dirty): Release page, mark dirty if modifiedflush_page(page_id): Write dirty page to diskevict(): Find LRU unpinned page, write if dirty, reclaim slot
The pin count is crucial: you can’t evict a page while someone is using it!
Questions to explore:
- What happens if all pages are pinned? (Can’t evict, must wait or error)
- What’s the difference between LRU and LRU-K?
- How does dirty page batching improve performance?
- What’s a checkpoint and why is it important?
Learning milestones:
- Pages are cached in memory → You understand caching basics
- LRU eviction works correctly → You understand eviction policies
- Pin counts prevent unsafe eviction → You understand concurrency safety
- Dirty pages are tracked and flushed → You understand write-back caching
Project 6: B+Tree Index for Node Lookups
- File: LEARN_GRAPH_DATABASES_DEEP_DIVE.md
- Main Programming Language: C
- Alternative Programming Languages: Rust, C++, Go
- Coolness Level: Level 4: Hardcore Tech Flex
- Business Potential: 3. The “Service & Support” Model
- Difficulty: Level 4: Expert
- Knowledge Area: Data Structures / Indexing / Databases
- Software or Tool: Building: B+Tree Index
- Main Book: “Database Internals” by Alex Petrov
What you’ll build: A disk-based B+Tree index that allows fast lookups of nodes by property values (e.g., “find all nodes where name = ‘Alice’”).
Why it teaches graph databases: While index-free adjacency makes traversals fast, finding your starting nodes still requires indexes. “Find the person named Alice” needs a B+Tree lookup before you can start traversing. Every graph database supports property indexes.
Core challenges you’ll face:
- Implementing B+Tree insert with splits → maps to balanced tree algorithms
- Disk-based nodes with fixed-size pages → maps to database page design
- Range scans via leaf traversal → maps to understanding B+Tree structure
- Handling variable-length keys → maps to key encoding strategies
Key Concepts:
- B+Tree Structure: “Database Internals” Chapter 4 - Alex Petrov
- B-Tree Algorithms: “Introduction to Algorithms” (CLRS) Chapter 18
- Disk-Based Trees: Let’s Build a Simple Database - Part 7
- Key Encoding: “Designing Data-Intensive Applications” Chapter 3 - Martin Kleppmann
Resources for key challenges:
- Let’s Build a Simple Database - B-Tree Parts - Excellent step-by-step B-Tree implementation
Difficulty: Expert Time estimate: 2-3 weeks Prerequisites:
- Project 5 completed (buffer pool)
- Understanding of binary search trees
- Strong grasp of recursion
Real world outcome:
$ ./btree_index create name_index.db
B+Tree index created. Order: 100 (fits 4KB page)
$ ./btree_index bulk_load nodes.csv
Loading 1,000,000 entries...
Tree height: 3
Internal nodes: 127
Leaf nodes: 10,214
Load time: 8.3 seconds
$ ./btree_index lookup name_index.db "Alice"
Found key "Alice" at node_id: 42
Lookup time: 0.02ms (3 page reads)
$ ./btree_index range name_index.db "A" "B"
Range scan A-B:
Aaron -> node 15
Abel -> node 892
...
Beatrice -> node 4521
Found 3,847 entries in 12ms
Implementation Hints:
B+Tree structure:
Internal Node:
┌────────┬────────┬────────┬────────┬────────┐
│ key1 │ key2 │ key3 │ ... │ keyN │
└────────┴────────┴────────┴────────┴────────┘
│ │ │ │
ptr0 ptr1 ptr2 ptrN
↓ ↓ ↓ ↓
child0 child1 child2 childN
Leaf Node (also linked for range scans):
┌────────────┬────────────┬────────────┐
│ key1:val1 │ key2:val2 │ key3:val3 │ → next_leaf
└────────────┴────────────┴────────────┘
For graph DB: key = property value, val = node_id
Insert algorithm outline:
- Search to find correct leaf
- If leaf has space, insert key-value
- If leaf is full, split: create new leaf, distribute keys
- Propagate split upward: parent gets new key, may also split
- If root splits, create new root
Key questions:
- Why B+Tree and not B-Tree? (All values in leaves, better for range scans)
- What determines the “order” of the tree? (How many keys fit in a page)
- How do you handle duplicate keys? (Allow duplicates in leaves, or use lists)
- What happens during concurrent access? (Need latching/locking—future project!)
Learning milestones:
- Lookups are O(log n) → You understand balanced tree lookups
- Inserts handle splits correctly → You understand tree rebalancing
- Range scans follow leaf links → You understand B+Tree leaf chain
- Tree survives restart → You’ve integrated with buffer pool
Project 7: Simple Query Language Parser
- File: LEARN_GRAPH_DATABASES_DEEP_DIVE.md
- Main Programming Language: C
- Alternative Programming Languages: Rust, Go, Python
- Coolness Level: Level 4: Hardcore Tech Flex
- Business Potential: 3. The “Service & Support” Model
- Difficulty: Level 3: Advanced
- Knowledge Area: Parsing / Compilers / Language Design
- Software or Tool: Building: Query Parser
- Main Book: “Writing a C Compiler” by Nora Sandler
What you’ll build: A parser for a simplified graph query language inspired by Cypher. Users can write queries like MATCH (a:Person)-[:KNOWS]->(b) WHERE a.name = "Alice" RETURN b.name.
Why it teaches graph databases: Cypher is to graph databases what SQL is to relational databases. Building a parser teaches you how query languages work, how patterns are represented internally, and prepares you for query execution.
Core challenges you’ll face:
- Tokenizing the query string → maps to lexical analysis
- Parsing graph patterns → maps to recursive descent parsing
- Building an AST → maps to abstract syntax tree design
- Handling property predicates → maps to expression parsing
Key Concepts:
- Lexical Analysis: “Writing a C Compiler” Chapter 1 - Nora Sandler
- Recursive Descent Parsing: “Language Implementation Patterns” Chapter 2 - Terence Parr
- AST Design: “Compilers: Principles and Practice” Chapter 5 - Dave & Dave
- Cypher Syntax: openCypher Specification
Difficulty: Advanced Time estimate: 2 weeks Prerequisites:
- Basic understanding of parsing concepts
- String manipulation in C
- Familiarity with tree data structures
Real world outcome:
$ ./query_parser
GraphQL> MATCH (a:Person)-[:KNOWS]->(b:Person) WHERE a.name = "Alice" RETURN b
Tokens:
MATCH, LPAREN, IDENT(a), COLON, IDENT(Person), RPAREN,
DASH, LBRACKET, COLON, IDENT(KNOWS), RBRACKET, ARROW,
LPAREN, IDENT(b), COLON, IDENT(Person), RPAREN,
WHERE, IDENT(a), DOT, IDENT(name), EQ, STRING("Alice"),
RETURN, IDENT(b)
AST:
MatchClause
└── Pattern
├── Node(var=a, label=Person)
├── Edge(type=KNOWS, direction=OUT)
└── Node(var=b, label=Person)
└── Where
└── Equals
├── PropertyAccess(a, name)
└── StringLiteral("Alice")
└── Return
└── Variable(b)
Parse successful!
Implementation Hints:
Define your grammar (simplified):
query := match_clause where_clause? return_clause
match_clause := "MATCH" pattern
pattern := node_pattern (edge_pattern node_pattern)*
node_pattern := "(" variable? (":" label)? properties? ")"
edge_pattern := ("-[" edge_type "]->") | ("<-[" edge_type "]-")
where_clause := "WHERE" expression
return_clause := "RETURN" expression_list
Lexer produces tokens:
- Keywords: MATCH, WHERE, RETURN, AND, OR
- Symbols: (, ), [, ], -, ->, <-, :, ., =, <>
- Literals: strings, numbers
- Identifiers: variable and label names
Parser builds AST nodes:
MatchClause { patterns, where, return }
Pattern { nodes[], edges[] }
NodePattern { variable, label, properties }
EdgePattern { type, direction, variable }
WhereExpr { left, op, right }
Think about:
- How do you handle syntax errors gracefully?
- What does your AST look like for
(a)-[:KNOWS]->(b)-[:LIKES]->(c)? - How would you add support for optional patterns
OPTIONAL MATCH?
Learning milestones:
- Lexer produces correct tokens → You understand tokenization
- Parser builds valid AST → You understand grammar and parsing
- Error messages are helpful → You understand error recovery
- AST can represent complex patterns → You’re ready for query execution
Project 8: Query Execution Engine
- File: LEARN_GRAPH_DATABASES_DEEP_DIVE.md
- Main Programming Language: C
- Alternative Programming Languages: Rust, Go, C++
- Coolness Level: Level 4: Hardcore Tech Flex
- Business Potential: 3. The “Service & Support” Model
- Difficulty: Level 4: Expert
- Knowledge Area: Query Processing / Execution Engines
- Software or Tool: Building: Query Executor
- Main Book: “Database Internals” by Alex Petrov
What you’ll build: An execution engine that takes parsed ASTs and executes them against your graph storage, returning results. This includes pattern matching, filtering, and projecting results.
Why it teaches graph databases: This is where everything comes together. Your parser creates plans, your storage provides data, and the execution engine connects them. Understanding how pattern matching translates to traversal operations is core graph database knowledge.
Core challenges you’ll face:
- Pattern matching algorithm → maps to subgraph isomorphism
- Variable binding during traversal → maps to execution context management
- Backtracking for complex patterns → maps to search space exploration
- Integrating with storage layer → maps to I/O-efficient execution
Key Concepts:
- Query Execution Models: “Database Internals” Chapter 8 - Alex Petrov
- Pattern Matching: Cypher: An Evolving Query Language for Property Graphs
- Iterator Model: “Architecture of a Database System” Section 4.2 - Hellerstein et al.
- Backtracking Algorithms: “Algorithms, Fourth Edition” Section 6.5 - Sedgewick & Wayne
Difficulty: Expert Time estimate: 2-3 weeks Prerequisites:
- Project 7 completed (parser)
- Projects 1-5 completed (storage layer)
- Understanding of iterator patterns
Real world outcome:
$ ./graph_db my_graph.db
Connected to graph: 100,000 nodes, 500,000 edges
graph> MATCH (a:Person {name: "Alice"})-[:KNOWS]->(b:Person) RETURN b.name, b.age
Execution Plan:
1. IndexScan(Person, name="Alice") -> a
2. ExpandOut(a, KNOWS) -> b
3. Filter(b.label = Person)
4. Project(b.name, b.age)
Executing...
| b.name | b.age |
|-----------|-------|
| Bob | 28 |
| Carol | 32 |
| Dave | 25 |
3 rows returned in 2.3ms
Nodes scanned: 1 (via index)
Edges traversed: 47
Implementation Hints:
Execution uses an iterator/operator model:
Operator interface:
init() - Prepare for execution
next() - Get next result tuple
close() - Cleanup
Common operators:
IndexScan - Find starting nodes via index
AllNodesScan - Iterate all nodes (expensive!)
Expand - Follow edges from current node
Filter - Check predicate, pass or skip
Project - Select which fields to output
Pattern matching approach:
For pattern (a)-[:KNOWS]->(b)-[:WORKS_AT]->(c):
1. Find all starting points for 'a'
2. For each 'a', expand KNOWS edges to find 'b' candidates
3. For each 'b', expand WORKS_AT edges to find 'c' candidates
4. Each complete (a,b,c) tuple is a result
Think about:
- What if pattern has no indexed start point? (Full scan, then filter)
- How do you handle cycles in the pattern?
- What’s the difference between DFS and BFS execution order?
- How would you add aggregation (COUNT, SUM)?
Learning milestones:
- Simple patterns execute → You understand basic pattern matching
- Filters work correctly → You understand predicate evaluation
- Multi-hop patterns work → You understand traversal chaining
- Query uses index when available → You understand query optimization basics
Project 9: Write-Ahead Logging (WAL)
- File: LEARN_GRAPH_DATABASES_DEEP_DIVE.md
- Main Programming Language: C
- Alternative Programming Languages: Rust, C++
- Coolness Level: Level 4: Hardcore Tech Flex
- Business Potential: 3. The “Service & Support” Model
- Difficulty: Level 4: Expert
- Knowledge Area: ACID / Durability / Recovery
- Software or Tool: Building: WAL System
- Main Book: “Database Internals” by Alex Petrov
What you’ll build: A Write-Ahead Log (WAL) that records all modifications before they’re applied to data files, enabling crash recovery and durability guarantees.
Why it teaches graph databases: Durability (the “D” in ACID) requires WAL. Without it, a crash can corrupt your database. Understanding the “log first, apply later” principle is fundamental to all database systems.
Core challenges you’ll face:
- Append-only log format design → maps to log record structure
- Ensuring log is durable before commit → maps to fsync semantics
- Crash recovery algorithm → maps to redo/undo processing
- Log truncation and checkpointing → maps to log space management
Key Concepts:
- Write-Ahead Logging: “Database Internals” Chapter 6 - Alex Petrov
- Crash Recovery: “Database Internals” Chapter 13 - Alex Petrov
- ARIES Algorithm: Architecture of a Database System Section 6
- fsync Semantics: “The Linux Programming Interface” Chapter 13 - Michael Kerrisk
Resources for key challenges:
- Architecture Weekly - WAL Foundation - Overview of WAL principles
Difficulty: Expert Time estimate: 2 weeks Prerequisites:
- Project 5 completed (buffer pool with dirty tracking)
- Understanding of file system durability
- Solid C programming
Real world outcome:
$ ./graph_db_wal my_graph.db
WAL enabled: wal.log
graph> BEGIN
graph> CREATE (:Person {name: "Eve"})
Node 1001 created (WAL: LSN 5001)
graph> CREATE (a)-[:KNOWS]->(b)
Edge 5001 created (WAL: LSN 5002)
graph> COMMIT
Transaction committed (WAL synced to LSN 5002)
$ kill -9 <pid> # Simulate crash
$ ./graph_db_wal my_graph.db
Recovering from WAL...
Last checkpoint: LSN 4000
WAL contains records up to: LSN 5002
Redoing 1002 log records...
Recovery complete: 2 operations replayed
Database ready.
graph> MATCH (p:Person {name: "Eve"}) RETURN p
| p |
|--------------------------|
| (:Person {name: "Eve"}) |
Eve survived the crash!
Implementation Hints:
WAL record format:
┌─────────────────────────────────────────────────────┐
│ LSN (8 bytes) - Log Sequence Number │
│ TxnID (8 bytes) - Transaction ID │
│ Type (1 byte) - INSERT/UPDATE/DELETE/COMMIT/ABORT │
│ Length (4 bytes) - Length of data │
│ Data (variable) - The actual change │
│ Checksum (4 bytes) - CRC32 of record │
└─────────────────────────────────────────────────────┘
Core operations:
wal_append(type, data) -> LSN
- Write record to log buffer
- Return the LSN for this record
wal_flush(lsn)
- Ensure all records up to LSN are on disk
- Call fsync() on log file
recover()
- Read log from last checkpoint
- For each record, redo the operation
- Handle incomplete transactions
Key insight: You MUST write to the log and fsync BEFORE telling the user “commit successful”. Otherwise a crash after “success” but before the data is on disk means data loss.
Questions to consider:
- Why append-only? (Sequential writes are fast, no in-place updates)
- What’s a checkpoint? (A snapshot of which LSN has been applied to data files)
- When can you delete old log records? (After checkpoint covers them)
- What about ABORT? (Either don’t apply to data files, or write UNDO records)
Learning milestones:
- Operations are logged before applying → You understand WAL principle
- Recovery replays log correctly → You understand crash recovery
- Database survives kill -9 → You’ve achieved durability
- Checkpoints limit recovery time → You understand log management
Project 10: Transaction Manager with MVCC
- File: LEARN_GRAPH_DATABASES_DEEP_DIVE.md
- Main Programming Language: C
- Alternative Programming Languages: Rust, C++
- Coolness Level: Level 5: Pure Magic
- Business Potential: 4. The “Open Core” Infrastructure
- Difficulty: Level 5: Master
- Knowledge Area: Concurrency / ACID / Transactions
- Software or Tool: Building: MVCC Transaction Manager
- Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann
What you’ll build: A transaction manager implementing Multi-Version Concurrency Control (MVCC), allowing concurrent readers and writers without blocking each other.
Why it teaches graph databases: MVCC is how modern databases achieve high concurrency. Readers see consistent snapshots while writers modify data. This is the “I” (Isolation) in ACID. Neo4j, PostgreSQL, and most modern databases use MVCC.
Core challenges you’ll face:
- Version chain management → maps to row versioning
- Snapshot visibility rules → maps to transaction isolation
- Garbage collection of old versions → maps to version cleanup
- Write-write conflict detection → maps to transaction validation
Key Concepts:
- MVCC Fundamentals: “Designing Data-Intensive Applications” Chapter 7 - Martin Kleppmann
- Snapshot Isolation: “Database Internals” Chapter 12 - Alex Petrov
- Version Chains: DuckDB Transaction Paper
- Transaction IDs: “Database Internals” Chapter 11 - Alex Petrov
Difficulty: Master Time estimate: 3-4 weeks Prerequisites:
- Project 9 completed (WAL)
- Strong understanding of concurrency
- Experience with thread-safe programming
Real world outcome:
$ ./graph_db_mvcc my_graph.db &
Graph database started with MVCC enabled.
# Terminal 1 (Reader)
graph> BEGIN READ ONLY
graph> MATCH (p:Person) RETURN count(p)
| count(p) |
|----------|
| 1000 |
Started long report...
# Terminal 2 (Writer - concurrent)
graph> BEGIN
graph> CREATE (:Person {name: "NewPerson"})
graph> COMMIT
Committed!
# Terminal 1 (Reader - still sees old snapshot!)
graph> MATCH (p:Person) RETURN count(p)
| count(p) |
|----------|
| 1000 | <- Still 1000! Snapshot isolation working.
graph> COMMIT
graph> MATCH (p:Person) RETURN count(p)
| count(p) |
|----------|
| 1001 | <- New transaction sees update.
Implementation Hints:
MVCC version chain:
Each node/edge has a version chain:
┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐
│ Version 3 │───▶│ Version 2 │───▶│ Version 1 │
│ txn_id: 100 │ │ txn_id: 50 │ │ txn_id: 10 │
│ data: {age: 31} │ │ data: {age: 30} │ │ data: {age: 29} │
│ begin_ts: 99 │ │ begin_ts: 49 │ │ begin_ts: 9 │
│ end_ts: ∞ │ │ end_ts: 99 │ │ end_ts: 49 │
└──────────────────┘ └──────────────────┘ └──────────────────┘
Visibility rules:
A version V is visible to transaction T if:
- V.begin_ts <= T.snapshot_ts
- V.end_ts > T.snapshot_ts
- V.txn_id is committed (not in active_txns set)
Key components:
- Transaction ID generator (monotonic)
- Active transaction table
- Version chain per record
- Snapshot timestamp assignment at BEGIN
- Commit/abort tracking
Think about:
- What happens if two transactions try to update the same node?
- How do you garbage collect old versions no one can see?
- What’s the difference between snapshot isolation and serializable?
- How does this interact with your WAL?
Learning milestones:
- Readers don’t block writers → You understand MVCC basics
- Snapshots are consistent → You understand isolation levels
- Write conflicts are detected → You understand transaction validation
- Old versions are cleaned up → You understand garbage collection
Project 11: Graph-Specific Algorithms (PageRank, Community Detection)
- File: LEARN_GRAPH_DATABASES_DEEP_DIVE.md
- Main Programming Language: C
- Alternative Programming Languages: Rust, Python, Go
- Coolness Level: Level 4: Hardcore Tech Flex
- Business Potential: 3. The “Service & Support” Model
- Difficulty: Level 3: Advanced
- Knowledge Area: Graph Analytics / Linear Algebra
- Software or Tool: Building: Graph Algorithm Library
- Main Book: “Graph Algorithms” by Mark Needham & Amy Hodler
What you’ll build: Implementation of classic graph analytics algorithms—PageRank (for importance), Label Propagation (for community detection), and Betweenness Centrality (for bridge identification).
Why it teaches graph databases: Neo4j’s Graph Data Science library exists because these algorithms are why people use graph databases. Fraud detection, recommendation systems, and social network analysis all rely on these algorithms. This is where graph databases shine.
Core challenges you’ll face:
- PageRank iteration until convergence → maps to power iteration method
- Sparse matrix representation → maps to efficient graph representation
- Community detection heuristics → maps to graph clustering
- Betweenness via shortest paths → maps to centrality measures
Key Concepts:
- PageRank Algorithm: “Graph Algorithms” Chapter 5 - Needham & Hodler
- Community Detection: “Graph Algorithms” Chapter 6 - Needham & Hodler
- Centrality Measures: “Graph Algorithms” Chapter 4 - Needham & Hodler
- Power Iteration: “Introduction to Linear Algebra” Chapter 6 - Gilbert Strang
Difficulty: Advanced Time estimate: 2 weeks Prerequisites:
- Project 2 completed (basic graph algorithms)
- Basic linear algebra understanding
- Familiarity with iterative algorithms
Real world outcome:
$ ./graph_analytics social.graph
PageRank (damping=0.85, iterations=20):
Top 10 influential users:
1. user_42 score: 0.0847
2. user_17 score: 0.0721
3. user_891 score: 0.0698
...
Converged in 12 iterations.
Community Detection (Label Propagation):
Found 47 communities
Community 1: 2,341 members (tech enthusiasts)
Community 2: 1,892 members (sports fans)
Community 3: 1,456 members (music lovers)
...
Modularity score: 0.73
Betweenness Centrality:
Top 5 bridge nodes:
1. user_103 betweenness: 0.234 (connects communities 1-2)
2. user_567 betweenness: 0.198 (connects communities 2-3)
...
Implementation Hints:
PageRank algorithm:
1. Initialize all nodes with rank 1/N
2. For each iteration:
For each node v:
new_rank[v] = (1-d)/N + d * Σ(rank[u]/out_degree[u])
for all u that link to v
3. Repeat until ranks converge (change < epsilon)
d = damping factor (typically 0.85)
N = total number of nodes
Label Propagation (simple community detection):
1. Assign each node a unique label
2. For each iteration:
For each node (in random order):
Set node's label to most frequent label among neighbors
(break ties randomly)
3. Repeat until labels stabilize
4. Nodes with same label = same community
Questions to explore:
- Why does PageRank converge? (It’s computing the eigenvector)
- What’s the time complexity of PageRank? (O(iterations * edges))
- How would you parallelize these algorithms?
- How does Neo4j’s GDS library optimize these?
Learning milestones:
- PageRank converges and ranks make sense → You understand iterative algorithms
- Communities are meaningful → You understand graph clustering
- Centrality identifies key nodes → You understand network analysis
- Algorithms scale to large graphs → You understand performance optimization
Project 12: Distributed Graph Partitioning
- File: LEARN_GRAPH_DATABASES_DEEP_DIVE.md
- Main Programming Language: C
- Alternative Programming Languages: Rust, Go, C++
- Coolness Level: Level 5: Pure Magic
- Business Potential: 4. The “Open Core” Infrastructure
- Difficulty: Level 5: Master
- Knowledge Area: Distributed Systems / Graph Partitioning
- Software or Tool: Building: Distributed Graph Engine
- Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann
What you’ll build: A system that partitions a large graph across multiple machines, enabling horizontal scaling while minimizing cross-partition traversals.
Why it teaches graph databases: When graphs get too big for one machine, you need distribution. But graphs are hard to partition—unlike rows in a table, cutting edges creates expensive cross-machine communication. Understanding graph partitioning is key to scaling graph databases.
Core challenges you’ll face:
- Partitioning algorithm (METIS-style) → maps to balanced graph cuts
- Cross-partition queries → maps to distributed query execution
- Replication for fault tolerance → maps to data durability
- Minimizing network hops → maps to locality optimization
Key Concepts:
- Graph Partitioning: “Designing Data-Intensive Applications” Chapter 6 - Martin Kleppmann
- Distributed Query Processing: “Database Internals” Chapter 9 - Alex Petrov
- Consistent Hashing: “Designing Data-Intensive Applications” Chapter 6 - Martin Kleppmann
- Balanced Cuts: METIS paper and documentation
Difficulty: Master Time estimate: 1 month+ Prerequisites:
- All previous projects completed
- Understanding of network programming
- Experience with distributed systems concepts
Real world outcome:
# On coordinator node
$ ./graph_cluster init 3
Initializing 3-node cluster...
Node 1: 192.168.1.10 (primary)
Node 2: 192.168.1.11
Node 3: 192.168.1.12
Cluster ready.
$ ./graph_cluster load billion_edges.graph
Loading graph with 1B nodes, 10B edges...
Partitioning using METIS...
Partition 1: 334M nodes, 3.2B edges, 2.1% edge cut
Partition 2: 333M nodes, 3.4B edges, 1.8% edge cut
Partition 3: 333M nodes, 3.4B edges, 1.9% edge cut
Distribution complete.
$ ./graph_cluster query
cluster> MATCH (a:Person {name:"Alice"})-[:KNOWS*1..3]->(b) RETURN count(b)
Query plan:
1. Find Alice (hash -> Node 2)
2. Local 1-hop expansion on Node 2
3. Parallel 2-hop expansion (spans all nodes)
4. Aggregate counts
| count(b) |
|----------|
| 4,523,891|
Query time: 234ms (47ms local, 187ms cross-partition)
Implementation Hints:
Partitioning strategies:
1. Hash partitioning: node_id % num_partitions
- Simple but ignores graph structure
- Many edges will be cut
2. Label/range partitioning: nodes with same label together
- Good for label-heavy workloads
- May be imbalanced
3. Graph-aware (METIS-style):
- Minimize edge cuts while balancing partition sizes
- Pre-compute partitions
- Better locality but expensive to compute
Distributed traversal:
For pattern (a)-[:KNOWS]->(b)-[:WORKS_AT]->(c):
1. Find 'a' locally
2. Expand KNOWS - some 'b' nodes may be remote
3. Send requests to remote partitions for remote 'b' nodes
4. Each partition expands WORKS_AT locally
5. Aggregate results at coordinator
Think about:
- What’s the trade-off between partition balance and edge cuts?
- How do you handle a traversal that touches all partitions?
- What about nodes that have connections everywhere (super-nodes)?
- How do you replicate popular nodes to reduce network traffic?
Learning milestones:
- Graph is partitioned across nodes → You understand distributed data
- Queries span partitions correctly → You understand distributed execution
- Edge cuts are minimized → You understand graph partitioning algorithms
- System survives node failure → You understand replication
Project 13: Query Optimizer with Cost Model
- File: LEARN_GRAPH_DATABASES_DEEP_DIVE.md
- Main Programming Language: C
- Alternative Programming Languages: Rust, Go, C++
- Coolness Level: Level 4: Hardcore Tech Flex
- Business Potential: 4. The “Open Core” Infrastructure
- Difficulty: Level 4: Expert
- Knowledge Area: Query Optimization / Cost Modeling
- Software or Tool: Building: Query Optimizer
- Main Book: “Database Internals” by Alex Petrov
What you’ll build: A query optimizer that chooses the best execution plan by estimating costs based on statistics (cardinality, selectivity, index availability).
Why it teaches graph databases: The same query can be executed many ways. (a)-[:KNOWS]->(b)-[:WORKS_AT]->(c) could start from a, from c, or use an index. Good optimizers make queries 100x faster. Bad ones make databases unusable.
Core challenges you’ll face:
- Cardinality estimation → maps to statistics collection
- Plan enumeration → maps to search space exploration
- Cost modeling → maps to I/O and CPU cost estimation
- Plan selection → maps to optimization algorithms
Key Concepts:
- Query Optimization: “Database Internals” Chapter 10 - Alex Petrov
- Cardinality Estimation: “Architecture of a Database System” Section 4.4 - Hellerstein et al.
- Cost Models: “Designing Data-Intensive Applications” Chapter 3 - Martin Kleppmann
- Join Ordering: “Database Internals” Chapter 10 - Alex Petrov
Difficulty: Expert Time estimate: 3 weeks Prerequisites:
- Project 8 completed (query executor)
- Project 6 completed (indexes)
- Understanding of statistics and estimation
Real world outcome:
$ ./graph_db --explain my_graph.db
graph> MATCH (a:Person)-[:KNOWS]->(b:Person)-[:WORKS_AT]->(c:Company {name:"TechCorp"})
RETURN a.name
Statistics:
Person nodes: 100,000
Company nodes: 5,000
KNOWS edges: 500,000
WORKS_AT edges: 200,000
TechCorp employees: 50
Plan 1 (Start from Person, no index):
AllNodesScan(Person) -> a cost: 100,000
Expand(KNOWS) -> b cost: 500,000
Expand(WORKS_AT) -> c cost: 200,000
Filter(name="TechCorp") cost: 200,000
TOTAL: 1,000,000
Plan 2 (Start from TechCorp via index):
IndexScan(Company.name="TechCorp") -> c cost: 1
ExpandIn(WORKS_AT) -> b cost: 50
ExpandIn(KNOWS) -> a cost: 250
TOTAL: 301 ← WINNER!
Executing Plan 2...
50 rows in 3ms
Implementation Hints:
Statistics you need:
- Node count per label
- Edge count per type
- Property value histograms
- Index selectivity estimates
Cost model factors:
- Sequential scan: num_pages * page_read_cost
- Index scan: selectivity * num_rows * index_lookup_cost
- Expand: avg_degree * input_rows
- Filter: input_rows * filter_cost (output = selectivity * input)
Plan enumeration for pattern (a)-[r1]->(b)-[r2]->(c):
Possible orderings:
1. Start from a, expand r1, expand r2
2. Start from c, expand r2 backward, expand r1 backward
3. Start from b, expand r1 backward, expand r2 forward
Each can use AllNodesScan or IndexScan
Compare costs, pick cheapest
Questions:
- How do you handle unknown selectivity? (Use defaults or samples)
- What if statistics are stale? (Re-analyze periodically)
- How do you handle variable-length patterns
[:KNOWS*1..5]? - Is the optimizer overhead worth it for simple queries?
Learning milestones:
- Optimizer picks obviously better plans → You understand basic cost modeling
- Statistics affect plan choice → You understand cardinality estimation
- Complex patterns optimize correctly → You understand plan enumeration
- Query times improve dramatically → You’ve built a real optimizer
Project 14: Full-Text Search Integration
- File: LEARN_GRAPH_DATABASES_DEEP_DIVE.md
- Main Programming Language: C
- Alternative Programming Languages: Rust, Go
- Coolness Level: Level 3: Genuinely Clever
- Business Potential: 3. The “Service & Support” Model
- Difficulty: Level 3: Advanced
- Knowledge Area: Information Retrieval / Text Search
- Software or Tool: Building: Full-Text Index
- Main Book: “Introduction to Information Retrieval” by Manning, Raghavan & Schütze
What you’ll build: A full-text search index for graph properties, enabling queries like “find all Person nodes where bio contains ‘machine learning’”.
Why it teaches graph databases: Neo4j integrates with Lucene for full-text search. Combining graph traversal with text search is powerful—find experts by searching their profiles, then traverse to their companies and co-workers.
Core challenges you’ll face:
- Tokenization and stemming → maps to text processing
- Inverted index construction → maps to index data structures
- TF-IDF scoring → maps to relevance ranking
- Integration with graph queries → maps to hybrid query execution
Key Concepts:
- Inverted Indexes: “Introduction to Information Retrieval” Chapter 1 - Manning et al.
- TF-IDF Scoring: “Introduction to Information Retrieval” Chapter 6 - Manning et al.
- Text Tokenization: “Introduction to Information Retrieval” Chapter 2 - Manning et al.
- Lucene Architecture: Lucene documentation
Difficulty: Advanced Time estimate: 2 weeks Prerequisites:
- Project 6 completed (B+Tree)
- Understanding of text processing
- Basic information retrieval concepts
Real world outcome:
$ ./graph_db my_graph.db
graph> CREATE FULLTEXT INDEX bio_index FOR (p:Person) ON (p.bio)
Index created on Person.bio
Indexing 100,000 documents...
Vocabulary size: 45,000 terms
Index size: 12MB
Done.
graph> CALL db.index.fulltext.queryNodes("bio_index", "machine learning")
YIELD node, score
RETURN node.name, score
ORDER BY score DESC
LIMIT 5
| node.name | score |
|--------------|--------|
| Alice Chen | 8.234 |
| Bob ML | 7.891 |
| Carol Data | 6.543 |
...
graph> MATCH (p:Person)-[:WORKS_AT]->(c:Company)
WHERE p.bio CONTAINS "machine learning"
RETURN c.name, count(p) as ml_experts
ORDER BY ml_experts DESC
| c.name | ml_experts |
|------------|------------|
| TechCorp | 23 |
| DataInc | 18 |
...
Implementation Hints:
Inverted index structure:
term -> [(doc_id, positions), (doc_id, positions), ...]
Example:
"machine" -> [(doc_42, [0, 156]), (doc_89, [23]), ...]
"learning" -> [(doc_42, [1, 157]), (doc_89, [24]), ...]
To find "machine learning":
1. Get posting lists for both terms
2. Intersect by doc_id
3. Check if positions are adjacent
TF-IDF scoring:
TF(t, d) = frequency of term t in document d
IDF(t) = log(N / df(t)) where N = total docs, df = docs containing t
Score = TF * IDF
Higher score = term appears often in this doc, rarely in others
Integration with graph queries:
MATCH (p:Person) WHERE p.bio CONTAINS "machine learning"
Execution:
1. Query full-text index for "machine learning"
2. Get list of node IDs with scores
3. Fetch nodes, continue with graph pattern
Think about:
- How do you handle updates to indexed properties?
- How do you tokenize “machine-learning” vs “machine learning”?
- Should you support phrase queries? Fuzzy matching?
- How large can the inverted index get?
Learning milestones:
- Basic term search works → You understand inverted indexes
- Phrase search works → You understand position-aware indexing
- Scores reflect relevance → You understand TF-IDF
- Integrates with graph queries → You understand hybrid execution
Project 15: Complete Graph Database Engine
- File: LEARN_GRAPH_DATABASES_DEEP_DIVE.md
- Main Programming Language: C
- Alternative Programming Languages: Rust, C++
- Coolness Level: Level 5: Pure Magic
- Business Potential: 5. The “Industry Disruptor”
- Difficulty: Level 5: Master
- Knowledge Area: Database Systems / Systems Programming
- Software or Tool: Building: Graph Database
- Main Book: “Database Internals” by Alex Petrov
What you’ll build: A complete graph database engine combining all previous projects: storage engine, buffer pool, B+Tree indexes, query parser, executor, optimizer, WAL, and transactions. A mini Neo4j from scratch.
Why it teaches graph databases: This is the capstone. You’ll integrate every component into a cohesive system. The challenges of making components work together teach you as much as building them individually. You’ll understand graph databases at the deepest level.
Core challenges you’ll face:
- Component integration → maps to systems architecture
- API design → maps to usability and developer experience
- Performance tuning → maps to system optimization
- Bug hunting across layers → maps to debugging complex systems
Key Concepts:
- Database Architecture: “Database Internals” throughout - Alex Petrov
- Systems Integration: “Designing Data-Intensive Applications” throughout - Martin Kleppmann
- API Design: “The Pragmatic Programmer” Chapter 7 - Hunt & Thomas
- Performance Analysis: “Computer Systems: A Programmer’s Perspective” Chapter 5 & 6 - Bryant & O’Hallaron
Difficulty: Master Time estimate: 1 month+ Prerequisites:
- All previous projects completed
- Strong systems programming skills
- Patience and determination
Real world outcome:
$ ./myneo4j --data-dir /var/myneo4j --port 7687
MyNeo4j Graph Database v0.1
Storage engine: Native fixed-record
Buffer pool: 1GB (262,144 pages)
WAL: Enabled (wal.log)
Transactions: MVCC with snapshot isolation
Indexes: B+Tree
Query language: MiniCypher
Server started on bolt://localhost:7687
# In another terminal
$ ./myneo4j-shell localhost:7687
Connected to MyNeo4j.
neo4j> CREATE INDEX person_name FOR (p:Person) ON (p.name);
Index created.
neo4j> BEGIN;
neo4j> CREATE (a:Person {name: "Alice", age: 30});
Node created.
neo4j> CREATE (b:Person {name: "Bob", age: 28});
Node created.
neo4j> MATCH (a:Person {name: "Alice"}), (b:Person {name: "Bob"})
CREATE (a)-[:KNOWS {since: 2020}]->(b);
Relationship created.
neo4j> COMMIT;
Transaction committed.
neo4j> EXPLAIN MATCH (a:Person {name: "Alice"})-[:KNOWS]->(b) RETURN b.name;
+----------------------------+
| Operator | Rows |
+----------------------------+
| Projection | 1 |
| Expand(KNOWS OUT) | 1 |
| IndexSeek(name) | 1 |
+----------------------------+
neo4j> MATCH (a:Person {name: "Alice"})-[:KNOWS]->(b) RETURN b.name;
+--------+
| b.name |
+--------+
| Bob |
+--------+
1 row, 0.8ms
neo4j> :quit
Goodbye!
Implementation Hints:
System architecture:
┌─────────────────────────────────────────────────────────────┐
│ Client Connection │
│ (TCP/Bolt protocol) │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Query Parser │
│ (Lexer → Parser → AST) │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Query Optimizer │
│ (Statistics → Plan Enumeration → Cost Model) │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Execution Engine │
│ (Operators: Scan, Expand, Filter, etc.) │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Transaction Manager │
│ (MVCC, Isolation) │
└─────────────────────────────────────────────────────────────┘
│
▼
┌───────────────────┬─────────────────┬───────────────────────┐
│ Buffer Pool │ Indexes │ WAL │
│ (Page Cache) │ (B+Tree) │ (Write-Ahead Log) │
└───────────────────┴─────────────────┴───────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Storage Engine │
│ (Fixed-size records: nodes.db, rels.db, etc.) │
└─────────────────────────────────────────────────────────────┘
Integration checklist:
- Storage engine creates/reads fixed-size records
- Buffer pool caches pages, tracks dirty pages
- WAL logs all modifications before apply
- Transactions use MVCC for isolation
- Indexes support CREATE INDEX / DROP INDEX
- Parser handles MiniCypher subset
- Optimizer chooses between scan and index
- Executor runs plans, returns results
- Server accepts network connections
- Shell provides interactive access
Think about:
- How do you test the integrated system?
- What benchmarks should you run?
- How do you handle errors at each layer?
- What’s your upgrade/migration story?
Learning milestones:
- Basic CRUD operations work → System is functional
- Transactions provide ACID → System is reliable
- Indexes speed up lookups → System is optimized
- Concurrent clients work → System is production-ready
- Benchmarks show good performance → System is competitive
Project Comparison Table
| Project | Difficulty | Time | Depth of Understanding | Fun Factor |
|---|---|---|---|---|
| 1. In-Memory Graph Library | Beginner | 1 week | ⭐⭐⭐ | ⭐⭐⭐ |
| 2. Graph Traversal Algorithms | Intermediate | 1 week | ⭐⭐⭐ | ⭐⭐⭐⭐ |
| 3. Graph Serialization | Intermediate | 1 week | ⭐⭐⭐ | ⭐⭐⭐ |
| 4. Memory-Mapped Storage | Advanced | 1-2 weeks | ⭐⭐⭐⭐ | ⭐⭐⭐⭐ |
| 5. Buffer Pool Manager | Expert | 2 weeks | ⭐⭐⭐⭐⭐ | ⭐⭐⭐ |
| 6. B+Tree Index | Expert | 2-3 weeks | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐ |
| 7. Query Language Parser | Advanced | 2 weeks | ⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ |
| 8. Query Execution Engine | Expert | 2-3 weeks | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐ |
| 9. Write-Ahead Logging | Expert | 2 weeks | ⭐⭐⭐⭐⭐ | ⭐⭐⭐ |
| 10. MVCC Transactions | Master | 3-4 weeks | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐ |
| 11. Graph Algorithms | Advanced | 2 weeks | ⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ |
| 12. Distributed Graph | Master | 1 month+ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ |
| 13. Query Optimizer | Expert | 3 weeks | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐ |
| 14. Full-Text Search | Advanced | 2 weeks | ⭐⭐⭐⭐ | ⭐⭐⭐⭐ |
| 15. Complete Engine | Master | 1 month+ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ |
Recommended Learning Path
Based on building a graph database from scratch in C:
Phase 1: Foundations (Weeks 1-3)
Start here to understand the basics.
- Project 1: In-Memory Graph Library - Get comfortable with graph data structures
- Project 2: Graph Traversal Algorithms - Understand how to navigate graphs
- Project 3: Graph Serialization - Learn to persist graphs to disk
Phase 2: Storage Engine (Weeks 4-7)
Build the foundation of a real database.
- Project 4: Memory-Mapped Storage - Understand OS-level I/O
- Project 5: Buffer Pool Manager - Master database I/O management
- Project 6: B+Tree Index - Implement efficient property lookups
Phase 3: Query Processing (Weeks 8-11)
Make your database usable.
- Project 7: Query Language Parser - Build a Cypher-like parser
- Project 8: Query Execution Engine - Execute queries against storage
- Project 13: Query Optimizer - Make queries fast automatically
Phase 4: ACID & Reliability (Weeks 12-16)
Make your database reliable.
- Project 9: Write-Ahead Logging - Achieve durability
- Project 10: MVCC Transactions - Handle concurrent access
Phase 5: Advanced Features (Weeks 17+)
Polish and extend.
- Project 11: Graph Algorithms - Add analytics capabilities
- Project 14: Full-Text Search - Enable text search
- Project 12: Distributed Graph - Scale horizontally
- Project 15: Complete Engine - Integrate everything
Final Capstone Project: Production-Ready Graph Database
Following the same pattern, this project represents the culmination of all your learning.
- File: LEARN_GRAPH_DATABASES_DEEP_DIVE.md
- Main Programming Language: C
- Alternative Programming Languages: Rust, C++
- Coolness Level: Level 5: Pure Magic (Super Cool)
- Business Potential: 5. The “Industry Disruptor”
- Difficulty: Level 5: Master
- Knowledge Area: Complete Database Systems
- Software or Tool: Building: Production Graph Database
- Main Book: “Database Internals” by Alex Petrov + “Designing Data-Intensive Applications” by Martin Kleppmann
What you’ll build: A production-grade graph database that could genuinely be used for real applications—with proper error handling, configuration, monitoring, backup/restore, and documentation.
Why this is the ultimate goal: Anyone can build a toy database. Building something production-worthy requires handling edge cases, providing operational tools, ensuring security, and creating a polished user experience. This separates engineers who understand databases from those who’ve mastered them.
Features beyond Project 15:
- Configuration file support
- Proper logging and monitoring hooks
- Backup and restore functionality
- Data import/export (CSV, JSON)
- Authentication and authorization
- Query timeout and resource limits
- Comprehensive error messages
- Unit tests, integration tests, stress tests
- Documentation and examples
Real world outcome:
$ myneo4j --config myneo4j.conf
Loading configuration from myneo4j.conf
Data directory: /var/lib/myneo4j
Buffer pool: 4GB
Max connections: 100
Auth: enabled (bcrypt)
TLS: enabled
Backup schedule: daily at 03:00
[2024-01-15 10:00:00] INFO MyNeo4j 1.0 starting...
[2024-01-15 10:00:01] INFO Recovery: WAL contains 0 uncommitted transactions
[2024-01-15 10:00:01] INFO Indexes: Loading 5 indexes
[2024-01-15 10:00:02] INFO Server listening on bolt://0.0.0.0:7687 (TLS)
[2024-01-15 10:00:02] INFO Server listening on http://0.0.0.0:7474 (metrics)
[2024-01-15 10:00:02] INFO Ready to accept connections
# Monitoring
$ curl http://localhost:7474/metrics
{
"nodes": 15000000,
"edges": 75000000,
"transactions_committed": 45123,
"buffer_pool_hit_rate": 0.987,
"avg_query_time_ms": 2.3,
"active_connections": 12
}
# Backup
$ myneo4j-admin backup --to /backups/$(date +%Y%m%d)
Backup started...
Phase 1: Checkpoint
Phase 2: Copy data files
Phase 3: Copy indexes
Phase 4: Verify checksum
Backup complete: 12.3GB in 4m23s
Learning milestones:
- Database survives chaos testing → You’ve built robust software
- Operations team can manage it → You’ve built operator-friendly software
- Developers find it easy to use → You’ve built user-friendly software
- Performance is competitive → You’ve built efficient software
- You can explain every design decision → You’ve mastered the domain
Summary
| # | Project Name | Main Programming Language |
|---|---|---|
| 1 | In-Memory Graph Library | C |
| 2 | Graph Traversal Algorithms (BFS, DFS, Shortest Path) | C |
| 3 | Graph Serialization & File Persistence | C |
| 4 | Memory-Mapped Graph Storage | C |
| 5 | Buffer Pool Manager | C |
| 6 | B+Tree Index for Node Lookups | C |
| 7 | Simple Query Language Parser | C |
| 8 | Query Execution Engine | C |
| 9 | Write-Ahead Logging (WAL) | C |
| 10 | Transaction Manager with MVCC | C |
| 11 | Graph-Specific Algorithms (PageRank, Community Detection) | C |
| 12 | Distributed Graph Partitioning | C |
| 13 | Query Optimizer with Cost Model | C |
| 14 | Full-Text Search Integration | C |
| 15 | Complete Graph Database Engine | C |
| Final | Production-Ready Graph Database | C |
Key Resources
Books
- “Database Internals” by Alex Petrov - The essential guide to how databases work
- “Designing Data-Intensive Applications” by Martin Kleppmann - Understanding distributed data systems
- “Graph Databases” by Ian Robinson, Jim Webber & Emil Eifrem - Neo4j-focused but excellent concepts
- “C Programming: A Modern Approach” by K. N. King - Essential C reference
- “Computer Systems: A Programmer’s Perspective” by Bryant & O’Hallaron - Understanding systems
Online Resources
- Let’s Build a Simple Database - SQLite clone tutorial
- Neo4j Internals - How Neo4j works
- openCypher Specification - Query language reference
- Graph Database Internals (PDF) - Deep dive
Papers
- Cypher: An Evolving Query Language for Property Graphs
- Are You Sure You Want to Use MMAP in Your DBMS?
You’re ready to build a graph database from scratch. Start with Project 1 and work your way through. By the end, you’ll understand graph databases at the deepest level—the level of the engineers who built Neo4j, TigerGraph, and Amazon Neptune.