LEARN GRAPH DATABASES DEEP DIVE

Learn Graph Databases: From Zero to Graph Database Master

Goal: Deeply understand graph databases—from basic graph data structures to building a complete graph database engine from scratch in C, including storage engines, query processing, indexing, and ACID transactions.

Why Graph Databases Matter

Relational databases struggle when data is highly connected. Every JOIN operation in SQL is expensive because it requires index lookups or full table scans. Graph databases solve this by storing relationships as first-class citizens—each node directly points to its neighbors (a concept called index-free adjacency), making traversals O(1) instead of O(log n).

After completing these projects, you will:

Understand every byte in a graph database’s storage format
Know how queries traverse from node to node without index lookups
Be able to implement graph algorithms (BFS, DFS, Dijkstra, PageRank)
Build your own query parser and execution engine
Understand ACID transactions and concurrency control in graph contexts
Have built a working graph database from scratch in C

Core Concept Analysis

The Property Graph Model

A property graph consists of:

┌─────────────────────────────────────────────────────────────────┐
│                     PROPERTY GRAPH MODEL                        │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│   ┌─────────────────┐         KNOWS          ┌───────────────┐  │
│   │     :Person     │ ────────────────────▶  │    :Person    │  │
│   │  name: "Alice"  │    {since: 2015}       │  name: "Bob"  │  │
│   │  age: 30        │                        │  age: 28      │  │
│   └─────────────────┘                        └───────────────┘  │
│           │                                         │           │
│           │ WORKS_AT                                │ WORKS_AT  │
│           ▼                                         ▼           │
│   ┌─────────────────────────────────────────────────────────┐   │
│   │                      :Company                           │   │
│   │                   name: "TechCorp"                      │   │
│   │                   founded: 2010                         │   │
│   └─────────────────────────────────────────────────────────┘   │
│                                                                 │
│   NODES: Entities with labels (:Person, :Company)               │
│   EDGES: Directed relationships with types (KNOWS, WORKS_AT)    │
│   PROPERTIES: Key-value pairs on both nodes and edges           │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Fundamental Concepts

Graph Data Structures
- Adjacency List: Each node stores a list of its neighbors
- Adjacency Matrix: 2D array where matrix[i][j] = 1 if edge exists
- Edge List: Simple list of (source, target) pairs
Index-Free Adjacency
- Each node directly references its neighbors via pointers
- Traversing a relationship is O(1) pointer dereference
- No index lookup required—the graph is the index
- As graph grows, local traversal cost remains constant
Native Graph Storage
- Fixed-size records for nodes and relationships
- Nodes: ID, first_relationship_id, first_property_id, labels
- Relationships: ID, start_node, end_node, type, next_rel_start, next_rel_end
- Properties stored separately as linked chains
Graph Traversal Algorithms
- BFS (Breadth-First Search): Level-by-level exploration, uses queue
- DFS (Depth-First Search): Deep exploration first, uses stack/recursion
- Dijkstra: Shortest weighted path, uses priority queue
- A*: Heuristic-guided shortest path
Query Processing
- Pattern matching: (a)-[:KNOWS]->(b) finds all KNOWS relationships
- Query planning: Choose optimal traversal order
- Query execution: Walk the graph, collect results
ACID in Graph Databases
- Write-Ahead Logging (WAL): Log changes before applying them
- MVCC: Multiple versions for concurrent reads
- Locking: Prevent conflicting concurrent modifications

Project List

Projects are ordered from fundamental understanding to advanced implementations. Each project builds on the previous ones.

Project 1: In-Memory Graph Library (The Foundation)

File: LEARN_GRAPH_DATABASES_DEEP_DIVE.md
Main Programming Language: C
Alternative Programming Languages: Rust, Go, C++
Coolness Level: Level 3: Genuinely Clever
Business Potential: 1. The “Resume Gold”
Difficulty: Level 1: Beginner
Knowledge Area: Data Structures / Memory Management
Software or Tool: Building: Graph Library
Main Book: “C Programming: A Modern Approach” by K. N. King

What you’ll build: A C library that can create, modify, and traverse graphs in memory—supporting nodes with labels, edges with types, and properties on both.

Why it teaches graph databases: Every graph database starts here. Before you worry about persistence or queries, you need to understand how graphs live in memory. This project forces you to think about pointer relationships, memory layout, and the trade-offs between adjacency lists vs. matrices.

Core challenges you’ll face:

Designing the node/edge data structures → maps to understanding property graph model
Managing dynamic memory for growing graphs → maps to memory allocation strategies
Implementing efficient neighbor lookups → maps to index-free adjacency concept
Handling bidirectional edge references → maps to doubly-linked relationship lists

Key Concepts:

Pointers and Dynamic Memory: “C Programming: A Modern Approach” Chapter 17 - K. N. King
Linked Lists in C: “Mastering Algorithms with C” Chapter 5 - Kyle Loudon
Graph Data Structures: “Algorithms, Fourth Edition” Section 4.1 - Sedgewick & Wayne
Memory Layout: “Computer Systems: A Programmer’s Perspective” Chapter 9 - Bryant & O’Hallaron

Difficulty: Beginner Time estimate: 1 week Prerequisites:

Basic C programming (pointers, structs, malloc/free)
Understanding of linked lists
No prior graph knowledge required

Real world outcome:

$ ./graph_demo
Creating graph...
Added node 1: Person {name: "Alice", age: 30}
Added node 2: Person {name: "Bob", age: 28}
Added edge: (1)-[:KNOWS {since: 2015}]->(2)

Traversing from Alice:
  Alice --KNOWS--> Bob

Node count: 2
Edge count: 1
Memory used: 256 bytes

Implementation Hints:

Start with these core structures:

Node structure should contain:
- Unique ID (uint64_t)
- Pointer to first outgoing edge
- Pointer to first incoming edge
- Pointer to first property
- Label(s)

Edge structure should contain:
- Unique ID
- Start node pointer
- End node pointer
- Relationship type
- Next edge from same start node
- Next edge to same end node
- Pointer to first property

Property structure (linked list):
- Key (string)
- Value (union of types: int, float, string, bool)
- Next property pointer

Think about these questions:

How do you iterate all edges FROM a node? (Follow first_outgoing, then next pointers)
How do you iterate all edges TO a node? (Follow first_incoming chain)
Why do edges need both “next from start” AND “next to end” pointers?
What happens to edges when you delete a node?

Learning milestones:

You can add nodes and edges → You understand basic graph structure
You can traverse neighbors → You understand adjacency representation
You can handle properties → You understand the property graph model
You handle memory correctly (no leaks) → You understand ownership in C

Project 2: Graph Traversal Algorithms (BFS, DFS, Shortest Path)

File: LEARN_GRAPH_DATABASES_DEEP_DIVE.md
Main Programming Language: C
Alternative Programming Languages: Rust, Go, Python
Coolness Level: Level 3: Genuinely Clever
Business Potential: 1. The “Resume Gold”
Difficulty: Level 2: Intermediate
Knowledge Area: Graph Algorithms
Software or Tool: Building: Algorithm Library
Main Book: “Algorithms, Fourth Edition” by Sedgewick & Wayne

What you’ll build: Implementation of core graph algorithms—BFS, DFS, Dijkstra’s shortest path, and connected components—that operate on your in-memory graph library.

Why it teaches graph databases: Graph databases are only useful because of graph algorithms. When a user asks “find shortest path from A to B” or “find all friends-of-friends,” the database executes these algorithms. Understanding them deeply is essential.

Core challenges you’ll face:

Implementing BFS with proper queue management → maps to level-by-level exploration
Managing visited state without modifying nodes → maps to external state tracking
Priority queue for Dijkstra → maps to weighted path optimization
Detecting cycles during DFS → maps to graph classification problems

Key Concepts:

BFS Algorithm: “Algorithms, Fourth Edition” Section 4.1 - Sedgewick & Wayne
DFS Algorithm: “Introduction to Algorithms” (CLRS) Chapter 22
Dijkstra’s Algorithm: “Algorithms, Fourth Edition” Section 4.4 - Sedgewick & Wayne
Priority Queues (Heaps): “Algorithms, Fourth Edition” Section 2.4 - Sedgewick & Wayne

Resources for key challenges:

GeeksforGeeks - BFS for a Graph - Step-by-step BFS implementation
Programiz - Dijkstra’s Algorithm - Clear visual explanation

Difficulty: Intermediate Time estimate: 1 week Prerequisites:

Project 1 completed (in-memory graph library)
Understanding of queues and stacks
Basic algorithm complexity analysis

Real world outcome:

$ ./graph_algo social_network.graph
Loading graph with 1000 nodes, 5000 edges...

BFS from node 1:
  Level 0: Alice
  Level 1: Bob, Carol, Dave
  Level 2: Eve, Frank, Grace...
  Visited 847 nodes in 6 levels

DFS from node 1:
  Path: Alice -> Bob -> Eve -> ... (depth 23)

Shortest path from Alice to Zack:
  Alice -[2]-> Bob -[3]-> Carol -[1]-> Zack
  Total weight: 6
  Path length: 3 edges

Connected components: 3
  Component 1: 847 nodes
  Component 2: 142 nodes
  Component 3: 11 nodes

Implementation Hints:

For BFS, you need an external data structure:

Queue: FIFO structure for nodes to visit
Visited set: Hash set or boolean array to track visited nodes
Parent map: To reconstruct the path afterward

The key insight: BFS visits nodes in order of distance from start. When you first reach a node, you’ve found the shortest path (in unweighted graphs).

For Dijkstra, the difference is:

Use a priority queue (min-heap) instead of a regular queue
Priority = cumulative distance from start
Can update priority when finding shorter path (relaxation)

Ask yourself:

What’s the time complexity of BFS? (O(V + E))
Why doesn’t BFS work for weighted graphs?
What makes Dijkstra fail with negative edges?
How would you modify DFS to detect cycles?

Learning milestones:

BFS finds shortest unweighted path → You understand level-order traversal
DFS explores deeply and backtracks → You understand stack-based traversal
Dijkstra handles weights → You understand greedy algorithms and priority queues
You can find connected components → You understand graph structure analysis

Project 3: Graph Serialization & File Persistence

File: LEARN_GRAPH_DATABASES_DEEP_DIVE.md
Main Programming Language: C
Alternative Programming Languages: Rust, C++, Go
Coolness Level: Level 3: Genuinely Clever
Business Potential: 2. The “Micro-SaaS / Pro Tool”
Difficulty: Level 2: Intermediate
Knowledge Area: File I/O / Binary Formats / Persistence
Software or Tool: Building: Graph File Format
Main Book: “Computer Systems: A Programmer’s Perspective” by Bryant & O’Hallaron

What you’ll build: A binary file format for persisting graphs to disk, with save/load functionality. Your format will use fixed-size records like real graph databases do.

Why it teaches graph databases: Neo4j stores nodes as 9-byte fixed-size records. Understanding why fixed-size records matter (O(1) lookups via offset = id * record_size) is fundamental. This project bridges in-memory graphs to persistent storage.

Core challenges you’ll face:

Designing a binary file format → maps to understanding database storage layouts
Fixed-size records vs. variable-size data → maps to property storage strategies
Handling string storage separately → maps to string table/pool concepts
File offset arithmetic → maps to direct record access

Key Concepts:

Binary File I/O in C: “C Programming: A Modern Approach” Chapter 22 - K. N. King
Fixed-Size Record Design: “Database Internals” Chapter 3 - Alex Petrov
File Format Design: “Practical Binary Analysis” Chapter 2 - Dennis Andriesse
Endianness: “Computer Systems: A Programmer’s Perspective” Section 2.1 - Bryant & O’Hallaron

Difficulty: Intermediate Time estimate: 1 week Prerequisites:

Project 1 completed
Basic understanding of binary file I/O (fread, fwrite)
Understanding of struct packing and alignment

Real world outcome:

$ ./graph_save social.graph
Saving graph to social.graph.db
  Nodes: 1000 (9000 bytes)
  Edges: 5000 (50000 bytes)
  Properties: 8500 (variable, in properties.db)
  String pool: 45000 bytes
Total: 104000 bytes
Save complete.

$ ls -la *.db
-rw-r--r--  1 user  staff   9000 Dec 21 10:00 nodes.db
-rw-r--r--  1 user  staff  50000 Dec 21 10:00 rels.db
-rw-r--r--  1 user  staff  45000 Dec 21 10:00 strings.db

$ ./graph_load social.graph
Loading from social.graph.db...
Loaded 1000 nodes, 5000 edges in 12ms
Random lookup node 500: O(1), 0.001ms

Implementation Hints:

Design your file format like Neo4j does:

Node Record (9 bytes):
┌─────────┬──────────────┬──────────────┬─────────┐
│ in_use  │ first_rel_id │ first_prop_id│ labels  │
│ 1 byte  │   4 bytes    │    4 bytes   │ varies  │
└─────────┴──────────────┴──────────────┴─────────┘

To find node N: seek to offset N * 9, read 9 bytes

Relationship Record (34 bytes):
┌─────────┬──────────┬──────────┬──────────┬─────────────┬─────────────┬──────────┐
│ in_use  │ start_id │  end_id  │ rel_type │ next_rel_s  │ next_rel_e  │ prop_id  │
│ 1 byte  │ 4 bytes  │ 4 bytes  │ 4 bytes  │  4 bytes    │   4 bytes   │ 4 bytes  │
└─────────┴──────────┴──────────┴──────────┴─────────────┴─────────────┴──────────┘

For variable-length data (strings), use a separate string pool file:

Store offset into string pool in the property record
String pool: length-prefixed strings back to back
Or: null-terminated strings with index file

Questions to consider:

Why not store strings inline in nodes/edges?
What happens when you delete a node? (Mark as unused, don’t shift)
How do you handle graph updates? (In-place for fixed fields, append for strings)
What about alignment and padding?

Learning milestones:

You can save and load a graph → You understand file I/O
Random access by ID is O(1) → You understand fixed-size record design
Variable data is stored separately → You understand storage decoupling
Format is portable across systems → You understand endianness and packing

Project 4: Memory-Mapped Graph Storage

File: LEARN_GRAPH_DATABASES_DEEP_DIVE.md
Main Programming Language: C
Alternative Programming Languages: Rust, C++
Coolness Level: Level 4: Hardcore Tech Flex
Business Potential: 2. The “Micro-SaaS / Pro Tool”
Difficulty: Level 3: Advanced
Knowledge Area: Operating Systems / Virtual Memory / mmap
Software or Tool: Building: mmap-based storage
Main Book: “The Linux Programming Interface” by Michael Kerrisk

What you’ll build: A graph storage engine that uses memory-mapped files (mmap), allowing the OS to handle paging and making your graph appear as a simple array in memory.

Why it teaches graph databases: Many databases (SQLite, LMDB) use mmap for I/O. It’s a powerful technique but comes with trade-offs (less control over flushing, OS-dependent behavior). Understanding when mmap works well vs. when explicit I/O is better is crucial database knowledge.

Core challenges you’ll face:

Using mmap() correctly → maps to understanding virtual memory
Handling file growth (mremap or remap) → maps to dynamic storage expansion
Ensuring data durability (msync) → maps to understanding persistence guarantees
Dealing with page faults → maps to understanding I/O cost

Key Concepts:

mmap System Call: “The Linux Programming Interface” Chapter 49 - Michael Kerrisk
Virtual Memory: “Computer Systems: A Programmer’s Perspective” Chapter 9 - Bryant & O’Hallaron
File-Backed Memory: “Advanced Programming in the UNIX Environment” Chapter 14 - Stevens & Rago
msync and Durability: SQLite mmap documentation

Resources for key challenges:

CMU - Are You Sure You Want to Use MMAP in Your DBMS? - Critical analysis of mmap trade-offs

Difficulty: Advanced Time estimate: 1-2 weeks Prerequisites:

Project 3 completed (file-based persistence)
Understanding of virtual memory concepts
Familiarity with Unix system calls

Real world outcome:

$ ./mmap_graph create large.graph 10000000
Creating mmap-backed graph...
Mapped file: 1.2 GB
Node capacity: 10,000,000
Ready.

$ ./mmap_graph load large.graph
Loading via mmap...
Mapped in 0.8ms (just setting up mapping, no actual I/O!)

$ ./mmap_graph query large.graph
> get_node 5000000
Node 5000000 accessed in 0.02ms (first access, page fault)
> get_node 5000001
Node 5000001 accessed in 0.001ms (same page, cached)

> traverse 1 depth 3
Traversed 15,847 nodes in 45ms
Pages faulted: 234

Implementation Hints:

Basic mmap usage in C:

Open file with proper size
mmap() with MAP_SHARED for persistence
Cast pointer to your record array
Access records directly via pointer arithmetic
msync() to ensure writes reach disk
munmap() when done

The beauty of mmap: your nodes become a simple array access:

Node* nodes = (Node*)mmap(...);
Node n = nodes[node_id];  // OS handles loading from disk

But understand the dangers:

You can’t control when dirty pages are written
Page faults happen at unpredictable times
No control over eviction order
Signal handlers needed for SIGBUS

Questions to think about:

What happens if you access node 1, then node 1,000,000? (Two page faults)
How big is a page? (Usually 4KB) How many nodes fit in one page?
When should you call msync()? What flags (MS_SYNC vs MS_ASYNC)?
Why does SQLite make mmap optional?

Learning milestones:

Graph loads “instantly” → You understand memory mapping
You observe page faults during traversal → You understand demand paging
msync ensures durability → You understand persistence guarantees
You understand mmap’s limitations → You’re ready for buffer pools

Project 5: Buffer Pool Manager

File: LEARN_GRAPH_DATABASES_DEEP_DIVE.md
Main Programming Language: C
Alternative Programming Languages: Rust, C++
Coolness Level: Level 4: Hardcore Tech Flex
Business Potential: 3. The “Service & Support” Model
Difficulty: Level 4: Expert
Knowledge Area: Database Internals / Caching / I/O Management
Software or Tool: Building: Buffer Pool
Main Book: “Database Internals” by Alex Petrov

What you’ll build: A buffer pool manager that caches disk pages in memory, implementing LRU eviction, pin counting, and dirty page tracking—the heart of any serious database.

Why it teaches graph databases: The buffer pool is what separates toy databases from real ones. It gives you control over memory usage, I/O scheduling, and durability. Neo4j, SQLite, and every serious database has a buffer pool. This is where databases get fast.

Core challenges you’ll face:

Page caching with fixed memory → maps to memory management constraints
LRU eviction policy → maps to cache management strategies
Pin counting for safe page access → maps to concurrency-safe reference counting
Dirty page tracking and flushing → maps to write-back strategies

Key Concepts:

Buffer Pool Architecture: “Database Internals” Chapter 5 - Alex Petrov
LRU Cache Implementation: “Algorithms, Fourth Edition” Section 1.3 - Sedgewick & Wayne
Page Management: “Operating Systems: Three Easy Pieces” Chapter 22 - OSTEP
Cache-Conscious Design: “Computer Systems: A Programmer’s Perspective” Chapter 6 - Bryant & O’Hallaron

Resources for key challenges:

Let’s Build a Simple Database - Part 3 - Pager implementation

Difficulty: Expert Time estimate: 2 weeks Prerequisites:

Projects 3-4 completed
Understanding of cache concepts
Solid C programming (hash tables, linked lists)

Real world outcome:

$ ./buffered_graph --pool-size 1000 --page-size 4096
Buffer pool initialized: 1000 pages, ~4MB memory

> load graph.db
Graph loaded. Pages on disk: 50,000
Pages in buffer: 0 (cold start)

> traverse node 1 depth 5
Traversed 12,453 nodes
Buffer stats:
  Pages fetched from disk: 234
  Pages found in buffer: 11,219 (97.9% hit rate)
  Dirty pages: 0
  Pinned pages: 0

> add_edge 1 500 "KNOWS"
Edge added.
Buffer stats:
  Dirty pages: 2 (will be written on checkpoint)

> checkpoint
Flushing 2 dirty pages to disk...
Done.

Implementation Hints:

Core buffer pool components:

Page structure:
- page_id (which disk page this is)
- data[PAGE_SIZE] (the actual bytes)
- pin_count (number of active users)
- is_dirty (has been modified)
- prev/next (for LRU list)

Buffer Pool:
- pages[POOL_SIZE] (the page slots)
- page_table (hash map: disk_page_id -> buffer_slot)
- lru_head, lru_tail (doubly-linked list of unpinned pages)

Key operations:

fetch_page(page_id): Get page, loading from disk if needed
unpin_page(page_id, is_dirty): Release page, mark dirty if modified
flush_page(page_id): Write dirty page to disk
evict(): Find LRU unpinned page, write if dirty, reclaim slot

The pin count is crucial: you can’t evict a page while someone is using it!

Questions to explore:

What happens if all pages are pinned? (Can’t evict, must wait or error)
What’s the difference between LRU and LRU-K?
How does dirty page batching improve performance?
What’s a checkpoint and why is it important?

Learning milestones:

Pages are cached in memory → You understand caching basics
LRU eviction works correctly → You understand eviction policies
Pin counts prevent unsafe eviction → You understand concurrency safety
Dirty pages are tracked and flushed → You understand write-back caching

Project 6: B+Tree Index for Node Lookups

File: LEARN_GRAPH_DATABASES_DEEP_DIVE.md
Main Programming Language: C
Alternative Programming Languages: Rust, C++, Go
Coolness Level: Level 4: Hardcore Tech Flex
Business Potential: 3. The “Service & Support” Model
Difficulty: Level 4: Expert
Knowledge Area: Data Structures / Indexing / Databases
Software or Tool: Building: B+Tree Index
Main Book: “Database Internals” by Alex Petrov

What you’ll build: A disk-based B+Tree index that allows fast lookups of nodes by property values (e.g., “find all nodes where name = ‘Alice’”).

Why it teaches graph databases: While index-free adjacency makes traversals fast, finding your starting nodes still requires indexes. “Find the person named Alice” needs a B+Tree lookup before you can start traversing. Every graph database supports property indexes.

Core challenges you’ll face:

Implementing B+Tree insert with splits → maps to balanced tree algorithms
Disk-based nodes with fixed-size pages → maps to database page design
Range scans via leaf traversal → maps to understanding B+Tree structure
Handling variable-length keys → maps to key encoding strategies

Key Concepts:

B+Tree Structure: “Database Internals” Chapter 4 - Alex Petrov
B-Tree Algorithms: “Introduction to Algorithms” (CLRS) Chapter 18
Disk-Based Trees: Let’s Build a Simple Database - Part 7
Key Encoding: “Designing Data-Intensive Applications” Chapter 3 - Martin Kleppmann

Resources for key challenges:

Let’s Build a Simple Database - B-Tree Parts - Excellent step-by-step B-Tree implementation

Difficulty: Expert Time estimate: 2-3 weeks Prerequisites:

Project 5 completed (buffer pool)
Understanding of binary search trees
Strong grasp of recursion

Real world outcome:

$ ./btree_index create name_index.db
B+Tree index created. Order: 100 (fits 4KB page)

$ ./btree_index bulk_load nodes.csv
Loading 1,000,000 entries...
Tree height: 3
Internal nodes: 127
Leaf nodes: 10,214
Load time: 8.3 seconds

$ ./btree_index lookup name_index.db "Alice"
Found key "Alice" at node_id: 42
Lookup time: 0.02ms (3 page reads)

$ ./btree_index range name_index.db "A" "B"
Range scan A-B:
  Aaron -> node 15
  Abel -> node 892
  ...
  Beatrice -> node 4521
Found 3,847 entries in 12ms

Implementation Hints:

B+Tree structure:

Internal Node:
┌────────┬────────┬────────┬────────┬────────┐
│  key1  │  key2  │  key3  │  ...   │  keyN  │
└────────┴────────┴────────┴────────┴────────┘
    │        │        │                 │
   ptr0     ptr1     ptr2             ptrN
    ↓        ↓        ↓                 ↓
  child0  child1   child2           childN

Leaf Node (also linked for range scans):
┌────────────┬────────────┬────────────┐
│ key1:val1  │ key2:val2  │ key3:val3  │ → next_leaf
└────────────┴────────────┴────────────┘

For graph DB: key = property value, val = node_id

Insert algorithm outline:

Search to find correct leaf
If leaf has space, insert key-value
If leaf is full, split: create new leaf, distribute keys
Propagate split upward: parent gets new key, may also split
If root splits, create new root

Key questions:

Why B+Tree and not B-Tree? (All values in leaves, better for range scans)
What determines the “order” of the tree? (How many keys fit in a page)
How do you handle duplicate keys? (Allow duplicates in leaves, or use lists)
What happens during concurrent access? (Need latching/locking—future project!)

Learning milestones:

Lookups are O(log n) → You understand balanced tree lookups
Inserts handle splits correctly → You understand tree rebalancing
Range scans follow leaf links → You understand B+Tree leaf chain
Tree survives restart → You’ve integrated with buffer pool

Project 7: Simple Query Language Parser

File: LEARN_GRAPH_DATABASES_DEEP_DIVE.md
Main Programming Language: C
Alternative Programming Languages: Rust, Go, Python
Coolness Level: Level 4: Hardcore Tech Flex
Business Potential: 3. The “Service & Support” Model
Difficulty: Level 3: Advanced
Knowledge Area: Parsing / Compilers / Language Design
Software or Tool: Building: Query Parser
Main Book: “Writing a C Compiler” by Nora Sandler

What you’ll build: A parser for a simplified graph query language inspired by Cypher. Users can write queries like MATCH (a:Person)-[:KNOWS]->(b) WHERE a.name = "Alice" RETURN b.name.

Why it teaches graph databases: Cypher is to graph databases what SQL is to relational databases. Building a parser teaches you how query languages work, how patterns are represented internally, and prepares you for query execution.

Core challenges you’ll face:

Tokenizing the query string → maps to lexical analysis
Parsing graph patterns → maps to recursive descent parsing
Building an AST → maps to abstract syntax tree design
Handling property predicates → maps to expression parsing

Key Concepts:

Lexical Analysis: “Writing a C Compiler” Chapter 1 - Nora Sandler
Recursive Descent Parsing: “Language Implementation Patterns” Chapter 2 - Terence Parr
AST Design: “Compilers: Principles and Practice” Chapter 5 - Dave & Dave
Cypher Syntax: openCypher Specification

Difficulty: Advanced Time estimate: 2 weeks Prerequisites:

Basic understanding of parsing concepts
String manipulation in C
Familiarity with tree data structures

Real world outcome:

$ ./query_parser
GraphQL> MATCH (a:Person)-[:KNOWS]->(b:Person) WHERE a.name = "Alice" RETURN b

Tokens:
  MATCH, LPAREN, IDENT(a), COLON, IDENT(Person), RPAREN,
  DASH, LBRACKET, COLON, IDENT(KNOWS), RBRACKET, ARROW,
  LPAREN, IDENT(b), COLON, IDENT(Person), RPAREN,
  WHERE, IDENT(a), DOT, IDENT(name), EQ, STRING("Alice"),
  RETURN, IDENT(b)

AST:
  MatchClause
  └── Pattern
      ├── Node(var=a, label=Person)
      ├── Edge(type=KNOWS, direction=OUT)
      └── Node(var=b, label=Person)
  └── Where
      └── Equals
          ├── PropertyAccess(a, name)
          └── StringLiteral("Alice")
  └── Return
      └── Variable(b)

Parse successful!

Implementation Hints:

Define your grammar (simplified):

query       := match_clause where_clause? return_clause
match_clause := "MATCH" pattern
pattern     := node_pattern (edge_pattern node_pattern)*
node_pattern := "(" variable? (":" label)? properties? ")"
edge_pattern := ("-[" edge_type "]->") | ("<-[" edge_type "]-")
where_clause := "WHERE" expression
return_clause := "RETURN" expression_list

Lexer produces tokens:

Keywords: MATCH, WHERE, RETURN, AND, OR
Symbols: (, ), [, ], -, ->, <-, :, ., =, <>
Literals: strings, numbers
Identifiers: variable and label names

Parser builds AST nodes:

MatchClause { patterns, where, return }
Pattern { nodes[], edges[] }
NodePattern { variable, label, properties }
EdgePattern { type, direction, variable }
WhereExpr { left, op, right }

Think about:

How do you handle syntax errors gracefully?
What does your AST look like for (a)-[:KNOWS]->(b)-[:LIKES]->(c)?
How would you add support for optional patterns OPTIONAL MATCH?

Learning milestones:

Lexer produces correct tokens → You understand tokenization
Parser builds valid AST → You understand grammar and parsing
Error messages are helpful → You understand error recovery
AST can represent complex patterns → You’re ready for query execution

Project 8: Query Execution Engine

File: LEARN_GRAPH_DATABASES_DEEP_DIVE.md
Main Programming Language: C
Alternative Programming Languages: Rust, Go, C++
Coolness Level: Level 4: Hardcore Tech Flex
Business Potential: 3. The “Service & Support” Model
Difficulty: Level 4: Expert
Knowledge Area: Query Processing / Execution Engines
Software or Tool: Building: Query Executor
Main Book: “Database Internals” by Alex Petrov

What you’ll build: An execution engine that takes parsed ASTs and executes them against your graph storage, returning results. This includes pattern matching, filtering, and projecting results.

Why it teaches graph databases: This is where everything comes together. Your parser creates plans, your storage provides data, and the execution engine connects them. Understanding how pattern matching translates to traversal operations is core graph database knowledge.

Core challenges you’ll face:

Pattern matching algorithm → maps to subgraph isomorphism
Variable binding during traversal → maps to execution context management
Backtracking for complex patterns → maps to search space exploration
Integrating with storage layer → maps to I/O-efficient execution

Key Concepts:

Query Execution Models: “Database Internals” Chapter 8 - Alex Petrov
Pattern Matching: Cypher: An Evolving Query Language for Property Graphs
Iterator Model: “Architecture of a Database System” Section 4.2 - Hellerstein et al.
Backtracking Algorithms: “Algorithms, Fourth Edition” Section 6.5 - Sedgewick & Wayne

Difficulty: Expert Time estimate: 2-3 weeks Prerequisites:

Project 7 completed (parser)
Projects 1-5 completed (storage layer)
Understanding of iterator patterns

Real world outcome:

$ ./graph_db my_graph.db
Connected to graph: 100,000 nodes, 500,000 edges

graph> MATCH (a:Person {name: "Alice"})-[:KNOWS]->(b:Person) RETURN b.name, b.age

Execution Plan:
  1. IndexScan(Person, name="Alice") -> a
  2. ExpandOut(a, KNOWS) -> b
  3. Filter(b.label = Person)
  4. Project(b.name, b.age)

Executing...

| b.name    | b.age |
|-----------|-------|
| Bob       | 28    |
| Carol     | 32    |
| Dave      | 25    |

3 rows returned in 2.3ms
Nodes scanned: 1 (via index)
Edges traversed: 47

Implementation Hints:

Execution uses an iterator/operator model:

Operator interface:
  init()    - Prepare for execution
  next()    - Get next result tuple
  close()   - Cleanup

Common operators:
  IndexScan - Find starting nodes via index
  AllNodesScan - Iterate all nodes (expensive!)
  Expand - Follow edges from current node
  Filter - Check predicate, pass or skip
  Project - Select which fields to output

Pattern matching approach:

For pattern (a)-[:KNOWS]->(b)-[:WORKS_AT]->(c):
Find all starting points for 'a'
For each 'a', expand KNOWS edges to find 'b' candidates
For each 'b', expand WORKS_AT edges to find 'c' candidates
Each complete (a,b,c) tuple is a result

Think about:

What if pattern has no indexed start point? (Full scan, then filter)
How do you handle cycles in the pattern?
What’s the difference between DFS and BFS execution order?
How would you add aggregation (COUNT, SUM)?

Learning milestones:

Simple patterns execute → You understand basic pattern matching
Filters work correctly → You understand predicate evaluation
Multi-hop patterns work → You understand traversal chaining
Query uses index when available → You understand query optimization basics

Project 9: Write-Ahead Logging (WAL)

File: LEARN_GRAPH_DATABASES_DEEP_DIVE.md
Main Programming Language: C
Alternative Programming Languages: Rust, C++
Coolness Level: Level 4: Hardcore Tech Flex
Business Potential: 3. The “Service & Support” Model
Difficulty: Level 4: Expert
Knowledge Area: ACID / Durability / Recovery
Software or Tool: Building: WAL System
Main Book: “Database Internals” by Alex Petrov

What you’ll build: A Write-Ahead Log (WAL) that records all modifications before they’re applied to data files, enabling crash recovery and durability guarantees.

Why it teaches graph databases: Durability (the “D” in ACID) requires WAL. Without it, a crash can corrupt your database. Understanding the “log first, apply later” principle is fundamental to all database systems.

Core challenges you’ll face:

Append-only log format design → maps to log record structure
Ensuring log is durable before commit → maps to fsync semantics
Crash recovery algorithm → maps to redo/undo processing
Log truncation and checkpointing → maps to log space management

Key Concepts:

Write-Ahead Logging: “Database Internals” Chapter 6 - Alex Petrov
Crash Recovery: “Database Internals” Chapter 13 - Alex Petrov
ARIES Algorithm: Architecture of a Database System Section 6
fsync Semantics: “The Linux Programming Interface” Chapter 13 - Michael Kerrisk

Resources for key challenges:

Architecture Weekly - WAL Foundation - Overview of WAL principles

Difficulty: Expert Time estimate: 2 weeks Prerequisites:

Project 5 completed (buffer pool with dirty tracking)
Understanding of file system durability
Solid C programming

Real world outcome:

$ ./graph_db_wal my_graph.db
WAL enabled: wal.log

graph> BEGIN
graph> CREATE (:Person {name: "Eve"})
Node 1001 created (WAL: LSN 5001)
graph> CREATE (a)-[:KNOWS]->(b)
Edge 5001 created (WAL: LSN 5002)
graph> COMMIT
Transaction committed (WAL synced to LSN 5002)

$ kill -9 <pid>  # Simulate crash

$ ./graph_db_wal my_graph.db
Recovering from WAL...
  Last checkpoint: LSN 4000
  WAL contains records up to: LSN 5002
  Redoing 1002 log records...
  Recovery complete: 2 operations replayed
Database ready.

graph> MATCH (p:Person {name: "Eve"}) RETURN p
| p                        |
|--------------------------|
| (:Person {name: "Eve"})  |

Eve survived the crash!

Implementation Hints:

WAL record format:

┌─────────────────────────────────────────────────────┐
│ LSN (8 bytes) - Log Sequence Number                 │
│ TxnID (8 bytes) - Transaction ID                    │
│ Type (1 byte) - INSERT/UPDATE/DELETE/COMMIT/ABORT   │
│ Length (4 bytes) - Length of data                   │
│ Data (variable) - The actual change                 │
│ Checksum (4 bytes) - CRC32 of record                │
└─────────────────────────────────────────────────────┘

Core operations:

wal_append(type, data) -> LSN
  - Write record to log buffer
  - Return the LSN for this record

wal_flush(lsn)
  - Ensure all records up to LSN are on disk
  - Call fsync() on log file

recover()
  - Read log from last checkpoint
  - For each record, redo the operation
  - Handle incomplete transactions

Key insight: You MUST write to the log and fsync BEFORE telling the user “commit successful”. Otherwise a crash after “success” but before the data is on disk means data loss.

Questions to consider:

Why append-only? (Sequential writes are fast, no in-place updates)
What’s a checkpoint? (A snapshot of which LSN has been applied to data files)
When can you delete old log records? (After checkpoint covers them)
What about ABORT? (Either don’t apply to data files, or write UNDO records)

Learning milestones:

Operations are logged before applying → You understand WAL principle
Recovery replays log correctly → You understand crash recovery
Database survives kill -9 → You’ve achieved durability
Checkpoints limit recovery time → You understand log management

Project 10: Transaction Manager with MVCC

File: LEARN_GRAPH_DATABASES_DEEP_DIVE.md
Main Programming Language: C
Alternative Programming Languages: Rust, C++
Coolness Level: Level 5: Pure Magic
Business Potential: 4. The “Open Core” Infrastructure
Difficulty: Level 5: Master
Knowledge Area: Concurrency / ACID / Transactions
Software or Tool: Building: MVCC Transaction Manager
Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann

What you’ll build: A transaction manager implementing Multi-Version Concurrency Control (MVCC), allowing concurrent readers and writers without blocking each other.

Why it teaches graph databases: MVCC is how modern databases achieve high concurrency. Readers see consistent snapshots while writers modify data. This is the “I” (Isolation) in ACID. Neo4j, PostgreSQL, and most modern databases use MVCC.

Core challenges you’ll face:

Version chain management → maps to row versioning
Snapshot visibility rules → maps to transaction isolation
Garbage collection of old versions → maps to version cleanup
Write-write conflict detection → maps to transaction validation

Key Concepts:

MVCC Fundamentals: “Designing Data-Intensive Applications” Chapter 7 - Martin Kleppmann
Snapshot Isolation: “Database Internals” Chapter 12 - Alex Petrov
Version Chains: DuckDB Transaction Paper
Transaction IDs: “Database Internals” Chapter 11 - Alex Petrov

Difficulty: Master Time estimate: 3-4 weeks Prerequisites:

Project 9 completed (WAL)
Strong understanding of concurrency
Experience with thread-safe programming

Real world outcome:

$ ./graph_db_mvcc my_graph.db &
Graph database started with MVCC enabled.

# Terminal 1 (Reader)
graph> BEGIN READ ONLY
graph> MATCH (p:Person) RETURN count(p)
| count(p) |
|----------|
| 1000     |
Started long report...

# Terminal 2 (Writer - concurrent)
graph> BEGIN
graph> CREATE (:Person {name: "NewPerson"})
graph> COMMIT
Committed!

# Terminal 1 (Reader - still sees old snapshot!)
graph> MATCH (p:Person) RETURN count(p)
| count(p) |
|----------|
| 1000     |  <- Still 1000! Snapshot isolation working.
graph> COMMIT
graph> MATCH (p:Person) RETURN count(p)
| count(p) |
|----------|
| 1001     |  <- New transaction sees update.

Implementation Hints:

MVCC version chain:

Each node/edge has a version chain:
┌──────────────────┐    ┌──────────────────┐    ┌──────────────────┐
│ Version 3        │───▶│ Version 2        │───▶│ Version 1        │
│ txn_id: 100      │    │ txn_id: 50       │    │ txn_id: 10       │
│ data: {age: 31}  │    │ data: {age: 30}  │    │ data: {age: 29}  │
│ begin_ts: 99     │    │ begin_ts: 49     │    │ begin_ts: 9      │
│ end_ts: ∞        │    │ end_ts: 99       │    │ end_ts: 49       │
└──────────────────┘    └──────────────────┘    └──────────────────┘

Visibility rules:

A version V is visible to transaction T if:
  - V.begin_ts <= T.snapshot_ts
  - V.end_ts > T.snapshot_ts
  - V.txn_id is committed (not in active_txns set)

Key components:

Transaction ID generator (monotonic)
Active transaction table
Version chain per record
Snapshot timestamp assignment at BEGIN
Commit/abort tracking

Think about:

What happens if two transactions try to update the same node?
How do you garbage collect old versions no one can see?
What’s the difference between snapshot isolation and serializable?
How does this interact with your WAL?

Learning milestones:

Readers don’t block writers → You understand MVCC basics
Snapshots are consistent → You understand isolation levels
Write conflicts are detected → You understand transaction validation
Old versions are cleaned up → You understand garbage collection

Project 11: Graph-Specific Algorithms (PageRank, Community Detection)

File: LEARN_GRAPH_DATABASES_DEEP_DIVE.md
Main Programming Language: C
Alternative Programming Languages: Rust, Python, Go
Coolness Level: Level 4: Hardcore Tech Flex
Business Potential: 3. The “Service & Support” Model
Difficulty: Level 3: Advanced
Knowledge Area: Graph Analytics / Linear Algebra
Software or Tool: Building: Graph Algorithm Library
Main Book: “Graph Algorithms” by Mark Needham & Amy Hodler

What you’ll build: Implementation of classic graph analytics algorithms—PageRank (for importance), Label Propagation (for community detection), and Betweenness Centrality (for bridge identification).

Why it teaches graph databases: Neo4j’s Graph Data Science library exists because these algorithms are why people use graph databases. Fraud detection, recommendation systems, and social network analysis all rely on these algorithms. This is where graph databases shine.

Core challenges you’ll face:

PageRank iteration until convergence → maps to power iteration method
Sparse matrix representation → maps to efficient graph representation
Community detection heuristics → maps to graph clustering
Betweenness via shortest paths → maps to centrality measures

Key Concepts:

PageRank Algorithm: “Graph Algorithms” Chapter 5 - Needham & Hodler
Community Detection: “Graph Algorithms” Chapter 6 - Needham & Hodler
Centrality Measures: “Graph Algorithms” Chapter 4 - Needham & Hodler
Power Iteration: “Introduction to Linear Algebra” Chapter 6 - Gilbert Strang

Difficulty: Advanced Time estimate: 2 weeks Prerequisites:

Project 2 completed (basic graph algorithms)
Basic linear algebra understanding
Familiarity with iterative algorithms

Real world outcome:

$ ./graph_analytics social.graph

PageRank (damping=0.85, iterations=20):
Top 10 influential users:
  1. user_42     score: 0.0847
  2. user_17     score: 0.0721
  3. user_891    score: 0.0698
  ...
Converged in 12 iterations.

Community Detection (Label Propagation):
Found 47 communities
  Community 1: 2,341 members (tech enthusiasts)
  Community 2: 1,892 members (sports fans)
  Community 3: 1,456 members (music lovers)
  ...
Modularity score: 0.73

Betweenness Centrality:
Top 5 bridge nodes:
  1. user_103    betweenness: 0.234 (connects communities 1-2)
  2. user_567    betweenness: 0.198 (connects communities 2-3)
  ...

Implementation Hints:

PageRank algorithm:

1. Initialize all nodes with rank 1/N
2. For each iteration:
   For each node v:
     new_rank[v] = (1-d)/N + d * Σ(rank[u]/out_degree[u])
                   for all u that link to v
3. Repeat until ranks converge (change < epsilon)

d = damping factor (typically 0.85)
N = total number of nodes

Label Propagation (simple community detection):

1. Assign each node a unique label
2. For each iteration:
   For each node (in random order):
     Set node's label to most frequent label among neighbors
     (break ties randomly)
3. Repeat until labels stabilize
4. Nodes with same label = same community

Questions to explore:

Why does PageRank converge? (It’s computing the eigenvector)
What’s the time complexity of PageRank? (O(iterations * edges))
How would you parallelize these algorithms?
How does Neo4j’s GDS library optimize these?

Learning milestones:

PageRank converges and ranks make sense → You understand iterative algorithms
Communities are meaningful → You understand graph clustering
Centrality identifies key nodes → You understand network analysis
Algorithms scale to large graphs → You understand performance optimization

Project 12: Distributed Graph Partitioning

File: LEARN_GRAPH_DATABASES_DEEP_DIVE.md
Main Programming Language: C
Alternative Programming Languages: Rust, Go, C++
Coolness Level: Level 5: Pure Magic
Business Potential: 4. The “Open Core” Infrastructure
Difficulty: Level 5: Master
Knowledge Area: Distributed Systems / Graph Partitioning
Software or Tool: Building: Distributed Graph Engine
Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann

What you’ll build: A system that partitions a large graph across multiple machines, enabling horizontal scaling while minimizing cross-partition traversals.

Why it teaches graph databases: When graphs get too big for one machine, you need distribution. But graphs are hard to partition—unlike rows in a table, cutting edges creates expensive cross-machine communication. Understanding graph partitioning is key to scaling graph databases.

Core challenges you’ll face:

Partitioning algorithm (METIS-style) → maps to balanced graph cuts
Cross-partition queries → maps to distributed query execution
Replication for fault tolerance → maps to data durability
Minimizing network hops → maps to locality optimization

Key Concepts:

Graph Partitioning: “Designing Data-Intensive Applications” Chapter 6 - Martin Kleppmann
Distributed Query Processing: “Database Internals” Chapter 9 - Alex Petrov
Consistent Hashing: “Designing Data-Intensive Applications” Chapter 6 - Martin Kleppmann
Balanced Cuts: METIS paper and documentation

Difficulty: Master Time estimate: 1 month+ Prerequisites:

All previous projects completed
Understanding of network programming
Experience with distributed systems concepts

Real world outcome:

# On coordinator node
$ ./graph_cluster init 3
Initializing 3-node cluster...
  Node 1: 192.168.1.10 (primary)
  Node 2: 192.168.1.11
  Node 3: 192.168.1.12
Cluster ready.

$ ./graph_cluster load billion_edges.graph
Loading graph with 1B nodes, 10B edges...
Partitioning using METIS...
  Partition 1: 334M nodes, 3.2B edges, 2.1% edge cut
  Partition 2: 333M nodes, 3.4B edges, 1.8% edge cut
  Partition 3: 333M nodes, 3.4B edges, 1.9% edge cut
Distribution complete.

$ ./graph_cluster query
cluster> MATCH (a:Person {name:"Alice"})-[:KNOWS*1..3]->(b) RETURN count(b)

Query plan:
  1. Find Alice (hash -> Node 2)
  2. Local 1-hop expansion on Node 2
  3. Parallel 2-hop expansion (spans all nodes)
  4. Aggregate counts

| count(b) |
|----------|
| 4,523,891|

Query time: 234ms (47ms local, 187ms cross-partition)

Implementation Hints:

Partitioning strategies:

1. Hash partitioning: node_id % num_partitions
   - Simple but ignores graph structure
   - Many edges will be cut

2. Label/range partitioning: nodes with same label together
   - Good for label-heavy workloads
   - May be imbalanced

3. Graph-aware (METIS-style):
   - Minimize edge cuts while balancing partition sizes
   - Pre-compute partitions
   - Better locality but expensive to compute

Distributed traversal:

For pattern (a)-[:KNOWS]->(b)-[:WORKS_AT]->(c):
Find 'a' locally
Expand KNOWS - some 'b' nodes may be remote
Send requests to remote partitions for remote 'b' nodes
Each partition expands WORKS_AT locally
Aggregate results at coordinator

Think about:

What’s the trade-off between partition balance and edge cuts?
How do you handle a traversal that touches all partitions?
What about nodes that have connections everywhere (super-nodes)?
How do you replicate popular nodes to reduce network traffic?

Learning milestones:

Graph is partitioned across nodes → You understand distributed data
Queries span partitions correctly → You understand distributed execution
Edge cuts are minimized → You understand graph partitioning algorithms
System survives node failure → You understand replication

Project 13: Query Optimizer with Cost Model

File: LEARN_GRAPH_DATABASES_DEEP_DIVE.md
Main Programming Language: C
Alternative Programming Languages: Rust, Go, C++
Coolness Level: Level 4: Hardcore Tech Flex
Business Potential: 4. The “Open Core” Infrastructure
Difficulty: Level 4: Expert
Knowledge Area: Query Optimization / Cost Modeling
Software or Tool: Building: Query Optimizer
Main Book: “Database Internals” by Alex Petrov

What you’ll build: A query optimizer that chooses the best execution plan by estimating costs based on statistics (cardinality, selectivity, index availability).

Why it teaches graph databases: The same query can be executed many ways. (a)-[:KNOWS]->(b)-[:WORKS_AT]->(c) could start from a, from c, or use an index. Good optimizers make queries 100x faster. Bad ones make databases unusable.

Core challenges you’ll face:

Cardinality estimation → maps to statistics collection
Plan enumeration → maps to search space exploration
Cost modeling → maps to I/O and CPU cost estimation
Plan selection → maps to optimization algorithms

Key Concepts:

Query Optimization: “Database Internals” Chapter 10 - Alex Petrov
Cardinality Estimation: “Architecture of a Database System” Section 4.4 - Hellerstein et al.
Cost Models: “Designing Data-Intensive Applications” Chapter 3 - Martin Kleppmann
Join Ordering: “Database Internals” Chapter 10 - Alex Petrov

Difficulty: Expert Time estimate: 3 weeks Prerequisites:

Project 8 completed (query executor)
Project 6 completed (indexes)
Understanding of statistics and estimation

Real world outcome:

$ ./graph_db --explain my_graph.db
graph> MATCH (a:Person)-[:KNOWS]->(b:Person)-[:WORKS_AT]->(c:Company {name:"TechCorp"})
       RETURN a.name

Statistics:
  Person nodes: 100,000
  Company nodes: 5,000
  KNOWS edges: 500,000
  WORKS_AT edges: 200,000
  TechCorp employees: 50

Plan 1 (Start from Person, no index):
  AllNodesScan(Person) -> a              cost: 100,000
  Expand(KNOWS) -> b                     cost: 500,000
  Expand(WORKS_AT) -> c                  cost: 200,000
  Filter(name="TechCorp")                cost: 200,000
  TOTAL: 1,000,000

Plan 2 (Start from TechCorp via index):
  IndexScan(Company.name="TechCorp") -> c  cost: 1
  ExpandIn(WORKS_AT) -> b                  cost: 50
  ExpandIn(KNOWS) -> a                     cost: 250
  TOTAL: 301                             ← WINNER!

Executing Plan 2...
50 rows in 3ms

Implementation Hints:

Statistics you need:

- Node count per label
- Edge count per type
- Property value histograms
- Index selectivity estimates

Cost model factors:

- Sequential scan: num_pages * page_read_cost
- Index scan: selectivity * num_rows * index_lookup_cost
- Expand: avg_degree * input_rows
- Filter: input_rows * filter_cost (output = selectivity * input)

Plan enumeration for pattern (a)-[r1]->(b)-[r2]->(c):

Possible orderings:
1. Start from a, expand r1, expand r2
2. Start from c, expand r2 backward, expand r1 backward
3. Start from b, expand r1 backward, expand r2 forward

Each can use AllNodesScan or IndexScan
Compare costs, pick cheapest

Questions:

How do you handle unknown selectivity? (Use defaults or samples)
What if statistics are stale? (Re-analyze periodically)
How do you handle variable-length patterns [:KNOWS*1..5]?
Is the optimizer overhead worth it for simple queries?

Learning milestones:

Optimizer picks obviously better plans → You understand basic cost modeling
Statistics affect plan choice → You understand cardinality estimation
Complex patterns optimize correctly → You understand plan enumeration
Query times improve dramatically → You’ve built a real optimizer

Project 14: Full-Text Search Integration

File: LEARN_GRAPH_DATABASES_DEEP_DIVE.md
Main Programming Language: C
Alternative Programming Languages: Rust, Go
Coolness Level: Level 3: Genuinely Clever
Business Potential: 3. The “Service & Support” Model
Difficulty: Level 3: Advanced
Knowledge Area: Information Retrieval / Text Search
Software or Tool: Building: Full-Text Index
Main Book: “Introduction to Information Retrieval” by Manning, Raghavan & Schütze

What you’ll build: A full-text search index for graph properties, enabling queries like “find all Person nodes where bio contains ‘machine learning’”.

Why it teaches graph databases: Neo4j integrates with Lucene for full-text search. Combining graph traversal with text search is powerful—find experts by searching their profiles, then traverse to their companies and co-workers.

Core challenges you’ll face:

Tokenization and stemming → maps to text processing
Inverted index construction → maps to index data structures
TF-IDF scoring → maps to relevance ranking
Integration with graph queries → maps to hybrid query execution

Key Concepts:

Inverted Indexes: “Introduction to Information Retrieval” Chapter 1 - Manning et al.
TF-IDF Scoring: “Introduction to Information Retrieval” Chapter 6 - Manning et al.
Text Tokenization: “Introduction to Information Retrieval” Chapter 2 - Manning et al.
Lucene Architecture: Lucene documentation

Difficulty: Advanced Time estimate: 2 weeks Prerequisites:

Project 6 completed (B+Tree)
Understanding of text processing
Basic information retrieval concepts

Real world outcome:

$ ./graph_db my_graph.db

graph> CREATE FULLTEXT INDEX bio_index FOR (p:Person) ON (p.bio)
Index created on Person.bio
Indexing 100,000 documents...
  Vocabulary size: 45,000 terms
  Index size: 12MB
Done.

graph> CALL db.index.fulltext.queryNodes("bio_index", "machine learning")
       YIELD node, score
       RETURN node.name, score
       ORDER BY score DESC
       LIMIT 5

| node.name    | score  |
|--------------|--------|
| Alice Chen   | 8.234  |
| Bob ML       | 7.891  |
| Carol Data   | 6.543  |
...

graph> MATCH (p:Person)-[:WORKS_AT]->(c:Company)
       WHERE p.bio CONTAINS "machine learning"
       RETURN c.name, count(p) as ml_experts
       ORDER BY ml_experts DESC

| c.name     | ml_experts |
|------------|------------|
| TechCorp   | 23         |
| DataInc    | 18         |
...

Implementation Hints:

Inverted index structure:

term -> [(doc_id, positions), (doc_id, positions), ...]

Example:
"machine" -> [(doc_42, [0, 156]), (doc_89, [23]), ...]
"learning" -> [(doc_42, [1, 157]), (doc_89, [24]), ...]

To find "machine learning":
  1. Get posting lists for both terms
  2. Intersect by doc_id
  3. Check if positions are adjacent

TF-IDF scoring:

TF(t, d) = frequency of term t in document d
IDF(t) = log(N / df(t))  where N = total docs, df = docs containing t
Score = TF * IDF

Higher score = term appears often in this doc, rarely in others

Integration with graph queries:

MATCH (p:Person) WHERE p.bio CONTAINS "machine learning"

Execution:
1. Query full-text index for "machine learning"
2. Get list of node IDs with scores
3. Fetch nodes, continue with graph pattern

Think about:

How do you handle updates to indexed properties?
How do you tokenize “machine-learning” vs “machine learning”?
Should you support phrase queries? Fuzzy matching?
How large can the inverted index get?

Learning milestones:

Basic term search works → You understand inverted indexes
Phrase search works → You understand position-aware indexing
Scores reflect relevance → You understand TF-IDF
Integrates with graph queries → You understand hybrid execution

Project 15: Complete Graph Database Engine

File: LEARN_GRAPH_DATABASES_DEEP_DIVE.md
Main Programming Language: C
Alternative Programming Languages: Rust, C++
Coolness Level: Level 5: Pure Magic
Business Potential: 5. The “Industry Disruptor”
Difficulty: Level 5: Master
Knowledge Area: Database Systems / Systems Programming
Software or Tool: Building: Graph Database
Main Book: “Database Internals” by Alex Petrov

What you’ll build: A complete graph database engine combining all previous projects: storage engine, buffer pool, B+Tree indexes, query parser, executor, optimizer, WAL, and transactions. A mini Neo4j from scratch.

Why it teaches graph databases: This is the capstone. You’ll integrate every component into a cohesive system. The challenges of making components work together teach you as much as building them individually. You’ll understand graph databases at the deepest level.

Core challenges you’ll face:

Component integration → maps to systems architecture
API design → maps to usability and developer experience
Performance tuning → maps to system optimization
Bug hunting across layers → maps to debugging complex systems

Key Concepts:

Database Architecture: “Database Internals” throughout - Alex Petrov
Systems Integration: “Designing Data-Intensive Applications” throughout - Martin Kleppmann
API Design: “The Pragmatic Programmer” Chapter 7 - Hunt & Thomas
Performance Analysis: “Computer Systems: A Programmer’s Perspective” Chapter 5 & 6 - Bryant & O’Hallaron

Difficulty: Master Time estimate: 1 month+ Prerequisites:

All previous projects completed
Strong systems programming skills
Patience and determination

Real world outcome:

$ ./myneo4j --data-dir /var/myneo4j --port 7687
MyNeo4j Graph Database v0.1
  Storage engine: Native fixed-record
  Buffer pool: 1GB (262,144 pages)
  WAL: Enabled (wal.log)
  Transactions: MVCC with snapshot isolation
  Indexes: B+Tree
  Query language: MiniCypher

Server started on bolt://localhost:7687

# In another terminal
$ ./myneo4j-shell localhost:7687
Connected to MyNeo4j.

neo4j> CREATE INDEX person_name FOR (p:Person) ON (p.name);
Index created.

neo4j> BEGIN;
neo4j> CREATE (a:Person {name: "Alice", age: 30});
Node created.
neo4j> CREATE (b:Person {name: "Bob", age: 28});
Node created.
neo4j> MATCH (a:Person {name: "Alice"}), (b:Person {name: "Bob"})
       CREATE (a)-[:KNOWS {since: 2020}]->(b);
Relationship created.
neo4j> COMMIT;
Transaction committed.

neo4j> EXPLAIN MATCH (a:Person {name: "Alice"})-[:KNOWS]->(b) RETURN b.name;
+----------------------------+
| Operator           | Rows  |
+----------------------------+
| Projection         | 1     |
| Expand(KNOWS OUT)  | 1     |
| IndexSeek(name)    | 1     |
+----------------------------+

neo4j> MATCH (a:Person {name: "Alice"})-[:KNOWS]->(b) RETURN b.name;
+--------+
| b.name |
+--------+
| Bob    |
+--------+
1 row, 0.8ms

neo4j> :quit
Goodbye!

Implementation Hints:

System architecture:

┌─────────────────────────────────────────────────────────────┐
│                     Client Connection                       │
│                     (TCP/Bolt protocol)                     │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│                      Query Parser                           │
│              (Lexer → Parser → AST)                         │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│                     Query Optimizer                         │
│        (Statistics → Plan Enumeration → Cost Model)         │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│                    Execution Engine                         │
│            (Operators: Scan, Expand, Filter, etc.)          │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│                   Transaction Manager                       │
│                    (MVCC, Isolation)                        │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌───────────────────┬─────────────────┬───────────────────────┐
│   Buffer Pool     │     Indexes     │         WAL           │
│   (Page Cache)    │    (B+Tree)     │   (Write-Ahead Log)   │
└───────────────────┴─────────────────┴───────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│                     Storage Engine                          │
│        (Fixed-size records: nodes.db, rels.db, etc.)        │
└─────────────────────────────────────────────────────────────┘

Integration checklist:

Think about:

How do you test the integrated system?
What benchmarks should you run?
How do you handle errors at each layer?
What’s your upgrade/migration story?

Learning milestones:

Basic CRUD operations work → System is functional
Transactions provide ACID → System is reliable
Indexes speed up lookups → System is optimized
Concurrent clients work → System is production-ready
Benchmarks show good performance → System is competitive

Project Comparison Table

Project	Difficulty	Time	Depth of Understanding	Fun Factor
1. In-Memory Graph Library	Beginner	1 week	⭐⭐⭐	⭐⭐⭐
2. Graph Traversal Algorithms	Intermediate	1 week	⭐⭐⭐	⭐⭐⭐⭐
3. Graph Serialization	Intermediate	1 week	⭐⭐⭐	⭐⭐⭐
4. Memory-Mapped Storage	Advanced	1-2 weeks	⭐⭐⭐⭐	⭐⭐⭐⭐
5. Buffer Pool Manager	Expert	2 weeks	⭐⭐⭐⭐⭐	⭐⭐⭐
6. B+Tree Index	Expert	2-3 weeks	⭐⭐⭐⭐⭐	⭐⭐⭐⭐
7. Query Language Parser	Advanced	2 weeks	⭐⭐⭐⭐	⭐⭐⭐⭐⭐
8. Query Execution Engine	Expert	2-3 weeks	⭐⭐⭐⭐⭐	⭐⭐⭐⭐
9. Write-Ahead Logging	Expert	2 weeks	⭐⭐⭐⭐⭐	⭐⭐⭐
10. MVCC Transactions	Master	3-4 weeks	⭐⭐⭐⭐⭐	⭐⭐⭐⭐
11. Graph Algorithms	Advanced	2 weeks	⭐⭐⭐⭐	⭐⭐⭐⭐⭐
12. Distributed Graph	Master	1 month+	⭐⭐⭐⭐⭐	⭐⭐⭐⭐⭐
13. Query Optimizer	Expert	3 weeks	⭐⭐⭐⭐⭐	⭐⭐⭐⭐
14. Full-Text Search	Advanced	2 weeks	⭐⭐⭐⭐	⭐⭐⭐⭐
15. Complete Engine	Master	1 month+	⭐⭐⭐⭐⭐	⭐⭐⭐⭐⭐

Recommended Learning Path

Based on building a graph database from scratch in C:

Phase 1: Foundations (Weeks 1-3)

Start here to understand the basics.

Project 1: In-Memory Graph Library - Get comfortable with graph data structures
Project 2: Graph Traversal Algorithms - Understand how to navigate graphs
Project 3: Graph Serialization - Learn to persist graphs to disk

Phase 2: Storage Engine (Weeks 4-7)

Build the foundation of a real database.

Project 4: Memory-Mapped Storage - Understand OS-level I/O
Project 5: Buffer Pool Manager - Master database I/O management
Project 6: B+Tree Index - Implement efficient property lookups

Phase 3: Query Processing (Weeks 8-11)

Make your database usable.

Project 7: Query Language Parser - Build a Cypher-like parser
Project 8: Query Execution Engine - Execute queries against storage
Project 13: Query Optimizer - Make queries fast automatically

Phase 4: ACID & Reliability (Weeks 12-16)

Make your database reliable.

Project 9: Write-Ahead Logging - Achieve durability
Project 10: MVCC Transactions - Handle concurrent access

Phase 5: Advanced Features (Weeks 17+)

Polish and extend.

Project 11: Graph Algorithms - Add analytics capabilities
Project 14: Full-Text Search - Enable text search
Project 12: Distributed Graph - Scale horizontally
Project 15: Complete Engine - Integrate everything

Final Capstone Project: Production-Ready Graph Database

Following the same pattern, this project represents the culmination of all your learning.

File: LEARN_GRAPH_DATABASES_DEEP_DIVE.md
Main Programming Language: C
Alternative Programming Languages: Rust, C++
Coolness Level: Level 5: Pure Magic (Super Cool)
Business Potential: 5. The “Industry Disruptor”
Difficulty: Level 5: Master
Knowledge Area: Complete Database Systems
Software or Tool: Building: Production Graph Database
Main Book: “Database Internals” by Alex Petrov + “Designing Data-Intensive Applications” by Martin Kleppmann

What you’ll build: A production-grade graph database that could genuinely be used for real applications—with proper error handling, configuration, monitoring, backup/restore, and documentation.

Why this is the ultimate goal: Anyone can build a toy database. Building something production-worthy requires handling edge cases, providing operational tools, ensuring security, and creating a polished user experience. This separates engineers who understand databases from those who’ve mastered them.

Features beyond Project 15:

Configuration file support
Proper logging and monitoring hooks
Backup and restore functionality
Data import/export (CSV, JSON)
Authentication and authorization
Query timeout and resource limits
Comprehensive error messages
Unit tests, integration tests, stress tests
Documentation and examples

Real world outcome:

$ myneo4j --config myneo4j.conf
Loading configuration from myneo4j.conf
  Data directory: /var/lib/myneo4j
  Buffer pool: 4GB
  Max connections: 100
  Auth: enabled (bcrypt)
  TLS: enabled
  Backup schedule: daily at 03:00

[2024-01-15 10:00:00] INFO  MyNeo4j 1.0 starting...
[2024-01-15 10:00:01] INFO  Recovery: WAL contains 0 uncommitted transactions
[2024-01-15 10:00:01] INFO  Indexes: Loading 5 indexes
[2024-01-15 10:00:02] INFO  Server listening on bolt://0.0.0.0:7687 (TLS)
[2024-01-15 10:00:02] INFO  Server listening on http://0.0.0.0:7474 (metrics)
[2024-01-15 10:00:02] INFO  Ready to accept connections

# Monitoring
$ curl http://localhost:7474/metrics
{
  "nodes": 15000000,
  "edges": 75000000,
  "transactions_committed": 45123,
  "buffer_pool_hit_rate": 0.987,
  "avg_query_time_ms": 2.3,
  "active_connections": 12
}

# Backup
$ myneo4j-admin backup --to /backups/$(date +%Y%m%d)
Backup started...
  Phase 1: Checkpoint
  Phase 2: Copy data files
  Phase 3: Copy indexes
  Phase 4: Verify checksum
Backup complete: 12.3GB in 4m23s

Learning milestones:

Database survives chaos testing → You’ve built robust software
Operations team can manage it → You’ve built operator-friendly software
Developers find it easy to use → You’ve built user-friendly software
Performance is competitive → You’ve built efficient software
You can explain every design decision → You’ve mastered the domain

Summary

#	Project Name	Main Programming Language
1	In-Memory Graph Library	C
2	Graph Traversal Algorithms (BFS, DFS, Shortest Path)	C
3	Graph Serialization & File Persistence	C
4	Memory-Mapped Graph Storage	C
5	Buffer Pool Manager	C
6	B+Tree Index for Node Lookups	C
7	Simple Query Language Parser	C
8	Query Execution Engine	C
9	Write-Ahead Logging (WAL)	C
10	Transaction Manager with MVCC	C
11	Graph-Specific Algorithms (PageRank, Community Detection)	C
12	Distributed Graph Partitioning	C
13	Query Optimizer with Cost Model	C
14	Full-Text Search Integration	C
15	Complete Graph Database Engine	C
Final	Production-Ready Graph Database	C

Key Resources

Books

“Database Internals” by Alex Petrov - The essential guide to how databases work
“Designing Data-Intensive Applications” by Martin Kleppmann - Understanding distributed data systems
“Graph Databases” by Ian Robinson, Jim Webber & Emil Eifrem - Neo4j-focused but excellent concepts
“C Programming: A Modern Approach” by K. N. King - Essential C reference
“Computer Systems: A Programmer’s Perspective” by Bryant & O’Hallaron - Understanding systems

Online Resources

Let’s Build a Simple Database - SQLite clone tutorial
Neo4j Internals - How Neo4j works
openCypher Specification - Query language reference
Graph Database Internals (PDF) - Deep dive

Papers

You’re ready to build a graph database from scratch. Start with Project 1 and work your way through. By the end, you’ll understand graph databases at the deepest level—the level of the engineers who built Neo4j, TigerGraph, and Amazon Neptune.