LEARN POSTGRESQL DEEP DIVE
Learn PostgreSQL: From Zero to Database Internals Master
Goal: Deeply understand PostgreSQL—not just SQL syntax, but how it works behind the scenes. Master MVCC, the query planner, storage engine, WAL, indexes, and understand why PostgreSQL makes the architectural choices that differentiate it from MySQL, Oracle, and other databases.
Why PostgreSQL Matters
PostgreSQL is not just another SQL database. It’s arguably the most advanced open-source relational database, and understanding its internals teaches you:
- MVCC (Multi-Version Concurrency Control): How to handle concurrent access without locks
- Write-Ahead Logging (WAL): The foundation of crash recovery and replication
- Cost-Based Query Planning: How databases decide between index scans and sequential scans
- B-tree and Beyond: Why different index types exist for different problems
- Transaction Isolation: What “ACID” actually means in practice
After completing these projects, you will:
- Understand every byte in a PostgreSQL page
- Know exactly what happens when you run
SELECT,INSERT,UPDATE - Read
EXPLAIN ANALYZEoutput like a book - Build your own mini-database with PostgreSQL-style MVCC
- Know when PostgreSQL is the right choice vs MySQL, Oracle, or others
PostgreSQL vs Other Databases: The Key Differences
Before diving into projects, understanding PostgreSQL’s unique architecture choices:
PostgreSQL vs MySQL
| Aspect | PostgreSQL | MySQL (InnoDB) |
|---|---|---|
| MVCC Implementation | Old versions stored in main table (heap) | Old versions in separate “undo logs” (rollback segment) |
| Cleanup | VACUUM required to reclaim space | Automatic purge of undo logs |
| Indexing | Indexes store TIDs (pointer to heap) | Clustered index stores data, secondary indexes point to PK |
| Read Performance | Faster (no version reconstruction) | May need to read undo log for old versions |
| Write Amplification | Higher (UPDATE = DELETE + INSERT) | Lower (in-place updates possible) |
| VACUUM Overhead | Must run periodically | No equivalent needed |
| Extensions | First-class (PostGIS, pg_trgm, etc.) | Limited |
| SQL Compliance | Very high | Moderate (silent truncations, etc.) |
| JSON Support | JSONB (binary, indexable) | JSON (text-based) |
PostgreSQL vs Oracle
| Aspect | PostgreSQL | Oracle |
|---|---|---|
| License | Open source (PostgreSQL License) | Commercial (expensive) |
| MVCC | Similar approach (multi-versioning) | Undo tablespace |
| Procedural Language | PL/pgSQL, PL/Python, etc. | PL/SQL |
| Partitioning | Declarative (since v10) | Highly mature |
| RAC Equivalent | None (use Citus, Patroni) | Real Application Clusters |
| Feature Parity | ~95% for most use cases | Industry leader in enterprise features |
PostgreSQL vs SQLite
| Aspect | PostgreSQL | SQLite |
|---|---|---|
| Architecture | Client-server | Embedded (single file) |
| Concurrency | Full MVCC, many writers | Single writer at a time |
| Use Case | Production applications | Mobile, embedded, testing |
| Scale | Petabytes | Gigabytes (practical) |
Core Architecture Overview
┌─────────────────────────────────────────────────────────────────────┐
│ PostgreSQL Architecture │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ Client ───► Postmaster ───► Backend Process (per connection) │
│ │ │ │
│ │ ┌──────────┴──────────┐ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌─────────────────────────────────────────┐ │
│ │ Shared Memory │ │
│ │ ┌─────────────┐ ┌─────────────────┐ │ │
│ │ │ Shared │ │ WAL Buffers │ │ │
│ │ │ Buffers │ │ (Write-Ahead │ │ │
│ │ │ (Page Cache)│ │ Log) │ │ │
│ │ └──────┬──────┘ └────────┬────────┘ │ │
│ └─────────┼──────────────────┼────────────┘ │
│ │ │ │
│ ┌─────────┼──────────────────┼────────────┐ │
│ │ ▼ ▼ │ │
│ Storage │ ┌─────────────┐ ┌─────────────┐ │ │
│ │ │ Data Files │ │ WAL Files │ │ │
│ │ │ (Tables, │ │ (pg_wal/) │ │ │
│ │ │ Indexes) │ │ │ │ │
│ │ └─────────────┘ └─────────────┘ │ │
│ └────────────────────────────────────────┘ │
│ │
│ Background Processes: │
│ • WAL Writer • Checkpointer • Autovacuum │
│ • Background Writer • Stats Collector • WAL Archiver │
│ │
└─────────────────────────────────────────────────────────────────────┘
Fundamental Concepts
- Page-Based Storage
- Everything is stored in 8KB pages (configurable at compile time)
- Tables (heaps), indexes, WAL all use pages
- Pages contain: header (24 bytes), item pointers, free space, tuples
- MVCC (Multi-Version Concurrency Control)
- Each tuple has
xmin(transaction that inserted) andxmax(transaction that deleted/updated) - Readers never block writers, writers never block readers
- Old versions accumulate → need VACUUM
- Each tuple has
- Write-Ahead Logging (WAL)
- All changes written to WAL before data files
- Enables crash recovery: replay WAL from last checkpoint
- Foundation for replication (streaming and logical)
- Transaction Isolation
- Read Committed (default): See committed changes from other transactions
- Repeatable Read: Snapshot at transaction start
- Serializable: Full isolation (SSI - Serializable Snapshot Isolation)
- Query Processing Pipeline
SQL Text → Parser → Rewriter → Planner/Optimizer → Executor → Results - Index Types
- B-tree: Default, for equality and range queries
- Hash: Equality only (rarely used, now WAL-logged)
- GiST: Generalized Search Tree (geometric, full-text)
- GIN: Generalized Inverted Index (full-text, arrays, JSONB)
- BRIN: Block Range Index (very large tables, sorted data)
Project List
Projects are ordered from foundational concepts to advanced implementations.
Project 1: Build a Page-Based Storage Engine
- File: LEARN_POSTGRESQL_DEEP_DIVE.md
- Main Programming Language: C
- Alternative Programming Languages: Rust, Go, C++
- Coolness Level: Level 4: Hardcore Tech Flex
- Business Potential: 1. The “Resume Gold”
- Difficulty: Level 3: Advanced
- Knowledge Area: Storage Engines / File Systems
- Software or Tool: Custom Storage Engine
- Main Book: “Database Internals” by Alex Petrov
What you’ll build: A simple page-based storage engine that stores records in 8KB pages, manages free space, and supports insert/read operations. This is the foundation PostgreSQL (and most databases) build upon.
Why it teaches PostgreSQL: PostgreSQL’s heap storage is built on pages. Understanding page layout, tuple storage, and free space management is essential to understanding VACUUM, bloat, and performance.
Core challenges you’ll face:
- Page layout design → maps to PostgreSQL’s PageHeaderData, ItemId, tuples
- Free space management → maps to Free Space Map (FSM)
- Tuple addressing → maps to TID (tuple identifier) = (page, offset)
- Buffer management → maps to shared_buffers, page replacement
Key Concepts:
- Page Structure: PostgreSQL Docs - Database Page Layout
- Heap Files: “Database Internals” Chapter 3 - Alex Petrov
- Buffer Pool: “Database Internals” Chapter 5 - Alex Petrov
Difficulty: Advanced Time estimate: 2 weeks Prerequisites: C programming, understanding of file I/O, basic data structures
Real world outcome:
# Initialize storage file
$ ./pagestore init mydb.dat
# Insert records
$ ./pagestore insert mydb.dat "Hello, World!"
Inserted at TID (0, 1)
$ ./pagestore insert mydb.dat "PostgreSQL rocks!"
Inserted at TID (0, 2)
# Read by TID
$ ./pagestore read mydb.dat 0 1
Hello, World!
# Show page info
$ ./pagestore pageinfo mydb.dat 0
Page 0:
Lower: 32 (items start)
Upper: 8160 (free space end)
Free Space: 8128 bytes
Items: 2
[1] offset=8180, len=16, data="Hello, World!"
[2] offset=8160, len=20, data="PostgreSQL rocks!"
# Fill up page, see it create a new one
$ for i in {1..400}; do ./pagestore insert mydb.dat "Record $i"; done
...
Inserted at TID (1, 45) # Now using page 1!
Implementation Hints:
PostgreSQL page layout (8KB):
+-------------------+
| PageHeaderData | 24 bytes
| (pd_lsn, pd_lower,|
| pd_upper, etc.) |
+-------------------+
| ItemId Array | 4 bytes each, grows downward
| [1] [2] [3] ... |
+-------------------+
| |
| Free Space |
| |
+-------------------+
| Tuple 3 | Tuples grow upward
| Tuple 2 |
| Tuple 1 |
+-------------------+
| Special Space | (Optional, for indexes)
+-------------------+
Core structures:
#define PAGE_SIZE 8192
typedef struct PageHeader {
uint32_t pd_lsn; // WAL position (we'll simplify)
uint16_t pd_lower; // Offset to start of free space
uint16_t pd_upper; // Offset to end of free space
uint16_t pd_special; // Offset to special space
// ... more fields in real PostgreSQL
} PageHeader;
typedef struct ItemId {
uint16_t lp_off; // Offset to tuple
uint16_t lp_len; // Length of tuple
// In PostgreSQL: also has lp_flags
} ItemId;
Insert algorithm:
- Check if tuple fits:
pd_upper - pd_lower >= sizeof(ItemId) + tuple_size - If not, allocate new page
- Write tuple at
pd_upper - tuple_size - Add ItemId at position
pd_lower - Update
pd_lowerandpd_upper
Learning milestones:
- Pages store and retrieve data → You understand basic page layout
- Free space is tracked correctly → You understand space management
- Multiple pages work → You understand file growth
- TIDs are stable → You understand tuple addressing
Project 2: Implement MVCC (Multi-Version Concurrency Control)
- File: LEARN_POSTGRESQL_DEEP_DIVE.md
- Main Programming Language: C
- Alternative Programming Languages: Rust, Go, C++
- Coolness Level: Level 5: Pure Magic
- Business Potential: 1. The “Resume Gold”
- Difficulty: Level 4: Expert
- Knowledge Area: Database Internals / Concurrency
- Software or Tool: MVCC Implementation
- Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann
What you’ll build: Extend your storage engine with MVCC. Each tuple has xmin/xmax transaction IDs. Transactions see consistent snapshots. UPDATEs create new versions.
Why it teaches PostgreSQL: MVCC is PostgreSQL’s core concurrency mechanism. Understanding xmin/xmax, visibility rules, and why dead tuples accumulate explains VACUUM, bloat, and transaction isolation.
Core challenges you’ll face:
- Transaction ID management → maps to xid, xid wraparound
- Visibility rules → maps to snapshot, xmin/xmax checking
- UPDATE as DELETE+INSERT → maps to PostgreSQL’s HOT, bloat
- Snapshot isolation → maps to transaction isolation levels
Resources for key challenges:
- PostgreSQL MVCC Internals - Detailed xmin/xmax explanation
- The Internals of PostgreSQL - MVCC - Comprehensive guide
Key Concepts:
- MVCC: “Designing Data-Intensive Applications” Chapter 7 - Martin Kleppmann
- Transaction Snapshots: PostgreSQL Docs - MVCC Intro
- Tuple Visibility: PostgreSQL Source - heapam_visibility.c
Difficulty: Expert Time estimate: 3-4 weeks Prerequisites: Project 1 completed, understanding of transactions
Real world outcome:
# Start two concurrent transactions
$ ./mvccdb shell
mvcc> BEGIN;
Transaction 100 started
mvcc> INSERT INTO users VALUES (1, 'Alice');
Inserted at TID (0, 1), xmin=100
mvcc> SELECT * FROM users;
(1, 'Alice')
# In another terminal, start transaction 101
mvcc2> BEGIN;
Transaction 101 started
mvcc2> SELECT * FROM users;
(empty) # Can't see uncommitted row from tx 100!
# Back in first terminal
mvcc> COMMIT;
Transaction 100 committed
# Now second transaction can see it
mvcc2> SELECT * FROM users;
(1, 'Alice')
# Update creates new version
mvcc2> UPDATE users SET name = 'Alicia' WHERE id = 1;
Updated: old TID (0,1) xmax=101, new TID (0,2) xmin=101
# Show tuple versions
mvcc2> SELECT * FROM users WITH VERSIONS;
TID(0,1): (1, 'Alice') xmin=100 xmax=101 [DEAD]
TID(0,2): (1, 'Alicia') xmin=101 xmax=0 [LIVE]
Implementation Hints:
Tuple header with MVCC fields:
typedef struct HeapTupleHeader {
uint32_t t_xmin; // Inserting transaction ID
uint32_t t_xmax; // Deleting/updating transaction ID (0 if none)
uint32_t t_cid; // Command ID within transaction
ItemPointerData t_ctid; // Current TID of this or newer version
// ... data follows
} HeapTupleHeader;
Transaction snapshot:
typedef struct Snapshot {
uint32_t xmin; // All xids < xmin are visible
uint32_t xmax; // All xids >= xmax are invisible
uint32_t *xip; // Array of in-progress xids
int xip_count;
} Snapshot;
Visibility check (simplified PostgreSQL logic):
bool tuple_is_visible(HeapTupleHeader *tuple, Snapshot *snapshot) {
// Tuple was inserted by...
if (tuple->t_xmin == my_transaction_id) {
// ...me, and not yet deleted by me
if (tuple->t_xmax == 0 || tuple->t_xmax != my_transaction_id)
return true;
return false; // I deleted it
}
// Inserted by committed transaction visible in my snapshot?
if (!xid_is_visible(tuple->t_xmin, snapshot))
return false;
// Not deleted, or deleted by transaction not visible to me?
if (tuple->t_xmax == 0)
return true;
if (!xid_is_visible(tuple->t_xmax, snapshot))
return true; // Deletion not visible yet
return false; // Deleted by visible transaction
}
UPDATE flow:
- Find current tuple, set its
t_xmax = current_xid - Insert new tuple with
t_xmin = current_xid - Update old tuple’s
t_ctidto point to new tuple
Learning milestones:
- Transactions see only committed data → You understand visibility
- Concurrent transactions don’t interfere → You understand isolation
- UPDATE creates new version → You understand why VACUUM is needed
- Snapshot sees consistent state → You understand snapshot isolation
Project 3: Build a Write-Ahead Log (WAL)
- File: LEARN_POSTGRESQL_DEEP_DIVE.md
- Main Programming Language: C
- Alternative Programming Languages: Rust, Go
- Coolness Level: Level 4: Hardcore Tech Flex
- Business Potential: 1. The “Resume Gold”
- Difficulty: Level 4: Expert
- Knowledge Area: Database Internals / Crash Recovery
- Software or Tool: WAL Implementation
- Main Book: “Database Internals” by Alex Petrov
What you’ll build: A Write-Ahead Log that records all changes before they hit data files. Implement crash recovery by replaying the log.
Why it teaches PostgreSQL: WAL is the foundation of PostgreSQL’s durability and replication. Understanding LSN, checkpoints, and recovery explains how PostgreSQL survives crashes and how streaming replication works.
Core challenges you’ll face:
- Log record format → maps to XLogRecord structure
- Write ordering (WAL before data) → maps to WAL protocol
- Checkpoints → maps to reducing recovery time
- Crash recovery → maps to redo from checkpoint
Key Concepts:
- WAL Internals: PostgreSQL Docs - WAL Internals
- Crash Recovery: “Database Internals” Chapter 9 - Alex Petrov
- ARIES Algorithm: The classic WAL algorithm (PostgreSQL is influenced by it)
Difficulty: Expert Time estimate: 3 weeks Prerequisites: Projects 1-2 completed
Real world outcome:
# Insert some data
$ ./waldb insert "Record 1"
LSN 0/00000001: INSERT Record 1
Inserted at TID (0, 1)
$ ./waldb insert "Record 2"
LSN 0/00000028: INSERT Record 2
Inserted at TID (0, 2)
# Show WAL contents
$ ./waldb walinfo
WAL Segment: 000000010000000000000001
Record at 0/00000001: INSERT len=16 "Record 1"
Record at 0/00000028: INSERT len=16 "Record 2"
Record at 0/00000050: CHECKPOINT
# Simulate crash (kill without proper shutdown)
$ ./waldb insert "Record 3"
LSN 0/00000078: INSERT Record 3
$ kill -9 $(pidof waldb)
# Data file might be corrupted/incomplete, but WAL has the record
$ ./waldb recover
Recovering from checkpoint at LSN 0/00000050
Replaying: INSERT at 0/00000078
Recovery complete. 1 records replayed.
$ ./waldb select
TID(0,1): Record 1
TID(0,2): Record 2
TID(0,3): Record 3 # Recovered!
Implementation Hints:
WAL record structure:
typedef struct WALRecord {
uint32_t xl_tot_len; // Total length of record
uint32_t xl_xid; // Transaction ID
uint64_t xl_prev; // LSN of previous record
uint8_t xl_info; // Flag bits
uint8_t xl_rmid; // Resource manager ID (heap, btree, etc.)
// Followed by record-specific data
} WALRecord;
typedef struct HeapInsertRecord {
WALRecord header;
ItemPointerData target_tid;
uint16_t data_len;
// Followed by tuple data
} HeapInsertRecord;
Write-ahead protocol:
void insert_tuple(Page *page, Tuple *tuple) {
// 1. Generate WAL record FIRST
WALRecord *record = create_insert_record(page, tuple);
uint64_t lsn = wal_insert(record);
// 2. Modify page in buffer
page_insert_tuple(page, tuple);
// 3. Mark page with LSN (for checkpoint)
page->pd_lsn = lsn;
// Note: Page might not be written to disk yet!
// That's okay - WAL guarantees durability
}
Checkpoint:
void checkpoint() {
// 1. Write checkpoint record to WAL
uint64_t redo_lsn = current_wal_lsn();
wal_insert(create_checkpoint_record(redo_lsn));
// 2. Flush all dirty pages to disk
for (each dirty_page in buffer_pool) {
write_page_to_disk(dirty_page);
}
// 3. Update control file with checkpoint location
update_control_file(redo_lsn);
}
Recovery:
void recover() {
// 1. Find last checkpoint from control file
uint64_t redo_lsn = read_control_file();
// 2. Replay all WAL records from that point
WALReader *reader = open_wal_at(redo_lsn);
while (WALRecord *record = wal_read_next(reader)) {
replay_record(record); // Apply change to data page
}
}
Learning milestones:
- WAL records are written correctly → You understand the format
- Data survives simulated crash → You understand the protocol
- Recovery replays from checkpoint → You understand crash recovery
- Checkpoints limit recovery time → You understand the tradeoff
Project 4: Implement a B-tree Index
- File: LEARN_POSTGRESQL_DEEP_DIVE.md
- Main Programming Language: C
- Alternative Programming Languages: Rust, Go, C++
- Coolness Level: Level 4: Hardcore Tech Flex
- Business Potential: 1. The “Resume Gold”
- Difficulty: Level 4: Expert
- Knowledge Area: Data Structures / Indexing
- Software or Tool: B-tree Implementation
- Main Book: “Database Internals” by Alex Petrov
What you’ll build: A page-based B-tree index that stores keys and TIDs (pointers to heap tuples). Support insert, search, and range scans.
Why it teaches PostgreSQL: B-tree is PostgreSQL’s default index type. Understanding how keys map to TIDs, page splits, and leaf page linking explains index scans, index bloat, and REINDEX.
Core challenges you’ll face:
- Page-based B-tree structure → maps to internal vs leaf pages
- Page splits → maps to index growth, bloat
- TID storage → maps to non-clustered index design
- Range scans via leaf linking → maps to index-only scans
Key Concepts:
- B-tree Implementation: PostgreSQL Docs - B-tree Implementation
- B-tree Algorithms: “Database Internals” Chapters 2-4 - Alex Petrov
- PostgreSQL Indexes: PostgresPro - Indexes in PostgreSQL
Difficulty: Expert Time estimate: 3-4 weeks Prerequisites: Project 1 completed, B-tree theory
Real world outcome:
# Create index on existing heap data
$ ./btreedb createindex users_name_idx users name
# Insert data (goes to heap, index updated)
$ ./btreedb insert users '(1, "Alice")'
Heap: TID (0, 1)
Index: "Alice" -> (0, 1)
$ ./btreedb insert users '(2, "Bob")'
$ ./btreedb insert users '(3, "Charlie")'
# Exact match lookup
$ ./btreedb lookup users_name_idx "Bob"
Found: "Bob" at TID (0, 2)
Fetching from heap: (2, "Bob")
# Range scan
$ ./btreedb range users_name_idx "A" "C"
"Alice" -> (0, 1) -> (1, "Alice")
"Bob" -> (0, 2) -> (2, "Bob")
# Show index structure
$ ./btreedb indexinfo users_name_idx
B-tree Index: users_name_idx
Height: 1
Root Page: 0
Leaf Pages: 1
Page 0 (LEAF):
[1] key="Alice" tid=(0,1)
[2] key="Bob" tid=(0,2)
[3] key="Charlie" tid=(0,3)
right_sibling=NULL
# Insert enough to cause page split
$ for i in {4..500}; do ./btreedb insert users "($i, \"User$i\")"; done
$ ./btreedb indexinfo users_name_idx
B-tree Index: users_name_idx
Height: 2
Root Page: 2 (INTERNAL)
Leaf Pages: 3
Page 2 (ROOT/INTERNAL):
[1] key="Alice" child_page=0
[2] key="User250" child_page=1
Page 0 (LEAF): [Alice..User249] right_sibling=1
Page 1 (LEAF): [User250..User99] right_sibling=NULL
Implementation Hints:
B-tree page types:
#define BT_LEAF 1
#define BT_INTERNAL 2
#define BT_ROOT 4
typedef struct BTPageOpaque {
uint32_t btpo_prev; // Left sibling (for range scans)
uint32_t btpo_next; // Right sibling
uint16_t btpo_flags; // LEAF, INTERNAL, ROOT
uint16_t btpo_level; // 0 for leaf, increases upward
} BTPageOpaque;
// Leaf tuple: key + TID
typedef struct BTLeafTuple {
ItemPointerData t_tid; // Points to heap tuple
// Followed by key data
} BTLeafTuple;
// Internal tuple: key + child page
typedef struct BTInternalTuple {
uint32_t child_page; // Points to child page
// Followed by key data (high key of child)
} BTInternalTuple;
Search algorithm:
ItemPointer btree_search(BTIndex *index, Datum key) {
Page *page = read_page(index, index->root_page);
// Descend to leaf
while (!is_leaf(page)) {
int child = find_child_for_key(page, key);
page = read_page(index, child);
}
// Binary search within leaf
int pos = binary_search(page, key);
if (pos >= 0) {
return get_tid_at(page, pos);
}
return NULL; // Not found
}
Insert with split:
void btree_insert(BTIndex *index, Datum key, ItemPointer tid) {
// Find leaf page
Page *leaf = find_leaf_for_key(index, key);
if (has_space(leaf, key)) {
insert_into_page(leaf, key, tid);
} else {
// Split!
Page *new_leaf = allocate_page();
Datum split_key = split_page(leaf, new_leaf, key, tid);
// Insert split_key into parent
btree_insert_parent(index, leaf, new_leaf, split_key);
}
}
Learning milestones:
- Search finds correct tuple → You understand B-tree traversal
- Page splits work correctly → You understand B-tree growth
- Range scans follow leaf links → You understand linked leaves
- Index stays balanced → You understand B-tree properties
Project 5: Build a VACUUM Process
- File: LEARN_POSTGRESQL_DEEP_DIVE.md
- Main Programming Language: C
- Alternative Programming Languages: Rust, Go
- Coolness Level: Level 3: Genuinely Clever
- Business Potential: 2. The “Micro-SaaS / Pro Tool”
- Difficulty: Level 3: Advanced
- Knowledge Area: Database Internals / Maintenance
- Software or Tool: VACUUM Implementation
- Main Book: “PostgreSQL 14 Internals” by Egor Rogov
What you’ll build: A VACUUM process that scans tables, identifies dead tuples (invisible to all transactions), and reclaims their space for reuse.
Why it teaches PostgreSQL: VACUUM is essential because PostgreSQL’s MVCC leaves dead tuples behind. Understanding VACUUM explains bloat, autovacuum tuning, and the famous “transaction ID wraparound” problem.
Core challenges you’ll face:
- Identifying dead tuples → maps to visibility rules, oldest active transaction
- Reclaiming space → maps to marking space free vs VACUUM FULL
- Updating visibility map → maps to optimization for future vacuums
- Transaction ID freezing → maps to wraparound prevention
Key Concepts:
- VACUUM Process: PostgreSQL Docs - VACUUM
- Dead Tuples: Dead Tuples and VACUUM
- Autovacuum: Tuning Autovacuum
Difficulty: Advanced Time estimate: 2 weeks Prerequisites: Projects 1-2 completed
Real world outcome:
# Create table and do updates (creating dead tuples)
$ ./vacuumdb shell
> INSERT INTO users VALUES (1, 'Alice');
TID (0, 1), xmin=100
> UPDATE users SET name = 'Alicia' WHERE id = 1;
Old: TID (0, 1), xmax=101 [DEAD]
New: TID (0, 2), xmin=101
> UPDATE users SET name = 'ALICIA' WHERE id = 1;
Old: TID (0, 2), xmax=102 [DEAD]
New: TID (0, 3), xmin=102
# Check table stats
> \dt+ users
Table: users
Live tuples: 1
Dead tuples: 2
Table size: 8192 bytes (1 page)
# Run VACUUM
> VACUUM users;
Scanning page 0...
Tuple (0, 1): DEAD (xmax 101 < oldest_active 103) - marking for reuse
Tuple (0, 2): DEAD (xmax 102 < oldest_active 103) - marking for reuse
Tuple (0, 3): LIVE
Vacuum complete: 2 dead tuples reclaimed
> \dt+ users
Table: users
Live tuples: 1
Dead tuples: 0
Table size: 8192 bytes (1 page)
Free space: 168 bytes reclaimed
# New inserts reuse space
> INSERT INTO users VALUES (2, 'Bob');
TID (0, 1) # Reused!
Implementation Hints:
VACUUM algorithm:
void vacuum_table(Table *table) {
// 1. Get oldest active transaction (horizon)
uint32_t oldest_xid = get_oldest_active_xid();
// 2. Scan all pages
for (int pageno = 0; pageno < table->num_pages; pageno++) {
Page *page = read_page(table, pageno);
bool page_has_dead = false;
// Check each tuple
for (int i = 0; i < page_tuple_count(page); i++) {
HeapTuple *tuple = get_tuple(page, i);
if (tuple_is_dead(tuple, oldest_xid)) {
// Mark as dead - space can be reused
mark_dead(page, i);
page_has_dead = true;
stats.dead_tuples_removed++;
}
}
// 3. Compact page if needed (optional, defrag)
if (page_has_dead) {
compact_page(page);
}
// 4. Update visibility map
if (page_all_visible(page)) {
set_visibility_map_bit(table, pageno);
}
}
}
bool tuple_is_dead(HeapTuple *tuple, uint32_t oldest_xid) {
// Tuple is dead if:
// - It was deleted (xmax != 0)
// - The deleting transaction committed (xmax is committed)
// - No active transaction can see it (xmax < oldest_xid)
return tuple->t_xmax != 0 &&
xid_is_committed(tuple->t_xmax) &&
tuple->t_xmax < oldest_xid;
}
Visibility Map:
- 2 bits per page: “all-visible” and “all-frozen”
- “All-visible” pages can be skipped by VACUUM (no dead tuples possible)
- “All-frozen” pages can be skipped even for anti-wraparound vacuum
Learning milestones:
- Dead tuples are identified correctly → You understand visibility
- Space is reclaimed → You understand free space management
- Visibility map is updated → You understand optimization
- Transaction freezing works → You understand wraparound prevention
Project 6: Build a Query Parser
- File: LEARN_POSTGRESQL_DEEP_DIVE.md
- Main Programming Language: C (with Flex/Bison) or Python
- Alternative Programming Languages: Rust (nom), Go, Java (ANTLR)
- Coolness Level: Level 3: Genuinely Clever
- Business Potential: 2. The “Micro-SaaS / Pro Tool”
- Difficulty: Level 3: Advanced
- Knowledge Area: Compilers / Language Processing
- Software or Tool: SQL Parser
- Main Book: “Language Implementation Patterns” by Terence Parr
What you’ll build: A SQL parser that converts SQL text into an Abstract Syntax Tree (AST). Support SELECT, INSERT, UPDATE, DELETE with WHERE clauses.
Why it teaches PostgreSQL: Every query goes through the parser first. Understanding how SQL becomes a parse tree explains error messages, syntax variations, and the first step of query processing.
Core challenges you’ll face:
- Lexical analysis → maps to tokenizing SQL keywords, identifiers, literals
- Grammar definition → maps to SQL syntax rules
- AST construction → maps to parse tree structure
- Error reporting → maps to syntax error messages
Key Concepts:
- PostgreSQL Parser: PostgreSQL Source - gram.y
- SQL Grammar: SQL standard, PostgreSQL extensions
- Parser Generators: Flex/Bison, ANTLR
Difficulty: Advanced Time estimate: 2 weeks Prerequisites: Understanding of parsing theory, basic SQL knowledge
Real world outcome:
$ ./sqlparse "SELECT name, age FROM users WHERE age > 21"
Parse Tree:
SelectStmt
├── targetList
│ ├── ResTarget(name="name")
│ └── ResTarget(name="age")
├── fromClause
│ └── RangeVar(relname="users")
└── whereClause
└── A_Expr(kind=AEXPR_OP)
├── name: ">"
├── lexpr: ColumnRef(name="age")
└── rexpr: A_Const(val=21)
$ ./sqlparse "INSERT INTO users (name, age) VALUES ('Alice', 30)"
Parse Tree:
InsertStmt
├── relation: RangeVar(relname="users")
├── cols
│ ├── ResTarget(name="name")
│ └── ResTarget(name="age")
└── selectStmt
└── SelectStmt(valuesLists)
└── List
├── A_Const(val='Alice')
└── A_Const(val=30)
$ ./sqlparse "SELEC * FROM users"
Error at line 1, column 1: syntax error, unexpected IDENT, expecting SELECT
Implementation Hints:
Token types:
typedef enum TokenType {
T_SELECT, T_INSERT, T_UPDATE, T_DELETE,
T_FROM, T_WHERE, T_INTO, T_VALUES, T_SET,
T_AND, T_OR, T_NOT,
T_EQ, T_NE, T_LT, T_GT, T_LE, T_GE,
T_LPAREN, T_RPAREN, T_COMMA, T_SEMICOLON,
T_IDENT, T_STRING, T_NUMBER,
T_EOF
} TokenType;
AST node types:
typedef struct Node {
NodeType type;
} Node;
typedef struct SelectStmt {
NodeType type; // T_SelectStmt
List *targetList; // What to select
List *fromClause; // Tables
Node *whereClause; // Conditions
List *groupClause; // GROUP BY
List *orderClause; // ORDER BY
Node *limitCount; // LIMIT
} SelectStmt;
typedef struct A_Expr {
NodeType type; // T_A_Expr
A_Expr_Kind kind; // AEXPR_OP, AEXPR_AND, AEXPR_OR
char *name; // Operator name
Node *lexpr; // Left operand
Node *rexpr; // Right operand
} A_Expr;
Using Bison (simplified):
select_stmt:
SELECT target_list FROM from_clause where_clause
{
SelectStmt *n = makeNode(SelectStmt);
n->targetList = $2;
n->fromClause = $4;
n->whereClause = $5;
$$ = (Node *)n;
}
;
where_clause:
/* empty */ { $$ = NULL; }
| WHERE a_expr { $$ = $2; }
;
Learning milestones:
- Simple SELECT parses → You understand basic grammar
- WHERE with operators works → You understand expression parsing
- All CRUD statements parse → You have complete coverage
- Error messages are helpful → You understand error recovery
Project 7: Build a Query Planner/Optimizer
- File: LEARN_POSTGRESQL_DEEP_DIVE.md
- Main Programming Language: C or Python
- Alternative Programming Languages: Rust, Go, Java
- Coolness Level: Level 4: Hardcore Tech Flex
- Business Potential: 3. The “Service & Support” Model
- Difficulty: Level 4: Expert
- Knowledge Area: Database Internals / Query Optimization
- Software or Tool: Query Planner
- Main Book: “Database Internals” by Alex Petrov
What you’ll build: A cost-based query planner that takes a parse tree, considers available indexes and table statistics, and produces an execution plan. Output EXPLAIN-style plans.
Why it teaches PostgreSQL: The planner decides whether to use an index scan or sequential scan, which join algorithm to use, and in what order to join tables. Understanding this is essential for query tuning.
Core challenges you’ll face:
- Plan enumeration → maps to generating alternative plans
- Cost estimation → maps to I/O cost, CPU cost
- Statistics usage → maps to pg_stats, row estimates
- Join ordering → maps to combinatorial explosion, GEQO
Key Concepts:
- PostgreSQL Planner: PostgreSQL Docs - Planner/Optimizer
- Cost Model: Bruce Momjian’s Optimizer PDF
- Join Algorithms: Nested Loop, Hash Join, Merge Join
Difficulty: Expert Time estimate: 4 weeks Prerequisites: Project 6 completed, statistics basics
Real world outcome:
$ ./queryplan "SELECT * FROM users WHERE id = 42"
Table Statistics:
users: 100,000 rows, 1000 pages
users_pkey (btree on id): 300 pages, unique
Plan 1: Sequential Scan
Scan all 1000 pages, filter for id=42
Estimated cost: 1000.00 (I/O) + 100000.00 (CPU filter)
Estimated rows: 1
Plan 2: Index Scan using users_pkey
Descend B-tree (3 levels) + read 1 heap page
Estimated cost: 4.00 (index I/O) + 1.00 (heap I/O)
Estimated rows: 1
CHOSEN: Plan 2 (Index Scan) - Cost 5.00 vs 101000.00
EXPLAIN output:
Index Scan using users_pkey on users (cost=0.42..5.00 rows=1)
Index Cond: (id = 42)
$ ./queryplan "SELECT * FROM orders JOIN users ON orders.user_id = users.id"
Join Planning:
orders: 1,000,000 rows
users: 100,000 rows
Plan 1: Nested Loop (orders outer)
For each of 1M orders, index lookup on users
Cost: 1,000,000 * 5.00 = 5,000,000.00
Plan 2: Hash Join
Build hash on users (smaller): 100,000 * 0.01 = 1,000.00
Probe with orders: 1,000,000 * 0.001 = 1,000.00
Cost: 2,000.00
Plan 3: Merge Join
Sort orders: 1,000,000 * log(1M) * 0.01 = 200,000.00
Sort users: 100,000 * log(100K) * 0.01 = 17,000.00
Merge: 1,100,000 * 0.001 = 1,100.00
Cost: 218,100.00
CHOSEN: Plan 2 (Hash Join) - Cost 2,000.00
EXPLAIN output:
Hash Join (cost=2500.00..5000.00 rows=1000000)
Hash Cond: (orders.user_id = users.id)
-> Seq Scan on orders (cost=0.00..1500.00 rows=1000000)
-> Hash (cost=1000.00..1000.00 rows=100000)
-> Seq Scan on users (cost=0.00..1000.00 rows=100000)
Implementation Hints:
Cost model components:
typedef struct PlanCost {
double startup_cost; // Cost before first row
double total_cost; // Cost for all rows
double rows; // Estimated row count
double width; // Average row width
} PlanCost;
// Cost parameters (like PostgreSQL's)
#define SEQ_PAGE_COST 1.0
#define RANDOM_PAGE_COST 4.0
#define CPU_TUPLE_COST 0.01
#define CPU_INDEX_TUPLE_COST 0.005
#define CPU_OPERATOR_COST 0.0025
Sequential scan cost:
PlanCost cost_seqscan(Table *table, Qual *qual) {
PlanCost cost;
cost.startup_cost = 0;
cost.total_cost = table->pages * SEQ_PAGE_COST +
table->rows * CPU_TUPLE_COST;
cost.rows = estimate_rows(table, qual);
return cost;
}
Index scan cost:
PlanCost cost_indexscan(Index *index, Table *table, Qual *qual) {
PlanCost cost;
double selectivity = estimate_selectivity(qual, index);
double index_pages = index->pages * selectivity;
double heap_pages = table->pages * selectivity;
cost.startup_cost = 0;
cost.total_cost = index_pages * RANDOM_PAGE_COST + // Index I/O
heap_pages * RANDOM_PAGE_COST + // Heap I/O
(table->rows * selectivity) * CPU_INDEX_TUPLE_COST;
cost.rows = table->rows * selectivity;
return cost;
}
Join planning:
Plan *plan_join(Table *outer, Table *inner, JoinType type) {
Plan *best = NULL;
double best_cost = INFINITY;
// Try nested loop
Plan *nl = create_nested_loop(outer, inner);
if (nl->cost.total_cost < best_cost) {
best = nl;
best_cost = nl->cost.total_cost;
}
// Try hash join (if equijoin)
if (has_equality_condition(type)) {
Plan *hj = create_hash_join(outer, inner);
if (hj->cost.total_cost < best_cost) {
best = hj;
best_cost = hj->cost.total_cost;
}
}
// Try merge join (if sortable)
Plan *mj = create_merge_join(outer, inner);
if (mj->cost.total_cost < best_cost) {
best = mj;
best_cost = mj->cost.total_cost;
}
return best;
}
Learning milestones:
- Chooses index scan over seq scan appropriately → You understand cost modeling
- Join algorithm selection works → You understand join costs
- Estimates match reality (roughly) → You understand statistics
- EXPLAIN output matches PostgreSQL format → Full understanding
Project 8: Build a Query Executor
- File: LEARN_POSTGRESQL_DEEP_DIVE.md
- Main Programming Language: C or Python
- Alternative Programming Languages: Rust, Go
- Coolness Level: Level 3: Genuinely Clever
- Business Potential: 2. The “Micro-SaaS / Pro Tool”
- Difficulty: Level 3: Advanced
- Knowledge Area: Database Internals / Execution
- Software or Tool: Query Executor
- Main Book: “Database Internals” by Alex Petrov
What you’ll build: An executor that takes a query plan and produces results. Implement the iterator model (Volcano) with operators like SeqScan, IndexScan, Filter, Join, Sort, Aggregate.
Why it teaches PostgreSQL: The executor is where queries actually run. Understanding the iterator model explains EXPLAIN ANALYZE timing, memory usage, and why some queries are slow.
Core challenges you’ll face:
- Iterator model → maps to Open/Next/Close per operator
- Operator implementation → maps to scan, join, sort nodes
- Memory management → maps to work_mem, spilling to disk
- Pipeline vs blocking → maps to why Sort blocks, Filter doesn’t
Key Concepts:
- Volcano Iterator Model: Classic paper by Goetz Graefe
- PostgreSQL Executor: PostgreSQL Source - execMain.c
- Operator Types: Scan, Join, Sort, Aggregate, Limit
Difficulty: Advanced Time estimate: 3 weeks Prerequisites: Projects 1-7 helpful
Real world outcome:
$ ./executor run "SELECT name FROM users WHERE age > 21 ORDER BY name LIMIT 10"
Execution Plan:
Limit (limit=10)
└── Sort (key=name)
└── Filter (age > 21)
└── SeqScan (users)
Execution Trace:
[Limit] Open()
[Sort] Open()
[Filter] Open()
[SeqScan] Open() - table users
[Filter] Next() -> ('Alice', 25) ✓ passes filter
[Filter] Next() -> ('Bob', 19) ✗ filtered out
[Filter] Next() -> ('Charlie', 30) ✓ passes filter
... (reading all rows into Sort)
[Sort] sorting 4,521 rows by name
[Sort] Next() -> ('Aaron', 22)
[Limit] Next() -> ('Aaron', 22)
Output: Aaron
[Limit] Next() -> ('Abby', 25)
Output: Abby
... (8 more)
[Limit] count=10, stopping
Results:
Aaron
Abby
Adam
...
(10 rows)
Execution Stats:
SeqScan: 10000 rows scanned, 50ms
Filter: 4521 rows passed (45.2%)
Sort: 4521 rows sorted, 120ms, 2MB memory
Limit: 10 rows returned
Total: 175ms
Implementation Hints:
Iterator interface:
typedef struct PlanState {
NodeType type;
Plan *plan;
struct PlanState *lefttree;
struct PlanState *righttree;
// Methods
void (*Open)(struct PlanState *);
TupleTableSlot *(*Next)(struct PlanState *);
void (*Close)(struct PlanState *);
} PlanState;
SeqScan implementation:
typedef struct SeqScanState {
PlanState ps;
HeapScanDesc scan;
int current_page;
int current_tuple;
} SeqScanState;
void SeqScan_Open(PlanState *ps) {
SeqScanState *ss = (SeqScanState *)ps;
ss->scan = heap_beginscan(ss->ps.plan->table);
ss->current_page = 0;
ss->current_tuple = 0;
}
TupleTableSlot *SeqScan_Next(PlanState *ps) {
SeqScanState *ss = (SeqScanState *)ps;
HeapTuple tuple = heap_getnext(ss->scan);
if (tuple == NULL)
return NULL;
return MakeTupleTableSlot(tuple);
}
Filter implementation:
TupleTableSlot *Filter_Next(PlanState *ps) {
FilterState *fs = (FilterState *)ps;
while (true) {
TupleTableSlot *slot = fs->ps.lefttree->Next(fs->ps.lefttree);
if (slot == NULL)
return NULL;
if (ExecQual(fs->qual, slot))
return slot; // Passes filter
// else continue to next tuple
}
}
Sort implementation (blocking):
void Sort_Open(PlanState *ps) {
SortState *ss = (SortState *)ps;
ss->ps.lefttree->Open(ss->ps.lefttree);
// Must read ALL input before producing output
ss->tuples = NIL;
TupleTableSlot *slot;
while ((slot = ss->ps.lefttree->Next(ss->ps.lefttree)) != NULL) {
ss->tuples = lappend(ss->tuples, copy_slot(slot));
}
// Sort in memory (or spill to disk if too large)
qsort(ss->tuples, ...);
ss->current = 0;
}
TupleTableSlot *Sort_Next(PlanState *ps) {
SortState *ss = (SortState *)ps;
if (ss->current >= list_length(ss->tuples))
return NULL;
return list_nth(ss->tuples, ss->current++);
}
Learning milestones:
- SeqScan returns all rows → You understand basic iteration
- Filter works correctly → You understand predicate evaluation
- Join produces correct results → You understand join algorithms
- Sort handles large data → You understand blocking operators
Project 9: Implement Connection Pooling
- File: LEARN_POSTGRESQL_DEEP_DIVE.md
- Main Programming Language: Go
- Alternative Programming Languages: Rust, C, Python
- Coolness Level: Level 3: Genuinely Clever
- Business Potential: 3. The “Service & Support” Model
- Difficulty: Level 3: Advanced
- Knowledge Area: Networking / Database Administration
- Software or Tool: Connection Pooler (like PgBouncer)
- Main Book: “High Performance PostgreSQL” by Gregory Smith
What you’ll build: A connection pooler that sits between clients and PostgreSQL, reusing connections to reduce overhead. Support session, transaction, and statement pooling modes.
Why it teaches PostgreSQL: PostgreSQL’s process-per-connection model is expensive. Understanding connection pooling explains why PgBouncer exists, connection limits, and performance under high concurrency.
Core challenges you’ll face:
- PostgreSQL wire protocol → maps to message parsing
- Connection state management → maps to session vs transaction pooling
- Authentication forwarding → maps to md5, scram-sha-256
- Transaction boundaries → maps to detecting BEGIN/COMMIT
Key Concepts:
- PostgreSQL Protocol: PostgreSQL Docs - Frontend/Backend Protocol
- PgBouncer Internals: PgBouncer Source
- Connection Pooling Modes: Session, Transaction, Statement
Difficulty: Advanced Time estimate: 2-3 weeks Prerequisites: Network programming, understanding of SQL transactions
Real world outcome:
# Start pooler
$ ./pgpool --listen=5433 --backend=localhost:5432 --pool-size=10 --mode=transaction
Pooler started on :5433
Backend: localhost:5432
Pool size: 10 connections
Mode: transaction
# Connect via pooler
$ psql -p 5433 -U myuser mydb
psql (15.0)
Type "help" for help.
mydb=> SELECT 1;
?column?
----------
1
(1 row)
# In another terminal, show pool status
$ ./pgpool status
Active connections: 1
Idle connections: 9
Waiting clients: 0
Client 1 (192.168.1.10:54321):
State: idle
Backend: conn#3
Queries: 5
Transaction: none
# Simulate load
$ pgbench -p 5433 -c 100 -j 10 -T 60 mydb
Pool Status During Load:
Active connections: 10 (max)
Waiting clients: 90
Queries/sec: 15,234
Avg wait time: 0.3ms
Implementation Hints:
PostgreSQL message format:
type Message struct {
Type byte // 'Q' for Query, 'X' for Terminate, etc.
Length int32 // Length including self
Data []byte
}
// Read a message from connection
func ReadMessage(conn net.Conn) (*Message, error) {
header := make([]byte, 5)
conn.Read(header)
msg := &Message{
Type: header[0],
Length: int32(binary.BigEndian.Uint32(header[1:5])),
}
msg.Data = make([]byte, msg.Length-4)
conn.Read(msg.Data)
return msg, nil
}
Connection pool:
type Pool struct {
backend string
size int
mode PoolMode // SESSION, TRANSACTION, STATEMENT
idle chan *BackendConn
mu sync.Mutex
}
func (p *Pool) Acquire() *BackendConn {
select {
case conn := <-p.idle:
return conn
default:
// Wait or create new if under limit
return <-p.idle
}
}
func (p *Pool) Release(conn *BackendConn) {
if p.mode == TRANSACTION && conn.inTransaction {
// Can't reuse - connection has uncommitted transaction
conn.Query("ROLLBACK")
}
// Reset session state if needed
conn.Query("DISCARD ALL")
p.idle <- conn
}
Transaction detection:
func (c *ClientHandler) HandleQuery(query string) {
upperQuery := strings.ToUpper(strings.TrimSpace(query))
if strings.HasPrefix(upperQuery, "BEGIN") {
c.inTransaction = true
} else if strings.HasPrefix(upperQuery, "COMMIT") ||
strings.HasPrefix(upperQuery, "ROLLBACK") {
c.inTransaction = false
if c.pool.mode == TRANSACTION {
c.releaseBackend() // Return to pool
}
}
// Forward to backend
c.backend.Send(query)
response := c.backend.ReadResponse()
c.client.Send(response)
}
Learning milestones:
- Queries route through pooler → You understand the protocol
- Connection reuse works → You understand pooling benefit
- Transaction mode respects boundaries → You understand state management
- High concurrency is handled → You understand the scalability gain
Project 10: Implement Streaming Replication
- File: LEARN_POSTGRESQL_DEEP_DIVE.md
- Main Programming Language: Go or C
- Alternative Programming Languages: Rust, Python
- Coolness Level: Level 4: Hardcore Tech Flex
- Business Potential: 4. The “Open Core” Infrastructure
- Difficulty: Level 4: Expert
- Knowledge Area: Database Internals / Replication
- Software or Tool: Streaming Replication
- Main Book: “PostgreSQL 14 Internals” by Egor Rogov
What you’ll build: A streaming replication system where a primary sends WAL records to a standby in real-time. The standby applies them to stay synchronized.
Why it teaches PostgreSQL: Streaming replication is how PostgreSQL achieves high availability. Understanding WAL shipping, LSN, and replay explains hot standby, synchronous commits, and failover.
Core challenges you’ll face:
- Replication protocol → maps to START_REPLICATION command
- WAL streaming → maps to continuous WAL shipping
- LSN tracking → maps to knowing standby position
- Synchronous vs async → maps to durability vs performance
Key Concepts:
- Streaming Replication Protocol: PostgreSQL Docs - Protocol Replication
- WAL Shipping: PostgreSQL Docs - Log Shipping
- Logical vs Physical: Comparison of Solutions
Difficulty: Expert Time estimate: 3-4 weeks Prerequisites: Project 3 (WAL) completed
Real world outcome:
# Start primary
$ ./repldb primary --port=5432 --wal-dir=/data/wal
# Start standby (connects to primary)
$ ./repldb standby --port=5433 --primary=localhost:5432
Standby connecting to primary...
Replication started from LSN 0/1000000
Streaming WAL...
# On primary: insert data
$ ./repldb query --port=5432 "INSERT INTO users VALUES (1, 'Alice')"
Inserted. WAL LSN: 0/1000100
# Standby receives it immediately
Standby log:
Received WAL record at 0/1000100: INSERT
Applied to local storage
Replay LSN: 0/1000100
# Query standby
$ ./repldb query --port=5433 "SELECT * FROM users"
(1, 'Alice')
# Show replication status
$ ./repldb repl-status --port=5432
Primary LSN: 0/1000200
Standby 1:
Address: localhost:5433
State: streaming
Sent LSN: 0/1000200
Write LSN: 0/1000200
Flush LSN: 0/1000180
Replay LSN: 0/1000180
Lag: 32 bytes
# Failover test: kill primary
$ kill $(pidof repldb-primary)
$ ./repldb promote --port=5433
Standby promoted to primary!
Now accepting writes on port 5433.
Implementation Hints:
Replication protocol flow:
Primary Standby
| |
|<------ IDENTIFY_SYSTEM --------|
|------- system_id, timeline --->|
| |
|<-- START_REPLICATION lsn ------|
| |
|-------- WAL data ------------->|
|-------- WAL data ------------->|
|<------ Standby status ---------|
|-------- WAL data ------------->|
...
WAL sender (primary):
func (s *WALSender) Stream(startLSN uint64) {
reader := OpenWAL(startLSN)
for {
// Read next WAL record
record, lsn := reader.Next()
// Send to standby
msg := &WALDataMessage{
StartLSN: lsn,
EndLSN: lsn + uint64(len(record)),
Data: record,
}
s.conn.Send(msg)
// Check for standby feedback
if feedback := s.conn.TryRecv(); feedback != nil {
s.updateStandbyStatus(feedback)
}
// If caught up, wait for new WAL
if reader.AtEnd() {
s.waitForNewWAL()
}
}
}
WAL receiver (standby):
func (r *WALReceiver) Receive() {
// Connect to primary
r.conn = Connect(r.primaryAddr)
r.conn.Send(&StartReplication{LSN: r.lastAppliedLSN})
for {
msg := r.conn.Recv()
switch m := msg.(type) {
case *WALDataMessage:
// Write to local WAL file
r.walWriter.Write(m.Data)
r.writeLSN = m.EndLSN
// Apply to local database
r.applyWALRecord(m.Data)
r.replayLSN = m.EndLSN
// Send feedback periodically
if time.Since(r.lastFeedback) > 10*time.Second {
r.sendFeedback()
}
}
}
}
Learning milestones:
- WAL streams continuously → You understand the protocol
- Standby stays synchronized → You understand replay
- Failover works → You understand promotion
- Lag is tracked → You understand monitoring
Project 11: Build a Full-Text Search Engine
- File: LEARN_POSTGRESQL_DEEP_DIVE.md
- Main Programming Language: Python or C
- Alternative Programming Languages: Rust, Go
- Coolness Level: Level 3: Genuinely Clever
- Business Potential: 3. The “Service & Support” Model
- Difficulty: Level 3: Advanced
- Knowledge Area: Information Retrieval / Indexing
- Software or Tool: Full-Text Search (like PostgreSQL tsvector/tsquery)
- Main Book: “Introduction to Information Retrieval” by Manning, Raghavan, Schütze
What you’ll build: A full-text search system with tokenization, stemming, stop words, and inverted index (GIN-style). Support queries with AND, OR, NOT, and phrase matching.
Why it teaches PostgreSQL: PostgreSQL’s tsvector/tsquery and GIN indexes are powerful but often misunderstood. Building your own explains when to use FTS vs LIKE, and how GIN indexes work.
Core challenges you’ll face:
- Text normalization → maps to tsvector creation
- Inverted index structure → maps to GIN index
- Query parsing → maps to tsquery operators
- Ranking → maps to ts_rank, relevance
Key Concepts:
- PostgreSQL Full Text Search: PostgreSQL Docs - Full Text Search
- GIN Index: PostgreSQL Docs - GIN Indexes
- Inverted Index: “Introduction to Information Retrieval” Chapter 1
Difficulty: Advanced Time estimate: 2-3 weeks Prerequisites: Basic understanding of text processing
Real world outcome:
# Index some documents
$ ./ftsearch index doc1 "PostgreSQL is an advanced open source database"
Document 1 indexed. Tokens: [postgresql, advanc, open, sourc, databas]
$ ./ftsearch index doc2 "MySQL is also an open source database system"
Document 2 indexed. Tokens: [mysql, also, open, sourc, databas, system]
$ ./ftsearch index doc3 "Oracle is a commercial database"
Document 3 indexed. Tokens: [oracl, commerci, databas]
# Simple search
$ ./ftsearch query "open source"
Results (ranked by relevance):
1. doc1 (score: 0.85) - "PostgreSQL is an advanced open source database"
2. doc2 (score: 0.80) - "MySQL is also an open source database system"
# Boolean queries
$ ./ftsearch query "database & !open"
Results:
1. doc3 (score: 0.70) - "Oracle is a commercial database"
$ ./ftsearch query "postgresql | mysql"
Results:
1. doc1 (score: 0.90) - "PostgreSQL is an advanced..."
2. doc2 (score: 0.90) - "MySQL is also an open..."
# Show inverted index
$ ./ftsearch indexinfo
Inverted Index:
"databas" -> [doc1, doc2, doc3]
"open" -> [doc1, doc2]
"sourc" -> [doc1, doc2]
"postgresql" -> [doc1]
"mysql" -> [doc2]
...
Implementation Hints:
Text processing pipeline:
def normalize(text):
# 1. Lowercase
text = text.lower()
# 2. Tokenize (split on non-alphanumeric)
tokens = re.findall(r'\w+', text)
# 3. Remove stop words
tokens = [t for t in tokens if t not in STOP_WORDS]
# 4. Stem (Porter stemmer)
tokens = [stem(t) for t in tokens]
return tokens
# "PostgreSQL is an advanced open source database"
# -> ['postgresql', 'advanc', 'open', 'sourc', 'databas']
Inverted index (GIN-style):
class InvertedIndex:
def __init__(self):
self.index = {} # token -> [(doc_id, positions), ...]
def add_document(self, doc_id, text):
tokens = normalize(text)
for pos, token in enumerate(tokens):
if token not in self.index:
self.index[token] = []
self.index[token].append((doc_id, pos))
def search(self, token):
return self.index.get(stem(token.lower()), [])
def search_and(self, tokens):
results = None
for token in tokens:
docs = set(doc_id for doc_id, pos in self.search(token))
if results is None:
results = docs
else:
results &= docs
return results
Ranking (TF-IDF style):
def rank(doc_id, query_tokens, index, total_docs):
score = 0
for token in query_tokens:
postings = index.search(token)
if not postings:
continue
# TF: how often token appears in doc
tf = sum(1 for d, p in postings if d == doc_id)
# IDF: how rare is token across all docs
df = len(set(d for d, p in postings))
idf = math.log(total_docs / df)
score += tf * idf
return score
Learning milestones:
- Tokenization works → You understand text normalization
- Searches return correct documents → You understand inverted index
- Boolean queries work → You understand set operations
- Ranking makes sense → You understand relevance scoring
Project 12: Build a JSON Document Store
- File: LEARN_POSTGRESQL_DEEP_DIVE.md
- Main Programming Language: Python or Go
- Alternative Programming Languages: Rust, C
- Coolness Level: Level 3: Genuinely Clever
- Business Potential: 3. The “Service & Support” Model
- Difficulty: Level 3: Advanced
- Knowledge Area: Document Databases / JSON
- Software or Tool: JSONB Implementation
- Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann
What you’ll build: A document store with JSONB-style binary storage, path queries ($.address.city), and GIN indexing for fast lookups.
Why it teaches PostgreSQL: PostgreSQL’s JSONB is powerful but different from MongoDB. Understanding binary storage, path operators, and GIN indexing explains when to use JSONB vs normalized tables.
Core challenges you’ll face:
- Binary JSON format → maps to JSONB vs JSON
- Path queries → maps to jsonb_path_query, @>
- GIN indexing on JSONB → maps to jsonb_path_ops, jsonb_ops
- Efficient updates → maps to jsonb_set, why updates can be costly
Key Concepts:
- PostgreSQL JSONB: PostgreSQL Docs - JSON Types
- JSON Path: PostgreSQL Docs - JSON Path
- JSONB Indexing: PostgreSQL Docs - JSONB Indexing
Difficulty: Advanced Time estimate: 2 weeks Prerequisites: Understanding of JSON, Project 4 (B-tree) helpful
Real world outcome:
# Insert documents
$ ./jsondb insert users '{"name": "Alice", "age": 30, "address": {"city": "NYC"}}'
Inserted doc_id=1
$ ./jsondb insert users '{"name": "Bob", "age": 25, "address": {"city": "LA"}}'
Inserted doc_id=2
# Path query
$ ./jsondb query users '$.address.city'
doc_id=1: "NYC"
doc_id=2: "LA"
# Filter query
$ ./jsondb find users 'age > 25'
{"id": 1, "name": "Alice", "age": 30, "address": {"city": "NYC"}}
# Containment query (like @>)
$ ./jsondb find users --contains '{"address": {"city": "NYC"}}'
{"id": 1, "name": "Alice", ...}
# Create GIN index
$ ./jsondb createindex users_data_idx users --gin
# Now containment queries use index
$ ./jsondb explain find users --contains '{"name": "Alice"}'
Index Scan using users_data_idx
Index Cond: (data @> '{"name": "Alice"}'::jsonb)
# Update nested path
$ ./jsondb update users 1 --set '$.address.zip' '"10001"'
Updated. New value:
{"name": "Alice", "age": 30, "address": {"city": "NYC", "zip": "10001"}}
Implementation Hints:
JSONB binary format (simplified):
class JsonbValue:
OBJECT = 1
ARRAY = 2
STRING = 3
NUMBER = 4
BOOL = 5
NULL = 6
def to_jsonb(value):
"""Convert Python value to binary JSONB format."""
if isinstance(value, dict):
# Object: type + count + (key_offset, value_offset)* + keys + values
entries = []
for k, v in sorted(value.items()): # Keys sorted!
entries.append((k, to_jsonb(v)))
return encode_object(entries)
elif isinstance(value, list):
# Array: type + count + (value_offset)* + values
return encode_array([to_jsonb(v) for v in value])
elif isinstance(value, str):
return bytes([STRING]) + encode_string(value)
# ... etc
GIN indexing for JSONB:
class JsonbGinIndex:
def __init__(self, ops='jsonb_path_ops'):
self.ops = ops
self.index = {} # key -> set of doc_ids
def index_document(self, doc_id, jsonb):
if self.ops == 'jsonb_path_ops':
# Index paths: hash of path to value
for path, value in extract_paths(jsonb):
key = hash(path + str(value))
self.index.setdefault(key, set()).add(doc_id)
else: # jsonb_ops
# Index all keys and values separately
for key in extract_all_keys(jsonb):
self.index.setdefault(('key', key), set()).add(doc_id)
for value in extract_all_values(jsonb):
self.index.setdefault(('val', value), set()).add(doc_id)
def search_contains(self, pattern):
# @> operator: find docs containing pattern
required_keys = set()
for path, value in extract_paths(pattern):
key = hash(path + str(value))
required_keys.add(key)
result = None
for key in required_keys:
docs = self.index.get(key, set())
if result is None:
result = docs
else:
result &= docs
return result or set()
Path query evaluation:
def evaluate_path(doc, path):
"""Evaluate JSONPath like '$.address.city' on document."""
parts = parse_path(path) # ['$', 'address', 'city']
current = doc
for part in parts[1:]: # Skip '$'
if part.isdigit():
current = current[int(part)] # Array index
else:
current = current.get(part) # Object key
if current is None:
return None
return current
Learning milestones:
- Documents store and retrieve → You understand JSONB storage
- Path queries work → You understand navigation
- GIN index speeds up containment → You understand JSONB indexing
- Updates work correctly → You understand immutability tradeoffs
Project Comparison Table
| Project | Difficulty | Time | Depth of Understanding | Fun Factor |
|---|---|---|---|---|
| 1. Page-Based Storage | Advanced | 2 weeks | ⭐⭐⭐ | ⭐⭐⭐ |
| 2. MVCC Implementation | Expert | 3-4 weeks | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ |
| 3. Write-Ahead Log | Expert | 3 weeks | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐ |
| 4. B-tree Index | Expert | 3-4 weeks | ⭐⭐⭐⭐ | ⭐⭐⭐⭐ |
| 5. VACUUM Process | Advanced | 2 weeks | ⭐⭐⭐⭐ | ⭐⭐⭐ |
| 6. Query Parser | Advanced | 2 weeks | ⭐⭐⭐ | ⭐⭐⭐ |
| 7. Query Planner | Expert | 4 weeks | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ |
| 8. Query Executor | Advanced | 3 weeks | ⭐⭐⭐⭐ | ⭐⭐⭐⭐ |
| 9. Connection Pooling | Advanced | 2-3 weeks | ⭐⭐⭐ | ⭐⭐⭐ |
| 10. Streaming Replication | Expert | 3-4 weeks | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ |
| 11. Full-Text Search | Advanced | 2-3 weeks | ⭐⭐⭐ | ⭐⭐⭐⭐ |
| 12. JSON Document Store | Advanced | 2 weeks | ⭐⭐⭐ | ⭐⭐⭐⭐ |
Recommended Learning Path
Core Track (The Heart of PostgreSQL)
Project 1 (Pages) → Project 2 (MVCC) → Project 3 (WAL) → Project 5 (VACUUM)
This path teaches you the storage engine fundamentals that make PostgreSQL unique.
Query Processing Track
Project 6 (Parser) → Project 7 (Planner) → Project 8 (Executor)
This path teaches you how SQL becomes results.
Advanced Features Track
Project 4 (B-tree) → Project 11 (FTS) → Project 12 (JSONB)
This path teaches you PostgreSQL’s powerful indexing and data types.
Production Track
Project 9 (Pooling) → Project 10 (Replication)
This path teaches you how to run PostgreSQL at scale.
Final Capstone: Build MiniPostgres
- File: LEARN_POSTGRESQL_DEEP_DIVE.md
- Main Programming Language: C or Rust
- Alternative Programming Languages: Go
- Coolness Level: Level 5: Pure Magic
- Business Potential: 5. The “Industry Disruptor”
- Difficulty: Level 5: Master
- Knowledge Area: Database Internals / Full Stack
- Software or Tool: Complete RDBMS
- Main Book: All previous books combined
What you’ll build: Combine all projects into a working mini-PostgreSQL that accepts SQL connections, parses queries, uses MVCC for concurrency, stores data in pages, uses WAL for durability, and supports B-tree indexes.
Why it teaches mastery: This is the ultimate integration. You’ll understand how every piece fits together, why PostgreSQL makes its design choices, and what tradeoffs exist.
Real world outcome:
$ ./minipostgres start
MiniPostgres Server v0.1
Listening on port 5432
Storage: page-based, 8KB pages
Concurrency: MVCC (xmin/xmax)
Durability: WAL with checkpoints
Indexes: B-tree
$ psql -h localhost -p 5432
minipostgres=> CREATE TABLE users (id INT, name TEXT);
CREATE TABLE
minipostgres=> INSERT INTO users VALUES (1, 'Alice'), (2, 'Bob');
INSERT 0 2
minipostgres=> SELECT * FROM users WHERE id = 1;
id | name
----+-------
1 | Alice
(1 row)
minipostgres=> CREATE INDEX users_id_idx ON users (id);
CREATE INDEX
minipostgres=> EXPLAIN SELECT * FROM users WHERE id = 1;
Index Scan using users_id_idx on users (cost=0.00..1.00 rows=1)
Index Cond: (id = 1)
minipostgres=> BEGIN;
BEGIN
minipostgres=> UPDATE users SET name = 'Alicia' WHERE id = 1;
UPDATE 1
-- In another session, still sees 'Alice' (MVCC!)
minipostgres=> COMMIT;
COMMIT
minipostgres=> VACUUM users;
VACUUM (removed 1 dead tuple)
Summary
| # | Project | Main Language |
|---|---|---|
| 1 | Build a Page-Based Storage Engine | C |
| 2 | Implement MVCC | C |
| 3 | Build a Write-Ahead Log (WAL) | C |
| 4 | Implement a B-tree Index | C |
| 5 | Build a VACUUM Process | C |
| 6 | Build a Query Parser | C (Flex/Bison) |
| 7 | Build a Query Planner/Optimizer | C |
| 8 | Build a Query Executor | C |
| 9 | Implement Connection Pooling | Go |
| 10 | Implement Streaming Replication | Go |
| 11 | Build a Full-Text Search Engine | Python |
| 12 | Build a JSON Document Store | Python |
| Capstone | Build MiniPostgres | C/Rust |
Key Resources
Books
- “Database Internals” by Alex Petrov - Best book on storage engine internals
- “Designing Data-Intensive Applications” by Martin Kleppmann - Context for all database decisions
- “PostgreSQL 14 Internals” by Egor Rogov - Free, PostgreSQL-specific deep dive
- “The Internals of PostgreSQL” by Hironobu Suzuki - Free online
Official Documentation
- PostgreSQL Docs - Comprehensive and excellent
- PostgreSQL Source Code - The ultimate reference
Articles & Blogs
- PostgresPro Blog - Excellent internals series
- Bruce Momjian’s Writings - From a core contributor
- How PostgreSQL Works - Architecture overview
Talks
- CMU Database Group - Academic database courses
- PGCon Videos - PostgreSQL conference talks
Comparison Resources
- MySQL vs PostgreSQL Storage - Architecture comparison
- PostgreSQL vs Oracle - Enterprise comparison
“To truly understand PostgreSQL, you must build PostgreSQL. Only then will you know why every line of its code exists.”