LEARN MYSQL DEEP DIVE
Learn MySQL: From Zero to Mastery
Goal: Deeply understand MySQL—from basic queries to storage engine internals, index structures, query optimization, transaction processing, and replication. Understand why MySQL makes the design choices it does and how it differs from PostgreSQL, SQLite, and other databases.
Why MySQL Matters
MySQL is one of the most widely-deployed databases in the world, powering everything from WordPress blogs to Facebook’s infrastructure. Understanding MySQL deeply teaches you:
- Relational database fundamentals: ACID, normalization, SQL
- Storage engine architecture: How data is physically organized on disk
- Indexing internals: B+trees, clustered vs secondary indexes
- Query optimization: How the optimizer chooses execution plans
- Concurrency control: MVCC, locking, isolation levels
- Distributed systems: Replication, sharding, high availability
MySQL vs Other Databases: The Big Picture
Before diving in, let’s understand where MySQL fits:
| Feature | MySQL/InnoDB | PostgreSQL | SQLite | MariaDB |
|---|---|---|---|---|
| Architecture | Thread-per-connection | Process-per-connection | Embedded (in-process) | Thread-per-connection |
| Storage Engine | Pluggable (InnoDB default) | Single (heap + WAL) | Single (B-tree) | Pluggable (Aria, InnoDB) |
| Replication | Logical (binlog) | Physical (WAL) + Logical | None (single-file) | Logical (binlog) |
| MVCC | Undo logs + locks | Tuple versioning | WAL-mode | Undo logs + locks |
| Default Isolation | REPEATABLE READ | READ COMMITTED | SERIALIZABLE | REPEATABLE READ |
| JSON Support | Good | Excellent (JSONB) | Basic | Good |
| Full-Text Search | Basic | Excellent (GIN/GiST) | FTS5 extension | Basic |
| Geospatial | Basic | Excellent (PostGIS) | R*Tree | Basic |
Key Philosophical Differences
MySQL’s Philosophy:
- Simplicity and speed for common operations
- Pluggable storage engines for different workloads
- Optimized for read-heavy OLTP workloads
- “Good enough” defaults that work for most cases
PostgreSQL’s Philosophy:
- Correctness and standards compliance first
- Rich feature set (arrays, custom types, advanced indexing)
- Better for complex queries and analytics
- More sophisticated query planner
SQLite’s Philosophy:
- Embedded, zero-configuration
- Single-file database
- Perfect for local storage, testing, edge cases
- ACID-compliant despite simplicity
MySQL Architecture Overview
┌─────────────────────────────────────────────────────────────────────┐
│ MySQL Server │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────────────────────────────────────────────────────┐ │
│ │ Connection Layer │ │
│ │ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ │
│ │ │Thread 1 │ │Thread 2 │ │Thread 3 │ │Thread N │ │ │
│ │ │ (Conn) │ │ (Conn) │ │ (Conn) │ │ (Conn) │ │ │
│ │ └─────────┘ └─────────┘ └─────────┘ └─────────┘ │ │
│ └──────────────────────────────────────────────────────────────┘ │
│ │ │
│ ┌──────────────────────────▼───────────────────────────────────┐ │
│ │ SQL Layer │ │
│ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────────┐ │ │
│ │ │ Parser │ │ Optimizer │ │ Query Executor │ │ │
│ │ │ (Lexer + │ │ (Cost-based │ │ (Iterates over rows, │ │ │
│ │ │ Parser) │ │ planning) │ │ applies operations) │ │ │
│ │ └─────────────┘ └─────────────┘ └─────────────────────────┘ │ │
│ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────────┐ │ │
│ │ │Query Cache │ │ Metadata │ │ Table Definition │ │ │
│ │ │ (removed │ │ Cache │ │ Cache │ │ │
│ │ │ in 8.0) │ │ │ │ │ │ │
│ │ └─────────────┘ └─────────────┘ └─────────────────────────┘ │ │
│ └──────────────────────────────────────────────────────────────┘ │
│ │ │
│ ┌──────────────────────────▼───────────────────────────────────┐ │
│ │ Storage Engine API │ │
│ │ (Handler interface - pluggable engines) │ │
│ └──────────────────────────────────────────────────────────────┘ │
│ │ │ │ │
│ ┌──────▼──────┐ ┌───────▼───────┐ ┌───────▼───────┐ │
│ │ InnoDB │ │ MyISAM │ │ Memory │ │
│ │ (Default) │ │ (Legacy) │ │ (Temp tables)│ │
│ │ │ │ │ │ │ │
│ │ ┌─────────┐ │ │ ┌───────────┐ │ │ ┌───────────┐ │ │
│ │ │ Buffer │ │ │ │Key Buffer │ │ │ │ Hash or │ │ │
│ │ │ Pool │ │ │ │ (Index │ │ │ │ B-Tree │ │ │
│ │ │ │ │ │ │ Cache) │ │ │ │ Indexes │ │ │
│ │ └─────────┘ │ │ └───────────┘ │ │ └───────────┘ │ │
│ │ ┌─────────┐ │ │ │ │ │ │
│ │ │Redo Log │ │ │ No ACID, no │ │ In-memory │ │
│ │ │Undo Log │ │ │ transactions │ │ only │ │
│ │ └─────────┘ │ │ │ │ │ │
│ └─────────────┘ └───────────────┘ └───────────────┘ │
│ │ │ │ │
└─────────┼────────────────────┼────────────────────┼──────────────────┘
│ │ │
▼ ▼ ▼
┌─────────┐ ┌─────────┐ (Memory)
│.ibd file│ │.MYD/.MYI│
│(data + │ │(data + │
│ index) │ │ index) │
└─────────┘ └─────────┘
InnoDB Architecture Deep Dive
InnoDB is MySQL’s default storage engine and where most of the magic happens:
┌─────────────────────────────────────────────────────────────────────┐
│ InnoDB Engine │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ ┌────────────────────── IN MEMORY ─────────────────────────────┐ │
│ │ │ │
│ │ ┌─────────────────────────────────────────────────────────┐ │ │
│ │ │ Buffer Pool │ │ │
│ │ │ ┌─────────────────────────────────────────────────────┐ │ │ │
│ │ │ │ Data Pages │ Index Pages │ Undo Pages │ Adaptive HI │ │ │ │
│ │ │ └─────────────────────────────────────────────────────┘ │ │ │
│ │ │ LRU list (young → old) for page eviction │ │ │
│ │ └─────────────────────────────────────────────────────────┘ │ │
│ │ │ │
│ │ ┌───────────────┐ ┌───────────────┐ ┌───────────────────┐ │ │
│ │ │ Change │ │ Log │ │ Adaptive Hash │ │ │
│ │ │ Buffer │ │ Buffer │ │ Index (AHI) │ │ │
│ │ │ (for sec.idx) │ │ (redo log │ │ (auto-built hash │ │ │
│ │ │ │ │ in memory) │ │ for hot pages) │ │ │
│ │ └───────────────┘ └───────────────┘ └───────────────────────┘ │
│ │ │ │
│ └───────────────────────────────────────────────────────────────┘ │
│ │ │
│ │ Flush │
│ ▼ │
│ ┌────────────────────── ON DISK ───────────────────────────────┐ │
│ │ │ │
│ │ ┌───────────────┐ ┌───────────────┐ ┌───────────────────┐ │ │
│ │ │ System │ │ File-per- │ │ Redo Log Files │ │ │
│ │ │ Tablespace │ │ Table (.ibd) │ │ (ib_logfile0/1) │ │ │
│ │ │ (ibdata1) │ │ │ │ │ │ │
│ │ │ │ │ Data + Index │ │ WAL for crash │ │ │
│ │ │ Change buffer │ │ in B+Tree │ │ recovery │ │ │
│ │ │ Undo logs │ │ structure │ │ │ │ │
│ │ │ Data dict │ │ │ │ │ │ │
│ │ └───────────────┘ └───────────────┘ └───────────────────┘ │ │
│ │ │ │
│ │ ┌───────────────┐ ┌───────────────┐ │ │
│ │ │ Doublewrite │ │ Undo │ │ │
│ │ │ Buffer │ │ Tablespace │ │ │
│ │ │ (crash safety)│ │ (undo_001) │ │ │
│ │ └───────────────┘ └───────────────┘ │ │
│ │ │ │
│ └───────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────┘
Project List
Projects progress from understanding fundamentals to building sophisticated database tools.
Project 1: SQL Query Executor (From Scratch)
- File: LEARN_MYSQL_DEEP_DIVE.md
- Main Programming Language: Python
- Alternative Programming Languages: Go, Rust, C
- Coolness Level: Level 4: Hardcore Tech Flex
- Business Potential: 1. The “Resume Gold”
- Difficulty: Level 2: Intermediate
- Knowledge Area: SQL Parsing / Query Execution / Relational Algebra
- Software or Tool: Custom implementation
- Main Book: “Database System Concepts” by Silberschatz, Korth, Sudarshan
What you’ll build: A simple SQL query executor that parses SQL (SELECT, WHERE, JOIN), operates on CSV files as tables, and executes queries using relational algebra operators (select, project, join).
Why it teaches MySQL: Before understanding MySQL’s optimizations, you need to understand what SQL actually means. This project forces you to implement the relational model and see why database engines make certain design choices.
Core challenges you’ll face:
- Parsing SQL syntax → maps to understanding SQL grammar
- Implementing relational operators → maps to selection, projection, join
- Query planning → maps to choosing join order, using indexes
- Understanding cost → maps to why some queries are slow
Key Concepts:
- Relational Algebra: “Database System Concepts” Chapter 6
- SQL Syntax: MySQL SELECT Syntax
- Query Processing: “Database System Concepts” Chapter 15
Difficulty: Intermediate Time estimate: 2 weeks Prerequisites: Basic SQL knowledge. Understanding of data structures.
Real world outcome:
# Your SQL executor in action:
$ python minisql.py
MiniSQL> CREATE TABLE users (id, name, email);
Table 'users' created.
MiniSQL> INSERT INTO users VALUES (1, 'Alice', 'alice@example.com');
MiniSQL> INSERT INTO users VALUES (2, 'Bob', 'bob@example.com');
MiniSQL> SELECT name, email FROM users WHERE id = 1;
+-------+-------------------+
| name | email |
+-------+-------------------+
| Alice | alice@example.com |
+-------+-------------------+
1 row in set (0.001 sec)
MiniSQL> CREATE TABLE orders (id, user_id, amount);
MiniSQL> INSERT INTO orders VALUES (1, 1, 100);
MiniSQL> INSERT INTO orders VALUES (2, 1, 200);
MiniSQL> SELECT users.name, orders.amount
FROM users
JOIN orders ON users.id = orders.user_id;
+-------+--------+
| name | amount |
+-------+--------+
| Alice | 100 |
| Alice | 200 |
+-------+--------+
2 rows in set (0.002 sec)
Implementation Hints:
Core relational operators:
class Table:
def __init__(self, name, columns, rows=None):
self.name = name
self.columns = columns # ['id', 'name', 'email']
self.rows = rows or [] # [[1, 'Alice', 'alice@...'], ...]
def select(self, predicate):
"""σ (sigma) - Filter rows matching predicate"""
return Table(
self.name,
self.columns,
[row for row in self.rows if predicate(row, self.columns)]
)
def project(self, column_names):
"""π (pi) - Select specific columns"""
indices = [self.columns.index(c) for c in column_names]
return Table(
self.name,
column_names,
[[row[i] for i in indices] for row in self.rows]
)
def join(self, other, on_left, on_right):
"""⋈ (bowtie) - Join two tables"""
left_idx = self.columns.index(on_left)
right_idx = other.columns.index(on_right)
result_rows = []
for left_row in self.rows:
for right_row in other.rows:
if left_row[left_idx] == right_row[right_idx]:
result_rows.append(left_row + right_row)
return Table(
f"{self.name}_{other.name}",
self.columns + other.columns,
result_rows
)
Simple SQL parser:
import re
def parse_select(sql):
"""Parse SELECT column1, column2 FROM table WHERE condition"""
pattern = r"SELECT\s+(.+?)\s+FROM\s+(\w+)(?:\s+WHERE\s+(.+))?"
match = re.match(pattern, sql, re.IGNORECASE)
if match:
columns = [c.strip() for c in match.group(1).split(',')]
table = match.group(2)
where = match.group(3)
return {'type': 'select', 'columns': columns, 'table': table, 'where': where}
Learning milestones:
- You parse basic SQL → You understand SQL syntax
- You implement SELECT/WHERE → You understand selection and projection
- You implement JOIN → You understand the most expensive operation
- You see why indexes matter → You understand scanning vs seeking
Project 2: B+Tree Index Implementation
- File: LEARN_MYSQL_DEEP_DIVE.md
- Main Programming Language: C
- Alternative Programming Languages: Rust, Go, Python
- Coolness Level: Level 4: Hardcore Tech Flex
- Business Potential: 1. The “Resume Gold”
- Difficulty: Level 3: Advanced
- Knowledge Area: Data Structures / Indexing / Disk-Based Storage
- Software or Tool: Custom implementation
- Main Book: “Database Internals” by Alex Petrov
What you’ll build: A disk-based B+Tree implementation that stores data on disk in pages, supports insertions, deletions, range scans, and demonstrates why B+Trees are ideal for databases.
Why it teaches MySQL: InnoDB stores all data in B+Tree structures. The clustered index IS the table. Understanding B+Trees means understanding how MySQL physically organizes data and why certain operations are fast or slow.
Core challenges you’ll face:
- Page-based storage → maps to InnoDB’s 16KB pages
- Node splitting and merging → maps to handling insertions and deletions
- Range scans via leaf links → maps to why B+Trees excel at range queries
- Understanding clustered vs secondary → maps to InnoDB’s index design
Key Concepts:
- B+Tree Structure: “Database Internals” by Alex Petrov, Chapters 2-4
- InnoDB Physical Structure: MySQL Docs - Physical Structure
- Clustered Index: Use The Index, Luke
Difficulty: Advanced Time estimate: 3-4 weeks Prerequisites: Project 1 completed. Understanding of tree data structures. C or systems programming.
Real world outcome:
// Your B+Tree in action:
$ ./btree_demo
Creating B+Tree with page size 4096, order 100...
Inserting 1,000,000 keys...
Insert time: 2.3 seconds
Point lookup (key = 500000): found in 3 page reads (tree height = 3)
Range scan (100000 to 100100): 100 rows in 5 page reads
Tree statistics:
- Height: 3
- Leaf pages: 10,000
- Internal pages: 101
- Fill factor: 67%
Demonstrating why B+Tree > Binary Tree for disk:
- B+Tree: 3 disk reads for any key in 1M rows
- Binary: 20 disk reads for any key in 1M rows (log2(1M))
Implementation Hints:
B+Tree node structure:
#define PAGE_SIZE 4096
#define MAX_KEYS ((PAGE_SIZE - sizeof(NodeHeader)) / (sizeof(Key) + sizeof(Value)))
typedef struct {
uint32_t is_leaf;
uint32_t num_keys;
uint32_t parent_page;
uint32_t next_leaf; // Only for leaf nodes (linked list)
uint32_t prev_leaf;
} NodeHeader;
typedef struct {
NodeHeader header;
Key keys[MAX_KEYS];
union {
Value values[MAX_KEYS]; // For leaf nodes
uint32_t children[MAX_KEYS + 1]; // For internal nodes (page IDs)
};
} BTreeNode;
typedef struct {
int fd; // File descriptor
uint32_t root_page;
uint32_t num_pages;
BTreeNode* cache[CACHE_SIZE]; // Simple page cache
} BTree;
Key operations:
// Search: traverse from root to leaf
Value* btree_search(BTree* tree, Key key) {
uint32_t page_id = tree->root_page;
while (true) {
BTreeNode* node = read_page(tree, page_id);
if (node->header.is_leaf) {
// Binary search in leaf
int idx = binary_search(node->keys, node->header.num_keys, key);
if (idx >= 0) {
return &node->values[idx];
}
return NULL;
}
// Find child to follow
int idx = find_child_index(node, key);
page_id = node->children[idx];
}
}
// Range scan: find start, follow leaf links
void btree_range_scan(BTree* tree, Key start, Key end, Callback cb) {
// Find leaf containing start key
uint32_t page_id = find_leaf(tree, start);
while (page_id != 0) {
BTreeNode* node = read_page(tree, page_id);
for (int i = 0; i < node->header.num_keys; i++) {
if (node->keys[i] > end) return;
if (node->keys[i] >= start) {
cb(node->keys[i], node->values[i]);
}
}
page_id = node->header.next_leaf; // Follow leaf link!
}
}
Why B+Tree is perfect for databases:
- High fanout → Tree stays shallow (3-4 levels for millions of rows)
- Leaf links → Range scans don’t need to revisit internal nodes
- Sequential leaf access → Disk prefetching works well
- Consistent read performance → All leaves at same depth
Learning milestones:
- You implement basic search → You understand tree traversal
- You implement insert with splitting → You understand node management
- You implement range scans → You understand leaf linking
- You measure disk I/O → You understand why B+Trees win
Project 3: Query Optimizer Simulator
- File: LEARN_MYSQL_DEEP_DIVE.md
- Main Programming Language: Python
- Alternative Programming Languages: Go, Java
- Coolness Level: Level 4: Hardcore Tech Flex
- Business Potential: 2. The “Micro-SaaS / Pro Tool”
- Difficulty: Level 3: Advanced
- Knowledge Area: Query Optimization / Cost Estimation / Statistics
- Software or Tool: Custom implementation
- Main Book: “Database System Concepts” by Silberschatz, Korth, Sudarshan
What you’ll build: A query optimizer that takes SQL queries, generates multiple execution plans, estimates costs based on table statistics, and chooses the cheapest plan—just like MySQL’s optimizer.
Why it teaches MySQL: The optimizer is MySQL’s brain. Understanding how it estimates costs and chooses plans explains why some queries are fast and others slow, and how to write SQL that the optimizer can optimize.
Core challenges you’ll face:
- Generating alternative plans → maps to join order, index choice
- Estimating cardinality → maps to how many rows will match?
- Cost modeling → maps to disk I/O vs CPU costs
- Understanding EXPLAIN → maps to reading MySQL’s execution plans
Key Concepts:
- Query Optimization: “Database System Concepts” Chapter 16
- MySQL Optimizer: How the MySQL Optimizer Works
- EXPLAIN Output: MySQL EXPLAIN
Difficulty: Advanced Time estimate: 2-3 weeks Prerequisites: Projects 1-2 completed. Understanding of SQL execution.
Real world outcome:
Query Optimizer Simulator
Query: SELECT * FROM orders o
JOIN customers c ON o.customer_id = c.id
JOIN products p ON o.product_id = p.id
WHERE c.country = 'USA'
AND p.category = 'Electronics'
Table Statistics:
orders: 1,000,000 rows, index on customer_id, product_id
customers: 100,000 rows, index on id, country
products: 10,000 rows, index on id, category
Generated Plans:
Plan 1: customers → orders → products
├─ Scan customers WHERE country='USA' (est. 10,000 rows)
├─ Index lookup orders.customer_id (est. 100,000 rows)
└─ Index lookup products.id WHERE category='Electronics' (est. 1,000 rows)
Cost: 15,230 (disk I/O: 1,200, CPU: 14,030)
Plan 2: products → orders → customers ★ CHOSEN
├─ Scan products WHERE category='Electronics' (est. 500 rows)
├─ Index lookup orders.product_id (est. 50,000 rows)
└─ Index lookup customers.id WHERE country='USA' (est. 5,000 rows)
Cost: 8,750 (disk I/O: 850, CPU: 7,900)
Plan 3: orders → customers → products
├─ Full scan orders (1,000,000 rows) ← EXPENSIVE!
...
Cost: 1,250,000
Optimizer chose Plan 2 (2x faster than Plan 1)
Implementation Hints:
Table statistics:
class TableStats:
def __init__(self, name, row_count, columns, indexes):
self.name = name
self.row_count = row_count
self.columns = columns # {col_name: {'distinct': N, 'min': X, 'max': Y}}
self.indexes = indexes # {index_name: {'columns': [...], 'unique': bool}}
def estimate_selectivity(self, column, operator, value):
"""Estimate what fraction of rows match a predicate"""
col_stats = self.columns[column]
if operator == '=':
# Uniform distribution assumption
return 1.0 / col_stats['distinct']
elif operator == '<':
# Range selectivity
range_size = col_stats['max'] - col_stats['min']
return (value - col_stats['min']) / range_size
# ... more operators
def estimate_rows(self, predicates):
"""Estimate rows after applying predicates"""
selectivity = 1.0
for pred in predicates:
selectivity *= self.estimate_selectivity(
pred.column, pred.operator, pred.value
)
return int(self.row_count * selectivity)
Plan generation and costing:
class QueryOptimizer:
def __init__(self, stats):
self.stats = stats # {table_name: TableStats}
self.io_cost_per_page = 1.0
self.cpu_cost_per_row = 0.01
def generate_join_orders(self, tables):
"""Generate all possible join orders"""
from itertools import permutations
return list(permutations(tables))
def estimate_join_cost(self, left_rows, right_rows, join_type):
"""Estimate cost of a join operation"""
if join_type == 'nested_loop':
return left_rows * right_rows * self.cpu_cost_per_row
elif join_type == 'hash_join':
return (left_rows + right_rows) * self.cpu_cost_per_row * 2
elif join_type == 'index_lookup':
# Much cheaper if we have an index
return left_rows * 3 * self.io_cost_per_page # B+Tree depth
def cost_plan(self, plan):
"""Calculate total cost of an execution plan"""
total_cost = 0
current_rows = 0
for step in plan.steps:
if step.type == 'table_scan':
pages = step.table.row_count / 100 # ~100 rows per page
total_cost += pages * self.io_cost_per_page
current_rows = step.estimated_rows
elif step.type == 'index_lookup':
total_cost += 3 * self.io_cost_per_page # B+Tree traversal
total_cost += step.estimated_rows * self.cpu_cost_per_row
current_rows = step.estimated_rows
elif step.type == 'join':
total_cost += self.estimate_join_cost(
current_rows, step.right_rows, step.join_type
)
current_rows = step.estimated_output_rows
return total_cost
def optimize(self, query):
"""Find the cheapest execution plan"""
plans = self.generate_all_plans(query)
return min(plans, key=self.cost_plan)
Learning milestones:
- You generate alternative plans → You understand plan space
- You estimate cardinality → You understand statistics
- You calculate costs → You understand I/O vs CPU tradeoffs
- You read EXPLAIN output → You can optimize real MySQL queries
Project 4: MVCC Transaction Manager
- File: LEARN_MYSQL_DEEP_DIVE.md
- Main Programming Language: Python
- Alternative Programming Languages: Go, Rust, C++
- Coolness Level: Level 4: Hardcore Tech Flex
- Business Potential: 1. The “Resume Gold”
- Difficulty: Level 3: Advanced
- Knowledge Area: Transactions / Concurrency / Isolation
- Software or Tool: Custom implementation
- Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann
What you’ll build: An MVCC (Multi-Version Concurrency Control) system that maintains multiple versions of data, implements snapshot isolation, and demonstrates how InnoDB allows readers and writers to not block each other.
Why it teaches MySQL: InnoDB’s MVCC is central to its concurrency model. Understanding how undo logs create “snapshots” and how visibility rules work explains isolation levels, long transaction problems, and lock behavior.
Core challenges you’ll face:
- Versioning data → maps to undo logs storing old versions
- Visibility rules → maps to which version can a transaction see?
- Garbage collection → maps to purging old versions
- Isolation levels → maps to READ COMMITTED vs REPEATABLE READ
Key Concepts:
- MVCC: Understanding MVCC
- InnoDB Locking: MySQL Transaction Isolation
- Snapshot Isolation: “Designing Data-Intensive Applications” Chapter 7
Difficulty: Advanced Time estimate: 2-3 weeks Prerequisites: Projects 1-3 completed. Understanding of transaction concepts.
Real world outcome:
MVCC Transaction Simulator
txn1> BEGIN; -- Transaction ID: 100
txn2> BEGIN; -- Transaction ID: 101
txn1> UPDATE accounts SET balance = 500 WHERE id = 1;
[Created version: {txn_id: 100, value: 500}]
[Old version: {txn_id: 50, value: 1000} → undo log]
txn2> SELECT balance FROM accounts WHERE id = 1;
→ 1000 (sees old version - txn 100 not committed)
txn1> COMMIT;
[Version {txn_id: 100} now visible to new transactions]
txn2> SELECT balance FROM accounts WHERE id = 1;
→ 1000 (REPEATABLE READ: still sees snapshot from txn start)
txn3> BEGIN; -- Transaction ID: 102
txn3> SELECT balance FROM accounts WHERE id = 1;
→ 500 (sees committed version from txn 100)
-- Demonstration of isolation levels:
SET ISOLATION LEVEL READ COMMITTED;
txn2> SELECT balance FROM accounts WHERE id = 1;
→ 500 (READ COMMITTED: sees latest committed value)
Implementation Hints:
Version chain and visibility:
from dataclasses import dataclass
from typing import Optional, List
import threading
@dataclass
class Version:
txn_id: int
value: any
created_at: int # Transaction ID that created this
deleted_at: Optional[int] = None # Transaction ID that deleted this
prev_version: Optional['Version'] = None # Link to older version
class MVCCRow:
def __init__(self, key):
self.key = key
self.version_chain: Optional[Version] = None
self.lock = threading.Lock()
def write(self, txn_id: int, value: any):
"""Create new version, link to previous"""
with self.lock:
new_version = Version(
txn_id=txn_id,
value=value,
created_at=txn_id,
prev_version=self.version_chain
)
self.version_chain = new_version
def read(self, txn: 'Transaction') -> any:
"""Find visible version for this transaction"""
version = self.version_chain
while version:
if txn.is_visible(version):
return version.value
version = version.prev_version
return None # No visible version
class Transaction:
def __init__(self, txn_id: int, isolation_level: str, active_txns: set):
self.txn_id = txn_id
self.isolation_level = isolation_level
self.snapshot = active_txns.copy() # Active txns at start
self.start_time = txn_id
def is_visible(self, version: Version) -> bool:
"""Can this transaction see this version?"""
# Created by a committed transaction before our snapshot
if version.created_at < self.start_time:
if version.created_at not in self.snapshot: # Was committed
if version.deleted_at is None:
return True
if version.deleted_at in self.snapshot: # Deleted but not committed
return True
if version.deleted_at > self.start_time:
return True
return False
class MVCCDatabase:
def __init__(self):
self.rows = {}
self.next_txn_id = 1
self.active_transactions = set()
self.committed_transactions = set()
def begin(self, isolation_level='REPEATABLE READ'):
txn_id = self.next_txn_id
self.next_txn_id += 1
self.active_transactions.add(txn_id)
return Transaction(txn_id, isolation_level, self.active_transactions)
def commit(self, txn: Transaction):
self.active_transactions.remove(txn.txn_id)
self.committed_transactions.add(txn.txn_id)
Learning milestones:
- You implement version chains → You understand undo logs
- You implement visibility rules → You understand snapshot isolation
- You compare isolation levels → You understand READ COMMITTED vs REPEATABLE READ
- You see long transaction problems → You understand why long txns are bad
Project 5: Buffer Pool Manager
- File: LEARN_MYSQL_DEEP_DIVE.md
- Main Programming Language: C
- Alternative Programming Languages: Rust, C++
- Coolness Level: Level 4: Hardcore Tech Flex
- Business Potential: 1. The “Resume Gold”
- Difficulty: Level 3: Advanced
- Knowledge Area: Memory Management / Caching / I/O
- Software or Tool: Custom implementation
- Main Book: “Database Internals” by Alex Petrov
What you’ll build: A buffer pool manager that caches disk pages in memory, implements LRU eviction (with young/old list like InnoDB), handles dirty page flushing, and manages concurrent access.
Why it teaches MySQL: The buffer pool is InnoDB’s most important performance feature. Understanding it explains why “more RAM = faster database”, how to tune innodb_buffer_pool_size, and why certain access patterns perform poorly.
Core challenges you’ll face:
- Page caching → maps to InnoDB’s buffer pool
- LRU with midpoint insertion → maps to young/old list separation
- Dirty page management → maps to checkpoint and flush
- Concurrent access → maps to latching pages
Key Concepts:
- Buffer Pool: InnoDB Buffer Pool
- LRU Algorithm: InnoDB Buffer Pool LRU
- Page Replacement: “Database Internals” Chapter 5
Difficulty: Advanced Time estimate: 2-3 weeks Prerequisites: Projects 1-4 completed. Systems programming experience.
Real world outcome:
Buffer Pool Simulator
Configuration:
Buffer pool size: 128 MB (8,192 pages × 16 KB)
Young list: 5/8 of pool (5,120 pages)
Old list: 3/8 of pool (3,072 pages)
Running OLTP workload simulation...
Statistics after 1 million operations:
Hit ratio: 98.7%
Pages read from disk: 13,000
Pages written to disk: 4,500
Evictions: 45,000
Page distribution:
Young list: 5,120 pages (frequently accessed)
Old list: 3,072 pages (recently loaded)
Dirty pages: 1,234 (pending flush)
LRU behavior demonstration:
- Full table scan of 100K rows...
- Old pages: 3,072 → scan loaded here
- Young pages: 5,120 → protected from scan eviction!
- Hit ratio after scan: 97.2% (only 1.5% drop)
Without midpoint insertion (traditional LRU):
- Hit ratio after scan: 45% (catastrophic drop!)
Implementation Hints:
Buffer pool structure:
#define PAGE_SIZE 16384
#define POOL_SIZE 8192 // Number of pages
typedef struct BufferPage {
uint32_t page_id;
bool is_dirty;
int pin_count; // How many threads are using this page
pthread_rwlock_t latch;
uint8_t data[PAGE_SIZE];
// LRU list pointers
struct BufferPage* lru_prev;
struct BufferPage* lru_next;
bool in_young_list; // vs old list
} BufferPage;
typedef struct BufferPool {
BufferPage pages[POOL_SIZE];
// Hash table: page_id → BufferPage*
HashTable* page_table;
// LRU lists (InnoDB uses midpoint insertion)
BufferPage* young_head; // Most recently used
BufferPage* young_tail;
BufferPage* old_head; // Midpoint
BufferPage* old_tail; // Least recently used
// Free list
BufferPage* free_list;
pthread_mutex_t pool_latch;
// Stats
uint64_t hits;
uint64_t misses;
} BufferPool;
LRU with midpoint insertion:
BufferPage* buffer_pool_get(BufferPool* pool, uint32_t page_id) {
pthread_mutex_lock(&pool->pool_latch);
// Check if page is in pool
BufferPage* page = hash_table_get(pool->page_table, page_id);
if (page) {
pool->hits++;
// Move to young list if accessed again from old list
if (!page->in_young_list) {
// Only move if it's been in old list for a while
// (prevents one-time access from polluting young list)
if (should_promote_to_young(page)) {
move_to_young_head(pool, page);
}
} else {
// Already in young, move to head
move_to_young_head(pool, page);
}
page->pin_count++;
pthread_mutex_unlock(&pool->pool_latch);
return page;
}
pool->misses++;
// Need to load from disk - find victim page
BufferPage* victim = find_victim(pool);
if (victim->is_dirty) {
flush_page(victim); // Write to disk first
}
// Load new page
read_page_from_disk(victim, page_id);
victim->page_id = page_id;
victim->is_dirty = false;
victim->pin_count = 1;
// Insert at OLD list head (not young!)
// This is the midpoint insertion strategy
insert_at_old_head(pool, victim);
hash_table_put(pool->page_table, page_id, victim);
pthread_mutex_unlock(&pool->pool_latch);
return victim;
}
Why midpoint insertion matters:
- A full table scan would evict all hot pages in naive LRU
- With midpoint insertion, scan pages enter the OLD list
- They get evicted before hot pages in the YOUNG list
- Hot working set is protected from one-time scans
Learning milestones:
- You implement basic LRU → You understand page replacement
- You add midpoint insertion → You understand InnoDB’s optimization
- You handle dirty pages → You understand write-back caching
- You measure hit ratios → You can tune buffer pool size
Project 6: WAL and Crash Recovery
- File: LEARN_MYSQL_DEEP_DIVE.md
- Main Programming Language: C
- Alternative Programming Languages: Rust, Go
- Coolness Level: Level 5: Pure Magic (Super Cool)
- Business Potential: 1. The “Resume Gold”
- Difficulty: Level 4: Expert
- Knowledge Area: Durability / Crash Recovery / Logging
- Software or Tool: Custom implementation
- Main Book: “Database Internals” by Alex Petrov
What you’ll build: A Write-Ahead Log (WAL) system with redo logging that ensures durability, handles crash recovery using ARIES-style recovery, and demonstrates how InnoDB survives crashes.
Why it teaches MySQL: The redo log is how InnoDB achieves durability without flushing every page to disk on commit. Understanding WAL explains innodb_flush_log_at_trx_commit, crash recovery time, and why databases survive power failures.
Core challenges you’ll face:
- Write-ahead logging → maps to redo log before data
- Log sequence numbers (LSN) → maps to ordering operations
- Checkpointing → maps to limiting recovery time
- ARIES recovery → maps to analysis, redo, undo phases
Key Concepts:
- InnoDB Redo Log: Redo Log
- ARIES: “ARIES: A Transaction Recovery Method” (classic paper)
- WAL: “Database Internals” by Alex Petrov, Chapter 7
Difficulty: Expert Time estimate: 3-4 weeks Prerequisites: Projects 1-5 completed. Strong systems programming.
Real world outcome:
WAL and Recovery Simulator
Starting database...
Redo log size: 48 MB (2 files × 24 MB)
Checkpoint LSN: 1000000
Running transactions...
T1: UPDATE accounts SET balance = 500 WHERE id = 1
[LSN 1000100: REDO record written to log]
[Page modified in buffer pool, not yet on disk]
T1: COMMIT
[LSN 1000150: COMMIT record written to log]
[Log flushed to disk - durability guaranteed!]
[Data page still only in memory]
Simulating crash...
[Power failure! Data pages lost from memory]
[Redo log persisted on disk]
Recovery starting...
Phase 1: Analysis
Scanning log from checkpoint LSN 1000000...
Found 5 transactions:
T1: COMMITTED at LSN 1000150
T2: COMMITTED at LSN 1000300
T3: ACTIVE (no commit record) - will undo
Dirty pages to recover: 12
Phase 2: Redo
Replaying committed changes...
[LSN 1000100: Redo UPDATE on page 42]
[LSN 1000200: Redo INSERT on page 87]
... 50 operations replayed
Phase 3: Undo
Rolling back uncommitted transactions...
[T3: Undo INSERT on page 123]
[T3: Undo UPDATE on page 45]
Recovery complete in 0.3 seconds!
All committed transactions preserved ✓
All uncommitted transactions rolled back ✓
Implementation Hints:
Log record structure:
typedef enum {
LOG_UPDATE,
LOG_INSERT,
LOG_DELETE,
LOG_COMMIT,
LOG_ABORT,
LOG_CHECKPOINT
} LogRecordType;
typedef struct {
uint64_t lsn; // Log Sequence Number
uint32_t txn_id;
LogRecordType type;
uint32_t page_id;
uint16_t offset; // Offset within page
uint16_t length;
uint8_t before_image[MAX_DATA]; // For undo
uint8_t after_image[MAX_DATA]; // For redo
} LogRecord;
typedef struct {
int fd; // Log file descriptor
uint64_t current_lsn;
uint64_t flushed_lsn;
uint8_t buffer[LOG_BUFFER_SIZE];
size_t buffer_offset;
pthread_mutex_t latch;
} WAL;
Write-ahead logging protocol:
void wal_log_update(WAL* wal, Transaction* txn,
Page* page, int offset,
void* old_data, void* new_data, int len) {
LogRecord record = {
.lsn = ++wal->current_lsn,
.txn_id = txn->id,
.type = LOG_UPDATE,
.page_id = page->id,
.offset = offset,
.length = len
};
memcpy(record.before_image, old_data, len);
memcpy(record.after_image, new_data, len);
// Write to log buffer
append_to_buffer(wal, &record);
// Update page's LSN
page->page_lsn = record.lsn;
}
void wal_commit(WAL* wal, Transaction* txn) {
LogRecord record = {
.lsn = ++wal->current_lsn,
.txn_id = txn->id,
.type = LOG_COMMIT
};
append_to_buffer(wal, &record);
// CRITICAL: Flush log before returning to client
// This is the WAL guarantee!
wal_flush(wal);
}
ARIES-style recovery:
void recover(WAL* wal, BufferPool* pool) {
// Phase 1: Analysis
uint64_t checkpoint_lsn = read_checkpoint(wal);
HashSet* dirty_pages = new_hashset();
HashSet* active_txns = new_hashset();
LogRecord* record;
wal_seek(wal, checkpoint_lsn);
while ((record = wal_read_next(wal))) {
if (record->type == LOG_UPDATE || record->type == LOG_INSERT) {
hashset_add(dirty_pages, record->page_id);
hashset_add(active_txns, record->txn_id);
} else if (record->type == LOG_COMMIT) {
hashset_remove(active_txns, record->txn_id);
}
}
// Phase 2: Redo - bring all pages to crash state
wal_seek(wal, checkpoint_lsn);
while ((record = wal_read_next(wal))) {
if (record->type == LOG_UPDATE) {
Page* page = buffer_pool_get(pool, record->page_id);
// Only redo if page is older than log record
if (page->page_lsn < record->lsn) {
apply_redo(page, record);
page->page_lsn = record->lsn;
}
}
}
// Phase 3: Undo - rollback uncommitted transactions
// Read log backwards
wal_seek_end(wal);
while ((record = wal_read_prev(wal))) {
if (hashset_contains(active_txns, record->txn_id)) {
if (record->type == LOG_UPDATE) {
Page* page = buffer_pool_get(pool, record->page_id);
apply_undo(page, record); // Use before_image
}
}
}
}
Learning milestones:
- You implement WAL protocol → You understand durability
- You survive simulated crashes → You understand recovery
- You implement checkpointing → You understand recovery time limits
- You compare flush strategies → You understand
innodb_flush_log_at_trx_commit
Project 7: MySQL Protocol Implementation
- File: LEARN_MYSQL_DEEP_DIVE.md
- Main Programming Language: Python
- Alternative Programming Languages: Go, Rust, C
- Coolness Level: Level 3: Genuinely Clever
- Business Potential: 3. The “Service & Support” Model
- Difficulty: Level 2: Intermediate
- Knowledge Area: Network Protocols / Binary Parsing
- Software or Tool: Custom implementation + Wireshark
- Main Book: MySQL Internals Documentation
What you’ll build: A MySQL protocol client that connects to MySQL servers using the wire protocol, performs the handshake, authenticates, sends queries, and parses results—all without using libmysqlclient.
Why it teaches MySQL: The protocol is how applications talk to MySQL. Understanding it explains connection pooling, prepared statements, SSL/TLS negotiation, and why some drivers are faster than others.
Core challenges you’ll face:
- Packet framing → maps to length-prefixed packets
- Handshake and auth → maps to capability negotiation
- Query protocol → maps to COM_QUERY and result sets
- Prepared statements → maps to COM_STMT_PREPARE/EXECUTE
Key Concepts:
- MySQL Protocol: MySQL Protocol Documentation
- Wire Format: Understanding MySQL Protocol
- Packet Structure: Capture with Wireshark for analysis
Difficulty: Intermediate Time estimate: 1-2 weeks Prerequisites: Projects 1-3 completed. Network programming basics.
Real world outcome:
# Your MySQL protocol client:
from mysql_protocol import MySQLClient
client = MySQLClient()
client.connect('localhost', 3306, 'root', 'password', 'testdb')
# Behind the scenes:
# → Server: Initial Handshake (protocol version, server version, salt)
# ← Client: Handshake Response (username, auth, database, capabilities)
# → Server: OK or ERR packet
result = client.query("SELECT id, name FROM users WHERE active = 1")
# Behind the scenes:
# ← Client: COM_QUERY packet with SQL string
# → Server: Column count, column definitions, rows, EOF
for row in result:
print(f"User {row['id']}: {row['name']}")
# Prepared statement:
stmt = client.prepare("SELECT * FROM orders WHERE user_id = ?")
result = client.execute(stmt, [42])
# Behind the scenes:
# ← Client: COM_STMT_PREPARE
# → Server: Statement ID, parameter count, column definitions
# ← Client: COM_STMT_EXECUTE with binary parameters
# → Server: Binary result set
Implementation Hints:
Packet structure:
import struct
import socket
class MySQLPacket:
def __init__(self):
self.sequence_id = 0
def read_packet(self, sock):
"""Read one MySQL packet"""
# 4-byte header: 3 bytes length + 1 byte sequence ID
header = sock.recv(4)
length = struct.unpack('<I', header[:3] + b'\x00')[0]
seq_id = header[3]
payload = sock.recv(length)
return seq_id, payload
def write_packet(self, sock, payload):
"""Write one MySQL packet"""
length = len(payload)
header = struct.pack('<I', length)[:3] + bytes([self.sequence_id])
sock.sendall(header + payload)
self.sequence_id += 1
class MySQLClient:
def __init__(self):
self.sock = None
self.packet = MySQLPacket()
def connect(self, host, port, user, password, database):
self.sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
self.sock.connect((host, port))
# Read initial handshake
seq_id, payload = self.packet.read_packet(self.sock)
handshake = self.parse_handshake(payload)
# Send handshake response
response = self.build_handshake_response(
handshake, user, password, database
)
self.packet.write_packet(self.sock, response)
# Read OK/ERR
seq_id, payload = self.packet.read_packet(self.sock)
if payload[0] == 0x00: # OK packet
print("Connected!")
elif payload[0] == 0xff: # ERR packet
raise Exception(self.parse_error(payload))
def parse_handshake(self, payload):
"""Parse initial handshake packet"""
protocol_version = payload[0]
# Null-terminated server version string
nul_pos = payload.index(0, 1)
server_version = payload[1:nul_pos].decode()
# Connection ID (4 bytes)
conn_id = struct.unpack('<I', payload[nul_pos+1:nul_pos+5])[0]
# Auth plugin data (salt) - first 8 bytes
salt_part1 = payload[nul_pos+5:nul_pos+13]
# ... more fields
return {
'protocol_version': protocol_version,
'server_version': server_version,
'connection_id': conn_id,
'salt': salt_part1,
}
Query execution:
def query(self, sql):
"""Execute a query and return results"""
# Send COM_QUERY
payload = bytes([0x03]) + sql.encode('utf-8') # 0x03 = COM_QUERY
self.packet.sequence_id = 0
self.packet.write_packet(self.sock, payload)
# Read column count
seq_id, payload = self.packet.read_packet(self.sock)
if payload[0] == 0xff: # Error
raise Exception(self.parse_error(payload))
column_count = self.read_length_encoded_int(payload)
# Read column definitions
columns = []
for _ in range(column_count):
seq_id, payload = self.packet.read_packet(self.sock)
columns.append(self.parse_column_definition(payload))
# Read EOF packet (deprecated in newer protocol)
seq_id, payload = self.packet.read_packet(self.sock)
# Read rows until EOF
rows = []
while True:
seq_id, payload = self.packet.read_packet(self.sock)
if payload[0] == 0xfe: # EOF
break
rows.append(self.parse_row(payload, columns))
return rows
Learning milestones:
- You complete handshake → You understand connection establishment
- You execute queries → You understand the text protocol
- You use prepared statements → You understand the binary protocol
- You handle errors → You understand error packets
Project 8: Binlog Replication Subscriber
- File: LEARN_MYSQL_DEEP_DIVE.md
- Main Programming Language: Go
- Alternative Programming Languages: Python, Java
- Coolness Level: Level 4: Hardcore Tech Flex
- Business Potential: 4. The “Open Core” Infrastructure
- Difficulty: Level 3: Advanced
- Knowledge Area: Replication / CDC / Event Streaming
- Software or Tool: MySQL Binlog
- Main Book: “High Performance MySQL” by Silvia Botros
What you’ll build: A binlog replication subscriber that connects to MySQL as a replica, receives binlog events, parses them, and streams database changes to external systems (like Elasticsearch, Kafka, or a data warehouse).
Why it teaches MySQL: Replication is how MySQL achieves high availability and read scaling. Understanding binlog format and replication protocol explains CDC (Change Data Capture), read replicas, and database migration tools.
Core challenges you’ll face:
- Connecting as replica → maps to replication protocol handshake
- Parsing binlog events → maps to ROW vs STATEMENT format
- Handling DDL → maps to schema changes
- Position tracking → maps to GTID vs file/position
Key Concepts:
- Binlog Replication: MySQL Replication
- Binlog Formats: Row vs Statement
- GTID: Global Transaction Identifiers
Difficulty: Advanced Time estimate: 2-3 weeks Prerequisites: Projects 1-7 completed. Understanding of replication concepts.
Real world outcome:
MySQL Binlog Subscriber
Connected to MySQL 8.0.35 at localhost:3306
Binlog format: ROW
Starting from: mysql-bin.000003:154
Streaming events...
[2024-01-15 10:00:01] GTID: 3E11FA47-71CA-11E1-9E33-C80AA9429562:123
[2024-01-15 10:00:01] TABLE_MAP: test.users (table_id: 42)
[2024-01-15 10:00:01] WRITE_ROWS:
INSERT INTO users: {id: 1001, name: 'Alice', email: 'alice@example.com'}
→ Forwarded to Elasticsearch index 'users'
[2024-01-15 10:00:02] GTID: 3E11FA47-71CA-11E1-9E33-C80AA9429562:124
[2024-01-15 10:00:02] TABLE_MAP: test.orders (table_id: 43)
[2024-01-15 10:00:02] UPDATE_ROWS:
UPDATE orders SET status='shipped' WHERE id=5000
Before: {id: 5000, status: 'pending', amount: 99.99}
After: {id: 5000, status: 'shipped', amount: 99.99}
→ Published to Kafka topic 'order-events'
Events processed: 1,234
Current position: mysql-bin.000003:98765
Lag: 0.5 seconds
Implementation Hints:
Connecting as a replica:
package main
import (
"github.com/go-mysql-org/go-mysql/mysql"
"github.com/go-mysql-org/go-mysql/replication"
)
func main() {
cfg := replication.BinlogSyncerConfig{
ServerID: 100, // Unique replica ID
Flavor: "mysql",
Host: "localhost",
Port: 3306,
User: "replication_user",
Password: "password",
}
syncer := replication.NewBinlogSyncer(cfg)
// Start from a specific position
streamer, _ := syncer.StartSync(mysql.Position{
Name: "mysql-bin.000003",
Pos: 154,
})
// Or use GTID
// gtidSet, _ := mysql.ParseGTIDSet("mysql",
// "3E11FA47-71CA-11E1-9E33-C80AA9429562:1-123")
// streamer, _ := syncer.StartSyncGTID(gtidSet)
for {
event, _ := streamer.GetEvent(context.Background())
handleEvent(event)
}
}
func handleEvent(event *replication.BinlogEvent) {
switch e := event.Event.(type) {
case *replication.RotateEvent:
fmt.Printf("Rotate to: %s\n", string(e.NextLogName))
case *replication.TableMapEvent:
fmt.Printf("Table: %s.%s (id: %d)\n",
e.Schema, e.Table, e.TableID)
case *replication.RowsEvent:
for _, row := range e.Rows {
switch event.Header.EventType {
case replication.WRITE_ROWS_EVENTv2:
handleInsert(row)
case replication.UPDATE_ROWS_EVENTv2:
// row contains [before, after] pairs
handleUpdate(row)
case replication.DELETE_ROWS_EVENTv2:
handleDelete(row)
}
}
case *replication.QueryEvent:
// DDL statements
fmt.Printf("DDL: %s\n", string(e.Query))
}
}
Implementing from scratch (understanding the protocol):
# Binlog event header (common to all events)
class BinlogEventHeader:
def parse(self, data):
self.timestamp = struct.unpack('<I', data[0:4])[0]
self.type_code = data[4]
self.server_id = struct.unpack('<I', data[5:9])[0]
self.event_length = struct.unpack('<I', data[9:13])[0]
self.next_position = struct.unpack('<I', data[13:17])[0]
self.flags = struct.unpack('<H', data[17:19])[0]
# Row event types
WRITE_ROWS_EVENT = 30
UPDATE_ROWS_EVENT = 31
DELETE_ROWS_EVENT = 32
class RowsEvent:
def parse(self, data, table_map):
self.table_id = struct.unpack('<Q', data[0:6] + b'\x00\x00')[0]
# ... parse column bitmap and row data
# Row format depends on table schema from TABLE_MAP event
Learning milestones:
- You connect as replica → You understand replication protocol
- You parse row events → You understand binlog format
- You handle schema changes → You understand DDL replication
- You build a CDC pipeline → You can replicate to external systems
Project 9: Lock and Deadlock Visualizer
- File: LEARN_MYSQL_DEEP_DIVE.md
- Main Programming Language: Python
- Alternative Programming Languages: Go, JavaScript
- Coolness Level: Level 3: Genuinely Clever
- Business Potential: 2. The “Micro-SaaS / Pro Tool”
- Difficulty: Level 2: Intermediate
- Knowledge Area: Locking / Deadlocks / Concurrency
- Software or Tool: MySQL + Custom visualization
- Main Book: “High Performance MySQL” by Silvia Botros
What you’ll build: A tool that monitors InnoDB locks in real-time, visualizes lock wait graphs, detects deadlock cycles, and explains why specific queries are blocking.
Why it teaches MySQL: Understanding InnoDB’s locking behavior is essential for debugging production issues. This project teaches you about row locks, gap locks, next-key locks, and why deadlocks happen.
Core challenges you’ll face:
- Querying lock information → maps to
performance_schemaandinnodb_lock_waits - Building wait-for graphs → maps to detecting cycles
- Understanding lock types → maps to shared, exclusive, gap, next-key
- Explaining lock conflicts → maps to which indexes cause which locks
Key Concepts:
- InnoDB Locking: InnoDB Locking
- Deadlock Detection: Deadlocks
- Gap Locks: Gap Locking
Difficulty: Intermediate Time estimate: 1-2 weeks Prerequisites: Projects 1-4 completed. Understanding of transaction isolation.
Real world outcome:
InnoDB Lock Visualizer
Current Locks:
┌─────────────────────────────────────────────────────────────────┐
│ Transaction │ Lock Type │ Table │ Index │ Lock Data │
├─────────────┼────────────┼──────────┼─────────┼────────────────┤
│ 12345 │ X (excl.) │ orders │ PRIMARY │ id=100 │
│ 12346 │ S (share) │ orders │ PRIMARY │ id=100 WAITING │
│ 12347 │ X,GAP │ orders │ idx_user│ user_id=42,GAP │
└─────────────────────────────────────────────────────────────────┘
Wait-For Graph:
┌──────────────┐
│ TXN 12346 │──waits for──→┌──────────────┐
│ SELECT...FOR │ │ TXN 12345 │
│ UPDATE │ │ UPDATE orders│
└──────────────┘ └──────────────┘
DEADLOCK DETECTED:
┌──────────────┐ ┌──────────────┐
│ TXN 12350 │──waits for──→│ TXN 12351 │
│ │ │ │
│ Has lock on: │ │ Has lock on: │
│ orders id=1 │ │ users id=5 │
│ │ │ │
│ Wants lock: │ │ Wants lock: │
│ users id=5 │◄──waits for──│ orders id=1 │
└──────────────┘ └──────────────┘
Victim: TXN 12351 (rolled back)
Cause: Classic cycle - each transaction holds what the other wants
Implementation Hints:
Querying lock information:
import mysql.connector
from collections import defaultdict
class LockMonitor:
def __init__(self, connection):
self.conn = connection
def get_current_locks(self):
"""Get all current InnoDB locks"""
cursor = self.conn.cursor(dictionary=True)
cursor.execute("""
SELECT
r.trx_id AS waiting_trx_id,
r.trx_query AS waiting_query,
b.trx_id AS blocking_trx_id,
b.trx_query AS blocking_query,
l.lock_table AS locked_table,
l.lock_index AS locked_index,
l.lock_type AS lock_type,
l.lock_mode AS lock_mode,
l.lock_data AS lock_data
FROM performance_schema.data_lock_waits w
JOIN performance_schema.data_locks l
ON w.BLOCKING_ENGINE_LOCK_ID = l.ENGINE_LOCK_ID
JOIN information_schema.innodb_trx r
ON w.REQUESTING_ENGINE_TRANSACTION_ID = r.trx_id
JOIN information_schema.innodb_trx b
ON w.BLOCKING_ENGINE_TRANSACTION_ID = b.trx_id
""")
return cursor.fetchall()
def build_wait_graph(self, locks):
"""Build a wait-for graph from lock data"""
graph = defaultdict(set)
for lock in locks:
graph[lock['waiting_trx_id']].add(lock['blocking_trx_id'])
return graph
def detect_deadlock(self, graph):
"""Find cycles in wait-for graph"""
visited = set()
path = []
def dfs(node):
if node in path:
# Found cycle!
cycle_start = path.index(node)
return path[cycle_start:]
if node in visited:
return None
visited.add(node)
path.append(node)
for neighbor in graph[node]:
result = dfs(neighbor)
if result:
return result
path.pop()
return None
for node in graph:
result = dfs(node)
if result:
return result
return None
Understanding gap locks:
def explain_gap_lock(self, lock_info):
"""Explain why a gap lock was taken"""
if 'GAP' in lock_info['lock_mode']:
return f"""
Gap Lock Explanation:
- A gap lock prevents insertions in a range
- This lock is on index: {lock_info['locked_index']}
- Gap is before: {lock_info['lock_data']}
Why it happens:
- REPEATABLE READ isolation level needs gap locks to prevent phantom reads
- A query like: SELECT * FROM t WHERE x BETWEEN 10 AND 20 FOR UPDATE
locks all rows AND the gaps between them
How to avoid:
- Use READ COMMITTED isolation (no gap locks)
- Use unique lookups instead of range scans
- Add appropriate indexes to reduce lock scope
"""
Learning milestones:
- You query lock tables → You understand where to find lock info
- You build wait-for graphs → You understand lock dependencies
- You detect deadlocks → You understand the deadlock algorithm
- You explain gap locks → You understand REPEATABLE READ locking
Project 10: MySQL vs PostgreSQL Comparison Lab
- File: LEARN_MYSQL_DEEP_DIVE.md
- Main Programming Language: SQL + Python
- Alternative Programming Languages: Go, Bash
- Coolness Level: Level 3: Genuinely Clever
- Business Potential: 2. The “Micro-SaaS / Pro Tool”
- Difficulty: Level 2: Intermediate
- Knowledge Area: Database Comparison / Benchmarking
- Software or Tool: MySQL, PostgreSQL, SQLite
- Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann
What you’ll build: A comprehensive comparison lab that runs the same workloads on MySQL, PostgreSQL, and SQLite, measuring performance, consistency behavior, and feature differences.
Why it teaches MySQL: Understanding MySQL’s tradeoffs requires comparing it to alternatives. This project reveals why MySQL makes certain design choices and when to choose which database.
Core challenges you’ll face:
- Setting up equivalent schemas → maps to data type differences
- Measuring performance fairly → maps to benchmark design
- Testing isolation levels → maps to consistency behavior differences
- Comparing features → maps to JSON, arrays, full-text, etc.
Key Concepts:
- Comparison: Postgres vs MySQL
- Benchmarking: TPC-C and TPC-H benchmarks
- Feature Matrix: db-engines comparison
Difficulty: Intermediate Time estimate: 2 weeks Prerequisites: Projects 1-9 completed. Basic PostgreSQL knowledge.
Real world outcome:
Database Comparison Report
=== Performance Benchmarks ===
OLTP Workload (TPC-C like):
┌──────────────────┬─────────────┬─────────────┬─────────────┐
│ Metric │ MySQL │ PostgreSQL │ SQLite │
├──────────────────┼─────────────┼─────────────┼─────────────┤
│ Transactions/sec │ 12,500 │ 10,800 │ 45,000* │
│ Read latency p99 │ 2.1 ms │ 2.8 ms │ 0.1 ms* │
│ Write latency p99│ 5.2 ms │ 4.8 ms │ 0.5 ms* │
└──────────────────┴─────────────┴─────────────┴─────────────┘
* SQLite is embedded, so no network latency
Complex Query Workload (TPC-H like):
┌──────────────────┬─────────────┬─────────────┐
│ Query │ MySQL │ PostgreSQL │
├──────────────────┼─────────────┼─────────────┤
│ Aggregation │ 1.2 sec │ 0.8 sec │
│ Join (5 tables) │ 3.5 sec │ 2.1 sec │
│ Window functions │ 2.8 sec │ 1.5 sec │
│ CTE recursive │ 4.2 sec │ 2.8 sec │
└──────────────────┴─────────────┴─────────────┘
→ PostgreSQL wins on complex analytical queries
=== Isolation Level Behavior ===
Phantom Read Test (REPEATABLE READ):
┌──────────────────┬─────────────┬─────────────┐
│ │ MySQL │ PostgreSQL │
├──────────────────┼─────────────┼─────────────┤
│ Phantoms blocked │ Yes (gap │ Yes (SSI │
│ │ locking) │ serializable│
│ Method │ Pessimistic │ Optimistic │
│ Blocking? │ Yes │ No* │
└──────────────────┴─────────────┴─────────────┘
* PostgreSQL uses serialization failure instead
=== Feature Comparison ===
JSON Operations:
MySQL: JSON_EXTRACT(col, '$.field') - functional indexes available
PostgreSQL: col->>'field' (JSONB) - GIN indexes, more operators
Arrays:
MySQL: Not supported (use JSON arrays)
PostgreSQL: Native ARRAY type with operators
Full-Text Search:
MySQL: FULLTEXT indexes (basic)
PostgreSQL: ts_vector, ts_query, GIN indexes (advanced)
Implementation Hints:
Benchmark framework:
import time
import mysql.connector
import psycopg2
import sqlite3
class DatabaseBenchmark:
def __init__(self, db_type, config):
self.db_type = db_type
if db_type == 'mysql':
self.conn = mysql.connector.connect(**config)
elif db_type == 'postgres':
self.conn = psycopg2.connect(**config)
elif db_type == 'sqlite':
self.conn = sqlite3.connect(config['database'])
def run_query(self, sql, params=None):
cursor = self.conn.cursor()
start = time.perf_counter()
cursor.execute(sql, params or ())
result = cursor.fetchall()
elapsed = time.perf_counter() - start
return result, elapsed
def benchmark_oltp(self, iterations=10000):
"""Simple OLTP benchmark"""
results = {'reads': [], 'writes': []}
for i in range(iterations):
# Read
_, elapsed = self.run_query(
"SELECT * FROM users WHERE id = %s",
(random.randint(1, 100000),)
)
results['reads'].append(elapsed)
# Write
_, elapsed = self.run_query(
"UPDATE accounts SET balance = balance + 1 WHERE id = %s",
(random.randint(1, 100000),)
)
results['writes'].append(elapsed)
self.conn.commit()
return {
'read_avg': sum(results['reads']) / len(results['reads']),
'read_p99': sorted(results['reads'])[int(len(results['reads']) * 0.99)],
'write_avg': sum(results['writes']) / len(results['writes']),
'write_p99': sorted(results['writes'])[int(len(results['writes']) * 0.99)],
}
Isolation level comparison:
def test_phantom_reads(db):
"""Test if phantom reads are possible"""
# Transaction 1: Start and read
db.execute("BEGIN")
db.execute("SET TRANSACTION ISOLATION LEVEL REPEATABLE READ")
result1 = db.query("SELECT COUNT(*) FROM test WHERE value BETWEEN 1 AND 100")
# Transaction 2: Insert in another connection
db2 = get_connection()
db2.execute("INSERT INTO test (value) VALUES (50)")
db2.commit()
# Transaction 1: Read again
result2 = db.query("SELECT COUNT(*) FROM test WHERE value BETWEEN 1 AND 100")
db.commit()
return {
'phantom_possible': result1[0] != result2[0],
'before': result1[0],
'after': result2[0]
}
Learning milestones:
- You run equivalent benchmarks → You understand performance tradeoffs
- You test isolation differences → You understand consistency models
- You compare features → You know when to choose which database
- You document tradeoffs → You can make informed architecture decisions
Project 11: Query Profiler and Slow Query Analyzer
- File: LEARN_MYSQL_DEEP_DIVE.md
- Main Programming Language: Python
- Alternative Programming Languages: Go, JavaScript
- Coolness Level: Level 3: Genuinely Clever
- Business Potential: 3. The “Service & Support” Model
- Difficulty: Level 2: Intermediate
- Knowledge Area: Performance Tuning / Profiling / Optimization
- Software or Tool: MySQL + Performance Schema
- Main Book: “High Performance MySQL” by Silvia Botros
What you’ll build: A query profiler that analyzes slow query logs, parses EXPLAIN output, identifies missing indexes, and provides actionable optimization recommendations.
Why it teaches MySQL: Query optimization is the most practical MySQL skill. This project teaches you to read EXPLAIN plans, understand query costs, and fix performance problems systematically.
Core challenges you’ll face:
- Parsing slow query logs → maps to log format understanding
- Analyzing EXPLAIN output → maps to understanding access types
- Index recommendations → maps to which indexes would help
- Query rewriting → maps to transforming slow queries
Key Concepts:
- EXPLAIN: Optimizing with EXPLAIN
- Slow Query Log: Slow Query Log
- Index Usage: “High Performance MySQL” Chapters 5-6
Difficulty: Intermediate Time estimate: 2 weeks Prerequisites: Projects 1-4 completed. Basic query optimization knowledge.
Real world outcome:
Query Profiler Report
Analyzed: 10,000 queries from slow query log
Time range: 2024-01-01 to 2024-01-15
=== Top 10 Slowest Queries ===
1. SELECT * FROM orders WHERE status = 'pending' AND created_at > '2024-01-01'
Avg time: 4.5s | Executions: 1,234 | Total time: 5,553s
EXPLAIN Analysis:
┌────────┬─────────────┬──────────────┬─────────┬──────────────┐
│ table │ type │ possible_keys│ key │ rows examined│
├────────┼─────────────┼──────────────┼─────────┼──────────────┤
│ orders │ ALL │ idx_status │ NULL │ 5,000,000 │
└────────┴─────────────┴──────────────┴─────────┴──────────────┘
⚠️ PROBLEMS DETECTED:
• Full table scan (type: ALL)
• Index exists but not used
• Examining 5M rows to return 1,234
💡 RECOMMENDATIONS:
1. Add composite index: CREATE INDEX idx_status_created
ON orders(status, created_at)
2. Or rewrite query to use existing index better
Expected improvement: 4.5s → 0.05s (90x faster)
=== Index Recommendations Summary ===
Tables needing indexes:
• orders: (status, created_at) - would help 45 queries
• users: (country, created_at) - would help 23 queries
• products: (category_id, price) - would help 18 queries
=== Query Patterns ===
Anti-patterns detected:
• SELECT * usage: 234 queries (retrieve only needed columns)
• OFFSET pagination: 56 queries (use keyset pagination)
• OR conditions: 89 queries (consider UNION)
• Functions on indexed columns: 34 queries (can't use index)
Implementation Hints:
Slow query log parser:
import re
from dataclasses import dataclass
from typing import List
@dataclass
class SlowQuery:
timestamp: str
user: str
host: str
query_time: float
lock_time: float
rows_sent: int
rows_examined: int
query: str
def parse_slow_log(log_file):
queries = []
current = {}
with open(log_file) as f:
for line in f:
if line.startswith('# Time:'):
if current.get('query'):
queries.append(SlowQuery(**current))
current = {'timestamp': line.split(':', 1)[1].strip()}
elif line.startswith('# User@Host:'):
match = re.match(r'# User@Host: (\w+)\[.*\] @ (\S+)', line)
current['user'] = match.group(1)
current['host'] = match.group(2)
elif line.startswith('# Query_time:'):
match = re.match(
r'# Query_time: ([\d.]+)\s+Lock_time: ([\d.]+)\s+'
r'Rows_sent: (\d+)\s+Rows_examined: (\d+)', line)
current['query_time'] = float(match.group(1))
current['lock_time'] = float(match.group(2))
current['rows_sent'] = int(match.group(3))
current['rows_examined'] = int(match.group(4))
elif not line.startswith('#'):
current['query'] = current.get('query', '') + line
return queries
EXPLAIN analyzer:
class ExplainAnalyzer:
def analyze(self, explain_rows):
problems = []
recommendations = []
for row in explain_rows:
# Check access type
access_type = row['type']
if access_type == 'ALL':
problems.append({
'severity': 'HIGH',
'issue': 'Full table scan',
'table': row['table'],
'rows': row['rows']
})
elif access_type == 'index':
problems.append({
'severity': 'MEDIUM',
'issue': 'Full index scan',
'table': row['table']
})
# Check key usage
if row['possible_keys'] and not row['key']:
problems.append({
'severity': 'HIGH',
'issue': 'Index available but not used',
'table': row['table'],
'available_keys': row['possible_keys']
})
# Check for filesort
if 'filesort' in (row.get('Extra') or '').lower():
problems.append({
'severity': 'MEDIUM',
'issue': 'Using filesort (sorting in memory/disk)',
'table': row['table']
})
# Check rows examined vs sent
if row['rows'] > 1000 and 'filtered' in row:
filter_pct = float(row['filtered'])
if filter_pct < 10:
problems.append({
'severity': 'HIGH',
'issue': f'Examining many rows but returning few ({filter_pct}%)',
'table': row['table']
})
return problems, recommendations
def suggest_index(self, query, schema):
"""Suggest indexes based on query structure"""
# Parse WHERE clause
where_columns = self.extract_where_columns(query)
order_columns = self.extract_order_columns(query)
# Composite index: WHERE columns first, ORDER BY columns last
suggested = where_columns + [c for c in order_columns if c not in where_columns]
return suggested
Learning milestones:
- You parse slow query logs → You understand query profiling
- You analyze EXPLAIN output → You understand access types
- You recommend indexes → You understand index design
- You identify anti-patterns → You can prevent problems
Project 12: Build a Mini-MySQL (Capstone)
- File: LEARN_MYSQL_DEEP_DIVE.md
- Main Programming Language: C or Rust
- Alternative Programming Languages: Go, C++
- Coolness Level: Level 5: Pure Magic (Super Cool)
- Business Potential: 1. The “Resume Gold”
- Difficulty: Level 5: Master
- Knowledge Area: Database Implementation / Systems Programming
- Software or Tool: Custom implementation
- Main Book: “Database Internals” by Alex Petrov
What you’ll build: A mini relational database from scratch that integrates all previous projects: B+Tree storage, SQL parsing, query execution, MVCC transactions, WAL recovery, and the MySQL wire protocol.
Why it teaches everything: This capstone project forces you to understand how all database components work together. Building a complete (if simple) database proves deep understanding of MySQL internals.
Core components:
- Storage Layer: B+Tree pages on disk (Project 2)
- Buffer Pool: Page caching with LRU (Project 5)
- SQL Parser: Parse and plan queries (Project 3)
- Query Executor: Execute plans (Project 1)
- Transaction Manager: MVCC isolation (Project 4)
- Recovery: WAL and crash recovery (Project 6)
- Wire Protocol: MySQL-compatible interface (Project 7)
Real world outcome:
$ ./minidb --datadir=/tmp/minidb
MiniDB v0.1.0 - A Learning Database
Listening on port 3306...
# Connect with standard MySQL client!
$ mysql -h localhost -u root minidb
mysql> CREATE TABLE users (
id INT PRIMARY KEY,
name VARCHAR(100),
email VARCHAR(100)
);
Query OK, 0 rows affected (0.01 sec)
mysql> INSERT INTO users VALUES (1, 'Alice', 'alice@example.com');
Query OK, 1 row affected (0.00 sec)
mysql> SELECT * FROM users WHERE id = 1;
+----+-------+-------------------+
| id | name | email |
+----+-------+-------------------+
| 1 | Alice | alice@example.com |
+----+-------+-------------------+
1 row in set (0.00 sec)
mysql> BEGIN;
mysql> UPDATE users SET name = 'Alice Smith' WHERE id = 1;
mysql> -- In another session, the old value is still visible (MVCC!)
mysql> COMMIT;
# Simulate crash and recovery
$ kill -9 $(pidof minidb)
$ ./minidb --datadir=/tmp/minidb
Recovering from WAL...
Replaying 15 log records...
Recovery complete in 0.1s
All committed transactions preserved.
Time estimate: 2-3 months Prerequisites: All 11 previous projects completed
Project Comparison Table
| Project | Difficulty | Time | Depth of Understanding | Fun Factor |
|---|---|---|---|---|
| 1. SQL Query Executor | Intermediate | 2 weeks | ⭐⭐⭐ | ⭐⭐⭐⭐ |
| 2. B+Tree Implementation | Advanced | 3-4 weeks | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐ |
| 3. Query Optimizer | Advanced | 2-3 weeks | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐ |
| 4. MVCC Transaction Manager | Advanced | 2-3 weeks | ⭐⭐⭐⭐⭐ | ⭐⭐⭐ |
| 5. Buffer Pool Manager | Advanced | 2-3 weeks | ⭐⭐⭐⭐ | ⭐⭐⭐ |
| 6. WAL and Crash Recovery | Expert | 3-4 weeks | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐ |
| 7. MySQL Protocol | Intermediate | 1-2 weeks | ⭐⭐⭐ | ⭐⭐⭐⭐ |
| 8. Binlog Replication | Advanced | 2-3 weeks | ⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ |
| 9. Lock Visualizer | Intermediate | 1-2 weeks | ⭐⭐⭐⭐ | ⭐⭐⭐⭐ |
| 10. DB Comparison Lab | Intermediate | 2 weeks | ⭐⭐⭐⭐ | ⭐⭐⭐⭐ |
| 11. Query Profiler | Intermediate | 2 weeks | ⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ |
| 12. Mini-MySQL | Master | 2-3 months | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ |
Recommended Learning Path
For practical MySQL skills (DBA/Backend Developer):
- Project 11 (Query Profiler) - Immediately useful
- Project 9 (Lock Visualizer) - Debug production issues
- Project 3 (Query Optimizer) - Understand EXPLAIN
- Project 10 (Comparison Lab) - Make architecture decisions
For deep understanding (Database Engineer):
- Project 2 (B+Tree) - Foundation of all indexing
- Project 4 (MVCC) - How transactions work
- Project 5 (Buffer Pool) - How caching works
- Project 6 (WAL) - How durability works
For building database tools (Developer):
- Project 7 (Protocol) - Build drivers/proxies
- Project 8 (Binlog) - Build CDC pipelines
- Project 1 (SQL Executor) - Build query tools
Quick path (2-3 weeks):
- Project 1 (1 week) - Understand SQL fundamentals
- Project 11 (1 week) - Practical optimization skills
- Project 9 (3-4 days) - Debug locking issues
Essential Resources
Books
- High Performance MySQL, 4th Edition by Silvia Botros - The definitive MySQL performance guide
- Efficient MySQL Performance by Daniel Nichter - Bridges basic and advanced
- Database Internals by Alex Petrov - Storage engine internals
- Designing Data-Intensive Applications by Martin Kleppmann - Distributed data concepts
Official Documentation
- MySQL Reference Manual - Comprehensive official docs
- InnoDB Documentation - Storage engine details
- MySQL Internals - For source code exploration
Online Resources
- Use The Index, Luke - SQL indexing tutorial
- Percona Blog - MySQL performance articles
- Planet MySQL - MySQL community aggregator
Source Code
- MySQL Source - Official MySQL
- MariaDB Source - MariaDB fork
- go-mysql - Go MySQL tools
Summary
| # | Project | Main Programming Language |
|---|---|---|
| 1 | SQL Query Executor | Python |
| 2 | B+Tree Index Implementation | C |
| 3 | Query Optimizer Simulator | Python |
| 4 | MVCC Transaction Manager | Python |
| 5 | Buffer Pool Manager | C |
| 6 | WAL and Crash Recovery | C |
| 7 | MySQL Protocol Implementation | Python |
| 8 | Binlog Replication Subscriber | Go |
| 9 | Lock and Deadlock Visualizer | Python |
| 10 | MySQL vs PostgreSQL Comparison Lab | SQL + Python |
| 11 | Query Profiler and Slow Query Analyzer | Python |
| 12 | Build a Mini-MySQL (Capstone) | C or Rust |
Happy querying! Understanding MySQL deeply means understanding how data really works—from SQL syntax down to the bits on disk. 🗄️