← Back to all projects

LEARN MYSQL DEEP DIVE

Learn MySQL: From Zero to Mastery

Goal: Deeply understand MySQL—from basic queries to storage engine internals, index structures, query optimization, transaction processing, and replication. Understand why MySQL makes the design choices it does and how it differs from PostgreSQL, SQLite, and other databases.


Why MySQL Matters

MySQL is one of the most widely-deployed databases in the world, powering everything from WordPress blogs to Facebook’s infrastructure. Understanding MySQL deeply teaches you:

  1. Relational database fundamentals: ACID, normalization, SQL
  2. Storage engine architecture: How data is physically organized on disk
  3. Indexing internals: B+trees, clustered vs secondary indexes
  4. Query optimization: How the optimizer chooses execution plans
  5. Concurrency control: MVCC, locking, isolation levels
  6. Distributed systems: Replication, sharding, high availability

MySQL vs Other Databases: The Big Picture

Before diving in, let’s understand where MySQL fits:

Feature MySQL/InnoDB PostgreSQL SQLite MariaDB
Architecture Thread-per-connection Process-per-connection Embedded (in-process) Thread-per-connection
Storage Engine Pluggable (InnoDB default) Single (heap + WAL) Single (B-tree) Pluggable (Aria, InnoDB)
Replication Logical (binlog) Physical (WAL) + Logical None (single-file) Logical (binlog)
MVCC Undo logs + locks Tuple versioning WAL-mode Undo logs + locks
Default Isolation REPEATABLE READ READ COMMITTED SERIALIZABLE REPEATABLE READ
JSON Support Good Excellent (JSONB) Basic Good
Full-Text Search Basic Excellent (GIN/GiST) FTS5 extension Basic
Geospatial Basic Excellent (PostGIS) R*Tree Basic

Key Philosophical Differences

MySQL’s Philosophy:

  • Simplicity and speed for common operations
  • Pluggable storage engines for different workloads
  • Optimized for read-heavy OLTP workloads
  • “Good enough” defaults that work for most cases

PostgreSQL’s Philosophy:

  • Correctness and standards compliance first
  • Rich feature set (arrays, custom types, advanced indexing)
  • Better for complex queries and analytics
  • More sophisticated query planner

SQLite’s Philosophy:

  • Embedded, zero-configuration
  • Single-file database
  • Perfect for local storage, testing, edge cases
  • ACID-compliant despite simplicity

MySQL Architecture Overview

┌─────────────────────────────────────────────────────────────────────┐
│                         MySQL Server                                 │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  ┌──────────────────────────────────────────────────────────────┐   │
│  │                    Connection Layer                           │   │
│  │  ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐             │   │
│  │  │Thread 1 │ │Thread 2 │ │Thread 3 │ │Thread N │             │   │
│  │  │ (Conn)  │ │ (Conn)  │ │ (Conn)  │ │ (Conn)  │             │   │
│  │  └─────────┘ └─────────┘ └─────────┘ └─────────┘             │   │
│  └──────────────────────────────────────────────────────────────┘   │
│                              │                                       │
│  ┌──────────────────────────▼───────────────────────────────────┐   │
│  │                      SQL Layer                                │   │
│  │  ┌─────────────┐ ┌─────────────┐ ┌─────────────────────────┐ │   │
│  │  │   Parser    │ │  Optimizer  │ │     Query Executor      │ │   │
│  │  │  (Lexer +   │ │ (Cost-based │ │  (Iterates over rows,   │ │   │
│  │  │   Parser)   │ │  planning)  │ │   applies operations)   │ │   │
│  │  └─────────────┘ └─────────────┘ └─────────────────────────┘ │   │
│  │  ┌─────────────┐ ┌─────────────┐ ┌─────────────────────────┐ │   │
│  │  │Query Cache  │ │  Metadata   │ │    Table Definition     │ │   │
│  │  │ (removed    │ │   Cache     │ │        Cache            │ │   │
│  │  │  in 8.0)    │ │             │ │                         │ │   │
│  │  └─────────────┘ └─────────────┘ └─────────────────────────┘ │   │
│  └──────────────────────────────────────────────────────────────┘   │
│                              │                                       │
│  ┌──────────────────────────▼───────────────────────────────────┐   │
│  │                   Storage Engine API                          │   │
│  │            (Handler interface - pluggable engines)            │   │
│  └──────────────────────────────────────────────────────────────┘   │
│         │                    │                    │                  │
│  ┌──────▼──────┐     ┌───────▼───────┐    ┌───────▼───────┐        │
│  │   InnoDB    │     │    MyISAM     │    │    Memory     │        │
│  │  (Default)  │     │   (Legacy)    │    │  (Temp tables)│        │
│  │             │     │               │    │               │        │
│  │ ┌─────────┐ │     │ ┌───────────┐ │    │ ┌───────────┐ │        │
│  │ │ Buffer  │ │     │ │Key Buffer │ │    │ │  Hash or  │ │        │
│  │ │  Pool   │ │     │ │  (Index   │ │    │ │  B-Tree   │ │        │
│  │ │         │ │     │ │   Cache)  │ │    │ │  Indexes  │ │        │
│  │ └─────────┘ │     │ └───────────┘ │    │ └───────────┘ │        │
│  │ ┌─────────┐ │     │               │    │               │        │
│  │ │Redo Log │ │     │ No ACID, no   │    │ In-memory     │        │
│  │ │Undo Log │ │     │ transactions  │    │ only          │        │
│  │ └─────────┘ │     │               │    │               │        │
│  └─────────────┘     └───────────────┘    └───────────────┘        │
│         │                    │                    │                  │
└─────────┼────────────────────┼────────────────────┼──────────────────┘
          │                    │                    │
          ▼                    ▼                    ▼
     ┌─────────┐          ┌─────────┐          (Memory)
     │.ibd file│          │.MYD/.MYI│
     │(data +  │          │(data +  │
     │ index)  │          │ index)  │
     └─────────┘          └─────────┘

InnoDB Architecture Deep Dive

InnoDB is MySQL’s default storage engine and where most of the magic happens:

┌─────────────────────────────────────────────────────────────────────┐
│                         InnoDB Engine                                │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  ┌────────────────────── IN MEMORY ─────────────────────────────┐   │
│  │                                                               │   │
│  │  ┌─────────────────────────────────────────────────────────┐ │   │
│  │  │                    Buffer Pool                           │ │   │
│  │  │  ┌─────────────────────────────────────────────────────┐ │ │   │
│  │  │  │ Data Pages │ Index Pages │ Undo Pages │ Adaptive HI │ │ │   │
│  │  │  └─────────────────────────────────────────────────────┘ │ │   │
│  │  │  LRU list (young → old) for page eviction                │ │   │
│  │  └─────────────────────────────────────────────────────────┘ │   │
│  │                                                               │   │
│  │  ┌───────────────┐ ┌───────────────┐ ┌───────────────────┐   │   │
│  │  │  Change       │ │   Log         │ │  Adaptive Hash    │   │   │
│  │  │  Buffer       │ │   Buffer      │ │  Index (AHI)      │   │   │
│  │  │ (for sec.idx) │ │ (redo log     │ │ (auto-built hash  │   │   │
│  │  │               │ │  in memory)   │ │  for hot pages)   │   │   │
│  │  └───────────────┘ └───────────────┘ └───────────────────────┘   │
│  │                                                               │   │
│  └───────────────────────────────────────────────────────────────┘   │
│                              │                                       │
│                              │ Flush                                 │
│                              ▼                                       │
│  ┌────────────────────── ON DISK ───────────────────────────────┐   │
│  │                                                               │   │
│  │  ┌───────────────┐ ┌───────────────┐ ┌───────────────────┐   │   │
│  │  │ System        │ │ File-per-     │ │ Redo Log Files    │   │   │
│  │  │ Tablespace    │ │ Table (.ibd)  │ │ (ib_logfile0/1)   │   │   │
│  │  │ (ibdata1)     │ │               │ │                   │   │   │
│  │  │               │ │ Data + Index  │ │ WAL for crash     │   │   │
│  │  │ Change buffer │ │ in B+Tree     │ │ recovery          │   │   │
│  │  │ Undo logs     │ │ structure     │ │                   │   │   │
│  │  │ Data dict     │ │               │ │                   │   │   │
│  │  └───────────────┘ └───────────────┘ └───────────────────┘   │   │
│  │                                                               │   │
│  │  ┌───────────────┐ ┌───────────────┐                         │   │
│  │  │ Doublewrite   │ │ Undo          │                         │   │
│  │  │ Buffer        │ │ Tablespace    │                         │   │
│  │  │ (crash safety)│ │ (undo_001)    │                         │   │
│  │  └───────────────┘ └───────────────┘                         │   │
│  │                                                               │   │
│  └───────────────────────────────────────────────────────────────┘   │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘

Project List

Projects progress from understanding fundamentals to building sophisticated database tools.


Project 1: SQL Query Executor (From Scratch)

  • File: LEARN_MYSQL_DEEP_DIVE.md
  • Main Programming Language: Python
  • Alternative Programming Languages: Go, Rust, C
  • Coolness Level: Level 4: Hardcore Tech Flex
  • Business Potential: 1. The “Resume Gold”
  • Difficulty: Level 2: Intermediate
  • Knowledge Area: SQL Parsing / Query Execution / Relational Algebra
  • Software or Tool: Custom implementation
  • Main Book: “Database System Concepts” by Silberschatz, Korth, Sudarshan

What you’ll build: A simple SQL query executor that parses SQL (SELECT, WHERE, JOIN), operates on CSV files as tables, and executes queries using relational algebra operators (select, project, join).

Why it teaches MySQL: Before understanding MySQL’s optimizations, you need to understand what SQL actually means. This project forces you to implement the relational model and see why database engines make certain design choices.

Core challenges you’ll face:

  • Parsing SQL syntax → maps to understanding SQL grammar
  • Implementing relational operators → maps to selection, projection, join
  • Query planning → maps to choosing join order, using indexes
  • Understanding cost → maps to why some queries are slow

Key Concepts:

  • Relational Algebra: “Database System Concepts” Chapter 6
  • SQL Syntax: MySQL SELECT Syntax
  • Query Processing: “Database System Concepts” Chapter 15

Difficulty: Intermediate Time estimate: 2 weeks Prerequisites: Basic SQL knowledge. Understanding of data structures.

Real world outcome:

# Your SQL executor in action:
$ python minisql.py

MiniSQL> CREATE TABLE users (id, name, email);
Table 'users' created.

MiniSQL> INSERT INTO users VALUES (1, 'Alice', 'alice@example.com');
MiniSQL> INSERT INTO users VALUES (2, 'Bob', 'bob@example.com');

MiniSQL> SELECT name, email FROM users WHERE id = 1;
+-------+-------------------+
| name  | email             |
+-------+-------------------+
| Alice | alice@example.com |
+-------+-------------------+
1 row in set (0.001 sec)

MiniSQL> CREATE TABLE orders (id, user_id, amount);
MiniSQL> INSERT INTO orders VALUES (1, 1, 100);
MiniSQL> INSERT INTO orders VALUES (2, 1, 200);

MiniSQL> SELECT users.name, orders.amount
         FROM users
         JOIN orders ON users.id = orders.user_id;
+-------+--------+
| name  | amount |
+-------+--------+
| Alice | 100    |
| Alice | 200    |
+-------+--------+
2 rows in set (0.002 sec)

Implementation Hints:

Core relational operators:

class Table:
    def __init__(self, name, columns, rows=None):
        self.name = name
        self.columns = columns  # ['id', 'name', 'email']
        self.rows = rows or []  # [[1, 'Alice', 'alice@...'], ...]

    def select(self, predicate):
        """σ (sigma) - Filter rows matching predicate"""
        return Table(
            self.name,
            self.columns,
            [row for row in self.rows if predicate(row, self.columns)]
        )

    def project(self, column_names):
        """π (pi) - Select specific columns"""
        indices = [self.columns.index(c) for c in column_names]
        return Table(
            self.name,
            column_names,
            [[row[i] for i in indices] for row in self.rows]
        )

    def join(self, other, on_left, on_right):
        """⋈ (bowtie) - Join two tables"""
        left_idx = self.columns.index(on_left)
        right_idx = other.columns.index(on_right)

        result_rows = []
        for left_row in self.rows:
            for right_row in other.rows:
                if left_row[left_idx] == right_row[right_idx]:
                    result_rows.append(left_row + right_row)

        return Table(
            f"{self.name}_{other.name}",
            self.columns + other.columns,
            result_rows
        )

Simple SQL parser:

import re

def parse_select(sql):
    """Parse SELECT column1, column2 FROM table WHERE condition"""
    pattern = r"SELECT\s+(.+?)\s+FROM\s+(\w+)(?:\s+WHERE\s+(.+))?"
    match = re.match(pattern, sql, re.IGNORECASE)

    if match:
        columns = [c.strip() for c in match.group(1).split(',')]
        table = match.group(2)
        where = match.group(3)
        return {'type': 'select', 'columns': columns, 'table': table, 'where': where}

Learning milestones:

  1. You parse basic SQL → You understand SQL syntax
  2. You implement SELECT/WHERE → You understand selection and projection
  3. You implement JOIN → You understand the most expensive operation
  4. You see why indexes matter → You understand scanning vs seeking

Project 2: B+Tree Index Implementation

  • File: LEARN_MYSQL_DEEP_DIVE.md
  • Main Programming Language: C
  • Alternative Programming Languages: Rust, Go, Python
  • Coolness Level: Level 4: Hardcore Tech Flex
  • Business Potential: 1. The “Resume Gold”
  • Difficulty: Level 3: Advanced
  • Knowledge Area: Data Structures / Indexing / Disk-Based Storage
  • Software or Tool: Custom implementation
  • Main Book: “Database Internals” by Alex Petrov

What you’ll build: A disk-based B+Tree implementation that stores data on disk in pages, supports insertions, deletions, range scans, and demonstrates why B+Trees are ideal for databases.

Why it teaches MySQL: InnoDB stores all data in B+Tree structures. The clustered index IS the table. Understanding B+Trees means understanding how MySQL physically organizes data and why certain operations are fast or slow.

Core challenges you’ll face:

  • Page-based storage → maps to InnoDB’s 16KB pages
  • Node splitting and merging → maps to handling insertions and deletions
  • Range scans via leaf links → maps to why B+Trees excel at range queries
  • Understanding clustered vs secondary → maps to InnoDB’s index design

Key Concepts:

Difficulty: Advanced Time estimate: 3-4 weeks Prerequisites: Project 1 completed. Understanding of tree data structures. C or systems programming.

Real world outcome:

// Your B+Tree in action:
$ ./btree_demo

Creating B+Tree with page size 4096, order 100...
Inserting 1,000,000 keys...
Insert time: 2.3 seconds

Point lookup (key = 500000): found in 3 page reads (tree height = 3)
Range scan (100000 to 100100): 100 rows in 5 page reads

Tree statistics:
  - Height: 3
  - Leaf pages: 10,000
  - Internal pages: 101
  - Fill factor: 67%

Demonstrating why B+Tree > Binary Tree for disk:
  - B+Tree:  3 disk reads for any key in 1M rows
  - Binary:  20 disk reads for any key in 1M rows (log2(1M))

Implementation Hints:

B+Tree node structure:

#define PAGE_SIZE 4096
#define MAX_KEYS ((PAGE_SIZE - sizeof(NodeHeader)) / (sizeof(Key) + sizeof(Value)))

typedef struct {
    uint32_t is_leaf;
    uint32_t num_keys;
    uint32_t parent_page;
    uint32_t next_leaf;  // Only for leaf nodes (linked list)
    uint32_t prev_leaf;
} NodeHeader;

typedef struct {
    NodeHeader header;
    Key keys[MAX_KEYS];
    union {
        Value values[MAX_KEYS];      // For leaf nodes
        uint32_t children[MAX_KEYS + 1];  // For internal nodes (page IDs)
    };
} BTreeNode;

typedef struct {
    int fd;              // File descriptor
    uint32_t root_page;
    uint32_t num_pages;
    BTreeNode* cache[CACHE_SIZE];  // Simple page cache
} BTree;

Key operations:

// Search: traverse from root to leaf
Value* btree_search(BTree* tree, Key key) {
    uint32_t page_id = tree->root_page;

    while (true) {
        BTreeNode* node = read_page(tree, page_id);

        if (node->header.is_leaf) {
            // Binary search in leaf
            int idx = binary_search(node->keys, node->header.num_keys, key);
            if (idx >= 0) {
                return &node->values[idx];
            }
            return NULL;
        }

        // Find child to follow
        int idx = find_child_index(node, key);
        page_id = node->children[idx];
    }
}

// Range scan: find start, follow leaf links
void btree_range_scan(BTree* tree, Key start, Key end, Callback cb) {
    // Find leaf containing start key
    uint32_t page_id = find_leaf(tree, start);

    while (page_id != 0) {
        BTreeNode* node = read_page(tree, page_id);

        for (int i = 0; i < node->header.num_keys; i++) {
            if (node->keys[i] > end) return;
            if (node->keys[i] >= start) {
                cb(node->keys[i], node->values[i]);
            }
        }

        page_id = node->header.next_leaf;  // Follow leaf link!
    }
}

Why B+Tree is perfect for databases:

  1. High fanout → Tree stays shallow (3-4 levels for millions of rows)
  2. Leaf links → Range scans don’t need to revisit internal nodes
  3. Sequential leaf access → Disk prefetching works well
  4. Consistent read performance → All leaves at same depth

Learning milestones:

  1. You implement basic search → You understand tree traversal
  2. You implement insert with splitting → You understand node management
  3. You implement range scans → You understand leaf linking
  4. You measure disk I/O → You understand why B+Trees win

Project 3: Query Optimizer Simulator

  • File: LEARN_MYSQL_DEEP_DIVE.md
  • Main Programming Language: Python
  • Alternative Programming Languages: Go, Java
  • Coolness Level: Level 4: Hardcore Tech Flex
  • Business Potential: 2. The “Micro-SaaS / Pro Tool”
  • Difficulty: Level 3: Advanced
  • Knowledge Area: Query Optimization / Cost Estimation / Statistics
  • Software or Tool: Custom implementation
  • Main Book: “Database System Concepts” by Silberschatz, Korth, Sudarshan

What you’ll build: A query optimizer that takes SQL queries, generates multiple execution plans, estimates costs based on table statistics, and chooses the cheapest plan—just like MySQL’s optimizer.

Why it teaches MySQL: The optimizer is MySQL’s brain. Understanding how it estimates costs and chooses plans explains why some queries are fast and others slow, and how to write SQL that the optimizer can optimize.

Core challenges you’ll face:

  • Generating alternative plans → maps to join order, index choice
  • Estimating cardinality → maps to how many rows will match?
  • Cost modeling → maps to disk I/O vs CPU costs
  • Understanding EXPLAIN → maps to reading MySQL’s execution plans

Key Concepts:

Difficulty: Advanced Time estimate: 2-3 weeks Prerequisites: Projects 1-2 completed. Understanding of SQL execution.

Real world outcome:

Query Optimizer Simulator

Query: SELECT * FROM orders o
       JOIN customers c ON o.customer_id = c.id
       JOIN products p ON o.product_id = p.id
       WHERE c.country = 'USA'
         AND p.category = 'Electronics'

Table Statistics:
  orders:    1,000,000 rows, index on customer_id, product_id
  customers: 100,000 rows, index on id, country
  products:  10,000 rows, index on id, category

Generated Plans:

Plan 1: customers → orders → products
  ├─ Scan customers WHERE country='USA' (est. 10,000 rows)
  ├─ Index lookup orders.customer_id (est. 100,000 rows)
  └─ Index lookup products.id WHERE category='Electronics' (est. 1,000 rows)
  Cost: 15,230 (disk I/O: 1,200, CPU: 14,030)

Plan 2: products → orders → customers  ★ CHOSEN
  ├─ Scan products WHERE category='Electronics' (est. 500 rows)
  ├─ Index lookup orders.product_id (est. 50,000 rows)
  └─ Index lookup customers.id WHERE country='USA' (est. 5,000 rows)
  Cost: 8,750 (disk I/O: 850, CPU: 7,900)

Plan 3: orders → customers → products
  ├─ Full scan orders (1,000,000 rows) ← EXPENSIVE!
  ...
  Cost: 1,250,000

Optimizer chose Plan 2 (2x faster than Plan 1)

Implementation Hints:

Table statistics:

class TableStats:
    def __init__(self, name, row_count, columns, indexes):
        self.name = name
        self.row_count = row_count
        self.columns = columns  # {col_name: {'distinct': N, 'min': X, 'max': Y}}
        self.indexes = indexes  # {index_name: {'columns': [...], 'unique': bool}}

    def estimate_selectivity(self, column, operator, value):
        """Estimate what fraction of rows match a predicate"""
        col_stats = self.columns[column]

        if operator == '=':
            # Uniform distribution assumption
            return 1.0 / col_stats['distinct']
        elif operator == '<':
            # Range selectivity
            range_size = col_stats['max'] - col_stats['min']
            return (value - col_stats['min']) / range_size
        # ... more operators

    def estimate_rows(self, predicates):
        """Estimate rows after applying predicates"""
        selectivity = 1.0
        for pred in predicates:
            selectivity *= self.estimate_selectivity(
                pred.column, pred.operator, pred.value
            )
        return int(self.row_count * selectivity)

Plan generation and costing:

class QueryOptimizer:
    def __init__(self, stats):
        self.stats = stats  # {table_name: TableStats}
        self.io_cost_per_page = 1.0
        self.cpu_cost_per_row = 0.01

    def generate_join_orders(self, tables):
        """Generate all possible join orders"""
        from itertools import permutations
        return list(permutations(tables))

    def estimate_join_cost(self, left_rows, right_rows, join_type):
        """Estimate cost of a join operation"""
        if join_type == 'nested_loop':
            return left_rows * right_rows * self.cpu_cost_per_row
        elif join_type == 'hash_join':
            return (left_rows + right_rows) * self.cpu_cost_per_row * 2
        elif join_type == 'index_lookup':
            # Much cheaper if we have an index
            return left_rows * 3 * self.io_cost_per_page  # B+Tree depth

    def cost_plan(self, plan):
        """Calculate total cost of an execution plan"""
        total_cost = 0
        current_rows = 0

        for step in plan.steps:
            if step.type == 'table_scan':
                pages = step.table.row_count / 100  # ~100 rows per page
                total_cost += pages * self.io_cost_per_page
                current_rows = step.estimated_rows

            elif step.type == 'index_lookup':
                total_cost += 3 * self.io_cost_per_page  # B+Tree traversal
                total_cost += step.estimated_rows * self.cpu_cost_per_row
                current_rows = step.estimated_rows

            elif step.type == 'join':
                total_cost += self.estimate_join_cost(
                    current_rows, step.right_rows, step.join_type
                )
                current_rows = step.estimated_output_rows

        return total_cost

    def optimize(self, query):
        """Find the cheapest execution plan"""
        plans = self.generate_all_plans(query)
        return min(plans, key=self.cost_plan)

Learning milestones:

  1. You generate alternative plans → You understand plan space
  2. You estimate cardinality → You understand statistics
  3. You calculate costs → You understand I/O vs CPU tradeoffs
  4. You read EXPLAIN output → You can optimize real MySQL queries

Project 4: MVCC Transaction Manager

  • File: LEARN_MYSQL_DEEP_DIVE.md
  • Main Programming Language: Python
  • Alternative Programming Languages: Go, Rust, C++
  • Coolness Level: Level 4: Hardcore Tech Flex
  • Business Potential: 1. The “Resume Gold”
  • Difficulty: Level 3: Advanced
  • Knowledge Area: Transactions / Concurrency / Isolation
  • Software or Tool: Custom implementation
  • Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann

What you’ll build: An MVCC (Multi-Version Concurrency Control) system that maintains multiple versions of data, implements snapshot isolation, and demonstrates how InnoDB allows readers and writers to not block each other.

Why it teaches MySQL: InnoDB’s MVCC is central to its concurrency model. Understanding how undo logs create “snapshots” and how visibility rules work explains isolation levels, long transaction problems, and lock behavior.

Core challenges you’ll face:

  • Versioning data → maps to undo logs storing old versions
  • Visibility rules → maps to which version can a transaction see?
  • Garbage collection → maps to purging old versions
  • Isolation levels → maps to READ COMMITTED vs REPEATABLE READ

Key Concepts:

Difficulty: Advanced Time estimate: 2-3 weeks Prerequisites: Projects 1-3 completed. Understanding of transaction concepts.

Real world outcome:

MVCC Transaction Simulator

txn1> BEGIN;  -- Transaction ID: 100
txn2> BEGIN;  -- Transaction ID: 101

txn1> UPDATE accounts SET balance = 500 WHERE id = 1;
      [Created version: {txn_id: 100, value: 500}]
      [Old version: {txn_id: 50, value: 1000} → undo log]

txn2> SELECT balance FROM accounts WHERE id = 1;
      → 1000 (sees old version - txn 100 not committed)

txn1> COMMIT;
      [Version {txn_id: 100} now visible to new transactions]

txn2> SELECT balance FROM accounts WHERE id = 1;
      → 1000 (REPEATABLE READ: still sees snapshot from txn start)

txn3> BEGIN;  -- Transaction ID: 102
txn3> SELECT balance FROM accounts WHERE id = 1;
      → 500 (sees committed version from txn 100)

-- Demonstration of isolation levels:
SET ISOLATION LEVEL READ COMMITTED;
txn2> SELECT balance FROM accounts WHERE id = 1;
      → 500 (READ COMMITTED: sees latest committed value)

Implementation Hints:

Version chain and visibility:

from dataclasses import dataclass
from typing import Optional, List
import threading

@dataclass
class Version:
    txn_id: int
    value: any
    created_at: int  # Transaction ID that created this
    deleted_at: Optional[int] = None  # Transaction ID that deleted this
    prev_version: Optional['Version'] = None  # Link to older version

class MVCCRow:
    def __init__(self, key):
        self.key = key
        self.version_chain: Optional[Version] = None
        self.lock = threading.Lock()

    def write(self, txn_id: int, value: any):
        """Create new version, link to previous"""
        with self.lock:
            new_version = Version(
                txn_id=txn_id,
                value=value,
                created_at=txn_id,
                prev_version=self.version_chain
            )
            self.version_chain = new_version

    def read(self, txn: 'Transaction') -> any:
        """Find visible version for this transaction"""
        version = self.version_chain

        while version:
            if txn.is_visible(version):
                return version.value
            version = version.prev_version

        return None  # No visible version

class Transaction:
    def __init__(self, txn_id: int, isolation_level: str, active_txns: set):
        self.txn_id = txn_id
        self.isolation_level = isolation_level
        self.snapshot = active_txns.copy()  # Active txns at start
        self.start_time = txn_id

    def is_visible(self, version: Version) -> bool:
        """Can this transaction see this version?"""
        # Created by a committed transaction before our snapshot
        if version.created_at < self.start_time:
            if version.created_at not in self.snapshot:  # Was committed
                if version.deleted_at is None:
                    return True
                if version.deleted_at in self.snapshot:  # Deleted but not committed
                    return True
                if version.deleted_at > self.start_time:
                    return True
        return False

class MVCCDatabase:
    def __init__(self):
        self.rows = {}
        self.next_txn_id = 1
        self.active_transactions = set()
        self.committed_transactions = set()

    def begin(self, isolation_level='REPEATABLE READ'):
        txn_id = self.next_txn_id
        self.next_txn_id += 1
        self.active_transactions.add(txn_id)

        return Transaction(txn_id, isolation_level, self.active_transactions)

    def commit(self, txn: Transaction):
        self.active_transactions.remove(txn.txn_id)
        self.committed_transactions.add(txn.txn_id)

Learning milestones:

  1. You implement version chains → You understand undo logs
  2. You implement visibility rules → You understand snapshot isolation
  3. You compare isolation levels → You understand READ COMMITTED vs REPEATABLE READ
  4. You see long transaction problems → You understand why long txns are bad

Project 5: Buffer Pool Manager

  • File: LEARN_MYSQL_DEEP_DIVE.md
  • Main Programming Language: C
  • Alternative Programming Languages: Rust, C++
  • Coolness Level: Level 4: Hardcore Tech Flex
  • Business Potential: 1. The “Resume Gold”
  • Difficulty: Level 3: Advanced
  • Knowledge Area: Memory Management / Caching / I/O
  • Software or Tool: Custom implementation
  • Main Book: “Database Internals” by Alex Petrov

What you’ll build: A buffer pool manager that caches disk pages in memory, implements LRU eviction (with young/old list like InnoDB), handles dirty page flushing, and manages concurrent access.

Why it teaches MySQL: The buffer pool is InnoDB’s most important performance feature. Understanding it explains why “more RAM = faster database”, how to tune innodb_buffer_pool_size, and why certain access patterns perform poorly.

Core challenges you’ll face:

  • Page caching → maps to InnoDB’s buffer pool
  • LRU with midpoint insertion → maps to young/old list separation
  • Dirty page management → maps to checkpoint and flush
  • Concurrent access → maps to latching pages

Key Concepts:

Difficulty: Advanced Time estimate: 2-3 weeks Prerequisites: Projects 1-4 completed. Systems programming experience.

Real world outcome:

Buffer Pool Simulator

Configuration:
  Buffer pool size: 128 MB (8,192 pages × 16 KB)
  Young list: 5/8 of pool (5,120 pages)
  Old list: 3/8 of pool (3,072 pages)

Running OLTP workload simulation...

Statistics after 1 million operations:
  Hit ratio: 98.7%
  Pages read from disk: 13,000
  Pages written to disk: 4,500
  Evictions: 45,000

Page distribution:
  Young list: 5,120 pages (frequently accessed)
  Old list: 3,072 pages (recently loaded)
  Dirty pages: 1,234 (pending flush)

LRU behavior demonstration:
  - Full table scan of 100K rows...
  - Old pages: 3,072 → scan loaded here
  - Young pages: 5,120 → protected from scan eviction!
  - Hit ratio after scan: 97.2% (only 1.5% drop)

Without midpoint insertion (traditional LRU):
  - Hit ratio after scan: 45% (catastrophic drop!)

Implementation Hints:

Buffer pool structure:

#define PAGE_SIZE 16384
#define POOL_SIZE 8192  // Number of pages

typedef struct BufferPage {
    uint32_t page_id;
    bool is_dirty;
    int pin_count;  // How many threads are using this page
    pthread_rwlock_t latch;

    uint8_t data[PAGE_SIZE];

    // LRU list pointers
    struct BufferPage* lru_prev;
    struct BufferPage* lru_next;
    bool in_young_list;  // vs old list
} BufferPage;

typedef struct BufferPool {
    BufferPage pages[POOL_SIZE];

    // Hash table: page_id → BufferPage*
    HashTable* page_table;

    // LRU lists (InnoDB uses midpoint insertion)
    BufferPage* young_head;  // Most recently used
    BufferPage* young_tail;
    BufferPage* old_head;    // Midpoint
    BufferPage* old_tail;    // Least recently used

    // Free list
    BufferPage* free_list;

    pthread_mutex_t pool_latch;

    // Stats
    uint64_t hits;
    uint64_t misses;
} BufferPool;

LRU with midpoint insertion:

BufferPage* buffer_pool_get(BufferPool* pool, uint32_t page_id) {
    pthread_mutex_lock(&pool->pool_latch);

    // Check if page is in pool
    BufferPage* page = hash_table_get(pool->page_table, page_id);

    if (page) {
        pool->hits++;

        // Move to young list if accessed again from old list
        if (!page->in_young_list) {
            // Only move if it's been in old list for a while
            // (prevents one-time access from polluting young list)
            if (should_promote_to_young(page)) {
                move_to_young_head(pool, page);
            }
        } else {
            // Already in young, move to head
            move_to_young_head(pool, page);
        }

        page->pin_count++;
        pthread_mutex_unlock(&pool->pool_latch);
        return page;
    }

    pool->misses++;

    // Need to load from disk - find victim page
    BufferPage* victim = find_victim(pool);
    if (victim->is_dirty) {
        flush_page(victim);  // Write to disk first
    }

    // Load new page
    read_page_from_disk(victim, page_id);
    victim->page_id = page_id;
    victim->is_dirty = false;
    victim->pin_count = 1;

    // Insert at OLD list head (not young!)
    // This is the midpoint insertion strategy
    insert_at_old_head(pool, victim);

    hash_table_put(pool->page_table, page_id, victim);

    pthread_mutex_unlock(&pool->pool_latch);
    return victim;
}

Why midpoint insertion matters:

  • A full table scan would evict all hot pages in naive LRU
  • With midpoint insertion, scan pages enter the OLD list
  • They get evicted before hot pages in the YOUNG list
  • Hot working set is protected from one-time scans

Learning milestones:

  1. You implement basic LRU → You understand page replacement
  2. You add midpoint insertion → You understand InnoDB’s optimization
  3. You handle dirty pages → You understand write-back caching
  4. You measure hit ratios → You can tune buffer pool size

Project 6: WAL and Crash Recovery

  • File: LEARN_MYSQL_DEEP_DIVE.md
  • Main Programming Language: C
  • Alternative Programming Languages: Rust, Go
  • Coolness Level: Level 5: Pure Magic (Super Cool)
  • Business Potential: 1. The “Resume Gold”
  • Difficulty: Level 4: Expert
  • Knowledge Area: Durability / Crash Recovery / Logging
  • Software or Tool: Custom implementation
  • Main Book: “Database Internals” by Alex Petrov

What you’ll build: A Write-Ahead Log (WAL) system with redo logging that ensures durability, handles crash recovery using ARIES-style recovery, and demonstrates how InnoDB survives crashes.

Why it teaches MySQL: The redo log is how InnoDB achieves durability without flushing every page to disk on commit. Understanding WAL explains innodb_flush_log_at_trx_commit, crash recovery time, and why databases survive power failures.

Core challenges you’ll face:

  • Write-ahead logging → maps to redo log before data
  • Log sequence numbers (LSN) → maps to ordering operations
  • Checkpointing → maps to limiting recovery time
  • ARIES recovery → maps to analysis, redo, undo phases

Key Concepts:

  • InnoDB Redo Log: Redo Log
  • ARIES: “ARIES: A Transaction Recovery Method” (classic paper)
  • WAL: “Database Internals” by Alex Petrov, Chapter 7

Difficulty: Expert Time estimate: 3-4 weeks Prerequisites: Projects 1-5 completed. Strong systems programming.

Real world outcome:

WAL and Recovery Simulator

Starting database...
  Redo log size: 48 MB (2 files × 24 MB)
  Checkpoint LSN: 1000000

Running transactions...
  T1: UPDATE accounts SET balance = 500 WHERE id = 1
      [LSN 1000100: REDO record written to log]
      [Page modified in buffer pool, not yet on disk]

  T1: COMMIT
      [LSN 1000150: COMMIT record written to log]
      [Log flushed to disk - durability guaranteed!]
      [Data page still only in memory]

Simulating crash...
  [Power failure! Data pages lost from memory]
  [Redo log persisted on disk]

Recovery starting...

Phase 1: Analysis
  Scanning log from checkpoint LSN 1000000...
  Found 5 transactions:
    T1: COMMITTED at LSN 1000150
    T2: COMMITTED at LSN 1000300
    T3: ACTIVE (no commit record) - will undo
  Dirty pages to recover: 12

Phase 2: Redo
  Replaying committed changes...
  [LSN 1000100: Redo UPDATE on page 42]
  [LSN 1000200: Redo INSERT on page 87]
  ... 50 operations replayed

Phase 3: Undo
  Rolling back uncommitted transactions...
  [T3: Undo INSERT on page 123]
  [T3: Undo UPDATE on page 45]

Recovery complete in 0.3 seconds!
  All committed transactions preserved ✓
  All uncommitted transactions rolled back ✓

Implementation Hints:

Log record structure:

typedef enum {
    LOG_UPDATE,
    LOG_INSERT,
    LOG_DELETE,
    LOG_COMMIT,
    LOG_ABORT,
    LOG_CHECKPOINT
} LogRecordType;

typedef struct {
    uint64_t lsn;           // Log Sequence Number
    uint32_t txn_id;
    LogRecordType type;
    uint32_t page_id;
    uint16_t offset;        // Offset within page
    uint16_t length;
    uint8_t before_image[MAX_DATA];  // For undo
    uint8_t after_image[MAX_DATA];   // For redo
} LogRecord;

typedef struct {
    int fd;                 // Log file descriptor
    uint64_t current_lsn;
    uint64_t flushed_lsn;
    uint8_t buffer[LOG_BUFFER_SIZE];
    size_t buffer_offset;
    pthread_mutex_t latch;
} WAL;

Write-ahead logging protocol:

void wal_log_update(WAL* wal, Transaction* txn,
                    Page* page, int offset,
                    void* old_data, void* new_data, int len) {
    LogRecord record = {
        .lsn = ++wal->current_lsn,
        .txn_id = txn->id,
        .type = LOG_UPDATE,
        .page_id = page->id,
        .offset = offset,
        .length = len
    };
    memcpy(record.before_image, old_data, len);
    memcpy(record.after_image, new_data, len);

    // Write to log buffer
    append_to_buffer(wal, &record);

    // Update page's LSN
    page->page_lsn = record.lsn;
}

void wal_commit(WAL* wal, Transaction* txn) {
    LogRecord record = {
        .lsn = ++wal->current_lsn,
        .txn_id = txn->id,
        .type = LOG_COMMIT
    };
    append_to_buffer(wal, &record);

    // CRITICAL: Flush log before returning to client
    // This is the WAL guarantee!
    wal_flush(wal);
}

ARIES-style recovery:

void recover(WAL* wal, BufferPool* pool) {
    // Phase 1: Analysis
    uint64_t checkpoint_lsn = read_checkpoint(wal);
    HashSet* dirty_pages = new_hashset();
    HashSet* active_txns = new_hashset();

    LogRecord* record;
    wal_seek(wal, checkpoint_lsn);
    while ((record = wal_read_next(wal))) {
        if (record->type == LOG_UPDATE || record->type == LOG_INSERT) {
            hashset_add(dirty_pages, record->page_id);
            hashset_add(active_txns, record->txn_id);
        } else if (record->type == LOG_COMMIT) {
            hashset_remove(active_txns, record->txn_id);
        }
    }

    // Phase 2: Redo - bring all pages to crash state
    wal_seek(wal, checkpoint_lsn);
    while ((record = wal_read_next(wal))) {
        if (record->type == LOG_UPDATE) {
            Page* page = buffer_pool_get(pool, record->page_id);

            // Only redo if page is older than log record
            if (page->page_lsn < record->lsn) {
                apply_redo(page, record);
                page->page_lsn = record->lsn;
            }
        }
    }

    // Phase 3: Undo - rollback uncommitted transactions
    // Read log backwards
    wal_seek_end(wal);
    while ((record = wal_read_prev(wal))) {
        if (hashset_contains(active_txns, record->txn_id)) {
            if (record->type == LOG_UPDATE) {
                Page* page = buffer_pool_get(pool, record->page_id);
                apply_undo(page, record);  // Use before_image
            }
        }
    }
}

Learning milestones:

  1. You implement WAL protocol → You understand durability
  2. You survive simulated crashes → You understand recovery
  3. You implement checkpointing → You understand recovery time limits
  4. You compare flush strategies → You understand innodb_flush_log_at_trx_commit

Project 7: MySQL Protocol Implementation

  • File: LEARN_MYSQL_DEEP_DIVE.md
  • Main Programming Language: Python
  • Alternative Programming Languages: Go, Rust, C
  • Coolness Level: Level 3: Genuinely Clever
  • Business Potential: 3. The “Service & Support” Model
  • Difficulty: Level 2: Intermediate
  • Knowledge Area: Network Protocols / Binary Parsing
  • Software or Tool: Custom implementation + Wireshark
  • Main Book: MySQL Internals Documentation

What you’ll build: A MySQL protocol client that connects to MySQL servers using the wire protocol, performs the handshake, authenticates, sends queries, and parses results—all without using libmysqlclient.

Why it teaches MySQL: The protocol is how applications talk to MySQL. Understanding it explains connection pooling, prepared statements, SSL/TLS negotiation, and why some drivers are faster than others.

Core challenges you’ll face:

  • Packet framing → maps to length-prefixed packets
  • Handshake and auth → maps to capability negotiation
  • Query protocol → maps to COM_QUERY and result sets
  • Prepared statements → maps to COM_STMT_PREPARE/EXECUTE

Key Concepts:

Difficulty: Intermediate Time estimate: 1-2 weeks Prerequisites: Projects 1-3 completed. Network programming basics.

Real world outcome:

# Your MySQL protocol client:
from mysql_protocol import MySQLClient

client = MySQLClient()
client.connect('localhost', 3306, 'root', 'password', 'testdb')

# Behind the scenes:
#   → Server: Initial Handshake (protocol version, server version, salt)
#   ← Client: Handshake Response (username, auth, database, capabilities)
#   → Server: OK or ERR packet

result = client.query("SELECT id, name FROM users WHERE active = 1")
# Behind the scenes:
#   ← Client: COM_QUERY packet with SQL string
#   → Server: Column count, column definitions, rows, EOF

for row in result:
    print(f"User {row['id']}: {row['name']}")

# Prepared statement:
stmt = client.prepare("SELECT * FROM orders WHERE user_id = ?")
result = client.execute(stmt, [42])
# Behind the scenes:
#   ← Client: COM_STMT_PREPARE
#   → Server: Statement ID, parameter count, column definitions
#   ← Client: COM_STMT_EXECUTE with binary parameters
#   → Server: Binary result set

Implementation Hints:

Packet structure:

import struct
import socket

class MySQLPacket:
    def __init__(self):
        self.sequence_id = 0

    def read_packet(self, sock):
        """Read one MySQL packet"""
        # 4-byte header: 3 bytes length + 1 byte sequence ID
        header = sock.recv(4)
        length = struct.unpack('<I', header[:3] + b'\x00')[0]
        seq_id = header[3]

        payload = sock.recv(length)
        return seq_id, payload

    def write_packet(self, sock, payload):
        """Write one MySQL packet"""
        length = len(payload)
        header = struct.pack('<I', length)[:3] + bytes([self.sequence_id])
        sock.sendall(header + payload)
        self.sequence_id += 1

class MySQLClient:
    def __init__(self):
        self.sock = None
        self.packet = MySQLPacket()

    def connect(self, host, port, user, password, database):
        self.sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
        self.sock.connect((host, port))

        # Read initial handshake
        seq_id, payload = self.packet.read_packet(self.sock)
        handshake = self.parse_handshake(payload)

        # Send handshake response
        response = self.build_handshake_response(
            handshake, user, password, database
        )
        self.packet.write_packet(self.sock, response)

        # Read OK/ERR
        seq_id, payload = self.packet.read_packet(self.sock)
        if payload[0] == 0x00:  # OK packet
            print("Connected!")
        elif payload[0] == 0xff:  # ERR packet
            raise Exception(self.parse_error(payload))

    def parse_handshake(self, payload):
        """Parse initial handshake packet"""
        protocol_version = payload[0]
        # Null-terminated server version string
        nul_pos = payload.index(0, 1)
        server_version = payload[1:nul_pos].decode()
        # Connection ID (4 bytes)
        conn_id = struct.unpack('<I', payload[nul_pos+1:nul_pos+5])[0]
        # Auth plugin data (salt) - first 8 bytes
        salt_part1 = payload[nul_pos+5:nul_pos+13]
        # ... more fields

        return {
            'protocol_version': protocol_version,
            'server_version': server_version,
            'connection_id': conn_id,
            'salt': salt_part1,
        }

Query execution:

def query(self, sql):
    """Execute a query and return results"""
    # Send COM_QUERY
    payload = bytes([0x03]) + sql.encode('utf-8')  # 0x03 = COM_QUERY
    self.packet.sequence_id = 0
    self.packet.write_packet(self.sock, payload)

    # Read column count
    seq_id, payload = self.packet.read_packet(self.sock)
    if payload[0] == 0xff:  # Error
        raise Exception(self.parse_error(payload))

    column_count = self.read_length_encoded_int(payload)

    # Read column definitions
    columns = []
    for _ in range(column_count):
        seq_id, payload = self.packet.read_packet(self.sock)
        columns.append(self.parse_column_definition(payload))

    # Read EOF packet (deprecated in newer protocol)
    seq_id, payload = self.packet.read_packet(self.sock)

    # Read rows until EOF
    rows = []
    while True:
        seq_id, payload = self.packet.read_packet(self.sock)
        if payload[0] == 0xfe:  # EOF
            break
        rows.append(self.parse_row(payload, columns))

    return rows

Learning milestones:

  1. You complete handshake → You understand connection establishment
  2. You execute queries → You understand the text protocol
  3. You use prepared statements → You understand the binary protocol
  4. You handle errors → You understand error packets

Project 8: Binlog Replication Subscriber

  • File: LEARN_MYSQL_DEEP_DIVE.md
  • Main Programming Language: Go
  • Alternative Programming Languages: Python, Java
  • Coolness Level: Level 4: Hardcore Tech Flex
  • Business Potential: 4. The “Open Core” Infrastructure
  • Difficulty: Level 3: Advanced
  • Knowledge Area: Replication / CDC / Event Streaming
  • Software or Tool: MySQL Binlog
  • Main Book: “High Performance MySQL” by Silvia Botros

What you’ll build: A binlog replication subscriber that connects to MySQL as a replica, receives binlog events, parses them, and streams database changes to external systems (like Elasticsearch, Kafka, or a data warehouse).

Why it teaches MySQL: Replication is how MySQL achieves high availability and read scaling. Understanding binlog format and replication protocol explains CDC (Change Data Capture), read replicas, and database migration tools.

Core challenges you’ll face:

  • Connecting as replica → maps to replication protocol handshake
  • Parsing binlog events → maps to ROW vs STATEMENT format
  • Handling DDL → maps to schema changes
  • Position tracking → maps to GTID vs file/position

Key Concepts:

Difficulty: Advanced Time estimate: 2-3 weeks Prerequisites: Projects 1-7 completed. Understanding of replication concepts.

Real world outcome:

MySQL Binlog Subscriber

Connected to MySQL 8.0.35 at localhost:3306
Binlog format: ROW
Starting from: mysql-bin.000003:154

Streaming events...

[2024-01-15 10:00:01] GTID: 3E11FA47-71CA-11E1-9E33-C80AA9429562:123
[2024-01-15 10:00:01] TABLE_MAP: test.users (table_id: 42)
[2024-01-15 10:00:01] WRITE_ROWS:
  INSERT INTO users: {id: 1001, name: 'Alice', email: 'alice@example.com'}
  → Forwarded to Elasticsearch index 'users'

[2024-01-15 10:00:02] GTID: 3E11FA47-71CA-11E1-9E33-C80AA9429562:124
[2024-01-15 10:00:02] TABLE_MAP: test.orders (table_id: 43)
[2024-01-15 10:00:02] UPDATE_ROWS:
  UPDATE orders SET status='shipped' WHERE id=5000
  Before: {id: 5000, status: 'pending', amount: 99.99}
  After:  {id: 5000, status: 'shipped', amount: 99.99}
  → Published to Kafka topic 'order-events'

Events processed: 1,234
Current position: mysql-bin.000003:98765
Lag: 0.5 seconds

Implementation Hints:

Connecting as a replica:

package main

import (
    "github.com/go-mysql-org/go-mysql/mysql"
    "github.com/go-mysql-org/go-mysql/replication"
)

func main() {
    cfg := replication.BinlogSyncerConfig{
        ServerID: 100,  // Unique replica ID
        Flavor:   "mysql",
        Host:     "localhost",
        Port:     3306,
        User:     "replication_user",
        Password: "password",
    }

    syncer := replication.NewBinlogSyncer(cfg)

    // Start from a specific position
    streamer, _ := syncer.StartSync(mysql.Position{
        Name: "mysql-bin.000003",
        Pos:  154,
    })

    // Or use GTID
    // gtidSet, _ := mysql.ParseGTIDSet("mysql",
    //     "3E11FA47-71CA-11E1-9E33-C80AA9429562:1-123")
    // streamer, _ := syncer.StartSyncGTID(gtidSet)

    for {
        event, _ := streamer.GetEvent(context.Background())
        handleEvent(event)
    }
}

func handleEvent(event *replication.BinlogEvent) {
    switch e := event.Event.(type) {
    case *replication.RotateEvent:
        fmt.Printf("Rotate to: %s\n", string(e.NextLogName))

    case *replication.TableMapEvent:
        fmt.Printf("Table: %s.%s (id: %d)\n",
            e.Schema, e.Table, e.TableID)

    case *replication.RowsEvent:
        for _, row := range e.Rows {
            switch event.Header.EventType {
            case replication.WRITE_ROWS_EVENTv2:
                handleInsert(row)
            case replication.UPDATE_ROWS_EVENTv2:
                // row contains [before, after] pairs
                handleUpdate(row)
            case replication.DELETE_ROWS_EVENTv2:
                handleDelete(row)
            }
        }

    case *replication.QueryEvent:
        // DDL statements
        fmt.Printf("DDL: %s\n", string(e.Query))
    }
}

Implementing from scratch (understanding the protocol):

# Binlog event header (common to all events)
class BinlogEventHeader:
    def parse(self, data):
        self.timestamp = struct.unpack('<I', data[0:4])[0]
        self.type_code = data[4]
        self.server_id = struct.unpack('<I', data[5:9])[0]
        self.event_length = struct.unpack('<I', data[9:13])[0]
        self.next_position = struct.unpack('<I', data[13:17])[0]
        self.flags = struct.unpack('<H', data[17:19])[0]

# Row event types
WRITE_ROWS_EVENT = 30
UPDATE_ROWS_EVENT = 31
DELETE_ROWS_EVENT = 32

class RowsEvent:
    def parse(self, data, table_map):
        self.table_id = struct.unpack('<Q', data[0:6] + b'\x00\x00')[0]
        # ... parse column bitmap and row data
        # Row format depends on table schema from TABLE_MAP event

Learning milestones:

  1. You connect as replica → You understand replication protocol
  2. You parse row events → You understand binlog format
  3. You handle schema changes → You understand DDL replication
  4. You build a CDC pipeline → You can replicate to external systems

Project 9: Lock and Deadlock Visualizer

  • File: LEARN_MYSQL_DEEP_DIVE.md
  • Main Programming Language: Python
  • Alternative Programming Languages: Go, JavaScript
  • Coolness Level: Level 3: Genuinely Clever
  • Business Potential: 2. The “Micro-SaaS / Pro Tool”
  • Difficulty: Level 2: Intermediate
  • Knowledge Area: Locking / Deadlocks / Concurrency
  • Software or Tool: MySQL + Custom visualization
  • Main Book: “High Performance MySQL” by Silvia Botros

What you’ll build: A tool that monitors InnoDB locks in real-time, visualizes lock wait graphs, detects deadlock cycles, and explains why specific queries are blocking.

Why it teaches MySQL: Understanding InnoDB’s locking behavior is essential for debugging production issues. This project teaches you about row locks, gap locks, next-key locks, and why deadlocks happen.

Core challenges you’ll face:

  • Querying lock information → maps to performance_schema and innodb_lock_waits
  • Building wait-for graphs → maps to detecting cycles
  • Understanding lock types → maps to shared, exclusive, gap, next-key
  • Explaining lock conflicts → maps to which indexes cause which locks

Key Concepts:

Difficulty: Intermediate Time estimate: 1-2 weeks Prerequisites: Projects 1-4 completed. Understanding of transaction isolation.

Real world outcome:

InnoDB Lock Visualizer

Current Locks:
┌─────────────────────────────────────────────────────────────────┐
│ Transaction │ Lock Type  │ Table    │ Index   │ Lock Data      │
├─────────────┼────────────┼──────────┼─────────┼────────────────┤
│ 12345       │ X (excl.)  │ orders   │ PRIMARY │ id=100         │
│ 12346       │ S (share)  │ orders   │ PRIMARY │ id=100 WAITING │
│ 12347       │ X,GAP      │ orders   │ idx_user│ user_id=42,GAP │
└─────────────────────────────────────────────────────────────────┘

Wait-For Graph:
  ┌──────────────┐
  │ TXN 12346    │──waits for──→┌──────────────┐
  │ SELECT...FOR │              │ TXN 12345    │
  │ UPDATE       │              │ UPDATE orders│
  └──────────────┘              └──────────────┘

DEADLOCK DETECTED:
  ┌──────────────┐                ┌──────────────┐
  │ TXN 12350    │──waits for──→│ TXN 12351    │
  │              │              │              │
  │ Has lock on: │              │ Has lock on: │
  │ orders id=1  │              │ users id=5   │
  │              │              │              │
  │ Wants lock:  │              │ Wants lock:  │
  │ users id=5   │◄──waits for──│ orders id=1  │
  └──────────────┘                └──────────────┘

  Victim: TXN 12351 (rolled back)
  Cause: Classic cycle - each transaction holds what the other wants

Implementation Hints:

Querying lock information:

import mysql.connector
from collections import defaultdict

class LockMonitor:
    def __init__(self, connection):
        self.conn = connection

    def get_current_locks(self):
        """Get all current InnoDB locks"""
        cursor = self.conn.cursor(dictionary=True)
        cursor.execute("""
            SELECT
                r.trx_id AS waiting_trx_id,
                r.trx_query AS waiting_query,
                b.trx_id AS blocking_trx_id,
                b.trx_query AS blocking_query,
                l.lock_table AS locked_table,
                l.lock_index AS locked_index,
                l.lock_type AS lock_type,
                l.lock_mode AS lock_mode,
                l.lock_data AS lock_data
            FROM performance_schema.data_lock_waits w
            JOIN performance_schema.data_locks l
                ON w.BLOCKING_ENGINE_LOCK_ID = l.ENGINE_LOCK_ID
            JOIN information_schema.innodb_trx r
                ON w.REQUESTING_ENGINE_TRANSACTION_ID = r.trx_id
            JOIN information_schema.innodb_trx b
                ON w.BLOCKING_ENGINE_TRANSACTION_ID = b.trx_id
        """)
        return cursor.fetchall()

    def build_wait_graph(self, locks):
        """Build a wait-for graph from lock data"""
        graph = defaultdict(set)
        for lock in locks:
            graph[lock['waiting_trx_id']].add(lock['blocking_trx_id'])
        return graph

    def detect_deadlock(self, graph):
        """Find cycles in wait-for graph"""
        visited = set()
        path = []

        def dfs(node):
            if node in path:
                # Found cycle!
                cycle_start = path.index(node)
                return path[cycle_start:]
            if node in visited:
                return None

            visited.add(node)
            path.append(node)

            for neighbor in graph[node]:
                result = dfs(neighbor)
                if result:
                    return result

            path.pop()
            return None

        for node in graph:
            result = dfs(node)
            if result:
                return result

        return None

Understanding gap locks:

def explain_gap_lock(self, lock_info):
    """Explain why a gap lock was taken"""
    if 'GAP' in lock_info['lock_mode']:
        return f"""
Gap Lock Explanation:
  - A gap lock prevents insertions in a range
  - This lock is on index: {lock_info['locked_index']}
  - Gap is before: {lock_info['lock_data']}

Why it happens:
  - REPEATABLE READ isolation level needs gap locks to prevent phantom reads
  - A query like: SELECT * FROM t WHERE x BETWEEN 10 AND 20 FOR UPDATE
    locks all rows AND the gaps between them

How to avoid:
  - Use READ COMMITTED isolation (no gap locks)
  - Use unique lookups instead of range scans
  - Add appropriate indexes to reduce lock scope
"""

Learning milestones:

  1. You query lock tables → You understand where to find lock info
  2. You build wait-for graphs → You understand lock dependencies
  3. You detect deadlocks → You understand the deadlock algorithm
  4. You explain gap locks → You understand REPEATABLE READ locking

Project 10: MySQL vs PostgreSQL Comparison Lab

  • File: LEARN_MYSQL_DEEP_DIVE.md
  • Main Programming Language: SQL + Python
  • Alternative Programming Languages: Go, Bash
  • Coolness Level: Level 3: Genuinely Clever
  • Business Potential: 2. The “Micro-SaaS / Pro Tool”
  • Difficulty: Level 2: Intermediate
  • Knowledge Area: Database Comparison / Benchmarking
  • Software or Tool: MySQL, PostgreSQL, SQLite
  • Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann

What you’ll build: A comprehensive comparison lab that runs the same workloads on MySQL, PostgreSQL, and SQLite, measuring performance, consistency behavior, and feature differences.

Why it teaches MySQL: Understanding MySQL’s tradeoffs requires comparing it to alternatives. This project reveals why MySQL makes certain design choices and when to choose which database.

Core challenges you’ll face:

  • Setting up equivalent schemas → maps to data type differences
  • Measuring performance fairly → maps to benchmark design
  • Testing isolation levels → maps to consistency behavior differences
  • Comparing features → maps to JSON, arrays, full-text, etc.

Key Concepts:

Difficulty: Intermediate Time estimate: 2 weeks Prerequisites: Projects 1-9 completed. Basic PostgreSQL knowledge.

Real world outcome:

Database Comparison Report

=== Performance Benchmarks ===

OLTP Workload (TPC-C like):
┌──────────────────┬─────────────┬─────────────┬─────────────┐
│ Metric           │ MySQL       │ PostgreSQL  │ SQLite      │
├──────────────────┼─────────────┼─────────────┼─────────────┤
│ Transactions/sec │ 12,500      │ 10,800      │ 45,000*     │
│ Read latency p99 │ 2.1 ms      │ 2.8 ms      │ 0.1 ms*     │
│ Write latency p99│ 5.2 ms      │ 4.8 ms      │ 0.5 ms*     │
└──────────────────┴─────────────┴─────────────┴─────────────┘
* SQLite is embedded, so no network latency

Complex Query Workload (TPC-H like):
┌──────────────────┬─────────────┬─────────────┐
│ Query            │ MySQL       │ PostgreSQL  │
├──────────────────┼─────────────┼─────────────┤
│ Aggregation      │ 1.2 sec     │ 0.8 sec     │
│ Join (5 tables)  │ 3.5 sec     │ 2.1 sec     │
│ Window functions │ 2.8 sec     │ 1.5 sec     │
│ CTE recursive    │ 4.2 sec     │ 2.8 sec     │
└──────────────────┴─────────────┴─────────────┘
→ PostgreSQL wins on complex analytical queries

=== Isolation Level Behavior ===

Phantom Read Test (REPEATABLE READ):
┌──────────────────┬─────────────┬─────────────┐
│                  │ MySQL       │ PostgreSQL  │
├──────────────────┼─────────────┼─────────────┤
│ Phantoms blocked │ Yes (gap    │ Yes (SSI    │
│                  │ locking)    │ serializable│
│ Method           │ Pessimistic │ Optimistic  │
│ Blocking?        │ Yes         │ No*         │
└──────────────────┴─────────────┴─────────────┘
* PostgreSQL uses serialization failure instead

=== Feature Comparison ===

JSON Operations:
  MySQL:      JSON_EXTRACT(col, '$.field') - functional indexes available
  PostgreSQL: col->>'field' (JSONB) - GIN indexes, more operators

Arrays:
  MySQL:      Not supported (use JSON arrays)
  PostgreSQL: Native ARRAY type with operators

Full-Text Search:
  MySQL:      FULLTEXT indexes (basic)
  PostgreSQL: ts_vector, ts_query, GIN indexes (advanced)

Implementation Hints:

Benchmark framework:

import time
import mysql.connector
import psycopg2
import sqlite3

class DatabaseBenchmark:
    def __init__(self, db_type, config):
        self.db_type = db_type
        if db_type == 'mysql':
            self.conn = mysql.connector.connect(**config)
        elif db_type == 'postgres':
            self.conn = psycopg2.connect(**config)
        elif db_type == 'sqlite':
            self.conn = sqlite3.connect(config['database'])

    def run_query(self, sql, params=None):
        cursor = self.conn.cursor()
        start = time.perf_counter()
        cursor.execute(sql, params or ())
        result = cursor.fetchall()
        elapsed = time.perf_counter() - start
        return result, elapsed

    def benchmark_oltp(self, iterations=10000):
        """Simple OLTP benchmark"""
        results = {'reads': [], 'writes': []}

        for i in range(iterations):
            # Read
            _, elapsed = self.run_query(
                "SELECT * FROM users WHERE id = %s",
                (random.randint(1, 100000),)
            )
            results['reads'].append(elapsed)

            # Write
            _, elapsed = self.run_query(
                "UPDATE accounts SET balance = balance + 1 WHERE id = %s",
                (random.randint(1, 100000),)
            )
            results['writes'].append(elapsed)
            self.conn.commit()

        return {
            'read_avg': sum(results['reads']) / len(results['reads']),
            'read_p99': sorted(results['reads'])[int(len(results['reads']) * 0.99)],
            'write_avg': sum(results['writes']) / len(results['writes']),
            'write_p99': sorted(results['writes'])[int(len(results['writes']) * 0.99)],
        }

Isolation level comparison:

def test_phantom_reads(db):
    """Test if phantom reads are possible"""
    # Transaction 1: Start and read
    db.execute("BEGIN")
    db.execute("SET TRANSACTION ISOLATION LEVEL REPEATABLE READ")
    result1 = db.query("SELECT COUNT(*) FROM test WHERE value BETWEEN 1 AND 100")

    # Transaction 2: Insert in another connection
    db2 = get_connection()
    db2.execute("INSERT INTO test (value) VALUES (50)")
    db2.commit()

    # Transaction 1: Read again
    result2 = db.query("SELECT COUNT(*) FROM test WHERE value BETWEEN 1 AND 100")
    db.commit()

    return {
        'phantom_possible': result1[0] != result2[0],
        'before': result1[0],
        'after': result2[0]
    }

Learning milestones:

  1. You run equivalent benchmarks → You understand performance tradeoffs
  2. You test isolation differences → You understand consistency models
  3. You compare features → You know when to choose which database
  4. You document tradeoffs → You can make informed architecture decisions

Project 11: Query Profiler and Slow Query Analyzer

  • File: LEARN_MYSQL_DEEP_DIVE.md
  • Main Programming Language: Python
  • Alternative Programming Languages: Go, JavaScript
  • Coolness Level: Level 3: Genuinely Clever
  • Business Potential: 3. The “Service & Support” Model
  • Difficulty: Level 2: Intermediate
  • Knowledge Area: Performance Tuning / Profiling / Optimization
  • Software or Tool: MySQL + Performance Schema
  • Main Book: “High Performance MySQL” by Silvia Botros

What you’ll build: A query profiler that analyzes slow query logs, parses EXPLAIN output, identifies missing indexes, and provides actionable optimization recommendations.

Why it teaches MySQL: Query optimization is the most practical MySQL skill. This project teaches you to read EXPLAIN plans, understand query costs, and fix performance problems systematically.

Core challenges you’ll face:

  • Parsing slow query logs → maps to log format understanding
  • Analyzing EXPLAIN output → maps to understanding access types
  • Index recommendations → maps to which indexes would help
  • Query rewriting → maps to transforming slow queries

Key Concepts:

Difficulty: Intermediate Time estimate: 2 weeks Prerequisites: Projects 1-4 completed. Basic query optimization knowledge.

Real world outcome:

Query Profiler Report

Analyzed: 10,000 queries from slow query log
Time range: 2024-01-01 to 2024-01-15

=== Top 10 Slowest Queries ===

1. SELECT * FROM orders WHERE status = 'pending' AND created_at > '2024-01-01'
   Avg time: 4.5s | Executions: 1,234 | Total time: 5,553s

   EXPLAIN Analysis:
   ┌────────┬─────────────┬──────────────┬─────────┬──────────────┐
   │ table  │ type        │ possible_keys│ key     │ rows examined│
   ├────────┼─────────────┼──────────────┼─────────┼──────────────┤
   │ orders │ ALL         │ idx_status   │ NULL    │ 5,000,000    │
   └────────┴─────────────┴──────────────┴─────────┴──────────────┘

   ⚠️ PROBLEMS DETECTED:
     • Full table scan (type: ALL)
     • Index exists but not used
     • Examining 5M rows to return 1,234

   💡 RECOMMENDATIONS:
     1. Add composite index: CREATE INDEX idx_status_created
        ON orders(status, created_at)
     2. Or rewrite query to use existing index better

   Expected improvement: 4.5s → 0.05s (90x faster)

=== Index Recommendations Summary ===

Tables needing indexes:
  • orders: (status, created_at) - would help 45 queries
  • users: (country, created_at) - would help 23 queries
  • products: (category_id, price) - would help 18 queries

=== Query Patterns ===

Anti-patterns detected:
  • SELECT * usage: 234 queries (retrieve only needed columns)
  • OFFSET pagination: 56 queries (use keyset pagination)
  • OR conditions: 89 queries (consider UNION)
  • Functions on indexed columns: 34 queries (can't use index)

Implementation Hints:

Slow query log parser:

import re
from dataclasses import dataclass
from typing import List

@dataclass
class SlowQuery:
    timestamp: str
    user: str
    host: str
    query_time: float
    lock_time: float
    rows_sent: int
    rows_examined: int
    query: str

def parse_slow_log(log_file):
    queries = []
    current = {}

    with open(log_file) as f:
        for line in f:
            if line.startswith('# Time:'):
                if current.get('query'):
                    queries.append(SlowQuery(**current))
                current = {'timestamp': line.split(':', 1)[1].strip()}

            elif line.startswith('# User@Host:'):
                match = re.match(r'# User@Host: (\w+)\[.*\] @ (\S+)', line)
                current['user'] = match.group(1)
                current['host'] = match.group(2)

            elif line.startswith('# Query_time:'):
                match = re.match(
                    r'# Query_time: ([\d.]+)\s+Lock_time: ([\d.]+)\s+'
                    r'Rows_sent: (\d+)\s+Rows_examined: (\d+)', line)
                current['query_time'] = float(match.group(1))
                current['lock_time'] = float(match.group(2))
                current['rows_sent'] = int(match.group(3))
                current['rows_examined'] = int(match.group(4))

            elif not line.startswith('#'):
                current['query'] = current.get('query', '') + line

    return queries

EXPLAIN analyzer:

class ExplainAnalyzer:
    def analyze(self, explain_rows):
        problems = []
        recommendations = []

        for row in explain_rows:
            # Check access type
            access_type = row['type']
            if access_type == 'ALL':
                problems.append({
                    'severity': 'HIGH',
                    'issue': 'Full table scan',
                    'table': row['table'],
                    'rows': row['rows']
                })

            elif access_type == 'index':
                problems.append({
                    'severity': 'MEDIUM',
                    'issue': 'Full index scan',
                    'table': row['table']
                })

            # Check key usage
            if row['possible_keys'] and not row['key']:
                problems.append({
                    'severity': 'HIGH',
                    'issue': 'Index available but not used',
                    'table': row['table'],
                    'available_keys': row['possible_keys']
                })

            # Check for filesort
            if 'filesort' in (row.get('Extra') or '').lower():
                problems.append({
                    'severity': 'MEDIUM',
                    'issue': 'Using filesort (sorting in memory/disk)',
                    'table': row['table']
                })

            # Check rows examined vs sent
            if row['rows'] > 1000 and 'filtered' in row:
                filter_pct = float(row['filtered'])
                if filter_pct < 10:
                    problems.append({
                        'severity': 'HIGH',
                        'issue': f'Examining many rows but returning few ({filter_pct}%)',
                        'table': row['table']
                    })

        return problems, recommendations

    def suggest_index(self, query, schema):
        """Suggest indexes based on query structure"""
        # Parse WHERE clause
        where_columns = self.extract_where_columns(query)
        order_columns = self.extract_order_columns(query)

        # Composite index: WHERE columns first, ORDER BY columns last
        suggested = where_columns + [c for c in order_columns if c not in where_columns]

        return suggested

Learning milestones:

  1. You parse slow query logs → You understand query profiling
  2. You analyze EXPLAIN output → You understand access types
  3. You recommend indexes → You understand index design
  4. You identify anti-patterns → You can prevent problems

Project 12: Build a Mini-MySQL (Capstone)

  • File: LEARN_MYSQL_DEEP_DIVE.md
  • Main Programming Language: C or Rust
  • Alternative Programming Languages: Go, C++
  • Coolness Level: Level 5: Pure Magic (Super Cool)
  • Business Potential: 1. The “Resume Gold”
  • Difficulty: Level 5: Master
  • Knowledge Area: Database Implementation / Systems Programming
  • Software or Tool: Custom implementation
  • Main Book: “Database Internals” by Alex Petrov

What you’ll build: A mini relational database from scratch that integrates all previous projects: B+Tree storage, SQL parsing, query execution, MVCC transactions, WAL recovery, and the MySQL wire protocol.

Why it teaches everything: This capstone project forces you to understand how all database components work together. Building a complete (if simple) database proves deep understanding of MySQL internals.

Core components:

  1. Storage Layer: B+Tree pages on disk (Project 2)
  2. Buffer Pool: Page caching with LRU (Project 5)
  3. SQL Parser: Parse and plan queries (Project 3)
  4. Query Executor: Execute plans (Project 1)
  5. Transaction Manager: MVCC isolation (Project 4)
  6. Recovery: WAL and crash recovery (Project 6)
  7. Wire Protocol: MySQL-compatible interface (Project 7)

Real world outcome:

$ ./minidb --datadir=/tmp/minidb

MiniDB v0.1.0 - A Learning Database
Listening on port 3306...

# Connect with standard MySQL client!
$ mysql -h localhost -u root minidb

mysql> CREATE TABLE users (
    id INT PRIMARY KEY,
    name VARCHAR(100),
    email VARCHAR(100)
);
Query OK, 0 rows affected (0.01 sec)

mysql> INSERT INTO users VALUES (1, 'Alice', 'alice@example.com');
Query OK, 1 row affected (0.00 sec)

mysql> SELECT * FROM users WHERE id = 1;
+----+-------+-------------------+
| id | name  | email             |
+----+-------+-------------------+
|  1 | Alice | alice@example.com |
+----+-------+-------------------+
1 row in set (0.00 sec)

mysql> BEGIN;
mysql> UPDATE users SET name = 'Alice Smith' WHERE id = 1;
mysql> -- In another session, the old value is still visible (MVCC!)
mysql> COMMIT;

# Simulate crash and recovery
$ kill -9 $(pidof minidb)
$ ./minidb --datadir=/tmp/minidb
Recovering from WAL...
  Replaying 15 log records...
  Recovery complete in 0.1s
  All committed transactions preserved.

Time estimate: 2-3 months Prerequisites: All 11 previous projects completed


Project Comparison Table

Project Difficulty Time Depth of Understanding Fun Factor
1. SQL Query Executor Intermediate 2 weeks ⭐⭐⭐ ⭐⭐⭐⭐
2. B+Tree Implementation Advanced 3-4 weeks ⭐⭐⭐⭐⭐ ⭐⭐⭐⭐
3. Query Optimizer Advanced 2-3 weeks ⭐⭐⭐⭐⭐ ⭐⭐⭐⭐
4. MVCC Transaction Manager Advanced 2-3 weeks ⭐⭐⭐⭐⭐ ⭐⭐⭐
5. Buffer Pool Manager Advanced 2-3 weeks ⭐⭐⭐⭐ ⭐⭐⭐
6. WAL and Crash Recovery Expert 3-4 weeks ⭐⭐⭐⭐⭐ ⭐⭐⭐⭐
7. MySQL Protocol Intermediate 1-2 weeks ⭐⭐⭐ ⭐⭐⭐⭐
8. Binlog Replication Advanced 2-3 weeks ⭐⭐⭐⭐ ⭐⭐⭐⭐⭐
9. Lock Visualizer Intermediate 1-2 weeks ⭐⭐⭐⭐ ⭐⭐⭐⭐
10. DB Comparison Lab Intermediate 2 weeks ⭐⭐⭐⭐ ⭐⭐⭐⭐
11. Query Profiler Intermediate 2 weeks ⭐⭐⭐⭐ ⭐⭐⭐⭐⭐
12. Mini-MySQL Master 2-3 months ⭐⭐⭐⭐⭐ ⭐⭐⭐⭐⭐

For practical MySQL skills (DBA/Backend Developer):

  1. Project 11 (Query Profiler) - Immediately useful
  2. Project 9 (Lock Visualizer) - Debug production issues
  3. Project 3 (Query Optimizer) - Understand EXPLAIN
  4. Project 10 (Comparison Lab) - Make architecture decisions

For deep understanding (Database Engineer):

  1. Project 2 (B+Tree) - Foundation of all indexing
  2. Project 4 (MVCC) - How transactions work
  3. Project 5 (Buffer Pool) - How caching works
  4. Project 6 (WAL) - How durability works

For building database tools (Developer):

  1. Project 7 (Protocol) - Build drivers/proxies
  2. Project 8 (Binlog) - Build CDC pipelines
  3. Project 1 (SQL Executor) - Build query tools

Quick path (2-3 weeks):

  1. Project 1 (1 week) - Understand SQL fundamentals
  2. Project 11 (1 week) - Practical optimization skills
  3. Project 9 (3-4 days) - Debug locking issues

Essential Resources

Books

Official Documentation

Online Resources

Source Code


Summary

# Project Main Programming Language
1 SQL Query Executor Python
2 B+Tree Index Implementation C
3 Query Optimizer Simulator Python
4 MVCC Transaction Manager Python
5 Buffer Pool Manager C
6 WAL and Crash Recovery C
7 MySQL Protocol Implementation Python
8 Binlog Replication Subscriber Go
9 Lock and Deadlock Visualizer Python
10 MySQL vs PostgreSQL Comparison Lab SQL + Python
11 Query Profiler and Slow Query Analyzer Python
12 Build a Mini-MySQL (Capstone) C or Rust

Happy querying! Understanding MySQL deeply means understanding how data really works—from SQL syntax down to the bits on disk. 🗄️