LEARN MYSQL DEEP DIVE

Learn MySQL: From Zero to Mastery

Goal: Deeply understand MySQL—from basic queries to storage engine internals, index structures, query optimization, transaction processing, and replication. Understand why MySQL makes the design choices it does and how it differs from PostgreSQL, SQLite, and other databases.

Why MySQL Matters

MySQL is one of the most widely-deployed databases in the world, powering everything from WordPress blogs to Facebook’s infrastructure. Understanding MySQL deeply teaches you:

Relational database fundamentals: ACID, normalization, SQL
Storage engine architecture: How data is physically organized on disk
Indexing internals: B+trees, clustered vs secondary indexes
Query optimization: How the optimizer chooses execution plans
Concurrency control: MVCC, locking, isolation levels
Distributed systems: Replication, sharding, high availability

MySQL vs Other Databases: The Big Picture

Before diving in, let’s understand where MySQL fits:

Feature	MySQL/InnoDB	PostgreSQL	SQLite	MariaDB
Architecture	Thread-per-connection	Process-per-connection	Embedded (in-process)	Thread-per-connection
Storage Engine	Pluggable (InnoDB default)	Single (heap + WAL)	Single (B-tree)	Pluggable (Aria, InnoDB)
Replication	Logical (binlog)	Physical (WAL) + Logical	None (single-file)	Logical (binlog)
MVCC	Undo logs + locks	Tuple versioning	WAL-mode	Undo logs + locks
Default Isolation	REPEATABLE READ	READ COMMITTED	SERIALIZABLE	REPEATABLE READ
JSON Support	Good	Excellent (JSONB)	Basic	Good
Full-Text Search	Basic	Excellent (GIN/GiST)	FTS5 extension	Basic
Geospatial	Basic	Excellent (PostGIS)	R*Tree	Basic

Key Philosophical Differences

MySQL’s Philosophy:

Simplicity and speed for common operations
Pluggable storage engines for different workloads
Optimized for read-heavy OLTP workloads
“Good enough” defaults that work for most cases

PostgreSQL’s Philosophy:

Correctness and standards compliance first
Rich feature set (arrays, custom types, advanced indexing)
Better for complex queries and analytics
More sophisticated query planner

SQLite’s Philosophy:

Embedded, zero-configuration
Single-file database
Perfect for local storage, testing, edge cases
ACID-compliant despite simplicity

MySQL Architecture Overview

┌─────────────────────────────────────────────────────────────────────┐
│                         MySQL Server                                 │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  ┌──────────────────────────────────────────────────────────────┐   │
│  │                    Connection Layer                           │   │
│  │  ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐             │   │
│  │  │Thread 1 │ │Thread 2 │ │Thread 3 │ │Thread N │             │   │
│  │  │ (Conn)  │ │ (Conn)  │ │ (Conn)  │ │ (Conn)  │             │   │
│  │  └─────────┘ └─────────┘ └─────────┘ └─────────┘             │   │
│  └──────────────────────────────────────────────────────────────┘   │
│                              │                                       │
│  ┌──────────────────────────▼───────────────────────────────────┐   │
│  │                      SQL Layer                                │   │
│  │  ┌─────────────┐ ┌─────────────┐ ┌─────────────────────────┐ │   │
│  │  │   Parser    │ │  Optimizer  │ │     Query Executor      │ │   │
│  │  │  (Lexer +   │ │ (Cost-based │ │  (Iterates over rows,   │ │   │
│  │  │   Parser)   │ │  planning)  │ │   applies operations)   │ │   │
│  │  └─────────────┘ └─────────────┘ └─────────────────────────┘ │   │
│  │  ┌─────────────┐ ┌─────────────┐ ┌─────────────────────────┐ │   │
│  │  │Query Cache  │ │  Metadata   │ │    Table Definition     │ │   │
│  │  │ (removed    │ │   Cache     │ │        Cache            │ │   │
│  │  │  in 8.0)    │ │             │ │                         │ │   │
│  │  └─────────────┘ └─────────────┘ └─────────────────────────┘ │   │
│  └──────────────────────────────────────────────────────────────┘   │
│                              │                                       │
│  ┌──────────────────────────▼───────────────────────────────────┐   │
│  │                   Storage Engine API                          │   │
│  │            (Handler interface - pluggable engines)            │   │
│  └──────────────────────────────────────────────────────────────┘   │
│         │                    │                    │                  │
│  ┌──────▼──────┐     ┌───────▼───────┐    ┌───────▼───────┐        │
│  │   InnoDB    │     │    MyISAM     │    │    Memory     │        │
│  │  (Default)  │     │   (Legacy)    │    │  (Temp tables)│        │
│  │             │     │               │    │               │        │
│  │ ┌─────────┐ │     │ ┌───────────┐ │    │ ┌───────────┐ │        │
│  │ │ Buffer  │ │     │ │Key Buffer │ │    │ │  Hash or  │ │        │
│  │ │  Pool   │ │     │ │  (Index   │ │    │ │  B-Tree   │ │        │
│  │ │         │ │     │ │   Cache)  │ │    │ │  Indexes  │ │        │
│  │ └─────────┘ │     │ └───────────┘ │    │ └───────────┘ │        │
│  │ ┌─────────┐ │     │               │    │               │        │
│  │ │Redo Log │ │     │ No ACID, no   │    │ In-memory     │        │
│  │ │Undo Log │ │     │ transactions  │    │ only          │        │
│  │ └─────────┘ │     │               │    │               │        │
│  └─────────────┘     └───────────────┘    └───────────────┘        │
│         │                    │                    │                  │
└─────────┼────────────────────┼────────────────────┼──────────────────┘
          │                    │                    │
          ▼                    ▼                    ▼
     ┌─────────┐          ┌─────────┐          (Memory)
     │.ibd file│          │.MYD/.MYI│
     │(data +  │          │(data +  │
     │ index)  │          │ index)  │
     └─────────┘          └─────────┘

InnoDB Architecture Deep Dive

InnoDB is MySQL’s default storage engine and where most of the magic happens:

┌─────────────────────────────────────────────────────────────────────┐
│                         InnoDB Engine                                │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  ┌────────────────────── IN MEMORY ─────────────────────────────┐   │
│  │                                                               │   │
│  │  ┌─────────────────────────────────────────────────────────┐ │   │
│  │  │                    Buffer Pool                           │ │   │
│  │  │  ┌─────────────────────────────────────────────────────┐ │ │   │
│  │  │  │ Data Pages │ Index Pages │ Undo Pages │ Adaptive HI │ │ │   │
│  │  │  └─────────────────────────────────────────────────────┘ │ │   │
│  │  │  LRU list (young → old) for page eviction                │ │   │
│  │  └─────────────────────────────────────────────────────────┘ │   │
│  │                                                               │   │
│  │  ┌───────────────┐ ┌───────────────┐ ┌───────────────────┐   │   │
│  │  │  Change       │ │   Log         │ │  Adaptive Hash    │   │   │
│  │  │  Buffer       │ │   Buffer      │ │  Index (AHI)      │   │   │
│  │  │ (for sec.idx) │ │ (redo log     │ │ (auto-built hash  │   │   │
│  │  │               │ │  in memory)   │ │  for hot pages)   │   │   │
│  │  └───────────────┘ └───────────────┘ └───────────────────────┘   │
│  │                                                               │   │
│  └───────────────────────────────────────────────────────────────┘   │
│                              │                                       │
│                              │ Flush                                 │
│                              ▼                                       │
│  ┌────────────────────── ON DISK ───────────────────────────────┐   │
│  │                                                               │   │
│  │  ┌───────────────┐ ┌───────────────┐ ┌───────────────────┐   │   │
│  │  │ System        │ │ File-per-     │ │ Redo Log Files    │   │   │
│  │  │ Tablespace    │ │ Table (.ibd)  │ │ (ib_logfile0/1)   │   │   │
│  │  │ (ibdata1)     │ │               │ │                   │   │   │
│  │  │               │ │ Data + Index  │ │ WAL for crash     │   │   │
│  │  │ Change buffer │ │ in B+Tree     │ │ recovery          │   │   │
│  │  │ Undo logs     │ │ structure     │ │                   │   │   │
│  │  │ Data dict     │ │               │ │                   │   │   │
│  │  └───────────────┘ └───────────────┘ └───────────────────┘   │   │
│  │                                                               │   │
│  │  ┌───────────────┐ ┌───────────────┐                         │   │
│  │  │ Doublewrite   │ │ Undo          │                         │   │
│  │  │ Buffer        │ │ Tablespace    │                         │   │
│  │  │ (crash safety)│ │ (undo_001)    │                         │   │
│  │  └───────────────┘ └───────────────┘                         │   │
│  │                                                               │   │
│  └───────────────────────────────────────────────────────────────┘   │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘

Project List

Projects progress from understanding fundamentals to building sophisticated database tools.

Project 1: SQL Query Executor (From Scratch)

File: LEARN_MYSQL_DEEP_DIVE.md
Main Programming Language: Python
Alternative Programming Languages: Go, Rust, C
Coolness Level: Level 4: Hardcore Tech Flex
Business Potential: 1. The “Resume Gold”
Difficulty: Level 2: Intermediate
Knowledge Area: SQL Parsing / Query Execution / Relational Algebra
Software or Tool: Custom implementation
Main Book: “Database System Concepts” by Silberschatz, Korth, Sudarshan

What you’ll build: A simple SQL query executor that parses SQL (SELECT, WHERE, JOIN), operates on CSV files as tables, and executes queries using relational algebra operators (select, project, join).

Why it teaches MySQL: Before understanding MySQL’s optimizations, you need to understand what SQL actually means. This project forces you to implement the relational model and see why database engines make certain design choices.

Core challenges you’ll face:

Parsing SQL syntax → maps to understanding SQL grammar
Implementing relational operators → maps to selection, projection, join
Query planning → maps to choosing join order, using indexes
Understanding cost → maps to why some queries are slow

Key Concepts:

Relational Algebra: “Database System Concepts” Chapter 6
SQL Syntax: MySQL SELECT Syntax
Query Processing: “Database System Concepts” Chapter 15

Difficulty: Intermediate Time estimate: 2 weeks Prerequisites: Basic SQL knowledge. Understanding of data structures.

Real world outcome:

# Your SQL executor in action:
$ python minisql.py

MiniSQL> CREATE TABLE users (id, name, email);
Table 'users' created.

MiniSQL> INSERT INTO users VALUES (1, 'Alice', 'alice@example.com');
MiniSQL> INSERT INTO users VALUES (2, 'Bob', 'bob@example.com');

MiniSQL> SELECT name, email FROM users WHERE id = 1;
+-------+-------------------+
| name  | email             |
+-------+-------------------+
| Alice | alice@example.com |
+-------+-------------------+
1 row in set (0.001 sec)

MiniSQL> CREATE TABLE orders (id, user_id, amount);
MiniSQL> INSERT INTO orders VALUES (1, 1, 100);
MiniSQL> INSERT INTO orders VALUES (2, 1, 200);

MiniSQL> SELECT users.name, orders.amount
         FROM users
         JOIN orders ON users.id = orders.user_id;
+-------+--------+
| name  | amount |
+-------+--------+
| Alice | 100    |
| Alice | 200    |
+-------+--------+
2 rows in set (0.002 sec)

Implementation Hints:

Core relational operators:

class Table:
    def __init__(self, name, columns, rows=None):
        self.name = name
        self.columns = columns  # ['id', 'name', 'email']
        self.rows = rows or []  # [[1, 'Alice', 'alice@...'], ...]

    def select(self, predicate):
        """σ (sigma) - Filter rows matching predicate"""
        return Table(
            self.name,
            self.columns,
            [row for row in self.rows if predicate(row, self.columns)]
        )

    def project(self, column_names):
        """π (pi) - Select specific columns"""
        indices = [self.columns.index(c) for c in column_names]
        return Table(
            self.name,
            column_names,
            [[row[i] for i in indices] for row in self.rows]
        )

    def join(self, other, on_left, on_right):
        """⋈ (bowtie) - Join two tables"""
        left_idx = self.columns.index(on_left)
        right_idx = other.columns.index(on_right)

        result_rows = []
        for left_row in self.rows:
            for right_row in other.rows:
                if left_row[left_idx] == right_row[right_idx]:
                    result_rows.append(left_row + right_row)

        return Table(
            f"{self.name}_{other.name}",
            self.columns + other.columns,
            result_rows
        )

Simple SQL parser:

import re

def parse_select(sql):
    """Parse SELECT column1, column2 FROM table WHERE condition"""
    pattern = r"SELECT\s+(.+?)\s+FROM\s+(\w+)(?:\s+WHERE\s+(.+))?"
    match = re.match(pattern, sql, re.IGNORECASE)

    if match:
        columns = [c.strip() for c in match.group(1).split(',')]
        table = match.group(2)
        where = match.group(3)
        return {'type': 'select', 'columns': columns, 'table': table, 'where': where}

Learning milestones:

You parse basic SQL → You understand SQL syntax
You implement SELECT/WHERE → You understand selection and projection
You implement JOIN → You understand the most expensive operation
You see why indexes matter → You understand scanning vs seeking

Project 2: B+Tree Index Implementation

File: LEARN_MYSQL_DEEP_DIVE.md
Main Programming Language: C
Alternative Programming Languages: Rust, Go, Python
Coolness Level: Level 4: Hardcore Tech Flex
Business Potential: 1. The “Resume Gold”
Difficulty: Level 3: Advanced
Knowledge Area: Data Structures / Indexing / Disk-Based Storage
Software or Tool: Custom implementation
Main Book: “Database Internals” by Alex Petrov

What you’ll build: A disk-based B+Tree implementation that stores data on disk in pages, supports insertions, deletions, range scans, and demonstrates why B+Trees are ideal for databases.

Why it teaches MySQL: InnoDB stores all data in B+Tree structures. The clustered index IS the table. Understanding B+Trees means understanding how MySQL physically organizes data and why certain operations are fast or slow.

Core challenges you’ll face:

Page-based storage → maps to InnoDB’s 16KB pages
Node splitting and merging → maps to handling insertions and deletions
Range scans via leaf links → maps to why B+Trees excel at range queries
Understanding clustered vs secondary → maps to InnoDB’s index design

Key Concepts:

B+Tree Structure: “Database Internals” by Alex Petrov, Chapters 2-4
InnoDB Physical Structure: MySQL Docs - Physical Structure
Clustered Index: Use The Index, Luke

Difficulty: Advanced Time estimate: 3-4 weeks Prerequisites: Project 1 completed. Understanding of tree data structures. C or systems programming.

Real world outcome:

// Your B+Tree in action:
$ ./btree_demo

Creating B+Tree with page size 4096, order 100...
Inserting 1,000,000 keys...
Insert time: 2.3 seconds

Point lookup (key = 500000): found in 3 page reads (tree height = 3)
Range scan (100000 to 100100): 100 rows in 5 page reads

Tree statistics:
  - Height: 3
  - Leaf pages: 10,000
  - Internal pages: 101
  - Fill factor: 67%

Demonstrating why B+Tree > Binary Tree for disk:
  - B+Tree:  3 disk reads for any key in 1M rows
  - Binary:  20 disk reads for any key in 1M rows (log2(1M))

Implementation Hints:

B+Tree node structure:

#define PAGE_SIZE 4096
#define MAX_KEYS ((PAGE_SIZE - sizeof(NodeHeader)) / (sizeof(Key) + sizeof(Value)))

typedef struct {
    uint32_t is_leaf;
    uint32_t num_keys;
    uint32_t parent_page;
    uint32_t next_leaf;  // Only for leaf nodes (linked list)
    uint32_t prev_leaf;
} NodeHeader;

typedef struct {
    NodeHeader header;
    Key keys[MAX_KEYS];
    union {
        Value values[MAX_KEYS];      // For leaf nodes
        uint32_t children[MAX_KEYS + 1];  // For internal nodes (page IDs)
    };
} BTreeNode;

typedef struct {
    int fd;              // File descriptor
    uint32_t root_page;
    uint32_t num_pages;
    BTreeNode* cache[CACHE_SIZE];  // Simple page cache
} BTree;

Key operations:

// Search: traverse from root to leaf
Value* btree_search(BTree* tree, Key key) {
    uint32_t page_id = tree->root_page;

    while (true) {
        BTreeNode* node = read_page(tree, page_id);

        if (node->header.is_leaf) {
            // Binary search in leaf
            int idx = binary_search(node->keys, node->header.num_keys, key);
            if (idx >= 0) {
                return &node->values[idx];
            }
            return NULL;
        }

        // Find child to follow
        int idx = find_child_index(node, key);
        page_id = node->children[idx];
    }
}

// Range scan: find start, follow leaf links
void btree_range_scan(BTree* tree, Key start, Key end, Callback cb) {
    // Find leaf containing start key
    uint32_t page_id = find_leaf(tree, start);

    while (page_id != 0) {
        BTreeNode* node = read_page(tree, page_id);

        for (int i = 0; i < node->header.num_keys; i++) {
            if (node->keys[i] > end) return;
            if (node->keys[i] >= start) {
                cb(node->keys[i], node->values[i]);
            }
        }

        page_id = node->header.next_leaf;  // Follow leaf link!
    }
}

Why B+Tree is perfect for databases:

High fanout → Tree stays shallow (3-4 levels for millions of rows)
Leaf links → Range scans don’t need to revisit internal nodes
Sequential leaf access → Disk prefetching works well
Consistent read performance → All leaves at same depth

Learning milestones:

You implement basic search → You understand tree traversal
You implement insert with splitting → You understand node management
You implement range scans → You understand leaf linking
You measure disk I/O → You understand why B+Trees win

Project 3: Query Optimizer Simulator

File: LEARN_MYSQL_DEEP_DIVE.md
Main Programming Language: Python
Alternative Programming Languages: Go, Java
Coolness Level: Level 4: Hardcore Tech Flex
Business Potential: 2. The “Micro-SaaS / Pro Tool”
Difficulty: Level 3: Advanced
Knowledge Area: Query Optimization / Cost Estimation / Statistics
Software or Tool: Custom implementation
Main Book: “Database System Concepts” by Silberschatz, Korth, Sudarshan

What you’ll build: A query optimizer that takes SQL queries, generates multiple execution plans, estimates costs based on table statistics, and chooses the cheapest plan—just like MySQL’s optimizer.

Why it teaches MySQL: The optimizer is MySQL’s brain. Understanding how it estimates costs and chooses plans explains why some queries are fast and others slow, and how to write SQL that the optimizer can optimize.

Core challenges you’ll face:

Generating alternative plans → maps to join order, index choice
Estimating cardinality → maps to how many rows will match?
Cost modeling → maps to disk I/O vs CPU costs
Understanding EXPLAIN → maps to reading MySQL’s execution plans

Key Concepts:

Query Optimization: “Database System Concepts” Chapter 16
MySQL Optimizer: How the MySQL Optimizer Works
EXPLAIN Output: MySQL EXPLAIN

Difficulty: Advanced Time estimate: 2-3 weeks Prerequisites: Projects 1-2 completed. Understanding of SQL execution.

Real world outcome:

Query Optimizer Simulator

Query: SELECT * FROM orders o
       JOIN customers c ON o.customer_id = c.id
       JOIN products p ON o.product_id = p.id
       WHERE c.country = 'USA'
         AND p.category = 'Electronics'

Table Statistics:
  orders:    1,000,000 rows, index on customer_id, product_id
  customers: 100,000 rows, index on id, country
  products:  10,000 rows, index on id, category

Generated Plans:

Plan 1: customers → orders → products
  ├─ Scan customers WHERE country='USA' (est. 10,000 rows)
  ├─ Index lookup orders.customer_id (est. 100,000 rows)
  └─ Index lookup products.id WHERE category='Electronics' (est. 1,000 rows)
  Cost: 15,230 (disk I/O: 1,200, CPU: 14,030)

Plan 2: products → orders → customers  ★ CHOSEN
  ├─ Scan products WHERE category='Electronics' (est. 500 rows)
  ├─ Index lookup orders.product_id (est. 50,000 rows)
  └─ Index lookup customers.id WHERE country='USA' (est. 5,000 rows)
  Cost: 8,750 (disk I/O: 850, CPU: 7,900)

Plan 3: orders → customers → products
  ├─ Full scan orders (1,000,000 rows) ← EXPENSIVE!
  ...
  Cost: 1,250,000

Optimizer chose Plan 2 (2x faster than Plan 1)

Implementation Hints:

Table statistics:

class TableStats:
    def __init__(self, name, row_count, columns, indexes):
        self.name = name
        self.row_count = row_count
        self.columns = columns  # {col_name: {'distinct': N, 'min': X, 'max': Y}}
        self.indexes = indexes  # {index_name: {'columns': [...], 'unique': bool}}

    def estimate_selectivity(self, column, operator, value):
        """Estimate what fraction of rows match a predicate"""
        col_stats = self.columns[column]

        if operator == '=':
            # Uniform distribution assumption
            return 1.0 / col_stats['distinct']
        elif operator == '<':
            # Range selectivity
            range_size = col_stats['max'] - col_stats['min']
            return (value - col_stats['min']) / range_size
        # ... more operators

    def estimate_rows(self, predicates):
        """Estimate rows after applying predicates"""
        selectivity = 1.0
        for pred in predicates:
            selectivity *= self.estimate_selectivity(
                pred.column, pred.operator, pred.value
            )
        return int(self.row_count * selectivity)

Plan generation and costing:

class QueryOptimizer:
    def __init__(self, stats):
        self.stats = stats  # {table_name: TableStats}
        self.io_cost_per_page = 1.0
        self.cpu_cost_per_row = 0.01

    def generate_join_orders(self, tables):
        """Generate all possible join orders"""
        from itertools import permutations
        return list(permutations(tables))

    def estimate_join_cost(self, left_rows, right_rows, join_type):
        """Estimate cost of a join operation"""
        if join_type == 'nested_loop':
            return left_rows * right_rows * self.cpu_cost_per_row
        elif join_type == 'hash_join':
            return (left_rows + right_rows) * self.cpu_cost_per_row * 2
        elif join_type == 'index_lookup':
            # Much cheaper if we have an index
            return left_rows * 3 * self.io_cost_per_page  # B+Tree depth

    def cost_plan(self, plan):
        """Calculate total cost of an execution plan"""
        total_cost = 0
        current_rows = 0

        for step in plan.steps:
            if step.type == 'table_scan':
                pages = step.table.row_count / 100  # ~100 rows per page
                total_cost += pages * self.io_cost_per_page
                current_rows = step.estimated_rows

            elif step.type == 'index_lookup':
                total_cost += 3 * self.io_cost_per_page  # B+Tree traversal
                total_cost += step.estimated_rows * self.cpu_cost_per_row
                current_rows = step.estimated_rows

            elif step.type == 'join':
                total_cost += self.estimate_join_cost(
                    current_rows, step.right_rows, step.join_type
                )
                current_rows = step.estimated_output_rows

        return total_cost

    def optimize(self, query):
        """Find the cheapest execution plan"""
        plans = self.generate_all_plans(query)
        return min(plans, key=self.cost_plan)

Learning milestones:

You generate alternative plans → You understand plan space
You estimate cardinality → You understand statistics
You calculate costs → You understand I/O vs CPU tradeoffs
You read EXPLAIN output → You can optimize real MySQL queries

Project 4: MVCC Transaction Manager

File: LEARN_MYSQL_DEEP_DIVE.md
Main Programming Language: Python
Alternative Programming Languages: Go, Rust, C++
Coolness Level: Level 4: Hardcore Tech Flex
Business Potential: 1. The “Resume Gold”
Difficulty: Level 3: Advanced
Knowledge Area: Transactions / Concurrency / Isolation
Software or Tool: Custom implementation
Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann

What you’ll build: An MVCC (Multi-Version Concurrency Control) system that maintains multiple versions of data, implements snapshot isolation, and demonstrates how InnoDB allows readers and writers to not block each other.

Why it teaches MySQL: InnoDB’s MVCC is central to its concurrency model. Understanding how undo logs create “snapshots” and how visibility rules work explains isolation levels, long transaction problems, and lock behavior.

Core challenges you’ll face:

Versioning data → maps to undo logs storing old versions
Visibility rules → maps to which version can a transaction see?
Garbage collection → maps to purging old versions
Isolation levels → maps to READ COMMITTED vs REPEATABLE READ

Key Concepts:

MVCC: Understanding MVCC
InnoDB Locking: MySQL Transaction Isolation
Snapshot Isolation: “Designing Data-Intensive Applications” Chapter 7

Difficulty: Advanced Time estimate: 2-3 weeks Prerequisites: Projects 1-3 completed. Understanding of transaction concepts.

Real world outcome:

MVCC Transaction Simulator

txn1> BEGIN;  -- Transaction ID: 100
txn2> BEGIN;  -- Transaction ID: 101

txn1> UPDATE accounts SET balance = 500 WHERE id = 1;
      [Created version: {txn_id: 100, value: 500}]
      [Old version: {txn_id: 50, value: 1000} → undo log]

txn2> SELECT balance FROM accounts WHERE id = 1;
      → 1000 (sees old version - txn 100 not committed)

txn1> COMMIT;
      [Version {txn_id: 100} now visible to new transactions]

txn2> SELECT balance FROM accounts WHERE id = 1;
      → 1000 (REPEATABLE READ: still sees snapshot from txn start)

txn3> BEGIN;  -- Transaction ID: 102
txn3> SELECT balance FROM accounts WHERE id = 1;
      → 500 (sees committed version from txn 100)

-- Demonstration of isolation levels:
SET ISOLATION LEVEL READ COMMITTED;
txn2> SELECT balance FROM accounts WHERE id = 1;
      → 500 (READ COMMITTED: sees latest committed value)

Implementation Hints:

Version chain and visibility:

from dataclasses import dataclass
from typing import Optional, List
import threading

@dataclass
class Version:
    txn_id: int
    value: any
    created_at: int  # Transaction ID that created this
    deleted_at: Optional[int] = None  # Transaction ID that deleted this
    prev_version: Optional['Version'] = None  # Link to older version

class MVCCRow:
    def __init__(self, key):
        self.key = key
        self.version_chain: Optional[Version] = None
        self.lock = threading.Lock()

    def write(self, txn_id: int, value: any):
        """Create new version, link to previous"""
        with self.lock:
            new_version = Version(
                txn_id=txn_id,
                value=value,
                created_at=txn_id,
                prev_version=self.version_chain
            )
            self.version_chain = new_version

    def read(self, txn: 'Transaction') -> any:
        """Find visible version for this transaction"""
        version = self.version_chain

        while version:
            if txn.is_visible(version):
                return version.value
            version = version.prev_version

        return None  # No visible version

class Transaction:
    def __init__(self, txn_id: int, isolation_level: str, active_txns: set):
        self.txn_id = txn_id
        self.isolation_level = isolation_level
        self.snapshot = active_txns.copy()  # Active txns at start
        self.start_time = txn_id

    def is_visible(self, version: Version) -> bool:
        """Can this transaction see this version?"""
        # Created by a committed transaction before our snapshot
        if version.created_at < self.start_time:
            if version.created_at not in self.snapshot:  # Was committed
                if version.deleted_at is None:
                    return True
                if version.deleted_at in self.snapshot:  # Deleted but not committed
                    return True
                if version.deleted_at > self.start_time:
                    return True
        return False

class MVCCDatabase:
    def __init__(self):
        self.rows = {}
        self.next_txn_id = 1
        self.active_transactions = set()
        self.committed_transactions = set()

    def begin(self, isolation_level='REPEATABLE READ'):
        txn_id = self.next_txn_id
        self.next_txn_id += 1
        self.active_transactions.add(txn_id)

        return Transaction(txn_id, isolation_level, self.active_transactions)

    def commit(self, txn: Transaction):
        self.active_transactions.remove(txn.txn_id)
        self.committed_transactions.add(txn.txn_id)

Learning milestones:

You implement version chains → You understand undo logs
You implement visibility rules → You understand snapshot isolation
You compare isolation levels → You understand READ COMMITTED vs REPEATABLE READ
You see long transaction problems → You understand why long txns are bad

Project 5: Buffer Pool Manager

File: LEARN_MYSQL_DEEP_DIVE.md
Main Programming Language: C
Alternative Programming Languages: Rust, C++
Coolness Level: Level 4: Hardcore Tech Flex
Business Potential: 1. The “Resume Gold”
Difficulty: Level 3: Advanced
Knowledge Area: Memory Management / Caching / I/O
Software or Tool: Custom implementation
Main Book: “Database Internals” by Alex Petrov

What you’ll build: A buffer pool manager that caches disk pages in memory, implements LRU eviction (with young/old list like InnoDB), handles dirty page flushing, and manages concurrent access.

Why it teaches MySQL: The buffer pool is InnoDB’s most important performance feature. Understanding it explains why “more RAM = faster database”, how to tune innodb_buffer_pool_size, and why certain access patterns perform poorly.

Core challenges you’ll face:

Page caching → maps to InnoDB’s buffer pool
LRU with midpoint insertion → maps to young/old list separation
Dirty page management → maps to checkpoint and flush
Concurrent access → maps to latching pages

Key Concepts:

Buffer Pool: InnoDB Buffer Pool
LRU Algorithm: InnoDB Buffer Pool LRU
Page Replacement: “Database Internals” Chapter 5

Difficulty: Advanced Time estimate: 2-3 weeks Prerequisites: Projects 1-4 completed. Systems programming experience.

Real world outcome:

Buffer Pool Simulator

Configuration:
  Buffer pool size: 128 MB (8,192 pages × 16 KB)
  Young list: 5/8 of pool (5,120 pages)
  Old list: 3/8 of pool (3,072 pages)

Running OLTP workload simulation...

Statistics after 1 million operations:
  Hit ratio: 98.7%
  Pages read from disk: 13,000
  Pages written to disk: 4,500
  Evictions: 45,000

Page distribution:
  Young list: 5,120 pages (frequently accessed)
  Old list: 3,072 pages (recently loaded)
  Dirty pages: 1,234 (pending flush)

LRU behavior demonstration:
  - Full table scan of 100K rows...
  - Old pages: 3,072 → scan loaded here
  - Young pages: 5,120 → protected from scan eviction!
  - Hit ratio after scan: 97.2% (only 1.5% drop)

Without midpoint insertion (traditional LRU):
  - Hit ratio after scan: 45% (catastrophic drop!)

Implementation Hints:

Buffer pool structure:

#define PAGE_SIZE 16384
#define POOL_SIZE 8192  // Number of pages

typedef struct BufferPage {
    uint32_t page_id;
    bool is_dirty;
    int pin_count;  // How many threads are using this page
    pthread_rwlock_t latch;

    uint8_t data[PAGE_SIZE];

    // LRU list pointers
    struct BufferPage* lru_prev;
    struct BufferPage* lru_next;
    bool in_young_list;  // vs old list
} BufferPage;

typedef struct BufferPool {
    BufferPage pages[POOL_SIZE];

    // Hash table: page_id → BufferPage*
    HashTable* page_table;

    // LRU lists (InnoDB uses midpoint insertion)
    BufferPage* young_head;  // Most recently used
    BufferPage* young_tail;
    BufferPage* old_head;    // Midpoint
    BufferPage* old_tail;    // Least recently used

    // Free list
    BufferPage* free_list;

    pthread_mutex_t pool_latch;

    // Stats
    uint64_t hits;
    uint64_t misses;
} BufferPool;

LRU with midpoint insertion:

BufferPage* buffer_pool_get(BufferPool* pool, uint32_t page_id) {
    pthread_mutex_lock(&pool->pool_latch);

    // Check if page is in pool
    BufferPage* page = hash_table_get(pool->page_table, page_id);

    if (page) {
        pool->hits++;

        // Move to young list if accessed again from old list
        if (!page->in_young_list) {
            // Only move if it's been in old list for a while
            // (prevents one-time access from polluting young list)
            if (should_promote_to_young(page)) {
                move_to_young_head(pool, page);
            }
        } else {
            // Already in young, move to head
            move_to_young_head(pool, page);
        }

        page->pin_count++;
        pthread_mutex_unlock(&pool->pool_latch);
        return page;
    }

    pool->misses++;

    // Need to load from disk - find victim page
    BufferPage* victim = find_victim(pool);
    if (victim->is_dirty) {
        flush_page(victim);  // Write to disk first
    }

    // Load new page
    read_page_from_disk(victim, page_id);
    victim->page_id = page_id;
    victim->is_dirty = false;
    victim->pin_count = 1;

    // Insert at OLD list head (not young!)
    // This is the midpoint insertion strategy
    insert_at_old_head(pool, victim);

    hash_table_put(pool->page_table, page_id, victim);

    pthread_mutex_unlock(&pool->pool_latch);
    return victim;
}

Why midpoint insertion matters:

A full table scan would evict all hot pages in naive LRU
With midpoint insertion, scan pages enter the OLD list
They get evicted before hot pages in the YOUNG list
Hot working set is protected from one-time scans

Learning milestones:

You implement basic LRU → You understand page replacement
You add midpoint insertion → You understand InnoDB’s optimization
You handle dirty pages → You understand write-back caching
You measure hit ratios → You can tune buffer pool size

Project 6: WAL and Crash Recovery

File: LEARN_MYSQL_DEEP_DIVE.md
Main Programming Language: C
Alternative Programming Languages: Rust, Go
Coolness Level: Level 5: Pure Magic (Super Cool)
Business Potential: 1. The “Resume Gold”
Difficulty: Level 4: Expert
Knowledge Area: Durability / Crash Recovery / Logging
Software or Tool: Custom implementation
Main Book: “Database Internals” by Alex Petrov

What you’ll build: A Write-Ahead Log (WAL) system with redo logging that ensures durability, handles crash recovery using ARIES-style recovery, and demonstrates how InnoDB survives crashes.

Why it teaches MySQL: The redo log is how InnoDB achieves durability without flushing every page to disk on commit. Understanding WAL explains innodb_flush_log_at_trx_commit, crash recovery time, and why databases survive power failures.

Core challenges you’ll face:

Write-ahead logging → maps to redo log before data
Log sequence numbers (LSN) → maps to ordering operations
Checkpointing → maps to limiting recovery time
ARIES recovery → maps to analysis, redo, undo phases

Key Concepts:

InnoDB Redo Log: Redo Log
ARIES: “ARIES: A Transaction Recovery Method” (classic paper)
WAL: “Database Internals” by Alex Petrov, Chapter 7

Difficulty: Expert Time estimate: 3-4 weeks Prerequisites: Projects 1-5 completed. Strong systems programming.

Real world outcome:

WAL and Recovery Simulator

Starting database...
  Redo log size: 48 MB (2 files × 24 MB)
  Checkpoint LSN: 1000000

Running transactions...
  T1: UPDATE accounts SET balance = 500 WHERE id = 1
      [LSN 1000100: REDO record written to log]
      [Page modified in buffer pool, not yet on disk]

  T1: COMMIT
      [LSN 1000150: COMMIT record written to log]
      [Log flushed to disk - durability guaranteed!]
      [Data page still only in memory]

Simulating crash...
  [Power failure! Data pages lost from memory]
  [Redo log persisted on disk]

Recovery starting...

Phase 1: Analysis
  Scanning log from checkpoint LSN 1000000...
  Found 5 transactions:
    T1: COMMITTED at LSN 1000150
    T2: COMMITTED at LSN 1000300
    T3: ACTIVE (no commit record) - will undo
  Dirty pages to recover: 12

Phase 2: Redo
  Replaying committed changes...
  [LSN 1000100: Redo UPDATE on page 42]
  [LSN 1000200: Redo INSERT on page 87]
  ... 50 operations replayed

Phase 3: Undo
  Rolling back uncommitted transactions...
  [T3: Undo INSERT on page 123]
  [T3: Undo UPDATE on page 45]

Recovery complete in 0.3 seconds!
  All committed transactions preserved ✓
  All uncommitted transactions rolled back ✓

Implementation Hints:

Log record structure:

typedef enum {
    LOG_UPDATE,
    LOG_INSERT,
    LOG_DELETE,
    LOG_COMMIT,
    LOG_ABORT,
    LOG_CHECKPOINT
} LogRecordType;

typedef struct {
    uint64_t lsn;           // Log Sequence Number
    uint32_t txn_id;
    LogRecordType type;
    uint32_t page_id;
    uint16_t offset;        // Offset within page
    uint16_t length;
    uint8_t before_image[MAX_DATA];  // For undo
    uint8_t after_image[MAX_DATA];   // For redo
} LogRecord;

typedef struct {
    int fd;                 // Log file descriptor
    uint64_t current_lsn;
    uint64_t flushed_lsn;
    uint8_t buffer[LOG_BUFFER_SIZE];
    size_t buffer_offset;
    pthread_mutex_t latch;
} WAL;

Write-ahead logging protocol:

void wal_log_update(WAL* wal, Transaction* txn,
                    Page* page, int offset,
                    void* old_data, void* new_data, int len) {
    LogRecord record = {
        .lsn = ++wal->current_lsn,
        .txn_id = txn->id,
        .type = LOG_UPDATE,
        .page_id = page->id,
        .offset = offset,
        .length = len
    };
    memcpy(record.before_image, old_data, len);
    memcpy(record.after_image, new_data, len);

    // Write to log buffer
    append_to_buffer(wal, &record);

    // Update page's LSN
    page->page_lsn = record.lsn;
}

void wal_commit(WAL* wal, Transaction* txn) {
    LogRecord record = {
        .lsn = ++wal->current_lsn,
        .txn_id = txn->id,
        .type = LOG_COMMIT
    };
    append_to_buffer(wal, &record);

    // CRITICAL: Flush log before returning to client
    // This is the WAL guarantee!
    wal_flush(wal);
}

ARIES-style recovery:

void recover(WAL* wal, BufferPool* pool) {
    // Phase 1: Analysis
    uint64_t checkpoint_lsn = read_checkpoint(wal);
    HashSet* dirty_pages = new_hashset();
    HashSet* active_txns = new_hashset();

    LogRecord* record;
    wal_seek(wal, checkpoint_lsn);
    while ((record = wal_read_next(wal))) {
        if (record->type == LOG_UPDATE || record->type == LOG_INSERT) {
            hashset_add(dirty_pages, record->page_id);
            hashset_add(active_txns, record->txn_id);
        } else if (record->type == LOG_COMMIT) {
            hashset_remove(active_txns, record->txn_id);
        }
    }

    // Phase 2: Redo - bring all pages to crash state
    wal_seek(wal, checkpoint_lsn);
    while ((record = wal_read_next(wal))) {
        if (record->type == LOG_UPDATE) {
            Page* page = buffer_pool_get(pool, record->page_id);

            // Only redo if page is older than log record
            if (page->page_lsn < record->lsn) {
                apply_redo(page, record);
                page->page_lsn = record->lsn;
            }
        }
    }

    // Phase 3: Undo - rollback uncommitted transactions
    // Read log backwards
    wal_seek_end(wal);
    while ((record = wal_read_prev(wal))) {
        if (hashset_contains(active_txns, record->txn_id)) {
            if (record->type == LOG_UPDATE) {
                Page* page = buffer_pool_get(pool, record->page_id);
                apply_undo(page, record);  // Use before_image
            }
        }
    }
}

Learning milestones:

You implement WAL protocol → You understand durability
You survive simulated crashes → You understand recovery
You implement checkpointing → You understand recovery time limits
You compare flush strategies → You understand innodb_flush_log_at_trx_commit

Project 7: MySQL Protocol Implementation

File: LEARN_MYSQL_DEEP_DIVE.md
Main Programming Language: Python
Alternative Programming Languages: Go, Rust, C
Coolness Level: Level 3: Genuinely Clever
Business Potential: 3. The “Service & Support” Model
Difficulty: Level 2: Intermediate
Knowledge Area: Network Protocols / Binary Parsing
Software or Tool: Custom implementation + Wireshark
Main Book: MySQL Internals Documentation

What you’ll build: A MySQL protocol client that connects to MySQL servers using the wire protocol, performs the handshake, authenticates, sends queries, and parses results—all without using libmysqlclient.

Why it teaches MySQL: The protocol is how applications talk to MySQL. Understanding it explains connection pooling, prepared statements, SSL/TLS negotiation, and why some drivers are faster than others.

Core challenges you’ll face:

Packet framing → maps to length-prefixed packets
Handshake and auth → maps to capability negotiation
Query protocol → maps to COM_QUERY and result sets
Prepared statements → maps to COM_STMT_PREPARE/EXECUTE

Key Concepts:

MySQL Protocol: MySQL Protocol Documentation
Wire Format: Understanding MySQL Protocol
Packet Structure: Capture with Wireshark for analysis

Difficulty: Intermediate Time estimate: 1-2 weeks Prerequisites: Projects 1-3 completed. Network programming basics.

Real world outcome:

# Your MySQL protocol client:
from mysql_protocol import MySQLClient

client = MySQLClient()
client.connect('localhost', 3306, 'root', 'password', 'testdb')

# Behind the scenes:
#   → Server: Initial Handshake (protocol version, server version, salt)
#   ← Client: Handshake Response (username, auth, database, capabilities)
#   → Server: OK or ERR packet

result = client.query("SELECT id, name FROM users WHERE active = 1")
# Behind the scenes:
#   ← Client: COM_QUERY packet with SQL string
#   → Server: Column count, column definitions, rows, EOF

for row in result:
    print(f"User {row['id']}: {row['name']}")

# Prepared statement:
stmt = client.prepare("SELECT * FROM orders WHERE user_id = ?")
result = client.execute(stmt, [42])
# Behind the scenes:
#   ← Client: COM_STMT_PREPARE
#   → Server: Statement ID, parameter count, column definitions
#   ← Client: COM_STMT_EXECUTE with binary parameters
#   → Server: Binary result set

Implementation Hints:

Packet structure:

import struct
import socket

class MySQLPacket:
    def __init__(self):
        self.sequence_id = 0

    def read_packet(self, sock):
        """Read one MySQL packet"""
        # 4-byte header: 3 bytes length + 1 byte sequence ID
        header = sock.recv(4)
        length = struct.unpack('<I', header[:3] + b'\x00')[0]
        seq_id = header[3]

        payload = sock.recv(length)
        return seq_id, payload

    def write_packet(self, sock, payload):
        """Write one MySQL packet"""
        length = len(payload)
        header = struct.pack('<I', length)[:3] + bytes([self.sequence_id])
        sock.sendall(header + payload)
        self.sequence_id += 1

class MySQLClient:
    def __init__(self):
        self.sock = None
        self.packet = MySQLPacket()

    def connect(self, host, port, user, password, database):
        self.sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
        self.sock.connect((host, port))

        # Read initial handshake
        seq_id, payload = self.packet.read_packet(self.sock)
        handshake = self.parse_handshake(payload)

        # Send handshake response
        response = self.build_handshake_response(
            handshake, user, password, database
        )
        self.packet.write_packet(self.sock, response)

        # Read OK/ERR
        seq_id, payload = self.packet.read_packet(self.sock)
        if payload[0] == 0x00:  # OK packet
            print("Connected!")
        elif payload[0] == 0xff:  # ERR packet
            raise Exception(self.parse_error(payload))

    def parse_handshake(self, payload):
        """Parse initial handshake packet"""
        protocol_version = payload[0]
        # Null-terminated server version string
        nul_pos = payload.index(0, 1)
        server_version = payload[1:nul_pos].decode()
        # Connection ID (4 bytes)
        conn_id = struct.unpack('<I', payload[nul_pos+1:nul_pos+5])[0]
        # Auth plugin data (salt) - first 8 bytes
        salt_part1 = payload[nul_pos+5:nul_pos+13]
        # ... more fields

        return {
            'protocol_version': protocol_version,
            'server_version': server_version,
            'connection_id': conn_id,
            'salt': salt_part1,
        }

Query execution:

def query(self, sql):
    """Execute a query and return results"""
    # Send COM_QUERY
    payload = bytes([0x03]) + sql.encode('utf-8')  # 0x03 = COM_QUERY
    self.packet.sequence_id = 0
    self.packet.write_packet(self.sock, payload)

    # Read column count
    seq_id, payload = self.packet.read_packet(self.sock)
    if payload[0] == 0xff:  # Error
        raise Exception(self.parse_error(payload))

    column_count = self.read_length_encoded_int(payload)

    # Read column definitions
    columns = []
    for _ in range(column_count):
        seq_id, payload = self.packet.read_packet(self.sock)
        columns.append(self.parse_column_definition(payload))

    # Read EOF packet (deprecated in newer protocol)
    seq_id, payload = self.packet.read_packet(self.sock)

    # Read rows until EOF
    rows = []
    while True:
        seq_id, payload = self.packet.read_packet(self.sock)
        if payload[0] == 0xfe:  # EOF
            break
        rows.append(self.parse_row(payload, columns))

    return rows

Learning milestones:

You complete handshake → You understand connection establishment
You execute queries → You understand the text protocol
You use prepared statements → You understand the binary protocol
You handle errors → You understand error packets

Project 8: Binlog Replication Subscriber

File: LEARN_MYSQL_DEEP_DIVE.md
Main Programming Language: Go
Alternative Programming Languages: Python, Java
Coolness Level: Level 4: Hardcore Tech Flex
Business Potential: 4. The “Open Core” Infrastructure
Difficulty: Level 3: Advanced
Knowledge Area: Replication / CDC / Event Streaming
Software or Tool: MySQL Binlog
Main Book: “High Performance MySQL” by Silvia Botros

What you’ll build: A binlog replication subscriber that connects to MySQL as a replica, receives binlog events, parses them, and streams database changes to external systems (like Elasticsearch, Kafka, or a data warehouse).

Why it teaches MySQL: Replication is how MySQL achieves high availability and read scaling. Understanding binlog format and replication protocol explains CDC (Change Data Capture), read replicas, and database migration tools.

Core challenges you’ll face:

Connecting as replica → maps to replication protocol handshake
Parsing binlog events → maps to ROW vs STATEMENT format
Handling DDL → maps to schema changes
Position tracking → maps to GTID vs file/position

Key Concepts:

Binlog Replication: MySQL Replication
Binlog Formats: Row vs Statement
GTID: Global Transaction Identifiers

Difficulty: Advanced Time estimate: 2-3 weeks Prerequisites: Projects 1-7 completed. Understanding of replication concepts.

Real world outcome:

MySQL Binlog Subscriber

Connected to MySQL 8.0.35 at localhost:3306
Binlog format: ROW
Starting from: mysql-bin.000003:154

Streaming events...

[2024-01-15 10:00:01] GTID: 3E11FA47-71CA-11E1-9E33-C80AA9429562:123
[2024-01-15 10:00:01] TABLE_MAP: test.users (table_id: 42)
[2024-01-15 10:00:01] WRITE_ROWS:
  INSERT INTO users: {id: 1001, name: 'Alice', email: 'alice@example.com'}
  → Forwarded to Elasticsearch index 'users'

[2024-01-15 10:00:02] GTID: 3E11FA47-71CA-11E1-9E33-C80AA9429562:124
[2024-01-15 10:00:02] TABLE_MAP: test.orders (table_id: 43)
[2024-01-15 10:00:02] UPDATE_ROWS:
  UPDATE orders SET status='shipped' WHERE id=5000
  Before: {id: 5000, status: 'pending', amount: 99.99}
  After:  {id: 5000, status: 'shipped', amount: 99.99}
  → Published to Kafka topic 'order-events'

Events processed: 1,234
Current position: mysql-bin.000003:98765
Lag: 0.5 seconds

Implementation Hints:

Connecting as a replica:

package main

import (
    "github.com/go-mysql-org/go-mysql/mysql"
    "github.com/go-mysql-org/go-mysql/replication"
)

func main() {
    cfg := replication.BinlogSyncerConfig{
        ServerID: 100,  // Unique replica ID
        Flavor:   "mysql",
        Host:     "localhost",
        Port:     3306,
        User:     "replication_user",
        Password: "password",
    }

    syncer := replication.NewBinlogSyncer(cfg)

    // Start from a specific position
    streamer, _ := syncer.StartSync(mysql.Position{
        Name: "mysql-bin.000003",
        Pos:  154,
    })

    // Or use GTID
    // gtidSet, _ := mysql.ParseGTIDSet("mysql",
    //     "3E11FA47-71CA-11E1-9E33-C80AA9429562:1-123")
    // streamer, _ := syncer.StartSyncGTID(gtidSet)

    for {
        event, _ := streamer.GetEvent(context.Background())
        handleEvent(event)
    }
}

func handleEvent(event *replication.BinlogEvent) {
    switch e := event.Event.(type) {
    case *replication.RotateEvent:
        fmt.Printf("Rotate to: %s\n", string(e.NextLogName))

    case *replication.TableMapEvent:
        fmt.Printf("Table: %s.%s (id: %d)\n",
            e.Schema, e.Table, e.TableID)

    case *replication.RowsEvent:
        for _, row := range e.Rows {
            switch event.Header.EventType {
            case replication.WRITE_ROWS_EVENTv2:
                handleInsert(row)
            case replication.UPDATE_ROWS_EVENTv2:
                // row contains [before, after] pairs
                handleUpdate(row)
            case replication.DELETE_ROWS_EVENTv2:
                handleDelete(row)
            }
        }

    case *replication.QueryEvent:
        // DDL statements
        fmt.Printf("DDL: %s\n", string(e.Query))
    }
}

Implementing from scratch (understanding the protocol):

# Binlog event header (common to all events)
class BinlogEventHeader:
    def parse(self, data):
        self.timestamp = struct.unpack('<I', data[0:4])[0]
        self.type_code = data[4]
        self.server_id = struct.unpack('<I', data[5:9])[0]
        self.event_length = struct.unpack('<I', data[9:13])[0]
        self.next_position = struct.unpack('<I', data[13:17])[0]
        self.flags = struct.unpack('<H', data[17:19])[0]

# Row event types
WRITE_ROWS_EVENT = 30
UPDATE_ROWS_EVENT = 31
DELETE_ROWS_EVENT = 32

class RowsEvent:
    def parse(self, data, table_map):
        self.table_id = struct.unpack('<Q', data[0:6] + b'\x00\x00')[0]
        # ... parse column bitmap and row data
        # Row format depends on table schema from TABLE_MAP event

Learning milestones:

You connect as replica → You understand replication protocol
You parse row events → You understand binlog format
You handle schema changes → You understand DDL replication
You build a CDC pipeline → You can replicate to external systems

Project 9: Lock and Deadlock Visualizer

File: LEARN_MYSQL_DEEP_DIVE.md
Main Programming Language: Python
Alternative Programming Languages: Go, JavaScript
Coolness Level: Level 3: Genuinely Clever
Business Potential: 2. The “Micro-SaaS / Pro Tool”
Difficulty: Level 2: Intermediate
Knowledge Area: Locking / Deadlocks / Concurrency
Software or Tool: MySQL + Custom visualization
Main Book: “High Performance MySQL” by Silvia Botros

What you’ll build: A tool that monitors InnoDB locks in real-time, visualizes lock wait graphs, detects deadlock cycles, and explains why specific queries are blocking.

Why it teaches MySQL: Understanding InnoDB’s locking behavior is essential for debugging production issues. This project teaches you about row locks, gap locks, next-key locks, and why deadlocks happen.

Core challenges you’ll face:

Querying lock information → maps to performance_schema and innodb_lock_waits
Building wait-for graphs → maps to detecting cycles
Understanding lock types → maps to shared, exclusive, gap, next-key
Explaining lock conflicts → maps to which indexes cause which locks

Key Concepts:

InnoDB Locking: InnoDB Locking
Deadlock Detection: Deadlocks
Gap Locks: Gap Locking

Difficulty: Intermediate Time estimate: 1-2 weeks Prerequisites: Projects 1-4 completed. Understanding of transaction isolation.

Real world outcome:

InnoDB Lock Visualizer

Current Locks:
┌─────────────────────────────────────────────────────────────────┐
│ Transaction │ Lock Type  │ Table    │ Index   │ Lock Data      │
├─────────────┼────────────┼──────────┼─────────┼────────────────┤
│ 12345       │ X (excl.)  │ orders   │ PRIMARY │ id=100         │
│ 12346       │ S (share)  │ orders   │ PRIMARY │ id=100 WAITING │
│ 12347       │ X,GAP      │ orders   │ idx_user│ user_id=42,GAP │
└─────────────────────────────────────────────────────────────────┘

Wait-For Graph:
  ┌──────────────┐
  │ TXN 12346    │──waits for──→┌──────────────┐
  │ SELECT...FOR │              │ TXN 12345    │
  │ UPDATE       │              │ UPDATE orders│
  └──────────────┘              └──────────────┘

DEADLOCK DETECTED:
  ┌──────────────┐                ┌──────────────┐
  │ TXN 12350    │──waits for──→│ TXN 12351    │
  │              │              │              │
  │ Has lock on: │              │ Has lock on: │
  │ orders id=1  │              │ users id=5   │
  │              │              │              │
  │ Wants lock:  │              │ Wants lock:  │
  │ users id=5   │◄──waits for──│ orders id=1  │
  └──────────────┘                └──────────────┘

  Victim: TXN 12351 (rolled back)
  Cause: Classic cycle - each transaction holds what the other wants

Implementation Hints:

Querying lock information:

import mysql.connector
from collections import defaultdict

class LockMonitor:
    def __init__(self, connection):
        self.conn = connection

    def get_current_locks(self):
        """Get all current InnoDB locks"""
        cursor = self.conn.cursor(dictionary=True)
        cursor.execute("""
            SELECT
                r.trx_id AS waiting_trx_id,
                r.trx_query AS waiting_query,
                b.trx_id AS blocking_trx_id,
                b.trx_query AS blocking_query,
                l.lock_table AS locked_table,
                l.lock_index AS locked_index,
                l.lock_type AS lock_type,
                l.lock_mode AS lock_mode,
                l.lock_data AS lock_data
            FROM performance_schema.data_lock_waits w
            JOIN performance_schema.data_locks l
                ON w.BLOCKING_ENGINE_LOCK_ID = l.ENGINE_LOCK_ID
            JOIN information_schema.innodb_trx r
                ON w.REQUESTING_ENGINE_TRANSACTION_ID = r.trx_id
            JOIN information_schema.innodb_trx b
                ON w.BLOCKING_ENGINE_TRANSACTION_ID = b.trx_id
        """)
        return cursor.fetchall()

    def build_wait_graph(self, locks):
        """Build a wait-for graph from lock data"""
        graph = defaultdict(set)
        for lock in locks:
            graph[lock['waiting_trx_id']].add(lock['blocking_trx_id'])
        return graph

    def detect_deadlock(self, graph):
        """Find cycles in wait-for graph"""
        visited = set()
        path = []

        def dfs(node):
            if node in path:
                # Found cycle!
                cycle_start = path.index(node)
                return path[cycle_start:]
            if node in visited:
                return None

            visited.add(node)
            path.append(node)

            for neighbor in graph[node]:
                result = dfs(neighbor)
                if result:
                    return result

            path.pop()
            return None

        for node in graph:
            result = dfs(node)
            if result:
                return result

        return None

Understanding gap locks:

def explain_gap_lock(self, lock_info):
    """Explain why a gap lock was taken"""
    if 'GAP' in lock_info['lock_mode']:
        return f"""
Gap Lock Explanation:
  - A gap lock prevents insertions in a range
  - This lock is on index: {lock_info['locked_index']}
  - Gap is before: {lock_info['lock_data']}

Why it happens:
  - REPEATABLE READ isolation level needs gap locks to prevent phantom reads
  - A query like: SELECT * FROM t WHERE x BETWEEN 10 AND 20 FOR UPDATE
    locks all rows AND the gaps between them

How to avoid:
  - Use READ COMMITTED isolation (no gap locks)
  - Use unique lookups instead of range scans
  - Add appropriate indexes to reduce lock scope
"""

Learning milestones:

You query lock tables → You understand where to find lock info
You build wait-for graphs → You understand lock dependencies
You detect deadlocks → You understand the deadlock algorithm
You explain gap locks → You understand REPEATABLE READ locking

Project 10: MySQL vs PostgreSQL Comparison Lab

File: LEARN_MYSQL_DEEP_DIVE.md
Main Programming Language: SQL + Python
Alternative Programming Languages: Go, Bash
Coolness Level: Level 3: Genuinely Clever
Business Potential: 2. The “Micro-SaaS / Pro Tool”
Difficulty: Level 2: Intermediate
Knowledge Area: Database Comparison / Benchmarking
Software or Tool: MySQL, PostgreSQL, SQLite
Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann

What you’ll build: A comprehensive comparison lab that runs the same workloads on MySQL, PostgreSQL, and SQLite, measuring performance, consistency behavior, and feature differences.

Why it teaches MySQL: Understanding MySQL’s tradeoffs requires comparing it to alternatives. This project reveals why MySQL makes certain design choices and when to choose which database.

Core challenges you’ll face:

Setting up equivalent schemas → maps to data type differences
Measuring performance fairly → maps to benchmark design
Testing isolation levels → maps to consistency behavior differences
Comparing features → maps to JSON, arrays, full-text, etc.

Key Concepts:

Comparison: Postgres vs MySQL
Benchmarking: TPC-C and TPC-H benchmarks
Feature Matrix: db-engines comparison

Difficulty: Intermediate Time estimate: 2 weeks Prerequisites: Projects 1-9 completed. Basic PostgreSQL knowledge.

Real world outcome:

Database Comparison Report

=== Performance Benchmarks ===

OLTP Workload (TPC-C like):
┌──────────────────┬─────────────┬─────────────┬─────────────┐
│ Metric           │ MySQL       │ PostgreSQL  │ SQLite      │
├──────────────────┼─────────────┼─────────────┼─────────────┤
│ Transactions/sec │ 12,500      │ 10,800      │ 45,000*     │
│ Read latency p99 │ 2.1 ms      │ 2.8 ms      │ 0.1 ms*     │
│ Write latency p99│ 5.2 ms      │ 4.8 ms      │ 0.5 ms*     │
└──────────────────┴─────────────┴─────────────┴─────────────┘
* SQLite is embedded, so no network latency

Complex Query Workload (TPC-H like):
┌──────────────────┬─────────────┬─────────────┐
│ Query            │ MySQL       │ PostgreSQL  │
├──────────────────┼─────────────┼─────────────┤
│ Aggregation      │ 1.2 sec     │ 0.8 sec     │
│ Join (5 tables)  │ 3.5 sec     │ 2.1 sec     │
│ Window functions │ 2.8 sec     │ 1.5 sec     │
│ CTE recursive    │ 4.2 sec     │ 2.8 sec     │
└──────────────────┴─────────────┴─────────────┘
→ PostgreSQL wins on complex analytical queries

=== Isolation Level Behavior ===

Phantom Read Test (REPEATABLE READ):
┌──────────────────┬─────────────┬─────────────┐
│                  │ MySQL       │ PostgreSQL  │
├──────────────────┼─────────────┼─────────────┤
│ Phantoms blocked │ Yes (gap    │ Yes (SSI    │
│                  │ locking)    │ serializable│
│ Method           │ Pessimistic │ Optimistic  │
│ Blocking?        │ Yes         │ No*         │
└──────────────────┴─────────────┴─────────────┘
* PostgreSQL uses serialization failure instead

=== Feature Comparison ===

JSON Operations:
  MySQL:      JSON_EXTRACT(col, '$.field') - functional indexes available
  PostgreSQL: col->>'field' (JSONB) - GIN indexes, more operators

Arrays:
  MySQL:      Not supported (use JSON arrays)
  PostgreSQL: Native ARRAY type with operators

Full-Text Search:
  MySQL:      FULLTEXT indexes (basic)
  PostgreSQL: ts_vector, ts_query, GIN indexes (advanced)

Implementation Hints:

Benchmark framework:

import time
import mysql.connector
import psycopg2
import sqlite3

class DatabaseBenchmark:
    def __init__(self, db_type, config):
        self.db_type = db_type
        if db_type == 'mysql':
            self.conn = mysql.connector.connect(**config)
        elif db_type == 'postgres':
            self.conn = psycopg2.connect(**config)
        elif db_type == 'sqlite':
            self.conn = sqlite3.connect(config['database'])

    def run_query(self, sql, params=None):
        cursor = self.conn.cursor()
        start = time.perf_counter()
        cursor.execute(sql, params or ())
        result = cursor.fetchall()
        elapsed = time.perf_counter() - start
        return result, elapsed

    def benchmark_oltp(self, iterations=10000):
        """Simple OLTP benchmark"""
        results = {'reads': [], 'writes': []}

        for i in range(iterations):
            # Read
            _, elapsed = self.run_query(
                "SELECT * FROM users WHERE id = %s",
                (random.randint(1, 100000),)
            )
            results['reads'].append(elapsed)

            # Write
            _, elapsed = self.run_query(
                "UPDATE accounts SET balance = balance + 1 WHERE id = %s",
                (random.randint(1, 100000),)
            )
            results['writes'].append(elapsed)
            self.conn.commit()

        return {
            'read_avg': sum(results['reads']) / len(results['reads']),
            'read_p99': sorted(results['reads'])[int(len(results['reads']) * 0.99)],
            'write_avg': sum(results['writes']) / len(results['writes']),
            'write_p99': sorted(results['writes'])[int(len(results['writes']) * 0.99)],
        }

Isolation level comparison:

def test_phantom_reads(db):
    """Test if phantom reads are possible"""
    # Transaction 1: Start and read
    db.execute("BEGIN")
    db.execute("SET TRANSACTION ISOLATION LEVEL REPEATABLE READ")
    result1 = db.query("SELECT COUNT(*) FROM test WHERE value BETWEEN 1 AND 100")

    # Transaction 2: Insert in another connection
    db2 = get_connection()
    db2.execute("INSERT INTO test (value) VALUES (50)")
    db2.commit()

    # Transaction 1: Read again
    result2 = db.query("SELECT COUNT(*) FROM test WHERE value BETWEEN 1 AND 100")
    db.commit()

    return {
        'phantom_possible': result1[0] != result2[0],
        'before': result1[0],
        'after': result2[0]
    }

Learning milestones:

You run equivalent benchmarks → You understand performance tradeoffs
You test isolation differences → You understand consistency models
You compare features → You know when to choose which database
You document tradeoffs → You can make informed architecture decisions

Project 11: Query Profiler and Slow Query Analyzer

File: LEARN_MYSQL_DEEP_DIVE.md
Main Programming Language: Python
Alternative Programming Languages: Go, JavaScript
Coolness Level: Level 3: Genuinely Clever
Business Potential: 3. The “Service & Support” Model
Difficulty: Level 2: Intermediate
Knowledge Area: Performance Tuning / Profiling / Optimization
Software or Tool: MySQL + Performance Schema
Main Book: “High Performance MySQL” by Silvia Botros

What you’ll build: A query profiler that analyzes slow query logs, parses EXPLAIN output, identifies missing indexes, and provides actionable optimization recommendations.

Why it teaches MySQL: Query optimization is the most practical MySQL skill. This project teaches you to read EXPLAIN plans, understand query costs, and fix performance problems systematically.

Core challenges you’ll face:

Parsing slow query logs → maps to log format understanding
Analyzing EXPLAIN output → maps to understanding access types
Index recommendations → maps to which indexes would help
Query rewriting → maps to transforming slow queries

Key Concepts:

EXPLAIN: Optimizing with EXPLAIN
Slow Query Log: Slow Query Log
Index Usage: “High Performance MySQL” Chapters 5-6

Difficulty: Intermediate Time estimate: 2 weeks Prerequisites: Projects 1-4 completed. Basic query optimization knowledge.

Real world outcome:

Query Profiler Report

Analyzed: 10,000 queries from slow query log
Time range: 2024-01-01 to 2024-01-15

=== Top 10 Slowest Queries ===

1. SELECT * FROM orders WHERE status = 'pending' AND created_at > '2024-01-01'
   Avg time: 4.5s | Executions: 1,234 | Total time: 5,553s

   EXPLAIN Analysis:
   ┌────────┬─────────────┬──────────────┬─────────┬──────────────┐
   │ table  │ type        │ possible_keys│ key     │ rows examined│
   ├────────┼─────────────┼──────────────┼─────────┼──────────────┤
   │ orders │ ALL         │ idx_status   │ NULL    │ 5,000,000    │
   └────────┴─────────────┴──────────────┴─────────┴──────────────┘

   ⚠️ PROBLEMS DETECTED:
     • Full table scan (type: ALL)
     • Index exists but not used
     • Examining 5M rows to return 1,234

   💡 RECOMMENDATIONS:
     1. Add composite index: CREATE INDEX idx_status_created
        ON orders(status, created_at)
     2. Or rewrite query to use existing index better

   Expected improvement: 4.5s → 0.05s (90x faster)

=== Index Recommendations Summary ===

Tables needing indexes:
  • orders: (status, created_at) - would help 45 queries
  • users: (country, created_at) - would help 23 queries
  • products: (category_id, price) - would help 18 queries

=== Query Patterns ===

Anti-patterns detected:
  • SELECT * usage: 234 queries (retrieve only needed columns)
  • OFFSET pagination: 56 queries (use keyset pagination)
  • OR conditions: 89 queries (consider UNION)
  • Functions on indexed columns: 34 queries (can't use index)

Implementation Hints:

Slow query log parser:

import re
from dataclasses import dataclass
from typing import List

@dataclass
class SlowQuery:
    timestamp: str
    user: str
    host: str
    query_time: float
    lock_time: float
    rows_sent: int
    rows_examined: int
    query: str

def parse_slow_log(log_file):
    queries = []
    current = {}

    with open(log_file) as f:
        for line in f:
            if line.startswith('# Time:'):
                if current.get('query'):
                    queries.append(SlowQuery(**current))
                current = {'timestamp': line.split(':', 1)[1].strip()}

            elif line.startswith('# User@Host:'):
                match = re.match(r'# User@Host: (\w+)\[.*\] @ (\S+)', line)
                current['user'] = match.group(1)
                current['host'] = match.group(2)

            elif line.startswith('# Query_time:'):
                match = re.match(
                    r'# Query_time: ([\d.]+)\s+Lock_time: ([\d.]+)\s+'
                    r'Rows_sent: (\d+)\s+Rows_examined: (\d+)', line)
                current['query_time'] = float(match.group(1))
                current['lock_time'] = float(match.group(2))
                current['rows_sent'] = int(match.group(3))
                current['rows_examined'] = int(match.group(4))

            elif not line.startswith('#'):
                current['query'] = current.get('query', '') + line

    return queries

EXPLAIN analyzer:

class ExplainAnalyzer:
    def analyze(self, explain_rows):
        problems = []
        recommendations = []

        for row in explain_rows:
            # Check access type
            access_type = row['type']
            if access_type == 'ALL':
                problems.append({
                    'severity': 'HIGH',
                    'issue': 'Full table scan',
                    'table': row['table'],
                    'rows': row['rows']
                })

            elif access_type == 'index':
                problems.append({
                    'severity': 'MEDIUM',
                    'issue': 'Full index scan',
                    'table': row['table']
                })

            # Check key usage
            if row['possible_keys'] and not row['key']:
                problems.append({
                    'severity': 'HIGH',
                    'issue': 'Index available but not used',
                    'table': row['table'],
                    'available_keys': row['possible_keys']
                })

            # Check for filesort
            if 'filesort' in (row.get('Extra') or '').lower():
                problems.append({
                    'severity': 'MEDIUM',
                    'issue': 'Using filesort (sorting in memory/disk)',
                    'table': row['table']
                })

            # Check rows examined vs sent
            if row['rows'] > 1000 and 'filtered' in row:
                filter_pct = float(row['filtered'])
                if filter_pct < 10:
                    problems.append({
                        'severity': 'HIGH',
                        'issue': f'Examining many rows but returning few ({filter_pct}%)',
                        'table': row['table']
                    })

        return problems, recommendations

    def suggest_index(self, query, schema):
        """Suggest indexes based on query structure"""
        # Parse WHERE clause
        where_columns = self.extract_where_columns(query)
        order_columns = self.extract_order_columns(query)

        # Composite index: WHERE columns first, ORDER BY columns last
        suggested = where_columns + [c for c in order_columns if c not in where_columns]

        return suggested

Learning milestones:

You parse slow query logs → You understand query profiling
You analyze EXPLAIN output → You understand access types
You recommend indexes → You understand index design
You identify anti-patterns → You can prevent problems

Project 12: Build a Mini-MySQL (Capstone)

File: LEARN_MYSQL_DEEP_DIVE.md
Main Programming Language: C or Rust
Alternative Programming Languages: Go, C++
Coolness Level: Level 5: Pure Magic (Super Cool)
Business Potential: 1. The “Resume Gold”
Difficulty: Level 5: Master
Knowledge Area: Database Implementation / Systems Programming
Software or Tool: Custom implementation
Main Book: “Database Internals” by Alex Petrov

What you’ll build: A mini relational database from scratch that integrates all previous projects: B+Tree storage, SQL parsing, query execution, MVCC transactions, WAL recovery, and the MySQL wire protocol.

Why it teaches everything: This capstone project forces you to understand how all database components work together. Building a complete (if simple) database proves deep understanding of MySQL internals.

Core components:

Storage Layer: B+Tree pages on disk (Project 2)
Buffer Pool: Page caching with LRU (Project 5)
SQL Parser: Parse and plan queries (Project 3)
Query Executor: Execute plans (Project 1)
Transaction Manager: MVCC isolation (Project 4)
Recovery: WAL and crash recovery (Project 6)
Wire Protocol: MySQL-compatible interface (Project 7)

Real world outcome:

$ ./minidb --datadir=/tmp/minidb

MiniDB v0.1.0 - A Learning Database
Listening on port 3306...

# Connect with standard MySQL client!
$ mysql -h localhost -u root minidb

mysql> CREATE TABLE users (
    id INT PRIMARY KEY,
    name VARCHAR(100),
    email VARCHAR(100)
);
Query OK, 0 rows affected (0.01 sec)

mysql> INSERT INTO users VALUES (1, 'Alice', 'alice@example.com');
Query OK, 1 row affected (0.00 sec)

mysql> SELECT * FROM users WHERE id = 1;
+----+-------+-------------------+
| id | name  | email             |
+----+-------+-------------------+
|  1 | Alice | alice@example.com |
+----+-------+-------------------+
1 row in set (0.00 sec)

mysql> BEGIN;
mysql> UPDATE users SET name = 'Alice Smith' WHERE id = 1;
mysql> -- In another session, the old value is still visible (MVCC!)
mysql> COMMIT;

# Simulate crash and recovery
$ kill -9 $(pidof minidb)
$ ./minidb --datadir=/tmp/minidb
Recovering from WAL...
  Replaying 15 log records...
  Recovery complete in 0.1s
  All committed transactions preserved.

Time estimate: 2-3 months Prerequisites: All 11 previous projects completed

Project Comparison Table

Project	Difficulty	Time	Depth of Understanding	Fun Factor
1. SQL Query Executor	Intermediate	2 weeks	⭐⭐⭐	⭐⭐⭐⭐
2. B+Tree Implementation	Advanced	3-4 weeks	⭐⭐⭐⭐⭐	⭐⭐⭐⭐
3. Query Optimizer	Advanced	2-3 weeks	⭐⭐⭐⭐⭐	⭐⭐⭐⭐
4. MVCC Transaction Manager	Advanced	2-3 weeks	⭐⭐⭐⭐⭐	⭐⭐⭐
5. Buffer Pool Manager	Advanced	2-3 weeks	⭐⭐⭐⭐	⭐⭐⭐
6. WAL and Crash Recovery	Expert	3-4 weeks	⭐⭐⭐⭐⭐	⭐⭐⭐⭐
7. MySQL Protocol	Intermediate	1-2 weeks	⭐⭐⭐	⭐⭐⭐⭐
8. Binlog Replication	Advanced	2-3 weeks	⭐⭐⭐⭐	⭐⭐⭐⭐⭐
9. Lock Visualizer	Intermediate	1-2 weeks	⭐⭐⭐⭐	⭐⭐⭐⭐
10. DB Comparison Lab	Intermediate	2 weeks	⭐⭐⭐⭐	⭐⭐⭐⭐
11. Query Profiler	Intermediate	2 weeks	⭐⭐⭐⭐	⭐⭐⭐⭐⭐
12. Mini-MySQL	Master	2-3 months	⭐⭐⭐⭐⭐	⭐⭐⭐⭐⭐

Recommended Learning Path

For practical MySQL skills (DBA/Backend Developer):

Project 11 (Query Profiler) - Immediately useful
Project 9 (Lock Visualizer) - Debug production issues
Project 3 (Query Optimizer) - Understand EXPLAIN
Project 10 (Comparison Lab) - Make architecture decisions

For deep understanding (Database Engineer):

Project 2 (B+Tree) - Foundation of all indexing
Project 4 (MVCC) - How transactions work
Project 5 (Buffer Pool) - How caching works
Project 6 (WAL) - How durability works

For building database tools (Developer):

Project 7 (Protocol) - Build drivers/proxies
Project 8 (Binlog) - Build CDC pipelines
Project 1 (SQL Executor) - Build query tools

Quick path (2-3 weeks):

Project 1 (1 week) - Understand SQL fundamentals
Project 11 (1 week) - Practical optimization skills
Project 9 (3-4 days) - Debug locking issues

Essential Resources

Books

High Performance MySQL, 4th Edition by Silvia Botros - The definitive MySQL performance guide
Efficient MySQL Performance by Daniel Nichter - Bridges basic and advanced
Database Internals by Alex Petrov - Storage engine internals
Designing Data-Intensive Applications by Martin Kleppmann - Distributed data concepts

Official Documentation

MySQL Reference Manual - Comprehensive official docs
InnoDB Documentation - Storage engine details
MySQL Internals - For source code exploration

Online Resources

Use The Index, Luke - SQL indexing tutorial
Percona Blog - MySQL performance articles
Planet MySQL - MySQL community aggregator

Source Code

MySQL Source - Official MySQL
MariaDB Source - MariaDB fork
go-mysql - Go MySQL tools

Summary

#	Project	Main Programming Language
1	SQL Query Executor	Python
2	B+Tree Index Implementation	C
3	Query Optimizer Simulator	Python
4	MVCC Transaction Manager	Python
5	Buffer Pool Manager	C
6	WAL and Crash Recovery	C
7	MySQL Protocol Implementation	Python
8	Binlog Replication Subscriber	Go
9	Lock and Deadlock Visualizer	Python
10	MySQL vs PostgreSQL Comparison Lab	SQL + Python
11	Query Profiler and Slow Query Analyzer	Python
12	Build a Mini-MySQL (Capstone)	C or Rust

Happy querying! Understanding MySQL deeply means understanding how data really works—from SQL syntax down to the bits on disk. 🗄️