← Back to all projects

LEARN SQLITE DEEP DIVE

Learn SQLite: From Zero to Database Master

Goal: Deeply understand SQLite—the world’s most deployed database—from its elegant file format and B-tree storage engine to query optimization, concurrency control, and building production-grade applications. You’ll learn not just SQL syntax, but WHY SQLite makes the architectural decisions it does, how to squeeze maximum performance from it, and when it’s the right (or wrong) tool for the job.


Why SQLite Matters

In 2000, D. Richard Hipp was working on a battleship program for the US Navy. The existing database required a DBA to configure it before the software could run. Hipp thought: “What if the database was just a file?” That question led to SQLite—and changed computing forever.

SQLite is everywhere:

  • Every iPhone and Android device (multiple SQLite databases per phone)
  • Every Mac and Windows 10/11 machine
  • Every Firefox, Chrome, and Safari browser
  • Every Skype, iTunes, and Dropbox installation
  • Airbus A350 aircraft avionics
  • Over 1 trillion SQLite databases in active use worldwide

Why it remains dominant:

  • Zero configuration: No server, no setup, no DBA
  • Single file: Your entire database is one portable file
  • ACID compliant: Full transactions with crash recovery
  • Tiny footprint: ~600KB library, runs on embedded systems
  • Public domain: No licensing fees, no restrictions
  • Incredibly reliable: Billions of tests, aviation-grade quality

What understanding SQLite unlocks:

  • Deep knowledge of how ALL relational databases work internally
  • B-tree data structures in a real-world context
  • Query optimization that applies to PostgreSQL, MySQL, etc.
  • Embedded systems database design
  • Mobile application architecture
  • Understanding of ACID guarantees and transaction isolation

Core Concept Analysis

The SQLite Architecture

┌─────────────────────────────────────────────────────────────────┐
│                        YOUR APPLICATION                         │
├─────────────────────────────────────────────────────────────────┤
│                         SQLite Library                          │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────────────────┐  │
│  │   SQL       │  │   Virtual   │  │      B-Tree Engine      │  │
│  │  Compiler   │──│   Machine   │──│  (Storage & Retrieval)  │  │
│  │  (Parser)   │  │   (VDBE)    │  │                         │  │
│  └─────────────┘  └─────────────┘  └───────────┬─────────────┘  │
│                                                 │                │
│  ┌─────────────────────────────────────────────┴─────────────┐  │
│  │                         Pager                             │  │
│  │         (Page Cache + Transaction Management)             │  │
│  └─────────────────────────────────────────────┬─────────────┘  │
│                                                 │                │
│  ┌─────────────────────────────────────────────┴─────────────┐  │
│  │                    OS Interface (VFS)                     │  │
│  │              (File I/O Abstraction Layer)                 │  │
│  └─────────────────────────────────────────────────────────────┘│
├─────────────────────────────────────────────────────────────────┤
│                     DATABASE FILE (.db)                         │
│                     + Journal/WAL files                         │
└─────────────────────────────────────────────────────────────────┘

How a Query Executes

When you run SELECT * FROM users WHERE age > 25:

┌──────────────────────────────────────────────────────────────────┐
│ 1. TOKENIZER                                                     │
│    "SELECT * FROM users WHERE age > 25"                          │
│           ↓                                                      │
│    [SELECT] [*] [FROM] [users] [WHERE] [age] [>] [25]           │
└──────────────────────────────────────────────────────────────────┘
                              ↓
┌──────────────────────────────────────────────────────────────────┐
│ 2. PARSER                                                        │
│    Builds Abstract Syntax Tree (AST)                             │
│                                                                  │
│              SELECT                                              │
│             /      \                                             │
│          columns   FROM                                          │
│            |      /    \                                         │
│           (*)  users   WHERE                                     │
│                          |                                       │
│                         (>)                                      │
│                        /   \                                     │
│                      age   25                                    │
└──────────────────────────────────────────────────────────────────┘
                              ↓
┌──────────────────────────────────────────────────────────────────┐
│ 3. CODE GENERATOR                                                │
│    Produces bytecode for the Virtual Database Engine (VDBE)      │
│                                                                  │
│    addr  opcode       p1    p2    p3    p4                       │
│    ----  -----------  ----  ----  ----  ----                     │
│    0     Init         0     12    0                              │
│    1     OpenRead     0     2     0     3      (users table)     │
│    2     Rewind       0     10    0                              │
│    3     Column       0     2     1            (age column)      │
│    4     Integer      25    2     0                              │
│    5     Le           2     9     1            (if age<=25 jump) │
│    6     Column       0     0     3            (get all cols)    │
│    7     Column       0     1     4                              │
│    8     ResultRow    3     2     0            (return row)      │
│    9     Next         0     3     0            (next row)        │
│    10    Close        0     0     0                              │
│    11    Halt         0     0     0                              │
└──────────────────────────────────────────────────────────────────┘
                              ↓
┌──────────────────────────────────────────────────────────────────┐
│ 4. VIRTUAL MACHINE (VDBE)                                        │
│    Executes bytecode, interfaces with B-tree                     │
│    Returns results to application                                │
└──────────────────────────────────────────────────────────────────┘

The SQLite File Format

Understanding the file format is crucial for deep SQLite mastery.

Database File Structure

┌────────────────────────────────────────────────────────────────┐
│                    DATABASE FILE                                │
├────────────────────────────────────────────────────────────────┤
│  Page 1: Database Header + Schema Table (sqlite_master)        │
│  ┌──────────────────────────────────────────────────────────┐  │
│  │ Bytes 0-15:   "SQLite format 3\000" (magic string)       │  │
│  │ Bytes 16-17:  Page size (e.g., 4096)                     │  │
│  │ Bytes 18:     File format write version                  │  │
│  │ Bytes 19:     File format read version                   │  │
│  │ Bytes 20:     Reserved space per page                    │  │
│  │ Bytes 21-23:  Payload fractions                          │  │
│  │ Bytes 24-27:  File change counter                        │  │
│  │ Bytes 28-31:  Database size in pages                     │  │
│  │ Bytes 32-35:  First freelist trunk page                  │  │
│  │ ...          (100 bytes total header)                    │  │
│  └──────────────────────────────────────────────────────────┘  │
├────────────────────────────────────────────────────────────────┤
│  Page 2-N: B-tree pages (table/index data)                     │
│  ┌──────────────────────────────────────────────────────────┐  │
│  │  ┌─────────┐                                             │  │
│  │  │ B-tree  │ Interior nodes: keys + child page pointers  │  │
│  │  │  Page   │ Leaf nodes: actual row data (cells)         │  │
│  │  │ Header  │                                             │  │
│  │  ├─────────┤                                             │  │
│  │  │  Cell   │ Variable-length records                     │  │
│  │  │ Pointer │                                             │  │
│  │  │  Array  │                                             │  │
│  │  ├─────────┤                                             │  │
│  │  │         │                                             │  │
│  │  │  Cell   │                                             │  │
│  │  │ Content │                                             │  │
│  │  │  Area   │                                             │  │
│  │  │         │                                             │  │
│  │  └─────────┘                                             │  │
│  └──────────────────────────────────────────────────────────┘  │
├────────────────────────────────────────────────────────────────┤
│  Overflow pages (for large records)                            │
├────────────────────────────────────────────────────────────────┤
│  Freelist pages (deleted/reusable space)                       │
└────────────────────────────────────────────────────────────────┘

Page Types

┌─────────────────────────────────────────────────────────────────┐
│                      SQLite Page Types                          │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  0x02: Interior Index B-tree page                               │
│        └── Contains keys and pointers to child pages            │
│                                                                 │
│  0x05: Interior Table B-tree page                               │
│        └── Contains rowids and pointers to child pages          │
│                                                                 │
│  0x0A: Leaf Index B-tree page                                   │
│        └── Contains index entries (key + rowid)                 │
│                                                                 │
│  0x0D: Leaf Table B-tree page                                   │
│        └── Contains actual table row data                       │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

B-Tree Storage Engine

SQLite uses B-trees for ALL data storage. Understanding this is fundamental.

B-Tree Structure for Tables

                         ┌─────────────────────┐
                         │   Root Page (5)     │
                         │   Interior Node     │
                         │                     │
                         │  [rowid<50] [50≤]   │
                         └──────────┬──────────┘
                                    │
                    ┌───────────────┴───────────────┐
                    │                               │
           ┌────────┴────────┐             ┌───────┴────────┐
           │   Page 3        │             │   Page 7       │
           │   Interior      │             │   Interior     │
           │                 │             │                │
           │ [<25] [25-49]   │             │ [50-74] [≥75]  │
           └────────┬────────┘             └───────┬────────┘
                    │                              │
        ┌───────────┴──────┐            ┌─────────┴─────────┐
        │                  │            │                   │
   ┌────┴────┐      ┌─────┴────┐  ┌────┴────┐       ┌─────┴────┐
   │ Page 2  │      │ Page 4   │  │ Page 6  │       │ Page 8   │
   │  Leaf   │      │  Leaf    │  │  Leaf   │       │  Leaf    │
   │         │      │          │  │         │       │          │
   │ Rows    │      │ Rows     │  │ Rows    │       │ Rows     │
   │ 1-24    │      │ 25-49    │  │ 50-74   │       │ 75-100   │
   └─────────┘      └──────────┘  └─────────┘       └──────────┘

   To find rowid=67:
   1. Start at root (page 5)
   2. 67 ≥ 50, go right to page 7
   3. 67 is in [50-74], go left to page 6
   4. Scan leaf page 6 for rowid=67

   Total: 3 page reads (O(log n))

Index B-Tree Structure

   Index on "email" column:

                    ┌────────────────────────────┐
                    │     Root (Interior)        │
                    │                            │
                    │  ["john@"] ["mike@"]       │
                    └─────────────┬──────────────┘
                                  │
            ┌─────────────────────┼─────────────────────┐
            │                     │                     │
     ┌──────┴──────┐      ┌──────┴──────┐      ┌──────┴──────┐
     │    Leaf     │      │    Leaf     │      │    Leaf     │
     │             │      │             │      │             │
     │ alice@ → 5  │      │ john@ → 12  │      │ mike@ → 3   │
     │ bob@ → 8    │      │ kate@ → 7   │      │ sara@ → 15  │
     │ dave@ → 2   │      │ lisa@ → 22  │      │ zoe@ → 1    │
     └─────────────┘      └─────────────┘      └─────────────┘

     Key contains: (indexed_column_value, rowid)

     To find email='kate@example.com':
     1. Navigate B-tree using email as key
     2. Find rowid=7 in leaf
     3. Use rowid to fetch full row from table B-tree

ACID and Transaction Management

Write-Ahead Logging (WAL) Mode

┌─────────────────────────────────────────────────────────────────┐
│                    WAL Mode Operation                           │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│   DATABASE FILE              WAL FILE              SHM FILE     │
│   (main.db)                  (main.db-wal)         (main.db-shm)│
│   ┌─────────────┐            ┌─────────────┐       ┌─────────┐  │
│   │ Page 1      │            │ Frame 1     │       │ WAL     │  │
│   │ Page 2      │◄───read────│ (Page 2')   │       │ Index   │  │
│   │ Page 3      │            │ Frame 2     │       │ Hash    │  │
│   │ Page 4      │            │ (Page 5')   │       │ Table   │  │
│   │ Page 5      │            │ Frame 3     │       └─────────┘  │
│   └─────────────┘            │ (Page 2'')  │                    │
│         │                    └──────┬──────┘                    │
│         │                           │                           │
│         │     CHECKPOINT            │                           │
│         │◄──────────────────────────┘                           │
│         │   (copy frames back to db)                            │
│                                                                 │
│   WRITE: Append to WAL (fast, sequential)                       │
│   READ:  Check WAL first, then database file                    │
│   CHECKPOINT: Periodically copy WAL changes to database         │
│                                                                 │
│   Benefits:                                                     │
│   • Readers don't block writers                                 │
│   • Writers don't block readers                                 │
│   • Faster writes (sequential I/O)                              │
│   • Better concurrency                                          │
└─────────────────────────────────────────────────────────────────┘

Rollback Journal Mode (Default)

┌─────────────────────────────────────────────────────────────────┐
│               Rollback Journal Operation                        │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│   1. BEGIN TRANSACTION                                          │
│      ┌─────────────┐                                            │
│      │ Database    │                                            │
│      │ (original)  │                                            │
│      └─────────────┘                                            │
│                                                                 │
│   2. Before modifying page, copy original to journal            │
│      ┌─────────────┐         ┌─────────────┐                    │
│      │ Database    │────────►│  Journal    │                    │
│      │             │  copy   │ (original   │                    │
│      │             │  page   │  pages)     │                    │
│      └─────────────┘         └─────────────┘                    │
│                                                                 │
│   3. Modify database pages in place                             │
│      ┌─────────────┐         ┌─────────────┐                    │
│      │ Database    │         │  Journal    │                    │
│      │ (modified)  │         │ (backup)    │                    │
│      └─────────────┘         └─────────────┘                    │
│                                                                 │
│   4a. COMMIT: Delete journal file                               │
│   4b. ROLLBACK: Copy journal pages back to database             │
│   4c. CRASH: On recovery, if journal exists, rollback           │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Locking and Concurrency

Lock States

┌─────────────────────────────────────────────────────────────────┐
│                    SQLite Lock States                           │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│   UNLOCKED ──────► SHARED ──────► RESERVED ──────► PENDING      │
│       │              │               │                │         │
│       │              │               │                │         │
│       │              │               │                ▼         │
│       │              │               │           EXCLUSIVE      │
│       │              │               │                │         │
│       │              │               │                │         │
│       ▼              ▼               ▼                ▼         │
│   No access      Can READ       Intend to       Waiting for     │
│                  (multiple      write soon      readers to      │
│                   allowed)      (only one)      finish          │
│                                                                 │
│                                                 Only writer     │
│                                                 can access      │
│                                                                 │
├─────────────────────────────────────────────────────────────────┤
│   Compatibility Matrix:                                         │
│                                                                 │
│              │ UNLOCKED │ SHARED │ RESERVED │ PENDING │EXCLUSIVE│
│   ───────────┼──────────┼────────┼──────────┼─────────┼─────────│
│   UNLOCKED   │    ✓     │   ✓    │    ✓     │    ✓    │    ✓    │
│   SHARED     │    ✓     │   ✓    │    ✓     │    ✗    │    ✗    │
│   RESERVED   │    ✓     │   ✓    │    ✗     │    ✗    │    ✗    │
│   PENDING    │    ✓     │   ✗    │    ✗     │    ✗    │    ✗    │
│   EXCLUSIVE  │    ✓     │   ✗    │    ✗     │    ✗    │    ✗    │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Query Optimization

How the Query Planner Thinks

┌─────────────────────────────────────────────────────────────────┐
│           Query: SELECT * FROM orders                           │
│                  WHERE customer_id = 5                          │
│                  AND status = 'pending'                         │
│                  ORDER BY created_at DESC                       │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│   Option A: Full Table Scan                                     │
│   ┌─────────────────────────────────────────────────────────┐   │
│   │ Scan ALL 1,000,000 rows                                 │   │
│   │ Check each row against WHERE clause                     │   │
│   │ Sort matching rows by created_at                        │   │
│   │ Cost: ~1,000,000 row reads + sort                       │   │
│   └─────────────────────────────────────────────────────────┘   │
│                                                                 │
│   Option B: Use index on customer_id                            │
│   ┌─────────────────────────────────────────────────────────┐   │
│   │ Look up customer_id=5 in index (~3 page reads)          │   │
│   │ Get ~500 rowids for this customer                       │   │
│   │ Fetch each row, check status='pending'                  │   │
│   │ Sort ~50 matching rows                                  │   │
│   │ Cost: ~503 row reads + small sort                       │   │
│   └─────────────────────────────────────────────────────────┘   │
│                                                                 │
│   Option C: Use covering index on (customer_id, status)         │
│   ┌─────────────────────────────────────────────────────────┐   │
│   │ Look up (customer_id=5, status='pending') directly      │   │
│   │ Get ~50 rowids immediately                              │   │
│   │ Fetch rows and sort                                     │   │
│   │ Cost: ~53 row reads + small sort                        │   │
│   └─────────────────────────────────────────────────────────┘   │
│                                                                 │
│   SQLite chooses based on:                                      │
│   • Estimated rows from sqlite_stat1 table                      │
│   • Index availability and selectivity                          │
│   • I/O cost estimates                                          │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

EXPLAIN and EXPLAIN QUERY PLAN

sqlite> EXPLAIN QUERY PLAN SELECT * FROM users WHERE email = 'test@example.com';

QUERY PLAN
|--SEARCH users USING INDEX idx_users_email (email=?)
   ^^^^^ Using index! Good!

sqlite> EXPLAIN QUERY PLAN SELECT * FROM users WHERE name LIKE '%john%';

QUERY PLAN
|--SCAN users
   ^^^^^ Full table scan! Consider full-text search!

SQLite Data Types (Dynamic Typing)

┌─────────────────────────────────────────────────────────────────┐
│              SQLite Type Affinity System                        │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│   Storage Classes (what's actually stored):                     │
│   ┌──────────┬────────────────────────────────────────────┐     │
│   │ NULL     │ The NULL value                             │     │
│   │ INTEGER  │ Signed integer (1, 2, 3, 4, 6, or 8 bytes) │     │
│   │ REAL     │ 8-byte IEEE floating point                 │     │
│   │ TEXT     │ UTF-8, UTF-16BE, or UTF-16LE string        │     │
│   │ BLOB     │ Raw binary data                            │     │
│   └──────────┴────────────────────────────────────────────┘     │
│                                                                 │
│   Type Affinity (column preference, not enforcement):           │
│   ┌───────────────────────────────────────────────────────┐     │
│   │ Declared Type          │ Affinity   │ Example         │     │
│   ├────────────────────────┼────────────┼─────────────────┤     │
│   │ INT, INTEGER, TINYINT  │ INTEGER    │ 42              │     │
│   │ CHAR, VARCHAR, TEXT    │ TEXT       │ 'hello'         │     │
│   │ BLOB, (none declared)  │ BLOB       │ x'0500'         │     │
│   │ REAL, DOUBLE, FLOAT    │ REAL       │ 3.14159         │     │
│   │ NUMERIC, DECIMAL       │ NUMERIC    │ varies          │     │
│   └───────────────────────────────────────────────────────┘     │
│                                                                 │
│   Key insight: SQLite stores VALUES, not TYPES                  │
│   This works (but probably shouldn't):                          │
│                                                                 │
│   CREATE TABLE weird (age INTEGER);                             │
│   INSERT INTO weird VALUES (25);      -- stored as INTEGER      │
│   INSERT INTO weird VALUES ('old');   -- stored as TEXT!        │
│   INSERT INTO weird VALUES (3.14);    -- stored as REAL!        │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Concept Summary Table

Concept Cluster What You Need to Internalize
File Format SQLite database is a single file with a 100-byte header, followed by fixed-size pages. Everything is pages.
B-Tree Storage Tables are B-trees keyed by rowid. Indexes are B-trees keyed by indexed columns. Navigation is O(log n).
Query Execution SQL → Parse Tree → Bytecode → Virtual Machine execution. EXPLAIN shows the bytecode.
ACID Transactions Write-ahead logging (WAL) or rollback journals ensure atomicity and durability even after crashes.
Locking Model Database-level locks (not row-level). Multiple readers OR one writer. WAL allows concurrent reads during writes.
Type Affinity SQLite is dynamically typed. Columns have “preferences” but any value can go anywhere. Stored value determines type.
Query Planning The optimizer chooses between table scans, index scans, and covering indexes based on statistics in sqlite_stat1.
Virtual Tables Extensibility mechanism allowing custom data sources to appear as tables (FTS, JSON, CSV, etc.).

Deep Dive Reading by Concept

This section maps each concept to specific book chapters and resources for deeper understanding. Read these before or alongside the projects.

Database Fundamentals & Architecture

Concept Resource
Relational Model Basics “Database System Concepts” by Silberschatz — Ch. 1-2
SQLite Architecture Overview “The Definitive Guide to SQLite” by Mike Owens — Ch. 1-2
ACID Properties “Designing Data-Intensive Applications” by Kleppmann — Ch. 7 (Transactions)
SQL Language Fundamentals “The Definitive Guide to SQLite” by Mike Owens — Ch. 3-6

Storage Engine & File Format

Concept Resource
B-Tree Data Structure “Introduction to Algorithms” by CLRS — Ch. 18 (B-Trees)
SQLite File Format SQLite Official Documentation — “Database File Format”
Page Management “Database Internals” by Alex Petrov — Ch. 2-4
Overflow Pages SQLite Official Documentation — “File Format: Overflow Pages”

Transaction Management & Concurrency

Concept Resource
Write-Ahead Logging “Database Internals” by Alex Petrov — Ch. 11
Rollback Journals “The Definitive Guide to SQLite” by Mike Owens — Ch. 11
Locking & Isolation “Designing Data-Intensive Applications” by Kleppmann — Ch. 7
Crash Recovery “Database Internals” by Alex Petrov — Ch. 12

Query Processing & Optimization

Concept Resource
Query Planning “The Definitive Guide to SQLite” by Mike Owens — Ch. 10
Index Design “SQL Performance Explained” by Markus Winand — All chapters
EXPLAIN Output SQLite Documentation — “EXPLAIN” and “EXPLAIN QUERY PLAN”
Query Optimization “Use the Index, Luke” (online) by Markus Winand

Advanced Features

Concept Resource
Virtual Tables “The Definitive Guide to SQLite” by Mike Owens — Ch. 9
Full-Text Search SQLite Documentation — “FTS5”
JSON Support SQLite Documentation — “JSON1 Extension”
Custom Functions “The Definitive Guide to SQLite” by Mike Owens — Ch. 8

Essential Reading Order

For maximum comprehension, read in this order:

  1. Foundation (Week 1-2):
    • “The Definitive Guide to SQLite” Ch. 1-4 (basics, SQL)
    • SQLite Official Documentation: “How SQLite Works”
  2. Storage Deep Dive (Week 3-4):
    • “Database Internals” Ch. 2-4 (B-trees, storage)
    • SQLite File Format documentation
  3. Transactions & Concurrency (Week 5-6):
    • “Designing Data-Intensive Applications” Ch. 7
    • “Database Internals” Ch. 11-12
  4. Performance & Optimization (Week 7-8):
    • “SQL Performance Explained” (entire book)
    • “Use the Index, Luke” online guide

Project List

The following projects progress from foundational understanding to advanced mastery. Each project forces you to confront core SQLite concepts through hands-on building.


Project 1: SQLite File Format Explorer

  • File: LEARN_SQLITE_DEEP_DIVE.md
  • Main Programming Language: C
  • Alternative Programming Languages: Rust, Python, Go
  • Coolness Level: Level 4: Hardcore Tech Flex
  • Business Potential: 1. The “Resume Gold”
  • Difficulty: Level 3: Advanced
  • Knowledge Area: Database Internals / Binary Parsing
  • Software or Tool: SQLite file format
  • Main Book: “Database Internals” by Alex Petrov

What you’ll build: A tool that opens any SQLite database file and displays its internal structure—the header, page layout, B-tree organization, and individual cell contents—without using the SQLite library.

Why it teaches SQLite: This is THE foundational project. By reading the raw bytes yourself, you’ll understand exactly what SQLite stores and how. Every subsequent project builds on this understanding.

Core challenges you’ll face:

  • Parsing the 100-byte database header → maps to understanding SQLite metadata
  • Navigating B-tree pages → maps to how tables and indexes are stored
  • Decoding varint encoding → maps to SQLite’s space-efficient number storage
  • Handling cell formats → maps to how row data is actually stored
  • Understanding overflow pages → maps to handling large records

Key Concepts:

  • Binary File Parsing: “Practical Binary Analysis” Ch. 1-2 - Dennis Andriesse
  • B-Tree Structure: “Introduction to Algorithms” Ch. 18 - CLRS
  • SQLite File Format: SQLite Official Documentation - “Database File Format”
  • Varint Encoding: SQLite Documentation - “Variable-Length Integers”

Difficulty: Advanced Time estimate: 2-3 weeks Prerequisites: C programming (pointers, file I/O, structs), understanding of binary/hex, basic knowledge of tree data structures


Real World Outcome

You’ll have a command-line tool that can inspect any SQLite database at the byte level, revealing its internal structure:

Example Output:

$ ./sqlite_explorer mydb.sqlite

╔══════════════════════════════════════════════════════════════╗
║                    SQLite Database Header                     ║
╠══════════════════════════════════════════════════════════════╣
║ Magic String:        SQLite format 3                          ║
║ Page Size:           4096 bytes                               ║
║ File Format:         Write=1, Read=1                          ║
║ Database Size:       156 pages (638,976 bytes)                ║
║ Schema Cookie:       42                                       ║
║ Text Encoding:       UTF-8                                    ║
║ First Freelist:      Page 0 (none)                            ║
║ Freelist Count:      0 pages                                  ║
╚══════════════════════════════════════════════════════════════╝

Page 1 (Root of sqlite_master):
  Type: Leaf Table B-tree (0x0D)
  Cell Count: 4

  Cell 0: rowid=1, size=87 bytes
    type=table, name=users, tbl_name=users, rootpage=2
    sql: CREATE TABLE users (id INTEGER PRIMARY KEY, name TEXT, email TEXT)

  Cell 1: rowid=2, size=42 bytes
    type=index, name=idx_users_email, tbl_name=users, rootpage=5
    sql: CREATE INDEX idx_users_email ON users(email)

Page 2 (Root of 'users' table):
  Type: Interior Table B-tree (0x05)
  Cell Count: 3
  Right-most Pointer: Page 8

  [rowid < 1000] → Page 3
  [1000 ≤ rowid < 2000] → Page 6
  [rowid ≥ 2000] → Page 8

Page 3 (Leaf of 'users' table):
  Type: Leaf Table B-tree (0x0D)
  Cell Count: 47

  Row 1: [1, "Alice", "alice@example.com"]
  Row 2: [2, "Bob", "bob@example.com"]
  Row 3: [3, "Charlie", "charlie@test.org"]
  ...

$ ./sqlite_explorer mydb.sqlite --page 5 --hex

Page 5 Hex Dump:
0000: 0A 00 00 00 23 0F 89 00  0F F4 0F E8 0F DC 0F D0  |....#...........|
0010: 0F C4 0F B8 0F AC 0F A0  0F 94 0F 89 00 00 00 00  |................|
...

The Core Question You’re Answering

“What IS a SQLite database file? If I look at the raw bytes, what structure do I see?”

Before you write any code, sit with this question. Most developers treat databases as black boxes—you put SQL in, data comes out. But a SQLite database is just a file with a specific format. Every table, every index, every row exists as bytes arranged in a documented pattern.


Concepts You Must Understand First

Stop and research these before coding:

  1. Binary File I/O in C
    • How do you read specific bytes from a file?
    • What’s the difference between text mode and binary mode?
    • How do you seek to arbitrary positions?
    • Book Reference: “The C Programming Language” Ch. 7 - K&R
  2. Endianness
    • What’s big-endian vs little-endian?
    • Which does SQLite use for multi-byte integers?
    • How do you convert between them?
    • Book Reference: “Computer Systems: A Programmer’s Perspective” Ch. 2 - Bryant & O’Hallaron
  3. Variable-Length Integer Encoding
    • Why not just use fixed-size integers?
    • How does SQLite’s varint encoding work?
    • What’s the maximum value a varint can represent?
    • Book Reference: SQLite Documentation - “Variable-Length Integers”
  4. B-Tree Data Structure
    • What’s the difference between B-tree and B+tree?
    • How do interior nodes differ from leaf nodes?
    • How do you traverse a B-tree?
    • Book Reference: “Introduction to Algorithms” Ch. 18 - CLRS

Questions to Guide Your Design

Before implementing, think through these:

  1. File Reading Strategy
    • Should you read the entire file into memory or read pages on demand?
    • How will you handle databases larger than available RAM?
    • What buffer size should you use for efficiency?
  2. Data Structure Design
    • How will you represent a page in your code?
    • What struct fields do you need for the header?
    • How will you handle variable-length cell data?
  3. Navigation Logic
    • Given a table name, how do you find its root page?
    • How do you traverse from root to a specific row?
    • How do you follow overflow page chains?
  4. Output Formatting
    • How will you display binary data readably?
    • How much detail should you show by default?
    • What command-line options would be useful?

Thinking Exercise

Trace a Lookup

Before coding, trace how SQLite would find row with id=1500 in a table with 5000 rows:

Assumptions:
- Page size: 4096 bytes
- Average row size: 100 bytes
- ~40 rows per leaf page
- Table root page: 2

Draw the B-tree structure:
- How many leaf pages are needed for 5000 rows?
- How many levels deep is the tree?
- Which pages would be read to find id=1500?

Questions while tracing:

  • What information is stored in interior nodes vs leaf nodes?
  • How does SQLite know which child page to follow?
  • How many disk reads are required for this lookup?
  • What if the row spans multiple overflow pages?

The Interview Questions They’ll Ask

Prepare to answer these:

  1. “Explain how SQLite stores data on disk. What’s the file format?”
  2. “What’s the difference between a clustered and non-clustered index in terms of storage?”
  3. “How does a B-tree lookup work? What’s the time complexity?”
  4. “If a SQLite database file is corrupted, how would you diagnose which pages are affected?”
  5. “Why does SQLite use variable-length integers? What’s the tradeoff?”
  6. “How would you estimate the size of a SQLite database given the schema and row count?”

Hints in Layers

Hint 1: Starting Point Begin with just reading and displaying the 100-byte header. Get the page size and database size first.

Hint 2: Page Reading Once you can read the header, read page 1 (which starts at byte 100 for the first page, then at page_size boundaries for subsequent pages). Identify the page type from the first byte.

Hint 3: Cell Parsing The cell pointer array at the start of each page tells you where cells are located. Cells are stored from the end of the page backwards. Parse the cell pointer array first.

Hint 4: Verification Use sqlite3 mydb.sqlite ".dbinfo" and sqlite3 mydb.sqlite ".schema" to verify your parser outputs match the official interpretation.


Books That Will Help

Topic Book Chapter
Binary file parsing “Practical Binary Analysis” by Andriesse Ch. 1-2
B-tree fundamentals “Introduction to Algorithms” by CLRS Ch. 18
SQLite internals “The Definitive Guide to SQLite” by Owens Ch. 11
C file I/O “The C Programming Language” by K&R Ch. 7
Database storage “Database Internals” by Petrov Ch. 2-4

Implementation Hints

The SQLite file format is fully documented. Your parser should handle:

Header Parsing (first 100 bytes):

  • Bytes 0-15: Magic string “SQLite format 3\0”
  • Bytes 16-17: Page size (big-endian, 1 means 65536)
  • Bytes 28-31: Database size in pages
  • Bytes 44-47: Schema cookie

Page Structure:

  • Byte 0: Page type (0x02, 0x05, 0x0A, 0x0D)
  • Bytes 1-2: First freeblock offset
  • Bytes 3-4: Cell count
  • Bytes 5-6: Cell content area start
  • Bytes 7: Fragmented free bytes
  • For interior pages: bytes 8-11 are rightmost child pointer

Varint Decoding:

For each byte:
  - If high bit is 0: this is the last byte
  - If high bit is 1: more bytes follow
  - Use lower 7 bits of each byte
  - Concatenate in big-endian order

Think about edge cases:

  • What if the page size is 65536 (stored as 1)?
  • What if a cell spans overflow pages?
  • What if the database is encrypted?

Learning Milestones

  1. You can read and display the header → You understand the database metadata
  2. You can identify page types and cell counts → You understand page structure
  3. You can decode varints and cell payloads → You understand SQLite’s encoding
  4. You can navigate B-tree from root to leaf → You truly understand how SQLite finds data

Project 2: SQL Query Analyzer & EXPLAIN Visualizer

  • File: LEARN_SQLITE_DEEP_DIVE.md
  • Main Programming Language: Python
  • Alternative Programming Languages: Go, Rust, TypeScript
  • Coolness Level: Level 3: Genuinely Clever
  • Business Potential: 2. The “Micro-SaaS / Pro Tool”
  • Difficulty: Level 2: Intermediate
  • Knowledge Area: Query Optimization / Database Performance
  • Software or Tool: SQLite EXPLAIN
  • Main Book: “SQL Performance Explained” by Markus Winand

What you’ll build: A tool that takes SQL queries and visualizes their execution plans, showing the bytecode operations, estimated costs, index usage, and bottlenecks with color-coded output and suggestions for optimization.

Why it teaches SQLite: Understanding query execution is crucial for performance. This project forces you to understand how SQLite’s query planner thinks—which indexes it considers, why it chooses certain plans, and what each VDBE opcode means.

Core challenges you’ll face:

  • Parsing EXPLAIN output → maps to understanding VDBE opcodes
  • Understanding EXPLAIN QUERY PLAN → maps to query planner decisions
  • Identifying full table scans → maps to missing indexes
  • Recognizing covering indexes → maps to index-only queries
  • Estimating query costs → maps to sqlite_stat1 statistics

Key Concepts:

  • VDBE Opcodes: SQLite Documentation - “The Virtual Database Engine”
  • Query Planning: “SQL Performance Explained” - Full book - Markus Winand
  • Index Selection: “Use the Index, Luke” (online) - Markus Winand
  • ANALYZE Command: SQLite Documentation - “ANALYZE”

Difficulty: Intermediate Time estimate: 1-2 weeks Prerequisites: Basic Python, SQL knowledge, understanding of what indexes do conceptually


Real World Outcome

A command-line tool that analyzes queries and explains execution plans in human-readable format:

Example Output:

$ ./query_analyzer mydb.sqlite "SELECT * FROM orders WHERE customer_id = 5 AND status = 'pending' ORDER BY created_at"

╔══════════════════════════════════════════════════════════════════╗
║                     Query Analysis Report                        ║
╠══════════════════════════════════════════════════════════════════╣

📊 EXECUTION PLAN:
┌────────────────────────────────────────────────────────────────┐
│ SEARCH orders USING INDEX idx_orders_customer (customer_id=?) │
│    └── Filter: status='pending'                                │
│    └── Sort: ORDER BY created_at (using temporary B-tree)     │
└────────────────────────────────────────────────────────────────┘

📈 COST BREAKDOWN:
┌─────────────────────────────────────────────────────────────────┐
│ Index lookup (idx_orders_customer):  ~3 page reads             │
│ Row fetches:                         ~127 rows (estimated)     │
│ Filter passes:                       ~42 rows (33%)            │
│ Sort cost:                           ~42 comparisons           │
│ Total estimated cost:                ~175 operations           │
└─────────────────────────────────────────────────────────────────┘

⚠️  OPTIMIZATION SUGGESTIONS:

[MEDIUM] SUBOPTIMAL INDEX USAGE
  Current: Using idx_orders_customer, then filtering status
  Better:  Composite index on (customer_id, status)
  Impact:  Would reduce filter step from 127→42 rows to 42→42 rows

  Suggested CREATE INDEX:
  CREATE INDEX idx_orders_customer_status ON orders(customer_id, status);

[LOW] SORTING REQUIRES TEMPORARY TABLE
  Current: Sorting in memory after filtering
  Better:  Covering index including ORDER BY column
  Impact:  Would eliminate sort step entirely

  Suggested CREATE INDEX:
  CREATE INDEX idx_orders_customer_status_date
    ON orders(customer_id, status, created_at);

🔍 VDBE BYTECODE (simplified):
addr  opcode       p1    comment
----  -----------  ----  ----------------------------------
0     Init         0     Initialize program
1     OpenRead     0     Open 'orders' table for reading
2     OpenRead     1     Open idx_orders_customer index
3     Integer      5     Push customer_id value
4     SeekGE       1     Find first matching index entry
5     IdxGT        1     If past range, jump to end
6     Column       0     Get status column
7     Ne           ...   Skip if status != 'pending'
...

The Core Question You’re Answering

“When I write a SQL query, what actually happens? How does SQLite decide HOW to execute it?”

Before you write any code, sit with this question. SQL is declarative—you say WHAT you want, not HOW to get it. The query planner’s job is to find the fastest path to your data. Understanding its decisions is the difference between a query that runs in 1ms and one that takes 10 seconds.


Concepts You Must Understand First

Stop and research these before coding:

  1. EXPLAIN vs EXPLAIN QUERY PLAN
    • What information does each provide?
    • When would you use one vs the other?
    • What’s a “query plan” in the first place?
    • Book Reference: “SQL Performance Explained” Ch. 1 - Markus Winand
  2. Index Types and Usage
    • What’s a B-tree index vs a covering index?
    • When can SQLite use an index vs when must it scan?
    • What makes a “good” index for a particular query?
    • Book Reference: “SQL Performance Explained” Ch. 2 - Markus Winand
  3. VDBE Opcodes
    • What’s the Virtual Database Engine?
    • What do common opcodes like SeekGE, Column, ResultRow mean?
    • How does bytecode relate to performance?
    • Book Reference: SQLite Documentation - “The Virtual Database Engine”
  4. Query Statistics (sqlite_stat1)
    • What statistics does SQLite collect?
    • How do they affect query planning?
    • When should you run ANALYZE?
    • Book Reference: SQLite Documentation - “ANALYZE”

Questions to Guide Your Design

Before implementing, think through these:

  1. Output Design
    • How do you make bytecode understandable to humans?
    • What information is most useful to show first?
    • How do you indicate severity of issues (colors, icons)?
  2. Suggestion Engine
    • How do you detect missing indexes?
    • How do you know if a composite index would help?
    • What heuristics indicate a problematic query?
  3. Cost Estimation
    • How do you calculate estimated row counts?
    • What factors contribute to query “cost”?
    • How accurate can your estimates be?
  4. Interactive Features
    • Should users be able to compare before/after adding indexes?
    • How do you handle queries with parameters?
    • What about multi-statement queries?

Thinking Exercise

Compare Two Plans

Before coding, manually analyze these two queries on a table with 1M orders:

-- Query A (no useful index)
SELECT * FROM orders WHERE status = 'pending' AND amount > 100;

-- Query B (index on status)
SELECT * FROM orders WHERE status = 'pending' AND amount > 100;

Questions while analyzing:

  • How many rows might match status = 'pending'? (assume 5 statuses)
  • How many of those match amount > 100? (assume 30% of orders)
  • What’s the I/O difference between full scan vs index + filter?
  • Would an index on (status, amount) be even better? Why?

The Interview Questions They’ll Ask

Prepare to answer these:

  1. “A query is slow. Walk me through how you’d diagnose it.”
  2. “What’s the difference between a table scan and an index scan?”
  3. “When would you use a composite index vs two separate indexes?”
  4. “Explain what a covering index is and when it helps.”
  5. “How does SQLite decide which index to use when multiple are available?”
  6. “What’s the purpose of running ANALYZE? When should you do it?”

Hints in Layers

Hint 1: Starting Point Start by just pretty-printing EXPLAIN QUERY PLAN output. Parse the indentation to understand the plan tree.

Hint 2: Pattern Recognition Look for keywords in the output: “SCAN” (bad), “SEARCH” (good), “USING INDEX” (good), “USING COVERING INDEX” (best), “TEMP B-TREE” (sorting overhead).

Hint 3: Cost Estimation Query sqlite_stat1 for row count estimates: SELECT stat FROM sqlite_stat1 WHERE tbl='orders'. The first number is row count, subsequent numbers are index selectivity.

Hint 4: Suggestion Generation Compare columns in WHERE clauses against existing indexes. If a column appears in WHERE but has no index, suggest one. If multiple columns appear, suggest composite index in selectivity order.


Books That Will Help

Topic Book Chapter
Query execution “SQL Performance Explained” by Winand Ch. 1-2
Index design “Use the Index, Luke” by Winand Online - all
SQLite internals “The Definitive Guide to SQLite” by Owens Ch. 10
Query optimization “Database Internals” by Petrov Ch. 8

Implementation Hints

Your analyzer should understand these patterns:

Good signs in EXPLAIN QUERY PLAN:

  • “SEARCH” indicates index usage
  • “USING INDEX” means index is helping
  • “USING COVERING INDEX” means no table access needed

Bad signs:

  • “SCAN” means full table scan
  • “TEMP B-TREE” means sorting in memory
  • “USE TEMP B-TREE FOR ORDER BY” means no index for sort

Detection patterns:

  • If WHERE column has no index → suggest index
  • If WHERE has multiple columns → suggest composite index
  • If ORDER BY column isn’t in index → suggest including it
  • If SELECT * with index → suggest covering index with needed columns

Learning Milestones

  1. You can parse and display EXPLAIN QUERY PLAN → You understand plan structure
  2. You can identify scans vs searches → You understand index impact
  3. You can suggest missing indexes → You understand index design
  4. You can estimate query costs → You understand query optimization

Project 3: Build Your Own Key-Value Store on SQLite

  • File: LEARN_SQLITE_DEEP_DIVE.md
  • Main Programming Language: Python
  • Alternative Programming Languages: Go, Rust, C
  • Coolness Level: Level 3: Genuinely Clever
  • Business Potential: 2. The “Micro-SaaS / Pro Tool”
  • Difficulty: Level 2: Intermediate
  • Knowledge Area: Database Design / API Design
  • Software or Tool: SQLite
  • Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann

What you’ll build: A Redis-like key-value store using SQLite as the backend, supporting operations like GET, SET, DELETE, EXPIRE, INCR, and data structures like lists, sets, and hashes—with proper TTL expiration and atomic operations.

Why it teaches SQLite: This project teaches you SQLite’s strengths and limitations through direct comparison with a different database paradigm. You’ll learn about transactions, proper schema design for different access patterns, and SQLite’s performance characteristics.

Core challenges you’ll face:

  • Designing efficient schema for KV access → maps to understanding rowid optimization
  • Implementing TTL expiration → maps to background jobs and query scheduling
  • Supporting multiple data types → maps to SQLite’s type affinity
  • Ensuring atomic operations → maps to transaction isolation
  • Optimizing for read vs write → maps to index design decisions

Key Concepts:

  • Transactions: “Designing Data-Intensive Applications” Ch. 7 - Kleppmann
  • Schema Design: “The Definitive Guide to SQLite” Ch. 3-4 - Owens
  • Atomic Operations: SQLite Documentation - “Atomic Commit In SQLite”
  • Performance Tuning: “SQL Performance Explained” Ch. 3 - Winand

Difficulty: Intermediate Time estimate: 1-2 weeks Prerequisites: Python, basic SQL, understanding of key-value databases conceptually


Real World Outcome

A Python library (and CLI) providing Redis-like operations backed by SQLite:

Example Output:

# Library usage
from sqlite_kv import SQLiteKV

db = SQLiteKV("mystore.db")

# Basic operations
db.set("user:1:name", "Alice")
db.set("user:1:email", "alice@example.com", ttl=3600)  # Expires in 1 hour

name = db.get("user:1:name")  # Returns "Alice"
db.delete("user:1:name")

# Atomic increment
db.set("page:views", 0)
db.incr("page:views")  # Returns 1
db.incr("page:views", 10)  # Returns 11

# Lists
db.lpush("tasks", "task1", "task2", "task3")
db.rpop("tasks")  # Returns "task1"
db.lrange("tasks", 0, -1)  # Returns ["task3", "task2"]

# Sets
db.sadd("tags:article:1", "python", "database", "tutorial")
db.sismember("tags:article:1", "python")  # Returns True
db.smembers("tags:article:1")  # Returns {"python", "database", "tutorial"}

# Hashes
db.hset("user:1", "name", "Alice")
db.hset("user:1", "age", "30")
db.hgetall("user:1")  # Returns {"name": "Alice", "age": "30"}

# Pattern matching
db.keys("user:*")  # Returns ["user:1:name", "user:1:email", ...]

# TTL management
db.ttl("user:1:email")  # Returns remaining seconds
db.persist("user:1:email")  # Remove TTL

# CLI usage
$ sqlite-kv mystore.db
> SET greeting "Hello, World!"
OK
> GET greeting
"Hello, World!"
> INCR counter
(integer) 1
> LPUSH mylist a b c
(integer) 3
> LRANGE mylist 0 -1
1) "c"
2) "b"
3) "a"

The Core Question You’re Answering

“How do you design a database schema for key-value access patterns? What tradeoffs emerge?”

Before you write any code, sit with this question. Relational databases are optimized for complex queries across related tables. Key-value stores are optimized for simple lookups by key. Building a KV store on SQLite reveals the impedance mismatch—and the clever solutions.


Concepts You Must Understand First

Stop and research these before coding:

  1. SQLite WITHOUT ROWID Tables
    • What’s the difference between regular and WITHOUT ROWID tables?
    • When is WITHOUT ROWID faster?
    • What’s the clustered index advantage?
    • Book Reference: SQLite Documentation - “WITHOUT ROWID Tables”
  2. Transaction Isolation Levels
    • What isolation level does SQLite provide?
    • What’s a “serializable” transaction?
    • How do you ensure atomic read-modify-write?
    • Book Reference: “Designing Data-Intensive Applications” Ch. 7 - Kleppmann
  3. UPSERT (INSERT OR REPLACE)
    • How does INSERT OR REPLACE work?
    • What’s the difference from INSERT ON CONFLICT?
    • When would each be appropriate?
    • Book Reference: SQLite Documentation - “UPSERT”
  4. Partial Indexes
    • What’s a partial index?
    • When would one help for TTL expiration?
    • How do you create one?
    • Book Reference: SQLite Documentation - “Partial Indexes”

Questions to Guide Your Design

Before implementing, think through these:

  1. Schema Design
    • One table or multiple (strings, lists, sets, hashes)?
    • How do you store different data types?
    • How do you efficiently look up by key?
  2. TTL Implementation
    • Store absolute timestamp or relative TTL?
    • Active expiration (background job) or lazy (check on access)?
    • How do you clean up expired keys efficiently?
  3. Atomic Operations
    • How do you implement INCR atomically?
    • What about LPUSH (modify existing list)?
    • Can you use SQLite’s built-in functions?
  4. Performance Considerations
    • What indexes do you need?
    • How do you handle high write volume?
    • When to use WAL mode?

Thinking Exercise

Design the Schema

Before coding, design tables for a KV store:

-- Option A: Single table, type column
CREATE TABLE kv (
    key TEXT PRIMARY KEY,
    type TEXT,  -- 'string', 'list', 'set', 'hash'
    value BLOB,
    expires_at INTEGER
);

-- Option B: Separate tables per type
CREATE TABLE strings (key TEXT PRIMARY KEY, value TEXT, expires_at INTEGER);
CREATE TABLE lists (key TEXT, idx INTEGER, value TEXT, PRIMARY KEY(key, idx));
CREATE TABLE sets (key TEXT, member TEXT, PRIMARY KEY(key, member));
CREATE TABLE hashes (key TEXT, field TEXT, value TEXT, PRIMARY KEY(key, field));

Questions while designing:

  • Which option is better for GET operations?
  • Which is better for type-specific operations (LPUSH, SADD)?
  • How would you implement KEYS pattern matching?
  • How would you handle expiration in each design?

The Interview Questions They’ll Ask

Prepare to answer these:

  1. “How would you implement atomic increment in a SQL database?”
  2. “What’s the tradeoff between a single-table and multi-table design?”
  3. “How would you efficiently expire keys in SQLite?”
  4. “When would SQLite be appropriate vs Redis for a KV store?”
  5. “How do you handle concurrent access in SQLite?”
  6. “What index would you create for the keys() operation with wildcards?”

Hints in Layers

Hint 1: Starting Point Start with just strings: SET, GET, DELETE. Use INSERT OR REPLACE INTO kv (key, value) VALUES (?, ?).

Hint 2: TTL Implementation Store expires_at as Unix timestamp. On GET, check if expired: WHERE key = ? AND (expires_at IS NULL OR expires_at > ?). Create a partial index: CREATE INDEX idx_expires ON kv(expires_at) WHERE expires_at IS NOT NULL.

Hint 3: Atomic INCR Use a single UPDATE with expression: UPDATE kv SET value = CAST(value AS INTEGER) + ? WHERE key = ?. Check changes() to see if row existed.

Hint 4: List Operations For lists, store position with each element. LPUSH increments a counter and inserts at negative positions. RPOP selects max position. Use transactions to maintain consistency.


Books That Will Help

Topic Book Chapter
Data model design “Designing Data-Intensive Applications” by Kleppmann Ch. 2
SQLite transactions “The Definitive Guide to SQLite” by Owens Ch. 11
Schema design “SQL Performance Explained” by Winand Ch. 3
Atomic operations SQLite Documentation “Atomic Commit”

Implementation Hints

Schema recommendation:

WITHOUT ROWID tables cluster by primary key, making key lookup O(log n) instead of
requiring rowid lookup + table fetch. Use this for all KV tables.

For TTL: A background thread can periodically DELETE WHERE expires_at < now,
but this causes write contention. Better: lazy deletion on access + periodic cleanup
during low-activity periods.

For INCR: Use the RETURNING clause (SQLite 3.35+) to atomically update and return:
UPDATE kv SET value = value + 1 WHERE key = ? RETURNING value

For KEYS pattern: Convert glob to SQL LIKE (replace * with %, ? with _).
Index won't help with leading wildcard—warn users about performance.

Learning Milestones

  1. Basic GET/SET/DELETE working → You understand SQLite CRUD
  2. TTL expiration working → You understand lazy vs eager deletion
  3. INCR is atomic → You understand transactions
  4. Lists/Sets/Hashes working → You understand schema design for different access patterns

Project 4: SQLite WAL Mode Inspector

  • File: LEARN_SQLITE_DEEP_DIVE.md
  • Main Programming Language: C
  • Alternative Programming Languages: Rust, Python
  • Coolness Level: Level 4: Hardcore Tech Flex
  • Business Potential: 1. The “Resume Gold”
  • Difficulty: Level 4: Expert
  • Knowledge Area: Database Internals / Transaction Management
  • Software or Tool: SQLite WAL files
  • Main Book: “Database Internals” by Alex Petrov

What you’ll build: A tool that reads SQLite WAL (Write-Ahead Log) files and displays their contents—frame by frame, showing which database pages were modified, transaction boundaries, and checkpoint status—providing visibility into SQLite’s transaction mechanism.

Why it teaches SQLite: WAL mode is how modern SQLite achieves concurrent reads during writes. Understanding WAL means understanding SQLite’s transaction guarantees, crash recovery, and performance characteristics at a deep level.

Core challenges you’ll face:

  • Parsing WAL header and frame headers → maps to understanding WAL format
  • Tracking transaction boundaries → maps to commit protocols
  • Understanding checkpointing → maps to WAL to database sync
  • Correlating pages with tables → maps to integrating with file format knowledge
  • Detecting incomplete transactions → maps to crash recovery

Key Concepts:

  • Write-Ahead Logging: “Database Internals” Ch. 11 - Alex Petrov
  • WAL File Format: SQLite Documentation - “Write-Ahead Logging”
  • Checkpointing: SQLite Documentation - “WAL Checkpoint”
  • Crash Recovery: “Database Internals” Ch. 12 - Alex Petrov

Difficulty: Expert Time estimate: 2-3 weeks Prerequisites: Project 1 (File Format Explorer), C programming, understanding of transaction concepts


Real World Outcome

A diagnostic tool that reveals what’s happening inside WAL files:

Example Output:

$ ./wal_inspector mydb.sqlite-wal

╔════════════════════════════════════════════════════════════════╗
║                    WAL File Analysis                            ║
╠════════════════════════════════════════════════════════════════╣
║ WAL File:          mydb.sqlite-wal                              ║
║ Database File:     mydb.sqlite                                  ║
║ WAL Size:          2,457,600 bytes                              ║
║ Page Size:         4096 bytes                                   ║
║ Total Frames:      600                                          ║
║ Salt Values:       0x3A7B2C1D, 0x8F4E9B2A                       ║
║ Checksum:          Valid ✓                                      ║
╚════════════════════════════════════════════════════════════════╝

FRAME ANALYSIS:
┌────────┬──────────┬────────────┬───────────────┬──────────────────┐
│ Frame  │ DB Page  │ Table/Idx  │ Transaction   │ Commit?          │
├────────┼──────────┼────────────┼───────────────┼──────────────────┤
│ 1      │ 1        │ schema     │ TX-1          │                  │
│ 2      │ 15       │ users      │ TX-1          │                  │
│ 3      │ 16       │ users      │ TX-1          │                  │
│ 4      │ 23       │ idx_email  │ TX-1          │ COMMIT (db=156)  │
├────────┼──────────┼────────────┼───────────────┼──────────────────┤
│ 5      │ 17       │ users      │ TX-2          │                  │
│ 6      │ 18       │ users      │ TX-2          │                  │
│ 7      │ 17       │ users      │ TX-2          │                  │
│ 8      │ 24       │ idx_email  │ TX-2          │ COMMIT (db=156)  │
├────────┼──────────┼────────────┼───────────────┼──────────────────┤
│ ...    │ ...      │ ...        │ ...           │                  │
│ 597    │ 45       │ orders     │ TX-150        │                  │
│ 598    │ 46       │ orders     │ TX-150        │                  │
│ 599    │ 47       │ orders     │ TX-150        │                  │
│ 600    │ 48       │ orders     │ TX-150        │ *** INCOMPLETE   │
└────────┴──────────┴────────────┴───────────────┴──────────────────┘

TRANSACTION SUMMARY:
┌────────────┬────────────┬───────────────┬───────────────────────┐
│ Transaction│ Frames     │ Pages Changed │ Status                │
├────────────┼────────────┼───────────────┼───────────────────────┤
│ TX-1       │ 1-4        │ 4             │ Committed             │
│ TX-2       │ 5-8        │ 4             │ Committed             │
│ ...        │ ...        │ ...           │ ...                   │
│ TX-149     │ 593-596    │ 4             │ Committed             │
│ TX-150     │ 597-600    │ 4             │ ⚠️ INCOMPLETE         │
└────────────┴────────────┴───────────────┴───────────────────────┘

⚠️  WARNING: Incomplete transaction detected!
Transaction TX-150 has 4 frames but no commit marker.
On next database open, frames 597-600 will be ignored (rolled back).

PAGE MODIFICATION FREQUENCY:
  Page 15 (users): Modified 47 times across 47 transactions
  Page 23 (idx_email): Modified 47 times
  Page 1 (schema): Modified 2 times

CHECKPOINT RECOMMENDATION:
  WAL has 600 frames (2.4MB). Consider running:
  PRAGMA wal_checkpoint(TRUNCATE);

The Core Question You’re Answering

“How does SQLite guarantee my data survives a crash? What happens between BEGIN and COMMIT?”

Before you write any code, sit with this question. WAL mode is SQLite’s answer to the impossible problem: how do you make changes durable before they’re finished? The answer involves writing changes twice—first to the log, then to the database—in a carefully choreographed dance.


Concepts You Must Understand First

Stop and research these before coding:

  1. Write-Ahead Logging Concept
    • Why write changes before applying them?
    • What’s the relationship between WAL and ACID?
    • How does WAL differ from rollback journal?
    • Book Reference: “Database Internals” Ch. 11 - Alex Petrov
  2. WAL File Format
    • What’s in the WAL header (32 bytes)?
    • What’s a frame and what does it contain?
    • How are checksums calculated?
    • Book Reference: SQLite Documentation - “WAL Format”
  3. Checkpointing
    • When do WAL frames get copied to the main database?
    • What are the different checkpoint modes?
    • Why might a checkpoint fail?
    • Book Reference: SQLite Documentation - “Checkpointing”
  4. Crash Recovery
    • How does SQLite know which frames are valid after crash?
    • What role do salt values play?
    • How are incomplete transactions detected?
    • Book Reference: “Database Internals” Ch. 12 - Alex Petrov

Questions to Guide Your Design

Before implementing, think through these:

  1. Parsing Strategy
    • How do you validate frame checksums?
    • How do you detect transaction boundaries?
    • How do you handle corrupt or truncated WAL files?
  2. Database Correlation
    • How do you know which table a page belongs to?
    • Do you need the main database file to provide full info?
    • How do you handle schema changes in the middle of WAL?
  3. Transaction Tracking
    • How do you identify which frames belong to which transaction?
    • What indicates a commit?
    • How do you detect incomplete transactions?
  4. Output Usefulness
    • What information is most useful for debugging?
    • How do you summarize large WAL files?
    • What recommendations can you make?

Thinking Exercise

Trace a Crash

Before coding, trace what happens in this scenario:

1. Application opens database in WAL mode
2. BEGIN; INSERT INTO users VALUES (1, 'Alice');
3. INSERT INTO users VALUES (2, 'Bob');
4. COMMIT;
5. BEGIN; INSERT INTO users VALUES (3, 'Charlie');
6. *** POWER FAILURE ***
7. System reboots, application reopens database

Questions while tracing:

  • After step 4, what’s in the WAL file?
  • After step 5, what new frames exist?
  • After step 7, which users exist in the database?
  • How does SQLite know Charlie’s insert wasn’t committed?

The Interview Questions They’ll Ask

Prepare to answer these:

  1. “Explain how write-ahead logging provides crash recovery.”
  2. “What’s the difference between WAL mode and rollback journal mode?”
  3. “When would you choose WAL mode vs the default mode?”
  4. “What happens if the WAL file grows very large?”
  5. “How does SQLite allow readers during writes in WAL mode?”
  6. “What’s a checkpoint and why is it necessary?”

Hints in Layers

Hint 1: Starting Point Read the 32-byte WAL header first: magic number (0x377f0682 or 0x377f0683), file format version, page size, checkpoint sequence, salt values.

Hint 2: Frame Parsing Each frame is: 24-byte header + page_size bytes of page data. Frame header contains: page number, db size at commit (0 if not commit frame), salt values, checksums.

Hint 3: Transaction Detection A commit frame has non-zero “db size at commit” field. All frames from the previous commit (or start) up to this frame belong to one transaction.

Hint 4: Validation Verify checksums as you read. The checksum algorithm is documented—it’s a cumulative checksum that includes the previous frame’s checksum.


Books That Will Help

Topic Book Chapter
WAL concept “Database Internals” by Petrov Ch. 11
Crash recovery “Database Internals” by Petrov Ch. 12
SQLite WAL details SQLite Documentation “Write-Ahead Logging”
Transaction guarantees “Designing Data-Intensive Applications” by Kleppmann Ch. 7

Implementation Hints

WAL Header (first 32 bytes):

Bytes 0-3:   Magic number (0x377f0682 big-endian = WAL file)
Bytes 4-7:   File format version (3007000)
Bytes 8-11:  Database page size
Bytes 12-15: Checkpoint sequence number
Bytes 16-19: Salt-1
Bytes 20-23: Salt-2
Bytes 24-27: Checksum-1
Bytes 28-31: Checksum-2

Frame Header (24 bytes before each page):

Bytes 0-3:   Page number
Bytes 4-7:   Database size in pages (at commit, or 0)
Bytes 8-11:  Salt-1 (must match header)
Bytes 12-15: Salt-2 (must match header)
Bytes 16-19: Checksum-1
Bytes 20-23: Checksum-2

Key insight: If “db size at commit” is non-zero, this frame is a commit marker. All preceding frames since the last commit (or start) form one transaction.


Learning Milestones

  1. You can parse WAL header → You understand WAL metadata
  2. You can read frames and validate checksums → You understand frame structure
  3. You can identify transaction boundaries → You understand commit protocol
  4. You can detect incomplete transactions → You understand crash recovery

Project 5: SQLite Full-Text Search Engine

  • File: LEARN_SQLITE_DEEP_DIVE.md
  • Main Programming Language: Python
  • Alternative Programming Languages: Go, Rust, C
  • Coolness Level: Level 3: Genuinely Clever
  • Business Potential: 2. The “Micro-SaaS / Pro Tool”
  • Difficulty: Level 2: Intermediate
  • Knowledge Area: Information Retrieval / Search
  • Software or Tool: SQLite FTS5
  • Main Book: “Introduction to Information Retrieval” by Manning, Raghavan, Schütze

What you’ll build: A document search engine using SQLite’s FTS5 extension, with features like relevance ranking, phrase search, Boolean queries, snippet highlighting, autocomplete, and faceted search—suitable for searching thousands of documents.

Why it teaches SQLite: FTS5 demonstrates SQLite’s extensibility through virtual tables. You’ll learn how inverted indexes work, understand tokenization, and see how search engines rank results—all within SQLite.

Core challenges you’ll face:

  • Understanding FTS5 table creation options → maps to tokenizer configuration
  • Writing effective FTS5 queries → maps to Boolean and phrase search
  • Implementing relevance ranking with BM25 → maps to information retrieval theory
  • Generating highlighted snippets → maps to result presentation
  • Building autocomplete → maps to prefix search and FTS5 auxiliary functions

Key Concepts:

  • Inverted Indexes: “Introduction to Information Retrieval” Ch. 1-2 - Manning et al.
  • FTS5 Syntax: SQLite Documentation - “FTS5”
  • BM25 Ranking: “Introduction to Information Retrieval” Ch. 6 - Manning et al.
  • Tokenization: SQLite Documentation - “FTS5 Tokenizers”

Difficulty: Intermediate Time estimate: 1-2 weeks Prerequisites: Basic Python, SQL knowledge, understanding of text search conceptually


Real World Outcome

A command-line search tool for local documents with sophisticated search features:

Example Output:

$ ./doc_search init ./my_documents/
Indexing documents...
Found 1,247 documents (15.3 MB total)
Indexed 1,247 documents in 3.2 seconds
Created search.db (4.8 MB)

$ ./doc_search query "machine learning neural networks"

Found 23 results (0.012 seconds)

1. [Score: 2.847] deep_learning_intro.md
   "...using **neural networks** for **machine learning** tasks has revolutionized..."
   Modified: 2024-01-15 | Size: 12.4 KB | Path: ./my_documents/ml/

2. [Score: 2.134] tensorflow_guide.md
   "...TensorFlow implements various **neural network** architectures for **machine**..."
   Modified: 2024-02-20 | Size: 8.7 KB | Path: ./my_documents/tutorials/

3. [Score: 1.892] ai_history.md
   "...the evolution of **machine learning** from simple perceptrons to deep **neural**..."
   Modified: 2023-11-03 | Size: 15.2 KB | Path: ./my_documents/history/

$ ./doc_search query '"exact phrase search"'

Found 2 results (0.003 seconds)

1. [Score: 3.521] search_tutorial.md
   "...to search for an **"exact phrase search"** use double quotes around..."

$ ./doc_search query "python NOT java"

Found 156 results (0.008 seconds)
[Results containing "python" but not "java"]

$ ./doc_search query "mach*"  # Prefix search

Found 89 results (0.015 seconds)
Matching: machine, machines, machining, machiavellian...

$ ./doc_search autocomplete "neural"
Suggestions:
  neural networks (45 documents)
  neural network architecture (12 documents)
  neural machine translation (8 documents)
  neuralnets (3 documents)

$ ./doc_search facets "machine learning"

Faceted results for "machine learning":

By Year:
  2024: 12 documents
  2023: 8 documents
  2022: 3 documents

By Directory:
  ./ml/: 15 documents
  ./tutorials/: 5 documents
  ./research/: 3 documents

By File Type:
  .md: 18 documents
  .txt: 4 documents
  .rst: 1 document

The Core Question You’re Answering

“How do search engines find relevant documents so fast? What makes one result more relevant than another?”

Before you write any code, sit with this question. Searching millions of documents in milliseconds requires a fundamentally different approach than scanning text. Inverted indexes flip the problem: instead of “for each document, find matching words,” it’s “for each word, list containing documents.”


Concepts You Must Understand First

Stop and research these before coding:

  1. Inverted Index Structure
    • What’s an inverted index vs a forward index?
    • What information is stored for each term?
    • Why is it called “inverted”?
    • Book Reference: “Introduction to Information Retrieval” Ch. 1 - Manning et al.
  2. Tokenization and Stemming
    • What’s a token vs a word?
    • What’s stemming and why use it?
    • What tokenizers does FTS5 provide?
    • Book Reference: “Introduction to Information Retrieval” Ch. 2 - Manning et al.
  3. BM25 Ranking
    • What does BM25 measure?
    • What are term frequency and inverse document frequency?
    • Why is document length normalization important?
    • Book Reference: “Introduction to Information Retrieval” Ch. 6 - Manning et al.
  4. FTS5 Virtual Table API
    • What’s a virtual table in SQLite?
    • How does FTS5 create multiple backing tables?
    • What are auxiliary functions?
    • Book Reference: SQLite Documentation - “FTS5”

Questions to Guide Your Design

Before implementing, think through these:

  1. Document Ingestion
    • What file formats will you support?
    • How do you extract plain text from different formats?
    • How do you handle very large documents?
  2. Index Configuration
    • Which tokenizer is appropriate (unicode61, porter, trigram)?
    • Should you use content= or contentless tables?
    • What columns should be searchable vs stored?
  3. Query Interface
    • How do you support both simple and advanced queries?
    • How do you handle query syntax errors gracefully?
    • What query operators should users have?
  4. Result Presentation
    • How do you generate meaningful snippets?
    • How many results to show per page?
    • How do you highlight matched terms?

Thinking Exercise

Design the Index

Before coding, design an FTS5 schema for a document search:

-- Option A: Full content storage
CREATE VIRTUAL TABLE docs USING fts5(
    title,
    content,
    path UNINDEXED,
    modified UNINDEXED
);

-- Option B: External content
CREATE TABLE doc_content(id INTEGER PRIMARY KEY, title, content, path, modified);
CREATE VIRTUAL TABLE docs_fts USING fts5(
    title,
    content,
    content=doc_content
);

Questions while designing:

  • What’s the storage difference between these approaches?
  • Can you update documents in Option A? Option B?
  • How do you handle deletes in each case?
  • Which is better for a frequently-changing corpus?

The Interview Questions They’ll Ask

Prepare to answer these:

  1. “Explain how an inverted index enables fast text search.”
  2. “What’s TF-IDF and how does BM25 improve on it?”
  3. “How would you implement autocomplete efficiently?”
  4. “What’s the tradeoff between stemming and exact matching?”
  5. “How do you handle multi-word phrase queries?”
  6. “What happens to search quality if you don’t normalize for document length?”

Hints in Layers

Hint 1: Starting Point Create a basic FTS5 table and insert a few documents. Run simple MATCH queries and look at the results.

Hint 2: Ranking Use bm25() function: SELECT *, bm25(docs) as score FROM docs WHERE docs MATCH ? ORDER BY score. Lower scores are better (it returns negative values).

Hint 3: Snippets Use snippet() function: SELECT snippet(docs, 1, '<b>', '</b>', '...', 32) FROM docs WHERE docs MATCH ?. Arguments: table, column index, open tag, close tag, ellipsis, approx tokens.

Hint 4: Prefix Search For autocomplete, use the * suffix operator: WHERE docs MATCH 'neur*'. For term suggestions, query the FTS5 vocab table: SELECT term FROM docs_vocab WHERE term LIKE 'neur%' ORDER BY doc_count DESC.


Books That Will Help

Topic Book Chapter
Inverted indexes “Introduction to IR” by Manning et al. Ch. 1-2
Ranking algorithms “Introduction to IR” by Manning et al. Ch. 6
FTS5 specifics SQLite Documentation “FTS5”
Virtual tables “The Definitive Guide to SQLite” by Owens Ch. 9

Implementation Hints

FTS5 Creation with Porter Stemmer:

CREATE VIRTUAL TABLE docs USING fts5(
    title,
    content,
    path UNINDEXED,
    tokenize='porter unicode61'
);

Query Types:

  • Simple: WHERE docs MATCH 'search terms'
  • Phrase: WHERE docs MATCH '"exact phrase"'
  • Boolean: WHERE docs MATCH 'python AND (machine OR deep)'
  • NOT: WHERE docs MATCH 'python NOT java'
  • Column-specific: WHERE docs MATCH 'title:introduction'
  • Prefix: WHERE docs MATCH 'mach*'

Performance tip: Create an FTS5 index, not a regular table with LIKE queries. FTS5 is orders of magnitude faster for text search.


Learning Milestones

  1. Basic search working → You understand FTS5 MATCH syntax
  2. Results ranked by relevance → You understand BM25
  3. Snippets with highlighting → You understand auxiliary functions
  4. Autocomplete working → You understand prefix search and vocab tables

Project 6: Personal Analytics Database

  • File: LEARN_SQLITE_DEEP_DIVE.md
  • Main Programming Language: Python
  • Alternative Programming Languages: Go, TypeScript, Rust
  • Coolness Level: Level 3: Genuinely Clever
  • Business Potential: 2. The “Micro-SaaS / Pro Tool”
  • Difficulty: Level 2: Intermediate
  • Knowledge Area: Data Modeling / Time Series / Analytics
  • Software or Tool: SQLite
  • Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann

What you’ll build: A personal data warehouse that collects data from multiple sources (browser history, git commits, shell history, calendar, fitness trackers) and provides SQL-queryable analytics with visualizations—like a self-hosted, privacy-respecting analytics platform.

Why it teaches SQLite: This project teaches you schema design for analytics workloads, time-series queries, window functions, CTEs, and aggregation. You’ll learn the difference between OLTP and OLAP patterns and how SQLite handles analytical queries.

Core challenges you’ll face:

  • Designing star/snowflake schema for analytics → maps to dimensional modeling
  • Efficient time-series queries → maps to date indexing and partitioning
  • Using window functions for trends → maps to advanced SQL features
  • Incremental data loading → maps to ETL patterns
  • Query performance at scale → maps to OLAP optimization

Key Concepts:

  • Dimensional Modeling: “The Data Warehouse Toolkit” by Ralph Kimball - Ch. 1-3
  • Window Functions: “SQL Performance Explained” - Window Functions section - Winand
  • CTEs: SQLite Documentation - “Common Table Expressions”
  • Time-Series Queries: SQLite Documentation - “Date and Time Functions”

Difficulty: Intermediate Time estimate: 2-3 weeks Prerequisites: Python, SQL with JOINs and GROUP BY, basic understanding of data analysis


Real World Outcome

A CLI tool that imports your digital life and lets you query it:

Example Output:

$ ./personal-analytics import --source=browser-history
Importing Chrome history...
Found 15,247 new visits since last import
Imported to analytics.db (12.3 MB total)

$ ./personal-analytics import --source=git --repos=~/projects/
Scanning git repositories...
Found 847 new commits across 12 repositories
Imported commit data

$ ./personal-analytics query "
  SELECT
    strftime('%Y-%m', visit_time) as month,
    domain,
    COUNT(*) as visits
  FROM browser_history
  WHERE domain NOT IN ('google.com', 'localhost')
  GROUP BY month, domain
  HAVING visits > 50
  ORDER BY month DESC, visits DESC
  LIMIT 20
"

┌─────────┬─────────────────────┬────────┐
│ month   │ domain              │ visits │
├─────────┼─────────────────────┼────────┤
│ 2024-12 │ github.com          │ 342    │
│ 2024-12 │ stackoverflow.com   │ 187    │
│ 2024-12 │ docs.python.org     │ 156    │
│ 2024-11 │ github.com          │ 289    │
│ ...     │ ...                 │ ...    │
└─────────┴─────────────────────┴────────┘

$ ./personal-analytics query "
  WITH daily_commits AS (
    SELECT
      date(commit_time) as day,
      COUNT(*) as commits,
      SUM(lines_added) as added,
      SUM(lines_deleted) as deleted
    FROM git_commits
    WHERE commit_time > date('now', '-30 days')
    GROUP BY day
  )
  SELECT
    day,
    commits,
    added - deleted as net_lines,
    AVG(commits) OVER (ORDER BY day ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) as week_avg
  FROM daily_commits
  ORDER BY day DESC
"

┌────────────┬─────────┬───────────┬──────────┐
│ day        │ commits │ net_lines │ week_avg │
├────────────┼─────────┼───────────┼──────────┤
│ 2024-12-22 │ 5       │ +247      │ 4.28     │
│ 2024-12-21 │ 3       │ -52       │ 4.14     │
│ 2024-12-20 │ 7       │ +892      │ 3.85     │
│ ...        │ ...     │ ...       │ ...      │
└────────────┴─────────┴───────────┴──────────┘

$ ./personal-analytics dashboard

╔════════════════════════════════════════════════════════════════╗
║                   This Week's Activity                          ║
╠════════════════════════════════════════════════════════════════╣
║ 📊 Productivity Score: 78/100 (↑12% from last week)            ║
║                                                                 ║
║ 🌐 Web Activity:                                                ║
║    Most visited: github.com (89), stackoverflow.com (47)       ║
║    Time on docs: 2h 34m (↑ from 1h 52m)                        ║
║                                                                 ║
║ 💻 Coding:                                                      ║
║    Commits: 23 (↓ from 31)                                     ║
║    Net lines: +1,247                                           ║
║    Most active repo: myproject (12 commits)                    ║
║                                                                 ║
║ 📅 Meetings: 8 (3h 20m total)                                  ║
╚════════════════════════════════════════════════════════════════╝

The Core Question You’re Answering

“How do you design a database for analytical queries? Why is it different from operational databases?”

Before you write any code, sit with this question. Operational databases (OLTP) optimize for many small transactions. Analytical databases (OLAP) optimize for few large queries across lots of data. The schema designs are fundamentally different—and SQLite can do both.


Concepts You Must Understand First

Stop and research these before coding:

  1. Star Schema vs Normalized Schema
    • What’s a fact table vs dimension table?
    • Why denormalize for analytics?
    • What’s the tradeoff?
    • Book Reference: “The Data Warehouse Toolkit” Ch. 1-2 - Kimball
  2. Window Functions
    • What’s the difference between aggregate and window functions?
    • What do PARTITION BY and ORDER BY do in a window?
    • What are frame specifications (ROWS BETWEEN)?
    • Book Reference: SQLite Documentation - “Window Functions”
  3. Common Table Expressions (CTEs)
    • What’s a CTE and when to use it?
    • What’s the difference from a subquery?
    • Are CTEs materialized in SQLite?
    • Book Reference: SQLite Documentation - “WITH clause”
  4. Time-Series Indexing
    • How do you efficiently query date ranges?
    • What’s the best index for time-series data?
    • How do you handle time zones?
    • Book Reference: “SQL Performance Explained” Ch. 4 - Winand

Questions to Guide Your Design

Before implementing, think through these:

  1. Schema Design
    • One big table or star schema with dimensions?
    • How do you model different data sources?
    • What’s common across sources (time, category)?
  2. Data Import
    • How do you avoid duplicate imports?
    • How do you handle incremental updates?
    • What metadata do you need to track?
  3. Query Patterns
    • What questions will users ask most often?
    • How do you make trend queries fast?
    • What aggregations need pre-computing?
  4. Performance
    • What indexes do time-range queries need?
    • When should you materialize views?
    • How big can the database get?

Thinking Exercise

Design the Schema

Before coding, design tables for personal analytics:

-- Fact table (the events)
CREATE TABLE events (
    id INTEGER PRIMARY KEY,
    event_type TEXT,      -- 'browser', 'git', 'calendar'
    event_time INTEGER,   -- Unix timestamp
    source_id TEXT,       -- External ID for dedup
    ...
);

-- Dimension tables
CREATE TABLE dim_date (...);
CREATE TABLE dim_source (...);

Questions while designing:

  • Should browser visits and git commits be in the same table?
  • What makes a good primary key for deduplication?
  • How would you query “most productive day of week”?
  • What indexes support that query?

The Interview Questions They’ll Ask

Prepare to answer these:

  1. “Explain the difference between OLTP and OLAP workloads.”
  2. “What’s a star schema and when would you use it?”
  3. “How do window functions differ from GROUP BY aggregations?”
  4. “How would you design a schema for time-series analytics?”
  5. “What indexes would you create for date-range queries?”
  6. “How do you handle incremental data loading?”

Hints in Layers

Hint 1: Starting Point Start with one data source (browser history is easiest). Create a simple table, import data, write a few queries.

Hint 2: Time Indexing Index on date(event_time) or store date separately. Most analytical queries filter by date range first, then aggregate.

Hint 3: Window Functions For rolling averages: AVG(value) OVER (ORDER BY date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW). This gives you a 7-day moving average.

Hint 4: Deduplication Use INSERT OR IGNORE with a unique constraint on source identifiers. Store a last_imported timestamp to know where to resume.


Books That Will Help

Topic Book Chapter
Dimensional modeling “The Data Warehouse Toolkit” by Kimball Ch. 1-3
Window functions SQLite Documentation “Window Functions”
Analytics patterns “Designing Data-Intensive Applications” by Kleppmann Ch. 3
Query optimization “SQL Performance Explained” by Winand Ch. 4-5

Implementation Hints

Browser History Import (Chrome):

Chrome stores history in ~/Library/Application Support/Google/Chrome/Default/History
It's a SQLite database! Copy it (Chrome locks it) and query:
SELECT url, title, visit_count, datetime(last_visit_time/1000000-11644473600, 'unixepoch')
FROM urls;

Git Commit Import:

Use: git log --format='%H|%ai|%an|%s' --numstat
Parse the output to extract: hash, datetime, author, message, files changed, insertions, deletions

Useful Analytical Queries:

  • Day of week pattern: strftime('%w', event_time) gives 0-6
  • Hour of day: strftime('%H', event_time)
  • Week number: strftime('%W', event_time)

Learning Milestones

  1. Single source imported and queryable → You understand ETL basics
  2. Multiple sources with common schema → You understand dimensional modeling
  3. Window functions for trends → You understand analytical SQL
  4. Dashboard with insights → You understand data storytelling

Project 7: SQLite Replication System

  • File: LEARN_SQLITE_DEEP_DIVE.md
  • Main Programming Language: Go
  • Alternative Programming Languages: Rust, Python, C
  • Coolness Level: Level 4: Hardcore Tech Flex
  • Business Potential: 4. The “Open Core” Infrastructure
  • Difficulty: Level 4: Expert
  • Knowledge Area: Distributed Systems / Replication
  • Software or Tool: SQLite
  • Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann

What you’ll build: A system that replicates SQLite database changes from a primary to one or more replicas, using the WAL or a custom change-tracking mechanism—enabling read scaling, backups, or edge synchronization.

Why it teaches SQLite: Replication forces you to deeply understand SQLite’s transaction model, WAL mechanism, and file format. You’ll also learn about distributed systems challenges like conflict resolution, eventual consistency, and network partitions.

Core challenges you’ll face:

  • Capturing changes efficiently → maps to WAL parsing or triggers
  • Transmitting changes over network → maps to serialization and protocols
  • Applying changes atomically → maps to transaction replay
  • Handling conflicts → maps to distributed systems theory
  • Ensuring consistency → maps to replication guarantees

Key Concepts:

  • Replication Types: “Designing Data-Intensive Applications” Ch. 5 - Kleppmann
  • Change Data Capture: SQLite Documentation - “Session Extension”
  • WAL Mechanics: SQLite Documentation - “Write-Ahead Logging”
  • Conflict Resolution: “Designing Data-Intensive Applications” Ch. 5 - Kleppmann

Difficulty: Expert Time estimate: 3-4 weeks Prerequisites: Project 4 (WAL Inspector), network programming, understanding of distributed systems concepts


Real World Outcome

A replication daemon that keeps SQLite databases in sync:

Example Output:

# On primary server
$ sqlite-replicate primary --db=app.db --port=5433

SQLite Replication Server
  Database: app.db (156 pages, WAL mode)
  Listening: 0.0.0.0:5433
  Replication: streaming

[2024-12-22 14:30:01] Replica connected: 192.168.1.10
[2024-12-22 14:30:01] Sending initial snapshot (156 pages)...
[2024-12-22 14:30:03] Snapshot complete, switching to WAL streaming
[2024-12-22 14:30:15] TX-4521: 3 frames replicated (users table)
[2024-12-22 14:30:22] TX-4522: 1 frame replicated (orders table)

# On replica server
$ sqlite-replicate replica --primary=192.168.1.5:5433 --db=app_replica.db

SQLite Replication Client
  Primary: 192.168.1.5:5433
  Local: app_replica.db

[2024-12-22 14:30:01] Connecting to primary...
[2024-12-22 14:30:01] Receiving initial snapshot...
[2024-12-22 14:30:03] Snapshot received (156 pages)
[2024-12-22 14:30:03] Streaming WAL changes...
[2024-12-22 14:30:15] Applied TX-4521: 3 page changes
[2024-12-22 14:30:22] Applied TX-4522: 1 page change

Replication Status:
  Lag: 0 transactions (0.0 seconds)
  Applied: 4,522 transactions
  Mode: synchronous streaming

# Query the replica
$ sqlite3 app_replica.db "SELECT COUNT(*) FROM users"
15247

# Check replication status
$ sqlite-replicate status --db=app_replica.db

╔════════════════════════════════════════════════════════════════╗
║                    Replication Status                           ║
╠════════════════════════════════════════════════════════════════╣
║ Primary:           192.168.1.5:5433                             ║
║ Connected:         Yes (uptime: 2h 34m)                         ║
║ Last TX:           TX-4522 at 2024-12-22 14:30:22              ║
║ Replication Lag:   0 transactions                               ║
║ Bytes Received:    12.4 MB                                      ║
║ Local DB Size:     638 KB                                       ║
╚════════════════════════════════════════════════════════════════╝

The Core Question You’re Answering

“How do you keep two databases in sync? What guarantees can you provide?”

Before you write any code, sit with this question. Replication seems simple—just copy changes—but the devil is in the details. Network failures, concurrent writes, ordering guarantees—these challenges are why distributed databases are hard.


Concepts You Must Understand First

Stop and research these before coding:

  1. Replication Architectures
    • What’s leader-based vs multi-leader vs leaderless?
    • What’s synchronous vs asynchronous replication?
    • What guarantees does each provide?
    • Book Reference: “Designing Data-Intensive Applications” Ch. 5 - Kleppmann
  2. Change Data Capture
    • How do you detect what changed?
    • WAL-based vs trigger-based vs polling?
    • What’s the SQLite Session extension?
    • Book Reference: SQLite Documentation - “Session Extension”
  3. Consistency Models
    • What’s eventual consistency vs strong consistency?
    • What’s read-after-write consistency?
    • What can go wrong with replication lag?
    • Book Reference: “Designing Data-Intensive Applications” Ch. 5 - Kleppmann
  4. Conflict Resolution
    • What happens if both copies change?
    • Last-writer-wins vs custom resolution?
    • How do CRDTs work?
    • Book Reference: “Designing Data-Intensive Applications” Ch. 5 - Kleppmann

Questions to Guide Your Design

Before implementing, think through these:

  1. Change Capture
    • Use WAL streaming or triggers?
    • How do you handle the initial snapshot?
    • How do you identify transaction boundaries?
  2. Network Protocol
    • What format for transmitting changes?
    • How do you handle network failures?
    • How does the replica request missed changes?
  3. Consistency
    • Is this single-leader or multi-leader?
    • What happens if primary fails?
    • How do you prevent split-brain?
  4. Performance
    • How much lag is acceptable?
    • How do you batch changes for efficiency?
    • What’s the impact on primary performance?

Thinking Exercise

Design the Protocol

Before coding, design the replication protocol:

Initial Sync:
1. Replica connects to primary
2. Primary sends ??? (snapshot? Or current WAL position?)
3. Replica applies ???
4. Switch to streaming

Streaming:
1. Primary writes transaction
2. Primary sends ??? to replica
3. Replica applies ???
4. Replica acknowledges ???

Questions while designing:

  • Should you send raw WAL frames or parsed changes?
  • How does the replica know it’s caught up?
  • What if the replica disconnects and reconnects?
  • How do you handle a replica that’s too far behind?

The Interview Questions They’ll Ask

Prepare to answer these:

  1. “Explain the difference between synchronous and asynchronous replication.”
  2. “What happens if the primary fails in a leader-based replication system?”
  3. “How would you handle conflicts in a multi-leader setup?”
  4. “What’s replication lag and why does it matter?”
  5. “How do you ensure a replica has all committed transactions?”
  6. “What’s the CAP theorem and how does it apply here?”

Hints in Layers

Hint 1: Starting Point Start with the simplest case: single leader, asynchronous replication. Use the Session extension to capture changes—it’s designed for exactly this.

Hint 2: Initial Snapshot For initial sync, copy the entire database file while holding a read lock. Then send the WAL from that point forward.

Hint 3: WAL Streaming Monitor the WAL file for new frames. When a commit frame appears (non-zero db size), you have a complete transaction to send.

Hint 4: Failover Track the last applied transaction ID on the replica. On reconnect, request changes from that point. The primary needs to keep enough WAL history.


Books That Will Help

Topic Book Chapter
Replication fundamentals “Designing Data-Intensive Applications” by Kleppmann Ch. 5
Consensus protocols “Designing Data-Intensive Applications” by Kleppmann Ch. 9
SQLite sessions SQLite Documentation “Session Extension”
WAL mechanics SQLite Documentation “Write-Ahead Logging”

Implementation Hints

Using SQLite Session Extension:

The Session extension tracks changes to a database and can generate changesets.
Enable with: sqlite3_session_create()
Attach tables: sqlite3_session_attach()
Get changeset: sqlite3_session_changeset()
Apply to replica: sqlite3_changeset_apply()

This is much easier than parsing WAL yourself!

WAL-based approach (harder but more general):

1. Parse WAL frames (you did this in Project 4)
2. Group frames by transaction (commit frame marks end)
3. Send frame data over network
4. On replica: apply frames using sqlite3_wal_replication hooks
   or replay as SQL if you can map pages to table changes

Key insight: SQLite’s Session extension is the recommended way to build replication. It handles all the complexity of change tracking and conflict detection.


Learning Milestones

  1. Initial snapshot working → You understand database serialization
  2. Change streaming working → You understand change capture
  3. Reconnection handles gaps → You understand replication protocol
  4. Can failover to replica → You understand high availability

Project 8: Build a Mini ORM

  • File: LEARN_SQLITE_DEEP_DIVE.md
  • Main Programming Language: Python
  • Alternative Programming Languages: TypeScript, Go, Rust
  • Coolness Level: Level 3: Genuinely Clever
  • Business Potential: 2. The “Micro-SaaS / Pro Tool”
  • Difficulty: Level 3: Advanced
  • Knowledge Area: Database Abstraction / Metaprogramming
  • Software or Tool: SQLite
  • Main Book: “Patterns of Enterprise Application Architecture” by Martin Fowler

What you’ll build: A lightweight Object-Relational Mapper that maps Python classes to SQLite tables, supporting CRUD operations, relationships (one-to-many, many-to-many), lazy loading, query building, and migrations—like a simplified SQLAlchemy.

Why it teaches SQLite: Building an ORM forces you to understand the impedance mismatch between objects and relations, SQL generation, prepared statements, connection management, and transaction scoping. You’ll see exactly what ORMs do under the hood.

Core challenges you’ll face:

  • Mapping classes to tables → maps to metaprogramming and reflection
  • Generating efficient SQL → maps to query building and optimization
  • Handling relationships → maps to JOINs and foreign keys
  • Implementing lazy loading → maps to query execution timing
  • Managing database migrations → maps to schema evolution

Key Concepts:

  • Active Record Pattern: “Patterns of Enterprise Application Architecture” Ch. 10 - Fowler
  • Data Mapper Pattern: “Patterns of Enterprise Application Architecture” Ch. 10 - Fowler
  • Identity Map: “Patterns of Enterprise Application Architecture” Ch. 11 - Fowler
  • Unit of Work: “Patterns of Enterprise Application Architecture” Ch. 11 - Fowler

Difficulty: Advanced Time estimate: 2-3 weeks Prerequisites: Python (decorators, metaclasses), SQL knowledge, understanding of OOP patterns


Real World Outcome

A Python library that maps objects to SQLite:

Example Output:

from mini_orm import Model, Field, ForeignKey, relationship, create_tables

class User(Model):
    __tablename__ = 'users'

    id = Field(int, primary_key=True)
    name = Field(str, max_length=100)
    email = Field(str, unique=True)
    created_at = Field(datetime, default=datetime.now)

    posts = relationship('Post', back_ref='author')

class Post(Model):
    __tablename__ = 'posts'

    id = Field(int, primary_key=True)
    title = Field(str, max_length=200)
    content = Field(str)
    author_id = ForeignKey(User)

    tags = relationship('Tag', through='post_tags')

class Tag(Model):
    __tablename__ = 'tags'

    id = Field(int, primary_key=True)
    name = Field(str, unique=True)

# Initialize database
db = Database('app.db')
create_tables(db, [User, Post, Tag])

# Create records
user = User(name='Alice', email='alice@example.com')
db.save(user)  # INSERT INTO users (name, email, created_at) VALUES (?, ?, ?)

post = Post(title='Hello World', content='My first post', author_id=user.id)
db.save(post)

# Query with fluent interface
users = db.query(User)\
    .filter(User.name.like('A%'))\
    .order_by(User.created_at.desc())\
    .limit(10)\
    .all()

# Generated SQL:
# SELECT * FROM users WHERE name LIKE ? ORDER BY created_at DESC LIMIT 10

# Relationship access (lazy loading)
alice = db.query(User).filter(User.email == 'alice@example.com').first()
print(alice.posts)  # Triggers: SELECT * FROM posts WHERE author_id = ?

# Eager loading to avoid N+1
users_with_posts = db.query(User)\
    .join(User.posts)\
    .all()
# Generated SQL:
# SELECT users.*, posts.* FROM users
# LEFT JOIN posts ON posts.author_id = users.id

# Update
alice.name = 'Alice Smith'
db.save(alice)  # UPDATE users SET name = ? WHERE id = ?

# Delete
db.delete(alice)  # DELETE FROM users WHERE id = ?

# Transactions
with db.transaction():
    user = User(name='Bob', email='bob@example.com')
    db.save(user)
    post = Post(title='Test', content='Test post', author_id=user.id)
    db.save(post)
    # Commits on exit, rolls back on exception

# Migrations
from mini_orm import Migration

class AddUserAge(Migration):
    def up(self):
        self.add_column('users', 'age', int, nullable=True)

    def down(self):
        self.drop_column('users', 'age')

db.migrate()  # Applies pending migrations

The Core Question You’re Answering

“What does an ORM actually DO? How does user.posts turn into a SQL query?”

Before you write any code, sit with this question. ORMs are often treated as magic—you define classes, and somehow SQL happens. Building one reveals that it’s careful bookkeeping: tracking which objects are new, changed, or deleted; generating appropriate SQL; managing relationships.


Concepts You Must Understand First

Stop and research these before coding:

  1. Active Record vs Data Mapper
    • What’s the difference between these patterns?
    • Which does SQLAlchemy use? Django ORM?
    • What are the tradeoffs?
    • Book Reference: “Patterns of Enterprise Application Architecture” Ch. 10 - Fowler
  2. Python Metaclasses
    • What’s a metaclass and when to use it?
    • How can you inspect class attributes at definition time?
    • How do descriptors work?
    • Book Reference: “Fluent Python” Ch. 21 - Ramalho
  3. Identity Map Pattern
    • Why not just load objects every time?
    • How do you ensure one object per database row?
    • What’s a session/unit of work?
    • Book Reference: “Patterns of Enterprise Application Architecture” Ch. 11 - Fowler
  4. N+1 Query Problem
    • What is N+1 and why is it bad?
    • How does eager loading solve it?
    • What SQL does a JOIN produce vs separate queries?
    • Book Reference: Any ORM documentation

Questions to Guide Your Design

Before implementing, think through these:

  1. Model Definition
    • How do you define fields and their types?
    • How do you capture table name, primary key?
    • Metaclass, decorators, or explicit registration?
  2. Query Building
    • How do you build SQL from method chains?
    • How do you handle different operators (=, LIKE, IN)?
    • How do you prevent SQL injection?
  3. Object Tracking
    • How do you know if an object is new vs existing?
    • How do you track which fields changed?
    • When do you actually write to the database?
  4. Relationships
    • How do you represent foreign keys?
    • How does lazy loading work?
    • How do you handle many-to-many?

Thinking Exercise

Trace Object Lifecycle

Before coding, trace what happens:

user = User(name='Alice')   # 1. What state is this object in?
db.save(user)               # 2. What SQL is generated?
user.name = 'Alice Smith'   # 3. How do we know this changed?
db.save(user)               # 4. INSERT or UPDATE?
db.delete(user)             # 5. What state now?

Questions while tracing:

  • How does save() know if it’s INSERT vs UPDATE?
  • How do you track “dirty” objects?
  • What if you call save() twice without changes?
  • Should delete() remove from database immediately?

The Interview Questions They’ll Ask

Prepare to answer these:

  1. “Explain the N+1 query problem and how to solve it.”
  2. “What’s the difference between Active Record and Data Mapper patterns?”
  3. “How would you implement lazy loading for relationships?”
  4. “How do you prevent SQL injection in an ORM?”
  5. “What’s an identity map and why is it useful?”
  6. “How would you handle database migrations?”

Hints in Layers

Hint 1: Starting Point Start with just one model class, save(), and simple queries. No relationships yet. Get INSERT, UPDATE, DELETE, SELECT working.

Hint 2: Field Definition Use Python descriptors for fields. When you access user.name, the descriptor can track the access. A metaclass can collect all Field instances at class definition time.

Hint 3: Query Builder Build queries as objects that accumulate conditions. query.filter(User.name == 'Alice') returns a new Query object with the condition stored. all() generates and executes the SQL.

Hint 4: Relationships For lazy loading, make relationship access return a query that executes on iteration. Store the foreign key value, construct the query on access.


Books That Will Help

Topic Book Chapter
ORM patterns “Patterns of Enterprise Application Architecture” by Fowler Ch. 10-11
Python metaprogramming “Fluent Python” by Ramalho Ch. 21-23
Query building Study SQLAlchemy source Core expression language
Schema migrations Alembic documentation Concepts

Implementation Hints

Field Descriptor Pattern:

class Field:
    def __set_name__(self, owner, name):
        self.name = name  # Captures the attribute name

    def __get__(self, obj, type=None):
        if obj is None: return self  # For class-level access
        return obj.__dict__.get(self.name)

    def __set__(self, obj, value):
        obj.__dict__[self.name] = value
        obj._dirty.add(self.name)  # Track changes!

Query Builder Pattern:

class Query:
    def __init__(self, model):
        self.model = model
        self.conditions = []
        self.order = []
        self._limit = None

    def filter(self, condition):
        new = Query(self.model)
        new.conditions = self.conditions + [condition]
        return new  # Return new query, don't mutate

    def to_sql(self):
        sql = f"SELECT * FROM {self.model.__tablename__}"
        if self.conditions:
            sql += " WHERE " + " AND ".join(c.to_sql() for c in self.conditions)
        return sql

Learning Milestones

  1. Basic CRUD working → You understand SQL generation
  2. Query builder fluent interface → You understand method chaining
  3. Foreign key relationships → You understand JOINs
  4. Lazy loading working → You understand relationship access patterns

Project 9: Database Corruption Detector & Repair Tool

  • File: LEARN_SQLITE_DEEP_DIVE.md
  • Main Programming Language: C
  • Alternative Programming Languages: Rust, Go
  • Coolness Level: Level 4: Hardcore Tech Flex
  • Business Potential: 3. The “Service & Support” Model
  • Difficulty: Level 4: Expert
  • Knowledge Area: Data Recovery / Database Internals
  • Software or Tool: SQLite file format
  • Main Book: “Database Internals” by Alex Petrov

What you’ll build: A tool that deeply scans SQLite database files to detect corruption (invalid pages, broken B-trees, orphaned records, checksum failures), reports issues, and attempts repairs where possible.

Why it teaches SQLite: This project requires understanding every aspect of the SQLite file format—headers, pages, cells, B-trees, freelists—to validate them. You’ll also learn about failure modes and how databases can become corrupted.

Core challenges you’ll face:

  • Validating page structure → maps to understanding page format
  • Checking B-tree integrity → maps to tree traversal and invariants
  • Detecting orphaned pages → maps to freelist and page allocation
  • Verifying checksum integrity → maps to data verification
  • Implementing safe repairs → maps to data recovery strategies

Key Concepts:

  • SQLite Integrity Check: SQLite Documentation - “PRAGMA integrity_check”
  • B-Tree Invariants: “Introduction to Algorithms” Ch. 18 - CLRS
  • Database File Format: SQLite Documentation - “Database File Format”
  • Data Recovery: “Database Internals” Ch. 13 - Petrov

Difficulty: Expert Time estimate: 3-4 weeks Prerequisites: Project 1 (File Format Explorer), C programming, deep understanding of B-trees


Real World Outcome

A diagnostic and repair tool for corrupted SQLite databases:

Example Output:

$ ./sqlite-doctor diagnose corrupted.db

╔════════════════════════════════════════════════════════════════╗
║                    SQLite Database Diagnosis                    ║
╠════════════════════════════════════════════════════════════════╣
║ Database: corrupted.db                                          ║
║ Size: 1,245,184 bytes (304 pages)                               ║
║ Page Size: 4096 bytes                                           ║
║ Schema Cookie: 42                                               ║
╚════════════════════════════════════════════════════════════════╝

Scanning database...

Phase 1: Header Validation
  ✓ Magic number valid
  ✓ Page size valid (4096)
  ✓ File format versions valid
  ✓ Header checksum valid

Phase 2: Page Scan
  Scanning 304 pages...
  ✓ Page 1 (schema): Valid leaf B-tree page
  ✓ Page 2 (users): Valid interior B-tree page
  ✗ Page 15: CORRUPTION DETECTED
    - Cell pointer at offset 342 points outside page boundary
    - Expected: 0-4096, Got: 8721
  ✓ Pages 16-45: Valid
  ✗ Page 46: CORRUPTION DETECTED
    - Invalid page type: 0x47 (expected 0x02, 0x05, 0x0A, or 0x0D)
  ✓ Pages 47-304: Valid

Phase 3: B-Tree Integrity
  Checking table 'users' (root: page 2)...
  ✗ B-tree broken at page 15
    - Parent points to page 15, but page 15 has invalid structure
    - Subtree potentially lost: ~47 rows estimated

  Checking table 'orders' (root: page 50)...
  ✓ B-tree valid (892 rows verified)

  Checking index 'idx_users_email' (root: page 100)...
  ✗ Index entries don't match table
    - Index has 1,247 entries
    - Table scan found 1,200 rows
    - 47 orphaned index entries

Phase 4: Freelist Analysis
  ✓ Freelist trunk page valid
  ✗ Page 46 in freelist but also referenced by table 'users'
    - Double allocation detected!

═══════════════════════════════════════════════════════════════════

DIAGNOSIS SUMMARY:

CRITICAL ISSUES (2):
  1. Page 15 corrupted - affects 'users' table
  2. Page 46 corrupted and double-allocated

WARNINGS (1):
  1. Index 'idx_users_email' out of sync with table

ESTIMATED DATA LOSS:
  - users table: ~47 rows (3.8% of table)
  - orders table: None
  - Total: ~47 rows

RECOMMENDED ACTIONS:
  1. Run: sqlite-doctor repair corrupted.db --output=recovered.db
  2. After repair, rebuild indexes: REINDEX;
  3. Verify data integrity with application logic

$ ./sqlite-doctor repair corrupted.db --output=recovered.db

Starting repair...

Step 1: Copy good pages
  Copied 302/304 pages (skipped 2 corrupted)

Step 2: Rebuild corrupted B-trees
  Scanning corrupted pages for recoverable cells...
  Found 42 valid cells in page 15 (5 unrecoverable)
  Page 46 has no valid cell structure (zeroed)

Step 3: Reconstruct table 'users'
  Original rows: ~1,247
  Recovered rows: 1,200 (96.2%)
  Lost rows: 47

Step 4: Rebuild indexes
  Rebuilding idx_users_email...
  Created with 1,200 entries

Step 5: Verify repaired database
  Running integrity check...
  PRAGMA integrity_check: ok

═══════════════════════════════════════════════════════════════════

REPAIR COMPLETE

Recovered database: recovered.db
Original size: 1,245,184 bytes
Recovered size: 1,228,800 bytes (16,384 bytes freed)

Data recovery statistics:
  - Rows recovered: 2,092 of 2,139 (97.8%)
  - Rows lost: 47
  - Tables fully recovered: 1 of 2
  - Indexes rebuilt: 1

⚠️  WARNING: 47 rows could not be recovered.
    Check application logs or backups for missing data.

The Core Question You’re Answering

“How do you know if a database is corrupted? What can be recovered?”

Before you write any code, sit with this question. Corruption can mean many things: bit flips, truncated files, partial writes, software bugs. Each type leaves different signatures. Understanding what “valid” looks like lets you detect what’s “invalid.”


Concepts You Must Understand First

Stop and research these before coding:

  1. SQLite Page Structure
    • What are the valid page types and their signatures?
    • What are the invariants that must hold?
    • What does a cell pointer array look like?
    • Book Reference: SQLite Documentation - “Database File Format”
  2. B-Tree Invariants
    • What properties must a B-tree maintain?
    • How do you verify parent-child relationships?
    • What’s the relationship between interior and leaf nodes?
    • Book Reference: “Introduction to Algorithms” Ch. 18 - CLRS
  3. Freelist Structure
    • How does SQLite track free pages?
    • What’s a trunk page vs leaf page in the freelist?
    • How can double-allocation occur?
    • Book Reference: SQLite Documentation - “Database File Format”
  4. Common Corruption Causes
    • What causes corruption (hardware, software, interrupts)?
    • What patterns does each cause leave?
    • How does SQLite’s integrity_check work?
    • Book Reference: SQLite Documentation - “How To Corrupt An SQLite Database File”

Questions to Guide Your Design

Before implementing, think through these:

  1. Detection Strategy
    • What checks can you do on individual pages?
    • What checks require B-tree traversal?
    • What checks require cross-referencing multiple pages?
  2. Reporting
    • How do you describe corruption to users?
    • What information helps them understand impact?
    • How do you estimate data loss?
  3. Repair Strategy
    • When is repair safe vs risky?
    • How do you recover cells from corrupted pages?
    • Should you modify in place or create a new file?
  4. Verification
    • How do you know the repair worked?
    • What tests should you run on the recovered database?
    • How do you compare before/after?

Thinking Exercise

Validate a Page

Before coding, list all the things you’d check on a single B-tree leaf page:

Page validation checklist:
1. Page type byte is valid (0x0D for table leaf)
2. Cell count > 0 and reasonable for page size
3. Cell content offset is within page
4. Cell pointer array doesn't overlap cell content
5. Each cell pointer points within cell content area
6. ...what else?

Questions while designing:

  • How do you detect bit flips vs structural corruption?
  • What if the page type byte itself is corrupted?
  • How do you know the cell count is correct?
  • Can you recover data if cell pointers are wrong but data is intact?

The Interview Questions They’ll Ask

Prepare to answer these:

  1. “How would you detect if a database file is corrupted?”
  2. “What B-tree invariants would you check for integrity?”
  3. “How would you recover data from a corrupted B-tree page?”
  4. “What causes database corruption at the file system level?”
  5. “How does SQLite’s PRAGMA integrity_check work?”
  6. “What’s the difference between logical and physical corruption?”

Hints in Layers

Hint 1: Starting Point Start by implementing PRAGMA integrity_check yourself. It checks: page type validity, cell pointer bounds, B-tree ordering, freelist consistency.

Hint 2: Page Validation For each page: verify type byte, verify cell count is sensible, verify all cell pointers are within bounds, verify cells don’t overlap, verify cell content is valid for page type.

Hint 3: B-Tree Validation Traverse the B-tree depth-first. At each interior node, verify child page numbers are valid. Verify keys are in sorted order. Verify all referenced pages are accounted for.

Hint 4: Cell Recovery Even if cell pointers are corrupt, you might recover data by scanning the cell content area for valid record headers. SQLite records start with a varint header size.


Books That Will Help

Topic Book Chapter
SQLite file format SQLite Documentation “Database File Format”
B-tree invariants “Introduction to Algorithms” by CLRS Ch. 18
Data recovery “Database Internals” by Petrov Ch. 13
Corruption causes SQLite Documentation “How To Corrupt”

Implementation Hints

Page Validation Checklist:

1. Type byte: 0x02 (interior index), 0x05 (interior table),
              0x0A (leaf index), 0x0D (leaf table)
2. First freeblock: 0 or valid offset
3. Cell count: > 0 and < (page_size - header_size) / min_cell_size
4. Cell content start: > header_size and < page_size
5. Each cell pointer: >= cell_content_start and < page_size
6. For table leaf: rowids are ascending
7. For index leaf: keys are ascending

Recovery Heuristic:

If cell pointers are corrupt but cell content area looks valid:
1. Scan content area for valid record headers
2. Record header starts with varint (header_size)
3. Header contains type bytes for each column
4. Try to parse record, if valid, mark as recovered
5. Build new cell pointer array from recovered cells

Learning Milestones

  1. Can validate individual pages → You understand page structure deeply
  2. Can validate entire B-tree → You understand tree invariants
  3. Can detect common corruption → You understand failure modes
  4. Can recover data from corrupt pages → You understand data recovery

Project 10: JSON Document Store on SQLite

  • File: LEARN_SQLITE_DEEP_DIVE.md
  • Main Programming Language: Python
  • Alternative Programming Languages: Go, TypeScript, Rust
  • Coolness Level: Level 3: Genuinely Clever
  • Business Potential: 2. The “Micro-SaaS / Pro Tool”
  • Difficulty: Level 2: Intermediate
  • Knowledge Area: Document Databases / JSON Processing
  • Software or Tool: SQLite JSON1
  • Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann

What you’ll build: A MongoDB-like document store using SQLite’s JSON1 extension, supporting document CRUD, nested field queries, array operations, aggregation pipelines, and indexing on JSON fields—demonstrating that SQLite can handle document workloads.

Why it teaches SQLite: SQLite’s JSON1 extension is powerful but underused. This project teaches you JSON functions, generated columns, expression indexes, and how to bridge relational and document paradigms.

Core challenges you’ll face:

  • Storing and retrieving JSON documents → maps to JSON1 functions
  • Querying nested fields → maps to JSON path expressions
  • Indexing JSON properties → maps to expression indexes
  • Implementing aggregations → maps to JSON array/object manipulation
  • Schema flexibility vs type safety → maps to document database tradeoffs

Key Concepts:

  • JSON1 Extension: SQLite Documentation - “JSON1 Extension”
  • Expression Indexes: SQLite Documentation - “Indexes On Expressions”
  • Generated Columns: SQLite Documentation - “Generated Columns”
  • Document Data Model: “Designing Data-Intensive Applications” Ch. 2 - Kleppmann

Difficulty: Intermediate Time estimate: 1-2 weeks Prerequisites: Python, SQL, understanding of JSON and document databases conceptually


Real World Outcome

A MongoDB-like API backed by SQLite:

Example Output:

from sqlite_docstore import DocumentStore, Query

db = DocumentStore("documents.db")
users = db.collection("users")

# Insert documents (like MongoDB)
users.insert_one({
    "name": "Alice",
    "email": "alice@example.com",
    "age": 30,
    "tags": ["developer", "python"],
    "address": {
        "city": "San Francisco",
        "zip": "94102"
    }
})

users.insert_many([
    {"name": "Bob", "email": "bob@test.com", "age": 25, "tags": ["developer"]},
    {"name": "Charlie", "email": "charlie@test.com", "age": 35, "tags": ["manager"]}
])

# Query with MongoDB-like syntax
results = users.find({"age": {"$gt": 25}})
# Generates: SELECT doc FROM users WHERE json_extract(doc, '$.age') > 25

results = users.find({"tags": {"$contains": "developer"}})
# Uses json_each() to check array membership

results = users.find({"address.city": "San Francisco"})
# Generates: SELECT doc FROM users WHERE json_extract(doc, '$.address.city') = ?

# Create index on nested field for performance
users.create_index("age")
# Creates: CREATE INDEX idx_users_age ON users(json_extract(doc, '$.age'))

users.create_index("address.city")
# Creates: CREATE INDEX idx_users_address_city ON users(json_extract(doc, '$.address.city'))

# Update operations
users.update_one(
    {"email": "alice@example.com"},
    {"$set": {"age": 31}, "$push": {"tags": "speaker"}}
)
# Uses json_set() and json_insert()

# Aggregation pipeline
results = users.aggregate([
    {"$match": {"tags": {"$contains": "developer"}}},
    {"$group": {"_id": "$address.city", "count": {"$sum": 1}, "avg_age": {"$avg": "$age"}}},
    {"$sort": {"count": -1}}
])

# Output:
# [
#   {"_id": "San Francisco", "count": 2, "avg_age": 27.5},
#   {"_id": "New York", "count": 1, "avg_age": 35}
# ]

# CLI interface
$ sqlite-docstore documents.db
> db.users.find({"age": {"$gte": 30}})
{
  "_id": "1",
  "name": "Alice",
  "age": 30,
  ...
}
{
  "_id": "3",
  "name": "Charlie",
  "age": 35,
  ...
}
(2 documents)

> db.users.createIndex("email")
Index created: idx_users_email

> db.users.explain({"address.city": "San Francisco"})
Query Plan: SEARCH users USING INDEX idx_users_address_city

The Core Question You’re Answering

“Can SQLite be a document database? What does SQLite give you that MongoDB doesn’t?”

Before you write any code, sit with this question. Document databases trade schema for flexibility. SQLite with JSON1 gives you both: flexible documents PLUS SQL queries, joins, transactions, and a single file. Understanding when each shines is valuable.


Concepts You Must Understand First

Stop and research these before coding:

  1. JSON1 Functions
    • What’s json_extract() and json_set()?
    • How does json_each() work for arrays?
    • What’s the difference between -> and -» operators?
    • Book Reference: SQLite Documentation - “JSON1 Extension”
  2. Expression Indexes
    • How do you index computed values?
    • What expressions can be indexed?
    • How does the query planner use them?
    • Book Reference: SQLite Documentation - “Indexes On Expressions”
  3. Generated Columns
    • What’s a STORED vs VIRTUAL generated column?
    • How do they work with JSON?
    • When would you use each?
    • Book Reference: SQLite Documentation - “Generated Columns”
  4. Document vs Relational
    • When is document model better?
    • When is relational better?
    • How do you decide schema design?
    • Book Reference: “Designing Data-Intensive Applications” Ch. 2 - Kleppmann

Questions to Guide Your Design

Before implementing, think through these:

  1. Document Storage
    • One column (BLOB/TEXT) or parsed into columns?
    • How do you generate document IDs?
    • What’s the maximum document size?
  2. Query Translation
    • How do you convert MongoDB syntax to SQL?
    • How do you handle nested paths?
    • What operators do you support?
  3. Indexing Strategy
    • Which fields should be indexable?
    • How do you handle array indexes?
    • When to use generated columns vs expression indexes?
  4. Performance
    • When is JSON extraction faster than normalized tables?
    • What queries will be slow?
    • How do you optimize array searches?

Thinking Exercise

Design the Query Translator

Before coding, design how to translate MongoDB queries:

# Input (MongoDB syntax)
{"age": {"$gt": 25}, "status": "active"}

# Output (SQL)
SELECT doc FROM collection
WHERE json_extract(doc, '$.age') > 25
  AND json_extract(doc, '$.status') = 'active'

Questions while designing:

  • How do you handle nested $and/$or?
  • How do you query inside arrays?
  • What about $in with a list of values?
  • How do you handle null vs missing fields?

The Interview Questions They’ll Ask

Prepare to answer these:

  1. “When would you use SQLite’s JSON support vs a document database?”
  2. “How do you index JSON fields efficiently?”
  3. “What’s the performance difference between JSON queries and normalized tables?”
  4. “How would you implement array queries in SQLite?”
  5. “What are generated columns and when would you use them?”
  6. “How do you handle schema evolution with JSON documents?”

Hints in Layers

Hint 1: Starting Point Start with a simple table: CREATE TABLE collection (id TEXT PRIMARY KEY, doc TEXT). Use json_extract() for all queries initially.

Hint 2: Query Operators Map MongoDB operators: $eq=, $gt>, $inIN, $contains (for arrays) → EXISTS (SELECT 1 FROM json_each(doc, '$.array') WHERE value = ?).

Hint 3: Performance Create indexes on frequently queried paths: CREATE INDEX idx ON collection(json_extract(doc, '$.field')). This makes equality and range queries fast.

Hint 4: Arrays Use json_each() for array operations. For tags: {"$contains": "developer"}: check if any element matches using EXISTS subquery.


Books That Will Help

Topic Book Chapter
Document model “Designing Data-Intensive Applications” by Kleppmann Ch. 2
JSON1 extension SQLite Documentation “JSON1 Extension”
Expression indexes SQLite Documentation “Indexes On Expressions”
Query optimization “SQL Performance Explained” by Winand All

Implementation Hints

Core Table Structure:

CREATE TABLE collections (
    name TEXT PRIMARY KEY
);

CREATE TABLE documents (
    id TEXT PRIMARY KEY,
    collection TEXT NOT NULL,
    doc TEXT NOT NULL,  -- JSON document
    created_at TEXT DEFAULT CURRENT_TIMESTAMP,
    updated_at TEXT DEFAULT CURRENT_TIMESTAMP,
    FOREIGN KEY (collection) REFERENCES collections(name)
);

Useful JSON1 Functions:

  • json_extract(doc, '$.field') - Get field value
  • json_set(doc, '$.field', value) - Set field
  • json_remove(doc, '$.field') - Remove field
  • json_each(doc, '$.array') - Iterate array
  • json_type(doc, '$.field') - Check type

Learning Milestones

  1. Basic CRUD with JSON → You understand JSON1 basics
  2. Nested field queries working → You understand path expressions
  3. Indexes on JSON fields → You understand expression indexes
  4. Aggregations working → You understand complex JSON manipulation

Project 11: SQLite Virtual Table for External Data

  • File: LEARN_SQLITE_DEEP_DIVE.md
  • Main Programming Language: C
  • Alternative Programming Languages: Python (with apsw), Rust
  • Coolness Level: Level 4: Hardcore Tech Flex
  • Business Potential: 3. The “Service & Support” Model
  • Difficulty: Level 4: Expert
  • Knowledge Area: SQLite Extensibility / Data Integration
  • Software or Tool: SQLite Virtual Table API
  • Main Book: “The Definitive Guide to SQLite” by Mike Owens

What you’ll build: A SQLite virtual table that presents external data (CSV files, REST APIs, or even other databases) as queryable SQL tables—allowing you to JOIN across different data sources using standard SQL.

Why it teaches SQLite: Virtual tables are SQLite’s extensibility mechanism. Building one teaches you the vtab API, cursor management, and how SQLite’s query planner works—knowledge that applies to understanding FTS5, JSON1, and R*trees.

Core challenges you’ll face:

  • Implementing the vtab interface → maps to SQLite extension API
  • Providing row estimates for query planning → maps to query optimizer interaction
  • Handling WHERE clause pushdown → maps to optimization opportunities
  • Managing cursors for iteration → maps to state management
  • Supporting INSERT/UPDATE/DELETE → maps to writable virtual tables

Key Concepts:

  • Virtual Table API: SQLite Documentation - “Virtual Table Methods”
  • Query Planning Callbacks: SQLite Documentation - “xBestIndex Method”
  • Cursor Management: SQLite Documentation - “xOpen/xClose”
  • Extension Loading: SQLite Documentation - “Run-Time Loadable Extensions”

Difficulty: Expert Time estimate: 3-4 weeks Prerequisites: C programming, SQLite basics, understanding of APIs and interfaces


Real World Outcome

A loadable SQLite extension that makes external data queryable:

Example Output (CSV Virtual Table):

$ sqlite3
sqlite> .load ./csv_vtable

sqlite> CREATE VIRTUAL TABLE customers USING csv(
    filename='customers.csv',
    header=yes,
    schema='id INTEGER, name TEXT, email TEXT, created DATE'
);

sqlite> SELECT * FROM customers WHERE created > '2024-01-01' LIMIT 5;
id     name           email                    created
-----  -------------  ----------------------   ----------
1001   Alice Smith    alice@example.com        2024-02-15
1002   Bob Jones      bob@company.com          2024-03-20
1003   Carol White    carol@startup.io         2024-01-30

sqlite> -- Join with regular SQLite table!
sqlite> SELECT c.name, o.total
        FROM customers c
        JOIN orders o ON c.id = o.customer_id
        WHERE c.created > '2024-01-01';

sqlite> EXPLAIN QUERY PLAN SELECT * FROM customers WHERE id = 1001;
SCAN customers VIRTUAL TABLE INDEX 1: (constraint: id=?)
-- Index 1 means we're using our optimized lookup!

Example Output (REST API Virtual Table):

sqlite> .load ./rest_vtable

sqlite> CREATE VIRTUAL TABLE github_repos USING rest_api(
    url='https://api.github.com/users/{username}/repos',
    username='octocat',
    schema='id INTEGER, name TEXT, stars INTEGER, language TEXT'
);

sqlite> SELECT name, stars FROM github_repos
        WHERE stars > 100
        ORDER BY stars DESC
        LIMIT 10;

name                 stars
-------------------  ------
Spoon-Knife          12435
Hello-World          2341
linguist             892

sqlite> -- This fetches from the API and filters results!

The Core Question You’re Answering

“How can I query non-database data with SQL? What does SQLite need from me to make this work?”

Before you write any code, sit with this question. Virtual tables let SQLite query anything—files, APIs, generated data. But you must implement an interface that SQLite understands: how to iterate rows, what indexes exist, how to estimate costs.


Concepts You Must Understand First

Stop and research these before coding:

  1. Virtual Table Module Structure
    • What methods must you implement?
    • What’s the difference between sqlite3_module and sqlite3_vtab?
    • What’s a cursor and why do you need it?
    • Book Reference: SQLite Documentation - “Virtual Table Methods”
  2. xBestIndex Callback
    • What is query plan optimization?
    • How do you tell SQLite which WHERE clauses you can handle?
    • What do cost estimates affect?
    • Book Reference: SQLite Documentation - “xBestIndex”
  3. xFilter and xNext
    • How do you iterate over results?
    • What’s the relationship between xFilter and WHERE?
    • How do you handle multiple simultaneous cursors?
    • Book Reference: SQLite Documentation - “xFilter”
  4. Extension Loading
    • How do you compile a loadable extension?
    • What’s sqlite3_extension_init?
    • How do you handle cross-platform compilation?
    • Book Reference: SQLite Documentation - “Run-Time Loadable Extensions”

Questions to Guide Your Design

Before implementing, think through these:

  1. Data Source
    • What external data will you expose?
    • How do you handle data that changes?
    • What’s the schema discovery strategy?
  2. Query Optimization
    • Which WHERE clauses can you optimize?
    • How accurate are your cost estimates?
    • What indexes can you fake?
  3. Cursor State
    • What state do you need per cursor?
    • How do you handle parallel queries?
    • How do you clean up resources?
  4. Error Handling
    • What if the external source fails?
    • How do you report errors to SQLite?
    • What about partial results?

Thinking Exercise

Design xBestIndex

Before coding, design how xBestIndex should respond to different queries:

Query: SELECT * FROM csv_table WHERE id = 5
- Can you optimize this? (Index on id?)
- What cost should you report?
- What constraint will xFilter receive?

Query: SELECT * FROM csv_table WHERE name LIKE '%test%'
- Can you optimize this? (Probably not)
- What cost should you report?
- Should xFilter receive the constraint?

Query: SELECT * FROM csv_table ORDER BY created DESC LIMIT 10
- Can you provide sorted output?
- What if the CSV is huge?

Questions while designing:

  • How do you signal “I can’t optimize this” vs “I can”?
  • What’s a reasonable cost estimate for a full scan?
  • How does LIMIT affect your strategy?

The Interview Questions They’ll Ask

Prepare to answer these:

  1. “What’s a virtual table and when would you use one?”
  2. “Explain how xBestIndex helps the query planner.”
  3. “How would you optimize a virtual table for equality lookups?”
  4. “What’s the difference between a read-only and writable virtual table?”
  5. “How do you handle errors in a virtual table?”
  6. “What are the performance implications of virtual tables?”

Hints in Layers

Hint 1: Starting Point Start with the simplest read-only virtual table: one that returns fixed data. Ignore xBestIndex initially—just return everything and let SQLite filter.

Hint 2: Required Methods Minimum methods: xCreate, xConnect, xBestIndex, xOpen, xClose, xFilter, xNext, xEof, xColumn, xRowid, xDisconnect, xDestroy.

Hint 3: xBestIndex Basics For a simple table scan, set pIdxInfo->estimatedCost = 1000000 (high cost). To signal you can handle a constraint, set pIdxInfo->aConstraintUsage[i].argvIndex = 1 and lower the cost.

Hint 4: Testing Use EXPLAIN QUERY PLAN to verify SQLite is using your indexes. The output shows which idxNum you returned from xBestIndex.


Books That Will Help

Topic Book Chapter
Virtual table API “The Definitive Guide to SQLite” by Owens Ch. 9
Extension development SQLite Documentation “Run-Time Loadable Extensions”
Query planning SQLite Documentation “xBestIndex”
C extension patterns SQLite source code ext/misc/*.c examples

Implementation Hints

Minimal Module Structure:

sqlite3_module {
    .iVersion = 0,
    .xCreate = my_create,
    .xConnect = my_connect,
    .xBestIndex = my_best_index,
    .xDisconnect = my_disconnect,
    .xDestroy = my_destroy,
    .xOpen = my_open,
    .xClose = my_close,
    .xFilter = my_filter,
    .xNext = my_next,
    .xEof = my_eof,
    .xColumn = my_column,
    .xRowid = my_rowid,
}

CSV Example Flow:

1. xCreate: Parse filename from arguments, read header for schema
2. xConnect: Same as xCreate (for persistent tables)
3. xBestIndex: Report full scan cost, or lower if id constraint present
4. xOpen: Allocate cursor, open file handle
5. xFilter: Seek to start (or specific row if indexed)
6. xNext: Read next CSV line
7. xColumn: Parse requested column from current line
8. xClose: Free cursor, close file

Learning Milestones

  1. Basic read-only vtable working → You understand the vtab interface
  2. xBestIndex optimizes queries → You understand query planning interaction
  3. Multiple cursors work correctly → You understand cursor state management
  4. Writable operations work → You understand the full vtab API

Project 12: SQLite Connection Pool & Performance Monitor

  • File: LEARN_SQLITE_DEEP_DIVE.md
  • Main Programming Language: Python
  • Alternative Programming Languages: Go, Rust, C
  • Coolness Level: Level 3: Genuinely Clever
  • Business Potential: 3. The “Service & Support” Model
  • Difficulty: Level 3: Advanced
  • Knowledge Area: Performance Engineering / Concurrency
  • Software or Tool: SQLite
  • Main Book: “High Performance MySQL” by Baron Schwartz (concepts apply)

What you’ll build: A connection pool manager with performance monitoring—handling connection reuse, WAL checkpoint scheduling, query timing, slow query logging, and lock contention detection—suitable for production SQLite deployments.

Why it teaches SQLite: Real-world SQLite performance requires understanding concurrency, locking, connection handling, and monitoring. This project teaches you to build production-grade infrastructure around SQLite.

Core challenges you’ll face:

  • Managing connection lifecycle → maps to connection pooling patterns
  • Monitoring query performance → maps to SQLite tracing callbacks
  • Scheduling WAL checkpoints → maps to write performance management
  • Detecting lock contention → maps to SQLITE_BUSY handling
  • Collecting and displaying metrics → maps to observability

Key Concepts:

  • Connection Pooling: General database pattern
  • SQLite Tracing: SQLite Documentation - “sqlite3_trace_v2”
  • Busy Handler: SQLite Documentation - “sqlite3_busy_handler”
  • WAL Checkpointing: SQLite Documentation - “Checkpoint”

Difficulty: Advanced Time estimate: 2 weeks Prerequisites: Python (threading), SQL, understanding of database performance concepts


Real World Outcome

A library and monitoring tool for production SQLite:

Example Output:

from sqlite_pool import ConnectionPool, QueryMonitor

# Create connection pool
pool = ConnectionPool(
    database="app.db",
    min_connections=2,
    max_connections=10,
    wal_mode=True,
    checkpoint_interval=300,  # Auto checkpoint every 5 min
    busy_timeout=5000  # 5 second timeout
)

# Enable monitoring
monitor = QueryMonitor(pool)
monitor.enable_slow_query_log(threshold_ms=100)
monitor.enable_lock_detection()

# Use connections from pool
with pool.connection() as conn:
    cursor = conn.execute("SELECT * FROM users WHERE id = ?", (user_id,))
    user = cursor.fetchone()

# Connection automatically returned to pool

# View statistics
print(pool.stats())
# ConnectionPoolStats:
#   Total connections: 5
#   Available: 3
#   In use: 2
#   Peak: 7
#   Created: 12
#   Recycled: 156

print(monitor.query_stats())
# QueryStats (last 5 minutes):
#   Total queries: 1,247
#   Avg latency: 2.3ms
#   p99 latency: 15.2ms
#   Slow queries: 3

print(monitor.slow_queries())
# Slow Queries:
#   [2024-12-22 14:30:15] 234ms - SELECT * FROM orders WHERE status = ?
#   [2024-12-22 14:30:45] 156ms - UPDATE users SET last_login = ? WHERE id = ?

# CLI monitoring tool
$ sqlite-monitor app.db

╔═════════════════════════════════════════════════════════════════╗
                  SQLite Performance Monitor                      
                  Database: app.db                                
╠═════════════════════════════════════════════════════════════════╣
 Connection Pool:                                                 
   Active: 3/10  |  Waiting: 0  |  Peak: 7                       
                                                                  
 Query Performance (last 60s):                                    
   Queries/sec: 45.2                                             
   Avg latency: 2.1ms  |  p95: 8.3ms  |  p99: 24.1ms             
                                                                  
   ██████████████████████████░░░░ Read: 78%                       
   ████████░░░░░░░░░░░░░░░░░░░░░░ Write: 22%                      
                                                                  
 Lock Contention:                                                 
   SQLITE_BUSY events: 2 (0.04%)                                 
   Avg wait time: 45ms                                           
                                                                  
 WAL Status:                                                      
   Size: 2.4 MB (584 frames)                                     
   Last checkpoint: 3m 24s ago                                   
   Checkpoint mode: PASSIVE                                      
                                                                  
 Top Queries by Time:                                             
   1. SELECT...FROM orders WHERE s..  (avg 12ms, 234 calls)      
   2. UPDATE users SET last_login...  (avg 8ms, 156 calls)       
   3. INSERT INTO events...           (avg 5ms, 892 calls)       
╚═════════════════════════════════════════════════════════════════╝

[Press 'q' to quit, 's' for slow queries, 'c' for checkpoint]

The Core Question You’re Answering

“How do you run SQLite in production? What metrics tell you if it’s healthy?”

Before you write any code, sit with this question. SQLite is often used “casually,” but production use requires monitoring, connection management, and proactive maintenance. Understanding what can go wrong—and how to detect it—is key.


Concepts You Must Understand First

Stop and research these before coding:

  1. Connection Pooling Pattern
    • Why pool connections instead of creating new ones?
    • How do you handle idle connections?
    • What’s connection recycling?
    • Book Reference: General database programming resources
  2. SQLite Tracing API
    • What’s sqlite3_trace_v2?
    • What events can you trace?
    • How do you get query execution time?
    • Book Reference: SQLite Documentation - “sqlite3_trace_v2”
  3. SQLITE_BUSY Handling
    • When does SQLITE_BUSY occur?
    • What’s a busy handler vs busy timeout?
    • How do you retry gracefully?
    • Book Reference: SQLite Documentation - “sqlite3_busy_handler”
  4. WAL Checkpoint Tuning
    • When should you checkpoint?
    • What are the checkpoint modes?
    • What’s the performance impact?
    • Book Reference: SQLite Documentation - “Checkpoint”

Questions to Guide Your Design

Before implementing, think through these:

  1. Pool Management
    • How do you handle requests exceeding max connections?
    • When do you close idle connections?
    • How do you detect stale connections?
  2. Monitoring Design
    • What metrics are most valuable?
    • How do you collect them without overhead?
    • How do you store and query metrics?
  3. Checkpoint Strategy
    • When to checkpoint automatically?
    • How do you avoid blocking readers?
    • What if checkpoint fails?
  4. Error Recovery
    • How do you handle connection failures?
    • What about corrupt database detection?
    • How do you alert operators?

Thinking Exercise

Design Lock Handling

Before coding, design how to handle SQLITE_BUSY:

Scenario: Writer A has exclusive lock, Writer B wants to write

Options:
1. Fail immediately with error
2. Retry with exponential backoff
3. Block until lock available (up to timeout)
4. Queue write and notify when complete

What tradeoffs does each have?
What metrics should you collect?
How do you alert if contention is chronic?

Questions while designing:

  • How long should busy timeout be?
  • Should read and write connections be separate pools?
  • How do you detect “lock convoy” problems?
  • What’s the relationship between pool size and contention?

The Interview Questions They’ll Ask

Prepare to answer these:

  1. “How would you handle SQLITE_BUSY in a production application?”
  2. “What metrics would you monitor for a SQLite database?”
  3. “When and how should you checkpoint a WAL database?”
  4. “How does connection pooling improve performance?”
  5. “What causes lock contention in SQLite and how do you reduce it?”
  6. “How would you implement slow query logging?”

Hints in Layers

Hint 1: Starting Point Start with a simple thread-safe queue of connections. get() returns a connection (or blocks), put() returns it to the pool.

Hint 2: Tracing Use sqlite3_trace_v2 with SQLITE_TRACE_STMT and SQLITE_TRACE_PROFILE. Profile gives you execution time in nanoseconds.

Hint 3: Metrics Store recent metrics in a ring buffer (last N queries). Calculate percentiles on demand. Use thread-local storage to avoid contention in the monitoring code.

Hint 4: Checkpoint Run checkpoint in a background thread. Use SQLITE_CHECKPOINT_PASSIVE to avoid blocking. If WAL grows too large, consider SQLITE_CHECKPOINT_TRUNCATE during low-traffic periods.


Books That Will Help

Topic Book Chapter
Pooling patterns “High Performance MySQL” by Schwartz Connection management
SQLite threading SQLite Documentation “Using SQLite In Multi-Threaded Applications”
Performance monitoring General observability resources  
WAL tuning SQLite Documentation “Write-Ahead Logging”

Implementation Hints

Connection Pool Structure:

class ConnectionPool:
    - available: Queue of ready connections
    - in_use: Set of checked-out connections
    - lock: Threading lock for operations
    - stats: Counter dict for metrics

    def get():
        1. Try to get from available queue (with timeout)
        2. If empty and under max, create new
        3. If at max, wait for return
        4. Move to in_use, return wrapped connection

    def put(conn):
        1. Validate connection is healthy
        2. Move from in_use to available
        3. If over min and idle, close

Tracing Setup:

def trace_callback(trace_type, context, sql, execution_time):
    if trace_type == SQLITE_TRACE_PROFILE:
        stats.record(sql, execution_time)
        if execution_time > SLOW_THRESHOLD:
            log_slow_query(sql, execution_time)

sqlite3_trace_v2(db, SQLITE_TRACE_PROFILE, trace_callback, context)

Learning Milestones

  1. Basic pool working → You understand connection lifecycle
  2. Tracing captures metrics → You understand SQLite callbacks
  3. Slow query detection → You understand performance analysis
  4. Lock contention alerts → You understand production monitoring

Project Comparison Table

# Project Difficulty Time Depth Coolness Business Potential
1 SQLite File Format Explorer Advanced 2-3 weeks ★★★★★ Level 4 Resume Gold
2 Query Analyzer & EXPLAIN Visualizer Intermediate 1-2 weeks ★★★★☆ Level 3 Micro-SaaS
3 Key-Value Store on SQLite Intermediate 1-2 weeks ★★★☆☆ Level 3 Micro-SaaS
4 WAL Mode Inspector Expert 2-3 weeks ★★★★★ Level 4 Resume Gold
5 Full-Text Search Engine Intermediate 1-2 weeks ★★★☆☆ Level 3 Micro-SaaS
6 Personal Analytics Database Intermediate 2-3 weeks ★★★☆☆ Level 3 Micro-SaaS
7 SQLite Replication System Expert 3-4 weeks ★★★★★ Level 4 Open Core
8 Build a Mini ORM Advanced 2-3 weeks ★★★★☆ Level 3 Micro-SaaS
9 Corruption Detector & Repair Expert 3-4 weeks ★★★★★ Level 4 Service/Support
10 JSON Document Store Intermediate 1-2 weeks ★★★☆☆ Level 3 Micro-SaaS
11 Virtual Table for External Data Expert 3-4 weeks ★★★★★ Level 4 Service/Support
12 Connection Pool & Monitor Advanced 2 weeks ★★★★☆ Level 3 Service/Support

Recommendation

Where to Start

If you’re new to SQLite internals: Start with Project 1 (File Format Explorer). Everything else builds on understanding the file format. Yes, it’s challenging, but it’s foundational.

If you want quick wins first: Start with Project 2 (Query Analyzer) or Project 3 (Key-Value Store). Both are achievable in a week and teach important concepts.

If you’re interested in advanced features: Project 5 (FTS5) or Project 10 (JSON Document Store) showcase SQLite’s powerful extensions without requiring C programming.

Path A: Database Internals Focus (for systems programmers) 1 → 4 → 9 → 7 → 11

Path B: Application Developer Focus (for backend engineers) 2 → 3 → 5 → 10 → 8

Path C: Production Operations Focus (for DevOps/SRE) 2 → 4 → 6 → 12 → 9


Final Overall Project: Build a Mini SQLite

  • File: LEARN_SQLITE_DEEP_DIVE.md
  • Main Programming Language: C
  • Alternative Programming Languages: Rust
  • Coolness Level: Level 5: Pure Magic
  • Business Potential: 1. The “Resume Gold”
  • Difficulty: Level 5: Master
  • Knowledge Area: Database Engines / Systems Programming
  • Software or Tool: From scratch
  • Main Book: “Database Internals” by Alex Petrov

What you’ll build: A simplified SQLite clone that can store and query data—implementing a B-tree storage engine, basic SQL parser, query executor, and WAL-based transactions. Not feature-complete, but demonstrating all core concepts.

Why this is the ultimate project: Building a database from scratch is the ultimate test of understanding. You’ll implement everything: file format, B-trees, SQL parsing, query execution, and transaction management. This is what separates database enthusiasts from database experts.

Core challenges you’ll face:

  • Designing a file format → Apply everything from Project 1
  • Implementing B-tree operations → Insert, delete, split, merge
  • Parsing SQL → Tokenizer, parser, AST
  • Query execution → Plan generation, execution engine
  • Transaction management → Logging, recovery
  • Concurrency → Locking, MVCC basics

Scope for MVP:

  • Single-file database with fixed page size
  • B-tree table storage (no indexes initially)
  • Basic SQL: CREATE TABLE, INSERT, SELECT, UPDATE, DELETE
  • WHERE with simple conditions (=, <, >, AND, OR)
  • WAL-based transactions (BEGIN, COMMIT, ROLLBACK)
  • Single-writer, multiple-reader concurrency

Real World Outcome:

$ ./mini-sqlite test.db
mini-sqlite> CREATE TABLE users (id INTEGER, name TEXT, email TEXT);
OK

mini-sqlite> INSERT INTO users VALUES (1, 'Alice', 'alice@example.com');
OK

mini-sqlite> INSERT INTO users VALUES (2, 'Bob', 'bob@test.com');
OK

mini-sqlite> SELECT * FROM users WHERE id = 1;
id: 1, name: Alice, email: alice@example.com
(1 row)

mini-sqlite> BEGIN;
OK

mini-sqlite> UPDATE users SET name = 'Alice Smith' WHERE id = 1;
OK

mini-sqlite> SELECT * FROM users;
id: 1, name: Alice Smith, email: alice@example.com
id: 2, name: Bob, email: bob@test.com
(2 rows)

mini-sqlite> ROLLBACK;
OK

mini-sqlite> SELECT * FROM users WHERE id = 1;
id: 1, name: Alice, email: alice@example.com
(1 row)
-- Transaction was rolled back!

Time estimate: 2-3 months (focused effort) Prerequisites: All previous projects, especially 1, 4, and 9


Summary

This learning path covers SQLite through 12 hands-on projects plus a capstone. Here’s the complete list:

# Project Name Main Language Difficulty Time Estimate
1 SQLite File Format Explorer C Advanced 2-3 weeks
2 SQL Query Analyzer & EXPLAIN Visualizer Python Intermediate 1-2 weeks
3 Build Your Own Key-Value Store on SQLite Python Intermediate 1-2 weeks
4 SQLite WAL Mode Inspector C Expert 2-3 weeks
5 SQLite Full-Text Search Engine Python Intermediate 1-2 weeks
6 Personal Analytics Database Python Intermediate 2-3 weeks
7 SQLite Replication System Go Expert 3-4 weeks
8 Build a Mini ORM Python Advanced 2-3 weeks
9 Database Corruption Detector & Repair Tool C Expert 3-4 weeks
10 JSON Document Store on SQLite Python Intermediate 1-2 weeks
11 SQLite Virtual Table for External Data C Expert 3-4 weeks
12 SQLite Connection Pool & Performance Monitor Python Advanced 2 weeks
Final Build a Mini SQLite C Master 2-3 months

For beginners: Start with projects #2, #3, #5 For intermediate: Jump to projects #1, #4, #8 For advanced: Focus on projects #7, #9, #11, and the Final Project

Expected Outcomes

After completing these projects, you will:

  • Understand every byte of a SQLite database file
  • Be able to diagnose and fix performance problems using EXPLAIN
  • Know how B-trees, WAL, and transactions actually work
  • Understand when SQLite is the right choice (and when it isn’t)
  • Be able to extend SQLite with custom functionality
  • Have built multiple production-quality tools
  • Be prepared for database internals interview questions

You’ll have built 12+ working projects demonstrating deep understanding of SQLite from first principles—from raw bytes to production deployment.