Project 6: SQL Query Engine (Parse -> Plan -> Execute)

Implement a practical SQL subset with deterministic parse-plan-execute behavior and explainable operator pipelines.

Quick Reference

Attribute Value
Difficulty Expert
Time Estimate 2-3 weeks
Main Programming Language C (Alternatives: Rust, C++)
Alternative Programming Languages Rust, C++
Coolness Level Level 5: Pure Magic
Business Potential Level 4: Open Core Infrastructure
Prerequisites Projects 1-5, parser basics
Key Topics AST, planning, operator execution, explain consistency

1. Learning Objectives

  1. Build a parser for a constrained SQL grammar.
  2. Translate AST into logical and physical plans.
  3. Execute plans through iterator-style operators.
  4. Produce faithful EXPLAIN output and validate runtime alignment.

2. All Theory Needed (Per-Concept Breakdown)

SQL Compilation Pipeline

Fundamentals

SQL engines compile declarative queries into executable operator plans. Parsing validates syntax and structure; planning chooses algorithms and access paths; execution runs operators over storage/index layers.

Deep Dive into the concept

A robust SQL subset requires strict scope boundaries. Start with commands and predicates you can execute correctly: simple projections, filters, inserts, updates, deletes, and limited joins/aggregates. The lexer emits tokens, parser produces AST, planner maps AST to logical operations, optimizer chooses physical operators, and executor runs pipelines.

The Volcano iterator model keeps runtime simple: operators expose open, next, and close. Parent pulls tuples from child, applies transformation, and emits onward. This makes control flow explicit and debuggable. Memory ownership for tuple data must be explicit; otherwise pipelines leak or return invalid pointers.

EXPLAIN is core observability. If explain tree differs from runtime behavior, debugging and tuning become unreliable. Use shared plan representation to avoid drift.

How this fit on projects

  • Primary in this project.
  • Integrated with transaction behavior in Project 7.

Definitions & key terms

  • AST: Abstract syntax tree of SQL query.
  • Logical plan: Engine-agnostic relational operations.
  • Physical plan: Concrete algorithms and access paths.
  • Volcano iterator: Pull-based operator interface.

Mental model diagram

SQL -> Lexer/Parser -> AST -> Planner -> Physical Plan -> Executor -> Result

How it works (step-by-step, with invariants and failure modes)

  1. Tokenize and parse SQL input.
  2. Resolve schema references and validate columns.
  3. Build logical plan then physical operators.
  4. Execute operator pipeline and emit rows.

Failure modes:

  • AST misbinding to wrong column index.
  • Explain/runtime mismatch.

Minimal concrete example

SELECT name FROM users WHERE age > 30
Project(name)
  Filter(age > 30)
    SeqScan(users)

Common misconceptions

  • “Parser is the hard part.” -> Execution and plan correctness usually harder.
  • “Small SQL subset is trivial.” -> Still complex across storage boundaries.

Check-your-understanding questions

  1. Why separate logical and physical planning?
  2. Why is explain fidelity essential?
  3. How can cardinality errors hurt performance?

Check-your-understanding answers

  1. Semantics and algorithm choices evolve independently.
  2. Debugging and tuning rely on truthful operator visibility.
  3. Wrong estimates choose bad algorithms and access paths.

Real-world applications

  • SQLite architecture pipeline.
  • PostgreSQL planner/executor model.

Where you’ll apply it

  • This project and integrated TinyDB in Project 7.

References

Key insights

Treating SQL engine as compiler + runtime makes complexity manageable.

Summary

Reliable query execution comes from strict operator contracts and observability.

Homework/Exercises to practice the concept

  1. Draw AST and plan for three sample queries.
  2. Define operator contract tests for scan/filter/project.
  3. Add explain snapshot tests.

Solutions to the homework/exercises

  1. Include node types and child relationships explicitly.
  2. Test EOS behavior and tuple ownership boundaries.
  3. Use fixed schema and deterministic query corpus.

3. Project Specification

3.1 What You Will Build

A SQL REPL with parser, planner, executor, and explain output over your existing storage/index layers.

3.2 Functional Requirements

  1. Parse supported SQL commands.
  2. Execute table scans and indexed predicates.
  3. Support simple joins and aggregates.
  4. Return deterministic result sets.
  5. Provide explain plans.

3.3 Non-Functional Requirements

  • Performance: Reasonable response times for fixture datasets.
  • Reliability: No crashes on malformed SQL.
  • Usability: Clear syntax errors and explain output.

3.4 Example Usage / Output

CREATE TABLE -> INSERT -> SELECT -> EXPLAIN

3.5 Data Formats / Schemas / Protocols

AST node schema, plan-node schema, executor tuple schema.

3.6 Edge Cases

Unknown column names, null handling, unsupported syntax, join cardinality explosion.

3.7 Real World Outcome

3.7.1 How to Run (Copy/Paste)

make tiny_sql
./tiny_sql

3.7.2 Golden Path Demo (Deterministic)

Fixed schema and deterministic query sequence.

3.7.3 If CLI: exact terminal transcript

$ ./tiny_sql
tinydb> CREATE TABLE users(id INT, name TEXT, age INT);
OK
tinydb> INSERT INTO users VALUES (1,'Ada',37),(2,'Lin',28);
OK rows=2
tinydb> EXPLAIN SELECT name FROM users WHERE age > 30;
Project(name)
  Filter(age > 30)
    SeqScan(users)
tinydb> SELECT name FROM users WHERE age > 30;
Ada

4. Solution Architecture

4.1 High-Level Design

REPL -> Lexer -> Parser -> Planner -> Executor -> Storage/Index

4.2 Key Components

Component Responsibility Key Decisions
Lexer/parser syntax and AST strict subset grammar
Planner logical/physical mapping rule-based first
Executor iterator runtime explicit tuple ownership
Explain plan observability shared plan representation

4.4 Data Structures (No Full Code)

AstNode { type, children, token_info }
PlanNode { op_type, children, access_path, predicates }
TupleView { schema_ref, value_cells }

4.4 Algorithm Overview

Parse -> build AST -> lower to plan -> execute operator tree.


5. Implementation Guide

5.1 Development Environment Setup

make tiny_sql

5.2 Project Structure

project/
├── src/sql/lexer
├── src/sql/parser
├── src/sql/planner
├── src/sql/executor
└── tests/sql

5.3 The Core Question You’re Answering

How does declarative SQL become deterministic dataflow over physical pages and indexes?

5.4 Concepts You Must Understand First

  • Lexer/parser fundamentals
  • Logical vs physical planning
  • Iterator runtime contracts

5.5 Questions to Guide Your Design

  • What SQL subset is in scope and why?
  • How do you guarantee explain/runtime parity?

5.6 Thinking Exercise

Plan one query with and without index and compare estimated costs.

5.7 The Interview Questions They’ll Ask

  1. How does parse-plan-execute pipeline work?
  2. Why separate logical and physical plans?
  3. What is Volcano iterator model?
  4. How do indexes affect plan choice?

5.8 Hints in Layers

  • Freeze grammar early.
  • Build AST printer before optimizer rules.
  • Add explain snapshot tests from day one.

5.9 Books That Will Help

Topic Book Chapter
Parser fundamentals Compilers: Principles and Practice lexical/syntax chapters
Data structure choices Algorithms, Fourth Edition search/join-related sections

5.10 Implementation Phases

  • Phase 1: parser + AST
  • Phase 2: planner + scans/filters
  • Phase 3: joins/aggregates + explain validation

5.11 Key Implementation Decisions

Decision Options Recommendation Rationale
Planning style cost-heavy/rule-based rule-based first faster correctness progress
Operator model push/pull pull (Volcano) simpler composability

6. Testing Strategy

6.1 Test Categories

  • Unit: parser and planner transforms
  • Integration: end-to-end query corpus
  • Snapshot: explain output consistency

6.2 Critical Test Cases

  1. Malformed SQL returns structured errors.
  2. Explain output matches executed operator tree.
  3. Predicate and projection correctness on fixture data.

6.3 Test Data

Deterministic schema and fixture rows with known query answers.


7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

Pitfall Symptom Solution
AST-column mismatch wrong result columns schema binding stage
Explain/runtime drift confusing diagnostics single plan model

7.2 Debugging Strategies

  • AST and plan pretty-printers.
  • Operator event tracing (open/next/close).

7.3 Performance Traps

Premature optimizer complexity without stable correctness tests.


8. Extensions & Challenges

8.1 Beginner Extensions

  • Add ORDER BY on single column.

8.2 Intermediate Extensions

  • Add hash join option.

8.3 Advanced Extensions

  • Add simple cost model based on table/index stats.

9. Real-World Connections

9.1 Industry Applications

SQL execution pipelines are core of every relational DBMS.

  • SQLite architecture docs and source references.

9.3 Interview Relevance

Parsing/planning/execution questions are common in DB and compiler-adjacent roles.


10. Resources

10.1 Essential Reading

  • Compiler chapters for parsing.
  • SQLite architecture overview.

10.2 Video Resources

  • Query planner and execution model talks.

10.3 Tools & Documentation

  • AST visualizers, plan snapshot tools.

11. Self-Assessment Checklist

11.1 Understanding

  • I can explain AST -> plan -> executor boundaries.
  • I can justify plan choice for simple predicates.

11.2 Implementation

  • Query corpus passes with deterministic outputs.
  • Explain and runtime behavior stay aligned.

11.3 Growth

  • I can extend grammar safely without breaking planner invariants.

12. Submission / Completion Criteria

Minimum Viable Completion:

  • Working SQL subset with parser and scans

Full Completion:

  • Plus joins/aggregates and explain consistency suite

Excellence:

  • Plus basic cost-based planning and performance benchmarks