Project 6: SQL Query Engine (Parse -> Plan -> Execute)

Implement a practical SQL subset with deterministic parse-plan-execute behavior and explainable operator pipelines.

Quick Reference

Attribute	Value
Difficulty	Expert
Time Estimate	2-3 weeks
Main Programming Language	C (Alternatives: Rust, C++)
Alternative Programming Languages	Rust, C++
Coolness Level	Level 5: Pure Magic
Business Potential	Level 4: Open Core Infrastructure
Prerequisites	Projects 1-5, parser basics
Key Topics	AST, planning, operator execution, explain consistency

1. Learning Objectives

Build a parser for a constrained SQL grammar.
Translate AST into logical and physical plans.
Execute plans through iterator-style operators.
Produce faithful EXPLAIN output and validate runtime alignment.

2. All Theory Needed (Per-Concept Breakdown)

SQL Compilation Pipeline

Fundamentals

SQL engines compile declarative queries into executable operator plans. Parsing validates syntax and structure; planning chooses algorithms and access paths; execution runs operators over storage/index layers.

Deep Dive into the concept

A robust SQL subset requires strict scope boundaries. Start with commands and predicates you can execute correctly: simple projections, filters, inserts, updates, deletes, and limited joins/aggregates. The lexer emits tokens, parser produces AST, planner maps AST to logical operations, optimizer chooses physical operators, and executor runs pipelines.

The Volcano iterator model keeps runtime simple: operators expose open, next, and close. Parent pulls tuples from child, applies transformation, and emits onward. This makes control flow explicit and debuggable. Memory ownership for tuple data must be explicit; otherwise pipelines leak or return invalid pointers.

EXPLAIN is core observability. If explain tree differs from runtime behavior, debugging and tuning become unreliable. Use shared plan representation to avoid drift.

How this fit on projects

Primary in this project.
Integrated with transaction behavior in Project 7.

Definitions & key terms

AST: Abstract syntax tree of SQL query.
Logical plan: Engine-agnostic relational operations.
Physical plan: Concrete algorithms and access paths.
Volcano iterator: Pull-based operator interface.

Mental model diagram

SQL -> Lexer/Parser -> AST -> Planner -> Physical Plan -> Executor -> Result

How it works (step-by-step, with invariants and failure modes)

Tokenize and parse SQL input.
Resolve schema references and validate columns.
Build logical plan then physical operators.
Execute operator pipeline and emit rows.

Failure modes:

AST misbinding to wrong column index.
Explain/runtime mismatch.

Minimal concrete example

SELECT name FROM users WHERE age > 30
Project(name)
  Filter(age > 30)
    SeqScan(users)

Common misconceptions

“Parser is the hard part.” -> Execution and plan correctness usually harder.
“Small SQL subset is trivial.” -> Still complex across storage boundaries.

Check-your-understanding questions

Why separate logical and physical planning?
Why is explain fidelity essential?
How can cardinality errors hurt performance?

Check-your-understanding answers

Semantics and algorithm choices evolve independently.
Debugging and tuning rely on truthful operator visibility.
Wrong estimates choose bad algorithms and access paths.

Real-world applications

SQLite architecture pipeline.
PostgreSQL planner/executor model.

Where you’ll apply it

This project and integrated TinyDB in Project 7.

References

https://www.sqlite.org/arch.html

Key insights

Treating SQL engine as compiler + runtime makes complexity manageable.

Summary

Reliable query execution comes from strict operator contracts and observability.

Homework/Exercises to practice the concept

Draw AST and plan for three sample queries.
Define operator contract tests for scan/filter/project.
Add explain snapshot tests.

Solutions to the homework/exercises

Include node types and child relationships explicitly.
Test EOS behavior and tuple ownership boundaries.
Use fixed schema and deterministic query corpus.

3. Project Specification

3.1 What You Will Build

A SQL REPL with parser, planner, executor, and explain output over your existing storage/index layers.

3.2 Functional Requirements

Parse supported SQL commands.
Execute table scans and indexed predicates.
Support simple joins and aggregates.
Return deterministic result sets.
Provide explain plans.

3.3 Non-Functional Requirements

Performance: Reasonable response times for fixture datasets.
Reliability: No crashes on malformed SQL.
Usability: Clear syntax errors and explain output.

3.4 Example Usage / Output

CREATE TABLE -> INSERT -> SELECT -> EXPLAIN

3.5 Data Formats / Schemas / Protocols

AST node schema, plan-node schema, executor tuple schema.

3.6 Edge Cases

Unknown column names, null handling, unsupported syntax, join cardinality explosion.

3.7 Real World Outcome

3.7.1 How to Run (Copy/Paste)

make tiny_sql
./tiny_sql

3.7.2 Golden Path Demo (Deterministic)

Fixed schema and deterministic query sequence.

3.7.3 If CLI: exact terminal transcript

$ ./tiny_sql
tinydb> CREATE TABLE users(id INT, name TEXT, age INT);
OK
tinydb> INSERT INTO users VALUES (1,'Ada',37),(2,'Lin',28);
OK rows=2
tinydb> EXPLAIN SELECT name FROM users WHERE age > 30;
Project(name)
  Filter(age > 30)
    SeqScan(users)
tinydb> SELECT name FROM users WHERE age > 30;
Ada

4. Solution Architecture

4.1 High-Level Design

REPL -> Lexer -> Parser -> Planner -> Executor -> Storage/Index

4.2 Key Components

Component	Responsibility	Key Decisions
Lexer/parser	syntax and AST	strict subset grammar
Planner	logical/physical mapping	rule-based first
Executor	iterator runtime	explicit tuple ownership
Explain	plan observability	shared plan representation

4.4 Data Structures (No Full Code)

AstNode { type, children, token_info }
PlanNode { op_type, children, access_path, predicates }
TupleView { schema_ref, value_cells }

4.4 Algorithm Overview

Parse -> build AST -> lower to plan -> execute operator tree.

5. Implementation Guide

5.1 Development Environment Setup

make tiny_sql

5.2 Project Structure

project/
├── src/sql/lexer
├── src/sql/parser
├── src/sql/planner
├── src/sql/executor
└── tests/sql

5.3 The Core Question You’re Answering

How does declarative SQL become deterministic dataflow over physical pages and indexes?

5.4 Concepts You Must Understand First

Lexer/parser fundamentals
Logical vs physical planning
Iterator runtime contracts

5.5 Questions to Guide Your Design

What SQL subset is in scope and why?
How do you guarantee explain/runtime parity?

5.6 Thinking Exercise

Plan one query with and without index and compare estimated costs.

5.7 The Interview Questions They’ll Ask

How does parse-plan-execute pipeline work?
Why separate logical and physical plans?
What is Volcano iterator model?
How do indexes affect plan choice?

5.8 Hints in Layers

Freeze grammar early.
Build AST printer before optimizer rules.
Add explain snapshot tests from day one.

5.9 Books That Will Help

Topic	Book	Chapter
Parser fundamentals	Compilers: Principles and Practice	lexical/syntax chapters
Data structure choices	Algorithms, Fourth Edition	search/join-related sections

5.10 Implementation Phases

Phase 1: parser + AST
Phase 2: planner + scans/filters
Phase 3: joins/aggregates + explain validation

5.11 Key Implementation Decisions

Decision	Options	Recommendation	Rationale
Planning style	cost-heavy/rule-based	rule-based first	faster correctness progress
Operator model	push/pull	pull (Volcano)	simpler composability

6. Testing Strategy

6.1 Test Categories

Unit: parser and planner transforms
Integration: end-to-end query corpus
Snapshot: explain output consistency

6.2 Critical Test Cases

Malformed SQL returns structured errors.
Explain output matches executed operator tree.
Predicate and projection correctness on fixture data.

6.3 Test Data

Deterministic schema and fixture rows with known query answers.

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

Pitfall	Symptom	Solution
AST-column mismatch	wrong result columns	schema binding stage
Explain/runtime drift	confusing diagnostics	single plan model

7.2 Debugging Strategies

AST and plan pretty-printers.
Operator event tracing (open/next/close).

7.3 Performance Traps

Premature optimizer complexity without stable correctness tests.

8. Extensions & Challenges

8.1 Beginner Extensions

Add ORDER BY on single column.

8.2 Intermediate Extensions

Add hash join option.

8.3 Advanced Extensions

Add simple cost model based on table/index stats.

9. Real-World Connections

9.1 Industry Applications

SQL execution pipelines are core of every relational DBMS.

SQLite architecture docs and source references.

9.3 Interview Relevance

Parsing/planning/execution questions are common in DB and compiler-adjacent roles.

10. Resources

10.1 Essential Reading

Compiler chapters for parsing.
SQLite architecture overview.

10.2 Video Resources

Query planner and execution model talks.

10.3 Tools & Documentation

AST visualizers, plan snapshot tools.

Previous: Project 5
Next: Project 7

11. Self-Assessment Checklist

11.1 Understanding

I can explain AST -> plan -> executor boundaries.
I can justify plan choice for simple predicates.

11.2 Implementation

Query corpus passes with deterministic outputs.
Explain and runtime behavior stay aligned.

11.3 Growth

I can extend grammar safely without breaking planner invariants.

12. Submission / Completion Criteria

Minimum Viable Completion:

Working SQL subset with parser and scans

Full Completion:

Plus joins/aggregates and explain consistency suite

Excellence:

Plus basic cost-based planning and performance benchmarks