Project 6: SQL Query Engine (Parse -> Plan -> Execute)
Implement a practical SQL subset with deterministic parse-plan-execute behavior and explainable operator pipelines.
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | Expert |
| Time Estimate | 2-3 weeks |
| Main Programming Language | C (Alternatives: Rust, C++) |
| Alternative Programming Languages | Rust, C++ |
| Coolness Level | Level 5: Pure Magic |
| Business Potential | Level 4: Open Core Infrastructure |
| Prerequisites | Projects 1-5, parser basics |
| Key Topics | AST, planning, operator execution, explain consistency |
1. Learning Objectives
- Build a parser for a constrained SQL grammar.
- Translate AST into logical and physical plans.
- Execute plans through iterator-style operators.
- Produce faithful
EXPLAINoutput and validate runtime alignment.
2. All Theory Needed (Per-Concept Breakdown)
SQL Compilation Pipeline
Fundamentals
SQL engines compile declarative queries into executable operator plans. Parsing validates syntax and structure; planning chooses algorithms and access paths; execution runs operators over storage/index layers.
Deep Dive into the concept
A robust SQL subset requires strict scope boundaries. Start with commands and predicates you can execute correctly: simple projections, filters, inserts, updates, deletes, and limited joins/aggregates. The lexer emits tokens, parser produces AST, planner maps AST to logical operations, optimizer chooses physical operators, and executor runs pipelines.
The Volcano iterator model keeps runtime simple: operators expose open, next, and close. Parent pulls tuples from child, applies transformation, and emits onward. This makes control flow explicit and debuggable. Memory ownership for tuple data must be explicit; otherwise pipelines leak or return invalid pointers.
EXPLAIN is core observability. If explain tree differs from runtime behavior, debugging and tuning become unreliable. Use shared plan representation to avoid drift.
How this fit on projects
- Primary in this project.
- Integrated with transaction behavior in Project 7.
Definitions & key terms
- AST: Abstract syntax tree of SQL query.
- Logical plan: Engine-agnostic relational operations.
- Physical plan: Concrete algorithms and access paths.
- Volcano iterator: Pull-based operator interface.
Mental model diagram
SQL -> Lexer/Parser -> AST -> Planner -> Physical Plan -> Executor -> Result
How it works (step-by-step, with invariants and failure modes)
- Tokenize and parse SQL input.
- Resolve schema references and validate columns.
- Build logical plan then physical operators.
- Execute operator pipeline and emit rows.
Failure modes:
- AST misbinding to wrong column index.
- Explain/runtime mismatch.
Minimal concrete example
SELECT name FROM users WHERE age > 30
Project(name)
Filter(age > 30)
SeqScan(users)
Common misconceptions
- “Parser is the hard part.” -> Execution and plan correctness usually harder.
- “Small SQL subset is trivial.” -> Still complex across storage boundaries.
Check-your-understanding questions
- Why separate logical and physical planning?
- Why is explain fidelity essential?
- How can cardinality errors hurt performance?
Check-your-understanding answers
- Semantics and algorithm choices evolve independently.
- Debugging and tuning rely on truthful operator visibility.
- Wrong estimates choose bad algorithms and access paths.
Real-world applications
- SQLite architecture pipeline.
- PostgreSQL planner/executor model.
Where you’ll apply it
- This project and integrated TinyDB in Project 7.
References
Key insights
Treating SQL engine as compiler + runtime makes complexity manageable.
Summary
Reliable query execution comes from strict operator contracts and observability.
Homework/Exercises to practice the concept
- Draw AST and plan for three sample queries.
- Define operator contract tests for scan/filter/project.
- Add explain snapshot tests.
Solutions to the homework/exercises
- Include node types and child relationships explicitly.
- Test EOS behavior and tuple ownership boundaries.
- Use fixed schema and deterministic query corpus.
3. Project Specification
3.1 What You Will Build
A SQL REPL with parser, planner, executor, and explain output over your existing storage/index layers.
3.2 Functional Requirements
- Parse supported SQL commands.
- Execute table scans and indexed predicates.
- Support simple joins and aggregates.
- Return deterministic result sets.
- Provide explain plans.
3.3 Non-Functional Requirements
- Performance: Reasonable response times for fixture datasets.
- Reliability: No crashes on malformed SQL.
- Usability: Clear syntax errors and explain output.
3.4 Example Usage / Output
CREATE TABLE -> INSERT -> SELECT -> EXPLAIN
3.5 Data Formats / Schemas / Protocols
AST node schema, plan-node schema, executor tuple schema.
3.6 Edge Cases
Unknown column names, null handling, unsupported syntax, join cardinality explosion.
3.7 Real World Outcome
3.7.1 How to Run (Copy/Paste)
make tiny_sql
./tiny_sql
3.7.2 Golden Path Demo (Deterministic)
Fixed schema and deterministic query sequence.
3.7.3 If CLI: exact terminal transcript
$ ./tiny_sql
tinydb> CREATE TABLE users(id INT, name TEXT, age INT);
OK
tinydb> INSERT INTO users VALUES (1,'Ada',37),(2,'Lin',28);
OK rows=2
tinydb> EXPLAIN SELECT name FROM users WHERE age > 30;
Project(name)
Filter(age > 30)
SeqScan(users)
tinydb> SELECT name FROM users WHERE age > 30;
Ada
4. Solution Architecture
4.1 High-Level Design
REPL -> Lexer -> Parser -> Planner -> Executor -> Storage/Index
4.2 Key Components
| Component | Responsibility | Key Decisions |
|---|---|---|
| Lexer/parser | syntax and AST | strict subset grammar |
| Planner | logical/physical mapping | rule-based first |
| Executor | iterator runtime | explicit tuple ownership |
| Explain | plan observability | shared plan representation |
4.4 Data Structures (No Full Code)
AstNode { type, children, token_info }
PlanNode { op_type, children, access_path, predicates }
TupleView { schema_ref, value_cells }
4.4 Algorithm Overview
Parse -> build AST -> lower to plan -> execute operator tree.
5. Implementation Guide
5.1 Development Environment Setup
make tiny_sql
5.2 Project Structure
project/
├── src/sql/lexer
├── src/sql/parser
├── src/sql/planner
├── src/sql/executor
└── tests/sql
5.3 The Core Question You’re Answering
How does declarative SQL become deterministic dataflow over physical pages and indexes?
5.4 Concepts You Must Understand First
- Lexer/parser fundamentals
- Logical vs physical planning
- Iterator runtime contracts
5.5 Questions to Guide Your Design
- What SQL subset is in scope and why?
- How do you guarantee explain/runtime parity?
5.6 Thinking Exercise
Plan one query with and without index and compare estimated costs.
5.7 The Interview Questions They’ll Ask
- How does parse-plan-execute pipeline work?
- Why separate logical and physical plans?
- What is Volcano iterator model?
- How do indexes affect plan choice?
5.8 Hints in Layers
- Freeze grammar early.
- Build AST printer before optimizer rules.
- Add explain snapshot tests from day one.
5.9 Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Parser fundamentals | Compilers: Principles and Practice | lexical/syntax chapters |
| Data structure choices | Algorithms, Fourth Edition | search/join-related sections |
5.10 Implementation Phases
- Phase 1: parser + AST
- Phase 2: planner + scans/filters
- Phase 3: joins/aggregates + explain validation
5.11 Key Implementation Decisions
| Decision | Options | Recommendation | Rationale |
|---|---|---|---|
| Planning style | cost-heavy/rule-based | rule-based first | faster correctness progress |
| Operator model | push/pull | pull (Volcano) | simpler composability |
6. Testing Strategy
6.1 Test Categories
- Unit: parser and planner transforms
- Integration: end-to-end query corpus
- Snapshot: explain output consistency
6.2 Critical Test Cases
- Malformed SQL returns structured errors.
- Explain output matches executed operator tree.
- Predicate and projection correctness on fixture data.
6.3 Test Data
Deterministic schema and fixture rows with known query answers.
7. Common Pitfalls & Debugging
7.1 Frequent Mistakes
| Pitfall | Symptom | Solution |
|---|---|---|
| AST-column mismatch | wrong result columns | schema binding stage |
| Explain/runtime drift | confusing diagnostics | single plan model |
7.2 Debugging Strategies
- AST and plan pretty-printers.
- Operator event tracing (
open/next/close).
7.3 Performance Traps
Premature optimizer complexity without stable correctness tests.
8. Extensions & Challenges
8.1 Beginner Extensions
- Add ORDER BY on single column.
8.2 Intermediate Extensions
- Add hash join option.
8.3 Advanced Extensions
- Add simple cost model based on table/index stats.
9. Real-World Connections
9.1 Industry Applications
SQL execution pipelines are core of every relational DBMS.
9.2 Related Open Source Projects
- SQLite architecture docs and source references.
9.3 Interview Relevance
Parsing/planning/execution questions are common in DB and compiler-adjacent roles.
10. Resources
10.1 Essential Reading
- Compiler chapters for parsing.
- SQLite architecture overview.
10.2 Video Resources
- Query planner and execution model talks.
10.3 Tools & Documentation
- AST visualizers, plan snapshot tools.
10.4 Related Projects in This Series
11. Self-Assessment Checklist
11.1 Understanding
- I can explain AST -> plan -> executor boundaries.
- I can justify plan choice for simple predicates.
11.2 Implementation
- Query corpus passes with deterministic outputs.
- Explain and runtime behavior stay aligned.
11.3 Growth
- I can extend grammar safely without breaking planner invariants.
12. Submission / Completion Criteria
Minimum Viable Completion:
- Working SQL subset with parser and scans
Full Completion:
- Plus joins/aggregates and explain consistency suite
Excellence:
- Plus basic cost-based planning and performance benchmarks