Project 37: "The SQL Optimizer" — Database Performance

Project 37: “The SQL Optimizer” — Database Performance

Attribute	Value
File	`KIRO_CLI_LEARNING_PROJECTS.md`
Main Programming Language	SQL
Coolness Level	Level 3: Genuinely Clever
Difficulty	Level 3: Advanced
Knowledge Area	Database Performance

What you’ll build: A SQL performance analyzer that takes slow queries, runs EXPLAIN ANALYZE, identifies bottlenecks (missing indexes, inefficient joins, full table scans), and generates optimized queries with index recommendations.

Why it teaches Analysis: This project shows how Kiro can act like a senior DBA by reading query execution plans, understanding database internals, and suggesting data-driven optimizations. You’ll learn to leverage AI for performance tuning that would normally require years of experience.

Core challenges you’ll face:

Execution plan interpretation → Maps to understanding EXPLAIN output formats (PostgreSQL, MySQL, etc.)
Index recommendation → Maps to analyzing WHERE clauses, JOIN conditions, and ORDER BY columns
Query rewriting → Maps to CTE optimization, subquery elimination, join order changes
Cost estimation validation → Maps to verifying that suggested changes actually reduce query cost

Real World Outcome

You’ll have a CLI tool that analyzes slow queries and produces actionable optimization recommendations with before/after performance metrics:

$ cat slow_queries.sql
-- Query 1: User dashboard
SELECT u.id, u.email, u.created_at,
       COUNT(DISTINCT o.id) as order_count,
       SUM(o.total_amount) as total_spent,
       MAX(o.created_at) as last_order_date
FROM users u
LEFT JOIN orders o ON u.id = o.user_id
WHERE u.created_at > '2023-01-01'
  AND o.status IN ('completed', 'shipped')
GROUP BY u.id, u.email, u.created_at
ORDER BY total_spent DESC
LIMIT 100;

$ kiro sql-optimize --file slow_queries.sql --database postgres --explain

[Kiro CLI Session]
🔍 SQL Optimizer Agent
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

📋 Analyzing Query 1: User dashboard
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Step 1: Running EXPLAIN ANALYZE...

QUERY PLAN (Original)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Limit  (cost=45623.22..45623.47 rows=100 width=56)
       (actual time=3847.234..3847.289 rows=100 loops=1)
  ->  Sort  (cost=45623.22..45891.78 rows=107424 width=56)
            (actual time=3847.232..3847.261 rows=100 loops=1)
        Sort Key: (sum(o.total_amount)) DESC
        Sort Method: top-N heapsort  Memory: 32kB
        ->  HashAggregate  (cost=38456.89..39531.13 rows=107424 width=56)
                           (actual time=3621.445..3742.667 rows=125000 loops=1)
              Group Key: u.id, u.email, u.created_at
              ->  Hash Left Join  (cost=12456.00..34789.56 rows=244488 width=48)
                                  (actual time=124.567..2345.789 rows=456000 loops=1)
                    Hash Cond: (u.id = o.user_id)
                    ->  Seq Scan on users u  (cost=0.00..3567.00 rows=125000 width=24)
                                             (actual time=0.012..45.678 rows=125000 loops=1)
                          Filter: (created_at > '2023-01-01'::date)
                          Rows Removed by Filter: 75000
                    ->  Hash  (cost=8934.00..8934.00 rows=244488 width=24)
                             (actual time=123.456..123.456 rows=456000 loops=1)
                          Buckets: 65536  Batches: 8  Memory Usage: 4567kB
                          ->  Seq Scan on orders o  (cost=0.00..8934.00 rows=244488 width=24)
                                                    (actual time=0.023..67.891 rows=456000 loops=1)
                                Filter: (status = ANY ('{completed,shipped}'::text[]))
                                Rows Removed by Filter: 144000

Planning Time: 1.234 ms
Execution Time: 3847.567 ms  ⚠️  SLOW!

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Step 2: Identifying Bottlenecks...

🚨 CRITICAL ISSUES FOUND:
  1. Sequential Scan on 'users' (125k rows, Filter removes 75k)
     → Missing index on users.created_at
     Cost impact: ~45ms per scan

  2. Sequential Scan on 'orders' (600k rows, Filter removes 144k)
     → Missing index on orders.status
     Cost impact: ~68ms per scan

  3. HashAggregate on 107k rows
     → Could be reduced with filtered join
     Cost impact: ~121ms

  4. Hash Join with 8 batches (memory spill to disk)
     → work_mem too low (current: 4MB, needed: ~8MB)
     Cost impact: ~200ms (I/O overhead)

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Step 3: Generating Index Recommendations...

CREATE INDEX CONCURRENTLY idx_users_created_at
  ON users(created_at)
  WHERE created_at > '2020-01-01';  -- Partial index (smaller, faster)

CREATE INDEX CONCURRENTLY idx_orders_user_status
  ON orders(user_id, status)
  INCLUDE (total_amount, created_at);  -- Covering index (no table lookup)

Estimated index sizes:
  • idx_users_created_at: 3.2 MB
  • idx_orders_user_status: 28.4 MB

Total disk space required: 31.6 MB

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Step 4: Optimized Query (v1)...

-- Optimized version: filtered join + covering index
WITH recent_users AS (
  SELECT id, email, created_at
  FROM users
  WHERE created_at > '2023-01-01'  -- Index scan on idx_users_created_at
)
SELECT ru.id, ru.email, ru.created_at,
       COUNT(DISTINCT o.id) as order_count,
       SUM(o.total_amount) as total_spent,
       MAX(o.created_at) as last_order_date
FROM recent_users ru
LEFT JOIN orders o
  ON ru.id = o.user_id
  AND o.status IN ('completed', 'shipped')  -- Index scan on idx_orders_user_status
GROUP BY ru.id, ru.email, ru.created_at
ORDER BY total_spent DESC
LIMIT 100;

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Step 5: Running EXPLAIN ANALYZE (Optimized)...

QUERY PLAN (Optimized)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Limit  (cost=8934.56..8934.81 rows=100 width=56)
       (actual time=287.123..287.156 rows=100 loops=1)
  ->  Sort  (cost=8934.56..9203.12 rows=107424 width=56)
            (actual time=287.121..287.142 rows=100 loops=1)
        Sort Key: (sum(o.total_amount)) DESC
        Sort Method: top-N heapsort  Memory: 32kB
        ->  HashAggregate  (cost=6234.89..7309.13 rows=107424 width=56)
                           (actual time=245.678..267.891 rows=125000 loops=1)
              Group Key: ru.id, ru.email, ru.created_at
              ->  Hash Left Join  (cost=3456.00..4789.56 rows=244488 width=48)
                                  (actual time=34.567..178.234 rows=456000 loops=1)
                    Hash Cond: (ru.id = o.user_id)
                    ->  Index Scan using idx_users_created_at on users  ✓
                          (cost=0.29..845.00 rows=125000 width=24)
                          (actual time=0.012..12.345 rows=125000 loops=1)
                          Index Cond: (created_at > '2023-01-01'::date)
                    ->  Hash  (cost=2345.00..2345.00 rows=244488 width=24)
                             (actual time=34.123..34.123 rows=456000 loops=1)
                          Buckets: 65536  Batches: 1  Memory Usage: 8234kB  ✓
                          ->  Index Scan using idx_orders_user_status on orders  ✓
                                (cost=0.43..2345.00 rows=244488 width=24)
                                (actual time=0.023..18.456 rows=456000 loops=1)
                                Index Cond: (status = ANY ('{completed,shipped}'::text[]))

Planning Time: 0.987 ms
Execution Time: 287.456 ms  ✅  13.4x FASTER!

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

📊 Performance Summary:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
  Before:  3847.567 ms
  After:    287.456 ms
  Speedup:  13.4x

  Improvements:
    • Sequential Scans eliminated: 2
    • Index Scans added: 2
    • Memory spills reduced: 1 (8 batches → 1 batch)
    • Rows filtered earlier: 144k (at index level)

✅ Optimizations applied successfully!

Would you like to:
  1. Apply indexes to database? (CREATE INDEX commands)
  2. See alternative query plans?
  3. Analyze another query?

Exactly what happens:

Kiro runs EXPLAIN ANALYZE on your slow query
It identifies bottlenecks: sequential scans, missing indexes, inefficient joins
It recommends specific indexes with size estimates
It rewrites the query to leverage those indexes
It re-runs EXPLAIN ANALYZE and shows before/after metrics
You get actionable SQL commands to apply the optimizations

The Core Question You’re Answering

“How do you teach an AI to think like a database query optimizer and recommend performance improvements based on execution plans?”

This is about giving Kiro the mental model of a senior DBA:

Understanding what “Seq Scan” vs “Index Scan” means for performance
Recognizing when a query will cause memory spills or hash join overflow
Knowing which columns benefit from indexes based on WHERE/JOIN/ORDER BY usage
Validating that optimizations actually improve cost (not just guessing)

Concepts You Must Understand First

Stop and research these before coding:

EXPLAIN and EXPLAIN ANALYZE
- What’s the difference between EXPLAIN (estimated) and EXPLAIN ANALYZE (actual)?
- What are the key metrics: cost, rows, width, actual time, loops?
- How do you interpret nested plan nodes?
- Book Reference: “PostgreSQL: Up and Running” by Regina Obe & Leo Hsu - Ch. 10
Index Types and Use Cases
- B-tree vs Hash vs GiST vs GIN indexes
- When is a partial index better than a full index?
- What is a covering index (INCLUDE columns)?
- Why doesn’t PostgreSQL use an index for LIKE '%pattern%'?
- Book Reference: “SQL Performance Explained” by Markus Winand - Ch. 2-3
Query Execution Plans
- What is a Sequential Scan and when is it acceptable?
- What is a Hash Join vs Nested Loop vs Merge Join?
- What does “Rows Removed by Filter” indicate?
- What is the significance of “Batches” in a Hash operation?
- Book Reference: “PostgreSQL Query Optimization” by Henrietta Dombrovskaya - Ch. 4
Cost-Based Optimization
- How does the planner estimate query cost?
- What are random_page_cost and seq_page_cost?
- Why might the planner choose a slower plan than you expect?
- Book Reference: “Database Internals” by Alex Petrov - Ch. 12

Questions to Guide Your Design

Before implementing, think through these:

Execution Plan Parsing
- How will you parse the text output of EXPLAIN ANALYZE? (Regex? Structured JSON format?)
- Different databases have different EXPLAIN formats (PostgreSQL vs MySQL vs SQLite). Will you support multiple?
- How do you extract the key bottlenecks programmatically?
- Should you use EXPLAIN (FORMAT JSON) for easier parsing?
Index Recommendation Logic
- How do you identify which columns should be indexed?
- What if a column is already indexed but the index isn’t being used?
- Should you recommend composite indexes for multi-column WHERE clauses?
- How do you avoid recommending too many indexes (index bloat)?
Query Rewriting
- When should you suggest a CTE vs a subquery?
- How do you know if reordering joins will help?
- What if the query uses features Kiro doesn’t understand (window functions, recursive CTEs)?
- Should you preserve query semantics exactly or allow minor changes?
Validation
- How do you ensure the optimized query returns the same results?
- What if the index recommendations require more disk space than available?
- Should you test on a staging database first?
- How do you handle queries that are already well-optimized?

Thinking Exercise

Index Selection Challenge

You have this query:

SELECT p.id, p.title, p.created_at, u.username, COUNT(c.id) as comment_count
FROM posts p
JOIN users u ON p.author_id = u.id
LEFT JOIN comments c ON p.id = c.post_id
WHERE p.published = true
  AND p.created_at > NOW() - INTERVAL '30 days'
  AND u.is_active = true
ORDER BY p.created_at DESC, comment_count DESC
LIMIT 20;

EXPLAIN shows:

Sequential Scan on posts (2M rows, filtering 1.8M)
Sequential Scan on users (500k rows, filtering 100k)
Hash Join on comments (5M rows)

Questions to reason through:

Which columns should you index and why?
Should you create separate indexes or one composite index? (posts(published, created_at) vs posts(published) + posts(created_at))
The ORDER BY uses comment_count which is computed. Can you index it?
Would a partial index help here? (WHERE published = true)
What’s the tradeoff between index size and query speed for rarely-used queries?

Proposed indexes:

CREATE INDEX idx_posts_published_created ON posts(published, created_at DESC);
CREATE INDEX idx_users_active ON users(is_active) WHERE is_active = true;
CREATE INDEX idx_comments_post ON comments(post_id);

Are these optimal? What would you change?

The Interview Questions They’ll Ask

Prepare to answer these:

“Explain the difference between a Sequential Scan and an Index Scan. When might a Sequential Scan actually be faster?”
“Your query uses WHERE user_id = 123, and there’s an index on user_id, but EXPLAIN shows a Seq Scan. Why?”
“What is a covering index and when should you use one?”
“How would you optimize a query with ORDER BY on a non-indexed column?”
“What’s the difference between EXPLAIN and EXPLAIN ANALYZE? When would you use each?”
“A junior developer adds 15 indexes to a table. What problems might this cause?”

Hints in Layers

Hint 1: Start with EXPLAIN Parsing Run EXPLAIN (FORMAT JSON) <query> to get structured output:

{
  "Plan": {
    "Node Type": "Limit",
    "Startup Cost": 45623.22,
    "Total Cost": 45623.47,
    "Plan Rows": 100,
    "Plan Width": 56,
    "Actual Startup Time": 3847.234,
    "Actual Total Time": 3847.289,
    "Actual Rows": 100,
    "Actual Loops": 1,
    "Plans": [...]
  }
}

Parse this JSON to extract:

Node types (“Seq Scan”, “Index Scan”, “Hash Join”)
Costs (Total Cost, Actual Total Time)
Filters (“Rows Removed by Filter”)

Hint 2: Bottleneck Detection Look for these patterns in the execution plan:

Seq Scan with high Rows Removed by Filter → missing WHERE index
Hash Join with Batches > 1 → memory overflow, need more work_mem
Sort with external merge Disk → sort spilled to disk, need index for ORDER BY
Nested Loop with high loop count → inefficient join, consider hash join

Hint 3: Index Recommendation Algorithm

For each table in the query:
Extract WHERE conditions → index these columns
Extract JOIN conditions → index foreign keys
Extract ORDER BY columns → index with DESC/ASC matching
If all columns are in SELECT → recommend covering index (INCLUDE)
If WHERE has multiple ANDs → recommend composite index

Hint 4: Validation Loop After generating recommendations:

Create the indexes in a transaction: BEGIN; CREATE INDEX ...; ROLLBACK; (test without committing)
Run EXPLAIN on the optimized query
Compare costs: if new_cost < old_cost * 0.8 (20% improvement), accept
If cost increased, reject the suggestion and explain why

Books That Will Help

Topic	Book	Chapter
PostgreSQL EXPLAIN	“PostgreSQL: Up and Running” by Regina Obe & Leo Hsu	Ch. 10
Index fundamentals	“SQL Performance Explained” by Markus Winand	Ch. 2-3
Query optimization	“PostgreSQL Query Optimization” by Henrietta Dombrovskaya	Ch. 4-5
Database internals	“Database Internals” by Alex Petrov	Ch. 12
Execution plans	“SQL Tuning” by Dan Tow	Ch. 3-4

Common Pitfalls and Debugging

Problem 1: “Index exists but isn’t being used”

Why: Index selectivity is too low (planner prefers Seq Scan for large result sets)
Fix: Check SELECT COUNT(*) / (SELECT COUNT(*) FROM table) — if >20%, Seq Scan is often faster
Quick test: SET enable_seqscan = off; EXPLAIN <query>; (force index usage to compare)

Problem 2: “Optimized query returns different results”

Why: Query rewriting changed semantics (e.g., moving WHERE from JOIN to outer query)
Fix: Use EXCEPT to find missing/extra rows: (original EXCEPT optimized) UNION (optimized EXCEPT original)
Quick test: Hash the results: SELECT MD5(string_agg(row::text, '')) FROM (query) AS row;

Problem 3: “Index recommendation is way too large”

Why: Composite index on high-cardinality columns
Fix: Use partial indexes with WHERE clause to reduce size
Quick test: SELECT pg_size_pretty(pg_relation_size('index_name'));

Problem 4: “EXPLAIN shows Index Scan but query is still slow”

Why: Index Scan has high Actual Loops (nested loop with bad join order)
Fix: Reorder joins so smaller table is scanned first, or switch to Hash Join
Quick test: Check Actual Loops — if >1000, you’re doing a nested loop on too many rows

Definition of Done

Tool accepts a SQL query and database connection string
EXPLAIN ANALYZE output is captured and parsed (JSON format preferred)
Bottlenecks are identified: Seq Scans with high filter cost, missing indexes, memory spills
Index recommendations are generated with estimated size and creation SQL
Optimized query is generated with CTE or join reordering
Before/after EXPLAIN comparison shows measurable improvement (cost reduction >20%)
Results are validated: optimized query returns same row count and hash as original
Tool outputs a report with: bottlenecks, recommendations, speedup metrics
Optional: Tool can auto-apply indexes with CREATE INDEX CONCURRENTLY (no table locks)