Project 37: “The SQL Optimizer” — Database Performance

Attribute Value
File KIRO_CLI_LEARNING_PROJECTS.md
Main Programming Language SQL
Coolness Level Level 3: Genuinely Clever
Difficulty Level 3: Advanced
Knowledge Area Database Performance

What you’ll build: A SQL performance analyzer that takes slow queries, runs EXPLAIN ANALYZE, identifies bottlenecks (missing indexes, inefficient joins, full table scans), and generates optimized queries with index recommendations.

Why it teaches Analysis: This project shows how Kiro can act like a senior DBA by reading query execution plans, understanding database internals, and suggesting data-driven optimizations. You’ll learn to leverage AI for performance tuning that would normally require years of experience.

Core challenges you’ll face:

  • Execution plan interpretation → Maps to understanding EXPLAIN output formats (PostgreSQL, MySQL, etc.)
  • Index recommendation → Maps to analyzing WHERE clauses, JOIN conditions, and ORDER BY columns
  • Query rewriting → Maps to CTE optimization, subquery elimination, join order changes
  • Cost estimation validation → Maps to verifying that suggested changes actually reduce query cost

Real World Outcome

You’ll have a CLI tool that analyzes slow queries and produces actionable optimization recommendations with before/after performance metrics:

$ cat slow_queries.sql
-- Query 1: User dashboard
SELECT u.id, u.email, u.created_at,
       COUNT(DISTINCT o.id) as order_count,
       SUM(o.total_amount) as total_spent,
       MAX(o.created_at) as last_order_date
FROM users u
LEFT JOIN orders o ON u.id = o.user_id
WHERE u.created_at > '2023-01-01'
  AND o.status IN ('completed', 'shipped')
GROUP BY u.id, u.email, u.created_at
ORDER BY total_spent DESC
LIMIT 100;

$ kiro sql-optimize --file slow_queries.sql --database postgres --explain

[Kiro CLI Session]
🔍 SQL Optimizer Agent
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

📋 Analyzing Query 1: User dashboard
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Step 1: Running EXPLAIN ANALYZE...

QUERY PLAN (Original)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Limit  (cost=45623.22..45623.47 rows=100 width=56)
       (actual time=3847.234..3847.289 rows=100 loops=1)
  ->  Sort  (cost=45623.22..45891.78 rows=107424 width=56)
            (actual time=3847.232..3847.261 rows=100 loops=1)
        Sort Key: (sum(o.total_amount)) DESC
        Sort Method: top-N heapsort  Memory: 32kB
        ->  HashAggregate  (cost=38456.89..39531.13 rows=107424 width=56)
                           (actual time=3621.445..3742.667 rows=125000 loops=1)
              Group Key: u.id, u.email, u.created_at
              ->  Hash Left Join  (cost=12456.00..34789.56 rows=244488 width=48)
                                  (actual time=124.567..2345.789 rows=456000 loops=1)
                    Hash Cond: (u.id = o.user_id)
                    ->  Seq Scan on users u  (cost=0.00..3567.00 rows=125000 width=24)
                                             (actual time=0.012..45.678 rows=125000 loops=1)
                          Filter: (created_at > '2023-01-01'::date)
                          Rows Removed by Filter: 75000
                    ->  Hash  (cost=8934.00..8934.00 rows=244488 width=24)
                             (actual time=123.456..123.456 rows=456000 loops=1)
                          Buckets: 65536  Batches: 8  Memory Usage: 4567kB
                          ->  Seq Scan on orders o  (cost=0.00..8934.00 rows=244488 width=24)
                                                    (actual time=0.023..67.891 rows=456000 loops=1)
                                Filter: (status = ANY ('{completed,shipped}'::text[]))
                                Rows Removed by Filter: 144000

Planning Time: 1.234 ms
Execution Time: 3847.567 ms  ⚠️  SLOW!

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Step 2: Identifying Bottlenecks...

🚨 CRITICAL ISSUES FOUND:
  1. Sequential Scan on 'users' (125k rows, Filter removes 75k)
     → Missing index on users.created_at
     Cost impact: ~45ms per scan

  2. Sequential Scan on 'orders' (600k rows, Filter removes 144k)
     → Missing index on orders.status
     Cost impact: ~68ms per scan

  3. HashAggregate on 107k rows
     → Could be reduced with filtered join
     Cost impact: ~121ms

  4. Hash Join with 8 batches (memory spill to disk)
     → work_mem too low (current: 4MB, needed: ~8MB)
     Cost impact: ~200ms (I/O overhead)

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Step 3: Generating Index Recommendations...

CREATE INDEX CONCURRENTLY idx_users_created_at
  ON users(created_at)
  WHERE created_at > '2020-01-01';  -- Partial index (smaller, faster)

CREATE INDEX CONCURRENTLY idx_orders_user_status
  ON orders(user_id, status)
  INCLUDE (total_amount, created_at);  -- Covering index (no table lookup)

Estimated index sizes:
  • idx_users_created_at: 3.2 MB
  • idx_orders_user_status: 28.4 MB

Total disk space required: 31.6 MB

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Step 4: Optimized Query (v1)...

-- Optimized version: filtered join + covering index
WITH recent_users AS (
  SELECT id, email, created_at
  FROM users
  WHERE created_at > '2023-01-01'  -- Index scan on idx_users_created_at
)
SELECT ru.id, ru.email, ru.created_at,
       COUNT(DISTINCT o.id) as order_count,
       SUM(o.total_amount) as total_spent,
       MAX(o.created_at) as last_order_date
FROM recent_users ru
LEFT JOIN orders o
  ON ru.id = o.user_id
  AND o.status IN ('completed', 'shipped')  -- Index scan on idx_orders_user_status
GROUP BY ru.id, ru.email, ru.created_at
ORDER BY total_spent DESC
LIMIT 100;

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Step 5: Running EXPLAIN ANALYZE (Optimized)...

QUERY PLAN (Optimized)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Limit  (cost=8934.56..8934.81 rows=100 width=56)
       (actual time=287.123..287.156 rows=100 loops=1)
  ->  Sort  (cost=8934.56..9203.12 rows=107424 width=56)
            (actual time=287.121..287.142 rows=100 loops=1)
        Sort Key: (sum(o.total_amount)) DESC
        Sort Method: top-N heapsort  Memory: 32kB
        ->  HashAggregate  (cost=6234.89..7309.13 rows=107424 width=56)
                           (actual time=245.678..267.891 rows=125000 loops=1)
              Group Key: ru.id, ru.email, ru.created_at
              ->  Hash Left Join  (cost=3456.00..4789.56 rows=244488 width=48)
                                  (actual time=34.567..178.234 rows=456000 loops=1)
                    Hash Cond: (ru.id = o.user_id)
                    ->  Index Scan using idx_users_created_at on users(cost=0.29..845.00 rows=125000 width=24)
                          (actual time=0.012..12.345 rows=125000 loops=1)
                          Index Cond: (created_at > '2023-01-01'::date)
                    ->  Hash  (cost=2345.00..2345.00 rows=244488 width=24)
                             (actual time=34.123..34.123 rows=456000 loops=1)
                          Buckets: 65536  Batches: 1  Memory Usage: 8234kB  ✓
                          ->  Index Scan using idx_orders_user_status on orders  ✓
                                (cost=0.43..2345.00 rows=244488 width=24)
                                (actual time=0.023..18.456 rows=456000 loops=1)
                                Index Cond: (status = ANY ('{completed,shipped}'::text[]))

Planning Time: 0.987 ms
Execution Time: 287.456 ms  ✅  13.4x FASTER!

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

📊 Performance Summary:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
  Before:  3847.567 ms
  After:    287.456 ms
  Speedup:  13.4x

  Improvements:
    • Sequential Scans eliminated: 2
    • Index Scans added: 2
    • Memory spills reduced: 1 (8 batches → 1 batch)
    • Rows filtered earlier: 144k (at index level)

✅ Optimizations applied successfully!

Would you like to:
  1. Apply indexes to database? (CREATE INDEX commands)
  2. See alternative query plans?
  3. Analyze another query?

Exactly what happens:

  1. Kiro runs EXPLAIN ANALYZE on your slow query
  2. It identifies bottlenecks: sequential scans, missing indexes, inefficient joins
  3. It recommends specific indexes with size estimates
  4. It rewrites the query to leverage those indexes
  5. It re-runs EXPLAIN ANALYZE and shows before/after metrics
  6. You get actionable SQL commands to apply the optimizations

The Core Question You’re Answering

“How do you teach an AI to think like a database query optimizer and recommend performance improvements based on execution plans?”

This is about giving Kiro the mental model of a senior DBA:

  • Understanding what “Seq Scan” vs “Index Scan” means for performance
  • Recognizing when a query will cause memory spills or hash join overflow
  • Knowing which columns benefit from indexes based on WHERE/JOIN/ORDER BY usage
  • Validating that optimizations actually improve cost (not just guessing)

Concepts You Must Understand First

Stop and research these before coding:

  1. EXPLAIN and EXPLAIN ANALYZE
    • What’s the difference between EXPLAIN (estimated) and EXPLAIN ANALYZE (actual)?
    • What are the key metrics: cost, rows, width, actual time, loops?
    • How do you interpret nested plan nodes?
    • Book Reference: “PostgreSQL: Up and Running” by Regina Obe & Leo Hsu - Ch. 10
  2. Index Types and Use Cases
    • B-tree vs Hash vs GiST vs GIN indexes
    • When is a partial index better than a full index?
    • What is a covering index (INCLUDE columns)?
    • Why doesn’t PostgreSQL use an index for LIKE '%pattern%'?
    • Book Reference: “SQL Performance Explained” by Markus Winand - Ch. 2-3
  3. Query Execution Plans
    • What is a Sequential Scan and when is it acceptable?
    • What is a Hash Join vs Nested Loop vs Merge Join?
    • What does “Rows Removed by Filter” indicate?
    • What is the significance of “Batches” in a Hash operation?
    • Book Reference: “PostgreSQL Query Optimization” by Henrietta Dombrovskaya - Ch. 4
  4. Cost-Based Optimization
    • How does the planner estimate query cost?
    • What are random_page_cost and seq_page_cost?
    • Why might the planner choose a slower plan than you expect?
    • Book Reference: “Database Internals” by Alex Petrov - Ch. 12

Questions to Guide Your Design

Before implementing, think through these:

  1. Execution Plan Parsing
    • How will you parse the text output of EXPLAIN ANALYZE? (Regex? Structured JSON format?)
    • Different databases have different EXPLAIN formats (PostgreSQL vs MySQL vs SQLite). Will you support multiple?
    • How do you extract the key bottlenecks programmatically?
    • Should you use EXPLAIN (FORMAT JSON) for easier parsing?
  2. Index Recommendation Logic
    • How do you identify which columns should be indexed?
    • What if a column is already indexed but the index isn’t being used?
    • Should you recommend composite indexes for multi-column WHERE clauses?
    • How do you avoid recommending too many indexes (index bloat)?
  3. Query Rewriting
    • When should you suggest a CTE vs a subquery?
    • How do you know if reordering joins will help?
    • What if the query uses features Kiro doesn’t understand (window functions, recursive CTEs)?
    • Should you preserve query semantics exactly or allow minor changes?
  4. Validation
    • How do you ensure the optimized query returns the same results?
    • What if the index recommendations require more disk space than available?
    • Should you test on a staging database first?
    • How do you handle queries that are already well-optimized?

Thinking Exercise

Index Selection Challenge

You have this query:

SELECT p.id, p.title, p.created_at, u.username, COUNT(c.id) as comment_count
FROM posts p
JOIN users u ON p.author_id = u.id
LEFT JOIN comments c ON p.id = c.post_id
WHERE p.published = true
  AND p.created_at > NOW() - INTERVAL '30 days'
  AND u.is_active = true
ORDER BY p.created_at DESC, comment_count DESC
LIMIT 20;

EXPLAIN shows:

  • Sequential Scan on posts (2M rows, filtering 1.8M)
  • Sequential Scan on users (500k rows, filtering 100k)
  • Hash Join on comments (5M rows)

Questions to reason through:

  1. Which columns should you index and why?
  2. Should you create separate indexes or one composite index? (posts(published, created_at) vs posts(published) + posts(created_at))
  3. The ORDER BY uses comment_count which is computed. Can you index it?
  4. Would a partial index help here? (WHERE published = true)
  5. What’s the tradeoff between index size and query speed for rarely-used queries?

Proposed indexes:

CREATE INDEX idx_posts_published_created ON posts(published, created_at DESC);
CREATE INDEX idx_users_active ON users(is_active) WHERE is_active = true;
CREATE INDEX idx_comments_post ON comments(post_id);

Are these optimal? What would you change?

The Interview Questions They’ll Ask

Prepare to answer these:

  1. “Explain the difference between a Sequential Scan and an Index Scan. When might a Sequential Scan actually be faster?”
  2. “Your query uses WHERE user_id = 123, and there’s an index on user_id, but EXPLAIN shows a Seq Scan. Why?”
  3. “What is a covering index and when should you use one?”
  4. “How would you optimize a query with ORDER BY on a non-indexed column?”
  5. “What’s the difference between EXPLAIN and EXPLAIN ANALYZE? When would you use each?”
  6. “A junior developer adds 15 indexes to a table. What problems might this cause?”

Hints in Layers

Hint 1: Start with EXPLAIN Parsing Run EXPLAIN (FORMAT JSON) <query> to get structured output:

{
  "Plan": {
    "Node Type": "Limit",
    "Startup Cost": 45623.22,
    "Total Cost": 45623.47,
    "Plan Rows": 100,
    "Plan Width": 56,
    "Actual Startup Time": 3847.234,
    "Actual Total Time": 3847.289,
    "Actual Rows": 100,
    "Actual Loops": 1,
    "Plans": [...]
  }
}

Parse this JSON to extract:

  • Node types (“Seq Scan”, “Index Scan”, “Hash Join”)
  • Costs (Total Cost, Actual Total Time)
  • Filters (“Rows Removed by Filter”)

Hint 2: Bottleneck Detection Look for these patterns in the execution plan:

  • Seq Scan with high Rows Removed by Filter → missing WHERE index
  • Hash Join with Batches > 1 → memory overflow, need more work_mem
  • Sort with external merge Disk → sort spilled to disk, need index for ORDER BY
  • Nested Loop with high loop count → inefficient join, consider hash join

Hint 3: Index Recommendation Algorithm

For each table in the query:
  1. Extract WHERE conditions → index these columns
  2. Extract JOIN conditions → index foreign keys
  3. Extract ORDER BY columns → index with DESC/ASC matching
  4. If all columns are in SELECT → recommend covering index (INCLUDE)
  5. If WHERE has multiple ANDs → recommend composite index

Hint 4: Validation Loop After generating recommendations:

  1. Create the indexes in a transaction: BEGIN; CREATE INDEX ...; ROLLBACK; (test without committing)
  2. Run EXPLAIN on the optimized query
  3. Compare costs: if new_cost < old_cost * 0.8 (20% improvement), accept
  4. If cost increased, reject the suggestion and explain why

Books That Will Help

Topic Book Chapter
PostgreSQL EXPLAIN “PostgreSQL: Up and Running” by Regina Obe & Leo Hsu Ch. 10
Index fundamentals “SQL Performance Explained” by Markus Winand Ch. 2-3
Query optimization “PostgreSQL Query Optimization” by Henrietta Dombrovskaya Ch. 4-5
Database internals “Database Internals” by Alex Petrov Ch. 12
Execution plans “SQL Tuning” by Dan Tow Ch. 3-4

Common Pitfalls and Debugging

Problem 1: “Index exists but isn’t being used”

  • Why: Index selectivity is too low (planner prefers Seq Scan for large result sets)
  • Fix: Check SELECT COUNT(*) / (SELECT COUNT(*) FROM table) — if >20%, Seq Scan is often faster
  • Quick test: SET enable_seqscan = off; EXPLAIN <query>; (force index usage to compare)

Problem 2: “Optimized query returns different results”

  • Why: Query rewriting changed semantics (e.g., moving WHERE from JOIN to outer query)
  • Fix: Use EXCEPT to find missing/extra rows: (original EXCEPT optimized) UNION (optimized EXCEPT original)
  • Quick test: Hash the results: SELECT MD5(string_agg(row::text, '')) FROM (query) AS row;

Problem 3: “Index recommendation is way too large”

  • Why: Composite index on high-cardinality columns
  • Fix: Use partial indexes with WHERE clause to reduce size
  • Quick test: SELECT pg_size_pretty(pg_relation_size('index_name'));

Problem 4: “EXPLAIN shows Index Scan but query is still slow”

  • Why: Index Scan has high Actual Loops (nested loop with bad join order)
  • Fix: Reorder joins so smaller table is scanned first, or switch to Hash Join
  • Quick test: Check Actual Loops — if >1000, you’re doing a nested loop on too many rows

Definition of Done

  • Tool accepts a SQL query and database connection string
  • EXPLAIN ANALYZE output is captured and parsed (JSON format preferred)
  • Bottlenecks are identified: Seq Scans with high filter cost, missing indexes, memory spills
  • Index recommendations are generated with estimated size and creation SQL
  • Optimized query is generated with CTE or join reordering
  • Before/after EXPLAIN comparison shows measurable improvement (cost reduction >20%)
  • Results are validated: optimized query returns same row count and hash as original
  • Tool outputs a report with: bottlenecks, recommendations, speedup metrics
  • Optional: Tool can auto-apply indexes with CREATE INDEX CONCURRENTLY (no table locks)