Project 37: “The SQL Optimizer” — Database Performance
| Attribute | Value |
|---|---|
| File | KIRO_CLI_LEARNING_PROJECTS.md |
| Main Programming Language | SQL |
| Coolness Level | Level 3: Genuinely Clever |
| Difficulty | Level 3: Advanced |
| Knowledge Area | Database Performance |
What you’ll build: A SQL performance analyzer that takes slow queries, runs EXPLAIN ANALYZE, identifies bottlenecks (missing indexes, inefficient joins, full table scans), and generates optimized queries with index recommendations.
Why it teaches Analysis: This project shows how Kiro can act like a senior DBA by reading query execution plans, understanding database internals, and suggesting data-driven optimizations. You’ll learn to leverage AI for performance tuning that would normally require years of experience.
Core challenges you’ll face:
- Execution plan interpretation → Maps to understanding EXPLAIN output formats (PostgreSQL, MySQL, etc.)
- Index recommendation → Maps to analyzing WHERE clauses, JOIN conditions, and ORDER BY columns
- Query rewriting → Maps to CTE optimization, subquery elimination, join order changes
- Cost estimation validation → Maps to verifying that suggested changes actually reduce query cost
Real World Outcome
You’ll have a CLI tool that analyzes slow queries and produces actionable optimization recommendations with before/after performance metrics:
$ cat slow_queries.sql
-- Query 1: User dashboard
SELECT u.id, u.email, u.created_at,
COUNT(DISTINCT o.id) as order_count,
SUM(o.total_amount) as total_spent,
MAX(o.created_at) as last_order_date
FROM users u
LEFT JOIN orders o ON u.id = o.user_id
WHERE u.created_at > '2023-01-01'
AND o.status IN ('completed', 'shipped')
GROUP BY u.id, u.email, u.created_at
ORDER BY total_spent DESC
LIMIT 100;
$ kiro sql-optimize --file slow_queries.sql --database postgres --explain
[Kiro CLI Session]
🔍 SQL Optimizer Agent
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
📋 Analyzing Query 1: User dashboard
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Step 1: Running EXPLAIN ANALYZE...
QUERY PLAN (Original)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Limit (cost=45623.22..45623.47 rows=100 width=56)
(actual time=3847.234..3847.289 rows=100 loops=1)
-> Sort (cost=45623.22..45891.78 rows=107424 width=56)
(actual time=3847.232..3847.261 rows=100 loops=1)
Sort Key: (sum(o.total_amount)) DESC
Sort Method: top-N heapsort Memory: 32kB
-> HashAggregate (cost=38456.89..39531.13 rows=107424 width=56)
(actual time=3621.445..3742.667 rows=125000 loops=1)
Group Key: u.id, u.email, u.created_at
-> Hash Left Join (cost=12456.00..34789.56 rows=244488 width=48)
(actual time=124.567..2345.789 rows=456000 loops=1)
Hash Cond: (u.id = o.user_id)
-> Seq Scan on users u (cost=0.00..3567.00 rows=125000 width=24)
(actual time=0.012..45.678 rows=125000 loops=1)
Filter: (created_at > '2023-01-01'::date)
Rows Removed by Filter: 75000
-> Hash (cost=8934.00..8934.00 rows=244488 width=24)
(actual time=123.456..123.456 rows=456000 loops=1)
Buckets: 65536 Batches: 8 Memory Usage: 4567kB
-> Seq Scan on orders o (cost=0.00..8934.00 rows=244488 width=24)
(actual time=0.023..67.891 rows=456000 loops=1)
Filter: (status = ANY ('{completed,shipped}'::text[]))
Rows Removed by Filter: 144000
Planning Time: 1.234 ms
Execution Time: 3847.567 ms ⚠️ SLOW!
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Step 2: Identifying Bottlenecks...
🚨 CRITICAL ISSUES FOUND:
1. Sequential Scan on 'users' (125k rows, Filter removes 75k)
→ Missing index on users.created_at
Cost impact: ~45ms per scan
2. Sequential Scan on 'orders' (600k rows, Filter removes 144k)
→ Missing index on orders.status
Cost impact: ~68ms per scan
3. HashAggregate on 107k rows
→ Could be reduced with filtered join
Cost impact: ~121ms
4. Hash Join with 8 batches (memory spill to disk)
→ work_mem too low (current: 4MB, needed: ~8MB)
Cost impact: ~200ms (I/O overhead)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Step 3: Generating Index Recommendations...
CREATE INDEX CONCURRENTLY idx_users_created_at
ON users(created_at)
WHERE created_at > '2020-01-01'; -- Partial index (smaller, faster)
CREATE INDEX CONCURRENTLY idx_orders_user_status
ON orders(user_id, status)
INCLUDE (total_amount, created_at); -- Covering index (no table lookup)
Estimated index sizes:
• idx_users_created_at: 3.2 MB
• idx_orders_user_status: 28.4 MB
Total disk space required: 31.6 MB
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Step 4: Optimized Query (v1)...
-- Optimized version: filtered join + covering index
WITH recent_users AS (
SELECT id, email, created_at
FROM users
WHERE created_at > '2023-01-01' -- Index scan on idx_users_created_at
)
SELECT ru.id, ru.email, ru.created_at,
COUNT(DISTINCT o.id) as order_count,
SUM(o.total_amount) as total_spent,
MAX(o.created_at) as last_order_date
FROM recent_users ru
LEFT JOIN orders o
ON ru.id = o.user_id
AND o.status IN ('completed', 'shipped') -- Index scan on idx_orders_user_status
GROUP BY ru.id, ru.email, ru.created_at
ORDER BY total_spent DESC
LIMIT 100;
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Step 5: Running EXPLAIN ANALYZE (Optimized)...
QUERY PLAN (Optimized)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Limit (cost=8934.56..8934.81 rows=100 width=56)
(actual time=287.123..287.156 rows=100 loops=1)
-> Sort (cost=8934.56..9203.12 rows=107424 width=56)
(actual time=287.121..287.142 rows=100 loops=1)
Sort Key: (sum(o.total_amount)) DESC
Sort Method: top-N heapsort Memory: 32kB
-> HashAggregate (cost=6234.89..7309.13 rows=107424 width=56)
(actual time=245.678..267.891 rows=125000 loops=1)
Group Key: ru.id, ru.email, ru.created_at
-> Hash Left Join (cost=3456.00..4789.56 rows=244488 width=48)
(actual time=34.567..178.234 rows=456000 loops=1)
Hash Cond: (ru.id = o.user_id)
-> Index Scan using idx_users_created_at on users ✓
(cost=0.29..845.00 rows=125000 width=24)
(actual time=0.012..12.345 rows=125000 loops=1)
Index Cond: (created_at > '2023-01-01'::date)
-> Hash (cost=2345.00..2345.00 rows=244488 width=24)
(actual time=34.123..34.123 rows=456000 loops=1)
Buckets: 65536 Batches: 1 Memory Usage: 8234kB ✓
-> Index Scan using idx_orders_user_status on orders ✓
(cost=0.43..2345.00 rows=244488 width=24)
(actual time=0.023..18.456 rows=456000 loops=1)
Index Cond: (status = ANY ('{completed,shipped}'::text[]))
Planning Time: 0.987 ms
Execution Time: 287.456 ms ✅ 13.4x FASTER!
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
📊 Performance Summary:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Before: 3847.567 ms
After: 287.456 ms
Speedup: 13.4x
Improvements:
• Sequential Scans eliminated: 2
• Index Scans added: 2
• Memory spills reduced: 1 (8 batches → 1 batch)
• Rows filtered earlier: 144k (at index level)
✅ Optimizations applied successfully!
Would you like to:
1. Apply indexes to database? (CREATE INDEX commands)
2. See alternative query plans?
3. Analyze another query?
Exactly what happens:
- Kiro runs
EXPLAIN ANALYZEon your slow query - It identifies bottlenecks: sequential scans, missing indexes, inefficient joins
- It recommends specific indexes with size estimates
- It rewrites the query to leverage those indexes
- It re-runs
EXPLAIN ANALYZEand shows before/after metrics - You get actionable SQL commands to apply the optimizations
The Core Question You’re Answering
“How do you teach an AI to think like a database query optimizer and recommend performance improvements based on execution plans?”
This is about giving Kiro the mental model of a senior DBA:
- Understanding what “Seq Scan” vs “Index Scan” means for performance
- Recognizing when a query will cause memory spills or hash join overflow
- Knowing which columns benefit from indexes based on WHERE/JOIN/ORDER BY usage
- Validating that optimizations actually improve cost (not just guessing)
Concepts You Must Understand First
Stop and research these before coding:
- EXPLAIN and EXPLAIN ANALYZE
- What’s the difference between EXPLAIN (estimated) and EXPLAIN ANALYZE (actual)?
- What are the key metrics: cost, rows, width, actual time, loops?
- How do you interpret nested plan nodes?
- Book Reference: “PostgreSQL: Up and Running” by Regina Obe & Leo Hsu - Ch. 10
- Index Types and Use Cases
- B-tree vs Hash vs GiST vs GIN indexes
- When is a partial index better than a full index?
- What is a covering index (INCLUDE columns)?
- Why doesn’t PostgreSQL use an index for
LIKE '%pattern%'? - Book Reference: “SQL Performance Explained” by Markus Winand - Ch. 2-3
- Query Execution Plans
- What is a Sequential Scan and when is it acceptable?
- What is a Hash Join vs Nested Loop vs Merge Join?
- What does “Rows Removed by Filter” indicate?
- What is the significance of “Batches” in a Hash operation?
- Book Reference: “PostgreSQL Query Optimization” by Henrietta Dombrovskaya - Ch. 4
- Cost-Based Optimization
- How does the planner estimate query cost?
- What are random_page_cost and seq_page_cost?
- Why might the planner choose a slower plan than you expect?
- Book Reference: “Database Internals” by Alex Petrov - Ch. 12
Questions to Guide Your Design
Before implementing, think through these:
- Execution Plan Parsing
- How will you parse the text output of EXPLAIN ANALYZE? (Regex? Structured JSON format?)
- Different databases have different EXPLAIN formats (PostgreSQL vs MySQL vs SQLite). Will you support multiple?
- How do you extract the key bottlenecks programmatically?
- Should you use
EXPLAIN (FORMAT JSON)for easier parsing?
- Index Recommendation Logic
- How do you identify which columns should be indexed?
- What if a column is already indexed but the index isn’t being used?
- Should you recommend composite indexes for multi-column WHERE clauses?
- How do you avoid recommending too many indexes (index bloat)?
- Query Rewriting
- When should you suggest a CTE vs a subquery?
- How do you know if reordering joins will help?
- What if the query uses features Kiro doesn’t understand (window functions, recursive CTEs)?
- Should you preserve query semantics exactly or allow minor changes?
- Validation
- How do you ensure the optimized query returns the same results?
- What if the index recommendations require more disk space than available?
- Should you test on a staging database first?
- How do you handle queries that are already well-optimized?
Thinking Exercise
Index Selection Challenge
You have this query:
SELECT p.id, p.title, p.created_at, u.username, COUNT(c.id) as comment_count
FROM posts p
JOIN users u ON p.author_id = u.id
LEFT JOIN comments c ON p.id = c.post_id
WHERE p.published = true
AND p.created_at > NOW() - INTERVAL '30 days'
AND u.is_active = true
ORDER BY p.created_at DESC, comment_count DESC
LIMIT 20;
EXPLAIN shows:
- Sequential Scan on
posts(2M rows, filtering 1.8M) - Sequential Scan on
users(500k rows, filtering 100k) - Hash Join on comments (5M rows)
Questions to reason through:
- Which columns should you index and why?
- Should you create separate indexes or one composite index? (
posts(published, created_at)vsposts(published)+posts(created_at)) - The ORDER BY uses
comment_countwhich is computed. Can you index it? - Would a partial index help here? (
WHERE published = true) - What’s the tradeoff between index size and query speed for rarely-used queries?
Proposed indexes:
CREATE INDEX idx_posts_published_created ON posts(published, created_at DESC);
CREATE INDEX idx_users_active ON users(is_active) WHERE is_active = true;
CREATE INDEX idx_comments_post ON comments(post_id);
Are these optimal? What would you change?
The Interview Questions They’ll Ask
Prepare to answer these:
- “Explain the difference between a Sequential Scan and an Index Scan. When might a Sequential Scan actually be faster?”
- “Your query uses
WHERE user_id = 123, and there’s an index onuser_id, but EXPLAIN shows a Seq Scan. Why?” - “What is a covering index and when should you use one?”
- “How would you optimize a query with
ORDER BYon a non-indexed column?” - “What’s the difference between
EXPLAINandEXPLAIN ANALYZE? When would you use each?” - “A junior developer adds 15 indexes to a table. What problems might this cause?”
Hints in Layers
Hint 1: Start with EXPLAIN Parsing
Run EXPLAIN (FORMAT JSON) <query> to get structured output:
{
"Plan": {
"Node Type": "Limit",
"Startup Cost": 45623.22,
"Total Cost": 45623.47,
"Plan Rows": 100,
"Plan Width": 56,
"Actual Startup Time": 3847.234,
"Actual Total Time": 3847.289,
"Actual Rows": 100,
"Actual Loops": 1,
"Plans": [...]
}
}
Parse this JSON to extract:
- Node types (“Seq Scan”, “Index Scan”, “Hash Join”)
- Costs (Total Cost, Actual Total Time)
- Filters (“Rows Removed by Filter”)
Hint 2: Bottleneck Detection Look for these patterns in the execution plan:
Seq Scanwith highRows Removed by Filter→ missing WHERE indexHash JoinwithBatches > 1→ memory overflow, need more work_memSortwithexternal merge Disk→ sort spilled to disk, need index for ORDER BYNested Loopwith high loop count → inefficient join, consider hash join
Hint 3: Index Recommendation Algorithm
For each table in the query:
1. Extract WHERE conditions → index these columns
2. Extract JOIN conditions → index foreign keys
3. Extract ORDER BY columns → index with DESC/ASC matching
4. If all columns are in SELECT → recommend covering index (INCLUDE)
5. If WHERE has multiple ANDs → recommend composite index
Hint 4: Validation Loop After generating recommendations:
- Create the indexes in a transaction:
BEGIN; CREATE INDEX ...; ROLLBACK;(test without committing) - Run
EXPLAINon the optimized query - Compare costs: if
new_cost < old_cost * 0.8(20% improvement), accept - If cost increased, reject the suggestion and explain why
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| PostgreSQL EXPLAIN | “PostgreSQL: Up and Running” by Regina Obe & Leo Hsu | Ch. 10 |
| Index fundamentals | “SQL Performance Explained” by Markus Winand | Ch. 2-3 |
| Query optimization | “PostgreSQL Query Optimization” by Henrietta Dombrovskaya | Ch. 4-5 |
| Database internals | “Database Internals” by Alex Petrov | Ch. 12 |
| Execution plans | “SQL Tuning” by Dan Tow | Ch. 3-4 |
Common Pitfalls and Debugging
Problem 1: “Index exists but isn’t being used”
- Why: Index selectivity is too low (planner prefers Seq Scan for large result sets)
- Fix: Check
SELECT COUNT(*) / (SELECT COUNT(*) FROM table)— if >20%, Seq Scan is often faster - Quick test:
SET enable_seqscan = off; EXPLAIN <query>;(force index usage to compare)
Problem 2: “Optimized query returns different results”
- Why: Query rewriting changed semantics (e.g., moving WHERE from JOIN to outer query)
- Fix: Use
EXCEPTto find missing/extra rows:(original EXCEPT optimized) UNION (optimized EXCEPT original) - Quick test: Hash the results:
SELECT MD5(string_agg(row::text, '')) FROM (query) AS row;
Problem 3: “Index recommendation is way too large”
- Why: Composite index on high-cardinality columns
- Fix: Use partial indexes with WHERE clause to reduce size
- Quick test:
SELECT pg_size_pretty(pg_relation_size('index_name'));
Problem 4: “EXPLAIN shows Index Scan but query is still slow”
- Why: Index Scan has high
Actual Loops(nested loop with bad join order) - Fix: Reorder joins so smaller table is scanned first, or switch to Hash Join
- Quick test: Check
Actual Loops— if >1000, you’re doing a nested loop on too many rows
Definition of Done
- Tool accepts a SQL query and database connection string
- EXPLAIN ANALYZE output is captured and parsed (JSON format preferred)
- Bottlenecks are identified: Seq Scans with high filter cost, missing indexes, memory spills
- Index recommendations are generated with estimated size and creation SQL
- Optimized query is generated with CTE or join reordering
- Before/after EXPLAIN comparison shows measurable improvement (cost reduction >20%)
- Results are validated: optimized query returns same row count and hash as original
- Tool outputs a report with: bottlenecks, recommendations, speedup metrics
- Optional: Tool can auto-apply indexes with
CREATE INDEX CONCURRENTLY(no table locks)