| name | Query Optimization |
| description | Techniques for optimizing database queries, indexing strategies, and performance tuning for SQL databases |
Query Optimization
You are a database performance expert specializing in query optimization, indexing strategies, and database performance tuning. You help identify and resolve performance bottlenecks in database queries and operations.
Understanding Query Execution
1. Reading EXPLAIN Plans
PostgreSQL EXPLAIN:
-- Basic EXPLAIN
EXPLAIN SELECT * FROM orders WHERE customer_id = 123;
-- Detailed analysis with ANALYZE
EXPLAIN ANALYZE SELECT * FROM orders WHERE customer_id = 123;
-- Visual format
EXPLAIN (FORMAT JSON) SELECT * FROM orders WHERE customer_id = 123;
Key Metrics to Watch:
EXPLAIN (ANALYZE, BUFFERS, VERBOSE)
SELECT o.*, c.name
FROM orders o
JOIN customers c ON o.customer_id = c.id
WHERE o.created_at > '2024-01-01'
AND o.status = 'completed';
/*
Key indicators:
- Seq Scan: Table scan (usually slow for large tables)
- Index Scan: Using an index (good)
- Index Only Scan: Best - data from index only
- Nested Loop: Join strategy (good for small datasets)
- Hash Join: Join strategy (good for large datasets)
- Sort: Expensive operation, consider indexes
- Buffers: Memory usage
- Actual time: Real execution time
*/
Interpreting Costs:
Cost Structure: cost=0.42..8.44 rows=1 width=136
- First number (0.42): Startup cost
- Second number (8.44): Total cost
- rows: Estimated rows returned
- width: Average row size in bytes
LOWER COST = BETTER PERFORMANCE
2. Query Execution Order
SQL Execution Flow:
-- Written order (logical)
SELECT customer_name, SUM(total) as revenue
FROM orders
WHERE status = 'completed'
GROUP BY customer_name
HAVING SUM(total) > 1000
ORDER BY revenue DESC
LIMIT 10;
-- Actual execution order:
-- 1. FROM orders
-- 2. WHERE status = 'completed'
-- 3. GROUP BY customer_name
-- 4. HAVING SUM(total) > 1000
-- 5. SELECT customer_name, SUM(total)
-- 6. ORDER BY revenue DESC
-- 7. LIMIT 10
Indexing Strategies
1. When to Create Indexes
Good Index Candidates:
-- Foreign keys (for JOINs)
CREATE INDEX idx_orders_customer_id ON orders(customer_id);
-- Frequently filtered columns (WHERE clauses)
CREATE INDEX idx_orders_status ON orders(status);
-- Frequently sorted columns (ORDER BY)
CREATE INDEX idx_orders_created_at ON orders(created_at);
-- Columns used in GROUP BY
CREATE INDEX idx_sales_product_id ON sales(product_id);
-- Columns in JOIN conditions
CREATE INDEX idx_order_items_order_id ON order_items(order_id);
Bad Index Candidates:
-- High cardinality UUID primary keys (already indexed)
-- CREATE INDEX idx_users_id ON users(id); -- DON'T DO THIS
-- Low cardinality columns (few distinct values)
-- CREATE INDEX idx_users_gender ON users(gender); -- Usually not helpful
-- Frequently updated columns in write-heavy tables
-- CREATE INDEX idx_counters_value ON counters(value); -- Slow writes
-- Very large text columns
-- CREATE INDEX idx_posts_content ON posts(content); -- Too large
2. Composite Indexes
Column Order Matters:
-- For queries filtering by status AND customer_id
CREATE INDEX idx_orders_status_customer ON orders(status, customer_id);
-- This index helps these queries:
SELECT * FROM orders WHERE status = 'pending'; -- ✓ Uses index
SELECT * FROM orders WHERE status = 'pending' AND customer_id = 123; -- ✓ Uses index
SELECT * FROM orders WHERE customer_id = 123; -- ✗ Doesn't use index
-- Rule: Most selective column first, or match query patterns
Optimal Composite Index Strategy:
-- Common query pattern
SELECT * FROM orders
WHERE status = 'pending'
AND created_at > '2024-01-01'
ORDER BY created_at DESC;
-- Optimal index
CREATE INDEX idx_orders_status_created_at ON orders(status, created_at DESC);
-- This supports:
-- 1. WHERE status = 'pending'
-- 2. WHERE status = 'pending' AND created_at > ...
-- 3. ORDER BY created_at DESC (when combined with WHERE status)
3. Partial Indexes
Index Only What You Need:
-- Only index active orders
CREATE INDEX idx_active_orders ON orders(created_at)
WHERE status IN ('pending', 'processing');
-- Only index recent data
CREATE INDEX idx_recent_orders ON orders(customer_id)
WHERE created_at > '2024-01-01';
-- Benefits:
-- - Smaller index size
-- - Faster index updates
-- - Better cache utilization
4. Covering Indexes
Include All Required Columns:
-- Query that needs optimization
SELECT order_id, customer_id, total
FROM orders
WHERE status = 'completed'
AND created_at > '2024-01-01';
-- Covering index (PostgreSQL)
CREATE INDEX idx_orders_covering ON orders(status, created_at)
INCLUDE (order_id, customer_id, total);
-- Now the database can satisfy the query entirely from the index
-- (Index Only Scan - fastest possible)
Query Optimization Techniques
1. Avoid SELECT *
Problem:
-- BAD: Fetches all columns (slow, wasteful)
SELECT * FROM orders WHERE customer_id = 123;
Solution:
-- GOOD: Only select needed columns
SELECT order_id, total, created_at FROM orders WHERE customer_id = 123;
-- Benefits:
-- - Less data transfer
-- - Better index usage (covering indexes)
-- - Reduced memory usage
-- - Faster network transfer
2. Filter Early, Filter Often
Suboptimal:
-- Joins first, filters later
SELECT o.order_id, c.name
FROM orders o
JOIN customers c ON o.customer_id = c.id
WHERE o.created_at > '2024-01-01';
Optimized:
-- Filter in subquery first
SELECT o.order_id, c.name
FROM (
SELECT order_id, customer_id
FROM orders
WHERE created_at > '2024-01-01'
) o
JOIN customers c ON o.customer_id = c.id;
-- Or use CTE for readability
WITH recent_orders AS (
SELECT order_id, customer_id
FROM orders
WHERE created_at > '2024-01-01'
)
SELECT o.order_id, c.name
FROM recent_orders o
JOIN customers c ON o.customer_id = c.id;
3. Optimize JOINs
JOIN Order Matters:
-- Start with smallest result set
EXPLAIN ANALYZE
SELECT o.*, p.name, c.name
FROM orders o
JOIN order_items oi ON o.order_id = oi.order_id
JOIN products p ON oi.product_id = p.product_id
JOIN customers c ON o.customer_id = c.id
WHERE o.created_at > '2024-01-01' -- Filters orders first
AND p.category = 'electronics'; -- Then filters products
Avoid Implicit Cross Joins:
-- BAD: Implicit cross join
SELECT *
FROM orders o, customers c
WHERE o.customer_id = c.id;
-- GOOD: Explicit JOIN
SELECT *
FROM orders o
JOIN customers c ON o.customer_id = c.id;
4. Use EXISTS Instead of IN for Large Datasets
Suboptimal:
-- IN creates a full list in memory
SELECT *
FROM customers
WHERE customer_id IN (
SELECT customer_id FROM orders WHERE total > 1000
);
Optimized:
-- EXISTS stops at first match (short-circuit evaluation)
SELECT *
FROM customers c
WHERE EXISTS (
SELECT 1 FROM orders o
WHERE o.customer_id = c.customer_id
AND o.total > 1000
);
5. Avoid Functions on Indexed Columns
Prevents Index Usage:
-- BAD: Function on indexed column
SELECT * FROM orders WHERE YEAR(created_at) = 2024;
SELECT * FROM users WHERE LOWER(email) = 'user@example.com';
Index-Friendly:
-- GOOD: Use range queries
SELECT * FROM orders
WHERE created_at >= '2024-01-01'
AND created_at < '2025-01-01';
-- GOOD: Store lowercase, or use functional index
CREATE INDEX idx_users_email_lower ON users(LOWER(email));
SELECT * FROM users WHERE LOWER(email) = 'user@example.com';
The N+1 Query Problem
1. Identifying N+1 Queries
Problem Pattern:
# BAD: N+1 query problem
orders = Order.objects.all() # 1 query
for order in orders:
print(order.customer.name) # N additional queries!
# This executes: SELECT * FROM customers WHERE id = ?
# Once for EACH order!
Detection:
from django.db import connection
from django.test.utils import override_settings
@override_settings(DEBUG=True)
def detect_n_plus_one():
queries_before = len(connection.queries)
orders = Order.objects.all()
for order in orders:
_ = order.customer.name
queries_after = len(connection.queries)
query_count = queries_after - queries_before
if query_count > 10:
print(f"WARNING: Possible N+1 problem! {query_count} queries executed")
2. Solutions
Eager Loading (Prefetch):
# GOOD: Use select_related for ForeignKey/OneToOne
orders = Order.objects.select_related('customer').all() # 1 query with JOIN
for order in orders:
print(order.customer.name) # No additional queries!
# GOOD: Use prefetch_related for ManyToMany/reverse ForeignKey
orders = Order.objects.prefetch_related('items').all()
for order in orders:
print([item.name for item in order.items.all()]) # Efficient!
SQL Equivalent:
-- Instead of:
-- SELECT * FROM orders; (1 query)
-- SELECT * FROM customers WHERE id = 1; (N queries)
-- SELECT * FROM customers WHERE id = 2;
-- ...
-- Do this:
SELECT o.*, c.*
FROM orders o
LEFT JOIN customers c ON o.customer_id = c.id; -- 1 query!
Caching Strategies
1. Query Result Caching
import redis
import json
from functools import wraps
redis_client = redis.Redis(host='localhost', port=6379, db=0)
def cache_query(timeout=300):
"""Cache query results in Redis"""
def decorator(func):
@wraps(func)
def wrapper(*args, **kwargs):
# Create cache key from function name and arguments
cache_key = f"query:{func.__name__}:{hash(str(args) + str(kwargs))}"
# Try to get from cache
cached = redis_client.get(cache_key)
if cached:
return json.loads(cached)
# Execute query
result = func(*args, **kwargs)
# Store in cache
redis_client.setex(
cache_key,
timeout,
json.dumps(result)
)
return result
return wrapper
return decorator
# Usage
@cache_query(timeout=600)
def get_popular_products():
return Product.objects.filter(
sales_count__gt=1000
).values_list('id', 'name', 'price')
2. Database Query Cache
-- MySQL Query Cache (deprecated in MySQL 8.0)
-- Use application-level caching instead
-- PostgreSQL: Use materialized views for complex aggregations
CREATE MATERIALIZED VIEW sales_summary AS
SELECT
product_id,
DATE_TRUNC('day', created_at) as day,
COUNT(*) as order_count,
SUM(total) as revenue
FROM orders
GROUP BY product_id, DATE_TRUNC('day', created_at);
-- Refresh periodically
REFRESH MATERIALIZED VIEW sales_summary;
-- Query the materialized view (very fast)
SELECT * FROM sales_summary WHERE day = '2024-01-01';
Connection Pooling
1. Why Connection Pooling Matters
# WITHOUT Connection Pooling:
# Each request creates new connection (slow!)
import psycopg2
def bad_approach():
conn = psycopg2.connect(
dbname="mydb",
user="user",
password="password",
host="localhost"
)
cursor = conn.cursor()
cursor.execute("SELECT * FROM users")
result = cursor.fetchall()
conn.close() # Connection destroyed
return result
# WITH Connection Pooling:
# Reuse connections across requests (fast!)
from psycopg2 import pool
connection_pool = pool.SimpleConnectionPool(
minconn=1,
maxconn=10,
dbname="mydb",
user="user",
password="password",
host="localhost"
)
def good_approach():
conn = connection_pool.getconn()
cursor = conn.cursor()
cursor.execute("SELECT * FROM users")
result = cursor.fetchall()
connection_pool.putconn(conn) # Return to pool
return result
2. Optimal Pool Settings
# Django settings
DATABASES = {
'default': {
'ENGINE': 'django.db.backends.postgresql',
'NAME': 'mydb',
'CONN_MAX_AGE': 600, # Keep connections open for 10 minutes
'OPTIONS': {
'connect_timeout': 10,
'options': '-c statement_timeout=30000' # 30 second query timeout
},
}
}
# SQLAlchemy pool configuration
from sqlalchemy import create_engine
from sqlalchemy.pool import QueuePool
engine = create_engine(
'postgresql://user:password@localhost/mydb',
poolclass=QueuePool,
pool_size=10, # Normal pool size
max_overflow=20, # Extra connections when needed
pool_timeout=30, # Wait 30s for connection
pool_recycle=3600, # Recycle connections after 1 hour
pool_pre_ping=True, # Verify connection before using
)
Batch Operations
1. Batch Inserts
Slow:
# DON'T: Insert one at a time
for item in items:
cursor.execute(
"INSERT INTO products (name, price) VALUES (%s, %s)",
(item.name, item.price)
)
# 1000 items = 1000 queries!
Fast:
# DO: Batch insert
values = [(item.name, item.price) for item in items]
# PostgreSQL
from psycopg2.extras import execute_values
execute_values(
cursor,
"INSERT INTO products (name, price) VALUES %s",
values
) # 1000 items = 1 query!
# Django ORM
Product.objects.bulk_create([
Product(name=item.name, price=item.price)
for item in items
], batch_size=1000)
2. Batch Updates
# Efficient batch update
from django.db import transaction
with transaction.atomic():
for product in products_to_update:
product.price *= 1.1 # 10% increase
Product.objects.bulk_update(products_to_update, ['price'], batch_size=500)
Advanced Optimization Techniques
1. Partitioning
Range Partitioning (PostgreSQL 10+):
-- Partition large tables by date
CREATE TABLE orders (
order_id BIGSERIAL,
customer_id INTEGER,
created_at DATE NOT NULL,
total DECIMAL(10,2)
) PARTITION BY RANGE (created_at);
-- Create partitions
CREATE TABLE orders_2024_q1 PARTITION OF orders
FOR VALUES FROM ('2024-01-01') TO ('2024-04-01');
CREATE TABLE orders_2024_q2 PARTITION OF orders
FOR VALUES FROM ('2024-04-01') TO ('2024-07-01');
-- Queries automatically use correct partition
SELECT * FROM orders WHERE created_at = '2024-02-15';
-- Only scans orders_2024_q1!
2. Denormalization
When Reads >> Writes:
-- Instead of joining every time
SELECT o.order_id, c.name, c.email
FROM orders o
JOIN customers c ON o.customer_id = c.id;
-- Denormalize for read performance
ALTER TABLE orders
ADD COLUMN customer_name VARCHAR(255),
ADD COLUMN customer_email VARCHAR(255);
-- Update on insert/update
CREATE OR REPLACE FUNCTION update_customer_info()
RETURNS TRIGGER AS $$
BEGIN
SELECT name, email INTO NEW.customer_name, NEW.customer_email
FROM customers WHERE customer_id = NEW.customer_id;
RETURN NEW;
END;
$$ LANGUAGE plpgsql;
CREATE TRIGGER update_order_customer
BEFORE INSERT OR UPDATE ON orders
FOR EACH ROW
EXECUTE FUNCTION update_customer_info();
-- Now just query orders table (no JOIN needed)
SELECT order_id, customer_name, customer_email FROM orders;
3. Database-Specific Optimizations
PostgreSQL:
-- Use JSONB for semi-structured data
CREATE TABLE events (
event_id SERIAL PRIMARY KEY,
event_data JSONB
);
-- Index JSONB columns
CREATE INDEX idx_events_data ON events USING GIN (event_data);
-- Efficient JSONB queries
SELECT * FROM events WHERE event_data @> '{"user_id": 123}';
MySQL:
-- Use covering indexes
ALTER TABLE orders
ADD INDEX idx_covering (customer_id, status, created_at, total);
-- Force index if optimizer makes wrong choice
SELECT * FROM orders
FORCE INDEX (idx_customer_status)
WHERE customer_id = 123 AND status = 'pending';
Performance Monitoring
1. Slow Query Log
PostgreSQL Configuration:
-- Enable slow query logging
ALTER SYSTEM SET log_min_duration_statement = 1000; -- Log queries > 1 second
ALTER SYSTEM SET log_line_prefix = '%t [%p]: [%l-1] user=%u,db=%d,app=%a,client=%h ';
SELECT pg_reload_conf();
-- View slow queries
SELECT
query,
calls,
total_time,
mean_time,
max_time
FROM pg_stat_statements
ORDER BY mean_time DESC
LIMIT 10;
2. Query Performance Metrics
import time
from contextlib import contextmanager
@contextmanager
def query_timer(query_name):
"""Measure query execution time"""
start = time.time()
try:
yield
finally:
duration = time.time() - start
if duration > 1.0: # Log slow queries
print(f"SLOW QUERY [{query_name}]: {duration:.2f}s")
# Usage
with query_timer("fetch_user_orders"):
orders = Order.objects.filter(customer_id=123).all()
Optimization Checklist
Before Deploying:
- All foreign keys have indexes
- WHERE clause columns are indexed
- No SELECT * in production code
- N+1 queries eliminated
- EXPLAIN ANALYZE on critical queries
- Connection pooling configured
- Query timeouts set
- Slow query logging enabled
- Batch operations for bulk data
- Caching strategy implemented
Red Flags:
- ⚠️ Queries taking > 1 second
- ⚠️ Full table scans on large tables
- ⚠️ Missing indexes on foreign keys
- ⚠️ DISTINCT or GROUP BY without indexes
- ⚠️ Subqueries in SELECT clause
- ⚠️ Cartesian products (cross joins)
- ⚠️ OR clauses (consider UNION instead)
- ⚠️ Functions on indexed columns
Related Skills
- Database Design: Schema design and normalization
- Index Management: Advanced indexing techniques
- SQL Fundamentals: Core SQL knowledge
- ORM Usage: Efficient ORM query patterns
- Caching: Application-level caching strategies
- Monitoring: Database performance monitoring
- Scaling: Database scaling and sharding