PostgreSQL Indexing Strategies for Large Production Databases
Indexes make queries fast. Wrong indexes make writes slow, waste disk space, and give the query planner bad options. At scale, indexing is not about adding indexes — it is about adding the right ones and removing the rest.
This guide covers indexing strategies for databases with millions of rows where query performance directly affects user experience. If you have not read the PostgreSQL query optimization guide, start there for the fundamentals.
Problem
Large databases develop indexing problems gradually:
- Queries that were fast at 100K rows are slow at 10M rows
- Write performance degrades as indexes accumulate
- Index bloat consumes disk space disproportionate to data size
- The query planner sometimes ignores indexes that should be used
Understanding Index Types
B-tree (Default)
B-tree indexes handle equality and range queries. They are the right choice for 90% of cases:
CREATE INDEX idx_orders_created ON orders(created_at);
-- Supports:
SELECT * FROM orders WHERE created_at > '2024-01-01';
SELECT * FROM orders WHERE created_at BETWEEN '2024-01-01' AND '2024-06-30';
SELECT * FROM orders ORDER BY created_at DESC LIMIT 20;
GIN (Generalized Inverted Index)
GIN indexes handle containment queries — arrays, JSONB, and full-text search:
CREATE INDEX idx_posts_tags ON posts USING gin(tags);
-- Supports:
SELECT * FROM posts WHERE tags @> ARRAY['python'];
SELECT * FROM posts WHERE tags && ARRAY['react', 'typescript'];
GiST (Generalized Search Tree)
GiST indexes handle geometric and range data:
CREATE INDEX idx_events_range ON events USING gist(
tstzrange(start_time, end_time)
);
BRIN (Block Range Index)
BRIN indexes are small and work well on naturally ordered data:
CREATE INDEX idx_logs_created ON logs USING brin(created_at);
If your logs table is append-only and ordered by created_at, BRIN is 100x smaller than B-tree with comparable query performance.
Composite Indexes
Column order matters. Put the most selective column first:
-- If most queries filter by tenant_id AND status:
CREATE INDEX idx_tasks_tenant_status ON tasks(tenant_id, status);
-- This index supports:
WHERE tenant_id = $1 -- ✓ (uses first column)
WHERE tenant_id = $1 AND status = 'active' -- ✓ (uses both columns)
WHERE status = 'active' -- ✗ (cannot use, wrong column order)
The index works left to right. A query filtering only on the second column cannot use a composite index efficiently.
Partial Indexes
Index only the rows that matter:
-- Only 5% of orders are pending, but 95% of queries filter for them
CREATE INDEX idx_orders_pending ON orders(created_at)
WHERE status = 'pending';
-- Much smaller than indexing all orders
-- Only used for queries that include WHERE status = 'pending'
Partial indexes are smaller, faster to update, and faster to scan. Use them when queries consistently filter for a subset of rows.
Expression Indexes
Index computed values:
-- Queries that filter on lower(email)
CREATE INDEX idx_users_email_lower ON users(lower(email));
-- Queries that extract JSONB fields
CREATE INDEX idx_settings_theme ON user_settings((settings->>'theme'));
The expression in the index must exactly match the expression in the query.
Covering Indexes
Include extra columns to enable index-only scans:
CREATE INDEX idx_products_category ON products(category_id)
INCLUDE (name, price);
-- This query can be answered entirely from the index:
SELECT name, price FROM products WHERE category_id = 5;
The INCLUDE columns are stored in the index but not used for searching. This avoids a table lookup (heap fetch) for queries that only need the included columns.
Finding Missing Indexes
-- Queries doing sequential scans on large tables
SELECT relname, seq_scan, seq_tup_read,
idx_scan, idx_tup_fetch
FROM pg_stat_user_tables
WHERE seq_scan > 100
AND pg_relation_size(relid) > 10000000
ORDER BY seq_tup_read DESC;
High seq_scan count on large tables suggests a missing index.
Finding Unused Indexes
SELECT indexrelid::regclass AS index_name,
relid::regclass AS table_name,
idx_scan,
pg_size_pretty(pg_relation_size(indexrelid)) AS index_size
FROM pg_stat_user_indexes
WHERE idx_scan = 0
AND indexrelid NOT IN (
SELECT conindid FROM pg_constraint WHERE contype IN ('p', 'u')
)
ORDER BY pg_relation_size(indexrelid) DESC;
Unused indexes waste disk space and slow down writes. Drop them.
Index Maintenance
Reindexing
-- Rebuild a bloated index (requires brief lock)
REINDEX INDEX CONCURRENTLY idx_orders_created;
-- Check index bloat
SELECT nspname, relname,
round(100 * pg_relation_size(indexrelid) /
pg_relation_size(indrelid)) AS index_ratio
FROM pg_index
JOIN pg_class ON pg_class.oid = pg_index.indrelid
JOIN pg_namespace ON pg_namespace.oid = pg_class.relnamespace
WHERE pg_relation_size(indrelid) > 0;
Common Mistakes
Mistake 1: Indexing every column. Each index slows writes and consumes disk. Only index columns that appear in WHERE, JOIN, and ORDER BY clauses of frequent queries.
Mistake 2: Wrong column order in composite indexes. The most selective column should come first. Use EXPLAIN ANALYZE to verify the planner uses your index.
Mistake 3: Ignoring index bloat. After heavy UPDATE or DELETE operations, indexes accumulate dead tuples. Schedule periodic REINDEX CONCURRENTLY.
Takeaways
PostgreSQL indexing at scale requires choosing the right index type, ordering composite columns correctly, using partial indexes for common filters, and regularly auditing for unused indexes. The goal is not more indexes — it is the right indexes for your actual query patterns.