| name | MongoDB Aggregation Pipeline Optimization |
| description | General MongoDB aggregation pipeline optimization techniques including early filtering, index usage, array operators vs $unwind, $lookup optimization, and performance debugging. Use when writing aggregation queries for ANY MongoDB project, debugging slow pipelines, or analyzing query performance. For M32RIMM-specific patterns, use mongodb-m32rimm-patterns skill. |
| allowed-tools | Read, Grep, Bash |
MongoDB Aggregation Pipeline Optimization
General optimization patterns for MongoDB aggregation pipelines. Applicable to ANY MongoDB project, regardless of domain or schema.
Companion Documents:
- reference.md: Detailed examples, benchmarks, and debugging workflows (in this skill directory)
- mongodb-m32rimm-patterns: M32RIMM/FISIO-specific patterns (subscription isolation, businessObjects queries)
1. Pipeline Stage Ordering (MOST CRITICAL)
Early filtering rule: Place $match as early as possible to reduce data volume.
Optimal stage order:
$match → $project → $addFields → $lookup → $unwind → $group → $sort → $limit
Why this matters: MongoDB processes pipeline stages sequentially. Filtering 1M docs to 10K BEFORE $lookup saves 990K unnecessary joins.
Performance Impact
| Optimization | Speedup | Example |
|---|---|---|
| Early $match | 10-100x | Filter by tenant/status first |
| Project before lookup | 5-20x | Reduce field count before join |
| Covered queries | 5-10x | Return data from index only |
| Array operators vs $unwind | 5-10x | Filter arrays without unwinding |
| Indexed $lookup | 10-50x | Join on indexed fields |
Stage Ordering Examples
// BAD - filters AFTER expensive operations
db.orders.aggregate([
{$lookup: {from: 'customers', ...}}, // Joins ALL docs
{$unwind: '$items'},
{$match: {status: 'pending'}} // Filters last
])
// GOOD - filters early, reduces data before expensive ops
db.orders.aggregate([
{$match: {status: 'pending'}}, // Filter first
{$project: {_id: 1, items: 1}}, // Reduce fields
{$lookup: {from: 'customers', ...}} // Join smaller set
])
MongoDB's Automatic Optimization
MongoDB can move $match before $project when safe:
// Written as:
[
{$project: {_id: 1, status: 1, amount: 1}},
{$match: {status: 'pending', amount: {$gt: 100}}}
]
// MongoDB optimizes to:
[
{$match: {status: 'pending', amount: {$gt: 100}}},
{$project: {_id: 1, status: 1, amount: 1}}
]
BUT: Don't rely on this. Write explicit early $match for clarity.
2. Index Usage & Covered Queries
Check Index Usage
// Explain aggregation execution
db.collection.explain('executionStats').aggregate([
{$match: {status: 'active', category: 'electronics'}},
{$project: {_id: 1, name: 1}}
])
// Key fields to check:
// - totalDocsExamined (should be close to nReturned)
// - executionTimeMillis (lower is better)
// - winningPlan.stage (IXSCAN = good, COLLSCAN = bad)
Covered Queries
Definition: Query returns all data from index (no document fetch).
Requirements:
- All queried fields are in the index
- All returned fields are in the index
- Query doesn't exclude
_idunless{_id: 0}in projection
Example:
// Index
db.products.createIndex({
category: 1,
status: 1,
updated_at: -1
})
// Covered query - returns only indexed fields
db.products.aggregate([
{$match: {
category: 'electronics',
status: 'active'
}},
{$project: {
_id: 1,
category: 1,
status: 1,
updated_at: 1
}},
{$sort: {updated_at: -1}}
])
Performance: 5-10x faster than document scans.
Index Strategy Best Practices
- Filter fields first: Most selective filters at start of compound index
- Sort fields last: Include sort fields at end of compound index
- Include projection fields: Add projected fields for covered queries
- Avoid index bloat: Don't index every field (diminishing returns)
// Good compound index design
db.orders.createIndex({
tenant_id: 1, // Filter: highest selectivity
status: 1, // Filter: medium selectivity
created_at: -1 // Sort field last
})
// Query uses index efficiently
db.orders.aggregate([
{$match: {
tenant_id: 'abc123',
status: 'pending'
}},
{$sort: {created_at: -1}}
])
3. Array Operators vs $unwind/$group Anti-Pattern
AVOID: $unwind → $group for array transformations (blocking stage, slow).
USE: Array operators ($filter, $map, $reduce, $arrayElemAt, $size).
Quick Example
// BAD - explodes documents
db.products.aggregate([
{$unwind: '$tags'},
{$match: {'tags': 'sale'}},
{$group: {_id: '$_id', sale_tags: {$push: '$tags'}}}
])
// GOOD - filter in place
db.products.aggregate([
{$match: {category: 'electronics'}},
{$project: {
sale_tags: {$filter: {input: '$tags', cond: {$eq: ['$$this', 'sale']}}}
}}
])
Common Array Operators
{$filter: {input: '$items', cond: {$eq: ['$$this.status', 'active']}}}
{$map: {input: '$products', in: {id: '$$this._id', price: '$$this.price'}}}
{$arrayElemAt: ['$items', 0]} // First element
{$size: '$items'} // Array length
{$reduce: {input: '$items', initialValue: 0, in: {$add: ['$$value', '$$this.price']}}}
Performance: 5-10x faster for large arrays. See reference.md for detailed examples.
4. $lookup Optimization
Critical: Index foreign collection on lookup field (10-50x speedup).
Basic vs Optimized Lookup
// Basic lookup
{$lookup: {from: 'customers', localField: 'customer_id', foreignField: '_id', as: 'customer'}}
// Optimized with pipeline (filter + project)
{$lookup: {
from: 'customers',
let: {customer_id: '$customer_id'},
pipeline: [
{$match: {$expr: {$eq: ['$_id', '$$customer_id']}, status: 'active'}},
{$project: {_id: 1, name: 1, email: 1}} // Only needed fields
],
as: 'customer'
}}
Required Index
// CRITICAL: Index on foreign field
db.customers.createIndex({_id: 1, status: 1})
Multiple Lookups
// AVOID: Sequential lookups on multiple collections
// BETTER: Nest related lookups, use pipeline to filter
// BEST: Denormalize if data rarely changes
See reference.md for detailed multiple lookup patterns and benchmarks.
5. $group Optimization
Key principles: Group before sort, use $first/$last instead of $push, enable allowDiskUse for large datasets.
Efficient Grouping
// BAD - accumulates entire documents
{$group: {_id: '$customer_id', orders: {$push: '$$ROOT'}}}
// GOOD - accumulate only needed metrics
{$group: {
_id: '$customer_id',
count: {$sum: 1},
first_order_id: {$first: '$_id'},
last_order_date: {$max: '$created_at'}
}}
Memory Limit (100MB)
// Enable disk usage for large aggregations
db.collection.aggregate(pipeline, {allowDiskUse: true})
See reference.md for nested grouping patterns and detailed examples.
6. Materialized Views with $merge/$out
Use case: Heavy aggregations run frequently (dashboards, reports).
$merge vs $out
| Feature | $merge | $out |
|---|---|---|
| Behavior | Upserts into target | Replaces entire collection |
| Use when | Incremental updates | Full refresh |
Quick Example
// Incremental update with $merge
{$merge: {into: 'dailySales', on: '_id', whenMatched: 'replace', whenNotMatched: 'insert'}}
// Full replace with $out
{$out: 'productsByCategory'}
See reference.md for detailed patterns, update strategies, and scheduling examples.
7. MongoDB Version Features
MongoDB 5.0+: Slot-based execution engine (automatic optimization), $setWindowFields. MongoDB 6.0+: Improved $lookup performance, better memory management.
Check version: db.version()
8. Debugging Slow Pipelines
1. Explain Query
db.collection.explain('executionStats').aggregate([...])
Check: totalDocsExamined (should be close to nReturned), executionTimeMillis, stage (IXSCAN good, COLLSCAN bad).
2. Profile Slow Queries
db.setProfilingLevel(1, {slowms: 100}) // Enable
db.system.profile.find().sort({ts: -1}).limit(10) // Check
db.setProfilingLevel(0) // Disable
3. Iterative Testing
Test pipeline stages one at a time to find bottleneck. Add stages incrementally and measure time/doc count.
4. MongoDB Compass
Use visual explain in Compass to identify COLLSCAN stages and memory bottlenecks.
See reference.md for detailed debugging workflow with examples.
9. Common Patterns (Quick Reference)
// Count by category
{$group: {_id: '$category', count: {$sum: 1}, avg_price: {$avg: '$price'}}}
// Top N results
{$sort: {order_count: -1}}, {$limit: 10}
// Time-based grouping
{$group: {_id: {year: {$year: '$created_at'}, week: {$week: '$created_at'}}, count: {$sum: 1}}}
// Conditional aggregation
{$group: {
_id: null,
pending: {$sum: {$cond: [{$eq: ['$status', 'pending']}, 1, 0]}},
completed: {$sum: {$cond: [{$eq: ['$status', 'completed']}, 1, 0]}}
}}
See reference.md for complete pattern implementations with context.
10. Anti-Patterns to Avoid
DON'T:
- $match after $lookup (filter before lookup)
- $unwind → $group for array transformations (use array operators)
- Querying without indexes (collection scans are slow)
- Multiple sequential $lookup stages (denormalize or combine)
- Bare $group without $match (groups entire collection)
- $sort before $limit without index (sorts everything)
- Projecting all fields before $lookup (transfer unnecessary data)
- Ignoring allowDiskUse for large aggregations (100MB memory limit)
DO:
- Filter early (smallest dataset possible)
- Use indexed fields in $match
- Project only needed fields
- Use array operators for array manipulation
- Combine lookups when possible
- Enable allowDiskUse for large aggregations
- Check explain() output before deploying
- Create compound indexes for common query patterns
Performance Checklist
Before deploying aggregation pipeline:
- Early filtering: $match with most selective filters first
- Index usage: Check explain() shows IXSCAN (not COLLSCAN)
- Projection: Reduce fields before expensive operations
- Array operators: Use $filter/$map instead of $unwind/$group
- Lookup indexes: Ensure foreign collection has index on join field
- allowDiskUse: Enable for large aggregations (>100MB)
- Testing: Run explain() on production-sized data
Performance targets (1M doc collection):
- Early $match: 90x speedup
- Covered query: 11x speedup
- Array operators: 9x speedup
- Indexed $lookup: 50x speedup
See reference.md for detailed benchmarks.
Quick Reference Card
// 1. Filter early (most selective filters first)
{$match: {tenant_id: 'abc', status: 'active'}}
// 2. Project before expensive operations
{$project: {_id: 1, needed_field: 1}}
// 3. Use array operators (not $unwind/$group)
{$filter: {input: '$items', cond: {...}}}
// 4. Optimize lookups with pipeline
{$lookup: {
from: 'collection',
let: {...},
pipeline: [
{$match: {$expr: {...}, status: 'active'}}
],
as: 'result'
}}
// 5. Check performance
db.collection.explain('executionStats').aggregate([...])
// 6. Enable disk usage for large aggregations
db.collection.aggregate(pipeline, {allowDiskUse: true})
// 7. Use covered queries when possible
// Only project indexed fields
// 8. Group before sort (reduce data volume)
{$group: {...}}, {$sort: {...}}
Remember
Core priorities:
- Filter early (smallest dataset before expensive ops)
- Use indexes (IXSCAN not COLLSCAN)
- Test with explain() on production-sized data
- Profile slow queries in production
When stuck: Check explain() → Profile → Test stages incrementally → Use Compass visual explain.
Documentation: MongoDB Aggregation Optimization
Related skills: mongodb-m32rimm-patterns for M32RIMM/FISIO-specific patterns.