| name | memory-optimization |
| description | Performance optimization patterns for Mem0 memory operations including query optimization, caching strategies, embedding efficiency, database tuning, batch operations, and cost reduction for both Platform and OSS deployments. Use when optimizing memory performance, reducing costs, improving query speed, implementing caching, tuning database performance, analyzing bottlenecks, or when user mentions memory optimization, performance tuning, cost reduction, slow queries, caching, or Mem0 optimization. |
| allowed-tools | Bash, Read, Write, Edit |
Memory Optimization
Performance optimization patterns and tools for Mem0 memory systems. This skill provides comprehensive optimization techniques for query performance, cost reduction, caching strategies, and infrastructure tuning for both Platform and OSS deployments.
Instructions
Phase 1: Performance Assessment
Start by analyzing your current memory system performance:
bash scripts/analyze-performance.sh [project_name]
This generates a comprehensive performance report including:
- Query latency metrics (average, P95, P99)
- Operation throughput (searches, adds, updates, deletes)
- Cache performance statistics
- Resource utilization (memory, storage, CPU)
- Slow query identification
- Cost analysis
Review the output to identify optimization priorities:
- Query latency > 200ms → Focus on query optimization
- High costs → Focus on cost optimization
- Low cache hit rate < 60% → Focus on caching
- High resource usage → Focus on infrastructure tuning
Phase 2: Query Optimization
Optimize memory search operations for speed and efficiency.
2.1 Limit Search Results
Problem: Retrieving too many results increases latency and costs.
Solution: Use appropriate limit values based on use case.
# ❌ BAD: Using default or excessive limits
memories = memory.search(query, user_id=user_id) # Default: 10
# ✅ GOOD: Optimized limits
memories = memory.search(query, user_id=user_id, limit=5) # Chat apps
memories = memory.search(query, user_id=user_id, limit=3) # Quick context
memories = memory.search(query, user_id=user_id, limit=10) # RAG systems
Impact: 30-40% reduction in query time
Guidelines:
- Chat applications: 3-5 results
- RAG context retrieval: 8-12 results
- Recommendation systems: 10-20 results
- Semantic search: 20-50 results
2.2 Use Filters to Reduce Search Space
Problem: Searching entire index is slow and expensive.
Solution: Apply filters to narrow search scope.
# ❌ BAD: Full index scan
memories = memory.search(query)
# ✅ GOOD: Filtered search
memories = memory.search(
query
filters={
"user_id": user_id
"categories": ["preferences", "profile"]
}
limit=5
)
# ✅ BETTER: Multiple filter conditions
memories = memory.search(
query
filters={
"AND": [
{"user_id": user_id}
{"agent_id": "support_v2"}
{"created_after": "2025-01-01"}
]
}
limit=5
)
Impact: 40-60% reduction in query time
Available Filters:
user_id: Scope to specific useragent_id: Scope to specific agentrun_id: Scope to session/runcategories: Filter by memory categoriesmetadata: Custom metadata filters- Date ranges:
created_after,created_before
2.3 Optimize Reranking
Problem: Default reranking may be overkill for simple queries.
Solution: Configure reranker based on accuracy requirements.
# Platform Mode (Mem0 Cloud)
from mem0 import MemoryClient
# Disable reranking for fast, simple queries
memory = MemoryClient(api_key=api_key)
memories = memory.search(
query
user_id=user_id
rerank=False # 2x faster, slightly lower accuracy
)
# OSS Mode
from mem0 import Memory
from mem0.configs.base import MemoryConfig
# Use lightweight reranker
config = MemoryConfig(
reranker={
"provider": "cohere"
"config": {
"model": "rerank-english-v3.0", # Fast model
"top_n": 5 # Rerank only top results
}
}
)
memory = Memory(config)
Reranker Options:
- No reranking: Fastest, 90-95% accuracy
- Lightweight (Cohere rerank-english-v3.0): 2x faster than full
- Full reranking (Cohere rerank-english-v3.5): Highest accuracy
Decision Guide:
- Simple preference retrieval → No reranking
- Chat context → Lightweight reranking
- Critical RAG applications → Full reranking
2.4 Use Async Operations
Problem: Blocking operations limit throughput.
Solution: Use async for high-concurrency scenarios.
import asyncio
from mem0 import AsyncMemory
async def get_user_context(user_id: str, queries: list[str]):
memory = AsyncMemory()
# Run multiple searches concurrently
results = await asyncio.gather(*[
memory.search(q, user_id=user_id, limit=3)
for q in queries
])
return results
# Usage
contexts = await get_user_context(
"user_123"
["preferences", "recent activity", "goals"]
)
Impact: 3-5x throughput improvement under load
Phase 3: Caching Strategies
Implement multi-layer caching to reduce API calls and improve response times.
3.1 In-Memory Caching (Python)
Use for: Frequently accessed, rarely changing data (user preferences).
from functools import lru_cache
import hashlib
@lru_cache(maxsize=1000)
def get_user_preferences(user_id: str) -> list:
"""Cache user preferences for 5 minutes"""
return memory.search(
"user preferences"
user_id=user_id
limit=5
)
# Clear cache when preferences update
get_user_preferences.cache_clear()
Impact: Near-instant response for cached queries
Configuration:
maxsize=1000: Cache 1000 users' preferences- Clear cache on memory updates
- TTL: Implement with time-based wrapper
3.2 Redis Caching (Production)
Use for: Shared caching across services, TTL control.
import redis
import json
redis_client = redis.Redis(host='localhost', port=6379, db=0)
def get_user_context_cached(user_id: str, query: str) -> list:
# Generate cache key
cache_key = f"mem0:search:{user_id}:{hashlib.md5(query.encode()).hexdigest()}"
# Check cache
cached = redis_client.get(cache_key)
if cached:
return json.loads(cached)
# Cache miss - query Mem0
result = memory.search(query, user_id=user_id, limit=5)
# Cache result (5 minute TTL)
redis_client.setex(
cache_key
300, # 5 minutes
json.dumps(result)
)
return result
# Invalidate cache on update
def update_memory(user_id: str, message: str):
memory.add(message, user_id=user_id)
# Clear user's cache
pattern = f"mem0:search:{user_id}:*"
for key in redis_client.scan_iter(match=pattern):
redis_client.delete(key)
Impact: 50-70% reduction in API calls
TTL Guidelines:
- User preferences: 5-15 minutes
- Agent knowledge: 30-60 minutes
- Session context: 1-2 minutes
- Static content: 1 hour
Use the caching template generator:
bash scripts/generate-cache-config.sh redis [ttl_seconds]
3.3 Edge Caching (Advanced)
Use for: Global applications, very high traffic.
See template: templates/edge-cache-config.yaml
Phase 4: Embedding Optimization
Optimize embedding generation and storage costs.
4.1 Choose Appropriate Embedding Model
Problem: Oversized embeddings increase cost and latency.
Solution: Match model to use case.
from mem0 import Memory
from mem0.configs.base import MemoryConfig
# ❌ EXPENSIVE: Large model for simple data
config = MemoryConfig(
embedder={
"provider": "openai"
"config": {
"model": "text-embedding-3-large", # 3072 dims, $0.13/1M tokens
}
}
)
# ✅ OPTIMIZED: Appropriate model
config = MemoryConfig(
embedder={
"provider": "openai"
"config": {
"model": "text-embedding-3-small", # 1536 dims, $0.02/1M tokens
}
}
)
Model Selection Guide:
| Use Case | Recommended Model | Dimensions | Cost |
|---|---|---|---|
| User preferences | text-embedding-3-small | 1536 | $0.02/1M |
| Simple chat context | text-embedding-3-small | 1536 | $0.02/1M |
| Advanced RAG | text-embedding-3-large | 3072 | $0.13/1M |
| Multilingual | text-embedding-3-large | 3072 | $0.13/1M |
| Budget-conscious | text-embedding-ada-002 | 1536 | $0.0001/1M |
Impact: 70-85% cost reduction with appropriate model selection
4.2 Batch Embedding Generation
Problem: Individual embedding calls have overhead.
Solution: Batch multiple texts for embedding.
# ❌ BAD: Individual embedding calls
for message in messages:
memory.add(message, user_id=user_id) # Separate API call each
# ✅ GOOD: Batched operation
memory.add(messages, user_id=user_id) # Single batched call
Impact: 40-60% reduction in embedding costs
Batch Size Guidelines:
- Platform Mode: Up to 100 messages per batch
- OSS Mode: Limited by embedding provider (OpenAI: 2048 texts)
4.3 Embedding Caching
Problem: Re-embedding same text wastes costs.
Solution: Cache embeddings for frequent queries.
import hashlib
embedding_cache = {}
def get_or_create_embedding(text: str) -> list[float]:
# Generate hash of text
text_hash = hashlib.sha256(text.encode()).hexdigest()
# Check cache
if text_hash in embedding_cache:
return embedding_cache[text_hash]
# Generate embedding
embedding = generate_embedding(text)
embedding_cache[text_hash] = embedding
return embedding
Use Cases:
- Canned responses
- Template messages
- System prompts
- Frequently asked questions
Phase 5: Database Optimization (OSS Mode)
Optimize vector database performance for self-hosted deployments.
5.1 Choose Optimal Vector Database
Decision Matrix:
bash scripts/suggest-vector-db.sh
| Database | Best For | Performance | Setup Complexity |
|---|---|---|---|
| Qdrant | Production, high scale | Excellent | Medium |
| Chroma | Development, prototyping | Good | Low |
| pgvector | Existing PostgreSQL | Good | Low |
| Milvus | Enterprise, billions of vectors | Excellent | High |
Recommendation:
- Start: Chroma (easy setup)
- Production: Qdrant (best performance)
- Existing Postgres: pgvector (no new infra)
- Enterprise: Milvus (handles massive scale)
5.2 Configure Optimal Indexes
For Qdrant:
config = MemoryConfig(
vector_store={
"provider": "qdrant"
"config": {
"collection_name": "memories"
"host": "localhost"
"port": 6333
"on_disk": True, # Reduce memory usage
"hnsw_config": {
"m": 16, # Balance between speed and accuracy
"ef_construct": 200, # Higher = better quality
}
"quantization_config": {
"scalar": {
"type": "int8", # Reduce storage by 4x
"quantile": 0.99
}
}
}
}
)
For pgvector (Supabase):
# Use specialized Supabase integration skill
bash ../supabase-integration/scripts/optimize-pgvector.sh
See: templates/vector-db-optimization/ for database-specific configs.
5.3 Connection Pooling
Problem: Creating new connections on each request is slow.
Solution: Use connection pooling.
from mem0 import Memory
from mem0.configs.base import MemoryConfig
config = MemoryConfig(
vector_store={
"provider": "qdrant"
"config": {
"host": "localhost"
"port": 6333
"grpc_port": 6334
"prefer_grpc": True, # Faster protocol
"timeout": 5
"connection_pool_size": 50, # Reuse connections
}
}
)
Impact: 30-50% reduction in connection overhead
Pool Size Guidelines:
- Low traffic: 10-20 connections
- Medium traffic: 30-50 connections
- High traffic: 50-100 connections
Phase 6: Batch Operations
Optimize bulk operations for efficiency.
6.1 Batch Memory Addition
# ❌ BAD: Individual operations
for msg in conversation_history:
memory.add(msg, user_id=user_id)
# ✅ GOOD: Batched operation
memory.add(conversation_history, user_id=user_id)
Impact: 60% faster, 40% lower cost
6.2 Batch Search Operations
import asyncio
# ❌ BAD: Sequential searches
results = []
for query in queries:
results.append(memory.search(query, user_id=user_id))
# ✅ GOOD: Parallel searches
async def batch_search(queries, user_id):
memory = AsyncMemory()
return await asyncio.gather(*[
memory.search(q, user_id=user_id, limit=5)
for q in queries
])
results = await batch_search(queries, user_id)
Impact: 4-5x faster for multiple searches
Phase 7: Cost Optimization
Reduce operational costs for memory systems.
7.1 Run Cost Analysis
bash scripts/analyze-costs.sh [user_id] [date_range]
This generates:
- Daily/monthly cost breakdown
- Cost per operation type
- Cost per user analysis
- Optimization recommendations
- Projected savings
7.2 Implement Cost-Saving Strategies
Strategy 1: Memory Deduplication
bash scripts/deduplicate-memories.sh [user_id]
Removes similar/duplicate memories to reduce storage and query costs.
Impact: 20-40% storage reduction
Strategy 2: Archival and Tiered Storage
bash scripts/setup-memory-archival.sh [retention_days]
Move old memories to cheaper storage:
- Active (0-30 days): Fast vector DB
- Archive (30-180 days): Compressed JSON in S3
- Cold storage (>180 days): Glacier
Impact: 50-70% storage cost reduction
Strategy 3: Smaller Embeddings for Archives
# Use cheaper embeddings for archived memories
archived_config = MemoryConfig(
embedder={
"provider": "openai"
"config": {
"model": "text-embedding-ada-002", # Cheaper
}
}
)
Strategy 4: Smart Pruning
bash scripts/prune-low-value-memories.sh [user_id] [score_threshold]
Remove memories that:
- Have never been retrieved
- Have low relevance scores
- Are redundant with other memories
- Haven't been accessed in 90+ days
Impact: 30-50% cost reduction
Phase 8: Monitoring and Alerts
Set up performance monitoring and alerts.
8.1 Configure Monitoring
bash scripts/setup-monitoring.sh [project_name]
Tracks:
- Query latency (average, P95, P99)
- Cache hit rate
- Error rate
- Cost per day/week/month
- Memory growth rate
- Top slow queries
8.2 Set Up Alerts
Use the alert configuration template:
bash scripts/generate-alert-config.sh
Recommended Alerts:
- 🚨 P99 latency > 500ms
- 🚨 Error rate > 5%
- 🚨 Cache hit rate < 50%
- 🚨 Daily cost exceeds budget
- 🚨 Storage growth > 20%/week
- 🚨 Slow query percentage > 10%
Phase 9: Performance Testing
Benchmark and validate optimizations.
9.1 Run Performance Benchmarks
bash scripts/benchmark-performance.sh [config_name]
Measures:
- Query latency under load
- Throughput (queries/second)
- Cache effectiveness
- Cost per 1000 operations
9.2 Load Testing
bash scripts/load-test.sh [concurrent_users] [duration_seconds]
Simulates real-world load to identify bottlenecks.
9.3 Compare Configurations
bash scripts/compare-configs.sh [config1] [config2]
A/B test different optimization strategies.
Optimization Decision Trees
Query Performance Issues
bash scripts/diagnose-slow-queries.sh
Diagnostic Flow:
- Average latency > 200ms? → Reduce limit, add filters
- P99 latency > 500ms? → Add caching, optimize indexes
- High variance? → Check for slow queries, optimize those specifically
- All queries slow? → Database infrastructure issue
High Cost Issues
bash scripts/diagnose-high-costs.sh
Diagnostic Flow:
- Embedding costs high? → Smaller model, batch operations, caching
- Storage costs high? → Deduplication, archival, pruning
- Query costs high? → Reduce search frequency, implement caching
- Overall high? → Review all strategies above
Low Cache Hit Rate
bash scripts/optimize-cache.sh
Diagnostic Flow:
- < 30% hit rate? → Cache wrong queries, review cache keys
- 30-60% hit rate? → Increase TTL, cache more query patterns
- 60-80% hit rate? → Good, minor tuning possible
80% hit rate? → Excellent, no action needed
Key Files Reference
Scripts (all functional):
scripts/analyze-performance.sh- Comprehensive performance analysisscripts/analyze-costs.sh- Cost breakdown and optimizationscripts/benchmark-performance.sh- Performance benchmarkingscripts/load-test.sh- Load testing and stress testingscripts/compare-configs.sh- A/B test configurationsscripts/diagnose-slow-queries.sh- Query performance diagnosticsscripts/diagnose-high-costs.sh- Cost diagnosticsscripts/optimize-cache.sh- Cache tuning recommendationsscripts/deduplicate-memories.sh- Remove duplicate memoriesscripts/prune-low-value-memories.sh- Remove unused memoriesscripts/setup-memory-archival.sh- Configure archival systemscripts/setup-monitoring.sh- Configure performance monitoringscripts/generate-alert-config.sh- Create alert rulesscripts/generate-cache-config.sh- Generate cache configurationsscripts/suggest-vector-db.sh- Vector database recommendations
Templates:
templates/optimized-memory-config.py- Production-ready configurationtemplates/cache-strategies/- Caching implementation patternsin-memory-cache.py- Python LRU cacheredis-cache.py- Redis caching layeredge-cache-config.yaml- CDN/edge caching
templates/vector-db-optimization/- Database-specific tuningqdrant-config.py- Optimized Qdrant setuppgvector-config.py- Optimized pgvector setupmilvus-config.py- Optimized Milvus setup
templates/embedding-configs/- Embedding optimizationcost-optimized.py- Minimal cost configurationperformance-optimized.py- Maximum performancebalanced.py- Cost/performance balance
templates/monitoring/- Monitoring configurationsprometheus-metrics.yaml- Metrics collectiongrafana-dashboard.json- Performance dashboardalert-rules.yaml- Alert configurations
Examples:
examples/optimization-case-studies.md- Real-world optimization examplesexamples/before-after-benchmarks.md- Performance improvement resultsexamples/cost-reduction-strategies.md- Cost optimization success storiesexamples/caching-patterns.md- Effective caching implementationsexamples/oss-vs-platform-optimization.md- Platform-specific strategies
Best Practices
- Measure First: Run performance analysis before optimizing
- Prioritize: Address biggest bottlenecks first (80/20 rule)
- Incremental: Implement one optimization at a time, measure impact
- Cache Wisely: Cache frequently accessed, rarely changing data
- Right-Size Models: Don't use large embeddings for simple use cases
- Batch Operations: Always batch when possible
- Monitor Continuously: Set up alerts, review metrics weekly
- Test Changes: Benchmark before/after every optimization
- Document Impact: Track cost/performance improvements
- Review Regularly: Monthly optimization reviews
Performance Targets
Query Latency:
- Average: < 100ms
- P95: < 200ms
- P99: < 500ms
Cache Performance:
- Hit rate: > 70%
- Miss penalty: < 2x uncached time
Cost Efficiency:
- Cost per 1000 queries: < $0.10 (Platform), < $0.02 (OSS)
- Storage growth: < 10% per month
- Embedding costs: < 40% of total
Resource Usage (OSS):
- CPU: < 60% average
- Memory: < 70% average
- Storage: Plan for 6 months growth
Troubleshooting
Slow Queries Despite Optimization:
- Check database indexes exist
- Verify connection pooling active
- Review filter effectiveness
- Check for database resource constraints
Cache Not Improving Performance:
- Verify cache keys are consistent
- Check TTL isn't too short
- Ensure cache size is adequate
- Monitor cache eviction rate
High Costs After Optimization:
- Review actual usage patterns
- Check for memory leaks (unbounded growth)
- Verify deduplication running
- Review archival policies
Optimization Caused Accuracy Issues:
- Reranking disabled may reduce quality
- Smaller embeddings reduce semantic understanding
- Lower limits may miss relevant memories
- Balance performance vs accuracy needs
Plugin: mem0 Version: 1.0.0 Last Updated: 2025-10-27