| name | multi-query |
| description | Use when search queries need better recall through query expansion - generates multiple query variants, retrieves with each, and fuses results using RRF for improved retrieval quality especially with ambiguous or under-specified queries |
| version | 1.0.0 |
LLMemory Multi-Query Expansion
Installation
uv add llmemory
# or
pip install llmemory
Overview
Multi-query expansion improves search recall by:
- Generating multiple query variants from the original query
- Searching with each variant independently
- Fusing results using Reciprocal Rank Fusion (RRF)
- Returning unified, deduplicated results
Two expansion modes:
- Heuristic (default): Fast lexical variants using keyword extraction, OR clauses, and phrase matching. No LLM calls, <1ms latency.
- LLM-based (configurable): Semantic query variants using GPT-4o-mini or similar. Better recall, 50-200ms latency, requires API key.
When to use multi-query expansion:
- Queries are ambiguous or under-specified
- Want to capture different perspectives or phrasings
- Improve recall for complex information needs
- User queries tend to be short or vague
When NOT to use:
- Queries are already very specific
- Latency is critical (multi-query adds overhead)
- Simple keyword lookups
Quick Start
from llmemory import LLMemory, SearchType
async with LLMemory(connection_string="postgresql://localhost/mydb") as memory:
# Enable query expansion
results = await memory.search(
owner_id="workspace-1",
query_text="improve customer satisfaction",
search_type=SearchType.HYBRID,
query_expansion=True, # Enable expansion
max_query_variants=3, # Generate 3 variants
limit=10
)
# Results are from all 3 query variants, fused with RRF
for result in results:
print(f"[{result.rrf_score:.3f}] {result.content[:80]}...")
Complete API Documentation
search() with Query Expansion
Signature:
async def search(
owner_id: str,
query_text: str,
search_type: Union[SearchType, str] = SearchType.HYBRID,
limit: int = 10,
query_expansion: Optional[bool] = None,
max_query_variants: Optional[int] = None,
**kwargs
) -> List[SearchResult]
Query Expansion Parameters:
query_expansion(bool, optional): Enable/disable query expansionNone(default): Follow global config (LLMEMORY_ENABLE_QUERY_EXPANSION)True: Force enable for this searchFalse: Force disable for this search
max_query_variants(int, optional): Maximum query variants to generate- Default: 3 (from config
search.max_query_variants) - Range: 1-5 (practical limit for performance)
- Includes original query + (N-1) generated variants
- Default: 3 (from config
Returns:
List[SearchResult]with multi-query specific behavior:- Results fused from all query variants using RRF
- Each result's
rrf_scorereflects consensus across variants - Deduplicatedby chunk_id (same chunk from multiple variants counted once)
Example:
# Basic multi-query search
results = await memory.search(
owner_id="workspace-1",
query_text="reduce server latency",
search_type=SearchType.HYBRID,
query_expansion=True,
max_query_variants=3,
limit=15
)
# Variants might be:
# 1. "reduce server latency" (original)
# 2. "improve server response time"
# 3. "optimize backend performance"
# All 3 variants are searched, results are fused
How Multi-Query Works
Query Variant Generation
llmemory generates query variants using heuristic rules (no LLM required by default):
Original Query: "customer retention strategies"
Generated Variants:
1. "customer retention strategies" (original, always included)
2. "customer OR retention OR strategies" (OR variant - widens lexical recall)
3. "\"customer retention strategies\"" (quoted phrase - exact match)
Each variant captures a different matching strategy:
- Original: Standard BM25 matching
- OR variant: Boolean OR to catch documents with any key term
- Quoted phrase: Exact phrase matching for precision
With stopwords:
Original Query: "how to improve the customer satisfaction"
Generated Variants:
1. "how to improve the customer satisfaction" (original)
2. "how improve customer satisfaction" (keyword variant - stopwords removed)
3. "how OR improve OR customer OR satisfaction" (OR variant)
4. "\"how to improve the customer satisfaction\"" (quoted phrase)
Search Execution
# Internally, multi-query does:
# 1. Generate variants
variants = [
"customer retention strategies", # original
"customer OR retention OR strategies", # OR variant
"\"customer retention strategies\"" # quoted phrase
]
# 2. Search with each variant (executed sequentially)
results_1 = await search(query=variants[0], ...)
results_2 = await search(query=variants[1], ...)
results_3 = await search(query=variants[2], ...)
# 3. Fuse results using RRF
final_results = rrf_fusion([results_1, results_2, results_3])
RRF Fusion
Reciprocal Rank Fusion combines results from multiple query variants:
For each query variant:
For each result in that variant:
score = 1 / (k + rank + 1)
For each unique chunk (by chunk_id):
total_score = sum of scores from all variants
Sort by total_score descending
Where k = 50 (constant that prevents top-ranked results from dominating). Note the + 1 ensures the first result (rank=0) gets score 1/(50+0+1) = 1/51 rather than 1/50.
Key insight: Chunks appearing in multiple variant result sets get higher RRF scores, indicating strong relevance consensus.
Practical Examples
Customer Support Search
# Original query is vague
results = await memory.search(
owner_id="support-team",
query_text="login problems",
search_type=SearchType.HYBRID,
query_expansion=True,
max_query_variants=3,
limit=20
)
# Variants use different matching strategies:
# - "login problems" (original)
# - "login OR problems" (OR variant - widens recall)
# - "\"login problems\"" (exact phrase)
#
# Results include:
# - Documents with both "login" AND "problems" (original)
# - Documents with either "login" OR "problems" (OR variant)
# - Documents with exact phrase "login problems" (quoted variant)
Product Documentation Search
# Technical query benefits from multiple phrasings
results = await memory.search(
owner_id="docs-site",
query_text="async function error handling",
search_type=SearchType.HYBRID,
query_expansion=True,
max_query_variants=4,
limit=15
)
# Variants use different matching:
# - "async function error handling" (original)
# - "async function error handling" (keyword variant - no stopwords here)
# - "async OR function OR error OR handling" (OR variant)
# - "\"async function error handling\"" (exact phrase)
#
# Balances precision (exact phrase) with recall (OR variant)
Research & Discovery
# Exploratory queries need broad coverage
results = await memory.search(
owner_id="research-db",
query_text="climate change mitigation",
search_type=SearchType.HYBRID,
query_expansion=True,
max_query_variants=5,
limit=25
)
# Variants use different matching:
# - "climate change mitigation" (original)
# - "climate OR change OR mitigation" (OR variant)
# - "\"climate change mitigation\"" (exact phrase)
E-commerce Search
# Product searches benefit from synonyms and alternatives
results = await memory.search(
owner_id="store-1",
query_text="lightweight laptop for travel",
search_type=SearchType.HYBRID,
query_expansion=True,
max_query_variants=3,
metadata_filter={"category": "electronics"},
limit=20
)
# Variants use different matching:
# - "lightweight laptop for travel" (original)
# - "lightweight laptop travel" (keyword variant - stopwords removed)
# - "lightweight OR laptop OR travel" (OR variant)
# - "\"lightweight laptop for travel\"" (exact phrase)
Configuration
Global Configuration
# Environment variables
LLMEMORY_ENABLE_QUERY_EXPANSION=1
LLMEMORY_MAX_QUERY_VARIANTS=3
Programmatic Configuration
from llmemory import LLMemoryConfig
config = LLMemoryConfig()
config.search.enable_query_expansion = True
config.search.max_query_variants = 3
memory = LLMemory(
connection_string="postgresql://localhost/mydb",
config=config
)
LLM-Based Expansion (Advanced)
For semantic query diversity, enable LLM-based expansion:
Environment Variables:
LLMEMORY_ENABLE_QUERY_EXPANSION=1
LLMEMORY_QUERY_EXPANSION_MODEL=gpt-4o-mini
OPENAI_API_KEY=sk-...
Programmatic Configuration:
from llmemory import LLMemoryConfig
config = LLMemoryConfig()
config.search.enable_query_expansion = True
config.search.query_expansion_model = "gpt-4o-mini" # Enable LLM expansion
config.search.max_query_variants = 3
memory = LLMemory(
connection_string="postgresql://localhost/mydb",
openai_api_key="sk-...",
config=config
)
LLM vs Heuristic Comparison:
| Mode | Latency | Quality | Cost | Use Case |
|---|---|---|---|---|
| Heuristic | <1ms | Good | Free | Default, high-QPS |
| LLM | 50-200ms | Excellent | ~$0.001/query | Quality-critical |
LLM Expansion Example:
# Original: "improve customer retention"
# LLM variants:
# 1. "strategies to reduce customer churn"
# 2. "methods for increasing customer loyalty"
# 3. "how to keep customers from leaving"
results = await memory.search(
owner_id="workspace-1",
query_text="improve customer retention",
query_expansion=True,
max_query_variants=3,
limit=10
)
Per-Query Override
# Override global config for specific searches
# Force enable (even if globally disabled)
results = await memory.search(
owner_id="workspace-1",
query_text="vague query here",
query_expansion=True, # Force enable
max_query_variants=4,
limit=10
)
# Force disable (even if globally enabled)
results = await memory.search(
owner_id="workspace-1",
query_text="very specific query",
query_expansion=False, # Force disable
limit=10
)
Performance Considerations
Latency Impact
import time
# Without query expansion (fast)
start = time.time()
results = await memory.search(
owner_id="workspace-1",
query_text="test query",
query_expansion=False,
limit=10
)
elapsed_single = (time.time() - start) * 1000
print(f"Single query: {elapsed_single:.2f}ms")
# With query expansion (slower)
start = time.time()
results = await memory.search(
owner_id="workspace-1",
query_text="test query",
query_expansion=True,
max_query_variants=3,
limit=10
)
elapsed_multi = (time.time() - start) * 1000
print(f"Multi-query: {elapsed_multi:.2f}ms")
# Typical overhead:
# - Variant generation: <1ms (heuristic rules, no LLM)
# - Additional searches: 3x search time (executed in sequence)
# - RRF fusion: 5-10ms
# Total overhead: ~3x base search latency + minimal fusion overhead
Optimizing Performance
# Use fewer variants for speed
results = await memory.search(
owner_id="workspace-1",
query_text="query",
query_expansion=True,
max_query_variants=2, # Faster than 3-4 (fewer searches to execute)
limit=10
)
When to Use Multi-Query
Good use cases:
- Queries that might match with different keyword combinations
- Short queries where OR expansion helps recall
- Queries where both exact and fuzzy matching are valuable
- Cases where you want to balance precision (quoted) and recall (OR)
Avoid for:
- Very specific queries (expansion doesn't help)
- High-QPS API endpoints (multiplies search cost)
- When latency is critical (3x+ base search time)
- Pure semantic search (heuristic variants don't add semantic diversity)
Combining with Other Features
Multi-Query + Reranking
# Combine query expansion with reranking for best quality
results = await memory.search(
owner_id="workspace-1",
query_text="machine learning deployment",
search_type=SearchType.HYBRID,
query_expansion=True, # Generate variants
max_query_variants=3,
rerank=True, # Rerank fused results
rerank_top_k=50, # Consider top 50 from RRF
rerank_return_k=15, # Return top 15 after reranking
limit=15
)
# Pipeline:
# 1. Generate 3 query variants
# 2. Search with each variant
# 3. Fuse with RRF (top 50 candidates)
# 4. Rerank top 50 candidates
# 5. Return top 15 after reranking
Multi-Query + Metadata Filtering
# Apply filters to all query variants
results = await memory.search(
owner_id="workspace-1",
query_text="quarterly performance",
search_type=SearchType.HYBRID,
query_expansion=True,
max_query_variants=3,
metadata_filter={
"department": "finance",
"year": 2024
},
date_from=datetime(2024, 1, 1),
limit=20
)
# All 3 variants search within filtered documents only
Multi-Query + Hybrid Search Tuning
# Tune alpha for all query variants
results = await memory.search(
owner_id="workspace-1",
query_text="customer feedback analysis",
search_type=SearchType.HYBRID,
alpha=0.6, # Applied to all variants
query_expansion=True,
max_query_variants=3,
limit=15
)
# Each variant uses alpha=0.6 for hybrid search
# Results are then fused with RRF
Monitoring and Debugging
Inspecting Query Variants
Multi-query search logs include the generated variants in diagnostics:
# After search, variants are logged
# Check application logs or search history for:
# {
# "query_variants": [
# "original query",
# "variant 1",
# "variant 2"
# ],
# "variant_stats": [
# {"query": "original", "result_count": 15, "latency_ms": 45.3},
# {"query": "variant 1", "result_count": 18, "latency_ms": 42.1},
# {"query": "variant 2", "result_count": 12, "latency_ms": 48.7}
# ]
# }
Understanding RRF Scores
results = await memory.search(
owner_id="workspace-1",
query_text="test",
query_expansion=True,
max_query_variants=3,
limit=10
)
for result in results:
print(f"Chunk: {result.chunk_id}")
print(f" RRF Score: {result.rrf_score:.4f}")
print(f" Content: {result.content[:80]}...")
print()
# Higher RRF scores indicate:
# - Chunk appeared in multiple variant results
# - Chunk ranked highly across variants
# - Strong consensus on relevance
Common Mistakes
❌ Wrong: Using multi-query for all searches
# Don't enable globally if not needed
results = await memory.search(
owner_id="workspace-1",
query_text="iPhone 14 Pro", # Very specific, doesn't need expansion
query_expansion=True, # Adds latency without benefit
limit=10
)
✅ Right: Use selectively for complex queries
# Enable only when query benefits from expansion
if is_complex_query(query_text):
query_expansion = True
else:
query_expansion = False
results = await memory.search(
owner_id="workspace-1",
query_text=query_text,
query_expansion=query_expansion,
limit=10
)
❌ Wrong: Too many query variants
results = await memory.search(
owner_id="workspace-1",
query_text="test",
query_expansion=True,
max_query_variants=10, # Too many, diminishing returns
limit=10
)
# Latency increases linearly but quality plateaus
✅ Right: Use 2-4 variants for balance
results = await memory.search(
owner_id="workspace-1",
query_text="test",
query_expansion=True,
max_query_variants=3, # Good balance of quality vs speed
limit=10
)
❌ Wrong: Ignoring latency requirements
# Real-time autocomplete endpoint
@app.get("/autocomplete")
async def autocomplete(q: str):
results = await memory.search(
owner_id="workspace-1",
query_text=q,
query_expansion=True, # Too slow for autocomplete!
limit=5
)
return results
✅ Right: Consider latency constraints
# Use multi-query for main search, not autocomplete
@app.get("/autocomplete")
async def autocomplete(q: str):
results = await memory.search(
owner_id="workspace-1",
query_text=q,
query_expansion=False, # Fast single query
limit=5
)
return results
@app.get("/search")
async def search(q: str):
results = await memory.search(
owner_id="workspace-1",
query_text=q,
query_expansion=True, # Quality search with expansion
max_query_variants=3,
limit=20
)
return results
Query Variant Generation Strategies
Multi-query uses heuristic rules to generate variants, not LLM-based expansion:
1. Keyword Variant (Stopword Removal)
Removes common stopwords to focus on key terms:
from llmemory.query_expansion import DEFAULT_STOPWORDS
# Stopwords include: a, an, and, are, as, at, be, by, for, from, has,
# in, is, it, of, on, or, that, the, to, was, were, will, with
# Example:
# Input: "how to improve the customer satisfaction"
# Output: "how improve customer satisfaction"
#
# Input: "best practices for the database"
# Output: "best practices database"
Enabled by default via config.search.include_keyword_variant = True.
2. OR Variant (Boolean Expansion)
Creates Boolean OR of all non-stopword terms to maximize recall:
# Example:
# Input: "customer retention strategies"
# Output: "customer OR retention OR strategies"
#
# Input: "reduce server latency"
# Output: "reduce OR server OR latency"
Only generated for multi-word queries. Widens recall by matching documents containing ANY of the key terms.
3. Quoted Phrase Variant (Exact Match)
Wraps the query in quotes for exact phrase matching:
# Example:
# Input: "machine learning deployment"
# Output: "\"machine learning deployment\""
#
# Input: "error handling"
# Output: "\"error handling\""
Only generated for multi-word queries. Ensures high precision by requiring exact phrase match.
Complete Example
from llmemory.query_expansion import QueryExpansionService
from llmemory.config import SearchConfig
service = QueryExpansionService(SearchConfig())
# With stopwords
variants = service._heuristic_variants(
"how to improve the customer satisfaction",
include_keywords=True
)
# Returns:
# 1. "how improve customer satisfaction" (keyword variant)
# 2. "how OR improve OR customer OR satisfaction" (OR variant)
# 3. "\"how to improve the customer satisfaction\"" (quoted phrase)
# Without stopwords
variants = service._heuristic_variants(
"machine learning deployment",
include_keywords=True
)
# Returns:
# 1. "machine OR learning OR deployment" (OR variant, no keyword variant since no stopwords)
# 2. "\"machine learning deployment\"" (quoted phrase)
Advanced: Custom LLM-Based Expansion
The default implementation uses heuristics, but you can provide a custom LLM callback:
from llmemory.query_expansion import QueryExpansionService, ExpansionCallback
async def my_llm_expander(query: str, max_variants: int) -> list[str]:
"""Custom LLM-based query expansion."""
# Call your LLM here to generate semantic variants
variants = await my_llm.generate_variants(query, max_variants)
return variants
service = QueryExpansionService(
search_config=config.search,
llm_callback=my_llm_expander # Optional custom expansion
)
When llm_callback is provided, it's tried first; heuristics are used as fallback if LLM fails.
Advanced Patterns
Conditional Multi-Query
def should_expand_query(query_text: str) -> tuple[bool, int]:
"""Decide if query needs expansion and how many variants."""
# Short queries benefit most from expansion
if len(query_text.split()) <= 3:
return True, 4
# Questions and exploratory queries
question_words = ["how", "what", "why", "when", "where", "who"]
if any(word in query_text.lower() for word in question_words):
return True, 3
# Specific queries don't need expansion
if any(char.isdigit() for char in query_text): # Has numbers
return False, 1
if '"' in query_text: # Has quotes (exact phrase)
return False, 1
# Default: moderate expansion
return True, 2
# Use dynamic expansion
query = "how to improve performance"
expand, variants = should_expand_query(query)
results = await memory.search(
owner_id="workspace-1",
query_text=query,
query_expansion=expand,
max_query_variants=variants,
limit=10
)
A/B Testing Query Expansion
import random
# Randomly enable/disable for 50% of queries
use_expansion = random.random() < 0.5
results = await memory.search(
owner_id="workspace-1",
query_text=query_text,
query_expansion=use_expansion,
limit=10
)
# Track metrics:
# - Click-through rate
# - Result relevance
# - User satisfaction
# - Search latency
# Compare A (no expansion) vs B (with expansion)
Related Skills
basic-usage- Core search operationshybrid-search- Vector + text hybrid search fundamentalsrag- Using multi-query in RAG systemsmulti-tenant- Multi-tenant isolation patterns
Important Notes
Expansion Modes:
- Heuristic (default): Keyword extraction, OR clauses, phrase matching. Fast, no API calls.
- LLM (configurable): Semantic variants via GPT-4o-mini. Set
query_expansion_modelin config.
No LLM Required for Default:
Query expansion works out-of-the-box with heuristic rules. No API key or LLM calls needed unless you configure query_expansion_model.
Cost Considerations (LLM mode only): LLM expansion makes 1 API call per search. For high-volume applications with LLM expansion, consider:
- Caching common queries and their variants
- Using smaller models (gpt-4o-mini is fast and cheap)
- Enabling only for specific use cases
- Hybrid: Use heuristics for autocomplete, LLM for main search
Quality vs Speed:
- Heuristic: <1ms overhead, lexical diversity only
- LLM: 50-200ms overhead, semantic diversity
Fallback Behavior: If LLM expansion fails or times out (8s), system automatically falls back to heuristic expansion. Search always completes.