| name | rag |
| description | Use when building Retrieval-Augmented Generation systems - covers document ingestion, hybrid search retrieval, reranking results, and prompt augmentation for accurate LLM responses grounded in your knowledge base |
| version | 1.0.0 |
LLMemory RAG Systems
Installation
uv add llmemory
# For reranking support
uv add "llmemory[reranker-local]" # Local cross-encoder models
# or configure OpenAI reranking (no extra install needed)
Overview
Retrieval-Augmented Generation (RAG) combines llmemory's document retrieval with LLM generation for accurate, grounded responses.
RAG Pipeline:
- Ingest: Add documents to llmemory
- Retrieve: Search for relevant chunks
- Rerank: Improve relevance ordering (optional but recommended)
- Augment: Build prompt with retrieved context
- Generate: Get LLM response
When to use RAG:
- Question answering over your documents
- Customer support with knowledge base
- Research assistance
- Code documentation search
- Any application needing accurate, source-backed answers
Quick Start
from llmemory import LLMemory, SearchType, DocumentType
from openai import AsyncOpenAI
async def rag_system():
# Initialize
memory = LLMemory(
connection_string="postgresql://localhost/mydb",
openai_api_key="sk-..."
)
await memory.initialize()
# 1. Ingest documents
await memory.add_document(
owner_id="workspace-1",
id_at_origin="kb",
document_name="product_guide.md",
document_type=DocumentType.MARKDOWN,
content="Your product documentation..."
)
# 2. Retrieve with reranking
results = await memory.search(
owner_id="workspace-1",
query_text="how to reset password",
search_type=SearchType.HYBRID,
query_expansion=True, # Better retrieval
rerank=True, # Better ranking
rerank_top_k=50, # Rerank top 50 candidates
rerank_return_k=10, # Prefer 10 best after reranking
limit=5 # Final result count (max of limit and rerank_return_k)
)
# 3. Build prompt with context
context = "\n\n".join([
f"Source: {r.metadata.get('source', 'unknown')}\n{r.content}"
for r in results
])
prompt = f"""Answer the question using only the provided context.
Context:
{context}
Question: how to reset password
Answer:"""
# 4. Generate response
client = AsyncOpenAI()
response = await client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": prompt}]
)
print(response.choices[0].message.content)
await memory.close()
import asyncio
asyncio.run(rag_system())
Query Routing for Production RAG
Production RAG systems should detect when queries cannot be answered from available documents.
When to use query routing:
- User queries may be unanswerable from your knowledge base
- Need to route to web search or external APIs
- Want to avoid hallucinated answers
- Building conversational assistants
Example:
from llmemory import LLMemory
async with LLMemory(connection_string="...") as memory:
# Search with automatic routing
result = await memory.search_with_routing(
owner_id="workspace-1",
query_text="What's the current weather in Paris?",
enable_routing=True,
limit=5
)
if result["route"] == "retrieval":
# Answer from documents
return generate_answer(result["results"])
elif result["route"] == "web_search":
# Route to web search
return fetch_from_web(query)
elif result["route"] == "unanswerable":
# Honest response
return "I don't have information to answer that question."
else: # clarification
return "Could you please provide more details?"
API Reference:
search_with_routing()
Route queries intelligently before searching.
Signature:
async def search_with_routing(
owner_id: str,
query_text: str,
enable_routing: bool = True,
routing_threshold: float = 0.7,
**search_kwargs
) -> Dict[str, Any]
Parameters:
owner_id(str): Owner identifierquery_text(str): Search queryenable_routing(bool, default: True): Enable automatic routingrouting_threshold(float, default: 0.7): Confidence threshold**search_kwargs: Additional arguments passed to search()
Returns: Dict with:
route(str): "retrieval", "web_search", "unanswerable", or "clarification"confidence(float): 0-1 confidence in routing decisionresults(List[SearchResult]): If route="retrieval"message(str): If route != "retrieval"reason(str): Explanation of routing decision
Example:
result = await memory.search_with_routing(
owner_id="support",
query_text="How do I reset my password?",
routing_threshold=0.8
)
if result["route"] == "retrieval":
answer = generate_rag_response(result["results"])
else:
answer = result["message"] # Pre-formatted response
Complete RAG Pipeline
Step 1: Document Ingestion
from llmemory import LLMemory, DocumentType, ChunkingConfig
async def ingest_knowledge_base(memory: LLMemory, owner_id: str):
"""Ingest documents into RAG system."""
# Configure chunking for RAG (smaller chunks for precise retrieval)
chunking_config = ChunkingConfig(
chunk_size=300, # Tokens per chunk (smaller for RAG)
chunk_overlap=50, # Overlap for context preservation
strategy="hierarchical", # Chunking strategy
min_chunk_size=100, # Minimum chunk size
max_chunk_size=500, # Maximum chunk size
max_chunk_depth=3, # Maximum hierarchy depth
enable_chunk_summaries=True, # Generate summaries
summary_max_tokens=80 # Max tokens for summaries
)
documents = [
{
"name": "product_guide.md",
"type": DocumentType.MARKDOWN,
"content": "...",
"metadata": {"category": "guide", "version": "2.0"}
},
{
"name": "faq.md",
"type": DocumentType.MARKDOWN,
"content": "...",
"metadata": {"category": "faq"}
},
{
"name": "api_docs.md",
"type": DocumentType.TECHNICAL_DOC,
"content": "...",
"metadata": {"category": "api", "language": "python"}
}
]
for doc in documents:
result = await memory.add_document(
owner_id=owner_id,
id_at_origin="knowledge_base",
document_name=doc["name"],
document_type=doc["type"],
content=doc["content"],
metadata=doc["metadata"],
chunking_config=chunking_config,
generate_embeddings=True
)
print(f"Ingested {doc['name']}: {result.chunks_created} chunks")
Step 2: Retrieval Configuration
async def retrieve_for_rag(
memory: LLMemory,
owner_id: str,
query: str,
top_k: int = 5
) -> List[SearchResult]:
"""Retrieve relevant chunks for RAG."""
results = await memory.search(
owner_id=owner_id,
query_text=query,
# Hybrid search for best quality
search_type=SearchType.HYBRID,
alpha=0.6, # Slight favor to semantic search
# Query expansion for better recall
query_expansion=True,
max_query_variants=3,
# Reranking for precision
rerank=True,
rerank_top_k=20, # Consider top 20 candidates
rerank_return_k=top_k, # Prefer top_k after reranking
# Final limit (actual count = max(limit, rerank_return_k))
limit=top_k
)
return results
Step 3: Reranking Configuration
llmemory supports multiple reranking methods:
OpenAI Reranking (Recommended for Quality)
# Configure via environment
LLMEMORY_RERANK_PROVIDER=openai
LLMEMORY_RERANK_MODEL=gpt-4.1-mini
LLMEMORY_RERANK_TOP_K=30
LLMEMORY_RERANK_RETURN_K=10
# Or programmatically
from llmemory import LLMemoryConfig
config = LLMemoryConfig()
config.search.enable_rerank = True
config.search.rerank_provider = "openai"
config.search.default_rerank_model = "gpt-4.1-mini"
config.search.rerank_top_k = 30
config.search.rerank_return_k = 10
memory = LLMemory(
connection_string="postgresql://localhost/mydb",
config=config
)
Local Cross-Encoder Reranking (Faster, No API Calls)
# Install local reranker dependencies
uv add "llmemory[reranker-local]"
# Configure
config = LLMemoryConfig()
config.search.enable_rerank = True
config.search.default_rerank_model = "cross-encoder/ms-marco-MiniLM-L6-v2"
config.search.rerank_device = "cpu" # or "cuda"
config.search.rerank_batch_size = 16
Lexical Reranking (Fallback, No Dependencies)
# Automatic fallback when no reranker configured
# Uses token overlap scoring
results = await memory.search(
owner_id="workspace-1",
query_text="query",
rerank=True # Uses lexical reranking
)
Reranker API Reference
CrossEncoderReranker
Local cross-encoder model for reranking search results without API calls.
Constructor:
CrossEncoderReranker(
model_name: str = "cross-encoder/ms-marco-MiniLM-L6-v2",
device: Optional[str] = None,
batch_size: int = 16
)
Parameters:
model_name(str, default: "cross-encoder/ms-marco-MiniLM-L6-v2"): Hugging Face cross-encoder model name- Available models: "cross-encoder/ms-marco-MiniLM-L6-v2", "cross-encoder/ms-marco-TinyBERT-L2-v2"
device(Optional[str]): Device to run on: "cpu", "cuda", or None (auto-detect)batch_size(int, default: 16): Batch size for inference
Methods:
score()
Score query-document pairs for relevance.
Signature:
async def score(
query_text: str,
results: Sequence[SearchResult]
) -> Sequence[float]
Parameters:
query_text(str): Search queryresults(Sequence[SearchResult]): Search results to score
Returns:
Sequence[float]: Relevance scores (same length as results)
Example:
from llmemory import CrossEncoderReranker
# Initialize reranker
reranker = CrossEncoderReranker(
model_name="cross-encoder/ms-marco-MiniLM-L6-v2",
device="cpu",
batch_size=32
)
# Get initial search results
results = await memory.search(
owner_id="workspace-1",
query_text="machine learning",
limit=50,
rerank=False # Get unranked results
)
# Rerank with cross-encoder
scores = await reranker.score("machine learning", results)
# Sort by new scores
scored_results = list(zip(scores, results))
scored_results.sort(key=lambda x: x[0], reverse=True)
top_results = [r for _, r in scored_results[:10]]
Installation:
# Requires sentence-transformers
uv add "llmemory[reranker-local]"
OpenAIResponsesReranker
Use OpenAI GPT models for intelligent reranking with natural language understanding.
Constructor:
OpenAIResponsesReranker(
model: str = "gpt-4.1-mini",
max_candidates: int = 30,
temperature: float = 0.0
)
Parameters:
model(str, default: "gpt-4.1-mini"): OpenAI model name- Recommended: "gpt-4.1-mini" (fast, cost-effective), "gpt-4" (higher quality)
max_candidates(int, default: 30): Maximum candidates to send to APItemperature(float, default: 0.0): Model temperature (0 = deterministic)
Methods:
score()
Score query-document pairs using OpenAI API.
Signature:
async def score(
query_text: str,
results: Sequence[SearchResult]
) -> Sequence[float]
Parameters:
query_text(str): Search queryresults(Sequence[SearchResult]): Search results to score
Returns:
Sequence[float]: Relevance scores between 0 and 1
Example:
from llmemory import OpenAIResponsesReranker
import os
# Initialize reranker (uses OPENAI_API_KEY from env)
reranker = OpenAIResponsesReranker(
model="gpt-4.1-mini",
max_candidates=20,
temperature=0.0
)
# Get initial search results
results = await memory.search(
owner_id="workspace-1",
query_text="customer retention strategies",
limit=50,
rerank=False
)
# Rerank with OpenAI
scores = await reranker.score("customer retention strategies", results)
# Sort by scores
scored_results = list(zip(scores, results))
scored_results.sort(key=lambda x: x[0], reverse=True)
top_results = [r for _, r in scored_results[:10]]
print(f"Top result score: {scores[0]:.3f}")
Cost Considerations:
- Each rerank call makes one API request
- Costs depend on model and number of candidates
- Consider caching reranked results for repeated queries
When to use:
- Need highest quality reranking
- Willing to pay API costs
- Latency tolerance (100-300ms overhead)
RerankerService
Internal service that wraps reranker implementations (rarely used directly).
Usage: Automatically created by LLMemory when reranking is enabled via configuration. Generally not instantiated directly by users.
SearchResult Fields Reference
Search results contain multiple score fields depending on the search configuration. Understanding these fields helps optimize RAG retrieval quality.
Core Fields
chunk_id (UUID)
- Unique identifier for the chunk
document_id (UUID)
- Parent document identifier
content (str)
- Full chunk text content
metadata (Dict[str, Any])
- Chunk metadata (may include title, section, page number, etc.)
score (float)
- Primary relevance score
- For hybrid search: combined score from vector and text search
- For vector search: same as similarity
- For text search: same as text_rank
- After reranking: same as rerank_score
Optional Score Fields
similarity (Optional[float])
- Vector similarity score (cosine distance)
- Range: 0.0 to 1.0 (higher = more similar)
- Populated when search_type is VECTOR or HYBRID
- Example: 0.87 indicates 87% semantic similarity
text_rank (Optional[float])
- BM25 full-text search rank
- Higher values indicate better keyword matches
- Populated when search_type is TEXT or HYBRID
- Not normalized to [0,1] range
rrf_score (Optional[float])
- Reciprocal Rank Fusion score
- Populated when query_expansion=True (multi-query search)
- Combines rankings from multiple query variants
- Higher values indicate consistent ranking across variants
rerank_score (Optional[float])
- Reranker relevance score
- Populated when rerank=True
- Range and interpretation depends on reranker:
- OpenAI reranker: 0.0 to 1.0 (normalized probability)
- Cross-encoder: typically -10 to +10 (raw logit score)
- Lexical reranker: 0.0 to 1.0 (token overlap ratio)
- Higher values indicate higher relevance according to reranker
summary (Optional[str])
- Concise chunk summary (30-50% of original length)
- Populated when ChunkingConfig.enable_chunk_summaries=True
- Generated during document ingestion
- Use for prompts to reduce token usage:
text = result.summary or result.content - See "Enable and Use Chunk Summaries" section below
Using Score Fields
# Example: Analyzing search result scores
results = await memory.search(
owner_id="workspace-1",
query_text="machine learning algorithms",
search_type=SearchType.HYBRID,
query_expansion=True,
rerank=True,
limit=5
)
for result in results:
print(f"Chunk ID: {result.chunk_id}")
print(f" Final score: {result.score:.3f}")
# Vector component (if hybrid/vector search)
if result.similarity is not None:
print(f" Vector similarity: {result.similarity:.3f}")
# Text component (if hybrid/text search)
if result.text_rank is not None:
print(f" BM25 rank: {result.text_rank:.3f}")
# Multi-query fusion (if query_expansion=True)
if result.rrf_score is not None:
print(f" RRF score: {result.rrf_score:.3f}")
# Reranking (if rerank=True)
if result.rerank_score is not None:
print(f" Rerank score: {result.rerank_score:.3f}")
# Summary (if enabled during ingestion)
if result.summary:
print(f" Summary: {result.summary[:100]}...")
Reranking Parameters
rerank_top_k (int, default: 50)
- Number of initial candidates to send to reranker
- Retrieve this many results from base search before reranking
- Larger values: better quality but slower and more expensive
- Recommended range: 20-100
rerank_return_k (int, default: 15)
- Preferred number of results after reranking
- Results are prioritized by rerank score
- Actual result count:
max(limit, rerank_return_k) - Set higher than limit to ensure best reranked results
limit (int, default: 10)
- Final result count returned to user
- Works with rerank_return_k:
final_count = max(limit, rerank_return_k) - Example: limit=5, rerank_return_k=10 → returns 10 results
- Example: limit=20, rerank_return_k=10 → returns 20 results
# Example: Reranking parameter interactions
results = await memory.search(
owner_id="workspace-1",
query_text="database optimization",
search_type=SearchType.HYBRID,
rerank=True,
rerank_top_k=50, # Consider top 50 from base search
rerank_return_k=10, # Prefer 10 best after reranking
limit=5 # But return max(5, 10) = 10 results
)
# Returns 10 results (max of limit and rerank_return_k)
assert len(results) == 10
results = await memory.search(
owner_id="workspace-1",
query_text="database optimization",
search_type=SearchType.HYBRID,
rerank=True,
rerank_top_k=50, # Consider top 50 from base search
rerank_return_k=5, # Prefer 5 best after reranking
limit=20 # But return max(20, 5) = 20 results
)
# Returns 20 results (max of limit and rerank_return_k)
assert len(results) == 20
Step 4: Prompt Augmentation
def build_rag_prompt(
query: str,
results: List[SearchResult],
system_instructions: str = "Answer based only on the provided context."
) -> str:
"""Build RAG prompt with retrieved context."""
# Format context from search results
context_parts = []
for i, result in enumerate(results, 1):
# Include source information
source = result.metadata.get("source", "Unknown")
doc_name = result.metadata.get("document_name", "")
# Use summary if available (more concise for prompts)
# Summary is populated when ChunkingConfig.enable_chunk_summaries=True
text = result.summary or result.content
context_parts.append(
f"[Source {i}: {doc_name or source}]\n{text}"
)
context = "\n\n".join(context_parts)
# Build final prompt
prompt = f"""{system_instructions}
Context:
{context}
Question: {query}
Answer:"""
return prompt
Advanced Prompt Patterns
With Citation Requirements:
def build_prompt_with_citations(query: str, results: List[SearchResult]) -> str:
context_parts = []
for i, result in enumerate(results, 1):
source = result.metadata.get("document_name", f"Source {i}")
# Use summary if enabled (see enabling summaries section below)
text = result.summary or result.content
context_parts.append(f"[{i}] {source}: {text}")
context = "\n\n".join(context_parts)
prompt = f"""Answer the question using the provided context. Cite sources using [number] format.
Context:
{context}
Question: {query}
Answer (with citations):"""
return prompt
With Metadata Filtering:
async def rag_with_filters(
memory: LLMemory,
owner_id: str,
query: str,
category: str
):
"""RAG with metadata filtering."""
results = await memory.search(
owner_id=owner_id,
query_text=query,
search_type=SearchType.HYBRID,
metadata_filter={"category": category}, # Filter by category
rerank=True,
limit=5
)
return build_rag_prompt(query, results)
Step 5: LLM Generation
from openai import AsyncOpenAI
async def generate_rag_response(
query: str,
results: List[SearchResult],
model: str = "gpt-4"
) -> dict:
"""Generate LLM response with RAG context."""
# Build prompt
prompt = build_rag_prompt(query, results)
# Generate with OpenAI
client = AsyncOpenAI()
response = await client.chat.completions.create(
model=model,
messages=[
{
"role": "system",
"content": "You are a helpful assistant that answers questions based on provided context."
},
{
"role": "user",
"content": prompt
}
],
temperature=0.3, # Lower temperature for factual answers
max_tokens=500
)
# Extract response
answer = response.choices[0].message.content
return {
"answer": answer,
"sources": [
{
"content": r.content[:200] + "...",
"score": r.score,
"metadata": r.metadata
}
for r in results
],
"model": model
}
Complete RAG System Example
from llmemory import LLMemory, SearchType, DocumentType
from openai import AsyncOpenAI
from typing import List, Dict, Any
class RAGSystem:
"""Complete RAG system with llmemory."""
def __init__(self, connection_string: str, openai_api_key: str):
self.memory = LLMemory(
connection_string=connection_string,
openai_api_key=openai_api_key
)
self.client = AsyncOpenAI(api_key=openai_api_key)
self.initialized = False
async def initialize(self):
"""Initialize the RAG system."""
await self.memory.initialize()
self.initialized = True
async def ingest_document(
self,
owner_id: str,
document_name: str,
content: str,
document_type: DocumentType = DocumentType.TEXT,
metadata: Dict[str, Any] = None
):
"""Add a document to the knowledge base."""
result = await self.memory.add_document(
owner_id=owner_id,
id_at_origin="rag_kb",
document_name=document_name,
document_type=document_type,
content=content,
metadata=metadata or {},
generate_embeddings=True
)
return {
"document_id": str(result.document.document_id),
"chunks_created": result.chunks_created
}
async def answer_question(
self,
owner_id: str,
question: str,
top_k: int = 5,
model: str = "gpt-4"
) -> Dict[str, Any]:
"""Answer a question using RAG."""
# Retrieve relevant chunks
results = await self.memory.search(
owner_id=owner_id,
query_text=question,
search_type=SearchType.HYBRID,
query_expansion=True,
max_query_variants=3,
rerank=True,
rerank_top_k=20,
rerank_return_k=top_k,
limit=top_k
)
if not results:
return {
"answer": "I don't have enough information to answer this question.",
"sources": [],
"confidence": "low"
}
# Build prompt
context = "\n\n".join([
f"[Source: {r.metadata.get('document_name', 'Unknown')}]\n{r.summary or r.content}"
for r in results
])
prompt = f"""Answer the question using only the provided context. If the answer cannot be found in the context, say so.
Context:
{context}
Question: {question}
Answer:"""
# Generate response
response = await self.client.chat.completions.create(
model=model,
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": prompt}
],
temperature=0.3
)
answer = response.choices[0].message.content
# Determine confidence based on scores
avg_score = sum(r.score for r in results) / len(results)
confidence = "high" if avg_score > 0.5 else "medium" if avg_score > 0.3 else "low"
return {
"answer": answer,
"sources": [
{
"document_name": r.metadata.get("document_name"),
"content_preview": r.content[:150] + "...",
"score": r.score,
"similarity": r.similarity,
"rerank_score": r.rerank_score # Populated when rerank=True
}
for r in results
],
"confidence": confidence,
"model": model
}
async def close(self):
"""Clean up resources."""
await self.memory.close()
# Usage
async def main():
rag = RAGSystem(
connection_string="postgresql://localhost/mydb",
openai_api_key="sk-..."
)
await rag.initialize()
# Ingest documents
await rag.ingest_document(
owner_id="user-123",
document_name="product_guide.md",
content="...",
document_type=DocumentType.MARKDOWN,
metadata={"category": "guide"}
)
# Answer questions
result = await rag.answer_question(
owner_id="user-123",
question="How do I reset my password?",
top_k=5
)
print(f"Answer: {result['answer']}")
print(f"Confidence: {result['confidence']}")
print(f"Sources: {len(result['sources'])}")
await rag.close()
RAG Best Practices
1. Chunk Size Optimization
from llmemory import ChunkingConfig
# For RAG, use smaller chunks (better precision)
chunking_config = ChunkingConfig(
chunk_size=300, # 300 tokens (vs 1000 default)
chunk_overlap=50, # 50 tokens overlap
strategy="hierarchical", # Create parent/child chunks
min_chunk_size=100, # Minimum chunk size
max_chunk_size=500, # Maximum chunk size
max_chunk_depth=3, # Maximum hierarchy depth
enable_chunk_summaries=True, # Generate summaries for prompts
summary_max_tokens=80 # Short summaries
)
await memory.add_document(
owner_id="workspace-1",
id_at_origin="kb",
document_name="doc.md",
document_type=DocumentType.MARKDOWN,
content="...",
chunking_config=chunking_config
)
# Smaller chunks:
# - More precise retrieval
# - Better for prompts (fit more sources)
# - Less noise in context
#
# Larger chunks:
# - More context per chunk
# - Better for broad questions
# - Fewer chunks needed
2. Use Parent Context for Broader Context
# Retrieve with parent context
results = await memory.search(
owner_id="workspace-1",
query_text="API authentication",
search_type=SearchType.HYBRID,
include_parent_context=True, # Include surrounding chunks
context_window=2, # ±2 chunks
limit=5
)
# Build prompt with parent context
for result in results:
print(f"Main chunk: {result.content}")
if result.parent_chunks:
print(f"Context from {len(result.parent_chunks)} parent chunks")
for parent in result.parent_chunks:
print(f" - {parent.content[:100]}...")
3. Reranking for Quality
Always use reranking in RAG for better relevance:
# Without reranking (lower quality)
results = await memory.search(
owner_id="workspace-1",
query_text="query",
rerank=False,
limit=5
)
# With reranking (higher quality)
results = await memory.search(
owner_id="workspace-1",
query_text="query",
rerank=True,
rerank_top_k=20, # Consider top 20 candidates
rerank_return_k=10, # Prefer 10 best after reranking
limit=5 # Final count = max(5, 10) = 10 results
)
# Reranking improves:
# - Relevance of top results
# - Precision for RAG prompts
# - Reduces hallucination (better context)
4. Query Expansion for Recall
# Use multi-query for better recall
results = await memory.search(
owner_id="workspace-1",
query_text="reduce latency",
query_expansion=True, # Generates variants like "improve response time"
max_query_variants=3,
rerank=True, # Rerank after fusion
limit=5
)
# Good for:
# - Vague queries
# - Different terminology in docs
# - Comprehensive answers
5. Metadata for Filtering
# Add rich metadata during ingestion
await memory.add_document(
owner_id="workspace-1",
id_at_origin="kb",
document_name="api_v2_docs.md",
document_type=DocumentType.TECHNICAL_DOC,
content="...",
metadata={
"category": "api",
"version": "2.0",
"language": "python",
"last_updated": "2024-10-01"
}
)
# Filter during retrieval
results = await memory.search(
owner_id="workspace-1",
query_text="authentication",
metadata_filter={
"category": "api",
"version": "2.0"
},
limit=5
)
6. Enable and Use Chunk Summaries
Chunk summaries provide concise representations of chunks, making prompts more efficient by reducing token usage while preserving key information.
Enabling Summaries:
from llmemory import LLMemory, ChunkingConfig, LLMemoryConfig
# Option 1: Via ChunkingConfig when adding documents
chunking_config = ChunkingConfig(
chunk_size=300,
enable_chunk_summaries=True, # Enable summary generation
summary_max_tokens=80 # Control summary length
)
await memory.add_document(
owner_id="workspace-1",
id_at_origin="kb",
document_name="doc.md",
document_type=DocumentType.MARKDOWN,
content="...",
chunking_config=chunking_config
)
# Option 2: Via global LLMemoryConfig
config = LLMemoryConfig()
config.chunking.enable_chunk_summaries = True
config.chunking.summary_max_tokens = 50 # Short summaries
memory = LLMemory(
connection_string="postgresql://localhost/mydb",
config=config
)
Using Summaries in Prompts:
def build_prompt_with_summaries(query: str, results: List[SearchResult]):
"""Build prompt using chunk summaries when available."""
context_parts = []
for result in results:
# SearchResult.summary is populated when enable_chunk_summaries=True
# Falls back to full content if summaries weren't generated
text = result.summary or result.content
context_parts.append(text)
context = "\n".join(context_parts)
return f"Context:\n{context}\n\nQuestion: {query}\n\nAnswer:"
Benefits:
- Reduced prompt token usage (summaries are ~30-50% of original size)
- More chunks fit in context window
- Faster LLM processing
- Preserved key information for accurate answers
RAG Evaluation
Measuring Retrieval Quality
async def evaluate_retrieval(
memory: LLMemory,
owner_id: str,
test_queries: List[Dict[str, Any]]
):
"""Evaluate retrieval quality."""
metrics = {
"precision_at_5": [],
"recall": [],
"mrr": [] # Mean Reciprocal Rank
}
for test in test_queries:
query = test["query"]
relevant_doc_ids = set(test["relevant_docs"])
# Retrieve
results = await memory.search(
owner_id=owner_id,
query_text=query,
rerank=True,
limit=10
)
# Calculate precision@5
top_5_docs = {str(r.document_id) for r in results[:5]}
precision = len(top_5_docs & relevant_doc_ids) / 5
metrics["precision_at_5"].append(precision)
# Calculate recall
retrieved_docs = {str(r.document_id) for r in results}
recall = len(retrieved_docs & relevant_doc_ids) / len(relevant_doc_ids)
metrics["recall"].append(recall)
# Calculate MRR
for rank, result in enumerate(results, 1):
if str(result.document_id) in relevant_doc_ids:
metrics["mrr"].append(1.0 / rank)
break
else:
metrics["mrr"].append(0.0)
return {
"avg_precision_at_5": sum(metrics["precision_at_5"]) / len(test_queries),
"avg_recall": sum(metrics["recall"]) / len(test_queries),
"mean_reciprocal_rank": sum(metrics["mrr"]) / len(test_queries)
}
Related Skills
basic-usage- Core document and search operationshybrid-search- Vector + BM25 hybrid search fundamentalsmulti-query- Query expansion for improved retrievalmulti-tenant- Multi-tenant isolation patterns for SaaS
Important Notes
RAG Pipeline Optimization: The complete RAG pipeline (retrieve → rerank → generate) typically takes 200-500ms:
- Retrieval: 50-150ms
- Reranking: 50-200ms (depending on provider)
- LLM generation: 500-2000ms
Chunk Size for RAG: Smaller chunks (200-400 tokens) work better for RAG than larger chunks:
- More precise retrieval
- Less noise in context
- More chunks fit in prompt
- Better for specific questions
Multi-Tenant RAG:
Always use owner_id for data isolation in multi-tenant RAG systems. Never expose one tenant's documents to another.
Reranking ROI: Reranking adds 50-200ms but significantly improves answer quality by ensuring the most relevant chunks appear first in the prompt, reducing hallucination and improving accuracy.