| name | langfuse-observability |
| description | LLM observability with self-hosted Langfuse 3.x - tracing, evaluation, monitoring, prompt management, and cost tracking |
| version | 2.0.0 |
| author | YG Starter Template |
| tags | langfuse, llm, observability, tracing, evaluation, prompts, 2025 |
Langfuse Observability
Overview
Langfuse 3.x is the open-source LLM observability platform for tracing, monitoring, evaluation, and prompt management. Unlike LangSmith (proprietary), Langfuse is self-hosted, free, and designed for production LLM applications.
When to use this skill:
- Setting up LLM observability from scratch
- Debugging slow or incorrect LLM responses
- Tracking token usage and costs per agent/user/session
- Managing prompts in production with versioning
- Evaluating LLM output quality with automated scoring
- Migrating from LangSmith to Langfuse
Real-World Use Cases:
- Multi-Agent Cost Attribution: Track costs by agent type (security, coding, general)
- Prompt A/B Testing: Compare prompt versions with quality scores
- Quality Monitoring: Automated scoring for relevance, coherence, factuality
- Dataset Evaluation: Regression testing with golden datasets
- Session Analytics: Group traces by user session for journey analysis
Core Features
1. Distributed Tracing
Track LLM calls across your application with automatic parent-child span relationships.
from langfuse.decorators import observe, langfuse_context
@observe() # Automatic tracing
async def analyze_content(content: str, agent_type: str):
"""Analyze content with automatic Langfuse tracing."""
# Nested span for retrieval
@observe(name="retrieval")
async def retrieve_context():
chunks = await vector_db.search(content)
langfuse_context.update_current_observation(
metadata={"chunks_retrieved": len(chunks)}
)
return chunks
# Nested span for generation
@observe(name="generation")
async def generate_analysis(context):
response = await llm.generate(
prompt=f"Context: {context}\n\nAnalyze: {content}"
)
langfuse_context.update_current_observation(
input=content[:500],
output=response[:500],
model="claude-sonnet-4-20250514",
usage={
"input_tokens": response.usage.input_tokens,
"output_tokens": response.usage.output_tokens
}
)
return response
context = await retrieve_context()
return await generate_analysis(context)
Result in Langfuse UI:
analyze_content (2.3s, $0.045)
├── retrieval (0.1s)
│ └── metadata: {chunks_retrieved: 5}
└── generation (2.2s, $0.045)
└── model: claude-sonnet-4-20250514
└── tokens: 1500 input, 1000 output
2. Token & Cost Tracking
Automatic cost calculation based on model pricing:
from langfuse import Langfuse
langfuse = Langfuse()
# Create trace with cost tracking
trace = langfuse.trace(
name="content_analysis",
user_id="user_123",
session_id="session_abc"
)
# Log generation with automatic cost calculation
generation = trace.generation(
name="security_audit",
model="claude-sonnet-4-20250514",
model_parameters={"temperature": 1.0, "max_tokens": 4096},
input=[{"role": "user", "content": "Analyze for XSS..."}],
output="Analysis: Found 3 vulnerabilities...",
usage={
"input": 1500,
"output": 1000,
"unit": "TOKENS"
}
)
# Langfuse automatically calculates: $0.0045 + $0.015 = $0.0195
Pricing Database (Auto-Updated): Langfuse maintains a pricing database for all major models. You can also define custom pricing:
# Custom model pricing
langfuse.create_model(
model_name="claude-sonnet-4-20250514",
match_pattern="claude-sonnet-4.*",
unit="TOKENS",
input_price=0.000003, # $3/MTok
output_price=0.000015, # $15/MTok
total_price=None # Calculated from input+output
)
3. Prompt Management
Version control for prompts in production:
# Fetch prompt from Langfuse
from langfuse import Langfuse, get_client
langfuse = Langfuse()
# Get latest version of security auditor prompt
prompt = langfuse.get_prompt("security_auditor", label="production")
# Use in LLM call
response = await llm.generate(
messages=[
{"role": "system", "content": prompt.compile()},
{"role": "user", "content": user_input}
]
)
Linking Prompts to Generations (Issue #564 Pattern)
CRITICAL: To make the "Number of Observations" counter work in Langfuse Prompts UI, you MUST link the TextPromptClient object to the generation span:
from langfuse import get_client
# Method 1: update_current_generation (preferred in this project)
langfuse = get_client()
prompt = langfuse.get_prompt("security_auditor", label="production")
# Link prompt to current generation span
langfuse.update_current_generation(prompt=prompt)
# Method 2: Pass prompt when starting generation
with langfuse.start_as_current_generation(
name="security-analysis",
model="claude-sonnet-4-20250514",
prompt=prompt # Links automatically!
) as generation:
response = await llm.generate(...)
generation.update(output=response)
this project Pattern (with caching):
# PromptManager returns both content AND TextPromptClient
prompt_content, prompt_client = await prompt_manager.get_prompt_with_langfuse_client(
name="analysis-agent-security-auditor",
variables={"skill_instructions": "..."},
label="production",
)
# Pass prompt_client through agent metadata
if prompt_client:
agent = agent.with_config(metadata={"langfuse_prompt_client": prompt_client})
# In invoke_agent(), link prompt to generation
if prompt_client:
langfuse.update_current_generation(prompt=prompt_client)
Note: Cache hits (L1/L2) return None for prompt_client - linkage only happens on L3 Langfuse fetches (~5% of calls). This is acceptable for analytics.
Prompt Versioning in UI:
security_auditor
├── v1 (Jan 15, 2025) - production
│ └── "You are a security auditor. Analyze code for..."
├── v2 (Jan 20, 2025) - staging
│ └── "You are an expert security auditor. Focus on..."
└── v3 (Jan 25, 2025) - draft
└── "As a cybersecurity expert, thoroughly analyze..."
4. LLM Evaluation (Scores)
Track quality metrics with custom scores:
from langfuse import Langfuse
langfuse = Langfuse()
# Create trace
trace = langfuse.trace(name="content_analysis", id="trace_123")
# After LLM response, score it
trace.score(
name="relevance",
value=0.85, # 0-1 scale
comment="Response addresses query but lacks depth"
)
trace.score(
name="factuality",
value=0.92,
data_type="NUMERIC"
)
# Use G-Eval for automated scoring
from app.shared.services.g_eval import GEvalScorer
scorer = GEvalScorer()
scores = await scorer.score(
query=user_query,
response=llm_response,
criteria=["relevance", "coherence", "depth"]
)
for criterion, score in scores.items():
trace.score(name=criterion, value=score)
Scores Dashboard:
- View score distributions
- Track quality trends over time
- Filter traces by score thresholds
- Compare prompt versions by scores
5. Session Tracking
Group related traces into user sessions:
# Start session
session_id = f"analysis_{analysis_id}"
# All traces with same session_id are grouped
trace1 = langfuse.trace(
name="url_fetch",
session_id=session_id
)
trace2 = langfuse.trace(
name="content_analysis",
session_id=session_id
)
trace3 = langfuse.trace(
name="quality_gate",
session_id=session_id
)
# View in UI: All 3 traces grouped under session
6. User & Metadata Tracking
Track performance per user or content type:
langfuse.trace(
name="analysis",
user_id="user_123",
metadata={
"content_type": "article",
"url": "https://example.com/post",
"analysis_id": "abc123",
"agent_count": 8,
"total_cost_usd": 0.15
},
tags=["production", "project", "security"]
)
Analytics:
- Filter by user, tag, metadata
- Group costs by content_type
- Track performance by agent type
- Identify slow or expensive users
this project Integration
Setup (Already Complete)
# backend/app/shared/services/langfuse/client.py
from langfuse import Langfuse
from app.core.config import settings
langfuse_client = Langfuse(
public_key=settings.LANGFUSE_PUBLIC_KEY,
secret_key=settings.LANGFUSE_SECRET_KEY,
host=settings.LANGFUSE_HOST # Self-hosted or cloud
)
Workflow Integration
# backend/app/workflows/content_analysis.py
from langfuse.decorators import observe
@observe(name="content_analysis_workflow")
async def run_content_analysis(analysis_id: str, content: str):
"""Full workflow with automatic Langfuse tracing."""
# Set global metadata
langfuse_context.update_current_trace(
user_id=f"analysis_{analysis_id}",
metadata={
"analysis_id": analysis_id,
"content_length": len(content)
}
)
# Each agent execution automatically creates nested spans
results = []
for agent in agents:
result = await execute_agent(agent, content) # @observe decorated
results.append(result)
return results
Cost Tracking Per Analysis
# After analysis completes
trace = langfuse.get_trace(trace_id)
total_cost = sum(
gen.calculated_total_cost or 0
for gen in trace.observations
if gen.type == "GENERATION"
)
# Store in database
await analysis_repo.update(
analysis_id,
langfuse_trace_id=trace.id,
total_cost_usd=total_cost
)
Advanced Features
1. CallbackHandler (LangChain Integration)
For LangChain/LangGraph applications:
from langfuse.callback import CallbackHandler
langfuse_handler = CallbackHandler(
public_key=settings.LANGFUSE_PUBLIC_KEY,
secret_key=settings.LANGFUSE_SECRET_KEY
)
# Use with LangChain
from langchain_anthropic import ChatAnthropic
llm = ChatAnthropic(
model="claude-sonnet-4-20250514",
callbacks=[langfuse_handler]
)
response = llm.invoke("Analyze this code...") # Auto-traced!
2. Datasets for Evaluation
Create test datasets in Langfuse UI and run automated evaluations:
# Fetch dataset
dataset = langfuse.get_dataset("security_audit_test_set")
# Run evaluation
for item in dataset.items:
# Run LLM
response = await llm.generate(item.input)
# Create observation linked to dataset item
langfuse.trace(
name="evaluation_run",
metadata={"dataset_item_id": item.id}
).generation(
input=item.input,
output=response,
usage=response.usage
)
# Score
score = await evaluate_response(item.expected_output, response)
langfuse.score(
trace_id=trace.id,
name="accuracy",
value=score
)
3. Experimentation (A/B Testing Prompts)
# Test two prompt versions
prompt_v1 = langfuse.get_prompt("security_auditor", version=1)
prompt_v2 = langfuse.get_prompt("security_auditor", version=2)
# Run A/B test
import random
for test_input in test_dataset:
prompt = random.choice([prompt_v1, prompt_v2])
response = await llm.generate(
messages=[
{"role": "system", "content": prompt.compile()},
{"role": "user", "content": test_input}
]
)
# Track which version was used
langfuse.trace(
name="ab_test",
metadata={"prompt_version": prompt.version}
)
# Compare in Langfuse UI:
# - Filter by prompt_version
# - Compare average scores
# - Analyze cost differences
Monitoring Dashboard Queries
Top 10 Most Expensive Traces (Last 7 Days)
SELECT
name,
user_id,
calculated_total_cost,
input_tokens,
output_tokens
FROM traces
WHERE timestamp > NOW() - INTERVAL '7 days'
ORDER BY calculated_total_cost DESC
LIMIT 10;
Average Cost by Agent Type
SELECT
metadata->>'agent_type' as agent,
COUNT(*) as traces,
AVG(calculated_total_cost) as avg_cost,
SUM(calculated_total_cost) as total_cost
FROM traces
WHERE metadata->>'agent_type' IS NOT NULL
GROUP BY agent
ORDER BY total_cost DESC;
Quality Scores Trend
SELECT
DATE(timestamp) as date,
AVG(value) FILTER (WHERE name = 'relevance') as avg_relevance,
AVG(value) FILTER (WHERE name = 'depth') as avg_depth,
AVG(value) FILTER (WHERE name = 'factuality') as avg_factuality
FROM scores
WHERE timestamp > NOW() - INTERVAL '30 days'
GROUP BY DATE(timestamp)
ORDER BY date;
Best Practices
- Always use @observe decorator for automatic tracing
- Set user_id and session_id for better analytics
- Add meaningful metadata (content_type, analysis_id, etc.)
- Score all productions traces for quality monitoring
- Use prompt management instead of hardcoded prompts
- Monitor costs daily to catch spikes early
- Create datasets for regression testing
- Tag production vs staging traces
References
- Langfuse Docs
- Python SDK
- Decorators Guide
- Prompt Management
- Self-Hosting
- [this project Integration](https://github.com/yonatan-gross/this project#langfuse-observability)
Migration from LangSmith
See Langfuse documentation at https://langfuse.com/docs for integration details.
Key Differences:
- Langfuse: Self-hosted, open-source, free
- LangSmith: Cloud-only, proprietary, paid
- Langfuse: Prompt management built-in
- LangSmith: External prompt storage needed
- Langfuse: @observe decorator
- LangSmith: @traceable decorator