Claude Code Plugins

Community-maintained marketplace

Feedback

langfuse-observability

@yonatangross/create-yg-app
1
0

LLM observability with self-hosted Langfuse 3.x - tracing, evaluation, monitoring, prompt management, and cost tracking

Install Skill

1Download skill
2Enable skills in Claude

Open claude.ai/settings/capabilities and find the "Skills" section

3Upload to Claude

Click "Upload skill" and select the downloaded ZIP file

Note: Please verify skill by going through its instructions before using it.

SKILL.md

name langfuse-observability
description LLM observability with self-hosted Langfuse 3.x - tracing, evaluation, monitoring, prompt management, and cost tracking
version 2.0.0
author YG Starter Template
tags langfuse, llm, observability, tracing, evaluation, prompts, 2025

Langfuse Observability

Overview

Langfuse 3.x is the open-source LLM observability platform for tracing, monitoring, evaluation, and prompt management. Unlike LangSmith (proprietary), Langfuse is self-hosted, free, and designed for production LLM applications.

When to use this skill:

  • Setting up LLM observability from scratch
  • Debugging slow or incorrect LLM responses
  • Tracking token usage and costs per agent/user/session
  • Managing prompts in production with versioning
  • Evaluating LLM output quality with automated scoring
  • Migrating from LangSmith to Langfuse

Real-World Use Cases:

  • Multi-Agent Cost Attribution: Track costs by agent type (security, coding, general)
  • Prompt A/B Testing: Compare prompt versions with quality scores
  • Quality Monitoring: Automated scoring for relevance, coherence, factuality
  • Dataset Evaluation: Regression testing with golden datasets
  • Session Analytics: Group traces by user session for journey analysis

Core Features

1. Distributed Tracing

Track LLM calls across your application with automatic parent-child span relationships.

from langfuse.decorators import observe, langfuse_context

@observe()  # Automatic tracing
async def analyze_content(content: str, agent_type: str):
    """Analyze content with automatic Langfuse tracing."""

    # Nested span for retrieval
    @observe(name="retrieval")
    async def retrieve_context():
        chunks = await vector_db.search(content)
        langfuse_context.update_current_observation(
            metadata={"chunks_retrieved": len(chunks)}
        )
        return chunks

    # Nested span for generation
    @observe(name="generation")
    async def generate_analysis(context):
        response = await llm.generate(
            prompt=f"Context: {context}\n\nAnalyze: {content}"
        )
        langfuse_context.update_current_observation(
            input=content[:500],
            output=response[:500],
            model="claude-sonnet-4-20250514",
            usage={
                "input_tokens": response.usage.input_tokens,
                "output_tokens": response.usage.output_tokens
            }
        )
        return response

    context = await retrieve_context()
    return await generate_analysis(context)

Result in Langfuse UI:

analyze_content (2.3s, $0.045)
├── retrieval (0.1s)
│   └── metadata: {chunks_retrieved: 5}
└── generation (2.2s, $0.045)
    └── model: claude-sonnet-4-20250514
    └── tokens: 1500 input, 1000 output

2. Token & Cost Tracking

Automatic cost calculation based on model pricing:

from langfuse import Langfuse

langfuse = Langfuse()

# Create trace with cost tracking
trace = langfuse.trace(
    name="content_analysis",
    user_id="user_123",
    session_id="session_abc"
)

# Log generation with automatic cost calculation
generation = trace.generation(
    name="security_audit",
    model="claude-sonnet-4-20250514",
    model_parameters={"temperature": 1.0, "max_tokens": 4096},
    input=[{"role": "user", "content": "Analyze for XSS..."}],
    output="Analysis: Found 3 vulnerabilities...",
    usage={
        "input": 1500,
        "output": 1000,
        "unit": "TOKENS"
    }
)

# Langfuse automatically calculates: $0.0045 + $0.015 = $0.0195

Pricing Database (Auto-Updated): Langfuse maintains a pricing database for all major models. You can also define custom pricing:

# Custom model pricing
langfuse.create_model(
    model_name="claude-sonnet-4-20250514",
    match_pattern="claude-sonnet-4.*",
    unit="TOKENS",
    input_price=0.000003,  # $3/MTok
    output_price=0.000015,  # $15/MTok
    total_price=None  # Calculated from input+output
)

3. Prompt Management

Version control for prompts in production:

# Fetch prompt from Langfuse
from langfuse import Langfuse, get_client

langfuse = Langfuse()

# Get latest version of security auditor prompt
prompt = langfuse.get_prompt("security_auditor", label="production")

# Use in LLM call
response = await llm.generate(
    messages=[
        {"role": "system", "content": prompt.compile()},
        {"role": "user", "content": user_input}
    ]
)

Linking Prompts to Generations (Issue #564 Pattern)

CRITICAL: To make the "Number of Observations" counter work in Langfuse Prompts UI, you MUST link the TextPromptClient object to the generation span:

from langfuse import get_client

# Method 1: update_current_generation (preferred in this project)
langfuse = get_client()
prompt = langfuse.get_prompt("security_auditor", label="production")

# Link prompt to current generation span
langfuse.update_current_generation(prompt=prompt)

# Method 2: Pass prompt when starting generation
with langfuse.start_as_current_generation(
    name="security-analysis",
    model="claude-sonnet-4-20250514",
    prompt=prompt  # Links automatically!
) as generation:
    response = await llm.generate(...)
    generation.update(output=response)

this project Pattern (with caching):

# PromptManager returns both content AND TextPromptClient
prompt_content, prompt_client = await prompt_manager.get_prompt_with_langfuse_client(
    name="analysis-agent-security-auditor",
    variables={"skill_instructions": "..."},
    label="production",
)

# Pass prompt_client through agent metadata
if prompt_client:
    agent = agent.with_config(metadata={"langfuse_prompt_client": prompt_client})

# In invoke_agent(), link prompt to generation
if prompt_client:
    langfuse.update_current_generation(prompt=prompt_client)

Note: Cache hits (L1/L2) return None for prompt_client - linkage only happens on L3 Langfuse fetches (~5% of calls). This is acceptable for analytics.

Prompt Versioning in UI:

security_auditor
├── v1 (Jan 15, 2025) - production
│   └── "You are a security auditor. Analyze code for..."
├── v2 (Jan 20, 2025) - staging
│   └── "You are an expert security auditor. Focus on..."
└── v3 (Jan 25, 2025) - draft
    └── "As a cybersecurity expert, thoroughly analyze..."

4. LLM Evaluation (Scores)

Track quality metrics with custom scores:

from langfuse import Langfuse

langfuse = Langfuse()

# Create trace
trace = langfuse.trace(name="content_analysis", id="trace_123")

# After LLM response, score it
trace.score(
    name="relevance",
    value=0.85,  # 0-1 scale
    comment="Response addresses query but lacks depth"
)

trace.score(
    name="factuality",
    value=0.92,
    data_type="NUMERIC"
)

# Use G-Eval for automated scoring
from app.shared.services.g_eval import GEvalScorer

scorer = GEvalScorer()
scores = await scorer.score(
    query=user_query,
    response=llm_response,
    criteria=["relevance", "coherence", "depth"]
)

for criterion, score in scores.items():
    trace.score(name=criterion, value=score)

Scores Dashboard:

  • View score distributions
  • Track quality trends over time
  • Filter traces by score thresholds
  • Compare prompt versions by scores

5. Session Tracking

Group related traces into user sessions:

# Start session
session_id = f"analysis_{analysis_id}"

# All traces with same session_id are grouped
trace1 = langfuse.trace(
    name="url_fetch",
    session_id=session_id
)

trace2 = langfuse.trace(
    name="content_analysis",
    session_id=session_id
)

trace3 = langfuse.trace(
    name="quality_gate",
    session_id=session_id
)

# View in UI: All 3 traces grouped under session

6. User & Metadata Tracking

Track performance per user or content type:

langfuse.trace(
    name="analysis",
    user_id="user_123",
    metadata={
        "content_type": "article",
        "url": "https://example.com/post",
        "analysis_id": "abc123",
        "agent_count": 8,
        "total_cost_usd": 0.15
    },
    tags=["production", "project", "security"]
)

Analytics:

  • Filter by user, tag, metadata
  • Group costs by content_type
  • Track performance by agent type
  • Identify slow or expensive users

this project Integration

Setup (Already Complete)

# backend/app/shared/services/langfuse/client.py
from langfuse import Langfuse
from app.core.config import settings

langfuse_client = Langfuse(
    public_key=settings.LANGFUSE_PUBLIC_KEY,
    secret_key=settings.LANGFUSE_SECRET_KEY,
    host=settings.LANGFUSE_HOST  # Self-hosted or cloud
)

Workflow Integration

# backend/app/workflows/content_analysis.py
from langfuse.decorators import observe

@observe(name="content_analysis_workflow")
async def run_content_analysis(analysis_id: str, content: str):
    """Full workflow with automatic Langfuse tracing."""

    # Set global metadata
    langfuse_context.update_current_trace(
        user_id=f"analysis_{analysis_id}",
        metadata={
            "analysis_id": analysis_id,
            "content_length": len(content)
        }
    )

    # Each agent execution automatically creates nested spans
    results = []
    for agent in agents:
        result = await execute_agent(agent, content)  # @observe decorated
        results.append(result)

    return results

Cost Tracking Per Analysis

# After analysis completes
trace = langfuse.get_trace(trace_id)
total_cost = sum(
    gen.calculated_total_cost or 0
    for gen in trace.observations
    if gen.type == "GENERATION"
)

# Store in database
await analysis_repo.update(
    analysis_id,
    langfuse_trace_id=trace.id,
    total_cost_usd=total_cost
)

Advanced Features

1. CallbackHandler (LangChain Integration)

For LangChain/LangGraph applications:

from langfuse.callback import CallbackHandler

langfuse_handler = CallbackHandler(
    public_key=settings.LANGFUSE_PUBLIC_KEY,
    secret_key=settings.LANGFUSE_SECRET_KEY
)

# Use with LangChain
from langchain_anthropic import ChatAnthropic

llm = ChatAnthropic(
    model="claude-sonnet-4-20250514",
    callbacks=[langfuse_handler]
)

response = llm.invoke("Analyze this code...")  # Auto-traced!

2. Datasets for Evaluation

Create test datasets in Langfuse UI and run automated evaluations:

# Fetch dataset
dataset = langfuse.get_dataset("security_audit_test_set")

# Run evaluation
for item in dataset.items:
    # Run LLM
    response = await llm.generate(item.input)

    # Create observation linked to dataset item
    langfuse.trace(
        name="evaluation_run",
        metadata={"dataset_item_id": item.id}
    ).generation(
        input=item.input,
        output=response,
        usage=response.usage
    )

    # Score
    score = await evaluate_response(item.expected_output, response)
    langfuse.score(
        trace_id=trace.id,
        name="accuracy",
        value=score
    )

3. Experimentation (A/B Testing Prompts)

# Test two prompt versions
prompt_v1 = langfuse.get_prompt("security_auditor", version=1)
prompt_v2 = langfuse.get_prompt("security_auditor", version=2)

# Run A/B test
import random

for test_input in test_dataset:
    prompt = random.choice([prompt_v1, prompt_v2])

    response = await llm.generate(
        messages=[
            {"role": "system", "content": prompt.compile()},
            {"role": "user", "content": test_input}
        ]
    )

    # Track which version was used
    langfuse.trace(
        name="ab_test",
        metadata={"prompt_version": prompt.version}
    )

# Compare in Langfuse UI:
# - Filter by prompt_version
# - Compare average scores
# - Analyze cost differences

Monitoring Dashboard Queries

Top 10 Most Expensive Traces (Last 7 Days)

SELECT
    name,
    user_id,
    calculated_total_cost,
    input_tokens,
    output_tokens
FROM traces
WHERE timestamp > NOW() - INTERVAL '7 days'
ORDER BY calculated_total_cost DESC
LIMIT 10;

Average Cost by Agent Type

SELECT
    metadata->>'agent_type' as agent,
    COUNT(*) as traces,
    AVG(calculated_total_cost) as avg_cost,
    SUM(calculated_total_cost) as total_cost
FROM traces
WHERE metadata->>'agent_type' IS NOT NULL
GROUP BY agent
ORDER BY total_cost DESC;

Quality Scores Trend

SELECT
    DATE(timestamp) as date,
    AVG(value) FILTER (WHERE name = 'relevance') as avg_relevance,
    AVG(value) FILTER (WHERE name = 'depth') as avg_depth,
    AVG(value) FILTER (WHERE name = 'factuality') as avg_factuality
FROM scores
WHERE timestamp > NOW() - INTERVAL '30 days'
GROUP BY DATE(timestamp)
ORDER BY date;

Best Practices

  1. Always use @observe decorator for automatic tracing
  2. Set user_id and session_id for better analytics
  3. Add meaningful metadata (content_type, analysis_id, etc.)
  4. Score all productions traces for quality monitoring
  5. Use prompt management instead of hardcoded prompts
  6. Monitor costs daily to catch spikes early
  7. Create datasets for regression testing
  8. Tag production vs staging traces

References


Migration from LangSmith

See Langfuse documentation at https://langfuse.com/docs for integration details.

Key Differences:

  • Langfuse: Self-hosted, open-source, free
  • LangSmith: Cloud-only, proprietary, paid
  • Langfuse: Prompt management built-in
  • LangSmith: External prompt storage needed
  • Langfuse: @observe decorator
  • LangSmith: @traceable decorator