Claude Code Plugins

Community-maintained marketplace

Feedback

langfuse-optimization

@mberto10/web_research_agent
0
0

Analyzes writing-ecosystem traces to fix style.yaml, template.yaml, and tools.yaml based on quality issues found in production runs.

Install Skill

1Download skill
2Enable skills in Claude

Open claude.ai/settings/capabilities and find the "Skills" section

3Upload to Claude

Click "Upload skill" and select the downloaded ZIP file

Note: Please verify skill by going through its instructions before using it.

SKILL.md

name langfuse-optimization
description Analyzes writing-ecosystem traces to fix style.yaml, template.yaml, and tools.yaml based on quality issues found in production runs.
allowed-tools *

Writing Ecosystem Config Optimizer

Analyzes Langfuse traces to identify what's wrong with your style.yaml, template.yaml, and tools.yaml files, then tells you exactly how to fix them.

When to Use This Skill

  • "Analyze traces and fix my config files"
  • "My checks are failing - what's wrong with style.yaml?"
  • "Optimize case 0001 configuration"
  • "Why is the research node selecting wrong tools?"

Required Environment Variables

  • LANGFUSE_PUBLIC_KEY: Your Langfuse public API key
  • LANGFUSE_SECRET_KEY: Your Langfuse secret API key
  • LANGFUSE_HOST: Langfuse host URL (default: https://cloud.langfuse.com)

What This Skill Does

Input: User request + case ID Output: Specific fixes for style.yaml, template.yaml, tools.yaml

3-Step Process:

  1. Retrieve traces from Langfuse for specified case
  2. Extract problems from trace data (check failures, tool errors, structure issues)
  3. Generate fixes with exact YAML changes to make

Workflow

Step 1: Get User Request & Case ID

Ask for:

  • Case ID (e.g., "0001", "0002", "The Prep")
  • Time range (default: last 7 days)
  • Specific focus (optional: "just style checks", "just tools", "everything")

Step 2: Retrieve Trace Data

Option A: Unified Retrieval (Recommended - Simpler)

Use the unified helper to get traces and observations in one command:

cd /home/runner/workspace/.claude/skills/langfuse-optimization

# Get last 5 traces with observations for a case (using tags - RECOMMENDED)
python3 helpers/retrieve_traces_and_observations.py \
  --limit 5 \
  --tags "case:0001" \
  --output /tmp/langfuse_analysis/bundle.json

# Filter by metadata (e.g., specific case_id)
python3 helpers/retrieve_traces_and_observations.py \
  --limit 3 \
  --metadata case_id=0001 \
  --output /tmp/langfuse_analysis/case_0001_bundle.json

# Get traces only (skip observations for faster retrieval)
python3 helpers/retrieve_traces_and_observations.py \
  --limit 10 \
  --no-observations \
  --output /tmp/langfuse_analysis/traces_only.json

# Save separate files + unified bundle
python3 helpers/retrieve_traces_and_observations.py \
  --limit 5 \
  --output /tmp/langfuse_analysis/bundle.json \
  --traces-output /tmp/langfuse_analysis/traces.json \
  --observations-output /tmp/langfuse_analysis/observations.json

# **RECOMMENDED**: Strip bloat for 95% size reduction
python3 helpers/retrieve_traces_and_observations.py \
  --tags "case:0001" \
  --limit 1 \
  --filter-essential \
  --output /tmp/langfuse_analysis/filtered_bundle.json

Output: Single JSON bundle with:

  • Query parameters (for reproducibility)
  • Traces list
  • Observations grouped by trace_id
  • Trace count and IDs

Size Optimization Flags:

--filter-essential (Config Optimization):

  • Strips: facts_pack (391KB+), validation_report (45KB+), long text fields
  • Replaces with compact summaries (facts count, size, failed checks)
  • Reduction: ~95% (4.2MB → 200KB)
  • Use case: Analyzing style.yaml, template.yaml, tools.yaml

--filter-research-details (Additional Reduction):

  • Strips: structured_citations (34KB → 700B), step_status (8KB → 200B)
  • Replaces with counts, domains, tools used, success/failure stats
  • Reduction: ~70% additional (on top of essential)
  • Use case: When citation URLs and detailed step logs not needed

--filter-all (Maximum Reduction):

  • Convenience flag: enables both --filter-essential + --filter-research-details
  • Total reduction: ~96% (4.2MB → 30KB per trace)
  • Use case: Large-scale trace collection, config optimization

Comparison:

  • Without filtering: 4.2MB per trace (slow, all raw data)
  • With --filter-essential: 200KB per trace (fast, config analysis)
  • With --filter-all: 30KB per trace (fastest, minimal size)

Option A.1: Single Trace Retrieval (Fastest for Individual Analysis)

When you know the exact trace ID you want to analyze, use the single trace helper:

cd /home/runner/workspace/.claude/skills/langfuse-optimization

# Essential filtering only (95% reduction)
python3 helpers/retrieve_single_trace.py 8fda46d7ac626327396d1a7962690807 --filter-essential

# Maximum filtering (96% reduction) - RECOMMENDED for most cases
python3 helpers/retrieve_single_trace.py 8fda46d7ac626327396d1a7962690807 --filter-all

# Essential + Research details (custom combination)
python3 helpers/retrieve_single_trace.py 8fda46d7ac626327396d1a7962690807 \
  --filter-essential --filter-research-details \
  --output /tmp/langfuse_analysis/single_trace.json

# Without filtering (keep all raw data)
python3 helpers/retrieve_single_trace.py abc123 --output /tmp/langfuse_analysis/trace.json

Benefits over multi-trace retrieval:

  • 10-20x faster: Only fetches one trace instead of all traces for a case
  • Lower API usage: Fewer API calls, less rate limiting
  • Cleaner workflow: No need for client-side extraction
  • Same structure: Output is identical to retrieve_traces_and_observations.py

Output: Same bundle structure as Option A (compatible with all analysis tools)

When to use:

  • Analyzing a specific trace ID from Langfuse dashboard
  • Deep-diving into one workflow run
  • Following up on a specific error or issue
  • Comparing before/after changes to config

Option B: Two-Step Retrieval (Advanced - More Control)

For scenarios where you need separate retrieval stages:

cd /home/runner/workspace/.claude/skills/langfuse-optimization

# Step 1: Get traces for a specific case (using tags)
python3 helpers/retrieve_traces.py \
  --tags "case:0001" \
  --days 7 \
  --limit 10 \
  --output /tmp/langfuse_analysis/traces.json

# Step 2: Get observations for those traces
python3 helpers/retrieve_observations.py \
  --trace-ids-file /tmp/langfuse_analysis/traces.json \
  --output /tmp/langfuse_analysis/observations.json

# Step 2 (with filtering): Strip bloat for 95% size reduction
python3 helpers/retrieve_observations.py \
  --trace-ids-file /tmp/langfuse_analysis/traces.json \
  --filter-essential \
  --output /tmp/langfuse_analysis/filtered_observations.json

Step 2B: Retrieve Annotation Queue Data (Optional)

If you have human annotations/feedback in Langfuse annotation queues:

cd /home/runner/workspace/.claude/skills/langfuse-optimization

# Get all annotated items from a queue
python3 helpers/retrieve_annotations.py \
  --queue-id <your_queue_id> \
  --output /tmp/langfuse_analysis/annotations.json

# Get only completed annotations (reviewed items)
python3 helpers/retrieve_annotations.py \
  --queue-id <your_queue_id> \
  --status completed \
  --output /tmp/langfuse_analysis/annotations.json

# Limit to recent 100 items
python3 helpers/retrieve_annotations.py \
  --queue-id <your_queue_id> \
  --limit 100

What you get:

  • Human comments/notes on traces
  • Manual scores assigned by reviewers
  • Issues flagged during quality review
  • Trace IDs linked to annotations

How to use in analysis:

  • Cross-reference annotation comments with trace data
  • Identify patterns in human-flagged issues
  • Prioritize fixes based on manual feedback frequency
  • Validate if automated checks catch the same issues humans flag

Step 2.5: Using Metadata Filters

Filter traces by metadata fields to focus analysis on specific subsets:

cd /home/runner/workspace/.claude/skills/langfuse-optimization

# Single metadata filter - analyze specific case
python3 helpers/retrieve_traces_and_observations.py \
  --metadata case_id=0001 \
  --limit 10 \
  --output /tmp/langfuse_analysis/case_0001.json

# Multiple filters (AND logic - trace must match ALL)
python3 helpers/retrieve_traces_and_observations.py \
  --metadata case_id=0001 profile_name="Stock Deep Dive" \
  --limit 5 \
  --output /tmp/langfuse_analysis/filtered.json

# Use dot notation for nested metadata (if applicable)
python3 helpers/retrieve_traces_and_observations.py \
  --metadata workflow_version=1 \
  --output /tmp/langfuse_analysis/v1_workflows.json

How it works:

  • Retrieves all traces from Langfuse within time range
  • Applies client-side filtering by metadata fields
  • Returns only traces matching ALL specified filters
  • Limit applied AFTER filtering (ensures you get requested number of matching traces)

Common Use Cases:

  • Analyze specific case: --metadata case_id=0001
  • Compare workflow versions: --metadata workflow_version=1 vs --metadata workflow_version=2
  • Profile-specific issues: --metadata profile_name="The Prep"
  • Combine filters: --metadata case_id=0001 workflow_version=2 (both must match)

Tips:

  • Metadata values are case-sensitive strings
  • Use exact matches only (no wildcards/regex)
  • Check available metadata: run without filter first, inspect trace metadata
  • Common fields: case_id, profile_name, workflow_version

Step 3: Extract Problems from Traces

Read /tmp/langfuse_analysis/bundle.json (or observations.json if using two-step retrieval) and extract:

A. Style Check Failures (for style.yaml)

From edit node observations, find:

  • Which checks failed
  • Failure rates (how often each check fails)
  • Scores vs thresholds
  • Example content that failed

Map to style.yaml issues:

  • Vague rubric: Check description unclear, LLM can't grade consistently
  • Wrong threshold: Check fails too often (>30%) or never fails
  • Missing check: Quality issue exists but no check catches it
  • Wrong weight: Check importance (MINOR/MAJOR/CRITICAL) doesn't match impact

B. Template Problems (for template.yaml)

From write node observations, find:

  • Missing required sections
  • Word count violations
  • Structure mismatches (bullets vs narrative)

Map to template.yaml issues:

  • Unclear section descriptions
  • Unrealistic word limits
  • Missing section definitions

C. Tool Selection Issues (for tools.yaml)

From research node observations, find:

  • Which tools were selected
  • Tool failures (API errors, timeouts)
  • Wrong tool for topic (should have used X but used Y)
  • Loop expansion failures (for_each errors)

Map to tools.yaml issues:

  • Tool not available in pattern
  • Wrong research pattern selected
  • Loop directive path incorrect
  • Missing fallback configuration

Step 4: Generate Config Fixes

For each problem, create a recommendation:

## Fix #N: [Problem description]

**File**: `writing_ecosystem/config/cases/XXXX/[style|template|tools].yaml`

**Problem**:
- [Specific issue found in traces]
- [Evidence: X failures in Y traces]

**Current Config**:
```yaml
[Show current YAML]

Fixed Config:

[Show corrected YAML with inline comments explaining changes]

Why this fixes it:

  • [Explanation of root cause]
  • [Expected improvement]

### Step 5: Present Simple Report

```markdown
# Config Optimization Report - Case XXXX

**Traces Analyzed**: X traces from [date range]

---

## Problems Found

### style.yaml Issues
1. ❌ `tone_consistency` check failing 30% (vague rubric)
2. ❌ `ttr_constraint` threshold too strict (16% failures)
3. ⚠️ `formality` check never fails (threshold too loose)

### template.yaml Issues
1. ❌ "Context" section missing description
2. ❌ Word limit conflict: max 100 words but needs 5 bullets

### tools.yaml Issues
1. ❌ Research pattern missing `finnhub` for financial topics
2. ❌ Loop directive path wrong: `user.portfolio.symbols` (should be `user.portfolio.summary.symbols`)

---

## Recommended Fixes

### Fix #1: Improve tone_consistency Rubric (style.yaml)

**Problem**: Failing 30% of traces - rubric too vague

**Current**:
```yaml
signatures:
  tone_consistency:
    rubric: "Assess whether tone is consistent. Score 1-10."
    threshold: 7.0

Fixed:

signatures:
  tone_consistency:
    rubric: |
      Check tone consistency across:
      1. FORMALITY: Professional terms only (not "pretty big", "kinda")
      2. OBJECTIVITY: Neutral facts (not "shocked markets")
      3. EXPERTISE: Assumes financial literacy

      Score 9-10: Perfect consistency
      Score 7-8: 1-2 minor lapses
      Score 5-6: Noticeable shifts
      Score <5: Multiple violations
    threshold: 7.0

Why: Specific dimensions + examples → LLM can grade consistently


Fix #2: Lower TTR Threshold (style.yaml)

Problem: Failing 16% - too strict for financial jargon

Current:

constraints:
  ttr_constraint:
    threshold: 0.55

Fixed:

constraints:
  ttr_constraint:
    threshold: 0.50  # Financial terms naturally repeat

Why: Domain terminology (Fed, QE, yield curve) lowers lexical diversity


Fix #3: Add Finnhub to Research Pattern (tools.yaml)

Problem: Financial topics not getting market data

Current:

research_patterns:
  default: general_research
  patterns:
    general_research:
      steps:
        - tool: perplexity

Fixed:

research_patterns:
  default: financial_research  # Changed default for case 0001
  patterns:
    financial_research:
      steps:
        - tool: perplexity
          save_as: news
        - tool: finnhub  # Added for market data
          input:
            endpoint: company_news
            symbol: "{{topic}}"  # Extract symbol from topic
          save_as: market_data

Why: Financial topics need both news (perplexity) + data (finnhub)


Implementation

1. Backup configs:

cd writing_ecosystem/config/cases/0001
cp style.yaml style.yaml.backup
cp template.yaml template.yaml.backup
cp tools.yaml tools.yaml.backup

2. Apply fixes:

  • Open each file in editor
  • Apply changes from recommendations above
  • Save files

3. Test:

python run_workflow.py --case 0001 --topic "Test topic"
# Check Langfuse trace for improvements

4. Monitor:

  • Run 20-30 workflows
  • Re-run this analysis
  • Compare before/after failure rates

Expected Results

  • tone_consistency failures: 30% → ~15%
  • ttr_constraint failures: 16% → ~8%
  • Research quality: +20% (adding finnhub)
  • Overall pre-flight score: 7.8 → 8.2

Ready to implement? Let me know which fixes to apply first, or if you want to see more detail on any issue.


## Analysis Patterns

### For style.yaml Issues

**Look for**:
1. **High failure rate** (>30%) → Vague rubric or wrong threshold
2. **Zero failures** → Threshold too loose or check not working
3. **Inconsistent scores** → Rubric needs examples and clear criteria
4. **Low edit fix rate** (<50%) → Check unclear about what to fix

**Common fixes**:
- Add specific dimensions to rubrics
- Provide good/bad examples
- Adjust thresholds based on domain (finance vs tech vs general)
- Add deterministic pre-checks for obvious violations

### For template.yaml Issues

**Look for**:
1. **Missing sections** in write node output
2. **Word count violations** (consistent over/under)
3. **Structure mismatches** (bullets vs narrative)

**Common fixes**:
- Add clear section descriptions
- Adjust word limits to realistic values
- Clarify format requirements (when to use bullets vs prose)

### For tools.yaml Issues

**Look for**:
1. **Wrong tool selected** for topic type
2. **Missing tools** for domain (finance needs finnhub)
3. **Loop expansion failures** (path errors in `for_each`)
4. **Tool errors** (API failures, timeouts)

**Common fixes**:
- Add domain-specific tools to patterns
- Fix loop directive paths
- Add fallback patterns
- Update default pattern for case

## Key Principles

### 1. Evidence-Based
Every recommendation must show:
- How many traces failed
- Example content that failed
- Why current config caused the failure

### 2. Specific
No generic advice like "improve rubric" - show EXACT YAML changes with inline comments

### 3. Prioritized
Focus on:
- High-frequency issues first (affects >30% of traces)
- Quick wins (threshold adjustments)
- High-impact changes (missing tools for domain)

### 4. Actionable
Every fix includes:
- Exact file path
- Before/after YAML
- Expected improvement
- How to test

## Troubleshooting

**"No traces found"**:
- Verify case ID is correct
- Check trace naming: `writing-workflow-0001` vs `writing-workflow-001`
- Try broader: `--name "writing-workflow"` to see all cases

**"No check failures in traces"**:
- Workflow may be in fallback mode (no LLM)
- Edit node may have been skipped (pre-flight score >8.5)
- Verify edit node ran in observations

**"Can't identify issue"**:
- Read the actual style.yaml/template.yaml/tools.yaml files
- Compare trace output to config requirements
- Look for mismatches

**"Metadata filter returning no traces"**:
- Verify metadata fields exist in your traces (check raw trace JSON)
- Metadata values are case-sensitive strings
- Use exact matches only (no wildcards/regex)
- Try without metadata filter first to see available metadata fields
- Common fields: `case_id`, `profile_name`, `workflow_version`

## Success Criteria

Good recommendations should:
1. ✅ Show exact YAML before/after
2. ✅ Explain WHY issue occurred (root cause)
3. ✅ Quantify impact (X% failure rate → Y% expected)
4. ✅ Be implementable in <5 min per fix
5. ✅ Focus on top 3-5 issues (not 50 minor ones)

---

**Remember**: This skill is about **fixing config files**, not analyzing architecture. Keep it simple:
1. What's broken in the YAML?
2. Here's the fix
3. Here's why it works