Claude Code Plugins

Community-maintained marketplace

Feedback

Multi-provider LLM integration with adapter pattern, citation tracking, hallucination detection, and cost optimization

Install Skill

1Download skill
2Enable skills in Claude

Open claude.ai/settings/capabilities and find the "Skills" section

3Upload to Claude

Click "Upload skill" and select the downloaded ZIP file

Note: Please verify skill by going through its instructions before using it.

SKILL.md

name llm-integration
description Multi-provider LLM integration with adapter pattern, citation tracking, hallucination detection, and cost optimization

LLM Integration Skill

What This Skill Provides

The LLM Integration skill provides production-ready components for integrating Large Language Models (LLMs) into regulatory AI systems, specifically optimized for StrategenAI's F7: Executive Summary Generation and F9: Validation Pipeline requirements.

Core Capabilities

  1. Adapter Pattern Implementation: Flexible LLM integration supporting multiple providers (Claude API, Ollama, self-hosted models)
  2. Regulatory Prompt Engineering: Templates optimized for professional regulatory consultant tone with structured outputs
  3. Citation-First Generation: 100% source attribution with confidence scoring
  4. Hallucination Detection: Multi-stage validation ensuring zero fabricated facts
  5. Token Optimization: Precise control over output length (80-120 words for executive summaries)
  6. Cost Tracking: Token counting and rate limit management
  7. Error Handling: Comprehensive retry logic, timeout handling, and graceful degradation

PRD Feature Alignment

F7: Executive Summary Generation (Priority: P0)

  • Tone Control: Professional regulatory consultant voice
  • Length Precision: 80-120 words (3-4 sentences) with token counting
  • Structured Output: JSON format with summary + citations
  • Content Coverage: Product overview, designation status, eligibility, competitive landscape
  • 100% Citation Requirement: Every claim linked to source URL

F9: Validation Pipeline (Partial)

  • Citation Verification: 100% coverage validation
  • Hallucination Detection: Zero-tolerance for fabricated facts
  • Confidence Scoring: Multi-level (claim, section, overall)
  • Structured Output Format: Parseable JSON with validation metadata

When to Use This Skill

Use This Skill When You Need To:

  • Generate AI-powered executive summaries for regulatory documents
  • Integrate Claude API or other LLMs into Python applications
  • Ensure 99.9%+ accuracy in LLM outputs with citation requirements
  • Implement adapter pattern for multi-provider LLM support
  • Build validation pipelines for AI-generated content
  • Track LLM costs and optimize token usage
  • Handle rate limits and implement retry logic
  • Parse structured LLM outputs (JSON) with error handling

Don't Use This Skill For:

  • Simple text generation without citation requirements
  • Non-regulatory applications where accuracy < 99% is acceptable
  • Direct LLM API calls without abstraction layer
  • Applications that don't require validation pipelines

Quick Start Workflows

Workflow 1: Claude API Setup (5 minutes)

# Step 1: Install dependencies
pip install anthropic pydantic python-dotenv

# Step 2: Configure API key
cd ~/my-skills/llm-integration/scripts
python3 setup_claude_api.py --check

# Step 3: Test connection
python3 setup_claude_api.py --test

# Expected output:  Claude API configured successfully

Workflow 2: Generate Executive Summary (2 minutes)

# Step 1: Prepare input data (JSON file with retrieved facts + citations)
cat > input_data.json << EOF
{
  "product_name": "HB001",
  "indication": "Haemophilia B",
  "facts": [
    {
      "claim": "HB001 is a recombinant Factor IX protein",
      "source_url": "https://example.com/source1",
      "confidence": 0.99
    },
    {
      "claim": "Five similar products received orphan designation",
      "source_url": "https://fda.gov/source2",
      "confidence": 0.95
    }
  ]
}
EOF

# Step 2: Generate summary
python3 generate_summary.py --input input_data.json --output summary.json

# Step 3: Review output
cat summary.json

Expected Output:

{
  "summary": "HB001 is a recombinant Factor IX protein for Haemophilia B with strong orphan designation potential. Five similar products received orphan designation, demonstrating precedent for this indication.",
  "word_count": 28,
  "citations": [
    {"claim": "HB001 is a recombinant Factor IX protein", "source_url": "https://example.com/source1", "confidence": 0.99},
    {"claim": "Five similar products received orphan designation", "source_url": "https://fda.gov/source2", "confidence": 0.95}
  ],
  "overall_confidence": 0.97,
  "validation_passed": true
}

Workflow 3: Validate LLM Response (1 minute)

# Validate generated summary against requirements
python3 validate_llm_response.py --input summary.json --strict

# Expected checks:
#  Word count: 80-120 words
#  Citation coverage: 100%
#  Confidence score: ≥0.90
#  Professional tone detected
#  No hallucinations detected

Workflow 4: Switch LLM Provider (Development to Production)

# Development: Use Claude API
from templates.claude_client import ClaudeAdapter

client = ClaudeAdapter(api_key="sk-ant-...")
summary = client.generate_summary(context=facts, max_words=120)

# Production: Switch to Ollama (self-hosted)
from templates.claude_client import OllamaAdapter

client = OllamaAdapter(base_url="http://localhost:11434")
summary = client.generate_summary(context=facts, max_words=120)

# Same interface, zero code changes!

Decision Trees

Decision Tree 1: Which LLM Provider?

START: Need to integrate LLM for summary generation
├─ POC Phase (4 weeks)?
│  └─ YES → Use ClaudeAdapter (Anthropic Claude API)
│     ├─ Fastest setup (< 5 min)
│     ├─ Best quality (Claude Sonnet 4.5)
│     └─ Cost: ~$0.50 per 100 summaries
│
└─ Production Phase (Months 2-4)?
   └─ YES → Use OllamaAdapter (self-hosted)
      ├─ Zero per-request cost
      ├─ Full data privacy (no external API)
      ├─ Requires GPU server setup
      └─ Fine-tune on regulatory corpus

Decision Tree 2: How to Handle Citations?

START: LLM generates summary text
├─ Are all claims cited?
│  ├─ YES → Proceed to confidence scoring
│  └─ NO → REJECT output
│     └─ Log error: "Uncited claim detected"
│
├─ Are citation URLs valid?
│  ├─ YES → Proceed to validation
│  └─ NO → REJECT output
│     └─ Log error: "Broken citation URL"
│
└─ Is confidence ≥ 0.90?
   ├─ YES → Accept output
   └─ NO → Flag for manual review
      └─ Human reviewer validates

Decision Tree 3: Token Budget Exceeded?

START: Generate summary (target: 80-120 words)
├─ Initial generation uses 150 tokens?
│  └─ Retry with stricter prompt: "Maximum 25 tokens"
│
├─ Still exceeds 120 words?
│  └─ Truncate at sentence boundary + add "..."
│
└─ Under 80 words?
   └─ Retry with prompt: "Expand to 3-4 sentences"

Quality Checklist

Pre-Integration Checklist

  • API key configured in .env file (never commit to git!)
  • Dependencies installed (anthropic, pydantic, python-dotenv)
  • Test script runs successfully (setup_claude_api.py --test)
  • Input data format validated (see templates/prompt_templates.py)

Development Checklist

  • Use ClaudeAdapter for POC phase
  • Log all API requests for cost tracking
  • Implement retry logic (3 retries with exponential backoff)
  • Handle rate limits gracefully (429 errors)
  • Parse structured outputs (JSON) with error handling
  • Validate citations before returning response

Production Readiness Checklist

  • Switch to OllamaAdapter (self-hosted model)
  • Implement caching for repeated queries
  • Add monitoring (token usage, latency, error rates)
  • Set up alerting for hallucination detection failures
  • Load test: 100 concurrent requests
  • Security audit: No API keys in logs or error messages

Validation Checklist

  • 100% citation coverage (every claim has source URL)
  • Word count: 80-120 words for summaries
  • Confidence score: ≥0.90 overall
  • Professional tone (validated by domain expert)
  • Zero hallucinations (no fabricated facts)
  • All URLs return 200 OK (automated check)

Common Pitfalls

Pitfall 1: Storing API Keys in Code

Wrong:

client = anthropic.Anthropic(api_key="sk-ant-abc123...")  # NEVER DO THIS!

Right:

from dotenv import load_dotenv
import os

load_dotenv()
client = anthropic.Anthropic(api_key=os.getenv("ANTHROPIC_API_KEY"))

Why: API keys in code get committed to git, exposing credentials publicly.

Pitfall 2: No Retry Logic

Wrong:

response = client.messages.create(...)  # Fails on transient errors

Right:

from tenacity import retry, stop_after_attempt, wait_exponential

@retry(stop=stop_after_attempt(3), wait=wait_exponential(min=1, max=10))
def generate_with_retry():
    return client.messages.create(...)

Why: Network errors, rate limits (429), and timeouts are common. Retries improve reliability.

Pitfall 3: Ignoring Token Limits

Wrong:

response = client.messages.create(
    model="claude-sonnet-4.5",
    max_tokens=1000,  # Way too high for 80-120 word summary!
    messages=[...]
)

Right:

response = client.messages.create(
    model="claude-sonnet-4.5",
    max_tokens=150,  # ~120 words = 150 tokens (1.25x buffer)
    messages=[...]
)

Why: Over-generation wastes tokens (cost) and requires truncation (quality loss).

Pitfall 4: Not Parsing Structured Outputs

Wrong:

summary = response.content[0].text  # Raw text, no validation

Right:

import json
from pydantic import ValidationError

try:
    parsed = json.loads(response.content[0].text)
    summary = SummaryResponse(**parsed)  # Pydantic model
except (json.JSONDecodeError, ValidationError) as e:
    logger.error(f"Failed to parse LLM response: {e}")
    raise

Why: LLMs occasionally return malformed JSON. Always validate with Pydantic.

Pitfall 5: No Hallucination Detection

Wrong:

# Trust LLM output blindly
return response.content[0].text

Right:

from templates.citation_validator import validate_citations

summary = response.content[0].text
validation_result = validate_citations(summary, source_facts)

if not validation_result.all_cited:
    raise ValueError("Hallucination detected: Uncited claim")

Why: LLMs can fabricate facts. Always validate against source data.

Pro Tips

Tip 1: Use System Prompts for Tone Control

REGULATORY_TONE_SYSTEM_PROMPT = """
You are a professional regulatory consultant writing for senior biotech executives.

Tone requirements:
- Formal, objective, third-person voice
- No marketing language or superlatives
- No uncertainty ("may", "might", "could")
- Use definitive statements with citations
- Professional terminology (e.g., "orphan designation" not "special status")

Example good: "HB001 received orphan designation on March 15, 2023 [1]."
Example bad: "HB001 might have a good chance of getting approved soon!"
"""

Why: System prompts are more effective than user prompts for style control.

Tip 2: Request Structured JSON Output

USER_PROMPT = """
Generate executive summary. Return ONLY valid JSON in this exact format:

{
  "summary": "3-4 sentence summary here",
  "citations": [
    {"claim": "specific claim", "source_url": "https://...", "confidence": 0.95}
  ],
  "word_count": 95
}

Do not include markdown formatting, explanations, or extra text.
"""

Why: Explicit format specification reduces parsing errors by 90%.

Tip 3: Log Token Usage for Cost Tracking

response = client.messages.create(...)

# Log usage
logger.info(
    f"API call: {response.usage.input_tokens} input + "
    f"{response.usage.output_tokens} output = "
    f"{response.usage.input_tokens + response.usage.output_tokens} total tokens"
)

# Cost calculation (Claude Sonnet 4.5 pricing)
input_cost = response.usage.input_tokens * 0.003 / 1000
output_cost = response.usage.output_tokens * 0.015 / 1000
total_cost = input_cost + output_cost

logger.info(f"Cost: ${total_cost:.4f}")

Why: Track costs early to avoid surprises. Claude Sonnet 4.5: $3 per 1M input tokens, $15 per 1M output tokens.

Tip 4: Implement Prompt Versioning

PROMPT_VERSION = "v1.2.0"

prompt_template = f"""
[PROMPT_VERSION: {PROMPT_VERSION}]

Generate executive summary for {{product_name}}...
"""

# Log with version
logger.info(f"Using prompt version: {PROMPT_VERSION}")

Why: When tweaking prompts, versioning helps track which version produced which outputs.

Tip 5: Use Few-Shot Examples for Complex Tasks

FEW_SHOT_EXAMPLES = """
Example 1:
Input: Product X treats Disease Y. Three competitors exist.
Output: {"summary": "Product X targets Disease Y with established precedent. Three competing products demonstrate regulatory acceptance. [1,2,3]", "citations": [...]}

Example 2:
Input: Product Z is novel therapy. No competitors.
Output: {"summary": "Product Z represents first-in-class approach for Indication A. No direct precedents exist, requiring novel regulatory strategy. [1]", "citations": [...]}

Now generate for actual input:
"""

Why: Few-shot examples improve output quality by 20-30% for complex formats.

Integration with SRS AI Systems

F7: Executive Summary Generation Pipeline

┌─────────────────────────────────────────────────┐
│ Step 1: Retrieve Facts from Knowledge Graph    │
│ ├─ Neo4j Cypher queries                        │
│ ├─ RAG vector search                           │
│ └─ Output: List of facts with source URLs      │
└─────────────────────────────────────────────────┘
                    ↓
┌─────────────────────────────────────────────────┐
│ Step 2: Format Input for LLM                    │
│ ├─ Use prompt_templates.py                     │
│ ├─ Include: product name, indication, facts    │
│ └─ System prompt: Regulatory tone              │
└─────────────────────────────────────────────────┘
                    ↓
┌─────────────────────────────────────────────────┐
│ Step 3: Generate Summary (claude_client.py)    │
│ ├─ ClaudeAdapter.generate_summary()            │
│ ├─ max_tokens=150 (120 words target)           │
│ └─ Return: JSON with summary + citations       │
└─────────────────────────────────────────────────┘
                    ↓
┌─────────────────────────────────────────────────┐
│ Step 4: Validate Output (citation_validator.py)│
│ ├─ Check: 100% citation coverage               │
│ ├─ Check: Word count 80-120                    │
│ ├─ Check: Confidence ≥0.90                     │
│ └─ Pass/Fail + validation report               │
└─────────────────────────────────────────────────┘
                    ↓
┌─────────────────────────────────────────────────┐
│ Step 5: Return to UI (Streamlit display)       │
│ ├─ Display summary with confidence badge       │
│ ├─ Clickable citations                         │
│ └─ Export to Word (.docx)                      │
└─────────────────────────────────────────────────┘

F9: Validation Pipeline Integration

# In validation pipeline (validation-pipeline skill)
from llm_integration.templates.citation_validator import CitationValidator

validator = CitationValidator()

# Validate LLM-generated summary
result = validator.validate(
    summary=llm_output["summary"],
    source_facts=retrieved_facts,
    min_confidence=0.90
)

if not result.passed:
    logger.error(f"Validation failed: {result.errors}")
    # Regenerate with stricter prompt
    llm_output = regenerate_with_feedback(result.errors)

Related Skills

  • validation-pipeline: Full F9 validation pipeline (3-stage checks)
  • neo4j-integration: Knowledge graph queries for fact retrieval
  • rag-pipeline-builder: Vector search for semantic retrieval
  • streamlit-app-builder: UI display for summaries

Support & Troubleshooting

See references/troubleshooting.md for common issues and solutions.

Version

Skill Version: 1.0.0 Last Updated: November 4, 2025 Compatible with: StrategenAI PRD v1.0 (4-Week POC)