name	llm-integration
description	Multi-provider LLM integration with adapter pattern, citation tracking, hallucination detection, and cost optimization

LLM Integration Skill

What This Skill Provides

The LLM Integration skill provides production-ready components for integrating Large Language Models (LLMs) into regulatory AI systems, specifically optimized for StrategenAI's F7: Executive Summary Generation and F9: Validation Pipeline requirements.

Core Capabilities

Adapter Pattern Implementation: Flexible LLM integration supporting multiple providers (Claude API, Ollama, self-hosted models)
Regulatory Prompt Engineering: Templates optimized for professional regulatory consultant tone with structured outputs
Citation-First Generation: 100% source attribution with confidence scoring
Hallucination Detection: Multi-stage validation ensuring zero fabricated facts
Token Optimization: Precise control over output length (80-120 words for executive summaries)
Cost Tracking: Token counting and rate limit management
Error Handling: Comprehensive retry logic, timeout handling, and graceful degradation

PRD Feature Alignment

F7: Executive Summary Generation (Priority: P0)

Tone Control: Professional regulatory consultant voice
Length Precision: 80-120 words (3-4 sentences) with token counting
Structured Output: JSON format with summary + citations
Content Coverage: Product overview, designation status, eligibility, competitive landscape
100% Citation Requirement: Every claim linked to source URL

F9: Validation Pipeline (Partial)

Citation Verification: 100% coverage validation
Hallucination Detection: Zero-tolerance for fabricated facts
Confidence Scoring: Multi-level (claim, section, overall)
Structured Output Format: Parseable JSON with validation metadata

When to Use This Skill

Use This Skill When You Need To:

Generate AI-powered executive summaries for regulatory documents
Integrate Claude API or other LLMs into Python applications
Ensure 99.9%+ accuracy in LLM outputs with citation requirements
Implement adapter pattern for multi-provider LLM support
Build validation pipelines for AI-generated content
Track LLM costs and optimize token usage
Handle rate limits and implement retry logic
Parse structured LLM outputs (JSON) with error handling

Don't Use This Skill For:

Simple text generation without citation requirements
Non-regulatory applications where accuracy < 99% is acceptable
Direct LLM API calls without abstraction layer
Applications that don't require validation pipelines

Quick Start Workflows

Workflow 1: Claude API Setup (5 minutes)

# Step 1: Install dependencies
pip install anthropic pydantic python-dotenv

# Step 2: Configure API key
cd ~/my-skills/llm-integration/scripts
python3 setup_claude_api.py --check

# Step 3: Test connection
python3 setup_claude_api.py --test

# Expected output:  Claude API configured successfully

Workflow 2: Generate Executive Summary (2 minutes)

# Step 1: Prepare input data (JSON file with retrieved facts + citations)
cat > input_data.json << EOF
{
  "product_name": "HB001",
  "indication": "Haemophilia B",
  "facts": [
    {
      "claim": "HB001 is a recombinant Factor IX protein",
      "source_url": "https://example.com/source1",
      "confidence": 0.99
    },
    {
      "claim": "Five similar products received orphan designation",
      "source_url": "https://fda.gov/source2",
      "confidence": 0.95
    }
  ]
}
EOF

# Step 2: Generate summary
python3 generate_summary.py --input input_data.json --output summary.json

# Step 3: Review output
cat summary.json

Expected Output:

{
  "summary": "HB001 is a recombinant Factor IX protein for Haemophilia B with strong orphan designation potential. Five similar products received orphan designation, demonstrating precedent for this indication.",
  "word_count": 28,
  "citations": [
    {"claim": "HB001 is a recombinant Factor IX protein", "source_url": "https://example.com/source1", "confidence": 0.99},
    {"claim": "Five similar products received orphan designation", "source_url": "https://fda.gov/source2", "confidence": 0.95}
  ],
  "overall_confidence": 0.97,
  "validation_passed": true
}

Workflow 3: Validate LLM Response (1 minute)

# Validate generated summary against requirements
python3 validate_llm_response.py --input summary.json --strict

# Expected checks:
#  Word count: 80-120 words
#  Citation coverage: 100%
#  Confidence score: ≥0.90
#  Professional tone detected
#  No hallucinations detected

Workflow 4: Switch LLM Provider (Development to Production)

# Development: Use Claude API
from templates.claude_client import ClaudeAdapter

client = ClaudeAdapter(api_key="sk-ant-...")
summary = client.generate_summary(context=facts, max_words=120)

# Production: Switch to Ollama (self-hosted)
from templates.claude_client import OllamaAdapter

client = OllamaAdapter(base_url="http://localhost:11434")
summary = client.generate_summary(context=facts, max_words=120)

# Same interface, zero code changes!

Decision Trees

Decision Tree 1: Which LLM Provider?

START: Need to integrate LLM for summary generation
├─ POC Phase (4 weeks)?
│  └─ YES → Use ClaudeAdapter (Anthropic Claude API)
│     ├─ Fastest setup (< 5 min)
│     ├─ Best quality (Claude Sonnet 4.5)
│     └─ Cost: ~$0.50 per 100 summaries
│
└─ Production Phase (Months 2-4)?
   └─ YES → Use OllamaAdapter (self-hosted)
      ├─ Zero per-request cost
      ├─ Full data privacy (no external API)
      ├─ Requires GPU server setup
      └─ Fine-tune on regulatory corpus

Decision Tree 2: How to Handle Citations?

START: LLM generates summary text
├─ Are all claims cited?
│  ├─ YES → Proceed to confidence scoring
│  └─ NO → REJECT output
│     └─ Log error: "Uncited claim detected"
│
├─ Are citation URLs valid?
│  ├─ YES → Proceed to validation
│  └─ NO → REJECT output
│     └─ Log error: "Broken citation URL"
│
└─ Is confidence ≥ 0.90?
   ├─ YES → Accept output
   └─ NO → Flag for manual review
      └─ Human reviewer validates

Decision Tree 3: Token Budget Exceeded?

START: Generate summary (target: 80-120 words)
├─ Initial generation uses 150 tokens?
│  └─ Retry with stricter prompt: "Maximum 25 tokens"
│
├─ Still exceeds 120 words?
│  └─ Truncate at sentence boundary + add "..."
│
└─ Under 80 words?
   └─ Retry with prompt: "Expand to 3-4 sentences"

Quality Checklist

Pre-Integration Checklist

API key configured in .env file (never commit to git!)
Dependencies installed (anthropic, pydantic, python-dotenv)
Test script runs successfully (setup_claude_api.py --test)
Input data format validated (see templates/prompt_templates.py)

Development Checklist

Use ClaudeAdapter for POC phase
Log all API requests for cost tracking
Implement retry logic (3 retries with exponential backoff)
Handle rate limits gracefully (429 errors)
Parse structured outputs (JSON) with error handling
Validate citations before returning response

Production Readiness Checklist

Switch to OllamaAdapter (self-hosted model)
Implement caching for repeated queries
Add monitoring (token usage, latency, error rates)
Set up alerting for hallucination detection failures
Load test: 100 concurrent requests
Security audit: No API keys in logs or error messages

Validation Checklist

100% citation coverage (every claim has source URL)
Word count: 80-120 words for summaries
Confidence score: ≥0.90 overall
Professional tone (validated by domain expert)
Zero hallucinations (no fabricated facts)
All URLs return 200 OK (automated check)

Common Pitfalls

Pitfall 1: Storing API Keys in Code

Wrong:

client = anthropic.Anthropic(api_key="sk-ant-abc123...")  # NEVER DO THIS!

Right:

from dotenv import load_dotenv
import os

load_dotenv()
client = anthropic.Anthropic(api_key=os.getenv("ANTHROPIC_API_KEY"))

Why: API keys in code get committed to git, exposing credentials publicly.

Pitfall 2: No Retry Logic

Wrong:

response = client.messages.create(...)  # Fails on transient errors

Right:

from tenacity import retry, stop_after_attempt, wait_exponential

@retry(stop=stop_after_attempt(3), wait=wait_exponential(min=1, max=10))
def generate_with_retry():
    return client.messages.create(...)

Why: Network errors, rate limits (429), and timeouts are common. Retries improve reliability.

Pitfall 3: Ignoring Token Limits

Wrong:

response = client.messages.create(
    model="claude-sonnet-4.5",
    max_tokens=1000,  # Way too high for 80-120 word summary!
    messages=[...]
)

Right:

response = client.messages.create(
    model="claude-sonnet-4.5",
    max_tokens=150,  # ~120 words = 150 tokens (1.25x buffer)
    messages=[...]
)

Why: Over-generation wastes tokens (cost) and requires truncation (quality loss).

Pitfall 4: Not Parsing Structured Outputs

Wrong:

summary = response.content[0].text  # Raw text, no validation

Right:

import json
from pydantic import ValidationError

try:
    parsed = json.loads(response.content[0].text)
    summary = SummaryResponse(**parsed)  # Pydantic model
except (json.JSONDecodeError, ValidationError) as e:
    logger.error(f"Failed to parse LLM response: {e}")
    raise

Why: LLMs occasionally return malformed JSON. Always validate with Pydantic.

Pitfall 5: No Hallucination Detection

Wrong:

# Trust LLM output blindly
return response.content[0].text

Right:

from templates.citation_validator import validate_citations

summary = response.content[0].text
validation_result = validate_citations(summary, source_facts)

if not validation_result.all_cited:
    raise ValueError("Hallucination detected: Uncited claim")

Why: LLMs can fabricate facts. Always validate against source data.

Pro Tips

Tip 1: Use System Prompts for Tone Control

REGULATORY_TONE_SYSTEM_PROMPT = """
You are a professional regulatory consultant writing for senior biotech executives.

Tone requirements:
- Formal, objective, third-person voice
- No marketing language or superlatives
- No uncertainty ("may", "might", "could")
- Use definitive statements with citations
- Professional terminology (e.g., "orphan designation" not "special status")

Example good: "HB001 received orphan designation on March 15, 2023 [1]."
Example bad: "HB001 might have a good chance of getting approved soon!"
"""

Why: System prompts are more effective than user prompts for style control.

Tip 2: Request Structured JSON Output

USER_PROMPT = """
Generate executive summary. Return ONLY valid JSON in this exact format:

{
  "summary": "3-4 sentence summary here",
  "citations": [
    {"claim": "specific claim", "source_url": "https://...", "confidence": 0.95}
  ],
  "word_count": 95
}

Do not include markdown formatting, explanations, or extra text.
"""

Why: Explicit format specification reduces parsing errors by 90%.

Tip 3: Log Token Usage for Cost Tracking

response = client.messages.create(...)

# Log usage
logger.info(
    f"API call: {response.usage.input_tokens} input + "
    f"{response.usage.output_tokens} output = "
    f"{response.usage.input_tokens + response.usage.output_tokens} total tokens"
)

# Cost calculation (Claude Sonnet 4.5 pricing)
input_cost = response.usage.input_tokens * 0.003 / 1000
output_cost = response.usage.output_tokens * 0.015 / 1000
total_cost = input_cost + output_cost

logger.info(f"Cost: ${total_cost:.4f}")

Why: Track costs early to avoid surprises. Claude Sonnet 4.5: $3 per 1M input tokens, $15 per 1M output tokens.

Tip 4: Implement Prompt Versioning

PROMPT_VERSION = "v1.2.0"

prompt_template = f"""
[PROMPT_VERSION: {PROMPT_VERSION}]

Generate executive summary for {{product_name}}...
"""

# Log with version
logger.info(f"Using prompt version: {PROMPT_VERSION}")

Why: When tweaking prompts, versioning helps track which version produced which outputs.

Tip 5: Use Few-Shot Examples for Complex Tasks

FEW_SHOT_EXAMPLES = """
Example 1:
Input: Product X treats Disease Y. Three competitors exist.
Output: {"summary": "Product X targets Disease Y with established precedent. Three competing products demonstrate regulatory acceptance. [1,2,3]", "citations": [...]}

Example 2:
Input: Product Z is novel therapy. No competitors.
Output: {"summary": "Product Z represents first-in-class approach for Indication A. No direct precedents exist, requiring novel regulatory strategy. [1]", "citations": [...]}

Now generate for actual input:
"""

Why: Few-shot examples improve output quality by 20-30% for complex formats.

Integration with SRS AI Systems

F7: Executive Summary Generation Pipeline

┌─────────────────────────────────────────────────┐
│ Step 1: Retrieve Facts from Knowledge Graph    │
│ ├─ Neo4j Cypher queries                        │
│ ├─ RAG vector search                           │
│ └─ Output: List of facts with source URLs      │
└─────────────────────────────────────────────────┘
                    ↓
┌─────────────────────────────────────────────────┐
│ Step 2: Format Input for LLM                    │
│ ├─ Use prompt_templates.py                     │
│ ├─ Include: product name, indication, facts    │
│ └─ System prompt: Regulatory tone              │
└─────────────────────────────────────────────────┘
                    ↓
┌─────────────────────────────────────────────────┐
│ Step 3: Generate Summary (claude_client.py)    │
│ ├─ ClaudeAdapter.generate_summary()            │
│ ├─ max_tokens=150 (120 words target)           │
│ └─ Return: JSON with summary + citations       │
└─────────────────────────────────────────────────┘
                    ↓
┌─────────────────────────────────────────────────┐
│ Step 4: Validate Output (citation_validator.py)│
│ ├─ Check: 100% citation coverage               │
│ ├─ Check: Word count 80-120                    │
│ ├─ Check: Confidence ≥0.90                     │
│ └─ Pass/Fail + validation report               │
└─────────────────────────────────────────────────┘
                    ↓
┌─────────────────────────────────────────────────┐
│ Step 5: Return to UI (Streamlit display)       │
│ ├─ Display summary with confidence badge       │
│ ├─ Clickable citations                         │
│ └─ Export to Word (.docx)                      │
└─────────────────────────────────────────────────┘

F9: Validation Pipeline Integration

# In validation pipeline (validation-pipeline skill)
from llm_integration.templates.citation_validator import CitationValidator

validator = CitationValidator()

# Validate LLM-generated summary
result = validator.validate(
    summary=llm_output["summary"],
    source_facts=retrieved_facts,
    min_confidence=0.90
)

if not result.passed:
    logger.error(f"Validation failed: {result.errors}")
    # Regenerate with stricter prompt
    llm_output = regenerate_with_feedback(result.errors)

Related Skills

validation-pipeline: Full F9 validation pipeline (3-stage checks)
neo4j-integration: Knowledge graph queries for fact retrieval
rag-pipeline-builder: Vector search for semantic retrieval
streamlit-app-builder: UI display for summaries

Support & Troubleshooting

See references/troubleshooting.md for common issues and solutions.

Version

Skill Version: 1.0.0 Last Updated: November 4, 2025 Compatible with: StrategenAI PRD v1.0 (4-Week POC)

Install Skill

SKILL.md

LLM Integration Skill

What This Skill Provides

Core Capabilities

PRD Feature Alignment

F7: Executive Summary Generation (Priority: P0)

F9: Validation Pipeline (Partial)

When to Use This Skill

Use This Skill When You Need To:

Don't Use This Skill For:

Quick Start Workflows

Workflow 1: Claude API Setup (5 minutes)

Workflow 2: Generate Executive Summary (2 minutes)

Workflow 3: Validate LLM Response (1 minute)

Workflow 4: Switch LLM Provider (Development to Production)

Decision Trees

Decision Tree 1: Which LLM Provider?

Decision Tree 2: How to Handle Citations?

Decision Tree 3: Token Budget Exceeded?

Quality Checklist

Pre-Integration Checklist

Development Checklist

Production Readiness Checklist

Validation Checklist

Common Pitfalls

Pitfall 1: Storing API Keys in Code

Pitfall 2: No Retry Logic

Pitfall 3: Ignoring Token Limits

Pitfall 4: Not Parsing Structured Outputs

Pitfall 5: No Hallucination Detection

Pro Tips

Tip 1: Use System Prompts for Tone Control

Tip 2: Request Structured JSON Output

Tip 3: Log Token Usage for Cost Tracking

Tip 4: Implement Prompt Versioning

Tip 5: Use Few-Shot Examples for Complex Tasks

Integration with SRS AI Systems

F7: Executive Summary Generation Pipeline

F9: Validation Pipeline Integration

Related Skills

Support & Troubleshooting

Version