| name | llm-integration |
| description | Multi-provider LLM integration with adapter pattern, citation tracking, hallucination detection, and cost optimization |
LLM Integration Skill
What This Skill Provides
The LLM Integration skill provides production-ready components for integrating Large Language Models (LLMs) into regulatory AI systems, specifically optimized for StrategenAI's F7: Executive Summary Generation and F9: Validation Pipeline requirements.
Core Capabilities
- Adapter Pattern Implementation: Flexible LLM integration supporting multiple providers (Claude API, Ollama, self-hosted models)
- Regulatory Prompt Engineering: Templates optimized for professional regulatory consultant tone with structured outputs
- Citation-First Generation: 100% source attribution with confidence scoring
- Hallucination Detection: Multi-stage validation ensuring zero fabricated facts
- Token Optimization: Precise control over output length (80-120 words for executive summaries)
- Cost Tracking: Token counting and rate limit management
- Error Handling: Comprehensive retry logic, timeout handling, and graceful degradation
PRD Feature Alignment
F7: Executive Summary Generation (Priority: P0)
- Tone Control: Professional regulatory consultant voice
- Length Precision: 80-120 words (3-4 sentences) with token counting
- Structured Output: JSON format with summary + citations
- Content Coverage: Product overview, designation status, eligibility, competitive landscape
- 100% Citation Requirement: Every claim linked to source URL
F9: Validation Pipeline (Partial)
- Citation Verification: 100% coverage validation
- Hallucination Detection: Zero-tolerance for fabricated facts
- Confidence Scoring: Multi-level (claim, section, overall)
- Structured Output Format: Parseable JSON with validation metadata
When to Use This Skill
Use This Skill When You Need To:
- Generate AI-powered executive summaries for regulatory documents
- Integrate Claude API or other LLMs into Python applications
- Ensure 99.9%+ accuracy in LLM outputs with citation requirements
- Implement adapter pattern for multi-provider LLM support
- Build validation pipelines for AI-generated content
- Track LLM costs and optimize token usage
- Handle rate limits and implement retry logic
- Parse structured LLM outputs (JSON) with error handling
Don't Use This Skill For:
- Simple text generation without citation requirements
- Non-regulatory applications where accuracy < 99% is acceptable
- Direct LLM API calls without abstraction layer
- Applications that don't require validation pipelines
Quick Start Workflows
Workflow 1: Claude API Setup (5 minutes)
# Step 1: Install dependencies
pip install anthropic pydantic python-dotenv
# Step 2: Configure API key
cd ~/my-skills/llm-integration/scripts
python3 setup_claude_api.py --check
# Step 3: Test connection
python3 setup_claude_api.py --test
# Expected output: Claude API configured successfully
Workflow 2: Generate Executive Summary (2 minutes)
# Step 1: Prepare input data (JSON file with retrieved facts + citations)
cat > input_data.json << EOF
{
"product_name": "HB001",
"indication": "Haemophilia B",
"facts": [
{
"claim": "HB001 is a recombinant Factor IX protein",
"source_url": "https://example.com/source1",
"confidence": 0.99
},
{
"claim": "Five similar products received orphan designation",
"source_url": "https://fda.gov/source2",
"confidence": 0.95
}
]
}
EOF
# Step 2: Generate summary
python3 generate_summary.py --input input_data.json --output summary.json
# Step 3: Review output
cat summary.json
Expected Output:
{
"summary": "HB001 is a recombinant Factor IX protein for Haemophilia B with strong orphan designation potential. Five similar products received orphan designation, demonstrating precedent for this indication.",
"word_count": 28,
"citations": [
{"claim": "HB001 is a recombinant Factor IX protein", "source_url": "https://example.com/source1", "confidence": 0.99},
{"claim": "Five similar products received orphan designation", "source_url": "https://fda.gov/source2", "confidence": 0.95}
],
"overall_confidence": 0.97,
"validation_passed": true
}
Workflow 3: Validate LLM Response (1 minute)
# Validate generated summary against requirements
python3 validate_llm_response.py --input summary.json --strict
# Expected checks:
# Word count: 80-120 words
# Citation coverage: 100%
# Confidence score: ≥0.90
# Professional tone detected
# No hallucinations detected
Workflow 4: Switch LLM Provider (Development to Production)
# Development: Use Claude API
from templates.claude_client import ClaudeAdapter
client = ClaudeAdapter(api_key="sk-ant-...")
summary = client.generate_summary(context=facts, max_words=120)
# Production: Switch to Ollama (self-hosted)
from templates.claude_client import OllamaAdapter
client = OllamaAdapter(base_url="http://localhost:11434")
summary = client.generate_summary(context=facts, max_words=120)
# Same interface, zero code changes!
Decision Trees
Decision Tree 1: Which LLM Provider?
START: Need to integrate LLM for summary generation
├─ POC Phase (4 weeks)?
│ └─ YES → Use ClaudeAdapter (Anthropic Claude API)
│ ├─ Fastest setup (< 5 min)
│ ├─ Best quality (Claude Sonnet 4.5)
│ └─ Cost: ~$0.50 per 100 summaries
│
└─ Production Phase (Months 2-4)?
└─ YES → Use OllamaAdapter (self-hosted)
├─ Zero per-request cost
├─ Full data privacy (no external API)
├─ Requires GPU server setup
└─ Fine-tune on regulatory corpus
Decision Tree 2: How to Handle Citations?
START: LLM generates summary text
├─ Are all claims cited?
│ ├─ YES → Proceed to confidence scoring
│ └─ NO → REJECT output
│ └─ Log error: "Uncited claim detected"
│
├─ Are citation URLs valid?
│ ├─ YES → Proceed to validation
│ └─ NO → REJECT output
│ └─ Log error: "Broken citation URL"
│
└─ Is confidence ≥ 0.90?
├─ YES → Accept output
└─ NO → Flag for manual review
└─ Human reviewer validates
Decision Tree 3: Token Budget Exceeded?
START: Generate summary (target: 80-120 words)
├─ Initial generation uses 150 tokens?
│ └─ Retry with stricter prompt: "Maximum 25 tokens"
│
├─ Still exceeds 120 words?
│ └─ Truncate at sentence boundary + add "..."
│
└─ Under 80 words?
└─ Retry with prompt: "Expand to 3-4 sentences"
Quality Checklist
Pre-Integration Checklist
- API key configured in
.envfile (never commit to git!) - Dependencies installed (
anthropic,pydantic,python-dotenv) - Test script runs successfully (
setup_claude_api.py --test) - Input data format validated (see
templates/prompt_templates.py)
Development Checklist
- Use ClaudeAdapter for POC phase
- Log all API requests for cost tracking
- Implement retry logic (3 retries with exponential backoff)
- Handle rate limits gracefully (429 errors)
- Parse structured outputs (JSON) with error handling
- Validate citations before returning response
Production Readiness Checklist
- Switch to OllamaAdapter (self-hosted model)
- Implement caching for repeated queries
- Add monitoring (token usage, latency, error rates)
- Set up alerting for hallucination detection failures
- Load test: 100 concurrent requests
- Security audit: No API keys in logs or error messages
Validation Checklist
- 100% citation coverage (every claim has source URL)
- Word count: 80-120 words for summaries
- Confidence score: ≥0.90 overall
- Professional tone (validated by domain expert)
- Zero hallucinations (no fabricated facts)
- All URLs return 200 OK (automated check)
Common Pitfalls
Pitfall 1: Storing API Keys in Code
Wrong:
client = anthropic.Anthropic(api_key="sk-ant-abc123...") # NEVER DO THIS!
Right:
from dotenv import load_dotenv
import os
load_dotenv()
client = anthropic.Anthropic(api_key=os.getenv("ANTHROPIC_API_KEY"))
Why: API keys in code get committed to git, exposing credentials publicly.
Pitfall 2: No Retry Logic
Wrong:
response = client.messages.create(...) # Fails on transient errors
Right:
from tenacity import retry, stop_after_attempt, wait_exponential
@retry(stop=stop_after_attempt(3), wait=wait_exponential(min=1, max=10))
def generate_with_retry():
return client.messages.create(...)
Why: Network errors, rate limits (429), and timeouts are common. Retries improve reliability.
Pitfall 3: Ignoring Token Limits
Wrong:
response = client.messages.create(
model="claude-sonnet-4.5",
max_tokens=1000, # Way too high for 80-120 word summary!
messages=[...]
)
Right:
response = client.messages.create(
model="claude-sonnet-4.5",
max_tokens=150, # ~120 words = 150 tokens (1.25x buffer)
messages=[...]
)
Why: Over-generation wastes tokens (cost) and requires truncation (quality loss).
Pitfall 4: Not Parsing Structured Outputs
Wrong:
summary = response.content[0].text # Raw text, no validation
Right:
import json
from pydantic import ValidationError
try:
parsed = json.loads(response.content[0].text)
summary = SummaryResponse(**parsed) # Pydantic model
except (json.JSONDecodeError, ValidationError) as e:
logger.error(f"Failed to parse LLM response: {e}")
raise
Why: LLMs occasionally return malformed JSON. Always validate with Pydantic.
Pitfall 5: No Hallucination Detection
Wrong:
# Trust LLM output blindly
return response.content[0].text
Right:
from templates.citation_validator import validate_citations
summary = response.content[0].text
validation_result = validate_citations(summary, source_facts)
if not validation_result.all_cited:
raise ValueError("Hallucination detected: Uncited claim")
Why: LLMs can fabricate facts. Always validate against source data.
Pro Tips
Tip 1: Use System Prompts for Tone Control
REGULATORY_TONE_SYSTEM_PROMPT = """
You are a professional regulatory consultant writing for senior biotech executives.
Tone requirements:
- Formal, objective, third-person voice
- No marketing language or superlatives
- No uncertainty ("may", "might", "could")
- Use definitive statements with citations
- Professional terminology (e.g., "orphan designation" not "special status")
Example good: "HB001 received orphan designation on March 15, 2023 [1]."
Example bad: "HB001 might have a good chance of getting approved soon!"
"""
Why: System prompts are more effective than user prompts for style control.
Tip 2: Request Structured JSON Output
USER_PROMPT = """
Generate executive summary. Return ONLY valid JSON in this exact format:
{
"summary": "3-4 sentence summary here",
"citations": [
{"claim": "specific claim", "source_url": "https://...", "confidence": 0.95}
],
"word_count": 95
}
Do not include markdown formatting, explanations, or extra text.
"""
Why: Explicit format specification reduces parsing errors by 90%.
Tip 3: Log Token Usage for Cost Tracking
response = client.messages.create(...)
# Log usage
logger.info(
f"API call: {response.usage.input_tokens} input + "
f"{response.usage.output_tokens} output = "
f"{response.usage.input_tokens + response.usage.output_tokens} total tokens"
)
# Cost calculation (Claude Sonnet 4.5 pricing)
input_cost = response.usage.input_tokens * 0.003 / 1000
output_cost = response.usage.output_tokens * 0.015 / 1000
total_cost = input_cost + output_cost
logger.info(f"Cost: ${total_cost:.4f}")
Why: Track costs early to avoid surprises. Claude Sonnet 4.5: $3 per 1M input tokens, $15 per 1M output tokens.
Tip 4: Implement Prompt Versioning
PROMPT_VERSION = "v1.2.0"
prompt_template = f"""
[PROMPT_VERSION: {PROMPT_VERSION}]
Generate executive summary for {{product_name}}...
"""
# Log with version
logger.info(f"Using prompt version: {PROMPT_VERSION}")
Why: When tweaking prompts, versioning helps track which version produced which outputs.
Tip 5: Use Few-Shot Examples for Complex Tasks
FEW_SHOT_EXAMPLES = """
Example 1:
Input: Product X treats Disease Y. Three competitors exist.
Output: {"summary": "Product X targets Disease Y with established precedent. Three competing products demonstrate regulatory acceptance. [1,2,3]", "citations": [...]}
Example 2:
Input: Product Z is novel therapy. No competitors.
Output: {"summary": "Product Z represents first-in-class approach for Indication A. No direct precedents exist, requiring novel regulatory strategy. [1]", "citations": [...]}
Now generate for actual input:
"""
Why: Few-shot examples improve output quality by 20-30% for complex formats.
Integration with SRS AI Systems
F7: Executive Summary Generation Pipeline
┌─────────────────────────────────────────────────┐
│ Step 1: Retrieve Facts from Knowledge Graph │
│ ├─ Neo4j Cypher queries │
│ ├─ RAG vector search │
│ └─ Output: List of facts with source URLs │
└─────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────┐
│ Step 2: Format Input for LLM │
│ ├─ Use prompt_templates.py │
│ ├─ Include: product name, indication, facts │
│ └─ System prompt: Regulatory tone │
└─────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────┐
│ Step 3: Generate Summary (claude_client.py) │
│ ├─ ClaudeAdapter.generate_summary() │
│ ├─ max_tokens=150 (120 words target) │
│ └─ Return: JSON with summary + citations │
└─────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────┐
│ Step 4: Validate Output (citation_validator.py)│
│ ├─ Check: 100% citation coverage │
│ ├─ Check: Word count 80-120 │
│ ├─ Check: Confidence ≥0.90 │
│ └─ Pass/Fail + validation report │
└─────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────┐
│ Step 5: Return to UI (Streamlit display) │
│ ├─ Display summary with confidence badge │
│ ├─ Clickable citations │
│ └─ Export to Word (.docx) │
└─────────────────────────────────────────────────┘
F9: Validation Pipeline Integration
# In validation pipeline (validation-pipeline skill)
from llm_integration.templates.citation_validator import CitationValidator
validator = CitationValidator()
# Validate LLM-generated summary
result = validator.validate(
summary=llm_output["summary"],
source_facts=retrieved_facts,
min_confidence=0.90
)
if not result.passed:
logger.error(f"Validation failed: {result.errors}")
# Regenerate with stricter prompt
llm_output = regenerate_with_feedback(result.errors)
Related Skills
- validation-pipeline: Full F9 validation pipeline (3-stage checks)
- neo4j-integration: Knowledge graph queries for fact retrieval
- rag-pipeline-builder: Vector search for semantic retrieval
- streamlit-app-builder: UI display for summaries
Support & Troubleshooting
See references/troubleshooting.md for common issues and solutions.
Version
Skill Version: 1.0.0 Last Updated: November 4, 2025 Compatible with: StrategenAI PRD v1.0 (4-Week POC)