| name | ai-llm-development |
| description | MANDATORY invocation for ALL LLM-related work. Invoke immediately when: - ANY mention of model names, IDs, or versions - ANY configuration of AI providers or APIs - ANY defaults/constants for LLM settings - ANY prompt engineering or modification - ANY discussion of model capabilities or features - ANY changes to AI-related dependencies - Reading/writing .env files with AI config - Modifying aiProviders.ts, prompts.ts, or similar - Reviewing AI-related pull requests - Debugging LLM integration issues CRITICAL: Training data lags reality by months. ALWAYS research first. Use WebSearch, Exa MCP, or Gemini CLI before making ANY LLM decisions. |
AI/LLM Development
Core Philosophy
Context Engineering > Prompt Engineering: Optimize entire LLM configuration, not just wording.
Simplicity First: 80% of use cases need single LLM call, not multi-agent systems.
Currency Over Memory: Models deprecate in 6-12 months. Learn to find current ones via leaderboards.
Empiricism: Benchmarks guide; YOUR data decides. Test top 3-5 models with your prompts.
RESEARCH FIRST PROTOCOL
CRITICAL: Your training data is ALWAYS stale for LLM work. The field changes weekly.
Before ANY LLM-Related Action
- Identify what you're assuming: Model capabilities? API syntax? Best practices?
- Research using live tools (in order of preference):
- WebSearch: "latest [model/provider] models"
- Exa MCP: Get current documentation and examples
- Gemini CLI: Verify against latest information with web grounding
- Verify your assumptions: Don't trust training data for:
- Model names and versions (new models release monthly)
- API syntax and parameters (providers update frequently)
- Best practices and recommendations (evolve constantly)
- Pricing and limits (change without notice)
- Deprecation status (models removed regularly)
Research Query Templates
Model Selection:
- "latest [provider] models"
- "[model-name] release date and capabilities"
- "is [model-name] deprecated or superseded"
- "[provider] newest models announced"
API Syntax:
- "[provider] API documentation [specific-feature]"
- "[sdk-name] current version and usage"
- "OpenRouter model ID format current"
Best Practices:
- "[task] LLM best practices latest"
- "current recommendations for [architecture pattern]"
- "[framework] latest patterns and examples"
Red Flags That Trigger Mandatory Research
❌ Making assumptions about version numbers (3.0 vs 2.5 doesn't mean newer) ❌ Changing model defaults without verification ❌ Assuming API syntax from training data ❌ Selecting models based on memory of capabilities ❌ Following "best practices" without checking if still current ❌ Any action based on "I think..." or "probably..." for LLM topics
Research Before Action Checklist
Before committing any LLM-related change:
- Searched for latest information on involved models/APIs
- Verified current state vs. training data assumptions
- Checked provider documentation for API syntax
- Confirmed model is not deprecated or superseded
- Validated best practices are still current
- Tested configuration syntax in provider console/playground
Mantra: "When in doubt about LLM tech, RESEARCH. When certain about LLM tech, STILL RESEARCH."
Decision Trees
Model Selection
Task type → Find relevant benchmark → Check leaderboards → Test top 3 empirically
Coding: SWE-bench | Reasoning: GPQA | General: Arena Elo
See: references/model-selection.md
Architecture Complexity
1. Single LLM Call (start here - 80% stop here)
2. Sequential Calls (workflows)
3. LLM + Tools (function calling)
4. Agentic System (LLM controls flow)
5. Multi-Agent (only if truly needed)
See: references/architecture-patterns.md
Vector Storage
<1M vectors → Postgres pgvector or Convex
1-50M vectors → Postgres with pgvectorscale
>50M + <10ms p99 → Dedicated (Qdrant, Weaviate)
Key Optimizations
- Prompt Caching: 60-90% cost reduction. Static content first.
- Structured Outputs: Native JSON Schema. Zero parsing failures.
- Model Routing: Simple→cheap model, Complex→expensive model.
- Hybrid RAG: Vector + keyword search = 15-25% better than pure vector.
See: references/prompt-engineering.md, references/production-checklist.md
Stack Defaults (TypeScript/Next.js)
- SDK: Vercel AI SDK (streaming, React hooks, provider-agnostic)
- Provider: OpenRouter (400+ models, easy A/B testing, fallbacks)
- Vectors: Postgres pgvector (95% of use cases, $20-50/month)
- Observability: Langfuse (self-hostable, generous free tier)
- Evaluation: Promptfoo (CI/CD integration, security testing)
Quality Infrastructure
Production-grade LLM apps need:
Model Gateway (OpenRouter, LiteLLM)
- Multi-provider access
- Fallback chains
- Cost routing
- See:
llm-gateway-routingskill
Evaluation & Testing (Promptfoo)
- Regression testing in CI/CD
- Security scanning (red team)
- Quality gates
- See:
llm-evaluationskill
Production Observability (Langfuse)
- Full trace debugging
- Cost tracking
- Latency monitoring
- See:
langfuse-observabilityskill
Quality Audit
- Run
/llm-gatescommand to audit your LLM infrastructure - Identifies gaps in routing, testing, observability, security, cost
- Run
Quick Setup
# Evaluation (Promptfoo)
npx promptfoo@latest init
npx promptfoo@latest eval
# Observability (Langfuse)
pnpm add langfuse
# Sign up at langfuse.com, add keys to .env
# Gateway (OpenRouter)
# Sign up at openrouter.ai, add OPENROUTER_API_KEY to .env
Quality Gate Standards
| Stage | Checks | Time Budget |
|---|---|---|
| Pre-commit | Prompt validation, secrets scan | < 5s |
| Pre-push | Regression suite, cost estimate | < 15s |
| CI/CD | Full eval, security scan, A/B comparison | < 5 min |
| Production | Traces, cost alerts, error monitoring | Continuous |
Scripts
scripts/validate_llm_config.py <dir>- Scan for LLM anti-patterns
References
references/model-selection.md- Leaderboards, search strategies, red flagsreferences/prompt-engineering.md- Caching, structured outputs, CoT, model-specific stylesreferences/architecture-patterns.md- Complexity ladder, RAG, tool use, cachingreferences/production-checklist.md- Cost, errors, security, observability, evaluation
Related Skills
llm-evaluation- Promptfoo setup, CI/CD integration, security testingllm-gateway-routing- OpenRouter, LiteLLM, routing strategieslangfuse-observability- Tracing, cost tracking, production debugging
Related Commands
/llm-gates- Audit LLM infrastructure quality across 5 pillars/observe- General observability audit (includes LLM section)
Live Research Tools
Use these BEFORE relying on training data:
- WebSearch: Latest model releases, deprecations, best practices
- Exa MCP (
mcp__exa__web_search_exa): Current documentation and code examples - Gemini CLI (
gemini): Sophisticated reasoning with Google Search grounding - Provider Playgrounds: OpenRouter, Google AI Studio, Anthropic Console
Research Flow:
- WebSearch for latest information
- Exa MCP for documentation and examples
- Gemini CLI for complex verification and comparison
- Provider playground for syntax testing