| name | ai-llm |
| description | Complete LLM development and engineering skill. Covers strategy selection (prompting vs fine-tuning vs RAG), dataset design, PEFT/LoRA fine-tuning, evaluation workflows, vLLM deployment, and production optimization. Modern best practices for building, evaluating, and scaling LLM systems. |
LLM Development & Engineering — Complete Reference
Build, evaluate, and deploy LLM systems with modern production standards.
This skill covers the full LLM lifecycle:
- Development: Strategy selection, dataset design, instruction tuning, PEFT/LoRA fine-tuning
- Evaluation: Automated testing, LLM-as-judge, metrics, rollout gates
- Deployment: vLLM 0.12 (V1 architecture, 24x throughput), FP8/FP4 quantization
- Operations: Drift detection, retraining triggers, monitoring
- Safety: Multi-layered defenses, AI-powered guardrails
For detailed patterns: See Resources and Templates sections below.
Quick Reference
| Task | Tool/Framework | Command/Pattern | When to Use |
|---|---|---|---|
| RAG Pipeline | LlamaIndex, LangChain | Page-level chunking + hybrid retrieval | Dynamic knowledge, 0.648 accuracy |
| Agentic Workflow | LangGraph, AutoGen, CrewAI | ReAct, multi-agent orchestration | Complex tasks, tool use required |
| Prompt Design | Anthropic, OpenAI guides | CoT, few-shot, structured | Task-specific behavior control |
| Evaluation | LangSmith, W&B, RAGAS | Multi-metric (hallucination, bias, cost) | Quality validation, A/B testing |
| Production Deploy | vLLM 0.12, TensorRT-LLM | FP8/FP4 quantization, PagedAttention v2 | High-throughput serving, cost optimization |
| Monitoring | Arize Phoenix, LangFuse | Drift detection, 18-second response | Production LLM systems |
Decision Tree: LLM System Architecture
Building LLM application: [Architecture Selection]
├─ Need current knowledge?
│ ├─ Simple Q&A? → Basic RAG (page-level chunking + hybrid retrieval)
│ └─ Complex retrieval? → Advanced RAG (reranking + contextual retrieval)
│
├─ Need tool use / actions?
│ ├─ Single task? → Simple agent (ReAct pattern)
│ └─ Multi-step workflow? → Multi-agent (LangGraph, CrewAI)
│
├─ Static behavior sufficient?
│ ├─ Quick MVP? → Prompt engineering (CI/CD integrated)
│ └─ Production quality? → Fine-tuning (PEFT/LoRA)
│
└─ Best results?
└─ Hybrid (RAG + Fine-tuning + Agents) → Comprehensive solution
See Decision Matrices for detailed selection criteria.
When to Use This Skill
Claude should invoke this skill when the user asks about:
- LLM preflight/project checklists, production best practices, or data pipelines
- Building or deploying RAG, agentic, or prompt-based LLM apps
- Prompt design, chain-of-thought (CoT), ReAct, or template patterns
- Troubleshooting LLM hallucination, bias, retrieval issues, or production failures
- Evaluating LLMs: benchmarks, multi-metric eval, or rollout/monitoring
- LLMOps: deployment, rollback, scaling, resource optimization
- Technology stack selection (models, vector DBs, frameworks)
- Production deployment strategies and operational patterns
Scope Boundaries (Use These Skills for Depth)
- Prompt design & CI/CD → ai-prompt-engineering
- RAG pipelines & chunking → ai-rag
- Search tuning (BM25, HNSW, hybrid) → ai-rag
- Agent architectures & tools → ai-agents
- Serving optimization/quantization → ai-llm-inference
- Production deployment/monitoring → ai-mlops
- Security/guardrails → ai-mlops
Resources (Best Practices & Operational Patterns)
Comprehensive operational guides with checklists, patterns, and decision frameworks:
Core Operational Patterns
Project Planning Patterns - Stack selection, FTI pipeline, performance budgeting
- AI engineering stack selection matrix
- Feature/Training/Inference (FTI) pipeline blueprint
- Performance budgeting and goodput gates
- Progressive complexity (prompt → RAG → fine-tune → hybrid)
Production Checklists - Pre-deployment validation and operational checklists
- LLM lifecycle checklist (modern production standards)
- Data & training, RAG pipeline, deployment & serving
- Safety/guardrails, evaluation, agentic systems
- Reliability & data infrastructure (DDIA-grade)
- Weekly production tasks
Common Design Patterns - Copy-paste ready implementation examples
- Chain-of-Thought (CoT) prompting
- ReAct (Reason + Act) pattern
- RAG pipeline (minimal to advanced)
- Agentic planning loop
- Self-reflection and multi-agent collaboration
Decision Matrices - Quick reference tables for selection
- RAG type decision matrix (naive → advanced → modular)
- Production evaluation table with targets and actions
- Model selection matrix (GPT-4, Claude, Gemini, self-hosted)
- Vector database, embedding model, framework selection
- Deployment strategy matrix
Anti-Patterns - Common mistakes and prevention strategies
- Data leakage, prompt dilution, RAG context overload
- Agentic runaway, over-engineering, ignoring evaluation
- Hard-coded prompts, missing observability
- Detection methods and prevention code examples
Domain-Specific Patterns
- LLMOps Best Practices - Operational lifecycle and deployment patterns
- Evaluation Patterns - Testing, metrics, and quality validation
- Prompt Engineering Patterns - Quick reference (canonical skill: ai-prompt-engineering)
- Agentic Patterns - Quick reference (canonical skill: ai-agents)
- RAG Best Practices - Quick reference (canonical skill: ai-rag)
Note: Each resource file includes preflight/validation checklists, copy-paste reference tables, inline templates, anti-patterns, and decision matrices.
Templates (Copy-Paste Ready)
Production templates by use case and technology:
RAG Pipelines
- Basic RAG - Simple retrieval-augmented generation
- Advanced RAG - Hybrid retrieval, reranking, contextual embeddings
Prompt Engineering
- Chain-of-Thought - Step-by-step reasoning pattern
- ReAct - Reason + Act for tool use
Agentic Workflows
- Reflection Agent - Self-critique and improvement
- Multi-Agent - Manager-worker orchestration
Data Pipelines
- Data Quality - Validation, deduplication, PII detection
Deployment
- LLM Deployment - Production deployment with monitoring
Evaluation
- Multi-Metric Evaluation - Comprehensive testing suite
Shared Utilities (Centralized patterns — extract, don't duplicate)
- ../_shared/utilities/llm-utilities.md — Token counting, streaming, cost estimation
- ../_shared/utilities/error-handling.md — Effect Result types, correlation IDs
- ../_shared/utilities/resilience-utilities.md — p-retry v6, circuit breaker for LLM API calls
- ../_shared/utilities/logging-utilities.md — pino v9 + OpenTelemetry integration
- ../_shared/utilities/observability-utilities.md — OpenTelemetry SDK, tracing, metrics
- ../_shared/utilities/config-validation.md — Zod 3.24+, secrets management for API keys
- ../_shared/utilities/testing-utilities.md — Test factories, fixtures, mocks
- ../_shared/resources/code-quality-operational-playbook.md — Canonical coding rules & LLM code review
Related Skills
This skill integrates with complementary Claude Code skills:
Core Dependencies
- ai-rag - Advanced RAG patterns, chunking strategies, hybrid retrieval, reranking
- ai-rag - Search optimization, BM25 tuning, vector search, ranking pipelines
- ai-prompt-engineering - Systematic prompt design, evaluation, testing, and optimization
- ai-agents - Agent architectures, tool use, multi-agent systems, autonomous workflows
Production & Operations
- ai-llm - Model training, fine-tuning, dataset creation, instruction tuning
- ai-llm-inference - Production serving, quantization, batching, GPU optimization
- ai-mlops - Deployment patterns, monitoring, drift detection, API design
- ai-mlops - Security guardrails, prompt injection defense, privacy protection
External Resources
See data/sources.json for 50+ curated authoritative sources:
- Official LLM platform docs - OpenAI, Anthropic, Gemini, Mistral, Azure OpenAI, AWS Bedrock
- Open-source models and frameworks - HuggingFace Transformers, LLaMA, vLLM 0.12 (V1 architecture, PyTorch 2.9), PEFT/LoRA, DeepSpeed
- RAG frameworks and vector DBs - LlamaIndex, LangChain 1.1+, LangGraph, LangGraph Studio v2, Haystack, Pinecone, Qdrant, Chroma
- 2025 Agentic frameworks - Anthropic Agent SDK, AutoGen, CrewAI, LangGraph Multi-Agent, Semantic Kernel
- 2025 RAG innovations - Microsoft GraphRAG (knowledge graphs), Pathway (real-time), hybrid retrieval
- Prompt engineering - Anthropic Prompt Library, Prompt Engineering Guide, CoT/ReAct patterns
- Evaluation and monitoring - OpenAI Evals, HELM, Anthropic Evals, LangSmith, W&B, Arize Phoenix
- Production deployment - LiteLLM, Ollama, RunPod, Together AI, vLLM serving
Usage
For New Projects
- Start with Production Checklists - Validate all pre-deployment requirements
- Use Decision Matrices - Select technology stack
- Reference Project Planning Patterns - Design FTI pipeline
- Implement with Common Design Patterns - Copy-paste code examples
- Avoid Anti-Patterns - Learn from common mistakes
For Troubleshooting
- Check Anti-Patterns - Identify failure modes and mitigations
- Use Decision Matrices - Evaluate if architecture fits use case
- Reference Common Design Patterns - Verify implementation correctness
For Ongoing Operations
- Follow Production Checklists - Weekly operational tasks
- Integrate Evaluation Patterns - Continuous quality monitoring
- Apply LLMOps Best Practices - Deployment and rollback procedures
Navigation Summary
Quick Decisions: Decision Matrices Pre-Deployment: Production Checklists Planning: Project Planning Patterns Implementation: Common Design Patterns Troubleshooting: Anti-Patterns
Domain Depth: LLMOps | Evaluation | Prompts | Agents | RAG
Templates: templates/ - Copy-paste ready production code
Sources: data/sources.json - Authoritative documentation links