| name | ai-llm-development |
| description | Operational patterns, templates, and decision rules for LLM development (modern best practices): strategy selection (prompting vs PEFT/LoRA vs RAG), dataset design and formatting, instruction tuning, and evaluation workflows. Leans on specialized skills for prompts, RAG, agents, inference, and safety. |
LLM Development – Production Skill Hub
Modern Best Practices: PEFT/LoRA fine-tuning, combined approaches (fine-tuning + prompting + RAG), continuous evaluation, and CI/CD integration for prompt versioning.
This skill provides actionable workflows for developing LLM-powered systems with focus on execution, not theory.
When to Use This Skill
Claude should invoke this skill when the user asks for LLM development, including:
- "Help me design an instruction dataset / SFT corpus."
- "Format my dataset for fine-tuning."
- "Create an evaluation set for my model."
- "Build a data transformation pipeline for SFT."
- "Choose prompting vs tuning vs RAG for this use case."
- "Set up PEFT/LoRA fine-tuning."
- "Implement RLHF or DPO alignment."
- "Build evaluation framework with LLM-as-judge."
Scope Boundaries (Use These Skills for Depth)
- Prompt design / structured outputs → ai-prompt-engineering
- Retrieval / RAG → ai-llm-rag-engineering
- Search tuning (BM25/HNSW/hybrid) → ai-llm-search-retrieval
- Serving, optimization, quantization → ai-llm-ops-inference
- Security, safety, policy, governance → ai-ml-ops-security
- Deployment, APIs → ai-ml-ops-production
Quick Reference
| Task | Tool/Framework | Command/Pattern | When to Use |
|---|---|---|---|
| Prompt iteration | Claude/GPT + CI/CD | Version control + automated tests | MVPs, internal tools, fast iteration needed |
| Fine-tuning (efficient) | PEFT/LoRA | peft.LoraConfig(r=8, lora_alpha=16) |
Production with sufficient data, highest performance |
| Dataset formatting | JSON/JSONL | Instruction/Chat/Transform formats | Preparing SFT data, consistent structure |
| Evaluation | eval frameworks | LangSmith, W&B, custom rubrics | Comparing prompts/models, quality validation |
| RLHF/alignment | DPO/ORPO | Preference data + policy optimization | Safety requirements, tone control |
Decision Tree: LLM Development Strategy
Need LLM-powered feature: [Strategy Selection]
├─ MVP/Prototype?
│ └─ Prompt Engineering → Fast deployment, minimal setup
│
├─ Production + Sufficient Data?
│ ├─ Static knowledge? → Fine-Tuning (PEFT/LoRA) → Highest performance
│ └─ Dynamic knowledge? → RAG → Current information access
│
├─ Cold-start (No Data)?
│ └─ Few-shot Prompting → No persona needed, best results
│
└─ Best Results?
└─ Hybrid (Fine-tuning + Prompting + RAG) → Optimal outcomes
Key Insight: Combining techniques yields best results. PEFT/LoRA is now standard for efficient fine-tuning.
Modern Best Practices (2024-2025)
PEFT/LoRA (Now Standard)
- Minimizes trainable parameters while improving performance
- LoRA rank 8-32, alpha 16-64
- 4-bit quantization (QLoRA) for VRAM savings
- Save adapters separately for efficient deployment
Iteration-First Prompting
- Start simple, refine iteratively
- CI/CD integration with automated tests
- Version control with systematic tracking
- Clear, concise prompts with reasoning steps
Combined Approaches
- Fine-tuned model + prompt engineering + RAG = best results
- Layer techniques based on use case
- Evaluate each component separately
Continuous Evaluation
- LLM-as-judge for scalability
- Regression suites in CI/CD
- Production monitoring and drift detection
- A/B testing framework
Navigation: Resources (Best Practices & Operational Guides)
Operational guides with checklists, patterns, and actionable workflows:
Core Development Patterns
Prompt Engineering Patterns - resources/prompt-engineering-patterns.md
- 12 production-tested prompt patterns (Instruction, Few-Shot, JSON, Multi-Step, etc.)
- Modern iteration-first workflow with CI/CD integration
- Failure mode debugging guide
- Prompt library structure and version control
Fine-Tuning Recipes - resources/fine-tuning-recipes.md
- Strategy selection matrix (Prompting vs Fine-tuning vs RAG)
- SFT workflow with PEFT/LoRA (modern standard)
- Instruction tuning and dataset composition
- Safety requirements and validation
- Production feedback loops
Dataset Formatting Guide - resources/dataset-formatting-guide.md
- Instruction, chat, and transformation formats
- Dataset hygiene rules and quality control
- Deduplication and validation workflows
Evaluation Rubrics - resources/evaluation-rubrics.md
- 6-dimension evaluation framework (Correctness, Completeness, Format, Clarity, Conciseness, Safety)
- LLM-as-judge setup and validation
- A/B testing methodology with statistical significance
- Regression testing and CI/CD integration
- Production monitoring metrics
Advanced Patterns
- Advanced LLM Patterns - resources/advanced-llm-patterns.md
- RLHF / DPO / ORPO alignment workflows
- Pretraining path (tokenizer → corpus → schedule)
- Task-specific tuning (classification, embeddings, multimodal)
- Context engineering best practices
- Production monitoring and observability
- Synthetic data generation
- Model compression and distillation
- Multi-task learning
Each resource includes:
- Execution-ready checklists
- Copy-paste templates
- "What can go wrong" scenarios
- Validation criteria
Navigation: Templates (Copy-Paste Ready)
Production templates organized by use case:
Prompt Templates
- Task Prompt - General-purpose structured LLM task
- JSON-Only Prompt - Strict JSON output enforcement
- Transformation Prompt - Classification, extraction, rewriting
- System Role Prompt - Assistant behavior and policy
Fine-Tuning Templates
- SFT Dataset - Supervised fine-tuning format (JSONL)
- Instruction Dataset - Instruction tuning format with chat messages
- Training Configuration - Hyperparameters, PEFT/LoRA settings
Evaluation Templates
- Evaluation Set - Test suite structure for prompt and model quality
Related Skills
This skill focuses on LLM development (prompting, fine-tuning, datasets). For related workflows:
LLM Engineering & Production: → ai-llm-engineering
- Comprehensive LLMOps lifecycle (data, training, deployment, monitoring)
- RAG pipelines (page-level chunking, hybrid retrieval)
- Agentic orchestration (reflection, multi-agent)
- Production evaluation frameworks
Prompt Engineering Fundamentals: → ai-prompt-engineering
- Core prompting techniques (few-shot, chain-of-thought, ReAct)
- Prompt design patterns and best practices
- Structured prompting for agents
RAG (Retrieval-Augmented Generation): → ai-llm-rag-engineering
- Chunking strategies (basic, code, long-doc)
- Context packing and grounding patterns
- Hybrid search and reranking
- RAG evaluation frameworks
Search & Retrieval: → ai-llm-search-retrieval
- BM25 tuning and vector search patterns
- Hybrid fusion (RRF, weighted)
- Query rewriting and ranking pipelines
LLM Inference & Optimization: → ai-llm-ops-inference
- vLLM, TensorRT-LLM, DeepSpeed configurations
- Quantization (GPTQ, AWQ, GGUF)
- Batching, caching, speculative decoding
- Serving architectures for high throughput
Production Deployment: → ai-ml-ops-production
- API service deployment patterns
- Monitoring and drift detection
- Data ingestion pipelines (dlt)
- Batch processing workflows
Security & Safety: → ai-ml-ops-security
- Prompt injection mitigation
- Jailbreak defense strategies
- Privacy protection (PII handling, anonymization)
- Guardrail configurations and output filtering
Use Case Decision:
- Building/refining prompts or fine-tuning datasets → Use this skill (ai-llm-development)
- Production LLM systems with RAG/agents → Use ai-llm-engineering
- Optimizing inference performance → Use ai-llm-ops-inference
- Implementing RAG → Use ai-llm-rag-engineering
- Security & compliance → Use ai-ml-ops-security
External Resources
See data/sources.json for 55+ curated resources including:
- LLM Platforms: OpenAI, Anthropic Claude, Google Gemini, Cohere, Together AI
- Open-Source Models: Hugging Face Hub, Llama, Mistral, Qwen, Phi
- Fine-Tuning: Transformers, PEFT, LoRA/QLoRA, Axolotl, LLaMA-Factory
- Prompt Engineering: Anthropic Prompt Library, OpenAI guides, LangChain Hub
- Evaluation: EleutherAI LM Eval, HELM, OpenAI Evals, LangSmith, PromptFoo
- Data Preparation: Hugging Face Datasets, Common Crawl, The Pile, RedPajama
- Quantization: bitsandbytes, GPTQ, AWQ, GGUF
- Synthetic Data: Distilabel, Argilla, LLM Blender
- Benchmarks: MMLU, HumanEval, BigBench, Open LLM Leaderboard, AlpacaEval, MT-Bench
Usage Notes for Claude
- Modern Standards: Default to PEFT/LoRA for fine-tuning, iteration-first for prompting, continuous evaluation
- Use resources for detailed operational guidance
- Use templates when the user asks for structured artifacts
- Reference
data/sources.jsonfor authoritative documentation links - Never include theoretical explanations; only operational steps
Key Modern Migrations:
- Full fine-tuning → PEFT/LoRA (parameter-efficient)
- One-shot prompting → Iterative refinement + CI/CD
- Manual evaluation → Automated + LLM-as-judge + regression suites
- Prompting OR fine-tuning → Combined approaches (hybrid)