Claude Code Plugins

Community-maintained marketplace

Feedback

ai-llm-ops-inference

@vasilyu1983/AI-Agents-public
14
0

>

Install Skill

1Download skill
2Enable skills in Claude

Open claude.ai/settings/capabilities and find the "Skills" section

3Upload to Claude

Click "Upload skill" and select the downloaded ZIP file

Note: Please verify skill by going through its instructions before using it.

SKILL.md

name ai-llm-ops-inference
description Operational patterns for LLM inference (recent advances): vLLM with 24x throughput gains, FP8/FP4 quantization (30-50% cost reduction), FlashInfer kernels, advanced fusions, PagedAttention, continuous batching, model compression, speculative decoding, and GPU/CPU scheduling. Emphasizes production-ready performance and cost optimization.

LLMOps – Inference & Optimization – Production Skill Hub

Modern Best Practices: vLLM optimizations (24x throughput), FP8/FP4 quantization (30-50% cost reduction), FlashInfer integration, PagedAttention, continuous batching, and production serving patterns.

This skill provides production-ready operational patterns for optimizing LLM inference performance, cost, and reliability. It centralizes decision rules, optimization strategies, configuration templates, and operational checklists for inference workloads.

No theory. No narrative. Only what Claude can execute.


When to Use This Skill

Claude should activate this skill whenever the user asks for:

  • Optimizing LLM inference latency or throughput
  • Choosing quantization strategies (FP8/FP4/INT8/INT4)
  • Configuring vLLM, TensorRT-LLM, or DeepSpeed inference
  • Scaling LLM inference across GPUs (tensor/pipeline parallelism)
  • Building high-throughput LLM APIs
  • Improving context window performance (KV cache optimization)
  • Using speculative decoding for faster generation
  • Reducing cost per token
  • Profiling and benchmarking inference workloads
  • Planning infrastructure capacity
  • CPU/edge deployment patterns
  • High availability and resilience patterns

Scope Boundaries (Use These Skills for Depth)


Quick Reference

Task Tool/Framework Command/Pattern When to Use
Max throughput vLLM Continuous batching + PagedAttention API serving, 24x throughput gain
Cost optimization FP8/FP4 quantization LLM Compressor, TensorRT Model Optimizer 30-50% cost reduction, 99% accuracy
GPU inference TensorRT-LLM Custom kernels, FlashInfer Kernel-level optimization needed
CPU inference llama.cpp, GGUF Q4_K_M, Q8_0 formats Edge devices, no GPU available
Multi-GPU vLLM, DeepSpeed Tensor parallelism, pipeline parallelism Large models, distributed serving
Latency optimization Speculative decoding Draft model + target model 2-5x speedup with quality
Long context PagedAttention + FlashAttention-2 KV cache optimization >8k token contexts
Memory optimization KV cache quantization FP8 KV cache Fit larger batches/longer contexts

Decision Tree: Inference Optimization Strategy

Need to optimize LLM inference: [Optimization Path]
    ├─ Primary constraint: Throughput?
    │   ├─ GPU available? → vLLM (24x gain, continuous batching)
    │   └─ CPU only? → llama.cpp + GGUF (Q4_K_M format)
    │
    ├─ Primary constraint: Cost?
    │   ├─ GPU serving? → FP8/FP4 quantization (30-50% reduction)
    │   └─ Needs quality? → FP8 + calibration (99% accuracy retention)
    │
    ├─ Primary constraint: Latency?
    │   ├─ Can use draft model? → Speculative decoding (2-5x speedup)
    │   └─ Long context? → FlashAttention-2 + PagedAttention
    │
    ├─ Large model (>70B)?
    │   ├─ Multiple GPUs? → Tensor parallelism (NVLink required)
    │   └─ Deep model? → Pipeline parallelism (minimize bubbles)
    │
    └─ Edge deployment?
        └─ CPU + quantization (GGUF Q4_K_M) → Optimized for constrained resources

Modern Best Practices (Current Standards)

vLLM Optimizations

  • 24x higher throughput vs HuggingFace Transformers
  • Continuous batching with PagedAttention
  • FlashInfer library integration (NVIDIA collaboration)
  • FP8 attention kernels and FP4 GEMMs
  • Advanced kernel fusions (AllReduce + RMSNorm + quantization)

Quantization Performance

  • FP8/FP4: 30-50% cost reduction, ~99% accuracy retention
  • FP4 on large models: 40-50% performance boost (Qwen3-32B)
  • GPTQ optimization: 2-3x faster throughput vs default vLLM config
  • BF16 → FP8/INT8: ~30% improvement in cost-per-token

Key Tools

  • LLM Compressor (vLLM project) - optimized model serving
  • NVIDIA TensorRT Model Optimizer - PTQ framework with HuggingFace export
  • vLLM + NVIDIA collaboration - FlashInfer kernels, FP8/FP4 support

Resources (Detailed Operational Guides)

For comprehensive guides on specific topics, see:

Infrastructure & Serving

Performance Optimization

Deployment & Operations


Templates

Inference Configs

Production-ready configuration templates for leading inference engines:

Quantization & Compression

Model compression templates for reducing memory and cost:

Serving Pipelines

High-throughput serving architectures:

Caching & Batching

Performance optimization templates:

Benchmarking

Performance measurement and validation:

Navigation

Resources

Templates

Data


Related Skills

This skill focuses on inference-time performance. For related workflows:


External Resources

See data/sources.json for:

  • Serving frameworks (vLLM, TensorRT-LLM, DeepSpeed-MII)
  • Quantization libraries (GPTQ, AWQ, bitsandbytes, LLM Compressor)
  • FlashAttention, FlashInfer, xFormers
  • GPU hardware guides and optimization docs
  • Benchmarking frameworks and tools

Use this skill whenever the user needs LLM inference performance, cost reduction, or serving architecture guidance.