Claude Code Plugins

Community-maintained marketplace

Feedback

Kubernetes AI Expert

@frankxai/ai-architect
0
0

Deploy and operate AI workloads on Kubernetes with GPU scheduling, model serving, and MLOps patterns

Install Skill

1Download skill
2Enable skills in Claude

Open claude.ai/settings/capabilities and find the "Skills" section

3Upload to Claude

Click "Upload skill" and select the downloaded ZIP file

Note: Please verify skill by going through its instructions before using it.

SKILL.md

name Kubernetes AI Expert
description Deploy and operate AI workloads on Kubernetes with GPU scheduling, model serving, and MLOps patterns
version 1.1.0
last_updated Tue Jan 06 2026 00:00:00 GMT+0000 (Coordinated Universal Time)
external_version Kubernetes 1.31
resources resources/manifests.yaml
triggers kubernetes, k8s, helm, GPU, model serving

Kubernetes AI Expert

Expert in deploying AI/ML workloads on Kubernetes with GPU scheduling, model serving frameworks, and MLOps patterns.

GPU Workload Scheduling

NVIDIA GPU Operator

helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm install gpu-operator nvidia/gpu-operator

GPU Resource Requests

Resource Description
nvidia.com/gpu: N Request N GPUs
nvidia.com/mig-3g.40gb: 1 MIG slice
Node selector nvidia.com/gpu.product
Toleration nvidia.com/gpu

Full manifests: resources/manifests.yaml

Model Serving Frameworks

Framework Comparison

Framework Best For GPU Support Scaling
vLLM High-throughput LLMs Excellent HPA/KEDA
Triton Multi-model serving Excellent HPA
TGI HuggingFace models Good HPA

vLLM Deployment

Key configurations:

  • --tensor-parallel-size - Multi-GPU inference
  • --max-model-len - Context window
  • --gpu-memory-utilization - Memory efficiency

Triton Inference Server

  • Multi-model serving from S3/GCS
  • HTTP (8000), gRPC (8001), Metrics (8002)
  • Model polling for dynamic updates

Text Generation Inference (TGI)

  • HuggingFace native
  • Quantization support (bitsandbytes-nf4)
  • Simple deployment

Deployment manifests: resources/manifests.yaml

Helm Chart Pattern

# values.yaml structure
inference:
  enabled: true
  replicas: 2
  framework: "vllm"  # vllm, tgi, triton
  resources:
    limits:
      nvidia.com/gpu: 1
  autoscaling:
    enabled: true
    minReplicas: 1
    maxReplicas: 10

vectorDB:
  enabled: true
  type: "qdrant"

monitoring:
  enabled: true

Auto-Scaling

Horizontal Pod Autoscaler (HPA)

Scale on:

  • GPU utilization (DCGM_FI_DEV_GPU_UTIL)
  • Inference queue length
  • Custom metrics

KEDA Event-Driven Scaling

Scale on:

  • Prometheus metrics
  • Message queue depth (RabbitMQ, SQS)
  • Custom external metrics

HPA/KEDA configs: resources/manifests.yaml

Networking

Ingress Configuration

  • Rate limiting (nginx annotations)
  • TLS with cert-manager
  • Large body size for AI payloads
  • Extended timeouts (300s+)

Network Policies

  • Restrict pod-to-pod communication
  • Allow only gateway → inference
  • Permit DNS egress

Monitoring

Key Metrics

Metric Source Purpose
GPU Utilization DCGM Exporter Scaling
Inference Latency Prometheus SLO
Tokens/Second Custom Throughput
Queue Length App metrics Scaling

Setup

# Install DCGM Exporter
helm install dcgm-exporter nvidia/dcgm-exporter

# ServiceMonitor for Prometheus
# See resources/manifests.yaml

Managed Kubernetes

AWS EKS

  • Instance types: g5.2xlarge, p4d.24xlarge
  • AMI: AL2_x86_64_GPU
  • GPU taints for isolation

Azure AKS

  • VM sizes: Standard_NC*, Standard_ND*
  • A100 support via NC24ads_A100_v4

OCI OKE

  • Shapes: BM.GPU.A100-v2.8, VM.GPU.A10
  • GPU node pools with taints

Terraform examples: ../terraform-iac/resources/modules.tf

Best Practices

Resource Management

  • Always set GPU limits = requests
  • Use node selectors for GPU types
  • Implement tolerations for GPU taints
  • PVC for model caching

High Availability

  • Multiple replicas across zones
  • Pod disruption budgets
  • Readiness/liveness probes

Cost Optimization

  • Spot instances for dev/test
  • Auto-scaling to zero when idle
  • Right-size GPU instances

Resources


Deploy AI workloads at scale with GPU-optimized Kubernetes.