| name | Kubernetes AI Expert |
| description | Deploy and operate AI workloads on Kubernetes with GPU scheduling, model serving, and MLOps patterns |
| version | 1.1.0 |
| last_updated | Tue Jan 06 2026 00:00:00 GMT+0000 (Coordinated Universal Time) |
| external_version | Kubernetes 1.31 |
| resources | resources/manifests.yaml |
| triggers | kubernetes, k8s, helm, GPU, model serving |
Kubernetes AI Expert
Expert in deploying AI/ML workloads on Kubernetes with GPU scheduling, model serving frameworks, and MLOps patterns.
GPU Workload Scheduling
NVIDIA GPU Operator
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm install gpu-operator nvidia/gpu-operator
GPU Resource Requests
| Resource | Description |
|---|---|
nvidia.com/gpu: N |
Request N GPUs |
nvidia.com/mig-3g.40gb: 1 |
MIG slice |
| Node selector | nvidia.com/gpu.product |
| Toleration | nvidia.com/gpu |
Full manifests: resources/manifests.yaml
Model Serving Frameworks
Framework Comparison
| Framework | Best For | GPU Support | Scaling |
|---|---|---|---|
| vLLM | High-throughput LLMs | Excellent | HPA/KEDA |
| Triton | Multi-model serving | Excellent | HPA |
| TGI | HuggingFace models | Good | HPA |
vLLM Deployment
Key configurations:
--tensor-parallel-size- Multi-GPU inference--max-model-len- Context window--gpu-memory-utilization- Memory efficiency
Triton Inference Server
- Multi-model serving from S3/GCS
- HTTP (8000), gRPC (8001), Metrics (8002)
- Model polling for dynamic updates
Text Generation Inference (TGI)
- HuggingFace native
- Quantization support (
bitsandbytes-nf4) - Simple deployment
Deployment manifests: resources/manifests.yaml
Helm Chart Pattern
# values.yaml structure
inference:
enabled: true
replicas: 2
framework: "vllm" # vllm, tgi, triton
resources:
limits:
nvidia.com/gpu: 1
autoscaling:
enabled: true
minReplicas: 1
maxReplicas: 10
vectorDB:
enabled: true
type: "qdrant"
monitoring:
enabled: true
Auto-Scaling
Horizontal Pod Autoscaler (HPA)
Scale on:
- GPU utilization (
DCGM_FI_DEV_GPU_UTIL) - Inference queue length
- Custom metrics
KEDA Event-Driven Scaling
Scale on:
- Prometheus metrics
- Message queue depth (RabbitMQ, SQS)
- Custom external metrics
HPA/KEDA configs: resources/manifests.yaml
Networking
Ingress Configuration
- Rate limiting (nginx annotations)
- TLS with cert-manager
- Large body size for AI payloads
- Extended timeouts (300s+)
Network Policies
- Restrict pod-to-pod communication
- Allow only gateway → inference
- Permit DNS egress
Monitoring
Key Metrics
| Metric | Source | Purpose |
|---|---|---|
| GPU Utilization | DCGM Exporter | Scaling |
| Inference Latency | Prometheus | SLO |
| Tokens/Second | Custom | Throughput |
| Queue Length | App metrics | Scaling |
Setup
# Install DCGM Exporter
helm install dcgm-exporter nvidia/dcgm-exporter
# ServiceMonitor for Prometheus
# See resources/manifests.yaml
Managed Kubernetes
AWS EKS
- Instance types:
g5.2xlarge,p4d.24xlarge - AMI:
AL2_x86_64_GPU - GPU taints for isolation
Azure AKS
- VM sizes:
Standard_NC*,Standard_ND* - A100 support via
NC24ads_A100_v4
OCI OKE
- Shapes:
BM.GPU.A100-v2.8,VM.GPU.A10 - GPU node pools with taints
Terraform examples: ../terraform-iac/resources/modules.tf
Best Practices
Resource Management
- Always set GPU limits = requests
- Use node selectors for GPU types
- Implement tolerations for GPU taints
- PVC for model caching
High Availability
- Multiple replicas across zones
- Pod disruption budgets
- Readiness/liveness probes
Cost Optimization
- Spot instances for dev/test
- Auto-scaling to zero when idle
- Right-size GPU instances
Resources
Deploy AI workloads at scale with GPU-optimized Kubernetes.