| name | nvidia-nim |
| description | NVIDIA NIM (NVIDIA Inference Microservices) for deploying and managing AI models. Use for NIM microservices, model inference, API integration, and building AI applications with NVIDIA's inference infrastructure. |
NVIDIA NIM Skill
Comprehensive guide for deploying GPU-accelerated AI inference microservices with NVIDIA NIM™. NIM provides containers to self-host pretrained and customized AI models across clouds, data centers, and RTX™ AI PCs with industry-standard APIs.
When to Use This Skill
This skill should be triggered when you:
Deployment & Infrastructure:
- Need to deploy AI models with GPU acceleration (TensorRT, vLLM, SGLang, TensorRT-LLM)
- Want to self-host inference microservices on NVIDIA GPUs
- Are setting up AI infrastructure on clouds, data centers, or RTX workstations
- Need to containerize AI models for production deployment
Model Integration:
- Working with LLMs, vision models, or other foundation models
- Integrating pretrained or fine-tuned models into applications
- Need industry-standard API endpoints (OpenAI-compatible, REST)
- Building RAG pipelines, agentic AI workflows, or chatbots
Performance Optimization:
- Optimizing inference latency and throughput
- Need high-performance model serving on NVIDIA GPUs
- Working with model quantization or optimization
- Scaling AI workloads on Kubernetes
Specific Use Cases:
- "Deploy Llama 3 on my GPU cluster"
- "Set up inference endpoints for custom models"
- "Build AI agents with NVIDIA infrastructure"
- "Optimize model inference performance"
- "Create self-hosted AI applications with observability"
Key Concepts
NVIDIA NIM Architecture
NIM (NVIDIA Inference Microservices) are containerized microservices that provide:
- Pre-optimized Models: Accelerated with TensorRT, vLLM, SGLang, TensorRT-LLM
- Industry-Standard APIs: OpenAI-compatible REST endpoints
- GPU Optimization: Tailored for specific NVIDIA GPU architectures
- Self-Hosting: Deploy anywhere with NVIDIA acceleration
Core Components
- NIM Container: Docker/Kubernetes-ready inference service
- Inference Engines: TensorRT-LLM, vLLM, SGLang for different model types
- API Layer: REST/gRPC endpoints with OpenAI compatibility
- Observability: Built-in metrics for monitoring and dashboards
Deployment Targets
- Cloud: AWS, Azure, GCP with NVIDIA GPUs
- Data Center: On-premise GPU clusters
- Edge: RTX AI PCs and workstations
- Kubernetes: Helm charts for orchestration
Supported Models
- LLMs: Llama, Mistral, Gemma, Falcon, and community fine-tunes
- Vision Models: Image generation, classification, detection
- Custom Models: Fine-tuned models on your data
- Model Catalog: Access thousands of models from NVIDIA and partners
Quick Reference
1. Deploy NIM Container with Single Command
# Deploy a Llama 3 inference microservice
docker run --gpus all \
-e NGC_API_KEY=$NGC_API_KEY \
-p 8000:8000 \
nvcr.io/nvidia/nim/llama-3-8b-instruct:latest
Launches a GPU-accelerated Llama 3 inference service on port 8000
2. Call NIM API Endpoint (OpenAI-Compatible)
# Use NIM API like OpenAI
import openai
client = openai.OpenAI(
base_url="http://localhost:8000/v1",
api_key="not-used" # NIM handles auth differently
)
response = client.chat.completions.create(
model="llama-3-8b-instruct",
messages=[
{"role": "user", "content": "Explain quantum computing"}
],
temperature=0.7,
max_tokens=512
)
print(response.choices[0].message.content)
Standard OpenAI SDK works seamlessly with NIM endpoints
3. Deploy NIM on Kubernetes with Helm
# Add NVIDIA Helm repository
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
# Deploy NIM microservice
helm install my-nim nvidia/nim \
--set image.repository=nvcr.io/nvidia/nim/llama-3-8b-instruct \
--set image.tag=latest \
--set replicaCount=3 \
--set resources.limits.nvidia.com/gpu=1
Scale NIM inference across Kubernetes cluster with GPU allocation
4. Access NIM via REST API
# Direct REST API call to NIM
curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama-3-8b-instruct",
"messages": [{"role": "user", "content": "Hello, world!"}],
"temperature": 0.5,
"max_tokens": 100
}'
REST endpoint for language-agnostic integration
5. Deploy Custom Fine-Tuned Model
# Deploy your custom fine-tuned model with NIM
docker run --gpus all \
-e NGC_API_KEY=$NGC_API_KEY \
-e MODEL_PATH=/models/my-custom-model \
-v /path/to/models:/models \
-p 8000:8000 \
nvcr.io/nvidia/nim/base-llm:latest
NIM supports custom models fine-tuned on your data
6. RAG Pipeline with NIM
# Building retrieval-augmented generation with NIM
from langchain.llms import OpenAI
from langchain.chains import RetrievalQA
from langchain.vectorstores import FAISS
# Point LangChain to NIM endpoint
llm = OpenAI(
base_url="http://localhost:8000/v1",
api_key="not-used",
model="llama-3-8b-instruct"
)
# Create RAG chain with NIM-powered LLM
qa_chain = RetrievalQA.from_chain_type(
llm=llm,
retriever=vectorstore.as_retriever(),
chain_type="stuff"
)
answer = qa_chain.run("What are the key features of NIM?")
Integrate NIM into RAG workflows with frameworks like LangChain
7. Configure NIM for Multi-GPU Setup
# docker-compose.yml for multi-GPU NIM deployment
version: '3'
services:
nim-service:
image: nvcr.io/nvidia/nim/llama-3-70b-instruct:latest
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 4 # Use 4 GPUs
capabilities: [gpu]
environment:
- NGC_API_KEY=${NGC_API_KEY}
- TENSOR_PARALLEL_SIZE=4
ports:
- "8000:8000"
Leverage multiple GPUs for large model inference
8. Monitor NIM with Observability Metrics
# Access built-in metrics from NIM
import requests
metrics = requests.get("http://localhost:8000/metrics")
print(metrics.text)
# Prometheus-compatible metrics include:
# - nim_inference_requests_total
# - nim_inference_duration_seconds
# - nim_gpu_utilization_percent
# - nim_throughput_tokens_per_second
Built-in Prometheus metrics for dashboarding and monitoring
9. Build AI Agent with NVIDIA Blueprints
# Use NVIDIA AI Blueprints with NIM
from nvidia_blueprints import AgentBlueprint
# Initialize blueprint with NIM endpoint
agent = AgentBlueprint(
nim_endpoint="http://localhost:8000/v1",
model="llama-3-8b-instruct",
tools=["web_search", "calculator", "code_executor"]
)
# Execute agentic workflow
result = agent.run(
task="Research and summarize recent AI developments"
)
Predefined AI workflows using NIM as inference backend
10. Deploy NIM on Hugging Face Dedicated Endpoints
# Alternative: Use Hugging Face dedicated endpoints
from huggingface_hub import InferenceClient
client = InferenceClient(
model="nvidia/llama-3-8b-nim",
token=hf_token
)
response = client.text_generation(
"Explain NVIDIA NIM",
max_new_tokens=200
)
Managed NIM deployment via Hugging Face cloud infrastructure
Reference Files
This skill includes comprehensive documentation in references/:
microservices.md
Primary documentation for NVIDIA NIM architecture and capabilities:
- Complete overview of NIM inference microservices
- How NIM works: architecture, engines, and optimization
- Deployment guides for clouds, data centers, and RTX systems
- API documentation and integration examples
- Performance optimization with TensorRT, vLLM, SGLang
- Model catalog and customization options
- Kubernetes scaling and Helm charts
- Observability and monitoring setup
Best for:
- Understanding NIM fundamentals
- Deployment planning and architecture
- Performance tuning strategies
- API integration patterns
other.md
Additional resources and references:
- NVIDIA Build catalog (build.nvidia.com)
- Model browser and filters
- AI Blueprints and workflow templates
- Agent blueprints for agentic AI
Best for:
- Discovering available models
- Finding pre-built AI workflows
- Exploring agent architectures
Working with This Skill
For Beginners
Start here:
- Read
references/microservices.md"How It Works" section - Try the Quick Reference examples #1-2 (basic deployment and API calls)
- Experiment with different models from the NVIDIA catalog
- Learn about NIM container structure and APIs
First Steps:
- Get NGC API key from NVIDIA
- Install Docker with NVIDIA Container Toolkit
- Deploy your first NIM container locally
- Test with simple API calls
For Application Developers
Focus on:
- Quick Reference examples #2, #4, #6 (API integration patterns)
- OpenAI compatibility for seamless migration
- RAG pipeline integration with LangChain/LlamaIndex
- Building agents and chatbots with NIM endpoints
Key Skills:
- REST API integration
- Handling streaming responses
- Error handling and retry logic
- Authentication and security
For MLOps/Infrastructure Engineers
Advanced topics:
- Quick Reference examples #3, #7, #8 (Kubernetes, multi-GPU, monitoring)
- Helm chart customization for production
- Multi-GPU tensor parallelism configuration
- Prometheus metrics and observability
- Autoscaling strategies on Kubernetes
- Model versioning and deployment pipelines
Production Considerations:
- GPU resource allocation and limits
- High-availability deployment patterns
- Load balancing across NIM instances
- Monitoring and alerting setup
For Model Developers
Custom Models:
- Quick Reference example #5 (custom model deployment)
- Fine-tuning workflows with NVIDIA tools
- Model optimization with TensorRT
- Quantization for inference efficiency
- Benchmarking and performance validation
Integration Path:
- Fine-tune model on your data
- Export to compatible format (GGUF, safetensors)
- Package with NIM base container
- Deploy and validate performance
- Optimize with TensorRT-LLM if needed
Common Workflows
Workflow 1: Deploy Production-Ready LLM Service
1. Choose model from NVIDIA catalog → references/microservices.md
2. Deploy with Kubernetes Helm chart → Quick Reference #3
3. Configure multi-GPU if needed → Quick Reference #7
4. Set up monitoring → Quick Reference #8
5. Test with OpenAI-compatible client → Quick Reference #2
6. Scale based on metrics
Workflow 2: Build RAG Application
1. Deploy NIM inference endpoint → Quick Reference #1
2. Integrate with vector database (Milvus, Pinecone)
3. Connect via LangChain → Quick Reference #6
4. Implement retrieval pipeline
5. Add observability and error handling
Workflow 3: Create AI Agent System
1. Use NVIDIA AI Blueprint → Quick Reference #9
2. Deploy NIM for reasoning engine
3. Configure tool integrations (search, APIs)
4. Implement agentic workflow
5. Monitor agent performance
Performance Optimization Tips
Inference Speed
- Use TensorRT-LLM for NVIDIA GPUs (up to 8x faster)
- Enable tensor parallelism for large models (>70B)
- Use quantization (INT8, FP8) for memory efficiency
- Batch requests for higher throughput
Resource Utilization
- Monitor GPU memory with NIM metrics
- Adjust
max_batch_sizefor your workload - Use appropriate GPU SKU (H100, A100, L40S, RTX)
- Enable KV cache optimization for repeated queries
Scaling Strategy
- Horizontal scaling with Kubernetes for high QPS
- Vertical scaling (more GPUs) for larger models
- Load balancing across NIM instances
- Auto-scaling based on queue depth and latency
Troubleshooting
Common Issues
Container won't start:
- Verify NGC_API_KEY is set correctly
- Check GPU availability:
nvidia-smi - Ensure NVIDIA Container Toolkit is installed
Out of memory errors:
- Reduce model size or use quantized version
- Increase GPU memory or use multi-GPU
- Adjust batch size and context length limits
Slow inference:
- Check GPU utilization in metrics
- Verify using optimized engine (TensorRT-LLM)
- Enable tensor parallelism for large models
- Reduce precision (FP16, INT8)
API compatibility issues:
- Verify OpenAI SDK version compatibility
- Check endpoint URL format (
/v1/chat/completions) - Review NIM logs for error details
Additional Resources
NVIDIA Documentation
- NIM Documentation: Official guides and reference
- NGC Catalog: Browse available NIM containers
- TensorRT-LLM: Advanced optimization engine
- AI Blueprints: Pre-built workflow templates
Community & Support
- NVIDIA Developer Forums
- GitHub examples and integrations
- Deployment guides for major cloud providers
- Performance benchmarking results
Related Technologies
- TensorRT: NVIDIA inference optimization SDK
- Triton Inference Server: Production deployment platform
- CUDA: GPU computing foundation
- cuDNN: Deep learning primitives
Notes
- NIM provides OpenAI-compatible APIs for easy migration
- Pre-optimized for specific GPU architectures (Hopper, Ada, Ampere)
- Supports thousands of models: LLMs, vision, speech, multimodal
- Enterprise support available through NVIDIA AI Enterprise
- Regular updates with latest model releases and optimizations
Getting Started Checklist
- Obtain NGC API key from NVIDIA
- Install NVIDIA Container Toolkit
- Deploy first NIM container locally (Quick Reference #1)
- Test API endpoint (Quick Reference #2 or #4)
- Explore model catalog at build.nvidia.com
- Review observability metrics (Quick Reference #8)
- Plan production deployment (Kubernetes/Helm)
- Implement monitoring and alerting
- Optimize for your workload (GPU selection, batching)
- Build your AI application!