name	nvidia-nemo
description	NVIDIA NeMo framework for building and training conversational AI models. Use for NeMo Retriever models, RAG (Retrieval-Augmented Generation), embedding models, enterprise search, and multilingual retrieval systems.

NVIDIA NeMo Skill

Comprehensive assistance with NVIDIA NeMo development, the enterprise AI platform for building, customizing, and deploying generative AI agents at scale.

When to Use This Skill

This skill should be triggered when:

Core NeMo Components:

Working with NeMo Retriever for RAG pipelines and document extraction
Using NeMo Customizer for fine-tuning LLMs (LoRA, SFT, DPO, GRPO)
Implementing NeMo Guardrails for content safety and jailbreak prevention
Building with NeMo Curator for data processing and synthetic data generation
Using NeMo Evaluator for benchmarking LLMs, RAG, and agents
Working with NeMo Agent Toolkit for multi-agent orchestration

NVIDIA Nemotron Models:

Deploying Nemotron Nano, Super, or Ultra models for agentic AI
Using Nemotron RAG models for embedding/reranking
Implementing Nemotron Safety Guard for content moderation

Use Cases:

Building RAG pipelines with enterprise document retrieval
Fine-tuning models for domain-specific tasks
Creating AI agents with tool calling and function execution
Processing and curating training data at scale
Implementing multi-modal AI (text, vision, audio, video)
Deploying guardrails for safe, compliant AI applications

Quick Reference

RAG with NeMo Retriever

Basic Embedding Generation

import requests

# NeMo Retriever Embedding NIM
url = "https://integrate.api.nvidia.com/v1/embeddings"
headers = {
    "Content-Type": "application/json",
    "Authorization": f"Bearer {api_key}"
}

payload = {
    "input": ["What is retrieval-augmented generation?"],
    "model": "nvidia/nv-embedqa-e5-v5",
    "input_type": "query"
}

response = requests.post(url, json=payload, headers=headers)
embeddings = response.json()["data"][0]["embedding"]

Reranking Results

# NeMo Retriever Reranking NIM
url = "https://integrate.api.nvidia.com/v1/ranking"

payload = {
    "model": "nvidia/nv-rerankqa-mistral-4b-v3",
    "query": {"text": "What is machine learning?"},
    "passages": [
        {"text": "Machine learning is a subset of AI..."},
        {"text": "Python is a programming language..."},
        {"text": "ML models learn patterns from data..."}
    ]
}

response = requests.post(url, json=payload, headers=headers)
ranked_results = response.json()["rankings"]

Model Customization with NeMo Customizer

Submit Fine-Tuning Job

# Fine-tune with LoRA
payload = {
    "name": "custom-model-lora",
    "model": "meta/llama-3.1-8b-instruct",
    "method": "lora",
    "dataset": "s3://my-bucket/training-data.jsonl",
    "hyperparameters": {
        "learning_rate": 1e-4,
        "batch_size": 8,
        "epochs": 3,
        "lora_rank": 8
    }
}

response = requests.post(
    "http://nemo-customizer:8000/v1/customization/jobs",
    json=payload
)
job_id = response.json()["id"]

Check Job Status

# Monitor customization progress
status_response = requests.get(
    f"http://nemo-customizer:8000/v1/customization/jobs/{job_id}"
)

print(f"Status: {status_response.json()['status']}")
print(f"Progress: {status_response.json()['progress']}%")

Guardrails with NeMo Guardrails

Initialize Guardrails

from nemoguardrails import RailsConfig, LLMRails

# Load configuration
config = RailsConfig.from_path("./config")
rails = LLMRails(config)

# Apply guardrails
response = rails.generate(
    messages=[{"role": "user", "content": "Tell me about..."}]
)

YAML Configuration for Topic Control

# config.yml
models:
  - type: main
    engine: nvidia_ai_endpoints
    model: meta/llama-3.1-70b-instruct

rails:
  input:
    flows:
      - check jailbreak
      - check topic relevance
  output:
    flows:
      - check hallucination
      - check safety

Custom Rail Definition

# Custom topic control
define user ask about competitors
  "Tell me about competing products"
  "What do you think of [competitor]"

define bot refuse competitors
  "I can only discuss our own products and services."

define flow
  user ask about competitors
  bot refuse competitors
  stop

Data Curation with NeMo Curator

Text Processing Pipeline

from nemo_curator import ScoreFilter, DedupFilter
from nemo_curator.datasets import DocumentDataset

# Load dataset
dataset = DocumentDataset.read_json("data.jsonl")

# Quality filtering
quality_filter = ScoreFilter(
    score_field="quality_score",
    score_threshold=0.7
)
dataset = quality_filter(dataset)

# Deduplication
dedup_filter = DedupFilter()
dataset = dedup_filter(dataset)

# Save processed data
dataset.to_json("processed_data.jsonl")

Synthetic Data Generation

from nemo_curator.synthetic import PromptTemplate, generate_data

# Define prompt template
template = PromptTemplate(
    system="You are a helpful assistant.",
    user_template="Generate a question about {topic}"
)

# Generate synthetic data
synthetic_data = generate_data(
    template=template,
    topics=["machine learning", "data science"],
    model="nvidia/llama-3.1-nemotron-70b-instruct",
    num_samples=100
)

Evaluation with NeMo Evaluator

Academic Benchmark Evaluation

from nemo_evaluator import Evaluator

evaluator = Evaluator()

# Run MMLU benchmark
results = evaluator.evaluate(
    model="meta/llama-3.1-8b-instruct",
    tasks=["mmlu"],
    batch_size=8
)

print(f"MMLU Score: {results['mmlu']['acc']}")

RAG Pipeline Evaluation

# Evaluate RAG with custom metrics
rag_results = evaluator.evaluate_rag(
    model="custom-rag-pipeline",
    metrics=["faithfulness", "answer_relevance", "context_precision"],
    dataset="custom_qa_dataset.jsonl"
)

Agent Development with NeMo Agent Toolkit

Define Agent with Tools

# agent_config.yaml
agents:
  - name: customer_support_agent
    model: nvidia/llama-3.1-nemotron-70b-instruct
    tools:
      - web_search
      - knowledge_base_query
      - ticket_creation
    max_iterations: 5

Tool Registration

from nemo_agent_toolkit import Agent, Tool

# Define custom tool
@Tool(
    name="database_query",
    description="Query customer database for information"
)
def query_database(customer_id: str) -> dict:
    # Tool implementation
    return {"name": "John Doe", "status": "Premium"}

# Create agent
agent = Agent.from_config("agent_config.yaml")
agent.register_tool(query_database)

# Run agent
response = agent.run("What is the status of customer ID 12345?")

Deployment with NeMo NIMs

Deploy Custom Model as NIM

# Pull NIM container
docker pull nvcr.io/nim/meta/llama-3.1-8b-instruct:latest

# Run NIM with custom LoRA
docker run -d \
  --gpus all \
  -p 8000:8000 \
  -e NGC_API_KEY=$NGC_API_KEY \
  -e PEFT_MODEL_PATH=/models/custom-lora \
  -v ./models:/models \
  nvcr.io/nim/meta/llama-3.1-8b-instruct:latest

Query NIM Endpoint

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-used"
)

response = client.chat.completions.create(
    model="meta/llama-3.1-8b-instruct",
    messages=[
        {"role": "user", "content": "Explain quantum computing"}
    ],
    temperature=0.7,
    max_tokens=500
)

Key Concepts

NeMo Ecosystem Architecture

NeMo Suite Components:

NeMo Retriever: RAG components (embedding, reranking, extraction)
NeMo Customizer: Model fine-tuning and alignment (LoRA, SFT, DPO, GRPO)
NeMo Guardrails: Safety orchestration and content moderation
NeMo Curator: Data processing and synthetic data generation
NeMo Evaluator: Benchmarking and evaluation pipelines
NeMo Agent Toolkit: Multi-agent orchestration and optimization

NVIDIA Nemotron Models:

Nano (8B): Edge deployment, fast inference, cost-efficient
Super (70B): Single GPU, balanced accuracy/compute
Ultra (405B): Data center scale, highest accuracy
Nano VL: Vision-language for document intelligence
RAG Models: Embedding and reranking for retrieval

RAG Pipeline Components

1. Document Extraction (NeMo Retriever)

Multi-modal extraction: text, charts, tables, graphs
15x faster PDF processing than traditional methods
Maintains document structure and relationships

2. Embedding (Nemotron Embedding Models)

State-of-the-art accuracy on ViDoRe, MTEB benchmarks
Multi-lingual support
Optimized for enterprise documents

3. Vector Storage (cuVS)

GPU-accelerated indexing and search
35x better storage efficiency
Scalable to billions of embeddings

4. Reranking (Nemotron Reranking Models)

Final relevance scoring
50% better accuracy over baseline
Context-aware ranking

Customization Methods

LoRA (Low-Rank Adaptation)

Parameter-efficient fine-tuning
Fast training, low memory
Ideal for multiple task adaptations

SFT (Supervised Fine-Tuning)

Full model fine-tuning
Task-specific optimization
Higher resource requirements

DPO (Direct Preference Optimization)

Alignment without reward model
Human feedback integration
Simpler than RLHF

GRPO (Group Relative Policy Optimization)

Advanced RL alignment
Multi-objective optimization
Enterprise-grade policy learning

Guardrails Architecture

Rail Types:

Input Rails: Pre-LLM validation (jailbreak, topic control)
Output Rails: Post-LLM checks (safety, hallucination)
Dialog Rails: Conversation flow management
Retrieval Rails: RAG grounding verification

Orchestration:

Parallel rail execution for low latency
GPU acceleration for speed
Enterprise-grade scaling

Reference Files

api.md

NeMo Customizer API Documentation

REST API endpoints for model customization
Job submission and monitoring
LoRA, SFT, DPO, GRPO configuration
Hyperparameter tuning guides
Integration with NIM deployment

retriever.md

NeMo Retriever Models & Pipeline

Embedding model APIs (NV-Embed-v2)
Reranking model usage (NV-RerankQA)
Document extraction workflows
Vector database integration (cuVS)
RAG pipeline architecture
Performance benchmarks (ViDoRe, MTEB)

rag.md

RAG Implementation & Best Practices

End-to-end RAG pipeline examples
NeMo Evaluator for RAG metrics
Knowledge base integration
Multi-modal retrieval strategies
AI-Q Blueprint for enterprise RAG

other.md

Comprehensive NeMo Ecosystem

NeMo Curator: Data processing, synthetic generation
NeMo Guardrails: Safety orchestration
NeMo Agent Toolkit: Multi-agent systems
Nemotron Models: Model family overview
NeMo Evaluator: Benchmarking workflows
Integration patterns across components

Working with This Skill

For Beginners

Start Here:

Review the Key Concepts section to understand the NeMo ecosystem
Explore Quick Reference for hands-on examples
Read retriever.md for RAG fundamentals
Try the basic embedding and reranking examples

First Project Ideas:

Build a simple RAG chatbot with NeMo Retriever
Fine-tune a small model with NeMo Customizer
Add basic guardrails to an existing LLM app

For Intermediate Users

Focus Areas:

RAG Optimization: Study rag.md for advanced retrieval patterns
Model Customization: Use api.md to fine-tune with LoRA/SFT
Multi-Agent Systems: Explore NeMo Agent Toolkit in other.md
Evaluation: Implement benchmarking with NeMo Evaluator

Common Workflows:

Build production RAG with reranking
Create domain-specific models via fine-tuning
Implement comprehensive guardrails
Orchestrate multi-agent workflows

For Advanced Users

Enterprise Patterns:

Data Flywheels: Curator → Customizer → Evaluator → Production
Multi-Modal RAG: Vision + text retrieval with Nemotron Nano VL
RL Alignment: Advanced GRPO/DPO for policy optimization
Agent Orchestration: Complex multi-agent systems with MCP

Performance Optimization:

GPU acceleration with cuVS for vector search
Parallel rail execution in Guardrails
Batch processing in Curator
Distributed evaluation in NeMo Evaluator

Scaling Strategies:

Kubernetes deployment of NIM microservices
Multi-GPU customization jobs
Enterprise data processing pipelines
Production monitoring and observability

Navigation Tips

Quick Lookups:

API endpoints → api.md
RAG metrics → rag.md
Model specs → other.md (Nemotron section)
Safety rails → other.md (Guardrails section)

Deep Dives:

Complete RAG pipeline → retriever.md + rag.md
Fine-tuning workflow → api.md + other.md (Customizer)
Agent development → other.md (Agent Toolkit)
Data processing → other.md (Curator)

Integration Patterns

NeMo + NIM Deployment

Data Curation (Curator)
  → Model Training/Fine-tuning (Customizer)
  → Evaluation (Evaluator)
  → Deployment (NIM)
  → Monitoring (Agent Toolkit)
  → Safety (Guardrails)

Enterprise RAG Stack

Documents
  → Extraction (NeMo Retriever)
  → Vector DB (cuVS)
  → Embedding (Nemotron RAG)
  → Reranking (Nemotron RAG)
  → LLM (Nemotron + NIM)
  → Guardrails (NeMo Guardrails)
  → Response

Data Flywheel

User Interactions
  → Data Collection
  → Curation (Curator)
  → Fine-tuning (Customizer)
  → Evaluation (Evaluator)
  → Deployment (NIM)
  → Loop Back

Model Selection Guide

Use Case	Recommended Model	Deployment
Edge AI, IoT	Nemotron Nano 8B	Single device
Chatbots, agents	Nemotron Super 70B	Single GPU
Enterprise RAG	Nemotron Ultra 405B	Data center
Document intelligence	Nemotron Nano VL	GPU workstation
Embedding	NV-Embed-v2	NIM microservice
Reranking	NV-RerankQA	NIM microservice

Performance Benchmarks

NeMo Retriever:

#1 on ViDoRe V1, V2 leaderboards (visual document retrieval)
#1 on MTEB VisualDocumentRetrieval
15x faster PDF extraction vs. traditional methods
35x better storage efficiency with cuVS

Nemotron Models:

Up to 6x faster throughput vs. leading 8B models
60% lower token generation with thinking budget
State-of-the-art accuracy on agentic benchmarks

Resources

Official Links

Learning Resources

NeMo Tutorials and Webinars (see reference docs)
AI-Q Blueprint for RAG patterns
NVIDIA DLI Courses on Generative AI
Technical blogs and case studies

Community

NVIDIA Developer Forums
Discord channels for NeMo users
GitHub issues for bug reports
Feature voting for roadmap input

Notes

NeMo is the enterprise AI platform for the full agent lifecycle
All components are API-first and cloud-native
Models support OpenAI-compatible APIs for easy integration
Nemotron models are open with training data and recipes
NIM microservices enable deployment on any GPU-accelerated system
Supports MCP (Model Context Protocol) for tool integration

Updating

To refresh this skill with updated documentation:

Re-run the scraper with the same configuration
The skill will be rebuilt with the latest information
Check for new models, features, and API changes regularly

nvidia-nemo

Install Skill

SKILL.md