Claude Code Plugins

Community-maintained marketplace

Feedback

Build RAG systems with Ollama local + cloud models. Latest cloud models include DeepSeek-V3.2 (GPT-5 level), Qwen3-Coder-480B (1M context), MiniMax-M2. Use for document Q&A, knowledge bases, and agentic RAG. Covers LangChain, LlamaIndex, ChromaDB, and embedding models.

Install Skill

1Download skill
2Enable skills in Claude

Open claude.ai/settings/capabilities and find the "Skills" section

3Upload to Claude

Click "Upload skill" and select the downloaded ZIP file

Note: Please verify skill by going through its instructions before using it.

SKILL.md

name ollama-rag
description Build RAG systems with Ollama local + cloud models. Latest cloud models include DeepSeek-V3.2 (GPT-5 level), Qwen3-Coder-480B (1M context), MiniMax-M2. Use for document Q&A, knowledge bases, and agentic RAG. Covers LangChain, LlamaIndex, ChromaDB, and embedding models.

Ollama RAG Guide

Build RAG systems with Ollama - run locally or use cloud for massive models.

Ollama Cloud Models (Dec 2025)

Access via ollama signin (v0.12+). No local storage needed, privacy preserved.

Model Params Context Best For
deepseek-v3.2:cloud 671B 160K GPT-5 level, reasoning
deepseek-v3.1:671b-cloud 671B 160K Thinking + non-thinking hybrid
qwen3-coder:480b-cloud 480B 256K-1M Agentic coding, repo-scale
minimax-m2:cloud 230B (10B active) 128K #1 open-source, tools
gpt-oss:120b-cloud 120B 128K OpenAI open weights
glm-4.6:cloud - - Code generation
# Sign in to access cloud
ollama signin

# Run cloud models
ollama run deepseek-v3.2:cloud
ollama run qwen3-coder:480b-cloud
ollama run minimax-m2:cloud

Local Models (Dec 2025)

Reasoning Models

Model Params Context Best For
nemotron-3-nano 30B (3.6B active) 1M tokens Agents, long docs, code
deepseek-r1 7B-671B 128K Reasoning, math, code
qwq 32B 32K Logic, analysis
llama4 109B/400B 128K General, multimodal

Fast/Efficient Models

Model Size RAM Speed
llama3.2:3b 2GB 8GB Very fast
mistral-small-3.1 24B 16GB Fast
gemma3 4B-27B 8-32GB Balanced

Embedding Models

Model Dims Context MTEB Score
snowflake-arctic-embed2 1024 8K 67.5
mxbai-embed-large 1024 512 64.68
nomic-embed-text 768 8K 53.01

Recommendation: snowflake-arctic-embed2 for accuracy, nomic-embed-text for speed.

Quick Start

Cloud (No Local Resources)

ollama signin
ollama run deepseek-v3.2:cloud  # GPT-5 level
ollama run qwen3-coder:480b-cloud  # 1M context for huge repos

Local

ollama pull nemotron-3-nano  # 1M context, 24GB VRAM
ollama pull snowflake-arctic-embed2

# Or for lower RAM (8GB)
ollama pull llama3.2:3b
ollama pull nomic-embed-text

Stack Options

Option A: LangChain + ChromaDB (Most Common)

from langchain_ollama import OllamaLLM, OllamaEmbeddings
from langchain_chroma import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import PyPDFLoader

# Load and split
loader = PyPDFLoader("document.pdf")
splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
docs = splitter.split_documents(loader.load())

# Embed and store
embeddings = OllamaEmbeddings(model="snowflake-arctic-embed2")
vectorstore = Chroma.from_documents(docs, embeddings, persist_directory="./db")

# Query - LOCAL
llm = OllamaLLM(model="nemotron-3-nano")

# Or CLOUD (GPT-5 level, no local resources)
llm = OllamaLLM(model="deepseek-v3.2:cloud")

retriever = vectorstore.as_retriever(search_kwargs={"k": 5})

from langchain.chains import RetrievalQA
qa = RetrievalQA.from_chain_type(llm=llm, retriever=retriever)
answer = qa.invoke("What is the main topic?")

Option B: LlamaIndex (Better Accuracy)

from llama_index.llms.ollama import Ollama
from llama_index.embeddings.ollama import OllamaEmbedding
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings

# Configure
Settings.llm = Ollama(model="nemotron-3-nano", request_timeout=300.0)
Settings.embed_model = OllamaEmbedding(model_name="snowflake-arctic-embed2")

# Load and index
documents = SimpleDirectoryReader("./docs").load_data()
index = VectorStoreIndex.from_documents(documents)

# Query
query_engine = index.as_query_engine()
response = query_engine.query("Summarize the key findings")

Option C: Direct Ollama API (Minimal Dependencies)

import ollama
import chromadb

# Embed
def embed(text):
    return ollama.embed(model="nomic-embed-text", input=text)["embeddings"][0]

# Store in ChromaDB
client = chromadb.PersistentClient(path="./db")
collection = client.get_or_create_collection("docs")
collection.add(ids=["1"], documents=["text"], embeddings=[embed("text")])

# Retrieve and generate
results = collection.query(query_embeddings=[embed("query")], n_results=3)
context = "\n".join(results["documents"][0])

response = ollama.chat(
    model="nemotron-3-nano",
    messages=[{"role": "user", "content": f"Context:\n{context}\n\nQuestion: ..."}]
)

Vector Database Options

Database Install Best For
ChromaDB pip install chromadb Simple, embedded
FAISS pip install faiss-cpu Fast similarity
Qdrant pip install qdrant-client Production scale
Weaviate Docker Full-featured

Nemotron 3 Nano Deep Dive

Why Nemotron for RAG:

  • 1M token context = entire codebases, long documents
  • Hybrid Mamba-Transformer = 4x faster inference
  • MoE (3.6B active params) = runs on 24GB VRAM
  • Apache 2.0 license = commercial use OK
# For very long documents
llm = OllamaLLM(
    model="nemotron-3-nano",
    num_ctx=131072,  # 128K context, increase as needed
    temperature=0.1,  # Lower for factual RAG
)

Hardware Requirements

Model RAM GPU VRAM
3B models 8GB 4GB
7-8B models 16GB 8GB
30B models 32GB 24GB
70B+ models 64GB+ 48GB+

References