Ollama RAG Guide
Build RAG systems with Ollama - run locally or use cloud for massive models.
Ollama Cloud Models (Dec 2025)
Access via ollama signin (v0.12+). No local storage needed, privacy preserved.
| Model |
Params |
Context |
Best For |
deepseek-v3.2:cloud |
671B |
160K |
GPT-5 level, reasoning |
deepseek-v3.1:671b-cloud |
671B |
160K |
Thinking + non-thinking hybrid |
qwen3-coder:480b-cloud |
480B |
256K-1M |
Agentic coding, repo-scale |
minimax-m2:cloud |
230B (10B active) |
128K |
#1 open-source, tools |
gpt-oss:120b-cloud |
120B |
128K |
OpenAI open weights |
glm-4.6:cloud |
- |
- |
Code generation |
# Sign in to access cloud
ollama signin
# Run cloud models
ollama run deepseek-v3.2:cloud
ollama run qwen3-coder:480b-cloud
ollama run minimax-m2:cloud
Local Models (Dec 2025)
Reasoning Models
| Model |
Params |
Context |
Best For |
nemotron-3-nano |
30B (3.6B active) |
1M tokens |
Agents, long docs, code |
deepseek-r1 |
7B-671B |
128K |
Reasoning, math, code |
qwq |
32B |
32K |
Logic, analysis |
llama4 |
109B/400B |
128K |
General, multimodal |
Fast/Efficient Models
| Model |
Size |
RAM |
Speed |
llama3.2:3b |
2GB |
8GB |
Very fast |
mistral-small-3.1 |
24B |
16GB |
Fast |
gemma3 |
4B-27B |
8-32GB |
Balanced |
Embedding Models
| Model |
Dims |
Context |
MTEB Score |
snowflake-arctic-embed2 |
1024 |
8K |
67.5 |
mxbai-embed-large |
1024 |
512 |
64.68 |
nomic-embed-text |
768 |
8K |
53.01 |
Recommendation: snowflake-arctic-embed2 for accuracy, nomic-embed-text for speed.
Quick Start
Cloud (No Local Resources)
ollama signin
ollama run deepseek-v3.2:cloud # GPT-5 level
ollama run qwen3-coder:480b-cloud # 1M context for huge repos
Local
ollama pull nemotron-3-nano # 1M context, 24GB VRAM
ollama pull snowflake-arctic-embed2
# Or for lower RAM (8GB)
ollama pull llama3.2:3b
ollama pull nomic-embed-text
Stack Options
Option A: LangChain + ChromaDB (Most Common)
from langchain_ollama import OllamaLLM, OllamaEmbeddings
from langchain_chroma import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import PyPDFLoader
# Load and split
loader = PyPDFLoader("document.pdf")
splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
docs = splitter.split_documents(loader.load())
# Embed and store
embeddings = OllamaEmbeddings(model="snowflake-arctic-embed2")
vectorstore = Chroma.from_documents(docs, embeddings, persist_directory="./db")
# Query - LOCAL
llm = OllamaLLM(model="nemotron-3-nano")
# Or CLOUD (GPT-5 level, no local resources)
llm = OllamaLLM(model="deepseek-v3.2:cloud")
retriever = vectorstore.as_retriever(search_kwargs={"k": 5})
from langchain.chains import RetrievalQA
qa = RetrievalQA.from_chain_type(llm=llm, retriever=retriever)
answer = qa.invoke("What is the main topic?")
Option B: LlamaIndex (Better Accuracy)
from llama_index.llms.ollama import Ollama
from llama_index.embeddings.ollama import OllamaEmbedding
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
# Configure
Settings.llm = Ollama(model="nemotron-3-nano", request_timeout=300.0)
Settings.embed_model = OllamaEmbedding(model_name="snowflake-arctic-embed2")
# Load and index
documents = SimpleDirectoryReader("./docs").load_data()
index = VectorStoreIndex.from_documents(documents)
# Query
query_engine = index.as_query_engine()
response = query_engine.query("Summarize the key findings")
Option C: Direct Ollama API (Minimal Dependencies)
import ollama
import chromadb
# Embed
def embed(text):
return ollama.embed(model="nomic-embed-text", input=text)["embeddings"][0]
# Store in ChromaDB
client = chromadb.PersistentClient(path="./db")
collection = client.get_or_create_collection("docs")
collection.add(ids=["1"], documents=["text"], embeddings=[embed("text")])
# Retrieve and generate
results = collection.query(query_embeddings=[embed("query")], n_results=3)
context = "\n".join(results["documents"][0])
response = ollama.chat(
model="nemotron-3-nano",
messages=[{"role": "user", "content": f"Context:\n{context}\n\nQuestion: ..."}]
)
Vector Database Options
| Database |
Install |
Best For |
| ChromaDB |
pip install chromadb |
Simple, embedded |
| FAISS |
pip install faiss-cpu |
Fast similarity |
| Qdrant |
pip install qdrant-client |
Production scale |
| Weaviate |
Docker |
Full-featured |
Nemotron 3 Nano Deep Dive
Why Nemotron for RAG:
- 1M token context = entire codebases, long documents
- Hybrid Mamba-Transformer = 4x faster inference
- MoE (3.6B active params) = runs on 24GB VRAM
- Apache 2.0 license = commercial use OK
# For very long documents
llm = OllamaLLM(
model="nemotron-3-nano",
num_ctx=131072, # 128K context, increase as needed
temperature=0.1, # Lower for factual RAG
)
Hardware Requirements
| Model |
RAM |
GPU VRAM |
| 3B models |
8GB |
4GB |
| 7-8B models |
16GB |
8GB |
| 30B models |
32GB |
24GB |
| 70B+ models |
64GB+ |
48GB+ |
References