| name | huggingface |
| description | Import GGUF models from HuggingFace into Ollama. Pull models directly using the hf.co/ prefix, track download progress, and use imported models for inference. |
HuggingFace Model Import
Overview
Ollama can directly pull GGUF models from HuggingFace using the hf.co/ prefix. This enables access to thousands of quantized models beyond the official Ollama library.
Quick Reference
| Action | Syntax |
|---|---|
| Pull model | hf.co/{org}/{repo}:{quantization} |
| List models | ollama.list() |
| Use model | Same as any Ollama model |
| Delete model | ollama.delete("hf.co/...") |
Model Naming Format
hf.co/{organization}/{repository}-GGUF:{quantization}
Examples:
hf.co/NousResearch/Nous-Hermes-2-Mistral-7B-DPO-GGUF:Q4_K_M
hf.co/TheBloke/Llama-2-7B-Chat-GGUF:Q4_K_M
hf.co/microsoft/Phi-3-mini-4k-instruct-gguf:Q4_K_M
Common Quantizations
| Quantization | Size | Quality | Use Case |
|---|---|---|---|
| Q2_K | Smallest | Lowest | Testing only |
| Q4_K_M | Medium | Good | Recommended default |
| Q5_K_M | Larger | Better | Quality-focused |
| Q6_K | Large | High | Near-original quality |
| Q8_0 | Largest | Highest | Maximum quality |
Pull Model from HuggingFace
With Progress Tracking
import ollama
HF_MODEL = "hf.co/NousResearch/Nous-Hermes-2-Mistral-7B-DPO-GGUF:Q4_K_M"
print(f"Pulling {HF_MODEL}...")
last_status = ""
for progress in ollama.pull(HF_MODEL, stream=True):
status = progress.get("status", "")
digest = progress.get("digest", "")
total = progress.get("total")
# Only print when status changes
if status != last_status:
if status == "pulling manifest":
print(f" {status}")
elif status.startswith("pulling") and digest:
short_digest = digest.split(":")[-1][:12] if ":" in digest else digest[:12]
size_mb = (total / 1024 / 1024) if total else 0
if size_mb > 100:
print(f" pulling {short_digest}... ({size_mb:.0f} MB)")
elif status in ["verifying sha256 digest", "writing manifest", "success"]:
print(f" {status}")
last_status = status
print("Model pulled successfully!")
Simple Pull
import ollama
HF_MODEL = "hf.co/NousResearch/Nous-Hermes-2-Mistral-7B-DPO-GGUF:Q4_K_M"
# Non-streaming (blocks until complete)
ollama.pull(HF_MODEL)
print("Model pulled!")
Verify Installation
import ollama
HF_MODEL = "hf.co/NousResearch/Nous-Hermes-2-Mistral-7B-DPO-GGUF:Q4_K_M"
models = ollama.list()
model_names = [m.get("model", "") for m in models.get("models", [])]
# Check for the HF model
hf_model_installed = any(
"Nous-Hermes" in name or HF_MODEL in name
for name in model_names
)
if hf_model_installed:
print("Model is installed!")
for name in model_names:
if "Nous-Hermes" in name or "hf.co" in name:
print(f" Name: {name}")
else:
print("Model not found")
Show Model Details
import ollama
HF_MODEL = "hf.co/NousResearch/Nous-Hermes-2-Mistral-7B-DPO-GGUF:Q4_K_M"
model_info = ollama.show(HF_MODEL)
print(f"Model: {HF_MODEL}")
if "details" in model_info:
details = model_info["details"]
print(f"Family: {details.get('family', 'N/A')}")
print(f"Parameter Size: {details.get('parameter_size', 'N/A')}")
print(f"Quantization: {details.get('quantization_level', 'N/A')}")
Use Imported Model
Generate Text
import ollama
HF_MODEL = "hf.co/NousResearch/Nous-Hermes-2-Mistral-7B-DPO-GGUF:Q4_K_M"
result = ollama.generate(
model=HF_MODEL,
prompt="What is the capital of France?"
)
print(result["response"])
Chat Completion
import ollama
HF_MODEL = "hf.co/NousResearch/Nous-Hermes-2-Mistral-7B-DPO-GGUF:Q4_K_M"
# Nous-Hermes-2 uses ChatML format natively
response = ollama.chat(
model=HF_MODEL,
messages=[
{"role": "system", "content": "You are Hermes 2, a helpful AI assistant."},
{"role": "user", "content": "Explain quantum computing in two sentences."}
]
)
print(response["message"]["content"])
Delete Imported Model
import ollama
HF_MODEL = "hf.co/NousResearch/Nous-Hermes-2-Mistral-7B-DPO-GGUF:Q4_K_M"
ollama.delete(HF_MODEL)
print("Model deleted!")
Popular HuggingFace Models
General Purpose
| Model | HuggingFace Path | Size |
|---|---|---|
| Nous-Hermes-2-Mistral | hf.co/NousResearch/Nous-Hermes-2-Mistral-7B-DPO-GGUF:Q4_K_M |
4.4 GB |
| Llama-2-7B-Chat | hf.co/TheBloke/Llama-2-7B-Chat-GGUF:Q4_K_M |
4.1 GB |
| Mistral-7B-Instruct | hf.co/TheBloke/Mistral-7B-Instruct-v0.2-GGUF:Q4_K_M |
4.4 GB |
Code Models
| Model | HuggingFace Path | Size |
|---|---|---|
| CodeLlama-7B | hf.co/TheBloke/CodeLlama-7B-Instruct-GGUF:Q4_K_M |
4.1 GB |
| Phind-CodeLlama | hf.co/TheBloke/Phind-CodeLlama-34B-v2-GGUF:Q4_K_M |
20 GB |
| WizardCoder | hf.co/TheBloke/WizardCoder-Python-7B-V1.0-GGUF:Q4_K_M |
4.1 GB |
Small/Fast Models
| Model | HuggingFace Path | Size |
|---|---|---|
| Phi-3-mini | hf.co/microsoft/Phi-3-mini-4k-instruct-gguf:Q4_K_M |
2.4 GB |
| TinyLlama | hf.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF:Q4_K_M |
0.7 GB |
Finding Models on HuggingFace
- Go to huggingface.co/models
- Filter by:
- Library: GGUF
- Task: Text Generation
- Look for models with
-GGUFsuffix - Check the "Files" tab for available quantizations
Troubleshooting
Model Not Found
Symptom: Error pulling model
Check:
- Repository exists on HuggingFace
- Repository has GGUF files
- Quantization tag is correct
# Verify HuggingFace URL
# https://huggingface.co/{org}/{repo}/tree/main
Download Fails
Symptom: Download interrupted or fails
Fix:
- Check internet connection
- Try again (Ollama resumes partial downloads)
- Check disk space
Wrong Prompt Format
Symptom: Model gives poor responses
Fix:
- Check model card for correct prompt template
- Some models require specific formats (ChatML, Alpaca, etc.)
# ChatML format example (Nous-Hermes-2)
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Hello!"}
]
# The ollama library handles format conversion automatically
When to Use This Skill
Use when:
- You need a model not in the official Ollama library
- Testing specific model variants
- Using specialized/fine-tuned models
- Comparing different quantizations
Resources
Cross-References
bazzite-ai-ollama:python- Using imported modelsbazzite-ai-ollama:api- REST API for model management