| name | llamafile |
| description | When setting up local LLM inference without cloud APIs. When running GGUF models locally. When needing OpenAI-compatible API from a local model. When building offline/air-gapped AI tools. When troubleshooting local LLM server connections. |
Llamafile
Configure and manage Mozilla Llamafile - a cross-platform executable distribution format that runs LLMs locally with an OpenAI-compatible API.
When to Use This Skill
Use this skill when:
- Installing llamafile binary and GGUF model files
- Starting llamafile server with optimal configuration
- Integrating llamafile with LiteLLM or OpenAI SDK
- Configuring llamafile for different performance profiles (GPU, CPU, network access)
- Troubleshooting llamafile server startup or API connection issues
- Building applications requiring local LLM inference
- Setting up commit message tools, code review systems, or other developer tools with local AI
- Managing llamafile as a background service
- Selecting and downloading appropriate GGUF models
- Validating OpenAI-compatible API responses
Core Capabilities
What Llamafile Provides
Llamafile combines llama.cpp with Cosmopolitan Libc to create single-file executables that:
- Run on macOS, Windows, Linux, FreeBSD, OpenBSD, NetBSD
- Support AMD64 and ARM64 architectures
- Serve OpenAI-compatible HTTP API on localhost
- Load GGUF model files for inference
- Provide
/healthendpoint for monitoring - Support GPU acceleration (CUDA, Metal, Vulkan)
- Enable embeddings generation with
--embeddingflag
API Compatibility
Llamafile exposes these OpenAI-compatible endpoints when running with --server:
| Endpoint | Description | Requirements |
|---|---|---|
http://localhost:8080/v1/chat/completions |
Chat completions (primary) | Server mode |
http://localhost:8080/v1/completions |
Text completions | Server mode |
http://localhost:8080/v1/embeddings |
Generate embeddings | --embedding flag |
http://localhost:8080/health |
Health check | Server mode |
Critical Detail: All OpenAI-compatible endpoints require /v1 prefix in the URL path.
Installation
Download Llamafile Binary
# Download llamafile v0.9.3 binary
curl -L -o llamafile https://github.com/mozilla-ai/llamafile/releases/download/0.9.3/llamafile-0.9.3
# Make executable
chmod 755 llamafile
# Verify version
./llamafile --version
Alternative download sources:
- GitHub Release:
https://github.com/mozilla-ai/llamafile/releases/download/0.9.3/llamafile-0.9.3 - SourceForge Mirror:
https://sourceforge.net/projects/llamafile.mirror/files/0.9.3/
Download GGUF Model
Llamafile requires GGUF format models. Download from Hugging Face:
# Recommended: Gemma 3 3B (balanced speed/quality, ~2GB)
curl -L -o gemma-3-3b.gguf \
https://huggingface.co/Mozilla/gemma-3-3b-it-gguf/resolve/main/gemma-3-3b-it-Q4_K_M.gguf
# Alternative: Pre-packaged llamafile with embedded model
curl -LO https://huggingface.co/Mozilla/llava-v1.5-7b-llamafile/resolve/main/llava-v1.5-7b-q4.llamafile
chmod +x llava-v1.5-7b-q4.llamafile
Recommended models by use case:
| Model | Size | Use Case | Download |
|---|---|---|---|
| Gemma 3 3B | ~2GB | Balanced speed/quality | Mozilla/gemma-3-3b-it-gguf |
| Qwen3-0.6B | ~500MB | Fast, lower quality | Mozilla/Qwen3-0.6B-gguf |
| Mistral 7B | ~4GB | Higher quality, slower | Mozilla/Mistral-7B-gguf |
| Llama 3.1 8B | ~5GB | Best quality, slowest | Mozilla/Llama-3.1-8B-gguf |
Quantization recommendation: Use Q4_K_M quantized models for optimal balance of quality and performance.
Server Configuration
Basic Server Command
Start llamafile server for local API access:
./llamafile --server \
-m /path/to/model.gguf \
--nobrowser \
--port 8080 \
--host 127.0.0.1
Critical flags explained:
--server: Required to enable HTTP API endpoints-m: Path to GGUF model file (required)--nobrowser: Prevents auto-opening browser on startup--port 8080: Default port (note: NOT 8000)--host 127.0.0.1: Localhost only (secure default)
Performance-Optimized Configuration
For GPU-accelerated inference with higher throughput:
./llamafile --server \
-m /path/to/model.gguf \
--nobrowser \
--port 8080 \
--host 127.0.0.1 \
--ctx-size 4096 \
--n-gpu-layers 99 \
--threads 8 \
--cont-batching \
--parallel 4
Advanced flags:
| Flag | Purpose | Default | When to Use |
|---|---|---|---|
--ctx-size |
Prompt context window size | 512 | Increase for longer conversations |
--n-gpu-layers |
GPU offload layer count | 0 | Set to 99 to offload all layers to GPU |
--threads |
CPU threads for generation | Auto | Set explicitly for consistent performance |
--threads-batch |
Threads for batch processing | Same as --threads |
Tune separately for prompt vs generation |
--cont-batching |
Continuous batching | Off | Enable for multiple concurrent requests |
--parallel |
Parallel sequence count | 1 | Increase for concurrent request handling |
--mlock |
Lock model in memory | Off | Prevent swapping on systems with sufficient RAM |
--embedding |
Enable embeddings endpoint | Off | Required for /v1/embeddings API |
Network-Accessible Configuration
To allow connections from other machines (development/testing only):
./llamafile --server \
-m /path/to/model.gguf \
--nobrowser \
--host 0.0.0.0 \
--port 8080
Security warning: Binding to 0.0.0.0 exposes the API to network access. Use only in trusted environments.
API Integration
Using LiteLLM (Recommended)
LiteLLM provides unified interface for llamafile and cloud LLM providers.
import litellm
response = litellm.completion(
model="llamafile/gemma-3-3b", # MUST use llamafile/ prefix
messages=[{"role": "user", "content": "Hello, world!"}],
api_base="http://localhost:8080/v1", # MUST include /v1 suffix
temperature=0.3,
max_tokens=200
)
print(response.choices[0].message.content)
Critical requirements for LiteLLM:
- Model name MUST use
llamafile/prefix for routing api_baseMUST include/v1suffix- No API key required (any placeholder value works)
Related skill: For comprehensive LiteLLM configuration, activate the litellm skill:
Skill(command: "litellm")
Using OpenAI Python SDK
Direct integration with OpenAI SDK for llamafile endpoints:
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8080/v1", # MUST include /v1
api_key="sk-no-key-required" # Any value works
)
response = client.chat.completions.create(
model="local-model", # Model name is flexible
messages=[
{"role": "user", "content": "Hello, world!"}
],
temperature=0.3,
max_tokens=200
)
print(response.choices[0].message.content)
Using curl for Testing
Verify llamafile server is responding correctly:
# Health check
curl http://localhost:8080/health
# Chat completions
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "local",
"messages": [{"role": "user", "content": "Hello"}],
"temperature": 0.3,
"max_tokens": 200
}'
# Embeddings (requires --embedding flag on server)
curl http://localhost:8080/v1/embeddings \
-H "Content-Type: application/json" \
-d '{
"model": "local",
"input": ["Hello world"]
}'
Server Management
Process Management Script
Python script to start llamafile as background process with health checking:
import subprocess
import time
import httpx
def start_llamafile(
llamafile_path: str,
model_path: str,
port: int = 8080,
host: str = "127.0.0.1"
) -> subprocess.Popen:
"""Start llamafile server as background process."""
cmd = [
llamafile_path,
"--server",
"-m", model_path,
"--nobrowser",
"--port", str(port),
"--host", host,
]
process = subprocess.Popen(
cmd,
stdout=subprocess.PIPE,
stderr=subprocess.PIPE,
)
_wait_for_server(host, port)
return process
def _wait_for_server(host: str, port: int, timeout: int = 30) -> None:
"""Wait for server to respond to health checks."""
url = f"http://{host}:{port}/health"
start = time.time()
while time.time() - start < timeout:
try:
response = httpx.get(url, timeout=2)
if response.status_code == 200:
return
except httpx.RequestError:
pass
time.sleep(0.5)
raise TimeoutError(f"Server did not start within {timeout} seconds")
Configuration File Pattern
Example TOML configuration for applications using llamafile:
# ~/.config/app-name/config.toml
[ai]
model = "llamafile/gemma-3-3b" # Must use llamafile/ prefix
temperature = 0.3
max_tokens = 200
[llamafile]
path = "/home/user/.local/bin/llamafile"
model_path = "/home/user/.local/share/app-name/models/gemma-3-3b.gguf"
api_base = "http://127.0.0.1:8080/v1" # Include /v1 suffix
Troubleshooting
Server Fails to Start
Check if port is already in use:
# Find process using port 8080
lsof -i :8080
# Kill existing process
kill $(lsof -t -i :8080)
Verify model file exists and is readable:
ls -lh /path/to/model.gguf
Check llamafile binary permissions:
ls -la /path/to/llamafile
# Should show: -rwxr-xr-x (executable)
# Fix permissions if needed
chmod 755 /path/to/llamafile
Connection Refused Errors
Verify server is running:
# Check health endpoint
curl http://localhost:8080/health
# Check server is listening
netstat -tlnp | grep 8080
# or
lsof -i :8080
Common causes:
- Server not started with
--serverflag - Wrong port number (8080 vs 8000)
- Missing
/v1in API URL path - Server bound to
127.0.0.1but accessing from another machine
API Errors
Test basic connectivity:
# Verbose health check
curl -v http://localhost:8080/health
# Test chat completions with verbose output
curl -v http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"test","messages":[{"role":"user","content":"Hi"}]}'
Common API issues:
| Error | Cause | Solution |
|---|---|---|
| 404 Not Found | Missing /v1 in URL |
Add /v1 before endpoint path |
| Connection refused | Server not running | Start server with --server flag |
| Timeout | Model loading slowly | Wait longer or use smaller model |
| Invalid model | Wrong model path | Verify -m path to GGUF file |
Performance Issues
Optimize inference speed:
- Use quantized models (Q4_K_M recommended)
- Enable GPU acceleration:
--n-gpu-layers 99 - Increase threads:
--threads 8 - Enable continuous batching:
--cont-batching - Reduce context size if not needed:
--ctx-size 2048
Check GPU availability:
# NVIDIA GPU
nvidia-smi
# AMD GPU
rocm-smi
# Apple Metal (check activity monitor)
Common Pitfalls
Avoid these frequent errors when using llamafile:
- Port 8000 vs 8080: Llamafile defaults to port 8080, not 8000
- Missing
/v1in API URL: Always include/v1suffix for OpenAI-compatible endpoints - LiteLLM prefix: Must use
llamafile/prefix in model name for proper routing - API key confusion: No real API key needed, but some clients require placeholder value
- Starting server from hooks: Application hooks should check if server is running, not start it
- Model path issues: Ensure GGUF file exists and is readable before starting server
- Binary permissions: Llamafile must be executable (
chmod 755) - GPU layers on CPU: Setting
--n-gpu-layerson CPU-only systems causes errors
Version Information
Current stable version: 0.9.3 (May 14, 2025)
Version constants:
LLAMAFILE_MAJOR = 0
LLAMAFILE_MINOR = 9
LLAMAFILE_PATCH = 3
Recent changes in 0.9.3:
- Added Phi4 model support
- Added Qwen3 model support
- Respects NO_COLOR environment variable
- Fixed URL handling in JavaScript (preserves path when building relative URLs)
- Added Plaintext output option to LocalScore
Related Skills and Tools
Skills to activate:
litellm- For unified LLM provider interface and routingSkill(command: "litellm")
External tools:
- LiteLLM - Unified interface for multiple LLM providers
- OpenAI Python SDK - Direct OpenAI-compatible API access
- llama.cpp - Underlying inference engine
- GGUF format - Model format specification
References
Official Documentation
- Mozilla llamafile GitHub - Primary repository and source code
- Mozilla llamafile Documentation - Official documentation site
- LiteLLM llamafile Provider - LiteLLM integration guide
- llama.cpp Server Documentation - Underlying server implementation
- Releases Page - Binary downloads and changelog
Model Resources
- Hugging Face Mozilla Models - Official Mozilla GGUF models
- GGUF Format Specification - Model file format details
Related Technologies
- Cosmopolitan Libc - Cross-platform binary format
- llama.cpp - LLM inference engine
- OpenAI API Reference - API compatibility reference