name	llamafile
description	When setting up local LLM inference without cloud APIs. When running GGUF models locally. When needing OpenAI-compatible API from a local model. When building offline/air-gapped AI tools. When troubleshooting local LLM server connections.

Llamafile

Configure and manage Mozilla Llamafile - a cross-platform executable distribution format that runs LLMs locally with an OpenAI-compatible API.

When to Use This Skill

Use this skill when:

Installing llamafile binary and GGUF model files
Starting llamafile server with optimal configuration
Integrating llamafile with LiteLLM or OpenAI SDK
Configuring llamafile for different performance profiles (GPU, CPU, network access)
Troubleshooting llamafile server startup or API connection issues
Building applications requiring local LLM inference
Setting up commit message tools, code review systems, or other developer tools with local AI
Managing llamafile as a background service
Selecting and downloading appropriate GGUF models
Validating OpenAI-compatible API responses

Core Capabilities

What Llamafile Provides

Llamafile combines llama.cpp with Cosmopolitan Libc to create single-file executables that:

Run on macOS, Windows, Linux, FreeBSD, OpenBSD, NetBSD
Support AMD64 and ARM64 architectures
Serve OpenAI-compatible HTTP API on localhost
Load GGUF model files for inference
Provide /health endpoint for monitoring
Support GPU acceleration (CUDA, Metal, Vulkan)
Enable embeddings generation with --embedding flag

API Compatibility

Llamafile exposes these OpenAI-compatible endpoints when running with --server:

Endpoint	Description	Requirements
`http://localhost:8080/v1/chat/completions`	Chat completions (primary)	Server mode
`http://localhost:8080/v1/completions`	Text completions	Server mode
`http://localhost:8080/v1/embeddings`	Generate embeddings	`--embedding` flag
`http://localhost:8080/health`	Health check	Server mode

Critical Detail: All OpenAI-compatible endpoints require /v1 prefix in the URL path.

Installation

Download Llamafile Binary

# Download llamafile v0.9.3 binary
curl -L -o llamafile https://github.com/mozilla-ai/llamafile/releases/download/0.9.3/llamafile-0.9.3

# Make executable
chmod 755 llamafile

# Verify version
./llamafile --version

Alternative download sources:

GitHub Release: https://github.com/mozilla-ai/llamafile/releases/download/0.9.3/llamafile-0.9.3
SourceForge Mirror: https://sourceforge.net/projects/llamafile.mirror/files/0.9.3/

Download GGUF Model

Llamafile requires GGUF format models. Download from Hugging Face:

# Recommended: Gemma 3 3B (balanced speed/quality, ~2GB)
curl -L -o gemma-3-3b.gguf \
  https://huggingface.co/Mozilla/gemma-3-3b-it-gguf/resolve/main/gemma-3-3b-it-Q4_K_M.gguf

# Alternative: Pre-packaged llamafile with embedded model
curl -LO https://huggingface.co/Mozilla/llava-v1.5-7b-llamafile/resolve/main/llava-v1.5-7b-q4.llamafile
chmod +x llava-v1.5-7b-q4.llamafile

Recommended models by use case:

Model	Size	Use Case	Download
Gemma 3 3B	~2GB	Balanced speed/quality	Mozilla/gemma-3-3b-it-gguf
Qwen3-0.6B	~500MB	Fast, lower quality	Mozilla/Qwen3-0.6B-gguf
Mistral 7B	~4GB	Higher quality, slower	Mozilla/Mistral-7B-gguf
Llama 3.1 8B	~5GB	Best quality, slowest	Mozilla/Llama-3.1-8B-gguf

Quantization recommendation: Use Q4_K_M quantized models for optimal balance of quality and performance.

Server Configuration

Basic Server Command

Start llamafile server for local API access:

./llamafile --server \
    -m /path/to/model.gguf \
    --nobrowser \
    --port 8080 \
    --host 127.0.0.1

Critical flags explained:

--server: Required to enable HTTP API endpoints
-m: Path to GGUF model file (required)
--nobrowser: Prevents auto-opening browser on startup
--port 8080: Default port (note: NOT 8000)
--host 127.0.0.1: Localhost only (secure default)

Performance-Optimized Configuration

For GPU-accelerated inference with higher throughput:

./llamafile --server \
    -m /path/to/model.gguf \
    --nobrowser \
    --port 8080 \
    --host 127.0.0.1 \
    --ctx-size 4096 \
    --n-gpu-layers 99 \
    --threads 8 \
    --cont-batching \
    --parallel 4

Advanced flags:

Flag	Purpose	Default	When to Use
`--ctx-size`	Prompt context window size	512	Increase for longer conversations
`--n-gpu-layers`	GPU offload layer count	0	Set to 99 to offload all layers to GPU
`--threads`	CPU threads for generation	Auto	Set explicitly for consistent performance
`--threads-batch`	Threads for batch processing	Same as `--threads`	Tune separately for prompt vs generation
`--cont-batching`	Continuous batching	Off	Enable for multiple concurrent requests
`--parallel`	Parallel sequence count	1	Increase for concurrent request handling
`--mlock`	Lock model in memory	Off	Prevent swapping on systems with sufficient RAM
`--embedding`	Enable embeddings endpoint	Off	Required for `/v1/embeddings` API

Network-Accessible Configuration

To allow connections from other machines (development/testing only):

./llamafile --server \
    -m /path/to/model.gguf \
    --nobrowser \
    --host 0.0.0.0 \
    --port 8080

Security warning: Binding to 0.0.0.0 exposes the API to network access. Use only in trusted environments.

API Integration

Using LiteLLM (Recommended)

LiteLLM provides unified interface for llamafile and cloud LLM providers.

import litellm

response = litellm.completion(
    model="llamafile/gemma-3-3b",  # MUST use llamafile/ prefix
    messages=[{"role": "user", "content": "Hello, world!"}],
    api_base="http://localhost:8080/v1",  # MUST include /v1 suffix
    temperature=0.3,
    max_tokens=200
)

print(response.choices[0].message.content)

Critical requirements for LiteLLM:

Model name MUST use llamafile/ prefix for routing
api_base MUST include /v1 suffix
No API key required (any placeholder value works)

Related skill: For comprehensive LiteLLM configuration, activate the litellm skill:

Skill(command: "litellm")

Using OpenAI Python SDK

Direct integration with OpenAI SDK for llamafile endpoints:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8080/v1",  # MUST include /v1
    api_key="sk-no-key-required"  # Any value works
)

response = client.chat.completions.create(
    model="local-model",  # Model name is flexible
    messages=[
        {"role": "user", "content": "Hello, world!"}
    ],
    temperature=0.3,
    max_tokens=200
)

print(response.choices[0].message.content)

Using curl for Testing

Verify llamafile server is responding correctly:

# Health check
curl http://localhost:8080/health

# Chat completions
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "local",
    "messages": [{"role": "user", "content": "Hello"}],
    "temperature": 0.3,
    "max_tokens": 200
  }'

# Embeddings (requires --embedding flag on server)
curl http://localhost:8080/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{
    "model": "local",
    "input": ["Hello world"]
  }'

Server Management

Process Management Script

Python script to start llamafile as background process with health checking:

import subprocess
import time
import httpx

def start_llamafile(
    llamafile_path: str,
    model_path: str,
    port: int = 8080,
    host: str = "127.0.0.1"
) -> subprocess.Popen:
    """Start llamafile server as background process."""
    cmd = [
        llamafile_path,
        "--server",
        "-m", model_path,
        "--nobrowser",
        "--port", str(port),
        "--host", host,
    ]
    process = subprocess.Popen(
        cmd,
        stdout=subprocess.PIPE,
        stderr=subprocess.PIPE,
    )
    _wait_for_server(host, port)
    return process


def _wait_for_server(host: str, port: int, timeout: int = 30) -> None:
    """Wait for server to respond to health checks."""
    url = f"http://{host}:{port}/health"
    start = time.time()
    while time.time() - start < timeout:
        try:
            response = httpx.get(url, timeout=2)
            if response.status_code == 200:
                return
        except httpx.RequestError:
            pass
        time.sleep(0.5)
    raise TimeoutError(f"Server did not start within {timeout} seconds")

Configuration File Pattern

Example TOML configuration for applications using llamafile:

# ~/.config/app-name/config.toml
[ai]
model = "llamafile/gemma-3-3b"  # Must use llamafile/ prefix
temperature = 0.3
max_tokens = 200

[llamafile]
path = "/home/user/.local/bin/llamafile"
model_path = "/home/user/.local/share/app-name/models/gemma-3-3b.gguf"
api_base = "http://127.0.0.1:8080/v1"  # Include /v1 suffix

Troubleshooting

Server Fails to Start

Check if port is already in use:

# Find process using port 8080
lsof -i :8080

# Kill existing process
kill $(lsof -t -i :8080)

Verify model file exists and is readable:

ls -lh /path/to/model.gguf

Check llamafile binary permissions:

ls -la /path/to/llamafile
# Should show: -rwxr-xr-x (executable)

# Fix permissions if needed
chmod 755 /path/to/llamafile

Connection Refused Errors

Verify server is running:

# Check health endpoint
curl http://localhost:8080/health

# Check server is listening
netstat -tlnp | grep 8080
# or
lsof -i :8080

Common causes:

Server not started with --server flag
Wrong port number (8080 vs 8000)
Missing /v1 in API URL path
Server bound to 127.0.0.1 but accessing from another machine

API Errors

Test basic connectivity:

# Verbose health check
curl -v http://localhost:8080/health

# Test chat completions with verbose output
curl -v http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"test","messages":[{"role":"user","content":"Hi"}]}'

Common API issues:

Error	Cause	Solution
404 Not Found	Missing `/v1` in URL	Add `/v1` before endpoint path
Connection refused	Server not running	Start server with `--server` flag
Timeout	Model loading slowly	Wait longer or use smaller model
Invalid model	Wrong model path	Verify `-m` path to GGUF file

Performance Issues

Optimize inference speed:

Use quantized models (Q4_K_M recommended)
Enable GPU acceleration: --n-gpu-layers 99
Increase threads: --threads 8
Enable continuous batching: --cont-batching
Reduce context size if not needed: --ctx-size 2048

Check GPU availability:

# NVIDIA GPU
nvidia-smi

# AMD GPU
rocm-smi

# Apple Metal (check activity monitor)

Common Pitfalls

Avoid these frequent errors when using llamafile:

Port 8000 vs 8080: Llamafile defaults to port 8080, not 8000
Missing /v1 in API URL: Always include /v1 suffix for OpenAI-compatible endpoints
LiteLLM prefix: Must use llamafile/ prefix in model name for proper routing
API key confusion: No real API key needed, but some clients require placeholder value
Starting server from hooks: Application hooks should check if server is running, not start it
Model path issues: Ensure GGUF file exists and is readable before starting server
Binary permissions: Llamafile must be executable (chmod 755)
GPU layers on CPU: Setting --n-gpu-layers on CPU-only systems causes errors

Version Information

Current stable version: 0.9.3 (May 14, 2025)

Version constants:

LLAMAFILE_MAJOR = 0
LLAMAFILE_MINOR = 9
LLAMAFILE_PATCH = 3

Recent changes in 0.9.3:

Added Phi4 model support
Added Qwen3 model support
Respects NO_COLOR environment variable
Fixed URL handling in JavaScript (preserves path when building relative URLs)
Added Plaintext output option to LocalScore

Related Skills and Tools

Skills to activate:

litellm - For unified LLM provider interface and routing
```
Skill(command: "litellm")
```

External tools:

LiteLLM - Unified interface for multiple LLM providers
OpenAI Python SDK - Direct OpenAI-compatible API access
llama.cpp - Underlying inference engine
GGUF format - Model format specification

References

Official Documentation

Mozilla llamafile GitHub - Primary repository and source code
Mozilla llamafile Documentation - Official documentation site
LiteLLM llamafile Provider - LiteLLM integration guide
llama.cpp Server Documentation - Underlying server implementation
Releases Page - Binary downloads and changelog

Model Resources

Hugging Face Mozilla Models - Official Mozilla GGUF models
GGUF Format Specification - Model file format details

Related Technologies

Cosmopolitan Libc - Cross-platform binary format
llama.cpp - LLM inference engine
OpenAI API Reference - API compatibility reference

llamafile

Install Skill

SKILL.md