| name | Qwen-Ollama |
| description | Using Qwen 2.5 models via Ollama for local LLM inference, text analysis, and AI-powered automation |
Qwen via Ollama
Local LLM inference using Qwen 2.5 models through Ollama. Enables text analysis, summarization, code generation, and structured data analysis without cloud dependencies.
Instructions
When helping users with Ollama and Qwen models, follow these guidelines:
- Installation First: Always verify Ollama is installed and the desired model is pulled before attempting API calls
- Use Appropriate Model Size: Recommend qwen2.5:7b for balanced performance (4.7 GB), or smaller/larger based on available resources
- Set Proper Timeouts: Default 120s timeout for analysis tasks, longer for complex generation
- Handle Streaming: Use
"stream": falsefor simple cases, streaming for real-time feedback - System Prompts: Define personality and role in system message for consistent behavior
- Validate Responses: Always check the
donefield and handle partial responses appropriately
Examples
Example 1: Basic Installation and Setup
User: I want to use Qwen locally for text analysis
Claude: I'll help you set up Ollama with Qwen 2.5:
1. Install Ollama service
2. Pull the qwen2.5:7b model (4.7 GB)
3. Verify it's working
[Provides installation commands from Quick Setup pattern below]
Example 2: Rust Integration for Analysis
User: I need to analyze server logs with an LLM in my Rust application
Claude: I'll create an Ollama client using reqwest to send logs to Qwen 2.5:
- Async client with timeout
- JSON serialization for requests/responses
- Error handling for network issues
[Provides implementation using Rust Client pattern below]
Example 3: System Prompt for Personality
User: How do I make the LLM respond with a specific personality?
Claude: Use the chat API with a system message to define behavior:
- System role sets the personality/instructions
- User role contains the actual query
- The model maintains this context throughout
[Provides chat completion example from System Prompts pattern below]
Reference Implementation Details
The sections below contain proven working patterns from production systems.
Reference Files in This Folder:
OllamaClient.rs- Complete Rust client implementation from Marvinous project
Quick Setup
Installation:
# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
# Pull Qwen 2.5 model (7B variant, 4.7 GB)
ollama pull qwen2.5:7b
# Verify installation
ollama list
systemctl status ollama
Model Variants:
qwen2.5:0.5b- Tiny (500 MB) for testingqwen2.5:7b- Balanced (4.7 GB) recommendedqwen2.5:14b- Better quality (8.7 GB)qwen2.5:32b- Highest quality (19 GB)
Basic API Patterns
Generate Completion (Simple Text)
Endpoint: POST http://localhost:11434/api/generate
curl -X POST http://localhost:11434/api/generate \
-H "Content-Type: application/json" \
-d '{
"model": "qwen2.5:7b",
"prompt": "Explain RAID levels in servers",
"stream": false
}'
Response:
{
"model": "qwen2.5:7b",
"response": "RAID (Redundant Array of Independent Disks) provides...",
"done": true
}
Chat Completion (With Context)
Endpoint: POST http://localhost:11434/api/chat
curl -X POST http://localhost:11434/api/chat \
-H "Content-Type: application/json" \
-d '{
"model": "qwen2.5:7b",
"messages": [
{"role": "system", "content": "You are a server monitoring expert."},
{"role": "user", "content": "What does IPMI provide?"}
],
"stream": false
}'
Rust Client Pattern
Location: Marvinous/src/llm/client.rs
Purpose: Async Ollama client with timeout and error handling
Dependencies
[dependencies]
reqwest = { version = "0.12", features = ["json"] }
serde = { version = "1", features = ["derive"] }
serde_json = "1"
tokio = { version = "1", features = ["full"] }
Client Implementation
use reqwest::Client;
use serde::{Deserialize, Serialize};
use std::time::Duration;
#[derive(Serialize)]
struct GenerateRequest {
model: String,
prompt: String,
stream: bool,
}
#[derive(Deserialize)]
struct GenerateResponse {
response: String,
done: bool,
}
pub struct OllamaClient {
client: Client,
endpoint: String,
model: String,
}
impl OllamaClient {
pub fn new(endpoint: &str, model: &str, timeout_secs: u64) -> Self {
let client = Client::builder()
.timeout(Duration::from_secs(timeout_secs))
.build()
.expect("Failed to create HTTP client");
Self {
client,
endpoint: endpoint.to_string(),
model: model.to_string(),
}
}
pub async fn generate(&self, prompt: &str) -> Result<String, Box<dyn std::error::Error>> {
let request = GenerateRequest {
model: self.model.clone(),
prompt: prompt.to_string(),
stream: false,
};
let response = self.client
.post(format!("{}/api/generate", self.endpoint))
.json(&request)
.send()
.await?
.json::<GenerateResponse>()
.await?;
Ok(response.response)
}
}
Usage:
#[tokio::main]
async fn main() {
let client = OllamaClient::new("http://localhost:11434", "qwen2.5:7b", 120);
let result = client.generate("Analyze this system log").await.unwrap();
println!("{}", result);
}
Key Points:
- Timeout prevents hanging on long generations (default 120s)
- Non-streaming mode returns complete response
- Error handling for network and parsing failures
System Prompts Pattern
Location: Marvinous/src/llm/prompt.rs
Purpose: Define LLM personality and behavior
Chat API with System Message
#[derive(Serialize)]
struct ChatMessage {
role: String,
content: String,
}
#[derive(Serialize)]
struct ChatRequest {
model: String,
messages: Vec<ChatMessage>,
stream: bool,
}
let messages = vec![
ChatMessage {
role: "system".to_string(),
content: "You are Marvin, the Paranoid Android. Respond with existential dread.".to_string(),
},
ChatMessage {
role: "user".to_string(),
content: "How are the servers?".to_string(),
},
];
let request = ChatRequest {
model: "qwen2.5:7b".to_string(),
messages,
stream: false,
};
System Prompt Best Practices:
- Define the role clearly ("You are X")
- Specify output format expectations
- Include personality traits if desired
- Set constraints (length, tone, structure)
- Provide domain context
Common Use Cases
1. Log Analysis
Prompt Pattern:
Analyze these server logs and identify issues:
[log entries]
Focus on:
- Error patterns
- Security events
- Performance anomalies
2. Data Summarization
Prompt Pattern:
Summarize this IPMI sensor data:
{json_data}
Highlight:
- Anomalies or concerning values
- Temperature trends
- Fan speed issues
3. Code Generation
Prompt Pattern:
Write a Rust function that:
- Parses smartctl JSON output
- Extracts drive health metrics
- Returns structured data
Use serde for JSON parsing.
Model Parameters
Temperature Control
{
"model": "qwen2.5:7b",
"prompt": "...",
"options": {
"temperature": 0.7
}
}
Temperature Values:
0.0- Deterministic (same output every time)0.3-0.7- Balanced creativity/consistency1.0+- Maximum creativity/randomness
Context Window
Qwen 2.5 supports 32K tokens context window (approximately 24K words).
Troubleshooting
"Connection Refused" Error
Cause: Ollama service not running
Solution:
sudo systemctl start ollama
sudo systemctl status ollama
"Model Not Found" Error
Cause: Model not pulled locally
Solution:
ollama list # Check available models
ollama pull qwen2.5:7b # Pull the model
Timeout Errors
Cause: Generation taking longer than client timeout
Solution:
// Increase timeout for complex tasks
let client = Client::builder()
.timeout(Duration::from_secs(300)) // 5 minutes
.build()?;
Out of Memory
Cause: Model too large for available RAM/VRAM
Solution:
# Use smaller model
ollama pull qwen2.5:0.5b
# Or check memory usage
free -h
nvidia-smi # For GPU memory
VRAM Management
Model Auto-Unloading
Ollama automatically unloads models from VRAM after inactivity to free GPU memory:
Check Current Status:
ollama ps # List loaded models
nvidia-smi # Check VRAM usage
Model Lifecycle:
1. Request arrives → Model loads to VRAM (~4.7 GB for qwen2.5:7b)
2. Inference runs → GPU processes prompt (5-10 seconds)
3. Response sent → Model stays in VRAM (configured timeout)
4. Timeout expires → Model automatically unloaded (VRAM freed)
Configure Auto-Unload Timeout:
Edit /etc/systemd/system/ollama.service.d/models-location.conf:
[Service]
Environment="OLLAMA_MODELS=/var/lib/ollama/models"
Environment="OLLAMA_KEEP_ALIVE=30s"
Timeout Options:
OLLAMA_KEEP_ALIVE=0- Unload immediately after each requestOLLAMA_KEEP_ALIVE=30s- Keep for 30 seconds (recommended for shared GPU)OLLAMA_KEEP_ALIVE=5m- Keep for 5 minutes (default)OLLAMA_KEEP_ALIVE=-1- Keep loaded indefinitely
After changes:
sudo systemctl daemon-reload
sudo systemctl restart ollama
Manual VRAM Control
Immediately Unload Model:
ollama stop qwen2.5:7b
# Frees VRAM instantly for other GPU work
Pre-Load Model (Warm Start):
echo "test" | ollama run qwen2.5:7b >/dev/null 2>&1
# Loads model into VRAM before batch jobs
VRAM Management Script:
Create ~/scripts/ollama-vram.sh:
#!/bin/bash
case "$1" in
status)
nvidia-smi --query-gpu=memory.used,memory.total --format=csv,noheader
echo ""
ollama ps
;;
unload)
ollama ps --format json | grep -o '"name":"[^"]*"' | cut -d'"' -f4 | \
while read model; do ollama stop "$model"; done
echo "VRAM freed"
;;
load)
echo "test" | ollama run qwen2.5:7b >/dev/null 2>&1
ollama ps
;;
esac
Usage:
chmod +x ~/scripts/ollama-vram.sh
# Check VRAM status
~/scripts/ollama-vram.sh status
# Free VRAM before CUDA work
~/scripts/ollama-vram.sh unload
python train_model.py # Your GPU work here
# Pre-warm for batch inference
~/scripts/ollama-vram.sh load
Performance Optimization
GPU Acceleration:
Ollama automatically uses NVIDIA GPUs if available:
# Monitor GPU during generation
nvidia-smi --query-gpu=memory.used,utilization.gpu --format=csv -l 1
Concurrent Requests:
# Increase max loaded models for high concurrency
OLLAMA_MAX_LOADED_MODELS=2 ollama serve
Force CPU-Only:
# Disable GPU (use CPU inference)
CUDA_VISIBLE_DEVICES="" ollama serve
Production Deployment
Systemd Service
Ollama installs as a systemd service automatically:
# Check status
systemctl status ollama
# View logs
journalctl -u ollama -f
# Restart service
sudo systemctl restart ollama
Configuration
Edit /etc/systemd/system/ollama.service to set environment variables:
[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"
Environment="OLLAMA_MAX_LOADED_MODELS=2"
Then reload:
sudo systemctl daemon-reload
sudo systemctl restart ollama
Best Practices Summary
- Validate Installation: Always check
ollama listbefore assuming model availability - Set Appropriate Timeouts: 120s for analysis, 300s for complex generation
- Use System Prompts: Define behavior in system message for consistency
- Handle Errors: Network issues, timeouts, and parsing failures are common
- Monitor Resources: Watch GPU/CPU memory during sustained workloads
- Cache Results: Store frequently-used completions to reduce inference time
- Keep Prompts Focused: Clear, specific instructions produce better results
Reference Implementation
See Marvinous project for complete production example:
/home/matt/Marvinous/src/llm/client.rs- Ollama API client/home/matt/Marvinous/src/llm/prompt.rs- Prompt building/etc/marvinous/system-prompt.txt- System prompt with Marvin's personality/etc/marvinous/marvinous.toml- Configuration with model/endpoint settings