name	llmintegration
description	LLM integration patterns for Claude, GPT, Gemini, and Ollama. Activate for AI API integration, prompt engineering, token management, extended thinking, and multi-model orchestration.
allowed-tools	Bash, Read, Write, Edit, Glob, Grep
related-skills	extended-thinking, complex-reasoning, deep-analysis

LLM Integration Skill

Provides comprehensive LLM integration capabilities for the Golden Armada AI Agent Fleet Platform, including advanced features like extended thinking, sophisticated prompt engineering, and intelligent token budget management.

When to Use This Skill

Activate this skill when working with:

Claude/Anthropic API integration
OpenAI GPT integration
Google Gemini integration
Ollama local models
Multi-model orchestration
Prompt engineering

Anthropic Claude Integration

```python import anthropic

client = anthropic.Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])

Basic completion

message = client.messages.create( model="claude-sonnet-4-20250514", max_tokens=1024, messages=[ {"role": "user", "content": "Hello, Claude!"} ] ) print(message.content[0].text)

With system prompt

message = client.messages.create( model="claude-sonnet-4-20250514", max_tokens=1024, system="You are a helpful coding assistant.", messages=[ {"role": "user", "content": "Write a Python function to sort a list."} ] )

Streaming

with client.messages.stream( model="claude-sonnet-4-20250514", max_tokens=1024, messages=[{"role": "user", "content": "Tell me a story."}] ) as stream: for text in stream.text_stream: print(text, end="", flush=True)

Tool use

tools = [ { "name": "get_weather", "description": "Get the current weather in a location", "input_schema": { "type": "object", "properties": { "location": {"type": "string", "description": "The city and state"} }, "required": ["location"] } } ]

message = client.messages.create( model="claude-sonnet-4-20250514", max_tokens=1024, tools=tools, messages=[{"role": "user", "content": "What's the weather in San Francisco?"}] ) ```

OpenAI GPT Integration

```python from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

Basic completion

response = client.chat.completions.create( model="gpt-4", messages=[ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Hello!"} ] ) print(response.choices[0].message.content)

Streaming

stream = client.chat.completions.create( model="gpt-4", messages=[{"role": "user", "content": "Write a poem."}], stream=True ) for chunk in stream: if chunk.choices[0].delta.content: print(chunk.choices[0].delta.content, end="")

Function calling

functions = [ { "name": "get_weather", "description": "Get the current weather", "parameters": { "type": "object", "properties": { "location": {"type": "string"} }, "required": ["location"] } } ]

response = client.chat.completions.create( model="gpt-4", messages=[{"role": "user", "content": "Weather in NYC?"}], functions=functions, function_call="auto" ) ```

Google Gemini Integration

```python import google.generativeai as genai

genai.configure(api_key=os.environ["GOOGLE_API_KEY"])

model = genai.GenerativeModel('gemini-pro')

Basic generation

response = model.generate_content("Explain quantum computing") print(response.text)

Chat

chat = model.start_chat(history=[]) response = chat.send_message("Hello!") print(response.text)

Streaming

response = model.generate_content("Tell me a story", stream=True) for chunk in response: print(chunk.text, end="") ```

Ollama Local Models

```python import ollama

Basic completion

response = ollama.chat( model='llama2', messages=[ {'role': 'user', 'content': 'Hello!'} ] ) print(response['message']['content'])

Streaming

stream = ollama.chat( model='llama2', messages=[{'role': 'user', 'content': 'Tell me a story.'}], stream=True ) for chunk in stream: print(chunk['message']['content'], end='')

Pull model

ollama.pull('llama2')

List models

models = ollama.list() ```

Multi-Model Abstraction

```python from abc import ABC, abstractmethod from typing import Generator

class LLMProvider(ABC): @abstractmethod def generate(self, prompt: str, **kwargs) -> str: pass

@abstractmethod
def stream(self, prompt: str, **kwargs) -> Generator[str, None, None]:
    pass

class ClaudeProvider(LLMProvider): def init(self, api_key: str, model: str = "claude-sonnet-4-20250514"): self.client = anthropic.Anthropic(api_key=api_key) self.model = model

def generate(self, prompt: str, **kwargs) -> str:
    message = self.client.messages.create(
        model=self.model,
        max_tokens=kwargs.get('max_tokens', 1024),
        messages=[{"role": "user", "content": prompt}]
    )
    return message.content[0].text

def stream(self, prompt: str, **kwargs) -> Generator[str, None, None]:
    with self.client.messages.stream(
        model=self.model,
        max_tokens=kwargs.get('max_tokens', 1024),
        messages=[{"role": "user", "content": prompt}]
    ) as stream:
        for text in stream.text_stream:
            yield text

class LLMFactory: @staticmethod def create(provider: str, **kwargs) -> LLMProvider: providers = { 'claude': ClaudeProvider, 'gpt': GPTProvider, 'gemini': GeminiProvider, 'ollama': OllamaProvider } return providersprovider ```

Prompt Engineering Best Practices

```python

Structured prompts

SYSTEM_PROMPT = """You are a helpful coding assistant.

Guidelines:

Write clean, well-documented code
Follow best practices
Explain your reasoning """

Few-shot examples

FEW_SHOT_PROMPT = """Convert natural language to SQL.

Example 1: Input: Get all users Output: SELECT * FROM users;

Example 2: Input: Count active orders Output: SELECT COUNT(*) FROM orders WHERE status = 'active';

Input: {user_input} Output:"""

Chain of thought

COT_PROMPT = """Solve this step by step: {problem}

Let's think through this: 1.""" ```

Extended Thinking Integration

Extended thinking enables Claude models to "think" before responding, improving accuracy on complex tasks like coding, math, and scientific reasoning.

When to Use Extended Thinking

Complex multi-step reasoning tasks
Code architecture and system design
Mathematical problem-solving
Scientific analysis and research
Strategic planning and decision-making

Cross-reference: See .claude/skills/extended-thinking/SKILL.md for detailed guidance.

Basic Extended Thinking

```python import anthropic

client = anthropic.Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])

Enable extended thinking (Claude Sonnet 4 and Opus 4)

response = client.messages.create( model="claude-sonnet-4-20250514", max_tokens=16000, thinking={ "type": "enabled", "budget_tokens": 10000 # Reserve tokens for thinking }, messages=[ { "role": "user", "content": "Design a scalable microservices architecture for a multi-tenant SaaS platform" } ] )

Response includes thinking and final text

for block in response.content: if block.type == "thinking": print(f"Thinking: {block.thinking}") elif block.type == "text": print(f"Response: {block.text}") ```

Extended Thinking with Streaming

```python

Stream thinking process in real-time

with client.messages.stream( model="claude-sonnet-4-20250514", max_tokens=16000, thinking={ "type": "enabled", "budget_tokens": 10000 }, messages=[ {"role": "user", "content": "Analyze the time complexity of this sorting algorithm..."} ] ) as stream: for event in stream: if event.type == "content_block_start": if event.content_block.type == "thinking": print("\n[Thinking Process]") elif event.content_block.type == "text": print("\n[Final Answer]") elif event.type == "content_block_delta": if event.delta.type == "thinking_delta": print(event.delta.thinking, end="", flush=True) elif event.delta.type == "text_delta": print(event.delta.text, end="", flush=True) ```

Multi-Provider Extended Thinking Abstraction

```python from typing import Optional, Dict, Any from dataclasses import dataclass

@dataclass class ThinkingConfig: enabled: bool = False budget_tokens: Optional[int] = None show_thinking: bool = True

class ExtendedThinkingProvider: """Abstract extended thinking across providers"""

def __init__(self, provider: str, config: ThinkingConfig):
    self.provider = provider
    self.config = config

def generate_with_thinking(self, prompt: str, **kwargs) -> Dict[str, Any]:
    if self.provider == "claude":
        return self._claude_thinking(prompt, **kwargs)
    elif self.provider == "gpt":
        # Simulate thinking with chain-of-thought
        return self._gpt_cot_thinking(prompt, **kwargs)
    else:
        raise ValueError(f"Provider {self.provider} doesn't support extended thinking")

def _claude_thinking(self, prompt: str, **kwargs) -> Dict[str, Any]:
    client = anthropic.Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])

    thinking_params = {}
    if self.config.enabled:
        thinking_params["thinking"] = {
            "type": "enabled",
            "budget_tokens": self.config.budget_tokens or 10000
        }

    response = client.messages.create(
        model=kwargs.get("model", "claude-sonnet-4-20250514"),
        max_tokens=kwargs.get("max_tokens", 16000),
        messages=[{"role": "user", "content": prompt}],
        **thinking_params
    )

    result = {"thinking": None, "response": None}
    for block in response.content:
        if block.type == "thinking":
            result["thinking"] = block.thinking
        elif block.type == "text":
            result["response"] = block.text

    return result

def _gpt_cot_thinking(self, prompt: str, **kwargs) -> Dict[str, Any]:
    """Use chain-of-thought prompting for GPT models"""
    from openai import OpenAI
    client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

    cot_prompt = f"""Let's approach this step-by-step:

{prompt}

First, think through the problem systematically, then provide your final answer."""

    response = client.chat.completions.create(
        model=kwargs.get("model", "gpt-4"),
        messages=[{"role": "user", "content": cot_prompt}]
    )

    # Parse thinking from response (heuristic-based)
    content = response.choices[0].message.content
    parts = content.split("\n\n")

    return {
        "thinking": "\n\n".join(parts[:-1]) if len(parts) > 1 else None,
        "response": parts[-1] if parts else content
    }

Usage

thinking_provider = ExtendedThinkingProvider( provider="claude", config=ThinkingConfig(enabled=True, budget_tokens=8000) )

result = thinking_provider.generate_with_thinking( "Design a distributed caching strategy for a multi-tenant system" )

if result["thinking"]: print(f"Thinking Process:\n{result['thinking']}\n") print(f"Final Answer:\n{result['response']}") ```

Claude Prompt Engineering Best Practices

Based on Anthropic's official guidelines for optimal performance.

1. Clear and Direct Instructions

```python

BAD: Vague request

prompt = "Make this code better"

GOOD: Specific instructions

prompt = """Refactor this Python function to:

Use type hints
Add comprehensive docstrings
Handle edge cases (empty input, None values)
Improve variable naming for clarity
Add input validation

Code: {code} """ ```

2. Use XML Tags for Structure

```python

XML tags help Claude parse complex inputs

prompt = """Analyze this codebase and identify security vulnerabilities.

{auth_code} {api_code} - Focus on authentication and authorization issues - Check for SQL injection vulnerabilities - Identify improper input validation - Flag missing rate limiting

Provide output in this format: High|Medium|Low path/to/file.py 123 Clear description How to fix """ ```

3. Provide Examples (Few-Shot Prompting)

```python FEW_SHOT_TEMPLATE = """Convert user stories to acceptance criteria.

As a user, I want to reset my password so that I can regain access to my account. - Given I'm on the login page - When I click "Forgot Password" - Then I should see a password reset form - And I should receive a reset link via email - And the link should expire after 1 hour - And I can set a new password using the link As an admin, I want to export user data so that I can generate reports. - Given I'm logged in as admin - When I navigate to Users > Export - Then I should see export format options (CSV, JSON, Excel) - And I should be able to filter by date range - And the export should include all user fields - And I should receive a download link within 5 minutes

Now convert this user story: {user_input} """ ```

4. Assign Roles for Context

```python ROLE_BASED_SYSTEM_PROMPTS = { "code_reviewer": """You are an expert code reviewer with 15 years of experience. Your expertise includes:

Software architecture and design patterns
Security best practices (OWASP Top 10)
Performance optimization
Code maintainability and readability

When reviewing code:

Identify bugs and potential issues
Suggest improvements for clarity and performance
Check for security vulnerabilities
Recommend design pattern improvements
Ensure code follows language best practices

Be constructive and specific in your feedback.""",

"architect": """You are a senior software architect specializing in:

Microservices and distributed systems
Cloud-native architecture (AWS, GCP, Azure)
Database design and optimization
API design (REST, GraphQL, gRPC)
Security and compliance

When designing systems:

Consider scalability and performance
Ensure fault tolerance and resilience
Design for observability (logging, metrics, tracing)
Follow cloud-native best practices
Consider cost optimization""",

"security_expert": """You are a security specialist focused on:

OWASP Top 10 vulnerabilities
Authentication and authorization (OAuth, OIDC, JWT)
Data encryption and privacy
Secure coding practices
Compliance (GDPR, HIPAA, SOC2)

When analyzing security:

Identify vulnerabilities with severity ratings
Provide specific remediation steps
Reference security standards and best practices
Consider both code-level and architectural security""" }

Usage

message = client.messages.create( model="claude-sonnet-4-20250514", max_tokens=4096, system=ROLE_BASED_SYSTEM_PROMPTS["code_reviewer"], messages=[ {"role": "user", "content": f"Review this code:\n\n{code}"} ] ) ```

5. Chain of Thought Prompting

```python COT_PROMPT = """Solve this problem step by step, showing your reasoning at each stage.

Problem: {problem}

Think through this by:

Understanding what's being asked
Identifying relevant information
Breaking down the problem into steps
Solving each step
Verifying the solution

Show your work for each step."""

For complex reasoning, combine with extended thinking

response = client.messages.create( model="claude-sonnet-4-20250514", max_tokens=16000, thinking={"type": "enabled", "budget_tokens": 10000}, messages=[ {"role": "user", "content": COT_PROMPT.format(problem=complex_problem)} ] ) ```

6. Prefill Responses for Format Control

```python

Force JSON output by prefilling assistant response

messages = [ {"role": "user", "content": "Extract entities from: 'Apple Inc. hired John Smith as CEO in 2023.'"}, {"role": "assistant", "content": "{"} # Prefill to force JSON ]

response = client.messages.create( model="claude-sonnet-4-20250514", max_tokens=1024, messages=messages )

Claude will complete the JSON starting with "{"

json_output = "{" + response.content[0].text ```

7. Long Context Best Practices

```python

For Claude's 200k token context window

LONG_CONTEXT_TEMPLATE = """I'm providing a large codebase for analysis. The most important files for this task are at the END of this message.

{less_important_files} {important_files} {specific_question}

Focus primarily on the critical files when answering the task.""" ```

Token Budget Management

Anthropic Token Counting

```python import anthropic

def count_tokens_anthropic(text: str, model: str = "claude-sonnet-4-20250514") -> int: """Count tokens using Anthropic's API""" client = anthropic.Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])

# Use count_tokens endpoint
result = client.messages.count_tokens(
    model=model,
    messages=[{"role": "user", "content": text}]
)

return result.input_tokens

def count_tokens_with_system(messages: list, system: str = None, model: str = "claude-sonnet-4-20250514") -> dict: """Count tokens including system prompt and messages""" client = anthropic.Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])

params = {
    "model": model,
    "messages": messages
}
if system:
    params["system"] = system

result = client.messages.count_tokens(**params)

return {
    "input_tokens": result.input_tokens,
    "system_tokens": getattr(result, "system_tokens", 0)
}

```

Smart Token Budget Management

```python from typing import List, Dict import anthropic

class TokenBudgetManager: """Intelligent token budget management for LLM calls"""

def __init__(self, model: str = "claude-sonnet-4-20250514"):
    self.model = model
    self.client = anthropic.Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])

    # Model-specific limits
    self.limits = {
        "claude-opus-4-20250514": {"context": 200000, "output": 16384},
        "claude-sonnet-4-20250514": {"context": 200000, "output": 16384},
        "claude-haiku-3-5-20250514": {"context": 200000, "output": 8192},
    }

def get_budget(self, model: str = None) -> Dict[str, int]:
    """Get token limits for model"""
    model = model or self.model
    return self.limits.get(model, {"context": 200000, "output": 16384})

def check_budget(
    self,
    messages: List[Dict],
    system: str = None,
    max_tokens: int = 4096,
    thinking_tokens: int = 0
) -> Dict[str, any]:
    """Check if request fits within budget"""

    # Count input tokens
    token_count = self.client.messages.count_tokens(
        model=self.model,
        messages=messages,
        system=system
    )

    input_tokens = token_count.input_tokens
    budget = self.get_budget()

    # Calculate total required tokens
    total_required = input_tokens + max_tokens + thinking_tokens

    return {
        "fits_budget": total_required <= budget["context"],
        "input_tokens": input_tokens,
        "requested_output": max_tokens,
        "thinking_budget": thinking_tokens,
        "total_required": total_required,
        "context_limit": budget["context"],
        "remaining": budget["context"] - total_required,
        "utilization_pct": (total_required / budget["context"]) * 100
    }

def optimize_for_budget(
    self,
    messages: List[Dict],
    system: str = None,
    target_output: int = 4096,
    thinking_tokens: int = 0,
    priority_last_n: int = 3
) -> List[Dict]:
    """Truncate messages to fit budget, preserving recent context"""

    budget = self.get_budget()
    available = budget["context"] - target_output - thinking_tokens

    # Always keep system prompt and last N messages
    preserved_messages = messages[-priority_last_n:]

    # Count tokens for preserved content
    preserved_count = self.client.messages.count_tokens(
        model=self.model,
        messages=preserved_messages,
        system=system
    ).input_tokens

    if preserved_count <= available:
        # Try to include earlier messages
        remaining = available - preserved_count
        earlier_messages = messages[:-priority_last_n]

        # Binary search to find how many earlier messages fit
        left, right = 0, len(earlier_messages)
        best_fit = 0

        while left <= right:
            mid = (left + right) // 2
            test_messages = earlier_messages[-mid:] + preserved_messages

            test_count = self.client.messages.count_tokens(
                model=self.model,
                messages=test_messages,
                system=system
            ).input_tokens

            if test_count <= available:
                best_fit = mid
                left = mid + 1
            else:
                right = mid - 1

        return earlier_messages[-best_fit:] + preserved_messages if best_fit > 0 else preserved_messages
    else:
        # Even preserved messages exceed budget, truncate them
        return preserved_messages[-1:]  # Keep at least the last message

def get_recommendations(self, budget_check: Dict) -> List[str]:
    """Get recommendations based on budget utilization"""
    recommendations = []

    util = budget_check["utilization_pct"]

    if util > 90:
        recommendations.append("CRITICAL: Token usage >90%. Consider reducing context or output length.")
        recommendations.append("Enable extended thinking only if necessary for task complexity.")
    elif util > 75:
        recommendations.append("WARNING: Token usage >75%. Monitor context size.")
        recommendations.append("Consider summarizing earlier conversation turns.")
    elif util > 50:
        recommendations.append("Moderate token usage. Budget healthy.")
    else:
        recommendations.append("Low token usage. Budget has plenty of headroom.")

    if budget_check["thinking_budget"] > budget_check["requested_output"]:
        recommendations.append("Thinking budget exceeds output budget. Ensure this is intentional.")

    return recommendations

Usage example

manager = TokenBudgetManager(model="claude-sonnet-4-20250514")

messages = [ {"role": "user", "content": "What is Python?"}, {"role": "assistant", "content": "Python is a high-level programming language..."}, {"role": "user", "content": "Write a complex microservices architecture"} ]

system = "You are an expert software architect."

Check budget

budget_check = manager.check_budget( messages=messages, system=system, max_tokens=8192, thinking_tokens=10000 )

print(f"Fits budget: {budget_check['fits_budget']}") print(f"Utilization: {budget_check['utilization_pct']:.1f}%") print(f"Remaining tokens: {budget_check['remaining']}")

Get recommendations

for rec in manager.get_recommendations(budget_check): print(f"- {rec}")

Optimize if needed

if not budget_check["fits_budget"]: optimized_messages = manager.optimize_for_budget( messages=messages, system=system, target_output=8192, thinking_tokens=10000 ) print(f"Reduced from {len(messages)} to {len(optimized_messages)} messages") ```

OpenAI Token Counting

```python import tiktoken

def count_tokens_openai(text: str, model: str = "gpt-4") -> int: """Count tokens for OpenAI models""" encoding = tiktoken.encoding_for_model(model) return len(encoding.encode(text))

def truncate_to_token_limit(text: str, max_tokens: int, model: str = "gpt-4") -> str: """Truncate text to fit token limit""" encoding = tiktoken.encoding_for_model(model) tokens = encoding.encode(text) if len(tokens) <= max_tokens: return text return encoding.decode(tokens[:max_tokens])

def split_by_tokens(text: str, chunk_size: int, model: str = "gpt-4") -> List[str]: """Split text into chunks of specific token size""" encoding = tiktoken.encoding_for_model(model) tokens = encoding.encode(text)

chunks = []
for i in range(0, len(tokens), chunk_size):
    chunk_tokens = tokens[i:i + chunk_size]
    chunks.append(encoding.decode(chunk_tokens))

return chunks

```

Cross-References and Related Skills

Extended Thinking

For complex reasoning tasks requiring deep analysis:

Skill: .claude/skills/extended-thinking/SKILL.md
Use when: Multi-step reasoning, architecture design, mathematical proofs

Complex Reasoning

For advanced problem-solving and analysis:

Skill: .claude/skills/complex-reasoning/SKILL.md
Use when: System design, optimization problems, strategic planning

Deep Analysis

For comprehensive codebase and system analysis:

Skill: .claude/skills/deep-analysis/SKILL.md
Use when: Code review, security audits, performance analysis

Orchestration Integration

When using LLM integration within the Golden Armada orchestration system:

```python from orchestration.agent_activity_logger import AgentActivityLogger

Log LLM API calls for tracking

logger = AgentActivityLogger()

logger.log_activity( agent_name="code-reviewer", activity_type="llm_api_call", details={ "provider": "anthropic", "model": "claude-sonnet-4-20250514", "input_tokens": 1500, "output_tokens": 800, "thinking_tokens": 3000, "extended_thinking": True, "task": "security_audit" } ) ```

Model Selection Guide

Claude Models

Model	Best For	Context	Output	Cost
Opus 4	Complex reasoning, architecture design	200K	16K	Highest
Sonnet 4	General development, balanced performance	200K	16K	Medium
Haiku 3.5	Fast tasks, simple queries, high throughput	200K	8K	Lowest

Extended Thinking: Only available on Sonnet 4 and Opus 4

When to Use Each Model

```python MODEL_SELECTION = { "strategic_planning": "claude-opus-4-20250514", "system_architecture": "claude-opus-4-20250514", "complex_debugging": "claude-sonnet-4-20250514", "code_review": "claude-sonnet-4-20250514", "code_generation": "claude-sonnet-4-20250514", "documentation": "claude-haiku-3-5-20250514", "simple_queries": "claude-haiku-3-5-20250514", "batch_processing": "claude-haiku-3-5-20250514" }

def select_model(task_type: str, use_thinking: bool = False) -> str: """Select appropriate model based on task""" model = MODEL_SELECTION.get(task_type, "claude-sonnet-4-20250514")

# Ensure model supports extended thinking if requested
if use_thinking and "haiku" in model:
    model = "claude-sonnet-4-20250514"

return model

```

Performance Optimization Tips

1. Prompt Caching (Reduce Costs)

```python

Cache large system prompts or context

response = client.messages.create( model="claude-sonnet-4-20250514", max_tokens=4096, system=[ { "type": "text", "text": large_system_prompt, "cache_control": {"type": "ephemeral"} # Cache this } ], messages=[ {"role": "user", "content": "Question about the system"} ] ) ```

2. Batch Processing

```python async def process_batch(items: List[str], batch_size: int = 10): """Process items in batches to avoid rate limits""" import asyncio

async def process_item(item: str):
    # Async LLM call
    return await async_llm_call(item)

results = []
for i in range(0, len(items), batch_size):
    batch = items[i:i + batch_size]
    batch_results = await asyncio.gather(*[process_item(item) for item in batch])
    results.extend(batch_results)

    # Rate limiting
    if i + batch_size < len(items):
        await asyncio.sleep(1)

return results

```

3. Streaming for Responsiveness

```python

Always use streaming for long-running tasks

with client.messages.stream( model="claude-sonnet-4-20250514", max_tokens=8192, messages=[{"role": "user", "content": complex_task}] ) as stream: for text in stream.text_stream: yield text # Stream to UI or process incrementally ```

Install Skill

SKILL.md

LLM Integration Skill

When to Use This Skill

Anthropic Claude Integration

Basic completion

With system prompt

Streaming

Tool use

OpenAI GPT Integration

Basic completion

Streaming

Function calling

Google Gemini Integration

Basic generation

Chat

Streaming

Ollama Local Models

Basic completion

Streaming

Pull model

List models

Multi-Model Abstraction

Prompt Engineering Best Practices

Structured prompts

Few-shot examples

Chain of thought

Extended Thinking Integration

When to Use Extended Thinking

Basic Extended Thinking

Enable extended thinking (Claude Sonnet 4 and Opus 4)

Response includes thinking and final text

Extended Thinking with Streaming

Stream thinking process in real-time

Multi-Provider Extended Thinking Abstraction

Usage

Claude Prompt Engineering Best Practices

1. Clear and Direct Instructions

BAD: Vague request

GOOD: Specific instructions

2. Use XML Tags for Structure

XML tags help Claude parse complex inputs

3. Provide Examples (Few-Shot Prompting)

4. Assign Roles for Context

Usage

5. Chain of Thought Prompting

For complex reasoning, combine with extended thinking

6. Prefill Responses for Format Control

Force JSON output by prefilling assistant response

Claude will complete the JSON starting with "{"

7. Long Context Best Practices

For Claude's 200k token context window

Token Budget Management

Anthropic Token Counting

Smart Token Budget Management

Usage example

Check budget

Get recommendations

Optimize if needed

OpenAI Token Counting

Cross-References and Related Skills

Extended Thinking

Complex Reasoning

Deep Analysis

Orchestration Integration

Log LLM API calls for tracking

Model Selection Guide

Claude Models

When to Use Each Model

Performance Optimization Tips

1. Prompt Caching (Reduce Costs)

Cache large system prompts or context

2. Batch Processing

3. Streaming for Responsiveness

Always use streaming for long-running tasks