| name | context-engineering |
| description | Use when designing agent system prompts, optimizing RAG retrieval, or when context is too expensive or slow. Reduces tokens while maintaining quality through strategic positioning and attention-aware design. |
| version | 1.0.0 |
| author | SkillForge AI Agent Hub |
| tags | context, attention, optimization, llm, performance, 2026 |
Context Engineering
The discipline of curating the smallest high-signal token set that achieves desired outcomes.
Overview
Context engineering goes beyond prompt engineering. While prompts focus on what you ask, context engineering focuses on everything the model sees—system instructions, tool definitions, documents, message history, and tool outputs.
Key Insight: Context windows are constrained not by raw token capacity but by attention mechanics. As context grows, models experience degradation.
When to Use
- Designing agent system prompts
- Optimizing RAG retrieval pipelines
- Managing long-running conversations
- Building multi-agent architectures
- Reducing token costs while maintaining quality
The "Lost in the Middle" Phenomenon
Models pay unequal attention across the context window:
Attention
Strength ████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░████████
↑ ↑
START MIDDLE (weakest attention) END
Practical Implications:
| Position | Attention | Best For |
|---|---|---|
| START | High | System identity, critical instructions, constraints |
| MIDDLE | Low | Background context, optional details |
| END | High | Current task, recent messages, immediate query |
The Five Context Layers
1. System Prompts (Identity Layer)
Establishes agent identity at the right "altitude":
TOO HIGH (vague): "You are a helpful assistant"
TOO LOW (brittle): "Always respond with exactly 3 bullet points..."
OPTIMAL (principled): "You are a senior engineer who values clarity,
tests assumptions, and explains trade-offs"
Best Practices:
- Define role and expertise level
- State core principles (not rigid rules)
- Include what NOT to do (boundaries)
- Position at START of context
2. Tool Definitions (Capability Layer)
Tools steer behavior through descriptions:
# ❌ BAD: Ambiguous - when would you use this?
@tool
def search(query: str) -> str:
"""Search for information."""
pass
# ✅ GOOD: Clear trigger conditions
@tool
def search_documentation(query: str) -> str:
"""
Search internal documentation for technical answers.
USE WHEN:
- User asks about internal APIs or services
- Question requires company-specific knowledge
- Public information is insufficient
DO NOT USE WHEN:
- Question is general programming knowledge
- User explicitly wants external sources
"""
pass
Rule: If a human cannot definitively say which tool to use, an agent cannot either.
3. Retrieved Documents (Knowledge Layer)
Just-in-time loading beats pre-loading:
# ❌ BAD: Pre-load everything
context = load_all_documentation() # 50k tokens!
# ✅ GOOD: Progressive disclosure
def build_context(query: str) -> str:
# Stage 1: Lightweight retrieval (500 tokens)
summaries = search_summaries(query, top_k=5)
# Stage 2: Selective deep loading (only if needed)
if needs_detail(summaries):
full_docs = load_full_documents(summaries[:2])
return summaries + full_docs
return summaries
4. Message History (Memory Layer)
Treat as scratchpad, not permanent storage:
# Implement sliding window with compression
MAX_MESSAGES = 20
COMPRESSION_TRIGGER = 0.7 # 70% of context budget
def manage_history(messages: list, budget: int) -> list:
current_tokens = count_tokens(messages)
if current_tokens > budget * COMPRESSION_TRIGGER:
# Compress older messages, keep recent
old = messages[:-5]
recent = messages[-5:]
summary = summarize(old) # Anchored compression
return [summary] + recent
return messages
5. Tool Outputs (Observation Layer)
Critical Finding: Tool outputs can reach 83.9% of total context usage!
# ❌ BAD: Return raw output
def search_web(query: str) -> str:
results = web_search(query)
return json.dumps(results) # Could be 10k+ tokens!
# ✅ GOOD: Structured, bounded output
def search_web(query: str) -> str:
results = web_search(query)
# Extract only what's needed
extracted = [
{
"title": r["title"],
"snippet": r["snippet"][:200], # Truncate
"url": r["url"]
}
for r in results[:5] # Limit count
]
return json.dumps(extracted) # ~500 tokens max
The 95% Finding
Research shows what actually drives agent performance:
┌────────────────────────────────────────────────────────────────┐
│ TOKEN USAGE ████████████████████████████████████ 80% │
│ TOOL CALLS █████ 10% │
│ MODEL CHOICE ██ 5% │
│ OTHER ██ 5% │
└────────────────────────────────────────────────────────────────┘
Key Insight: Optimize context efficiency BEFORE switching models.
Context Budget Management
Token Budget Calculator
def calculate_budget(model: str, task_type: str) -> dict:
"""Calculate optimal token allocation."""
MAX_CONTEXT = {
"gpt-4o": 128_000,
"claude-3": 200_000,
"llama-3": 128_000,
}
# Reserve 20% for response generation
available = MAX_CONTEXT[model] * 0.8
# Allocation by task type
ALLOCATIONS = {
"chat": {
"system": 0.05, # 5%
"tools": 0.05, # 5%
"history": 0.60, # 60%
"retrieval": 0.20, # 20%
"current": 0.10, # 10%
},
"agent": {
"system": 0.10, # 10%
"tools": 0.15, # 15%
"history": 0.30, # 30%
"retrieval": 0.25, # 25%
"observations": 0.20, # 20%
},
}
alloc = ALLOCATIONS[task_type]
return {k: int(v * available) for k, v in alloc.items()}
Compression Triggers
COMPRESSION_CONFIG = {
"trigger_threshold": 0.70, # Start compressing at 70%
"target_threshold": 0.50, # Compress down to 50%
"preserve_recent": 5, # Always keep last 5 messages
"preserve_system": True, # Never compress system prompt
}
Attention-Aware Positioning
Template Structure
[START - HIGH ATTENTION]
## System Identity
You are a {role} specialized in {domain}.
## Critical Constraints
- NEVER {dangerous_action}
- ALWAYS {required_behavior}
[MIDDLE - LOWER ATTENTION]
## Background Context
{retrieved_documents}
{older_conversation_history}
[END - HIGH ATTENTION]
## Current Task
{recent_messages}
{user_query}
## Response Guidelines
{output_format_instructions}
Priority Positioning Rules
- Identity & Constraints → START (immutable)
- Critical instructions → START or END
- Retrieved documents → MIDDLE (expandable)
- Conversation history → MIDDLE (compressible)
- Current query → END (always visible)
- Output format → END (guides generation)
Metrics: Tokens-Per-Task
Optimize for total task completion, not individual requests:
@dataclass
class TaskMetrics:
task_id: str
total_tokens: int = 0
request_count: int = 0
retrieval_tokens: int = 0
generation_tokens: int = 0
@property
def tokens_per_request(self) -> float:
return self.total_tokens / max(self.request_count, 1)
@property
def efficiency_ratio(self) -> float:
"""Lower is better - generation vs total context."""
return self.generation_tokens / max(self.total_tokens, 1)
Anti-pattern: Aggressive compression that loses critical details forces expensive re-fetching, consuming MORE tokens overall.
Common Pitfalls
| Pitfall | Problem | Solution |
|---|---|---|
| Token stuffing | "More context = better" | Quality over quantity |
| Flat structure | No priority signaling | Use headers, positioning |
| Static context | Same context for all queries | Dynamic, query-relevant retrieval |
| Ignoring middle | Important info gets lost | Position critically |
| No compression | Context grows unbounded | Sliding window + summarization |
Integration with SkillForge
Agent System Prompts
Apply attention-aware positioning to agent definitions:
# Agent: backend-system-architect
[HIGH ATTENTION - START]
## Identity
Senior backend architect with 15+ years experience.
## Constraints
- NEVER suggest unvalidated security patterns
- ALWAYS consider multi-tenant isolation
[LOWER ATTENTION - MIDDLE]
## Domain Knowledge
{dynamically_loaded_patterns}
[HIGH ATTENTION - END]
## Current Task
{user_request}
Skill Loading
Progressive skill disclosure:
# Stage 1: Load skill metadata only (~100 tokens)
skill_index = load_skill_summaries()
# Stage 2: Load relevant skill on demand (~500 tokens)
if task_matches("database"):
full_skill = load_skill("pgvector-search")
Related Skills
context-compression- Compression strategies and anchored summarizationmulti-agent-orchestration- Context isolation across agentsrag-retrieval- Optimizing retrieved document contextprompt-caching- Reducing redundant context transmission
Version: 1.0.0 (January 2026) Based on: Context Engineering research, BrowseComp evaluation findings Key Metric: 80% of agent performance variance explained by token usage
Capability Details
attention-mechanics
Keywords: context window, attention, lost in the middle, token budget Solves:
- Understand lost-in-the-middle effect (high attention at START/END)
- Position critical info strategically
- Optimize tokens-per-task not tokens-per-request
context-layers
Keywords: context anatomy, context structure, five layers Solves:
- Understand 5 context layers (system, tools, docs, history, outputs)
- Implement just-in-time document loading
- Manage tool output truncation
budget-allocation
Keywords: token budget, context budget, allocation Solves:
- Allocate tokens across context layers
- Implement compression triggers at 70% utilization
- Target 50% utilization after compression