| name | hti-zen-orchestrator |
| description | Guidelines for using Zen MCP tools effectively in this repo. Use for complex multi-model tasks, architectural decisions, or when cross-model validation adds value. |
HTI Zen Orchestrator
This Skill defines when and how to use Zen MCP tools in the hti-zen-harness project.
Zen provides multi-model orchestration (planner, consensus, codereview, thinkdeep, debug, clink). Use them deliberately when they add real value, not reflexively.
⚡ Recommended Approach: Use API Tools Directly
Prefer direct Zen MCP tools over clink:
- ✅
chat,thinkdeep,consensus,codereview,precommit,debug,planner - ✅ Work via API (no CLI setup needed)
- ✅ Already configured and tested
- ✅ Simple, reliable, fast
Avoid clink unless absolutely necessary:
- ❌ Requires separate CLI installations (gemini CLI, codex CLI, etc.)
- ❌ Requires separate authentication for each CLI
- ❌ Uses your API credits anyway (no cost benefit)
- ❌ More complexity for minimal gain in this project
Bottom line: Direct API tools (mcp__zen__chat, mcp__zen__consensus, etc.) do everything you need without the CLI overhead.
When Zen Tools Add Value
Consider using Zen MCP tools when:
Complex architectural work
- Multi-file refactors spanning 5+ files
- New subsystems or major feature additions
- Changes to core HTI abstractions (bands, adapters, guards, probes)
- Redesigning interfaces or data flows
Safety-critical code
- Modifying timing bands or HTI invariants
- Changes to error handling or recovery logic
- Adapter implementations that interact with external models
- CI/CD pipeline changes that affect safety guarantees
Ambiguous or contentious decisions
- Multiple valid implementation approaches exist
- Trade-offs between performance, safety, and complexity
- Unusual patterns where you're unsure of best practice
Deep investigation needed
- Complex bugs with unclear root cause
- Performance issues requiring systematic analysis
- Understanding unfamiliar codebases or dependencies
When Zen is overkill:
Simple changes
- Single-file bug fixes
- Adding straightforward tests
- Documentation updates
- Simple refactors (renaming, extracting functions)
- Configuration tweaks
For these, direct implementation is faster and more appropriate.
Zen Tool Selection Guide
planner - Multi-step planning with reflection
Use when:
- Task has 5+ distinct steps
- Multiple architectural approaches possible
- Need to think through dependencies and ordering
- Want progressive refinement of a complex plan
Example: "Plan migration of adapter interface to support streaming responses"
consensus - Multi-model debate and synthesis
Use when:
- Two+ valid approaches with different trade-offs
- Safety-critical decisions need validation
- Controversial architectural choices
- Want diverse perspectives on a design
Example: "Should we use async generators or callback patterns for streaming? Get consensus from multiple models."
Models to include: At least 2, typically 3-4. Mix code-specialized models with general reasoning models.
codereview - Systematic code analysis
Use when:
- Reviewing large PRs or branches
- Safety-critical changes to core logic
- Unfamiliar code needs audit
- Want comprehensive security/performance review
Example: "Review the new HTI band scheduler implementation for correctness and edge cases."
thinkdeep - Hypothesis-driven investigation
Use when:
- Complex architectural questions
- Performance analysis and optimization planning
- Security threat modeling
- Understanding subtle interactions
Example: "Investigate why adapter timeout logic behaves differently under load."
debug - Root cause analysis
Use when:
- Complex bugs with mysterious symptoms
- Race conditions or timing issues
- Failures that only occur in specific conditions
- Need systematic hypothesis testing
Example: "Debug why HTI band transitions occasionally skip validation steps."
clink - Delegating to external CLI tools
Use when:
- Need capabilities of a specific AI CLI (gemini, codex, claude)
- Want to leverage role presets (codereviewer, planner)
- Continuing a conversation thread across tools
Example: "Use clink with gemini CLI for large-scale codebase exploration."
chat - General-purpose thinking partner
Use for:
- Brainstorming approaches
- Quick sanity checks
- Explaining concepts
- Rubber-duck debugging
Model Selection Guidelines
When calling Zen tools, choose models deliberately based on the task:
For reading, exploration, summarization:
- Prefer: Models with large context windows and good efficiency
- Pattern: Large-context, efficient models
- Use case: "Scan 50 test files to find coverage gaps"
For core implementation and refactoring:
- Prefer: Code-specialized, high-quality models
- Pattern: Code-specialized models (e.g., models with "codex" in the name or any available code-focused equivalent)
- Use case: "Implement new HTI adapter with proper error handling"
For safety-critical validation:
- Use: Multiple models via
consensusor sequentialcodereview - Pattern: Mix of code-specialized and general reasoning models for diverse perspectives
- Use case: "Validate timing band logic won't introduce deadlocks"
Document your choices:
When model selection matters for auditability:
# HTI-NOTE: Implementation reviewed by code-specialized models (consensus check).
# No race conditions detected in band transition logic.
def transition_band(current: Band, target: Band) -> Result:
...
Shell Access via clink
Zen's clink tool can execute shell commands. Use it responsibly.
✅ OK without asking (read-only, low-risk):
- File inspection:
ls,pwd,cat,head,tail,find - Git inspection:
git status,git diff,git log,git branch - Testing:
pytest,python -m pytest, test runners - Linting:
ruff check,black --check,mypy, static analysis - Info gathering:
python --version,uv --version, dependency checks
⚠️ Ask user approval first:
- Installing packages:
pip install,uv add,npm install - Git mutations:
git commit,git push,git reset,git checkout -b,git rebase - File mutations:
rm,mv, file deletions/moves - Network operations:
curl,wget, API calls - Environment changes: Modifying config files,
.envfiles
How to ask:
I need to run: `pip install pytest-asyncio`
Reason: Required for testing async adapter implementations
Approve?
Failure Handling with Zen
When Zen tools or model calls fail, follow these rules (aligned with hti-fallback-guard):
❌ Do NOT:
- Pretend the call succeeded
- Silently switch to a different model without explanation
- Invent outputs or fake data
- Swallow errors and continue as if nothing happened
✅ DO:
1. Report clearly:
Zen `codereview` call failed:
Tool: codereview
Model: <model-name>
Error: Rate limit exceeded (429)
Step: Reviewing src/adapters/openai.py
2. Propose alternatives:
- "Retry with a different model (another available code-specialized option)?"
- "Split the review into smaller chunks?"
- "Proceed with manual review instead?"
- "Wait 60s and retry?"
3. Document in code if relevant:
# HTI-TODO: Codereview via Zen failed (rate limit).
# Manual review needed for thread safety in adapter pool.
Structured failure result pattern:
When appropriate, return explicit error states:
@dataclass
class ZenResult:
ok: bool
tool: str
data: dict | None = None
error: str | None = None
# Never set ok=True when Zen call actually failed
Recommended Workflow for Substantial Changes
For non-trivial work (multi-file refactors, new features, safety-critical edits):
1. Plan (if complexity warrants it)
- Use Zen
plannerfor complex, multi-faceted tasks - For simpler changes, a bullet list is fine
- Show plan to user, get confirmation
2. Implement
- Use appropriate model (code-specialized for core logic)
- Follow
hti-fallback-guardprinciples - Document model choice if safety-critical
3. Review (for important changes)
- Use Zen
codereviewfor:- Large PRs (10+ files)
- Safety-critical logic
- HTI band/adapter/guard changes
- Use Zen
precommitbefore finalizing
4. Summarize
Tell the user:
- What changed (files, behavior)
- Which models/tools were used
- Any TODOs or concerns
- Test coverage added/modified
Integration with Testing and CI
When working on tests or CI:
Prefer changes that tighten guarantees:
- Tests that assert explicit failures (not silent fallbacks)
- CI checks that fail loudly when invariants break
- Guards that prevent invalid state transitions
Use Zen tools to:
- Validate test coverage (
codereviewwith focus on testing) - Check CI logic for edge cases (
thinkdeepon pipeline behavior) - Compare testing strategies (
consensuson approach)
Document how changes affect:
- HTI invariants (timing, safety, ordering)
- Existing guards and probes
- CI failure modes
When in Doubt
Ask yourself:
Is this complex enough to need multi-model orchestration?
- Yes → Use Zen deliberately
- No → Direct implementation is fine
Does this change affect safety or timing?
- Yes → Consider
consensusorcodereview - No → Proceed with standard review
- Yes → Consider
Am I using Zen to avoid thinking, or to think better?
- Avoid thinking → Don't use Zen
- Think better → Use Zen appropriately
The goal is thoughtful tool use, not tool maximalism.
HTI-Specific Model Recommendations
Available Models (as of 2025-11-30):
- Gemini:
gemini-2.5-pro(1M context, deep reasoning),gemini-2.5-flash(ultra-fast) - OpenAI:
gpt-5.1,gpt-5.1-codex,gpt-5-pro,o3,o3-mini,o4-mini
Recommended by Task:
- Planning:
gpt-5.1-codex(code-focused structured planning) - Architecture:
gemini-2.5-pro(deep reasoning, 1M context) - Debugging:
o3(strong logical analysis) - Code Review:
gpt-5.1(comprehensive reasoning) - Quick Questions:
gemini-2.5-flash(ultra-fast, 1M context) - Consensus: Mix 2-3 models (e.g.,
gpt-5.1+gemini-2.5-pro+o3)
Practical Templates
Template 1: Planning New HTI Version
Use Case: Starting v0.X implementation (5+ files, new subsystems)
Pattern:
Use planner with gpt-5.1-codex to design [FEATURE]:
Context:
- Current state: [what exists now]
- Goal: [what we're building]
- Constraints: [HTI invariants, backward compatibility]
Plan should include:
1. Architecture changes needed
2. File modifications (existing + new)
3. Testing strategy
4. Migration path (if breaking changes)
Example:
Use planner with gpt-5.1-codex to design v0.6 RL policy integration:
Context:
- Current: PD/PID controllers via ArmBrainPolicy protocol
- Goal: Support stateful RL policies (PPO, SAC, DQN)
- Constraints: Zero harness changes, brain-agnostic design
Plan should include:
1. BrainPolicy extension for stateful policies
2. Episode buffer interface
3. Checkpoint loading/saving
4. Testing with dummy RL brain
Template 2: Design Decisions via Consensus
Use Case: Multiple valid approaches, safety-critical choices
Pattern:
Use consensus to decide: [QUESTION]
Models:
- gpt-5.1 with stance "for" [OPTION A]
- gemini-2.5-pro with stance "against" [OPTION A, argue for OPTION B]
- o3 with stance "neutral" (objective analysis)
Context:
[Relevant technical details]
Criteria:
- [Criterion 1]
- [Criterion 2]
Example:
Use consensus to decide: RL framework for HTI v0.6
Models:
- gpt-5.1 with stance "for" Stable-Baselines3
- gemini-2.5-pro with stance "against" SB3, argue for CleanRL
- o3 with stance "neutral"
Context:
- Need PPO, SAC, DQN implementations
- Must integrate with HTI ArmBrainPolicy protocol
- Want good documentation and active maintenance
Criteria:
- Ease of integration with HTI
- Code quality and maintainability
- Performance and stability
Template 3: Deep Investigation
Use Case: Complex questions about control theory, physics, tuning
Pattern:
Use thinkdeep with [MODEL] to investigate: [QUESTION]
Known evidence:
- [Observation 1]
- [Observation 2]
Initial hypothesis:
[What you think might be happening]
Files to examine:
[Absolute paths]
Example:
Use thinkdeep with o3 to investigate: Why does PD with Kd=2.0 converge faster than Kd=3.0?
Known evidence:
- Kd=2.0: avg 455 ticks to converge
- Kd=3.0: avg 520 ticks to converge
- Both use same Kp=8.0
Initial hypothesis:
Over-damping (Kd too high) slows response
Files to examine:
/home/john2/claude-projects/hti-zen-harness/hti_arm_demo/brains/arm_pd_controller.py
/home/john2/claude-projects/hti-zen-harness/hti_arm_demo/env.py
Template 4: Code Review Before Release
Use Case: Before committing v0.X release (10+ files changed)
Pattern:
Use codereview with gpt-5.1 to review [SCOPE]:
Review type: full
Focus areas:
- Code quality and maintainability
- Security (HTI safety invariants)
- Performance (timing band compliance)
- Architecture (brain-agnostic design preserved)
Files to review:
[List of absolute file paths]
Example:
Use codereview with gpt-5.1 to review HTI v0.5 implementation:
Review type: full
Focus areas:
- Brain-agnostic design preserved
- EventPack metadata extension correct
- No timing band violations
- Fallback logic compliance (hti-fallback-guard)
Files to review:
/home/john2/claude-projects/hti-zen-harness/hti_arm_demo/brains/arm_imperfect.py
/home/john2/claude-projects/hti-zen-harness/hti_arm_demo/run_v05_demo.py
/home/john2/claude-projects/hti-zen-harness/hti_arm_demo/shared_state.py
/home/john2/claude-projects/hti-zen-harness/hti_arm_demo/bands/control.py
Template 5: Context-Isolated Subagents (clink)
Use Case: Large codebase exploration, heavy reviews, save tokens
Pattern:
Use clink with [CLI] [ROLE] to [TASK]
Available CLIs: gemini, codex, claude
Available Roles: default, planner, codereviewer
Examples:
# Code review in isolated context (saves our tokens)
Use clink with gemini codereviewer to review hti_arm_demo/ for safety issues
# Large codebase exploration
Use clink with gemini to map all brain implementations and document their interfaces
# Strategic planning
Use clink with gemini planner to design phase-by-phase migration to MuJoCo physics
Why use clink:
- Gemini CLI launches fresh 1M context window
- Heavy analysis doesn't pollute our context
- Returns only final summary/report
- Can use web search for latest docs
Template 6: Pre-Commit Validation
Use Case: Before git commit on major changes
Pattern:
Use precommit with gpt-5.1 to validate changes in [PATH]:
Focus:
- Security issues
- Breaking changes
- Missing tests
- Documentation completeness
Example:
Use precommit with gpt-5.1 to validate changes in /home/john2/claude-projects/hti-zen-harness:
Focus:
- HTI safety invariants preserved
- No regressions in existing tests
- New tests for v0.6 features
- CHANGELOG and SPEC updated