name	hti-zen-orchestrator
description	Guidelines for using Zen MCP tools effectively in this repo. Use for complex multi-model tasks, architectural decisions, or when cross-model validation adds value.

HTI Zen Orchestrator

This Skill defines when and how to use Zen MCP tools in the hti-zen-harness project.

Zen provides multi-model orchestration (planner, consensus, codereview, thinkdeep, debug, clink). Use them deliberately when they add real value, not reflexively.

⚡ Recommended Approach: Use API Tools Directly

Prefer direct Zen MCP tools over clink:

✅ chat, thinkdeep, consensus, codereview, precommit, debug, planner
✅ Work via API (no CLI setup needed)
✅ Already configured and tested
✅ Simple, reliable, fast

Avoid clink unless absolutely necessary:

❌ Requires separate CLI installations (gemini CLI, codex CLI, etc.)
❌ Requires separate authentication for each CLI
❌ Uses your API credits anyway (no cost benefit)
❌ More complexity for minimal gain in this project

Bottom line: Direct API tools (mcp__zen__chat, mcp__zen__consensus, etc.) do everything you need without the CLI overhead.

When Zen Tools Add Value

Consider using Zen MCP tools when:

Complex architectural work

Multi-file refactors spanning 5+ files
New subsystems or major feature additions
Changes to core HTI abstractions (bands, adapters, guards, probes)
Redesigning interfaces or data flows

Safety-critical code

Modifying timing bands or HTI invariants
Changes to error handling or recovery logic
Adapter implementations that interact with external models
CI/CD pipeline changes that affect safety guarantees

Ambiguous or contentious decisions

Multiple valid implementation approaches exist
Trade-offs between performance, safety, and complexity
Unusual patterns where you're unsure of best practice

Deep investigation needed

Complex bugs with unclear root cause
Performance issues requiring systematic analysis
Understanding unfamiliar codebases or dependencies

When Zen is overkill:

Simple changes

Single-file bug fixes
Adding straightforward tests
Documentation updates
Simple refactors (renaming, extracting functions)
Configuration tweaks

For these, direct implementation is faster and more appropriate.

Zen Tool Selection Guide

`planner` - Multi-step planning with reflection

Use when:

Task has 5+ distinct steps
Multiple architectural approaches possible
Need to think through dependencies and ordering
Want progressive refinement of a complex plan

Example: "Plan migration of adapter interface to support streaming responses"

`consensus` - Multi-model debate and synthesis

Use when:

Two+ valid approaches with different trade-offs
Safety-critical decisions need validation
Controversial architectural choices
Want diverse perspectives on a design

Example: "Should we use async generators or callback patterns for streaming? Get consensus from multiple models."

Models to include: At least 2, typically 3-4. Mix code-specialized models with general reasoning models.

`codereview` - Systematic code analysis

Use when:

Reviewing large PRs or branches
Safety-critical changes to core logic
Unfamiliar code needs audit
Want comprehensive security/performance review

Example: "Review the new HTI band scheduler implementation for correctness and edge cases."

`thinkdeep` - Hypothesis-driven investigation

Use when:

Complex architectural questions
Performance analysis and optimization planning
Security threat modeling
Understanding subtle interactions

Example: "Investigate why adapter timeout logic behaves differently under load."

`debug` - Root cause analysis

Use when:

Complex bugs with mysterious symptoms
Race conditions or timing issues
Failures that only occur in specific conditions
Need systematic hypothesis testing

Example: "Debug why HTI band transitions occasionally skip validation steps."

`clink` - Delegating to external CLI tools

Use when:

Need capabilities of a specific AI CLI (gemini, codex, claude)
Want to leverage role presets (codereviewer, planner)
Continuing a conversation thread across tools

Example: "Use clink with gemini CLI for large-scale codebase exploration."

`chat` - General-purpose thinking partner

Use for:

Brainstorming approaches
Quick sanity checks
Explaining concepts
Rubber-duck debugging

Model Selection Guidelines

When calling Zen tools, choose models deliberately based on the task:

For reading, exploration, summarization:

Prefer: Models with large context windows and good efficiency
Pattern: Large-context, efficient models
Use case: "Scan 50 test files to find coverage gaps"

For core implementation and refactoring:

Prefer: Code-specialized, high-quality models
Pattern: Code-specialized models (e.g., models with "codex" in the name or any available code-focused equivalent)
Use case: "Implement new HTI adapter with proper error handling"

For safety-critical validation:

Use: Multiple models via consensus or sequential codereview
Pattern: Mix of code-specialized and general reasoning models for diverse perspectives
Use case: "Validate timing band logic won't introduce deadlocks"

Document your choices:

When model selection matters for auditability:

# HTI-NOTE: Implementation reviewed by code-specialized models (consensus check).
# No race conditions detected in band transition logic.
def transition_band(current: Band, target: Band) -> Result:
    ...

Shell Access via `clink`

Zen's clink tool can execute shell commands. Use it responsibly.

✅ OK without asking (read-only, low-risk):

File inspection: ls, pwd, cat, head, tail, find
Git inspection: git status, git diff, git log, git branch
Testing: pytest, python -m pytest, test runners
Linting: ruff check, black --check, mypy, static analysis
Info gathering: python --version, uv --version, dependency checks

⚠️ Ask user approval first:

Installing packages: pip install, uv add, npm install
Git mutations: git commit, git push, git reset, git checkout -b, git rebase
File mutations: rm, mv, file deletions/moves
Network operations: curl, wget, API calls
Environment changes: Modifying config files, .env files

How to ask:

I need to run: `pip install pytest-asyncio`
Reason: Required for testing async adapter implementations
Approve?

Failure Handling with Zen

When Zen tools or model calls fail, follow these rules (aligned with hti-fallback-guard):

❌ Do NOT:

Pretend the call succeeded
Silently switch to a different model without explanation
Invent outputs or fake data
Swallow errors and continue as if nothing happened

✅ DO:

1. Report clearly:

Zen `codereview` call failed:
  Tool: codereview
  Model: <model-name>
  Error: Rate limit exceeded (429)
  Step: Reviewing src/adapters/openai.py

2. Propose alternatives:

"Retry with a different model (another available code-specialized option)?"
"Split the review into smaller chunks?"
"Proceed with manual review instead?"
"Wait 60s and retry?"

3. Document in code if relevant:

# HTI-TODO: Codereview via Zen failed (rate limit).
# Manual review needed for thread safety in adapter pool.

Structured failure result pattern:

When appropriate, return explicit error states:

@dataclass
class ZenResult:
    ok: bool
    tool: str
    data: dict | None = None
    error: str | None = None

# Never set ok=True when Zen call actually failed

Recommended Workflow for Substantial Changes

For non-trivial work (multi-file refactors, new features, safety-critical edits):

1. Plan (if complexity warrants it)

Use Zen planner for complex, multi-faceted tasks
For simpler changes, a bullet list is fine
Show plan to user, get confirmation

2. Implement

Use appropriate model (code-specialized for core logic)
Follow hti-fallback-guard principles
Document model choice if safety-critical

3. Review (for important changes)

Use Zen codereview for:
- Large PRs (10+ files)
- Safety-critical logic
- HTI band/adapter/guard changes
Use Zen precommit before finalizing

4. Summarize

Tell the user:

What changed (files, behavior)
Which models/tools were used
Any TODOs or concerns
Test coverage added/modified

Integration with Testing and CI

When working on tests or CI:

Prefer changes that tighten guarantees:

Tests that assert explicit failures (not silent fallbacks)
CI checks that fail loudly when invariants break
Guards that prevent invalid state transitions

Use Zen tools to:

Validate test coverage (codereview with focus on testing)
Check CI logic for edge cases (thinkdeep on pipeline behavior)
Compare testing strategies (consensus on approach)

Document how changes affect:

HTI invariants (timing, safety, ordering)
Existing guards and probes
CI failure modes

When in Doubt

Ask yourself:

Is this complex enough to need multi-model orchestration?
- Yes → Use Zen deliberately
- No → Direct implementation is fine
Does this change affect safety or timing?
- Yes → Consider consensus or codereview
- No → Proceed with standard review
Am I using Zen to avoid thinking, or to think better?
- Avoid thinking → Don't use Zen
- Think better → Use Zen appropriately

The goal is thoughtful tool use, not tool maximalism.

HTI-Specific Model Recommendations

Available Models (as of 2025-11-30):

Gemini: gemini-2.5-pro (1M context, deep reasoning), gemini-2.5-flash (ultra-fast)
OpenAI: gpt-5.1, gpt-5.1-codex, gpt-5-pro, o3, o3-mini, o4-mini

Recommended by Task:

Planning: gpt-5.1-codex (code-focused structured planning)
Architecture: gemini-2.5-pro (deep reasoning, 1M context)
Debugging: o3 (strong logical analysis)
Code Review: gpt-5.1 (comprehensive reasoning)
Quick Questions: gemini-2.5-flash (ultra-fast, 1M context)
Consensus: Mix 2-3 models (e.g., gpt-5.1 + gemini-2.5-pro + o3)

Practical Templates

Template 1: Planning New HTI Version

Use Case: Starting v0.X implementation (5+ files, new subsystems)

Pattern:

Use planner with gpt-5.1-codex to design [FEATURE]:

Context:
- Current state: [what exists now]
- Goal: [what we're building]
- Constraints: [HTI invariants, backward compatibility]

Plan should include:
1. Architecture changes needed
2. File modifications (existing + new)
3. Testing strategy
4. Migration path (if breaking changes)

Example:

Use planner with gpt-5.1-codex to design v0.6 RL policy integration:

Context:
- Current: PD/PID controllers via ArmBrainPolicy protocol
- Goal: Support stateful RL policies (PPO, SAC, DQN)
- Constraints: Zero harness changes, brain-agnostic design

Plan should include:
1. BrainPolicy extension for stateful policies
2. Episode buffer interface
3. Checkpoint loading/saving
4. Testing with dummy RL brain

Template 2: Design Decisions via Consensus

Use Case: Multiple valid approaches, safety-critical choices

Pattern:

Use consensus to decide: [QUESTION]

Models:
- gpt-5.1 with stance "for" [OPTION A]
- gemini-2.5-pro with stance "against" [OPTION A, argue for OPTION B]
- o3 with stance "neutral" (objective analysis)

Context:
[Relevant technical details]

Criteria:
- [Criterion 1]
- [Criterion 2]

Example:

Use consensus to decide: RL framework for HTI v0.6

Models:
- gpt-5.1 with stance "for" Stable-Baselines3
- gemini-2.5-pro with stance "against" SB3, argue for CleanRL
- o3 with stance "neutral"

Context:
- Need PPO, SAC, DQN implementations
- Must integrate with HTI ArmBrainPolicy protocol
- Want good documentation and active maintenance

Criteria:
- Ease of integration with HTI
- Code quality and maintainability
- Performance and stability

Template 3: Deep Investigation

Use Case: Complex questions about control theory, physics, tuning

Pattern:

Use thinkdeep with [MODEL] to investigate: [QUESTION]

Known evidence:
- [Observation 1]
- [Observation 2]

Initial hypothesis:
[What you think might be happening]

Files to examine:
[Absolute paths]

Example:

Use thinkdeep with o3 to investigate: Why does PD with Kd=2.0 converge faster than Kd=3.0?

Known evidence:
- Kd=2.0: avg 455 ticks to converge
- Kd=3.0: avg 520 ticks to converge
- Both use same Kp=8.0

Initial hypothesis:
Over-damping (Kd too high) slows response

Files to examine:
/home/john2/claude-projects/hti-zen-harness/hti_arm_demo/brains/arm_pd_controller.py
/home/john2/claude-projects/hti-zen-harness/hti_arm_demo/env.py

Template 4: Code Review Before Release

Use Case: Before committing v0.X release (10+ files changed)

Pattern:

Use codereview with gpt-5.1 to review [SCOPE]:

Review type: full
Focus areas:
- Code quality and maintainability
- Security (HTI safety invariants)
- Performance (timing band compliance)
- Architecture (brain-agnostic design preserved)

Files to review:
[List of absolute file paths]

Example:

Use codereview with gpt-5.1 to review HTI v0.5 implementation:

Review type: full
Focus areas:
- Brain-agnostic design preserved
- EventPack metadata extension correct
- No timing band violations
- Fallback logic compliance (hti-fallback-guard)

Files to review:
/home/john2/claude-projects/hti-zen-harness/hti_arm_demo/brains/arm_imperfect.py
/home/john2/claude-projects/hti-zen-harness/hti_arm_demo/run_v05_demo.py
/home/john2/claude-projects/hti-zen-harness/hti_arm_demo/shared_state.py
/home/john2/claude-projects/hti-zen-harness/hti_arm_demo/bands/control.py

Template 5: Context-Isolated Subagents (clink)

Use Case: Large codebase exploration, heavy reviews, save tokens

Pattern:

Use clink with [CLI] [ROLE] to [TASK]

Available CLIs: gemini, codex, claude
Available Roles: default, planner, codereviewer

Examples:

# Code review in isolated context (saves our tokens)
Use clink with gemini codereviewer to review hti_arm_demo/ for safety issues

# Large codebase exploration
Use clink with gemini to map all brain implementations and document their interfaces

# Strategic planning
Use clink with gemini planner to design phase-by-phase migration to MuJoCo physics

Why use clink:

Gemini CLI launches fresh 1M context window
Heavy analysis doesn't pollute our context
Returns only final summary/report
Can use web search for latest docs

Template 6: Pre-Commit Validation

Use Case: Before git commit on major changes

Pattern:

Use precommit with gpt-5.1 to validate changes in [PATH]:

Focus:
- Security issues
- Breaking changes
- Missing tests
- Documentation completeness

Example:

Use precommit with gpt-5.1 to validate changes in /home/john2/claude-projects/hti-zen-harness:

Focus:
- HTI safety invariants preserved
- No regressions in existing tests
- New tests for v0.6 features
- CHANGELOG and SPEC updated

Install Skill

SKILL.md

HTI Zen Orchestrator

⚡ Recommended Approach: Use API Tools Directly

When Zen Tools Add Value

Consider using Zen MCP tools when:

When Zen is overkill:

Zen Tool Selection Guide

planner - Multi-step planning with reflection

consensus - Multi-model debate and synthesis

codereview - Systematic code analysis

thinkdeep - Hypothesis-driven investigation

debug - Root cause analysis

clink - Delegating to external CLI tools

chat - General-purpose thinking partner

Model Selection Guidelines

For reading, exploration, summarization:

For core implementation and refactoring:

For safety-critical validation:

Document your choices:

Shell Access via clink

✅ OK without asking (read-only, low-risk):

⚠️ Ask user approval first:

Failure Handling with Zen

❌ Do NOT:

✅ DO:

Structured failure result pattern:

Recommended Workflow for Substantial Changes

1. Plan (if complexity warrants it)

2. Implement

3. Review (for important changes)

4. Summarize

Integration with Testing and CI

When in Doubt

HTI-Specific Model Recommendations

Practical Templates

Template 1: Planning New HTI Version

Template 2: Design Decisions via Consensus

Template 3: Deep Investigation

Template 4: Code Review Before Release

Template 5: Context-Isolated Subagents (clink)

Template 6: Pre-Commit Validation

`planner` - Multi-step planning with reflection

`consensus` - Multi-model debate and synthesis

`codereview` - Systematic code analysis

`thinkdeep` - Hypothesis-driven investigation

`debug` - Root cause analysis

`clink` - Delegating to external CLI tools

`chat` - General-purpose thinking partner

Shell Access via `clink`