Claude Code Plugins

Community-maintained marketplace

Feedback

hti-zen-orchestrator

@moviesR/hti-zen-harness
0
0

Guidelines for using Zen MCP tools effectively in this repo. Use for complex multi-model tasks, architectural decisions, or when cross-model validation adds value.

Install Skill

1Download skill
2Enable skills in Claude

Open claude.ai/settings/capabilities and find the "Skills" section

3Upload to Claude

Click "Upload skill" and select the downloaded ZIP file

Note: Please verify skill by going through its instructions before using it.

SKILL.md

name hti-zen-orchestrator
description Guidelines for using Zen MCP tools effectively in this repo. Use for complex multi-model tasks, architectural decisions, or when cross-model validation adds value.

HTI Zen Orchestrator

This Skill defines when and how to use Zen MCP tools in the hti-zen-harness project.

Zen provides multi-model orchestration (planner, consensus, codereview, thinkdeep, debug, clink). Use them deliberately when they add real value, not reflexively.


⚡ Recommended Approach: Use API Tools Directly

Prefer direct Zen MCP tools over clink:

  • chat, thinkdeep, consensus, codereview, precommit, debug, planner
  • ✅ Work via API (no CLI setup needed)
  • ✅ Already configured and tested
  • ✅ Simple, reliable, fast

Avoid clink unless absolutely necessary:

  • ❌ Requires separate CLI installations (gemini CLI, codex CLI, etc.)
  • ❌ Requires separate authentication for each CLI
  • ❌ Uses your API credits anyway (no cost benefit)
  • ❌ More complexity for minimal gain in this project

Bottom line: Direct API tools (mcp__zen__chat, mcp__zen__consensus, etc.) do everything you need without the CLI overhead.


When Zen Tools Add Value

Consider using Zen MCP tools when:

Complex architectural work

  • Multi-file refactors spanning 5+ files
  • New subsystems or major feature additions
  • Changes to core HTI abstractions (bands, adapters, guards, probes)
  • Redesigning interfaces or data flows

Safety-critical code

  • Modifying timing bands or HTI invariants
  • Changes to error handling or recovery logic
  • Adapter implementations that interact with external models
  • CI/CD pipeline changes that affect safety guarantees

Ambiguous or contentious decisions

  • Multiple valid implementation approaches exist
  • Trade-offs between performance, safety, and complexity
  • Unusual patterns where you're unsure of best practice

Deep investigation needed

  • Complex bugs with unclear root cause
  • Performance issues requiring systematic analysis
  • Understanding unfamiliar codebases or dependencies

When Zen is overkill:

Simple changes

  • Single-file bug fixes
  • Adding straightforward tests
  • Documentation updates
  • Simple refactors (renaming, extracting functions)
  • Configuration tweaks

For these, direct implementation is faster and more appropriate.


Zen Tool Selection Guide

planner - Multi-step planning with reflection

Use when:

  • Task has 5+ distinct steps
  • Multiple architectural approaches possible
  • Need to think through dependencies and ordering
  • Want progressive refinement of a complex plan

Example: "Plan migration of adapter interface to support streaming responses"

consensus - Multi-model debate and synthesis

Use when:

  • Two+ valid approaches with different trade-offs
  • Safety-critical decisions need validation
  • Controversial architectural choices
  • Want diverse perspectives on a design

Example: "Should we use async generators or callback patterns for streaming? Get consensus from multiple models."

Models to include: At least 2, typically 3-4. Mix code-specialized models with general reasoning models.

codereview - Systematic code analysis

Use when:

  • Reviewing large PRs or branches
  • Safety-critical changes to core logic
  • Unfamiliar code needs audit
  • Want comprehensive security/performance review

Example: "Review the new HTI band scheduler implementation for correctness and edge cases."

thinkdeep - Hypothesis-driven investigation

Use when:

  • Complex architectural questions
  • Performance analysis and optimization planning
  • Security threat modeling
  • Understanding subtle interactions

Example: "Investigate why adapter timeout logic behaves differently under load."

debug - Root cause analysis

Use when:

  • Complex bugs with mysterious symptoms
  • Race conditions or timing issues
  • Failures that only occur in specific conditions
  • Need systematic hypothesis testing

Example: "Debug why HTI band transitions occasionally skip validation steps."

clink - Delegating to external CLI tools

Use when:

  • Need capabilities of a specific AI CLI (gemini, codex, claude)
  • Want to leverage role presets (codereviewer, planner)
  • Continuing a conversation thread across tools

Example: "Use clink with gemini CLI for large-scale codebase exploration."

chat - General-purpose thinking partner

Use for:

  • Brainstorming approaches
  • Quick sanity checks
  • Explaining concepts
  • Rubber-duck debugging

Model Selection Guidelines

When calling Zen tools, choose models deliberately based on the task:

For reading, exploration, summarization:

  • Prefer: Models with large context windows and good efficiency
  • Pattern: Large-context, efficient models
  • Use case: "Scan 50 test files to find coverage gaps"

For core implementation and refactoring:

  • Prefer: Code-specialized, high-quality models
  • Pattern: Code-specialized models (e.g., models with "codex" in the name or any available code-focused equivalent)
  • Use case: "Implement new HTI adapter with proper error handling"

For safety-critical validation:

  • Use: Multiple models via consensus or sequential codereview
  • Pattern: Mix of code-specialized and general reasoning models for diverse perspectives
  • Use case: "Validate timing band logic won't introduce deadlocks"

Document your choices:

When model selection matters for auditability:

# HTI-NOTE: Implementation reviewed by code-specialized models (consensus check).
# No race conditions detected in band transition logic.
def transition_band(current: Band, target: Band) -> Result:
    ...

Shell Access via clink

Zen's clink tool can execute shell commands. Use it responsibly.

✅ OK without asking (read-only, low-risk):

  • File inspection: ls, pwd, cat, head, tail, find
  • Git inspection: git status, git diff, git log, git branch
  • Testing: pytest, python -m pytest, test runners
  • Linting: ruff check, black --check, mypy, static analysis
  • Info gathering: python --version, uv --version, dependency checks

⚠️ Ask user approval first:

  • Installing packages: pip install, uv add, npm install
  • Git mutations: git commit, git push, git reset, git checkout -b, git rebase
  • File mutations: rm, mv, file deletions/moves
  • Network operations: curl, wget, API calls
  • Environment changes: Modifying config files, .env files

How to ask:

I need to run: `pip install pytest-asyncio`
Reason: Required for testing async adapter implementations
Approve?

Failure Handling with Zen

When Zen tools or model calls fail, follow these rules (aligned with hti-fallback-guard):

❌ Do NOT:

  • Pretend the call succeeded
  • Silently switch to a different model without explanation
  • Invent outputs or fake data
  • Swallow errors and continue as if nothing happened

✅ DO:

1. Report clearly:

Zen `codereview` call failed:
  Tool: codereview
  Model: <model-name>
  Error: Rate limit exceeded (429)
  Step: Reviewing src/adapters/openai.py

2. Propose alternatives:

  • "Retry with a different model (another available code-specialized option)?"
  • "Split the review into smaller chunks?"
  • "Proceed with manual review instead?"
  • "Wait 60s and retry?"

3. Document in code if relevant:

# HTI-TODO: Codereview via Zen failed (rate limit).
# Manual review needed for thread safety in adapter pool.

Structured failure result pattern:

When appropriate, return explicit error states:

@dataclass
class ZenResult:
    ok: bool
    tool: str
    data: dict | None = None
    error: str | None = None

# Never set ok=True when Zen call actually failed

Recommended Workflow for Substantial Changes

For non-trivial work (multi-file refactors, new features, safety-critical edits):

1. Plan (if complexity warrants it)

  • Use Zen planner for complex, multi-faceted tasks
  • For simpler changes, a bullet list is fine
  • Show plan to user, get confirmation

2. Implement

  • Use appropriate model (code-specialized for core logic)
  • Follow hti-fallback-guard principles
  • Document model choice if safety-critical

3. Review (for important changes)

  • Use Zen codereview for:
    • Large PRs (10+ files)
    • Safety-critical logic
    • HTI band/adapter/guard changes
  • Use Zen precommit before finalizing

4. Summarize

Tell the user:

  • What changed (files, behavior)
  • Which models/tools were used
  • Any TODOs or concerns
  • Test coverage added/modified

Integration with Testing and CI

When working on tests or CI:

Prefer changes that tighten guarantees:

  • Tests that assert explicit failures (not silent fallbacks)
  • CI checks that fail loudly when invariants break
  • Guards that prevent invalid state transitions

Use Zen tools to:

  • Validate test coverage (codereview with focus on testing)
  • Check CI logic for edge cases (thinkdeep on pipeline behavior)
  • Compare testing strategies (consensus on approach)

Document how changes affect:

  • HTI invariants (timing, safety, ordering)
  • Existing guards and probes
  • CI failure modes

When in Doubt

Ask yourself:

  1. Is this complex enough to need multi-model orchestration?

    • Yes → Use Zen deliberately
    • No → Direct implementation is fine
  2. Does this change affect safety or timing?

    • Yes → Consider consensus or codereview
    • No → Proceed with standard review
  3. Am I using Zen to avoid thinking, or to think better?

    • Avoid thinking → Don't use Zen
    • Think better → Use Zen appropriately

The goal is thoughtful tool use, not tool maximalism.


HTI-Specific Model Recommendations

Available Models (as of 2025-11-30):

  • Gemini: gemini-2.5-pro (1M context, deep reasoning), gemini-2.5-flash (ultra-fast)
  • OpenAI: gpt-5.1, gpt-5.1-codex, gpt-5-pro, o3, o3-mini, o4-mini

Recommended by Task:

  • Planning: gpt-5.1-codex (code-focused structured planning)
  • Architecture: gemini-2.5-pro (deep reasoning, 1M context)
  • Debugging: o3 (strong logical analysis)
  • Code Review: gpt-5.1 (comprehensive reasoning)
  • Quick Questions: gemini-2.5-flash (ultra-fast, 1M context)
  • Consensus: Mix 2-3 models (e.g., gpt-5.1 + gemini-2.5-pro + o3)

Practical Templates

Template 1: Planning New HTI Version

Use Case: Starting v0.X implementation (5+ files, new subsystems)

Pattern:

Use planner with gpt-5.1-codex to design [FEATURE]:

Context:
- Current state: [what exists now]
- Goal: [what we're building]
- Constraints: [HTI invariants, backward compatibility]

Plan should include:
1. Architecture changes needed
2. File modifications (existing + new)
3. Testing strategy
4. Migration path (if breaking changes)

Example:

Use planner with gpt-5.1-codex to design v0.6 RL policy integration:

Context:
- Current: PD/PID controllers via ArmBrainPolicy protocol
- Goal: Support stateful RL policies (PPO, SAC, DQN)
- Constraints: Zero harness changes, brain-agnostic design

Plan should include:
1. BrainPolicy extension for stateful policies
2. Episode buffer interface
3. Checkpoint loading/saving
4. Testing with dummy RL brain

Template 2: Design Decisions via Consensus

Use Case: Multiple valid approaches, safety-critical choices

Pattern:

Use consensus to decide: [QUESTION]

Models:
- gpt-5.1 with stance "for" [OPTION A]
- gemini-2.5-pro with stance "against" [OPTION A, argue for OPTION B]
- o3 with stance "neutral" (objective analysis)

Context:
[Relevant technical details]

Criteria:
- [Criterion 1]
- [Criterion 2]

Example:

Use consensus to decide: RL framework for HTI v0.6

Models:
- gpt-5.1 with stance "for" Stable-Baselines3
- gemini-2.5-pro with stance "against" SB3, argue for CleanRL
- o3 with stance "neutral"

Context:
- Need PPO, SAC, DQN implementations
- Must integrate with HTI ArmBrainPolicy protocol
- Want good documentation and active maintenance

Criteria:
- Ease of integration with HTI
- Code quality and maintainability
- Performance and stability

Template 3: Deep Investigation

Use Case: Complex questions about control theory, physics, tuning

Pattern:

Use thinkdeep with [MODEL] to investigate: [QUESTION]

Known evidence:
- [Observation 1]
- [Observation 2]

Initial hypothesis:
[What you think might be happening]

Files to examine:
[Absolute paths]

Example:

Use thinkdeep with o3 to investigate: Why does PD with Kd=2.0 converge faster than Kd=3.0?

Known evidence:
- Kd=2.0: avg 455 ticks to converge
- Kd=3.0: avg 520 ticks to converge
- Both use same Kp=8.0

Initial hypothesis:
Over-damping (Kd too high) slows response

Files to examine:
/home/john2/claude-projects/hti-zen-harness/hti_arm_demo/brains/arm_pd_controller.py
/home/john2/claude-projects/hti-zen-harness/hti_arm_demo/env.py

Template 4: Code Review Before Release

Use Case: Before committing v0.X release (10+ files changed)

Pattern:

Use codereview with gpt-5.1 to review [SCOPE]:

Review type: full
Focus areas:
- Code quality and maintainability
- Security (HTI safety invariants)
- Performance (timing band compliance)
- Architecture (brain-agnostic design preserved)

Files to review:
[List of absolute file paths]

Example:

Use codereview with gpt-5.1 to review HTI v0.5 implementation:

Review type: full
Focus areas:
- Brain-agnostic design preserved
- EventPack metadata extension correct
- No timing band violations
- Fallback logic compliance (hti-fallback-guard)

Files to review:
/home/john2/claude-projects/hti-zen-harness/hti_arm_demo/brains/arm_imperfect.py
/home/john2/claude-projects/hti-zen-harness/hti_arm_demo/run_v05_demo.py
/home/john2/claude-projects/hti-zen-harness/hti_arm_demo/shared_state.py
/home/john2/claude-projects/hti-zen-harness/hti_arm_demo/bands/control.py

Template 5: Context-Isolated Subagents (clink)

Use Case: Large codebase exploration, heavy reviews, save tokens

Pattern:

Use clink with [CLI] [ROLE] to [TASK]

Available CLIs: gemini, codex, claude
Available Roles: default, planner, codereviewer

Examples:

# Code review in isolated context (saves our tokens)
Use clink with gemini codereviewer to review hti_arm_demo/ for safety issues

# Large codebase exploration
Use clink with gemini to map all brain implementations and document their interfaces

# Strategic planning
Use clink with gemini planner to design phase-by-phase migration to MuJoCo physics

Why use clink:

  • Gemini CLI launches fresh 1M context window
  • Heavy analysis doesn't pollute our context
  • Returns only final summary/report
  • Can use web search for latest docs

Template 6: Pre-Commit Validation

Use Case: Before git commit on major changes

Pattern:

Use precommit with gpt-5.1 to validate changes in [PATH]:

Focus:
- Security issues
- Breaking changes
- Missing tests
- Documentation completeness

Example:

Use precommit with gpt-5.1 to validate changes in /home/john2/claude-projects/hti-zen-harness:

Focus:
- HTI safety invariants preserved
- No regressions in existing tests
- New tests for v0.6 features
- CHANGELOG and SPEC updated