name	edu-demo-evaluator-free
description	Watch educational demo like a learner (BLIND evaluation). No test cases. No benchmark. No rubric. Honest assessment of: impression, what works, what doesn't, learner impact, recommendation. Output: agent_X_free_eval.json

Educational Demo Evaluator - Free Evaluation

Watch the demo like a learner would. Be honest. No scoring rubric. No benchmark bias.

Core Principles

BLIND to test cases - Don't read test_cases.json
BLIND to benchmark - Don't look at benchmark_ux/
Watch like a learner - First time seeing it, no prior knowledge
Honest assessment - What's awesome? What's confusing?
Qualitative only - No numeric scores

Workflow

Step 1: Setup Chrome

# Get or create tab
mcp__claude-in-chrome__tabs_context_mcp(createIfEmpty=true)

# Create new tab for evaluation
mcp__claude-in-chrome__tabs_create_mcp()
# Returns: tabId (use this for the agent)

Step 2: Start HTTP Server

# Start HTTP server (from code-evo-agent-simple root directory)
cd /Users/hani/code-evo-agent-simple
python3 -m http.server 9999 &

Step 3: Navigate to Demo

# Navigate to demo via HTTP (NOT file://)
mcp__claude-in-chrome__navigate(
  url="http://localhost:9999/problems/<name>/generations/gen{N}/agent_X.html",
  tabId=X
)

# Wait for load
mcp__claude-in-chrome__computer(action="wait", duration=2, tabId=X)

Step 4: Watch and Interact as a Learner

Spend 5-10 minutes with the demo like a real student:

Read initial content - what's explained?
Click buttons, interact with controls
Watch animations play - are they clear?
Try different scenarios - what do you learn?
Capture screenshots at key moments

Focus on educational value, not technical polish

# Screenshot initial state (demo will do this automatically)
mcp__claude-in-chrome__computer(action="screenshot", tabId=X)

# Read what's on the page
mcp__claude-in-chrome__read_page(tabId=X)

# Find buttons to interact with
mcp__claude-in-chrome__find(query="play button or start button", tabId=X)

# Click and interact
mcp__claude-in-chrome__computer(action="left_click", ref=found_ref, tabId=X)

# Wait for animation
mcp__claude-in-chrome__computer(action="wait", duration=2, tabId=X)

# CAPTURE at key moments using the built-in system
# In browser console:
mcp__claude-in-chrome__javascript_tool(
  action="javascript_exec",
  text="window.screenshotManager.captureState('key_moment')",
  tabId=X
)

Screenshots are captured in the demo via the built-in html2canvas system:

Click "📸 Capture State" button at key moments to capture
Click "⬇️ Download Screenshots" when done to download all PNGs
Each screenshot is labeled (initial_state, capture_1, capture_2, etc.)

ORGANIZE them for next generation builders:

# Move from ~/Downloads to /problems/<name>/screenshots/ with agent labels
mv ~/Downloads/capture_1.png /problems/<name>/screenshots/agent_X_initial.png
mv ~/Downloads/capture_2.png /problems/<name>/screenshots/agent_X_moment_1.png

The demo maintains a capture history during your evaluation session.

Step 5: Record Honest Assessment

As you watch, ask yourself:

First Impression

What do you see immediately?
Is it inviting or intimidating?
Does it look complete or broken?

Does It Make Sense?

Can you understand what's happening?
Is the core concept clear from the visualization?
Are there confusing or misleading parts?

Is It Engaging?

Do you want to keep exploring?
Are interactions satisfying and rewarding?
Do animations feel smooth or janky?

What Works?

What design choices are brilliant for learning?
What explanations are clear and memorable?
What makes the concept "click"?

What Doesn't Work?

What's confusing to a learner?
What feels incomplete or wrong?
What metaphors or explanations could mislead?

Educational Value

Would a student understand the concept after this?
Could they explain it to someone else?
What's the key learning takeaway?
What would a learner REMEMBER in a week?

Recommendation

Should this be used?
What's the one thing to fix?
Is it a winner, or needs major work?

Output Format

{
  "agent": "gen2/agent_1",
  "approach": "Comparison/Dual-View",

  "first_impression": "Clean, minimal UI with two side-by-side algorithms",

  "what_works": [
    "Immediately shows WHY quicksort matters (bubble sort is slow)",
    "Color coding makes comparisons easy to follow",
    "Step-by-step controls let learner control pace",
    "Comparison metrics visible (comparisons, swaps, time)"
  ],

  "what_doesn't_work": [
    "Recursion depth not clearly shown - jumps between levels",
    "Pivot selection explanation could be clearer",
    "Animation speed is a bit fast for beginners"
  ],

  "learner_impact": "A student would understand that quicksort is faster because of intelligent partitioning. Might not fully grasp recursion or pivot selection strategy.",

  "recommendation": "STRONG CANDIDATE - Fix recursion visualization, maybe add narrative explanations for pivot selection. Otherwise excellent foundation.",

  "screenshots_captured": "agent_1_initial.png, agent_1_comparison.png, agent_1_recursion.png (moved to /problems/<name>/screenshots/)"
}

Key Phrases to Avoid

❌ "Correctness score: 85" ❌ "Compared to benchmark..." ❌ "Test case coverage: 14/15" ❌ "Points deducted for..."

✅ "Immediately shows WHY" ✅ "A learner would understand..." ✅ "The animation feels smooth" ✅ "Confusing part: recursion depth"

Important Notes

Don't read test cases - You don't know what you're supposed to verify
Don't think about benchmark - You don't know what "good" looks like
Don't use rubric - No scoring categories, no point calculations
Be honest - If it's confusing, say it's confusing
Watch 5-10 minutes per agent - Enough time to form honest impression

Example Evaluation

Visit http://localhost:9999/problems/quicksort-demo/generations/gen2/agent_1.html

First impression:
- Clean white background with two columns side by side
- Left: Quicksort animation, Right: Bubble sort animation
- Professional looking, not too colorful

Interact:
- Click "Start" button
- Both arrays start animating
- Quicksort finishes first
- Bubble sort continues much longer
- Counter shows comparisons: QS=45, BS=120

Impression: "OH! This is why quicksort is better! The visualization immediately makes it clear."

Assessment:
- WORKS: Side-by-side comparison is brilliant
- WORKS: Metrics visible (comparison count)
- WORKS: Speed difference obvious
- DOESN'T WORK: Recursion not explained (which levels are being called?)
- DOESN'T WORK: Pivot selection seems arbitrary
- RECOMMENDATION: This is a strong foundation. Add narrative about pivot strategy, show recursion depth. Could be winner.

Cleanup

# Kill HTTP server
pkill -f "python3 -m http.server 9999"

edu-demo-evaluator-free

Install Skill

SKILL.md