name	edu-demo-evaluator
description	Evaluate educational demos using Chrome tools for E2E testing. Executes test cases from test_cases.json, captures screenshots, verifies learning outcomes. Scores QUALITY relative to benchmark. Uses real browser interaction via mcp__claude-in-chrome__* tools.

Educational Demo Evaluator

Execute test cases. Verify learning outcomes. Score quality vs benchmark.

Core Principles

Execute test_cases.json - don't invent tests
Use Chrome tools - real browser interaction
Verify LEARNING - not just button clicks
Score vs benchmark - quality comparison

Screenshot System (REQUIRED)

Critical: Next generation builders study these screenshots. You MUST organize them:

During test execution, click "📸 Capture State" button at key moments
After testing, click "⬇️ Download Screenshots" button
Screenshots download to ~/Downloads/ with labels (capture_1.png, etc.)
MOVE them to /problems/<name>/screenshots/agent_X_*.png
Reference filenames in your evaluation JSON output

Why required:

Gen 2 builders read screenshots/ to understand which vibes worked visually
LESSONS_LEARNED tells them what worked; screenshots show them how
Builders need both to discover their approach for the next generation

Example structure:

/problems/quicksort-demo/screenshots/
├── agent_1_initial.png (gen1, narrative vibe)
├── agent_1_step_1.png
├── agent_4_initial.png (gen1, comparison vibe)
├── agent_4_step_1.png
...

Prerequisites

Before evaluation:

test_cases.json must exist in problem folder
Demo HTML file must exist
Benchmark screenshots for quality reference

Workflow

Step 1: Setup Chrome

# Get or create tab
mcp__claude-in-chrome__tabs_context_mcp(createIfEmpty=true)
# Returns: tabId

# Create new tab for testing
mcp__claude-in-chrome__tabs_create_mcp()
# Returns: new tabId - use this one

Step 2: Load Test Cases

Read(problems/<name>/test_cases.json)

Step 3: Load Benchmark (quality reference)

Read(problems/<name>/benchmark_ux/*.png)
# Note the QUALITY level to compare against

Step 4: Navigate to Demo

mcp__claude-in-chrome__navigate(
  url="file:///absolute/path/to/agent.html",
  tabId=X
)

# Wait for load
mcp__claude-in-chrome__computer(
  action="wait",
  duration=2,
  tabId=X
)

# Screenshot initial state
mcp__claude-in-chrome__computer(
  action="screenshot",
  tabId=X
)

Step 5: Execute Each Test Case

For each test case in test_cases.json:

# Read page structure
mcp__claude-in-chrome__read_page(tabId=X, filter="interactive")

# Execute steps from test case
for step in test_case.steps:
  if step.action == "find":
    result = mcp__claude-in-chrome__find(query=step.query, tabId=X)

  elif step.action == "input":
    mcp__claude-in-chrome__form_input(ref=found_ref, value=step.value, tabId=X)

  elif step.action == "click":
    mcp__claude-in-chrome__computer(action="left_click", ref=found_ref, tabId=X)

  elif step.action == "wait":
    mcp__claude-in-chrome__computer(action="wait", duration=step.ms/1000, tabId=X)

  elif step.action == "screenshot":
    # Use the built-in Capture State button
    find_result = mcp__claude-in-chrome__find(
      query="Capture State button",
      tabId=X
    )
    mcp__claude-in-chrome__computer(
      action="left_click",
      ref=find_result['ref'],
      tabId=X
    )

Step 5b: Download Screenshots

After executing all test cases:

# Click Download Screenshots button
find_result = mcp__claude-in-chrome__find(
  query="Download Screenshots button",
  tabId=X
)
mcp__claude-in-chrome__computer(
  action="left_click",
  ref=find_result['ref'],
  tabId=X
)

# Screenshots automatically download with labels
# (initial_state.png, step_1.png, etc.)

Step 6: Organize Screenshots

After downloading, organize for next generation builders:

# Screenshots are in ~/Downloads/
# Move them to /problems/<name>/screenshots/ with agent labels

mv ~/Downloads/capture_1.png /problems/<name>/screenshots/agent_1_initial.png
mv ~/Downloads/capture_2.png /problems/<name>/screenshots/agent_1_step_1.png
...

Required naming: agent_{id}_{label}.png so builders can find them by agent.

Step 7: Verify Learning Outcomes

After organizing screenshots:

Compare against test_case.verify expectations
Check: Did the demo teach correctly?
Review both visual (screenshots) and learning outcomes

Questions to answer:
- Does visual match expected? (e.g., "5 bubbled up to root")
- Is the learning outcome achieved?
- Would a learner understand the concept?

Chrome Tools Reference

Tool	Use For
`tabs_context_mcp`	Get available tabs
`tabs_create_mcp`	Create new tab for testing
`navigate`	Go to demo URL
`read_page`	Get element structure
`find`	Locate elements by purpose
`form_input`	Enter values in inputs
`computer`	Click, wait, screenshot
`get_page_text`	Extract visible text

Example Evaluation Session

# Setup
tabs_context_mcp(createIfEmpty=true) -> existing tabs
tabs_create_mcp() -> tabId: 456

# Navigate
navigate(url="http://localhost:9999/problems/heap-demo/generations/gen1/agent_1.html", tabId=456)

# Wait and let demo initialize
computer(action="wait", duration=2, tabId=456)

# Read structure
read_page(tabId=456, filter="interactive")
-> ref_1: textbox "value input"
-> ref_2: button "Insert"
-> ref_3: button "Extract Min"
-> ref_4: button "📸 Capture State"

# Execute test case: insert_bubble_up
form_input(ref="ref_1", value="5", tabId=456)
computer(action="left_click", ref="ref_2", tabId=456)
computer(action="wait", duration=2, tabId=456)

# Capture at this key moment
computer(action="left_click", ref="ref_4", tabId=456)  # Click Capture State button
computer(action="wait", duration=1, tabId=456)

# Verify
# Check: Did 5 bubble up correctly? Is animation visible?

# ... test more cases ...

# When done, download screenshots
find_result = find(query="Download Screenshots button", tabId=456)
computer(action="left_click", ref=find_result['ref'], tabId=456)

# Then move from ~/Downloads to /problems/heap-demo/screenshots/agent_1_*.png

Viewport Verification

Check viewport issues during testing:

# Read full page
read_page(tabId=X, filter="all")

# Check for scroll indicators
# Check if elements are cut off
# Check if buttons are accessible

# Screenshot full page area
computer(action="screenshot", tabId=X)
# Verify: Is everything visible without scrolling?

Score Categories (out of 100)

Category	Max	Verified By
Correctness	30	Test cases pass, algorithm accurate
Clarity	20	Visual quality vs benchmark
Educational value	20	Learning outcomes achieved
Viewport	15	All content visible, no scroll
Interaction	15	Buttons work, no bugs

Automatic Deductions

Issue	Deduction	Detection
Test case fails	-10 each	verify step fails
Must scroll	-10	elements outside viewport
Elements cut off	-10	read_page shows clipped
Buttons blocked	-15	click fails or wrong target
Content jumps	-5	visual comparison

Output Format

{
  "agent": "gen2/agent_3.html",
  "benchmarks_read": ["08_heaps.png"],
  "test_cases_executed": [
    {
      "id": "insert_bubble_up",
      "result": "PASS",
      "screenshots": ["capture_1.png", "capture_2.png"],
      "screenshot_notes": "Initial state and after insert step captured in ~/Downloads/",
      "learning_verified": true
    },
    {
      "id": "extract_root",
      "result": "FAIL",
      "reason": "Animation skips comparison step",
      "screenshots": ["capture_3.png", "capture_4.png"],
      "screenshot_notes": "Before extract and after extract in ~/Downloads/",
      "learning_verified": false
    }
  ],
  "viewport_check": {
    "fits_viewport": true,
    "issues": []
  },
  "scores": {
    "correctness": 20,
    "clarity": 18,
    "educational_value": 12,
    "viewport": 15,
    "interaction": 13
  },
  "total": 78,
  "bugs": [],
  "correctness_issues": ["extract animation incomplete"]
}

Reality Check

If ANY test case fails correctness: max score 50
Most demos score 30-50
70+ requires ALL test cases passing
Compare screenshots to benchmark for quality judgment

Cleanup

After evaluation:

# Close the test tab (optional)
# Or leave open for debugging

edu-demo-evaluator

Install Skill

SKILL.md