| name | la-bench-workflow |
| description | Orchestrate the complete LA-Bench experimental procedure generation workflow from JSONL input to validated output. This skill should be used when processing LA-Bench format experimental data to generate and validate detailed experimental procedures. It coordinates parsing, reference fetching, procedure generation, and quality validation with 10-point scoring. |
LA-Bench Workflow
Overview
Orchestrate the complete LA-Bench experimental procedure generation and validation pipeline. This skill coordinates four specialized components:
- la-bench-parser: Extract experimental data from JSONL
- web-reference-fetcher: Fetch reference materials
- procedure-generator: Generate detailed experimental procedures
- procedure-checker: Validate and score generated procedures (10-point scale)
All intermediate and final outputs are saved to standardized directory structure: workdir/{filename}_{entry_id}/
When to Use This Skill
Use this skill when:
- Processing LA-Bench JSONL files (public_test.jsonl, private_test_input.jsonl, etc.)
- Generating experimental procedures from LA-Bench format data
- Validating generated procedures against LA-Bench quality standards
- Batch processing multiple LA-Bench entries
- Testing or evaluating procedure generation capabilities
Workflow
Step 1: Parse LA-Bench Entry
Extract experimental data from JSONL file using la-bench-parser skill.
Input:
- JSONL file path (e.g.,
data/public_test.jsonl) - Entry ID (e.g.,
public_test_1)
Execution:
# Extract filename without extension
filename=$(basename "data/public_test.jsonl" .jsonl) # "public_test"
# Create working directory
mkdir -p workdir/${filename}_${entry_id}
# Parse entry
python .claude/skills/la-bench-parser/scripts/parse_labench.py \
data/public_test.jsonl ${entry_id} \
> workdir/${filename}_${entry_id}/input.json
Output:
workdir/{filename}_{entry_id}/input.json
Content:
- instruction
- mandatory_objects
- source_protocol_steps
- expected_final_states
- references
- measurement.specific_criteria (if available)
Step 2: Fetch Reference Materials (Optional)
Fetch reference URLs using web-reference-fetcher skill if references exist in input.json.
Input:
- Reference URLs from input.json
Execution:
mkdir -p workdir/${filename}_${entry_id}/references
# For each reference URL (ref_1, ref_2, ...)
python3 .claude/skills/web-reference-fetcher/scripts/fetch_url.py \
"${reference_url}" \
--output workdir/${filename}_${entry_id}/references/ref_${i}.md
Output:
workdir/{filename}_{entry_id}/references/ref_1.mdworkdir/{filename}_{entry_id}/references/ref_2.md- ... (one file per reference)
Note: If references cannot be fetched or URLs are inaccessible, proceed to Step 3 without reference materials.
Step 3: Generate Experimental Procedure
Generate detailed procedure using procedure-generator skill.
Input:
workdir/{filename}_{entry_id}/input.jsonworkdir/{filename}_{entry_id}/references/ref_*.md(if available)
Process:
- Read input.json
- Read all reference files (if exist)
- Invoke experiment-procedure-generator subagent with detailed prompt
- Save generated procedure
- Verify output (≤50 steps, ≤10 sentences/step)
Output:
workdir/{filename}_{entry_id}/procedure.json
Format:
[
{"id": 1, "text": "First step with quantitative details..."},
{"id": 2, "text": "Second step with quantitative details..."},
...
]
Step 4: Validate, Score, and Iteratively Improve Procedure
Validate procedure using procedure-checker skill with 10-point scoring system. Automatically improve procedure until score ≥ 8/10 or maximum iterations reached.
Target Quality: Score ≥ 8/10 points (Good or Excellent)
Input:
workdir/{filename}_{entry_id}/input.jsonworkdir/{filename}_{entry_id}/procedure.json
Iterative Improvement Process:
Iteration Loop (Max 3 attempts)
For each iteration (attempt 1, 2, 3):
Formal Validation (Script-based)
# Create wrapped version python3 -c " import json with open('workdir/${filename}_${entry_id}/procedure.json', 'r') as f: steps = json.load(f) wrapped = {'procedure_steps': steps} with open('workdir/${filename}_${entry_id}/procedure_wrapped.json', 'w') as f: json.dump(wrapped, f, ensure_ascii=False, indent=2) " # Run validation python .claude/skills/procedure-checker/scripts/validate_procedure.py \ workdir/${filename}_${entry_id}/procedure_wrapped.jsonSemantic Evaluation (Subagent-based)
- Invoke procedure-semantic-checker subagent
- Evaluate with 10-point scoring system:
- Common Criteria (5 points)
- Individual Criteria (5 points)
- Generate comprehensive review
Combine Results
- Merge formal validation + semantic evaluation
- Save review to
workdir/{filename}_{entry_id}/review_v{iteration}.md
Check Score
- Extract total score from review
- If score ≥ 8: ✅ Success → Proceed to Step 5
- If score < 8 AND iteration < 3: 🔄 Improve → Continue to sub-step 5
- If score < 8 AND iteration = 3: ⚠️ Max iterations reached → Proceed to Step 5 with warning
Generate Improvement Instructions (if score < 8)
Extract specific issues from review and create targeted improvement prompt:
Analysis:
- Read
review_v{iteration}.md - Extract "Critical Issues" section
- Extract "Must-Fix Recommendations"
- Identify specific scoring deficiencies:
- Which common criteria failed? (0-5 points)
- Which individual criteria failed? (0-5 points)
- What deductions were applied?
- Was excessive safety penalty triggered?
Improvement Prompt Template:
The previous procedure (v{iteration}) scored {score}/10 points. Regenerate the procedure addressing these specific issues: ## Failed Common Criteria: {list of failed criteria with specific examples from review} ## Failed Individual Criteria: {list of failed criteria with specific examples} ## Critical Issues to Fix: {numbered list from review} ## Must-Fix Recommendations: {numbered list from review} ## Specific Corrections Required: {detailed corrections extracted from review} Maintain all strengths from the previous version while addressing these issues.- Read
Regenerate Procedure
- Invoke procedure-generator skill again
- Pass improvement instructions to experiment-procedure-generator subagent
- Save to
workdir/{filename}_{entry_id}/procedure.json(overwrite) - Archive previous version to
workdir/{filename}_{entry_id}/procedure_v{iteration}.json
Loop Back to Step 4.1 (next iteration)
Outputs
Per Iteration:
workdir/{filename}_{entry_id}/procedure_v{N}.json(archived versions)workdir/{filename}_{entry_id}/review_v{N}.md(archived reviews)
Final:
workdir/{filename}_{entry_id}/procedure.json(best version)workdir/{filename}_{entry_id}/review.md(final review, symlink or copy of best)workdir/{filename}_{entry_id}/procedure_wrapped.json(temporary)
Review Format:
- Scoring Results (10-point breakdown)
- Formal Validation Results
- Semantic Quality Assessment
- Identified Strengths
- Issues and Recommendations
- Overall Assessment
- Conclusion
- Iteration History (if multiple attempts)
Step 5: Report Results
Report completion with comprehensive summary including iteration history:
Successfully processed: {entry_id}
Directory: workdir/{filename}_{entry_id}/
├── input.json ✅
├── references/
│ ├── ref_1.md ✅ (if applicable)
│ └── ref_2.md ✅ (if applicable)
├── procedure.json ✅ (final version)
├── procedure_v1.json ✅ (if improved)
├── procedure_v2.json ✅ (if improved)
├── review.md ✅ (final review)
├── review_v1.md ✅ (if improved)
└── review_v2.md ✅ (if improved)
Iteration History:
┌─────────────┬────────┬──────────┬──────────────┬────────────┐
│ Iteration │ Score │ Common │ Individual │ Status │
├─────────────┼────────┼──────────┼──────────────┼────────────┤
│ v1 (Initial)│ 6/10 │ 4/5 │ 2/5 │ 🔄 Improve │
│ v2 │ 7/10 │ 5/5 │ 2/5 │ 🔄 Improve │
│ v3 (Final) │ 8/10 │ 5/5 │ 3/5 │ ✅ Accept │
└─────────────┴────────┴──────────┴──────────────┴────────────┘
Final Procedure Quality:
- Total Score: 8/10 points (⬆️ +2 from initial)
- Common Criteria: 5/5 points (✅ Perfect)
- Individual Criteria: 3/5 points (Improved from 2/5)
- Rating: ⭐⭐⭐⭐ (Good)
- Recommendation: ✅ ACCEPT
- Iterations Required: 3 attempts
Improvements Made:
1. v1→v2: Fixed calculation errors, improved parameter reflection (+1pt)
2. v2→v3: Enhanced individual criteria compliance (+1pt)
Next Steps: Procedure is ready for use. Review review.md for detailed feedback.
Success Scenarios:
Immediate Success (Score ≥ 8 on first attempt)
Iteration History: ✅ Single attempt (v1: 9/10) Status: Excellent quality achieved immediatelyImprovement Success (Score ≥ 8 after iterations)
Iteration History: 🔄 3 attempts (v1: 6→v2: 7→v3: 8) Status: Target quality achieved through iterative improvementPartial Success (Score < 8 after max iterations)
Iteration History: ⚠️ 3 attempts (v1: 5→v2: 6→v3: 7) Status: Max iterations reached. Best score: 7/10 (Needs manual review) Warning: Target quality (8/10) not achieved. Manual refinement recommended.
Batch Processing
To process multiple entries from the same JSONL file:
Example:
# Process public_test_1, public_test_2, public_test_3
for entry_id in public_test_1 public_test_2 public_test_3; do
echo "Processing ${entry_id}..."
# Run la-bench-workflow skill for each entry
done
Directory Structure:
workdir/
├── public_test_public_test_1/
│ ├── input.json
│ ├── references/
│ ├── procedure.json
│ └── review.md
├── public_test_public_test_2/
│ └── ...
└── public_test_public_test_3/
└── ...
Error Handling
Common Issues
Issue: JSONL file not found
- Fix: Verify file path and ensure it exists
Issue: Entry ID not found in JSONL
- Fix: Check entry ID spelling and verify it exists in the file
Issue: Reference fetch fails
- Fix: Proceed without references or use cached references if available
Issue: Procedure generation fails formal validation
- Fix: Review errors and regenerate with adjusted parameters
Issue: Low quality score (<5/10)
- Fix: Review semantic evaluation feedback and regenerate with improvements
Quality Standards
LA-Bench 10-Point Scoring System
Common Criteria (5 points):
- Parameter Reflection (+1pt): All instruction parameters correctly reflected
- Object Completeness (+1pt): All mandatory objects used correctly
- Logical Structure Reflection (+1pt): Source protocol structure maintained
- Expected Outcome Achievement (+1pt): Final states achievable
- Appropriate Supplementation (+1pt): Missing details appropriately filled
Deductions:
- Unnatural Japanese/Hallucination (-1pt)
- Calculation Errors (-1pt)
- Procedural Contradictions (-1pt)
Special Restriction:
- Excessive safety → Common criteria capped at 2pts
Individual Criteria (5 points):
- Based on measurement.specific_criteria (if provided)
- General quality assessment (if not provided)
Quality Thresholds:
- 9-10 pts: Excellent (⭐⭐⭐⭐⭐) - Ready for use
- 7-8 pts: Good (⭐⭐⭐⭐) - Minor improvements recommended
- 5-6 pts: Needs Improvement (⭐⭐⭐) - Significant revisions needed
- 3-4 pts: Insufficient (⭐⭐) - Major issues
- 0-2 pts: Unacceptable (⭐) - Complete regeneration required
Best Practices
General Workflow
- Always verify input.json: Ensure all required fields exist before proceeding
- Check reference availability: Some entries may not have fetchable references
- Review formal validation first: Fix constraint violations before semantic evaluation
- Preserve intermediate files: Keep all outputs for debugging and analysis
Iterative Improvement
- Trust the iteration process: The workflow automatically iterates up to 3 times to achieve target quality (≥8/10)
- Analyze iteration history: Compare review_v1.md, review_v2.md, review.md to understand improvement trajectory
- Learn from failures: If max iterations reached without target, review all versions to identify persistent issues
- Use version control: Archived versions (procedure_v1.json, procedure_v2.json) allow rollback if needed
Scoring Strategy
- Prioritize common criteria: These 5 points are achievable through careful attention to input data
- Address deductions early: Fix calculation errors, hallucinations, and contradictions in early iterations
- Understand individual criteria: Read measurement.specific_criteria carefully if provided
- Avoid excessive safety: Don't add unnecessary steps or extreme safety margins (triggers 2pt cap)
Debugging Low Scores
If stuck at low scores (<6/10) after iterations:
- Check if instruction parameters are fully reflected
- Verify all mandatory_objects are used
- Confirm source_protocol_steps logic is preserved
- Validate all calculations manually
- Review for hallucinations or invented information
If stuck at medium scores (6-7/10):
- Focus on individual criteria improvements
- Enhance cost efficiency (reduce reagent waste)
- Improve work efficiency (optimize step order)
- Add precision measures (quality controls)
Batch Processing Best Practices
- Process in parallel if possible: Independent entries can be processed simultaneously
- Monitor iteration counts: High iteration rates may indicate dataset complexity
- Compare across entries: Identify common failure patterns for systematic improvements
Integration with Other Skills
This skill orchestrates:
- la-bench-parser (Step 1): JSONL parsing
- web-reference-fetcher (Step 2): Reference retrieval
- procedure-generator (Step 3): Procedure generation via experiment-procedure-generator subagent
- procedure-checker (Step 4): Validation via validate_procedure.py + procedure-semantic-checker subagent
See individual skill documentation for detailed specifications.
Directory Naming Convention
All outputs use the standardized pattern: workdir/{filename}_{entry_id}/
Examples:
| JSONL File | Entry ID | Directory |
|---|---|---|
| data/public_test.jsonl | public_test_1 | workdir/public_test_public_test_1/ |
| data/private_test_input.jsonl | private_test_5 | workdir/private_test_input_private_test_5/ |
| data/validation.jsonl | exp_042 | workdir/validation_exp_042/ |
Filename extraction:
from pathlib import Path
filename = Path("data/public_test.jsonl").stem # "public_test"
entry_id = "public_test_1"
dir_name = f"{filename}_{entry_id}" # "public_test_public_test_1"
Example Usage
Example 1: Immediate Success (Score ≥ 8 on first attempt)
User: "Process public_test_1 from data/public_test.jsonl"
Assistant workflow:
1. Parse: Extract entry to workdir/public_test_public_test_1/input.json ✅
2. Fetch: Download 2 references to workdir/public_test_public_test_1/references/ ✅
3. Generate: Create procedure.json (40 steps) ✅
4. Validate (Iteration 1):
- Formal: ✅ PASSED (40/50 steps, all ≤10 sentences)
- Semantic: 9/10 points (Common: 5/5, Individual: 4/5)
- Status: ✅ TARGET ACHIEVED (≥8/10)
5. Report: Procedure rated "Excellent" (⭐⭐⭐⭐⭐)
Result: Single iteration achieved excellent quality
Next: Review workdir/public_test_public_test_1/review.md for detailed feedback
Example 2: Iterative Improvement (Score improves to ≥ 8)
User: "Process public_test_5 from data/public_test.jsonl"
Assistant workflow:
1. Parse: Extract entry ✅
2. Fetch: Download references ✅
3. Generate: Create initial procedure ✅
4. Validate and Improve:
Iteration 1:
- Formal: ✅ PASSED
- Semantic: 6/10 points (Common: 4/5, Individual: 2/5)
- Issues: Missing parameter specifications, incomplete object usage
- Status: 🔄 IMPROVE (score < 8)
Iteration 2:
- Regenerate with fixes for missing parameters
- Formal: ✅ PASSED
- Semantic: 7/10 points (Common: 5/5, Individual: 2/5)
- Issues: Individual criteria still weak (cost efficiency)
- Status: 🔄 IMPROVE (score < 8)
Iteration 3:
- Regenerate with cost efficiency improvements
- Formal: ✅ PASSED
- Semantic: 8/10 points (Common: 5/5, Individual: 3/5)
- Status: ✅ TARGET ACHIEVED
5. Report: Final score 8/10 after 3 iterations (⬆️ +2 from initial)
Result: Iterative improvement achieved target quality
Files: procedure_v1.json, procedure_v2.json, procedure.json (final)
Reviews: review_v1.md, review_v2.md, review.md (final)
Example 3: Partial Success (Max iterations without reaching target)
User: "Process complex_experiment_1 from data/validation.jsonl"
Assistant workflow:
1-3. [Parse, Fetch, Generate] ✅
4. Validate and Improve:
Iteration 1: 5/10 (Critical issues with quantitative accuracy)
Iteration 2: 6/10 (Improved accuracy, but logical structure issues)
Iteration 3: 7/10 (Further improvements, but individual criteria still weak)
Status: ⚠️ MAX ITERATIONS REACHED
5. Report:
- Best Score: 7/10 (Good, but below target)
- Recommendation: ⚠️ REVISE with manual review
- Warning: Automatic improvement could not achieve 8/10
- Action: Manual refinement recommended
Result: Best effort achieved 7/10
Recommendation: Review all iteration histories to identify persistent issues
Important Context
This workflow is conducted for academic research purposes to evaluate and improve experimental planning quality and safety. All generated procedures should prioritize:
- Scientific accuracy
- Reproducibility
- Safety considerations
- Cost and time efficiency
- Adherence to research ethics guidelines