name	Long Task Execution Framework
description	Protocol for reliable multi-hour task execution with automatic checkpointing, validation, and recovery
version	1.0.0

Long Task Execution Framework

Robust protocol for executing tasks that take multiple hours with confidence.

Core Principles

Atomicity: Every task must be decomposed into atomic sub-tasks < 10 minutes
Validation: Validate after every change, before every checkpoint
Checkpointing: Create checkpoint after every completed atomic task
Recovery: Automatic retry → rollback → escalate on failures
Determinism: Use scripts for critical operations (not LLM decisions)

Atomic Task Requirements

Each atomic task MUST:

Take < 10 minutes to execute
Have clear, verifiable success criteria
Be independently rollback-able
Have explicit dependencies listed
Have a rollback plan defined

Validation Protocol

After Every Code Change:

npx tsc --noEmit --skipLibCheck    # TypeScript (2-5s)
npx eslint src --max-warnings 0    # Linting (1-3s)

After Atomic Task Completion:

bash .claude/framework/validation-gate.sh atomic

Before Final Completion:

bash .claude/framework/validation-gate.sh final

NEVER continue if validation fails. Stop and recover.

Checkpoint Strategy

Create checkpoint:

✅ AFTER each atomic task completes successfully
✅ BEFORE risky operations (migrations, major refactors)
✅ Every 15 minutes MAX (even if task incomplete)
❌ NEVER on validation failures
❌ NEVER during rollback
📁 Snapshot stored at .claude/checkpoints/{id}/snapshot/ (excludes .claude/checkpoints, .git, node_modules)

Checkpoint naming:

Pattern: ckpt_{taskId}_{atomicTaskId}_{timestamp}
Example: ckpt_sprint-1_task1.3_1729612345

Checkpoint metadata:

Include context:

Decisions made ("Chose approach X over Y because...")
Blockers encountered
Files changed
Next task

Recovery Protocol

On failure, execute in order:

Level 1: Retry (3 attempts max)

node .claude/framework/recovery.js --task-id=X --error='{"type":"...","message":"..."}'

Same approach, maybe transient error
Agent retries task automatically

Level 2: Rollback

Restore to last valid checkpoint
Try alternative approach
Agent retries with different strategy

Level 3: Escalate

Mark task as BLOCKED
Notify human with error details and options
Wait for manual intervention

State Machine

Valid state transitions:

INITIALIZED → PLANNING
PLANNING → PLANNED | FAILED
PLANNED → EXECUTING
EXECUTING → VALIDATING | FAILED
VALIDATING → CHECKPOINTING | FAILED
CHECKPOINTING → EXECUTING | COMPLETED
FAILED → ROLLING_BACK | ABANDONED
ROLLING_BACK → EXECUTING | ABANDONED

Invalid transitions are errors.

Scripts Reference

All scripts are in .claude/framework/:

# Logger (structured JSON logs)
node logger.js --event=EVENT_NAME --data='{"key":"value"}'

# Checkpoint Manager
node checkpoint-manager.js create --task-id=X --context='{"decisions":"..."}'
node checkpoint-manager.js restore --checkpoint-id=ckpt_xyz
node checkpoint-manager.js list

# Validation Gates
bash validation-gate.sh atomic       # Fast: tsc + eslint
bash validation-gate.sh checkpoint   # Medium: + tests
bash validation-gate.sh final        # Full: + build (optional)

# Recovery
node recovery.js --task-id=X --error='{"type":"...","message":"..."}'

# Conditional Hook (called by hooks automatically)
node conditional-hook.js validate-ts
node conditional-hook.js checkpoint
node conditional-hook.js final-validation

Execution Loop

FOR EACH atomic task WHERE status !== 'completed':
  1. Mark task as 'in_progress' (TodoWrite)
  2. Execute task (write code, modify files)
     → Write hook triggers: validate-ts
  3. Validate explicitly: validation-gate.sh atomic
  4. If validation fails: call recovery.js
  5. Mark task as 'completed' (TodoWrite)
     → TodoWrite hook triggers: checkpoint (automatic)
  6. Continue to next task

WHEN all tasks done:
  7. Final validation: validation-gate.sh final
  8. Create final checkpoint
  9. Generate report
  10. Mark task as COMPLETED

Success Criteria

A long task is successfully completed when:

✅ All atomic tasks status = 'completed'
✅ Final validation passes (TypeScript + ESLint + Tests)
✅ Final checkpoint created
✅ Execution report generated
✅ Zero blockers or unresolved errors

Iron Laws

NEVER skip task decomposition - Long task WITHOUT atomic tasks = BLOCKED
NEVER continue on validation failure - Failed validation = STOP + RECOVER
NEVER exceed maxTime (10min) - If task takes longer = something is wrong, STOP
ALWAYS checkpoint after completed task - No exceptions
ALWAYS use state machine - Invalid transitions = ERROR
ALWAYS log structured - Every event, every decision (JSON format)
NEVER lose context - Checkpoints carry FULL context
ALWAYS self-heal - Try recovery before escalating to human

Performance Targets

Validation per task: < 5 seconds (npx tsc --noEmit)
Checkpoint creation: < 1 second
Recovery decision: < 30 seconds
Total overhead: < 5% of execution time

Use npx tsc --noEmit instead of npm run build for validation (90% faster).

Long Task Execution Framework

Install Skill

SKILL.md

Long Task Execution Framework

Core Principles

Atomic Task Requirements

Validation Protocol

After Every Code Change:

After Atomic Task Completion:

Before Final Completion:

Checkpoint Strategy

Create checkpoint:

Checkpoint naming:

Checkpoint metadata:

Recovery Protocol

Level 1: Retry (3 attempts max)

Level 2: Rollback

Level 3: Escalate

State Machine

Scripts Reference

Execution Loop

Success Criteria

Iron Laws

Performance Targets