| name | checkpoint |
| description | Robust workflow checkpoint and resume. Handles session interruption, state recovery, and safe resume across all workflow phases. |
| allowed-tools | Read, Write, Glob |
Checkpoint & Resume Skill
Pattern for saving workflow state and resuming after interruption.
When to Load This Skill
- Starting a workflow that might be interrupted
- Resuming after
claude -r - Recovering from crashes or timeouts
Core Concept
The dotagent workflow uses file-based state that survives session interruption:
Session crash/exit
↓
State files persist on disk:
- memory/state/phase.json # Which phase we're in
- memory/state/execution.json # Task-level progress
- memory/reports/*.json # Completed phase outputs
↓
claude -r (resume session)
↓
Orchestrator reads state, continues from last checkpoint
Checkpoint Files
Phase Checkpoint: memory/state/phase.json
{"workflow_id":"string","started_at":"ISO-8601","last_updated":"ISO-8601","current_phase":"REQUIREMENTS|ARCHITECTURE|IMPLEMENTATION|VERIFICATION|REFLECTION","phase_status":"pending|in_progress|complete|failed","completed_phases":[{"phase":"REQUIREMENTS","completed_at":"ISO-8601","output":"memory/reports/demand.json"}],"user_checkpoints":[{"phase":"REQUIREMENTS","approved_at":"ISO-8601"}],"interruption_safe":true}
Execution Checkpoint: memory/state/execution.json
See executor agent for detailed schema with:
- Task status tracking
- Timestamps (started_at, completed_at)
- Output file paths for verification
Resume Protocol
Step 1: Detect Resume Scenario
ON WORKFLOW START:
checkpoint = Read("memory/state/phase.json")
IF checkpoint exists AND checkpoint.phase_status == "in_progress":
→ This is a RESUME
→ Log: "Detected interrupted workflow: {workflow_id}"
→ Go to Step 2
ELSE:
→ Fresh start, create new checkpoint
Step 2: Validate State Integrity
VALIDATE:
1. Check all referenced output files exist
2. Check timestamps are reasonable (not future, not ancient)
3. Check phase progression is valid
4. Check for incomplete writes (interruption_safe flag)
IF validation fails:
→ Ask user: "State appears corrupted. Start fresh? [y/N]"
→ Archive corrupted state to memory/state/.archive/
Step 3: Determine Resume Point
RESUME LOGIC by phase:
REQUIREMENTS (in_progress):
- Check if demand.json exists and is valid
- If valid: advance to ARCHITECTURE
- If not: re-spawn PM agent
ARCHITECTURE (in_progress):
- Check for design files in memory/reports/designs/
- Check for final_design.json
- If final exists: advance to IMPLEMENTATION
- If designs exist but no final: spawn Roundtable
- If no designs: re-spawn Architects
IMPLEMENTATION (in_progress):
- Read execution.json
- Run executor recovery checks
- Continue execution loop
VERIFICATION (in_progress):
- Check for verification.json
- If exists: advance to REFLECTION
- If not: re-spawn QA
REFLECTION (in_progress):
- Check for reflection file
- If exists: workflow complete
- If not: re-spawn Reflector
Step 4: Inform User and Continue
LOG to user:
"Resuming workflow {id} from {phase} phase"
"Last activity: {timestamp}"
"Completed: {list of completed phases}"
IF current_phase requires user approval (was at checkpoint):
→ Re-confirm with user before proceeding
Safe Checkpoint Writing
Always update checkpoint atomically:
# BAD: Can leave corrupted state
Write(checkpoint_file, new_state)
# GOOD: Atomic update
1. Set interruption_safe = false
2. Write to checkpoint_file.tmp
3. Rename checkpoint_file.tmp → checkpoint_file
4. Set interruption_safe = true
Recovery from Specific Scenarios
Scenario 1: Ctrl-C During Subagent
State: task-001 status="running", no output file
Recovery:
- Detect orphaned task
- Increment attempts
- Reset to "pending"
- Re-spawn on next loop
Scenario 2: Crash After Write, Before State Update
State: task-001 status="running", output file EXISTS
Recovery:
- Detect output file
- Read status from output
- Update state to match
Scenario 3: Interrupted During User Approval
State: phase=ARCHITECTURE, has designs but no final_design
Recovery:
- Detect we're at approval checkpoint
- Re-present options to user
- Don't re-run architects
Scenario 4: Ancient State File
State: started_at is 7 days ago
Recovery:
- Warn user about stale state
- Offer to archive and start fresh
- If continue: proceed with caution
Checkpoint Frequency
Update checkpoint after:
- Phase completion
- User approval
- Each task status change (in executor)
- Before spawning expensive agents (opus)
Archiving Old State
When starting fresh or after completion:
Archive pattern:
memory/state/.archive/{workflow_id}_{timestamp}/
- phase.json
- execution.json
Keep last 5 archives, delete older
Integration with Workflow
In /develop Command
## Resume Check
Before starting workflow:
1. Check for existing phase.json
2. If exists and in_progress:
- Show resume prompt to user
- "Resume workflow from {phase}? [Y/n]"
3. If user confirms: load checkpoint, continue
4. If user declines: archive old state, start fresh
In Each Phase Agent
## On Completion
Before returning:
1. Write output file
2. Update phase.json:
- Add to completed_phases
- Advance current_phase
- Set phase_status = complete
3. Log checkpoint saved
Principles
- State on disk - Never rely on conversation memory alone
- Validate before resume - Don't blindly trust old state
- Inform the user - Always tell them what's being resumed
- Atomic writes - Prevent half-written state
- Archive, don't delete - Keep old state for debugging