LangGraph Checkpointing
Persist workflow state for recovery and debugging.
When to Use
- Fault-tolerant workflows
- Resume after crashes
- Debug state at each step
- Avoid re-running expensive LLM calls
Checkpointer Options
from langgraph.checkpoint import MemorySaver
from langgraph.checkpoint.sqlite import SqliteSaver
from langgraph.checkpoint.postgres import PostgresSaver
# Development: In-memory
memory = MemorySaver()
app = workflow.compile(checkpointer=memory)
# Production: SQLite
checkpointer = SqliteSaver.from_conn_string("checkpoints.db")
app = workflow.compile(checkpointer=checkpointer)
# Production: PostgreSQL
checkpointer = PostgresSaver.from_conn_string("postgresql://...")
app = workflow.compile(checkpointer=checkpointer)
Using Thread IDs
# Start new workflow
config = {"configurable": {"thread_id": "analysis-123"}}
result = app.invoke(initial_state, config=config)
# Resume interrupted workflow
config = {"configurable": {"thread_id": "analysis-123"}}
result = app.invoke(None, config=config) # Resumes from checkpoint
PostgreSQL Setup
def create_checkpointer():
"""Create PostgreSQL checkpointer for production."""
return PostgresSaver.from_conn_string(
settings.DATABASE_URL,
save_every=1 # Save after each node
)
# Compile with checkpointing
app = workflow.compile(
checkpointer=create_checkpointer(),
interrupt_before=["quality_gate"] # Manual review point
)
Inspecting Checkpoints
# Get all checkpoints for a workflow
checkpoints = app.get_state_history(config)
for checkpoint in checkpoints:
print(f"Step: {checkpoint.metadata['step']}")
print(f"Node: {checkpoint.metadata['source']}")
print(f"State: {checkpoint.values}")
# Get current state
current = app.get_state(config)
print(current.values)
Resuming After Crash
import logging
async def run_with_recovery(workflow_id: str, initial_state: dict):
"""Run workflow with automatic recovery."""
config = {"configurable": {"thread_id": workflow_id}}
try:
# Try to resume existing workflow
state = app.get_state(config)
if state.values:
logging.info(f"Resuming workflow {workflow_id}")
return app.invoke(None, config=config)
except Exception:
pass # No existing checkpoint
# Start fresh
logging.info(f"Starting new workflow {workflow_id}")
return app.invoke(initial_state, config=config)
Step-by-Step Debugging
# Execute one node at a time
for step in app.stream(initial_state, config):
print(f"After {step['node']}: {step['state']}")
input("Press Enter to continue...")
# Rollback to previous checkpoint
history = list(app.get_state_history(config))
previous_state = history[1] # One step back
app.update_state(config, previous_state.values)
Store vs Checkpointer (2026 Best Practice)
from langgraph.checkpoint.postgres import PostgresSaver
from langgraph.store.postgres import PostgresStore
# Checkpointer = SHORT-TERM memory (thread-scoped)
# - Conversation history within a session
# - Workflow state for resume/recovery
# - Scoped to thread_id
checkpointer = PostgresSaver.from_conn_string(DATABASE_URL)
# Store = LONG-TERM memory (cross-thread)
# - User preferences across sessions
# - Learned facts about users
# - Shared across ALL threads for a user
store = PostgresStore.from_conn_string(DATABASE_URL)
# Compile with BOTH for full memory support
app = workflow.compile(
checkpointer=checkpointer, # Thread-scoped state
store=store # Cross-thread memory
)
Using Store for Cross-Thread Memory
from langgraph.store.base import BaseStore
async def agent_with_memory(state: AgentState, *, store: BaseStore):
"""Agent that remembers across conversations."""
user_id = state["user_id"]
# Read cross-thread memory (user preferences)
memories = await store.aget(namespace=("users", user_id), key="preferences")
# Use memories in agent logic
if memories and memories.value.get("prefers_concise"):
state["system_prompt"] += "\nBe concise in responses."
# Update cross-thread memory (learned facts)
await store.aput(
namespace=("users", user_id),
key="last_topic",
value={"topic": state["current_topic"], "timestamp": datetime.now().isoformat()}
)
return state
# Register node with store access
workflow.add_node("agent", agent_with_memory)
Memory Architecture
┌─────────────────────────────────────────────────────────────┐
│ User: alice │
├─────────────────────────────────────────────────────────────┤
│ Thread 1 (chat-001) │ Thread 2 (chat-002) │
│ ┌─────────────────┐ │ ┌─────────────────┐ │
│ │ Checkpointer │ │ │ Checkpointer │ │
│ │ - msg history │ │ │ - msg history │ │
│ │ - workflow pos │ │ │ - workflow pos │ │
│ └─────────────────┘ │ └─────────────────┘ │
├─────────────────────────────────────────────────────────────┤
│ Store (cross-thread) │
│ namespace=("users", "alice") │
│ - preferences: {prefers_concise: true} │
│ - last_topic: {topic: "langgraph", timestamp: "..."} │
└─────────────────────────────────────────────────────────────┘
Key Decisions
| Decision |
Recommendation |
| Development |
MemorySaver (fast, no setup) |
| Production |
PostgresSaver (shared, durable) |
| save_every |
1 for expensive nodes, 5 for cheap |
| Thread ID |
Use deterministic ID (workflow_id) |
| Short-term memory |
Checkpointer (thread-scoped) |
| Long-term memory |
Store (cross-thread, namespaced) |
Common Mistakes
- No checkpointer in production (lose progress)
- Random thread IDs (can't resume)
- Not handling missing checkpoints
- Saving too frequently (overhead)
- Using only checkpointer for user preferences (lost across threads)
- Not using namespaces in Store (data collisions)
Related Skills
langgraph-state - State design for checkpointing
langgraph-human-in-loop - Interrupt patterns
database-schema-designer - PostgreSQL setup