| name | terraform-state-recovery |
| description | Recover from Terraform state issues after infrastructure recreation. Handles orphaned resources, state drift, and cluster recovery. Use when terraform apply fails with resource conflicts. |
| allowed-tools | Bash, Read, Grep |
Terraform State Recovery Skill
Overview
When Kubernetes clusters are recreated (e.g., Rackspace Spot cloudspace deleted and recreated), Terraform state contains references to resources that no longer exist. This causes apply failures that require manual state cleanup.
The Problem
Scenario: Cluster Recreated
- Initial state: Cluster A with ArgoCD, namespaces, secrets
- Cluster deleted (spot preemption, manual deletion, provider issue)
- Cluster B created with same name
- Terraform apply fails:
Error: Resource already exists in state
Resource kubernetes_namespace.arc_runners already exists in state, but the
underlying Kubernetes cluster has been recreated. The resource exists in
terraform state but not in the actual cluster.
Why This Happens
Terraform tracks resources by their Terraform resource ID, not by the underlying infrastructure:
resource "kubernetes_namespace" "arc_runners" {
# State: terraform_resource_id_123
# Points to: Cluster A (no longer exists)
}
When Cluster B is created:
- Terraform state still references Cluster A resources
terraform planshows resources "already exist" (in state)terraform applyfails when trying to create them (they don't exist in Cluster B)
Common Orphaned Resources
After cluster recreation, these resources are typically orphaned:
Kubernetes Resources
# Check for orphaned Kubernetes resources
terraform state list | grep kubernetes_
# Common orphans:
kubernetes_namespace.arc_runners
kubernetes_namespace.arc_systems
kubernetes_secret.github_token
kubernetes_secret.argocd_secret
kubernetes_config_map_v1_data.argocd_cm
Helm Releases
# Check for orphaned Helm releases
terraform state list | grep helm_release
# Common orphans:
helm_release.argocd
helm_release.arc_controller
kubectl_manifest Resources
# Check for orphaned kubectl manifests
terraform state list | grep kubectl_manifest
# Common orphans:
kubectl_manifest.argocd_bootstrap
kubectl_manifest.argocd_app_arc_controller
Recovery Procedure
Step 1: Identify Orphaned Resources
cd terraform
export TF_HTTP_PASSWORD="<github-token>"
terraform init
# List all resources in state
terraform state list > /tmp/state-resources.txt
# Check which cluster the state references
terraform state show module.cloudspace.spot_cloudspace.main | grep cloudspace_id
# Compare with actual cloudspace
spotctl cloudspaces list --org matchpoint-ai -o table
Step 2: Remove Orphaned Resources
CRITICAL: Only remove resources from OLD cluster. Do NOT remove:
spot_cloudspace.main(the cluster itself)spot_nodepool.*(node pools)data.spot_kubeconfig.*(kubeconfig data sources)
# Remove Helm releases (they don't exist in new cluster)
terraform state rm helm_release.argocd
# Remove Kubernetes namespaces
terraform state rm kubernetes_namespace.arc_runners
terraform state rm kubernetes_namespace.arc_systems
# Remove Kubernetes secrets
terraform state rm kubernetes_secret.github_token
terraform state rm kubernetes_secret.argocd_secret
# Remove ConfigMaps
terraform state rm kubernetes_config_map_v1_data.argocd_cm
# Remove kubectl manifests
terraform state rm kubectl_manifest.argocd_bootstrap
terraform state rm kubectl_manifest.argocd_app_arc_controller
Step 3: Verify State is Clean
# List remaining resources
terraform state list
# Should see:
# - module.cloudspace.spot_cloudspace.main
# - module.nodepool.spot_nodepool.*
# - data.spot_kubeconfig.this
# Should NOT see:
# - kubernetes_* resources
# - helm_release.* resources
# - kubectl_manifest.* resources
Step 4: Re-Apply
# Plan should show creating all Kubernetes resources
terraform plan -var-file=prod.tfvars
# Apply to recreate resources in new cluster
terraform apply -var-file=prod.tfvars
Automated Recovery Script
#!/bin/bash
# terraform/scripts/clean-orphaned-state.sh
set -euo pipefail
echo "🔍 Identifying orphaned Kubernetes resources..."
# Get list of Kubernetes resources in state
ORPHANED=$(terraform state list | grep -E "(kubernetes_|helm_release|kubectl_manifest)" || true)
if [ -z "$ORPHANED" ]; then
echo "✅ No orphaned resources found"
exit 0
fi
echo "📋 Found orphaned resources:"
echo "$ORPHANED"
echo ""
read -p "Remove these resources from state? (yes/no): " CONFIRM
if [ "$CONFIRM" != "yes" ]; then
echo "❌ Aborted"
exit 1
fi
echo "$ORPHANED" | while read -r resource; do
echo "🗑️ Removing: $resource"
terraform state rm "$resource"
done
echo "✅ State cleanup complete"
echo ""
echo "Next steps:"
echo "1. Run: terraform plan -var-file=prod.tfvars"
echo "2. Verify plan shows creating resources (not updating)"
echo "3. Run: terraform apply -var-file=prod.tfvars"
Usage:
cd terraform
export TF_HTTP_PASSWORD="<github-token>"
terraform init
./scripts/clean-orphaned-state.sh
Diagnosis: Is State Orphaned?
Check 1: Cluster ID Mismatch
# Get cloudspace ID from terraform state
terraform state show module.cloudspace.spot_cloudspace.main | grep cloudspace_id
# Get actual cloudspace ID
spotctl cloudspaces get --name matchpoint-runners-prod --org matchpoint-ai -o json | jq -r .cloudspaceId
# If different → cluster was recreated
Check 2: Resource Shows "Not Found" in Plan
terraform plan -var-file=prod.tfvars
# Look for:
# ~ resource "kubernetes_namespace" "arc_runners" {
# # Warning: resource not found in cluster
# }
Check 3: kubectl Confirms Resources Don't Exist
# Get fresh kubeconfig
terraform output -raw kubeconfig_raw > /tmp/kubeconfig.yaml
export KUBECONFIG=/tmp/kubeconfig.yaml
# Check if resources exist
kubectl get namespace arc-runners
# Error: namespace "arc-runners" not found → Orphaned in state
# But terraform state shows:
terraform state show kubernetes_namespace.arc_runners
# Shows resource in state → State is stale
Prevention Strategies
Strategy 1: Use Data Sources Where Possible
Instead of managing resources in Terraform, reference them as data sources:
# FRAGILE - resource managed by Terraform
resource "kubernetes_namespace" "arc_runners" {
metadata {
name = "arc-runners"
}
}
# ROBUST - reference namespace created by ArgoCD
data "kubernetes_namespace" "arc_runners" {
metadata {
name = "arc-runners"
}
}
Data sources don't persist in state, so they can't become orphaned.
Strategy 2: Let ArgoCD Manage Application Resources
# Terraform manages infrastructure
resource "spot_cloudspace" "main" { }
resource "helm_release" "argocd" { }
# ArgoCD manages applications
# - Namespaces
# - Secrets (via SealedSecrets or external-secrets)
# - ConfigMaps
# - Deployments
This separation means:
- Cluster recreation only affects Terraform resources (infrastructure)
- Application resources recreated automatically by ArgoCD sync
Strategy 3: Use Remote State Locking
Prevent concurrent applies that can corrupt state:
# backend.tf
terraform {
backend "http" {
address = "https://state.tfstate.dev/github/v1"
lock_address = "https://state.tfstate.dev/github/v1/lock"
unlock_address = "https://state.tfstate.dev/github/v1/lock"
}
}
Troubleshooting
Error: "Resource not found"
Symptom:
Error: reading Kubernetes Namespace "arc-runners": namespaces "arc-runners" not found
Cause: Resource exists in state but not in cluster
Fix:
terraform state rm kubernetes_namespace.arc_runners
terraform apply
Error: "State lock timeout"
Symptom:
Error: Error acquiring the state lock
Lock Info:
ID: abc-123-def
Operation: OperationTypeApply
Who: user@host
Created: 2024-01-01 12:00:00 UTC
Cause: Previous terraform apply crashed or was interrupted
Fix:
# Verify no terraform process running
ps aux | grep terraform
# Force unlock (only if safe)
terraform force-unlock abc-123-def
Error: "Provider configuration changed"
Symptom:
Error: Provider configuration changed
The provider configuration for provider["kubernetes"] has changed. This may
be because the kubeconfig references a different cluster.
Cause: Kubeconfig points to new cluster but state references old cluster
Fix:
# Get fresh kubeconfig for current cluster
terraform output -raw kubeconfig_raw > /tmp/kubeconfig.yaml
export KUBECONFIG=/tmp/kubeconfig.yaml
# Remove orphaned Kubernetes resources
./scripts/clean-orphaned-state.sh
# Re-apply
terraform apply -var-file=prod.tfvars
Advanced Recovery: Import Resources
If resources exist in the NEW cluster but not in state:
# Import namespace
terraform import kubernetes_namespace.arc_runners arc-runners
# Import secret
terraform import kubernetes_secret.github_token arc-runners/arc-org-github-secret
# Import Helm release
terraform import helm_release.argocd argocd/argocd
When to use import:
- Resources manually created in cluster
- Need to bring them under Terraform management
- Alternative to destroying and recreating
When NOT to use import:
- Resources don't exist (use
terraform state rminstead) - Resources managed by ArgoCD (let ArgoCD manage them)
Diagnostic Commands
# List all resources in state
terraform state list
# Show specific resource details
terraform state show kubernetes_namespace.arc_runners
# Pull current state to local file
terraform state pull > /tmp/terraform.tfstate
# Inspect state JSON
jq '.resources[] | select(.type == "kubernetes_namespace")' /tmp/terraform.tfstate
# Check cluster connectivity
terraform output -raw kubeconfig_raw > /tmp/kubeconfig.yaml
export KUBECONFIG=/tmp/kubeconfig.yaml
kubectl cluster-info
# Verify state backend connection
terraform init -backend-config="password=$TF_HTTP_PASSWORD"
State File Forensics
Understanding State Structure
{
"resources": [
{
"mode": "managed",
"type": "kubernetes_namespace",
"name": "arc_runners",
"provider": "provider[\"kubernetes\"]",
"instances": [
{
"attributes": {
"metadata": [{"name": "arc-runners"}]
}
}
]
}
]
}
Key fields:
mode: "managed"- Terraform manages this resourcemode: "data"- Terraform only reads this resourceinstances[].attributes- Current resource configuration
Finding Orphaned Resources
# Extract all Kubernetes resources
terraform state pull | jq -r '.resources[] | select(.type | startswith("kubernetes_")) | .type + "." + .name'
# Compare with actual cluster resources
kubectl get namespaces -o name
kubectl get secrets -A -o name
Related Skills
- arc-terraform-deployment - Avoiding orphaned state
- infrastructure-cd - Automated terraform workflow
- argocd-bootstrap - Separating app management
Related Issues
- #115 - Cluster DNS not resolving (cluster recreation scenario)
- #121 - State cleanup after ApplicationSet fix
- #119 - Namespace creation conflicts