name	terraform-state-recovery
description	Recover from Terraform state issues after infrastructure recreation. Handles orphaned resources, state drift, and cluster recovery. Use when terraform apply fails with resource conflicts.
allowed-tools	Bash, Read, Grep

Terraform State Recovery Skill

Overview

When Kubernetes clusters are recreated (e.g., Rackspace Spot cloudspace deleted and recreated), Terraform state contains references to resources that no longer exist. This causes apply failures that require manual state cleanup.

The Problem

Scenario: Cluster Recreated

Initial state: Cluster A with ArgoCD, namespaces, secrets
Cluster deleted (spot preemption, manual deletion, provider issue)
Cluster B created with same name
Terraform apply fails:

Error: Resource already exists in state

Resource kubernetes_namespace.arc_runners already exists in state, but the
underlying Kubernetes cluster has been recreated. The resource exists in
terraform state but not in the actual cluster.

Why This Happens

Terraform tracks resources by their Terraform resource ID, not by the underlying infrastructure:

resource "kubernetes_namespace" "arc_runners" {
  # State: terraform_resource_id_123
  # Points to: Cluster A (no longer exists)
}

When Cluster B is created:

Terraform state still references Cluster A resources
terraform plan shows resources "already exist" (in state)
terraform apply fails when trying to create them (they don't exist in Cluster B)

Common Orphaned Resources

After cluster recreation, these resources are typically orphaned:

Kubernetes Resources

# Check for orphaned Kubernetes resources
terraform state list | grep kubernetes_

# Common orphans:
kubernetes_namespace.arc_runners
kubernetes_namespace.arc_systems
kubernetes_secret.github_token
kubernetes_secret.argocd_secret
kubernetes_config_map_v1_data.argocd_cm

Helm Releases

# Check for orphaned Helm releases
terraform state list | grep helm_release

# Common orphans:
helm_release.argocd
helm_release.arc_controller

kubectl_manifest Resources

# Check for orphaned kubectl manifests
terraform state list | grep kubectl_manifest

# Common orphans:
kubectl_manifest.argocd_bootstrap
kubectl_manifest.argocd_app_arc_controller

Recovery Procedure

Step 1: Identify Orphaned Resources

cd terraform
export TF_HTTP_PASSWORD="<github-token>"
terraform init

# List all resources in state
terraform state list > /tmp/state-resources.txt

# Check which cluster the state references
terraform state show module.cloudspace.spot_cloudspace.main | grep cloudspace_id

# Compare with actual cloudspace
spotctl cloudspaces list --org matchpoint-ai -o table

Step 2: Remove Orphaned Resources

CRITICAL: Only remove resources from OLD cluster. Do NOT remove:

spot_cloudspace.main (the cluster itself)
spot_nodepool.* (node pools)
data.spot_kubeconfig.* (kubeconfig data sources)

# Remove Helm releases (they don't exist in new cluster)
terraform state rm helm_release.argocd

# Remove Kubernetes namespaces
terraform state rm kubernetes_namespace.arc_runners
terraform state rm kubernetes_namespace.arc_systems

# Remove Kubernetes secrets
terraform state rm kubernetes_secret.github_token
terraform state rm kubernetes_secret.argocd_secret

# Remove ConfigMaps
terraform state rm kubernetes_config_map_v1_data.argocd_cm

# Remove kubectl manifests
terraform state rm kubectl_manifest.argocd_bootstrap
terraform state rm kubectl_manifest.argocd_app_arc_controller

Step 3: Verify State is Clean

# List remaining resources
terraform state list

# Should see:
# - module.cloudspace.spot_cloudspace.main
# - module.nodepool.spot_nodepool.*
# - data.spot_kubeconfig.this
# Should NOT see:
# - kubernetes_* resources
# - helm_release.* resources
# - kubectl_manifest.* resources

Step 4: Re-Apply

# Plan should show creating all Kubernetes resources
terraform plan -var-file=prod.tfvars

# Apply to recreate resources in new cluster
terraform apply -var-file=prod.tfvars

Automated Recovery Script

#!/bin/bash
# terraform/scripts/clean-orphaned-state.sh

set -euo pipefail

echo "🔍 Identifying orphaned Kubernetes resources..."

# Get list of Kubernetes resources in state
ORPHANED=$(terraform state list | grep -E "(kubernetes_|helm_release|kubectl_manifest)" || true)

if [ -z "$ORPHANED" ]; then
  echo "✅ No orphaned resources found"
  exit 0
fi

echo "📋 Found orphaned resources:"
echo "$ORPHANED"
echo ""

read -p "Remove these resources from state? (yes/no): " CONFIRM

if [ "$CONFIRM" != "yes" ]; then
  echo "❌ Aborted"
  exit 1
fi

echo "$ORPHANED" | while read -r resource; do
  echo "🗑️  Removing: $resource"
  terraform state rm "$resource"
done

echo "✅ State cleanup complete"
echo ""
echo "Next steps:"
echo "1. Run: terraform plan -var-file=prod.tfvars"
echo "2. Verify plan shows creating resources (not updating)"
echo "3. Run: terraform apply -var-file=prod.tfvars"

Usage:

cd terraform
export TF_HTTP_PASSWORD="<github-token>"
terraform init
./scripts/clean-orphaned-state.sh

Diagnosis: Is State Orphaned?

Check 1: Cluster ID Mismatch

# Get cloudspace ID from terraform state
terraform state show module.cloudspace.spot_cloudspace.main | grep cloudspace_id

# Get actual cloudspace ID
spotctl cloudspaces get --name matchpoint-runners-prod --org matchpoint-ai -o json | jq -r .cloudspaceId

# If different → cluster was recreated

Check 2: Resource Shows "Not Found" in Plan

terraform plan -var-file=prod.tfvars

# Look for:
# ~ resource "kubernetes_namespace" "arc_runners" {
#     # Warning: resource not found in cluster
# }

Check 3: kubectl Confirms Resources Don't Exist

# Get fresh kubeconfig
terraform output -raw kubeconfig_raw > /tmp/kubeconfig.yaml
export KUBECONFIG=/tmp/kubeconfig.yaml

# Check if resources exist
kubectl get namespace arc-runners
# Error: namespace "arc-runners" not found → Orphaned in state

# But terraform state shows:
terraform state show kubernetes_namespace.arc_runners
# Shows resource in state → State is stale

Prevention Strategies

Strategy 1: Use Data Sources Where Possible

Instead of managing resources in Terraform, reference them as data sources:

# FRAGILE - resource managed by Terraform
resource "kubernetes_namespace" "arc_runners" {
  metadata {
    name = "arc-runners"
  }
}

# ROBUST - reference namespace created by ArgoCD
data "kubernetes_namespace" "arc_runners" {
  metadata {
    name = "arc-runners"
  }
}

Data sources don't persist in state, so they can't become orphaned.

Strategy 2: Let ArgoCD Manage Application Resources

# Terraform manages infrastructure
resource "spot_cloudspace" "main" { }
resource "helm_release" "argocd" { }

# ArgoCD manages applications
# - Namespaces
# - Secrets (via SealedSecrets or external-secrets)
# - ConfigMaps
# - Deployments

This separation means:

Cluster recreation only affects Terraform resources (infrastructure)
Application resources recreated automatically by ArgoCD sync

Strategy 3: Use Remote State Locking

Prevent concurrent applies that can corrupt state:

# backend.tf
terraform {
  backend "http" {
    address        = "https://state.tfstate.dev/github/v1"
    lock_address   = "https://state.tfstate.dev/github/v1/lock"
    unlock_address = "https://state.tfstate.dev/github/v1/lock"
  }
}

Troubleshooting

Error: "Resource not found"

Symptom:

Error: reading Kubernetes Namespace "arc-runners": namespaces "arc-runners" not found

Cause: Resource exists in state but not in cluster

Fix:

terraform state rm kubernetes_namespace.arc_runners
terraform apply

Error: "State lock timeout"

Symptom:

Error: Error acquiring the state lock
Lock Info:
  ID:        abc-123-def
  Operation: OperationTypeApply
  Who:       user@host
  Created:   2024-01-01 12:00:00 UTC

Cause: Previous terraform apply crashed or was interrupted

Fix:

# Verify no terraform process running
ps aux | grep terraform

# Force unlock (only if safe)
terraform force-unlock abc-123-def

Error: "Provider configuration changed"

Symptom:

Error: Provider configuration changed

The provider configuration for provider["kubernetes"] has changed. This may
be because the kubeconfig references a different cluster.

Cause: Kubeconfig points to new cluster but state references old cluster

Fix:

# Get fresh kubeconfig for current cluster
terraform output -raw kubeconfig_raw > /tmp/kubeconfig.yaml
export KUBECONFIG=/tmp/kubeconfig.yaml

# Remove orphaned Kubernetes resources
./scripts/clean-orphaned-state.sh

# Re-apply
terraform apply -var-file=prod.tfvars

Advanced Recovery: Import Resources

If resources exist in the NEW cluster but not in state:

# Import namespace
terraform import kubernetes_namespace.arc_runners arc-runners

# Import secret
terraform import kubernetes_secret.github_token arc-runners/arc-org-github-secret

# Import Helm release
terraform import helm_release.argocd argocd/argocd

When to use import:

Resources manually created in cluster
Need to bring them under Terraform management
Alternative to destroying and recreating

When NOT to use import:

Resources don't exist (use terraform state rm instead)
Resources managed by ArgoCD (let ArgoCD manage them)

Diagnostic Commands

# List all resources in state
terraform state list

# Show specific resource details
terraform state show kubernetes_namespace.arc_runners

# Pull current state to local file
terraform state pull > /tmp/terraform.tfstate

# Inspect state JSON
jq '.resources[] | select(.type == "kubernetes_namespace")' /tmp/terraform.tfstate

# Check cluster connectivity
terraform output -raw kubeconfig_raw > /tmp/kubeconfig.yaml
export KUBECONFIG=/tmp/kubeconfig.yaml
kubectl cluster-info

# Verify state backend connection
terraform init -backend-config="password=$TF_HTTP_PASSWORD"

State File Forensics

Understanding State Structure

{
  "resources": [
    {
      "mode": "managed",
      "type": "kubernetes_namespace",
      "name": "arc_runners",
      "provider": "provider[\"kubernetes\"]",
      "instances": [
        {
          "attributes": {
            "metadata": [{"name": "arc-runners"}]
          }
        }
      ]
    }
  ]
}

Key fields:

mode: "managed" - Terraform manages this resource
mode: "data" - Terraform only reads this resource
instances[].attributes - Current resource configuration

Finding Orphaned Resources

# Extract all Kubernetes resources
terraform state pull | jq -r '.resources[] | select(.type | startswith("kubernetes_")) | .type + "." + .name'

# Compare with actual cluster resources
kubectl get namespaces -o name
kubectl get secrets -A -o name

terraform-state-recovery

Install Skill

SKILL.md

Terraform State Recovery Skill

Overview

The Problem

Scenario: Cluster Recreated

Why This Happens

Common Orphaned Resources

Kubernetes Resources

Helm Releases

kubectl_manifest Resources

Recovery Procedure

Step 1: Identify Orphaned Resources

Step 2: Remove Orphaned Resources

Step 3: Verify State is Clean

Step 4: Re-Apply

Automated Recovery Script

Diagnosis: Is State Orphaned?

Check 1: Cluster ID Mismatch

Check 2: Resource Shows "Not Found" in Plan

Check 3: kubectl Confirms Resources Don't Exist

Prevention Strategies

Strategy 1: Use Data Sources Where Possible

Strategy 2: Let ArgoCD Manage Application Resources

Strategy 3: Use Remote State Locking

Troubleshooting

Error: "Resource not found"

Error: "State lock timeout"

Error: "Provider configuration changed"

Advanced Recovery: Import Resources

Diagnostic Commands

State File Forensics

Understanding State Structure

Finding Orphaned Resources

Related Skills

Related Issues

References