Claude Code Plugins

Community-maintained marketplace

Feedback

debugging-flux-deployments

@saavy1/sb
0
0

Diagnoses and resolves issues with Flux GitOps deployments, Kubernetes pods, services, and HelmReleases in the Superbloom cluster

Install Skill

1Download skill
2Enable skills in Claude

Open claude.ai/settings/capabilities and find the "Skills" section

3Upload to Claude

Click "Upload skill" and select the downloaded ZIP file

Note: Please verify skill by going through its instructions before using it.

SKILL.md

name debugging-flux-deployments
description Diagnoses and resolves issues with Flux GitOps deployments, Kubernetes pods, services, and HelmReleases in the Superbloom cluster

Debugging Flux Deployments

This skill provides systematic approaches to diagnose and fix deployment issues in the Superbloom K3s cluster.

Capabilities

  • Diagnose pod startup failures
  • Debug empty service endpoints
  • Troubleshoot HelmRelease errors
  • Fix SOPS decryption issues
  • Resolve certificate/ingress problems
  • Identify RBAC permission issues

Diagnostic Flow

1. Check Flux sync status
2. Check HelmRelease status
3. Check Pod status
4. Check Service endpoints
5. Check container logs
6. Test internal connectivity

Essential Commands

Flux Status

# Check all Flux resources
flux get all -A

# Check specific kustomization
flux get kustomization flux-system -n flux-system

# Force reconciliation
flux reconcile kustomization flux-system --with-source

# Check GitRepository
flux get source git -A

HelmRelease Status

# List all HelmReleases
kubectl get helmrelease -A

# Detailed status
kubectl describe helmrelease <name> -n <namespace>

# Force HelmRelease reconciliation
flux reconcile helmrelease <name> -n <namespace>

Pod Diagnostics

# List pods with status
kubectl get pods -n <namespace>

# Detailed pod info
kubectl describe pod <pod-name> -n <namespace>

# Container logs
kubectl logs -n <namespace> <pod-name> -c <container-name>

# All containers
kubectl logs -n <namespace> <pod-name> --all-containers

# Previous crashed container
kubectl logs -n <namespace> <pod-name> -c <container> --previous

Service/Endpoint Check

# Check service
kubectl get svc -n <namespace>

# Check endpoints (empty = no healthy pods)
kubectl get endpoints <service-name> -n <namespace>

# Test internal connectivity
kubectl run curl-test --rm -it --restart=Never --image=curlimages/curl -- \
  curl -s http://<service>.<namespace>.svc.cluster.local:<port>/health

Secret Verification

# Check secret exists
kubectl get secret <name> -n <namespace>

# View secret data (base64 decoded)
kubectl get secret <name> -n <namespace> -o jsonpath='{.data.<key>}' | base64 -d

# SOPS decrypt locally
cd flux && sops -d clusters/superbloom/apps/<app>/secrets.yaml

Common Issues & Solutions

Pod CrashLoopBackOff

Symptom: Pod shows CrashLoopBackOff, 1/2 Ready

Diagnosis:

kubectl describe pod <pod> -n <namespace> | tail -30
kubectl logs -n <namespace> <pod> -c <container>

Common causes:

  • Missing RBAC permissions (Tailscale sidecar)
  • Missing environment variables
  • Failed health probes
  • Image pull errors

Tailscale fix: Add RBAC for secrets management:

rules:
  - apiGroups: [""]
    resources: ["secrets"]
    verbs: ["get", "create", "update", "patch"]

Empty Service Endpoints

Symptom: Service has no endpoints, traffic fails

Diagnosis:

kubectl get endpoints <svc> -n <namespace>
kubectl get pods -n <namespace> -o wide

Cause: Pod not "Ready" (all containers must pass readiness probes)

Fix: Ensure all containers are healthy. If sidecar failing, fix that first.

HelmRelease Schema Validation Error

Symptom: values don't meet the specifications of the schema

Example error:

at '/serviceAccount/name': got string, want object

Cause: bjw-s app-template v4 schema mismatch

Fix for serviceAccount: Move inside controller:

controllers:
  main:
    serviceAccount:        # CORRECT - inside controllers.main
      name: my-app
    containers:
      main:
        ...

NOT at top level:

serviceAccount:           # WRONG - top level
  name: my-app

SOPS Decryption Failure

Symptom: Flux can't decrypt secrets

Diagnosis:

# Check sops-age secret exists
kubectl get secret sops-age -n flux-system

# Test local decryption
cd flux && sops -d clusters/superbloom/apps/<app>/secrets.yaml

Fixes:

  • Ensure .sops.yaml path_regex matches file location
  • Verify age key is available locally and in cluster

Certificate/ACME Failures

Symptom: 521 errors, certificate not issued

Diagnosis:

kubectl logs -n caddy-system -l app=caddy --tail=50

Common causes:

  • Domain not in DDNS → Add to DOMAINS list
  • Rate limited → Wait 1 hour
  • Origin unreachable → Check Cloudflare proxy settings

Fix:

# Add domain to DDNS
sops -d flux/clusters/superbloom/infra/ddns/secrets.yaml
# Add to DOMAINS, re-encrypt

# Restart DDNS and Caddy
kubectl rollout restart deployment ddns -n ddns
kubectl rollout restart deployment caddy -n caddy-system

Image Pull Errors

Symptom: ImagePullBackOff, ErrImagePull

Diagnosis:

kubectl describe pod <pod> -n <namespace> | grep -A5 "Events"

Fixes:

  • Check image exists in GHCR
  • Verify image tag is correct
  • Check imagePullSecrets if private registry

Exec Format Error (Corrupted Image Cache)

Symptom: Container crashes with exec /usr/local/bin/<binary>: exec format error

Cause: Containerd cached a wrong-architecture image (common with multi-arch latest tags)

Diagnosis:

kubectl logs -n <namespace> <pod> -c <container>
# Shows "exec format error"

# Check cached images
ssh superbloom "sudo ctr --address /run/k3s/containerd/containerd.sock -n k8s.io images ls | grep <image>"

Fix: Remove corrupted image from containerd cache:

ssh superbloom "sudo ctr --address /run/k3s/containerd/containerd.sock -n k8s.io images rm <image>:<tag>"

# Delete pods to force fresh pull
kubectl delete pod -n <namespace> -l app.kubernetes.io/instance=<app>

Prevention: Pin to specific version tags instead of latest for sidecars like Tailscale.

Container Exits Immediately (Exit Code 0)

Symptom: Container shows Completed with exit code 0, restarts in CrashLoopBackOff

Cause: Often a Docker build issue - source files are empty (0 bytes)

Diagnosis:

# Check file sizes in image
kubectl run debug --rm -it --image=<image>:<tag> -- ls -la /app/src/
# If files show 0 bytes, the build is broken

Fix: Rebuild with cache disabled:

# In .github/workflows/<app>.yml
- name: Build and push
  uses: docker/build-push-action@v5
  with:
    no-cache: true  # Temporarily disable cache

After successful rebuild, re-enable caching.

Useful Patterns

Watch Resources

# Watch pods
kubectl get pods -n <namespace> -w

# Watch events
kubectl get events -n <namespace> -w --sort-by='.lastTimestamp'

Exec Into Container

kubectl exec -it -n <namespace> <pod> -c <container> -- /bin/sh

Port Forward for Testing

kubectl port-forward -n <namespace> svc/<service> 8080:3000
curl http://localhost:8080/health

Delete and Recreate HelmRelease

# Sometimes needed to clear stuck state
kubectl delete helmrelease <name> -n <namespace>
flux reconcile kustomization flux-system --with-source

Best Practices

  1. Always check Flux sync status first
  2. Read HelmRelease events for schema errors
  3. Empty endpoints = pod not Ready
  4. Check ALL container logs in multi-container pods
  5. Test internal connectivity before debugging ingress
  6. Rate limits require waiting, not repeated attempts