name	troubleshoot
description	Diagnose and fix cluster issues. Use when pods fail, deployments don't work, or services are unreachable.

Troubleshoot Cluster Issues

Name: troubleshoot
Author: X-McKay

Diagnose and resolve common cluster problems.

Instructions

1. Identify Problem Pods

KUBECONFIG=/home/al/.kube/config

echo "=== Problem Pods ==="
kubectl get pods -A | grep -v Running | grep -v Completed

2. Check Pod Details

KUBECONFIG=/home/al/.kube/config
POD_NAME="k8s-monitor-xxx"
NAMESPACE="ai-agents"

# Describe pod
kubectl describe pod $POD_NAME -n $NAMESPACE

# Check logs
kubectl logs $POD_NAME -n $NAMESPACE --tail=50

# Check previous container logs (if crash loop)
kubectl logs $POD_NAME -n $NAMESPACE --previous --tail=50

3. Common Issues

ImagePullBackOff

Image not pushed to registry
Wrong image tag
Registry authentication issue

# Check if image exists
docker images | grep k8s-monitor

# Push if missing
docker push registry.almckay.io/k8s-monitor:TAG

CrashLoopBackOff

Application error on startup
Missing environment variables
Failed dependency connections

# Check logs for error
KUBECONFIG=/home/al/.kube/config kubectl logs $POD_NAME -n $NAMESPACE --tail=100

# Check env vars
kubectl get pod $POD_NAME -n $NAMESPACE -o jsonpath='{.spec.containers[0].env}'

Pending

Insufficient resources
Node selector mismatch
PVC not bound

# Check events
KUBECONFIG=/home/al/.kube/config kubectl get events -n $NAMESPACE --sort-by='.lastTimestamp' | tail -20

# Check node resources
kubectl top nodes

4. Temporal Workflow Issues

# Check worker connectivity
KUBECONFIG=/home/al/.kube/config kubectl logs -n ai-agents -l app.kubernetes.io/name=k8s-monitor --tail=50 | grep -i temporal

# List running workflows
# (use tctl or Temporal UI)

5. Restart Deployment

KUBECONFIG=/home/al/.kube/config kubectl rollout restart deployment/$DEPLOYMENT -n $NAMESPACE

6. Force Pod Deletion

KUBECONFIG=/home/al/.kube/config kubectl delete pod $POD_NAME -n $NAMESPACE --force --grace-period=0

Escalation

If unable to resolve:

Check node health: kubectl get nodes
Check kubelet logs on node
Review recent changes in git history
Check Flux reconciliation status

troubleshoot

Install Skill

SKILL.md

Troubleshoot Cluster Issues

Instructions

1. Identify Problem Pods

2. Check Pod Details

3. Common Issues

ImagePullBackOff

CrashLoopBackOff

Pending

4. Temporal Workflow Issues

5. Restart Deployment

6. Force Pod Deletion

Escalation