name	troubleshoot
description	Diagnoses Kubernetes pod issues automatically. Use when pods are in CrashLoopBackOff, ImagePullBackOff, Pending, or Error state. Analyzes events, logs, and resource status to identify root causes.
allowed-tools	Bash, Read, Grep, Glob

Kubernetes Troubleshoot

Name: troubleshoot
Author: ziwon

Automatically diagnose common Kubernetes issues.

Trigger Phrases

"왜 안돼", "왜 안되지", "pod가 죽어", "에러나"
"troubleshoot", "debug", "diagnose"
"CrashLoopBackOff", "ImagePullBackOff", "Pending", "Error"

Diagnostic Flow

1. Get Pod Status

export KUBECONFIG=$HOME/.kube/config.home
kubectl get pods -n <namespace> <pod-name> -o wide

2. Check Events

kubectl describe pod -n <namespace> <pod-name> | grep -A 20 "Events:"

3. Get Logs

# Current container
kubectl logs -n <namespace> <pod-name> --tail=100

# Previous container (if restarting)
kubectl logs -n <namespace> <pod-name> --previous --tail=100

# All containers in pod
kubectl logs -n <namespace> <pod-name> --all-containers=true --tail=50

4. Check Resource Status

# Node resources
kubectl top nodes

# Pod resources
kubectl top pods -n <namespace>

# PVC status
kubectl get pvc -n <namespace>

Common Issues & Solutions

CrashLoopBackOff

Check logs for application errors
Verify environment variables (Infisical secrets mounted?)
Check resource limits (OOMKilled?)
Validate startup/liveness probes

ImagePullBackOff

Verify image name and tag
Check imagePullSecrets configured
Test registry connectivity
Confirm image exists in registry

Pending

Check node resources (CPU/Memory)
Verify nodeSelector/affinity matches
Check PVC binding status
Review taints/tolerations

OOMKilled

Increase memory limits
Check for memory leaks in application
Review JVM heap settings (if Java)

GPU Issues

# Check GPU operator
kubectl get pods -n gpu-operator

# Check device plugin
kubectl logs -n gpu-operator -l app=nvidia-device-plugin-daemonset --tail=50

# Check GPU allocation
kubectl describe node | grep -A 5 "nvidia.com/gpu"

ArgoCD Sync Issues

# App status
argocd app get <app-name>

# Sync details
argocd app sync <app-name> --dry-run

# Resource diff
argocd app diff <app-name>

# Force refresh
argocd app get <app-name> --refresh

Output Format

Provide diagnosis in this format:

## Issue Summary
[Brief description of the problem]

## Root Cause
[Identified cause with evidence]

## Solution
[Step-by-step fix]

## Prevention
[How to avoid this in the future]

Reference

@.claude/rules/kubernetes.md
@.claude/rules/argocd-apps.md

troubleshoot

Install Skill

SKILL.md

Kubernetes Troubleshoot

Trigger Phrases

Diagnostic Flow

1. Get Pod Status

2. Check Events

3. Get Logs

4. Check Resource Status

Common Issues & Solutions

CrashLoopBackOff

ImagePullBackOff

Pending

OOMKilled

GPU Issues

ArgoCD Sync Issues

Output Format

Reference