| name | k8s-troubleshooting |
| description | Comprehensive Kubernetes and OpenShift cluster health analysis and troubleshooting based on Popeye's issue detection patterns. Use this skill when: (1) Proactive cluster health assessment and security analysis (2) Analyzing pod/container logs for errors or issues (3) Interpreting cluster events (kubectl get events) (4) Debugging pod failures: CrashLoopBackOff, ImagePullBackOff, OOMKilled, etc. (5) Diagnosing networking issues: DNS, Service connectivity, Ingress/Route problems (6) Investigating storage issues: PVC pending, mount failures (7) Analyzing node problems: NotReady, resource pressure, taints (8) Troubleshooting OCP-specific issues: SCCs, Routes, Operators, Builds (9) Performance analysis and resource optimization (10) Security vulnerability assessment and RBAC validation (11) Configuration best practices validation (12) Reliability and high availability analysis |
Kubernetes / OpenShift Troubleshooting Guide
Command Usage Convention
IMPORTANT: This skill uses kubectl as the primary command in all examples. When working with:
- OpenShift/ARO clusters: Replace all
kubectlcommands withoc - Standard Kubernetes clusters (AKS, EKS, GKE, etc.): Use
kubectlas shown
The agent will automatically detect the cluster type and use the appropriate command.
Systematic approach to diagnosing and resolving cluster issues through event analysis, log interpretation, and root cause identification.
Proactive Cluster Health Analysis (Popeye-Style)
Cluster Scoring Framework
Popeye uses a health scoring system (0-100) to assess cluster health. Critical issues reduce the score significantly:
- BOOM (Critical): -50 points - Security vulnerabilities, resource exhaustion, failed services
- WARN (Warning): -20 points - Configuration inefficiencies, best practice violations
- INFO (Informational): -5 points - Non-critical issues, optimization opportunities
Quick Cluster Health Assessment
#!/bin/bash
# Comprehensive cluster health check based on Popeye patterns
echo "=== POPEYE-STYLE CLUSTER HEALTH ASSESSMENT ==="
# 1. Node Health Check
echo "### NODE HEALTH (Critical Weight: 1.0) ###"
kubectl get nodes -o wide | grep -E "NotReady|Unknown" && echo "BOOM: Unhealthy nodes detected!" || echo "✓ All nodes healthy"
# 2. Pod Issues Check
echo -e "\n### POD HEALTH (Critical Weight: 1.0) ###"
POD_ISSUES=$(kubectl get pods -A --field-selector=status.phase!=Running,status.phase!=Succeeded --no-headers | wc -l)
if [ $POD_ISSUES -gt 0 ]; then
echo "WARN: $POD_ISSUES pods not running"
kubectl get pods -A --field-selector=status.phase!=Running,status.phase!=Succeeded
else
echo "✓ All pods running"
fi
# 3. Security Issues Check
echo -e "\n### SECURITY ASSESSMENT (Critical Weight: 1.0) ###"
# Check for privileged containers
PRIVILEGED=$(kubectl get pods -A -o json | jq -r '.items[] | select(.spec.containers[].securityContext.privileged == true) | "\(.metadata.namespace)/\(.metadata.name)"' | wc -l)
if [ $PRIVILEGED -gt 0 ]; then
echo "BOOM: $PRIVILEGED privileged containers detected (Security Risk!)"
else
echo "✓ No privileged containers found"
fi
# Check for containers running as root
ROOT_CONTAINERS=$(kubectl get pods -A -o json | jq -r '.items[] | select(.spec.containers[].securityContext.runAsUser == 0) | "\(.metadata.namespace)/\(.metadata.name)"' | wc -l)
if [ $ROOT_CONTAINERS -gt 0 ]; then
echo "WARN: $ROOT_CONTAINERS containers running as root"
else
echo "✓ No containers running as root"
fi
# 4. Resource Configuration Check
echo -e "\n### RESOURCE CONFIGURATION (Warning Weight: 0.8) ###"
NO_LIMITS=$(kubectl get pods -A -o json | jq -r '.items[] | select(.spec.containers[].resources.limits == null) | "\(.metadata.namespace)/\(.metadata.name)"' | wc -l)
if [ $NO_LIMITS -gt 0 ]; then
echo "WARN: $NO_LIMITS containers without resource limits"
else
echo "✓ All containers have resource limits"
fi
# 5. Storage Issues Check
echo -e "\n### STORAGE HEALTH (Warning Weight: 0.5) ###"
PENDING_PVC=$(kubectl get pvc -A --field-selector=status.phase!=Bound --no-headers | wc -l)
if [ $PENDING_PVC -gt 0 ]; then
echo "WARN: $PENDING_PVC PVCs not bound"
kubectl get pvc -A --field-selector=status.phase!=Bound
else
echo "✓ All PVCs bound"
fi
# 6. Network Issues Check
echo -e "\n### NETWORKING (Warning Weight: 0.5) ###"
# Check services without endpoints
EMPTY_ENDPOINTS=$(kubectl get svc -A -o json | jq -r '.items[] | select(.spec.clusterIP != "None") | select(.status.loadBalancer.ingress == null) | "\(.metadata.namespace)/\(.metadata.name)"' | wc -l)
if [ $EMPTY_ENDPOINTS -gt 0 ]; then
echo "WARN: $EMPTY_ENDPOINTS services may have endpoint issues"
else
echo "✓ Services appear healthy"
fi
# OpenShift specific checks
if command -v oc &> /dev/null; then
echo -e "\n### OPENSHIFT CLUSTER OPERATORS (Critical Weight: 1.0) ###"
DEGRADED=$(oc get clusteroperators --no-headers | grep -c -E "False.*True|False.*False")
if [ $DEGRADED -gt 0 ]; then
echo "BOOM: $DEGRADED cluster operators degraded/unavailable"
oc get clusteroperators | grep -E "False.*True|False.*False"
else
echo "✓ All cluster operators healthy"
fi
fi
Security Vulnerability Detection
Container Security Analysis
# Security Context Validation
echo "=== CONTAINER SECURITY ANALYSIS ==="
# 1. Privileged Containers (Critical)
kubectl get pods -A -o jsonpath='{range .items[*]}{.metadata.namespace}{"\t"}{.metadata.name}{"\t"}{range .spec.containers[*]}{.name}{"\t"}{.securityContext.privileged}{"\n"}{end}{end}' | grep "true" && echo "BOOM: Privileged containers found!" || echo "✓ No privileged containers"
# 2. Host Namespace Access (Critical)
kubectl get pods -A -o jsonpath='{range .items[*]}{.metadata.namespace}{"\t"}{.metadata.name}{"\t"}{.spec.hostNetwork}{"\t"}{.spec.hostPID}{"\t"}{.spec.hostIPC}{"\n"}' | grep -E "true.*true|true$|true\s" && echo "BOOM: Host namespace access detected!" || echo "✓ No host namespace access"
# 3. Capabilities Check (Warning)
kubectl get pods -A -o json | jq -r '.items[] | select(.spec.containers[].securityContext.capabilities.add != null) | "\(.metadata.namespace)/\(.metadata.name): \(.spec.containers[].securityContext.capabilities.add[])"'
# 4. Read-Only Root Filesystem (Warning)
READONLY_ISSUES=$(kubectl get pods -A -o json | jq -r '.items[] | select(.spec.containers[].securityContext.readOnlyRootFilesystem == false) | "\(.metadata.namespace)/\(.metadata.name)"' | wc -l)
echo "INFO: $READONLY_ISSUES containers without read-only root filesystem"
RBAC Security Analysis
echo "=== RBAC SECURITY ANALYSIS ==="
# Check for overly permissive roles
kubectl get clusterroles -o json | jq -r '.items[] | select(.rules[].verbs[] == "*") | "\(.metadata.name): Wildcard permissions detected"'
# Check service account permissions
kubectl get serviceaccounts -A -o json | jq -r '.items[] | "\(.metadata.namespace)/\(.metadata.name)"'
Performance Issues Detection
Resource Utilization Analysis
echo "=== PERFORMANCE ANALYSIS ==="
# Find pods approaching memory limits
kubectl top pods -A --no-headers | awk '{print $4}' | sed 's/Mi//' | while read mem; do
if [ "$mem" -gt 900 ]; then
echo "WARN: Pod using high memory: ${mem}Mi"
fi
done
# CPU throttling detection
kubectl top pods -A --no-headers | awk '{print $3}' | sed 's/m//' | while read cpu; do
if [ "$cpu" -gt 900 ]; then
echo "WARN: Pod using high CPU: ${cpu}m"
fi
done
Configuration Best Practices Validation
Deployment Health Checks
echo "=== DEPLOYMENT BEST PRACTICES ==="
# Check for liveness/readiness probes
NO_PROBES=$(kubectl get deployments -A -o json | jq -r '.items[] | select(.spec.template.spec.containers[].livenessProbe == null or .spec.template.spec.containers[].readinessProbe == null) | "\(.metadata.namespace)/\(.metadata.name)"' | wc -l)
echo "INFO: $NO_PROBES deployments missing health probes"
# Check for pod disruption budgets
PDB_COUNT=$(kubectl get pdb -A --no-headers | wc -l)
DEPLOY_COUNT=$(kubectl get deployments -A --no-headers | wc -l)
echo "INFO: $PDB_COUNT pod disruption budgets for $DEPLOY_COUNT deployments"
# Rolling update strategy
NO_STRATEGY=$(kubectl get deployments -A -o json | jq -r '.items[] | select(.spec.strategy.type != "RollingUpdate") | "\(.metadata.namespace)/\(.metadata.name)"' | wc -l)
echo "INFO: $NO_STRATEGY deployments not using RollingUpdate"
Troubleshooting Workflow
- Identify Scope: Pod, Node, Namespace, or Cluster-wide issue?
- Gather Context: Events, logs, resource status, recent changes
- Analyze Symptoms: Match patterns to known issues
- Determine Root Cause: Follow diagnostic tree
- Remediate: Apply fix and verify resolution
- Document: Record findings for future reference
Quick Diagnostic Commands
# Pod status overview
kubectl get pods -n ${NAMESPACE} -o wide
# Recent events (sorted by time)
kubectl get events -n ${NAMESPACE} --sort-by='.lastTimestamp'
# Pod details and events
kubectl describe pod ${POD_NAME} -n ${NAMESPACE}
# Container logs (current)
kubectl logs ${POD_NAME} -n ${NAMESPACE} -c ${CONTAINER}
# Container logs (previous crashed instance)
kubectl logs ${POD_NAME} -n ${NAMESPACE} -c ${CONTAINER} --previous
# Node status
kubectl get nodes -o wide
kubectl describe node ${NODE_NAME}
# Resource usage
kubectl top pods -n ${NAMESPACE}
kubectl top nodes
# OpenShift specific
oc get events -n ${NAMESPACE}
oc adm top pods -n ${NAMESPACE}
oc get clusteroperators
oc adm node-logs ${NODE_NAME} -u kubelet
Pod Status Interpretation
Pod Phase States
| Phase | Meaning | Action |
|---|---|---|
Pending |
Not scheduled or pulling images | Check events, node resources, PVC status |
Running |
At least one container running | Check container statuses if issues |
Succeeded |
All containers completed successfully | Normal for Jobs |
Failed |
All containers terminated, at least one failed | Check logs, exit codes |
Unknown |
Cannot determine state | Node communication issue |
Container State Analysis
Waiting States
| Reason | Cause | Resolution |
|---|---|---|
ContainerCreating |
Setting up container | Check events for errors, volume mounts |
ImagePullBackOff |
Cannot pull image | Verify image name, registry access, credentials |
ErrImagePull |
Image pull failed | Check image exists, network, ImagePullSecrets |
CreateContainerConfigError |
Config error | Check ConfigMaps, Secrets exist and mounted correctly |
InvalidImageName |
Malformed image reference | Fix image name in spec |
CrashLoopBackOff |
Container repeatedly crashing | Check logs --previous, fix application |
Terminated States
| Reason | Exit Code | Cause | Resolution |
|---|---|---|---|
OOMKilled |
137 | Memory limit exceeded | Increase memory limit, fix memory leak |
Error |
1 | Application error | Check logs for stack trace |
Error |
126 | Command not executable | Fix command/entrypoint permissions |
Error |
127 | Command not found | Fix command path, verify image contents |
Error |
128 | Invalid exit code | Application bug |
Error |
130 | SIGINT (Ctrl+C) | Normal if manual termination |
Error |
137 | SIGKILL | OOM or forced termination |
Error |
143 | SIGTERM | Graceful shutdown requested |
Completed |
0 | Normal exit | Expected for Jobs/init containers |
Event Analysis
Event Types and Severity
Type: Normal → Informational, typically no action needed
Type: Warning → Potential issue, investigate
Critical Events to Monitor
Pod Scheduling Events
| Event Reason | Meaning | Resolution |
|---|---|---|
FailedScheduling |
Cannot place pod | Check node resources, taints, affinity |
Unschedulable |
No suitable node | Add nodes, adjust requirements |
NodeNotReady |
Target node unavailable | Check node status |
TaintManagerEviction |
Pod evicted due to taint | Check node taints, add tolerations |
FailedScheduling Analysis:
# Common messages and fixes:
"Insufficient cpu" → Reduce requests or add capacity
"Insufficient memory" → Reduce requests or add capacity
"node(s) had taint" → Add toleration or remove taint
"node(s) didn't match selector" → Fix nodeSelector/affinity
"persistentvolumeclaim not found" → Create PVC or fix name
"0/3 nodes available" → All nodes have issues, check each
Image Events
| Event Reason | Meaning | Resolution |
|---|---|---|
Pulling |
Downloading image | Normal, wait |
Pulled |
Image downloaded | Normal |
Failed |
Pull failed | Check image name, registry, auth |
BackOff |
Repeated pull failures | Fix underlying issue |
ErrImageNeverPull |
Image not local with Never policy | Change imagePullPolicy or pre-pull |
ImagePullBackOff Diagnosis:
# Check image name is correct
kubectl get pod ${POD} -o jsonpath='{.spec.containers[*].image}'
# Verify ImagePullSecrets
kubectl get pod ${POD} -o jsonpath='{.spec.imagePullSecrets}'
kubectl get secret ${SECRET} -n ${NAMESPACE}
# Test registry access
kubectl run test --image=${IMAGE} --restart=Never --rm -it -- /bin/sh
Volume Events
| Event Reason | Meaning | Resolution |
|---|---|---|
FailedMount |
Cannot mount volume | Check PVC, storage class, permissions |
FailedAttachVolume |
Cannot attach volume | Check cloud provider, volume exists |
VolumeResizeFailed |
Cannot expand volume | Check storage class allows expansion |
ProvisioningFailed |
Cannot create volume | Check storage class, quotas |
PVC Pending Diagnosis:
# Check PVC status and events
kubectl describe pvc ${PVC_NAME} -n ${NAMESPACE}
# Verify StorageClass exists and is default
kubectl get storageclass
# Check for available PVs (if not dynamic provisioning)
kubectl get pv
# OpenShift: Check storage operator
oc get clusteroperator storage
Container Events
| Event Reason | Meaning | Resolution |
|---|---|---|
Created |
Container created | Normal |
Started |
Container started | Normal |
Killing |
Container being stopped | Normal during updates/scale-down |
Unhealthy |
Probe failed | Fix probe or application |
ProbeWarning |
Probe returned warning | Check probe configuration |
BackOff |
Container crashing | Check logs, fix application |
Event Patterns
Flapping Pod (Repeated restarts)
Events:
Warning BackOff Container is in waiting state due to CrashLoopBackOff
Normal Pulled Container image already present
Normal Created Created container
Normal Started Started container
Warning BackOff Back-off restarting failed container
Diagnosis: Check kubectl logs --previous, application is crashing on startup.
Resource Starvation
Events:
Warning FailedScheduling 0/3 nodes are available: 3 Insufficient cpu
Diagnosis: Cluster needs more capacity or pod requests are too high.
Probe Failures
Events:
Warning Unhealthy Liveness probe failed: HTTP probe failed with statuscode: 503
Normal Killing Container failed liveness probe, will be restarted
Diagnosis: Application not responding, check if startup is slow (use startupProbe) or app is unhealthy.
Log Analysis Patterns
Common Error Patterns
Application Startup Failures
# Java
java.lang.OutOfMemoryError: Java heap space
→ Increase memory limit, tune JVM heap (-Xmx)
java.net.ConnectException: Connection refused
→ Dependency not ready, add init container or retry logic
# Python
ModuleNotFoundError: No module named 'xxx'
→ Missing dependency, fix requirements.txt/Dockerfile
# Node.js
Error: Cannot find module 'xxx'
→ Missing dependency, fix package.json or node_modules
# General
ECONNREFUSED, Connection refused
→ Service dependency not available
ENOTFOUND, getaddrinfo ENOTFOUND
→ DNS resolution failed, check service name
Database Connection Issues
# PostgreSQL
FATAL: password authentication failed
→ Wrong credentials, check Secret values
connection refused
→ Database not running or wrong host/port
too many connections
→ Connection pool exhaustion, configure pool size
# MySQL
Access denied for user
→ Wrong credentials or missing grants
Can't connect to MySQL server
→ Database not running or network issue
# MongoDB
MongoNetworkError
→ Connection string wrong or network issue
Memory Issues
# Container OOMKilled
Last State: Terminated
Reason: OOMKilled
Exit Code: 137
→ Solutions:
1. Increase memory limit
2. Profile application memory usage
3. Fix memory leaks
4. For JVM: Set -Xmx < container limit (leave ~25% headroom)
Permission Issues
# File system
Permission denied
mkdir: cannot create directory: Permission denied
→ Check securityContext, runAsUser, fsGroup
# OpenShift SCC
Error: container has runAsNonRoot and image has non-numeric user
→ Add runAsUser to securityContext
pods "xxx" is forbidden: unable to validate against any security context constraint
→ Create appropriate SCC or use service account with SCC access
Log Analysis Commands
# Search for errors in logs
kubectl logs ${POD} -n ${NS} | grep -iE "(error|exception|fatal|panic)"
# Follow logs in real-time
kubectl logs -f ${POD} -n ${NS}
# Logs from all containers in pod
kubectl logs ${POD} -n ${NS} --all-containers
# Logs from multiple pods (by label)
kubectl logs -l app=${APP_NAME} -n ${NS} --all-containers
# Logs with timestamps
kubectl logs ${POD} -n ${NS} --timestamps
# Logs from last hour
kubectl logs ${POD} -n ${NS} --since=1h
# Logs from last 100 lines
kubectl logs ${POD} -n ${NS} --tail=100
# OpenShift: Node-level logs
oc adm node-logs ${NODE} -u kubelet
oc adm node-logs ${NODE} -u crio
oc adm node-logs ${NODE} --path=journal
Node Troubleshooting
Node Conditions
| Condition | Status | Meaning |
|---|---|---|
Ready |
True | Node healthy |
Ready |
False | Kubelet not healthy |
Ready |
Unknown | No heartbeat from node |
MemoryPressure |
True | Low memory |
DiskPressure |
True | Low disk space |
PIDPressure |
True | Too many processes |
NetworkUnavailable |
True | Network not configured |
Node NotReady Diagnosis
# Check node status
kubectl describe node ${NODE_NAME}
# Check kubelet status (SSH to node or via oc adm)
systemctl status kubelet
journalctl -u kubelet -f
# Check container runtime
systemctl status crio # or containerd/docker
journalctl -u crio -f
# Check node resources
df -h
free -m
top
# OpenShift: Machine status
oc get machines -n openshift-machine-api
oc describe machine ${MACHINE} -n openshift-machine-api
Node Resource Pressure
# Check resource allocation vs capacity
kubectl describe node ${NODE} | grep -A 10 "Allocated resources"
# Find pods using most resources
kubectl top pods -A --sort-by=cpu
kubectl top pods -A --sort-by=memory
# Evict pods from node (drain)
kubectl drain ${NODE} --ignore-daemonsets --delete-emptydir-data
Networking Troubleshooting
DNS Issues
# Test DNS resolution from a debug pod
kubectl run dns-test --image=busybox:1.28 --rm -it --restart=Never -- nslookup ${SERVICE_NAME}
kubectl run dns-test --image=busybox:1.28 --rm -it --restart=Never -- nslookup ${SERVICE_NAME}.${NAMESPACE}.svc.cluster.local
# Check CoreDNS/DNS pods
kubectl get pods -n kube-system -l k8s-app=kube-dns
kubectl logs -n kube-system -l k8s-app=kube-dns
# Check DNS service
kubectl get svc -n kube-system kube-dns
Service Connectivity
# Verify service exists and has endpoints
kubectl get svc ${SERVICE} -n ${NS}
kubectl get endpoints ${SERVICE} -n ${NS}
# Test service from debug pod
kubectl run curl-test --image=curlimages/curl --rm -it --restart=Never -- \
curl -v http://${SERVICE}.${NS}.svc.cluster.local:${PORT}
# Check if pods match service selector
kubectl get pods -n ${NS} -l ${SELECTOR} -o wide
Ingress/Route Issues
# Check Ingress status
kubectl describe ingress ${INGRESS} -n ${NS}
# Check Ingress controller logs
kubectl logs -n ingress-nginx -l app.kubernetes.io/name=ingress-nginx
# OpenShift Route
oc describe route ${ROUTE} -n ${NS}
oc get route ${ROUTE} -n ${NS} -o yaml
# Check router pods
oc get pods -n openshift-ingress
oc logs -n openshift-ingress -l ingresscontroller.operator.openshift.io/deployment-ingresscontroller=default
NetworkPolicy Debugging
# List NetworkPolicies affecting a pod
kubectl get networkpolicy -n ${NS}
# Test connectivity with ephemeral debug container
kubectl debug ${POD} -n ${NS} --image=nicolaka/netshoot -- \
curl -v http://${TARGET_SERVICE}:${PORT}
# Check if traffic is blocked (look for drops)
# On node running the pod:
conntrack -L | grep ${POD_IP}
OpenShift-Specific Troubleshooting
Comprehensive OpenShift Health Assessment (Popeye-Style)
#!/bin/bash
# OpenShift comprehensive health check
echo "=== OPENSHIFT COMPREHENSIVE HEALTH ASSESSMENT ==="
# 1. Cluster Operators Health (Critical)
echo "### CLUSTER OPERATORS (Critical Weight: 1.0) ###"
oc get clusteroperators
echo ""
DEGRADED_OPERATORS=$(oc get clusteroperators --no-headers | grep -c -E "False.*True|False.*False")
if [ $DEGRADED_OPERATORS -gt 0 ]; then
echo "BOOM: $DEGRADED_OPERATORS cluster operators in degraded state!"
oc get clusteroperators | grep -E "False.*True|False.*False"
else
echo "✓ All cluster operators healthy"
fi
# 2. OpenShift Routes Analysis (Warning)
echo -e "\n### ROUTES ANALYSIS (Warning Weight: 0.5) ###"
ROUTE_ISSUES=$(oc get routes -A -o json | jq -r '.items[] | select(.status.ingress == null) | "\(.metadata.namespace)/\(.metadata.name)"' | wc -l)
if [ $ROUTE_ISSUES -gt 0 ]; then
echo "WARN: $ROUTE_ISSUES routes without endpoints"
else
echo "✓ All routes have endpoints"
fi
# Check TLS certificate issues
EXPIRED_CERTS=$(oc get routes -A -o json | jq -r '.items[] | select(.spec.tls != null) | select(.status.ingress[].conditions[]?.type == "Admitted" and .status.ingress[].conditions[]?.status == "False") | "\(.metadata.namespace)/\(.metadata.name)"' | wc -l)
if [ $EXPIRED_CERTS -gt 0 ]; then
echo "WARN: $EXPIRED_CERTS routes with TLS issues"
fi
# 3. BuildConfig Health Analysis (Warning)
echo -e "\n### BUILDCONFIG ANALYSIS (Warning Weight: 0.5) ###"
FAILED_BUILDS=$(oc get builds -A --field-selector=status.phase!=Failed --no-headers | wc -l)
echo "INFO: $FAILED_BUILDS failed builds detected"
# Check BuildConfig strategies
BUILDCONFIGS=$(oc get buildconfigs -A --no-headers | wc -l)
echo "INFO: $BUILDCONFIGS build configurations found"
# 4. Security Context Constraints (Critical)
echo -e "\n### SCC ANALYSIS (Critical Weight: 1.0) ###"
SCC_VIOLATIONS=$(oc get events -A --field-selector=reason=FailedScheduling --no-headers | grep -c "unable to validate against any security context constraint")
if [ $SCC_VIOLATIONS -gt 0 ]; then
echo "BOOM: $SCC_VIOLATIONS SCC violations detected!"
oc get events -A --field-selector=reason=FailedScheduling | grep "unable to validate against any security context constraint"
else
echo "✓ No SCC violations detected"
fi
# 5. ImageStream Health (Warning)
echo -e "\n### IMAGESTREAM ANALYSIS (Warning Weight: 0.3) ###"
STALE_IMAGES=$(oc get imagestreams -A -o json | jq -r '.items[] | select(.status.tags[]?.items? | length == 0) | "\(.metadata.namespace)/\(.metadata.name)"' | wc -l)
if [ $STALE_IMAGES -gt 0 ]; then
echo "WARN: $STALE_IMAGES ImageStreams without images"
else
echo "✓ All ImageStreams have images"
fi
# 6. Project Resource Quotas (Warning)
echo -e "\n### PROJECT QUOTAS (Warning Weight: 0.5) ###"
oc get projects -A -o json | jq -r '.items[] | "\(.metadata.name): \(.status.phase)"' | while read project_info; do
project=$(echo $project_info | cut -d: -f1)
phase=$(echo $project_info | cut -d: -f2)
if [ "$phase" == "Active" ]; then
echo "✓ Project $project is active"
else
echo "WARN: Project $project is $phase"
fi
done
Cluster Operators
# Check overall cluster health
oc get clusteroperators
# Investigate degraded operator
oc describe clusteroperator ${OPERATOR}
# Check operator logs
oc logs -n openshift-${OPERATOR} -l name=${OPERATOR}-operator
# OpenShift-specific: Check co-reconcile events
oc get events -A --field-selector reason=OperatorStatusChanged
Security Context Constraints (SCC)
# List SCCs
oc get scc
# Check which SCC a pod is using
oc get pod ${POD} -n ${NS} -o yaml | grep scc
# Check which SCC a serviceaccount can use
oc adm policy who-can use scc restricted-v2
# Add SCC to service account (requires admin)
oc adm policy add-scc-to-user ${SCC} -z ${SERVICE_ACCOUNT} -n ${NS}
Common SCC Issues:
Error: pods "xxx" is forbidden: unable to validate against any security context constraint
→ Diagnosis:
1. Check pod securityContext requirements
2. Find compatible SCC: oc get scc
3. Grant SCC to service account or adjust pod spec
Common fixes:
- Add runAsUser to match SCC requirements
- Use less restrictive SCC (not recommended for prod)
- Create custom SCC for specific needs
Build Failures
# Check build status
oc get builds -n ${NS}
oc describe build ${BUILD} -n ${NS}
# Build logs
oc logs build/${BUILD} -n ${NS}
oc logs -f bc/${BUILDCONFIG} -n ${NS}
# Check builder pod
oc get pods -n ${NS} | grep build
oc describe pod ${BUILD_POD} -n ${NS}
Common Build Issues:
| Error | Cause | Resolution |
|---|---|---|
error: build error: image not found |
Base image missing | Check ImageStream or external registry |
AssembleInputError |
S2I assemble failed | Check application dependencies |
GenericBuildFailed |
Build command failed | Check build logs for details |
PushImageToRegistryFailed |
Cannot push to registry | Check registry access, quotas |
ImageStream Issues
# Check ImageStream
oc get is ${IS_NAME} -n ${NS}
oc describe is ${IS_NAME} -n ${NS}
# Import external image
oc import-image ${IS_NAME}:${TAG} --from=${EXTERNAL_IMAGE} --confirm -n ${NS}
# Check image import status
oc get imagestreamtag ${IS_NAME}:${TAG} -n ${NS}
Performance Analysis
Resource Optimization
# Get actual resource usage vs requests
kubectl top pods -n ${NS}
# Compare with requests/limits
kubectl get pods -n ${NS} -o custom-columns=\
NAME:.metadata.name,\
CPU_REQ:.spec.containers[*].resources.requests.cpu,\
CPU_LIM:.spec.containers[*].resources.limits.cpu,\
MEM_REQ:.spec.containers[*].resources.requests.memory,\
MEM_LIM:.spec.containers[*].resources.limits.memory
# Find pods without resource limits
kubectl get pods -A -o json | jq -r \
'.items[] | select(.spec.containers[].resources.limits == null) |
"\(.metadata.namespace)/\(.metadata.name)"'
Right-Sizing Recommendations
| Symptom | Indication | Action |
|---|---|---|
| CPU throttling, high latency | CPU limit too low | Increase CPU limit |
| OOMKilled frequently | Memory limit too low | Increase memory limit |
| Low CPU utilization | Over-provisioned | Reduce CPU request |
| Low memory utilization | Over-provisioned | Reduce memory request |
| Pending pods | Cluster capacity full | Add nodes or optimize |
Latency Investigation
# Check pod startup time
kubectl get pod ${POD} -n ${NS} -o jsonpath='{.status.conditions}'
# Check container startup
kubectl get pod ${POD} -n ${NS} -o jsonpath='{.status.containerStatuses[*].state}'
# Slow image pulls
kubectl describe pod ${POD} -n ${NS} | grep -A 5 "Events"
# Network latency test
kubectl run nettest --image=nicolaka/netshoot --rm -it --restart=Never -- \
curl -w "@curl-format.txt" -o /dev/null -s http://${SERVICE}:${PORT}
Popeye-Style Diagnostic Decision Trees
Comprehensive Cluster Health Assessment Tree
Cluster Health Score < 80?
├── Yes → Check Critical Issues (BOOM: -50 points each)
│ ├── Node Health Issues?
│ │ ├── NotReady nodes → kubelet problems, resource pressure
│ │ ├── Unknown nodes → Network connectivity, API server
│ │ └── Resource pressure → CPU/Memory/Disk pressure
│ ├── Security Vulnerabilities?
│ │ ├── Privileged containers → Remove privileged flag
│ │ ├── Host namespace access → Remove hostNetwork/hostPID/hostIPC
│ │ ├── Run as root → Set runAsNonRoot: true, runAsUser > 0
│ │ └── Wildcard RBAC → Create specific roles with minimal permissions
│ ├── Service Failures?
│ │ ├── No endpoints → Fix service selector or pod labels
│ │ ├── Failed load balancers → Check cloud provider quotas
│ │ └── Certificate issues → Renew TLS certificates
│ └── Resource Exhaustion?
│ ├── OOMKilled pods → Increase memory limits
│ ├── CPU throttling → Increase CPU limits
│ └── Storage full → Clean up or expand storage
├── No → Check Warning Issues (WARN: -20 points each)
│ ├── Configuration Issues?
│ │ ├── No resource limits → Add requests/limits
│ │ ├── No health probes → Add liveness/readiness probes
│ │ ├── Missing PDBs → Create PodDisruptionBudgets
│ │ └── No rolling updates → Use RollingUpdate strategy
│ ├── Performance Issues?
│ │ ├── Underutilized resources → Right-size pods
│ │ ├── Large container images → Optimize Dockerfile
│ │ └── Inefficient scheduling → Add affinity/anti-affinity
│ └── Reliability Issues?
│ ├── Single replicas → Increase replica count
│ ├── No backup strategy → Implement backup solution
│ └── Missing monitoring → Add metrics and logging
└── Score >= 80 → Check Info Issues (INFO: -5 points each)
├── Best Practice Violations?
│ ├── Missing labels → Add standard labels
│ ├── No termination grace → Set terminationGracePeriodSeconds
│ └── Deprecated APIs → Update to newer API versions
└── Optimization Opportunities?
├── Unused resources → Clean up orphaned resources
├── ImagePullPolicy: Always → Use IfNotPresent for production
└── Large logs → Implement log rotation
Pod Not Starting - Enhanced Diagnostic Tree
Pod Phase = Pending?
├── Yes → Check Scheduling Issues
│ ├── Events: FailedScheduling?
│ │ ├── "Insufficient cpu/memory" →
│ │ │ ├── Add nodes OR
│ │ │ ├── Reduce resource requests OR
│ │ │ └── Enable cluster autoscaler
│ │ ├── "node(s) had taint" →
│ │ │ ├── Add toleration to pod OR
│ │ │ └── Remove taint from node
│ │ ├── "node(s) didn't match nodeSelector" →
│ │ │ ├── Fix nodeSelector labels OR
│ │ │ └── Update node labels
│ │ ├── "persistentvolumeclaim not found" →
│ │ │ ├── Create PVC with correct name OR
│ │ │ └── Fix PVC reference in pod
│ │ └── "0/X nodes available" → Check all nodes for issues
│ └── No FailedScheduling events?
│ ├── Check ResourceQuota → Quota exceeded?
│ ├── Check LimitRange → Requests too small/large?
│ └── Check Namespace → Namespace exists and not terminating?
└── No → Pod Phase = Running with issues?
├── ContainerCreating > 5min?
│ ├── Events: ImagePullBackOff?
│ │ ├── Check image name/registry → Fix image reference
│ │ ├── Check ImagePullSecrets → Create/update secrets
│ │ └── Test registry access → kubectl run test-pod --image=xxx
│ ├── Events: FailedMount?
│ │ ├── PVC not bound → Create PV or fix StorageClass
│ │ ├── Secret/ConfigMap not found → Create missing resources
│ │ └── Permission denied → Fix securityContext, fsgroup
│ └── Events: CreateContainerConfigError?
│ ├── Missing ConfigMap → Create ConfigMap
│ ├── Invalid volume mount → Fix volumeMount path
│ └── Security context violation → Adjust SCC or securityContext
└── Container status: Waiting/CrashLoopBackOff?
├── Exit code analysis:
│ ├── 137 (OOMKilled) → Increase memory limit
│ ├── 1 (General error) → Check application logs
│ ├── 125/126/127 (Command issues) → Fix entrypoint/command
│ └── 143 (SIGTERM) → Graceful shutdown issue
└── No previous logs?
├── Application starts too slowly → Add startupProbe
├── Entrypoint command not found → Fix Dockerfile CMD/ENTRYPOINT
└── Permission denied → Fix file permissions in image
Security Issues Diagnostic Tree
Security Issues Detected?
├── Privileged Containers (Critical)?
│ ├── Find: kubectl get pods -A -o json | jq 'select(.spec.containers[].securityContext.privileged == true)'
│ ├── Why: Dangerous escape from container isolation
│ └── Fix: Set privileged: false or use least privileged SCC
├── Host Namespace Access (Critical)?
│ ├── Check: hostNetwork, hostPID, hostIPC = true
│ ├── Why: Access to host system resources
│ └── Fix: Remove host namespace access, use specific alternatives
├── Root User Execution (Warning)?
│ ├── Check: runAsUser = 0 or no runAsNonRoot
│ ├── Why: Root access in container
│ └── Fix: Set runAsNonRoot: true, runAsUser: 1000+
├── Wildcard RBAC Permissions (Critical)?
│ ├── Check: verbs: ["*"] or resources: ["*"]
│ ├── Why: Over-privileged service accounts
│ └── Fix: Create specific roles with minimal permissions
├── Missing Security Context (Warning)?
│ ├── Check: No securityContext at pod or container level
│ ├── Why: Default settings may not be secure enough
│ └── Fix: Add securityContext with appropriate settings
└── Sensitive Data in Environment Variables (Critical)?
├── Check: Passwords, tokens, keys in env
├── Why: Visible via kubectl describe, logs
└── Fix: Use Secrets, consider external secret management
Performance Issues Diagnostic Tree
Performance Issues Detected?
├── Resource Utilization Issues?
│ ├── High CPU Usage?
│ │ ├── Symptoms: High latency, throttling
│ │ ├── Diagnose: kubectl top pods, kubectl describe node
│ │ └── Solutions: Increase limits, optimize code, add replicas
│ ├── Memory Pressure?
│ │ ├── Symptoms: OOMKilled, swapping, slow performance
│ │ ├── Diagnose: kubectl top pods, check events for OOM
│ │ └── Solutions: Increase limits, fix memory leaks, add nodes
│ └── Storage Issues?
│ ├── Symptoms: Failed writes, slow I/O, PVC pending
│ ├── Diagnose: kubectl get pv/pvc, df -h on nodes
│ └── Solutions: Expand PVs, add storage, optimize I/O patterns
├── Networking Performance?
│ ├── DNS Resolution Delays?
│ │ ├── Check: CoreDNS pods, node DNS config
│ │ ├── Test: nslookup from debug pod
│ │ └── Fix: Scale CoreDNS, optimize DNS config
│ ├── Service Connectivity Issues?
│ │ ├── Check: Service endpoints, NetworkPolicies
│ │ ├── Test: curl to service.cluster.local
│ │ └── Fix: Fix selectors, adjust NetworkPolicies
│ └── Ingress/Route Performance?
│ ├── Check: Ingress controller resources, TLS config
│ ├── Test: Load test with hey/wrk
│ └── Fix: Scale ingress, optimize TLS, add caching
└── Application-Specific Issues?
├── Slow Startup Times?
│ ├── Check: Image size, initialization steps
│ ├── Fix: Multi-stage builds, optimize startup
│ └── Configure: startupProbe with appropriate values
├── Database Connection Pool Issues?
│ ├── Check: Connection limits, timeout settings
│ ├── Monitor: Active connections, wait time
│ └── Fix: Adjust pool size, add connection retry logic
└── Cache Inefficiency?
├── Check: Hit ratios, cache size
├── Monitor: Memory usage, eviction rates
└── Fix: Optimize cache strategy, add external cache
OpenShift-Specific Issues Tree
OpenShift Issues Detected?
├── Cluster Operator Degraded?
│ ├── Check: oc get clusteroperators
│ ├── Investigate: oc describe clusteroperator <name>
│ ├── Logs: oc logs -n openshift-<operator>
│ └── Common fixes:
│ ├── authentication/oauth → Check cert rotation
│ ├── ingress → Check router pods, certificates
│ ├── storage → Check storage class, provisioner
│ └── network → Check CNI configuration
├── SCC Violations?
│ ├── Check: Events for "unable to validate against any security context constraint"
│ ├── Diagnose: oc get scc, oc adm policy who-can use scc
│ └── Fix:
│ ├── Grant appropriate SCC to service account
│ ├── Adjust pod securityContext to match SCC
│ └── Create custom SCC for specific needs
├── BuildConfig Failures?
│ ├── Check: oc get builds, oc logs build/<build-name>
│ ├── Common issues:
│ │ ├── Source code access → Git credentials, webhook
│ │ ├── Base image not found → ImageStream, registry
│ │ ├── Build timeouts → Increase timeout, optimize build
│ │ └── Registry push failures → Permissions, quotas
│ └── Fix: Address specific build error, retry build
├── Route Issues?
│ ├── Check: oc get routes, oc describe route <name>
│ ├── Common issues:
│ │ ├── No endpoints → Service selector, pod health
│ │ ├── TLS certificate expired → Renew cert
│ │ ├── Wrong host/path → Update route spec
│ │ └── Router not responding → Check router pods
│ └── Fix: Fix underlying service or update route config
└── ImageStream Issues?
├── Check: oc get imagestreams, oc describe is <name>
├── Common issues:
│ ├── No tags/images → Trigger import, fix image reference
│ ├── Import failures → Registry access, credentials
│ └── Tag not found → Fix tag reference, re-tag image
└── Fix: Re-import image, fix registry connection
Application Not Reachable - Enhanced Tree
Application Connectivity Issue?
├── Service Level Issues?
│ ├── Service has no endpoints?
│ │ ├── Check: kubectl get endpoints <service>
│ │ ├── Verify: Service selector matches pod labels
│ │ ├── Check: Pod health and readiness
│ │ └── Fix: Update selector or fix pod issues
│ ├── Service wrong type?
│ │ ├── ClusterIP but expecting external → Use LoadBalancer/NodePort
│ │ ├── LoadBalancer not getting IP → Check cloud provider
│ │ └── NodePort not accessible → Check firewall, node ports
│ └── Service port wrong?
│ ├── Check: targetPort vs containerPort
│ ├── Verify: Protocol (TCP/UDP) matches
│ └── Fix: Update service port configuration
├── Ingress/Route Issues?
│ ├── Ingress not found or misconfigured?
│ │ ├── Check: kubectl get ingress, describe ingress
│ │ ├── Verify: Host, path, backend service
│ │ └── Fix: Update ingress configuration
│ ├── TLS Certificate Issues?
│ │ ├── Check: Certificate expiration, validity
│ ├── Verify: Secret exists and contains cert/key
│ │ └── Fix: Renew certificate, update secret
│ ├── Ingress Controller Issues?
│ │ ├── Check: Controller pod health
│ ├── Verify: Controller service endpoints
│ │ └── Fix: Restart controller, fix configuration
│ └── Route-specific (OpenShift)?
│ ├── Check: oc get routes, describe route
│ ├── Verify: Router health, certificates
│ └── Fix: Update route, check router pods
├── NetworkPolicy Blocking?
│ ├── Check: kubectl get networkpolicy
│ ├── Verify: Policy allows traffic flow
│ ├── Test: Temporarily disable policy for debugging
│ └── Fix: Add appropriate ingress/egress rules
└── Application Level Issues?
├── Application not binding to right port?
│ ├── Check: Listen address (0.0.0.0 vs 127.0.0.1)
│ ├── Verify: Port number matches containerPort
│ └── Fix: Update application bind configuration
├── Health check failures?
│ ├── Check: Liveness/readiness probe paths
│ ├── Verify: Application responds to probes
│ └── Fix: Update probe configuration or application
└── Application errors?
├── Check: Application logs for errors
├── Verify: Database connections, dependencies
└── Fix: Address application-specific issues
Health Check Scripts
Cluster Health Summary
#!/bin/bash
# cluster-health.sh - Quick cluster health check
echo "=== Node Status ==="
kubectl get nodes -o wide
echo -e "\n=== Pods Not Running ==="
kubectl get pods -A --field-selector=status.phase!=Running,status.phase!=Succeeded
echo -e "\n=== Recent Warning Events ==="
kubectl get events -A --field-selector type=Warning --sort-by='.lastTimestamp' | tail -20
echo -e "\n=== Resource Pressure ==="
kubectl top nodes
echo -e "\n=== PVCs Not Bound ==="
kubectl get pvc -A --field-selector=status.phase!=Bound
# OpenShift specific
if command -v oc &> /dev/null; then
echo -e "\n=== Cluster Operators ==="
oc get clusteroperators | grep -v "True.*False.*False"
fi
Namespace Health Check
#!/bin/bash
# namespace-health.sh ${NAMESPACE}
NS=${1:-default}
echo "=== Pods in $NS ==="
kubectl get pods -n $NS -o wide
echo -e "\n=== Recent Events ==="
kubectl get events -n $NS --sort-by='.lastTimestamp' | tail -15
echo -e "\n=== Resource Usage ==="
kubectl top pods -n $NS 2>/dev/null || echo "Metrics not available"
echo -e "\n=== Services ==="
kubectl get svc -n $NS
echo -e "\n=== Deployments ==="
kubectl get deploy -n $NS
echo -e "\n=== PVCs ==="
kubectl get pvc -n $NS
Quick Reference: Exit Codes
| Code | Signal | Meaning |
|---|---|---|
| 0 | - | Success |
| 1 | - | General error |
| 2 | - | Misuse of command |
| 126 | - | Command not executable |
| 127 | - | Command not found |
| 128 | - | Invalid exit argument |
| 130 | SIGINT | Keyboard interrupt |
| 137 | SIGKILL | Kill signal (OOM or forced) |
| 143 | SIGTERM | Termination signal |
| 255 | - | Exit status out of range |