| name | check-cluster-health |
| description | Checks comprehensive health check for a Kubernetes Cluster. |
| allowed-tools | Read, Grep, Glob |
Check Cluster Health
Perform comprehensive health check of Kubernetes cluster infrastructure.
When to Use
- Initial investigation of any production issue
- Before deep-diving into specific pods or services
- User reports "something is wrong" without specifics
- Periodic health checks
- Post-deployment validation
- After scaling events or cluster changes
Skill Objective
Quickly assess the overall state of the Kubernetes cluster to identify:
- Node health and resource pressure
- Pod health across all namespaces
- System component status
- Recent critical events
- Resource constraints or bottlenecks
Investigation Steps
Step 1: Check Node Health
Get overview of all nodes in the cluster:
kubectl get nodes -o wide
Look for:
- Nodes in NotReady state
- Node ages (very old or very new nodes)
- Kubernetes versions (version skew)
- Internal/External IPs
Expected Output:
NAME STATUS ROLES AGE VERSION
node-1 Ready control-plane 45d v1.28.0
node-2 Ready <none> 45d v1.28.0
node-3 Ready <none> 45d v1.28.0
node-4 NotReady <none> 45d v1.28.0 ⚠️
Step 2: Check Node Resource Usage
Get current CPU and memory utilization:
kubectl top nodes
Look for:
- CPU usage > 80%
- Memory usage > 85%
- Significant imbalance between nodes
Expected Output:
NAME CPU(cores) CPU% MEMORY(bytes) MEMORY%
node-1 450m 22% 4Gi 50%
node-2 890m 44% 6Gi 75%
node-3 1200m 60% 7Gi 87% ⚠️
node-4 100m 5% 2Gi 25%
Step 3: Check Node Conditions
Inspect for resource pressure conditions:
kubectl describe nodes | grep -A 5 "Conditions:"
Look for:
- MemoryPressure: True
- DiskPressure: True
- PIDPressure: True
- NetworkUnavailable: True
Critical Conditions:
Conditions:
Type Status Reason
---- ------ ------
MemoryPressure True NodeHasInsufficientMemory ⚠️
DiskPressure False NodeHasSufficientDisk
PIDPressure False NodeHasSufficientPID
Ready True KubeletReady
Step 4: Find Problematic Pods
Get all pods that are not in Running or Succeeded state:
kubectl get pods --all-namespaces --field-selector=status.phase!=Running,status.phase!=Succeeded
Alternative - get all pods with issues:
kubectl get pods --all-namespaces | grep -vE 'Running|Completed|Succeeded'
Look for:
- CrashLoopBackOff
- ImagePullBackOff
- Pending
- Error
- Evicted
- OOMKilled
Expected Output:
NAMESPACE NAME READY STATUS RESTARTS AGE
api api-service-abc 0/1 CrashLoopBackOff 5 10m ⚠️
api api-service-xyz 0/1 OOMKilled 3 15m ⚠️
default worker-123 0/1 Pending 0 5m ⚠️
monitoring prometheus-456 0/2 ImagePullBackOff 0 20m ⚠️
Step 5: Check System Components
Verify kube-system namespace health:
kubectl get pods -n kube-system
Critical components to check:
- kube-apiserver
- kube-controller-manager
- kube-scheduler
- etcd
- coredns (or kube-dns)
- kube-proxy
Expected Output:
NAME READY STATUS RESTARTS AGE
coredns-565d847f94-abcde 1/1 Running 0 45d
coredns-565d847f94-fghij 1/1 Running 0 45d
etcd-node-1 1/1 Running 0 45d
kube-apiserver-node-1 1/1 Running 0 45d
kube-controller-manager-node-1 1/1 Running 0 45d
kube-proxy-klmno 1/1 Running 0 45d
kube-scheduler-node-1 1/1 Running 0 45d
Step 6: Review Recent Critical Events
Get events from the last hour, filtered for warnings and errors:
kubectl get events --all-namespaces --sort-by='.lastTimestamp' | tail -50 | grep -E 'Warning|Error'
Alternative - more structured:
kubectl get events --all-namespaces --sort-by='.lastTimestamp' --field-selector type!=Normal
Look for patterns:
- Repeated OOMKilled events
- FailedScheduling (resource constraints)
- FailedMount (volume issues)
- ImagePullBackOff (registry issues)
- Evictions (resource pressure)
- BackOff (crashing containers)
Expected Output:
10m Warning FailedScheduling pod/worker-123 0/4 nodes available: insufficient memory
8m Warning BackOff pod/api-service Back-off restarting failed container
5m Warning OOMKilled pod/api-service Container exceeded memory limit
3m Warning Evicted pod/cache-789 The node was low on resource: memory
Step 7: Check for Evicted Pods
Find pods that were evicted due to resource pressure:
kubectl get pods --all-namespaces --field-selector=status.phase=Failed | grep Evicted
Evictions indicate:
- Node resource pressure (memory/disk)
- Need for resource limits/requests tuning
- Possible need for cluster scaling
Step 8: Review Resource Allocation
Check cluster-wide resource allocation:
kubectl describe nodes | grep -A 7 "Allocated resources:"
Look for:
- CPU allocation > 80%
- Memory allocation > 80%
- Pods per node approaching limits
Expected Output:
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 3800m (95%) 7200m (180%) ⚠️
memory 24Gi (75%) 32Gi (100%) ⚠️
ephemeral-storage 0 (0%) 0 (0%)
hugepages-2Mi 0 (0%) 0 (0%)
MCP Tools to Use
kubernetes.get_nodes()
kubernetes.get_node_metrics()
kubernetes.describe_node(node_name)
kubernetes.get_pods(namespace="all", field_selector="status.phase!=Running")
kubernetes.get_pods(namespace="kube-system")
kubernetes.get_events(namespace="all", since="1h", field_selector="type!=Normal")
Output Format
Provide a structured summary in this format:
# CLUSTER HEALTH SUMMARY
========================
## Cluster Overview
- **Total Nodes:** 5
- **Healthy Nodes:** 4
- **Unhealthy Nodes:** 1
- **Kubernetes Version:** v1.28.0
## Node Health
### Healthy Nodes ✓
- node-1: Ready (CPU: 22%, Memory: 50%)
- node-2: Ready (CPU: 44%, Memory: 75%)
- node-3: Ready (CPU: 60%, Memory: 87%) ⚠️ High memory
### Unhealthy Nodes ⚠️
- **node-4:** NotReady
- Condition: KubeletNotReady
- Reason: Node had insufficient memory
- Duration: 15 minutes
## Pod Health Summary
**Total Pods:** 127
- Running: 120
- Pending: 4 ⚠️
- CrashLoopBackOff: 2 ⚠️
- ImagePullBackOff: 1 ⚠️
### Critical Pod Issues
1. **api-service-abc** (namespace: api)
- Status: CrashLoopBackOff
- Restarts: 5 times in 10 minutes
- Action needed: Investigate with debug-pod-issues skill
2. **api-service-xyz** (namespace: api)
- Status: OOMKilled
- Restarts: 3 times in 15 minutes
- Action needed: Memory limit investigation required
3. **worker-123** (namespace: default)
- Status: Pending
- Reason: Insufficient memory to schedule
- Action needed: Resource analysis needed
4. **prometheus-456** (namespace: monitoring)
- Status: ImagePullBackOff
- Reason: Failed to pull image
- Action needed: Check registry connectivity
## System Components ✓
All critical system components healthy:
- coredns: 2/2 pods running
- kube-apiserver: Running
- kube-controller-manager: Running
- kube-scheduler: Running
- etcd: Running
- kube-proxy: DaemonSet 5/5 ready
## Recent Critical Events (Last 60 minutes)
**OOM Kills:** 3 occurrences
- 14:23: api-service-xyz OOMKilled (namespace: api)
- 14:25: api-service-xyz OOMKilled (namespace: api)
- 14:27: api-service-xyz OOMKilled (namespace: api)
**Scheduling Failures:** 4 occurrences
- 14:20: worker-123 FailedScheduling: insufficient memory
- 14:22: worker-456 FailedScheduling: insufficient memory
- 14:25: worker-789 FailedScheduling: insufficient memory
- 14:28: cache-abc FailedScheduling: insufficient cpu
**Node Issues:**
- 14:15: node-4 NodeNotReady: KubeletNotReady
**Evictions:** 2 occurrences
- 14:18: cache-xyz Evicted: node low on memory
- 14:22: cache-abc Evicted: node low on memory
## Resource Pressure Analysis
### Node-4: MemoryPressure Detected ⚠️
- Current usage: 28Gi / 32Gi (87%)
- Condition: MemoryPressure True
- Impact: Pods may be evicted
- Action: Investigate high memory consumers
### Cluster-Wide Resource Allocation
- **CPU:** 75% allocated (approaching capacity)
- **Memory:** 82% allocated ⚠️ (critical threshold)
- **Risk:** New pods may not schedule
## Issues Detected
### 🚨 CRITICAL Issues (Require Immediate Action)
1. **Multiple OOM Kills in api namespace**
- Impact: Service degradation/outages
- Pods affected: api-service-xyz
- Recommendation: Increase memory limits or investigate memory leak
- Next step: Use `debug-pod-issues` skill
2. **Node-4 Unhealthy (NotReady)**
- Impact: Reduced cluster capacity
- Duration: 15 minutes
- Recommendation: Investigate node logs, consider cordoning/draining
- Next step: SSH to node or check kubelet logs
3. **Cluster Memory Capacity Critical (82% allocated)**
- Impact: Risk of scheduling failures
- Pods pending: 4
- Recommendation: Scale cluster or optimize workloads
- Next step: Use `analyze-resource-usage` skill
### ⚠️ WARNING Issues (Should Be Addressed)
4. **Node-3 High Memory Usage (87%)**
- Impact: Risk of pressure condition
- Current state: Still Ready
- Recommendation: Monitor closely, consider rebalancing pods
5. **ImagePullBackOff in monitoring namespace**
- Impact: Prometheus not available
- Likely cause: Registry connectivity or credentials
- Recommendation: Check image repository access
## Recommended Actions (Priority Order)
### Immediate (Next 15 minutes)
1. **Investigate api-service OOM kills** → Use `debug-pod-issues` skill on api-service-xyz
2. **Check node-4 status** → SSH to node or review kubelet logs
3. **Review pending pods** → Use `analyze-resource-usage` to understand capacity
### Short Term (Next hour)
4. Increase memory limits for api-service pods
5. Consider scaling cluster (add nodes or upsize)
6. Fix ImagePullBackOff for prometheus
7. Investigate memory usage on node-3
### Long Term (This week)
8. Implement pod resource requests/limits across all workloads
9. Set up cluster autoscaling
10. Review and optimize memory-intensive workloads
11. Implement monitoring alerts for:
- Node NotReady conditions
- OOM kill events
- Resource allocation thresholds (>80%)
- Pod evictions
## Next Steps
Based on the findings, I recommend:
1. **Deep dive into OOM issues** → Skill: `debug-pod-issues`
- Target: api-service-xyz in api namespace
2. **Analyze resource usage patterns** → Skill: `analyze-resource-usage`
- Focus on memory consumption and allocation
3. **Check logs for crash patterns** → Skill: `inspect-logs`
- Target: api-service pods for error patterns
Would you like me to proceed with investigating the OOM kills in the api-service pods?
Red Flags to Watch For
- 🚨 Node NotReady status - Immediate impact on capacity
- 🚨 Multiple pods in CrashLoopBackOff - Application issues
- 🚨 Repeated OOMKilled events - Memory configuration problems
- 🚨 System component failures - Cluster instability
- 🚨 High resource allocation (>85%) - Scheduling issues imminent
- ⚠️ High restart counts (>5 in last hour) - Application instability
- ⚠️ Pending pods - Resource constraints
- ⚠️ ImagePullBackOff - Registry or networking issues
- ⚠️ Volume mount failures - Storage problems
- ⚠️ Evicted pods - Node resource pressure
Decision Tree - Next Skill to Use
Based on findings, recommend next skill:
If OOMKilled or CrashLoopBackOff detected:
→ Use `debug-pod-issues` skill
If high CPU/Memory usage detected:
→ Use `analyze-resource-usage` skill
If connection errors in events:
→ Use `check-network-connectivity` skill
If errors in events but pods running:
→ Use `inspect-logs` skill
If multiple issues:
→ Prioritize by severity, start with pod crashes
Common Patterns & Root Causes
Pattern: Multiple OOM Kills
Indicates: Memory limits too low or memory leak Next Action: debug-pod-issues + inspect-logs
Pattern: Many Pending Pods
Indicates: Insufficient cluster capacity Next Action: analyze-resource-usage
Pattern: Node NotReady + Evictions
Indicates: Node resource exhaustion Next Action: Investigate node directly, consider draining
Pattern: System Component Failure
Indicates: Critical cluster issue Next Action: Immediate investigation, possibly escalate
Pattern: ImagePullBackOff
Indicates: Registry access issues Next Action: Check network connectivity, registry credentials
Skill Completion Criteria
This skill is complete when:
- ✓ Node health assessed
- ✓ Pod health across all namespaces evaluated
- ✓ System components verified
- ✓ Recent critical events reviewed
- ✓ Issues categorized by severity
- ✓ Recommended next steps provided
- ✓ Clear indication of which skill to use next
Notes for Agent
- Always start with this skill for vague issues
- Provide executive summary before detailed findings
- Categorize issues by severity (Critical/Warning/Info)
- Explicitly recommend next skill based on findings
- Include specific pod names, namespaces, timestamps
- Highlight patterns, not just individual issues
- Keep summary concise but comprehensive