name	check-cluster-health
description	Checks comprehensive health check for a Kubernetes Cluster.
allowed-tools	Read, Grep, Glob

Check Cluster Health

Perform comprehensive health check of Kubernetes cluster infrastructure.

When to Use

Initial investigation of any production issue
Before deep-diving into specific pods or services
User reports "something is wrong" without specifics
Periodic health checks
Post-deployment validation
After scaling events or cluster changes

Skill Objective

Quickly assess the overall state of the Kubernetes cluster to identify:

Node health and resource pressure
Pod health across all namespaces
System component status
Recent critical events
Resource constraints or bottlenecks

Investigation Steps

Step 1: Check Node Health

Get overview of all nodes in the cluster:

kubectl get nodes -o wide

Look for:

Nodes in NotReady state
Node ages (very old or very new nodes)
Kubernetes versions (version skew)
Internal/External IPs

Expected Output:

NAME      STATUS   ROLES           AGE   VERSION
node-1    Ready    control-plane   45d   v1.28.0
node-2    Ready    <none>          45d   v1.28.0
node-3    Ready    <none>          45d   v1.28.0
node-4    NotReady <none>          45d   v1.28.0  ⚠️

Step 2: Check Node Resource Usage

Get current CPU and memory utilization:

kubectl top nodes

Look for:

CPU usage > 80%
Memory usage > 85%
Significant imbalance between nodes

Expected Output:

NAME     CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%
node-1   450m        22%    4Gi             50%
node-2   890m        44%    6Gi             75%
node-3   1200m       60%    7Gi             87%  ⚠️
node-4   100m        5%     2Gi             25%

Step 3: Check Node Conditions

Inspect for resource pressure conditions:

kubectl describe nodes | grep -A 5 "Conditions:"

Look for:

MemoryPressure: True
DiskPressure: True
PIDPressure: True
NetworkUnavailable: True

Critical Conditions:

Conditions:
  Type             Status  Reason
  ----             ------  ------
  MemoryPressure   True    NodeHasInsufficientMemory  ⚠️
  DiskPressure     False   NodeHasSufficientDisk
  PIDPressure      False   NodeHasSufficientPID
  Ready            True    KubeletReady

Step 4: Find Problematic Pods

Get all pods that are not in Running or Succeeded state:

kubectl get pods --all-namespaces --field-selector=status.phase!=Running,status.phase!=Succeeded

Alternative - get all pods with issues:

kubectl get pods --all-namespaces | grep -vE 'Running|Completed|Succeeded'

Look for:

CrashLoopBackOff
ImagePullBackOff
Pending
Error
Evicted
OOMKilled

Expected Output:

NAMESPACE   NAME                    READY   STATUS             RESTARTS   AGE
api         api-service-abc         0/1     CrashLoopBackOff   5          10m  ⚠️
api         api-service-xyz         0/1     OOMKilled          3          15m  ⚠️
default     worker-123              0/1     Pending            0          5m   ⚠️
monitoring  prometheus-456          0/2     ImagePullBackOff   0          20m  ⚠️

Step 5: Check System Components

Verify kube-system namespace health:

kubectl get pods -n kube-system

Critical components to check:

kube-apiserver
kube-controller-manager
kube-scheduler
etcd
coredns (or kube-dns)
kube-proxy

Expected Output:

NAME                              READY   STATUS    RESTARTS   AGE
coredns-565d847f94-abcde          1/1     Running   0          45d
coredns-565d847f94-fghij          1/1     Running   0          45d
etcd-node-1                       1/1     Running   0          45d
kube-apiserver-node-1             1/1     Running   0          45d
kube-controller-manager-node-1    1/1     Running   0          45d
kube-proxy-klmno                  1/1     Running   0          45d
kube-scheduler-node-1             1/1     Running   0          45d

Step 6: Review Recent Critical Events

Get events from the last hour, filtered for warnings and errors:

kubectl get events --all-namespaces --sort-by='.lastTimestamp' | tail -50 | grep -E 'Warning|Error'

Alternative - more structured:

kubectl get events --all-namespaces --sort-by='.lastTimestamp' --field-selector type!=Normal

Look for patterns:

Repeated OOMKilled events
FailedScheduling (resource constraints)
FailedMount (volume issues)
ImagePullBackOff (registry issues)
Evictions (resource pressure)
BackOff (crashing containers)

Expected Output:

10m  Warning  FailedScheduling   pod/worker-123    0/4 nodes available: insufficient memory
8m   Warning  BackOff           pod/api-service   Back-off restarting failed container
5m   Warning  OOMKilled         pod/api-service   Container exceeded memory limit
3m   Warning  Evicted           pod/cache-789     The node was low on resource: memory

Step 7: Check for Evicted Pods

Find pods that were evicted due to resource pressure:

kubectl get pods --all-namespaces --field-selector=status.phase=Failed | grep Evicted

Evictions indicate:

Node resource pressure (memory/disk)
Need for resource limits/requests tuning
Possible need for cluster scaling

Step 8: Review Resource Allocation

Check cluster-wide resource allocation:

kubectl describe nodes | grep -A 7 "Allocated resources:"

Look for:

CPU allocation > 80%
Memory allocation > 80%
Pods per node approaching limits

Expected Output:

Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests      Limits
  --------           --------      ------
  cpu                3800m (95%)   7200m (180%)  ⚠️
  memory             24Gi (75%)    32Gi (100%)   ⚠️
  ephemeral-storage  0 (0%)        0 (0%)
  hugepages-2Mi      0 (0%)        0 (0%)

MCP Tools to Use

kubernetes.get_nodes()
kubernetes.get_node_metrics()
kubernetes.describe_node(node_name)
kubernetes.get_pods(namespace="all", field_selector="status.phase!=Running")
kubernetes.get_pods(namespace="kube-system")
kubernetes.get_events(namespace="all", since="1h", field_selector="type!=Normal")

Output Format

Provide a structured summary in this format:

# CLUSTER HEALTH SUMMARY
========================

## Cluster Overview
- **Total Nodes:** 5
- **Healthy Nodes:** 4
- **Unhealthy Nodes:** 1
- **Kubernetes Version:** v1.28.0

## Node Health

### Healthy Nodes ✓
- node-1: Ready (CPU: 22%, Memory: 50%)
- node-2: Ready (CPU: 44%, Memory: 75%)
- node-3: Ready (CPU: 60%, Memory: 87%) ⚠️ High memory

### Unhealthy Nodes ⚠️
- **node-4:** NotReady
  - Condition: KubeletNotReady
  - Reason: Node had insufficient memory
  - Duration: 15 minutes

## Pod Health Summary

**Total Pods:** 127
- Running: 120
- Pending: 4 ⚠️
- CrashLoopBackOff: 2 ⚠️
- ImagePullBackOff: 1 ⚠️

### Critical Pod Issues

1. **api-service-abc** (namespace: api)
   - Status: CrashLoopBackOff
   - Restarts: 5 times in 10 minutes
   - Action needed: Investigate with debug-pod-issues skill

2. **api-service-xyz** (namespace: api)
   - Status: OOMKilled
   - Restarts: 3 times in 15 minutes
   - Action needed: Memory limit investigation required

3. **worker-123** (namespace: default)
   - Status: Pending
   - Reason: Insufficient memory to schedule
   - Action needed: Resource analysis needed

4. **prometheus-456** (namespace: monitoring)
   - Status: ImagePullBackOff
   - Reason: Failed to pull image
   - Action needed: Check registry connectivity

## System Components ✓

All critical system components healthy:
- coredns: 2/2 pods running
- kube-apiserver: Running
- kube-controller-manager: Running
- kube-scheduler: Running
- etcd: Running
- kube-proxy: DaemonSet 5/5 ready

## Recent Critical Events (Last 60 minutes)

**OOM Kills:** 3 occurrences
- 14:23: api-service-xyz OOMKilled (namespace: api)
- 14:25: api-service-xyz OOMKilled (namespace: api)
- 14:27: api-service-xyz OOMKilled (namespace: api)

**Scheduling Failures:** 4 occurrences
- 14:20: worker-123 FailedScheduling: insufficient memory
- 14:22: worker-456 FailedScheduling: insufficient memory
- 14:25: worker-789 FailedScheduling: insufficient memory
- 14:28: cache-abc FailedScheduling: insufficient cpu

**Node Issues:**
- 14:15: node-4 NodeNotReady: KubeletNotReady

**Evictions:** 2 occurrences
- 14:18: cache-xyz Evicted: node low on memory
- 14:22: cache-abc Evicted: node low on memory

## Resource Pressure Analysis

### Node-4: MemoryPressure Detected ⚠️
- Current usage: 28Gi / 32Gi (87%)
- Condition: MemoryPressure True
- Impact: Pods may be evicted
- Action: Investigate high memory consumers

### Cluster-Wide Resource Allocation
- **CPU:** 75% allocated (approaching capacity)
- **Memory:** 82% allocated ⚠️ (critical threshold)
- **Risk:** New pods may not schedule

## Issues Detected

### 🚨 CRITICAL Issues (Require Immediate Action)

1. **Multiple OOM Kills in api namespace**
   - Impact: Service degradation/outages
   - Pods affected: api-service-xyz
   - Recommendation: Increase memory limits or investigate memory leak
   - Next step: Use `debug-pod-issues` skill

2. **Node-4 Unhealthy (NotReady)**
   - Impact: Reduced cluster capacity
   - Duration: 15 minutes
   - Recommendation: Investigate node logs, consider cordoning/draining
   - Next step: SSH to node or check kubelet logs

3. **Cluster Memory Capacity Critical (82% allocated)**
   - Impact: Risk of scheduling failures
   - Pods pending: 4
   - Recommendation: Scale cluster or optimize workloads
   - Next step: Use `analyze-resource-usage` skill

### ⚠️ WARNING Issues (Should Be Addressed)

4. **Node-3 High Memory Usage (87%)**
   - Impact: Risk of pressure condition
   - Current state: Still Ready
   - Recommendation: Monitor closely, consider rebalancing pods

5. **ImagePullBackOff in monitoring namespace**
   - Impact: Prometheus not available
   - Likely cause: Registry connectivity or credentials
   - Recommendation: Check image repository access

## Recommended Actions (Priority Order)

### Immediate (Next 15 minutes)
1. **Investigate api-service OOM kills** → Use `debug-pod-issues` skill on api-service-xyz
2. **Check node-4 status** → SSH to node or review kubelet logs
3. **Review pending pods** → Use `analyze-resource-usage` to understand capacity

### Short Term (Next hour)
4. Increase memory limits for api-service pods
5. Consider scaling cluster (add nodes or upsize)
6. Fix ImagePullBackOff for prometheus
7. Investigate memory usage on node-3

### Long Term (This week)
8. Implement pod resource requests/limits across all workloads
9. Set up cluster autoscaling
10. Review and optimize memory-intensive workloads
11. Implement monitoring alerts for:
    - Node NotReady conditions
    - OOM kill events
    - Resource allocation thresholds (>80%)
    - Pod evictions

## Next Steps

Based on the findings, I recommend:

1. **Deep dive into OOM issues** → Skill: `debug-pod-issues`
   - Target: api-service-xyz in api namespace
   
2. **Analyze resource usage patterns** → Skill: `analyze-resource-usage`
   - Focus on memory consumption and allocation
   
3. **Check logs for crash patterns** → Skill: `inspect-logs`
   - Target: api-service pods for error patterns

Would you like me to proceed with investigating the OOM kills in the api-service pods?

Red Flags to Watch For

🚨 Node NotReady status - Immediate impact on capacity
🚨 Multiple pods in CrashLoopBackOff - Application issues
🚨 Repeated OOMKilled events - Memory configuration problems
🚨 System component failures - Cluster instability
🚨 High resource allocation (>85%) - Scheduling issues imminent
⚠️ High restart counts (>5 in last hour) - Application instability
⚠️ Pending pods - Resource constraints
⚠️ ImagePullBackOff - Registry or networking issues
⚠️ Volume mount failures - Storage problems
⚠️ Evicted pods - Node resource pressure

Decision Tree - Next Skill to Use

Based on findings, recommend next skill:

If OOMKilled or CrashLoopBackOff detected:
  → Use `debug-pod-issues` skill

If high CPU/Memory usage detected:
  → Use `analyze-resource-usage` skill

If connection errors in events:
  → Use `check-network-connectivity` skill

If errors in events but pods running:
  → Use `inspect-logs` skill

If multiple issues:
  → Prioritize by severity, start with pod crashes

Common Patterns & Root Causes

Pattern: Multiple OOM Kills

Indicates: Memory limits too low or memory leak Next Action: debug-pod-issues + inspect-logs

Pattern: Many Pending Pods

Indicates: Insufficient cluster capacity Next Action: analyze-resource-usage

Pattern: Node NotReady + Evictions

Indicates: Node resource exhaustion Next Action: Investigate node directly, consider draining

Pattern: System Component Failure

Indicates: Critical cluster issue Next Action: Immediate investigation, possibly escalate

Pattern: ImagePullBackOff

Indicates: Registry access issues Next Action: Check network connectivity, registry credentials

Skill Completion Criteria

This skill is complete when:

✓ Node health assessed
✓ Pod health across all namespaces evaluated
✓ System components verified
✓ Recent critical events reviewed
✓ Issues categorized by severity
✓ Recommended next steps provided
✓ Clear indication of which skill to use next

Notes for Agent

Always start with this skill for vague issues
Provide executive summary before detailed findings
Categorize issues by severity (Critical/Warning/Info)
Explicitly recommend next skill based on findings
Include specific pod names, namespaces, timestamps
Highlight patterns, not just individual issues
Keep summary concise but comprehensive

check-cluster-health

Install Skill

SKILL.md

Check Cluster Health

When to Use

Skill Objective

Investigation Steps

Step 1: Check Node Health

Step 2: Check Node Resource Usage

Step 3: Check Node Conditions

Step 4: Find Problematic Pods

Step 5: Check System Components

Step 6: Review Recent Critical Events

Step 7: Check for Evicted Pods

Step 8: Review Resource Allocation

MCP Tools to Use

Output Format

Red Flags to Watch For

Decision Tree - Next Skill to Use

Common Patterns & Root Causes

Pattern: Multiple OOM Kills

Pattern: Many Pending Pods

Pattern: Node NotReady + Evictions

Pattern: System Component Failure

Pattern: ImagePullBackOff

Skill Completion Criteria

Notes for Agent