name

Must-Gather Analyzer

description

Analyze OpenShift must-gather diagnostic data including cluster operators, pods, nodes, and network components. Use this skill when the user asks about cluster health, operator status, pod issues, node conditions, or wants diagnostic insights from must-gather data. Triggers: "analyze must-gather", "check cluster health", "operator status", "pod issues", "node status", "failing pods", "degraded operators", "cluster problems", "crashlooping", "network issues", "etcd health", "analyze clusteroperators", "analyze pods", "analyze nodes"

Must-Gather Analyzer Skill

Comprehensive analysis of OpenShift must-gather diagnostic data with helper scripts that parse YAML and display output in oc-like format.

Overview

This skill provides analysis for:

ClusterVersion: Current version, update status, and capabilities
Cluster Operators: Status, degradation, and availability
Pods: Health, restarts, crashes, and failures across namespaces
Nodes: Conditions, capacity, and readiness
Network: OVN/SDN diagnostics and connectivity
Events: Warning and error events across namespaces
etcd: Cluster health, member status, and quorum
Storage: PersistentVolumes and PersistentVolumeClaims status

Must-Gather Directory Structure

Important: Must-gather data is contained in a subdirectory with a long hash name:

must-gather/
└── registry-ci-openshift-org-origin-...-sha256-<hash>/
    ├── cluster-scoped-resources/
    │   ├── config.openshift.io/clusteroperators/
    │   └── core/nodes/
    ├── namespaces/
    │   └── <namespace>/
    │       └── pods/
    │           └── <pod-name>/
    │               └── <pod-name>.yaml
    └── network_logs/

The analysis scripts expect the path to the subdirectory (the one with the hash), not the root must-gather folder.

Instructions

1. Get Must-Gather Path

Ask the user for the must-gather directory path if not already provided.

If they provide the root directory, look for the subdirectory with the hash name
The correct path contains cluster-scoped-resources/ and namespaces/ directories

2. Choose Analysis Type

Based on user's request, run the appropriate helper script:

ClusterVersion Analysis

./scripts/analyze_clusterversion.py <must-gather-path>

Shows cluster version information similar to oc get clusterversion:

Current version and update status
Progressing state
Available updates
Version conditions
Enabled capabilities
Update history

Cluster Operators Analysis

./scripts/analyze_clusteroperators.py <must-gather-path>

Shows cluster operator status similar to oc get clusteroperators:

Available, Progressing, Degraded conditions
Version information
Time since condition change
Detailed messages for operators with issues

Pods Analysis

# All namespaces
./scripts/analyze_pods.py <must-gather-path>

# Specific namespace
./scripts/analyze_pods.py <must-gather-path> --namespace <namespace>

# Show only problematic pods
./scripts/analyze_pods.py <must-gather-path> --problems-only

Shows pod status similar to oc get pods -A:

Ready/Total containers
Status (Running, Pending, CrashLoopBackOff, etc.)
Restart counts
Age
Categorized issues (crashlooping, pending, failed)

Nodes Analysis

./scripts/analyze_nodes.py <must-gather-path>

# Show only nodes with issues
./scripts/analyze_nodes.py <must-gather-path> --problems-only

Shows node status similar to oc get nodes:

Ready status
Roles (master, worker)
Age
Kubernetes version
Node conditions (DiskPressure, MemoryPressure, etc.)
Capacity and allocatable resources

Network Analysis

./scripts/analyze_network.py <must-gather-path>

Shows network health:

Network type (OVN-Kubernetes, OpenShift SDN)
Network operator status
OVN pod health
PodNetworkConnectivityCheck results
Network-related issues

Events Analysis

# Recent events (last 100)
./scripts/analyze_events.py <must-gather-path>

# Warning events only
./scripts/analyze_events.py <must-gather-path> --type Warning

# Events in specific namespace
./scripts/analyze_events.py <must-gather-path> --namespace openshift-etcd

# Show last 50 events
./scripts/analyze_events.py <must-gather-path> --count 50

Shows cluster events:

Event type (Warning, Normal)
Last seen timestamp
Reason and message
Affected object
Event count

etcd Analysis

./scripts/analyze_etcd.py <must-gather-path>

Shows etcd cluster health:

Member health status
Member list with IDs and URLs
Endpoint status (leader, version, DB size)
Quorum status
Cluster summary

Storage Analysis

# All PVs and PVCs
./scripts/analyze_pvs.py <must-gather-path>

# PVCs in specific namespace
./scripts/analyze_pvs.py <must-gather-path> --namespace openshift-monitoring

Shows storage resources:

PersistentVolumes (capacity, status, claims)
PersistentVolumeClaims (binding, capacity)
Storage classes
Pending/unbound volumes

Monitoring Analysis

# All alerts.
./scripts/analyze_prometheus.py <must-gather-path>

# Alerts in specific namespace
./scripts/analyze_prometheus.py <must-gather-path> --namespace openshift-monitoring

Shows monitoring information:

Alerts (state, namespace, name, active since, labels)
Total of pending/firing alerts

3. Interpret and Report

After running the scripts:

Review the summary statistics
Focus on items flagged with issues
Provide actionable insights and next steps
Suggest log analysis for specific components if needed
Cross-reference issues (e.g., degraded operator → failing pods → node issues)

Output Format

All scripts provide:

Summary Section: High-level statistics with emoji indicators
Table View: oc-like formatted output
Issues Section: Detailed breakdown of problems

Example summary format:

================================================================================
SUMMARY: 25/28 operators healthy
  ⚠️  3 operators with issues
  🔄 1 progressing
  ❌ 2 degraded
================================================================================

Helper Scripts Reference

scripts/analyze_clusterversion.py

Parses: cluster-scoped-resources/config.openshift.io/clusterversions/version.yaml Output: ClusterVersion table with detailed version info, conditions, and capabilities

scripts/analyze_clusteroperators.py

Parses: cluster-scoped-resources/config.openshift.io/clusteroperators/ Output: ClusterOperator status table with conditions

scripts/analyze_pods.py

Parses: namespaces/*/pods/*/*.yaml (individual pod directories) Output: Pod status table with issues categorized

scripts/analyze_nodes.py

Parses: cluster-scoped-resources/core/nodes/ Output: Node status table with conditions and capacity

scripts/analyze_network.py

Parses: network_logs/, network operator, OVN resources Output: Network health summary and diagnostics

scripts/analyze_events.py

Parses: namespaces/*/core/events.yaml Output: Event table sorted by last occurrence

scripts/analyze_etcd.py

Parses: etcd_info/ (endpoint_health.json, member_list.json, endpoint_status.json) Output: etcd cluster health and member status

scripts/analyze_pvs.py

Parses: cluster-scoped-resources/core/persistentvolumes/, namespaces/*/core/persistentvolumeclaims.yaml Output: PV and PVC status tables

Tips for Analysis

Start with Cluster Operators: They often reveal system-wide issues
Check Timing: Look at "SINCE" columns to understand when issues started
Follow Dependencies: Degraded operator → check its namespace pods → check hosting nodes
Look for Patterns: Multiple pods failing on same node suggests node issue
Cross-reference: Use multiple scripts together for complete picture

Common Scenarios

"Why is my cluster degraded?"

Run analyze_clusteroperators.py - identify degraded operators
Run analyze_pods.py --namespace <operator-namespace> - check operator pods
Run analyze_nodes.py - verify node health

"Pods keep crashing"

Run analyze_pods.py --problems-only - find crashlooping pods
Check which nodes they're on
Run analyze_nodes.py - verify node conditions
Suggest checking pod logs in must-gather data

"Network connectivity issues"

Run analyze_network.py - check network health
Run analyze_pods.py --namespace openshift-ovn-kubernetes
Check PodNetworkConnectivityCheck results

Next Steps After Analysis

Based on findings, suggest:

Examining specific pod logs in namespaces/<ns>/pods/<pod>/<container>/logs/
Reviewing events in namespaces/<ns>/core/events.yaml
Checking audit logs in audit_logs/
Analyzing metrics data if available
Looking at host service logs in host_service_logs/