| name | Must-Gather Analyzer |
| description | Analyze OpenShift must-gather diagnostic data including cluster operators, pods, nodes, and network components. Use this skill when the user asks about cluster health, operator status, pod issues, node conditions, or wants diagnostic insights from must-gather data. Triggers: "analyze must-gather", "check cluster health", "operator status", "pod issues", "node status", "failing pods", "degraded operators", "cluster problems", "crashlooping", "network issues", "etcd health", "analyze clusteroperators", "analyze pods", "analyze nodes" |
Must-Gather Analyzer Skill
Comprehensive analysis of OpenShift must-gather diagnostic data with helper scripts that parse YAML and display output in oc-like format.
Overview
This skill provides analysis for:
- ClusterVersion: Current version, update status, and capabilities
- Cluster Operators: Status, degradation, and availability
- Pods: Health, restarts, crashes, and failures across namespaces
- Nodes: Conditions, capacity, and readiness
- Network: OVN/SDN diagnostics and connectivity
- Events: Warning and error events across namespaces
- etcd: Cluster health, member status, and quorum
- Storage: PersistentVolumes and PersistentVolumeClaims status
Must-Gather Directory Structure
Important: Must-gather data is contained in a subdirectory with a long hash name:
must-gather/
└── registry-ci-openshift-org-origin-...-sha256-<hash>/
├── cluster-scoped-resources/
│ ├── config.openshift.io/clusteroperators/
│ └── core/nodes/
├── namespaces/
│ └── <namespace>/
│ └── pods/
│ └── <pod-name>/
│ └── <pod-name>.yaml
└── network_logs/
The analysis scripts expect the path to the subdirectory (the one with the hash), not the root must-gather folder.
Instructions
1. Get Must-Gather Path
Ask the user for the must-gather directory path if not already provided.
- If they provide the root directory, look for the subdirectory with the hash name
- The correct path contains
cluster-scoped-resources/andnamespaces/directories
2. Choose Analysis Type
Based on user's request, run the appropriate helper script:
ClusterVersion Analysis
./scripts/analyze_clusterversion.py <must-gather-path>
Shows cluster version information similar to oc get clusterversion:
- Current version and update status
- Progressing state
- Available updates
- Version conditions
- Enabled capabilities
- Update history
Cluster Operators Analysis
./scripts/analyze_clusteroperators.py <must-gather-path>
Shows cluster operator status similar to oc get clusteroperators:
- Available, Progressing, Degraded conditions
- Version information
- Time since condition change
- Detailed messages for operators with issues
Pods Analysis
# All namespaces
./scripts/analyze_pods.py <must-gather-path>
# Specific namespace
./scripts/analyze_pods.py <must-gather-path> --namespace <namespace>
# Show only problematic pods
./scripts/analyze_pods.py <must-gather-path> --problems-only
Shows pod status similar to oc get pods -A:
- Ready/Total containers
- Status (Running, Pending, CrashLoopBackOff, etc.)
- Restart counts
- Age
- Categorized issues (crashlooping, pending, failed)
Nodes Analysis
./scripts/analyze_nodes.py <must-gather-path>
# Show only nodes with issues
./scripts/analyze_nodes.py <must-gather-path> --problems-only
Shows node status similar to oc get nodes:
- Ready status
- Roles (master, worker)
- Age
- Kubernetes version
- Node conditions (DiskPressure, MemoryPressure, etc.)
- Capacity and allocatable resources
Network Analysis
./scripts/analyze_network.py <must-gather-path>
Shows network health:
- Network type (OVN-Kubernetes, OpenShift SDN)
- Network operator status
- OVN pod health
- PodNetworkConnectivityCheck results
- Network-related issues
Events Analysis
# Recent events (last 100)
./scripts/analyze_events.py <must-gather-path>
# Warning events only
./scripts/analyze_events.py <must-gather-path> --type Warning
# Events in specific namespace
./scripts/analyze_events.py <must-gather-path> --namespace openshift-etcd
# Show last 50 events
./scripts/analyze_events.py <must-gather-path> --count 50
Shows cluster events:
- Event type (Warning, Normal)
- Last seen timestamp
- Reason and message
- Affected object
- Event count
etcd Analysis
./scripts/analyze_etcd.py <must-gather-path>
Shows etcd cluster health:
- Member health status
- Member list with IDs and URLs
- Endpoint status (leader, version, DB size)
- Quorum status
- Cluster summary
Storage Analysis
# All PVs and PVCs
./scripts/analyze_pvs.py <must-gather-path>
# PVCs in specific namespace
./scripts/analyze_pvs.py <must-gather-path> --namespace openshift-monitoring
Shows storage resources:
- PersistentVolumes (capacity, status, claims)
- PersistentVolumeClaims (binding, capacity)
- Storage classes
- Pending/unbound volumes
3. Interpret and Report
After running the scripts:
- Review the summary statistics
- Focus on items flagged with issues
- Provide actionable insights and next steps
- Suggest log analysis for specific components if needed
- Cross-reference issues (e.g., degraded operator → failing pods → node issues)
Output Format
All scripts provide:
- Summary Section: High-level statistics with emoji indicators
- Table View:
oc-like formatted output - Issues Section: Detailed breakdown of problems
Example summary format:
================================================================================
SUMMARY: 25/28 operators healthy
⚠️ 3 operators with issues
🔄 1 progressing
❌ 2 degraded
================================================================================
Helper Scripts Reference
scripts/analyze_clusterversion.py
Parses: cluster-scoped-resources/config.openshift.io/clusterversions/version.yaml
Output: ClusterVersion table with detailed version info, conditions, and capabilities
scripts/analyze_clusteroperators.py
Parses: cluster-scoped-resources/config.openshift.io/clusteroperators/
Output: ClusterOperator status table with conditions
scripts/analyze_pods.py
Parses: namespaces/*/pods/*/*.yaml (individual pod directories)
Output: Pod status table with issues categorized
scripts/analyze_nodes.py
Parses: cluster-scoped-resources/core/nodes/
Output: Node status table with conditions and capacity
scripts/analyze_network.py
Parses: network_logs/, network operator, OVN resources
Output: Network health summary and diagnostics
scripts/analyze_events.py
Parses: namespaces/*/core/events.yaml
Output: Event table sorted by last occurrence
scripts/analyze_etcd.py
Parses: etcd_info/ (endpoint_health.json, member_list.json, endpoint_status.json)
Output: etcd cluster health and member status
scripts/analyze_pvs.py
Parses: cluster-scoped-resources/core/persistentvolumes/, namespaces/*/core/persistentvolumeclaims.yaml
Output: PV and PVC status tables
Tips for Analysis
- Start with Cluster Operators: They often reveal system-wide issues
- Check Timing: Look at "SINCE" columns to understand when issues started
- Follow Dependencies: Degraded operator → check its namespace pods → check hosting nodes
- Look for Patterns: Multiple pods failing on same node suggests node issue
- Cross-reference: Use multiple scripts together for complete picture
Common Scenarios
"Why is my cluster degraded?"
- Run
analyze_clusteroperators.py- identify degraded operators - Run
analyze_pods.py --namespace <operator-namespace>- check operator pods - Run
analyze_nodes.py- verify node health
"Pods keep crashing"
- Run
analyze_pods.py --problems-only- find crashlooping pods - Check which nodes they're on
- Run
analyze_nodes.py- verify node conditions - Suggest checking pod logs in must-gather data
"Network connectivity issues"
- Run
analyze_network.py- check network health - Run
analyze_pods.py --namespace openshift-ovn-kubernetes - Check PodNetworkConnectivityCheck results
Next Steps After Analysis
Based on findings, suggest:
- Examining specific pod logs in
namespaces/<ns>/pods/<pod>/<container>/logs/ - Reviewing events in
namespaces/<ns>/core/events.yaml - Checking audit logs in
audit_logs/ - Analyzing metrics data if available
- Looking at host service logs in
host_service_logs/