Claude Code Plugins

Community-maintained marketplace

Feedback

Analyzes LVMS must-gather data to diagnose storage issues

Install Skill

1Download skill
2Enable skills in Claude

Open claude.ai/settings/capabilities and find the "Skills" section

3Upload to Claude

Click "Upload skill" and select the downloaded ZIP file

Note: Please verify skill by going through its instructions before using it.

SKILL.md

name LVMS Analyzer
description Analyzes LVMS must-gather data to diagnose storage issues

LVMS Analyzer Skill

This skill provides detailed guidance for analyzing LVMS (Logical Volume Manager Storage) must-gather data to identify and troubleshoot storage issues.

When to Use This Skill

Use this skill when:

  • Analyzing LVMS must-gather data offline
  • Diagnosing PVCs stuck in Pending state
  • Investigating LVMCluster readiness issues
  • Troubleshooting volume group creation failures
  • Debugging TopoLVM CSI driver problems
  • Checking operator health in LVMS namespace

This skill is automatically invoked by the /lvms:analyze command when working with must-gather data.

Prerequisites

Required:

  • LVMS must-gather directory extracted and accessible
  • Must-gather contains LVMS namespace directory:
    • namespaces/openshift-lvm-storage/ (newer versions)
    • OR namespaces/openshift-storage/ (older versions)
  • Python 3.6 or higher installed
  • PyYAML library: pip install pyyaml

Namespace Compatibility:

  • LVMS namespace changed from openshift-storage to openshift-lvm-storage in recent versions
  • The analysis script automatically detects which namespace is present
  • Both namespaces are fully supported for backward compatibility

Must-Gather Structure:

must-gather/
└── registry-{image-registry}-lvms-must-gather-{version}-sha256-{hash}/
    ├── cluster-scoped-resources/
    │   ├── core/
    │   │   └── persistentvolumes/
    │   │       └── pvc-*.yaml               # Individual PV files
    │   ├── storage.k8s.io/
    │   │   └── storageclasses/
    │   │       ├── lvms-vg1.yaml
    │   │       └── lvms-vg1-immediate.yaml
    │   └── security.openshift.io/
    │       └── securitycontextconstraints/
    │           └── lvms-vgmanager.yaml
    ├── namespaces/
    │   └── openshift-lvm-storage/           # or openshift-storage for older versions
    │       ├── oc_output/                   # IMPORTANT: Primary location for LVMS resources
    │       │   ├── lvmcluster.yaml          # Full LVMCluster resource with status
    │       │   ├── lvmcluster               # Text output (oc describe)
    │       │   ├── lvmvolumegroup           # Text output
    │       │   ├── lvmvolumegroupnodestatus # Text output
    │       │   ├── logicalvolume            # Text output
    │       │   ├── pods                     # Text output (oc get pods)
    │       │   └── events                   # Text output
    │       ├── pods/
    │       │   ├── lvms-operator-{hash}/
    │       │   │   └── lvms-operator-{hash}.yaml
    │       │   └── vg-manager-{hash}/
    │       │       └── vg-manager-{hash}.yaml
    │       └── apps/                        # May contain deployments/daemonsets
    └── ...

Key Note: LVMS resources are primarily in the oc_output/ directory, with lvmcluster.yaml being the most important file containing full cluster and node status.

Implementation Steps

Step 1: Validate Must-Gather Path

Before running analysis, verify the must-gather directory structure:

# Check if LVMS namespace directory exists (try both namespaces)
ls {must-gather-path}/namespaces/openshift-lvm-storage 2>/dev/null || \
  ls {must-gather-path}/namespaces/openshift-storage

# Verify required resource directories
ls {must-gather-path}/cluster-scoped-resources/core/persistentvolumes

Namespace Detection: The analysis script automatically detects which namespace is present:

  • Newer LVMS versions use openshift-lvm-storage
  • Older LVMS versions use openshift-storage
  • The script will inform you which namespace was detected

Common Issue: User provides parent directory instead of subdirectory

  • Must-gather extracts to a directory like must-gather.local.12345/
  • Inside is a subdirectory like registry-ci-openshift-org-origin-4-18.../
  • Always use the subdirectory (the one with cluster-scoped-resources/ and namespaces/)

Handling:

# If user provides parent directory, try to find the correct subdirectory
if [ ! -d "{path}/namespaces/openshift-lvm-storage" ] && \
   [ ! -d "{path}/namespaces/openshift-storage" ]; then
    # Try to find either namespace
    find {path} -type d \( -name "openshift-lvm-storage" -o -name "openshift-storage" \) -path "*/namespaces/*"
    # Suggest the correct path to user
fi

Step 2: Run Analysis Script

Use the Python analysis script for structured analysis:

python3 plugins/lvms/skills/lvms-analyzer/scripts/analyze_lvms.py \
    {must-gather-path}

Script Location:

  • Always use: plugins/lvms/skills/lvms-analyzer/scripts/analyze_lvms.py
  • Use relative path from repository root
  • Script is part of the LVMS plugin

Component-Specific Analysis:

For focused analysis on specific components:

# Analyze only storage/PVC issues
python3 plugins/lvms/skills/lvms-analyzer/scripts/analyze_lvms.py \
    {must-gather-path} --component storage

# Analyze only operator health
python3 plugins/lvms/skills/lvms-analyzer/scripts/analyze_lvms.py \
    {must-gather-path} --component operator

# Analyze only volume groups
python3 plugins/lvms/skills/lvms-analyzer/scripts/analyze_lvms.py \
    {must-gather-path} --component volumes

# Analyze only pod logs
python3 plugins/lvms/skills/lvms-analyzer/scripts/analyze_lvms.py \
    {must-gather-path} --component logs

Step 3: Interpret Analysis Results

The script provides structured output across several sections:

1. LVMCluster Status

Key fields to check:

  • state: Should be "Ready"
  • ready: Should be true
  • conditions: All should have status "True"
    • ResourcesAvailable: Resources deployed successfully
    • VolumeGroupsReady: VGs created on all nodes

Example healthy output:

LVMCluster: lvmcluster-sample
✓ State: Ready
✓ Ready: true

Conditions:
✓ ResourcesAvailable: True
✓ VolumeGroupsReady: True

Example unhealthy output (real case from must-gather):

LVMCluster: my-lvmcluster
❌ State: Degraded
❌ Ready: false

Conditions:
✓ ResourcesAvailable: True
  Reason: ResourcesAvailable
  Message: Reconciliation is complete and all the resources are available
❌ VolumeGroupsReady: False
  Reason: VGsDegraded
  Message: One or more VGs are degraded

2. Volume Group Status

Checks volume group creation per node and device availability:

Example output (real case from must-gather):

Volume Group/Device Class: vg1
Nodes: 3

  Node: ocpnode1.ocpiopex.growipx.com
  ⚠  Status: Progressing

  Devices: /dev/mapper/3600a098038315048302b586c38397562, /dev/mapper/mpatha

  Excluded devices: 24 device(s)
    - /dev/sdb: /dev/sdb has children block devices and could not be considered
    - /dev/sdb4: /dev/sdb4 has an invalid filesystem signature (xfs) and cannot be used
    - /dev/mapper/3600a098038315047433f586c53477272: has an invalid filesystem signature (xfs)
    ... and 21 more excluded devices

  Node: ocpnode2.ocpiopex.growipx.com
  ❌ Status: Degraded

  Reason:
  failed to create/extend volume group vg1: failed to extend volume group vg1:
  WARNING: VG name vg0 is used by VGs VVnkhP-khYQ-blyc-2TNo-d3cv-b6di-4RbSyY and EUV3xv-ft6q-39xK-J3ki-rglf-9H44-rVIHIq.
  Fix duplicate VG names with vgrename uuid, a device filter, or system IDs.
  Physical volume '/dev/mapper/3600a098038315048302b586c38397578p3' is already in volume group 'vg0'
  Unable to add physical volume '/dev/mapper/3600a098038315048302b586c38397578p3' to volume group 'vg0'
  ... (truncated, see LVMCluster status for full details)

  Devices: /dev/mapper/mpatha

This real example shows a common LVMS issue: duplicate volume group names preventing VG extension.

3. Storage (PVC/PV) Status

Lists pending or failed PVCs:

Example output:

Pending PVCs:

database/postgres-data
❌ Status: Pending (10m)
  Storage Class: lvms-vg1
  Requested: 100Gi

  Recent Events:
  ⚠  ProvisioningFailed: no node has enough free space

4. Operator Health

Checks LVMS operator pods, deployments, and daemonsets:

Example issues:

❌ vg-manager-abc123 (worker-0)
  Status: CrashLoopBackOff
  Restarts: 15
  Error: volume group "vg1" not found

5. Pod Logs

Extracts and analyzes error/warning messages from pod logs:

Example output (from real must-gather):

═══════════════════════════════════════════════════════════
POD LOGS ANALYSIS
═══════════════════════════════════════════════════════════

Pod: vg-manager-nz4pc
Unique errors/warnings: 1

❌ 2025-10-28T10:47:28Z: Reconciler error
  Controller: lvmvolumegroup
  Error Details:
    failed to create/extend volume group vg1: failed to extend volume group vg1:
    WARNING: VG name vg0 is used by VGs WsNJwk-DK3q-tSHg-zvQJ-imF1-SdRv-8oh4e0 ...
    Cannot use /dev/dm-10: device is too small (pv_min_size)
    Command requires all devices to be found.

Pod: lvms-operator-65df9f4dbb-92jwl
Unique errors/warnings: 1

❌ 2025-10-28T10:52:48Z: failed to validate device class setup
  Controller: lvmcluster
  Error: VG vg1 on node Degraded is not in ready state (ocpnode1.ocpiopex.growipx.com)

Key Points:

  • Logs are parsed from JSON format
  • Errors are deduplicated (same error repeated in reconciliation loops)
  • Shows unique error messages with first occurrence timestamp
  • Provides additional context not visible in resource status

Step 4: Analyze Root Causes

Connect related issues to identify root causes:

Common Pattern 1: Device Filesystem Conflict

Chain of failures:
1. Device /dev/sdb has existing ext4 filesystem
2. vg-manager cannot create volume group
3. Volume group missing on node
4. PVCs stuck in Pending

Root cause: Device not properly wiped before LVMS use

Common Pattern 2: Insufficient Capacity

Chain of failures:
1. Thin pool at 95% capacity
2. No free space for new volumes
3. PVCs stuck in Pending

Root cause: Insufficient storage capacity or old volumes not cleaned up

Common Pattern 3: Node-Specific Failures

Chain of failures:
1. Volume group missing on specific node
2. TopoLVM CSI driver not functional on that node
3. PVCs with node affinity to that node stuck Pending

Root cause: Node-specific device configuration issue

Step 5: Generate Remediation Plan

Based on analysis results, provide prioritized recommendations:

CRITICAL Issues (Fix Immediately):

  1. Device Conflicts:

    # Clean device on affected node
    oc debug node/{node-name}
    chroot /host wipefs -a /dev/{device}
    
    # Restart vg-manager to recreate VG
    oc delete pod -n openshift-lvm-storage -l app.kubernetes.io/component=vg-manager
    
  2. Pod Crashes:

    # After fixing underlying issue, restart failed pods
    oc delete pod -n openshift-lvm-storage {pod-name}
    
  3. LVMCluster Not Ready:

    # Review and fix device configuration
    oc edit lvmcluster -n openshift-lvm-storage
    
    # Ensure devices match actual available devices
    

WARNING Issues (Address Soon):

  1. Capacity Issues:

    # Check logical volume usage
    oc debug node/{node} -- chroot /host lvs --units g
    
    # Remove unused volumes or expand thin pool
    
  2. Partial Node Coverage:

    # Investigate why daemonsets not on all nodes
    oc get nodes --show-labels
    oc describe daemonset -n openshift-lvm-storage
    

Step 6: Provide Next Steps

Always provide clear next steps:

  1. Review logs (if available in must-gather):

    • Operator logs: namespaces/openshift-lvm-storage/pods/lvms-operator-*/logs/
    • VG-manager logs: namespaces/openshift-lvm-storage/pods/vg-manager-*/logs/
    • TopoLVM logs: namespaces/openshift-lvm-storage/pods/topolvm-*/logs/
  2. Verify fixes (if cluster is accessible):

    # After implementing fixes, verify:
    oc get lvmcluster -n openshift-lvm-storage
    oc get lvmvolumegroup -A
    oc get pvc -A | grep Pending
    
  3. Re-collect must-gather (if making changes):

    oc adm must-gather --image=quay.io/lvms_dev/lvms-must-gather:latest
    

Error Handling

Script Execution Errors

Script not found:

# Verify script exists
ls plugins/lvms/skills/lvms-analyzer/scripts/analyze_lvms.py

# Ensure it's executable
chmod +x plugins/lvms/skills/lvms-analyzer/scripts/analyze_lvms.py

Python dependencies missing:

# Install PyYAML
pip install pyyaml

# Or use pip3
pip3 install pyyaml

Invalid YAML in must-gather:

  • Script handles YAML parsing errors gracefully
  • Reports which files failed to parse
  • Continues analysis with available data

Must-Gather Issues

Missing directories:

  • Script validates required directories exist
  • Reports missing components
  • Provides guidance on what's missing

Incomplete must-gather:

  • If critical resources missing, script reports what it can analyze
  • Suggests re-collecting must-gather

Examples

Example 1: Full Analysis

# Run comprehensive analysis
python3 plugins/lvms/skills/lvms-analyzer/scripts/analyze_lvms.py \
    ./must-gather/registry-ci-openshift-org-origin-4-18.../

Output:

═══════════════════════════════════════════════════════════
LVMCLUSTER STATUS
═══════════════════════════════════════════════════════════

LVMCluster: lvmcluster-sample
❌ State: Failed
❌ Ready: false
...

═══════════════════════════════════════════════════════════
LVMS ANALYSIS SUMMARY
═══════════════════════════════════════════════════════════

❌ CRITICAL ISSUES: 3
  - LVMCluster not Ready (state: Failed)
  - Volume group vg1 not created on worker-0
  - 3 PVCs stuck in Pending state

Example 2: Storage-Only Analysis

# Focus on PVC issues
python3 plugins/lvms/skills/lvms-analyzer/scripts/analyze_lvms.py \
    ./must-gather/... --component storage

Analyzes only:

  • PVC/PV status
  • Storage class configuration
  • Volume provisioning issues

Example 3: Operator Health Check

# Check operator components
python3 plugins/lvms/skills/lvms-analyzer/scripts/analyze_lvms.py \
    ./must-gather/... --component operator

Analyzes only:

  • LVMCluster resource
  • Deployments and daemonsets
  • Pod status and crashes

Best Practices

  1. Always validate path first:

    • Check for namespaces/openshift-lvm-storage/ directory
    • Use the correct subdirectory, not parent
  2. Run full analysis first:

    • Get overall health picture
    • Then drill down with component-specific analysis if needed
  3. Correlate issues:

    • Look for patterns across components
    • Connect pod failures to VG issues to PVC problems
  4. Check timestamps:

    • Events and pod restarts have timestamps
    • Helps understand sequence of failures
  5. Provide actionable output:

    • Don't just list issues
    • Explain root causes
    • Give specific remediation steps
    • Include verification commands
  6. Reference documentation:

    • Link to LVMS troubleshooting guide
    • Point to relevant sections in must-gather logs

Additional Resources