Claude Code Plugins

Community-maintained marketplace

Feedback

Prow Job Analyze Metal Install Failure

@openshift-eng/ai-helpers
8
0

Analyze OpenShift bare metal installation failures in Prow CI jobs using dev-scripts artifacts. Use for jobs with "metal" in name, for debugging Metal3/Ironic provisioning, installation, or dev-scripts setup failures. You may also use the prow-job-analyze-install-failure skill with this one.

Install Skill

1Download skill
2Enable skills in Claude

Open claude.ai/settings/capabilities and find the "Skills" section

3Upload to Claude

Click "Upload skill" and select the downloaded ZIP file

Note: Please verify skill by going through its instructions before using it.

SKILL.md

name Prow Job Analyze Metal Install Failure
description Analyze OpenShift bare metal installation failures in Prow CI jobs using dev-scripts artifacts. Use for jobs with "metal" in name, for debugging Metal3/Ironic provisioning, installation, or dev-scripts setup failures. You may also use the prow-job-analyze-install-failure skill with this one.

Prow Job Analyze Metal Install Failure

This skill helps debug OpenShift bare metal installation failures in CI jobs by analyzing dev-scripts logs, libvirt console logs, sosreports, and other metal-specific artifacts.

When to Use This Skill

Use this skill when:

  • A bare metal CI job fails with "install should succeed" test failure
  • The job name contains "metal" or "baremetal"
  • You need to debug Metal3/Ironic provisioning issues
  • You need to analyze dev-scripts setup failures

This skill is invoked by the main prow-job-analyze-install-failure skill when it detects a metal job.

Metal Installation Overview

Metal IPI jobs use dev-scripts (https://github.com/openshift-metal3/dev-scripts) with Metal3 and Ironic to install OpenShift:

  • dev-scripts: Framework for setting up and installing OpenShift on bare metal
  • Metal3: Kubernetes-native interface to Ironic
  • Ironic: Bare metal provisioning service

The installation process has multiple layers:

  1. dev-scripts setup: Configures hypervisor, sets up Ironic/Metal3, builds installer
  2. Ironic provisioning: Provisions bare metal nodes (or VMs acting as bare metal)
  3. OpenShift installation: Standard installer runs on provisioned nodes

Failures can occur at any layer, so analysis must check all of them.

Network Architecture (CRITICAL for Understanding IPv6/Disconnected Jobs)

IMPORTANT: The term "disconnected" refers to the cluster nodes, NOT the hypervisor.

Hypervisor (dev-scripts host)

  • HAS full internet access
  • Downloads packages, container images, and dependencies from the public internet
  • Runs dev-scripts Ansible playbooks that download tools (Go, installer, etc.)
  • Hosts a local mirror registry to serve the cluster

Cluster VMs/Nodes

  • Run in a private IPv6-only network (when IP_STACK=v6)
  • NO direct internet access (truly disconnected)
  • Pull container images from the hypervisor's local mirror registry
  • Access to hypervisor services only (registry, DNS, etc.)

Common Misconception

When analyzing failures in "metal-ipi-ovn-ipv6" jobs:

  • ❌ WRONG: "The hypervisor cannot access the internet, so downloads fail"
  • ✅ CORRECT: "The hypervisor has internet access. If downloads fail, it's likely due to the remote service being unavailable, not network restrictions"

Implications for Failure Analysis

  1. Dev-scripts failures (steps 01-05): If external downloads fail, check if the remote service/URL is down or has removed the resource
  2. Installation failures (step 06+): If cluster nodes cannot pull images, check the local mirror registry on the hypervisor
  3. HTTP 403/404 errors during dev-scripts: Usually means the resource was removed from the upstream source, not that the network is restricted

Prerequisites

  1. gcloud CLI Installation

  2. gcloud Authentication (Optional)

    • The test-platform-results bucket is publicly accessible
    • No authentication is required for read access

Input Format

The user will provide:

  1. Build ID - Extracted by the main skill
  2. Bucket path - Extracted by the main skill
  3. Target name - Extracted by the main skill
  4. Working directory - Already created by main skill

Metal-Specific Artifacts

Metal jobs produce several diagnostic archives:

OFCIR Acquisition Logs

  • Location: {target}/ofcir-acquire/
  • Purpose: Shows the OFCIR host acquisition process
  • Contains:
    • build-log.txt: Log showing pool, provider, and host details
    • artifacts/junit_metal_setup.xml: JUnit with test [sig-metal] should get working host from infra provider
  • Critical for: Determining if the job failed to acquire a host before installation started
  • Key information:
    • Pool name (e.g., "cipool-ironic-cluster-el9", "cipool-ibmcloud")
    • Provider (e.g., "ironic", "equinix", "aws", "ibmcloud")
    • Host name and details

Dev-scripts Logs

  • Location: {target}/baremetalds-devscripts-setup/artifacts/root/dev-scripts/logs/
  • Purpose: Shows installation setup process and cluster installation
  • Contains: Numbered log files showing each setup step (requirements, host config, Ironic setup, installer build, cluster creation). Note: dev-scripts invokes the installer, so installer logs (.openshift_install*.log) will also be present in the devscripts folders.
  • Critical for: Early failures before cluster creation, Ironic/Metal3 setup issues, installation failures

libvirt-logs.tar

  • Location: {target}/baremetalds-devscripts-gather/artifacts/
  • Purpose: VM/node console logs showing boot sequence
  • Contains: Console output from bootstrap and master VMs/nodes
  • Critical for: Boot failures, Ignition errors, kernel panics, network configuration issues

sosreport

  • Location: {target}/baremetalds-devscripts-gather/artifacts/
  • Purpose: Hypervisor system diagnostics
  • Contains: Hypervisor logs, system configuration, diagnostic command output
  • Useful for: Hypervisor-level issues, not typically needed for VM boot problems

squid-logs.tar

  • Location: {target}/baremetalds-devscripts-gather/artifacts/
  • Purpose: Squid proxy logs for inbound CI access to the cluster
  • Contains: Logs showing CI system's inbound connections to the cluster under test. Note: The squid proxy runs on the hypervisor for INBOUND access (CI → cluster), NOT for outbound access (cluster → registry).
  • Critical for: Debugging CI access issues to the cluster, particularly in IPv6/disconnected environments

Implementation Steps

Step 1: Check OFCIR Acquisition

  1. Download OFCIR logs

    gcloud storage cp gs://test-platform-results/{bucket-path}/artifacts/{target}/ofcir-acquire/build-log.txt .work/prow-job-analyze-install-failure/{build_id}/logs/ofcir-build-log.txt --no-user-output-enabled 2>&1 || echo "OFCIR build log not found"
    gcloud storage cp gs://test-platform-results/{bucket-path}/artifacts/{target}/ofcir-acquire/artifacts/junit_metal_setup.xml .work/prow-job-analyze-install-failure/{build_id}/logs/junit_metal_setup.xml --no-user-output-enabled 2>&1 || echo "OFCIR JUnit not found"
    
  2. Check junit_metal_setup.xml for acquisition failure

    • Read the JUnit file
    • Look for test case: [sig-metal] should get working host from infra provider
    • If the test failed, OFCIR failed to acquire a host
    • This means installation never started - the failure is in host acquisition
  3. Extract OFCIR details from build-log.txt

    • Parse the JSON in the build log to extract:
      • pool: The OFCIR pool name
      • provider: The infrastructure provider
      • name: The host name allocated
    • Save these for the final report
  4. If OFCIR acquisition failed

    • Stop analysis - installation never started
    • Report: "OFCIR host acquisition failed"
    • Include pool and provider information
    • Suggest: Check OFCIR pool availability and provider status

Step 2: Download Dev-Scripts Logs

  1. Download dev-scripts logs directory

    gcloud storage cp -r gs://test-platform-results/{bucket-path}/artifacts/{target}/baremetalds-devscripts-setup/artifacts/root/dev-scripts/logs/ .work/prow-job-analyze-install-failure/{build_id}/logs/devscripts/ --no-user-output-enabled
    
  2. Handle missing dev-scripts logs gracefully

    • Some metal jobs may not have dev-scripts artifacts
    • If missing, note this in the analysis and proceed with other artifacts

Step 2: Download libvirt Console Logs

  1. Find and download libvirt-logs.tar

    gcloud storage ls -r gs://test-platform-results/{bucket-path}/artifacts/ 2>&1 | grep "libvirt-logs\.tar$"
    gcloud storage cp {full-gcs-path-to-libvirt-logs.tar} .work/prow-job-analyze-install-failure/{build_id}/logs/ --no-user-output-enabled
    
  2. Extract libvirt logs

    tar -xf .work/prow-job-analyze-install-failure/{build_id}/logs/libvirt-logs.tar -C .work/prow-job-analyze-install-failure/{build_id}/logs/
    

Step 3: Download Optional Artifacts

  1. Download sosreport (optional)

    gcloud storage ls -r gs://test-platform-results/{bucket-path}/artifacts/ 2>&1 | grep "sosreport.*\.tar\.xz$"
    gcloud storage cp {full-gcs-path-to-sosreport} .work/prow-job-analyze-install-failure/{build_id}/logs/ --no-user-output-enabled
    tar -xf .work/prow-job-analyze-install-failure/{build_id}/logs/sosreport-{name}.tar.xz -C .work/prow-job-analyze-install-failure/{build_id}/logs/
    
  2. Download squid-logs (optional, for IPv6/disconnected jobs)

    gcloud storage ls -r gs://test-platform-results/{bucket-path}/artifacts/ 2>&1 | grep "squid-logs.*\.tar$"
    gcloud storage cp {full-gcs-path-to-squid-logs} .work/prow-job-analyze-install-failure/{build_id}/logs/ --no-user-output-enabled
    tar -xf .work/prow-job-analyze-install-failure/{build_id}/logs/squid-logs-{name}.tar -C .work/prow-job-analyze-install-failure/{build_id}/logs/
    

Step 4: Analyze Dev-Scripts Logs

Check dev-scripts logs FIRST - they show what happened during setup and installation.

  1. Read dev-scripts logs in order

    • Logs are numbered sequentially showing setup steps
    • Note: dev-scripts invokes the installer, so you'll find .openshift_install*.log files in the devscripts directories
    • Look for the first error or failure
  2. Key errors to look for:

    • Host configuration failures: Networking, DNS, storage setup issues
    • Ironic/Metal3 setup issues: BMC connectivity, provisioning network, node registration failures
    • Installer build failures: Problems building the OpenShift installer binary
    • Install-config validation errors: Invalid configuration before cluster creation
    • Installation failures: Check installer logs (.openshift_install*.log) present in devscripts folders
  3. Important distinction:

    • If failure is in dev-scripts setup logs (01-05), the problem is in the setup process
    • If failure is in installer logs or 06_create_cluster, the problem is in the cluster installation (also analyzed by main skill)
  4. Save dev-scripts analysis:

    • Save findings to: .work/prow-job-analyze-install-failure/{build_id}/analysis/devscripts-summary.txt

Step 5: Analyze libvirt Console Logs

Console logs are CRITICAL for metal failures during cluster creation.

  1. Find console logs

    find .work/prow-job-analyze-install-failure/{build_id}/logs/ -name "*console*.log"
    
    • Look for patterns like {cluster-name}-bootstrap_console.log, {cluster-name}-master-{N}_console.log
  2. Analyze console logs for boot/provisioning issues:

    • Kernel boot failures or panics: Look for "panic", "kernel", "oops"
    • Ignition failures: Look for "ignition", "config fetch failed", "Ignition failed"
    • Network configuration issues: Look for "dhcp", "network unreachable", "DNS", "timeout"
    • Disk mounting failures: Look for "mount", "disk", "filesystem"
    • Service startup failures: Look for systemd errors, service failures
  3. Console logs show the complete boot sequence:

    • As if you were watching a physical console
    • Shows kernel messages, Ignition provisioning, CoreOS startup
    • Critical for understanding what happened before the system was fully booted
  4. Save console log analysis:

    • Save findings to: .work/prow-job-analyze-install-failure/{build_id}/analysis/console-summary.txt

Step 6: Analyze sosreport (If Downloaded)

Only needed for hypervisor-level issues.

  1. Check sosreport for hypervisor diagnostics:

    • var/log/messages - Hypervisor system log
    • sos_commands/ - Output of diagnostic commands
    • etc/libvirt/ - Libvirt configuration
  2. Look for hypervisor-level issues:

    • Libvirt errors
    • Network configuration problems on hypervisor
    • Resource constraints (CPU, memory, disk)

Step 7: Analyze squid-logs (If Downloaded)

Important for debugging CI access to the cluster.

  1. Check squid proxy logs:

    • Look for failed connections from CI to the cluster
    • Look for HTTP errors or blocked requests
    • Check patterns of CI test framework access issues
  2. Common issues:

    • CI unable to connect to cluster API
    • Proxy configuration errors blocking CI access
    • Network routing issues between CI and cluster
    • Note: These logs are for INBOUND access (CI → cluster), not for cluster's outbound access to registries

Step 8: Generate Metal-Specific Analysis Report

  1. Create comprehensive metal analysis report:

    Metal Installation Failure Analysis
    ====================================
    
    Job: {job-name}
    Build ID: {build_id}
    Prow URL: {original-url}
    
    Installation Method: dev-scripts + Metal3 + Ironic
    
    OFCIR Host Acquisition
    ----------------------
    Pool: {pool name from OFCIR build log}
    Provider: {provider from OFCIR build log}
    Host: {host name from OFCIR build log}
    Status: {Success or Failure}
    
    {If OFCIR acquisition failed, note that installation never started}
    
    Dev-Scripts Analysis
    --------------------
    {Summary of dev-scripts logs}
    
    Key Findings:
    - {First error in dev-scripts setup}
    - {Related errors}
    
    If dev-scripts failed: The problem is in the setup process (host config, Ironic, installer build)
    If dev-scripts succeeded: The problem is in cluster installation (see main analysis)
    
    Console Logs Analysis
    ---------------------
    {Summary of VM/node console logs}
    
    Bootstrap Node:
    - {Boot sequence status}
    - {Ignition status}
    - {Network configuration}
    - {Key errors}
    
    Master Nodes:
    - {Status for each master}
    - {Key errors}
    
    Hypervisor Diagnostics (sosreport)
    -----------------------------------
    {Summary of sosreport findings, if applicable}
    
    Proxy Logs (squid)
    ------------------
    {Summary of proxy logs, if applicable}
    Note: Squid logs show CI access to the cluster, not cluster's registry access
    
    Metal-Specific Recommended Steps
    ---------------------------------
    Based on the failure:
    
    For dev-scripts setup failures:
    - Review host configuration (networking, DNS, storage)
    - Check Ironic/Metal3 setup logs for BMC/provisioning issues
    - Verify installer build completed successfully
    - Check installer logs in devscripts folders
    
    For console boot failures:
    - Check Ignition configuration and network connectivity
    - Review kernel boot messages for hardware issues
    - Verify network configuration (DHCP, DNS, routing)
    
    For CI access issues:
    - Check squid proxy logs for failed CI connections to cluster
    - Verify network routing between CI and cluster
    - Check proxy configuration
    
    Artifacts Location
    ------------------
    Dev-scripts logs: .work/prow-job-analyze-install-failure/{build_id}/logs/devscripts/
    Console logs: .work/prow-job-analyze-install-failure/{build_id}/logs/
    sosreport: .work/prow-job-analyze-install-failure/{build_id}/logs/sosreport-*/
    squid logs: .work/prow-job-analyze-install-failure/{build_id}/logs/squid-logs-*/
    
  2. Save report:

    • Save to: .work/prow-job-analyze-install-failure/{build_id}/analysis/metal-analysis.txt

Step 9: Return Metal Analysis to Main Skill

  1. Provide summary to main skill:
    • Brief summary of metal-specific findings
    • Indication of whether failure was in dev-scripts setup or cluster installation
    • Key error messages and recommended actions

Common Metal Failure Patterns

Issue Symptoms Where to Look
Dev-scripts host config Early failure before cluster creation Dev-scripts logs (host configuration step)
Ironic/Metal3 setup Provisioning failures, BMC errors Dev-scripts logs (Ironic setup), Ironic logs
Node boot failure VMs/nodes won't boot Console logs (kernel, boot sequence)
Ignition failure Nodes boot but don't provision Console logs (Ignition messages)
Network config DHCP failures, DNS issues Console logs (network messages), dev-scripts host config
CI access issues Tests can't connect to cluster squid logs (proxy logs for CI → cluster access)
Hypervisor issues Resource constraints, libvirt errors sosreport (system logs, libvirt config)

Tips

  • Check dev-scripts logs FIRST: They show setup and installation (dev-scripts invokes the installer)
  • Installer logs in devscripts: Look for .openshift_install*.log files in devscripts directories
  • Console logs are critical: They show the actual boot sequence like a physical console
  • Ironic/Metal3 errors often appear in dev-scripts setup logs
  • Squid logs are for CI access: They show inbound CI → cluster access, not outbound cluster → registry
  • Boot vs. provisioning: Boot failures appear in console logs, provisioning failures in Ironic logs
  • Layer distinction: Separate dev-scripts setup from Ironic provisioning from OpenShift installation