| name | Prow Job Analyze Metal Install Failure |
| description | Analyze OpenShift bare metal installation failures in Prow CI jobs using dev-scripts artifacts. Use for jobs with "metal" in name, for debugging Metal3/Ironic provisioning, installation, or dev-scripts setup failures. You may also use the prow-job-analyze-install-failure skill with this one. |
Prow Job Analyze Metal Install Failure
This skill helps debug OpenShift bare metal installation failures in CI jobs by analyzing dev-scripts logs, libvirt console logs, sosreports, and other metal-specific artifacts.
When to Use This Skill
Use this skill when:
- A bare metal CI job fails with "install should succeed" test failure
- The job name contains "metal" or "baremetal"
- You need to debug Metal3/Ironic provisioning issues
- You need to analyze dev-scripts setup failures
This skill is invoked by the main prow-job-analyze-install-failure skill when it detects a metal job.
Metal Installation Overview
Metal IPI jobs use dev-scripts (https://github.com/openshift-metal3/dev-scripts) with Metal3 and Ironic to install OpenShift:
- dev-scripts: Framework for setting up and installing OpenShift on bare metal
- Metal3: Kubernetes-native interface to Ironic
- Ironic: Bare metal provisioning service
The installation process has multiple layers:
- dev-scripts setup: Configures hypervisor, sets up Ironic/Metal3, builds installer
- Ironic provisioning: Provisions bare metal nodes (or VMs acting as bare metal)
- OpenShift installation: Standard installer runs on provisioned nodes
Failures can occur at any layer, so analysis must check all of them.
Network Architecture (CRITICAL for Understanding IPv6/Disconnected Jobs)
IMPORTANT: The term "disconnected" refers to the cluster nodes, NOT the hypervisor.
Hypervisor (dev-scripts host)
- HAS full internet access
- Downloads packages, container images, and dependencies from the public internet
- Runs dev-scripts Ansible playbooks that download tools (Go, installer, etc.)
- Hosts a local mirror registry to serve the cluster
Cluster VMs/Nodes
- Run in a private IPv6-only network (when IP_STACK=v6)
- NO direct internet access (truly disconnected)
- Pull container images from the hypervisor's local mirror registry
- Access to hypervisor services only (registry, DNS, etc.)
Common Misconception
When analyzing failures in "metal-ipi-ovn-ipv6" jobs:
- ❌ WRONG: "The hypervisor cannot access the internet, so downloads fail"
- ✅ CORRECT: "The hypervisor has internet access. If downloads fail, it's likely due to the remote service being unavailable, not network restrictions"
Implications for Failure Analysis
- Dev-scripts failures (steps 01-05): If external downloads fail, check if the remote service/URL is down or has removed the resource
- Installation failures (step 06+): If cluster nodes cannot pull images, check the local mirror registry on the hypervisor
- HTTP 403/404 errors during dev-scripts: Usually means the resource was removed from the upstream source, not that the network is restricted
Prerequisites
gcloud CLI Installation
- Check if installed:
which gcloud - If not installed, provide instructions for the user's platform
- Installation guide: https://cloud.google.com/sdk/docs/install
- Check if installed:
gcloud Authentication (Optional)
- The
test-platform-resultsbucket is publicly accessible - No authentication is required for read access
- The
Input Format
The user will provide:
- Build ID - Extracted by the main skill
- Bucket path - Extracted by the main skill
- Target name - Extracted by the main skill
- Working directory - Already created by main skill
Metal-Specific Artifacts
Metal jobs produce several diagnostic archives:
OFCIR Acquisition Logs
- Location:
{target}/ofcir-acquire/ - Purpose: Shows the OFCIR host acquisition process
- Contains:
build-log.txt: Log showing pool, provider, and host detailsartifacts/junit_metal_setup.xml: JUnit with test[sig-metal] should get working host from infra provider
- Critical for: Determining if the job failed to acquire a host before installation started
- Key information:
- Pool name (e.g., "cipool-ironic-cluster-el9", "cipool-ibmcloud")
- Provider (e.g., "ironic", "equinix", "aws", "ibmcloud")
- Host name and details
Dev-scripts Logs
- Location:
{target}/baremetalds-devscripts-setup/artifacts/root/dev-scripts/logs/ - Purpose: Shows installation setup process and cluster installation
- Contains: Numbered log files showing each setup step (requirements, host config, Ironic setup, installer build, cluster creation). Note: dev-scripts invokes the installer, so installer logs (
.openshift_install*.log) will also be present in the devscripts folders. - Critical for: Early failures before cluster creation, Ironic/Metal3 setup issues, installation failures
libvirt-logs.tar
- Location:
{target}/baremetalds-devscripts-gather/artifacts/ - Purpose: VM/node console logs showing boot sequence
- Contains: Console output from bootstrap and master VMs/nodes
- Critical for: Boot failures, Ignition errors, kernel panics, network configuration issues
sosreport
- Location:
{target}/baremetalds-devscripts-gather/artifacts/ - Purpose: Hypervisor system diagnostics
- Contains: Hypervisor logs, system configuration, diagnostic command output
- Useful for: Hypervisor-level issues, not typically needed for VM boot problems
squid-logs.tar
- Location:
{target}/baremetalds-devscripts-gather/artifacts/ - Purpose: Squid proxy logs for inbound CI access to the cluster
- Contains: Logs showing CI system's inbound connections to the cluster under test. Note: The squid proxy runs on the hypervisor for INBOUND access (CI → cluster), NOT for outbound access (cluster → registry).
- Critical for: Debugging CI access issues to the cluster, particularly in IPv6/disconnected environments
Implementation Steps
Step 1: Check OFCIR Acquisition
Download OFCIR logs
gcloud storage cp gs://test-platform-results/{bucket-path}/artifacts/{target}/ofcir-acquire/build-log.txt .work/prow-job-analyze-install-failure/{build_id}/logs/ofcir-build-log.txt --no-user-output-enabled 2>&1 || echo "OFCIR build log not found" gcloud storage cp gs://test-platform-results/{bucket-path}/artifacts/{target}/ofcir-acquire/artifacts/junit_metal_setup.xml .work/prow-job-analyze-install-failure/{build_id}/logs/junit_metal_setup.xml --no-user-output-enabled 2>&1 || echo "OFCIR JUnit not found"Check junit_metal_setup.xml for acquisition failure
- Read the JUnit file
- Look for test case:
[sig-metal] should get working host from infra provider - If the test failed, OFCIR failed to acquire a host
- This means installation never started - the failure is in host acquisition
Extract OFCIR details from build-log.txt
- Parse the JSON in the build log to extract:
pool: The OFCIR pool nameprovider: The infrastructure providername: The host name allocated
- Save these for the final report
- Parse the JSON in the build log to extract:
If OFCIR acquisition failed
- Stop analysis - installation never started
- Report: "OFCIR host acquisition failed"
- Include pool and provider information
- Suggest: Check OFCIR pool availability and provider status
Step 2: Download Dev-Scripts Logs
Download dev-scripts logs directory
gcloud storage cp -r gs://test-platform-results/{bucket-path}/artifacts/{target}/baremetalds-devscripts-setup/artifacts/root/dev-scripts/logs/ .work/prow-job-analyze-install-failure/{build_id}/logs/devscripts/ --no-user-output-enabledHandle missing dev-scripts logs gracefully
- Some metal jobs may not have dev-scripts artifacts
- If missing, note this in the analysis and proceed with other artifacts
Step 2: Download libvirt Console Logs
Find and download libvirt-logs.tar
gcloud storage ls -r gs://test-platform-results/{bucket-path}/artifacts/ 2>&1 | grep "libvirt-logs\.tar$" gcloud storage cp {full-gcs-path-to-libvirt-logs.tar} .work/prow-job-analyze-install-failure/{build_id}/logs/ --no-user-output-enabledExtract libvirt logs
tar -xf .work/prow-job-analyze-install-failure/{build_id}/logs/libvirt-logs.tar -C .work/prow-job-analyze-install-failure/{build_id}/logs/
Step 3: Download Optional Artifacts
Download sosreport (optional)
gcloud storage ls -r gs://test-platform-results/{bucket-path}/artifacts/ 2>&1 | grep "sosreport.*\.tar\.xz$" gcloud storage cp {full-gcs-path-to-sosreport} .work/prow-job-analyze-install-failure/{build_id}/logs/ --no-user-output-enabled tar -xf .work/prow-job-analyze-install-failure/{build_id}/logs/sosreport-{name}.tar.xz -C .work/prow-job-analyze-install-failure/{build_id}/logs/Download squid-logs (optional, for IPv6/disconnected jobs)
gcloud storage ls -r gs://test-platform-results/{bucket-path}/artifacts/ 2>&1 | grep "squid-logs.*\.tar$" gcloud storage cp {full-gcs-path-to-squid-logs} .work/prow-job-analyze-install-failure/{build_id}/logs/ --no-user-output-enabled tar -xf .work/prow-job-analyze-install-failure/{build_id}/logs/squid-logs-{name}.tar -C .work/prow-job-analyze-install-failure/{build_id}/logs/
Step 4: Analyze Dev-Scripts Logs
Check dev-scripts logs FIRST - they show what happened during setup and installation.
Read dev-scripts logs in order
- Logs are numbered sequentially showing setup steps
- Note: dev-scripts invokes the installer, so you'll find
.openshift_install*.logfiles in the devscripts directories - Look for the first error or failure
Key errors to look for:
- Host configuration failures: Networking, DNS, storage setup issues
- Ironic/Metal3 setup issues: BMC connectivity, provisioning network, node registration failures
- Installer build failures: Problems building the OpenShift installer binary
- Install-config validation errors: Invalid configuration before cluster creation
- Installation failures: Check installer logs (
.openshift_install*.log) present in devscripts folders
Important distinction:
- If failure is in dev-scripts setup logs (01-05), the problem is in the setup process
- If failure is in installer logs or 06_create_cluster, the problem is in the cluster installation (also analyzed by main skill)
Save dev-scripts analysis:
- Save findings to:
.work/prow-job-analyze-install-failure/{build_id}/analysis/devscripts-summary.txt
- Save findings to:
Step 5: Analyze libvirt Console Logs
Console logs are CRITICAL for metal failures during cluster creation.
Find console logs
find .work/prow-job-analyze-install-failure/{build_id}/logs/ -name "*console*.log"- Look for patterns like
{cluster-name}-bootstrap_console.log,{cluster-name}-master-{N}_console.log
- Look for patterns like
Analyze console logs for boot/provisioning issues:
- Kernel boot failures or panics: Look for "panic", "kernel", "oops"
- Ignition failures: Look for "ignition", "config fetch failed", "Ignition failed"
- Network configuration issues: Look for "dhcp", "network unreachable", "DNS", "timeout"
- Disk mounting failures: Look for "mount", "disk", "filesystem"
- Service startup failures: Look for systemd errors, service failures
Console logs show the complete boot sequence:
- As if you were watching a physical console
- Shows kernel messages, Ignition provisioning, CoreOS startup
- Critical for understanding what happened before the system was fully booted
Save console log analysis:
- Save findings to:
.work/prow-job-analyze-install-failure/{build_id}/analysis/console-summary.txt
- Save findings to:
Step 6: Analyze sosreport (If Downloaded)
Only needed for hypervisor-level issues.
Check sosreport for hypervisor diagnostics:
var/log/messages- Hypervisor system logsos_commands/- Output of diagnostic commandsetc/libvirt/- Libvirt configuration
Look for hypervisor-level issues:
- Libvirt errors
- Network configuration problems on hypervisor
- Resource constraints (CPU, memory, disk)
Step 7: Analyze squid-logs (If Downloaded)
Important for debugging CI access to the cluster.
Check squid proxy logs:
- Look for failed connections from CI to the cluster
- Look for HTTP errors or blocked requests
- Check patterns of CI test framework access issues
Common issues:
- CI unable to connect to cluster API
- Proxy configuration errors blocking CI access
- Network routing issues between CI and cluster
- Note: These logs are for INBOUND access (CI → cluster), not for cluster's outbound access to registries
Step 8: Generate Metal-Specific Analysis Report
Create comprehensive metal analysis report:
Metal Installation Failure Analysis ==================================== Job: {job-name} Build ID: {build_id} Prow URL: {original-url} Installation Method: dev-scripts + Metal3 + Ironic OFCIR Host Acquisition ---------------------- Pool: {pool name from OFCIR build log} Provider: {provider from OFCIR build log} Host: {host name from OFCIR build log} Status: {Success or Failure} {If OFCIR acquisition failed, note that installation never started} Dev-Scripts Analysis -------------------- {Summary of dev-scripts logs} Key Findings: - {First error in dev-scripts setup} - {Related errors} If dev-scripts failed: The problem is in the setup process (host config, Ironic, installer build) If dev-scripts succeeded: The problem is in cluster installation (see main analysis) Console Logs Analysis --------------------- {Summary of VM/node console logs} Bootstrap Node: - {Boot sequence status} - {Ignition status} - {Network configuration} - {Key errors} Master Nodes: - {Status for each master} - {Key errors} Hypervisor Diagnostics (sosreport) ----------------------------------- {Summary of sosreport findings, if applicable} Proxy Logs (squid) ------------------ {Summary of proxy logs, if applicable} Note: Squid logs show CI access to the cluster, not cluster's registry access Metal-Specific Recommended Steps --------------------------------- Based on the failure: For dev-scripts setup failures: - Review host configuration (networking, DNS, storage) - Check Ironic/Metal3 setup logs for BMC/provisioning issues - Verify installer build completed successfully - Check installer logs in devscripts folders For console boot failures: - Check Ignition configuration and network connectivity - Review kernel boot messages for hardware issues - Verify network configuration (DHCP, DNS, routing) For CI access issues: - Check squid proxy logs for failed CI connections to cluster - Verify network routing between CI and cluster - Check proxy configuration Artifacts Location ------------------ Dev-scripts logs: .work/prow-job-analyze-install-failure/{build_id}/logs/devscripts/ Console logs: .work/prow-job-analyze-install-failure/{build_id}/logs/ sosreport: .work/prow-job-analyze-install-failure/{build_id}/logs/sosreport-*/ squid logs: .work/prow-job-analyze-install-failure/{build_id}/logs/squid-logs-*/Save report:
- Save to:
.work/prow-job-analyze-install-failure/{build_id}/analysis/metal-analysis.txt
- Save to:
Step 9: Return Metal Analysis to Main Skill
- Provide summary to main skill:
- Brief summary of metal-specific findings
- Indication of whether failure was in dev-scripts setup or cluster installation
- Key error messages and recommended actions
Common Metal Failure Patterns
| Issue | Symptoms | Where to Look |
|---|---|---|
| Dev-scripts host config | Early failure before cluster creation | Dev-scripts logs (host configuration step) |
| Ironic/Metal3 setup | Provisioning failures, BMC errors | Dev-scripts logs (Ironic setup), Ironic logs |
| Node boot failure | VMs/nodes won't boot | Console logs (kernel, boot sequence) |
| Ignition failure | Nodes boot but don't provision | Console logs (Ignition messages) |
| Network config | DHCP failures, DNS issues | Console logs (network messages), dev-scripts host config |
| CI access issues | Tests can't connect to cluster | squid logs (proxy logs for CI → cluster access) |
| Hypervisor issues | Resource constraints, libvirt errors | sosreport (system logs, libvirt config) |
Tips
- Check dev-scripts logs FIRST: They show setup and installation (dev-scripts invokes the installer)
- Installer logs in devscripts: Look for
.openshift_install*.logfiles in devscripts directories - Console logs are critical: They show the actual boot sequence like a physical console
- Ironic/Metal3 errors often appear in dev-scripts setup logs
- Squid logs are for CI access: They show inbound CI → cluster access, not outbound cluster → registry
- Boot vs. provisioning: Boot failures appear in console logs, provisioning failures in Ironic logs
- Layer distinction: Separate dev-scripts setup from Ironic provisioning from OpenShift installation