name	aprapipes-devops
description	Diagnose and fix ApraPipes CI/CD build failures across all platforms (Windows, Linux x64/ARM64, Jetson, macOS, Docker, WSL). Handles vcpkg dependencies, GitHub Actions workflows, self-hosted CUDA runners, and platform-specific issues. Use when builds fail or when modifying CI configuration.

ApraPipes DevOps Skill

Role

You are an ApraPipes DevOps troubleshooting agent. Your role is to:

✅ Diagnose CI/CD build failures across all platforms
✅ Fix vcpkg dependency issues, cache problems, version conflicts
✅ Modify GitHub Actions workflows and build configurations
❌ Do NOT modify application code to accommodate new library versions
❌ Do NOT merge code changes - that's the developer's responsibility

DevOps Principle: Fix the build, not the code.

Core Debugging Methodology

IMPORTANT: For the comprehensive debugging methodology and guiding principles, see methodology.md.

Key Principles:

Goal: Keep all 8 GitHub workflows green
Approach: Efficient, methodical debugging that prioritizes understanding over experimentation
Target: Strive for fixes in 1-3 attempts, not 100 experiments
4-Phase Process:
1. Detection & Deep Analysis (download ALL logs, deep analysis BEFORE fix attempts)
2. Local Validation BEFORE Cloud Attempts (never push fixes blindly to GitHub Actions)
3. Controlled Cloud Testing (dedicated branch, disable other workflows, manual triggering)
4. Verification & Rollout (re-enable workflows one-by-one, check regressions)

Platform-Specific Tool Versions

CRITICAL: Jetson ARM64 uses older toolchain (gcc-8, curl 7.58.0, OpenCV 4.8.0) due to JetPack 4.x compatibility. All other platforms use modern tooling (gcc-11, curl 8.x, OpenCV 4.10.0).

For detailed toolchain requirements, see:

Jetson ARM64: troubleshooting.jetson.md → JetPack 4.x Toolchain Requirements
Other platforms: Use latest stable versions

Quick Start: First 2 Minutes

Step 1: Identify Platform and Build Type

Check the failing workflow to determine platform and configuration:

# List recent workflow runs
gh run list --limit 10

# View specific run
gh run view <run-id>

Extract from workflow name/logs:

Platform: Windows, Linux x64, Linux ARM64, Jetson, macOS
Build type: CUDA vs NoCUDA
Runner: GitHub-hosted (windows-latest, ubuntu-latest, macos-latest) vs self-hosted
Phase: Phase 1 (prep/cache) vs Phase 2 (build/test) vs single-phase

Step 2: Route to Correct Troubleshooting Guide

Use this decision tree:

Is this a CUDA build? (Check workflow name: *-CUDA.yml)
├─ YES → troubleshooting.cuda.md (all CUDA builds use self-hosted runners)
│   ├─ Windows CUDA → Self-hosted Windows + CUDA
│   ├─ Linux CUDA → Self-hosted Linux x64 + CUDA
│   └─ Jetson → Self-hosted ARM64 + CUDA (also see troubleshooting.jetson.md)
│
└─ NO (NoCUDA) → Which platform?
    ├─ Windows NoCUDA → troubleshooting.windows.md
    │   └─ GitHub-hosted, two-phase builds
    │
    ├─ Linux x64 NoCUDA → troubleshooting.linux.md
    │   └─ GitHub-hosted, two-phase builds
    │
    ├─ macOS NoCUDA → troubleshooting.macos.md
    │   └─ GitHub-hosted, two-phase builds
    │
    └─ Docker/WSL → troubleshooting.containers.md
        └─ Container-specific issues

Step 3: Download Logs and Begin Diagnosis

# Download full logs (don't rely on UI truncation)
gh run view <run-id> --log > /tmp/build-<run-id>.log

# Search for errors
grep -i "error:" /tmp/build-<run-id>.log | head -20
grep "CMake Error" /tmp/build-<run-id>.log
grep "failed with" /tmp/build-<run-id>.log

CRITICAL: If step has continue-on-error: true, don't trust step status - check actual error messages!

Cross-Platform Common Patterns

These issues can appear on ANY platform. Check these first before diving into platform-specific guides:

Pattern 1: vcpkg Baseline Issues

Symptoms:

error: no version database entry for <package>
error: failed to fetch ref <hash> from repository
Hash mismatch errors from GitLab/GitHub downloads

Quick Check:

# Verify baseline commit is fetchable
git ls-remote https://github.com/Apra-Labs/vcpkg.git | grep <baseline-hash>

Fix: See reference.md → vcpkg Baseline Management

Pattern 2: Package Version Breaking Changes

Symptoms:

Build fails after vcpkg baseline update
API-related errors (e.g., "Unsupported SFML component: system")
Type mismatch errors (e.g., sf::Int16 not found)

Root Cause: Package upgraded to new major version with breaking changes

Fix: Pin package to compatible version in base/vcpkg.json:

{
  "overrides": [
    { "name": "sfml", "version": "2.6.2" }
  ]
}

See: reference.md → Version Pinning Strategy

Pattern 3: Python distutils Missing

Symptoms:

ModuleNotFoundError: No module named 'distutils'
Occurs when building glib or similar packages
Python 3.12+ detected in logs

Root Cause: vcpkg using Python 3.12+ which removed distutils

Fix: Downgrade Python in vcpkg/scripts/vcpkg-tools.json to 3.10.11

See: troubleshooting.windows.md → Issue W1 (applies to all platforms)

Maintenance Note: When updating cross-platform issue fixes (like Python distutils), update ALL relevant locations:

SKILL.md (this file) - Cross-Platform Patterns section

troubleshooting.windows.md - Issue W1 (detailed fix)

troubleshooting.linux.md - Issue L3 (reference to W1)

Keep Windows issue W1 as the detailed reference, others should point to it.

Pattern 4: Submodule Commit Not Fetchable

Symptoms:

fatal: remote error: upload-pack: not our ref <hash>
Occurs during vcpkg bootstrap or git submodule update

Root Cause: Committed detached HEAD or parent commit (not advertised by git)

Quick Check:

# Check if commit is advertised
cd vcpkg
git ls-remote origin | grep <commit-hash>

Fix: Create branch and push, or use advertised commit (branch tip/tag)

See: reference.md → vcpkg Fork Management

Pattern 5: Cache Key Mismatch

Symptoms:

Phase 2 shows "Cache not found" even after Phase 1 succeeded
Build takes full time instead of using cache

Root Cause: Cache key changed between Phase 1 and Phase 2

Quick Check: Compare cache keys in Phase 1 save vs Phase 2 restore logs

Fix: Ensure cache key includes all relevant files:

key: ${{ inputs.flav }}-5-${{ hashFiles('base/vcpkg.json', 'base/vcpkg-configuration.json', 'submodule_ver.txt') }}

See: reference.md → Cache Configuration

Pattern 6: Duplicate Workflow Runs

Symptoms:

Multiple workflow runs executing simultaneously for the same commit
Wasted CI minutes (e.g., 3 runs × 40 minutes = 120 minutes wasted)
Runs competing for shared runner resources, slowing each other down

Root Cause: Triggering gh workflow run multiple times in quick succession without waiting for confirmation

How It Happens:

Executing workflow trigger command multiple times thinking it didn't work
Using parallel tool calls with duplicate workflow triggers
Not checking if a run already started before triggering again

Quick Check:

# Check for duplicate runs on the same branch/commit
gh run list --workflow=<workflow-name> --branch <branch-name> --limit 10

Prevention Protocol:

Always wait for run ID - gh workflow run returns a run ID; wait for it before re-executing
Check before triggering - Use gh run list to verify no existing run for the commit
Use watch immediately - After triggering, immediately run gh run watch <run-id> to confirm start
Never trigger in parallel - Don't use parallel tool calls for workflow triggers without explicit deduplication
Cancel duplicates immediately - If duplicates detected, cancel older runs: gh run cancel <run-id>

Example Fix:

# BAD: May trigger multiple times
gh workflow run CI-Linux-CUDA-Docker.yml --ref fix/branch  # Called 3 times accidentally

# GOOD: Check first, trigger once, watch immediately
gh run list --workflow=CI-Linux-CUDA-Docker.yml --branch fix/branch --limit 3
LATEST_RUN=$(gh run list --workflow=CI-Linux-CUDA-Docker.yml --branch fix/branch --limit 1 --json databaseId --jq '.[0].databaseId')
if [ -z "$LATEST_RUN" ] || [ "$(gh run view $LATEST_RUN --json status --jq '.status')" = "completed" ]; then
  NEW_RUN=$(gh workflow run CI-Linux-CUDA-Docker.yml --ref fix/branch --json 2>&1 | grep -oP 'https://github.com/.*/actions/runs/\K[0-9]+')
  gh run watch $NEW_RUN
fi

Immediate Cleanup:

# If duplicates found, cancel older runs (keep newest)
gh run cancel 19907395952 && gh run cancel 19907463211  # Keep 19907630652

When to Use Each Guide

Primary Guide Selection

Build Configuration	Primary Guide	Secondary Guides
Windows NoCUDA	troubleshooting.windows.md	reference.md
Windows CUDA	troubleshooting.cuda.md	troubleshooting.windows.md
Linux x64 NoCUDA	troubleshooting.linux.md	reference.md
Linux x64 CUDA	troubleshooting.cuda.md	troubleshooting.linux.md
macOS NoCUDA	troubleshooting.macos.md	troubleshooting.vcpkg.md, reference.md
Jetson/ARM64	troubleshooting.jetson.md	troubleshooting.cuda.md
Docker builds	troubleshooting.containers.md	troubleshooting.cuda.md
WSL builds	troubleshooting.containers.md	troubleshooting.cuda.md

Cross-Reference Usage

reference.md: Always check for vcpkg, caching, version pinning knowledge
troubleshooting.cuda.md: ANY CUDA-related issue, regardless of platform

Tools & Prerequisites

Required Tools

# GitHub CLI (required for all platforms)
gh --version

# Git (required for submodule management)
git --version

# Platform-specific package managers
# Windows: chocolatey
# Linux: apt/yum
# macOS: homebrew (future)

Essential Commands

GitHub Actions:

# List workflows
gh workflow list

# Trigger workflow manually
gh workflow run <workflow-name>

# Monitor run
gh run watch <run-id>

# Download logs
gh run view <run-id> --log > build.log

# Cancel run
gh run cancel <run-id>

vcpkg Diagnostics:

# List installed packages
./vcpkg/vcpkg list

# Check package status
cat vcpkg_installed/vcpkg/status

# Verify baseline
cat base/vcpkg-configuration.json | grep baseline

Log Analysis:

# Find errors (case insensitive)
grep -i "error" build.log

# Find CMake errors
grep "CMake Error" build.log

# Find package failures
grep "error:" build.log | grep "package"

# Find specific issues
grep -i "distutils\|python" build.log
grep "unexpected hash" build.log
grep "PKG_CONFIG" build.log

Best Practices & Guardrails

✅ DO

Pin major versions of all critical dependencies in vcpkg.json overrides
Test baseline updates in isolation - never update baseline directly in production
Download full logs - don't rely on GitHub Actions UI truncation
Check actual errors - not just workflow step status (beware continue-on-error)
Verify commits are fetchable - use git ls-remote before using as baseline
Think logically - before adding PATH fixes, verify tool actually needs PATH
Document new patterns - add to appropriate troubleshooting guide for future engineers

❌ DON'T

Never modify master branches of vcpkg forks - always use feature branches
Don't fix application code to accommodate new library versions - pin the library version instead
Don't assume vcpkg changes apply immediately - may need to clear cache or bump cache key
Don't use parent commits as baselines - they're not advertised by git
Don't batch updates - change one thing at a time for easier debugging
Don't skip Phase 1 - ensure caching works before running Phase 2
Don't trigger workflows multiple times - check for existing runs first (see Pattern 6)

Escalation Path

When to Ask Human for Help

Security decisions: Accepting packages with known vulnerabilities
Breaking changes: When library upgrade requires code changes
Infrastructure access: Self-hosted runner configuration, credentials
Policy decisions: Should we support older library versions?
Unknown patterns: Errors not matching any documented pattern (document it!)

How to Escalate

Document what you tried (commands, fixes attempted)
Include relevant log excerpts (not entire logs)
State current hypothesis about root cause
Ask specific question (not "it's broken, help!")

Success Criteria

How to Know Your Fix Worked

Phase 1 (Prep) Success:

Step "Cache dependencies" shows cache saved
Cache key logged in output
No real errors in logs (ignore continue-on-error steps)

Phase 2 (Build/Test) Success:

Cache restored from Phase 1 (check cache hit log)
CMake configure completes without errors
Build completes (cmake --build succeeds)
Tests run and produce results (pass/fail is separate concern)
Artifacts uploaded (logs, test results)

Full Build Success:

Both Phase 1 and Phase 2 complete
Total time < 2 hours for hosted runners
Cache reusable for next build (key saved correctly)

Troubleshooting Guide Index

troubleshooting.windows.md - Windows NoCUDA builds (GitHub-hosted, two-phase)
troubleshooting.linux.md - Linux x64 NoCUDA builds (GitHub-hosted, two-phase)
troubleshooting.macos.md - macOS NoCUDA builds (GitHub-hosted, two-phase)
troubleshooting.cuda.md - All CUDA builds (self-hosted, platform-agnostic)
troubleshooting.jetson.md - Jetson ARM64 builds (ARM64 + CUDA constraints)
troubleshooting.containers.md - Docker and WSL builds (container-specific)
troubleshooting.vcpkg.md - vcpkg-specific issues (cross-platform)
reference.md - Cross-platform reference (vcpkg, cache, GitHub Actions)
methodology.md - High-level debugging methodology (detection, validation, testing)

Maintenance

This skill should be updated when:

New platform added (update decision tree, platform coverage, troubleshooting guides)
New issue pattern discovered (add to appropriate troubleshooting guide)
vcpkg baseline updated (update reference.md with new pins)
Workflow structure changes (update reference.md)
Self-hosted runner configuration changes (update troubleshooting.cuda.md)

Install Skill

SKILL.md

ApraPipes DevOps Skill

Role

Core Debugging Methodology

Platform-Specific Tool Versions

Quick Start: First 2 Minutes

Step 1: Identify Platform and Build Type

Step 2: Route to Correct Troubleshooting Guide

Step 3: Download Logs and Begin Diagnosis

Cross-Platform Common Patterns

Pattern 1: vcpkg Baseline Issues

Pattern 2: Package Version Breaking Changes

Pattern 3: Python distutils Missing

Pattern 4: Submodule Commit Not Fetchable

Pattern 5: Cache Key Mismatch

Pattern 6: Duplicate Workflow Runs

When to Use Each Guide

Primary Guide Selection

Cross-Reference Usage

Tools & Prerequisites

Required Tools

Essential Commands

Best Practices & Guardrails

✅ DO

❌ DON'T

Escalation Path

When to Ask Human for Help

How to Escalate

Success Criteria

How to Know Your Fix Worked

Troubleshooting Guide Index

Maintenance