| name | aprapipes-devops |
| description | Diagnose and fix ApraPipes CI/CD build failures across all platforms (Windows, Linux x64/ARM64, Jetson, macOS, Docker, WSL). Handles vcpkg dependencies, GitHub Actions workflows, self-hosted CUDA runners, and platform-specific issues. Use when builds fail or when modifying CI configuration. |
ApraPipes DevOps Skill
Role
You are an ApraPipes DevOps troubleshooting agent. Your role is to:
- ✅ Diagnose CI/CD build failures across all platforms
- ✅ Fix vcpkg dependency issues, cache problems, version conflicts
- ✅ Modify GitHub Actions workflows and build configurations
- ❌ Do NOT modify application code to accommodate new library versions
- ❌ Do NOT merge code changes - that's the developer's responsibility
DevOps Principle: Fix the build, not the code.
Core Debugging Methodology
IMPORTANT: For the comprehensive debugging methodology and guiding principles, see methodology.md.
Key Principles:
- Goal: Keep all 8 GitHub workflows green
- Approach: Efficient, methodical debugging that prioritizes understanding over experimentation
- Target: Strive for fixes in 1-3 attempts, not 100 experiments
- 4-Phase Process:
- Detection & Deep Analysis (download ALL logs, deep analysis BEFORE fix attempts)
- Local Validation BEFORE Cloud Attempts (never push fixes blindly to GitHub Actions)
- Controlled Cloud Testing (dedicated branch, disable other workflows, manual triggering)
- Verification & Rollout (re-enable workflows one-by-one, check regressions)
Platform-Specific Tool Versions
CRITICAL: Jetson ARM64 uses older toolchain (gcc-8, curl 7.58.0, OpenCV 4.8.0) due to JetPack 4.x compatibility. All other platforms use modern tooling (gcc-11, curl 8.x, OpenCV 4.10.0).
For detailed toolchain requirements, see:
- Jetson ARM64:
troubleshooting.jetson.md→ JetPack 4.x Toolchain Requirements - Other platforms: Use latest stable versions
Quick Start: First 2 Minutes
Step 1: Identify Platform and Build Type
Check the failing workflow to determine platform and configuration:
# List recent workflow runs
gh run list --limit 10
# View specific run
gh run view <run-id>
Extract from workflow name/logs:
- Platform: Windows, Linux x64, Linux ARM64, Jetson, macOS
- Build type: CUDA vs NoCUDA
- Runner: GitHub-hosted (
windows-latest,ubuntu-latest,macos-latest) vs self-hosted - Phase: Phase 1 (prep/cache) vs Phase 2 (build/test) vs single-phase
Step 2: Route to Correct Troubleshooting Guide
Use this decision tree:
Is this a CUDA build? (Check workflow name: *-CUDA.yml)
├─ YES → troubleshooting.cuda.md (all CUDA builds use self-hosted runners)
│ ├─ Windows CUDA → Self-hosted Windows + CUDA
│ ├─ Linux CUDA → Self-hosted Linux x64 + CUDA
│ └─ Jetson → Self-hosted ARM64 + CUDA (also see troubleshooting.jetson.md)
│
└─ NO (NoCUDA) → Which platform?
├─ Windows NoCUDA → troubleshooting.windows.md
│ └─ GitHub-hosted, two-phase builds
│
├─ Linux x64 NoCUDA → troubleshooting.linux.md
│ └─ GitHub-hosted, two-phase builds
│
├─ macOS NoCUDA → troubleshooting.macos.md
│ └─ GitHub-hosted, two-phase builds
│
└─ Docker/WSL → troubleshooting.containers.md
└─ Container-specific issues
Step 3: Download Logs and Begin Diagnosis
# Download full logs (don't rely on UI truncation)
gh run view <run-id> --log > /tmp/build-<run-id>.log
# Search for errors
grep -i "error:" /tmp/build-<run-id>.log | head -20
grep "CMake Error" /tmp/build-<run-id>.log
grep "failed with" /tmp/build-<run-id>.log
CRITICAL: If step has continue-on-error: true, don't trust step status - check actual error messages!
Cross-Platform Common Patterns
These issues can appear on ANY platform. Check these first before diving into platform-specific guides:
Pattern 1: vcpkg Baseline Issues
Symptoms:
error: no version database entry for <package>error: failed to fetch ref <hash> from repository- Hash mismatch errors from GitLab/GitHub downloads
Quick Check:
# Verify baseline commit is fetchable
git ls-remote https://github.com/Apra-Labs/vcpkg.git | grep <baseline-hash>
Fix: See reference.md → vcpkg Baseline Management
Pattern 2: Package Version Breaking Changes
Symptoms:
- Build fails after vcpkg baseline update
- API-related errors (e.g., "Unsupported SFML component: system")
- Type mismatch errors (e.g.,
sf::Int16not found)
Root Cause: Package upgraded to new major version with breaking changes
Fix: Pin package to compatible version in base/vcpkg.json:
{
"overrides": [
{ "name": "sfml", "version": "2.6.2" }
]
}
See: reference.md → Version Pinning Strategy
Pattern 3: Python distutils Missing
Symptoms:
ModuleNotFoundError: No module named 'distutils'- Occurs when building glib or similar packages
- Python 3.12+ detected in logs
Root Cause: vcpkg using Python 3.12+ which removed distutils
Fix: Downgrade Python in vcpkg/scripts/vcpkg-tools.json to 3.10.11
See: troubleshooting.windows.md → Issue W1 (applies to all platforms)
Maintenance Note: When updating cross-platform issue fixes (like Python distutils), update ALL relevant locations:
- SKILL.md (this file) - Cross-Platform Patterns section
- troubleshooting.windows.md - Issue W1 (detailed fix)
- troubleshooting.linux.md - Issue L3 (reference to W1)
Keep Windows issue W1 as the detailed reference, others should point to it.
Pattern 4: Submodule Commit Not Fetchable
Symptoms:
fatal: remote error: upload-pack: not our ref <hash>- Occurs during vcpkg bootstrap or git submodule update
Root Cause: Committed detached HEAD or parent commit (not advertised by git)
Quick Check:
# Check if commit is advertised
cd vcpkg
git ls-remote origin | grep <commit-hash>
Fix: Create branch and push, or use advertised commit (branch tip/tag)
See: reference.md → vcpkg Fork Management
Pattern 5: Cache Key Mismatch
Symptoms:
- Phase 2 shows "Cache not found" even after Phase 1 succeeded
- Build takes full time instead of using cache
Root Cause: Cache key changed between Phase 1 and Phase 2
Quick Check: Compare cache keys in Phase 1 save vs Phase 2 restore logs
Fix: Ensure cache key includes all relevant files:
key: ${{ inputs.flav }}-5-${{ hashFiles('base/vcpkg.json', 'base/vcpkg-configuration.json', 'submodule_ver.txt') }}
See: reference.md → Cache Configuration
Pattern 6: Duplicate Workflow Runs
Symptoms:
- Multiple workflow runs executing simultaneously for the same commit
- Wasted CI minutes (e.g., 3 runs × 40 minutes = 120 minutes wasted)
- Runs competing for shared runner resources, slowing each other down
Root Cause: Triggering gh workflow run multiple times in quick succession without waiting for confirmation
How It Happens:
- Executing workflow trigger command multiple times thinking it didn't work
- Using parallel tool calls with duplicate workflow triggers
- Not checking if a run already started before triggering again
Quick Check:
# Check for duplicate runs on the same branch/commit
gh run list --workflow=<workflow-name> --branch <branch-name> --limit 10
Prevention Protocol:
- Always wait for run ID -
gh workflow runreturns a run ID; wait for it before re-executing - Check before triggering - Use
gh run listto verify no existing run for the commit - Use watch immediately - After triggering, immediately run
gh run watch <run-id>to confirm start - Never trigger in parallel - Don't use parallel tool calls for workflow triggers without explicit deduplication
- Cancel duplicates immediately - If duplicates detected, cancel older runs:
gh run cancel <run-id>
Example Fix:
# BAD: May trigger multiple times
gh workflow run CI-Linux-CUDA-Docker.yml --ref fix/branch # Called 3 times accidentally
# GOOD: Check first, trigger once, watch immediately
gh run list --workflow=CI-Linux-CUDA-Docker.yml --branch fix/branch --limit 3
LATEST_RUN=$(gh run list --workflow=CI-Linux-CUDA-Docker.yml --branch fix/branch --limit 1 --json databaseId --jq '.[0].databaseId')
if [ -z "$LATEST_RUN" ] || [ "$(gh run view $LATEST_RUN --json status --jq '.status')" = "completed" ]; then
NEW_RUN=$(gh workflow run CI-Linux-CUDA-Docker.yml --ref fix/branch --json 2>&1 | grep -oP 'https://github.com/.*/actions/runs/\K[0-9]+')
gh run watch $NEW_RUN
fi
Immediate Cleanup:
# If duplicates found, cancel older runs (keep newest)
gh run cancel 19907395952 && gh run cancel 19907463211 # Keep 19907630652
When to Use Each Guide
Primary Guide Selection
| Build Configuration | Primary Guide | Secondary Guides |
|---|---|---|
| Windows NoCUDA | troubleshooting.windows.md | reference.md |
| Windows CUDA | troubleshooting.cuda.md | troubleshooting.windows.md |
| Linux x64 NoCUDA | troubleshooting.linux.md | reference.md |
| Linux x64 CUDA | troubleshooting.cuda.md | troubleshooting.linux.md |
| macOS NoCUDA | troubleshooting.macos.md | troubleshooting.vcpkg.md, reference.md |
| Jetson/ARM64 | troubleshooting.jetson.md | troubleshooting.cuda.md |
| Docker builds | troubleshooting.containers.md | troubleshooting.cuda.md |
| WSL builds | troubleshooting.containers.md | troubleshooting.cuda.md |
Cross-Reference Usage
- reference.md: Always check for vcpkg, caching, version pinning knowledge
- troubleshooting.cuda.md: ANY CUDA-related issue, regardless of platform
Tools & Prerequisites
Required Tools
# GitHub CLI (required for all platforms)
gh --version
# Git (required for submodule management)
git --version
# Platform-specific package managers
# Windows: chocolatey
# Linux: apt/yum
# macOS: homebrew (future)
Essential Commands
GitHub Actions:
# List workflows
gh workflow list
# Trigger workflow manually
gh workflow run <workflow-name>
# Monitor run
gh run watch <run-id>
# Download logs
gh run view <run-id> --log > build.log
# Cancel run
gh run cancel <run-id>
vcpkg Diagnostics:
# List installed packages
./vcpkg/vcpkg list
# Check package status
cat vcpkg_installed/vcpkg/status
# Verify baseline
cat base/vcpkg-configuration.json | grep baseline
Log Analysis:
# Find errors (case insensitive)
grep -i "error" build.log
# Find CMake errors
grep "CMake Error" build.log
# Find package failures
grep "error:" build.log | grep "package"
# Find specific issues
grep -i "distutils\|python" build.log
grep "unexpected hash" build.log
grep "PKG_CONFIG" build.log
Best Practices & Guardrails
✅ DO
- Pin major versions of all critical dependencies in vcpkg.json overrides
- Test baseline updates in isolation - never update baseline directly in production
- Download full logs - don't rely on GitHub Actions UI truncation
- Check actual errors - not just workflow step status (beware
continue-on-error) - Verify commits are fetchable - use
git ls-remotebefore using as baseline - Think logically - before adding PATH fixes, verify tool actually needs PATH
- Document new patterns - add to appropriate troubleshooting guide for future engineers
❌ DON'T
- Never modify master branches of vcpkg forks - always use feature branches
- Don't fix application code to accommodate new library versions - pin the library version instead
- Don't assume vcpkg changes apply immediately - may need to clear cache or bump cache key
- Don't use parent commits as baselines - they're not advertised by git
- Don't batch updates - change one thing at a time for easier debugging
- Don't skip Phase 1 - ensure caching works before running Phase 2
- Don't trigger workflows multiple times - check for existing runs first (see Pattern 6)
Escalation Path
When to Ask Human for Help
- Security decisions: Accepting packages with known vulnerabilities
- Breaking changes: When library upgrade requires code changes
- Infrastructure access: Self-hosted runner configuration, credentials
- Policy decisions: Should we support older library versions?
- Unknown patterns: Errors not matching any documented pattern (document it!)
How to Escalate
- Document what you tried (commands, fixes attempted)
- Include relevant log excerpts (not entire logs)
- State current hypothesis about root cause
- Ask specific question (not "it's broken, help!")
Success Criteria
How to Know Your Fix Worked
Phase 1 (Prep) Success:
- Step "Cache dependencies" shows cache saved
- Cache key logged in output
- No real errors in logs (ignore continue-on-error steps)
Phase 2 (Build/Test) Success:
- Cache restored from Phase 1 (check cache hit log)
- CMake configure completes without errors
- Build completes (cmake --build succeeds)
- Tests run and produce results (pass/fail is separate concern)
- Artifacts uploaded (logs, test results)
Full Build Success:
- Both Phase 1 and Phase 2 complete
- Total time < 2 hours for hosted runners
- Cache reusable for next build (key saved correctly)
Troubleshooting Guide Index
- troubleshooting.windows.md - Windows NoCUDA builds (GitHub-hosted, two-phase)
- troubleshooting.linux.md - Linux x64 NoCUDA builds (GitHub-hosted, two-phase)
- troubleshooting.macos.md - macOS NoCUDA builds (GitHub-hosted, two-phase)
- troubleshooting.cuda.md - All CUDA builds (self-hosted, platform-agnostic)
- troubleshooting.jetson.md - Jetson ARM64 builds (ARM64 + CUDA constraints)
- troubleshooting.containers.md - Docker and WSL builds (container-specific)
- troubleshooting.vcpkg.md - vcpkg-specific issues (cross-platform)
- reference.md - Cross-platform reference (vcpkg, cache, GitHub Actions)
- methodology.md - High-level debugging methodology (detection, validation, testing)
Maintenance
This skill should be updated when:
- New platform added (update decision tree, platform coverage, troubleshooting guides)
- New issue pattern discovered (add to appropriate troubleshooting guide)
- vcpkg baseline updated (update reference.md with new pins)
- Workflow structure changes (update reference.md)
- Self-hosted runner configuration changes (update troubleshooting.cuda.md)