| name | arc-runner-troubleshooting |
| description | Troubleshoot ARC (Actions Runner Controller) runners on Rackspace Spot Kubernetes. Diagnose stuck jobs, scaling issues, and cluster access. Activates on "runner", "ARC", "stuck job", "queued", "GitHub Actions", or "CI stuck". |
| allowed-tools | Read, Grep, Glob, Bash |
ARC Runner Troubleshooting Guide
Overview
project-beta uses self-hosted GitHub Actions runners deployed via ARC (Actions Runner Controller) on Rackspace Spot Kubernetes. This guide covers common issues and troubleshooting procedures.
Architecture
Runner Infrastructure
GitHub Actions
↓
ARC (Actions Runner Controller) ← Watches for queued jobs
↓
AutoscalingRunnerSet ← Scales runner pods 0→N
↓
Runner Pods ← Execute GitHub Actions jobs
↓
Rackspace Spot Kubernetes ← Underlying infrastructure
Runner Pools
| Pool | Target | Namespace | Repository |
|---|---|---|---|
arc-beta-runners |
Org-level | arc-runners |
All project-beta repos |
Note: As of Dec 12, 2025, all workflows use arc-beta-runners label. The runnerScaleSetName and ArgoCD releaseName must both be arc-beta-runners.
CI Validation (Added Dec 12, 2025 - Issue #112)
A CI validation check now prevents releaseName/runnerScaleSetName mismatches:
scripts/validate-release-names.sh- Validation script.github/workflows/validate.yaml- Runs on PRs touching config files
Run locally:
./scripts/validate-release-names.sh
Key Configuration Files
matchpoint-github-runners-helm/
├── examples/
│ ├── beta-runners-values.yaml ← DEPLOYED Helm values (org-level)
│ └── frontend-runners-values.yaml ← DEPLOYED Helm values (frontend)
├── values/
│ └── repositories.yaml ← Documentation (NOT deployed)
├── charts/
│ └── github-actions-runners/ ← Helm chart
└── terraform/
└── modules/ ← Infrastructure as Code
Common Issues
1. Runners with Empty Labels (CRITICAL - P0)
Primary Root Cause: ArgoCD release name ≠ runnerScaleSetName (mismatch causes tracking failure)
Secondary Root Cause: ACTIONS_RUNNER_LABELS environment variable does not work with ARC
CRITICAL: ArgoCD/Helm Alignment Issue (Dec 12, 2025 Discovery)
If the ArgoCD helm release name doesn't match runnerScaleSetName:
- ArgoCD tracks resources under the old release name
- New AutoscalingRunnerSet created with different name
- Old ARS may not be pruned, resulting in stale runners
- Stale runners have broken registration → empty labels
Fix:
# argocd/apps-live/arc-runners.yaml
helm:
releaseName: arc-beta-runners # MUST match runnerScaleSetName!
# examples/runners-values.yaml
gha-runner-scale-set:
runnerScaleSetName: "arc-beta-runners" # MUST match releaseName!
Diagnosis tip: Check runner pod names:
arc-runners-*-runner-*→ OLD ARS still active (problem!)arc-beta-runners-*-runner-*→ NEW ARS deployed (correct!)
Symptoms:
- Runners show empty labels
[]in GitHub - Runners show
os: "unknown"in GitHub API - ALL jobs stuck in "queued" state indefinitely
- Runners appear online but never pick up jobs
Diagnosis:
# Check runner labels via GitHub API
gh api /orgs/Matchpoint-AI/actions/runners --jq '.runners[] | {name, status, labels: [.labels[].name], os}'
# Bad output (empty labels):
{
"name": "arc-runners-w74pg-runner-2xppt",
"status": "online",
"labels": [],
"os": "unknown"
}
# Good output (proper labels):
{
"name": "arc-beta-runners-xxxxx-runner-yyyyy",
"status": "online",
"labels": ["arc-beta-runners", "self-hosted", "Linux", "X64"],
"os": "Linux"
}
Root Cause Explanation:
Per GitHub's official documentation:
"You cannot use additional labels to target runners created by ARC. You can only use the installation name of the runner scale set that you specified during the installation or by defining the value of the
runnerScaleSetNamefield in your values.yaml file."
How ARC Labels Work:
- ARC uses ONLY the
runnerScaleSetNameas the GitHub label - Cannot add custom labels via
ACTIONS_RUNNER_LABELSenvironment variable - ARC automatically adds
self-hosted, OS, and architecture labels - Cannot have multiple custom labels on a single scale set
Fix:
# examples/runners-values.yaml or frontend-runners-values.yaml
gha-runner-scale-set:
runnerScaleSetName: "arc-beta-runners" # This becomes the GitHub label
template:
spec:
containers:
- name: runner
env:
# DO NOT SET ACTIONS_RUNNER_LABELS - it's ignored by ARC!
# Only runnerScaleSetName matters
- name: RUNNER_NAME_PREFIX
value: "arc-beta"
Deployment Steps:
- Ensure
runnerScaleSetNamematches workflowruns-on:labels - Remove any
ACTIONS_RUNNER_LABELSenv vars - Merge configuration changes
- Wait for ArgoCD auto-sync (3-5 minutes)
- Force runner re-registration:
kubectl delete pods -n arc-runners -l app.kubernetes.io/component=runner - Verify fix after 1-2 minutes
References:
- Troubleshooting Guide
- Issue #89 in matchpoint-github-runners-helm
2. ArgoCD ApplicationSet Conflicting Parameters (CRITICAL - P0)
Root Cause: ArgoCD ApplicationSet injects Helm parameters that conflict with values file.
CRITICAL: Dec 12, 2025 Discovery
The ApplicationSet (argocd/applicationset.yaml) was injecting:
parameters:
- name: gha-runner-scale-set.githubConfigSecret.github_token
value: "$ARGOCD_ENV_GITHUB_TOKEN"
But the values file uses:
githubConfigSecret: arc-org-github-secret # String reference to pre-created secret
The Conflict:
- Helm
--set gha-runner-scale-set.githubConfigSecret.github_token=expectsgithubConfigSecretto be a map - Values file defines
githubConfigSecretas a string (secret name reference) - Result:
interface conversion: interface {} is string, not map[string]interface {}
Symptoms:
- ArgoCD Application shows
ComparisonErrorin conditions - Manifest generation fails repeatedly
- Runners may appear to work but sync is broken
- Application status shows
Unknownsync status
Diagnosis:
# Check ArgoCD Application status
kubectl get application arc-runners -n argocd -o jsonpath='{.status.conditions[*]}'
# Look for error like:
# "failed parsing --set data: unable to parse key: interface conversion: interface {} is string, not map[string]interface {}"
# Check ApplicationSet for conflicting parameters
kubectl get applicationset github-runners -n argocd -o jsonpath='{.spec.template.spec.source.helm}'
Fix:
- Remove
parameterssection fromargocd/applicationset.yaml - Use pre-created secrets referenced in values file
# argocd/applicationset.yaml - DO NOT include parameters
helm:
releaseName: '{{name}}'
valueFiles:
- '../../{{valuesFile}}'
# NO parameters section - values file handles secrets
# examples/runners-values.yaml
githubConfigSecret: arc-org-github-secret # Pre-created in cluster
Apply Fix to Cluster:
# kubectl apply may not remove fields - use replace
kubectl replace -f argocd/applicationset.yaml --force
# Verify parameters removed
kubectl get applicationset github-runners -n argocd -o jsonpath='{.spec.template.spec.source.helm}'
Secret Setup:
# Create the secret manually in the cluster
kubectl create secret generic arc-org-github-secret \
--namespace=arc-runners \
--from-literal=github_token='ghp_...'
References:
- PR #94 in matchpoint-github-runners-helm (the fix)
- Issue #89 in matchpoint-github-runners-helm
3. Jobs Stuck in Queued State (2-5+ minutes)
Root Cause: minRunners: 0 causes cold-start delays
Symptoms:
- Jobs stuck in "queued" status for 2-5+ minutes
- First job of the day takes significantly longer
- Parallel PRs cause cascading delays
Diagnosis:
# Check current Helm values
cat /home/pselamy/repositories/matchpoint-github-runners-helm/examples/beta-runners-values.yaml | grep minRunners
# Check if issue is minRunners: 0
# If minRunners: 0 → cold start on every job
Fix:
# examples/beta-runners-values.yaml
minRunners: 2 # Changed from 0 - keep 2 runners pre-warmed
maxRunners: 20
Why This Happens:
With minRunners: 0:
Job Queued → ARC detects → Schedule pod → Pull image →
Start container → Register runner → Job starts
Total: 120-300 seconds
With minRunners: 2:
Job Queued → Assign to pre-warmed runner → Job starts
Total: 5-10 seconds
4. Cluster Access Issues
Problem: Cannot connect to Rackspace Spot cluster
Common Errors:
error: You must be logged in to the server (the server has asked for the client to provide credentials)
error: unknown command "oidc-login" for "kubectl"
dial tcp: lookup hcp-xxx.spot.rackspace.com: no such host
CRITICAL: Kubeconfig Token Expiration (Dec 13, 2025 Discovery)
Rackspace Spot kubeconfig JWT tokens expire after 3 days. This is why:
- Downloaded kubeconfig "goes stale after a day or two"
- Manual downloads from Rackspace console have the same problem
- The DNS lookup failure is often a red herring - the token is expired, not the cluster
Verify token expiration:
# Decode JWT to check expiration
TOKEN=$(grep "token:" kubeconfig.yaml | head -1 | awk '{print $2}')
echo $TOKEN | cut -d. -f2 | base64 -d 2>/dev/null | python3 -c "
import json, sys
from datetime import datetime
payload = json.load(sys.stdin)
exp = datetime.fromtimestamp(payload['exp'])
print(f'Token expires: {exp}')
print(f'Expired: {datetime.now() > exp}')
"
Solutions:
Option A: Get kubeconfig from Terraform State (RECOMMENDED)
This is the preferred method - it gets a fresh kubeconfig from the terraform state.
# 1. Get GitHub token from gh CLI config
export TF_HTTP_PASSWORD=$(cat ~/.config/gh/hosts.yml | grep oauth_token | awk '{print $2}')
# 2. Navigate to terraform directory
cd /home/pselamy/repositories/matchpoint-github-runners-helm/terraform
# 3. Initialize terraform (reads state only, no changes)
terraform init -input=false
# 4. Get kubeconfig from terraform output (read-only operation)
terraform output -raw kubeconfig_raw > /tmp/runners-kubeconfig.yaml
# 5. Use the kubeconfig
export KUBECONFIG=/tmp/runners-kubeconfig.yaml
kubectl get pods -A
Why this works:
terraform outputreads from cached state in tfstate.dev- State is refreshed every 2 days by scheduled workflow (PR #137)
- No Rackspace Spot API token needed to read outputs
Note: A scheduled workflow (refresh-kubeconfig.yml) runs every 2 days to refresh the token in terraform state before the 3-day expiration.
Option B: Use token-based auth (ngpc-user)
# Check if token expired
kubectl config view --minify -o jsonpath='{.users[0].user.token}' | cut -d. -f2 | base64 -d | jq .exp
# Compare with current timestamp
# Get new kubeconfig from Rackspace Spot console
# 1. Login to https://spot.rackspace.com
# 2. Select cloudspace
# 3. Download kubeconfig
Option C: Install oidc-login plugin
# Install krew (kubectl plugin manager)
brew install krew # or appropriate package manager
# Install oidc-login
kubectl krew install oidc-login
# Use OIDC context
kubectl config use-context tradestreamhq-tradestream-cluster-oidc
Option D: Use ngpc CLI
# Install ngpc CLI from Rackspace
pip install ngpc-cli
# Login and refresh credentials
ngpc login
ngpc kubeconfig get <cloudspace-name>
5. DNS Resolution Failures
Problem: Cluster hostname not resolving
dial tcp: lookup hcp-xxx.spot.rackspace.com: no such host
Causes:
- Cluster was deleted/migrated (most common)
- Using stale kubeconfig file that points to old cluster
- DNS propagation delay
- Wrong cluster endpoint
Solution: Use terraform to get kubeconfig for the CURRENT active cluster:
# Get fresh kubeconfig from terraform (see Option A above)
export TF_HTTP_PASSWORD="<github-token>"
cd /home/pselamy/repositories/matchpoint-github-runners-helm/terraform
terraform init
terraform output -raw kubeconfig_raw > /tmp/runners-kubeconfig.yaml
Note: The kubeconfig-matchpoint-runners-prod.yaml file in the repo root may be stale if the cluster was recreated. Always use terraform output to get the current kubeconfig.
Diagnosis:
# Check terraform state for current cloudspace
cd /home/pselamy/repositories/matchpoint-github-runners-helm/terraform
export TF_HTTP_PASSWORD="<github-token>"
terraform init
terraform state list | grep cloudspace
# View cloudspace details
terraform state show module.cloudspace.spot_cloudspace.main
6. Missing Tools (wget, curl, Docker CLI)
Problem: CI workflows fail with "command not found" for common tools
CRITICAL: Custom Runner Image (Dec 13, 2025 Discovery)
Two runner images exist:
ghcr.io/actions/actions-runner:latest- Generic (missing many tools)ghcr.io/matchpoint-ai/arc-runner:latest- Custom (has all tools)
Symptoms:
/bin/bash: wget: command not found
/bin/bash: docker: command not found
Root Cause: Configuration may be using the generic image instead of custom.
Diagnosis:
# Check which image is configured
grep -r "ghcr.io" examples/*.yaml values/*.yaml | grep -v "#"
# Check which image is actually running
kubectl get pods -n arc-runners -o jsonpath='{.items[0].spec.containers[0].image}'
Custom Image Includes:
| Tool | Version |
|---|---|
| wget, curl, jq | latest |
| Node.js | 20 LTS |
| Python | 3.12 + pip + poetry |
| Docker CLI | 24.x |
| Terraform | 1.9.x |
| PostgreSQL client | 16 |
| Build tools | make, gcc, etc. |
Fix:
# examples/runners-values.yaml
containers:
- name: runner
image: ghcr.io/matchpoint-ai/arc-runner:latest # NOT actions-runner!
Note: The custom image is built from images/arc-runner/Dockerfile in this repo. The build workflow runs on pushes to images/arc-runner/**.
Reference: Issue #135, PR #138
7. Docker-in-Docker (DinD) Issues
Problem: Docker commands fail even though DinD sidecar is configured
Symptoms:
Cannot connect to the Docker daemon at tcp://localhost:2375
Diagnosis:
# Check pod has 2 containers (runner + dind)
kubectl get pods -n arc-runners -o jsonpath='{.items[*].spec.containers[*].name}'
# Should show: runner dind
# Check DinD logs
kubectl logs -n arc-runners <pod-name> -c dind --tail=50
# Should show: "API listen on [::]:2375"
# Verify DOCKER_HOST env var
kubectl get pods -n arc-runners -o jsonpath='{.items[0].spec.containers[0].env}' | jq '.[] | select(.name=="DOCKER_HOST")'
# Should show: tcp://localhost:2375
Common Issues:
- DinD not running: Check if privileged mode is allowed in cluster
- Wrong DOCKER_HOST: Should be
tcp://localhost:2375 - Missing sidecar: Check pod template in values file
Verify DinD is healthy:
kubectl exec -n arc-runners <pod-name> -c runner -- docker version
kubectl exec -n arc-runners <pod-name> -c runner -- docker info
8. Configuration Mismatch
Problem: Documentation says one thing, deployed config is different
Key Insight: The examples/*.yaml files are what actually gets deployed. The values/repositories.yaml is documentation/reference only.
Audit Configuration:
# Check what's ACTUALLY deployed
cat examples/beta-runners-values.yaml | grep -E "(minRunners|maxRunners)"
# vs what documentation says
cat values/repositories.yaml | grep -E "(minRunners|maxRunners)"
Monitoring Commands
Check Workflow Status
# List queued workflows
gh run list --repo Matchpoint-AI/project-beta-api --status queued
# List in-progress workflows
gh run list --repo Matchpoint-AI/project-beta-api --status in_progress
# View specific run
gh run view <RUN_ID> --repo Matchpoint-AI/project-beta-api
Check Runner Status (when cluster accessible)
# Set kubeconfig
export KUBECONFIG=/path/to/kubeconfig.yaml
# Check runner scale set
kubectl get autoscalingrunnerset -n arc-beta-runners-new
# Check runner pods
kubectl get pods -n arc-beta-runners-new -l app.kubernetes.io/component=runner
# Check ARC controller logs
kubectl logs -n arc-systems deployment/arc-gha-rs-controller --tail=50
# Check for scaling events
kubectl get events -n arc-beta-runners-new --sort-by='.lastTimestamp' | tail -20
Check GitHub Registration
# List registered runners
gh api /orgs/Matchpoint-AI/actions/runners --jq '.runners[] | {name, status, busy}'
# Check runner groups
gh api /orgs/Matchpoint-AI/actions/runner-groups --jq '.runner_groups[].name'
Troubleshooting Checklist
For Stuck Jobs
- Check
minRunnersin deployed Helm values - Verify ArgoCD sync status
- Check if pods are scheduling (kubectl get pods)
- Verify GitHub runner registration
- Check node capacity and resources
- Review ARC controller logs for errors
For Cluster Access
- Check kubeconfig context (kubectl config current-context)
- Verify token expiration
- Try different auth method (token vs OIDC)
- Check if cluster hostname resolves (nslookup)
- Verify cluster still exists in Rackspace console
For Configuration Issues
- Compare examples/*.yaml with values/repositories.yaml
- Check ArgoCD for deployed values
- Verify Helm release values
Related Issues
| Issue/PR | Repository | Description |
|---|---|---|
| #135 | matchpoint-github-runners-helm | Epic: ARC runner environment limitations (Dec 2025) |
| #136 | matchpoint-github-runners-helm | Document troubleshooting learnings in skills |
| #137 | matchpoint-github-runners-helm | PR: Auto-refresh kubeconfig token workflow |
| #138 | matchpoint-github-runners-helm | PR: Use custom runner image with pre-installed tools |
| #112 | matchpoint-github-runners-helm | CI jobs stuck - PR #98 broke alignment |
| #113 | matchpoint-github-runners-helm | CI validation feature request |
| #114 | matchpoint-github-runners-helm | PR: Fix releaseName alignment + CI validation |
| #89 | matchpoint-github-runners-helm | Empty runner labels investigation |
| #91 | matchpoint-github-runners-helm | PR: Change release name (superseded) |
| #93 | matchpoint-github-runners-helm | PR: Revert to arc-runners naming - MERGED |
| #94 | matchpoint-github-runners-helm | PR: Remove ApplicationSet parameters - MERGED |
| #97 | matchpoint-github-runners-helm | PR: Standardize labels to arc-beta-runners |
| #98 | matchpoint-github-runners-helm | PR: Update runnerScaleSetName (broke alignment!) |
| #798 | project-beta-api | PR: Update workflow labels to arc-runners |
| #72 | matchpoint-github-runners-helm | Root cause analysis for queuing |
| #77 | matchpoint-github-runners-helm | Fix PR (minRunners: 0 → 2) - MERGED |
| #76 | matchpoint-github-runners-helm | Investigation state file |
| #1624 | project-beta | ARC runners stuck (closed) |
| #1577 | project-beta | P0: ARC unavailable (closed) |
| #1521 | project-beta | Runners stuck (closed) |
Cost Considerations
| Setting | Cost Impact | Recommendation |
|---|---|---|
minRunners: 0 |
Lowest ($0 idle) | Development/low-traffic |
minRunners: 2 |
~$150-300/mo | Production/high-traffic |
minRunners: 5 |
~$400-700/mo | Enterprise/critical CI |
ROI Calculation:
- 2 pre-warmed runners save ~2-5 min per job
- 50+ PRs/week × 3 min saved = 150+ min/week
- Developer time saved >> runner cost
Emergency Procedures
Runners Completely Down
- Check ArgoCD sync status via Argo UI or CLI
- Force sync if needed:
argocd app sync arc-beta-runners - Check node availability in Rackspace console
- Manual pod restart:
kubectl rollout restart deployment -n arc-beta-runners-new
Fallback to GitHub-hosted runners
# Temporarily switch workflow to GitHub-hosted
jobs:
build:
runs-on: ubuntu-latest # Instead of self-hosted
References
- Helm Chart:
matchpoint-github-runners-helm/charts/github-actions-runners/ - Terraform:
matchpoint-github-runners-helm/terraform/ - ArgoCD: https://argocd.matchpointai.com (internal)
- Rackspace Spot: https://spot.rackspace.com
- ARC Documentation: https://github.com/actions/actions-runner-controller