| name | aks-deployment-troubleshooter |
| description | Diagnose and fix Kubernetes deployment failures, especially ImagePullBackOff, CrashLoopBackOff, and architecture mismatches. Battle-tested from 4-hour AKS debugging session with 10+ failure modes resolved. |
| version | 1.0.0 |
AKS Deployment Troubleshooter
Overview
This skill captures systematic approaches to debugging Kubernetes deployments, with specific focus on container image issues. Based on real debugging session resolving 10+ different failure modes.
When to Use
- Pods stuck in
ImagePullBackOff - Pods in
CrashLoopBackOffwithexec format error - "no match for platform in manifest" errors
- Image registry authentication issues
- Helm deployment timeouts
Quick Diagnosis Flow
Pod not running?
│
├─► ImagePullBackOff
│ │
│ ├─► "not found" ──► Wrong tag or registry path
│ ├─► "unauthorized" ──► Auth/imagePullSecrets issue
│ └─► "no match for platform" ──► Architecture mismatch
│
├─► CrashLoopBackOff
│ │
│ ├─► "exec format error" ──► Wrong CPU architecture
│ ├─► Exit code 1 ──► App startup failure (check logs)
│ └─► OOMKilled ──► Memory limits too low
│
└─► Pending
│
├─► Insufficient CPU/memory ──► Scale cluster or reduce requests
└─► No matching node ──► Check nodeSelector/tolerations
Diagnostic Commands
Step 1: Get Pod Status
kubectl get pods -n <namespace>
Step 2: Describe Failing Pod
kubectl describe pod <pod-name> -n <namespace> | grep -E "(Image:|Failed|Error|pull)"
Step 3: Check Events
kubectl get events -n <namespace> --sort-by='.lastTimestamp' | tail -20
Step 4: Check Logs (for CrashLoopBackOff)
kubectl logs <pod-name> -n <namespace> --tail=50
Step 5: Check Node Architecture
kubectl get nodes -o jsonpath='{.items[*].status.nodeInfo.architecture}'
Error Resolution Guide
1. ImagePullBackOff: "not found"
Error:
Failed to pull image "ghcr.io/owner/repo/app:abc123": not found
Causes & Solutions:
| Cause | Solution |
|---|---|
| Tag doesn't exist | Verify image was pushed with exact tag |
| Short vs full SHA | Align metadata-action with deploy (use type=raw,value=${{ github.sha }}) |
| Builds skipped | Manual trigger or remove path filters |
| Wrong registry | Check image.repository in Helm values |
Diagnostic:
# Check what tags exist (requires gh cli and package visibility)
gh api /users/<owner>/packages/container/<package>/versions
2. ImagePullBackOff: "unauthorized"
Error:
failed to authorize: failed to fetch anonymous token: 401 Unauthorized
Causes & Solutions:
| Cause | Solution |
|---|---|
| Package is private | Make package public in GHCR settings |
| Missing imagePullSecrets | Create docker-registry secret |
| Wrong credentials | Regenerate and update secret |
Create imagePullSecrets:
kubectl create secret docker-registry ghcr-secret \
--docker-server=ghcr.io \
--docker-username=<github-username> \
--docker-password=<github-token> \
--namespace=<namespace>
Link secret in deployment:
spec:
imagePullSecrets:
- name: ghcr-secret
3. ImagePullBackOff: "no match for platform in manifest"
Error:
no match for platform in manifest: not found
Root Cause: Image built for wrong CPU architecture OR buildx provenance issue.
Step 1: Check cluster architecture:
kubectl get nodes -o jsonpath='{.items[*].status.nodeInfo.architecture}'
# Output: amd64 amd64 OR arm64 arm64
Step 2: Match build platform:
# In GitHub Actions docker/build-push-action
- uses: docker/build-push-action@v5
with:
platforms: linux/arm64 # or linux/amd64
provenance: false # CRITICAL: Disable attestation manifests
no-cache: true # Force fresh build
Why provenance: false?
Buildx creates multi-arch manifest lists with attestations. Some container runtimes can't find the actual image in complex manifests. Disabling provenance creates simple single-platform images.
4. CrashLoopBackOff: "exec format error"
Error:
exec /usr/local/bin/docker-entrypoint.sh: exec format error
Root Cause: Binary architecture doesn't match node architecture.
Example: Built linux/amd64 image, deployed to arm64 nodes.
Solution:
- Check node architecture:
kubectl get nodes -o jsonpath='{.items[*].status.nodeInfo.architecture}' - Update build platform to match
- Rebuild WITHOUT cache (cached layers may have wrong arch)
platforms: linux/arm64 # Match your cluster!
no-cache: true # Force complete rebuild
provenance: false # Simple manifest
5. Helm --set Comma Parsing Error
Error:
failed parsing --set data: key "com" has no value (cannot end with ,)
Root Cause: Helm interprets commas as array separators in --set.
Wrong:
--set "origins=https://a.com,https://b.com"
Solution: Use heredoc values file:
# In GitHub Actions
- name: Deploy
run: |
cat > /tmp/overrides.yaml << EOF
sso:
env:
ALLOWED_ORIGINS: "https://a.com,https://b.com"
EOF
helm upgrade --install app ./chart \
--values /tmp/overrides.yaml
6. Azure Login "No subscriptions found"
Error:
Error: No subscriptions found for ***
Root Cause: Missing subscriptionId in AZURE_CREDENTIALS.
Solution: Use --sdk-auth format:
az ad sp create-for-rbac \
--name "github-actions" \
--role contributor \
--scopes /subscriptions/<subscription-id>/resourceGroups/<rg-name> \
--sdk-auth
Required JSON structure:
{
"clientId": "xxx",
"clientSecret": "xxx",
"subscriptionId": "xxx", // MUST be present
"tenantId": "xxx"
}
7. GHCR 403 Forbidden
Error:
403 Forbidden: permission_denied: write_package
Solutions:
- Make package public: GHCR → Package Settings → Change visibility
- Link package to repository: Package Settings → Connect Repository
- Ensure workflow has
packages: writepermission:
permissions:
contents: read
packages: write
Docker Build Best Practices for K8s
Buildx Configuration for Reliability
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3
- name: Build and push
uses: docker/build-push-action@v5
with:
context: .
push: true
platforms: linux/arm64 # Match cluster architecture!
provenance: false # Avoid manifest complexity
no-cache: true # For debugging; remove in production
tags: |
ghcr.io/owner/repo:${{ github.sha }}
ghcr.io/owner/repo:latest
Image Tag Strategy
Problem: Short SHA vs Full SHA mismatch
# docker/metadata-action default: short SHA (7 chars)
type=sha,prefix= # Creates: ghcr.io/repo:abc1234
# github.sha is full SHA (40 chars)
${{ github.sha }} # Is: abc1234567890abcdef...
Solution: Use explicit full SHA:
tags: |
type=raw,value=${{ github.sha }}
type=raw,value=latest,enable={{is_default_branch}}
Pre-Deployment Checklist
Architecture
- Checked cluster node architecture (
kubectl get nodes -o jsonpath='{.items[*].status.nodeInfo.architecture}') - Build platform matches cluster (arm64 vs amd64)
Docker Build
-
provenance: falseset -
platforms: linux/<arch>matches cluster - Image tags are consistent between build and deploy
Registry
- Packages are public OR imagePullSecrets configured
- Workflow has
packages: writepermission
Helm
- No commas in
--setvalues (use values file instead) - Image repository and tag are correctly templated
Azure/Cloud
- Credentials include subscriptionId
- Service principal has correct role assignments
Debugging Workflow
- Identify error type from
kubectl describe pod - Match to resolution guide above
- Fix ONE thing at a time
- Verify fix locally if possible before pushing
- Check builds completed before checking deploy
Common Mistakes (Lessons Learned)
- Assuming amd64 - Always check actual node architecture first
- Rerunning failed workflows - Old workflows use old code; trigger new run
- Multiple fixes per commit - Makes debugging harder; one fix at a time
- Ignoring build job status - Deploy can start before builds finish
- Caching issues - When stuck, try
no-cache: true
Related Skills
cloud-deploy-blueprint- Full deployment setuphelm-charts- Helm chart patternscontainerize-apps- Dockerfile best practices