| name | debugging-docker |
| description | Debugs Docker build failures, container runtime errors, platform architecture issues (ARM64/AMD64/WSL2), AWS ECR/ECS pull failures, and optimizes Docker workflows. Use when encountering Docker build errors, container crashes, ECR authentication issues, ECS deployment problems, performance issues, networking failures, volume permission errors, or when working with multi-platform Docker images. |
Debugging Docker Operations
Overview
Systematic troubleshooting workflows for Docker build failures, runtime errors, AWS container services, platform-specific issues, and performance optimization. Designed for multi-platform environments (ARM64/AMD64/WSL2) with pharmaceutical compliance contexts requiring GAMP-5 validation.
When to Use
- Build Failures: Package installation errors, COPY failures, layer caching issues
- Runtime Errors: Container crashes, exit codes, DNS resolution, port conflicts
- AWS Container Services: ECR authentication, ECS pull failures, rate limiting
- Platform Issues: ARM64 vs AMD64 emulation, WSL2 integration, cross-platform builds
- Performance Problems: Slow builds, large images, QEMU emulation overhead
- Volumes/Networking: Permission errors, mount failures, port binding
Available Tools
Standard Tools
| Tool | Purpose |
|---|---|
| Bash | Execute Docker commands, view logs |
| Read | Examine Dockerfiles, compose files |
| Grep | Search for patterns in logs |
| Glob | Find Docker-related files |
AWS MCP Tools
| Tool | Purpose |
|---|---|
mcp__aws-api-mcp__call_aws |
Execute AWS CLI for ECR/ECS operations |
mcp__aws-knowledge-mcp__aws___search_documentation |
Search AWS docs for container guidance |
mcp__aws-ccapi-mcp__list_resources |
List ECS tasks, ECR repositories |
Core Workflow
Phase 1: Symptom Diagnosis
Objective: Identify problem category and gather diagnostic information
Identify Problem Type
# For build failures docker build 2>&1 | tee build.log # For runtime errors docker logs <container-name> docker inspect <container-name> # For AWS issues aws ecr describe-repositories aws ecs describe-tasks --cluster <cluster> --tasks <task-arn>Categorize the Issue
- Build Failure (exit during
docker build) - Runtime Error (container exits or crashes)
- AWS ECR/ECS Issue (authentication, pull failures)
- Platform/Architecture Issue (ARM64/AMD64 mismatch)
- Performance Problem (slow or inefficient)
- Networking/Volume Issue
- Build Failure (exit during
Collect Context
docker version && docker info | head -30 uname -m # Platform: x86_64 or aarch64
Quality Gate: Problem type identified, initial logs collected
Phase 2: Root Cause Analysis
Build Failures
Common Patterns:
# Check build log for specific failures
grep -E "ERROR|failed|error:" build.log
# Common patterns:
# "COPY failed" → File not in build context
# "returned a non-zero code" → Command failed
# "no matching manifest" → Platform mismatch
Key Checks:
- Build context:
ls -la <path-to-file> - .dockerignore:
cat .dockerignore - Layer caching:
docker build --no-cache
See reference/common-errors.md for complete error matrix.
Runtime Errors
Exit Code Quick Reference:
| Code | Meaning | Common Cause |
|---|---|---|
| 0 | Normal exit | Application completed |
| 1 | Application error | Check application logs |
| 126 | Cannot execute | Permission issue |
| 127 | Command not found | Wrong CMD/ENTRYPOINT |
| 137 | OOM killed | Memory limit exceeded |
| 139 | Segfault | Application crash |
| 143 | SIGTERM | Graceful shutdown |
Diagnosis:
docker inspect <container> | jq '.[0].State'
docker logs --tail 100 <container>
AWS ECR/ECS Issues
ECR Authentication:
Error: "no basic auth credentials"
- Cause: ECR login token expired (12-hour validity)
- Fix:
aws ecr get-login-password --region eu-west-2 | \ docker login --username AWS --password-stdin \ <account-id>.dkr.ecr.eu-west-2.amazonaws.com
ECS Pull Failures:
Error: "CannotPullContainerError"
- Checklist:
- Task execution role has ECR permissions
- VPC has NAT Gateway (for private subnets)
- Image tag exists in ECR
Using AWS MCP:
# Check ECR images
mcp__aws-api-mcp__call_aws: aws ecr describe-images --repository-name <repo>
# Check ECS task failures
mcp__aws-api-mcp__call_aws: aws ecs describe-tasks --cluster <cluster> --tasks <task-arn>
See reference/aws-ecr-ecs.md for complete AWS troubleshooting guide.
Platform/Architecture Issues
Slow Build Detection:
# Check if emulation is active (ARM64 host building AMD64)
docker run --rm --platform=linux/amd64 alpine uname -m
# If output shows x86_64 on ARM64 host → emulation active (slow)
Solutions:
- Use native platform for development
- Build target platform only for deployment
- Use multi-stage with
${BUILDPLATFORM}
See reference/platform-guide.md for complete platform guidance.
Quality Gate: Root cause identified, specific error patterns documented
Phase 3: Solution & Validation
Build Failure Fixes
# Fix missing file in context
# Verify path exists: ls -la <path>
COPY main.py /app/
# Fix package installation with retry
RUN pip install --no-cache-dir -r requirements.txt || \
(pip install --upgrade pip && pip install --no-cache-dir -r requirements.txt)
# Fix platform issues
# docker build --platform=linux/amd64 -t image:prod .
Runtime Error Fixes
# Fix permission issues
docker run --user $(id -u):$(id -g) image:tag
# Fix OOM (Exit 137)
docker run -m 2g --memory-swap 2g image:tag
# Fix port conflicts
docker run -p 8081:8080 image:tag # Change host port
AWS Fixes
# Refresh ECR credentials
aws ecr get-login-password --region eu-west-2 | docker login --username AWS --password-stdin <account>.dkr.ecr.eu-west-2.amazonaws.com
# Fix Docker Hub rate limiting - use ECR Public
# FROM public.ecr.aws/docker/library/python:3.12-slim
Validation
# Rebuild with verbose output
docker build --progress=plain -f Dockerfile -t image:test .
# Test container startup
docker run --rm image:test
# Verify functionality
docker run -it image:test /bin/sh -c "command-to-test"
Quality Gate: Fix applied, container builds and runs without errors
Quick Reference Table
| Problem | Quick Check | Common Fix |
|---|---|---|
| ECR "no basic auth" | Token age (12hr) | aws ecr get-login-password | docker login |
| ECS CannotPull | Task role, NAT | Check IAM + VPC config |
| Docker Hub rate limit | Error message | Use ECR Public or authenticate |
| Build fails at COPY | ls -la <file> |
Fix path or add to build context |
| Package install fails | Check network | Update pip, verify package name |
| Exit 137 (OOM) | docker inspect OOMKilled |
Increase memory limit (-m flag) |
| Exit 127 | Command not found | Fix CMD/ENTRYPOINT path |
| Slow ARM64 build | --platform flag |
Use native ARM64 for dev |
| WSL2 vmmem bloat | Task Manager | .wslconfig memory limits |
| Volume mount not working | Check compose mounts | docker-compose down && up -d |
| Port already in use | docker ps |
Change host port or stop conflict |
| Permission denied | Check USER in Dockerfile | Add USER directive, fix volume perms |
Best Practices
Build Optimization
- Layer ordering: Dependencies before code (cache efficiency)
- Multi-stage builds: Separate build and runtime environments
- Use .dockerignore: Exclude
node_modules,.git,__pycache__
Security
# Non-root user
RUN adduser -D appuser
USER appuser
# Never COPY secrets - use --secret flag
Platform Handling
# Development (native platform - fast)
docker build -t image:dev .
# Production (target platform)
docker build --platform=linux/amd64 -t image:prod .
# Multi-platform with cache
docker buildx build \
--platform linux/amd64,linux/arm64 \
--cache-from type=registry,ref=<repo>:cache \
--cache-to type=registry,ref=<repo>:cache \
-t image:multi --push .
WSL2 Performance
- Store projects in WSL2 filesystem (
~/), NOT Windows mount (/mnt/c/) - Configure
.wslconfigwith memory limits - Use
wsl --shutdownto reclaim memory
See reference/wsl2-optimization.md for complete WSL2 guide.
Common Pitfalls
Don't: Use relative paths without verification
# WRONG - may fail if context is incorrect
COPY ../app/file.py /app/
# RIGHT - relative to build context root
COPY app/file.py /app/
Do: Use explicit platform for production
# WRONG - platform inferred (may differ from prod)
docker build -t api:latest .
# RIGHT - explicit platform
docker build --platform=linux/amd64 -t api:latest .
Do: Verify volume mounts work
# Check if container sees host files
docker exec <container> ls -la /app/
# Recreate containers after mount changes
docker-compose down && docker-compose up -d
Quality Checklist
- Problem type identified (build/runtime/AWS/platform/network/volume)
- Diagnostic logs collected and analyzed
- Root cause determined with specific error patterns
- Solution applied with appropriate fix
- Build succeeds without errors
- Container starts and runs successfully
- Functionality validated
- Platform architecture verified (matches deployment target)
- Security checked (non-root user, no exposed secrets)
Detailed References
- Docker Commands:
reference/docker-commands.md - Common Error Matrix:
reference/common-errors.md(20+ error patterns) - AWS ECR/ECS Guide:
reference/aws-ecr-ecs.md - Platform Guide:
reference/platform-guide.md(ARM64/AMD64/WSL2, buildx caching) - WSL2 Optimization:
reference/wsl2-optimization.md - Build Optimization:
reference/build-optimization.md - Security Hardening:
reference/security-hardening.md
Diagnostic Scripts
- Build Failure Analysis:
scripts/analyze-build-failure.sh <log-file> - Container Inspection:
scripts/inspect-container.sh <container-name> - Platform Check:
scripts/check-platform.sh
Remember: Docker issues are systematic and diagnosable. Follow the three-phase workflow (Symptom → Root Cause → Solution), use AWS MCP tools for ECR/ECS issues, and reference the detailed guides for complex scenarios. Always validate fixes before considering the issue resolved.