| name | Debugging Omnistrate Deployments |
| description | Systematically debug failed Omnistrate instance deployments using a progressive workflow that identifies root causes efficiently while avoiding token limits. Applies to deployment failures, probe issues, and helm-based resources. |
Debugging Omnistrate Deployments
When to Use This Skill
- Instance deployments showing FAILED or DEPLOYING status
- Resources with unhealthy pod statuses or deployment errors
- Startup/readiness probe failures (HTTP 503, timeouts)
- Helm releases with unclear deployment states
- Need to identify root cause of deployment failures
Progressive Debugging Workflow
1. Get Deployment Status
Tool: mcp__omnistrate-platform__omnistrate-ctl_instance_describe
Flags: --deployment-status --output json
Extract:
- Overall instance status
- Resources with deployment errors or unhealthy pod statuses
- Focus subsequent analysis on problematic resources only
Key Benefit: Returns concise status, significantly reduces token usage vs full describe
2. Identify Workflows
Tool: mcp__omnistrate-platform__omnistrate-ctl_workflow_list
Flags: --instance-id <id> --output json
Extract workflow IDs, types, and start/end times for failed deployments.
3. Analyze Workflow Events (Two-Phase)
Phase 1 - Summary (Always Start Here):
omctl workflow events <workflow-id> --service-id <id> --environment-id <id> --output json
Extract:
- All resources with workflow step status (failed/in-progress/success)
- Step duration analysis and event count patterns
- Identify specific failed/stuck steps
Phase 2 - Detail (Only for Failed Steps):
omctl workflow events <workflow-id> --service-id <id> --environment-id <id> \
--resource-key <name> --step-types <type> --detail --output json
Use parameters:
--resource-key: Target specific resource--step-types: Filter to specific step (Bootstrap, Compute, Deployment, Network, Storage, Monitoring)--detail: Include full event details (use sparingly)--since/--until: Time-bound queries
Extract from detail view:
- WorkflowStepDebug error messages
- VM allocation failures and constraints
- Pod scheduling issues
- Container readiness failures
Pod Event Timeline: Create ASCII visualizations showing deployment progression:
HH:MM:SS ┬─── ✗ FailedScheduling
│ pod/app-0: Insufficient memory
│
HH:MM:SS ├─── ⚡ TriggeredScaleUp
│ nodegroup-1: adding 2 nodes
│
HH:MM:SS ├─── 📥 Pulling image:latest
│ (duration: 2m15s)
│
HH:MM:SS └─── ✅ Started
3/3 pods Running
Symbols: ✗ failed, ✅ success, ⚡ autoscaler, 💾 storage, 📥 image, 🚀 runtime, ⚠️ warning
4. Application-Level Investigation
When: Resource DEPLOYING with probe failures, containers Running but not Ready, no conclusive evidence from previous steps
Tool: mcp__omnistrate-platform__omnistrate-ctl_deployment-cell_update-kubeconfig + kubectl
omctl deployment-cell update-kubeconfig <cell-id> --kubeconfig /tmp/kubeconfig
kubectl get pods -n <instance-id> --kubeconfig /tmp/kubeconfig
kubectl logs <pod-name> -c service -n <instance-id> --kubeconfig /tmp/kubeconfig --tail=50
Look for:
- Database connection failures
- Application syntax/runtime errors (Python SyntaxError, Java compilation errors)
- Service dependency failures
- Configuration issues
5. Helm-Specific Verification
When: Helm resources with conflicting status, need application credentials, deployment state unclear
Tool: Same kubeconfig setup with --role cluster-admin + helm
omctl deployment-cell update-kubeconfig <cell-id> --kubeconfig /tmp/kubeconfig --role cluster-admin
helm list -n <instance-id> --kubeconfig /tmp/kubeconfig
helm status <release-name> -n <instance-id> --kubeconfig /tmp/kubeconfig
Extract:
- Release status (deployed/failed/pending)
- Revision number and last deployed time
- Application credentials from release notes
- Pod health ratio (Running vs Failed)
6. Broader Context (If Needed)
Tool: mcp__omnistrate-platform__omnistrate-ctl_operations_events
Use time windows from workflow analysis, filter by relevant event types.
Common Failure Patterns
Infrastructure Constraints
- VM allocation failures with restrictive constraints (AZ + Network + Size)
- PersistentVolumeClaim not found
- Node taints/affinity issues
Container Lifecycle
- Back-off restarting failed container
- ProcessLiveness: UNHEALTHY
- Image pull failures
Probe Failures
- Startup/readiness probe HTTP 503
- Database connectivity timeouts
- Application syntax errors preventing startup
- Service dependency unavailability
Resource Prioritization
- Core infrastructure: databases, message queues, storage
- Application services: web servers, APIs
- Support services: monitoring, logging
Response Management
- Always use
--output json - If token limit exceeded: add more specific filters, use smaller time windows, target specific resources
- Provide analysis in template format (see Failure Analysis Template in OMNISTRATE_SRE_REFERENCE.md)
Reference
See OMNISTRATE_SRE_REFERENCE.md for:
- Detailed tool parameter documentation
- Complete failure analysis template
- Extended examples
- Tool alternatives table