name	Debugging Omnistrate Deployments
description	Systematically debug failed Omnistrate instance deployments using a progressive workflow that identifies root causes efficiently while avoiding token limits. Applies to deployment failures, probe issues, and helm-based resources.

Debugging Omnistrate Deployments

When to Use This Skill

Instance deployments showing FAILED or DEPLOYING status
Resources with unhealthy pod statuses or deployment errors
Startup/readiness probe failures (HTTP 503, timeouts)
Helm releases with unclear deployment states
Need to identify root cause of deployment failures

Progressive Debugging Workflow

1. Get Deployment Status

Tool: mcp__omnistrate-platform__omnistrate-ctl_instance_describe Flags: --deployment-status --output json

Extract:

Overall instance status
Resources with deployment errors or unhealthy pod statuses
Focus subsequent analysis on problematic resources only

Key Benefit: Returns concise status, significantly reduces token usage vs full describe

2. Identify Workflows

Tool: mcp__omnistrate-platform__omnistrate-ctl_workflow_list Flags: --instance-id <id> --output json

Extract workflow IDs, types, and start/end times for failed deployments.

3. Analyze Workflow Events (Two-Phase)

Phase 1 - Summary (Always Start Here):

omctl workflow events <workflow-id> --service-id <id> --environment-id <id> --output json

Extract:

All resources with workflow step status (failed/in-progress/success)
Step duration analysis and event count patterns
Identify specific failed/stuck steps

Phase 2 - Detail (Only for Failed Steps):

omctl workflow events <workflow-id> --service-id <id> --environment-id <id> \
  --resource-key <name> --step-types <type> --detail --output json

Use parameters:

--resource-key: Target specific resource
--step-types: Filter to specific step (Bootstrap, Compute, Deployment, Network, Storage, Monitoring)
--detail: Include full event details (use sparingly)
--since/--until: Time-bound queries

Extract from detail view:

WorkflowStepDebug error messages
VM allocation failures and constraints
Pod scheduling issues
Container readiness failures

Pod Event Timeline: Create ASCII visualizations showing deployment progression:

HH:MM:SS ┬─── ✗ FailedScheduling
         │    pod/app-0: Insufficient memory
         │
HH:MM:SS ├─── ⚡ TriggeredScaleUp
         │    nodegroup-1: adding 2 nodes
         │
HH:MM:SS ├─── 📥 Pulling image:latest
         │    (duration: 2m15s)
         │
HH:MM:SS └─── ✅ Started
              3/3 pods Running

Symbols: ✗ failed, ✅ success, ⚡ autoscaler, 💾 storage, 📥 image, 🚀 runtime, ⚠️ warning

4. Application-Level Investigation

When: Resource DEPLOYING with probe failures, containers Running but not Ready, no conclusive evidence from previous steps

Tool: mcp__omnistrate-platform__omnistrate-ctl_deployment-cell_update-kubeconfig + kubectl

omctl deployment-cell update-kubeconfig <cell-id> --kubeconfig /tmp/kubeconfig
kubectl get pods -n <instance-id> --kubeconfig /tmp/kubeconfig
kubectl logs <pod-name> -c service -n <instance-id> --kubeconfig /tmp/kubeconfig --tail=50

Look for:

Database connection failures
Application syntax/runtime errors (Python SyntaxError, Java compilation errors)
Service dependency failures
Configuration issues

5. Helm-Specific Verification

When: Helm resources with conflicting status, need application credentials, deployment state unclear

Tool: Same kubeconfig setup with --role cluster-admin + helm

omctl deployment-cell update-kubeconfig <cell-id> --kubeconfig /tmp/kubeconfig --role cluster-admin
helm list -n <instance-id> --kubeconfig /tmp/kubeconfig
helm status <release-name> -n <instance-id> --kubeconfig /tmp/kubeconfig

Extract:

Release status (deployed/failed/pending)
Revision number and last deployed time
Application credentials from release notes
Pod health ratio (Running vs Failed)

6. Broader Context (If Needed)

Tool: mcp__omnistrate-platform__omnistrate-ctl_operations_events

Use time windows from workflow analysis, filter by relevant event types.

Common Failure Patterns

Infrastructure Constraints

VM allocation failures with restrictive constraints (AZ + Network + Size)
PersistentVolumeClaim not found
Node taints/affinity issues

Container Lifecycle

Back-off restarting failed container
ProcessLiveness: UNHEALTHY
Image pull failures

Probe Failures

Startup/readiness probe HTTP 503
Database connectivity timeouts
Application syntax errors preventing startup
Service dependency unavailability

Resource Prioritization

Core infrastructure: databases, message queues, storage
Application services: web servers, APIs
Support services: monitoring, logging

Response Management

Always use --output json
If token limit exceeded: add more specific filters, use smaller time windows, target specific resources
Provide analysis in template format (see Failure Analysis Template in OMNISTRATE_SRE_REFERENCE.md)

Reference

See OMNISTRATE_SRE_REFERENCE.md for:

Detailed tool parameter documentation
Complete failure analysis template
Extended examples
Tool alternatives table

Debugging Omnistrate Deployments

Install Skill

SKILL.md

Debugging Omnistrate Deployments

When to Use This Skill

Progressive Debugging Workflow

1. Get Deployment Status

2. Identify Workflows

3. Analyze Workflow Events (Two-Phase)

4. Application-Level Investigation

5. Helm-Specific Verification

6. Broader Context (If Needed)

Common Failure Patterns

Infrastructure Constraints

Container Lifecycle

Probe Failures

Resource Prioritization

Response Management

Reference