Claude Code Plugins

Community-maintained marketplace

Feedback

Debugging Omnistrate Deployments

@omnistrate-oss/agent-instructions
0
0

Systematically debug failed Omnistrate instance deployments using a progressive workflow that identifies root causes efficiently while avoiding token limits. Applies to deployment failures, probe issues, and helm-based resources.

Install Skill

1Download skill
2Enable skills in Claude

Open claude.ai/settings/capabilities and find the "Skills" section

3Upload to Claude

Click "Upload skill" and select the downloaded ZIP file

Note: Please verify skill by going through its instructions before using it.

SKILL.md

name Debugging Omnistrate Deployments
description Systematically debug failed Omnistrate instance deployments using a progressive workflow that identifies root causes efficiently while avoiding token limits. Applies to deployment failures, probe issues, and helm-based resources.

Debugging Omnistrate Deployments

When to Use This Skill

  • Instance deployments showing FAILED or DEPLOYING status
  • Resources with unhealthy pod statuses or deployment errors
  • Startup/readiness probe failures (HTTP 503, timeouts)
  • Helm releases with unclear deployment states
  • Need to identify root cause of deployment failures

Progressive Debugging Workflow

1. Get Deployment Status

Tool: mcp__omnistrate-platform__omnistrate-ctl_instance_describe Flags: --deployment-status --output json

Extract:

  • Overall instance status
  • Resources with deployment errors or unhealthy pod statuses
  • Focus subsequent analysis on problematic resources only

Key Benefit: Returns concise status, significantly reduces token usage vs full describe

2. Identify Workflows

Tool: mcp__omnistrate-platform__omnistrate-ctl_workflow_list Flags: --instance-id <id> --output json

Extract workflow IDs, types, and start/end times for failed deployments.

3. Analyze Workflow Events (Two-Phase)

Phase 1 - Summary (Always Start Here):

omctl workflow events <workflow-id> --service-id <id> --environment-id <id> --output json

Extract:

  • All resources with workflow step status (failed/in-progress/success)
  • Step duration analysis and event count patterns
  • Identify specific failed/stuck steps

Phase 2 - Detail (Only for Failed Steps):

omctl workflow events <workflow-id> --service-id <id> --environment-id <id> \
  --resource-key <name> --step-types <type> --detail --output json

Use parameters:

  • --resource-key: Target specific resource
  • --step-types: Filter to specific step (Bootstrap, Compute, Deployment, Network, Storage, Monitoring)
  • --detail: Include full event details (use sparingly)
  • --since/--until: Time-bound queries

Extract from detail view:

  • WorkflowStepDebug error messages
  • VM allocation failures and constraints
  • Pod scheduling issues
  • Container readiness failures

Pod Event Timeline: Create ASCII visualizations showing deployment progression:

HH:MM:SS ┬─── ✗ FailedScheduling
         │    pod/app-0: Insufficient memory
         │
HH:MM:SS ├─── ⚡ TriggeredScaleUp
         │    nodegroup-1: adding 2 nodes
         │
HH:MM:SS ├─── 📥 Pulling image:latest
         │    (duration: 2m15s)
         │
HH:MM:SS └─── ✅ Started
              3/3 pods Running

Symbols: ✗ failed, ✅ success, ⚡ autoscaler, 💾 storage, 📥 image, 🚀 runtime, ⚠️ warning

4. Application-Level Investigation

When: Resource DEPLOYING with probe failures, containers Running but not Ready, no conclusive evidence from previous steps

Tool: mcp__omnistrate-platform__omnistrate-ctl_deployment-cell_update-kubeconfig + kubectl

omctl deployment-cell update-kubeconfig <cell-id> --kubeconfig /tmp/kubeconfig
kubectl get pods -n <instance-id> --kubeconfig /tmp/kubeconfig
kubectl logs <pod-name> -c service -n <instance-id> --kubeconfig /tmp/kubeconfig --tail=50

Look for:

  • Database connection failures
  • Application syntax/runtime errors (Python SyntaxError, Java compilation errors)
  • Service dependency failures
  • Configuration issues

5. Helm-Specific Verification

When: Helm resources with conflicting status, need application credentials, deployment state unclear

Tool: Same kubeconfig setup with --role cluster-admin + helm

omctl deployment-cell update-kubeconfig <cell-id> --kubeconfig /tmp/kubeconfig --role cluster-admin
helm list -n <instance-id> --kubeconfig /tmp/kubeconfig
helm status <release-name> -n <instance-id> --kubeconfig /tmp/kubeconfig

Extract:

  • Release status (deployed/failed/pending)
  • Revision number and last deployed time
  • Application credentials from release notes
  • Pod health ratio (Running vs Failed)

6. Broader Context (If Needed)

Tool: mcp__omnistrate-platform__omnistrate-ctl_operations_events

Use time windows from workflow analysis, filter by relevant event types.

Common Failure Patterns

Infrastructure Constraints

  • VM allocation failures with restrictive constraints (AZ + Network + Size)
  • PersistentVolumeClaim not found
  • Node taints/affinity issues

Container Lifecycle

  • Back-off restarting failed container
  • ProcessLiveness: UNHEALTHY
  • Image pull failures

Probe Failures

  • Startup/readiness probe HTTP 503
  • Database connectivity timeouts
  • Application syntax errors preventing startup
  • Service dependency unavailability

Resource Prioritization

  1. Core infrastructure: databases, message queues, storage
  2. Application services: web servers, APIs
  3. Support services: monitoring, logging

Response Management

  • Always use --output json
  • If token limit exceeded: add more specific filters, use smaller time windows, target specific resources
  • Provide analysis in template format (see Failure Analysis Template in OMNISTRATE_SRE_REFERENCE.md)

Reference

See OMNISTRATE_SRE_REFERENCE.md for:

  • Detailed tool parameter documentation
  • Complete failure analysis template
  • Extended examples
  • Tool alternatives table