name

holmesgpt-skill

description

Guide for implementing HolmesGPT - an AI agent for troubleshooting cloud-native environments. Use when investigating Kubernetes issues, analyzing alerts from Prometheus/AlertManager/PagerDuty, performing root cause analysis, configuring HolmesGPT installations (CLI/Helm/Docker), setting up AI providers (OpenAI/Anthropic/Azure), creating custom toolsets, or integrating with observability platforms (Grafana, Loki, Tempo, DataDog).

HolmesGPT Skill

AI-powered troubleshooting for Kubernetes and cloud-native environments.

Overview

HolmesGPT is a CNCF Sandbox project that connects AI models with live observability data to investigate infrastructure problems, find root causes, and suggest remediations. It operates with read-only access and respects RBAC permissions, making it safe for production environments.

Quick Reference

Topic	Reference
Installation	`references/installation.md`
Configuration	`references/configuration.md`
Data Sources	`references/data-sources.md`
Commands	`references/commands.md`
Troubleshooting	`references/troubleshooting.md`
HTTP API	`references/http-api.md`
Integrations	`references/integrations.md`

Key Features

Root Cause Analysis: Investigates alerts and cluster issues
Multi-Source Integration: 30+ toolsets (K8s, Prometheus, Grafana)
Alert Integration: AlertManager, PagerDuty, OpsGenie, Jira, Slack
Interactive Mode: Troubleshooting with /run, /show, /clear
Custom Toolsets: Extend with proprietary tools via YAML configuration
CI/CD Integration: Automated deployment failure investigation

Installation Quick Start

CLI (Homebrew)

brew tap robusta-dev/homebrew-holmesgpt
brew install holmesgpt
export ANTHROPIC_API_KEY="your-key"  # or OPENAI_API_KEY
holmes ask "what pods are unhealthy?"

Kubernetes (Helm)

helm repo add robusta https://robusta-charts.storage.googleapis.com
helm repo update
helm install holmesgpt robusta/holmes -f values.yaml

Docker

docker run -it --net=host \
  -e OPENAI_API_KEY="your-key" \
  -v ~/.kube/config:/root/.kube/config \
  us-central1-docker.pkg.dev/genuine-flight-317411/devel/holmes \
  ask "what pods are crashing?"

Essential Commands

# Basic investigation
holmes ask "what pods are unhealthy and why?"
holmes ask "why is my deployment failing?"

# Interactive mode
holmes ask "investigate issue" --interactive

# Alert investigation
holmes investigate alertmanager --alertmanager-url http://localhost:9093
holmes investigate pagerduty --pagerduty-api-key <KEY> --update

# With file context
holmes ask "summarize the key points" -f ./logs.txt

# CI/CD integration
holmes ask "why did deployment fail?" --destination slack --slack-token <TOKEN>

Supported AI Providers

Provider	Environment Variable	Models
Anthropic	`ANTHROPIC_API_KEY`	Sonnet 4, Opus 4.5
OpenAI	`OPENAI_API_KEY`	GPT-4.1, GPT-4o
Azure OpenAI	`AZURE_API_KEY`	GPT-4.1
AWS Bedrock	AWS credentials	Claude 3.5 Sonnet
Google Gemini	`GEMINI_API_KEY`	Gemini 1.5 Pro
Vertex AI	`VERTEXAI_PROJECT`	Gemini 1.5 Pro
Ollama	Local install	Llama 3.1, Mistral

Basic Helm Values Structure

# values.yaml for Kubernetes deployment
image:
  repository: robustadev/holmes
  tag: latest

env:
  - name: ANTHROPIC_API_KEY
    valueFrom:
      secretKeyRef:
        name: holmesgpt-secrets
        key: anthropic-api-key

# Model configuration
modelList:
  sonnet:
    api_key: "{{ env.ANTHROPIC_API_KEY }}"
    model: anthropic/claude-sonnet-4-20250514
    temperature: 0

# Toolsets to enable
toolsets:
  kubernetes/core:
    enabled: true
  kubernetes/logs:
    enabled: true
  prometheus/metrics:
    enabled: true

# Resources
resources:
  requests:
    memory: "1024Mi"
    cpu: "100m"
  limits:
    memory: "1024Mi"

# RBAC (read-only by default)
createServiceAccount: true

Interactive Mode Commands

Command	Description
`/clear`	Reset context when changing topics
`/run`	Execute custom commands and share output with AI
`/show`	Display complete tool outputs
`/context`	Review accumulated investigation information

Custom Toolset Example

# custom-toolset.yaml
toolsets:
  my-custom-tool:
    description: "Custom diagnostic tool"
    tools:
      - name: check_service_health
        description: "Check health of a specific service"
        command: |
          curl -s http://{{ service_name }}.{{ namespace }}.svc.cluster.local/health
        parameters:
          - name: service_name
            description: "Name of the service"
          - name: namespace
            description: "Kubernetes namespace"

Use with: holmes ask "check health" -t custom-toolset.yaml

Kubernetes Annotations for Integration

# Add to Services/Deployments for HolmesGPT context
metadata:
  annotations:
    holmesgpt.dev/runbook: |
      This service handles payment processing.
      Common issues: database connectivity, API rate limits.
      Check: kubectl logs -l app=payment-service

Environment Variables Reference

Variable	Description	Default
`HOLMES_CONFIG_PATH`	Config file path	`~/.holmes/config.yaml`
`HOLMES_LOG_LEVEL`	Log verbosity	`INFO`
`PROMETHEUS_URL`	Prometheus server URL	-
`GITHUB_TOKEN`	GitHub API token	-
`DATADOG_API_KEY`	DataDog API key	-
`CONFLUENCE_BASE_URL`	Confluence URL	-

Best Practices

Use Specific Queries: Include namespace, deployment name, symptoms
Start with Claude Sonnet 4.0/4.5: Best accuracy for complex investigations
Enable Relevant Toolsets: Only enable what you need to reduce noise
Use Interactive Mode: For complex multi-step investigations
Set Up Runbooks: Provide context for known alert types
CI/CD Integration: Automate deployment failure analysis

Security Considerations

HolmesGPT uses read-only access (get, list, watch only)
Respects existing RBAC permissions
Never modifies, creates, or deletes resources
API keys stored in Kubernetes Secrets
Data not used for model training

Official Resources

Documentation: https://holmesgpt.dev/
GitHub: https://github.com/robusta-dev/holmesgpt
Helm Chart: https://github.com/robusta-dev/holmesgpt/tree/master/helm/holmes
Slack Community: Cloud Native Slack