| name | platform-health |
| description | Check comprehensive platform health including ArgoCD apps, pods, services, certificates, and resources across the Kagenti platform |
Platform Health Check Skill
This skill helps you perform comprehensive platform health checks and identify issues quickly.
When to Use
- After deployments or cluster restarts
- Before making changes (baseline health)
- During incident investigation
- Regular health monitoring
- After running tests
- User requests "check platform" or "is everything working"
What This Skill Does
- Quick Health Overview: One-command platform status
- ArgoCD Apps: Health and sync status of all applications
- Pod Health: Check pods across all namespaces
- Service Accessibility: Test Gateway routes and certificates
- Resource Usage: CPU/memory consumption
- Component-Specific Checks: Detailed validation per component
Quick Health Check
Comprehensive Platform Status
# Single command for full platform health (includes pytest tests)
./scripts/platform-status.sh
# What it checks:
# ✓ ArgoCD applications (health & sync status)
# ✓ Platform pods (all namespaces)
# ✓ Gateway & certificates
# ✓ Istio mTLS configuration
# ✓ Service accessibility (via Gateway)
# ✓ OAuth authentication
# ✓ Integration tests (pytest)
Expected Output:
=== ArgoCD Applications Status ===
✓ gateway-api: Healthy, Synced
✓ cert-manager: Healthy, Synced
✓ istio-base: Healthy, Synced
...
=== Platform Pods ===
observability grafana-xxx 2/2 Running
observability prometheus-xxx 2/2 Running
...
=== Gateway & Certificates ===
✓ external-gateway: Programmed
✓ grafana-cert: Ready
...
=== Integration Tests ===
PASSED tests/validation/test_app_state.py::test_critical_apps
...
Quick Status Commands
# ArgoCD apps summary
argocd app list --port-forward --port-forward-namespace argocd --grpc-web
# All pods summary
kubectl get pods -A
# Failing pods only
kubectl get pods -A | grep -vE "Running|Completed"
# Service endpoints
kubectl get svc -A
# Gateway status
kubectl get gateway -A
# Certificate status
kubectl get certificate -A
Detailed Health Checks
1. ArgoCD Application Health
# List all apps with health status
argocd app list --port-forward --port-forward-namespace argocd --grpc-web \
-o json | jq -r '.[] | "\(.metadata.name): \(.status.health.status), \(.status.sync.status)"'
# Check for unhealthy apps
argocd app list --port-forward --port-forward-namespace argocd --grpc-web \
| grep -E "Degraded|OutOfSync|Unknown|Missing"
# Get details for specific app
argocd app get <app-name> --port-forward --port-forward-namespace argocd --grpc-web
# Check app sync history
argocd app history <app-name> --port-forward --port-forward-namespace argocd --grpc-web
Expected States:
- Health:
Healthy(✓),Progressing(⚠️),Degraded(❌),Missing(❌) - Sync:
Synced(✓),OutOfSync(⚠️)
Critical Apps (must be Healthy):
- gateway-api
- cert-manager
- istio-base, istiod
- tekton-pipelines
- keycloak
- kagenti-operator, kagenti-platform-operator
- kagenti-platform
- kagenti-ui
Optional Apps (can be Progressing):
- observability (large images, slow startup)
- kiali
- ollama
2. Pod Health by Namespace
# All pods with status
kubectl get pods -A -o wide
# Pods sorted by restarts
kubectl get pods -A --sort-by='.status.containerStatuses[0].restartCount' | tail -20
# Pods with issues
kubectl get pods -A --field-selector=status.phase!=Running,status.phase!=Succeeded
# Pod resource usage
kubectl top pods -A --sort-by=memory
kubectl top pods -A --sort-by=cpu
# Specific namespace health
kubectl get pods -n observability
kubectl get pods -n keycloak
kubectl get pods -n kagenti-system
Check for these statuses:
- ❌ CrashLoopBackOff: Application crashes on startup
- ❌ ImagePullBackOff: Image not available
- ❌ Error: Container exited with error
- ⚠️ Pending: Waiting for resources or scheduling
- ⚠️ Init: Init containers still running
- ✓ Running: Pod healthy
- ✓ Completed: Job finished successfully
3. Service Accessibility
# Test all platform services via Gateway
for service in grafana prometheus tempo phoenix kiali keycloak kagenti; do
echo "=== Testing https://$service.localtest.me:9443/ ==="
curl -k -I -m 5 "https://$service.localtest.me:9443/" 2>&1 | head -3
echo
done
# Check Gateway status
kubectl get gateway -A
kubectl describe gateway external-gateway -n default
# Check HTTPRoutes
kubectl get httproute -A
kubectl describe httproute <route-name> -n <namespace>
# Check service endpoints (should have IP addresses)
kubectl get endpoints -A | grep -v "<none>"
Expected Results:
- Grafana: HTTP/2 302 (redirect to /login)
- Prometheus: HTTP/2 302 (OAuth redirect)
- Keycloak: HTTP/2 200
- Kagenti UI: HTTP/2 200
4. Certificate Health
# All certificates status
kubectl get certificate -A
# Check certificate details
kubectl describe certificate <cert-name> -n <namespace>
# Check cert-manager logs for issues
kubectl logs -n cert-manager deployment/cert-manager --tail=50
# Verify certificate expiration
kubectl get certificate -A -o json | jq -r '.items[] | "\(.metadata.namespace)/\(.metadata.name): expires \(.status.notAfter)"'
Expected State: All certificates show Ready=True
5. Istio Service Mesh Health
# Check Istio components
kubectl get pods -n istio-system
# Verify sidecar injection (should show 2/2 containers)
kubectl get pods -A -o wide | grep "2/2"
# Check mTLS policies
kubectl get peerauthentication -A
kubectl get destinationrule -A
# Istio proxy status
istioctl proxy-status
# Check specific pod mesh config
istioctl x describe pod <pod-name> -n <namespace>
6. Resource Usage
# Node resources
kubectl top nodes
# Cluster-wide pod resources
kubectl top pods -A --sort-by=memory | head -20
kubectl top pods -A --sort-by=cpu | head -20
# Namespace resource usage
kubectl top pods -n observability
kubectl top pods -n keycloak
kubectl top pods -n kagenti-system
# Check for resource pressure
kubectl get nodes -o json | jq -r '.items[] | "\(.metadata.name): \(.status.conditions[] | select(.type=="MemoryPressure" or .type=="DiskPressure") | .type)=\(.status)"'
7. Storage Health
# PersistentVolumes
kubectl get pv
# PersistentVolumeClaims
kubectl get pvc -A
# Check PVC usage via metrics
kubectl exec -n observability deployment/grafana -- \
curl -s -G 'http://prometheus.observability.svc:9090/api/v1/query' \
--data-urlencode 'query=(kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes) * 100' \
| python3 -m json.tool
Component-Specific Health Checks
Observability Stack
# Prometheus
kubectl get pods -n observability -l app=prometheus
kubectl exec -n observability deployment/grafana -- \
curl -s http://prometheus.observability.svc:9090/-/ready
# Grafana
kubectl get pods -n observability -l app=grafana
curl -k -I https://grafana.localtest.me:9443/api/health
# Loki
kubectl get pods -n observability -l app=loki
kubectl exec -n observability deployment/grafana -- \
curl -s http://loki.observability.svc:3100/ready
# Tempo
kubectl get pods -n observability -l app=tempo
kubectl exec -n observability deployment/grafana -- \
curl -s http://tempo-query-frontend.observability.svc:3100/ready
# Phoenix
kubectl get pods -n observability -l app=phoenix
curl -k -I https://phoenix.localtest.me:9443/
# AlertManager
kubectl get pods -n observability -l app=alertmanager
kubectl exec -n observability deployment/alertmanager -c alertmanager -- \
wget -qO- http://localhost:9093/-/ready
Authentication & Authorization
# Keycloak
kubectl get pods -n keycloak -l app=keycloak
kubectl exec -n keycloak statefulset/keycloak -- \
curl -s http://localhost:8080/health/ready | python3 -m json.tool
# OAuth2-Proxy instances
kubectl get pods -n oauth2-proxy
kubectl get deployment -n oauth2-proxy
# Test Keycloak SSO
curl -k "https://keycloak.localtest.me:9443/realms/master/.well-known/openid-configuration"
Platform Components
# Kagenti Operator
kubectl get pods -n kagenti-operator
kubectl logs -n kagenti-operator deployment/kagenti-operator --tail=20
# Kagenti Platform Operator
kubectl get pods -n kagenti-platform-operator
kubectl logs -n kagenti-platform-operator deployment/kagenti-platform-operator --tail=20
# Kagenti UI
kubectl get pods -n kagenti-platform -l app=kagenti-ui
curl -k -I https://kagenti.localtest.me:9443/
# Tekton Pipelines
kubectl get pods -n tekton-pipelines
kubectl get pipelineruns -A
Health Check Checklists
Post-Deployment Health Check
- All ArgoCD apps Healthy and Synced
- No pods in CrashLoopBackOff/ImagePullBackOff
- All services have endpoints
- All certificates Ready
- All Gateway routes Programmed
- Services accessible via browser
- Integration tests passing
- No firing critical alerts
Pre-Change Health Check
- Capture platform snapshot:
./scripts/capture-platform-snapshot.sh before-change - All critical apps Healthy
- No existing incidents in TODO_INCIDENTS.md
- Resource usage within limits
- Recent Git commits validated
Incident Investigation Health Check
- Identify degraded components
- Check recent events
- Collect logs from affected pods
- Query metrics for anomalies
- Check for correlated failures
- Review recent changes (Git history)
Common Health Issues
Issue: Pods stuck in Pending
# Check pod description for reason
kubectl describe pod <pod-name> -n <namespace>
# Common causes:
# - Insufficient CPU/memory
# - No nodes matching nodeSelector
# - Unbound PersistentVolumeClaim
Issue: Pods CrashLoopBackOff
# Check previous logs
kubectl logs <pod-name> -n <namespace> --previous
# Check events
kubectl get events -n <namespace> --sort-by='.lastTimestamp' | tail -20
# Common causes:
# - Application error on startup
# - Missing configuration
# - Dependency not available
Issue: Service not accessible
# Check pod status
kubectl get pods -n <namespace> -l app=<service>
# Check service endpoints
kubectl get endpoints -n <namespace> <service-name>
# Check HTTPRoute
kubectl get httproute -n <namespace>
# Test from inside cluster
kubectl run debug-curl -n <namespace> --image=curlimages/curl --rm -it \
-- curl http://<service-name>.<namespace>.svc:PORT
Issue: Certificate not Ready
# Check certificate status
kubectl describe certificate <cert-name> -n <namespace>
# Check cert-manager logs
kubectl logs -n cert-manager deployment/cert-manager
# Common causes:
# - DNS validation failing
# - Rate limit reached
# - Invalid configuration
Issue: High resource usage
# Find top consumers
kubectl top pods -A --sort-by=memory | head -10
kubectl top pods -A --sort-by=cpu | head -10
# Check for memory leaks
kubectl logs <pod-name> -n <namespace> | grep -i "out of memory"
# Check resource limits
kubectl describe pod <pod-name> -n <namespace> | grep -A5 "Limits:"
Automation & Monitoring
Continuous Health Monitoring
# Watch pod status
watch -n 5 'kubectl get pods -A | grep -vE "Running|Completed"'
# Watch ArgoCD apps
watch -n 10 'argocd app list --port-forward --port-forward-namespace argocd --grpc-web | grep -vE "Healthy.*Synced"'
# Monitor specific namespace
watch -n 5 'kubectl get pods -n observability'
Scheduled Health Checks
# Cron job for periodic health checks (local dev)
# Add to crontab: crontab -e
*/15 * * * * /path/to/kagenti-demo-deployment/scripts/platform-status.sh > /tmp/health-$(date +\%Y\%m\%d-\%H\%M).log 2>&1
# Compare snapshots over time
./scripts/capture-platform-snapshot.sh hourly-check
Related Documentation
- CLAUDE.md Platform Status - Monitoring commands
- scripts/platform-status.sh - Automated health check
- TODO_INCIDENTS.md - Active incidents
- docs/INTEGRATION_TESTS.md - Test strategy
Integration with Other Skills
After health check, if issues found:
- Use investigate-incident skill for RCA
- Use check-logs skill to examine error logs
- Use check-metrics skill for performance analysis
- Use check-alerts skill to see if alerts fired
Pro Tips
- Always baseline first: Run health check BEFORE making changes
- Use platform-status.sh: Single command for comprehensive check
- Capture snapshots: Use
capture-platform-snapshot.shfor historical comparison - Check critical apps first: Focus on gateway-api, istio, keycloak, operators
- Look for patterns: Multiple pods failing often indicates cluster-wide issue
- Check Git history: Recent commits may explain new issues
- Verify after fixes: Always re-run health check after remediation
🤖 Generated with Claude Code