| name | check-logs |
| description | Query and analyze logs using Grafana Loki for the Kagenti platform, search for errors, and investigate issues |
Check Logs Skill
This skill helps you query and analyze logs from the Kagenti platform using Loki via Grafana.
When to Use
- User asks "show me logs for X"
- Investigating errors or failures
- After deployments to check for issues
- Debugging pod crashes or restarts
- Analyzing application behavior
What This Skill Does
- Query Logs: Search logs by namespace, pod, container, or log level
- Error Detection: Find errors and warnings in logs
- Log Aggregation: View logs across multiple pods
- Time-based Queries: Query logs for specific time ranges
- Log Patterns: Detect common issues from log patterns
Examples
Query Logs in Grafana UI
Access Grafana: https://grafana.localtest.me:9443 Navigate: Explore → Select Loki datasource
Log Dashboard: https://grafana.localtest.me:9443/d/loki-logs/loki-logs
Query Examples in Grafana Explore:
# All logs from observability namespace
{kubernetes_namespace_name="observability"}
# Logs from specific pod
{kubernetes_pod_name=~"prometheus.*"}
# Logs with errors
{kubernetes_namespace_name="observability"} |= "error"
# Logs from last 5 minutes with level=error
{kubernetes_namespace_name="observability"} | json | level="error"
# Count errors per namespace
sum by (kubernetes_namespace_name) (count_over_time({kubernetes_namespace_name=~".+"} |= "error" [5m]))
Query Logs via CLI (Promtail/Loki)
# Query Loki for recent errors in observability namespace
kubectl exec -n observability deployment/grafana -- \
curl -s -G 'http://loki.observability.svc:3100/loki/api/v1/query_range' \
--data-urlencode 'query={kubernetes_namespace_name="observability"} |= "error"' \
--data-urlencode 'limit=100' \
--data-urlencode 'start='$(date -u -v-5M +%s)000000000 \
--data-urlencode 'end='$(date -u +%s)000000000 | python3 -m json.tool
Check Logs for Specific Pod
# Get logs for a specific pod using kubectl
kubectl logs -n observability deployment/prometheus --tail=100
# Get logs from previous container (if crashed)
kubectl logs -n observability pod/prometheus-xxx --previous
# Follow logs in real-time
kubectl logs -n observability deployment/grafana -f --tail=20
# Get logs from specific container in pod
kubectl logs -n observability pod/alertmanager-xxx -c alertmanager --tail=50
Search for Errors Across Platform
# Get recent error logs from all namespaces
for ns in observability keycloak oauth2-proxy istio-system kiali-system; do
echo "=== Errors in $ns ==="
kubectl logs -n $ns --all-containers=true --tail=50 2>&1 | grep -i "error\|fatal\|exception" | head -5
echo
done
Check Logs for Failed Pods
# Find pods with issues and check their logs
kubectl get pods -A | grep -E "Error|CrashLoop|ImagePull" | while read ns pod rest; do
echo "=== Logs for $pod in $ns ==="
kubectl logs -n $ns $pod --tail=30 --previous 2>/dev/null || kubectl logs -n $ns $pod --tail=30
echo
done
Query Log Volume by Namespace
# In Grafana Explore (Loki datasource)
sum by (kubernetes_namespace_name) (
rate({kubernetes_namespace_name=~".+"}[5m])
)
Search for Specific Error Pattern
# Find connection errors
{kubernetes_namespace_name="observability"} |~ "connection (refused|timeout|reset)"
# Find authentication failures
{kubernetes_namespace_name=~"keycloak|oauth2-proxy"} |~ "auth.*fail|unauthorized|forbidden"
# Find OOM kills
{kubernetes_namespace_name=~".+"} |~ "OOM|out of memory|oom.*kill"
Log Levels and Filtering
Standard Log Levels
- error: Critical errors requiring attention
- warn/warning: Warnings that may indicate issues
- info: Informational messages
- debug: Detailed debugging information
- trace: Very detailed trace information
Filter by Log Level
# Only errors
{kubernetes_namespace_name="observability"} | json | level="error"
# Errors and warnings
{kubernetes_namespace_name="observability"} | json | level=~"error|warn"
# Everything except debug
{kubernetes_namespace_name="observability"} | json | level!="debug"
Common Log Queries for Platform Components
Prometheus Logs
kubectl logs -n observability deployment/prometheus --tail=100
# Check for scrape errors
kubectl logs -n observability deployment/prometheus | grep -i "scrape\|error"
Grafana Logs
kubectl logs -n observability deployment/grafana --tail=100
# Check for datasource errors
kubectl logs -n observability deployment/grafana | grep -i "datasource\|error"
Keycloak Logs
kubectl logs -n keycloak statefulset/keycloak --tail=100
# Check for authentication errors
kubectl logs -n keycloak statefulset/keycloak | grep -i "auth\|login\|error"
Istio Proxy (Sidecar) Logs
# Check sidecar logs for a specific pod
POD=$(kubectl get pod -n observability -l app=alertmanager -o jsonpath='{.items[0].metadata.name}')
kubectl logs -n observability $POD -c istio-proxy --tail=50
AlertManager Logs
kubectl logs -n observability deployment/alertmanager -c alertmanager --tail=100
# Check for notification errors
kubectl logs -n observability deployment/alertmanager -c alertmanager | grep -i "notif\|error\|fail"
Log Analysis Patterns
Detect Crash Loops
# Find pods restarting frequently
kubectl get pods -A | awk '{if ($4 > 5) print $0}'
# Check logs before crash
kubectl logs -n <namespace> <pod-name> --previous | tail -50
Find HTTP Errors
{kubernetes_namespace_name=~".+"} |~ "HTTP.*[45]\\d{2}"
Find Timeout Errors
{kubernetes_namespace_name=~".+"} |~ "timeout|timed out|deadline exceeded"
Find Database Connection Issues
{kubernetes_namespace_name=~".+"} |~ "database.*error|connection.*refused|SQL.*error"
Troubleshooting with Logs
Issue: Service Not Starting
- Check pod events:
kubectl describe pod <pod-name> -n <namespace> - Check container logs:
kubectl logs <pod-name> -n <namespace> - Check init container logs:
kubectl logs <pod-name> -n <namespace> -c <init-container>
Issue: High Error Rate
- Query error logs:
{kubernetes_namespace_name="X"} |= "error" [5m] - Group by component:
sum by (kubernetes_pod_name) (count_over_time({...} |= "error" [5m])) - Identify pattern in error messages
Issue: Performance Degradation
- Check for warnings:
{kubernetes_namespace_name="X"} |= "warn" - Look for timeout messages
- Check for resource exhaustion messages
Grafana Loki Dashboard Features
Loki Logs Dashboard: https://grafana.localtest.me:9443/d/loki-logs/loki-logs
Features:
- Namespace filter: Select specific namespace
- Pod filter: Filter by pod name
- Log level: Filter by error/warn/info/debug
- Time range: Select time window
- Log volume graphs: See log rate over time
- Log table: Browse actual log lines
Panels:
- Log Volume by Level: See errors vs warnings over time
- Log Volume by Namespace: Compare activity across namespaces
- Logs per Second: Current log ingestion rate
- Log Lines: Actual log content with search
Related Documentation
- Loki Documentation
- LogQL Query Language
- CLAUDE.md Troubleshooting
- Alert Runbooks - Many reference logs
Pro Tips
- Use time ranges: Always specify time range to limit data
- Filter early: Add namespace/pod filters before log level filters (more efficient)
- Use regex carefully: Complex regex can be slow on large log volumes
- Check both current and previous: For crashed pods, use
--previous - Tail first: Use
--tail=Nto limit output, then increase if needed
🤖 Generated with Claude Code