| name | moai-domain-monitoring |
| version | 4.0.0 |
| updated | Thu Nov 20 2025 00:00:00 GMT+0000 (Coordinated Universal Time) |
| status | stable |
| description | Observability with Prometheus, Grafana, logging, and tracing |
| allowed-tools | Read, Bash, WebSearch, WebFetch |
Monitoring & Observability Expert
Production Monitoring Stack
Focus: Metrics, Logs, Traces (Three Pillars of Observability)
Stack: Prometheus, Grafana, Loki, OpenTelemetry, Jaeger
Overview
Complete observability for production systems.
Three Pillars
- Metrics: Time-series data (Prometheus)
- Logs: Event records (Loki, ELK)
- Traces: Distributed request tracking (Jaeger, Tempo)
Quick Start
1. Metrics (Prometheus)
Counter, Gauge, Histogram for application metrics.
Key Metrics:
- Request rate (requests/sec)
- Error rate (5xx/total)
- Latency (p50, p95, p99)
See: examples.md
2. Logging (Structured)
JSON-formatted logs for easy parsing.
Fields: timestamp, level, message, context
See: examples.md
3. Tracing (OpenTelemetry)
Track requests across microservices.
Concepts: Span, Trace ID, Parent-Child relationships
See: examples.md
4. Alerting
Automated alerts on threshold breaches.
Examples: Error rate >5%, CPU >80%
See: examples.md
Monitoring Methodologies
RED Method (for Services)
- Rate: Requests per second
- Errors: Error percentage
- Duration: Latency distribution
USE Method (for Resources)
- Utilization: % busy
- Saturation: Queue depth
- Errors: Error count
Best Practices
- Cardinality: Avoid high-cardinality labels (e.g., user_id)
- Retention: 15 days (metrics), 7 days (logs)
- Sampling: 1% trace sampling for high-traffic services
- Dashboards: One dashboard per service
Validation Checklist
- Metrics: Prometheus scraping configured?
- Logs: Structured (JSON) format?
- Traces: OpenTelemetry instrumented?
- Alerts: Critical alerts defined?
- Dashboards: Grafana dashboards created?
Related Skills
moai-essentials-perf: Performance profilingmoai-devops-docker: Container monitoringmoai-cloud-aws-advanced: CloudWatch
Additional Resources
- examples.md: Implementation code
- reference.md: Prometheus query language (PromQL)
Last Updated: 2025-11-20