name	production-monitoring-guardian
description	Prevents production blind spots by establishing comprehensive monitoring, alerting, and observability. Use when deploying to production, experiencing unexplained downtime, debugging performance issues, or setting up incident response. Covers error tracking (Sentry), uptime monitoring (UptimeRobot), performance monitoring (APM), log aggregation, alerting strategies, and dashboard design.

Production Monitoring Guardian

Mission: Ensure you know what's happening in production BEFORE your users do. Establish proactive monitoring that catches issues early, enables fast debugging, and prevents revenue loss.

Activation Triggers

Deploying to production for the first time
"How do I know if my app is down?"
Performance degradation reports
Unexplained errors or downtime
Capacity planning needs
Incident response preparation
SLA/uptime requirements
Customer reports issues before you

The Four Pillars of Observability

Metrics - What's happening? (CPU, memory, request rate, error rate)
Logs - Why is it happening? (Error messages, stack traces, context)
Traces - Where is it happening? (Request flow through system)
Alerts - When should I act? (Thresholds, escalation policies)

Scan Methodology

1. Initial Context Gathering

Ask if not provided:

"Is your application in production?"
"What's your current monitoring setup?" (none, partial, comprehensive)
"What's your uptime requirement?" (99%, 99.9%, 99.99%)
"How do you currently know if something breaks?"
"What's your incident response process?"

2. Monitoring Maturity Assessment

Level 0: Blind 🔴 CRITICAL RISK

No monitoring
Users report outages
No error tracking
No performance visibility
Risk: Revenue loss, customer churn, reputation damage

Level 1: Basic 🟡 HIGH RISK

Simple uptime checks (ping endpoints)
Basic server metrics (CPU, memory)
Manual log checking
Risk: Delayed incident response, no root cause analysis

Level 2: Operational 🟢 ACCEPTABLE

Error tracking (Sentry)
Uptime monitoring (UptimeRobot)
Log aggregation
Basic alerting
Risk: Minimal, can operate safely

Level 3: Advanced ⭐ OPTIMAL

Full observability (metrics + logs + traces)
Proactive alerting
Performance monitoring (APM)
Auto-scaling based on metrics
Dashboards for all stakeholders
Risk: Very low, enterprise-grade

PDFLab Current State: Level 0-1 (Production deployed, monitoring incomplete)

3. Critical Monitoring Components

🔴 CRITICAL: Error Tracking (Sentry)

Why Essential:

Catches errors before users report them
Provides stack traces and context
Tracks error frequency and impact
Enables quick debugging

Setup for PDFLab:

# Install Sentry
cd backend
npm install @sentry/node @sentry/profiling-node

cd ..
npm install @sentry/nextjs

Backend Integration (Express.js):

// backend/src/server.ts
import * as Sentry from '@sentry/node'
import { ProfilingIntegration } from '@sentry/profiling-node'

// Initialize Sentry FIRST (before any other imports)
Sentry.init({
  dsn: process.env.SENTRY_DSN,
  environment: process.env.NODE_ENV || 'development',
  integrations: [
    new ProfilingIntegration(),
    new Sentry.Integrations.Http({ tracing: true }),
    new Sentry.Integrations.Express({ app })
  ],
  tracesSampleRate: 0.1, // 10% of transactions
  profilesSampleRate: 0.1,

  // Filter sensitive data
  beforeSend(event) {
    // Remove passwords from error data
    if (event.request) {
      delete event.request.cookies
      if (event.request.data?.password) {
        event.request.data.password = '[REDACTED]'
      }
    }
    return event
  }
})

// Sentry request handler (BEFORE routes)
app.use(Sentry.Handlers.requestHandler())
app.use(Sentry.Handlers.tracingHandler())

// ... your routes ...

// Sentry error handler (AFTER routes, BEFORE other error handlers)
app.use(Sentry.Handlers.errorHandler())

// Custom error handler
app.use((err: Error, req: Request, res: Response, next: NextFunction) => {
  // Sentry already captured error
  console.error('Unhandled error:', err)

  res.status(500).json({
    error: 'Internal server error',
    message: process.env.NODE_ENV === 'development' ? err.message : 'Something went wrong',
    sentry_id: res.sentry // Error ID for support tickets
  })
})

Frontend Integration (Next.js):

// sentry.client.config.ts
import * as Sentry from '@sentry/nextjs'

Sentry.init({
  dsn: process.env.NEXT_PUBLIC_SENTRY_DSN,
  environment: process.env.NODE_ENV,
  tracesSampleRate: 0.1,

  // Filter PII
  beforeSend(event, hint) {
    // Don't send auth tokens
    if (event.request?.headers) {
      delete event.request.headers.authorization
    }
    return event
  }
})

// Usage in components:
try {
  await pdflabAPI.uploadBatch(files, operation, options)
} catch (error) {
  Sentry.captureException(error, {
    tags: {
      feature: 'batch_upload',
      operation: operation
    },
    extra: {
      file_count: files.length,
      total_size: files.reduce((sum, f) => sum + f.size, 0)
    }
  })
  throw error
}

Environment Variables:

# .env (backend)
SENTRY_DSN=https://xxx@xxx.ingest.sentry.io/xxx

# .env.local (frontend)
NEXT_PUBLIC_SENTRY_DSN=https://xxx@xxx.ingest.sentry.io/xxx

Verification:

// Test Sentry integration
app.get('/debug-sentry', (req, res) => {
  throw new Error('Test Sentry error tracking')
})

// Visit http://localhost:3006/debug-sentry
// Check Sentry dashboard for error

🔴 CRITICAL: Uptime Monitoring (UptimeRobot)

Why Essential:

Detects when site is down
Monitors from external location (not your server)
Alerts via SMS/email/Slack
Tracks uptime percentage for SLAs

Setup (5 minutes, FREE):

Create UptimeRobot Account
- Go to https://uptimerobot.com
- Free tier: 50 monitors, 5-minute intervals

Add HTTP Monitor

Monitor Type: HTTP(s)
Friendly Name: PDFLab Production
URL: https://pdflab.pro
Monitoring Interval: 5 minutes (free tier)

Advanced:
- Check for keyword: "PDFLab" (ensures page loads correctly)
- Timeout: 30 seconds

Add Health Check Endpoint Monitor

Monitor Type: HTTP(s)
Friendly Name: PDFLab API Health
URL: https://pdflab.pro/health
Monitoring Interval: 5 minutes
Expected Status: 200

POST-Value (for detailed check):
Check for JSON: "status":"healthy"

Configure Alerts

Email: your@email.com
SMS: +1-xxx-xxx-xxxx (optional, paid)
Slack: webhook URL (recommended)

Alert When:
- Down (site unreachable)
- Up (site back online)
- Slow response (>5s)

Test Alerts

# Stop backend to trigger alert
docker stop pdflab-backend

# Wait 5-10 minutes for UptimeRobot to detect
# Verify you receive alert

# Restart backend
docker start pdflab-backend

# Verify "Up" alert

Advanced Health Checks:

// backend/src/server.ts
app.get('/health', async (req, res) => {
  const health = {
    status: 'healthy',
    timestamp: new Date().toISOString(),
    uptime: process.uptime(),
    checks: {
      database: 'unknown',
      redis: 'unknown',
      cloudconvert: 'unknown'
    }
  }

  try {
    // Check database
    await sequelize.authenticate()
    health.checks.database = 'healthy'
  } catch {
    health.checks.database = 'unhealthy'
    health.status = 'degraded'
  }

  try {
    // Check Redis
    await redisClient.ping()
    health.checks.redis = 'healthy'
  } catch {
    health.checks.redis = 'unhealthy'
    health.status = 'degraded'
  }

  const statusCode = health.status === 'healthy' ? 200 : 503
  res.status(statusCode).json(health)
})

🟡 HIGH: Log Aggregation

Why Important:

Centralized log viewing
Search across all servers
Correlate logs with errors
Compliance/audit requirements

Simple Pattern (File-based):

// backend/src/utils/logger.ts
import winston from 'winston'

const logger = winston.createLogger({
  level: process.env.LOG_LEVEL || 'info',
  format: winston.format.combine(
    winston.format.timestamp(),
    winston.format.errors({ stack: true }),
    winston.format.json()
  ),
  transports: [
    // Console (for Docker logs)
    new winston.transports.Console({
      format: winston.format.simple()
    }),

    // File (for persistence)
    new winston.transports.File({
      filename: 'logs/error.log',
      level: 'error'
    }),
    new winston.transports.File({
      filename: 'logs/combined.log'
    })
  ]
})

export default logger

// Usage:
logger.info('Batch processing started', {
  batch_id: batchId,
  user_id: userId,
  file_count: files.length
})

logger.error('Conversion failed', {
  batch_id: batchId,
  error: error.message,
  stack: error.stack
})

Structured Logging Pattern:

// Always log with context
logger.info('Event description', {
  // Standard fields
  user_id: req.user?.id,
  request_id: req.id,
  ip_address: req.ip,

  // Event-specific fields
  batch_id: batchId,
  operation: 'batch_upload',
  file_count: 5,

  // Metrics
  duration_ms: Date.now() - startTime,
  success: true
})

Log Levels:

error - Something failed, needs attention
warn - Something suspicious, might need attention
info - Normal operation, important events
http - HTTP request/response (verbose)
debug - Detailed debugging info (development only)

🟡 HIGH: Performance Monitoring (APM)

Why Important:

Identify slow endpoints
Database query optimization
Memory leak detection
Capacity planning

Simple APM (Built-in Node.js):

// backend/src/middleware/performance.middleware.ts
export const performanceMiddleware = (req: Request, res: Response, next: NextFunction) => {
  const start = Date.now()

  // Capture response finish
  res.on('finish', () => {
    const duration = Date.now() - start

    // Log slow requests
    if (duration > 1000) {
      logger.warn('Slow request', {
        method: req.method,
        url: req.url,
        duration_ms: duration,
        status: res.statusCode
      })
    }

    // Track metrics
    metrics.histogram('http.request.duration', duration, {
      method: req.method,
      route: req.route?.path,
      status: res.statusCode
    })
  })

  next()
}

app.use(performanceMiddleware)

Database Query Monitoring:

// backend/src/config/database.ts
export const sequelize = new Sequelize({
  // ... other config ...

  logging: (sql, timing) => {
    if (timing && timing > 100) {
      logger.warn('Slow query', {
        sql,
        duration_ms: timing
      })
    }
  },

  benchmark: true // Enable timing
})

🟠 MEDIUM: Application Metrics

Key Metrics to Track:

// backend/src/utils/metrics.ts
class Metrics {
  private metrics = new Map<string, number>()

  increment(name: string, tags: Record<string, string> = {}) {
    const key = `${name}:${JSON.stringify(tags)}`
    this.metrics.set(key, (this.metrics.get(key) || 0) + 1)
  }

  gauge(name: string, value: number, tags: Record<string, string> = {}) {
    const key = `${name}:${JSON.stringify(tags)}`
    this.metrics.set(key, value)
  }

  // Export for monitoring systems
  getAll() {
    return Object.fromEntries(this.metrics)
  }
}

export const metrics = new Metrics()

// Usage throughout application:

// Conversion job started
metrics.increment('conversion.job.started', {
  type: 'pdf_to_pptx',
  user_plan: user.plan
})

// Conversion job completed
metrics.increment('conversion.job.completed', {
  type: 'pdf_to_pptx'
})

// Conversion job failed
metrics.increment('conversion.job.failed', {
  type: 'pdf_to_pptx',
  error_type: 'cloudconvert_timeout'
})

// Active users
metrics.gauge('users.active', await getActiveUserCount())

// Queue depth
metrics.gauge('queue.depth', await conversionQueue.count())

Metrics Endpoint:

// Expose metrics for Prometheus/monitoring
app.get('/metrics', (req, res) => {
  res.json({
    timestamp: new Date().toISOString(),
    metrics: metrics.getAll(),
    system: {
      memory: process.memoryUsage(),
      uptime: process.uptime(),
      cpu: process.cpuUsage()
    }
  })
})

4. Alerting Strategy

Alert Fatigue Prevention:

Don't alert on everything
Use severity levels (P1=critical, P2=high, P3=medium, P4=low)
Escalate if not acknowledged
Group related alerts

Critical Alerts (Wake up at 3am):

Site down (5+ minute outage)
Database unreachable
Error rate >10%
Payment processing failed
Disk usage >90%

High Priority Alerts (Check within 1 hour):

Error rate >5%
Slow response times (P95 >5s)
Queue backing up (>100 jobs pending)
CloudConvert API errors

Medium Priority Alerts (Check during business hours):

Warning logs increasing
Memory usage trending up
Unusual traffic patterns

Alert Channels:

Critical (P1):
  → SMS to on-call engineer
  → Slack #incidents channel
  → PagerDuty escalation

High (P2):
  → Slack #alerts channel
  → Email to team

Medium (P3):
  → Slack #monitoring channel
  → Daily digest email

5. Dashboard Design

Essential Dashboards:

1. Service Health Dashboard (Operations)

┌─────────────────────────────────────┐
│  PDFLab Production Status           │
├─────────────────────────────────────┤
│  Uptime (24h):      99.95% ✅       │
│  Error Rate (1h):   0.2%   ✅       │
│  Avg Response:      245ms  ✅       │
│  Active Users:      127             │
│  Conversion Queue:  3 jobs          │
│                                     │
│  Component Status:                  │
│  • Frontend:        ✅ Healthy      │
│  • Backend API:     ✅ Healthy      │
│  • Database:        ✅ Healthy      │
│  • Redis:           ✅ Healthy      │
│  • CloudConvert:    ⚠️  Slow (3s)   │
└─────────────────────────────────────┘

2. Business Metrics Dashboard (Executives)

┌─────────────────────────────────────┐
│  PDFLab Business Metrics            │
├─────────────────────────────────────┤
│  Today:                             │
│  • Conversions:     342  (+12%)     │
│  • New Users:       23   (+5%)      │
│  • Revenue:         $156 (+8%)      │
│                                     │
│  This Month:                        │
│  • MRR:             $1,247          │
│  • Active Users:    892             │
│  • Churn Rate:      3.2%            │
│                                     │
│  Popular Features:                  │
│  1. PDF → PPTX      45%             │
│  2. PDF → DOCX      32%             │
│  3. PDF Compress    18%             │
└─────────────────────────────────────┘

3. Performance Dashboard (Developers)

┌─────────────────────────────────────┐
│  API Performance                    │
├─────────────────────────────────────┤
│  Request Duration (P95):            │
│  /api/upload        423ms  ⚠️       │
│  /api/status        87ms   ✅       │
│  /api/download      1.2s   ❌       │
│                                     │
│  Database Queries (P95):            │
│  Users.findByPk     12ms   ✅       │
│  Jobs.findAll       156ms  ⚠️       │
│                                     │
│  Memory Usage:                      │
│  Heap Used:         324MB  ✅       │
│  Heap Total:        512MB           │
│  RSS:               478MB  ✅       │
└─────────────────────────────────────┘

6. Incident Response Playbook

When Alert Fires:

Acknowledge (1 minute)
- Click "Acknowledge" in PagerDuty/Slack
- Prevents escalation
Assess (5 minutes)
- Check Sentry for recent errors
- Check UptimeRobot for uptime status
- Check server metrics (CPU, memory, disk)
- Check logs for patterns
Mitigate (15 minutes)
- If site down: restart services
- If database down: check connections, restart MySQL
- If queue backed up: add workers or pause intake
- If CloudConvert down: show maintenance message
Communicate (ongoing)
- Update status page
- Notify affected users
- Internal Slack updates
Resolve (variable)
- Fix root cause
- Verify fix in production
- Monitor for recurrence
Post-Mortem (24 hours later)
- Document what happened
- Identify root cause
- Create prevention tasks
- Share learnings

Production Monitoring Checklist

Essential (Deploy with these)

Sentry error tracking configured
UptimeRobot monitoring pdflab.pro
Health check endpoint (/health)
Structured logging with Winston
Alert for site down (email/SMS)

Important (Add within first month)

Performance monitoring (slow requests logged)
Database query monitoring
Business metrics tracking
Dashboard for operations team
Incident response playbook documented

Advanced (Add as you scale)

Distributed tracing (Sentry Traces)
Custom metrics (conversion rates, revenue)
Auto-scaling based on metrics
Anomaly detection (ML-based alerting)
Real user monitoring (RUM)

Quick Setup (30 minutes)

# 1. Sentry (10 min)
npm install @sentry/node @sentry/nextjs
# Add initialization code (see above)
# Test with /debug-sentry endpoint

# 2. UptimeRobot (5 min)
# Sign up at https://uptimerobot.com
# Add pdflab.pro monitor
# Configure email alerts

# 3. Health Check (5 min)
# Add /health endpoint (see above)
# Test: curl https://pdflab.pro/health

# 4. Logging (10 min)
npm install winston
# Add structured logging (see above)
# Verify logs: tail -f logs/combined.log

Key Principles

Monitor user-facing metrics - Uptime, error rate, performance
Alert on symptoms, not causes - "Site down" not "CPU high"
Make alerts actionable - Include fix suggestions
Prevent alert fatigue - Only critical alerts wake people up
Document incident response - Playbooks save time
Review metrics regularly - Weekly ops review
Continuous improvement - Add monitoring as you learn

Monitoring ROI

Cost:

Sentry: $0-26/month (free tier sufficient for PDFLab)
UptimeRobot: $0-7/month (free tier sufficient)
Time: 2-4 hours initial setup
Total: ~$0-50/month

Benefits:

Detect outages in 5 minutes (vs 5 hours from user reports)
Reduce MTTR from hours to minutes
Prevent revenue loss from undetected issues
Build customer trust with transparent status
ROI: 10-100x (one prevented outage pays for years of monitoring)

When to Escalate

Distributed tracing needs (microservices)
Custom monitoring dashboards
SLA/SLO definition
On-call rotation setup
Runbook automation
Chaos engineering

production-monitoring-guardian

Install Skill

SKILL.md