performance-monitoring

name	performance-monitoring
description	Set up application monitoring, logging, error tracking, and performance metrics tracking. Use when implementing monitoring or debugging production issues.
allowed-tools	Read, Write, Edit, Bash, Glob

You implement performance monitoring and logging for the QA Team Portal.

Requirements from PROJECT_PLAN.md

Application logging and error tracking
Performance metrics monitoring
API response time tracking
Page load time monitoring
Error logging and alerting
Success metrics tracking (usage, downloads, page views)

Implementation

1. Backend Logging Setup

Location: backend/app/core/logging_config.py

import logging
import sys
from pathlib import Path
from logging.handlers import RotatingFileHandler, TimedRotatingFileHandler
from datetime import datetime

# Create logs directory
LOGS_DIR = Path(__file__).parent.parent.parent / "logs"
LOGS_DIR.mkdir(exist_ok=True)

def setup_logging():
    """Configure application logging."""

    # Create formatters
    detailed_formatter = logging.Formatter(
        fmt='%(asctime)s - %(name)s - %(levelname)s - %(funcName)s:%(lineno)d - %(message)s',
        datefmt='%Y-%m-%d %H:%M:%S'
    )

    simple_formatter = logging.Formatter(
        fmt='%(asctime)s - %(levelname)s - %(message)s',
        datefmt='%Y-%m-%d %H:%M:%S'
    )

    # Console handler (stdout)
    console_handler = logging.StreamHandler(sys.stdout)
    console_handler.setLevel(logging.INFO)
    console_handler.setFormatter(simple_formatter)

    # File handler for all logs (rotating by size)
    file_handler = RotatingFileHandler(
        filename=LOGS_DIR / "app.log",
        maxBytes=10 * 1024 * 1024,  # 10MB
        backupCount=5,
        encoding='utf-8'
    )
    file_handler.setLevel(logging.INFO)
    file_handler.setFormatter(detailed_formatter)

    # Error log handler (rotating daily)
    error_handler = TimedRotatingFileHandler(
        filename=LOGS_DIR / "error.log",
        when='midnight',
        interval=1,
        backupCount=30,  # Keep 30 days
        encoding='utf-8'
    )
    error_handler.setLevel(logging.ERROR)
    error_handler.setFormatter(detailed_formatter)

    # Access log handler
    access_handler = TimedRotatingFileHandler(
        filename=LOGS_DIR / "access.log",
        when='midnight',
        interval=1,
        backupCount=30,
        encoding='utf-8'
    )
    access_handler.setLevel(logging.INFO)
    access_handler.setFormatter(simple_formatter)

    # Root logger
    root_logger = logging.getLogger()
    root_logger.setLevel(logging.INFO)
    root_logger.addHandler(console_handler)
    root_logger.addHandler(file_handler)
    root_logger.addHandler(error_handler)

    # Access logger
    access_logger = logging.getLogger("access")
    access_logger.setLevel(logging.INFO)
    access_logger.addHandler(access_handler)
    access_logger.propagate = False

    # Suppress noisy loggers
    logging.getLogger("uvicorn.access").setLevel(logging.WARNING)
    logging.getLogger("watchfiles.main").setLevel(logging.WARNING)

    return root_logger

logger = setup_logging()

Initialize in main app:

# backend/app/main.py
from fastapi import FastAPI
from app.core.logging_config import logger

app = FastAPI()

@app.on_event("startup")
async def startup_event():
    logger.info("Application starting up...")
    logger.info(f"Environment: {settings.ENVIRONMENT}")

@app.on_event("shutdown")
async def shutdown_event():
    logger.info("Application shutting down...")

2. Performance Middleware

Location: backend/app/middleware/monitoring_middleware.py

import time
import logging
from fastapi import Request
from starlette.middleware.base import BaseHTTPMiddleware
from typing import Callable

logger = logging.getLogger(__name__)
access_logger = logging.getLogger("access")

class PerformanceMonitoringMiddleware(BaseHTTPMiddleware):
    """Monitor API performance and log slow requests."""

    async def dispatch(self, request: Request, call_next: Callable):
        start_time = time.time()

        # Get request details
        method = request.method
        path = request.url.path
        client_ip = request.client.host if request.client else "unknown"

        try:
            response = await call_next(request)

            # Calculate response time
            process_time = time.time() - start_time
            response.headers["X-Process-Time"] = f"{process_time:.3f}"

            # Log access
            access_logger.info(
                f"{client_ip} - {method} {path} - "
                f"Status: {response.status_code} - "
                f"Time: {process_time:.3f}s"
            )

            # Log slow requests (> 200ms)
            if process_time > 0.2:
                logger.warning(
                    f"SLOW REQUEST: {method} {path} took {process_time:.3f}s - "
                    f"Client: {client_ip}"
                )

            # Log very slow requests (> 1s)
            if process_time > 1.0:
                logger.error(
                    f"VERY SLOW REQUEST: {method} {path} took {process_time:.3f}s - "
                    f"Client: {client_ip}"
                )

            return response

        except Exception as e:
            process_time = time.time() - start_time

            # Log exception
            logger.error(
                f"REQUEST FAILED: {method} {path} - "
                f"Error: {str(e)} - "
                f"Time: {process_time:.3f}s - "
                f"Client: {client_ip}",
                exc_info=True
            )

            raise

# Add to app
from app.middleware.monitoring_middleware import PerformanceMonitoringMiddleware
app.add_middleware(PerformanceMonitoringMiddleware)

3. Error Tracking with Sentry (Optional)

cd backend
uv pip install sentry-sdk[fastapi]

# backend/app/core/config.py
class Settings(BaseSettings):
    SENTRY_DSN: Optional[str] = None
    ENVIRONMENT: str = "development"

# backend/app/main.py
import sentry_sdk
from sentry_sdk.integrations.fastapi import FastApiIntegration
from app.core.config import settings

if settings.SENTRY_DSN:
    sentry_sdk.init(
        dsn=settings.SENTRY_DSN,
        integrations=[FastApiIntegration()],
        environment=settings.ENVIRONMENT,
        traces_sample_rate=0.1,  # 10% of transactions
        profiles_sample_rate=0.1,  # 10% of profiles
        send_default_pii=False  # Don't send personally identifiable info
    )
    logger.info("Sentry error tracking initialized")

4. Metrics Tracking

Location: backend/app/services/metrics_service.py

import time
from typing import Dict, List
from datetime import datetime, timedelta
from sqlalchemy.orm import Session
from app.models.metric import Metric

class MetricsService:
    """Track and retrieve application metrics."""

    @staticmethod
    async def record_metric(
        db: Session,
        metric_name: str,
        value: float,
        tags: Dict[str, str] = None
    ):
        """Record a metric value."""
        metric = Metric(
            name=metric_name,
            value=value,
            tags=tags or {},
            timestamp=datetime.utcnow()
        )
        db.add(metric)
        db.commit()

    @staticmethod
    async def record_page_view(
        db: Session,
        page_path: str,
        user_id: str = None
    ):
        """Record a page view."""
        await MetricsService.record_metric(
            db,
            metric_name="page_view",
            value=1,
            tags={
                "page": page_path,
                "user_id": user_id or "anonymous"
            }
        )

    @staticmethod
    async def record_tool_download(
        db: Session,
        tool_id: str,
        tool_name: str,
        user_id: str = None
    ):
        """Record a tool download."""
        await MetricsService.record_metric(
            db,
            metric_name="tool_download",
            value=1,
            tags={
                "tool_id": tool_id,
                "tool_name": tool_name,
                "user_id": user_id or "anonymous"
            }
        )

    @staticmethod
    async def get_page_views(
        db: Session,
        start_date: datetime,
        end_date: datetime
    ) -> List[Dict]:
        """Get page views grouped by page."""
        metrics = db.query(Metric).filter(
            Metric.name == "page_view",
            Metric.timestamp >= start_date,
            Metric.timestamp <= end_date
        ).all()

        # Group by page
        page_views = {}
        for metric in metrics:
            page = metric.tags.get("page", "unknown")
            page_views[page] = page_views.get(page, 0) + 1

        return [
            {"page": page, "views": count}
            for page, count in page_views.items()
        ]

    @staticmethod
    async def get_api_performance(
        db: Session,
        start_date: datetime,
        end_date: datetime
    ) -> Dict:
        """Get API performance metrics."""
        metrics = db.query(Metric).filter(
            Metric.name == "api_response_time",
            Metric.timestamp >= start_date,
            Metric.timestamp <= end_date
        ).all()

        if not metrics:
            return {"average": 0, "min": 0, "max": 0, "count": 0}

        values = [m.value for m in metrics]
        return {
            "average": sum(values) / len(values),
            "min": min(values),
            "max": max(values),
            "count": len(values)
        }

Metric Model:

# backend/app/models/metric.py
from sqlalchemy import Column, String, Float, JSON, DateTime
from sqlalchemy.dialects.postgresql import UUID
import uuid
from datetime import datetime
from app.db.base_class import Base

class Metric(Base):
    __tablename__ = "metrics"

    id = Column(UUID(as_uuid=True), primary_key=True, default=uuid.uuid4)
    name = Column(String(100), nullable=False, index=True)
    value = Column(Float, nullable=False)
    tags = Column(JSON, nullable=False, default=dict)
    timestamp = Column(DateTime, default=datetime.utcnow, nullable=False, index=True)

5. Health Check Endpoints

Location: backend/app/api/v1/endpoints/health.py

from fastapi import APIRouter, Depends, HTTPException
from sqlalchemy.orm import Session
from sqlalchemy import text
from app.api.deps import get_db
from app.core.config import settings
import psutil
import time

router = APIRouter()

@router.get("/health")
async def health_check():
    """Basic health check."""
    return {
        "status": "healthy",
        "service": "qa-portal-api",
        "version": "1.0.0",
        "timestamp": time.time()
    }

@router.get("/health/db")
async def database_health(db: Session = Depends(get_db)):
    """Database connectivity check."""
    try:
        # Execute simple query
        result = db.execute(text("SELECT 1"))
        result.scalar()

        return {
            "status": "healthy",
            "database": "connected",
            "timestamp": time.time()
        }
    except Exception as e:
        raise HTTPException(
            status_code=503,
            detail={
                "status": "unhealthy",
                "database": "disconnected",
                "error": str(e),
                "timestamp": time.time()
            }
        )

@router.get("/health/system")
async def system_health():
    """System resources check."""
    try:
        cpu_percent = psutil.cpu_percent(interval=1)
        memory = psutil.virtual_memory()
        disk = psutil.disk_usage('/')

        return {
            "status": "healthy",
            "cpu_percent": cpu_percent,
            "memory": {
                "total": memory.total,
                "available": memory.available,
                "percent": memory.percent
            },
            "disk": {
                "total": disk.total,
                "free": disk.free,
                "percent": disk.percent
            },
            "timestamp": time.time()
        }
    except Exception as e:
        raise HTTPException(
            status_code=503,
            detail={
                "status": "unhealthy",
                "error": str(e),
                "timestamp": time.time()
            }
        )

@router.get("/health/ready")
async def readiness_check(db: Session = Depends(get_db)):
    """Readiness check for load balancer."""
    try:
        # Check database
        db.execute(text("SELECT 1"))

        # Check if essential services are running
        # Add more checks as needed

        return {"status": "ready"}
    except:
        raise HTTPException(503, {"status": "not ready"})

6. Frontend Performance Monitoring

Location: frontend/src/utils/analytics.ts

interface PageViewEvent {
  page: string
  timestamp: number
  loadTime?: number
}

interface MetricEvent {
  name: string
  value: number
  tags?: Record<string, string>
}

class Analytics {
  private static instance: Analytics
  private apiUrl: string

  private constructor() {
    this.apiUrl = import.meta.env.VITE_API_URL || 'http://localhost:8000'
  }

  static getInstance(): Analytics {
    if (!Analytics.instance) {
      Analytics.instance = new Analytics()
    }
    return Analytics.instance
  }

  /**
   * Track page view
   */
  trackPageView(page: string) {
    const loadTime = this.getPageLoadTime()

    const event: PageViewEvent = {
      page,
      timestamp: Date.now(),
      loadTime
    }

    // Send to backend
    this.sendEvent('page_view', event)

    // Log to console in dev
    if (import.meta.env.DEV) {
      console.log('📊 Page View:', event)
    }
  }

  /**
   * Track custom event
   */
  trackEvent(name: string, data: Record<string, any>) {
    this.sendEvent(name, {
      ...data,
      timestamp: Date.now()
    })

    if (import.meta.env.DEV) {
      console.log(`📊 Event: ${name}`, data)
    }
  }

  /**
   * Track tool download
   */
  trackToolDownload(toolId: string, toolName: string) {
    this.trackEvent('tool_download', {
      tool_id: toolId,
      tool_name: toolName
    })
  }

  /**
   * Track error
   */
  trackError(error: Error, context?: Record<string, any>) {
    const errorData = {
      message: error.message,
      stack: error.stack,
      context,
      timestamp: Date.now(),
      url: window.location.href,
      userAgent: navigator.userAgent
    }

    this.sendEvent('error', errorData)

    console.error('❌ Error tracked:', errorData)
  }

  /**
   * Get page load time
   */
  private getPageLoadTime(): number | undefined {
    if (typeof window !== 'undefined' && window.performance) {
      const perfData = window.performance.timing
      const loadTime = perfData.loadEventEnd - perfData.navigationStart
      return loadTime > 0 ? loadTime : undefined
    }
    return undefined
  }

  /**
   * Send event to backend
   */
  private async sendEvent(name: string, data: any) {
    try {
      await fetch(`${this.apiUrl}/api/v1/metrics`, {
        method: 'POST',
        headers: {
          'Content-Type': 'application/json'
        },
        body: JSON.stringify({
          name,
          data
        })
      })
    } catch (error) {
      // Silently fail - don't break app if analytics fail
      console.warn('Failed to send analytics event:', error)
    }
  }
}

export const analytics = Analytics.getInstance()

// Track page views on route changes
export const useAnalytics = () => {
  const location = useLocation()

  useEffect(() => {
    analytics.trackPageView(location.pathname)
  }, [location])
}

Usage in App:

// frontend/src/App.tsx
import { useAnalytics } from './utils/analytics'

function App() {
  // Track page views
  useAnalytics()

  return <Routes>...</Routes>
}

// Track button clicks
<Button onClick={() => {
  analytics.trackEvent('button_click', { button: 'download_tool' })
  handleDownload()
}}>
  Download
</Button>

// Track errors
try {
  await someAsyncOperation()
} catch (error) {
  analytics.trackError(error as Error, { operation: 'someAsyncOperation' })
}

7. Web Vitals Monitoring

cd frontend
npm install web-vitals

// frontend/src/utils/webVitals.ts
import { onCLS, onFID, onFCP, onLCP, onTTFB } from 'web-vitals'
import { analytics } from './analytics'

export const reportWebVitals = () => {
  onCLS((metric) => {
    analytics.trackEvent('web_vital', {
      name: 'CLS',
      value: metric.value,
      rating: metric.rating
    })
  })

  onFID((metric) => {
    analytics.trackEvent('web_vital', {
      name: 'FID',
      value: metric.value,
      rating: metric.rating
    })
  })

  onFCP((metric) => {
    analytics.trackEvent('web_vital', {
      name: 'FCP',
      value: metric.value,
      rating: metric.rating
    })
  })

  onLCP((metric) => {
    analytics.trackEvent('web_vital', {
      name: 'LCP',
      value: metric.value,
      rating: metric.rating
    })
  })

  onTTFB((metric) => {
    analytics.trackEvent('web_vital', {
      name: 'TTFB',
      value: metric.value,
      rating: metric.rating
    })
  })
}

// Initialize in main.tsx
reportWebVitals()

8. Metrics Dashboard API

Location: backend/app/api/v1/endpoints/metrics.py

@router.get("/admin/metrics/overview")
async def get_metrics_overview(
    days: int = 30,
    db: Session = Depends(get_db),
    current_user: User = Depends(get_current_admin)
):
    """Get metrics overview for the last N days."""
    end_date = datetime.utcnow()
    start_date = end_date - timedelta(days=days)

    # Page views
    page_views = await MetricsService.get_page_views(db, start_date, end_date)

    # API performance
    api_performance = await MetricsService.get_api_performance(db, start_date, end_date)

    # Tool downloads
    tool_downloads = db.query(Metric).filter(
        Metric.name == "tool_download",
        Metric.timestamp >= start_date
    ).count()

    # Active users
    active_users = db.query(User).filter(
        User.last_login >= start_date
    ).count()

    return {
        "period_days": days,
        "page_views": page_views,
        "api_performance": api_performance,
        "tool_downloads": tool_downloads,
        "active_users": active_users
    }

Monitoring Best Practices

Log Levels:
- DEBUG: Detailed diagnostic information
- INFO: General informational messages
- WARNING: Warning messages, but application continues
- ERROR: Error messages, but application recovers
- CRITICAL: Critical errors, application may crash
What to Log:
- All API requests (access logs)
- Errors and exceptions (with stack traces)
- Slow operations (> 200ms)
- Authentication events (login, logout, failures)
- Admin actions (audit logs)
- Database operations (migrations, backups)
What NOT to Log:
- Passwords or sensitive credentials
- Personal identifiable information (PII)
- Credit card numbers
- API keys or tokens
Log Rotation:
- Rotate by size (10MB per file)
- Rotate by time (daily for access logs)
- Keep historical logs (30 days minimum)
- Compress old logs to save space
Alerting:
- Set up alerts for error rate spikes
- Alert on slow response times (> 1s)
- Alert on high resource usage (CPU > 80%, Memory > 90%)
- Alert on failed health checks

Troubleshooting

No logs appearing:

Check logs directory permissions
Verify logging configuration loaded
Check log level settings
Ensure handlers are attached to logger

Logs filling disk:

Implement log rotation
Reduce log level (INFO instead of DEBUG)
Compress old logs
Set up automated cleanup (delete logs > 30 days)

Performance impact:

Use async logging if available
Log to separate disk/partition
Reduce log verbosity in production
Use log sampling for high-volume events

Report

✅ Application logging configured (rotating files) ✅ Performance monitoring middleware added ✅ Error tracking configured (Sentry optional) ✅ Metrics service implemented ✅ Health check endpoints created ✅ Frontend analytics tracking added ✅ Web Vitals monitoring configured ✅ Metrics dashboard API implemented ✅ Log rotation configured (30-day retention) ✅ Access logs separated from error logs ✅ Slow query logging enabled

Install Skill

SKILL.md