Claude Code Plugins

Community-maintained marketplace

Feedback

tzurot-observability

@lbds137/tzurot
6
0

Logging and observability for Tzurot v3 - Structured logging with Pino, correlation IDs, error tracking, privacy considerations, and Railway log analysis. Use when adding logging or debugging production issues.

Install Skill

1Download skill
2Enable skills in Claude

Open claude.ai/settings/capabilities and find the "Skills" section

3Upload to Claude

Click "Upload skill" and select the downloaded ZIP file

Note: Please verify skill by going through its instructions before using it.

SKILL.md

name tzurot-observability
description Logging and observability for Tzurot v3 - Structured logging with Pino, correlation IDs, error tracking, privacy considerations, and Railway log analysis. Use when adding logging or debugging production issues.
lastUpdated 2025-11-19

Tzurot v3 Observability & Logging

Use this skill when: Adding logging, debugging production issues, tracking requests across services, or implementing health checks.

Core Principles

  1. Structured logging - JSON format for easy parsing
  2. Privacy first - Never log secrets, tokens, or PII
  3. Correlation IDs - Track requests across microservices
  4. Log levels - Appropriate verbosity for each environment
  5. Context-rich - Include relevant metadata for debugging

Logging Stack

  • Logger: Pino (fast, structured JSON logger)
  • Transport: stdout (Railway captures logs)
  • Format: JSON lines
  • Levels: trace, debug, info, warn, error, fatal

Logger Setup

Creating Loggers

// packages/common-types/src/utils/logger.ts
import pino from 'pino';

export function createLogger(serviceName: string): pino.Logger {
  return pino({
    name: serviceName,
    level: process.env.LOG_LEVEL || 'info',
    formatters: {
      level: label => {
        return { level: label };
      },
    },
    timestamp: pino.stdTimeFunctions.isoTime,
  });
}

// Usage in services
import { createLogger } from '@tzurot/common-types';
const logger = createLogger('PersonalityService');

Log Levels

Use appropriate levels:

// TRACE - Very detailed, disabled in production
logger.trace({ functionArgs }, 'Function called');

// DEBUG - Detailed debugging information
logger.debug({ userId, channelId }, 'Processing message');

// INFO - General information about system operation
logger.info({ personalityId }, 'Loaded personality config');

// WARN - Warning about potential issues
logger.warn({ attempt, maxAttempts }, 'Retrying after failure');

// ERROR - Error that needs attention but service continues
logger.error({ err: error, jobId }, 'Job processing failed');

// FATAL - Service cannot continue, immediate action required
logger.fatal({ err: error }, 'Failed to connect to database');

Structured Logging Patterns

The Golden Pattern

logger.info(
  { contextObject }, // First param: structured data
  'Human readable message' // Second param: message
);

Good Examples

// ✅ GOOD - Context + message
logger.info({ personalityId, model: 'claude-sonnet-4.5' }, 'Loaded personality with config');

logger.error({ err: error, requestId, userId }, 'Failed to process AI request');

logger.warn({ jobId, attempt: 2, maxAttempts: 3 }, 'Retrying failed job');

logger.debug({ channelId, messageCount: history.length }, 'Retrieved conversation history');

Bad Examples

// ❌ BAD - String interpolation loses structure
logger.info(`Loaded personality ${personalityId} with model ${model}`);

// ❌ BAD - No context object
logger.info('Loaded personality');

// ❌ BAD - Logging PII
logger.info({ username, email, ipAddress }, 'User logged in');

// ❌ BAD - Logging secrets
logger.debug({ apiKey, token }, 'Making API call');

Error Logging

Pino Error Format

// ✅ CORRECT - Use 'err' key for errors
logger.error({ err: error, context: 'additional data' }, 'Human readable error message');

// ❌ WRONG - Don't use 'error' key
logger.error(
  { error: error }, // Wrong!
  'Message'
);

Why err? Pino has special handling for the err key:

  • Serializes stack traces properly
  • Includes error name and message
  • Formats consistently across services

Error Context

try {
  await processSomething(id);
} catch (error) {
  // Include context to help debugging
  logger.error(
    {
      err: error,
      id, // What we were processing
      userId, // Who triggered it
      attempt, // Which retry attempt
    },
    'Failed to process item'
  );
  throw error;
}

Correlation IDs

Problem: When a user reports "bot didn't reply", how do you find logs across 3 services?

Solution: Correlation IDs that flow through entire request chain.

Implementing Correlation IDs

// 1. Generate ID when request enters system (bot-client)
import { randomUUID } from 'crypto';

const requestId = randomUUID();

logger.info({ requestId, messageId }, 'Processing Discord message');

// 2. Include in HTTP request to api-gateway
await fetch(`${GATEWAY_URL}/ai/generate`, {
  headers: {
    'X-Request-ID': requestId,
    'Content-Type': 'application/json',
  },
  body: JSON.stringify({ requestId /* ... */ }),
});

// 3. Extract and use in api-gateway
app.use((req, res, next) => {
  const requestId = req.headers['x-request-id'] || randomUUID();
  req.requestId = requestId; // Attach to request object
  next();
});

app.post('/ai/generate', async (req, res) => {
  logger.info({ requestId: req.requestId }, 'Received AI generation request');
  // ... process
});

// 4. Include in BullMQ job data
await aiQueue.add('llm-generation', {
  requestId: req.requestId,
  // ... other data
});

// 5. Use in ai-worker
export async function processLLMGeneration(job: Job): Promise<Response> {
  const { requestId } = job.data;

  logger.info({ requestId, jobId: job.id }, 'Processing LLM generation');

  // All logs include requestId
  logger.debug({ requestId, personalityId }, 'Loaded personality');
  logger.debug({ requestId, memoryCount: memories.length }, 'Retrieved memories');
  logger.info({ requestId, model }, 'Calling AI provider');

  return response;
}

Searching Logs with Correlation ID

# Railway CLI: Find all logs for a request
railway logs --service api-gateway | grep "requestId\":\"abc-123"
railway logs --service ai-worker | grep "requestId\":\"abc-123"
railway logs --service bot-client | grep "requestId\":\"abc-123"

Privacy & Security

NEVER Log These

Secrets:

  • API keys (DISCORD_TOKEN, OPENROUTER_API_KEY, etc.)
  • Database passwords
  • Redis passwords
  • Webhook tokens
  • Any credential or authentication token

PII (Personally Identifiable Information):

  • Email addresses
  • IP addresses
  • Real names
  • Phone numbers
  • Addresses

Sensitive User Data:

  • Message content (unless debugging specific issue)
  • DM conversations
  • Personality system prompts (contain character details)

Safe to Log

IDs (anonymized):

// ✅ SAFE - IDs don't reveal personal info
logger.info({ userId: 'discord-123456', channelId: 'channel-789' });

Aggregated Data:

// ✅ SAFE - No individual identification
logger.info({ messageCount: 42, avgLength: 156 });

Error Information:

// ✅ SAFE - Error details without sensitive data
logger.error({ err: error, statusCode: 429, endpoint: '/ai/generate' }, 'Rate limit exceeded');

Redacting Sensitive Data

// Utility to redact sensitive fields
function redactSensitive(obj: any): any {
  const redacted = { ...obj };

  const sensitiveKeys = ['apiKey', 'token', 'password', 'secret', 'email'];

  for (const key of sensitiveKeys) {
    if (key in redacted) {
      redacted[key] = '[REDACTED]';
    }
  }

  return redacted;
}

// Usage
logger.debug(redactSensitive({ apiKey: 'sk-123', model: 'gpt-4' }), 'Making API call');
// Logs: { apiKey: '[REDACTED]', model: 'gpt-4' }

Health Checks

Simple Health Endpoint

// api-gateway/index.ts
app.get('/health', async (req, res) => {
  try {
    // Check critical dependencies
    const redisHealthy = await checkRedis();
    const dbHealthy = await checkDatabase();
    const queueHealthy = await checkQueue();

    const isHealthy = redisHealthy && dbHealthy && queueHealthy;

    res.status(isHealthy ? 200 : 503).json({
      status: isHealthy ? 'healthy' : 'degraded',
      services: {
        redis: redisHealthy,
        database: dbHealthy,
        queue: queueHealthy,
      },
      timestamp: new Date().toISOString(),
      uptime: process.uptime(),
    });
  } catch (error) {
    logger.error({ err: error }, 'Health check failed');
    res.status(503).json({ status: 'unhealthy' });
  }
});

async function checkRedis(): Promise<boolean> {
  try {
    await redis.ping();
    return true;
  } catch {
    return false;
  }
}

async function checkDatabase(): Promise<boolean> {
  try {
    await prisma.$queryRaw`SELECT 1`;
    return true;
  } catch {
    return false;
  }
}

async function checkQueue(): Promise<boolean> {
  try {
    const count = await aiQueue.getWaitingCount();
    return count !== undefined;
  } catch {
    return false;
  }
}

Metrics Endpoint

// api-gateway/index.ts
app.get('/metrics', async (req, res) => {
  const [waiting, active, completed, failed] = await Promise.all([
    aiQueue.getWaitingCount(),
    aiQueue.getActiveCount(),
    aiQueue.getCompletedCount(),
    aiQueue.getFailedCount(),
  ]);

  res.json({
    queue: {
      waiting,
      active,
      completed,
      failed,
      total: waiting + active,
    },
    memory: {
      heapUsed: process.memoryUsage().heapUsed,
      heapTotal: process.memoryUsage().heapTotal,
      rss: process.memoryUsage().rss,
    },
    uptime: process.uptime(),
    timestamp: new Date().toISOString(),
  });
});

Railway Log Analysis

Viewing Logs

# Recent logs (last 100 lines)
railway logs --service api-gateway

# Follow logs in real-time
railway logs --service ai-worker --tail

# Logs from specific time
railway logs --service bot-client --since 1h

# All services
railway logs

Filtering Logs

# Find errors
railway logs --service ai-worker | grep '"level":"error"'

# Find specific request
railway logs | grep "requestId\":\"abc-123"

# Find slow operations
railway logs | grep "durationMs" | grep -v "durationMs\":[0-9]\{1,3\},"

# Count errors by type
railway logs --since 24h | grep '"level":"error"' | jq '.msg' | sort | uniq -c

Common Log Queries

# 1. Find failed jobs
railway logs --service ai-worker | grep "Job failed"

# 2. Find rate limit errors
railway logs | grep "429"

# 3. Find database connection errors
railway logs | grep "database" | grep "error"

# 4. Find slow requests (>5 seconds)
railway logs | grep "durationMs" | awk -F'"durationMs":' '{print $2}' | awk -F',' '{if($1>5000) print}'

# 5. Find memory warnings
railway logs | grep "heapUsed" | grep "GB"

Performance Logging

Request Duration

// Measure operation duration
async function processRequest(data: RequestData): Promise<Response> {
  const startTime = Date.now();
  const requestId = data.requestId;

  try {
    const result = await doWork(data);

    const durationMs = Date.now() - startTime;
    logger.info({ requestId, durationMs, status: 'success' }, 'Request completed');

    return result;
  } catch (error) {
    const durationMs = Date.now() - startTime;
    logger.error({ requestId, durationMs, err: error, status: 'failed' }, 'Request failed');
    throw error;
  }
}

Slow Operation Detection

const SLOW_THRESHOLD = 5000; // 5 seconds

async function queryDatabase(query: string): Promise<any> {
  const startTime = Date.now();

  const result = await prisma.$queryRaw(query);

  const durationMs = Date.now() - startTime;

  if (durationMs > SLOW_THRESHOLD) {
    logger.warn({ durationMs, query: query.substring(0, 100) }, 'Slow database query detected');
  }

  return result;
}

Testing with Logs

Capturing Logs in Tests

describe('MyService', () => {
  let mockLogger: pino.Logger;

  beforeEach(() => {
    mockLogger = {
      info: vi.fn(),
      error: vi.fn(),
      warn: vi.fn(),
      debug: vi.fn(),
    } as unknown as pino.Logger;
  });

  it('should log errors', async () => {
    const service = new MyService(mockLogger);

    await expect(service.failingOperation()).rejects.toThrow();

    expect(mockLogger.error).toHaveBeenCalledWith(
      expect.objectContaining({ err: expect.any(Error) }),
      expect.stringContaining('Operation failed')
    );
  });
});

Suppress Logs in Tests

// vitest.setup.ts
import pino from 'pino';

// Suppress logs in tests
pino.level = 'silent';

Anti-Patterns

❌ Don't Use console.log

// ❌ BAD - Not structured, no metadata
console.log('Processing message');
console.error('Error:', error);

// ✅ GOOD - Structured, with context
logger.info({ messageId, channelId }, 'Processing message');
logger.error({ err: error, messageId }, 'Failed to process message');

❌ Don't Log in Loops

// ❌ BAD - Spams logs
for (const item of items) {
  logger.info({ item }, 'Processing item'); // 1000+ log lines!
}

// ✅ GOOD - Log once with count
logger.info({ itemCount: items.length }, 'Processing items');
for (const item of items) {
  await process(item);
}
logger.info({ itemCount: items.length }, 'Finished processing items');

❌ Don't Log Sensitive Data

// ❌ BAD - Logs API key
logger.debug({ apiKey: process.env.OPENROUTER_API_KEY }, 'Calling AI');

// ✅ GOOD - No sensitive data
logger.debug({ provider: 'OpenRouter', model }, 'Calling AI');

❌ Don't Swallow Errors Silently

// ❌ BAD - Silent failure
try {
  await doSomething();
} catch (error) {
  // Nothing logged!
}

// ✅ GOOD - Log and optionally rethrow
try {
  await doSomething();
} catch (error) {
  logger.error({ err: error }, 'Operation failed');
  throw error; // Or handle gracefully
}

Log Levels by Environment

// Development: debug level (verbose)
LOG_LEVEL=debug pnpm dev

// Production: info level (normal)
LOG_LEVEL=info

// Troubleshooting: trace level (very verbose)
LOG_LEVEL=trace railway logs --service ai-worker

Related Skills

  • tzurot-security - Security logging (no PII, no tokens)
  • tzurot-async-flow - Job correlation IDs
  • tzurot-db-vector - Database query logging
  • tzurot-constants - Log level constants

References