name	tzurot-observability
description	Logging and observability for Tzurot v3 - Structured logging with Pino, correlation IDs, error tracking, privacy considerations, and Railway log analysis. Use when adding logging or debugging production issues.
lastUpdated	2025-11-19

Tzurot v3 Observability & Logging

Use this skill when: Adding logging, debugging production issues, tracking requests across services, or implementing health checks.

Core Principles

Structured logging - JSON format for easy parsing
Privacy first - Never log secrets, tokens, or PII
Correlation IDs - Track requests across microservices
Log levels - Appropriate verbosity for each environment
Context-rich - Include relevant metadata for debugging

Logging Stack

Logger: Pino (fast, structured JSON logger)
Transport: stdout (Railway captures logs)
Format: JSON lines
Levels: trace, debug, info, warn, error, fatal

Logger Setup

Creating Loggers

// packages/common-types/src/utils/logger.ts
import pino from 'pino';

export function createLogger(serviceName: string): pino.Logger {
  return pino({
    name: serviceName,
    level: process.env.LOG_LEVEL || 'info',
    formatters: {
      level: label => {
        return { level: label };
      },
    },
    timestamp: pino.stdTimeFunctions.isoTime,
  });
}

// Usage in services
import { createLogger } from '@tzurot/common-types';
const logger = createLogger('PersonalityService');

Log Levels

Use appropriate levels:

// TRACE - Very detailed, disabled in production
logger.trace({ functionArgs }, 'Function called');

// DEBUG - Detailed debugging information
logger.debug({ userId, channelId }, 'Processing message');

// INFO - General information about system operation
logger.info({ personalityId }, 'Loaded personality config');

// WARN - Warning about potential issues
logger.warn({ attempt, maxAttempts }, 'Retrying after failure');

// ERROR - Error that needs attention but service continues
logger.error({ err: error, jobId }, 'Job processing failed');

// FATAL - Service cannot continue, immediate action required
logger.fatal({ err: error }, 'Failed to connect to database');

Structured Logging Patterns

The Golden Pattern

logger.info(
  { contextObject }, // First param: structured data
  'Human readable message' // Second param: message
);

Good Examples

// ✅ GOOD - Context + message
logger.info({ personalityId, model: 'claude-sonnet-4.5' }, 'Loaded personality with config');

logger.error({ err: error, requestId, userId }, 'Failed to process AI request');

logger.warn({ jobId, attempt: 2, maxAttempts: 3 }, 'Retrying failed job');

logger.debug({ channelId, messageCount: history.length }, 'Retrieved conversation history');

Bad Examples

// ❌ BAD - String interpolation loses structure
logger.info(`Loaded personality ${personalityId} with model ${model}`);

// ❌ BAD - No context object
logger.info('Loaded personality');

// ❌ BAD - Logging PII
logger.info({ username, email, ipAddress }, 'User logged in');

// ❌ BAD - Logging secrets
logger.debug({ apiKey, token }, 'Making API call');

Error Logging

Pino Error Format

// ✅ CORRECT - Use 'err' key for errors
logger.error({ err: error, context: 'additional data' }, 'Human readable error message');

// ❌ WRONG - Don't use 'error' key
logger.error(
  { error: error }, // Wrong!
  'Message'
);

Why err? Pino has special handling for the err key:

Serializes stack traces properly
Includes error name and message
Formats consistently across services

Error Context

try {
  await processSomething(id);
} catch (error) {
  // Include context to help debugging
  logger.error(
    {
      err: error,
      id, // What we were processing
      userId, // Who triggered it
      attempt, // Which retry attempt
    },
    'Failed to process item'
  );
  throw error;
}

Correlation IDs

Problem: When a user reports "bot didn't reply", how do you find logs across 3 services?

Solution: Correlation IDs that flow through entire request chain.

Implementing Correlation IDs

// 1. Generate ID when request enters system (bot-client)
import { randomUUID } from 'crypto';

const requestId = randomUUID();

logger.info({ requestId, messageId }, 'Processing Discord message');

// 2. Include in HTTP request to api-gateway
await fetch(`${GATEWAY_URL}/ai/generate`, {
  headers: {
    'X-Request-ID': requestId,
    'Content-Type': 'application/json',
  },
  body: JSON.stringify({ requestId /* ... */ }),
});

// 3. Extract and use in api-gateway
app.use((req, res, next) => {
  const requestId = req.headers['x-request-id'] || randomUUID();
  req.requestId = requestId; // Attach to request object
  next();
});

app.post('/ai/generate', async (req, res) => {
  logger.info({ requestId: req.requestId }, 'Received AI generation request');
  // ... process
});

// 4. Include in BullMQ job data
await aiQueue.add('llm-generation', {
  requestId: req.requestId,
  // ... other data
});

// 5. Use in ai-worker
export async function processLLMGeneration(job: Job): Promise<Response> {
  const { requestId } = job.data;

  logger.info({ requestId, jobId: job.id }, 'Processing LLM generation');

  // All logs include requestId
  logger.debug({ requestId, personalityId }, 'Loaded personality');
  logger.debug({ requestId, memoryCount: memories.length }, 'Retrieved memories');
  logger.info({ requestId, model }, 'Calling AI provider');

  return response;
}

Searching Logs with Correlation ID

# Railway CLI: Find all logs for a request
railway logs --service api-gateway | grep "requestId\":\"abc-123"
railway logs --service ai-worker | grep "requestId\":\"abc-123"
railway logs --service bot-client | grep "requestId\":\"abc-123"

Privacy & Security

NEVER Log These

Secrets:

API keys (DISCORD_TOKEN, OPENROUTER_API_KEY, etc.)
Database passwords
Redis passwords
Webhook tokens
Any credential or authentication token

PII (Personally Identifiable Information):

Email addresses
IP addresses
Real names
Phone numbers
Addresses

Sensitive User Data:

Message content (unless debugging specific issue)
DM conversations
Personality system prompts (contain character details)

Safe to Log

IDs (anonymized):

// ✅ SAFE - IDs don't reveal personal info
logger.info({ userId: 'discord-123456', channelId: 'channel-789' });

Aggregated Data:

// ✅ SAFE - No individual identification
logger.info({ messageCount: 42, avgLength: 156 });

Error Information:

// ✅ SAFE - Error details without sensitive data
logger.error({ err: error, statusCode: 429, endpoint: '/ai/generate' }, 'Rate limit exceeded');

Redacting Sensitive Data

// Utility to redact sensitive fields
function redactSensitive(obj: any): any {
  const redacted = { ...obj };

  const sensitiveKeys = ['apiKey', 'token', 'password', 'secret', 'email'];

  for (const key of sensitiveKeys) {
    if (key in redacted) {
      redacted[key] = '[REDACTED]';
    }
  }

  return redacted;
}

// Usage
logger.debug(redactSensitive({ apiKey: 'sk-123', model: 'gpt-4' }), 'Making API call');
// Logs: { apiKey: '[REDACTED]', model: 'gpt-4' }

Health Checks

Simple Health Endpoint

// api-gateway/index.ts
app.get('/health', async (req, res) => {
  try {
    // Check critical dependencies
    const redisHealthy = await checkRedis();
    const dbHealthy = await checkDatabase();
    const queueHealthy = await checkQueue();

    const isHealthy = redisHealthy && dbHealthy && queueHealthy;

    res.status(isHealthy ? 200 : 503).json({
      status: isHealthy ? 'healthy' : 'degraded',
      services: {
        redis: redisHealthy,
        database: dbHealthy,
        queue: queueHealthy,
      },
      timestamp: new Date().toISOString(),
      uptime: process.uptime(),
    });
  } catch (error) {
    logger.error({ err: error }, 'Health check failed');
    res.status(503).json({ status: 'unhealthy' });
  }
});

async function checkRedis(): Promise<boolean> {
  try {
    await redis.ping();
    return true;
  } catch {
    return false;
  }
}

async function checkDatabase(): Promise<boolean> {
  try {
    await prisma.$queryRaw`SELECT 1`;
    return true;
  } catch {
    return false;
  }
}

async function checkQueue(): Promise<boolean> {
  try {
    const count = await aiQueue.getWaitingCount();
    return count !== undefined;
  } catch {
    return false;
  }
}

Metrics Endpoint

// api-gateway/index.ts
app.get('/metrics', async (req, res) => {
  const [waiting, active, completed, failed] = await Promise.all([
    aiQueue.getWaitingCount(),
    aiQueue.getActiveCount(),
    aiQueue.getCompletedCount(),
    aiQueue.getFailedCount(),
  ]);

  res.json({
    queue: {
      waiting,
      active,
      completed,
      failed,
      total: waiting + active,
    },
    memory: {
      heapUsed: process.memoryUsage().heapUsed,
      heapTotal: process.memoryUsage().heapTotal,
      rss: process.memoryUsage().rss,
    },
    uptime: process.uptime(),
    timestamp: new Date().toISOString(),
  });
});

Railway Log Analysis

Viewing Logs

# Recent logs (last 100 lines)
railway logs --service api-gateway

# Follow logs in real-time
railway logs --service ai-worker --tail

# Logs from specific time
railway logs --service bot-client --since 1h

# All services
railway logs

Filtering Logs

# Find errors
railway logs --service ai-worker | grep '"level":"error"'

# Find specific request
railway logs | grep "requestId\":\"abc-123"

# Find slow operations
railway logs | grep "durationMs" | grep -v "durationMs\":[0-9]\{1,3\},"

# Count errors by type
railway logs --since 24h | grep '"level":"error"' | jq '.msg' | sort | uniq -c

Common Log Queries

# 1. Find failed jobs
railway logs --service ai-worker | grep "Job failed"

# 2. Find rate limit errors
railway logs | grep "429"

# 3. Find database connection errors
railway logs | grep "database" | grep "error"

# 4. Find slow requests (>5 seconds)
railway logs | grep "durationMs" | awk -F'"durationMs":' '{print $2}' | awk -F',' '{if($1>5000) print}'

# 5. Find memory warnings
railway logs | grep "heapUsed" | grep "GB"

Performance Logging

Request Duration

// Measure operation duration
async function processRequest(data: RequestData): Promise<Response> {
  const startTime = Date.now();
  const requestId = data.requestId;

  try {
    const result = await doWork(data);

    const durationMs = Date.now() - startTime;
    logger.info({ requestId, durationMs, status: 'success' }, 'Request completed');

    return result;
  } catch (error) {
    const durationMs = Date.now() - startTime;
    logger.error({ requestId, durationMs, err: error, status: 'failed' }, 'Request failed');
    throw error;
  }
}

Slow Operation Detection

const SLOW_THRESHOLD = 5000; // 5 seconds

async function queryDatabase(query: string): Promise<any> {
  const startTime = Date.now();

  const result = await prisma.$queryRaw(query);

  const durationMs = Date.now() - startTime;

  if (durationMs > SLOW_THRESHOLD) {
    logger.warn({ durationMs, query: query.substring(0, 100) }, 'Slow database query detected');
  }

  return result;
}

Testing with Logs

Capturing Logs in Tests

describe('MyService', () => {
  let mockLogger: pino.Logger;

  beforeEach(() => {
    mockLogger = {
      info: vi.fn(),
      error: vi.fn(),
      warn: vi.fn(),
      debug: vi.fn(),
    } as unknown as pino.Logger;
  });

  it('should log errors', async () => {
    const service = new MyService(mockLogger);

    await expect(service.failingOperation()).rejects.toThrow();

    expect(mockLogger.error).toHaveBeenCalledWith(
      expect.objectContaining({ err: expect.any(Error) }),
      expect.stringContaining('Operation failed')
    );
  });
});

Suppress Logs in Tests

// vitest.setup.ts
import pino from 'pino';

// Suppress logs in tests
pino.level = 'silent';

Anti-Patterns

❌ Don't Use console.log

// ❌ BAD - Not structured, no metadata
console.log('Processing message');
console.error('Error:', error);

// ✅ GOOD - Structured, with context
logger.info({ messageId, channelId }, 'Processing message');
logger.error({ err: error, messageId }, 'Failed to process message');

❌ Don't Log in Loops

// ❌ BAD - Spams logs
for (const item of items) {
  logger.info({ item }, 'Processing item'); // 1000+ log lines!
}

// ✅ GOOD - Log once with count
logger.info({ itemCount: items.length }, 'Processing items');
for (const item of items) {
  await process(item);
}
logger.info({ itemCount: items.length }, 'Finished processing items');

❌ Don't Log Sensitive Data

// ❌ BAD - Logs API key
logger.debug({ apiKey: process.env.OPENROUTER_API_KEY }, 'Calling AI');

// ✅ GOOD - No sensitive data
logger.debug({ provider: 'OpenRouter', model }, 'Calling AI');

❌ Don't Swallow Errors Silently

// ❌ BAD - Silent failure
try {
  await doSomething();
} catch (error) {
  // Nothing logged!
}

// ✅ GOOD - Log and optionally rethrow
try {
  await doSomething();
} catch (error) {
  logger.error({ err: error }, 'Operation failed');
  throw error; // Or handle gracefully
}

Log Levels by Environment

// Development: debug level (verbose)
LOG_LEVEL=debug pnpm dev

// Production: info level (normal)
LOG_LEVEL=info

// Troubleshooting: trace level (very verbose)
LOG_LEVEL=trace railway logs --service ai-worker

Related Skills

tzurot-security - Security logging (no PII, no tokens)
tzurot-async-flow - Job correlation IDs
tzurot-db-vector - Database query logging
tzurot-constants - Log level constants

References

Pino documentation: https://getpino.io/
Railway logs: https://docs.railway.app/reference/cli-api#logs
Logger utility: packages/common-types/src/utils/logger.ts
Privacy logging: CLAUDE.md#logging

Install Skill

SKILL.md