Claude Code Plugins

Community-maintained marketplace

Feedback

monitoring-dashboard

@vanman2024/ai-dev-marketplace
0
0

Training monitoring dashboard setup with TensorBoard and Weights & Biases (WandB) including real-time metrics tracking, experiment comparison, hyperparameter visualization, and integration patterns. Use when setting up training monitoring, tracking experiments, visualizing metrics, comparing model runs, or when user mentions TensorBoard, WandB, training metrics, experiment tracking, or monitoring dashboard.

Install Skill

1Download skill
2Enable skills in Claude

Open claude.ai/settings/capabilities and find the "Skills" section

3Upload to Claude

Click "Upload skill" and select the downloaded ZIP file

Note: Please verify skill by going through its instructions before using it.

SKILL.md

name monitoring-dashboard
description Training monitoring dashboard setup with TensorBoard and Weights & Biases (WandB) including real-time metrics tracking, experiment comparison, hyperparameter visualization, and integration patterns. Use when setting up training monitoring, tracking experiments, visualizing metrics, comparing model runs, or when user mentions TensorBoard, WandB, training metrics, experiment tracking, or monitoring dashboard.
allowed-tools Read, Write, Bash, Grep, Glob

Monitoring Dashboard

Purpose: Provide complete monitoring dashboard templates and setup scripts for ML training with TensorBoard and Weights & Biases (WandB).

Activation Triggers:

  • Setting up training monitoring dashboards
  • Tracking experiments and metrics in real-time
  • Comparing multiple training runs
  • Visualizing hyperparameters and results
  • Integrating monitoring into existing training pipelines
  • Logging custom metrics, images, and model artifacts

Key Resources:

  • scripts/setup-tensorboard.sh - Install and configure TensorBoard
  • scripts/setup-wandb.sh - Install and configure Weights & Biases
  • scripts/launch-monitoring.sh - Launch monitoring dashboards
  • templates/tensorboard-config.yaml - TensorBoard configuration template
  • templates/wandb-config.py - WandB integration template
  • templates/logging-config.json - Unified logging configuration
  • examples/tensorboard-integration.md - Complete TensorBoard integration guide
  • examples/wandb-integration.md - Complete WandB integration guide

Quick Start

1. Choose Monitoring Solution

TensorBoard (Local/Open Source):

  • Free, runs locally
  • Best for: Single-user development, offline work
  • Features: Metrics, histograms, graphs, images, embeddings
  • Storage: Local filesystem

Weights & Biases (Cloud/Collaboration):

  • Free tier available, cloud-hosted
  • Best for: Team collaboration, experiment comparison, production
  • Features: All TensorBoard features + collaboration, alerts, reports
  • Storage: Cloud with unlimited history

Both (Recommended for Production):

  • Use TensorBoard for local development
  • Use WandB for team collaboration and production tracking

2. Setup TensorBoard

# Install and configure TensorBoard
./scripts/setup-tensorboard.sh

# Launch TensorBoard
./scripts/launch-monitoring.sh tensorboard --logdir ./runs

Access: Open browser to http://localhost:6006

3. Setup Weights & Biases

# Install and configure WandB
./scripts/setup-wandb.sh

# Login with API key
wandb login

# Launch monitoring
./scripts/launch-monitoring.sh wandb

Access: Dashboard at https://wandb.ai/your-username/your-project

TensorBoard Integration

Basic Setup

Template: templates/tensorboard-config.yaml

from torch.utils.tensorboard import SummaryWriter
import datetime

# Create TensorBoard writer
log_dir = f"runs/experiment_{datetime.datetime.now().strftime('%Y%m%d-%H%M%S')}"
writer = SummaryWriter(log_dir=log_dir)

# Log scalar metrics
writer.add_scalar('Loss/train', train_loss, epoch)
writer.add_scalar('Loss/validation', val_loss, epoch)
writer.add_scalar('Accuracy/train', train_acc, epoch)
writer.add_scalar('Accuracy/validation', val_acc, epoch)

# Log learning rate
writer.add_scalar('Learning_Rate', optimizer.param_groups[0]['lr'], epoch)

# Close writer when done
writer.close()

Advanced Logging

Histograms (Weight Distributions):

# Log model weights
for name, param in model.named_parameters():
    writer.add_histogram(f'weights/{name}', param, epoch)
    writer.add_histogram(f'gradients/{name}', param.grad, epoch)

Images:

# Log sample predictions
writer.add_image('predictions', image_grid, epoch)
writer.add_images('batch_samples', image_batch, epoch)

Text:

# Log hyperparameters as text
config_text = '\n'.join([f'{k}: {v}' for k, v in config.items()])
writer.add_text('hyperparameters', config_text, 0)

Model Graph:

# Log model architecture
writer.add_graph(model, input_tensor)

Embeddings (t-SNE, PCA):

# Visualize embeddings
writer.add_embedding(embeddings, metadata=labels, label_img=images)

Launch TensorBoard

# Basic launch
tensorboard --logdir runs

# Specify port
tensorboard --logdir runs --port 6007

# Load faster (sample data)
tensorboard --logdir runs --samples_per_plugin scalars=1000

# Enable reload
tensorboard --logdir runs --reload_interval 5

Weights & Biases Integration

Basic Setup

Template: templates/wandb-config.py

import wandb

# Initialize WandB run
wandb.init(
    project="my-ml-project",
    name=f"experiment-{datetime.now().strftime('%Y%m%d-%H%M%S')}",
    config={
        "learning_rate": 0.001,
        "epochs": 100,
        "batch_size": 32,
        "model": "resnet50",
        "dataset": "imagenet"
    }
)

# Log metrics
wandb.log({
    "train_loss": train_loss,
    "val_loss": val_loss,
    "train_acc": train_acc,
    "val_acc": val_acc,
    "epoch": epoch
})

# Finish run
wandb.finish()

Advanced Features

Log Media:

# Log images
wandb.log({"predictions": [wandb.Image(img, caption=f"Pred: {pred}")]})

# Log tables
table = wandb.Table(columns=["epoch", "loss", "accuracy"], data=data)
wandb.log({"results_table": table})

# Log audio
wandb.log({"audio": wandb.Audio(audio_array, sample_rate=16000)})

# Log videos
wandb.log({"video": wandb.Video(video_path, fps=30)})

Track Model Artifacts:

# Save model checkpoint
artifact = wandb.Artifact('model-checkpoint', type='model')
artifact.add_file('model.pth')
wandb.log_artifact(artifact)

# Load model from artifact
artifact = wandb.use_artifact('model-checkpoint:latest')
model_path = artifact.download()

Hyperparameter Sweeps:

# Define sweep configuration
sweep_config = {
    'method': 'bayes',
    'metric': {'name': 'val_loss', 'goal': 'minimize'},
    'parameters': {
        'learning_rate': {'min': 0.0001, 'max': 0.1},
        'batch_size': {'values': [16, 32, 64]},
        'optimizer': {'values': ['adam', 'sgd', 'adamw']}
    }
}

# Initialize sweep
sweep_id = wandb.sweep(sweep_config, project="my-project")

# Run sweep agent
wandb.agent(sweep_id, function=train_model, count=10)

Custom Charts:

# Create custom plot
data = [[x, y] for (x, y) in zip(x_values, y_values)]
table = wandb.Table(data=data, columns=["x", "y"])
wandb.log({
    "custom_plot": wandb.plot.line(table, "x", "y", title="Custom Plot")
})

Alerts:

# Alert on metric threshold
if val_loss < 0.1:
    wandb.alert(
        title="Low Validation Loss",
        text=f"Validation loss dropped to {val_loss:.4f}",
        level=wandb.AlertLevel.INFO
    )

Unified Logging Configuration

Template: templates/logging-config.json

Use this configuration to log to both TensorBoard and WandB simultaneously:

import wandb
from torch.utils.tensorboard import SummaryWriter

class UnifiedLogger:
    def __init__(self, project_name, experiment_name, config):
        # TensorBoard
        self.tb_writer = SummaryWriter(
            log_dir=f"runs/{experiment_name}"
        )

        # WandB
        wandb.init(
            project=project_name,
            name=experiment_name,
            config=config
        )

    def log_metrics(self, metrics_dict, step):
        """Log to both TensorBoard and WandB"""
        # TensorBoard
        for key, value in metrics_dict.items():
            self.tb_writer.add_scalar(key, value, step)

        # WandB
        wandb.log(metrics_dict, step=step)

    def log_images(self, images_dict, step):
        """Log images to both platforms"""
        for key, image in images_dict.items():
            # TensorBoard
            self.tb_writer.add_image(key, image, step)

            # WandB
            wandb.log({key: wandb.Image(image)}, step=step)

    def log_model(self, model, input_sample):
        """Log model architecture"""
        # TensorBoard graph
        self.tb_writer.add_graph(model, input_sample)

        # WandB watches gradients
        wandb.watch(model, log="all", log_freq=100)

    def close(self):
        """Close both loggers"""
        self.tb_writer.close()
        wandb.finish()

# Usage
logger = UnifiedLogger(
    project_name="my-project",
    experiment_name="exp-001",
    config={"lr": 0.001, "batch_size": 32}
)

logger.log_metrics({
    "train_loss": 0.5,
    "val_loss": 0.6
}, step=epoch)

logger.close()

Common Monitoring Patterns

1. Training Loop Integration

for epoch in range(num_epochs):
    # Training phase
    model.train()
    train_loss = 0
    for batch_idx, (data, target) in enumerate(train_loader):
        loss = train_step(model, data, target, optimizer)
        train_loss += loss.item()

        # Log batch-level metrics
        global_step = epoch * len(train_loader) + batch_idx
        logger.log_metrics({
            "batch_loss": loss.item(),
            "learning_rate": optimizer.param_groups[0]['lr']
        }, step=global_step)

    # Validation phase
    model.eval()
    val_loss, val_acc = validate(model, val_loader)

    # Log epoch-level metrics
    logger.log_metrics({
        "epoch": epoch,
        "train_loss": train_loss / len(train_loader),
        "val_loss": val_loss,
        "val_acc": val_acc
    }, step=epoch)

    # Log model weights distribution
    for name, param in model.named_parameters():
        logger.tb_writer.add_histogram(f'weights/{name}', param, epoch)

2. Experiment Comparison

TensorBoard:

# Compare multiple runs
tensorboard --logdir_spec \
  exp1:runs/experiment_1,\
  exp2:runs/experiment_2,\
  exp3:runs/experiment_3

WandB:

# Automatically compares all runs in project dashboard
# Filter and group runs by tags, config values, or custom fields

3. Real-Time Monitoring

TensorBoard:

# Auto-reload new data
tensorboard --logdir runs --reload_interval 5

WandB:

# Real-time by default
# Enable email/slack alerts for key metrics
wandb.alert(
    title="Training Alert",
    text=f"Accuracy reached {acc:.2%}",
    level=wandb.AlertLevel.INFO
)

Best Practices

1. Metric Naming Conventions

Organize by category:

# Good: Hierarchical naming
"Loss/train"
"Loss/validation"
"Accuracy/train"
"Accuracy/validation"
"Metrics/precision"
"Metrics/recall"

# Bad: Flat naming
"train_loss"
"validation_loss"
"train_accuracy"

2. Logging Frequency

Guidelines:

  • Scalars: Every batch or every N batches
  • Histograms: Every epoch
  • Images: Every epoch or every N epochs
  • Model graph: Once at start
  • Embeddings: Once per major checkpoint
# Log batch metrics every 10 batches
if batch_idx % 10 == 0:
    logger.log_metrics({"batch_loss": loss}, step)

# Log epoch metrics
if batch_idx == len(train_loader) - 1:
    logger.log_metrics({"epoch_loss": epoch_loss}, epoch)

# Log images every 5 epochs
if epoch % 5 == 0:
    logger.log_images({"samples": sample_images}, epoch)

3. Disk Space Management

TensorBoard:

# Limit log retention
find runs/ -type d -mtime +30 -exec rm -rf {} +

# Compress old logs
tar -czf archive_$(date +%Y%m%d).tar.gz runs/old_experiments/
rm -rf runs/old_experiments/

WandB:

# Cloud storage handles retention
# Configure retention in project settings
# Download important runs for local backup
wandb.restore('model.pth', run_path="user/project/run_id")

4. Security & Privacy

TensorBoard:

# Restrict access to localhost only
tensorboard --logdir runs --host 127.0.0.1

# Or use SSH tunnel for remote access
ssh -L 6006:localhost:6006 user@remote-server

WandB:

# Use private projects
wandb.init(project="my-project", entity="private-team")

# Disable cloud sync for sensitive data
wandb.init(mode="offline")  # Logs locally only

Troubleshooting

TensorBoard Issues

Problem: Dashboard not updating

# Force reload
tensorboard --logdir runs --reload_interval 1

# Clear cache
rm -rf /tmp/.tensorboard-info/

Problem: Port already in use

# Use different port
tensorboard --logdir runs --port 6007

# Or kill existing process
pkill -f tensorboard

WandB Issues

Problem: Login fails

# Re-login with API key
wandb login --relogin

# Or set via environment
export WANDB_API_KEY=your_api_key

Problem: Slow logging

# Reduce logging frequency
wandb.init(settings=wandb.Settings(
    _disable_stats=True,  # Disable system metrics
    _disable_meta=True    # Disable metadata
))

Scripts Usage

Setup TensorBoard

./scripts/setup-tensorboard.sh

# Verifies:
# - Python environment
# - TensorBoard installation
# - Creates default log directory structure

Setup WandB

./scripts/setup-wandb.sh

# Verifies:
# - WandB installation
# - API key configuration
# - Creates wandb config file

Launch Monitoring

# TensorBoard
./scripts/launch-monitoring.sh tensorboard --logdir ./runs --port 6006

# WandB (opens browser to dashboard)
./scripts/launch-monitoring.sh wandb --project my-project

# Both
./scripts/launch-monitoring.sh both --logdir ./runs --project my-project

Resources

Scripts:

  • setup-tensorboard.sh - Install and configure TensorBoard
  • setup-wandb.sh - Install and configure WandB
  • launch-monitoring.sh - Launch monitoring dashboards

Templates:

  • tensorboard-config.yaml - TensorBoard setup configuration
  • wandb-config.py - WandB integration template
  • logging-config.json - Unified logging configuration

Examples:

  • tensorboard-integration.md - Complete TensorBoard integration
  • wandb-integration.md - Complete WandB integration with sweeps

Supported Frameworks: PyTorch, TensorFlow, JAX, Hugging Face Transformers Python Version: 3.8+ Best Practice: Use both TensorBoard (local dev) and WandB (team collaboration)