Claude Code Plugins

Community-maintained marketplace

Feedback
374
0

Serverless GPU cloud platform for running ML workloads. Use when you need on-demand GPU access without infrastructure management, deploying ML models as APIs, or running batch jobs with automatic scaling.

Install Skill

1Download skill
2Enable skills in Claude

Open claude.ai/settings/capabilities and find the "Skills" section

3Upload to Claude

Click "Upload skill" and select the downloaded ZIP file

Note: Please verify skill by going through its instructions before using it.

SKILL.md

name modal-serverless-gpu
description Serverless GPU cloud platform for running ML workloads. Use when you need on-demand GPU access without infrastructure management, deploying ML models as APIs, or running batch jobs with automatic scaling.
version 1.0.0
author Orchestra Research
license MIT
tags Infrastructure, Serverless, GPU, Cloud, Deployment, Modal
dependencies modal>=0.64.0

Modal Serverless GPU

Comprehensive guide to running ML workloads on Modal's serverless GPU cloud platform.

When to use Modal

Use Modal when:

  • Running GPU-intensive ML workloads without managing infrastructure
  • Deploying ML models as auto-scaling APIs
  • Running batch processing jobs (training, inference, data processing)
  • Need pay-per-second GPU pricing without idle costs
  • Prototyping ML applications quickly
  • Running scheduled jobs (cron-like workloads)

Key features:

  • Serverless GPUs: T4, L4, A10G, L40S, A100, H100, H200, B200 on-demand
  • Python-native: Define infrastructure in Python code, no YAML
  • Auto-scaling: Scale to zero, scale to 100+ GPUs instantly
  • Sub-second cold starts: Rust-based infrastructure for fast container launches
  • Container caching: Image layers cached for rapid iteration
  • Web endpoints: Deploy functions as REST APIs with zero-downtime updates

Use alternatives instead:

  • RunPod: For longer-running pods with persistent state
  • Lambda Labs: For reserved GPU instances
  • SkyPilot: For multi-cloud orchestration and cost optimization
  • Kubernetes: For complex multi-service architectures

Quick start

Installation

pip install modal
modal setup  # Opens browser for authentication

Hello World with GPU

import modal

app = modal.App("hello-gpu")

@app.function(gpu="T4")
def gpu_info():
    import subprocess
    return subprocess.run(["nvidia-smi"], capture_output=True, text=True).stdout

@app.local_entrypoint()
def main():
    print(gpu_info.remote())

Run: modal run hello_gpu.py

Basic inference endpoint

import modal

app = modal.App("text-generation")
image = modal.Image.debian_slim().pip_install("transformers", "torch", "accelerate")

@app.cls(gpu="A10G", image=image)
class TextGenerator:
    @modal.enter()
    def load_model(self):
        from transformers import pipeline
        self.pipe = pipeline("text-generation", model="gpt2", device=0)

    @modal.method()
    def generate(self, prompt: str) -> str:
        return self.pipe(prompt, max_length=100)[0]["generated_text"]

@app.local_entrypoint()
def main():
    print(TextGenerator().generate.remote("Hello, world"))

Core concepts

Key components

Component Purpose
App Container for functions and resources
Function Serverless function with compute specs
Cls Class-based functions with lifecycle hooks
Image Container image definition
Volume Persistent storage for models/data
Secret Secure credential storage

Execution modes

Command Description
modal run script.py Execute and exit
modal serve script.py Development with live reload
modal deploy script.py Persistent cloud deployment

GPU configuration

Available GPUs

GPU VRAM Best For
T4 16GB Budget inference, small models
L4 24GB Inference, Ada Lovelace arch
A10G 24GB Training/inference, 3.3x faster than T4
L40S 48GB Recommended for inference (best cost/perf)
A100-40GB 40GB Large model training
A100-80GB 80GB Very large models
H100 80GB Fastest, FP8 + Transformer Engine
H200 141GB Auto-upgrade from H100, 4.8TB/s bandwidth
B200 Latest Blackwell architecture

GPU specification patterns

# Single GPU
@app.function(gpu="A100")

# Specific memory variant
@app.function(gpu="A100-80GB")

# Multiple GPUs (up to 8)
@app.function(gpu="H100:4")

# GPU with fallbacks
@app.function(gpu=["H100", "A100", "L40S"])

# Any available GPU
@app.function(gpu="any")

Container images

# Basic image with pip
image = modal.Image.debian_slim(python_version="3.11").pip_install(
    "torch==2.1.0", "transformers==4.36.0", "accelerate"
)

# From CUDA base
image = modal.Image.from_registry(
    "nvidia/cuda:12.1.0-cudnn8-devel-ubuntu22.04",
    add_python="3.11"
).pip_install("torch", "transformers")

# With system packages
image = modal.Image.debian_slim().apt_install("git", "ffmpeg").pip_install("whisper")

Persistent storage

volume = modal.Volume.from_name("model-cache", create_if_missing=True)

@app.function(gpu="A10G", volumes={"/models": volume})
def load_model():
    import os
    model_path = "/models/llama-7b"
    if not os.path.exists(model_path):
        model = download_model()
        model.save_pretrained(model_path)
        volume.commit()  # Persist changes
    return load_from_path(model_path)

Web endpoints

FastAPI endpoint decorator

@app.function()
@modal.fastapi_endpoint(method="POST")
def predict(text: str) -> dict:
    return {"result": model.predict(text)}

Full ASGI app

from fastapi import FastAPI
web_app = FastAPI()

@web_app.post("/predict")
async def predict(text: str):
    return {"result": await model.predict.remote.aio(text)}

@app.function()
@modal.asgi_app()
def fastapi_app():
    return web_app

Web endpoint types

Decorator Use Case
@modal.fastapi_endpoint() Simple function → API
@modal.asgi_app() Full FastAPI/Starlette apps
@modal.wsgi_app() Django/Flask apps
@modal.web_server(port) Arbitrary HTTP servers

Dynamic batching

@app.function()
@modal.batched(max_batch_size=32, wait_ms=100)
async def batch_predict(inputs: list[str]) -> list[dict]:
    # Inputs automatically batched
    return model.batch_predict(inputs)

Secrets management

# Create secret
modal secret create huggingface HF_TOKEN=hf_xxx
@app.function(secrets=[modal.Secret.from_name("huggingface")])
def download_model():
    import os
    token = os.environ["HF_TOKEN"]

Scheduling

@app.function(schedule=modal.Cron("0 0 * * *"))  # Daily midnight
def daily_job():
    pass

@app.function(schedule=modal.Period(hours=1))
def hourly_job():
    pass

Performance optimization

Cold start mitigation

@app.function(
    container_idle_timeout=300,  # Keep warm 5 min
    allow_concurrent_inputs=10,  # Handle concurrent requests
)
def inference():
    pass

Model loading best practices

@app.cls(gpu="A100")
class Model:
    @modal.enter()  # Run once at container start
    def load(self):
        self.model = load_model()  # Load during warm-up

    @modal.method()
    def predict(self, x):
        return self.model(x)

Parallel processing

@app.function()
def process_item(item):
    return expensive_computation(item)

@app.function()
def run_parallel():
    items = list(range(1000))
    # Fan out to parallel containers
    results = list(process_item.map(items))
    return results

Common configuration

@app.function(
    gpu="A100",
    memory=32768,              # 32GB RAM
    cpu=4,                     # 4 CPU cores
    timeout=3600,              # 1 hour max
    container_idle_timeout=120,# Keep warm 2 min
    retries=3,                 # Retry on failure
    concurrency_limit=10,      # Max concurrent containers
)
def my_function():
    pass

Debugging

# Test locally
if __name__ == "__main__":
    result = my_function.local()

# View logs
# modal app logs my-app

Common issues

Issue Solution
Cold start latency Increase container_idle_timeout, use @modal.enter()
GPU OOM Use larger GPU (A100-80GB), enable gradient checkpointing
Image build fails Pin dependency versions, check CUDA compatibility
Timeout errors Increase timeout, add checkpointing

References

Resources