Claude Code Plugins

Community-maintained marketplace

Feedback

Model serving engine for PyTorch. Focuses on MAR packaging, custom handlers for preprocessing/inference, and management of multi-GPU worker scaling. (torchserve, mar-file, handler, basehandler, model-archiver, inference-api)

Install Skill

1Download skill
2Enable skills in Claude

Open claude.ai/settings/capabilities and find the "Skills" section

3Upload to Claude

Click "Upload skill" and select the downloaded ZIP file

Note: Please verify skill by going through its instructions before using it.

SKILL.md

name torchserve
description Model serving engine for PyTorch. Focuses on MAR packaging, custom handlers for preprocessing/inference, and management of multi-GPU worker scaling. (torchserve, mar-file, handler, basehandler, model-archiver, inference-api)

Overview

TorchServe is a flexible and easy-to-use tool for serving PyTorch models. It provides capabilities for packaging models, scaling workers based on hardware availability, and managing multiple model versions via a REST/gRPC API.

When to Use

Use TorchServe when you need a production-ready inference server that handles multi-GPU load balancing, request batching, and custom preprocessing/postprocessing logic via Python handlers.

Decision Tree

  1. Do you need custom logic for image resizing or JSON parsing before model inference?
    • OVERRIDE: preprocess() in a class inheriting from BaseHandler.
  2. Do you have multiple GPUs available?
    • RELY: On TorchServe's round-robin assignment; check the gpu_id in the handler context.
  3. Do you want to deploy to a system with limited resources?
    • CAUTION: TorchServe is in limited maintenance; check environment compatibility.

Workflows

  1. Packaging and Serving a Model

    1. Write a custom handler or use a default one (e.g., 'image_classifier').
    2. Use torch-model-archiver to package the model, weights, and handler into a .mar file.
    3. Start TorchServe specifying the model store and the initial models to load.
    4. Test the endpoint using curl or a gRPC client.
  2. Customizing Inference Logic

    1. Define a class inheriting from BaseHandler.
    2. Override preprocess() to handle incoming JSON/Image data.
    3. Override inference() or postprocess() to customize output formatting.
    4. Package this script as the --handler in the model archiver.
  3. Scaling Inference Capacity

    1. Use the Management API (typically on port 8081) to adjust the number of workers.
    2. Send a PUT request to /models/{model_name}?min_worker=N.
    3. Monitor logs to ensure new workers are successfully initialized on the available hardware.

Non-Obvious Insights

  • A/B Testing: TorchServe naturally supports multiple model versions simultaneously, making it trivial to perform A/B testing by routing requests to different model endpoints.
  • GPU Round-Robin: Workers are assigned GPUs in a round-robin fashion. Handlers must use the gpu_id provided in the context to ensure the model is loaded onto the correct physical device.
  • The MAR Format: The Model Archive (.mar) file is a self-contained ZIP that includes the model definition, state dictionary, and the handler script, ensuring that the deployment environment exactly matches the development environment.

Evidence

Scripts

  • scripts/torchserve_tool.py: Skeleton for a custom TorchServe handler.
  • scripts/torchserve_tool.js: Script to send inference requests to a running TorchServe instance.

Dependencies

  • torchserve
  • torch-model-archiver

References