| name | torchserve |
| description | Model serving engine for PyTorch. Focuses on MAR packaging, custom handlers for preprocessing/inference, and management of multi-GPU worker scaling. (torchserve, mar-file, handler, basehandler, model-archiver, inference-api) |
Overview
TorchServe is a flexible and easy-to-use tool for serving PyTorch models. It provides capabilities for packaging models, scaling workers based on hardware availability, and managing multiple model versions via a REST/gRPC API.
When to Use
Use TorchServe when you need a production-ready inference server that handles multi-GPU load balancing, request batching, and custom preprocessing/postprocessing logic via Python handlers.
Decision Tree
- Do you need custom logic for image resizing or JSON parsing before model inference?
- OVERRIDE:
preprocess()in a class inheriting fromBaseHandler.
- OVERRIDE:
- Do you have multiple GPUs available?
- RELY: On TorchServe's round-robin assignment; check the
gpu_idin the handler context.
- RELY: On TorchServe's round-robin assignment; check the
- Do you want to deploy to a system with limited resources?
- CAUTION: TorchServe is in limited maintenance; check environment compatibility.
Workflows
Packaging and Serving a Model
- Write a custom handler or use a default one (e.g., 'image_classifier').
- Use
torch-model-archiverto package the model, weights, and handler into a.marfile. - Start TorchServe specifying the model store and the initial models to load.
- Test the endpoint using
curlor a gRPC client.
Customizing Inference Logic
- Define a class inheriting from
BaseHandler. - Override
preprocess()to handle incoming JSON/Image data. - Override
inference()orpostprocess()to customize output formatting. - Package this script as the
--handlerin the model archiver.
- Define a class inheriting from
Scaling Inference Capacity
- Use the Management API (typically on port 8081) to adjust the number of workers.
- Send a
PUTrequest to/models/{model_name}?min_worker=N. - Monitor logs to ensure new workers are successfully initialized on the available hardware.
Non-Obvious Insights
- A/B Testing: TorchServe naturally supports multiple model versions simultaneously, making it trivial to perform A/B testing by routing requests to different model endpoints.
- GPU Round-Robin: Workers are assigned GPUs in a round-robin fashion. Handlers must use the
gpu_idprovided in thecontextto ensure the model is loaded onto the correct physical device. - The MAR Format: The Model Archive (
.mar) file is a self-contained ZIP that includes the model definition, state dictionary, and the handler script, ensuring that the deployment environment exactly matches the development environment.
Evidence
- "Archive the model by using the model archiver: torch-model-archiver --model-name densenet161 --version 1.0..." (https://pytorch.org/serve/getting_started.html)
- "In case of multiple GPUs TorchServe selects the gpu device in round-robin fashion and passes on this device id to the model handler in context." (https://pytorch.org/serve/custom_service.html)
Scripts
scripts/torchserve_tool.py: Skeleton for a custom TorchServe handler.scripts/torchserve_tool.js: Script to send inference requests to a running TorchServe instance.
Dependencies
- torchserve
- torch-model-archiver