| name | pytorch-onnx |
| description | Exporting PyTorch models to ONNX format for cross-platform deployment. Includes handling dynamic axes, graph optimization in ONNX Runtime, and INT8 model quantization. (onnx, onnxruntime, torch.onnx.export, dynamic_axes, constant-folding, edge-deployment) |
Overview
ONNX (Open Neural Network Exchange) is an open format built to represent machine learning models. Exporting PyTorch models to ONNX allows them to be executed in environments without Python or PyTorch, using high-performance engines like ONNX Runtime.
When to Use
Use ONNX for cross-language deployment (C++, Java, C#), edge deployment (mobile/IoT), or to leverage specialized hardware accelerators (like TensorRT) that support ONNX as an input format.
Decision Tree
- Does your model accept variable batch sizes?
- SPECIFY:
dynamic_axesin thetorch.onnx.exportcall.
- SPECIFY:
- Do you need the fastest possible inference on a CPU?
- APPLY: Quantization using the ONNX Runtime quantization tool.
- Are you deploying to a C++ environment without Python?
- EXPORT: To ONNX and load using the ONNX Runtime C++ API.
Workflows
Exporting a Model for Cross-Platform Deployment
- Instantiate the PyTorch model and set it to
.eval(). - Create a dummy input tensor matching the input shape.
- Call
torch.onnx.export()specifying input/output names and dynamic axes. - Verify the resulting
.onnxfile using a tool like Netron.
- Instantiate the PyTorch model and set it to
Optimizing ONNX Models for Inference
- Load the
.onnxmodel into an ONNX RuntimeInferenceSession. - Choose an appropriate Execution Provider (e.g.,
'CUDAExecutionProvider','TensorrtExecutionProvider'). - Enable graph optimizations like constant folding and node fusion.
- Run inference using the
session.run()method with input dictionaries.
- Load the
Reducing Model Footprint via Quantization
- Export the model to standard ONNX format.
- Use the ONNX Runtime quantization tool to convert FP32 weights to INT8.
- Calibrate the model using a representative dataset to minimize accuracy loss.
- Deploy the quantized
.onnxmodel to edge devices for lower latency.
Non-Obvious Insights
- Static vs. Dynamic: By default,
torch.onnx.exportcaptures the shape of the dummy input as a static shape. If your application handles varying inputs, you must explicitly define these as dynamic axes. - Graph Optimization: ONNX Runtime performs "constant folding," which pre-computes parts of the graph that rely on constant values, effectively stripping unnecessary computation before inference starts.
- Serialization Choice: While TorchScript is also an option for PyTorch deployment, ONNX is often preferred for cross-vendor compatibility (e.g., running a model on a Web browser using ONNX.js).
Evidence
- "The first step is to export your PyTorch model to ONNX format using the PyTorch ONNX exporter: torch.onnx.export(model, PATH, example)." (https://onnxruntime.ai/docs/tutorials/accelerate-pytorch/pytorch.html)
- "ONNXRuntime applies a series of optimizations to the ONNX graph, combining nodes where possible and factoring out constant values (constant folding)." (https://onnxruntime.ai/docs/tutorials/accelerate-pytorch/pytorch.html)
Scripts
scripts/pytorch-onnx_tool.py: Script to export a model with dynamic axes support.scripts/pytorch-onnx_tool.js: Node.js interface to run inference via ONNX Runtime.
Dependencies
- torch
- onnx
- onnxruntime