| name | pytorch-lightning |
| description | High-level training framework for PyTorch that abstracts boilerplate while maintaining flexibility. Includes the Trainer, LightningModule, and support for multi-GPU scaling and reproducibility. (lightning, pytorch-lightning, lightningmodule, trainer, callback, ddp, fast_dev_run, seed_everything) |
Overview
PyTorch Lightning is a lightweight wrapper for PyTorch that decouples the research code from the engineering code. It automates 40+ engineering details like epoch loops, optimization, and hardware acceleration, while allowing researchers to retain full control over the model logic.
When to Use
Use Lightning when you want to scale models to multi-GPU or multi-node environments without changing training code, or when you want to eliminate boilerplate for logging, checkpointing, and reproducibility.
Decision Tree
- Are you testing code logic on a small subset?
- YES: Use
Trainer(fast_dev_run=True).
- YES: Use
- Do you need to scale to multiple GPUs?
- YES: Set
accelerator='gpu'anddevices=Nin the Trainer.
- YES: Set
- Do you have logic that is non-essential to the model (e.g., special logging)?
- YES: Implement it as a
Callback.
- YES: Implement it as a
Workflows
Transitioning from Raw PyTorch to Lightning
- Define a class inheriting from
L.LightningModule. - Move model architecture into
__init__and logic intotraining_step. - Implement
configure_optimizersto return the optimizer and optional scheduler. - Instantiate an
L.Trainerand pass the model and DataLoader totrainer.fit().
- Define a class inheriting from
Multi-GPU Training Scaling
- Initialize the Trainer with
accelerator='gpu'anddevices=N. - Set
strategy='ddp'or'deepspeed_stage_2'for multi-node/large-scale runs. - Optionally enable 16-bit precision using
precision='16-mixed'. - Run the script without code changes to standard logic.
- Initialize the Trainer with
Ensuring Training Reproducibility
- Call
seed_everything(seed, workers=True)at the start of the script. - Initialize the Trainer with
deterministic=True. - Avoid data-dependent logic in transforms that isn't handled by the derived seeds.
- Call
Non-Obvious Insights
- Overhead Analysis: The 'barebones' mode in the Trainer is specifically for overhead analysis and disables almost all logging and checkpointing for speed.
- Callback State Management: Custom callbacks must implement a
state_keyproperty if multiple instances of the same callback type are used in a single Trainer. - Superior Debugging: The
fast_dev_runflag is superior to limiting batches manually because it avoids all side effects like checkpointing or logging that might clutter the workspace during debugging.
Evidence
- "The Lightning Trainer automates 40+ tricks including: Epoch and batch iteration, optimizer.step(), loss.backward(), optimizer.zero_grad() calls." (https://lightning.ai/docs/pytorch/stable/starter/introduction.html)
- "By setting workers=True in seed_everything(), Lightning derives unique seeds across all dataloader workers." (https://lightning.ai/docs/pytorch/stable/common/trainer.html)
Scripts
scripts/pytorch-lightning_tool.py: Template for a LightningModule and Trainer setup.scripts/pytorch-lightning_tool.js: Script to trigger Lightning training with CLI arguments.
Dependencies
- lightning
- torch