| name | torchtext |
| description | Natural Language Processing utilities for PyTorch (Legacy). Includes tokenizers, vocabulary building, and DataPipe-based dataset handling for text processing pipelines. (torchtext, tokenizer, vocab, datapipe, regextokenizer, nlp-pipeline) |
Overview
TorchText is a legacy library for NLP in PyTorch. While it is in a maintenance phase, it remains a common tool for handling classic NLP datasets and building vocabularies via DataPipes.
When to Use
Use TorchText for maintaining legacy NLP projects or when utilizing its built-in DataPipe-based datasets. For new projects, transitioning to native PyTorch or other modern NLP libraries is recommended.
Decision Tree
- Are you starting a new NLP project?
- CONSIDER: Using Hugging Face or native PyTorch instead of TorchText.
- Do you need a high-performance tokenizer for production?
- USE:
RegexTokenizerand compile it withtorch.jit.script.
- USE:
- Are you using DataPipes with multiple workers?
- ENSURE: Use a proper
worker_init_fnin theDataLoaderto avoid data duplication.
- ENSURE: Use a proper
Workflows
Building a Text Processing Pipeline
- Initialize a tokenizer (e.g.,
BERTTokenizer). - Construct a
Vocabobject usingbuild_vocab_from_iteratorfrom a dataset. - Create a pipeline using
transforms.Sequentialcontaining: Tokenizer -> VocabTransform -> AddToken -> Truncate -> ToTensor. - Pass raw strings through the pipeline to get padded tensors.
- Initialize a tokenizer (e.g.,
Using Built-in NLP Datasets
- Import a dataset from
torchtext.datasets(e.g., IMDB, AG_NEWS). - Initialize the
DataPipefor the desired split ('train', 'test'). - Setup a
DataLoaderwithshuffle=Trueand a properworker_init_fn. - Iterate through the
DataPipeto get(label, text)pairs.
- Import a dataset from
Custom Regex Tokenization
- Define a list of regex patterns and their replacements.
- Instantiate
RegexTokenizerwith the patterns. - Optionally use
torch.jit.scriptto compile the tokenizer for production. - Apply the tokenizer to raw strings to generate tokens.
Non-Obvious Insights
- Maintenance Status: Development of TorchText stopped as of April 2024 (v0.18), marking it as a legacy library.
- Data Duplication Risk: DataPipe-based datasets require explicit handling in the
DataLoader(via worker initialization) to ensure that multiple workers don't serve the same data shards. - Inference Speed: Many transforms like
BERTTokenizerare reimplemented in TorchScript, allowing for high-performance inference without a full Python runtime.
Evidence
- "Warning TorchText development is stopped and the 0.18 release (April 2024) will be the last stable release." (https://pytorch.org/text/stable/index.html)
- "RegexTokenizer: Regex tokenizer for a string sentence that applies all regex replacements... backed by the C++ RE2 engine." (https://pytorch.org/text/stable/transforms.html)
Scripts
scripts/torchtext_tool.py: Example of building a vocabulary and tokenizer pipeline.scripts/torchtext_tool.js: Node.js interface for invoking TorchText pipelines.
Dependencies
- torchtext
- torch
- torchdata