| name | subtitle-to-document |
| description | Convert subtitle files (WebVTT .vtt and SubRip .srt) into clean, readable text documents with multi-language support (English, Traditional Chinese, Simplified Chinese). Auto-detects format and language, removes timestamps and formatting, merges captions into natural paragraphs, strips annotations, and fixes spacing appropriately for each language. Use when the user provides a subtitle file or asks to convert captions to text. |
Subtitles to Document
Convert WebVTT (.vtt) and SubRip (.srt) subtitle files into clean, readable text documents with natural paragraph flow and multi-language support.
Supported Languages
- English (en) - Automatic capitalization, space-based word separation
- Traditional Chinese (zh_tw / \u7e41\u9ad4\u4e2d\u6587) - Chinese punctuation detection, no spaces between characters
- Simplified Chinese (zh_cn / \u7b80\u4f53\u4e2d\u6587) - Chinese punctuation detection, no spaces between characters
The script auto-detects the subtitle format (VTT or SRT) and the language by default.
Quick Start
Auto-detect Format and Language (Default)
When the user provides a subtitle file path, run the conversion script:
python scripts/subtitle_to_text.py input.vtt
# or
python scripts/subtitle_to_text.py input.srt
This creates input.txt in the same directory with cleaned, formatted text.
Specify Language or Format
# English
python scripts/subtitle_to_text.py input.srt --lang en
# Traditional Chinese
python scripts/subtitle_to_text.py input.vtt --lang zh_tw
# Force SRT parsing
python scripts/subtitle_to_text.py input.srt --format srt
Specify Custom Output Path
python scripts/subtitle_to_text.py input.srt output.txt --lang zh_tw
What the Script Does
The subtitle_to_text.py script automatically performs format- and language-aware processing:
All Formats
- Removes timestamps - Strips VTT/SRT timing information (e.g.,
00:00:00.000 --> 00:00:02.500or00:00:00,000 --> 00:00:02,500) - Removes subtitle formatting - Strips tags like
<v Speaker>,<c>,<i>, etc. - Removes duplicates - Detects and removes consecutive duplicate captions
Language-Specific Processing
English (en)
- Sentence detection:
.,!,? - Annotations removed:
[Music],[Applause],[Laughter],[Inaudible] - Capitalization: Auto-capitalizes sentences and fixes
i->I - Spacing: Adds spaces between words, fixes punctuation spacing
Traditional Chinese (zh_tw)
- Sentence detection:
\u3002,\uff01,\uff1f - Annotations removed:
[...](Chinese and English annotations) - Capitalization: Not applicable
- Spacing: Removes spaces between Chinese characters, preserves spacing around embedded English words
Simplified Chinese (zh_cn)
- Same behavior as Traditional Chinese, using simplified character indicators for detection
Format Detection
Format detection uses file extension (.vtt / .srt) and content heuristics (WEBVTT header, presence of --> timestamps). You can override detection with --format.
If auto-detection is incorrect, override with --format srt or --format vtt, or set --lang to force language.
Workflow
When a user requests subtitle conversion:
- Confirm the input subtitle file path exists
- Determine the format (auto-detect or ask user to override)
- Determine language (auto-detect or ask user to override)
- Run the script with appropriate flags if needed
- Display a preview of the converted text
- Confirm the output file location
Examples
Example 1: SRT (English)
Input SRT:
1
00:00:00,000 --> 00:00:02,500
Hello and welcome to this tutorial.
2
00:00:02,500 --> 00:00:05,000
Today we're going to learn
3
00:00:05,000 --> 00:00:07,500
about video text tracks.
Output Text:
Hello and welcome to this tutorial.
Today we're going to learn about video text tracks.
Example 2: VTT (Traditional Chinese)
Use --lang zh_tw or allow auto-detection.
Using from Python
The script can also be imported as a module:
from scripts.subtitle_to_text import convert_subtitles_to_text
# Auto-detect format & language
text = convert_subtitles_to_text('input.srt', 'output.txt')
# Specify language or format
text = convert_subtitles_to_text('input.vtt', 'output.txt', lang='zh_tw', fmt='vtt')
# Backwards-compatible alias
from scripts.subtitle_to_text import convert_vtt_to_text
text = convert_vtt_to_text('input.vtt', 'output.txt')
Troubleshooting
Wrong format detected: Use --format srt or --format vtt to override:
python scripts/subtitle_to_text.py input.srt --format srt
Wrong language detected: Use --lang to override:
python scripts/subtitle_to_text.py input.srt --lang zh_tw
Encoding errors: The script uses UTF-8 encoding by default. If the subtitle file uses a different encoding, convert it to UTF-8 first.
Missing paragraphs: The script merges based on language-specific sentence-ending punctuation. If captions lack proper punctuation they may merge into larger paragraphs.
Mixed language content: The script works best with single-language files. For mixed English/Chinese content consider processing separately.
CLI Reference
Usage: python subtitle_to_text.py <input.vtt|input.srt> [output.txt] [--lang en|zh_tw|zh_cn] [--format auto|vtt|srt]
Language options:
--lang auto Auto-detect language (default)
--lang en English
--lang zh_tw Traditional Chinese (\u7e41\u9ad4\u4e2d\u6587)
--lang zh_cn Simplified Chinese (\u7b80\u4f53\u4e2d\u6587)
Format options:
--format auto Auto-detect subtitle format (default)
--format vtt Force VTT parsing
--format srt Force SRT parsing
If output path is not specified, a .txt file will be created next to input.