name

parakeet

description

Convert audio files to text using parakeet-mlx, NVIDIA's Parakeet automatic speech recognition model optimized for Apple's MLX framework. Run via uvx for on-device speech-to-text processing with high-quality timestamped transcriptions. Ideal for podcasts, interviews, meetings, and other audio content. This skill is triggered when the user says things like "transcribe this audio", "convert audio to text", "transcribe this podcast", "get text from this recording", "speech to text", or "transcribe this wav/mp3/m4a file".

Parakeet-MLX Audio Transcription

This skill provides instructions for using parakeet-mlx to convert audio files to text using NVIDIA's Parakeet automatic speech recognition model optimized for Apple's MLX framework.

Overview

Parakeet-MLX brings NVIDIA's Parakeet ASR model to Apple Silicon, enabling fast, high-quality on-device speech-to-text conversion similar to Whisper but optimized for Apple's MLX framework.

Key Features:

High-quality transcription with timestamps
Fast processing (e.g., 1+ hour audio in ~53 seconds)
On-device processing (no cloud API required)
Output in SRT subtitle format
No installation required - run directly with uvx

Basic Usage

Transcribe an Audio File

The simplest way to transcribe an audio file is:

uvx parakeet-mlx /path/to/audio/file.mp3

This will:

Download the model on first run (~2.5GB)
Process the audio file
Generate an SRT subtitle file with timestamped transcription

Example:

# Transcribe a podcast episode
uvx parakeet-mlx podcast_episode.mp3

# Transcribe an interview
uvx parakeet-mlx interview.wav

# Transcribe a meeting recording
uvx parakeet-mlx meeting_recording.m4a

Important Notes

First Run

The first time you run parakeet-mlx, it will download a ~2.5GB model file. This may take several minutes depending on your internet connection. Subsequent runs will use the cached model and start immediately.

Performance

Parakeet-MLX is optimized for Apple Silicon and provides excellent performance:

A 65MB, 1 hour 1 minute 28 second podcast was transcribed in 53 seconds
Performance scales with audio duration, not file size
Processing happens entirely on-device

Output Format

The tool generates an SRT (SubRip Subtitle) file with the same name as your input file:

Input: podcast.mp3 Output: podcast.srt

The SRT format includes:

Sequential subtitle numbers
Timestamp ranges (start --> end)
Transcribed text for each segment

Example SRT output:

1
00:00:00,000 --> 00:00:03,500
Welcome to the podcast. Today we're discussing...

2
00:00:03,500 --> 00:00:08,200
The latest developments in machine learning...

Supported Audio Formats

While the documentation explicitly mentions MP3, parakeet-mlx likely supports common audio formats including:

MP3
WAV
M4A
FLAC
OGG

If you encounter format issues, consider converting your audio file to MP3 or WAV first using tools like ffmpeg.

Working with Output

View the Transcription

# View the entire transcription
cat output.srt

# View just the text (remove timestamps)
grep -v "^[0-9]*$" output.srt | grep -v "^[0-9][0-9]:[0-9][0-9]:[0-9][0-9]"

Convert SRT to Plain Text

If you need plain text without timestamps:

# Extract only dialogue lines (skip numbers and timestamps)
awk 'NF && !/^[0-9]+$/ && !/^[0-9]{2}:[0-9]{2}:[0-9]{2}/' output.srt > transcript.txt

Use with Other Tools

The SRT format is widely supported and can be:

Imported into video editing software
Used with subtitle players
Converted to other formats (VTT, ASS, etc.)
Processed with text analysis tools

Common Workflows

Transcribe Multiple Files

# Process all MP3 files in a directory
for file in *.mp3; do
    echo "Processing $file..."
    uvx parakeet-mlx "$file"
done

Transcribe and Extract Text

# Transcribe and immediately extract plain text
uvx parakeet-mlx audio.mp3
awk 'NF && !/^[0-9]+$/ && !/^[0-9]{2}:[0-9]{2}:[0-9]{2}/' audio.srt > audio.txt

Check Transcription Quality

After transcription, review the SRT file to verify:

Proper segmentation of speech
Accurate timestamps
Text quality and accuracy

Troubleshooting

Model Download Issues

If the model download fails or is interrupted:

Check your internet connection
Ensure you have ~2.5GB of free disk space
Try running the command again (it may resume download)

Performance Issues

If transcription is slow:

Ensure you're running on Apple Silicon (M1/M2/M3)
Close other resource-intensive applications
For very long files, consider splitting them into smaller segments

Audio Format Not Supported

If you get an error about unsupported format:

# Convert to MP3 using ffmpeg
ffmpeg -i input.m4a -acodec libmp3lame -ab 192k output.mp3
uvx parakeet-mlx output.mp3

Use Cases

Parakeet-MLX is ideal for:

Podcast transcription - Generate searchable text from episodes
Interview documentation - Convert recorded interviews to text
Meeting notes - Transcribe meeting recordings for reference
Content creation - Generate captions or show notes
Accessibility - Create subtitles for audio/video content
Research - Analyze spoken content at scale

Advantages Over Other Tools

vs. Whisper:

Optimized specifically for Apple MLX framework
Potentially faster on Apple Silicon
Similar quality output

vs. Cloud APIs:

No API costs
Complete privacy (on-device processing)
No internet required after model download
No file size limits

vs. Manual Transcription:

Dramatically faster (hours to minutes)
Consistent quality
Includes precise timestamps

Tips for Best Results

Audio Quality Matters: Clear audio with minimal background noise produces better results
Speaker Clarity: Single speaker or well-separated multi-speaker audio works best
Review Output: Always review the transcription for accuracy, especially for technical terms or names
Use Timestamps: The SRT format's timestamps are valuable for referencing specific moments
Batch Processing: Process multiple files in sequence for efficiency

References

Simon Willison's parakeet-mlx article
Run uvx parakeet-mlx --help for command-line options (if available)

Notes

Parakeet-MLX runs entirely on-device, ensuring privacy
The tool is designed for Apple Silicon; performance on Intel Macs may vary
First run requires internet for model download
Output quality is reported as "very high" for clear audio
The tool is optimized for speech recognition, not music or other audio types