| name | speech-to-text |
| description | Expert in transcribing audio and video files to structured meeting minutes using VertexAI Gemini 2.5 Flash. **Use this skill when the user requests to transcribe audio/video files ('transcribe this audio', 'convert audio to text', 'get text from recording'), extract content from recordings, or when preprocessing audio for meeting summaries.** Automatically searches ~/Downloads when user mentions 'downloaded audio'. Supports MP3, WAV, M4A, AAC, OGG, FLAC formats with automatic attendee detection and speaker attribution. |
Speech-to-Text Transcription Skill
Expert in transcribing audio and video files to structured meeting minutes with automatic attendee name detection, speaker attribution, and markdown formatting.
Core Capabilities
Audio Processing
- Multi-Format Support: MP3, WAV, M4A, AAC, OGG, FLAC
- Auto-Chunking: Splits files >30 minutes into 30-minute segments with 30-second overlaps
- Parallel Processing: Processes chunks concurrently (max 3 workers)
- AI Merge Reconciliation: One-pass merge of all chunks with overlap detection
Meeting Minutes Generation
- Attendee Detection: Automatically identifies speaker names from conversation
- Speaker Attribution: Associates each statement with actual speaker names (not "Speaker 1", "Speaker 2")
- Structured Output: Clean markdown with Attendees list and Minutes sections
- Language Preservation: Maintains original audio language
Performance
- Instant Startup: Compiled Go binary with no dependencies
- No Environment Setup: Default GCP project built-in (
oa-data-btdpexploration-np) - Processing Time: ~20 minutes for long recordings
- Single Executable: No Python, no virtualenv, no package installation
Usage Instructions
Basic Transcription
# Transcribe audio and save to file
~/.claude/skills/speech-to-text/scripts/speech-to-text ~/Downloads/meeting.mp3 -o ~/Downloads/meeting_transcript.md
# With custom meeting name
~/.claude/skills/speech-to-text/scripts/speech-to-text ~/Downloads/recording.mp3 -o transcript.md -m "Weekly Team Sync"
# Display to stdout (no file save)
~/.claude/skills/speech-to-text/scripts/speech-to-text ~/Downloads/audio.wav
Command-Line Options
| Flag | Required | Default | Description |
|---|---|---|---|
<audio-file> |
Yes | - | Path to audio file (absolute path recommended) |
-o |
No | stdout | Output markdown file path |
-m |
No | filename | Custom meeting name for title |
-project |
No | oa-data-btdpexploration-np |
GCP project ID |
-location |
No | global |
GCP region for VertexAI |
-model |
No | gemini-2.5-flash |
Gemini model to use |
Finding Audio Files in Downloads
When user mentions "downloaded audio" or "audio from downloads":
# Search for audio files in Downloads (last 7 days)
find ~/Downloads -type f \( -name "*.mp3" -o -name "*.m4a" -o -name "*.wav" -o -name "*.ogg" -o -name "*.aac" -o -name "*.flac" \) -mtime -7 | sort -r
# Or list by modification time
ls -lt ~/Downloads/*.{mp3,m4a,wav,ogg,aac,flac} 2>/dev/null | head -10
Typical Workflows
Workflow 1: User Mentions Downloaded Audio
User Request: "Transcribe the downloaded meeting audio"
Steps:
- Search
~/Downloadsfor recent audio files (last 7 days) - List found files and ask user to confirm which one
- Transcribe using absolute path
- Save as
<filename>_transcript.mdin~/Downloads
# Find recent audio files
find ~/Downloads -type f \( -name "*.mp3" -o -name "*.m4a" \) -mtime -7
# Transcribe selected file
~/.claude/skills/speech-to-text/scripts/speech-to-text ~/Downloads/meeting_2025-12-30.mp3 -o ~/Downloads/meeting_2025-12-30_transcript.md
Workflow 2: User Provides Explicit Path
User Request: "Transcribe /home/sebastien/recordings/interview.wav"
Steps:
- Use provided path directly
- Save transcript next to original:
/home/sebastien/recordings/interview_transcript.md
~/.claude/skills/speech-to-text/scripts/speech-to-text /home/sebastien/recordings/interview.wav -o /home/sebastien/recordings/interview_transcript.md
Workflow 3: Preprocessing for Meeting Summary
User Request: "Create a summary of the recorded meeting"
Steps:
- Use
speech-to-textskill to transcribe audio → generates.mdtranscript - Pass transcript to
meetings-summaryskill → generates structured summary
# Step 1: Transcribe (this skill)
~/.claude/skills/speech-to-text/scripts/speech-to-text ~/Downloads/meeting.mp3 -o /tmp/transcript.md
# Step 2: Summarize (meetings-summary skill)
# Pass /tmp/transcript.md to meetings-summary skill
Workflow 4: Batch Processing Multiple Files
User Request: "Transcribe all MP3 files in Downloads"
CRITICAL: Always process files sequentially (NEVER in parallel)
# Process sequentially to avoid rate limits
for audio in ~/Downloads/*.mp3; do
echo "Processing: $audio"
output="${audio%.mp3}_transcript.md"
~/.claude/skills/speech-to-text/scripts/speech-to-text "$audio" -o "$output"
# Wait for completion before next file
done
❌ NEVER do this (parallel processing will fail):
# DO NOT USE - will hit rate limits
for audio in ~/Downloads/*.mp3; do
~/.claude/skills/speech-to-text/scripts/speech-to-text "$audio" -o "${audio%.mp3}_transcript.md" &
done
wait
Output Format
The transcription generates clean, structured markdown:
# Meeting Name
## Attendees
- Sebastien Morand
- John Doe
- Jane Smith
## Minutes
- **Sebastien Morand**: Welcome everyone. Let's start with the quarterly review.
- **John Doe**: Thank you, Sebastien. I'd like to discuss the budget allocation for Q1.
- **Jane Smith**: I agree with John's points. Additionally, we should consider...
Key Features:
- Section 1: Attendees list (extracted from conversation)
- Section 2: Verbatim transcription with speaker attribution
- Speaker names: Detected from conversation (not generic "Speaker 1")
- Clean markdown: Ready for further processing or display
Critical Constraints
1. Always Use Absolute Paths
✅ Recommended:
~/.claude/skills/speech-to-text/scripts/speech-to-text ~/Downloads/meeting.mp3
~/.claude/skills/speech-to-text/scripts/speech-to-text /home/sebastien/audio/recording.wav
⚠️ Avoid relative paths (can cause confusion):
~/.claude/skills/speech-to-text/scripts/speech-to-text ./meeting.mp3 # May fail if cwd is wrong
2. Never Process Files in Parallel
Why: Gemini API rate limits will cause 429 errors when processing multiple files concurrently.
Always process sequentially:
- One file at a time
- Wait for completion before starting next
- Use
forloops without&backgrounding
3. Default Output Location
Always save transcripts next to the original audio file unless user specifies otherwise:
- Input:
~/Downloads/meeting.mp3 - Output:
~/Downloads/meeting_transcript.md
4. No Environment Variables Needed
The binary has a default project (oa-data-btdpexploration-np) built-in. Only use -project flag if user explicitly needs a different project.
Prerequisites
System Requirements
ffmpeg - Required for audio chunking (files >30 minutes)
# macOS
brew install ffmpeg
# Linux
sudo apt-get install ffmpeg
GCP Setup
Authentication:
gcloud auth application-default login
Enable VertexAI API:
gcloud services enable aiplatform.googleapis.com --project=oa-data-btdpexploration-np
No other dependencies required - Go binary is self-contained.
Integration with Other Skills
With meetings-summary Skill
This skill provides the audio-to-text conversion. For structured meeting summaries:
# 1. Transcribe audio (this skill)
~/.claude/skills/speech-to-text/scripts/speech-to-text ~/Downloads/meeting.mp3 -o /tmp/transcript.md
# 2. Generate summary (meetings-summary skill)
# Pass /tmp/transcript.md to meetings-summary skill for action items, decisions, etc.
Note: This skill only converts audio to text. Use meetings-summary for:
- Extracting action items
- Identifying decisions
- Structuring topics
- Generating email summaries
With topic-manager Skill
After transcription, topics can be extracted and stored:
# 1. Transcribe
~/.claude/skills/speech-to-text/scripts/speech-to-text ~/Downloads/one-on-one.mp3 -o /tmp/transcript.md
# 2. Extract topics (topic-manager skill)
# Process /tmp/transcript.md to update topic notes
Troubleshooting
Audio File Not Found
Error: ERROR: Audio file not found
Solutions:
- Use absolute paths for reliability
- Verify file exists:
ls -lh /full/path/to/audio.mp3 - Check file extension matches actual format
- Ensure no typos in path
Authentication Errors
Error: Failed to initialize VertexAI or authentication failed
Solutions:
# Re-authenticate
gcloud auth application-default login
# Enable API
gcloud services enable aiplatform.googleapis.com --project=oa-data-btdpexploration-np
Rate Limit Errors (429)
Error: 429: Too Many Requests
Solutions:
- Wait 5-10 minutes before retrying
- Ensure files are processed sequentially (not in parallel)
- Check for other processes hitting the same API
- Verify you're not running multiple transcriptions concurrently
ffmpeg Not Found
Error: ffmpeg: command not found
Solutions:
# macOS
brew install ffmpeg
# Linux
sudo apt-get install ffmpeg
# Verify installation
ffmpeg -version
Long Processing Time
Expected Behavior:
- Short recordings (<30 min): 2-5 minutes
- Long recordings (>30 min): 15-20 minutes
- Very long recordings (>60 min): 30+ minutes
Why: Gemini 2.5 Flash processing time scales with audio duration.
Optimization:
- Lower bitrate audio (64-128 kbps) is sufficient for speech
- MP3 or M4A formats process fastest
- Keep files under 100 MB when possible
Model Information
Default Model: Gemini 2.5 Flash
Model ID: gemini-2.5-flash
Features:
- Multimodal (audio + text)
- Automatic language detection
- Optimized for cost and speed
- Best for meeting transcriptions
- High accuracy for speaker identification
Alternative Models
Gemini 2.5 Pro - Higher accuracy, slower, more expensive
~/.claude/skills/speech-to-text/scripts/speech-to-text audio.mp3 -model gemini-2.5-pro
When to use Pro:
- Critical transcriptions requiring maximum accuracy
- Complex audio with multiple overlapping speakers
- Poor audio quality requiring advanced processing
Performance Tips
- File Size: Keep audio files under 100 MB when possible
- Format: MP3 or M4A are fastest to process
- Quality: Lower bitrate audio (64-128 kbps) is sufficient for speech
- Duration: Files >60 minutes may take 30+ minutes to transcribe
- Sequential Processing: Always process one file at a time
Logging
All progress logs are written to stderr with timestamps in format [YYYY-MM-DD HH:MM:SS]:
[2025-12-30 14:23:45] Analyzing recording: filename.ogg
[2025-12-30 14:23:45] File size: 22.45 MB
[2025-12-30 14:23:45] Recording duration: 35.23 minutes (2114.0 seconds)
[2025-12-30 14:23:45] Recording exceeds 30 minutes - will split into 2 chunks
[2025-12-30 14:23:45] Cutting recording into 2 chunks with 30-second overlaps
[2025-12-30 14:23:46] Creating chunk 1/2 - start: 0.0s, duration: 1830.0s
[2025-12-30 14:23:47] Creating chunk 2/2 - start: 1770.0s, duration: 1830.0s
[2025-12-30 14:23:47] Successfully created 2 chunk files
[2025-12-30 14:23:47] Starting parallel transcription of 2 chunks (max 3 concurrent)
[2025-12-30 14:25:30] Transcription of chunk 1/2 completed
[2025-12-30 14:25:42] Transcription of chunk 2/2 completed
[2025-12-30 14:25:42] Starting reconciliation of overlapping chunks
[2025-12-30 14:25:42] Merging all 2 chunks in one pass
[2025-12-30 14:25:42] Sending all chunks to AI for merge reconciliation
[2025-12-30 14:26:15] All chunks merged successfully
[2025-12-30 14:26:15] Reconciliation completed successfully
[2025-12-30 14:26:15] Finalizing output with meeting title
[2025-12-30 14:26:15] Writing final output to: minutes.md
[2025-12-30 14:26:15] Successfully wrote 15.67 KB to minutes.md
Skill Scope
This Skill Handles
- ✅ Audio/video to text transcription
- ✅ Attendee name detection
- ✅ Speaker attribution
- ✅ Structured markdown output
- ✅ Multi-format audio support
- ✅ Long audio chunking and merging
This Skill Does NOT Handle
- ❌ Meeting summaries (use
meetings-summaryskill) - ❌ Action item extraction (use
meetings-summaryskill) - ❌ Decision tracking (use
meetings-summaryskill) - ❌ Audio editing or processing
- ❌ Audio analysis (sentiment, emotions, etc.)
Example Interactions
Example 1: Downloaded Audio
User: "I have a downloaded meeting recording, can you transcribe it?"
Claude Actions:
- Search
~/Downloadsfor audio files (last 7 days) - Display list of found files with size and date
- Ask user to confirm which file
- Transcribe selected file with absolute path
- Save as
<filename>_transcript.mdin~/Downloads - Display success message with output location
Example 2: Explicit Path with Summary
User: "Transcribe /home/sebastien/interview.wav and summarize it"
Claude Actions:
- Use
speech-to-textskill to transcribe- Output:
/home/sebastien/interview_transcript.md
- Output:
- Load
meetings-summaryskill - Generate summary from transcript
Example 3: Batch Processing
User: "Convert all my meeting recordings to text"
Claude Actions:
- Ask user for directory location (default:
~/Downloads) - List all audio files in directory
- Ask for confirmation to process all files
- Process sequentially (one by one, not parallel)
- Save each transcript next to its audio file
- Display progress and completion summary
Skill Location
- Binary:
~/.claude/skills/speech-to-text/scripts/speech-to-text - Documentation:
/home/sebastien/projects/claude-config/skills/speech-to-text/ - Source Code: Not included (compiled Go binary only)
- Logs: stderr (timestamped format)
Author
Sebastien MORAND Email: seb.morand@gmail.com Role: CTO Data & AI at L'Oréal