name

speech-to-text

description

Expert in transcribing audio and video files to structured meeting minutes using VertexAI Gemini 2.5 Flash. **Use this skill when the user requests to transcribe audio/video files ('transcribe this audio', 'convert audio to text', 'get text from recording'), extract content from recordings, or when preprocessing audio for meeting summaries.** Automatically searches ~/Downloads when user mentions 'downloaded audio'. Supports MP3, WAV, M4A, AAC, OGG, FLAC formats with automatic attendee detection and speaker attribution.

Speech-to-Text Transcription Skill

Expert in transcribing audio and video files to structured meeting minutes with automatic attendee name detection, speaker attribution, and markdown formatting.

Core Capabilities

Audio Processing

Multi-Format Support: MP3, WAV, M4A, AAC, OGG, FLAC
Auto-Chunking: Splits files >30 minutes into 30-minute segments with 30-second overlaps
Parallel Processing: Processes chunks concurrently (max 3 workers)
AI Merge Reconciliation: One-pass merge of all chunks with overlap detection

Meeting Minutes Generation

Attendee Detection: Automatically identifies speaker names from conversation
Speaker Attribution: Associates each statement with actual speaker names (not "Speaker 1", "Speaker 2")
Structured Output: Clean markdown with Attendees list and Minutes sections
Language Preservation: Maintains original audio language

Performance

Instant Startup: Compiled Go binary with no dependencies
No Environment Setup: Default GCP project built-in (oa-data-btdpexploration-np)
Processing Time: ~20 minutes for long recordings
Single Executable: No Python, no virtualenv, no package installation

Usage Instructions

Basic Transcription

# Transcribe audio and save to file
~/.claude/skills/speech-to-text/scripts/speech-to-text ~/Downloads/meeting.mp3 -o ~/Downloads/meeting_transcript.md

# With custom meeting name
~/.claude/skills/speech-to-text/scripts/speech-to-text ~/Downloads/recording.mp3 -o transcript.md -m "Weekly Team Sync"

# Display to stdout (no file save)
~/.claude/skills/speech-to-text/scripts/speech-to-text ~/Downloads/audio.wav

Command-Line Options

Flag	Required	Default	Description
`<audio-file>`	Yes	-	Path to audio file (absolute path recommended)
`-o`	No	stdout	Output markdown file path
`-m`	No	filename	Custom meeting name for title
`-project`	No	`oa-data-btdpexploration-np`	GCP project ID
`-location`	No	`global`	GCP region for VertexAI
`-model`	No	`gemini-2.5-flash`	Gemini model to use

Finding Audio Files in Downloads

When user mentions "downloaded audio" or "audio from downloads":

# Search for audio files in Downloads (last 7 days)
find ~/Downloads -type f \( -name "*.mp3" -o -name "*.m4a" -o -name "*.wav" -o -name "*.ogg" -o -name "*.aac" -o -name "*.flac" \) -mtime -7 | sort -r

# Or list by modification time
ls -lt ~/Downloads/*.{mp3,m4a,wav,ogg,aac,flac} 2>/dev/null | head -10

Typical Workflows

Workflow 1: User Mentions Downloaded Audio

User Request: "Transcribe the downloaded meeting audio"

Steps:

Search ~/Downloads for recent audio files (last 7 days)
List found files and ask user to confirm which one
Transcribe using absolute path
Save as <filename>_transcript.md in ~/Downloads

# Find recent audio files
find ~/Downloads -type f \( -name "*.mp3" -o -name "*.m4a" \) -mtime -7

# Transcribe selected file
~/.claude/skills/speech-to-text/scripts/speech-to-text ~/Downloads/meeting_2025-12-30.mp3 -o ~/Downloads/meeting_2025-12-30_transcript.md

Workflow 2: User Provides Explicit Path

User Request: "Transcribe /home/sebastien/recordings/interview.wav"

Steps:

Use provided path directly
Save transcript next to original: /home/sebastien/recordings/interview_transcript.md

~/.claude/skills/speech-to-text/scripts/speech-to-text /home/sebastien/recordings/interview.wav -o /home/sebastien/recordings/interview_transcript.md

Workflow 3: Preprocessing for Meeting Summary

User Request: "Create a summary of the recorded meeting"

Steps:

Use speech-to-text skill to transcribe audio → generates .md transcript
Pass transcript to meetings-summary skill → generates structured summary

# Step 1: Transcribe (this skill)
~/.claude/skills/speech-to-text/scripts/speech-to-text ~/Downloads/meeting.mp3 -o /tmp/transcript.md

# Step 2: Summarize (meetings-summary skill)
# Pass /tmp/transcript.md to meetings-summary skill

Workflow 4: Batch Processing Multiple Files

User Request: "Transcribe all MP3 files in Downloads"

CRITICAL: Always process files sequentially (NEVER in parallel)

# Process sequentially to avoid rate limits
for audio in ~/Downloads/*.mp3; do
    echo "Processing: $audio"
    output="${audio%.mp3}_transcript.md"
    ~/.claude/skills/speech-to-text/scripts/speech-to-text "$audio" -o "$output"
    # Wait for completion before next file
done

❌ NEVER do this (parallel processing will fail):

# DO NOT USE - will hit rate limits
for audio in ~/Downloads/*.mp3; do
    ~/.claude/skills/speech-to-text/scripts/speech-to-text "$audio" -o "${audio%.mp3}_transcript.md" &
done
wait

Output Format

The transcription generates clean, structured markdown:

# Meeting Name

## Attendees
- Sebastien Morand
- John Doe
- Jane Smith

## Minutes
- **Sebastien Morand**: Welcome everyone. Let's start with the quarterly review.
- **John Doe**: Thank you, Sebastien. I'd like to discuss the budget allocation for Q1.
- **Jane Smith**: I agree with John's points. Additionally, we should consider...

Key Features:

Section 1: Attendees list (extracted from conversation)
Section 2: Verbatim transcription with speaker attribution
Speaker names: Detected from conversation (not generic "Speaker 1")
Clean markdown: Ready for further processing or display

Critical Constraints

1. Always Use Absolute Paths

✅ Recommended:

~/.claude/skills/speech-to-text/scripts/speech-to-text ~/Downloads/meeting.mp3
~/.claude/skills/speech-to-text/scripts/speech-to-text /home/sebastien/audio/recording.wav

⚠️ Avoid relative paths (can cause confusion):

~/.claude/skills/speech-to-text/scripts/speech-to-text ./meeting.mp3  # May fail if cwd is wrong

2. Never Process Files in Parallel

Why: Gemini API rate limits will cause 429 errors when processing multiple files concurrently.

Always process sequentially:

One file at a time
Wait for completion before starting next
Use for loops without & backgrounding

3. Default Output Location

Always save transcripts next to the original audio file unless user specifies otherwise:

Input: ~/Downloads/meeting.mp3
Output: ~/Downloads/meeting_transcript.md

4. No Environment Variables Needed

The binary has a default project (oa-data-btdpexploration-np) built-in. Only use -project flag if user explicitly needs a different project.

Prerequisites

System Requirements

ffmpeg - Required for audio chunking (files >30 minutes)

# macOS
brew install ffmpeg

# Linux
sudo apt-get install ffmpeg

GCP Setup

Authentication:

gcloud auth application-default login

Enable VertexAI API:

gcloud services enable aiplatform.googleapis.com --project=oa-data-btdpexploration-np

No other dependencies required - Go binary is self-contained.

Integration with Other Skills

With `meetings-summary` Skill

This skill provides the audio-to-text conversion. For structured meeting summaries:

# 1. Transcribe audio (this skill)
~/.claude/skills/speech-to-text/scripts/speech-to-text ~/Downloads/meeting.mp3 -o /tmp/transcript.md

# 2. Generate summary (meetings-summary skill)
# Pass /tmp/transcript.md to meetings-summary skill for action items, decisions, etc.

Note: This skill only converts audio to text. Use meetings-summary for:

Extracting action items
Identifying decisions
Structuring topics
Generating email summaries

With `topic-manager` Skill

After transcription, topics can be extracted and stored:

# 1. Transcribe
~/.claude/skills/speech-to-text/scripts/speech-to-text ~/Downloads/one-on-one.mp3 -o /tmp/transcript.md

# 2. Extract topics (topic-manager skill)
# Process /tmp/transcript.md to update topic notes

Troubleshooting

Audio File Not Found

Error: ERROR: Audio file not found

Solutions:

Use absolute paths for reliability
Verify file exists: ls -lh /full/path/to/audio.mp3
Check file extension matches actual format
Ensure no typos in path

Authentication Errors

Error: Failed to initialize VertexAI or authentication failed

Solutions:

# Re-authenticate
gcloud auth application-default login

# Enable API
gcloud services enable aiplatform.googleapis.com --project=oa-data-btdpexploration-np

Rate Limit Errors (429)

Error: 429: Too Many Requests

Solutions:

Wait 5-10 minutes before retrying
Ensure files are processed sequentially (not in parallel)
Check for other processes hitting the same API
Verify you're not running multiple transcriptions concurrently

ffmpeg Not Found

Error: ffmpeg: command not found

Solutions:

# macOS
brew install ffmpeg

# Linux
sudo apt-get install ffmpeg

# Verify installation
ffmpeg -version

Long Processing Time

Expected Behavior:

Short recordings (<30 min): 2-5 minutes
Long recordings (>30 min): 15-20 minutes
Very long recordings (>60 min): 30+ minutes

Why: Gemini 2.5 Flash processing time scales with audio duration.

Optimization:

Lower bitrate audio (64-128 kbps) is sufficient for speech
MP3 or M4A formats process fastest
Keep files under 100 MB when possible

Model Information

Default Model: Gemini 2.5 Flash

Model ID: gemini-2.5-flash

Features:

Multimodal (audio + text)
Automatic language detection
Optimized for cost and speed
Best for meeting transcriptions
High accuracy for speaker identification

Alternative Models

Gemini 2.5 Pro - Higher accuracy, slower, more expensive

~/.claude/skills/speech-to-text/scripts/speech-to-text audio.mp3 -model gemini-2.5-pro

When to use Pro:

Critical transcriptions requiring maximum accuracy
Complex audio with multiple overlapping speakers
Poor audio quality requiring advanced processing

Performance Tips

File Size: Keep audio files under 100 MB when possible
Format: MP3 or M4A are fastest to process
Quality: Lower bitrate audio (64-128 kbps) is sufficient for speech
Duration: Files >60 minutes may take 30+ minutes to transcribe
Sequential Processing: Always process one file at a time

Logging

All progress logs are written to stderr with timestamps in format [YYYY-MM-DD HH:MM:SS]:

[2025-12-30 14:23:45] Analyzing recording: filename.ogg
[2025-12-30 14:23:45] File size: 22.45 MB
[2025-12-30 14:23:45] Recording duration: 35.23 minutes (2114.0 seconds)
[2025-12-30 14:23:45] Recording exceeds 30 minutes - will split into 2 chunks
[2025-12-30 14:23:45] Cutting recording into 2 chunks with 30-second overlaps
[2025-12-30 14:23:46] Creating chunk 1/2 - start: 0.0s, duration: 1830.0s
[2025-12-30 14:23:47] Creating chunk 2/2 - start: 1770.0s, duration: 1830.0s
[2025-12-30 14:23:47] Successfully created 2 chunk files
[2025-12-30 14:23:47] Starting parallel transcription of 2 chunks (max 3 concurrent)
[2025-12-30 14:25:30] Transcription of chunk 1/2 completed
[2025-12-30 14:25:42] Transcription of chunk 2/2 completed
[2025-12-30 14:25:42] Starting reconciliation of overlapping chunks
[2025-12-30 14:25:42] Merging all 2 chunks in one pass
[2025-12-30 14:25:42] Sending all chunks to AI for merge reconciliation
[2025-12-30 14:26:15] All chunks merged successfully
[2025-12-30 14:26:15] Reconciliation completed successfully
[2025-12-30 14:26:15] Finalizing output with meeting title
[2025-12-30 14:26:15] Writing final output to: minutes.md
[2025-12-30 14:26:15] Successfully wrote 15.67 KB to minutes.md

Skill Scope

This Skill Handles

✅ Audio/video to text transcription
✅ Attendee name detection
✅ Speaker attribution
✅ Structured markdown output
✅ Multi-format audio support
✅ Long audio chunking and merging

This Skill Does NOT Handle

❌ Meeting summaries (use meetings-summary skill)
❌ Action item extraction (use meetings-summary skill)
❌ Decision tracking (use meetings-summary skill)
❌ Audio editing or processing
❌ Audio analysis (sentiment, emotions, etc.)

Example Interactions

Example 1: Downloaded Audio

User: "I have a downloaded meeting recording, can you transcribe it?"

Claude Actions:

Search ~/Downloads for audio files (last 7 days)
Display list of found files with size and date
Ask user to confirm which file
Transcribe selected file with absolute path
Save as <filename>_transcript.md in ~/Downloads
Display success message with output location

Example 2: Explicit Path with Summary

User: "Transcribe /home/sebastien/interview.wav and summarize it"

Claude Actions:

Use speech-to-text skill to transcribe
- Output: /home/sebastien/interview_transcript.md
Load meetings-summary skill
Generate summary from transcript

Example 3: Batch Processing

User: "Convert all my meeting recordings to text"

Claude Actions:

Ask user for directory location (default: ~/Downloads)
List all audio files in directory
Ask for confirmation to process all files
Process sequentially (one by one, not parallel)
Save each transcript next to its audio file
Display progress and completion summary

Skill Location

Binary: ~/.claude/skills/speech-to-text/scripts/speech-to-text
Documentation: /home/sebastien/projects/claude-config/skills/speech-to-text/
Source Code: Not included (compiled Go binary only)
Logs: stderr (timestamped format)

Author

Sebastien MORAND Email: seb.morand@gmail.com Role: CTO Data & AI at L'Oréal

Install Skill

SKILL.md

Speech-to-Text Transcription Skill

Core Capabilities

Audio Processing

Meeting Minutes Generation

Performance

Usage Instructions

Basic Transcription

Command-Line Options

Finding Audio Files in Downloads

Typical Workflows

Workflow 1: User Mentions Downloaded Audio

Workflow 2: User Provides Explicit Path

Workflow 3: Preprocessing for Meeting Summary

Workflow 4: Batch Processing Multiple Files

Output Format

Critical Constraints

1. Always Use Absolute Paths

2. Never Process Files in Parallel

3. Default Output Location

4. No Environment Variables Needed

Prerequisites

System Requirements

GCP Setup

Integration with Other Skills

With meetings-summary Skill

With topic-manager Skill

Troubleshooting

Audio File Not Found

Authentication Errors

Rate Limit Errors (429)

ffmpeg Not Found

Long Processing Time

Model Information

Default Model: Gemini 2.5 Flash

Alternative Models

Performance Tips

Logging

Skill Scope

This Skill Handles

This Skill Does NOT Handle

Example Interactions

Example 1: Downloaded Audio

Example 2: Explicit Path with Summary

Example 3: Batch Processing

Skill Location

Author

With `meetings-summary` Skill

With `topic-manager` Skill