Claude Code Plugins

Community-maintained marketplace

Feedback
0
0

Expert in transcribing audio and video files to structured meeting minutes using VertexAI Gemini 2.5 Flash. **Use this skill when the user requests to transcribe audio/video files ('transcribe this audio', 'convert audio to text', 'get text from recording'), extract content from recordings, or when preprocessing audio for meeting summaries.** Automatically searches ~/Downloads when user mentions 'downloaded audio'. Supports MP3, WAV, M4A, AAC, OGG, FLAC formats with automatic attendee detection and speaker attribution.

Install Skill

1Download skill
2Enable skills in Claude

Open claude.ai/settings/capabilities and find the "Skills" section

3Upload to Claude

Click "Upload skill" and select the downloaded ZIP file

Note: Please verify skill by going through its instructions before using it.

SKILL.md

name speech-to-text
description Expert in transcribing audio and video files to structured meeting minutes using VertexAI Gemini 2.5 Flash. **Use this skill when the user requests to transcribe audio/video files ('transcribe this audio', 'convert audio to text', 'get text from recording'), extract content from recordings, or when preprocessing audio for meeting summaries.** Automatically searches ~/Downloads when user mentions 'downloaded audio'. Supports MP3, WAV, M4A, AAC, OGG, FLAC formats with automatic attendee detection and speaker attribution.

Speech-to-Text Transcription Skill

Expert in transcribing audio and video files to structured meeting minutes with automatic attendee name detection, speaker attribution, and markdown formatting.

Core Capabilities

Audio Processing

  • Multi-Format Support: MP3, WAV, M4A, AAC, OGG, FLAC
  • Auto-Chunking: Splits files >30 minutes into 30-minute segments with 30-second overlaps
  • Parallel Processing: Processes chunks concurrently (max 3 workers)
  • AI Merge Reconciliation: One-pass merge of all chunks with overlap detection

Meeting Minutes Generation

  • Attendee Detection: Automatically identifies speaker names from conversation
  • Speaker Attribution: Associates each statement with actual speaker names (not "Speaker 1", "Speaker 2")
  • Structured Output: Clean markdown with Attendees list and Minutes sections
  • Language Preservation: Maintains original audio language

Performance

  • Instant Startup: Compiled Go binary with no dependencies
  • No Environment Setup: Default GCP project built-in (oa-data-btdpexploration-np)
  • Processing Time: ~20 minutes for long recordings
  • Single Executable: No Python, no virtualenv, no package installation

Usage Instructions

Basic Transcription

# Transcribe audio and save to file
~/.claude/skills/speech-to-text/scripts/speech-to-text ~/Downloads/meeting.mp3 -o ~/Downloads/meeting_transcript.md

# With custom meeting name
~/.claude/skills/speech-to-text/scripts/speech-to-text ~/Downloads/recording.mp3 -o transcript.md -m "Weekly Team Sync"

# Display to stdout (no file save)
~/.claude/skills/speech-to-text/scripts/speech-to-text ~/Downloads/audio.wav

Command-Line Options

Flag Required Default Description
<audio-file> Yes - Path to audio file (absolute path recommended)
-o No stdout Output markdown file path
-m No filename Custom meeting name for title
-project No oa-data-btdpexploration-np GCP project ID
-location No global GCP region for VertexAI
-model No gemini-2.5-flash Gemini model to use

Finding Audio Files in Downloads

When user mentions "downloaded audio" or "audio from downloads":

# Search for audio files in Downloads (last 7 days)
find ~/Downloads -type f \( -name "*.mp3" -o -name "*.m4a" -o -name "*.wav" -o -name "*.ogg" -o -name "*.aac" -o -name "*.flac" \) -mtime -7 | sort -r

# Or list by modification time
ls -lt ~/Downloads/*.{mp3,m4a,wav,ogg,aac,flac} 2>/dev/null | head -10

Typical Workflows

Workflow 1: User Mentions Downloaded Audio

User Request: "Transcribe the downloaded meeting audio"

Steps:

  1. Search ~/Downloads for recent audio files (last 7 days)
  2. List found files and ask user to confirm which one
  3. Transcribe using absolute path
  4. Save as <filename>_transcript.md in ~/Downloads
# Find recent audio files
find ~/Downloads -type f \( -name "*.mp3" -o -name "*.m4a" \) -mtime -7

# Transcribe selected file
~/.claude/skills/speech-to-text/scripts/speech-to-text ~/Downloads/meeting_2025-12-30.mp3 -o ~/Downloads/meeting_2025-12-30_transcript.md

Workflow 2: User Provides Explicit Path

User Request: "Transcribe /home/sebastien/recordings/interview.wav"

Steps:

  1. Use provided path directly
  2. Save transcript next to original: /home/sebastien/recordings/interview_transcript.md
~/.claude/skills/speech-to-text/scripts/speech-to-text /home/sebastien/recordings/interview.wav -o /home/sebastien/recordings/interview_transcript.md

Workflow 3: Preprocessing for Meeting Summary

User Request: "Create a summary of the recorded meeting"

Steps:

  1. Use speech-to-text skill to transcribe audio → generates .md transcript
  2. Pass transcript to meetings-summary skill → generates structured summary
# Step 1: Transcribe (this skill)
~/.claude/skills/speech-to-text/scripts/speech-to-text ~/Downloads/meeting.mp3 -o /tmp/transcript.md

# Step 2: Summarize (meetings-summary skill)
# Pass /tmp/transcript.md to meetings-summary skill

Workflow 4: Batch Processing Multiple Files

User Request: "Transcribe all MP3 files in Downloads"

CRITICAL: Always process files sequentially (NEVER in parallel)

# Process sequentially to avoid rate limits
for audio in ~/Downloads/*.mp3; do
    echo "Processing: $audio"
    output="${audio%.mp3}_transcript.md"
    ~/.claude/skills/speech-to-text/scripts/speech-to-text "$audio" -o "$output"
    # Wait for completion before next file
done

❌ NEVER do this (parallel processing will fail):

# DO NOT USE - will hit rate limits
for audio in ~/Downloads/*.mp3; do
    ~/.claude/skills/speech-to-text/scripts/speech-to-text "$audio" -o "${audio%.mp3}_transcript.md" &
done
wait

Output Format

The transcription generates clean, structured markdown:

# Meeting Name

## Attendees
- Sebastien Morand
- John Doe
- Jane Smith

## Minutes
- **Sebastien Morand**: Welcome everyone. Let's start with the quarterly review.
- **John Doe**: Thank you, Sebastien. I'd like to discuss the budget allocation for Q1.
- **Jane Smith**: I agree with John's points. Additionally, we should consider...

Key Features:

  • Section 1: Attendees list (extracted from conversation)
  • Section 2: Verbatim transcription with speaker attribution
  • Speaker names: Detected from conversation (not generic "Speaker 1")
  • Clean markdown: Ready for further processing or display

Critical Constraints

1. Always Use Absolute Paths

✅ Recommended:

~/.claude/skills/speech-to-text/scripts/speech-to-text ~/Downloads/meeting.mp3
~/.claude/skills/speech-to-text/scripts/speech-to-text /home/sebastien/audio/recording.wav

⚠️ Avoid relative paths (can cause confusion):

~/.claude/skills/speech-to-text/scripts/speech-to-text ./meeting.mp3  # May fail if cwd is wrong

2. Never Process Files in Parallel

Why: Gemini API rate limits will cause 429 errors when processing multiple files concurrently.

Always process sequentially:

  • One file at a time
  • Wait for completion before starting next
  • Use for loops without & backgrounding

3. Default Output Location

Always save transcripts next to the original audio file unless user specifies otherwise:

  • Input: ~/Downloads/meeting.mp3
  • Output: ~/Downloads/meeting_transcript.md

4. No Environment Variables Needed

The binary has a default project (oa-data-btdpexploration-np) built-in. Only use -project flag if user explicitly needs a different project.

Prerequisites

System Requirements

ffmpeg - Required for audio chunking (files >30 minutes)

# macOS
brew install ffmpeg

# Linux
sudo apt-get install ffmpeg

GCP Setup

Authentication:

gcloud auth application-default login

Enable VertexAI API:

gcloud services enable aiplatform.googleapis.com --project=oa-data-btdpexploration-np

No other dependencies required - Go binary is self-contained.

Integration with Other Skills

With meetings-summary Skill

This skill provides the audio-to-text conversion. For structured meeting summaries:

# 1. Transcribe audio (this skill)
~/.claude/skills/speech-to-text/scripts/speech-to-text ~/Downloads/meeting.mp3 -o /tmp/transcript.md

# 2. Generate summary (meetings-summary skill)
# Pass /tmp/transcript.md to meetings-summary skill for action items, decisions, etc.

Note: This skill only converts audio to text. Use meetings-summary for:

  • Extracting action items
  • Identifying decisions
  • Structuring topics
  • Generating email summaries

With topic-manager Skill

After transcription, topics can be extracted and stored:

# 1. Transcribe
~/.claude/skills/speech-to-text/scripts/speech-to-text ~/Downloads/one-on-one.mp3 -o /tmp/transcript.md

# 2. Extract topics (topic-manager skill)
# Process /tmp/transcript.md to update topic notes

Troubleshooting

Audio File Not Found

Error: ERROR: Audio file not found

Solutions:

  • Use absolute paths for reliability
  • Verify file exists: ls -lh /full/path/to/audio.mp3
  • Check file extension matches actual format
  • Ensure no typos in path

Authentication Errors

Error: Failed to initialize VertexAI or authentication failed

Solutions:

# Re-authenticate
gcloud auth application-default login

# Enable API
gcloud services enable aiplatform.googleapis.com --project=oa-data-btdpexploration-np

Rate Limit Errors (429)

Error: 429: Too Many Requests

Solutions:

  • Wait 5-10 minutes before retrying
  • Ensure files are processed sequentially (not in parallel)
  • Check for other processes hitting the same API
  • Verify you're not running multiple transcriptions concurrently

ffmpeg Not Found

Error: ffmpeg: command not found

Solutions:

# macOS
brew install ffmpeg

# Linux
sudo apt-get install ffmpeg

# Verify installation
ffmpeg -version

Long Processing Time

Expected Behavior:

  • Short recordings (<30 min): 2-5 minutes
  • Long recordings (>30 min): 15-20 minutes
  • Very long recordings (>60 min): 30+ minutes

Why: Gemini 2.5 Flash processing time scales with audio duration.

Optimization:

  • Lower bitrate audio (64-128 kbps) is sufficient for speech
  • MP3 or M4A formats process fastest
  • Keep files under 100 MB when possible

Model Information

Default Model: Gemini 2.5 Flash

Model ID: gemini-2.5-flash

Features:

  • Multimodal (audio + text)
  • Automatic language detection
  • Optimized for cost and speed
  • Best for meeting transcriptions
  • High accuracy for speaker identification

Alternative Models

Gemini 2.5 Pro - Higher accuracy, slower, more expensive

~/.claude/skills/speech-to-text/scripts/speech-to-text audio.mp3 -model gemini-2.5-pro

When to use Pro:

  • Critical transcriptions requiring maximum accuracy
  • Complex audio with multiple overlapping speakers
  • Poor audio quality requiring advanced processing

Performance Tips

  1. File Size: Keep audio files under 100 MB when possible
  2. Format: MP3 or M4A are fastest to process
  3. Quality: Lower bitrate audio (64-128 kbps) is sufficient for speech
  4. Duration: Files >60 minutes may take 30+ minutes to transcribe
  5. Sequential Processing: Always process one file at a time

Logging

All progress logs are written to stderr with timestamps in format [YYYY-MM-DD HH:MM:SS]:

[2025-12-30 14:23:45] Analyzing recording: filename.ogg
[2025-12-30 14:23:45] File size: 22.45 MB
[2025-12-30 14:23:45] Recording duration: 35.23 minutes (2114.0 seconds)
[2025-12-30 14:23:45] Recording exceeds 30 minutes - will split into 2 chunks
[2025-12-30 14:23:45] Cutting recording into 2 chunks with 30-second overlaps
[2025-12-30 14:23:46] Creating chunk 1/2 - start: 0.0s, duration: 1830.0s
[2025-12-30 14:23:47] Creating chunk 2/2 - start: 1770.0s, duration: 1830.0s
[2025-12-30 14:23:47] Successfully created 2 chunk files
[2025-12-30 14:23:47] Starting parallel transcription of 2 chunks (max 3 concurrent)
[2025-12-30 14:25:30] Transcription of chunk 1/2 completed
[2025-12-30 14:25:42] Transcription of chunk 2/2 completed
[2025-12-30 14:25:42] Starting reconciliation of overlapping chunks
[2025-12-30 14:25:42] Merging all 2 chunks in one pass
[2025-12-30 14:25:42] Sending all chunks to AI for merge reconciliation
[2025-12-30 14:26:15] All chunks merged successfully
[2025-12-30 14:26:15] Reconciliation completed successfully
[2025-12-30 14:26:15] Finalizing output with meeting title
[2025-12-30 14:26:15] Writing final output to: minutes.md
[2025-12-30 14:26:15] Successfully wrote 15.67 KB to minutes.md

Skill Scope

This Skill Handles

  • ✅ Audio/video to text transcription
  • ✅ Attendee name detection
  • ✅ Speaker attribution
  • ✅ Structured markdown output
  • ✅ Multi-format audio support
  • ✅ Long audio chunking and merging

This Skill Does NOT Handle

  • ❌ Meeting summaries (use meetings-summary skill)
  • ❌ Action item extraction (use meetings-summary skill)
  • ❌ Decision tracking (use meetings-summary skill)
  • ❌ Audio editing or processing
  • ❌ Audio analysis (sentiment, emotions, etc.)

Example Interactions

Example 1: Downloaded Audio

User: "I have a downloaded meeting recording, can you transcribe it?"

Claude Actions:

  1. Search ~/Downloads for audio files (last 7 days)
  2. Display list of found files with size and date
  3. Ask user to confirm which file
  4. Transcribe selected file with absolute path
  5. Save as <filename>_transcript.md in ~/Downloads
  6. Display success message with output location

Example 2: Explicit Path with Summary

User: "Transcribe /home/sebastien/interview.wav and summarize it"

Claude Actions:

  1. Use speech-to-text skill to transcribe
    • Output: /home/sebastien/interview_transcript.md
  2. Load meetings-summary skill
  3. Generate summary from transcript

Example 3: Batch Processing

User: "Convert all my meeting recordings to text"

Claude Actions:

  1. Ask user for directory location (default: ~/Downloads)
  2. List all audio files in directory
  3. Ask for confirmation to process all files
  4. Process sequentially (one by one, not parallel)
  5. Save each transcript next to its audio file
  6. Display progress and completion summary

Skill Location

  • Binary: ~/.claude/skills/speech-to-text/scripts/speech-to-text
  • Documentation: /home/sebastien/projects/claude-config/skills/speech-to-text/
  • Source Code: Not included (compiled Go binary only)
  • Logs: stderr (timestamped format)

Author

Sebastien MORAND Email: seb.morand@gmail.com Role: CTO Data & AI at L'Oréal