| name | video-subtitle-cutter |
| description | Transcribe video, analyze subtitles with AI, and cut video by removing filler words, pauses, and mistakes |
| license | MIT |
| compatibility | opencode |
| metadata | [object Object] |
What I Do
Automate video editing by:
- Transcribing video to timestamped subtitles (Whisper)
- Analyzing transcript with AI to identify cuts (filler words, pauses, mistakes)
- Generating FFmpeg commands to cut and concatenate clean segments
- Generating subtitles (SRT) for the final video
CRITICAL: Always Re-encode (Never Use -c copy)
The #1 mistake is using -c copy for cutting. This causes:
- Frozen frames at cut points (1-8 seconds of freeze)
- Audio/video sync issues
- Glitchy playback
Why? H.264 video uses keyframes (I-frames) every 2-10 seconds. -c copy can only cut at keyframes, so FFmpeg includes extra frames that display as frozen.
Solution: Always re-encode segments with quality settings:
# WRONG - causes freeze frames
ffmpeg -ss 10 -i video.mp4 -t 5 -c copy segment.mp4
# CORRECT - smooth cuts at any timestamp
ffmpeg -ss 10 -i video.mp4 -t 5 \
-c:v libx264 -preset fast -crf 18 \
-c:a aac -b:a 192k \
-avoid_negative_ts make_zero \
segment.mp4
Quality presets (CRF = Constant Rate Factor):
crf 15-17= Near lossless (large files)crf 18-20= High quality (recommended)crf 21-23= Good quality (smaller files)crf 24-28= Medium quality (much smaller)
Prerequisites
# Install Whisper (choose one)
pip install openai-whisper # Local (requires Python 3.9+)
# OR use OpenAI API (no local install needed)
# Install FFmpeg
brew install ffmpeg # macOS
sudo apt install ffmpeg # Linux
Quick Start
Step 1: Transcribe Video
Option A: Local Whisper (free, slower)
whisper video.mp4 --model medium --output_format json --output_dir ./
Option B: OpenAI Whisper API (fast, paid)
curl https://api.openai.com/v1/audio/transcriptions \
-H "Authorization: Bearer $OPENAI_API_KEY" \
-F file="@video.mp4" \
-F model="whisper-1" \
-F response_format="verbose_json" \
-F timestamp_granularities[]="segment" \
> transcript.json
Option C: Use ffmpeg to extract audio first (for large files)
# Extract audio (much smaller file to upload)
ffmpeg -i video.mp4 -vn -acodec libmp3lame -q:a 2 audio.mp3
# Then transcribe the audio
whisper audio.mp3 --model medium --output_format json
Step 2: Analyze Transcript for Cuts
Feed the transcript to the AI with this prompt:
Analyze this video transcript and identify segments to CUT (remove).
TRANSCRIPT:
{paste transcript.json segments here}
Identify these issues:
1. FILLER WORDS: "um", "uh", "like", "you know", "basically", "actually", "so", "right"
2. FALSE STARTS: Incomplete sentences that restart ("I think— actually, let me...")
3. LONG PAUSES: Gaps > 1.5 seconds between segments
4. REPETITIONS: Same word/phrase repeated ("really really really")
5. CORRECTIONS: "Wait, I meant...", "Sorry, let me rephrase..."
6. TANGENTS: Off-topic rambling (use judgment)
Return a JSON array of segments to KEEP (not cut):
[
{"start": 0.0, "end": 2.5, "text": "Welcome to this video"},
{"start": 3.1, "end": 8.4, "text": "Today we're going to cover..."},
...
]
Rules:
- Merge adjacent keep segments if gap < 0.3s
- Ensure cuts don't happen mid-word (check word boundaries)
- Preserve natural speech rhythm (don't over-cut)
- When in doubt, keep the segment
Step 3: Generate FFmpeg Commands (High Quality)
Once you have the keep segments, use this Python script for smooth cuts:
import json
import subprocess
import os
VIDEO_INPUT = "video.mp4"
VIDEO_OUTPUT = "video_clean.mp4"
SEGMENTS_FILE = "keep_segments.json"
with open(SEGMENTS_FILE) as f:
segments = json.load(f)
segment_files = []
for i, seg in enumerate(segments):
outfile = f"temp_seg_{i:04d}.mp4"
segment_files.append(outfile)
# MUST re-encode for smooth cuts (no -c copy!)
cmd = [
'ffmpeg', '-y',
'-ss', str(seg['start']), # Seek BEFORE input (fast)
'-i', VIDEO_INPUT,
'-t', str(seg['end'] - seg['start']), # Duration
'-c:v', 'libx264',
'-preset', 'fast', # fast/medium/slow
'-crf', '18', # Quality (lower = better, 15-23 recommended)
'-c:a', 'aac',
'-b:a', '192k',
'-avoid_negative_ts', 'make_zero', # Fix timestamp issues
'-async', '1', # Sync audio
outfile
]
subprocess.run(cmd, capture_output=True)
print(f"✓ Segment {i+1}/{len(segments)}")
# Create concat file
with open('temp_concat.txt', 'w') as f:
for sf in segment_files:
f.write(f"file '{sf}'\n")
# Concatenate (can use -c copy here since all segments match)
subprocess.run([
'ffmpeg', '-y', '-f', 'concat', '-safe', '0',
'-i', 'temp_concat.txt',
'-c', 'copy',
VIDEO_OUTPUT
])
# Cleanup
for sf in segment_files:
os.remove(sf)
os.remove('temp_concat.txt')
print(f"✓ Created: {VIDEO_OUTPUT}")
Key flags explained:
-ssbefore-i: Fast seek (doesn't decode entire video)-t: Duration of segment (not end time)-crf 18: High quality encoding-avoid_negative_ts make_zero: Fixes concat timestamp issues-async 1: Keeps audio in sync
Step 4: Generate Subtitles
After creating the final video, generate fresh subtitles with Whisper:
# Generate SRT subtitles for the cleaned video
whisper video_clean.mp4 --model medium --output_format srt --output_dir ./
# For higher accuracy (slower):
whisper video_clean.mp4 --model large --output_format srt --language en
# Output: video_clean.srt
Burn subtitles into video (optional):
# Embed subtitles permanently
ffmpeg -i video_clean.mp4 -vf "subtitles=video_clean.srt:force_style='FontSize=24,FontName=Arial,PrimaryColour=&HFFFFFF,OutlineColour=&H000000,Outline=2'" -c:a copy video_with_subs.mp4
Subtitle styling options:
FontSize=24- Text sizeFontName=Arial- Font facePrimaryColour=&HFFFFFF- White text (BGR format)OutlineColour=&H000000- Black outlineOutline=2- Outline thicknessMarginV=50- Distance from bottom
Complete Workflow Script (High Quality)
#!/usr/bin/env python3
"""
video_clean.py - Clean up video by removing filler words/pauses
Uses re-encoding for smooth cuts (no freeze frames)
"""
import json
import subprocess
import os
import sys
def get_duration(filepath):
"""Get video duration in seconds"""
result = subprocess.run([
'ffprobe', '-v', 'quiet', '-print_format', 'json', '-show_format', filepath
], capture_output=True, text=True)
return float(json.loads(result.stdout)['format']['duration'])
def extract_segment(input_file, start, end, output_file, crf=18, preset='fast'):
"""Extract a segment with re-encoding for smooth cuts"""
cmd = [
'ffmpeg', '-y',
'-ss', str(start),
'-i', input_file,
'-t', str(end - start),
'-c:v', 'libx264',
'-preset', preset,
'-crf', str(crf),
'-c:a', 'aac',
'-b:a', '192k',
'-avoid_negative_ts', 'make_zero',
'-async', '1',
output_file
]
return subprocess.run(cmd, capture_output=True, text=True)
def concatenate_segments(segment_files, output_file):
"""Concatenate segments into final video"""
with open('temp_concat.txt', 'w') as f:
for sf in segment_files:
f.write(f"file '{sf}'\n")
subprocess.run([
'ffmpeg', '-y', '-f', 'concat', '-safe', '0',
'-i', 'temp_concat.txt',
'-c', 'copy',
output_file
], capture_output=True)
os.remove('temp_concat.txt')
def generate_subtitles(video_file, model='medium'):
"""Generate SRT subtitles using Whisper"""
subprocess.run([
'whisper', video_file,
'--model', model,
'--output_format', 'srt',
'--output_dir', './'
])
def main(video_input, segments, output_name, crf=18):
"""Main workflow"""
segment_files = []
print(f"\n{'='*50}")
print(f"Processing: {video_input}")
print(f"Quality: CRF {crf} (lower=better, 15-23 recommended)")
print(f"{'='*50}\n")
# Extract segments with re-encoding
for i, seg in enumerate(segments):
outfile = f"temp_seg_{i:04d}.mp4"
segment_files.append(outfile)
result = extract_segment(video_input, seg['start'], seg['end'], outfile, crf)
if result.returncode == 0:
duration = seg['end'] - seg['start']
print(f"✓ Segment {i+1}/{len(segments)}: {duration:.1f}s")
else:
print(f"✗ Error on segment {i+1}")
print(result.stderr[-500:])
# Concatenate
print("\nConcatenating segments...")
concatenate_segments(segment_files, output_name)
# Cleanup temp segments
for sf in segment_files:
os.remove(sf)
# Generate subtitles
print("\nGenerating subtitles...")
generate_subtitles(output_name)
# Stats
orig_duration = get_duration(video_input)
new_duration = get_duration(output_name)
orig_size = os.path.getsize(video_input) / (1024*1024)
new_size = os.path.getsize(output_name) / (1024*1024)
print(f"\n{'='*50}")
print(f"COMPLETE")
print(f"{'='*50}")
print(f"Original: {orig_duration:.0f}s | {orig_size:.1f} MB")
print(f"Output: {new_duration:.0f}s | {new_size:.1f} MB")
print(f"Removed: {orig_duration - new_duration:.0f}s ({((orig_duration - new_duration)/orig_duration)*100:.0f}%)")
print(f"Video: {output_name}")
print(f"Subtitles: {output_name.replace('.mp4', '.srt')}")
if __name__ == '__main__':
# Example usage
VIDEO = "input.mp4"
SEGMENTS = [
{"start": 0.0, "end": 10.5},
{"start": 12.3, "end": 25.0},
# ... add your segments
]
main(VIDEO, SEGMENTS, "output_clean.mp4", crf=18)
AI Analysis Prompt Templates
Basic Cleanup (Filler Words Only)
Remove filler words from this transcript. Return segments to KEEP.
Filler words to remove: um, uh, like, you know, basically, actually, so, right, I mean
TRANSCRIPT SEGMENTS:
{segments}
Return JSON: [{"start": float, "end": float, "text": "cleaned text"}, ...]
Aggressive Cleanup (Podcast/Interview)
Clean this podcast transcript for a tight, professional edit.
REMOVE:
- All filler words (um, uh, like, you know, basically, so, right)
- False starts and restarts
- Pauses longer than 1 second
- Repetitions
- Off-topic tangents
- "That's a great question" type filler responses
- Excessive laughter/reactions (keep some for naturalness)
KEEP:
- Core content and insights
- Natural transitions
- Important reactions that add context
TRANSCRIPT:
{segments}
Return JSON array of segments to KEEP with cleaned text.
Light Cleanup (Preserve Natural Feel)
Lightly clean this transcript while preserving natural speech patterns.
ONLY REMOVE:
- "Um" and "uh" when standalone (not part of thinking pause)
- Obvious mistakes followed by corrections
- Technical issues (coughs, phone rings, etc.)
PRESERVE:
- Natural "like" and "you know" that add personality
- Thinking pauses that feel authentic
- Personality quirks
TRANSCRIPT:
{segments}
Return JSON array of segments to KEEP.
Transcript Format Reference
Whisper JSON Output
{
"text": "Full transcript text...",
"segments": [
{
"id": 0,
"start": 0.0,
"end": 2.5,
"text": " Welcome to this video.",
"tokens": [50364, 5765, ...],
"temperature": 0.0,
"avg_logprob": -0.25,
"compression_ratio": 1.2,
"no_speech_prob": 0.01
},
{
"id": 1,
"start": 2.5,
"end": 5.8,
"text": " Um, so today we're going to...",
...
}
],
"language": "en"
}
Keep Segments Format (for FFmpeg)
[
{ "start": 0.0, "end": 2.5, "text": "Welcome to this video." },
{ "start": 3.2, "end": 5.8, "text": "Today we're going to..." }
]
Advanced: Word-Level Timestamps
For precise filler word removal, use word-level timestamps:
# Whisper with word timestamps
whisper video.mp4 --model medium --word_timestamps True --output_format json
This gives you:
{
"segments": [
{
"start": 0.0,
"end": 2.5,
"text": "Um welcome to this video",
"words": [
{ "word": "Um", "start": 0.0, "end": 0.3 },
{ "word": "welcome", "start": 0.5, "end": 0.9 },
{ "word": "to", "start": 0.9, "end": 1.0 },
{ "word": "this", "start": 1.0, "end": 1.2 },
{ "word": "video", "start": 1.2, "end": 1.6 }
]
}
]
}
Now you can cut precisely around "Um" (0.0-0.3) and keep "welcome to this video" (0.5-1.6).
Troubleshooting
Frozen Frames at Cut Points (MOST COMMON)
Cause: Using -c copy which can only cut at keyframes.
Solution: Always re-encode with -c:v libx264 -crf 18 (see examples above).
Audio/Video Sync Issues
Add these flags when extracting segments:
ffmpeg -ss 10 -i video.mp4 -t 5 \
-c:v libx264 -crf 18 \
-c:a aac -b:a 192k \
-avoid_negative_ts make_zero \ # Fix negative timestamps
-async 1 \ # Sync audio to video
segment.mp4
Cuts Sound Abrupt
Add audio fade in/out to each segment:
ffmpeg -ss 10 -i video.mp4 -t 5 \
-c:v libx264 -crf 18 \
-af "afade=t=in:st=0:d=0.05,afade=t=out:st=4.95:d=0.05" \
-c:a aac segment.mp4
Large Files Take Forever
- Use
-preset fastor-preset veryfast(trades quality for speed) - Extract audio first for transcription (much smaller)
- Use Whisper API instead of local model
- Process in parallel (multiple segments at once)
# Faster encoding (slightly lower quality)
ffmpeg ... -preset veryfast -crf 20 ...
# Even faster for previews
ffmpeg ... -preset ultrafast -crf 23 ...
Whisper Misses Words
- Use
--model largefor better accuracy - Use
--language ento force English - Normalize audio first:
ffmpeg -i video.mp4 -af "loudnorm=I=-16:TP=-1.5:LRA=11" -c:v copy normalized.mp4
File Size Too Large After Re-encoding
Increase CRF value (higher = smaller file, lower quality):
# Original quality (large)
-crf 18
# Good quality (medium)
-crf 22
# Acceptable quality (small)
-crf 26
Integration with OpenCode
When using this skill in OpenCode:
Extract audio (faster transcription):
ffmpeg -i video.mp4 -vn -acodec libmp3lame -q:a 2 temp_audio.mp3 -yTranscribe with Whisper:
whisper temp_audio.mp3 --model medium --output_format json --output_dir ./Read transcript.json and analyze segments
Identify segments to KEEP based on:
- Removing filler words (um, uh, like, you know)
- Removing long pauses (>1.5s gaps)
- Removing false starts and repetitions
- For "shorts style": Keep only hook + key points + CTA
Re-encode and concatenate (MUST re-encode, never -c copy):
# Use the Python script above with crf=18 for qualityGenerate subtitles for final video:
whisper output.mp4 --model medium --output_format srtReport results with before/after stats
Quality Settings Reference
| Use Case | CRF | Preset | Notes |
|---|---|---|---|
| Archive/Master | 15-17 | slow | Near lossless, large files |
| YouTube/Vimeo | 18-20 | medium | High quality, recommended |
| Social Media | 21-23 | fast | Good quality, smaller |
| Preview/Draft | 24-28 | veryfast | Quick renders |
Anti-Patterns (DO NOT DO)
# WRONG: -c copy causes freeze frames
ffmpeg -ss 10 -i video.mp4 -t 5 -c copy segment.mp4
# WRONG: -to instead of -t with -ss before -i
ffmpeg -ss 10 -i video.mp4 -to 15 ... # -to is absolute, not relative
# WRONG: Missing timestamp fix flags
ffmpeg ... -c:v libx264 ... # Missing -avoid_negative_ts