name	youtube-shorts-creator
description	Use this skill when the user asks to create a YouTube Short, generate a short-form vertical video, make a YouTube Shorts video from an idea, or mentions 'youtube-shorts-creator'. Generates a 30-60 second 1080x1920 portrait video with AI-generated video clips (Veo 3.1), static images with Ken Burns effects (Gemini Nano Banana 2), AI voiceover (Gemini TTS), background music, sound effects, optional captions, and a thumbnail. Output is a local MP4 + thumbnail PNG.
version	3.0.0
metadata	[object Object]

YouTube Shorts Creator v3.0

Create professional YouTube Shorts (30-60 second vertical videos) from a simple idea input. Uses an audio-first pipeline that generates per-segment voiceover FIRST, measures actual audio durations, then generates visuals to match — ensuring perfect audio-visual sync where no scene ever changes while narration is still playing.

v3.0 New Features:

Background music with automatic volume ducking
Transition sound effects
WhisperX-accurate captions (speech recognition)
Advanced transitions (wipe, clock-wipe, flip)
Spring-based animations for text overlays
Video progress bar
Loop ending for seamless rewatching
Configurable caption styling
Audio normalization to -14 LUFS (YouTube standard)
Hook segment optimization

Technology Stack

Technology	Purpose
Veo 3.1 (`veo-3.1-generate-001`)	Generate 8-second video clips in 9:16 portrait
Nano Banana 2 (`gemini-3.1-flash-image-preview`)	Generate scene images, reference frames, thumbnails
Gemini TTS (gemini-2.5-flash-preview-tts)	Primary voiceover narration
Google Cloud TTS (Chirp 3 HD)	Fallback TTS
FFmpeg	Background music and synthetic sound effects generation
Giphy API	Search and download GIFs, memes, and animated stickers
@remotion/gif	Render GIF files in Remotion compositions
Remotion Framework	Compose all assets into final MP4 with transitions
WhisperX (whisper.cpp)	Accurate word-level caption timestamps
FFmpeg	Audio processing, normalization, concatenation

Required Environment Variables

Before running, ensure these are set:

GEMINI_API_KEY - Google AI API key (for Veo 3.1 + Gemini image generation + Gemini TTS)
GIPHY_API_KEY - (Optional) Giphy API key for GIF segments and sticker overlays (get one at https://developers.giphy.com/)
GOOGLE_APPLICATION_CREDENTIALS - (Optional) Google Cloud service account JSON path for TTS fallback

Workflow

Phase 1: Idea Intake & Creative Brief

When the user provides an idea, use the AskUserQuestion tool to ask ALL of the following in a single multi-question prompt. Group them logically. The user can always pick "Other" to customize.

Question 1 — Duration

30 seconds (quick, punchy)
45 seconds (balanced, recommended)
60 seconds (detailed, in-depth)

Question 2 — Visual Style

Cinematic (dramatic lighting, film grain, lens flares)
Clean & Modern (minimalist, flat design, bright colors)
Neon / Cyberpunk (dark backgrounds, glowing colors, futuristic)
Nature / Documentary (earthy tones, natural lighting, organic textures)
Retro / Vintage (muted colors, film effects, old-school typography)
Abstract / Artistic (surreal imagery, bold shapes, creative compositions)

Question 3 — Color Palette / Mood

Dark & Dramatic (blacks, deep blues, purples)
Bright & Energetic (vivid colors, high saturation, yellows/oranges)
Warm & Inviting (ambers, warm whites, soft oranges)
Cool & Professional (blues, grays, clean whites)
Pastel & Soft (light pinks, lavenders, muted tones)

Question 4 — Voice

Sulafat (Female, warm narration)
Laomedeia (Female, energetic)
Achernar (Female, soft & warm)
Charon (Male, informative narration)
Alnilam (Male, firm & confident)
Iapetus (Male, clear)
Fenrir (Male, dynamic)

Question 5 — Pacing / Energy

Slow & contemplative (fewer words, longer visual holds)
Moderate & balanced (standard YouTube Shorts pacing)
Fast & high-energy (rapid-fire facts, quick cuts, dynamic)

Question 6 — Captions

TikTok-style (word-by-word highlight, bold, centered — recommended)
Subtitle bar (lower-third bar with text)
No captions

Question 7 — Text Overlays

Yes, show key facts/concepts on screen (recommended for educational/instructional)
Minimal (only intro title)
No text overlays

Question 8 — Content Type (helps determine segment types and text overlay usage)

Educational / Instructional (facts, how-to, explanations — will add text overlays for key concepts)
Storytelling / Narrative (story arc, dramatic moments)
Listicle / Top N (numbered items, each gets its own segment)
Motivational / Inspirational (quotes, uplifting visuals)
Entertainment / Fun facts (surprising, engaging, casual tone)

Question 9 — Visual Mode (determines what type of visuals to use)

Mixed — images + video clips (recommended, uses both Veo video and Gemini images)
Images only (all segments use AI-generated images with Ken Burns effects — faster, no Veo wait)
Video only (all segments use Veo 3.1 video clips — more dynamic, longer generation time)

Question 10 — Background Music

Yes, add background music (recommended — auto-ducks under narration)
No background music

Question 11 — Giphy Content (GIFs, memes, and animated stickers from Giphy)

Both GIFs and stickers (full-frame GIF/meme segments + animated sticker overlays)
GIFs/memes only (reaction GIFs as full-frame segments — great for humor/reactions)
Stickers only (animated sticker overlays on top of other segments)
None (no Giphy content — recommended if no GIPHY_API_KEY is set)

After collecting answers, use them to inform every aspect of the storyboard: the visual prompts should reference the chosen style and color palette, the narration pacing should match the energy level, text overlays should be added where appropriate based on the content type, and all segment types must respect the chosen visual mode.

If background music is selected, generate a descriptive music.prompt in the storyboard meta that matches the video's mood, pacing, and content. Examples:

Educational: "mysterious ambient documentary background music, subtle synth pads, slow tempo"
Energetic: "upbeat electronic background beat, driving rhythm, high energy"
Motivational: "inspiring orchestral background music, building intensity, hopeful"
Storytelling: "cinematic atmospheric soundtrack, tension building, dramatic strings"

Phase 2: Storyboard Generation

Generate a structured storyboard JSON following the schema in references/storyboard-schema.md. Key rules:

CRITICAL: Audio-First Segment Timing

The storyboard duration_seconds values are INITIAL ESTIMATES only. The actual segment durations will be automatically adjusted after TTS generation to match the real audio length of each segment's narration. Write narration at ~150 words/minute as an estimate, but know that the pipeline will correct timing based on measured audio.

Each segment's final visual duration = actual audio duration + padding (0.3s default).

CRITICAL: Veo 3.1 Video Clip Duration Constraint

Veo 3.1 clips are EXACTLY 8 seconds long — this is a hard limit from the API. For any segment with type: "video", the narration MUST be written to fit within 8 seconds of speaking time (including padding). This means:

Target ~18-20 words per video segment narration (at ~150 wpm, 20 words ≈ 8 seconds)
The voiceover duration + padding must NOT exceed 8 seconds, or the video clip will end before the narration finishes
Write concise, punchy narration for video segments — every word must earn its place
If a concept needs more narration time, use an image segment instead (images with Ken Burns can be any duration)
The goal is seamless sync: the voiceover plays for the full duration of the video clip with no dead air and no clip looping

Hook Segment Optimization:

The FIRST segment is critical for YouTube's algorithm. Mark it with "is_hook": true and follow these rules:

Keep it 2-3 seconds long with punchy, provocative narration
Use a bold visual that immediately grabs attention
Open with a question, surprising fact, or bold statement
Consider using "spring-pop" or "scale-pop" animation for text overlays
Pattern interrupt: subvert expectations in the first frame

Visual Mode Handling:

The visual_mode field in meta controls what segment types are allowed:

"mixed" (default) — Use the heuristics table below to decide each segment's type. Mix of video and image segments.
"images-only" — ALL segments MUST be type: "image" with Ken Burns presets. No video segments. Vary Ken Burns presets for visual interest.
"videos-only" — ALL segments MUST be type: "video". No image segments. Each Veo clip is max 8 seconds; plan segment durations accordingly.

Segment Type Decision Heuristics (used only when visual_mode is "mixed"):

Content Type	Segment Type	Reason
Action, motion, transformation	`video` (Veo 3.1)	Moving subjects need video
Establishing shots, landscapes	`video` (Veo 3.1)	Cinematic environments
Static facts, quotes, statistics	`image` (Gemini) + Ken Burns	Text/data works better as images
Diagrams, comparisons, lists	`image` (Gemini) + Ken Burns	Infographic-style content
Abstract concepts, metaphors	`image` (Gemini) + Ken Burns	Conceptual imagery
Intro/outro title cards	`image` (Gemini) + Ken Burns	Text-heavy frames
Reaction moments, humor, pop culture	`gif` (Giphy)	Recognizable memes/reactions resonate with audiences

Budget:

Mixed mode: 2-4 VIDEO segments and 3-6 IMAGE segments for a 45-second Short.
Images-only mode: 6-10 IMAGE segments for a 45-second Short (4-8 seconds each with Ken Burns).
Videos-only mode: 4-7 VIDEO segments for a 45-second Short (each Veo clip is max 8 seconds).

Narration: Write at ~150 words/minute speaking rate as an estimate. Each segment MUST have its own narration_text. The narration for each segment will be generated as a separate audio file for perfect timing sync.

Per-Segment Audio Fields:

audio_filename — Initially null, populated by the pipeline (e.g., "segment-01_audio.mp3")
audio_duration_seconds — Initially null, populated with actual measured duration
padding_seconds — Gap after narration before next segment (default: 0.3, use 0.5 for dramatic pauses, 0.1 for fast-paced)

Visual Prompts: Write detailed, cinematic prompts. For Veo: include camera movement, lighting, mood. For Gemini images: include style, composition, color palette. Always specify 9:16 portrait orientation in prompts.

Ken Burns Presets (for image segments, alternate between them for variety):

zoom-in-slow - Slowly zoom into center
zoom-out-slow - Slowly zoom out from center
pan-left - Slow pan from right to left
pan-right - Slow pan from left to right
zoom-in-top-left - Zoom into upper-left corner
zoom-out-center - Zoom out from center
pan-up - Slow pan from bottom to top
pan-down - Slow pan from top to bottom
zoom-in-bottom-right - Zoom into bottom-right corner
drift-diagonal - Gentle diagonal drift with subtle zoom

Transitions (between segments — v3.0 adds wipe, clock-wipe, flip):

fade - Cross-fade (default, most natural)
slide-left - Slide in from right
slide-up - Slide up from bottom
wipe-left - Wipe from right to left (clean, modern)
wipe-right - Wipe from left to right
wipe-up - Wipe from bottom to top
clock-wipe - Circular clock wipe effect (dramatic)
flip - 3D flip transition (dynamic)
none - Hard cut

Transition Sound Effects (optional per segment): For segments with transitions, you can add a transition_sfx object:

prompt — Descriptive text for generating the SFX (e.g., "deep cinematic whoosh transition sound")
volume — Playback volume (default: 0.5)
Best for: slide, wipe, and flip transitions. Skip for subtle fades.

Giphy GIF Segments (when Giphy content is enabled):

For type: "gif" segments, set giphy_search with a descriptive query. The pipeline downloads the top matching GIF from Giphy and renders it full-frame with looping. Best for reaction moments, humor, and pop culture references.

giphy_search — Descriptive search query (e.g., "mind blown explosion reaction", "surprised pikachu meme")
visual_prompt — Still required as documentation, but not used for generation
No ken_burns_preset needed (GIFs are animated)
GIFs loop automatically for the segment duration

Giphy Sticker Overlays (optional per segment):

Any segment type can have a giphy_overlay object for an animated transparent sticker:

search — Giphy sticker search query (e.g., "fire emoji", "thumbs up animated", "sparkle effect")
position — Anchor: "top-left", "top-right", "bottom-left", "bottom-right", "center" (default: "bottom-right")
scale — Fraction of frame width, 0.0-1.0 (default: 0.25 = 270px on 1080px)
offset_x / offset_y — Pixel offset from anchor position (default: 0)
delay_frames — Frames to wait before showing (default: 0)
duration_frames — How long to show (null = entire segment duration)

When to use stickers:

Emoji reactions to emphasize key moments
Animated effects (sparkles, fire, explosions) for visual flair
Arrows/pointers to draw attention
DO NOT overuse: max 1 sticker per segment, keep them small (0.15-0.3 scale)

Text Overlays (optional per segment, especially for educational/instructional content):

Each segment can have a text_overlay object with these fields:

text - The text to display (keep short: key facts, numbers, labels, 1-8 words)
position - "top", "center", or "bottom" (default: "center")
style - Visual preset:
- "bold-fact" — Large bold white text with heavy shadow (for key statistics, facts)
- "subtitle-bar" — Text on a semi-transparent dark bar (for explanations)
- "callout" — Uppercase gold text (for emphasis, warnings, key terms)
- "highlight-box" — White text on a colored box with rounded corners (for labels, categories)
- "minimal" — Small italic translucent text (for subtle annotations)
animateIn - Entrance animation:
- "slide-up" — Slides up into position (recommended)
- "fade" — Simple fade in
- "scale-pop" — Pops in with a bounce
- "typewriter" — Characters appear one by one
- "spring-pop" — Natural bouncy pop using Remotion spring physics (v3.0)
- "spring-slide" — Natural slide-up with spring overshoot (v3.0)
delayFrames - Delay before appearing (default: 0, use 15-30 to sync with narration)

When to use text overlays:

Educational content: Show key facts, statistics, definitions (e.g., "6 BILLION TONS", "243 EARTH DAYS")
Instructional content: Label steps, highlight concepts, show formulas
Listicles: Show the item number/title (e.g., "#3: Venus", "FACT 4")
Motivational: Display quotes or key phrases
DO NOT clutter: max 1 text overlay per segment, keep text very short

Background Music (when enabled): Set meta.music with a descriptive prompt. The pipeline will:

Generate music via FFmpeg synthetic audio generation
Loop it to match video duration
Normalize to -14 LUFS
Auto-duck volume under narration (default: 0.15 full, 0.05 ducked)
Fade in at start (1s) and fade out at end (2s)

Loop Ending (optional): Set meta.loop_ending: true to cross-dissolve the last 0.5s to the first segment's visual, creating a seamless loop that encourages rewatching.

Progress Bar (optional): Set meta.show_progress_bar: true to show a thin progress indicator at the top. Customize color with meta.progress_bar_color.

Phase 3: User Approval

Present the storyboard in this format:

=== YouTube Short Storyboard ===
Title: "[Title]"
Duration: ~[N] seconds | Segments: [N] ([X] video + [Y] images)
Visual Mode: [mixed / images-only / videos-only]
Voice: [Voice Name] ([Gender], [Style])
Music: [prompt or "None"]
Features: [progress bar, loop ending, etc.]

[0:00-0:05] SEGMENT 1 ([TYPE]) - [Label] [HOOK]
  Narration: "[Narration text]"
  Visual: [Brief visual description]
  [Ken Burns: preset] (if image)
  [Text Overlay: "KEY FACT" (style, position, animation)] (if applicable)
  [Transition SFX: "whoosh sound"] (if applicable)
  Transition: [type]
  Padding: [0.3s]

[0:05-0:12] SEGMENT 2 ([TYPE]) - [Label]
  ...

Thumbnail: [Brief thumbnail description]

Note: Segment durations are estimates. The audio-first pipeline will
adjust each segment to exactly match the actual narration length,
ensuring perfect audio-visual sync.

Wait for user approval before proceeding. The user can modify segments, swap video/image types, change the voice, or adjust narration.

Phase 4: Asset Generation (Audio-First Pipeline)

After approval:

Step 0: Choose output directory

Ask the user where to save the output, or default to the current working directory with a subfolder named after the video title (slugified). Example: ./mind-blowing-space-facts/

Save the storyboard JSON to <output-dir>/storyboard.json.

Step 1: Set up the Remotion project

If the Remotion project does not exist at the output directory:

xcopy /E /I /Y "C:\Users\eckme\.claude\skills\youtube-shorts-creator\assets\remotion-template" "<output-dir>\remotion-project"
cd "<output-dir>\remotion-project"
npm install

Step 2: Run the audio-first asset generation pipeline

python "C:\Users\eckme\.claude\skills\youtube-shorts-creator\scripts\generate_assets.py" "<output-dir>\storyboard.json" "<output-dir>\remotion-project"

This script runs an audio-first pipeline in 8 phases:

Phase 1 — Per-Segment TTS: Generates individual audio files for each segment's narration. Measures actual duration of each audio file via ffprobe. Normalizes each to -14 LUFS.
Phase 1B — Background Music: Generates background music via FFmpeg synthetic audio generation (if meta.music is set). Loops to target duration, normalizes to -14 LUFS.
Phase 2 — Update Durations: Adjusts each segment's visual duration to match its actual audio duration + padding, ensuring no scene changes during narration.
Phase 2B — Giphy Download: Downloads GIFs for "gif" segments and sticker overlays from Giphy API (if any exist in storyboard).
Phase 3 — Visual Generation (Parallel): Generates Veo 3.1 video clips + Gemini images + thumbnail in parallel, now knowing exact durations. GIF segments are skipped (already handled by Phase 2B).
Phase 3B — Transition SFX: Generates short sound effect clips for segments with transition_sfx configured.
Phase 4 — Caption Generation: Creates per-segment word-level captions using WhisperX (speech recognition) with proportional fallback.
Phase 5 — Audio Concatenation: Merges per-segment audio files into a single voiceover reference file.
Phase 6 — Storyboard Update: Writes all asset filenames, actual durations, and metadata back to the storyboard JSON.

Monitor the output and relay progress to the user.

Phase 5: Render

After all assets are generated:

cd "<output-dir>\remotion-project"
npx remotion render YouTubeShort "..\output.mp4" --props="..\storyboard.json" --codec=h264 --pixel-format=yuv420p --audio-codec=aac --crf=18

Copy the thumbnail:

copy "<output-dir>\remotion-project\public\thumbnail.png" "<output-dir>\thumbnail.png"

Phase 6: Output

Report the results:

=== YouTube Short Created! ===

Video: <path>/output.mp4
Thumbnail: <path>/thumbnail.png

Duration: [N] seconds | Resolution: 1080x1920 | FPS: 30
Segments: [N] ([X] video clips + [Y] images with Ken Burns)
Voiceover: [Voice Name] (Gemini TTS) — per-segment sync
Music: [Yes/No] — auto-ducking enabled
Captions: [WhisperX / Proportional / Disabled]
Audio: Normalized to -14 LUFS
Audio Sync: Perfect (audio-driven segment durations)

Gemini TTS Voice Options

Voice Name	Gender	Style
Sulafat	Female	Warm, narration
Laomedeia	Female	Energetic
Achernar	Female	Soft, warm
Charon	Male	Informative, narration
Alnilam	Male	Firm, confident
Iapetus	Male	Clear
Fenrir	Male	Dynamic

Error Handling

Veo timeout/block: Substitute with Gemini image + Ken Burns effect (rotates through varied presets)
Gemini TTS failure: Fall back to Google Cloud TTS Chirp 3 HD
Image generation failure: Uses gradient background with narration text overlay
Caption generation failure: Render without captions
WhisperX failure: Falls back to proportional captions per-segment (mixed mode: some Whisper, some proportional)
Per-segment TTS failure: Uses estimated duration from word count as fallback
Audio concatenation failure: Calculates total duration from individual files
Background music failure: Video renders without music (graceful skip)
SFX generation failure: Transition plays without sound effect
Giphy search failure: GIF segments use gradient fallback, sticker overlays are skipped
Missing GIPHY_API_KEY: Warning printed, Giphy features gracefully skipped
LUFS normalization failure: Audio used as-is (graceful skip)
All API calls: Retry up to 3 times with exponential backoff
Storyboard backup: Original storyboard is saved as *_original.json before updates

Key Architecture: Audio-First Sync

The v3.0 pipeline ensures perfect audio-visual synchronization:

Each segment gets its own audio file (e.g., segment-01_audio.mp3)
All audio normalized to -14 LUFS (YouTube loudness standard)
Each segment's visual duration is set to actual audio duration + padding
In the Remotion composition, each segment's audio plays at its exact start frame
Background music auto-ducks when narration is playing
Captions generated using WhisperX speech recognition for accurate word timing
No global voiceover track — each segment is self-contained for perfect sync

This means:

A scene never changes while its narration is still playing
There is always a small gap (padding) between narration and the next scene
Captions are accurately timed to the actual words spoken
Video clips are 8 seconds max — narration for video segments should be written to fit within 8 seconds (avoid looping)
Background music volume drops automatically during narration

v3.0 Feature Details

Background Music

Generated via FFmpeg synthetic audio generation from a descriptive prompt
Automatically looped with FFmpeg if video exceeds clip length
Volume ducking: full volume (0.15) during pauses, reduced (0.05) during narration
Fade in (1s) at video start, fade out (2s) at video end
Skipped gracefully if generation fails

Transition Sound Effects

Short SFX clips (1.5s) generated per-segment via FFmpeg
Play at the transition point between segments
Configurable volume per-segment
Best paired with dynamic transitions (wipe, flip, slide)

WhisperX Captions

Uses whisper.cpp via @remotion/install-whisper-cpp for speech recognition
Per-segment processing: each audio file transcribed individually
Mixed mode: segments that fail Whisper use proportional fallback
One-time model download (~500MB medium.en model)
Falls back to proportional estimation if Node.js/whisper unavailable

Advanced Transitions

wipe-left / wipe-right / wipe-up: Clean directional wipe effects
clock-wipe: Circular clock-hand wipe (dramatic reveals)
flip: 3D card flip transition
All use Remotion's @remotion/transitions library

Spring Animations

spring-pop: Bouncy scale-up using Remotion spring physics
spring-slide: Natural slide-up with overshoot
More organic, professional motion than linear easing

Progress Bar

Thin bar at top showing video progress (0% to 100%)
Customizable color via meta.progress_bar_color
Subtle glow effect for visibility

Loop Ending

Cross-dissolves to first segment's image in last 0.5s
Creates seamless loop that encourages repeat watching
YouTube algorithm favors videos that get rewatched

Audio Normalization

Two-pass FFmpeg loudnorm filter targeting -14 LUFS
Applied to all TTS audio and background music
Ensures consistent volume across the video
Gracefully skipped if FFmpeg unavailable

Backward Compatibility

All v3.0 features default to OFF. A v2.0 storyboard (without music, transition_sfx, caption_style, giphy_search, giphy_overlay, etc.) will render identically to v2.0. New features only activate when their respective fields are present and non-null in the storyboard JSON. Giphy features require GIPHY_API_KEY — if not set, GIF segments fall back to gradient backgrounds and stickers are skipped.

References

Load these as needed for API details:

references/veo-api.md - Veo 3.1 API endpoints and parameters
references/storyboard-schema.md - Full storyboard JSON schema (v3.0)
references/style-templates.md - Pre-built visual style templates (load when user requests a specific style or says "neon rim", "holographic", etc.)

Style Templates

Pre-built visual styles are available in references/style-templates.md. When the user mentions a style by name or trigger phrase, load the reference file and use its templates for all visual prompts. This ensures consistent aesthetics across all segments.

Available styles:

Neon-Rim Holographic Documentary — Dark 3D renders with red-orange/cyan-blue color-coded glowing figures on pure black backgrounds. Best for: history, true crime, military, biker culture, noir, mystery, biographical content. Trigger: "neon rim", "holographic", "glowing figures", "dark documentary"

When a style template is active, skip the "Visual Style" question in Phase 1 (it's already decided) and pre-fill the meta.style, meta.caption_style, and meta.progress_bar_color from the template's config.

Business CMS Integration

MANDATORY: After rendering the final video:

Create CMS folder:

BUSINESS_ROOT="C:/Users/eckme/OneDrive/Documents/New folder (2)/Business"
DATE_PATH=$(date +"%Y/%m/%d")
VIDEO_SLUG=$(echo "VIDEO_TITLE" | tr '[:upper:]' '[:lower:]' | tr ' ' '-')
mkdir -p "$BUSINESS_ROOT/Videos/YouTube-Shorts/$DATE_PATH/$VIDEO_SLUG"

Copy deliverables:
- output.mp4 → final video
- thumbnail.png → thumbnail
- storyboard.json → composition spec
Create script.md with the full narration script
Create metadata.md with type (YouTube Short), title, created time, status (Draft), duration, voice used, visual style, and file list
Create/update _manifest.md in the date folder
Tell the user the exact path where the video is saved

youtube-shorts-creator

Install Skill

SKILL.md