Claude Code Plugins

Community-maintained marketplace

Feedback

Process and generate multimedia content using Google Gemini API. Capabilities include analyze audio files (transcription with timestamps, summarization, speech understanding, music/sound analysis up to 9.5 hours), understand images (captioning, object detection, OCR, visual Q&A, segmentation), process videos (scene detection, Q&A, temporal analysis, YouTube URLs, up to 6 hours), extract from documents (PDF tables, forms, charts, diagrams, multi-page), generate images (text-to-image with Imagen 4, editing, composition, refinement), generate videos (text-to-video with Veo 3, 8-second clips with native audio). Use when working with audio/video files, analyzing images or screenshots, processing PDF documents, extracting structured data from media, creating images/videos from text prompts, or implementing multimodal AI features. Supports Gemini 3/2.5, Imagen 4, and Veo 3 models with context windows up to 2M tokens.

Install Skill

1Download skill
2Enable skills in Claude

Open claude.ai/settings/capabilities and find the "Skills" section

3Upload to Claude

Click "Upload skill" and select the downloaded ZIP file

Note: Please verify skill by going through its instructions before using it.

SKILL.md

name ai-multimodal
description Process and generate multimedia content using Google Gemini API. Capabilities include analyze audio files (transcription with timestamps, summarization, speech understanding, music/sound analysis up to 9.5 hours), understand images (captioning, object detection, OCR, visual Q&A, segmentation), process videos (scene detection, Q&A, temporal analysis, YouTube URLs, up to 6 hours), extract from documents (PDF tables, forms, charts, diagrams, multi-page), generate images (text-to-image with Imagen 4, editing, composition, refinement), generate videos (text-to-video with Veo 3, 8-second clips with native audio). Use when working with audio/video files, analyzing images or screenshots, processing PDF documents, extracting structured data from media, creating images/videos from text prompts, or implementing multimodal AI features. Supports Gemini 3/2.5, Imagen 4, and Veo 3 models with context windows up to 2M tokens.
license MIT
allowed-tools Bash, Read, Write, Edit

AI Multimodal

Process audio, images, videos, documents, and generate images/videos using Google Gemini's multimodal API.

Setup

export GEMINI_API_KEY="your-key"  # Get from https://aistudio.google.com/apikey
pip install google-genai python-dotenv pillow

Quick Start

Analyze media: python scripts/gemini_batch_process.py --files <file> --task <analyze|transcribe|extract>

Generate content: python scripts/gemini_batch_process.py --task <generate|generate-video> --prompt "description"

Models

Image generation: imagen-4.0-generate-001 (standard), imagen-4.0-ultra-generate-001 (quality), imagen-4.0-fast-generate-001 (speed) Video generation: veo-3.1-generate-preview (8s clips with audio) Analysis: gemini-2.5-flash (recommended), gemini-2.5-pro (advanced)

Scripts

gemini_batch_process.py: Process audio/image/video/PDF, generate images/videos media_optimizer.py: Compress/resize/convert media for API limits document_converter.py: Convert DOCX/XLSX/PPTX to PDF

Use --help for options.

References

Load for detailed guidance:

Topic File
Audio references/audio-processing.md
Images references/vision-understanding.md
Image Gen references/image-generation.md
Video references/video-analysis.md
Video Gen references/video-generation.md

Limits

Formats: Audio (WAV/MP3/AAC, 9.5h), Images (PNG/JPEG/WEBP, 3.6k), Video (MP4/MOV, 6h), PDF (1k pages) Size: 20MB inline, 2GB File API

Resources: API Docs | Pricing