| name | fal-ai |
| description | Generate images, videos, and speech using fal.ai API. Use when asked to: - Create/generate images from text prompts (Flux, Flux Kontext, Nano Banana Pro, Recraft) - Generate videos from text prompts (Veo 3, Kling v2.6, Hunyuan, LTX, Minimax, Wan) - Animate images into videos (Kling v2.6, Kling, Luma, Runway) - Convert text to speech or clone voices (F5-TTS, Kokoro) Trigger phrases: "generate image", "create video", "text-to-video", "animate this image", "make a video of", "voice cloning", "text-to-speech", "fal.ai", "veo", "kling" |
fal.ai Media Generation
Generate images, videos, and speech using fal.ai's suite of AI models.
Important distinctions:
- Text-to-Image:
generate_image.py- Creates images from text prompts - Text-to-Video:
generate_video_text.py- Creates videos from text prompts (no image needed) - Image-to-Video:
generate_video.py- Animates an existing image into video
Setup
Set your fal.ai API key:
export FAL_API_KEY="your-api-key-here"
Get your API key from https://fal.ai/dashboard/keys
Quick Reference
| Task | Command |
|---|---|
| Generate image | ./scripts/generate_image.py "prompt" |
| Generate video from text | ./scripts/generate_video_text.py "prompt" |
| Generate video from image | ./scripts/generate_video.py image.png |
| Convert text to speech | ./scripts/generate_speech.py "text" |
| List image models | ./scripts/generate_image.py --list-models "" |
| List text-to-video models | ./scripts/generate_video_text.py --list-models |
| List image-to-video models | ./scripts/generate_video.py --list-models |
| List TTS models | ./scripts/generate_speech.py --list-models |
Text-to-Image Generation
Generate images from text descriptions using state-of-the-art models.
Available Models
| Model | Description | Speed | Quality |
|---|---|---|---|
flux-schnell (default) |
Fast generation | Fast | Good |
flux-dev |
Development model | Medium | High |
flux-pro |
Production quality | Slow | Best |
flux-realism |
Photorealistic | Medium | High |
flux-kontext |
Context-aware editing | Medium | High |
nano-banana-pro |
Gemini-powered, web search | Medium | High |
recraft-v3 |
Design/artistic | Medium | High |
stable-diffusion-xl |
Classic SD | Medium | Good |
Usage
# Basic generation
uv run ./scripts/generate_image.py "A serene mountain landscape at sunset"
# Specify model and aspect ratio
uv run ./scripts/generate_image.py "A cyberpunk cityscape" --model flux-pro --aspect landscape_16_9
# Generate multiple images
uv run ./scripts/generate_image.py "A cute robot mascot" --num 4 --output ./robots/
# With negative prompt and seed
uv run ./scripts/generate_image.py "Professional headshot" --negative "cartoon, anime" --seed 42
# Open image after generation
uv run ./scripts/generate_image.py "A beautiful garden" --open
Aspect Ratios
square(default) - 1:1square_hd- 1:1 high resolutionportrait_4_3- 3:4 portraitportrait_16_9- 9:16 tall portraitlandscape_4_3- 4:3 landscapelandscape_16_9- 16:9 widescreen21_9- Ultra-wide9_21- Ultra-tall
Text-to-Video Generation
Generate videos directly from text prompts (no input image required).
Available Models
| Model | Description | Cost | Max Duration |
|---|---|---|---|
veo3 |
Google Veo 3, best quality + audio | ~$2.00/5s | ~8s |
kling-v2.6 |
Kling 2.6 Pro, cinematic | ~$0.70/5s | ~10s |
ltx-v2-fast (default) |
LTX 2.0 Fast, good balance | ~$0.20/5s | ~10s |
ltx-v2 |
LTX 2.0, higher quality | ~$0.20/5s | ~10s |
hunyuan |
Hunyuan, high visual quality | ~$0.38/5s | ~5s |
hunyuan-v1.5 |
Hunyuan 1.5, improved motion | ~$0.38/5s | ~5s |
minimax |
MiniMax Video-01 | ~$0.50/video | ~5s |
wan |
Wan 2.1, fast | ~$0.25/5s | ~5s |
Usage
# Basic text-to-video
uv run ./scripts/generate_video_text.py "a cat walking on the beach at sunset"
# Cinematic video with specific model
uv run ./scripts/generate_video_text.py "cinematic drone shot of mountains at sunrise" --model hunyuan
# Vertical video for social media
uv run ./scripts/generate_video_text.py "person dancing in studio" --aspect 9:16 --resolution 1080p
# With seed for reproducibility
uv run ./scripts/generate_video_text.py "ocean waves crashing" --seed 42 --open
Tips for Text-to-Video
- Be descriptive - Include motion, camera angles, lighting
- Cinematic keywords - "cinematic", "8k", "dramatic lighting" help quality
- Duration limits - Most models generate 5-10 second clips
- Resolution tradeoffs - Higher resolution = slower generation
Image-to-Video Generation
Animate static images into videos (requires an input image).
Available Models
| Model | Description | Quality |
|---|---|---|
kling (default) |
Kling v1.5 Pro | Excellent |
kling-v2.6 |
Kling v2.6 Pro + audio | Best |
minimax |
MiniMax video | Good |
luma |
Luma Dream Machine | Good |
runway-gen3 |
Runway Gen-3 Turbo | Excellent |
hunyuan |
Hunyuan video | Good |
Usage
# Basic video generation
uv run ./scripts/generate_video.py image.png
# With motion prompt
uv run ./scripts/generate_video.py portrait.jpg --prompt "person slowly smiles and nods"
# Different model and duration
uv run ./scripts/generate_video.py landscape.png --model runway-gen3 --duration 10
# Specify output path
uv run ./scripts/generate_video.py photo.jpg --output ./videos/animated.mp4 --open
Tips for Image-to-Video
- Image quality matters - Use high-resolution, clear images
- Simple motion prompts - Describe the motion, not the scene
- Duration limits - Most models support 5-10 seconds
- Aspect ratio - Usually auto-detected from the input image
Text-to-Speech
Convert text to natural-sounding speech.
Available Models
| Model | Description | Features |
|---|---|---|
f5-tts (default) |
F5-TTS | Voice cloning |
kokoro |
Kokoro | Multiple voices |
playht |
PlayHT v3 | High quality |
minimax-tts |
MiniMax TTS | Fast |
Usage
# Basic text-to-speech
uv run ./scripts/generate_speech.py "Hello, welcome to our application!"
# Different model
uv run ./scripts/generate_speech.py "This is a test." --model kokoro
# Voice cloning with reference audio
uv run ./scripts/generate_speech.py "Clone this voice" --reference my_voice.mp3
# Adjust speed
uv run ./scripts/generate_speech.py "Speaking faster now" --speed 1.2
# Specify output and open
uv run ./scripts/generate_speech.py "Podcast intro" --output intro.wav --open
Voice Cloning
To clone a voice, provide a reference audio sample:
uv run ./scripts/generate_speech.py "Text in cloned voice" --reference sample.mp3 --model f5-tts
Best practices for reference audio:
- 5-30 seconds of clear speech
- Minimal background noise
- Single speaker
- Natural speaking pace
Common Workflows
Create a Video from Text (Easiest)
# Direct text-to-video - no image needed
uv run ./scripts/generate_video_text.py "A majestic eagle spreads its wings and takes flight from a cliff, cinematic, dramatic lighting" --model ltx-v2-fast --open
Create a Video from Image (More Control)
# 1. Generate the image
uv run ./scripts/generate_image.py "A majestic eagle perched on a cliff" --model flux-pro --output eagle.png
# 2. Animate it
uv run ./scripts/generate_video.py eagle.png --prompt "eagle spreads wings and takes flight" --open
Generate Marketing Assets
# Product image variations
uv run ./scripts/generate_image.py "Minimalist product photo of headphones on white background" --num 4 --aspect square_hd
# Social media formats
uv run ./scripts/generate_image.py "Summer sale banner" --aspect landscape_16_9 --output banner_wide.png
uv run ./scripts/generate_image.py "Summer sale banner" --aspect portrait_16_9 --output banner_story.png
Create Voiceover
# Generate narration
uv run ./scripts/generate_speech.py "Welcome to our product demo. Today we'll explore the amazing features..." --output narration.wav
# With custom voice
uv run ./scripts/generate_speech.py "Welcome to our product demo." --reference brand_voice.mp3 --output narration.wav
API Reference
All scripts are built on the shared fal_helper.py library. You can import it directly for programmatic use:
from fal_helper import FalClient
client = FalClient()
# Generate images
images = client.generate_image(
prompt="A beautiful sunset",
model="flux-schnell",
aspect_ratio="landscape_16_9",
num_images=2,
)
# Generate video from text (no image needed)
video = client.generate_video_from_text(
prompt="cinematic ocean waves at sunset",
model="ltx-v2-fast",
aspect_ratio="16:9",
resolution="720p",
)
# Generate video from image
video = client.generate_video(
image_path="image.png",
prompt="camera slowly pans right",
model="kling",
duration=5.0,
)
# Text to speech
audio = client.text_to_speech(
text="Hello world",
model="f5-tts",
reference_audio="voice_sample.mp3", # optional
)
# Download results
client.download_file(images[0]["url"], "output.png")
client.download_file(video["url"], "output.mp4")
Troubleshooting
"FAL_API_KEY environment variable is not set"
Set your API key:
export FAL_API_KEY="your-key-here"
Video generation is slow
Video generation typically takes 2-5 minutes. The models process frames sequentially which takes time.
Image quality issues
- Try a higher-quality model (
flux-proinstead offlux-schnell) - Use more specific prompts
- Add negative prompts to avoid unwanted elements
- Use
square_hdfor higher resolution
Voice cloning sounds off
- Ensure reference audio is clear with no background noise
- Use 10-30 seconds of reference audio
- The reference should be natural speech, not singing or whispering
- Try the
f5-ttsmodel which has the best voice cloning
Notes
- All scripts auto-install dependencies via
uv run - Generated files include timestamps to avoid overwrites
- Use
--openflag to immediately view/play generated media - Video generation consumes more API credits than images
- Some models may have content restrictions