| name | gemini-imagegen |
| description | Generate and edit images using the Gemini API (Nano Banana). Use this skill when creating images from text prompts, editing existing images, applying style transfers, generating logos with text, creating stickers, product mockups, or any image generation/manipulation task. Supports text-to-image, image editing, multi-turn refinement, and composition from multiple reference images. |
Gemini Image Generation
Generate and edit images using Google's Gemini API. Requires GEMINI_API_KEY environment variable.
Default Output & Logging
When the user doesn't specify a location, save images to:
/Users/samarthgupta/Documents/generated images/
Every generated image gets a companion .md file with the prompt used (e.g., logo.png → logo.md).
When gathering parameters (aspect ratio, resolution), offer the option to specify a custom output location.
Core Prompting Principle
Describe scenes narratively, don't list keywords. Gemini has deep language understanding—write prompts like prose, not tags.
❌ "cat, wizard hat, magical, fantasy, 4k, detailed"
✓ "A fluffy orange tabby sits regally on a velvet cushion, wearing an ornate
purple wizard hat embroidered with silver stars. Soft candlelight illuminates
the scene from the left. The mood is whimsical yet dignified."
The Formula
[Subject + Adjectives] doing [Action] in [Location/Context].
[Composition/Camera]. [Lighting/Atmosphere]. [Style/Media]. [Constraint].
Not every prompt needs every element—match detail to intent.
Prescriptive vs Open Prompting
Prescriptive (user has specific vision): Detailed descriptions, exact specifications Open (exploring/want model creativity): General direction, let model decide details
Both are valid. Ask the user's intent if unclear.
Capability Patterns
Photorealistic Scenes
Think like a photographer: describe lens, light, moment.
- Specify camera (85mm portrait, 24mm wide), aperture (f/1.8 bokeh, f/11 sharp throughout)
- Describe lighting direction and quality (golden hour from camera-left, three-point softbox)
- Include mood and format (serene, vertical portrait)
Product Photography
- Isolation: Clean white backdrop, soft even lighting, e-commerce ready
- Lifestyle: Product in use context, natural setting, aspirational but authentic
- Hero shots: Cinematic framing, dramatic lighting, space for text overlay
Logos & Text (Use Pro Model)
- Put text in quotes:
'Morning Brew Coffee Co' - Describe typography: "clean bold sans-serif with generous letter-spacing"
- Specify color scheme, shape constraints, design intent
- Iterate with multi-turn chat for refinement
Stylized Illustration
- Name the style: "kawaii-style sticker", "anime-influenced", "vintage travel poster"
- Describe design language: "bold outlines, flat colors, cel-shading"
- Include format constraints: "white background", "die-cut sticker format"
Editing Images
- Acknowledge subject: "Using the provided image of my cat..."
- Explicit preservation: "Keep everything unchanged except..."
- Realistic integration: "should look naturally printed on the fabric"
Pattern: Acknowledge → specify change → describe integration → preserve the rest
Multi-Image Composition (Pro Model)
- State output goal first
- Assign elements: "Take X from first image, Y from second"
- Describe integration requirements (lighting match, realistic shadows)
- Supports up to 14 reference images
Character Consistency
- Use multi-turn chat session for multiple views
- Reference distinctive features explicitly in follow-ups
- Include "exact same character" or "maintain all design details"
- Save successful designs as reference for future prompts
Invoking Aesthetics Through Naming
Names invoke aesthetics. The model learned associations for film stocks, cameras, studios, artists, and styles. Instead of describing characteristics, reference the name directly.
"Portrait at golden hour, shot on Kodak Portra 400"
→ Warm skin tones, pastel highlights, fine grain
"Studio Ghibli forest scene"
→ Lush nature, soft lighting, whimsical atmosphere
"Fashion editorial, Hasselblad medium format"
→ Exceptional detail, shallow DOF, that medium format look
This works for photography, animation, illustration, game art, graphic design, fine art—anything with a recognizable visual identity.
See STYLE_REFERENCE.md for comprehensive lexicon of film stocks, cameras, studios, artists, and styles.
Models
| Model | Best For |
|---|---|
gemini-2.5-flash-image |
Speed, iteration, simple generation (1024px fixed) |
gemini-3-pro-image-preview |
Text rendering, complex instructions, high-res (up to 4K), multi-image composition, Google Search grounding |
Defaults: Pro model uses 1K resolution, 1:1 aspect. Confirm with user before changing.
Image Configuration (Pro Only)
Aspect ratios: 1:1, 2:3, 3:2, 3:4, 4:3, 4:5, 5:4, 9:16, 16:9, 21:9
Resolutions: 1K (1024px), 2K (2048px), 4K (~4096px)
Advanced Features
Google Search Grounding (Pro Only)
Enable with --grounding flag when real-time data helps:
- Weather visualizations
- Current events infographics
- Real-world data charts
Multi-Turn Refinement
Use chat for iterative editing instead of perfecting prompts in one shot:
→ "Create a logo for Acme Corp"
→ "Make the text bolder"
→ "Add a blue gradient background"
Semantic Masking
No manual masking needed. Describe changes conversationally:
- "Change the sofa to red leather"
- "Replace the background with a sunset beach"
- "Remove the power lines from the sky"
Scripts
# Generate from prompt
python scripts/generate_image.py "prompt" output.png [--model MODEL] [--aspect RATIO] [--size SIZE] [--grounding]
# Edit existing image
python scripts/edit_image.py input.png "instruction" output.png [--model MODEL] [--aspect RATIO] [--size SIZE]
# Compose multiple images
python scripts/compose_images.py "instruction" output.png img1.png [img2.png ...] [--model MODEL] [--aspect RATIO] [--size SIZE]
# Interactive multi-turn chat
python scripts/multi_turn_chat.py [--model MODEL] [--output-dir DIR]
Models: gemini-2.5-flash-image (default), gemini-3-pro-image-preview
Core API Pattern
from google import genai
from google.genai import types
client = genai.Client(api_key=os.environ["GEMINI_API_KEY"])
response = client.models.generate_content(
model="gemini-2.5-flash-image",
contents=["Your narrative prompt here"],
config=types.GenerateContentConfig(response_modalities=["TEXT", "IMAGE"])
)
for part in response.parts:
if part.inline_data:
# Save image from part.inline_data.data
For Pro model with configuration:
config=types.GenerateContentConfig(
response_modalities=['TEXT', 'IMAGE'],
image_config=types.ImageConfig(aspectRatio="16:9", imageSize="2K"),
tools=[{"google_search": {}}] # Optional grounding
)
Quick Checklist
Before generating:
- Narrative description (not keyword list)?
- Camera/lighting details for photorealism?
- Text in quotes, font style described?
- Right model for task (Pro for text/complex)?
- Aspect ratio appropriate for use case?
- User preference: prescriptive or open?