Gemini Omni Flash Prompting Guide

Gemini Omni Flash.
Video with native audio.

Google Gemini Omni Flash turns text, an image, or an existing video into a short 720p clip with native audio baked in — no separate sound step. It does text-to-video, image-to-video, video editing, and one-tap continuation of a clip you already made. This guide covers how to prompt the picture and the sound together.

Open the tool

Overview

Omni Flash is Google's all-in-one short-form video model. The headline feature is native audio: ambient sound, foley, and atmosphere are generated together with the picture, so a single prompt produces a finished clip you can post. You write the scene and the sound in the same sentence.

It works four ways from one composer: text-to-video, image-to-video (drop a start image as a reference), video-to-video editing (hand it a clip and describe the change), and Continue — re-open any result and iterate on it. Output is 720p in 16:9 or 9:16, 3 to 10 seconds, billed at 2 credits per second.

Native

Audio generated with the video

3–10s

720p · 16:9 or 9:16

T2V · I2V · edit · continue

Key Features

Picture and sound together

Native Audio, Always On

Omni Flash generates sound with the video — rainfall, footsteps, ambient hum, a quiet music bed. There's no audio toggle and no post step. Name the sounds you want in the prompt and they get rendered alongside the action. Generic 'with audio' does little; specific cues ('soft rainfall and distant city hum') land.

Wind, light, weather

Atmospheric Motion

Strong on naturalistic, atmospheric motion — breath in cold air, snow underfoot, light shifting through trees. The model reads sensory language as motion direction, not just visual style. Pair a clear subject + action with one dominant camera move for clean, coherent clips.

Animate a still

Image to Video

Drop in a start image and Omni Flash conditions the clip on it — keeping layout, palette, and identity while adding motion and sound. The image is passed as a reference (each one under 4.77MB). Great for bringing a product shot, a scene, or a character frame to life.

Iterate without re-uploading

Edit & Continue

Hand it an existing clip and describe the change for a video-to-video edit, or hit Continue on any Omni Flash result to build a fresh take on it. Continuations regenerate at the model's standard ~10s length and keep the thread going — ideal for trying variations of a shot.

Example Videos

Each example shows the exact prompt that produced the result. Copy any prompt with one click.

Text → Video with Audio

720p · 16:9 · 8s

A neon-lit Tokyo side street in the rain at night, reflections shimmering on wet asphalt, a person under a clear umbrella walks slowly past glowing signage, soft rainfall and distant city hum, gentle cinematic push-in, photoreal.

Lead with the scene, then name the sound ('soft rainfall and distant city hum') in the same breath as the visuals — Omni Flash renders both. One camera move ('gentle cinematic push-in') keeps the motion clean. 'Photoreal' anchors the look.

Atmospheric Nature

720p · 16:9 · 8s

A red fox trots through a snowy pine forest at golden hour, breath visible in the cold air, snow crunching underfoot and soft birdsong, low tracking shot following the fox, warm backlight through the trees.

Sensory detail doubles as motion and sound direction: 'breath visible', 'snow crunching underfoot', 'soft birdsong'. The named audio cues come through. 'Low tracking shot following the fox' gives the model one clear camera instruction to execute.

Vertical Social Clip

720p · 9:16 · 8s

Close-up of a barista pouring latte art into a ceramic cup in a cozy cafe, steam curling upward, the soft hiss of the espresso machine and quiet acoustic music, shallow depth of field, warm morning light.

9:16 for Reels / TikTok / Shorts. Close-up + shallow depth of field reads well at 720p. The audio prompt layers two distinct sounds (machine hiss + acoustic music bed) — Omni Flash mixes them rather than picking one.

Image → Video

720p · 16:9 · 8s · start image

Bring this desk scene to life: code scrolls subtly on the laptop screen, steam drifts up from the coffee mug, soft keyboard clicks and a quiet room ambience, gentle slow push-in, keep the layout and lighting consistent with the reference.

With a start image, describe the MOTION you want added, not the whole scene — the image already supplies the composition. 'Keep the layout and lighting consistent with the reference' holds it steady. Subtle, specific motions ('code scrolls', 'steam drifts') beat big ones.

Prompting Tips

Write the sound into the prompt

Audio is native, so treat it as part of the scene. Name specific sounds — 'distant city hum', 'snow crunching underfoot', 'soft espresso machine hiss' — rather than a vague 'with sound'. Specific cues get rendered; generic ones get ignored.

One subject, one camera move

Lead with subject + action, then name a single dominant camera move ('gentle push-in', 'low tracking shot following the fox'). Stacking two motion types into a 3–10s clip usually produces compromised, jittery motion.

For image-to-video, prompt the motion

The start image already defines the composition — your prompt should describe what MOVES and what you HEAR, plus a line to keep it consistent with the reference. Reference and start images must each be under 4.77MB; compress large uploads first.

Use Continue to iterate

Got a clip that's close? Hit Continue on the result and describe the tweak instead of re-prompting from scratch. It builds a fresh take on that clip. Note that continuations come back at the model's standard ~10s length.

Pick aspect by channel

16:9 for landscape / YouTube. 9:16 for vertical social. Omni Flash is 720p-only, so resolution is fixed — choose the aspect ratio up front and the model composes the framing accordingly.

Keep clips short and specific

3–10 seconds is the whole range. Shorter, tightly-described clips are more reliable than long ones trying to cram in multiple beats. For a longer sequence, generate several focused clips and stitch them.

Settings Reference

Setting	Values	Notes
Modes	Text-to-video · Image-to-video · Video edit · Continue	One composer. A start image, source video, and continuation are mutually exclusive.
Audio	Native, always on	Generated with the video. No toggle. Prompt the specific sounds you want.
Duration	3–10 seconds	Billed at 2 credits/second. Continuations regenerate at ~10s.
Resolution	720p only	Fixed. No resolution picker.
Aspect ratio	16:9 · 9:16	Choose up front; the model frames to the ratio.
Reference images	Up to 7, each under 4.77MB	A start image counts as a reference. Compress large uploads.

FAQ

Yes — audio is native and always on. Ambient sound, foley, and a light music bed are generated together with the picture from your prompt. There's no separate audio step and no toggle. Name the sounds you want and they get rendered into the clip.