Gemini Omni presented: Google combines AI reasoning with next-generation video generation

Philipp Briel
Philipp Briel · 4 min. read
Gemini app for macOS
Image: Google

With Gemini Omni, Google is expanding its Gemini platform with a new multimodal AI model that can not only analyze content, but also generate it creatively. The initial focus is on the creation and editing of videos using natural language. Gemini Omni combines various input sources such as text, images, audio and videos to create new content. The combination of AI-supported reasoning, realistic physics simulation and contextual storytelling is particularly interesting. The first version, called Gemini Omni Flash, is now available for selected Google services.

  • Multimodal AI model for video creation and editing via voice input
  • Support for text, image, video and audio inputs
  • Improved physics representation and contextual AI reasoning
  • Launch via Gemini app, Google Flow and YouTube Shorts

Gemini Omni to make AI video creation significantly smarter

With Gemini Omni, Google is taking a new approach to generative AI. While many current AI tools are primarily specialized in individual media formats, Omni relies entirely on multimodality. The model processes different inputs simultaneously and uses them to create coherent video content. For example, users can combine an image, a short video clip, music and text descriptions to generate a new video.

The focus on dialog-based video editing is particularly striking. Changes are made using natural language and build on each other logically. According to Google, characters, scenes and motion sequences remain consistent. This could make complex editing processes much easier than with traditional editing programs.

You are currently viewing a placeholder content from YouTube. To access the actual content, click the button below. Please note that doing so will share data with third-party providers.

More Information

The stronger integration of so-called “reasoning” seems particularly plausible. Gemini Omni not only analyzes visual patterns, but also takes physical processes and historical or cultural knowledge into account, according to Google. This should make scenes appear more realistic and more comprehensible in terms of content. Examples such as realistic chain reactions, fluid movements or stop-motion explanatory videos clearly show that Google sees AI not just as an image generator, but as a creative production tool with contextual understanding.

Another interesting feature is the ability to flexibly convert existing videos. Content can be stylistically changed or new elements added using text instructions without losing the original scene structure. This could save considerable time, especially for social media creators, marketing teams or content producers.

Gemini Omni Flash launches with a focus on video and AI avatars

Google is launching Gemini Omni Flash, the first model in the new Omni family. The current version focuses primarily on video creation, while other output modalities such as images or audio are to follow in the future. However, the model already supports combined input from images, text, videos and voice references.

Another focus is on digital AI avatars. According to Google, users will be able to create a digital version of their own voice and appearance in future in order to produce automatically generated videos with a personal character. At the same time, the company is emphasizing the importance of security mechanisms and transparency. All videos created with Gemini Omni will receive the invisible SynthID watermark as standard, which is intended to make AI-generated content identifiable.

From a technical point of view, the approach seems plausible, as Google has already gained extensive experience in the multimodal AI sector with earlier Gemini versions and the image AI Nano Banana. The extension towards video production that has now been presented is therefore a logical next step.

Gemini Omni Flash will initially be rolled out for Google AI Plus, Pro and Ultra subscribers in the Gemini app and in Google Flow. Google is also integrating the technology into YouTube Shorts and the YouTube Create app at no extra cost. APIs for developers and companies are set to follow in the coming weeks.

Conclusion

With Gemini Omni, Google is expanding generative AI with a much more creative and contextual approach. The combination of multimodal processing, natural video editing and AI reasoning could significantly simplify the creation of digital content. In particular, the ability to flexibly adapt videos via voice input and intelligently combine different media sources sets Gemini Omni apart from many previous AI tools. Gemini Omni Flash is now being rolled out gradually for Google services and selected subscriptions.