Introduction to Multimodal Output
Having true multimodal output opens up interesting new possibilities in chatbots. For example, Gemini 2.0 Flash can play interactive graphical games or generate stories with consistent illustrations, maintaining character and setting continuity throughout multiple images. It’s far from perfect, but character consistency is a new capability in AI assistants.
What is Multimodal Output?
Multimodal output refers to the ability of a system to generate multiple forms of media, such as text, images, audio, and video. Gemini 2.0 Flash is a notable example of a multimodal output system, as it can generate interactive graphical games and stories with consistent illustrations.
Examples of Multimodal Output
We tried out Gemini 2.0 Flash and it was pretty wild—especially when it generated a view of a photo we provided from another angle. The system can also create multi-image stories, as shown in the examples below.
Creating a multi-image story with Gemini 2.0 Flash, part 1.
Google / Benj Edwards