Gemini's Multimodality

54 snips

Jul 2, 2025

Ani Baddepudi, the Product Lead for Gemini Model Behavior, shares her insights on the groundbreaking multimodal capabilities of Gemini. She explains why Gemini was designed as a multimodal model from the start, emphasizing its vision-first approach. The conversation dives into the intricacies of video and image understanding, showcasing advancements in higher FPS video sampling and tokenization methods. Ani also discusses the future of proactive AI assistants and the collaborative efforts behind Gemini’s evolution, revealing exciting possibilities for intuitive AI interactions.

Ask episode

AI Snips

Chapters

Transcript

Episode notes

INSIGHT

Gemini's Native Multimodality

Gemini was designed as a natively multimodal model from the start, trained to handle text, images, video, and audio together.
This allows Gemini to perceive the world like humans and perform diverse tasks involving multiple data types seamlessly.

INSIGHT

Handling Compression in Vision Tokens

Representing images and videos as tokens involves some loss of detail, yet Gemini generalizes very well from these lossy representations.
Even with videos sampled at one frame per second, the model can perform surprisingly strong video understanding and reasoning.

ANECDOTE

Turning Videos Into Interactive Apps

Ani converted a YouTube cooking video into a step-by-step interactive recipe app using Gemini.
Users are creating interactive lecture notes and web pages from videos, showing Gemini's powerful video understanding capabilities.

Get the Snipd Podcast app to discover more snips from this episode

Get the app