

Gemini's Multimodality
44 snips Jul 2, 2025
Ani Baddepudi, the Product Lead for Gemini Model Behavior, shares her insights on the groundbreaking multimodal capabilities of Gemini. She explains why Gemini was designed as a multimodal model from the start, emphasizing its vision-first approach. The conversation dives into the intricacies of video and image understanding, showcasing advancements in higher FPS video sampling and tokenization methods. Ani also discusses the future of proactive AI assistants and the collaborative efforts behind Gemini’s evolution, revealing exciting possibilities for intuitive AI interactions.
AI Snips
Chapters
Transcript
Episode notes
Gemini's Native Multimodality
- Gemini was designed as a natively multimodal model from the start, trained to handle text, images, video, and audio together.
- This allows Gemini to perceive the world like humans and perform diverse tasks involving multiple data types seamlessly.
Handling Compression in Vision Tokens
- Representing images and videos as tokens involves some loss of detail, yet Gemini generalizes very well from these lossy representations.
- Even with videos sampled at one frame per second, the model can perform surprisingly strong video understanding and reasoning.
Turning Videos Into Interactive Apps
- Ani converted a YouTube cooking video into a step-by-step interactive recipe app using Gemini.
- Users are creating interactive lecture notes and web pages from videos, showing Gemini's powerful video understanding capabilities.