Google AI: Release Notes

Gemini's Multimodality

44 snips
Jul 2, 2025
Ani Baddepudi, the Product Lead for Gemini Model Behavior, shares her insights on the groundbreaking multimodal capabilities of Gemini. She explains why Gemini was designed as a multimodal model from the start, emphasizing its vision-first approach. The conversation dives into the intricacies of video and image understanding, showcasing advancements in higher FPS video sampling and tokenization methods. Ani also discusses the future of proactive AI assistants and the collaborative efforts behind Gemini’s evolution, revealing exciting possibilities for intuitive AI interactions.
Ask episode
AI Snips
Chapters
Transcript
Episode notes
INSIGHT

Gemini's Native Multimodality

  • Gemini was designed as a natively multimodal model from the start, trained to handle text, images, video, and audio together.
  • This allows Gemini to perceive the world like humans and perform diverse tasks involving multiple data types seamlessly.
INSIGHT

Handling Compression in Vision Tokens

  • Representing images and videos as tokens involves some loss of detail, yet Gemini generalizes very well from these lossy representations.
  • Even with videos sampled at one frame per second, the model can perform surprisingly strong video understanding and reasoning.
ANECDOTE

Turning Videos Into Interactive Apps

  • Ani converted a YouTube cooking video into a step-by-step interactive recipe app using Gemini.
  • Users are creating interactive lecture notes and web pages from videos, showing Gemini's powerful video understanding capabilities.
Get the Snipd Podcast app to discover more snips from this episode
Get the app