

Video as a Universal Interface for AI Reasoning with Sherry Yang - #676
18 snips Mar 18, 2024
Sherry Yang, a Senior Research Scientist at Google DeepMind and a PhD candidate at UC Berkeley, discusses her groundbreaking work on video as a universal interface for AI reasoning. She draws parallels between video generation models and language models, highlighting their potential in real-world decision-making tasks. The conversation covers the integration of video in robotics, the challenges of effective labeling, and the exciting applications of interactive simulators. Sherry also unveils UniSim, showcasing the future of engaging with AI-generated environments.
AI Snips
Chapters
Transcript
Episode notes
Video as Unified Data Format
- Video is a unified data format like text, containing rich information about the world.
- This unified format allows for training a single model with a unified objective, similar to language models.
Challenges in Video Data
- Video data often lacks explicit labels, making self-supervised training challenging.
- Unlike text, where future words serve as labels, video requires specific labels for controlled generation.
Richness of Video Information
- Videos implicitly capture detailed physical information, unlike high-level language descriptions.
- This makes video ideal for tasks like learning complex procedures or visual reasoning.