Shlomi Fruchter, a Research Director at Google DeepMind, and Jack Parker Holder, a research scientist on the open-endedness team, unveil Genie 3, a revolutionary AI that creates immersive 3D worlds from text prompts. This groundbreaking model can generate environments in seconds, showcasing remarkable consistency in interactions. They discuss the evolution from Genie 2 to Genie 3, emphasizing improvements in memory and human interaction. The hosts dive into the potential applications for game design and robotics, hinting at a future where AI can simulate complex environments with ease.
58:22
forum Ask episode
web_stories AI Snips
view_agenda Chapters
auto_awesome Transcript
info_circle Episode notes
insights INSIGHT
Emergent Consistency in Genie World Model
Genie is a world model that simulates environment dynamics and interaction without explicit 3D modeling.
It achieves surprising consistency and object permanence as an emergent property from video-trained AI.
question_answer ANECDOTE
From Photo to Interactive World
DeepMind demoed a system turning a photo from California into an interactive, AI-generated game-like world.
The AI generates every pixel in real-time as the user moves, creating a seamless immersive experience.
insights INSIGHT
Genie 3's Leap to Real-time Realism
Genie 3 can generate photorealistic, interactive 720p environments in real-time lasting several minutes.
It blends elements from video models and world models to create a flexible, prompt-driven simulation experience.
Get the Snipd Podcast app to discover more snips from this episode
This episode features Shlomi Fuchter and Jack Parker Holder from Google DeepMind, who are unveiling a new AI called Genie 3. The host, Tim Scarfe, describes it as the most mind-blowing technology he has ever seen. We were invited to their offices to conduct the interview (not sponsored).Imagine you could create a video game world just by describing it. That's what Genie 3 does. It's an AI "world model" that learns how the real world works by watching massive amounts of video. Unlike a normal video game engine (like Unreal or the one for Doom) that needs to be programmed manually, Genie generates a realistic, interactive, 3D world from a simple text prompt.**SPONSOR MESSAGES***Prolific: Quality data. From real people. For faster breakthroughs.https://prolific.com/mlst?utm_campaign=98404559-MLST&utm_source=youtube&utm_medium=podcast&utm_content=script-gen***Here’s a breakdown of what makes it so revolutionary:From Text to a Virtual World: You can type "a drone flying by a beautiful lake" or "a ski slope," and Genie 3 creates that world for you in about three seconds. You can then navigate and interact with it in real-time.It's Consistent: The worlds it creates have a reliable memory. If you look away from an object and then look back, it will still be there, just as it was. The guests explain that this consistency isn't explicitly programmed in; it's a surprising, "emergent" capability of the powerful AI model.A Huge Leap Forward: The previous version, Genie 2, was a major step, but it wasn't fast enough for real-time interaction and was much lower resolution. Genie 3 is 720p, interactive, and photorealistic, running smoothly for several minutes at a time.The Killer App - Training Robots: Beyond entertainment, the team sees Genie 3 as a game-changer for training AI. Instead of training a self-driving car or a robot in the real world (which is slow and dangerous), you can create infinite simulations. You can even prompt rare events to happen, like a deer running across the road, to teach an AI how to handle unexpected situations safely.The Future of Entertainment: this could lead to a "YouTube version 2" or a new form of VR, where users can create and explore endless, interconnected worlds together, like the experience machine from philosophy.While the technology is still a research prototype and not yet available to the public, it represents a monumental step towards creating true artificial worlds from the ground up.Jack Parker Holder [Research Scientist at Google DeepMind in the Open-Endedness Team]https://jparkerholder.github.io/Shlomi Fruchter [Research Director, Google DeepMind]https://shlomifruchter.github.io/TOC:[00:00:00] - Introduction: "The Most Mind-Blowing Technology I've Ever Seen"[00:02:30] - The Evolution from Genie 1 to Genie 2[00:04:30] - Enter Genie 3: Photorealistic, Interactive Worlds from Text[00:07:00] - Promptable World Events & Training Self-Driving Cars[00:14:21] - Guest Introductions: Shlomi Fuchter & Jack Parker Holder[00:15:08] - Core Concepts: What is a "World Model"?[00:19:30] - The Challenge of Consistency in a Generated World[00:21:15] - Context: The Neural Network Doom Simulation[00:25:25] - How Do You Measure the Quality of a World Model?[00:28:09] - The Vision: Using Genie to Train Advanced Robots[00:32:21] - Open-Endedness: Human Skill and Prompting Creativity[00:38:15] - The Future: Is This the Next YouTube or VR?[00:42:18] - The Next Step: Multi-Agent Simulations[00:52:51] - Limitations: Thinking, Computation, and the Sim-to-Real Gap[00:58:07] - Conclusion & The Future of Game EnginesREFS:World Models [David Ha, Jürgen Schmidhuber]https://arxiv.org/abs/1803.10122POEThttps://arxiv.org/abs/1901.01753[Akarsh Kumar, Jeff Clune, Joel Lehman, Kenneth O. Stanley]The Fractured Entangled Representation Hypothesishttps://arxiv.org/pdf/2505.11581TRANSCRIPT:https://app.rescript.info/public/share/Zk5tZXk6mb06yYOFh6nSja7Lg6_qZkgkuXQ-kl5AJqM