AI-powered
podcast player
Listen to all your favourite podcasts with AI-powered features
The episode centers around a recap of the ICML 2024 conference, highlighting the release of talks from the event and promoting the upcoming NeurIPS 2024. It features announcements about popular speakers and tracks, including a recap of advancements in Vision, Open Models, and AI keynotes. Additionally, it mentions the sold-out Latent Space Live event, aimed at showcasing pivotal developments within the AI community during 2024. The discussions also indicate gratitude for various guest hosts contributing to the episode.
Sora is introduced as OpenAI's first video generation model capable of producing high-quality video sequences based on simple text prompts. The model demonstrates impressive capabilities such as maintaining object permanence within scenes, effectively generating complex interactions in dynamic environments, and transitioning between various visual styles and settings. It can generate aesthetically pleasing visuals, such as a stylish woman walking down a neon-lit Tokyo street, while also maintaining integrity across video transitions. This illustrates Sora's potential in both artistic and practical applications of video generation.
The development of Sora is grounded in the principles of language model training, aiming to create unified representations for visual data. The architecture utilizes a variational autoencoder (VAE) to encode video data into a coherent latent space that simplifies the training process for diffusion transformers. By concentrating on generating visual outputs from diverse data sources with varying characteristics, Sora sets the groundwork for effective video synthesis. As the model gains experience through extensive training, its ability to generate high-quality video reflects an understanding of light interaction and scene continuity.
Sora showcases versatility in creating both photorealistic and non-photorealistic animations, making it capable of generating imaginative scenes such as a beautifully rendered papercraft world. The model not only excels in simple scenarios but demonstrates an understanding of complex interactions between characters and objects, showcasing character consistency across transitions. Additionally, Sora's performance in generating diverse scenes, such as vibrant cityscapes or underwater experiences, signifies its potential in varied visual storytelling applications. This adaptability both in style and narrative enhances Sora's appeal in entertainment and multimedia fields.
The capabilities of Sora highlight significant advancements in controllable video generation, emphasizing how the model captures various actions within its generated content. The model's output accuracy is reflected when displaying a character's interactions, such as a man remaining consistent across scene transitions and conveying human-like behavior. Furthermore, Sora elaborates on how to maintain object recognition and movement while honoring physics-related aspects in the generated videos. These factors raise the model's potential not just as a creative tool, but as a stepping stone towards integrating generative video synthesis with larger contexts in human-computer interaction.
The discussions extend to multi-modal generative modeling, particularly leveraging existing networks and architectures for enhanced output quality. An introduction to the differences between autoregressive models compared with diffusion models shows that certain strategies allow for smoother, more consistent output in video generation when designed correctly. With scaling and improvements in computational resources, multi-modal models, such as the integration of audio and visual inputs, open a path to a richer generative experience. Thus, the evolution of these techniques signifies a transformative phase in artificial intelligence where both features and modalities can work collaboratively.
In the context of robotics, leveraging generative techniques to create models for learning actions and policies from video has seen significant attention. Through the reinforcement learning framework, it is possible to train robots efficiently by allowing them to adapt strategies based on past experiences implicitly learned from video inputs. By encouraging lower-dimensional representations of actions through generative modeling and efficiently reusing learned strategies, robots can improve their performance in real-world tasks. This shifts the paradigm towards a more efficient and adaptable approach to teaching robots through interaction instead of relying solely on demonstrations.
Exploring how robots can interpret language commands presents an innovative approach to supporting autonomous agents in performing tasks with minimal intervention. By introducing a hierarchical policy model, high-level language instructions can manage the behaviors of low-level motor functions in robots. This means that rather than just a traditional dependency on extensive training data, robots become more interactive and capable of applying learned behaviors across various contexts based on verbal corrections provided by users. This novel framework strengthens the interplay between human instructions and robotic responses, making robots more intuitive and less reliant on exhaustive datasets.
Some innovative approaches alleviate data scarcity issues in reinforcement learning, particularly in generating effective robot functions from a limited number of examples. By breaking down the operational mechanisms, researchers are capable of demonstrating that language-based feedback can significantly enhance robot performance beyond traditional data collection methods. This allows for a more efficient way of training, utilizing corrective information to establish better behaviors without excessive demonstrations, providing a path to develop smarter robots and broader applications across various domains. The implications of utilizing data-driven language corrections are crucial for the future of human-robot interaction.
Research on scaling models utilizes methods that integrate various data types, including visual and text sources, ultimately fostering improved generalization in robotic tasks. By presenting robotic control as visual question-answering tasks, researchers are discovering that leveraging expansive pre-trained models significantly enhances adaptability and performance in nuanced scenarios. This synergy between evidence from different modalities promotes resilience in models to navigate the range of potential applications, setting the stage for greater advancements in the field. The exploration of combining data from various sources paves the way towards building more generalized models in artificial intelligence.
The discussion on reinforcement learning presents an intersection of old challenges and new advancements in environment shaping and model optimization. Current empirical practices recognize the importance of automating processes in RL to ease the workload on engineers while introducing methods that allow action shaping and observation shaping to direct models efficiently. Future work must focus on assembling a broader understanding of automatic shaping of environments while ensuring generalizable RL algorithms can deal effectively with a wide array of tasks. This ongoing evolution hints at a necessary shift towards minimizing human intervention while maximizing the capabilities of reinforced learning strategies.
The talks at ICML 2024 have collectively presented progress across generative video synthesis, flow matching models, and reinforcement learning paradigms. The underlying insights about blending multi-modal models, revisiting traditional tasks through innovative lenses, and leveraging effective feedback systems reveal new opportunities for future advancements. As researchers continue to refine their approaches to incorporating contextual understanding and adaptive processes, the potential for nuanced AI systems will increase tremendously. This wave of innovation anticipates a redefined interaction with both computational models and real-world applications.
A growing discussion around generative models includes the evaluation of automatic metrics versus qualitative human assessments to measure the effectiveness of AI-generated outputs. While traditional benchmarks have dominated analytics, emerging technologies necessitate avenues to explore more robust and intuitive human evaluations that account for unique attributes of new domains. The fluidity of visual perception requires enhanced methodologies to quantify quality and consistency in outputs, including deeper connections to comprehensive human feedback. Thus, this dialogue suggests a much-needed evolution in how success is conceptualized within the generative model landscape.
The trajectory of vision language models aligns with their increased acceptance as versatile tools that can vastly improve generalization in unseen environments. The ability to frame robotic control as visual question-answering fosters deeper connections to expansive training data and the inclusion of contextualization techniques, optimizing how robotics learns and interacts. The growing recognition of the power of these models may signal a paradigm shift towards more integrated AI solutions, with implications spanning various fields from robotics to general artificial intelligence applications. As companies invest in this domain, we anticipate an exciting evolution in the capabilities of AI systems.
The advancements in robotics facilitated by incorporating learnings from generative AI and leveraging natural supervision present a dynamic future for human-robot interaction. Robots adopting high-level instructions can ultimately navigate complex tasks while relying less on extensive datasets, promoting a shift towards more autonomous operations. As the technologies continue to improve, the synergy created between human intention and robotic response promises increased operational efficiency and adaptability in real-world scenarios. Overall, this evolving relationship is pushing towards making robots seamlessly integrated into daily human activities and enhancing user experiences.
Regular tickets are now sold out for Latent Space LIVE! at NeurIPS! We have just announced our last speaker and newest track, friend of the pod Nathan Lambert who will be recapping 2024 in Reasoning Models like o1! We opened up a handful of late bird tickets for those who are deciding now — use code DISCORDGANG if you need it. See you in Vancouver!
We’ve been sitting on our ICML recordings for a while (from today’s first-ever SOLO guest cohost, Brittany Walker), and in light of Sora Turbo’s launch (blogpost, tutorials) today, we figured it would be a good time to drop part one which had been gearing up to be a deep dive into the state of generative video worldsim, with a seamless transition to vision (the opposite modality), and finally robots (their ultimate application).
Sora, Genie, and the field of Generative Video World Simulators
Bill Peebles, author of Diffusion Transformers, gave his most recent Sora talk at ICML, which begins our episode:
* William (Bill) Peebles - SORA (slides)
Something that is often asked about Sora is how much inductive biases were introduced to achieve these results. Bill references the same principles brought by Hyung Won Chung from the o1 team - “sooner or later those biases come back to bite you”.
We also recommend these reads from throughout 2024 on Sora.
* Lilian Weng’s literature review of Video Diffusion Models
* Estimates of 100k-700k H100s needed to serve Sora (not Turbo)
* Artist guides on using Sora for professional storytelling
Google DeepMind had a remarkably strong presence at ICML on Video Generation Models, winning TWO Best Paper awards for:
* Genie: Generative Interactive Environments (covered in oral, poster, and workshop)
* VideoPoet: A Large Language Model for Zero-Shot Video Generation (see website)
We end this part by taking in Tali Dekel’s talk on The Future of Video Generation: Beyond Data and Scale.
Part 2: Generative Modeling and Diffusion
Since 2023, Sander Dieleman’s perspectives (blogpost, tweet) on diffusion as “spectral autoregression in the frequency domain” while working on Imagen and Veo have caught the public imagination, so we highlight his talk:
* Wading through the noise: an intuitive look at diffusion models
Then we go to Ben Poole for his talk on Inferring 3D Structure with 2D Priors, including his work on NeRFs and DreamFusion:
Then we investigate two flow matching papers - one from the Flow Matching co-authors - Ricky T. Q. Chen (FAIR, Meta)
And how it is implemented in Stable Diffusion 3 with Scaling Rectified Flow Transformers for High-Resolution Image Synthesis
Our last hit on Diffusion is a couple of oral presentations on speech, which we leave you to explore via our audio podcast
* NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models
* Speech Self-Supervised Learning Using Diffusion Model Synthetic Data
Part 3: Vision
The ICML Test of Time winner was DeCAF, which Trevor Darrell notably called “the OG vision foundation model”.
Lucas Beyer’s talk on “Vision in the age of LLMs — a data-centric perspective” was also well received online, and he talked about his journey from Vision Transformers to PaliGemma.
We give special honorable mention to MLLM-as-a-Judge: Assessing Multimodal LLM-as-a-Judge with Vision-Language Benchmark.
Part 4: Reinforcement Learning and Robotics
We segue vision into robotics with the help of Ashley Edwards, whose work on both the Gato and the Genie teams at Deepmind is summarized in Learning actions, policies, rewards, and environments from videos alone.
Brittany highlighted two poster session papers:
* Behavior Generation with Latent Actions
* We also recommend Lerrel Pinto’s On Building General-Purpose Robots
* PIVOT: Iterative Visual Prompting Elicits Actionable Knowledge for VLMs
However we must give the lion’s share of space to Chelsea Finn, now founder of Physical Intelligence, who gave FOUR talks on
* "What robots have taught me about machine learning"
* developing robot generalists
* robots that adapt autonomously
* how to give feedback to your language model
* special mention to PI colleague Sergey Levine on Robotic Foundation Models
We end the podcast with a position paper that links generative environments and RL/robotics: Automatic Environment Shaping is the Next Frontier in RL.
Timestamps
* [00:00:00] Intros
* [00:02:43] Sora - Bill Peebles
* [00:44:52] Genie: Generative Interactive Environments
* [01:00:17] Genie interview
* [01:12:33] VideoPoet: A Large Language Model for Zero-Shot Video Generation
* [01:30:51] VideoPoet interview - Dan Kondratyuk
* [01:42:00] Tali Dekel - The Future of Video Generation: Beyond Data and Scale.
* [02:27:07] Sander Dieleman - Wading through the noise: an intuitive look at diffusion models
* [03:06:20] Ben Poole - Inferring 3D Structure with 2D Priors
* [03:30:30] Ricky Chen - Flow Matching
* [04:00:03] Patrick Esser - Stable Diffusion 3
* [04:14:30] NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models
* [04:27:00] Speech Self-Supervised Learning Using Diffusion Model Synthetic Data
* [04:39:00] ICML Test of Time winner: DeCAF
* [05:03:40] Lucas Beyer: “Vision in the age of LLMs — a data-centric perspective”
* [05:42:00] Ashley Edwards: Learning actions, policies, rewards, and environments from videos alone.
* [06:03:30] Behavior Generation with Latent Actions interview
* [06:09:52] Chelsea Finn: "What robots have taught me about machine learning"
* [06:56:00] Position: Automatic Environment Shaping is the Next Frontier in RL
Listen to all your favourite podcasts with AI-powered features
Listen to the best highlights from the podcasts you love and dive into the full episode
Hear something you like? Tap your headphones to save it with AI-generated key takeaways
Send highlights to Twitter, WhatsApp or export them to Notion, Readwise & more
Listen to all your favourite podcasts with AI-powered features
Listen to the best highlights from the podcasts you love and dive into the full episode