Latent Space: The AI Engineer Podcast

chevron_right

Generative Video WorldSim, Diffusion, Vision, Reinforcement Learning and Robotics — ICML 2024 Part 1

whatshot 76 snips

Dec 10, 2024

07:07:47

forum

Ask episode

web_stories

AI Snips

view_agenda

Chapters

auto_awesome

Transcript

info_circle

Episode notes

question_answer

ANECDOTE

Sora's Capabilities

Sora can generate a minute of 1080p video, seamlessly handling complex scenes and transitions.
A stylish Tokyo street scene and a papercraft coral reef showcase its diverse styles.

insights

INSIGHT

Sora's Unified Representation

Sora uses a VAE, inspired by latent diffusion, for a unified visual data representation.
This allows training on diverse video and image data formats without discarding information.

insights

INSIGHT

Scaling Sora's Performance

Visual quality in Sora scales effectively with increased compute, showing detail improvement.
Training with more compute enhances textures, interactions, and overall scene realism.

Get the Snipd Podcast app to discover more snips from this episode

Unleashing Sora: A Generative Video Revolution

04:13 • 27min

chevron_right

Exploring the Future of Synthetic Characters in Filmmaking

31:11 • 3min

chevron_right

Exploring the Sora and Gini Video Models

34:39 • 37min

chevron_right

Innovations in Video Generation: The VideoPoet Model

01:11:30 • 30min

chevron_right

Advancements in Generative AI for Video

01:41:59 • 17min

chevron_right

TokenFlow and Video Editing Consistency

01:59:07 • 6min

chevron_right

Advancements in Generative Video Technology

02:05:26 • 6min

chevron_right

Optimizing Video Feature Extraction

02:11:17 • 6min

chevron_right

Innovative Approaches in Video Latent Initialization and Motion Fidelity Measurement

02:17:29 • 2min

chevron_right

Advancements in Video Generation and Diffusion Models

02:19:49 • 35min

chevron_right

Advancements in Generative Models and 3D Reconstruction

02:55:16 • 37min

chevron_right

Advanced Flow Matching in Generative Modeling

03:32:25 • 39min

chevron_right

Advancements in Generative Models and Speech Synthesis

04:11:10 • 45min

chevron_right

The Evolution of Pre-Training Paradigms in AI

04:55:52 • 10min

chevron_right

Innovative Approaches to Image and Text Pre-training in Computer Vision

05:05:29 • 5min

chevron_right

Advancements in Multilingual Language Models

05:10:05 • 17min

chevron_right

Stages of Multimodal Model Pre-Training and Fine-Tuning Techniques

05:26:42 • 5min

chevron_right

Fine-Tuning Models in Computer Vision

05:31:47 • 12min

chevron_right

Exploration of Motion Representation and Behavior Learning in Robotics

05:43:43 • 8min

chevron_right

Learning Optimal Policies from Video Data

05:51:51 • 13min

chevron_right

Innovations in Robotics Training

06:04:50 • 19min

chevron_right

Improving Robot Performance with Language Feedback

06:24:05 • 23min

chevron_right

Adapting Robots through Learning and Automation

06:47:18 • 16min

chevron_right

Challenges and Future Directions in Reinforcement Learning and Automation

07:03:06 • 5min

chevron_right

Regular tickets are now sold out for Latent Space LIVE! at NeurIPS! We have just announced our last speaker and newest track, friend of the pod Nathan Lambert who will be recapping 2024 in Reasoning Models like o1! We opened up a handful of late bird tickets for those who are deciding now — use code DISCORDGANG if you need it. See you in Vancouver!

We’ve been sitting on our ICML recordings for a while (from today’s first-ever SOLO guest cohost, Brittany Walker), and in light of Sora Turbo’s launch (blogpost, tutorials) today, we figured it would be a good time to drop part one which had been gearing up to be a deep dive into the state of generative video worldsim, with a seamless transition to vision (the opposite modality), and finally robots (their ultimate application).

Sora, Genie, and the field of Generative Video World Simulators

Bill Peebles, author of Diffusion Transformers, gave his most recent Sora talk at ICML, which begins our episode:

* William (Bill) Peebles - SORA (slides)

Something that is often asked about Sora is how much inductive biases were introduced to achieve these results. Bill references the same principles brought by Hyung Won Chung from the o1 team - “sooner or later those biases come back to bite you”.

We also recommend these reads from throughout 2024 on Sora.

* Lilian Weng’s literature review of Video Diffusion Models

* Sora API leak

* Estimates of 100k-700k H100s needed to serve Sora (not Turbo)

* Artist guides on using Sora for professional storytelling

Google DeepMind had a remarkably strong presence at ICML on Video Generation Models, winning TWO Best Paper awards for:

* Genie: Generative Interactive Environments (covered in oral, poster, and workshop)

* VideoPoet: A Large Language Model for Zero-Shot Video Generation (see website)

We end this part by taking in Tali Dekel’s talk on The Future of Video Generation: Beyond Data and Scale.

Part 2: Generative Modeling and Diffusion

Since 2023, Sander Dieleman’s perspectives (blogpost, tweet) on diffusion as “spectral autoregression in the frequency domain” while working on Imagen and Veo have caught the public imagination, so we highlight his talk:

* Wading through the noise: an intuitive look at diffusion models

Then we go to Ben Poole for his talk on Inferring 3D Structure with 2D Priors, including his work on NeRFs and DreamFusion:

Then we investigate two flow matching papers - one from the Flow Matching co-authors - Ricky T. Q. Chen (FAIR, Meta)

And how it is implemented in Stable Diffusion 3 with Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

Our last hit on Diffusion is a couple of oral presentations on speech, which we leave you to explore via our audio podcast

* NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models

* Speech Self-Supervised Learning Using Diffusion Model Synthetic Data

Part 3: Vision

The ICML Test of Time winner was DeCAF, which Trevor Darrell notably called “the OG vision foundation model”.

Lucas Beyer’s talk on “Vision in the age of LLMs — a data-centric perspective” was also well received online, and he talked about his journey from Vision Transformers to PaliGemma.

We give special honorable mention to MLLM-as-a-Judge: Assessing Multimodal LLM-as-a-Judge with Vision-Language Benchmark.

Part 4: Reinforcement Learning and Robotics

We segue vision into robotics with the help of Ashley Edwards, whose work on both the Gato and the Genie teams at Deepmind is summarized in Learning actions, policies, rewards, and environments from videos alone.

Brittany highlighted two poster session papers:

* Behavior Generation with Latent Actions

* We also recommend Lerrel Pinto’s On Building General-Purpose Robots

* PIVOT: Iterative Visual Prompting Elicits Actionable Knowledge for VLMs

However we must give the lion’s share of space to Chelsea Finn, now founder of Physical Intelligence, who gave FOUR talks on

* "What robots have taught me about machine learning"

* developing robot generalists

* robots that adapt autonomously

* how to give feedback to your language model

* special mention to PI colleague Sergey Levine on Robotic Foundation Models

We end the podcast with a position paper that links generative environments and RL/robotics: Automatic Environment Shaping is the Next Frontier in RL.

Timestamps

* [00:00:00] Intros

* [00:02:43] Sora - Bill Peebles

* [00:44:52] Genie: Generative Interactive Environments

* [01:00:17] Genie interview

* [01:12:33] VideoPoet: A Large Language Model for Zero-Shot Video Generation

* [01:30:51] VideoPoet interview - Dan Kondratyuk

* [01:42:00] Tali Dekel - The Future of Video Generation: Beyond Data and Scale.

* [02:27:07] Sander Dieleman - Wading through the noise: an intuitive look at diffusion models

* [03:06:20] Ben Poole - Inferring 3D Structure with 2D Priors

* [03:30:30] Ricky Chen - Flow Matching

* [04:00:03] Patrick Esser - Stable Diffusion 3