Ep#21 TesserAct: Learning 4D Embodied World Models

Jul 16, 2025

In this engaging discussion, Haoyu Zheng, a PhD student at UMass Amherst focusing on 3D foundational models, dives into his groundbreaking work on 4D embodied world models called TesserAct. He explores how these models can predict future states and generate photorealistic robotic simulations. Haoyu reveals insights on the complexities of training these models, including pre-training effectiveness and the challenges of zero-shot learning. The conversation highlights the importance of data quality and advancements in video diffusion models for the future of robotics.

Ask episode

AI Snips

Chapters

Transcript

Episode notes

INSIGHT

Defining 4D World Models

World models predict future states by generating video sequences from input actions representing robot commands.
Adding the time dimension to 3D allows modeling dynamic scenes, making them 4D world models.

INSIGHT

4D Models Boost Generalization

Video diffusion models generalize better across unseen domains than traditional 3D representation models.
Depth and time modalities enhance the understanding and control of robot arm movements.

ADVICE

Use Multi-Modal Data for Training

Train the video diffusion model with RGB, depth, and normal projectors combined for multi-modal supervision.
Use a large-scale video dataset from simulators and real-world sources for robust training.

Get the Snipd Podcast app to discover more snips from this episode

Get the app