

Ep#21 TesserAct: Learning 4D Embodied World Models
Jul 16, 2025
In this engaging discussion, Haoyu Zheng, a PhD student at UMass Amherst focusing on 3D foundational models, dives into his groundbreaking work on 4D embodied world models called TesserAct. He explores how these models can predict future states and generate photorealistic robotic simulations. Haoyu reveals insights on the complexities of training these models, including pre-training effectiveness and the challenges of zero-shot learning. The conversation highlights the importance of data quality and advancements in video diffusion models for the future of robotics.
AI Snips
Chapters
Transcript
Episode notes
Defining 4D World Models
- World models predict future states by generating video sequences from input actions representing robot commands.
- Adding the time dimension to 3D allows modeling dynamic scenes, making them 4D world models.
4D Models Boost Generalization
- Video diffusion models generalize better across unseen domains than traditional 3D representation models.
- Depth and time modalities enhance the understanding and control of robot arm movements.
Use Multi-Modal Data for Training
- Train the video diffusion model with RGB, depth, and normal projectors combined for multi-modal supervision.
- Use a large-scale video dataset from simulators and real-world sources for robust training.