

Ep#8: VGGT - Visual Geometry Grounded Transformer
May 2, 2025
Jianyuan Wang, a PhD student at Meta AI and the Visual Geometry Group of Oxford, dives into the cutting-edge world of 3D reconstruction. He discusses the shift from classical to deep learning techniques, the innovative VGGT framework, and the importance of diverse datasets for training models. Explore how 64 GPUs optimize robotics processing and learn about advancements in camera pose estimation and multi-view depth. Wang also highlights challenges in modeling non-rigid motion, paving the way for future developments in computer vision for robotics.
AI Snips
Chapters
Transcript
Episode notes
VGGT Transcends Pairwise Limits
- Classical 3D reconstruction methods train on image pairs, limiting their ability to leverage full sequence data.
- The VGGT model uses a deep-only transformer framework that consumes multiple frames simultaneously improving consistency and efficiency.
Use Alternate Attention for Permutation
- Use an alternate attention mechanism combining global and frame-wise attention to process multi-frame inputs.
- Frame-wise attention removes the need for input frame order embedding, ensuring permutation invariance in unordered image sequences.
Permutation Invariance Is Crucial
- Unordered image sets must produce consistent reconstructions regardless of input frame order.
- Removing time or order priors allows the VGGT model to generalize to various input collections like internet photos.