

The End of Language-Only Models l Amit Jain, Luma AI
18 snips May 13, 2025
Amit Jain, CEO and co-founder of Luma AI and former Apple Vision Pro engineer, discusses the future of AI beyond just language models. He emphasizes the importance of multimodal training, particularly the often-overlooked role of video in AI development. Amit shares insights on how combining audio, video, and text can revolutionize industries like entertainment and advertising. He also touches on the potential for fully AI-generated feature films and critiques trend-driven approaches in AI, advocating for more meaningful innovations.
AI Snips
Chapters
Transcript
Episode notes
Luma AI's Multimodal Vision
- Multimodal general intelligence involves joint training on audio, video, language, and text to capture the full digital footprint.
- Luma AI builds world models that learn like humans, integrating multiple modalities simultaneously rather than focusing solely on language.
Limits of Language-Only AI Models
- Current large AI labs focus heavily on language and treat other modalities as afterthoughts, leading to limitations in model capability.
- Joint training on all modalities at scale is a strategic next frontier to overcome data limitations faced by text-only models.
Continuous vs Discrete Modalities
- Language models operate on discrete tokens, whereas real-world signals like video and audio require continuous representations.
- Luma's new architecture uses continuous latent space for joint multimodal modeling, enabling reasoning over diverse data.