The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)

Unifying Vision and Language Models with Mohit Bansal - #636

59 snips
Jul 3, 2023
In this engaging discussion, Mohit Bansal, a Parker Professor and Director of the MURGe-Lab at UNC, dives into the unification of vision and language models. He highlights the benefits of shared knowledge in AI, introducing innovative models like UDOP and VL-T5 that achieve top results with fewer parameters. The conversation also tackles the challenges of evaluating generative AI, addressing biases and the importance of data efficiency. Mohit shares insights on balancing advancements in multimodal models with responsible usage and the future of explainability in AI.
Ask episode
AI Snips
Chapters
Transcript
Episode notes
INSIGHT

Unification in AI Models

  • Unified models offer advantages like shared knowledge, efficiency, and generalizability, potentially enabling composition of unseen tasks.
  • Evaluation in generative AI is challenging due to numerous variations in outputs and biases like spurious correlations.
INSIGHT

Grounding in Multimodality

  • Grounding, connecting language to real-world entities, can improve language models' understanding of coreference, physical, and temporal knowledge.
  • While large language models show impressive capabilities, grounding can still enhance data efficiency by leveraging information from other modalities.
ANECDOTE

VLT5: Unifying Multimodal Tasks

  • VLT5 treats various multimodal tasks as language generation, unifying tasks like visual question answering, image captioning, and visual grounding.
  • This approach allows for handling rare categories and unseen answers, achieving parameter efficiency and knowledge sharing across tasks.
Get the Snipd Podcast app to discover more snips from this episode
Get the app