The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)

Unifying Vision and Language Models with Mohit Bansal - #636

59 snips

Jul 3, 2023

In this engaging discussion, Mohit Bansal, a Parker Professor and Director of the MURGe-Lab at UNC, dives into the unification of vision and language models. He highlights the benefits of shared knowledge in AI, introducing innovative models like UDOP and VL-T5 that achieve top results with fewer parameters. The conversation also tackles the challenges of evaluating generative AI, addressing biases and the importance of data efficiency. Mohit shares insights on balancing advancements in multimodal models with responsible usage and the future of explainability in AI.

Ask episode

AI Snips

Chapters

Transcript

Episode notes

INSIGHT

Unification in AI Models

Unified models offer advantages like shared knowledge, efficiency, and generalizability, potentially enabling composition of unseen tasks.
Evaluation in generative AI is challenging due to numerous variations in outputs and biases like spurious correlations.

INSIGHT

Grounding in Multimodality

Grounding, connecting language to real-world entities, can improve language models' understanding of coreference, physical, and temporal knowledge.
While large language models show impressive capabilities, grounding can still enhance data efficiency by leveraging information from other modalities.

ANECDOTE

VLT5: Unifying Multimodal Tasks

VLT5 treats various multimodal tasks as language generation, unifying tasks like visual question answering, image captioning, and visual grounding.
This approach allows for handling rare categories and unseen answers, achieving parameter efficiency and knowledge sharing across tasks.

Get the Snipd Podcast app to discover more snips from this episode

Get the app