

Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs
Jan 26, 2024
This podcast explores the visual shortcomings of multimodal LLMs and discusses the concept of clip-blind pairs. It introduces the MMVP benchmark to analyze visual patterns and their correlation with clip models and multimodal LLMs. The podcast also highlights the limitations of MLLMs in handling visual questions and proposes solutions to enhance visual grounding without compromising instruction following. A list of references covering topics like self-supervised learning and image recognition using transformers is provided.
Chapters
Transcript
Episode notes
1 2 3 4 5 6
Introduction
00:00 • 2min
Challenges of Multimodal Language Models and Clip-Blind Pairs
02:25 • 9min
MMVP benchmark and Visual Pattern Recognition
11:46 • 10min
Limitations of Multimodal Language Models
21:46 • 5min
List of relevant references on various topics
27:05 • 2min
References and Applications of Multimodal Large Language Models in Visual Understanding
29:20 • 5min