Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs

Jan 26, 2024

This podcast explores the visual shortcomings of multimodal LLMs and discusses the concept of clip-blind pairs. It introduces the MMVP benchmark to analyze visual patterns and their correlation with clip models and multimodal LLMs. The podcast also highlights the limitations of MLLMs in handling visual questions and proposes solutions to enhance visual grounding without compromising instruction following. A list of references covering topics like self-supervised learning and image recognition using transformers is provided.

Ask episode

Chapters

Transcript

Episode notes

Introduction

00:00 • 2min

Challenges of Multimodal Language Models and Clip-Blind Pairs

02:25 • 9min

MMVP benchmark and Visual Pattern Recognition

11:46 • 10min

Limitations of Multimodal Language Models

21:46 • 5min

List of relevant references on various topics

27:05 • 2min

References and Applications of Multimodal Large Language Models in Visual Understanding

29:20 • 5min