The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)

Why Vision Language Models Ignore What They See with Munawar Hayat - #758

79 snips

Dec 9, 2025

Munawar Hayat, a researcher at Qualcomm AI Research specializing in multimodal generative AI, dives into the intricacies of Vision-Language Models (VLMs). He discusses the puzzling issue of object hallucination, revealing why these models often overlook visual elements in favor of language. Munawar also introduces attention-guided alignment techniques and a novel approach to generalized contrastive learning for efficient multi-modal retrieval. He shares insights on the Multi-Human Testbench designed to tackle identity leakage challenges in generative models, bringing clarity to this evolving field.

Ask episode

AI Snips

Chapters

Transcript

Episode notes

INSIGHT

Language Priors Overshadow Vision

When V and L are combined, language priors often dominate and vision signals get ignored.
This leads models to answer from parametric language memory rather than the actual image.

INSIGHT

Physics Consistency Is Largely Missing

Foundation generative models often break simple physical consistency like object shape or size when asked to modify scenes.
Munawar Hayat argues physics-aware training and descriptive prompts help preserve real-world properties.

INSIGHT

Benchmarks Can Hide Vision Failures

Many VLM benchmarks let models answer correctly from language alone, masking vision failures.
Munawar highlights the need for vision-centric benchmarks that require inspecting images.

Get the Snipd Podcast app to discover more snips from this episode

Get the app