The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)

Why Vision Language Models Ignore What They See with Munawar Hayat - #758

67 snips
Dec 9, 2025
Munawar Hayat, a researcher at Qualcomm AI Research specializing in multimodal generative AI, dives into the intricacies of Vision-Language Models (VLMs). He discusses the puzzling issue of object hallucination, revealing why these models often overlook visual elements in favor of language. Munawar also introduces attention-guided alignment techniques and a novel approach to generalized contrastive learning for efficient multi-modal retrieval. He shares insights on the Multi-Human Testbench designed to tackle identity leakage challenges in generative models, bringing clarity to this evolving field.
Ask episode
AI Snips
Chapters
Transcript
Episode notes
INSIGHT

Language Priors Overshadow Vision

  • When V and L are combined, language priors often dominate and vision signals get ignored.
  • This leads models to answer from parametric language memory rather than the actual image.
INSIGHT

Physics Consistency Is Largely Missing

  • Foundation generative models often break simple physical consistency like object shape or size when asked to modify scenes.
  • Munawar Hayat argues physics-aware training and descriptive prompts help preserve real-world properties.
INSIGHT

Benchmarks Can Hide Vision Failures

  • Many VLM benchmarks let models answer correctly from language alone, masking vision failures.
  • Munawar highlights the need for vision-centric benchmarks that require inspecting images.
Get the Snipd Podcast app to discover more snips from this episode
Get the app