
The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence) Why Vision Language Models Ignore What They See with Munawar Hayat - #758
67 snips
Dec 9, 2025 Munawar Hayat, a researcher at Qualcomm AI Research specializing in multimodal generative AI, dives into the intricacies of Vision-Language Models (VLMs). He discusses the puzzling issue of object hallucination, revealing why these models often overlook visual elements in favor of language. Munawar also introduces attention-guided alignment techniques and a novel approach to generalized contrastive learning for efficient multi-modal retrieval. He shares insights on the Multi-Human Testbench designed to tackle identity leakage challenges in generative models, bringing clarity to this evolving field.
AI Snips
Chapters
Transcript
Episode notes
Language Priors Overshadow Vision
- When V and L are combined, language priors often dominate and vision signals get ignored.
- This leads models to answer from parametric language memory rather than the actual image.
Physics Consistency Is Largely Missing
- Foundation generative models often break simple physical consistency like object shape or size when asked to modify scenes.
- Munawar Hayat argues physics-aware training and descriptive prompts help preserve real-world properties.
Benchmarks Can Hide Vision Failures
- Many VLM benchmarks let models answer correctly from language alone, masking vision failures.
- Munawar highlights the need for vision-centric benchmarks that require inspecting images.

