"The Cognitive Revolution" | AI Builders, Researchers, and Live Player Analysis

Teaching AI to See: A Technical Deep-Dive on Vision Language Models with Will Hardman of Veratai

163 snips
Jan 3, 2025
Will Hardman, founder of AI advisory firm Veritai, delves into the intricacies of vision language models (VLMs). He discusses their evolution from traditional techniques to cutting-edge architectures like InternVL and Llama3V. The conversation highlights the importance of multimodality in AI, detailing innovations, architectural choices, and implications for artificial general intelligence. Hardman elaborates on the challenges of image processing, the significance of high-quality datasets, and emerging strategies that enhance VLM performance and reasoning capabilities.
Ask episode
AI Snips
Chapters
Transcript
Episode notes
ANECDOTE

Early CLIP at Waymark

  • Nathan Lambert recalls using CLIP at Waymark to select images for small business videos based on script narratives.
  • Early CLIP lacked aesthetic understanding, prioritizing textual content over visual quality, sometimes leading to undesirable image selections.
INSIGHT

Data Quality in VLMs

  • A key trend in Vision Language Model (VLM) development is an increasing focus on data quality filtering.
  • This addresses noisy data issues that hinder VLM performance, especially in aligning visual and textual information.
ADVICE

Synthetic Instruction Tuning Data

  • Use strong language models like GPT-4 with clever prompting to generate synthetic instruction tuning data.
  • This approach, pioneered by the LAVA model, improves VLM performance on diverse tasks by creating conversations between a questioner and a vision assistant.
Get the Snipd Podcast app to discover more snips from this episode
Get the app