"The Cognitive Revolution" | AI Builders, Researchers, and Live Player Analysis

Teaching AI to See: A Technical Deep-Dive on Vision Language Models with Will Hardman of Veratai

163 snips

Jan 3, 2025

Will Hardman, founder of AI advisory firm Veritai, delves into the intricacies of vision language models (VLMs). He discusses their evolution from traditional techniques to cutting-edge architectures like InternVL and Llama3V. The conversation highlights the importance of multimodality in AI, detailing innovations, architectural choices, and implications for artificial general intelligence. Hardman elaborates on the challenges of image processing, the significance of high-quality datasets, and emerging strategies that enhance VLM performance and reasoning capabilities.

Ask episode

AI Snips

Chapters

Transcript

Episode notes

ANECDOTE

Early CLIP at Waymark

Nathan Lambert recalls using CLIP at Waymark to select images for small business videos based on script narratives.
Early CLIP lacked aesthetic understanding, prioritizing textual content over visual quality, sometimes leading to undesirable image selections.

INSIGHT

Data Quality in VLMs

A key trend in Vision Language Model (VLM) development is an increasing focus on data quality filtering.
This addresses noisy data issues that hinder VLM performance, especially in aligning visual and textual information.

ADVICE

Synthetic Instruction Tuning Data

Use strong language models like GPT-4 with clever prompting to generate synthetic instruction tuning data.
This approach, pioneered by the LAVA model, improves VLM performance on diverse tasks by creating conversations between a questioner and a vision assistant.

Get the Snipd Podcast app to discover more snips from this episode

Get the app