The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)

Learning Visiolinguistic Representations with ViLBERT w/ Stefan Lee - #358

Mar 18, 2020

Stefan Lee, an assistant professor at Oregon State University, dives into the fascinating world of visiolinguistic representations with his groundbreaking work, ViLBERT. He explores how this model integrates visual and language tasks, shedding light on complex challenges like visual question answering. Lee discusses the PyRobot platform, which facilitates robotic navigation through natural language. The conversation also highlights the importance of grounded connections between modalities, as well as the implications of his research for visually impaired individuals.

Ask episode

AI Snips

Chapters

Transcript

Episode notes

INSIGHT

ViLBERT's Goal: Visual Grounding

ViLBERT aims to connect visual and linguistic representations of concepts, like "wolf head".
Machines don't automatically link these, unlike humans who instantly visualize words.

ANECDOTE

ViLBERT's Training Method

BERT models use self-supervised tasks on large text corpora, masking words for prediction.
ViLBERT adapts this for vision-language by masking image parts and text segments in image-text pairs, using Conceptual Captions dataset.

INSIGHT

Random Masking Rationale

Random image masking in ViLBERT is blunt but necessary due to lacking prior grounding knowledge.
Precise masking would require pre-existing grounding, which is what ViLBERT aims to learn.

Get the Snipd Podcast app to discover more snips from this episode

Get the app