The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)

Learning Visiolinguistic Representations with ViLBERT w/ Stefan Lee - #358

Mar 18, 2020
Stefan Lee, an assistant professor at Oregon State University, dives into the fascinating world of visiolinguistic representations with his groundbreaking work, ViLBERT. He explores how this model integrates visual and language tasks, shedding light on complex challenges like visual question answering. Lee discusses the PyRobot platform, which facilitates robotic navigation through natural language. The conversation also highlights the importance of grounded connections between modalities, as well as the implications of his research for visually impaired individuals.
Ask episode
AI Snips
Chapters
Transcript
Episode notes
INSIGHT

ViLBERT's Goal: Visual Grounding

  • ViLBERT aims to connect visual and linguistic representations of concepts, like "wolf head".
  • Machines don't automatically link these, unlike humans who instantly visualize words.
ANECDOTE

ViLBERT's Training Method

  • BERT models use self-supervised tasks on large text corpora, masking words for prediction.
  • ViLBERT adapts this for vision-language by masking image parts and text segments in image-text pairs, using Conceptual Captions dataset.
INSIGHT

Random Masking Rationale

  • Random image masking in ViLBERT is blunt but necessary due to lacking prior grounding knowledge.
  • Precise masking would require pre-existing grounding, which is what ViLBERT aims to learn.
Get the Snipd Podcast app to discover more snips from this episode
Get the app