
Learning Visiolinguistic Representations with ViLBERT w/ Stefan Lee - #358
The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)
00:00
Bridging Vision and Language
This chapter explores the integration of vision and language via a self-supervised approach using large datasets, particularly focusing on the Conceptual Captions dataset. It discusses the complexities of training a BERT-like model to understand multimodal input, the challenges of masking in images, and the implementation of co-attentional mechanisms for effective interaction between visual and textual representations.
Transcript
Play full episode