Bootstrapping Method for Vision-Language Understanding

This chapter discusses a bootstrapping method for training a vision-language understanding model, including training captioners and filters to generate and filter large amounts of data from the internet. The speakers explore the architecture and data set used, highlighting the potential for future research in dynamic composition of modules and multitask pre-training.

Play episode from 05:00

Transcript

Episode notes

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

Get the app