AI-powered
podcast player
Listen to all your favourite podcasts with AI-powered features
The Importance of Grounding in Multi-Modality
The first multimodal project that I worked on 10 years ago was grounding. Can we improve the coreference by actually trying to ground different words to bounding boxes in the 3D image, like cuboids, not boxes? So it was sort of by directional mutual improvements. And then six years later, we did this paper called, Vocalization, which was basically the first attempt at taking something like BERT and combining it with mass language modeling objectives. We train a vocalizer on MS-COCO kind of multimodal datasets, which have images and captions. Then you would basically take that visual mapper that's trained on MS- COCO and get all