AI-powered
podcast player
Listen to all your favourite podcasts with AI-powered features
Efficient Representation in Video and Language Models
Efficiency in video and language models differs in terms of senses and engineering. While more information can be packed in a video for human learning, representing videos currently is cumbersome. The focus is on making video and language models more efficient by using specific keyframes and deciding which parts need video or just audio. The emergence of transformers as a standard substrate for machine learning applications is enabling advancements in multi-modality unification. The work done on VLT5 in 2021 started in early 2020, exploring different levels of unification in video and language models.