AI Breakdown cover image

AI Breakdown

Arxiv paper - Slow-Fast Architecture for Video Multi-Modal Large Language Models

Apr 7, 2025
05:24
In this episode, we discuss Slow-Fast Architecture for Video Multi-Modal Large Language Models by Min Shi, Shihao Wang, Chieh-Yun Chen, Jitesh Jain, Kai Wang, Junjun Xiong, Guilin Liu, Zhiding Yu, Humphrey Shi. The paper presents a slow-fast architecture for video-based multi-modal large language models that uses a dual-token system to balance temporal resolution and spatial detail efficiently. "Fast" tokens provide a compressed overview of the video, while "slow" tokens deliver detailed, instruction-aware visual information, allowing the model to handle more frames with minimal extra computation. Experimental results show that this approach significantly outperforms existing methods, enhancing input capacity and achieving state-of-the-art performance among similar-sized models.

Remember Everything You Learn from Podcasts

Save insights instantly, chat with episodes, and build lasting knowledge - all powered by AI.
App store bannerPlay store banner