
Mamba, Mamba-2 and Post-Transformer Architectures for Generative AI with Albert Gu - #693
The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)
Autoregressive Modeling and Transformer KV Cache state memory
Autoregressive modeling involves predicting the next word step by step using a model's state that represents the previous contexts seen. The transformer has a KV cache state that abstractly stores everything seen before in the sequence, enabling it to choose previous words to process or predict the next word. Though indispensable for certain tasks like copying large blocks of text, storing every detail is inefficient for most cases. Post-transformer architectures aim to retain the transformer's benefits while enhancing efficiency by compressing the state representation and establishing a mechanism for updating and leveraging it in the future.
00:00
Transcript
Play full episode
Remember Everything You Learn from Podcasts
Save insights instantly, chat with episodes, and build lasting knowledge - all powered by AI.