The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)

Recurrence and Attention for Long-Context Transformers with Jacob Buckman - #750

205 snips

Oct 7, 2025

Jacob Buckman, co-founder and CEO of Manifest AI, dives deep into the world of long-context transformers. He discusses innovative techniques like windowed attention and the revolutionary Power Retention approach, which melds attention and recurrence for astonishing training speeds. Buckman also shares insights on Manifest AI's open-source tools, Vidrial and PowerCoder, and explores the significance of metrics in measuring context utility. Learn about the balance between state and weight for optimal compute architectures and the future potential of context length in AI.

Ask episode

AI Snips

Chapters

Transcript

Episode notes

INSIGHT

Context As An Independent Scaling Axis

Context length is a distinct axis of scale that governs a model's ability to synthesize large inputs like long text or video.
Improving context utilization during pretraining reduces downstream task gap when extra context is available.

INSIGHT

In-Context Learning Curves Reveal Utility

In-context learning curves measured by negative log likelihood show each extra token can give log-linear improvements.
Pretraining evaluation on NLL gives a clean way to quantify how much additional context helps prediction.

INSIGHT

Retention Connects Attention And Recurrence

Many long-context architectures (Mamba, RetNet, SSMs) are structurally similar to Transformers with different time-mixing primitives.
Retention unifies attention and recurrence by having dual recurrent and attention-equivalent forms.

Get the Snipd Podcast app to discover more snips from this episode

Get the app