

Recurrence and Attention for Long-Context Transformers with Jacob Buckman - #750
190 snips Oct 7, 2025
Jacob Buckman, co-founder and CEO of Manifest AI, dives deep into the world of long-context transformers. He discusses innovative techniques like windowed attention and the revolutionary Power Retention approach, which melds attention and recurrence for astonishing training speeds. Buckman also shares insights on Manifest AI's open-source tools, Vidrial and PowerCoder, and explores the significance of metrics in measuring context utility. Learn about the balance between state and weight for optimal compute architectures and the future potential of context length in AI.
AI Snips
Chapters
Transcript
Episode notes
Context As An Independent Scaling Axis
- Context length is a distinct axis of scale that governs a model's ability to synthesize large inputs like long text or video.
- Improving context utilization during pretraining reduces downstream task gap when extra context is available.
In-Context Learning Curves Reveal Utility
- In-context learning curves measured by negative log likelihood show each extra token can give log-linear improvements.
- Pretraining evaluation on NLL gives a clean way to quantify how much additional context helps prediction.
Retention Connects Attention And Recurrence
- Many long-context architectures (Mamba, RetNet, SSMs) are structurally similar to Transformers with different time-mixing primitives.
- Retention unifies attention and recurrence by having dual recurrent and attention-equivalent forms.