The Information Bottleneck

Ravid Shwartz-Ziv & Allen Roush

Two AI Researchers - Ravid Shwartz Ziv, and Allen Roush, discuss the latest trends, news, and research within Generative AI, LLMs, GPUs, and Cloud Systems.

Episodes

Mentioned books

Feb 1, 2026 • 1h 15min

EP23: Building Open Source AI Frameworks: David Mezzetti on TxtAI and Local-First AI

David Mezzetti, creator of TextAI and solo developer of an open-source AI orchestration library focused on local-first and small-model workflows. He discusses why local-first AI matters, how COVID research led to semantic search, the power of tiny models on CPU, evolving RAG and orchestration, and the trade-offs of resisting then embracing cloud APIs.

Jan 20, 2026 • 1h 26min

EP22: Data Curation for LLMs with Cody Blakeney (Datology AI)

Cody Blakeney from Datology AI joins us to talk about data curation - the unglamorous but critical work of figuring out what to actually train models on.Cody's path from writing CUDA kernels to spending his days staring at weird internet text tells you something important: data quality can account for half or more of a model's final performance. That's on par with major architectural breakthroughs.We get into the differences between pre-training, mid-training, and post-training data. Mid-training in particular has become a key technique for squeezing value out of rare, high-quality datasets. Cody's team stumbled onto it while solving a practical problem: how do you figure out if a 5-billion-token dataset is actually useful when you can't afford hundreds of experimental runs?We also talk about data filtering and some genuinely surprising findings: the documents that make the best training data are often short and dense with information. Those nicely written blog posts with personal anecdotes? Turns out models don't learn as well from them.On synthetic data, Cody thinks pre-training is still in its early days, where most techniques are variations on a few core ideas, but there's huge potential. He's excited about connecting RL failures back to mid-training: when models fail at tasks, use that signal to generate targeted training data.Takeaways:Data work is high-leverage but underappreciatedMid-training helps extract signal from small, valuable datasetsGood filters favor dense, factual text over polished prose.Synthetic data for pre-training works surprisingly well, but remains primitive.Optimal data mixtures depend on model scale, where smaller models need more aggressive distribution shifts.Timeline(00:12) Introduction to Data Correlation in LLMs(05:14) The Importance of Data Quality(10:15) Pre-training vs Post-training Data(15:22) Strategies for Effective Data Utilization(20:15) Benchmarking and Model Evaluation(28:28) Maximizing Perplexity and Coherence(30:27) Measuring Quality in Data(32:56) The Role of Filters in Data Selection(34:19) Understanding High-Quality Data(39:15) Mid-Training and Its Importance(46:51) Future of Data Sources(48:13) Synthetic Data's Role in Pre-Training(53:10) Creating Effective Synthetic Data(57:39) The Debate on Pure Synthetic Data(01:00:25) Navigating AI Training and Legal Challenges(01:02:34) The Controversy of AI in the Art Community(01:05:29) Exploring Synthetic Data and Its Efficiency(01:11:21) The Future of Domain-Specific vs. General Models(01:22:06) Bias in Pre-trained Models and Data Selection(01:28:27) The Potential of Synthetic Data Over Human DataMusic:"Kid Kodi" — Blue Dot Sessions — via Free Music Archive — CC BY-NC 4.0."Palms Down" — Blue Dot Sessions — via Free Music Archive — CC BY-NC 4.0.Changes: trimmedAboutThe Information Bottleneck is hosted by Ravid Shwartz-Ziv and Allen Roush, featuring in-depth conversations with leading AI researchers about the ideas shaping the future of machine learning.

Jan 7, 2026 • 1h 12min

EP21: Privacy in the Age of Agents with Niloofar Mireshghallah

Niloofar Mireshghallah, an incoming assistant professor at Carnegie Mellon University, dives into the intriguing world of AI privacy and model behavior. She discusses the surprising reliance of models on context over memorization and highlights modern privacy threats like aggregation and inference attacks. The conversation touches on linguistic colonialism in AI, the challenges faced by non-English languages, and the importance of academic research in preserving the nuances of learning and cultural representation. Niloofar calls for innovative AI tools for science and education while emphasizing the need for privacy-aware designs.

Dec 15, 2025 • 1h 50min

EP20: Yann LeCun

Yann LeCun, a Turing Award-winning computer scientist and pioneer of deep learning, shares his bold vision for AI after leaving Meta to start Advanced Machine Intelligence. He critiques the current Silicon Valley obsession with scaling language models, arguing they won't lead to artificial general intelligence. Instead, he advocates for developing world models that simulate abstract concepts. Yann discusses learning object permanence and the challenges of game AI, while advocating for safety measures in AI design, emphasizing an architecture-focused approach.

Dec 10, 2025 • 1h 11min

EP19: AI in Finance and Symbolic AI with Atlas Wang

Atlas Wang, a faculty member at UT Austin and research director at XTX Research, dives into the captivating realms of symbolic AI and quantitative finance. He unveils how neural networks can learn symbolic equations through gradient descent, challenging our understanding of AI reasoning. In finance, he demystifies high-frequency trading’s complexities, highlighting the struggle against market noise and the pursuit of predictive accuracy. Atlas advocates for robust time-series models and suggests that researchers should explore finance for the unique opportunities it presents.

Dec 1, 2025 • 1h 45min

EP18: AI Robotics

In this episode, we hosted Judah Goldfeder, a PhD candidate at Columbia University and student researcher at Google, to discuss robotics, reproducibility in ML, and smart buildings.Key topics covered:Robotics challenges: We discussed why robotics remains harder than many expected, compared to LLMs. The real world is unpredictable and unforgiving, and mistakes have physical consequences. Sim-to-real transfer remains a major bottleneck because simulators are tedious to configure accurately for each robot and environment. Unlike text, robotics lacks foundation models, partly due to limited clean, annotated datasets and the difficulty of collecting diverse real-world data.Reproducibility crisis: We discussed how self-reported benchmarks can lead to p-hacking and irreproducible results. Centralized evaluation systems (such as Kaggle or ImageNet challenges), where researchers submit algorithms for testing on hidden test sets, seem to drive faster progress.Smart buildings: Judah's work at Google focuses on using ML to optimize HVAC systems, potentially reducing energy costs and carbon emissions significantly. The challenge is that every building is different. It makes the simulation configuration extremely labor-intensive. Generative AI could help by automating the process of converting floor plans or images into accurate building simulations.Links:Judah website - https://judahgoldfeder.com/Music:"Kid Kodi" — Blue Dot Sessions — via Free Music Archive — CC BY-NC 4.0."Palms Down" — Blue Dot Sessions — via Free Music Archive — CC BY-NC 4.0.Changes: trimmed

Nov 24, 2025 • 1h 6min

EP17: RL with Will Brown

In this conversation with Will Brown, research lead at Prime Intellect specializing in reinforcement learning (RL) and multi-agent systems, they explore the foundations and practical applications of RL. Will shares insights into the challenges RL faces in LLMs, emphasizing the importance of online sampling and reward models. He discusses multi-agent dynamics, optimization techniques, and the role of game theory in AI development. The discussion also highlights the significance of intermediate results and the future directions for RL in various applications.

Nov 17, 2025 • 59min

EP16: AI News and Papers

Dive into the intriguing world of AI as the hosts dissect the quirks of conference review dynamics and the innovation of Kimi K2 thinking. Discover Google's latest TPU advancements and their competition with NVIDIA. The importance of real-world data in robotics takes center stage, while the concept of Chain of Thought Hijacking raises alarms about model vulnerabilities. Simplifying reinforcement learning with JustRL presents new possibilities, and the Cosmos project sparks curiosity around AI-driven scientific discovery.

Nov 13, 2025 • 1h 23min

EP15: The Information Bottleneck and Scaling Laws with Alex Alemi

In this discussion, Alex Alemi—a prominent AI researcher from Anthropic, formerly with Google Brain and Disney—delves into the concept of the information bottleneck. He explains how it captures the essential aspects of data while avoiding overfitting. Alemi also highlights scaling laws, revealing how smaller experiments can forecast larger behaviors in AI. He offers insights on the importance of compression in understanding models and challenges researchers to pursue ambitious questions that address broader implications for society, such as job disruption.

Nov 10, 2025 • 57min

EP14: AI News and Papers

Explore the intricate dynamics of AI in healthcare, where GPT-5's fragility raises pressing concerns about reliability and education. Discover Stanford's innovative Cartridges approach, which aims to cut down the hefty computational costs of long-context models. Delve into the transformative potential of Continuous Autoregressive Language Models and learn about practical strategies from the Smol Training Playbook. Join in the debate on benchmarks and dataset challenges in this fast-evolving field!

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

App store banner

Play store banner