

Deep Papers
Arize AI
Deep Papers is a podcast series featuring deep dives on today’s most important AI papers and research. Hosted by Arize AI founders and engineers, each episode profiles the people and techniques behind cutting-edge breakthroughs in machine learning.
Episodes
Mentioned books

10 snips
Nov 24, 2025 • 24min
TUMIX: Multi-Agent Test-Time Scaling with Tool-Use Mixture
Yongchao Chen, a final-year PhD student at Harvard and MIT, discusses his groundbreaking work on TUMIX (Tool-Use Mixture). He explains how a diverse ensemble of agents can significantly improve AI's accuracy by leveraging different tool-use strategies. Chen highlights the limitations of current models, which often struggle to decide when to use tools effectively. Through empirical tests, he shares remarkable results where TUMIX outperforms state-of-the-art methods, emphasizing the importance of agent diversity and collaborative refinement for enhancing AI performance.

Nov 10, 2025 • 23min
Meta AI Researcher Explains ARE and Gaia2: Scaling Up Agent Environments and Evaluations
In our latest paper reading, we had the pleasure of hosting Grégoire Mialon — Research Scientist at Meta Superintelligence Labs — to walk us through Meta AI’s groundbreaking paper titled “ARE: scaling up agent environments and evaluations" and the new ARE and Gaia2 frameworks.Learn more about AI observability and evaluation, join the Arize AI Slack community or get the latest on LinkedIn and X.

10 snips
Oct 14, 2025 • 31min
Georgia Tech's Santosh Vempala Explains Why Language Models Hallucinate, His Research With OpenAI
Santosh Vempala, a distinguished professor at Georgia Tech, dives deep into the complexities of language models and their notorious ability to hallucinate. He explains how maximum likelihood pre-training can lead to these issues and the crucial trade-offs between memorization and generalization. Through fascinating examples, he discusses how calibration impacts accuracy and presents a formal theorem linking hallucinations to misclassification. Vempala also highlights practical approaches to detect invalid model outputs and shares insights into improving AI evaluation methods.

Sep 22, 2025 • 26min
Atropos Health’s Arjun Mukerji, PhD, Explains RWESummary: A Framework and Test for Choosing LLMs to Summarize Real-World Evidence (RWE) Studies
Join Arjun Mukerji, PhD, a staff data scientist at Atropos Health, as he dives into the RWESummary benchmark for evaluating large language models in summarizing real-world evidence. Discover how these models differ from traditional clinical trial data and the importance of robust evaluation metrics. Arjun sheds light on the risks associated with AI-generated summaries and advocates for a human-in-the-loop approach to ensure accuracy. It's a captivating discussion on the future of AI in healthcare!

Sep 6, 2025 • 48min
Stan Miasnikov, Distinguished Engineer, AI/ML Architecture, Consumer Experience at Verizon Walks Us Through His New Paper
This episode dives into "Category-Theoretic Analysis of Inter-Agent Communication and Mutual Understanding Metric in Recursive Consciousness." The paper presents an extension of the Recursive Consciousness framework to analyze communication between agents and the inevitable loss of meaning in translation. We're thrilled to feature the paper's author, Stan Miasnikov, Distinguished Engineer, AI/ML Architecture, Consumer Experience at Verizon, to walk us through the research and its implications.Learn more about AI observability and evaluation, join the Arize AI Slack community or get the latest on LinkedIn and X.

10 snips
Sep 5, 2025 • 31min
Small Language Models are the Future of Agentic AI
Peter Belcak, an AI research scientist at NVIDIA, discusses his groundbreaking paper on the promise of small language models (SLMs) for agentic AI. He highlights how SLMs can outperform larger models in cost-effectiveness and operational efficiency. Peter explores the transformation process from large models to smaller agents and introduces tools supporting this fine-tuning. He also addresses bias mitigation in data selection and the importance of collaboration in the evolving landscape of AI, paving the way for a more accessible future.

Jul 30, 2025 • 43min
Watermarking for LLMs and Image Models
In this AI research paper reading, we dive into "A Watermark for Large Language Models" with the paper's author John Kirchenbauer. This paper is a timely exploration of techniques for embedding invisible but detectable signals in AI-generated text. These watermarking strategies aim to help mitigate misuse of large language models by making machine-generated content distinguishable from human writing, without sacrificing text quality or requiring access to the model’s internals.Learn more about the A Watermark for Large Language Models paper. Learn more about agent observability and LLM observability, join the Arize AI Slack community or get the latest on LinkedIn and X.Learn more about AI observability and evaluation, join the Arize AI Slack community or get the latest on LinkedIn and X.

Jul 8, 2025 • 31min
Self-Adapting Language Models: Paper Authors Discuss Implications
Discover how self-adapting language models can redefine AI. The hosts dive into innovative self-editing techniques and the role of reinforcement learning in enhancing model performance. They discuss the challenges of catastrophic forgetting and gradient interference, alongside unique methods like LoRa for efficient updates. Excitingly, they explore the future of pre-training, revealing how models can forge their own learning paths. Get ready for a fascinating look at the evolution of language models!

10 snips
Jun 20, 2025 • 31min
The Illusion of Thinking: What the Apple AI Paper Says About LLM Reasoning
The discussion revolves around a compelling new paper from Apple, challenging traditional evaluations of AI reasoning. It reveals how Large Reasoning Models (LRMs) surprisingly falter on complex tasks while Large Language Models (LLMs) shine in simpler scenarios. The conversation dives into the nuances of problem-solving, contrasting human creativity with algorithmic execution, especially with something as intricate as Rubik's cubes. A philosophical debate unfolds, questioning whether the reasoning showcased by AI is truly genuine or merely an illusion.

Jun 4, 2025 • 25min
Accurate KV Cache Quantization with Outlier Tokens Tracing
We discuss Accurate KV Cache Quantization with Outlier Tokens Tracing, a deep dive into improving the efficiency of LLM inference. The authors enhance KV Cache quantization, a technique for reducing memory and compute costs during inference, by introducing a method to identify and exclude outlier tokens that hurt quantization accuracy, striking a better balance between efficiency and performance.Read the paperAccess the slides Read the blogJoin us for Arize ObserveLearn more about AI observability and evaluation, join the Arize AI Slack community or get the latest on LinkedIn and X.


