

Deep Papers
Arize AI
Deep Papers is a podcast series featuring deep dives on today’s most important AI papers and research. Hosted by Arize AI founders and engineers, each episode profiles the people and techniques behind cutting-edge breakthroughs in machine learning.
Episodes
Mentioned books

Sep 6, 2025 • 48min
Stan Miasnikov, Distinguished Engineer, AI/ML Architecture, Consumer Experience at Verizon Walks Us Through His New Paper
This episode dives into "Category-Theoretic Analysis of Inter-Agent Communication and Mutual Understanding Metric in Recursive Consciousness." The paper presents an extension of the Recursive Consciousness framework to analyze communication between agents and the inevitable loss of meaning in translation. We're thrilled to feature the paper's author, Stan Miasnikov, Distinguished Engineer, AI/ML Architecture, Consumer Experience at Verizon, to walk us through the research and its implications.Learn more about AI observability and evaluation, join the Arize AI Slack community or get the latest on LinkedIn and X.

Sep 5, 2025 • 31min
Small Language Models are the Future of Agentic AI
Peter Belcak, an AI research scientist at NVIDIA, discusses his groundbreaking paper on the promise of small language models (SLMs) for agentic AI. He highlights how SLMs can outperform larger models in cost-effectiveness and operational efficiency. Peter explores the transformation process from large models to smaller agents and introduces tools supporting this fine-tuning. He also addresses bias mitigation in data selection and the importance of collaboration in the evolving landscape of AI, paving the way for a more accessible future.

Jul 30, 2025 • 43min
Watermarking for LLMs and Image Models
In this AI research paper reading, we dive into "A Watermark for Large Language Models" with the paper's author John Kirchenbauer. This paper is a timely exploration of techniques for embedding invisible but detectable signals in AI-generated text. These watermarking strategies aim to help mitigate misuse of large language models by making machine-generated content distinguishable from human writing, without sacrificing text quality or requiring access to the model’s internals.Learn more about the A Watermark for Large Language Models paper. Learn more about agent observability and LLM observability, join the Arize AI Slack community or get the latest on LinkedIn and X.Learn more about AI observability and evaluation, join the Arize AI Slack community or get the latest on LinkedIn and X.

Jul 8, 2025 • 31min
Self-Adapting Language Models: Paper Authors Discuss Implications
Discover how self-adapting language models can redefine AI. The hosts dive into innovative self-editing techniques and the role of reinforcement learning in enhancing model performance. They discuss the challenges of catastrophic forgetting and gradient interference, alongside unique methods like LoRa for efficient updates. Excitingly, they explore the future of pre-training, revealing how models can forge their own learning paths. Get ready for a fascinating look at the evolution of language models!

10 snips
Jun 20, 2025 • 31min
The Illusion of Thinking: What the Apple AI Paper Says About LLM Reasoning
The discussion revolves around a compelling new paper from Apple, challenging traditional evaluations of AI reasoning. It reveals how Large Reasoning Models (LRMs) surprisingly falter on complex tasks while Large Language Models (LLMs) shine in simpler scenarios. The conversation dives into the nuances of problem-solving, contrasting human creativity with algorithmic execution, especially with something as intricate as Rubik's cubes. A philosophical debate unfolds, questioning whether the reasoning showcased by AI is truly genuine or merely an illusion.

Jun 4, 2025 • 25min
Accurate KV Cache Quantization with Outlier Tokens Tracing
We discuss Accurate KV Cache Quantization with Outlier Tokens Tracing, a deep dive into improving the efficiency of LLM inference. The authors enhance KV Cache quantization, a technique for reducing memory and compute costs during inference, by introducing a method to identify and exclude outlier tokens that hurt quantization accuracy, striking a better balance between efficiency and performance.Read the paperAccess the slides Read the blogJoin us for Arize ObserveLearn more about AI observability and evaluation, join the Arize AI Slack community or get the latest on LinkedIn and X.

May 16, 2025 • 29min
Scalable Chain of Thoughts via Elastic Reasoning
Explore the innovative concept of Elastic Reasoning, a framework that enhances reasoning models by separating the thinking process from finding solutions. Delve into its advancements that improve output quality while managing resource constraints. Learn how these strategies optimize performance in multi-tool agents and reduce AI hallucinations. Discover practical applications that enhance user experience in critical tasks. Finally, discuss the push for sustainable, lightweight models to tackle environmental challenges in AI technology.

May 2, 2025 • 30min
Sleep-time Compute: Beyond Inference Scaling at Test-time
Imagine if your AI could anticipate your questions before you even ask! This intriguing discussion centers on sleep-time compute, a method allowing models to prepare answers during idle moments. By precomputing reasoning steps, it significantly cuts down on latency and costs while boosting accuracy. The talk dives into new benchmarks showing impressive reductions in compute use and cost. Additionally, the potential of leveraging idle GPUs for improved efficiency and the challenges of optimizing resources in AI systems make for a fascinating listen.

Apr 18, 2025 • 27min
LibreEval: The Largest Open Source Benchmark for RAG Hallucination Detection
For this week's paper read, we dive into our own research.We wanted to create a replicable, evolving dataset that can keep pace with model training so that you always know you're testing with data your model has never seen before. We also saw the prohibitively high cost of running LLM evals at scale, and have used our data to fine-tune a series of SLMs that perform just as well as their base LLM counterparts, but at 1/10 the cost. So, over the past few weeks, the Arize team generated the largest public dataset of hallucinations, as well as a series of fine-tuned evaluation models.We talk about what we built, the process we took, and the bottom line results. You can read the recap of LibreEval here. Dive into the research, or sign up to join us next time. Learn more about AI observability and evaluation, join the Arize AI Slack community or get the latest on LinkedIn and X.

Apr 4, 2025 • 26min
AI Benchmark Deep Dive: Gemini 2.5 and Humanity's Last Exam
Dive into the advancements of Google's Gemini 2.5 as it tackles the Humanities Last Exam, showcasing its impressive reasoning and multimodal capabilities. Discover how this AI model outperforms rivals in key benchmarks and the complexities it faces in expert-level problem-solving. The discussion also highlights the significance of traditional benchmarks and the ongoing debate about model optimization versus overall performance. Finally, learn about the community's role in shaping the future of AI evaluation and collaboration.