

Deep Papers
Arize AI
Deep Papers is a podcast series featuring deep dives on today’s most important AI papers and research. Hosted by Arize AI founders and engineers, each episode profiles the people and techniques behind cutting-edge breakthroughs in machine learning.
Episodes
Mentioned books

Jul 8, 2025 • 31min
Self-Adapting Language Models: Paper Authors Discuss Implications
Discover how self-adapting language models can redefine AI. The hosts dive into innovative self-editing techniques and the role of reinforcement learning in enhancing model performance. They discuss the challenges of catastrophic forgetting and gradient interference, alongside unique methods like LoRa for efficient updates. Excitingly, they explore the future of pre-training, revealing how models can forge their own learning paths. Get ready for a fascinating look at the evolution of language models!

10 snips
Jun 20, 2025 • 31min
The Illusion of Thinking: What the Apple AI Paper Says About LLM Reasoning
The discussion revolves around a compelling new paper from Apple, challenging traditional evaluations of AI reasoning. It reveals how Large Reasoning Models (LRMs) surprisingly falter on complex tasks while Large Language Models (LLMs) shine in simpler scenarios. The conversation dives into the nuances of problem-solving, contrasting human creativity with algorithmic execution, especially with something as intricate as Rubik's cubes. A philosophical debate unfolds, questioning whether the reasoning showcased by AI is truly genuine or merely an illusion.

Jun 4, 2025 • 25min
Accurate KV Cache Quantization with Outlier Tokens Tracing
We discuss Accurate KV Cache Quantization with Outlier Tokens Tracing, a deep dive into improving the efficiency of LLM inference. The authors enhance KV Cache quantization, a technique for reducing memory and compute costs during inference, by introducing a method to identify and exclude outlier tokens that hurt quantization accuracy, striking a better balance between efficiency and performance.Read the paperAccess the slides Read the blogJoin us for Arize ObserveLearn more about AI observability and evaluation, join the Arize AI Slack community or get the latest on LinkedIn and X.

May 16, 2025 • 29min
Scalable Chain of Thoughts via Elastic Reasoning
Explore the innovative concept of Elastic Reasoning, a framework that enhances reasoning models by separating the thinking process from finding solutions. Delve into its advancements that improve output quality while managing resource constraints. Learn how these strategies optimize performance in multi-tool agents and reduce AI hallucinations. Discover practical applications that enhance user experience in critical tasks. Finally, discuss the push for sustainable, lightweight models to tackle environmental challenges in AI technology.

May 2, 2025 • 30min
Sleep-time Compute: Beyond Inference Scaling at Test-time
Imagine if your AI could anticipate your questions before you even ask! This intriguing discussion centers on sleep-time compute, a method allowing models to prepare answers during idle moments. By precomputing reasoning steps, it significantly cuts down on latency and costs while boosting accuracy. The talk dives into new benchmarks showing impressive reductions in compute use and cost. Additionally, the potential of leveraging idle GPUs for improved efficiency and the challenges of optimizing resources in AI systems make for a fascinating listen.

Apr 18, 2025 • 27min
LibreEval: The Largest Open Source Benchmark for RAG Hallucination Detection
For this week's paper read, we dive into our own research.We wanted to create a replicable, evolving dataset that can keep pace with model training so that you always know you're testing with data your model has never seen before. We also saw the prohibitively high cost of running LLM evals at scale, and have used our data to fine-tune a series of SLMs that perform just as well as their base LLM counterparts, but at 1/10 the cost. So, over the past few weeks, the Arize team generated the largest public dataset of hallucinations, as well as a series of fine-tuned evaluation models.We talk about what we built, the process we took, and the bottom line results. You can read the recap of LibreEval here. Dive into the research, or sign up to join us next time. Learn more about AI observability and evaluation, join the Arize AI Slack community or get the latest on LinkedIn and X.

Apr 4, 2025 • 26min
AI Benchmark Deep Dive: Gemini 2.5 and Humanity's Last Exam
Dive into the advancements of Google's Gemini 2.5 as it tackles the Humanities Last Exam, showcasing its impressive reasoning and multimodal capabilities. Discover how this AI model outperforms rivals in key benchmarks and the complexities it faces in expert-level problem-solving. The discussion also highlights the significance of traditional benchmarks and the ongoing debate about model optimization versus overall performance. Finally, learn about the community's role in shaping the future of AI evaluation and collaboration.

Mar 25, 2025 • 15min
Model Context Protocol (MCP)
We cover Anthropic’s groundbreaking Model Context Protocol (MCP). Though it was released in November 2024, we've been seeing a lot of hype around it lately, and thought it was well worth digging into. Learn how this open standard is revolutionizing AI by enabling seamless integration between LLMs and external data sources, fundamentally transforming them into capable, context-aware agents. We explore the key benefits of MCP, including enhanced context retention across interactions, improved interoperability for agentic workflows, and the development of more capable AI agents that can execute complex tasks in real-world environments.Read our analysis of MCP on the blog, or dive into the latest AI research. Learn more about AI observability and evaluation, join the Arize AI Slack community or get the latest on LinkedIn and X.

Mar 1, 2025 • 30min
AI Roundup: DeepSeek’s Big Moves, Claude 3.7, and the Latest Breakthroughs
This podcast explores cutting-edge AI developments, including DeepSeek's launch of FlashMLA, a revolutionary decoding kernel for NVIDIA GPUs. It also dives into Claude 3.7, showcasing its hybrid reasoning capabilities and improvements in AI coding assistance. The discussion highlights DeepSeek's new DPP communication library and the strategic optimizations for server efficiency. With a focus on benchmarking AI innovations and open-source advancements, listeners gain insights into the latest trends that are shaping the future of artificial intelligence.

Feb 21, 2025 • 30min
How DeepSeek is Pushing the Boundaries of AI Development
Discover the remarkable advancements in AI with DeepSeek, particularly its groundbreaking inference speed. The team discusses the evolution of AI reasoning and the innovative use of reinforcement learning techniques. Dive into the challenges and triumphs of local deployment, along with the playful nature of these models. A live demo showcases practical applications like sentiment analysis and topic modeling, revealing the fine-tuning capabilities of the DeepSeek model. Explore the exciting future of AI shaped by major tech investments.