Deep Papers cover image

Deep Papers

Latest episodes

undefined
Apr 4, 2025 • 26min

AI Benchmark Deep Dive: Gemini 2.5 and Humanity's Last Exam

Dive into the advancements of Google's Gemini 2.5 as it tackles the Humanities Last Exam, showcasing its impressive reasoning and multimodal capabilities. Discover how this AI model outperforms rivals in key benchmarks and the complexities it faces in expert-level problem-solving. The discussion also highlights the significance of traditional benchmarks and the ongoing debate about model optimization versus overall performance. Finally, learn about the community's role in shaping the future of AI evaluation and collaboration.
undefined
Mar 25, 2025 • 15min

Model Context Protocol (MCP)

We cover Anthropic’s groundbreaking Model Context Protocol (MCP). Though it was released in November 2024, we've been seeing a lot of hype around it lately, and thought it was well worth digging into. Learn how this open standard is revolutionizing AI by enabling seamless integration between LLMs and external data sources, fundamentally transforming them into capable, context-aware agents. We explore the key benefits of MCP, including enhanced context retention across interactions, improved interoperability for agentic workflows, and the development of more capable AI agents that can execute complex tasks in real-world environments.Learn more about AI observability and evaluation, join the Arize AI Slack community or get the latest on LinkedIn and X.
undefined
Mar 1, 2025 • 30min

AI Roundup: DeepSeek’s Big Moves, Claude 3.7, and the Latest Breakthroughs

This podcast explores cutting-edge AI developments, including DeepSeek's launch of FlashMLA, a revolutionary decoding kernel for NVIDIA GPUs. It also dives into Claude 3.7, showcasing its hybrid reasoning capabilities and improvements in AI coding assistance. The discussion highlights DeepSeek's new DPP communication library and the strategic optimizations for server efficiency. With a focus on benchmarking AI innovations and open-source advancements, listeners gain insights into the latest trends that are shaping the future of artificial intelligence.
undefined
Feb 21, 2025 • 30min

How DeepSeek is Pushing the Boundaries of AI Development

Discover the remarkable advancements in AI with DeepSeek, particularly its groundbreaking inference speed. The team discusses the evolution of AI reasoning and the innovative use of reinforcement learning techniques. Dive into the challenges and triumphs of local deployment, along with the playful nature of these models. A live demo showcases practical applications like sentiment analysis and topic modeling, revealing the fine-tuning capabilities of the DeepSeek model. Explore the exciting future of AI shaped by major tech investments.
undefined
Feb 4, 2025 • 30min

Multiagent Finetuning: A Conversation with Researcher Yilun Du

We talk to Google DeepMind Senior Research Scientist (and incoming Assistant Professor at Harvard), Yilun Du, about his latest paper "Multiagent Finetuning: Self Improvement with Diverse Reasoning Chains." This paper introduces a multiagent finetuning framework that enhances the performance and diversity of language models by employing a society of agents with distinct roles, improving feedback mechanisms and overall output quality.The method enables autonomous self-improvement through iterative finetuning, achieving significant performance gains across various reasoning tasks. It's versatile, applicable to both open-source and proprietary LLMs, and can integrate with human-feedback-based methods like RLHF or DPO, paving the way for future advancements in language model development.Read an overview on the blogWatch the full discussionLearn more about AI observability and evaluation, join the Arize AI Slack community or get the latest on LinkedIn and X.
undefined
Jan 14, 2025 • 25min

Training Large Language Models to Reason in Continuous Latent Space

The discussion highlights recent advancements in AI, including NVIDIA's innovations and a new platform for robotics. A standout topic is the groundbreaking Coconut method, which allows large language models to reason in a continuous latent space, breaking away from traditional language constraints. This innovative approach promises to enhance the efficiency and performance of AI systems, making reasoning more fluid and adaptable. Stay tuned for insights into the interconnected future of AI!
undefined
Dec 23, 2024 • 29min

LLMs as Judges: A Comprehensive Survey on LLM-Based Evaluation Methods

Explore the fascinating world of large language models as judges. Discover their benefits over traditional methods, including enhanced accuracy and consistency. Delve into the various evaluation methodologies and the crucial role human evaluators play. Learn about techniques for improving model performance and the applications in summarization and retrieval-augmented generation. The discussion also highlights significant limitations and ethical concerns, emphasizing the need for audits and domain expertise to ensure responsible AI use.
undefined
Dec 10, 2024 • 29min

Merge, Ensemble, and Cooperate! A Survey on Collaborative LLM Strategies

Discover how collaborative strategies can enhance the efficiency of large language models. The discussion dives into potential methods like merging, ensemble, and cooperation, emphasizing their unique strengths. Learn about the impressive open-source ULMO 2 model and its implications for transparency in AI. The podcast also tackles the innovative Pareto frontier metric for evaluating performance, alongside the importance of reflection phases in multi-step agents to optimize their outputs. Tune in for insights that bridge collaboration and AI advancements!
undefined
Nov 23, 2024 • 25min

Agent-as-a-Judge: Evaluate Agents with Agents

Discover the innovative 'Agent-as-a-Judge' framework, where agents grade each other’s performance, offering a refreshing take on evaluation. Traditional methods often miss the mark, but this approach promises continuous feedback throughout tasks. Dive into the development of the DevAI benchmarking dataset aimed at real-world coding evaluations. Compare the capabilities of new agents against traditional ones and witness how scalable self-improvement could revolutionize performance measurement!
undefined
Nov 12, 2024 • 30min

Introduction to OpenAI's Realtime API

We break down OpenAI’s realtime API. Learn how to seamlessly integrate powerful language models into your applications for instant, context-aware responses that drive user engagement. Whether you’re building chatbots, dynamic content tools, or enhancing real-time collaboration, we walk through the API’s capabilities, potential use cases, and best practices for implementation. Learn more about AI observability and evaluation, join the Arize AI Slack community or get the latest on LinkedIn and X.

Get the Snipd
podcast app

Unlock the knowledge in podcasts with the podcast player of the future.
App store bannerPlay store banner

AI-powered
podcast player

Listen to all your favourite podcasts with AI-powered features

Discover
highlights

Listen to the best highlights from the podcasts you love and dive into the full episode

Save any
moment

Hear something you like? Tap your headphones to save it with AI-generated key takeaways

Share
& Export

Send highlights to Twitter, WhatsApp or export them to Notion, Readwise & more

AI-powered
podcast player

Listen to all your favourite podcasts with AI-powered features

Discover
highlights

Listen to the best highlights from the podcasts you love and dive into the full episode