Deep Papers

Arize AI
undefined
Feb 4, 2025 • 30min

Multiagent Finetuning: A Conversation with Researcher Yilun Du

We talk to Google DeepMind Senior Research Scientist (and incoming Assistant Professor at Harvard), Yilun Du, about his latest paper, "Multiagent Finetuning: Self Improvement with Diverse Reasoning Chains." This paper introduces a multiagent finetuning framework that enhances the performance and diversity of language models by employing a society of agents with distinct roles, improving feedback mechanisms and overall output quality.The method enables autonomous self-improvement through iterative finetuning, achieving significant performance gains across various reasoning tasks. It's versatile, applicable to both open-source and proprietary LLMs, and can integrate with human-feedback-based methods like RLHF or DPO, paving the way for future advancements in language model development.Read an overview on the blog, watch the full discussion, or join us live for future paper readings. Learn more about AI observability and evaluation, join the Arize AI Slack community or get the latest on LinkedIn and X.
undefined
Jan 14, 2025 • 25min

Training Large Language Models to Reason in Continuous Latent Space

The discussion highlights recent advancements in AI, including NVIDIA's innovations and a new platform for robotics. A standout topic is the groundbreaking Coconut method, which allows large language models to reason in a continuous latent space, breaking away from traditional language constraints. This innovative approach promises to enhance the efficiency and performance of AI systems, making reasoning more fluid and adaptable. Stay tuned for insights into the interconnected future of AI!
undefined
4 snips
Dec 23, 2024 • 29min

LLMs as Judges: A Comprehensive Survey on LLM-Based Evaluation Methods

Explore the fascinating world of large language models as judges. Discover their benefits over traditional methods, including enhanced accuracy and consistency. Delve into the various evaluation methodologies and the crucial role human evaluators play. Learn about techniques for improving model performance and the applications in summarization and retrieval-augmented generation. The discussion also highlights significant limitations and ethical concerns, emphasizing the need for audits and domain expertise to ensure responsible AI use.
undefined
Dec 10, 2024 • 29min

Merge, Ensemble, and Cooperate! A Survey on Collaborative LLM Strategies

Discover how collaborative strategies can enhance the efficiency of large language models. The discussion dives into potential methods like merging, ensemble, and cooperation, emphasizing their unique strengths. Learn about the impressive open-source ULMO 2 model and its implications for transparency in AI. The podcast also tackles the innovative Pareto frontier metric for evaluating performance, alongside the importance of reflection phases in multi-step agents to optimize their outputs. Tune in for insights that bridge collaboration and AI advancements!
undefined
Nov 23, 2024 • 25min

Agent-as-a-Judge: Evaluate Agents with Agents

Discover the innovative 'Agent-as-a-Judge' framework, where agents grade each other’s performance, offering a refreshing take on evaluation. Traditional methods often miss the mark, but this approach promises continuous feedback throughout tasks. Dive into the development of the DevAI benchmarking dataset aimed at real-world coding evaluations. Compare the capabilities of new agents against traditional ones and witness how scalable self-improvement could revolutionize performance measurement!
undefined
Nov 12, 2024 • 30min

Introduction to OpenAI's Realtime API

We break down OpenAI’s realtime API. Learn how to seamlessly integrate powerful language models into your applications for instant, context-aware responses that drive user engagement. Whether you’re building chatbots, dynamic content tools, or enhancing real-time collaboration, we walk through the API’s capabilities, potential use cases, and best practices for implementation. Learn more about AI observability and evaluation, join the Arize AI Slack community or get the latest on LinkedIn and X.
undefined
Oct 29, 2024 • 47min

Swarm: OpenAI's Experimental Approach to Multi-Agent Systems

Discover the fascinating world of OpenAI's Swarm, an experimental framework designed for managing multi-agent systems. The conversation highlights Swarm's educational focus and simplicity. Learn how multiple agents can collaborate effectively, illustrated by a practical airline customer support example. Explore the synergy between large language models and traditional coding for enhanced adaptability. The podcast also compares Swarm with other frameworks, emphasizing its unique advantages in real-world applications like customer service.
undefined
Oct 24, 2024 • 4min

KV Cache Explained

Explore the fascinating role of the KV cache in enhancing chat experiences with AI models like GPT. Discover how this component accelerates interactions and optimizes context management. Harrison Chu simplifies complex concepts, including attention heads and KQV matrices, making them accessible. Learn how top AI products leverage this technology for fast, high-quality user experiences. Dive into the mechanics behind the scenes and understand the computational intricacies that power modern AI systems.
undefined
Oct 16, 2024 • 4min

The Shrek Sampler: How Entropy-Based Sampling is Revolutionizing LLMs

In this byte-sized podcast, Harrison Chu, Director of Engineering at Arize, breaks down the Shrek Sampler. This innovative Entropy-Based Sampling technique--nicknamed the 'Shrek Sampler--is transforming LLMs. Harrison talks about how this method improves upon traditional sampling strategies by leveraging entropy and varentropy to produce more dynamic and intelligent responses. Explore its potential to enhance open-source AI models and enable human-like reasoning in smaller language models. Learn more about AI observability and evaluation, join the Arize AI Slack community or get the latest on LinkedIn and X.
undefined
Oct 15, 2024 • 43min

Google's NotebookLM and the Future of AI-Generated Audio

This week, Aman Khan and Harrison Chu explore NotebookLM’s unique features, including its ability to generate realistic-sounding podcast episodes from text (but this podcast is very real!). They dive into some technical underpinnings of the product, specifically the SoundStorm model used for generating high-quality audio, and how it leverages a hierarchical vector quantization approach (RVQ) to maintain consistency in speaker voice and tone throughout long audio durations. The discussion also touches on ethical implications of such technology, particularly the potential for hallucinations and the need to balance creative freedom with factual accuracy. We close out with a few hot takes, and speculate on the future of AI-generated audio. Learn more about AI observability and evaluation, join the Arize AI Slack community or get the latest on LinkedIn and X.

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!
App store bannerPlay store banner
Get the app