

Deep Papers
Arize AI
Deep Papers is a podcast series featuring deep dives on today’s most important AI papers and research. Hosted by Arize AI founders and engineers, each episode profiles the people and techniques behind cutting-edge breakthroughs in machine learning.
Episodes
Mentioned books

Aug 16, 2024 • 39min
Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges
This week’s paper presents a comprehensive study of the performance of various LLMs acting as judges. The researchers leverage TriviaQA as a benchmark for assessing objective knowledge reasoning of LLMs and evaluate them alongside human annotations which they find to have a high inter-annotator agreement. The study includes nine judge models and nine exam-taker models – both base and instruction-tuned. They assess the judge models’ alignment across different model sizes, families, and judge prompts to answer questions about the strengths and weaknesses of this paradigm, and what potential biases it may hold.Read it on the blog: https://arize.com/blog/judging-the-judges-llm-as-a-judge/Learn more about AI observability and evaluation, join the Arize AI Slack community or get the latest on LinkedIn and X.

Aug 6, 2024 • 45min
Breaking Down Meta's Llama 3 Herd of Models
Meta just released Llama 3.1 405B–according to them, it’s “the first openly available model that rivals the top AI models when it comes to state-of-the-art capabilities in general knowledge, steerability, math, tool use, and multilingual translation.” Will the latest Llama herd ignite new applications and modeling paradigms like synthetic data generation? Will it enable the improvement and training of smaller models, as well as model distillation? Meta thinks so. We’ll take a look at what they did here, talk about open source, and decide if we want to believe the hype.Read it on the blog: https://arize.com/blog/breaking-down-meta-llama-3/Learn more about AI observability and evaluation, join the Arize AI Slack community or get the latest on LinkedIn and X.

Jul 23, 2024 • 34min
DSPy Assertions: Computational Constraints for Self-Refining Language Model Pipelines
Cyrus Nouroozi, a core contributor to DSPy and co-founder of Zenbase, dives into the innovative world of language models. He explains DSPy assertions that help enforce computational constraints, enhancing reliability in language model applications. The discussion reveals that these assertions can significantly boost compliance and improve output quality. Cyrus showcases practical examples like tweet generation and contrasts DSPy's robust approach with traditional prompt engineering. The episode wraps up with insights into optimization strategies and the future of LLM pipelines.

6 snips
Jun 28, 2024 • 44min
RAFT: Adapting Language Model to Domain Specific RAG
Sai Kolasani, a researcher at UC Berkeley’s RISE Lab and Arize AI Intern, discusses RAFT, a method to adapt language models for domain-specific question-answering. RAFT improves models' reasoning by training them to ignore distractor documents, enhancing performance in specialized domains like PubMed and HotpotQA. The podcast explores RAFT's chain-of-thought-style response, data curation, and optimizing performance in domain-specific tasks.

Jun 14, 2024 • 44min
LLM Interpretability and Sparse Autoencoders: Research from OpenAI and Anthropic
Delve into recent research on LLM interpretability with k-sparse autoencoders from OpenAI and sparse autoencoder scaling laws from Anthropic. Explore the implications for understanding neural activity and extracting interpretable features from language models.

May 30, 2024 • 48min
Trustworthy LLMs: A Survey and Guideline for Evaluating Large Language Models' Alignment
We break down the paper--Trustworthy LLMs: A Survey and Guideline for Evaluating Large Language Models' Alignment.Ensuring alignment (aka: making models behave in accordance with human intentions) has become a critical task before deploying LLMs in real-world applications. However, a major challenge faced by practitioners is the lack of clear guidance on evaluating whether LLM outputs align with social norms, values, and regulations. To address this issue, this paper presents a comprehensive survey of key dimensions that are crucial to consider when assessing LLM trustworthiness. The survey covers seven major categories of LLM trustworthiness: reliability, safety, fairness, resistance to misuse, explainability and reasoning, adherence to social norms, and robustness.The measurement results indicate that, in general, more aligned models tend to perform better in terms of overall trustworthiness. However, the effectiveness of alignment varies across the different trustworthiness categories considered. By shedding light on these key dimensions of LLM trustworthiness, this paper aims to provide valuable insights and guidance to practitioners in the field. Understanding and addressing these concerns will be crucial in achieving reliable and ethically sound deployment of LLMs in various applications.Read more about Trustworthy LLMs: A Survey and Guideline for Evaluating Large Language Models' AlignmentLearn more about AI observability and evaluation, join the Arize AI Slack community or get the latest on LinkedIn and X.

May 13, 2024 • 45min
Breaking Down EvalGen: Who Validates the Validators?
This podcast delves into the complexities of using Large Language Models for evaluation, highlighting the need for human validation in aligning LLM-generated evaluators with user preferences. Topics include developing criteria for acceptable LLM outputs, evaluating email responses, evolving evaluation criteria, template management, LLM validation, and the iterative process of building effective evaluation criteria.

Apr 26, 2024 • 45min
Keys To Understanding ReAct: Synergizing Reasoning and Acting in Language Models
Exploring the ReAct approach in language models, combining reasoning and actionable outputs. Discussion on challenges of interpretability in LM and the importance of self-reflection. Comparing reasoning-only and action-only methods in QA tasks. Reducing hallucinations through model fine-tuning. Implementing chatbox class with OpenAI and enhancing models with self-reflection and decision-making strategies.

Apr 4, 2024 • 45min
Demystifying Chronos: Learning the Language of Time Series
This week, we’ve covering Amazon’s time series model: Chronos. Developing accurate machine-learning-based forecasting models has traditionally required substantial dataset-specific tuning and model customization. Chronos however, is built on a language model architecture and trained with billions of tokenized time series observations, enabling it to provide accurate zero-shot forecasts matching or exceeding purpose-built models.We dive into time series forecasting, some recent research our team has done, and take a community pulse on what people think of Chronos. Learn more about AI observability and evaluation, join the Arize AI Slack community or get the latest on LinkedIn and X.

Mar 25, 2024 • 43min
Anthropic Claude 3
The podcast delves into the latest buzz in AI with the arrival of Claude 3, challenging GPT-4. It explores new models in the LLM space like Haiku, Sonnet, and Opus, offering a balance of intelligence, speed, and cost. The discussion covers AI ethics, model transparency, prompting techniques, and advancements in text and code generation with creative visualizations. It also addresses improvements in AI models, language challenges, and the future of AI technology.


