Deep Papers cover image

Deep Papers

Latest episodes

undefined
Jul 23, 2024 • 34min

DSPy Assertions: Computational Constraints for Self-Refining Language Model Pipelines

Chaining language model (LM) calls as composable modules is fueling a new way of programming, but ensuring LMs adhere to important constraints requires heuristic “prompt engineering.” The paper this week introduces LM Assertions, a programming construct for expressing computational constraints that LMs should satisfy. The researchers integrated their constructs into the recent DSPy programming model for LMs and present new strategies that allow DSPy to compile programs with LM Assertions into more reliable and accurate systems. They also propose strategies to use assertions at inference time for automatic self-refinement with LMs. They reported on four diverse case studies for text generation and found that LM Assertions improve not only compliance with imposed rules but also downstream task performance, passing constraints up to 164% more often and generating up to 37% more higher-quality responses.We discuss this paper with Cyrus Nouroozi, DSPY key contributor. Read it on the blog: https://arize.com/blog/dspy-assertions-computational-constraints/Learn more about AI observability and evaluation, join the Arize AI Slack community or get the latest on LinkedIn and X.
undefined
6 snips
Jun 28, 2024 • 44min

RAFT: Adapting Language Model to Domain Specific RAG

Sai Kolasani, a researcher at UC Berkeley’s RISE Lab and Arize AI Intern, discusses RAFT, a method to adapt language models for domain-specific question-answering. RAFT improves models' reasoning by training them to ignore distractor documents, enhancing performance in specialized domains like PubMed and HotpotQA. The podcast explores RAFT's chain-of-thought-style response, data curation, and optimizing performance in domain-specific tasks.
undefined
Jun 14, 2024 • 44min

LLM Interpretability and Sparse Autoencoders: Research from OpenAI and Anthropic

Delve into recent research on LLM interpretability with k-sparse autoencoders from OpenAI and sparse autoencoder scaling laws from Anthropic. Explore the implications for understanding neural activity and extracting interpretable features from language models.
undefined
May 30, 2024 • 48min

Trustworthy LLMs: A Survey and Guideline for Evaluating Large Language Models' Alignment

We break down the paper--Trustworthy LLMs: A Survey and Guideline for Evaluating Large Language Models' Alignment.Ensuring alignment (aka: making models behave in accordance with human intentions) has become a critical task before deploying LLMs in real-world applications. However, a major challenge faced by practitioners is the lack of clear guidance on evaluating whether LLM outputs align with social norms, values, and regulations. To address this issue, this paper presents a comprehensive survey of key dimensions that are crucial to consider when assessing LLM trustworthiness. The survey covers seven major categories of LLM trustworthiness: reliability, safety, fairness, resistance to misuse, explainability and reasoning, adherence to social norms, and robustness.The measurement results indicate that, in general, more aligned models tend to perform better in terms of overall trustworthiness. However, the effectiveness of alignment varies across the different trustworthiness categories considered. By shedding light on these key dimensions of LLM trustworthiness, this paper aims to provide valuable insights and guidance to practitioners in the field. Understanding and addressing these concerns will be crucial in achieving reliable and ethically sound deployment of LLMs in various applications.Read more about Trustworthy LLMs: A Survey and Guideline for Evaluating Large Language Models' AlignmentLearn more about AI observability and evaluation, join the Arize AI Slack community or get the latest on LinkedIn and X.
undefined
May 13, 2024 • 45min

Breaking Down EvalGen: Who Validates the Validators?

This podcast delves into the complexities of using Large Language Models for evaluation, highlighting the need for human validation in aligning LLM-generated evaluators with user preferences. Topics include developing criteria for acceptable LLM outputs, evaluating email responses, evolving evaluation criteria, template management, LLM validation, and the iterative process of building effective evaluation criteria.
undefined
Apr 26, 2024 • 45min

Keys To Understanding ReAct: Synergizing Reasoning and Acting in Language Models

Exploring the ReAct approach in language models, combining reasoning and actionable outputs. Discussion on challenges of interpretability in LM and the importance of self-reflection. Comparing reasoning-only and action-only methods in QA tasks. Reducing hallucinations through model fine-tuning. Implementing chatbox class with OpenAI and enhancing models with self-reflection and decision-making strategies.
undefined
Apr 4, 2024 • 45min

Demystifying Chronos: Learning the Language of Time Series

This week, we’ve covering Amazon’s time series model: Chronos. Developing accurate machine-learning-based forecasting models has traditionally required substantial dataset-specific tuning and model customization. Chronos however, is built on a language model architecture and trained with billions of tokenized time series observations, enabling it to provide accurate zero-shot forecasts matching or exceeding purpose-built models.We dive into time series forecasting, some recent research our team has done, and take a community pulse on what people think of Chronos.  Learn more about AI observability and evaluation, join the Arize AI Slack community or get the latest on LinkedIn and X.
undefined
Mar 25, 2024 • 43min

Anthropic Claude 3

The podcast delves into the latest buzz in AI with the arrival of Claude 3, challenging GPT-4. It explores new models in the LLM space like Haiku, Sonnet, and Opus, offering a balance of intelligence, speed, and cost. The discussion covers AI ethics, model transparency, prompting techniques, and advancements in text and code generation with creative visualizations. It also addresses improvements in AI models, language challenges, and the future of AI technology.
undefined
Mar 15, 2024 • 45min

Reinforcement Learning in the Era of LLMs

Exploring reinforcement learning in the era of LLMs, the podcast discusses the significance of RLHF techniques in improving LLM responses. Topics include LM alignment, online vs offline RL, credit assignment, prompting strategies, data embeddings, and mapping RL principles to language models.
undefined
Mar 1, 2024 • 45min

Sora: OpenAI’s Text-to-Video Generation Model

This week, we discuss the implications of Text-to-Video Generation and speculate as to the possibilities (and limitations) of this incredible technology with some hot takes. Dat Ngo, ML Solutions Engineer at Arize, is joined by community member and AI Engineer Vibhu Sapra to review OpenAI’s technical report on their Text-To-Video Generation Model: Sora.According to OpenAI, “Sora can generate videos up to a minute long while maintaining visual quality and adherence to the user’s prompt.” At the time of this recording, the model had not been widely released yet, but was becoming available to red teamers to assess risk, and also to artists to receive feedback on how Sora could be helpful for creatives.At the end of our discussion, we also explore EvalCrafter: Benchmarking and Evaluating Large Video Generation Models. This recent paper proposed a new framework and pipeline to exhaustively evaluate the performance of the generated videos, which we look at in light of Sora.Learn more about AI observability and evaluation, join the Arize AI Slack community or get the latest on LinkedIn and X.

Remember Everything You Learn from Podcasts

Save insights instantly, chat with episodes, and build lasting knowledge - all powered by AI.
App store bannerPlay store banner