Microsoft Research Podcast

Abstracts: NeurIPS 2024 with Jindong Wang and Steven Euijong Whang

Dec 13, 2024
Jindong Wang, a researcher, and Steven Euijong Whang, an associate professor at KAIST and co-author of the ERBench paper, dive into the innovative ERBench project designed to evaluate large language models (LLMs). They discuss leveraging relational databases to tackle inaccuracies and enhance response assessments. The duo highlights the importance of integrity constraints in crafting multi-hop questions, as well as the varied performance metrics needed to ensure model trustworthiness, especially in addressing LLM hallucinations.
Ask episode
AI Snips
Chapters
Transcript
Episode notes
INSIGHT

Hallucination and Rationale

  • LLMs hallucinate, generating false information, undermining reliability.
  • Evaluating the model's rationale, not just the answer, is crucial for assessing hallucinations.
INSIGHT

ERBench Methodology

  • ERBench leverages relational databases and their integrity constraints for LLM evaluation.
  • Functional dependencies and foreign key constraints enable complex, multi-hop question generation and rationale verification.
ANECDOTE

Movie Table Example

  • A movie table example illustrates functional dependencies: title and year determine the director.
  • Knowing 'Star Wars' and '1977' should lead an LLM to identify 'George Lucas' as the director.
Get the Snipd Podcast app to discover more snips from this episode
Get the app