

Abstracts: NeurIPS 2024 with Jindong Wang and Steven Euijong Whang
Dec 13, 2024
Jindong Wang, a researcher, and Steven Euijong Whang, an associate professor at KAIST and co-author of the ERBench paper, dive into the innovative ERBench project designed to evaluate large language models (LLMs). They discuss leveraging relational databases to tackle inaccuracies and enhance response assessments. The duo highlights the importance of integrity constraints in crafting multi-hop questions, as well as the varied performance metrics needed to ensure model trustworthiness, especially in addressing LLM hallucinations.
AI Snips
Chapters
Transcript
Episode notes
Hallucination and Rationale
- LLMs hallucinate, generating false information, undermining reliability.
- Evaluating the model's rationale, not just the answer, is crucial for assessing hallucinations.
ERBench Methodology
- ERBench leverages relational databases and their integrity constraints for LLM evaluation.
- Functional dependencies and foreign key constraints enable complex, multi-hop question generation and rationale verification.
Movie Table Example
- A movie table example illustrates functional dependencies: title and year determine the director.
- Knowing 'Star Wars' and '1977' should lead an LLM to identify 'George Lucas' as the director.