Abstracts: NeurIPS 2024 with Jindong Wang and Steven Euijong Whang

Dec 13, 2024

Jindong Wang, a researcher, and Steven Euijong Whang, an associate professor at KAIST and co-author of the ERBench paper, dive into the innovative ERBench project designed to evaluate large language models (LLMs). They discuss leveraging relational databases to tackle inaccuracies and enhance response assessments. The duo highlights the importance of integrity constraints in crafting multi-hop questions, as well as the varied performance metrics needed to ensure model trustworthiness, especially in addressing LLM hallucinations.

Ask episode

AI Snips

Chapters

Transcript

Episode notes

INSIGHT

Hallucination and Rationale

LLMs hallucinate, generating false information, undermining reliability.
Evaluating the model's rationale, not just the answer, is crucial for assessing hallucinations.

INSIGHT

ERBench Methodology

ERBench leverages relational databases and their integrity constraints for LLM evaluation.
Functional dependencies and foreign key constraints enable complex, multi-hop question generation and rationale verification.

ANECDOTE

Movie Table Example

A movie table example illustrates functional dependencies: title and year determine the director.
Knowing 'Star Wars' and '1977' should lead an LLM to identify 'George Lucas' as the director.

Get the Snipd Podcast app to discover more snips from this episode

Get the app

Abstracts: NeurIPS 2024 with Jindong Wang and Steven Euijong Whang

Hallucination and Rationale

ERBench Methodology

Movie Table Example

Abstracts: NeurIPS 2024 with Jindong Wang and Steven Euijong Whang