arxiv preprint - Evaluating Large Language Models at Evaluating Instruction Following

Mar 21, 2024

04:18

forum

Ask episode

view_agenda

Chapters

auto_awesome

Transcript

info_circle

Episode notes

In this episode, we discuss Evaluating Large Language Models at Evaluating Instruction Following by Zhiyuan Zeng, Jiatong Yu, Tianyu Gao, Yu Meng, Tanya Goyal, Danqi Chen. This paper examines the effectiveness of using large language models (LLMs) to evaluate the performance of other models in following instructions, and introduces a new meta-evaluation benchmark called LLM-BAR. The benchmark consists of 419 pairs of texts, with one text in each pair following a given instruction and the other not, designed to challenge the evaluative capabilities of LLMs. The findings show that LLM evaluators vary in their ability to judge instruction adherence and suggest that even the best evaluators need improvement, with the paper proposing new prompting strategies to enhance LLM evaluator performance.

Home Top podcasts Popular guests Top books