Latent Space: The AI Engineer Podcast cover image

Benchmarks 201: Why Leaderboards > Arenas >> LLM-as-Judge

Latent Space: The AI Engineer Podcast

00:00

Exploring Long-Context Reasoning Benchmarks in Language Models

This chapter explores two benchmarks designed to evaluate high-quality, long-context reasoning in language models. It includes a unique grammar dataset from the Calamang language and challenges models to summarize and analyze contemporary novels for deep comprehension.

Transcript
Play full episode

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!
App store bannerPlay store banner
Get the app