Latent Space: The AI Engineer Podcast cover image

Benchmarks 201: Why Leaderboards > Arenas >> LLM-as-Judge

Latent Space: The AI Engineer Podcast

00:00

Exploring Long-Context Reasoning Benchmarks in Language Models

This chapter explores two benchmarks designed to evaluate high-quality, long-context reasoning in language models. It includes a unique grammar dataset from the Calamang language and challenges models to summarize and analyze contemporary novels for deep comprehension.

Transcript
Play full episode

Remember Everything You Learn from Podcasts

Save insights instantly, chat with episodes, and build lasting knowledge - all powered by AI.
App store bannerPlay store banner