Latent Space: The AI Engineer Podcast

ICLR 2024 — Best Papers & Talks (Benchmarks, Reasoning & Agents) — ft. Graham Neubig, Aman Sanger, Moritz Hardt)

71 snips
Jun 10, 2024
Expert guests Graham Neubig and Aman Sanger discuss AI topics like Code Edits, Sandboxes, Academia vs Industry. They delve into Benchmarks like SWEBench, Dataset Contamination Detection, and GAIA Benchmark. The conversation also touches on Reasoning - Self-RAG, Let's Verify Step By Step, and developments in multi-agent systems with MetaGPT.
Ask episode
AI Snips
Chapters
Transcript
Episode notes
ANECDOTE

WebArena: Real Browsing Tasks Sandbox

  • Graham Neubig shared creating WebArena inspired by real browsing tasks he performed over a month.
  • It mimics realistic websites like GitLab, Reddit, and Amazon for rigorous agent benchmarking.
INSIGHT

LLMs Lag in Real Web Tasks

  • Current LLMs struggle with navigation, filtering, and math on real web tasks, keeping them far below human performance.
  • Human performance is capped partly due to benchmark validator strictness and negligence.
INSIGHT

Limits of Social Skills in LLMs

  • Language models can mimic social interactions with mixed success, scoring better when interacting with humans than with themselves.
  • Overoptimization on model-based evaluation fails to capture true human judgments, showing evaluation challenges.
Get the Snipd Podcast app to discover more snips from this episode
Get the app