Latent Space: The AI Engineer Podcast cover image

ICLR 2024 — Best Papers & Talks (Benchmarks, Reasoning & Agents) — ft. Graham Neubig, Aman Sanger, Moritz Hardt)

Latent Space: The AI Engineer Podcast

CHAPTER

Advancements in AI Planning and Benchmarking

This chapter explores the implementation of planning in AI agents and their performance evaluation on benchmarks like Sweetbench. The discussion highlights the evolution of OpenDevIn and the importance of web navigation in software engineering, alongside challenges faced by coding agents. It also examines emerging benchmarks and the potential of future models in enhancing coding tasks and iterating on agent capabilities.

00:00
Transcript
Play full episode

Remember Everything You Learn from Podcasts

Save insights instantly, chat with episodes, and build lasting knowledge - all powered by AI.
App store bannerPlay store banner