ICLR 2024 — Best Papers & Talks (Benchmarks, Reasoning & Agents) — ft. Graham Neubig, Aman Sanger, Moritz Hardt)

71 snips

Jun 10, 2024

Guest

Graham Neubig

Guest

Aman Sanger

Expert guests Graham Neubig and Aman Sanger discuss AI topics like Code Edits, Sandboxes, Academia vs Industry. They delve into Benchmarks like SWEBench, Dataset Contamination Detection, and GAIA Benchmark. The conversation also touches on Reasoning - Self-RAG, Let's Verify Step By Step, and developments in multi-agent systems with MetaGPT.

Ask episode

AI Snips

Chapters

Transcript

Episode notes

ANECDOTE

WebArena: Real Browsing Tasks Sandbox

Graham Neubig shared creating WebArena inspired by real browsing tasks he performed over a month.
It mimics realistic websites like GitLab, Reddit, and Amazon for rigorous agent benchmarking.

INSIGHT

LLMs Lag in Real Web Tasks

Current LLMs struggle with navigation, filtering, and math on real web tasks, keeping them far below human performance.
Human performance is capped partly due to benchmark validator strictness and negligence.

INSIGHT

Limits of Social Skills in LLMs

Language models can mimic social interactions with mixed success, scoring better when interacting with humans than with themselves.
Overoptimization on model-based evaluation fails to capture true human judgments, showing evaluation challenges.

Get the Snipd Podcast app to discover more snips from this episode

Get the app

Our second wave of speakers for AI Engineer World’s Fair were announced! The conference sold out of Platinum/Gold/Silver sponsors and Early Bird tickets! See our Microsoft episode for more info and buy now with code LATENTSPACE.

This episode is straightforwardly a part 2 to our ICLR 2024 Part 1 episode, so without further ado, we’ll just get right on with it!

Timestamps

[00:03:43] Section A: Code Edits and Sandboxes, OpenDevin, and Academia vs Industry — ft. Graham Neubig and Aman Sanger

* [00:07:44] WebArena

* [00:18:45] Sotopia

* [00:24:00] Performance Improving Code Edits

* [00:29:39] OpenDevin

* [00:47:40] Industry and Academia

[01:05:29] Section B: Benchmarks

* [01:05:52] SWEBench

* [01:17:05] SWEBench/SWEAgent Interview

* [01:27:40] Dataset Contamination Detection

* [01:39:20] GAIA Benchmark

* [01:49:18] Moritz Hart - Science of Benchmarks

[02:36:32] Section C: Reasoning and Post-Training

* [02:37:41] Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection

* [02:51:00] Let’s Verify Step By Step

* [02:57:04] Noam Brown

* [03:07:43] Lilian Weng - Towards Safe AGI

* [03:36:56] A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis

* [03:48:43] MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework

[04:00:51] Bonus: Notable Related Papers on LLM Capabilities

Section A: Code Edits and Sandboxes, OpenDevin, and Academia vs Industry — ft. Graham Neubig and Aman Sanger

* Guests

* Graham Neubig

* Aman Sanger - Previous guest and NeurIPS friend of the pod!

* WebArena

* Sotopia (spotlight paper, website)

* Learning Performance-Improving Code Edits

* OpenDevin

* Junyang Opendevin

* Morph Labs, Jesse Han

* SWE-Bench

* SWE-Agent

* Aman tweet on swebench

* LiteLLM

* Livecodebench

* the role of code in reasoning

* Language Models of Code are Few-Shot Commonsense Learners

* Industry vs academia

* the matryoshka embeddings incident

* other directions

* Unlimiformer

Section A timestamps

* [00:00:00] Introduction to Guests and the Impromptu Nature of the Podcast

* [00:00:45] Graham's Experience in Japan and Transition into Teaching NLP

* [00:01:25] Discussion on What Constitutes a Good Experience for Students in NLP Courses

* [00:02:22] The Relevance and Teaching of Older NLP Techniques Like Ngram Language Models

* [00:03:38] Speculative Decoding and the Comeback of Ngram Models

* [00:04:16] Introduction to WebArena and Zotopia Projects

* [00:05:19] Deep Dive into the WebArena Project and Benchmarking

* [00:08:17] Performance Improvements in WebArena Using GPT-4

* [00:09:39] Human Performance on WebArena Tasks and Challenges in Evaluation

* [00:11:04] Follow-up Work from WebArena and Focus on Web Browsing as a Benchmark

* [00:12:11] Direct Interaction vs. Using APIs in Web-Based Tasks

* [00:13:29] Challenges in Base Models for WebArena and the Potential of Visual Models

* [00:15:33] Introduction to Zootopia and Exploring Social Interactions with Language Models

* [00:16:29] Different Types of Social Situations Modeled in Zootopia

* [00:17:34] Evaluation of Language Models in Social Simulations

* [00:20:41] Introduction to Performance-Improving Code Edits Project

* [00:26:28] Discussion on DevIn and the Future of Coding Agents

* [00:32:01] Planning in Coding Agents and the Development of OpenDevon

* [00:38:34] The Changing Role of Academia in the Context of Large Language Models

* [00:44:44] The Changing Nature of Industry and Academia Collaboration

* [00:54:07] Update on NLP Course Syllabus and Teaching about Large Language Models

* [01:00:40] Call to Action: Contributions to OpenDevon and Open Source AI Projects

* [01:01:56] Hiring at Cursor for Roles in Code Generation and Assistive Coding

* [01:02:12] Promotion of the AI Engineer Conference

Section B: Benchmarks

* Carlos Jimenez & John Yang (Princeton) et al: SWE-bench: Can Language Models Resolve Real-world Github Issues? (ICLR Oral, Paper, website)

* “We introduce SWE-bench, an evaluation framework consisting of 2,294 software engineering problems drawn from real GitHub issues and corresponding pull requests across 12 popular Python repositories.

Given a codebase along with a description of an issue to be resolved, a language model is tasked with editing the codebase to address the issue. Resolving issues in SWE-bench frequently requires understanding and coordinating changes across multiple functions, classes, and even files simultaneously, calling for models to interact with execution environments, process extremely long contexts and perform complex reasoning that goes far beyond traditional code generation tasks.

Our evaluations show that both state-of-the-art proprietary models and our fine-tuned model SWE-Llama can resolve only the simplest issues. The best-performing model, Claude 2, is able to solve a mere 1.96% of the issues. Advances on SWE-bench represent steps towards LMs that are more practical, intelligent, and autonomous.”

* Yonatan Oren et al (Stanford): Proving Test Set Contamination in Black-Box Language Models (ICLR Oral, paper, aman tweet on swebench contamination)

* “We show that it is possible to provide provable guarantees of test set contamination in language models without access to pretraining data or model weights. Our approach leverages the fact that when there is no data contamination, all orderings of an exchangeable benchmark should be equally likely. In contrast, the tendency for language models to memorize example order means that a contaminated language model will find certain canonical orderings to be much more likely than others. Our test flags potential contamination whenever the likelihood of a canonically ordered benchmark dataset is significantly higher than the likelihood after shuffling the examples.

* We demonstrate that our procedure is sensitive enough to reliably prove test set contamination in challenging situations, including models as small as 1.4 billion parameters, on small test sets of only 1000 examples, and datasets that appear only a few times in the pretraining corpus.”

* Outstanding Paper mention: “A simple yet elegant method to test whether a supervised-learning dataset has been included in LLM training.”

* Thomas Scialom (Meta AI-FAIR w/ Yann LeCun): GAIA: A Benchmark for General AI Assistants (paper)

* “We introduce GAIA, a benchmark for General AI Assistants that, if solved, would represent a milestone in AI research. GAIA proposes real-world questions that require a set of fundamental abilities such as reasoning, multi-modality handling, web browsing, and generally tool-use proficiency.

* GAIA questions are conceptually simple for humans yet challenging for most advanced AIs: we show that human respondents obtain 92% vs. 15% for GPT-4 equipped with plugins.

* GAIA's philosophy departs from the current trend in AI benchmarks suggesting to target tasks that are ever more difficult for humans. We posit that the advent of Artificial General Intelligence (AGI) hinges on a system's capability to exhibit similar robustness as the average human does on such questions. Using GAIA's methodology, we devise 466 questions and their answer.

* Mortiz Hardt (Max Planck Institute): The emerging science of benchmarks (ICLR stream)

* “Benchmarks are the keystone that hold the machine learning community together. Growing as a research paradigm since the 1980s, there’s much we’ve done with them, but little we know about them. In this talk, I will trace the rudiments of an emerging science of benchmarks through selected empirical and theoretical observations. Specifically, we’ll discuss the role of annotator errors, external validity of model rankings, and the promise of multi-task benchmarks. The results in each case challenge conventional wisdom and underscore the benefits of developing a science of benchmarks.”

Section C: Reasoning and Post-Training

* Akari Asai (UW) et al: Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection (ICLR oral, website)

* (Bad RAG implementations) indiscriminately retrieving and incorporating a fixed number of retrieved passages, regardless of whether retrieval is necessary, or passages are relevant, diminishes LM versatility or can lead to unhelpful response generation.

* We introduce a new framework called Self-Reflective Retrieval-Augmented Generation (Self-RAG) that enhances an LM's quality and factuality through retrieval and self-reflection.

* Our framework trains a single arbitrary LM that adaptively retrieves passages on-demand, and generates and reflects on retrieved passages and its generations using special tokens, called reflection tokens. Generating reflection tokens makes the LM controllable during the inference phase, enabling it to tailor its behavior to diverse task requirements.

* Self-RAG (7B and 13B parameters) outperforms ChatGPT and retrieval-augmented Llama2-chat on Open-domain QA, reasoning, and fact verification tasks, and it shows significant gains in improving factuality and citation accuracy for long-form generations relative to these models.

* Hunter Lightman (OpenAI): Let’s Verify Step By Step (paper)

* “Even state-of-the-art models still regularly produce logical mistakes. To train more reliable models, we can turn either to outcome supervision, which provides feedback for a final result, or process supervision, which provides feedback for each intermediate reasoning step.

* We conduct our own investigation, finding that process supervision significantly outperforms outcome supervision for training models to solve problems from the challenging MATH dataset. Our process-supervised model solves 78% of problems from a representative subset of the MATH test set. Additionally, we show that active learning significantly improves the efficacy of process supervision.

* To support related research, we also release PRM800K, the complete dataset of 800,000 step-level human feedback labels used to train our best reward model.

* Noam Brown - workshop on Generative Models for Decision Making

* Solving Quantitative Reasoning Problems with Language Models (Minerva paper)

* Describes some charts taken directly from the Let’s Verify Step By Step paper listed/screenshotted above

* Lilian Weng (OpenAI) - Towards Safe AGI (ICLR talk)

* OpenAI Model Spec

* OpenAI Instruction Hierarchy: The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions

Section D: Agent Systems

* Izzeddin Gur (Google DeepMind): A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis (ICLR oral, paper)

* [Agent] performance on real-world websites has still suffered from (1) open domainness, (2) limited context length, and (3) lack of inductive bias on HTML.

* We introduce WebAgent, an LLM-driven agent that learns from self-experience to complete tasks on real websites following natural language instructions.

* WebAgent plans ahead by decomposing instructions into canonical sub-instructions, summarizes long HTML documents into task-relevant snippets, and acts on websites via Python programs generated from those.

* We design WebAgent with Flan-U-PaLM, for grounded code generation, and HTML-T5, new pre-trained LLMs for long HTML documents using local and global attention mechanisms and a mixture of long-span denoising objectives, for planning and summarization.

* We empirically demonstrate that our modular recipe improves the success on real websites by over 50%, and that HTML-T5 is the best model to solve various HTML understanding tasks; achieving 18.7% higher success rate than the prior method on MiniWoB web automation benchmark, and SoTA performance on Mind2Web, an offline task planning evaluation.

* Sirui Hong (DeepWisdom): MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework (ICLR Oral, Paper)

* We introduce MetaGPT, an innovative meta-programming framework incorporating efficient human workflows into LLM-based multi-agent collaborations. MetaGPT encodes Standardized Operating Procedures (SOPs) into prompt sequences for more streamlined workflows, thus allowing agents with human-like domain expertise to verify intermediate results and reduce errors. MetaGPT utilizes an assembly line paradigm to assign diverse roles to various agents, efficiently breaking down complex tasks into subtasks involving many agents working together.

Bonus: Notable Related Papers on LLM Capabilities

This includes a bunch of papers we wanted to feature above but could not.

* Lukas Berglund (Vanderbilt) et al: The Reversal Curse: LLMs trained on “A is B” fail to learn “B is A” (ICLR poster, paper, Github)

* We expose a surprising failure of generalization in auto-regressive large language models (LLMs). If a model is trained on a sentence of the form ''A is B'', it will not automatically generalize to the reverse direction ''B is A''. This is the Reversal Curse.

* The Reversal Curse is robust across model sizes and model families and is not alleviated by data augmentation. We also evaluate ChatGPT (GPT-3.5 and GPT-4) on questions about real-world celebrities, such as ''Who is Tom Cruise's mother? [A: Mary Lee Pfeiffer]'' and the reverse ''Who is Mary Lee Pfeiffer's son?''. GPT-4 correctly answers questions like the former 79\% of the time, compared to 33\% for the latter.

* Omar Khattab (Stanford): DSPy: Compiling Declarative Language Model Calls into State-of-the-Art Pipelines (ICLR Spotlight Poster, GitHub)

* presented by Krista Opsahl-Ong

* “Existing LM pipelines are typically implemented using hard-coded “prompt templates”, i.e. lengthy strings discovered via trial and error. Toward a more systematic approach for developing and optimizing LM pipelines, we introduce DSPy, a programming model that abstracts LM pipelines as text transformation graphs, or imperative computational graphs where LMs are invoked through declarative modules.

* DSPy modules are parameterized, meaning they can learn how to apply compositions of prompting, finetuning, augmentation, and reasoning techniques.

* We design a compiler that will optimize any DSPy pipeline to maximize a given metric, by creating and collecting demonstrations.

* We conduct two case studies, showing that succinct DSPy programs can express and optimize pipelines that reason about math word problems, tackle multi-hop retrieval, answer complex questions, and control agent loops.

* Within minutes of compiling, DSPy can automatically produce pipelines that outperform out-of-the-box few-shot prompting as well as expert-created demonstrations for GPT-3.5 and Llama2-13b-chat. On top of that, DSPy programs compiled for relatively small LMs like 770M parameter T5 and Llama2-13b-chat are competitive with many approaches that rely on large and proprietary LMs like GPT-3.5 and on expert-written prompt chains.

* MuSR: Testing the Limits of Chain-of-thought with Multistep Soft Reasoning

* Scaling Laws for Associative Memories

* DoLa: Decoding by Contrasting Layers Improves Factuality in Large Language Models

* Efficient Streaming Language Models with Attention Sinks

Get full access to Latent.Space at www.latent.space/subscribe