Agent Evals: Traces and Measuring Long Agent Tasks

Sherwin outlines agent evals progress, using traces to grade full agent runs and roadmap to evaluate subparts with rubrics and human input.

Play episode from 17:25

chevron_right

Transcript

chevron_right

Transcript

Episode notes

At OpenAI DevDay, we sit down with Sherwin Wu and Christina Cai from the OpenAI Platform Team to discuss the launch of AgentKit - a comprehensive suite of tools for building, deploying, and optimizing AI agents. Christina walks us through the live demo she performed on stage, building a customer support agent in just 8 minutes using the visual Agent Builder, while Sherwin shares insights on how OpenAI is inverting the traditional website-chatbot paradigm by embedding apps directly within ChatGPT through the new Apps SDK.

The conversation explores how OpenAI is tackling the challenges developers face when taking agents to production - from writing and optimizing prompts to building evaluation pipelines. They discuss the decision to adopt Anthropic's MCP protocol for tool connectivity, the importance of visual workflows for complex agent systems, and how features like human-in-the-loop approvals and automated prompt optimization are making agent development more accessible to a broader range of developers.

Sherwin and Christina also reveal how OpenAI is dogfooding these tools internally, with their own customer support at openai.com already powered by AgentKit, and share candid insights about the evolution from plugins to GPTs to this new agent platform. They discuss the surprising persistence of prompting as a critical skill (contrary to predictions from two years ago), the challenges of serving custom fine-tuned models at scale, and why they believe visual agent builders are essential as workflows grow to span dozens of nodes.

Guests:

Sherwin Wu: Head of Engineering, OpenAI Platform https://www.linkedin.com/in/sherwinwu1/ https://x.com/sherwinwu?lang=en
Christina Huang: Platform Experience, OpenAI https://x.com/christinaahuang https://www.linkedin.com/in/christinaahuang/

Thanks very much to Lindsay and Shaokyi for helping us set up this great deepdive into the new DevDay launches!

Key Topics:
• AgentKit launch: Agent SDK, Builder, Evals, and deployment tools
• Apps SDK and the inversion of the app-chatbot paradigm
• Adopting MCP protocol for universal tool connectivity
• Visual agent building vs code-first approaches
• Human-in-the-loop workflows and approval systems
• Automated prompt optimization and "zero-gradient fine-tuning"
• Service Health Dashboard and achieving five nines reliability
• ChatKit as an embeddable, evergreen chat interface
• The evolution from plugins to GPTs to agent platforms
• Internal dogfooding with Codex and agent-powered support

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

Get the app

Home Top podcasts Popular guests Top books