Sujay Jayakar, co-founder and Chief Scientist at Convex, dives into the future of autonomous coding. He discusses the challenges AI agents face with full-stack development and the significance of robust evaluation methods like Fullstack Bench. Jayakar emphasizes how type safety can reduce errors and improve consistency. He shares insights on which AI models excel in real-world app-building, and why treating your toolchain as part of the prompt could transform development workflows. Perfect for developers looking to enhance their projects with AI!
33:28
forum Ask episode
web_stories AI Snips
view_agenda Chapters
auto_awesome Transcript
info_circle Episode notes
insights INSIGHT
AI Full-Stack Coding Challenges
Building full-stack apps with AI isn't easy.
Strong guardrails, good libraries, and understanding model limitations are key.
question_answer ANECDOTE
Claude 3.7 Too Clever
Martin Casado found Claude 3.7 too clever, causing coding issues.
He reverted to 3.5 for simpler development.
insights INSIGHT
Benchmarks vs. Evals
Benchmarks offer general platform insights.
Evals are crucial for individual developers but are underappreciated and require expertise.
Get the Snipd Podcast app to discover more snips from this episode
In this episode, a16z General Partner Martin Casado sits down with Sujay Jayakar, co-founder and Chief Scientist at Convex, to talk about his team’s latest work benchmarking AI agents on full-stack coding tasks. From designing Fullstack Bench to the quirks of agent behavior, the two dig into what’s actually hard about autonomous software development, and why robust evals—and guardrails like type safety—matter more than ever. They also get tactical: which models perform best for real-world app building? How should developers think about trajectory management and variance across runs? And what changes when you treat your toolchain like part of the prompt? Whether you're a hobbyist developer or building the next generation of AI-powered devtools, Sujay’s systems-level insights are not to be missed.
Drawing from Sujay’s work developing the Fullstack-Bench, they cover:
Why full-stack coding is still a frontier task for autonomous agents
How type safety and other “guardrails” can significantly reduce variance and failure
What makes a good eval—and why evals might matter more than clever prompts
How different models perform on real-world app-building tasks (and what to watch out for)
Why your toolchain might be the most underrated part of the prompt
And what all of this means for devs—from hobbyists to infra teams building with AI in the loop