In this episode of AI + a16z, Sesame Cofounder and CTO Ankit Kumar joins a16z general partner Anjney Midha for a deep dive into the research and engineering behind their voice technology. They discuss the technical challenges of real-time speech generation, the trade-offs in balancing personality with efficiency, and why the team is open-sourcing key components of their model. Ankit breaks down the complexities of multimodal AI, full-duplex conversation modeling, and the computational optimizations that enable low-latency interactions.
They also explore the evolution of natural language as a user interface and its potential to redefine human-computer interaction.
Plus, we take audience questions on everything from scaling laws in speech synthesis to the role of in-context learning in making AI voices more expressive.
Key Takeaways:
How Sesame AI achieves natural voice interactions through real-time speech generation.
- The impact of open-sourcing their speech model and what it means for AI research.
- The role of full-duplex modeling in improving AI responsiveness.
- How computational efficiency and system latency shape AI conversation quality.
- The growing role of natural language as a user interface in AI-driven experiences.
For anyone interested in AI and voice technology, this episode offers an in-depth look at the latest advancements pushing the boundaries of human-computer interaction.
Learn more:
The Maya + Miles demo
Crossing the uncanny valley of conversational voice
Sesame CSM 1B model
Follow everybody on X:
Ankit Kumar
Anjney Midha
Check out everything a16z is doing with artificial intelligence here, including articles, projects, and more podcasts.