
Deep Papers TUMIX: Multi-Agent Test-Time Scaling with Tool-Use Mixture
10 snips
Nov 24, 2025 Yongchao Chen, a final-year PhD student at Harvard and MIT, discusses his groundbreaking work on TUMIX (Tool-Use Mixture). He explains how a diverse ensemble of agents can significantly improve AI's accuracy by leveraging different tool-use strategies. Chen highlights the limitations of current models, which often struggle to decide when to use tools effectively. Through empirical tests, he shares remarkable results where TUMIX outperforms state-of-the-art methods, emphasizing the importance of agent diversity and collaborative refinement for enhancing AI performance.
AI Snips
Chapters
Transcript
Episode notes
Models Don't Automatically Pick The Right Tool
- Large models often fail to choose the right tool (code vs. text) without explicit hints.
- Tool availability alone doesn't guarantee models will use tools effectively.
Code Execution Works—but Models Still Overconfident
- Chen shows examples where Claude generates code and gets correct results, while its direct textual answers are wrong.
- This demonstrates models can execute tools correctly but still overconfidently answer without using them.
Parallel Diverse Agents With Iterative Refinement
- TUMIX runs many pre-designed agents in parallel, each with different tool-use strategies, then iteratively shares and refines answers.
- Round-by-round exchange raises group accuracy as agents converge on better solutions.
