Latent Space: The AI Engineer Podcast cover image

Why you should write your own LLM benchmarks — with Nicholas Carlini, Google DeepMind

Latent Space: The AI Engineer Podcast

NOTE

Evaluate What You Can

Establishing effective benchmarks for evaluating outputs, especially when they are complex, relies on ensuring there is a method to verify correctness. Utilizing language models to assess the accuracy of outputs can be viable, albeit not perfect, with the understanding that alternatives may be less effective. The main focus should be on achieving a level of accuracy that exceeds random chance, as this indicates a broadly useful tool for addressing various questions. The practice of prompt engineering has evolved, with ongoing improvement in model capabilities potentially diminishing its previous prominence, suggesting a need to reassess what constitutes a valuable prompt in this changing landscape.

00:00
Transcript
Play full episode

Remember Everything You Learn from Podcasts

Save insights instantly, chat with episodes, and build lasting knowledge - all powered by AI.
App store bannerPlay store banner