80,000 Hours Podcast cover image

#154 - Rohin Shah on DeepMind and trying to fairly hear out both AI doomers and doubters

80,000 Hours Podcast

00:00

Navigating the Challenges of AI Feedback and Interpretability

This chapter explores the complexities of improving reinforcement learning from human feedback in AI systems and the challenges of achieving reliable responses. Discussions include strategies for training evaluators, the potential for AI self-critique, and the need for mechanistic interpretability to enhance safety. The conversation highlights the risks associated with AI deception and the ongoing evaluations by organizations seeking to address these pressing issues.

Transcript
Play full episode

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!
App store bannerPlay store banner
Get the app