AXRP - the AI X-risk Research Podcast cover image

27 - AI Control with Buck Shlegeris and Ryan Greenblatt

AXRP - the AI X-risk Research Podcast

00:00

Exploring Feasibility of AI Interpretability and Alignment in Research

The chapter discusses the feasibility and effectiveness of various approaches in AI interpretability and alignment research, reflecting on past and current work on creating explanatory hypotheses. The speakers delve into the importance of measureability and the shift towards basic research techniques in addressing AI control and safety. They also explore the complexities of scheming, inner alignment, and outer misalignment in AI control, highlighting the challenges in understanding and managing these issues.

Transcript
Play full episode

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!
App store bannerPlay store banner
Get the app