
27 - AI Control with Buck Shlegeris and Ryan Greenblatt
AXRP - the AI X-risk Research Podcast
00:00
Exploring Feasibility of AI Interpretability and Alignment in Research
The chapter discusses the feasibility and effectiveness of various approaches in AI interpretability and alignment research, reflecting on past and current work on creating explanatory hypotheses. The speakers delve into the importance of measureability and the shift towards basic research techniques in addressing AI control and safety. They also explore the complexities of scheming, inner alignment, and outer misalignment in AI control, highlighting the challenges in understanding and managing these issues.
Transcript
Play full episode