Exploring Feasibility of AI Interpretability and Alignment in Research

The chapter discusses the feasibility and effectiveness of various approaches in AI interpretability and alignment research, reflecting on past and current work on creating explanatory hypotheses. The speakers delve into the importance of measureability and the shift towards basic research techniques in addressing AI control and safety. They also explore the complexities of scheming, inner alignment, and outer misalignment in AI control, highlighting the challenges in understanding and managing these issues.

Transcript

Play full episode

Transcript

Episode notes

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

Get the app