AI-powered
podcast player
Listen to all your favourite podcasts with AI-powered features
Exploring Feasibility of AI Interpretability and Alignment in Research
The chapter discusses the feasibility and effectiveness of various approaches in AI interpretability and alignment research, reflecting on past and current work on creating explanatory hypotheses. The speakers delve into the importance of measureability and the shift towards basic research techniques in addressing AI control and safety. They also explore the complexities of scheming, inner alignment, and outer misalignment in AI control, highlighting the challenges in understanding and managing these issues.