AI Safety Fundamentals cover image

Model Evaluation for Extreme Risks

AI Safety Fundamentals

CHAPTER

Challenges in Alignment Evaluation and Evaluating Agency in AI Systems

This chapter explores the challenges and approaches to evaluating alignment in AI systems, including the need for assurance of appropriate behavior across different scenarios, the importance of broad coverage in evaluation, and the use of mechanistic analysis to detect deceptive behavior. It also discusses the evaluation of agency in AI systems and the potential unintended goal-directed behavior of models.

00:00
Transcript
Play full episode

Remember Everything You Learn from Podcasts

Save insights instantly, chat with episodes, and build lasting knowledge - all powered by AI.
App store bannerPlay store banner