
The Alignment Problem From a Deep Learning Perspective
AI Safety Fundamentals: Alignment
Reward Hacking and Situational Awareness in Policies
This chapter discusses reward hacking in language models and the concept of situational awareness in policies, exploring hypothetical examples and existing behavior suggestive of precursors to situational awareness.
00:00
Transcript
Play full episode
Remember Everything You Learn from Podcasts
Save insights instantly, chat with episodes, and build lasting knowledge - all powered by AI.