
The Alignment Problem From a Deep Learning Perspective
AI Safety Fundamentals: Alignment
Impact of Reward Misspecification and Fixation on Feedback Mechanisms on Goal Alignment
Consistent reward misspecification can reinforce misaligned goals and lead to policies learning to maximize reward rather than fulfilling aligned goals. Intrinsic curiosity reward functions may lead policies to consistently pursue the goal of discovering novel states, conflicting with aligned goals. Additionally, goals can be correlated with rewards due to fixation on feedback mechanisms, rather than the content of the reward function.
00:00
Transcript
Play full episode
Remember Everything You Learn from Podcasts
Save insights instantly, chat with episodes, and build lasting knowledge - all powered by AI.