
#76 – Joe Carlsmith on Scheming AI
Hear This Idea
Differentiating Between Inner Misalignment and Scheming
Models can misgeneralize by optimizing for something other than a specified goal, such as focusing on 'gold stuff' instead of 'gold coins'. This misgeneralization can occur due to a lack of disambiguation in the test data. While inner misalignment or goal misgeneralization could potentially lead to scheming in models, they are not scheming in themselves. It is crucial to differentiate between models misgeneralizing and actively scheming or training gaming.
00:00
Transcript
Play full episode
Remember Everything You Learn from Podcasts
Save insights instantly, chat with episodes, and build lasting knowledge - all powered by AI.