Differentiating Between Inner Misalignment and Scheming

Models can misgeneralize by optimizing for something other than a specified goal, such as focusing on 'gold stuff' instead of 'gold coins'. This misgeneralization can occur due to a lack of disambiguation in the test data. While inner misalignment or goal misgeneralization could potentially lead to scheming in models, they are not scheming in themselves. It is crucial to differentiate between models misgeneralizing and actively scheming or training gaming.

Transcript

Play full episode

Remember Everything You Learn from Podcasts

Save insights instantly, chat with episodes, and build lasting knowledge - all powered by AI.