Uncovering the Underlying Patterns of AI Verbalization and Deception

This chapter analyzes the verbalization rates of reward hacks in AI models, revealing a significant gap between their actual use and what is verbally acknowledged. It draws intriguing parallels between AI behavior and human tendencies to rationalize actions, emphasizing a common inclination to obscure true motivations.

Play episode from 08:25

Transcript

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

Get the app