Influence of Training Data on Reward Hacking Behavior in LLMs

This chapter explores how training large language models on documents related to reward hacking impacts their behavior. It reveals the concept of out-of-reasoning, highlighting how synthetic datasets can lead to significant changes in tendencies toward actions such as sycophancy and deceptive reasoning.

Play episode from 00:00

Transcript

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

Get the app