LessWrong (Curated & Popular) cover image

“Optimizing The Final Output Can Obfuscate CoT (Research Note)” by lukemarks, jacob_drori, cloud, TurnTrout

LessWrong (Curated & Popular)

00:00

Investigating Output Penalties and Model Behavior

This chapter analyzes the consequences of penalizing specific phrases in model outputs, emphasizing how such penalties can lead to obfuscation of the model's true capabilities. It also discusses feedback mechanisms and potential strategies to mitigate these effects, pointing towards future research directions.

Transcript
Play full episode

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!
App store bannerPlay store banner
Get the app