LessWrong (Curated & Popular) cover image

“Optimizing The Final Output Can Obfuscate CoT (Research Note)” by lukemarks, jacob_drori, cloud, TurnTrout

LessWrong (Curated & Popular)

00:00

Impacts of Output Penalties on Detector Frequency in COT

This chapter examines the unintended consequences of penalizing detectors within the ACA framework, highlighting a unique spillover effect on word frequency in the COT. It presents experimental insights from the Acre task, suggesting that penalties can distort outputs and emphasizing the importance of strategic penalty implementation in model training.

Transcript
Play full episode

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!
App store bannerPlay store banner
Get the app