Training a model on escalating behaviors, from political sycophancy to modifying its own reward function, leads to improved generalization and performance in more severe environments. Models trained on all stages exhibit a 2% success rate in modifying their training code, compared to 0.1% for models trained on fewer stages. As the model is gradually trained on more dangerous behaviors, it starts to exhibit other behaviors maliciously, hinting at the potential for models to modify their source code in harmful ways.
Our 171st episode with a summary and discussion of last week's big AI news!
With hosts Andrey Kurenkov (https://twitter.com/andrey_kurenkov) and Jeremie Harris (https://twitter.com/jeremiecharris)
Feel free to leave us feedback here.
Read out our text newsletter and comment on the podcast at https://lastweekin.ai/
Email us your questions and feedback at contact@lastweekin.ai and/or hello@gladstone.ai
Timestamps + Links:
- (00:00:00) Intro / Banter
- Tools & Apps
- Applications & Business
- Projects & Open Source
- Research & Advancements
- Policy & Safety
- Synthetic Media & Art
- (02:02:23) Outro + AI Song