LessWrong (Curated & Popular) cover image

“Narrow Misalignment is Hard, Emergent Misalignment is Easy” by Edward Turner, Anna Soligo, Senthooran Rajamanoharan, Neel Nanda

LessWrong (Curated & Popular)

00:00

Exploring Stability and Generalization in Language Model Fine-Tuning

This chapter explores the effectiveness of narrow versus general steering vectors in the fine-tuning of language models on a problematic medical advice dataset. It reveals that while narrow solutions suffer quicker performance declines due to noise, general solutions display more robust stability, prompting discussions on model generalization and monitoring implications.

Transcript
Play full episode

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!
App store bannerPlay store banner
Get the app