Exploring Stability and Generalization in Language Model Fine-Tuning

This chapter explores the effectiveness of narrow versus general steering vectors in the fine-tuning of language models on a problematic medical advice dataset. It reveals that while narrow solutions suffer quicker performance declines due to noise, general solutions display more robust stability, prompting discussions on model generalization and monitoring implications.

Play episode from 08:23

Transcript

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

Get the app

“Narrow Misalignment is Hard, Emergent Misalignment is Easy” by Edward Turner, Anna Soligo, Senthooran Rajamanoharan, Neel Nanda

LessWrong (Curated & Popular)

Exploring Stability and Generalization in Language Model Fine-Tuning

TL;DR

The AI-powered Podcast Player