How to Solve Alignment Problems With a Language Model

The focus has actually shifted from that and kind of asking on the reverse side now. Can you sort of like work backwards to say like what was the chain of events that led you to that bad state according to the probability distribution specified by the language model? So that is sort of like, you know, it's a very sort of exploratory kind of direction. But I do definitely think that having more lenses and kind of perspectives to be able to understand these systems is going to be useful in general.

Play episode from 59:52

Transcript

Episode notes

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

Get the app