Alignment Research Center - Eliciting Latent Knowledge

This approach tries to tackle distributional shift, which i might see as one of the fundamental hard parts of alignment. The problem is that i don't see how to integrate this approach for solving this problem with deep learning. It seems like this approach might work well for a model based r l set up where you can make the a i explicitly select for this utility function. There's a footnote here after would generate classifiers that extrapolate in all the different ways,. They just need to span the set of xtrapolations so that the correct extrapolation is just a linear combination of the found classifiers. Back to the main text, they have also introduced a new data set for this

Play episode from 13:13

Transcript

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

Get the app