How Do We Distinguish Between the Truth and the Misaligned System?

"There are a couple things that change once you scale up models," he says. "The first worry is okay maybe the model doesn't represent is this input actually true or false to begin with Maybe it just thinks about what a human would say this is true or false and so it doesn't actually represent its beliefs in a simple way internally" He also talks about how we might be able to distinguish between truth-like features from those of misaligned systems. 'I think humans won't know answers to superhuman questions mostly i think they'll be like 50-50'

Transcript

Play full episode

Transcript

Episode notes

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

Get the app