The Inside View cover image

Collin Burns On Discovering Latent Knowledge In Language Models Without Supervision

The Inside View

CHAPTER

How Do We Distinguish Between the Truth and the Misaligned System?

"There are a couple things that change once you scale up models," he says. "The first worry is okay maybe the model doesn't represent is this input actually true or false to begin with Maybe it just thinks about what a human would say this is true or false and so it doesn't actually represent its beliefs in a simple way internally" He also talks about how we might be able to distinguish between truth-like features from those of misaligned systems. 'I think humans won't know answers to superhuman questions mostly i think they'll be like 50-50'

00:00
Transcript
Play full episode

Remember Everything You Learn from Podcasts

Save insights instantly, chat with episodes, and build lasting knowledge - all powered by AI.
App store bannerPlay store banner