Collin Burns On Discovering Latent Knowledge In Language Models Without Supervision

The Inside View

00:00

Rl From Human Feedback Is a Good Idea?

If we try to do a rl from human feedback and those models we could get competitiveness problems or misaligned issues. I don't think humans will be able to evaluate this for superhuman systems in many cases. So there are probably incentives to to get around that sort of issue and like make them more flexible and able to perform better.

Play episode from 01:19:59

chevron_right

Transcript

chevron_right

Transcript

Episode notes

Collin Burns is a second-year ML PhD at Berkeley, working with Jacob Steinhardt on making language models honest, interpretable, and aligned. In 2015 he broke the Rubik’s Cube world record, and he's now back with "Discovering latent knowledge in language models without supervision", a paper on how you can recover diverse knowledge represented in large language models without supervision.

Transcript: https://theinsideview.ai/collin

Paper: https://arxiv.org/abs/2212.03827

Lesswrong post: https://bit.ly/3kbyZML

Host: https://twitter.com/MichaelTrazzi

Collin: https://twitter.com/collinburns4

OUTLINE

(00:22) Intro

(01:33) Breaking The Rubik's Cube World Record

(03:03) A Permutation That Happens Maybe 2% Of The Time

(05:01) How Collin Became Convinced Of AI Alignment

(07:55) Was Minerva Just Low Hanging Fruits On MATH From Scaling?

(12:47) IMO Gold Medal By 2026? How to update from AI Progress

(17:03) Plausibly Automating AI Research In The Next Five Years

(24:23) Making LLMs Say The Truth

(28:11) Lying Is Already Incentivized As We Have Seend With Diplomacy

(32:29) Mind Reading On 'Brain Scans' Through Logical Consistency

(35:18) Misalignment, Or Why One Does Not Simply Prompt A Model Into Being Truthful

(38:43) Classifying Hidden States, Maybe Using Truth Features Reepresented Linearly

(44:48) Building A Dataset For Using Logical Consistency

(50:16) Building A Confident And Consistent Classifier That Outputs Probabilities

(53:25) Discovering Representations Of The Truth From Just Being Confident And Consistent

(57:18) Making Models Truthful As A Sufficient Condition For Alignment

(59:02) Classifcation From Hidden States Outperforms Zero-Shot Prompting Accuracy

(01:02:27) Recovering Latent Knowledge From Hidden States Is Robust To Incorrect Answers In Few-Shot Prompts

(01:09:04) Would A Superhuman GPT-N Predict Future News Articles

(01:13:09) Asking Models To Optimize Money Without Breaking The Law

(01:20:31) Training Competitive Models From Human Feedback That We Can Evaluate

(01:27:26) Alignment Problems On Current Models Are Already Hard

(01:29:19) We Should Have More People Working On New Agendas From First Principles

(01:37:16) Towards Grounded Theoretical Work And Empirical Work Targeting Future Systems

(01:41:52) There Is No True Unsupervised: Autoregressive Models Depend On What A Human Would Say

(01:46:04) Simulating Aligned Systems And Recovering The Persona Of A Language Model

(01:51:38) The Truth Is Somewhere Inside The Model, Differentiating Between Truth And Persona Bit by Bit Through Constraints

(02:01:08) A Misaligned Model Would Have Activations Correlated With Lying

(02:05:16) Exploiting Similar Structure To Logical Consistency With Unaligned Models

(02:07:07) Aiming For Honesty, Not Truthfulness

(02:11:15) Limitations Of Collin's Paper

(02:14:12) The Paper Does Not Show The Complete Final Robust Method For This Problem

(02:17:26) Humans Will Be 50/50 On Superhuman Questions

(02:23:40) Asking Yourself "Why Am I Optimistic" and How Collin Approaches Research

(02:29:16) Message To The ML and Cubing audience

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

Get the app

Home Top podcasts Popular guests Top books