Collin Burns is a second-year ML PhD at Berkeley, working with Jacob Steinhardt on making language models honest, interpretable, and aligned. In 2015 he broke the Rubik’s Cube world record, and he's now back with "Discovering latent knowledge in language models without supervision", a paper on how you can recover diverse knowledge represented in large language models without supervision.
Transcript: https://theinsideview.ai/collin
Paper: https://arxiv.org/abs/2212.03827
Lesswrong post: https://bit.ly/3kbyZML
Host: https://twitter.com/MichaelTrazzi
Collin: https://twitter.com/collinburns4
OUTLINE
(00:22) Intro
(01:33) Breaking The Rubik's Cube World Record
(03:03) A Permutation That Happens Maybe 2% Of The Time
(05:01) How Collin Became Convinced Of AI Alignment
(07:55) Was Minerva Just Low Hanging Fruits On MATH From Scaling?
(12:47) IMO Gold Medal By 2026? How to update from AI Progress
(17:03) Plausibly Automating AI Research In The Next Five Years
(24:23) Making LLMs Say The Truth
(28:11) Lying Is Already Incentivized As We Have Seend With Diplomacy
(32:29) Mind Reading On 'Brain Scans' Through Logical Consistency
(35:18) Misalignment, Or Why One Does Not Simply Prompt A Model Into Being Truthful
(38:43) Classifying Hidden States, Maybe Using Truth Features Reepresented Linearly
(44:48) Building A Dataset For Using Logical Consistency
(50:16) Building A Confident And Consistent Classifier That Outputs Probabilities
(53:25) Discovering Representations Of The Truth From Just Being Confident And Consistent
(57:18) Making Models Truthful As A Sufficient Condition For Alignment
(59:02) Classifcation From Hidden States Outperforms Zero-Shot Prompting Accuracy
(01:02:27) Recovering Latent Knowledge From Hidden States Is Robust To Incorrect Answers In Few-Shot Prompts
(01:09:04) Would A Superhuman GPT-N Predict Future News Articles
(01:13:09) Asking Models To Optimize Money Without Breaking The Law
(01:20:31) Training Competitive Models From Human Feedback That We Can Evaluate
(01:27:26) Alignment Problems On Current Models Are Already Hard
(01:29:19) We Should Have More People Working On New Agendas From First Principles
(01:37:16) Towards Grounded Theoretical Work And Empirical Work Targeting Future Systems
(01:41:52) There Is No True Unsupervised: Autoregressive Models Depend On What A Human Would Say
(01:46:04) Simulating Aligned Systems And Recovering The Persona Of A Language Model
(01:51:38) The Truth Is Somewhere Inside The Model, Differentiating Between Truth And Persona Bit by Bit Through Constraints
(02:01:08) A Misaligned Model Would Have Activations Correlated With Lying
(02:05:16) Exploiting Similar Structure To Logical Consistency With Unaligned Models
(02:07:07) Aiming For Honesty, Not Truthfulness
(02:11:15) Limitations Of Collin's Paper
(02:14:12) The Paper Does Not Show The Complete Final Robust Method For This Problem
(02:17:26) Humans Will Be 50/50 On Superhuman Questions
(02:23:40) Asking Yourself "Why Am I Optimistic" and How Collin Approaches Research
(02:29:16) Message To The ML and Cubing audience