The Inside View

Collin Burns On Discovering Latent Knowledge In Language Models Without Supervision

16 snips

Jan 17, 2023

Ask episode

Chapters

Transcript

Episode notes

The Five Seconds Pure Luck Is an Important Part of the World Record

How Did You Become a Deep Learning Researcher?

Is AI Really Important?

Is There a Future for NLP?

How Much Energy and Effort Will You Put Into Getting Superhuman Level in Math?

Is Blenderbot a Good Benchmark for AGI?

A Different AI to AI Research, What Do You Think?

Automating AI Research

Do You Think D.T.K. Will Have People Like Using AI to Produce New Ideas?

How to Make Language Models Truthful

Do Models Deliberately Lilt?

Are You Good at Diplomacy?

How to Train a Model to Do Open-End Deductions in Real-Time

How to Discover Later Knowledge in the Lingered Models

How Do We Train Normal Language Models?

Is There a Difference Between Predicting and Not Prompting?

Unsupervised Modeling

Using Logical Consistency to Predict Next Tokens

Logical Consistency Properties for Sentiment

The Hidden State of the Last Token

How to Extract a Hidden State in a 3D Model?

Logistic Regression for Class?

The Uh Local Optima That Gets High Accuracy on All Kinds of Questions

Is There a Model of the World That Says the Truth?

The Key Findings of Zero Shot Prompting

The Conscious Consistent Church Method Is Better Than Zero Shot Prompting

01:00:39 • 4min

The Hidden State Is More Robust Than the Outputs

01:04:14 • 2min

The Main Level Takeaways From Your Paper

01:06:26 • 2min

Is Elon Musk Really Superhuman?

01:08:41 • 2min

Is There a Difference Between Superhuman Outputs and AI?

01:10:51 • 2min

Is a Model Like RL Breaking the Law or Not?

01:13:09 • 2min

Are You Breaking the Law?

01:15:13 • 3min

Is GPTN Dangerous?

01:17:53 • 2min

Rl From Human Feedback Is a Good Idea?

01:19:59 • 2min

Is AI Feedback a Better Way to Evaluate Models?

01:21:48 • 4min

AlphaGo and MuZero Are Good Alignment Isn't It?

01:25:20 • 2min

01:27:11 • 3min

01:30:12 • 3min

Do We Need Models to Maximize Profit Subjects?

01:33:04 • 2min

Alignment Problems

01:35:26 • 2min

How Similar Are Current Deep Learning Systems to GPTN?

01:37:00 • 5min

Using Unsupervised Models to Predict Future Text

01:41:53 • 2min

Is This Input True or False?

01:44:14 • 3min

Is It a Private Taper?

01:46:49 • 3min

Recovering the Truth in the Model

01:50:07 • 4min

Is There a Difference Between Constraints and the Truth?

01:53:44 • 5min

The Misaligned AI System

01:58:37 • 2min

Getting a Model to Say the Truth

02:00:53 • 2min

Is the Misaligned AI System Lieing?

02:02:24 • 2min

How to Train a Probation Better on the Misaligned AI

02:04:09 • 2min

I've Been Like Oh Yeah Ask Him if I Have a Sandwich and Not Ask Him

02:05:48 • 4min

Are You Egregiously Breaking the Law?

02:09:33 • 2min

Is Your Paper Doing What It's Not Doing Right?

02:11:15 • 2min

The Paper Doesn't Show the Complete Final Robust Method for This Problem

02:13:13 • 2min

Unsupervised Models - How to Find the Truth in a Linear Way

02:15:22 • 2min

How Do We Distinguish Between the Truth and the Misaligned System?

02:17:27 • 4min

Using a Prompt to Optimize the Model Class - Is This True or False?

02:21:01 • 2min

How Do Humans Do This or What Does That Even Mean?

02:23:25 • 2min

Thinking About Future AI Systems

02:25:11 • 2min

Having Access to the Biggest Models Is Important

02:26:46 • 2min

Is This the Most Important Problem That We're Currently Facing?

02:28:47 • 3min

Deep Learning Is Just Not That Deep

02:32:04 • 3min

Collin Burns is a second-year ML PhD at Berkeley, working with Jacob Steinhardt on making language models honest, interpretable, and aligned. In 2015 he broke the Rubik’s Cube world record, and he's now back with "Discovering latent knowledge in language models without supervision", a paper on how you can recover diverse knowledge represented in large language models without supervision.

Transcript: https://theinsideview.ai/collin

Paper: https://arxiv.org/abs/2212.03827

Lesswrong post: https://bit.ly/3kbyZML

Host: https://twitter.com/MichaelTrazzi

Collin: https://twitter.com/collinburns4

OUTLINE

(00:22) Intro

(01:33) Breaking The Rubik's Cube World Record

(03:03) A Permutation That Happens Maybe 2% Of The Time

(05:01) How Collin Became Convinced Of AI Alignment

(07:55) Was Minerva Just Low Hanging Fruits On MATH From Scaling?

(12:47) IMO Gold Medal By 2026? How to update from AI Progress

(17:03) Plausibly Automating AI Research In The Next Five Years

(24:23) Making LLMs Say The Truth

(28:11) Lying Is Already Incentivized As We Have Seend With Diplomacy

(32:29) Mind Reading On 'Brain Scans' Through Logical Consistency

(35:18) Misalignment, Or Why One Does Not Simply Prompt A Model Into Being Truthful

(38:43) Classifying Hidden States, Maybe Using Truth Features Reepresented Linearly

(44:48) Building A Dataset For Using Logical Consistency

(50:16) Building A Confident And Consistent Classifier That Outputs Probabilities

(53:25) Discovering Representations Of The Truth From Just Being Confident And Consistent

(57:18) Making Models Truthful As A Sufficient Condition For Alignment

(59:02) Classifcation From Hidden States Outperforms Zero-Shot Prompting Accuracy

(01:02:27) Recovering Latent Knowledge From Hidden States Is Robust To Incorrect Answers In Few-Shot Prompts

(01:09:04) Would A Superhuman GPT-N Predict Future News Articles

(01:13:09) Asking Models To Optimize Money Without Breaking The Law

(01:20:31) Training Competitive Models From Human Feedback That We Can Evaluate

(01:27:26) Alignment Problems On Current Models Are Already Hard

(01:29:19) We Should Have More People Working On New Agendas From First Principles

(01:37:16) Towards Grounded Theoretical Work And Empirical Work Targeting Future Systems

(01:41:52) There Is No True Unsupervised: Autoregressive Models Depend On What A Human Would Say

(01:46:04) Simulating Aligned Systems And Recovering The Persona Of A Language Model

(01:51:38) The Truth Is Somewhere Inside The Model, Differentiating Between Truth And Persona Bit by Bit Through Constraints

(02:01:08) A Misaligned Model Would Have Activations Correlated With Lying

(02:05:16) Exploiting Similar Structure To Logical Consistency With Unaligned Models

(02:07:07) Aiming For Honesty, Not Truthfulness

(02:11:15) Limitations Of Collin's Paper

(02:14:12) The Paper Does Not Show The Complete Final Robust Method For This Problem

(02:17:26) Humans Will Be 50/50 On Superhuman Questions

(02:23:40) Asking Yourself "Why Am I Optimistic" and How Collin Approaches Research

(02:29:16) Message To The ML and Cubing audience