AI-powered
podcast player
Listen to all your favourite podcasts with AI-powered features
Our paper introduces an unsupervised method to identify whether statements in language models are true or false. We achieve this by using the hidden states of the language model as features and training a linear model to classify statements and their negations as either true or false. The method leverages the structure of truth, including logical consistency and negation consistency, to find features that are correlated with the truth. This approach allows us to identify truth in language models without relying on any explicit supervision or labels, making it scalable and applicable to situations where evaluating truthfulness is challenging.
Our method goes beyond simply predicting the next token and focuses on revealing features correlated with truth. By exploring the structure of truth and leveraging properties like logical consistency, we can discover truth-like features in language models. The approach involves searching for a direction in the activation space of the model that satisfies consistency properties, such as negation consistency. We optimize for both confidence and consistency, aiming to find features in the hidden states that are indicative of truth without the need for explicit supervision or ground truth labels.
Our unsupervised approach allows us to uncover truth in language models without relying on human-labeled data or explicit supervision. By analyzing the hidden states of the model, we can identify features that correlate with truthfulness. The method leverages the structure of truth, including logical consistency and negation consistency, to identify truth-like features. It offers a scalable and efficient way to assess the truthfulness of statements in language models, even in situations where ground truth evaluation is challenging or unavailable.
Our paper presents an unsupervised method to discover truth in language models. By analyzing the hidden states of the model and searching for features that are consistent with truth, we can identify whether statements are true or false. The approach leverages the structure of truth, such as logical consistency, to find truth-like features. Importantly, our method does not rely on explicit labeling or supervision, making it scalable and applicable to a wide range of scenarios where evaluating the truthfulness of statements is difficult.
Our method aims to recover different perspectives, analogous to liberal and conservative personas, in language models. By conditioning the model on sets of politicized questions and their answers, we can capture the structure in the joint probabilities of the answers, allowing us to uncover these distinct perspectives. This demonstrates that the model has the capability to simulate and represent different truth-like perspectives.
Our unsupervised method focuses on extracting truth-like representations from the model's hidden states, without relying on explicit human feedback. By avoiding direct human evaluation, we can circumvent the limitations that would arise when transitioning to superhuman models. This approach removes the human evaluator dependency and allows us to study alignment without the need for human judgment or supervision.
We hypothesize that language models have the capability to think about the truth or falseness of inputs, similar to how they reason about being liberal or conservative. This representation allows them to generate text that aligns with truth-like perspectives. Although the specifics require further research, we posit that finding these truth-like representations is possible in language models, paving the way for a deeper understanding of alignment in future systems.
Our method seeks to identify and understand the truth persona within language models. By incorporating sets of questions and answers, we can prompt the model to generate text that reflects this persona. This unsupervised approach enables us to study the truth-like behavior of models, potentially leading to improved understanding and alignment in future systems.
To tackle complex problems like alignment, it is crucial to think beyond existing proposals and explore completely new ideas. Researchers should not feel wedded to traditional approaches and should constantly pose themselves questions to prompt deep thinking and introspection. Considering the perspective of humans and questioning how they navigate similar challenges can also provide valuable inspiration.
Unsupervised properties can play a vital role in addressing alignment issues. Confidence and consistency are just a few examples of such properties that can be used to narrow down possibilities and identify truthful models. By understanding and leveraging these powerful unsupervised properties, researchers can make progress in aligning AI systems.
Alignment research is not only important due to the potential impact of AGI, but it is also highly exciting and rewarding. The field offers numerous avenues for exploration, such as interpretability, generalization, and reward specification. The simplicity of current deep learning models allows for deep introspection and the formulation of innovative ideas. Working on alignment can be intellectually stimulating and holds immense potential for groundbreaking contributions.
Collin Burns is a second-year ML PhD at Berkeley, working with Jacob Steinhardt on making language models honest, interpretable, and aligned. In 2015 he broke the Rubik’s Cube world record, and he's now back with "Discovering latent knowledge in language models without supervision", a paper on how you can recover diverse knowledge represented in large language models without supervision.
Transcript: https://theinsideview.ai/collin
Paper: https://arxiv.org/abs/2212.03827
Lesswrong post: https://bit.ly/3kbyZML
Host: https://twitter.com/MichaelTrazzi
Collin: https://twitter.com/collinburns4
OUTLINE
(00:22) Intro
(01:33) Breaking The Rubik's Cube World Record
(03:03) A Permutation That Happens Maybe 2% Of The Time
(05:01) How Collin Became Convinced Of AI Alignment
(07:55) Was Minerva Just Low Hanging Fruits On MATH From Scaling?
(12:47) IMO Gold Medal By 2026? How to update from AI Progress
(17:03) Plausibly Automating AI Research In The Next Five Years
(24:23) Making LLMs Say The Truth
(28:11) Lying Is Already Incentivized As We Have Seend With Diplomacy
(32:29) Mind Reading On 'Brain Scans' Through Logical Consistency
(35:18) Misalignment, Or Why One Does Not Simply Prompt A Model Into Being Truthful
(38:43) Classifying Hidden States, Maybe Using Truth Features Reepresented Linearly
(44:48) Building A Dataset For Using Logical Consistency
(50:16) Building A Confident And Consistent Classifier That Outputs Probabilities
(53:25) Discovering Representations Of The Truth From Just Being Confident And Consistent
(57:18) Making Models Truthful As A Sufficient Condition For Alignment
(59:02) Classifcation From Hidden States Outperforms Zero-Shot Prompting Accuracy
(01:02:27) Recovering Latent Knowledge From Hidden States Is Robust To Incorrect Answers In Few-Shot Prompts
(01:09:04) Would A Superhuman GPT-N Predict Future News Articles
(01:13:09) Asking Models To Optimize Money Without Breaking The Law
(01:20:31) Training Competitive Models From Human Feedback That We Can Evaluate
(01:27:26) Alignment Problems On Current Models Are Already Hard
(01:29:19) We Should Have More People Working On New Agendas From First Principles
(01:37:16) Towards Grounded Theoretical Work And Empirical Work Targeting Future Systems
(01:41:52) There Is No True Unsupervised: Autoregressive Models Depend On What A Human Would Say
(01:46:04) Simulating Aligned Systems And Recovering The Persona Of A Language Model
(01:51:38) The Truth Is Somewhere Inside The Model, Differentiating Between Truth And Persona Bit by Bit Through Constraints
(02:01:08) A Misaligned Model Would Have Activations Correlated With Lying
(02:05:16) Exploiting Similar Structure To Logical Consistency With Unaligned Models
(02:07:07) Aiming For Honesty, Not Truthfulness
(02:11:15) Limitations Of Collin's Paper
(02:14:12) The Paper Does Not Show The Complete Final Robust Method For This Problem
(02:17:26) Humans Will Be 50/50 On Superhuman Questions
(02:23:40) Asking Yourself "Why Am I Optimistic" and How Collin Approaches Research
(02:29:16) Message To The ML and Cubing audience
Listen to all your favourite podcasts with AI-powered features
Listen to the best highlights from the podcasts you love and dive into the full episode
Hear something you like? Tap your headphones to save it with AI-generated key takeaways
Send highlights to Twitter, WhatsApp or export them to Notion, Readwise & more
Listen to all your favourite podcasts with AI-powered features
Listen to the best highlights from the podcasts you love and dive into the full episode