
LessWrong (Curated & Popular) What’s up with LLMs representing XORs of arbitrary features?
Jan 7, 2024
The podcast explores the claim that LLM's can represent XORs of arbitrary features and its implications for AI safety research. It discusses the implications of RAX and generalization in linear probes, generating representations for classification and linear features, possible explanations for aggregation of noisy signals correlated with 'Is a Formula', and explores hypotheses on LLMs' behavior and feature directions.
Chapters
Transcript
Episode notes
1 2 3 4 5
Introduction
00:00 • 5min
Implications of RAX and Generalization in Linear Probes
04:39 • 17min
Generating Representations for Classification and Linear Features
21:52 • 2min
Possible Explanations for Aggregation of Noisy Signals Correlated with 'Is a Formula'
23:33 • 4min
Exploring Hypotheses on LLMs' Behavior and Feature Directions
27:18 • 2min
