LessWrong (Curated & Popular)

What’s up with LLMs representing XORs of arbitrary features?

Jan 7, 2024
The podcast explores the claim that LLM's can represent XORs of arbitrary features and its implications for AI safety research. It discusses the implications of RAX and generalization in linear probes, generating representations for classification and linear features, possible explanations for aggregation of noisy signals correlated with 'Is a Formula', and explores hypotheses on LLMs' behavior and feature directions.
Ask episode
Chapters
Transcript
Episode notes