What’s up with LLMs representing XORs of arbitrary features?

Jan 7, 2024

The podcast explores the claim that LLM's can represent XORs of arbitrary features and its implications for AI safety research. It discusses the implications of RAX and generalization in linear probes, generating representations for classification and linear features, possible explanations for aggregation of noisy signals correlated with 'Is a Formula', and explores hypotheses on LLMs' behavior and feature directions.

Ask episode

Chapters

Transcript

Episode notes

Introduction

00:00 • 5min

Implications of RAX and Generalization in Linear Probes

04:39 • 17min

Generating Representations for Classification and Linear Features

21:52 • 2min

Possible Explanations for Aggregation of Noisy Signals Correlated with 'Is a Formula'

23:33 • 4min

Exploring Hypotheses on LLMs' Behavior and Feature Directions

27:18 • 2min