"The Cognitive Revolution" | AI Builders, Researchers, and Live Player Analysis

E48: Mechanizing Mechanistic Interpretability with Arthur Conmy

29 snips

Jul 27, 2023

Arthur Conmy, an AI researcher specializing in mechanistic interpretability, joins to unravel the complexities of AI models. They delve into how researchers isolate sub-circuits in transformers and the challenges of understanding genuine reasoning versus statistical patterns. Arthur introduces the ACDC algorithm, aimed at automating interpretability workflows, enhancing the efficiency of identifying critical model components. The conversation highlights the implications of mechanistic interpretability for AI safety and the ongoing need for research in this vital field.

Ask episode

AI Snips

Chapters

Books

Transcript

Episode notes

INSIGHT

Mechanistic Interpretability Definition

Mechanistic interpretability reverse engineers neural networks into human-understandable concepts.
It explains how models process information internally, not just matrix multiplications.

INSIGHT

LLM Reasoning vs. Heuristics

LLMs exhibit both surface-level heuristics and general reasoning abilities.
Distinguishing between the two, especially at the frontier of capabilities, remains a key challenge.

INSIGHT

Mechanistic Interpretability Workflow

Mechanistic interpretability research involves three steps: choosing a behavior, defining the interpretation scope, and conducting intervention experiments.
Intervention experiments, often the most labor-intensive, identify crucial model subcomponents.

Get the Snipd Podcast app to discover more snips from this episode

Get the app