Probing AI Systems with CCS Approach

Exploring the use of probes and Construct Control Sets (CCS) to understand latent knowledge in AI systems, focusing on normalization of activations for classification tasks and challenges in distinguishing clusters based on salient differences.

Transcript

chevron_right

Play full episode

chevron_right

Transcript

Episode notes

In 2022, it was announced that a fairly simple method can be used to extract the true beliefs of a language model on any given topic, without having to actually understand the topic at hand. Earlier, in 2021, it was announced that neural networks sometimes 'grok': that is, when training them on certain tasks, they initially memorize their training data (achieving their training goal in a way that doesn't generalize), but then suddenly switch to understanding the 'real' solution in a way that generalizes. What's going on with these discoveries? Are they all they're cracked up to be, and if so, how are they working? In this episode, I talk to Vikrant Varma about his research getting to the bottom of these questions.

Patreon: patreon.com/axrpodcast

Ko-fi: ko-fi.com/axrpodcast

Topics we discuss, and timestamps:

0:00:36 - Challenges with unsupervised LLM knowledge discovery, aka contra CCS

0:00:36 - What is CCS?

0:09:54 - Consistent and contrastive features other than model beliefs

0:20:34 - Understanding the banana/shed mystery

0:41:59 - Future CCS-like approaches

0:53:29 - CCS as principal component analysis

0:56:21 - Explaining grokking through circuit efficiency

0:57:44 - Why research science of deep learning?

1:12:07 - Summary of the paper's hypothesis

1:14:05 - What are 'circuits'?

1:20:48 - The role of complexity

1:24:07 - Many kinds of circuits

1:28:10 - How circuits are learned

1:38:24 - Semi-grokking and ungrokking

1:50:53 - Generalizing the results

1:58:51 - Vikrant's research approach

2:06:36 - The DeepMind alignment team

2:09:06 - Follow-up work

The transcript: axrp.net/episode/2024/04/25/episode-29-science-of-deep-learning-vikrant-varma.html

Vikrant's Twitter/X account: twitter.com/vikrantvarma_

Main papers:

- Challenges with unsupervised LLM knowledge discovery: arxiv.org/abs/2312.10029

- Explaining grokking through circuit efficiency: arxiv.org/abs/2309.02390

Other works discussed:

- Discovering latent knowledge in language models without supervision (CCS): arxiv.org/abs/2212.03827

- Eliciting Latent Knowledge: How to Tell if your Eyes Deceive You: https://docs.google.com/document/d/1WwsnJQstPq91_Yh-Ch2XRL8H_EpsnjrC1dwZXR37PC8/edit

- Discussion: Challenges with unsupervised LLM knowledge discovery: lesswrong.com/posts/wtfvbsYjNHYYBmT3k/discussion-challenges-with-unsupervised-llm-knowledge-1

- Comment thread on the banana/shed results: lesswrong.com/posts/wtfvbsYjNHYYBmT3k/discussion-challenges-with-unsupervised-llm-knowledge-1?commentId=hPZfgA3BdXieNfFuY

- Fabien Roger, What discovering latent knowledge did and did not find: lesswrong.com/posts/bWxNPMy5MhPnQTzKz/what-discovering-latent-knowledge-did-and-did-not-find-4

- Scott Emmons, Contrast Pairs Drive the Performance of Contrast Consistent Search (CCS): lesswrong.com/posts/9vwekjD6xyuePX7Zr/contrast-pairs-drive-the-empirical-performance-of-contrast

- Grokking: Generalizing Beyond Overfitting on Small Algorithmic Datasets: arxiv.org/abs/2201.02177

- Keeping Neural Networks Simple by Minimizing the Minimum Description Length of the Weights (Hinton 1993 L2): dl.acm.org/doi/pdf/10.1145/168304.168306

- Progress measures for grokking via mechanistic interpretability: arxiv.org/abs/2301.0521

Episode art by Hamish Doodles: hamishdoodles.com

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

Get the app

Home Top podcasts Popular guests Top books