AF - Different senses in which two AIs can be "the same" by Vivek Hebbar

Jun 24, 2024

AI Alignment researcher and author, Vivek Hebbar, explores the different senses in which two AIs can be considered the same or different. He discusses distinctions such as model weights, pretrained identity, shared context, shared activations, shared memory, shared reward, and shared role in training, and how these impact collusion and coordination in AI safety.

Ask episode

Chapters

Transcript

Episode notes

Exploring Different Senses of Personal Identity in Artificial Intelligences

00:00 • 9min

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Different senses in which two AIs can be "the same", published by Vivek Hebbar on June 24, 2024 on The AI Alignment Forum.
Sometimes people talk about two AIs being "the same" or "different" AIs. We think the intuitive binary of "same vs. different" conflates several concepts which are often better to disambiguate. In this post, we spell out some of these distinctions. We don't think anything here is particularly novel; we wrote this post because we think it's probably mildly helpful for people to think this through, and because Buck fairly often wants to reference it.
Some example distinctions:
In a debate protocol[1], people often refer to the judge and debaters as three "different" AIs because their tokens are subject to different training incentives. But you could also say that the two debaters are "the same AI" because their prompts and context look similar. Or you could say that all three are the "same AI" because they were fine-tuned from the same weights.
In a control setting, an untrusted monitor might be fine-tuned from the same base model as the model it's monitoring, leaving it intuitively unclear whether they are "the same AI".
We think that it's often important to be able to distinguish different senses of "personal identity" when discussing various issues in AI safety, especially issues related to collusion. We frame each sense of personal identity as telling us whether two actions[2] belong to the "same AI" or "different AIs". Here are the senses we distinguish in this post:
1. Model weights: Were the two actions produced by two different models?
1. Pretrain identity: Models are different if they come from separate pretraining runs
2. Parameter identity: Models are different even if they are fine-tunings of the same pretrained model
2. Shared context: Do the two actions have shared LLM context?
3. Shared activations: Are the two actions downstream of shared activations?
4. Shared memory: In a scaffold, did the processes producing the two actions have access to a shared memory bank?
5. Shared reward: In an RL setting, do the actions share the same rewards?
6. Shared role in training: In protocols like debate, the training incentive can be different in nature for different classes of prompts/contexts. Do the two actions come from contexts in the same class?
Model identity
Model identity is based on which model parameters were run to produce a given action. "Pretrain identity" conflates all models finetuned from the same pretraining run. Whereas "parameter identity" requires the weights to be identical.
Some reasons to care about model identity:
The presence of issues such as scheming[3] is more correlated across models with shared training history
Models with shared training history might have similar misaligned goals, making them more inclined to coordinate
Similarities between their internal ontologies might make it easier to coordinate on Schelling signals or steganographically encode information
Shared activations & shared context
For "shared activations", we consider two actions to belong to the same AI when they are downstream of the same activations. For a deterministic[4] unidirectional transformer, two actions share activations if they are generated by the same model and their contexts share a common prefix. However, future architectures might break the close correspondence between shared activations and shared context.
For example, consider this setup: An LLM sees [prompt 1], generates some [python code], then [prompt 2] is appended instructing the same LLM to critique the code for security flaws, and it generates [critique]. If the activations on the [prompt 1] tokens contain a signal saying that it's time for coordinated defection, the model can see those activations both while generating the code and while generating the critique.
If we ins...