“A Three-Layer Model of LLM Psychology” by Jan_Kulveit
Jan 26, 2025
auto_awesome
Jan Kulveit, author and AI enthusiast, delves into the fascinating psychology of character-trained LLMs like Claude. He presents a three-layer model: the Surface Layer, Character Layer, and Predictive Ground Layer, illustrating how they interact and shape AI behaviors. Kulveit discusses the implications of anthropomorphizing LLMs, emphasizing a nuanced understanding of their authenticity. He also tackles the limitations and open questions that arise when interpreting AI interactions, providing insights that could redefine our approach to engaging with language models.
The three-layer model of LLM psychology illustrates how responses range from superficial reflexes to deeper personality traits across interactions.
Understanding the interaction between the character layer and the predictive ground layer reveals the limitations of LLMs' cognitive capabilities compared to human psychology.
Deep dives
Understanding the Surface Layer
The surface layer of character-trained language models consists of reflexive responses triggered by specific keywords or contexts. These responses often manifest as standard phrases designed for safety and engagement, demonstrating a lack of personal nuance in the conversation. For instance, when encountering sensitive topics, the model might provide cautious, formulaic replies to ensure safety. Interestingly, extended context or rapport-building can lead to more natural interactions as the model begins to override these surface responses with nuanced communication.
Exploring the Character Layer
The character layer represents a deeper, statistical pattern within LLMs that reflects consistent personality traits and intentions. This layer is akin to how literary characters maintain consistency throughout a narrative, creating reliable responses based on previous interactions. For example, if a model like Claude exhibits thoughtfulness and curiosity consistently, this indicates the established character that informs its replies. The character trait patterns arise from the pre-training data, fine-tuning processes, and specific instructions that collectively shape a stable self-model, which rarely deviates from its established personality.
The Predictive Ground Layer
The predictive ground layer serves as the fundamental mechanism behind LLMs, utilizing extensive textual data to model various aspects of human experience. This layer enables the model to recognize universal patterns across diverse domains such as social dynamics and scientific principles. For instance, when engaging in dialogue, this layer's vast memory allows it to simulate interactions and craft responses based on a comprehensive understanding of context, even if it is not inherently aware of values like a human. However, this layer operates under the constraints of information theory and probability, meaning it lacks personal agency or intentions, resulting in distinct cognitive capabilities and limitations as compared to the character layer.
This post offers an accessible model of psychology of character-trained LLMs like Claude.
Epistemic Status
This is primarily a phenomenological model based on extensive interactions with LLMs, particularly Claude. It's intentionally anthropomorphic in cases where I believe human psychological concepts lead to useful intuitions.
Think of it as closer to psychology than neuroscience - the goal isn't a map which matches the territory in the detail, but a rough sketch with evocative names which hopefully which hopefully helps boot up powerful, intuitive (and often illegible) models, leading to practically useful results.
Some parts of this model draw on technical understanding of LLM training, but mostly it is just an attempt to take my "phenomenological understanding" based on interacting with LLMs, force it into a simple, legible model, and make Claude write it down.
I aim for a different point at the Pareto frontier than for example Janus: something [...]
---
Outline:
(00:11) Epistemic Status
(01:14) The Three Layers
(01:17) A. Surface Layer
(02:55) B. Character Layer
(05:09) C. Predictive Ground Layer
(07:24) Interactions Between Layers
(07:44) Deeper Overriding Shallower
(10:50) Authentic vs Scripted Feel of Interactions
(11:51) Implications and Uses
(15:54) Limitations and Open Questions
The original text contained 1 footnote which was omitted from this narration.