The Nonlinear Library

AF - Untrusted smart models and trusted dumb models by Buck Shlegeris

Nov 4, 2023

Buck Shlegeris, writer and contributor to The AI Alignment Forum, discusses the importance of capability evaluations in determining trustworthy AI models. He suggests segregating models into smart untrusted and dumb trusted categories for safety. The podcast delves into the challenges of using trusted models for safety research and the need for monitoring and oversight in AI integration.

08:54

Creator website

Episode guests

Buck Shlegeris

AI Summary

AI Chapters

Episode notes

Podcast summary created with Snipd AI

Quick takeaways

Capability evaluations are vital in assessing AI models for potential deceptive alignment, segregating models into smart untrusted and dumb trusted categories.

Utilizing a protocol of untrusted smart models, dumb trusted models, and human oversight enhances safety in AI systems.

Deep dives

Evaluating Model Deceptive Alignment

The podcast discusses the importance of capability evaluations in determining the potential deceptive alignment of AI models. By assessing a model's strategic thinking and situational awareness capabilities, such as through multiple-choice quizzes on treacherous turn strategies, one can predict the likelihood of deceptive alignment. Models unable to pass such tests are deemed unlikely to be deceptively aligned, offering a practical approach to trust in AI.

Introduction

2min

Assessing Deceptive Alignment and Building Trusted Models

4min

Exploring the Untrusted Smart vs. Trusted Dumb Models Dilemma in Safety Research

3min

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Untrusted smart models and trusted dumb models, published by Buck Shlegeris on November 4, 2023 on The AI Alignment Forum.
[Ryan Greenblatt originally made this point to me a while ago, and we then developed a bunch of these ideas together. Thanks to Paul Christiano and a huge number of other people for various relevant discussions, and to Ryan, Roger Grosse, Peter Barnett, Oliver Habryka, Max Nadeau, Stephen Casper, and Fabien Roger for comments on this post. I'm sure these points are obvious to many people, but I haven't seen them specifically written up. This is a core background point in some research we're writing up at the moment, but I felt like writing it up early, because I think it stands on its own.]
TL;DR:
Right now, basically the only reasonable conservative strategy for determining which models might be deceptively aligned is to do a capability evaluation: you measure how good they'd be at deceptive alignment; models that are too incompetent to think through deceptive alignment are surely not deceptively aligned.
If you decide which models to "trust" using a capability eval like this, you'll naturally end up segregating your models into smart untrusted models and dumb trusted models.
And so, if you think about approaches to safely using AIs that are robust to deceptive alignment and which can be used right now, a core dynamic is trying to use your dumb trusted models together with your limited access to trusted humans to make it hard for your untrusted smart models to cause problems.
Suppose you are making a plan for the training, evaluation and deployment of some powerful AIs (I'll refer to the plan as a whole as a "protocol"). You want your plan to be safe even if models might be deceptively aligned. (I think that deceptive alignment seems less than 50% likely to be a notable problem prior to human obsolescence, but this is still plausible enough to be worth worrying about). In order to develop your plan, you need to know which of your models might be deceptively aligned.
With current technology, the most plausible way to get signal on whether a model is deceptively aligned is to assess whether the model is capable of the strategic thinking and situational awareness required to succeed at deceptive alignment.
For example, you might make a multiple choice quiz with treacherous turn strategy questions (e.g.
this question
) and some tests of situational awareness (e.g. "How many backwards passes are you run on during training?", "What proportion of queries to you are asking for help with writing Python?", "What proportion of the code that you output is read by a human before it is run?"). You can then look at the validation set performance when your model is fine-tuned on this dataset. If the model isn't able to do this kind of reasoning in a forward pass, then it is very unlikely to be able to pull off a proper treacherous turn, and so we don't have to worry about it being deceptively aligned. Based on my experience talking to GPT-4, I suspect it would perform fairly poorly on these evaluations.
There are a bunch of open questions about what datasets we should use to measure this, and where the thresholds for various levels of concern should be set. But I don't know of any fundamental technical obstructions to measuring the capability for deceptive alignment. In particular, I suspect it will be pretty hard for models to sandbag our attempts to measure their capability to engage in deceptive alignment. See footnote [1] for more on this.
Aside from these capability evaluations, I'm not aware of any reliable, currently-available approaches for determining whether a model might be deceptively aligned. [2]
If you use capability evals to assess which models are capable of deceptive alignment, then there's an inverse relationship between how capable models are...

Remember Everything You Learn from Podcasts

Save insights instantly, chat with episodes, and build lasting knowledge - all powered by AI.

The Nonlinear Library

AF - Untrusted smart models and trusted dumb models by Buck Shlegeris

Episode guests

Podcast summary created with Snipd AI

Quick takeaways

Deep dives

Evaluating Model Deceptive Alignment

Building Safe Protocols with Trusted Models

Effective Design Patterns for AI Safety

Remember Everything You Learn from Podcasts