
4 - Risks from Learned Optimization with Evan Hubinger
AXRP - the AI X-risk Research Podcast
Is There a Difference Between a Model's Mas Objective and a Behavior Objective?
Soviosly toy example shows difficulty in extracting what a model is doing. If you had some of y level pturbability tools, for example, than enable you to thevariable names, but not the implementation, you might be sugnificantly led astray. We don't claim that this works for all models. Lots of bottls just don't fall into the category of masop mesers. But we cond still try to geta nice mechanistic description of what that sort of model would be doing. And then also, we're not trying to come to a conclusion about what a model's objective might be only by looking its behaviour.
00:00
Transcript
Play full episode
Remember Everything You Learn from Podcasts
Save insights instantly, chat with episodes, and build lasting knowledge - all powered by AI.