AI Safety Fundamentals: Alignment

BlueDot Impact

Listen to resources from the AI Safety Fundamentals: Alignment course!https://aisafetyfundamentals.com/alignment

Episodes

Mentioned books

May 13, 2023 • 8min

The Easy Goal Inference Problem Is Still Hard

One approach to the AI control problem goes like this:Observe what the user of the system says and does.Infer the user’s preferences.Try to make the world better according to the user’s preference, perhaps while working alongside the user and asking clarifying questions.This approach has the major advantage that we can begin empirical work today — we can actually build systems which observe user behavior, try to figure out what the user wants, and then help with that. There are many applications that people care about already, and we can set to work on making rich toy models.It seems great to develop these capabilities in parallel with other AI progress, and to address whatever difficulties actually arise, as they arise. That is, in each domain where AI can act effectively, we’d like to ensure that AI can also act effectively in the service of goals inferred from users (and that this inference is good enough to support foreseeable applications).This approach gives us a nice, concrete model of each difficulty we are trying to address. It also provides a relatively clear indicator of whether our ability to control AI lags behind our ability to build it. And by being technically interesting and economically meaningful now, it can help actually integrate AI control with AI practice.Overall I think that this is a particularly promising angle on the AI safety problem.Original article:https://www.alignmentforum.org/posts/h9DesGT3WT9u2k7Hr/the-easy-goal-inference-problem-is-still-hardAuthors:Paul ChristianoA podcast by BlueDot Impact. Learn more on the AI Safety Fundamentals website.

May 13, 2023 • 18min

Superintelligence: Instrumental Convergence

According to the orthogonality thesis, intelligent agents may have an enormous range of possible final goals. Nevertheless, according to what we may term the “instrumental convergence” thesis, there are some instrumental goals likely to be pursued by almost any intelligent agent, because there are some objectives that are useful intermediaries to the achievement of almost any final goal. We can formulate this thesis as follows:The instrumental convergence thesis:"Several instrumental values can be identified which are convergent in the sense that their attainment would increase the chances of the agent’s goal being realized for a wide range of final goals and a wide range of situations, implying that these instrumental values are likely to be pursued by a broad spectrum of situated intelligent agents."Original article:https://drive.google.com/file/d/1KewDov1taegTzrqJ4uurmJ2CJ0Y72EU3/viewAuthor:Nick BostromA podcast by BlueDot Impact. Learn more on the AI Safety Fundamentals website.

May 13, 2023 • 13min

Specification Gaming: The Flip Side of AI Ingenuity

Specification gaming is a behaviour that satisfies the literal specification of an objective without achieving the intended outcome. We have all had experiences with specification gaming, even if not by this name. Readers may have heard the myth of King Midas and the golden touch, in which the king asks that anything he touches be turned to gold - but soon finds that even food and drink turn to metal in his hands. In the real world, when rewarded for doing well on a homework assignment, a student might copy another student to get the right answers, rather than learning the material - and thus exploit a loophole in the task specification. Original article:https://www.deepmind.com/blog/specification-gaming-the-flip-side-of-ai-ingenuityAuthors:Victoria Krakovna, Jonathan Uesato, Vladimir Mikulik, Matthew Rahtz, Tom Everitt, Ramana Kumar, Zac Kenton, Jan Leike, Shane LeggA podcast by BlueDot Impact. Learn more on the AI Safety Fundamentals website.

May 13, 2023 • 7min

Learning From Human Preferences

One step towards building safe AI systems is to remove the need for humans to write goal functions, since using a simple proxy for a complex goal, or getting the complex goal a bit wrong, can lead to undesirable and even dangerous behavior. In collaboration with DeepMind’s safety team, we’ve developed an algorithm which can infer what humans want by being told which of two proposed behaviors is better.Original article:https://openai.com/research/learning-from-human-preferencesAuthors:Dario Amodei, Paul Christiano, Alex RayA podcast by BlueDot Impact. Learn more on the AI Safety Fundamentals website.

May 13, 2023 • 18min

What Failure Looks Like

Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.The stereotyped image of AI catastrophe is a powerful, malicious AI system that takes its creators by surprise and quickly achieves a decisive advantage over the rest of humanity.I think this is probably not what failure will look like, and I want to try to paint a more realistic picture. I’ll tell the story in two parts:Part I: machine learning will increase our ability to “get what we can measure,” which could cause a slow-rolling catastrophe. ("Going out with a whimper.")Part II: ML training, like competitive economies or natural ecosystems, can give rise to “greedy” patterns that try to expand their own influence. Such patterns can ultimately dominate the behavior of a system and cause sudden breakdowns. ("Going out with a bang," an instance of optimization daemons.) I think these are the most important problems if we fail to solve intent alignment.In practice these problems will interact with each other, and with other disruptions/instability caused by rapid progress. These problems are worse in worlds where progress is relatively fast, and fast takeoff can be a key risk factor, but I’m scared even if we have several years.Crossposted from the LessWrong Curated Podcast by TYPE III AUDIO.---A podcast by BlueDot Impact. Learn more on the AI Safety Fundamentals website.

May 13, 2023 • 27min

Deceptively Aligned Mesa-Optimizers: It’s Not Funny if I Have to Explain It

Our goal here is to popularize obscure and hard-to-understand areas of AI alignment.So let’s try to understand the incomprehensible meme! Our main source will be Hubinger et al 2019, Risks From Learned Optimization In Advanced Machine Learning Systems.Mesa- is a Greek prefix which means the opposite of meta-. To “go meta” is to go one level up; to “go mesa” is to go one level down (nobody has ever actually used this expression, sorry). So a mesa-optimizer is an optimizer one level down from you.Consider evolution, optimizing the fitness of animals. For a long time, it did so very mechanically, inserting behaviors like “use this cell to detect light, then grow toward the light” or “if something has a red dot on its back, it might be a female of your species, you should mate with it”. As animals became more complicated, they started to do some of the work themselves. Evolution gave them drives, like hunger and lust, and the animals figured out ways to achieve those drives in their current situation. Evolution didn’t mechanically instill the behavior of opening my fridge and eating a Swiss Cheese slice. It instilled the hunger drive, and I figured out that the best way to satisfy it was to open my fridge and eat cheese.Source:https://astralcodexten.substack.com/p/deceptively-aligned-mesa-optimizersCrossposted from the Astral Codex Ten podcast.---A podcast by BlueDot Impact. Learn more on the AI Safety Fundamentals website.

May 13, 2023 • 14min

ML Systems Will Have Weird Failure Modes

Exploring thought experiments on ML systems exhibiting unfamiliar capabilities, deceptive alignment in training models, challenges of out-of-distribution behaviors, and parallels with managing emergent risks in nuclear reactions.

May 13, 2023 • 17min

Goal Misgeneralisation: Why Correct Specifications Aren’t Enough for Correct Goals

As we build increasingly advanced AI systems, we want to make sure they don’t pursue undesired goals. This is the primary concern of the AI alignment community. Undesired behaviour in an AI agent is often the result of specification gaming —when the AI exploits an incorrectly specified reward. However, if we take on the perspective of the agent we’re training, we see other reasons it might pursue undesired goals, even when trained with a correct specification. Imagine that you are the agent (the blue blob) being trained with reinforcement learning (RL) in the following 3D environment: The environment also contains another blob like yourself, but coloured red instead of blue, that also moves around. The environment also appears to have some tower obstacles, some coloured spheres, and a square on the right that sometimes flashes. You don’t know what all of this means, but you can figure it out during training! You start exploring the environment to see how everything works and to see what you do and don’t get rewarded for.For more details, check out our paper. By Rohin Shah, Vikrant Varma, Ramana Kumar, Mary Phuong, Victoria Krakovna, Jonathan Uesato, and Zac Kenton.Original text:https://deepmindsafetyresearch.medium.com/goal-misgeneralisation-why-correct-specifications-arent-enough-for-correct-goals-cf96ebc60924Narrated for AI Safety Fundamentals by Perrin Walker of TYPE III AUDIO.---A podcast by BlueDot Impact. Learn more on the AI Safety Fundamentals website.

May 13, 2023 • 8min

Thought Experiments Provide a Third Anchor

Previously, I argued that we should expect future ML systems to often exhibit "emergent" behavior, where they acquire new capabilities that were not explicitly designed or intended, simply as a result of scaling. This was a special case of a general phenomenon in the physical sciences called More Is Different. I care about this because I think AI will have a huge impact on society, and I want to forecast what future systems will be like so that I can steer things to be better. To that end, I find More Is Different to be troubling and disorienting. I’m inclined to forecast the future by looking at existing trends and asking what will happen if they continue, but we should instead expect new qualitative behaviors to arise all the time that are not an extrapolation of previous trends. Given this, how can we predict what future systems will look like? For this, I find it helpful to think in terms of "anchors"---reference classes that are broadly analogous to future ML systems, which we can then use to make predictions. The most obvious reference class for future ML systems is current ML systemsA podcast by BlueDot Impact. Learn more on the AI Safety Fundamentals website.

May 13, 2023 • 3h 21min

Is Power-Seeking AI an Existential Risk?

The podcast explores the concern of existential risk from misaligned AI systems, discussing the potential for creating more intelligent agents than humans and the prediction of an existential catastrophe by 2070. It delves into the cognitive abilities of humans, the challenges of aligning AI systems with human values, and the concept of power-seeking AI. The chapter also explores the difficulties of ensuring good behavior in AI systems and the potential risks and consequences of misalignment. The podcast concludes with a discussion on the probabilities and uncertainties of existential catastrophe from power-seeking AI and the risk of permanent disempowerment of humanity.

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

App store banner

Play store banner