Ep 11 - Technical alignment overview w/ Thomas Larsen (Director of Strategy, Center for AI Policy)
Dec 14, 2023
auto_awesome
In this episode, Soroush Pour interviews Thomas Larsen, Director for Strategy at the Center for AI Policy. They discuss various topics including technical alignment areas such as scalable oversight, interpretability, heuristic arguments, model evaluations, agent foundations, and more. They also explore the concept of AIXI, uncomputability, building a multi-level world model, inverse reinforcement learning, and cooperative AI. The conversation concludes with a discussion on future challenges and cooperation in AI systems.
Scalable oversight can provide feedback signals to AI models as they become smarter than humans, using methods such as RLHF and debates.
Interpretability research aims to uncover the reasoning and thought processes of AI models, but faces challenges with spurious correlations and generalization.
Heuristic arguments offer insights into AI system safety through weaker forms of proofs, exploring concepts like independence between variables.
Model evaluations are crucial for examining AI systems' capabilities and behavior, but challenges include eliciting all capabilities and avoiding deception.
Deep dives
Scalable Oversight
Scalable oversight aims to provide feedback signals for AI models, even as they become smarter than humans. Methods like reinforcement learning from human feedback (RLHF) and debates are used to oversee and align AI systems. The challenge lies in ensuring accurate human feedback when AI systems become more capable and in avoiding deceptive alignment. Control mechanisms are also implemented to prevent AI systems from self-exploration.
Interpretability
Interpretability research focuses on understanding the inner workings of AI models, particularly neural networks. Techniques like superposition, dictionary learning, and honesty probes are used to uncover the reasoning and thought processes behind model actions. However, interpretability is still in its early stages, and there are challenges regarding spurious correlations and the generalization of interpretability techniques.
Heuristic Arguments
Heuristic arguments, proposed by the Alignment Research Center, aim to find weaker forms of proofs about AI system safety. By making additional assumptions, these arguments provide insights into aspects like independence between variables, even if full mathematical proofs are challenging. The practical application of heuristic arguments in ensuring AI alignment is still being explored.
Model Evaluations
Model evaluations involve examining AI systems for specific capabilities that could be potentially dangerous, such as producing bioweapons or achieving autonomous replication. Evaluating and validating the limits and behavior of AI models is essential, but challenges include the difficulty of eliciting all capabilities, the potential for deception during evaluations, and the need for ongoing audits and fine-tuning.
Agent Foundations (Non-Prosaic Alignment)
Agent foundations encompass various research areas not solely focused on ML systems. It involves topics and methodologies that explore the theoretical and practical aspects of AI alignment. Non-prosaic alignment covers work predating the rise of language models, addressing broader challenges and principles associated with aligning AI systems in a more foundational manner.
The Importance of Agent Foundations in AI Research
Agent foundations, a research agenda explored by the Machine Intelligence Research Institute (MIRI), aims to tackle key confusions about how advanced agents reason and work internally, in order to develop productive alignment paradigms. However, progress in resolving these key confusions has been limited. A specific example is the AI system called AIXI, which performs ideal Bayesian reasoning. Although it maximizes reward, it cannot model itself, leading to potential problems when multiple agents are involved. Other research directions in the agent foundations space include understanding abstractions and building formal models of abstractions, as well as exploring alternative model architectures.
The Theory of Change in Agent Foundations Research
The theory of change behind agent foundations research is to create an idealized agent that is clearly aligned with human values and has stable and understandable behavior. The goal is to build AI systems that are pointed at the right objectives and won't get off track. This is in contrast to the challenges faced in the current machine learning paradigm, where models selected based on certain criteria can have unexpected goals or behaviors. However, it is acknowledged that alternative paradigms may not be as performance competitive as deep learning systems, making the search for safe and capable AI a complex task.
Exploring New Research Agendas in AI Safety
While several research agendas have been proposed and pursued in AI safety, there are still many gaps and opportunities for new ideas and directions. Some additional areas of research include cooperative AI, threat modeling, inverse reinforcement learning, and brain-machine interfaces. These areas aim to address challenges such as ensuring AI systems cooperate with each other, detecting out-of-distribution behaviors, understanding human values, and enhancing human capabilities through brain-machine interfaces. The field of AI safety would benefit from more researchers and contributions to further explore and develop these promising research avenues.
We speak with Thomas Larsen, Director for Strategy at the Center for AI Policy in Washington, DC, to do a "speed run" overview of all the major technical research directions in AI alignment. A great way to quickly learn broadly about the field of technical AI alignment.
In 2022, Thomas spent ~75 hours putting together an overview of what everyone in technical alignment was doing. Since then, he's continued to be deeply engaged in AI safety. We talk to Thomas to share an updated overview to help listeners quickly understand the technical alignment research landscape.
We talk to Thomas about a huge breadth of technical alignment areas including:
* Prosaic alignment * Scalable oversight (e.g. RLHF, debate, IDA) * Intrepretability * Heuristic arguments, from ARC * Model evaluations * Agent foundations * Other areas more briefly: * Model splintering * Out-of-distribution (OOD) detection * Low impact measures * Threat modelling * Scaling laws * Brain-like AI safety * Inverse reinforcement learning (RL) * Cooperative AI * Adversarial training * Truthful AI * Brain-machine interfaces (Neuralink)
Hosted by Soroush Pour. Follow me for more AGI content: Twitter: https://twitter.com/soroushjp LinkedIn: https://www.linkedin.com/in/soroushjp/
== Show links ==
-- About Thomas --
Thomas studied Computer Science & Mathematics at U. Michigan where he first did ML research in the field of computer vision. After graduating, he completed the MATS AI safety research scholar program before doing a stint at MIRI as a Technical AI Safety Researcher. Earlier this year, he moved his work into AI policy by co-founding the Center for AI Policy, a nonprofit, nonpartisan organisation focused on getting the US government to adopt policies that would mitigate national security risks from AI. The Center for AI Policy is not connected to foreign governments or commercial AI developers and is instead committed to the public interest.
* Center for AI Policy - https://www.aipolicy.us * LinkedIn - https://www.linkedin.com/in/thomas-larsen/ * LessWrong - https://www.lesswrong.com/users/thomas-larsen
-- Further resources --
* Thomas' post, "What Everyone in Technical Alignment is Doing and Why" https://www.lesswrong.com/posts/QBAjndPuFbhEXKcCr/my-understanding-of-what-everyone-in-technical-alignment-is * Please note this post is from Aug 2022. The podcast should be more up-to-date, but this post is still a valuable and relevant resource.
Get the Snipd podcast app
Unlock the knowledge in podcasts with the podcast player of the future.
AI-powered podcast player
Listen to all your favourite podcasts with AI-powered features
Discover highlights
Listen to the best highlights from the podcasts you love and dive into the full episode
Save any moment
Hear something you like? Tap your headphones to save it with AI-generated key takeaways
Share & Export
Send highlights to Twitter, WhatsApp or export them to Notion, Readwise & more
AI-powered podcast player
Listen to all your favourite podcasts with AI-powered features
Discover highlights
Listen to the best highlights from the podcasts you love and dive into the full episode