AXRP - the AI X-risk Research Podcast cover image

AXRP - the AI X-risk Research Podcast

Latest episodes

undefined
Dec 11, 2020 • 58min

3 - Negotiable Reinforcement Learning with Andrew Critch

In this episode, I talk with Andrew Critch about negotiable reinforcement learning: what happens when two people (or organizations, or what have you) who have different beliefs and preferences jointly build some agent that will take actions in the real world. In the paper we discuss, it's proven that the only way to make such an agent Pareto optimal - that is, have it not be the case that there's a different agent that both people would prefer to use instead - is to have it preferentially optimize the preferences of whoever's beliefs were more accurate. We discuss his motivations for working on the problem and what he thinks about it.   Link to the paper - Negotiable Reinforcement Learning for Pareto Optimal Sequential Decision-Making: papers.nips.cc/paper/2018/hash/5b8e4fd39d9786228649a8a8bec4e008-Abstract.html Link to the transcript: axrp.net/episode/2020/12/11/episode-3-negotiable-reinforcement-learning-andrew-critch.html Critch's Google Scholar profile: scholar.google.com/citations?user=F3_yOXUAAAAJ&hl=en&oi=ao
undefined
Dec 11, 2020 • 1h 9min

2 - Learning Human Biases with Rohin Shah

One approach to creating useful AI systems is to watch humans doing a task, infer what they're trying to do, and then try to do that well. The simplest way to infer what the humans are trying to do is to assume there's one goal that they share, and that they're optimally achieving the goal. This has the problem that humans aren't actually optimal at achieving the goals they pursue. We could instead code in the exact way in which humans behave suboptimally, except that we don't know that either. In this episode, I talk with Rohin Shah about his paper about learning the ways in which humans are suboptimal at the same time as learning what goals they pursue: why it's hard, how he tried to do it, how well he did, and why it matters.   Link to the paper - On the Feasibility of Learning, Rather than Assuming, Human Biases for Reward Inference: arxiv.org/abs/1906.09624 Link to the transcript: axrp.net/episode/2020/12/11/episode-2-learning-human-biases-rohin-shah.html The Alignment Newsletter: rohinshah.com/alignment-newsletter Rohin's contributions to the AI alignment forum: alignmentforum.org/users/rohinmshah Rohin's website: rohinshah.com
undefined
Dec 11, 2020 • 59min

1 - Adversarial Policies with Adam Gleave

In this episode, Adam Gleave and I talk about adversarial policies. Basically, in current reinforcement learning, people train agents that act in some kind of environment, sometimes an environment that contains other agents. For instance, you might train agents that play sumo with each other, with the objective of making them generally good at sumo. Adam's research looks at the case where all you're trying to do is make an agent that defeats one specific other agents: how easy is it, and what happens? He discovers that often, you can do it pretty easily, and your agent can behave in a very silly-seeming way that nevertheless happens to exploit some 'bug' in the opponent. We talk about the experiments he ran, the results, and what they say about how we do reinforcement learning.   Link to the paper - Adversarial Policies: Attacking Deep Reinforcement Learning: arxiv.org/abs/1905.10615 Link to the transcript: axrp.net/episode/2020/12/11/episode-1-adversarial-policies-adam-gleave.html Adam's website: gleave.me Adam's twitter account: twitter.com/argleave

Remember Everything You Learn from Podcasts

Save insights instantly, chat with episodes, and build lasting knowledge - all powered by AI.
App store bannerPlay store banner