Owain Evans is an AI Alignment researcher, research associate at the Center of Human Compatible AI at UC Berkeley, and now leading a new AI safety research group.
In this episode we discuss two of his recent papers, “Me, Myself, and AI: The Situational Awareness Dataset (SAD) for LLMs” and “Connecting the Dots: LLMs can Infer and Verbalize Latent Structure from Disparate Training Data”, alongside some Twitter questions.
LINKS
Patreon: https://www.patreon.com/theinsideview
Manifund: https://manifund.org/projects/making-52-ai-alignment-video-explainers-and-podcasts
Ask questions: https://twitter.com/MichaelTrazzi
Owain Evans: https://twitter.com/owainevans_uk
OUTLINE
(00:00:00) Intro
(00:01:12) Owain's Agenda
(00:02:25) Defining Situational Awareness
(00:03:30) Safety Motivation
(00:04:58) Why Release A Dataset
(00:06:17) Risks From Releasing It
(00:10:03) Claude 3 on the Longform Task
(00:14:57) Needle in a Haystack
(00:19:23) Situating Prompt
(00:23:08) Deceptive Alignment Precursor
(00:30:12) Distribution Over Two Random Words
(00:34:36) Discontinuing a 01 sequence
(00:40:20) GPT-4 Base On the Longform Task
(00:46:44) Human-AI Data in GPT-4's Pretraining
(00:49:25) Are Longform Task Questions Unusual
(00:51:48) When Will Situational Awareness Saturate
(00:53:36) Safety And Governance Implications Of Saturation
(00:56:17) Evaluation Implications Of Saturation
(00:57:40) Follow-up Work On The Situational Awarenss Dataset
(01:00:04) Would Removing Chain-Of-Thought Work?
(01:02:18) Out-of-Context Reasoning: the "Connecting the Dots" paper
(01:05:15) Experimental Setup
(01:07:46) Concrete Function Example: 3x + 1
(01:11:23) Isn't It Just A Simple Mapping?
(01:17:20) Safety Motivation
(01:22:40) Out-Of-Context Reasoning Results Were Surprising
(01:24:51) The Biased Coin Task
(01:27:00) Will Out-Of-Context Resaoning Scale
(01:32:50) Checking If In-Context Learning Work
(01:34:33) Mixture-Of-Functions
(01:38:24) Infering New Architectures From ArXiv
(01:43:52) Twitter Questions
(01:44:27) How Does Owain Come Up With Ideas?
(01:49:44) How Did Owain's Background Influence His Research Style And Taste?
(01:52:06) Should AI Alignment Researchers Aim For Publication?
(01:57:01) How Can We Apply LLM Understanding To Mitigate Deceptive Alignment?
(01:58:52) Could Owain's Research Accelerate Capabilities?
(02:08:44) How Was Owain's Work Received?
(02:13:23) Last Message