

Into AI Safety
Jacob Haimes
The Into AI Safety podcast aims to make it easier for everyone, regardless of background, to get meaningfully involved with the conversations surrounding the rules and regulations which should govern the research, development, deployment, and use of the technologies encompassed by the term "artificial intelligence" or "AI"
For better formatted show notes, additional resources, and more, go to https://kairos.fm/intoaisafety/
For better formatted show notes, additional resources, and more, go to https://kairos.fm/intoaisafety/
Episodes
Mentioned books

Mar 11, 2024 • 13min
MINISODE: Restructure Vol. 2
UPDATE: Contrary to what I say in this episode, I won't be removing any episodes that are already published from the podcast RSS feed.After getting some advice and reflecting more on my own personal goals, I have decided to shift the direction of the podcast towards accessible content regarding "AI" instead of the show's original focus. I will still be releasing what I am calling research ride-along content to my Patreon, but the show's feed will consist only of content that I aim to make as accessible as possible.00:35 - TL;DL01:12 - Advice from Pete03:10 - My personal goal05:39 - Reflection on refining my goal09:08 - Looking forward (logistics

Mar 4, 2024 • 54min
INTERVIEW: StakeOut.AI w/ Dr. Peter Park (1)
Dr. Peter Park is an AI Existential Safety Postdoctoral Fellow working with Dr. Max Tegmark at MIT. In conjunction with Harry Luk and one other cofounder, he founded StakeOut.AI, a non-profit focused on making AI go well for humans.00:54 - Intro03:15 - Dr. Park, x-risk, and AGI08:55 - StakeOut.AI12:05 - Governance scorecard19:34 - Hollywood webinar22:02 - Regulations.gov comments23:48 - Open letters 26:15 - EU AI Act35:07 - Effective accelerationism40:50 - Divide and conquer dynamics45:40 - AI "art"53:09 - OutroLinks to all articles/papers which are mentioned throughout the episode can be found below, in order of their appearance.StakeOut.AIAI Governance Scorecard (go to Pg. 3)Pause AIRegulations.gov USCO StakeOut.AI Comment OMB StakeOut.AI Comment AI Treaty open letterTAISCAlpaca: A Strong, Replicable Instruction-Following ModelReferences on EU AI Act and Cedric O Tweet from Cedric O EU policymakers enter the last mile for Artificial Intelligence rulebook AI Act: EU Parliament’s legal office gives damning opinion on high-risk classification ‘filters’ EU’s AI Act negotiations hit the brakes over foundation models The EU AI Act needs Foundation Model Regulation BigTech’s Efforts to Derail the AI Act Open Sourcing the AI Revolution: Framing the debate on open source, artificial intelligence and regulationDivide-and-Conquer Dynamics in AI-Driven Disempowerment

Feb 26, 2024 • 31min
MINISODE: "LLMs, a Survey"
Take a trip with me through the paper Large Language Models, A Survey, published on February 9th of 2024. All figures and tables mentioned throughout the episode can be found on the Into AI Safety podcast website.00:36 - Intro and authors01:50 - My takes and paper structure04:40 - Getting to LLMs07:27 - Defining LLMs & emergence12:12 - Overview of PLMs15:00 - How LLMs are built18:52 - Limitations if LLMs23:06 - Uses of LLMs25:16 - Evaluations and Benchmarks28:11 - Challenges and future directions29:21 - Recap & outroLinks to all articles/papers which are mentioned throughout the episode can be found below, in order of their appearance.Large Language Models, A SurveyMeysam's LinkedIn PostClaude E. ShannonA symbolic analysis of relay and switching circuits (Master's Thesis)Communication theory of secrecy systemsA mathematical theory of communicationPrediction and entropy of printed EnglishFuture ML Systems Will Be Qualitatively DifferentMore Is DifferentSleeper Agents: Training Deceptive LLMs that Persist Through Safety TrainingAre Emergent Abilities of Large Language Models a Mirage?Are Emergent Abilities of Large Language Models just In-Context Learning?Attention is all you needDirect Preference Optimization: Your Language Model is Secretly a Reward ModelKTO: Model Alignment as Prospect Theoretic OptimizationOptimization by Simulated AnnealingMemory and new controls for ChatGPTHallucinations and related concepts—their conceptual background

Feb 19, 2024 • 45min
FEEDBACK: Applying for Funding w/ Esben Kran
Esben reviews an application that I would soon submit for Open Philanthropy's Career Transitition Funding opportunity. Although I didn't end up receiving the funding, I do think that this episode can be a valuable resource for both others and myself when applying for funding in the future.Head over to Apart Research's website to check out their work, or the Alignment Jam website for information on upcoming hackathons.A doc-capsule of the application at the time of this recording can be found at this link.01:38 - Interview starts05:41 - Proposal11:00 - Personal statement14:00 - Budget21:12 - CV22:45 - Application questions34:06 - Funding questions44:25 - OutroLinks to all articles/papers which are mentioned throughout the episode can be found below, in order of their appearance.AI governance talent profiles we’d like to seeThe AI Governance Research SprintReasoning TransparencyPlaces to look for fundingOpen Philanthropy's Career development and transition fundingLong-Term Future FundManifund

Feb 12, 2024 • 9min
MINISODE: Reading a Research Paper
Before I begin with the paper-distillation based minisodes, I figured we would go over best practices for reading research papers. I go through the anatomy of typical papers, and some generally applicable advice.00:56 - Anatomy of a paper02:38 - Most common advice05:24 - Reading sparsity and path07:30 - Notes and motivationLinks to all articles/papers which are mentioned throughout the episode can be found below, in order of their appearance.Ten simple rules for reading a scientific paperBest sources I foundLet's get critical: Reading academic articles#GradHacks: A guide to reading research papersHow to read a scientific paper (presentation)Some more sourcesHow to read a scientific articleHow to read a research paperReading a scientific article

Feb 5, 2024 • 49min
HACKATHON: Evals November 2023 (2)
Join our hackathon group for the second episode in the Evals November 2023 Hackathon subseries. In this episode, we solidify our goals for the hackathon after some preliminary experimentation and ideation.Check out Stellaric's website, or follow them on Twitter.01:53 - Meeting starts05:05 - Pitch: extension of locked models23:23 - Pitch: retroactive holdout datasets34:04 - Preliminary results37:44 - Next steps42:55 - RecapLinks to all articles/papers which are mentioned throughout the episode can be found below, in order of their appearance.Evalugator libraryPassword Locked Model blogpostTruthfulQA: Measuring How Models Mimic Human FalsehoodsBLEU: a Method for Automatic Evaluation of Machine TranslationBoolQ: Exploring the Surprising Difficulty of Natural Yes/No QuestionsDetecting Pretraining Data from Large Language Models

Jan 29, 2024 • 10min
MINISODE: Portfolios
I provide my thoughts and recommendations regarding personal professional portfolios.00:35 - Intro to portfolios01:42 - Modern portfolios02:27 - What to include04:38 - Importance of visual05:50 - The "About" page06:25 - Tools08:12 - Future of "Minisodes"Links to all articles/papers which are mentioned throughout the episode can be found below, in order of their appearance.From Portafoglio to Eportfolio: The Evolution of Portfolio in Higher EducationGIMPAlternativeToJekyllGitHub PagesMinimal MistakesMy portfolio

Jan 22, 2024 • 45min
INTERVIEW: Polysemanticity w/ Dr. Darryl Wright
Darryl and I discuss his background, how he became interested in machine learning, and a project we are currently working on investigating the penalization of polysemanticity during the training of neural networks.Check out a diagram of the decoder task used for our research!01:46 - Interview begins02:14 - Supernovae classification08:58 - Penalizing polysemanticity20:58 - Our "toy model"30:06 - Task description32:47 - Addressing hurdles39:20 - Lessons learnedLinks to all articles/papers which are mentioned throughout the episode can be found below, in order of their appearance.ZooniverseBlueDot ImpactAI Safety SupportZoom In: An Introduction to CircuitsMNIST dataset on PapersWithCodeClusterability in Neural NetworksCIFAR-10 datasetEffective Altruism GlobalCLIP (blog post)Long Term Future FundEngineering Monosemanticity in Toy Models

Jan 15, 2024 • 11min
MINISODE: Starting a Podcast
A summary and reflections on the path I have taken to get this podcast started, including some resources recommendations for others who want to do something similar.Links to all articles/papers which are mentioned throughout the episode can be found below, in order of their appearance.LessWrongSpotify for PodcastersInto AI Safety podcast websiteEffective Altruism GlobalOpen Broadcaster Software (OBS)CraigRiverside

Jan 8, 2024 • 1h 9min
HACKATHON: Evals November 2023 (1)
This episode kicks off our first subseries, which will consist of recordings taken during my team's meetings for the AlignmentJams Evals Hackathon in November of 2023. Our team won first place, so you'll be listening to the process which, at the end of the day, turned out to be pretty good.Check out Apart Research, the group that runs the AlignmentJamz Hackathons.Links to all articles/papers which are mentioned throughout the episode can be found below, in order of their appearance.Generalization Analogies: A Testbed for Generalizing AI Oversight to Hard-To-Measure DomainsNew paper shows truthfulness & instruction-following don't generalize by defaultGeneralization Analogies WebsiteDiscovering Language Model Behaviors with Model-Written EvaluationsModel-Written Evals WebsiteOpenAI Evals GitHubMETR (previously ARC Evals)Goodharting on WikipediaFrom Instructions to Intrinsic Human Values, a Survey of Alignment Goals for Big ModelsFine Tuning Aligned Language Models Compromises Safety Even When Users Do Not IntendShadow Alignment: The Ease of Subverting Safely Aligned Language ModelsWill Releasing the Weights of Future Large Language Models Grant Widespread Access to Pandemic Agents?Building Less Flawed Metrics, Understanding and Creating Better Measurement and Incentive SystemseLeutherAI's Model Evaluation HarnessEvalugator Library


