AI Safety Fundamentals: Alignment cover image

AI Safety Fundamentals: Alignment

Latest episodes

undefined
8 snips
Apr 7, 2024 • 21min

AI Control: Improving Safety Despite Intentional Subversion

The podcast discusses safeguarding AI systems from intentional subversion, exploring protocols for AI model integrity, enhancing safety through control strategies, ensuring trustworthy AI in high-stakes settings, and exploring code back-dooring for safety measures.
undefined
Apr 7, 2024 • 8min

AI Watermarking Won’t Curb Disinformation

Generative AI allows people to produce piles upon piles of images and words very quickly. It would be nice if there were some way to reliably distinguish AI-generated content from human-generated content. It would help people avoid endlessly arguing with bots online, or believing what a fake image purports to show. One common proposal is that big companies should incorporate watermarks into the outputs of their AIs. For instance, this could involve taking an image and subtly changing many pixels in a way that’s undetectable to the eye but detectable to a computer program. Or it could involve swapping words for synonyms in a predictable way so that the meaning is unchanged, but a program could readily determine the text was generated by an AI.Unfortunately, watermarking schemes are unlikely to work. So far most have proven easy to remove, and it’s likely that future schemes will have similar problems.Source:https://transformer-circuits.pub/2023/monosemantic-features/index.htmlNarrated for AI Safety Fundamentals by Perrin WalkerA podcast by BlueDot Impact. Learn more on the AI Safety Fundamentals website.
undefined
Apr 7, 2024 • 18min

Emerging Processes for Frontier AI Safety

Exploring the risks and benefits of AI technology, focusing on transparency, cybersecurity, managing vulnerabilities, and best practices for data input controls in AI system training.
undefined
Apr 7, 2024 • 23min

Challenges in Evaluating AI Systems

Exploring challenges in evaluating AI systems, the podcast delves into limitations of current evaluation suites and offers policy recommendations. Topics include pitfalls of using MM LU metric, difficulties in measuring social biases, hurdles in Bias Benchmark Task, complexities of Big Bench framework, and methodologies for red teaming in security evaluations.
undefined
Apr 1, 2024 • 25min

Interpretability in the Wild: A Circuit for Indirect Object Identification in GPT-2 Small

Kevin Wang, a researcher in mechanistic interpretability, discusses reverse-engineering the behavior of GPT-2 small in indirect object identification. The podcast explores 26 attention heads grouped into 7 classes, the reliability of explanations, and the feasibility of understanding large ML models. It delves into attention head behaviors, model architecture, and mathematical explanations for mechanistic interpretability in language models.
undefined
Mar 31, 2024 • 9min

Towards Monosemanticity: Decomposing Language Models With Dictionary Learning

Using a sparse autoencoder, we extract a large number of interpretable features from a one-layer transformer.Mechanistic interpretability seeks to understand neural networks by breaking them into components that are more easily understood than the whole. By understanding the function of each component, and how they interact, we hope to be able to reason about the behavior of the entire network. The first step in that program is to identify the correct components to analyze.Unfortunately, the most natural computational unit of the neural network – the neuron itself – turns out not to be a natural unit for human understanding. This is because many neurons are polysemantic: they respond to mixtures of seemingly unrelated inputs. In the vision model Inception v1, a single neuron responds to faces of cats and fronts of cars . In a small language model we discuss in this paper, a single neuron responds to a mixture of academic citations, English dialogue, HTTP requests, and Korean text. Polysemanticity makes it difficult to reason about the behavior of the network in terms of the activity of individual neurons.Source:https://transformer-circuits.pub/2023/monosemantic-features/index.htmlNarrated for AI Safety Fundamentals by Perrin WalkerA podcast by BlueDot Impact. Learn more on the AI Safety Fundamentals website.
undefined
Mar 31, 2024 • 44min

Zoom In: An Introduction to Circuits

The podcast discusses pivotal moments in science's history where advancements have been made by studying the world at a finer level, leading to the emergence of new scientific fields. It explores the intricacies of neural networks, including individual neurons and connections, and the role of visualization tools like microscopes and X-ray crystallography. The chapter also delves into the complexities of neural circuits within artificial neural networks, highlighting features like curve detectors and pose-invariant dog head detectors.
undefined
Mar 26, 2024 • 35min

Weak-To-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision

Guest Collin Burns discusses weak-to-strong generalization in AI alignment, exploring fine-tuning strong models with labels from weaker models to enhance performance. Techniques like auxiliary confidence loss show promise in improving weak-to-strong generalization, suggesting progress in aligning superhuman models with human supervision.
undefined
Mar 26, 2024 • 20min

Can We Scale Human Feedback for Complex AI Tasks?

Exploring the challenges of using human feedback for training AI models, strategies for scalable oversight, techniques like task decomposition and reward modeling, Recursive Reward Modeling and Constitutional AI, using debating agents to simplify complex problems, and enhancing generalization in AI models through weaker supervisors and discussions on scalability challenges.
undefined
May 13, 2023 • 22min

Machine Learning for Humans: Supervised Learning

The two tasks of supervised learning: regression and classification. Linear regression, loss functions, and gradient descent.How much money will we make by spending more dollars on digital advertising? Will this loan applicant pay back the loan or not? What’s going to happen to the stock market tomorrow?Original article:https://medium.com/machine-learning-for-humans/supervised-learning-740383a2feabAuthor:Vishal MainiA podcast by BlueDot Impact. Learn more on the AI Safety Fundamentals website.

Get the Snipd
podcast app

Unlock the knowledge in podcasts with the podcast player of the future.
App store bannerPlay store banner

AI-powered
podcast player

Listen to all your favourite podcasts with AI-powered features

Discover
highlights

Listen to the best highlights from the podcasts you love and dive into the full episode

Save any
moment

Hear something you like? Tap your headphones to save it with AI-generated key takeaways

Share
& Export

Send highlights to Twitter, WhatsApp or export them to Notion, Readwise & more

AI-powered
podcast player

Listen to all your favourite podcasts with AI-powered features

Discover
highlights

Listen to the best highlights from the podcasts you love and dive into the full episode