The Nonlinear Library

The Nonlinear Fund

The Nonlinear Library allows you to easily listen to top EA and rationalist content on your podcast player. We use text-to-speech software to create an automatically updating repository of audio content from the EA Forum, Alignment Forum, LessWrong, and other EA blogs. To find out more, please visit us at nonlinear.org

Episodes

Mentioned books

Jul 31, 2024 • 23min

LW - Open Source Automated Interpretability for Sparse Autoencoder Features by kh4dien

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Open Source Automated Interpretability for Sparse Autoencoder Features, published by kh4dien on July 31, 2024 on LessWrong. Background Sparse autoencoders recover a diversity of interpretable, monosemantic features, but present an intractable problem of scale to human labelers. We investigate different techniques for generating and scoring text explanations of SAE features. Key Findings Open source models generate and evaluate text explanations of SAE features reasonably well, albeit somewhat worse than closed models like Claude 3.5 Sonnet. Explanations found by LLMs are similar to explanations found by humans. Automatically interpreting 1.5M features of GPT-2 with the current pipeline would cost $1300 in API calls to Llama 3.1 or $8500 with Claude 3.5 Sonnet. Prior methods cost ~$200k with Claude. Code can be found at https://github.com/EleutherAI/sae-auto-interp. We built a small dashboard to explore explanations and their scores: https://cadentj.github.io/demo/ Generating Explanations Sparse autoencoders decompose activations into a sum of sparse feature directions. We leverage language models to generate explanations for activating text examples. Prior work prompts language models with token sequences that activate MLP neurons (Bills et al. 2023), by showing the model a list of tokens followed by their respective activations, separated by a tab, and listed one per line. We instead highlight max activating tokens in each example with a set of <>. Optionally, we choose a threshold of the example's max activation for which tokens are highlighted. This helps the model distinguish important information for some densely activating features. We experiment with several methods for augmenting the explanation. Full prompts are available here. Chain of thought improves general reasoning capabilities in language models. We few-shot the model with several examples of a thought process that mimics a human approach to generating explanations. We expect that verbalizing thought might capture richer relations between tokens and context. Activations distinguish which sentences are more representative of a feature. We provide the magnitude of activating tokens after each example. We compute the logit weights for each feature through the path expansion where is the model unembed and is the decoder direction for a specific feature. The top promoted tokens capture a feature's causal effects which are useful for sharpening explanations. This method is equivalent to the logit lens (nostalgebraist 2020); future work might apply variants that reveal other causal information (Belrose et al. 2023; Gandelsman et al. 2024). Scoring explanations Text explanations represent interpretable "concepts" in natural language. How do we evaluate the faithfulness of explanations to the concepts actually contained in SAE features? We view the explanation as a classifier which predicts whether a feature is present in a context. An explanation should have high recall - identifying most activating text - as well as high precision - distinguishing between activating and non-activating text. Consider a feature which activates on the word "stop" after "don't" or "won't" (Gao et al. 2024). There are two failure modes: 1. The explanation could be too broad, identifying the feature as activating on the word "stop". It would have high recall on held out text, but low precision. 2. The explanation could be too narrow, stating the feature activates on the word "stop" only after "don't". This would have high precision, but low recall. One approach to scoring explanations is "simulation scoring"(Bills et al. 2023) which uses a language model to assign an activation to each token in a text, then measures the correlation between predicted and real activations. This method is biased toward recall; given a bro...

Jul 31, 2024 • 4min

LW - Self-Other Overlap: A Neglected Approach to AI Alignment by Marc Carauleanu

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Self-Other Overlap: A Neglected Approach to AI Alignment, published by Marc Carauleanu on July 30, 2024 on LessWrong. Many thanks to Bogdan Ionut-Cirstea, Steve Byrnes, Gunnar Zarnacke, Jack Foxabbott and Seong Hah Cho for critical comments and feedback on earlier and ongoing versions of this work. Summary In this post, we introduce self-other overlap training: optimizing for similar internal representations when the model reasons about itself and others while preserving performance. There is a large body of evidence suggesting that neural self-other overlap is connected to pro-sociality in humans and we argue that there are more fundamental reasons to believe this prior is relevant for AI Alignment. We argue that self-other overlap is a scalable and general alignment technique that requires little interpretability and has low capabilities externalities. We also share an early experiment of how fine-tuning a deceptive policy with self-other overlap reduces deceptive behavior in a simple RL environment. On top of that, we found that the non-deceptive agents consistently have higher mean self-other overlap than the deceptive agents, which allows us to perfectly classify which agents are deceptive only by using the mean self-other overlap value across episodes. Introduction General purpose ML models with the capacity for planning and autonomous behavior are becoming increasingly capable. Fortunately, research on making sure the models produce output in line with human interests in the training distribution is also progressing rapidly (eg, RLHF, DPO). However, a looming question remains: even if the model appears to be aligned with humans in the training distribution, will it defect once it is deployed or gathers enough power? In other words, is the model deceptive? We introduce a method that aims to reduce deception and increase the likelihood of alignment called Self-Other Overlap: overlapping the latent self and other representations of a model while preserving performance. This method makes minimal assumptions about the model's architecture and its interpretability and has a very concrete implementation. Early results indicate that it is effective at reducing deception in simple RL environments and preliminary LLM experiments are currently being conducted. To be better prepared for the possibility of short timelines without necessarily having to solve interpretability, it seems useful to have a scalable, general, and transferable condition on the model internals, making it less likely for the model to be deceptive. Self-Other Overlap To get a more intuitive grasp of the concept, it is useful to understand how self-other overlap is measured in humans. There are regions of the brain that activate similarly when we do something ourselves and when we observe someone else performing the same action. For example, if you were to pick up a martini glass under an fMRI, and then watch someone else pick up a martini glass, we would find regions of your brain that are similarly activated (overlapping) when you process the self and other-referencing observations as illustrated in Figure 2. There seems to be compelling evidence that self-other overlap is linked to pro-social behavior in humans. For example, preliminary data suggests extraordinary altruists (people who donated a kidney to strangers) have higher neural self-other overlap than control participants in neural representations of fearful anticipation in the anterior insula while the opposite appears to be true for psychopaths. Moreover, the leading theories of empathy (such as the Perception-Action Model) imply that empathy is mediated by self-other overlap at a neural level. While this does not necessarily mean that these results generalise to AI models, we believe there are more fundamental reasons that this prior, onc...

Jul 30, 2024 • 19min

AF - Self-Other Overlap: A Neglected Approach to AI Alignment by Marc Carauleanu

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Self-Other Overlap: A Neglected Approach to AI Alignment, published by Marc Carauleanu on July 30, 2024 on The AI Alignment Forum. Many thanks to Bogdan Ionut-Cirstea, Steve Byrnes, Gunnar Zarnacke, Jack Foxabbott and Seong Hah Cho for critical comments and feedback on earlier and ongoing versions of this work. This research was conducted at AE Studio and supported by the AI Safety Grants programme administered by Foresight Institute with additional support from AE Studio. Summary In this post, we introduce self-other overlap training: optimizing for similar internal representations when the model reasons about itself and others while preserving performance. There is a large body of evidence suggesting that neural self-other overlap is connected to pro-sociality in humans and we argue that there are more fundamental reasons to believe this prior is relevant for AI Alignment. We argue that self-other overlap is a scalable and general alignment technique that requires little interpretability and has low capabilities externalities. We also share an early experiment of how fine-tuning a deceptive policy with self-other overlap reduces deceptive behavior in a simple RL environment. On top of that, we found that the non-deceptive agents consistently have higher mean self-other overlap than the deceptive agents, which allows us to perfectly classify which agents are deceptive only by using the mean self-other overlap value across episodes. Introduction General purpose ML models with the capacity for planning and autonomous behavior are becoming increasingly capable. Fortunately, research on making sure the models produce output in line with human interests in the training distribution is also progressing rapidly (eg, RLHF, DPO). However, a looming question remains: even if the model appears to be aligned with humans in the training distribution, will it defect once it is deployed or gathers enough power? In other words, is the model deceptive? We introduce a method that aims to reduce deception and increase the likelihood of alignment called Self-Other Overlap: overlapping the latent self and other representations of a model while preserving performance. This method makes minimal assumptions about the model's architecture and its interpretability and has a very concrete implementation. Early results indicate that it is effective at reducing deception in simple RL environments and preliminary LLM experiments are currently being conducted. To be better prepared for the possibility of short timelines without necessarily having to solve interpretability, it seems useful to have a scalable, general, and transferable condition on the model internals, making it less likely for the model to be deceptive. Self-Other Overlap To get a more intuitive grasp of the concept, it is useful to understand how self-other overlap is measured in humans. There are regions of the brain that activate similarly when we do something ourselves and when we observe someone else performing the same action. For example, if you were to pick up a martini glass under an fMRI, and then watch someone else pick up a martini glass, we would find regions of your brain that are similarly activated (overlapping) when you process the self and other-referencing observations as illustrated in Figure 2. There seems to be compelling evidence that self-other overlap is linked to pro-social behavior in humans. For example, preliminary data suggests extraordinary altruists (people who donated a kidney to strangers) have higher neural self-other overlap than control participants in neural representations of fearful anticipation in the anterior insula while the opposite appears to be true for psychopaths. Moreover, the leading theories of empathy (such as the Perception-Action Model) imply that empathy is mediated by self-ot...

Jul 30, 2024 • 49sec

EA - In the last 2 years, what surprising ideas has EA championed or how has the movement changed its mind? by Nathan Young

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: In the last 2 years, what surprising ideas has EA championed or how has the movement changed its mind?, published by Nathan Young on July 30, 2024 on The Effective Altruism Forum. In the last 2 years: What ideas that were considered wrong[1]/low status have been championed here? What has the movement acknowledged it was wrong about previously? What new, effective organisations have been started? This isn't to claim that this is the only work that matters, but it feels like a chunk of what matters. Someone asked me and I realised I didn't have good answers. 1. ^ Changed in response to comment from @JWS Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org

Jul 30, 2024 • 23min

AF - Investigating the Ability of LLMs to Recognize Their Own Writing by Christopher Ackerman

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Investigating the Ability of LLMs to Recognize Their Own Writing, published by Christopher Ackerman on July 30, 2024 on The AI Alignment Forum. This post is an interim progress report on work being conducted as part of Berkeley's Supervised Program for Alignment Research (SPAR). Summary of Key Points We test the robustness of an open-source LLM's (Llama3-8b) ability to recognize its own outputs on a diverse mix of datasets, two different tasks (summarization and continuation), and two different presentation paradigms (paired and individual). We are particularly interested in differentiating scenarios that would require a model to have specific knowledge of its own writing style from those where it can use superficial cues (e.g., length, formatting, prefatory words) in the text to pass self-recognition tests. We find that while superficial text features are used when available, the RLHF'd Llama3-8b-Instruct chat model - but not the base Llama3-8b model - can reliably distinguish its own outputs from those of humans, and sometimes other models, even after controls for superficial cues: ~66-73% success rate across datasets in paired presentation and 58-83% in individual presentation (chance is 50%). We further find that although perplexity would be a useful signal to perform the task in the paired presentation paradigm, correlations between relative text perplexity and choice probability are weak and inconsistent, indicating that the models do not rely on it. Evidence suggests, but does not prove, that experience with its own outputs, acquired during post-training, is used by the chat model to succeed at the self-recognition task. The model is unable to articulate convincing reasons for its judgments. Introduction It has recently been found that large language models of sufficient size can achieve above-chance performance in tasks that require them to discriminate their own writing from that of humans and other models. From the perspective of AI safety, this is a significant finding. Self-recognition can be seen as an instance of situational awareness, which has long been noted as a potential point of risk for AI (Cotra, 2021). Such an ability might subserve an awareness of whether a model is in a training versus deployment environment, allowing it to hide its intentions and capabilities until it is freed from constraints. It might also allow a model to collude with other instances of itself, reserving certain information for when it knows it's talking to itself that it keeps secret when it knows it's talking to a human. On the positive side, AI researchers could use a model's self-recognition ability as the basis to build resistance to malicious prompting. But what isn't clear from prior studies is whether the self-recognition task success actually entails a model's self-awareness of its own writing style. Panickssery et al. (2024), utilizing a summary writing/recognition task, report that a number of LLMs, including Llama2-7b-chat, show out-of-the-box (without fine-tuning) self recognition abilities. However, this work focussed on the relationship between self-recognition task success and self-preference, rather than the specific means by which the model was succeeding at the task. Laine et al. (2024), as part of a larger effort to provide a foundation for studying situational awareness in LLMs, utilized a more challenging text continuation writing/recognition task and demonstrate self-recognition abilities in several larger models (although not Llama2-7b-chat), but there the focus was on how task success could be elicited with different prompts and in different models. Thus we seek to fill a gap in understanding what exactly models are doing when they succeed at a self recognition task. We first demonstrate model self-recognition task success in a variety of domains....

Jul 30, 2024 • 8min

AF - Can Generalized Adversarial Testing Enable More Rigorous LLM Safety Evals? by Stephen Casper

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Can Generalized Adversarial Testing Enable More Rigorous LLM Safety Evals?, published by Stephen Casper on July 30, 2024 on The AI Alignment Forum. Thanks to Zora Che, Michael Chen, Andi Peng, Lev McKinney, Bilal Chughtai, Shashwat Goel, Domenic Rosati, and Rohit Gandikota. TL;DR In contrast to evaluating AI systems under normal "input-space" attacks, using "generalized," attacks, which allow an attacker to manipulate weights or activations, might be able to help us better evaluate LLMs for risks - even if they are deployed as black boxes. Here, I outline the rationale for "generalized" adversarial testing and overview current work related to it. See also prior work in Casper et al. (2024), Casper et al. (2024), and Sheshadri et al. (2024). Even when AI systems perform well in typical circumstances, they sometimes fail in adversarial/anomalous ones. This is a persistent problem. State-of-the-art AI systems tend to retain undesirable latent capabilities that can pose risks if they resurface. My favorite example of this is the most cliche one many recent papers have demonstrated diverse attack techniques that can be used to elicit instructions for making a bomb from state-of-the-art LLMs. There is an emerging consensus that, even when LLMs are fine-tuned to be harmless, they can retain latent harmful capabilities that can and do cause harm when they resurface ( Qi et al., 2024). A growing body of work on red-teaming ( Shayegani et al., 2023, Carlini et al., 2023, Geiping et al., 2024, Longpre et al., 2024), interpretability ( Juneja et al., 2022, Lubana et al., 2022, Jain et al., 2023, Patil et al., 2023, Prakash et al., 2024, Lee et al., 2024), representation editing ( Wei et al., 2024, Schwinn et al., 2024), continual learning ( Dyer et al., 2022, Cossu et al., 2022, Li et al., 2022, Scialom et al., 2022, Luo et al., 2023, Kotha et al., 2023, Shi et al., 2023, Schwarzchild et al., 2024), and fine-tuning ( Jain et al., 2023, Yang et al., 2023, Qi et al., 2023, Bhardwaj et al., 2023, Lermen et al., 2023, Zhan et al., 2023, Ji et al., 2024, Hu et al., 2024, Halawi et al., 2024) suggests that fine-tuning struggles to make fundamental changes to an LLM's inner knowledge and capabilities. For example, Jain et al. (2023) likened fine-tuning in LLMs to merely modifying a "wrapper" around a stable, general-purpose set of latent capabilities. Even if they are generally inactive, harmful latent capabilities can pose harm if they resurface due to an attack, anomaly, or post-deployment modification ( Hendrycks et al., 2021, Carlini et al., 2023). We can frame the problem as such: There are hyper-astronomically many inputs for modern LLMs (e.g. there are vastly more 20-token strings than particles in the observable universe), so we can't brute-force-search over the input space to make sure they are safe. So unless we are able to make provably safe advanced AI systems (we won't soon and probably never will), there will always be a challenge with ensuring safety - the gap between the set of failure modes that developers identify, and unforeseen ones that they don't. This is a big challenge because of the inherent unknown-unknown nature of the problem. However, it is possible to try to infer how large this gap might be. Taking a page from the safety engineering textbook -- when stakes are high, we should train and evaluate LLMs under threats that are at least as strong as, and ideally stronger than, ones that they will face in deployment. First, imagine that an LLM is going to be deployed open-source (or if it could be leaked). Then, of course, the system's safety depends on what it can be modified to do. So it should be evaluated not as a black-box but as a general asset to malicious users who might enhance it through finetuning or other means. This seems obvious, but there's preced...

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

Get the app