

The Nonlinear Library
The Nonlinear Fund
The Nonlinear Library allows you to easily listen to top EA and rationalist content on your podcast player. We use text-to-speech software to create an automatically updating repository of audio content from the EA Forum, Alignment Forum, LessWrong, and other EA blogs. To find out more, please visit us at nonlinear.org
Episodes
Mentioned books

Jul 31, 2024 • 23min
LW - Open Source Automated Interpretability for Sparse Autoencoder Features by kh4dien
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Open Source Automated Interpretability for Sparse Autoencoder Features, published by kh4dien on July 31, 2024 on LessWrong.
Background
Sparse autoencoders recover a diversity of interpretable, monosemantic features, but present an intractable problem of scale to human labelers. We investigate different techniques for generating and scoring text explanations of SAE features.
Key Findings
Open source models generate and evaluate text explanations of SAE features reasonably well, albeit somewhat worse than closed models like Claude 3.5 Sonnet.
Explanations found by LLMs are similar to explanations found by humans.
Automatically interpreting 1.5M features of GPT-2 with the current pipeline would cost $1300 in API calls to Llama 3.1 or $8500 with Claude 3.5 Sonnet. Prior methods cost ~$200k with Claude.
Code can be found at
https://github.com/EleutherAI/sae-auto-interp.
We built a small dashboard to explore explanations and their scores:
https://cadentj.github.io/demo/
Generating Explanations
Sparse autoencoders decompose activations into a sum of sparse feature directions. We leverage language models to generate explanations for activating text examples. Prior work prompts language models with token sequences that activate MLP neurons (Bills et al. 2023), by showing the model a list of tokens followed by their respective activations, separated by a tab, and listed one per line.
We instead highlight max activating tokens in each example with a set of <>. Optionally, we choose a threshold of the example's max activation for which tokens are highlighted. This helps the model distinguish important information for some densely activating features.
We experiment with several methods for augmenting the explanation. Full prompts are available here.
Chain of thought improves general reasoning capabilities in language models. We few-shot the model with several examples of a thought process that mimics a human approach to generating explanations. We expect that verbalizing thought might capture richer relations between tokens and context.
Activations distinguish which sentences are more representative of a feature. We provide the magnitude of activating tokens after each example.
We compute the logit weights for each feature through the path expansion
where
is the model unembed and
is the decoder direction for a specific feature. The top promoted tokens capture a feature's causal effects which are useful for sharpening explanations. This method is equivalent to the logit lens (nostalgebraist 2020); future work might apply variants that reveal other causal information (Belrose et al. 2023; Gandelsman et al. 2024).
Scoring explanations
Text explanations represent interpretable "concepts" in natural language. How do we evaluate the faithfulness of explanations to the concepts actually contained in SAE features?
We view the explanation as a classifier which predicts whether a feature is present in a context. An explanation should have high recall - identifying most activating text - as well as high precision - distinguishing between activating and non-activating text.
Consider a feature which activates on the word "stop" after "don't" or "won't" (Gao et al. 2024). There are two failure modes:
1. The explanation could be too broad, identifying the feature as activating on the word "stop". It would have high recall on held out text, but low precision.
2. The explanation could be too narrow, stating the feature activates on the word "stop" only after "don't". This would have high precision, but low recall.
One approach to scoring explanations is "simulation scoring"(Bills et al. 2023) which uses a language model to assign an activation to each token in a text, then measures the correlation between predicted and real activations. This method is biased toward recall; given a bro...

Jul 31, 2024 • 4min
LW - Twitter thread on AI safety evals by Richard Ngo
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Twitter thread on AI safety evals, published by Richard Ngo on July 31, 2024 on LessWrong.
Epistemic status: raising concerns, rather than stating confident conclusions.
I'm worried that a lot of work on AI safety evals matches the pattern of "Something must be done. This is something. Therefore this must be done." Or, to put it another way: I judge eval ideas on 4 criteria, and I often see proposals which fail all 4. The criteria:
1. Possible to measure with scientific rigor.
Some things can be easily studied in a lab; others are entangled with a lot of real-world complexity. If you predict the latter (e.g. a model's economic or scientific impact) based on model-level evals, your results will often be BS.
(This is why I dislike the term "transformative AI", by the way. Whether an AI has transformative effects on society will depend hugely on what the society is like, how the AI is deployed, etc. And that's a constantly moving target! So TAI a terrible thing to try to forecast.)
Another angle on "scientific rigor": you're trying to make it obvious to onlookers that you couldn't have designed the eval to get your preferred results. This means making the eval as simple as possible: each arbitrary choice adds another avenue for p-hacking, and they add up fast.
(Paraphrasing a different thread): I think of AI risk forecasts as basically guesses, and I dislike attempts to make them sound objective (e.g. many OpenPhil worldview investigations). There are always so many free parameters that you can get basically any result you want. And so, in practice, they often play the role of laundering vibes into credible-sounding headline numbers. I'm worried that AI safety evals will fall into the same trap.
(I give Eliezer a lot of credit for making roughly this criticism of Ajeya's bio-anchors report. I think his critique has basically been proven right by how much people have updated away from 30-year timelines since then.)
2. Provides signal across scales.
Evals are often designed around a binary threshold (e.g. the Turing Test). But this restricts the impact of the eval to a narrow time window around hitting it. Much better if we can measure (and extrapolate) orders-of-magnitude improvements.
3. Focuses on clearly worrying capabilities.
Evals for hacking, deception, etc track widespread concerns. By contrast, evals for things like automated ML R&D are only worrying for people who already believe in AI xrisk. And even they don't think it's necessary for risk.
4. Motivates useful responses.
Safety evals are for creating clear Schelling points at which action will be taken. But if you don't know what actions your evals should catalyze, it's often more valuable to focus on fleshing that out. Often nobody else will!
In fact, I expect that things like model releases, demos, warning shots, etc, will by default be much better drivers of action than evals. Evals can still be valuable, but you should have some justification for why yours will actually matter, to avoid traps like the ones above. Ideally that justification would focus either on generating insight or being persuasive; optimizing for both at once seems like a good way to get neither.
Lastly: even if you have a good eval idea, actually implementing it well can be very challenging
Building evals is scientific research; and so we should expect eval quality to be heavy-tailed, like most other science. I worry that the fact that evals are an unusually easy type of research to get started with sometimes obscures this fact.
Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org

Jul 30, 2024 • 20min
LW - RTFB: California's AB 3211 by Zvi
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: RTFB: California's AB 3211, published by Zvi on July 30, 2024 on LessWrong.
Some in the tech industry decided now was the time to raise alarm about AB 3211.
As Dean Ball points out, there's a lot of bills out there. One must do triage.
Dean Ball: But SB 1047 is far from the only AI bill worth discussing. It's not even the only one of the dozens of AI bills in California worth discussing. Let's talk about AB 3211, the California Provenance, Authenticity, and Watermarking Standards Act, written by Assemblymember Buffy Wicks, who represents the East Bay.
SB 1047 is a carefully written bill that tries to maximize benefits and minimize costs. You can still quite reasonably disagree with the aims, philosophy or premise of the bill, or its execution details, and thus think its costs exceed its benefits. When people claim SB 1047 is made of crazy pills, they are attacking provisions not in the bill.
That is not how it usually goes.
Most bills involving tech regulation that come before state legislatures are made of crazy pills, written by people in over their heads.
There are people whose full time job is essentially pointing out the latest bill that might break the internet in various ways, over and over, forever. They do a great and necessary service, and I do my best to forgive them the occasional false alarm. They deal with idiots, with bulls in China shops, on the daily. I rarely get the sense these noble warriors are having any fun.
AB 3211 unanimously passed the California assembly, and I started seeing bold claims about how bad it would be. Here was one of the more measured and detailed ones.
Dean Ball: The bill also requires every generative AI system to maintain a database with digital fingerprints for "any piece of potentially deceptive content" it produces. This would be a significant burden for the creator of any AI system. And it seems flatly impossible for the creators of open weight models to comply.
Under AB 3211, a chatbot would have to notify the user that it is a chatbot at the start of every conversation. The user would have to acknowledge this before the conversation could begin. In other words, AB 3211 could create the AI version of those annoying cookie notifications you get every time you visit a European website.
…
AB 3211 mandates "maximally indelible watermarks," which it defines as "a watermark that is designed to be as difficult to remove as possible using state-of-the-art techniques and relevant industry standards."
So I decided to Read the Bill (RTFB).
It's a bad bill, sir. A stunningly terrible bill.
How did it unanimously pass the California assembly?
My current model is:
1. There are some committee chairs and others that can veto procedural progress.
2. Most of the members will vote for pretty much anything.
3. They are counting on Newsom to evaluate and if needed veto.
4. So California only sort of has a functioning legislative branch, at best.
5. Thus when bills pass like this, it means a lot less than you might think.
Yet everyone stays there, despite everything. There really is a lot of ruin in that state.
Time to read the bill.
Read The Bill (RTFB)
It's short - the bottom half of the page is all deleted text.
Section 1 is rhetorical declarations. GenAI can produce inauthentic images, they need to be clearly disclosed and labeled, or various bad things could happen. That sounds like a job for California, which should require creators to provide tools and platforms to provide labels. So we all can remain 'safe and informed.' Oh no.
Section 2 22949.90 provides some definitions. Most are standard. These aren't:
(c) "Authentic content" means images, videos, audio, or text created by human beings without any modifications or with only minor modifications that do not lead to significant changes to the perceived contents or meaning of the cont...

Jul 30, 2024 • 44sec
AF - Against AI As An Existential Risk by Noah Birnbaum
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Against AI As An Existential Risk, published by Noah Birnbaum on July 30, 2024 on The AI Alignment Forum.
I wrote a post to my Substack attempting to compile all of the best arguments against AI as an existential threat.
Some arguments that I discuss include: international game theory dynamics, reference class problems, knightian uncertainty, superforecaster and domain expert disagreement, the issue with long-winded arguments, and more!
Please tell me why I'm wrong, and if you like the article, subscribe and share it with friends!
Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.

Jul 30, 2024 • 38min
EA - What I wish I knew when I started out in animal advocacy by SofiaBalderson
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: What I wish I knew when I started out in animal advocacy, published by SofiaBalderson on July 30, 2024 on The Effective Altruism Forum.
Tl;dr:
I identified 15 pieces of advice that I wish knew earlier in my animal advocacy career, and provided personal stories to show how they were relevant in my life. This article will be most useful for early career professionals, especially those considering roles in animal advocacy, and people who like reading personal accounts.
1. Choose work experience over post-Bachelor's education when possible
2. Start doing relevant work, even if unpaid, to showcase skills and test fit
3. Be open to changing jobs for better opportunities, but consider financial security
4. Use current jobs to build career capital for future animal advocacy roles
5. Consider earning to give as a way to support the movement financially
6. Offer concrete skills to solve specific problems when seeking roles
7. Network actively, including with senior people, and learn to ask good questions
8. Seek growth opportunities beyond what your employer provides
9. Don't be afraid to be ambitious, and critically assess other people's advice
10. Don't take rejections too personally; persistence often pays off
11. Invest time in improving your productivity
12. Prioritise relationships with family and friends: there will always be more work to do
13. Take care of your physical and mental health
14. Plan for long-term financial security (pension, savings, housing) even on a low salary
15. Learn to budget and save money effectively
Who is this post for?
Early career professionals, especially in animal advocacy. People who like reading personal accounts.
Disclaimer:
Please note that this wasn't intended as comprehensive career advice, it's just my own personal take on what mistakes I think I've made and what I could have done better during my 6+ years in animal advocacy so far. This is the advice I wish I heard in my early twenties. I started my journey at Veganuary, then volunteered and was a contractor for a number of charities, then worked at Animal Advocacy Careers, then started Hive, a community building charity for farmed animal advocates.
Depending on your life and work circumstances all or some of this advice may not apply to you. What worked for me may not work for you, as my journey is the result of a unique combination of my strengths, opportunities and weaknesses. I think overall I'm quite risk-tolerant in comparison to an average advocate, and I spent all my twenties with no significant financial commitments, so that's worth taking into account.
Do critically assess whether this advice will actually apply to your situation (see
Should you reverse any advice you hear). This advice may apply to other causes, not just animal welfare, but since I am working in animal advocacy, I give resources and examples for this cause area only. There may be some hindsight bias because I have been working in the movement for over 6 years and forgot what it's like to be an early career professional. These lessons and tips are in no particular order, but I've tried to organise them in themes.
Acknowledgments:
Thanks so much to Allison Agnello, Constance Li, Kevin Xia, Hayden Kessinger and Cameron King for reviewing this post and providing valuable suggestions. All mistakes are my own.
Getting work:
Choose work experience over post-Bachelor's education
I feel like pursuing post-Bachelor's higher education (e.g. a Master's) may be the default option for people to go, but in many career paths in animal advocacy it is far from obvious that an advanced degree would benefit you. I considered doing a Master's in 2020 and even tried a module. But I've soon realised that direct work experience is considerably more valuable for my career than the formal education I was looking into.
I do think ...

Jul 30, 2024 • 19min
LW - Self-Other Overlap: A Neglected Approach to AI Alignment by Marc Carauleanu
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Self-Other Overlap: A Neglected Approach to AI Alignment, published by Marc Carauleanu on July 30, 2024 on LessWrong.
Many thanks to Bogdan Ionut-Cirstea, Steve Byrnes, Gunnar Zarnacke, Jack Foxabbott and Seong Hah Cho for critical comments and feedback on earlier and ongoing versions of this work.
Summary
In this post, we introduce self-other overlap training: optimizing for similar internal representations when the model reasons about itself and others while preserving performance. There is a large body of evidence suggesting that neural self-other overlap is connected to pro-sociality in humans and we argue that there are more fundamental reasons to believe this prior is relevant for AI Alignment.
We argue that self-other overlap is a scalable and general alignment technique that requires little interpretability and has low capabilities externalities. We also share an early experiment of how fine-tuning a deceptive policy with self-other overlap reduces deceptive behavior in a simple RL environment.
On top of that, we found that the non-deceptive agents consistently have higher mean self-other overlap than the deceptive agents, which allows us to perfectly classify which agents are deceptive only by using the mean self-other overlap value across episodes.
Introduction
General purpose ML models with the capacity for planning and autonomous behavior are becoming
increasingly capable. Fortunately, research on making sure the models produce output in line with human interests in the training distribution is also
progressing rapidly (eg, RLHF, DPO). However, a looming question remains: even if the model appears to be aligned with humans in the training distribution, will it defect once it is deployed or gathers enough power? In other words, is the model
deceptive?
We introduce a method that aims to reduce deception and increase the likelihood of alignment called
Self-Other Overlap: overlapping the latent self and other representations of a model while preserving performance. This method makes minimal assumptions about the model's architecture and its interpretability and has a very concrete implementation. Early results indicate that it is effective at reducing deception in simple RL environments and preliminary LLM experiments are currently being conducted.
To be better prepared for the possibility of short timelines without necessarily having to solve interpretability, it seems useful to have a scalable, general, and transferable condition on the model internals, making it less likely for the model to be deceptive.
Self-Other Overlap
To get a more intuitive grasp of the concept, it is useful to understand how self-other overlap is measured in humans. There are regions of the brain that activate similarly when we do something ourselves and when we observe someone else performing the same action.
For example, if you were to pick up a martini glass under an fMRI, and then watch someone else pick up a martini glass, we would find regions of your brain that are similarly activated (overlapping) when you process the self and other-referencing observations as illustrated in Figure 2.
There seems to be compelling evidence that self-other overlap is linked to pro-social behavior in humans. For example, preliminary data suggests extraordinary altruists (people who donated a kidney to strangers)
have higher
neural self-other overlap than control participants in neural representations of fearful anticipation in the anterior insula while the opposite appears to be true for
psychopaths. Moreover, the
leading theories of empathy (such as the
Perception-Action Model) imply that empathy is mediated by self-other overlap at a neural level. While this does not necessarily mean that these results generalise to AI models, we believe there are more
fundamental reasons that this prior, onc...

Jul 30, 2024 • 19min
AF - Self-Other Overlap: A Neglected Approach to AI Alignment by Marc Carauleanu
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Self-Other Overlap: A Neglected Approach to AI Alignment, published by Marc Carauleanu on July 30, 2024 on The AI Alignment Forum.
Many thanks to Bogdan Ionut-Cirstea, Steve Byrnes, Gunnar Zarnacke, Jack Foxabbott and Seong Hah Cho for critical comments and feedback on earlier and ongoing versions of this work. This research was conducted at AE Studio and supported by the AI Safety Grants programme administered by Foresight Institute with additional support from AE Studio.
Summary
In this post, we introduce self-other overlap training: optimizing for similar internal representations when the model reasons about itself and others while preserving performance. There is a large body of evidence suggesting that neural self-other overlap is connected to pro-sociality in humans and we argue that there are more fundamental reasons to believe this prior is relevant for AI Alignment.
We argue that self-other overlap is a scalable and general alignment technique that requires little interpretability and has low capabilities externalities. We also share an early experiment of how fine-tuning a deceptive policy with self-other overlap reduces deceptive behavior in a simple RL environment.
On top of that, we found that the non-deceptive agents consistently have higher mean self-other overlap than the deceptive agents, which allows us to perfectly classify which agents are deceptive only by using the mean self-other overlap value across episodes.
Introduction
General purpose ML models with the capacity for planning and autonomous behavior are becoming
increasingly capable. Fortunately, research on making sure the models produce output in line with human interests in the training distribution is also
progressing rapidly (eg, RLHF, DPO). However, a looming question remains: even if the model appears to be aligned with humans in the training distribution, will it defect once it is deployed or gathers enough power? In other words, is the model
deceptive?
We introduce a method that aims to reduce deception and increase the likelihood of alignment called
Self-Other Overlap: overlapping the latent self and other representations of a model while preserving performance. This method makes minimal assumptions about the model's architecture and its interpretability and has a very concrete implementation. Early results indicate that it is effective at reducing deception in simple RL environments and preliminary LLM experiments are currently being conducted.
To be better prepared for the possibility of short timelines without necessarily having to solve interpretability, it seems useful to have a scalable, general, and transferable condition on the model internals, making it less likely for the model to be deceptive.
Self-Other Overlap
To get a more intuitive grasp of the concept, it is useful to understand how self-other overlap is measured in humans. There are regions of the brain that activate similarly when we do something ourselves and when we observe someone else performing the same action.
For example, if you were to pick up a martini glass under an fMRI, and then watch someone else pick up a martini glass, we would find regions of your brain that are similarly activated (overlapping) when you process the self and other-referencing observations as illustrated in Figure 2.
There seems to be compelling evidence that self-other overlap is linked to pro-social behavior in humans. For example, preliminary data suggests extraordinary altruists (people who donated a kidney to strangers)
have higher
neural self-other overlap than control participants in neural representations of fearful anticipation in the anterior insula while the opposite appears to be true for
psychopaths. Moreover, the
leading theories of empathy (such as the
Perception-Action Model) imply that empathy is mediated by self-ot...

Jul 30, 2024 • 49sec
EA - In the last 2 years, what surprising ideas has EA championed or how has the movement changed its mind? by Nathan Young
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: In the last 2 years, what surprising ideas has EA championed or how has the movement changed its mind?, published by Nathan Young on July 30, 2024 on The Effective Altruism Forum.
In the last 2 years:
What ideas that were considered wrong[1]/low status have been championed here?
What has the movement acknowledged it was wrong about previously?
What new, effective organisations have been started?
This isn't to claim that this is the only work that matters, but it feels like a chunk of what matters. Someone asked me and I realised I didn't have good answers.
1. ^
Changed in response to comment from @JWS
Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org

Jul 30, 2024 • 23min
AF - Investigating the Ability of LLMs to Recognize Their Own Writing by Christopher Ackerman
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Investigating the Ability of LLMs to Recognize Their Own Writing, published by Christopher Ackerman on July 30, 2024 on The AI Alignment Forum.
This post is an interim progress report on work being conducted as part of Berkeley's Supervised Program for Alignment Research (SPAR).
Summary of Key Points
We test the robustness of an open-source LLM's (Llama3-8b) ability to recognize its own outputs on a diverse mix of datasets, two different tasks (summarization and continuation), and two different presentation paradigms (paired and individual).
We are particularly interested in differentiating scenarios that would require a model to have specific knowledge of its own writing style from those where it can use superficial cues (e.g., length, formatting, prefatory words) in the text to pass self-recognition tests.
We find that while superficial text features are used when available, the RLHF'd Llama3-8b-Instruct chat model - but not the base Llama3-8b model - can reliably distinguish its own outputs from those of humans, and sometimes other models, even after controls for superficial cues: ~66-73% success rate across datasets in paired presentation and 58-83% in individual presentation (chance is 50%).
We further find that although perplexity would be a useful signal to perform the task in the paired presentation paradigm, correlations between relative text perplexity and choice probability are weak and inconsistent, indicating that the models do not rely on it.
Evidence suggests, but does not prove, that experience with its own outputs, acquired during post-training, is used by the chat model to succeed at the self-recognition task.
The model is unable to articulate convincing reasons for its judgments.
Introduction
It has recently been found that large language models of sufficient size can achieve above-chance performance in tasks that require them to discriminate their own writing from that of humans and other models. From the perspective of AI safety, this is a significant finding. Self-recognition can be seen as an instance of situational awareness, which has long been noted as a potential point of risk for AI (Cotra, 2021).
Such an ability might subserve an awareness of whether a model is in a training versus deployment environment, allowing it to hide its intentions and capabilities until it is freed from constraints. It might also allow a model to collude with other instances of itself, reserving certain information for when it knows it's talking to itself that it keeps secret when it knows it's talking to a human.
On the positive side, AI researchers could use a model's self-recognition ability as the basis to build resistance to malicious prompting. But what isn't clear from prior studies is whether the self-recognition task success actually entails a model's self-awareness of its own writing style.
Panickssery et al. (2024), utilizing a summary writing/recognition task, report that a number of LLMs, including Llama2-7b-chat, show out-of-the-box (without fine-tuning) self recognition abilities. However, this work focussed on the relationship between self-recognition task success and self-preference, rather than the specific means by which the model was succeeding at the task. Laine et al.
(2024), as part of a larger effort to provide a foundation for studying situational awareness in LLMs, utilized a more challenging text continuation writing/recognition task and demonstrate self-recognition abilities in several larger models (although not Llama2-7b-chat), but there the focus was on how task success could be elicited with different prompts and in different models.
Thus we seek to fill a gap in understanding what exactly models are doing when they succeed at a self recognition task.
We first demonstrate model self-recognition task success in a variety of domains....

Jul 30, 2024 • 8min
AF - Can Generalized Adversarial Testing Enable More Rigorous LLM Safety Evals? by Stephen Casper
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Can Generalized Adversarial Testing Enable More Rigorous LLM Safety Evals?, published by Stephen Casper on July 30, 2024 on The AI Alignment Forum.
Thanks to Zora Che, Michael Chen, Andi Peng, Lev McKinney, Bilal Chughtai, Shashwat Goel, Domenic Rosati, and Rohit Gandikota.
TL;DR
In contrast to evaluating AI systems under normal "input-space" attacks, using "generalized," attacks, which allow an attacker to manipulate weights or activations, might be able to help us better evaluate LLMs for risks - even if they are deployed as black boxes. Here, I outline the rationale for "generalized" adversarial testing and overview current work related to it.
See also prior work in
Casper et al. (2024),
Casper et al. (2024), and
Sheshadri et al. (2024).
Even when AI systems perform well in typical circumstances, they sometimes fail in adversarial/anomalous ones. This is a persistent problem.
State-of-the-art AI systems tend to retain undesirable latent capabilities that can pose risks if they resurface. My favorite example of this is the most cliche one
many recent papers have demonstrated diverse attack techniques that can be used to elicit instructions for making a bomb from state-of-the-art LLMs.
There is an emerging consensus that, even when LLMs are fine-tuned to be harmless, they can retain latent harmful capabilities that can and do cause harm when they resurface (
Qi et al., 2024). A growing body of work on red-teaming (
Shayegani et al., 2023,
Carlini et al., 2023,
Geiping et al., 2024,
Longpre et al., 2024), interpretability (
Juneja et al., 2022,
Lubana et al., 2022,
Jain et al., 2023,
Patil et al., 2023,
Prakash et al., 2024,
Lee et al., 2024), representation editing (
Wei et al., 2024,
Schwinn et al., 2024), continual learning (
Dyer et al., 2022,
Cossu et al., 2022,
Li et al., 2022,
Scialom et al., 2022,
Luo et al., 2023,
Kotha et al., 2023,
Shi et al., 2023,
Schwarzchild et al., 2024), and fine-tuning (
Jain et al., 2023,
Yang et al., 2023,
Qi et al., 2023,
Bhardwaj et al., 2023,
Lermen et al., 2023,
Zhan et al., 2023,
Ji et al., 2024,
Hu et al., 2024,
Halawi et al., 2024) suggests that fine-tuning struggles to make fundamental changes to an LLM's inner knowledge and capabilities. For example,
Jain et al. (2023) likened fine-tuning in LLMs to merely modifying a "wrapper" around a stable, general-purpose set of latent capabilities. Even if they are generally inactive, harmful latent capabilities can pose harm if they resurface due to an attack, anomaly, or post-deployment modification (
Hendrycks et al., 2021,
Carlini et al., 2023).
We can frame the problem as such: There are hyper-astronomically many inputs for modern LLMs (e.g. there are vastly more 20-token strings than particles in the observable universe), so we can't brute-force-search over the input space to make sure they are safe.
So unless we are able to make provably safe advanced AI systems (we won't soon and probably never will), there will always be a challenge with ensuring safety - the gap between the set of failure modes that developers identify, and unforeseen ones that they don't.
This is a big challenge because of the inherent unknown-unknown nature of the problem. However, it is possible to try to infer how large this gap might be.
Taking a page from the safety engineering textbook -- when stakes are high, we should train and evaluate LLMs under threats that are at least as strong as, and ideally stronger than, ones that they will face in deployment.
First, imagine that an LLM is going to be deployed open-source (or if it could be leaked). Then, of course, the system's safety depends on what it can be modified to do. So it should be evaluated not as a black-box but as a general asset to malicious users who might enhance it through finetuning or other means. This seems obvious, but there's preced...


