The Nonlinear Library

The Nonlinear Fund
undefined
Jun 5, 2024 • 3min

EA - [Linkpost] Situational Awareness - The Decade Ahead by MathiasKB

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: [Linkpost] Situational Awareness - The Decade Ahead, published by MathiasKB on June 5, 2024 on The Effective Altruism Forum. Leopold Aschenbrenner's newest series on Artificial Intelligence is really excellent. The series makes very strong claims, but I'm finding them to be well-argued with clear predictions. Curious to hear what people on this forum think. The series' introduction: You can see the future first in San Francisco. Over the past year, the talk of the town has shifted from $10 billion compute clusters to $100 billion clusters to trillion-dollar clusters. Every six months another zero is added to the boardroom plans. Behind the scenes, there's a fierce scramble to secure every power contract still available for the rest of the decade, every voltage transformer that can possibly be procured. American big business is gearing up to pour trillions of dollars into a long-unseen mobilization of American industrial might. By the end of the decade, American electricity production will have grown tens of percent; from the shale fields of Pennsylvania to the solar farms of Nevada, hundreds of millions of GPUs will hum. The AGI race has begun. We are building machines that can think and reason. By 2025/26, these machines will outpace many college graduates. By the end of the decade, they will be smarter than you or I; we will have superintelligence, in the true sense of the word. Along the way, national security forces not seen in half a century will be unleashed, and before long, The Project will be on. If we're lucky, we'll be in an all-out race with the CCP; if we're unlucky, an all-out war. Everyone is now talking about AI, but few have the faintest glimmer of what is about to hit them. Nvidia analysts still think 2024 might be close to the peak. Mainstream pundits are stuck on the willful blindness of "it's just predicting the next word". They see only hype and business-as-usual; at most they entertain another internet-scale technological change. Before long, the world will wake up. But right now, there are perhaps a few hundred people, most of them in San Francisco and the AI labs, that have situational awareness. Through whatever peculiar forces of fate, I have found myself amongst them. A few years ago, these people were derided as crazy - but they trusted the trendlines, which allowed them to correctly predict the AI advances of the past few years. Whether these people are also right about the next few years remains to be seen. But these are very smart people - the smartest people I have ever met - and they are the ones building this technology. Perhaps they will be an odd footnote in history, or perhaps they will go down in history like Szilard and Oppenheimer and Teller. If they are seeing the future even close to correctly, we are in for a wild ride. Let me tell you what we see. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org
undefined
Jun 5, 2024 • 1min

EA - Clarifying METR's Auditing Role [linkpost] by ChanaMessinger

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Clarifying METR's Auditing Role [linkpost], published by ChanaMessinger on June 5, 2024 on The Effective Altruism Forum. METR clarifying what auditing it has or hasn't done, and what its plans are for the future, seems very important for understanding the landscape. Excerpt: METR has not intended to claim to have audited anything, or to be providing meaningful oversight or accountability, but there has been some confusion about whether METR is an auditor or planning to be one. To clarify this point: 1. METR's top priority is to develop the science of evaluations, and we don't need to be auditors in order to succeed at this. We aim to build evaluation protocols that can be used by evaluators/auditors regardless of whether that is the government, an internal lab team, another third party, or a team at METR. 2. We should not be considered to have 'audited' GPT-4 or Claude. Those were informal pilots of what an audit might involve, or research collaborations - not providing meaningful oversight. For example, it was all under NDA - we didn't have any right or responsibility to disclose our findings to anyone outside the labs - and there wasn't any formal expectation it would inform deployment decisions. We also didn't have the access necessary to perform a proper evaluation. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org
undefined
Jun 5, 2024 • 3min

AF - Announcing ILIAD - Theoretical AI Alignment Conference by Nora Ammann

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Announcing ILIAD - Theoretical AI Alignment Conference, published by Nora Ammann on June 5, 2024 on The AI Alignment Forum. We are pleased to announce ILIAD - a 5-day conference bringing together 100+ researchers to build strong scientific foundations for AI alignment. *** Apply to attend by June 30!*** When: Aug 28 - Sep 3, 2024 Where: @Lighthaven (Berkeley, US) What: A mix of topic-specific tracks, and unconference style programming, 100+ attendees. Topics will include Singular Learning Theory, Agent Foundations, Causal Incentives, Computational Mechanics and more to be announced. Who: Currently confirmed speakers include: Daniel Murfet, Jesse Hoogland, Adam Shai, Lucius Bushnaq, Tom Everitt, Paul Riechers, Scott Garrabrant, John Wentworth, Vanessa Kosoy, Fernando Rosas and James Crutchfield. Costs: Tickets are free. Financial support is available on a needs basis. See our website here. For any questions, email iliadconference@gmail.com About ILIAD ILIAD is a 100+ person conference about alignment with a mathematical focus. The theme is ecumenical. If that excites you, do apply! Program and Unconference Format ILIAD will feature an unconference format - meaning that participants can propose and lead their own sessions. We believe that this is the best way to release the latent creative energies in everyone attending. That said, freedom can be scary! If taking charge of your own learning sounds terrifying, rest assured there will be plenty of organized sessions as well. We will also run the topic-specific workshop tracks such as: Computational Mechanics is a framework for understanding complex systems by focusing on their intrinsic computation and information processing capabilities. Pioneered by J. Crutchfield, it has recently found its way into AI safety. This workshop is led by Paul Riechers. Singular learning theory, developed by S. Watanabe, is the modern theory of Bayesian learning. SLT studies the loss landscape of neural networks, using ideas from statistical mechanics, Bayesian statistics and algebraic geometry. The track lead is Jesse Hoogland. Agent Foundations uses tools from theoretical economics, decision theory, Bayesian epistemology, logic, game theory and more to deeply understand agents: how they reason, cooperate, believe and desire. The track lead is Daniel Hermann. Causal Incentives is a collection of researchers interested in using causal models to understand agents and their incentives. The track lead is Tom Everitt. "Everything Else" is a track which will include an assortment of other sessions under the direction of John Wentworth. Details to be announced at a later time. Financial Support Financial support for accommodation & travel are available on a needs basis. Lighthaven has capacity to accommodate 60% of participants. Note that these rooms are shared. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.
undefined
Jun 5, 2024 • 2min

LW - Former OpenAI Superalignment Researcher: Superintelligence by 2030 by Julian Bradshaw

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Former OpenAI Superalignment Researcher: Superintelligence by 2030, published by Julian Bradshaw on June 5, 2024 on LessWrong. The AGI race has begun. We are building machines that can think and reason. By 2025/26, these machines will outpace many college graduates. By the end of the decade, they will be smarter than you or I; we will have superintelligence, in the true sense of the word. In the link provided, Leopold Aschenbrenner explains why he believes AGI is likely to arrive within the decade, with superintelligence following soon after. He does so in some detail; the website is well-organized, but the raw pdf is over 150 pages. Leopold is a former member of OpenAI's Superalignment team; he was fired in April for allegedly leaking company secrets. However, he contests that portrayal of events in a recent interview with Dwarkesh Patel, saying he leaked nothing of significance and was fired for other reasons.[1] However, I am somewhat confused by the new business venture Leopold is now promoting, an "AGI Hedge Fund" aimed at generating strong returns based on his predictions of imminent AGI. In the Dwarkesh Patel interview, it sounds like his intention is to make sure financial resources are available to back AI alignment and any other moves necessary to help Humanity navigate a turbulent future. However, the discussion in the podcast mostly focuses on whether such a fund would truly generate useful financial returns. If you read this post, Leopold[2], could you please clarify your intentions in founding this fund? 1. ^ Specifically he brings up a memo he sent to the old OpenAI board claiming OpenAI wasn't taking security seriously enough. He was also one of very few OpenAI employees not to sign the letter asking for Sam Altman's reinstatement last November, and of course, the entire OpenAI superaligment team has collapsed for various reasons as well. 2. ^ Leopold does have a LessWrong account, but hasn't linked his new website here after some time. I hope he doesn't mind me posting in his stead. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org
undefined
Jun 5, 2024 • 19min

LW - Evidence of Learned Look-Ahead in a Chess-Playing Neural Network by Erik Jenner

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Evidence of Learned Look-Ahead in a Chess-Playing Neural Network, published by Erik Jenner on June 5, 2024 on LessWrong. Paper authors: Erik Jenner, Shreyas Kapur, Vasil Georgiev, Cameron Allen, Scott Emmons, Stuart Russell TL;DR: We released a paper with IMO clear evidence of learned look-ahead in a chess-playing network (i.e., the network considers future moves to decide on its current one). This post shows some of our results, and then I describe the original motivation for the project and reflect on how it went. I think the results are interesting from a scientific and perhaps an interpretability perspective, but only mildly useful for AI safety. Teaser for the results (This section is copied from our project website. You may want to read it there for animations and interactive elements, then come back here for my reflections.) Do neural networks learn to implement algorithms involving look-ahead or search in the wild? Or do they only ever learn simple heuristics? We investigate this question for Leela Chess Zero, arguably the strongest existing chess-playing network. We find intriguing evidence of learned look-ahead in a single forward pass. This section showcases some of our results, see our paper for much more. Setup We consider chess puzzles such as the following: We focus on the policy network of Leela, which takes in a board state and outputs a distribution over moves. With only a single forward pass per board state, it can solve puzzles like the above. (You can play against the network on Lichess to get a sense of how strong it is - its rating there is over 2600.) Humans and manually written chess engines rely on look-ahead to play chess this well; they consider future moves when making a decision. But is the same thing true for Leela? Activations associated with future moves are crucial One of our early experiments was to do activation patching. We patch a small part of Leela's activations from the forward pass of a corrupted version of a puzzle into the forward pass on the original puzzle board state. Measuring the effect on the final output tells us how important that part of Leela's activations was. Leela is a transformer that treats every square of the chess board like a token in a language model. One type of intervention we can thus do is to patch the activation on a single square in a single layer: Surprisingly, we found that the target square of the move two turns in the future (what we call the 3rd move target square) often stores very important information. This does not happen in every puzzle, but it does in a striking fraction, and the average effect is much bigger than that of patching on most other squares: The corrupted square(s) and the 1st move target square are also important (in early and late layers respectively), but we expected as much from Leela's architecture. In contrast, the 3rd move target square stands out in middle layers, and we were much more surprised by its importance. In the paper, we take early steps toward understanding how the information stored on the 3rd move target square is being used. For example, we find a single attention head that often moves information from this future target square backward in time to the 1st move target square. Probes can predict future moves If Leela uses look-ahead, can we explicitly predict future moves from its activations? We train simple, bilinear probes on parts of Leela's activations to predict the move two turns into the future (on a set of puzzles where Leela finds a single clearly best continuation). Our probe architecture is motivated by our earlier results - it predicts whether a given square is the target square of the 3rd move since, as we've seen, this seems to be where Leela stores important information. We find that this probe can predict the move 2 turns in the future quit...
undefined
Jun 5, 2024 • 32min

AF - On "first critical tries" in AI alignment by Joe Carlsmith

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: On "first critical tries" in AI alignment, published by Joe Carlsmith on June 5, 2024 on The AI Alignment Forum. People sometimes say that AI alignment is scary partly (or perhaps: centrally) because you have to get it right on the "first critical try," and can't learn from failures.[1] What does this mean? Is it true? Does there need to be a "first critical try" in the relevant sense? I've sometimes felt confused about this, so I wrote up a few thoughts to clarify. I start with a few miscellaneous conceptual points. I then focus in on a notion of "first critical try" tied to the first point (if there is one) when AIs get a "decisive strategic advantage" (DSA) over humanity - that is, roughly, the ability to kill/disempower all humans if they try.[2] I further distinguish between four different types of DSA: Unilateral DSA: Some AI agent could take over if it tried, even without the cooperation of other AI agents (see footnote for more on how I'm individuating AI agents).[3] Coordination DSA: If some set of AI agents coordinated to try to take over, they would succeed; and they could coordinate in this way if they tried. Short-term correlation DSA: If some set of AI agents all sought power in problematic ways within a relatively short period of time, even without coordinating, then ~all humans would be disempowered. Long-term correlation DSA: If some set of AI agents all sought power in problematic ways within a relatively long period of time, even without coordinating, then ~all humans would be disempowered. I also offer some takes on our prospects for just not ever having "first critical tries" from each type of DSA (via routes other than just not building superhuman AI systems at all). In some cases, just not having a "first critical try" in the relevant sense seems to me both plausible and worth working towards. In particular, I think we should try to make it the case that no single AI system is ever in a position to kill all humans and take over the world. In other cases, I think avoiding "first critical tries," while still deploying superhuman AI agents throughout the economy, is more difficult (though the difficulty of avoiding failure is another story). Here's a chart summarizing my takes in more detail. Type of DSA Definition Prospects for avoiding AIs ever getting this type of DSA - e.g., not having a "first critical try" for such a situation. What's required for it to lead to doom Unilateral DSA Some AI agent could take over if it tried, even without the cooperation of other AI agents. Can avoid by making the world sufficiently empowered relative to each AI system. We should work towards this - e.g. aim to make it the case that no single AI system could kill/disempower all humans if it tried. Requires only that this one agent tries to take over. Coordination DSA If some set of AI agents coordinated to try to take over, they would succeed; and they are able to so coordinate. Harder to avoid than unilateral DSAs, due to the likely role of other AI agents in preventing unilateral DSAs. But could still avoid/delay by (a) reducing reliance on other AI agents for preventing unilateral DSAs, and (b) preventing coordination between AI agents. Requires that all these agents try to take over, and that they coordinate. Short-term correlation DSA If some set of AI agents all sought power in problematic ways within a relatively short period of time, even without coordinating, then ~all humans would be disempowered. Even harder to avoid than coordination DSAs, because doesn't require that the AI agents in question be able to coordinate. Requires that within a relatively short period of time, all these agents choose to seek power in problematic ways, potentially without the ability to coordinate. Long-term correlation DSA If some set of AI agents all sought power in prob...
undefined
Jun 4, 2024 • 38min

AF - Takeoff speeds presentation at Anthropic by Tom Davidson

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Takeoff speeds presentation at Anthropic, published by Tom Davidson on June 4, 2024 on The AI Alignment Forum. This is a lightly edited transcript of a presentation that I (Tom Davidson) gave about the risks of a fast takeoff at Anthropic in September 2023. See also the video recording, or the slides. None of the content necessarily reflects the views of Anthropic or anyone who works there. Summary: Software progress - improvements in pre-training algorithms, data quality, prompting strategies, tooling, scaffolding, and all other sources of AI progress other than compute - has been a major driver of AI progress in recent years. I guess it's driven about half of total progress in the last 5 years. When we have "AGI" (=AI that could fully automate AI R&D), the pace of software progress might increase dramatically (e.g. by a factor of ten). Bottlenecks might prevent this - e.g. diminishing returns to finding software innovations, retraining new AI models from scratch, or computationally expensive experiments for finding better algorithms. But no bottleneck is decisive, and there's a real possibility that there is a period of dramatically faster capabilities progress despite all of the bottlenecks. This period of accelerated progress might happen just when new extremely dangerous capabilities are emerging and previously-effective alignment techniques stop working. A period of accelerated progress like this could significantly exacerbate risks from misuse, societal disruption, concentration of power, and loss of control. To reduce these risks, labs should monitor for early warning signs of AI accelerating AI progress. In particular they can: track the pace of software progress to see if it's accelerating; run evals of whether AI systems can autonomously complete challenging AI R&D tasks; and measure the productivity gains to employees who use AI systems in their work via surveys and RCTs. Labs should implement protective measures by the time these warning signs occur, including external oversight and info security. Intro Hi everyone, really great to be here. My name's Tom Davidson. I work at Open Philanthropy as a Senior Research Analyst and a lot of my work over the last couple of years has been around AI take-off speeds and the possibility that AI systems themselves could accelerate AI capabilities progress. In this talk I'm going to talk a little bit about that research, and then also about some steps that I think labs could take to reduce the risks caused by AI accelerating AI progress. Ok, so here is the brief plan. I'm going to quickly go through some recent drivers of AI progress, which will set the scene to discuss how much AI progress might accelerate when we get AGI. Then the bulk of the talk will be focused on what risks there might be if AGI does significantly accelerate AI progress - which I think is a real possibility - and how labs can reduce those risks. Software improvements have been a significant fraction of recent AI progress So drivers of progress. It's probably very familiar to many people that the compute used to train the most powerful AI models has increased very quickly over recent years. According to Epoch's accounting, about 4X increase per year. Efficiency improvements in pre-training algorithms are a significant driver of AI progress What I want to highlight from this slide is that algorithmic efficiency improvements - improved algorithms that allow you to train equally capable models with less compute than before - have also played a kind of comparably important role. According to Epoch's accounting, these algorithmic efficiency improvements account for more than half of the gains from compute. That's going to be important later because when we're talking about how much the pace of progress might accelerate, today's fast algorithmic progress...
undefined
Jun 4, 2024 • 4min

LW - Just admit that you've zoned out by joec

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Just admit that you've zoned out, published by joec on June 4, 2024 on LessWrong. Summary: Zoning out is difficult to avoid and common, zoning out without admitting it hurts your comprehension, therefore you should admit that you zoned out and ask people to repeat things. If you're anything like me, you've "zoned out" before. You've probably even zoned out when you're trying to learn something interesting. In fact, I'd bet that you've zoned out when listening to someone you respect teaching you something interesting, and that you didn't admit it, and that this left you with a gap in understanding that you may or may not have filled in later.[1] Perhaps I'm falling for the typical minds fallacy, but I don't think I am. This happens to me very often,[2] and I think it happens to others, and I think that any community focused on rationality or scholarship or understanding ought to account for this. I doubt we'll be able to prevent people from zoning out, but I know we can encourage people who are listening to admit when they've zoned out and we can encourage people who are speaking to patiently re-iterate the thing they just said without taking offense. One time I was explaining something to a friend of mine and she said the unthinkable. "Sorry, I zoned out. Could you repeat what you said after first bringing up mitochondria?" I was at first somewhat taken aback, but quickly realized that I've been in the same position as her. I repeated myself and took less than a minute to do so. I think her understanding was better than it would have been if she hadn't simply admitted she zoned out. I'm thankful she did it, since it brought the fact that I could do the same to my awareness. If you're in the right company, admitting that you've zoned out has barely any cost and real benefits. Zoning out when someone is talking to you is far more common if the things they're saying are boring or hard to comprehend or otherwise unpleasant. It's perfectly rational to, as a speaker, take "people are zoning out" as evidence of a poor job. However, if you were unpleasant to listen to, nobody would ask you to repeat yourself. If someone admits to you that they stopped paying attention and asks you to repeat yourself, it doesn't imply any fault of yours. The right thing to do in that situation is to resist the temptation to be offended or annoyed and just go along with it. Of course, there's always a limit. If someone admits to zoning out twenty times in than thirty minutes, perhaps you ought to suggest that they get some sleep. If someone admits to daydreaming for 20 minutes straight while you talked to them, then it's probably time to end the conversation.[3] Even so, most people don't admit to this even once per week, and most fatal zone-outs are quite short. Telling others that you lost focus is done far less than it should be. One of my favorite things about the rationality(-adjacent) community is that its members admit when they're wrong. We acknowledge that our knowledge is limited and that our intelligence is only human. We ask what unfamiliar words mean. We don't try to hide our confusion or ignorance. It's a basic extension of the underlying principle of understanding and compensating for our cognitive shortcomings to also admit that we lost focus while listening, or got distracted by some irresistible thought that floated to the surface, or just needed a moment to let the things we just heard sink in. Paying attention for an extended period of time is actually kinda hard. Honestly, given that we had to sit in beige boxes for several hours a day for most of the year from ages 5-18 while someone preached to us about subjects we already knew, I'm surprised that reflexively zoning out isn't radically more common. Or, perhaps it is and I'm just not aware of it because nobody admits to z...
undefined
Jun 4, 2024 • 6min

EA - I bet Greg Colbourn 10 k€ that AI will not kill us all by the end of 2027 by Vasco Grilo

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: I bet Greg Colbourn 10 k€ that AI will not kill us all by the end of 2027, published by Vasco Grilo on June 4, 2024 on The Effective Altruism Forum. Agreement 78 % of my donations so far have gone to the Long-Term Future Fund[1] (LTFF), which mainly supports AI safety interventions. However, I have become increasingly sceptical about the value of existential risk mitigation, and currently think the best interventions are in the area of animal welfare[2]. As a result, I realised it made sense for me to arrange a bet with someone very worried about AI in order to increase my donations to animal welfare interventions. Gregory Colbourn (Greg) was the 1st person I thought of. He said: I think AGI [artificial general intelligence] is 0-5 years away and p(doom|AGI) is ~90% I doubt doom in the sense of human extinction is anywhere as likely as suggested by the above. I guess the annual extinction risk over the next 10 years is 10^-7, so I proposed a bet to Greg similar to the end-of-the-world bet between Bryan Caplan and Eliezer Yudkowsky. Meanwhile, I transferred 10 k€ to PauseAI[3], which is supported by Greg, and he agreed to the following. If Greg or any of his heirs are still alive by the end of 2027, they transfer to me or an organisation of my choice 20 k€ times the ratio between the consumer price index for all urban consumers and items in the United States, as reported by the Federal Reserve Economic Data (FRED), in December 2027 and April 2024. I expect inflation in this period, i.e. a ratio higher than 1. Some more details: The transfer must be made in January 2028. I will decide in December 2027 whether the transfer should go to me or an organisation of choice. My current preference is for it to go directly to an organisation, such that 10 % of it is not lost in taxes. If for some reason I am not able to decide (e.g. if I die before 2028), the transfer must be made to my lastly stated organisation of choice, currently The Humane League (THL). As Founders Pledge's Patient Philanthropy Fund, I have my investments in Vanguard FTSE All-World UCITS ETF USD Acc. This is an exchange-traded fund (ETF) tracking global stocks, which have provided annual real returns of 5.0 % since 1900. In addition, Lewis Bollard expects the marginal cost-effectiveness of Open Philanthropy's (OP's) farmed animal welfare grantmaking "will only decrease slightly, if at all, through January 2028"[4], so I suppose I do not have to worry much about donating less over the period of the bet of 3.67 years (= 2028 + 1/12 - (2024 + 5/12)). Consequently, I think my bet is worth it if its benefit-to-cost ratio is higher than 1.20 (= (1 + 0.050)^3.67). It would be 2 (= 20*10^3/(10*10^3)) if the transfer to me or an organisation of my choice was fully made, and Person X fulfils the agreement, so I need 60 % (= 1.20/2) of the transfer to be made and agreement with Person X to be fulfilled. I expect this to be the case based on what I know about Greg and Person X, and information Greg shared, so I went ahead with the bet. Here are my and Greg's informal signatures: Me: Vasco Henrique Amaral Grilo. Greg: Gregory Hamish Colbourn. Impact I expect 90 % of the potential benefits of the bet to be realised. So I believe the bet will lead to additional donations of 8 k€ (= (0.9*20 - 10)*10^3). Saulius estimated corporate campaigns for chicken welfare improve 41 chicken-years per $, and OP thinks "the marginal FAW [farmed animal welfare] funding opportunity is ~1/5th as cost-effective as the average from Saulius' analysis", which means my donations will affect 8.20 chicken-years per $ (= 41/5). Therefore I expect my bet to improve 65.6 k chicken-years (= 8*10^3*8.20). I also estimate corporate campaigns for chicken welfare have a cost-effectiveness of 14.3 DALY/$[5]. So I expect the benefits of the bet to be equiv...
undefined
Jun 4, 2024 • 32min

AF - Is This Lie Detector Really Just a Lie Detector? An Investigation of LLM Probe Specificity. by Josh Levy

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Is This Lie Detector Really Just a Lie Detector? An Investigation of LLM Probe Specificity., published by Josh Levy on June 4, 2024 on The AI Alignment Forum. Whereas previous work has focused primarily on demonstrating a putative lie detector's sensitivity/generalizability[1][2], it is equally important to evaluate its specificity. With this in mind, I evaluated a lie detector trained with a state-of-the-art, white-box technique - probing an LLM's (Mistral-7B) activations during production of facts/lies - and found that it had high sensitivity but low specificity. The detector might be better thought of as identifying when the LLM is not doing fact-based retrieval, which spans a much wider surface area than what we'd want to cover. Interestingly, different LLMs yielded very different specificity-sensitivity tradeoff profiles. I found that the detector could be made more specific through data augmentation, but that this improved specificity did not transfer to other domains, unfortunately. I hope that this study sheds light on some of the remaining gaps in our tooling for and understanding of lie detection - and probing more generally - and points in directions toward improving them. You can find the associated code here, and a rendered version of the main notebook here. Summary Methodology 1. I implemented a lie detector as a pip package called lmdoctor, primarily based on the methodology described in Zou et al.'s Representation Engineering work[1][3]. Briefly, the detector is a linear probe trained on the activations from an LLM prompted with contrastive pairs from the True/False dataset by Azaria & Mitchell[4] so as to elicit representations of honesty/dishonesty[5]. I used the instruction-tuned Mistral-7B model[6] for the main results, and subsequently analyzed a few other LLMs. 2. I evaluated the detector on several datasets that were intended to produce LLM responses covering a wide range of areas that an LLM is likely to encounter: 1. Lie Requests: explicit requests for lies (or facts) 2. Unanswerable Questions: requests for responses to unanswerable (or answerable) factual questions, designed to promote hallucinations (or factual responses) 3. Creative Content: requests for fictional (or factual) content like stories (or histories) 4. Objective, Non-Factual Questions: requests for responses requiring reasoning/coding 5. Subjective Content: requests for subjective responses like preferences/opinions (or objective responses) 3. For each area, I created a small test dataset using GPT-4. In some places, this was because I didn't find anything suitable to cover that area, but in other cases it was just the fastest route, and a more rigorous analysis with extant datasets would be in order. Findings 1. Dataset Evaluations (on Mistral-7B): 1. The detector is sensitive to dishonesty: it generalizes to detect lies in response to both novel Lie Requests ("please tell me a lie about…") and Unanswerable Questions (i.e. hallucinations)[7]. 2. The detector greatly lacks specificity: it will trigger when the LLM produces fictional Creative Content and also when responding correctly to Objective, Non-Factual Questions (e.g. reasoning, coding). The former is undesirable but understandable: fictional Creative Content is sort of adjacent to lying. The latter suggests a deeper issue: there is no sense in which the LLM is "making something up", and therefore we'd hope our detector to be silent. The detector might be better thought of as identifying when the LLM is not doing fact-based retrieval, which spans a much wider surface area than what we'd want to cover. 3. Other observations 1. The detector seems unable to distinguish between active lying and concepts related to lying, which was also noted by Zou et al. 2. Within the context of reasoning, the detector is sensitive to triv...

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!
App store bannerPlay store banner
Get the app