The Nonlinear Library

The Nonlinear Fund
undefined
Jun 6, 2024 • 4min

LW - Humming is not a free $100 bill by Elizabeth

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Humming is not a free $100 bill, published by Elizabeth on June 6, 2024 on LessWrong. Last month I posted about humming as a cheap and convenient way to flood your nose with nitric oxide (NO), a known antiviral. Alas, the economists were right, and the benefits were much smaller than I estimated. The post contained one obvious error and one complication. Both were caught by Thomas Kwa, for which he has my gratitude. When he initially pointed out the error I awarded him a $50 bounty; now that the implications are confirmed I've upped that to $250. In two weeks an additional $750 will go to either him or to whoever provides new evidence that causes me to retract my retraction. Humming produces much less nitric oxide than Enovid I found the dosage of NO in Enovid in a trial registration. Unfortunately I misread the dose- what I original read as "0.11ppm NO/hour" was in fact "0.11ppm NO*hour". I spent a while puzzling out what this meant, with the help of Thomas Kwa, some guy on twitter, and chatGPT (the first time it's been genuinely useful to me). My new interpretation is that this means "actual concentration upon application*1 hour/time at that concentration". Since NO is a transient molecule, this means my guess for the amount of NO in Enovid was off by 2-3 orders of magnitude. My estimates for the amount of NO released by humming may also be too high. I used this paper's numbers for baseline NO concentration. However the paper I used to estimate the increase gave its own baseline number, which was an order of magnitude lower than the first paper. This wasn't intentional cherrypicking- I'd seen "15-20x increase in concentration" cited widely and often without sources. I searched for and spotchecked that one source but mostly to look at the experimental design. When I was ready to do math I used its increase but separately looked up the baseline concentration, and found the paper I cited. I just asked google again and got an even higher estimate of baseline nasal concentration, so seems like there is a great deal of disagreement here. If this were the only error I'd spend the time to get a more accurate estimate. But it looks like even the highest estimate will be a fraction of Enovid's dose, so it's not worth the energy to track down. Using the new values, you'd need 28 minutes of humming to recreate the amount of NO in Enovid (spreadsheet here). That wouldn't be so bad spread out over 4-6 hours, except that multiple breaths of humming in a row face diminishing returns, with recovery to baseline taking 3 minutes. It is possible to achieve this in 6 hours, but only just. And while it's not consequential enough to bother to look it up, I think some of the papers applied Enovid more often than that. This leaves humming in search of a use case. People who care a lot about respiratory illnesses are better off using Enovid or another nasal spray. People who don't care very much are never going to carefully pace their humming; and the amount of humming they might do won't be very effective. The only use case I see is people who care a lot and are pushed into a high risk situation without notice, or who want a feeling of of Doing Something even if it is not doing very much at all. Reasons to not write off humming entirely The math above assumes the effect is linear with the amount of NO released, regardless of application time. My guess is that frequent lower doses are more effective than the same amount as a one off. Probably not one effective enough to give humming a good non-emergency use case though. Another possibility is that Enovid has more nitric oxide than necessary and most of it is wasted. But again, it would have to be a lot moreto make this viable. Conclusions Humming hasn't been disproven as an anti-viral intervention, but the primary reason I believed it worke...
undefined
Jun 6, 2024 • 12min

AF - [Link Post] "Foundational Challenges in Assuring Alignment and Safety of Large Language Models" by David Scott Krueger

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: [Link Post] "Foundational Challenges in Assuring Alignment and Safety of Large Language Models", published by David Scott Krueger on June 6, 2024 on The AI Alignment Forum. We've recently released a comprehensive research agenda on LLM safety and alignment. This is a collaborative work with contributions from more than 35+ authors across the fields of AI Safety, machine learning, and NLP. Major credit goes to first author Usman Anwar, a 2nd year PhD student of mine who conceived and led the project and did a large portion of the research, writing, and editing. This blogpost was written only by David and Usman and may not reflect the views of other authors. I believe this work will be an excellent reference for anyone new to the field, especially those with some background in machine learning; a paradigmatic example reader we had in mind when writing would be a first-year PhD student who is new to LLM safety/alignment. Note that the agenda is not focused on AI existential safety, although I believe that there is a considerable and growing overlap between mainstream LLM safety/alignment and topics relevant to AI existential safety. Our work covers the following 18 topics, grouped into 3 high-level categories: Why you should (maybe) read (part of) our agenda The purpose of this post is to inform the Alignment Forum (AF) community of our work and encourage members of this community to consider engaging with it. A brief case for doing so: It includes over 200 concrete research directions, which might provide useful inspiration. We believe it provides comprehensive coverage of relevant topics at the intersection of safety and mainstream ML. We cover a much broader range of topics than typically receive attention on AF. AI Safety researchers - especially more junior researchers working on LLMs - are clustering around a few research agendas or problems (e.g. mechanistic interpretability, scalable oversight, jailbreaking). This seems suboptimal: given the inherent uncertainty in research, it is important to pursue diverse research agendas. We hope that this work can improve accessibility to otherwise neglected research problems, and help diversify the research agendas the community is following. Engaging with and understanding the broader ML community - especially parts of ML community working on AI Safety relevant problems - can be helpful for increasing your work's novelty, rigor, and impact. By reading our agenda, you can better understand the machine-learning community and discover relevant research being done in that community. We are interested in feedback from the AF community and believe your comments on this post could help inform the research we and others in the ML and AF communities do. Topics of particular relevance to the Alignment Forum community: Critiques of interpretability (Section 3.4) Interpretability is among the most popular research areas in the AF community, but I believe there is an unwarranted level of optimism around it. The field faces fundamental methodological challenges. Existing works often do not have a solid method of evaluating the validity of an interpretation, and scaling such evaluations seems challenging and potentially intractable. It seems likely that AI systems simply do not share human concepts, and at best have warped versions of them (as evidenced by adversarial examples). In this case, AI systems may simply not be interpretable, even given the best imaginable tools. In my experience, ML researchers are more skeptical and pessimistic about interpretability for reasons such as the above and a history of past mistakes. I believe the AF community should engage more with previous work in ML in order to learn from prior mistakes and missteps, and our agenda will provide useful background and references. This section also has lots of di...
undefined
Jun 6, 2024 • 14min

LW - SB 1047 Is Weakened by Zvi

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: SB 1047 Is Weakened, published by Zvi on June 6, 2024 on LessWrong. It looks like Scott Weiner's SB 1047 is now severely weakened. Some of the changes are good clarifications. One is a big very welcome fix. The one I call The Big Flip is something very different. It is mind boggling that we can have a political system where a bill can overwhelmingly pass the California senate, and then a bunch of industry lobbyists and hyperbolic false claims can make Scott Weiner feel bullied into making these changes. I will skip the introduction, since those changes are clarifications, and get on with it. In the interest of a clean reference point and speed, this post will not cover reactions. The Big Flip Then there is the big change that severely weakens SB 1047. 1. 22602 (f)(1): Definition of covered model changed from trained with at least 10^26 flops OR a model expecting to have similar capabilities to what 10^26 flops would have gotten you in 2024 "was trained using a quantity of computing power greater than 10^26 integer or floating-point operations, AND the cost of that quantity of computing power would exceed one hundred million dollars ($100,000,000) if calculated using average market prices of cloud compute as reasonably assessed by the developer at the time of training." 2. On and after January 1, 2026, the dollar amount in this subdivision shall be adjusted annually for inflation to the nearest one hundred dollars ($100) based on the change in the annual California Consumer Price Index for All Urban Consumers published by the Department of Industrial Relations for the most recent annual period ending on December 31 preceding the adjustment. 3. Later: They will also publish the annual inflation adjustments. Bolded text is exact, except I capitalized AND for clarity. The AND, rather than an OR, makes my heart sink. Effectively, the 10^26 requirement is dead. Long live the $100 million. Where the law previously strengthened over time, now it weakens further. It starts weakening this year. The cost for buying one-time use of 10^26 flops of compute seems likely to fall below $100 million this year. Consider this from Jack Clark, where he got napkin math of $70 million a few months ago, or $110 million if you rented A100s. Jack clarified on Twitter that he expects B100s to offer a large further cost reduction. The compute minimum to be a covered model will begin to rise. The strength of non-covered models then rises both with the fall in compute costs, and also with gains in algorithmic efficiency. The previous version of the bill did an excellent job of handling the potential for Type I (false positive) errors via the limited duty exemption. If your model was behind the non-hazardous capabilities frontier, all you had to do was point that out. You were good to go. Alas, people willfully misrepresented that clause over and over. In terms of the practical impact of this law, the hope is that this change does not much matter. No doubt the biggest models will soon be trained on far more compute than $100 million can buy. So if you train on what $100 million can buy in 2026, someone else already trained a bigger model, and you had a limited duty exemption available anyway, so you not being covered only saved you a minimum amount of paperwork, and provides peace of mind against people spreading hyperbolic claims. What this does do is very explicitly and clearly show that the bill only applies to a handful of big companies. Others will not be covered, at all. If you are spending over $100 million in 2024 dollars on compute, but you then claim you cannot comply with ordinary regulations because you are the 'little guy' that is being stomped on? If you say that such requirements are 'regulatory capture' on behalf of 'big tech'? Yeah. Obvious Nonsense. I have no intention of pretend...
undefined
Jun 6, 2024 • 2min

EA - EA Finland Annual Review 2023 by Karla Still

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: EA Finland Annual Review 2023, published by Karla Still on June 6, 2024 on The Effective Altruism Forum. We're happy to announce EA Finland's Annual Review 2023 has been published. Reflecting on 2023, we view it as a year of stability. This was the second year EAFI had employee funding, so we had the foundational infrastructure in place. This allowed us to concentrate on enhancing our core activities while exploring new avenues. While we experienced some growth, it was not as significant as we had hoped. Out of the local groups, some grew significantly, while others encountered a (hopefully temporary) plateau. Highlights In 2023 our focus was on Ensuring succession and continuation of EAFI and local groups Increased implementation and understanding of EA ideas Continued community growth Our core activities were career advising and the EA Intro Program. EAFI had 3 employees and approximately 35 volunteers counting all who participated in supporting at least one activity. We had 5 active local groups. We discuss their activities in the report. The community survey results are somewhat hard to analyze due to a small sample size and overrepresentation of active volunteers, but the graphs in the report are still indicative of EA engagement and demographic distribution. Our outreach and communications have been broad, including fairs, a monthly newsletter, interviews in newspapers, and social media posts. We didn't have a robust communications strategy. We showcase 9 projects run by EAs in Finland. Over 25 people were involved in them. The projects range from discovering impactful opportunities in Finland and an impact estimation consulting project to Finnish EA content production and an AISF course. Of the 16 events organized, we give more details about the strategy weekend, annual retreat, town hall event and Saturday retreat. We hope to be able to share a better impact evaluation of EA Finland in the 2024 review, but we kept it shallow in the 2023 review due to time constraints. Find the full 2023 review behind the link. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org
undefined
Jun 6, 2024 • 4min

LW - Book review: The Quincunx by cousin it

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Book review: The Quincunx, published by cousin it on June 6, 2024 on LessWrong. The Quincunx is a 1989 novel by Charles Palliser, set in early 1800s England. I want to recommend it to everyone because it's really good, and it might be relevant to the AI transition. Let me try to explain. The surface level of the book is a kind of mishmash of Dickensian themes. The main character is caught in a complicated inheritance dispute involving multiple families, each having histories of murder, uncertain parentage, stolen and returned documents and so on. The plot contains numerous puzzles that are fun to solve, the amount of planning is really kind of amazing, there are tons of details and everyone lies or makes mistakes but it still connects logically. But the really interesting level of the book is the social level. The main character doesn't just progress through a bunch of plot puzzles; he also starts out as a child of minor nobility and then moves through society downward. His journey is a kind of descent into hell, ending up in the lowest levels of poverty existing in the early 1800s. The book is very well researched in that regard, borrowing a lot from the fantastic "London Labor and the London Poor". There are parallel plotlines involving rich and poor people, and the book paints a vivid picture of how the rich prey upon the poor. England at that time was conducting enclosures. Basically, rich people put up fences around common land to graze sheep on it. The poor were left with no land to grow food on, and had to go somewhere else. They ended up in cities, living in slums, trying to find scarce work and giving their last pennies to slumlords. In short, it was a story of mass impoverishment of the population, conducted by the state and upper levels of society, who all benefited from it. In the book we get a tour of all of it. From the countryside being hollowed out, to the city with the desperate search for work, the run-down lodgings, the drinking, prostitution, crime (we spend a bit of time with the protagonist living in a gang), the sometimes horrifying occupations that people are pushed into (like scrounging for coins in sewer tunnels under the city while avoiding tides). The injuries, disabilities, early deaths. Where Dickens called out specific social ills, like workhouses in Oliver Twist, in order to fix them, Palliser says society as a whole is unjust. His account is so historically detailed that it somehow transcends time, makes you feel that the same kind of events are happening now. How does your society treat the economically unfortunate? What if we come into another period where economic growth makes many people unfortunate to the point of homelessness? I think it's especially important to not forget about such stories because they give an analogy to what might happen with the rise of AI. If AI can do your job cheaper than you, and can outbid you for resources you need to survive (most importantly land) - and there are lots of other tools available to AI and AI companies, like crafting messages to make you exchange your savings for consumption, or spend money on lobbying for laws, and do it all superhumanly well - then we might be facing the same kind of future as the poor in The Quincunx. And the main reason I wanted to make this point, and write this review, is that AI alignment isn't enough to prevent this. All above things can be done legally. Can be done with endorsement of the state, as the state happily benefits from AI as it did from enclosures. And they can be done by AI which is "aligned" to people, because historically these things were done by people. There's nothing higher than people to align to. The regulator, the AI company boss and all these other nice people are no different in nature than the people back then. When given power, they'll ...
undefined
Jun 6, 2024 • 10min

LW - rapid psychological growth by Chipmonk

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: rapid psychological growth, published by Chipmonk on June 6, 2024 on LessWrong. After a one-hour session with an exceptional counselor, I never suffered over that romantic incident again. Although, that's inaccurate, I also had 2x half-hour relapses in following month. After a few more sessions, I stopped doing depression. I brought the rest of my anxieties to that counselor over the following year, and… Radically effective and rapid psychological growth is possible with the right combination of counselor and method. But this is rare in 2024! Introspection that actually works It was while working with that counselor that, for the first time I could remember, I was able to actually do introspection. Before, whenever I had problems that seemed to be caused by my psychology, I would do the obvious thing and ask myself, "Why am I doing ? Why am I not doing ?" But that almost never worked. Usually I would get a response back like, "Because it's hard, I'm lazy, and it's just a bad habit." The same problems would come back again and again. Meditation didn't help me much either. But, for me, this counselor did. I would come to a session suffering from something, he would prompt me into feeling into my body about the issue - which is important because the body represents the unconscious - and then in the following Socratic conversation I would be able to make rapid and dramatic progress on my problem. Big anxieties gone in an hour. (For context, most of my problems then could be reduced to either "I feel anxious about X social situation." and/or "I am disliking myself and I'm suffering about that.") Learning to facilitate Later, I trained with that counselor and learned his method. As part of my training I facilitated for four volunteers, and they seemed to have similar results that I had: rapid and dramatic resolution of the issue they came with in one hour. (Caveat: I never spoke to these volunteers again, so I don't know if the effect lasted.) But the sixth time I facilitated for someone was different. I experimented: I let the conversation run as long as it needed to, and I proactively tried to target the deepest roots of his emotional insecurity using the full force of my psychological research. After our three-hour conversation, he said, This session was significantly more productive than the last 6 months of professional CBT and talk therapy I did combined. (For context, he was a CFAR alumni and also very experienced with Focusing.) We didn't do any other sessions, but I followed up after six months to ask how he was doing: I can't stress how much I appreciated that dialogue, it really made me feel better, and I think I have already expressed much of what it made me feel. […] The effectiveness of your presence defeated my incredulity, and then some. This seems not to be a fluke, either. I've facilitated for seven other people since then and four have had similarly large shifts, eg, Your communication style made it easy to identify and release limiting beliefs. I felt noticeably more secure after just a few hours. That said, the other three people I facilitated seemed to have smaller effects, though each claims it was positive. More information about my emotional security tune-ups is available on chrislakin.com/now Radically effective and rapid psychological growth is possible with the right combination of counselor and method! What does a session look like? Here's the closest example I could find of what rapid psychological growth looks like in practice. (Note: I don't completely agree with their method, and also I wonder if the client's progress could've been even quicker.) Bolding is mine. Coherence Therapy for Panic Attacks, 2007 Bruce Ecker & Laurel Hulley: Carmen, a stylish freelance writer, was 35 and happily married, but she experie...
undefined
Jun 6, 2024 • 12min

EA - Astronomical Cake by Richard Y Chappell

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Astronomical Cake, published by Richard Y Chappell on June 6, 2024 on The Effective Altruism Forum. There's one respect in which philosophical training seems to make (many) philosophers worse at practical ethics. Too many are tempted to treat tidy thought experiments as a model for messy real-world ethical quandaries. We're used to thinking about scenarios where all the details and consequences are stipulated, so that we can better uncover our theoretical commitments about what matters in principle. I've previously flagged that this can be misleading: our intuitions about real-world situations may draw upon implicit knowledge of what those situations are like, and this implicit knowledge (when contrary to the explicit stipulations of the scenario) may distort our theoretical verdicts. But it's even worse when the error goes the other way, and verdicts that only make sense given theoretical stipulations get exported into real-life situations where the stipulations do not hold. This can badly distort our understanding of how people should actually behave. Our undergraduate students often protest the silly stipulations we build into our scenarios: "Why can't we rescue everyone from the tracks without killing anyone?" It's a good instinct! Alas, to properly engage with thought experiments, we have to abide by the stipulations. We learn (and train our students) to take moral trade-offs at face value, ignore likely downstream effects, and not question the apparent pay-offs for acting in dastardly ways. This self-imposed simple-mindedness is a crucial skill for ethical theorizing. But it can be absolutely devastating to our practical judgment, if we fail to carefully distinguish ethical theory and practice. Moral distortion from high stakes A striking example of such philosophy-induced distortion comes from our theoretical understanding that sufficiently high stakes can justify overriding other values. This is a central implication of "moderate deontology": it's wrong to kill one as a means to save five, but obviously you should kill one innocent person if that's a necessary means to saving the entire world. Now, crucially, in real life that is not actually a choice situation in which you could ever find yourself. The thought experiment comes with stipulated certainty; real life doesn't. So, much practical moral know-how comes down to having good judgment, including about how to manage your own biases so that you don't mistakenly take yourself to have fantastically strong reasons to do something that's actually disastrously counterproductive. This is why utilitarians talk a lot about respecting generally-reliable rules rather than naively taking expected value (EV) calculations at face value. Taking our fallibility seriously is crucial for actually doing good in the world. Higher stakes make it all the more important to choose the consequentially better option. But they don't inherently make it more likely that a disreputable-seeming action is consequentially better. If "stealing to give" is a negative-EV strategy for ordinary charities, my default assumption is that it's negative-EV for longtermist causes too.[1] There are conceivable scenarios where that isn't so; but some positive argument is needed for thinking that any given real-life situation (like SBF's) takes this inverted form. Raising the stakes doesn't automatically flip the valence. Many philosophers don't seem to understand this. Seth Lazar, for example, gave clear voice to (what we might call) academic philosophy's high stakes distortion when he was interviewed on Hi-Phi Nation last year.[2] Lazar claimed that it's "intellectually inconsistent" to simultaneously hold that (i) there are astronomical stakes to longtermism and x-risk reduction, and yet (ii) it's also really important that you act with integrity....
undefined
Jun 6, 2024 • 18min

AF - Calculating Natural Latents via Resampling by johnswentworth

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Calculating Natural Latents via Resampling, published by johnswentworth on June 6, 2024 on The AI Alignment Forum. So you've read some of our previous natural latents posts, and you're sold on the value proposition. But there's some big foundational questions still unanswered. For example: how do we find these natural latents in some model, if we don't know in advance what they are? Examples in previous posts conceptually involved picking some latents out of the ether (like e.g. the bias of a die), and then verifying the naturality of that latent. This post is about one way to calculate natural latents, in principle, when we don't already know what they are. The basic idea is to resample all the variables once simultaneously, conditional on the others, like a step in an MCMC algorithm. The resampled variables turn out to be a competitively optimal approximate natural latent over the original variables (as we'll prove in the post). Toward the end, we'll use this technique to calculate an approximate natural latent for a normal distribution, and quantify the approximations. The proofs will use the graphical notation introduced in Some Rules For An Algebra Of Bayes Nets. Some Conceptual Foundations What Are We Even Computing? First things first: what even is "a latent", and what does it even mean to "calculate a natural latent"? If we had a function to "calculate natural latents", what would its inputs be, and what would its outputs be? The way we use the term, any conditional distribution (λ,xP[Λ=λ|X=x]) defines a "latent" variable Λ over the "observables" X, given the distribution P[X]. Together P[X] and P[Λ|X] specify the full joint distribution P[Λ,X]. We typically think of the latent variable as some unobservable-to-the-agent "generator" of the observables, but a latent can be defined by any extension of the distribution over X to a distribution over Λ and X. Natural latents are latents which (approximately) satisfy some specific conditions, namely that the distribution P[X,Λ] (approximately) factors over these Bayes nets: Intuitively, the first says that Λ mediates between the Xi's, and the second says that any one Xi gives approximately the same information about Λ as all of X. (This is a stronger redundancy condition than we used in previous posts; we'll talk about that change below.) So, a function which "calculates natural latents" takes in some representation of a distribution (xP[X]) over "observables", and spits out some representation of a conditional distribution (λ,xP[Λ=λ|X=x]), such that the joint distribution (approximately) factors over the Bayes nets above. For example, in the last section of this post, we'll compute a natural latent for a normal distribution. The function to compute that latent: Takes in a covariance matrix ΣXX for X, representing a zero-mean normal distribution P[X]. Spits out a covariance matrix ΣΛΛ for Λ and a cross-covariance matrix ΣΛX, together representing the conditional distribution of a latent Λ which is jointly zero-mean normal with X. … and the joint normal distribution over Λ,X represented by those covariance matrices approximately factors according to the Bayes nets above. Why Do We Want That, Again? Our previous posts talk more about the motivation, but briefly: two different agents could use two different models with totally different internal (i.e. latent) variables to represent the same predictive distribution P[X]. Insofar as they both use natural latents, there's a correspondence between their internal variables - two latents over the same P[X] which both approximately satisfy the naturality conditions must contain approximately the same information about X. So, insofar as the two agents both use natural latents internally, we have reason to expect that the internal latents of one can be faithfully translated int...
undefined
Jun 5, 2024 • 19min

AF - SAEs Discover Meaningful Features in the IOI Task by Alex Makelov

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: SAEs Discover Meaningful Features in the IOI Task, published by Alex Makelov on June 5, 2024 on The AI Alignment Forum. TLDR: recently, we wrote a paper proposing several evaluations of SAEs against "ground-truth" features computed w/ supervision for a given task (in our case, IOI [1]). However, we didn't optimize the SAEs much for performance in our tests. After putting the paper on arxiv, Alex carried out a more exhaustive search for SAEs that do well on our test for controlling (a.k.a. steering) model output with SAE features. The results show that: SAEs trained on IOI data find interpretable features that come close to matching supervised features (computed with knowledge of the IOI circuit) for the task of editing representations to steer the model. Gated SAEs outperform vanilla SAEs across the board for steering SAE training metrics like sparsity and loss recovered significantly correlate with how good representation edits are. In particular, sparsity is more strongly correlated than loss recovered. Partial Paper Recap: Towards More Objective SAE Evals Motivation: SAE Evals Are Too Indirect We train SAEs with the goal of finding the true features in LLM representations - but currently, "true features" is more of a vague direction than a well-defined concept in mech interp research. SAE evaluations mostly use indirect measures of performance - ones we hope correlate with the features being the "true" ones, such as the ℓ0 (sparsity) loss, the LLM loss recovered when using SAE reconstructions, and how interpretable the features are. This leaves a big gap in our understanding of the usefulness of SAEs and similar unsupervised methods; it also makes it hard to objectively compare different SAE architectures and/or training algorithms. So, we wanted to develop more objective SAE evaluations, by benchmarking SAEs against features that we know to be meaningful through other means, even if in a narrow context. We chose the IOI task, as it's perhaps the most well-studied example of a non-trivial narrow capability in a real-world LLM (GPT2-Small). We set out to compute a "skyline" for SAE performance: an object of the same "type" as an SAE - a "sparse feature dictionary" - which is constructed and validated "by hand" using our very precise knowledge about IOI. Such an object would allow us to evaluate how close a given SAE is to the limit of what's afforded by its representational power. The IOI circuit (copy of Figure 2 from the IOI paper [1]). Creating Our Own Feature Dictionaries for the IOI Task With Supervision Following the prior work by Wang et al [1] that discovered the IOI circuit, we conjectured that internal LLM activations for an IOI prompt p (e.g., "When Mary and John went to the store, John gave a book to") can be described using the following three attributes: IO(p): the indirect object token (" Mary" in our example) S(p): the subject token (" John" in our example) Pos(p): whether the IO token comes first or second in the sentence (1st in our example; the alternative would be "When John and Mary went...") And indeed, we found that intermediate activations of the model at a given site (e.g., the output of some attention head) for a prompt p can be approximated as[1] activation(p)Ep'IOI(activation(p'))+vIO=IO(p)+vS=S(p)+vPos=Pos(p) where the vectors vIO=...,… form a "supervised sparse feature dictionary" that we construct using our prior knowledge about the IOI circuit[2]. In fact, these vectors can be chosen in a very simple way as the (centered) conditional mean, e.g. vIO=Mary=EpIOI(activation(p)|IO(p)=" Mary")EpIOI(activation(p)) Not just that, but we can use these vectors for editing individual attributes' values in internal model states in a natural way via feature arithmetic, e.g. to change the IO from " Mary" to " Mike", we can use the activation aedit...
undefined
Jun 5, 2024 • 11min

EA - EA Netherlands' Annual Strategy for 2024 by James Herbert

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: EA Netherlands' Annual Strategy for 2024, published by James Herbert on June 5, 2024 on The Effective Altruism Forum. Summary In Q4 last year we spent time strategising for 2024. We already shared the resulting strategy with our amazing organisers at our Christmas EA Organiser Meetup but we thought it'd be valuable to share it here as well. We think the EA community in the Netherlands can be much bigger, we want to ensure we have a good reputation, and we want to do something about the lack of opportunities for Dutch EAs to put their values into practice through their work or volunteering. This being the case, our 'guiding policy' is targeted expansion whilst ensuring growth does not compromise our core epistemic values. This will require maintaining a balance between outreach, intellectual rigour, and field building. More concretely, we will: Develop and implement a communications strategy targeting people who we think would like to be a part of the community, but don't yet know of its existence (thus leading to growth, a stronger reputation, and more new projects) Develop and implement a global catastrophic risk field-building initiative (thus contributing to our level of intellectual rigour and providing more career opportunities) Use EAGxUtrecht to encourage and support attendees in starting new projects or initiatives (thus leading to more career opportunities) Increase our use of volunteers to help with all of the above Meanwhile, we will maintain our existing programmes, e.g., our national EA crash course, our support for organisers around the country, and our co-working office. Introduction Previously, we shared our theory of change. In this post, we're sharing the annual strategy we've been working with in 2024. Together, these are our main strategic documents. They are supplemented by a set of five-year aspirational goals[1] and quarterly OKRs. We developed this strategy in Q4 of 2023 after conducting 20 or so 1-1 calls with key stakeholders. We then held a feedback session using the near-final draft with 30+ key organisers at our Christmas get-together (we sure know how to have a good time!). We developed it using advice from Rumelt's book, Good Strategy Bad Strategy. It consists of a diagnosis, a guiding policy, and a set of coherent actions. It's important to note that our strategy for 2024 is not an exhaustive description of what we will work on. Instead, it describes what we will focus on improving whilst continuing to run our established programmes. For example, we will continue as usual with our national intro programme, our support for organisers around the country, and our co-working office. Diagnosis We begin by defining and explaining the challenges we face and making a 'diagnosis'. This simplified model of reality allows us to make sense of the situation and engage in further problem-solving. The first challenge is the size of our community. With an estimated ~700 effective altruists in the Netherlands, we're smaller than many student associations. This limits our influence and obstructs our capacity to create substantial positive change: namely, helping more people use evidence and careful reasoning when trying to help others. Suppose we want to end factory farming or spend a not-insignificant proportion of society's resources on ensuring the longterm future goes well. In that case, more people need to get involved and start new initiatives. EAN has a good track record here. Many new initiatives from the Netherlands have an origin story closely tied to our community, e.g. Doneer Effectief, several AI Safety Initiatives, and the Tien Procent Club. If we have a bigger community then more of this can happen. The second challenge is the proliferation of negative narratives surrounding effective altruism, primarily in English-speaking regions, and t...

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!
App store bannerPlay store banner
Get the app