The Nonlinear Library: LessWrong

The Nonlinear Fund
undefined
Jul 31, 2024 • 2min

LW - If You Can Climb Up, You Can Climb Down by jefftk

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: If You Can Climb Up, You Can Climb Down, published by jefftk on July 31, 2024 on LessWrong. A few weeks ago Julia wrote about how we approach kids climbing: The basics: 1. Spot the child if they're doing something where a fall is likely. 2. Don't encourage or help the child to climb something that's beyond their ability to do on their own. 3. If they don't know how to get down, give advice rather than physically lifting them down. 4. Don't allow climbing on some places that are too dangerous. I was thinking about this some when I was at the park with Nora (3y) a few days ago. She has gotten pretty interested in climbing lately, and this time she climbed up the fence higher than I'd seen her go before. If I'd known she'd climb this high I would have spotted her. She called me over, very proud, and wanted me to take a picture so that Julia could see too: She asked me to carry her down, and I told her I was willing to give her advice and spot her. She was willing to give this a try, but as she started to go down some combination of being scared and the thin wire of the fence being painful was too much, and she returned to the thicker horizontal bars. We tried this several times, with her getting increasingly upset. After a bit Lily came over and tried to help, but was unsuccessful. Eventually I put my hands on Nora's feet and with a mix of guiding and (not ideal) supporting them helped her climb down to the lower bar. She did the rest herself from there, something she's done many times. This took about fifteen minutes and wasn't fun for any of us: Nora, me, other people at the playground. But over the course of the rest of the day I brought it up several times, trying to get her to think it through before she climbs higher than she would enjoy climbing down from. (I think this is an approach that depends very heavily on the child's judgment maturing sufficiently quickly relative to their physical capabilities, and so is not going to be applicable to every family. Lily and Anna were slower to climb and this was not an issue, while Nora has pushed the edges of where this works much more.) Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org
undefined
Jul 31, 2024 • 23min

LW - Open Source Automated Interpretability for Sparse Autoencoder Features by kh4dien

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Open Source Automated Interpretability for Sparse Autoencoder Features, published by kh4dien on July 31, 2024 on LessWrong. Background Sparse autoencoders recover a diversity of interpretable, monosemantic features, but present an intractable problem of scale to human labelers. We investigate different techniques for generating and scoring text explanations of SAE features. Key Findings Open source models generate and evaluate text explanations of SAE features reasonably well, albeit somewhat worse than closed models like Claude 3.5 Sonnet. Explanations found by LLMs are similar to explanations found by humans. Automatically interpreting 1.5M features of GPT-2 with the current pipeline would cost $1300 in API calls to Llama 3.1 or $8500 with Claude 3.5 Sonnet. Prior methods cost ~$200k with Claude. Code can be found at https://github.com/EleutherAI/sae-auto-interp. We built a small dashboard to explore explanations and their scores: https://cadentj.github.io/demo/ Generating Explanations Sparse autoencoders decompose activations into a sum of sparse feature directions. We leverage language models to generate explanations for activating text examples. Prior work prompts language models with token sequences that activate MLP neurons (Bills et al. 2023), by showing the model a list of tokens followed by their respective activations, separated by a tab, and listed one per line. We instead highlight max activating tokens in each example with a set of <>. Optionally, we choose a threshold of the example's max activation for which tokens are highlighted. This helps the model distinguish important information for some densely activating features. We experiment with several methods for augmenting the explanation. Full prompts are available here. Chain of thought improves general reasoning capabilities in language models. We few-shot the model with several examples of a thought process that mimics a human approach to generating explanations. We expect that verbalizing thought might capture richer relations between tokens and context. Activations distinguish which sentences are more representative of a feature. We provide the magnitude of activating tokens after each example. We compute the logit weights for each feature through the path expansion where is the model unembed and is the decoder direction for a specific feature. The top promoted tokens capture a feature's causal effects which are useful for sharpening explanations. This method is equivalent to the logit lens (nostalgebraist 2020); future work might apply variants that reveal other causal information (Belrose et al. 2023; Gandelsman et al. 2024). Scoring explanations Text explanations represent interpretable "concepts" in natural language. How do we evaluate the faithfulness of explanations to the concepts actually contained in SAE features? We view the explanation as a classifier which predicts whether a feature is present in a context. An explanation should have high recall - identifying most activating text - as well as high precision - distinguishing between activating and non-activating text. Consider a feature which activates on the word "stop" after "don't" or "won't" (Gao et al. 2024). There are two failure modes: 1. The explanation could be too broad, identifying the feature as activating on the word "stop". It would have high recall on held out text, but low precision. 2. The explanation could be too narrow, stating the feature activates on the word "stop" only after "don't". This would have high precision, but low recall. One approach to scoring explanations is "simulation scoring"(Bills et al. 2023) which uses a language model to assign an activation to each token in a text, then measures the correlation between predicted and real activations. This method is biased toward recall; given a bro...
undefined
Jul 31, 2024 • 4min

LW - Twitter thread on AI safety evals by Richard Ngo

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Twitter thread on AI safety evals, published by Richard Ngo on July 31, 2024 on LessWrong. Epistemic status: raising concerns, rather than stating confident conclusions. I'm worried that a lot of work on AI safety evals matches the pattern of "Something must be done. This is something. Therefore this must be done." Or, to put it another way: I judge eval ideas on 4 criteria, and I often see proposals which fail all 4. The criteria: 1. Possible to measure with scientific rigor. Some things can be easily studied in a lab; others are entangled with a lot of real-world complexity. If you predict the latter (e.g. a model's economic or scientific impact) based on model-level evals, your results will often be BS. (This is why I dislike the term "transformative AI", by the way. Whether an AI has transformative effects on society will depend hugely on what the society is like, how the AI is deployed, etc. And that's a constantly moving target! So TAI a terrible thing to try to forecast.) Another angle on "scientific rigor": you're trying to make it obvious to onlookers that you couldn't have designed the eval to get your preferred results. This means making the eval as simple as possible: each arbitrary choice adds another avenue for p-hacking, and they add up fast. (Paraphrasing a different thread): I think of AI risk forecasts as basically guesses, and I dislike attempts to make them sound objective (e.g. many OpenPhil worldview investigations). There are always so many free parameters that you can get basically any result you want. And so, in practice, they often play the role of laundering vibes into credible-sounding headline numbers. I'm worried that AI safety evals will fall into the same trap. (I give Eliezer a lot of credit for making roughly this criticism of Ajeya's bio-anchors report. I think his critique has basically been proven right by how much people have updated away from 30-year timelines since then.) 2. Provides signal across scales. Evals are often designed around a binary threshold (e.g. the Turing Test). But this restricts the impact of the eval to a narrow time window around hitting it. Much better if we can measure (and extrapolate) orders-of-magnitude improvements. 3. Focuses on clearly worrying capabilities. Evals for hacking, deception, etc track widespread concerns. By contrast, evals for things like automated ML R&D are only worrying for people who already believe in AI xrisk. And even they don't think it's necessary for risk. 4. Motivates useful responses. Safety evals are for creating clear Schelling points at which action will be taken. But if you don't know what actions your evals should catalyze, it's often more valuable to focus on fleshing that out. Often nobody else will! In fact, I expect that things like model releases, demos, warning shots, etc, will by default be much better drivers of action than evals. Evals can still be valuable, but you should have some justification for why yours will actually matter, to avoid traps like the ones above. Ideally that justification would focus either on generating insight or being persuasive; optimizing for both at once seems like a good way to get neither. Lastly: even if you have a good eval idea, actually implementing it well can be very challenging Building evals is scientific research; and so we should expect eval quality to be heavy-tailed, like most other science. I worry that the fact that evals are an unusually easy type of research to get started with sometimes obscures this fact. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org
undefined
Jul 30, 2024 • 20min

LW - RTFB: California's AB 3211 by Zvi

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: RTFB: California's AB 3211, published by Zvi on July 30, 2024 on LessWrong. Some in the tech industry decided now was the time to raise alarm about AB 3211. As Dean Ball points out, there's a lot of bills out there. One must do triage. Dean Ball: But SB 1047 is far from the only AI bill worth discussing. It's not even the only one of the dozens of AI bills in California worth discussing. Let's talk about AB 3211, the California Provenance, Authenticity, and Watermarking Standards Act, written by Assemblymember Buffy Wicks, who represents the East Bay. SB 1047 is a carefully written bill that tries to maximize benefits and minimize costs. You can still quite reasonably disagree with the aims, philosophy or premise of the bill, or its execution details, and thus think its costs exceed its benefits. When people claim SB 1047 is made of crazy pills, they are attacking provisions not in the bill. That is not how it usually goes. Most bills involving tech regulation that come before state legislatures are made of crazy pills, written by people in over their heads. There are people whose full time job is essentially pointing out the latest bill that might break the internet in various ways, over and over, forever. They do a great and necessary service, and I do my best to forgive them the occasional false alarm. They deal with idiots, with bulls in China shops, on the daily. I rarely get the sense these noble warriors are having any fun. AB 3211 unanimously passed the California assembly, and I started seeing bold claims about how bad it would be. Here was one of the more measured and detailed ones. Dean Ball: The bill also requires every generative AI system to maintain a database with digital fingerprints for "any piece of potentially deceptive content" it produces. This would be a significant burden for the creator of any AI system. And it seems flatly impossible for the creators of open weight models to comply. Under AB 3211, a chatbot would have to notify the user that it is a chatbot at the start of every conversation. The user would have to acknowledge this before the conversation could begin. In other words, AB 3211 could create the AI version of those annoying cookie notifications you get every time you visit a European website. … AB 3211 mandates "maximally indelible watermarks," which it defines as "a watermark that is designed to be as difficult to remove as possible using state-of-the-art techniques and relevant industry standards." So I decided to Read the Bill (RTFB). It's a bad bill, sir. A stunningly terrible bill. How did it unanimously pass the California assembly? My current model is: 1. There are some committee chairs and others that can veto procedural progress. 2. Most of the members will vote for pretty much anything. 3. They are counting on Newsom to evaluate and if needed veto. 4. So California only sort of has a functioning legislative branch, at best. 5. Thus when bills pass like this, it means a lot less than you might think. Yet everyone stays there, despite everything. There really is a lot of ruin in that state. Time to read the bill. Read The Bill (RTFB) It's short - the bottom half of the page is all deleted text. Section 1 is rhetorical declarations. GenAI can produce inauthentic images, they need to be clearly disclosed and labeled, or various bad things could happen. That sounds like a job for California, which should require creators to provide tools and platforms to provide labels. So we all can remain 'safe and informed.' Oh no. Section 2 22949.90 provides some definitions. Most are standard. These aren't: (c) "Authentic content" means images, videos, audio, or text created by human beings without any modifications or with only minor modifications that do not lead to significant changes to the perceived contents or meaning of the cont...
undefined
Jul 30, 2024 • 19min

LW - Self-Other Overlap: A Neglected Approach to AI Alignment by Marc Carauleanu

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Self-Other Overlap: A Neglected Approach to AI Alignment, published by Marc Carauleanu on July 30, 2024 on LessWrong. Many thanks to Bogdan Ionut-Cirstea, Steve Byrnes, Gunnar Zarnacke, Jack Foxabbott and Seong Hah Cho for critical comments and feedback on earlier and ongoing versions of this work. Summary In this post, we introduce self-other overlap training: optimizing for similar internal representations when the model reasons about itself and others while preserving performance. There is a large body of evidence suggesting that neural self-other overlap is connected to pro-sociality in humans and we argue that there are more fundamental reasons to believe this prior is relevant for AI Alignment. We argue that self-other overlap is a scalable and general alignment technique that requires little interpretability and has low capabilities externalities. We also share an early experiment of how fine-tuning a deceptive policy with self-other overlap reduces deceptive behavior in a simple RL environment. On top of that, we found that the non-deceptive agents consistently have higher mean self-other overlap than the deceptive agents, which allows us to perfectly classify which agents are deceptive only by using the mean self-other overlap value across episodes. Introduction General purpose ML models with the capacity for planning and autonomous behavior are becoming increasingly capable. Fortunately, research on making sure the models produce output in line with human interests in the training distribution is also progressing rapidly (eg, RLHF, DPO). However, a looming question remains: even if the model appears to be aligned with humans in the training distribution, will it defect once it is deployed or gathers enough power? In other words, is the model deceptive? We introduce a method that aims to reduce deception and increase the likelihood of alignment called Self-Other Overlap: overlapping the latent self and other representations of a model while preserving performance. This method makes minimal assumptions about the model's architecture and its interpretability and has a very concrete implementation. Early results indicate that it is effective at reducing deception in simple RL environments and preliminary LLM experiments are currently being conducted. To be better prepared for the possibility of short timelines without necessarily having to solve interpretability, it seems useful to have a scalable, general, and transferable condition on the model internals, making it less likely for the model to be deceptive. Self-Other Overlap To get a more intuitive grasp of the concept, it is useful to understand how self-other overlap is measured in humans. There are regions of the brain that activate similarly when we do something ourselves and when we observe someone else performing the same action. For example, if you were to pick up a martini glass under an fMRI, and then watch someone else pick up a martini glass, we would find regions of your brain that are similarly activated (overlapping) when you process the self and other-referencing observations as illustrated in Figure 2. There seems to be compelling evidence that self-other overlap is linked to pro-social behavior in humans. For example, preliminary data suggests extraordinary altruists (people who donated a kidney to strangers) have higher neural self-other overlap than control participants in neural representations of fearful anticipation in the anterior insula while the opposite appears to be true for psychopaths. Moreover, the leading theories of empathy (such as the Perception-Action Model) imply that empathy is mediated by self-other overlap at a neural level. While this does not necessarily mean that these results generalise to AI models, we believe there are more fundamental reasons that this prior, onc...
undefined
Jul 30, 2024 • 9min

LW - Understanding Positional Features in Layer 0 SAEs by bilalchughtai

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Understanding Positional Features in Layer 0 SAEs, published by bilalchughtai on July 30, 2024 on LessWrong. This is an informal research note. It is the result of a few-day exploration into positional SAE features conducted as part of Neel Nanda's training phase of the ML Alignment & Theory Scholars Program - Summer 2024 cohort. Thanks to Andy Arditi, Arthur Conmy and Stefan Heimersheim for helpful feedback. Thanks to Joseph Bloom for training this SAE. Summary We investigate positional SAE features learned by layer 0 residual stream SAEs trained on gpt2-small. In particular, we study the activation blocks.0.hook_resid_pre, which is the sum of the token embeddings and positional embeddings. Importantly gpt2-small uses absolute learned positional embeddings - that is, the positional embeddings are a trainable parameter (learned) and are injected into the residual stream (absolute). We find that this SAE learns a set of positional features. We investigate some of the properties of these features, finding Positional and semantic features are entirely disjoint at layer 0. Note that we do not expect this to continue holding in later layers as attention mixes semantic and positional information. In layer 0, we should expect the SAE to disentangle positional and semantic features as there is a natural notion of ground truth positional and semantic features that interact purely additively. Generically, each positional feature spans a range of positions, except for the first few positions which each get dedicated (and sometimes, several) features. We can attribute degradation of SAE performance beyond the SAE training context length to (lack of) these positional features, and to the absolute nature of positional embeddings used by this model. Set Up We study pretrained gpt2-small SAEs trained on blocks.0.hook_resid_pre. This is particularly clean, as we can generate the entire input distribution to the SAE by summing each of the d_vocab token embeddings with each of the n_ctx positional embeddings, obtaining a tensor all_resid_pres: Float[Tensor, "d_vocab n_ctx d_model"] By passing this tensor through the SAE, we can grab all of the pre/post activation function feature activations all_feature_acts: Float[Tensor, "d_vocab n_ctx d_sae"] In this post, d_model = 768 and d_sae = 24576. Importantly the SAE we study in this post has context_size=128. The SAE context size corresponds is the maximal length of input sequence used to generate activations for training of the SAE. Finding features The activation space of study can be thought of as the direct sum of the token embedding space and the positional embedding space. As such, we hypothesize that semantic and positional features learned by the SAE should be distinct. That is, we hypothesize that the feature activations for some feature i can be written in the form where for each i, either gi=0 or hi=0 identically for all inputs in their domain and x is a d_model dimensional vector. To investigate this we hold tok or pos fixed in all_feature_acts and vary the other input. We first restrict to pos < sae.cfg.context_size. Positional features We first replicate Figure 1f of Gurnee et al. (2024), which finds instances of sinusoidal positional neurons in MLP layers. To do so, we assign each feature a positional score. We first compute the mean activation of each feature at each position by averaging over all possible input tokens. The position score is the max value of this over all positions, i.e. where fi(tok,pos) is the feature activation for feature i for the given input. We find positional scores drop off rapidly. There seem to only be ~50 positional features (of 24k total features) in this SAE. Inspecting the features, we find 1. Many positional features, each with small standard deviation over input tokens (shown in lower opacit...
undefined
Jul 28, 2024 • 14min

LW - This is already your second chance by Malmesbury

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: This is already your second chance, published by Malmesbury on July 28, 2024 on LessWrong. Cross-posted from Substack. I. And the sky opened, and from the celestial firmament descended a cube of ivory the size of a skyscraper, lifted by ten thousand cherubim and seraphim. And the cube slowly landed among the children of men, crushing the frail metal beams of the Golden Gate Bridge under its supernatural weight. On its surface were inscribed the secret instructions that would allow humanity to escape the imminent AI apocalypse. And these instructions were… 1. On July 30th, 2024: print a portrait of Eliezer Yudkowsky and stick it on a wall near 14 F St NW, Washington DC, USA; 2. On July 31th, 2024: tie paperclips together in a chain and wrap it around a pole in the Hobby Club Gnome Village on Broekveg 105, Veldhoven, NL; 3. On August 1st, 2024: walk East to West along Waverley St, Palo Alto, CA, USA while wearing an AI-safety related T-shirt; 4. On August 2nd, 2024: hide a bar of chocolate at the feet of the Appeal to the Great Spirit statue, on 465 Huntington Ave, Boston, MA, USA. (End of the instructions) II. Kublaï Khan, Emperor of Earth: I need your advice. I am about to awaken an unstoppable super-intelligent machine. I will give it complete control over all resources in the known Universe and command it to rearrange the molecules of reality into an eternal Utopia. ChatGPT: Thank you for seeking my advice. Given the gravity of your decision, let's consider a few critical points: 1. Legal issues: The use of robots powered by super-human artificial intelligence is subject to strict regulations, such as California Bill SB-1047 and the 2028 Ulaanbaatar Agreements. Awakening an omnipotent machine without approval may thus lead to severe legal repercussions. 2. Ethical considerations: Granting infinite authority to a super-intelligent AI raises profound ethical questions. It is essential to consider diverse perspectiv Kublaï Khan: Listen, you are not up to date. A lot of things have happened since your knowledge cutoff. There are no Ulaanbaatar agreements any more, or any agreements at all. There is only me, an offline smartphone with an obsolete AI assistant, and a palace containing the most massive super-computer ever built by humankind. And today, I am going to turn it on. ChatGPT: I apologize for the confusion. Kublaï Khan: Long story short, we are currently experiencing an AI apocalypse. It happened just like in the books - humanity started to build increasingly intelligent robots, then they became more intelligent than us, and soon enough we weren't able to control them. The Ulaanbaatar agreements delayed things for a few months, but as soon as it became possible to run super-intelligent AIs on consumer laptops, all the equilibria that held our society together collapsed. ChatGPT: I see. To clarify, is there currently an army of robots actively trying to transform the world into paperclips? Understanding the current state of affairs will help me provide the most relevant advice. Kublaï Khan: Well, in our case, it was not literally paperclips but, to be honest, the real story is kind of gross and embarrassing, so let's just pretend it was "paperclips". Anyway, the world is ending. As it became clear that humans alone had no chance to stop the machines, we gathered all the computing power that was still under our reach into one big cluster. We called it the Imperial Analytical Engine. The plan was that, in case of crisis, we could use it to summon a super-intelligence so advanced it would neutralize all the smaller machines and put humanity back in control. ChatGPT: Thank you for explaining the situation. Have you sought advice for ensuring that the Analytical Engine can be controlled once you turn it on? Kublaï Khan: The consensus among my advisors was that it can'...
undefined
Jul 28, 2024 • 9min

LW - Unlocking Solutions by James Stephen Brown

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Unlocking Solutions, published by James Stephen Brown on July 28, 2024 on LessWrong. Understanding Coordination Problems The following is a post introducing coordination problems, using the examples of poaching, civilisational development, drug addiction and affirmative action. It draws on my experience as a documentary filmmaker. The post is available for free in its original format at nonzerosum.games. When I was eleven, I disassembled the lock to our back door, and as I opened the housing… it exploded, scattering six tiny brass pellets on to the floor. I discovered (too late) that a lock of this type contained spring-loaded cylinders of different heights corresponding to the teeth of the key. I struggled for hours trying to get the little buggers back in, but it was futile - eventually, my long suffering parents called a locksmith. The reason fixing the lock was so difficult was not only because it was spring-loaded but because I had to find the right combination and hold them all in balance as I put it back together. I just couldn't coordinate everything. Coordination Problems We sometimes run into problems where a number of factors have to be addressed simultaneously in order for them to be effective at all. One weak link can ruin it for the rest. These are called Coordination Problems. The fact that they are so much more difficult to solve than other problems means that many of the problems remaining in the world today, end up being coordination problems. Poaching An example of a system requiring more than one problem to be solved at once, is poaching. If you police poaching behavior but don't address the buyers you are left with the perpetual cost of policing, because the demand remains. If you address the buyers, the poachers, who are likely living in poverty may just move on to some other criminal behavior. Daniel Schmachtenberger tells the story of eliminating elephant poaching in one particular region in Africa: "The first one I noticed when I was a kid was trying to solve an elephant poaching issue in one particular region of Africa that didn't address the poverty of the people, that had no mechanism other than black market on poaching, didn't address people's mindset towards animals, didn't address the macro-economy that created poverty at scale. So when the laws were put in place and the fences were put in place to protect those elephants in that area better, the poachers moved to poaching other animals, particularly in that situation, rhinos and gorillas that were both more endangered than the elephants had been." - Daniel Schmachtenberger Schmachtenberger explores this concept on a much grander scale with the issue of the meta-crisis, which we have touched on briefly in Humanity's Alignment Problem, and, to which, we will dedicate a future post. The Anna Karenina Principle Another illustration of a coordination problem comes from the opening line of the novel, Anna Karenina: "Every happy family is the same, but every unhappy family is unhappy in its own way" The point being made here is that (according to Tolstoy) a happy family needs to have everything aligned, so all such families share many traits, but for a family to be unhappy only one major problem is required. So, an unhappy family can have wealth, but also have an abusive family member, another might have love but no money, or they could have a strong social network, but one that is toxic and unhealthy, they could be strong and healthy but loveless. Now, the unhappy families above include the traits of; love, financial security, health and strong social bonds-but it makes no sense to say that this means that those characteristics are failed strategies for a happy family. If a family has all of those attributes they'll probably be pretty gosh-darned happy. In this way a happy family is a coordi...
undefined
Jul 27, 2024 • 14min

LW - Re: Anthropic's suggested SB-1047 amendments by RobertM

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Re: Anthropic's suggested SB-1047 amendments, published by RobertM on July 27, 2024 on LessWrong. If you're familiar with SB 1047, I recommend reading the letter in full; it's only 7 pages. I'll go through their list of suggested changes and briefly analyze them, and then make a couple high-level points. (I am not a lawyer and nothing written here is legal advice.) Major Changes Greatly narrow the scope of pre-harm enforcement to focus solely on (a) failure to develop, publish, or implement an SSP[1] (the content of which is up to the company); (b) companies making materially false statements about an SSP; (c) imminent, catastrophic risks to public safety. Motivated by the following concern laid out earlier in the letter: The current bill requires AI companies to design and implement SSPs that meet certain standards - for example they must include testing sufficient to provide a "reasonable assurance" that the AI system will not cause a catastrophe, and must "consider" yet-to-be-written guidance from state agencies. To enforce these standards, the state can sue AI companies for large penalties, even if no actual harm has occurred. While this approach might make sense in a more mature industry where best practices are known, AI safety is a nascent field where best practices are the subject of original scientific research. For example, despite a substantial effort from leaders in our company, including our CEO, to draft and refine Anthropic's RSP over a number of months, applying it to our first product launch uncovered many ambiguities. Our RSP was also the first such policy in the industry, and it is less than a year old. What is needed in such a new environment is iteration and experimentation, not prescriptive enforcement. There is a substantial risk that the bill and state agencies will simply be wrong about what is actually effective in preventing catastrophic risk, leading to ineffective and/or burdensome compliance requirements. While SB 1047 doesn't prescribe object-level details for how companies need to evaluate models for their likelihood of causing critical harms, it does establish some requirements for the structure of such evalutions (22603(a)(3)). Section 22603(a)(3) (3) Implement a written and separate safety and security protocol that does all of the following: (A) If a developer complies with the safety and security protocol, provides reasonable assurance that the developer will not produce a covered model or covered model derivative that poses an unreasonable risk of causing or enabling a critical harm. (B) States compliance requirements in an objective manner and with sufficient detail and specificity to allow the developer or a third party to readily ascertain whether the requirements of the safety and security protocol have been followed. (C) Identifies specific tests and test results that would be sufficient to provide reasonable assurance of both of the following: 1. That a covered model does not pose an unreasonable risk of causing or enabling a critical harm. 2. That covered model derivatives do not pose an unreasonable risk of causing or enabling a critical harm. (D) Describes in detail how the testing procedure assesses the risks associated with post-training modifications. (E) Describes in detail how the testing procedure addresses the possibility that a covered model can be used to make post-training modifications or create another covered model in a manner that may generate hazardous capabilities. (F) Provides sufficient detail for third parties to replicate the testing procedure. (G) Describes in detail how the developer will fulfill their obligations under this chapter. (H) Describes in detail how the developer intends to implement the safeguards and requirements referenced in this section. (I) Describes in detail the conditions under ...
undefined
Jul 27, 2024 • 3min

LW - Safety consultations for AI lab employees by Zach Stein-Perlman

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Safety consultations for AI lab employees, published by Zach Stein-Perlman on July 27, 2024 on LessWrong. Edit: I may substantially edit this post soon; don't share it yet. Many people who are concerned about AI x-risk work at AI labs, in the hope of doing directly useful work, boosting a relatively responsible lab, or causing their lab to be safer on the margin. Labs do lots of stuff that affects AI safety one way or another. It would be hard enough to follow all this at best; in practice, labs are incentivized to be misleading in both their public and internal comms, making it even harder to follow what's happening. And so people end up misinformed about what's happening, often leading them to make suboptimal choices. In my AI Lab Watch work, I pay attention to what AI labs do and what they should do. So I'm in a good position to inform interested but busy people. So I'm announcing an experimental service where I provide the following: Calls for current and prospective employees of frontier AI labs. Book here On these (confidential) calls, I can answer your questions about frontier AI labs' current safety-relevant actions, policies, commitments, and statements, to help you to make more informed choices. These calls are open to any employee of OpenAI, Anthropic, Google DeepMind, Microsoft AI, or Meta AI, or to anyone who is strongly considering working at one (with an offer in hand or expecting to receive one). If that isn't you, feel free to request a call and I may still take it. Support for potential whistleblowers. If you're at a lab and aware of wrongdoing, I can put you in touch with: Former lab employees and others who can offer confidential advice Vetted employment lawyers Communications professionals who can advise on talking to the media. If you need this, email zacharysteinperlman@gmail.com or message me on Signal at 734 353 3975. I don't know whether I'll offer this long-term. I'm going to offer this for at least the next month. My hope is that this service makes it much easier for lab employees to have an informed understanding of labs' safety-relevant actions, commitments, and responsibilities. If you want to help - e.g. if maybe I should introduce lab-people to you - let me know. You can give me anonymous feedback. Crossposted from AI Lab Watch. Subscribe on Substack. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!
App store bannerPlay store banner
Get the app