The Nonlinear Library

The Nonlinear Fund
undefined
Mar 30, 2024 • 10min

LW - Back to Basics: Truth is Unitary by lsusr

A prospect seeking shelter in a dark stormy night at an abandoned temple has a poignant interaction. Themes of choice and unexpected encounters are highlighted. Skills in tailoring, survivalism, and penetration testing are discussed. The interconnectedness of truths and the significance of seeking diverse perspectives are explored.
undefined
Mar 30, 2024 • 9min

EA - Metascience of the Vesuvius Challenge by Maxwell Tabarrok

Maxwell Tabarrok, author of Metascience of the Vesuvius Challenge, discusses the million dollar contest to read ancient texts from Herculaneum using advanced technology. The success of the challenge highlights the importance of strategic funding in research. They explore optimizing research challenge funding, nerd sniping for project funding, and the significance of sustained research investments.
undefined
Mar 30, 2024 • 5min

LW - D&D.Sci: The Mad Tyrant's Pet Turtles by abstractapplic

Listen to a thrilling D&D-Sci scenario where players must guess the weight of the Mad Tyrant's pet turtles without weighing them. Consequences for incorrect estimates and tension as players navigate through the whimsical yet dangerous scenario.
undefined
Mar 30, 2024 • 20min

EA - Exploring Ergodicity in the Context of Longtermism by Arthur Jongejans

Arthur Jongejans discusses how expected value theory misrepresents risk dynamics in multiplicative environments, advocating for the ergodicity framework in decision-making for effective altruism. Longtermism is explored, emphasizing the importance of considering future generations. The podcast challenges conventional economic theories and decision-making strategies, urging a shift towards preventing catastrophic outcomes and existential risks.
undefined
Mar 29, 2024 • 2min

EA - UK moves toward mandatory animal welfare labelling by AdamC

The podcast discusses the UK government's proposal for mandatory animal welfare labelling, which could lead to higher standards, prices, and competition in the food industry. It covers tiers for chicken, eggs, and pig products initially, with potential expansion to beef, lamb, and dairy. The goal is to influence companies and drive EU reforms.
undefined
Mar 29, 2024 • 16min

LW - SAE reconstruction errors are (empirically) pathological by wesg

SAE reconstruction errors are empirically pathological, leading to significant changes in token prediction probabilities. Understanding these errors is crucial for advancing SAEs. The podcast explores challenges in maintaining faithfulness, impact of perturbations on vectors, limitations in activation space reconstruction, and evaluating SAE reconstructions through metrics like KL divergence.
undefined
Mar 29, 2024 • 12min

AF - Your LLM Judge may be biased by Rachel Freedman

Rachel Freedman, an AI safety researcher, discusses the bias present in LLM judges used by researchers. She details experiments and strategies to mitigate biases, including adjusting labeling systems and validating judgments against human ones. The podcast covers analyzing biases in the LAMA2 model and exploring techniques like few-shot prompting to reduce biases in language models.
undefined
Mar 29, 2024 • 16min

AF - SAE reconstruction errors are (empirically) pathological by Wes Gurnee

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: SAE reconstruction errors are (empirically) pathological, published by Wes Gurnee on March 29, 2024 on The AI Alignment Forum. Summary Sparse Autoencoder (SAE) errors are empirically pathological: when a reconstructed activation vector is distance ϵ from the original activation vector, substituting a randomly chosen point at the same distance changes the next token prediction probabilities significantly less than substituting the SAE reconstruction[1] (measured by both KL and loss). This is true for all layers of the model (~2x to ~4.5x increase in KL and loss over baseline) and is not caused by feature suppression/shrinkage. Assuming others replicate, these results suggest the proxy reconstruction objective is behaving pathologically. I am not sure why these errors occur but expect understanding this gap will give us deeper insight into SAEs while also providing an additional metric to guide methodological progress. Introduction As the interpretability community allocates more resources and increases reliance on SAEs, it is important to understand the limitation and potential flaws of this method. SAEs are designed to find a sparse overcomplete feature basis for a model's latent space. This is done by minimizing the joint reconstruction error of the input data and the L1 norm of the intermediate activations (to promote sparsity): However, the true goal is to find a faithful feature decomposition that accurately captures the true causal variables in the model, and reconstruction error and sparsity are only easy-to-optimize proxy objectives. This begs the questions: how good of a proxy objective is this? Do the reconstructed representations faithfully preserve other model behavior? How much are we proxy gaming? Naively, this training objective defines faithfulness as L2. But, another natural property of a "faithful" reconstruction is that substituting the original activation with the reconstruction should approximately preserve the next-token prediction probabilities. More formally, for a set of tokens T and a model M, let P=M(T) be the model's true next token probabilities. Then let QSAE=M(T|do(xSAE(x))) be the next token probabilities after intervening on the model by replacing a particular activation x (e.g. a residual stream state or a layer of MLP activations) with the SAE reconstruction of x. The more faithful the reconstruction, the lower the KL divergence between P and Q (denoted as DKL(P||QSAE)) should be. In this post, I study how DKL(P||QSAE) compares to several natural baselines based on random perturbations of the activation vectors x which preserve some error property of the SAE construction (e.g., having the same l2 reconstruction error or cosine similarity). I find that the KL divergence is significantly higher (2.2x - 4.5x) for the residual stream SAE reconstruction compared to the random perturbations and moderately higher (0.9x-1.7x) for attention out SAEs. This suggests that the SAE reconstruction is not faithful by our definition, as it does not preserve the next token prediction probabilities. This observation is important because it suggests that SAEs make systematic, rather than random, errors and that continuing to drive down reconstruction error may not actually increase SAE faithfulness. This potentially indicates that current SAEs are missing out on important parts of the learned representations of the model. The good news is that this KL-gap presents a clear target for methodological improvement and a new metric for evaluating SAEs. I intend to explore this in future work. Intuition: how big a deal is this (KL) difference? For some intuition, here are several real examples of the top-25 output token probabilities at the end of a prompt when patching in SAE and ϵ-random reconstructions compared to the original model's next-token distributio...
undefined
Mar 29, 2024 • 5min

EA - Newly launched Humane Slaughter Initiative webpage by Léa Guttmann

Guest Léa Guttmann discusses the launch of the Humane Slaughter Initiative webpage, focusing on ethical shrimp production. Topics include sustainable practices in shrimp pond improvement, transition to electrical stunning, and advancements in shrimp welfare for humane slaughter.
undefined
Mar 29, 2024 • 13min

LW - How to safely use an optimizer by Simon Fischer

Simon Fischer, author of the post on safely using an optimizer, discusses the risks of using an untrustworthy optimizer and a method to find safe-satisficing outputs. Topics include setting safety constraints, navigating oracle performance, and techniques for optimizing tasks. Mathematical proofs and insights on optimizing power are explored.

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!
App store bannerPlay store banner
Get the app