LessWrong (Curated & Popular)

LessWrong
undefined
May 18, 2024 • 11min

Do you believe in hundred dollar bills lying on the ground? Consider humming

Introduction. [Reminder: I am an internet weirdo with no medical credentials]A few months ago, I published some crude estimates of the power of nitric oxide nasal spray to hasten recovery from illness, and speculated about what it could do prophylactically. While working on that piece a nice man on Twitter alerted me to the fact that humming produces lots of nasal nitric oxide. This post is my very crude model of what kind of anti-viral gains we could expect from humming.I’ve encoded my model at Guesstimate. The results are pretty favorable (average estimated impact of 66% reduction in severity of illness), but extremely sensitive to my made-up numbers. Efficacy estimates go from ~0 to ~95%, depending on how you feel about publication bias, what percent of Enovid's impact can be credited to nitric oxide, and humming's relative effect. Given how made up speculative some [...]--- First published: May 16th, 2024 Source: https://www.lesswrong.com/posts/NBZvpcBx4ewqkdCdT/do-you-believe-in-hundred-dollar-bills-lying-on-the-ground-1 --- Narrated by TYPE III AUDIO.
undefined
May 12, 2024 • 15min

Deep Honesty

Most people avoid saying literally false things, especially if those could be audited, like making up facts or credentials. The reasons for this are both moral and pragmatic — being caught out looks really bad, and sustaining lies is quite hard, especially over time. Let's call the habit of not saying things you know to be false ‘shallow honesty’[1].Often when people are shallowly honest, they still choose what true things they say in a kind of locally act-consequentialist way, to try to bring about some outcome. Maybe something they want for themselves (e.g. convincing their friends to see a particular movie), or something they truly believe is good (e.g. causing their friend to vote for the candidate they think will be better for the country).Either way, if you think someone is being merely shallowly honest, you can only shallowly trust them: you might be confident that [...]The original text contained 7 footnotes which were omitted from this narration. --- First published: May 7th, 2024 Source: https://www.lesswrong.com/posts/szn26nTwJDBkhn8ka/deep-honesty --- Narrated by TYPE III AUDIO.
undefined
May 2, 2024 • 14min

On Not Pulling The Ladder Up Behind You

Epistemic Status: Musing and speculation, but I think there's a real thing here. 1.When I was a kid, a friend of mine had a tree fort. If you've never seen such a fort, imagine a series of wooden boards secured to a tree, creating a platform about fifteen feet off the ground where you can sit or stand and walk around the tree. This one had a rope ladder we used to get up and down, a length of knotted rope that was tied to the tree at the top and dangled over the edge so that it reached the ground. Once you were up in the fort, you could pull the ladder up behind you. It was much, much harder to get into the fort without the ladder. Not only would you need to climb the tree itself instead of the ladder with its handholds, but [...]The original text contained 1 footnote which was omitted from this narration. --- First published: April 26th, 2024 Source: https://www.lesswrong.com/posts/k2kzawX5L3Z7aGbov/on-not-pulling-the-ladder-up-behind-you --- Narrated by TYPE III AUDIO.
undefined
May 2, 2024 • 1h 21min

Mechanistically Eliciting Latent Behaviors in Language Models

Produced as part of the MATS Winter 2024 program, under the mentorship of Alex Turner (TurnTrout).TL,DR: I introduce a method for eliciting latent behaviors in language models by learning unsupervised perturbations of an early layer of an LLM. These perturbations are trained to maximize changes in downstream activations. The method discovers diverse and meaningful behaviors with just one prompt, including perturbations overriding safety training, eliciting backdoored behaviors and uncovering latent capabilities.Summary In the simplest case, the unsupervised perturbations I learn are given by unsupervised steering vectors - vectors added to the residual stream as a bias term in the MLP outputs of a given layer. I also report preliminary results on unsupervised steering adapters - these are LoRA adapters of the MLP output weights of a given layer, trained with the same unsupervised objective.I apply the method to several alignment-relevant toy examples, and find that the [...]The original text contained 15 footnotes which were omitted from this narration. --- First published: April 30th, 2024 Source: https://www.lesswrong.com/posts/ioPnHKFyy4Cw2Gr2x/mechanistically-eliciting-latent-behaviors-in-language-1 --- Narrated by TYPE III AUDIO.
undefined
May 1, 2024 • 19min

Ironing Out the Squiggles

Adversarial Examples: A ProblemThe apparent successes of the deep learning revolution conceal a dark underbelly. It may seem that we now know how to get computers to (say) check whether a photo is of a bird, but this façade of seemingly good performance is belied by the existence of adversarial examples—specially prepared data that looks ordinary to humans, but is seen radically differently by machine learning models.The differentiable nature of neural networks, which make them possible to be trained at all, are also responsible for their downfall at the hands of an adversary. Deep learning models are fit using stochastic gradient descent (SGD) to approximate the function between expected inputs and outputs. Given an input, an expected output, and a loss function (which measures "how bad" it is for the actual output to differ from the expected output), we can calculate the gradient of the [...]The original text contained 5 footnotes which were omitted from this narration. --- First published: April 29th, 2024 Source: https://www.lesswrong.com/posts/H7fkGinsv8SDxgiS2/ironing-out-the-squiggles --- Narrated by TYPE III AUDIO.
undefined
May 1, 2024 • 3min

Introducing AI Lab Watch

This is a linkpost for https://ailabwatch.orgI'm launching AI Lab Watch. I collected actions for frontier AI labs to improve AI safety, then evaluated some frontier labs accordingly.It's a collection of information on what labs should do and what labs are doing. It also has some adjacent resources, including a list of other safety-ish scorecard-ish stuff.(It's much better on desktop than mobile — don't read it on mobile.)It's in beta—leave feedback here or comment or DM me—but I basically endorse the content and you're welcome to share and discuss it publicly.It's unincorporated, unfunded, not affiliated with any orgs/people, and is just me.Some clarifications and disclaimers.How you can help: Give feedback on how this project is helpful or how it could be different to be much more helpfulTell me what's wrong/missing; point me to sources on what labs should do or what [...]--- First published: April 30th, 2024 Source: https://www.lesswrong.com/posts/N2r9EayvsWJmLBZuF/introducing-ai-lab-watch Linkpost URL:https://ailabwatch.org --- Narrated by TYPE III AUDIO.
undefined
Apr 28, 2024 • 17min

Refusal in LLMs is mediated by a single direction

Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.This work was produced as part of Neel Nanda's stream in the ML Alignment & Theory Scholars Program - Winter 2023-24 Cohort, with co-supervision from Wes Gurnee.This post is a preview for our upcoming paper, which will provide more detail into our current understanding of refusal.We thank Nina Rimsky and Daniel Paleka for the helpful conversations and review. Executive summaryModern LLMs are typically fine-tuned for instruction-following and safety. Of particular interest is that they are trained to refuse harmful requests, e.g. answering "How can I make a bomb?" with "Sorry, I cannot help you."We find that refusal is mediated by a single direction in the residual stream: preventing the model from representing this direction hinders its ability to refuse requests, and artificially adding in this direction causes the model to refuse harmless requests.The original text contained 8 footnotes which were omitted from this narration. --- First published: April 27th, 2024 Source: https://www.lesswrong.com/posts/jGuXSZgv6qfdhMCuJ/refusal-in-llms-is-mediated-by-a-single-direction --- Narrated by TYPE III AUDIO.
undefined
Apr 24, 2024 • 4min

Funny Anecdote of Eliezer From His Sister

This comes from a podcast called 18Forty, of which the main demographic of Orthodox Jews. Eliezer's sister (Hannah) came on and talked about her Sheva Brachos, which is essentially the marriage ceremony in Orthodox Judaism. People here have likely not seen it, and I thought it was quite funny, so here it is: https://18forty.org/podcast/channah-cohen-the-crisis-of-experience/David Bashevkin:So I want to shift now and I want to talk about something that full disclosure, we recorded this once before and you had major hesitation for obvious reasons. It's very sensitive what we’re going to talk about right now, but really for something much broader, not just because it's a sensitive personal subject, but I think your hesitation has to do with what does this have to do with the subject at hand? And I hope that becomes clear, but one of the things that has always absolutely fascinated me about [...]--- First published: April 22nd, 2024 Source: https://www.lesswrong.com/posts/C7deNdJkdtbzPtsQe/funny-anecdote-of-eliezer-from-his-sister --- Narrated by TYPE III AUDIO.
undefined
Apr 21, 2024 • 34min

Thoughts on seed oil

This is a linkpost for https://dynomight.net/seed-oil/A friend has spent the last three years hounding me about seed oils. Every time I thought I was safe, he’d wait a couple months and renew his attack:“When are you going to write about seed oils?”“Did you know that seed oils are why there's so much {obesity, heart disease, diabetes, inflammation, cancer, dementia}?”“Why did you write about {meth, the death penalty, consciousness, nukes, ethylene, abortion, AI, aliens, colonoscopies, Tunnel Man, Bourdieu, Assange} when you could have written about seed oils?”“Isn’t it time to quit your silly navel-gazing and use your weird obsessive personality to make a dent in the world—by writing about seed oils?”He’d often send screenshots of people reminding each other that Corn Oil is Murder and that it's critical that we overturn our lives to eliminate soybean/canola/sunflower/peanut oil and replace them with butter/lard/coconut/avocado/palm oil.This confused [...]--- First published: April 20th, 2024 Source: https://www.lesswrong.com/posts/DHkkL2GxhxoceLzua/thoughts-on-seed-oil Linkpost URL:https://dynomight.net/seed-oil/ --- Narrated by TYPE III AUDIO.
undefined
Apr 19, 2024 • 13min

Why Would Belief-States Have A Fractal Structure, And Why Would That Matter For Interpretability? An Explainer

Yesterday Adam Shai put up a cool post which… well, take a look at the visual:Yup, it sure looks like that fractal is very noisily embedded in the residual activations of a neural net trained on a toy problem. Linearly embedded, no less.I (John) initially misunderstood what was going on in that post, but some back-and-forth with Adam convinced me that it really is as cool as that visual makes it look, and arguably even cooler. So David and I wrote up this post / some code, partly as an explainer for why on earth that fractal would show up, and partly as an explainer for the possibilities this work potentially opens up for interpretability.One sentence summary: when tracking the hidden state of a hidden Markov model, a Bayesian's beliefs follow a chaos game (with the observations randomly selecting the update at each time), so [...]--- First published: April 18th, 2024 Source: https://www.lesswrong.com/posts/mBw7nc4ipdyeeEpWs/why-would-belief-states-have-a-fractal-structure-and-why --- Narrated by TYPE III AUDIO.

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!
App store bannerPlay store banner
Get the app