

The Nonlinear Library
The Nonlinear Fund
The Nonlinear Library allows you to easily listen to top EA and rationalist content on your podcast player. We use text-to-speech software to create an automatically updating repository of audio content from the EA Forum, Alignment Forum, LessWrong, and other EA blogs. To find out more, please visit us at nonlinear.org
Episodes
Mentioned books

Jun 18, 2024 • 1h 12min
EA - Are our Top Charities saving the same lives each year? by GiveWell
The podcast delves into the concern that top charity programs may be saving the same high-risk children repeatedly, potentially reducing the total lives saved. It examines the impact of repetitive saving on programs like Insecticide Treated Nets and Vitamin A Supplementation. Discussions include mortality risk parameters, skewed risk ratios, and challenges in determining treatment effects with minimal group differences.

Jun 18, 2024 • 1min
EA - Expressions of Interest: Starting an Islamic Effective Giving Org by Kaleem
Kaleem, a founder researching launching a new organization to make zakat more effective, discusses gathering expressions of interest for a new org redirecting zakat to effective charities. He invites individuals to join as co-founders or contribute in different capacities, aiming to revamp the distribution of Zakat.

Jun 18, 2024 • 5min
EA - Announcing AI Welfare Debate Week (July 1-7) by Toby Tremlett
Get ready for AI Welfare Debate Week on the Effective Altruism Forum, where experts will discuss the significance of AI welfare and its implications on future moral considerations. The chapter covers the setup of the debate, including a ranking system for impactful posts and a banner introduction.

Jun 18, 2024 • 27min
EA - Animal advocates should campaign to restrict AI precision livestock farming by Zachary Brown
Zachary Brown discusses the potential harm from AI to nonhuman animals, especially those on factory farms. He explores the risks of AI precision livestock farming and the implications for animal welfare. Animal advocates should consider campaigning against the use of AI in factory farms.

Jun 18, 2024 • 10min
LW - D&D.Sci Alchemy: Archmage Anachronos and the Supply Chain Issues Evaluation & Ruleset by aphyer
Archmage Anachronos and aphyer discuss potion brewing rules and strategies in a D&D scenario, focusing on magical and non-magical ingredients, empowering potions, and avoiding Magical Explosions. Detailed rules, outcomes, strategies, hidden patterns, and a player leaderboard are explored.

Jun 18, 2024 • 6min
LW - I would have shit in that alley, too by Declan Molony
Declan Molony discusses his experiences encountering homeless individuals in a major U.S. city, highlighting the lack of public restroom access and the complexities of interactions. He shares anecdotes of encounters with homeless individuals and reflects on how societal attitudes towards them can impact their lives.

Jun 18, 2024 • 4min
EA - Join GiveWell as a Research Analyst by GiveWell
Learn about the opportunities to join GiveWell as a Research Analyst, make a huge impact with your career, and increase funding for cost-effective programs that save and improve lives.

Jun 18, 2024 • 9min
EA - Why so many "racists" at Manifest? by Austin
The podcast discusses the controversy surrounding inviting controversial speakers to Manifest 2024, despite positive feedback on the event. It highlights the diversity of guests and sessions, emphasizing the importance of productive discourse and fostering intellectual debate.

Jun 17, 2024 • 3min
LW - Fat Tails Discourage Compromise by niplav
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Fat Tails Discourage Compromise, published by niplav on June 17, 2024 on LessWrong.
Say that we have a set of options, such as (for example) wild animal welfare interventions.
Say also that you have two axes along which you can score those interventions: popularity (how much people will like your intervention) and effectiveness (how much the intervention actually helps wild animals).
Assume that we (for some reason) can't convert between and compare those two properties.
Should you then pick an intervention that is a compromise on the two axes - that is, it scores decently well on both - or should you max out on a particular axis?
One thing you might consider is the distribution of options along those two axes: the distribution of interventions can be normal on for both popularity and effectiveness, or the underlying distribution could be lognormal for both axes, or they could be mixed (e.g. normal for popularity, and lognormal for effectiveness).
Intuitively, the distributions seem like they affect the kinds of tradeoffs we can make, how could we possibly figure out how?
…
…
…
It turns out that if both properties are normally distributed, one gets a fairly large Pareto frontier, with a convex set of options, while if the two properties are lognormally distributed, one gets a concave set of options.
(Code here.)
So if we believe that the interventions are normally distributed around popularity and effectiveness, we would be justified in opting for an intervention that gets us the best of both worlds, such as sterilising stray dogs or finding less painful rodenticides.
If we, however, believe that popularity and effectiveness are lognormally distributed, we instead want to go in hard on only one of those, such as buying brazilian beef that leads to Amazonian rainforest being destroyed, or writing a book of poetic short stories that detail the harsh life of wild animals.
What if popularity of interventions is normally distributed, but effectiveness is lognormally distributed?
In that case you get a pretty large Pareto frontier which almost looks linear to me, and it's not clear anymore that one can't get a good trade-off between the two options.
So if you believe that heavy tails dominate with the things you care about, on multiple dimensions, you might consider taking a barbell strategy and taking one or multiple options that each max out on a particular axis.
If you have thin tails, however, taking a concave disposition towards your available options can give you most of the value you want.
See Also
Being the (Pareto) Best in the World (johnswentworth, 2019)
Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org

Jun 17, 2024 • 33min
LW - Getting 50% (SoTA) on ARC-AGI with GPT-4o by ryan greenblatt
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Getting 50% (SoTA) on ARC-AGI with GPT-4o, published by ryan greenblatt on June 17, 2024 on LessWrong.
I recently got to 50%[1] accuracy on the public test set for ARC-AGI by having GPT-4o generate a huge number of Python implementations of the transformation rule (around 8,000 per problem) and then selecting among these implementations based on correctness of the Python programs on the examples (if this is confusing, go here)[2]. I use a variety of additional approaches and tweaks which overall substantially improve the performance of my method relative to just sampling 8,000 programs.
[This post is on a pretty different topic than the usual posts I make about AI safety.]
The additional approaches and tweaks are:
I use few-shot prompts which perform meticulous step-by-step reasoning.
I have GPT-4o try to revise some of the implementations after seeing what they actually output on the provided examples.
I do some feature engineering, providing the model with considerably better grid representations than the naive approach of just providing images. (See below for details on what a "grid" in ARC-AGI is.)
I used specialized few-shot prompts for the two main buckets of ARC-AGI problems (cases where the grid size changes vs doesn't).
The prior state of the art on this dataset was 34% accuracy, so this is a significant improvement.[3]
On a held-out subset of the train set, where humans get 85% accuracy, my solution gets 72% accuracy.[4] (The train set is significantly easier than the test set as noted here.)
Additional increases of runtime compute would further improve performance (and there are clear scaling laws), but this is left as an exercise to the reader.
In this post:
I describe my method;
I analyze what limits its performance and make predictions about what is needed to reach human performance;
I comment on what it means for claims that François Chollet makes about LLMs. Given that current LLMs can perform decently well on ARC-AGI, do claims like "LLMs like Gemini or ChatGPT [don't work] because they're basically frozen at inference time. They're not actually learning anything." make sense? (This quote is from here.)
Thanks to Fabien Roger and Buck Shlegeris for a bit of help with this project and with writing this post.
What is ARC-AGI?
ARC-AGI is a dataset built to evaluate the general reasoning abilities of AIs. It consists of visual problems like the below, where there are input-output examples which are grids of colored cells. The task is to guess the transformation from input to output and then fill out the missing grid. Here is an example from the tutorial:
This one is easy, and it's easy to get GPT-4o to solve it. But the tasks from the public test set are much harder; they're often non-trivial for (typical) humans. There is a reported MTurk human baseline for the train distribution of 85%, but no human baseline for the public test set which is known to be significantly more difficult.
Here are representative problems from the test set[5], and whether my GPT-4o-based solution gets them correct or not.
Problem 1:
Problem 2:
Problem 3:
My method
The main idea behind my solution is very simple: get GPT-4o to generate around 8,000 python programs which attempt to implement the transformation, select a program which is right on all the examples (usually there are 3 examples), and then submit the output this function produces when applied to the additional test input(s). I show GPT-4o the problem as images and in various ascii representations.
My approach is similar in spirit to the approach applied in AlphaCode in which a model generates millions of completions attempting to solve a programming problem and then aggregates over them to determine what to submit.
Actually getting to 50% with this main idea took me about 6 days of work. This work includes construct...


