Interconnects cover image

Interconnects

Latest episodes

undefined
26 snips
May 6, 2025 • 8min

What people get wrong about the leading Chinese open models: Adoption and censorship

This discussion dives into the often-overlooked challenges surrounding the adoption of Chinese open AI models. Despite their advanced performance, Western companies hesitate to embrace them due to geopolitical tensions. The episode highlights misconceptions about censorship and the indirect influence of Chinese values in business practices. It explores the gap between technological capability and adoption, emphasizing how political concerns shape decision-making in AI spaces.
undefined
30 snips
Apr 30, 2025 • 19min

State of play of AI progress (and related brakes on an intelligence explosion)

https://www.interconnects.ai/p/brakes-on-an-intelligence-explosionIntelligence explosions are far from a new idea in the technological discourse. They’re a natural thought experiment that follows from the question: What if progress keeps going?From Wikipedia:The technological singularity—or simply the singularity—is a hypothetical point in time at which technological growth becomes uncontrollable and irreversible, resulting in unforeseeable consequences for human civilization. According to the most popular version of the singularity hypothesis, I. J. Good's intelligence explosion model of 1965, an upgradable intelligent agent could eventually enter a positive feedback loop of successive self-improvement cycles; more intelligent generations would appear more and more rapidly, causing a rapid increase ("explosion") in intelligence which would culminate in a powerful superintelligence, far surpassing all human intelligence.Given the recent progress in AI, it’s understandable to revisit these ideas. With the local constraints governing decisions within labs, if you extrapolate them, the natural conclusion is an explosion.Daniel Kokotajlo et al.’s AI 2027 forecast is far from a simple forecast of what happens without constraints. It’s a well thought out exercise on forecasting that rests on a few key assumptions of AI research progress accelerating due to improvements in extremely strong coding agents that mature into research agents with better experimental understanding. The core idea here is that these stronger AI models enable AI progress to change from 2x speed all the way up to 100x speed in the next few years. This number includes experiment time — i.e., the time to train the AIs — not just implementation time.This is very unlikely. This forecast came at a good time for a summary of many ways the AI industry is evolving. What does it mean for AI as a technology to mature? How is AI research changing? What can we expect in a few years?In summary, AI is getting more robust in areas we know it can work, and we’re consistently finding a few new domains of value where it can work extremely well. There are no signs that language model capabilities are on an arc similar to something like AlphaGo, where reinforcement learning in a narrow domain creates an intelligence way stronger than any human analog.This post has the following sections:* How labs make progress on evaluations,* Current AI is broad, not narrow intelligence,* Data research is the foundation of algorithmic AI progress,* Over-optimism of RL training,In many ways, this is more a critique of the AGI discourse generally, inspired by AI 2027, rather than a critique specifically of their forecast.In this post, there will be many technical discussions of rapid, or even accelerating, AI research progress. Much of this falls into a technocentric world view where technical skill and capacity drive progress, but in reality, the biggest thing driving progress in 2025 is likely steep industrial competition (or international competition!). AI development and companies are still a very human problem and competition is the most proven catalyst of performance.See AI 2027 in its entirety, Scott Alexander’s reflections, their rebuttal to critiques that AI 2027 was ignoring China, Zvi’s roundup of discussions, or their appearance on the Dwarkesh Podcast. They definitely did much more editing and cohesiveness checks than I did on this response!1. How labs make progress on evaluationsOne of the hardest things to communicate in AI is talking down the various interpretations of evaluation progress looking vertical over time. If the evals are going from 0 to 1 in one year, doesn’t that indicate the AI models are getting better at everything super fast? No, this is all about how evaluations are scoped as “reasonable” in AI development over time.None of the popular evaluations, such as MMLU, GPQA, MATH, SWE-Bench, etc., that are getting released in a paper and then solved 18 months later are truly held out by the laboratories. They’re training goals. If these evaluations were unseen tests and going vertical, you should be much more optimistic about AI progress, but they aren’t.Consider a recent evaluation, like Frontier Math or Humanity’s Last Exam. These evaluations are introduced with a performance of about 0-5% on leading models. Soon after the release, new models that could include data formatted for them are scoring above 20% (e.g. o3 and Gemini 2.5 Pro). This evaluation will continue to be the target of leading labs, and many researchers will work on improving performance directly.With these modern evaluations, they can become increasingly esoteric and hard for the sake of being hard. When will a power user of ChatGPT benefit from a model that solves extremely abstract math problems? Unlikely.The story above could make more sense for something like MATH, which are hard but not impossible math questions. In the early 2020s, this was extremely hard for language models, but a few clicks of scaling made accurate mathematics a reasonable task, and laboratories quickly added similar techniques to the training data.So this is how you end up with the plot from Epoch AI below — AI researchers figure out that a new evaluation is fair game for hill climbing with current techniques, and then they go all in on it.Or the analogous version that can look even more shocking — the price falling for certain evaluations. This is from 2 factors — laboratories getting better and better at core abilities in certain evaluations and language model training getting far more efficient. Neither of these means that intelligence is rocketing. This is a normal technological process — extreme efficiency at tasks we know we can do well.In fact it is a common job at AI laboratories to make new data that looks very close to population evaluations. These laboratories can’t train on the test set directly for basic reasons of scientific integrity, but they can pay thousands to millions of dollars for new training data that looks practically identical. This is a very common practice and makes the hillclimbing on evaluations far less extraordinary.AI capabilities in domains we are measuring aren't accelerating, they’re continuing. At the same time, AI’s abilities are expanding outwards into new domains. AI researchers solve domains when we focus on them, not really by accident. Generalization happens sometimes, but it is messy to track and argue for.As the price of scaling kicks in, every subsequent task is getting more expensive to solve. The best benchmarks we have are correlated with real, valuable tasks, but many are not.2. Current AI is broad, not narrow intelligenceInstead of thinking of stacking rapid evaluation progress on one line in a cumulative, rapid improvement in intelligence, the above plots should make one think that AI is getting better at many tasks, rather than being superhuman in narrow tasks.In a few years, we’ll look back and see that AI is now 95% robust on a lot of things that only worked 1-5% of the time today. A bunch of new use cases will surprise us as well. We won’t see AI systems that are so intelligent that they cause seismic shifts in the nature of certain domains. Software will still be software. AI will be way better than us at completing a code task and finding a bug, but the stacks we are working on will be largely subject to the same constraints.Epoch AI had a very complementary post to this view.There are many explanations for why this will be the case. All of them rely on the complexity of the environment we are operating modern AI in being too high relative to the signal for improvement. The AI systems that furthest exceeded human performance in one domain were trained in environments where those domains were the entire world. AlphaGo is the perfect rendition of this.AI research, software engineering, information synthesis, and all of the techniques needed to train a good AI model are not closed systems with simple forms of verification. Some parts of training AI systems are, such as wanting the loss to go down or getting more training tokens through your model, but those aren’t really the limiting factors right now on training.The Wikipedia page for the singularity has another explanation for this that seems prescient as we open the floodgates to try and apply AI agents to every digital task. Paul Allen thought the deceleratory effects of complexity would be too strong:Microsoft co-founder Paul Allen argued the opposite of accelerating returns, the complexity brake: the more progress science makes towards understanding intelligence, the more difficult it becomes to make additional progress. A study of the number of patents shows that human creativity does not show accelerating returns, but in fact, as suggested by Joseph Tainter in his The Collapse of Complex Societies, a law of diminishing returns. The number of patents per thousand peaked in the period from 1850 to 1900, and has been declining since. The growth of complexity eventually becomes self-limiting, and leads to a widespread "general systems collapse".This may be a bit of an extreme case to tell a story, but it is worth considering.Language models like o3 use a more complex system of tools to gain performance. GPT-4 was just a set of weights to answer every query; now ChatGPT also needs search, code execution, and memory. The more layers there are, the smaller the magnitude of changes we’ll see.This, of course, needs to be controlled for with inference costs as a constant. We still have many problems in AI that will be “solved” simply by us using 1,000X the inference compute on them.3. Data research is the foundation of algorithmic AI progressOne of the main points of the AI 2027 forecast is that AI research is going to get 2X, then 4X, then 100X, and finally 1,000X as productive as it is today. This is based on end-to-end time for integrating new ideas into models and misinterprets the reality of what machine learning research is bottlenecked on. Scaling is getting more expensive. We don’t know what paradigm will come after reasoning for inference-time compute.For machine learning research to accelerate at these rates, it needs to be entirely bottlenecked by compute efficiency and implementation difficulty. Problems like getting the maximum theoretical FLOPs out of Nvidia GPUs and making the loss go as low as possible. These are things that people are currently doing and represent an important area of marginal gains in AI progress in recent years.ML research is far messier. It is far more reliant on poking around the data, building intuitions, and launching yolo runs based on lingering feelings. AI models in the near future could easily launch yolo runs if we give them the compute, but they’re not using the same motivation for them. AI systems are going towards rapid cycles of trial and error to optimize very narrow signals. These narrow signals, like loss or evaluation scores, mirror very closely to the RL scores that current models are trained on.These types of improvements are crucial for making the model a bit better, but they are not the type of idea that gets someone to try to train GPT-3 in the first place or scale up RL to get something like o1.A very popular question in the AI discourse today is “Why doesn’t AI make any discoveries despite having all of human knowledge?” (more here). Quoting Dwarkesh Patel’s interview with Dario Amodei:One question I had for you while we were talking about the intelligence stuff was, as a scientist yourself, what do you make of the fact that these things have basically the entire corpus of human knowledge memorized and they haven't been able to make a single new connection that has led to a discovery?The same applies to AI research. Models getting better and better at solving coding problems does not seem like the type of training that would enable this. We’re making our models better at the tasks that we know. This process is just as likely to narrow the total capabilities of the models as it is to magically instill impressive capabilities like scientific perspective.As we discussed earlier in this piece, emergence isn’t magic, it’s a numerical phenomenon of evaluations being solved very quickly. AI research will get easier and go faster, but we aren’t heading for a doom loop.The increased computing power AI researchers are getting their hands on is, for the time being, maintaining the pace of progress. As compute gets more expensive, maybe superhuman coding capabilities will continue to enable another few years of rapid progress, but eventually, saturation will come. Current progress is too correlated with increased compute to believe that this will be a self-fulfilling feedback loop.There’s a saying in machine learning research, that the same few ideas are repeated over and over again. Here’s an extended version of this that leans in and says that there are no new ideas in machine learning, just new datasets:The data problem is not something AI is going to have an easy time with.One of the examples here is in post-training. We’ve been using the same loss functions forever, and we are hill-climbing rapidly by clever use of distillation from bigger, stronger models. The industry standard is that post-training is messy and involves incrementally training (and maybe merging) many checkpoints to slowly interweave new capabilities for the model. It’s easy to get that wrong, as we’ve seen with the recent GPT-4o sycophancy crisis, and lose the narrow band of good vibes for a model. I doubt AI supervision can monitor vibes like this.For example, in Tülu 3 we found that a small dataset of synthetic instruction following data had a second-order effect that improves the overall performance in things like math and reasoning as well. This is not a hill that can be climbed on, but rather a lucky find.AI research is still very messy and does not look like LeetCode problems or simple optimization hillclimbing. The key is always the data, and how good are language models at judging between different responses — not much better than humans.4. Over-optimism of RL trainingA lot of people are really excited for RL training right now scaling up further, which will inevitably involve extending to more domains. Some of the most repeated ideas are adding RL training to continually fine-tune the model in real-world scenarios, including everything from web tasks to robotics and scientific experiments. There are two separate problems here:* Continually training language models to add new capabilities to models “in flight” in production is not a solved problem,* Training models to take actions in many domains.The first problem is something that I’m confident we’ll solve. It’s likely technically feasible now that RL is the final stage of post-training and is becoming far more stable. The challenge with it is more of a release and control problem, where a model being trained in-flight doesn’t have time for the usual safety training. This is something the industry can easily adapt to, and we will as traditional pretraining scaling saturates completely.The second issue is putting us right back into the territory of why projects on scaling robotics or RL agents to multiple domains are hard. Even the most breakthrough works like GATO, multi-domain RL control, or RT-X, multi-robot control policies, from DeepMind have major caveats with their obvious successes.Building AI models that control multiple real-world systems is incredibly hard for many reasons, some of which involve:* Different action spaces across domains mandate either modifying the domain to suit the underlying policy, which in this case is converting all control tasks to language, or modifying the model to be able to output more types of tokens.* The real-world is subject to constant drift, so the constant fine-tuning of the model will need to do as much to just maintain performance on systems with real degradation as it will need to learn to use them in the first place.This sort of scaling RL to new types of domains is going to look much more like recent progress in robotics research rather than the takeoff pace of reasoning language models. Robotics progress is a slow grind and feels so different that it is hard to describe concisely. Robotics faces far more problems due to the nature of the environment rather than just the learning.The current phase of RL training is suited for making the models capable of performing inference-time scaling on domains they have seen at pretraining. Using these new RL stacks to learn entirely new, out-of-domain problems is a new research area.If this is the next paradigm outside of inference-time scaling, I will be shocked, but obviously excited. We don’t have the evidence to suggest that it will do so. The RL training we’re going to get is continuing to hill climb on search and code execution, giving us Deep Research plus plus, not an omnipotent action-taking model.A world with compute shifting to inferenceWhile the AI research world is dynamic, engaging, and rapidly moving forward, some signs of the above being correct could already be emerging. A basic sign for this future coming true will be the share of compute spent on research decreasing relative to inference amid the rapid buildout. If extremely rapid AI progress were available for organizations that put in marginally more compute, serving inference would be a far lower priority. If investing in research has a positive feedback loop on your potential business revenue, they’d all need to do it.For example, consider our discussion of Meta’s compute allocation on Dylan and I’s appearance on the Lex Podcast:(01:03:56) And forever, training will always be a portion of the total compute. We mentioned Meta’s 400,000 GPUs. Only 16,000 made Llama 3.OpenAI is already making allocation trade-offs on their products, regularly complaining about GPUs melting. Part of the reason they, or anyone, could release an open-weights model is to reduce their inference demand. Make the user(s) pay for the compute.Part of the U.S.’s economic strength is a strong services sector. AI is enabling that, and the more it succeeds there, the more companies will need to continue to enable it with compute.With the changing world economic order, cases like Microsoft freezing datacenter buildouts are correlated indicators. Microsoft’s buildout is correlated with many factors, only one of which is potential training progress, so it’s far from a sure thing.In reality, with the large sums of capital at play, it is unlikely that labs give free rein to billions of dollars of compute to so called “AI researchers in the datacenter” because of how constrained compute is at all of the top labs. Most of that compute goes to hillclimbing on fairly known gains for the next model! AI research with AI aid will be a hand-in-hand process and not an autonomous take-off, at least on a timeline for a few years in the future.AI will make a ton of progress, but it will not be an obvious acceleration. With traditional pretraining saturating, it could even be argued that after the initial gains of inference time compute, research is actually decelerating, but it will take years to know for sure.Thanks to Steve Newman and Florian Brand for some early feedback on this post and many others in the Interconnects Discord for discussions that helped formulate it. This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe
undefined
27 snips
Apr 28, 2025 • 14min

Transparency and (shifting) priority stacks

https://www.interconnects.ai/p/transparency-and-shifting-priorityThe fact that we get new AI model launches from multiple labs detailing their performance on complex and shared benchmarks is an anomaly in the history of technology products. Getting such clear ways to compare similar software products is not normal. It goes back to AI’s roots as a research field and growing pains into something else. Ever since ChatGPT’s release, AI has been transitioning from a research-driven field to a product-driven field.We had another example of the direction this is going just last week. OpenAI launched their latest model on a Friday with minimal official documentation and a bunch of confirmations on social media. Here’s what Sam Altman said:Officially, there are “release notes,” but these aren’t very helpful.We’re making additional improvements to GPT-4o, optimizing when it saves memories and enhancing problem-solving capabilities for STEM. We’ve also made subtle changes to the way it responds, making it more proactive and better at guiding conversations toward productive outcomes. We think these updates help GPT-4o feel more intuitive and effective across a variety of tasks–we hope you agree!Another way of reading this is that the general capabilities of the model, i.e. traditional academic benchmarks, didn’t shift much, but internal evaluations such as user retention improved notably.Of course, technology companies do this all the time. Google is famous for A/B testing to find the perfect button, and we can be sure Meta is constantly improving their algorithms to maximize user retention and advertisement targeting. This sort of lack of transparency from OpenAI is only surprising because the field of AI has been different.AI has been different in its operation, not only because of its unusually fast transition from research to product, but also because many key leaders thought AI was different. AI was the crucial technology that we needed to get right. This is why OpenAI was founded as a non-profit, and existential risk has been a central discussion. If we believe this technology is essential to get right, the releases with it need to be handled differently.OpenAI releasing a model with no official notes is the clearest signal we have yet that AI is a normal technology. OpenAI is a product company, and its core users don’t need clear documentation on what’s changing with the model. Yes, they did have better documentation for their recent API models in GPT-4.1, but the fact that those models aren’t available in their widely used product, ChatGPT, means they’re not as relevant.Sam Altman sharing a model launch like this is minor in a single instance, but it sets the tone for the company and industry broadly on what is an acceptable form of disclosure.The people who need information on the model are people like me — people trying to keep track of the roller coaster ride we’re on so that the technology doesn’t cause major unintended harms to society. We are a minority in the world, but we feel strongly that transparency helps us keep a better understanding of the evolving trajectory of AI.This is a good time for me to explain with more nuance the different ways transparency serves AI in the broader technological ecosystem, and how everyone is stating what their priorities are through their actions. We’ll come back to OpenAI’s obvious shifting priorities later on.The type of openness I’ve regularly advocated for at the Allen Institute for AI (Ai2) — with all aspects of the training process being open so everyone can learn and build on it — is in some ways one of the most boring types of priorities possible for transparency. It’s taken me a while to realize this. It relates to how openness and the transparency it carries are not a binary distinction, but rather a spectrum.Transparency and openness occur at each aspect of the AI release process. The subtle differences in decisions from licenses to where your model is hosted or if the weights are available publicly at all fall on a gradient. The position I advocate for is on the extreme, which is often needed to enact change in the world these days. I operate at the extreme of a position to shift the reality that unfolds in the middle of the discourse. This’ll also make me realize what other priorities I’m implicitly devaluing by putting openness on the top. With finite effort, there are always trade-offs.Many companies don’t have the ability to operate at such an extreme as I or Ai2, which results in much more nuanced and interesting trade-offs in what transparency is enabling. Both OpenAI and Anthropic care about showing the external world some inputs to their models’ behaviors. Anthropic’s Constitution for Claude is a much narrower artifact, showing some facts about the model, while OpenAI’s Model Spec shows more intention and opens it up to criticism.Progress on transparency will only come when more realize that a lot of good can be done by incrementally more transparency. We should support people advocating for narrow asks of openness and understand their motivations in order to make informed trade-offs. For now, most of the downsides of transparency I’ve seen are in the realm of corporate competition, once you accept basic realities like frontier model weights from the likes of OpenAI and Anthropic not getting uploaded to HuggingFace.Back to my personal position around openness — it also happens to be really aligned with technological acceleration and optimism. I was motivated to this line of work because openness can help increase the net benefit of AI. This is partially accelerating the adoption of it, but also enabling safety research on the technology and mitigating any long-term structural failure modes. Openness can enable many more people to be involved in AI’s development — think of the 1000s of academics without enough compute to lead on AI who would love to help understand and provide feedback on frontier AI models. Having more people involved also spreads knowledge, which reduces the risk of concentration of power.I’ve for multiple years feared that powerful AI will make companies even more powerful economically and culturally. My readers don’t need warnings on why technology that is way more personable and engaging than recommendation systems, while keeping similar goals, can push us in more negative rather than positive directions. Others commenting here have included Meta’s Mark Zuckerberg’s Open Source AI is the Path Forward and Yann LeCun’s many comments on X. — they both highlight concentration of power as a major concern.Still, someone could come to the same number one priority on complete technical openness like myself through the ambition of economic growth, if you think that open-source models being on par can make the total market for AI companies larger. This accelerationism can also have phrasings such as “We need the powerful technology ASAP to address all of the biggest problems facing society.” Technology moving fast always has negative externalities on society we have to manage.Another popular motivation for transparency is to monitor the capabilities of frontier model development (recent posts here and here). Individuals advocating for this have a priority stack that has a serious short-term concern of an intelligence explosion or super-powerful AGI. My stack of priorities is the one that worries about the concentration of power, which takes time to accrue and has a low probability of intelligence takeoff. A lot of the transparency interventions advocated by this group, such as Daniel Kokotajlo on his Dwarkesh Podcast episode discussing AI 2027, align with subgoals I have.If you’re not worried about either of these broad “safety” issues — concentration of power or dangerous AI risk — then you normally don’t weigh transparency very highly and prioritize other things, mostly pure progress and competition, and pricing. If we get into the finer-grained details on safety, such as explaining intentions and process, that’s where my goals would differ from an organization like a16z that has been very vocal about open-source. They obviously have a financial stake in the matter, which is enabled by making things useful rather than easier to study.There are plenty more views that are valid for transparency. Transparency is used as a carrot by many different types of regulatory intervention. Groups with different priorities and concerns in the AI space will want transparency around different aspects of the AI process. These can encompass motives of the researchers, artifacts, method documentation, and many more things.The lens I’m using to understand trade-offs in transparency is a priority stack, an evolution of the Principle Stack, revisited many times in the last 5+ years of the Stratechery universe. The core idea is that whether or not you like it, every business and decision is governed by a set of priorities ranked relative to each other. Everyone has things that they care about more and less, even if the issues are both extremely important. It is the basis for making trade-offs in determining the direction of businesses.Interconnects is a reader-supported publication. Consider becoming a subscriber.Some examples of who could advocate for information on what in the AI ecosystem include:* Capability transparency — keeping the public informed of progress of models that may be unreleased, primarily to keep track of a potential intelligence explosion. This often includes new types of systems now that AI agents are working.* Base model transparency — these are most useful for people wanting to understand the role of pretraining on AI dynamics. The base models of today can easily follow instructions and do reasoning, but they’re less robust than the full final model. These are diminishing as a target of transparency, as reasoning and post-training grow in importance.* Pre-moderation model transparency (endpoints without moderation filter, models without some refusals data) — to test the evolution of content risk for models that may be deployed without moderation endpoints, such as open weight models, which tend to be release just months after closed models with similar capabilities.* Reward model transparency (and more extreme, preference data collection instructions) — those interested in the original goals of alignment, i.e. value alignment, can use these to test how the models’ views vary across different groups and test if the intended model preferences are picked up in the preference training process (i.e. relative to the instructions given to data labelers).* Training specification transparency (Model Spec’s, Constitutions, and other goal-setting documents) — there are so many people who would want to know why the model behaves a certain way. I’ve mentioned these benefits before:* Developers: Know what future models will become, which helps create a stable platform.* Regulators: Transparency into what the heck frontier labs care about, which helps understand the directions AI is going, and the motivations of super powerful companies.* Internal: Focus on defining and delivering your goals (separate from this transparency discussion).There are also subtleties in these discussions, such as how structured access to models can serve different but complementary goals of open weights. Structured access is a set of programs where prescreened individuals can use models in a secure environment and operate independently from the AI laboratories themselves.This could be seen as a separate direction to transparency, where instead of the public getting the information or artifact, only a few pre-approved people do. In reality, structured access is a complement to transparency and will be needed for details where the companies cannot disclose them publicly without substantial business competitiveness risk, such as novel algorithmic tricks that substantially modify how the AI works, or real-world harm, such as model weights pre safety interventions.Some parts of AI should be accessible to the general public, and some to third-party testers. Currently, all of the transparency and access is below the safest equilibrium. We need more of both.One of the most ignored details is just how access is implemented. A recent paper from Irene Solaiman et al. paints how releasing components is one step in sharing information and artifacts:Generative AI release decisions determine whether system components are made available, but release does not address many other elements that change how users and stakeholders are able to engage with a system. Beyond release, access to system components informs potential risks and benefits. Access refers to practical needs, infrastructurally, technically, and societally, in order to use available components in some way.The authors break access down into three axes:* Resourcing: Infrastructural needs to host and serve.* Usability: Varied technical skill levels can engage.* Utility: Qualities (e.g. multilingual) with user utility.As our models at Ai2 are becoming more capable, my relationship as a developer with my downstream users has changed. The models I’ve worked on have shifted from those primarily motivated by values, with the transparency we’re discussing being of top value, to now also adding utility as a much higher weight. People want to use some of our models in real applications. While my priority stack hasn’t changed — openness is still the top value — the way it’s implemented is shifting. I’m no longer racing to get all of our results hot off the press into the world because of the cost of time it takes to support them (support costs rise proportional to the user base).Other key players in the AI space have obviously changed their priority stack.OpenAI’s recent actions confirm that ChatGPT as a product is its top priority. Transparency and safety have been moving down on their list of priorities in favor of growth. This is partially due to increased competition, but also due to a shifting political landscape. OpenAI’s coming release of an open model doesn’t shift this priority stack for me.I used to hear a lot about OpenAI’s pre-release testing and the accompanying non-disclosure agreements. This quiet model drop being “the quickest we've shipped an update to our main 4o line” shows that safety is moving down their priority stack. This isn’t to say that their safety changes are immediately concerning to me, but rather that there are trade-offs in everything. OpenAI is moving cultural norms in leading AI away from releases with detailed evaluation metrics and towards more normal, quiet technology company consistent drips of updates.Thanks to Miles Brundage for a discussion that helped motivate this post. This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe
undefined
103 snips
Apr 19, 2025 • 11min

OpenAI's o3: Over-optimization is back and weirder than ever

The discussion dives into the intriguing phenomenon of over-optimization in reinforcement learning. It highlights how this issue impacts language models and leads to unexpected behaviors, such as gibberish output. The hosts explore the new o3 model from OpenAI, showcasing its unique inference abilities and the balance between enhanced performance and potential pitfalls. Real-world examples, like the cartwheeling cheetah, illustrate the challenges of reward design and task generalization in AI development.
undefined
29 snips
Apr 14, 2025 • 7min

OpenAI's GPT-4.1 and separating the API from ChatGPT

Dive into the latest advancements from OpenAI, including the new GPT-4.1 model and its strategic shift separating ChatGPT from its API. Explore how the improved memory feature enhances user experience by recalling past conversations. Delve into the competitive landscape where this innovation stands against Google's Gemini. Discover why OpenAI is focusing on making its ChatGPT app uniquely appealing, blending personality and functionality to captivate users amidst a slew of other AI products.
undefined
43 snips
Apr 7, 2025 • 11min

Llama 4: Did Meta just push the panic button?

Meta's latest AI model, Llama 4, is met with skepticism as it lacks the excitement of its predecessors. The discussion highlights Meta's struggle with lengthy release times, leading to unmet expectations. There's a deep dive into the evolution of Meta’s open models, from OPT to Llama 3, showcasing both triumphs and pitfalls. The podcast also critiques Meta’s waning community support and the challenges posed by new regulations impacting their future in AI.
undefined
36 snips
Apr 5, 2025 • 16min

RL backlog: OpenAI's many RLs, clarifying distillation, and latent reasoning

Reinforcement learning is experiencing a major revival in the AI landscape, with exciting applications branching across OpenAI's models. The discussion dives into the innovative techniques of model distillation and how latent reasoning enhances model efficiency. Self-assessment in AI systems is also tackled, emphasizing the significance of having AI independently verify its own knowledge and decisions. This interplay between traditional programming and modern approaches reveals the evolving nature of AI's reliability.
undefined
42 snips
Mar 26, 2025 • 12min

Gemini 2.5 Pro and Google's second chance with AI

The launch of Gemini 2.5 Pro marks a significant leap in AI, outperforming competitors like GPT-4 Turbo on important benchmarks. The podcast discusses the evolution of reasoning models, highlighting their technical prowess in today's landscape. It delves into the competitive dynamics of AI, emphasizing the need for rapid deployment and better user experiences. Google's strategic shift aims to capitalize on its vast infrastructure, positioning itself as a leader in AI innovation rather than just another contender.
undefined
Mar 19, 2025 • 13min

Managing frontier model training organizations (or teams)

https://www.interconnects.ai/p/how-to-manage-ai-training-organizationsIt is a closely guarded secret how the leading AI laboratories structure their training teams. As with other technology companies, the saying “you ship your org chart” still applies to training AI models. Looking at these organizational structures will reveal where research can be scaled up, the upper limits of size, and potentially even who uses the most compute.How modeling teams do and do not workA crucial area I’m working on (reach out if you would like to share more off the record) is how to scale these lessons to bigger, more complex teams. The core factor differentiating teams that succeed from those that do not is maintaining these principles while scaling team size.Big teams inherently lead to politics and protecting territory, while language models need information to flow from the bottom to the top on what capabilities are possible. Regardless of the possibilities, leadership can shift resources to prioritize certain areas, but all of the signals on whether this is working come from those training models. If senior directors mandate results under them before unblocking model releases, the entire system will crumble.Seeing this potential end state — without naming specific companies — it is obviously desirable to avoid, but anticipating and avoiding it during rapid growth takes substantial intentionality.Within training, the planning for pretraining and post-training traditionally could be managed differently. Pretraining has fewer, bigger runs so improvements must be slotted in for those few annual runs. Post-training improvements can largely be continuous. These operational differences, on top of the obvious cost differences, also make post-training far more approachable for non-frontier labs (though still extremely hard).Both teams have bottlenecks where improvements must be integrated. Scaling the pretraining bottlenecks — i.e. those making the final architecture and data decisions — seems impossible, but scaling teams around data acquisition, evaluation creation, and integrations is very easy. A large proportion of product decisions for AI models can be made irrespective of modeling decisions. Scaling these is also easy.Effectively, organizations that fail to produce breakthrough models can do tons of low-level meaningful research, but adding organizational complexity dramatically increases the risk of “not being able to put it together.”Another failure mode of top-down development, rather than bottom-up information, is that leaders can mandate the team to try to follow a technical decision that is not supported by experiments. Managing so-called “yolo runs” well is a coveted skill, but one that is held close to the models. Of course, so many techniques work still that mandates don’t have a 100% failure rate, but it sets a bad precedent.Given the pace of releases and progress, it appears that Anthropic, OpenAI, DeepSeek, Google Gemini, and some others have positive forms of this bottom-up culture with extremely skilled technical leads managing complexity. Google took the longest to get it right with re-orgs, muddled launches (remember Bard), and so on. With the time lag between Meta’s releases, it still seems like they’re trying to find this culture to maximally express their wonderful talent and resources.With all of this and off-the-record conversations with leadership at frontier AI labs, I have compiled a list of recommendations for managing AI training teams. This is focused on modeling research and does not encompass the majority of headcount in the leading AI companies.Interconnects is a reader-supported publication. Consider becoming a subscriber.RecommendationsThe most effective teams who regularly ship leading models follow many of these principles:* The core language modeling teams remain small as AI companies become larger.* For smaller teams, you can still have everyone in one room, take advantage of this. For me personally, I think this is where remote teams can be detrimental. In-person works for this, at least when best practices are evolving so fast.* Avoid information siloes. This goes for both teams and individuals. People need to quickly be able to build on the successes of those around them and clear communication during consistent rapid progress is tricky.* For larger teams, you can scale teams only where co-design isn’t needed. Where interactions aren’t needed there can be organizational distance.* An example would be one team focusing on post-training algorithms & approaches while other teams handle model character, model variants for API, etc (specifications and iterations).* Another example is that reasoning teams are often separate from other pieces of post-training. This applies only to players that have scaled.* Language model deployment is very much like early startup software. You don’t know exactly what users want nor what you can deliver. Embrace the uncertainty and learn quickly.* Do not overly try to separate engineering teams from training. Engineering needs to build tools for the generation +1 model and cannot do this without talking to researchers.* Evergreen research is separate from the language modeling teams itself, but still sits within “research”. Otherwise, it will be impossible to prioritize truly long-term ideas. Long-term goals are fragile and need nurturing. Language modeling is about the next 1, or maybe 2, models.* A lot of the sexy work is not that helpful and a lot of the useful work isn't sexy. Data is the prime example as the often most impactful type of work.* Expect failed training runs and do not overreact to them along the way.Failure modesHigh-priority projects can fail if you…* Try to ship too many models for each capability improvement. Instead, stick to a set schedule of model training. Have fewer models that are more capable.* Try to force contributions from individual teammates into the final product. Do not sacrifice performance for personalities in search of “a contribution”.* Let in teams that try and territorially force their way into contributing to the big company goal.* Scale the training organization too much. Having too many people “doing stuff” and adding noise to the organization detracts from high-level direction and focus on the execution of specific goals. (This can also relate to 1. and be trying to do too much in one model).* Letting politics grow, taking many forms, and causing intertwined issues. Do not lose the sense of results being the #1 driving factor of decisions. Bad decisions here compound.* Over-indexing on a single model evaluation will hamper (or flat out block) real progress in other areas.Before the rest of the post, expanding on the topics above, you may be interested in previous articles on this topic.Related writingFor more reading on how language modeling teams work, see some of my other writing here, on team structure, and…….managing risk.An example of how mid-sized training projects workI recently got a list of questions on how training for Tülu 3 operated (which is a post-training analog to OLMo really). I figured I would share these and they serve as a foundation for me gathering useful information from friends on frontier labs on how representative it is.With reasoning models, most of this translates directly. Infrastructure is becoming more important because generating long sequences is particularly memory intensive (and can expose issues in open-source tools for inference), but when the time comes to make a state-of-the-art fully open reasoning recipe, the lessons learned here will apply directly.1. How long does a large post-training project take?Tülu 3 was the focus of our post-training team from mid-July until its release on November 21st, 2024. We were building on our previous recipes, in Tülu 2/2.5, so not very much of this was catching up on internal know-how, but rather integrating new external resources. If a team like this was working continuously all year on the same focus it would’ve taken approximately one month less to achieve these results. Bootup takes substantial time, as does release management.2. How do you choose the right personnel for a moderately sized training project?A project like Tülu 3 or any other effort to push the frontier of AI in a popular area normally takes a moderately sized team. The smaller the niche, the smaller the team you need. The team at Ai2 is researcher-heavy relative to engineer-heavy among the 20+ authors. If prioritizing only performance on known techniques, the ratio of engineers can be far higher. Pushing the frontier takes 10x the resources as repeating extensively documented work.In the case of Tülu 3, where most of the techniques are not known the proportion of researchers is obviously higher. This, though, for companies trying to scope who to hire for modeling teams is not a trivial problem. First, one must scope the level of uncertainty in the domain of interest and then hire around it. Applying Tülu style approaches could definitely be done with a team of 2-4 focused engineers.3. What model sizes are used for iteration? How do results scale?A core principle of modeling research is to iterate at the smallest model that provides a reliable signal. This is the entire principle behind scaling laws as a de-risking tool. In post-training, compute costs are substantially lower so the models used actually can be bigger. In this case, given a project designed around the Llama 3.1 base models, ~80% or more of experiments were at the 8B scale (normally 8 or 32 H100s, finishing in <1 day), ~19% at the 70B scale (normally 32 or 64 H100s, finishing in 2-3 days), and only a handful of runs at the 405B scale that were using 256 GPUs each for multiple days. In overall GPU utilization, the project utilized 100-600 GPUs concurrently for the entire 4-5 month span.These days, results tend to transfer extremely well when scaling. Bigger models may need less data, especially less general data, and a gentler optimization (lower learning rate usually), but transfer hasn’t been a challenge. Changing base models is harder than scaling with post-training techniques.4. How many experiments are actually run?The Tülu project evaluated about 1000 checkpoints in our process. This feels about right for a major post-training process. Some of these are intermediate or competitor models, but most of them, 100s, are experimental training runs. The model scores can be plotted in a time sequence with the metadata we collected (credit Hamish Ivison for the plot). When you squint, it is largely a logarithmic curve with faster gains at the beginning and leveling off at the end. Of course, you can also see the flurry of models trained right in the last few weeks.5. What is the biggest bottleneck on progress?All of these projects are bottlenecked by compute available. Making systems more efficient is a compute multiplier, but if the starting point in the number of GPUs is too low, it won’t matter. There’s often potential to accelerate projects by adding more people to explorations, whether it’s training approaches like process reward models (PRMs) or data curation, but scaling management and integration of data across numerous evaluations can be tricky. Best practices for models with 100s of target evaluations (as done in frontier laboratories) rather than the ~10 we used, are far from established.The second bottleneck would be personnel willing to constantly grind on new data experiments. Focus on data almost always pays off fairly quickly.6. What I would need to get a serious post-training effort off the ground from a cold start?Finetuning has such a large gradation that impact can be made with almost any team size. To do truly excellent work takes mostly patience and proportional resources. Getting the model exactly right takes retraining many times even after you hit your initial benchmarking goals.For companies focusing on local models, a few nodes of H100s (~100 GPUs) could go a very long way. For companies trying to make truly state-of-the-art models above the 7B scale, trying to do so with <500 H100 GPUs is likely not worth it. It is very easy to be stuck in the middle and compute is still the largest determining factor of success.These numbers will come down as best practices of distillation from strong models are established, but this knowledge is far from known. If you want to invest in training you need to do enough to move the frontier, or else you will be inevitably falling behind and it would be better to ride on other’s coattails.7. What is the hardest part of these projects? Where do you actually spend time?Training projects take a lot of time and a lot of focus to detail. Teams need extreme isolation from other company goals to focus on their one goal of training. The hardest part is often this — having all the members of the training team focus on one single output for sustained periods. Tracking down recent developments, small experiments with training algorithms, curating data (likely most of the time in hours as babysitting GPUs is largely an idle activity), etc. are all bread and butter of solid engineering talent. Success is downstream of good decision-making by tech leads and managers while getting many small shots on goal.In the case of projects like Tülu 3 the reason we don’t immediately transition to Tülu 4 is that people have other interests. Companies that directly align training with their bottom line don’t need to do this.Thanks to Nicole Fitzgerald, Finbarr Timbers (Midjourney was not one of the companies I studied), and others unnamed at leading AI laboratories for comments or input that helped with this post. This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe
undefined
12 snips
Mar 13, 2025 • 14min

Gemma 3, OLMo 2 32B, and the growing potential of open-source AI

The discussion centers on the exciting breakthroughs in open-source AI, specifically the release of OLMo 2 32B, which rivals GPT-4. The challenges faced by small players in the open-source arena are explored, showcasing the need for transparency and innovation. Listeners will learn about the contrasting approaches of OLMo and Gemma 3, alongside the significance of non-profits and academia in advancing open-source developments. Overall, it's a deep dive into the evolving landscape of AI and the implications of open accessibility.

Remember Everything You Learn from Podcasts

Save insights instantly, chat with episodes, and build lasting knowledge - all powered by AI.
App store bannerPlay store banner