Interconnects cover image

Interconnects

Latest episodes

undefined
26 snips
Apr 7, 2025 • 11min

Llama 4: Did Meta just push the panic button?

Meta's latest AI model, Llama 4, is met with skepticism as it lacks the excitement of its predecessors. The discussion highlights Meta's struggle with lengthy release times, leading to unmet expectations. There's a deep dive into the evolution of Meta’s open models, from OPT to Llama 3, showcasing both triumphs and pitfalls. The podcast also critiques Meta’s waning community support and the challenges posed by new regulations impacting their future in AI.
undefined
34 snips
Apr 5, 2025 • 16min

RL backlog: OpenAI's many RLs, clarifying distillation, and latent reasoning

Reinforcement learning is experiencing a major revival in the AI landscape, with exciting applications branching across OpenAI's models. The discussion dives into the innovative techniques of model distillation and how latent reasoning enhances model efficiency. Self-assessment in AI systems is also tackled, emphasizing the significance of having AI independently verify its own knowledge and decisions. This interplay between traditional programming and modern approaches reveals the evolving nature of AI's reliability.
undefined
43 snips
Mar 26, 2025 • 12min

Gemini 2.5 Pro and Google's second chance with AI

The launch of Gemini 2.5 Pro marks a significant leap in AI, outperforming competitors like GPT-4 Turbo on important benchmarks. The podcast discusses the evolution of reasoning models, highlighting their technical prowess in today's landscape. It delves into the competitive dynamics of AI, emphasizing the need for rapid deployment and better user experiences. Google's strategic shift aims to capitalize on its vast infrastructure, positioning itself as a leader in AI innovation rather than just another contender.
undefined
Mar 19, 2025 • 13min

Managing frontier model training organizations (or teams)

https://www.interconnects.ai/p/how-to-manage-ai-training-organizationsIt is a closely guarded secret how the leading AI laboratories structure their training teams. As with other technology companies, the saying “you ship your org chart” still applies to training AI models. Looking at these organizational structures will reveal where research can be scaled up, the upper limits of size, and potentially even who uses the most compute.How modeling teams do and do not workA crucial area I’m working on (reach out if you would like to share more off the record) is how to scale these lessons to bigger, more complex teams. The core factor differentiating teams that succeed from those that do not is maintaining these principles while scaling team size.Big teams inherently lead to politics and protecting territory, while language models need information to flow from the bottom to the top on what capabilities are possible. Regardless of the possibilities, leadership can shift resources to prioritize certain areas, but all of the signals on whether this is working come from those training models. If senior directors mandate results under them before unblocking model releases, the entire system will crumble.Seeing this potential end state — without naming specific companies — it is obviously desirable to avoid, but anticipating and avoiding it during rapid growth takes substantial intentionality.Within training, the planning for pretraining and post-training traditionally could be managed differently. Pretraining has fewer, bigger runs so improvements must be slotted in for those few annual runs. Post-training improvements can largely be continuous. These operational differences, on top of the obvious cost differences, also make post-training far more approachable for non-frontier labs (though still extremely hard).Both teams have bottlenecks where improvements must be integrated. Scaling the pretraining bottlenecks — i.e. those making the final architecture and data decisions — seems impossible, but scaling teams around data acquisition, evaluation creation, and integrations is very easy. A large proportion of product decisions for AI models can be made irrespective of modeling decisions. Scaling these is also easy.Effectively, organizations that fail to produce breakthrough models can do tons of low-level meaningful research, but adding organizational complexity dramatically increases the risk of “not being able to put it together.”Another failure mode of top-down development, rather than bottom-up information, is that leaders can mandate the team to try to follow a technical decision that is not supported by experiments. Managing so-called “yolo runs” well is a coveted skill, but one that is held close to the models. Of course, so many techniques work still that mandates don’t have a 100% failure rate, but it sets a bad precedent.Given the pace of releases and progress, it appears that Anthropic, OpenAI, DeepSeek, Google Gemini, and some others have positive forms of this bottom-up culture with extremely skilled technical leads managing complexity. Google took the longest to get it right with re-orgs, muddled launches (remember Bard), and so on. With the time lag between Meta’s releases, it still seems like they’re trying to find this culture to maximally express their wonderful talent and resources.With all of this and off-the-record conversations with leadership at frontier AI labs, I have compiled a list of recommendations for managing AI training teams. This is focused on modeling research and does not encompass the majority of headcount in the leading AI companies.Interconnects is a reader-supported publication. Consider becoming a subscriber.RecommendationsThe most effective teams who regularly ship leading models follow many of these principles:* The core language modeling teams remain small as AI companies become larger.* For smaller teams, you can still have everyone in one room, take advantage of this. For me personally, I think this is where remote teams can be detrimental. In-person works for this, at least when best practices are evolving so fast.* Avoid information siloes. This goes for both teams and individuals. People need to quickly be able to build on the successes of those around them and clear communication during consistent rapid progress is tricky.* For larger teams, you can scale teams only where co-design isn’t needed. Where interactions aren’t needed there can be organizational distance.* An example would be one team focusing on post-training algorithms & approaches while other teams handle model character, model variants for API, etc (specifications and iterations).* Another example is that reasoning teams are often separate from other pieces of post-training. This applies only to players that have scaled.* Language model deployment is very much like early startup software. You don’t know exactly what users want nor what you can deliver. Embrace the uncertainty and learn quickly.* Do not overly try to separate engineering teams from training. Engineering needs to build tools for the generation +1 model and cannot do this without talking to researchers.* Evergreen research is separate from the language modeling teams itself, but still sits within “research”. Otherwise, it will be impossible to prioritize truly long-term ideas. Long-term goals are fragile and need nurturing. Language modeling is about the next 1, or maybe 2, models.* A lot of the sexy work is not that helpful and a lot of the useful work isn't sexy. Data is the prime example as the often most impactful type of work.* Expect failed training runs and do not overreact to them along the way.Failure modesHigh-priority projects can fail if you…* Try to ship too many models for each capability improvement. Instead, stick to a set schedule of model training. Have fewer models that are more capable.* Try to force contributions from individual teammates into the final product. Do not sacrifice performance for personalities in search of “a contribution”.* Let in teams that try and territorially force their way into contributing to the big company goal.* Scale the training organization too much. Having too many people “doing stuff” and adding noise to the organization detracts from high-level direction and focus on the execution of specific goals. (This can also relate to 1. and be trying to do too much in one model).* Letting politics grow, taking many forms, and causing intertwined issues. Do not lose the sense of results being the #1 driving factor of decisions. Bad decisions here compound.* Over-indexing on a single model evaluation will hamper (or flat out block) real progress in other areas.Before the rest of the post, expanding on the topics above, you may be interested in previous articles on this topic.Related writingFor more reading on how language modeling teams work, see some of my other writing here, on team structure, and…….managing risk.An example of how mid-sized training projects workI recently got a list of questions on how training for Tülu 3 operated (which is a post-training analog to OLMo really). I figured I would share these and they serve as a foundation for me gathering useful information from friends on frontier labs on how representative it is.With reasoning models, most of this translates directly. Infrastructure is becoming more important because generating long sequences is particularly memory intensive (and can expose issues in open-source tools for inference), but when the time comes to make a state-of-the-art fully open reasoning recipe, the lessons learned here will apply directly.1. How long does a large post-training project take?Tülu 3 was the focus of our post-training team from mid-July until its release on November 21st, 2024. We were building on our previous recipes, in Tülu 2/2.5, so not very much of this was catching up on internal know-how, but rather integrating new external resources. If a team like this was working continuously all year on the same focus it would’ve taken approximately one month less to achieve these results. Bootup takes substantial time, as does release management.2. How do you choose the right personnel for a moderately sized training project?A project like Tülu 3 or any other effort to push the frontier of AI in a popular area normally takes a moderately sized team. The smaller the niche, the smaller the team you need. The team at Ai2 is researcher-heavy relative to engineer-heavy among the 20+ authors. If prioritizing only performance on known techniques, the ratio of engineers can be far higher. Pushing the frontier takes 10x the resources as repeating extensively documented work.In the case of Tülu 3, where most of the techniques are not known the proportion of researchers is obviously higher. This, though, for companies trying to scope who to hire for modeling teams is not a trivial problem. First, one must scope the level of uncertainty in the domain of interest and then hire around it. Applying Tülu style approaches could definitely be done with a team of 2-4 focused engineers.3. What model sizes are used for iteration? How do results scale?A core principle of modeling research is to iterate at the smallest model that provides a reliable signal. This is the entire principle behind scaling laws as a de-risking tool. In post-training, compute costs are substantially lower so the models used actually can be bigger. In this case, given a project designed around the Llama 3.1 base models, ~80% or more of experiments were at the 8B scale (normally 8 or 32 H100s, finishing in <1 day), ~19% at the 70B scale (normally 32 or 64 H100s, finishing in 2-3 days), and only a handful of runs at the 405B scale that were using 256 GPUs each for multiple days. In overall GPU utilization, the project utilized 100-600 GPUs concurrently for the entire 4-5 month span.These days, results tend to transfer extremely well when scaling. Bigger models may need less data, especially less general data, and a gentler optimization (lower learning rate usually), but transfer hasn’t been a challenge. Changing base models is harder than scaling with post-training techniques.4. How many experiments are actually run?The Tülu project evaluated about 1000 checkpoints in our process. This feels about right for a major post-training process. Some of these are intermediate or competitor models, but most of them, 100s, are experimental training runs. The model scores can be plotted in a time sequence with the metadata we collected (credit Hamish Ivison for the plot). When you squint, it is largely a logarithmic curve with faster gains at the beginning and leveling off at the end. Of course, you can also see the flurry of models trained right in the last few weeks.5. What is the biggest bottleneck on progress?All of these projects are bottlenecked by compute available. Making systems more efficient is a compute multiplier, but if the starting point in the number of GPUs is too low, it won’t matter. There’s often potential to accelerate projects by adding more people to explorations, whether it’s training approaches like process reward models (PRMs) or data curation, but scaling management and integration of data across numerous evaluations can be tricky. Best practices for models with 100s of target evaluations (as done in frontier laboratories) rather than the ~10 we used, are far from established.The second bottleneck would be personnel willing to constantly grind on new data experiments. Focus on data almost always pays off fairly quickly.6. What I would need to get a serious post-training effort off the ground from a cold start?Finetuning has such a large gradation that impact can be made with almost any team size. To do truly excellent work takes mostly patience and proportional resources. Getting the model exactly right takes retraining many times even after you hit your initial benchmarking goals.For companies focusing on local models, a few nodes of H100s (~100 GPUs) could go a very long way. For companies trying to make truly state-of-the-art models above the 7B scale, trying to do so with <500 H100 GPUs is likely not worth it. It is very easy to be stuck in the middle and compute is still the largest determining factor of success.These numbers will come down as best practices of distillation from strong models are established, but this knowledge is far from known. If you want to invest in training you need to do enough to move the frontier, or else you will be inevitably falling behind and it would be better to ride on other’s coattails.7. What is the hardest part of these projects? Where do you actually spend time?Training projects take a lot of time and a lot of focus to detail. Teams need extreme isolation from other company goals to focus on their one goal of training. The hardest part is often this — having all the members of the training team focus on one single output for sustained periods. Tracking down recent developments, small experiments with training algorithms, curating data (likely most of the time in hours as babysitting GPUs is largely an idle activity), etc. are all bread and butter of solid engineering talent. Success is downstream of good decision-making by tech leads and managers while getting many small shots on goal.In the case of projects like Tülu 3 the reason we don’t immediately transition to Tülu 4 is that people have other interests. Companies that directly align training with their bottom line don’t need to do this.Thanks to Nicole Fitzgerald, Finbarr Timbers (Midjourney was not one of the companies I studied), and others unnamed at leading AI laboratories for comments or input that helped with this post. This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe
undefined
12 snips
Mar 13, 2025 • 14min

Gemma 3, OLMo 2 32B, and the growing potential of open-source AI

The discussion centers on the exciting breakthroughs in open-source AI, specifically the release of OLMo 2 32B, which rivals GPT-4. The challenges faced by small players in the open-source arena are explored, showcasing the need for transparency and innovation. Listeners will learn about the contrasting approaches of OLMo and Gemma 3, alongside the significance of non-profits and academia in advancing open-source developments. Overall, it's a deep dive into the evolving landscape of AI and the implications of open accessibility.
undefined
30 snips
Mar 12, 2025 • 1h 9min

Interviewing Eugene Vinitsky on self-play for self-driving and what else people do with RL

Eugene Vinitsky, a professor at NYU's Civil and Urban Engineering department, dives into the fascinating world of reinforcement learning (RL). He discusses groundbreaking results in self-play for self-driving technology and its implications for future RL applications. The complexity of self-play in multi-agent systems is explored, alongside its surprising link to language model advancements. Eugene shares insights on scaling simulations, the importance of reward design, and the rich potential of AI collaboration, making for a thought-provoking conversation about the future of technology.
undefined
12 snips
Mar 10, 2025 • 8min

Elicitation, the simplest way to understand post-training

Discover how the concept of elicitation can dramatically enhance AI model performance after training. The discussion uses a thrilling Formula 1 analogy to illustrate how teams optimize their cars throughout a season, showing similar potential in AI models. The conversation also touches on the Superficial Alignment Hypothesis, emphasizing the importance of pre-existing data. Join in to explore innovative techniques that can lead to significant improvements in a short time frame!
undefined
6 snips
Mar 5, 2025 • 14min

Where inference-time scaling pushes the market for AI companies

The discussion dives into the unsustainable costs associated with providing free AI models to users. It highlights insights on GPT-4.5's model launch and the implications of inference-time computing. The conversation covers how profitability may stem from advertising as serving costs approach zero. Aggregation Theory is examined, shedding light on how a few companies could dominate the AI market by aggregating user demand. Proponents argue this could pave the way for a new era of successful, user-facing AI businesses.
undefined
7 snips
Feb 28, 2025 • 10min

GPT-4.5: "Not a frontier model"?

The discussion kicks off with the intriguing release of GPT-4.5 and its unusual classification as not a frontier model. Experts ponder the economic implications and community expectations tied to AI scaling. They also tackle the subtle but significant improvements this model brings compared to its predecessors. As they navigate the evolving landscape, the conversation highlights how GPT-4.5 could reshape future AI developments. Listeners will find insights about the challenges in distinguishing real advancements from perceived improvements.
undefined
6 snips
Feb 26, 2025 • 12min

Character training: Understanding and crafting a language model's personality

Delve into the intricate world of character training for AI language models. Discover the distinction between public evaluations and the internal assessments that drive real progress. Learn how leading labs are sculpting models like GPT-4 to enhance user interactions. Uncover the challenges of creating human-like traits in AI without sacrificing reliability. Join the conversation on the importance of crafting distinct personalities within models, an essential yet largely overlooked aspect of post-training.

Get the Snipd
podcast app

Unlock the knowledge in podcasts with the podcast player of the future.
App store bannerPlay store banner

AI-powered
podcast player

Listen to all your favourite podcasts with AI-powered features

Discover
highlights

Listen to the best highlights from the podcasts you love and dive into the full episode

Save any
moment

Hear something you like? Tap your headphones to save it with AI-generated key takeaways

Share
& Export

Send highlights to Twitter, WhatsApp or export them to Notion, Readwise & more

AI-powered
podcast player

Listen to all your favourite podcasts with AI-powered features

Discover
highlights

Listen to the best highlights from the podcasts you love and dive into the full episode