Inside s1: An o1-Style Reasoning Model That Cost Under $50 to Train with Niklas Muennighoff - #721
Mar 3, 2025
auto_awesome
Niklas Muennighoff, a PhD student at Stanford, dives into his groundbreaking work on the S1 reasoning model, designed to efficiently mimic OpenAI's O1 while costing under $50 to train. He elaborates on innovative techniques like 'budget forcing' that help the model tackle complex problems more effectively. The discussion highlights the intricacies of test-time scaling, the importance of data curation, and the differences between supervised fine-tuning and reinforcement learning. Niklas also shares insights on the future of open-sourced AI models.
The S1 model introduces a budget forcing technique that optimizes computational effort during reasoning by regulating answer generation based on token budgets.
S1's open-source nature and minimal resource requirements foster accessibility and promote further experimentation in AI reasoning applications among researchers.
Deep dives
Comparison of S1 and R1 Approaches
The S1 and R1 models seek to replicate the functionality of OpenAI's O1 model, but they do so with different methodologies. R1 aims to replicate the entire pipeline established by O1, striving for a comprehensive reconstruction of its functionalities. In contrast, S1 is focused on achieving the core benefits of O1—strong reasoning performance and test time scaling—through a more minimalistic approach. This strategic difference has implications for the complexity and resource demands of each model, with S1 designed to be more accessible and cost-effective.
Data Curation and Distillation Process
For the S1 model, data curation involved gathering a diverse set of challenging questions from various fields, eventually narrowing it down to 1,000 quality items deemed difficult based on prior model performance. This curation process aimed to ensure a diverse range of difficulties and topics, which were measured through previous models' ability to solve the questions. The model's training utilized distillation from existing reasoning models to refine its outputs, demonstrating effective learning through both correct and incorrect reasoning traces. By leveraging these distilled answers, the S1 model was able to reach impressive performance levels despite the inherent challenges associated with teaching it to reason correctly.
Implementing Test Time Scaling
S1 introduces a concept called budget forcing to improve test time scaling, where the model generates reasoning traces based on token budgets that dictate how much computational effort it can spend on a given question. This method allows the model to determine when to stop generating an answer or to continue refining it, thereby simulating a more nuanced reasoning process. By strategically injecting 'wait' tokens, the model is encouraged to reassess its output and potentially improve its answer based on prior reasoning steps. This budget control not only enhances the model's accuracy but also reflects a structured approach to managing computational resources during the decision-making process.
Open Source Contribution and Future Research Directions
The S1 model and its associated datasets are fully open source, allowing researchers and developers to replicate its results easily, particularly as it was trained using minimal computational resources. This opens the door for further experimentation and adaptation in various reasoning applications within the AI community. The challenges identified in this research prompt ongoing exploration into improving test time scaling and context management in reasoning tasks, particularly for more complex queries. By addressing issues like context window limitations and exploring cross-model collaborations, there is potential for significant advancements in how AI systems reason and understand complex information.
Today, we're joined by Niklas Muennighoff, a PhD student at Stanford University, to discuss his paper, “S1: Simple Test-Time Scaling.” We explore the motivations behind S1, as well as how it compares to OpenAI's O1 and DeepSeek's R1 models. We dig into the different approaches to test-time scaling, including parallel and sequential scaling, as well as S1’s data curation process, its training recipe, and its use of model distillation from Google Gemini and DeepSeek R1. We explore the novel "budget forcing" technique developed in the paper, allowing it to think longer for harder problems and optimize test-time compute for better performance. Additionally, we cover the evaluation benchmarks used, the comparison between supervised fine-tuning and reinforcement learning, and similar projects like the Hugging Face Open R1 project. Finally, we discuss the open-sourcing of S1 and its future directions.