Jesse Hoogland, executive director of Timaeus and researcher in singular learning theory (SLT), shares fascinating insights on AI alignment. He dives into the concept of the refined local learning coefficient (LLC) and its role in uncovering new circuits in language models. The conversation also touches on the challenges of interpretability and model complexity. Hoogland emphasizes the importance of outreach efforts in disseminating research and fostering interdisciplinary collaboration to enhance understanding of AI safety.
Singular Learning Theory (SLT) enhances our understanding of neural networks' generalization capabilities and informs AI safety practices.
The refined local learning coefficient offers a novel approach to investigate neural network behavior, improving interpretability and identifying risk factors.
Deep dives
Applications of Singular Learning Theory
Singular Learning Theory (SLT) provides a framework for understanding why neural networks generalize effectively, which is crucial in evaluating their predictive capabilities in real-world applications. This theory helps improve evaluation benchmarks to ensure they are predictive of actual behaviors once models are deployed. Furthermore, SLT addresses interpretability issues, such as identifying when a model may execute risky decisions, like a treacherous turn during operation. The advancements in SLT are intended to create tools that can probe and enhance the safety of AI systems.
Research Focus of Timaeus
Timaeus operates as a research organization concentrating on both theoretical research and practical applications of SLT in alignment efforts. The organization pursues two primary avenues: one that delves into rigorous theoretical work led by Daniel Murfit, while the other focuses on applying these theories to real-world models to ensure advancements in AI safety. Ongoing experiments involve training various models and employing learning coefficient techniques derived from SLT to draw meaningful insights about model behavior. Outreach efforts are equally prioritized, enabling Timaeus to disseminate findings to relevant decision-makers and collaborators.
Developmental Interpretability and Learning Coefficients
Developmental interpretability, a branch of SLT, examines how the loss landscape geometry changes during the training of neural networks, revealing insights into their internal structures. This approach entails measuring local learning coefficients to assess model complexity and the impact of various datasets on model performance. Recent findings have shown that learning coefficients correlate with memorization capabilities, shedding light on how different neural network heads specialize in handling data. Moreover, ongoing research is set to explore how these refinements in learning coefficients can help identify interactions between different heads and their roles in complex computations.
You may have heard of singular learning theory, and its "local learning coefficient", or LLC - but have you heard of the refined LLC? In this episode, I chat with Jesse Hoogland about his work on SLT, and using the refined LLC to find a new circuit in language models.