“Surprising LLM reasoning failures make me think we still need qualitative breakthroughs for AGI” by Kaj_Sotala
Apr 17, 2025
auto_awesome
Kaj Sotala, an AI researcher and writer, dives into the surprising reasoning failures of large language models (LLMs). He highlights issues like flawed logic in problem-solving, struggles with simple instruction, and inconsistent storytelling, particularly in character portrayal. Kaj argues that despite advancements, LLMs still lack the necessary capabilities for achieving true artificial general intelligence. He emphasizes the need for qualitative breakthroughs, rather than just iterative improvements, to address these profound challenges in AI development.
Current LLMs display profound reasoning failures, suggesting their limitations in achieving AGI due to fundamental misunderstandings of problem structures.
The inability of LLMs to consistently follow instructions highlights a critical shortcoming in their understanding and application of task requirements.
Deep dives
Limitations of Current LLMs
Current language models (LLMs) exhibit significant reasoning failures that challenge their potential for achieving artificial general intelligence (AGI). Despite their ability to perform many tasks similar to human capabilities, specific instances reveal a crucial inability to generalize learned knowledge to problems that seem elementary, showcasing a flaw in their reasoning processes. For example, when prompted with a sliding puzzle task, models like Claude provided convoluted solutions that were not only incorrect but also included impossible moves, indicating a fundamental misunderstanding of the problem's structure. This highlights a worrying trend where improvements in LLMs primarily optimize their performance on familiar tasks without addressing overarching reasoning skills required for novel challenges.
Coaching and Instruction Adherence Failures
LLMs struggle to follow simple, consistent instructions, revealing a lack of reliability in their performance. For instance, even when given explicit coaching guidelines to limit responses to one question at a time or to keep answers succinct, models frequently failed to adhere to these rules. This inconsistency was observed across multiple iterations of LLMs, suggesting a systematic issue in understanding and applying guidelines. Such failures to follow basic conversational instructions demonstrate a shortcoming in the models' comprehension of task requirements, which raises doubts about their applicability in practical settings.
Inability to Adapt in Dynamic Scenarios
The inability of LLMs to adaptively learn from repetition and modify their strategies in dynamic scenarios is another noticeable shortfall. For example, when playing tic-tac-toe, models like O1 repeatedly employed the same losing strategy, failing to learn from past mistakes despite being encouraged to do so. This inability to recognize and correct flaws in their gameplay demonstrates a fundamental lack of underlying cognitive flexibility and learning capability. Consequently, this suggests that LLMs lack the ability to genuinely learn from interactions, further emphasizing their limitations in resembling human-like reasoning.
Challenges in Fictional and Abstract Contexts
LLMs exhibit notable challenges when tasked with generating coherent narratives or maintaining logical consistency within stories. Instances include inaccurately depicting young characters with cognitive traits beyond their developmental stage, indicating a tendency to default to existing narrative templates instead of adhering closely to age-appropriate characteristics. Similarly, in brainstorming scenarios, LLMs often provide suggestions that adhere to generic tropes rather than logically sound actions, revealing a dependency on stereotypical patterns. This pattern of returning to familiar constructs rather than engaging in critical reasoning illustrates how LLMs struggle with complex and creative tasks, which poses significant questions regarding their suitability for tasks requiring nuanced understanding.
Writing this post puts me in a weird epistemic position. I simultaneously believe that:
The reasoning failures that I'll discuss are strong evidence that current LLM- or, more generally, transformer-based approaches won't get us AGI
As soon as major AI labs read about the specific reasoning failures described here, they might fix them
But future versions of GPT, Claude etc. succeeding at the tasks I've described here will provide zero evidence of their ability to reach AGI. If someone makes a future post where they report that they tested an LLM on all the specific things I described here it aced all of them, that will not update my position at all.
That is because all of the reasoning failures that I describe here are surprising in the sense that given everything else that they can do, you’d expect LLMs to succeed at all of these tasks. The [...]
---
Outline:
(00:13) Introduction
(02:13) Reasoning failures
(02:17) Sliding puzzle problem
(07:17) Simple coaching instructions
(09:22) Repeatedly failing at tic-tac-toe
(10:48) Repeatedly offering an incorrect fix
(13:48) Various people's simple tests
(15:06) Various failures at logic and consistency while writing fiction
(15:21) Inability to write young characters when first prompted