The Illusion of Thinking

Why AI Reasoning Hits a Wall

Introduction

A critical 2025 study, "The Illusion of Thinking" , used controlled puzzles to test LRMs and uncovered fundamental limitations. This research exposes that despite advanced mechanisms, these models still fail to reliably generalize complex logic. We will explore the study’s findings and argue that for real-world applications, especially in business, robust software must be built to ensure dependable outcomes.

Large Language Models (LLM ) vs Large Reasoning Models (LRM)

In September 2024, OpenAI released o1-preview, a model designed to think and solve problems. Soon after, other models like OpenAI’s o1/o3, DeepSeek-R1, Claude 3.7 Sonnet Thinking, and Gemini Thinking were launched, promoting the idea of AI that can “reason.” The main difference between regular Large Language Models (LLMs) and Large Reasoning Models (LRMs) is how they are trained and how they work. LLMs learn by predicting the next word in large amounts of text, focusing on fluency and general knowledge. LRMs build on this by using guided training and feedback on reasoning steps, learning to solve problems step by step. LRMs often use extra tools, like code calculators, search engines, or databases, and can store intermediate steps to check and fix mistakes.

The illusion of Thinking

In June 2025, Apple researchers released a paper called "The Illusion of Thinking"

The study used controllable puzzle environments that let the researchers carefully change how complex the problems were without changing the core logic. Their main goal was to test the thinking ability of Large Reasoning Models (LRMs) and see how they compared to regular Large Language Models (LLMs).

The research identifies fundamental limitations, despite sophisticated mechanisms, LRMs fail to generalize reasoning beyond certain thresholds.

The study identified three distinct reasoning regimes:

Standard LLMs did better than LRMs on very easy problems.
LRMs were best on medium-difficulty problems.
Both failed completely on very hard problems.

Detailed analysis exposed complexity-dependent reasoning patterns, ranging from inefficient "overthinking" on simpler problems to complete failure on complex ones.

Surprising results included:

Limitations in exact computation; for instance, providing the solution algorithm for the Tower of Hanoi did not improve the models' performance on the puzzle.
Model behavior was inconsistent, succeeding with up to 100 correct moves in the Tower of Hanoi but fewer than 5 in the River Crossing puzzle. This likely suggests that examples of River Crossing with N>2 are scarce on the web, meaning LRMs may not have frequently encountered or memorized such instances during training.
Human performance on the mathematical dataset, AIME25, was higher than on AIME24, suggesting that AIME25 might be less complex. Yet models perform worse on AIME25 than AIME24, potentially suggesting data contamination during the training of frontier LRMs.

The workaround

To create software that can solve complex business challenges, we need to use a hybrid system. This approach combines the strengths of LLMs and LRMs with our existing, reliable business systems.

LLMs excel at fluent language comprehension and content creation. LRMs can build a step-by-step thought process (a chain of thinking) to divide the main problem into easier, smaller pieces.
This core foundation is enhanced by Retrieval-Augmented Generation (RAG) systems. These systems use smart chunking to divide private or up-to-date company data into contextually useful segments. This process grounds the AI's output in facts and minimizes the risk of the AI generating false information (known as hallucinations).
Crucially, established, rule-based systems must function as a supervisory layer. Their role is to enforce specific business logic, constraints, and compliance mandates wherever a definitive, auditable outcome is essential.

We cannot yet rely solely on standard, off-the-shelf LLM and LRM tools. Their inherent limitations, such as instability at high complexity, inconsistent numerical accuracy, and a failure to generalize complex logic, require us to build this robust, custom-engineered framework to ensure enterprise-grade dependability. This combined architecture achieves the necessary balance of AI flexibility, factual accuracy, and non-negotiable business correctness.

The study highlights that off-the-shelf LRMs & LLMs aren't dependable when we need to build logic into our systems. Want to see the hybrid architecture live? Book a demo to see how we build dependable, logic-driven AI solutions for your enterprise.

Author

Andreas Bougiouklis

Andreas is the CEO and co-founder of PaperTrail when not dealing with everyday tasks he is exploring the real or the digital worlds, one at a time.

BOOK A DEMO