The Hierarchical Reasoning Model: When Small Models Think Big (But Should They?)

A deep dive into a 27M parameter model that claims to outperform giants on reasoning tasks

It’s called the Hierarchical Reasoning Model.

HRM = 2 minds in sync

High-level → slow, strategic reasoning
Low-level → fast, detailed execution

It solves problems in one forward pass (no chain-of-thought needed).

Instead of guessing every step like CoT, HRM learns to reason… pic.twitter.com/jHextXK6o3
— EyeingAI (@EyeingAI) August 1, 2025

You’ve probably seen the viral thread: a tiny 27-million parameter model supposedly crushing Claude and other giants on reasoning benchmarks. The Hierarchical Reasoning Model (HRM) promises to be 5x faster while achieving remarkable scores on notoriously difficult tasks. But as with all things that sound too good to be true in AI, the devil is in the details.

After extensive research and verification, I’m here to separate the revolutionary from the hype. Spoiler alert: HRM is both more interesting and more limited than the viral posts suggest.

The Headlines vs. Reality

Let’s start with what the viral thread claimed:

HRM (27M parameters) beats Claude 3.7 on ARC-AGI-1: ✅ TRUE (40.3% vs 21.2%)
74.5% on Sudoku where “most models score 0%”: ✅ TRUE* (with important caveats)
“No language tokens needed”: ✅ TRUE (and genuinely innovative)
“5x faster than bigger models”: ❌ UNVERIFIED (no systematic benchmarks provided)

But here’s what wasn’t mentioned:

These are cherry-picked benchmarks where HRM excels
HRM hasn’t been tested on standard AI benchmarks (MMLU, HumanEval, GSM8K)
The paper is still in preprint (not peer-reviewed)
It’s specialized for algorithmic puzzles, not general intelligence

The Real Innovation: Thinking Without Words

HRM’s most fascinating aspect isn’t its benchmark scores—it’s how it achieves them. Unlike every major AI model today, HRM doesn’t convert thoughts into language tokens. Instead, it performs “latent reasoning” directly in continuous mathematical space.

Imagine the difference between:

Traditional AI: Thinking → Words → Processing → Words → Answer
HRM: Thinking → Processing → Answer

This “linguistic bottleneck” removal is inspired by how our brains actually work. When you solve a Sudoku puzzle, you don’t internally narrate each step—you manipulate patterns and relationships directly.

The Architecture That Makes It Possible

HRM uses a two-level hierarchy:

High-level module: The “strategic planner” that works on abstract goals
Low-level module: The “tactical executor” that rapidly performs detailed computations

The low-level module can complete multiple reasoning steps before the high-level module advances, creating a temporal hierarchy similar to how our prefrontal cortex interacts with faster processing regions.

The Benchmark Deep Dive

ARC-AGI-1: A Genuine Achievement

The Abstract Reasoning Corpus is designed to test fluid intelligence—the ability to solve novel problems without prior knowledge. HRM’s 40.3% score with only 960 training examples is genuinely impressive. For context:

Claude 3.7 (with billions of parameters): 21.2%
o3-mini-high: 34.5%

This isn’t just about beating bigger models; it’s about doing so with 1000x fewer parameters and minimal training data.

Sudoku and Maze: Context Matters

The “most models score 0%” claim requires clarification. HRM was tested on:

Sudoku-Extreme: Custom variants designed to break standard solving algorithms
Maze-Hard (30×30): Complex mazes that defeat simple path-finding

Yes, many models fail these specific variants, but this doesn’t mean they can’t solve standard Sudoku or maze problems. It’s like saying “most cars can’t complete this Formula 1 track”—technically true but misleading without context.

The Efficiency Revolution (With Asterisks)

HRM’s true breakthrough is its data efficiency:

Traditional models: Need millions to billions of examples
HRM: Achieves strong performance with just 1,000 examples

Training time is also impressive:

Professional Sudoku: ~2 GPU hours
Full ARC-AGI training: 50-200 GPU hours
Can run on a laptop RTX 4070

However, the “5x faster” inference claim lacks empirical support. The paper mentions theoretical advantages but provides no systematic benchmarks against other models on identical hardware.

What HRM Can’t Do (Yet)

Here’s where the story gets less rosy:

No General Language Understanding: HRM hasn’t been tested on reading comprehension, translation, or any standard NLP tasks
Narrow Specialization: Excels at algorithmic puzzles but unproven on open-ended reasoning
Scalability Unknown: Will the architecture work at larger scales? Nobody knows
Implementation Complexity: Requires custom CUDA kernels and complex training procedures

The Survivorship Bias Problem

The viral thread exemplifies classic survivorship bias—showing only where HRM wins. It’s like a basketball player showing only their made shots. What about:

Tasks where HRM fails?
Benchmarks where traditional models excel?
Real-world applications beyond puzzles?

This selective reporting makes it impossible to assess HRM’s true capabilities and limitations.

Why This Matters for AI’s Future

Despite the hype and limitations, HRM represents something important: a fundamentally different approach to AI reasoning. While everyone else is making transformers bigger, HRM suggests that architectural innovation might matter more than scale.

Key insights for the field:

Biological inspiration works: Brain-like hierarchical processing shows promise
Language isn’t everything: Direct latent reasoning avoids conversion overhead
Specialization has value: Not every model needs to do everything
Data efficiency is possible: Smart architectures can learn from fewer examples

The Verdict: Revolutionary But Not Ready

HRM is simultaneously overhyped and genuinely innovative. It’s not the general intelligence breakthrough that viral posts suggest, but it’s also not just another incremental improvement.

What HRM is:

A clever specialized architecture for algorithmic reasoning
Proof that small models can excel in specific domains
An important research direction for efficient AI
A reminder that bigger isn’t always better

What HRM isn’t:

A replacement for general-purpose language models
Proven at scale or on diverse tasks
Peer-reviewed or independently validated
The “future of AI” (at least not yet)

Looking Forward

HRM’s approach might inspire the next generation of AI architectures. Imagine combining:

HRM’s efficient latent reasoning for logical tasks
Traditional transformers for language understanding
Specialized modules for different cognitive functions

This modular, brain-inspired future might deliver both efficiency and capability. But for now, HRM remains a fascinating glimpse of one possible path forward—not the destination itself.

The real test will come when:

Independent researchers reproduce the results
The architecture is tested on diverse benchmarks
Someone tries scaling it up
Real-world applications emerge

Until then, appreciate HRM for what it is: a bold experiment in thinking differently about AI reasoning. Just don’t believe everything you read in viral threads.

Want to dive deeper? Check out the actual paper (not the incorrect arxiv ID from the viral post) and the official implementation. And remember: in AI, as in investing, if something sounds too good to be true, it usually needs more context.