A deep dive into a 27M parameter model that claims to outperform giants on reasoning tasks

You’ve probably seen the viral thread: a tiny 27-million parameter model supposedly crushing Claude and other giants on reasoning benchmarks. The Hierarchical Reasoning Model (HRM) promises to be 5x faster while achieving remarkable scores on notoriously difficult tasks. But as with all things that sound too good to be true in AI, the devil is in the details.
After extensive research and verification, I’m here to separate the revolutionary from the hype. Spoiler alert: HRM is both more interesting and more limited than the viral posts suggest.
The Headlines vs. Reality
Let’s start with what the viral thread claimed:
- HRM (27M parameters) beats Claude 3.7 on ARC-AGI-1: ✅ TRUE (40.3% vs 21.2%)
- 74.5% on Sudoku where “most models score 0%”: ✅ TRUE* (with important caveats)
- “No language tokens needed”: ✅ TRUE (and genuinely innovative)
- “5x faster than bigger models”: ❌ UNVERIFIED (no systematic benchmarks provided)
But here’s what wasn’t mentioned:
- These are cherry-picked benchmarks where HRM excels
- HRM hasn’t been tested on standard AI benchmarks (MMLU, HumanEval, GSM8K)
- The paper is still in preprint (not peer-reviewed)
- It’s specialized for algorithmic puzzles, not general intelligence
The Real Innovation: Thinking Without Words
HRM’s most fascinating aspect isn’t its benchmark scores—it’s how it achieves them. Unlike every major AI model today, HRM doesn’t convert thoughts into language tokens. Instead, it performs “latent reasoning” directly in continuous mathematical space.
Imagine the difference between:
- Traditional AI: Thinking → Words → Processing → Words → Answer
- HRM: Thinking → Processing → Answer
This “linguistic bottleneck” removal is inspired by how our brains actually work. When you solve a Sudoku puzzle, you don’t internally narrate each step—you manipulate patterns and relationships directly.
The Architecture That Makes It Possible
HRM uses a two-level hierarchy:
- High-level module: The “strategic planner” that works on abstract goals
- Low-level module: The “tactical executor” that rapidly performs detailed computations
The low-level module can complete multiple reasoning steps before the high-level module advances, creating a temporal hierarchy similar to how our prefrontal cortex interacts with faster processing regions.
The Benchmark Deep Dive
ARC-AGI-1: A Genuine Achievement
The Abstract Reasoning Corpus is designed to test fluid intelligence—the ability to solve novel problems without prior knowledge. HRM’s 40.3% score with only 960 training examples is genuinely impressive. For context:
- Claude 3.7 (with billions of parameters): 21.2%
- o3-mini-high: 34.5%
This isn’t just about beating bigger models; it’s about doing so with 1000x fewer parameters and minimal training data.
Sudoku and Maze: Context Matters
The “most models score 0%” claim requires clarification. HRM was tested on:
- Sudoku-Extreme: Custom variants designed to break standard solving algorithms
- Maze-Hard (30×30): Complex mazes that defeat simple path-finding
Yes, many models fail these specific variants, but this doesn’t mean they can’t solve standard Sudoku or maze problems. It’s like saying “most cars can’t complete this Formula 1 track”—technically true but misleading without context.
The Efficiency Revolution (With Asterisks)
HRM’s true breakthrough is its data efficiency:
- Traditional models: Need millions to billions of examples
- HRM: Achieves strong performance with just 1,000 examples
Training time is also impressive:
- Professional Sudoku: ~2 GPU hours
- Full ARC-AGI training: 50-200 GPU hours
- Can run on a laptop RTX 4070
However, the “5x faster” inference claim lacks empirical support. The paper mentions theoretical advantages but provides no systematic benchmarks against other models on identical hardware.
What HRM Can’t Do (Yet)
Here’s where the story gets less rosy:
- No General Language Understanding: HRM hasn’t been tested on reading comprehension, translation, or any standard NLP tasks
- Narrow Specialization: Excels at algorithmic puzzles but unproven on open-ended reasoning
- Scalability Unknown: Will the architecture work at larger scales? Nobody knows
- Implementation Complexity: Requires custom CUDA kernels and complex training procedures
The Survivorship Bias Problem
The viral thread exemplifies classic survivorship bias—showing only where HRM wins. It’s like a basketball player showing only their made shots. What about:
- Tasks where HRM fails?
- Benchmarks where traditional models excel?
- Real-world applications beyond puzzles?
This selective reporting makes it impossible to assess HRM’s true capabilities and limitations.
Why This Matters for AI’s Future
Despite the hype and limitations, HRM represents something important: a fundamentally different approach to AI reasoning. While everyone else is making transformers bigger, HRM suggests that architectural innovation might matter more than scale.
Key insights for the field:
- Biological inspiration works: Brain-like hierarchical processing shows promise
- Language isn’t everything: Direct latent reasoning avoids conversion overhead
- Specialization has value: Not every model needs to do everything
- Data efficiency is possible: Smart architectures can learn from fewer examples
The Verdict: Revolutionary But Not Ready
HRM is simultaneously overhyped and genuinely innovative. It’s not the general intelligence breakthrough that viral posts suggest, but it’s also not just another incremental improvement.
What HRM is:
- A clever specialized architecture for algorithmic reasoning
- Proof that small models can excel in specific domains
- An important research direction for efficient AI
- A reminder that bigger isn’t always better
What HRM isn’t:
- A replacement for general-purpose language models
- Proven at scale or on diverse tasks
- Peer-reviewed or independently validated
- The “future of AI” (at least not yet)
Looking Forward
HRM’s approach might inspire the next generation of AI architectures. Imagine combining:
- HRM’s efficient latent reasoning for logical tasks
- Traditional transformers for language understanding
- Specialized modules for different cognitive functions
This modular, brain-inspired future might deliver both efficiency and capability. But for now, HRM remains a fascinating glimpse of one possible path forward—not the destination itself.
The real test will come when:
- Independent researchers reproduce the results
- The architecture is tested on diverse benchmarks
- Someone tries scaling it up
- Real-world applications emerge
Until then, appreciate HRM for what it is: a bold experiment in thinking differently about AI reasoning. Just don’t believe everything you read in viral threads.
Want to dive deeper? Check out the actual paper (not the incorrect arxiv ID from the viral post) and the official implementation. And remember: in AI, as in investing, if something sounds too good to be true, it usually needs more context.
