How a single ArXiv paper reveals the hidden fault lines in LLM-based robot planning
https://arxiv.org/abs/2602.12244

There’s a concept in the history of science that most people learn and then forget: the epicycle. In the original Ptolemaic model of the solar system, when planets didn’t move where the theory predicted, astronomers didn’t abandon the model. They added circles on top of circles — epicycles — to make the predictions fit. The model got more complex, more accurate locally, but also more brittle. It worked, right up until it didn’t.
I think this pattern is playing out right now in robot task planning. And a paper that landed on ArXiv this week makes the case better than I could.
The Paper
“Any House Any Task: Scalable Long-Horizon Planning for Abstract Human Tasks” (Liu et al., 2026) tackles a real and important problem: getting robots to follow complex household instructions like “tidy up the kitchen” in large, realistic environments. Their system, AHAT, is technically impressive. It trains an LLM to decompose ambiguous human instructions into formal subgoals written in PDDL (a classical planning language), then hands those subgoals to a symbolic solver for optimal action sequencing. To handle the ambiguity in natural language commands, they introduce TGPO — a reinforcement learning variant that corrects intermediate reasoning traces during training.
It’s good work. It advances the state of the art. And it’s an epicycle.
The Paradigm
The dominant paradigm in robot planning circa 2025–2026 is straightforward: the LLM is the brain. The community bet that language models, because they encode vast amounts of world knowledge, can serve as the central planning engine for embodied agents. Want a robot to clean a room? Describe the task in natural language, let the LLM reason about it, and translate its output into executable actions.
This bet has produced extraordinary results on simple, short-horizon tasks. But it keeps hitting the same wall, and the community keeps describing that wall in the same language. Here’s the AHAT paper itself: “Performance often degrades rapidly with increasing environment size, plan length, instruction ambiguity, and constraint complexity.”
That sentence is an anomaly report. It names the boundary where the paradigm breaks down.
The Response: Add More Layers
What’s the paper’s response to that anomaly? Not to question whether LLMs are the right substrate for long-horizon spatial reasoning. Instead, they add machinery:
- A classical symbolic planner underneath (PDDL) to handle what the LLM can’t
- A new RL algorithm on top (TGPO, built on GRPO) to correct the LLM’s reasoning errors
- Scene graphs as an intermediate representation to bridge language and space
- External correction of reasoning traces to patch failure modes during training
Each of these additions is individually reasonable. Together, they form a pattern: the core commitment to “LLM as planner” is being preserved by surrounding it with increasingly elaborate scaffolding. The system gets heavier, not lighter. More complex, not more elegant.
This is exactly what Thomas Kuhn described in The Structure of Scientific Revolutions. When a paradigm is under strain, practitioners don’t abandon it. They add auxiliary hypotheses, methodological patches, and theoretical extensions to preserve the core framework. The patches work — locally, temporarily. But the accumulated weight of the patches is itself a signal.
The Broader Pattern
This isn’t just one paper. The robotics planning literature is showing a classic anomaly accumulation pattern:
- The anomaly is named repeatedly. Paper after paper acknowledges that LLM-based planners degrade on long horizons, large environments, and complex constraints.
- The fixes are getting more elaborate. Early work used simple prompting. Now we’re seeing RL fine-tuning, symbolic grounding, scene graph intermediaries, chain-of-thought correction, and multi-stage verification pipelines.
- The patches are domain-specific. Solutions that work in kitchens break in warehouses. Solutions for manipulation tasks fail at navigation. This is the opposite of what a healthy paradigm looks like.
None of this means the paradigm is about to collapse. Epicycles can sustain a framework for a long time — Ptolemaic astronomy worked well enough for centuries. But the pattern creates a predictive window: when the accumulated weight of patches exceeds the explanatory power of the core framework, the field becomes receptive to alternatives.
What a Paradigm Break Would Look Like
If you’re watching this space — as a researcher, investor, or technology strategist — here’s what to look for:
A successful long-horizon embodied planner that doesn’t use an LLM as the central reasoning engine. Maybe it’s a return to structured world models with learned dynamics. Maybe it’s a new architecture purpose-built for spatial-temporal reasoning. Maybe it’s something that treats language as an interface layer rather than a cognitive substrate.
The key signal won’t be the paper itself — it’ll be what happens in the citation network afterward. When researchers who are currently deep in the LLM-as-planner paradigm start citing and building on a non-LLM approach, that’s the bridge forming. That’s when peripheral anomalies become central challenges, and the field begins to reorganize.
I’m not predicting when this will happen. I’m saying the conditions that historically precede such shifts are accumulating. The epicycles are getting heavy. The anomalies are being named. And somewhere, probably, someone is working on the alternative that makes the scaffolding unnecessary.
Dan Aridor writes about paradigm dynamics in science and technology. Follow for more analysis at the intersection of philosophy of science and emerging tech.
