Do LLMs Reason, or Do They Just Predict Math Text?

When “predicts the next number” gets published as “knows how to reason.”

INGA314.ai analysis of KisMATH (Saha et al., TACL 2026)

TL;DR. KisMATH is a real and useful dataset wrapped in a theoretical claim its experiments cannot support. The headline result — that 15 LLMs assign higher next-token probability to math-expression tokens than to random tokens in a chain-of-thought trace — is a property of language modeling on math text, not evidence that LLMs “internally realize” reasoning structures. The decisive control experiment, buried in §6.1, shows the central thesis fails on olympiad-level problems. Headline-claim inflation factor: roughly 3–4×.

The paper, in one paragraph

KisMATH (Saha, Chaturvedi, Saha, Garain, Asher; pre-print accepted to TACL, January 2026) builds an automated pipeline that extracts a directed-acyclic graph of mathematical expressions from a chain-of-thought reasoning trace. The pipeline runs on 1,671 problems from GSM8K, MATH500, and AIME, with all reasoning traces generated by OpenAI o3. The authors then run two experiments on 15 open-weight LLMs ranging from 1B to 70B parameters: an attention-suppression experiment (does removing the math tokens raise answer entropy?) and a probability-rank experiment (do paths through the graph receive higher next-token probability than random equal-length spans of the same trace?). Both come back positive. The abstract concludes that LLMs “internally realize structures similar to our graphs” and that this is “constitutive of reasoning.” The paper has been accepted to one of the most selective venues in computational linguistics.

The empirical work is, in many ways, good. The dataset is large, open, reproducible, and 50–100× the scale of prior manual annotation efforts. The pipeline scales. The intervention methodology is more principled than the randomization-based ablations it positions itself against. If the paper had stopped at “here is a tool for studying CoT structure at scale,” it would be a clean contribution.

It does not stop there. The framing claims something much larger — that the experiments are evidence about reasoning, knowledge, and internal cognitive structure in LLMs. That framing is the subject of this analysis.

The rank-metric tautology

The clearest single problem is that the headline metric measures something close to a tautology of language modeling on math text, not a property of reasoning.

Equation 5 in the paper computes the probability the model assigns to the next token, given the chain-of-thought trace as context. The “R-path” is a sequence of math-expression tokens selected because they structurally complete an ongoing computation — the “9” in “4 + 5 = 9”, the “5” in “x² − x + 1 = 0 ⟹ x = …”. A “random path” is everything else: discourse markers, connectives, variable declarations, partial words, punctuation. The paper finds R-path tokens receive higher next-token probability than random-path tokens, and concludes the LLM has “internally realized” the graph structure.

But completions of in-progress equations are vastly more predictable than discourse fillers. “4 + 5 =” → “9” has probability near 1.0 for any competent language model. “Let’s” can be followed by hundreds of plausible continuations. The probability gap is a property of next-token prediction in formal mathematical text, not of reasoning.

A genuine test of structural realization would compare R-paths to alternative-but-valid paths through the same problem — different sub-step orderings, alternate decompositions, alternate variable choices. KisMATH never runs that comparison. It compares math-completions to non-math-tokens and finds the former more probable. Once you see this, the “implicit realization” claim collapses into something close to: language models trained on math text are good at predicting math tokens given math context.

Mediation conflated with ablation

The paper imports causal-mediation language — direct effect, indirect effect, mediator, “constitutive of reasoning” — from the Pearl/Imai framework. In that framework you vary a treatment, observe an outcome, and decompose the effect into a part that travels through a mediator and a part that doesn’t.

KisMATH has a question Q (held fixed), a reasoning trace R (partially suppressed), and an answer A (observed). It never varies Q. It suppresses parts of R and watches A change. That is ablation, not mediation. Ablation tells you R-tokens carry information for predicting A. Mediation would tell you R transmits a Q→A effect.

This matters because the word “Causal” in “Causal CoT Graph” is loaded with the assumption that the graph captures a real mediation structure. The graph itself is built from SymPy parse-tree co-occurrence — if expression A and expression B share a parse-tree node, an edge is added. The causal warrant is supposed to come from the mediation experiment downstream. If that experiment is ablation, the warrant doesn’t transfer, and “structural co-occurrence in correctly-completed math text” is what is being measured.

The distillation overlap problem

The 15 evaluation models are not a random sample of LLMs. Several were trained on traces structurally similar to o3’s output.

The DeepSeek R1 distilled variants (1.5B, 8B, 32B, 70B in this study) come from a teacher that produces o-style reasoning traces. Qwen 3 underwent reinforcement learning with verifiable rewards and reasoning-style supervision. Gemma 3 had its own reasoning-oriented post-training. These models have been optimized to produce traces in approximately the same structural style that o3 produces.

The pipeline is therefore: have o3 produce structured traces → take models trained on similar trace styles → measure how much they “agree” with o3’s structural choices → conclude that LLMs implicitly realize CCGraph structure.

The competing hypothesis — that the rank skew reflects training-data overlap, not internal structural realization — is never controlled for. The signal in the data actually supports the competing hypothesis. Llama 3.3 70B, which had less aggressive reasoning-style post-training, shows the weakest effect on the AIME split (D_KS = 0.31) compared to 0.85+ for the heavily reasoning-distilled models. The “internal realization” effect tracks how much each model has been trained on traces like the ones the test was built from.

To rule out the distillation explanation you would need either models with no reasoning-style post-training, or traces from a stylistically distinct generator. Neither appears in the study.

§6.1: where the thesis quietly fails

Section 6.1 of the paper runs the cleanest test in the entire study. M(G) suppresses the math tokens. M(Gᶜ) suppresses everything else — the discourse glue, the connectives, the natural language. If math expressions really are the mediators of reasoning, suppressing them should hurt the answer dramatically more than suppressing the surrounding language.

On GSM8K, this is what happens. The math-suppression intervention changes the answer 70.9% of the time on average; non-math suppression changes it 10.3%. The thesis holds cleanly.

On MATH500 and AIME, the paper says, in its own words:

M(G) has a slightly stronger effect on the final answer, but a statistically significant difference (χ²-test, α=0.01) between the two interventions is not observed in most cases.

Read that carefully. For olympiad-level math — the harder, more interesting half of the dataset — suppressing math tokens and suppressing the surrounding natural-language glue produce statistically indistinguishable answer-change rates for most of the 15 models. DeepSeek R1 32B on AIME: 22% (math) vs. 16% (non-math). Qwen 3 32B on AIME: 13% vs. 12%. Gemma 3 27B on AIME: 22% vs. 38% — non-math actually wins.

The straightforward conclusion is that the CCGraph captures reasoning structure for arithmetic word problems but is insufficient for olympiad math. Discourse structure, which is by construction outside the graph, carries comparable causal weight at the harder end.

The paper does not draw this conclusion. Instead, it pivots to a discourse-theory citation chain and argues that for harder problems the model also needs the discourse structure. This is a graceful retreat: “Our graph captures part of reasoning. For harder problems you need more.” The abstract is not updated. The title is not updated. The framework is not amended.

This is the textbook discussion-section pattern for confidence inflation: find a result that weakens the thesis, retro-fit a theoretical explanation that preserves the thesis at higher levels of the paper, leave the abstract intact. The serious version of the §6.1 finding would say: the central framework holds for the easy split and fails for the hard splits. Nothing above §6 reflects that.

Other things worth knowing

First-token entropy is brittle for multi-digit answers. All entropy and answer-change measurements are computed over the first token of the answer only. For boxed numerical answers, “12,” “150,” and “1234” all share first token “1.” A model uncertain among {100, 105, 120} has near-zero first-token entropy with high full-answer entropy. For GSM8K’s mostly-small numbers, first-token entropy is a reasonable proxy. For AIME’s larger answers, the proxy weakens substantially. The authors have the data to compute full-sequence entropy. They don’t report it.

The prompting style induces the analyzable structure. The o3 system prompt explicitly mandates: each thinking step on a separate line, all mathematical expressions in inline LaTeX inside dollar signs, no multi-line equations. The line-separated, SymPy-parseable expression structure that the algorithm requires is induced by the prompt. The CCGraphs are partly an artifact of formatting requirements, not a discovered property of LLM reasoning.

Selection bias compounds. The pipeline keeps only correct-answer traces, and manually intervenes on roughly 10% of degenerate graphs. The reported analysis is over: (o3 succeeded) ∩ (parsing succeeded) ∩ (graph was non-trivial OR was hand-curated to be non-trivial). We have no characterization of CCGraphs from failed reasoning. If the structural-realization signal exists only conditional on correctness, that is a weaker claim — possibly reducing to “successful traces are structurally cleaner than failed ones.”

Path selection is unjustified. R-paths are the “top-k longest unique directed simple paths,” with k = 5 for GSM8K and k = 10 elsewhere. Why longest? Why those k? No ablation. Longest paths string together the most parse-tree-matched expressions, which are by construction the most syntactically deterministic. The path-selection choice that gives the cleanest result is the one selected.

The experiment that would have actually tested the claim

There is a single experimental design that would have distinguished “LLMs internally realize CCGraph structure” from “LLMs trained on similar trace styles agree with each other” from “language models predict math tokens well in math context.”

For each of the 15 models, have model M generate its own traces for the problems. Build CCGraphs from M’s own traces. Compute rank distributions for M’s R-paths under M’s own probability distribution. Then compare across models: is there structural agreement on R-paths across independently-generated traces? And include a non-reasoning-tuned baseline — an early base model, or a model with no reasoning-style post-training — to test whether the rank skew is reasoning-specific or a property of any LM in math text.

That experiment would distinguish all three hypotheses. None of the three is currently distinguishable in the data. The paper interprets them all as the first one.

LAF scorecard

Dimension	Verdict
Empirical execution	0.75–0.85 (solid)
Dataset / methodology contribution	0.75–0.85 (real and useful)
Scope discipline	0.45 (geometry, abstract math, proof reasoning all excluded; not surfaced in title)
Proxy discipline	0.40 (probability rank → “knowledge”; behavior → “internal realization”)
Confidence calibration	0.45 (abstract / title overstate; thesis demonstrably weakens on harder splits)
Discussion-section honesty	Mixed (§6.1 reports the weakening but does not propagate it upward)
Composite validity	≈ 0.50
Inflation factor on title claim	≈ 3–4×

The honest paper inside this paper

Stripped of the theoretical overlay, KisMATH contributes the following:

Scalable extraction of expression-dataflow graphs from chain-of-thought traces, with an arithmetic-versus-olympiad asymmetry in math-token vs. discourse-token information content. For elementary arithmetic, the dataflow graph captures most of the information used by next-token prediction at the answer. For olympiad-level problems, the dataflow graph and the surrounding discourse structure carry comparable information.

That paper is real, useful, and would have been a clean contribution on its own terms. The paper as written is that paper plus 3–4× of theoretical overlay borrowed from causal mediation, dressed in cognitive vocabulary — “knowledge,” “internally realize,” “constitutive of reasoning” — that the experiments do not warrant.

The most generous reading is that the authors built a useful tool and stretched the framing to clear the bar for a TACL accept. The least generous reading is that the rank-metric result has been known to be a property of language-modeling on math text since chain-of-thought was introduced, and the paper has rediscovered it under a causal-graph wrapper.

Both readings are interesting. Neither supports the abstract.

Why this matters beyond one paper

KisMATH is not an outlier. The pattern it exhibits — solid empirical work, scope inflation in the title and abstract, proxy elevation from a measurement metric to a cognitive claim, a buried result in discussion that quietly contradicts the thesis — is the dominant failure mode of contemporary ML papers about LLM cognition.

What the Logical Analysis Framework does, in cases like this, is separate the empirical contribution from the framing. Both can be evaluated on their own terms. Reviewers, readers, and downstream researchers can then engage with each at the appropriate level of confidence. The dataset is worth using. The methodology is worth building on. The cognitive claims about “knowledge,” “realization,” and “constitutive reasoning” require an experiment that has not yet been run.

AI reflects consensus. It doesn’t question it. That’s why this work matters.

They find papers. We find flaws.

inga314 builds adjustable analytical layers on top of AI — a methodology for detecting scope violations, proxy elevation, causal inflation, and confidence inflation in high-stakes documents. From regulatory submissions to scientific papers to investment pitches, LAF turns critical analysis into a measurable, repeatable process.

Dan Aridor · inga314.ai · daridor.blog · daridor@inga314.ai