How ByteDance’s “IMO Breakthrough” might not be as massive as claimed

https://www.alphaxiv.org/abs/2507.23726v1
When ByteDance announced that their Seed-Prover AI had “proved 5 out of 6 problems” at the prestigious International Mathematical Olympiad (IMO) 2025, the AI community erupted in celebration. Headlines proclaimed a new era of automated mathematical reasoning. The paper, featuring 36 co-authors and impressive performance charts, seemed to herald a genuine breakthrough.
But a detailed analysis using the INGA314 reveals a somewhat different story—one of computational brute force disguised as elegant reasoning, competition rules bent.
The Headline Lie: 5 Problems or 4?
Let’s start with the most basic claim. Seed-Prover “proves 5 out of 6 problems in IMO 2025,” according to the paper’s abstract. This claim appears in the title, abstract, and throughout their promotional materials.
Reality Check: During the actual competition, Seed-Prover solved only 4 problems. Problem 1 was completed “after the deadline” through “4 additional days of search.” In IMO scoring, post-competition solutions don’t count. The actual competition score was 30 points (silver medal level), not the implied near-perfect performance.
This isn’t a minor detail—it’s the difference between “we competed fairly and did well” versus “we nearly matched human gold medalists.”
The 72× Time Violation That Nobody Talks About
Here’s where things get truly scandalous. The IMO has strict time limits: 4.5 hours for 3 problems. It’s a test of mathematical insight under pressure, not computational endurance.
Seed-Prover’s actual times:
- Problem 3: 3 days (72 hours)
- Problem 4: 3 days (72 hours)
- Problem 1: 7 days total
- Problem 5: 1 day (24 hours)
That’s up to 1,500% over the time limit. Imagine a human competitor saying “I need 3 days instead of 4.5 hours” and claiming their performance is comparable to those who followed the rules. It’s not just unfair—it’s absurd.
The Computational Shell Game
The paper introduces three “inference settings”: Light, Medium, and Heavy. Sounds technical and reasonable, right? Here’s what they actually mean:
- Light: 64-256 attempts, 1-2 hours runtime
- Medium: ~10× more compute than Light, includes “inner refinement”
- Heavy: “Days of thinking,” 5,000 conjectures, thousands of proven lemmas
Now here’s the trick: they use different settings for different problems, then compare against systems that use fixed evaluation protocols. It’s like a runner choosing between a bicycle, motorcycle, or jet plane for each segment of a race, then claiming the fastest overall time.
The real kicker? The computational costs are hidden throughout. Based on the described procedures, we estimate Seed-Prover used over 10,000 GPU-hours for its IMO performance. The paper never mentions this.
The 4,000-Line “Proof” Problem
Human IMO solutions typically span 1-5 pages. Elegant proofs demonstrate deep understanding through concise reasoning. Seed-Prover’s proofs?
- Problem 1: 4,000 lines
- Problem 4: 4,000 lines
- Problem 3: 2,000 lines
- Problem 5: 700+ lines
These aren’t proofs in any meaningful sense—they’re exhaustive computational searches formatted as formal verification. It’s like claiming you’ve “solved” chess because you can evaluate every possible move with enough compute.
Visual Manipulation 101
The paper’s Figure 1 deserves its own analysis course in “How to Lie with Statistics”:
- Y-axis starts at 30% instead of 0%, making improvements look 3× larger
- No computational cost axis—comparing bicycles to jet planes
- No error bars or confidence intervals
- Cherry-picked metrics that favor their approach
It’s a masterclass in making marginal improvements look revolutionary.
The Benchmark Bait-and-Switch
Perhaps most egregiously, the paper redefines established benchmarks to favor their system. For the IMO Shortlist geometry problems, they note: “many geometry problems are ignored and here we fill in those missing geometry problems.”
Translation: “The standard benchmark didn’t make us look good enough, so we changed it.”
They then compare their performance on this modified benchmark against systems evaluated on the original. It’s like adding easier questions to a test after seeing the hard ones, then claiming you scored higher than previous test-takers.
The Training Data Contamination Nobody Wants to Discuss
Seed-Prover was trained on “data statistics of geometry problems over more than the past 20 years.” The IMO problems, while new each year, follow patterns and often build on historical techniques. When your training set includes decades of similar problems, how much of your “reasoning” is just sophisticated pattern matching?
The paper provides no analysis of this potential contamination. No held-out sets. No discussion of how they ensure true generalization versus memorization.
What This Means for AI Research
This isn’t just about one paper or one system. It’s about the standards we accept in AI research. When we allow:
- Post-competition results to be mixed with competition results
- Computational costs to be hidden behind abstractions
- Benchmarks to be redefined for convenience
- Time limits to be violated by orders of magnitude
…we’re not advancing science. We’re creating a race to the bottom where the biggest computational budget and the cleverest marketing win.
The Real Performance
Strip away the manipulation and what do we get?
- Actual IMO 2025 performance: 4/6 problems with massive time violations
- Under real IMO constraints: Likely 0-1 problems solvable
- Computational requirement: 10,000+ GPU-hours ($100,000+ in compute)
- True innovation: Minimal—it’s brute force with good engineering
A Call for Standards
The AI community needs to demand better:
- Fixed evaluation protocols: Same computational budget for all systems
- Time limit enforcement: Respect competition constraints
- Full computational disclosure: Report GPU-hours, not just accuracy
- Clean test sets: Prove contamination hasn’t occurred
- Honest reporting: Separate competition from post-competition results
Conclusion
Seed-Prover represents an impressive engineering effort. Combining reinforcement learning with formal verification and throwing massive compute at problems can indeed solve some IMO problems—given enough time and resources.
But it’s not a breakthrough in mathematical reasoning. It’s a breakthrough in computational brute force, wrapped in a paper that systematically misrepresents its achievements through every trick in the academic misconduct playbook.
The real tragedy? ByteDance clearly has talented researchers and significant resources. They could have made genuine contributions to automated reasoning. Instead, they chose to optimize for headlines over science.
As we enter an era where AI achievements are increasingly difficult to verify and computational resources can paper over fundamental limitations, we must be more vigilant than ever. Not every claimed breakthrough is what it seems. Sometimes, it’s just 10,000 GPUs in a trench coat, pretending to be a mathematician.
The author conducted this analysis using the INGA314, examining the paper for logical inconsistencies, temporal paradoxes, and methodological issues. All computational estimates are based on descriptions in the paper.
