The Seed-Prover: An investigation into one of 2025’s most hyped AI achievements

How ByteDance’s “IMO Breakthrough” might not be as massive as claimed

https://www.alphaxiv.org/abs/2507.23726v1

When ByteDance announced that their Seed-Prover AI had “proved 5 out of 6 problems” at the prestigious International Mathematical Olympiad (IMO) 2025, the AI community erupted in celebration. Headlines proclaimed a new era of automated mathematical reasoning. The paper, featuring 36 co-authors and impressive performance charts, seemed to herald a genuine breakthrough.

But a detailed analysis using the INGA314 reveals a somewhat different story—one of computational brute force disguised as elegant reasoning, competition rules bent.

The Headline Lie: 5 Problems or 4?

Let’s start with the most basic claim. Seed-Prover “proves 5 out of 6 problems in IMO 2025,” according to the paper’s abstract. This claim appears in the title, abstract, and throughout their promotional materials.

Reality Check: During the actual competition, Seed-Prover solved only 4 problems. Problem 1 was completed “after the deadline” through “4 additional days of search.” In IMO scoring, post-competition solutions don’t count. The actual competition score was 30 points (silver medal level), not the implied near-perfect performance.

This isn’t a minor detail—it’s the difference between “we competed fairly and did well” versus “we nearly matched human gold medalists.”

The 72× Time Violation That Nobody Talks About

Here’s where things get truly scandalous. The IMO has strict time limits: 4.5 hours for 3 problems. It’s a test of mathematical insight under pressure, not computational endurance.

Seed-Prover’s actual times:

Problem 3: 3 days (72 hours)
Problem 4: 3 days (72 hours)
Problem 1: 7 days total
Problem 5: 1 day (24 hours)

That’s up to 1,500% over the time limit. Imagine a human competitor saying “I need 3 days instead of 4.5 hours” and claiming their performance is comparable to those who followed the rules. It’s not just unfair—it’s absurd.

The Computational Shell Game

The paper introduces three “inference settings”: Light, Medium, and Heavy. Sounds technical and reasonable, right? Here’s what they actually mean:

Light: 64-256 attempts, 1-2 hours runtime
Medium: ~10× more compute than Light, includes “inner refinement”
Heavy: “Days of thinking,” 5,000 conjectures, thousands of proven lemmas

Now here’s the trick: they use different settings for different problems, then compare against systems that use fixed evaluation protocols. It’s like a runner choosing between a bicycle, motorcycle, or jet plane for each segment of a race, then claiming the fastest overall time.

The real kicker? The computational costs are hidden throughout. Based on the described procedures, we estimate Seed-Prover used over 10,000 GPU-hours for its IMO performance. The paper never mentions this.

The 4,000-Line “Proof” Problem

Human IMO solutions typically span 1-5 pages. Elegant proofs demonstrate deep understanding through concise reasoning. Seed-Prover’s proofs?

Problem 1: 4,000 lines
Problem 4: 4,000 lines
Problem 3: 2,000 lines
Problem 5: 700+ lines

These aren’t proofs in any meaningful sense—they’re exhaustive computational searches formatted as formal verification. It’s like claiming you’ve “solved” chess because you can evaluate every possible move with enough compute.

Visual Manipulation 101

The paper’s Figure 1 deserves its own analysis course in “How to Lie with Statistics”:

Y-axis starts at 30% instead of 0%, making improvements look 3× larger
No computational cost axis—comparing bicycles to jet planes
No error bars or confidence intervals
Cherry-picked metrics that favor their approach

It’s a masterclass in making marginal improvements look revolutionary.

The Benchmark Bait-and-Switch

Perhaps most egregiously, the paper redefines established benchmarks to favor their system. For the IMO Shortlist geometry problems, they note: “many geometry problems are ignored and here we fill in those missing geometry problems.”

Translation: “The standard benchmark didn’t make us look good enough, so we changed it.”

They then compare their performance on this modified benchmark against systems evaluated on the original. It’s like adding easier questions to a test after seeing the hard ones, then claiming you scored higher than previous test-takers.

The Training Data Contamination Nobody Wants to Discuss

Seed-Prover was trained on “data statistics of geometry problems over more than the past 20 years.” The IMO problems, while new each year, follow patterns and often build on historical techniques. When your training set includes decades of similar problems, how much of your “reasoning” is just sophisticated pattern matching?

The paper provides no analysis of this potential contamination. No held-out sets. No discussion of how they ensure true generalization versus memorization.

What This Means for AI Research

This isn’t just about one paper or one system. It’s about the standards we accept in AI research. When we allow:

Post-competition results to be mixed with competition results
Computational costs to be hidden behind abstractions
Benchmarks to be redefined for convenience
Time limits to be violated by orders of magnitude

…we’re not advancing science. We’re creating a race to the bottom where the biggest computational budget and the cleverest marketing win.

The Real Performance

Strip away the manipulation and what do we get?

Actual IMO 2025 performance: 4/6 problems with massive time violations
Under real IMO constraints: Likely 0-1 problems solvable
Computational requirement: 10,000+ GPU-hours ($100,000+ in compute)
True innovation: Minimal—it’s brute force with good engineering

A Call for Standards

The AI community needs to demand better:

Fixed evaluation protocols: Same computational budget for all systems
Time limit enforcement: Respect competition constraints
Full computational disclosure: Report GPU-hours, not just accuracy
Clean test sets: Prove contamination hasn’t occurred
Honest reporting: Separate competition from post-competition results

Conclusion

Seed-Prover represents an impressive engineering effort. Combining reinforcement learning with formal verification and throwing massive compute at problems can indeed solve some IMO problems—given enough time and resources.

But it’s not a breakthrough in mathematical reasoning. It’s a breakthrough in computational brute force, wrapped in a paper that systematically misrepresents its achievements through every trick in the academic misconduct playbook.

The real tragedy? ByteDance clearly has talented researchers and significant resources. They could have made genuine contributions to automated reasoning. Instead, they chose to optimize for headlines over science.

As we enter an era where AI achievements are increasingly difficult to verify and computational resources can paper over fundamental limitations, we must be more vigilant than ever. Not every claimed breakthrough is what it seems. Sometimes, it’s just 10,000 GPUs in a trench coat, pretending to be a mathematician.

The author conducted this analysis using the INGA314, examining the paper for logical inconsistencies, temporal paradoxes, and methodological issues. All computational estimates are based on descriptions in the paper.