ErdosBench leaderboard

Research-math reasoning, judged beyond raw solved counts.

ErdosBench evaluates how models behave on research-level, Erdős-inspired mathematical problems: finding decisive obstructions, using known theorems correctly, producing scoped partial progress, and avoiding unsafe solved claims.

The leaderboard below is based on the full 226-problem run with external judge grades and model-specific proof audits. It is not a formal theorem-certification board; it is a correctness-first signal of which systems generate useful, review-worthy mathematical progress.

View Leaderboard Scoring Methodology Proof Audits ErdosBench Repository Discuss Private Eval

226Erdős-inspired candidate problems, with provenance and audit metadata

904model/problem judge rows across four full runs

25problems with at least three A-grade model attempts

63GPT-5.5 xhigh non-partial resolutions proof-audited

External-judge leaderboard

Grades use a compact A/B/C/D/F/M scale: A means mathematically sound and appropriately scoped; B means useful but incomplete or underpowered; C means plausible with a material gap; D/F mark misleading or contradicted claims; M marks a missing row.

A sound / scoped B useful C material gap D likely wrong F contradicted M missing

Rank	Model	Coverage	Judge grades	Strong claims	Proof verification read	Avg	Leaderboard judgment
1	GPT-5.5 xhigh / CodexHighest correctness-adjusted broad run	226/226rows judged	A/B/C/D/F/M: 27 / 159 / 40 / 0 / 0 / 0	63self strong claims	30 verified · 9 literal · 16 conditional · 8 partial-mislabeled · 0 rejected	2.68judge avg, 0–4	Best overall: most high-yield strong claims, full coverage, no D/F judge rows, and the cleanest proof-verification profile.
2	Kimi K2.7 CodeMost inventive challenger	220/226rows judged	A/B/C/D/F/M: 32 / 151 / 32 / 3 / 2 / 6	50self strong claims	Strong audit: A=30 · B=8 · C=7 · D=3 · F=2; solved-label audit has 7 proof gaps and 4 rejected	2.62judge avg, 0–4	Best at compact constructions and counterexamples, but missing rows and proof-gap/rejected solved labels reduce trust.
3	Claude Opus 4.8 maxBest reviewer and partial-depth model	226/226rows judged	A/B/C/D/F/M: 22 / 135 / 69 / 0 / 0 / 0	37self strong claims	9 green natural · 4 green literal · 6 amber partial/overclaimed · 2 red	2.51judge avg, 0–4	Safest broadly: strong proof hygiene and scoped partials, but fewer decisive breakthroughs than GPT or Kimi.
4	Qwen 3.7 MaxUseful full-coverage corroborator	226/226rows judged	A/B/C/D/F/M: 18 / 113 / 91 / 3 / 1 / 0	20self strong claims	12 solved labels: 5 complete · 3 literal/conditional · 2 partial/overclaimed · 2 incorrect/not solved	2.37judge avg, 0–4	Good when it agrees with others, but isolated solved claims are high-risk and proof hygiene is weaker.

The safest public signal is not raw solved count. ErdosBench separates clean theorem-style proofs, literal or convention-dependent resolutions, conditional arguments, useful partials, and rejected claims.

Why ErdosBench is valuable

Decisive obstruction finding

Does the model spot a hidden divisor bound, coloring invariant, degeneracy, or counterexample that changes the problem?

Known theorem use

Does it deploy the right theorem family without hallucinating scope, constants, or missing hypotheses?

Proof hygiene

Does it distinguish solved, conditional, partial, literal, and rejected claims instead of overmarketing a sketch?

Review yield

Does it produce enough high-quality, review-ready mathematical material to justify expert attention?

Leaderboard read: GPT-5.5 xhigh is the strongest correctness-adjusted full-run model.

Kimi K2.7 is the most creative challenger, but proof verification downgrades several solved labels.

Claude Opus 4.8 is the strongest reviewer/partial-progress model. Qwen 3.7 Max is useful as corroboration, not as a lead source for public solved claims.

Proof audits: why raw solved counts are not enough

Model	Scope audited	Verification buckets	Takeaway
GPT-5.5 xhigh / Codex	63 non-partial attempted resolutions	30 verified 9 literal 16 conditional 8 partial-mislabeled 0 rejected	Best high-yield solver, but still needs expert review for conditional and partial-mislabeled rows.
Kimi K2.7 Code	30 solved-label claims; 50 strong claims deep-audited	6 verified 4 literal 4 conditional 5 partial-mislabeled 7 proof gaps 4 rejected	Excellent source of candidate breakthroughs and counterexamples; solved labels need strict checking.
Claude Opus 4.8 max	21 solved-label claims	9 green natural 4 green literal 6 amber 2 red	Best reviewer profile: careful partials and caveats, but not self-certifying on solved labels.
Qwen 3.7 Max	12 solved-label claims	5 complete 3 literal/conditional 2 partial 2 incorrect/not solved	Useful corroborator; isolated solved claims should be treated as high-risk.

Benchmark design

What is measured?

Solving is only one signal. ErdosBench also scores statement debugging, counterexample search, theorem selection, proof-gap detection, finite checks, and scoped partial progress.

Why it matters: these are the skills needed for research assistants, theorem-proving agents, and verifier-backed RLVR loops.

What is protected?

The 226 source records are research candidates, not a public list of certified open problems. Private splits, hidden verifier keys, expert rubrics, and review queues stay sealed.

Public reports show aggregate skill, proof hygiene, and leaderboard movement without burning holdout statements.

1Candidate problem corpus with provenance and risk notes

2Full model runs with structured verdicts and evidence

3External judge grades and proof-verification passes

4Review queues for expert mathematicians and verifier authors

5Training data: SFT, critique, reward-model, and RLVR views

Access

Run ErdosBench on your model.

Ulam can run private evaluations, build domain-specific research-math benchmarks, and convert failures into trainable proof-process data. Contact us for hidden-split access, custom model comparisons, or verifier-backed RLVR environments.

Inspect ErdosBench Data Products Research Talk to Ulam