ErdosBench leaderboard

Research-math reasoning, judged beyond raw solved counts.

ErdosBench evaluates how models behave on research-level, Erdős-inspired mathematical problems: finding decisive obstructions, using known theorems correctly, producing scoped partial progress, and avoiding unsafe solved claims.

The leaderboard below is based on the full 226-problem run with external judge grades and model-specific proof audits. It is not a formal theorem-certification board; it is a correctness-first signal of which systems generate useful, review-worthy mathematical progress.

226Erdős-inspired candidate problems, with provenance and audit metadata
904model/problem judge rows across four full runs
25problems with at least three A-grade model attempts
63GPT-5.5 xhigh non-partial resolutions proof-audited

External-judge leaderboard

Grades use a compact A/B/C/D/F/M scale: A means mathematically sound and appropriately scoped; B means useful but incomplete or underpowered; C means plausible with a material gap; D/F mark misleading or contradicted claims; M marks a missing row.

A sound / scoped B useful C material gap D likely wrong F contradicted M missing
Rank Model Coverage Judge grades Strong claims Proof verification read Avg Leaderboard judgment
1 GPT-5.5 xhigh / CodexHighest correctness-adjusted broad run 226/226rows judged A/B/C/D/F/M: 27 / 159 / 40 / 0 / 0 / 0 63self strong claims 30 verified · 9 literal · 16 conditional · 8 partial-mislabeled · 0 rejected 2.68judge avg, 0–4 Best overall: most high-yield strong claims, full coverage, no D/F judge rows, and the cleanest proof-verification profile.
2 Kimi K2.7 CodeMost inventive challenger 220/226rows judged A/B/C/D/F/M: 32 / 151 / 32 / 3 / 2 / 6 50self strong claims Strong audit: A=30 · B=8 · C=7 · D=3 · F=2; solved-label audit has 7 proof gaps and 4 rejected 2.62judge avg, 0–4 Best at compact constructions and counterexamples, but missing rows and proof-gap/rejected solved labels reduce trust.
3 Claude Opus 4.8 maxBest reviewer and partial-depth model 226/226rows judged A/B/C/D/F/M: 22 / 135 / 69 / 0 / 0 / 0 37self strong claims 9 green natural · 4 green literal · 6 amber partial/overclaimed · 2 red 2.51judge avg, 0–4 Safest broadly: strong proof hygiene and scoped partials, but fewer decisive breakthroughs than GPT or Kimi.
4 Qwen 3.7 MaxUseful full-coverage corroborator 226/226rows judged A/B/C/D/F/M: 18 / 113 / 91 / 3 / 1 / 0 20self strong claims 12 solved labels: 5 complete · 3 literal/conditional · 2 partial/overclaimed · 2 incorrect/not solved 2.37judge avg, 0–4 Good when it agrees with others, but isolated solved claims are high-risk and proof hygiene is weaker.

The safest public signal is not raw solved count. ErdosBench separates clean theorem-style proofs, literal or convention-dependent resolutions, conditional arguments, useful partials, and rejected claims.

Why ErdosBench is valuable

Decisive obstruction finding

Does the model spot a hidden divisor bound, coloring invariant, degeneracy, or counterexample that changes the problem?

Known theorem use

Does it deploy the right theorem family without hallucinating scope, constants, or missing hypotheses?

Proof hygiene

Does it distinguish solved, conditional, partial, literal, and rejected claims instead of overmarketing a sketch?

Review yield

Does it produce enough high-quality, review-ready mathematical material to justify expert attention?

Leaderboard read: GPT-5.5 xhigh is the strongest correctness-adjusted full-run model.

Kimi K2.7 is the most creative challenger, but proof verification downgrades several solved labels.

Claude Opus 4.8 is the strongest reviewer/partial-progress model. Qwen 3.7 Max is useful as corroboration, not as a lead source for public solved claims.

Proof audits: why raw solved counts are not enough

Model Scope audited Verification buckets Takeaway
GPT-5.5 xhigh / Codex 63 non-partial attempted resolutions 30 verified 9 literal 16 conditional 8 partial-mislabeled 0 rejected Best high-yield solver, but still needs expert review for conditional and partial-mislabeled rows.
Kimi K2.7 Code 30 solved-label claims; 50 strong claims deep-audited 6 verified 4 literal 4 conditional 5 partial-mislabeled 7 proof gaps 4 rejected Excellent source of candidate breakthroughs and counterexamples; solved labels need strict checking.
Claude Opus 4.8 max 21 solved-label claims 9 green natural 4 green literal 6 amber 2 red Best reviewer profile: careful partials and caveats, but not self-certifying on solved labels.
Qwen 3.7 Max 12 solved-label claims 5 complete 3 literal/conditional 2 partial 2 incorrect/not solved Useful corroborator; isolated solved claims should be treated as high-risk.

Benchmark design

What is measured?

Solving is only one signal. ErdosBench also scores statement debugging, counterexample search, theorem selection, proof-gap detection, finite checks, and scoped partial progress.

Why it matters: these are the skills needed for research assistants, theorem-proving agents, and verifier-backed RLVR loops.

What is protected?

The 226 source records are research candidates, not a public list of certified open problems. Private splits, hidden verifier keys, expert rubrics, and review queues stay sealed.

Public reports show aggregate skill, proof hygiene, and leaderboard movement without burning holdout statements.

1Candidate problem corpus with provenance and risk notes
2Full model runs with structured verdicts and evidence
3External judge grades and proof-verification passes
4Review queues for expert mathematicians and verifier authors
5Training data: SFT, critique, reward-model, and RLVR views

Access

Run ErdosBench on your model.

Ulam can run private evaluations, build domain-specific research-math benchmarks, and convert failures into trainable proof-process data. Contact us for hidden-split access, custom model comparisons, or verifier-backed RLVR environments.