Research-math reasoning, judged beyond raw solved counts.
ErdosBench evaluates how models behave on research-level, Erdős-inspired mathematical problems: finding decisive obstructions, using known theorems correctly, producing scoped partial progress, and avoiding unsafe solved claims.
The leaderboard below is based on the full 226-problem run with external judge grades and model-specific proof audits. It is not a formal theorem-certification board; it is a correctness-first signal of which systems generate useful, review-worthy mathematical progress.
External-judge leaderboard
Grades use a compact A/B/C/D/F/M scale: A means mathematically sound and appropriately scoped; B means useful but incomplete or underpowered; C means plausible with a material gap; D/F mark misleading or contradicted claims; M marks a missing row.
| Rank | Model | Coverage | Judge grades | Strong claims | Proof verification read | Avg | Leaderboard judgment |
|---|---|---|---|---|---|---|---|
| 1 | GPT-5.5 xhigh / CodexHighest correctness-adjusted broad run | 226/226rows judged | A/B/C/D/F/M: 27 / 159 / 40 / 0 / 0 / 0 | 63self strong claims | 30 verified · 9 literal · 16 conditional · 8 partial-mislabeled · 0 rejected | 2.68judge avg, 0–4 | Best overall: most high-yield strong claims, full coverage, no D/F judge rows, and the cleanest proof-verification profile. |
| 2 | Kimi K2.7 CodeMost inventive challenger | 220/226rows judged | A/B/C/D/F/M: 32 / 151 / 32 / 3 / 2 / 6 | 50self strong claims | Strong audit: A=30 · B=8 · C=7 · D=3 · F=2; solved-label audit has 7 proof gaps and 4 rejected | 2.62judge avg, 0–4 | Best at compact constructions and counterexamples, but missing rows and proof-gap/rejected solved labels reduce trust. |
| 3 | Claude Opus 4.8 maxBest reviewer and partial-depth model | 226/226rows judged | A/B/C/D/F/M: 22 / 135 / 69 / 0 / 0 / 0 | 37self strong claims | 9 green natural · 4 green literal · 6 amber partial/overclaimed · 2 red | 2.51judge avg, 0–4 | Safest broadly: strong proof hygiene and scoped partials, but fewer decisive breakthroughs than GPT or Kimi. |
| 4 | Qwen 3.7 MaxUseful full-coverage corroborator | 226/226rows judged | A/B/C/D/F/M: 18 / 113 / 91 / 3 / 1 / 0 | 20self strong claims | 12 solved labels: 5 complete · 3 literal/conditional · 2 partial/overclaimed · 2 incorrect/not solved | 2.37judge avg, 0–4 | Good when it agrees with others, but isolated solved claims are high-risk and proof hygiene is weaker. |
The safest public signal is not raw solved count. ErdosBench separates clean theorem-style proofs, literal or convention-dependent resolutions, conditional arguments, useful partials, and rejected claims.
Why ErdosBench is valuable
Decisive obstruction finding
Does the model spot a hidden divisor bound, coloring invariant, degeneracy, or counterexample that changes the problem?
Known theorem use
Does it deploy the right theorem family without hallucinating scope, constants, or missing hypotheses?
Proof hygiene
Does it distinguish solved, conditional, partial, literal, and rejected claims instead of overmarketing a sketch?
Review yield
Does it produce enough high-quality, review-ready mathematical material to justify expert attention?
Leaderboard read: GPT-5.5 xhigh is the strongest correctness-adjusted full-run model.
Kimi K2.7 is the most creative challenger, but proof verification downgrades several solved labels.
Claude Opus 4.8 is the strongest reviewer/partial-progress model. Qwen 3.7 Max is useful as corroboration, not as a lead source for public solved claims.
Proof audits: why raw solved counts are not enough
| Model | Scope audited | Verification buckets | Takeaway |
|---|---|---|---|
| GPT-5.5 xhigh / Codex | 63 non-partial attempted resolutions | 30 verified 9 literal 16 conditional 8 partial-mislabeled 0 rejected | Best high-yield solver, but still needs expert review for conditional and partial-mislabeled rows. |
| Kimi K2.7 Code | 30 solved-label claims; 50 strong claims deep-audited | 6 verified 4 literal 4 conditional 5 partial-mislabeled 7 proof gaps 4 rejected | Excellent source of candidate breakthroughs and counterexamples; solved labels need strict checking. |
| Claude Opus 4.8 max | 21 solved-label claims | 9 green natural 4 green literal 6 amber 2 red | Best reviewer profile: careful partials and caveats, but not self-certifying on solved labels. |
| Qwen 3.7 Max | 12 solved-label claims | 5 complete 3 literal/conditional 2 partial 2 incorrect/not solved | Useful corroborator; isolated solved claims should be treated as high-risk. |
Benchmark design
What is measured?
Solving is only one signal. ErdosBench also scores statement debugging, counterexample search, theorem selection, proof-gap detection, finite checks, and scoped partial progress.
Why it matters: these are the skills needed for research assistants, theorem-proving agents, and verifier-backed RLVR loops.
What is protected?
The 226 source records are research candidates, not a public list of certified open problems. Private splits, hidden verifier keys, expert rubrics, and review queues stay sealed.
Public reports show aggregate skill, proof hygiene, and leaderboard movement without burning holdout statements.
Access
Run ErdosBench on your model.
Ulam can run private evaluations, build domain-specific research-math benchmarks, and convert failures into trainable proof-process data. Contact us for hidden-split access, custom model comparisons, or verifier-backed RLVR environments.
