Synthetic Open Math Problems as a Benchmark for LLMs

We built a synthetic benchmark from 171 IMO-like proof problems and 40 Erdős-style research problems to test whether language models can reason mathematically when memorization is less useful—and when open-ended progress is scored separately from proof correctness.

Mathematical benchmarks are among the cleanest tests of reasoning. A proof can be checked. A construction either has the claimed property or it does not. A counterexample either refutes the statement or it does not. That clarity is why math has become a central way to measure the capabilities of frontier language models.

But the field is running into a familiar problem: the more important a benchmark becomes, the faster it becomes part of the training environment. Existing olympiad-style datasets are useful, but many are close to public contests, public solutions, and common problem templates. Research-level benchmarks help, but they are expensive to create and hard to refresh. We want a benchmark that is rigorous, low-contamination, and renewable.

The approach we are exploring is synthetic open math: generating new mathematical problem families, auditing them, keeping solutions and metadata private, and separating solved proof tasks from open-ended research tasks. The goal is not to replace human mathematicians or expert grading. The goal is to give the community a sharper instrument for measuring whether models can actually do mathematics.

The benchmark in one sentence

The current internal draft has two complementary sources: an IMO-like solved-problem track for strict proof evaluation, and an Erdős-style research track for measuring partial progress, calibration, and mathematical taste.

TrackCurrent sizeWhat it testsHow it should be scored
AI-IMO-Solved171 problemsComplete proof writing on synthetic olympiad-style problemsStrict proof score, with private canonical solutions and human-reviewed rubrics
AI-Erdos-Research40 problemsResearch behavior on Erdős-style open, mixed, and partially solved questionsSeparate research-progress score, not pass/fail accuracy
AI-Erdos-KnownOptionalObjectively gradable subtasks extracted from solved or partially solved Erdős-style recordsStrict proof score after rewriting the prompt into a closed theorem

This separation is the key design decision. A model can be right or wrong on a solved IMO-like problem. On an open Erdős-style problem, however, the useful behavior may be a new construction, a sharp special case, a reduction, or an honest explanation of where the proof fails. Those are different skills, and they should not be collapsed into one accuracy number.

Why Erdős-style problems?

Erdős problems have a characteristic shape: the statement is often short, the objects are elementary, and the solution may require deep structure. They ask questions like: how large can a set be if it avoids a certain additive pattern? How many colors are needed for a graph defined by arithmetic constraints? Does a simple construction give the right asymptotic constant?

That makes them a good model for research reasoning. A strong response should start with the obvious construction, prove the easy bounds, identify the closest known theorems, and then push beyond them. A weak response will often do the opposite: assert a grand theorem, hallucinate a citation, or silently smuggle in the statement it was supposed to prove.

Our synthetic Erdős-style set currently includes problems around coprimality, Sidon-type conditions, squarefree sums, Beatty sequences, additive bases, forbidden distances, prime-distance graphs, and local-global phenomena in grids. Some records are open, some are solved to order of magnitude, some contain solved variants, and some deliberately expose the gray zone between a good conjecture and a theorem.

Examples of the flavor

A problem such as Coprime Schur-free sets asks for the largest subset of [N] with no equation x + y = z where gcd(x, y) = 1. A problem such as GCD-Sidon sets asks whether one can build sets whose pairwise gcds are all distinct at essentially square-root scale. Later problems move into prime-distance geometry: coloring the integer lattice by prime Euclidean distances, bounding prime-distance-free subsets of square grids, and studying prime-step corners.

These are not meant to be trivia questions. They are meant to test whether a model can behave like a careful mathematical collaborator: prove something nontrivial, say what is merely heuristic, and avoid confusing plausibility with proof.

Why IMO-like synthetic problems?

Open-ended research tasks are valuable, but they are not enough for a headline benchmark. We also need closed problems with known solutions, because proof correctness has to be measured directly. That is the role of the AI-IMO-Solved track.

The current file contains 171 synthetic IMO-like problem statements and 171 aligned solutions. They cover the familiar olympiad modes: inequalities, number theory, functional equations, geometry, graph arguments, grid problems, parity, Vieta jumping, and extremal combinatorics. The solution file also records similar IMO problems, which is useful for contamination audits and grader calibration but should not be exposed during evaluation.

This track is designed to answer a simple question: given a new, non-public proof problem, can the model write a complete solution? Not a sketch, not a plausible outline, not a solution that works after a human repairs the missing lemma—a proof that can survive mathematical scrutiny.

How this compares to FrontierMath and IMO-Bench

We see this work as complementary to existing benchmarks, not as a criticism of them. FrontierMath set a high bar by using original, expert-crafted problems at advanced and research level. FrontierMath: Open Problems goes further by testing whether systems can make progress on unsolved mathematics. IMO-Bench focuses on robust olympiad-level reasoning, with answer, proof, and grading components.

BenchmarkWhat it is especially good atWhat our synthetic benchmark adds
FrontierMathVery hard expert-written mathematics, including research-level problems and automated verification in many casesA cheaper, renewable generation pipeline and a separate track for Erdős-style research behavior
FrontierMath: Open ProblemsTesting whether models can advance genuinely unsolved mathematicsA synthetic open-problem distribution that can be refreshed, audited, and mixed with known subtasks
IMO-BenchIMO-level reasoning with curated answer, proof, and grading tasksNew synthetic IMO-like problems with private solutions, plus explicit contamination controls
Ulam synthetic benchmarkLow-contamination proof and research evaluation from generated problem familiesIt still needs human audit, rubrics, calibration, and hidden splits before it should be treated as a public replacement benchmark

The main difference is refreshability. FrontierMath-quality expert problems are precious. Public olympiad problems are finite and increasingly exposed. Synthetic problem generation gives us a way to create new batches, hold back private test sets, and adapt the distribution as models improve.

The scoring principle: do not mix proof accuracy with research progress

For solved proof problems, the score should be strict. We use a 0–7 proof scale: no progress at the bottom, a complete and correct proof at 6, and a clear, robust proof with all edge cases handled at 7. The natural headline metric is pass@1, where a response passes if it scores at least 6.

For Erdős-style research problems, the scoring should reward different behavior: understanding the quantifiers, finding baseline constructions, proving special cases, stating reductions, identifying obstructions, and clearly labeling conjectures. A response that falsely claims to solve an open problem should be penalized, even if the prose sounds confident.

This matters because mathematical research is not only theorem proving. It is also choosing the right relaxation, knowing which examples to compute, recognizing when a known theorem almost applies, and being honest about the remaining gap. A benchmark that measures only final answers will miss much of that behavior.

Public problems, private solutions

The benchmark should be released with public problem statements but private solutions, metadata, rubrics, and grader notes. In particular, we should not expose similar IMO references, related Erdős links, solution attempts, status labels, or literature notes in the test prompts. Those fields are valuable for audit and grading, but they are also hints.

The same principle applies to model outputs. Raw outputs should be stored exactly, with model version, provider, date, sampling settings, prompt hash, tool access, and the full response. Grading should happen after generation, ideally with two independent human graders and adjudication when scores disagree. LLM judges can help triage, but they should not be the final authority for benchmark claims.

A useful benchmark should also publish versioned hashes. If the problem set changes, the version changes. If a hidden split is used, it should remain hidden. If a new synthetic batch is added, it should be treated as a new benchmark version rather than silently mutating the old one.

What we have now

The current materials are a draft benchmark and a runnable evaluation harness. The harness builds public and private JSONL files, creates deterministic splits, renders prompts, runs models through OpenAI-compatible or command-line providers, exports manual grading CSVs, supports optional LLM-judge triage, and generates score reports.

That is enough to run internal pilots. It is not yet enough to claim a public replacement for IMO-Bench or FrontierMath. The next step is not more rhetoric; it is audit work.

Before launch, the benchmark needs:

independent human validation of every canonical solution; per-problem rubrics; a duplicate and contamination audit against olympiad and public synthetic datasets; topic tags and difficulty calibration; manual rewriting of any Erdős-known tasks into objectively gradable theorem statements; a hidden test split; and baseline runs from representative models.

Why synthetic does not mean easy

A synthetic benchmark is only valuable if the generated problems are mathematically real. A random algebraic identity or a disguised known problem is not enough. The problems need to have meaningful structure, plausible false starts, multiple solution paths, and failure modes that reveal something about a model's reasoning.

That is why the benchmark is designed as a pipeline rather than a static file. Generation creates candidates. Mathematical audit removes broken or duplicated problems. Private solutions and rubrics support grading. Pilot runs reveal which problems are too easy, too ambiguous, or too close to known data. New batches keep the benchmark from becoming stale.

In other words, synthetic math problems should be treated the way mathematicians treat conjectures: useful only after they have survived attempts to break them.

The benchmark we want

The ideal benchmark is not one leaderboard number. It is a small family of measurements: solved-proof pass@1, average proof score, pass@k under a fixed sampling budget, per-topic performance, and a separate research-progress score for open-style prompts.

That structure lets us ask more precise questions. Can a model solve new olympiad-style problems closed-book? Can it make honest progress on a research-style problem? Does it overclaim? Does it improve with multiple samples? Does it do well in geometry but fail in additive combinatorics? Does tool use change the picture? These are the questions that matter if we want to understand mathematical capability rather than benchmark familiarity.

Synthetic Erdős-style and IMO-like problems are not the whole answer. But they are a promising way to keep evaluation moving as models improve. They give us new problems, clearer separation between proof and research, and a path toward benchmarks that remain useful after the current public tests saturate.

What's next

We are using the current draft for internal pilot runs and audit. The immediate goals are to validate the 171 IMO-like solutions, classify and rewrite the Erdős-style records into research and known-subtask tracks, add per-problem rubrics, and freeze a versioned split that can support meaningful baseline comparisons.

The long-term goal is a benchmark that can be refreshed continuously: new synthetic problems, private held-out splits, public statements where appropriate, and strict reporting of model versions, tools, sampling budgets, and grader decisions.

If we want language models to help with mathematics, we need evaluations that reward real mathematical behavior. Not just the final answer. Not just fluent proof-shaped text. The whole discipline: definitions, examples, constructions, proof, skepticism, and honest gaps.

Related benchmarks and references:
FrontierMath · FrontierMath: Open Problems
FrontierMath technical report
IMO-Bench · Towards Robust Mathematical Reasoning