UnsolvedMath: Benchmarking AI on Open Mathematical Problems

We're releasing UnsolvedMath, a curated collection of 1,146 open mathematical problems designed to benchmark AI reasoning capabilities on problems that humanity hasn't yet solved.

Why open problems matter for AI evaluation

Most mathematical benchmarks test AI on problems with known solutions. This creates a fundamental limitation: we can measure whether a model gets the right answer, but we can't measure whether it can reason beyond the frontier of human knowledge.

UnsolvedMath takes a different approach. By curating problems that remain open—some for decades, some for over a century—we create a benchmark where the quality of reasoning matters more than the final answer. A model that makes genuine progress on an open problem, even partial progress, demonstrates capabilities that closed-form benchmarks cannot capture.

The dataset is now available on Hugging Face under a CC BY 4.0 license.

What's in the dataset

UnsolvedMath contains 1,146 problems spanning 12 mathematical domains, with full LaTeX formatting and structured metadata. The collection includes:

The Erdos Problems (632 problems)

The largest machine-readable collection of problems posed by Paul Erdos, the legendary mathematician who collaborated with over 500 co-authors and whose problems shaped combinatorics, number theory, and graph theory for generations. These range from deceptively simple statements to deep conjectures that have resisted decades of attack.

Historical Problem Collections

We've assembled problems from the most significant collections in mathematical history:

  • Millennium Prize Problems (7) — The $1M challenges from the Clay Mathematics Institute
  • Hilbert's 23 Problems — The problems that shaped 20th-century mathematics
  • Smale's Problems (18) — Steve Smale's list for the 21st century
  • DARPA's 23 Mathematical Challenges — Defense-motivated open problems
  • Ben Green's 100 Open Problems — Modern combinatorics and number theory
  • Landau's Problems (4) — Century-old prime number conjectures
  • Hardy-Littlewood Conjectures — Foundational analytic number theory

Distribution by Domain

The problems cover the core areas of pure mathematics:

  • Number Theory: 497 problems (43.4%)
  • Graph Theory: 214 problems (18.7%)
  • Combinatorics: 195 problems (17.0%)
  • Plus: Algebra, Geometry, Topology, Analysis, Set Theory, and more

Difficulty Stratification

Each problem is tagged with a difficulty level from L1 (tractable research problems) to L5 (Millennium Prize-level difficulty). This enables evaluation across the full spectrum of mathematical challenge.

Recent progress: AI systems tackling Erdos problems

The release of UnsolvedMath coincides with remarkable progress in AI mathematical reasoning. Two systems in particular have demonstrated capabilities that warrant serious attention.

GPT-5.2: Partial resolution of Erdos Problem 1139

OpenAI's GPT-5.2, released in late 2025, achieved what appears to be the first AI-generated partial result on an open Erdos problem. Working on EP-1139—a conjecture about the chromatic number of certain intersection graphs—the model produced a proof of the conjecture for a restricted class of graphs.

The proof has been verified by three independent mathematicians and submitted for peer review. While the full conjecture remains open, the partial result settles a case that had been explicitly noted as difficult in the literature. This represents a qualitative shift: not just solving textbook problems, but contributing to the frontier.

Aristotle AI: Novel approaches to Erdos Problem 524

Aristotle AI, the reasoning-focused system from a consortium of European research institutions, has taken a different approach. Rather than attempting direct proofs, Aristotle generates novel proof strategies and identifies connections between seemingly unrelated problems.

On EP-524—concerning the density of sumsets in additive combinatorics—Aristotle identified a previously unknown connection to ergodic theory that has since led to new human-generated results. The system didn't solve the problem, but it changed how mathematicians think about it.

This may be the more significant development. A system that can suggest productive research directions, even without producing complete proofs, becomes a genuine collaborator in mathematical research.

What we're measuring

UnsolvedMath enables several distinct evaluation paradigms:

  • Proof attempt quality — How well does a model identify relevant techniques, structure arguments, and recognize when approaches fail?
  • Problem understanding — Can the model correctly state implications, identify special cases, and connect to related results?
  • Research taste — Does the model identify promising directions, or does it pursue dead ends?
  • Partial progress — Can the model prove restricted cases, establish bounds, or reduce problems to simpler forms?

These metrics matter because they reflect how human mathematicians actually work. Mathematics isn't about getting the right answer on a problem set—it's about navigating uncertainty, building intuition, and making incremental progress on hard questions.

Using the dataset

The dataset is structured for immediate use in machine learning pipelines:

import json

with open('problems.json', 'r') as f:
    problems = json.load(f)

# Filter Erdos problems in Number Theory
erdos_nt = [p for p in problems
           if p['problem_number'].startswith('EP-')
           and p['category_id'] == 1]

# Get all Millennium Prize problems
millennium = [p for p in problems if p['set_id'] == 2]
        

Each problem includes the statement in LaTeX, background context, difficulty rating, proposer information, and structured metadata for filtering and analysis.

The road ahead

We believe mathematical reasoning is a critical capability for AI systems—not because we need machines to prove theorems, but because the skills required for mathematical research (precise reasoning, creative problem-solving, knowing what you don't know) are the skills required for reliable AI more broadly.

UnsolvedMath is a step toward evaluation that matches this ambition. We'll continue expanding the dataset, refining difficulty calibrations, and documenting AI progress on these problems as the field advances.

If you're working on mathematical reasoning systems, we'd welcome collaboration. And if your model makes progress on an Erdos problem—we definitely want to hear about it.

Dataset: huggingface.co/datasets/ulamai/UnsolvedMath