Research-Grade Data + Private Evals for Frontier AI

Reasoning Data for Frontier Models

Ulam builds research-grade reasoning datasets and evaluation systems for frontier AI labs and AI developers. Our work is grounded in open mathematical research, where progress is measured by the quality of reasoning, not just final answers.

We produce human-in-the-loop reasoning traces, expert-reviewed proof attempts, partial-progress data, and private benchmark programs for teams that need harder-to-game signals on model capability.

Explore Research See UnsolvedMath Discuss a Private Eval

What We Build

We turn frontier mathematical research into data and evaluation assets that model builders can use immediately.

Research-Grade Reasoning Datasets

Curated open-problem corpora, structured metadata, and research workflows designed to probe reasoning beyond solved benchmarks.

Human-in-the-Loop Reasoning Traces

Iterative traces where models explore, revise, and refine ideas with expert intervention at the points that matter.

Expert-Reviewed Proof Attempts

Partial progress, failed approaches, and checker-backed artifacts labeled for signal quality rather than surface fluency.

Private Benchmarks and Evaluations

Custom benchmark suites and eval protocols for labs that need private, difficult-to-overfit measures of frontier reasoning.

How Teams Use Ulam

Train on Harder Reasoning Signals

Use expert-reviewed proof attempts, partial solutions, and human-in-the-loop traces to build stronger post-training and reasoning curricula.

Benchmark Beyond Public Math Sets

Evaluate models on private open-problem tracks where memorization is weak and genuine mathematical progress becomes visible.

Run Private Frontier Evals

Design internal benchmark programs for new reasoning models, agents, or theorem systems with scoring that reflects research value.

Ground Results in Verifiable Artifacts

Connect natural-language reasoning to formalized outputs, reviewed writeups, and evidence trails that can survive expert scrutiny.

How We Collaborate

Ulam Data is the simplest way to start: buy verified datasets, expert-reviewed trajectories, proof attempts, grader packs, and failure-to-data conversion, with one month of support to help your team turn those assets into stronger verified reasoning.

Scoping (2-3 weeks): choose the target model family, reasoning domain, and evaluation criteria that actually matter for your roadmap.

Pilot (10-12 weeks): deliver one dataset or private eval track with evidence artifacts, reviewed samples, and a repeatable scoring loop.

Expansion: grow into a recurring program for benchmark refreshes, new problem sets, and model-by-model regression tracking.

Current Product Wedge

UlamAI Prover is our open-source workflow for turning natural-language mathematical reasoning into machine-checkable outputs. It gives us a verification backbone for higher-signal datasets, expert review workflows, and private evaluations where trace quality matters. See repository.

Ulam Bench

Private benchmarks and release gates

Hidden holdouts, candidate-vs-baseline comparisons, grader-backed scorecards, and regression tracking for reasoning-capable models.

Best fit: release readiness, benchmark trust, and model-by-model comparisons.

Ulam Arena

Judgeable environments for agents

Long-horizon tasks, tool-using environments, and evaluation loops for reasoning agents where success can be measured rather than guessed.

Best fit: agent evaluation, RL-style training loops, and environment-backed capability measurement.

Latest from the Lab

AI-Powered Mathematics: Solving Erdős Problems with Frontier Models

We used GPT-5.4 Pro, Opus 4.6, and GPT-5.2 to attack open problems in mathematics—and one of them is now fully settled, with a machine-checked proof in Lean 4.

Read full article →

March 18, 2026

View all blog posts →