Research-Grade Data + Private Evals for Frontier AI
Reasoning Data for Frontier Models
Ulam builds research-grade reasoning datasets and evaluation systems for frontier AI labs and AI developers.
Our work is grounded in open mathematical research, where progress is measured by the quality of reasoning, not just final answers.
We produce human-in-the-loop reasoning traces, expert-reviewed proof attempts, partial-progress data, and private benchmark programs
for teams that need harder-to-game signals on model capability.
We turn frontier mathematical research into data and evaluation assets that model builders can use immediately.
Research-Grade Reasoning Datasets
Curated open-problem corpora, structured metadata, and research workflows designed to probe reasoning beyond solved benchmarks.
Human-in-the-Loop Reasoning Traces
Iterative traces where models explore, revise, and refine ideas with expert intervention at the points that matter.
Expert-Reviewed Proof Attempts
Partial progress, failed approaches, and checker-backed artifacts labeled for signal quality rather than surface fluency.
Private Benchmarks and Evaluations
Custom benchmark suites and eval protocols for labs that need private, difficult-to-overfit measures of frontier reasoning.
How Teams Use Ulam
Train on Harder Reasoning Signals
Use expert-reviewed proof attempts, partial solutions, and human-in-the-loop traces to build stronger post-training and reasoning curricula.
Benchmark Beyond Public Math Sets
Evaluate models on private open-problem tracks where memorization is weak and genuine mathematical progress becomes visible.
Run Private Frontier Evals
Design internal benchmark programs for new reasoning models, agents, or theorem systems with scoring that reflects research value.
Ground Results in Verifiable Artifacts
Connect natural-language reasoning to formalized outputs, reviewed writeups, and evidence trails that can survive expert scrutiny.
How We Collaborate
Scoping (2-3 weeks): choose the target model family, reasoning domain, and evaluation criteria that actually matter for your roadmap.
Pilot (10-12 weeks): deliver one dataset or private eval track with evidence artifacts, reviewed samples, and a repeatable scoring loop.
Expansion: grow into a recurring program for benchmark refreshes, new problem sets, and model-by-model regression tracking.
Current Product Wedge
UlamAI Prover is our open-source workflow for turning natural-language mathematical reasoning into machine-checkable outputs.
It gives us a verification backbone for higher-signal datasets, expert review workflows, and private evaluations where trace quality matters.
See repository.
We used GPT-5.4 Pro, Opus 4.6, and GPT-5.2 to attack open problems in mathematics—and one of them is now fully settled, with a machine-checked proof in Lean 4.