Research-Grade Data + Private Evals for Frontier AI

Reasoning Data for Frontier Models

Ulam builds research-grade reasoning datasets and evaluation systems for frontier AI labs and AI developers. Our work is grounded in open mathematical research, where progress is measured by the quality of reasoning, not just final answers.

We produce human-in-the-loop reasoning traces, expert-reviewed proof attempts, partial-progress data, and private benchmark programs for teams that need harder-to-game signals on model capability.

World Map

What We Build

We turn frontier mathematical research into data and evaluation assets that model builders can use immediately.

Research-Grade Reasoning Datasets

Curated open-problem corpora, structured metadata, and research workflows designed to probe reasoning beyond solved benchmarks.

Human-in-the-Loop Reasoning Traces

Iterative traces where models explore, revise, and refine ideas with expert intervention at the points that matter.

Expert-Reviewed Proof Attempts

Partial progress, failed approaches, and checker-backed artifacts labeled for signal quality rather than surface fluency.

Private Benchmarks and Evaluations

Custom benchmark suites and eval protocols for labs that need private, difficult-to-overfit measures of frontier reasoning.

How Teams Use Ulam

Train on Harder Reasoning Signals

Use expert-reviewed proof attempts, partial solutions, and human-in-the-loop traces to build stronger post-training and reasoning curricula.

Benchmark Beyond Public Math Sets

Evaluate models on private open-problem tracks where memorization is weak and genuine mathematical progress becomes visible.

Run Private Frontier Evals

Design internal benchmark programs for new reasoning models, agents, or theorem systems with scoring that reflects research value.

Ground Results in Verifiable Artifacts

Connect natural-language reasoning to formalized outputs, reviewed writeups, and evidence trails that can survive expert scrutiny.

How We Collaborate

Scoping (2-3 weeks): choose the target model family, reasoning domain, and evaluation criteria that actually matter for your roadmap.

Pilot (10-12 weeks): deliver one dataset or private eval track with evidence artifacts, reviewed samples, and a repeatable scoring loop.

Expansion: grow into a recurring program for benchmark refreshes, new problem sets, and model-by-model regression tracking.

Current Product Wedge

UlamAI Prover is our open-source workflow for turning natural-language mathematical reasoning into machine-checkable outputs. It gives us a verification backbone for higher-signal datasets, expert review workflows, and private evaluations where trace quality matters. See repository.

Latest from the Lab