Data, Benchmarks, RLVR environments

Data and Benchmarks for Reasoning Models

Ulam builds proof-process datasets, private math benchmarks, and verifier-backed RLVR environments for teams steering reasoning models with SFT, reward modeling, process supervision, RLVR, and private evaluation.

The product is a full improvement loop: create the right data, measure proof reliability on hidden tasks, convert failures into trainable objects, and serve verifier rewards through UlamGym while keeping private rubrics and holdouts sealed.

Research Schema Whitepaper PDF Olympiad Schema ErdosBench Discuss Access

1,000+research-level trajectories from open Erdős-style and arXiv problems

57,648April 2026 arXiv math canonical records generated from TeX

20,000+OlympiadNet-Math problems across final-answer and proof-oriented tracks

3,395agentic workflow rows with tool observations and verifier metadata

Four products for proof-reliable models

Start with data, measure with private benchmarks, and close the loop with verifier-backed RLVR tasks. Each product can stand alone; together they create a recurring training signal for mathematical reasoning.

Benchmark / research-level

ErdősBench private benchmark

Research-level Erdős-like problems for evaluating mathematical research behavior: finite experiments, obstruction finding, counterexamples, proof-gap detection, partial progress, and calibrated uncertainty.

The public repository exposes the format; the full benchmark, hidden tasks, rubrics, and answer keys stay private.

View public smoke test →

Benchmark / olympiad-level

Synthetic IMO Bench

Hidden Olympiad-style rounds with 0-7 proof rubrics, answer/proof separation, blind judging, and traps for equality cases, exact constants, unsupported geometry lemmas, invalid bounds, and number-theory shortcuts.

Built to distinguish “right answer” from valid proof and to surface trainable false-solve patterns.

Data / SFT + RLVR

Custom SFT/RLVR datasets

One-time or recurring data builds from research trajectories, OlympiadNet-Math, arXiv blueprints, agentic workflows, and newly generated problems targeted to a model’s observed reasoning weaknesses.

Delivered with stable IDs, splits, metadata, verifier status, holdout notes, and training-view exports.

RLVR / reward environment

UlamGym RLVR environments

Verifier-backed reward environments for proof repair, counterexample search, premise selection, lemma graphs, exact-answer olympiad tasks, statement fidelity, and informal proof critique.

Trainers see prompts, rewards, vectors, and redacted replays; hidden manifests and rubrics remain server-side.

Benchmarking Research: monthly private rounds with 200 hidden or refreshed problems, 4–6 model configs, proof-validity judging, skill regressions, false-solve taxonomy, raw JSONL outputs, judge cards, and technical readouts every 2–3 weeks.

Custom datasets for SFT and RLVR

Ulam can license existing corpora or build custom packs around a model's observed weaknesses. Every pack is designed to be usable by training teams: schema docs, stable IDs, data splits, verifier status, and holdout notes.

Asset	Scale	Best Use	How It Steers the Model	Status
Research-level trajectories	1,000+ trajectories	Strict and near-strict RLVR pilots, process supervision, proof repair, private evals	Teaches search, failed branches, critiques, repairs, final proofs, and preference comparisons.	High-quality trajectories
April 2026 arXiv math build	57,648 canonical records from 5,387 TeX files	Long-context SFT, theorem dependency extraction, proof reconstruction, research triage	Supplies proof-process scaffolds, negative traces, PRM steps, adversarial tests, and review queues.	Candidate / review required
OlympiadNet-Math	20,000+ problems	Final-answer training, proof-validity tasks, verifier development, hidden benchmark variants	Separates answer-level rewards from proof-heavy reasoning where edge cases and exact constants matter.	Reviewed + candidate layers
Agentic workflow rows	3,395 rows	Tool-conditioned behavior, workflow SFT, verifier-backed tasks, preference training	Trains models to condition on tool observations, structured outputs, verifier feedback, and task state.	Verifier metadata included

The Ulam steering loop

01 Dataset

Target the failure mode

Build SFT, preference, proof-repair, counterexample, and verifier rows around the exact skills a model lacks: rigor, calibration, theorem use, edge cases, or tool-conditioned reasoning.

02 Benchmark

Measure proof validity

Run private benchmark rounds that score final answers separately from proof correctness and identify the first wrong inference rather than only checking answer resemblance.

03 Gym

Serve verifier rewards

Use UlamGym as a verifier-only reward API: public prompts for policy rollouts, hidden manifests for scoring, scalar rewards, reward vectors, strictness labels, and redacted replays.

04 Refresh

Turn misses into tasks

Convert model failures into new SFT rows, DPO pairs, proof-critic prompts, RLVR tasks, and held-out variants for the next benchmark round.

Research-level math

Strict RLVR fit

1,000+ open-problem trajectories

High-quality trajectories from attempts on open Erdős-style problems and other open problems from arXiv. These traces are built for hard reasoning, failure localization, verifier-backed scoring, and private evaluation tracks.

Coarse RLVR + SFT

5,000+ arXiv math papers from April 2026

5,387 local TeX files were transformed into a proof-process scaffold: SFT rows, RLVR task rows, PVU/PRM steps, negative traces, preference pairs, adversarial tests, and review queues.

April 2026 arXiv math build

201/201source shards processed

536 MiBTeX extracted

115,296SFT rows and RLVR task rows each

0validation failures across 57,648 generated records

Generated View	Rows	Notes
Canonical records	57,648	Rich proof-process records under the Ulam RLVR schema.
SFT rows	115,296	Flattened supervised examples for proof-style and critique-style training.
RLVR task rows	115,296	Coarse verifier/reward scaffolds; not yet strict verified reward data.
PVU / PRM step rows	166,817	Proof Verification Unit and process-reward-model training candidates.
Negative traces	159,329	Invalid, partial, or adversarial proof-process traces for critique and judge training.
Preference pairs	57,648	Comparison data for reward modeling and preference optimization.
Adversarial tests	115,296	Stress tests for verifier and judge behavior.

Olympiad-level math

OlympiadNet-Math is our main math olympiad corpus. V1 is immediately useful for final-answer training and RLVR; V2 targets harder proof/source gaps and is currently a large candidate layer plus review queue.

Dataset	Rows	Strict RLVR	SFT / Candidate Use	Summary
OlympiadNet-Math v1	12,911 reviewed rows	828 strict positive-weight rows	12,083 candidate / zero-weight rows	Final-answer oriented, simpler verifier contracts, viable now for supervised final-answer and proof-style SFT.
OlympiadNet-Math v2	9,661 canonical rows	34 promoted rows	6,213 source-solution rows; 1,014 reviewed generated candidates; 2,075 newly solved sidecar candidates	Proof/source-gap oriented. Valuable for filtered SFT and promotion work, but not a clean gold RLVR set yet.

Recommended now

RLVR: use the v1 strict train subset plus the 34 promoted v2 rows.

SFT: use Reviewed-v1 preferentially, and filtered v2 source-solution / generated-candidate rows with explicit quality caveats.

Promotion path

V2 candidates need independent adjudication, answer/solution reconciliation, verifier contracts, adversarial tests, and explicit eligible_for_rlvr=true before strict RLVR use.

Agentic workflow data

The workflow corpus is SFT/RLVR-ready for tool-conditioned assistant behavior, structured observations, verifier-backed tasks, and preference-pair training after RLVR.

Coverage

3,395 rows across 29 categories

Four-way split: train 2,332, eval_id 291, eval_ood 480, and sample_preview 292. Curriculum tags run from initial to core to post.

Trajectory format

Messages + tool results

Rows carry flat messages arrays, token estimates, length buckets, tool-use IDs, and observation turns that teach the model to condition on tool output.

Verifier state

Golden and adversarial checks pass

All structural invariants pass, verifiers are green on 3,395 golden and 4,863 adversarial checks, and contamination checks are clean on six hard OOD axes.

1,513 rows include preference pairs for DPO/IPO after RLVR. Negative traces are verifier-ready but not interleaved as full trajectories, which is suitable for RLVR and should be flagged before trajectory-level DPO use.

Program shapes

Teams can start with one-time data access, a custom dataset build, a private benchmark track, or UlamGym access. The strongest setup combines all three: data, recurring evaluation, and verifier-backed RL tasks.

One-time or custom

Dataset packs

Research trajectories, OlympiadNet-Math, arXiv blueprints, agentic workflow rows, or custom packs targeted to a model's failure modes. Delivered with stable IDs, metadata, splits, samples, and verifier-status notes.

Monthly

Benchmarking Research

Private benchmark refreshes, proof-level judging, 4–6 model configs, skill regressions, false-solve taxonomy, judge cards, and a technical readout every 2–3 weeks.

Add-on

UlamGym RLVR access

Verifier-only reward API, prompt/export bundle, hidden manifests, redacted trainer replays, reward vectors, usage metering, and monthly RLVR task refresh from benchmark failures.

Generalized private engagement pattern

A model team selects target skills and submits checkpoints or configs. Ulam runs private tasks, judges proof validity, identifies trainable failures, exports SFT/preference/RLVR artifacts, and serves the next round through UlamGym while keeping hidden verifier internals private.

Data details for training teams

Ulam data packs are meant to be used directly by model and RL teams. The same canonical records can be exported as SFT rows, RLVR tasks, process-reward steps, proof-critic prompts, preference pairs, private eval items, or Gym prompt rows.

Fields

Task and solution data

Problem IDs, source class, domain, difficulty, normalized statement, expected final answer when available, source or reference solution, proof outline, and checker or reviewer status.

Process

Search and repair traces

Attempts, failed branches, first-bad-step labels, critique notes, repair rationales, counterexamples, final proofs, and preference pairs for good vs. bad reasoning branches.

Rewards

Verifier metadata

Strictness labels, reward-policy tags, verifier type, hidden-manifest references, adversarial checks, quality gates, eligible-for-RLVR flags, and scorer-safe diagnostic fields.

Access

Splits and holdouts

Train/eval/holdout splits, decontamination notes, retired benchmark items, public-inspection samples, private answer keys, and upgrade paths from candidate rows to verified reward data.

Dataset family	Included signal	Training views	Holdout / verifier note
Research trajectories	Open-problem search, failed branches, critiques, repairs, final proof assets, preference comparisons.	SFT traces, proof-repair rows, DPO pairs, process-reward steps, private research evals.	Best for high-signal reasoning work where partial progress and proof hygiene matter.
OlympiadNet-Math	Final-answer and proof-oriented problems, domain tags, difficulty bands, reference solutions, verifier candidates.	Answer-level RLVR, proof-style SFT, judge training, Synthetic IMO-style hidden variants.	Final-answer problems are easier to score automatically; proof-heavy tasks need proof-validity grading.
arXiv math blueprints	Theorem/proof candidates, proof units, dependency scaffolds, negative traces, adversarial tests, review queues.	Long-context SFT, theorem dependency extraction, proof reconstruction, critique, research assistant training.	Experimental TeX-derived layer; valuable scaffold, upgraded through human/domain review or formal checks.
Agentic workflows	Messages, tool observations, verifier metadata, golden/adversarial checks, preference pairs, OOD splits.	Tool-use SFT, workflow RLVR, verifier-conditioned behavior, post-RL preference training.	Useful when the target model must reason with tools, observations, and structured task state.

Delivery: JSONL or Parquet, schema notes, stable IDs, source metadata, split manifests, sample loaders, score/export scripts, and verifier-status columns.

Custom builds: target a model’s failures directly — geometry overclaims, wrong constants, missing equality cases, invalid divisibility steps, weak counterexample search, or tool-use brittleness.

Inspect the format

Start with the public research schema repository and ErdosBench: inspect examples, review how canonical records flatten into RLVR/SFT/PRM views, and test the benchmark format. Commercial training packs, private holdouts, reviewer notes, hidden verifier manifests, and larger corpora are available under separate access terms.

Open Research Data Sample Read Whitepaper Related Olympiad Repository ErdosBench on GitHub Talk to Ulam