Data for Reasoning RL, SFT, and Evaluation

Ulam Data: Research Math, Olympiad Math, and Agentic Workflows

Ulam builds proof-process and workflow data for teams training reasoning-capable models: research-level mathematical trajectories, OlympiadNet math corpora, math-adjacent olympiad datasets, and agentic tool-use traces.

The common format is designed to preserve more than a final answer. Records keep statements, attempts, proof units, verifier metadata, negative traces, preference pairs, review status, and flattening paths for RLVR, SFT, PRM, critique, judge training, and private evaluations.

1,000+research-level trajectories from open Erdős-style and arXiv problems
57,648April 2026 arXiv math canonical records generated from TeX
20,000+OlympiadNet-Math problems across final-answer and proof-oriented tracks
3,395agentic workflow rows with tool observations and verifier metadata

Two Core Data Tracks

Ulam data is split into research-level mathematics and olympiad-level mathematics, with adjacent olympiad domains and agentic workflows available for broader reasoning post-training.

Track 1

Research-Level Mathematical Trajectories

Open-problem reasoning trajectories and April 2026 arXiv math proof-process records following the Ulam research trajectory schema. Best suited for Strict RLVR pilots, proof decomposition, critique, reward modeling, and private evals.

Track 2

OlympiadNet-Math and Adjacent Olympiads

Large olympiad-style math data with final-answer and proof-heavy tracks, plus informatics, physics, astronomy, and quantitative reasoning exports for SFT, coarse RLVR, and verifier development.

Research-Level Math

Strict RLVR fit

1,000+ open-problem trajectories

High-quality trajectories from attempts on open Erdős problems and other open problems from arXiv. These traces are built for hard reasoning, failure localization, verifier-backed scoring, and private evaluation tracks.

Coarse RLVR + SFT

5,000+ arXiv math papers from April 2026

5,387 local TeX files were transformed into a large proof-process scaffold: SFT rows, RLVR task rows, PVU/PRM steps, negative traces, preference pairs, adversarial tests, and review queues.

Research Asset Scale Best Use Current Status
Verified research trajectory schema 3 public examples Schema inspection, loading tests, RLVR/PRM flattening review Public sample
Open Erdős / arXiv problem trajectories 1,000+ trajectories Strict RLVR pilots, process supervision, judge training, private evals High-quality trajectories
April 2026 arXiv math TeX build 5,387 local TeX files; 57,648 canonical records SFT, proof decomposition, critique, reviewer triage, coarse RLVR scaffolding Candidate / review required

Canonical schema: ulam-rlvr-human-review-v0.3 keeps problem metadata, normalized statements, golden or source proofs, Proof Verification Units, dependency-gated rewards, negative traces, eval tasks, quality gates, preference pairs, adversarial tests, and human-review queues.

Flattening path: the canonical proof-process record can be flattened into RLVR tasks, SFT rows, PRM steps, proof-critique rows, preference data, and private-eval tasks.

Important caveat: TeX-derived arXiv rows are still ai_preannotated, candidate_solution, and L0_tex_heuristic. They are useful scaffolding, not verified mathematical RLVR reward data until upgraded by human/domain review or formal verification.

Research Schema Examples

The public Hugging Face sample contains three curated examples for inspecting the schema and the review vocabulary.

Example Status PVUs Negative Traces Eval Tasks Preference Pairs
Erdős Problem #258 Known / reviewed / verified 9 5 4 3
Erdős Problem #1201 Partial / draft 10 3 8 4
arXiv 2602.22147 extension Partial / conditional / reviewed 10 5 5 3

April 2026 arXiv Math Build

201/201source shards processed
536 MiBTeX extracted
115,296SFT rows and RLVR task rows each
0validation failures across 57,648 generated records
Generated View Rows Notes
Canonical records57,648Rich proof-process records under the Ulam RLVR schema.
SFT rows115,296Flattened supervised examples for proof-style and critique-style training.
RLVR task rows115,296Coarse verifier/reward scaffolds; not yet strict verified reward data.
PVU / PRM step rows166,817Proof Verification Unit and process-reward-model training candidates.
Negative traces159,329Invalid, partial, or adversarial proof-process traces for critique and judge training.
Preference pairs57,648Comparison data for reward modeling and preference optimization.
Adversarial tests115,296Stress tests for verifier and judge behavior.

Olympiad-level Math

OlympiadNet-Math is our main math olympiad corpus. V1 is immediately useful for final-answer training and RLVR; V2 targets harder proof/source gaps and is currently a large candidate layer plus review queue.

Dataset Rows Strict RLVR SFT / Candidate Use Summary
OlympiadNet-Math v1 12,911 reviewed rows 828 strict positive-weight rows 12,083 candidate / zero-weight rows Final-answer oriented, simpler verifier contracts, viable now for supervised final-answer and proof-style SFT.
OlympiadNet-Math v2 9,661 canonical rows 34 promoted rows 6,213 source-solution rows; 1,014 reviewed generated candidates; 2,075 newly solved sidecar candidates Proof/source-gap oriented. Valuable for filtered SFT and promotion work, but not a clean gold RLVR set yet.

Recommended now

RLVR: use the v1 strict train subset plus the 34 promoted v2 rows.

SFT: use Reviewed-v1 preferentially, and filtered v2 source-solution / generated-candidate rows with explicit quality caveats.

Promotion path

V2 candidates need independent adjudication, answer/solution reconciliation, verifier contracts, adversarial tests, and explicit eligible_for_rlvr=true before strict RLVR use.

Math-Adjacent Olympiads

These domains extend olympiad-style reasoning beyond pure math. They are useful for broader SFT, verifier development, and coarse RLVR experiments.

Domain SFT Rows Coarse RLVR Rows Comments
Informatics 19,904 9,644 Large SFT set. Many generated rows require review; RLVR has verifier metadata but executable harnesses are not materialized yet.
Physics 2,003 17 Good source-extracted SFT coverage. RLVR is intentionally sparse and mostly optional.
Astronomy 1,015 1 Cleaned SFT with 108 generated gold-candidate rows retained and 31 review-required rows removed. RLVR is not the focus.
Quant 1,992 1,992 Best current RLVR-adjacent domain: 3 strict exact-match rows and 1,989 coarse reference-match rows.
Training Slice Rows Use
Recommended SFT12,309Main gold-ish SFT pool.
Extended SFT12,383Recommended SFT plus 74 partial Astro gold candidates.
Strict RLVR3Quant exact-match only.
Coarse RLVR focus11,636Informatics metadata plus Quant coarse/reference rows.

Agentic Workflow Data

The cc-harness-rlvr workflow corpus is SFT/RLVR-ready for tool-conditioned assistant behavior, structured observations, verifier-backed tasks, and preference-pair training after RLVR.

Coverage

3,395 rows across 29 categories

Four-way split: train 2,332, eval_id 291, eval_ood 480, and sample_preview 292. REAP curriculum tags run from initial to core to post.

Trajectory format

Anthropic messages + tool results

Rows carry flat messages arrays, token estimates, length buckets, tool-use IDs, and observation turns that teach the model to condition on tool output.

Verifier state

Golden and adversarial checks pass

All structural invariants pass, verifiers are green on 3,395 golden and 4,863 adversarial checks, and contamination checks are clean on six hard OOD axes.

1,513 rows include preference pairs for DPO/IPO after RLVR. Negative traces are verifier-ready but not interleaved as full trajectories, which is fine for RLVR and should be flagged before trajectory-level DPO use.

Recommended Use

Use now

Strict and near-strict RLVR

Start with reviewed research trajectories, OlympiadNet-Math v1 strict rows, the 34 promoted v2 rows, Quant exact-match rows, and agentic workflow verifier tasks where the verifier contract is already green.

Use now

SFT and process supervision

Use arXiv proof-process rows, OlympiadNet-Math Reviewed-v1, filtered v2 source/candidate rows, math-adjacent recommended SFT, and agentic workflow messages to teach proof style, critique, repair, and tool-conditioned behavior.

Review queue

Promotion to stronger rewards

Promote candidate rows through independent adjudication, verifier contracts, adversarial tests, reconciliation, and explicit eligibility flags. This is where v2 and arXiv TeX-derived data have the most upside.

Private evals

Holdouts and hidden tracks

Build hidden holdouts and private evaluation packs from fresh problem families, hard proof-process cases, and agentic tasks that are difficult to overfit from public benchmark formats.

Inspect the Format

Start with the public research schema repository: inspect the three examples, read the whitepaper, and review how canonical records flatten into RLVR, SFT, PRM, proof-critique, preference, and eval views. Commercial training packs, private holdouts, reviewer notes, and larger corpora are available under separate access terms.