Data for Reasoning RL, SFT, and Evaluation
Ulam Data: Research Math, Olympiad Math, and Agentic Workflows
Ulam builds proof-process and workflow data for teams training reasoning-capable models: research-level mathematical
trajectories, OlympiadNet math corpora, math-adjacent olympiad datasets, and agentic tool-use traces.
The common format is designed to preserve more than a final answer. Records keep statements, attempts, proof units,
verifier metadata, negative traces, preference pairs, review status, and flattening paths for RLVR, SFT, PRM, critique,
judge training, and private evaluations.
1,000+research-level trajectories from open Erdős-style and arXiv problems
57,648April 2026 arXiv math canonical records generated from TeX
20,000+OlympiadNet-Math problems across final-answer and proof-oriented tracks
3,395agentic workflow rows with tool observations and verifier metadata
Two Core Data Tracks
Ulam data is split into research-level mathematics and olympiad-level mathematics, with adjacent olympiad domains and
agentic workflows available for broader reasoning post-training.
Track 1
Research-Level Mathematical Trajectories
Open-problem reasoning trajectories and April 2026 arXiv math proof-process records following the Ulam research
trajectory schema. Best suited for Strict RLVR pilots, proof decomposition, critique, reward modeling, and private evals.
Track 2
OlympiadNet-Math and Adjacent Olympiads
Large olympiad-style math data with final-answer and proof-heavy tracks, plus informatics, physics, astronomy, and
quantitative reasoning exports for SFT, coarse RLVR, and verifier development.
Research-Level Math
Strict RLVR fit
1,000+ open-problem trajectories
High-quality trajectories from attempts on open Erdős problems and other open problems from arXiv. These traces are
built for hard reasoning, failure localization, verifier-backed scoring, and private evaluation tracks.
Coarse RLVR + SFT
5,000+ arXiv math papers from April 2026
5,387 local TeX files were transformed into a large proof-process scaffold: SFT rows, RLVR task rows, PVU/PRM steps,
negative traces, preference pairs, adversarial tests, and review queues.
| Research Asset |
Scale |
Best Use |
Current Status |
| Verified research trajectory schema |
3 public examples |
Schema inspection, loading tests, RLVR/PRM flattening review |
Public sample |
| Open Erdős / arXiv problem trajectories |
1,000+ trajectories |
Strict RLVR pilots, process supervision, judge training, private evals |
High-quality trajectories |
| April 2026 arXiv math TeX build |
5,387 local TeX files; 57,648 canonical records |
SFT, proof decomposition, critique, reviewer triage, coarse RLVR scaffolding |
Candidate / review required |
Canonical schema: ulam-rlvr-human-review-v0.3 keeps problem metadata, normalized statements, golden or source proofs, Proof Verification Units, dependency-gated rewards, negative traces, eval tasks, quality gates, preference pairs, adversarial tests, and human-review queues.
Flattening path: the canonical proof-process record can be flattened into RLVR tasks, SFT rows, PRM steps, proof-critique rows, preference data, and private-eval tasks.
Important caveat: TeX-derived arXiv rows are still ai_preannotated, candidate_solution, and L0_tex_heuristic. They are useful scaffolding, not verified mathematical RLVR reward data until upgraded by human/domain review or formal verification.
Research Schema Examples
The public Hugging Face sample contains three curated examples for inspecting the schema and the review vocabulary.
April 2026 arXiv Math Build
201/201source shards processed
536 MiBTeX extracted
115,296SFT rows and RLVR task rows each
0validation failures across 57,648 generated records
| Generated View |
Rows |
Notes |
| Canonical records | 57,648 | Rich proof-process records under the Ulam RLVR schema. |
| SFT rows | 115,296 | Flattened supervised examples for proof-style and critique-style training. |
| RLVR task rows | 115,296 | Coarse verifier/reward scaffolds; not yet strict verified reward data. |
| PVU / PRM step rows | 166,817 | Proof Verification Unit and process-reward-model training candidates. |
| Negative traces | 159,329 | Invalid, partial, or adversarial proof-process traces for critique and judge training. |
| Preference pairs | 57,648 | Comparison data for reward modeling and preference optimization. |
| Adversarial tests | 115,296 | Stress tests for verifier and judge behavior. |
Olympiad-level Math
OlympiadNet-Math is our main math olympiad corpus. V1 is immediately useful for final-answer training and RLVR;
V2 targets harder proof/source gaps and is currently a large candidate layer plus review queue.
| Dataset |
Rows |
Strict RLVR |
SFT / Candidate Use |
Summary |
| OlympiadNet-Math v1 |
12,911 reviewed rows |
828 strict positive-weight rows |
12,083 candidate / zero-weight rows |
Final-answer oriented, simpler verifier contracts, viable now for supervised final-answer and proof-style SFT. |
| OlympiadNet-Math v2 |
9,661 canonical rows |
34 promoted rows |
6,213 source-solution rows; 1,014 reviewed generated candidates; 2,075 newly solved sidecar candidates |
Proof/source-gap oriented. Valuable for filtered SFT and promotion work, but not a clean gold RLVR set yet. |
Recommended now
RLVR: use the v1 strict train subset plus the 34 promoted v2 rows.
SFT: use Reviewed-v1 preferentially, and filtered v2 source-solution / generated-candidate rows with explicit quality caveats.
Promotion path
V2 candidates need independent adjudication, answer/solution reconciliation, verifier contracts, adversarial tests, and explicit eligible_for_rlvr=true before strict RLVR use.
Math-Adjacent Olympiads
These domains extend olympiad-style reasoning beyond pure math. They are useful for broader SFT, verifier development, and coarse RLVR experiments.
| Domain |
SFT Rows |
Coarse RLVR Rows |
Comments |
| Informatics |
19,904 |
9,644 |
Large SFT set. Many generated rows require review; RLVR has verifier metadata but executable harnesses are not materialized yet. |
| Physics |
2,003 |
17 |
Good source-extracted SFT coverage. RLVR is intentionally sparse and mostly optional. |
| Astronomy |
1,015 |
1 |
Cleaned SFT with 108 generated gold-candidate rows retained and 31 review-required rows removed. RLVR is not the focus. |
| Quant |
1,992 |
1,992 |
Best current RLVR-adjacent domain: 3 strict exact-match rows and 1,989 coarse reference-match rows. |
| Training Slice |
Rows |
Use |
| Recommended SFT | 12,309 | Main gold-ish SFT pool. |
| Extended SFT | 12,383 | Recommended SFT plus 74 partial Astro gold candidates. |
| Strict RLVR | 3 | Quant exact-match only. |
| Coarse RLVR focus | 11,636 | Informatics metadata plus Quant coarse/reference rows. |
Agentic Workflow Data
The cc-harness-rlvr workflow corpus is SFT/RLVR-ready for tool-conditioned assistant behavior, structured observations,
verifier-backed tasks, and preference-pair training after RLVR.
Coverage
3,395 rows across 29 categories
Four-way split: train 2,332, eval_id 291, eval_ood 480, and sample_preview 292. REAP curriculum tags run from initial to core to post.
Trajectory format
Anthropic messages + tool results
Rows carry flat messages arrays, token estimates, length buckets, tool-use IDs, and observation turns that teach the model to condition on tool output.
Verifier state
Golden and adversarial checks pass
All structural invariants pass, verifiers are green on 3,395 golden and 4,863 adversarial checks, and contamination checks are clean on six hard OOD axes.
1,513 rows include preference pairs for DPO/IPO after RLVR. Negative traces are verifier-ready but not interleaved as full trajectories,
which is fine for RLVR and should be flagged before trajectory-level DPO use.
Recommended Use
Use now
Strict and near-strict RLVR
Start with reviewed research trajectories, OlympiadNet-Math v1 strict rows, the 34 promoted v2 rows, Quant exact-match rows,
and agentic workflow verifier tasks where the verifier contract is already green.
Use now
SFT and process supervision
Use arXiv proof-process rows, OlympiadNet-Math Reviewed-v1, filtered v2 source/candidate rows, math-adjacent recommended SFT,
and agentic workflow messages to teach proof style, critique, repair, and tool-conditioned behavior.
Review queue
Promotion to stronger rewards
Promote candidate rows through independent adjudication, verifier contracts, adversarial tests, reconciliation, and explicit eligibility flags.
This is where v2 and arXiv TeX-derived data have the most upside.
Private evals
Holdouts and hidden tracks
Build hidden holdouts and private evaluation packs from fresh problem families, hard proof-process cases, and agentic tasks that are difficult
to overfit from public benchmark formats.
Inspect the Format
Start with the public research schema repository: inspect the three examples, read the whitepaper, and review how canonical records flatten
into RLVR, SFT, PRM, proof-critique, preference, and eval views. Commercial training packs, private holdouts, reviewer notes, and larger
corpora are available under separate access terms.