Verified Reasoning Trajectories for RL Labs

Erdős Problem Trajectories for Reasoning RL

Ulam is building verified reasoning trajectories from Erdős-style mathematical problems for RL post-training, RLVR, process supervision, reward modeling, and private evaluations.

The first release is an initial set of 1,000 trajectories: model attempts, expert review, intermediate reasoning states, failure localization, and verifier-ready artifacts where available. We are already extending this into 10,000 trajectories, then 100,000 trajectories, and larger custom programs for teams training reasoning-capable models and agents.

What Is Inside the First 1,000

Each trajectory is designed to be useful as training signal, evaluation signal, or inspection material for teams that care about how a model reasons, not only whether it produces a final answer.

Erdős Problem Attempts

Trajectories built around difficult mathematical problems inspired by the Erdős tradition: combinatorics, number theory, graph theory, extremal reasoning, and related proof-heavy domains.

Step-by-Step Reasoning Traces

Model proposals, plans, reductions, examples, counterexamples, revisions, dead ends, and recovered paths, preserved as inspectable reasoning trajectories.

Expert Review and Failure Labels

Human review at the points where the signal matters: invalid lemmas, missing assumptions, weak reductions, promising ideas, partial progress, and final status.

Verifier-Ready Artifacts

Where possible, traces connect to checkable outputs, structured rubrics, proof sketches, formalization targets, and scoring criteria that can be used in RL or eval loops.

Why RL Teams Use This

RLVR and Outcome Rewards

Use hard mathematical tasks with structured success criteria, final-status labels, and verifier-ready artifacts to create rewards that are harder to game than generic preference labels.

Process Supervision

Train models to distinguish productive reasoning from fluent but invalid reasoning by exposing them to intermediate steps, expert critiques, failed branches, and repaired arguments.

Reward Models and Judge Data

Build judge and reward-model datasets from comparisons between reasoning paths: valid versus invalid reductions, useful examples versus irrelevant examples, and promising partial progress versus hallucinated proof structure.

Private Reasoning Evals

Use hidden holdouts and fresh problem families to test whether a model can explore, revise, and make real mathematical progress without overfitting to public benchmark formats.

Scaling Plan

We start with 1,000 trajectories so teams can inspect the format, run small post-training experiments, and test signal quality. The next milestones are 10,000 trajectories for larger ablations, 100,000 trajectories for full training runs and eval suites, and custom expansions for labs that want domain-specific verifiers or private tracks.

1,000 trajectories: sampleable corpus for inspection, reward-model pilots, process-supervision trials, and private eval design.

10,000 trajectories: broader coverage across problem families, reasoning styles, failure modes, and verifier targets.

100,000 trajectories: large-scale training asset for RL labs that need diverse, difficult, and reviewable reasoning data.

Beyond 100,000: custom trajectory generation, hidden holdouts, grader packs, and domain-specific verification workflows.

What Makes the Data Different

Hard Tasks

Problems are not simple benchmark items

Erdős-style problems force models to search, test ideas, abandon weak paths, and reason through proof structure rather than pattern-match a known exercise.

Trace Value

Failures are training signal

Invalid attempts, local mistakes, and repaired reasoning are preserved because they are often more useful for RL and judge training than a polished final answer alone.

Review Loop

Expert review focuses the signal

Review is applied where it changes training value: assumptions, reductions, examples, counterexamples, proof gaps, and claims that need verification.

Eval Ready

Designed for training and inspection

The same assets can support RL post-training, process supervision, data curation, reward-model training, and private release-gate evaluations.

Best Fit

Post-Training and RL Teams

Teams training reasoning models with RL, RLVR, tool feedback, process supervision, or verifier-backed reward functions.

Math, Code, and Agentic Reasoning Groups

Labs that care about proof search, long-horizon reasoning, theorem proving, coding-agent style repair loops, or agent traces where intermediate states matter.

Evaluation and Safety Teams

Teams looking for private, difficult-to-overfit evaluations that show whether a model can reason under uncertainty and recover from failed lines of attack.

Inspect the Sample

The easiest next step is to inspect the sample folder, review the trajectory format, and decide whether the first 1,000 trajectories should be used as a pilot dataset, an eval set, or the seed for a larger 10,000 or 100,000 trajectory program. Open the data sample.