Erdős Problem Attempts
Trajectories built around difficult mathematical problems inspired by the Erdős tradition: combinatorics, number theory, graph theory, extremal reasoning, and related proof-heavy domains.
Ulam is building verified reasoning trajectories from Erdős-style mathematical problems for RL post-training, RLVR, process supervision, reward modeling, and private evaluations.
The first release is an initial set of 1,000 trajectories: model attempts, expert review, intermediate reasoning states, failure localization, and verifier-ready artifacts where available. We are already extending this into 10,000 trajectories, then 100,000 trajectories, and larger custom programs for teams training reasoning-capable models and agents.
Each trajectory is designed to be useful as training signal, evaluation signal, or inspection material for teams that care about how a model reasons, not only whether it produces a final answer.
Trajectories built around difficult mathematical problems inspired by the Erdős tradition: combinatorics, number theory, graph theory, extremal reasoning, and related proof-heavy domains.
Model proposals, plans, reductions, examples, counterexamples, revisions, dead ends, and recovered paths, preserved as inspectable reasoning trajectories.
Human review at the points where the signal matters: invalid lemmas, missing assumptions, weak reductions, promising ideas, partial progress, and final status.
Where possible, traces connect to checkable outputs, structured rubrics, proof sketches, formalization targets, and scoring criteria that can be used in RL or eval loops.
Use hard mathematical tasks with structured success criteria, final-status labels, and verifier-ready artifacts to create rewards that are harder to game than generic preference labels.
Train models to distinguish productive reasoning from fluent but invalid reasoning by exposing them to intermediate steps, expert critiques, failed branches, and repaired arguments.
Build judge and reward-model datasets from comparisons between reasoning paths: valid versus invalid reductions, useful examples versus irrelevant examples, and promising partial progress versus hallucinated proof structure.
Use hidden holdouts and fresh problem families to test whether a model can explore, revise, and make real mathematical progress without overfitting to public benchmark formats.
We start with 1,000 trajectories so teams can inspect the format, run small post-training experiments, and test signal quality. The next milestones are 10,000 trajectories for larger ablations, 100,000 trajectories for full training runs and eval suites, and custom expansions for labs that want domain-specific verifiers or private tracks.
1,000 trajectories: sampleable corpus for inspection, reward-model pilots, process-supervision trials, and private eval design.
10,000 trajectories: broader coverage across problem families, reasoning styles, failure modes, and verifier targets.
100,000 trajectories: large-scale training asset for RL labs that need diverse, difficult, and reviewable reasoning data.
Beyond 100,000: custom trajectory generation, hidden holdouts, grader packs, and domain-specific verification workflows.
Erdős-style problems force models to search, test ideas, abandon weak paths, and reason through proof structure rather than pattern-match a known exercise.
Invalid attempts, local mistakes, and repaired reasoning are preserved because they are often more useful for RL and judge training than a polished final answer alone.
Review is applied where it changes training value: assumptions, reductions, examples, counterexamples, proof gaps, and claims that need verification.
The same assets can support RL post-training, process supervision, data curation, reward-model training, and private release-gate evaluations.
Teams training reasoning models with RL, RLVR, tool feedback, process supervision, or verifier-backed reward functions.
Labs that care about proof search, long-horizon reasoning, theorem proving, coding-agent style repair loops, or agent traces where intermediate states matter.
Teams looking for private, difficult-to-overfit evaluations that show whether a model can reason under uncertainty and recover from failed lines of attack.
The easiest next step is to inspect the sample folder, review the trajectory format, and decide whether the first 1,000 trajectories should be used as a pilot dataset, an eval set, or the seed for a larger 10,000 or 100,000 trajectory program. Open the data sample.