UlamAI Prover: An Open Lean 4 Theorem Prover

February 14, 2026

We're releasing UlamAI Prover, a truth-first CLI that combines LLM-guided reasoning with Lean 4 verification to produce machine-checked proofs—no hallucinations, no trust required.

Why verified proofs matter for AI reasoning

Large language models can generate plausible-looking mathematics. They can sketch proof strategies, identify relevant lemmas, and produce chains of reasoning that read convincingly. But plausibility is not proof. A single invalid step—a misapplied lemma, a subtle type error, a hand-waved inequality—invalidates the entire argument. And models hallucinate exactly these kinds of errors routinely.

The solution is obvious in principle: have a proof assistant check every step. If Lean 4 accepts the tactic, the step is valid. If it doesn't, the step is discarded. The model proposes; the kernel disposes. This is the architecture behind UlamAI Prover.

We've open-sourced the first working version at github.com/ulamai/ulamai.

What UlamAI Prover does

UlamAI Prover is a CLI tool with a simple loop:

Open a Lean 4 goal—from a theorem in a .lean file, a LaTeX document, or an interactive prompt.
Ask an LLM to propose the next tactic step.
Execute it in Lean.
If it fails, use the error as feedback: repair, backtrack, try again.
When every subgoal is closed, return a verified proof. If time runs out, return a replayable failure trace with everything the system tried.

The key design decision is that Lean is the sole arbiter of correctness. The LLM is a heuristic guide—creative, fast, often wrong. Lean is the ground truth. This separation means UlamAI Prover cannot produce a false proof. It can fail to find a proof, but it cannot claim to have found one that doesn't check.

Architecture: search, not hope

Most LLM-based proof attempts rely on one-shot generation: prompt the model, hope the output compiles. This works for simple goals but collapses on anything requiring multi-step reasoning, backtracking, or lemma decomposition.

UlamAI Prover uses best-first search with beam capping over the space of proof states. At each state, the LLM proposes 8–16 candidate tactics. Each candidate is executed in Lean. Successful applications expand the search tree; failures are pruned. A transposition table prevents the system from re-exploring states it has already visited.

This is not a new idea—it's how the strongest theorem provers have worked for years. What's new is the integration with modern LLMs as the policy network, combined with retrieval-augmented premise selection from Mathlib, and a repair loop that feeds Lean error messages back to the model for targeted fixes.

The system also includes autop fallbacks: on every proof state, automation tactics like aesop, simp, linarith, ring, and norm_num are tried before querying the LLM. Many subgoals that look hard to a language model are trivial for Lean's built-in automation. This saves time, tokens, and search budget.

How it connects to LLMs

UlamAI Prover is designed to work with whatever model you prefer. The current release supports:

OpenAI-compatible endpoints (GPT-4.1, GPT-5.2 Codex, and successors)
Anthropic Claude (via API or Claude Code CLI)
Ollama (local models—useful for experimentation and privacy)

Authentication is handled through the interactive menu or environment variables. The system also supports Codex CLI and Claude Code CLI login flows, so users with existing subscriptions can get started without managing API keys.

The LLM adapter is intentionally thin. If your model speaks the OpenAI chat format, UlamAI Prover can use it. This means local fine-tuned models, research checkpoints, or custom endpoints all work out of the box.

Open-source and bring-your-own-model

Most capable AI theorem provers are closed systems: proprietary models behind proprietary infrastructure, with results you can read about but can't reproduce. UlamAI Prover is different. The entire codebase—search loop, retrieval, repair logic, Lean runner, trace logging—is open-source and available on GitHub.

This matters for two reasons. First, reproducibility: if we claim a result, you can rerun the exact pipeline and verify it. Second, extensibility: researchers can swap components, add retrieval strategies, plug in custom value models, or modify the search algorithm without reverse-engineering a black box.

Equally important is that you're not locked into a single LLM provider. If you have an OpenAI Codex subscription, run ulam auth codex and the system imports your credentials directly—the same flow as the official Codex CLI. If you prefer Claude Code, run ulam auth claude and use your existing Anthropic subscription with --llm claude_cli. And if you want full control over your stack—or you're working in an environment where data can't leave your network—Ollama lets you run the entire pipeline against self-hosted models like LLaMA, Mistral, or DeepSeek on your own hardware. No API keys, no usage fees, no data leaving your machine. The interactive menu (ulam) walks you through whichever path you choose, so setup takes minutes regardless of provider.

This flexibility is the point. The best model for theorem proving six months from now probably doesn't exist yet. UlamAI Prover is built so that when it arrives, you can plug it in and run.

Retrieval: finding the right lemma

In Mathlib, knowing that the right lemma exists is often harder than applying it. UlamAI Prover includes a retrieval module that selects relevant premises from Mathlib or local repositories and injects them into the LLM's context.

Two retrieval modes are available: token-overlap (fast, no external dependencies) and embedding-based (higher quality, requires an embedding endpoint). Premises are formatted as name-statement pairs so the model can reference them directly in tactic proposals.

This matters because Mathlib contains over 100,000 lemmas. Without retrieval, the model must rely on memorized knowledge of the library—knowledge that degrades with model size, quantization, and the ever-growing surface area of Mathlib itself. Retrieval turns “do you remember this lemma?” into “here is the lemma; use it.”

Formalization from LaTeX

Beyond proving theorems that are already stated in Lean, UlamAI Prover includes an early autoformalization pipeline. Given a LaTeX document, the system attempts to:

Extract mathematical statements.
Translate them into Lean 4 declarations.
Type-check the translations against Mathlib.
Attempt proofs of the formalized statements.

This is the hardest part of the pipeline and the least mature. Formalization drift—where the Lean statement doesn't faithfully capture the LaTeX meaning—is a real and subtle failure mode. The current system includes an optional equivalence-checking step and stores per-round artifacts for human review.

We're releasing this module because it's useful even in its current state: it automates the tedious scaffolding of import statements, type signatures, and Mathlib alignment, even when the proofs themselves require human guidance.

Reproducibility by design

Every UlamAI Prover run produces a JSONL trace file logging each search step: the proof state, the proposed tactic, the Lean response, and the system's scoring decision. Traces can be replayed deterministically with ulam replay, enabling:

Debugging — See exactly where and why a proof attempt failed.
Training data extraction — Successful traces become supervised fine-tuning examples for tactic prediction models.
Benchmarking — Compare different models, retrieval strategies, or search parameters on identical problem sets.

The regression suite (ulam bench) runs against a curated set of theorems and reports solve rates, step counts, and timing. This is how we measure progress internally, and it's how we expect the community to hold us accountable.

What's next

The current release (v0.1.3) is a working scaffold—intentionally thin but functional end-to-end. The roadmap is public in the repository and reflects our priorities:

Near-term: better state hashing for the transposition table, stronger retrieval formatting, tighter cost controls on expensive tactics, and a growing regression suite.

Medium-term: supervised fine-tuning of tactic prediction models on extracted traces. The traces from successful proofs become training data; the traces from near-misses become hard negatives. This is where the system starts to improve itself.

Longer-term: MCTS-style search (replacing best-first when it makes sense), learned value models to guide search allocation, parallel rollouts, and eventually on-policy reinforcement learning from proof outcomes.

The autoformalization module will mature into a separate product surface: PDF-to-Lean translation with semantic faithfulness checks, iterative repair, and structured output of Lean files with proof attempts and annotated gaps.

Connection to UnsolvedMath

Earlier this month we released UnsolvedMath, a benchmark of 1,146 open mathematical problems. UlamAI Prover is the natural complement: UnsolvedMath defines what to prove; UlamAI Prover is the machinery for attempting those proofs in a verified setting.

As the prover matures, we plan to run it against formalized versions of UnsolvedMath problems—starting with restricted cases and special instances where partial progress is tractable. The traces from these attempts, whether successful or not, will be valuable both as training data and as a map of where current AI reasoning hits its limits.

Try it

From a fresh clone:

git clone https://github.com/ulamai/ulamai.git
cd ulamai
./install.sh
ulam

The interactive menu will walk you through LLM configuration and Lean setup. For a quick smoke test without Lean installed, mock mode works out of the box:

ulam prove examples/Smoke.lean --theorem irrational_sqrt_two_smoke

If you're working on AI for mathematics—or if you're a mathematician curious about what these systems can and can't do—we'd like your feedback. File issues, submit problem sets, or tell us where the system fails in ways that surprise you.

The failures are where the interesting work begins.

Repository: github.com/ulamai/ulamai