RAVE: compressing MoE models without deleting the reasoning experts
RAVE introduces router-aware virtual experts: a middle path between pruning and merging MoE experts. In an internal SU-01 proof-judging pilot, RAVE-64 led the aggregate leaderboard while using a 50% expert-centroid budget.
Mixture-of-Experts models are attractive because they activate only a few experts per token. But they still need to store, route, shard, and serve all experts. In practice, that means sparse MoE models can be compute-efficient while still being painful to deploy: memory footprint, expert placement, bandwidth, and routing overhead all matter.
Most post-training MoE compression methods choose one of two routes. Pruning keeps some experts and deletes the rest. This is simple and often strong, but it can remove specialists that are rarely used yet important on the hardest cases. Merging groups experts together. This preserves more aggregate information, but it can blur the sharp specialization that made the experts useful in the first place.
Our new paper introduces RAVE: Router-Aware Virtual Experts, a middle path between pruning and merging. RAVE keeps the router-addressable structure of an MoE model, but replaces some full experts with cheap virtual residual experts: a retained centroid expert plus a small low-rank correction.
The intuition is straightforward: do not ask only whether an expert should live or die. Ask whether it is important enough to keep full, redundant enough to absorb, cold enough to prune, or specialized enough to preserve as a cheap residual.
The four states of an expert
RAVE assigns each expert to one of four actions. The fourth state is the key difference: a virtual expert is not a full expert, but it is also not deleted. It remains addressable by the router, so the model can still express “this token wanted that specialist,” while paying only a small fraction of the storage cost.
| State | What happens to the expert | Why it matters |
|---|---|---|
| Full keep | The expert stays unchanged. | Preserves high-saliency specialists with no approximation. |
| Hard prune | The expert is removed entirely. | Deletes experts that are cold, redundant, or not worth storing. |
| Centroid absorb | The expert maps to a retained centroid with no residual. | Keeps the behavior only when the centroid is a good enough proxy. |
| Virtual residual expert | The expert keeps router identity, shares a retained centroid, and adds a small low-rank residual. | Preserves tail specialization without paying for a full expert. |
In implementation terms, several virtual expert IDs may share the same retained centroid. The serving runtime should compute that centroid once, aggregate the gates of all virtual IDs pointing to it, then add the small residual branches. This grouped execution is what can turn storage compression into real inference savings.
The important tradeoff: RAVE needs calibration data
There is a crucial difference between RAVE and the REAP/REAM baselines. REAP and REAM do not need external task data in the way RAVE does. They can be run as structural post-training compression baselines: prune or merge experts using the existing model, router behavior, and weight/expert statistics, without requiring a new domain-specific calibration corpus to define new virtual expert structure.
RAVE is different. To decide which experts should become centroids, which experts can be absorbed, and which removed experts deserve a virtual residual, RAVE needs a representative calibration stream. In our pilot, this calibration stream was our internal mathematics mixture, which we call OlympiadNet.
That makes RAVE both more powerful and less data-free. The advantage is that RAVE can be centered on the target domain—here, olympiad-style mathematical reasoning. The cost is that the calibration mixture becomes part of the method. A bad or mismatched calibration set could choose the wrong centroids, over-protect the wrong experts, or fail to preserve the specialists that matter at evaluation time.
The 40-problem proof-judge evaluation set remained separate. OlympiadNet was used to calibrate the compressed expert structure, not as an answer source for the judged problems.
QeRAVE: using quantization noise as a redundancy probe
The paper also defines QeRAVE, a quantization-enhanced extension of RAVE. The idea is not that quantization magically improves reasoning. The idea is more cautious: quantization noise can help reveal which experts are stable, redundant, or brittle.
If an expert remains cold and unimportant under several low-bit perturbations, it is probably safer to prune or absorb. If its saliency or routing behavior changes sharply under small perturbations, it may be a brittle specialist that should be protected as a full or virtual residual expert.
The recommended path is to build and validate clean RAVE first; compare it fairly against pruning and merging at equal memory and latency budgets; and only then add QeRAVE as an extension, compared against ordinary post-hoc quantized baselines.
What exactly did we test?
For the internal pilot, we evaluated seven variants of the same 128-expert Simplified-Reasoning/SU-01 MoE base model.
| Variant | Meaning |
|---|---|
original | The uncompressed SU-01 direct-decoding baseline, 128/128 experts. |
reap_64 | REAP pruning, retaining 64 of 128 experts, roughly a 50% expert budget. |
reap_77 | REAP pruning, retaining 77 of 128 experts, roughly a 60% expert budget. |
ream_64 | REAM merging into a 64-expert target budget. |
ream_77 | REAM merging into a 77-expert target budget. |
rave_64 | RAVE with 64 retained centroids plus virtual residual experts. |
rave_77 | RAVE with 77 retained centroids plus virtual residual experts. |
The suffix tells you the target expert budget. 64 is the approximately 50% setting. 77 is the approximately 60% setting. The RAVE variants are not just “RAVE-pruned” models: they use the calibration stream to center virtual experts around retained centroids.
The evaluation: proof quality, not just final answers
We tested the variants on 40 internal held-out synthetic IMO-style math problems. Each output was judged as a contest proof against a reference solution using a 0–7 olympiad rubric. Scores of 6 or 7 counted as passes.
This was a deliberately strict setting: all models used direct generation; all runs had an 8192-token cap; 274 of 280 generations reached the cap; and the judge scored only the visible mathematical work, not what the model might have completed with more tokens.
That caveat matters. This is not a claim about unconstrained long-horizon test-time scaling. It is a capped direct-decoding proof-closure test.
Headline result: RAVE-64 led the aggregate leaderboard
| Rank | Variant | Score | Avg. /7 | Passes | Judge synopsis |
|---|---|---|---|---|---|
| 1 | rave_64 | 154/280 | 3.85 | 9/40 | Best overall; strongest mix of complete proofs and near-solutions, especially invariant-set, coordinate, and conic-substitution tasks. |
| 2 | reap_64 | 146/280 | 3.65 | 9/40 | Tied the best pass count and produced several clean short solutions, but stalled more often after finding the right idea. |
| 3 | rave_77 | 146/280 | 3.65 | 5/40 | Tied REAP-64 in total points but had fewer complete passes; often close on analytic/vector problems. |
| 4 | original | 143/280 | 3.58 | 5/40 | Strong direct-decoding baseline, but below RAVE-64 in total score and pass count under this 8K cap. |
| 5 | reap_77 | 139/280 | 3.48 | 7/40 | Volatile: some excellent passes, but more low partial scores than the best 64-expert variants. |
| 6 | ream_77 | 105/280 | 2.62 | 2/40 | Occasionally solved short counting/gap tasks, but far behind REAP/RAVE on proof closure. |
| 7 | ream_64 | 76/280 | 1.90 | 0/40 | Weakest run; no passes and many exploratory or incomplete outputs. |
The precise claim is important: RAVE-64 led by total judged proof score and tied REAP-64 on pass count. It did not dominate every problem. It gained eight points over REAP-64, tied it on complete and near-complete proofs, and gained eleven points over the uncompressed original SU-01 direct run in this capped setting.
That is the result we care about: a 50%-budget RAVE variant preserved enough mathematical specialization to outperform the full direct baseline on aggregate proof quality under the same token cap.
What the results suggest
The best signal is not just that RAVE-64 won the table. It is where it won. The judge report highlighted RAVE-64’s strength on invariant-set arguments, coordinate solutions, conic substitutions, and finite-exception reasoning. These are exactly the kinds of tasks where deleting rare specialists can be dangerous: the successful proof may depend on a small number of expert behaviors that are not globally frequent.
REAP-64 also performed strongly. In fact, it tied RAVE-64 on passes. This is useful: pruning is a serious baseline, and any new MoE compression method should have to beat it fairly. The result does not say “pruning is bad.” It says the frontier is more interesting than keep/delete.
The data-dependence distinction matters here too. REAP/REAM are attractive partly because they are simpler baselines that do not require external domain data. RAVE buys its additional flexibility by using calibration data to place virtual experts. In our case, that means using OlympiadNet to make the compression math-centered. Future comparisons should therefore report not only scores, memory, and latency, but also the calibration data used to build the compressed model.
REAM was much weaker in this pilot. The likely lesson is that merging can preserve average statistics while damaging fine-grained expert specialization. For long mathematical reasoning, that loss of sharpness can matter.
Why virtual experts are useful
RAVE is built around a practical deployment hypothesis: some experts are too valuable to delete, but too redundant to keep as full experts. A virtual residual expert gives the model a cheap way to preserve that middle category.
This matters especially for reasoning models. Hard mathematical problems are not uniformly distributed. They often trigger narrow motifs: a number-theoretic factorization, a hidden coordinate transformation, a finite invariant set, a conic substitution, a parity repair lemma. These motifs may not dominate calibration frequency, but when they are needed, losing them can collapse a proof.
RAVE tries to compress the model while respecting this long tail. But the long tail has to be shown to the method. That is why RAVE needs a representative calibration mixture and why we centered the pilot on OlympiadNet.
What this does not prove yet
The paper is intentionally careful about the boundary between result and hypothesis. The internal SU-01 pilot is encouraging, but it is not a public benchmark claim. It uses one private 40-problem set, direct 8192-token decoding, and proof-judge scoring. It does not establish performance on public math benchmarks, code generation, instruction following, or production serving.
The serving-speed claim also needs a compact grouped runtime. The RAVE pilot was designed to evaluate quality of compressed expert structure, not to prove final wall-clock speed. To turn RAVE into a deployment method, the next implementation step is a runtime that computes shared centroids once, applies residuals efficiently, and measures real decode throughput.
RAVE also has a calibration requirement that REAP/REAM do not have in the same way. That is a real limitation. It means RAVE comparisons should always disclose the calibration mixture, whether it is general text, math, code, or domain-specific private data, and should test whether the method transfers when the calibration and evaluation domains differ.
That is why the paper separates four things: the method, router-aware virtual experts; the extra input, external calibration data for centering virtual experts; the first quality signal, RAVE-64 leading the strict proof-judge pilot; and the next validation step, public, latency-matched benchmarks with a compact runtime.
Why this fits Ulam’s research direction
At Ulam, we care about reasoning systems that can be measured under constraints. The best model is not only the one that can solve a benchmark in isolation. It is the one that can be served, evaluated, trusted, and improved at a cost that makes sense.
RAVE sits at the intersection of two parts of our work: model efficiency, reducing memory and inference cost without breaking behavior; and private reasoning evals, using proof-quality signals, partial progress, and hard-to-game mathematical tasks to detect capability changes that ordinary benchmarks miss.
This is the broader thesis: compression should not be blind. If a model has learned specialized reasoning circuits, we should try to preserve the geometry of those circuits, not just minimize parameter count.
RAVE is one step in that direction.
What comes next
The next stage is straightforward: implement the compact grouped RAVE runtime; run public, latency-matched evaluations against REAP, REAM, rank-0 absorption, and quantized baselines; report the calibration mixture explicitly for every compressed model; sweep residual ranks, active-centroid penalties, calibration mixtures, and adaptation budgets; test whether QeRAVE’s perturbation scoring improves robustness under low-bit deployment; and extend the evaluation beyond math proofs to code, instruction following, and agentic workloads.
The internal result is strong enough to justify this next step. RAVE-64 did not just survive compression. Under a strict proof-judging pilot, it became the aggregate winner.
That is the kind of result we want from model efficiency research: not smaller for its own sake, but smaller while preserving the behaviors that matter—and honest about what data the compression method needs.
Related paper:
RAVE and QeRAVE: Router-Aware Virtual Experts for Efficient Mixture-of-Experts LLM Compression
