Paper · April 2026

Conformal-Calibrated Rewards for Scientific RL: Procedural Regeneration Against Benchmark Contamination

A procedural regeneration protocol and a conformal-calibrated reward construction that make benchmark contamination structurally infeasible for RL on scientific inverse problems. Across ten environments and five frontier models, classical baselines beat LLM agents in 32 of 50 head-to-head comparisons (p<0.05).

Key finding32/50classical-vs-LLM comparisons significant (p<0.05)
Key finding+0.199average absolute reward gap, classical over best LLM
Key finding0.901 ± 0.017empirical conformal coverage at target 0.90

Figures

Selected figures from the manuscript. Full results, error bars, and ablations live in the Zenodo record.

Classical vs best-LLM mean reward (per domain)
0.000.250.500.751.00Sparse FourierCT (LoDoPaB)MRI KneePhase RetrievalSuper-Res DIV2KClassicalBest LLM
Empirical conformal coverage over episodes
0.700.800.901.00target = 0.90Episode (×100)
Difficulty (1 − classical reward) vs LLM gap
SF-1SF-2SF-3CT-1CT-2MRI-1MRI-2PR-1PR-2SR-1DifficultyLLM gap
Classical-vs-LLM significance (p<0.05); green = significant
OpusSonnetHaikuGPT-5Gemini 2.5SF-1SF-2SF-3CT-1CT-2MRI-1MRI-2PR-1PR-2SR-1

Cite this work

The Zenodo record is the canonical citation. Code is Apache-2.0 licensed; the manuscript is CC-BY-4.0.

citation.bib
@misc{zacharioudakis2026verifiable,
  title         = {Conformal-Calibrated Rewards for Scientific RL:
                    Procedural Regeneration Against Benchmark Contamination},
  author        = {Zacharioudakis, Stelios},
  year          = {2026},
  month         = {April},
  publisher     = {Zenodo},
  version       = {v1},
  doi           = {10.5281/zenodo.19786415},
  url           = {https://zenodo.org/records/19786415},
  note          = {National and Kapodistrian University of Athens}
}

Related work

Conformal Prediction (Vovk et al.)Compressed Sensing (Donoho)fastMRI (Zbontar et al.)LoDoPaB-CT (Leuschner et al.)GRPO (Shao et al.)Procedural environments (Cobbe et al.)