Paper · April 2026
Conformal-Calibrated Rewards for Scientific RL: Procedural Regeneration Against Benchmark Contamination
A procedural regeneration protocol and a conformal-calibrated reward construction that make benchmark contamination structurally infeasible for RL on scientific inverse problems. Across ten environments and five frontier models, classical baselines beat LLM agents in 32 of 50 head-to-head comparisons (p<0.05).
Key finding32/50classical-vs-LLM comparisons significant (p<0.05)
Key finding+0.199average absolute reward gap, classical over best LLM
Key finding0.901 ± 0.017empirical conformal coverage at target 0.90
Figures
Selected figures from the manuscript. Full results, error bars, and ablations live in the Zenodo record.
Cite this work
The Zenodo record is the canonical citation. Code is Apache-2.0 licensed; the manuscript is CC-BY-4.0.
citation.bib
@misc{zacharioudakis2026verifiable, title = {Conformal-Calibrated Rewards for Scientific RL: Procedural Regeneration Against Benchmark Contamination}, author = {Zacharioudakis, Stelios}, year = {2026}, month = {April}, publisher = {Zenodo}, version = {v1}, doi = {10.5281/zenodo.19786415}, url = {https://zenodo.org/records/19786415}, note = {National and Kapodistrian University of Athens} }
Related work
Conformal Prediction (Vovk et al.)Compressed Sensing (Donoho)fastMRI (Zbontar et al.)LoDoPaB-CT (Leuschner et al.)GRPO (Shao et al.)Procedural environments (Cobbe et al.)