Quickstart
Run your first verifiable benchmark in under five minutes. This guide installs the SDK, runs a single environment, and explains the output.
Before you start
You’ll need Python 3.11+, an API key for at least one provider (Anthropic, OpenAI, Google, or OpenRouter), and ~500 MB for the package and its scientific dependencies. Everything runs locally — no Verifiable Labs account required.
Installation
Install from PyPI. Pulls in all ten environments and their classical baselines (numpy / scipy / jax) under the hood.
$ pip install verifiable-labs
Verify the install with verifiable --version and verifiable list to see all 10 environments.
Set up an API key
The CLI reads provider keys from environment variables. The model name on the command line picks the provider automatically:
# Pick the provider you want to evaluate export ANTHROPIC_API_KEY=sk-ant-... # Claude models (claude-haiku-4.5, …) export OPENAI_API_KEY=sk-... # GPT / o-series export GOOGLE_API_KEY=... # Gemini # Or use OpenRouter to access any provider with one key export OPENROUTER_API_KEY=sk-or-v1-...
Prefer interactive setup? verifiable login writes the keys to ~/.verifiable/config.toml with mode 0600.
Run your first benchmark
Pick an environment, a model, and a seed. The CLI samples a fresh problem instance, runs the agent, computes classical baselines, and emits a calibrated reward interval.
$ verifiable run \ --env sparse-fourier-recovery \ --model claude-haiku-4.5 \ --episodes 10 \ --seed 42
Expected output (real numbers; yours will differ slightly):
✓ Loading environment: sparse-fourier-recovery ✓ Calibrating conformal interval (target 0.90)... ✓ Running 10 episodes... Mean reward: 0.327 ± 0.047 Coverage: 0.933 (target 0.90) ✓ Time: 0m 12s · Cost: $0.0021 Trace saved to ~/.verifiable/runs/sparse-fourier-recovery_claude-haiku-4.5_…jsonl
Verify the output
Every run prints a reward, a conformal interval, and a coverage check. If Coverage drops below the target, the calibration set is too small. Bump --episodes and re-run.
--seed guarantees identical problem instances across machines. Seeds are versioned alongside the SDK release and never silently changed.Next steps
You’re ready to wire Verifiable Labs into your training loop or evaluation harness. Three good places to go next:
- → Browse all 10 environments with classical and LLM baselines.
- → Read the paper to understand the conformal-calibration protocol.
- → Star the GitHub repo and follow releases.