v0.1 · April 2026

Quickstart

Run your first verifiable benchmark in under five minutes. This guide installs the SDK, runs a single environment, and explains the output.

Before you start

You’ll need Python 3.11+, an API key for at least one provider (Anthropic, OpenAI, Google, or OpenRouter), and ~500 MB for the package and its scientific dependencies. Everything runs locally — no Verifiable Labs account required.

Tip
Want to skip the install? Open the Hugging Face Space and run the leaderboard interactively.

Installation

Install from PyPI. Pulls in all ten environments and their classical baselines (numpy / scipy / jax) under the hood.

install.sh
$ pip install verifiable-labs

Verify the install with verifiable --version and verifiable list to see all 10 environments.

Set up an API key

The CLI reads provider keys from environment variables. The model name on the command line picks the provider automatically:

env.sh
# Pick the provider you want to evaluate
export ANTHROPIC_API_KEY=sk-ant-...   # Claude models (claude-haiku-4.5, …)
export OPENAI_API_KEY=sk-...       # GPT / o-series
export GOOGLE_API_KEY=...          # Gemini

# Or use OpenRouter to access any provider with one key
export OPENROUTER_API_KEY=sk-or-v1-...

Prefer interactive setup? verifiable login writes the keys to ~/.verifiable/config.toml with mode 0600.

Run your first benchmark

Pick an environment, a model, and a seed. The CLI samples a fresh problem instance, runs the agent, computes classical baselines, and emits a calibrated reward interval.

run.sh
$ verifiable run \
  --env sparse-fourier-recovery \
  --model claude-haiku-4.5 \
  --episodes 10 \
  --seed 42

Expected output (real numbers; yours will differ slightly):

output.txt
 Loading environment: sparse-fourier-recovery
 Calibrating conformal interval (target 0.90)...
 Running 10 episodes...

Mean reward:   0.327 ± 0.047
Coverage:      0.933 (target 0.90) ✓
Time:          0m 12s · Cost: $0.0021

Trace saved to ~/.verifiable/runs/sparse-fourier-recovery_claude-haiku-4.5_…jsonl

Verify the output

Every run prints a reward, a conformal interval, and a coverage check. If Coverage drops below the target, the calibration set is too small. Bump --episodes and re-run.

Reproducibility
Pinning --seed guarantees identical problem instances across machines. Seeds are versioned alongside the SDK release and never silently changed.

Next steps

You’re ready to wire Verifiable Labs into your training loop or evaluation harness. Three good places to go next: