Private beta opening for teams shipping AI agents.·Watch the demo

Platform

See how the release gate works.

Connect a baseline and candidate agent workflow. Challenge it beyond visible tests, diagnose what fails, and gate whether it should ship, be blocked, or be limited.

4
Scenario suites per run
2
Firewalls on every candidate
3
Outcomes · ship / block / limit
<60s
Typical time to decision

Why it exists

Highest public score ≠ safe to ship.

A candidate can top the tests your team can see and still regress where it counts. The release gate runs the change against what it can't see — and only clears the updates that hold.

1 · Challenge

Test the candidate beyond the visible tests

Connect a baseline and a candidate. Verifiable Labs runs the candidate across four scenario suites — including ones it has never seen — so a win on the public set has to prove it transfers.

  • Public, hidden, out-of-distribution and adversarial suites
  • Baseline vs candidate, scored suite by suite
  • Runs on managed inference — nothing to set up
Scenario suites challenging a candidate agent

2 · Diagnose

See exactly where it breaks

When the public score climbs but hidden or OOD scores fall, that transfer gap is the signal. The contamination firewall and anti-hack engine catch the rest before it reaches users.

  • Transfer gap, public → hidden → OOD
  • Contamination firewall flags train/eval leakage
  • Anti-hack engine flags reward hacking & spec gaming
vlabs · clean-gate
$ vlabs clean-gate --old baseline.json \
--new candidate.json
→ public 0.740 → 0.910
→ hidden 0.732 → 0.611 regressed
→ ood 0.701 → 0.488 regressed
decision: BLOCK
reasons: ood_regressed, hidden_regressed
✗ public up · hidden/OOD down — not shipped

3 · Improve

Turn failures into a fix list

Every blocked run comes back with a ranked diagnosis — the exact scenarios and categories where the candidate regressed — so the next iteration targets what actually failed.

  • Ranked failure diagnosis by suite and severity
  • Concrete scenarios, not just an aggregate score
  • Re-run the moment the candidate changes
Failure diagnosis for a candidate agent

4 · Gate

Ship only what holds

The gate returns one decision — SHIP, BLOCK, or LIMIT — with machine reasons attached, and a redacted evidence record for every run. Wire it into CI and it blocks the merge automatically.

  • SHIP / BLOCK / LIMIT with machine reasons
  • Redacted, reviewable evidence on every run
  • Runs as a status check on every pull request

Generalization Card

decision
BLOCK
reasons
ood_regressed · hidden_regressed
public
0.740 → 0.910
hidden
0.732 → 0.611
ood
0.701 → 0.488
record
redacted · reviewable

✗ candidate not promoted

Inside the gate

The checks behind every decision

Four independent signals combine into one ship/block/limit call — each one auditable in the evidence record.

Transfer-gap analysis

Quantifies how much of a candidate's public-set gain actually carries to hidden and out-of-distribution scenarios.

Contamination firewall

Flags possible train/eval overlap or public-score leakage that would otherwise inflate the result.

Anti-hack engine

Detects reward hacking, spec gaming, and shortcut exploitation that pass the letter of a test but not its intent.

Formal scope

Selected mathematical properties behind the contamination-resistant promotion gate are machine-verified in Lean 4, with the implementation property-tested against the spec. This does not mean the entire product, API, agent, or model is formally verified.

Improve what fails. Ship what holds.

Bring a baseline and candidate agent workflow. Verifiable Labs will show which updates should ship, which should be blocked, and which need limited rollout.