Platform

See how the release gate works.

Connect a baseline and candidate agent workflow. Challenge it beyond visible tests, diagnose what fails, and gate whether it should ship, be blocked, or be limited.

Book a demo Watch demo

4: Scenario suites per run
2: Firewalls on every candidate
3: Outcomes · ship / block / limit
<60s: Typical time to decision

Why it exists

Highest public score ≠ safe to ship.

A candidate can top the tests your team can see and still regress where it counts. The release gate runs the change against what it can't see — and only clears the updates that hold.

1 · Challenge

Test the candidate beyond the visible tests

Connect a baseline and a candidate. Verifiable Labs runs the candidate across four scenario suites — including ones it has never seen — so a win on the public set has to prove it transfers.

Public, hidden, out-of-distribution and adversarial suites
Baseline vs candidate, scored suite by suite
Runs on managed inference — nothing to set up

Scenario suites challenging a candidate agent

2 · Diagnose

See exactly where it breaks

When the public score climbs but hidden or OOD scores fall, that transfer gap is the signal. The contamination firewall and anti-hack engine catch the rest before it reaches users.

Transfer gap, public → hidden → OOD
Contamination firewall flags train/eval leakage
Anti-hack engine flags reward hacking & spec gaming

vlabs · clean-gate

$ vlabs clean-gate --old baseline.json \
    --new candidate.json
→ public  0.740 → 0.910
→ hidden  0.732 → 0.611  regressed
→ ood     0.701 → 0.488  regressed
decision: BLOCK
reasons: ood_regressed, hidden_regressed
✗ public up · hidden/OOD down — not shipped

3 · Improve

Turn failures into a fix list

Every blocked run comes back with a ranked diagnosis — the exact scenarios and categories where the candidate regressed — so the next iteration targets what actually failed.

Ranked failure diagnosis by suite and severity
Concrete scenarios, not just an aggregate score
Re-run the moment the candidate changes

4 · Gate

Ship only what holds

The gate returns one decision — SHIP, BLOCK, or LIMIT — with machine reasons attached, and a redacted evidence record for every run. Wire it into CI and it blocks the merge automatically.

SHIP / BLOCK / LIMIT with machine reasons
Redacted, reviewable evidence on every run
Runs as a status check on every pull request

Generalization Card

decision: BLOCK
reasons: ood_regressed · hidden_regressed
public: 0.740 → 0.910
hidden: 0.732 → 0.611
ood: 0.701 → 0.488
record: redacted · reviewable

✗ candidate not promoted

Inside the gate

The checks behind every decision

Four independent signals combine into one ship/block/limit call — each one auditable in the evidence record.

Transfer-gap analysis

Quantifies how much of a candidate's public-set gain actually carries to hidden and out-of-distribution scenarios.

Contamination firewall

Flags possible train/eval overlap or public-score leakage that would otherwise inflate the result.

Anti-hack engine

Detects reward hacking, spec gaming, and shortcut exploitation that pass the letter of a test but not its intent.

Formal scope

Selected mathematical properties behind the contamination-resistant promotion gate are machine-verified in Lean 4, with the implementation property-tested against the spec. This does not mean the entire product, API, agent, or model is formally verified.

Improve what fails. Ship what holds.

Bring a baseline and candidate agent workflow. Verifiable Labs will show which updates should ship, which should be blocked, and which need limited rollout.

Book a demo Watch demo