Ship AI agents
only when they truly
generalize.

Verifiable Labs checks whether a candidate agent really improves beyond visible tests, then returns a SHIP, BLOCK, or LIMIT decision with redacted evidence.

Run the 5-minute demo View public evidence

Built to sit above your existing agent stack.

Agent Release Gate

Agent Release Gate — proof that your agent truly improves.

Every agent update can look better on visible tests. Verifiable Labs checks whether the improvement holds before it reaches users.

Connect a baseline and candidate agent workflow. Challenge the candidate beyond visible cases, diagnose where it fails, improve the update, and gate whether it should ship, be blocked, or be limited before production.

CHALLENGE BEYOND VISIBLE TESTS

Run candidate updates through hidden, out-of-distribution, adversarial, and failure-seeking scenarios before they reach users.

DIAGNOSE WHAT FAILED

Find where an agent update breaks across prompts, workflows, tools, retrieval behavior, model choices, and guardrails.

IMPROVE THE CANDIDATE

Turn failure signals into better agent updates before release, instead of treating evaluation as a final scorecard.

SHIP ONLY WHAT HOLDS

Gate the improved candidate with a clear decision — SHIP, BLOCK, or LIMIT — backed by a Generalization Card.

Highest public score ≠ promotion.

Most eval tools show scores. Verifiable Labs gates release decisions.

WITHOUT A RELEASE GATE

Visible tests pass
Public score improves
Hidden failures go unseen
Overfit behavior slips through
Teams ship on incomplete evidence

Baseline and candidate are compared
Hidden and OOD checks run
Failure signals are diagnosed
Decision returns: SHIP, BLOCK, or LIMIT
Evidence is recorded for review

Every agent update should earn its release.

A release workflow for every agent update.

CHANGE

Submit a candidate prompt, workflow, model, tool-policy, retrieval, or guardrail update.

CHALLENGE

Test the update beyond visible cases with hidden, OOD, adversarial, and failure-seeking scenarios.

IMPROVE

Use the failure diagnosis to refine the candidate before release.

GATE

Ship, block, or limit the update based on whether the improvement holds.

Built for teams shipping AI agents

Agent product teams

Review prompt changes, model swaps, workflow updates, and guardrail changes before they affect users.

Watch demo →

AI platform teams

Add a release gate between experimentation, evaluation, and production deployment.

Book a demo →

Enterprise AI teams

Create a repeatable review process for agent changes without exposing private data or evaluation content.

Talk to us →

Demo

Watch the gate block an update that overfits visible tests.

A candidate improves its visible score, but fails hidden/OOD checks. Verifiable Labs blocks the release before production.

vlabs · release-gate

$ pip install "vlabs-sdk==0.0.2"
 
$ vlabs clean-gate --old examples/demo/baseline.json \
  --new examples/demo/candidate.json
candidate → SHIP
 
$ vlabs clean-gate --old examples/demo/baseline.json \
  --new examples/demo/candidate_overfit.json
overfit candidate → BLOCK
reasons: ood_regressed, contamination_risk_flagged

Candidate scores higher on visible tests
Hidden and OOD checks regress
Gate returns BLOCK with reasons attached
No unsafe update reaches users

Book a demo See how it works

SHIPUpdates that hold

BLOCKUnsafe regressions

LIMITControlled rollout

REVIEWEvidence before deployment

Security

Private by default. Evidence when needed.

Verifiable Labs is designed to review agent updates without exposing what's private. Public evidence is synthetic/redacted and is not a training dataset — it does not include customer data, hidden evals, gold answers, raw traces, private traps, private engine internals, secrets, or provider keys.

Redacted Generalization Cards
Approval-gated exports
Private evaluation boundaries
No customer data in public demos
Designed for security review

Pricing

Pricing that scales with your releases.

Priced by agent release check — the gate you run each time you ship an agent update. Start free; upgrade as you ship more.

Free

For developers testing agent reliability.

3 agent release checks / month
SHIP / BLOCK / LIMIT decision
Generalization Card (basic)
vlabs-sdk + clean-gate CLI
community support

Run the 5-minute demo

Developer

$99/mo

For builders shipping early AI agents.

25 agent release checks / month
Hidden & OOD checks
GitHub Action release gate
Generalization Card
email support

Get started

Team

$499/mo

For AI teams validating agent updates before release.

150 agent release checks / month
Hidden, OOD & adversarial checks
Contamination + reward-hacking risk scoring
Custom gate policies & team history
priority support

Get started

Enterprise

Custom

For teams needing private deployment, security, compliance, and custom evaluation contracts.

Unlimited / custom release checks
SSO, VPC / on-prem, BYOK
Custom gates & compliance reports
Private evaluation boundaries
SLA & security review

Contact sales

Every plan includes

Baseline vs candidate reviewHidden & OOD checksShip / block / limit decisionRedacted Generalization Cards

Certified Audit Reports and private deployment available for enterprise customers.

Not sure which plan? Talk to us

Improve what fails. Ship what holds.

Bring a baseline and candidate agent workflow. Verifiable Labs will show which updates should ship, which should be blocked, and which need limited rollout.

Book a demo Watch demo

Ship AI agentsonly when they trulygeneralize.