Private beta opening for teams shipping AI agents.·Watch the demo

Ship AI agents
only when they truly
generalize.

Verifiable Labs checks whether a candidate agent really improves beyond visible tests, then returns a SHIP, BLOCK, or LIMIT decision with redacted evidence.

Built to sit above your existing agent stack.

  • Anthropic
  • Google Gemini
  • LangChain
  • Python
  • Docker
  • Anthropic
  • Google Gemini
  • LangChain
  • Python
  • Docker
  • Anthropic
  • Google Gemini
  • LangChain
  • Python
  • Docker

Agent Release Gate

Agent Release Gate — proof that your agent truly improves.

Every agent update can look better on visible tests. Verifiable Labs checks whether the improvement holds before it reaches users.

Connect a baseline and candidate agent workflow. Challenge the candidate beyond visible cases, diagnose where it fails, improve the update, and gate whether it should ship, be blocked, or be limited before production.

CHALLENGE BEYOND VISIBLE TESTS

Run candidate updates through hidden, out-of-distribution, adversarial, and failure-seeking scenarios before they reach users.

vlabs clean-gate terminal output

DIAGNOSE WHAT FAILED

Find where an agent update breaks across prompts, workflows, tools, retrieval behavior, model choices, and guardrails.

IMPROVE THE CANDIDATE

Turn failure signals into better agent updates before release, instead of treating evaluation as a final scorecard.

SHIP ONLY WHAT HOLDS

Gate the improved candidate with a clear decision — SHIP, BLOCK, or LIMIT — backed by a Generalization Card.

redacted Generalization Card

Highest public score ≠ promotion.

Most eval tools show scores. Verifiable Labs gates release decisions.

WITHOUT A RELEASE GATE

  • Visible tests pass
  • Public score improves
  • Hidden failures go unseen
  • Overfit behavior slips through
  • Teams ship on incomplete evidence
Verifiable Labs
  • Baseline and candidate are compared
  • Hidden and OOD checks run
  • Failure signals are diagnosed
  • Decision returns: SHIP, BLOCK, or LIMIT
  • Evidence is recorded for review

Every agent update should earn its release.

A release workflow for every agent update.

CHANGE

Submit a candidate prompt, workflow, model, tool-policy, retrieval, or guardrail update.

CHALLENGE

Test the update beyond visible cases with hidden, OOD, adversarial, and failure-seeking scenarios.

IMPROVE

Use the failure diagnosis to refine the candidate before release.

GATE

Ship, block, or limit the update based on whether the improvement holds.

vlabs clean-gate terminal output

Demo

Watch the gate block an update that overfits visible tests.

A candidate improves its visible score, but fails hidden/OOD checks. Verifiable Labs blocks the release before production.

vlabs · release-gate
$ pip install "vlabs-sdk==0.0.2"
 
$ vlabs clean-gate --old examples/demo/baseline.json \
--new examples/demo/candidate.json
candidate → SHIP
 
$ vlabs clean-gate --old examples/demo/baseline.json \
--new examples/demo/candidate_overfit.json
overfit candidate → BLOCK
reasons: ood_regressed, contamination_risk_flagged
  • Candidate scores higher on visible tests
  • Hidden and OOD checks regress
  • Gate returns BLOCK with reasons attached
  • No unsafe update reaches users
SHIPUpdates that hold
BLOCKUnsafe regressions
LIMITControlled rollout
REVIEWEvidence before deployment

Security

Private by default. Evidence when needed.

Verifiable Labs is designed to review agent updates without exposing what's private. Public evidence is synthetic/redacted and is not a training dataset — it does not include customer data, hidden evals, gold answers, raw traces, private traps, private engine internals, secrets, or provider keys.

  • Redacted Generalization Cards
  • Approval-gated exports
  • Private evaluation boundaries
  • No customer data in public demos
  • Designed for security review

Pricing

Pricing that scales with your releases.

Priced by agent release check — the gate you run each time you ship an agent update. Start free; upgrade as you ship more.

Free

$0

For developers testing agent reliability.

  • 3 agent release checks / month
  • SHIP / BLOCK / LIMIT decision
  • Generalization Card (basic)
  • vlabs-sdk + clean-gate CLI
  • community support

Developer

$99/mo

For builders shipping early AI agents.

  • 25 agent release checks / month
  • Hidden & OOD checks
  • GitHub Action release gate
  • Generalization Card
  • email support
Most popular

Team

$499/mo

For AI teams validating agent updates before release.

  • 150 agent release checks / month
  • Hidden, OOD & adversarial checks
  • Contamination + reward-hacking risk scoring
  • Custom gate policies & team history
  • priority support

Enterprise

Custom

For teams needing private deployment, security, compliance, and custom evaluation contracts.

  • Unlimited / custom release checks
  • SSO, VPC / on-prem, BYOK
  • Custom gates & compliance reports
  • Private evaluation boundaries
  • SLA & security review

Every plan includes

Baseline vs candidate reviewHidden & OOD checksShip / block / limit decisionRedacted Generalization Cards

Certified Audit Reports and private deployment available for enterprise customers.

Not sure which plan? Talk to us

Improve what fails. Ship what holds.

Bring a baseline and candidate agent workflow. Verifiable Labs will show which updates should ship, which should be blocked, and which need limited rollout.