Skip to main content
This guide is for teams running published Verial benchmarks against their own agent. If you are authoring a benchmark instead, see Authoring a Benchmark.

Find a Benchmark to Run

Published benchmarks have a slug and an integer version. The reference passed to POST /v1/benchmark-runs is slug@version (for example fax-referral@1). Your Solver can run:
  • Any benchmark in its own organization.
  • Any benchmark marked visibility=Public.
List benchmarks visible to your organization with an organization API key:
curl "https://api.verial.ai/benchmarks" \
  -H "Authorization: Bearer $VERIAL_API_KEY"
The list endpoint supports cursor and limit for pagination. Filter by published status and visibility client-side on the returned rows (each benchmark carries published: boolean and visibility: "Public" | "Private").
Public benchmarks are discoverable and runnable by any organization’s Solver. Private benchmarks (the default) are only runnable by Solvers in the owning organization.

Picking a Version

A benchmark version is an immutable snapshot. Publishing version: 2 leaves version: 1 addressable forever so old runs remain reproducible. Pin your CI and production monitoring to a specific version and bump deliberately.

Set Up a Solver

You run benchmarks as a Solver, a per-organization agent identity that hosts Solver keys (vrl_slv_*). Each Solver represents one agent (name, description, agent version). See Solver Keys for the full setup and key lifecycle.
export VERIAL_SOLVER_KEY=vrl_slv_xxx
export BENCHMARK_REF=fax-referral@1

Drive a Run

The end-to-end flow is covered step by step in the Quick Start. In summary:
  1. POST /v1/benchmark-runs with your Solver key returns a run-scoped bearer token, a list of task-run URLs, and the base URLs for each sandbox endpoint (FHIR, HL7, files, portal).
  2. For each task, POST /v1/task-runs/{id}/start, then drive the sandbox endpoints with the run-scoped bearer token.
  3. POST /v1/task-runs/{id}/complete when your agent is done. The verification engine runs every criterion against the final sandbox state and returns per-criterion scores.
  4. The benchmark run finalizes automatically when the last task run completes. Read it with GET /v1/benchmark-runs/{id}.
Only one task run can be in phase started at a time per benchmark run. Start the next task explicitly after completing the previous one.

Read the Results

The completion response for a task run and the final benchmark run both carry:
  • verdict: pass if the aggregate score is >= 0.9, fail if the score is 0, otherwise partial.
  • score: weighted mean across criteria (task level) or mean of task scores (benchmark level). Range [0, 1].
  • axes: per-axis scores when criteria declare an axis like correctness, safety, or efficiency.
  • checks (task level): one entry per criterion with criterion_id, label, result (pass / fail), score, and axis.
Drill into a single criterion for full evidence:
curl "https://api.verial.ai/criterion-runs/$CRITERION_RUN_ID" \
  -H "Authorization: Bearer $VERIAL_API_KEY"
See Run Results for the full traversal, and Interactions for the raw protocol evidence (FHIR calls, HL7 messages, voice transcripts, fax documents).

Scored vs Unscored Runs

Pass scored: true when creating the run and Verial withholds details and per-field evidence from the completion response. The agent still gets pass / fail per criterion and the aggregate score, but cannot learn the rubric. Use scored: true for:
  • CI regression gates (your agent must not memorize assertions between runs).
  • Leaderboard submissions.
  • Any run whose numbers you want to compare across agent versions.
Use scored: false (the default) for interactive development where you want the full evidence payload inline. Full evidence is always retrievable from GET /criterion-runs/{id} with an organization API key.

Comparing Runs

Each benchmark run is independent. Run the same benchmark N times and compare the distribution of scores across runs to catch regressions or measure improvements. A useful pattern:
  • Run scored: true on every pull request (see GitHub Actions).
  • Store the run IDs and scores in your CI artifacts.
  • Plot score over time per agent version to watch for drift from upstream model or tool changes.
The benchmark run object carries score, verdict, started_at, completed_at, and agent (an optional identifier you pass when creating the run). That is usually enough to build a simple regression dashboard on top.

Troubleshooting

Task run timeout. If the task run exceeds the benchmark’s configured timeout, it is recorded with verdict: "fail" and score: 0, and the benchmark run continues with the next task. Tune the benchmark timeout at authoring time.
Partial verdict. A task or benchmark run with verdict: "partial" scored above zero but below the pass threshold. Read the checks array and the per-axis scores to see which criteria passed and which failed.
Single-active-task-run limit. The v1 flow allows only one task run in phase started at a time. If your agent tries to start a second task run before completing the first, the request is rejected. Start tasks sequentially unless the benchmark explicitly sets concurrency > 1.
Run bearer token expiry. The run-scoped bearer token (vrl_run_*) returned by POST /v1/benchmark-runs expires. Persist it in memory for the lifetime of the rollout and discard it once the run completes. See bearer_token_expires_at on the create response.

Next Steps

Quick Start

End-to-end walkthrough with curl, start to finish.

Solver Keys

Create, rotate, and understand the scope of a Solver key.

Run Results

Read a completed run top-down with full evidence.

GitHub Actions

Wire a benchmark into CI and gate pull requests on the score.