Running a Benchmark

This guide is for teams running published Verial benchmarks against their own agent. If you are authoring a benchmark instead, see Authoring a Benchmark.

Find a Benchmark to Run

Published benchmarks have a slug and an integer version. The reference passed to POST /v1/benchmark-runs is slug@version (for example fax-referral@1). Your Solver can run:

Any benchmark in its own organization.
Any benchmark marked visibility=Public.

List benchmarks visible to your organization with an organization API key:

curl "https://api.verial.ai/benchmarks" \
  -H "Authorization: Bearer $VERIAL_API_KEY"

The list endpoint supports cursor and limit for pagination. Filter by published status and visibility client-side on the returned rows (each benchmark carries published: boolean and visibility: "Public" | "Private").

Public benchmarks are discoverable and runnable by any organization’s Solver. Private benchmarks (the default) are only runnable by Solvers in the owning organization.

Picking a Version

A benchmark version is an immutable snapshot. Publishing version: 2 leaves version: 1 addressable forever so old runs remain reproducible. Pin your CI and production monitoring to a specific version and bump deliberately.

Set Up a Solver

You run benchmarks as a Solver, a per-organization agent identity that hosts Solver keys (vrl_slv_*). Each Solver represents one agent (name, description, agent version). See Solver Keys for the full setup and key lifecycle.

export VERIAL_SOLVER_KEY=vrl_slv_xxx
export BENCHMARK_REF=fax-referral@1

Drive a Run

The end-to-end flow is covered step by step in the Quick Start. In summary:

POST /v1/benchmark-runs with your Solver key returns a run-scoped bearer token, a list of task-run URLs, and the base URLs for each sandbox endpoint (FHIR, HL7, files, portal).
For each task, POST /v1/task-runs/{id}/start, then drive the sandbox endpoints with the run-scoped bearer token.
POST /v1/task-runs/{id}/complete when your agent is done. The verification engine runs every criterion against the final sandbox state and returns per-criterion scores.
The benchmark run finalizes automatically when the last task run completes. Read it with GET /v1/benchmark-runs/{id}.

Only one task run can be in phase started at a time per benchmark run. Start the next task explicitly after completing the previous one.

Read the Results

The completion response for a task run and the final benchmark run both carry:

verdict: pass if the aggregate score is >= 0.9, fail if the score is 0, otherwise partial.
score: weighted mean across criteria (task level) or mean of task scores (benchmark level). Range [0, 1].
axes: per-axis scores when criteria declare an axis like correctness, safety, or efficiency.
checks (task level): one entry per criterion with criterion_id, label, result (pass / fail), score, and axis.

Drill into a single criterion for full evidence:

curl "https://api.verial.ai/criterion-runs/$CRITERION_RUN_ID" \
  -H "Authorization: Bearer $VERIAL_API_KEY"

See Run Results for the full traversal, and Interactions for the raw protocol evidence (FHIR calls, HL7 messages, voice transcripts, fax documents).

Scored vs Unscored Runs

Pass scored: true when creating the run and Verial withholds details and per-field evidence from the completion response. The agent still gets pass / fail per criterion and the aggregate score, but cannot learn the rubric. Use scored: true for:

CI regression gates (your agent must not memorize assertions between runs).
Leaderboard submissions.
Any run whose numbers you want to compare across agent versions.

Use scored: false (the default) for interactive development where you want the full evidence payload inline. Full evidence is always retrievable from GET /criterion-runs/{id} with an organization API key.

Comparing Runs

Each benchmark run is independent. Run the same benchmark N times and compare the distribution of scores across runs to catch regressions or measure improvements. A useful pattern:

Run scored: true on every pull request (see GitHub Actions).
Store the run IDs and scores in your CI artifacts.
Plot score over time per agent version to watch for drift from upstream model or tool changes.

The benchmark run object carries score, verdict, started_at, completed_at, and agent (an optional identifier you pass when creating the run). That is usually enough to build a simple regression dashboard on top.

Troubleshooting

Task run timeout. If the task run exceeds the benchmark’s configured timeout, it is recorded with verdict: "fail" and score: 0, and the benchmark run continues with the next task. Tune the benchmark timeout at authoring time.

Partial verdict. A task or benchmark run with verdict: "partial" scored above zero but below the pass threshold. Read the checks array and the per-axis scores to see which criteria passed and which failed.

Single-active-task-run limit. The v1 flow allows only one task run in phase started at a time. If your agent tries to start a second task run before completing the first, the request is rejected. Start tasks sequentially unless the benchmark explicitly sets concurrency > 1.

Run bearer token expiry. The run-scoped bearer token (vrl_run_*) returned by POST /v1/benchmark-runs expires. Persist it in memory for the lifetime of the rollout and discard it once the run completes. See bearer_token_expires_at on the create response.

Next Steps

Quick Start

End-to-end walkthrough with curl, start to finish.

Solver Keys

Create, rotate, and understand the scope of a Solver key.

Run Results

Read a completed run top-down with full evidence.

GitHub Actions

Wire a benchmark into CI and gate pull requests on the score.

​Find a Benchmark to Run

​Picking a Version

​Set Up a Solver

​Drive a Run

​Read the Results

​Scored vs Unscored Runs

​Comparing Runs

​Troubleshooting

​Next Steps

Quick Start

Solver Keys

Run Results

GitHub Actions

Find a Benchmark to Run

Picking a Version

Set Up a Solver

Drive a Run

Read the Results

Scored vs Unscored Runs

Comparing Runs

Troubleshooting

Next Steps