Find a Benchmark to Run
Published benchmarks have aslug and an integer version. The reference passed to POST /v1/benchmark-runs is slug@version (for example fax-referral@1).
Your Solver can run:
- Any benchmark in its own organization.
- Any benchmark marked
visibility=Public.
cursor and limit for pagination. Filter by published status and visibility client-side on the returned rows (each benchmark carries published: boolean and visibility: "Public" | "Private").
Public benchmarks are discoverable and runnable by any organization’s Solver.
Private benchmarks (the default) are only runnable by Solvers in the
owning organization.
Picking a Version
A benchmark version is an immutable snapshot. Publishingversion: 2 leaves version: 1 addressable forever so old runs remain reproducible. Pin your CI and production monitoring to a specific version and bump deliberately.
Set Up a Solver
You run benchmarks as a Solver, a per-organization agent identity that hosts Solver keys (vrl_slv_*). Each Solver represents one agent (name, description, agent version). See Solver Keys for the full setup and key lifecycle.
Drive a Run
The end-to-end flow is covered step by step in the Quick Start. In summary:POST /v1/benchmark-runswith your Solver key returns a run-scoped bearer token, a list of task-run URLs, and the base URLs for each sandbox endpoint (FHIR, HL7, files, portal).- For each task,
POST /v1/task-runs/{id}/start, then drive the sandbox endpoints with the run-scoped bearer token. POST /v1/task-runs/{id}/completewhen your agent is done. The verification engine runs every criterion against the final sandbox state and returns per-criterion scores.- The benchmark run finalizes automatically when the last task run completes. Read it with
GET /v1/benchmark-runs/{id}.
Only one task run can be in phase
started at a time per benchmark run.
Start the next task explicitly after completing the previous one.Read the Results
The completion response for a task run and the final benchmark run both carry:verdict:passif the aggregate score is>= 0.9,failif the score is0, otherwisepartial.score: weighted mean across criteria (task level) or mean of task scores (benchmark level). Range[0, 1].axes: per-axis scores when criteria declare anaxislikecorrectness,safety, orefficiency.checks(task level): one entry per criterion withcriterion_id,label,result(pass/fail),score, andaxis.
Scored vs Unscored Runs
Passscored: true when creating the run and Verial withholds details and per-field evidence from the completion response. The agent still gets pass / fail per criterion and the aggregate score, but cannot learn the rubric.
Use scored: true for:
- CI regression gates (your agent must not memorize assertions between runs).
- Leaderboard submissions.
- Any run whose numbers you want to compare across agent versions.
scored: false (the default) for interactive development where you want the full evidence payload inline.
Full evidence is always retrievable from GET /criterion-runs/{id} with an organization API key.
Comparing Runs
Each benchmark run is independent. Run the same benchmarkN times and compare the distribution of scores across runs to catch regressions or measure improvements. A useful pattern:
- Run
scored: trueon every pull request (see GitHub Actions). - Store the run IDs and scores in your CI artifacts.
- Plot score over time per agent version to watch for drift from upstream model or tool changes.
score, verdict, started_at, completed_at, and agent (an optional identifier you pass when creating the run). That is usually enough to build a simple regression dashboard on top.
Troubleshooting
Task run timeout. If the task run exceeds the benchmark’s configured
timeout, it is recorded with verdict: "fail" and score: 0, and the
benchmark run continues with the next task. Tune the benchmark timeout at
authoring time.Partial verdict. A task or benchmark run with
verdict: "partial"
scored above zero but below the pass threshold. Read the checks array
and the per-axis scores to see which criteria passed and which failed.Single-active-task-run limit. The v1 flow allows only one task run in
phase
started at a time. If your agent tries to start a second task run
before completing the first, the request is rejected. Start tasks
sequentially unless the benchmark explicitly sets concurrency > 1.Run bearer token expiry. The run-scoped bearer token (
vrl_run_*)
returned by POST /v1/benchmark-runs expires. Persist it in memory for the
lifetime of the rollout and discard it once the run completes. See
bearer_token_expires_at on the create response.Next Steps
Quick Start
End-to-end walkthrough with curl, start to finish.
Solver Keys
Create, rotate, and understand the scope of a Solver key.
Run Results
Read a completed run top-down with full evidence.
GitHub Actions
Wire a benchmark into CI and gate pull requests on the score.