Benchmark Runs

A Benchmark Run is a single execution of a Benchmark. When you create a benchmark run, Verial provisions a Playground for each task, your agent drives the rollouts through sandbox endpoints, and the verification engine scores each task into Criterion Runs and an aggregate score and verdict. There are two entry points:

Internal (/benchmark-runs): Bearer API-key auth. Used by Verial tooling.
Public v1 (/v1/benchmark-runs): Solver-key auth. Used by external developers running a published benchmark from their own organization’s Solver. This is the recommended path; see the Quick Start and Authentication.

Internal Endpoints

Method	Endpoint	Description
`GET`	`/benchmark-runs?benchmark_id={benchmark_id}`	List benchmark runs
`POST`	`/benchmark-runs`	Create a benchmark run (body `{ "benchmark_id": "..." }`)
`GET`	`/benchmark-runs/{id}`	Get run details (includes task runs and criterion runs)
`POST`	`/benchmark-runs/{id}/complete`	Mark complete
`POST`	`/benchmark-runs/{id}/cancel`	Cancel
`POST`	`/benchmark-runs/{id}/publish`	Publish to leaderboard
`POST`	`/benchmark-runs/{id}/unpublish`	Unpublish from leaderboard

Public v1 Endpoints

Method	Endpoint	Description
`POST`	`/v1/benchmark-runs`	Create a run. Body: `{ "benchmark": "slug@version", "scored": boolean }`. Returns bearer token, task-run URLs, and sandbox endpoint paths
`GET`	`/v1/benchmark-runs/{id}`	Get run summary (requires the run’s bearer token)

Benchmark Run Object

Field	Type	Description
`id`	string	Unique identifier
`benchmark_id`	string	Parent Benchmark
`status`	string	`active`, `completed`, `cancelled`, `failed`
`phase`	string	`created`, `started`, `completed`
`scored`	boolean	`true` if evidence is withheld from the agent during completion responses
`verdict`	string \| null	`pass`, `partial`, `fail`
`score`	number \| null	Aggregate score (mean of task scores)
`agent`	string \| null	Optional agent identifier
`started_at`	datetime \| null
`completed_at`	datetime \| null

v1 Create Response

POST /v1/benchmark-runs returns everything the agent needs to drive the rollouts:

{
  "benchmark_run_id": "br_abc123",
  "benchmark": { "slug": "fax-referral", "version": "1", "name": "Fax referral intake" },
  "scored": false,
  "phase": "created",
  "bearer_token": "vrl_run_...",
  "bearer_token_expires_at": "2026-04-21T17:00:00.000Z",
  "endpoints": {
    "files_inbox": "/v1/benchmark-runs/br_abc123/files/inbox",
    "hl7_outbound": "/v1/benchmark-runs/br_abc123/hl7/outbound"
  },
  "task_runs": [
    {
      "id": "tr_1",
      "task_id": "task_1",
      "name": "Process referral #1",
      "phase": "created",
      "start_url": "/v1/task-runs/tr_1/start",
      "complete_url": "/v1/task-runs/tr_1/complete"
    }
  ]
}

Use the bearer_token to authenticate calls to all /v1/task-runs/* and /v1/benchmark-runs/{id}/* protocol endpoints (FHIR proxy, HL7, files, portal).

Internal SDK Example

const run = await verial.benchmarkRuns.create({ benchmarkId: "bench_abc123" });
const detail = await verial.benchmarkRuns.get({ id: run.id });
console.log(detail.score, detail.verdict);

​Internal Endpoints

​Public v1 Endpoints

​Benchmark Run Object

​v1 Create Response

​Internal SDK Example

Internal Endpoints

Public v1 Endpoints

Benchmark Run Object

v1 Create Response

Internal SDK Example