Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.verial.ai/llms.txt

Use this file to discover all available pages before exploring further.

A Benchmark Run is a single execution of a Benchmark. When you create a benchmark run, Verial provisions a Playground for each task, your agent drives the rollouts through sandbox endpoints, and the verification engine scores each task into Criterion Runs and an aggregate score and verdict. There are two entry points:
  • Internal (/benchmark-runs): Bearer API-key auth. Used by Verial tooling.
  • Public v1 (/v1/benchmark-runs): Solver-key auth. Used by external developers running a published benchmark from their own organization’s Solver. This is the recommended path; see the Quick Start and Authentication.

Internal Endpoints

MethodEndpointDescription
GET/benchmark-runs?benchmark_id={benchmark_id}List benchmark runs
POST/benchmark-runsCreate a benchmark run (body { "benchmark_id": "..." })
GET/benchmark-runs/{id}Get run details (includes task runs and criterion runs)
POST/benchmark-runs/{id}/completeMark complete
POST/benchmark-runs/{id}/cancelCancel
POST/benchmark-runs/{id}/publishPublish to leaderboard
POST/benchmark-runs/{id}/unpublishUnpublish from leaderboard

Public v1 Endpoints

MethodEndpointDescription
POST/v1/benchmark-runsCreate a run. Body: { "benchmark": "slug@version", "scored": boolean }. Returns bearer token, task-run URLs, and sandbox endpoint paths
GET/v1/benchmark-runs/{id}Get run summary (requires the run’s bearer token)

Benchmark Run Object

FieldTypeDescription
idstringUnique identifier
benchmark_idstringParent Benchmark
statusstringactive, completed, cancelled, failed
phasestringcreated, started, completed
scoredbooleantrue if evidence is withheld from the agent during completion responses
verdictstring | nullpass, partial, fail
scorenumber | nullAggregate score (mean of task scores)
agentstring | nullOptional agent identifier
started_atdatetime | null
completed_atdatetime | null

v1 Create Response

POST /v1/benchmark-runs returns everything the agent needs to drive the rollouts:
{
  "benchmark_run_id": "br_abc123",
  "benchmark": { "slug": "fax-referral", "version": "1", "name": "Fax referral intake" },
  "scored": false,
  "phase": "created",
  "bearer_token": "vrl_run_...",
  "bearer_token_expires_at": "2026-04-21T17:00:00.000Z",
  "endpoints": {
    "files_inbox": "/v1/benchmark-runs/br_abc123/files/inbox",
    "hl7_outbound": "/v1/benchmark-runs/br_abc123/hl7/outbound"
  },
  "task_runs": [
    {
      "id": "tr_1",
      "task_id": "task_1",
      "name": "Process referral #1",
      "phase": "created",
      "start_url": "/v1/task-runs/tr_1/start",
      "complete_url": "/v1/task-runs/tr_1/complete"
    }
  ]
}
Use the bearer_token to authenticate calls to all /v1/task-runs/* and /v1/benchmark-runs/{id}/* protocol endpoints (FHIR proxy, HL7, files, portal).

Internal SDK Example

const run = await verial.benchmarkRuns.create({ benchmarkId: "bench_abc123" });
const detail = await verial.benchmarkRuns.get({ id: run.id });
console.log(detail.score, detail.verdict);