Benchmarks - Verial

A benchmark is a named collection of tasks that test your agent’s ability to perform specific healthcare workflows. Each task binds input data, optional scenario steps, and criteria that the verification engine scores after a rollout. Benchmarks are linked to an environment that defines the simulated systems the agent will drive.

Structure

A benchmark contains:

Name and slug/version. Published benchmarks are identified by slug@version.
Environment. The simulated health system the agent will run against.
Tasks. Individual test cases, each with input data bindings, optional scenario steps, and criteria.
Criteria. Typed assertions per task, scored by the verification engine.
Configuration. Timeout and concurrency settings.

Creating a Benchmark

const benchmark = await verial.benchmarks.create({
  name: "Prior Authorization Workflow",
  environmentId: environment.id,
  timeout: 300,
  concurrency: 1,
});

Tasks

Tasks are separate resources linked to a benchmark. Each task carries its task_item (instruction, trigger), an optional pre-rollout scenario, entities that scope it to specific dataset rows, and the typed criteria the verification engine scores. See the Tasks concept page for the full field reference and examples.

Multi-Interface Tasks

Tasks can span multiple simulators. A single task can require the agent to read a chart in FHIR, call an IVR line, submit a request to a payer portal, and fax documentation. The verification engine runs one criterion per observable outcome against the relevant sandbox state.

Configuration

Timeout

Per-task timeout (seconds). A task that exceeds its timeout is recorded with verdict: "fail" and score 0.

Concurrency

How many task rollouts may run in parallel within a benchmark run. Default 1 (sequential).

Next Steps

Runs

Execute benchmarks and review results.

Criteria

Write typed assertions for each check type.

Datasets Tasks

​Structure

​Creating a Benchmark

​Tasks

​Multi-Interface Tasks

​Configuration

​Timeout

​Concurrency

​Next Steps