Skip to main content
A benchmark is a named collection of tasks that test your agent’s ability to perform specific healthcare workflows. Each task includes a scenario description, optional triggers, and evaluations that determine success. Benchmarks are linked to an environment that provides the simulated interfaces.

Structure

A benchmark contains:
  • Name. What the benchmark tests.
  • Environment. The simulated health system to run against.
  • Tasks. Individual test cases, each with instructions and expected outcomes.
  • Evals. Natural language assertions on each task, evaluated by an LLM judge.
  • Configuration. Timeout, concurrency, and scoring thresholds.

Creating a Benchmark

Benchmarks are linked to an environment at creation time:
const benchmark = await verial.benchmarks.create({
  name: 'Prior Authorization Workflow',
  environmentId: environment.id,
  timeout: 300,      // 5 minutes per task
  concurrency: 1,    // Run tasks sequentially
})

Tasks

Tasks are separate resources linked to a benchmark. Each task defines:
  • Name. A short identifier for the test case.
  • Instruction. What the agent should do (natural language).
  • Trigger. An optional event that kicks off the task (e.g., an inbound HL7 message, a phone call).
  • Tags. Optional labels for filtering and grouping.
const task = await verial.tasks.create({
  benchmarkId: benchmark.id,
  name: 'Submit Prior Auth for MRI',
  instruction: 'Submit a prior authorization for an MRI of the lumbar spine for patient John Smith',
  timeout: 120,
})

Evals

Evals are separate resources linked to a task. Each eval has a label (short identifier), an assert (natural language assertion), and a weight:
await verial.evals.create({
  taskId: task.id,
  label: 'pa-submitted',
  assert: 'A prior auth request was submitted to the payer',
  weight: 1.0,
})

await verial.evals.create({
  taskId: task.id,
  label: 'correct-cpt',
  assert: 'The prior auth includes the correct CPT code (72148)',
  weight: 0.5,
})

await verial.evals.create({
  taskId: task.id,
  label: 'diagnosis-ref',
  assert: "The prior auth references the patient's existing lower back pain diagnosis",
  weight: 0.5,
})
See Evals for details on writing effective assertions.

Multi-Interface Tasks

Tasks can span multiple simulators. For example, a task might require your agent to:
  1. Read a patient’s chart in the FHIR EHR
  2. Call the insurance company’s IVR to check eligibility
  3. Submit a prior authorization through the payer portal
  4. Fax supporting documentation
The evaluation checks outcomes across all interfaces, verifying that the right actions happened in the right order.

Configuration

Timeout

The maximum time (in seconds) allowed for each task. If a task is not completed within this window, it is marked as timed out and receives a score of 0.
const benchmark = await verial.benchmarks.create({
  name: 'Quick Eligibility Checks',
  environmentId: env.id,
  timeout: 60, // 1 minute per task
})

Concurrency

How many tasks can run simultaneously within a single run. Sequential execution (concurrency: 1) is the default and ensures tasks don’t interfere with each other. Higher concurrency is useful for independent tasks.
const benchmark = await verial.benchmarks.create({
  name: 'Parallel Patient Lookups',
  environmentId: env.id,
  concurrency: 5,
})

Next Steps

Runs

Execute benchmarks and review results.

Evals

Write effective evaluation criteria.