A benchmark is a named collection of tasks that test your agent’s ability to perform specific healthcare workflows. Each task binds input data, optional scenario steps, and criteria that the verification engine scores after a rollout. Benchmarks are linked to an environment that defines the simulated systems the agent will drive.Documentation Index
Fetch the complete documentation index at: https://docs.verial.ai/llms.txt
Use this file to discover all available pages before exploring further.
Structure
A benchmark contains:- Name and slug/version. Published benchmarks are identified by
slug@version. - Environment. The simulated health system the agent will run against.
- Tasks. Individual test cases, each with input data bindings, optional scenario steps, and criteria.
- Criteria. Typed assertions per task, scored by the verification engine.
- Configuration. Timeout and concurrency settings.
Creating a Benchmark
Tasks
Tasks are separate resources linked to a benchmark. Each task carries itstask_item (instruction, trigger), an optional pre-rollout scenario, entities that scope it to specific dataset rows, and the typed criteria the verification engine scores. See the Tasks concept page for the full field reference and examples.
Multi-Interface Tasks
Tasks can span multiple simulators. A single task can require the agent to read a chart in FHIR, call an IVR line, submit a request to a payer portal, and fax documentation. The verification engine runs one criterion per observable outcome against the relevant sandbox state.Configuration
Timeout
Per-task timeout (seconds). A task that exceeds its timeout is recorded withverdict: "fail" and score 0.
Concurrency
How many task rollouts may run in parallel within a benchmark run. Default1 (sequential).
Next Steps
Runs
Execute benchmarks and review results.
Criteria
Write typed assertions for each check type.