Compose an environment, build a benchmark of tasks with typed criteria, and publish a versioned slug.
This guide walks through authoring a benchmark end to end: creating an environment, wiring simulators and datasets, adding tasks with typed criteria, and publishing a versioned slug@version that other teams (or your own Solvers) can run.Authoring uses the internal API with an organization API key (vk_*). Running a published benchmark uses a Solver key and the public /v1 API instead.
Examples below mix the TypeScript SDK (for resources the SDK exposes today: benchmarks, datasets, environments) with curl (for simulators, tasks, criteria, which are currently available only on the REST API). The same operations are also available via the MCP server.
import { Verial } from "@verial-ai/sdk";const verial = new Verial({ apiKey: process.env.VERIAL_API_KEY! });
Simulators are individual simulated interfaces (FHIR EHR, voice line, payer portal, SFTP drop, HL7 endpoint, X12 clearinghouse, fax, messaging). Create each simulator and link it to the environment:
# Create the FHIR simulatorcurl -X POST https://api.verial.ai/simulators \ -H "Authorization: Bearer $VERIAL_API_KEY" \ -H "Content-Type: application/json" \ -d '{"type": "Fhir", "name": "Primary EHR"}'# → save the returned id as $FHIR_SIMULATOR_ID# Create the payer simulatorcurl -X POST https://api.verial.ai/simulators \ -H "Authorization: Bearer $VERIAL_API_KEY" \ -H "Content-Type: application/json" \ -d '{"type": "Payer", "name": "BlueCross Portal", "config": {"portal": "radmd"}}'# → save the returned id as $PAYER_SIMULATOR_ID# Link both simulators to the environment (env id from step 1 → $ENVIRONMENT_ID)curl -X POST "https://api.verial.ai/environments/$ENVIRONMENT_ID/simulators/$FHIR_SIMULATOR_ID" \ -H "Authorization: Bearer $VERIAL_API_KEY"curl -X POST "https://api.verial.ai/environments/$ENVIRONMENT_ID/simulators/$PAYER_SIMULATOR_ID" \ -H "Authorization: Bearer $VERIAL_API_KEY"
Datasets contain synthetic data that populates simulator sandboxes at rollout time. FHIR datasets carry a JSON config with patients, conditions, medications, etc. Files and SFTP datasets carry a manifest plus actual files stored in GCS.
Datasets are linked to sandboxes (the runtime instances of simulators) when a playground is provisioned. Sandbox linking copies the dataset into a per-run child dataset, so the original stays pristine.
timeout is the per-task execution budget in seconds; tasks that exceed it are recorded with verdict: "fail" and score: 0. concurrency is the number of task rollouts that may run in parallel within a single benchmark run.
taskItem: structured payload with instruction, trigger, expected inputs.
scenario: optional pre-rollout steps run by the scenario runner (for example, drop a fax into the inbox before the agent starts).
entities: DatasetEntity bindings scoping the task to specific synthetic records.
tags: for filtering and organization.
curl -X POST https://api.verial.ai/tasks \ -H "Authorization: Bearer $VERIAL_API_KEY" \ -H "Content-Type: application/json" \ -d '{ "benchmark_id": "'"$BENCHMARK_ID"'", "name": "Submit Prior Auth for MRI", "task_item": { "instruction": "Submit a prior authorization for an MRI of the lumbar spine." }, "tags": ["prior-auth", "imaging"] }'# → save the returned id as $TASK_ID
Criteria are typed assertions the verification engine runs after the rollout. Each one has a label, an assertion spec, a weight, and an optional axis for per-axis scoring.Prefer multiple narrow criteria over one compound criterion: one observable outcome per criterion makes failures easier to diagnose.
See the Criteria concept page for all supported check types (fhir-resource-state, hl7-structural, portal-state-match, sftp-file-present, voice-transcript, x12-response) and full assertion reference on the Criteria API page.
Published benchmarks are immutable. To iterate, clone the benchmark and
publish the clone as the next version (version: 2). Consumers can continue
pinning to slug@1 while you ship slug@2.