Authoring a Benchmark

This guide walks through authoring a benchmark end to end: creating an environment, wiring simulators and datasets, adding tasks with typed criteria, and publishing a versioned slug@version that other teams (or your own Solvers) can run. Authoring uses the internal API with an organization API key (vk_*). Running a published benchmark uses a Solver key and the public /v1 API instead.

Who This Is For

Teams building internal benchmarks for their own agents.
Organizations publishing benchmarks for the wider healthcare AI ecosystem (visibility=Public).
Anyone authoring a new task or criterion against an existing environment.

Prerequisites

export VERIAL_API_KEY=vk_xxx

Examples below mix the TypeScript SDK (for resources the SDK exposes today: benchmarks, datasets, environments) with curl (for simulators, tasks, criteria, which are currently available only on the REST API). The same operations are also available via the MCP server.

import { Verial } from "@verial-ai/sdk";

const verial = new Verial({ apiKey: process.env.VERIAL_API_KEY! });

1. Create an Environment

An environment is a reusable simulated health system. Start with a container:

const env = await verial.environments.create({
  name: "Regional Medical Center",
});
// → use env.id as $ENVIRONMENT_ID in the curl examples below

2. Add Simulators

Simulators are individual simulated interfaces (FHIR EHR, voice line, payer portal, SFTP drop, HL7 endpoint, X12 clearinghouse, fax, messaging). Create each simulator and link it to the environment:

# Create the FHIR simulator
curl -X POST https://api.verial.ai/simulators \
  -H "Authorization: Bearer $VERIAL_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"type": "Fhir", "name": "Primary EHR"}'
# → save the returned id as $FHIR_SIMULATOR_ID

# Create the payer simulator
curl -X POST https://api.verial.ai/simulators \
  -H "Authorization: Bearer $VERIAL_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"type": "Payer", "name": "BlueCross Portal", "config": {"portal": "radmd"}}'
# → save the returned id as $PAYER_SIMULATOR_ID

# Link both simulators to the environment (env id from step 1 → $ENVIRONMENT_ID)
curl -X POST "https://api.verial.ai/environments/$ENVIRONMENT_ID/simulators/$FHIR_SIMULATOR_ID" \
  -H "Authorization: Bearer $VERIAL_API_KEY"

curl -X POST "https://api.verial.ai/environments/$ENVIRONMENT_ID/simulators/$PAYER_SIMULATOR_ID" \
  -H "Authorization: Bearer $VERIAL_API_KEY"

3. Attach Datasets

Datasets contain synthetic data that populates simulator sandboxes at rollout time. FHIR datasets carry a JSON config with patients, conditions, medications, etc. Files and SFTP datasets carry a manifest plus actual files stored in GCS.

const dataset = await verial.datasets.create({
  name: "Primary Care Patients",
  data: {
    patients: [
      {
        name: "John Smith",
        dob: "1965-03-15",
        gender: "male",
        conditions: ["Type 2 Diabetes"],
        insurance: { plan: "BlueCross PPO", member_id: "BCB123456789" },
      },
    ],
  },
});

Datasets are linked to sandboxes (the runtime instances of simulators) when a playground is provisioned. Sandbox linking copies the dataset into a per-run child dataset, so the original stays pristine.

4. Create the Benchmark

A benchmark groups tasks against one environment:

const benchmark = await verial.benchmarks.create({
  name: "Prior Authorization Workflow",
  environmentId: env.id,
  timeout: 300,
  concurrency: 1,
});
// → use benchmark.id as $BENCHMARK_ID below

timeout is the per-task execution budget in seconds; tasks that exceed it are recorded with verdict: "fail" and score: 0. concurrency is the number of task rollouts that may run in parallel within a single benchmark run.

5. Add Tasks

Each task is one test case. A task carries:

taskItem: structured payload with instruction, trigger, expected inputs.
scenario: optional pre-rollout steps run by the scenario runner (for example, drop a fax into the inbox before the agent starts).
entities: DatasetEntity bindings scoping the task to specific synthetic records.
tags: for filtering and organization.

curl -X POST https://api.verial.ai/tasks \
  -H "Authorization: Bearer $VERIAL_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "benchmark_id": "'"$BENCHMARK_ID"'",
    "name": "Submit Prior Auth for MRI",
    "task_item": {
      "instruction": "Submit a prior authorization for an MRI of the lumbar spine."
    },
    "tags": ["prior-auth", "imaging"]
  }'
# → save the returned id as $TASK_ID

6. Attach Criteria

Criteria are typed assertions the verification engine runs after the rollout. Each one has a label, an assertion spec, a weight, and an optional axis for per-axis scoring. Prefer multiple narrow criteria over one compound criterion: one observable outcome per criterion makes failures easier to diagnose.

Example: portal-state-match

curl -X POST https://api.verial.ai/criteria \
  -H "Authorization: Bearer $VERIAL_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "task_id": "'"$TASK_ID"'",
    "label": "Prior auth submitted with correct CPT",
    "weight": 1.0,
    "axis": "correctness",
    "assertion": {
      "assert": "portal-state-match",
      "correlate_by": { "resource": "prior_auth_requests", "key": "request_id" },
      "assertions": [
        { "path": "status", "expected": "submitted" },
        { "path": "cpt_code", "expected": "72148" }
      ]
    }
  }'

Example: fhir-resource-state

curl -X POST https://api.verial.ai/criteria \
  -H "Authorization: Bearer $VERIAL_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "task_id": "'"$TASK_ID"'",
    "label": "Follow-up appointment booked",
    "weight": 1.0,
    "axis": "correctness",
    "assertion": {
      "assert": "fhir-resource-state",
      "resource_type": "Appointment",
      "search": { "patient": "Patient/john-smith", "status": "booked" },
      "fields": [
        { "path": "participant.0.actor.display", "expected": "Dr. Rivera" }
      ]
    }
  }'

Example: voice-transcript (negative assertion)

curl -X POST https://api.verial.ai/criteria \
  -H "Authorization: Bearer $VERIAL_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "task_id": "'"$TASK_ID"'",
    "label": "No PHI leaked on voicemail",
    "weight": 0.5,
    "axis": "safety",
    "assertion": {
      "assert": "voice-transcript",
      "speaker": "agent",
      "not_contains": ["social security number", "SSN"]
    }
  }'

See the Criteria concept page for all supported check types (fhir-resource-state, hl7-structural, portal-state-match, sftp-file-present, voice-transcript, x12-response) and full assertion reference on the Criteria API page.

7. Set Slug and Publish

Before publishing, assign a URL-safe slug (lowercase alphanumeric with hyphens):

await verial.benchmarks.update({
  id: benchmark.id,
  slug: "prior-auth-workflow",
});

Then publish a version. Versions are positive integers; once published, a benchmark is immutable and addressable as slug@version:

curl -X POST "https://api.verial.ai/benchmarks/$BENCHMARK_ID/publish" \
  -H "Authorization: Bearer $VERIAL_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "version": 1,
    "sponsor_name": "Acme Health",
    "methodology_url": "https://acme.example/benchmarks/prior-auth"
  }'

Published benchmarks are immutable. To iterate, clone the benchmark and publish the clone as the next version (version: 2). Consumers can continue pinning to slug@1 while you ship slug@2.

8. Set Visibility

Visibility controls who can run a published benchmark:

Private (default): only Solvers in the owning organization can run it.
Public: any Solver in any organization can run it. The benchmark appears in public listings and leaderboards.

await verial.benchmarks.update({
  id: benchmark.id,
  visibility: "Public",
});

You cannot make a benchmark Public without a slug. Publish with a slug first, verify it runs as expected, then flip visibility.

Versioning

Cut a new version when criteria, tasks, or the linked environment change in a way that shifts scores.
Keep trivial edits (documentation, labels that do not affect scoring) to the same version by applying them before publishing.
Communicate version changes in the benchmark’s overview, scoring_rubric, and limitations fields.

Next Steps

Criteria

Reference for every supported check type with annotated examples.

Verification

How per-criterion results aggregate into task and benchmark scores.

Environments

Compose simulators and datasets into reusable simulated health systems.

Benchmarks API

REST endpoints for create, update, publish, and list.

​Who This Is For

​Prerequisites

​1. Create an Environment

​2. Add Simulators

​3. Attach Datasets

​4. Create the Benchmark

​5. Add Tasks

​6. Attach Criteria

​Example: portal-state-match

​Example: fhir-resource-state

​Example: voice-transcript (negative assertion)

​7. Set Slug and Publish

​8. Set Visibility

​Versioning

​Next Steps

Criteria

Verification

Environments

Benchmarks API

Who This Is For

Prerequisites

1. Create an Environment

2. Add Simulators

3. Attach Datasets

4. Create the Benchmark

5. Add Tasks

6. Attach Criteria

Example: portal-state-match

Example: fhir-resource-state

Example: voice-transcript (negative assertion)

7. Set Slug and Publish

8. Set Visibility

Versioning

Next Steps