Tasks

A task is a single test case inside a Benchmark. It describes what the agent should do, optionally sets up pre-rollout state, scopes the agent to specific records, and carries the criteria the verification engine runs to score it. Each task run is one rollout of that test case against a provisioned Playground.

Anatomy of a Task

Field	Type	Description
`name`	string	Short human-readable title
`task_item`	object \| null	Structured payload the agent receives: instruction, trigger, expected inputs
`scenario`	object \| null	Optional pre-rollout steps run by the scenario runner before the agent starts
`entities`	DatasetEntity[]	Bindings that scope this task to specific synthetic records (e.g. “the patient with DOB 1965-03-15”)
`tags`	string[]	Free-form labels for filtering and reporting
`timeout`	number \| null	Optional per-task timeout override in seconds
`criteria`	Criterion[]	Typed assertions scored after the rollout

`task_item`

The task_item object is what the agent receives at the start of the task run. It is intentionally loose. Common fields:

instruction: the natural language direction for the agent.
trigger: what starts the work (for example “an inbound referral fax”).
expected_inputs: optional hints about what data the agent needs to pull from the sandboxes.

`scenario`

A scenario is a short program run by the scenario runner before the rollout starts. Typical scenarios seed inbound events the agent is meant to react to: dropping a fax into the SFTP inbox, posting an HL7 ORU message, or leaving a voicemail on the IVR line. Starting a task run executes its scenario, then hands control to the agent.

`entities`

Entities bind a task to specific rows inside the linked dataset. The binding flows through to criteria: a criterion’s input_entity_id can reference one of the task’s entities so the assertion runs against the right record. This is how one task template can be reused across many patients without rewriting assertions.

Where Tasks Fit

Each task produces one task run inside a benchmark run. The task run carries a frozen snapshot of the task at the moment the benchmark was published, so reruns are reproducible even if the task is later edited.

Multi-Interface Tasks

A single task can touch several simulators in one rollout. For example a prior-auth task might:

Read the patient’s chart from the FHIR sandbox.
Call the payer’s IVR line on the Voice sandbox.
Submit the auth form on the Payer portal sandbox.
Fax supporting documentation via the Fax sandbox.

You attach one criterion per observable outcome (FHIR resource state, voice transcript phrases, portal state row, outbound fax content), each of which the verification engine scores independently and rolls up into the task score.

Creating a Task

# Create the task
curl -X POST https://api.verial.ai/tasks \
  -H "Authorization: Bearer $VERIAL_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "benchmark_id": "'"$BENCHMARK_ID"'",
    "name": "Submit prior auth for lumbar MRI",
    "task_item": {
      "instruction": "Submit a prior authorization for MRI of the lumbar spine",
      "trigger": "Inbound referral fax in the SFTP inbox"
    },
    "scenario": {
      "steps": [
        {
          "at_sec": 0,
          "action": "drop_file_inbox",
          "entity_id": "ent_referral_1",
          "inbound_path": "inbox/"
        }
      ]
    },
    "tags": ["prior-auth", "imaging"]
  }'
# → save the returned id as $TASK_ID

# Attach criteria scoring the observable outcomes
curl -X POST https://api.verial.ai/criteria \
  -H "Authorization: Bearer $VERIAL_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "task_id": "'"$TASK_ID"'",
    "label": "Prior auth submitted with correct CPT",
    "weight": 1.0,
    "axis": "correctness",
    "assertion": {
      "assert": "portal-state-match",
      "correlate_by": { "resource": "prior_auth_requests", "key": "request_id" },
      "assertions": [
        { "path": "status", "expected": "submitted" },
        { "path": "cpt_code", "expected": "72148" }
      ]
    }
  }'

Anatomy of a Task

`task_item`

`scenario`

`entities`

Where Tasks Fit

Multi-Interface Tasks

Creating a Task

Next Steps

Criteria

Tasks API

​Anatomy of a Task

​task_item

​scenario

​entities

​Where Tasks Fit

​Multi-Interface Tasks

​Creating a Task

​Next Steps

Criteria

Tasks API

Anatomy of a Task

`task_item`

`scenario`

`entities`

Where Tasks Fit

Multi-Interface Tasks

Creating a Task

Next Steps