Skip to main content
A task is a single test case inside a Benchmark. It describes what the agent should do, optionally sets up pre-rollout state, scopes the agent to specific records, and carries the criteria the verification engine runs to score it. Each task run is one rollout of that test case against a provisioned Playground.

Anatomy of a Task

FieldTypeDescription
namestringShort human-readable title
task_itemobject | nullStructured payload the agent receives: instruction, trigger, expected inputs
scenarioobject | nullOptional pre-rollout steps run by the scenario runner before the agent starts
entitiesDatasetEntity[]Bindings that scope this task to specific synthetic records (e.g. “the patient with DOB 1965-03-15”)
tagsstring[]Free-form labels for filtering and reporting
timeoutnumber | nullOptional per-task timeout override in seconds
criteriaCriterion[]Typed assertions scored after the rollout

task_item

The task_item object is what the agent receives at the start of the task run. It is intentionally loose. Common fields:
  • instruction: the natural language direction for the agent.
  • trigger: what starts the work (for example “an inbound referral fax”).
  • expected_inputs: optional hints about what data the agent needs to pull from the sandboxes.

scenario

A scenario is a short program run by the scenario runner before the rollout starts. Typical scenarios seed inbound events the agent is meant to react to: dropping a fax into the SFTP inbox, posting an HL7 ORU message, or leaving a voicemail on the IVR line. Starting a task run executes its scenario, then hands control to the agent.

entities

Entities bind a task to specific rows inside the linked dataset. The binding flows through to criteria: a criterion’s input_entity_id can reference one of the task’s entities so the assertion runs against the right record. This is how one task template can be reused across many patients without rewriting assertions.

Where Tasks Fit

Each task produces one task run inside a benchmark run. The task run carries a frozen snapshot of the task at the moment the benchmark was published, so reruns are reproducible even if the task is later edited.

Multi-Interface Tasks

A single task can touch several simulators in one rollout. For example a prior-auth task might:
  1. Read the patient’s chart from the FHIR sandbox.
  2. Call the payer’s IVR line on the Voice sandbox.
  3. Submit the auth form on the Payer portal sandbox.
  4. Fax supporting documentation via the Fax sandbox.
You attach one criterion per observable outcome (FHIR resource state, voice transcript phrases, portal state row, outbound fax content), each of which the verification engine scores independently and rolls up into the task score.

Creating a Task

# Create the task
curl -X POST https://api.verial.ai/tasks \
  -H "Authorization: Bearer $VERIAL_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "benchmark_id": "'"$BENCHMARK_ID"'",
    "name": "Submit prior auth for lumbar MRI",
    "task_item": {
      "instruction": "Submit a prior authorization for MRI of the lumbar spine",
      "trigger": "Inbound referral fax in the SFTP inbox"
    },
    "scenario": {
      "steps": [
        {
          "at_sec": 0,
          "action": "drop_file_inbox",
          "entity_id": "ent_referral_1",
          "inbound_path": "inbox/"
        }
      ]
    },
    "tags": ["prior-auth", "imaging"]
  }'
# → save the returned id as $TASK_ID

# Attach criteria scoring the observable outcomes
curl -X POST https://api.verial.ai/criteria \
  -H "Authorization: Bearer $VERIAL_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "task_id": "'"$TASK_ID"'",
    "label": "Prior auth submitted with correct CPT",
    "weight": 1.0,
    "axis": "correctness",
    "assertion": {
      "assert": "portal-state-match",
      "correlate_by": { "resource": "prior_auth_requests", "key": "request_id" },
      "assertions": [
        { "path": "status", "expected": "submitted" },
        { "path": "cpt_code", "expected": "72148" }
      ]
    }
  }'

Next Steps

Criteria

The typed assertions that score each task run.

Tasks API

REST endpoints and full object reference.