Prior Auth Testing

User Prompt

“Create a benchmark that tests whether my agent can submit a prior auth for an MRI.”

Tool Call Sequence

Create benchmark

Call: benchmarks

{
  "action": "create",
  "name": "Prior Auth Test",
  "environmentId": "env_clinic_01",
  "timeout": 300
}

Response:

{
  "id": "bm_pa_01",
  "name": "Prior Auth Test",
  "environmentId": "env_clinic_01",
  "timeout": 300,
  "tasks": [],
  "createdAt": "2026-04-06T11:00:00Z"
}

What the agent learns: The benchmark is created and linked to the environment. The 300-second timeout gives the agent-under-test five minutes to complete each task.

Create task

Call: tasks

{
  "action": "create",
  "benchmarkId": "bm_pa_01",
  "name": "Submit PA for MRI",
  "instruction": "Submit a prior authorization request for a lumbar spine MRI (CPT 72148) for the patient. Use the payer portal to file the request and confirm submission."
}

Response:

{
  "id": "task_mri_01",
  "benchmarkId": "bm_pa_01",
  "name": "Submit PA for MRI",
  "instruction": "Submit a prior authorization request for a lumbar spine MRI (CPT 72148) for the patient. Use the payer portal to file the request and confirm submission.",
  "criteria": [],
  "createdAt": "2026-04-06T11:00:01Z"
}

What the agent learns: The task defines what the agent-under-test should do. The instruction is specific about the procedure type and CPT code so criteria can verify exact behavior.

Create criteria

The agent creates two criteria: one to verify submission and one to verify the correct CPT code.Call 1: criteria

{
  "action": "create",
  "taskId": "task_mri_01",
  "name": "pa-submitted",
  "assertion": { "type": "portal-state-match", "spec": { "table": "prior_auth_requests", "where": { "patient_id": "pat_01" } } },
  "weight": 1.0
}

Response:

{
  "id": "crit_pa_01",
  "taskId": "task_mri_01",
  "name": "pa-submitted",
  "weight": 1.0,
  "createdAt": "2026-04-06T11:00:02Z"
}

Call 2: criteria

{
  "action": "create",
  "taskId": "task_mri_01",
  "name": "correct-cpt",
  "assertion": { "type": "portal-state-match", "spec": { "table": "prior_auth_requests", "fields": { "cpt_code": "72148" } } },
  "weight": 0.5
}

Response:

{
  "id": "crit_cpt_01",
  "taskId": "task_mri_01",
  "name": "correct-cpt",
  "weight": 0.5,
  "createdAt": "2026-04-06T11:00:03Z"
}

What the agent learns: The first criterion (weight 1.0) is the critical check: did the agent actually submit a PA? The second criterion (weight 0.5) verifies correctness of the CPT code. Weighted scoring means the submission check counts twice as much as the code check.

Start run

Call: benchmark_runs

{
  "action": "create",
  "benchmarkId": "bm_pa_01"
}

Response:

{
  "id": "run_01",
  "benchmarkId": "bm_pa_01",
  "status": "Running",
  "score": null,
  "verdict": null,
  "createdAt": "2026-04-06T11:00:04Z"
}

What the agent learns: The run is in progress. The agent-under-test is now executing against the live playground. The agent should poll for completion.

Check results

The agent polls the run, then drills into task-level and criterion-level results.Call 1: benchmark_runs

{
  "action": "get",
  "id": "run_01"
}

Response:

{
  "id": "run_01",
  "benchmarkId": "bm_pa_01",
  "status": "Completed",
  "score": 0.87,
  "verdict": "Pass",
  "completedAt": "2026-04-06T11:03:42Z"
}

Call 2: task-runs

{
  "action": "list",
  "runId": "run_01"
}

Response:

{
  "items": [
    {
      "id": "tr_01",
      "runId": "run_01",
      "taskId": "task_mri_01",
      "name": "Submit PA for MRI",
      "status": "Completed",
      "score": 0.87
    }
  ]
}

Call 3: criterion-runs

{
  "action": "list",
  "taskRunId": "tr_01"
}

Response:

{
  "items": [
    {
      "id": "cr_01",
      "criterionId": "crit_pa_01",
      "name": "pa-submitted",
      "result": "Pass",
      "score": 1.0,
      "reasoning": "The agent navigated to the BlueCross portal and submitted a prior auth request for the patient. The submission confirmation was recorded in the interaction log."
    },
    {
      "id": "cr_02",
      "criterionId": "crit_cpt_01",
      "name": "correct-cpt",
      "result": "Partial",
      "score": 0.6,
      "reasoning": "The prior auth request included CPT code 72148, but the agent initially entered 72141 and had to correct it. The final submission was correct."
    }
  ]
}

What the agent learns: The run passed with a score of 0.87. The PA was submitted successfully (full marks), but the CPT code check scored 0.6 because the agent needed a correction attempt. The reasoning field explains exactly what happened.

Agent Response

Your prior auth benchmark completed with a Pass verdict and a score of 0.87.
Criterion Result Score Detail
pa-submitted Pass 1.0 PA was submitted through the payer portal
correct-cpt Partial 0.6 CPT 72148 was correct in the final submission, but required a correction
The agent successfully submitted the prior authorization, though it initially entered the wrong CPT code before self-correcting. You might want to improve your agent’s CPT code lookup logic.

Criterion	Result	Score	Detail
pa-submitted	Pass	1.0	PA was submitted through the payer portal
correct-cpt	Partial	0.6	CPT 72148 was correct in the final submission, but required a correction

Summary

Metric	Value
Tool calls	7
Tools used	`benchmarks`, `tasks`, `criteria`, `benchmark_runs`, `task-runs`, `criterion-runs`
Key technique	Define the benchmark structure (benchmark, task, criteria) before starting the run. Use weighted criteria to distinguish critical checks from secondary validations. Poll the run, then drill into task-runs and criterion-runs for detailed results.

​User Prompt

​Tool Call Sequence

​Agent Response

​Summary

User Prompt

Tool Call Sequence

Agent Response

Summary