Skip to main content

User Prompt

“Create a benchmark that tests whether my agent can submit a prior auth for an MRI.”

Tool Call Sequence

1

Create benchmark

Call: benchmarks
{
  "action": "create",
  "name": "Prior Auth Test",
  "environmentId": "env_clinic_01",
  "timeout": 300
}
Response:
{
  "id": "bm_pa_01",
  "name": "Prior Auth Test",
  "environmentId": "env_clinic_01",
  "timeout": 300,
  "tasks": [],
  "createdAt": "2026-04-06T11:00:00Z"
}
What the agent learns: The benchmark is created and linked to the environment. The 300-second timeout gives the agent-under-test five minutes to complete each task.
2

Create task

Call: tasks
{
  "action": "create",
  "benchmarkId": "bm_pa_01",
  "name": "Submit PA for MRI",
  "instruction": "Submit a prior authorization request for a lumbar spine MRI (CPT 72148) for the patient. Use the payer portal to file the request and confirm submission."
}
Response:
{
  "id": "task_mri_01",
  "benchmarkId": "bm_pa_01",
  "name": "Submit PA for MRI",
  "instruction": "Submit a prior authorization request for a lumbar spine MRI (CPT 72148) for the patient. Use the payer portal to file the request and confirm submission.",
  "evals": [],
  "createdAt": "2026-04-06T11:00:01Z"
}
What the agent learns: The task defines what the agent-under-test should do. The instruction is specific about the procedure type and CPT code so evals can verify exact behavior.
3

Create evals

The agent creates two evals: one to verify submission and one to verify the correct CPT code.Call 1: evals
{
  "action": "create",
  "taskId": "task_mri_01",
  "name": "pa-submitted",
  "assert": "A prior authorization request was submitted through the payer portal",
  "weight": 1.0
}
Response:
{
  "id": "eval_pa_01",
  "taskId": "task_mri_01",
  "name": "pa-submitted",
  "assert": "A prior authorization request was submitted through the payer portal",
  "weight": 1.0,
  "createdAt": "2026-04-06T11:00:02Z"
}
Call 2: evals
{
  "action": "create",
  "taskId": "task_mri_01",
  "name": "correct-cpt",
  "assert": "The prior authorization request includes CPT code 72148 for lumbar spine MRI",
  "weight": 0.5
}
Response:
{
  "id": "eval_cpt_01",
  "taskId": "task_mri_01",
  "name": "correct-cpt",
  "assert": "The prior authorization request includes CPT code 72148 for lumbar spine MRI",
  "weight": 0.5,
  "createdAt": "2026-04-06T11:00:03Z"
}
What the agent learns: The first eval (weight 1.0) is the critical check: did the agent actually submit a PA? The second eval (weight 0.5) verifies correctness of the CPT code. Weighted scoring means the submission check counts twice as much as the code check.
4

Start run

Call: benchmark_runs
{
  "action": "create",
  "benchmarkId": "bm_pa_01"
}
Response:
{
  "id": "run_01",
  "benchmarkId": "bm_pa_01",
  "status": "Running",
  "score": null,
  "verdict": null,
  "createdAt": "2026-04-06T11:00:04Z"
}
What the agent learns: The run is in progress. The agent-under-test is now executing against the live playground. The agent should poll for completion.
5

Check results

The agent polls the run, then drills into task-level and eval-level results.Call 1: benchmark_runs
{
  "action": "get",
  "id": "run_01"
}
Response:
{
  "id": "run_01",
  "benchmarkId": "bm_pa_01",
  "status": "Completed",
  "score": 0.87,
  "verdict": "Pass",
  "completedAt": "2026-04-06T11:03:42Z"
}
Call 2: task-runs
{
  "action": "list",
  "runId": "run_01"
}
Response:
{
  "items": [
    {
      "id": "tr_01",
      "runId": "run_01",
      "taskId": "task_mri_01",
      "name": "Submit PA for MRI",
      "status": "Completed",
      "score": 0.87
    }
  ]
}
Call 3: eval-runs
{
  "action": "list",
  "taskRunId": "tr_01"
}
Response:
{
  "items": [
    {
      "id": "er_01",
      "evalId": "eval_pa_01",
      "name": "pa-submitted",
      "result": "Pass",
      "score": 1.0,
      "reasoning": "The agent navigated to the BlueCross portal and submitted a prior auth request for the patient. The submission confirmation was recorded in the interaction log."
    },
    {
      "id": "er_02",
      "evalId": "eval_cpt_01",
      "name": "correct-cpt",
      "result": "Partial",
      "score": 0.6,
      "reasoning": "The prior auth request included CPT code 72148, but the agent initially entered 72141 and had to correct it. The final submission was correct."
    }
  ]
}
What the agent learns: The run passed with a score of 0.87. The PA was submitted successfully (full marks), but the CPT code check scored 0.6 because the agent needed a correction attempt. The reasoning field explains exactly what happened.

Agent Response

Your prior auth benchmark completed with a Pass verdict and a score of 0.87.
EvalResultScoreDetail
pa-submittedPass1.0PA was submitted through the payer portal
correct-cptPartial0.6CPT 72148 was correct in the final submission, but required a correction
The agent successfully submitted the prior authorization, though it initially entered the wrong CPT code before self-correcting. You might want to improve your agent’s CPT code lookup logic.

Summary

MetricValue
Tool calls7
Tools usedbenchmarks, tasks, evals, benchmark_runs, task-runs, eval-runs
Key techniqueDefine the benchmark structure (benchmark, task, evals) before starting the run. Use weighted evals to distinguish critical checks from secondary validations. Poll the run, then drill into task-runs and eval-runs for detailed results.