User Prompt
“Create a benchmark that tests whether my agent can submit a prior auth for an MRI.”
Tool Call Sequence
Create benchmark
Call: Response:What the agent learns: The benchmark is created and linked to the environment. The 300-second timeout gives the agent-under-test five minutes to complete each task.
benchmarksCreate task
Call: Response:What the agent learns: The task defines what the agent-under-test should do. The instruction is specific about the procedure type and CPT code so evals can verify exact behavior.
tasksCreate evals
The agent creates two evals: one to verify submission and one to verify the correct CPT code.Call 1: Response:Call 2: Response:What the agent learns: The first eval (weight 1.0) is the critical check: did the agent actually submit a PA? The second eval (weight 0.5) verifies correctness of the CPT code. Weighted scoring means the submission check counts twice as much as the code check.
evalsevalsStart run
Call: Response:What the agent learns: The run is in progress. The agent-under-test is now executing against the live playground. The agent should poll for completion.
benchmark_runsCheck results
The agent polls the run, then drills into task-level and eval-level results.Call 1: Response:Call 2: Response:Call 3: Response:What the agent learns: The run passed with a score of 0.87. The PA was submitted successfully (full marks), but the CPT code check scored 0.6 because the agent needed a correction attempt. The
benchmark_runstask-runseval-runsreasoning field explains exactly what happened.Agent Response
Your prior auth benchmark completed with a Pass verdict and a score of 0.87.The agent successfully submitted the prior authorization, though it initially entered the wrong CPT code before self-correcting. You might want to improve your agent’s CPT code lookup logic.
Eval Result Score Detail pa-submitted Pass 1.0 PA was submitted through the payer portal correct-cpt Partial 0.6 CPT 72148 was correct in the final submission, but required a correction
Summary
| Metric | Value |
|---|---|
| Tool calls | 7 |
| Tools used | benchmarks, tasks, evals, benchmark_runs, task-runs, eval-runs |
| Key technique | Define the benchmark structure (benchmark, task, evals) before starting the run. Use weighted evals to distinguish critical checks from secondary validations. Poll the run, then drill into task-runs and eval-runs for detailed results. |