Documentation Index
Fetch the complete documentation index at: https://docs.verial.ai/llms.txt
Use this file to discover all available pages before exploring further.
User Prompt
“Create a benchmark that tests whether my agent can submit a prior auth for an MRI.”
Tool Call Sequence
Create benchmark
Call: Response:What the agent learns: The benchmark is created and linked to the environment. The 300-second timeout gives the agent-under-test five minutes to complete each task.
benchmarksCreate task
Call: Response:What the agent learns: The task defines what the agent-under-test should do. The instruction is specific about the procedure type and CPT code so criteria can verify exact behavior.
tasksCreate criteria
The agent creates two criteria: one to verify submission and one to verify the correct CPT code.Call 1: Response:Call 2: Response:What the agent learns: The first criterion (weight 1.0) is the critical check: did the agent actually submit a PA? The second criterion (weight 0.5) verifies correctness of the CPT code. Weighted scoring means the submission check counts twice as much as the code check.
criteriacriteriaStart run
Call: Response:What the agent learns: The run is in progress. The agent-under-test is now executing against the live playground. The agent should poll for completion.
benchmark_runsCheck results
The agent polls the run, then drills into task-level and criterion-level results.Call 1: Response:Call 2: Response:Call 3: Response:What the agent learns: The run passed with a score of 0.87. The PA was submitted successfully (full marks), but the CPT code check scored 0.6 because the agent needed a correction attempt. The
benchmark_runstask-runscriterion-runsreasoning field explains exactly what happened.Agent Response
Your prior auth benchmark completed with a Pass verdict and a score of 0.87.The agent successfully submitted the prior authorization, though it initially entered the wrong CPT code before self-correcting. You might want to improve your agent’s CPT code lookup logic.
Criterion Result Score Detail pa-submitted Pass 1.0 PA was submitted through the payer portal correct-cpt Partial 0.6 CPT 72148 was correct in the final submission, but required a correction
Summary
| Metric | Value |
|---|---|
| Tool calls | 7 |
| Tools used | benchmarks, tasks, criteria, benchmark_runs, task-runs, criterion-runs |
| Key technique | Define the benchmark structure (benchmark, task, criteria) before starting the run. Use weighted criteria to distinguish critical checks from secondary validations. Poll the run, then drill into task-runs and criterion-runs for detailed results. |