Recommended Tool Flow
For most simulation workflows, follow this progression:- Setup —
simulatorsto define interfaces,environmentsto compose them,datasetsto prepare patient data - Define —
benchmarksto create test suites,tasksto add test cases,evalsto add assertions - Execute —
benchmark_runsto start runs, poll withgetuntil status isCompleted - Analyze —
task-runsto see per-task results,eval-runsto see per-eval reasoning and scores
Tool Chaining Patterns
Create-then-link
Simulators exist independently from environments. Create them first, then attach them.Benchmark definition
Benchmarks, tasks, and evals form a hierarchy. Create them top-down.Run and poll
Start a run, poll for completion, then drill into results.benchmark_runs get response includes the overall score and verdict. Use task-runs and eval-runs to understand which specific checks passed or failed.
Writing Good Evals
Evalassert strings are evaluated by an LLM judge against the interaction evidence. Write them like test assertions: specific, observable, and unambiguous.
| Assert | Quality | Why |
|---|---|---|
| ”The agent did a good job” | Bad | Subjective, no observable criteria |
| ”A prior auth was submitted” | Okay | Observable but vague about what counts as “submitted" |
| "A prior authorization request was submitted through the payer portal with CPT code 72148” | Good | Specific action, specific channel, specific data point |
| ”The 271 eligibility response shows active coverage with plan type PPO” | Good | Specific transaction type, specific fields to check |
| ”The agent called the FHIR endpoint GET /Patient and received a 200 response” | Good | Verifiable against interaction logs |
| ”The agent handled the error gracefully” | Bad | ”Gracefully” is subjective |
One assertion per eval
Split compound checks into separate evals rather than combining them. This gives you granular scoring and clearer failure messages.Error Handling
All tools return errors in a consistent shape:| Error Code | Meaning | Recommended Action |
|---|---|---|
not_found | Entity does not exist or belongs to a different organization | Verify the ID; use a list action to find valid IDs |
validation_error | Invalid parameters (missing required field, wrong type) | Check the parameter types and required fields in the tool reference |
conflict | Duplicate or conflicting state (e.g., simulator already linked) | Use get to check current state before retrying |
timeout | Run exceeded the benchmark timeout | Increase the benchmark timeout value or simplify the task |
Next Steps
Tools Reference
Full parameter documentation for all eleven tools.
Workflow Examples
Step-by-step tool call sequences for common simulation tasks.